Bits from Biology for Computational Intelligence

Computational intelligence is broadly defined as biologically-inspired computing. Usually, inspiration is drawn from neural systems. This article shows how to analyze neural systems using information theory to obtain constraints that help identify the algorithms run by such systems and the information they represent. Algorithms and representations identified information-theoretically may then guide the design of biologically inspired computing systems (BICS). The material covered includes the necessary introduction to information theory and the estimation of information theoretic quantities from neural data. We then show how to analyze the information encoded in a system about its environment, and also discuss recent methodological developments on the question of how much information each agent carries about the environment either uniquely, or redundantly or synergistically together with others. Last, we introduce the framework of local information dynamics, where information processing is decomposed into component processes of information storage, transfer, and modification -- locally in space and time. We close by discussing example applications of these measures to neural data and other complex systems.


INTRODUCTION
Computational intelligence (CI) is broadly defined as biologically-inspired computing. CI often must deal with ill-posed problems and the field of CI draws inspiration from natural information processing systems -as these cannot afford the luxury to dismiss any problem that happens to cross their path as 'ill-posed'. Instead, natural systems have evolved algorithms to approximately solve problems relevant to them: algorithms that are adapted to their often limited resources and that yield 'good enough' solutions.
These algorithms may then serve as an inspiration for artificial information processing systems to solve similar problems under tight constraints of computational power, data availability, and time.
One way to use this inspiration is to copy and incorporate as much biological detail as possible in the artificial system, in the hope to also copy the emergent information processing of the biological system.
However, already small errors in copying the parameters of a system may compromise success. Therefore, it may be useful to derive inspiration also in a more abstract way, that is directly linked to the information processing carried out by a biological system. But how can can we gain insight into this information processing without caring for its biological implementation?
The formal language to quantitatively describe and dissect information processing -in any system -is provided by information theory. For our particular question we can exploit the fact that information theory does not care about the nature of variables that enter the computation or information processing. Thus, it is in principle possible to treat all relevant aspects of biological computation, and of biologically inspired computing systems, in one natural framework.
Here, we will first review some information theoretic preliminaries. Then we will systematically present how to analyze biological computing systems, especially neural systems, using methods from information theory and discuss how these information theoretic results can inspire artificial computing systems. We will close by a brief review of studies where this information theoretic point of view has served this goal.

INFORMATION THEORETIC PRELIMINARIES
In this section, we introduce the necessary terminology, and notation, and define basic information theoretic quantities that later analyses build on. Experts in information theory may proceed immediately to Section 2.2 which discusses the use of information theory in neuroscience.

Terminology and Notation
To analyze neural systems and biologically-inspired computing systems (BICS) alike, and to show how the analysis of one can inspire the design of the other, we have to establish a common terminology. Neural systems and BICS have the common property that they are composed of various smaller parts that interact. These parts will be called agents in general, but we will also refer to them as neurons or brain areas where appropriate. The collection of all agents will be referred to as the system.
We define that an agent X in a system produces an observed time series {x 1 , . . . , x t , . . . , x N } which is sampled at time intervals ∆. For simplicity we choose ∆ = 1, and index our measurements by t ∈ {1...N } ⊆ N. The time series is understood as a realization of a random process X. The random processes is a collection of random variables (RVs) X t , sorted by an integer index (t). Each RV X t , at a specific time t, is described by the set of all its J possible outcomes A X t = {a 1 , . . . a j . . . a J }, and their associated probabilities p X t (x t = a j ). Since the probabilities of an outcome p X t (x t = a j ) may change with t in nonstationary random processes, we indicate the RV the probabilities belong to by subscript: p X t (·). In sum, the physical agent X is conceptualized as a random process X, composed of a collection of RVs X t , that produce realizations x t , according to the probability distributions p X t (x t ). When referring to more than one agent, the notation is generalized to X , Y, Z, . . . . An overview of the complete notation can be found in table 1.

Estimation of Probability Distributions for Stationary and Non-stationary Random Processes
In general, the probability distributions of the X t are unknown. Since knowledge of these probability distributions is essential to computing any information theoretic measure, the probability distributions have to be estimated from the observed realizations of the RVs, x t . This is only possible if we have some form of replication of the processes we wish to analyze. From such replications the probabilities are estimated for example by counting relative frequencies, or by density estimation (Kozachenko and Leonenko, 1987;Kraskov et al., 2004;Victor, 2005). Table 1. Notation X , Y, Z agent in a system X, Y, Z random process X, Y, Z or X t , Y t , Z t random variable (at time point t) Whenever necessary, the index t is detailed as t 1 , t 2 , ..., t k . For stationary processes, the index t can be omitted. x, y, z or x t , y t , z t realization of the random variable (at time point t) a j specific outcome of a random variable x p Xt (x t = a j ) probability that X t has a specific outcome a j A Xt = {a 1 , . . . a j , . . . a J } set of all possible outcomes of X t X (c) , X (c) t cyclostationary process and cyclostationary random variable X (s) , X (s) stationary process and stationary random variable X t , x t state space representation of X at t X − t−u state space representation of X at t − u; the superscript minusH serves as a reminder that X − t−u is in the past of X t u assumed interaction delay between two processes δ physical or true interaction delay between two processes S i , R j random variables referring to stimuli (S i ) or responses (R j ) R = {R 1 , R 2 } joint variable (in this example of two responses) H(X) entropy H(X|Y ) conditional entropy h(x) information content h(x|y) conditional information content I(X : Y ) mutual information I(X : Y |Z) conditional mutual information note that the colon is used to separate the random variables between which we compute I i(x : y) local mutual information i(x : y|z) local conditional mutual information X, Y the comma is used to separate random variables X 1 , X 2 ; Y 1 , Y 2 the semicolon is used to separate sets of random variables In general, the probability p X t (x t = a j ) to obtain the j-th outcome x t = a j at time t, has to be estimated from replications of the processes at the same time point t, i.e. via an ensemble of physical replications of the systems in question. These replications can be often be obtained in BICS via multiple simulation runs or even physical replications if the systems in question are very small and/or simple. For complex physically embodied BICS and neural systems, generating a sufficient number of replications of a process is often impossible. Therefore, one either resorts to repetitions of parts of the process in time, to the generation of cyclostationary processes, or even assumes stationarity. All three possibilities will be discussed in the following.
General Repetitions in Time. If our random process can be repeated in time, then the probability to obtain the value x t = a j can be estimated from observations made at a sufficiently large set M of time points t + k, where we know by design of the experiment that the process repeated itself. That is, we know that RVs X t+k at certain time points t + k have probability distributions identical to the distribution at t that is of interest to us: If the set M of times t k that the process is repeated at is large enough, we obtain a reliable estimate of The Cyclostationary Case. Cyclostationarity can be understood as a specific form of repeating parts of the random process, where the repetitions occur after regular intervals T . For cyclostationary processes X (c) we assume (Gardner, 1994;Gardner et al., 2006), that there are RVs X (c) t+nT at times t + nT that have the same probability distribution as X (c) t : ∃T ∈ N : p X t (a j ) = p X t+nT (a j ) ∀t, n ∈ N, t < T, a j ∈ A X t . ( This condition guarantees that we can estimate the necessary probability distributions p X t (·) of the RV X (c) t by looking at other RVs X (c) t+nT of the process X (c) .
Stationary Processes. Finally, for stationary processes X (s) , we can substitute T in eq. 2 by T = 1 and: p X t (a j ) = p X t+n (a j ) ∀t, n ∈ N, a j ∈ A X t .
In the stationary case the probability distribution p X t (·) can be estimated from the entire set of measured realizations x t . Thus, we will drop the subscript index indicating the specific RV, i.e. p X t (·) = p(·), X t = X and x t = x when the process is stationary, and also when stationarity is irrelevant (e.g. when talking only about a single RV).

Basic Information Theory
Based on the above definitions we now define the necessary basic information theoretic quantities. To put a focus on the often neglected local information theoretic quantities that will become important later on, we will start with the Shannon information content of a realization of a RV.
To this end, we assume a (potentially nonstationary) random process X consisting of X 1 , X 2 , . . . , X N .
The law of total probaility states that and the product rule yields All realizations of the process starting with a specific x 1 thus together have probability mass and occupy a fraction of p(x 1 )/1 in the original probability space. Obtaining x 1 can therefore be interpreted as informing us that the the full realization lies in this fraction of the space. Thus, the reduction in uncertainty, or the information gained from x 1 must be a function of 1/p(x 1 ). To ensure that subsequent realizations from indepenent RVs yield additive amounts of information, we take the logarithm of this ratio to obtain the Shannon information content (Shannon and Weaver, 1948) (also see MacKay (2003)) which measures the information provided by a single realization x i of a RV X i : Typically, we take log 2 giving units in bits.
The average information content of a RV X i is called the entropy H: .
The information content of a specific realization x of X, given we already know the outcome y of another variable Y , which is not necessarily independent of X, is called conditional information content: Averaging this for all possible outcomes of X, given their probabilities p(x|y) after the outcome y was observed and averaging then over all possible outcomes y, that occur with p(y), yields the conditional entropy: The conditional entropy H(X|Y ) can be described from various perspectives: H(X|Y ) is the average amount of information that we get from making an observation of X after having already made an observation of Y . In terms of uncertainties H(X|Y ) is the average remaining uncertainty in X once Y was observed. We can also say H(X|Y ) is the information in X that can not be directly obtained from Y .
The conditional entropy can be used to derive the amount of information directly shared between the two variables X, Y . This is because the mutual information of two variables X, Y , I(X : Y ), is the the total average information in one variable (H(X)) minus the average information in this variable that can not be obtained from the other variable (H(X|Y )). Hence the mutual information (MI) is defined as: Similarly to conditional entropy we can also define a conditional mutual information between two variables X, Y , given the value of a third variable Z is known: The above measures of mutual information are averages. Although average values are used more often than their localized counterparts, it is perfectly valid to inspect local values for MI (like the information content h, above). This 'localizability' was in fact a requirement that both Shannon and Fano postulated for proper information theoretic measures (Fano, 1961;Shannon and Weaver, 1948), and there is a growing trend in neuroscience (Lizier et al., 2011a) and in the theory of distributed computation (Lizier, 2013(Lizier, , 2014a to return to local values. For the above measures of mutual information the localized forms are listed in the following. The local mutual information i(x : y) is defined as: while the local conditional mutual information is defined as: When we take the expected values of these local measures, we obtain mutual and conditional mutual information. These measures are called local, because they allow one to quantify mutual and conditional mutual information between single realizations. Note, however, that the probabilities p(·) involved in equations 14 and 15 are global in the sense that they are representative of all possible outcomes. In other words, a valid probability distribution has to be estimated irrespective of whether we are interested in average or local information measures. We also note that local MI and local conditional MI may be negative, unlike their averaged forms (Fano, 1961;Lizier, 2014a). This occurs for the local MI where the measurement of one variable is misinformative about the other variable, i.e. where the realization y lowers the probability p(x|y) below the initial probability p(x). This means that the observer expected x less after observing y than before, but x occurred nevertheless. Therefore, y was misinformative about x.
2.1.4 Estimating information theoretic quantities from data Before we advance to specific information theoretic analyses of neural data, it must be stressed that the estimation of information theoretic measures from finite data is a difficult task. The naive estimation of probabilities by empirically observed frequencies, followed by plugging of these probabilities into the above definitions almost inevitably leads to serious bias problems (Treves and Panzeri, 1995;Panzeri et al., 2007a;Victor, 2005). This situation can be improved to some degree by using binless density estimators (Victor, 2005;Kozachenko and Leonenko, 1987;Kraskov et al., 2004). However, ususally statistical testing against surrogate data or empirical control data will be necessary to judge whether a non-zero value of a measure indicates an effect or just the bias (see e.g. Lindner et al. (2011)).

Signal Representation and State Space Reconstruction
The random processes that we analyze in the agents of a computing system usually have memory. This means that the RVs that form the process are no longer independent, but depend on variables in the past. In this setting, a proper description of the process requires to look at the present and past RVs jointly. In general, if there is any dependence between the X t , we have to form the smallest collection of variables X t = (X t , X t 1 , X t 2 , . . . , X t i , . . .) with t i < t that jointly make X t+1 conditionally independent of all X t k with t k < min(t i ), i.e.: A realization x t of such a sufficient collection X t of past variables is called a state of the random process X at time t.
A sufficient collection of past variables, also called a delay embedding vector, can always be reconstructed from scalar observations for low dimensional deterministic systems, as shown by Takens (1981). Unfortunately, most real world systems have high-dimensional dynamics rather than being lowdimensional deterministic. For these systems it is not obvious that a delay embedding similar to Taken's approach would yield the desired results. In fact, many systems require an infinite number of past random variables when only a scalar observable of the high-dimensional stochastic process is accessible (Ragwitz and Kantz, 2002). Nevertheless, the behavior of scalar observables of most of these systems can be approximated well by a finite collection of such past variables for all practical purposes (Ragwitz and Kantz, 2002); in other words, these systems can be approximated well by a finite order, one-dimensional Markov-process according to eq. 16.
Note that without proper state space reconstruction information theoretic analyses will almost inevitably miscount information in the random process. Indeed, the importance of state space reconstruction cannot be overstated; for example, a failure to reconstruct states properly lead to false positive findings and reversed directions of information transfer as shown in ; imperfect state space reconstruction is also the cause of failure of transfer entropy analysis demonstrated in (Smirnov, 2013); and has been shown to impede the otherwise clear identification of coherent moving structures in cellular automata as information transfer entities (Lizier et al., 2008c).
In the remainder of the text we therefore assume proper state space reconstruction. The resulting state space representations are indicated by bold case letters, i.e. X t and x t refer to the state variables of X.

WHY INFORMATION THEORY IN NEUROSCIENCE?
It is useful to organize our understanding of neural (and biologically-inspired) computing systems into three major levels, originally proposed by David Marr (Marr, 1982), and to then see at which level information theory provides insights: • At the level of the task the neural system or the BICS is trying to solve (task level 1 ) we ask what information processing problem a neural system (or a part of it) tries to solve. Such problems could for example be the detection of edges or objects in a visual scene, or maintaining information about an object after the object is no longer in the visual scene. It is important to note that questions at the task level typically revolve around entities that have a direct meaning to us, e.g. objects or specific object properties used as stimulus categories, or operationally defined states, or concepts such as attention or working memory. An example of an analysis carried out purely at this level is the investigation of whether a person behaves as an optimal Bayesian observer (see references in Knill and Pouget (2004)).
• At the algorithmic level we ask what entities or quantities of the task level are represented by the neural system and how the system operates on these representations using algorithms. For example, a neural system may represent either absolute luminance or changes of luminance of the visual input.
An algorithm operating on either of these representations may for example then try to identify an object in the input that is causing the luminance pattern either by a brute force comparison to all luminance patterns ever seen (and stored by the neural system). Alternatively, it may try to further transform the luminance representation via filtering etc. , before inferring the object via a few targeted comparisons.
• At the (biophysical) implementation level, we ask how the representations and algorithms are implemented in neural systems. Descriptions at this level are given in terms of the relationship between various biophysical properties of the neural system or its components, e.g. membrane currents or voltages, the morphology of neurons, spike rates, chemical gradients etc. . A typical study at this level might aim for example at reproducing observed physical behavior of neural circuits, such as gamma-frequency (>40 Hz) oscillations in local field potentials by modeling the biophysical details of these circuits from ground up (Markram, 2006).
This separation of levels of understanding served to resolve important debates in neuroscience, but there is also growing awareness of a specific shortcoming of this classic view: results obtained by careful study at any of these levels do not constrain the possibilities at any other level (see the after-word by Poggio in Marr (1982)). For example, the task of winning a game of Tic-Tac-Toe (task level) can be reached by a brute force strategy (algorithmic level) that may be realized in a mechanical computer (implementation level) (Dewdney, 1989). Alternatively, the very same task can be solved by flexible rule use (algorithmic level) realized in biological brains (implementation level) of young children (Crowley and Siegler, 1993).
As we will see, missing relationships between Marr's levels can be filled in by information theory: In Section 3 we show how to link the task level and the implementation level by computing various forms of mutual information between variables at these two levels. These mutual informations can be further decomposed into the contributions of each agent in a multi-agent system, as well as information carried jointly. This will be covered in Section 4. In Section 5 we use local information measures to link neural activity at the implementation level to components of information processing at the algorithmic level, such as information storage, and transfer. This will be done per agent and time step and thereby yields a sort of information theoretic "footprint" of the algorithm in space and time. To be clear, such an analyses will only yield this "footprint"-not identify the algorithm itself. Nevertheless, this footprint is a useful constraint when identifying algorithms in neural systems, because various possible algorithms to solve a problem will clearly differ with respect to this footprint. Section 6 covers current attempts to define the concept of information modification. We close by a short review of some example applications of information theoretic analyses of neural data, and describe how they relate to Marr's levels.

NEURAL CODES FOR EXTERNAL STIMULI
As introduced above, information theory can serve to bridge the gap between the task level, where we deal with properties of a stimulus or task that bear a direct meaning to us, and the implementation level, where we recorded physical indices of neural activity, such as action potentials. To this end we use mutual information (eq. 13) and derivatives thereof to answer questions about neural systems like these: 1. Which (features of) neural responses (R) carry information about which (features of) stimuli (S)?
2. How much does an observer of a specific neural response r, i.e. a receiving brain area, change its beliefs about the identity of a stimulus s, from the initial belief p(s) to the posterior belief p(s|r) after receiving the neural response r?
3. Which specific neural response r is particularly informative about an unknown stimulus s from a certain set of stimuli?
4. Which stimulus s leads to responses that are informative about this very stimulus, i.e. to responses that can "transmit" the identity of the stimulus to downstream neurons?
The empirical answers to these questions bear important implications for the design of BICS. For example, the encoding of an enviroment in a BICS maybe modeled on that of a neural system that successfully lives in the same environment. In the following paragraphs we will show how to answer the above questions 1-4 using information theory.

Which neural responses (R) carry information about which stimuli (S)? This question can be easily
answered by computing the mutual information I(S : R) between stimulus identity and neural responses.
Despite its deceptive simplicity, computing this mutual information can be very informative about neural codes. This is because both the description of what constitutes a stimulus and a response rely on what we consider to be their relevant features. For example, presenting pictures of fruit as stimulus set, we could compute the mutual information between neural responses and the stimuli described as red versus green fruit or described as apples versus pears. The resulting mutual information will differ between these two descriptions of the stimulus set -allowing us to see how the neural system partitions the stimuli. Likewise, we could extract features F i (r) of neural responses r, such as the time of the first spike (e.g. (Johansson and Birznieks, 2004)), or the relative spike times (Havenith et al., 2011;O'Keefe and Recce, 1993)).
Comparing the mutual information for two features I(S : F 1 (R)), I(S : F 2 (R)) allows to identify the feature carrying most information. This feature potentially is the one also read out internally by other stages of the neural system. However, when investigating individual stimulus or response features, one should also keep in mind that several stimulus or response features might have to be considered jointly as they could carry synergistic information (see Section 4, below).
2. How much does an observer of a specific neural response r, i.e. a receiving neuron or brain area, change its beliefs about the identity of a stimulus s, from the prior belief p(s) to the posterior belief p(s|r) after receiving the neural response r? This question is natural to ask in the setting of Bayesian brain theories (Knill and Pouget, 2004). Since this question addresses a quantity associated with a specific response (r), we have to decompose the overall mutual information between the stimulus variable and the response variable (I(S : R)) into more specific information terms. As this question is about a difference in probability distributions, before and after receiving r, it is naturally expressed in terms of a Kullback-Leibler divergence between p(s) and p(s|r). The resulting measure is called the specific surprise i sp (DeWeese and Meister, 1999): It can be easily verified that I(S : R) = r p(r)i sp (S : r). Hence i sp is a valid decomposition of the mutual information into more specific, response dependent contributions. Similarly, we have i sp (S : r) = s p(s|r)i(s : r), giving the relationship between the (fully) localized MI (eq. 14) and i sp (S : r) as a partially-localized MI. As a Kullback-Leibler divergence, i sp is always positive or zero: This simply indicates that any incoming response will either update our beliefs (leading to a positive Kullback-Leibler divergence), or not (in which case the Kullback-Leibler divergence will be zero). From this it immediately follows that i sp cannot be additive: if of two subsequent responses r 1 , r 2 , the first leads us to update our beliefs about s from p(s) to p(s|r 1 ), but the second leads us to revert this update, i.e.
p(s|r 1 , r 2 ) = p(s) then i sp (S : r 1 , r 2 ) = 0 = i sp (S : r 1 ) + i sp (S : r 2 |r 1 ). Loosely speaking, a series of surprises and belief updates does not necessarily lead to a better estimate. This fact has been largely overlooked in early applications of this measure in neuroscience as pointed out by DeWeese and Meister (1999). Some caution is therefore necessary when interpreting results from the literature before 1999 that were obtained using this particular decomposition of the mutual information.
3. Which specific neural response r is particularly informative about an unknown stimulus from a certain set of stimuli? This question asks how much the knowledge about r is worth in terms of an uncertainty reduction about s, i.e. an information gain. In contrast to the question about an update of our beliefs above, we here ask whether this update increases or reduces uncertainty about s. This question is naturally expressed in terms of conditional entropies, comparing our uncertainty before the response, H(S), with our uncertainty after receiving the specific response r, H(S|r). The resulting difference is called the (response-) specific information i r (S : r) (DeWeese and Meister, 1999): where H(S|r) = s p(s|r) log 1 p(s|r) . Again it is easily verified that I(S : R) = r p(r)i r (S : r). However, here the individual contributions, i r (S : r), are not necessarily positive. This is because a response r can lead from a probability distribution p(s) with a low entropy H(S) to some p(s|r) with a high entropy H(S|r). Accepting such 'negative information' terms makes the measure additive for two subsequent responses: The negative contributions i r (S : r) can be interpreted as responses r that are mis-informative in the sense of an increase in uncertainty about the average outcome of S (compare the misinformation on the fully local scale indicated by negative i(x : y); see Section 2.1.3).
4. Which stimulus s leads to responses r that are informative about the stimulus itself? In other words, which stimulus is reliably associated to responses that are relatively unique for this stimulus, so that we know about the occurrence of this specific stimulus from the response unambiguously. Here we ask about stimuli that are being encoded well by the system, in the sense that they lead to responses that are informative to a downstream observer. In this type of question a response is considered informative if it strongly reduces the uncertainty about the stimulus, i.e. if it has a large i r (S : r). We then ask how informative the responses for a given stimulus s are on average over all responses that the stimulus elicits with probabilities p(r|s): The resulting measure i SSI (s : R) is called the stimulus specific information (SSI) (Butts, 2003).
Again it can be verified easily that I(S : R) = s p(s)i SSI (s : R), meaning that i SSI is another valid decomposition of the mutual information. Just as the response specific information terms that it is composed of, the stimulus specific information can be negative (Butts, 2003).
The stimulus specific information has been used to investigate which stimuli are encoded well in neurons with a specific tuning curve; it was demonstrated that the specific stimuli that were encoded best changed with the noise level of the responses (Butts and Goldman, 2006) (Figure 1). Results of this kind may for example be important to consider in the design of BICS that will be confronted with varying levels of noise in their environments.

IMPORTANCE OF THE STIMULUS SET AND RESPONSE FEATURES
It may not immediately be visible in the above equations, but central quantities of the above treatment, such as H(S), H(S|r) depend strongly on the choice of the stimulus set A S . For example, if one chooses to study the human visual systems with a set of "visual" stimuli in the far infrared end of the spectrum, I(S : R) will most likely be very small and analysis futile (although done properly, a zero value of i SSI (s : R) for all stimuli will correctly point out that the human visual system does not care or code for any of the infrared stimuli). Hence, characterizing a neural code properly hinges to a large extent on an appropriate choice of stimuli. In this respect, it is safe to assume that a move from artificial stimuli (such as gratings in visual neuroscience) to more natural ones will alter our view of neural codes in the future. A similar argument holds for the response features that are selected for analysis. If any feature is dropped or not measured at all this may distort the information measures above. This may even happen, if the dropped feature, say the exact spike time variable R ST , seems to carry no mutual information with the stimulus variable when considered alone, i.e. I(S : R ST ) = 0. This is because there may still be synergistic information that can only be recovered by looking at other response variables jointly with R ST . For example, it would be possible in principle that neither spike time R ST nor spike rate R SR carry mutual information with the stimulus variable when considered individually, i.e. I(S : R ST ) = I(S : R SR ) = 0. Still, when considered jointly they may be informative: I(S : R ST , R SR ) > 0. The problem of omitted response features is almost inevitable in neuroscience, as the full sampling of all parts of a neural systems is typically impossible, and we have to work with sub-sampled data. Considering only a subset of (response) variables may dramatically alter the apparent dependency structure in the neural system (see Priesemann et al. (2009) for an example). Therefore, the effects of subsampling should always be kept in mind when interpreting results of studies on neural coding.

INFORMATION IN ENSEMBLE CODING -PARTIAL INFORMATION DECOMPOSITION
In neural systems information is often encoded by ensembles of agents -as evidenced by the success of various 'brain reading' and decoding techniques applied to multivariate neural data (e.g. (Kriegeskorte et al., 2008)). Knowing how this information in the ensemble is distributed over the agents can inform the designer of BICS about strategies to distribute the relevant information about a problem over the available agents. These strategies determine properties like the coding capacity of the system as well as its reliability. For example, a reliable strategy would represent the same information in multiple agents, making their information redundant. In contrast, maximizing capacity would require taking into account the full combinatorial possibilities of states of agents, making their coding synergistic.
Here, we investigate the most basic ensemble of just two agents to introduce the concepts of redundant, synergistic and unique information (Williams and Beer, 2010;Bertschinger et al., 2014;Harder et al., 2013;Griffith and Koch, 2014;Lizier et al., 2013), and note that encoding in larger ensembles is still a field of active research. More specifically, we consider an ensemble of two neurons and their responses, {R 1 , R 2 }, after stimulation with stimuli s ∈ A S = {s 1 , s 2 , . . .}, and try to answer the following questions: 1. What information does R i provide about S? This is the mutual information I(S : R i ) between the responses of one neuron i and the stimulus set.
2. What information does the joint variable R = {R 1 , R 2 } provide about S? This is the mutual information I(S : R 1 , R 2 ) between the joint responses of the two neurons and the stimulus set.
3. What information does the joint variable R = {R 1 , R 2 } have about S that we cannot get from observing both variables R 1 , R 2 separately? This information is called the synergy, or complementary information, of {R 1 , R 2 } with respect to S: CI(S : R 1 ; R 2 ).
4. What information does one of the variables, say R 1 , hold individually about S that we can not obtain from any other variable (R 2 in our case)? This information is the unique information of R 1 about S: 5. What information does one of the variables, again say R 1 , have about S that we could also obtain by looking at the other variable alone? This information is the redundant, or shared, information of R 1 and R 2 about S: SI(S : R 1 ; R 2 ).
Interestingly, only questions 1. and 2. can be answered using standard tools of information theory such as the mutual information. In fact, the answers to the questions 3. to 5. , i.e. the quantification of unique, redundant and synergistic information, need new mathematical concepts as will be shown below.
Before we present more details, we would like to illustrate the above questions by a thought experiment where three visual neurons are recorded simultaneously while being stimulated with one of a set of four stimuli (Figure 2). Two of the neurons have almost identical receptive fields (RF 1 , RF 2 ) while the third one has a collinear but spatially displaced receptive field (RF 3 ) (Figure 2 (A)). These neurons are stimulated with one of the following stimuli (Figure 2 (B)): s 1 does not contain anything at the receptive fields of the three neurons, and the neurons stay inactive; s 2 is a small bar with the preferred orientation of neurons 1,2; s 3 is a similar small bar, but over the receptive field of neuron 3, instead of 1,2; s 4 is a long bar covering all receptive fields in the example.  To make things easy, let us encode responses that we get from these three neurons (colored traces in Figure 2 (B)) in binary form, with a "1" simply indicating that there was a response in our response window (gray boxes behind the activity traces in Figure 2). If we assume the stimuli to be presented with equal probability (p(S = s i ) = 1 4 , i = 1, . . . 4), then the entropy of the stimulus set is H(S) = 2 (bit). Obviously, none of the information terms above can be larger than these 2 bits. We also see that each neuron shows activity (binary response = 1) in half of the cases, yielding an entropy H(R j ) = 1 for the responses of each neuron. The responses of the three neurons fully specify the stimulus, and therefore

RF2
To see the mutual information between an individual neuron's response and the stimulus we may compute I(S : To do this, we remember H(S) = 2 and use that the number of equiprobable outcomes for S drops by half after observing a single neuron (e.g. after observing a response r 1 = 1 of neuron 1, two stimuli remain possible sources of this response -s 2 or s 4 ). This gives H(S|R i ) = 1, and I(S : R i ) = 1. Hence, each neuron provides one bit of information about the stimulus when considered individually. Already here, we see something curious -although each neuron has 1 bit about the stimulus, together they have only 2, not 3 bits. We can see the reason for this 'vanishing bit' when considering responses from pairs of neurons, especially the pair {R 1 , R 2 }.
What is the information in joint variables formed from pairs of neurons? If we first look at neurons 1 and 2 their responses to each stimulus are identical. Each of the neurons provides one bit of information about the stimulus. Even if we look at the two of them jointly ({R 1 , R 2 }) we still get only one bit: I(S : R 1 , R 2 ) = 1. This is because the information carried by their responses is redundant. To see this, consider that we cannot decide between stimuli s 1 and s 3 if we get the result (r 1 = 0, r 2 = 0), and we can also not decide between stimuli s 2 and s 4 when observing (r 1 = 1, r 2 = 1); other combinations of responses do not occur here. We see that neurons 1 and 2 have exactly the same information about the stimulus, and a measure of redundant information should yield the full 1 bit in this case (we will later see this intuitive argument again as the 'Self Redundancy' axiom (Williams and Beer, 2010)).
To understand the concept of synergy, we next consider the output of responses {R 1 , R 3 } from example neurons 1,3. We will transform these responses by a network that implements the mathematical XOR function, such that a downstream neuron N at the output end of this XOR-network gets activated only if there is one small bar on the screen (i.e. one of our neurons R 1 or R 3 gets activated, but not both), but neither for no stimulus nor for the long bar (Figure 2 C). We will now investigate the mutual information between {R 1 , R 3 }, R 1 , R 3 and N . In this case the individual mutual informations of each neuron R 1 , R 3 with the downstream neuron N are zero (I(N : R i ) = 0). However, the mutual information between these two neurons considered jointly and the downstream neuron N is still 1 bit, because the response of N is fully determined by its two inputs: I(N : R 1 , R 3 ) = 1. Thus, there is only synergistic information between R 1 and R 3 about N .
We now introduce the mathematical framework of partial information decomposition that formalizes the intuition in the above examples, and consider a decomposition of the mutual information between a set of two right hand side, or input, variables R 1 , R 2 and a left hand side variable, or output variable S, i.e. I(S : R 1 , R 2 ). In general, for a decomposition of this mutual information into unique, redundant and synergistic information to make sense, the total information from any one variable, say I(S : R 1 ) from R 1 , should be decomposable into the unique information term U I(S : R 1 \ R 2 ) and the redundant, or shared, information SI(S : R 1 ; R 2 ) that both variables have about S: Similarly, the total information I(S : R 1 , R 2 ) from both variables should be decomposable into the two unique information terms U I(S : R 1 \ R 2 ) and U I(S : R 2 \ R 1 ) of each R i about S, the redundant, or shared, information SI(S : R 1 ; R 2 ) that both variables have about S, and the synergistic, or complementary, information CI(S : R 1 ; R 2 ) that can only be obtained by considering {R 1 , R 2 } jointly: UI(S : R 1 \ R 2 ) is unkown in classical information theory, but constant on ∆ P by assumption from game theory.
Hence, if one finds a Q such that CI Q (S : R 1 ; R 2 ) = 0, then UI(S : R 1 \ R 2 ) can be calculated from I Q (S : R 1 | R 2 ). Else one chooses a Q that minimizes CI Q and thereby obtains an upper bound for UI.
I Q (S : R 1 | R 2 ) is known from information theory, and depends on the choice of Q.
With the aim to ultimately estimate UI(S : R 1 \ R 2 ), one defines a set of I Q (S : {R 1 , R 2 }`) on ∆ P . I Q depends on the choice of Q (see main text).
Likewise, CI Q (S : R 1 ; R 2 ) depends on the choice of Q.
The aim is to quantify UI, SI, and CI.
Last, CI(S : R 1 ; R 2 ) is caluclated from the known quantities I(S : {R 1 , R 2 }), UI(S : R 1 \ R 2 ), UI(S : R 2 \ R 1 ) and SI(S : R 1 ; R 2 ). 2. Self-redundancy: The redundant information that R 1 shares with itself about S is just the mutual information I(R 1 : S) 3. Monotonicity: The redundant information that variables R 1 , R 2 , . . ., R n have about S is smaller or equal than the redundant information that variables R 1 , R 2 , . . ., R n−1 have about S. Equality holds if R n−1 is a function of R n .
These three axioms also lead to global positivity, i.e. SI(· : ·) ≥ 0, CI(· : ·) ≥ 0 and U I(· : ·) ≥ 0 (Williams and Beer, 2010). As said above, these axioms are uncontroversial, although some authors restrict them to only two input variables R 1 , R 2 as detailed below Harder et al., 2013). These axioms, however, are not sufficient to uniquely define a measure of either redundant, unique or synergistic information. Therefore, various additional axioms, or assumptions, have been proposed (Williams and Beer, 2010;Harder et al., 2013;Bertschinger et al., 2014;Lizier et al., 2013;Griffith and Koch, 2014;Timme et al., 2014) that are not all compatible with each other (Bertschinger et al., 2012). Here we exemplarily discuss the recent choice of an assumption by Bertschinger et al. (2014) to define a measure of unique information, which is in fact equivalent to another formulation proposed by Griffith and Koch (2014). The reasons for selecting this particular assumption are that at the time of writing it comes with the richest set of derived theorems, and that it has an appealing link to game theory and utility functions, and thus to measures of success of an agent or a BICS. We note at the outset that this is one of the measures that are defined only for two "input" variables R 1 , R 2 and one "output" S (although the R i themselves may be multivariate RVs). For more details on this restriction see Rauh et al. (2014).
The basic idea of the definition by Bertschinger and colleagues comes from game theory and states that someone (say Alice) who has access to one input variable R 1 with unique information about an output variable S must be able to prove that her variable has information not available in the other. To prove this, Alice can design a bet on the output variable (by choosing a suitable utility function) so that someone else (say Bob) who has only access to the other input variable R 2 will on average loose this bet. Via some intermediate steps, this leads to the (defining) assumption that the unique information only depends on the two marginal probability distributions P (s, r 1 ) and P (s, r 2 ), but not on the exact full distribution P (S, r 1 , r 2 ). In other words, the unique information U I should not change on the space ∆ P of probability distributions Q that share these marginals with P : ∆ P = Q ∈ ∆ : Q(S = s, R 1 = r 1 ) = P (S = s, R 1 = r 1 ) and Q(S = s, R 2 = r 2 ) = P (S = s, R 2 = r 2 ) for all s ∈ A S , r 1 ∈ A R 1 , r 2 ∈ A R 2 where ∆ is the space of all probability distributions on the support of S, R 1 , R 2 . This motivated to the following definition for a measure U I of unique information: where I Q (S : R 1 |R 2 ) is a conditional mutual information computed with respect to the joint distribution Q(s, r 1 , r 2 ) instead of P (s, r 1 , r 2 ). Note that this conditional mutual information I Q (S : R 1 |R 2 ) does change on ∆ P , and that only its minimum is a measure of the (constant) unique information (see Figure 3). As stated above, knowing one of the three parts U I, SI, CI is enough to compute the others. Therefore, the matching definitions of measures for redundant ( SI) and shared information ( CI) are: where CoI Q (S; R 1 ; R 2 ) = I(S : R 1 ) − I Q (S : R 1 |R 2 ) is the so-called coinformation (equivalent to the redundancy minus the synergy) for the distribution Q(s, r 1 , r 2 ).
Among the notable properties of the measures defined this way is the fact that they can be found by convex optimization, and that all three measures above have been explicitly shown to be positive.
Moreover, the above measures are bounds for any definitions of synergy CI, redundancy SI and unique information U I that satisfy equations 22 and 23. That is, it can be shown that: holds .
The field of information decomposition has seen a rapid development since the initial study of Williams and Beer, however, some major questions remain unresolved so far. Most importantly, the definitions above have acceptable properties, but apply only for the case of decomposing mutual information into contributions of two (sets of) input variables. The structure of such a decomposition for more than two inputs is an active area of research at the moment.

ANALYZING DISTRIBUTED COMPUTATION IN NEURAL SYSTEMS
Analyzing neural coding and goal functions in a domain-independent way. The analysis of neural coding strategies presented above relies on our a priori knowledge of the set of task level (e.g. stimulus) features that is encoded in neural responses at the implementation level. If we have this knowledge, information theory will help us to link the two levels. This is somewhat similar to the situation in cryptography where we consider a code 'cracked' if we obtain a human-readable plain text message, i.e. we move from the implementation level (encrypted message) to the task level (meaning). However, what happens if the plain text were in a language that one never heard of 2 ? In this case, we would potentially crack the code without ever realizing it, as the plain text still has no meaning for us.
The sitution in neuroscience bears resemblance to this example in at last two respects: First, most neurons do not have direct access to any properties of the outside world, rather they receive nothing but input spike trains. All they ever learn and process must come from the structure of these input spike trains. Second, if we as researchers probe the system beyond early sensory or motor areas, we have little knowledge of what is actually encoded by the neurons deeper inside the system. As a result proper stimulus sets get hard to choose. In this case, the gap between the task-and the implementation level may actually become too wide for meaningful analyses, as noticed recently by Carandini (2012).
Instead of relying on descriptions of the outside world (and thereby involve the task level), we may take the point of view that information processing in a neuron is nothing but the transformation of input spike trains to output spike trains. We may then try to use information theory to link the implementation and algorithmic level, by retrieving a 'footprint' of the information processing carried out by a neural circuit.
This approach only builds on a very general agreement that neural systems perform at least some kind of information processing. This information processing can be decomposed into the component processes of (1) information storage, (2) information transfer, and (3) information modification. A decomposition of this kind had already been formulated by Turing (see Langton (1990)), and was recently formalized by (2013)):

Lizier et al. (2014) (see also Lizier
• Information storage quantifies the information contained in the past state variable Y t−1 of a process that is used by the process at the next RV at t, Y t (Lizier et al., 2012b). This relatively abstract definition means that we will see at least a part of the past information again in the future of the process, but potentially transformed. Hence, information storage can be naturally quantified by a mutual information between the past and the future 3 of a process.
• Information transfer quantifies the information contained in the past state variables X t−u of one source process X that can be used to predict information in the future variable Y t of a target process Y, in the context of the past state variables Y t−1 of the target process (Schreiber, 2000;Paluš, 2001;Vicente et al., 2011).
• Information modification quantifies the combination of information from various source processes into a new form that is not (trivially) predictable from any subset of these source processes .
Based on Turing's general decomposition of information processing (Langton, 1990), Lizier and colleagues recently proposed an information theoretic framework to quantify distributed computations in terms of all three component processes locally, i.e. for each part of the system (e.g. neurons or brain areas) and each time step (Lizier et al., 2008c(Lizier et al., , 2012b. This framework is called local information dynamics and has been successfully applied to unravel computation in swarms (Wang et al., 2011), in Boolean networks (Lizier et al., 2011b), and in neural models (Boedecker et al., 2012) and data (Wibral et al., 2014a) (also see Section 7 for details on these example applications).
In the following we present both global and local measures of information transfer, storage and modification, beginning with the well established measures of information transfer and ending with the highly dynamic field of information modification.

INFORMATION TRANSFER
The analysis of information transfer was formalized initially by Schreiber (2000) and Paluš (2001), and has seen a rapid surge of interest in neuroscience 4 and general physiology 5 .

Definition
Information transfer from a process X (the source) to another process Y (the target) is measured by the transfer entropy (TE) functional 6 (Schreiber, 2000): where I(· : ·|·) is the conditional mutual information, Y t is the RV of process Y at time t, and X t−u , Y t−1 are the past state-RVs of processes X and Y, respectively. The delay variable u in X t−u indicates that the past state of the source is to be taken u time steps into the past to account for a potential physical interaction delay between the processes. This parameter need not be chosen ad hoc, as it was recently proven for bivariate systems that the above estimator is maximized if the parameter u is equal to the true delay δ of the information transfer from X to Y . This relationship allows one to estimate the true interaction delay δ from data by simply scanning the assumed delay u: The TE functional can be linked to Wiener-Granger type causality (Wiener, 1956;Granger, 1969;Barnett et al., 2009). More precisely, for systems with jointly Gaussian variables, transfer entropy is equivalent 7 to linear Granger causality (see Barnett et al. (2009) and references therein). However, whether the assumption of jointly Gaussian variables is appropriate in a neural setting must be checked carefully for each case (note that Gaussianity of each marginal distribution is not sufficient). In fact, EEG source signals were found to be non-Gaussian (Wibral et al., 2008) We also note that TE has recently been given a thermodynamic interpretation by Prokopenko and Lizier (2014).

Transfer Entropy Estimation
When the probability distributions entering eq. 28 are known (e.g. in an analytically tractable neural model), TE can be computed directly. However, in most cases the probability distributions have to be derived from data. When probabilities are estimated naively from the data via couting, and when these estimates are then used to compute information theoretic quantities such as the transfer entropy, we speak of a "plug in" estimator. Indeed such plug in estiamtors have been used in the past, but they come with serious bias problems (Panzeri et al., 2007b). Therefore, newer approaches to TE estimation rely on a more direct estimation of the entropies that TE can be decomposed into (Kraskov et al., 2004;Gomez-Herrero et al., 2010;Vicente et al., 2011;Wibral et al., 2014b). These estimators still suffer from bias problems but to a lesser degree (Kraskov et al., 2004). We therefore restrict our presentation to these approaches.
Before we can proceed to estimate TE we will have to reconstruct the states of the processes (see Section 2.1.5). One approach to state reconstruction is time delay embedding (Takens, 1981). It uses past variables X t−nτ , n = 1, 2, . . . that are spaced in time by an interval τ . The number of these variables and their optimal spacing can be determined using established criteria Faes et al., 2012;Small and Tse, 2004;Ragwitz and Kantz, 2002). The realizations of the states variables can be represented as vectors of the form: where d is the dimension of the state vector. Using this vector notation, the transfer entropy estimator writes: 7 To a constant factor of 2.
where the subscript SP O (for self prediction optimal) is a reminder that the past states of the target, y dy t−1 , have to be constructed such that conditioning on them is optimal in the sense of taking the active information storage in the target correctly into account : If one were to condition on y dy t−w with w = 1, instead of y dy t−1 , then the self prediction for Y t would not be optimal and the transfer entropy would be overestimated.
We can rewrite equation 32 using a representation in the form of four entropies 8 H(·), as: Entropies can be estimated efficiently by nearest-neighbor techniques. These techniques exploit the fact that the distances between neighboring data points in a given embedding space are inversely related to the local probability density: the higher the local probability density around an observed data point the closer are the next neighbors. Since next neighbor estimators are data efficient (Kozachenko and Leonenko, 1987;Victor, 2005) they allow to estimate entropies in high-dimensional spaces from limited real data.
Unfortunately, it is problematic to estimate TE by simply applying a naive nearest-neighbor estimator for the entropy, such as the Kozachenko-Leonenko estimator (Kozachenko and Leonenko, 1987), separately to each of the terms appearing in equation 33. The reason is that the dimensionality of the state spaces involved in equation 33 differ largely across terms -creating bias problems. These are overcome by the Kraskov-Stögbauer-Grassberger (KSG) estimator that fixes the number of neighbors k in the highest dimensional space (spanned here by y t , y dy t−1 , x dx t−u ) and by projecting the resulting distances to the lower dimensional spaces as the range to look for additional neighbors there (Kraskov et al., 2004). After adapting this technique to the TE formula (Gomez-Herrero et al., 2010), the suggested estimator can be 8 For continuous-valued RVs, these entropies are differential entropies. written as: where ψ denotes the digamma function, the angle brackets ( · t ) indicate averaging over time for stationary systems, or over an ensemble of replications for non-stationary ones, and k is the number of nearest neighbors used for the estimation. n (·) refers to the number of neighbors which are within a hypercube that defines the search range around a state vector. As described above, the size of the hypercube in each of the marginal spaces is defined based on the distance to the k-th nearest neighbor in the highest dimensional space.

Interpretation of transfer entropy as a measure at the algorithmic level. TE describes
computation at the algorithmic level, not at the level of a physical dynamical system. As such it is not optimal for inference about causal interactions -although it has been used for this purpose in the past. The fundamental reason for this is that information transfer relies on causal interactions, but causal interactions do not necessarily lead to nonzero information transfer (Ay and Polani, 2008;Chicharro and Ledberg, 2012). Instead, causal interactions may serve active information storage alone (see next section), or force two systems into identical synchronization, where information transfer becomes effectively zero. This might be summarized by stating that transfer entropy is limited to effects of a causal interaction from a source to a target process that are unpredictable given the past of the target process alone. In this sense, TE may be seen as quantifying causal interactions currently in use for the communication aspect of distributed computation. Therefore, one may say that TE measures predictive, or algorithmic information transfer.
A simple thought experiment may serve to illustrate this point: When one plays an unknown record, a chain of causal interactions serve the transfer of information about the music from the record to your brain. Causal interactions happen between the record's grooves and the needle, the magnetic transducer system behind the needle, and so on, up to the conversion of pressure modulations to neural signals in the cochlea that finally activate your cortex. In this situation, there undeniably is information transfer, as the information read out from the source, the record, at any given moment is not yet known in the target process, i.e. the neural activity in the cochlea. However, this information transfer ceases if the record has a crack, making the needle skip and repeat a certain part of the music. Obviously, no new information is transferred which under certain mild conditions is equivalent to no information transfer at all. Interestingly, an analysis of TE between sound and cochlear activity will yield the same result: The repetitive sound leads to repetitive neural activity (at least after a while). This neural activity is thus predictable by it's own past, under the condition of vanishing neural 'noise', leaving no room for a prediction improvement by the sound source signal. Hence, we obtain a TE of zero, which is the correct result from a conceptual point of view. Remarkably, at the same time the chain of causal interactions remains practically unchanged. Therefore, a causal model able to fit the data from the original situation will have no problem to fit the data of the situation with the cracked record, as well. Again, this is conceptually the correct result, but this time from a causal point of view.
The difference between an analysis of information transfer in a computational sense and causality analysis based on interventions has demonstrated convincingly in a recent study by . The same authors also demonstrated why an analysis of information transfer can yield better insight than the analysis of causal interactions if the computation in the system is to be understood. The difference between causality and information transfer is also reflected in the fact that a single causal structure can support diverse pattern of information transfer (functional multiplicity), and the same pattern of information transfer can be realized with different causal structures (structural degeneracy) as shown by Battaglia (2014b).

Local information transfer
As transfer entropy is formally just a conditional mutual information, we can obtain the corresponding local conditional mutual information (equation 15) from equation 32. This quantity is called the local transfer entropy (Lizier et al., 2008c). For realizations x t , y t of two processes X, Y at time t it reads: As said earlier in the section on basic information theory, the use of local information measures does not eliminate the need for an appropriate estimation of the probability distributions involved. Hence, for a non-stationary process these distributions will still have to be estimated via an ensemble approach for each time point for the RVs involved -e.g. via physical replications of the system, or via enforcing cyclostationarity by design of the experiment.
The analysis of local transfer entropy has been applied with great success in the study of cellular automata to confirm the conjecture that certain coherent spatio-temporal structures traveling through the network are indeed the main carriers of information transfer (Lizier et al., 2008c) (see further discussion at Section 7.4). Similarly, local transfer entropy has identified coherent propagating wave structures in flocks as information cascades (Wang et al., 2012) (see Section 7.5), and indicated impending synchronization amongst coupled oscillators (Ceguerra et al., 2011).

Common Problems and solutions
Typical problems in TE estimation encompass (1) finite sample bias, (2) the presence of non-stationarities in the data, and (3) the need for multivariate analyses. In recent years all of these problems have been addressed at least in isolation, as summarized below: 1. Finite sample bias can be overcome by statistical testing using surrogate data, where the observed dy t−1 , X dx t−u are reassigned to other RVs of the process, such that the temporal order underlying the information transfer is destroyed (for an example see the procedures suggested in ). This reassignment should conserve as many data features of the single process realizations as possible.
2. As already explained in the section on basic information theory above, non-stationary random processes in principle require that the necessary estimates of the probabilities in equation 28 are based on physical replications of the systems in question. Where this is impossible, the experimenter should design the experiment in such a way that the processes are repeated in time. If such cyclostationary data are available, then TE should be estimated using ensemble methods as described in (Gomez-Herrero et al., 2010) and implemented in the TRENTOOL toolbox Wollstadt et al., 2014).
3. So far, we have restricted our presentation of transfer entropy estimation to the case of just two interacting random processes X , Y, i.e. a bivariate analysis. In a setting that is more realistic for neuroscience, one deals with large networks of interacting processes X, Y, Z, . . . . In this case various complications arise if the analysis is performed in a bivariate manner. For example a process Z could transfer information with two different delays δ Z→X , δ Z→Y to two other processes X, Y. In this case, a pairwise analysis of transfer entropy between X, Y will yield an apparent information transfer from the process that receives information from Z with the shorter delay to the one that receives it with the longer delay (common driver effect). A similar problem arises if information is transferred first from a process X to Y, and then from Y to Z. In this case, a bivariate analysis will also indicate information transfer from X to Z (cascade effect). Moreover, two sources may transfer information purely synergistically, i.e. the transfer entropy from each source alone to the target is zero, and only considering them jointly reveals the information transfer 9 .
From a mathematical perspective this problem seems to be easily solved by introducing the complete transfer entropy (Lizier et al., 2008c, which is defined in terms of a conditional transfer entropy (Lizier et al., 2008c: where the state-RV Z − is a collection of the past states of one or more processes in the network other than X, Y. We label eq. 36 a complete transfer entropy T E the set of all processes in the network other than X, Y. However, even for small networks of random processes the joint state space of the variables to analyze the information transfer in a network iteratively, selecting information sources for a target in each iteration either based on magnitude of apparent information transfer (Faes et al., 2012), or its significance (Lizier and Rubinov, 2012;Stramaglia et al., 2012). In the next iteration, already selected information sources are added to the conditioning set (Z − in equation 36), and the next search for information sources is started. The approach of Stramaglia and colleagues is particular here in that the conditional mutual information terms are computed at each level as a series expansion, following a suggestion by Bettencourt et al. (2008). This allows for an efficient computation as the series may truncate early, and the search can proceed to the next level. Importantly, these approaches also consider synergistic information transfer from more than one source variable to the target. For example, a variable transferring information purely synergistically with Z − maybe included in the next iteration, given that that the other variables it transfers information with are already in the 9 Again, cryptography may serve as an example here. If an encrypted message is received, there will be no discernible information transfer from encrypted message to plain text without the key. In the same way, there is no information transfer from the key alone to the plain text. It is only when encrypted message and key are combined that the relation between the combination of encrypted message and key on the one side and the plain text on the other side is revealed.

ACTIVE INFORMATION STORAGE
Before we present explicit measures of active information storage, a few comments may serve to avoid misunderstanding. Since we analyze neural activity here, measures of active information storage are concerned with information stored in this activity -rather than in synaptic properties, for example. 10 As laid out above, storage is conceptualized here as a mutual information between past and future states of neural activity. From this it is clear that there will not be much information storage if the information contained in the future states of neural activity is low in general. If, on the other hand these future states are rich in information but bear no relation to past states, i.e. are unpredictable, again information storage will be low. Hence, large information storage occurs for activity that is rich in information but, at the same time, predictable.
Thus, information storage gives us a way to define the predictability of a process that is independent of the prediction error: information storage quantifies how much future information of a process can be predicted from its past, whereas the prediction error measures how much information can not be predicted.
If both are quantified via information measures, i.e. in bits, the error and the predicted information add up to the total amount of information in a random variable of the process. Importantly, these two measures may lead to quite different views about the predictability of a process. This is because the total information can vary considerably over the process, and the predictable and the unpredictable information may thus vary almost independently. This is important for the design of BICS that use predictive coding strategies.
Before turning to the explicit definition of measures of information storage it is worth considering which temporal extent of 'past' and 'future' states we are interested in: Most globally, predictive information (Bialek et al., 2001) or excess entropy (Crutchfield and Packard, 1982;Grassberger, 1986;Crutchfield and Feldman, 2003) is the mutual information between the semi-infinite past and semi-infinite future of 10 See the distinction made between passive storage in synaptic properties and active storage in dynamics by Zipser et al. (1993). a process before and after time point t. In contrast, if we are interested in the information currently used for the next step of the process, the mutual information between the semi-infinite past and the next step of the process, the active information storage (Lizier et al., 2012b) is of greater interest. Both measures are defined in the next paragraphs.

Predictive information / Excess entropy
Excess entropy is formally defined as: where X k− t = {X t , X t−1 , . . . , X t−k+1 }, and X k+ t = {X t+1 , . . . , X t+k } indicate collections of the past and future k variables of the process X. 11 These collections of RVs (X k− t ,X k+ t ) in the limit k → ∞ span the semi-infinite past and future, respectively. In general, the mutual information in equation 37 has to be evaluated over multiple realizations of the process. For stationary process, however, E X t is not timedependent, and equation 37 can be rewritten as an average over time points t and computed from a single realization of the process -at least in principle (we have to consider that the process must run for an infinite time to allow the limit lim k→∞ for all t): where i(· : ·) is the local mutual information from equation 14, and x k− t , x k+ t are realizations of X k− t , X k+ t . The limit of k → ∞ can be replaced by a finite k max if a k max exists such that conditioning on X kmax− t renders X kmax+ t conditionally independent of any X l with l ≤ t − k max .
Even if the process in question is non-stationary, we may look at values that are local in time as long as the probability distributions are derived appropriately (see Section 2.1.2):

Active Information Storage
From a perspective of the dynamics of information processing, we might not be interested in information that is used by a process at some time far in the future, but at the next point in time, i.e. information that is said to be 'currently in use' for the computation of the next step (the realization of the next RV) in the process (Lizier et al., 2012b). To quantify this information, a 11 In principle these could harness embedding delays, as defined in eq. 31. different mutual information is computed, namely the active information storage (AIS): Again, if the process in question is stationary then A X t = const. = A X and the expected value can be obtained from an average over time -instead of an ensemble of realizations of the process -as: which can be read as an average over local active information storage (LAIS) values a X t : Even for nonstationary processes we may investigate local active storage values, given the corresponding probability distributions are properly obtained from an ensemble of realizations of X t , X k− t−1 : Again, the limit of k → ∞ can be replaced by a finite k max if a k max exists such that conditioning on X kmax t−1 renders X t conditionally independent of any X l with l ≤ t − k max (see equation 16).

5.2.3
Interpretation of information storage as a measure at the algorithmic level As laid out above information storage is a measure of the amount of information in a process that is predictable from its past. As such it quantifies for example how well activity in one brain area A can be predicted by another area, e.g. by learning its statistics. Hence, questions about information storage arise naturally when asking about the generation of predictions in the brain, e.g. in predictive coding.

TRANSFER ENTROPY
The two measures of local active information storage and local transfer entropy introduced in the preceding section may be fruitfully combined by pairing storage and transfer values at each point in time and for each agent. The resulting space has been termed the "local information dynamics state space" and has been used to investigate the computational capabilities of cellular automata, by pairing a(y j,t ) and te (x i,t−1 → y j,t ) for each pair of sources and targets x i , y j at each time point (Lizier et al., 2012a).
Here, we suggest that this concept may be used to disentangle various neural processing strategies.
Specifically we suggest to pair the sum 12 over all local active information storage in the inputs x i of a target y j (at the relevant delays u i , obtained from an analysis of transfer entropy  with the sum of outgoing local information transfers from this target to further targets z k , for each agent y j and each time point t: where sources x i and second order targets z k are defined by the conditions: te(y j,t → z k,t+u k ) = 0, ∀z k,t+u k .
The resulting point set set can be used to answer the important question, whether the aggregate outgoing information transfer of an agent is high either for predictable or for surprising input. The former information processing function amounts a sort of filtering, passing on reliable (predictable) information, and would be linked to something reliable being represented in activity. The latter information processing function is a form of prediction error encoding, where high outgoing information transfer is triggered when surprising, unpredictable information is received (also see Figure 4).
Note that for this type of analysis recordings of at least triplets of connected agents are necessary.
This may pose a considerable challenge in experimental neuroscience, but may be extremely valuable to disentangle the information processing goal functions of the various cortical layers for example. This type of analysis will also be valuable to understand the information processing in evolved BICS, as in these systems the availability of data from triplets of agents is no problem.
12 More complex ways of combining incoming active information storage are conceivable.   Langton (1990) described information modification as an interaction between transmitted and/or stored information that results in a modification of one or the other. Attempts to define information modification more rigorously implemented this basic idea. First attempts at defining a quantitative measure of information modification resulted in a heuristic measure termed local separable information , where the local active information storage and the sum over all pairwise local transfer entropies into the target was taken:

INFORMATION DECOMPOSITION
with V X t \X t−1 = {Z t − ,1 , . . . , Z t − ,G } indicating the set of G past state variables of all processes Z t − ,i that transfer information into the target variable X t ; note that X t−1 , the history of the target, is explicitly not part of the set. The index t − is a reminder that only past state variables are taken into account, i.e. t − < t. As shown above, the local measures entering the sum are negative if they are mis-informative about the future of the target. Eventually the overall sum, or separable information, might also be negative, indicating that neither the pairwise information transfers, nor the history could explain the information contained in the target's future. This has been interpreted as a modification of either stored or transferred information.
While this first attempt provided valuable insights in systems like elementary cellular automata , it is ultimately heuristic. A more rigorous approach is to look at decomposition of the local information h(x t ) in the realization of a random variable to shed some more light on the issue which part of this information may be due to modification. In this view, the overall information H(X t ), in the future of the target process (or its local form, h(x t )) can be explained by looking at all sources of information and the history of the target jointly, at least up to the genuinely stochastic part (innovation) in the target, as shown by  (also see equations 51, 52). In contrast, we cannot decompose this information into pairwise mutual information terms only. As described in the following, the remainder after exhausting pairwise terms is due to synergistic information between information sources and has motivated the suggestion to define information modification based on synergy .
To see the differences between a decomposition considering variables jointly or only in pairwise terms, consider a series of subsets formed from the set of all variables Z t − ,i (defined above; ordered by i here) that can transfer information into the target, except variables from the target's own history. The bold typeface in Z t − ,i is a reminder that we work with a state space representation where necessary. Following the derivation by , we create a series of subsets V g X t \X t−1 such that V g X t \X t−1 = {Z t − ,1 , . . . , Z t − ,g−1 }, i.e. the g-th subset only contains the first g − 1 sources. We can decompose the collective transfer entropy from all our source variables, T E(V X t \X t−1 → X t ), as a series of conditional mutual information terms, incrementally increasing the set that we condition on: The total entropy of the target H(X t ) can then be written as: where W X t is the genuine innovation in X t . If we rewrite the decomposition in equation 51 in its local form: and compare to equation 49, we see that the difference between the potentially mis-informative sum s X t in equation 49 and the fully accounted for information in h(x t ) from equation 52 lies in the conditioning of the local transfer entropies. This means that the context that the source variables provide for each other is neglected and synergies and redundancies (see Section 4) are not properly accounted for. Importantly, the results of both equations (49,52) are identical, if no information is provided either redundantly or synergistically by the sources Z t − ,g . This observation led Lizier and colleagues to propose a more rigorously defined measure of information modification based on the synergistic part of the information transfer from the source variables Z t − ,g , and the targets history X t−1 to the target X t .
This definition of information modification has several highly desirable properties. However, it relies on a suitable definition of synergy, which is currently only available for the case of two source variables (see Section 4). As there is currently a considerable debate on how to define the part of a the mutual information I(Y : {X 1 , . . . , X i , . . .}), that is synergistically provided by a larger set of source variables X i , the question of how to best measure information modification maybe be considered open.

ACTIVE INFORMATION STORAGE IN NEURAL DATA
Here, we present two very recent applications of (L)AIS to neural data and their estimation strategies for the PDFs. In both, estimation of (L)AIS was done using the JAVA information dynamics toolkit (Lizier, 2014b(Lizier, , 2012 and state space reconstruction was performed in TRENTOOL ) (for details, see (Gomez et al., 2014;Wibral et al., 2014a)). The first study investigated AIS in magnetoencephalographic (MEG) source signals from patients with autism spectrum disorder (ASD), and reported a reduction of AIS in the hippocampus in patients compared to healthy controls (Gomez et al., 2014) (Fig. 5). In this study, the strategy for obtaining an estimate of the PDF was to use only baseline data (between stimulus presentations) to guarantee stationarity of the data. Results from this study align well with predictive coding theories (Rao and Ballard, 1999;Friston et al., 2006) of ASD (also see Gomez et al. (2014), and references therein). The significance of this study in the current context lies in the fact, that it explictely sought to measure the information processing consequences at the algorithmic level of changes in neural dynamics in ASD at the implementation level. The second study (Wibral et al., 2014a) analyzed LAIS in voltage sensitive dye (VSD) imaging data from cat visual cortex. The study found low LAIS in the baseline before the onset of a visual stimulus, negative LAIS directly after stimulus onset and sustained increases in LAIS for the whole stimulation period, despite changing raw signal amplitude (Fig. 6). In this study all available data were pooled, both from baseline as well as stimulation periods, and also across all recording sites (VSD image pixels).
Pooling across time is unusual, but reasonable insofar as neurons themselves also have to deal with nonstationarities as they arise, and a measure of neurally accessible LAIS should reflect this. Pooling across all sites in this study was motivated by the argument that all neural pools seen by VSD pixels are capable of the same dynamic transitions as they were all in the same brain area. Thus, pixels were treated as physical replications for the estimation of the PDF. In sum, the evaluation strategy of this study is applicable to nonstationary data, but delivers results that strongly depend on the data included. Its future application therefore needs to be informed by precise estimates of the time scales at which neurons may sample their input statistics.

ACTIVE INFORMATION STORAGE IN A ROBOTIC SYSTEM
Recurrent neural networks (RNNs) consist of a reservoir of nodes or artificial neurons connected in some recurrent network structure (Maass et al., 2002;Jaeger and Haas, 2004). Typically, this structure is constructed at random, with only the output neurons connections trained to perform a given task. This approach is becoming increasingly popular for non-linear time-series modeling and robotic applications (Dasgupta et al., 2013;Boedecker et al., 2012). The use of Intrinsic Plasticity based techniques (Schrauwen et al., 2008) is known to assist performance of such RNNs in general, although this method is still outperformed on memory capacity tasks for example by the implementation of certain changes to the network structure (Boedecker et al., 2009).
To address this issue, Dasgupta et al. (2013) add an on-line rule to adapt the "leak-rate" of each neuron based on the AIS of its internal state. The leak-rate is reduced where the AIS is below a certain threshold, and increased where it is above. The technique was shown to improve performance on delayed memory tasks, both for benchmark tests and in embodied wheeled and hexapod robots. Dasgupta et al. (2013) describe the effect of their technique as speeding up or slowing down the dynamics of the reservoir based on the time-scale(s) of the input signal. In terms of Marr's levels, we can also view this as an intervention at the algorithmic level, directly adjusting the level of information storage in the system in order to affect the higher-level computational goal of enhanced performance on memory capacity tasks. It is particularly interesting to note the connection in information storage features across these different levels here.

BALANCE OF INFORMATION PROCESSING CAPABILITIES NEAR CRITICALITY
It has been conjectured that the brain may operate in a self-organized critical state (Beggs and Plenz, 2003), and recent evidence demonstrates that the human brain is at least very close to criticality, albeit slightly sub-critical (Priesemann et al., , 2014. This prompts the question of what advantages would be delivered by operating in such a critical state. From a dynamical systems perspective, one may suggest that the balance of stability (from ordered dynamics) with perturbation spreading (from chaotic dynamics) in this regime (Langton, 1990) gives rise to the scale-free correlations and emergent structures that we associate with computation in natural systems. From an information dynamics perspective, one may suggest that the critical regime represents a balance between capabilities of information storage and information transfer in the system, with too much of either one decaying the ability for emergent structures to carry out the complementary function (Lizier et al., 2011b(Lizier et al., , 2008bLangton, 1990).
Several studies have upheld this interpretation of maximised but balanced information processing properties near the critical regime. In a study of random Boolean networks it was shown that TE and AIS are in an optimal balance near the critical point (Lizier et al., 2008b(Lizier et al., , 2011b. This is echoed by findings for recurrent neural networks (Boedecker et al., 2012) and for maximisation of transfer entropy in the Ising model (Barnett et al., 2013), and maximization of entropy in neural models and recordings (Haldeman and Beggs, 2005;Shew and Plenz, 2013). From Marr's perspective, we see here that at the algorithmic level the optimal balance of these information processing operations yields the emergent and scale-free structures associated with the critical regime at the implementation level. This reflects the ties between Marr's levels as described in Section 7.2. These theoretical findings on computational properties at the critical point are of great relevance to neuroscience, due to the aforementioned importance of criticality in this field.

LOCAL INFORMATION DYNAMICS IN CELLULAR AUTOMATA
Cellular automata (CAs) are discrete dynamical systems with an array of cells that synchronously update their value as a function of a fixed number of spatial neighbours cells, using a uniform rule (Wolfram, 2002). CAs are a classic complex system where, despite their simplicity, emergent structures arise. These include gliders, which are coherent structures moving against regular background domains. These gliders and their interactions have formed the basis of analysis of cellular automata as canonical examples of nature-inspired distributed information processing (e.g. in a distributed "density" classification process to determine whether the initial state had a majority of "1" or "0" states) (Mitchell, 1998). In particular, (moving) gliders were conjectured to transmit information across the CA, static gliders to store information, and their collisions or interactions to process information in "computing" new macro-scale dynamics of the CA.
Local transfer entropy, active information storage and separable information were applied to CAs to produce spatiotemporal local information dynamics profiles in a series of experiments (Lizier et al., 2008c(Lizier et al., , 2012bLizier, 2014aLizier, , 2013. The results of these experiments confirmed the long-held between our qualitative understanding of emergent information processing in complex systems and our new ability to quantify such information processing via these measures. These insights could only be gained by using local information measures, as studying averages alone tells us nothing about the presence of these spatiotemporal structures. For our purposes, a crucial step was the extension of this analysis to a CA rule (known as φ par ) which was evolved to perform the density classification task outlined above (Lizier, 2013;Lizier et al., 2014), since we may interpret this with Marr's levels (Section 2.2). Spatiotemporal profiles of local information dynamics for a sample run of this density classification rule are shown in Figure 7, and may be reproduced using the DemoFrontiersBitsFromBiology2014.m script in the demos/octave/CellularAutomata demonstration distributed with the Java Information Dynamics Toolkit (Lizier, 2014b). In this example, the classification of the density of the initial CA state is the clear goal of the computation (task level). At the algorithmic level, our local information dynamics analysis allowed direct identification of the roles of the emergent structures arising on the CA after a short initial transient Figure 7. For example, this analysis revealed markers that CA regions had identified local majorities of "0" or "1" (see the wholly white or black regions, or checkerboard patterns indicating uncertainty). These regions are identified as storing this information in Figure 7(b). The analysis also quantifies the role of several glider types in communicating the presence of these local majorities and the strength of those majorities (see the slow and faster glider structures identified as information transfer in Figure 7(c) and Figure 7(d)), and the role of glider collisions resolving competing local majorities.

INFORMATION CASCADES IN SWARMS AND FLOCKS
Swarming or flocking refers to the collective behaviour exhibited in movement by a group of animals (Lissaman and Shollenberger, 1970;Parrish and Edelstein-Keshet, 1999), including the emergence of patterns and structures such as cascades of perturbations travelling in a wave-like manner, splitting and reforming of groups and group avoidance of obstacles. Such behaviour is thought to provide biological advantages, e.g. protection from predators. Realistic simulation of swarm behaviour can be generated using three simple rules for individuals in the swarm, based on separation, alignment and cohesion with others (Reynolds, 1987). Wang et al. (2012) analysed the local information storage and transfer dynamics exhibited in the patterns of motion in a swarm model, based on time-series of (relative) headings and speeds of each individual. Most importantly, this analysis quantitatively revealed the coherent cascades of motion in the swarm as waves of large, coherent information transfer (as had previously been conjectured, e.g. see Couzin et al. (2006) and Bikhchandani et al. (1992)).
These "information cascades" are analagous to the gliders in CAs (above). When viewed using Marr's levels they have a similar algorithmic role of carrying information coherently and efficiently across the swarm, while the implementation of the information here is simply in the relative heading and speed of the individuals. The goal of the computation (task level) for the swarm depends on the current environment, but may be to avoid predators, or efficiently transport the whole group to nesting or food sites. Lizier et al. (2008a) inverted the usual use of transfer entropy, applying it for the first time as a fitness function in the evolution of adaptive behaviour, as an example of guided self-organisation (Prokopenko, 2009(Prokopenko, , 2014. This experiment utilised a snakebot -a snake-like robot with separately controlled modules along its body, whose individual actuation was evolved via genetic programming (GP) to maximise transfer entropy between adjacent modules. The actual motion of the snake emerged from the interaction between the modules and their environment. While the approach did not result in a particularly fastmoving snake (as had been hypothesised), it did result in coherent travelling information waves along the snake, which were revealed only by local transfer entropy.

TRANSFER ENTROPY GUIDING SELF-ORGANISATION IN A SNAKEBOT
These coherent information waves are akin to gliders in CAs and cascades in swarms (above), suggesting that such waves may emerge as a resonant mode in evolution for information flow. This may be because they are robust and optimal for coherent communication over long distances, and may be simple to construct via evolutionary steps. Again, we may use Marr's levels here to identify the goal of the computation (task level) as to transfer information between the snake's modules here (perhaps information about the terrain encountered). At the algorithmic level the coherent waves carry this information efficiently along the snake's whole body, while the implementation is simply in the attempted actuation of the modules on joints and their interaction (tempered by the environment).

CONCLUSION AND OUTLOOK
Neural systems perform acts of information processing in the form of distributed (biological) computation, and many of the more complex computations and emergent information processing capabilities remain mysterious to date. Information theory can help to advance our understanding in two ways. On the one hand, neural information processing can be decomposed into its component processes of information storage, transfer and modification using information theoretic tools. This allows us to derive constraints on possible algorithms served by the observed neural dynamics. On the other hand, the representations that these algorithms operate on, can be guessed by analyzing the mutual information between humanunderstandable descriptions of relevant concepts and quantities in our experiments and indices of neural activity. This helps to identify which parts of the real world neural systems care for. However, care must be taken when asking such questions about neural codes as the question of how neurons code jointly has not been solved completely to date. Taken together, the knowledge about representations and possible algorithms describes the operational principles of neural systems at Marr's algorithmic level and may hint at solutions for solving ill-defined real world problems that biologically inspired computing systems have to face with their constrained resources.