Statistical Physics of Pairwise Probability Models

Statistical models for describing the probability distribution over the states of biological systems are commonly used for dimensional reduction. Among these models, pairwise models are very attractive in part because they can be fit using a reasonable amount of data: knowledge of the mean values and correlations between pairs of elements in the system is sufficient. Not surprisingly, then, using pairwise models for studying neural data has been the focus of many studies in recent years. In this paper, we describe how tools from statistical physics can be employed for studying and using pairwise models. We build on our previous work on the subject and study the relation between different methods for fitting these models and evaluating their quality. In particular, using data from simulated cortical networks we study how the quality of various approximate methods for inferring the parameters in a pairwise model depends on the time bin chosen for binning the data. We also study the effect of the size of the time bin on the model quality itself, again using simulated data. We show that using finer time bins increases the quality of the pairwise model. We offer new ways of deriving the expressions reported in our previous work for assessing the quality of pairwise models.


Introduction
In biological networks the collective dynamics of thousands to millions of interacting elements, generating complicated spatiotemporal structures, is fundamental for the function.Until recently, our understanding of these structures was severely limited by technical difficulties for simultaneous measurements from a large number of elements.Recent technical developments, however, are making it more and more common that experimentalists record data from larger and larger parts of the system.A living example of this story is what is now happening in Neuroscience.Until a few years ago, neurophysiology meant recording from a handful of neurons at a time when studying early stages of sensory processing (e.g. the retina), and only single neurons when studying advanced cortical areas (e.g.inferotemporal cortex).This limit is now rapidly going away, and people are collecting data from larger and larger populations (see e.g. the contributions in [1]).However, such data will not help us understand the system per se.It will only give us large collections of numbers, and most probably we will have the same problem that we had with the system, but now with the data set: we can only look at small parts of it at a time, without really making use of its collective structures.To use such data and make sense of it we need to find systematic ways to build mathematically tractable descriptions of the data, descriptions with lower dimensionality than the data itself, but involving relevant dimensions.
In recent years, binary pairwise models have attracted a lot of attention as parametric models for studying the statistics of spike trains of neuronal populations [2,3,4,5,6,7], the statistics of natural images [8], and inferring neuronal functional connectivities [3,9].These models are the simplest parametric models that can be used to describe correlated neuronal firing.To build them one only needs the knowledge of the distribution of spike probability over pairs of neurons, and, therefore, these models can be fit using reasonable amounts of data.As is the case for any parametric model, one would like to know how useful a pairwise model is, that is, how well it can describe the statistics of spike trains and whether specific questions about the network can be studied using the fitted model and its parameters.Furthermore, one would like to have fast and reliable ways to fit the model.These issues can be naturally studied in the framework provided by statistical physics.Starting from a probability distribution over the states of a number of elements, statistical physics deals with computing quantities such as correlation functions, entropy, free energy, energy minima, etc., through exact or carefully developed approximate methods.It allows one to give quantitative answers to questions such as: what is the entropy difference between real data and the parametric model?Is there a closed-form equation relating model parameters to the correlation functions?How well does a pairwise model approximate higher order statistics of the data?
In this paper, after briefly reviewing the experimental results on pairwise models, we discuss the statistical physics approaches that are useful for studying these models and what we find by employing them.We do this with the aim of providing a coherent framework for fitting and using pairwise models as well as evaluating their quality.We describe a number of approximate methods for fitting pairwise models, their underlying assumptions, their relation to each other, and the physical intuition behind them.We also study the quality of pairwise models to model the probability distribution over the spike trains.We have studied these issues in our previous work [5,6], and, here, we expand on this work in three ways.We study how the quality of approximate inference methods described in [6] and the quality of the pairwise models depend on the bin size chosen for binning the data.We describe new ways of deriving the perturbative expansion we reported in [5] as well as of one approximation that we find to perform well (the TAP approximation).We also study how the probability of synchronous spikes according to the independent and pairwise models compare with the true probabilities.
In the first part of the paper, we describe the naive mean-field approximation and the independent-pair approximation for fitting the model parameters.When applied to binned spike trains, these approximations provide reliable estimates of the model parameters when the size of the system and/or the time bins are small.We then describe how these approximations can be corrected to provide very accurate approximations for estimating the parameter even for large populations and large time bins: the Thouless-Anderson-Palmer approximation and the Sessak-Monasson approximation.
We then go on to study the quality of the pairwise model.As briefly reviewed in sec.2, several experimental reports showed that the pairwise models can provide perfect approximations to the true probability distribution of small populations of cells.Following this experimental work, Roudi et al [5] performed a theoretical analysis of the pairwise models to understand if their success on small populations can be extended to real-sized system.To do this, they derived equations for the entropy difference between the pairwise model and the data, as well as between an independent-neuron model and the data.This was done in the regime where the average number of spikes generated in a time bin by the whole population is small compared to one, i.e. when N δ ≪ 1 where where N is the number of neurons in the populations, ν i is the firing rate of neuron i and δt is the size of the time bins.We denote this regime "the low-rate regime" (also called "the perturbative regime" [5]), however, we emphasize that it is really the quantity N δ that defines this regime, and not only the firing rate.
In [5], the authors showed that in the low-rate regime, most of the difference between the true entropy and the entropy of the independent-neuron model can be explained by the pairwise model.However, this fact happens regardless of whether the true distribution for large N can be well approximated by the pairwise model.In other words, observing a good pairwise model in the low-rate regime does not tell us if the pairwise model will be a good model for the real sized system.A crucial step in the derivation of entropy differences in [5] was to express the pairwise distribution in the so called Sarmanov-Lancaster representation [10,11].Here we show that one can derive the same expressions by approximating the partition function of a Gibbs distribution (Eq.(3) in sec.1.1).In addition to recovering the low-rate expansion of the entropies, we use the results to find the difference between the probability of synchronous spikes according to the true model, the pairwise model and the independent model.We also show that one can derive the results of [5] by extending the idea behind the independentpair approximation to triplets of neuron, i.e. the independent-triplet approximation.By taking the limit ν i δt → 0 of the triplet approximation to the entropy of a given distribution, we show that the results of [5] can be recovered in a considerably simpler way.
An important step in using binary pairwise models is making a binary representation of the spike trains.This is done by binning the spike trains into small time bins and assigning zero or one to each bin depending on whether there is a spike in it or not.In fact, it was predicted in [5] that using finer and finer time bins, improves the quality of the pairwise model.In this work we study the influence of the bin size on the quality of the approximate methods of fitting as well as the quality of the model.Finally, we discuss two possible extensions of the binary pairwise models and mention a number of important questions that they raise.We first describe the extensions to non-binary variables useful for studying the statistics of modular models of cortical networks.We then describe an extension of the pairwise model to a model with asymmetric connections which gives promising results for discovering the synaptic connectivity from neural spike trains.

The binary pairwise model
In a binary pairwise model, starting from the spikes recorded from N neurons, one first divides the spike trains into small time bins.One then builds a binary representation of the spike trains by assigining a binary spin variable s i (t) to each neuron i and each time bin t, with s i (t) = −1 if neuron i has not emitted any spikes in that time bin, and s i (t) = 1 if it has emitted one spike or more.From this binary representation, the means and correlations between the neurons are computed as where T is the total number of time bins.The binary pairwise model of the data is then built by considering the following distribution over a set of N binary variable s = (s 1 , s 2 . . ., s N ) and chosing the parameters, h i and J ij such that the means and pairwise correlations under this distribution matches those of the data defined in Eq. ( 2), that is Following statistical physics terminology, the parameters h i are usually called the external fields and J ij , the pairwise couplings or pairwise interactions.One important property of the pairwise distributions in Eq. ( 3) is that it has the maximum amount of entropy among all the distribution that have the same mean and pairwise correlations as the data, and it is thus usually called the maximum entropy pairwise model.It is also called the Ising model, in accordance with the Ising model introduced in statistical physics as a simple model of magnetic materials.Although typically written in terms of ±1 spin variables, it is sometimes useful to write the pairwise distribution of Eq. (3) in terms of boolean variables r i = (s i + 1)/2.In the rest of the paper, we sometimes use such boolean representation as some of the calculations and equations become considerably simpler in this representation.We denote the external fields and the couplings in the boolean representations by H i and J ij .
The external fields and pairwise couplings can be found using both exact and approximate methods as discussed in details in [6] and reviewed briefly in the following sections.Before reviewing these methods, we first review the experimental studies that have used the maximum entropy pairwise model to study the statistics of spike trains.

Review of experimental results
The maximum entropy pairwise model as a model for describing the statistics of neural firing patterns was introduced in [2] and [3].This work used the Ising model to study the response of retinal ganglion cells to natural movies [2], steady spatially uniform illumination and white noise [3], as well as the spontaneous activity of cultured cortical networks [2].The main goal of these studies was to find out how close the fitted pairwise model is to the true experimentally computed distribution over s.
As a measure of distance, these studies compared the entropy difference between the true distribution and the pairwise model and compared it with the entropy difference between the the true distribution and an independent model.The independent model is a distribution that has the same mean as the data but assumes the firing of each neuron is independent from the rest, i.e.
The measure of misfit can thus be defined as where the overline indicates averaging with respect to many samples of N neurons.The results of Schneidman et al showed that for populations of size N = 10 or so, ∆ was around 0.1.In order words, in terms of entropy difference, the pairwise model offered a ten fold improvement over the independent model.In the other study, Shlens et al [3] found ∆ ∼ 0.01 for N = 7.These authors also considered a slightly different model in which only the pairwise correlation between the adjacent cells was used in the fit, and correspondingly the pairwise interactions between non-adjacent cells were set to zero.The results showed that this adjacent pairwise model also performed very well, with ∆ ∼ 0.02 on average.It is important to note that in Schneidman et al the stimulus induced long range correlations between cells, while in the data studied Shlens et al the correlations extended only to nearby cells.Following these studies, Tang et al [4] reproduced these observations to a large extent in other cortical preparations and also concluded that the pairwise model can successfully approximate the multi-neuron firing patterns.In a very recent study [7], Shlens et al extended their previous analysis to population of up to 100 neurons concluding that the adjacent pairwise model performs very well even in this case, with ∆ ∼ 0.01 − 0.02.The studies of Shlens et al were done without stimulation or with white noise stimulus, situations in which neurons do not exhibit strong long range correlations.It is still unclear how the pairwise model (not necessarily adjacent) will perform for cases in which neuronal correlations exist between neurons separated by large distance, i.e., when stimulated by natural scenes.
The assumption that ∆ is a good measure of distance rests upon the assumption that we are interested in finding out how different the true and model distributions are over the whole space of possible spike patterns.This can be appreciated when we note that the definition in Eq. ( 6) is equivalent to ∆ = D KL (p true ||p pair )/D KL (p true ||p ind ), where D KL (p||q) is the Kullback-Leibler divergence between two distribution p(s) and q(s) defined as D KL (p||q) = s p(s) log(p(s)/q(s)).Using this observation, we can think of ∆ as a weighted sum of the difference between the log probability of states according to the true distribution and according to the model distribution normalized by the distance to the independent model.Of course, we may not be interested in finding how different the two distribution are over the space of all possible states, but only how differently the model and true distributions assign probabilities to a subset of important states.For instance, in a particular setting, it may be important only to build a good model for the probability of all states in which some large number of neurons M fire simultaneously (i.e., when i r i = M ), regardless of how different the two distribution are on the rest of the states.It was found in [12] that the pairwise model offers a significant improvement over the independent model in modelling the experimental probability of synchronous multi-neuron firing of up to N = 15 neurons.

Approximations for fitting binary pairwise models
The commonly used Boltzmann Learning algorithm for fitting the parameters of the Ising model is a very slow process, particularly for large N .Although effort has been made to speed up the Boltzmann learning algorithm [13], such modified Boltzmann algorithms still require many gradient descent steps and long Monte Carlo runs for each step.This fact motivates the development of fast techniques for calculating the model parameters which do not rely on gradient descent or Monte Carlo sampling.A number of such approximations have been studied in [6].In what follows we describe these approximations, and in particular the relation between the simpler approximations (Naive mean-field approximation and the independent-pair approximation) to the more advanced ones (TAP approximation and Sessak-Monasson approximation).
3.1 Naive mean-field approximation (nMF) and the independent-pair approximations (IP) The simplest method [14,15] for finding the parameters of the Ising models from the data uses mean field theory: where m i = s i data .These equations express the effective field that determines the magnetization m i as the external field plus the sum of the influences of other spins through their average values m j , weighted by the couplings J ij .Differentiating with respect to a magnetization m j gives the inverse susceptibility (i.e., inverse correlation) matrix for i = j, where C ij = s i s j data − m i m j .Thus, if one knows the means m i and correlations C ij of the data, one can use Eq. ( 8) to find the J ij and then solve Eq. ( 7) to find the h i .We call this approximation "naive" mean-field theory (abbreviated nMFT) to distinguish it from the TAP approximation described below, which is also a mean-field theory.
In the independent-pair (IP) approximation, one solves the two-spin problem for neurons i and j, ignoring the rest of the network.This yields the following expressions for the parameters in terms of the means and correlations: where h j i is the external field acting on i when it is considered in a pair with j.It has been noted in [6] that in the limit m i → −1 and m j → −1, J IP ij matches the leading order of the low-rate expansion derived in [5].Although the couplings found in the independent-pair approximation can be directly used as an approximation to the true values of J ij , relating the fields h j i found from the independent-pair approximation to those of the model is slightly tricky.The reason is that the expression we find depends on which j we took to pair with neuron i.It is natural to think that we can sum h j i over j, i. e., over all possible pairings of cell i, to find the Ising model parameter h i .In doing so, however, we should be careful.The expression in Eq. (9b) has two types of terms, those that only depend on i, i.e. the first terms in the following decomposition, and those that involve j, i.e. the second and third terms below The first terms is the field that would have been acting on i if it were not connected to any other neuron, and the rest are contributions from interactions with j.By simply summing h j i over j, we will be overcounting this term, once for each pairing.In other words, the correct independent-pair approximation for h i will be Although simple in its derivation and intuition, in the limit m i → −1 for all i, Eq. ( 11) recovers both the leading term and the first order corrections of the low-rate expansion, as shown in Appendix B.
The simple naive mean-field and independent-pair approximations have been shown to perform well in deriving the parameters of the Ising model when the population size is small.
In [6], we showed that for data binned at 10 ms, the naive mean-field and the IP approximations perform well in deriving the parameters of the Ising model when the population size is small.In Fig. 1 we extend this study and evaluate how the quality of these approximations depend on the size of the time bin, δt.In this figure, we plot the R 2 value between the couplings found from nMFT and IP approximations and the results of long Boltzmann runs as a function of the time bin chosen to bin the data.We do this for both N = 40 and N = 100.The simulations used to generate the spike trains and the Boltzmann learning procedure used in this figure are the same as those reported in [6], with two exceptions.The first one is that here we use more gradient descent steps and longer Monte Carlo runs, namely, 60000 gradient descent steps and 40000 Monte Carlo steps per gradient descent step.The second one is that here we use 10000 seconds worth of data for estimating the means and correlations.This is 2.5 times larger than what we used before.Both of these improvements were made to ensure reliable estimates of the parameters, as well as the means and correlations, particularly for fine time bins.As can be seen in Fig. 1, increasing either N or δt results in a decay in the quality of the IP approximation, as well as its low-rate limit.For the case of nMFT, a reasonable performance is observed only for δt = 2ms and N = 40.For large populations sizes and/or time bins nMFT is a bad approximation.Given the strong dependence of the quality of these simple approximations on population size and the size of the time bin, we describe below how one can extend these approximations to obtain more accurate expressions for finding the external fields and couplings of the pairwise model for large N and δt.

Extending the independent-pair approximation
Extending the independent-pair approximation is in principle straightforward.Instead of solving the problem of two isolated spins, we can solve the problem of three spins, i, j, and k as shown in Appendix C.This will lead to the independent-triplet (IT) approximation.In Fig. 1, we show the quality of the IT approximation for finding the couplings as compared to the Boltzmann solutions.For N = 40, we see that the IT approximation provides an improvement over IP for different values of δt.In Fig. 1, we also looked at the quality of the low rate limit of the IT approximation.In Appendix C we show that, in the same way that the low rate limit of the IP approximation gives us the leading order terms of the low-rate expansion in [5], the low rate limit of the IT approximation gives us the first order corrections to it.For N = 40, this is evident in the fact that the low-rate limit of IT outperforms the low-rate limit of IP for δt ≤ 15 ms.When the population size is large, however, IT and its low rate limit outperform IP for only very fine time bins.Even for δt = 4 ms, IP and IT and their low rate limits perform very bad.
One can of course build on the idea of the IP and IT approximations and consider n=4, 5, . . .spins.However, for any large value of n, this will be impractical and computationally expensive, for the following reason.As described in Appendix C for the case of the independent-triplet approximation, there are two steps in building an independent-n-spin approximation.The first one is to express the probability of each of the 2 n possible states of a set of n spins in terms of the means and correlations of these spins.This requires inverting the 2 n × 2 n matrix.The second step, which is only present for n ≥ 3, is to express all correlations functions in terms of the means and pairwise correlations.Both of these steps become exponentially hard as n grows.Nonetheless, as shown in Appendix C, even going to the triplet level can be a very useful  exercise, as it offers a new way of computing the difference between the entropy of the true model and that of the independent model (Eq.(15a) in sec.4.1) as well as the difference between the entropy of the true model and that of the pairwise model (Eq.(15b) 4.1).This derivation is considerably simpler than the original derivation of these equations based on the Sarmanov-Lancaster representation of the probability distribution described in [5] as well as the derivations in Appendix A, in which one starts by expanding the partition function of a Gibbs distribution.Furthermore, the IT approximation also yields a relation between the couplings and the means and pairwise correlations that coincides with leading term and the corrections found by low-rate expansion, as shown in Appendix C.

Extending the naive Mean-Field: TAP equations
The naive mean-field, independent-pair and independent-triplet approximations are good for fitting the model parameters when the typical number of spikes generated by the whole population in a time bin is small compared to one, i.e., when N δ ≪ 1 [5,6].For large and/or high firing rate populations, however, such approximations perform poorly for inferring the model parameters.It is possible to make simple corrections to the naive mean-field approximation such that the resulting approximation performs well even for large populations.This is the so called Thouless-Anderson-Palmer (TAP) approximation [16].The idea dates back to Onsager, who added corrections to the naive mean-field approximation, taking into account the effect of the magnetization of a spin i on itself via its influence on another spin j.Subsequently, it was shown that the resulting expression was exact for infinite-range spin-glass models [16].The TAP equations are Differentiation with respect to m j then gives (i = j) One can solve Eqs. ( 13) for the J ij and, after substituting the result in Eqns.(12), solve Eqns.(12) for the h i [15,14].
There are several ways to derive this expression for the pairwise distribution Eq. ( 3).In Appendix D, we show that these equations can also be derived from the celebrated Belief Propagation algorithm used in combinatorial optimization theory [17].When applied to spike trains from populations of up to 200 neurons, the inversion of TAP equations was shown to give remarkably accurate results [6] for fitting the pairwise model.In Fig. 2, we show scatter plots comparing the couplings found by the TAP approximation versus the Boltzmann results for N = 100 and δt = 2, 10, 32 ms.The TAP approximations does well in all cases, and this is quantified in Fig. 1.In Fig. 1, we demonstrate the power of inverting the TAP equations for inferring the couplings for various time bins for both N = 40 and N = 100.

Sessak-Monasson approximation (SM)
Most recently, Sessak and Monasson [18] developed a perturbative expansion expressing the fields and couplings of the Ising distribution as a series expansion in the pairwise correlations function C ij .Some of the terms of the expression they found for the couplings could be summed up.It was noted in [6], that one can think of the resulting expression as a combination of the naive mean-field approximation and the independent-pair approximation.The Sessak-Monasson result can be written as The reason why the last terms should be subtracted is discussed below.Let us consider two neurons connected to each other.In the independent-pair approximation we calculate the fields and couplings for this pair exactly within the assumption that they do not affect the rest of the network and vice versa.If we were to find the coupling between this pair of neurons using the naive meanfield approximation Eq. ( 8), the result would just be the last term in Eq. (14).The reason why we should subtract it is now clear: the first term in Eq. ( 14) includes a naive mean-field solution to the pair problem.We subtract this part and replace it by the exact solution of the pair problem.Fig. 1 shows that this result is very robust to changing δt, although for N = 100, we can note a small decay in R 2 with δt.The good performance of the SM approximation can be also seen in the scatter plots shown for N = 100 in Fig. 2.These observations support the SM approximation as a very powerful way of inferring the functional connections.Following the observation made in [5], in Fig. 1, we also show how a simple averaging of the best approximate methods, i.e., TAP inversion and SM can provide a very accurate approximation to the couplings across different time bin and population sizes.

Assessment of model quality
The experimental result that the binary pairwise models provide very good models for the statistics of spike trains is very intriguing.However, the message they carry about the architecture and function of the nervous system is not clear.This is largely due to the fact that, as reviewed in sec.2, the experimental studies were conducted on populations of small number of neurons (N ∼ 10) and their implications on the real sized system are not trivial.Is it the case that observing a very good pairwise model on a subsystem of a large system constrains the structure and function of the real sized network?Does it mean that there is something unique about the role of pairwise interactions in the real sized system?Answering this question depends to

Entropy difference
The extrapolation problem was first addressed in [5] by analyzing the dependence of the misfit measure ∆ (defined in Eq. ( 6)) on N .The authors considered an arbitrary true distribution and computed the KL divergence between this distribution and an independent-neuron model as well as between it and the pairwise model.This was done using a perturbative expension in N δ ≪ 1.The results was the following equations: where g ind and g pair are constant that do not depend on N or δt and are defined in Eqs.(A-11a) and (A-11a).Using these expression for the KL divergences yields Eqs. ( 15) and (16) show that for small N δ, ∆ will be very close to 0, independent of the structure of the true distribution.In other words, in this regime, a very good pairwise-model fit is a generic property and does not tell us anything new about the underlying structure of the true probability distribution.It is important to note that the perturbative expansion is always valid if δt is small enough.That is, simply by choosing a sufficiently small time bin, we can push ∆ as close to 0 as we want.In all panels, the black stars represent quantities as computed directly from the simulated data, while the red squares show the predictions of the low-rate expansion, i.e.Eqs. ( 15) and ( 16).We have used 18000 seconds of simulated data for computing the plotted quantities and have corrected for finite sampling bias as described in [6].
In [5], g pair and g ind are related to the parameters of the pairwise model up to corrections of O(N δ).In Appendix A, we present a different derivation by expanding the partition function of a true Gibbs probability distribution around the partition function of a distribution without couplings.In the following subsection, we also use the results of this derivation to compare the probability of synchronous spikes under the model and the true distributions.Furthermore, in Appendix C, we extend the idea beyond the independent-pair approximation and approximate the entropy of a given distribution as a sum over the entropies of triplets of isolated neurons.We show that this approach leads to Eqs. (15) in a substantially simpler way than those reported in [5] and Appendix A.
In Fig. 3, we show how D KL (p|| p ind ), D KL (p|| p pair ) and ∆ vary with N and δt for data generated from a simulated network.We have also plotted the predictions of the low-rate expansion Eqs. ( 15) and (16).As shown in these figures, the low-rate expansion nicely predicts the behaviour of the measurements from the simulations particularly for small N and δt.For δt = 10 ms, we have δ 10 = 0.076 and for δt = 2 ms, we have δ 2 = 0.019 (note that this gives δ 10 /δ 2 = 3.95, a ratio that would have been equal to 5 if the bins were independent).Both the results from the low-rate expansion and those found directly from the simulations show that using finer time bins decrease ∆ for fixed N .

Probability of simultaneous spikes
As discussed in sec.2, in addition to entropic measures such as KL divergence and ∆, which in a sense asks how well the model approximates the experimental probabilities of all possible spike patterns, we can restrict our quality measure to a subset of possible spike patterns.We can, for instance, ask how well the pairwise model approximates the probability of M simultaneous spikes.Similar to the case of ∆, before getting too impressed about the power of pairwise models in approximating the true distributions, we should find out what we expect in the case of an arbitrary, or random, true probability distribution.
Suppose now that we have a distribution over a set of variables of the form of Eq. (A-1).For this distribution, the probability that a set of M neurons, I = {i 1 , i 2 , . . ., i M } out of the whole population of N fire in a time bin while the rest do not is Averaging over all possible I we get where H, J and K are the means of the fields and pairwise, third-order etc coupling.In Appendix A, we show that, to leading order in N δ, the external fields and pairwise couplings of fitted models (independent or pairwise) match those of the true model (see Eqs. (A-7a) and (A-7b)).Using this, we see that To have well defined behviour in large N limit, one should have J N ∼ 1 and KN 2 ∼ 1. Eq. (19) show that, for M/N ≪ 1, both the independent and pairwise models are close to the true distribution.For M/N ∼ O(1) (of course still M/N < 1), the difference between the model and true probabilities of observing M synchronous spike increases.For all ranges of M , the difference is larger for the independent model.These predictions are consistent with the experimental results found in the retina [12].
In the above calculation, we compute the difference between the true and model values of q the mean of the log probability of M synchronous spikes.However, one can ask about the log mean probability of M synchronous spikes, i.e.
Caculating w is in theory very hard, because the second term above involves calculating the averages of exponential functions of variables.However, if the population is homogenous enough, such that the average couplings and fields from one sample of M neurons to the next does not change much, we can approximate the average of the exponential of a variable with the exponential of the average it.Doing this, will again lead to Eq. (19).The difference between the two measure w and q will most likely appear for small M where the averages of the fields and couplings depends on the sample of M neurons more strongly than when M is large.

Extensions of the binary pairwise model
In the previous sections, we described various approximate methods for fitting a pairwise model of the type of Eq. ( 3).We also studied how good a model it will be for spike trains, using analytical calculations and computer simulations.As we describe below, there are two issues with a model of the type of Eq. ( 3) that lead to new directions for extending the pairwise models studied here.The first issue is the use of binary variables as a representation of the states of the system.For fine time bins and neural spike trains, the binary representation serves its purpose very well.However, in many other systems, a binary representation will be a naive simplification.Examples of such systems are modular models the cortex in which the state of each cortical module is describe by a variable taking a number of states usually much larger than 2. In the subsection 5.1 we briefly describe a simple non-binary model useful for modelling the statistics of such systems.
The second issue is that by using Eq. ( 3) in cortical networks, one is essentially approximating the statistics of a highly non-equilbium system with asymmetric physical interactions, e.g. a balanced cortical network, by an equilibrium distribution with symmetric interactions.This manifests itself in a lack of a simple relationship between the functional connectives to real physical connections.In our simulations we observed that there was no obvious relation between the synaptic connectivity and the inferred functional connections.Second, as we showed here and in our previous work, for large populations, the model quality decays.Although one can avoid this decay by decreasing δt as N grows, eventually one will get into the regime of very fine δt, where the assumption of independent bins used to build the model does not hold any more and one should start including the state transitions in the spike patterns [5].In fact, Tang et al [4] showed that even in the cases that the pairwise distribution of Eq. ( 3) is a good model for predicting the distribution of spike patterns, it will not be a good one for predicting the transition probabilities between them.These observations encourage one to go beyond an equilibrium distribution with symmetric weights.In the second extension, described in subsection 5.2, we propose one such model, although a detailed study of the properties of such model is beyond the scope of this paper.

Extension to non-binary variables
The binary representation is probably a good one for spike trains binned into fine time bins.However, for larger time bins where there is a considerable probability of observing more than one spike in a bin, as well as for a number of other systems, the binary representation may only serve as a naive simplification and going to non-binary representations is warranted.Example of such systems include the protein chains and modular cortical models.In probabilistic models of protein chains, each site is represented by a non-binary variable that takes one of its possible q states depending on the amino acid that sits on that site.In a number of models for the operations of cortical networks, one considers a network of interconnected modules, the state of each of which is represented by a non-binary variable [19,20].Each state of one such variable correspond to e.g. one of the many memory states stored in the corresponding module.
For a set of non-binary variables σ = (σ 1 , σ 2 , . . ., σ N ), σ i = 1 . . .q, one can simply write down a maximum entropy pairwise Gibss distribution as where α and β go from 1 . . .q and index the q possible states of each variable.For q = 2, the above distribution reduces to the binary case with boolean variables, and when one forces J αβ ij = 0 for α = β one recovers the q − state Potts model.Similar to the binary case, here also one is given the experimentally observed values of u ασ i data and u ασ i u βσ j data and wants to infer the fields and couplings that consistent with them.
The approximate methods described in this paper can be adopted, with some effort, to the case of non-binary variable as well.In particular, it is easy to derive the difference between the entropy of the true distribution, the pairwise model and the independent model in the low-rate limit of the non-binary model.Assuming that u ασ i data = O(ǫ) for α = 1, the low-rate regime in the case of non-binary variables is characterised by N (q − 1)ǫ ≪ 1.In this regime, similar to the binary case, S pair − S true ∝ (N (q − 1)ǫ) 3  and S ind − S true ∝ (N (q − 1)ǫ) 2 and ∆ = N (q − 1)ǫ, and consequently the pairwise model performs very well in the low-rate regime.
As we described before, the low-rate regime is where most experimental studies on binary pairwise models were performed, and the result of the low-rate expansion of the entropies explains the reported success of binary pairwise models in those studies.On the other hand, the low-rate regime of the non-binary variable may be of little use.This is because the systems to which the non-binary representation should be applied are unlikely to fall into the low-rate regime.For instance, in the case of modular memory networks the low-rate regime would be the case in which the network spends a significantly larger time in its ground state (no memory retrieved) compared to the time it spends operating and retrieving memory.A more likely scenario is the one where all the states (memory or no-memory) have approximately similar probabilities of occurrence over a period of time.How useful pairwise models are in describing the statistics of such nonbinary systems away from the trivial regime of the low-rate expansion is not known.Studying the quality of non-pairwise models in these cases and developing efficient ways to fit such models, in particular based on extensions of the powerful approximations such TAP and SM to non-binary variables will be the focus of future work.In particular, it is important to note that writing the SM approximation for the non-binary case will be a straightforward task in light of the relation we described between the SM, nMFT and IP approximations in sec.3.4

Extension to dynamics and asymmetric interactions
The Glauber model [21] is the simplest dynamical model that has a stationary distribution equal to the Ising model distribution Eq. ( 3).It is defined by a simple stochastic dynamics in which at each timestep δt = τ 0 /N one spin is chosen randomly and updated, taking the value +1 (i.e., the neuron spikes) with probability Although the interactions J ij in the static Ising model are symmetric (any antisymmetric piece would cancel in computing the distribution (3)), a Glauber model with asymmetric J ij is perfectly possible [22,23].This kinetic Ising model is closely related to another class of recently-studied network models called generalized linear models (GLMs) [24,25,26].In a GLM, neurons receive a net input from other neurons of the linear form, and spike with a probability per unit time equal to a function f of this input.Maximum-likelihood techniques have been developed for solving the inverse problem for GLMs (finding the linear kernels that give the stimulus input and that from the other neurons in the network, given spike train data) [24,25].They have been applied successfully to analyzing spatiotemporal correlations in populations of retinal ganglion cells [26].
The Glauber model looks superficially like a GLM with instantaneous interactions and f (x) equal to a logistic sigmoid function 1/(1+exp(−2x)).But there is a difference in the dynamics: spins are not spikes.In the Glauber model, a spin/neuron retains its value (+1 or −1) until it is chosen again for updating.Since the updating is random, this persistence time is exponentially distributed with mean τ 0 .Thus a Glauber-model "spike" has a variable width, and the autocorrelation function of a free spin exhibits exponential decay with a time constant of τ 0 .In the GLM, in contrast, a spike is really a spike and the autocorrelation function (for constant input) is a delta-function at t = 0.
The time constant characterizing the kernel J ij (τ ) in a GLM has a similar effect to τ 0 in the Glauber model, but a GLM with an exponential kernel is not exactly equivalent to the Glauber model.In the GLM, the effect of a presynaptic spike is spread out and delayed in time by the kernel, but once it is felt, the postsynaptic firing rate changes immediately.In the Glauber model, the presynaptic "spike" is felt instantaneously and without delay, but the firing state of the neuron takes on the order of τ 0 to change in response.
GLMs grew out of a class of single-neuron models called LNP models.The name LNP comes from the fact that there is a linear (L) filtering of the inputs, the result of which is fed to a nonlinear (N) function that specifies an instantaneous Poisson (P) firing rate.In the earlier studies, the focus was on the sensory periphery, where the input was an externally specified "stimulus".An aim of this modeling effort was to improve on classical linear receptive field models.Thus, in the usual formulation of a GLM network, one writes the total input as a sum of two terms, one linear in the stimulus as in the LNP model and the other linear in the spike trains of the other neurons.Of course, one can trivially add a "stimulus" term in a Glauber model, so this differences is not an essential one.
Similarly, interactions with temporal kernels J ij (τ ) can be included straightforwardly in a Glauber model.Such a model is equivalent to a GLM in the limit τ 0 → 0. (One has to multiply the kernels by 1/τ while taking the limit, because the integrated strength of a "Glauber spike" is proportional to τ 0 , while that of an ordinary spike in a GLM is 1.) One can derive a learning algorithm for a Glauber model, given its history, in a standard way, by maximizing the likelihood of the history.The update rule, which is exact in the same way that Boltzmann learning is for the symmetric model, is where δs j (t) = s j (t) − s j , the average is over the times t i at which unit i is updated, and One can also get a simple and potentially useful approximate algorithm which requires no iteration by expanding the tanh in eqn.(24) to first order in the J ij (i.e., to first order around the independent-neuron model).Then at convergence (δJ ij = 0) we have which is a simple linear matrix equation that can be solved for the J ij .

Discussion
The brain synthesizes higher-level concepts from multiple inputs received, and in many fields of research scientists are interested in inferring as simple a description as possible, given potentially vast amounts of data.Such processes of learning take a set of average values, correlations, or any other pattern in the data, and from these arrive at another representation, which is useful for speed or accuracy of prediction, for compressed storage of the data, as a grounds for decision making, or in any other aspect which improves the functionality or competitiveness of the brain or the researcher.Models and algorithms for learning in these contexts have been studied, in great detail, in neuroscience, image processing, and many other fields, for quite some time; consider e.g. the introduction of Boltzmann machines more than a quarter of a century ago [27], and the now more than ten-year old monograph on learning in graphical models [28].
As one could have expected, there is a trade-off between the complexity of the model to be learned and the efficiency of the learning algorithms.Boltzmann machines are, in principle, able to learn very complex models, and provably so, but convergence is then quite slow; modern applications of those methods center on various restricted Boltzmann machine models, which can be learned faster [29].
When we consider very large systems and/or problems where convergence time of the learning is a serious concern, then we are primarily interested in those very fast processes of learning which could be nick-named "immediate understanding", or "iterative understanding".By this we mean that the outcome of learning should be read out directly by one or a series of mathematical transformations of the data.The central issue is then obviously the accuracy of the learning outcomes.In this paper we have revisited some classical and some more recent algorithms of this kind that take their inspiration from statistical physics.The two simplest algorithms considered are naive mean-field and independent pair approximation.As we have shown for neural data, both of these perform poorly except for small systems (small N ), of for systems which have little variability (small value of N δ), where δ here can be thought of as a proxy for the deviation from a uniform state).On the other hand, one generalization of each method, the TAP approximation for the naive mean-field and the recent Sessak-Monasson approximation for the independent pair approximation, perform much better.We believe that there is scope for further ideas, and note particularly the recent reported generalization of the TAP equations to learning within a Bethe-lattice approximation; this approach, so far only carried out for pairwise binary models, could be a promising avenue for learning more generally in large systems where an underlying connectivity is locally tree-like [30].We also studied the quality of pairwise models using a variety of methods.We derived mathematical expressions relating the quality of pairwise models to the size of the population and the lower order statistics of the spike trains employing a number of approximation schemes.
To conclude, we have here framed the presentation in terms of inferring representations of neural data and assessing the goodness of the model.Although our focus was on neural data analysis, it is important to note that similar problems appear in many other fields of modern biology, for instance, e.g. in network reconstruction of gene regulatory networks, in genomic assembly in metagenomics projects, and in many other problems.Given the present explosion in sequencing technologies it is conceivable that the more novel applications will soon appear outside neuroscience.

Appendices
A The partition function, entropy and moments of a Gibbs distribution in the limit N δ → 0 Suppose we have a true distribution of the following form In this Appendix, we first find the relation between the the external fields and pairwise and third order couplings.As we show below, this allows us to compute the probability of synchronous spikes.We also compute the entropy of this distribution, in the small spike probability regime to derive the expression in Eqs.(15).This task can be accomplished much more easily if we rewrite the distribution of Eq. (A-1) in terms of zero-one variables r i , instead of the the spin variables s i , i.e.
where the auxiliary inverse temperature β is introduced because it will allow us to compute the entropy, as will become clear later.The crucial step in computing the relation between the moments, parameters and entropy of a distribution is computing its partition function.To compute the partition function of Eq. (A-2a), Z, we note that where 0 indicates averaging with respect to the distribution Note that p 0 is not the independent model for p in Eq. (A-2) but only the part of this distribution that includes the fields.In fact, as we show in the following (Eq.(A-7a)) the fields of the independent model to p only match H i to O(N δ) corrections, i.e.H i = H ind i + O(N δ).When dealing with corrections to the fields and couplings, this note will be important.
Since r n i 0 = r i 0 ≡ δ i , a term with l distinct indices in the expansion of the term inside the average in Eq. (A-3) is of the order of δ l .Therefore, to O((N δ) 3 ), we have where the sums over n, n 1 , n 2 , n 3 and n 4 run from 1 to infinity.Performing these sums yields where φ ij = exp(βJ ij ) − 1 and φ ijk = exp(βK ijk ) − 1.From Eq. (A-6), we can immediately compute the relation between the means, pairwise and three-point correlation functions and the parameters of the distribution.For β = 1, we have The relations between means and pairwise correlations and the external fields and pairwise couplings in Eq. (A-7) a and b to their leading orders were reported previously in [5], using a slightly different approach.However, the corrections and three-neuron correlations were not computed there.An interesting result of this calculation is a relation between the three-neuron correlations for the pairwise distribution, i.e. when K ijk = 0, and the lower moments The fact that, to leading order in N δ, the external fields and couplings are determined by means and pairwise correlations allows us to compute the leading-order probabilities of synchronous spikes reported as we did in sec.4.2.
We can now use Eq.(A-6) to find the entropies of the distribution in Eq. (A-2a) From Eqs.(A-7), for an independent fit to the distribution Eq.(A-2a), we have Using the fact that mi = δ i + O(N δ 2 ) from Eq. (A-7a), and from Eq. (A-7b), we get Eq.(15a) with g ind defined as we can use the definition of the means and correlations and write Inverting the matrix of coefficients we can express the probabilities in terms of the means, pairwise and third order correlations.Using the result in and similar equations for the other fields and couplings, we can also express these parameters in terms of means, pairwise and third order correlations.Before using Eqs.(C-3) as approximations for the parameters of a pairwise model, we need to perform two more steps.The first step is a familiar one that we noted when dealing with the external fields in the independent-pair approximation, namely the fact that H jk i depends on j and k in addition to i and that K k ij depends on k.We can use the same logic that we used in building an independent-pair approximation to the external fields, and build an approximation to the external fields that does not depended on j and k, and an approximation to the couplings that does not depend on k.For example, for the case of the pairwise couplings we get The second step has to do with the fact that Eqs.(C-3) (as well as their transformed version after performing the first step, e.g.Eq.C-4) depend on the third order correlations in addition to the pairwise correlations and the means.Hence to derive an expression that relates model parameters to pairwise correlations and means we should first find the third order correlation in terms of them.Note that this step is not present in the independent-pair approximation.To express the third order correlations in terms of the lower order statistics we take advantage of the following equation p ijk (0, 0, 1)p ijk (0, 1, 0)p ijk (1, 0, 0)p ijk (1, 1, 1) = p ijk (0, 0, 0)p ijk (0, 1, 1)p ijk (1, 0, 1)p ijk (1, 1, 0). (C-5) Writing the probabilities in terms of the moments, this equation can be solved to find the third order correlations in terms of the means and pairwise correlations.The equations have two imaginary solutions for Cijk , which are unphysical, and one real solution, which is the correct solution to be considered.The resulting expression for Cijk in terms of the means and pairwise correlations is complicated.However, in the limit of mi , mj , mk → 0, it can be shown to have the same form as the one reported in Eq. (A-8).We noted in the text that in this limit, the IP approximations to the couplings will give the same result as the leading order term of Eq. (A-7b) for the couplings.With the independent-triplet approximation, we can go one step further, and as can be shown by doing a small amount of algebra, we can recover O(N δ) corrections to the couplings in that we calculated in Eq. (A-7b).
As mentioned in sec.3.2, one can continue the above process to build approximations based on quadruples of spins and so on.However, this soon becomes difficult in practice for the reason that solving equations of the type Eq.(C-5) to find the higher moments in terms of the means and pairwise correlations will be as difficult as the original problem of finding the external fields and couplings of the original N body problem in terms of the means and correlations.Nevertheless, this simple triplet expansion offer an alternative derivation of Eqs.(15), simpler than the derivations in [5] and Appendix A.Here we show this for Eq.(15b), as deriving Eq. (15a) will be similar but less invoved.To derive Eq. (15b) using the triplet expansion, we approximate the entropy of the whole system of N neurons as a sum of the entropies of all triplets denoted by S ijk .We then expand the resulting expression keeping terms of up to O(δ 3 ) noting that Cijk ∼ O(δ 3 ), and Cij ∼ O(δ 2 ).The result takes the form of where Q(x) = x log(x).For a pairwise model, the independent-triplet approximation to the entropy in the limit m → 0 has the same form, except that Cijk of the true model should be replaced by Cpair ijk = ( Cij Cik ) Cjk )( mi mj mk ) −1 (see Eq. (A-8)), i. e.

D Derivation of TAP equations from Belief Propagation
In this appendix we derive the TAP equations (Eq.( 12)) starting from the Belief Propagation update rules.Let us begin with the result to be established.The TAP equations are a set of nonlinear equations for the magnetizations m i , which we will write: Here h i is the external field acting on spin i, ǫJ ij are the pairwise couplings, and the notationj ∈ ∂ i means that the sum is over neurons j connected with neuron i.As we also did in the text in Eq. ( 12), the above equation is generally quoted with ǫ set to one and without the error term of cubic order in ǫ.
Starting from the pairwise distribution Eq. ( 3), we define the following distribution, with the auxiliary variable ǫ that we set to 1 in the end of our calculation and the exact marginal distribution over spin s i is defined by where the sum goes over all spins except s i .The exact magnetization of spin s i is then Belief Propagation is a family of methods for approximately computing the marginal distributions of probability distributions [31,32,17].Since the model defined by Eq. (D-2) contains only pairwise interactions, it is convenient to adopt the pairwise Markovian Random Field formalism of Yedidia, Freeman and Weiss [32].Note that the important recent contribution by Mézard and Mora uses a more general formalism, which may prove to be more convenient in the perspective of extending an "inverse BP" learning algorithm beyond pairwise models [30].
Belief Propagation applied to the model in Eq. (D-2) in the Yedidia-Weiss-Freeman formalism is built on probability distributions, called BP messages, associated with every directed link in the graph.If i → j is such a link, starting at i and ending at j, then the BP message η i→j (s j ) is a probability distribution on the variable s j associated to node where the link ends.For Ising spins one can use the parametrization where m i→j is a real number called the cavity magnetization.BP is characterized by two equations, the Belief Propagation update equations, and the Belief Propagation output equations.The BP update equations are used iteratively to find a fixed point, which is an extremum of the Bethe free energy.At the fixed point, the BP update equations form a (large) set of compatibility conditions for the η's which, for the model Eq.(D-2), read η j→i (s i ) = 1 Ω j→i s j e h j s j +ǫJ ij s i s j k∈∂ j \i η k→j (s j ). (D-6) The BP output equations determine the marginal probability distributions from the η's and read where Ω j→i and Ω i in Eqs.(D-6) and (D-7) are normalizations.To lighten the notation, we will not distinguish between the exact marginals, as in Eq. (D-3), and the approximate marginals from BP, as in Eq. (D-7).

Figure 1 :
Figure1: The quality of various approximations for different time bins and populations sizes.Here we plot the R 2 values between the couplings obtained using various approximate methods and those found from Boltzmann learning versus δt (a) N = 40 and (b) N = 100.In both panel the colour code is as follows.Black, SM; Red, TAP; Blue, SM-TAP hybrid; Green, nMFT; Cyan, IP; Magenta lowrate limit of IP; Cyan with dashed line, IT; Megenta dashed lines, low rate limit of IT.For both population sizes and all time bins, the TAP, SM and hybrid approximations perform very well.

Figure 2 :
Figure 2: Scatter plots showing the results of TAP and SM approximations versus the Boltzmann results for various time bin sizes δt and N = 100.Panels (a), (b) and (c) show the TAP results versus the Boltzmann results for data binned at 2 ms, 10 ms and 32 ms, respectively.(d), (e) and (f ) show the same but for the SM approximation.Note that the structure of the error in estimating the couplings from the TAP equations changes when the size of the time bin is increased.

Figure 3 :
Figure 3: The quality of the pairwise and independent models for different time bins and populations sizes.(a) D KL (p true ||p ind ) versus N , (b) D KL (p true ||p pair ) versus N and (c) ∆ versus N all for δt = 10 ms.(d),(e), (f ) show the same things for δt = 2 ms.In all panels, the black stars represent quantities as computed directly from the simulated data, while the red squares show the predictions of the low-rate expansion, i.e.Eqs.(15) and(16).We have used 18000 seconds of simulated data for computing the plotted quantities and have corrected for finite sampling bias as described in[6].