^{®}

^{ 1 }. MATLAB is a numerical computing environment and programming language which is used by most neurophysiosiological laboratories to store, preprocess and plot experimental data. In our view, the reason for the choice of MATLAB for the implementation of such routines is that it allows interactive and rapid development of algorithms, though at the cost of some performance overhead. Traditionally, information calculations have not been demanding in terms of memory usage or CPU time because the information calculations were restricted to relatively small neural populations as a consequence of the limited sampling bias problem. Therefore, it has been convenient to perform the analysis with the tools used to obtain, preprocess and store the data. However, over the last few years, the CPU and memory requirements of information calculations for neural data has significantly increased. This is due to a number of reasons. First, the improvement of the techniques to correct for the sampling bias problem (Panzeri et al., 2007 ) has allowed the information theoretic analysis of larger populations. Second, some of these bias corrections techniques are computationally intensive. Third, in the context of understanding whether the correlation structure of neural activity can be described by simple low order models, it has become important to compute distributions with maximum entropy in the presence of various sets of constraints (Schneidman et al., 2006 ; Shlens et al., 2006 ; Tang et al., 2008 ). These calculations are particularly demanding in terms of processor and memory resources. Fourth, while most information analysis has been applied to spike trains, in the context of the development of brain machine interfaces it has become important to evaluate the information content of other types of brain signals, such as local field potentials (LFPs) or Electroencephalograms (EEGs) which are analog in nature and must be represented at each time step (Belitski et al., 2008 ; Montemurro et al., 2008 ; Rubino et al., 2006 ; Waldert et al., 2008 ). The manipulation of these signals stretches computational requirements much more than using spikes, which due to their sparse binary nature can be represented compactly, for example by storing only the spike arrival times.

*s*selected with probability

*P*(

*s*) from a stimulus set

**S**consisting of S elements, and the consequent response (either of a single neuron or an ensemble of neurons) is recorded and quantified in a certain post-stimulus time window. The aim of information theoretic analysis is to gain insight into how the neurons represent the stimuli. In most applications this is done by examining the information content of different candidate neural codes. To carry out such an analysis, the first step is to choose the neural code. In practice this means choosing a way to quantify the neuronal response that reflects our assumption of what is most salient in it. For example, if we think that only spike counts (not the precise temporal pattern of spikes) are important, we choose a spike-count code: we define a post-stimulus response interval and count the number of spikes it contains on each repetition (trial) of a stimulus. In most cases, the neural response is quantified as a discrete, multi-dimensional array

**r**= {

*r*

_{1},…,

*r*} of dimension

_{L}*L*. For example, to quantify the spike count response of a population of

*L*cells,

*r*would be the number of spikes emitted by cell i on a given trial in the response window. Alternatively, to quantify the spike timing response of a single neuron, the response window is divided into L bins of width Δ

_{i}*t*, so that

*r*is the number of spikes fired in the

_{i}*i*-th time bin (Strong et al., 1998 ). Here Δ

*t*is the assumed time precision of the code and can be varied parametrically to characterize the temporal precision of the neural code. We denote by

**R**the set of possible values taken by the response array.

*Shannon entropy*, referred to hereafter as

*entropy*, which is a measure of the uncertainty associated with a random variable. Intuitively one can posit some desirable properties of any uncertainty measure. It should be

*continuous*; that is small changes in the underlying probabilities should result in small changes in the uncertainty. It should be

*symmetric*; that is the measure should not depend on the labelling or ordering of the variables and outcomes. The measure should take its maximum value when all outcomes are equally likely and for systems with uniform probabilities, the measure should increase with the number of outcomes. Finally, the measure should be

*additive*; that is it should be independent of how the system is grouped or divided into parts. It can be shown (Cover and Thomas, 2006 ) that any measure of uncertainty about the neural responses satisfying these properties has the form

*P*(

**r**) is the probability of observing response

**r**across all trials to all stimuli. The response entropy quantifies how neuronal responses vary with the stimulus and thus sets the capacity of the spike train to convey information. In Eqs 1 and 2 the summation over

**r**is over all possible neuronal responses. However, neurons are typically noisy; their responses to repetitions of an identical stimulus differ from trial to trial.

*H*(

**R**) reflects both variation of responses to different stimuli and variation due to trial-to-trial noise. Thus

*H*(

**R**) is not a pure measure of the stimulus information actually transmitted by the neuron. We can quantify the variability specifically due to noise, by measuring the so-called

*noise entropy*, which is the entropy conditional on stimulus presentation:

*s*is over all possible stimuli.

*P*(

**r**|

*s*) is the probability of observing a particular response

**r**given that stimulus

*s*is presented. Experimentally,

*P*(

**r**|

*s*) is determined by repeating each stimulus on many trials, while recording the neuronal responses. The probability

*P*(

*s*) is usually chosen by the experimenter. The noise entropy quantifies the irreproducibility of the neuronal responses at fixed stimulus. The noisier is a neuron, the greater is

*H*(

**R**|

**S**). The information that the neuronal response transmits about the stimulus is the difference between the response entropy and the noise entropy. This is known as the mutual information

*I*(

**S**;

**R**) between stimuli and responses (in the following abbreviated to information).

*I*(

**S**;

**R**) is the most general measure of correlation between the stimuli and the neural responses, because it automatically takes into account contributions of correlations at all orders. Third, computing information does not require specifying a stimulus–response model; it only requires computing the response probabilities in response to each stimulus condition. Therefore, the calculation of information does not require spelling out which stimulus features (e.g., contrast, orientation, etc.) are encoded. Fourth,

*I*(

**S**;

**R**) takes into account the full stimulus–response probabilities, which include all possible effects of stimulus-induced responses and noise. Thus, it does not require the signal to be modeled as a set of response functions plus noise and is applicable even to situations when such decompositions are difficult or dubious. The last three points show that information theory can, in principle, be applied to any type of neural signal, including responses such as LFPs or spikes that are clearly nonlinear and difficult to model by a set of standard functions. Fifth, it is possible to analyze and combine the information given by different measures of neural activity e.g. spike trains and LFPs. These two signals have a very different nature and signal to noise ratios. Therefore, a certain increase of the peak height of an LFP cannot be compared to a certain change in the spike train to understand how well LFPs or spikes encode stimuli. In contrast, with information theory the LFPs and spikes can be directly compared because information theory projects both signals onto a common scale that is meaningful in terms of stimulus knowledge.

^{®}. However, the Python programming language (van Rossum, 1995 ) combined with the numerical and scientific libraries NumPy and SciPy (Jones et al., 2001 ) provide a compelling alternative for scientific programming. Python is a modern, fully object-oriented programming language that is powerful, flexible and easy to learn. The NumPy library provides a multi-dimensional array object and associated vectorised operations, and SciPy enhances this with a range of scientific functions using the NumPy array object. The syntax is familiar to anyone coming from a background with MATLAB or another C derivative language and there are a comprehensive set of tools for plotting and interactive use (IPython and Matplotlib). Assignments are by reference rather than by copying, which allows finer grained control of memory usage, and there are several ways to rapidly extend the system with external code written in FORTRAN and C. The flexibility and good design of the Python language make large projects much more manageable than with MATLAB, where each function must reside in a separate file and refactoring to reduce code repetition grows increasingly difficult with project size. Python is a well developed language, with libraries available for almost any conceivable task, such as GUI development, network communication, support for different file formats, etc. It is possible to read and write MATLAB binary files, and even call MATLAB commands from within the Python environment, which allows for a smooth transition and means that time invested in an existing MATLAB code base is not wasted. Finally, the Python tool set is

*open source*

^{ 2 }, rather than a proprietary product, which has several obvious advantages for scientific work. Its free availability allows better reproducibility of the results, since all interested parties are free to run the software without an expensive license. It is also inherently future-proof, since it will always be possible to obtain and use the version for which the code was written, whereas a commercial product may be withdrawn at some point in the future.

*limited sampling bias*) in estimates of entropies and information. This bias is the difference between the expected value of the quantity considered, computed from probability distributions estimated with

*N*trials or samples, and its value computed from the true probability distribution. The bias constitutes a significant practical problem, because its magnitude is often of the order of the information values to be evaluated, and because it cannot be alleviated simply by averaging over many neurons with similar characteristics.

### Origins of the Bias

*H*(

**R**) and the noise entropy

*H*(

**R**|

**S**) are biased downwards. That is, the estimated value is less than the true value, and the estimated value increases with the number of trials used, asymptotically approaching the true value. Intuitively, this is because finite sampling means it is less likely that the full range of responses will be included and so the measured responses seem less variable than they really are. In addition, estimates of

*H*(

**R**|

**S**) are significantly more biased than those of

*H*(

**R**), since the latter depends on

*P*(

**r**) which is calculated with data gathered across all stimuli and is better sampled than the conditional distributions, which are each sampled with data from a single stimulus only. The bias in the mutual information is then the difference between the bias of

*H*(

**R**) and that of

*H*(

**R**|

**S**). This results in an upward bias in the information, since the magnitude of the bias of

*H*(

**R**|

**S**) is greater, and its sign is reversed in Eq. 3. Again, this makes sense intuitively, since the finite sampling can introduce spurious stimulus-dependent differences in the response probabilities, which make the stimuli seem more discernible and hence the neuron more informative than it really is.

### Bias Correction Methods

#### Panzeri–Treves (PT)

*asymptotic sampling regime*, when the number of trials is large enough that every possible response occurs many times, an analytical approximation for the bias (i.e. the difference between the true value and the plug-in estimate) of entropies and information can be obtained (Miller, 1955 ; Panzeri and Treves, 1996 ).

*R*

_{s}. The simplest approach is to approximate

*R*

_{s}by the count of responses that are observed at least once – this is the “naive” count. However due to finite sampling this will be an underestimate of the true value. A Bayesian procedure (Panzeri and Treves, 1996 ) can be used to obtain a more accurate value.

#### Quadratic Extrapolation (QE)

*N*, where

*N*is the number of trials (Strong et al., 1998 ; Treves and Panzeri, 1995 ). For example, for the information:

*N*/2 and

*N*/4 trials and fitting the resulting values to the polynomial expression above. This allows an estimate of the parameters

*a*and

*b*and hence

*I*

_{true}(

**S**;

**R**). To use all available data, estimates of two subsets of size

*N*/2 and four subsets of size

*N*/4 are averaged to obtain the values for the extrapolation. Together with the full length data calculation, this requires seven different evaluations of the quantity being estimated.

#### Nemenman–Shafee–Bialek (NSB)

#### Shuffled Information Estimator (I_{sh})

*H*

_{ind}(

**R**|

**S**) is the noise entropy that would be obtained if each individual component

*r*of the response array

_{i}**r**were independent of any other component

*r*(

_{j}*i*≠

*j*) at fixed stimulus; that is the entropy calculated from the distribution

*P*

_{ind}(

**r**|

*s*) = Π

*(*

_{i}P*r*|

_{i}*s*). Since this value depends only on the first order marginal values of the response, it has a small bias.

*H*

_{sh}(

**R**|

**S**) is the entropy that results when stimulus conditional response correlations are removed by “shuffling” the data. That is, for each stimulus

*s*, the individual response components

*r*are shuffled independently across trials, to obtain a new set of vector responses

_{i}**r**. Both of these values provide estimates of the entropy of the system if correlations were removed and become equal for an infinite number of trials. However, with finite trials,

*H*

_{ind}(

**R**|

**S**) shows a small bias, while

*H*

_{sh}(

**R**|

**S**) shows a much larger bias, which is of the same order of magnitude as that of

*H*(

**R**|

**S**), but typically slightly more negative. Using these properties, a so-called shuffled information estimator,

*I*

_{sh}, can be computed as

*I*

_{sh}(

**S**;

**R**) =

*I*(

**S**;

**R**) since

*H*

_{sh}(

**R**|

**S**) =

*H*

_{ind}(

**R**|

**S**). For small numbers of trials, the biases of

*H*

_{sh}(

**R**|

**S**) and

*H*(

**R**|

**S**) approximately cancel out, leaving the bias of

*I*

_{sh}(

**S**;

**R**) dominated by that of

*H*(

**R**) −

*H*

_{ind}(

**R**|

**S**) which is much smaller than that of the normal information estimate

*I*(

**S**;

**R**). Using this shuffling technique, combined with entropy bias correction methods as described above, can reduce the number of trials needed for a reliable estimate by a factor of four (Montemurro et al., 2007b ; Panzeri et al., 2007 ).

#### James–Stein Shrinkage (“Shrink”) Estimator

*p*of each response

_{r}*r*are determined by

*P*

^{ML}

_{r}is the normal maximum likelihood estimate from frequency counts and

*t*is the shrinkage target. The maximum entropy uniform distribution is suggested as a convenient target in Hausser and Strimmer (2008 ). The shrinkage intensity λ is then given by the following

_{r}#### Comparative performance of different estimators

*I*(

**S**;

**R**) with respect to the plug-in estimator, and the NSB correction is especially effective. For the James–Stein shrinkage estimator, a uniform target distribution was used, and this may account for the relatively poor performance of that method outside of the asymptotic regime. Figure 1 B shows that the bias-corrected estimation of information is much improved by using

*I*

_{sh}(

**S**;

**R**) rather than

*I*(

**S**;

**R**). The use of

*I*

_{sh}(

**S**;

**R**) makes the residual errors in the estimation of information much smaller and almost independent from the bias correction method used. Taking into account both bias correction performance and computation time, for this simulated system the best method to use is the shuffled information estimator combined with the Panzeri–Treves analytical correction. Using this, an accurate estimate of the information is possible even when the number of samples per stimulus is $\frac{{R}}{{4}}$ where

*R*is the dimension of the response space.

**Figure 1. Comparison of the performance of different bias correction methods.**The methods were applied to spike trains of eight simulated somatosensory cortical neurons (see text). The information estimates

*I*(

**S**;

**R**) and

*I*

_{sh}(

**S**;

**R**) are plotted as a function of the available number of trials per stimulus.

**(A)**Mean ± SD/2 (over 50 simulations) of

*I*(

**S**;

**R**).

**(B)**Mean ± SD/2 (over 50 simulations) of

*I*

_{sh}(

**S**;

**R**). This calculation is very similar to that in Panzeri et al. (2007 , Figure 3 ), which also used realistic simulations of cortical spike trains (the only difference was that for this figure, the simulated population did not contain any correlations). This figure was produced using the Python library for bias corrections described in Section “A Python Library for Information Theoretic Estimates”, and the code to produce it is available at http://code.google.com/p/pyentropy/ .

### A Python Library for Information Theoretic Estimates

^{ 3 }. There are similar packages available in other languages, such as the R entropy library

^{ 4 }and the MATLAB Spike Train Analysis Toolbox

^{ 5 }, but the authors are not aware of any similar Python package.

^{Xn}possible values in the X space and Ym

^{Yn}in the Y space for each trial. X and Y are provided as integer arrays with values in [0, Xm − 1] and [0, Ym − 1] respectively with Xn, Yn rows representing the constituent variables and a column for each trial. It is important the columns match, that is the value of X in a given column corresponds to the same trial as the value of Y in the same column, but there are no further requirements on the format of the input. SortedDiscreteSystem requires the input trials to be grouped in values of the variable Y. This allows much more efficient sampling of the required probability distributions, since the trials for a given Y value can be easily isolated without having to search through the whole data set. This requires the space Y to be a single finite alphabet variable, so it should be decimalised beforehand if necessary. The class is initialised as s = SortedDiscreteSystem(X, X_dims, Ym, Ny) where X, X_dims are as above and Ym is the number of possible values for the single variable Y space. Ny is an array containing the number of trials available for each Y value. For example, Ny[] is the number of trials available with Y = , and the corresponding X values are found at X[ : Ny[]]. Both of these classes inherit from a base class BaseSystem which contains the common entropy and information calculations, reducing code duplication and increasing maintainability.

**S**, while X would be the response space

**R**. Since the stimuli are usually controlled by the experimenter, the results are often available already sorted by stimulus, allowing use of the more efficient SortedDiscreteSystem class. Mutual information is symmetric,

*I*(

**X**;

**Y**) =

*I*(

**Y**;

**X**), so in fact the stimulus and response spaces can be provided in any order, but due to the way the conditional probabilities are sampled it is strongly suggested that the smaller of the two spaces be provided as the Y parameter.

**R**and Y to the stimulus space

**S**, denote respectively

*H*(

**R**),

*H*(

**S**),

*H*(

**R**|

**S**), ${\sum}}_{{i}{=}{1}}^{{R}{n}}{H}{(}{\mathbf{R}}{i}{)$,

*H*

_{ind}(

**R**),

*H*

_{ind}(

**R**|

**S**),

*H*

_{sh}(

**R**|

**S**) and χ(

**R**). χ(

**R**) is a quantity needed for the information breakdown of (Pola et al., 2003 ) and is reported in Eq. 25 therein. This function will first decimalise the X and Y spaces, if required (if

*n*> 1) which involves converting the length-

*n*base-

*m*words representing the values for each space to a single decimal integer value in [0,

*m*− 1]. The probabilities required for the requested output entropies are then computed using the sampling method specified. “naive” represents the standard histogram bin counting method which is usually used. The add-constant estimator (Schürmann and Grassberger, 1996 ) is implemented through the “beta:x”method. The β parameter is provided after the colon in the option, so “beta:.1” would use the add-constant estimator with β = 0.01. The “shrink” option selects the James–Stein shrinkage estimator (Hausser and Strimmer, 2008 ). All the entropy estimates are currently implemented in pure Python, except for the NSB estimator. This is implemented using existing publicly available optimised codes

^{n}^{ 6 }. We have not yet implemented a direct link to the NSB codes, but instead write the data for analysis to a file, for processing by the standalone external program before reading back results from a file. Python’s heritage as a scripting language makes this process of reading and writing formatted files and programmatically calling an external program from the code very easy. The functions s.I() and s.Ish() can be used to obtain the mutual information estimate and shuffled mutual information estimate respectively, provided the required entropies have been computed. Similarly s.pola_decomp() will return the computed values for the decomposition of the mutual information presented in Pola et al. (2003) , again provided the required entropies were computed.

*docstrings*, which are embedded in the source and accessible through the interactive interpreter. Having the code documented in this way makes it easier for others to understand and contribute to.

*view*is created, but points to the same original data. In contrast, in MATLAB, taking such a slice always results in the extracted row being copied in memory to a new array object. As discussed, the object-oriented nature of Python allows code reuse through inheritance. To give an example of the performance of the Pyentropy library, for the preparation of the data for the Plugin, PT and QE methods in Figure 1 , the time taken using the Pyentropy library on a 2.4GHz Core 2 Duo laptop was 439 s. This includes data simulation for 50 trials at each sample size. The same task, using similar MATLAB code on an equivalent laptop was 987 s. There is also work in progress to extend the Pyentropy code with a more direct calculation of the core estimates in Cython. Cython is a language for writing C extensions to Python, and it shares a very similar syntax. This provides an easy way to quickly develop fast C modules to speed up the execution of Python code.

### Correlations and Maximum Entropy Models

*maximum entropy*(Montemurro et al., 2007b ; Schneidman et al., 2003 ; Tang et al., 2008 ; Victor, 2006 ), as follows.

*K*variables only, or whether and to what degree higher order interactions are present and important. The maximum entropy technique compares the measured response probability to one that takes into account all the observed interactions of up to

*K*elements but does not impose any additional structure on the data. Measuring all interactions of up to

*K*variables means measuring all the marginal response probabilities involving up to

*K*variables. Therefore any probability matching the observed interactions of up to

*K*elements must obey (apart from the usual non negativity and normalization constraints) the following linear constraints. Here we consider a response vector

**r**= {

*r*

_{1},…,

*r*

_{L}} of dimension

*L*, with each variable

*r*taking values from a finite alphabet

_{i}**A**containing

*m*elements.

*P*(

_{K}**r**) enforcing equality of the marginal values of a given order to those of the true distribution

*P*(

**r**). These marginals are denoted by η with subscript indices representing the variables involved in the marginal and superscript indices the corresponding values. The

*a*

^{th}order constraint applies for all unique combinations of

*a*variables, and every permutation of possible values that those variables can take. Thus the

*a*

^{th}line above represents ${{m}}^{{a}}{\left(}\begin{array}{c}{L}\\ {a}\end{array}{\right)}$ constraints, the product of permutations of

*a*values with choices of

*a*variables.

*P*(

_{K}**r**) with maximum entropy among those satisfying the above constraints is the one that does not impose the presence of any additional higher order correlations or interactions between the variables. To choose a distribution with lower entropy would correspond to the assumption of some additional structure that we do not know; to choose one with a higher entropy would necessarily violate the constraints that we wish to enforce.

*i*

_{1},…,

*i*label the subsets of a variables among the total

_{a}*L*considered. The set of indices

*r*

_{i1},...,

*r*

_{ia}labels a specific set of values of these variables. The first term in the sum is a finite alphabet Kronecker delta function which takes the value 1 when the variables of the argument specified by the subscript indices take the values specified by the superscript indices, and 0 otherwise. As with the marginal constraints, the second sum for each order is over all unique combinations of

*a*variables and all permutations of

*a*values that those variables can take; there are ${{m}}^{{a}}{\left(}\begin{array}{c}{L}\\ {a}\end{array}{\right)}$ summands, and the same number of distinct θ coefficients of that order.

*P*(

_{K}**r**; θ) compatible with all the known interactions up to

*K*-th order, we need to find the θ coefficients with up to

*K*indices to construct the solution above. These can be determined from the knowledge of the experimental η marginal probabilities of up to K elements through a set of algebraic equations, as detailed in the following section.

*P*(

**r**). However, the same procedure could be in principle applied to

*P*(

**r**|

*s*).

### An Algorithm for Finite-Alphabet Maximum Entropy Solutions

*coordinate systems*. The most obvious way of characterising a discrete probability distribution is by specifying the full list of probabilities for each element of the space. For example, if we have a finite alphabet response vector

**r**= {

*r*

_{1},…,

*r*} as above, then there are

_{L}*m*possible values for

^{L}**r**and so the probability distribution

*P*(

**r**) can be characterised by

*m*− 1 probability values, since one degree of freedom is removed by the normalisation constraint. These are called the

^{L}*p*-coordinates. An alternative way of uniquely determining a probability distribution is by listing the marginal probability values. As mentioned in the previous section, there are ${{m}}^{{k}}{\left(}\begin{array}{c}{L}\\ {k}\end{array}{\right)}$ marginals containing of order

*k*, so the collection of all marginals has ${{\sum}}_{{k}{=}{1}}^{{L}}{{m}}^{{k}}{\left(}\begin{array}{c}{L}\\ {k}\end{array}{\right)}}{=}{{m}}^{{L}}{-}{1$ elements. This way of describing the probability is called the η-coordinates. For the final characterisation of a probability distribution, we consider the form suggested by Eq. 10. Taking

*K*=

*L*,

*P*(

_{K}**r**) =

*P*(

**r**) and Eq. 10 shows that any probability can be computed from the set of coefficients, θ. Again there are ${{m}}^{{k}}{\left(}\begin{array}{c}{L}\\ {k}\end{array}{\right)}$ coefficients of each order

*k*. θ

_{0}is fixed by the normalisation condition, so again we have

*m*− 1 numbers that uniquely identify the probability distribution. Expressing a probability distribution in this way is also known as the

^{L}*log-linear*form, and the coefficients, θ are called the log-linear

*effects*. Here we refer to them as the θ-coordinates.

**p**denotes a vector describing a probability distribution in the

*p*-coordinates, η denotes a vector of η-coordinate values and θ a vector of θ-coordinates. The

**p**vector is ordered so that the value of the vector at a given index represents the probability of the underlying state which, when interpreted as a length

*L*base

*m*word, has the decimal value of the index. This ordering was chosen since it is easy to convert between state values and vector indices using existing change of basis functions. The vector η = (η

_{1}, η

_{2},…,η

**) where η**

_{L}**is the set of all marginals of order**

_{i}*i*and similarly θ = (θ

_{1}, θ

_{2},…,θ

**). The ordering of the vector within the subsets of different orders is arbitrary, however it is important that the subsets θ**

_{L}_{i}and η

_{i}share the same ordering for each

*i*.

#### Coordinate Transformations

*The key transformation is that from*

**p transforms.***p*-coordinates to η-coordinates. This is a linear transformation which performs the summation of relevant probabilities for calculating the marginal. With the coordinates arranged in vectors, as described above, it can be expressed as

*A*is a square matrix containing binary values. Each row of

*A*contains a 1 in the column for each

*p*coordinate that contributes to that marginal. The inverse transformation,

*p*coordinates from η coordinates is simply

*A*is invertible since it is square and all its constituent rows are linearly independent.

*For the θ–*

**p transforms.***p*transformations, first notice from Eq. 10 that in vector form

**p**=

*e*

^{θ0 + AT0}. This is because, for a given probability, the θ terms required are those corresponding to the non-zero elements of that specific state vector. Similarly, for a given probability, that probability will appear in the sum for the marginals corresponding to the same non-zero elements of the state vector. The marginals that a given probability appears in are given by the columns of the matrix

*A*, so provided the θ vector is ordered in the same way as the η vector, the sum of θ terms required in the exponential of Eq. 10 for each probability is given by

*A*θ. By evaluating Eq. 10 for the zero state vector ${{p}}_{{0}}{=}{P}{\left(}{{\left\{}{{r}}_{{i}}{=}{0}{\right\}}}_{{i}{=}{1}}^{{L}}{\right)}$ we see that the constant factor in the log-linear model,

^{T}*e*

^{θ0}, is in fact

*p*

_{0}. From

**P**=

*p*

_{0}

*e*

^{ATθ}, it is trivial to obtain the following transformation from

*p*coordinates to θ coordinates.

**p**we must compute

*p*

_{0}from the theta vector. The normalisation condition requires that ∑

_{p + p0}= 1, since the vector

**p**does not include the

*p*

_{0}value. Substituting the expression above gives ${{p}}_{{0}}{\displaystyle {\sum}{{e}}^{{{A}}^{{T}{\theta}}}}{+}{{p}}_{{0}}{=}{1}{\Rightarrow}{{p}}_{{0}}{=}{{\left(}{1}{+}{\displaystyle {\sum}{{e}}^{{{A}}^{{T}{\theta}}}}{\right)}}^{{-}{1}}$ yielding

#### Numerical Optimisation

*K*to those of the measured distribution corresponds to setting the low order η-coordinates of the maximum entropy solution equal to those of the measured distribution. From Eq. 10 the maximum entropy constraint is enforced by setting the high order components of the θ-coordinates to zero. By enforcing these constraints simultaneously, we obtain a set of

*N*simultaneous equations in

*N*unknowns, where ${N}{=}{\displaystyle {{\sum}}_{{j}{=}{1}}^{{k}}{{m}}^{{j}}}{\left(}\begin{array}{c}{L}\\ {j}\end{array}{\right)}$ is the number of coordinates up to order

*k*. Again

*m*is the size of the finite alphabet.

*represents the*

_{k}*N*low order (up to order

*k*) marginals of the sampled distribution. θ

_{k}, θ

_{k}, represent the low and high order theta coordinates of the maximum entropy distribution. $\stackrel{{\u02c7}}{{p}}{(}{\xb7}{)}$ denotes the coordinate transformation from θ to

*p*coordinates from Eq. 14 and ${\stackrel{{\u02c7}}{{\eta}}}_{{k}}{(}{\xb7}{)}$ denotes the coordinate transformation in Eq. 11 but with only the low order marginals returned. Setting the high order theta’s, θ

_{k+}, to zero ensures that there are no higher order interactions. It is then possible to find the low order theta’s that produce the same low order marginals as the sampled distribution, η

*. These low order theta’s, θ*

_{k}_{k}, completely characterise the maximum entropy distribution. In vector form the equations are:

_{k}are determined by numerically solving the equation above, one can convert back to

*p*-coordinates to obtain the corresponding maximum entropy distribution and calculate its entropy.

### Python Implementation

*workspace*was exploited. Here a setup script is used to define the coordinate transformation functions in the global workspace, from where they can be easily called by other scripts or used to interactively investigate data. In Python, an object-oriented approach was taken featuring two main classes. The first of these, AmariSolve, contains the parameters related to the underlying probability distribution, the required coordinate transformations and the code for performing the numerical solution. This is initialised with two parameters, the number of variables and the finite alphabet of each variable, since this is the only information required to implement the solution. The second class, AmariSystem, contains the data related to a specific system being studied, and contains the sampled probability distributions, calculated maximum entropy distributions and associated entropies. In this way the data independent analysis code is separated from the system specific code and data – the idea being that a single AmariSolve instance can be used on different data sets, providing the dimensions of the probability space are the same. It was found this approach gave much more flexibility than the global workspace, which could be confusing to manage during development, for example by requiring a full copy of the setup script to be maintained for every change to the algorithm investigated.

*A*which provides the transformation between probabilities (

*p*-coordinates) and marginals (η-coordinates). A recursive function is used in a loop over each order, to compute the elements of

*A*row by row. The code implements the longhand approach used for manual calculation of smaller matrices. The idea is that each marginal is the sum over all variables not fixed by the specification of the marginal. For each order a vector called terms is created which contains all base m words of length

*L*−

*o*, where

*o*is the order being considered. Then for each marginal, if columns of the appropriate value are inserted into the appropriate position in the terms array, the result contains a row for each probability state included in that marginal. These are converted to decimal, which directly gives the index in the probability vector, and the corresponding columns in

*A*are set to 1. To cover the different marginals, first the alphabet value and then the position is looped over. For orders higher than one, this process is recursive, so the first alphabet value is looped over, then within that the first position, then within that the second alphabet, then the second position and so on. This transformation matrix can be very large since its dimensions are the dimensions of the full probability space. However, it is highly sparse in structure, so in both implementations the provided sparse array construct was used to reduce the amount of memory required. In SciPy, the sparse array module is very flexible, providing a number of formats and datatypes. The advantage of this was that the binary matrix

*A*could be stored as a sparse array of 8-bit integers in SciPy, which provided a factor of eight memory saving over the 64-bit double which is the only type the MATLAB sparse matrix supports. Equations 12 and 13 show that some coordinate transformations require inversion of the matrix

*A*. Although this is not required directly for the computation of the maximum entropies, it was frequently useful while investigating properties of the system and of the different maximum entropy solutions. SciPy offers a very flexible direct interface to the UMFPACK

^{ 7 }library of sparse solvers (Davis, 2004 ), that allowed us to easily pre-factor the matrix and store the results allowing rapid calculation of the coordinate transforms when needed.

*A*containing only the rows required and Bsmall is the transpose of this. Asmall is extracted from

*A*using the

*slice*operator, for example in Python, Asmall = A[:l, :]. Python again provides a significant advantage here in terms of memory used. In MATLAB, any such slice results in a

*copy*of the data. However, with NumPy, the slice results in a

*view*of the original data. Similarly, in NumPy the transpose is also a view, with a different starting point and striding, but the same data buffer as the original array. In MATLAB the transpose operation also produces a copy.

**def** solvefunc(self, theta_un, Asmall, Bsmall, eta_sampled):

b = np.exp(Bsmall.matvec(theta_un))

y = eta_sampled(Asmall.matvec(b)/(b.sum() + 1) )

**return** y

*A*in smaller blocks, writing the rows and columns of the non-zero elements directly to files on disk to reduce memory overhead. Once this procedure was completed a sparse matrix in coordinate (COO) format could be generated directly from these files, and then converted to compressed sparse column (CSC) format for efficient matrix-vector multiplication. This is another example of where good results were obtained by using low level features that would not have been available in MATLAB.

*n*= 4,

*m*= 9 (four variables each taking 1 of 9 values). The MATLAB code took 17 s with a peak resident memory usage of 340 MB and the Python code took 12 s with a peak resident memory usage of 110 MB. These results are typical of our experience across a range of parameter values. The numerical optimisation routine took almost exactly the same time in both systems, with the difference being due to the improved performance of the sampling of the probability distributions in Python. This is likely to be due to the reduced amount of data copying needed with NumPy when using slicing and other array operations.

*A*must be calculated and held in memory. The ability to use an 8-bit integer for this binary matrix with Python provided a factor of 8 memory saving over the MATLAB equivalent. More significantly, the algorithm requires extraction of the submatrix of up to the relevant order, and the transpose of that, which in MATLAB consists of copies (meaning for each order the data is copied in memory three times, once for the full matrix

*A*, once for the extracted Asmall for the given order, and once for the transpose thereof, Bsmall). As an example, this meant that on a workstation with 2 GB of RAM the largest binary probability space that could be analysed up to order 3 was 12 variables for the MATLAB implementation, but 18 variables for the Python version. It is also worth noting that, while being similar to MATLAB, the Python language is a great pleasure to work with.

#### Example of application to thalamic neural recordings

**Figure 2. Responses of a VPm neuron to white noise vibrissa stimulation.**

**(A)**Vibrissa position as a function of time in units of stimulus SD (1 SD = 70 µm).

**(B)**Spikes fired by the neuron in response to 70 repetitions of the stimulus shown in

**(A)**.

*t*= 4 ms and quantified the response of the considered VPm neuron as a binary sequence of 1’s and 0’s (spikes or silence in that bin respectively), characterising the neural response

**r**as non-overlapping binary words of length

*L*extracted from this signal. We then considered the probability of response

*P*(

**r**) in response to all patterns of whisker stimulation obtained from the non-repeated white noise sequences, and we compared its entropy to that of the maximum entropy probability

*P*(

_{K}**r**) at level

*K*(

*K*= 1,…,3) and to the entropy of the true distribution. Results are reported in Figure 3 . We found that the lowest order model (

*K*= 1, which considers spikes in each bin as independent from each other) provides an entropy very close to that carried by higher order probability models. The difference between lower and higher order entropies becomes proportionally larger as the length

*L*of the binary word increases. However, differences remain small: for

*L*= 14, the difference between the independent-model,

*K*= 1 entropy and the true one remain within 3%. This suggests that the spike train could be quantitatively well described even by a simple model that ignores correlations between spikes at different time bins. It should be noted that in the Python implementation of this calculation, the limit on the maximum number of time bins

*L*and the order

*K*that could be analysed was set by the number of trials available and the effectiveness of the sampling bias corrections implemented, whereas in the corresponding MATLAB implementation the limit was reached when the available memory was consumed. For a binary system as described here that limit was

*L*= 12,

*K*= 2 on our workstation. This highlights the advantages of Python for these implementations.

**Figure 3. Response entropy of a VPm neuron to white noise vibrassa stimulation.**The full response entropy [

*H*(

**R**) denoted

*H*in the figure] is shown together with that of maximum entropy models preserving first [

*H*

^{(1)}], first and second [

*H*

^{(2)}] and up to third order [

*H*

^{(3)}] marginal densities. The response is treated as non-overlapping words of length 6 (panel

**A**), 10 (panel

**B**) and 14 (panel

**C**) bins, with each bin of 4 ms duration.

^{14}which is computationally equivalent to the case of the binary response of 14 simultaneously recorded neurons.

^{ 8 }project is a consortium effort to create a virtual laboratory for neurophysiology (Gibson et al., 2008 ), and is one example of project attempting to provide a centralised organisational structure for collaborative computing in neuroscience, as discussed above. CARMEN is an e-Science Pilot Project funded by the Engineering and Physical Sciences Research Council (UK) and involves investigators from 11 UK universities.

### Python Web Services

^{ 9 }. Web services are well suited to collaborative computing services, and they have been proven as a successful model for e-Science through their use in the bioinformatics community. They are also used as the foundation of the analysis code in the CARMEN project described above. Web services are operating system, location and language neutral. This is exploited in CARMEN to allow dynamic deployment of services to different computational nodes, and also simplifies the use and integration of analysis code written in a range of languages.

^{ 10 }is a standard XML based messaging format used to pass data and parameters to an analysis service, and then receive the results back. All clients and web services are capable of passing and decoding SOAP messages. The other pivotal standard is that of the Web Services Description Language (WSDL)

^{ 11 }, an XML document for the description of a web service; that is the method calls it provides, the arguments they require and the results they return. The WSDL that represents a web service is sufficiently informative to allow automatic generation of clients capable of binding to the service.

^{ 12 }a generic toolkit capable of exposing legacy applications as web services. Initially, we have created Python scripts that run as command line applications. This is straightforward since Python includes an excellent tool for easily parsing command line options. InstantSOAP provides a native command line processor to wrap any command line application into a web service through the creation of a single XML file. Work is currently in progress to extend InstantSOAP to natively support Python services, allowing direct deployment of a Python function as a web service, without requiring the developer to understand the web services stack, a significant barrier to entry in developing web services in any language. Python’s licensing model is also important in the deployment of distributed services; MATLAB suffers from licensing restrictions for collaborative deployment. This makes it harder both to provide open services to a large number of users and to employ the dynamic deployment architecture through which code may run on a number of computational nodes. For example, whilst CARMEN is capable of providing MATLAB web services, it is through compiled MATLAB scripts, supported by the MATLAB runtime environment, and has no native interface to MATLAB

*per se*, adding additional complexity to the procedure of creating, deploying and managing web services. There are also a number of ongoing technical challenges related to running the compiled MATLAB binaries within the web service environment.

^{ 13 }, it provides much of the power available from low level C coding with a numerical library, but with greatly reduced complexity and development time. For example, a major advantage for our maximum entropy application was the way we were able to fine tune the use of the sparse matrix structures. The interactive nature, familiar to users of MATLAB, is crucial to aid research, both in terms of investigation of data as well as development of algorithms. Compared to MATLAB, we have seen performance increases in moving our code to Python, particularly related to memory management in the case of our more demanding algorithms. In addition, increased productivity and code manageability, for example from the ability to use object-oriented programming techniques, speed development and ease collaboration with other researchers.

^{ 14 }, which we have used to run existing MATLAB code provided by colleagues for preprocessing data. Initially the required packages were difficult to install, requiring compilation from source of a range of packages with complicated dependencies. Actually getting the software installed was therefore the greatest challenge when we began using Python. However, since then, the community has done a lot of work in improving this process, and there are now regular binary releases of all the important components, as well as a number of projects that distribute a complete scientific tool chain with all required components through a common installer

^{ 15 }. Another challenge was adapting to the pass by reference semantics of Python rather than the pass by value style of MATLAB, as well as adapting to 0 based indexing. However, once these mental adjustments had been made we found ourselves more productive with Python than we were with MATLAB. Other disadvantages of Python are that the documentation of the included functions, while still available interactively, is not as comprehensive as that provided with MATLAB and the plotting functionality provided by matplotlib, is not quite as easy to use or well developed as the MATLAB version, especially with regard to 3D plotting.

^{ 16 }, which we are finding significantly easier to use and less error prone than the MATLAB equivalent (the MEX interface). Another area we are actively investigating in the use of parallelism. In many cases our problems are

*embarrassingly parallel*, for example calculating information theoretic bias-corrected quantities over a number of data sets or computing maximum entropy solutions of different orders and conditional distributions. A number of open source solutions exist for parallel computing with Python, and we are investigating using these features of IPython to easily distribute these types of jobs to available machines.

**^**The Mathworks, Inc, Natick, MA. http://www.mathworks.com/**^**“Open source is a development method for software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost, and an end to predatory vendor lock-in.” http://www.opensource.org/**^**See http://code.google.com/p/pyentropy/**^**See http://www.strimmerlab.org/software/entropy/index.html**^**See http://neuroanalysis.org/toolkit/**^**From http://nsb-entropy.sourceforge.net/**^**http://www.cise.ufl.edu/research/sparse/umfpack/**^**http://www.carmen.org.uk/**^**http://www.w3.org/TR/ws-gloss/**^**http://www.w3.org/TR/soap/**^**http://www.w3.org/TR/wsdl**^**http://instantsoap.sourceforge.net/**^**“An enhanced interactive Python shell and architecture for interactive parallel computing”, http://ipython.scipy.org/ (Perez and Granger, 2007 )**^**“A high-level Python to MATLAB bridge”, http://mlabwrap.sourceforge.net/**^**See for example http://www.pythonxy.com/ and http://www.enthought.com/products/epd.php**^**The Cython language, “C extensions for Python”, http://cython.org/

*in vitro*.

*J. Neurosci.*28, 505–518.