A Statistical Description of Neural Ensemble Dynamics

The growing use of multi-channel neural recording techniques in behaving animals has produced rich datasets that hold immense potential for advancing our understanding of how the brain mediates behavior. One limitation of these techniques is they do not provide important information about the underlying anatomical connections among the recorded neurons within an ensemble. Inferring these connections is often intractable because the set of possible interactions grows exponentially with ensemble size. This is a fundamental challenge one confronts when interpreting these data. Unfortunately, the combination of expert knowledge and ensemble data is often insufficient for selecting a unique model of these interactions. Our approach shifts away from modeling the network diagram of the ensemble toward analyzing changes in the dynamics of the ensemble as they relate to behavior. Our contribution consists of adapting techniques from signal processing and Bayesian statistics to track the dynamics of ensemble data on time-scales comparable with behavior. We employ a Bayesian estimator to weigh prior information against the available ensemble data, and use an adaptive quantization technique to aggregate poorly estimated regions of the ensemble data space. Importantly, our method is capable of detecting changes in both the magnitude and structure of correlations among neurons missed by firing rate metrics. We show that this method is scalable across a wide range of time-scales and ensemble sizes. Lastly, the performance of this method on both simulated and real ensemble data is used to demonstrate its utility.


INTRODUCTION
The growing use of multi-channel neural recording techniques in behaving animals has opened up new possibilities for understanding how the brain mediates behavior. These techniques are generating datasets that record the output of ever increasing numbers of neurons Suner et al., 2005;Nicolelis, 2008). Unfortunately, current recording techniques do not provide information about the underlying anatomical connections between the recorded neurons. Without these anatomical constraints, the set of possible network diagrams grows exponentially with increasing ensemble size. This limitation poses formidable challenges to the analysis and interpretation of these datasets (Palm et al., 1988;Averbeck et al., 2006). Several groups have focused upon the challenging inverse problem of inferring the connectivity among the individual neurons from ensemble data (Brown et al., 2001;Truccolo et al., 2005;Eldawlatly et al., 2009). However, the combination of expert knowledge and ensemble data is generally insufficient to select a unique model from the vast space of possible network diagrams. Thus, analysts are often faced with an intractable model selection problem (Sivia, 1996;Ghahramani, 1998).
A promising alternative comes from the field of statistical mechanics (Jaynes, 1957). This approach endeavors to estimate macro properties of a system, for example its entropy, from observations of its constituent parts. Recent applications of this approach in neuroscience have modeled the neural ensemble by maximizing its entropy subject to constraints imposed by estimated features of the neural ensemble. Importantly, these maximum entropy models do not require a priori specification of hidden variables or interactions between variables, unlike parametric methods such as Hidden Markov (Abeles et al., 1995;Jones et al., 2007) or point process models (Brown et al., 2001;Truccolo et al., 2005;Eldawlatly et al., 2009). These maximum entropy models have shown that by including only estimates of neural firing rates and pairwise interactions as parameters, one may generate a surprisingly good estimate of the frequency of observing any ensemble pattern (Schneidman et al., 2006;Tang et al., 2008). This suggests that tracking changes in these features over time provides a good approximation of the dynamics of the neural ensemble.
Our goal is to leverage these insights to provide experimentalists with a set of tools to aid in the exploratory data analysis (Tukey, 1962;Mallows, 2006) of neural ensemble datasets. Currently, investigators utilizing the techniques of statistical mechanics in neuroscience have used long segments of continuous ensemble data to describe the state of the brain in equilibrium (Schneidman et al., 2006;Shlens et al., 2009). To describe the dynamics of neural ensembles we must estimate changes in neural firing rates and pairwise interactions on shorter time-scales, which is not as established as analyzing time-varying changes in neural firing rates (Abeles, 1982b). We address this issue by using a Bayesian approach to estimating ensemble correlations on time-scales comparable with behavior. This estimator offers an answer to the question: how does one keep the use of prior information to a minimum while still producing sound estimates?
An additional motivation for our work comes from the realization that the large datasets generated by multi-channel ensemble recordings often go underutilized because they are cumbersome to load and tedious to scan. Here we leverage unsupervised learning algorithms, which work well at detecting well defined features and are tireless, to aid experimentalists, who are excellent at a wide range of pattern recognition tasks but fatigue easily. The method described below may be thought of as providing an answer to the question, "When are the dynamics of my neural ensemble data changing?" Our contribution consists of adapting techniques from signal processing (Dasu et al., 2006(Dasu et al., , 2009 and Bayesian statistics (Wolpert and Wolf, 1995) to track changes in the dynamics of neural ensemble data on time-scales comparable with behavior. This is achieved by expanding the statistical description of ensembles provided by Schneidman et al. (2006) into a framework allowing for the use of smaller sample sizes, thereby providing the temporal resolution required for comparison with behavior. Of course decreasing the number of samples may increase both the bias and variance of any estimate from the data. Moreover, many of the possible ensemble patterns may not be represented in the dataset. This results in the need to address the influence of low sample density upon our estimates. The method detailed below addresses this issue, and is scalable across a wide range of ensemble sizes. Importantly, it is capable of detecting changes in the correlation structure of ensemble data missed by firing rate metrics, allowing one to disassociate changes in ensemble correlations from changes in neural firing rates.

METHOD OVERVIEW
The proposed method combines a spatial data-clustering technique (Dasu et al., 2006(Dasu et al., , 2009 with a Bayesian estimator of the KL-divergence (Kullback, 1959) between discrete distributions over neural ensemble patterns (Wolpert and Wolf, 1995). The KLdivergence is calculated between pairs of probability distributions which share the same domain. Its output is a positive quantity with values greater than zero indicating the distributions compared are not the same. The efficiency of the KL-divergence for hypothesis testing and classification has been studied extensively (Kullback, 1959;Cover and Thomas, 1991). Here the KL-divergence is used to generate a one-dimensional time-series for tracking changes in the dynamics of the ensemble, relative to statistical null hypotheses about the underlying neural firing rates and pairwise interactions.
To calculate the KL-divergence one must provide a method for transforming the neural ensemble data into a probability mass function. Like others (Schneidman et al., 2006;Tang et al., 2008;Marre et al., 2009), we define a discrete joint distribution over possible ensemble patterns. One consequence of this formalism is that unobserved, but possible, ensemble patterns must be assigned a probability through the use of a prior distribution. The choice of a prior distribution will be expounded below. To keep the prior from unduly influencing our estimates, we use an adaptive quantization scheme to compress the complete set of ensemble patterns into a smaller set of multinomial categories. The data structure used to achieve this compression is the kdq-tree. Originally developed by Dasu et al. (2006) for detecting changes in multi-dimensional, streaming telecommunications data, this data structure aggregates regions of low sample density, and may be thought of as a datadriven binning scheme. Importantly, the kdq-tree was originally developed specifically for cases where the distribution generating the dataset is unknown.
First, samples of ensemble data of equal size, representing the null hypothesis and test data, are filtered through the kdq-tree. Next, the multinomial samples output by the kdq-tree are input to a routine that calculates the Bayesian estimator of the KL-divergence to determine whether the test data conform to the null hypothesis. The time-series of KL-divergence values are then examined for significant deviations from the null hypothesis. Since the null hypothesis is defined by features of the neural ensemble estimated within a previous epoch, significant deviations demarcate changes in these features across time, and therefore changes in the dynamics of the neural ensemble. The processing of ensemble data by this method is schematized in Figure 1A, from conversion of the dataset via the kdq-tree through examining the time-series of KL-divergence values to test the null hypothesis.
We next provide a detailed exposition of the method. We demonstrate its application and performance upon simulated ensemble data. Furthermore, it will be shown to be capable of detecting changes in the correlation structure of ensemble data missed by firing rate metrics. We conclude by demonstrating its utility upon neural ensemble data collected from awake behaving rodents. Figure 1 provides an overview of the method and an example of the phenomenon of interest to us. These data were collected from the somatosensory barrel cortex (S1bf) of a behaving rat (unpublished data). Figure 1 shows both local field potential (LFP; Figure 1B) and ensemble data ( Figure 1C) collected from a rat's S1bf. Here we are interested in examining whether the dynamics of the ensemble reflect changes in the LFP. Halfway through the time-series, there is a clear change in the LFP data. This feature is the so-called μrhythm commonly observed in idle rats (Semba et al., 1980;Wiest and Nicolelis, 2003;Tort et al., 2010). Figure 1D demonstrates the ability of our method to detect this change in the dynamics of the ensemble data. Subpanels 1 and 2 zoom in on a few cycles of the μ-rhythm to illustrate that this change in the LFP is correlated with a change in the patterns of activity emitted by the ensemble. In particular, the negativity of the μ-rhythm coincides with an increase in the probability of correlated firing among the neurons. The role played by the LFP might just as easily have been a stimulus presentation, a basic motor response, a response signaling a decision or any other variable of interest.

EXPOSITION TRANSFORMING ENSEMBLE DATA INTO A PROBABILITY MASS FUNCTION
We next illustrate the details of the transformation from ensemble data to probability mass functions. This is done in the context of checking if the results of Schneidman et al. (2006) hold for our ensemble data from S1bf. Here the generalized iterative scaling algorithm (Darroch and Ratcliff, 1972) is used to fit the maximum entropy pairwise model to the transformed ensemble data.

Frontiers in Computational Neuroscience
www.frontiersin.org The KL-divergence output of this method, which uses only the ensemble data as input. The shaded gray area indicates the mean ± SD, respectively, of the KL-divergence under the null hypothesis of independence among the neurons. Subpanel 1: A zoomed in example of the background LFP, ensemble activity, and KL-divergence. Subpanel 2: The onset of the μ-rhythm, ensemble data and KL-divergence at higher resolutions. Note the correlations in the ensemble data during the negative phase of the μ-rhythm. Figure 2A shows a typical example of the weak pairwise crosscorrelations observed between units within S1bf. For this dataset, the cross-correlation peaks were centered at approximately zero lag and were 20-40 ms wide (the inset of Figure 2A shows the details of an example peak). This is an important piece of experimental information used to set a bin width parameter used to simplify the cross-correlation structure (set to 20 ms here). An alternative is to assume a Markov order and fit a Markov model to the sequence of ensemble patterns (see Marre et al., 2009). This was not done here because the Markov order is a free parameter that vastly increases the complexity of the model. We will show that it is possible to detect subtle changes in the correlation structure of ensemble data using this binning scheme. Another feature of this transformation is the designation of any neural activity within a bin as either active (1) or inactive (0), as opposed to integer spike counts. This simplification loses the information individual neurons transmit via brief bursts of activity, but it allows for a complete description of the distribution of ensemble patterns based solely on the number of recorded neurons. Furthermore, we will show that the information conveyed by these bursts is recoverable by tracking the ensemble firing rate in parallel.
This processing of the ensemble data results in a series of row vectors, one for each time bin, ranging over all possible ensemble patterns, from completely inactive (all 0's) to maximally active (all 1's; Figure 2B y i is then: The operator C(y i /e) counts the number of times ensemble pattern y i appears in the sample. Normalizing these counts to one and ranking them in descending MLE probability determines the probability mass function ( Figure 2C).
We now fit maximum entropy models to the MLE probability distribution to test whether these provide a good fit to the ensemble data. First, we fit the independent model, which assumes the probability of observing any ensemble pattern is the product of the probabilities of the individual active and inactive neurons within a pattern. The independent model is calculated as: Here the X j designates one of the M neurons in the ensemble and y ji indicates the jth component of ensemble pattern, y i . We then fit the pairwise model, which also incorporates the simultaneously observed pairwise correlations. Fitting the maximum entropy pairwise model consists of maximizing the Shannon entropy subject to the constraints that it must be consistent with the expected firing rates and pairwise correlations measured from the dataset: The probability the model assigns to each ensemble pattern is simplified here as p i . The first term on the right of the equals sign is the Shannon Entropy. The first constraint, with parameter λ 0 , is a normalization constraint and the subsequent sum of constraints, with parameters λ r , require the model output match the expectations of these measured quantities, f r (p) (Jaynes, 1957). The form of the resulting model is obtained by setting ∂Q/∂p i = 0 and solving for p i : The denominator satisfies the normalization constraint, making this model a probability mass function. Figure 2D shows the improved fit of the pairwise model (P2) compared against the independent model (P1) for this, on average, weakly correlated ensemble. This analysis validates the result of Schneidman et al. (2006): a model incorporating the firing rates and pairwise interactions within a neural ensemble does provide a good fit to these ensemble data. Consequently, tracking changes in these features over time provides a good approximation to the dynamics of the ensemble.
To increase our temporal resolution of the dynamics of the ensemble we must make estimates using fewer samples. This introduces a host of issues that must be taken into account. For instance, to our knowledge all analyses of the joint distribution of ensemble patterns (Schneidman et al., 2006;Tang et al., 2008), including our own, observe that the low firing rates of neurons and weak correlations between them result in ensembles rarely if ever producing many of the possible patterns ( Figure 2C). Low sample density and outright missing data pose a challenge to any statistical analysis of ensemble data at higher temporal resolutions, because a probability mass must be assigned to all y i ∈ Y. Our solution is to first compress these data into multinomial samples via the kdq-tree and then multiple these by a prior distribution when calculating the Bayesian estimate of the KL-Divergence.

COMPRESSION VIA THE kdq-tree
The kdq-tree is used to compress the ensemble data before multiplying it by a prior distribution. Otherwise maximum-a posteriori estimates of the KL-divergence are liable to be more a reflection of the prior distribution than the dataset (Skilling, 1985). Therefore, to aggregate regions of low sample density, and thereby reduce the influence of the prior distribution, we employ the kdq-tree (Dasu et al., 2006).

Frontiers in Computational Neuroscience
www.frontiersin.org The kdq-tree data structure provides a data-driven approach to compressing regions of low sample density that scales linearly in the number of dimensions and data points (Dasu et al., 2006). This makes its use computationally efficient for a wide range of ensemble and sample sizes. The general principle followed by the kdq-tree is to bin finely where data density is high and coarsely where it is low. The parameters of the kdq-tree are an integer that sets the minimum number of data points a bin must contain before it is subdivided, and a real number that sets a constraint upon the width of any bin along any dimension. For ensemble patterns the latter parameter is always set to 1/2 since all dimensions consist of the values 0 and 1. The intuition behind the kdq-tree may be apprehended by representing it as a binary tree data structure (Figures 3A,B).
In Figures 3A,B each fork represents a splitting of a parent node (bin) along a single dimension, and the K leaves of the binary tree are the terminal nodes (the resulting bins). Each node contains the boundaries of a bin and the number of samples contained therein. As one moves down the tree, from top to bottom, the bins get smaller (see Algorithm 1). Figure 3A shows the binary tree representation of a kdq-tree fit to a simulated ensemble of five neurons. In this case, all of the possible 2 5 = 32 ensemble patterns were visited with sufficient frequency to generate the entire tree (Depth = 5, Terminal Nodes = 32). Figure 3B details a similar binary tree structure, but in this case not all of the ensemble patterns contain enough samples to generate the complete tree, resulting in a pruned tree (Depth = 5, Terminal Nodes = 15). In both examples, this binary tree is constructed iteratively by cycling through each of the five variables in the ensemble and at each step determining whether to split the parent node into two children nodes. This process continues until no node meets the criteria for subdivision set by the parameters. This compression into multinomial samples eliminates missing data, with only a slight loss of empirical entropy (Figures 3C,D: data from an ensemble of 10 recorded neurons). This transformation allows us to track differences in samples of ensemble data as differences in the posterior distributions using the Bayesian estimator of the KL-divergence.

TRACKING THE KL-DIVERGENCE
The KL-divergence is defined as: It is a convex function between probability density functions and is bound between 0, indicating no difference between the distributions, and +∞. It is only defined for the case where the domains of p 1 and p 2 are the same. We interpret the KL-divergence as the information for discriminating p 1 from p 2 . In general, the KL-divergence from p 2 to p 1 is not equal to that from p 1 to p 2 . With this in mind, it is important to note that our calculations of the KL-divergence are always relative to a null hypothesis, with the distribution of this null represented by p 2 .
The Bayesian estimator for the KL-divergence is derived according to the Laplace convolution method (Wolpert and Wolf, 1995), which requires the specification of a prior distribution. Based upon Figure 2C, we know that many of our ensemble states are unlikely to be observed. What we want then is a prior distribution that is minimally informative while being responsive to updates from sparse data. To achieve this end, we chose the conjugate prior for multinomial likelihoods functions, the Dirichlet distribution, and set all the parameters of the distribution equal to 0.5 (Krichevsky and Trofimov, 1981). The first moment of Bayesian estimator of the KL-divergence according to the posterior distributions of p 1 and p 2 is calculated, and derived in the Appendix (see also Berkes et al., 2011). Figure 4 details the behavior of the kdq-tree, our choice of prior, and the estimator of the KL-divergence when applied to the output of two simulated 10 unit ensembles. In this case the null hypothesis was that the test samples were all drawn from the same distribution that generated the first 500 samples (note the zero values in Figures 4C,F, which are the samples used for the null hypothesis). The test window of 500 samples was slid one sample at a time, forward in time, over the entire dataset. At each step the test data was processed as described above and the KL-divergence was estimated between the null and test data.
These simulations were designed to undergo either a change in firing rates ( Figure 4A) or pairwise correlations ( Figure 4D) beginning at sample 2001 and persisting through sample 4000. Specifically, in Figure 4B all variables were assigned an initial firing rate and, at sample 2001, each underwent a random change in firing rate (All firing rates were drawn from a distribution derived from unit activity recorded within the S1bf of five behaving rats. All unit waveforms had a signal-to-noise ratio of 4:1 relative to each channel's background voltage fluctuations, unpublished data.). In Figure 4E, initial firing rates were drawn as in Figure 4B, but here at sample 2001 all variables underwent an increase in pairwise correlations [from correlation(x i , x j ) ∼ N (0.005, 0.05) to ∼N (0.8, 0.05)] while firing rates remained the same (the simulation engine of Macke et al., 2009 was used to generate these example data). Figures 4C,F show that the Bayesian estimator of the KL-divergence is able to signal both of these changes.
An examination of Figure 4 illustrates that before applying this method to real data we must derive for the time-series of KL-divergence values a rejection threshold for ruling out the null hypothesis. Specifically, we must provide a means for rejecting the null hypothesis that the KL-divergence matches what one would expect if repeated finite samples were drawn from the null distribution. In Figure 4, this variability is manifest in the behavior of the KL-divergence time-series between samples 1000 and 2000 and again after sample 2501. The magnitude of this variance is a function of the ensemble size, the number of samples within a window, and the null hypothesis considered. We next examine the variance of the null hypothesis en route to defining a rejection threshold for the time-series of the KL-divergence values. We then validate the method's behavior and performance upon simulated ensemble data for a variety of null hypotheses.

CHECKING INTUITIONS AND ESTIMATING A REJECTION THRESHOLD
To evaluate the behavior of our method, we calculated the expected variance of the resulting KL-divergence for a range of ensemble sizes, sample sizes, and neural features. This was achieved by drawing samples from simulated ensembles generated by stationary distributions. Before estimating the SD of the posterior distribution of the KL-divergence, for multinomial likelihood functions, we derived the second moment of this distribution according to the posteriors of p 1 and p 2 (see Appendix for details).
The simulation engine of Macke et al. (2009) was then used to generate ensembles of binary variables with mean firing rates and pairwise correlations matched to those observed experimentally. The simulated ensemble data was then binned at 20 ms and the sample size parameter of the kdq-tree was set to five samples. The details of these simulations for varying ensemble and window sizes are shown in Figure 5. Each data point represents the mean ± SD of the KL-divergence. Several important intuitions are apparent from these simulations. First, the mean and SD of the KL-divergence is inversely proportional to the window size. Second, the mean is directly proportional to the number of variables in the ensemble (Figure 5A). This means that as the ensemble size increases, relative to the sample size, the likelihood of mistaking pairs of samples from a single distribution for samples from different distributions increases. Alternatively, stationary systems appear more variable if brief observations are made instead longer ones. Third, the mean of our estimates of the KL-divergence are inversely proportional to the degree of correlation among the variables (Figure 5B). In the extreme, if the variables within a system are completely correlated, the distribution reduces to a binomial distribution, greatly reducing the expected KL-divergence. These intuitions are important if we are to differentiate interesting features of the dataset from expected fluctuations in its activity.
It is important to emphasis that the Bayesian estimator of the KL-divergence is positively biased, especially for large ensembles. This is in line with observations made by others about the calculation of Shannon Entropy from ensemble data (Paninski, 2003). This bias is less of a concern for us because we are interested in tracking differences in the time-series of the KL-divergence values, not their absolute values. With these observations in hand, we now estimate the rejection threshold relative to the null hypothesis upon the time-series of KL-divergence values.
Our general goal is to detect epochs in which the dynamics of the ensemble move away from the distribution of the null hypothesis. The calculation of the rejection threshold will depend upon the null hypothesis under consideration. For example, for the null hypothesis of homogeneity among adjacent samples, a surrogate dataset is created by time shuffling the ensemble patterns to break up any temporal structure in the sequence of ensemble data. These surrogate data are then processed according to the method and the mean and SD of the resulting time-series of KL-divergence values are used to set the rejection threshold (see Algorithm 2 for a test of homogeneity among adjacent samples).
When considering the null hypothesis of independence among the neurons within an ensemble we generate pairs of surrogate independent samples by shuffling the time indices of each neuron within each window. This preserves the firing rates of all the neurons within the window while disrupting any correlations among them. These data are then processed and the rejection threshold is calculated as above. Figure 6 details our method's ability to detect changes in the dynamics of the ensemble data missed by the commonly used population, or ensemble, firing rate metric (Laubach et al., 2000;Friedrich and Laurent, 2001;Dorris and Glimcher, 2004) as well as changes in the correlation structure of an ensemble.
In Figure 6A, a simulated ensemble of 10 variables was generated so that the column sum for each sample was constant (five  This was done to provide an example of a disassociation of a change in ensemble firing rate from a change in ensemble correlations. Our method and the ensemble firing rate were calculated from these simulated data. For both analyses a 1 sample bin and a 100 sample window were used. For the ensemble firing rate this 100 sample window was used to calculate a running average. This matched the time-scales of the analyses. For the kdq-tree, the sample density parameter was set at 5. The null hypothesis tested was homogeneity among samples. The KL-divergence was able to detect this change in the correlation structure (Figure 6B), and by design the ensemble firing rate could not ( Figure 6C). In Figure 6D, a simulated ensemble of 10 variables was generated such that they were independent from samples 1-1000, from samples 1001-2000 variables 1-5 were correlated, from samples 2001-3000 variables 6-10 were correlated, and they were all independent again from samples 3001-4000. Here we used a window size of 500 samples. To detect this change in correlation structure using our method, we evaluated two null hypotheses. The first null hypothesis was that the variables were independent (Figure 6E). The second null hypothesis was that adjacent samples were homogeneous ( Figure 6F). Figure 6E shows that our method was able to reject the null of independence between samples 1000 and 3000. Figure 6F shows a rejection of the null of homogeneity almost immediately as the leading edge enters the first epoch of correlated data. It then decreases as both windows enter this epoch before increasing again and reaching a maximum around sample 2500, as each of the adjacent windows occupies one of the two differently correlated epochs. Taken together, this demonstrates the ability of our method to detect a change in the structure of neural ensemble correlations.
To demonstrate the performance of our method under difficult conditions, among variables. Again, the firing rates of the variables within the simulated ensembles were drawn from an empirical distribution derived from chronic extracellular recordings in five behaving rats (unpublished data). The parameters considered were the number of variables in the ensemble, the window size, the significance level for detection, and the sample density parameter for the kdq-tree. In all cases, 100 epochs of 5500 samples each were generated, and an increase in ensemble correlations from 0 to 0.1 occurred from samples 4501 to 5000. The null hypothesis evaluated was that the samples all came from the same distribution that generated the samples within the first window. The significance level was set relative to the rejection threshold of the null distribution calculated as described above. The confidence interval method of Dasu et al. (2009) was used to mark detections. Detections were marked when the number of significant KL-divergence values within a 100 sample window exceeded the proportion expected by chance. Performance was classified as "Detected,""Late,""False," and "Miss."A change was logged as"Detected"if the time of detection was within two window lengths of the actual change. Otherwise, any detection outside this interval but before another simulated change in these data was marked "Late." If the number of detections was greater than the number of epochs, then this excess was logged as "False," indicating false alarms. If no detection was signaled between two changes, it was logged as a "Miss." The most important result of these simulations is that it is far easier to detect small, widespread changes in the correlations among units within large ensembles than within smaller ensembles. This implies that if small, widespread changes in ensemble correlations are coincident with behavior, then increasing the number of recorded units in combination with this method should increase the ability of experimenters to detect this feature. This point will be considered further in the discussion. From the simulations, it was also clear that matching the window size to the duration of the change increased the number of detections, which recommends considering multiple time-scales when investigating ensemble correlations. As might be expected, increasing the significance level for detection decreased the number of false alarms but increased the number of false negatives. The threshold upon sample density was found to have only a small effect upon detection performance for the range of values considered.

APPLICATION TO REAL DATA
Having explicated our method, we now demonstrate its application to real neural ensemble data (Figure 7). Our goal was to detect the occurrence of the μ-rhythm apparent in the LFP using only the ensemble data (Figures 7A,B). To begin, the bin width parameter was set to 20 ms and a window of 200 binned samples was used. For comparison against a comparable estimate of ensemble firing rate, the ensemble data was binned as both binary activations and spike counts. The KL-divergence was set to evaluate the null hypothesis that the neurons within the ensemble fire independently of one another. The kdq-tree was constructed using the complete data sorted in descending order by firing rate (Figure 7B). For the kdqtree, the integer threshold upon data density was set at five samples. After compression, the resulting multinomial samples were then used to estimate the KL-divergence. The mode ± the SD of the time-series of surrogate KL-divergence values representing the null hypothesis is plotted to mark when the null is rejected (Figure 7D). The mode ± the SD of the ensemble firing rate time-series set the rejection threshold for this estimator (Figure 7C).
The most apparent difference between these analyses is that the KL-divergence was capable of signaling the sustained epoch of increased ensemble correlations (Figure 7D), whereas the ensemble firing rate only clearly signals the tail end of the second bout of the μ-rhythm (Figure 7C). These epochs are of particular interest because they signal a disassociation of changes in pairwise Frontiers in Computational Neuroscience www.frontiersin.org interactions from changes in neural firing rates. Both features detect a decrease in ensemble activity prior to the initiation of the μ-rhythm, but then the KL-divergence signals an increase in correlated activity missed by the ensemble firing (Figure 7, subpanel 1). Interestingly, the cadence of the μ-rhythms coincides with a plateau and subsequent decrease in the KL-divergence, while the ensemble firing rate signals a burst of activity (Figure 7, subpanel 2). An examination of the ensemble rasters validates the description of the dataset provided by these features. The initiation of the μ-rhythm begins with a decrease in ensemble activity followed by an increase in ensemble correlations without an appreciable change in unit firing rates. The bouts of μ-rhythms are then terminated by a burst of activity, which is detected by the ensemble firing rate. The ability of the proposed method to detect changes in ensemble correlations, in conjunction with the population firing rate's sensitivity to bursts of activity, paints a rich picture of these data. Moreover, by augmenting this method with the calculation of the ensemble firing rate, the spike count information lost when transforming the ensemble data to generate the joint distribution is recovered. This example shows that tracking both the KLdivergence ( Figure 7D) and the ensemble firing rate (Figure 7C) makes it straightforward to disassociate changes in firing rates from changes in the higher order moments of ensemble data. Altogether, our method provides an automated process for generating a succinct summary of neural ensembles dynamics.

DISCUSSION
Contemporary neurophysiological techniques for recording from behaving subjects track the output of ensembles of neurons. Put simply, the ensemble is the set of recorded neurons. This is done with minimal knowledge about the anatomical connections among the recorded neurons or any unobserved inputs that drive them. Until the advent of technology capable of detailing the relevant neural networks in vivo, progress will depend upon the ability of neuroscientists to make sound inferences about the structure and influences upon neural ensemble activity. This is to say, the impressive parametric models that have been developed for describing ensemble interactions (Brown et al., 2001;Truccolo et al., 2005;Eldawlatly et al., 2009) are only as convincing as the experimental evidence which supports them. While the body of work demonstrating some relationship between the structure of ensemble data and behavior is growing (Deadwyler and Hampson, 1997;Durstewitz et al., 2010;Truccolo et al., 2010), the functional significance of transient fluctuations in the dynamics of neural ensembles remains an open question. This led us to develop a computational method that utilizes unsupervised learning algorithms for the purpose of tracking changes in the dynamics of neural ensembles on time-scales comparable with behavior.
Our approach was to synthesize the non-parametric method of Dasu et al. (2006Dasu et al. ( , 2009 with the statistical mechanics description of ensemble data provided by Schneidman et al. (2006). These components were chosen to match well the practice of Frontiers in Computational Neuroscience www.frontiersin.org exploratory data analysis (Tukey, 1962;Mallows, 2006). As such, they are non-parametric and unsupervised, reflecting the fact that the mechanisms generating ensemble data are largely unknown and their covariance with behavior remains to be investigated. The kdq-tree was chosen for its ability to aggregate poorly estimated regions of the data space ( Figure 2D). Moreover, because it scales linearly in the number of variables and data points, it is appropriate for a wide range of ensemble and window sizes (Figure 5). Furthermore, the use of the Bayesian estimator of the KL-divergence (Kullback, 1959;Wolpert and Wolf, 1995) provided us with a sound framework for evaluating possible differences between ensemble data sampled over intervals short enough for making comparisons with behavior. Moreover, the Bayesian estimator allowed us to incorporate prior information about the dataset. In particular, the use of the Dirichlet prior with parameters all set to 0.5 biased us toward a sparse posterior over ensemble patterns, in accordance with experimental observations (Figure 2). Together, this allowed us to track changes in the dynamics of ensemble data by inspecting the time-series of the KL-divergence values relative to the corresponding expected variance of the null distribution (Figure 7). Methods such as principal component analysis (Jolliffe, 2002) and factor analysis (Yu et al., 2009) were not used to reduce the dimensionality of the dataset because of their reliance upon the assumption that these data come from a Gaussian distribution. Because the set of ensemble patterns is unordered, smoothing methods such as kernel density estimation (Rosenblatt, 1956;Botev et al., 2010) would only be appropriate after first selecting an arbitrary ordering of the ensemble patterns. It is worth noting that the order in which the kdq-tree evaluates variables is arbitrary, and other data compression schemes are worth considering if they are well suited to the peculiarities of ensemble data. Moreover, simulations demonstrated the free parameter upon sample density to be rather robust (Table 1). Lastly, the choice to leave the spike bin and window size as user-specified free parameters reflects the view that these require expert knowledge for their specification, and will depend upon the experimental preparation under observation.
The exposition and demonstrations provided herein illustrate our method's efficacy for evaluating a range of hypotheses about the dynamics of ensemble data. These include detecting changes in the structure of pairwise interactions among neurons within an ensemble, distinct from changes in neural firing rates (Figure 6). Throughout, we made the assumption that the features of interest would manifest as transient changes in the dynamics of the ensemble activity. This reflects that general observation that changes in the dynamics of neural ensembles are observable as transient modulations of neural firing rates and pairwise interactions. On the contrary, one might imagine an ensemble could shift from one sustained equilibrium state to another. Such a change would be clear from an inspection of the times-series of KL-divergence values under the null hypothesis of stationarity and would recommend a partitioning of the time-series of these data prior to further analysis. An alternative to this unsupervised approach would be a supervised learning scheme in which a classifier is built using training data to validate whether some ensemble data carries information about a behavior of interest (Churchward et al., 1997). The KL-divergence has been used extensively for classification (Kullback, 1959) and our method could easily be adapted to such a framework by an appropriate partitioning of the dataset to generate training data for each presumed class.
There are a few differences between our method and those of others, which are both principled and methodological. For instance, analyses such as the ISI distance method of Kreuz et al. (2007) or the gravity method of Lindsey and Gerstein (2006) are designed to detect synchronous events involving subsets of neurons within an ensemble. We did not take this approach for three reasons. First, we wished to avoid treating the recorded ensemble as a neural network, because of the experimental limitations listed above. Therefore, we adopted a framework that is agnostic to whether changes are caused by interactions between the recorded neurons or by unobserved inputs. Second, sets of neurons do not appear to fire in rigid patterns, i.e., sync-fire chains (Abeles, 1982a), but in a stochastic manner amenable to statistical analysis. Third, outside of a few areas within the brain which do show a high degree of synchrony, e.g., CA1 of the hippocampus, there is a paucity of experimental evidence for widespread, strong correlations among neurons in most brain areas. The norm is the observation of weak pairwise correlations (Schneidman et al., 2006;Tang et al., 2008;Shlens et al., 2009). It remains unclear why these periodic synchronizations are not observed at their downstream targets. Is it a due to random delays between the neural oscillator and its target(s)? If so, our method is capable of detecting the influence of such an upstream neural oscillator without having to model the explicit neuron-to-neuron interactions. This could be done by first applying our method to neural ensemble data from the downstream area under the null hypothesis of independence and then comparing the resulting time-series of KL-divergence values to the time-series of the neural oscillator.
Methodologically, by being grounded within the framework of statistical hypothesis testing, our method captures a notion of prior expectation many ad hoc methods lack. Some form of a prior expectation is important when analyzing complex systems, because simple changes in variance can result in incredible variability, producing myriad red herrings. This being said, other methods may be more sensitive to novel patterns of interest in the dataset, and in future work we will extend the set of null hypotheses to include a wider range of neural features. Ultimately, which method will most clearly illustrate the relationship between neural ensemble activity and behavior is an empirical question. In particular, the use of maximum entropy models in neuroscience has been extended to include both temporal interactions among ensemble patterns (Marre et al., 2009) and to argue for higher order correlations in ensemble data (Ohiorhenuan et al., 2010). In addition, a forthcoming extension will calculate the inverse from significant changes in the ensemble dynamics to the best estimate of the set of neurons that contributed to the change.
In conclusion, we presented a flexible method for signaling changes in the dynamics of neural ensemble data on time-scales comparable with behavior. We demonstrated the validity and utility of this method and recommend its use to complement existing analyses. This method is particularly sensitive to widespread, transient fluctuations in the correlations among neurons within an ensemble (Table 1). Importantly, it is capable of disassociating changes in ensemble correlation structure from changes in Frontiers in Computational Neuroscience www.frontiersin.org ensemble firing rate (Figure 6). This makes it an excellent candidate for mining ensemble data in search of evidence for hypotheses ranging from the reader mechanisms governing neural computation (Buzsaki, 2010) to the role of oscillations in the brain (Fries, 2005).
The application left to experimentalists is to observe the covariance between large values of the KL-divergence and other variables of interest. A MATLAB ® implementation of this method, along with an addition visualization tool, is available at: http://code.google.com/p/kdq-bayes-kl

APPENDIX
In this appendix the Bayesian estimator for the first and second moments of the KL-divergence over discrete distributions is derived according to the method of Wolpert and Wolf (1995). This appendix is meant to stand alone and so some of the results of Wolpert and Wolf (1995) are recapitulated. The interested reader should consult the original work of Wolpert and Wolf (1995).
The original result of Wolpert and Wolf (1995) was derived for a single, discrete distribution. When applying this method to derive the Bayesian estimator for the KL-divergence it is necessary to consider two discrete distributions, which share the same domain. This extension has been demonstrated for a quadratic loss function in "Determining Whether Two Data Sets Are From The Same Distribution" by Wolpert (1996).

PRELIMINARIES
We are interested in using a data set n to estimate some function of a probability distribution Q(p). To estimate Q(p) from the data n we must first determine the probability density function P(p|n). We know that we are working with multinomial samples here, so by Bayes's theorem P(p|n) is given by P(n) = dp P(n|p) P(p).
Note that because of cancelation, the constant N !/( m i=1 n i !) does not appear in P(p|n). Thus, P(p|n) ∝ m i=1 p n i i P(p). Therefore, the kth order moment of Q(p) given n is (∫dpQ k (p) P(p|n). If we define then the kth moment of Q k (p) may be expressed as q k /q 0 . In the following, for simplicity P(p) will be assumed to be uniform, i.e., P(p) ∝ (p)Θ(p), where Θ(p) = Π i θ(p i ), the Heaviside theta function, Δ(p) ≡ δ(Σ i p i − 1), and the proportionality constant is set by the normalization condition ∫dpP(p) = 1.
When deriving the Bayesian estimator for the KL-divergence we utilize a Dirichlet prior, for Re(α i ) > 0. Lastly, to be consistent with the notation of Wolpert and Wolf (1995) we define is a functional of its first argument and a function of its second argument.

RECAPITULATION OF DERIVATIONS BY WOLPERT AND WOLF (1995)
In Theorem 1 it is shown that if a function H (p) factors as H (p) = m i=1 h i (p i ), then the general form of the integral ∫(dp H (p)Δ(p)Θ(p) is that of a convolution product of m terms.
Define the Laplace convolution operator ⊗ by Frontiers in Computational Neuroscience www.frontiersin.org Proof. The p i may not be independently integrated since the constraint m i=1 p i = 1 exists. This constraint is crucial for deriving the closed form solution, and is reflected in the explicit definition of the integral dp Δ(p) Θ(p) H (p) = ∞ 0 dp 1 . . . (1 − (p 1 + . . . + p m−1 )).
Define the m variables τ k , k = 1,. . ., m, recursively by τ 1 ≡ m i=1 p i = 1 and τ k ≡ τ k − 1 − p k − 1 . Since τ k = τ 1 − k−1 i=1 p i , our integral may be rewritten as dp Δ(p) Θ(p) H (p) = τ 1 0 dp 1 h 1 (p 1 ) τ 2 0 dp 2 h 2 (p 2 ) . . . Since the convolution operator is both commutative and associative, we can repeat this procedure and write the integral above as Q.E.D. When deriving the Bayesian estimator for the KL-divergence we will need to calculate variants of integrals of the form I [p q 1 1 ln r 1 (p 1 ) . . . p q m m ln r m (p m ), n]. The key to these derivations is the fact that ∂ r n p n = p n ln r (p), which immediately leads to the following. The justification for the interchange of the derivative and integral is provided in Appendix C of Wolpert and Wolf (1995). In using Theorem 4, note that since N = m i=1 n i , we have ∂ n i N = 1. Following the notation of Wolpert and Wolf (1995) we define Φ (n) (z) = Ψ (n − 1) (z) and Φ (n) (z 1 , z 2 ) ≡ Φ (n) (z 1 ) − Φ (n) (z 2 ), where Ψ (n) (z) is the polygamma function ψ (n) (z) = ∂ Proof. Similar to, but far more laborious than, the proof of Theorem 5. This proof is not detailed here.

NOVEL DERIVATION OF THE BAYESIAN ESTIMATOR FOR THE KL-DIVERGENCE
The derivation of the first and second moments for the KL-Divergence according to the Laplace Convolution method of Wolpert and Wolf (1995). We consider a system of m possible states and an associated vector of m probabilities for those states p = (p i ), 1 i m, m i=1 p i = 1. For two multinomial samples, let the total number of count across states be N for each sample and denote the vectors of state counts by n j = (n j (i)), 1 ≤ i ≤ m, j = {1,2} and m i=1 n j (i) = N j , N 1 = N 2 = N . Here we define the KL-divergence as: The derivation requires the evaluation of the following integrals:

Frontiers in Computational Neuroscience
www.frontiersin.org