Spike sorting of heterogeneous neuron types by multimodality-weighted PCA and explicit robust variational Bayes

Takekawa, Takashi; Isomura, Yoshikazu; Fukai, Tomoki

doi:10.3389/fninf.2012.00005

METHODS article

Front. Neuroinform., 19 March 2012

Volume 6 - 2012 | https://doi.org/10.3389/fninf.2012.00005

Spike sorting of heterogeneous neuron types by multimodality-weighted PCA and explicit robust variational Bayes

TT
Takashi Takekawa ¹^*
YI
Yoshikazu Isomura ²
TF
Tomoki Fukai ¹^*

1. Laboratory for Neural Circuit Theory, RIKEN Brain Science Institute Wako, Japan
2. Brain Science Institute, Tamagawa University Machida, Japan

Article metrics

View details

Citations

11,2k

Views

2,8k

Downloads

Abstract

This study introduces a new spike sorting method that classifies spike waveforms from multiunit recordings into spike trains of individual neurons. In particular, we develop a method to sort a spike mixture generated by a heterogeneous neural population. Such a spike sorting has a significant practical value, but was previously difficult. The method combines a feature extraction method, which we may term “multimodality-weighted principal component analysis” (mPCA), and a clustering method by variational Bayes for Student's t mixture model (SVB). The performance of the proposed method was compared with that of other conventional methods for simulated and experimental data sets. We found that the mPCA efficiently extracts highly informative features as clusters clearly separable in a relatively low-dimensional feature space. The SVB was implemented explicitly without relying on Maximum-A-Posterior (MAP) inference for the “degree of freedom” parameters. The explicit SVB is faster than the conventional SVB derived with MAP inference and works more reliably over various data sets that include spiking patterns difficult to sort. For instance, spikes of a single bursting neuron may be separated incorrectly into multiple clusters, whereas those of a sparsely firing neuron tend to be merged into clusters for other neurons. Our method showed significantly improved performance in spike sorting of these “difficult” neurons. A parallelized implementation of the proposed algorithm (EToS version 3) is available as open-source code at http://etos.sourceforge.net/.

Introduction

Since a vast number of neurons are simultaneously active in the brain, the analyses of action potentials (spikes) of multiple neurons are crucial for uncovering the principle of brain computation. Electrical activity of multiple neurons can be recorded with high temporal resolution using electrodes located outside of neural cell bodies (O'Keefe and Recce, 1993; Wilson and McNaughton, 1993; Fynh et al., 2007). The extracellularly recorded data contains spikes of many neurons surrounding the tip of electrodes, and all spike-like signals belonging to a single neuron have to be correctly labeled as activity of the same neuron. This process, known as spike sorting (Lewicki, 1998; Brown et al., 2004; Buzsáki, 2004), consists of three major steps: the first step to detect spike candidates, the second step to extract the features of spikes, and the third step to classify the extracted features (Abeles, 1982; Csicsvari et al., 1998; Wood et al., 2004). Since the classification of a redundant high-dimension data is generally difficult due to the “curse of dimensionality” (Bishop, 2006), we have to extract the features of raw spike data in a low dimensional space. Principal component analysis (PCA) finds the directions of the maximum variance in the data distribution and has often been used for the dimensional reduction. PCA can remove the redundancy in the data since principal components are mutually uncorrelated. However, there is no guarantee that the data is classified into well-separated clusters in the directions of large variances.

Rather, a component useful for the classification is the one that exhibits multiple clusters in its distribution. Throughout this paper we use the word “multimodality” to indicate the existence of multiple peaks in data distributions. Several multimodality-based feature extraction methods have been proposed. The original waveforms were preprocessed by some means, for instance by wavelet transform (WT) (Halata et al., 2000; Quian Quiroga et al., 2004; Pavlov et al., 2007), and the multimodality of the pre-processed components was evaluated by Kolmogorov–Smirnov (KS) test (Quian Quiroga et al., 2004), model evidence (Takekawa et al., 2010) or Shannon's information (Yang et al., 2010). Although these methods can reduce the data dimension by picking up the multimodal components, the redundancy in the data still remains.

Here, we introduce a novel method for feature extraction, namely, multimodality-weighted PCA (mPCA). The mPCA is a class of the weighted PCA (Câmara de Macedo et al., 2008) that eliminates the redundant representation of features by emphasizing the informative components. Here, we rescale each component of the data so that its variance may coincide with its multimodality and then apply PCA to the rescaled data. The rescaling of the variances significantly reduces the influences of such components as distribute unimodally with large variances and enables PCA to obtain uncorrelated components of which distributions are strongly multimodal. We evaluate the multimodality of the feature distribution by performing KS test to measure the deviations from the normality. We compare the performance of mPCA with that of PCA, an improved multimodality pick-up algorithm (mPICK) and Graph Laplacian features (GLF), which project a high-dimensional data onto a low-dimensional space while preserving the topological (i.e., clustering) structure of the original data (Belkin and Niyogi, 2003; He and Niyogi, 2004). GLF is a linear mapping, solves the difficulties arising from the non-linearity of Laplacian eigen maps in a model-based clustering (Chah et al., 2011), and exhibits an excellent performance in spike sorting (Ghanbari et al., 2011). However, the computational cost of GLF increases drastically for larger data size. We show that mPCA is computationally much cheaper.

Another difficulty we attempt to overcome is the inaccurate spike sorting for bursting neurons and sparsely firing neurons. The two patterns of firing make contradicting demands on spike sorting. Spikes from a bursting neuron yield broad feature distributions with distorted shapes, which tend to be separated into multiple clusters. In contrast, a sparse-firing neuron yields small clusters in the feature space that may be mismerged into clusters belonging to more active neurons. To overcome these difficulties, we explicitly solve a variational Bayes algorithm for Student's t mixture models (SVB) for spike clustering. Namely, we introduce a prior for the degree of freedom (DOF) parameters of Student's t distribution and explicitly evaluate this probability distribution by numerical integrations. The conventional implementation of SVB (MAP-SVB) treats the values of DOF parameters as constant and estimates them by Maximum-A-Posterior (MAP) inference (Svensén and Bishop, 2005; Archambeau and Verleysen, 2007). To show the superiority of explicit SVB to MAP-SVB in the analysis of real physiological data, we tested our spike-sorting method also on the spike data obtained by simultaneous extracellular and intracellular recordings (Harris et al., 2000; Henze et al., 2000).

Materials and methods

Figure 1 summarizes the major steps of the algorithms tested in this study: (1) detecting and clipping out spike candidates via amplitude thresholding of a high-pass filtered signal and a window function; (2) applying WT to the spike waveforms; (3) extracting the features of the spike waveforms in the feature space spanned by the wavelet coefficients; (4) classifying the extracted features to identify spikes belonging to single neurons. For comparison, we also tested the methods that do not apply WT and extract features directly from spike waveforms.

Figure 1

Detection of spike candidates and calculation of spike waveforms

Spike detection was performed as in the previous study (Takekawa et al., 2010). After high-pass filtering raw signals, spikes were detected by amplitude thresholding. The high-pass filter was designed to subtract Gaussian smoothed signals from the raw signals. The threshold was set to μ_robust [h(t)]−f_thr σ_robust [h(t)], where h(t) is the high-pass filtered signal, f_thr is the threshold factor and μ_robust, σ_robust are robust estimates of the average and the standard deviation, respectively (Hoaglin et al., 1983; Quian Quiroga et al., 2004; Takekawa et al., 2010).

For each detected spike candidate, we interpolated the discrete waveform around the peak with a quadratic spline and determined the precise spike timing as the peak of the interpolated line. A spike in general exhibits slightly different peak times at different channels. To avoid detecting the same spike more than once, the waveforms detected within a time window of 0.5 msec were regarded as the same spike. Then, we resampled the filtered signal at the same sampling rate as the filtered data in the range of discrete times [−τ₁ :τ₂] with applying a window function, where τ = 0 refers to the precise spike timing and a window function can be described as where is the normal distribution, and s = τ₁ if τ < 0 or otherwise s = τ₂. We will determine adequate values of these time constants later.

Feature extraction

We applied mPCA with KS test for normality to the wavelet coefficients for feature extraction. The wavelet coefficients are calculated by multi-resolution analysis with Chohen-Daubechies-Feauveau 9/7 (CDF97) wavelet (Cohen et al., 1992; Daubechies, 1992; Takekawa et al., 2010). The multi-resolution analysis is analogous to discrete Fourier transform and transforms data in the time domain to those of time-frequency coefficients preserving the data dimension. To evaluate the performance of the method, we applied PCA, GLF, mPICK, and mPCA to the data set of resampled waveforms or the wavelet coefficients of the waveforms. Below we outline the frameworks of these feature extraction algorithms.

Principle component analysis

The algorithm of PCA is well described in literature and is only briefly reviewed (Bishop, 2006). The original D-dimensional data X = {x_n}^N_{n = 1} is reduced to a D′-dimensional data through the linear transformation V^T X^C, where X^C = {x_n − E[x]}^N_{n = 1}, and the projection matrix V is constructed from the eigenvectors corresponding to the largest D′ eigenvalues λ₁ ≥ λ₂ ≥ ··· ≥ λ_D′ of the covariance matrix of X^C . The data points exhibit the largest D′ variances in thus obtained D′-dimensional subspace.

Graph Laplacian features

Below, the definition and derivations of GLF are briefly reviewed. Details are found in (Ghanbari et al., 2011). As in the case of PCA, the original D-dimensional data set X = {x_n}^N_{n = 1} is reduced to a D′-dimensional data set through the transformation Y = A^T X, where A = {a_d}^D′_{d = 1} and a_d is a D-dimensional vector. It is desirable in classification if neighboring points in the original D-dimensional space remain close to each other after a projection to the low dimensional space (He and Niyogi, 2004).

Such a projection A can be obtained by solving the following minimization problem: where W is a weight matrix and Y = {y_n}^N_{n = 1} is reduced data set. Data points i and j are connected by an edge if i is among the K-nearest neighbors of j, or vice versa. The weight of the edge connecting these points is set as W_ij = exp(−|x_i − x_j|²/t). If the two points are not among the K-nearest neighbors of one another, W_ij = 0. The scaling parameter t is defined as where is the most distant point amongst the K-nearest neighbors of x_i. We used K = 5 in this paper.

It is possible to rewrite the minimization problem as the following eigenvalue problem (Ghanbari et al., 2011): where with B and Q being N × N matrices defined as,

The projection matrix A = {a_d}^D′_{d = 1} is constructed from the eigenvectors corresponding to the largest D′ eigenvalues λ₁ ≥ λ₂ ≥ ··· ≥ λ_D′ of the matrix B.

Multimodality pickup

If values of some components are distributed with multiple peaks, we may use these components to separate a large number of clusters in the data. In mPICK, we picked up the wavelet coefficients that distribute with multiple peaks by employing KS test for the normality (Press et al., 1992; Quian Quiroga et al., 2004), which evaluates the deviation of given distribution from the normal (unimodal) distribution. Namely, forgiven one-dimensional data set x, KS test uses the maximum value of the absolute difference between the cumulative distribution function CDF of the normalized data x′ and that of a standard normal distribution for the evaluation:

We select the components corresponding to D′ largest values of M_L [x]. Note that we use the robust statistical estimation for the mean and variance of the normalized data in order to minimize the effect of outliers. When we use mPICK without WT, we apply KS test to the distribution of the values at each time point of all the detected spike waveforms and pick up the time points that yield large multimodality. It is noted that the redundancy can be generally large in the features extracted by mPICK.

Multimodality-weighted PCA

To reduce the redundancy, we scale data points in each component dimension so that the variance of the scaled data along the dimension may coincide with its multimodality. This scaling emphasizes the multimodality of the data distribution and dramatically increases the chance to detect components showing strong multimodality among components with large variances. We define the procedure of mPCA explicitly as follows. We scale the each component of data set as (d = 1,…, D) using the multimodality M_L [x] defined in previous section and X^M is reduced to a D′-dimensional data through P^T X^M. The projection matrix P is constructed from the eigenvectors corresponding to the largest D′ eigenvalues λ₁ ≥ λ₂ ≥ ··· ≥ λ_D′ of the covariance matrix of X^M.

Clustering with variational bayes for student's t mixture model

The optimal number of components in a mixture model can be determined by several criteria including Akaike's information criteria (Akaike, 1974), Bayesian information criteria (Schwarz, 1978), minimum description length (Rissanen, 1978) or minimum message length (Wallace and Boulton, 1968; Agusta and Dowe, 2002). Then, for a given number of components, we may estimate the optimal values of model's parameters by the maximum likelihood method implemented by Expectation-Maximization (EM) algorithm (Dempster et al., 1977). Alternatively, Bayesian inference treats model's parameters as probabilistic variables and calculates their probability distributions (Bernardo and Smith, 1994). Furthermore, variational Bayes (VB) algorithms provide EM-like methods to calculate the lower bound of the model evidence, i.e., the free energy, for Gaussian-mixture models (Attias, 1999) and Student's t mixture models (Svensén and Bishop, 2005; Archambeau and Verleysen, 2007). VB for Student's t mixture models (SVB) exhibited an excellent model selection performance in spike sorting (Takekawa et al., 2010). Below, we outline the framework of our SVB method. The mathematical details of the SVB algorithm are found in Takekawa and Fukai (2009).

Statistical models and parameters

Student's t distributions have long tails compared with Gaussian distributions, and hence are used frequently for modeling data containing outliers. This is actually the case for spike sorting since multiunit recordings detect a number of noisy spikes from distant neurons. Student's t distribution can be written in terms of normal and Gamma distributions as follows: where x is a D-dimension data point. The parameters ν, μ, and S are the DOF parameter, the component mean vector and the component precision matrix, i.e., the inverse of the covariance matrix, respectively. Normal and Gamma distributions are defined in Section “Distributions.” Student's t distribution is thus a mixture of infinitely many normal distributions with the same mean. The scaling parameter u for the precision S depends on parameter ν through the gamma distribution, and a smaller value of ν corresponds to a heaver tail of .

Our mixture model is described as a weighted sum of Student's t distributions: where M is the number of clusters and θ = {α_m, ν_m, μ_m, S_m}^M_{m = 1} represents the remaining model parameters. The weights α_m are non-negative, and the parameters ν_m, μ_m, and S_m stand for the DOF, mean, and precision matrix of the m-th cluster, respectively. Introducing the latent label variables z = {z_m}^M_{m = 1} and the latent scaling variables u = {u_m}^M_{m = 1}, we can rewrite Student's t mixture model as a latent variable model:

The variable z_m is unity if the data point belongs to the m-th cluster and is zero otherwise. Therefore, z_m ∈ {0, 1} and only a single component of z can take a non-vanishing value. The variable u_m is necessary to analytically treat Student's t distribution in VB clustering. For a set of observations X = {x_n}^N_{n = 1}, the sets of variables Z = {z_n}^N_{n = 1} and U = {u_n}^N_{n = 1} are called “latent variables”, where N represents the number of data points and the m-th component of z_n (u_n), i.e., z_nm (u_nm), stands for z_m (u_m) for the n-th data point. The latent variables are not direct observables but are inferred through a statistical model from other observed variables. The latent variables generally represent the degree to which variables move together. Hence, they play a crucial role in clustering of statistical data.

VB calculations for Student's t mixture models

The VB is a general technique to solve for the posterior probability distribution of continuous variables. It calculates an approximate distribution of the posterior, assuming that the parameter variables and the latent variables are mutually independent. This assumption significantly reduces the cost of computations. Thus, in VB, we alternately renew the probability distributions of parameters and latent variables independently for a given prior distribution. In this study, we employ the factorized distributions for the priors as: where, and represent Dirichlet, an exponential and a normal-Wishart distribution, respectively, with {κ₀, ξ₀, η₀, γ₀, μ₀, Σ₀} being the hyper parameters of the prior function (see Section “Distribution”).

Introducing a test distribution function q_M (Z, U, θ) to approximate the posterior p(Z, U, θ|X, M) and assuming a factorization approximation q_M(Z, U, θ) = q_M (Z, U)q_M(θ), we can describe the test function for model parameters q_M(θ) and latent variables q_M(Z, U) by hyper parameters and respectively: where

And we can update these test functions by using an EM like iterative procedure.

In the M-step, the hyper parameters for model parameters are updated using the data X and the current fixed hyper parameters for latent variables where and

In the E-step, are updated using fixed obtained in the previous M-step. where

Since the range of the integrations is from zero to infinity, on-demand calculations of the functional values of and at every step of the EM algorithm are quite time consuming. To avoid the heavy calculations, we may fix ν_m at the constant values estimated by MAP inference. Alternatively, here we explicitly treat ν_m as probabilistic variables and calculate the integrations by interpolating the values of and from a numerical table calculated priori by Mathematica version 7 (Wolfram Research, Inc., Champaign, IL, 2008).

Model evidence and iterative algorithm

Using the ρ_nm calculated in the E-step, we can evaluate the model evidence as

The variable ρ_nm represents the likelihood that the n-th data point belongs to the m-th cluster. Therefore, the sum ∑^M_{m = 1} ρ_nm represents the degree to which the data point is described by the mixture model.

The reduced D'-dimensional data was decorrelated and renormalized before VB clustering so that μ_robust and σ_robust may be given as zero and unity, respectively. In order to reduce the effect of initial conditions, we preprocessed the data by k-means clustering (MacQueen, 1967) with sufficient large number of clusters and used the resultant clusters as initial conditions for VB clustering. Then we calculated E and M steps iteratively until (F^new −F^old)/N <10⁻⁶ was satisfied and eliminate a cluster if its size or its variance was small or if it yielded a negative contribution to F. We can calculate the contribution to F of cluster m as

Many of the initial clusters were rapidly eliminated according to the criteria. Since most of the terms necessary for these evaluations appear in the calculations at E-step, no additional computational cost arises.

Each penalty term can be further calculated as

Distributions

The normal, Gamma, Dirichlet, exponential, Wishart and normal-Wishart distributions used in the text are defined as follows, respectively:

Data set and numerical methods

We compared the performance of the proposed algorithm with that of other methods. To this end, we use a publicly available data sets of numerically simulated multiunit spike trains (Quian Quiroga et al., 2004; data sets are available at http://www2.le.ac.uk/departments/engineering/research/bioengineering/neuroengineering-lab/spike-sorting). The merit of this data base is that correct answers to spike sorting and the levels of difficulties are known for all the data sets. We employed the most difficult data sets, C_Easy2_noise20 [Ex2(0.20)], C_Difficult1_noise20 [Ex3(0.20)], and C_Difficult2_noise20 [Ex4(0.20)] in this study. All data sets contain spikes from three simulated neurons (see Figure 2). To obtain noisy signals, averaged spike waveforms with various amplitudes were added to each spike train at random times. In each data set, the standard deviation of noise was varied between 5 and 20% of the peak spike amplitudes. The simulated neural activity exhibits a firing rate of 20 Hz and a refractory period of 2 msec. The sampling rate of the all simulated data was assumed to 24 kHz.

Figure 2

We also use the experimental data obtained by simultaneous extracellular and intracellular recordings (Harris et al., 2000; Henze et al., 2000; data sets are available at http://crcns.org/data-sets/hc/hc-1). In these data, the correct sequence of spikes is known at least for a single neuron recorded intracellularly, which implies that the correct answers to spike sorting are already partially known. We employed two different data sets, d11222.001 and d14521.001, in this study since an intracellularly recorded neuron exhibited burst firing in d11222.001 or it generated only 181 spikes during the whole period of recordings in d14521.001. The data sets were recorded at 20 kHz.

We implemented our spike sorting algorithms in C++ code with linear algebra routines in Lapack library (http://www.netlib.org/lapack/) and OpenMP parallelization (http://www.openmp.org/). The program was compiled by Intel Compiler with Lapack implementation of Math Kernel Library (Intel Corp.) and executed on Mac OS X environment (Mac Pro; 2 × 2.93 GHz Quad-Core Intel Xeon; Apple Inc.).

Results

The proposed method was tested on simulated and experimental data, and the results were compared with those of other methods.

Detection of spike candidates

In spike detection, we used f_thr = 3 in thresholding for simulated data and f_thr = 4 for simultaneous intracellular-extracellular recording data. Figure 2 shows examples of the spikes detected in the simulated data, in which spikes belonging to three different neurons are marked by different symbols. Note that the correct answers are known for the artificial spike data. Spikes from different neurons were sometimes detected as a single spike (synchronized spikes) if their temporal locations were close to each other (see S in Figure 2). While true spikes were rarely missed (i.e., almost no false negative), noisy signals were sometimes detected as spurious spikes (false positive: see FP in Figure 2).

Each simulated data contains spike trains of three neurons and noisy spikes, and we used the artificial spike data simulated at the highest noise level. Our method detected 3973 (Ex2), 3883 (Ex3), and 3916 (Ex4) candidate spikes in each artificial data set, while they should contain 3526, 3414, and 3493 correct spikes, respectively. The numbers of false positive, false negative, and synchronized spikes were 530, 6, and 77 (Ex2), 540, 0 and 72 (Ex3), and 496, 1 and 70 (Ex4), respectively, in these data sets. We obtained about 14,000 spike candidates in each data set of the simultaneous intracellular-extracellular recordings.