Modern machine learning outperforms GLMs at predicting spikes

Neuroscience has long focused on finding encoding models that effectively ask “what predicts neural spiking?” and generalized linear models (GLMs) are a typical approach. It is often unknown how much of explainable neural activity is captured, or missed, when fitting a GLM. Here we compared the predictive performance of GLMs to three leading machine learning methods: feedforward neural networks, gradient boosted trees (using XGBoost), and stacked ensembles that combine the predictions of several methods. We predicted spike counts in macaque motor (M1) and somatosensory (S1) cortices from standard representations of reaching kinematics, and in rat hippocampal cells from open field location and orientation. In general, the modern methods (particularly XGBoost and the ensemble) produced more accurate spike predictions and were less sensitive to the preprocessing of features. This discrepancy in performance suggests that standard feature sets may often relate to neural activity in a nonlinear manner not captured by GLMs. Encoding models built with machine learning techniques, which can be largely automated, more accurately predict spikes and can offer meaningful benchmarks for simpler models.


28
A central tool of neuroscience is the tuning curve, which maps aspects of external stimuli to neural responses. 29 The tuning curve can be used to determine what information a neuron encodes in its spikes. For a tuning curve to 30 be meaningful it is important that it accurately predicts the neural response. Often, however, methods are chosen 31 that sacrifice accuracy for simplicity. Since inaccurate methods may systematically miss aspects of the neural 32 response, any choice of predictive method should be compared with methods that predict as accurately as possible. 33 A common predictive model for neural data is the Generalized Linear Model (GLM) (1)(2)(3)(4). The GLM performs 34 a nonlinear operation upon a linear combination of the input features, which are often called external covariates. 35 Typical covariates are stimulus features, movement vectors, or the animal's location, and may include covariate 36 history or spike history. In the absence of history terms, the GLM is also referred to as a linear-nonlinear Poisson 37 (LN) cascade. The nonlinear operation is usually held fixed, though it can be learned (5, 6), and the linear weights 38 of the combined inputs are chosen to maximize the agreement between the model fit and the neural recordings. This 39 optimization problem of weight selection is convex, allowing a global optimum, and can be solved with efficient 40 algorithms (7). The assumption of Poisson firing statistics can often be loosened (8), as well, allowing the modeling 41 of a broad range of neural responses. Due to its ease of use, perceived interpretability, and flexibility, the GLM has 42 become a popular model of neural spiking. 43 The GLM's central assumption is that the neural function is linear in the space of inputs (i.e. prior to the 44 nonlinearity). The GLM thus cannot learn arbitrary multi-dimensional functions of the inputs. While this 45 assumption may hold in certain cases (8,9), neural responses can in general be very nonlinear (5, 10). When the 46 nonlinearity is such that it cannot be captured by a scalar link function, a GLM will poorly predict neural activity 47 and will learn misleading feature weights (as the optimal weight on one input will depend on the values of other 48 inputs). To avoid this situation it is common to mathematically transform the features and thus obtain a new set that 49 yields better spike predictions and thus better matches the linearity assumption of the GLM. In keeping with the 50 machine learning literature, we call this step of choosing inputs feature engineering. When features have been found 51 that are exactly linear with respect to neural activity after the link function, a GLM trained upon them will match 52 or outperform more nonlinear methods. A GLM should thus be compared with benchmark methods (rather than 53 iterating on features based on GLM fit quality alone). To check if guessed features are indeed linear with respect to 54 neural activity prior to the nonlinearity, it is important that GLMs be compared with general nonlinear models that 55 can express more complex stimulus-response relationships. 56 Machine learning (ML) methods for regression have improved dramatically since the invention of the GLM. 57 Many ML methods require little feature engineering and do not need to assume linearity. Top performing methods, 58 (as judged by the frequency of winning solutions on Kaggle, a ML competition website (11)) include neural 59 networks (12), gradient boosted trees (13), and ensemble techniques. These methods are now relatively easy to 60 implement in a few lines of code in a scripting language such as Python, enabled by well-supported machine 61 learning packages, such as scikit-learn (14), Keras (15), Theano (16), and XGBoost (13). The greatly increased 62 predictive power of modern ML methods is now very accessible and could improve the state of the art in encoding 63 models across neuroscience.

64
Here we applied several ML methods, including artificial neural networks, gradient boosted trees, and ensembles 65 to the task of spike prediction, and evaluated their performance alongside a GLM. We compared the methods on 66 data from three separate brain areas. These areas differed greatly in the effect size of covariates and in their typical 67 spike rates, and thus served to evaluate the strengths of these methods across different conditions. In each area we 68 found that the tested ML methods could more accurately predict spiking than the GLM with typical feature choices, 69 demonstrating both the power of these methods and the extent of nonlinearity upon the inputs that the GLM could 70 not account for. Tuning curves built for these features with a GLM would thus not capture the full nature of neural 71 activity. We provide our implementing code in an accessible format so that all neuroscientists may easily test and 72 compare these methods on other datasets.

74
Data 75 We tested our methods at predicting spikes for neurons in the macaque primary motor cortex, the macaque 76 primary somatosensory cortex, and the rat hippocampus. All animal use procedures were approved by the  The macaque motor cortex data consisted of previously published electrophysiological recordings from 82 82 neurons in the primary motor cortex (M1) (17). The neurons were sorted from recordings made during a two-83 dimensional center-out reaching task with eight targets. In this task the monkey grasped the handle of a planar 84 manipulandum that controlled a cursor on a computer screen and simultaneously measured the hand location and 85 velocity (Fig. 1). After training, an electrode array was implanted in the arm area of area 4 on the precentral gyrus.

86
Spikes were discriminated using offline sorter (Plexon, Inc), counted and collected in 50-ms bins. The neural 87 recordings used here were taken in a single session lasting around 13 minutes.

88
The macaque primary somatosensory cortex (S1) data was recorded during a two-dimensional random-pursuit 89 reaching task and was previously unpublished. In this task, the monkey gripped the handle of the same 90 manipulandum. The monkey was rewarded for bringing the cursor to a series of randomly positioned targets 91 appearing on the screen. After training, an electrode array was implanted in the arm area of area 2 on the postcentral 92 gyrus, which receives a mix of cutaneous and proprioceptive afferents. Spikes were processed as for M1. The data 93 used for this publication derives from a single recording session lasting 51 minutes. 94 As with M1 (described in results), we processed the hand position, velocity, and acceleration accompanying the 95 S1 recordings in an attempt to obtain linearized features. The features , , , were found to be the most 96 successful for the GLM. Since cells in the arm area of S1 have been shown to have approximately sinusoidal tuning 97 curves relating to movement direction (18), we also tested the same feature transformations as were performed for 98 M1 but did not observe any increase in predictive power.

99
The third dataset consists of recordings from 58 neurons in the CA1 region of the rat dorsal hippocampus during  Treatment of Spike and Covariate History 113 We slightly modified our data preparation methods for spike prediction when spike and covariate history terms 114 included as regressors (Fig 6). To construct spike and covariate history filters, we convolved 10 raised cosine bases 115 (built as in (22)) with binned spikes and covariates. The longest temporal basis included times up to 250 ms before 116 the time bin being predicted. This process resulted in 120 total covariates per sample (10 current covariates, 100 117 covariate temporal filters, and 10 spike history filters). We predicted spikes in 5 ms bins (rather than 50 ms) to allow 118 for modeling of more precise time-dependent phenomena, such as refractory effects. The cross-validation scheme 119 also differs from the main analysis of this paper, as using randomly selected splits of the data would result in the 120 appearance in the test set of samples that were in history terms of training sets, potentially resulting in overfitting. 121 We thus employed a cross-validation routine to split the data continuously in time, assuring that no test set sample 122 has appeared in any form in training sets. Neural networks are well-known for their success at supervised learning tasks. More comprehensive reviews can 133 be found elsewhere (12). Here, we implemented a simple feedforward neural network and, for the analysis with 134 history terms, an LSTM, a recurrent neural network architecture that allows the modeling of time dependencies on 135 multiple time-scales (29). 136 We point out that a feedforward neural network with no hidden layers is equivalent in mathematical form to a GLM   In practice, it is simpler to choose the functions f k via gradient boosting, which minimizes a second order 172 approximation of the loss function (34).

173
XGBoost offers several additional parameters to optimize performance and prevent overfitting. Many of these 174 describe the training criteria for each tree. We optimized some of these parameters for a single neuron in each 175 dataset using Bayesian optimization (again over a validation set different from the final test set). These parameters 176 included the number of trees to train, the maximum depth of each decision tree, and the minimum weight allowed 177 on each decision leaf, the data subsampling ratio, and the minimum gain required to create a new decision branch.  (e.g. the GLM) to directly compare two methods, in which case we refer to the score as the 'comparative pseudo-211 R 2 '. The comparative pseudo-R 2 is referred to elsewhere as the 'relative pseudo-R 2 ', renamed here to avoid 212 confusion with the difference of two standard pseudo-R 2 scores both measured against the mean (26). 213 We used 8-fold cross-validation (CV) when assigning a final score to the models. The input and spike data were

229
We applied several machine learning methods to predict spike counts in three brain regions and compared the 230 quality of the predictions to those of a GLM. Our primary analysis centered on neural recordings from the macaque 231 primary motor cortex (M1) during reaching (Fig. 1). We examined the methods' relative performance on several  To test that all methods work reasonably well in a trivial case, we trained each to predict spiking from a simple, and cosine curve is a phase-shifted cosine, by identity, allowing the GLM to learn the proper preferred direction). 245 We observed that each method identified a similar tuning curve (Fig. 2b) and that the bulk of the neurons in the 246 dataset were just as well predicted by each of the methods (Fig. 2a, c) (though the ensemble was slightly more 247 accurate than the GLM, with mean comparative pseudo-R 2 of 0.06 [0.043 -0.084], 95% bootstrapped confidence 248 interval (CI)). The similar performance suggested that an exponentiated cosine is a nearly optimal approximating 249 function of the neural response to movement direction alone, as was previously known (40). This classic example 250 thus illustrated that all methods can in principle estimate tuning curves.

251
The exact form of the nonlinearity of the neural response to a given feature is rarely known prior to analysis, but 252 this lack of knowledge need not impact our prediction ability. To illustrate the ability of modern machine learning 253 to find the proper nonlinearity, we performed the same analysis as above but omitted the initial cosine feature-254 engineering step. Trained on only the hand velocity direction, in radians, which changes discontinuously at ±π, all 255 methods but the GLM closely matched the predictive power they attained using the engineered feature (Fig. 3a).

256
As expected, the GLM failed at generating a meaningful tuning curve (Fig. 3b). Both trends were consistent across 257 the population of recorded neurons (Fig. 3c). The neural net, XGBoost, and ensemble methods can learn the 258 nonlinearity of single features without requiring manual feature transformation.The inclusion of multiple features 259 raises the possibility of nonlinear feature interactions that may elude a GLM. We found this is the case for the four-260 dimensional set of hand position and velocity , , , . While all methods gained predictive power relative to 261 models using movement direction alone, the GLM failed to match the other methods (Fig 4a, c). If the GLM was to variable degrees, with the GLM improving to the level of the neural network (Fig. 5). XGBoost and the ensemble 272 still predicted spikes better than the GLM (Fig. 5c) To ensure that these results were not specific to the motor cortex, we extended the same analyses to primary 288 somatosensory cortex (S1) data. We again predicted neural activity from hand movement and speed, and here 289 without spike or covariate history terms. The ML methods outperformed the GLM for all but three of the 52 neurons, 290 indicating that firing rates in S1 generally relate nonlinearly to hand position and velocity (Fig 7a). Each of the three 291 ML methods performed similarly for each neuron. The S1 neural function was thus equally learnable by each 292 method, which is surprising given the dissimilarity of the neural network and XGBoost algorithms. This situation 293 would occur if learning has saturated near ground truth, though this cannot be proven definitively to be the case. It 294 is at least clear from the underperformance of the GLM that the relationship of S1 activity to these covariates is 295 nonlinear beyond the assumptions of the GLM. 296 We asked if the same trends of performance would hold for the rat hippocampus dataset, which was characterized 297 by very low mean firing rates but strong effect sizes. All methods were trained on a list of squared distances to a 298 grid of place fields and on and the rat head orientation, as described in methods. Far more even than the neocortical 299 data, neurons were described much better by XGBoost and the ensemble method than by the GLM (Fig. 7b). Many 300 neurons shifted from being completely unpredictable by the GLM (pseudo-R 2 near zero) to very predictable by 301 XGBoost and the ensemble (pseudo-R 2 above 0.2). These neurons thus have responses that do not correlate with 302 firing in any one Gaussian place field. We note that the neural network performed poorly, likely due to the very low 303 firing rates of most hippocampal cells (Supp. Fig. 2). The median spike rate of the 58 neurons in the dataset was 304 just 0.2 spikes/s, and it was only on the four neurons with rates above 1 spikes/s that the neural network achieved 305 pseudo-R 2 scores comparable to the GLM. The relative success of XGBoost was interesting given the failure of the 306 neural network, and supported the general observation that boosted trees can work well with smaller and sparser 307 datasets than those that neural networks generally require. Thus for hippocampal cells, a method leveraging decision 308 trees such as XGBoost or the ensemble is able to capture more structure in the neural response and thus demonstrate 309 a deficiency of the parameterization of the GLM.

311
We contrasted the performance of GLMs with various machine learning techniques at the task of predicting 312 binned spikes in three brain regions. We found that the tested ML methods, especially XGBoost and the ensemble, 313 routinely predicted spikes more accurately than did the GLM. Typical feature engineering only partially bridged 314 the performance gap. Machine learning methods, especially LSTMs, also outperformed GLMs when history terms 315 were included. The ML methods performed comparably well with and without feature engineering, even for the 316 very low spike rates of the hippocampus dataset, indicating they could serve as convenient performance benchmarks 317 for improving simpler encoding models. These findings indicate that instances where standard ML methods 318 outperform GLMs may be common for neural data and standard feature choices.

319
When a GLM fails to explain data as well as more expressive, nonlinear methods, the current parameterization 320 of inputs must relate to the data more nonlinearly than is assumed by the GLM. This unaccounted nonlinearity may 321 produce feature weights that do not reflect true feature importance. A GLM will incorrectly predict no dependence 322 on feature x whatsoever, for example, in the extreme case when the neural response to some feature x does not 323 correlate with exp(x). The only way to ensure reliable interpretations of feature weights as importances is to find an 324 input parameterization that maximizes the GLM's predictive power. ML methods can assist this process by acting 325 as benchmarks and indicating how much nonlinearity remains to be explained. This point relates to a possible 326 objection to this work, which is that the GLM underperforms simply because we have selected the wrong input 327 features. This is indeed true, as it is always theoretically possible to linearize features such that a GLM obtains equal 328 predictive power. We mean to point out here that ML methods can highlight the deficiency of features that might 329 have otherwise seemed uncontroversial. When applying a GLM to neural data, then, it is important to compare its 330 predictive power with standard ML methods to ensure the neural response is properly understood.

331
Advanced ML methods are not widely considered to be interpretable, and some may worry that this diminishes 332 their scientific value as encoding models. We can better discuss this issue with a more precise definition of 333 interpretability. Following Lipton, we make the distinction between a method's post-hoc interpretability, the ease 334 of justifying its predictions, and transparency, the degree to which its operation and internal parameters are human-335 readable or easily understandable (42). A GLM is certainly more transparent than many ML methods due to its 336 algorithmic simplicity. It is also generally more conducive to post-hoc predictions, though here is there room for 337 modern ML methods. It is possible, for example, to visualize the aspects of stimuli that most elicit a predicted to handle the entire regression workflow (51, 52). Applied to neuroscience, these tools will allow researchers to 363 gain descriptive power over current methods even with simple, out-of-the-box implementations.

364
Machine learning methods perform quite well and make minimal assumptions about the form of neural encoding.

365
Models that seek to understand the form of the neural code can test if they systematically misconstrue the 366 relationship between stimulus and response by comparing their performance to these benchmarks. Encoding models 367 built with machine learning can thus greatly aid the construction of models that capture arbitrary nonlinearity and 368 more accurately describe neural activity.

369
The code used for this publication is available at https://github.com/KordingLab/spykesML. We invite 370 researchers to adapt it freely for future problems of neural prediction. same example neuron as in Figure 3, the neural net and XGBoost maintained the same predictive power, while the GLM was 507 unable to extract a relationship between direction and spike rate. (b) XGBoost and neural nets displayed reasonable tuning 508 curves, while the GLM reduced to the average spiking rate (with a small slope, in this case). (c) Most neurons in the population 509 were poorly fit by the GLM, while the ML methods achieved the performance levels of Figure 2. The ensemble performed the 510 best of the methods tested. 511 high-dimensional dependence. (c) Even with feature engineering, XGBoost and the ensemble consistently achieve pseudo-R 2 526 scores higher than the GLM, though the neural net does not. The neuron selected at left is marked with black arrows. 527 figures (see methods) and pseudo-R 2 scores should not be compared directly across figures. All methods outperform the GLM, 532 indicating that the inclusion of history terms does not alone allow the GLM to capture the full nonlinear relationship between 533 covariates and spike rate. the GLM in the macaque S1 dataset. Interestingly, the neural network, XGBoost and the ensemble scored very similarly for 538 each neuron in the 52 neuron dataset. (b) Many neurons in the rat hippocampus were described well by XGBoost and the 539 ensemble but poorly by the GLM and the neural network. The poor neural network performance in the hippocampus was due 540 to the low rate of firing of most neurons in the dataset (Supp. Fig. 2). Note the difference in axes; hippocampal cells are 541 generally more predictable than those in S1. 542