^{1}

^{2}

^{*}

^{1}

^{1}

^{1}

^{1}

^{*}

^{1}

^{2}

Edited by: Ad Aertsen, University of Freiburg, Germany

Reviewed by: Alexander G. Dimitrov, Washington State University Vancouver, USA; James McFarland, University of Maryland, College Park, USA; Jannis Hildebrandt, Carl von Ossietzky University of Oldenburg, Germany

*Correspondence: Oliver Schoppe

Jan W. H. Schnupp

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Good metrics of the performance of a statistical or computational model are essential for model comparison and selection. Here, we address the design of performance metrics for models that aim to predict neural responses to sensory inputs. This is particularly difficult because the responses of sensory neurons are inherently variable, even in response to repeated presentations of identical stimuli. In this situation, standard metrics (such as the correlation coefficient) fail because they do not distinguish between explainable variance (the part of the neural response that is systematically dependent on the stimulus) and response variability (the part of the neural response that is not systematically dependent on the stimulus, and cannot be explained by modeling the stimulus-response relationship). As a result, models which perfectly describe the systematic stimulus-response relationship may appear to perform poorly. Two metrics have previously been proposed which account for this inherent variability: Signal Power Explained (_{norm}, Hsu et al., _{norm} is better behaved in that it is effectively bounded between −1 and 1, and values below zero are very rare in practice and easy to interpret. However, it was hitherto not possible to calculate _{norm} directly; instead, it was estimated using imprecise and laborious resampling techniques. Here, we identify a new approach that can calculate _{norm} quickly and accurately. As a result, we argue that it is now a better choice of metric than

Evaluating the performance of quantitative models of neural information processing is an essential part of their development. Appropriate metrics enable us to compare different models and select those which best describe the data. Here, we are interested in developing improved metrics to assess models of the stimulus-response relationships of sensory neurons, in the challenging (but common) situation where the stimulus-response relationship is complex, and neuronal responses are highly variable. In this case, the development of appropriate performance metrics is not trivial, and so there is a lack of consensus about which metrics are to be used.

The classical way to record and model neural responses has been to repeatedly present an animal with a small, well-defined set of stimuli (such as sinusoidal gratings of different orientations, or sounds of different frequencies). The neural responses to repeated presentations of each stimulus are then averaged. Using a small stimulus set, it may be possible to present the same stimulus enough times that this averaging succeeds in reducing the effect of neuronal response variability (Döerrscheidt,

However, recent work in sensory neuroscience has increasingly focused on the responses of neurons to complex stimuli (Atencio and Schreiner, ^{d} for

Different approaches for taking neural variability into account when measuring model performance have been developed. To get an unbiased estimate of

Based not only on pairs, but even larger sets of neural responses to independent stimulus repetitions, Sahani and Linden developed the very insightful decomposition of the recorded signal into

Focusing on coherence and the correlation coefficient, Hsu and colleagues developed a method to normalize those measures by their upper bound (_{max}), which is given by the inter-trial variability (Hsu et al., _{norm}). Following their suggestion, the upper bound can be approximated by looking at the similarity between one half of the trials and the other half of the trials (_{half}). This measure has also been used by Gill et al. (_{half} (Laudanski et al.,

The two metrics _{norm} have been developed independently, but they both attempt—in different ways—to provide a method for assessing model performance independent of neuronal response variability. Here, we here analyze these metrics, show for the first time that they are closely related, and discuss the shortcomings of each. We provide a new, efficient way to directly calculate _{norm} and show how it can be used to accurately assess model performance, overcoming previous shortcomings.

Neural responses are often measured as the membrane potential (Machens et al.,

where _{n} is the recorded response of the _{n}(_{n} contains the number of spikes that were recorded in the corresponding time bin. Note that, given the trial-to-trial variability of sensory responses, the recorded firing rate

With the recorded firing rate

Two somewhat related metrics which are widely applied in statistics are the “coefficient of determination” (

The

This makes the

^{1}

Applying this reformulation to Equation 5 reveals that:

Consider a particularly bad model, which produces predictions that are no better than the output of a random number generator. The covariance between the predictions and the neural responses will then be close to zero, but the variance (i.e., the power of the predicted signal) of the predicted signal may nevertheless be large. The

This can be illustrated with a simple hypothetical example. Imagine a visual cortex simple cell responding to a sinusoidal contrast grating stimulus with a sinusoidal modulation of its firing rate, so its observed response is a sine wave, let's say, of an amplitude of ±1 spikes/s around a mean firing rate of 10 spikes/s at a modulation rate of 1 Hz. Let us further assume that model A predicts sinusoidal firing at a 2 Hz modulation rate with an amplitude of ±2 spikes/s around a mean of 10 spikes/s, and model B predicts a sinusoidal firing at 2 Hz with an amplitude of ±1 spikes/s around a mean of 100 spikes/s. Since neither model A nor B correctly predicted the rate of the sinusoidal firing rate modulations, and because sine waves of different frequencies are orthogonal, both models will have covariance of zero with the observed data. Thus, they have a negative

_{α} = α_{α} set to 0. Top right panel: The original prediction

These examples illustrate that models can be thought of as being wrong in different ways. They can be “biased,” predicting an incorrect overall mean response rate, they can be “scaled wrong,” predicting fluctuations that are too small or too large, or they can fail to predict the trends and dynamics of the data, leading to near zero covariance between observation and prediction. Different metrics of model performance will differ in how sensitive they are to these different types of error.

Negative values of the

Of the three different types of error just discussed, large bias, poor scaling, small covariance,

Another measure widely used in statistics, Pearson's product-moment correlation coefficient can also be used to assess the similarity of two time-varying signals. The correlation coefficient quantifies the linear correlation and maps it to a value between −1 and +1. To distinguish it from a normalized variant that will be used later in this section, the (absolute) correlation coefficient will from now on be abbreviated as _{abs}. It is defined as:

_{abs} satisfies many of the criteria that one might desire in a good measure of model performance. It quantifies the similarity between observation and prediction, it is bounded between −1 and +1, and it can be interpreted easily and unambiguously. The normalization by the square root of the variances makes _{abs} insensitive to scaling errors, and the formulae for _{abs} insensitive to bias, so that only the ability of _{abs} might be small either because the model predictions _{abs}. This was also noted by Hsu and colleagues who went on to develop an approach to quantify and account for the inherent noise in neural data (Hsu et al., ^{2}

Where _{max} is the maximum correlation coefficient between the recorded firing rate _{max} is the correlation coefficient between the recorded firing rate _{max} can nevertheless be calculated using the formulae in Equation 9. Following the methods of Hsu et al. (_{half} is determined by splitting the data set into halves, and calculating the correlation coefficient between the PSTH constructed from the first half and the PSTH constructed from the second half of the trials. This approach determines _{max} by effectively extrapolating from

Note that there are _{half}. Thus, in theory, the best estimate would average over all possible values of _{half} calculated for each possible split. In practice, this resampling technique can be computationally expensive, given the fact that there are already 92, 378 combinations for _{max}.

In summary, _{norm} provides a feasible method for capturing model performance independently of noise in the neural responses. It gives values bounded between -1 and +1 (in practice, they are bounded between 0 and +1, as model predictions are either correlated or not correlated, but typically not anti-correlated to the firing rate). Furthermore, the measure lends itself to unambiguous interpretation, and its limitations are well-known. Finally, it is normalized so that its value does not depend on the variability of a particular data set. Thus, the normalized correlation coefficient _{norm} fulfills the criteria for a useful measure of model performance, but its current definition is based in a laborious and potentially imprecise resampling technique.

As will have become clear in the previous sections, the two measures _{norm} follow the same logic in that both measure prediction accuracy and normalize it by a quantification of the inherent reproducibility of the neural responses that are to be modeled (_{max}, respectively). In this section we will show that these two approaches of normalization not only follow the same logic, but are mathematically largely equivalent. This provides a deeper insight into the underlying concept and gives rise to a more elegant and efficient technique to normalize the correlation coefficient.

Following the methods of Sahani and Linden (^{3}

Where _{n} is the recorded neural response of the

Furthermore, using Equation 14 the ratio of the noise power

For _{norm} the normalization factor is the inverse of _{max} and, following the methods of Hsu et al. (_{max} can be computed directly by exploiting the relation between _{norm}.

The coherence

The derivation of the coherence function between the true underlying firing rate function and the observed neural response is analogous for the squared correlation coefficient between both signals (also see Hsu et al., _{max} as:

Combining Equation 17 with Equation 15 now allows us to express the inverse of _{max} as:

Based on Equation 8 and 9 the normalized correlation coefficient _{norm} between the recorded firing rate

In other words, we can now express _{norm} as a simple function of _{norm}, use the covariance to quantify the prediction accuracy and take the neural variability into account by normalizing with the signal power _{norm}, because _{norm} quantify the similarity of the prediction and the neural response solely based on the covariance of both signals. It is well known that the (normalized) correlation coefficient is based on covariance, but it has hitherto not been made explicit that this is also the case for _{norm}. Second, how both measures quantify neural variability is not only related, but mathematically equivalent. Third, in order to calculate _{norm} it is not necessary to laboriously compute an approximation to _{max} from repeated subsampling of the data to generate computationally inefficient and potentially imprecise estimates of _{half}. Instead, the normalization factor can be explicitly calculated with Equation 27, using Equation 13 for _{norm} (left panel of Figure

In summary, _{norm} as defined in Equation 27 provides an insightful measure of model performance. It quantifies the prediction accuracy using the covariance and isolates model performance by taking the amount of intrinsic variability in the observed neural responses into account. It is in theory bounded between -1 and 1, and in practice values below zero are very rarely observed. If they do occur, their interpretation is unambiguous: negative _{norm} implies anticorrelation between prediction and data. _{norm} thus behaves uniformly well whether called upon to quantify the performance of good and of poor models, in contrast to _{norm}, for good models, but becomes increasingly harder to interpret as model performance declines.

_{norm} and its squared values for reference_{norm} have been multiplied with their absolute value to demonstrate that negative values only occur for _{norm}. The solid red line shows the values for the simulation of Figure _{norm} for good predictions, the red line approaches the dashed black line.

The previous sections show the problems caused by the missing lower bound of

All animal procedures were approved by the local ethical review committee and performed under license from the UK Home Office. Ten adult pigmented ferrets (seven female, three male; all >6 months of age) underwent electrophysiological recordings under anesthesia. Full details are as in the study by Bizley et al. (

Natural sounds were presented via Panasonic RPHV27 earphones, which were coupled to otoscope specula that were inserted into each ear canal, and driven by Tucker-Davis Technologies System III hardware (48 kHz sample rate). The sounds had root mean square intensities in the range of 75–82 dB SPL. For Experiment 1, we presented 20 sound clips of 5 s duration each, separated by 0.25 s of silence. Sound clips consisted of animal vocalizations (ferrets and birds), environmental sounds (water and wind) and speech. The presentation of these stimuli was repeated in 20 trials. For Experiments 2 and 3, we presented 45 sound clips of 1 s duration, again separated by gaps of silence. The sound clips consisted of animal vocalizations (sheep and birds), environmental sounds (water and wind) and speech. The presentation of these stimuli was repeated in 10 trials. The silent gaps and the first 0.25 s thereafter have been removed from the data set.

For Experiment 1, the responses of 119 single neurons were predicted with an LN model, a widely used class of models comprising a linear and a nonlinear stage (Chichilnisky, _{norm} for the model predictions of the neural response to the remaining 10% of the data set. This procedure was repeated 10 times, each time with a distinct 10% of data. The model performance was computed as the mean across all 10 performance measurements.

We predicted neuronal responses to acoustic stimuli with different model classes in order to address the question how the choice of a performance measure affects the interpretability of the results in a practical setting. To this end, we measured the predictive performance of models with two different methods, _{norm}. The right panel of Figure _{norm} values, here we chose to plot the signed square of _{norm} as a percentage on the x-axis. This choice is motivated by the fact that the square of the correlation coefficient, also known as the coefficient of determination, quantifies the “proportion of variance explained” by a statistical regression model, and _{norm} and _{norm}, i.e., both provide very similar, sensible measures of “percent explainable variance explained.” However, as expected from the theoretical analysis of both measures in the previous sections, this relation diminishes for cases in which the models poorly predicted the neuronal response. For those cases where there is little or no correspondence between the prediction and the response, the value of _{norm} approaches zero (by definition), and for some of those cases, the value of _{norm}| × _{norm} indicates that the model was able to capture as much as 20–30% of the explainable, stimulus driven variability in the neural firing rate. Thirty percent variance explained may not be a stellar performance for a model, but it certainly does not seem deserving of a negative test score. Indeed, the experimental results are generally in accordance with the simulation in general, shown as a red line in the right panel of Figure _{norm} for a wide range of good and bad predictions, a good prediction was deteriorated by adding an increasing amount of white noise. Just as for the data from the three experiments, _{norm} for good predictions, but go deep into negative values for noisy predictions. For comparison, the _{norm} values of the example in the bottom right panel of Figure

Figure

Inter-trial variability of neural responses to repeated presentations of stimuli poses a problem for measuring the performance of predictive models. The neural variability inherently limits how similar one can expect the prediction of even a perfect model to be to the observed responses. Thus, when using prediction accuracy as a measure of performance, inherent response variability is a confound, and the need to control for this has been widely acknowledged (e.g., Panzeri and Treves,

Different approaches for taking neural variability into account when measuring model performance have been developed. To get an unbiased estimate of _{max}), which is given by the inter-trial variability (Hsu et al., _{norm}). Following their suggestion, the upper bound can be approximated by looking at the similarity between one half of the trials and the other half of the trials (_{half}). This measure has also been used by Gill et al. (_{half} (Laudanski et al.,

In this study we have analyzed in detail two measures of model quality that account for neural response variability, _{norm}. We have revealed the shortcomings of _{norm}, consolidated both approaches and arrived at several insights. First, both measures quantify prediction accuracy using the covariance (and _{norm}. And finally, it is not necessary to approximate _{max} using computationally expensive and inexact resampling methods because _{norm} can be calculated directly via

This consolidated definition of _{norm} is not only more elegant, precise, and efficient, but it also sheds light on how _{norm} can be interpreted. It is almost identical to the well-known Pearson's correlation coefficient _{abs}, but the variance (power) of the recorded signal is replaced with the _{norm} has been shown to fulfill the criteria of Section 2 for insightful measures: it is bounded, interpretable, and comparable across data sets. Thus, _{norm} is a well-defined and helpful tool to assess model performance^{4}

Note, however, that _{norm} cannot be estimated accurately if the data are excessively noisy. Equation 28 requires _{norm} estimates unstable, allowing them to become spuriously large (if _{max} are small or have a very wide confidence interval, _{norm} values must be treated with caution.

OS: initiated the project; developed methodology; wrote and tested code implementing methods; analyzed method performance both analytically and through experiment; lead author on paper. NH, BW, AK, JS: guided research, co-wrote manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by a Wellcome Trust grant (WT076508AIA) and a BBSRC grant (BB/H008608/1). OS was supported by the German National Academic Foundation.

^{1}Please note that we do not use the notation of Sahani and Linden (

^{2}The expression for _{max} can be derived from the work of Hsu et al. (^{2} and the squared correlation coefficient ^{2} allows to replace _{max} and _{half}. In the notation of Hsu and colleagues

^{3}Again, please note that Sahani and Linden (

^{4}Matlab code for all measures can be found on GitHub: