A Comparison of Temporal Response Function Estimation Methods for Auditory Attention Decoding

The decoding of selective auditory attention from noninvasive electroencephalogram (EEG) data is of interest in brain computer interface and auditory perception research. The current state-of-the-art approaches for decoding the attentional selection of listeners are based on temporal response functions (TRFs). In the current context, a TRF is a function that facilitates a mapping between features of sound streams and EEG responses. It has been shown that when the envelope of attended speech and EEG responses are used to derive TRF mapping functions, the TRF model predictions can be used to discriminate between attended and unattended talkers. However, the predictive performance of the TRF models is dependent on how the TRF model parameters are estimated. There exist a number of TRF estimation methods that have been published, along with a variety of datasets. It is currently unclear if any of these methods perform better than others, as they have not yet been compared side by side on a single standardized dataset in a controlled fashion. Here, we present a comparative study of the ability of different TRF estimation methods to classify attended speakers from multi-channel EEG data. The performance of the TRF estimation methods is evaluated using different performance metrics on a set of labeled EEG data from 18 subjects listening to mixtures of two speech streams.


INTRODUCTION
A fundamental goal of auditory neuroscience is to understand the mapping between auditory stimuli and 19 the cortical responses they elicit. In magneto/electro-encephalography (M/EEG) studies, this mapping has 20 predominantly been measured by examining the average cortical evoked response potential (ERP) to a 21

MATERIAL AND METHODS
Temporal response functions can be used to predict the EEG response to a multi-talker stimulus from 57 the attended speech envelope or, alternatively, to reconstruct the attended speech envelope from the EEG 58 response. The first case is denoted as a "forward TRF" (as it maps from speech features to neural data) and 59 the second as a "backward TRF" (as it maps from neural data back to speech features). The TRF methods described below map a matrix X = [x (t,f ),c ] to a matrix Y = [y t ]: whereŶ is the TRF model prediction in the form of a time-dimension t vector, and X is the TRF model  The TRF filter coefficients can be estimated via ordinary least squares: where X T X is the estimated covariance matrix and X T Y is the estimated cross-covariance matrix. The 77 ordinary least-squares solution was here estimated using the Cholesky decomposition method, via the 78 mldivide routine in Matlab. One advantage of the OLS estimator is that it has no additional hyperparameters 79 that must be optimized. However, in practice the OLS estimator is often outperformed by the regularized 80 solutions described in the following subsections. This is often the case when the regressor, X, is high-81 dimensional, has highly correlated columns and has a poorly estimated covariance matrix given limited 82 amounts of training data.
where λ is the regularization parameter that controls the amount of parameter shrinking. The LRA-based regression relies on a low-rank approximation of the covariance matrix, X T X. This is 90 achieved by employing a singular value decomposition (SVD) of X T X: where U and V are orthonormal matrices that contain respectively the left and right singular vectors, and The number of diagonal elements, K, to retain are typically chosen such that a diagonal element is retained 97 if the sum of the eigenvalues to be kept cover a fraction λ of the overall sum, or 0 < Note that the regularization parameter, λ, here is analogous to λ for Ridge Regression, but that the values 99 are not comparable between the two. with some tuning parameter, λ. In the context of regression, the Shrinkage solution is where ν is here defined as the average eigenvalue trace of the covariance matrix X T X . When λ = 0, under this approximation, be implemented as: where for simplicity and to be consistent with previous studies (23,9,10), this paper characterizes the goodness 151 of the fit using Pearson's correlation coefficients.

152
In the forward case, multiple EEG channels are predicted by the TRF. Rather than using multiple 153 correlation coefficients to characterize the regression accuracy in this case, we chose to take the average 154 of the correlation coefficients between the predicted channels and the actual EEG data as a validation 155 score.
The assumption with this approach is that low correlation scores will cancel out. Performance was also evaluated on a classification task based on the TRF model. The task of the classifier 161 was to decide, on the basis of the recorded EEG and the two simultaneous speech streams presented to 162 the listener (see Section 2.4), to which stream the subject was attending. The classifier had to make this 163 decision on the basis of a segment of test data, the duration of which was varied as a parameter (1, 3, 5, 7, 164 10, 15, 20 and 30s), which will be referred to as the decoding segment length. This duration includes the 165 kernel length of the TRF (500 ms). The position of this interval was stepped in 1s increments.   There is thus an optimal decoding segment duration. A number of metrics to compute the ITR have been 195 proposed. The most common is the Wolpaw ITR (36), which is calculated in bits per minute as: where V is the speed in trials per minute, N is the number of classes, and P is the classifier accuracy. We 197 also report the Nykopp ITR, which assumes that a classification decision does not need to be made on 198 every trial (21). This can be done by first calculating the confusion matrix p for classifier outputs where the classifier decision function exceeded a given threshold. This threshold is adjusted to maximize: where p(w i ) is the probability of the actual class being class i, p(ŵ j |w i ) is the probability of the predicted 201 class being class j given the actual class being class i, and p(ŵ j ) is the probability of the predicted class 202 being class j. The TRF models were all trained and tested using cross-validation with a 10-fold testing procedure 205 involving nested cross-validation loops. During this cross-validation procedure the TRFs were characterized 206 under a N-fold testing framework where the data was divided into 10 folds. One fold was held out for testing, 207 while data from the remaining 9 folds were used to compute the TRF. An additional cross-validation loop 208 on the remaining 9 folds was used to tune the hyperparameters. In this cross-validation, the regularization 209 parameter was adjusted to maximize the correlation coefficient between the TRF model prediction and 210 the actual measured data. For Ridge and Lasso regularization schemes that allowed a regularization 211 parameter between zero and infinity, a parameter sweep was performed between 10 −6 and 10 8 in 54 212 logarithmically-spaced steps. This was done using the following formula: where λ 0 ≡ 10 −6 . For LRA, Elastic Net, and Shrinkage schemes, where the regularization parameter range 214 was between 0 and 1, a parameter sweep was performed between 10 −6 and 1 using a log-sigmoid transfer 215 function that compresses the values between 0 and 1 using the following iterative formula: The weights of the TRF models generated for each inner cross-validation fold were then averaged to 217 generate an overall cross-validated model that could then be applied to the test set. University of Denmark (DTU). The recording sampling rate was 48 kHz. Each recording was divided into 227 50-s long segments for a total of 65 segments.

229
The 50-s long speech segments were used to generate auditory scenes comprising a male and a female

244
Electroencephalography (EEG) data were recorded from 19 subjects in an electrically shielded room 245 while they were listening to the stimuli described above. Data from one subject were excluded from the 246 analysis due to missing data from several trials. The data were recorded using a Biosemi Active 2 system, 247 with a sampling rate of 512 Hz. Sixty-four channel EEG data (10/20-system) were recorded from the scalp.

248
Six additional electrodes were used for recording the EEG at the mastoids, and vertical and horizontal 249 electrooculogram (V and H-EOG typically resulting in the removal of one or two components. Lastly, the EOG channels were removed from 275 the data, which was then referenced to a common average over all channels.

276
For the TRF analysis, the EEG was bandpassed between 1-9 Hz using a windowed sync type I linear-277 phase finite-impulse response (FIR) filter, shifted by its group delay to produce a zero-phase (35) with a 278 conservatively chosen order of 128 in order to minimize ringing effects. This frequency range was selected 279 as it has been shown that cortical responses time-lock to speech envelopes in this range (23). As part of the   Given the non-Gaussian distribution of regression accuracies (range -1 to 1) and classification performance 307 metrics (range 0 to 1), Fisher Z-transforms and arcsine transforms were applied to these measures, 308 respectively, prior to statistical tests and correlations.

RESULTS
The TRF estimation methods introduced in Section 2 were used to decode attended speech envelopes from 310 low-frequency EEG activity. The following sections analyze results with metrics of 1) regression accuracy,

312
Results are shown for each of the regularization schemes, for both forward and backward TRF models. For 313 each regularization scheme, the regularization parameter(s) are tuned to maximize regression accuracy.

314
These parameter values are then used for all regression and classification comparisons. Regression accuracy 315 compares different regularization schemes in predicting test data using the optimal regularization parameter.

322
The TRF estimation methods, except for the OLS method, use regularization techniques to prevent 323 overfitting and therefore require a selection of the appropriate tuning parameters. Figure 2 shows the 324 correlation coefficient between predicted (validation set) data and the actual target data (regression accuracy) 325 over a range of regularization parameters. In general, there is a broad region where validation regression 326 accuracy is flat, which peaks before quickly falling off with increasing λ. It is apparent that the regression 327 accuracies obtained with backward TRF models generally are higher than those obtained with forward 328 TRF models.

334
For each regression method (and each value of α for elastic net), the TRF model was estimated and 335 the optimum lambda estimated on the training/validation set. This optimal model was then applied to 336 the test set, and the regression accuracy was compared between regression methods. This is shown in  Figure 4. Test set regression accuracies (r attend ) for each TRF estimation method plotted against r unattend . Left: results from the forward modeling approach. Right: results from the backward modeling approach. For each scheme (represented by a color), each point represents average data from one subject. The black line shows r attend = r unattend . 358 We further sought to investigate how the different TRF models perform in terms of discriminating 359 between attended and unattended speech on a limited segment of data. The duration of the segment was 360 varied as a parameter (1, 3, 5, 7, 10, 15, 20 and 30s). This was characterized on held-out test data for 361 each TRF method, using the λ value that yielded the maximum regression accuracy in the validation data.

DISCUSSION
In this study, we systematically investigated the effects of TRF estimation methods on the ability to 427 decode and classify attended speech envelopes from single-trial EEG responses to speech mixtures. The 428 performance of stimulus/EEG decoders based on forward TRF models (mapping from attended speech 429 envelopes to multi-channel EEG responses) and backward TRF models (mapping from EEG response 430 back to speech envelopes) were compared. It was found that the backward TRF models outperformed the 431 forward TRF models in terms of classification accuracies. We hypothesize that TRF models do a better 432 job of predicting audio (the backward model) than EEG data (the forward model) because the EEG data 433 contains a lot of information from other brain functions. It is simpler to filter out these signals, as is done 434 in the backward model, than it is to predict them (as the forward model would need to do to achieve response. Additionally, we chose to focus on 1-9 Hz EEG activity as the attentional modulation of EEG 451 data has been found prominent in this range. It is likely that other neural frequency bands robustly track 452 attended speech (e.g. high gamma power (25)) and that the neural decoders potentially could benefit from 453 having access to other neural frequency bands. This is, however, outside the scope of this paper. improve signal-to-noise-ratio. This makes them fairly robust to spatially correlated artifact activity (e.g.

464
electro-ocular and muscle artifacts) when trained on data from a large number of electrodes. This is also 465 reflected in the high classification accuracies that were obtained with the backward models. However, 466 for the relatively high number of electrodes used in this study, it was found that the spatio-temporal 467 reconstruction filters were effective only when properly regularized.  486 We generally found that the reconstruction accuracies (r attend ) plateaued over a large range of λ values 487 for linear TRF models (Figure 2). In fact, fixing the regularization parameter to a high value did not strongly 488 affect the decoding accuracies compared to doing a hyperparameter search (this was tested with ridge 489 regression with a fixed large λ value). yielded the best regression accuracy. However, it was found that this did not lead to a better classification 499 accuracy compared to other L2-based regression schemes (i.e. Ridge, Shrinkage and LRA) due to an 500 associated increased variance in the correlation coefficient computed over short decoding segment lengths.

501
It has been reported that, in practice, the Ridge Regression approach appears to perform better than LRA 502 (33). While LRA yielded marginally lower mean regression accuracy and classification performance than 503 Ridge Regression, this was not found to be significant. LRA removes lower variance components after the 504 eigendecomposition of X T X, essentially performing a hard-threshold. In contrast, Ridge Regression is a 505 smooth down-weighting of lower-variance components (3).

507
The information transfer rate results provide insight into how classification performance can be optimized.

508
It is worth noting that the ITR measures represent particular points along the ROC curve, as is illustrated in  and specificity.

517
The ITR results in the present study suggest a 3-5 s decoding segment length to achieve the maximum 518 bit-rate. It should be noted that this assumes that switches in attention can occur frequently, on the order 519 of the decoding segment length. In cases, where switches in attention are known to be sparse a priori, 520 it may instead be more desirable to increase decoding segment length and sacrifice bit-rate to put more 521 emphasis on accuracy, since the loss in bit rate due to long decoding segments is only evident during 522 attention switches. Such an approach was taken by O'Sullivan and colleagues (24), where the theoretical 523 performance of a realtime TRF decoding system was characterized for switches in attention every 60 s.

524
In that study, a decoding segment length between 15-20 s was reported as optimal to achieve the best 525 speed-accuracy tradeoff.