Real-Time Tracking of Selective Auditory Attention From M/EEG: A Bayesian Filtering Approach

Humans are able to identify and track a target speaker amid a cacophony of acoustic interference, an ability which is often referred to as the cocktail party phenomenon. Results from several decades of studying this phenomenon have culminated in recent years in various promising attempts to decode the attentional state of a listener in a competing-speaker environment from non-invasive neuroimaging recordings such as magnetoencephalography (MEG) and electroencephalography (EEG). To this end, most existing approaches compute correlation-based measures by either regressing the features of each speech stream to the M/EEG channels (the decoding approach) or vice versa (the encoding approach). To produce robust results, these procedures require multiple trials for training purposes. Also, their decoding accuracy drops significantly when operating at high temporal resolutions. Thus, they are not well-suited for emerging real-time applications such as smart hearing aid devices or brain-computer interface systems, where training data might be limited and high temporal resolutions are desired. In this paper, we close this gap by developing an algorithmic pipeline for real-time decoding of the attentional state. Our proposed framework consists of three main modules: (1) Real-time and robust estimation of encoding or decoding coefficients, achieved by sparse adaptive filtering, (2) Extracting reliable markers of the attentional state, and thereby generalizing the widely-used correlation-based measures thereof, and (3) Devising a near real-time state-space estimator that translates the noisy and variable attention markers to robust and statistically interpretable estimates of the attentional state with minimal delay. Our proposed algorithms integrate various techniques including forgetting factor-based adaptive filtering, ℓ1-regularization, forward-backward splitting algorithms, fixed-lag smoothing, and Expectation Maximization. We validate the performance of our proposed framework using comprehensive simulations as well as application to experimentally acquired M/EEG data. Our results reveal that the proposed real-time algorithms perform nearly as accurately as the existing state-of-the-art offline techniques, while providing a significant degree of adaptivity, statistical robustness, and computational savings.


DYNAMIC ENCODING AND DECODING MODELS: PARAMETER ESTIMATION
Recall that the encoder/decoder estimation problems can be posed as the following optimization problem: ✓ k = arg min ✓ k X j=1 k j k y j X j ✓ k 2 2 + k ✓ k 1 , k = 1, 2, . . . , K. (S1) At each window k, for k = 1, . . . , K, the encoding/decoding coefficients✓ k are updated based on the new measurements, i.e., y k and X k , and previous measurements through the forgetting factor mechanism while applying sparsity-promoting priors on the coefficients.
There are several standard optimization techniques that can be used to find the minimizer in (S1). Off-line algorithms such as interior point methods do not meet the real-time requirements of our dynamic estimation. The SPARLS algorithm has been introduced in (Babadi et al., 2010) to solve the problem in (S1) through EM iterations, and it has been successfully adopted in (Akram et al., 2017) to estimate encoding coefficients in a dynamic fashion. However, the EM algorithm and the constant step-size in SPARLS may result in low convergence rates. Hence, to adapt our estimation procedure for real-time applications, we use the Forward-Backward Splitting (FBS) method (Combettes and Pesquet, 2011), also known as the proximal gradient method, to solve for✓ k in (S1). FBS is suited for optimization problems where the objective function can be expressed as the sum of a differentiable term, e.g., the log-likelihood term in (S1), and a simple non-differentiable term, e.g., the`1-norm in (S1). This type of problems frequently arise in signal processing and machine learning (Jenatton et al., 2010;Duchi and Singer, 2009;Figueiredo et al., 2007).
In summary, each FBS iteration for the optimization problem in (S1) includes two steps: 1) taking a descent step along the gradient of the log-likelihood term, and 2) applying a soft-thresholding shrinkage operator (Goldstein et al., 2014;Sheikhattar et al., 2015). This procedure provides an algorithm that uses recursive and low-complexity updates in an online fashion to solve Eq. (S1) upon the arrival of a new data window. The optimization problem in (S1) can be rewritten as: where A k and b k can be updated recursively. Algorithm 1 summarizes the steps of the FBS algorithm to solve for ✓ k in (S1), when moving from window k 1 to window k, as well as the required recursive update rules for A k and b k . The parameter S F BS in Algorithm 1 denotes the stopping condition for the FBS algorithm, which can be a maximum iteration number or a convergence criterion on the objective function.

Algorithm 1 Parameter Estimation in Dynamic Encoding and Decoding Models by Forward-Backward
Splitting ⌧, 0 , for each element of ✓ 8: end while 9:✓ k = ✓.
Remark 1. A proper step-size choice in Alg. 1 at each FBS iteration is crucial to the convergence of the algorithm. For a fixed step-size, it has been shown that ⌧ < 2 L(rf k ) ensures the stability and convergence of the algorithm (Combettes and Pesquet, 2011), where L(.) represents the Lipschitz constant, and f k represents the log-likelihood term in (S1). Through standard Cauchy-Schwarz and triangle inequality manipulations, we can calculate the simple upper bound L(rf k )  L ub = 2 P k j=1 k j trace X T k X k , implying that ⌧ < 2 L ub ensures stability; however, this loose upper bound may decrease the convergence rate of the algorithm. Thus, it is more beneficial to ensure stability through backtracking and employing acceleration schemes such as adaptive step-size selection or the Nesterov's method (Goldstein et al., 2014). We have used the FASTA software package (Goldstein et al., 2014) available online at (Goldstein et al., 2015) in this work, which has built-in features for all the foregoing FBS step-size adjustment methods.

DYNAMIC STATE-SPACE MODEL: PARAMETER ESTIMATION
Recall that p k denotes the probability of attending to speaker 1 at instance k for k = 1, . . . , K A . Although each k corresponds to a data window in time, we refer to it as an instance not to conflate it with the fixed-lag fixed Figure S1. The parameters involved in state-space fixed-lag smoothing.
sliding window used for state estimation. The parameter K A denotes the number of instances in fixed-lag smoothing as shown in Figure S1 (replaced from Figure 2 for completeness).
The linear state-space model which we apply on logit(p k ) = ln , can be summarized as: k represent the attention markers and n k represent a binary random variable taking values 1 or 2 depending on the attended speaker at instance k for k = 1, . . . , K A . The observation equations of the state-space model, which relate the observed m (1)

1:K A and m
(2) 1:K A to the hidden variables of the state-space model in Eq. (S3), can be summarized as: The parameters of the state-space model are, therefore, ⌦ = n z 1: 1:K A and m 1:K A . For notational simplicity, hereafter we use the boldface version of a variable to denote a vector containing all its instances, e.g., z := z 1: The inference problem for ⌦ can be expressed as: where the log-likelihood and the log-prior are respectively expanded as: +cnst. (S7) Similar to the treatment in (Akram et al., 2016), we use an Expectation Maximization (EM) algorithm with n as the latent variables to infer ⌦. Note that the optimization problem in (S5) is non-convex in general; thus, the choice of initial conditions and hyperparameters for priors are important for reaching a desirable local maximum. Having the estimate b ⌦ (`) for ⌦ at the`t h EM iteration, we will next derive the E-step and M-step of the (`+1) th EM iteration.

The E-step
In the E-step, the surrogate function Q where the expectation of the complete log-likelihood ln P ⇣ m (1) , m (2) , n ⌦ ⌘ needs to be calculated with respect to n given m (1) , m (2) , b ⌦ (`) . For notational simplicity, hereafter we drop the n m (1) , m (2) , b ⌦ (`) subscript of the conditional expectations.
We have used a normalized version of the log-likelihood in Eq. (S8) for two reasons. First, the window length K A is a hyperparameter in our framework, which we can modify to find the optimal trade-off between the dimensionality of the state-space and history-dependence of the model. Thus, to change the window length for fixed priors, it is important to normalize the contribution of the log-likelihood in (S8). Second, as noted before, we have a non-convex inference problem, which makes the resulting local maximum dependent on the conjugate priors used. We can use samples of m (i) k 's to estimate the attended and the unattended Log-Normal distributions and tune the hyperparameters to these distributions. By normalizing the log-likelihood term, we are enforcing informative and empirical prior distributions which would guide the inference procedure towards a plausible local maximum. For instance, for the correlationbased attention marker, we expect that a plausible solution would result in the attended Log-Normal distribution being concentrated around larger correlation values compared to the unattended distribution. Nevertheless, the forthcoming derivations can be carried out without the normalization factor 1/K A in a similar fashion.
Let I u (v) represent the indicator function, i.e., it is equal to one if v = u and zero otherwise. Conditioning on n and using the conditional independence of m (1) and m (2) given n and ⌦, the expected log-likelihood A in (S8) can be simplified as: Note that m (i) k n k , ⌦ pertains to either the attended or unattended Log-Normal distributions in Eq. (S4) depending on the values of i and n k . Considering that the n k 's are binary random variables and the expectations are with respect to n m (1) , m (2) , b ⌦ (`) , the term E I j (n k ) can be computed for j = 1, 2 using Bayes' rule and conditional independence as: The parameters of the Log-Normal distributions for m k is the estimate of z k from the previous EM iteration. Note that E I 1 (n k ) = 1 E I 2 (n k ) as n k is a binary random variable. Defining ✏ (`) where the cnst. term includes all the terms that are independent of ⌦.

The M Step
In the M step, we maximize Q in Eq. (S11) with respect to ⌦. The maximizers form the parameter updates for the (`+1) th EM iteration. As we observe in Eq. (S11), having n as the latent variables separates the terms in Q , and the terms depending on the state-space parameters, i.e., z and ⌘. The derivation of the update rules for the distribution parameters is straightforward through taking the derivatives of Q ⌘ and solving for their joint zero-crossings. Consequently, the closed-form formulas for the distribution parameters maximizing Q can be expressed as: ⌘ with respect to z and ⌘. Note that this joint maximization is non-convex in general. Consider the following state-space model with parameters (z 0 , ⌘ 0 ) and binary observations n 0 .
For the inference problem in (S16), the log-posterior can be expressed as: If we replace the observations n 0 k in (S17) with ✏ (`) k , for k = 1, 2, . . . , K A , the inference problem becomes equivalent to maximizing Q ⌘ in (S11) with respect to z and ⌘.
In (Smith and Brown, 2003;Smith et al., 2004), the inference of the parameters in (S16) has been carried out through the EM algorithm, where in each iteration, a Kalman filtering and smoothing algorithm has been employed together with Gaussian approximations. Similar to (Akram et al., 2016), we refer to this EM algorithm as the inner EM not to confuse it with the EM algorithm we have already adopted, which we call the outer EM hereafter. The basic idea behind the inner EM is to approximate the solutions to (S17) as: Frontiers where ⌘ 0 ⇤ are estimated through the inner EM with z 0 as the latent variables, and z 0 ⇤ are just the result of a Kalman filtering and smoothing algorithm in (S16) for ⌘ 0 = ⌘ 0 ⇤ . In order to make the inference procedure suitable for real-time implementation, we can avoid the inner EM and instead use crude estimates of ⌘ 0 ⇤ in (S18). Note that ✏ (`) k , which acts as the observation n 0 k in (S16) for k = 1, 2, . . . , K A , is equal to P ⌘i as a sample of N (0, ⌘ 0 k ). Therefore, considering the Inverse-Gamma prior, a crude estimate for ⌘ 0 ⇤ k can be calculated for k = 1, 2, . . . , K A as: If K A is small enough, we can simplify the state-space model of (S16) by assuming a single variance, i.e., ⌘ 0 = ⌘ 0 k for k = 1, 2, . . . , K A , and using an estimate similar to (S19) for ⌘ 0 ⇤ . However, in this model, the crude estimate would be more reliable as it is based on K A samples rather than a single sample. Considering a normalized log-likelihood and the same Inverse-Gamma prior on ⌘ 0 , the estimate for ⌘ 0 ⇤ can be computed as: After estimating ⌘ 0 ⇤ k in (S19) for k = 1, 2, . . . , K A , or ⌘ 0 ⇤ in (S20), we can proceed as before to estimate z 0 ⇤ , i.e., using a Kalman filtering and smoothing algorithm with Gaussian approximations to estimate z 0 ⇤ in (S18). These estimates, namely z ⇤ and ⌘ ⇤ , form approximate solutions for z and ⌘ in the original problem of maximizing Q(⌦ b ⌦ (`) ) in (S11) with respect to the state-space parameters. Next, we discuss the details of the inner EM algorithm, as in (Akram et al., 2016), used to solve for z 0 and ⌘ 0 in (S16). As mentioned before, the idea is to use an EM algorithm together with Gaussian approximations to maximize P ⌘ 0 n 0 , and then maximize the likelihood of z 0 with respect to the observations and estimated variances. Considering z 0 as the latent variables, the surrogate function Q(⌘ 0 b ⌘ 0 (`) ) at`t h EM iteration is calculated as: where the expectations are with respect to z 0 n 0 , b ⌘ 0 (`) , and the cnst. term contains all the terms that are independent of ⌘ 0 .
In the M-step of the inner EM algorithm, Q ◆ is maximized with respect to ⌘ 0 to calculate the updated variances for the next EM iteration. Taking the derivative of (S21) with respect to ⌘ 0 and equating it to zero results in the following update rule for b where the parametersz k|K A and 2 k|K A in Eq. (S22) are respectively the mean and the variance of If we consider the Gaussian approximation N ⇣z k 1 |k 2 , 2 these parameters can be computed in a forward and backward pass similar to the conventional Kalman filtering and smoothing algorithms. The corresponding filtering equations for 1  k  K A are summarized as: Note that the third equation in (S23) is a non-linear equation whose solution can be approximated through standard approaches such as the Newton's method. The last two equations in (S23) come from the Gaussian approximation: assuming that z 0 k 1 | n 0 1: . The mean of the Gaussian approximationz k|k is calculated as the mode of ln P , and its variance 2 k|k is computed as the negative inverse Hessian of ◆ evaluated at the estimated meanz k|k (Tanner, 1991). The smoothing equations are the same as those used for fixed interval smoothing. Therefore, for 1  k  K A 1, we have: The 2 k,k 1|K A term in (S22) is a lagged covariance term that can be computed using the covariance smoothing algorithm (De Jong and Mackinnon, 1988): Having calculated the variances ⌘ 0 ⇤ from the inner EM algorithm, z 0 ⇤ can be estimated using a single forward and backward pass for ⌘ 0 = ⌘ 0 ⇤ , similar to that used in the inner EM algorithm. In summary, we have transformed the problem of maximizing (S11) with respect to z and ⌘ into inferring z 0 and ⌘ 0 in (S16) by identifying n 0 k with ✏ (l) k for k = 1, . . . , K A . We have then solved the latter problem through an EM algorithm combined with Gaussian approximations and Kalman filtering and smoothing. Therefore, we have z ⇤ = z 0⇤ and ⌘ ⇤ = ⌘ 0⇤ in the original problem.

Algorithm 2 Parameter Estimation in Dynamic State-Space Model
Input: m (1) update the parameters of the Log-Normal distributions, i.e., µ (a) , µ (u) , ⇢ (a) , ⇢ (u) , based on equations (S12), (S13), (S14), and (S15) respectively 6: update the state-space variances, i.e., ⌘ 1:K A , using the inner-EM algorithm or the crude estimates in equations (S19) and (S20) 7: update the hidden states in the state-space model, i.e., z 1:K A , using a Kalman filtering and smoothing algorithm with Gaussian approximations 8: set b ⌦ (`+1) as the updated parameter set including the updated distribution parameters, variances, and hidden states in the state-space model 9:` `+ 1 10: end while Algorithm 2 summarizes the overall inference procedure within a fixed-lag window of length K A . Going back to Fig. S1, copied from the paper, we assume k = k 0 is the current instance and the goal is to infer the attentional state at instance k = k 0 K F based on the attention markers within the window indexed from 1 to K A , given by m (i) k for i = 1, 2 and k = 1, . . . , K A . We initialize the state-space model parameter set ⌦ using the estimates at the previous instance, and the output of Algorithm 2, i.e., b ⌦, is used for initialization in the next instance. Defining f (.) as the sigmoid function, f determines the estimated probability of attending to speaker 1 at k = k represents the 90% confidence intervals of this estimate, whereˆ 2 K A K F |K A represents the inferred variance ofẑ K A K F calculated through the discussed Gaussian approximations. The parameter S EM in Algorithm 2 is a stopping condition for the outer EM, which can be a limit on the number of iterations.

SMOOTHING EFFECT OF STATE-SPACE MODELING
In this section, we discuss the smoothing effect of the proposed state-space estimation and compare it with that of sliding Gaussian kernel smoothers. Recall that the Inverse-Gamma conjugate prior on ⌘ k 's in Eq. (S3) controls the degree of smoothing in the state-space model. If this prior favors smaller values of ⌘ k 's, the consecutive changes in z k 's and thereby p k 's will be smaller, which results in a larger smoothing effect. We tune the Inverse-Gamma prior through the hyperparameters a 0 and b 0 as in Eq. (S3) to match the auditory attention dynamics. Therefore, we expect that the corresponding smoothing effect will make the state-space estimates robust to the stochastic fluctuations in the attention markers, while capturing the attention switching instances with a small transition delay. Fig. S2-A shows the output of the correlation-based attention marker in Case 2 of the simulation study in the main manuscript (row D of Fig. 4). The output of the real-time estimator with 1.5 s forward-lag (as in row F of Fig. 4) is shown in Fig. S2-B. We also consider two non-causal Gaussian kernel smoothers with the same delay of 1.5 s for fairness of comparison. Both kernels provide a clearer picture of the attentional state by smoothing out the stochastic fluctuations of Fig. S2-A. However, unlike the output of the state-space estimator, they do not provide statistically interpretable results. First, based on Figs. S2-C and -D, we can only obtain a binary decision on the attended speaker at each instance. The state-space estimates, however, provide a probabilistic measure of the attentional state as shown in Fig. S2-B, together with statistical confidence intervals. The red arrows in Fig. S2-C and -D mark instances where strong fluctuations in the attention markers result in misclassification. For instance, the smoothed markers with kernel 2 imply an attention switch earlier than the 30 s mark (upward arrow, Fig. S2-D). Such abrupt classification errors could be undesirable for applications such as BCI systems or smart hearing aids, as the devices need to modify their settings back and forth in a small time period. The state-space model prevents these instances of misclassification, thanks to the confidence intervals of the estimated p k 's (the middle arrows) which help rule out such false alarm events.

ENCODING MODEL SIMULATION
This section provides a simulated example to motivate our MEG analysis, in which we use an encoding model and take the M100 component of the Temporal Response Function (TRF) as the attention marker.

Simulation Settings
Consider the following generative model: where e t , s t , and s (2) t respectively denote the auditory component of the neural response, speech envelope for speaker 1, and speech envelope for speaker 2. We have used the same speech signals for s (2) t are referred to as the TRF for speakers 1 and 2. We have set µ = 0.001 as the unknown constant mean and n t iid ⇠ N (0, 2.5⇥10 7 ) as the observation noise. We assume an attention modulation effect on the M100 component of the TRFs. Figure S3 shows two cases for the TRFs ⌧ (1) t and ⌧ (2) t : In the left panels (case 1), there is a strong attention modulation effect on the M100 components, and in the right panels (case 2), this effect is weakened. In both cases, the attention is on speaker 1 during the [0, 30) s interval and on speaker 2 during the (30, 60] s interval. Also, we have considered a length of 0.4 s for the TRFs. Row B in Fig. S3 shows examples of the attended and the unattended TRFs for each of the two cases. In case 1, there is a large difference between the magnitude of the M100 components in the attended and the unattended TRFs, while in case 2, this difference is small compared to our estimation accuracy. We have also considered three higher latency components in the TRFs which are not modulated by the attentional state, similar to the M50 component. As shown in row A of Fig. S3, a zero-mean Gaussian i.i.d. noise is added to the TRF components as well. Note that similar to the EEG simulation, we have used a Gaussian kernel with the standard deviation of 10 ms to smooth the TRFs. This smoothness property is also observed in TRFs estimated from experimentally-recorded MEG signals (Ding and Simon, 2012a,b).

Parameter Selection
For the encoder estimation parameters in Algorithm 1, we have considered consecutive non-overlapping windows of length 0.25 s, i.e., W = 50, resulting in K = 240 instances, and we have assumed the same 0.4 s length for the TRFs, i.e., L e = 80. We have chosen = 0.005 through cross-validation and = 0.9167, which results in an effective window length of 3 s for encoder estimation. Considering the smoothing Gaussian kernel used in the forward model, we have used the Gaussian dictionary matrix G 0 2 R (Le+1)⇥(Le+1) for each speaker in the encoder estimation step to enforce smoothness in the TRFs. The dictionary columns consist of overlapping Gaussian kernels with the standard deviation of 10 ms, whose means cover the 0 s to 0.4 s lag with T s = 5 ms increments. As a result, considering the simultaneous estimation of the two TRFs, the overall dictionary matrix would be G = diag (1, G 0 , G 0 ).
We have used the FASTA package (Goldstein et al., 2014) with Nesterov's acceleration method to implement the forward-backward splitting algorithm. All the prior distribution parameters of the statespace models are set similar to the EEG simulation in the paper, where a 0 = 2.008, b 0 = 0.2016, and the prior parameters for the attended and unattended distributions were tuned based on a separate 15 s sample trial. For the real-time state-space estimator, we have used a sliding window of length 15 s with a fixed forward-lag of 1.5 s, i.e., K A = b15f s /W c and K F = b1.5f s /W c. The sample trial for tuning the distribution parameters can be thought of as an initialization step for the estimator prior to its real-time application. Figure S4 shows the results of our estimation framework. Row A contains the estimated TRFs for the encoding model. The major components of the TRFs are retrieved in the estimates while the`1-norm penalty in Eq. (S1) has significantly denoised these components as compared with the original noisy versions in row A of Fig. S3. Row B in Fig. S4 displays the extracted magnitudes of the M100 components from the estimated TRFs at each instance. The attention marker in this case is defined as the magnitude of the M100 component, where the M100 component is calculated as the minimum value of the TRF estimate around the 100 ms lag. Notice that there is a significant statistical difference between the extracted M100 components for the attended and unattended speakers in case 1, while the estimated M100 components are highly variable in case 2 and do not show a strong attention modulation effect.  (2) t used for the simulation model in Eq. (S26). A) TRFs for case 1 (strong modulation in M100 components) and case 2 (weak modulation in M100 components). B) Snapshots of the attended and unattended TRFs for the two cases.

Estimation Results
Rows C and D of Fig. S4 show the output of the batch-mode and real-time state-space estimators, respectively. In case 1, both the batch-mode and real-time estimators perform well in tracking the attentional state. Note that the sharp drop of the attention probability near ⇠ 30 s in Row D is due to the fact that at each instance the real-time estimator does not observe the attention markers beyond the 1.5 s forward lag, whereas the batch-mode estimator estimates the probabilities given the entire trial. In case 2, the batch-mode estimator performs well even though the M100 components are not visually indicative of the attentional state. However, the classification confidence decreases considerably specially in the (30, 60] s interval. The real-time estimator in case 2 closely follows the batch-mode estimator, but is more sensitive to the fluctuations of the extracted M100 components. Thus, its performance undergoes further degradation going from case 1 to 2, as compared with that of the batch-mode estimator. The red arrows in rows C and D of case 2 in Fig. S4 mark instances where the less robustness of real-time estimator resulted in misclassifications, while the batch-mode estimator classified the attended speaker correctly.  Figure S4. Estimation results of application to simulated MEG data: A) Estimated TRFs for case 1 (strong modulation in M100 components) and case 2 (weak modulation in M100 components). B) Estimated M100 magnitudes as the attention markers. C) Outputs of the batch-mode estimator as the estimated probability of attending to speaker 1. D) Outputs of the real-time estimator as the estimated probability of attending to speaker 1. The real-time estimator is less robust to the statistical fluctuations in the extracted M100 components, which can result in misclassifications as shown for two example instances marker by red arrows. However, it follows the general trend of the batch-mode estimator closely despite its online access to data.