Neural Oscillations in Speech: Don't be Enslaved by the Envelope

In a recent “Perspective” article (Giraud and Poeppel, 2012), Giraud and Poeppel lay out in admirable clarity how neural oscillations and, in particular, nested oscillations at different time scales, might enable the human brain to understand speech. They provide compelling evidence for “enslaving” of ongoing neural oscillations by slow fluctuations in the amplitude envelope of the speech signal, and propose potential mechanisms for how slow theta and faster gamma oscillatory networks might work together to enable a concerted neural coding of speech. This model is unparalleled in its fruitful incorporation of state-of-the-art computational models and neurophysiology (e.g., the intriguing pyramidal–interneuron gamma loops, PING – which will unfortunately not be observable in healthy, speech-processing humans within the near future). The authors propose a scenario focused on theta and gamma, where problems in speech comprehension are sorted out if (and only if) the brain syncs well enough to the amplitude fluctuations of the incoming signal.

In a recent "Perspective" article (Giraud and Poeppel, 2012), Giraud and Poeppel lay out in admirable clarity how neural oscillations and, in particular, nested oscillations at different time scales, might enable the human brain to understand speech. They provide compelling evidence for "enslaving" of ongoing neural oscillations by slow fluctuations in the amplitude envelope of the speech signal, and propose potential mechanisms for how slow theta and faster gamma oscillatory networks might work together to enable a concerted neural coding of speech. This model is unparalleled in its fruitful incorporation of state-of-the-art computational models and neurophysiology (e.g., the intriguing pyramidal-interneuron gamma loops, PING -which will unfortunately not be observable in healthy, speechprocessing humans within the near future). The authors propose a scenario focused on theta and gamma, where problems in speech comprehension are sorted out if (and only if) the brain syncs well enough to the amplitude fluctuations of the incoming signal.
However, while we enjoy the "perspective" Giraud and Poeppel (2012) are offering, it seems to oversimplify the available evidence in at least three key respects: First, how "slow" is a slow neural oscillation? Although it might be troublesome to reliably record fast, local gamma oscillations outside the skull, we can do so with satisfying precision in the lower-frequency ranges. So, why not allow the model to gain specificity and, accordingly, be specific about the ranges in which effects were observed? Giraud and Poeppel report the range of rates in which amplitude fluctuations in speech occur as 4-7 Hz (p. 511), 1-5 Hz (Figure 2), 5-10 Hz (p. 514, Figure 5), and <10 Hz (p. 514). Moreover, neural "theta" is defined as 1-8 Hz (Figure 1), 4-8 Hz (p. 511), 2-6 Hz ( Figure 6), and 8.33 Hz (120 ms, p. 514). Also, they show the most focal coupling of gamma power with the phase of an 8-Hz oscillation -text-book alpha. The trouble is that, if we cut loosely across the boundaries between delta and theta or theta and alpha, we might overlook important functional differentiations between these frequency bands . On the delta-theta end, it has been demonstrated that delta (here: 1.4 Hz) phase covaries with theta (here: 7.8 Hz) oscillatory power in macaque auditory cortex (Lakatos et al., 2005), at least implying that theta oscillations themselves are slaves to lower-frequency masters. On the theta-alpha end, auditory evoked perturbations hint at an intimate, but antagonistic relationship of neural theta and alpha. Independent of the ongoing debate regarding whether the evoked potential reflects an additive brain response or a phase reset of ongoing neural oscillations (for review, see Sauseng et al., 2007), time-frequency representations of auditory evoked brain activity are typically characterized by initially strong phase alignment (i.e., increased phase coherence across trials) that spans across theta as well as alpha frequencies. This is often followed by a dissociation: alpha (>8 Hz) steeply decreases in power, while theta (<7 Hz) power remains high (e.g., Shahin et al., 2009).
To sum up this point, Giraud and Poeppel (2012, p. 511) argue for a "principled relation between the time scales present in speech and the time constants underlying neuronal cortical oscillations," but what if the time scales present in speech cross functional boundaries between oscillatory bands in the human brain? Put simply, if delta vs. theta bands, or theta vs. alpha bands, do subserve discontinuous, separa-ble processing modes in the auditory and speech-processing domain, then further speaking of "slow neural oscillations" will hinder rather than benefit our understanding. Recently, we observed a negative correlation of alpha and theta power in response to speech, and it was the peri-and poststimulation alpha suppression that indexed best speech comprehension (Obleser and Weisz, 2012). Note that in this study, effects were attained with an intelligibility manipulation that was relying on spectral changes only -envelope changes were less effective in modulating alpha suppression, and did not affect theta power at all.
Which leads us to our next point: An over-emphasis of speech envelope. Amplitude envelope and syllable rate are currently very much emphasized in the speech and vocalizations literature (e.g., Luo and Poeppel, 2007;Chandrasekaran et al., 2009;Ghitza and Greenberg, 2009), likely because (a) they are easily quantified, and (b) as outlined above, we are best at measuring relatively low-frequency brain oscillations. Hence, it is tempting to focus on these slow envelope fluctuations. However, the speech envelope is readily obscured in noisy backgrounds and reverberant environments (Houtgast and Steeneken, 1985) and intact spectral content can be used by the listener to at least partially compensate for degraded temporal envelope information (Sheft et al., 2008). Indeed, although the temporal envelope of speech has been shown to be very important for comprehension (e.g., Drullman et al., 1994a,b) there is good evidence that the spectral content of the speech signal is at least as decisive for speech intelligibility (if not more so; Xu et al., 2005;Lorenzi et al., 2006;Luo and Poeppel, 2007;Obleser et al., 2008;Obleser and Weisz, 2012;Scott and Mcgettigan, 2012). Moreover, it has recently been suggested that the temporal envelope and spectral content of natural speech (or conspecific vocalizations in non-human animals) are non-independent, and that speech comprehension performance is in fact best predicted from the presence of a "core" spectrotemporal modulation region in the modulation transfer function of a stimulus (Elliott and Theunissen, 2009). This view is supported by observations of single neurons or populations of neurons with receptive fields matching the spectrotemporal modulation transfer function of natural sounds in songbirds, marmosets, and humans (i.e., speech, conspecific vocalizations; Nagarajan et al., 2002;Mesgarani and Chang, 2012).
In addition, we have ample evidence that slow brain oscillations become phaselocked to slow spectral regularities in an auditory signal, even in the absence of amplitude envelope fluctuations (Figure 1). Using simple non-speech stimuli without any envelope profile whatsoever, we find spectral regularities in the 3-Hz range to effectively entrain neural delta oscillations. Although a number of neurophysiological experiments have shown similarities between the neural encoding of frequencyand amplitude-modulation, suggesting the possibility of shared neural mechanisms (Gaese and Ostwald, 1995;Liang et al., 2002;Hart et al., 2003), the point we make here is simply regarding the relative scientific inattention to slow spectral fluctuations as a mechanism for entrainment of low-frequency neural oscillations to speech.
Finally, Peelle et al. have recently demonstrated that the goodness of phase-locking to speech is influenced by non-envelope "bottom-up" spectral content and "topdown" linguistic information (Peelle et al., 2012); better phase-locking was associated with the presence of linguistic information in stimuli that were identical in terms of amplitude envelope characteristics. Thus, envelope information alone can predict neither the intelligibility of speech (Nourski et al., 2009;Obleser and Weisz, 2012) nor the goodness of phase-locking to the speech signal (but, see Howard and Poeppel, 2010). Thus, in contrast to Giraud and Poeppel's (2012) strong focus on entrainment by the amplitude envelope as the vehicle for speech comprehension, we want to emphasize that neural entrainment and speech comprehension are likely to be multi-causal in nature.
Overriding and underlying the first two points is a chicken and egg problem. Giraud and Poeppel (2012) -quite explicitly - Modulation spectrum shows frequency (y axis; 200-1800 Hz, scaled linearly) and amplitude variations (color scaling) as a function of time (x axis). Note that there are no systematic variations in amplitude envelope to which brain rhythms could entrain. (B) AM stimuli. The amplitude envelope fluctuation is periodic (also visible in color fluctuation in the Modulation spectrum, scaled the same as (A), and the rate falls into the range observed in natural speech. (C) EEG brain response to FM. Inter-trial phase coherence (calculated from complex output of wavelet convolution) and power (derived from FFT) quantified the degree of entrainment. For FM stimuli, peaks in both phase coherence (p = 0.03) and power (p = 0.006) were observed at 3 Hz (delta) and at the 6-Hz harmonic (p = 0.03 and p = 0.001, resp.; Picton et al., 2003). (D) EEG brain response to AM. A single peak in phase coherence and power was observed at 3 Hz (both p = 0.03).