Hierarchical Processing for Speech in Human Auditory Cortex and Beyond

The anatomical connectivity of the primate auditory system suggests that sound perception involves several hierarchical stages of analysis (Kaas et al., 1999), raising the question of how the processes required for human speech comprehension might map onto such a system. One intriguing possibility is that earlier areas of auditory cortex respond to acoustic differences in speech stimuli, but that later areas are insensitive to such features. Providing a consistent neural response to speech content despite variation in the acoustic signal is a critical feature of “higher level” speech processing regions because it indicates they respond to categorical speech information, such as phonemes and words, rather than idiosyncratic acoustic tokens. In a recent fMRI study, Okada et al. (2010) used multi-voxel pattern analysis (MVPA) to investigate neural responses to spoken sentences in canonical auditory cortex (i.e., superior temporal cortex), using a design modeled after Scott et al. (2000). Okada et al. (2010) used a factorial design that crossed speech clarity (clear speech vs. intelligible noise vocoded speech) with frequency order (normal vs. spectrally rotated). Noise vocoding reduces the amount of spectral detail in the speech signal but faithfully preserves temporal information. Depending on the reduction in spectral resolution (i.e., the number of bands used in vocoding), noise vocoded speech can be highly intelligible, especially following training. By contrast, spectral rotation of the speech signal renders it almost entirely unintelligible without any change in overall level of spectral detail. Thus, the clear and vocoded sentences used by Okada et al. (2010) provided two physically dissimilar presentations of intelligible speech that the authors could use to identify acoustically insensitive neural responses; spectrally rotated stimuli allowed the authors to look for response changes due to intelligibility, independent of reductions in spectral detail.

The anatomical connectivity of the primate auditory system suggests that sound perception involves several hierarchical stages of analysis , raising the question of how the processes required for human speech comprehension might map onto such a system. One intriguing possibility is that earlier areas of auditory cortex respond to acoustic differences in speech stimuli, but that later areas are insensitive to such features. Providing a consistent neural response to speech content despite variation in the acoustic signal is a critical feature of "higher level" speech processing regions because it indicates they respond to categorical speech information, such as phonemes and words, rather than idiosyncratic acoustic tokens. In a recent fMRI study, Okada et al. (2010) used multi-voxel pattern analysis (MVPA) to investigate neural responses to spoken sentences in canonical auditory cortex (i.e., superior temporal cortex), using a design modeled after Scott et al. (2000). Okada et al. (2010) used a factorial design that crossed speech clarity (clear speech vs. intelligible noise vocoded speech) with frequency order (normal vs. spectrally rotated). Noise vocoding reduces the amount of spectral detail in the speech signal but faithfully preserves temporal information. Depending on the reduction in spectral resolution (i.e., the number of bands used in vocoding), noise vocoded speech can be highly intelligible, especially following training. By contrast, spectral rotation of the speech signal renders it almost entirely unintelligible without any change in overall level of spectral detail. Thus, the clear and vocoded sentences used by Okada et al. (2010) provided two physically dissimilar presentations of intelligible speech that the authors could use to identify acoustically insensitive neural responses; spectrally rotated stimuli allowed the authors to look for response changes due to intelligibility, independent of reductions in spectral detail.
In a standard whole-brain univariate analysis, Okada et al. (2010) found intelligibility-related responses (i.e., intelligible activity > unintelligible activity) in large portions of the superior temporal lobes bilaterally, as well as smaller activations in left inferior frontal gyrus, posterior fusiform gyrus, and premotor cortex. The authors then chose maxima for each participant within anatomically defined regions (bilateral posterior, middle, and anterior superior temporal sulcus [STS], as well as Heschl's gyrus) and performed MVPA analyses to assess the ability of these regions to discriminate among the four acoustic conditions. They found that Heschl's gyrus could reliably distinguish all conditions, despite showing a similar average hemodynamic response in the traditional mass univariate analysis. Regions of the STS showed varying degrees of sensitivity to acoustic information. In the left hemisphere, posterior STS was the most acoustically insensitive, followed by anterior STS and Heschl's gyrus (no reliable middle STS activation was identified). In the right hemisphere the greatest acoustic insensitivity was observed in middle STS, close to Heschl's gyrus, followed by anterior and posterior STS respectively. The authors interpret these findings as generally consistent with a hierarchical structure for speech processing in the temporal lobe, with regions of STS in both hemispheres playing a critical role in abstract phonological processes as indicated by their high acoustic insensitivity.
With these results, Okada et al. (2010) partially replicate previous univariate fMRI results reported by Davis and Johnsrude (2003). Davis and Johnsrude measured neural activity in response to multiple speech conditions that were equally intelligible but differed acoustically. To achieve this, three different forms of speech degradation were employed: noise vocoded speech, speech segmented by noise bursts, and speech in continuous background noise. Within each type of speech degradation there were three matched levels of intelligibility (confirmed by pre-tests and behavioral ratings collected in the scanner). The authors first identified regions sensitive to speech intelligibility by correlating neural activity with behavioral performance, and then examined the degree to which each of these regions was also sensitive to the acoustic form of the stimuli. Activity in regions close to primary auditory cortex depended on the type of degradation, but other intelligibility-responsive regions were insensitive to this acoustic information. These acoustically insensitive areas included regions located anterior to peri-auditory areas bilaterally, posterior to left peri-auditory cortex, and left inferior frontal gyrus. This arrangement is broadly consistent with the anatomical organization of primate auditory cortex  and suggests high levels of acoustic insensitivity in both anterior and posterior regions of left superior temporal cortex -consistent with the univariate analysis reported by Okada et al. (2010), but in contrast to their multivariate results. These conflicting findings regarding acoustic sensitivity in anterior temporal regions could be a result of either (a) the experimental design and specific stimuli used or (b) differing sensitivity of multivariate and univariate analysis methods, a question that requires further investigation.
Moving beyond the temporal lobe, the results of Davis and Johnsrude (2003) highlight the role of left inferior frontal cortex in speech comprehension. Activity in left inferior frontal cortex is common, although not universal, in neuroimaging studies of connected speech (e.g., Humphries et al., 2001;Davis and Johnsrude, 2003;Crinion and Price, 2005;Rodd et al., 2005Rodd et al., , 2010 Wingfield et al., 2006). To date effects of scanner noise have been associated with changes in neural activity using univariate approaches (Seifritz et al., 2006;Gaab et al., 2007;Peelle et al., 2010a); the effect on multivariate results is unknown. Generally, however, the results of auditory fMRI studies employing a standard continuous scanning sequence must be viewed with caution given the additional perceptual processes required. Despite the caveats discussed above, how might the results of Okada et al. (2010) inform our understanding of speech processing? The authors suggest that their findings are consistent with a hierarchy for intelligible speech processing that starts with Heschl's gyrus, followed by anterior and posterior-going streams that progressively increase in acoustic invariance. In particular, their finding of acoustic sensitivity in anterior temporal regions stands in direct opposition to the original Scott et al. (2000) study and several follow-ups (Narain et al., 2003;Scott et al., 2006), as well as Davis and Johnsrude (2003), which argue that anterior temporal responses are largely acoustically invariant. Because of their focus on canonical regions of auditory cortex, Okada et al. (2010) do not discuss differences between stimuli. One way to control for these effects would be to match (belowceiling) intelligibility across different types of acoustic degradation. Using this approach, Davis and Johnsrude (2003) observed that both inferior frontal and peri-auditory regions of the STS showed elevated signal for intelligible but degraded speech compared to both clear speech and noise. Like intelligibility-sensitive regions, the areas responding to listening effort demonstrated a hierarchical organization (i.e., differential degrees of acoustic sensitivity), and hence it might be that these two effects are confounded in temporal lobe responses observed by Okada.
On a related topic, we also note that Okada et al. (2010) used continuous fMRI, meaning that the auditory stimuli were presented in the midst of considerable background noise. Although one might assume that any such confounds would apply equally to all conditions tested, in fact, vocoded and clear sentences are not equally intelligible in the presence of background noise, even if word report scores are equivalent (at ceiling) when tested in quiet (Faulkner et al., 2001). Furthermore, even if participants are able to hear the sentences, scanner noise introduces significant additional task components related to segregating the auditory stream of interest prefrontal cortex have extensive anatomical connections to auditory belt and parabelt regions Romanski et al., 1999) and are thus well positioned to modulate the operation of lower-level auditory areas. Davis and Johnsrude (2003) provided evidence linking this fronto-temporal modulation with the recovery of meaning from an impoverished acoustic signal by showing that inferior frontal responses were elevated for distorted-yet-intelligible speech compared to both clear speech and unintelligible noise. This result suggests that activation of inferior frontal regions is a neural correlate of the more effortful listening that is required for the comprehension of degraded speech.
Indeed, the relationship between intelligibility and listening effort also deserves consideration in interpreting temporal lobe responses to degraded speech. Okada et al. (2010) treated clear and vocoded speech as having similar intelligibility. This may be true in the sense that word report is equivalent (at ceiling); however, clear and vocoded conditions differ substantially in whether this intelligibility is achieved effortlessly (as for clear speech), or with considerable effort (for vocoded speech). In the Okada et al. (2010) study this difference in listening effort cannot be distinguished from sensitivity to acoustic regions outside the superior temporal lobe as part of this hierarchy ( Figure 1A). Expanding on this description, we argue that a hierarchical model of speech comprehension necessarily includes regions of motor, premotor, and prefrontal cortex ( Figure 1B) as part of multiple parallel processing pathways that radiate outward from primary auditory areas (Davis and Johnsrude, 2007). Electrophysiological studies in non-human primates demonstrate auditory responses in frontal cortex, suggesting not only strong frontal involvement, but that these regions may indeed be viewed as part of the auditory system Romanski and Goldman-Rakic, 2002). In addition to prefrontal cortex, left posterior inferotemporal cortex is also critically involved in the speech intelligibility network, especially in accessing or integrating semantic representations (Crinion et al., 2003;Rodd et al., 2005). Although anatomical studies in primates have long emphasized the extensive and highly parallel anatomical coupling between auditory and frontal cortices (Seltzer and Pandya, 1989;Hackett et al., 1999;Romanski et al., 1999;Petrides and Pandya, 2009), frontal regions have only recently become a prominent feature of models of speech processing (Hickok and Poeppel, 2007;Rauschecker and Scott, 2009). Unfortunately, discussions of auditory sentence processing often focus almost exclusively on the importance of superior temporal responses, even when frontal or inferotemporal activity is present (Humphries et al., 2001;Obleser et al., 2008;Okada et al., 2010), resulting in an incomplete picture of the neural mechanisms involved in speech comprehension.
In summary, there is now consensus that hierarchical processing is a key organizational aspect of the human cortical auditory system. The results of Okada et al. (2010) uniquely bring into question the degree to which anterior temporal cortex is acoustically insensitive, suggesting a more posterior locus for abstract phonological processing. Challenges for future studies include placing hierarchical organization in the temporal lobe within the broader context of larger networks for auditory and language processing, and clarifying the functional contribution of different parallel auditory processing pathways to comprehension of spoken language under varying degrees of effort.