A Generative Model of Speech Production in Broca’s and Wernicke’s Areas

Price, Cathy  J; Crinion, Jenny; Macsweeney, Mairead

doi:10.3389/fpsyg.2011.00237

ORIGINAL RESEARCH article

Front. Psychol., 16 September 2011

Sec. Psychology of Language

volume 2 - 2011 | https://doi.org/10.3389/fpsyg.2011.00237

A Generative Model of Speech Production in Broca’s and Wernicke’s Areas

CJ
Cathy J. Price ¹^*
JT
Jenny T. Crinion ²
MM
Mairéad MacSweeney ²

1. Wellcome Trust Centre for Neuroimaging, University College London London, UK
2. Institute of Cognitive Neuroscience, University College London London, UK

Abstract

Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalization, auditory feedback, and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping, and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words “one” and “three.” We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca’s area in the left dorsal pars opercularis and Wernicke’s area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca’s and Wernicke’s areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations.

Introduction

Speech production is a complex multistage process that converts conceptual ideas into acoustic signals that can be understood by others. The stages include conceptualization of the intended message, word retrieval, selection of the appropriate morphological forms, sequencing of phonemes, syllables, and words, phonetic encoding of the articulatory plans, initiation, and coordination of sequences of movements in the tongue, lips, and laryngeal muscles that vibrate the vocal tract, and the control of respiration for vowel phonation and prosody. In addition to this feed forward sequence, auditory, and somatosensory processing of the spoken output is fed back to the motor system for online correction of laryngeal and articulatory movements (Levelt et al., 1999; Guenther et al., 2006). This self monitoring process is thought to be essential for learning to speak in a first (native) or second language but also plays a role in adult/fluent speech production, particularly when the auditory feedback is distorted. The sensorimotor interactions involved in monitoring the spoken response require an internal model of the intended speech to which the output can be matched (Borden, 1979; Paus et al., 1996; Heinks-Maldonado, 2005). The aim of the current study was to identify brain responses related to the internal model and to consider how these responses might predict auditory output prior to auditory or sensorimotor feedback.

The concept of internal models that predict the sensory consequences of an action is not specific to speech production. In the motor system, internal models that finesse motor control are referred to as “forward models” (Miall, 1993; Wolpert et al., 1995). More generally, forward models are examples of generative models that the brain may use for both perception (Helmholtz, 1866; MacKay, 1956; Gregory, 1980; Ballard et al., 1983; Friston, 2001, 2005) and active inference (Friston, 2010). The underlying principle of a generative model of brain function is that higher-level systems predict the inputs to lower-levels; and the resulting prediction error is then used to optimize future predictions – a scheme known as predictive coding.

Recent accounts of forward models in speech production have varied in how the auditory predictions and feedback are implemented (see Figure 1). For example, in the model proposed by Tian and Poeppel (2010), a motor efference copy is generated during motor planning and fed into a forward model of motor processing which in turn feeds into a second forward model of sensory (auditory) processing (see first panel in Figure 1). This perspective differs from that proposed by Guenther et al. (2006) in which auditory and sensorimotor predictions are generated in parallel with motor commands (rather than in series as in the Tian and Poeppel, 2010 model), see second panel in Figure 1. The parallel processing of predictions and motor commands in Guenther et al. (2006) is more consistent with predictive coding accounts in generative models of active inference (Friston, 2010; Friston et al., 2010) where higher-level representations (i.e., prior knowledge of movements and their associations) drive the motor commands and predict the sensory responses in parallel (see third panel in Figure 1). However, in the Guenther et al. (2006)model, mismatches between the sensory response and the predicted sensory response (i.e., the prediction errors) are fed back to the motor system. This differs to the predictive coding in generative models (third panel of Figure 1) where the prediction errors are fed back to the source of the predictions (i.e., the high level representations) in order to optimize future predictions and minimize future prediction error. In addition, predictions in generative models are propagated in a hierarchical fashion through the system. For example, the third panel of Figure 1 shows that higher-level representations predict phonological associations of words and phonological processing predicts acoustic associations of words, with potentially many intervening stages that are not illustrated.

Figure 1

Although, the importance of a forward model of speech output during articulation is well recognized (Heinks-Maldonado, 2005; Christoffels et al., 2007; Hawco, 2009), no previous functional imaging study has attempted to identify the anatomical location of brain activation related to the forward model of speech output during articulation. This requires an experimental paradigm that activates speech production but controls for processing related to (a) auditory feedback and (b) movement of the articulators. Instead, previous functional imaging studies that have investigated the self monitoring of speech have primarily focused on activation related to auditory feedback rather than auditory predictions. This has involved altering rather than eliminating the auditory feedback (Paus et al., 1996; Hashimoto and Sakai, 2003; Ford et al., 2005; Fu et al., 2006; Christoffels et al., 2007; Toyomura et al., 2007; Tourville et al., 2008; Takaso et al., 2010). The results have highlighted activation changes in the superior temporal gyri but do not distinguish activation related to predicting speech from activation related to changes in auditory feedback. In contrast to this prior work, our study used a speech task that did not involve the generation of sound or auditory feedback because our aim was to identify brain activation that might be related to the internal model that predicts speech output during articulation.

To isolate brain activation associated with the internal model of speech output, we compared the production of speech to the production of non-speech mouth movements. The key difference between these conditions is that articulation of speech typically results in auditory speech processing whereas the production of non-speech mouth movements is not associated with auditory speech, although there may be some degree of acoustic association. In the speech condition, participants repeatedly articulated the words “one” and “three” without generating any sounds. This task places minimal demands on conceptualization of the intended message, word retrieval, the selection of the appropriate morphological forms, sequencing, respiration control, prosody, and auditory processing of the spoken output. However, silent articulation of words does not eliminate the experience of previously learnt auditory associations that have been tightly coupled with movement in the articulators during speech production (i.e., we have auditory imagery of the words “one” or “three” as they are silently articulated). These auditory images of speech may play a role in predicting the auditory consequences of speech production (Tian and Poeppel, 2010).

The words “one” and “three” were chosen because they have very distinct muscle movements that could be approximately matched in the non-speech mouth movement condition. Articulating “one” primarily involves lip pursing whereas articulating “three” primarily involves tongue protrusion and retraction. In the non-speech mouth movement condition, participants either pursed their lips (in a kissing action), protruded, and retracted their tongue or alternated between these movements. By including three different levels of non-speech mouth movements (lips repeatedly, tongue repeatedly, lips alternating with tongue), we were able to compare activation for different types of articulators (lips versus tongue) and also manipulate the complexity of the movements. For example, we were able to check whether increased activation for speech compared to non-speech was observed in areas where activation increased with the complexity of the movements (i.e., for alternating between different movements compared to repeatedly making the same movement).

Having controlled for auditory feedback and movement of the articulators, we predicted that activation related to the forward/generative model of auditory processing during speech production would be observed in the left ventral premotor cortex and/or the superior temporal gyrus/sulcus. These predictions are made on the basis of prior proposals by Guenther et al. (2006) who link the internal model of speech sound maps to the ventral premotor cortex; and Hickok et al. (2011) who link an internal model of motor processing to the premotor cortex; an internal model of auditory processing to the superior temporal gyrus/sulcus; and the translation between auditory and motor processing to the area they refer to as Spt (in the Sylvian fissure between the planum temporale (PT) and ventral supramarginal gyrus).

In addition to dissociating brain activation for speech and non-speech mouth movements, we also looked for activation that was common to both speech and non-speech mouth movements relative to finger tapping and visual fixation. Previous imaging studies have distinguished different systems involved in the motor control of speech: An articulatory “preparatory loop” that includes the inferior frontal, anterior insula, supplementary motor area, and superior cerebellum; an executive loop including the motor cortex, thalamus, putamen, caudate, and inferior cerebellum (Riecker et al., 2005) and a feedback loop including the postcentral gyri, the supratemporal plane, and the superior temporal gyri (Dhanjal et al., 2008; Peschke et al., 2009). The involvement of these regions in non-speech as well as speech mouth movements has already been demonstrated. For example, Chang et al. (2009) compared speech to non-speech orofacial movements and vocal tract gestures (whistle, cry, sigh, cough) and found common activations in the inferior frontal gyrus, the ventral premotor cortex, the supplementary motor area, the superior temporal gyrus, the insula, the supramarginal gyrus, the cerebellum, and the basal ganglia. This suggests a general role for these regions in orofacial movements and their auditory consequences.

By including a visual fixation baseline, we could also identify activation that was common to both finger and mouth movements; and control for inner speech that occurs independently of mouth movements during free thought.

Materials and Methods

Functional imaging data were acquired using positron emission tomography (PET). For the current study of speech production there are two advantages of using PET rather than fMRI: the PET scanning environment is quieter for recording the presence or absence of speech output; and the regional cerebral blood flow (rCBF) signals are not distorted by air flow through the articulators. The study was approved by the local hospital ethics committee.

Participants

We scanned 12 right handed, native English speakers who had normal or corrected vision and hearing and no history of neurological disease or mental illness. All gave written informed consent. One participant was subsequently excluded for reasons given below. The remaining 11 subjects (10 male) had a mean age of 34 years (range 19–68). The predominance of male participants is a consequence of using PET scanning which is not appropriate for women of child bearing age. Our results did not change when the one female was removed (n = 10; mean age = 32 years, age range = 19–52) therefore we did not exclude the female participant. Inter-subject variability in our results was investigated and reported (see Figure 2) to ensure consistency across participants, despite the wide range of ages and unequal distribution of males and females.

Figure 2

Table 1

Location of speech activations	Speech > mouth and fingers					Speech > mouth
	Co-ordinates (x,y,z in MNI)			Z score	k	Z score	k
Left posterior superior temporal sulcus	−52	−38	2	5.2	103	4.5	110
Left dorsal pars opercularis	−50	20	30	4.9	153	4.3	153
	Speech and mouth > fingers					Tongue > lips
Left pre-central gyrus	−54	6	6	5.8	1658
	−58	−2	14	7.4
	−62	−6	26	7.5		4.5	139
	−48	−12	32	8.0
Right pre-central gyrus	58	−4	10	6.0	1632
	64	−6	26	8.1		5.8	234
	58	−8	30	8.0
	Speech, mouth and fingers > fixation
Left pre/post-central gyrus	−56	−12	16	6.7	1212
	−58	−8	30	8.0
	−48	−20	36	7.4
	−46	−12	42	7.4
Right post-central gyrus	+64	−14	30	5.6	576
Left posterior cerebellum	−16	−60	−24	6.9	513
Right posterior cerebellum	+28	−62	−24	7.7	903
Left anterior insula	−38	2	+4	5.5	67
Left planum temporale	−46	−38	+14	5.7	53

Location of activation for speech relative to non-speech mouth movements and finger movement; and for all movement tasks relative to fixation; at peaks that were significant at p < 0.05 after correction for multiple comparisons across the whole brain in height (Z > 4.7) or extent (Z > 90 voxels).

k = number of voxels significant at p < 0.001 uncorrected.

Paradigm

There were four conditions: silent speech, non-speech mouth movements, finger tapping, and visual fixation. Each condition was repeated in three different blocks (with one block equivalent to one 90 s PET scan). In all 12 scans, a black circle, presented every 750 ms, was used as an external stimulus to pace movement production. During the three speech scans, participants were instructed to articulate the word “one” or “three” in time with the stimulus. They were specifically instructed to move their mouths as if they were speaking but without generating any sound (i.e., silent mouth movements). In one of the three speech scans, they articulated the word “one” on every trial; in a second, they articulated the word “three” on every trial and in the third, they alternated the articulation of “one” and “three,” with one speech utterance per stimulus. In the three non-speech mouth movement scans, participants pursed their lips in time with the stimulus, protruded, and retracted their tongue, or alternated between pursing their lips and protruding and retracting their tongue. In the three-finger tapping scans, participants made a two-finger movement in one scan, a three-finger movement in another scan and alternated between the two-finger movement and three-finger movement in the third scan. The two-finger movement involved a tap of their index finger followed by a tap of their middle finger on a table placed under their arm in the scanner. The three-finger movement involved a tap of their index finger followed by a tap of their middle finger followed by a tap of their fourth finger. Participants practiced these movements before the scan and they were referred to as “double drum” and “triple drum” respectively. In the three baseline scans, participants were instructed “Please look at the flashing dot and try to empty your mind.”

All responses, during all conditions were video recorded to ensure that the data collected were consistent with the experimental aims (e.g., mouth movements without sound during the speech condition). A scan/condition was repeated if the participants did not follow the instructions correctly. This only happened once for three different participants and in each case the repeated scan replaced the faulty scan. One subject (20-year-old male) did not follow the instructions in two different scans and was therefore excluded from the final analyses (n = 11). There was no further behavioral analysis because, in the final data sets, each condition was accurately performed (i.e., error free). Moreover, the functional imaging data showed no activation in the primary auditory cortex during any condition. This is consistent with the participants performing all conditions silently.

Data acquisition

Functional activation images were acquired using a SIEMENS/CPS ECAT EXACT HR+ (model 962) PET scanner (Siemens/CTI, Knoxville, TN, USA). Each participant had 12 or 13 PET scans (see previous section), to measure rCBF using bolus infusion of radioactively labeled water (H₂¹⁵O). The dose received was 9 mCi per measurement, as approved by the UK Administration of Radioactive Substances Advisory Committee (ARSAC). Using statistical parametric mapping (SPM99), scans from each subject were realigned using the first as a reference, transformed into a standard MNI space (Ashburner and Friston, 1997) and smoothed with a Gaussian kernel of 8 mm FWHM. Structural MRI images for each subject were obtained for coregistration with the PET data.

Statistical analysis

Statistical analysis used standardized procedures (Friston et al., 1995). This involved ANCOVA with subject effects modeled and global activity included as a subject specific covariate. The condition and subject effects were estimated according to the general linear model at each voxel. The statistical model included 10 conditions: Fixation (summed over three scans), the three-finger tapping conditions, the three non-speech mouth movement conditions and the three speech conditions. The statistical contrasts of interest identified activation that was greater for (1) all speech than all non-speech mouth and finger conditions; (2) all speech than all non-speech mouth movements; (3) all speech and all non-speech mouth movements relative to all finger movements; and (4) all movement conditions relative to fixation; (4) non-speech tongue movements relative to non-speech lip movements or vice versa; and (5) alternating between movements or the same type (e.g., non-speech lip/tongue/lip) versus repetition of the same movement (e.g., non-speech tongue/tongue/tongue). The statistical threshold was set at p < 0.05 after family wise error (FWE) correction for multiple comparisons across the whole brain in height or extent. To ensure that activation in contrast (3) reflected common activation for all types of movement, we used the inclusive masking option on SPM to exclude voxels that were not significantly activated (at p < 0.001 uncorrected) by (6) speech > fixation, (7) non-speech mouth > fixation, and (8) finger movements > fixation. As the inclusive masking removes voxels from activation maps that are highly significant (p < 0.05 corrected), they make the results more conservative rather than less.

Results

Greater activation for silent speech than non-speech mouth movements

There were two areas where activation was significantly higher for silent speech than non-speech mouth movements: the left posterior superior temporal sulcus (pSTS) and the left dorsal pars opercularis within the inferior frontal gyrus extending into the left middle frontal gyrus. In each of these areas, activation was also higher for speech than finger movements and for speech relative to the visual fixation baseline. The loci and significance of these effects are shown in Table 1 and Figure 2.

Other effects

Both speech and non-speech mouth movements resulted in extensive activation in bilateral pre-central gyri relative to finger tapping and visual fixation (see Table 1 and green areas in Figure 2 for details). In addition, activation that was common to speech, non-speech mouth movements, and finger tapping (relative to the visual fixation baseline) was observed bilaterally in the postcentral gyri, superior cerebellum, inferior cerebellum, putamen, with left lateralized activation in the thalamus, insula, supratemporal plane, and supplementary motor area (see Table 1 and Figure 2 which represents a subset of these regions in red). Common activation in these areas may relate to shared processing functions. For example, it has been proposed that activation in the anterior insula is related to the voluntary control of breathing during speech production (Ackermann and Riecker, 2010). It might therefore be the case that all three motor tasks (speech, non-speech mouth movements, and finger tapping) involve voluntary control of breathing in time with the motor activity. Alternatively, common activation might reflect different functions that could not be anatomically distinguished in the current study. As the current study is concerned with differential activation for speech relative to non-speech mouth movements, we do not discuss the common activations further.

The only other significant effect was observed when non-speech tongue movements were compared to non-speech lip movements. These effects are shown in blue in Figure 2. The MNI co-ordinates of this effect (x = +64, y = −6, z = 26; Z score = 5.9; and x = −58, y = −6, z = 26; Z score = 4.1) correspond to those previously reported for tongue movements (Corfield et al., 1999; Pulvermuller et al., 2006). The consistency of this effect with recent functional imaging (Takai et al., 2010) and early electrocortical mapping (Penfield and Rasmussen, 1950) provides reassuring support that our study had sufficient power to identify effects of interest with high precision. We did not see significantly increased activation for non-speech lip relative to mouth movements; nor did we see differential activation between any of the conditions that alternated between two movements (e.g., lips/mouth/lips) were compared to the corresponding conditions when the same movements was repeated continuously (e.g., lips/lips/lips or mouth/mouth/mouth).

Discussion

Silently articulating the words “one” and “three” strongly activated left inferior frontal and superior temporal language regions compared to lip pursing, tongue movements, finger tapping, and visual fixation. The left inferior frontal activation was located in the left dorsal pars opercularis and therefore corresponds to classic Broca’s area. The left superior temporal activation was located in the left pSTS and therefore corresponds to classic Wernicke’s area. We suggest that, during speech production, activation in these classic language areas are related to covertly generated auditory associations that are evoked automatically, and in synchrony, with highly familiar mouth movements, previously intimately associated with sound production, and thus auditory feedback. In contrast, lip pursing, tongue, and finger movements are less practiced actions that are not intimately associated with speech sounds although they may have acoustic associations. The location and function of these activations is discussed below, in the context of generative models of perception and active inference (Friston, 2010; Friston et al., 2010). These data lead us to propose that Broca’s and Wernicke’s areas may play a role in predicting the auditory response during articulation, even in the absence of auditory feedback.

The activation in the dorsal pars opercularis extended anteriorly into the left inferior frontal sulcus (see Figure 2). It does not, therefore, correspond to the ventral premotor site of the speech sound maps proposed in the model by Guenther et al. (2006). It is also anterior to the more posterior premotor areas that respond during the observation of hand actions (Caspers et al., 2010), speech perception (Skipper et al., 2007; Callan et al., 2010), mirror neurons (Morin and Grezes, 2008; Kilner et al., 2009), and phonetic encoding during speech production (Papoutsi et al., 2009). Nevertheless, it does correspond to the area that is activated during both inner and overt speech tasks, for example, silent phonological decisions on written words (Poldrack et al., 1999; Devlin et al., 2003), lip reading (Fridriksson et al., 2009; Turner et al., 2009), overt speech production (Jeon et al., 2009; Whitney et al., 2009; Holland et al., 2011), and sentence comprehension (Bilenko et al., 2009; Mashal et al., 2009; Tyler et al., 2010). Moreover, it is not differentially activated by articulating words silently (as in the current study) or saying them aloud (see Price et al., 1996). Therefore the activation is more likely to reflect a fundamental property of speech production than atypical task-specific processing (e.g., the act of inhibiting the production of sounds following instructions to articulate silently). Given the minimal demands on conceptual, lexical, and auditory processing in the current study, we suggest that increased activation in the left dorsal pars opercularis for silently articulating words relative to non-speech mouth movements is related to higher-level representations of learnt words that predict the auditory consequences of well learnt speech articulations. Further we propose that these “predictions” are sent to auditory processing regions in the PT and the pSTS. Confirmation of this hypothesis requires a functional connectivity study with high temporal resolution to determine how activation in the left dorsal pars opercularis interacts with that in the superior temporal gyrus and sulcus.

The left pSTS activation that we observed during the silent articulation of speech is associated with phonological processing of speech sounds (Scott et al., 2000). The same STS area is also activated by written words in the absence of auditory inputs (Booth et al., 2003; Richardson et al., 2011). In addition, Leech et al. (2009) associated the left pSTS with learnt auditory associations. Specifically, they used a video game to train participants to associate novel acoustically complex, artificial non-linguistic sounds to visually presented aliens. After training, viewing aliens alone, with no accompanying sound, activated the left pSTS with activation in this area proportional to how well the auditory categories representing each alien had been learnt. As Leech et al. (2009) point out, part of what makes speech special is the extended experience that we have with it throughout development and this includes acoustic familiarity, enhanced audio–visual associations, and auditory memory in addition to the higher-level processing that is specific to speech (e.g., phonology and semantics). The activation that we observe in left pSTS may therefore reflect auditory associations of the articulated words. This might either be seen as a consequence of auditory predictions from the left dorsal pars opercularis and left pSTS may, in turn, play an active role in generating the predicted acoustic input during articulation (see the generative model in Figure 1). As acknowledged above, future functional connectivity studies using data with high temporal resolution will be required to distinguish these alternatives.

We did not find speech-selective activation in the lower bank of the Sylvian fissure that has been referred to as the PT, left supratemporal plane (SPT), or Sylvian parietal temporal junction (Spt). The Sylvian fissure is the sulcus above the superior temporal gyrus but our speech-selective activation was in the pSTS which is the sulcus below the superior temporal gyrus. We did, nevertheless, confirm the involvement of PT/STP/Spt in speech production because we found common PT/STP/Spt activation for speech, mouth movements, and finger movements, relative to fixation. In other words, as shown previously (Binder et al., 2000), PT/STP/Spt was activated by speech but activation in this region was not specific to speech.

The observation of activation in PT/STP/Spt during finger tapping movements is surprising. Traditionally, PT has been considered to be an auditory association area that is important for speech but not more activated for speech than tone stimuli (Binder et al., 2000). More recent proposals suggest that the PT/STP/Spt region is an interface for speech perception and speech production (Wise et al., 2001; Hickok et al., 2009) and involved in anticipating the somatosensory consequences of movements in the articulators (Dhanjal et al., 2008). Our finding that PT/STP/Spt activation is observed for finger tapping and mouth movements might suggest an even more general role in sensorimotor processing. Alternatively, it might be the case that finger tapping and non-speech mouth movements have low level acoustic associations that are predicted during the movements that have previously been associated with such sounds. In other words, we are proposing that, during speech production, auditory predictions are generated at (a) the level of acoustic associations of any type of movement (in PT/STP/Spt) and (b) the phonology associated with learnt words (in pSTS), see lower part of Figure 1.

How do our results fit with the models illustrated in Figure 1? As emphasized above, the full answer to this question requires techniques with higher temporal resolution that can characterize how all the speech production areas interact and influence one another during articulation. Nevertheless, our data do allow us to test the anatomical hypotheses from the different models. Specifically, the Tian and Poeppel (2010) model suggests that the forward model of auditory processing is in the sensory cortex and the Guenther et al. (2006) model suggests their speech sound maps are in the ventral premotor cortex. In contrast, the effects that we observed for speech processing in the left dorsal pars opercularis and pSTS are in higher-level association areas, not in sensory areas or the ventral premotor cortex. The Spt activation that we observed for speech, non-speech, and finger tapping movements might plausibly correspond to the model proposed by Hickok et al. (2011) in which Spt translates an internal model of motor processing to an internal model of auditory processing. However, the Hickok et al. (2011) model does not provide an interpretation of our speech-selective activation in left dorsal pars opercularis or pSTS. Thus, the anatomical predictions of the previous models do not explain our data. We therefore propose a new anatomical model. Within the framework of the generative model, illustrated in Figure 1, we suggest that the activation we observed in the left dorsal pars opercularis corresponds to processing in higher-level areas that predicts the auditory and motor consequences of speech; and the pSTS activation corresponds to phonological processing that may be involved in predicting the auditory response in PT/STP/Spt. Future studies are now required to investigate the validity of this proposal and test how higher-level systems predict inputs to lower-levels; and how prediction error is used to optimize future predictions (Friston, 2010; Friston et al., 2010). We speculate that, during overt speech production, top-down predictions from higher-level areas optimize auditory processing of the heard response by minimizing the prediction error (i.e., the mismatch between the produced and predicted response). In parallel, the prediction error is fed back to the higher-level regions and used to optimize future motor commands and auditory predictions.

In conclusion, we found that regions corresponding to distinct parts of Broca’s and Wernicke’s areas were activated for mouth movements that have previously been learnt as words and therefore have well established auditory associations. We therefore suggest that the dorsal pars opercularis part of Broca’s area and pSTS part of Wernicke’s area are involved in predicting the auditory consequences of well rehearsed articulations. In addition, we propose that the left dorsal pars opercularis and pSTS areas may be involved in generating and maintaining a forward generative model of expected speech which can be used as a template for auditory prediction. Mismatches between the auditory predictions and auditory feedback can then be fed to the articulators to improve the precision of subsequent output. These audio–motor interactions are particularly important during speech acquisition in childhood, in those with hearing loss or when adults learn a new language. They are also needed to modify the intensity of speech output in noisy environments and when auditory feedback is altered (e.g., by delay on the telephone). We speculate that the devastating impact of damage to Broca’s and Wernicke’s areas on speech production may in part be related to the importance of dorsal pars opercularis and pSTS for auditory–motor integration of speech.

Statements

Acknowledgments

This work was funded by the Wellcome Trust. We would like to thank Karl Friston for useful discussions and our radiographers (Amanda Brennan and Janice Glensman) for their help scanning the participants.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1
AckermannH.RieckerA. (2010). The contribution(s) of the insula to speech production: a review of the clinical and functional imaging literature. Brain Struct. Funct.214, 419–433.10.1007/s00429-010-0257-x
2
AshburnerJ.FristonK. (1997). Multimodal image coregistration and partitioning – a unified framework. Neuroimage6, 209–217.
- Google Scholar
3
BallardD. H.HintonG. E.SejnowskiT. J. (1983). Parallel visual computation. Nature306, 21–26.10.1038/306021a0
4
BilenkoN. Y.GrindrodC. M.MyersE. B.BlumsteinS. E. (2009). Neural correlates of semantic competition during processing of ambiguous words. J. Cogn. Neurosci.21, 960–975.10.1162/jocn.2009.21073
5
BinderJ. R.FrostJ. A.HammekeT. A.BellgowanP. S.SpringerJ. A.KaufmanJ. N.PossingE. T. (2000). Human temporal lobe activation by speech and nonspeech sounds. Cereb. Cortex10, 512–528.10.1093/cercor/10.5.512
6
BoothJ. R.BurmanD. D.MeyerJ. R.GitelmanD. R.ParrishT. B.MesulamM. M. (2003). Relation between brain activation and lexical performance. Hum. Brain Mapp.19, 155–169.10.1002/hbm.10111
7
BordenG. J. (1979). An interpretation of research of feedback interruption in speech. Brain Lang.7, 307–319.10.1016/0093-934X(79)90025-7
8
CallanD.CallanA.GamezM.SatoM. A.KawatoM. (2010). Premotor cortex mediates perceptual performance. Neuroimage51, 844–858.10.1016/j.neuroimage.2010.02.027
9
CaspersS.ZillesK.LairdA. R.EickhoffS. B. (2010). ALE meta-analysis of action observation and imitation in the human brain. Neuroimage50, 1148–1167.10.1016/j.neuroimage.2009.12.112
10
ChangS. E.KenneyM. K.LoucksT. M.PolettoC. J.LudlowC. L. (2009). Common neural substrates support speech and non-speech vocal tract gestures. Neuroimage47, 314–325.10.1016/S1053-8119(09)70777-3
11
ChristoffelsI. K.FormisanoE.SchillerN. O. (2007). Neural correlates of verbal feedback processing: an fMRI study employing overt speech. Hum. Brain Mapp.28, 868–879.10.1002/hbm.20315
12
CorfieldD. R.MurphyK.JosephsO.FinkG. R.FrackowiakR. S.GuzA.AdamsL.TurnerR. (1999). Cortical and subcortical control of tongue movement in humans: a functional neuroimaging study using fMRI. J. Appl. Physiol.86, 1468–1477.
- Pubmed Abstract
- Google Scholar
13
DevlinJ. T.MatthewsP. M.RushworthM. F. (2003). Semantic processing in the left inferior prefrontal cortex: a combined functional magnetic resonance imaging and transcranial magnetic stimulation study. J. Cogn. Neurosci.15, 71–84.10.1162/089892903321107837
14
DhanjalN. S.HandunnetthiL.PatelM. C.WiseR. J. (2008). Perceptual systems controlling speech production. J. Neurosci.28, 9969–9975.10.1523/JNEUROSCI.2607-08.2008
15
FordJ. M.GrayM.FaustmanW. O.HeinksT. H.MathalonD. H. (2005). Reduced gamma-band coherence to distorted feedback during speech: when what you say is not what you hear. Int. J. Psychophysiol.57, 143–150.10.1016/j.ijpsycho.2005.03.002
16
FridrikssonJ.MoserD.RyallsJ.BonilhaL.RordenC.BaylisG. (2009). Modulation of frontal lobe speech areas associated with the production and perception of speech movements. J. Speech Lang. Hear. Res.52, 812–819.10.1044/1092-4388(2008/06-0197)
17
FristonK. (2005). A theory of cortical responses. Philos. Trans. Biol. Sci.360, 815–836.10.1098/rstb.2005.1622
- CrossRef
- Google Scholar
18
FristonK. (2010). The free-energy principle: a unified brain theory?Nat. Rev. Neurosci.11, 127–138.10.1038/nrn2787
19
FristonK. J. (2001). Dynamic representations and generative models of brain function. Brain Res. Bull.54, 275–285.10.1016/S0361-9230(00)00436-6
20
FristonK. J.DaunizeauJ.KilnerJ.KiebelS. J. (2010). Action and behavior: a free-energy formulation. Biol. Cybern.102, 227–260.10.1007/s00422-010-0364-z
21
FristonK. J.HolmesA. P.PolineJ. B.GrasbyP. J.WilliamsS. C.FrackowiakR. S.TurnerR. (1995). Analysis of fMRI time-series revisited. Neuroimage2, 45–53.10.1006/nimg.1995.1007
22
FuC. H.VythelingumG. N.BrammerM. J.WilliamsS. C.AmaroE.Jr.AndrewC. M.YágüezL.van HarenN. E.MatsumotoK.McGuireP. K. (2006). An fMRI study of verbal self-monitoring: neural correlates of auditory verbal feedback. Cereb. Cortex16, 969–977.10.1093/cercor/bhj039
23
GregoryR. L. (1980). Perceptions as hypotheses. Philos. Trans. R. Soc. Lond. B Biol. Sci.290, 181–197.10.1098/rstb.1980.0090
24
GuentherF. H.GhoshS. S.TourvilleJ. A. (2006). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain Lang.96, 280–301.10.1016/j.bandl.2005.06.001
25
HashimotoY.SakaiK. L. (2003). Brain activations during conscious self-monitoring of speech production with delayed auditory feedback: an fMRI study. Hum. Brain Mapp.20, 22–28.10.1002/hbm.10119
26
HawcoC. S. (2009). Control of vocalization at utterance onset and mid-utterance: different mechanisms for different goals. Brain Res.1276, 131–139.10.1016/j.brainres.2009.04.033
27
Heinks-MaldonadoT. H. (2005). Fine-tuning of auditory cortex during speech production. Psychophysiology42, 180–190.10.1111/j.1469-8986.2005.00272.x
28
HelmholtzH. (1866). “Concerning the perceptions in general,” in Treatise on Physiological Optics, Vol. III, 3rd Edn. (translated by SouthallJ. P. C. 1925 Opt. Soc. Am. Section 26, reprinted New York: Dover, 1962).
- Google Scholar
29
HickokG.HoudeJ.RongF. (2011). Sensorimotor integration in speech processing: computational basis and neural organization. Neuron69, 407–422.10.1016/j.neuron.2011.01.019
30
HickokG.OkadaK.SerencesJ. T. (2009). Area Spt in the human planum temporale supports sensory-motor integration for speech processing. J. Neurophysiol.101, 2725–2732.10.1152/jn.91099.2008
31
HollandR.LeffA. P.JosephsO.GaleaJ. M.DesikanM.PriceC. J.RothwellJ. C.CrinionJ. (2011). Speech facilitation by left inferior frontal stimulation. Curr. Biol.21, 1–5.10.1016/j.cub.2010.11.056
32
JeonH. A.LeeK. M.KimY. B.ChoZ. H. (2009). Neural substrates of semantic relationships: common and distinct left-frontal activities for generation of synonyms vs. antonyms. Neuroimage48, 449–457.10.1016/j.neuroimage.2009.06.049
33
KilnerJ. M.NealA.WeiskopfN.FristonK. J.FrithC. D. (2009). Evidence of mirror neurons in human inferior frontal gyrus. J. Neurosci.29, 10153–10159.10.1523/JNEUROSCI.2668-09.2009
34
LeechR.HoltL. L.DevlinJ. T.DickF. (2009). Expertise with artificial nonspeech sounds recruits speech-sensitive cortical regions. J. Neurosci.29, 5234–5239.10.1523/JNEUROSCI.5758-08.2009
35
LeveltW. J.RoelofsA.MeyerA. S. (1999). A theory of lexical access in speech production. Behav. Brain Sci.22, 1–38; discussion 38–75.10.1017/S0140525X99451775
36
MacKayD. M. (1956). “The epistemological problem for automata,” in Automata Studies, eds ShannonC. E.McCarthyJ. (Princeton, NJ: Princeton University Press), 235–251.
- Google Scholar
37
MashalN.FaustM.HendlerT.Jung-BeemanM. (2009). An fMRI study of processing novel metaphoric sentences. Laterality14, 30–54.
- Pubmed Abstract
- Google Scholar
38
MiallR. C. (1993). Is the cerebellum a Smith predictor. J. Mot. Behav.25, 203–216.10.1080/00222895.1993.9941639
39
MorinO.GrezesJ. (2008). What is “mirror” in the premotor cortex? A review. Neurophysiol. Clin.38, 189–195.10.1016/j.neucli.2008.02.005
40
PapoutsiM.de ZwartJ. A.JansmaJ. M.PickeringM. J.BednarJ. A.HorwitzB. (2009). From rom phonemes to articulatory codes: an fMRI study of the role of Broca’s area in speech production. Cereb. Cortex19, 2156–2165.10.1093/cercor/bhn239
41
PausT.PerryD. W.ZatorreR. J.WorsleyK. J.EvansA. C. (1996). Modulation of cerebral blood flow in the human auditory cortex during speech: role of motor-to-sensory discharges. Eur. J. Neurosci.8, 2236–2246.10.1111/j.1460-9568.1996.tb01187.x
42
PenfieldW.RasmussenT. (1950). The Cerebral Cortex of Man. New York: The Macmillan Company.
- Google Scholar
43
PeschkeC.ZieglerW.KappesJ.BaumgaertnerA. (2009). Auditory-motor integration during fast repetition: the neuronal correlates of shadowing. Neuroimage47, 392–402.10.1016/j.neuroimage.2009.03.061
44
PoldrackR. A.WagnerA. D.PrullM. W.DesmondJ. E.GloverG. H.GabrieliJ. D. (1999). Functional specialization for semantic and phonological processing in the left inferior prefrontal cortex. Neuroimage10, 15–35.10.1006/nimg.1999.0441
45
PriceC. J.MooreC. J.FrackowiakR. S. (1996). The effect of varying stimulus rate and duration on brain activity during reading. Neuroimage3, 40–52.10.1016/S1053-8119(96)80588-X
46
PulvermullerF.HussM.KherifF.Moscoso del Prado MartinF.HaukO.ShtyrovY. (2006). Motor cortex maps articulatory features of speech sounds. Proc. Natl. Acad. Sci. U.S.A.103, 7865–7870.10.1073/pnas.0509989103
47
RichardsonF. M.SeghierM. L.LeffA. P.ThomasM. S.PriceC. J. (2011). Multiple routes from occipital to temporal cortices during reading. J. Neurosci.31, 8239–8247.10.1523/JNEUROSCI.6519-10.2011
48
RieckerA.MathiakK.WildgruberD.ErbM.HertrichI.GroddW.AckermannH. (2005). fMRI reveals two distinct cerebral networks subserving speech motor control. Neurology64, 700–706.10.1212/01.WNL.0000152156.90779.89
49
ScottS. K.BlankC. C.RosenS.WiseR. J. (2000). Identification of a pathway for intelligible speech in the left temporal lobe. Brain123, 2400–2406.10.1093/brain/123.12.2400
50
SkipperJ. I.Goldin-MeadowS.NusbaumH. C.SmallS. L. (2007). Speech-associated gestures, Broca’s area, and the human mirror system. Brain Lang.101, 260–277.10.1016/j.bandl.2007.02.008
51
TakaiO.BrownS.LiottiM. (2010). Representation of the speech effectors in the human motor cortex: somatotopy or overlap?Brain Lang.113, 39–44.10.1016/j.bandl.2010.01.008
52
TakasoH.EisnerF.WiseR. J.ScottS. K. (2010). The effect of delayed auditory feedback on activity in the temporal lobe while speaking: a positron emission tomography study. J. Speech Lang. Hear. Res.53, 226–236.10.1044/1092-4388(2009/09-0009)
53
TianX.PoeppelD. (2010). Mental imagery of speech and movement implicates the dynamics of internal forward models. Front. Psychol.1:166.10.3389/fpsyg.2010.00166
- CrossRef
- Google Scholar
54
TourvilleJ. A.ReillyK. J.GuentherF. H. (2008). Neural mechanisms underlying auditory feedback control of speech. Neuroimage39, 1429–1443.
- Google Scholar
55
ToyomuraA.KoyamaS.MiyamaotoT.TeraoA.OmoriT.MurohashiH.KurikiS. (2007). Neural correlates of auditory feedback control in human. Neuroscience146, 499–503.10.1016/j.neuroscience.2007.02.023
56
TurnerT. H.FridrikssonJ.BakerJ.EouteD.Jr.BonilhaL.RordenC. (2009). Obligatory Broca’s area modulation associated with passive speech perception. Neuroreport20, 492–496.10.1097/WNR.0b013e32832940a0
57
TylerL. K.ShaftoM. A.RandallB.WrightP.Marslen-WilsonW. D.StamatakisE. A. (2010). Preserving syntactic processing across the adult life span: the modulation of the frontotemporal language system in the context of age-related atrophy. Cereb. Cortex20, 352–364.10.1093/cercor/bhp105
58
WhitneyC.WeisS.KringsT.HuberW.GrossmanM.KircherT. (2009). Task-dependent modulations of prefrontal and hippocampal activity during intrinsic word production. J. Cogn. Neurosci.21, 697–712.10.1162/jocn.2009.21056
59
WiseR. J.ScottS. K.BlankS. C.MummeryC. J.MurphyK.WarburtonE. A. (2001). Separate neural subsystems within “Wernicke’s area.”Brain124, 83–95.
- Google Scholar
60
WolpertD. M.GhahramaniZ.JordanM. I. (1995). An internal model for sensorimotor integration. Science269, 1880–1882.10.1126/science.7569931

Summary

Keywords

speech production, auditory feedback, PET, fMRI, forward model

Citation

Price CJ, Crinion JT and MacSweeney M (2011) A Generative Model of Speech Production in Broca’s and Wernicke’s Areas. Front. Psychology 2:237. doi: 10.3389/fpsyg.2011.00237

Received

02 February 2011

Accepted

30 August 2011

Published

16 September 2011

Volume

2 - 2011

Edited by

Albert Costa, University Pompeu Fabra, Spain

Reviewed by

Xing Tian, New York University, USA; Pascale Tremblay, The University of Trento, Italy

This is an open-access article subject to a non-exclusive license between the authors and Frontiers Media SA, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and other Frontiers conditions are complied with.

*Correspondence: Cathy J. Price, Wellcome Trust Centre for Neuroimaging, Institute of Neurology, University College London, 12 Queen Square, London WC1N 3BG, UK. e-mail: c.price@fil.ion.ucl.ac.uk

This article was submitted to Frontiers in Language Sciences, a specialty of Frontiers in Psychology.

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Psychology of Language

ORIGINAL RESEARCH article

A Generative Model of Speech Production in Broca’s and Wernicke’s Areas

Abstract

Introduction