The neural control of singing

Zarate, Jean Mary

doi:10.3389/fnhum.2013.00237

REVIEW article

Front. Hum. Neurosci., 03 June 2013

Sec. Cognitive Neuroscience

Volume 7 - 2013 | https://doi.org/10.3389/fnhum.2013.00237

This article is part of the Research TopicSensory-motor control and learning of musical performanceView all 22 articles

The neural control of singing

Jean Mary Zarate*

Department of Psychology, New York University, New York, NY, USA

Singing provides a unique opportunity to examine music performance—the musical instrument is contained wholly within the body, thus eliminating the need for creating artificial instruments or tasks in neuroimaging experiments. Here, more than two decades of voice and singing research will be reviewed to give an overview of the sensory-motor control of the singing voice, starting from the vocal tract and leading up to the brain regions involved in singing. Additionally, to demonstrate how sensory feedback is integrated with vocal motor control, recent functional magnetic resonance imaging (fMRI) research on somatosensory and auditory feedback processing during singing will be presented. The relationship between the brain and singing behavior will be explored also by examining: (1) neuroplasticity as a function of various lengths and types of training, (2) vocal amusia due to a compromised singing network, and (3) singing performance in individuals with congenital amusia. Finally, the auditory-motor control network for singing will be considered alongside dual-stream models of auditory processing in music and speech to refine both these theoretical models and the singing network itself.

Most of the literature on sensory-motor control in music production and training-induced plasticity focuses on trained instrumental musicians or learning paradigms with musical instruments (e.g., learning to play short piano melodies, etc.). Singing, however, provides a unique opportunity to examine sensory-motor processes during musical production, since the instrument is already contained within the body; there is no need to create artificial instruments to assess motor control mechanisms with neuroimaging or any other experimental approach. Moreover, the adult vocal apparatus is highly trained to produce nuanced utterances in both song and speech. Across their lifetime, healthy non-musicians have sung (or have attempted to sing) a full repertoire of songs in socially and culturally specific settings, (“Happy Birthday,” their national anthem, etc.). Additionally, healthy individuals can control their vocal pitch and/or output intensity to indicate the intent of a sentence (e.g., declarative statements vs. questions vs. commands), set the emotional context for a conversation (e.g., happiness, anger, sadness), or in tonal languages, distinguish between words and their meanings. Singers, on the other hand, undergo many years of extensive sensory-motor training and practice to exert much finer vocal control during more difficult tasks, such as singing fast vocal runs (e.g., melismata, melodic embellishments, etc.) or maintaining a melodic passage as someone else simultaneously sings a harmonic line. Therefore, using singing tasks to test groups with different levels of singing experience is a rare opportunity to determine how musical experience specifically enhances sensory-motor control of this particular instrument, beyond the remarkable feats it already can perform. However, the mechanisms by which the vocal instrument is precisely controlled for singing are highly complex and thus require multiple networks for vocal motor control and sensory feedback processing.

Sensory-Motor Control of Vocalization

Sensory-Motor Control Observed From the Vocal Tract

When air passes through the glottis (opening of the larynx) and causes the vocal folds surrounding the glottis to vibrate at a particular rate, the resulting vibration rate determines the fundamental frequency (i.e., perceived pitch) of the voice (Sundberg, 1987). Different intrinsic and extrinsic laryngeal muscles interact to regulate fundamental frequency by altering the length of the vocal folds, thus changing the rate of vocal-fold vibration (Hirano et al., 1969; Sundberg, 1987). The precise control of laryngeal muscles is maintained in part by laryngeal reflexogenic control systems, in which receptors within the larynx adjust muscular contractions during perturbations. For instance, during vocalization, the uneven airflow passing through the glottis stimulates the myotatic mechanoreceptors in the intrinsic laryngeal muscles; these stretch-sensitive receptors initiate reflexive muscular adjustments to ensure that the vocal folds remain at the intended length and tension and therefore maintain a steady vocal pitch (Wyke, 1974). Additional reflexogenic systems work in concert with the intrinsic laryngeal reflexogenic system to ensure a stable vocalization (Wyke, 1974). Vocalization also involves the coordination of many other muscles, including the diaphragm and abdominal/thoracic muscles to provide airflow and regulate vocal output intensity, and articulatory muscles (e.g., lip, jaw, and tongue muscles, Hardcastle, 1976; Sundberg, 1987). The articulatory muscles contain somatosensory receptors that play a role in generating different vocal-tract configurations, which shape the formant frequencies that contribute toward vowel formation and vocal timbre (Sundberg, 1987; Jürgens, 2002; Perkell, 2012).

Similar to the somatosensory contribution to reflexogenic vocal control systems, auditory feedback also plays a role in reflex-like adjustments of ongoing vocal motor control. For instance, a slight decrease in auditory feedback amplitude elicits a quick increase in vocal output amplitude, which is known as the Lombard reflex (Lombard, 1911). During speech production, when the first formant frequency is shifted so that a produced vowel (e.g., /ε/) sounds like a different one (e.g., /æ/), the vocal motor system immediately compensates for the formant shift (Houde and Jordan, 1998, 2002; Purcell and Munhall, 2006a,b). Arguably, the most relevant auditory-vocal motor correction for singers deals with vocal pitch. When the pitch of auditory feedback is shifted up or down as participants vocalize for a few seconds (either at a comfortable pitch or to match a target pitch), investigators have observed pitch-shift responses, during which vocal pitch is adjusted quickly in the opposite direction of the feedback shift (Anstis and Cavanagh, 1979; Burnett et al., 1998; Larson, 1998; Hain et al., 2000; Jones and Munhall, 2000, 2005; Larson et al., 2000; Burnett and Larson, 2002; Liu and Larson, 2007; Jones and Keough, 2008). These pitch-shift responses often have two components: (1) an early pitch-shift response of 25–50 cents (irrespective of the pitch-shift magnitude) that occurs 100–150 ms after the pitch shift; and (2) a late pitch-shift response with a latency of 250–600 ms, whose magnitude and direction can be under voluntary control, if listeners are instructed to make a specific response (e.g., change pitch to either oppose or follow the pitch shift, etc., Burnett et al., 1998; Larson, 1998; Hain et al., 2000). Interestingly, prolonged exposure to feedback that is incrementally pitch-shifted over numerous trials can produce aftereffects in which intended vocal pitch and vocal output are mismatched, such that vocal pitch is automatically adjusted even when auditory feedback is returned to normal (Jones and Munhall, 2000, 2005; Jones and Keough, 2008).

Neural Networks Governing Sensory-Motor Control of Vocalization

Brain regions involved in vocal motor control

Multiple neural networks are required for precise control of the “phonatory” muscles mentioned above. The reticular formation of the pons and medulla has direct connections to the motoneurons for all phonatory muscles (Figure 1, white boxes, Thoms and Jürgens, 1987), and thus may coordinate phonatory muscle groups to generate complete vocal patterns (Jürgens and Hage, 2007). This region receives excitatory input from two distinct neural pathways of vocal control (Figure 1; Jürgens, 2009; Owren et al., 2011). The first vocal control pathway (Figure 1, green boxes) contains the anterior cingulate cortex (ACC) and the midbrain periaqueductal gray (PAG), both of which produce vocalizations when stimulated electrically or pharmacologically (Müller-Preuss and Jürgens, 1976; Müller-Preuss et al., 1980; Suga and Yajima, 1988; Dujardin and Jürgens, 2005). The second neural pathway includes the primary motor cortex (M1, Figure 1, blue box) and two subcortical loops—comprised of putamen, globus pallidus, pontine gray, and cerebellum—that modulate vocal motor commands from M1 and subsequently send modified motor programs via the ventrolateral thalamus back to M1; electrical stimulation of the ventral part of M1 elicits vocalizations, as well as individual movements of the jaw, tongue, and lips (Penfield and Rasmussen, 1950).

FIGURE 1

Figure 1. Neural networks of vocal motor control (central column), somatosensory (left) and auditory feedback processing (right), and hypothesized regions of sensory-motor control of voice [modified from a model proposed by Jürgens (2009)]. The vocal motor control hierarchy starts with the generation of complete vocal patterns from the reticular formation and phonatory motoneurons (white boxes), and then the next highest level of control (green boxes) stems from the anterior cingulate cortex (ACC) and periaqueductal gray (PAG), which can initiate and emotionally motivate vocal responses. The highest level of vocal control comes from the primary motor cortex (M1, blue box; its modulatory brain regions are not depicted), which is responsible for producing learned vocalizations (i.e., speech and song). Somatosensory feedback (dotted arrow) from various receptors distributed throughout the vocal tract is processed in the ascending somatosensory pathway (yellow boxes, left; black slanted lines indicate that only selected regions of this pathway are shown) and transmitted to the primary and secondary somatosensory cortex (S1, S2). Auditory feedback (dashed arrow) from the vocalization is processed by the ascending auditory pathway and auditory cortical regions (orange boxes, right). Potential neural regions that integrate sensory feedback processing with vocal motor control are indicated with red-outlined boxes, and their shared connections are represented by red arrows: (A) the PAG, (B) ACC, and (C) the insula (in purple, classified as a higher-order associative area).

In humans, these networks form a tripartite hierarchy of vocal motor control (Figure 1, center column, Simonyan and Horwitz, 2011): (1) the reticular formation constitutes the lowest level at which complete vocal patterns are generated; (2) the next level is comprised of the ACC and the PAG, which are attributed with the voluntary initiation and emotional/motivational control of vocalizations (Jürgens, 2002, 2009); and (3) the highest level of vocal control occurs in M1 (and its modulatory brain regions), which is associated with the generation of learned vocalizations, such as speech and song (Jürgens, 2002, 2009). Importantly, this functional distinction of M1 is based on humans' unique possession of direct connections between the phonatory region of M1 (i.e., the ventral portion) and the motoneurons of phonatory muscles (see Figure 1); bilateral lesions to this M1 region destroys the ability to speak and sing (Jürgens, 2009), while innate vocalizations (e.g., shrieking, crying, etc.) that may be controlled by the ACC and PAG are left intact. In contrast, damage to the modulatory brain regions associated with M1 (e.g., putamen, globus pallidus, pontine gray, and cerebellum) can result in speech disorders such as stuttering and dysarthria (Ackermann et al., 1992; Jürgens, 2002; Alm, 2004). Lesions in the second level of vocal control may lead to mutism (attributed to PAG damage, Esposito et al., 1999) or loss of emotional/motivational intonation in speech (following damage to the ACC, Simonyan and Horwitz, 2011). Importantly, the functional organization of vocal motor control in humans is concurrently hierarchical and parallel, since damage to brain regions within the second or third levels does not abolish all vocalizations.

Neural processing of somatosensory feedback

Various somatosensory receptors transmit feedback about the current state of the vocal motor system (e.g., placement of articulators, respiration, etc.) via the glossopharyngeal and vagus nerves and the ascending somatosensory pathway, which includes the nuclei gracilis, solitarius, and spinalis nervi trigemini and the medial lemniscus in the medulla, and the ventral posteromedial nucleus in the thalamus (Jürgens and Kirzinger, 1985; Willis, 1986). The thalamus sends somatosensory information to primary and secondary somatosensory cortex (S1 and S2), as well as the insula (Jones and Powell, 1970; Augustine, 1996; Jürgens, 2002; Ackermann and Riecker, 2004, 2010). More specifically, the ventral portion of the primary somatosensory cortex (S1)—posteriorly adjacent to the M1 phonatory area that governs vocalizations and individual movements of the articulators (Penfield and Rasmussen, 1950)—processes somatosensory information about articulatory movements (Grabski et al., 2012), while the anterior portion of the insula is recruited particularly during overt vocalizations (compared to covert speech and song, Riecker et al., 2000) and may contribute to voluntarily controlled respiration during vocalizations in general (Ackermann and Riecker, 2010).

Neural processing of auditory feedback during singing

As each sung note reaches a singer's ear as auditory feedback, each of the different frequencies within that particular vocal pitch are transduced by the organ of Corti on the basilar membrane of the cochlea (Hudspeth, 2000). The frequency characteristics that are required to perceive the pitch are transmitted and/or processed along different parts of the ascending auditory pathway—comprised of the cochlear nucleus, lateral lemniscus, inferior colliculus, and the medial geniculate nucleus of the thalamus (Griffiths et al., 2001)—before the extracted frequencies (and many other attributes of sounds) are further processed in primary and secondary auditory cortex within Heschl's gyrus. In particular, pitch information may be processed specifically by a (rightward lateralized) pitch-sensitive area located in lateral Heschl's gyrus, reported to be involved in conscious pitch perception (Griffiths, 2003; Bendor and Wang, 2006). This region may also be involved in organizing pitches in a hierarchical fashion, since patients with lesions in this region displayed much higher discrimination thresholds than controls when asked to indicate the direction of pitch change between two notes (Johnsrude et al., 2000). Processing pitch changes or melodic phrases within a sung passage recruits additional auditory cortical regions outside of Heschl's gyrus, including regions in the right superior temporal gyrus (STG), planum polare, and planum temporale (Zatorre et al., 1994; Patterson et al., 2002; Hyde et al., 2008). When pitch comparisons are performed within a sequence of tones or short melodies, increased activity is observed within right auditory and frontal cortical regions presumably during tonal working memory processes, compared to passive melody perception (Zatorre et al., 1994). Melodic phrase comparisons in the same key, which may be done to ensure correct melodic reproduction, engages extensive activity within several auditory cortical regions along bilateral STG, whereas melodic phrase comparisons across a pitch transposition (i.e., a key change) engages additional activity from the intraparietal sulcus (IPS, Foster and Zatorre, 2010).

Aside from providing details about vocal pitch, auditory feedback can also provide information about vocal timbre, which is argued to be processed specifically along the superior temporal sulcus (STS, Belin et al., 2000). Kriegstein and Giraud (2004) discovered three functionally distinct regions along the STS. The anterior STS is associated with familiar voice recognition, while the mid/anterior STS preferentially responds to the spectral characteristics of voices. The posterior STS (pSTS), which is recruited during recognition of unfamiliar voices, may be involved in analyzing spectral details (or the changes therein) of voices over time (Kriegstein and Giraud, 2004; Warren et al., 2006). Given that the pSTS is also recruited in response to presentation of frequency-modulated sweeps of pure tones (Poeppel et al., 2004) and phonological processing (Hickok and Poeppel, 2007), this region may be involved generally in processing spectrotemporal fluctuations in sound, including notable changes in auditory feedback.

Potential substrates for integrating sensory feedback with vocal motor control

The constituents of the vocal motor network associated with voluntary initiation and emotional/motivational control of vocalizations—the PAG and ACC—receive both somatosensory and auditory input, and thus form two potential substrates for sensory-motor control of vocalization (Figure 1, red-outlined boxes and arrows). The PAG (Figure 1A) receives somatosensory input via afferent projections from the nucleus gracilis (implicated in respiratory control, Hannig and Jürgens, 2006) and nuclei solitarius and spinalis nervi trigemini (kinesthetic and proprioceptive information, Jürgens and Kirzinger, 1985; Yoshida et al., 2000), as well as auditory information from the inferior colliculus and lateral lemniscus (Dujardin and Jürgens, 2005), all of which may facilitate initiating vocalizations in response to external stimuli or adjusting vocalizations based on sensory feedback. For example, when connections to the cerebrum are severed, the Lombard reflex is preserved during PAG-induced vocalizations coupled with auditory masking, suggesting that the PAG may govern auditory-motor control during involuntary auditory-vocal reflexes (e.g., Lombard reflex, formant- and pitch-shift responses) without additional control from cortical regions (Nonaka et al., 1997). The ACC (Figure 1B) directly receives somatosensory input from S2 and auditory input from auditory cortical regions along the STG and STS (Jürgens, 1983; Barbas et al., 1999). This region also receives these types of sensory input indirectly from S1 and auditory association areas via the insula (Mesulam and Mufson, 1982; Augustine, 1996). Since the insula is a gateway of both somatosensory and auditory information for the ACC, this region itself may provide another substrate for sensory-motor control of vocalization (Figure 1C, purple box). In particular, the anterior insula, whose cytoarchitecture and projections classify it as an association area that integrates different modalities (e.g., auditory, visual, somatosensory, motor, etc., Rivier and Clarke, 1997; Lewis et al., 2000; Bamiou et al., 2003; Ackermann and Riecker, 2004), is engaged specifically during voiced speech and song, relative to covert or internal versions (Riecker et al., 2000; but see Hillis et al., 2004; Ackermann and Riecker, 2010 for conflicting clinical evidence of the insula's role in speech production).

Neuroimaging evidence: a general functional network for human vocalization

Neuroimaging studies from the past two decades have confirmed that many regions within vocal motor and sensory networks are recruited during various overt speech and song tasks, including: word or letter generation (Paus et al., 1993); syllable repetition (Riecker et al., 2005); singing a note repeatedly (Perry et al., 1999), in a sustained fashion (Zarate and Zatorre, 2008), or while changing vowels in particular rhythms (Jungblut et al., 2012); repeating syllables, spoken words, and sung or hummed melodies (Özdemir et al., 2006); humming, speaking, or singing lyrics of a well-known song (Formby et al., 1989; Jeffries et al., 2003); reciting the months of the year or singing a familiar melody (Riecker et al., 2000); telling a story (Schulz et al., 2005); improvising word phrases, melodies, or harmonies (Brown et al., 2004, 2006); spontaneous and synchronized speaking and singing (Saito et al., 2006); and singing an Italian aria (Kleber et al., 2007). Summarized from the neuroimaging evidence above, a general functional network for human vocalization (including speech and song) is comprised of the brain regions reviewed in the preceding sections: M1, ACC, basal ganglia, thalamus, and cerebellum for vocal motor control; S1 and S2 for somatosensory feedback processing; bilateral auditory cortical regions (primary auditory cortex and a pitch-sensitive region within Heschl's gyrus, various portions of STG and STS) for auditory feedback processing; and the insula presumably during multimodal processing of sensory feedback. In addition, premotor and parietal areas are recruited during human vocalization, and their functional roles will be further discussed below.

Until this point, both speech and song studies have been included to outline the brain regions associated with general vocal control in humans, since speaking and singing employ common mechanisms involved in vocal production. Moving forward, we will focus more on singing studies to examine how musical training modulates the general functional network for human vocalization as it is used for singing.

Training Effects on the Sensory-Motor Control of Singing

Vocal Training Effects on the Neural Correlates of Sensory-Motor Control of Singing

In general, due to their extensive auditory-motor training and experience, musicians excel in various auditory and motor tasks. For instance, previous studies report that musicians perform better at pitch, timbre, and voice discrimination tasks than non-musicians (Kishon-Rabin et al., 2001; Tervaniemi et al., 2005; Chartrand and Belin, 2006; Micheyl et al., 2006). In addition to possessing better auditory discrimination skills than non-musicians, musicians also display more precise control over the vocal apparatus in the absence of proper auditory feedback. For example, trained singers sang more accurately with masked auditory feedback than non-musicians (Schultz-Coulton, 1978), yet one study reported the reverse (Watts et al., 2003). However, Watts' group of singers may have had less vocal training than the singers in Schultz-Coulton's study; Watts suggested that during the earlier stages of vocal training, more emphasis is placed on monitoring auditory feedback for vocal accuracy (Watts et al., 2003), which may account for their recruited singers' greater vocal inaccuracy with masked feedback compared to non-musicians. In fact, in a longitudinal study with trained singers performing various slow and fast singing tasks, vocal accuracy was not differentially affected by masked auditory feedback neither before nor after 3 years of vocal training (Mürbe et al., 2004), which suggests that auditory feedback may not play a crucial role in vocal accuracy after extensive vocal training. Nevertheless, vocal accuracy did improve during slow singing tasks with masked feedback after vocal training, which Mürbe et al. (2004) attributed to training-enhanced “neuromuscular memory of pitch” (p. 240). This implies that trained singers may rely more on somatosensory feedback to make sure that notes are produced properly, since they can still sing accurately for some time after losing their hearing (Wyke, 1974). Indeed, a functional magnetic resonance imaging (fMRI) singing study demonstrated that both vocal students (enrolled in a performance program) and professional opera singers recruited more activity within S1 and somatosensory association cortex than amateur singers, and moreover, the amount of singing practice positively correlated with the activity in these regions (Kleber et al., 2010). In a more recent fMRI study, Kleber et al. (2013) effectively reduced the amount of somatosensory feedback available by applying a topical anesthetic to the vocal folds just prior to singing in the MR scanner. The investigators determined that under vocal-fold anesthesia, singers displayed reduced activity in the right anterior insula than non-musicians, who had enhanced insular activity with anesthesia. Additionally, this region exhibited decreased functional connectivity to M1, S1, and auditory cortex in singers under topical anesthesia, while functional connectivity increased between these regions in non-musicians with anesthetized vocal folds. Notably, singers still sang more accurately under anesthesia than non-musicians, despite the observed reduction of insular activity and functional connectivity. Both of Kleber's experiments provide evidence that: (1) singers may rely more heavily on somatosensory feedback as a function of vocal training and practice, and (2) singers, perhaps by virtue of their training, can regulate activity within the right anterior insula to “disengage” or ignore somatosensory feedback when it is perturbed or deemed unreliable and thus may significantly alter their singing performance.

Similar to the somatosensory feedback perturbation induced in Kleber's recent study, Zarate and colleagues (2008, 2010b) utilized pitch-shifted auditory feedback with fMRI techniques to target explicitly the brain regions involved in auditory-vocal motor control in singing. As discussed earlier, pitch-altered feedback elicits pitch-shift responses that often contain early and late components. Larson and colleagues suggested that the early pitch-shift response, which may be governed by the midbrain PAG, is a more automatic reaction used to stabilize vocal output by correcting small, unexpected fluctuations in vocal pitch; the late pitch-shift response, on the other hand, may be under more voluntary control—perhaps controlled by the auditory cortex, ACC, etc.,—and thus may contribute to vocal pitch control during speaking and singing (Burnett et al., 1998; Larson, 1998; Hain et al., 2000; Liu and Larson, 2007). Indeed, although trained singers exhibit early pitch-shift responses to briefly pitch-shifted feedback, they were still able to maintain their intended goal for vocalization (either sustaining a steady pitch or glissandos, Burnett and Larson, 2002; Hafke, 2008), perhaps due to enhanced top–down control of the late pitch-shift response that resulted from years of vocal training. In contrast, non-musicians may not exhibit such precise vocal control over the late pitch-shift response. To assess the effects of extensive vocal training on pitch control in singing, Zarate and colleagues (2008, 2010b) tested singers and non-musicians with two singing tasks that required different types of top–down voluntary control: (1) an “ignore” task where subjects were required to hold their pitch steady, despite hearing pitch-shifted auditory feedback; and (2) a “compensate” task in which subjects had to voluntarily adjust their vocal pitch precisely to correct for the pitch shift. The authors hypothesized that ignoring a small pitch shift would not only elicit an early pitch-shift response, but also target the PAG relative to the compensate task, which was specifically designed to engage their proposed cortical substrates for auditory-motor control of vocal pitch—auditory cortex, insula, and ACC (Zarate and Zatorre, 2008; Zarate et al., 2010b).

Due to the temporal limitations of fMRI methodology, Zarate et al. (2010b) were not able to determine whether the PAG is involved particularly with eliciting early pitch-shift responses, since these responses have a latency that is shorter than the best temporal resolution for fMRI. Nevertheless, two interesting cortical findings from their singing tasks were observed. First, both groups recruited the IPS and dorsal premotor cortex (dPMC) in each pitch-shifted singing task, compared to singing with normal feedback (Zarate and Zatorre, 2008). The authors suggested that since the IPS is associated with transformations of sensory input for motor preparation (Astafiev et al., 2003; Grefkes et al., 2004; Tanabe et al., 2005), it was recruited specifically during transformations of auditory input (see Foster and Zatorre, 2010; Zatorre et al., 2010; Foster et al., 2013) into spatial information within the frequency domain (i.e., up or down). This “frequency spatial information” can then be used by the dPMC—an area that receives indirect connections from auditory and parietal areas via the insula (Mufson and Mesulam, 1982), and is attributed to conditional sensory-motor associations (Petrides, 1986; Chouinard and Paus, 2006)—to prepare a vocal response (e.g., maintain steady vocal output or correct for the pitch shift). Second, despite the observed lack of performance differences in the compensate task—i.e., both groups voluntarily adjusted for the pitch-shifted feedback to a similar extent—different neural substrates for auditory-motor control were recruited in each group. Compared to singers, the non-musicians exhibited more activity within the dPMC while voluntarily correcting for the pitch shift (Figure 2A; Zarate and Zatorre, 2008); the authors proposed that the dPMC was recruited selectively in non-musicians as they learned to associate a pitch-shift “cue” in auditory feedback with a corrective adjustment in vocal pitch. Therefore, this region may constitute a basic substrate for voluntary auditory-motor control of vocal pitch (Zarate and Zatorre, 2008) and perhaps music production in general—after more training and practice, the dPMC is recruited less in non-musicians during the same musical production task that was learned (and assessed with fMRI) at earlier stages of an experiment (Chen et al., 2012). Indeed, rather than recruiting the dPMC, singers engaged auditory cortex within the pSTS, anterior insula, and ACC for this task (Figure 2B; Zarate and Zatorre, 2008; Zarate et al., 2010b). Moreover, voluntary vocal-control singing tasks (i.e., compensating for and ignoring large pitch shifts in feedback) specifically enhanced the functional connectivity between the pSTS and IPS (Figure 2C; Zarate et al., 2010b). Given the IPS' role in sensory-motor transformations, Zarate and colleagues suggested that within singers, the auditory cortex and IPS jointly process and extract pitch-shift information that can be used to control vocal pitch (e.g., magnitude and direction of the pitch shift). Since the auditory cortex is functionally connected to the insula and ACC (Zarate and Zatorre, 2008; Zarate et al., 2010b), the pitch-shift information may be sent via the anterior insula to the ACC for initiation of the task-appropriate vocal motor program (i.e., maintain the originally produced note or correct for the shift). The authors proposed that these four cortical regions constitute an experience-dependent network for auditory-motor control of the singing voice, which may be recruited increasingly as a function of more vocal training and practice.

FIGURE 2

Figure 2. Brain regions involved in auditory-motor control of singing, as observed in non-musicians and singers. (A) When voluntarily correcting for a 200-cent pitch shift in auditory feedback (“compensate 200c” task), non-musicians recruited more activity within the dorsal premotor cortex (dPMC) than singers. (B) Singers engaged the posterior superior temporal sulcus (pSTS), anterior cingulate cortex (ACC), and anterior insula (aINS) when performing the “compensate 200c” task. (C) Analyses of task-modulated functional connectivity revealed that relative to singing with normal auditory feedback, the 200-cent pitch shift specifically enhanced functional connectivity between right pSTS and intraparietal sulcus (IPS) during both the “ignore 200c” and “compensate 200c” tasks, as well as the postcentral gyrus (containing somatosensory cortex) during the “ignore 200c” task. Data from Zarate and colleagues (2008, 2010b).

Short-Term Training Effects on Auditory and Vocal Skills and their Neural Correlates

Based on the studies above, trained singers may have more precise vocal control compared to non-musicians, due to extensive vocal training that recruits an experience-dependent cortical network and/or selectively gates access to sensory feedback within this network. However, Amir et al. (2003) determined that instrumental musicians (without formal vocal training) also sang more accurately than non-musicians in a simple pitch-matching task, in which subjects were required to sing a note that was just presented. Additionally, two studies report a significant correlation between pitch discrimination and vocal accuracy in both instrumental musicians and non-musicians—individuals who sang more accurately also had better discrimination skills (Amir et al., 2003; Watts et al., 2005). If this observed correlational relationship is a causal one, as these studies suggest, then refining pitch-discrimination skills may lead to better vocal accuracy. For instance, many studies have reported that auditory training improves pitch discrimination both at the training frequency and at other non-trained frequencies (Demany, 1985; Delhommeau et al., 2002, 2005; Ari-Even Roth et al., 2003). Furthermore, the effects of auditory training with pure tones also generalize to more complex tones (Grimault et al., 2003). In light of these observations and the proposed causal relationship between pitch discrimination and vocal accuracy, the newly enhanced ability to discriminate between pitches (following training) may increase the likelihood of detecting slight errors in vocal output, which may result in increased vocal accuracy. In turn, these training-induced behavioral changes are often accompanied by neural plasticity. For example, after non-musicians had received pitch-discrimination training, improved pitch discrimination was accompanied by enhanced auditory cortical responses (Bosnyak et al., 2004). Additionally, when non-musicians were trained to associate specific piano keys with their corresponding pitches and play short piano melodies, significant training-induced increases in cortical activity were observed within auditory, sensorimotor, frontal, and parietal regions (Bangert and Altenmüller, 2003; Lahav et al., 2007).

Therefore, to examine whether: (1) singing accuracy improves subsequent to auditory training, and (2) auditory-training enhanced singing specifically engaged the experience-dependent network for auditory-motor control in singing (i.e., auditory cortex, IPS, anterior insula, and ACC), Zarate et al. (2010a) tested two groups of non-musicians—an experimental group that received training to improve their auditory discrimination skills, and a control group that received no training—with auditory discrimination and singing tasks. In this study, the investigators employed more naturalistic melodic singing tasks to target the experience-dependent network, since accurate production of novel melodies requires auditory-motor control in a similar fashion as voluntarily correcting for pitch-shifted feedback; the auditory feedback of the currently produced note may be monitored in order to produce the correct pitch interval to the next note. Although the experimental group displayed enhanced auditory discrimination skills and training-induced changes in auditory task-associated neural activity (Zatorre et al., 2012), they did not show significant improvements in singing performance or recruit the experience-dependent network for auditory-motor control in singing (Zarate et al., 2010a). Consequently, Zarate et al. (2010a) concluded that auditory training alone (at least in an experimental setting) is not sufficient to improve vocal performance or recruit the experience-dependent network for auditory-motor control of singing (auditory cortex, IPS, anterior insula, and ACC); perhaps only simultaneous enhancements in both auditory and vocal motor skills via extensive training (e.g., voice lessons) would bring forth improvements in vocal performance and engage this particular network.

Sensory-Motor Control of Singing in Other Populations

Acquired Vocal Amusia

Clinical evidence that complements the proposed roles of the auditory cortex, IPS, S1, insula, and premotor regions during singing comes from case reports of brain lesions that result in vocal amusia or oral-expressive amusia (for a review, see Berkowska and Dalla Bella, 2009; Stewart et al., 2009). For instance, a woman with cortical atrophy in the right temporal lobe and insula, as well as diminished blood flow to right frontal and temporal regions, exhibited signs of progressive amusia and aprosodia—she gradually was incapable of perceiving and producing well-known melodies and affective intonation or prosody in speech (Confavreux et al., 1992). Additionally, a female tango singer who suffered a right-lateralized cerebral infarction presented with damage to right Heschl's gyrus and STG, inferior parietal regions including supramarginal gyrus and S1, and posterior insula; her music perception was greatly diminished post-stroke (relative to speech discrimination), and her singing was considered less stable within single notes, less accurate in pitch, and monotonous in affect (Terao et al., 2006).

While the two previous cases with damage to auditory cortex, insula, and other regions within the singing network presented with deficits in both music perception and production, two additional cases present perhaps the strongest evidence for these regions' involvement specifically for singing in the absence of impaired auditory perception. In a female patient who suffered a stroke in the right hemisphere affecting the lateral frontal lobe and M1, STG, insula, S1, and inferior parietal lobe, investigators observed impaired affective intonation in speech and the inability to sing pitch intervals accurately, while familiar-song perception and singing rhythms or melodic contour were relatively preserved (Murayama et al., 2004). Finally, a male amateur singer with right-lateralized damage to his posterior temporal lobe, inferior parietal lobe, insula, and inferior frontal gyrus presented with relatively spared speech comprehension and production, prosodic perception and production, music perception, and rhythm production; however, he exhibited specifically impaired pitch-interval production (Schön et al., 2004). This rather pure case of vocal amusia—in the absence of aphasia, aprosodia, and “perceptual” amusia—demonstrates that the damaged brain regions, which overlap with the areas outlined by Zarate and colleagues (2008, 2010b), contribute to the finely-grained sensory-motor control of singing.

Congenital Amusia

Recall that the same neural network is recruited for singing in healthy individuals, irrespective of the amount of vocal training or experience (see section Neuroimaging Evidence: A General Functional Network For Human Vocalization). However, when pitch processing is compromised as observed in congenital amusia (Ayotte et al., 2002; Peretz and Hyde, 2003; Foxton et al., 2004)—due to cortical malformations in the STG and inferior frontal gyrus (Hyde et al., 2007) and disrupted structural and functional connectivity (Loui et al., 2009; Hyde et al., 2011)—it may be assumed that pitch production in singing would similarly be affected as well. Yet, as observed in Murayama's et al. (2004) and Schön's et al. (2004) case reports, a dissociation between pitch perception and production skills can exist—following a stroke, spared pitch perception does not necessarily preclude inaccurate pitch production. Conversely, some individuals with congenital amusia still can sing pitch changes in the correct direction (e.g., up vs. down), match target notes, and sing familiar song excerpts somewhat accurately, despite observed problems with pitch perception (Ayotte et al., 2002; Loui et al., 2008; Dalla Bella et al., 2009; Hutchins et al., 2010).

Based on this behavioral evidence, as well as observations of singing in the general population, Berkowska and Dalla Bella proffered a “vocal sensorimotor loop” model to outline two functional pathways within the song system that may explain observations of accurate-pitch and poor-pitch singing (Berkowska and Dalla Bella, 2009; Dalla Bella et al., 2011). In this model, the authors list potential brain regions—based on previous neuroimaging studies, many of which are included in the section Neuroimaging Evidence: A General Functional Network For Human Vocalization—that contribute to mechanims underlying singing, such as: regions within the STG for processing auditory input, which includes the auditory target to be reproduced and auditory feedback; dorsal prefrontal cortex, inferior sensorimotor cortex, area “Spt” within the planum temporale, and insula for auditory-motor mapping and memory access; supplementary motor area, ACC, and insula for motor preparation; and ventral M1 for vocal motor execution. Berkowska and colleagues also make distinctions between two pathways—a covert pathway involved in pitch discrimination (that can be compromised in congenital amusia), and an overt pathway involved in pitch production—but they do not clarify which of the aforementioned brain regions belong to each pathway. Congenital amusia may be due to a structural and functional “disconnection” between right auditory and inferior frontal cortical regions that contribute to pitch processing—although the right auditory cortex exhibits differential responses to pitch changes, the right inferior frontal cortex does not show a correlated increase in activity, as it does in normal listeners (Hyde et al., 2011). Even though this particular covert pathway is affected, auditory input (e.g., presented auditory targets, auditory feedback, etc.) can still be processed by auditory cortex (Moreau et al., 2009; Peretz et al., 2009; Moreau et al., 2013). Hypothetically speaking, auditory input may then be processed further by IPS (depending on the amount of vocal training), anterior insula, and premotor regions (dPMC or ACC) for auditory-motor control of singing based on Zarate's findings (Zarate and Zatorre, 2008; Zarate et al., 2010b), rendering vocal production relatively spared in some instances of congenital amusia.

Comparisons with Models of Auditory Processing

Berkowska and Dalla Bella's (2009), Dalla Bella et al.'s (2011) vocal sensorimotor loop model for singing, when enriched with neuroimaging evidence from Zarate and Zatorre (2008), Hyde et al. (2011), and Loui et al. (2009), potentially consists of auditory and inferior frontal cortex in the covert perception pathway (Figure 3, blue arrow), and auditory cortex, IPS, anterior insula, and premotor areas in the overt production pathway (Figure 3, red arrows). These updated pathways resemble the more recognized (and widely debated) dual-stream model for auditory processing, which was first proposed by Rauschecker and Tian (2000). The dorsal stream was originally suggested to be specialized for processing auditory spatial information (the “where” pathway), while the ventral stream was attributed with processing auditory object/sound identity information (the “what” pathway). The scientific debate focuses mostly on competing accounts and hypotheses of the dorsal stream's contributions, which include: (1) processing spectral changes over time (the “where in frequency” or “how” pathway, Belin and Zatorre, 2000); (2) extracting relevant sound features and matching them with stored templates of motor responses (the “do” pathway, Warren et al., 2005); (3) transforming auditory representations of speech into motor programs for speech gestures (Hickok and Poeppel, 2000, 2004, 2007); and (4) comparing between feedforward and feedback mechanisms (Rauschecker and Scott, 2009).

FIGURE 3

Figure 3. A revised version of Berkowska and Dalla Bella's, Dalla Bella, and colleagues' (2009, 2011) vocal sensorimotor loop model for singing, updated with findings from Zarate and colleagues (2008, 2010b) fMRI studies. The covert pathway for pitch production (blue arrow) includes auditory cortex and inferior frontal gyrus (IFG), while the overt pathway for vocal pitch production (red arrows) is comprised of auditory cortex (STG/STS), intraparietal sulcus (IPS), anterior insula (aINS), anterior cingulate cortex (ACC), and dorsal premotor cortex (dPMC). Brain regions that are not visible normally from this lateral brain view are indicated in boxes outlined with dashes. Box colors are retained from Figure 1: light orange for auditory processing, green for vocal motor control, purple for multimodal processing.

For our purposes here, the most relevant dorsal-stream models are the spectrotemporal processing account from Belin and Zatorre (2000) and auditory-motor transformation hypotheses for auditory spatial processing and speech from Warren et al. (2005) and Hickok and Poeppel (2000, 2004, 2007). It should be noted, however that the auditory-motor control network for singing conflicts with the latter two models, in which area Spt in the planum temporale is the sole neural substrate for auditory-motor transformations (Hickok and Poeppel, 2000, 2004; Warren et al., 2005; Hickok and Poeppel, 2007). Zarate's singing research (2008, 2010b) provides empirical evidence both supporting, and perhaps, updating these dorsal-stream models—auditory cortex and IPS process and extract pitch changes from feedback, and the pitch information is sent from these regions via the insula to premotor areas for vocal motor adjustments. Therefore, according to these neuroimaging findings, transformations of task-relevant auditory features into subsequent motor responses may not take place in only one brain region, as purported by the Warren et al. and Hickok/Poeppel models, but rather may be parceled among a network of different areas within the dorsal auditory stream. Thus, it could be argued that many brain regions along the dorsal auditory stream are involved in processing “how” auditory features change over time before executing or “doing” a specific motor act in response to these auditory events, regardless of the particular modality—be it information related to auditory space, speech, or music.

Conclusion

In this review, findings from over 20 years of research have been reviewed to outline a general neural network for song and speech production (section Neuroimaging Evidence: A General Functional Network For Human Vocalization). Within this functional network, cortical substrates that are specific for the sensory-motor control of singing pitch and are sensitive to the amount of vocal training have been identified (Figure 4): the pSTS and IPS for auditory processing and transformation for motor output (light orange boxes), S1 for somatosensory processing (yellow box), anterior insula (in purple, both for auditory-motor integration and somatosensory feedback gating), and premotor regions for vocal motor preparation and response initiation (dPMC and ACC, in green). When the auditory-related findings are placed within a larger framework—a dual-pathway (i.e., perception vs. production), sensory-motor model for singing (Berkowska and Dalla Bella, 2009)—these music-specific findings can then be linked to broader research interests in auditory cognition, such as auditory spatial localization and speech perception/production, due to the auditory-motor control network's similarity to prevalent dual-stream models of auditory processing as a whole.

FIGURE 4

Figure 4. Neural substrates for sensory-motor control of singing that are sensitive to the amount of vocal training [based on findings from Kleber et al. (2010, 2013), Zarate and Zatorre (2008), Zarate et al. (2010b)]. Brain regions that are not visible normally from this lateral brain view are indicated in boxes outlined with dashes, and box colors are retained from Figures 1 and 3. Activity within primary somatosensory cortex (S1) increases as a function of the amount of weekly vocal practice, suggesting a greater reliance on somatosensory feedback with more training and experience. After extensive vocal training and practice, the anterior insula (aINS) can serve a gating function for somatosensory feedback. Features within auditory feedback are processed and extracted by auditory cortex (STG/STS) and the intraparietal sulcus (IPS), and task-relevant auditory information is sent via the aINS to the dorsal premotor cortex (dPMC)—in people with little to no formal vocal training—or to the anterior cingulate cortex (ACC) in experienced singers to voluntarily adjust vocal output according to the singing task demands.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The author thanks Robert J. Zatorre, Ph.D. and David Poeppel, Ph.D. for their invaluable mentorship and support. This work was supported in part by grants from the GRAMMY Foundation^®, the Eileen Peters McGill Majors Fellowship, and the Centre for Interdisciplinary Research in Music Media and Technology (CIRMMT).

References

Ackermann, H., and Riecker, A. (2004). The contribution of the insula to motor aspects of speech production: a review and a hypothesis. Brain Lang. 89, 320–328. doi: 10.1016/S0093-934X(03)00347-X

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Ackermann, H., and Riecker, A. (2010). The contribution(s) of the insula to speech production: a review of the clinical and functional imaging literature. Brain Struct. Funct. 214, 419–433. doi: 10.1007/s00429-010-0257-x

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Ackermann, H., Vogel, M., Petersen, D., and Poremba, M. (1992). Speech deficits in ischaemic cerebellar lesions. J. Neurol. 239, 223–227.