The Linked Dual Representation model of vocal perception and production

The voice is one of the most important media for communication, yet there is a wide range of abilities in both the perception and production of the voice. In this article, we review this range of abilities, focusing on pitch accuracy as a particularly informative case, and look at the factors underlying these abilities. Several classes of models have been posited describing the relationship between vocal perception and production, and we review the evidence for and against each class of model. We look at how the voice is different from other musical instruments and review evidence about both the association and the dissociation between vocal perception and production abilities. Finally, we introduce the Linked Dual Representation (LDR) model, a new approach which can account for the broad patterns in prior findings, including trends in the data which might seem to be countervailing. We discuss how this model interacts with higher-order cognition and examine its predictions about several aspects of vocal perception and production.


INTRODUCTION
One of the most important abilities of humans is the capacity to communicate complex ideas quickly and efficiently. Although there are many ways of communicating with each other, including methods as diverse as body language, signing, and smoke signals, by far the most important medium is the voice. Singing and speech are cultural universals which rely on the voice being physically produced and perceived; these two processes are necessary for communication to occur. Understanding the relationship between vocal perception and production, then, is critical to understanding communication, the nature of the mental processes underlying it, and the most fundamental abilities of humanity.
Singing, even more than speech, has been one of the most profitable places to look for insights into vocal perception and production. On the production side, it involves a similar degree and type of vocal control as speech, and both create a similar type of signal to be perceived by a listener. Furthermore, because of the stylistic communication goals of music, small variations in the produced signal are generally more important than in speech and have thus been the focus of comparatively more research. Since speech and singing both use similar aspects of the vocal signal, the research on perception and production of the voice in a musical context can be informative of how people use their voices in the context of speech. Indeed, many who study this field consider music to have a special relationship with speech processing, due in large part to their overlap and the greater demands of precision of processing in music (see Moreno et al., 2009or Patel, 2011. This makes singing a particularly interesting and fruitful place to understand the connection (or lack thereof) between perception and production. Furthermore, these findings may shed some insight on how other domains divide processing for these functions.
Three basic model architectures have been proposed to explain the relationship between vocal perception and production (Figure 1). The simplest such theory posits that perception necessarily precedes vocal production (Figure 1, left). Thus, when we imitate speech or music, we first construct a symbolic representation of the vocal stimulus. This symbolic representation is then used to construct the vocal-motor representation. These vocal-motor representations are used to issue the appropriate commands to the vocal tract to create the intended sounds. That is, we imitate our symbolic representation of the sound. This model has the benefit of being intuitive and straightforward. It predicts a causal connection between perception and production abilities such that a deficit in our conscious pitch perception abilities would impair our pitch production abilities, while pitch production impairments would not negatively affect our pitch perception abilities.
However, there are alternate models. A motor model of vocal perception (Figure 1, center) would predict the opposite processing stream, where vocal stimuli are first processed for their motor-relevant features, and only afterwards are relayed into our conscious perception for symbolic representation. Such a model preserves the correlation between perception and production, but makes the reverse predictions of the naïve model: vocal production impairments should negatively affect vocal perception abilities, but not vice-versa. Finally, dual-route models (Figure 1, right) predict that vocal stimuli are processed for motor-relevant features and conscious, symbolic representations along two different, independent pathways. This model predicts that vocal perception and production abilities should be uncorrelated, and each can be improved or impaired without affecting the other. These models all have analogues in the speech domain. To take just a few examples, the general auditory account (Diehl et al., 2004), the motor theory of speech processing (Liberman and FIGURE 1 | Three proposed models of perception and production. Mattingly, 1985), and the dual-stream model of speech (Hickok and Poeppel, 2007) mirror the general architectures of the models in Figure 1 from left to right, respectively.
In this review, we will be examining the many factors that affect perception and production abilities, with an eye toward how perception and production might relate to each other and the neural mechanisms underlying each type of ability. We will look at the evidence for each basic type of model and show how different types of evidence point toward structurally different models. Based on this evidence, we introduce the Linked Dual Representation (LDR) model, a synthesis of the relevant features of these prior models that has the potential to explain why vocal perception and production can appear to be both correlated and dissociable abilities. Finally, we will look at the implications and predictions specific to the LDR model and lay out some possible lines of research.

PRODUCTION OF THE SINGING VOICE
Anybody who has ever been serenaded by "Happy Birthday" could tell you that there can be quite large individual differences in singing ability. Even among people who have never received any formal music training, we can find both potential future stars and those who cannot seem to find the key. One of the major reasons for individual differences in singing is the fact that singers have such a large number of variables to control simultaneously. To be a good singer, one needs to control the pitch, timbre, timing, and loudness of the voice, with many of these factors changing both between and within individual tones. Of course, part of what makes singing good or bad is culturallydependent. For example, a Western operatic voice is inappropriate for a Hindustani raga, and vice-versa. Within cultures, too, there are stylistic factors that will affect the judgment of performancesa very skilled country-western singer may sound quite out of place in an R&B recording. Taking stylistic concerns into account, we can identify certain factors that contribute to a good singing performance within particular styles. For example, one of the more well-known and studied of these is the singer's formant. This feature, which is really a compression of the 4th and 5th formants (those regions of the frequency spectrum at which the voice is most resonant; these help define the timbre of the voice) into one large amplitude formant, is a marker of good singing in the Western operatic style (Sundberg, 1987) and is typically achieved by lowering the larynx. Producing a singer's formant can help a solo singer to be heard over an orchestra by concentrating amplitude at frequencies which are not as loud in an orchestra (Sundberg, 1987). Studies of the particular characteristics that make a good vocal style for musical theatre (i.e., belting; Sundberg et al., 1993;Cleveland et al., 2003), country music (i.e., "twang"; Sundberg and Thalén, 2010), and others (Borch and Sundberg, 2011) have also revealed unique techniques for those styles. On the other side of the spectrum, studies of poor singers have found a number of acoustical markers that differentiate them from good singers. These include jitter (which captures irregularity in the microstructure of pitch), shimmer (which captures irregularity in the microstructure of amplitude), and harmonic-to-noise ratio (which captures the strength of harmonic vs. inharmonic frequencies), among others (Titze, 2000;Sataloff, 2005).
However, across all singing styles, one of the most important factors in determining the quality of singing is pitch accuracy. For example, in a study assessing the views of music educators on the singing abilities of non-musicians, intonation (pitch accuracy) was rated as the single most important factor in whether or not a non-musician was perceived as having talent (Watts et al., 2003a). Because of its importance, pitch accuracy is also one of the most widely studied factors in the literature on singing ability (e.g., Dalla Bella et al., 2007;Pfordresher and Brown, 2007;Hutchins and Peretz, 2012a). For example, in a study of untrained singers asked to sing a well-known song in either a city park or a lab setting, Dalla Bella et al. (2007) found a range of singing abilities. These singers showed a great amount of variance in the number of pitch interval errors. All of the participants in the park setting had at least one pitch interval error of greater than a semitone, and a few sang incorrectly on over half of the intervals of the song (there were a total of 31 intervals in the song). Singers performing the same song in a laboratory setting had fewer errors, but nevertheless showed a great deal of variability in performance. Interestingly, the number of errors in the time dimension was much lower across all participants in both groups, indicating that timing accuracy does not seem to be as indicative of singing ability as pitch accuracy.
In another study of note, Pfordresher and Brown (2007) studied singers performing single pitches, single intervals, and short melodies. This study also found a range of abilities on each task, with most being able to sing with an average pitch within one semitone of a target pitch, but some being very inaccurate, as high as 250 cents in error (1 semitone = 100 cents). Their results also indicated that poor pitch singers tend to be inaccurate both in single tones and in intervals and melodies. Poor singers tended to compress intervals. A further investigation (Pfordresher et al., 2010) demonstrated the variability of both single tone and interval tuning, even within individual singers. Here, over 50% of participants showed a standard deviation of greater than 100 cents in their singing, indicating wide-spread imprecision and considerable variability both within and between singers. Numerous other studies have looked at pitch-related singing abilities in the population; these have found consistent variation within nonmusicians and consistently better pitch abilities in musicians than non-musicians (e.g., Amir et al., 2003;Watts et al., 2003b;Demorest and Clements, 2007;Nikjeh et al., 2009;Hutchins and Peretz, 2012a). Pitch matching ability also tends to increase in children during their elementary and middle school years (Green, 1990;Yarbrough et al., 1991). Thus, it seems that there is a wide range of abilities in the general population to produce vocal pitches accurately. This wide range of abilities, in combination with the importance of pitch matching in singing, makes it one of the best ways to study vocal-motor control, providing an insight into the accuracy of individuals' vocal-motor representations.

FACTORS AFFECTING SINGING ABILITY
One of the most common assumptions about singing is that poor perception ability drives poor production ability. If people cannot hear pitches accurately, then it stands to reason that they will be inaccurate at imitating those pitches. This is the prediction of the perception-based model (Figure 1, left). Several studies have investigated this hypothesis, and the evidence is mixed. Using a variety of different singing and pitch perception tasks, some studies have found evidence of a correlation between the two abilities (e.g., Amir et al., 2003;Watts et al., 2005;Moore et al., 2007;Estis et al., 2009Estis et al., , 2011. However, many others, using similar designs, have failed to find a significant correlation (e.g., Bradshaw and McHenry, 2005;Dalla Bella et al., 2007;Pfordresher and Brown, 2007;Moore et al., 2008), which argues more for a dual-route model of perception and production (Figure 1, right), making the overall evidence mixed at best.
Two studies addressing this issue are worth pointing out in particular. First, in one of the few studies to use an experimental design, Zarate et al. (2010a) trained participants to better perceive small variations in pitch in the context of micromelodies. However, although they improved at perception, they did not improve in their abilities to produce these same small pitch changes. They concluded that perceptual training does not aid singing ability, thus contradicting the perceptual-based model. Second, in their 2007 study, Pfordresher and Brown found no correlation between pitch perception abilities and their imitation tasks, nor any problems with vocal pitch range in their sample. Thus, they posited that sensori-motor mismappings were the best remaining explanation for poor singing ability in most cases, such that perceived tones were incorrectly mapped onto motor outputs.
In order to sort out the causes of poor singing ability, Hutchins and Peretz (2012a) used a novel methodology involving a new instrument called a slider. This slider produced a synthesized vocal tone that was subject to many of the same limitations as the human voice, including a very fine scale of pitch control. Instead of using their vocal apparatus, though, the participant played the slider by pressing a finger onto a touch-sensitive strip. Thus, it provided a measurement of pitch matching ability independent of the ability to control one's vocal musculature. Pitch-matching ability on the slider was compared to the ability to vocally match a synthesized vocal tone and a prior recording of one's own voice. Participants who could match the pitch with the slider but not with their voice were thus likely to have a vocal-motor control impairment as their primary cause of singing inaccuracies. Those who could match the pitch with the slider and match the recording of their own voice (which had the same timbre as their attempts to match it), but not the synthesized vocal tone, were likely to have a sensori-motor impairment as their primary cause of singing inaccuracies. These singers had a specific difficulty in translating between the timbre of the synthesized voice and the timbre of their own voice. Because their primary deficit was neither in perceiving the relationships among tones, nor in controlling their vocal muscles, but in connecting their perception to an appropriate production, this is considered to be a type of sensori-motor impairment. Finally, those singers who failed at matching pitch both with the slider and the voice are likely to have a perceptual deficit.
The results showed about 20% of singers had a vocal-motor control impairment, 35% had a sensori-motor (timbre) deficit, and only 5% had a perceptual deficit. Participants were universally better at matching pitch with the slider than with their voice, and the results showed a wide range of singing abilities among non-musicians. Singing ability was not aided by multiple attempts, nor was it improved by a visualization of their produced pitch. Although these results show that perception is not a limiting factor in most people's pitch imitation ability, there was nevertheless a modest correlation among non-musicians (r = 0.4) between accuracy on the slider and with their voice. These results point to a strong effect of motor and sensori-motor factors on singing ability, with a moderate influence of perceptual ability. This pattern of results suggests aspects of both the perceptualbased model and the dual-route model of vocal perception and production.
Other studies have also shown effects of the target's timbre on pitch-matching ability. Singers are better able to match the pitch of vocal targets with a similar voice than the pitch of instruments (Watts and Hall, 2008) and better able to match the pitch of their own voice than the pitch of other targets (Moore et al., 2008). Poor singers are especially aided by using a human, rather than synthetic, target pitch (Léveque et al., 2012). Educators also report that children tend to be able to match pitch better when modeling a similar voice (reviewed in Goetze et al., 1990).
A number of functional imaging studies have investigated the brain areas that support singing production. These studies have localized the "singing network," which includes the auditory cortex, insula, supplementary motor area and anterior cingulated, as well as parts of the motor cortex specific to the mouth/lips and larynx. (Perry et al., 1999;Brown et al., 2004;Özdemir et al., 2006;Kleber et al., 2007). This network is involved in motor production, motor planning of sequences, motor initiation, and articulation.
Singing ability is also reflected in neural activation patterns. For example, as might be expected, highly trained singers show more recruitment of laryngeal and mouth areas of the somatosensory cortex than less-trained singers, an effect related to the amount of singing practice (Kleber et al., 2010). They also show more activation in non-cortical regions, such as the basal ganglia, the thalamus, and the cerebellum (Kleber et al., 2010). Other studies using a pitch-shift paradigm, in which the singer's auditory feedback is manipulated while producing the tones, have shown that experienced singers recruit more areas of the singing network than untrained singers (Zarate and Zatorre, 2008). This methodology has shown a particularly strong role of the dorsal premotor cortex in regulating and controlling responses to auditory feedback; this area is thus thought to be highly involved in the interface between perception and production (Zarate and Zatorre, 2008;Zarate et al., 2010b).

PERCEPTION OF THE SUNG VOICE GENERAL PITCH PERCEPTION ABILITIES
While there has been a good amount of research on singing ability and the factors underlying singing ability, there has been quite a bit less research done of vocal perception. However, we know a great deal about auditory perception in general. In the case of pitch, we can measure just-noticeable differences (or difference limens); in some cases these can be as low as five cents (Zwicker and Fastl, 1999). Individual differences in pitch difference limens, which can be considerable, could contribute to differences in vocal pitch perception abilities. The timbre of tones can also affect pitch perception abilities. Changes in timbre interfere with pitch judgments (Melara and Marks, 1990a,b,c;Krumhansl and Iverson, 1992), and timbre and pitch have been shown not to be perceptually independent (Melara and Marks, 1990a,b,c;Krumhansl and Iverson, 1992;Pitt, 1994;Warrier and Zatorre, 2002). Musicians seem to be less susceptible to timbral interference of pitch processing, however, (Beal, 1985;Pitt and Crowder, 1992;Pitt, 1994).
There is also considerable variability in preferences and judgments of musical intervals. Listeners will show differences between what they consider to be an acceptably-tuned musical interval or note (Rakowski, 1990;Vurma and Ross, 2006;Hutchins et al., 2012), as well as differences in their identification judgments of intervals (Siegel and Siegel, 1977;Halpern and Zatorre, 1979). There are also individual differences related to musical training in preferences in listening to certain types of consonant vs. dissonant intervals (McDermott et al., 2010).
Experience and training can play a large role in pitch perception ability, as evidenced by the differences between musicians and non-musicians (e.g., Pitt, 1994;Moreno and Besson, 2006;Moreno et al., 2009;McDermott et al., 2010;Hutchins et al., 2012). Even among non-musicians, pitch discrimination abilities can be improved with extra training (Zarate et al., 2010a). Tonelanguage speakers, too, show better pitch perception abilities, presumably due to their greater experience in pitch processing (Pfordresher and Brown, 2009;Bidelman et al., 2013a). Among bilinguals, there is also evidence of causality running in the opposite direction, such that musical ability is predictive of the ability to discriminate and produce non-native speech sounds, both for linguistic tones (Gottfried et al., 2004;Alexander et al., 2005) and for non-tone phonemes (Slevc and Miyake, 2006). Musically trained participants are also better at detecting pitch changes in speech in a foreign language (Marques et al., 2007).
One of the most important neurological correlates of pitch processing ability is the auditory brainstem response (ABR). This response mimics the pitch and some timbral characteristics of a presented tone (Krishnan, 2007;Skoe and Kraus, 2010) and occurs very early in processing, being recorded typically with less than a 10 ms lag following the stimulus. One characteristic of the ABR that is of particular interest is the fact that trained musicians show a higher-fidelity ABR with a shorter lag than non-musicians; this higher fidelity ABR correlates with better ability to make behavioral pitch judgments (Kraus et al., 2009;Bidelman et al., 2011). This benefit is not limited to musicians but generalizes to other groups with high expertise in pitch, such as tonal language speakers (Krishnan et al., 2008;Bidelman et al., 2013b). Other studies have shown that the ABR preserves timbral characteristics more accurately in people with musical backgrounds (Kraus et al., 2009;Bidelman and Krishnan, 2010;Strait et al., 2012). This early benefit in pitch and timbre perception seems to precede cortical representations of pitch and timbre and may be transformed to a more conceptual-level representation of the response as it is transmitted upwards (Bidelman et al., 2013a). This response most likely occurs before any task-relevant effects have time to affect the neural representation. Thus, the fidelity of the brainstem response is a good candidate to affect the accuracy of both pitch perception and production, and may be an indicator of the earliest level of perceptual processing.

CONGENITAL AMUSIA
One way of learning about the causes and effects of pitch perception, as well as its relationship to production and to the domain of language, is by looking at cases where pitch perception is compromised. Congenital amusia, which is a neurogenetic disorder  characterized by impaired music perception ability in the absence of brain damage or hearing or cognitive impairments (Peretz, 2008), provides this kind of test case. This condition is formally diagnosed by the Montreal Battery of Evaluation of Amusia (MBEA; Peretz et al., 2003). The majority of congenital amusics seem to suffer from a selective pitch perception deficit. Amusics are impaired at detecting pitch changes of less than a semitone Hyde and Peretz, 2004) and distinguishing between rising and falling pitches (Foxton et al., 2004;Liu et al., 2010). Amusics also seem to be somewhat impaired in timbre perception (Tillmann et al., 2009;Marin et al., 2012) and memory for pitch (e.g., Gosselin et al., 2009;Tillmann et al., 2009;Williamson et al., 2010). Their condition often leads to amusics not enjoying or seeking out music. Subjectively, they report that music seems like noise; thus it is reasonable to suspect a vicious circle here, where amusics tend to listen to music less often, thus gaining less experience with processing it, making listening even less rewarding than it otherwise might have been.
As would be expected from this type of condition, amusics are impaired in their singing abilities as well. Congenital amusics are judged as poor singers  and make considerably more pitch errors in singing a well-known song than do matched controls (Dalla Bella et al., 2009;Tremblay-Champoux et al., 2010). They are also well-below controls at matching single pitches . However, there are some signs that amusics are not uniformly poor at singing. Certain amusics seem to sing considerably better than would be predicted by their poor perceptual abilities (Dalla Bella et al., 2009;Tremblay-Champoux et al., 2010), and amusics as a whole are aided when directly imitating a model, rather than singing from memory (Tremblay-Champoux et al., 2010). For example, one amusic, ML, is able to sing an array of songs just as well as or better than unimpaired individuals despite her inability to hear errors in songs. These types of findings suggest that conscious perceptual ability may not be a hard limit on amusics' singing abilities. Further evidence for this and its implications will be reviewed later in this paper.
Anatomic and functional MRI studies have shown several differences between congenital amusics and unimpaired individuals. Congenital amusics typically show reduced white matter in the right inferior frontal gyrus, as well as thicker cortices in both that area and the right auditory cortex (Hyde et al., 2007). There is some evidence that there may be differences between amusics and controls in the left analogues of those regions as well (Mandell et al., 2007). In the right hemisphere, these two regions also show reduced functional connectivity (Hyde et al., 2011), and diffusion tensor imaging has shown reduced anatomical connectivity in the right arcuate fasciculus connecting these two regions (Loui et al., 2009). There is some evidence that different regions of the arcuate fasciculus may correlate with pitch perception ability and the discrepancy between perception and production ability (Loui et al., 2009), but this has yet to be corroborated.
Electrophysiological evidence also supports the relationship between pitch perception abilities and frontal-auditory connectivity. Amusics show a normal mismatch negativity (MMN) response (a pre-conscious response to deviations in sound generated in the auditory cortex, Näätänen et al., 2007) to small deviations in pitch which they are unable to consciously detect (Moreau et al., 2009;Peretz et al., 2009). These same deviations, however, generate no P3b response, normally indicative of attentive processing (Moreau et al., 2013). These components, then, seem to be markers of conscious and unconscious pitch perception ability. Taken together, the evidence indicates that frontal regions, auditory regions, and the connection between them regulate normal pitch perception ability, and that there may be anatomically and functionally distinct regions responsible for conscious and unconscious pitch processing. While the regions and processes investigated in these studies are not voicespecific, this type of pitch processing is likely a precursor to voice specific perception and production abilities, which may also be anatomically and functionally distinct.

IS VOCAL PITCH PERCEPTION SPECIAL?
One possible explanation of amusics' better-than-expected singing abilities is that our ability to perceive vocal pitch (and by extension, the processes underlying this ability) may be different from our ability to perceive the pitch of non-vocal tones, such as instruments or synthesized tones. While it is obvious that we can distinguish between the voice and other instruments, not many studies have examined the uniqueness of vocal musical perception. One clue that there may be fundamental differences between vocal and non-vocal pitch perception comes from the tuning perception literature. It has been noticed that pitch errors seem to be less noticeable when produced by a voice than by other instruments (Seashore, 1938;Sundberg, 1979). For example, Lindgren and Sundberg (as cited in Sundberg, 1979Sundberg, , 1982 showed that musically experienced listeners would accept as intune up to 50-70 cents of tuning errors in a recording of a highly trained singer. Another study looked at recordings of 10 professional singers performing the same song, and found that listeners were highly variable in their assessments of the tuning, with outof-tune notes being accepted as in-tune and well-tuned notes sometimes being judged as out-of-tune (Sundberg et al., 1996). In contrast, studies of acceptable tuning in synthesized tones show a much smaller range of acceptable tuning, with listeners accepting only 10-15 cents of error (Fyk-in van Besouw et al., 2008). This seems to indicate that listeners use different criteria when judging the pitch of the voice vs. other instruments.
To investigate this effect in a well-controlled manner, Hutchins and Peretz (2012a) directly compared tuning judgments of real and synthesized voices. Musicians and non-musicians listened to pairs of tones and judged them as the same or different. Listeners were less likely to notice the differences in tuning when the tone pairs were real voices than when they were synthesized voices; this pattern held across musicians and non-musicians. Non-musicians needed the two tones to be 50 cents apart to reliably notice the difference between two real vocal tones, compared with only 30 cents for synthesized vocal tones. This pattern held in musicians as well. Hutchins et al. (2012) found very similar results for tuning judgments of a trained voice vs. a violin and extended these findings to a melodic context. This difference in acceptable and noticeable tuning between voices and other timbres was termed the Vocal Generosity Effect and may be evidence of special processing of voices in a musical context as it is consistent across different voices and instruments.
Different types of tuning errors between vocal and non-vocal stimuli are also found in production. Trained singers tend to show more tuning errors than trained instrumentalists. Trained singers have a propensity to begin a note flat (Seashore, 1938), and analyses of recordings of professional singers show deviations of more than 40 cents, both sharp and flat (Prame, 1997). In contrast, studies of violin and wind instruments show average deviations less than 20 cents. This difference in production ability comes despite the fact that people have considerable amounts of experience using their voice. In experts, though, there is a tendency for instrumentalists to practice much more than vocalists (as the voice tends to tire out after a couple of hours of practice). In addition, singers typically use considerably more vibrato than do performers on other instruments, such as the violin (Prame, 1997;Mellody and Wakefield, 2000). Vibrato is sometimes thought to be a way of hiding tuning errors (Yoo et al., 1998), although listeners are nevertheless capable of making quite accurate tuning judgments even for tones with very highamplitude vibrato (Shonle and Horan, 1980). However, unlike the case of perception, many of these differences between voice and instruments can be explained by the unique motoric requirements of vocal production, which are substantially different from those required by any other instrument.
If the voice is processed differently from other instruments, then we should see special neural processes and regions devoted to vocal perception and production. And indeed, there is evidence for just such effects. Belin et al. (2000) showed evidence for subregions of the auditory cortex particularly sensitive to voice perception, called temporal voice areas. These are located bilaterally along the mid superior temporal sulcus, and respond to the voice independent of its linguistic content. Temporal voice areas become less active as the vocal signal is degraded by filtering, indicating a sensitivity to the quality of the input that was reflected in both fMRI and behavioral voice discrimination judgments. Electrophysiological studies also indicate special processing of the voice, with vocal sounds eliciting a fronto-temporal positivity/occipital negativity when compared to environmental sounds or birdsong, peaking around 200 ms post-stimulus (Charest et al., 2009). Another study found a similar frontal positivity of sung tones compared to instrumental sounds, but a bit later, likely due to the more similar acoustic characteristics of these stimuli (Levy et al., 2001), although an MEG study failed to show any differences between similar types of stimuli (Gunji et al., 2003). To the best of our knowledge, no one has yet run an fMRI study comparing activation from perceiving humming to that of perceiving instruments to look for vocal-specific regions involved in music processing. Given the specificity of the motor demands of singing, we would expect to find some such regions; such an experiment would provide an important contribution to the field.

THE RELATIONSHIP BETWEEN PERCEPTION AND PRODUCTION
To truly understand the nature of perception and production abilities, it is helpful to examine their relationship to each other, specifically the link between conscious vocal perception acuity and vocal production accuracy. The evidence reviewed so far shows a moderate, but not overwhelming correlation between perception and production abilities, which suggests a connection, rather than dissociation, between the two. This points more toward a perceptual-based or motor model of perception and production, rather than a dual route model (see Figure 1). However, other lines of evidence tend to argue against the simple and motor models, and dual-route models have been suggested to explain this pattern of findings (Griffiths, 2008).

PERCEPTION-PRODUCTION DISSOCIATIONS IN CONGENITAL AMUSIA
Some of the best evidence arguing for a dual-route model of perception and production comes from congenital amusics. Although most congenital amusics, who have severely impaired pitch perception abilities, are impaired in their singing ability, there is evidence that some amusics nevertheless retain the ability to sing accurately. Dalla Bella et al. (2009) identified three amusics (out of eleven tested) who were unimpaired at singing the correct intervals in a well-known song, including one who was unimpaired even without the aid of the lyrics-a condition in which most amusics fail to complete more than a few notes of the song.  tested congenital amusics in a single-pitch matching task and found that despite amusics' overall inaccurate performances, they showed a consistent, linear relationship between the imitations and the target tones.
These studies hint that amusics may demonstrate better overall singing ability than would be predicted from their abilities on perceptual tasks. Recently, a number of studies have attempted to directly compare perception and production abilities in amusia, to serve as direct tests of vocal perception and production models. Loui et al. (2008) presented three amusics with two note sequences and asked amusics to imitate the interval, then to describe whether the second note had been higher or lower than the first. The amusics were impaired at describing the direction of the second note, but they performed similarly to controls at singing an interval that went in the correct direction, although they were still inaccurate at producing an interval of the correct distance.
Some of our recent work also demonstrates a similar discrepancy between pitch perception and production ability in amusics. In one ongoing study , we tested amusics' pitch matching abilities with the slider and a vocal imitation condition (the same as used in Hutchins and Peretz, 2012a, Experiment 1; see above). As expected, amusics as a group performed worse than matched controls at both slider and vocal pitch matching. However, we found two participants who performed at levels comparable to normal participants on the vocal imitation task and, notably, better than their performance on the slider. This is a pattern of results not found among normal participants, who almost invariably show excellent pitch matching performance on the slider, even among non-musicians. This demonstrates that for these two amusics, their vocal pitch matching ability was not constrained by their pitch perception ability, arguing against the perceptual-based model of pitch perception and production.
Another of our studies looked at the pitch shift effect. This effect is an automatic compensatory response to a sudden shift in pitch of the feedback of a sung or spoken utterance. When most participants hear such a shift in their own voice, there is a quick reaction to change the pitch of their voice in the opposite direction. We tested amusics and controls in a pitch shift paradigm, where a pitch shift would occur in the middle of an imitative response. Our results showed that a subset of amusics showed a preserved pitch shift effect, showing normal pitch shift responses to both large (2 semitone) and small (25 cent) shifts. This is strong evidence that amusics do process even small pitch shifts when they are relevant to vocal-motor control. In addition, this study also found evidence of a correlation between the pitch shift effect and pitch matching accuracy (absent of any shift), strengthening the idea that this retained pitch shift response is related to generally preserved vocal-motor control. Together, this presents a strong contrast with amusics' previously documented disabilities in consciously perceiving small pitch changes.
We also see evidence for dissociation of vocal perception and production abilities in amusics' use of pitch in speech. Unlike in tone languages, pitch is non-lexical in most European languages. However, it plays a strong role in prosody and can determine the meaning of certain types of statement/question pairs. Liu et al. (2010) showed that amusics were somewhat poorer than controls at discriminating between statements and questions differing only in pitch contour. However, just as with intervals (Loui et al., 2008), they were better at imitating the pitch contour of these same sentences (although still below the level of matched controls). Hutchins and Peretz (2012b) tested amusics with speech examples containing pitch changes that did not systematically alter the meaning of the sentence. In this experiment, amusics showed an impaired ability to perceive pitch changes between sentences, but no impairment at imitating those same pitch differences, compared to controls. Similarly, in the pitch shift study (Hutchins and Peretz, 2013), we found no difference between pitch shift responses to spoken vs. sung utterances. The fact that pitch perception-production dissociation occurs across music and speech indicates that it is a function of vocal pitch perception and control, rather than a function of music.
Neural evidence also supports the dissociation between pitch perception and production in amusics. Loui et al. (2009) found that pitch perception abilities were correlated with tract density along the superior route of the arcuate fasciculus, whereas the lower route was correlated with the difference between their perception and production abilities. While a somewhat complicated story (all the more so because the association runs in the reverse direction to some other theories of dual-route processing, e.g., Goodale and Milner, 1992;Hickok and Poeppel, 2004), this is the first evidence of direct correlations between these dissociations in amusics and specific neuroanatomical structures.

EVIDENCE FOR PERCEPTION-PRODUCTION DISSOCIATIONS IN NORMAL SUBJECTS
A few studies have shown similar evidence for dissociations between perception and production abilities in an unimpaired population. In one study, Hafke (2008) used a vocal pitch shift paradigm to test trained singers. She found that they showed a normal pitch shift effect, even when the shifts were so small that the participants were unaware that they had occurred at all. This is similar to the pattern of results found among congenital amusics (Hutchins and Peretz, 2013). Vurma (2010) showed a related effect, demonstrating that trained singers' musical interval production abilities are more finely honed than their abilities to perceive the same intervals. Results such as these indicate that the independence of vocal-motor pitch control from conscious pitch perception is not limited to cases such as amusia, which again argues against a perceptual-based model.
The reverse pattern, better conscious perception than production ability, is even more common in normal participants. Hutchins and Peretz (2012a) showed that almost every participant was more capable of matching pitch with an instrument than with their voice in many cases over an order of magnitude better. This pattern held true for musicians and non-musicians alike and demonstrated that poor vocal pitch accuracy does not lead to poor pitch perception ability, as would be predicted by a motor theory. However, there was a moderate correlation between instrumental and vocal pitch matching abilities, arguing against a dual-route theory. A few other studies have found evidence of such perception-production connections (e.g., Amir et al., 2003;Watts et al., 2005;Moore et al., 2007;Estis et al., 2009Estis et al., , 2011, though others have failed to do so (Bradshaw and McHenry, 2005;Dalla Bella et al., 2007;Pfordresher and Brown, 2007;Moore et al., 2008). The preponderance of evidence shows a weak connection between pitch perception and singing ability, but also indicates that poor pitch perception ability is not necessarily the main cause of poor singing ability.
Similar evidence of this dissociation comes from second language learners. Many late second language learners will gain the ability to comprehend a second language, but will nevertheless be unable to speak it with any degree of fluency. Other second language learners, however, will show an opposite pattern, where their production ability will outstrip their comprehension ability. This latter pattern is typically shown by people who need to perform or deliver information in a second language, such as the singer who performs a Mozart opera without speaking a word of German, whereas the former is more characteristic of an immigrant immersed in a second language who does not have the opportunity or inclination to speak it often. Again, like with pitch in singing, perception and production ability in a second language will broadly correlate, but are nevertheless dissociable abilities.

THE LINKED DUAL REPRESENTATION MODEL
Across these studies, we see two main patterns emerging. First, there is a trend for people who are poor at pitch perception to be worse singers, holding across amusics and unimpaired people. This correlation is not perfect, however, and perception does not determine pitch matching abilities. Second, in many cases, people's production abilities can outstrip their perceptual limitations (or vice versa); this pattern can arise in both perceptually impaired and unimpaired people. To account for these two main patterns we propose a new model of adult human vocal perception and production: The LDR model (Figure 2). Like a dual-route model, the LDR model predicts that vocal information can be processed in two distinct ways. First, it can be encoded as a symbolic representation, such that we gain conscious knowledge of the identifiable features of the vocal stimulus. This process, which is what we normally equate with conscious perception, allows us to determine whether a tone is higher or lower than another, the same or different from another, and allows us to make identification and categorization judgments. Second, vocal information can be encoded as a motoric representation, such that it enables reproduction, imitation, or generative production. The LDR model predicts that vocal information can be directly encoded as a motoric representation, without mediation through a symbolic representation. Just as a point in space can be represented with Cartesian or polar coordinates, each of which is better suited to particular calculations, these symbolic and motor representations support different kinds of behaviors. However, unlike other dual-route models, the LDR model also predicts that the vocal-motor representation can be mediated by the symbolic representation (see Figure 2). Whereas most dualroute model fail to predict the broad correlation seen between vocal perception and production abilities (e.g., Goodale and Milner, 1992;Griffiths, 2008;, this aspect of the model is designed to incorporate this effect. The LDR model predicts that a vocal-motor representation is influenced directly by the low-level perceptual information, but also indirectly by our conscious perception, identification, and category judgments of the information. This is a unidirectional link between the symbolic and vocal-motor representations; the latter cannot directly affect the former. Finally, there is a process of feedback from production back to low-level perception; this process is taken to reflect both auditory feedback from actual productions as well as efferent feedback from actualized motor plans. All of these processes are variable in strength and are influenced by top-down mechanisms, similar to the way in which executive function can moderate transfer effects between speech and music (Moreno and Bidelman, 2013). The relative influence of the symbolic and direct motoric encoding of a tone on its production can be mediated by the task requirements and context. Even the degree to which a tone is initially encoded symbolically or motorically is influenced by the intention of the listener. A listener who is tasked with comparing a note to a template or identifying an interval will preferentially encode it symbolically, whereas the same input would lead to a stronger vocal-motor encoding in the context of an imitation task. These effects can be visualized as a change in the relative sizes of the arrows.
This model, although motivated by pitch, is intended to apply to other aspects of vocal processing, including timbre, loudness, and phonemic processing. There is nothing about symbolic representation or motoric encoding which does not apply equally to other aspects of vocal tones. This generalization is motivated by several factors, including amusics' impairment in speech perception but not production (Hutchins and Peretz, 2012b), and variability in speech perception and production abilities among normal participants in contexts such as second language learning. However, the applicability of this model to speech warrants further study. The model assumes that initial perception of these attributes can vary across individuals; this variance is passed along to subsequent steps and can influence the accuracy of both types of encoding. It also assumes that individuals can vary in skill in transforming between these different representations accurately, independently of their initial perceptual abilities. Together, these variances in different abilities can explain the patterns of individual difference in perception, discrimination, and imitation abilities.
Taken together, this model provides a more complete explanation of the data than previously proposed models by combining some of the features of previous models. For example, similar to other dual-route models that have been proposed, the LDR model is able to predict dissociations between perception and production among congenital amusics. This model posits that congenital amusics are impaired at encoding pitch symbolically and are thus poor at tasks such as categorization or identification of pitch. Because symbolic representations are responsible for our awareness of pitch, congenital amusics also have diminished awareness of pitch, leading to their lower enjoyment of music. However, they retain their ability to encode pitch as a vocal-motor code. Thus, in some cases, they retain their ability to imitate pitches and respond to pitch changes, often just as well as normal participants. However, they are still, on average, below the abilities of normal participants, which is due to the lack of contribution from a symbolic representation of pitch. A similar argument using naturally occurring variances in abilities can also explain why normal individuals will occasionally show a similar dissociation between conscious perception and production abilities.
However, straightforward dual-route models are unable to explain cases where there seems to be a relationship between perception and production. In contrast, the influence of the symbolic representation on the vocal-motor encoding in the LDR model allows it to explain the moderate correlation between pitch perception ability and imitation ability. Furthermore, this route of influence also allows us to explain the broad correspondence between what we produce and what we hear-most people's imitative responses broadly line up with their perceptual judgments (although not a one to one correspondence). This processing flow, and the independent variance in these abilities, can explain why individual differences in perception and production abilities co-vary but are not perfectly predictive.

FUTURE DIRECTIONS
The LDR model makes several predictions, which would be profitable to explore in future research. First, because this model is assumed to apply to all vocal abilities, rather than specifically to the domain of music or speech, this model predicts that vocal perception and production abilities should be domain-independent. We would expect to find that, in general, people who are better at singing should be better at using their voice for speaking and vice-versa. It has already been shown that congenital amusics are unimpaired at speech imitation (Hutchins and Peretz, 2012b), and they typically report no general speech production problems. The LDR model predicts that this general phenomenon should carry over to an unimpaired population as well. For example, trained singers should be better at speech imitation, and people skilled at manipulating their voices (such as voice actors) should be better than average at singing. This leads to the interesting prediction that training in singing should also help public speaking ability (above and beyond the benefit of simply becoming more comfortable performing in front of others). Similar relationships should also be found between experts in speech and music perception (such as speech therapists or piano tuners). However, the model also predicts that these abilities are taskdependent-better singers are not necessarily better at perceiving speech sounds. Showing such a pattern would help confirm the domain-generality of this model. A particularly interesting aspect of this prediction arises when considering the case of dyslexia, which is fundamentally an impairment in reading and writing skills. Many instances of dyslexia are assumed to arise from an impairment of phonetic abilities Bryant, 1978, 1983;Bruck, 1992), which can be considered to be difficulty forming an adequate motor representation of speech sounds (Heilman et al., 1996;Hickok and Poeppel, 2004;D'Ausilio et al., 2009). The LDR model bears a few similarities to dual-route models of sentence reading, which assume that phonological and whole-word routes are mediated by separate neural pathways (e.g., Coltheart et al.'s Dual Route Cascade model, 1993). Both models explain dyslexics' particular difficulties with reading non-words. However, the LDR model puts the phonological difficulties of dyslexics in the context of a general impairment of vocal-motor encoding. Because of this, we would predict that dyslexics should be worse than non-dyslexics at tasks requiring speech imitation and that they would be particularly influenced by the mediating influence of the symbolic representation of phonemic sounds. Thus, dyslexics should be particularly sensitive to the categorical representations of sounds and less able than non-dyslexics at imitating within-category variations in speech sounds.
Another unique prediction of the LDR model comes from taking the dynamics of the system into account. Although a production response can be constructed directly from the input or mediated by the symbolic encoding of the input, the latter route to motor responses involves more steps and would thus take more time to perform. This explains several interesting facts about the timing of vocal responses. In the pitch shift task, for example, responses occur very rapidly and automatically, typically around 100-200 ms after the pitch shift. However, when asked to consciously control the pitch shift response (by inhibiting it, for example), participants are unable to do so as quickly and take another 200-300 ms to make a conscious adjustment to their automatic shift response (Burnett et al., 1998). Our model posits that the controlled response must come through conscious awareness via a symbolic representation of vocal pitch, whereas the automatic response comes directly from a motor-representation of the feedback, creating the different time courses of the two responses.
A similar effect can be found in speech shadowing. Listeners have the ability to shadow a stream of speech (e.g., Chistovich, 1960;Chistovich et al., 1960;Marslen-Wilson, 1973) with a delay as short as 150 ms. While both close and distant shadowing can be quite accurate, and are subject to the same global effects of context (Marslen-Wilson, 1973, those who shadow speech quickly typically report that they were repeating the material "before they understood [it]" , see also Marslen-Wilson, 1985, whereas the distant shadowers reported knowing what the words were before repeating them. Marslen-Wilson (1985) described evidence that, in certain cases, distant shadowers were more affected by the meaning of words than close shadowers, a fact that makes sense if close shadowers were using a direct encoding from vocal input to vocal motor code and distant shadowers made use of the slower route through symbolic representation of words in their shadowing. Interestingly, when close shadowers were forced to consider the meaning of the words they were shadowing, their performance became slower, more like that of close shadowers (Marslen-Wilson, 1985), a process which can also be explained by the latency of the two analysis paths. Our model would also make the counterintuitive prediction that variation in the speech sounds, such as in different regional accents, would be more likely to be preserved in close shadowers than distant shadowers, due to the normalization process inherent in creating symbolic representations of the stream of speech.
These dynamical properties of the model could be tested directly using absolute pitch possessors. We would predict that in a vocal matching task, requiring a speeded response would make more use of the direct route to a vocal-motor encoding, bypassing the symbolic representation of pitch. However, forcing a delayed response (past the length of the sensory buffer) would lead to greater mediation of the symbolic representation. Because absolute pitch listeners are able to categorize pitches into distinct pitch classes (Takeuchi and Hulse, 1993;Levitin and Rogers, 2005), we would expect that these listeners would be more influenced by their categorizations when making delayed responses, whereas non-absolute pitch listeners should merely show a general decrease in accuracy over longer timescales (as in Estis et al., 2009).
One final avenue worth considering is the connection between the LDR model and the mirror neuron system. This system, which is hypothesized to underlie our abilities to recognize the connections between our actions and those of others (Rizzolatti et al., 2001;Kohler et al., 2002;Rizzolatti and Craighero, 2004), may be of great importance in the ability to imitate others' actions (Brass and Heyes, 2005;Heyes, 2011) and may play a role in speech processing as well (Rizzolatti and Arbib, 1998; although the importance of mirror neurons is not universally agreed upon, see Hickok, 2009, for example). The LDR model's ability to represent an input as a motor code and a symbolic code may be related to the mirror neuron system's purported ability to mediate between these two codes, and it may well be that dissociations between perceptual and production abilities are more likely to be found in people with poorer mirror neuron systems. As both of these models intend to describe the relationship between perception and imitation tasks, further research into their connection (or lack thereof) could be very revealing.

CONCLUSION
There is a great deal of variability in vocal perception and performance abilities and only a modest correlation between the two. Vocal perception and production are highly related to speech and musical processing, and we see evidence of a relationship in abilities between the two domains. However, despite the link between vocal perception and production abilities, there is growing evidence supporting a dissociation between them, both in impaired and unimpaired individuals. The LDR model can explain both these broad trends in the data and makes several new predictions about speech imitation, singing, and response timing. We believe this model will help to interpret a wide variety of experiments and can create a common framework for understanding vocal perception and production.