Edited by: Noel Nguyen, Université d'Aix-Marseille, France
Reviewed by: Laura S. Casasanto, Stony Brook University, USA; Alessandro D'Ausilio, Istituto Italiano di Tecnologia, Italy; Wolfram Ziegler, City Hospital Munich, Germany
*Correspondence: Maëva Garnier, GIPSA-LAB, Speech and Cognition Department, UMR CNRS 5216, Grenoble Université, 11 rue des Mathématiques, BP 46, 38402 Saint Martin d'Hères Cedex, France e-mail:
This article was submitted to Cognitive Science, a section of the journal Frontiers in Psychology.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Speakers unconsciously tend to mimic their interlocutor's speech during communicative interaction. This study aims at examining the neural correlates of phonetic convergence and deliberate imitation, in order to explore whether imitation of phonetic features, deliberate, or unconscious, might reflect a sensory-motor recalibration process. Sixteen participants listened to vowels with pitch varying around the average pitch of their own voice, and then produced the identified vowels, while their speech was recorded and their brain activity was imaged using fMRI. Three degrees and types of imitation were compared (unconscious, deliberate, and inhibited) using a go-nogo paradigm, which enabled the comparison of brain activations during the whole imitation process, its active perception step, and its production. Speakers followed the pitch of voices they were exposed to, even unconsciously, without being instructed to do so. After being informed about this phenomenon, 14 participants were able to inhibit it, at least partially. The results of whole brain and ROI analyses support the fact that both deliberate and unconscious imitations are based on similar neural mechanisms and networks, involving regions of the dorsal stream, during both perception and production steps of the imitation process. While no significant difference in brain activation was found between unconscious and deliberate imitations, the degree of imitation, however, appears to be determined by processes occurring during the perception step. Four regions of the dorsal stream: bilateral auditory cortex, bilateral supramarginal gyrus (SMG), and left Wernicke's area, indeed showed an activity that correlated significantly with the degree of imitation during the perception step.
When they interact, speakers tend to imitate their interlocutor's posture (Shockley et al.,
Most of the literature considers this convergence phenomenon as primarily driven by social or communicative motivations. Convergence behaviors may aim at placing the interaction on a “common ground” of sounds and gestures, which is hypothesized to improve communication at the social level and/or at the intelligibility level.
Several theories predict that speakers converge more toward people they like, and from whom they want to be liked in return (Byrne,
Phonetic convergence is also believed to improve communication at the intelligibility level. Producing speech sounds and lexical forms that are more similar to the own repertoire of the interlocutor may facilitate phonetic decoding and lexical access. However, no study has shown such intelligibility benefits yet [although, on the other side, previous studies showed it is easier to understand an accent after imitating it (Adank et al.,
Several additional observations lead us to partly reconsider the idea that phonetic convergence may be primarily driven by social and communicative motivations. First, phonetic convergence was also observed in non-interactive tasks of speech production (Goldinger,
Rizzolatti and colleagues (e.g., Iacoboni et al.,
Regarding speech, a number of models also support the idea of a direct matching between perception and motor systems (for reviews, see Galantucci et al.,
To sum up, all these observations and models support the idea that humans have shared representations of the motor commands of an action and of its sensory consequences. This functional coupling between perception and action systems, through these shared representations, argues for perception not only consisting in information decoding but also contributing to the automatic and involuntary “update” or “recalibration” of these shared sensory-motor representations.
This brings us to reconsider the mechanisms underlying the phenomenon of phonetic convergence and to explore the hypothesis that automatic and involuntary imitation of phonetic features might reflect a sensory-motor learning, taking place as soon as speech is perceived. In favor of this hypothesis is the fact that speakers modify their way of speaking not only during the interaction with their interlocutor, but also after the interaction. This “after-effect” concerns not only speech production, but also speech perception: vowel categorization was found to be modified after repeated exposure to someone else's speech (Sato et al.,
The present study aimed at determining the neural substrates of phonetic convergence and more particularly at: (1) understanding whether phonetic convergence and deliberate imitation of speech are underpinned by the same neurocognitive mechanisms, (2) examining to what extent sensory-motor brain areas are involved during deliberate and unconscious imitations of speech, and (3) better understanding the degree of control and consciousness that one can have on imitation and its inhibition.
On the basis of previous studies, showing the involvement of the dorsal stream in voluntary imitation of speech (Damasio and Damasio,
To explore these hypotheses, we simultaneously recorded speech signals and neural responses of 16 participants, in three tasks of speech imitation with varying degrees of will and consciousness: voluntary imitation, phonetic convergence, intended inhibition of phonetic convergence. In these tasks, we focused on one phonetic feature particularly sensitive to that phenomenon:
Sixteen right-handed and healthy participants (11 males and 4 females of 27 ± 5 years old), French native speakers, volunteered to participate in the experiment. None of them had any speaking or hearing disorders. None of them had previously received explicit information about phonetic convergence phenomena. The study received the ethic approval from the Centre Hospitalier Universitaire de Grenoble, from the Comité de Protection des Personnes pour la Recherche Biomédicale de Grenoble and from the Agence Française de Sécurité Sanitaire des Produits.
The experiment consisted of three tasks of interest and two reference tasks.
T1. Reference task: passive auditory perception of vowels. T2. Vowel production task. The vowels to be produced were played to the participant through headphones. Participants were expected to partly and unconsciously imitate these stimuli (convergence effect). T3. Vowel production reference task. The vowels to be produced were displayed on a screen viewed by the participant. Participants were expected to produce vowels according to their own speech representations. T4. Vowel imitation task. Like in T2, vowels were played to the participant through headphones. Participants were asked to produce these vowels and to « imitate the voice heard ». T5. Vowel production and convergence inhibition task. Participants were briefly informed about the existence of convergence phenomena. Like in T2 and T3, vowels were played to the participant through headphones. They were asked to produce these vowels as close as they could from their habitual production, trying not to follow the stimuli.
Participants were simply informed that the experiment would consist in the production and perception of vowels. The two first tasks were presented as such to the participants, in order for them not to suspect the audio stimuli to influence their own production. The voluntary imitation and inhibition tasks were thus left for the end of the experiment. These five tasks were followed by a brain anatomical scan. The whole procedure was completed in one and an half hour.
The audio stimuli used in the conditions T1, T2, T4, and T5 consisted of 27 different vowels, specifically selected for each participant. First, a vowel database with modified pitches was created from 3 French vowels ([e], [oe], [o]) produced by a reference male speaker and a reference female speaker. Pitches were artificially shifted by steps of 5 Hz from 80 to 180 Hz for the male vowels, and from 150 to 350 Hz for the female vowels. This pitch manipulation was performed using the PSOLA module integrated in Praat, which enables to modify pitch without affecting formants or speech rate. Before the experiment, each participant was also recorded while producing a series of vowels, in order to determine his/her habitual pitch (see Table
Gender | M | F | F | M | M | M | M | F | M | F | M | M | M | M | F | M |
Age | 22 | 30 | 27 | 23 | 25 | 39 | 24 | 30 | 24 | 25 | 38 | 28 | 25 | 27 | 27 | 23 |
Average habitual |
129 ± 3 | 219 ± 6 | 198± 5 | 127 ± 3 | 111 ± 1 | 130 ± 5 | 131 ± 3 | 298 ± 6 | 131 ± 5 | 239 ± 12 | 125 ± 6 | 135 ± 9 | 126 ± 4 | 146 ± 4 | 237 ± 18 | 121 ± 8 |
The two reference tasks (T1 and T3) consisted in 54 trials. The 27 audio or visual stimuli described above were presented in a pseudo-random order and in alternation with 27 « void » stimuli (i.e., no sound in T1 or no displayed vowel in T3). These « void » stimuli were used as a baseline for the comparison of neural activations. Each trial lasted 10 s. Stimuli were played (T1) or displayed (T3) during the first 500 ms. One second later, a fixation cross was displayed during 500 ms which indicated when the participant had to produce the vowel for the T3 condition.
The three tasks of interest (T2, T4, and T5) consisted of 81 trials. In these tasks, the 27 audio stimuli were presented twice, in a pseudo-random order and in alternation with the 27 « void » stimuli (i.e., no sound). Concretely, one third of the time, audio stimuli were followed by a green cross, indicating to the participant that he/she should produce the vowel (« Go »). One other third of the time, audio stimuli were followed by a red cross, meaning that the participant should remain quiet (« No Go »). The last third of the time, no stimulus was played and a red cross was displayed (Baseline). This go/no-go paradigm enables to compare the neural activations in a double task of speech production and perception, with those in a task of « active » perception, i.e., when participants perceive vowels with the goal of producing them afterwards, but finally without carrying out any motor action.
Visual instructions were displayed on a screen located behind the participant, using a video projector and the Presentation software (Neurobehavioral Systems, Albany, EU). Participants could read them by reflection, thanks to a mirror placed above their eyes. Audio stimuli were played though MRI-compatible headphones. The audio level was set to a sufficient intensity so that participants could hear the stimuli correctly, despite the earplugs they wore to protect them from the scanner noise. The production of vowels was recorded thanks to a microphone placed 1 m away from their mouth.
Anatomic and functional images were acquired with a whole body 3T scanner (Bruker MedSpec S300) equipped with a transmit/receive quadrature volume head coil. The fMRI experiment consisted of five functional runs and one anatomical run. Functional images were obtained using a T2*-weighted, echoplanar imaging (EPI) sequence with whole-brain coverage (
The acoustic analyses were performed using Praat software. A semi-automatic procedure was used to segment vowels on the basis of intensity and duration criteria. Hesitations and mispronunciations were removed from the analyses.
The stimuli were specific to each participant, with
Data were analyzed with the software SPM5 (Statistical Parametric Mapping, Wellcome Trust Centre for Neuroimaging, London, UK). The fMRI data of one participant (S3) were artifacted by a metalic pin and could therefore not be included in the analysis. The results reported in the fMRI data section of this article thus concern the 15 remaining participants.
For each participant, functional images were realigned, normalized in the reference space of the Montreal Neurological Institute (MNI) and smoothed with a 6 mm width Gaussian low-pass filter.
The hemodynamic responses corresponding to the experimental conditions were then estimated with a general linear model, including the characterization of a unique impulse response for each functional scan and taking body movements into account through regressors of non-interest.
Eight T-contrasts were tested (see Table
Using SPM, a flexible factorial group analysis was conducted from these individual contrasts, corresponding to a One-Way repeated measures ANOVA (one factor TASK with 8 levels).
Eight T-contrasts were tested in order to identify brain regions specifically involved in each task of vowel perception and/or production, compared to a resting condition. Two conjunctions were calculated from the first four contrasts examining neural correlates of vowel perception, as well as from the four following contrasts examining neural correlates of vowel production. Two F-contrasts tested the main effect between the vowel perception conditions (1,2,3,4) and between the vowel production conditions (5,6,7,8).
For these contrasts, statistical significance was considered for
The 3D coordinates of the center of gravity of the activated clusters, normalized in the MNI reference space were assigned to functional areas of the brain thanks to the SPM Anatomy toolbox and on the basis of cytoarchitechtonic probabilities. When not assigned in the SPM Anatomy toolbox, brain regions were labeled using Talairach Daemon (Lancaster et al.,
This study hypothesizes that the dorsal stream would be involved in speech imitation and phonetic convergence. Particular attention was therefore paid to neural activations in regions of the dorsal stream. With the SPM Anatomy toolbox, 7 ROIs were defined in both hemispheres, from the cytoarchitechtonic probability of
– Region TE (including TE1.0, TE1.1, and TE1.2) – Region TE3 (Wernicke's area, including the Spt area), – Supramarginal Gyrus (IPC PF, PFm, PFcm) – Region BA6 (premotor cortex and supplementary motor area) – Regions BA44 and BA45 (Broca's area) – The Insula
Using Marsbar, eight T-contrasts (similar to Table
Using SPSS software, a One-Way repeated measures ANOVA was then conducted on these individual differences of neural activation observed in each ROI. Statistical significance was considerered for
Finally, we performed a Pearson correlation analysis to determine the correlation between the average activation of each ROI, for each participant, in the deliberate and unconscious imitation tasks, and their demonstrated degree of imitation (defined from the behavioral data, as the slope coefficient between their produced
Figure – participants were able to imitate almost perfectly the pitch of the audio stimuli (T3; slope coefficient of 0.87, – participants unconsciously followed the pitch of the audio stimuli in the production task when vowels were presented auditorily (T2; – participants were able to inhibit almost completely this convergence effect when informed about its existence (T5; Slope coefficient of 0.08,
At the individual level, however, varying behaviors were observed. Figure
S1 | 0.90 | 0.96 | <0.001 | 0.88 | 0.98 | <0.001 | 0.15 | 0.52 | =0.005 |
S2 | 0.98 | 0.98 | <0.001 | 0.29 | 0.59 | =0.001 | 0.16 | 0.48 | =0.012 |
S3 | 1.00 | 0.98 | <0.001 | 0.83 | 0.77 | <0.001 | 0.10 | 0.22 | =0.26 |
S4 | 0.87 | 0.96 | <0.001 | 0.78 | 0.97 | <0.001 | −0.09 | −0.21 | =0.285 |
S5 | 0.87 | 0.95 | <0.001 | 0.19 | 0.36 | =0.005 | −0.11 | −036 | =0.069 |
S6 | 0.62 | 0.93 | <0.001 | 0.11 | 0.38 | =0.005 | 0.03 | 0.19 | =0.336 |
S7 | 0.65 | 0.89 | <0.001 | 0.27 | 0.75 | <0.001 | 0.06 | 0.32 | =0.099 |
S8 | 0.93 | 0.96 | <0.001 | 0.69 | 0.86 | <0.001 | 0.51 | 0.70 | <0.001 |
S9 | 0.89 | 0.99 | <0.001 | 0.57 | 0.88 | <0.001 | 0.22 | 0.31 | =0.112 |
S10 | 0.98 | 0.99 | <0.001 | 0.98 | 0.99 | <0.001 | 0.65 | 0.85 | <0.001 |
S11 | 0.78 | 0.93 | <0.001 | 0.45 | 0.74 | <0.001 | 0.01 | 0.02 | =0.952 |
S12 | 0.94 | 0.96 | <0.001 | 0.89 | 0.94 | <0.001 | −0.86 | −0.40 | =0.039 |
S13 | 0.87 | 0.99 | <0.001 | 0.88 | 0.99 | <0.001 | 0.29 | 0.61 | <0.001 |
S14 | 0.78 | 0.90 | <0.001 | 0.39 | 0.80 | <0.001 | 0.10 | 0.35 | =0.078 |
S15 | 1.00 | 0.96 | <0.001 | 0.44 | 0.77 | <0.001 | −0.20 | −0.50 | =0.009 |
S16 | 0.75 | 0.93 | <0.001 | 0.46 | 0.66 | <0.001 | 0.18 | 0.35 | =0.076 |
Some participants demonstrated better imitation abilities than others but all of them were able to follow variations of pitch (slope coefficients from 0.62 to 1.00).
Five speakers (see bottom panel of Figure
Eight speakers (see top panel of Figure
Great inter-speaker variability was observed in the inhibition task too. Ten out of 16 speakers (S3, S4, S5, S6, S7, S9, S11, S12, S14, S16) did not show a significant correlation between their produced
Three speakers (S1, S2, S13) showed a significant and positive correlation between their produced
Two speakers (S8 and S10) also showed a significant correlation between their produced
Finally, one of the speakers (S15) even showed a significant but negative correlation between her produced
The classical neural networks for speech perception and production were observed in the reference tasks of passive vowel perception and vowel production from visual instructions. Surface rendering of brain activity observed in these reference tasks is displayed in the top left panels of (Figures
Vowel production from visual instructions induced bilateral activations of the premotor, primary motor, and sensorimotor cortices, and of the SMA. Bilateral activations were also observed in the IFG (pars opercularis and triangularis) and in the STG, extending to the rolandic operculum and the SMG. Additional activations were displayed bilaterally in superior and posterior parts of the parietal cortex, including the precuneus, the associative cortex, and the angular gyrus. Further activity was found in the left inferior temporal gyrus, and bilaterally in the cerebellum, the cingulate cortex (anterior and middle part in the left hemisphere, middle part only in the right hemisphere), and the visual cortex.
This shared perception network involves bilateral activation of the STG, extending to the rolandic operculum and to the left Insula. Frontal regions participate in this network in the left hemisphere only, in particular Broca's area (pars opercularis and triangularis of the IFG), and the frontal region BA8. It also involves inferior parietal regions in both hemispheres: the SMG, extending to the rolandic operculum on the left side, and the angular gyrus on the right side. Further shared activations were found in the limbic system (right thalamus and left posterior cingulate cortex). A significant activation of the dorsolateral prefrontal region BA46 was also observed during the perception step of deliberate and inhibited imitations (NoGo trials, see Figure
Similarly, the typical network for speech production was also observed in the three other speech production tasks with deliberate, unconscious, or inhibited imitations (t-contrast between the Go and the NoGo trials). Surface rendering of the conjunction of the brain activity observed in all the perception tasks is displayed in the right panel of Figure
This shared production network involves bilateral activations in the premotor, primary motor and sensorimotor cortices, extending to the IFG (pars triangularis) and to the SMA. It also involves the primary auditory cortex in the STG, extending to the rolandic operculum, and to the Insula. Further shared activations were found in posterior parietal regions, including the precuneus and the associative cortex, as well as in the limbic system (anterior cingulate gyrus, thalamus), the cerebellum, the putamen, the red nucleus, and the right basal ganglia (substantia nigra).
No brain region was found to be significantly modulated in activation between the four speech production tasks either.
The ROI analysis showed a significant modulation of brain activity in the auditory cortex and Wernicke's area, bilaterally, between the four vowel perception tasks with varying degrees and types of imitation. No tendency was observed toward increasing or decreasing activation with the degree of imitation.
For the production tasks, the ROI analysis again highlighted the right auditory cortex as a brain region of the dorsal stream whose activity is significantly modulated between the four vowel production tasks with varying degrees and types of imitation. The left Insula and the right SMG were two additional regions of the dorsal stream that demonstrated a significant modulation of their neural activation. No tendency was found toward increasing or decreasing activation of these regions with the degree of imitation.
Again, in the two active perception tasks, preparing a deliberate or unconscious imitation, brain activity in both left and right auditory cortices was found to correlate significantly with the degree of following imitation. So did left Wernicke's area and the SMG, bilaterally.
On the other hand, no brain region was found to vary in activation with a significant correlation with the degree of imitation, for the production step of the deliberate and unconscious imitation only tasks (Go-NoGo) or for the entire process of these tasks (Go).
In line with previous studies on phonetic convergence, our results show that speakers follow and unconsciously imitate the phonetic features (here
As in previous studies, we also observed a great inter-individual variability in the degree of deliberate imitation, with slope coefficients observed ranging from 0.62 to 1.00. Such a result is consistent with Pfordresher and Brown (
At the neural level, the typical networks for speech production and perception were observed, in agreement with previous studies on vowel production and perception (Özdemir et al.,
The first questions addressed in this study were to determine whether phonetic convergence relied on the same mechanism and neural network as deliberate imitation, and to what extent brain regions related to sensori-motor integration were involved in that potentially shared network. Neither the whole brain analysis nor the ROI analysis showed any significant modulation in brain activation between the two tasks of deliberate and unconscious imitations (see Tables
Another question concerned the steps of the perception-production process at which the imitative process occurs: is imitation included in the perception process, in the production one, or in both? The ROI analysis revealed that the activation in several regions of the dorsal stream was significantly modulated between vowel production and perception reference tasks, and both the perception and production steps of unconscious imitation—in the auditory cortex and Wernicke's area, bilaterally, for the perception step; in the right auditory cortex, the supramaginal gyrus, and the left insula for the production step. In the case of deliberate imitation, however, significant changes in activation were also found in these ROIs, but only for the perception step, as compared to the vowel perception reference task. Finally, ROIs in the dorsal stream whose activation correlated with the degree of imitation were found for the perception step of imitative processes only. No such region was found for the production step, or for the whole imitative processes. These observations support the idea that (1) the imitation process requires both perception and production steps of the sensori-motor loop, and that (2) the degree of imitation is determined by processes occurring during the perception step. The fact that the degree of imitation is determined by processes occurring during the perception step supports the hypothesis that perception intrinsically includes an automatic update of sensori-motor representations from the speech inputs.
A last question dealt with the degree of control and consciousness that one can have on phonetic convergence and its inhibition. The behavioral results of this study showed that phonetic convergence can be inhibited to some extent. A great inter-speaker variability was observed: some speakers were able to inhibit this unconscious imitation completely (or even with an overcompensation), others only partially, while some speakers could not inhibit it at all. At the neural level, no additional region or network, out of the typical networks of speech production and perception, appeared to be specifically involved in imitation inhibition. It is worthwhile noting that a significant activation was observed during the perception step of deliberate and inhibited imitation in the dorsolateral prefrontal region BA46, an area commonly associated with attention, resource allocation, and verbal self-monitoring (Indefrey and Levelt,
The different behavioral and neural observations of this study support the hypothesis that phonetic convergence may not only be driven by social or communicative motivations, but that it may primarily be the consequence of an automatic process of sensorimotor recalibration. This has some important implications on speech production and perception, for the comprehension of how internal models and phonetic representations are learnt and updated. Indeed, many previous studies had shown how speakers modify their speech production to compensate for perturbations of their auditory or proprioceptive feedback (Abbs and Gracco,
This possible involvement of sensorimotor recalibration processes also has implications on the communicative and social aspects of phonetic convergence. Imitation may facilitate communication not only by improving our likeability or our intelligibility for the interlocutor, but also by helping
From these findings, the involvement of sensori-motor recalibration processes in phonetic convergence, and its potential explanation of higher-level communicative and social effects (inter-individual differences and phonetic talent, i.e., the ability to learn a second language, empathy and likability, intelligibility enhancement, …) remain to be investigated in future studies.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This study was supported by research grants from the Centre National de la Recherche Scientifique (CNRS) and the Agence Nationale de la Recherche (ANR SPIM—Imitation in speech: from sensori-motor integration to the dynamics of conversational interaction).