Gesture Facilitates the Syntactic Analysis of Speech

Holle, Henning; Obermeier, Christian; Schmidt-Kassow, Maren; Friederici, Angela  Dorkas; Ward, Jamie; Gunter, Thomas  C

doi:10.3389/fpsyg.2012.00074

ORIGINAL RESEARCH article

Front. Psychol., 19 March 2012

Sec. Psychology of Language

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00074

Gesture facilitates the syntactic analysis of speech

Henning Holle¹*

Christian Obermeier²

Maren Schmidt-Kassow³

Angela D. Friederici⁴

Jamie Ward⁵

Thomas C. Gunter⁴*

¹ Department of Psychology, University of Hull, Hull, UK
² Minerva Research Group “Neurocognition of Rhythm in Communication”, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
³ Institute of Medical Psychology, Johann Wolfgang Goethe University, Frankfurt am Main, Germany
⁴ Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
⁵ School of Psychology and Sackler Centre for Consciousness Science, University of Sussex, Brighton, UK

Recent research suggests that the brain routinely binds together information from gesture and speech. However, most of this research focused on the integration of representational gestures with the semantic content of speech. Much less is known about how other aspects of gesture, such as emphasis, influence the interpretation of the syntactic relations in a spoken message. Here, we investigated whether beat gestures alter which syntactic structure is assigned to ambiguous spoken German sentences. The P600 component of the Event Related Brain Potential indicated that the more complex syntactic structure is easier to process when the speaker emphasizes the subject of a sentence with a beat. Thus, a simple flick of the hand can change our interpretation of who has been doing what to whom in a spoken sentence. We conclude that gestures and speech are integrated systems. Unlike previous studies, which have shown that the brain effortlessly integrates semantic information from gesture and speech, our study is the first to demonstrate that this integration also occurs for syntactic information. Moreover, the effect appears to be gesture-specific and was not found for other stimuli that draw attention to certain parts of speech, including prosodic emphasis, or a moving visual stimulus with the same trajectory as the gesture. This suggests that only visual emphasis produced with a communicative intention in mind (that is, beat gestures) influences language comprehension, but not a simple visual movement lacking such an intention.

Introduction

When we talk to one another, communication does not only take place in the auditory domain, but also simultaneously in the visual domain. Conversational gestures, which are movements of the hands that co-occur with speech, are one important example of such cross-modal communication. The fact that conversational gestures are reliably elicited during spontaneous speech (even when talking on the phone, Bavelas et al., 2008) suggests that gestures serve an important communicative function that cannot be completely achieved by speech alone. So why do we gesture when we speak? Research addressing this question has either been looking for functions that gesture may have in the speaker (where they may facilitate the act of speaking, Krauss, 1998) or in the listener (where gesture may convey additional information not found in speech). With respect to the latter, it has been found that – contrary to some initial negative findings (Krauss et al., 1991, 1995) – listeners are sensitive to the additional information provided by gesture. For instance, several groups have found that semantically incongruent gesture–speech pairings interfere with language comprehension, and we have reported evidence that gestures can disambiguate lexically ambiguous words (Holle and Gunter, 2007; Obermeier et al., 2011). Thus, the semantic information provided by gestures interacts with the semantic information of speech, and recent brain imaging studies have implicated that the left inferior frontal gyrus (Willems et al., 2007, 2009) and the left posterior temporal lobe (Holle et al., 2008, 2010) are crucially involved in this interaction.

However, these insights relate mainly to representational gestures (e.g., gesturing writing with a pencil while saying write). In comparison, the relationship between speech and non-representational gestures (of which beat gestures are an important sub-category) is much less well understood. Beats are short, rhythmic hand movements that match the cadence of speech (McNeill, 1992; Hubbard et al., 2009). These gestures accompany speech pervasively and appear not to be under intentional control (Alibali et al., 2001). With respect to their potential communicative function, it has been suggested that they accent or emphasize portions of their co-expressive speech (Efron, 1941/1972). But how does such gestural emphasis influence our interpretation of a speaker’s utterance? And if gestural emphasis does have an effect on language comprehension, would this effect then be specific to gesture, or would other forms of emphasis (e.g., pitch accents, visual movement) have the same impact? We hypothesized that beats might help a listener to figure out who is doing what to whom in sentences that are temporarily ambiguous with respect to their syntactic structure.

The Present Study

Our paradigm exploited the quite flexible word order of German, which allows expression of the same meaning (e.g., “The dog bites the mailman”) in either a subject-initial word order (“[Der Hund]_Subj beisst [den Postboten]_Obj”) or an object-initial word order (“[Den Postboten]_Obj beisst [der Hund]_Subj”). Based on this phenomenon of word order variation, we created sentences that were temporarily ambiguous with respect to their subject and object role. Consider the example sentence provided in Figure 1. Up to the sentence-final word, there are two possible interpretations (at least in German). First, it could be that the woman has greeted the men (assumed argument order: Subject–Object–Verb, SOV). Alternatively, it could also be that the men have greeted the woman (argument order: OSV). Only at the sentence-final word it becomes clear who has actually been doing what to whom. Note that SOV and OSV structures are not treated as equally probable in German¹. Instead, there is a strong preference to analyze an ambiguous initial noun phrase as the subject of the sentence (Haupt et al., 2008). Therefore, a disambiguation toward the OSV structure is somewhat unexpected and elicits additional processing costs which can be observed at the disambiguating word (a) on a behavioral level as increased reading times (Schriefers et al., 1995; Bader and Meng, 1999) and (b) on an electrophysiological level as an increased P600 (Osterhout and Holcomb, 1992; Friederici et al., 1993; Hagoort et al., 1993; Knoeferle et al., 2008). The P600 is a positive-going deflection of the event related brain potential (ERP) peaking around 600 ms after the onset of the critical word. It is reliably elicited whenever an ambiguous input is disambiguated toward the syntactically more complex alternative (for review, see Haupt et al., 2008). Although there are several suggestions about the functional significance of the P600, one scheme these proposals have in common is that of reanalysis, be it terms of a specific syntactic reanalysis (Friederici, 2002) or a more general reanalysis, including perceptual errors (van de Meerendonk et al., 2010).

FIGURE 1

Figure 1. Materials and Results. Original German sentences, as uttered in our video stimuli, as well as literal English translation in italics. Full German SOV sentence: Peter sagt, dass die Frau die Männer gegrüß t hat. English gloss: Peter says that the woman has greeted the men. Full German OSV sentence: Peter sagt, dass die Frau die Männer gegrüß t haben. English gloss: Peter says that the men have greeted the woman. ERPs were time-locked to the critical sentence-final words (underlined). ERPs for the preferred Subject–Object–Verb order (SOV) are shown in blue and the ERPs for the less preferred Object–Subject–Verb order (OSV) are shown in yellow. Grand-average ERPs (Experiment 1: n = 24; Experiment 2: n = 19; Experiment 3: n = 23) were averaged across four regions of interest: Anterior-Left (AL), Anterior-Right (AR), Posterior-Left (PL), and Posterior-Right (PR). Text highlighted by red bars indicates those portions of speech emphasized either by a visual beat gesture (Experiment 1), a beat-induced pitch accent (Experiment 2) or a moving point (Experiment 3). Bar graphs show the amplitude of the P600 effect (±SEM). ROIs in which the P600 effect is significant at p < 0.05 are marked by an asterisk (*).

The remainder of the manuscript describes three EEG experiments that investigate in how far different types of emphasis cues (Experiment 1: visual beats; Experiment 2; the auditory pitch accents normally associated with visual beats; Experiment 3: the visual movement associated with beats) can help to reduce or even abolish the P600 effect usually associated with a disambiguation toward the more complex OSV structure. Such a reduction would imply that emphasis cues may bias toward (or prevent a de-selection of) an alternative syntactic structure.

Experiment 1: Seeing Beats

Beat gestures are inherently multimodal. In addition to the obvious visual component, making a beat also affects the speech that it accompanies (increased pitch, increased duration, see Krahmer and Swerts, 2007). In Experiment 1, we were specifically interested in the impact of the visual component on syntactic disambiguation, while controlling for auditory differences. Therefore, we assembled spoken sentences taken from a non-beating speaker that were phonologically identical until the sentence-final word (see below) and combined them with a video of a speaker either producing no gesture, a beat emphasizing the first noun phrase (NP1), or a beat emphasizing the second noun phrase (NP2). In this way, we ensured that all observed ERP effects can only be due the presence of a visual beat, and not to their associated auditory pitch accents.

Methods

Participants

Thirty-five German-speaking students participated in Experiment 1 after giving written informed consent following the guidelines of the Ethics committee of the University of Leipzig. Two participants had to be excluded due to excessive EEG artifacts, and nine because they had an overall error rate on the behavioral task exceeding 40%. The remaining 24 participants (12 female, mean 25 years of age, range 21–29) were right-handed (mean laterality coefficient 95.7, Oldfield, 1971). All participants had normal or corrected-to-normal vision and none reported any known hearing deficit.

Stimuli

All sentences consisted of a matrix clause, followed by a complement clause in perfect tense. The noun in the matrix clause was always a proper name. All verbs in the complement clause were transitive verbs requiring a direct accusative object. Of the two noun phrases in the complement clause, one NP was always a feminine singular, whereas the other NP was always plural (masculine, feminine, or neuter), making the two NPs case-ambiguous (either nominative or accusative case). This case ambiguity made all sentences temporarily ambiguous with respect to their syntactic structure. Sentences were only disambiguated at the sentence-final auxiliary verb, either toward a preferred SOV or a non-preferred OSV structure (for a stimulus example, see Figure 1).

The experimental sentences were created out of a set of 240 noun phrase–noun phrase–verb combinations (NP–NP–V). All verb participles had the form of ge + verb stem, which is a very common form of verb participle generation in German. No participle was repeated within the set of NP–NP–V combinations.

Recording and splicing. First, a recording list was assembled, containing only the subject-initial versions of the 240 NP–NP–V sets in two different variations: (1) A subject-initial structure, where the singular NP was followed by the plural NP (e.g., Peter knows, that the woman the men greeted has) and (2) a reversed subject-initial structure, where the plural NP was followed by the singular NP (e.g., Peter knows, that the men the woman greeted have). This was possible because the NPs of all sentences were carefully selected to contain as little semantic bias as possible, which allowed us to create semantically plausible reversed versions of each sentence.

Next, a professional speaker produced all stimuli from the recording list with a natural prosody without producing any beat gestures. We instructed the speaker to produce the stimuli with a broad focus, i.e., she was told to not highlight a particular word in the sentence. Each stimulus was produced at least twice. The acoustic signal of each sentence was visually inspected and acoustically tested to determine the time of six events: (1) Onset of the determiner of NP1 (2) primary stress of NP1 (3) onset of the determiner of NP2 (4) primary stress of NP2 (5) onset of the participle and (6) onset of the sentence-final auxiliary.

Using this timing information, the audio recordings described in the previous paragraph were then recombined in a cross-splicing procedure detailed in Figure 2. In short, we extracted segments of the recorded sentences on the basis of the available timing information, and recombined these segments. The outcome of the splicing procedure were four different structures of each experimental item: (1) an SOV structure, where the first NP was singular and the second NP plural (SOV: NP1_sg–NP2_pl), (2) an SOV structure, where the order of the NPs was reversed (SOV: NP1_pl–NP2_sg), (3) an OSV structure, in which the first NP was singular and the second NP plural (OSV: NP1_sg–NP2_pl), and (4) an OSV structure with a reversed order of the two NPs (OSV: NP1_pl–NP2_sg, see Figure 2). Finally, all cross-spliced sentences were normalized to the same average sound pressure level using PRAAT (Boersma and Weenink, 2010). These stimuli formed the auditory component of our audiovisual stimuli for Experiment 1.

FIGURE 2

Figure 2. Outline of the cross-splicing procedure for Experiment 1. Literal translation of the sentence on the left (portions used for cross-splicing are highlighted in blue): Peter says, that the woman the men greeted has. Literal translation of sentence on the right (highlighted in orange): Peter says, that the men the woman greeted have.

Gesture-recording and audiovisual synchronization. We re-invited the actress to record the beats gestures used for Experiment 1. She was placed in a comfortable chair with armrests and uttered a subset of the experimental sentences, either with no beat gesture, a left hand beat gesture accentuating the first NP, or a left hand beat gesture accentuating the second NP of the sentence. We also recorded versions where the right hand accentuated either the first or the second NP. Our gestures always started and ended in the central gesture space. They consisted of a rapid lowering of the forearm, a quick wrist movement at the apex, and a return to the resting to the resting position in the central space. Thus, although some degree of iconicity is present (according McNeill, 1992; e.g., more than two movement phases), they have many key characteristics of a beat gesture (beginning and end in central gesture space, wrist movement at the apex of the gesture). To minimize the influence of facial cues, the face of the actress was covered with a nylon stocking.

The video recordings were then combined with the previously created cross-spliced audio recordings. In the beat videos, the apex of the beat movement was always synchronized with the primary stress of the respective NP, resulting in a natural synchronization of gesture and speech (Levelt et al., 1985; McNeill, 1992). Each video contained 600 ms of silence before the onset of the sentence, where the speaker was also not moving. Example videos are provided as Supplementary Material.

As has been mentioned previously, the experimental stimuli were based on a set of 240 NP–NP–V combinations. Each item existed in 20 different versions, which differed with respect to their Structure (SOV or OSV), Number_Order (singular–plural or plural–singular), Emphasis (no beat, beat on NP1, beat on NP2), and Side (left hand beat, right hand beat). Thus, the total stimulus set for Experiment 1 consisted of 4800 video clips. These video clips were divided into 24 experimental lists, ensuring that each participant saw only one version of an item. Thus, all item-specific effects were counterbalanced across participants.

Procedure

The participants were seated in a dimly lit, sound-attenuated chamber facing a computer screen. The videos were centered on a black background and extended for 10° visual angle horizontally and 8° vertically. A trial started with a fixation cross on the screen, which was presented for 300 ms, followed by the video presentation. Two seconds after sentence offset, a question concerning the content of the preceding sentence was presented until the participant responded with a button press (“yes” vs. “no”) or until 3 s had elapsed. This question always had the form “Was NP1/NP2 verb-ed?, e.g., Was the woman greeted?”. Each response was immediately followed by a feedback stimulus (500 ms) that informed the subjects about the accuracy of the response (correct/incorrect). The next trial began 1700 ms later with the presentation of a fixation cross.

An experimental session (excluding time for electrode application) lasted approximately 60 min. Stimuli were presented in four blocks each consisting of 60 items. Key assignment for correct (left or right) was counterbalanced across participants. Each participant received one of the 24 experimental lists.

ERP recording

The EEG was recorded from 63 Ag/AgCl electrodes (Electro-Cap International). It was amplified using a PORTI-32/MREFA amplifier (DC to 135 Hz) and digitized online at 500 Hz. Electrode impedance was kept below 5 kΩ. Data were re-referenced offline to linked mastoids. Vertical and horizontal electrooculograms (EOG) were also measured.

Data analysis

Single-subject ERPs were calculated for each of the six experimental conditions. The epochs were time-locked to the onset of the disambiguating sentence-final auxiliary verb and lasted from 200 ms pre-stimulus onset to 1000 ms post-stimulus onset. A 200-ms pre-stimulus baseline was used. Four regions of interest (ROIs) were defined: anterior-left (AL): AF7, AF3, F7, F5, F3, FT7, FC5, FC3; anterior-right (AR): AF4, AF8, F4, F6, F8, FC4, FC6, FT8; posterior-left (PL): TP7, CP5, CP3, P7, P5, P3, PO7, PO3; posterior-right (PR): CP4, CP6, TP8, P4, P6, P8, PO4, PO8. An automatic artifact rejection using a 200-ms sliding window was performed on the EOG channels (±30 μV) and on the EEG channels (±40 μV) and was double-checked by visual inspection. Overall, approximately 20% of the trials did not enter statistical analysis due to artifacts or incorrect responses. Based on visual inspection of the data, a time window from 200 to 350 ms was used to analyze the early negativity, whereas a time window of 500–800 ms was used for the analysis of the P600 effects.

The statistical analysis of the ERP data was performed in two steps. The first analysis targeted the processing strategy in the absence of any visual cue. This analysis was implemented by means of a repeated-measures ANOVA using the within-subject factors Structure (SOV, OSV), Region (anterior, posterior), and Hemisphere (left, right). The second analysis explored how the brain response to the experimental sentences is modulated by a concurrent emphasis cue. This was done by means of a repeated-measures ANOVA with the factors Structure (SOV, OSV), Emphasis (NP1, NP2), Region (anterior, posterior), and Hemisphere (left, right).

Only effects that involve the critical factors of Structure or Emphasis are reported. Before entering statistical analysis, the data were filtered offline with a high-pass filter of 0.2 Hz. For presentation purposes only, an additional 10-Hz low pass filter was used.

Results

Behavioral data

Generally, participants were very accurate in responding to the target questions (87% correct). The first analysis that targeted the behavior in the absence of a gesture cue indicated that accuracy was lower for OSV (83.1%) than for SOV (92.3%) structures, as indicated by a significant main effect of Structure [F(1, 23) = 19.5, p < 0.0001]. The second analysis tested how the presence of a beat gesture modulated the behavioral performance. The corresponding ANOVA with the factors Structure (SOV, OSV), Beat (NP1, NP2) revealed only a main effect of Structure [F(1, 23) = 29.56, p < 0.0001], which was due to a lower accuracy for object-initial (83.8%) as compared to subject-initial structures (91.5%). The main effect of Beat as well the interaction of Structure and Beat were not significant (both Fs < 1).

ERP data

Our first analysis of the ERP data targeted again the processing mode in the absence of a particular emphasis cue. The top left part of Figure 1 shows the ERPs time-locked to the onset of the disambiguating auxiliary verb when the speaker did not produce an accompanying beat gesture. In this case, processing an object-initial as compared to a subject-initial structure elicited an early negativity in the time window from 200 to 350 ms, followed by a late positivity in the time window from 500 to 800 ms. Whereas the early negativity is broadly distributed across the scalp, the late positivity appears to be maximal at left posterior electrodes (see also Supplementary Material). On the basis of its scalp distribution, polarity, and latency, the late positivity was identified as a P600.

The statistical analysis for the early time window (200–350 ms) revealed that the early negativity was more pronounced for OSV than for SOV structures [F(1, 23) = 5.16, p < 0.05]. For the P600 time window (500–800 ms), the corresponding ANOVA yielded a significant main effect of Structure [F(1, 23) = 4.64, p < 0.05] indicating that the P600 was more pronounced when sentences were disambiguated toward the non-preferred object-initial structure.

To sum up, processing OSV as compared to SOV structures in the absence of a gesture cue elicited a broadly distributed early negativity, followed by a broadly distributed P600.

Next, we looked how these ERP patterns are modulated by the presence of a beat gesture. When the speaker produced a beat gesture on the first ambiguous noun phrase of the sentence, the ERP pattern appears to be similar to the pattern observed without a gesture: An early, broadly distributed negativity, followed by a P600 (see Figure 1, top center). However, when the beat accentuated the second ambiguous noun phrase, only an early negativity appears to be present, whereas the P600 effect is virtually absent (see Figure 1, top right).

The ANOVA for the early negativity yielded significant main effects of Structure [F(1, 23) = 20.71, p < 0.0001] and Beat [F(1, 23) = 6.60, p < 0.05], but no interaction between these two factors (F < 1). The main effect of Structure indicated that the negativity was more pronounced for object-initial as compared to subject-initial sentences, whereas the main effect of Beat reflected a generally more negative ERP when the beat fell on the second as compared to the first ambiguous NP.

The statistical analysis for the P600 time window revealed a significant main effect of Beat [F(1, 23) = 12.85, p < 0.005], a significant two-way interaction of Structure by Region [F(1, 23) = 4.58, p < 0.05] as well as a significant three-way interaction of Structure by Region by Hemisphere [F(1, 23) = 5.01, p < 0.05]. Most importantly, there was also a three-way interaction of Structure by Beat by Region [F(1, 23) = 5.71, p < 0.05]. Post hoc tests indicated that the P600 effect for OSV as compared to SOV structures was only significant at posterior sites when the beat accentuated the first NP [F(1, 23) = 9.57, p < 0.01], but not when it accompanied the second NP [F(1, 23) < 1].

To test whether a beat on the second ambiguous noun phrase might interfere with the processing of the SOV structure, we directly compared the P600 elicited by SOV structures presented without a beat with the P600 for SOV structures accompanied by a beat on the second noun phrase. No significant main effect of Beat or interactions involving this factor were observed (all F < 2.99, all p > 0.1), indicating that a beat on the second noun phrase did not interfere with the processing of the standard SOV order.

Finally, we tested whether the lack of a P600 effect when beats emphasized NP2 might be partly driven by an increase in the preceding negativity. Although the interaction of Beat by Structure was not significant for the preceding negativity (see above), Figure 1 suggests that at least numerically, the early negativity is more pronounced for beats on NP2 than for beats on NP1. To directly test this possibility, we compared the ERP difference of (NP1: OSV–SOV) with (NP2: OSV–SOV) in the early time window at posterior sites (where the P600 modulation was significant). The corresponding paired t-test was not significant [t(23) = 1.18, p = 0.25], suggesting that there are no reliable difference in the early negativity between beat on NP1 and beat on NP2. Thus, what happens during the time window of the early negativity cannot explain what happens later on during the P600 time window.

One question the results from Experiment 1 raise is why we observed an interaction of Beat by Structure in the ERPs, but not in the behavioral data. In the behavioral results we found a strong main effect of Word Order, which was due to a lower accuracy score for OSV as compared to SOV structures. This may be seen as further evidence that additional processing costs arise when temporarily ambiguous syntactic structures are disambiguated toward the non-preferred object-initial structure (Schriefers et al., 1995; Bader and Meng, 1999). Beyond that, however, we believe that the behavioral data should be interpreted with caution. To avoid a contamination of the ERP data through the motor preparation and execution of the response, the question prompting participants to respond was displayed 2 s after the offset of the video. This leaves a considerable period of time for slower, offline processes (e.g., metalinguistic reasoning) to kick in, which most likely have influenced the response of our participants. The ERPs are in this sense a purer measure, because they provide a direct reflection of the online processes taking place at the disambiguating region of the sentence.

To summarize the results from Experiment 1, only the P600, but not the early negativity was modulated by the presence of a beat gesture. When either no beat accompanied the sentence or the beat fell on the first NP, we observed strong P600 effects. However, when the beat highlighted the second NP, the P600 effect was abolished.

Experiment 2: Hearing Beats

In the first experiment, we observed that visual beat gestures can abolish the P600 usually associated with syntactically more complex sentences. Given that beat gestures tend to influence the speech they accompany (increased pitch and duration, see Krahmer and Swerts, 2007), one obvious question is now whether these auditory pitch accents induced by a beat gesture do also abolish the P600 effect. Therefore, we designed another experiment, which in some respect is a mirror version of Experiment 1. We took the speech of a speaker either not producing a beat gesture, or producing a beat on either one of two noun phrases, and paired the speech with a video of a non-gesturing speaker. Thus, visual stimulation is perfectly controlled for, and all modulations of the standard structure effect can only be attributed to the auditory manipulation.