Using Gesture to Facilitate L2 Phoneme Acquisition: The Importance of Gesture and Phoneme Complexity

Hoetjes, Marieke; van Maastricht, Lieke

doi:10.3389/fpsyg.2020.575032

ORIGINAL RESEARCH article

Front. Psychol., 23 November 2020

Sec. Cognition

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.575032

Using Gesture to Facilitate L2 Phoneme Acquisition: The Importance of Gesture and Phoneme Complexity

Centre for Language Studies, Radboud University, Nijmegen, Netherlands

Article metrics

View details

Citations

5,4k

Views

1,9k

Downloads

Abstract

Most language learners have difficulties acquiring the phonemes of a second language (L2). Unfortunately, they are often judged on their L2 pronunciation, and segmental inaccuracies contribute to miscommunication. Therefore, we aim to determine how to facilitate phoneme acquisition. Given the close relationship between speech and co-speech gesture, previous work unsurprisingly reports that gestures can benefit language acquisition, e.g., in (L2) word learning. However, gesture studies on L2 phoneme acquisition present contradictory results, implying that both specific properties of gestures and phonemes used in training, and their combination, may be relevant. We investigated the effect of phoneme and gesture complexity on L2 phoneme acquisition. In a production study, Dutch natives received instruction on the pronunciation of two Spanish phonemes, /u/ and /θ/. Both are typically difficult to produce for Dutch natives because their orthographic representation differs between both languages. Moreover, /θ/ is considered more complex than /u/, since the Dutch phoneme inventory contains /u/ but not /θ/. The instruction participants received contained Spanish examples presented either via audio-only, audio-visually without gesture, audio-visually with a simple, pointing gesture, or audio-visually with a more complex, iconic gesture representing the relevant speech articulator(s). Preceding and following training, participants read aloud Spanish sentences containing the target phonemes. In a perception study, Spanish natives rated the target words from the production study on accentedness and comprehensibility. Our results show that combining gesture and speech in L2 phoneme training can lead to significant improvement in L2 phoneme production, but both gesture and phoneme complexity affect successful learning: Significant learning only occurred for the less complex phoneme /u/ after seeing the more complex iconic gesture, whereas for the more complex phoneme /θ/, seeing the more complex gesture actually hindered acquisition. The perception results confirm the production findings and show that items containing /θ/ produced after receiving training with a less complex pointing gesture are considered less foreign-accented and more easily comprehensible as compared to the same items after audio-only training. This shows that gesture can facilitate task performance in L2 phonology acquisition, yet complexity affects whether certain gestures work better for certain phonemes than others.

Preliminary versions of parts of this paper were presented at the International Congress of Phonetic Sciences in August 2019 in Melbourne, Australia (Van Maastricht et al., 2019), at the 29th conference of the European Second Language Association in August 2019 in Lund, Sweden (Hoetjes et al., 2019b), and at the Gesture and Speech in Interaction conference in September 2019 in Paderborn, Germany (Hoetjes et al., 2019a). The current paper includes a more detailed theoretical background, description of the experimental methods, and discussion of the findings, as well as more advanced statistical analyses over the complete data set in the case of Study I and analyses over a new data set in the case of Study II.

Introduction

Human communication is multimodal: When people communicate face-to-face, they do not only use speech but also various non-verbal communicative cues, such as facial expressions and hand gestures. In this study, we focus on one of these aspects of non-verbal communication, namely co-speech hand gestures, within the context of foreign language learning. There is general agreement in the literature that speech and co-speech gestures are closely related and that they are integrated in various ways (McNeill, 1992; Kendon, 2004; Wagner et al., 2014). This is apparent, for example, by the fact that there is a close temporal and semantic coordination between speech and gesture. This means that roughly speaking, speech and gesture tend to express the same thing at the same time (see, Gullberg, 2006, for an overview). Moreover, the integration between speech and gesture is reflected in the parallel development of the two modalities: For instance, in first language (L1) acquisition, it has been shown that gestures play a facilitating role in vocabulary learning in children, with gesture production predicting their subsequent lexical and syntactic development (e.g., Goldin-Meadow, 2005). Both modalities have also been shown to break down in a parallel way, for example during disfluencies (e.g., Seyfeddinipur, 2006; Graziano and Gullberg, 2018) or as a result of aphasia (Van Nispen et al., 2016). In short, the relationship between speech and gesture plays a crucial role in our communicative processes. Given this close relationship between speech and gesture in communication, the possible benefit of gesture in learning contexts has been a topic of research in different scientific fields, one of which is second language (L2) acquisition. While gesture is often intuitively used by teachers in classrooms (cf. Smotrova, 2017), very little is known about the specifics of the interplay between both modalities in a learning context. Hence, in the current study, we compare the use of different types of gestures in the context of L2 phoneme acquisition to determine in which way gesture and phoneme complexity in L2 training affect the phoneme productions of Dutch learners of Spanish (Study I) and the perceptions of Spanish natives with respect to these non-native productions (Study II). Before turning to the specifics of our research, we first review the relevant literature.

Multimodality in Learning Contexts

Gesture can play a facilitative role in various kinds of learning situations. For example, previous work has shown that students take teachers’ gestures into account and that teachers can thus use gesture to help students learn mathematical concepts (e.g., Goldin-Meadow et al., 1999; Yeo et al., 2018). Focusing on L2 learning, various studies have shown that gestures can play a facilitative role in the acquisition of L2 vocabulary, both by children and adults. Tellier (2008), for example, had 5-year old French children learn English words associated with either a picture or a gesture and found that the gesture group did better than the picture group. For adults, Kelly et al. (2009) likewise found that when novel Japanese words were presented to native speakers of English, they were better at learning these words when they were presented with hand gestures, as compared to without hand gestures. In these studies, iconic gestures were used, which have a clear semantic relationship to the lexical items they accompany. The conclusion we can draw from these findings is that presenting semantic information in several modalities strengthens learners’ memory of the words’ semantic meaning (e.g., Tellier, 2008; Kelly et al., 2009; Macedonia et al., 2011).

Apart from vocabulary acquisition, it is important for L2 learners to also learn how to correctly pronounce the sounds of their target language. On the one hand, phoneme acquisition is one of the aspects of L2 acquisition learners generally find most difficult (see, e.g., collected papers in Bohn and Munro, 2007), while on the other hand, an atypical pronunciation is an aspect of speech that is very salient to native listeners (see Derwing and Munro, 2009 and the references therein), even if it doesn’t necessarily affect their perceived ease of comprehensibility or actual processing of the L2 speech (Munro and Derwing, 1999; Van Maastricht et al., 2016). Moreover, pronunciation is often one of the aspects of the L2 that learners are eager to acquire since most of them aim to sound as native-like as possible in the L2 (Timmis, 2002; Derwing, 2003). A native-like pronunciation is especially important given that a clear non-native pronunciation has been shown to negatively affect the way speakers are perceived (Lev-Ari and Keysar, 2010) and segmental inaccuracies contribute to miscommunication (Caspers and Horłoza, 2012).

Given the tight relationship between speech and gesture and the fact that gestures can facilitate L1, and even L2, development, it is not such a strange idea that gesture may also play a role in L2 phoneme acquisition. Anecdotally, L2 teachers report to regularly use gestures in the classroom when teaching different aspects of L2 phonology but there are also empirical reasons to assume that gestures could play a facilitative role in L2 phoneme acquisition even though, to date, most research on multimodal L2 phonology acquisition has not focused on gestures. For instance, Hazan et al. (2005) have shown that multimodal training on English phoneme contrasts, in this case through the auditory modality only as compared to through the audiovisual modality, generally benefitted the production and perception of L2 phonemes by Japanese learners of English. Hardison (2003) reports similar results with Japanese and Korean intermediate-level learners of English and found that improvement in phoneme perception also led to improved phoneme production, which she attributes to the fact that the audiovisual training leaves multiple memory traces, while the auditory training only left one.

Using a form of multimodal training that is similar to a gesture, Zhang et al. (2020) studied the facilitative effect of hand-clapping on L2 pronunciation. They showed that French words produced by Chinese adolescents were rated as marginally more nativelike after they had seen and reproduced training videos in which the speaker clapped to visualize the rhythmic structure of the French words as compared to seeing a speaker that did not move her hands and not moving their own hands. They also found a significant effect of training condition on final syllable duration, reflecting the final stress placement that is typical of French, with longer final syllable lengths for items produced after the clapping condition. Like hand-clapping, gestures are not only visual but also consist of movements. Hence, these previous findings would suggest that using gesture in language training, as opposed to using only auditory input or visual input without movements, could facilitate L2 phoneme acquisition. Indeed, some previous studies have been conducted specifically on the role of gestures in the acquisition of L2 tonal and phonemic contrasts. However, the results of these studies are inconclusive.

Gesture and L2 Phonology

On the one hand, there is previous work suggesting that gestures can indeed play a role in the acquisition of certain aspects of L2 phonology, such as the perception of L2 tones and intonation contours. Kelly et al. (2017) conducted a study in which native speakers of English listened to different types of Japanese phonemic contrasts. The speech sounds contrasted concerning their vowel length or their sentence-final intonation. Participants were presented with training on the relevant phonemic differences, followed by videos showing either speech without gestures, speech with congruent metaphoric gestures visualizing the contrast, where the gestures’ meaning was in line with the phonemic meaning (short vs. long vowel, or rising vs. falling intonation), or speech with incongruent gestures (e.g., a short vowel with a long gesture). After each video, participants had to indicate whether they perceived the audio to contain a long vs. short vowel, or rising vs. falling intonation. Although results were not clear-cut for the vowel length contrasts, congruent gestures did help to correctly perceive intonational contrasts, as compared to incongruent gesture or no gesture conditions. In a similar vein, work by Hannah et al. (2017) on Mandarin tones used speech-accompanying congruent and incongruent metaphoric gestures and found that perceivers often relied on the visual cues they received, which in the case of incongruence between speech and gesture resulted in participants incorrectly perceiving what they had heard. Gluhareva and Prieto (2017) did not use metaphoric gestures but beat gestures, and showed that viewing beat gestures during discourse prompts improved L2 pronunciation, as measured by accentedness ratings by English natives of short stories produced by Catalan learners of English. Moreover, recent work by Li et al. (2020) focused on the L2 acquisition of Japanese vowel-length contrasts and although they found that gesture (versus no gesture) did not improve L2 vowel length perception, gesture did facilitate correct L2 vowel length production.

On the other hand, there has been work suggesting that gestures do not play a facilitative role in the acquisition of some aspects of L2 phonology, such as the perception of phonemic vowel length distinctions in Kelly et al. (2017), where viewing gestures did not facilitate the perception of phonemic vowel length distinctions. Several other studies also did not report positive effects of gesture on L2 phoneme perception. For instance, in work by Kelly et al. (2014) and by Hirata et al. (2014), the L2 acquisition of phonemic vowel length contrasts was investigated by letting English naïve learners of Japanese observe or also produce gestures related to the syllable or the mora structure of the target word. In an auditory identification task, no differences between the training conditions were found. The authors suggest that this could mean that gestures are not suited for learning phonetic distinctions¹. Earlier work by Kelly and Lee (2012) expounds this point of view somewhat by stating that gesture may help in acquiring phonetically easy phonemic contrasts, but hinders the acquisition of phonetically hard contrasts because iconic gestures could add too much semantic content to the spoken input, which complicates the acquisition of new phonemes since the learner is simultaneously paying attention to the novel sounds and the contents of the gesture. Hence, they suggest that “it is possible that gesture facilitates local processing of speech sounds only for familiar phonemes in one’s native language” (p. 804), which is a relevant factor in the present study.

This contrast between gestures playing a facilitative role in certain contexts but hindering L2 acquisition in others has, in some cases, even been shown within studies. As discussed above, Kelly et al. (2017), for example, showed that similar metaphoric gestures helped for perceiving non-native intonation contours, but did not help in perceiving vowel length differences. Likewise, Morett and Chang (2015) studied the acquisition of L2 Mandarin lexical tone perception by English learners and found that gestures that visualize the target pitch contour helped acquisition, while gestures referring to the semantic meaning of the word hindered correct tone identification. Clearly, the role of gestures in the L2 acquisition of phonemes is not straight-forward. As prior studies used varying research methods and focused on different aspects of L2 phonology, it remains unclear whether the contradictory findings within the field of L2 phonology acquisition are due to methodological discrepancies or to the fact that the specific properties of the gestures used in training, as well as the properties of the phonetic feature to be acquired, contribute to the effectiveness of the use of gesture in L2 pronunciation training. It has been suggested (Kelly et al., 2014) that using gestures for complex L2 input, for example, because the learner has a low proficiency or because the contrast in question is hard to acquire, may hinder rather than help acquisition. In those cases, the processing resources needed for the interpretation of the speech might be prioritized to those needed to process the gesture. This would be in contrast with easy L2 acquisition contexts, where gestures that may play a beneficial role can be processed alongside speech. In any case, the lack of agreement between the different studies in this domain means that it is hard to draw clear conclusions, and indeed, Kelly et al. (2017, p. 1) suggest that “gestures help with some –but not all- novel speech sounds in a foreign language.”

The Present Study

What most previous studies on L2 phoneme acquisition have in common is that they generally focus on learners’ perception skills, that is, whether certain types of language training result in learners being able to recognize or distinguish between different phonemes. In most cases, we do not yet know to what extent these results can be extended to learners’ production of L2 phonemes. In other words, can a certain type of training result in L2 learners’ improved ability to pronounce the phonemes in the L2? Hence, one of the goals of this study is to focus on L2 phoneme production. Also, one potential reason for the diverging findings in previous work is that the effect of gestures in L2 phoneme training on L2 phoneme perception has been investigated using various types of gestures and hand movements, but without directly comparing them. Studies have, for example, looked at the use of beats (Gluhareva and Prieto, 2017), which are simple rhythmical gestures, but also at, arguably more complex, metaphoric gestures (Kelly et al., 2014, 2017), which are like iconic gestures in the sense that they show a clear semantic relationship between the movement and the content of speech, but are produced during abstract speech. We are unaware of previous work incorporating deictic (i.e., pointing) gestures in L2 acquisition or of work on L2 phoneme acquisition comparing the effect of different types of gestures. These differences between studies make it hard to draw clear conclusions about the educational value of different types of gestures. Differences in the speech-gesture relationship between types of gestures mean that their potential role in L2 acquisition is not self-evident. Hence, another goal of this study is to compare different types of gestures and the role they may play in the acquisition of L2 phoneme production.

In the current study, we thus aim to investigate whether different types of gestures can facilitate L2 learners’ productions of two different L2 phonemes, which vary in complexity. We do so by conducting two experimental studies. In our production task (Study I), we provide Dutch learners of Spanish with training on two phonemes that are typically difficult for them: /u/ and /θ/. We have chosen to approach L2 phoneme acquisition within the context that will likely be typical for adult L2 learners: They usually learn the L2 in a classroom setting and, in contrast to infants, are generally able to read, which means they often receive a large part of their instruction from written textbooks and exercises and part of the challenge lies therefore in making the correct association between spelling and sound. This means that producing the right L2 phoneme is not only dependent on whether they are familiar with the sound itself but also on whether they are accustomed to relating that particular sound to the correct grapheme. Prior research (e.g., Escudero et al., 2014) has shown that stimuli with incongruent grapheme-orthography mapping hinder L2 performance in various areas. We employed this distinction in order to manipulate phoneme complexity in our study: While there are subtle phonetic differences between the production of /u/ in Dutch and Spanish, it is a segment that is present in both the Dutch and the Spanish phoneme inventory. The difficulty for Dutch learners of Spanish lies in the fact that, in Spanish, the phoneme that corresponds to the grapheme < u > is always /u/, whereas in Dutch several phonemes correspond to the grapheme < u >, for instance, /æ/ as in dun (“thin”), /y/ as in pure (“pure”) and in combination with other vowels there is even more variation possible with realizations, for instance, as /æy/, /ø/, or /ɑu/, as in muis (“mouse”), leuk (“fun”), or rauw (“raw,” Kooij and Van Oostendorp, 2003). Conversely, when it comes to the acquisition of /θ/, the challenge is 2-fold: not only is /θ/ not a part of the Dutch phoneme inventory² and thus a new segment for which a category needs to be created, its only corresponding grapheme in Spanish is >Z<,³ while in Dutch >Z< is typically pronounced as /z/ or /s/. In sum, while /u/ requires a novel grapheme to phoneme correspondence, /θ/ requires both a novel grapheme to phoneme correspondence and the creation of a new category in the phoneme inventory. These differences between /u/ and /θ/ allow us to manipulate phoneme complexity in our production task.

Our Dutch learners of Spanish received instruction on /u/ and /θ/ in one of four conditions: audio-only (AO), audio-visual (AV), audio-visual with a pointing gesture (AV-P), or audio-visual with an iconic gesture (AV-I). The AO condition serves as a baseline, to which we will compare the other conditions, of which the latter two contain either a less or more complex gesture: A pointing gesture was chosen as a less complex gesture, as it has no intrinsic semantic meaning and only serves to draw the listeners attention to a specific feature in the context, in our case, the mouth of the native speaker of Spanish pronouncing an example item. An iconic gesture was chosen as a more complex gesture, as it does have intrinsic semantic meaning because it illustrates to the listener which articulator is involved in the production of the target sounds and in which way it should be used. Our analyses will focus on whether gesture complexity and phoneme complexity affect the production of the target phonemes by Dutch learners of Spanish. In a perception task (Study II), Spanish natives listened to words containing the target phonemes that were produced by the Dutch learners of Spanish before and after AV, AV-P, or AV-I training and judged them on foreign accentedness and comprehensibility.

Based on previous studies (Hardison, 2003; Hazan et al., 2005), we hypothesize that adding audio-visual information to L2 phoneme training will facilitate phoneme acquisition, as compared to providing only audio information. Given that some previous work (e.g., Hannah et al., 2017; Kelly et al., 2017) has shown that gestures can be helpful in the acquisition of certain phonemes, we expect that including gestures in language training will be more beneficial than not including them, but possibly only in a context that is less cognitively demanding, that is, when producing /u/, but not /θ/ (Kelly and Lee, 2012). This would be in line with an embodied approach to cognition, which implies that not only performing but also seeing gestures benefits memory performance, which is essential in our phoneme production task (Madan and Singhal, 2012). Finally, given the lack of previous work that directly compares the role of different types of gestures in language acquisition, we cannot predict different effects between different types of gestures, but we speculate that there might be a difference between the potential facilitative effect of deictic and iconic gestures, based on the cognitive resources needed to process them. If this indeed affects their effectiveness in L2 pronunciation training, one would expect that pointing gestures might be more helpful than iconic gestures, which would be more cognitively demanding and thus entail less processing resources available for the perception and acquisition of the phoneme itself.

Study I

Method

Participants

In study I, 50 native speakers of Dutch, who did not speak any Spanish, took part. They were 28 women and 22 men, with a mean age of 25 years old (range 18–61 years old). Participants had no auditory or visual impairments that could affect their participation. Participants were recruited via the Radboud University research participation system and received either credits or a small financial reward for taking part.

Design

Study I consisted of a pretest – training – posttest paradigm. We used a between-subjects design in which participants took part in one of four experimental training conditions: AO (n = 12), AV (n = 13), AV-P (n = 13), or AV-I (n = 12). The dependent variable was the pronunciation of the target phonemes, coded as either on-target or not.

Materials

Sentences

In the pretest and the posttest, participants read out loud 16 Spanish four-word sentences (in one of two randomized orders) that were easy to parse, half of which were experimental items. In each experimental item, the first syllable of the two-syllable noun in the sentence contained either /u/ or /θ/ (e.g., La nube es blanca, la zeta es verde). Each of the two target phonemes occurred in four target words, for /u/: muro, nube, ruta, suma; for /θ/: zeta, zorro, zueco, zumo. The eight remaining filler items also contained the target phonemes, but at different positions within the words or the sentence. The filler items were not analyzed. The target phonemes were embedded in the four-word sentences and presented to participants one at a time on PowerPoint slides. Each written sentence was accompanied by a picture illustrating the meaning of the sentence (see Figure 1). This was done to make the task more interesting and to help participants understand the semantic meaning of the sentence.

FIGURE 1

Training

After the pretest and before the posttest, participants received training on how to pronounce the target phonemes /θ/ and /u/ (in counterbalanced order) in Spanish. This training consisted of a set of three PowerPoint slides for each phoneme. On the first slide, written information was given on how to pronounce the target phoneme. Specifically, participants were told that the Spanish pronunciation of both graphemes differs from the Dutch pronunciation of these graphemes. Moreover, participants were explicitly instructed which articulatory gestures are necessary for nativelike pronunciation (i.e., “when pronouncing the letter “u” in Spanish, you need to round your lips” and “when pronouncing the letter “z” in Spanish, you need to place your tongue between your teeth and push out the air”). Apart from the written text, participants were also given an example of a native speaker of Spanish pronouncing the target phoneme in isolation. On the two following slides, participants were given two examples of the pronunciation of the target phoneme embedded within an example sentence. These examples (all produced by the same native speaker of Spanish) were accompanied by the written sentence and a picture illustrating the meaning of the sentence, in the same way as during the pretest and posttest (see Figure 2). The training was self-paced and participants took roughly 3 to 4 min to complete it. They were free to listen to/view the example fragments as many times as they wanted.

FIGURE 2

To manipulate training condition, the visual information given in the examples during the training varied, while the same audio was dubbed over all conditions. In the AO condition, participants heard the audio examples but did not see any video recordings of the speaker. In the AV condition, participants saw a video clip of the speaker producing the examples, but the speaker did not move her hands. In the AV-P condition, participants saw videos in which the speaker produced a pointing gesture toward her mouth while she produced the target phoneme. In the AV-I condition, participants saw the speaker produce an iconic gesture while she produced the target phoneme (see Figure 3 for examples). This iconic gesture represented the articulatory gesture needed for on-target phoneme production, as was explained verbally on the first training slide. For /u/, the iconic gesture was a one-handed gesture representing the rounding of the lips, and for /θ/, the iconic gesture was a one-handed gesture indicating that the speaker should push their tongue between their teeth. Both iconic gestures were made with one hand, roughly equally complex with respect to finger configuration, and not necessarily representing all articulators in the gesture but only the most relevant one for the learner. In the case of /θ/, Dutch learners of Spanish are familiar with non-sibilant fricatives (e.g., /f/ and /v/) but not interdental ones, so they need to know that they should push their tongue out of their mouth, which is only possible by placing it in between the teeth and lips. Concerning /u/, Dutch learners of Spanish need to know that correct pronunciation requires a stronger rounding of the lips than needed for any of the Dutch vowels. We performed a posttest for our stimuli among 42 native speakers of Dutch in which we compared the iconic and pointing gestures used for both phonemes with respect to how useful they found the gesture in the context of the L2 training for that specific phoneme, how intuitive they found the gesture in that context and whether they thought they understood why the gesture was chosen in that context. No significant differences were found between gesture type conditions or phoneme conditions for any of our measures, nor did the test reveal any significant interactions. This suggests that any differences between the iconic gestures concerning the way they visualize the relevant articulator did not affect our results.

FIGURE 3

Procedure

To minimize distractions for the participants, the experiment took place in a soundproof booth. The language used throughout the experiment, except for the Spanish sentences during pretest, training, and posttest was Dutch. After participants had received instructions and signed a consent form, they were recorded while they read the 16 Spanish sentences out loud into a microphone (pretest). The pretest was first followed by a language background questionnaire, and then by one of the four types of pronunciation training. After the pronunciation training, participants were again recorded while they read out loud the same 16 Spanish sentences in a reordered version (posttest). Both the pretest and posttest were self-paced and participants were invited to repeat the sentences until they were satisfied with their pronunciation. The last production of each sentence was used for analysis. After completing all tasks, participants were debriefed.

Results

The audio recorded during the pretest and the posttest was annotated using Praat (Boersma and Weenink, 2018) concerning the production of the target phonemes. Two phonetically trained coders annotated the 1600 target phonemes (50 speakers × 16 sentences × 2 testing moments), and distinguished between a nativelike production (i.e., as a native speaker of Iberian Spanish would do) and several non-nativelike productions that are typical for native speakers of Dutch (for /θ/, these were /s/, /z/, or “other”; for /u/, these were /y/, /ə/, /Y/, or “other”). In the current analyses, nativelike productions were distinguished from the various non-nativelike productions, collapsing over the various non-target options. There was an overlap of 50% in coding and a good inter-rater reliability (K = 0.900, p < 0.001). Productions of target phonemes from the same sentences were compared between the pretest and the posttest, resulting in four different outcome options: (1) the participant was able to produce the target phoneme in the pretest, but not anymore at the posttest; (2) the participant was not able to produce the target phoneme at either the pretest or the posttest; (3) the participant was able to produce the target phoneme both at the pretest and at the posttest; (4) the participant was unable to produce the target phoneme at the pretest, but able to do so at the posttest. Figure 4 and Table 1 summarize the results per learning outcome separated by gesture condition and phoneme. In Table 1, the results are presented in terms of raw frequencies, while percentages are presented in Figure 4. First, we will inspect the data descriptively, followed by inferential statistics in the form of a mixed effects logistic regression analysis in which we distinguished between cases of “learning” (i.e., option 4), and “no learning” (i.e., collapsing options 1–3).

FIGURE 4

TABLE 1

Training Condition	Learning		Always Able		Never Able		Unlearning		Total
	/u/	/θ/	/u/	/θ/	/u/	/θ/	/u/	/θ/
AO	9	14	36	0	0	32	0	0	91
AV	16	19	35	1	1	32	0	0	104
AV-P	15	25	32	0	2	26	2	0	102
AV-I	21	10	23	0	2	37	1	1	95
Total	61	68	126	1	5	127	3	1	392⁴

Frequency of Training outcomes for /u/ and /θ/, separated by training condition.

The target outcome is printed in bold.

When inspecting the raw data per training condition in the cases that learning occurred, the Dutch learners of Spanish, in general, appear to benefit from receiving both auditory and visual information. For both phonemes, the cases of learning increase as more visual information is added, except for in the AV-I condition: While the L2 learners who aimed to produce a /u/ benefitted most from seeing an iconic gesture during training, the participants who aimed to produce a /θ/ appeared to benefit most from seeing a pointing gesture.

We used R (version 3.6.1, RCoreTeam, 2019) and the lme4 package (Bates et al., 2015) to conduct a linear mixed effects logistic regression analysis to model binary outcome variables. The theoretically relevant predictors Gesture Condition (AO, AV, AV-P, or AV-I) and Phoneme (/u/ or /θ/) were included as fixed factors, and Training Outcome (Learning or No Learning) served as the response variable. Random intercepts were added for Speaker and Item. Adding random slopes resulted in models that either failed to converge or had inferior fit. Significance was assessed via likelihood ratio tests comparing the full model to a model lacking only the relevant effect. The complete model provided the best fit as determined by the Akaike Information Criterion, see Table 2 for a complete overview of all effects and coefficients.

TABLE 2

Learning vs. Not Learning	β estimate	Std. error	z value	p value	95% CI for Odds Ratio
					Lower	Odds Ratio	Upper
Intercept	−2.06	0.68	−3.06	0.002	0.03	0.13	0.48
Gesture Condition_AV	0.86	0.77	1.12	0.263	0.52	2.37	10.74
Gesture Condition_AV–P	0.90	0.76	1.19	0.236	0.55	2.47	11.01
Gesture Condition_AV–I	1.79	0.77	2.32	0.020	1.32	6.01	27.33
Phoneme_/θ/	0.88	0.73	1.20	0.232	0.57	2.41	10.14
Phoneme_/θ/ * Gesture Condition_AV	−0.44	0.76	−0.59	0.558	0.15	0.64	2.82
Phoneme_/θ/ * Gesture Condition_AV–P	0.23	0.73	0.32	0.753	0.30	1.26	5.30
Phoneme_/θ/ * Gesture Condition_AV–I	−2.30	0.78	−2.95	0.003	0.02	0.10	0.46

Random effects	Variance			Standard deviation

Speaker	1.593			1.262
Item	0.423			0.650

Estimated effects and coefficients for Training Outcome.

The intercept represents the following combination of variable levels: Gesture Condition = AO, Phoneme = /u/. Asterisks (*) represent interactions, subscript signals the level of a categorical variable. Significant p-values are printed in bold. The model used in this analysis can be described as Training Outcome ∼ Gesture Condition × Phoneme + (1| participant) + (1| item).

The analysis revealed that the condition that the participant was assigned to significantly predicted whether learning occurred or not but only when comparing the AV-I condition to the AO condition (the baseline condition, β = 1.79, p < 0.05). As gesture condition changes from AO to AV-I, the change in the odds of learning (rather than not learning) is 6.01. In other words, in general, a participant is more likely to learn than not in the AV-I condition than in the AO condition. In addition, there was an interaction between Phoneme and Gesture Condition (β = −2.30, p < 0.01), suggesting that the success of being in the AV-I condition depended on whether the participant aimed to produce a /u/ or a /θ/. The odds ratio tells us that as the gesture condition changes from AO to AV-I in combination with the phoneme being produced being a /θ/ instead of a /u/, the change in the odds of learning compared to not learning was 0.10. In order words, as the phoneme that is produced is /θ/ instead of /u/, participants are less likely to learn in the AV-I condition.

Interim Discussion

In summary, Study I showed that, in general, adding audio-visual information to phoneme pronunciation training aided target-like production. However, the complexity of the gesture produced by the trainer in combination with the complexity of the target phoneme affected L2 learners’ success. Only when producing the less complex phoneme /u/, did participants benefit from seeing a more complex, iconic, gesture, making the AV-I condition the one in which L2 learners were most likely to learn. Conversely, when aiming to produce the more challenging phoneme /θ/, seeing a more complex gesture was actually detrimental to L2 learners, resulting in less learning taking place than in all other conditions. Additionally, the analysis corroborates our theoretical predictions concerning the complexity level of both phonemes. L2 learners often tended to already produce /u/ in a target-like way during the pretest, whereas they generally continued to be unable to correctly produce /θ/ during the posttest. This confirms that /u/ inherently is a less complex phoneme for Dutch learners of Spanish than /θ/.