The Role of Native-Language Knowledge in the Perception of Casual Speech in a Second Language

Mitterer, Holger; Tuinman, Annelie

doi:10.3389/fpsyg.2012.00249

ORIGINAL RESEARCH article

Front. Psychol., 13 July 2012

Sec. Cognition

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00249

This article is part of the Research TopicEcological aspects of speech perceptionView all 6 articles

The role of native-language knowledge in the perception of casual speech in a second language

Holger Mitterer* and Annelie Tuinman

Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands

Casual speech processes, such as /t/-reduction, make word recognition harder. Additionally, word recognition is also harder in a second language (L2). Combining these challenges, we investigated whether L2 learners have recourse to knowledge from their native language (L1) when dealing with casual speech processes in their L2. In three experiments, production and perception of /t/-reduction was investigated. An initial production experiment showed that /t/-reduction occurred in both languages and patterned similarly in proper nouns but differed when /t/ was a verbal inflection. Two perception experiments compared the performance of German learners of Dutch with that of native speakers for nouns and verbs. Mirroring the production patterns, German learners’ performance strongly resembled that of native Dutch listeners when the reduced /t/ was part of a word stem, but deviated where /t/ was a verbal inflection. These results suggest that a casual speech process in a second language is problematic for learners when the process is not known from the leaner’s native language, similar to what has been observed for phoneme contrasts.

The Role of Native-Language Knowledge in the Perception of Casual Speech in a Second Language

Speech perception studies are often performed under ideal circumstances, with participants listening to their native language in a sound-proof booth. Experimental stimuli have usually been recorded under similarly optimal circumstances by a speaker, who is carefully reading out loud. Outside the laboratory, the situation is, however, often less ideal. Environmental noises are common, speakers are less careful than during reading, and the language we listen to might not be our native language. All of these influences make speech perception harder.

To start, speech perception in a second language is notoriously difficult. Non-native listeners have the disadvantage that the second language (L2) often contains phoneme contrasts that they are unfamiliar with given their native language (L1). A large body of research (see, e.g., Strange, 1995; Bohn and Munro, 2007 for an overview). has shown that mismatches between the phoneme repertoires of the L1 and L2 cause perceptual difficulties. A classical example is the segmental contrast between /r/ and /l/ in English, which difficult for Japanese listeners because these speech sounds match a single Japanese category equally well, leading to perceptual difficulties (e.g., Underbakke et al., 1988; Bradlow et al., 1997; Ingram and Park, 1998; Cutler et al., 2006). Cutler and Otake (2004), tested auditory repetition priming with English /r/-/l/ minimal pairs – such as right-light – and Japanese listeners. This contrast is difficult for these listeners and consequently priming was observed.

These difficulties of L2 listeners are not limited to minimal pairs. Research in psycholinguistics has shown that the recognition of any word entails activation of multiple word candidates which compete for recognition: For example, when listeners hear the word captain, not only the word captain is activated, but also words such as cap and capitol are temporarily activated (Davis et al., 2002). During this process, difficult L2 contrasts create additional competitors for non-native listeners. For instance, Dutch learners of English, who have a hard time to distinguish the vowels /ε/ and /æ/, activate the word deaf /dεf/ when hearing the longer word daffodil /dæf?dIl/ (Broersma and Cutler, 2011).

As this shows, listening to a non-native language is, compared to L1 listening, an adverse condition by itself. Non-native listeners are confronted with difficult segmental contrasts leading to spurious lexical activation. To make matters worse, L2 perception seems to be more strongly affected by additional challenges. For example, several studies have shown that, in the presence of background noise, speech perception by non-native listeners is more strongly affected than that of native listeners (see Lecumberri et al., 2010, for an overview). Non-native listeners do not only require a better signal-to-noise ratio (SNR) for the recognition of speech in noise, they also need larger changes in the SNR to recover (i.e., their psychometric functions of recognition rate over noise levels are shallower than those of native listeners, see Van Wijngaarden et al., 2002).

However, as indicated above, noise-masking is not the only challenge for speech perception. Spontaneous speech is also much harder to comprehend than carefully read speech. Little research, however, has focused on how non-native listeners deal with the adverse conditions created by casual speech. Is casual speech, like background noise, especially difficult to overcome for non-native listeners?

This is an important question. After all, casual speech is what, as language users, we mostly hear and produce. This speech is extremely variable as is evident from casual speech corpora from American English (Dilley and Pitt, 2007; Pitt et al., 2007), Dutch (Oostdijk, 2000; Pluymaekers et al., 2005), and French (Torreira et al., 2010). In these corpora, processes such as assimilation, epenthesis and extreme reduction occur frequently and result in pronunciation variations and potential ambiguities. For instance, consonant reduction such as the deletion of /t/ in lost confronts listeners with unintended words, in this case loss. The question how listeners deal with the resulting non-canonical forms in perception has recently received widespread attention. Alterations can often (temporarily) mislead listeners, and result in word recognition being harder than it would have been for the canonically pronounced versions (Ernestus et al., 2002; Sumner and Samuel, 2005, 2009; Tucker and Warner, 2007; Warner et al., 2009). But native listeners are still able to cope with the extreme variability that occurs in spontaneous speech, for a large part because they make of use information conveyed by phonetic detail and context. With regard to phonetic detail in compensation for place assimilation, listeners are able to distinguish the /p/ of English ripe in ripe berries from the assimilated final /t/ of right berries (Gow, 2002), even though the classical phonological analysis (c.f. Gussenhoven and Jacobs, 1998) prescribed both to be transcribed as [p] (see Spinelli et al., 2003, for a similar example). With regard to context, listeners make use of the fact that most reductions are conditioned by phonological context (Gaskell and Marslen-Wilson, 1996; Gow, 2003; Mitterer and Blomert, 2003; Mitterer and McQueen, 2009), and are more likely to infer an altered or deleted segment in contexts that facilitate reduction.

However, all of this research has been carried out with native listeners. Given that even these experienced listeners are often burdened by reductions, what is going to happen when non-native listeners hear the same sort of input? Hear it they will, because L2 listeners cannot permanently confine themselves to speech situations in which the input is as close to canonical perfection as it is in the classroom or on language tapes. Therefore, we investigate here how casual speech processes affects L2 listening.

Obviously, not all non-native casual speech processes may be equally difficult for a non-native listener, just as not all non-native phoneme contrasts are equally difficult. With regard to phoneme contrasts, for instance, the Perceptual Assimilation Model (Best and Tyler, 2007) assumes different types of contrast relations, which are more or less difficult for the learner. In short, contrasts that are similar in L1 and L2 pose little problems for L2 learners. If this also holds for casual speech processes, casual speech processes that occur in the L2 but not in the L1 should be particularly hard for non-native listeners. In line with this assumption, Tuinman et al. (2011) show that Dutch listeners have trouble with the process of /r/-intrusion in their L2 English, in which an /r/ is inserted between two vowels at a word boundary [as in “I saw(r) a film today” in the Beatles’ song A day in the life]. As this /r/-insertion does not occur in Dutch, it should hence be difficult for Dutch learners, and, indeed, it is. However, it is unclear if the reverse is also true. Does a L2 casual speech process cause less of a problem if it does occur in the L1?

To answer this question, we tested whether German advanced learners of Dutch are able to compensate for /t/-reduction in Dutch. There are two reasons to choose the process of /t/-reduction. The first is that the process of /t/-reduction is found in many languages (Guy, 1980) and patterns very similarly in the Germanic languages English, German, and Dutch, so that German learners are familiar with /t/-reduction from their native language experience. Second, /t/-reduction in Dutch has been intensively studied (Mitterer and Ernestus, 2006; Janse et al., 2007; Mitterer and McQueen, 2009). Based on these studies, we focus on three aspects that have been shown to influence compensation for /t/-reduction by Dutch L1 listeners: phonetic detail, preceding phonological context, and higher-level knowledge, such as lexical and syntactic knowledge. First of all, Dutch allows various gradations of /t/-reduction, and listeners are highly sensitive to the subtle differences in phonetic detail that result from this. Secondly, preceding context is important as /t/-reduction is more likely after /s/ than after /n/ in Dutch and listeners take this into account in perception. Finally, listeners also make use of higher-level knowledge, and are more likely to restore a reduced /t/ if this “leads to a word.” That is, they are more likely to report a /t/ at the end of “fros…” (frost being a word) than at the end of “blis…” (blist being an English non-word).

However, before we can use /t/-reduction as a case study we need to know exactly how similarly it patterns in Dutch and German. Here, it is important to note that /t/-reduction is more likely the more informal the speech. Mitterer and Ernestus (2006) found a much higher incidence of /t/-reduction in face-to-face conversations than in audio book recordings. This means that the pre-existing spontaneous speech corpora for Dutch and German cannot be used for a quantitative comparison, because it is difficult to judge whether the corpora are similar in formality (e.g., the German Kiel Corpus is based on a business-appointments scenario, and interlocutors address each other with the honorific “sie” form). Therefore, we decided to run a production study, which allows us to exert the level of control necessary for a quantitative comparison.

Critically, we will focus on whether /t/-reduction in German is more likely after /s/ than after /n/, which is the pattern attested for Dutch (Mitterer and Ernestus, 2006). Mitterer et al. (2008) had shown that, in perception, listeners take into account the context, that is, they are more likely to perceptually restore a reduced /t/ after /s/ than after /n/. This pattern is partially perceptually motivated: Listeners in fact have a hard time to hear the reduction of /t/ after /s/ (see Steriade, 2001; Mitterer et al., 2006a, for elaboration of such perceptual motivations of phonetic reductions). However, language learning also seems to play a role for this context effect, making it an interesting candidate for L2 effects.

Additionally, we investigated the pattern of /t/-reduction in nouns – in which the /t/ was part of the word’s stem – and /t/-reduction in verbs, in which /t/ is a verbal inflection for the third-person singular. There is evidence to suggest that morphological variables influence reduction processes (Guy and Boyd, 1990). The morphological influence may be different for German than for Dutch, because German is morphologically richer with many different verb conjugations, a three- rather than a two-gender system, and more grammatical cases for nouns. This is also the case for verb inflection. Where German uses four different conjugations (first sing: -e, second sing: -st, third sing, and second pl: -t, first and third pl: -en), Dutch has only three conjugations which are also more consistently used, with a single conjugation for all plurals (-en) and a simpler scheme for the singular (first: Ø, second and third: -t).

Experiment 1

In order to elicit casual speech in a controlled setting we used a blending task for nouns and a conjugation task for verbs. Both have previously been used successfully to elicit connected speech (Stephenson and Harrington, 2002; Zimmerer and Reetz, 2011). The tasks ask participants to produce a sentence that is only cued by words on the screen, that is, it is not a reading task. Instead, the tasks were tailored in such a way that participants had to formulate a sentence by themselves in order to draw attention away from the enunciation.

Method

Participants

Ten Dutch and 10 German speakers took part in this experiment. The Dutch participants were students at the University of Nijmegen, the Netherlands and members of the Max Planck Institute’s subject pool. Some of the German participants were also taken from this population; others were employes at the Max Planck Institute with basic knowledge of Dutch. The German participants had on average 2.7 years of exposure to Dutch and more than 10 years of exposure to English. Participants were asked in what proportions they used their languages for study/work, in peer-groups and with family. Dutch was the main language used for study/work purposes (74%), while language use with peer-groups was slightly dominated by German (49 vs. 41% for Dutch). The Dutch participants all had some formal teaching in German (5 years), but indicated to not use German in either study, peer-groups, or family, with the exception of one participant who used German occasionally (10%) for study purposes. None of the participants reported any hearing loss. All were volunteers and received a small fee for participation.

Materials, design, and procedure

The experiment consisted of two production tasks and was constructed in such a way that both groups of participants performed the same tasks. The experiment was run on a standard PC running with the NESU package to control stimulus presentation. Participants were tested one at a time in a sound-proof booth. They sat at a comfortable reading distance from the computer screen and had a microphone and a two-button response box in front of them.

The first part of the experiment examined /t/-reduction in verbs with a sentence generation task. Participants saw words in random order on the computer screen (e.g., in Dutch, bij/Maarten/de bushalte/wonen, Engl., “close to/Maarten/the bus stop/live”). They were instructed to produce a sentence with the third-person singular present – which ends on /t/ in both Dutch and German – using these words. The stem of the critical verb ended on either /n/ or /s/ and the verb was always presented in its infinitive form. In the example, the correct response was Maarten woont bij de bushalte (Engl., “Maarten lives close to the bus stop”). The words following the verb always started with a consonant to prevent resyllabification of the verb-final /t/ to the onset of the following word. Participants received 40 different stimulus sentences. The presentation of the sentences was random and different for every participant.

The second part of the experiment tested /t/-reduction after /n/ and after /s/ in proper names, using a blending task. Participants saw two non-existent place names (e.g., Toestwoud and Liekbeek for Dutch), and made a new place name with the first part of the first place name and the second part of the second place name (in this case Toestbeek). The first part of the place name ended in either /nt/ or /st/, so that half of the /t/s were preceded by /n/ and the other half by /s/. The new place name had to be produced in a sentence frame cued by the name of a store (e.g., groenteboer/Gemüsehändler, “greengrocer”) and a product (e.g., appels/Äpfel, Engl. “apples”), and the intended sentence was Bij de groenteboer in Toestbeek koop ik appels (“At the greengrocer in Toestbeek I buy apples”). Again, participants received 40 different sentence frames and the presentation of the sentences was random and different for every participant.

Both parts started with four practice trials. Each trial (experimental and practice trials) began with a blank screen. Then, the words were presented on the screen. After 1500 ms the message “Press the right button to continue” was displayed on the screen in either Dutch or German, so that participants could continue with the next sentence.

Results

The 1600 sentences (10 speakers of each language × 80 tokens) were analyzed for /t/-reduction in the critical words. On the basis of visual inspection of the sound files using PRAAT (Boersma and Weenink, 2005), the productions of /t/ were classified and it was judged whether the /t/ was present or not. The classification was done in accordance with the method employed in Mitterer and Ernestus (2006), who distinguished five variants of /t/ (canonical, weak and strong frication, closure-only, and complete deletion, see Figure 2 for re-synthesized examples). After this initial classification, the first two signals (full /t/ and strong frication) were then coded as present, the other as deleted. One may wonder why the coding scheme does not include the option of glottal stops as a phonetic correlate of an underlying /t/ (the glottal stops can sometimes replace /t/ in German). An analysis of the German Kiel Corpus (IPDS, 1994) shows, however, that the glottal stop is almost exclusively used for /t/ in a pre-schwa position (accounting for more than 93% of all /t/-glottal stop replacements), so it is unlikely to be frequent in the current environment.

FIGURE 1

Figure 1. Proportion of /t/-deletion in the same production tasks (Experiment 1) by Dutch and German speakers.

FIGURE 2

Figure 2. Realizations of word-final /t/ in four Dutch casual speech utterances from full production (top row) to deletion (bottom row). See text for details.

Figure 1 shows the results of the transcription. Overall, /t/ was deleted in 25% of the nouns and 40% of the verbs. This indicates that the tasks were successful in focusing attention away from careful enunciation, allowing the application of optional casual speech processes. It is also clear that rate of /t/-reduction was not the same for all conditions. Dutch participants tend to delete /t/ more often if the /t/ was preceded by an /s/ rather than an /n/. This pattern was consistently observed for both nouns and verbs. German participants, however, show a different pattern for verbs, with /t/-reduction in fact being more likely in verbs with an /nt/-coda than in verbs with a /st/-coda.

For the statistical analysis, the results were analyzed separately with a linear mixed-effects model with a binomial linking function to account for the categorical nature of the dependent variable (c.f. Dixon, 2008). Participant and item were entered as random factors with a maximal random effect structure and Noun/Verb, Native Language and Preceding Context as fixed factors. The fixed factors were contrast-coded, with the preceding context /n/ and the Native Language German coded as −0.5. A positive regression weight hence would indicate that more /t/s were produced in the /s/-context, and by the Dutch participants. In contrast coding for binary variables, one level is coded as −0.5 and the other as 0.5, so that regression weights for simple effects in the regression models show the overall effect of a given variable for the complete data set (c.f. Barr, 2008). This coding has the advantage that the predictors for the main effects are linearly independent from the predictor for the interaction. The analysis with all three factors showed a significant three-way interaction (b = 2.7, z = 1.97, p < 0.05). To understand the nature of this interaction, separate analysis were performed for nouns and verbs.

Table 1 shows the regression weights for the regression models for the proper names and verbs. For proper names there were no significant differences between Dutch and German speakers. The Preceding Context did have a significant effect in that /t/ was more often reduced after /s/ than after /n/. For verbs, there were no main effects of Native Language or Preceding Context, but a significant interaction. Separate analyses for German and Dutch speakers showed that only the Dutch speakers have a significant effect of Preceding Context and reduced /t/ more often after /s/ than after /n/ (b = −1.89, p < 0.05), while no such effect was observed for German speakers (p > 0.1).

TABLE 1

Table 1. Experiment 1: Regression weights for the models for nouns and verbs.

Discussion

The results show that Dutch and German are similar but not identical with regard to /t/-reduction. In both languages, /t/-reduction was more likely after /s/ than after /n/. This replicates the results in the corpus studies of Mitterer and Ernestus (2006), and therefore raises the question why /t/ is likely to be reduced after /s/. One possibility is that /t/ is more predictable after /s/ than after /n/, making it more prone to deletion according to hypothesis such as the “Smooth Signal Redundancy Hypothesis” (Aylett and Turk, 2006). To evaluate that, we calculated the type and token frequency in the of the codas /n/ vs. /nt/ and /s/ vs. /st/ using frequencies from the SubtLex Corpus (Keuleers et al., 2010). As it turns out the ratios (Frequency of /Ct/ divided by the Frequency of /C/) indicate that the likelihood of a coda with /t/ is higher for /n/ than for /s/ (Types: 1.54 vs. 0.16; Tokens: 0.14 vs. 0.10). That is, a /t/ is more predictable after /n/ than after /s/. Mitterer et al. (2008) have provided another explanation for the tendency to reduce /t/ especially after /s/. They argued that this is an example of a perceptual constraint that imposes itself on production. The reduction of /t/ is less salient after the spectrally similar /s/ than after the spectrally dissimilar /n/.

Our main focus was, however, to investigate the role of morphological differences. In line with our suspicion, reduction of a morphological /t/ was different in German than in Dutch. In Dutch, verbs patterned just as nouns, with more reduction after /s/ than after /n/ but, in German, /t/-reduction was independent of the preceding context. A possible explanation for this pattern is that the morphological /t/ has no special status in Dutch, a morphologically less rich language than German, and hence does not differ from a /t/ that has no morphological role. In the morphologically richer language German, however, the morphological status seems to block the influence of the preceding phonological context.

What do these results mean for the perception of /t/-reduction in Dutch by German learners? If the similarity of the L1 and the L2 pattern governs the perception of casual speech processes in L2, German listeners may be able to perform very similarly to native Dutch listeners for cases in which the /t/ has no morphological role. Then, the L1 knowledge of German learners matches the Dutch L2. However, /t/-reduction in verbs may be more of a challenge for German learners of Dutch, because, here, the L1 knowledge does not fit the pattern of the L2 that well.

To investigate this, we presented German and Dutch listeners with varying amounts of acoustic evidence for word-final /t/ in verbs, nouns, and adjectives. Five realizations of /t/, from full production to complete deletion, are presented in two acoustic contexts, after /n/ (where /t/-reduction is unlikely) and after /s/ (where /t/-reduction occurs frequently). These five levels of the /t/-Ø continuum are based on findings from a corpus study on word-final /t/ in Dutch (Mitterer and Ernestus, 2006). In each sentence, listeners judged whether the target word ended in /t/ or not.

If non-native listeners have difficulty in exploiting the phonological cues for /t/, they might base their judgments of whether the target word ended in a /t/ or not on higher-level information such lexical status and syntax in verbs. Therefore, we also added as a factor whether the /t/ is “prescribed” by syntax or lexical status. In Experiment 2, the /t/ had no morphological role, but was part of the stem of a noun or an adjective, and lexical information prescribed the presence of a /t/ (charmant “charming,” charman being a Dutch non-word) or not (kanon, “gun,” kanont being a Dutch non-word). In Experiment 3, target words were verbs (e.g., ren “run,” kus “kiss”) which makes it possible for listeners to use grammar to predict whether or not the ending should be /t/. The Dutch present tense third-person singular inflection is /t/ (e.g., zij rent, “she runs”) while the first-person inflection is null (e.g., ik ren, “I run”). Hence, listeners should be biased to expect a /t/ at the end of a verb that is preceded by the third-person singular pronoun zij and biased to expect no /t/ at the end of a verb that is preceded by the first-person singular pronoun ik.

Experiment 2

In this experiment, we investigated the perception of reduced /t/. As materials we used synthesized speech (following Mitterer and Ernestus, 2006). It may be argued that this defeats our purpose to investigate speech perception in ecologically valid listening situations. However, Mitterer and McQueen (2009) have shown that all the basic effects reported by Mitterer and Ernestus (2006) with synthetic speech can be replicated with natural speech. As argued in Mitterer and McQueen, 2009, p. 258), the use of synthetic speech hence does not lead to ecologically invalid results, as long as the acoustic patterns are based on patterns observed in spontaneous speech (as done by Mitterer and Ernestus, 2006). Stated otherwise, the unusual voice characteristics of synthetic speech do not seem to influence how listeners deal with reduced speech.