Foreign Languages Sound Fast: Evidence from Implicit Rate Normalization

Bosker, Hans Rutger; Reinisch, Eva

doi:10.3389/fpsyg.2017.01063

ORIGINAL RESEARCH article

Front. Psychol., 28 June 2017

Sec. Psychology of Language

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.01063

Foreign Languages Sound Fast: Evidence from Implicit Rate Normalization

$\r\nHans Rutger Bosker,*$ Hans Rutger Bosker^1,2*

Eva Reinisch³

¹Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
²Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
³Institute of Phonetics and Speech Processing, Ludwig Maximilian University of Munich, Munich, Germany

Anecdotal evidence suggests that unfamiliar languages sound faster than one’s native language. Empirical evidence for this impression has, so far, come from explicit rate judgments. The aim of the present study was to test whether such perceived rate differences between native and foreign languages (FLs) have effects on implicit speech processing. Our measure of implicit rate perception was “normalization for speech rate”: an ambiguous vowel between short /a/ and long /a:/ is interpreted as /a:/ following a fast but as /a/ following a slow carrier sentence. That is, listeners did not judge speech rate itself; instead, they categorized ambiguous vowels whose perception was implicitly affected by the rate of the context. We asked whether a bias towards long /a:/ might be observed when the context is not actually faster but simply spoken in a FL. A fully symmetrical experimental design was used: Dutch and German participants listened to rate matched (fast and slow) sentences in both languages spoken by the same bilingual speaker. Sentences were followed by non-words that contained vowels from an /a-a:/ duration continuum. Results from Experiments 1 and 2 showed a consistent effect of rate normalization for both listener groups. Moreover, for German listeners, across the two experiments, foreign sentences triggered more /a:/ responses than (rate matched) native sentences, suggesting that foreign sentences were indeed perceived as faster. Moreover, this FL effect was modulated by participants’ ability to understand the FL: those participants that scored higher on a FL translation task showed less of a FL effect. However, opposite effects were found for the Dutch listeners. For them, their native rather than the FL induced more /a:/ responses. Nevertheless, this reversed effect could be reduced when additional spectral properties of the context were controlled for. Experiment 3, using explicit rate judgments, replicated the effect for German but not Dutch listeners. We therefore conclude that the subjective impression that FLs sound fast may have an effect on implicit speech processing, with implications for how language learners perceive spoken segments in a FL.

Introduction

It is a common impression that foreign languages (FLs) seem to be spoken faster than one’s own native language. This subjective impression manifests itself, for instance, in remarks of many language learners, frequently asking their interlocutors if they can please slow down. The effect has been termed the ‘Gabbling Foreigner Illusion’ by Cutler (2012, p. 338) and has attracted the attention of speech scientists for many decades (cf. Osser and Peng, 1964).

Empirical evidence for this FL effect (as it will be referred to throughout this paper) in speech rate perception has been provided with tasks in which listeners had to judge or sort the speech rate of sentences in different languages. For instance, Schwab and Grosjean (2004) presented recordings of French short stories, read at various rates, to a group of 96 native French speakers (i.e., native listeners) and a group of 96 Swiss German speakers (i.e., non-native listeners). They observed a clear FL effect in the rate judgments collected: on average, non-native listeners reported a higher speaking rate compared to the native listeners, even though both groups had been presented with the same French recordings. Moreover, the authors found a negative correlation between this FL effect and FL comprehension scores: the better the learners were able to understand the content of the stories, the smaller the FL effect (i.e., the smaller the difference in rate judgments to the native listeners).

Similar evidence has been found in a symmetrical experimental design by Pfitzinger and Tamashima (2006), who asked German and Japanese listeners to order sentences in both languages according to their perceived rate. It appeared that Japanese listeners overestimated the speech rate of German by 7.5% (relative to the German participants), and German listeners overestimated Japanese speech rate by 9.1% (relative to the Japanese participants).

Critically, the use of a symmetrical design and the presence of the FL effect in both listener groups in Pfitzinger and Tamashima (2006) suggests that its origin cannot solely be explained on the basis of differences in the rhythmic structure of the two languages. German is considered a “stress timed” language, where stressed syllables alternate with unstressed syllables (Grabe and Low, 2002). Japanese, in contrast is considered a “mora-timed” language (Ramus et al., 1999). Due to these differences in rhythm, the two languages differ in the number and nature of allowed syllable structures; for instance, German allows for more complex structures than Japanese. This in turn could have influenced how speech rate is perceived. If speech rate is measured as the number of syllables per second, rate could be expected to be higher for Japanese than German since potentially more syllables fit into a second given the simpler syllable structures in Japanese. However, despite these differences in language structure as well as potential differences in processing strategies associated with rhythm (Cutler, 2012), both listener groups judged the FL as faster.

Interestingly, empirical evidence for the FL effect has even been found in closely related language pairs, such as French and Spanish that are both considered to be “syllable timed” languages (Ramus et al., 1999). Schwab (2014) collected rate judgments from native (L1 Spanish) and non-native (L1 French) speakers of Spanish and showed that the non-native French speakers overestimated the speech rate in Spanish. Differences in rhythmic patterns between languages are hence unlikely to cause the FL effect.

This leads to the question of the psycholinguistic origin of the FL effect. One suggestion has been that it relates to speech segmentation strategies: resolving continuous speech into words is less efficient in non-native languages than in one’s native language (Cutler et al., 1983, 1986, 1989; Cutler, 2012). Language skills and knowledge are weaker in non-native listeners (Segalowitz, 2010), and as a consequence non-native listeners cannot draw on the same prosodic, phonotactic, and lexical strategies as native listeners can to efficiently extract words from continuous speech. Thus, their segmentation of continuous speech produced in a FL is slowed.

Neurophysiological support for delayed segmentation in non-native listeners has been provided by an ERP study by Snijders et al. (2007). Analyses of ERP responses to word repetitions in isolation revealed no difference between natives and non-natives: both groups showed a more positive ERP response to later presentations of the same word. However, when the word repetitions were embedded in continuous speech, ERP repetition effects were only observed in the native listeners, not in the non-native listeners. This indicates that segmentation and detection of words in continuous speech is exceptionally difficult for non-native listeners and hence indeed could relate to the FL effect.

So far, the implications of the FL effect for spoken communication have been limited to the overall impression that the listener has of the speech rate of a particular speaker. That is, researchers have only studied the FL effect by collecting explicit rate judgments. Participants in the studies introduced earlier were explicitly instructed to pay close attention to the speech rate in the speech materials and to provide evaluative judgments about the speech rate of a given stimulus after the stimulus had finished. Such experimental paradigms do not allow assessment of how the FL effect affects the cognitive processes involved in online speech comprehension. Moreover, because the judgments are provided relatively late in perceptual processing, they can be biased by many other factors such as stereotypes about how fast a certain language sounds. In fact, acoustic measures of speed of articulation have been shown to only explain 53% of the variance of explicitly perceived speed judgments (Bosker et al., 2013).

Therefore, the present study investigated whether and how the FL effect would impact online speech processing. Rather than collecting explicit rate judgments, speech rate perception was tested implicitly by means of the ‘rate normalization’ paradigm.

It has long been known that the perceived speech rate of a surrounding sentence can influence the perception of subsequent target words (Pickett and Decker, 1960). For instance, in the German minimal word pair bannen /banǝn/ “to ban” – Bahnen /ba:nǝn/ “tracks,” the vowel /a/ in the first syllable is short in bannen but longer in Bahnen. The perception of a vowel with a manipulated duration ambiguous between /a/ and /a:/ may be biased towards a particular interpretation depending on the perceived speech rate of the surrounding sentence (Reinisch, 2016a,b). That is, if the target vowel is presented following a fast carrier sentence, target perception is biased towards the long vowel /a:/. If it is presented in a slow carrier sentence, perception is biased towards short /a/. This effect has been taken as evidence that listeners interpret segmental durations relative to the surrounding speech rate, hence referred to as ‘rate normalization.’ The measure can be taken as measuring ‘implicit’ rate perception since listeners are asked to identify a target word rather than directly judge the rate of the context.

The present study adapted the ‘rate normalization’ paradigm to investigate implicit speech rate perception in a FL. Specifically, we asked whether a ‘rate normalization’ context effect (i.e., fast speech biasing perception towards a long vowel /a:/) may be observed when the context is not actually faster but simply spoken in a FL.

Note that a previous study (Bosker et al., 2017) has used implicit rate normalization to demonstrate effects of cognitive load on the perception of speech rate. In that study, carrier sentences were shown to be perceived as faster when listeners were taxed by a simultaneously presented difficult visual search task. The same principle may apply to the perception of a FL: words in a FL are harder to segment out of the continuous speech stream (Snijders et al., 2007), thus taxing the perceptual system, and consequently inducing a higher perceived speech rate.

To test the FL effect, we adopted a fully symmetrical design, with parallel experiments involving two listener groups listening to two different languages. The languages studied here were German and Dutch because both languages have a phonological /a-a:/ vowel duration contrast (for details, see Method), allowing for comparison of /a-a:/ categorization across the two languages. Note that, despite related vocabulary, German and Dutch are not mutually comprehensible without explicit focus or prior training. Importantly, the use of two closely related languages with similar grammar, syllable structures, and rhythm, allowed for maximal control of these structural factors while only varying the language.

If the FL effect (i.e., the impression that FLs sound fast) does not only impact explicit evaluative judgments but also the online processing of speech, we may find that German listeners report more long target vowels (i.e., /a:/) after Dutch carrier sentences (a language unknown to them) than after rate matched German sentences (their native language). The opposite should hold for Dutch listeners (i.e., German as their FL should sound faster). By using two highly-related languages the presence of a FL effect would suggest that it is indeed the knowledge of the language that drives the effect.

Moreover, along these lines and based on the studies by Schwab and Grosjean (2004) and Schwab (2014), we would expect this Language effect to interact with listeners’ ability to understand the FL: listeners who understand more words in the FL – here also referred to as higher proficiency in the FL¹ – should show less of a FL effect.

Experiment 1

Method

Participants

A group of native Dutch participants (N = 27; 18 females, 9 males; M_age = 23) with little knowledge of German was recruited from the Max Planck Institute’s participant pool. Another group of native German participants (N = 23; 15 females, 8 males; M_age = 23) with little knowledge of Dutch was recruited. Of these 23 German participants, 20 participants were recruited from the student population at the University of Munich; the remaining three participants were recruited from the Max Planck Institute’s participant pool. All participants reported to have normal hearing and gave written informed consent as approved by the Ethics Committee of the Social Sciences department of Radboud University (project code: ECSW2014-1003-196). Overall proficiency in the FL was assessed by means of self-reported listening skills. Participants rated “how well you understand spoken [Dutch/German]” on a scale from 1 (“absolutely no understanding”) to 7 (“very much understanding”): M_{Dutch Group} (SD) = 2.9 (1.0); M_{German Group} = 0.8 (1.4); t(48) = 6.158, p < 0.001.

Design and Materials

A female German-Dutch bilingual speaker (bilingual from birth; no accent in either language) was recorded producing 30 sentences in German and 30 sentences in Dutch. The Dutch sentences were paraphrases of the German sentences, matching in number of syllables (see Appendix). None of the sentences contained any /a/ or /a:/ vowels since these made up the critical contrast for the targets. Each sentence was recorded with one of three minimal pairs in sentence-final position, selected to be non-words in either language: faft – faaft, fapt – faapt, fap – faap.

From these recordings, carrier sentences (i.e., all speech up to target onset) were excised. Using PSOLA in Praat (Boersma and Weenink, 2016), the total duration of each Dutch–German sentence pair was set to the mean duration of that pair. That is, the speaking rate of each sentence pair was equalized. Since the bilingual speaker produced the sentences at a rather slow speech rate, these (duration matched) carrier sentence pairs formed the slow condition in the experiments. Linear compression by a factor of 0.6 resulted in the fast condition.

Target non-words were manipulated with the aim to create an /a-a:/ duration continuum that is categorized similarly by Dutch and German listeners. In German, the contrast between /a/ (e.g., bannen “to ban”) and /a:/ (e.g., Bahnen “tracks”) is cued by temporal properties alone (i.e., without consistent co-variation of spectral properties; Jessen, 1993; Pätzold and Simpson, 1997; Reinisch, 2016a,b), with /a/ having a shorter duration than /a:/. In Dutch, the vowel contrast is cued by both spectral (/ɑ/ has relatively low formant values, particularly F2) and temporal properties (/ɑ/ has a relatively short duration; Adank et al., 2004; Escudero et al., 2009; Reinisch and Sjerps, 2013; Bosker, 2017a; Bosker et al., 2017). Because temporal variation influences both German and Dutch listeners in /a-a:/ categorization, a duration continuum from /a/ to /a:/ was created, while spectral properties of all steps on the continuum were controlled to be ambiguous for all listeners.

One particular /a:/ vowel token was selected for manipulation using Burg’s LPC method and PSOLA in Praat. A two-dimensional spectral-temporal continuum was created around the average F2 and duration values of the speaker in both languages. Based on a pretest of this two-dimensional continuum with Dutch (N = 15) and German (N = 12) listeners (none participated in any of the other experiments), the most ambiguous spectral values (F1 = 655 Hz; F2 = 1280 Hz) were selected to be used in a five-step duration continuum from 120 to 160 ms in steps of 10 ms for the main experiments. These five spectrally ambiguous vowel tokens were categorized similarly by Dutch (average % /a:/ categorization: 55%) and German listeners (average % /a:/ categorization: 51%). This observation was confirmed with a Generalized Linear Mixed Model (GLMM) with a logistic linking function that was fit with the predictors Vowel Duration, Listener Group, their interaction and with Participant as a random factor (β = 0.299; p > 0.35). These vowel tokens were spliced into three consonantal frames (/f_p/; /f_pt/; /f_f/) resulting in 15 target non-words.

Procedure

In Experiment 1, each trial started with the presentation of a fixation cross. After 500 ms, the carrier sentence was presented, followed by a silent interval of 100 ms, followed by the target. At target offset, the fixation cross was replaced by a screen with two response options, one on the left, one on the right (position of /a/-/a:/ non-words counter-balanced across participants). Participants entered their response as to which of the two response options they heard (fap or faap, etc.) by pressing “1” for the option on the left, or “0” for the option on the right. After their response (or timeout after 4 s), the screen was replaced by an empty screen for 500 ms, after which the next trial was initiated.

Language (native vs. foreign) was blocked, with order counter-balanced across participants. Participants were presented with 15 carriers in their L1 and the other 15 carriers in their FL to avoid carrier familiarity effects across blocks. One language block included 150 randomized trials: 15 carriers × 2 rates × 5 vowel steps; the particular consonantal frame was selected using a Latin Square design. Participants were allowed to take a break in between language blocks.

In order to assess participants’ recognition accuracy of the FL materials, participants were asked to translate the first 15 trials of the FL block into their L1. These first 15 trials all involved unique carrier sentences that participants had not heard before. Participants entered their translation after having given their categorization response; that is, they typed out their translation on the computer keyboard. Participants’ recognition accuracy was assessed by percentage of keywords correct. In order to match the L1 and FL blocks, participants also transcribed the first 15 trials of the L1 block.

Results

The Dutch group performed significantly better at translating German than the German group did in translating Dutch (in % keywords correct): M_{Dutch Group} (SD) = 54.3 (36.1); M_{German Group} (SD) = 30.9 (33.2); t(724) = 8.892, p < 0.001.

Before analyzing the categorization data, trials with missing categorization responses (n = 53; <1%) were excluded from analyses. Categorization data, calculated as the percentage of /a:/ responses (% /a:/), are presented in Figure 1, separately for each listener group. As expected, an increase in target vowel duration led all listeners to report more /a:/ responses (all lines have a positive slope). The difference between the solid and dashed lines indicates an influence of the carrier’s speech rate, with faster speech rates (dashed lines) biasing perception towards the long vowel /a:/. Importantly, differences between the blue and red lines indicate effects of the precursor’s language, and it would seem that the language effect is in opposite directions for the two listener groups.

FIGURE 1

FIGURE 1. Average categorization data (in % /a:/ responses) of Experiment 1. (Left) Data from the Dutch listener group; (Right) data from the German listener group.

We quantified these effects using a GLMM (Quené and Van den Bergh, 2008) with a logistic linking function as implemented in the lme4 library, version 1.0.5 (Bates et al., 2015) in R (R Development Core Team, 2012). The dependent variable was response /a:/ (coded as 1) or /a/ (coded 0). Fixed effects were Vowel Duration (continuous predictor, centered, and scaled around the mean), Carrier Rate (categorical predictor, with slow speech rate coded as -0.5 and fast speech rate as +0.5), Language (categorical predictor, with L1 coded as -0.5 and FL coded as +0.5), Listener Group (categorical predictor, with Dutch coded as -0.5 and German coded as +0.5), and the interaction between Language and Listener Group. The use of deviation coding of two-level categorical factors (i.e., coded with +0.5 and -0.5) allows us to test main effects of these predictors, since with this coding the grand mean is mapped onto the intercept. Participant and Carrier Item were entered as random factors with by-participant and by-carrier random slopes for Carrier Rate and Language (Barr et al., 2013). A more extended model also including random slopes for Listener Group failed to converge.

The GLMM revealed a significant effect of Vowel Duration (β = 0.792, z = 38.430, p < 0.001), with longer vowel durations increasing the percentage of /a:/ responses. The effect of Carrier Rate (β = 0.483, z = 5.700, p < 0.001) indicated that the faster the carrier’s speech rate, the higher the percentage of /a:/ responses. An effect of Language (β = -0.343, z = -2.860, p = 0.004) indicated that there was a lower percentage of /a:/ responses when the vowel was preceded by a FL carrier. However, an interaction between Language and Listener Group (β = 0.976, z = 4.280, p < 0.001) revealed that this only held for the Dutch group; the German group showed an opposite pattern, with a higher percentage of /a:/ responses after FL carriers. Taking categorization differences as indices of perceived rate, this suggests that, while for Dutch listeners foreign speech appeared to sound slower than their native language, Germans did show the expected pattern that FL speech sounds fast.

In order to test whether the Language effects observed were modulated by participants’ ability to understand the FL, the GLMM was extended with the predictor Translation Accuracy (continuous predictor, centered, and scaled around the mean), and the interactions between Translation Accuracy and other fixed effects. This extended GLMM modeled the data marginally better [χ²(4) = 8.339, p = 0.079] than the initial model reported above. It revealed similar effects as the previous model (i.e., effects of Vowel Duration, Carrier Rate, Language, and Language × Listener Group interaction); however, it also showed a three-way interaction between Language, Listener Group, and Translation Accuracy (β = -0.245, z = -2.680, p = 0.007). Post hoc analyses, run on the data from the Dutch and German listener groups separately, revealed that this three-way interaction is explained by a negative effect of Translation Accuracy on the Language effect in the German group (β = -0.130, z = -2.029, p = 0.042), but a positive effect of Translation Accuracy on the Language effect in the Dutch group (β = 0.128, z = 1.989, p = 0.047; see Figure 2). This suggests that, for the German group, the better the Germans understood the FL, the less of a difference there was between their native and FL categorization patterns. That is, the more ‘proficient’ the German listener, the less fast Dutch sounds to them (in line with our predictions). However, the post hoc analyses for the Dutch group suggest that the better a Dutch listener understands German, the faster German sounds (contrary to our predictions).

FIGURE 2

FIGURE 2. Individual participants’ foreign language (FL) effect (y-axis; calculated as % /a:/ responses in FL minus L1; positive values indicate a higher percentage /a:/ responses in the FL) plotted against individual participants’ translation accuracy (x-axis; in % keywords correct) in the FL. German participants are indicated by green “G”; Dutch participants by orange “NL.” The green line gives the regression line for the German group; the orange line gives the regression line for the Dutch group.

Discussion

Experiment 1 found partial support for the hypothesis that a FL sounds fast, with consequences for online speech processing. German listeners indeed reported a higher percentage of long vowel (/a:/) responses when the target vowel followed a FL carrier sentence compared to a (rate matched) L1 carrier sentence. This suggests that when the German participants listened to Dutch (to them, a FL), they perceived the carrier sentence as relatively fast, biasing their perception of subsequent ambiguous vowels towards the long vowel /a:/; similar to how actually (acoustically) fast speech biases perception towards /a:/. Moreover, a three-way interaction indicated that this Language effect in the German group was modulated by their ability to comprehend the Dutch sentences: the better they understood the sentences, the less fast they sounded (i.e., the fewer /a:/ responses).

However, the Dutch participant group showed the opposite pattern. Where the German group reported more /a:/ responses after listening to a FL (Dutch) carrier sentence, the Dutch participants reported fewer /a:/ responses after listening to their FL (German). This would suggest that, to Dutch listeners, German actually sounds slow relative to Dutch, in contrast to our predictions. Moreover, an unexpected three-way interaction suggested that the better the Dutch listeners understood German sentences, the faster it sounded to them.

In Experiment 1, the German and Dutch carrier sentences were matched in their temporal characteristics: both members of each sentence pair had the same number of syllables and the exact same sentence duration. However, the spectral properties of the carrier sentences were not controlled. Note that, although Dutch and German are closely related languages and we used close paraphrases of the sentences in both languages (see Appendix), the vowels occurring in the Dutch and German sentences differed (i.e., as part of the different vocabularies). This difference in vowels meant that the average formant values of the Dutch and German carrier sentences differed despite the fact that the same bilingual speaker had produced the two sentence sets. Specifically, the Dutch average F2 was lower (F2 = 1739 Hz [SD = 149]) than the German average F2 (F2 = 1865 Hz [SD = 143]; t(29) = -4.082; p < 0.001).

Considering the fact that the Dutch /ɑ-a:/ contrast is also cued by spectral properties, its perception is sensitive to the spectral properties in the sentence context as well. For instance, Dutch listeners may be biased to reporting fewer /a:/ targets by raising the average F2 in the surrounding sentence (Reinisch and Sjerps, 2013; Bosker et al., 2017). This process, known as spectral normalization (Sjerps et al., 2011), may potentially explain why, in Experiment 1, the Dutch listeners reported fewer /a:/ responses after the German carrier sentences with a relatively higher average F2. The different vowels in the German sentences, with a relatively high average F2, may have induced spectral normalization in the Dutch listeners, biasing their perception of the target vowels towards /ɑ/. In contrast, in German, the /a-a:/ contrast is a temporal one that is likely not sensitive to spectral context effects. Therefore, it could be the case that the difference in formants between the Dutch and German carrier sentences influenced the Dutch group (not the German group). Experiment 2 was designed to investigate this potential explanation by matching the average second formant values of the Dutch and German sentences.