# LEXICAL TONE PERCEPTION IN INFANTS AND YOUNG CHILDREN: EMPIRICAL STUDIES AND THEORETICAL PERSPECTIVES

EDITED BY : Leher Singh, Denis Burnham, Jessica Hay, Liquan Liu and Karen Mattock PUBLISHED IN : Frontiers in Psychology

#### Frontiers Copyright Statement

© Copyright 2007-2019 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88963-061-5 DOI 10.3389/978-2-88963-061-5

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## LEXICAL TONE PERCEPTION IN INFANTS AND YOUNG CHILDREN: EMPIRICAL STUDIES AND THEORETICAL PERSPECTIVES

Topic Editors:

Leher Singh, National University of Singapore, Singapore Denis Burnham, Western Sydney University, Australia Jessica Hay, University of Tennessee,United States Liquan Liu, Western Sydney University, Australia; University of Oslo, Norway Karen Mattock, Western Sydney University, Australia

Image: Lexical Tone Perception in Infants and Young Children: Empirical studies and theoretical perspectives by Zheng Bao is licensed under CC-BY

In psycholinguistic research there has traditionally been a strong emphasis on understanding how particular language types of are processed and learned . In particular, Romance and Germanic languages (e.g. English, French, German) have, until recently, received more attention than other types, such as Chinese languages. This has led to selective emphasis on the phonological building blocks of European languages, consonants and vowels, to the exclusion of lexical tones which, like consonants and vowels, determine lexical meaning, but unlike consonants and vowels are based on pitch variations. Lexical tone is pervasive; it is used in at least half of the world' languages (Maddieson, 2013), e.g., most Asian and some African, Central American, and European languages. This Research Topic brings together a collection of recent empirical research on the processing and representation of lexical tones across the lifespan with an emphasis on advancing knowledge on how tone systems are acquired.

The articles focus on various aspects of tone: early perception of tones, influences of tone on word learning, the acquisition of new tone systems, and production of tones. One set of articles report on tone perception at the earliest stage of development, in infants learning either tone or non-tone languages. Tsao and Chen et al. demonstrate that infants' sensitivity to Mandarin lexical tones, as well as pitch, improves over the first year of life in native and non-native learners in contrast to traditional accounts of perceptual narrowing for consonants and vowels. Götz et al. report a different pattern of perception for Cantonese tones and further demonstrate influences of methodological approaches on infants' tone sensitivity. Fan et al. demonstrate that sensitivity to less well-studied properties of tone languages, such as neutral tone, may develop after the first year of life. Cheng and Lee ask a similar question in an electrophysiological study and report effects of stimulus salience on infants' neural response to native tones.

In a complementary set of studies focused on tone sensitivity in word learning, Burnham et al. demonstrate that infants bind tones to newly-learned words if they are learning a tone language, either monolingually or bilingually; although it was also found that object-word binding was influenced by the properties of individual tones. Liu and Kager chart a developmental trajectory over the second year of life in which infants narrow in their interpretation of non-native tones. Choi et al. investigate how learning a tone language can influence uptake of other suprasegmental properties of language, such as stress, and demonstrate that native tone sensitivity in children can facilitate stress sensitivity when learning a stress-based language. Finally, two studies focus on sensitivity to pitch in a sub-class tone languages: pitch accent languages. In a study on Japanese children's abilities to recognise words they know, Ota et al. demonstrate a limited sensitivity to native pitch contrasts in toddlers. In contrast, Ramachers et al. demonstrate comparatively strong sensitivity to pitch in native and non-native speakers of a different pitch accent system (Limburghian) when learning new words.

Several studies focus on learning new tone systems. In a training study with school-aged children, Kasisopa et al. demonstrate that tone language experience increases children's abilities to learn new tone contrasts. Poltrock et al. demonstrate similar advantages of tone experience in learning new tone systems in adults. And in an elecrophysiological study, Liu et al. demonstrate order effects in adults' neural responses to new tones, discussing implications for learning tone languages as an adult. Finally, Hannah et al. demonstrate that extralinguistic cues, such as facial expression, can support adults' learning of new tone systems.

In three studies investigating tone production, Rattansone et al. report the results of a study demonstrating kindergartners' asynchronous mastery of tones – delayed acquisition of tone sandhi forms relative to base forms. In a study interrogating a corpus of adult tone production, Han et al. demonstrate that mothers produce tones in a distinct manner when speaking to infants; tone differences are emphasised more when speaking to infants than to adults. Combining perception and production of tones, Wong et al. report asynchronous development of tone perception and tone production in children.

The Research Topic also includes a series of Opinion pieces and Commentaries addressing the broader relevance of tone and pitch to the study of language acquisition. Curtin and Werker discuss ways in which tone can be integrated into their model of infant language development (PRIMIR). Best discusses the phonological status of lexical tones and considers how recent empirical research on tone perception bears on this question. Kager focuses on how language learners distinguish lexical tones from other sources of pitch variation (e.g., affective and pragmatic) that also inform language comprehension. Finally, Antoniou and Chin unite evidence of tone sensitivity from children and adults and discuss how these areas of research can be mutually informative.

Psycholinguistic studies of lexical tone acquisition have burgeoned over the past 13 years. This collection of empirical studies and opinion pieces provides a state-of-the-art panoply of the psycholinguistic study of lexical tones, and demonstrate its coming of age. The articles in this Research Topic will help address the hitherto Eurocentric non-tone language research emphasis, and will contribute to an expanding narrative of speech perception, speech production, and language acquisition that includes all of the world's languages. Importantly, these studies underline the scientific promise of drawing from tone languages in psycholinguistic research; the research questions raised by lexical tone are unique and distinct from those typically applied to more widely studied languages and populations. The comprehensive study of language acquisition can only benefit from this expanded focus.

Citation: Singh, L., Burnham, D., Hay, J., Liu, L., Mattock, K., eds. (2019). Lexical Tone Perception in Infants and Young Children: Empirical studies and theoretical perspectives. Lausanne: Frontiers Media. doi: 10.3389/978-2-88963-061-5

# Table of Contents

*07 Pitch Perception in the First Year of Life, a Comparison of Lexical Tones and Musical Pitch*

Ao Chen, Catherine J. Stevens and René Kager


Puisan Wong, Wing M. Fu and Eunice Y. L. Cheung


Beverly Hannah, Yue Wang, Allard Jongman, Joan A. Sereno, Jiguo Cao and Yunlong Nie

*99 Constraints on Tone Sensitivity in Novel Word Learning by Monolingual and Bilingual Infants: Tone Properties are More Influential Than Tone Familiarity*

Denis Burnham, Leher Singh, Karen Mattock, Pei J. Woo and Marina Kalashnikova

*113 The Effects of Lexical Pitch Accent on Infant Word Recognition in Japanese*

Mitsuhiko Ota, Naoto Yamane and Reiko Mazuka

*126 What can Lexical Tone Training Studies in Adults Tell us About Tone Processing in Children?*

Mark Antoniou and Jessica L. L. Chin


Ying-Ying Cheng and Chia-Ying Lee

	- Suzanne Curtin and Janet F. Werker

Silvana Poltrock, Hui Chen, Celia Kwok, Hintat Cheung and Thierry Nazzi


Leher Singh, Denis Burnham, Jessica Hay, Liquan Liu and Karen Mattock

# Pitch Perception in the First Year of Life, a Comparison of Lexical Tones and Musical Pitch

Ao Chen1,2 \*, Catherine J. Stevens<sup>3</sup> and René Kager<sup>1</sup>

<sup>1</sup> Utrecht Institute of Linguistics, Utrecht University, Utrecht, Netherlands, <sup>2</sup> Communication Science School, Beijing Language and Culture University, Beijing, China, <sup>3</sup> The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia

Pitch variation is pervasive in speech, regardless of the language to which infants are exposed. Lexical tone is influenced by general sensitivity to pitch. We examined whether the development in lexical tone perception may develop in parallel with perception of pitch in other cognitive domains namely music. Using a visual fixation paradigm, 100 and one 4- and 12-month-old Dutch infants were tested on their discrimination of Chinese rising and dipping lexical tones as well as comparable three-note musical pitch contours. The 4-month-old infants failed to show a discrimination effect in either condition, whereas the 12-month-old infants succeeded in both conditions. These results suggest that lexical tone perception may reflect and relate to general pitch perception abilities, which may serve as a basis for developing more complex language and musical skills.

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Xiuli Tong, University of Hong Kong, Hong Kong Weiyi Ma, Macquarie University, Australia

> \*Correspondence: Ao Chen irischen71@hotmail.com

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 25 October 2016 Accepted: 16 February 2017 Published: 09 March 2017

#### Citation:

Chen A, Stevens CJ and Kager R (2017) Pitch Perception in the First Year of Life, a Comparison of Lexical Tones and Musical Pitch. Front. Psychol. 8:297. doi: 10.3389/fpsyg.2017.00297 Keywords: lexical tone, musical pitch, perception development, cross-domain cognition, infancy

### INTRODUCTION

The perceptual reorganization hypothesis assumes that acquiring native phonology involves learning the specific phonemic contrasts present in the to-be-learned language, whereas sensitivity to non-native contrasts gradually decreases. Such perceptual tuning occurs in the second half of the 1st year (Werker and Tees, 1984; Kuhl et al., 1992). Yet previous studies disagree on how the perception of lexical tones, or pitch contours realized on single syllables, changes in the 1st year of life. It is widely agreed that infants are highly sensitive to speech prosody (e.g., Mehler and Christophe, 1995; Nazzi et al., 1998; Soderstrom et al., 2011; Frota et al., 2014). With regard to lexical tones, several studies have found supportive evidence for such a decline in discrimination among non-tone language learning infants between 4 and 9 months (Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013). Other studies, however, have found that sensitivity to lexical tones is maintained beyond the presumed perceptual reorganization window. Liu and Kager (2014) found that from 4 months onward, up until 17–18 months, Dutch infants were able to discriminate Chinese high-level and falling tone. When the acoustical distance between the two tones was reduced through manipulation, no discrimination was found between 9 and 15 months, yet the 5- and 17–18-month-olds succeeded at discrimination. English learning 14-month-old infants are able to learn words that are solely distinguished by lexical tones, and by 19 months, they are still able to discriminate Chinese rising and falling tones (Quam and Swingley, 2010; Hay et al., 2015). In addition, although it is a fact that non-tone language speakers find lexical tones notoriously difficult

(Kiriloff, 1969; Bluhme and Burr, 1971; Shen, 1989), they can be fairly accurate at discriminating them (Burnham et al., 1996, 2015; So and Best, 2010; Chen et al., 2015). Non-tone language listeners' acoustical sensitivity to lexical tones cannot simply reflect the effect of "nativeness," but possibly sensitivity to pitch in language in general. Regardless of the salience of lexical tones, native tone language learning infants do not fully acquire lexical tones until childhood, and global intonation contours interfere with the recognition of lexical tones (Singh and Chee, 2016; Singh and Fu, 2016). In addition, although lexical tones are phonemic in Chinese, when learning novel words, 3-year-old Chinese children are more tolerant to lexical tone than to vowel mispronunciations (Ma et al., 2017). In sum, lexical tone perception seems flexible and exhibits a complex course of development.

It has been long debated whether language ability reflects domain specific mechanisms or whether it is the product of domain general development (e.g., Piaget, 1926; Fodor, 1983; Chomsky, 1986; Pinker, 1994; Tomasello, 2003). Language and music, two types of uniquely human sophisticated functions, are often compared to understand this question. Language and music are parallel in many aspects (Trehub, 2003). For both, pitch plays a fundamental role, and pitch contour (i.e., the shape of pitch patterns) forms a salient cue in perception (Yip, 2002; Trehub and Hannon, 2006). In the language domain, crosslinguistically, at phrase and sentence level intonation is largely encoded by pitch contour. Questions are commonly realized with a rising pitch contour whereas statements often carry a falling contour (e.g., Gussenhoven, 2004). Emphasizing certain aspects of information in many language or "focus" is often realized by raising pitch of the emphasized part and compressing pitch of the following part (Xu, 2011). In tone languages, lexical tones are used in a phonemic way to distinguish meaning at the lexical level (Yip, 2002). In music, pitch relations (rather than specific pitch levels where these relations are exhibited) are central for music perception and also play a role in memory. For example, for the vast majority of listeners, the same song played at a different pitch level is readily recognizable (e.g., Trehub and Hannon, 2006; Trainor and Hannon, 2013). In addition, adults are more sensitive to differences of "global contour" (i.e., the pattern of ups and downs) of melodies than to "intervals" (i.e., exact pitch distance between notes; e.g., Cuddy and Cohen, 1976; Dowling, 1978; Bartlett and Dowling, 1980; Schiavetto et al., 1999).

Although some pitch processing skills have been argued to be music specific (Hauser and McDermott, 2003; Peretz and Coltheart, 2003; Peretz et al., 2003), many studies have found positive correlations between pitch perception in both language and music domains, which suggests domain general cognitive mechanisms in pitch processing (e.g., Wong and Perrachione, 2007; Wong et al., 2012; Bidelman et al., 2013, among many others). Speaking a tone language natively modulates neural response to non-speech pitch (e.g., Chandrasekaran et al., 2007; Bidelman et al., 2011).

For music processing, the encoding of pitch contour is visible from very early on. Infants as young as 2 months are able to discriminate familiar and novel songs (Plantinga and Trainor, 2009), and by 6 months (and like adults), infants discriminate between songs by attending to the pitch contour rather than to specific pitch levels that they are played (Trainor et al., 2004; Plantinga and Trainor, 2005). Eight- to 11-month-old infants are sensitive to both contour-violating and contour-nonviolating note changes, yet contour violation has been found to be perceptually more salient for infants than contour-sharing interval differences (Trehub et al., 1984, 1987). Moreover, infants are able to extract abstract pitch contour from the absolute pitch level at which it is played (Cohen et al., 1987; Trainor and Trehub, 1992). It should be noted that although infants discriminate songs from very early on (Trainor et al., 2004; Plantinga and Trainor, 2005, 2009), the songs not only differed in contour but also in rhythmic and temporal information. When using manipulated stimuli exhibiting contour differences alone, discrimination has only been attested on samples of infants older than 6 months (Trehub et al., 1984, 1987; Trainor and Trehub, 1992). It remains unknown whether younger infants are also sensitive to contour violation.

Although shared processing of lexical tone and music processing has been widely investigated among adults, not much is known regarding whether pitch perception development is related in these two domains in infancy. Mattock and Burnham (2006) tested both tone (Chinese and Cantonese) and nontone (English) language learning infants on their discrimination of Thai tones as well as violin analogs of the tones. For the lexical tones, a decline of sensitivity was observed between 6 and 9 months among the English infants, but not among the Chinese infants. For the violin stimuli, however, both groups succeeded in the discrimination at both ages. By 10 months, native Japanese infants' brain responses to pitch accents realized on words and to pure tones whose fundamental frequency was extracted from these words showed different lateralization patterns (Sato et al., 2010). These findings suggest that pitch perception develops in a domain specific manner. However, Mattock and Burnham (2006) and Sato et al. (2010) tested infants with non-speech rather than musical stimuli, as the analogs of lexical tones did not have a musical structure. The non-speech stimuli have no real life function, yet pitch contour is essential for perception and appreciation of music. In addition, these studies assume that lexical tones (or pitch accents) are phonological for infants, although non-tone language listeners may simply perceive them as musical (Chen et al., 2016).

In the current study, we investigate whether development observed in lexical tone perception may reflect general sensitivity to pitch, in the current study. We tested Dutch 4- and 12 month-old infants on their discrimination of lexical tones and comparable three-note musical melodies, both differing in pitch contour. A non-native pitch contrast was chosen so that the developmental change cannot be attributed to learning the specific tonal exemplars, and the music stimuli were manipulated so as to share similar properties to the lexical tones. We chose 4- and 12-month-olds since these age groups precede and follow perceptual reorganization, which allows us to observe whether development in lexical tone perception is language specific. As Dutch infants have shown high sensitivity to the contrast of Chinese high-level and high-falling tone (Liu and Kager, 2014) and to prevent a ceiling effect, we used two perceptually similar lexical tones (Hume and Johnson, 2001;

Ma et al., 2017), namely the Chinese rising and dipping tones as the stimuli. Since, we focus on acoustic perception that underlies music and language processing, the infants were tested on their discrimination of single tokens of lexical tones and musical melodies, which prevented possible interference from normalization (Singh et al., 2004; Singh, 2008; Shi, 2010; Chen and Kager, 2015). If pitch contour perception develops in a domain general way, then we would expect a similar trajectory in both domains, possibly age-related enhancement. On the other hand, if development occurs in a domain specific manner, then based on the perceptual reorganization hypothesis (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013) we would expect the 12-month-olds to be less sensitive than the 4-montholds to the lexical tones, as these are linguistically irrelevant for the Dutch infants. For the musical stimuli, and given the high sensitivity to musical pitch contour among adults, a maintained or enhanced discrimination of the musical melodies should be observed.

### MATERIALS AND METHODS

### Participants

One hundred and one infants were included in the analysis. All the infants were healthy full-term monolingual Dutch infants. There were 54 4-month-old infants (age range 4:01–4:29), 28 (18 boys, 10 girls) in the lexical tone condition and 26 (13 boys, 13 girls) in the music condition. There were 47 12-month-old (age range 12:01–12:29) infants, 23 in the lexical tone condition (10 boys, 13 girls), and 24 in the music condition (16 boys, 8 girls). Another 17 4-month-old infants were tested but excluded from analysis due to crying (N = 2), fussiness (N = 4), equipment failure (N = 1), experimenter error (N = 1), and failure to meet habituation criterion (N = 9, see below). Another 27 12-monthold infants were excluded from analysis due to crying (N = 7), fussiness (N = 4), equipment failure (N = 3), experimenter's error (N = 2), parental interferences (N = 2), and failure to meet habituation criteria (N = 9).

As the experiment was not invasive and was conducted in a natural environment, Utrecht Institute of Linguistics did not require ethical approval at the time that the experiment was conducted. The experiments were conducted in accordance to guidelines of Utrecht Institute of Linguistics and Helsinki Declaration. Written consents from caregivers were obtained for all participating infants.

### Stimuli

For the lexical tones, in order to prevent a ceiling effect (Liu and Kager, 2014), Mandarin Chinese rising tone (T2) and dipping tone (T3) were used as stimuli, as they have been found to be relatively difficult to discriminate (Hume and Johnson, 2001; Chen et al., 2015). We used /ma/ as tonebearing syllable, as an initial nasal consonant ensured continuous pitch. A female Mandarin speaker recorded the two syllables. Then the pitch contours of naturally produced /ma2/ and /ma3/ were extracted by the software PRAAT (Boersma and Weenink, 2009). After normalizing the duration of these two contours (450 ms), the pitch contours of the T2 after time normalization were re-synthesized onto the original T3 syllable using the PSOLA method (Moulines and Laroche, 1995). Timenormalization ruled out the possibility of interference from duration as a potential confounding factor in the experiment. Five native Mandarin speakers listened to the stimuli and were all in agreement that all the stimuli sounded like natural, normal speech. As young infants have shown difficulties in normalizing variable tokens (Singh et al., 2004; Singh, 2008; Shi, 2010), we only used one single token of each tone to prevent improvement in normalization from being a confounding factor for any development observed. To ensure that the comparability between tasks, we did not transpose the melodies in the music condition.

For the musical melodies, 16th notes of D4, E4, F4, and C4 with a piano timbre were synthesized using a Nyquist script<sup>1</sup>,<sup>2</sup> . The notes were generated on the C4 (middle C) scale, along which the fundamental frequency of A4 equals 440 Hz, with the default duration (250 ms) of 16th notes in Nyquist. After synthesizing the four single notes separately, D, E, and F were concatenated to obtain a three-note rising melody— D-E-F, and D, C, and F were concatenated to obtain another three-note dipping melody— D-C-F. These two melodies were normalized to 450 ms and were then used as stimuli in this experiment. All the notes belonged to C major scale, which prevented possible discrimination based on key membership (Cohen et al., 1987). The two melodies had identical initial and final pitches, and the middle note determined global contour. This assured that the infants would not be able to discriminate the melodies by only attending to the onset or the offset. The difference between the two musical melodies was expected to be salient, as the middle note changed the pitch "direction" (e.g., up and down) rather than the "degree" of rising or falling (Trehub et al., 1984). The musical melodies and lexical tones had comparable contours, namely one rising and one dipping. **Figure 1** plots the pitch contours of the speech stimuli.

### Procedure

A visual habituation paradigm adapted from Liu and Kager (2014) was used, which has been found to be suitable for testing infants as young as 4 months. During the experiment, infants sat on their parent's lap in the test cabin, and a 14-inch screen at the front displayed the visual stimuli, an infant-friendly colorful picture. The visual stimuli were contingent with the auditory stimuli, and the infants' looking time to the visual stimuli was used as the indicator of their attention to the auditory stimuli. The auditory stimuli were presented at a comfortable volume through a frontal speaker. The parent listened to background music through headphones to prevent possible interaction with the infants. A hidden camera mounted above the screen recorded the infants' looking behavior. The experimenter observed the video of the infants live and recorded whether the infant looked at the visual stimuli. For each trial, once the infant looked at the screen, the experimenter pressed a "looking" button on a button box to start the auditory stimuli. Whenever the infant

<sup>1</sup>http://audacity.sourceforge.net/help/nyquist

<sup>2</sup>http://www.cs.cmu.edu/~music/music.software.html

looked away, the experimenter pressed another "non-looking" button on the same button box, and if the infant looked back to the screen, the experimenter pressed the "looking" button again. A trial ended if the infant looked away for more than 2 s, and an attention getter immediately appeared on the screen. Once the infant looked back at the screen, the experimenter started the next trial in the same way described above. The looking time of each trial as well as each look was automatically calculated on the experimenter's computer.

The experiment consisted of a habituation and a test phase. Total looking time of the first three trials in the habituation phase was used as a baseline for measuring habituation. Starting from the fourth trial, the total looking time of each three consecutive habituation trials was calculated, and once this looking time was less than 65% of the total looking time of the first three habituation trials, the habituation criterion was met, and the test phase started automatically. The habituation phase had a minimum of six trials and a maximum of 12 trials. Those infants who failed to meet the habituation criterion within 12 trials were excluded from further analysis. The stimuli used for habituation were counter-balanced among the participants at each age for each condition. In the test phase, the infants were presented with one "old" trial, which was the same sound that they had heard in the habituation phase, followed by another "novel" trial, which was the new sound that they had not previously heard. In the test phase, if the infants were able to detect the difference between the two tones, then upon hearing the novel trial, their listening time should be recovered due to hearing something new. In both phases, a trial could have a maximum of 30 repetitions of the stimuli, with an inter-stimulus interval of 1 s. The same visual stimuli were used for the habituation and test. We did not counter-balance the order of test trials, and the current procedure was expected to highlight the discrimination response if there was any.

### RESULTS

**Table 1** lists the raw looking time in the habituation phase and test phase in both conditions by both age groups. Before the analysis of test trials, infants' response in the habituation phase was examined. A univariate ANOVA, taking condition and age as independent variables found a significant main effect of age, F(3,97) = 6.48, p < 0.05 (partial η <sup>2</sup> = 0.063), where 4-month-olds

TABLE 1 | Mean habituation time (s) and mean number of trials needed for habituation; raw looking time (s) to old and novel trial, and mean number of tokens in old and novel trial, separated by age group and condition.


Numbers in brackets are standard deviations. Time was measured in seconds.

needed more time to reach the habituation criterion. Condition, on the other hand, showed no significant effect, F(3,97) = 0.89, n.s.. No significant interaction between age and condition was found, F(3,97) = 0.002, n.s.. These findings suggest comparable habituation patterns for the music and the lexical tone condition. Next, the raw looking time of the infants was log transformed (base 10) to correct for skew (Gomez and Gerken, 1999; Gao et al., 2011). The log transformed looking times (logLT) of both age groups to both trial types fit a normal distribution. A repeated measures ANOVA was carried out with the logLT, where trial type (old/novel) was the within-subject factor, and condition (music/speech) and age (4/12-month-old) were between-subject factors. Trial type as well as condition showed a significant main effect Ftrialtype(1,97) = 5.20, p < 0.05 (partial η <sup>2</sup> = 0.051); Fdomain(1,97) = 4.84, p < 0.05 (partial η <sup>2</sup> = 0.047). A main effect of age was not significant, Fage(1,97) = 1.58, n.s.. A significant interaction was found between age and trial type F(1,97) = 4.50, p < 0.05 (partial η <sup>2</sup> = 0.044). Post hoc analyses found that, after merging domains only the 12-month-old infants showed a significantly longer logLT to the novel trial, t(46) = −2.88, p < 0.05. No other interaction was found to be significant. **Figure 2** depicts the logLT of the infants in each condition. As can be seen, for the 4-month-olds, no increase in listening time was observed for the novel trial in either condition. Such an increase, however, was found for the 12-month-old group in both conditions. The main effect of trial type was mainly driven by the 12-month-olds. In addition, both age groups had longer looking times in the lexical tone condition.

### DISCUSSION

In the current study, we investigated whether development in lexical tone perception may develop in parallel with perception

of pitch in other cognitive domains namely music. The 4-montholds did not show a discrimination effect in either the lexical tone or the music condition. For the lexical tones, at the age of 4 months, which has been assumed to precede the perceptual reorganization of lexical tones (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013), Dutch infants failed to show a discrimination effect. Importantly, without inter-token variation, presumably the infants did not need to represent the lexical tones as phonological categories, but only needed to discriminate the lexical tones acoustically. The lack of a discrimination effect suggests that the 4-months-old infants did not perceive the acoustic difference between the two lexical tones. Similarly, without transpositions, the infants did not need to equalize the pitch contours played at different pitch levels before they could detect the contour violation, yet no discrimination was found. It is likely that the skills that adult listeners readily make use of when processing music are not fully mature at the beginning of life (Dowling, 1978; Schiavetto et al., 1999). The lack of discrimination effect in both conditions suggests that at 4 months, the infants are not proficient at processing the acoustic attributes that are exploited by linguistic and musical structures.

By 12 months, a parallel enhancement was observed in both the music and the language conditions. Importantly, what we show in the current study is that language input may not be the only factor driving perceptual development, and the perceptual behavior elicited by linguistic stimuli may reflect a general auditory rather than language specific development. As the infants were not exposed to lexical tones in their ambient input, the improvement cannot be explained by learning the lexical tones per se, but must reflect a general ability in dealing with pitch in speech. The similar developmental trajectory in both domains suggests that improved auditory pitch acuity may form a common basis for developing cognitively more advanced skills in language and music. The enhanced pitch perception may correlate with auditory maturation. Although frequency tuning is mature at birth at the cochlea level (Abdala et al., 1996), frequency resolution becomes adult-like between 3 and 6 months (Spetner and Olsho, 1990). Auditory brainstem also matures within the first 6 months after birth, and the maturation of auditory cortex continues to childhood (see Moore and Linthicum, 2007 for a review). At this moment, it is hard to infer whether the processing of musical and speech pitch recruited the same neural resources within the sample, yet basic auditory abilities seem to develop in a domain-general fashion. The physiological basis for successful discrimination of pitch realized on ecologically valid and spectrally complex sounds needs further investigation. It would be interesting for further study to investigate how such improved perception contributes to higher level processing such as phonological categorization or representation of musical pitch contours across pitch levels and musical instruments, and whether these abilities also show a comparable developmental trajectory in language and music.

So far, the perception of non-native lexical tones has been mostly studied in infants between 6 and 9 months (Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013), and lexical tones are considered to be non-native phonological contrasts for infants learning a non-tone

language. Pitch variation, however, is a language universal. The need to distinguish and understand intonation may help infants improve their sensitivity to pitch in general, which is reflected in their discrimination of lexical tones. It is possible that the 12-month-old Dutch infants assimilated T2 to a salient pitch contour in Dutch question rise. Non-tone language adults have been found to maintain a high psycho-acoustically based perceptual sensitivity to non-native lexical tones (Burnham et al., 1996, 2015; So and Best, 2010; Chen et al., 2015). Non-native infants' sensitivity to lexical tones can remain after the assumed perceptual organization window (Liu and Kager, 2014; Chen and Kager, 2015; Hay et al., 2015). In the current study, we used a perceptually similar contrast than those used in Liu and Kager (2014; Hume and Johnson, 2001), and a progression from 4 to 12 months was observed. A growing body of evidence shows that the perception of speech sounds does not follow a single developmental trajectory (Narayan et al., 2010; Liu and Kager, 2014; Mazuka et al., 2014; Tsuji and Cristia, 2014; Tyler et al., 2014), and infants do not completely lose sensitivity to nonnative contrasts. Our results, together with these other studies, lead to the question of what underlies perceptual attunement. It is possible that when infants grow older, they become less capable of perceiving non-native contrasts phonologically, but at the same time, psycho-acoustical perception may improve. Yet whether a better auditory perception can be found in general for speech sounds after 9 months, or whether such improvement is restricted to certain types of speech sounds, such as vowels (Mazuka et al., 2014) and pitch, needs further investigation. Perceptual narrowing is well motivated given the need to efficiently process environmentally relevant distinctions (Scott et al., 2007) and by observations that adults cannot learn a language as easily as infants. The inability to perceive nonnative contrast has been claimed to be one of the hindrances to proficient learning in adults. Yet more efforts should be made to understand what exactly complicates non-native language perception and when exactly we lose the ease to perceive nonnative contrasts.

In the music domain, sensitivity to contour differences has been claimed to be visible from very early on (Plantinga and Trainor, 2009; Stefanics et al., 2009). However, Plantinga and Trainor (2009) tested 2-month-old infants with songs, and such discrimination only called for coarse representation of the melodies, as the songs differed from one another on multiple dimensions, including rhythm and tempo. Our task, on the other hand, tested the detection of contour violation with manipulated stimuli, and the 4-month-olds failed. Hence, it is possible that young infants are able to coarsely represent pitch contours, yet their accurate perception of pitch details is still under-developed. In our task, the middle note violated the contour, and the edge notes were not informative. Several studies have proposed an "edge benefit" in rule learning, namely that the edge serves as the anchoring position, and items in a stream are memorized relative to the edge item (Hitch, 1996; Henson, 1998; Endress et al., 2005). It may be the case that young infants have difficulties perceiving pitch change at a medial position, which may hinder them in noticing the change of contour efficiently. It would be interesting for future studies to test whether young infants could more easily detect a contour violation occurring at an edge position.

Finally, it should be acknowledged that our musical stimuli were generated to match the lexical tones. The constituent notes had a slightly shorter duration compared to previous studies (e.g., Trainor and Trehub, 1992). It might be the case that for the younger group, the short duration hindered the infants from sufficient representation of each individual note, where the violation of contour was realized. When presented with the same stimuli, the 12-month-olds did show a clear discrimination effect. This suggests that the better contour violation perception at 12 months may be due to a higher temporal resolution in auditory perception (Morrongiello et al., 1984; Werner et al., 1992). Nevertheless, our musical stimuli were ecologically valid, as a 16th note has a duration of 125 ms when the tempo is 120 beats-per-minute. In addition, our stimuli were highly representative of pitch in speech and pitch in music: the musical ones were composed of discrete notes without segmental information, whereas the lexical tones had continuous pitch contours and were realized on syllables. Therefore, the distinction between music and speech stimuli was still maintained, and it is convincing that infants show a general enhancement in auditory pitch perception in the 1st year of life.

### CONCLUSION

In the current study, we tested Dutch 4- and 12-month-old infants on their discrimination of pitch contours realized in speech, specifically, the Chinese rising and dipping tones, as well as musical stimuli exhibiting analogous pitch contours. We found that the 4-month-olds failed to show discrimination in either condition, whereas the older group succeeded in both conditions. These findings suggest that pitch perception develops in a domain-general fashion in early infancy, and development in speech perception may reside in more general auditory enhancement, and may not be a language specific development.

### AUTHOR CONTRIBUTIONS

AC contributed to the design of the work, acquisition and analysis of the data and drafting the work. CS contributed to the interpretation of the data, drafting and revising the work. RK contributed to the design of the work, interpretation of the data, drafting and revising the work.

### REFERENCES

fpsyg-08-00297 March 7, 2017 Time: 14:25 # 7


Pinker, S. (ed.). (1994). The Language Instinct. New York, NY: Morrow.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Chen, Stevens and Kager. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# From Lexical Tone to Lexical Stress: A Cross-Language Mediation Model for Cantonese Children Learning English as a Second Language

William Choi<sup>1</sup> , Xiuli Tong<sup>1</sup> \* and Leher Singh<sup>2</sup>

<sup>1</sup> Division of Speech and Hearing Sciences, The University of Hong Kong, Hong Kong, Hong Kong, <sup>2</sup> Department of Psychology, National University of Singapore, Singapore, Singapore

This study investigated how Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity among Cantonese children who learned English as a second language (ESL). Five-hundred-and-sixteen second-to-third grade Cantonese ESL children were tested on their Cantonese lexical tone sensitivity, English lexical stress sensitivity, general auditory sensitivity, and working memory. Structural equation modeling revealed that Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity both directly, and indirectly through the mediation of general auditory sensitivity, in which the direct pathway had a larger relative contribution to English lexical stress sensitivity than the indirect pathway. These results suggest that the tone-stress association might be accounted for by joint phonological and acoustic processes that

#### Edited by:

F-Xavier Alario, Centre National de la Recherche Scientifique (CNRS), France

#### Reviewed by:

Fanny Elise Meunier, Centre National de la Recherche Scientifique (CNRS), France Yi Du, Institute of Psychology (CAS), China

> \*Correspondence: Xiuli Tong xltong@hku.hk

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 18 September 2016 Accepted: 15 March 2017 Published: 31 March 2017

#### Citation:

Choi W, Tong X and Singh L (2017) From Lexical Tone to Lexical Stress: A Cross-Language Mediation Model for Cantonese Children Learning English as a Second Language. Front. Psychol. 8:492. doi: 10.3389/fpsyg.2017.00492 underlie lexical tone and lexical stress perception.

Keywords: lexical prosody, tone sensitivity, stress sensitivity, prosodic transfer, ESL

## INTRODUCTION

Suprasegmental information such as lexical tones and lexical stress surface with high frequency across languages of the world (Gussenhoven, 2004) and serve as two primary means by which languages contrast words using suprasegmental cues (e.g., Cutler and Chen, 1997). Relative to the segmental dimension, e.g., consonants and vowels (e.g., Chien et al., 2008; Keung and Ho, 2009; Wang et al., 2009), there is a comparative paucity of research on the extent to which sensitivity to suprasegmental cues generalize across languages over the course of first language (L1) and second language (L2) development. Noticeably, there is emerging research demonstrating lexical prosodic transfer, a process through which adults and children who learn English as second language (ESL) capitalize on similarities in the structure of lexical tones and lexical stress in a way that allows them to harness their perceptual sensitivity to L1 lexical tones in the service of L2 English lexical stress perception (e.g., Nguyen et al., 2008; Wang, 2008; Yu and Andruski, 2010; Tong et al., 2015a, 2016). In the context of Cantonese ESL children, a previous study reported that Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity (Tong et al., 2016). Little is known, however, on how Cantonese lexical tone sensitivity contributes to English lexical stress sensitivity among Cantonese ESL children. More specifically, attested links between Cantonese lexical tone perception and English lexical stress perception could be direct, as in Tong and colleagues' study, or they could be mediated by other candidate processes. In addition to its theoretical significance, this question has substantial practical significance given the role of

English lexical stress sensitivity in English word reading and English reading comprehension not only amongst English speaking children (e.g., Arciuli et al., 2010; Holliman et al., 2010) but also amongst Cantonese ESL children (e.g., Choi et al., 2016a). To further unpack the contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity in Cantonese ESL children, we proposed and evaluated two structural equation models, i.e., a full model and a nested model (**Figure 1**). The full model consisted of two possible pathways through which Cantonese lexical tone sensitivity might contribute to English lexical stress sensitivity, i.e., the direct pathway and an indirect pathway with the mediation of general auditory sensitivity. The nested model was nested within the full model, and consisted only of the indirect pathway.

### Contribution of Cantonese Lexical Tone Sensitivity to English Lexical Stress Sensitivity: The Direct Pathway

We hypothesize that there is a direct pathway from Cantonese lexical tone sensitivity to English lexical stress sensitivity. One theoretical foundation of this hypothesis lies in the structural and functional similarities between Cantonese lexical tone and English lexical stress. In terms of their composition, the assignment of lexical tone and of lexical stress has a common gestural basis in that they both involve modulating the rate of laryngeal vibration. The primary consequence of this is variation in the fundamental frequency of speech (vocal pitch). Although theoreticians have sometimes likened the way in which vocal pitch is manipulated to convey both tone and stress (e.g., Chrabaszcz et al., 2014), there are important differences in how tone and stress are assigned. In English, stress consists primarily of variation in vocal pitch, duration and intensity (Fry, 1958). In contrast to stress in English, lexical tone in Cantonese involves the use of vocal pitch (as well as amplitude, duration and other spectral factors) to distinguish lexical items at the syllable level and a tone is assigned to every syllable (Chao, 1968). Tones may change in a sentential context although each tone maintains its basic form even in a multi-word context. Stress and tone further differ in their relative scope of influence in the two languages investigated in the present study: lexical tones distinguish a broad set of words in Cantonese whereas lexical stress distinguishes a small set of words in English.

Tone and stress are therefore similar in structure, both being driven by a similar set of acoustic concomitants. In terms of the magnitude of fundamental frequency variation, there is a high correspondence between values assigned to tone bearing syllables in tone languages and to stress bearing syllables in stress languages, although the rate of fluctuation in fundamental frequency is higher in tone marking than in stress marking (Eady, 1982). However, subtle variations aside, tone and stress marking are compositionally similar. They are also functionally similar, both distinguishing minimally contrastive forms. Using Cantonese as an example, /ma/ in a high level tone /ma1/ means 'mother,' while /ma/ in a high rising tone /ma2/ means 'horse.' Similarly, in English, the words "CONtent" /'kantεnt/ and "conTENT" /ken'tεnt/ vary minimally by stress placement and represent different lexical items.

Theoretical support for links between stress and tone comes from phonological assimilation (Kuhl, 1991; Best, 1995; Flege, 1995; Best and Tyler, 2007). Specifically, models of non-native speech perception generally posit that L2 sounds are categorically perceived within, or assimilated to the L1 sound classes among L2 learners (see Cutler, 2012 for a detailed review). A parallel claim in the suprasegmental dimension is suggested by previous studies that have demonstrated the "tonalization" of English lexical stress, in which the English lexical stress were perceived as high tones by Cantonese ESL listeners (Luke, 2000; Lai, 2003; Chan, 2007). Furthermore, Tong et al. (2015a) proposed that lexical tone and lexical stress learning gave rise to the formation of a "general suprasegmental prototype" common to Cantonese lexical tone and English lexical stress, motivated by the acoustic similarity of Cantonese lexical tone and English lexical stress, and the association between Cantonese lexical tone sensitivity and English lexical stress sensitivity. Collectively, these studies suggest the possibility that Cantonese ESL listeners assimilated English lexical stress into their native tonal system, giving rise to a possible direct contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity.

### Contribution of Cantonese Lexical Tone Sensitivity to English Lexical Stress Sensitivity: The Indirect Pathway

Tone-stress links could be mediated by language-general auditory sensitivities to acoustic-phonetic variation. Notably, Wang et al. (2005) proposed a relation between Mandarin lexical tone sensitivity and general auditory sensitivity. Similarly, Zhang and McBride-Chang (2010) raised the possibility that auditory sensitivity to rhythmic changes influenced sensitivity to Cantonese lexical tones among Cantonese children. The purported relation between general auditory sensitivity and

lexical tone sensitivity was later examined among Cantonese children by means of structural equation modeling (Zhang and McBride-Chang, 2014). In Zhang and McBride-Chang's best-fit model, general auditory sensitivity was associated with Cantonese lexical tone perception. Specifically, the frequency discrimination tasks and the amplitude modulation task, presumably reflecting temporal and rhythmic sensitivities, respectively, correlated significantly with Cantonese lexical tone perception. This might suggest that tone language speakers harness lower level auditory perceptual sensitivity in the service of tone perception. The relation between general auditory sensitivity and suprasegmental speech perception may as well extend to English lexical stress. As described above, English lexical stress patterns are acoustical variations of fundamental frequency, duration, intensity and formant frequency of one syllable relative to another (e.g., Chrabaszcz et al., 2014). This raises the possibility that sensitivity to English lexical stress is associated with general auditory sensitivity as well as the further possibility that commonalities in lexical tone and lexical stress perception in tone language learners may be mediated by general auditory sensitivity.

Our proposition of the general auditory sensitivity mediated pathway is suggested by previous studies of lexical prosodic transfer from lexical tone to English lexical stress at the acoustic level (Nguyen et al., 2008; Wang, 2008; Yu and Andruski, 2010). In a series of tasks designed to measure lexical stress discrimination, identification and matching, the studies set out to explore the acoustic cues attended to by ESL adult listeners in perceiving English lexical stress. In a lexical stress identification task, Wang systematically manipulated the fundamental frequency, duration and intensity cues of the English lexical stress stimuli. As reflected by the reliance scores to the specific acoustic cues (Wang, 2008, p. 113 for the computation of reliance scores), Mandarin ESL listeners showed a greater reliance on fundamental frequency, and lesser reliance on duration and intensity relative to native English listeners when identifying lexical stress position. In a later study, Yu and Andruski reported that Mandarin ESL listeners consistently relied on fundamental frequency for identifying trochaic and iambic stress patterns in English real words, pseudowords and hums, and treated duration only as a secondary cue when identifying iambic stress patterns under the pseudoword condition. Native English listeners, on the other hand, relied on a more varied set of acoustic cues, including fundamental frequency, duration, intensity and vowel quality, across different stress patterns and linguistic conditions. Similar results have been reported among Vietnamese ESL listeners, who attended to fundamental frequency, but not duration in a stress matching task. Subtle variations aside, the above studies offer converging evidence that that L1 tonal listeners avail of acoustic cues to tone perception, specifically fundamental frequency, in order to process lexical stress. This suggests that ESL adults with L1 tone language experience draw on the acoustic commonalities across their languages to streamline processing of suprasegmental cues.

The current study set out to investigate how Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity in Cantonese children learning English as a second language (ESL). Specifically, we tested how Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity, but not the other way round. Our proposition was based on non-native speech perception models, e.g., the Speech Learning Model (Flege, 1995), which posited that L2 speech perception was susceptible to L1 influence (Cutler, 2012 for a detailed review). Additionally, it has been suggested the direction of cross-language transfer is predominantly governed by language proficiency, in which transfer occurs from the more proficient language to a less proficient language (e.g., Hernandez et al., 1994; Zhang et al., 2010). In the current study, the Cantonese ESL children were sequential bilinguals who were more proficient in Cantonese than English, and they had been actively developing their Cantonese tonal system since birth even though they had not yet reached adult-like performance (Ciocca and Lui, 2003). Based on the above, it was conceivable that the skills involved in Cantonese lexical tone perception were likely to be drawn upon to scaffold the sensitivity to L2 English lexical stress, instead of the other way around. In order to determine the pathways underlying the relationship, we investigated whether the contribution was direct or mediated by general auditory sensitivity, or both. We have addressed these questions by means of structural equation modeling given that it is a wellestablished statistical method for mediation analysis. Different from traditional regression analysis which is suited to evaluating single regression equations, structural equation modeling allows for the simultaneous evaluation of a system of regression equations essential for mediation analysis (Nachtigall et al., 2003; MacKinnon et al., 2007). To illustrate, **Figure 1** (Left) depicts the system of regression equations under investigation, specifically, Cantonese lexical tone sensitivity has an effect on general auditory sensitivity, general auditory sensitivity has an effect on English lexical stress sensitivity, and Cantonese lexical tone sensitivity also has an effect on English lexical stress sensitivity. Critically, structural equation modeling allows the same variable, i.e., general auditory sensitivity to represent a regressant in one equation (Cantonese lexical tone sensitivity to general auditory sensitivity) and a regressor in another equation (general auditory sensitivity to English lexical stress sensitivity), which crucially informs a mediation analysis.

To examine the research questions, we proposed two models: a partial mediation model consisting of both direct (Cantonese lexical tone sensitivity to English lexical stress sensitivity) and auditory-mediated pathways (Cantonese lexical tone sensitivity to general auditory sensitivity to English lexical stress sensitivity), and a full mediation model that only includes the auditory-mediated pathway (**Figure 1**). As for the latent variables, we assessed second-to-third grade Cantonese ESL children on a range of abilities, including Cantonese lexical tone sensitivity, English lexical stress sensitivity, general auditory sensitivity and working memory. As shown in **Figure 1**, we included working memory as a control variable to control for the variance in English lexical stress sensitivity as it had been found to relate to suprasegmental speech perception (Mattys and Samuel, 2000).

### MATERIALS AND METHODS

fpsyg-08-00492 March 30, 2017 Time: 16:9 # 4

### Participants

A sample of 516 (276 boys and 240 girls) second-to-third grade Cantonese ESL children was recruited from primary schools in Hong Kong. The mean age of the participants was 8 years and 5 months (SD = 7.81 months). We chose to study this age on account of prior research showing that by this age, children would have fully acquired all six contrastive tones (Ciocca and Lui, 2003) and stress (Tong et al., 2015a). All children were native Cantonese speakers and L2 English learners and were born to Cantonese speaking parents. They had all learned English as of Grade 1. Thus, all children had been learning English for a minimum of 2–3 years. All children were raised in a Cantonese-speaking environment and had learned and spoken Cantonese since birth.

### Materials and Procedure

### Cantonese Lexical Tone Sensitivity

An odd-one-out tone discrimination task (Tong et al., 2014, 2016; Choi et al., 2016a,b) was modified to assess children's sensitivity to Cantonese lexical tones. There were 48 trials, each consisting of three real Cantonese monosyllabic words, with one tone differed from the others (e.g., /sIn1/, /sa1/, /s5u2/). There were all together 144 words. To ensure the words were familiar to the children, the Cantonese words we selected were high frequency words representing common objects or concepts. They were matched approximately on ratings of familiarity and syllabic structure, i.e., either CV or CVC. The pilot testing showed that 7 years old Cantonese children were able to understand the meanings of the presented words.

Each trial presented one of the eight minimum possible tone contrasts, i.e., mid level-low level, high rising-low rising, high level-mid level, high level-low level, low rising-low level, low falling-low level, low falling-low rising and high level-high rising, as in previous studies of tone perception in children of a similar age (e.g., Ciocca and Lui, 2003; Tong et al., 2014, 2016; Choi et al., 2016a,b). There were six repetitions for each tone contrast in the whole test.

In each trial during the testing, three Cantonese real words (e.g., /sIn1/, /sa1/, /s5u2/) were presented audibly via an amplification system to the children, with an inter-stimulus interval of 400ms. The positions of the target word (e.g., /s5u2/) in the word sequence (e.g., /sIn1/, /sa1/, /s5u2/) were counterbalanced across trials. The children selected the word they identified as carrying a different lexical tone from the other two words by indicating on the testing booklet the position of the word. Prior to the testing, three practice trials with corrective feedback were given to ensure the children's full understanding of the instruction of the test. All participants reported that they heard the Cantonese words clearly, and understood the task requirements, and responded correctly in the practice trials. The number of correct responses was tallied out of the 48 trials, and summed to yield an accuracy rate for each participant.

#### English Lexical Stress Sensitivity

A "DEEdee" task (e.g., Whalley and Hansen, 2006; Goswami et al., 2010) was adopted to assess children's sensitivity to stress patterning in spoken English. This task has successfully assessed sensitivity to stress patterning among second-to-third grade Cantonese leaners of English (Choi et al., 2016b). There were two practice items and 18 test items. All test items were pre-recorded items and consisted of highly familiar names or titles of children's books converted to a reiterative syllable "dee." For example, the phrase "aLAddin" was replaced with three synthesized tokens "deeDEEdee" (stressed DEE syllable flanked by two unstressed dee syllables). In each trial, children were audibly presented with the spoken phrase "aLAddin," followed by the two DEEdee phrases, e.g., "deeDEEdee"and "DEEdeedee." The children then chose a match to the spoken phrase, by indicating on the testing booklet whether the match was the "first" or the "second" DEEdee phrase. As in a previous study (Choi et al., 2016b), the English learning Cantonese children reported that they were familiar with the English words presented. Response accuracy was logged for each child out of 18 trials and summed to yield an accuracy rate.

### General Auditory Sensitivity

We adopted the beat perception in music task (Goswami et al., 2013) to measure children's general auditory sensitivity. This task adopted a forced choice paradigm and each trial consisted of two series of musical notes each having a pulse rate of 500 ms. The numbers of "same" and "different" trails were the identical, and the order of presentation of these two types of trials was pseudorandomized. In different trials, the two series of musical notes exhibited metrical changes in the accented notes. For example, in a different trial, the accented notes in one sequence might exhibit a 100 ms or 166 ms delay in rhythmic structure. Children indicated on the testing booklet whether the two stimuli presented were the same or different. In total, there were 24 trials preceded by two practice trials.

### Working Memory

A serial-order reconstruction task adapted from Majerus et al. (2006) was used as a measure of working memory. Short-term retention for order information was probed in this task. The task was presented as a game, in which children heard sequences of animal names (lion, cat, dog, cock, bear, wolf, and monkey) in Cantonese with increasing length from 3 to 7 names. All animal names were common vocabularies in Cantonese, and were all monosyllabic. Children reconstructed the order of presentation of the animals by putting a digit (1–7) in the boxes under the animals' pictures. The maximum possible number of correct trials in this task was 10.

### RESULTS

Descriptive statistics for the results of all tasks are summarized in **Table 1**. Of particular interest were correlations between Cantonese lexical tone sensitivity, English lexical stress sensitivity and general auditory sensitivity, all of which were significant (ps < 0.01). These correlations suggest that these variables share common variance required for structural equation modeling.

TABLE 1 | Means, Standard Deviations, Reliabilities, and Inter-correlations of All Variables


N = 516; ∗∗∗p < 0.001; ∗∗p < 0.01.

fpsyg-08-00492 March 30, 2017 Time: 16:9 # 5

### Testing Direct and Indirect Contributions of Cantonese Lexical Tone Sensitivity to English Lexical Stress Sensitivity: Nested Models Comparisons

Latent variable structural equation modeling of the covariances matrix was conducted with LISREL 8.80 (Joreskog and Sorbom, 2007). The four latent variables, i.e., Cantonese lexical tone sensitivity, English lexical stress sensitivity, general auditory sensitivity and working memory were modeled with the Cantonese lexical tone discrimination task, DEEdee task, beat perception in music task and animal task, respectively (**Figure 1**).

The partial mediation model and full mediation models were nested models as the latter was derived from the former by fixing the parameter (direct effect from Cantonese lexical tone sensitivity to English lexical stress sensitivity) to zero. Given that the differences in chi-square values between these two nested models are chi-square distributed with degrees of freedom equivalent to the differences between degree of freedom between these two models (Steiger et al., 1985), we used a chi-square different test to determine which model can better explain the relation between Cantonese lexical tone sensitivity and English lexical stress sensitivity.

The chi-square difference between the partial mediation model and the full mediation model was significant, 1χ 2 (1, N = 512) = 31.19, p < 0.001. According to Schermelleh-Engel et al. (2003), a significant chi-square difference indicated that the null hypothesis of equal fit for the partial and full mediation models was rejected, reflecting that the two models did not fit the data equally well – in such a case, the less restrictive model with smaller chi-square was preferable because it fitted significantly better than the more restrictive model with larger chi-square. Thus, the less restricted partial mediation model, which had a smaller chi-square, χ 2 (7, N = 512) = 11.83, than the full mediation model, χ 2 (8, N = 512) = 43.02, was preferred.

We evaluated the goodness of fit of the data to the partial mediation model with several goodness of fit indices, i.e., Chisquare, comparative fit index (CFI), normed fit index (NFI), non-normed fit index (NNFI), and root-mean-square-error of approximation (RMSEA). According to Hu and Bentler (1999), a value of 0.95 or above for CFI, NFI, and NNFI, and a value less than 0.06 for RMSEA denote a good fit model. The partial mediation model fit the data well, χ 2 (7, N = 512) = 11.83, CFI = 0.98, NFI = 0.96, NNFI = 0.97, RMSEA < 0.06, predicting 14% of variance in English lexical stress sensitivity.

Next, we evaluated the significance of the structural paths based on the z value associated with the unstandardized estimates of the path weights (Bentler, 2006). According to Bentler, a value of 1.96 or above for the z value indicates that the pathway is significant. In the partial mediation model, both direct and auditory-mediated pathways were significant, ps < 0.01 (**Figure 2**). These results suggest that Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity both directly and mediated through general auditory sensitivity.

### Relative Contributions of the Direct and Auditory-Mediated Pathways

We examined the relative contributions of the direct and indirect pathways by examining differences in the product coefficients of the two pathways (MacKinnon et al., 2007). In the partial mediation model, the product coefficients of the direct and indirect pathways were 0.09 and 0.02 (0.04 × 0.53), respectively (**Figure 3**). Thus, the coefficient

FIGURE 2 | Best fit model (partial mediation model) for the relation between Cantonese lexical tone sensitivity and English lexical stress sensitivity. C\_Tone = Cantonese lexical tone sensitivity; E\_Stress = English lexical stress sensitivity; Aud = General auditory sensitivity; WM = Working memory; ToneDI = Cantonese lexical tone discrimination task; DEE = English lexical stress perception task; APPT-0 = Beat perception in music task (0 ms delay); APPT-100 = Beat perception in music task (100 ms delay); APPT-166 = Beat perception in music task (166 ms delay); Animal = Working memory task. The numerical values represent the standardized factor loadings. ∗∗∗p < 0.001; ∗∗p < 0.01.

difference of the two pathways was 0.09–0.02 = 0.07. With reference to MacKinnon and colleagues, a positive value of the coefficient difference indicates that the direct pathway has a larger relative contribution to the relationship between lexical tone and lexical stress perception than the indirect pathway.

### DISCUSSION

The current study set out to investigate how Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity within second-to-third grade Cantonese ESL children. Results indicate that Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity both directly, and indirectly through the mediation of general auditory sensitivity. In terms of relative contribution, Cantonese lexical tone sensitivity made a larger direct contribution than indirect contribution to English lexical stress sensitivity.

Consistent with a previous study (Tong et al., 2016), we have shown the contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity among Cantonese ESL children. This finding has extended previous studies on L1 and L2 segmental dimension (Comeau et al., 1999; Chien et al., 2008; Keung and Ho, 2009), by demonstrating that the contribution of L1 to L2 phonological skills of Cantonese ESL children is also evident at the suprasegmental dimension. Placed in the context of previous studies on adult suprasegmental speech perception (Nguyen et al., 2008; Wang, 2008; Yu and Andruski, 2010), the results suggest that Cantonese ESL children are able to exploit common sources of phonological variation across languages, and the perceptual operations underlying L1 lexical tone perception might be recruited in service of L2 English lexical stress perception, consistent with the predictions of the non-native speech perception models, e.g., the Speech Learning Model (Flege, 1995). The Cantonese ESL children therefore appear to profit from cross-language commonalities by virtue of the finding that sensitivity to suprasegmental variation in one language facilitates prosodic sensitivity in the other language. Furthermore, the current study has further extended Tong and colleagues' study by uncovering the underlying pathways through which Cantonese lexical tone sensitivity contributes to English lexical stress sensitivity. In particular, we found that Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity through two pathways – the indirect pathway involving the mediation of general auditory sensitivity, and the direct pathway without the mediation of general auditory sensitivity.

### Direct Contribution from Cantonese Lexical Tone Sensitivity to English Lexical Stress Sensitivity

The direct pathway through which Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity might be accounted for by joint phonological processes that underlie lexical tone and lexical stress perception. One possible shared component may be the ability to extract suprasegmental phonological information from full-spectral speech that includes segmental variation. Specifically, Cantonese lexical tones and English lexical stress are both instantiated on segments, most centrally, on the vowel (Cutler and Chen, 1997). In the Cantonese lexical tone discrimination task, which required children to integrate segmental variation and identify the odd tone, children had to extract tonal information from speech in order to compare the lexical tones of the target and distractors. Likewise, in the English stress perception task, children were required to identify reiterative stress patterns corresponding to that of a real word e.g., "deeDEEdee" for "aLAddin." Similarly, children had to extract the lexical stress pattern from the speech in order to match the reiterative lexical stress pattern with that of the real words. Each of these abilities involved extracting suprasegmental variation from full-spectral input and applying this variation to a new word. This account is in line with previous neurophysiological (Choi et al., 2017) and behavioral studies of tone perception (Repp and Lin, 1990; Lee and Nusbaum, 1993; Tong et al., 2008, 2014) which demonstrated that segmental and suprasegmental information were processed integrally rather than independently, suggesting that speech arrives at our senses as an integrated signal that has to be segregated in response to task demands. The ability to segregate the signal in this way and to extract and re-apply suprasegmental phonological information from full-spectral speech may underlie the direct contribution of Cantonese lexical tones sensitivity to English lexical stress sensitivity.

Another possible shared phonological component might be the phonological encoding of suprasegmental information. To date, evidence suggests that lexical tones (Singh et al., 2015; Tong et al., 2015b; Choi et al., 2016a; Luo et al., 2016) and lexical stress (Ashby and Clifton, 2005; Arciuli et al., 2010; Goswami et al., 2013) are encoded as essential components of phonological representations in Chinese and English, respectively. For example, in a test of toddlers' sensitivity to mispronunciations of tones, Chinese toddlers were very sensitive to lexical tones in a word recognition paradigm (Singh et al., 2015). Similarly, in English infants, Curtin (2010) demonstrated that infants are very sensitive to lexical stress when learning new words. Placed in the current context, it might be the case that the joint skill in phonological encoding of

lexical tones and lexical stress underlies the direct contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity. Speculatively, ours results might be taken to imply that English lexical stress and Cantonese lexical tone were not entirely separate representations, either in the case that L2 English lexical stress was assimilated to L1 Cantonese lexical tone categories (Luke, 2000; Lai, 2003; Chan, 2007), or that L2 English lexical stress and L1 Cantonese lexical tones shared a common "general suprasegmental prototype" as posited by Tong et al. (2015a). These claims were not directly evaluated in the current study, and await further evidence.

### Indirect Contribution from Cantonese Lexical Tone Sensitivity to English Lexical Stress Sensitivity: General Auditory Sensitivity as a Mediator

The indirect pathway through which Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity was via the effects of general auditory sensitivity. In line with previous psychoacoustic studies of English lexical stress perception by tonal listeners (Nguyen et al., 2008; Wang, 2008; Yu and Andruski, 2010), the present results suggest that the contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity was not driven solely by the phonological interpretation of lexical tone/stress but also by a more general perceptual sensitivity to acoustic-phonetic variation. Empirically, our finding has extended previous attested links between general auditory sensitivity and Cantonese lexical tone sensitivity (Zhang and McBride-Chang, 2010, 2014), by further identifying a link between general auditory sensitivity to English lexical stress sensitivity, in part through which Cantonese lexical tone sensitivity contributed to English lexical stress sensitivity. As mentioned previously, lexical tones and lexical stress share at least one common acoustic cue, most notably, fundamental frequency (f0). Given the prominent role of pitch in tone perception (e.g., Gandour, 1981, 1983; Khouw and Ciocca, 2007; Tong et al., 2014) and in stress perception (Yu and Andruski, 2010), it stands to reason that listeners' sensitivity to lexical tones and lexical stress may depend on their sensitivity to acoustic pitch. In particular, general auditory sensitivity might be a common construct engaged in the perception of lexical tones and lexical stress at lower auditory levels of speech perception articulated in theoretical models of speech perception (e.g., see McMurray et al., 2011 for C-CuRE Model; Tong et al., 2014 for TTRACE Model; Choi et al., 2017 for TTRACE+ Model).

### Relative Contributions of the Direct versus Indirect Pathways

With regard to the relative contributions of the direct and indirect pathways, the current results suggest that the contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity was more strongly associated with a direct relationship than an indirect relationship. This suggests that the relationship between suprasegmental perception between languages is more heavily influenced by phonological factors than by sensitivity to acoustic-phonetic variation. In terms of the nature of test stimuli, the Cantonese tone discrimination task and English stress perception task both involved the use of real and frequent words. However, the beat perception in music task involved nonspeech tones, which did not resemble familiar contours for tones or stress and were thus linguistically irrelevant. We hypothesize that the similarity between tone and stress in terms of structural properties and communicative functions may predispose these cues to shared processing mechanisms.

### Theoretical and Practical Implications

In terms of theoretical significance, the present findings inform the literature the pathways through which L1 phonological skill contributes to L2 phonological skill at the suprasegmental dimension. These findings have practical significance for L2 learners: one might imagine the presence of two similar, but distinct, sources of phonological variation across languages such as tone and stress to cause confusion for L2 learners. Together with previous studies (e.g., Tong et al., 2015a, 2016), the present findings suggest that instead, Cantonese ESL children may harness their sensitivities to lexical tone in the service of lexical stress perception. In particular, the results suggest that L1 Cantonese lexical tone sensitivity contributes to the development of L2 English lexical stress sensitivity directly and indirectly through the mediation of general auditory sensitivity. It is therefore conceivable that improving lexical tone sensitivity might bolster sensitivity to lexical stress on account of evidence of transfer from the present study. Taken a step further, it is also possible that English L1 bilingual learners might benefit from mastering contrastive stress forms in order to enhance their understanding of the Cantonese tone inventory. This question was not tested in the current study and there is a need for future research to explore this research question.

Despite the implications, it should be noted that the present study is a cross-sectional design, which limits the causal inference. Thus, longitudinal data are needed to study the developmental changes of the three metalinguistic skills tested herein, i.e., Cantonese lexical tone sensitivity, English lexical stress sensitivity and general auditory sensitivity. This may help delineate causality among the three variables. Apart from what was suggested in this study, it might also be possible that children with better ability to segregate different auditory pitch have higher potentials of acquiring both lexical tone and lexical stress. The above claims can be evaluated by undergoing cross-lag modeling, which requires a longitudinal design. Additionally, in the current study, only perception tasks were adopted to evaluate children's sensitivity to lexical tones and lexical stress. Production tasks of lexical tone and lexical stress may give a more complete picture regarding the developmental changes of lexical tone and lexical stress sensitivities among Cantonese ESL children. Similarly, multiple tasks can be adopted in measuring general auditory sensitivity and working memory, such as including pitch interval discrimination task and visual working memory task. Also, future studies may seek to establish associations or dissociations between the neural perceptual mechanisms underlying lexical tone, lexical stress and other prosodic information such as intonation (Gandour et al., 2003), and how they might be shaped by language experience (Xu et al., 2006).

### CONCLUSION

fpsyg-08-00492 March 30, 2017 Time: 16:9 # 8

The current study has gone beyond identifying the contribution of Cantonese lexical tone sensitivity to English lexical stress sensitivity among Cantonese ESL children, and further explored the pathways underlying the contribution. Results suggest that although lexical stress and lexical tone are distinct sources of variation that are used dissociatively in English and Cantonese, sensitivity to these properties of language develops interdependently in Cantonese children who learn English as L2. Our findings suggest that children may be able to harness their sensitivity to suprasegmental phonology of their L1 to efficiently process different sources of suprasegmental phonology in their L2. It is possible that a L2 learner's ability to detect and draw on cross-language commonalities may be a fundamental principle of learning that stimulates the growth of knowledge in both languages.

### ETHICS STATEMENT

We obtained ethical approval from The University of Hong Kong, Education Faculty Research Ethics Committee. Written

### REFERENCES


consent forms were obtained from school principals, parents and the students prior to testing.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

This research was supported, in part, by the ECS/RGC Early Career Scheme, the Hong Kong Special Administrative Regions Research Grants Council (27402514) to XT. The authors thank Professor Usha Goswami for her generous support for the use of the general auditory sensitivity task. The authors would like to acknowledge the principals of the Tsung Tsin Primary School and Kindergarten (Dr. Tam, W. L.), Po Leung Kuk Camões Tan Siu Lin Primary School (Mr. Yeung, V. M.), Munsang College Primary School (Ms. Leung, K. Y.), S. T. F. A. Wu Mien Tuen Primary School (Mr. Lau, H. Y.), and SKH Chu Oi Primary School (Ms. Lui, W. Y.) for their valuable support and participation in the study. The authors also acknowledge Dr. Lok Yin Joyce Kwan for her valuable suggestions on statistical methods.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer FEM and the handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2017 Choi, Tong and Singh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Perceptual Improvement of Lexical Tones in Infants: Effects of Tone Language Experience

Feng-Ming Tsao\*

Department of Psychology, National Taiwan University, Taipei, Taiwan

To learn words in a tonal language, tone-language learners should not only develop better abilities for perceiving consonants and vowels, but also for lexical tones. The divergent trend of enhancing sensitivity to native phonetic contrasts and reduced sensitivity to non-native phonetic contrast is theoretically essential to evaluate effects of listening to an ambient language on speech perception development. The loss of sensitivity in discriminating lexical tones among non-tonal language-learning infants was apparent between 6 and 12 months of age, but only few studies examined trends of differentiating native lexical tones in infancy. The sensitivity in discriminating lexical tones among 6–8 and 10–12 month-old Mandarin-learning infants (n = 120) was tested in Experiment 1 using three lexical tone contrasts of Mandarin. Facilitation of linguistic experience was shown in the tonal contrast (Tone 1 vs. 3), but both age groups performed similar in the other two tonal contrasts (Tone 2 vs. 4; Tone 2 vs. 3). In Experiment 2, 6–8 and 10–12 month-old Mandarin-learning infants (n = 90) were tested with tonal contrasts that have pitch contours either similar to or inverse from lexical tones in Mandarin, and perceptual improvement was shown only in a tonal contrast with familiar pitch contours (i.e., Tone 1 vs. 3). In Experiment 3, 6–8 and 10–12 month-old English-learning infants (n = 40) were tested with Tone 1 vs. 3 contrast of Mandarin and showed an improvement in the perception of non-native lexical tones. This study reveals that tone-language learning infants develop more accurate representations of lexical tones around their first birthday, and the results of both tone and non-tone languagelearning infants imply that the rate of development depends on listening experience and the acoustical salience of specific tone contrasts.

Keywords: infant lexical tone perception, pitch contour, native and non-native speech perception, developmental trends, Mandarin lexical tones

### INTRODUCTION

Perceptual sensitivity to consonants and vowels undergoes rapid changes during the first year of life. Infants start with a universal capacity to distinguish the phonemes of native and foreign languages (Eimas et al., 1971; Streeter, 1976), and demonstrate improved sensitivity in discriminating native phonemes occur in infants between 6 and 12 months of age (Kuhl et al., 2006; Tsao et al., 2006). Similar to consonants and vowels, lexical tones distinguish lexical meanings of syllables in tonal languages: the most well-known example of a tone language is Mandarin Chinese, which boasts the largest number of first-language speakers worldwide (Lewis et al., 2015).

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Rushen Shi, Université du Québec à Montréal, Canada Liquan Liu, Western Sydney University, Australia

> \*Correspondence: Feng-Ming Tsao tsaosph@ntu.edu.tw

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 09 November 2016 Accepted: 27 March 2017 Published: 11 April 2017

#### Citation:

Tsao F-M (2017) Perceptual Improvement of Lexical Tones in Infants: Effects of Tone Language Experience. Front. Psychol. 8:558. doi: 10.3389/fpsyg.2017.00558

**24**

The developmental trends of infants distinguishing consonants and vowels from both native and foreign languages are welldocumented (Werker et al., 2012), but only few studies have explored the developmental trajectories of lexical tones in nontonal language-learning infants (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014; Singh and Fu, 2016; Singh et al., 2016). It remains unclear whether infants learning a tonal language as their first language improve in their sensitivity in distinguishing lexical tones during the second half of their first year of life.

There is increasing evidence to suggest that infants acquire detailed information of their native language by listening to and analyzing linguistic inputs during the first year of life (Kuhl et al., 2008; Werker et al., 2012). By 6 months of age, infants engage in a detailed analysis of the distributional properties of the sounds contained in their ambient language, which alters their perception such that they tend to focus more on nativelike phonetic processing (Kuhl et al., 1992; Maye et al., 2008). By 10–12 months of age, the developmental change in the phoneme perception of infants is apparent. There is a steep decline in the discrimination of non-native phonemes (Werker and Tees, 1984; Palmer et al., 2012) and an improvement in that of native phonemes (Kuhl et al., 2006; Tsao et al., 2006), reflecting changes that depend on linguistic experience. Although, rapid changes in differentiating consonant contrasts between 6 and 12 months age were reported in numerous studies, few studies have reported the maintenance of perceptual sensitivity. For example, 10–12 month-old English-infants tested on their ability to discriminate the /d/ vs. /ð/ contrast of English performed similarly to 6–8 month-old infants of the same language (Polka et al., 2001). The language-specific pattern of differentiating English /d/ vs. /ð/ contrast emerged later than 12 months of age, when 4-year-old English-speaking children performed better than French-speaking children of the same age in distinguishing the English /d/ vs. /ð/ contrast (Sundara et al., 2006).

On perceptual development of phonetic segments, several theoretical models, such as attunement, perceptual learning and maturation theories, have been proposed to interpret effects of language experience on developmental trajectories of speech perception in infancy (Aslin and Pisoni, 1980). Studies that show the perceptual decline in discrimination of non-native consonants and perceptual improvement in discrimination of native consonants have provided greater support to theories of attunement and perceptual learning than other models. With increasing listening experience to the ambient language, attunement theory assumed that phonologically relevant contrasts would be finely tuned, but phonologically irrelevant contrasts would remain broadly tuned or attenuated. In other words, attunement theory predicts three developmental trajectories of discriminating native and non-native phonetic contrasts: facilitation, maintenance, and loss. Perceptual learning theory assumes that development of speech perception depends on frequency of occurrence and relative acoustical discriminability of specific phonetic contrasts, and rate of development could be slow or fast. Despite that attunement theory gains more support than perceptual learning theory, some hybrid of theories best describes the development of specific categories of phonetic discrimination (Aslin and Pisoni, 1980). Would the perceptual development trends predicted by attunement theory, perceptual learning theory, or their combination be evident in tonal perception development?

Despite the extensive literature on infant perception of phonetic segments (e.g., vowels and consonants), the developmental trends of lexical tones in tonal and non-tonal language learners have not been fully explored (Singh and Fu, 2016). Nevertheless, some studies have reported mixed findings regarding whether the perceptual decline in the discrimination of lexical tones is universal in non-tonal language-learning infants before their second birthday. Some studies have demonstrated a perceptual decline that occurred among English-learning infants between 4 and 9 months of age when discriminating lexical tones of Thai or Cantonese (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013). Compared with French-learning 6-month-old infants, reduced sensitivity to discriminating lexical tones of Thai has also been reported among 10-month-old infants learning the same language (Cabrera et al., 2015). However, 19-month-old English-learning infants were able to discriminate lexical tone contrasts of Mandarin (Hay et al., 2015, Experiment 3). For Dutch-learning infants, they were able to discriminate Mandarin lexical tone contrasts with larger pitch differences between 5 and 18 months of age; however, their sensitivity in distinguishing that same tonal contrast with smaller pitch difference was reduced between 9 and 15 months of age, and improved at approximately 18 months of age (Liu and Kager, 2014). These studies raised questions regarding whether the experience of listening to a non-tonal language either reduces or maintains infants' sensitivity in distinguishing lexical tones after 9 months of age, and results of Liu and Kager (2014) suggested that acoustical discriminability of contrasts impacted the development of tone sensitivity.

Reduced sensitivity to lexical tone contrasts among non-tonal language learners reveals that listening to an ambient language shifts the perceptual organization of lexical tones, and partially supports the attunement theory because a loss in sensitivity to tone is predicted by this model. Assessing tone perception among tonal language learners is not only necessary to reveal the developmental trends of differentiating native tone contrasts, but enhanced sensitivity to native tone contrasts is also theoretically required to evaluate attunement theory of speech perception development. In addition to listening to a tonal language, if development of tone perception depends on relative acoustical discriminability of specific tone contrasts, the perceptual learning model assumes that rate of development is slow for infants to distinguish acoustically similar tone contrasts. In other words, facilitation as well as maintenance of differentiating native tone contrasts across ages are predicted by models of speech perception.

It is therefore important to assess whether the native phonological system facilitates or maintains tonal-language learning infants' sensitivity to native tonal contrasts while nontonal language learners change their sensitivity to non-native lexical tones. Such an investigation would help construct a better conceptual framework through which the development

of native and non-native tone sensitivity could be explored between 6 and 12 months of age. Mandarin-learning infants and Cantonese-learning infants have been reported to show language-specific listening preferences for their native lexical tones at approximately 5 months of age (Yeung et al., 2013). However, it is still unclear whether exposure to a tonal language would either facilitate or maintain infants' sensitivity in the discrimination of native tone contrasts around their first birthdays.

The rate of tone perception developmental might vary with the relative acoustical salience of tone contrasts. In infantand child-directed speech, the average heights and contours of the fundamental frequency (F0) distinguish four lexical tones in Mandarin; however, some tones have similar F0 contours (Liu et al., 2009). **Figure 1** illustrates the F0 contours of the four lexical tones in Mandarin. Tone 1 is a high-level tone and Tone 4 is a high-falling tone. The pitch directions of both Tones 1 and 4 are not greatly altered within a syllable. However, Tones 2 (mid-rising tone) and 3 (low-dipping tone) exhibit similar F0 contours in isolated syllables: both have a concave F0 shape. The acoustical similarity between Tones 2 and 3 results in the frequent confusion of this tone contrast by non-tonal language speakers (Wang et al., 1999; So and Best, 2010). In contrast, although Tones 2 and 4 exhibit a similar average F0, they have different F0 contours: a rising F0 contour for Tone 2 and a falling F0 contour for Tone 4. Perceptual discrimination of the Tones 2 and 3 pair is the most difficult for English adult speakers, followed by Tones 2 and 4 pair, and Tones 1 and 3 pair is the easiest (e.g., Wang et al., 1999). For Mandarin-learning children, 3-year-old Mandarinspeaking children easily confuse Tone 3 with Tone 2 compared to other tone pairs (Wong et al., 2005). Acoustical salience of tone contrasts also affects the discrimination of lexical tone in preverbal infants. Tsao (2008) reported that 12-month-old Mandarin-learning infants were more accurate in discriminating the contrast between Tones 1 and 3 than those between Tones 2 and 4 and Tones 2 and 3. Tsao's (2008) results suggested that the growth rate for distinguishing tone contrasts between 6 and 12 months in Mandarin-learning infants might vary with the acoustical salience of tone contrasts. The acoustical salience of consonant contrasts influences infants' abilities to differentiate syllable-initial consonants between 6 and 12 months of age (e.g., Narayan et al., 2010). Adopting tone contrasts that vary acoustical salience would be conceptually essential to examine whether the rate of tone perception development depends on both the listening experience with lexical tones and the relative acoustical discriminability of tone contrasts.

Although, both pitch height (measured by the mean fundamental frequency) and pitch direction (measured by the time of pitch direction change or the slope of pitch contour) (Liu et al., 2009; Chandrasekaran et al., 2010) are acoustical correlates of Mandarin lexical tones, the perceptual weights of these acoustical cues vary with speakers' levels of proficiency in identifying and discriminating lexical tones. For Mandarin

speakers, the pitch direction (or pitch contour) is perceptually weighted more heavily than the pitch height. In contrast, English speakers tend to weigh pitch height more than they do pitch direction (Gandour and Harshman, 1978). The perceptual weight difference between the height and direction of pitch also indicates the individual differences among non-tonal language speakers when perceiving lexical tones. English-speaking adults, who are more accurate in labeling the pitch pattern (level, rising, and falling) of lexical tones, also weigh the pitch direction more heavily than they weigh the pitch height (Chandrasekaran et al., 2010). In brief, adult speakers who are able to track pitch contour would exhibit better tone perception of Mandarin tones. In addition to exploring the general trends of differentiating tone contrasts between 6 and 12 months of age, to further examine developmental mechanism of tone perception, it is essential to explore whether infants attune to language-specific pitch contours while improving their perceptual sensitivity to native tonal contrasts.

The acoustical features of lexical tones, i.e., pitch height and contours, are also acoustical parameters of linguistic prosody. Nevertheless, variations of pitch contour within syllables do not change the lexical meanings of English syllables; 8- to 12- monthold English-learning infants showed an improvement in their ability to utilize prosodic patterns between syllables (i.e., word stress) in the segmentation of words and phrases from continuous speech (Soderstrom et al., 2003; Thiessen et al., 2005; Seidl, 2007). If the improvements in the ability of English-learning infants to process linguistic prosody generalized to pitch features of lexical tones, the accuracy of discriminating lexical tones by Englishlearning infants might either not decline or even improve for each tonal contrast of a non-native language before their first birthday.

To reiterate, this study aimed to examine developmental trajectories of native and non-native tone perception among infants between 6 and 12 months of age. In addition, this study also explored whether the sensitivity to acoustical features of language-specific lexical tones, such as pitch contours, enhances tone perception around the first birthday. Three experiments were conducted to address these questions. Experiment 1 was designed to explore developmental trends of native lexical tone perception among Mandarin-learning infants. The acoustical salience of lexical tone contrasts refers to the magnitude of the differences between acoustical parameters essential to differentiate lexical tones (i.e., pitch height and contour). The acoustically most salient contrast has the largest acoustical difference, i.e., the Tone 1 vs. 3 contrast. To increase acoustical salience of tonal contrasts, the following tone contrasts were used: Tone 1 vs. Tone 3, Tone 2 vs. Tone 4, and Tone 2 vs. Tone 3. If lexical tone perception underwent a marked change between 6 and 12 months of age, the older Mandarin-learning infants would outperform the younger ones in the discrimination of native lexical tones. However, if rate of development depends on the interaction between listening experience and relative acoustical salience of tone contrasts, developmental trends of differentiating native tone contrasts would vary with tone contrasts. Improved sensitivity to discriminate tone contrasts might be observed for acoustically more salient contrasts, but maintenance of perceptual sensitivity might be shown for acoustically less salient contrasts. Experiment 2 explored whether Mandarinlearning infants relied on language-specific pitch contours to discriminate tonal contrasts, by testing the sensitivity to two tonal contrasts in which whether the tone contrasts were native to Mandarin or not was identified purely by pitch contour. The pitch contours of one tonal contrast were similar to the lexical tones in Mandarin, but contours of the other tonal contrast were inverse of the lexical tones in Mandarin. The assumption of Experiment 2 was that older Mandarin-learning infants would outperform their younger peers in discriminating tone contrasts with pitch contours similar to Mandarin tones. Experiment 3 employed a cross-language design to examine the developmental trends in the perception of non-native lexical tones among 6–8 and 10–12 month-old English-learning infants. The hypothesis was that acoustical salience of tone contrast and improvement of linguistic prosody in English-learning infants around the first birthday would also enhance Englishlearning infants' ability in distinguishing tone contrasts with greater acoustical salience. In addition, if the 10–12 monthold Mandarin-learning infants demonstrated higher accuracy in discriminating Mandarin tones than the English-learning infants at the same age, it would indicate that listening to lexical tones provides additional benefits to facilitate the development of lexical tones.

### EXPERIMENT 1: DEVELOPMENT OF NATIVE LEXICAL TONE PERCEPTION

### Method

#### Participants

Two age groups of Mandarin-learning infants (n = 120) participated in the study: (a) 10–12-month-olds: Tone 1 vs. 3 (n = 20, girls n = 10, mean age = 10.96 months, SD = 1.23 months), Tone 2 vs. 3 (n = 20, girls n = 6, mean age = 11.10 months, SD = 0.82 months), and Tone 2 vs. 4 (n = 20, girls n = 9, mean age = 11.12 months, SD = 0.74 months); (b) 6–8-month-olds: Tone 1 vs. 3 (n = 20, girls n = 8, mean age = 7.33 months, SD = 0.50 months), Tone 2 vs. 3 (n = 20, girls n = 8, mean age = 7.32 months, SD = 0.44 months), and Tone 2 vs. 4 (n = 20, girls n = 11, mean age = 7.32 months, SD = 0.38 months). Eighteen additional infants failed to complete the testing procedures due to their inability to pass the conditioning. Results of a χ 2 test on the rates of infants who could not pass the conditioning phase of the tone discrimination procedure indicated neither the age nor tone effect reached significance, at 6–8 months, χ 2 (2) = 0.156, p = 0.925, and at 10–12 months, χ 2 (2) = 0.252, p = 0.882. The pre-established criteria for inclusion in the study were that infants had no known visual or auditory deficits, were born full term (±14 days from the due date), were delivered without complications, had a normal birth weight (2.5–4.5 kg), and were developing normally. In addition, the members of the infants' immediate families had no history of hearing loss. Parents were paid NT\$ 600 for their child participating in the experiment.

Mandarin-learning infants were recruited either from the lists of names on the House Registry of the Da-An and Chung-Cheng Areas, Taipei City, Taiwan, or through an advertisement notice posted on the Internet. Although Taiwan is a multilingual society, Mandarin is the most dominant language spoken in homes. The Mandarin-dominant (or -only) language environment of Taiwanese infants was verified through a language background questionnaire, which was administrated to the caregiver before the study began. This study was carried out in accordance with the recommendations of 'American Psychological Association ethical standards' and 'Research Ethics Committees of National Taiwan University' with written informed consent from all participants. All parents gave written informed consent in accordance with the Declaration of Helsinki.

#### Stimuli

The speech stimuli were Tone 1 [tC h i1] (duration = 690 ms), Tone 2 [tC h i2] (duration = 600 ms), Tone 3 [tC h i3] (duration = 770 ms), and Tone 4 [tC h i4] (duration = 482 ms) syllables, recorded in a sound-attenuation booth by a female Mandarin-native speaker with a normal speaking rate, and digitized with the speech analysis software, Computerized Speech Lab (CSL 4400), at a 22050 sampling rate and 16-bit resolution. The use of naturally produced speech stimuli instead of computer synthesized stimuli provided the most natural tokens by which lexical tone sensitivity in infants could be examined. Acoustical salience between tonal contrasts was reported to affect the accuracy of discriminating tonal contrasts among 1-year-old Mandarin-learning infants (Tsao, 2008); this experiment adopted three tone contrasts regarding to the average F0 and F0 contour: (1) the Tone 1 vs. 3 pair was acoustically the most distinct; (2) Tone 2 vs. 3 was acoustically the most similar; and (3) Tone 2 vs. 4 had a moderate acoustical similarity. The duration, average F0, F0 range, and turning point [= (time of the minimal F0 ÷ tone duration) × 100%] are acoustical correlates of lexical tones (Liu et al., 2007). Acoustical correlates of lexical tones were assessed using the speech analysis software Praat (Boersma and Weenink, 2011). For speech stimuli in this experiment, lexical tones were only manifested on vowels. **Figure 1** illustrates the F0 contours of the four lexical tones and **Table 1** lists the acoustical features of lexical tones. The duration of lexical tones is an acoustical correlate of lexical tones in natural speech (Liu et al., 2009) and was preserved in the digitized speech stimuli. The durations of syllable-initial consonant [tC h ] are 238 ms (Tone 1), 240 ms (Tone 2), 216 ms (Tone 3) and 192 ms (Tone 4), respectively. The speech samples were edited with the sound-editing software Sound Forge 7.0 (Sony, 2004) to equalize the root mean square (RMS) levels of each syllable.

### Apparatus

Speech stimuli were presented using a personal computer (HP Compaq DC7100). The sounds were amplified (Yamaha RX V350) and delivered to infants in an adjoining sound-treated test room via a loudspeaker (Bowers & Wilkins DM303). Parents and experimenters wore headphones (SONY MDR-CD 280) and listened to music from a CD during the tests, so they could not distinguish between the stimuli presented to the infants. Infants' responses were monitored in the control room using a digital camera (SONY Handycam PC350) and a video monitor. Operated by an experimenter, who pushed a button on a handheld switch, the computer used a data acquisition board (National Instrument PCI-6503) to activate the reinforcer and record the infants' head-turn responses.

#### Test Suite

The test suite consisted of two rooms. In the sound-attenuation test room, an infant was held on his or her parent's lap, facing forward while an assistant sat at a 90-degree angle to the infant's right side. An assistant maintained the infant's attention by manipulating a series of engaging, silent toys to bring the infant's gaze to midline (straight ahead of the infant). A bank of two visual reinforcers was located at a 90-degree angle to the infant's left side, and each consisted of a dark Plexiglas box (13<sup>00</sup> × 13<sup>00</sup> × 1300) containing a commercially available mechanical toy (e.g., a dancing snowman). The toys were not visible until activated, at which point the lights mounted inside the box were illuminated. The visual reinforcers were placed on either side of the loudspeaker, at the infant's eye level. A camera located in front of the infant fed an image of the test room to the adjoining control room, where an experimenter observed the infant's behavior.

#### Infant Testing Procedure

The Head-Turn (HT) testing procedure has been previously used to explore developmental changes in consonant perception among infants 6–12 months of age (Kuhl et al., 2006; Tsao et al., 2006). Infants were first trained to produce a head turn for visual reinforcement whenever the "background" speech sound (e.g., [tC h i1]), which was repeated once every 2 s, would be changed to the "target" speech sound (e.g., [tC h i3]). Pitch contour of Tones 2 and 3 are acoustically more similar than the other two lexical tones (i.e., Tones 1 and 4), and to reduce the possibility that large acoustical differences between target speech sounds of tonal contrasts would also contribute to the performance differences among tone contrasts, the target tone of each contrast was one of contour tones. Tone 3 was the target tone for the Tone 1 vs. 3 and the Tone 2 vs. 3


contrasts, while Tone 2 was the target tone for the Tone 2 vs. 4 contrast. The experimental protocol required a two-step training phase followed by a Test phase, all of which were computer-controlled. While the speech stimuli were playing in the background, the assistant played with toys to get the infant's attention and distract the infant's attention from the speech stimuli.

The first step of the training phase consisted of Conditioning (+ Intensity). During this phase, infants were trained to associate the presentation of the target speech sound with the activation of visual reinforcers. The target sound interrupted the repetitive presentation of the background speech sound, and was presented at a level that was 4 dBA higher than that of the background speech sound. During the training phase, every trial was considered a target trial. The target stimulus was presented three times in a row. The onset-to-onset interstimulus interval was 2000 ms. The infant quickly learned to anticipate the visual reinforcer when the speech sound was changed from the background to the target. The infant had to respond to the sound change within 6 s after the first presentation of the target sound in order to watch the visual reinforcement. When the infant correctly anticipated the visual reinforcers with a head turn on two consecutive trials, the test proceeded to the next training phase, Conditioning (− Intensity).

In the Conditioning (− Intensity) phase, the target sound was presented at the same intensity level as the background sound; the infants used only the phonetic difference between the sounds as a cue. All other parameters of the experiment remained the same. The infants needed to correctly produce three anticipatory head turns to advance to the Test phase. Those who failed to pass the two-phase training within 30 trials were excluded from the sample. The speech stimuli were the same in both Conditioned and Test phases, similar to those in other infant studies using the head-turn procedure (Kuhl et al., 2006; Tsao et al., 2006). The Test phase consisted of 30 trials, with an equal number of Change and Control (nochange) trials presented in random order. Infants completed both training and testing phases in about 20 min on the same day.

In all phases of training and testing, trials were initiated by the research assistant, who showed toys to the infants in the test room. The assistant initiated trials when infants appeared ready (focusing on the toys held by the assistant). The experimenter could not hear the stimuli presented during the trials (a computer-controlled gating network cut out the sound during the trial), and was unaware of the type of trial that was automatically selected by the computer. The experimenter judged the head turn and pushed a button on a hand-held switch connected to the computer through the data acquisition board to indicate a head turn. The assistant could not hear the stimuli being presented at any time during the experiment, but was informed that a trial was underway by a small light that was automatically activated for the duration of a trial (out of the infant's view). This was necessary information for the assistant as she was instructed not to change the toy in the midst of a trial.

### Results and Discussion

An Age × Contrast two-way ANOVA of the percentage of correct responses revealed that 10–12-month-old Mandarinlearning infants (M = 69.86%, SD = 12.96) performed better than their 6–8-month-old counterparts (M = 59.64%, SD = 5.74), F(1,114) = 51.22, p < 0.001, η 2 <sup>p</sup> = 0.310. Further, perceptual accuracy significantly varied by the tone contrast, F(2,114) = 21.55, p < 0.001, η 2 <sup>p</sup> = 0.274. The Age × Contrast interaction was significant, F(2,114) = 18.39, p < 0.001, η 2 <sup>p</sup> = 0.244, showing that the developmental trends in the discrimination of lexical tones varied by tonal contrasts. **Figure 2** show the percentage of correct responses by infants aged 6–8 and 10–12 months while distinguishing native lexical tone contrasts.

Among tonal contrasts, the Bonferroni post hoc test (p < 0.001) showed that the Tone 1 vs. 3 contrast (M = 71.37%, SD = 12.63) was easier for infants to discriminate than the other two contrasts, i.e., the Tone 2 vs. 4 contrast (M = 61.13%, SD = 7.87) and the Tone 2 vs. 3 contrast (M = 61.76%, SD = 9.76). To further examine the interaction effect between Age and Tone contrast, separate one-way ANOVAs on age effect were run for each contrast. For Tone 1 vs. 3 contrast, 10–12 month-olds (M = 82.51%, SD = 5.83) performed significantly better than their 6–8-month-old counterparts (M = 60.23%, SD = 5.69), F(1,38) = 149.78, p < 0.001, η 2 <sup>p</sup> = 0.798. Performance of both infant groups in discriminating the Tone 1 vs. 3 contrast was significantly above chance level (percentage of correct response = 50%) at p < 0.001, one-sample t-test, 6–8 month-old infants, t(19) = 8.05; 10–12 month-old infants, t(19) = 24.94. However, the perceptual improvement shown for the Tone 1 vs. 3 contrast was not observed in discrimination of the other tone contrasts. For the Tone 2 vs. 4 contrast, older infants (M = 62.32%, SD = 9.03) did not perform significantly more accurately than younger infants (M = 59.93%, SD = 6.53), F(1,38) = 0.915, p = 0.345. Performance of both infant groups was significantly above chance level at p < 0.001, one-sample t-test, 6–8 month-old infants, t(19) = 6.80; 10–12 month-old infants, t(19) = 6.10. Further, for the Tone 2 vs. 3 contrast, the performance difference between the older (M = 64.75%, SD = 12.26) and younger infants (M = 58.77%, SD = 5.11) was not significant, F(1,38) = 4.06, p = 0.051, η 2 <sup>p</sup> = 0.097. Performance of both infant groups was significantly above chance level at p < 0.001, one-sample t-test, 6–8 month-old infants, t(19) = 7.67; 10–12 month-old infants, t(19) = 5.38. The result that Tone 1 vs. 3 contrast, the acoustically more distinct contrast, is easier than other tonal contrasts for infants to distinguish, suggests that acoustical salience between tonal contrasts affects the developmental trends of native lexical tone perception.

Results of this experiment showed that, between 6 and 12 months of age, the developmental rates of distinguishing lexical tones varied by tone contrasts. Significant improvement was observed in the Tone 1 vs. 3 contrast; this trend is consistent with previous findings that have shown an increasing sensitivity to native consonants (Kuhl et al., 2006; Tsao et al., 2006; Narayan et al., 2010). However, this developmental trend was less obvious in the other two contrasts, Tone 2 vs. 4 and Tone 2 vs. 3. The

results of this experiment reveal a trend that Mandarin-learning infants improve their perceptual sensitivity to discriminate native lexical tones around their first birthdays, but the acoustical salience of tonal contrast would impact the learning rate in developing lexical tones.

### EXPERIMENT 2: PERCEPTUAL DEVELOPMENT OF PITCH CONTOURS AMONG MANDARIN-LEARNING INFANTS

Results of Experiment 1 revealed that exposure to a lexical-tone language interacts with acoustical salience of lexical tones on the development of lexical tones perception. Pitch contour and height are acoustical cues of lexical tones, but tonal-language speaking adults perceptually weigh pitch contour more than pitch height (Gandour and Harshman, 1978; Gandour, 1984; Chandrasekaran et al., 2010). Would the perceptual improvement of differentiating Tone 1 vs. 3 contrast in Experiment 1 be the result of increased tuning to the familiar pitch contours of this tone contrast among Mandarin-learning 10– 12 month-old infants? Experiment 2 explored tonal perception development among Mandarin-learning infants by examining whether 10–12 month-old infants would outperform 6–8 monthold infants in discriminating tonal contrasts with familiar pitch contours. Two sets of tonal contrasts were used in Experiment 2; the pitch height of each lexical tone was the same, and pitch contour difference was the only valid cue to perceptually distinguish the lexical tones. To generate a familiar tonal contrast, one tonal contrast included pitch contours similar to Tones 1 and 3 of Mandarin lexical tones, but the novel contrast included the inverse pitch contour of Tone 3 and the non-inverse pitch contour of Tone 1.

## Methods

### Participants

The participants were 90 Mandarin-learning infants in Taiwan who were tested in two lexical-tone conditions: (1) familiar lexical-tone contrast, 7-month-olds (n = 23, Mean age = 7.53 months, SD = 0.69 months, boys n = 10) and 11-month-olds (n = 23, Mean age = 11.4 months, SD = 0.32 months, boys n = 15), and (2) novel lexical-tone contrast, 7-month-olds (n = 21, Mean age = 7.10 months, SD = 0.29 months, boys n = 12) and 11-month-olds (n = 23, Mean age = 11.13 months, SD = 0.25 months, boys n = 13). Thirteen additional infants failed to complete the testing procedures because of their inability to pass the conditioning phase. Results of a χ 2 test on the rate of infants who could not pass the conditioning indicated neither the age nor tone contrast effect reached significance, at 7 months, χ 2 (1) = 0.331, p = 0.565, at 11 months, χ 2 (1) = 0.754, p = 0.385. The pre-established criteria for inclusion in the experiment were same as in Experiment 1. Parents were paid NT\$ 600 for their child participating in the experiment. This study was carried out in accordance with the recommendations of 'American Psychological Association ethical standards' and 'Research Ethics Committees of National Taiwan University' with written

informed consent from all participants. All parents gave written informed consent in accordance with the Declaration of Helsinki.

#### Stimuli, Equipment, and Phonetic Testing Procedure

The speech stimuli were Mandarin consonant-vowel syllable ([tC h i], duration = 668 ms) with three patterns of pitch contour (two familiar tones and one novel tone). These lexical tones consisted of two sets of tonal contrasts in the experiment. For the familiar contrast, the pitch contours of speech stimuli were similar to Tones 1 and 3 of Mandarin. To generate the novel tone contrast, the pitch contour of one stimulus was similar to Tone 1, but the pitch contour of another stimulus was the inverse of Tone 3, and this pattern did not exist in any lexical tones of Mandarin. **Figure 3** depicts the pitch contours of the speech stimuli. The pitch direction of inverse Tone 3 is generally similar to Tone 4 (falling tone) of Mandarin, but with the later onset of pitch falling. Therefore, combining inverse Tone 3 and non-inverse Tone 1 would generate a novel tone contrast for Mandarin-learning infants. To control the effects of acoustical salience on phonetic discrimination, the average pitch height (mean F0 = 217 Hz) and the vowel formant structures were the same for all speech stimuli, and the pitch contour was the only acoustical parameter by which to distinguish lexical tones. To generate more natural stimuli, the speech stimuli were modified from a naturally produced token using the sound-modification software, Praat (Boersma and Weenink, 2011). The testing procedure for the phonetic discrimination was the same as in Experiment 1. For both familiar and novel contrasts, Tone 1 was the background sound in each contrast, but Tone 3 was the target sound in familiar contrast and inverse-Tone 3 was the target sound in the novel contrast.

### Results and Discussion

**Figure 4** displays the percentages of correct lexical tone discrimination at 7 and 11 months of age. The results of a two-way ANOVA (between-subject factor, Age: 7 vs. 11 months; Tonal contrast: familiar vs. novel) showed that older infants (M = 73.86%, SD = 10.12) performed better than younger

infants (M = 68.30%, SD = 10.72), F(1,86) = 6.85, p = 0.010, η 2 <sup>p</sup> = 0.074, and the familiar contrast (M = 73.47%, SD = 11.95) was easier than the novel contrast (M = 68.71%, SD = 8.78), F(1,86) = 5.08, p = 0.027, η 2 <sup>p</sup> = 0.056. The Age × Contrast interaction effect is insignificant, F(1,86) = 0.801, p = 0.373. However, given the priori hypotheses for a lack of tone contour effect at 7 months, and the contour preference emerging at 11 months, planned comparisons (simple effects tests) were conducted to verify the prediction that tone discrimination varies by pitch contour within each age group. At 7 months of age, infants performed similarly in discriminating both familiar (M = 69.70%, SD = 11.93) and novel (M = 66.78%, SD = 9.26) tone contrasts, as indicated by a planned comparison, t(42) = 0.901, p = 0.373, d = 0.274. Both 7-month-old infant groups performed above chance level at p < 0.001, one-sample t-test, familiar contour group, t(22) = 7.92; novel contour group, t(20) = 8.30. In contrast, at 11 months of age, infants were more accurate in distinguishing the familiar tone contrast (M = 77.25%, SD = 10.95) compared to the novel tone contrast (M = 70.48%, SD = 8.12), t(44) = 2.38, p = 0.022, d = 0.702. The performance of both 11-month-old infant groups was above chance level at p < 0.001, one-sample t-test, familiar contour group, t(22) = 11.94 and novel contour group, t(22) = 12.10.

The results of Experiment 2 revealed that the improved accuracy in distinguishing lexical tones between 6 and 12 months of age is evident with a familiar tone contrast that contains similar pitch contours to native lexical tones, but not with a novel tone contrast whose patterns of pitch contour does not exist in the native lexical tones. Since pitch contour was the only acoustical cue for infants to distinguish lexical tones, and the performance advantage of familiar tone contrast was observed only among older infants, the results suggest that Mandarinlearning infants perceptually fine tune to the pitch contours of the lexical tones in their native language around 10–12 months of age.

### EXPERIMENT 3: DEVELOPMENT OF NON-NATIVE LEXICAL TONE PERCEPTION

Results of Experiments 1 and 2 revealed that Mandarin-learning infants develop better sensitivity in discriminating the Tone 1 vs. 3 contrast around 12 months of age. However, to fully address the issue that listening to a tonal language shapes language-specific perceptions of lexical tones in early infancy, it is essential to examine whether the infants learning a non-tonal language also change their sensitivity for perceiving lexical tones. Perceptual decline in distinguishing lexical tones of a foreign language was repeatedly reported among non-tonal language learners after 9 months of age (Mattock and Burnham, 2006; Liu and Kager, 2014; Cabrera et al., 2015).

In the present experiment, English-learning infants were tested with the Tone 1 vs. 3 contrast for which a developmental trend was clearly shown among Mandarinlearning infants in previous experiments. Therefore, the results of this experiment would be compared with those

## Method

infants.

#### Participants

fpsyg-08-00558 April 10, 2017 Time: 15:47 # 9

This experiment included 6–8-month-old (n = 19, mean age = 7.40 months, SD = 0.23 months, boys n = 9) and 10–12-month-old (n = 21, mean age = 10.87 months, SD = 0.17 months, boys n = 9) English-learning infants. Seven additional infants failed to pass the conditioning and were excluded from the final data analysis. Results of χ 2 test on the rates of infants who could not meet the criterion of conditioning phase in the tone discrimination procedure indicated neither the age nor language effect reached significance, at 6–8 months, χ 2 (1) = 0.168, p = 0.681, and at 10–12 months, χ 2 (1) = 0.138, p = 0.711. The pre-established criteria for inclusion in the study were the same as those employed in the previous experiments. Parents were paid US\$ 10 for participating in this experiment. American infants were recruited through the database of names of the Infant Studies Subject Pool (ISSP) at the University of Washington. This study was carried out in accordance with the recommendations of 'American Psychological Association ethical standards' and 'IRB of University of Washington' with written informed consent from all participants. All parents gave written informed consent in accordance with the Declaration of Helsinki.

### Stimuli, Equipment, and Phonetic Testing Procedure

As in Experiment 1, the lexical tone stimuli were naturally produced Mandarin tokens of Tone 1 and Tone 3. The testing procedure for the phonetic discrimination was the same as in Experiment 1.

### Results and Discussion

The results of the English- and Mandarin-learning infants on the discrimination of the Mandarin Tone 1 vs. 3 contrast are illustrated in **Figure 5**. As with the data collected from the Mandarin-learning infants in Experiment 1, the percentage of the correct responses of English-learning infants was examined using a 2 (Language background) × 2 (Infant age) ANOVA to examine the development of tone perception. Results showed that the older infants from both language backgrounds were generally more accurate than their younger peers in discriminating tone contrast, F(1,76) = 56.65, p < 0.001, η 2 <sup>p</sup> = 0.427. The language background factor was not significant, F(1,76) = 3.32, p = 0.072. Performance of English-learning infants at both ages was above chance level at p < 0.001, one-sample t-test, 6–8 month-old group, t(18) = 3.82; 10–12 month-old group, t(20) = 10.48. However, a significant Age × Language background interaction, F(1,76) = 8.60, p = 0.004, η 2 <sup>p</sup> = 0.102, was observed, which indicated that improved accuracy in distinguishing lexical tones varied by the infants' language backgrounds.

To further examine the developmental trajectories of perceiving lexical tones in infancy, separate one-way ANOVAs were run. The results of Experiment 1 showed that the older Mandarin-learning infants discriminated the Tone 1 vs. 3 contrast more accurately than the younger infants. This perceptual improvement was also observed for the non-native lexical tones discriminated by the older English-learning infants (M = 72.38%, SD = 9.78), who were more accurate than their younger counterparts (M = 62.59%, SD = 14.37), F(1,38) = 6.45, p = 0.015, η 2 <sup>p</sup> = 0.145. This result led to the following question: "Is language-specific tone perception apparent at either younger age around 6–8 months or at a later age around 10–12 months?" At the age of 6–8 months, Englishlearning infants performed similarly to Mandarin-learning

infants at the same age, F(1,37) = 0.47, p = 0.499. In contrast, at 10–12 months, Mandarin-learning infants outperformed English-learning infants in detecting lexical tone differences, F(1,39) = 16.02, p < 0.001, η 2 <sup>p</sup> = 0.291. Results of this experiment revealed that language-specific lexical tone perception is not apparent among infants aged between 6 and 8 months, but it is apparent around the age of 10–12 months.

Infants' performance in discriminating non-native lexical tone contrasts was reduced between 6 and 9 months of age (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014). However, the results of the present experiment revealed a different trend: an improved sensitivity in the perception of non-native lexical tones after 10 months of age. The result that English-learning 10–12 month-olds outperform younger English-learning infants in the discrimination of a lexical tone contrast (i.e., the Mandarin Tone 1 vs. 3 contrast) suggest that the listening experience with specific lexical tones would not be the only mechanism by which infants learn lexical tones. Other abilities of speech perception development, such as detecting prosodic patterns of words and phrases in English (Jusczyk et al., 1999; Soderstrom et al., 2003; Seidl, 2007), might also contribute to the development of lexical tones.

### GENERAL DISCUSSION

This study explored two issues related to the development of lexical tone perception in three experiments. The first sought to explore the developmental trends in the perception of native and non-native lexical tones between 6 and 12 months of age, while the second questioned whether infants learning a tone language fine tune to the pitch contour of lexical tones while showing the development of tone perception. The results of Experiment 1 on Mandarin-learning infants showed diverse trends in the discrimination of native lexical tones between 6 and 12 months of age. The improvement in distinguishing tonal contrasts was observed only for the Tone 1 vs. 3 contrast, but older and younger infants performed similarly when they were tested with the Tone 2 vs. 3 and Tone 2 vs. 4 contrasts. Results of Experiment 1 revealed both facilitation and maintenance of discriminating native tonal contrasts, and suggested that the relative complexity of pitch contours among tonal contrasts would influence the learning rates of lexical tones. Experiment 2 utilized speech stimuli with familiar and novel pitch contours of Mandarin lexical tones to explore whether Mandarin-learning infants improved their ability to perceive pitch contours between 6 and 12 months of age, and results showed that the fine tuning to pitch contours was apparent with the familiar tone contrast, but not with the novel contrast. Results of Experiment 3 showed that older English-learning infants outperformed their younger counterparts in perceiving the Tone 1 vs. 3 contrast of Mandarin, indicating an improvement in the perception of nonnative lexical tones. Additionally, 10–12-month-old Mandarinlearning infants were more accurate than their English-learning counterparts in distinguishing Mandarin lexical tones, suggesting that the experience of listening to a tonal language facilitates

infants' ability to form detailed representations of lexical tones around 12 months of age.

On the perceptual development of phonetic segments, studies on consonant and vowel perception have reported an improvement in the discrimination of phonetic segments in infants' native languages between 6 and 12 months of age (Polka and Bohn, 1996; Kuhl et al., 2006; Tsao et al., 2006; Narayan et al., 2010; Pons et al., 2012). The current study extended these findings on the perception of native phonetic segments to lexical tones, the suprasegmental units in phonology. Results of this study reveal a trend of native tone perception: tonal-language learners exhibit a language-general pattern at 4–6 months of age to discriminate tone contrasts of native and foreign languages (Mattock and Burnham, 2006; Yeung et al., 2013), and infants raised in tonal language elevate their accuracy of distinguishing native tones between 6 and 12 months of age. The improved sensitivity to native tones is only shown for the Tone 1 vs. 3 contrast, but rate of development is relatively slow with regards to the Tone 2 vs. 3 and Tone 2 vs. 4 contrasts. Results of the current study are consistent with previous studies. The current study produced multiple indicators that the rates of developing native tone perception vary with tone contrasts and therefore, with acoustical salience. English-learning infants also improved in discrimination of nonnative tone contrasts with relatively large acoustical salience. The multiple trends of discriminating native and non-native lexical tones suggest that a hybrid of attunement and perceptual learning theories (Aslin and Pisoni, 1980) would better account for the interaction effects of language experience and acoustical salience on tone perception development. In addition, the results imply that several mechanisms would facilitate infants to acquire lexical tones.

First, the enhanced ability to perceive acoustical parameters of spoken words between 6 and 12 months of age might help infants tune to valid acoustical features for processing lexical tones of words. The speech stimuli in Experiment 1 did not manipulate the critical acoustical parameters of lexical tones, but the acoustical salience of these tone contrasts varied, suggesting an effect of acoustical salience on the learning rate of native lexical tones. Spectral cues to lexical tones, such as average pitch height and pitch contour, are major acoustical cues to lexical tones (Liu et al., 2007; Chandrasekaran et al., 2010). The pitch contour is the only acoustical cue to distinguish tones in Experiment 2, the results of which showed that older Mandarinlearning infants performed better in the discrimination of tone contrasts with familiar pitch contours (similar to Tone 1 vs. 3 contrast in Experiment 1) than for the tone contrast with novel pitch contours, but that the perceptual ability to distinguish familiar vs. novel tone contrasts was not apparent at younger ages. Therefore, the results of Experiment 2 showed an increasing sensitivity to the pitch contour of native lexical tones between 6 and 12 months of age, supporting the acoustical account of lexical tone perception development. The results of Experiment 3 showing that the 10–12 month-old English-learning infants perform better than younger infants of the same language in distinguishing the acoustically salient tone contrast suggest that the acoustical salience account is also applicable to developmental changes seen with non-native tone perception.

Despite that pitch height and contour of lexical tones are major acoustical parameters of lexical tones, results of these experiments imply that older Mandarin-learning infants differentiate tone contrasts with distinct contours (e.g., Tone 1 vs. 3) by attending to pitch contour difference, but they might extra attend to the initial segment of lexical tones (e.g., the first half) when discriminating tone contrasts with similar contours (e.g., Tone 2 vs. 3 and Tone 2 vs. 4). However, older Mandarin-learning infants are not more effective than younger infants when attending to the onset rather than the whole segment of tone contour when discriminating contour tones. F0 frequency of tone onsets differ for contour tones, but the directions of pitch change in the initial part are very similar. The pitch directions of Tones 2, 3, and 4 in Experiment 1 have similar trends in tone onset (shown in **Figure 1**), and pitch directions of novel tone and Tone 1 in Experiment 2 is almost parallel in the tone onset (shown in **Figure 2**). Therefore, older Mandarin-learning infants would not perform better than younger infants in the discrimination of tone contrasts with similar onset contour. The importance of pitch onset in perceiving lexical tones was reported in Cantonesespeaking 5–6 year-old children when they identified the lexical tones with similar pitch contours (Tong et al., 2014). Future studies might manipulate pitch directions of tone onset to assess the role of perceiving pitch onset in developing native lexical tones between 6 and 12 months of age.

The acoustical account of tone perception development has been proposed (Singh and Fu, 2016), and several infant studies on tonal perception provide supporting evidence. In addition to the current study, the effect of acoustical salience on lexical tone contrasts was observed among infants raised in Singapore learning native lexical tones between 6 and 9 months of age (Fu et al., 2015). One-year-old Mandarin-learning infants were more accurate at distinguishing acoustically more distinct tone contrasts than was the case for acoustically more similar contrasts (Tsao, 2008). The difference of improvement in the sensitivity to detecting musical pitch in 4- and 12-monthold Dutch-learning infants was congruent with the improved performance of lexical tone perception; thus, older Dutchlearning infants performed better than younger infants when discriminating the Mandarin tone contrast, suggesting that the improved ability to perceive acoustical features of pitch contour is essential for developing lexical tones (Chen et al., 2017). In addition to fundamental frequency, the perceptual weights of spectral and temporal modulation cues of speech signals also vary between tonal and non-tonal language speakers (Xu and Pfingst, 2003; Cabrera et al., 2014). Non-tonal language adult speakers rely on the amplitude modulation (AM, the relatively slow variation of amplitude over time) information to recognize lexical tones, while Mandarin speakers utilize frequency modulation (FM, the variation of instantaneous frequency) cues to identify and discriminate lexical tones (Xu and Pfingst, 2003; Wang et al., 2011; Cabrera et al., 2014). In line with studies involving adults, French-learning 10-month-old infants preferred AM cues over FM cues in distinguishing lexical tones, but Mandarin-learning infants of the same age utilized FM cues more than AM cues in tone perception (Cabrera et al., 2015). These studies suggest

that acoustical features of lexical tones in infants' native language affect the learning rates of developing lexical tones in infancy.

Second, another mechanism for developing lexical tone perception would be associated with infants' ability to process linguistic functions of supra-segmental units, such as word stress and sentence intonation (Singh and Fu, 2016). In tonal languages, lexical tones are the essential elements for constructing syllables, and they function like consonants and vowels in distinguishing lexical meanings of syllables. This phonemic function of lexical tones could result in a developmental trajectory of lexical tones in infancy similar to the trends of consonants and vowels, as reduced accuracy in discriminating lexical tones of a foreign language was reported among non-tonal language learners across 6 and 12 months of age (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014; Cabrera et al., 2015). Results of Experiment 3 showed that, for non-native lexical tones, improved sensitivity was observed when Englishlearning infants distinguished the Mandarin Tone 1 vs. 3 contrast. Improvement in the perception of non-native phonemes that are not included in the phonetic inventory of infants' native language is rarely documented among infants aged between 6 and 12 months; nonetheless, this trend of improving non-native lexical tone perception is not entirely unexpected. Recent studies have reported that during the second year of life, infants learning non-tonal languages exhibit either better sensitivity than younger peers (Liu and Kager, 2014) or an ability to distinguish the lexical tones of Mandarin at approximately 18 months of age (Hay et al., 2015, Experiment 3; Singh et al., 2014; Zhao and Hay, 2015).

Besides phonemic functions, lexical tones are supra-segmental units of phonetics that are expressed with speech prosody. Prosodic information of stressed syllables facilitates word segmentation for English-learning infants (Jusczyk et al., 1999), and English-learning infants rely more on prosodic information than on phonotactic cues in word segmentation at approximately 9–11 months of age (Mattys et al., 1999; Johnson and Seidl, 2009). Infants learning non-tonal languages detect the prosody of basic emotions very early in life (Mastropieri and Turkewitz, 1999; Singh et al., 2002), and children's abilities to utilize emotional prosody to recognize speaker's emotions behind the words continue to develop during early childhood (Quam and Swingley, 2012). The increasing ability to utilize prosodic information in the perception of words and emotions in English-learning infants might facilitate their efforts to distinguish prosodic features in a foreign language; it also reveals a developmental trend of nonnative tone perception that is different from the trend of perceptual decline for consonant and vowel contrasts of foreign languages.

The intonation of a sentence is one of the prosodic cues used to differentiate statement and question sentences. Pitch direction in certain lexical tones in Mandarin are similar to those of sentence intonations in English. The rising pitch direction of Tone 2 is similar to the intonation of questions and the falling pitch direction of Tone 4 is similar to the intonation of statements. Dutch-speaking adults were more attentive to pitch movement of Tone 2 and Tone 4 when intonations served the post-lexical function, e.g., differentiating statements and questions (Braun and Johnson, 2011). In future studies, exploring whether English-learning infants exhibit performance changes when distinguishing Tone 2 vs. Tone 4 between 6 and 12 months of age would help to test the assumption that improving prosodic perception facilitates the development of perception of nonnative lexical tones.

Would both developmental mechanisms of lexical tones compete with each other or work together for tone perception development in infancy? The present finding that 10–12-monthold Mandarin-learning infants are more accurate in detecting tonal differences of Mandarin than English-learning infants of the same age suggest that improvement in tuning to language-specific lexical tone acoustics would combine with the improving ability to perceive speech prosody for tone-language learning infants in developing their perception of lexical tones.

### CONCLUSION

Multiple trajectories to the development of distinguishing native lexical tone contrasts were found in Mandarin-learning infants between 6 and 12 months of age, and improving perceptual sensitivity was apparent in the Tone 1 vs. 3 contrast, the contrast with greater acoustical salience. In addition, perceptual advantage of Mandarin-learning infants utilizing familiar pitch contours was found among 8–10 month-old infants. For nonnative lexical tones, older English-learning infants outperformed their younger counterparts in the discrimination of Mandarin tone contrast. In addition, 10–12-month-old Mandarin-learning infants distinguished lexical tones more accurately than Englishlearning infants at the same age. Therefore, this paper suggests that both the fine tuning to acoustical features of lexical tones and improving ability in processing prosodic features of suprasegmental units contribute to the development of lexical tone perception before infants' first birthdays.

### AUTHOR CONTRIBUTIONS

F-MT conducted data collection and prepared the manuscript.

### FUNDING

This research was funded with grants from the Ministry of Science and Technology, Taiwan (NSC95-2413-H-002-019-MY2; MOST102-2923-H-002-001-MY3).

### ACKNOWLEDGMENT

The author thanks Prof. Patricia Kuhl at Institute of Learning and Brain Sciences, University of Washington, for her assistance in data collection.

### REFERENCES

fpsyg-08-00558 April 10, 2017 Time: 15:47 # 13


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Tsao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpsyg-08-00558 April 10, 2017 Time: 15:47 # 14

# Perception and Representation of Lexical Tones in Native Mandarin-Learning Infants and Toddlers

#### Rushen Shi <sup>1</sup> \*, Jun Gao<sup>2</sup> , André Achim<sup>1</sup> and Aijun Li <sup>2</sup>

<sup>1</sup> Département de Psychologie, Université du Québec à Montréal, Montréal, QC, Canada, <sup>2</sup> Phonetics and Speech Science Lab, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China

We investigated the perceptual development of lexical tones in native tone-learning infants during the first 2 years of life, focusing on two important stages of phonological acquisition: the preverbal and vocabulary explosion stages. Experiment 1 examined monolingual Mandarin-Chinese-learning 4- to 13-month-olds' discrimination of similar lexical tones in Mandarin, Tone 2 (T2, rising) vs. Tone 3 (T3, low-dipping). Infants were habituated to exemplars of one tone (either T2 or T3), and tested with new exemplars of the habituated tone vs. the contrasting tone. Results show that looking time increased for the contrasting tone, but not for new exemplars of the habituated tone, suggesting that infants discriminated the two tones as separate categories. Furthermore, infants' discrimination of the tones was comparable across ages. Experiment 2 tested whether tones are distinguished in toddlers' lexicon. Monolingual Mandarin-learning 19- to 26-month-olds were presented with pairs of objects while one was named. Targets were familiar words bearing T2 or T3, either correctly pronounced (CP) or mispronounced (MP) in tone. We found that word recognition was equally successful in CP and in MP trials when T2 was mispronounced as T3 and T3 as T2, indicating that T2 and T3 are confusable. In contrast, recognition failed when T2 and T3 words were mispronounced as Tone 4 (T4, falling), showing that T4 was represented as a distinct category. Results show that toddlers have difficulty encoding similar tones distinctly in known words. The T2-T3 contrast is particularly challenging because of Tone 3 Sandhi, which changes T3 to T2 when it precedes another T3. At the stage when toddlers track the meaning of T2 and T3 words and track the sandhi alternations, they seem to overgeneralize the two tones as variants of one functional category, reflecting perceptual organization at the level of phonemic learning.

Keywords: lexical tones, infant speech processing, lexical representation, phonological neutralization, language acquisition

## INTRODUCTION

Within the first year of life infants make significant advances in acquiring the native-language sound system. They initially perceive both native and non-native consonant and vowel contrasts, and gradually reorganize their perception according to the native language categories (e.g., Werker and Tees, 1984; Kuhl et al., 1992; Polka and Werker, 1994). In particular, during the second half of the

#### Edited by:

Jessica Hay, University of Tennessee, Knoxville, United States

#### Reviewed by:

Carolyn Quam, Portland State University, United States Henny Yeung, Simon Fraser University, Canada

> \*Correspondence: Rushen Shi shi.rushen@uqam.ca

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 11 November 2016 Accepted: 16 June 2017 Published: 21 July 2017

#### Citation:

Shi R, Gao J, Achim A and Li A (2017) Perception and Representation of Lexical Tones in Native Mandarin-Learning Infants and Toddlers. Front. Psychol. 8:1117. doi: 10.3389/fpsyg.2017.01117

**38**

first year of life, infants' sensitivity to non-native contrasts declines, while native contrasts continue to be discriminable. This reorganization is largely driven by distributional analysis of the input (e.g., Maye et al., 2002; Anderson et al., 2003).

To establish the full phonological system of the native language, infants would subsequently need to understand the relevance of the phonetic categories for distinguishing word meaning, and to acquire a lexicon as well as the associated phonemic structure. During early word learning shortly after the first year of life, infants confuse similar-sounding segments in certain tasks. For example, in Stager and Werker (1997) 14-month-old infants confused /b/ and /d/ in a word-object association task. The confusion seems to be due to task difficulty and the processing demand of word learning, since infants at this age succeeded in perceiving phonetic details in studies using more sensitive word-learning tasks (Ballem and Plunkett, 2005; Yoshida et al., 2009). Further, similar-sounding segments in familiar words are distinguished from an early age. Several studies have shown that even at the stage when the receptive lexicon is small, recognition is affected if a consonant or a vowel of a familiar word is mispronounced (Swingley and Aslin, 2002; Fennell and Werker, 2003; Mani and Plunkett, 2007). For instance, infants' looking time was affected when ball is mispronounced as doll (Fennell and Werker, 2003); when car is mispronounced as cur, visual fixation to the named object picture decreases (Swingley and Aslin, 2002). Moreover, toddlers showed graded sub-segmental representations for familiar words in a sensitive mispronunciation task (White and Morgan, 2008), similar to adults. Taken together, native segmental categories seem well distinguished in the early lexicon, especially for words that infants know well, although phonetically similar segments may be confusing for infants in certain word learning tasks.

Lexical tones are phonemic and are found in many languages (e.g., in Asia). Much less research has been conducted on early perceptual development of lexical tones. The present study investigated the perception and representation of native lexical tones in Mandarin-Chinese-learning children at two important stages of learning: the preverbal stage, and the vocabulary explosion stage. Specifically, we inquired (1) whether native tone-language-learning preverbal infants, who know a limited number of words and have not yet acquired a sophisticated phonological system, discriminate lexical tone contrasts, and if they do, (2) whether toddlers subsequently represent the tonal contrasts in familiar words. These questions thus concern the development from early phonetically based tonal discrimination to later representation of tonal contrasts in the lexicon. The latter is essential for acquiring a mature phonological system.

Mandarin-Chinese has four lexical tones: high (T1), rising (T2), low-dipping (T3), and falling (T4). In Chao's 5-level pitch notation (Chao, 1930) the four tones are 55, 35, 214, and 51. The fundamental frequency (F0) is the primary acoustic correlate of lexical tones. The tone-bearing unit is the syllable (Xu and Wang, 2001). Other acoustic cues to tonal contrasts also exist. For instance, as shown in **Figure 1**, T1 and T4 are shorter than T2 and T3, with T3 being the longest in isolation (e.g., Xu, 1997). T3 is often produced with a distinct creaky voice at low pitch. Among all the tonal contrasts in Mandarin, the T2-T3 contrast

is widely considered to be the most similar in pitch pattern. Nevertheless, the contrast is supported by multiple acoustic cues. Even non-tone-speaking teenagers can discriminate this contrast based purely on acoustic processing (Pierce et al., 2014).

The tones in Mandarin differ in their phonological structure, with T3 being the most complicated. T3 is subject to sandhi (the Tone 3 Sandhi rule), according to which T3 is realized as a T2 like rising tone (35 in Chao's notation, i.e., it is neutralized to T2) when T3 immediately precedes another T3, and T3 is a low tone (11 in Chao's notation) before any other tone. Utterancefinal and citation T3 (see **Figure 1**) has the most complex contour (214 in Chao's notation). In other words, the rising, the low, and the complex contour are the three variants of T3. Tone 3 Sandhi is a rule that applies generally across lexical items that bear T3 as the underlying tonal representation. Sandhi alternations also occur with other tones, although they only apply to a few specific lexical items. For example, the negation particle bu4 and the numeral yi1 ("one"), both highly frequent, go through sandhi alternations depending on context: they are realized as T2 when preceding T4, and as T4 when preceding all other tones. These item-specific alternations need to be learned as exceptions to the general non-alternating pattern of T4 and T1 words, unlike the learning of the Tone 3 Sandhi rule. T2, T4, and the utterance-final and citation variants of T3 are contour tones, whereas T1 and the low variant of T3 are level tones. Across tone languages, contour tones are considered more complex than level tones (Yip, 2002). For instance, a rising contour tone can be described in terms of combined tone height features (e.g., LH for T2 in Mandarin, with L and H representing the Low and High features) whereas a level tone can be represented with a single tone height feature (e.g., H for T1 in Mandarin). The feature representations for T3 are complex, with the utterance-final and citation variant as LLH (or L plus a post-lexical floating H, depending on theories), and nonfinal variants as L and LH. The T2-T3 contrast is hence the most phonologically complex one in Mandarin.

In the 3 sections below we first review previous research on preverbal tone-learning infants' discrimination of native and non-native lexical tones. Next, we discuss studies on infants' and toddlers' tonal processing in word segmentation, word learning and word comprehension tasks. Finally, we present the hypotheses of our present study on Mandarin learners' discrimination of two native tones at the preverbal stage and their perception of the two tones in familiar words at the vocabulary explosion stage.

### Tone-Learning Infants' Discrimination of Lexical Tones during the First Year of Life

Compared to consonants and vowels, much less is known about infants' discrimination of tones during the initial stages of learning, and only a few studies have tested preverbal tonelearning infants' perception of native tones (Harrison, 2000; Tsao, 2008; Yeung et al., 2013). Harrison (2000) was the first to test the discrimination of lexical tones in preverbal babies. Using the Conditioned Headturn Procedure, he showed that 6 to 8-month-old Yoruba-learning infants discriminated synthetic tones similar to the high tone vs. the mid tone in Yoruba, and their performance was consistent with that of adult native listeners.

Yeung et al. (2013) familiarized 4- and 9-month-old infants with one Cantonese tone (either high-rising or mid-level, i.e., 25 and 33 in Chao's notation), and the two tones were presented in three types of test trials: the familiarized tone, the contrasting tone, and both (alternating). Cantonese-learning infants did not show any differential looking times to the three types of test trials after being familiarized to the 33-tone. After they were familiarized with the 25-tone, they showed different looking times for alternating trials vs. 25-tone trials, although looking in 33-tone trials did not differ from looking in either alternating trials or 25-tone trials. Infants thus showed partial evidence supporting the discrimination of the tonal contrast. These results are difficult to interpret, as the patterns were not consistent across conditions and trials. The authors' predicted preference for alternating over non-alternating trials was not systematically observed. In this kind of task, discrimination is interpreted indirectly from preference. Infants may discriminate the contrast and prefer the more dynamic alternating trials; or they may discriminate the contrast but prefer the more familiar nonalternating trials. A systematic group preference for one type of test trials (for example, alternating over non-alternating) would be clear support for successful discrimination. However, lacking a systematic preference, as is the case in one of the familiarization groups in Yeung et al. (2013), does not necessarily mean a lack of discrimination.

In addition to Cantonese-learning infants Yeung et al. (2013) also tested Mandarin-learning 4- and 9-month-olds' perception of those two Cantonese tones (25-tone and 33-tone), which are similar to Mandarin T2 (rising) and T3 (low-dipping). After being familiarized to the 33-tone, Mandarin-learning babies showed no looking difference in the test phase, similar to the response of Cantonese-learning babies. After the 25-tone familiarization, looking was longer in 25-tone trials than in 33 tone trials, and longer in alternating trials than in 33-tone trials, but no looking difference was observed between 25-tone and alternating trials. Their preferential pattern differed from that of Cantonese-learning infants. Similar to the Cantonese babies, Mandarin-learning babies showed evidence of discrimination only in one of the familiarization conditions, with a complex pattern of preference. As discussed earlier, the non-predictability of their results was likely due to the nature of their task, which tested preference, but not necessarily discrimination. We suggest that the habituation task might be better suited to directly reveal discrimination. In such tasks infants are habituated to one member of a contrast, and then tested with the same habituated member and the contrasting member. Because habituation reflects a decrease in interest over time, a looking recovery to the new member, but not to the habituated member, is predicted when infants can discriminate the contrast. Conversely, if they cannot discriminate the contrast, they should show no looking increase upon hearing the new member relative to the old member during the test phase. In the present study we tested Mandarin-learning babies' discrimination of native tones using a habituation task.

Like the Cantonese contrast in Yeung et al. (2013), the Thai rising vs. low contrast is also similar to the T2-T3 contrast in Mandarin. Using the Conditioned Headturn Procedure, Mattock and Burnham (2006) showed that 6- and 9-month-old Chinese infants discriminated this Thai contrast, indicating that they might have assimilated the Thai contrast to their native contrasts (25-tone vs. 33-tone in Cantonese, or T2 vs. T3 in Mandarin).

Only one previous study has tested infants' discrimination of native tones in Mandarin. Using the Conditioned Headturn Procedure, Tsao (2008) tested Taiwan-Mandarin-learning 10– 12-month-olds' discrimination of the T1-T3, T2-T3, and T2- T4 contrasts. Infants discriminated T1-T3 (73% correct) better than T2-T3 (61%) and T2-T4 (58%), and the performance of T2-T3 and T2-T4 were comparable. The superior T1-T3 discrimination was expected. Even non-Mandarin adults find these two tones perceptually distinct (So and Best, 2010). Their F0 height and trajectories are non-overlapping. T2 and T3 are generally considered more similar, with the F0 onset being relatively low for both. In citation, T2 and T3 both move up in F0 toward the offset. T2 and T4 are acoustically more dissimilar than T2-T3, as they involve opposite F0 trajectories (see **Figure 1**). However, the T2-T4 and T2-T3 contrasts were discriminated equivalently in Tsao's study, and both were less discriminable than T1-T3. In their task each infant was first taught to respond to a tonal change in a contrast, and the stimuli used for the teaching then served as the stimuli for testing that infant. Only infants who passed the training criterion were included in the test phase. Their experiment was designed for testing the relative discriminability of the three contrasts after the training. It would be interesting to test whether the tonal contrasts can be discriminated spontaneously, i.e., entirely based on infants' prior experience with the native language. In the present study we directly tested whether Mandarin-learning preverbal infants can discriminate native tones without any training.

### Lexical Tones in Toddlers' Developing Lexicon

Around the age of 1 year, children start building a lexicon and develop a sophisticated phonological system associated with the lexicon. In addition to encoding consonant and vowel contrasts, tone-language children need to encode the lexical tone of words. A recent study suggests that infants close to 1 year of age distinguish native lexical tones when recognizing words. Specifically, in an auditory speech segmentation/recognition task Singh and Foong (2012) first familiarized English-Mandarin bilinguals with isolated word forms, and then tested the infants with passages containing the target forms. They found that at 11 months of age infants recognized the Mandarin target forms in passages only when the forms matched the familiarized forms in tone, but not when the tone was mismatched, similar to the results of monolingual Mandarin-learning infants in Shi (2009); however, when familiarized and tested with English stimuli, the bilingual infants at this age ignored lexical-tone-like pitch changes in target words and recognized the words regardless of whether their pitch matched or mismatched with the familiarized form in tone. In a subsequent study 18- and 24-month-old Mandarin-English bilinguals encoded T2 (rising) and T4 (falling) distinctly when learning to map novel objects to novel words (Singh et al., 2014). This tonal contrast was also distinguished during word learning by monolingual English-learning infants at 14 months of age in Hay et al. (2015) and at 18 months in Singh et al. (2014), but not at 17–19 months in Hay et al. (2015). Nevertheless, although the 17-to-19-month-old English learners in Hay et al. (2015) failed to encode the T2-T4 distinction during word learning, they were still able to discriminate the contrast in an auditory habituation task, suggesting that sensitivity to tonal contrasts remains more acute for acoustic-phonetic based discrimination than for phonemic based lexical encoding. In a similar auditory habituation study (Shi et al., 2017) the discrimination of T1 (high) and T4 (falling) showed no decline in French-learning infants from 4 to 11 months of age.

How are lexical tones represented in toddlers' familiar words? A few studies have addressed this question, primarily with children older than 2.5 years of age, who have acquired a reasonable-sized lexicon. In Singh et al. (2015) Mandarin-English bilinguals aged 2.5–3.5 years distinguished the Mandarin T1-T2, T1-T4, and T2-T4 contrasts during familiar word comprehension. They looked less at the named object when its tone was mispronounced than when it was correctly pronounced. The same effect was shown for T1-T4 in Mandarin-speaking preschoolers; however, these children failed to detect the mispronunciations between T2 and T3 (Singh et al., 2017).

Wong et al. (2005) examined tonal recognition in monolingual Mandarin-speaking children, using a picturepointing task. Three-year-olds were presented with familiar words, including tonal minimal pairs. Recognition accuracy was high for T1, T2, and T4 targets (nearly 90%), lower for T3 targets (69%). The errors were mostly mis-perception of T3 as T2. Interestingly, the confusion was unidirectional; T2 was rarely mis-perceived. This asymmetry seems to be related to Tone 3 Sandhi, which neutralizes T3 to T2. The T2-T3 asymmetry was also observed in adult Mandarin listeners in a recent ERP study (Li and Chen, 2015), in which mismatch negativity effects were greater and earlier when the stimuli presentation changed from T2 to T3 than when the change was from T3 to T2. That is, the presentation of T3 in the latter case automatically activated T2 as a variant of T3, causing a weak response when T2 was subsequently heard. The authors noted that this weak response was comparable to within-category tone processing.

The T3 targets in Wong et al. (2005) were utterancefinal, where the tone sandhi should not happen. Children's confusion of T3 as T2 thus suggests a partial understanding of Tone 3 Sandhi, i.e., an over-neutralization of T3 to T2 without understanding the appropriate context. Phonological neutralization often occurs between similar segments. For example, the word-medial /t/ and /d/ in latter and ladder in American English are neutralized as a flap. Syllable-final obstruents become devoiced in German (e.g., /d/ neutralized to /t/). Similar segments such as /t/ and /d/ share many phonetic features and acoustic properties. In general, dissimilar segments (e.g., /b/-/h/) are less likely to be subject to neutralization. Tone 3 Sandhi is likely related to the fact that T3 and T2 are acoustically similar. The differentiation of the two tones at the lexical level might therefore be challenging for children due to Tone 3 Sandhi.

### The Present Study

Considering the scarcity of data on the acquisition of native lexical tones during the initial 2 years of life, the present study examined Mandarin-learning infants' and young toddlers' perception of T2 and T3 in Mandarin. These two tones are interesting because they are acoustically similar and may be affected by the Tone 3 Sandhi rule. We thus focused on two stages of learning. In Experiment 1 we tested whether preverbal babies, who are either prior to or at the beginning of building a lexicon, can discriminate T2 and T3 in a habituation/dishabituation task. At this stage, tone learning should be largely based on the distributional properties of the acoustic patterns of tonal categories in the native language, or on other mechanisms independent of an infant knowing a lexicon (e.g., Yeung and Werker, 2009; Feldman et al., 2013). We hypothesized that at this stage infants' organization of the tones should be simpler, and they should be able to perceive tones based on pure auditory-phonetic processing. Following this stage, children face a harder task: they must build a sophisticated phonemic system, which requires them to encode tonal (in addition to segmental) distinctions across words in their lexicon. Do toddlers represent the phonetically similar and neutralization-prone T2 and T3 distinctly for known words? The status of lexical tones for words familiar to young toddlers below age two has not been studied previously in online comprehension tasks. Thus, in Experiment 2 we used this task to test whether toddlers, who begin to have a reasonable-sized lexicon, distinguish the phonetically similar, neutralization-prone T2-T3 contrast as well as the dissimilar, non-neutralizable T2-T4 and T3-T4 contrasts for familiar words. We note that the T2-T4 and T2-T3 contrasts are equally discriminated by Mandarin-learning infants (Tsao, 2008) at 10– 12 months of age. In the present study we hypothesized that the additional factor of lexical neutralization due to Tone 3 Sandhi might lead to the confusion of T2 and T3 for familiar words in young toddlers.

### EXPERIMENT 1

### Methods

### Participants

Participants were 20 Mandarin-learning 4- to 13-month-olds residing in Beijing (mean: 08;29 days; range: 4;22–13;20; girls: 13). The infants were monolingual Beijing-Mandarin (i.e., standard Mandarin) learners. Seven other infants were tested but were excluded from the analysis due to fussiness (4) and experimenter errors (3). Our interest here was to inquire generally whether Mandarin-learning infants at the preverbal stage can discriminate T2 and T3. We therefore treated our infants as one single group. We decided to set the youngest age at 4 months, since in previous research tone-learning infants from 4 months of age showed evidence of discriminating lexical tones (Yeung et al., 2013). Moreover, tonal discrimination in previous studies did not change across age during the first year of life for tone-learners (Mattock and Burnham, 2006; Yeung et al., 2013).

#### Stimuli

We chose the syllable can /ts<sup>h</sup> an/ in T2 (can2 "disabled") and T3 (can3 "tragic") because the words are unknown to preverbal and early verbal infants (and absent in the Mandarin early vocabulary corpus of Hao et al., 2008). The decision to use unknown words was important, as our goal in this experiment was to assess infants' early discrimination ability without any possible influence of familiar words. A female native speaker of Mandarin recorded many repetitions of the words in a lively voice. Overall, can3 tokens were longer than can2 tokens. We carefully selected a subset of can2 and can3 tokens which overlapped in duration. The final stimuli were 13 tokens of can2 and 13 tokens of can3. T2 tokens were on average 718 ms (range: 631-806; SD: 63), and T3 tokens 717 ms (range: 630–802; SD: 63). Moreover, the tokens were adjusted to have comparable amplitude. Thus, T2 and T3 here were more similar acoustically than usual. These controls enabled us to better assess the contribution of F0 to infants' discrimination of T2 and T3. Our initial plan was to conduct a further experiment including additional acoustic cues if infants could not discriminate the tones in Experiment 1. **Table 1** shows the F0 measures of the stimuli. The values of the measures indicate that for T2, F0 increased greatly and consistently from the onset region to the offset of the contour, whereas the F0 contour remained relatively low for T3, with a center dip. The pattern is similar to the examples in **Figure 1**. The maximum F0 occurred at the tonal offset for both tones. The time point of the minimum F0 (i.e., inflection point) along the contours differed with respect to tones. Specifically, the minimum F0 for T2 occurred around the tonal onset (on average 7.85% from the beginning of the tone), followed by a continuous increase. On the other hand, the F0 of T3 decreased from the beginning to a minimum value toward the middle part of the tone (on average 43.12% from the onset). The dip in F0 was accompanied mostly by a distinct creaky voice. In sum, T2 and T3 tokens differed highly significantly in nearly all of the F0 measures, as shown in **Table 1**.

The visual stimulus for all trials was a colorful checkerboardlike image. The attention-getter was a jumping star along with bird singing sound.

#### Procedure

Infants were tested individually in a sound-attenuated chamber. The child sat on the parent's lap, facing a central monitor that displayed the visual stimuli. Loudspeakers adjacent to both sides of the monitor simultaneously played auditory stimuli. A TABLE 1 | Acoustic measures (means and standard deviations) of the T2 (rising) and T3 (low-dipping) stimuli in Experiment 1.


computer in the neighboring room controlled the presentation of the audio-visual stimuli and recorded the child's looking times. A researcher blind to the stimuli and design observed the infant and started each trial when the child looked at the monitor. Parents heard masking music from noise-cancelation headphones.

#### Design

Each habituation and test trial was started when the infant looked at the front central monitor, and terminated when she looked away for at least 2 s or when the maximum trial length (about 21 s) elapsed. Between trials, the attention-getter was automatically presented to attract the infant back to the monitor. Each infant was habituated to seven tokens of one tone, either can2 or can3. The seven tokens of one tone were presented randomly without replacement, and the set was repeated (with tokens always in a random order) until the infant became habituated. The six other tokens of each tone were reserved for test trials. The inter-stimulus interval (ISI) within each trial was 1,000 ms. When the total looking time of three consecutive habituation trials declined to 50% of the first three habituation trials, the habituation criterion was reached, and the test phase began. All infants heard the same test stimuli, in two types: Same (new tokens of the habituated tone) and Different (the non-habituated tone). The order of the trial types was counterbalanced across infants. The use of new exemplars for the Same tone in the test phase was important for our design: if infants increased their looking time upon hearing the exemplars of Different tone (relative to their looking during the last habituation trial), but not upon hearing new exemplars of the Same tone (relative to their looking during the last habituation trial), the response would indicate category discrimination. On the other hand, if infants increased looking equally in both the Same and Different test trials (relative to the last habituation trial), this response would simply indicate the detection of any new tokens rather than the discrimination of tonal categories.

### Results and Discussion

We calculated the looking times (in seconds) of the test trials and the last habituation trial. Because the data of two of these three measures were significantly skewed (beyond two standard errors) across babies, transformation was needed before the analysis of variance (Csibra et al., 2016). To bring the skewness below one standard error within each trial type, we log-transformed (base

10) the data after subtracting a constant (1.3) from each looking time. This transformation corrected the skewness and made the data acceptably symmetrical for all three measures. A 2 × 3 ANOVA was then conducted, with the Habituation Tone (T2 vs. T3) as the between-subject factor and Trial Comparison (Last Hab, Same, Different) as the within-subject factor. The results showed a significant effect of Trial Comparison, F(2, 36) = 3.952, p = 0.028, but no effect of Habituation Tone, F(1, 18) = 0.014, p = 0.907, and no interaction of these factors, F(2, 36) = 0.851, p = 0.435. That is, infants who were habituated to either T2 or T3 responded in the same fashion.

Given the significant effect of Trial Comparison in the above analysis, the trial types were then analyzed in paired t-tests. The results revealed longer looking for Different (mean = 0.567, SD = 0.374, SE = 0.084) than Last Hab (mean = 0.324, SD = 0.311, SE = 0.069) [t(19) = 2.465, p = 0.023], and for Different than Same (mean = 0.293, SD = 0.458, SE = 0.102) [t(19) = 2.676, p = 0.015], but no difference for Same and Last Hab [t(19) = −0.359, p = 0.724], all 2-tailed and uncorrected, as shown in **Figure 2**. Moreover, none of these pairwise differences correlated with age in days (|r| ≤ 0.262, p ≥ 0.265), suggesting that infants across ages (4–13 months) responded similarly.

Since both the Different and Same trials presented new exemplars after habituation, the results support category discrimination. Infants only increased their looking time upon hearing a new tonal category, but not upon hearing new exemplars of the habituated tone.

The results of Experiment 1 show that the phonetically similar T2-T3 contrast is discriminable at the pure phonetic level by preverbal Mandarin-learning babies. Our next question was whether this similar contrast is subsequently distinguished in words at a stage when children have established a sizable lexicon.

### EXPERIMENT 2

In Experiment 2 we chose to test toddlers aged 19–26 months, an age range characterized by vocabulary explosion. Toddlers of this age should have a reasonable-sized lexicon and are engaged in the learning of more advanced phonological knowledge. We presented toddlers with T2 and T3 familiar words. Besides correct pronunciations (CP), two types of tonal mispronunciations (MP) were presented: acoustically similar MPs (T2 mispronounced as T3, i.e., T2-to-T3; T3 mispronounced as T3, i.e., T3-to-T2) and dissimilar MPs (T2 mispronounced as T4, i.e., T2-to-T4; T3 mispronounced as T3, i.e., T3-to-T4). The similar MPs were relevant for neutralization (related to Tone 3 Sandhi) whereas the dissimilar MPs were not. We tested whether the two types of MPs were equally perceivable during word comprehension.

We used both the similar (T2 vs. T3) and dissimilar contrasts (T3 vs. T4, T2 vs. T4) to reveal how T2 and T3 are represented in the developing lexicon. We needed to include the dissimilar contrasts because they would likely show a mispronunciation effect, thus allowing us to confirm that a possible lack of a mispronunciation effect for the similar T2-to-T3 and T3-to-T2 changes would not be because of any peculiarity of the task. We hypothesized that although the T2-T3 contrast was easily discriminable during early infancy at the acoustic-phonetic level, toddlers might not represent this contrast distinctly for words due to the complexity of the tonal system at the lexical level and the sandhi rule related to the two tones. Furthermore, we hypothesized that T2 and T3 should be represented distinctly from T4, since T4 is acoustically dissimilar from either tone and there is no sandhi rule affecting the T2-T4 and T3-T4 contrasts.

## Methods

### Participants

Participants were 64 monolingual Beijing-Mandarin-learning 19 to 26-month-olds residing in Beijing (mean: 21;29; range: 19;01– 26;26; girls: 26). The data of 31 other toddlers were excluded due to fussiness (16), no interest in the task (9), parental interference (6), and researcher error (1). Children at this age should have acquired a sizable vocabulary according to the report of Hao et al. (2008). In their corpus the mean expressive vocabulary size of Beijing-Mandarin-learning children was 168 words (SD = 114) words at 19 months of age, and 376 (SD = 189) at 26 months of age.

### Stimuli

Stimuli included monosyllabic T2 and T3 words (yang2 "sheep," wan3 "bowl") for the key trials. These key words are familiar to toddlers, as they appear in the majority of Mandarin-learning toddlers' production by 19 months of age in the early vocabulary corpus of Hao et al. (2008). Hao et al. (2008) did not collect data on toddlers' receptive vocabulary. Nevertheless, they reported both receptive and productive vocabularies for younger infants, with the former greatly exceeding the latter. For example, they reported that 16-month-old infants' mean productive vocabulary was 17 words, whereas their mean receptive vocabulary was 116 words. We can therefore infer that most toddlers in the Hao et al. database must be able to comprehend our key words by 19 months of age.

We also created two types of mispronunciations for these key words: 1) similar: T2 were mispronounced as T3 (i.e., the word yang2 ("sheep") was mispronounced as yang3: MPyang3), and T3 as T2 (i.e., the word wan3 ("bowl") was mispronounced as wan2: MP-wan2); 2) dissimilar: T2 as T4 (i.e., yang2 mispronounced as yang4: MP\_yang4) and T3 as T4 (i.e., wan3 mispronounced as wan4: MP\_wan4). We note that the MP forms are existing words in Mandarin, but they are mostly unfamiliar to young children. In particular, the words yang3 ("oxygen"), wan2 ("pill"), yang4 ("appearance") and wan4 ("wrist/ankle") are uncommon object labels for toddlers and are all absent in the early vocabulary corpus of Hao et al. (2008).

In addition, we included 16 other familiar words as fillers<sup>1</sup> (e.g., hua1 "flower," etc.) to make the task interesting to toddlers [see details in Appendix A (Supplementary Material)].

The same female Mandarin-Chinese speaker as in Experiment 1 recorded the speech stimuli in a sound-attenuated chamber. The final stimuli included two tokens for each target word and one token of the instruction utterances kan! ("look!") and zai nar? ("Where is it?").

For the key words, the CP mean duration was 614.25 ms (SD = 177.25), and the MP 518.75 ms (SD = 127.13). Tone 2 tokens were on average 459 ms (SD = 6.98) in length, Tone 3 tokens 743.75 ms (SD = 29.98), and T4 tokens 447.75 ms (SD = 39.61). Appendix B (Supplementary Material) shows the F0 trajectories of the first token of T2 and T3 words.

Visual stimuli were colorful pictures of objects for key words, filler words, and distractors. A picture of a laughing baby accompanied by the sound of a baby's laughter served as the attention-getter between trials.

### Procedure and Design

The equipment and room setup were the same as in Experiment 1. Infants were tested individually in the same sound-attenuated chamber as in Experiment 1. We used a within-subject design. Each test trial presented the images of two objects simultaneously on the far left and far right side of a 42-inch monitor; during a trial one object was named (i.e., the target), and the other unnamed (i.e., the distractor). The key trials presented the key words as the target in four CP trials (two CP-yang2 trials; two CP-wan3 trials), two similar MP trials (MP\_23: one MP-yang3 trial, one MP-wan2 trial), and two dissimilar MP trials (MP\_4: one MP-yang4 trial, one MP-wan4 trials), for a total of eight trials. MP\_23 referred to trials in which T3 was mispronounced as T2, and trials in which T2 was mispronounced as T3. These trials tested whether the similar contrast of T2 vs. T3 were confusable to children in both directions. MP\_4 referred to trials that presented T2-to-T4 or T3-to-T4 mispronunciations, which tested whether T2 and T3 were perceived as distinct from T4.

Images of two unfamiliar objects for which children have no words, a roller (painting tool) and a badger, were distractors in key trials in which they were paired, respectively, with the

target images (bowl and sheep) (see **Figure 3**). To control for animacy, the roller was paired with the bowl, and the badger with the sheep. We used the unfamiliar distractors to make the measure more sensitive, as this would more likely lead children to decrease looking to the target upon hearing similar-sounding mispronunciations (White and Morgan, 2008).

The remaining were filler trials, in which the targets were always correctly pronounced. Trial order was quasi-randomized with the constraints that adjacent trials did not contain the same objects, and that no more than three consecutive trials presented targets with the same tone or on the same side. Key trials were always separated by filler trials. Four quasi-randomized orders were created. Toddlers were assigned randomly to four groups, and each group was tested with one of the four orders [see Appendix B (Supplementary Material)].

All trials were constructed with the same timeline. Images of two objects appeared for 2.1 s in silence, followed by the utterance kan! ("look!," 458 ms) and then a 442 ms silence. The target word began exactly after 3 s from the trial onset, and zai nar? ("Where is it?") began 1 s later, followed by the second presentation of the target word starting at the end of 5 s. The object pictures stayed for the whole trial of 6.5 s.

### Results

Videos of participants were coded offline by another researcher blind to the stimuli and design of the experiment using an inhouse computer program. The coding was done at 25 frames/sec. For each frame, the looking was coded as left, right or elsewhere. We analyzed the 360–2,000 ms window from the onset of the first presentation of the target word, as in previous studies (e.g., Swingley and Aslin, 2002). The starting point of 360 ms was to account for the time needed for the child to initiate an eye movement. Within this window, the proportion of looking to target (PLT) was calculated by dividing the total looking time to the target by the sum of the looking times to the target and to the distractor.

A one-way repeated measure ANOVA was conducted, with Pronunciation (CP vs. MP\_23 vs. MP\_4) as the within-subject

<sup>1</sup>To verify if these fillers were indeed familiar words to our toddlers, we analyzed their comprehension of the named targets in filler trials within the same time window that was used for analyzing the key trials, i.e., 360–2,000 ms from the onset of the target word. For each filler word, the proportion of looking to the target was compared with the 0.5 chance level. We found that the toddlers indeed knew the filler words. Looking to targets was significantly above chance for all the filler words (p levels ranged from.000 to.022, two-tailed) except the one in the first trial of the whole experiment, which was expected since in the very first trial children were just getting acquainted with the equipment and the task.

factor. The results revealed a significant effect of Pronunciation [CP: mean = 0.61, SD = 0.17, SE = 0.02; MP\_23: mean = 0.64, SD = 0.28, SE = 0.04; MP\_4: mean = 0.48, SD = 0.27, SE = 0.03; F(1.819, 114.581) = 7.773, p = 0.001, Greenhouse-Geisser corrected].

Subsequent pairwise comparisons were conducted using twotailed t-tests. PLTs (proportion of looking to target) in CP and MP\_23 did not differ from each other (p = 0.722), but both were higher than MP\_4 (p = 0.001, p = 0.004). PLTs in both CP trials and MP\_23 trials were significantly above the 0.5 chance level [t(63) = 5.131, p < 0.0005; t(63) = 3.898, p < 0.0005], whereas the PLT (proportion of looking to target) in MP\_4 trials were at chance [t(63) = −0.535, p = 0.594]. The results are shown in **Figure 4**. Given that the age range of our toddlers was from 19 to 26 months, we further explored whether toddlers' tonal perception during word comprehension changed within this age range. In particular, we analyzed the correlation between age and each pairwise comparison (i.e., age with "CP minus MP\_23"; age with "CP minus MP\_4"; age with "MP\_23 minus MP\_4"). The results showed no significant correlation between age and the pairwise comparisons (r = 0.225, p = 0.077; r = −0.051, p = 0.691;r = −0.131, p = 0.302), suggesting that toddlers across ages in our sample responded similarly to the test trials.

Thus, children recognized the targets equally well in both CP and similar MP trials, but not in dissimilar MP trials. **Figure 5** shows the looking timecourse during the analysis window, revealing that the recognition patterns in CP and similar MP trials were comparable, both diverging from the recognition pattern in the dissimilar MP trials. PLTs (proportion of looking to target) in the three trial types before naming, that is, in the window just preceding the target word onset (the same size as the post-onset analysis window) within the same trial, were comparable (CP: mean = 0.50, SE = 0.02; MP\_23: mean = 0.54, SE = 0.03; MP\_4: mean = 0.49, SE = 0.03) (p > 0.4) and were not different from chance (p ≥ 0.23), 2-tailed.

We further analyzed the specific tones of the key words in a 2 × 3 ANOVA, with Pronunciation (CP, MP\_23, MP\_4) and Tone (T2 vs. T3 targets) as within-subject factors. Since Tone 3 Sandhi involves a unidirectional T3-to-T2 change, a significant Pronunciation x Tone interaction was expected if the unidirectionality affected children's responses. In that case, we could then predict that children should detect the T2-to-T3 mispronunciation, but not the T3-to-T2 mispronunciation. No such interaction was expected if children had an overgeneralized representation (i.e., treating T2 and T3 as one functional category). Results showed again a significant effect of Pronunciation [F(1.746, 80.339) = 4.251, p = 0.022, Greenhouse-Geisser corrected], but no significant main effect of Tone [F(1, 46) = 2.960, p = 0.092], and crucially, no Pronunciation x Tone interaction [F(1.864, 85.742) = 0.006, p = 0.992, Greenhouse-Geisser corrected]. The lack of any interaction with Tone indicates that responses to T2 and T3 targets followed the same patterns (i.e., confusion of T2 vs. T3 and T3 vs. T2; discrimination of T2 vs. T4 and T3 vs. T4). Recognition of the targets in CP trials (PLTs: T2 mean = 0.57, T2 SD = 0.25; T3 mean = 0.65, T3 SD = 0.198) and both T2-to-T3 MP trials (PLT: mean = 0.59, SD = 0.34) and T3-to-T2 MP trials (PLT mean = 0.67, SD = 0.36) was equally successful.

### DISCUSSION

Lexical tones are an important part of the phonological system in many languages. The goal of the present study was to understand the acquisition of native lexical tones during the initial stages of development. We focused on children at two important stages of phonological acquisition during the first 2 years of life: preverbal babies, who have a limited vocabulary, and toddlers, who have a reasonable-sized lexicon. Experiment 1 demonstrates that Mandarin-learning preverbal babies can

at 0 and the analysis window from 360 ms to 2,000 ms.

discriminate acoustically similar tones in their native language namely, T2 vs. T3, which exhibit similar pitch trajectories. Notably, babies discriminated the two tones even though we eliminated the duration and amplitude cues. These results suggest that Mandarin-learning infants during the first year of life are highly sensitive to the pitch patterns of the two tones.

During the second year of life infants engage in more active word learning, and their lexicon grows significantly, particularly when they reach the vocabulary explosion stage several months before age two. In Experiment 2 we asked whether the similar T2-T3 contrast, which was perceivable at the preverbal stage, was subsequently encoded in toddlers' lexicon. Our results show that toddlers did not detect mispronunciations of T2 as T3, and T3 as T2; proportions of looking to target for these mispronunciations were at the same level as for correct pronunciations, i.e., equally successful recognition in both cases. On the other hand, recognition failed when T2 and T3 were mispronounced as T4; that is, toddlers detected the T2-to-T4 and T3-to-T4 mispronunciations. These results indicate that unlike T2-T3, the T2-T4 and T3-T4 contrasts are distinct at the lexical level for toddlers.

The failure to detect the T2-T3 contrast in Experiment 2 might be due to their acoustic similarity. The pitch patterns of T2 and T3 are the most similar among all tonal contrasts in Mandarin. Stager and Werker (1997) showed that during word learning, similar segments such as /b/-/d/ were not distinguished by 14 month-olds, although the effect was due to young word learners' temporary processing limitation under certain task conditions. The confusion was absent when the word-learning task was made easier or when slightly older infants (17-month-olds) were tested (e.g., Werker et al., 2002; Fennell and Werker, 2003; Yoshida et al., 2009). With regards to our Experiment 2, previous studies on familiar word comprehension are most pertinent for consideration. In a previous study on familiar word recognition in English by White and Morgan (2008), toddlers' looking to targets varied according to the degree of mispronunciation of the word onset consonant, with reduced mispronunciation effects for smaller phonetic deviances than larger ones, indicating graded lexical representations. It is possible that toddlers do not distinguish T2 and T3 in familiar words due to their acoustic/perceptual similarity, and have less sensitivity to this contrast during word comprehension.

However, toddlers' T2-T3 confusion differs from the broader evidence that phonetically similar consonants and vowels are distinguished in infants' earliest familiar words (e.g., Swingley and Aslin, 2002; Fennell and Werker, 2003; Mani and Plunkett, 2007). Notably, even the smallest deviances in the sensitive task of White and Morgan (2008) still yielded a mispronunciation effect (significantly less looking to target in 1-feature MP trials than in CP trials), meaning that the most similar contrasts remained discriminable for toddlers, the same as for adults. That is, continuity was maintained from phonetic discrimination in early infancy to subsequent phonological development and to mature representation in adults. Our toddlers, however, were very different. They showed a complete lack of any mispronunciation effect for T2-to-T3 and T3-to-T2 deviances. This result was at odds with the clear discrimination of T2 and T3 shown in Experiment 1. We note that T2-T3 stimuli were made more acoustically similar than usual in Experiment 1, but this did not impede discrimination. Moreover, T2-T3 and T2- T4 were discriminated equally in Tsao (2008), i.e., comparable in perceptual salience. However, our toddlers distinguished T2- T4 but totally confused T2-T3. This was striking since T2-T3 stimuli were more distinct acoustically in Experiment 2 than in Experiment 1. The results of our experiments suggest that there may be reasons beyond acoustically based perceptual salience for T2-T3 development, as discussed below.

The T2-T3 confusion may be because children had heard the same lexical items in both tones in the input due to Tone 3 Sandhi. Both tones occur in surface realizations for the same words, e.g., xiao in xiao3tu4 "little bunny" vs. xiao2ma3 "little horse" (the numbers here indicate the surface realizations of tones). In xiao2ma3, the underlying T3T3 sequence surfaces as T2T3 due to Tone 3 Sandhi. Children might store both variants of T3 (the low variant and the T2-like rising variant) for words such as mai ("buy"), xiao ("little"), hao ("good"). What is more complicated is that T3T3 does surface in certain syntactic structures, against Tone 3 Sandhi. For example, bi3 and ma3 remain as T3T3 when surfacing in [gou2-dog [[bi3-than ma3-horse] kuai4-fast]] "dogs are faster than horses" (Duanmu, 2007). Thus, by observing the tonal changes in some known words, children can overgeneralize T2 and T3 as free variations across words, neutralizing them as variants within one functional category. Our results are consistent with this possibility. The T2-T3 confusion has also been observed in older children, who did not detect T3-to-T2 and T2-to-T3 mispronunciations during word comprehension (Singh et al., 2017). Wong et al. (2005), however, showed in a different task that 3-year-olds advance in their understanding of Tone 3 Sandhi, thus confusing only the T3-to-T2 change but not vice versa in word recognition. This asymmetry resembles native Mandarin adult listeners' asymmetrical responses in the ERP study of Li and Chen (2015), consistent with Tone 3 Sandhi.

The two ideas, acoustic similarity and sandhi alternation, are in fact related. As described in the Introduction, neutralization rules in natural languages tend to occur for phonemes that are acoustically/phonetically similar, such as the cases of flapping in English, obstruent devoicing neutralization in German and Tone 3 Sandhi in Mandarin. The two ideas cannot be easily separated, and the results of Experiment 2 are consistent with both. Nevertheless, the results of Experiment 1 in the present study suggest that T2 and T3 are perceptually discriminable. Even non-tone-speaking teenagers can discriminate T2 and T3 as successfully as do Mandarin-speaking peers (Pierce et al., 2014), indicating that the contrast is sufficiently salient acoustically. Thus, the complete lack of any mispronunciation effect between T2 and T3 in our toddlers is likely due to phonological reasons such as neutralizations related to Tone 3 Sandhi.

In our word comprehension experiment we used only monosyllabic words to test T2 and T3. There are in fact many bisyllabic (and some trisyllabic) compound words containing T2 and T3 (e.g., ping2guo3 "apple," yi3zi "chair," tuo1xie2 "slipper," chang2jing3lu4 "giraffe") that young toddlers know. Tone processing in compounds might be more challenging for learners due to coarticulation of neighboring tones that applies generally at the phonetic level (Gauthier et al., 2007a,b; Shi, 2009). It would be interesting to examine children's processing of lexical tones in compound words in future research.

In sum, our experiments show that during the first year of life tone-learning babies can discriminate similar lexical tones in their native language, as they do for similar consonant and vowel contrasts (e.g., Werker and Tees, 1984; Polka and Werker, 1994). However, during the second year of life when tones become organized in the developing lexicon, toddlers fail to distinguish similar tones in words, while they successfully represent dissimilar tonal contrasts in words. Toddlers' lexical representation seems to be affected by hearing words that go through neutralization in the input (also see recent work on consonant neutralization in Van der Feest and Johnson, 2016). A phonetic contrast that is acquired early in infancy seems to be reorganized and overgeneralized as one functional category (containing multiple variants) at the lexical stage, as toddlers focus on building a vocabulary and establishing a phonemic system.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Institute of Linguistics, CASS, China, with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of

### REFERENCES


Helsinki. The protocol was approved for ethics by the Institute of Linguistics, CASS, China.

### AUTHOR CONTRIBUTIONS

RS: Conceptualized and designed the study, supervised the construction and the execution of the experiments, directed the data analyses, and wrote the article. JG: Prepared/tested the experiments, coded/organized the data, and conducted the statistical analyses under the supervision of the first author. AA: Conducted further detailed statistical analysis of the data and participated in the writing. AL: Lead for obtaining the Chinese funding for the experimental procedure and the testing, and participated in discussions during the study.

### ACKNOWLEDGMENTS

This research was supported by the National Social Science Fund of China (Project No.: 08AYY02). The results were presented at the Speech Prosody 2010 meeting and the IASCL 2011 congress. We thank Mireille Babineau, Zhao Zhang, Leilei Zheng, and Ziyu Xiong for research assistance. We thank all families for participating in this study.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.01117/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Shi, Gao, Achim and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Cantonese-Speaking Children Do Not Acquire Tone Perception before Tone Production—A Perceptual and Acoustic Study of Three-Year-Olds' Monosyllabic Tones

Puisan Wong\*, Wing M. Fu and Eunice Y. L. Cheung

*Division of Speech and Hearing Sciences, University of Hong Kong, Pokfulam, Hong Kong*

Models of phonological development assume that speech perception precedes speech production and that children acquire suprasegmental features earlier than segmental features. Studies of Chinese-speaking children challenge these assumptions. For example, Chinese-speaking children can produce tones before two-and-a-half years but are not able to discriminate the same tones until after 6 years of age. This study compared the perception and production of monosyllabic Cantonese tones directly in 3 -year-old children. Twenty children and their mothers identified Cantonese tones in a picture identification test and produced monosyllabic tones in a picture labeling task. To control for lexical biases on tone ratings, the mother- and child-productions were low-pass filtered to eliminate lexical information and were presented to five judges for tone classification. Detailed acoustic analysis was performed. Contrary to the view that children master lexical tones earlier than segmental phonemes, results showed that 3-year-old children could not perceive or produce any Cantonese tone with adult-like proficiency and incorrect tone productions were acoustically different from criterion. In contrast to previous findings that Cantonese-speaking children mastered tone production before tone perception, we observed more accuracy during speech perception than production. Findings from Cantonese-speaking children challenge some of the established tenets in theories of phonological development that have been tested mostly with native English speakers.

Keywords: lexical tone, acoustic analysis, pitch analysis, fundamental frequency, pitch contours, pitch production, pitch discrimination, Cantonese tones acquistion

### INTRODUCTION

Lexical tone is the use of pitch variations to contrast lexical meaning (Yip, 2002). Models of phonological development assume that acquisition of lexical tone and other suprasegmental features (prosody) is early, rapid, and complete before the mastery of segmental features (vowels and consonants). Studies of children who are acquiring Indo-European languages (English, French, Hindi) support such assumptions (see Werker and Tees, 1984; Kuhl et al., 1992; Dehaene-Lambertz and Houston, 1998; Peña et al., 2012). However, studies of lexical tone production in Sino-Tibetan languages such as, Thai, Mandarin, and Cantonese report mixed results (see the review by Singh and Fu, 2016). Thai has three level tones (high-level, mid-level, and low-level), a rising tone and a

#### Edited by:

*Denis Burnham, Western Sydney University, Australia*

#### Reviewed by:

*Nan Xu Rattanasone, Macquarie University, Australia Chutamanee Onsuwan, Thammasat University, Thailand Denis Burnham, Western Sydney University, Australia*

> \*Correspondence: *Puisan Wong pswResearch@gmail.com*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *30 October 2016* Accepted: *10 August 2017* Published: *29 August 2017*

#### Citation:

*Wong P, Fu WM and Cheung EYL (2017) Cantonese-Speaking Children Do Not Acquire Tone Perception before Tone Production—A Perceptual and Acoustic Study of Three-Year-Olds' Monosyllabic Tones. Front. Psychol. 8:1450. doi: 10.3389/fpsyg.2017.01450* falling tone (Abramson, 1986). In a study of Thai speech perception (Tuaycharoen, 1977, in Li and Thompson, 1977) and an acoustic study (Onsuwan et al., 2014) children learning Thai as their first language had fully mastered the five tones at 2 years of age. The first tones to be mastered were the mid-level and lowlevel tones, followed by the rising tone and finally by the highlevel and falling tones. Mandarin has a simpler tone system than Thai. The four Mandarin tones—high, rising, low/dipping, and falling tones—are contrasted by tone shapes. Based on perceptual judgments of naturally produced tones, early studies reported that children master the production of the four Mandarin tones between one-and-a-half to around 3 years of age. One largescale cross-sectional study and one longitudinal study reported the earliest age of acquisition. Hua and Dodd (2000) examined Mandarin tone and segmental productions in isolated words and connected speech in 129 children between the ages of 1.6 to 4.6 years and reported that children as young as 1.6 made no tone errors. Hua (2002) the followed four children's Mandarin tone productions in spontaneous speech from 1 to 2 years of age and concluded that children's tone productions were stabilized before 2.0, supporting the findings of Hua and Dodd (2000). Other studies have reported a slightly later age of acquisition for Mandarin tones (Chao, 1973/1951; Li and Thompson, 1977; Clumeck, 1980). The order of acquisition of Mandarin tones varies across studies although most report that the rising tone is more difficult and the latest to be acquired by children (Li and Thompson, 1977; Clumeck, 1980). However, recent studies that controlled for lexical biases in tone judgment by asking judges to identify the tones in filtered speech reported that 5- and 6-yearolds do not produce Mandarin tones in monosyllabic words as well as adults do (Wong et al., 2005; Wong, 2012a,b, 2013; Wong and Strange, 2017).

Cantonese has a more complex tone system than Mandarin. There are three level tones [HL (T1), T3 (ML), LL (T6)], two rising tones [HR (T2), LR (T5)], and one falling tone (T4 LF; see **Table 1**) and these are contrasted by both pitch heights and pitch shapes. The relative pitch levels and pitch shapes of tones have been conventionally represented by a numerical system suggested by Chao (1947) based on an auditory impression. In this system, each tone is notated with a two-digit number indicating the pitch level at tonal onset and offset. Each digit ranges from one to five, with "1" and "5" representing the lowest and highest pitch of a person's typical pitch range, respectively. For example, HL (T1) is notated as 55 because it is perceived to be produced with the highest pitch of the speaker from tonal onset to tonal offset (see the third column in **Table 1**). **Figure 1** shows the pitch contours of the six tones produced by native adult speakers.

Previous studies of tone production with Cantonese-speaking children suggested that children make no tone errors after twoand-a-half years of age (Tse, 1978; To et al., 2013), supporting the established view lexical tones are acquired early. However, studies of tone perception with Cantonese-speaking children report that 6-year-old children were not able to discriminate tones at the level of native speaking adults (Lee et al., 2002, 2015; Ciocca and Lui, 2003). Comparing these results suggests that acquisition of lexical tone production in Cantonese precedes tone perception. However, no study has compared Cantonese tone perception and production in the same children. This methodological gap was the motivation for the present study to examine tone perception and production ability in 3-year-old Cantonese-speaking children. Another goal was to compare the acoustic features of Cantonese tone perception and production in children and adults to determine how well 3-year-old children perceive and produce monosyllabic Cantonese tones and confirm whether tone production precedes tone perception. Three-year-old children were recruited for several reasons. Most studies report that Cantonese children master Cantonese tone production at least before 3 years of age (at around two-and-ahalf years of age) but no study has compared Cantonese tone perception and production in 3-year-old children directly. In addition, as a study to examine Cantonese tone production abilities with both perceptual and acoustic methods, focusing on one critical age group allows more detailed and thorough examination of tone perception and production.

Extant studies suffered from a number of limitations. First, accuracy of tone productions is determined by rating tones with natural unfiltered stimuli in the presence of segmental information. With the expectation of a target word, a rater may not ignore critical segmental information and detect potential tone errors, which could lead to transcription biases (Oller and Eilers, 1975). Second, none of the studies included an adult reference group for comparison and the criterion for determining mastery is not defined in most studies. Therefore, it is unclear if children's tone productions are in fact adult-like. Third, most studies used only one judge (usually the experimenter) to score tone production. There is rarely any inter-rater or intra-judge reliability reported. Fourth, no study has examined the acoustic properties of productions to validate the perceptual findings.

There is evidence that when these methodological limitations are corrected, the age of mastery for Cantonese tones is relatively late. Barry and Blamey (2004) elicited monosyllabic Cantonese tone productions from eight children (range = 3.8–6.0), five adults, and a group of sixteen children with cochlear implants. A non-native speaker of Cantonese identified target tones in productions based on perceived pitch, which could have reduced the effect of lexical expectation. The findings were that although tone productions were not error-free in normal hearing adults, children produced the tones with much lower accuracy, showing that children as old as 6 years of age did not produce Cantonese tones as well as adults. Children's error patterns included confusions among the three level tones [HL (T1) vs. ML (T3) and ML (T3) vs. LL (T6)], between the two rising tones [HR (T2) vs. LR (T5)], and between the low-falling and low-level tones [LF (T4) vs. LL (T6)]. To compare the acoustic characteristics of tone productions, the fundamental frequencies of tone onset and offset were measured and plotted against one another. Sizes and distances of ellipses representing the clusters of the measurements of the tones produced by the three speaker groups were compared. The results showed that normal-hearing adults had small ellipses located in a relatively small tonal space, which was different to both typical and hearing-impaired children. The acoustic findings supported their perceptual findings that the tones produced by 4–6-year-old typically developing Cantonese children were at least not adult-like. Although the results of Barry and Blamey (2004) challenge the assumption of early acquisition of lexical tones in other studies, the sample size

#### TABLE 1 | The six tones in Cantonese.


pitch (upper panels) and normalized pitch (lower panels).

was low (n = 8) and there was a wide range of ages in the typical children. Furthermore, although the study compared the acoustic characteristics of productions, only two points of the pitch contour were measured and no information on the shapes and pitch level of the tone contours was reported. Thus, further study with more detailed acoustic analysis on a larger group of children examining children's acquisition of Cantonese tones would provide more detailed information on the acoustic characteristics of children's Cantonese tone productions.

A series of studies on Mandarin tone production in Mandarinspeaking children reported protracted lexical tone development (Wong et al., 2005; Wong, 2012a,b, 2013; Wong and Strange, 2017). In these studies, children and their mothers labeled pictures representing monosyllabic and disyllabic words familiar to young children. The productions were low-pass filtered to reserve the pitch information and eliminate lexical information. Judges who were blind to the experimental design categorized the children's and adults' tones based on the pitch information in the filtered stimuli. Perceived accuracy of children's tone productions in filtered stimuli was compared to those of mothers to determine mastery. The results showed that the judges categorized the filtered tones produced by the mothers with complete accuracy and significantly better than the tones produced by 3–5-year-old children (Wong et al., 2005; Wong, 2012a,b, 2013; Wong and Strange, 2017). Wong (2012a) conducted an acoustic study to compare children's and adults' Mandarin tone productions and found that children's tones, in which the target tones were correctly identified by the judges, had acoustic features similar to those of adults' tones—though not all acoustic parameters were adult-like. Children's tones in which target tones were incorrectly identified by judges were acoustically different from adults' and children's correct productions, supporting the perceptual findings in their studies. The findings questioned the assumption in speech and language acquisition models that suprasegmental units are acquired before segmental units.

Only one study has examined tone perception and production in the same group of children (Wong et al., 2005) and no study has compared Cantonese tone perception and production in the same group of children. Wong et al. (2005) reported that 3-year-old Mandarin-speaking children perceived four tones with complete accuracy, but tone production accuracy was significantly lower, suggesting that tone perception precedes tone production. Intriguingly, studies on children's identification of Cantonese tones report an age of acquisition of tone perception much later than the age of acquisition of tone production reported in production studies, posing a challenge to the conventional assumption in models of phonological development that speech perception precedes speech production. For example, Ching (1984) asked four typically developing Cantonese-speaking children to identify the six tones in the syllable /ji/ by pointing to one of six pictures upon hearing the word. They found that children did not reach an adult criterion for tone identification until 10 years of age. Ciocca and Lui (2003) modified the design of Ching (1984) and examined tone identification in adults and 60 Cantonese-speaking children between the ages of 4–11 years using the same stimuli but with a two-alternative forced-choice task. In accordance with the findings in Ching (1984), they reported that children's identification of Cantonese tones was not adult-like until 10 years of age. However, because the six words formed by the combination of the syllable /ji/ and the six tones were not of equal familiarity to young children, the findings of these two studies may have been confounded by word familiarity effects.

Two studies examined children's Cantonese tone identification in words familiar to children and found slightly earlier age of acquisition of Cantonese tone identification, though still much later than the age of acquisition of Cantonese tone production reported in most previous studies. Lee et al. (2002) presented three pairs of Cantonese tones in monosyllabic words with a live voice to 2–3-year-old children for identification using a four-choice picture-pointing task. All stimuli were judged by two experienced speech therapists to be familiar to young children. They reported an accuracy rate of 91% for Tones 1, 2, and 4. Without examining the full set of tones and without a reference group, it remains unclear when children reach the fully skilled criterion. Lee et al. (2015) examined Cantonese tone identification in familiar monosyllabic words in 200 3–10-year-old children and 25 adults. Upon hearing the target word, participants were asked to point to one of four pictures, with one representing the target word, another representing a word that formed a tone minimal pair with the target word, and the other two representing words that had the same initial consonant or vowel as the target word. The results showed that children identified tones in familiar words with adult-like accuracy at 6 years of age, far later than the reported age of mastery of the production of tones. However, without testing perception and production accuracy in the same group of children, the relationship between children's tone perception and production remains unclear.

The unexpected finding that Cantonese-speaking children fully master the production of six tones earlier than their mastery of Cantonese tone identification calls for a reexamination of children's acquisition of tones. As a first step, this study examined monosyllabic Cantonese tone perception and production in the same group of 3-year-old Cantonesespeaking children and provided detailed comparisons on the acoustic characteristics of adults' tones and children's correct and incorrect productions to test the tenet in theories of phonological development that (a) children rapidly acquire suprasegmental features in their language and fully master lexical tones before 3 years of age, well before their full mastery of the segmental features (Li and Thompson, 1977; Snow, 1997, 2006; Hua and Dodd, 2006), and that (b) speech perception precedes speech production (Edwards, 1974; Greenlee, 1980). Specific research questions were (1) How well do 3-year-old children perceive the Cantonese tones? (2) How well do 3-year-old children produce the Cantonese tones? (3) What are the relationships between children's tone perception and production ability? and (4) What are the acoustic characteristics of children's correct and incorrect Cantonese tone productions?

### METHODS

The study was approved by the Human Research Ethics Committee of the University of Hong Kong (date of approval: December 9, 2015).

### Participants

### Children

Twenty Cantonese-speaking children (8 girls, 12 boys) with a mean age of 3.07 (range = 3.01–3.11) participated in the study. Their mothers provided written informed consent for the children's participation. All were born in Hong Kong, and raised in Cantonese-speaking families. Cognitive, language, and speech developmental milestones reported by the mothers were within normal range. All children scored within normal limits on the Short Form A in the Hong Kong Cantonese Tone Identification Test (CanTIT; Lee, 2012), a standardized test that examines children's Cantonese tone perception ability (more information below), and the Cantonese Oral Language Deficiency Early Identification Test for Pre-primary Children (學前兒童粵語表 達能力識別測驗; Po Leung Kuk, 2012), which assesses children's oral language ability. All children passed hearing screening at 500, 1,000, 2,000, and 4,000 Hz at 25 dB HL bilaterally, under headphones using pure-tone audiometry.

### Adults

Mothers of the 20 children (n = 20) with a mean age of 37 (range = 32–48) years participated in the study. All mothers provided written informed consent for the participation of themselves and their children, and passed a telephone screening in which they repeated two syllables in six tones to ensure that they perceived and produced the six tones. All recruited mothers were Cantonese native speakers and had not lived overseas for more than 12 months. All mothers passed the same hearing screening.

### Stimuli

### Stimuli for Tone Pereption Test

The Short Form A of CanTIT (Lee, 2012) was employed to evaluate tone perception accuracy of the mothers and children. The test items comprised 30 monosyllabic words. In each trial, four pictures, with one representing the target word, one representing another word that formed a tone minimal pair with the target word (the tone distractor), and two pictures representing two other words that had the same vowel (vowel distractor), or initial consonant (consonant distractor) as the target word were displayed on the screen. The target words were recorded by a male speaker in the sentence-final position of the carrier phrase: "邊幅 \_\_\_ [Which picture shows\_\_\_\_?]."

### Stimuli for Tone Production Test

Thirty-nine monosyllabic words depicted in color pictures were employed as production stimuli for both child and adult speakers (**Table 2**). Twelve of the words were also found in the tone perception test. Twenty-nine of the words formed a tone minimal pair with another word, covering the 15 tonal contrasts of the six Cantonese tones, whereas the other 10 words were singletons without a minimal pair counterpart. Twenty-four of the words, three to six words for each tone category, were highly familiar words produced by 80–100% of 30-month-old Cantonese-speaking children growing up in Hong Kong based on parents' reports in the Cantonese Communicative Development Inventory (CCDI; Tardif et al., 2009).

### Procedures for Child and Adult Speakers

Each mother-and-child pair attended a 2-h session in a quiet room at home. Mothers were asked to fill out a background questionnaire. The tone production test was administered prior to the tone identification test to prevent delayed imitations and children were tested before mothers to avoid an exposure effect. Children were instructed to label the pictures presented on a computer screen with monosyllabic words. Three practice trials were presented first to familiarize the participants with the testing procedures. After that, the thirty-nine experimental stimuli were randomly presented. Simple questions such as, "咩 [What is this]?" or "隻雀仔做緊咩[What is the bird doing]?" were used to elicit spontaneous productions. If the children failed to produce the target words spontaneously in isolation, sentence completion such as, "係公園會見到好多[In the park, we can see a lot of \_\_\_\_]" was employed. All productions were digitally recorded.

After the tone production task, the CanTIT tone perception test was administered. A target word was randomly presented in a carrier phrase over the headphones. The children were instructed to point to one of the four pictures displayed on the computer screen corresponding to the word they heard. The experimenter clicked on the selected picture. Three practice trials were included to familiarize the children with the testing procedure, followed by 30 experimental trials. After the children finished the tone production and perception tasks, mothers were asked to label the pictures, and then took part in CanTIT. After that, the language tests for the children and hearing screenings for the children and the mothers were carried out.

## Perceptual Judgment of the Produced Tones

#### Judges

To determine accuracy of the tones produced by the mothers and children, five native Cantonese speakers (four females, one male; mean age = 21 years, range = 19–23 years) were recruited as judges. All were undergraduate students studying Speech and Hearing Sciences at The University of Hong Kong and had received phonetics training. Cantonese was their strongest and dominant language. They passed a screening test on tone judgment of filtered stimuli, with a passing criterion of 80% accuracy. No speech, language, or hearing difficulties were reported.

### Stimuli for Tone Judgment

The stimuli for tone rating included 750 child productions and 778 adult productions collected using the procedures that were described above. Thirty of the children's productions were not included due to failing to label the picture within

#### TABLE 2 | Word stimuli for tone production.


*<sup>a</sup>Words produced by at least 80% of the 30 months old children as reported in Cantonese Communicative Development Inventory (CCDI) (Tardif et al., 2009).*

*<sup>b</sup>Words in square brackets [ ] indicate the 18 highly familiar words selected for data analysis.*

*<sup>c</sup>Words produced by less than 80% of the 30 months old children in CCDI (Tardif et al., 2009).*

\**Words presented in both perception and production tasks as target words.*

\*\**Words presented in both perception and production tasks but were used as tone distractors in the perception test.*

the 1-min recording time-frame for the trial (n = 21), poor quality of recording (n = 6), and production of non-target words (n = 3). Two productions from mothers were excluded due to no recording or production of a non-target word. All practice trials were excluded for tone judgment. The tones collected were low-pass filtered to eliminate segmental information while retaining F0 information. Because children speak with a higher F0, child productions were low-pass filtered at 500 Hz whereas adult productions were low-pass filtered at 400 Hz. The filtered stimuli were then normalized to 68 dB to ensure that all tokens had the same overall rootmean-square amplitudes. All tones were blocked by speakers to assist the judges' normalization of the speaker's pitch range (Wong and Diehl, 2003). Altogether, 20 blocks of adult productions and 20 blocks of child productions were created.

### Procedures for Tone Judgment

Tone rating was carried out in a quiet room. The judges attended multiple sessions to categorize the tones in the 40 blocks of stimuli at their own pace. Productions by different speakers and trials within each block were randomly presented to the judges. The judges listened to the sounds at a comfortable level via headphones, and indicated their decision by selecting the tone number from a list that appeared on the computer screen (e.g., 1 = Tone 1 媽). They also re-rated, at a minimum, 4 blocks of child productions and four blocks of adult productions (20% of the data) for intra-rater reliability.

### Acoustic Analysis

Acoustic analyses were performed on the recorded tones produced by adults and children.

### Segmentation and Vocal Pulse Checking

Segmentation was performed on unfiltered stimuli. The speech signals were manually segmented into three sections: the initial section, the pitch section, and the final section. The initial section started from the onset of articulation of the target word (e.g., the burst for a stop consonant, the beginning of the fricative noise for fricatives) to the end of the first pitch cycle. Thus, the initial section included any unvoiced initial consonants, irregular pitch cycles, and the first regular pitch cycle. The final section started from the beginning of the final regular pitch cycle and ended at the end of the articulation for the word. Thus, the final section consisted of the last regular cycle of the pitch contour and the irregular cycles with very low amplitude. The pitch section included all the vocal pulses in the voiced initial consonants, voiced final consonants, and the vowels (i.e., the vocalic portion of the production), except the first and last regular pitch cycle (Boersma and Weenink, 2014).

### Acoustic Parameters

Segmentations obtained from the unfiltered stimuli were applied to the low-pass filtered sound files. The pulse markings generated by Praat were manually checked for accuracy. The pitch contour in the pitch section was divided into 10 intervals of equal duration. F0s in Hertz (Hz) at 10 time points were obtained and converted to semi-tones using 1 Hz as the reference frequency using a custom written script (Prosody Pro 6.1.3 beta; Xu 2005–2016). The mean pitch in semi-tones of each speaker across all productions was computed and referred to as the "speaker mean." The initial, final, minimum, maximum, and mean pitch in semi-tones relative to the speaker mean were computed for each production by subtracting the speaker mean from the pitch values, and called Pitch Heights. Altogether, five pitch parameters, namely "Mean Pitch Height" (i.e., mean pitch—speaker mean pitch), "Initial Pitch Height" (i.e., initial pitch—speaker mean pitch), "Final Pitch Height" (i.e., final pitch—speaker mean pitch), "Min Pitch Height" (i.e., minimum pitch—speaker mean pitch), and "Max Pitch Height" (i.e., maximum pitch—speaker mean pitch) were obtained for each tone production.

In order to compare the shape and direction of the F0 contours, the slope of the second half of the tone contour was calculated. The second half of the syllable was selected because perceptual cues for tones are carried in the second half of the syllable (Xu, 2001; Xu and Wang, 2001; Khouw and Ciocca, 2007) and the pitch targets for the tones are best approached toward the end of the syllable (Xu, 2001; Xu and Liu, 2006). Also, the pitch contours at tonal onset are affected by several factors, including the aspiration of the initial consonant (Xu and Xu, 2003), the tone transition, and the tone in the preceding syllable (Xu, 2001; Xu and Liu, 2006). For example, to produce a rising tone, the pitch contour in the initial portion of the syllable moves downward from the regular pitch of the speaker to a minimum pitch level before moving upward, resulting in a falling contour in the first half of the syllable and a rising contour in the second half of the syllable (see **Figure 1**). Our previous study (Wong and Ng, 2017) showed that if the initial 50% of the tone contours was included for acoustic analysis, 7% of 143 HR (T2) and 60% of 142 LR (T5) productions by adults had maximum and minimum pitch in the first 50% of the syllable, resulting in a falling contour for acoustic analysis, despite the fact that the second half of the syllable had a rising contour and listeners consistently identified the productions as rising tones, thus creating a mismatch between the acoustic and perceptual measures in the findings.

### RESULTS

Productions of a mother (M302) and her child (C302) were excluded because the overall accuracy and the mean accuracy in five of the six tones of this mothers' productions were outliers or extreme values compared to those of the other adults. In addition, two productions of non-target words from another two adults and 27 child productions, which included productions interrupted by too many clicks (n = 4), productions without pitch information after filtering (n = 1) and no response trials (n = 22), were excluded from analysis. Subsequent analyses were based on 714 child productions and 739 adult productions from 19 pairs of mothers and children.

In the following analyses, children's tone perception ability was examined before tone production ability. After that, the relationship between children's perception and production ability was determined. Finally, the acoustic characteristics of children's tone productions were investigated.

To investigate children's tone perception ability, (1) children's and adults' tone perception accuracy was compared to determine if children's tone perception performance was adult-like, (2) children's perception accuracy for the six tones was compared to investigate whether children perceived some tones better than others, and (3) the major error patterns in children's tone perception were identified.

To examine children's tone production ability, (1) inter-judge and intra-judge reliability were examined to determine the degree of consistency in the judges' rating of the adults' and children's filtered tones, (2) children's tone production accuracy in highly familiar words and relatively less familiar words was compared to determine whether word familiarity was confounded in children's tone production scores, (3) adults' tone production accuracy in words highly familiar to young children were examined to establish the criteria for determining tone production mastery in children, (4) children's tone production accuracy in familiar words was compared to adults' to determine if children's and adults' tone production accuracy was adult-like, (5) the rank order of accuracy of the six tones was compared in adults' and children's productions to determine whether some of the tones were more difficult for children to produce than others and whether the order of production accuracy of the six tones in children was similar to that of the adults, and (6) error patterns in adults' and children's tone productions were examined and compared.

To examine the relationships between children's tone production and perception ability, (1) the accuracy rates in children's tone perception and production were compared, and (2) correlation analysis was performed on children's perception and production scores.

To examine the acoustic characteristics of children's correct and incorrect tones, (1) the tone contours of adults' correct productions, and children's correct and incorrect productions were presented for visual comparison, (2) children's incorrect productions that constituted the major error patterns in children's errors were identified, and (3) the seven acoustic parameters in adults' correct productions, children's correctly perceived tones, and children's incorrect tones that delineated the major error patterns were compared to examine the acoustic similarities and differences in the tones among the three groups of productions.

### Children's Tone Perception Accuracy

To determine how well children perceived the tones, adults, and children's perception accuracy measured by CanTIT was compared. The results showed that the adult group identified all tones correctly with ceiling accuracy (range = 99–100%). On the

other hand, children identified the six tones with much lower accuracy, HL (T1) (M = 93%, SD = 11.95%), HR (T2) (M = 72%, SD = 19.22%), ML (T3) (M = 79%, SD = 20.52%), LF (T4) (M = 79%, SD = 19.41%), LR (T5) (M = 72%, SD = 13.85%), and LL (T6) (M = 82%, SD = 17.51%) (**Figure 2**). Because ceiling performance was noted in the adult group, a Mann-Whitney test with the participant group as the between-subject variable was used to examine whether children's tone perception accuracy was different from that of adults. The results showed that children perceived all six tones with significantly lower accuracy than adults, p = 0.000–0.009, r =0.30–0.90. Children's mean perception accuracy of the six tones, from the highest to the lowest accuracy, was HL (T1), LL (T6), ML (T3), LF (T4), HR (T2), and LR (T5).

To determine whether children's tone perception ability varied among the tones, a Wilcoxon signed-rank test was conducted to compare the perceptual accuracy of children's six tones. The results indicated that children's perception accuracy of HL (T1) (M = 93%, SE = 2.74%) was significantly higher than that of LR (T5) (M = 72%, SE = 3.18%, Z = −4.066), p < 0.001, r = 0.660. No other significant difference was found. Two error patterns that occurred over 10% of the time were the misperception of the LR (T5) as HR (T2) and the misperception of HR (T2) to LF (T4) (**Table 3**).

### Perceived Accuracy of Children's Tone Productions

To determine whether children produced the tones as accurately as adults, the five judges' perceptual accuracy of the adults' and children's tones was compared.

#### Inter-judge Reliability

Fleiss's kappas (κ), which adjusts for chance-level agreement, were used to determine the consistency in the tone ratings among the five judges. Following the conventional standards for the interpretation of the kappa coefficient (Landis and Koch, 1977; Posner et al., 1990), the results showed substantial and moderate interjudge agreement on the ratings of adult productions (κ = 0.788) and child productions (κ = 0.538), respectively. When the productions of all



*Percentages in square brackets [ ] indicate that more than 10% of the target tone were judged as another tone.*

*Shaded cells indicate correct perception of the tones.*

speakers were collapsed, substantial inter-judge agreement was found (κ = 0.674), indicating high overall inter-judge reliability.

#### Intra-judge Reliability

To determine how consistent each judge was in their own ratings, Cohen's kappa (κ) was computed. Based on the conventional interpretation of the kappa values in the literature, in which kappa values between 0.81 and 1.00 are considered as reaching almost perfect agreement and kappa values between 0.61 and 0.80 are considered as having substantial agreement (Landis and Koch, 1977; Posner et al., 1990), all judges showed almost perfect intra-rater agreement on their ratings of adult's tone productions (κ = 0.832–0.873), except one judge who reached substantial intra-judge agreement (κ = 0.773). For children's productions, all judges reached substantial intra-judge agreement (κ = 0.644–0.691).

### Effect of Word Familiarity on Children's Tone Production Accuracy

Production accuracy of the tones was defined as the judges' correct identification of the target tones. Among the stimuli for tone production (**Table 2**), 24 of the words were reported in Tardif et al. (2009) to be produced by more than 80% (M = 93%, range = 87−100%) of 30-month-old children growing up in Hong Kong and six words were reported to be produced by <80% (M = 54%, range = 25–79%) of 30-month-old Hong Kong children (Tardif et al., 2009). To determine whether children's tone production accuracy was affected by word familiarity, tone accuracy in these two groups of words with high and low familiarity were compared. A two-way mixed ANOVA, with speaker group (adults, children) as the between-subject variable and word frequency (high familiarity, low familiarity) as the within-subject factor, showed a significant main effect of word familiarity, F(1,36) = 4.655, p = 0.038, r = 0.338,

TABLE 4 | Confusion matrices of the tones produced by adults and children.


*Percentages in square brackets [ ] indicate that more than 10% of the target tone were judged as another tone.*

*Shaded cells indicate correctly perceived tone productions.*

a significant main effect of speaker group, F(1,36) = 111.598, p < 0.001, r = 0.689; and no significant interaction effect between speaker group and word familiarity, F(1,36) = 1.686, p = 0.202, r = 0.212. Follow-up pairwise comparisons with Bonferroni adjustments indicated that children produced less familiar words (M = 48%, SD = 18.18%) with significantly lower accuracy than words with high familiarity (M = 55%, SD = 7.88%), t(18) = −2.063, p = 0.02, d = −0.534. No significant difference in tone accuracy in familiar and unfamiliar words was found with adults, p = 0.547. To eliminate the confounding factor of word familiarity, 18 highly familiar words (three for each tone), which were produced by at least 90% (M = 94%, range = 90–100%) of 30-month-old children in Hong Kong (Tardif et al., 2009), were selected for subsequent analyses (**Table 2**).

### Adults' and Children's Tone Production Accuracy in Familiar Words

Adults' tone productions on the 18 highly familiar words were perceived by the judges with ceiling accuracy for HL (T1), HR (T2), LF (T4), and LR (T5) (range = 93–99%) and with lower accuracy for ML (T3) (M = 79%, SD = 13.60%) and LL (T6) (M = 67%, SD = 14.38%). On the other hand, all tones produced by children were perceived with much lower accuracy and larger variability, HL (T1) (M = 59%, SD = 25.10%), HR (T2) (M = 47%, SD = 27.12%), T3(ML) (M = 46%, SD = 22.43%), LF (T4) (M = 63%, SD = 26.23%), LR (T5) (M = 74%, SD = 21.00%), and LL (T6) (M = 38%, SD = 12.38%) (**Figure 3**, **Table 4**).

To determine whether children's tone production accuracy was lower than that of adults, a two-way mixed ANOVA with speaker group as the between-subject factor and tones as the within-subject factor was performed. The results revealed a significant main effect of speaker group, F(1,36) = 205.75, p < 0.001, r = 0.922, representing a large effect size; a significant main effect of tones, F(3.871, 139.361) = 17.326, p < 0.001, r = 0.570; and no significant interaction effect between speaker group and tones, F(3.871, 139.361) = 1.848, p =0.125, r = 0.221. Pairwise comparisons with Bonferroni adjustments between adults' and children's production accuracy for each tone showed that the perceived accuracy of children's productions on each of the six tones was significantly lower than that of adult productions, t(36) = −4.593 to −7.054, all p < 0.001, r = 0.597 to 0.744. Given that mothers' tone production accuracy reached ceiling for some tones, a Mann-Whitney test was performed and gave the same results, U = 17.5 to 58.0, z = −3.990 to −4.381, all p < 0.001, r = −0.647 to −0.784.

#### Order of Tone Production Accuracy in Familiar Words in Adults and Children

Pairwise comparisons of adults' tone production accuracy among the six tones were conducted to examine whether children's tone production accuracy of the six tones differed. The results with Bonferroni correction showed that LL (T6) (M = 67%, SE = 3.08%) was produced with significantly lower accuracy than HL (T1), HR (T2), LF (T4), and LR (T5), t(18) = −6.769 to −9.335, all ps < 0.001, r = 0.707 to 0.830, while ML (T3) (M = 78%, SE = 4.24%) was produced with significantly lower accuracy than LR (T5), t(18) = −7.270, p = 0.001, r = 0.732, showing that adults did not produce LL (T6) or ML (T3) as accurately as the other tones.

As for children's productions, the order of mean accuracy of the six tones, arranged from the highest to the lowest accuracy, was LR (T5), LF (T4), HL (T1), HR (T2), ML (T3), and LL (T6). Pairwise comparisons on the perceived accuracy of children's tones revealed that the perceived accuracy for LL (T6) (M = 38%, SD = 12.38%) was significantly lower than that of HL (T1) (M = 59%, SD = 25.10%), LF (T4) (M = 63%, SD = 26.23%) and LR (T5) (M = 74%, SD = 21.00%), t(18) = −2.672 to −8.191, ps = 0.000 to 0.019, r = 0.463 to 0.727. The perceived accuracy for HR (T2) (M = 47%, SD = 27.12%) and ML (T3) (M = 46%, SD = 22.43%) were significantly lower than that of LR (T5) (M = 74%, SD = 21.00%), t(18) = −2.965 to −4.263, ps = 0.000– 0.004, r = 0.487 to 0.547, suggesting children produced LR (T5), LF (T4), and HL (T1) better than HR (T2), ML (T3), and LL (T6).

### Error Patterns in Adults' and Children's Tone Productions

**Table 4** shows the confusion matrices of adults' and children's tone productions. The shaded cells indicate judges' correct identification of the target tones. Percentages in square brackets represent error patterns that occurred more than 10% of the time. For adults, the major error pattern was the confusion between ML (T3) and LL (T6). In comparison, children demonstrated more diverse confusion patterns. Children tended to produce HL (T1) as ML (T3); HR (T2) as LR (T5); ML (T3) as HL (T1) or LL (T6); LF (T4) as ML (T3) or LL (T6); LR (T5) as HR (T2); LL (T6) as HL (T1), ML (T3) or LF (T4).

### Relationship between Children's Tone Production and Tone Perception Accuracy

To examine the relationship between children's tone production and perception accuracy, first, a two-way repeated measures ANOVA was used to determine whether children performed similarly in tone perception and production. There was a main effect of testing modes (i.e., perception vs. production), F(1, 18) = 135.5, p < 0.001, ηp² = 0.883, indicating that regardless of tone types, children perceived the tones better than they produced them. There was also a significant main effect of tones, F(5, 90) = 3.960, p = 0.003, η<sup>p</sup> = 0.180, and a significant interaction effect between modes and tones, F(5, 90) =6.907, p < 0.001, η<sup>p</sup> = 0.277. Pairwise comparisons between children's perception and production accuracy for each tone with Bonferroni adjustments revealed that children's tone perception accuracy was significantly better than their tone production accuracy, ps = 0.029–0.000, r = 0.835–0.966) for all tones, except for LR (T5), p = 0.566, r = 0.335. Second, Pearson product-moment correlation was used to examine whether there was any predictive relationship between children's perception and production accuracy. The results revealed no significant correlation between children's overall tone production accuracy and their overall tone perception accuracy based on all stimuli (p = 0.344, r <sup>2</sup> = 0.053) or the 12 stimuli presented in both the production and perception tasks (p = 0.419, r <sup>2</sup> = 0.039).

### Acoustic Properties of the Produced Tones Accuracy Groups

To compare the acoustic characteristics of children's correct and incorrect productions, the tone productions were categorized into three accuracy groups based on the judgment results. Productions correctly judged by 80% or more of the judges (i.e., 4 or 5 judges) were considered as "correct." There were a total of 142 child correct (CC) productions and 278 adult correct (AC) productions. The "incorrect group" consisted of productions correctly judged by 0–40% of the judges (i.e., 0, 1, or 2 of the judges). There were 148 child incorrect (CI) productions and 25 adult incorrect productions. Productions correctly judged by 60% of the judges (n = 49 for children's productions, and n = 38 for adults' productions) and the incorrect productions of adults (n = 25) were excluded from further analysis. Thus, in the following analyses only AC, CC, and CI productions were compared.

### Pitch Contours of Adults' Correct and Children's Correct and Incorrect Productions

Appendix 1 in Supplementary Material shows the timenormalized average pitch contours of the correct and incorrect tones produced by each child and the correct productions of their mothers. Correct productions are in blue while incorrect productions are in pink. Children's productions are denoted by solid lines while mothers' productions are denoted by dotted lines.

By visual inspection, the pitch contours of adults' correct productions mostly followed the expected pitch heights and pitch shapes of the target tones; that is, HL (T1), HR (T2), LF (T4), LR (T5), have high and level, high and rising, low and falling, and low and rising contours, respectively, However, the contour of ML (T3) and LL (T6) did not appear to be level but had a slightly falling slope.

The shapes of the pitch contours of children's correct tone productions were generally similar to those of AC productions though some deviations were observed. Many of the pitch contours of children's incorrect productions did not follow the expected shapes, and showed many more variations among speakers than adults' and children's correct productions. In general, the pitch contours of children's incorrect HL (T1) productions were not as flat or as high as those in children's correct productions (Appendix 1 in Supplementary Material). The rising slopes of children's incorrect HR (T2) productions were not as steep as those in the CC productions. This could explain why some of children's incorrect HR (T2) productions were judged as T5s (LR). In general, children's incorrect ML (T3) and LL (T6) productions fell more sharply than adults' productions. On the other hand, children's incorrect LF (T4) productions did not fall as steeply as adults' correct productions.

**Figure 1** shows the average pitch contours of the six tones by each accuracy group in the upper panels and the pitch contours, adjusted for individual differences in the vocal pitch of the speaker (i.e., the measures of pitch heights, which were computed by subtracting the mean pitch of speaker from the measured pitch), in the lower panels. As indicated in the figure the pitch height measures appeared to successfully normalize the intrinsic pitch of speakers of different age groups. Acoustic analyses below provided further evidence on this. Appendix 2 in Supplementary Material shows the average pitch contours of the six tones of the three accuracy groups. Note that due to the large variations in the pitch contours in children's incorrect productions, the average plot of the incorrect productions may not be a good representation of the pitch shapes and levels of individual incorrect productions.

### Acoustic Similarities and Differences between Adults' Correct Tones and Children's Correct and Incorrect Tone Productions

Statistical analyses were performed to compare the acoustic parameters in adults' correct productions, children's correct productions, and children's incorrect productions. Because children's tone error patterns varied substantially (**Table 4B**), only the 10 major error patterns (i.e., error patterns that occurred in more than 10% of children's productions) were analyzed. Few children contributed to both correct and incorrect productions of the same tones, making it impossible to perform a withinsubject analysis. To serve our purposes of examining whether children's incorrect productions were acoustically different from children's and adults' correct productions and to examine whether the acoustic characteristics in children's incorrect productions justified the incorrect ratings of the judges (e.g., if HL (T1) was perceived as ML (T3) or whether the incorrect HL (T1) production had a mean pitch lower than the correct HL (T1) productions), children's correct and incorrect productions were treated as the between group variable.

Two two-way mixed ANOVAs, using the production patterns as the between subject variable and the acoustic parameters (six measures of pitch heights or pitch slope) as the within subject factor were conducted for each tone to compare the acoustic differences in adults' correct productions, children's correct productions and children's incorrect productions. For example, to examine the pitch levels of children's LF (T4) productions, adults' correct LF (T4) productions, children's correct LF (T4) productions, and children's LF (T4) productions that were misidentified as ML (T3) by more than 50% of the judges (i.e., 3 or more of the 5 judges) were selected for analyses. These three patterns were treated as the between subject variable. The six acoustic parameters of pitch height (i.e., height of initial pitch, height of final pitch, height of minimum pitch, height of maximum pitch, and height of mean pitch) of the productions were used as the within subject factor. The results for each tone after correction for multiple comparisons are presented in **Table 5**.

It can be noted in **Table 5** that there was no significant difference in the initial pitch height of any of the tones in AC, CC and CI productions, except for HL (T1), in which children's productions had lower pitch height than adults' productions. The findings suggested that the pitch height measures effectively normalized the initial pitch levels of adult and children productions.

#### **Acoustic characteristics of children's correct productions**

Not all children's tone productions that were correctly perceived by most of the judges were acoustically adult-like. Though there was no significant difference in the seven acoustic measures between adults' and children's correct ML (T3), LF (T4), LR (T5), and LL (T6) productions, children's correct HL (T1) productions were produced with significantly lower pitch than adults' correct HL (T1) productions, whereas children's correct HR (T2) productions did not rise as sharply as adults' HR (T2) productions (**Table 5**).

### **Acoustic characteristics of children's incorrect productions**

Children's incorrect tone productions were acoustically different from adults' and children's correct tone productions (**Table 5**). Children's incorrect HL (T1) productions that were perceived as ML (T3) had lower minimum, maximum, final and mean pitch than children's and adults' correct HL (T1) productions, justifying the judges' (mis-)categorization of the productions as ML (T3).

The pitch contours of children's incorrect HR (T2) productions that were perceived by most judges as LR (T5) rose less steeply than children's and adults' correct HR (T2) productions and did not reach a final and maximum pitch as high as adults' correct productions, justifying the judges' categorization of LR (T2) for these productions.

Children's incorrect ML (T3) productions that were perceived by most of the judges as HL (T1) had higher final, minimum, maximum, and mean pitch levels than the correct ML (T3) productions by adults and children, while children's incorrect ML (T3) productions that were perceived as LL (T6) had final pitch lower than the correct ML (T3) productions, matching the perceptual judgment of the judges.

Children's incorrect LF (T4) productions that were perceived as ML (T3) productions did not fall as sharply as children and adults' correct LF (T4) productions and had final, minimum, and mean pitch higher than the correct productions. Children's LF (T4) productions that were misperceived as LL (T6) productions did not reach a final and minimum pitch as low as children's correct LF (T4) productions.

Children's incorrect LR (T5) productions that were perceived as HR (T2) had a lower minimum pitch, reached a higher final pitch and had pitch contours that rose more sharply than the correct LR (T5) productions.

Children's incorrect LL (T6) that were misperceived as HL (T1) productions had maximum and mean pitch that was significantly higher than the correct productions. The final and minimum pitch heights of children's incorrect LL (T6) productions being perceived as ML (T3) were higher than those in the correct productions of LL (T6), and the final and minimum pitch heights of children's incorrect LL (T6) productions being perceived as LF (T4) were lower than those in the correct productions though the differences did not reach significance, likely due to insufficient power for the multiple comparisons.

Overall, the results showed that the acoustic characteristics of the correct and incorrect tone productions justified the judges' perceptual judgments of the tones.

## DISCUSSION

This study examined 3-year-old children's Cantonese tone perception and production accuracy to test whether lexical tones are acquired rapidly before 3 years of age, as most previous literature has suggested, and whether children's tone production ability is acquired ahead of their tone perception ability as expected. The results showed that, contrary to the view that children master lexical tones earlier than segmental phonemes, children could not perceive or produce Cantonese tones with adult-like proficiency by the age of 3 years and incorrect tone productions were acoustically different from the criterion. Contrary to previous findings, we observed more tone accuracy during speech perception than production.

Our first research question was how well children perceive Cantonese tones. Consistent with the findings in the Lee et al. (2015) study, our results show that 3-year-old Cantonesespeaking children are still developing their tone perceptual skills and do not identify any of the six tones with adult-like accuracy. Perception accuracy of the six tones in descending order was HL


TABLE 5 | Acoustic similarities and differences of correct and incorrect tones produced by children and adults.

*AC, CC, and CI represent adult-correct, child-correct, and child-incorrect productions, respectively. "*>*" indicates "is higher than" or "rises more sharply than." "*<*" indicates "is lower than" or "does not rise as steeply as," and "a<" indicates "falls less steeply than." Shaded cells highlight significant difference.*

\**Represents 0.05 significance level.* \*\**Represents 0.01 significance level.*

(T1), LL (T6), ML (T3), LF (T4), HR (T2), LR (T5). However, only HL (T1) was perceived significantly better than LR (T5). The findings suggested that HL (T1) is the easiest while LR (T5) is the most difficult tone for 3-year-old children to identify. Also similar to the findings reported by Lee et al. (2015), in this study children confused HR (T2) with LR (T5). However, in contrast to Lee et al. (2015), we did not find substantial confusion between ML (T3) and LL (T6), or LF (T4) and LL (T6), in children's tone identification. The present results therefore extend understanding of tone acquisition in Cantonese.

Our second research question was how well children produce the six Cantonese tones. Our findings showed that children's tone production accuracy was affected by word familiarity, even though most of the less familiar words tested were also found in young children's vocabulary (e.g., 樹 tree, tongue, 頸 neck, 梨 pear) (**Table 2**). This implies that future studies examining children's tone perception and production need to control for word familiarity.

Consistent with previous findings of Cantonese tone produced by adults (Ciocca and Lui, 2003; Barry and Blamey, 2004; Lee et al., 2015; Wong and Ng, 2017), the results showed that Cantonese tones produced by adults were not error free. Among the six tones, adults produced HL (T1), HR (T2), LF (T4), and LR (T5) with complete accuracy. However, considerable confusion was found between their production of ML (T3) and LL (T6), resulting in significantly lower accuracy in these tones. Lee et al. (2015) also reported less than perfect identification with adult production of ML (T3)-LL (T6) and HR (T2)-LR (T5) using unfiltered stimuli. Barry and Blamey (2004) found the greatest overlap between the tone ellipses of ML (T3), LR (T5), and LL (T6) in adult productions, suggesting little differentiation among these tones even in adults.

Contrary to previous reports that children can produce Cantonese tones in multisyllabic words and connected speech before age 2.6 (Tse, 1978; So and Dodd, 1995; To et al., 2013), the present findings constrain these reports by showing that 3-year-old children produce errors on lexical tones displaying low accuracy rates (**Table 4**) and did not produce any of the six tones with adult-like accuracy. The discrepancies in findings can be explained by methodological differences. The present study controlled lexical expectation in tone judgment by asking judges to categorize tones in filtered productions. Therefore, tone ratings were based exclusively on pitch information without linguistic support or contextual information. Previous studies showing early mastery of tone production did not control for potential lexical expectation effects, which may give rise to perceptual illusions (Oller and Eilers, 1975), and may lead to overestimations of children's tone production ability. The results are consistent with Barry and Blamey (2004) who controlled transcriber biases by asking an English speaker to rate tone productions based on the perceived pitch contours and found that children as old as 6 years of age had not mastered the production of tones suggesting that transcriber lexical expectation may have confounded the findings of early studies. The present study also ensured much tighter control on the context of tone production by examining only spontaneously produced monosyllabic tones. In other studies, imitated responses were used (e.g., So and Dodd, 1995; To et al., 2013), which may have inflated the scores of children. The lower accuracy rates for children in this study compared with Barry and Blamey (2004) may be explained by the younger age of participants (3.0 in this study vs. 3.8 to 6.0 in Barry and Blamey, 2004).

In terms of the relative production difficulties of the six tones for the children, the order of accuracy of the six tones in descending order was LR (T5), LF (T4), HL (T1), HR (T2), ML (T3), and LL (T6), with LR (T5), LF (T4), and HL (T1) easier for children to produce than HR (T2), ML (T3), and LL (T6). These findings are consistent with reports from Tse (1978) and Barry and Blamey (2004) that HL (T1) and LF (T4) are among the easiest tones for children to produce and Tse (1978) and So and Dodd (1995) that LL (T6) is the most difficult. On the other hand, the finding in this study that children produce LR (T5) with the highest accuracy contradicted the findings in previous studies that LR (T5) is one of the most difficult tones for children to master (Tse, 1978; So and Dodd, 1995; Barry and Blamey, 2004). It is not easy to speculate on factors contributing to these differences in studies due to the differences in the methodology used. Future studies using similar methods to those used in the present study are recommended to confirm the finding.

With respect to the error patterns in children's productions, as expected, children produced more tone errors and had more diverse error patterns in comparison to adults (**Table 4**). Children mostly confused tones with similar pitch contour shapes (i.e., between the two rising tones, and amongst the three level tones and the low-falling tone). Little confusion was found between tones that have very different tone shapes (e.g., rising tones vs. level tones and rising tones vs. falling tones). These error patterns were similar to but slightly more diverse than the error patterns reported in Barry and Blamey (2004), likely due to the age disparities in the children in the two studies.

Turning to our third question, the examination of the acoustic characteristics of correct and incorrect tone productions showed that correct tones had acoustic characteristics similar to those produced by adults, although not all acoustic properties in children's correct productions were adult-like. The pitch levels and pitch shapes of correct ML (T3), LF (T4), LR (T5), and LL (T6) productions were adult-like. However, HL (T1) tones were not produced with pitch levels as high as adult productions and HR (T2) tones were produced with lower rising slopes. Children's incorrect productions were acoustically different from those of the same tones produced by adults, as expected. The acoustic characteristics of production errors matched the expected acoustic characteristics of the tones selected in error by judges providing acoustic justifications to the judges' classification of the tones in filtered speech.

The final research question addressed the relationship between children's perception and production ability. Given that tone identification ability may be affected by the demand of the tasks (e.g., the number of tone minimal contrasts in the alternatives), tone identification scores should be interpreted with caution. Nevertheless, the results showed that 3-year-old children perceived the six tones significantly better than they could produce them. There was little relationship between tone perception and production accuracy. For example, the one tone that children perceived with highest accuracy [i.e., HL (T1)] was not produced to criterion, while the tone that children perceived with the lowest accuracy [i.e., LR (T5)] was not the tone with the worst production. These findings suggest that accurate tone perception is not sufficient for accurate tone production. Other factors may play a role in determining children's tone production accuracy.

Several factors may account for Cantonese tone production errors. Given acoustic proximity between some Cantonese tones, it is possible that children have not mastered accurate categorical perception of tones and, therefore, have difficulty producing correct tones. Previous work on tone perception with Cantonesespeaking children shows that children do not correctly identify all tones until after age 6.0 (Lee et al., 2015) and 10.0 (Ching, 1984; Ciocca and Lui, 2003). Moreover, correlation analysis revealed little association between tone perception and production i.e., the order of accuracy of the six tones in tone production did not follow the same pattern as in tone perception. Children who scored 100% accuracy in perception of HL (T1) produced HL (T1) with accuracy rates ranging from 20 to 87%. Taken together, the findings suggest that good perception does not guarantee accurate production. Future studies using the same set of familiar words to test tone perception and production in the same group of children will be needed to examine the relationship between tone perception and production.

Physiological limitations in speech motor control may also account for late acquisition of tones. To produce adequate tonal differentiation among acoustically similar tones, fine-tuned speech motor control is required. However, given that the laryngeal structures such as, the vocal folds of young children are not fully developed until adolescence (Crelin, 1987; Kent and Vorperian, 1995) and speech control is immature (Smith, 2006; Smith et al., 2006), it is likely that children are still acquiring the skills to regulate pitch differences among tone categories. Wong (2013) provided a physiological explanation of children's tone development and proposed that the order of accuracy of the four Mandarin tones followed the degree of articulatory complexity required to produce the tones. The present results are compatible with that account since the acoustic results from Cantonese are similar to those from Mandarin speaking children as in Wong (2012a, 2013). Three-year-old Cantonese-speaking children produced the high level tone with pitch contours at a lower level than adults, and the high rising tone with significantly reduced slopes and significantly lower pitch at the offset of the tone compared with the correct productions of adults. These acoustic similarities in the production of similar pitch contours in 3-yearold children across Chinese languages with two different tone systems supports a physiological constraint on tone production during development. However, future studies testing children's speech motor control when producing various pitch heights and patterns is needed to provide direct evidence to confirm this observation.

Inconsistent tonal input in the linguistic environment could be another contributing factor to slow acquisition of Cantonese tone production in the present study. Several studies have reported evidence of a tone merging processes in recent years in Hong Kong, thus affecting three tone pairs HR (T2)–LR (T5), ML (T3)–LL (T6), and LF (T4)–LL (T6) (Mok and Wong, 2010; Mok et al., 2013). These patterns of change in tone within modern Hong Kong do overlap with the confusion patterns found in children's tone productions in the present study. Therefore, the changing tonal system may influence the accuracy of tone production in some speakers and, thus, the auditory input to young children.

In sum, the results show that Cantonese-speaking children do not master the perception or production of monosyllabic Cantonese tones by the age of 3 years, indicating that the acquisition of tone is a more protracted process than previous studies have suggested. None of the six tones were perceived or produced by Hong Kong children with adult-like accuracy. Children perceived tones with comparable accuracy, except that HL (T1) was perceived significantly more accurately than LR (T5). Confusion between HR (T2) and LR (T5) in perception was noted. Tone production was less accurate than tone perception in the same children universally, with HR (T2), ML (T3), and LL (T6) being produced with lower accuracy than LR (T5), LF (T4), and HL (T1). The findings therefore challenge the prevailing view in phonological development that suprasegmental features are acquired rapidly and early in young children, and earlier than their acquisition of segmental features. In our view, these results call for a review of established developmental milestones for phonological development for Cantonese speaking children. This has implications for theories of phonological development and assessment of delay to phonological development.

### ETHICS STATEMENT

The study was approved by the Human Research Ethics Committee of the University of Hong Kong. Mothers provided

### REFERENCES


written informed consent for the participation of themselves and their children.

### AUTHOR CONTRIBUTIONS

PW was the thesis supervisor of WF and EC. PW designed the study. WF and EC collected the data. PW, WF, and EC performed data analysis, and drafted the paper. PW prepared the manuscript. All authors approved the final manuscript.

### FUNDING

This work was partly supported by the Seed Funding Programme for Basic Research from The University of Hong Kong to the first author (Grant No: 201611159068).

### ACKNOWLEDGMENTS

The authors thank Brendan Weekes for editing the manuscript, Bradley McPherson for proofreading an earlier version of the manuscript, and the children and their mothers for participating in the study.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.01450/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Wong, Fu and Cheung. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How Native Prosody Affects Pitch Processing during Word Learning in Limburgian and Dutch Toddlers and Adults

#### Stefanie Ramachers<sup>1</sup> \*, Susanne Brouwer<sup>2</sup> and Paula Fikkert<sup>2</sup>

<sup>1</sup> Department of German Language and Culture, Centre for Language Studies, Radboud University, Nijmegen, Netherlands, <sup>2</sup> Department of Dutch Language and Culture, Centre for Language Studies, Radboud University, Nijmegen, Netherlands

In this study, Limburgian and Dutch 2.5- to 4-year-olds and adults took part in a word learning experiment. Following the procedure employed by Quam and Swingley (2010) and Singh et al. (2014), participants learned two novel word-object mappings. After training, word recognition was tested in correct pronunciation (CP) trials and mispronunciation (MP) trials featuring a pitch change. Since Limburgian is considered a restricted tone language, we expected that the pitch change would hinder word recognition in Limburgian, but not in non-tonal Dutch listeners. Contrary to our expectations, both Limburgian and Dutch children appeared to be sensitive to pitch changes in newly learned words, indicated by a significant decrease in target fixation in MP trials compared to CP trials. Limburgian and Dutch adults showed very strong naming effects in both trial types. The results are discussed against the background of the influence of the native prosodic system.

Keywords: lexical tone, word learning, word recognition, preferential looking, bidialectalism, Limburgian, mispronunciations

### INTRODUCTION

Acquiring the sound structure of a language entails finding out which phonetic contrasts are meaningful in the native language (L1) and storing them as part of a word's lexical representation. Children need to learn to assign appropriate interpretations to many different sorts of phonetic variation, and separate variation that is lexically meaningful (i.e., phonemic variation) from variation that is not (e.g., speaker variation). Many studies have looked into the developmental perception of speech sound contrasts in the first year of life and into the way they are processed during word learning and recognition at later ages (e.g., Jusczyck, 1997; Stager and Werker, 1997; Swingley and Aslin, 2000; Kuhl, 2004; White and Morgan, 2008). This research has focused mainly on segmental contrasts, whereas approximately 60–70% of the world's languages employ pitch differences to distinguish words in addition to vocalic and consonantal contrasts (Yip, 2002). The aim of the present study is to add to the field of lexical tone acquisition by investigating the role of pitch contrasts during novel word learning. This is examined in child and adult speakers of Limburgian dialects of Dutch. Limburgian<sup>1</sup> is a restricted tone language yielding an intriguing

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Marina Kalashnikova, Western Sydney University, Australia Carolyn Quam, Portland State University, United States

> \*Correspondence: Stefanie Ramachers stmr.ramachers@gmail.com

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 24 June 2017 Accepted: 07 September 2017 Published: 22 September 2017

#### Citation:

Ramachers S, Brouwer S and Fikkert P (2017) How Native Prosody Affects Pitch Processing during Word Learning in Limburgian and Dutch Toddlers and Adults. Front. Psychol. 8:1652. doi: 10.3389/fpsyg.2017.01652

<sup>1</sup>Note that Limburgian is an umbrella term for many different dialects. Not all of these dialects have lexical tone, which will be discussed in detail later on.

interaction between lexical and intonational tones. Limburgian participants' performance in a word learning experiment is compared to a control group of monolingual child and adult speakers of Dutch.

Pitch variation is meaningful in all languages of the world (Yip, 2002; Gussenhoven, 2004; Singh and Fu, 2016). Tone languages such as Mandarin Chinese use pitch to distinguish words, similar to what phonemes do at the segmental level. Some tone languages make very extensive use of lexical pitch. Mandarin Chinese specifies every mora for tone, ignoring toneless neutral syllables (Duanmu, 2000). Other tone languages are more restricted in their use of lexical pitch. These languages, for example Tokyo Japanese, have been referred to as either PITCH-ACCENT LANGUAGES or RESTRICTED TONE LANGUAGES (Voorhoeve, 1973; Hyman, 2009). Whether there is a clear-cut distinction between tone languages and restricted tone languages is heavily debated. What they have in common is that pitch, be it to a greater or lesser extent, is necessary for determining the meaning of a word. Following Hyman's (2001, 2009) definition, we take the term 'tone language' to refer to languages that use pitch to distinguish between words.

Importantly, in non-tone languages like Dutch and English, pitch is not used to distinguish between words – except in a few very rare minimal pairs that differ in word stress (e.g., Dutch VOORkomen 'appear' vs. voorKOMEN 'prevent'), in which case pitch is only one of several correlated cues to stress. The fact that pitch is not lexically distinctive in non-tone languages might prevent speakers of these languages from distinguishing monosyllables that differ in pitch only (Schaefer and Darcy, 2014) and from encoding pitch information when building novel lexical representations (Braun et al., 2014).

Despite the abovementioned functional differences, non-tone language listeners often show sensitivity to non-native lexical tones (e.g., Hallé et al., 2004; So and Best, 2010, 2014; Liu and Kager, 2014; Ramachers et al., 2017). This sensitivity is mostly shown in perceptual tasks without lexical involvement (i.e., discrimination tasks; e.g., Broselow et al., 1987; So and Best, 2008, 2010, 2014; Liu and Kager, 2014; Schaefer and Darcy, 2014; Ramachers et al., 2017). Several factors have been put forward recently to account for these findings, the most important one being the role of prosody in the L1.

The PERCEPTUAL ASSIMILATION MODEL FOR SUPRASEGMENTALS (PAM-S; So and Best, 2014) states that non-native pitch contrasts tend to be perceived according to their degree of similarity to native pitch patterns. Indeed, a number of studies on the perception of non-native pitch patterns have shown that prosodic experience from listeners' L1 guides their perception of non-native pitch patterns (e.g., Broselow et al., 1987; So and Best, 2008, 2010, 2014). For example, English listeners presumably discriminate Mandarin tone 4 (falling) due to assimilation to their statement intonation category (e.g., Broselow et al., 1987; So and Best, 2008), and Dutch listeners in Braun and Johnson (2011) probably perceived utterancefinal Mandarin tone 2 (rising) as Dutch question intonation. Following these observations, the question thus no longer is whether non-tone language listeners discriminate lexical tones, but whether they interpret them as lexically relevant.

When acquiring a lexicon, tone language learners need to learn to ascribe lexical relevance to pitch changes and encode tone lexically. Conversely, non-tone language learners have to learn to disregard pitch changes that occur within words, despite the fact that they might still discriminate these pitch changes at lower levels of processing (e.g., in a purely perceptual task).

### Integration of Pitch into Lexical Representations

Recent work suggests that child and adult speakers of tone languages behave differently from non-tone language speakers in exploiting contrastive pitch contours when learning words. Tone language speakers attend to pitch information and exploit it during lexical access, whereas non-tone languages speakers do not, or at least to a lesser extent (e.g., Quam and Swingley, 2010; Braun et al., 2014; Singh et al., 2014; Hay et al., 2015). These previous studies primarily discussed the lexical integration of pitch by non-tone language listeners. Few of them looked at the interpretation of (non-)native pitch by tone language listeners, and if so, they focused on typically studied tone languages like Mandarin Chinese. However, within the family of tone languages, large differences exist.

First, tone languages differ with respect to the functional load of tone, which depends on the tonal inventory (i.e., the number of tones, and, related to that, their information value), the distributional restrictions of tones (i.e., can they appear on any syllable?), the importance of tones for lexical disambiguation (i.e., how many minimal pairs are there in the language?), and the extent to which f0 is the only cue to the tonal distinction (i.e., do duration or voice quality play a role?) (e.g., Pierrehumbert and Beckman, 1988; Kristoffersen, 2000; Wang et al., 2004; Tong et al., 2008; Wu et al., 2012). The smaller the inventory, the larger the amount of distributional restrictions and the smaller the number of tonal minimal pairs, the more restricted a tone system is (Voorhoeve, 1973). The functional load of lexical pitch patterns in the L1 has been assumed to influence sensitivity to word-level pitch in speakers of these languages (e.g., Wang et al., 2004; Wu et al., 2012; Schaefer and Darcy, 2014; Goss, 2015).

A second difference within the family of tone languages lies in the complexity of their intonation systems. Typically, tone languages do not have complex intonation systems (e.g., Gussenhoven and van der Vliet, 1999) and, as a consequence, the pronunciation of a word with a certain lexical tone is rather stable across different contexts. In Standard Chinese, for example, different intonations only cause changes in pitch height, not in pitch contours (Wu, 2000). However, some more restricted tone systems, like Norwegian, Swedish, and Limburgian, do show complex intonation systems. In these languages, intonation tones interact with lexical tones, causing variation in surface realizations (i.e., contours) of a lexical tone (e.g., Gussenhoven, 2000a; Riad, 2013). It has been suggested that surface variability in the contours of lexical tones can delay the acquisition of lexical tone assignment (Demuth, 1995; Ota, 2003).

In the present study, we investigated lexical encoding of tone in Limburgian. By studying a language with a low functional load for a binary tone contrast embedded in a complex intonation

system, this study widens our understanding of the influence of the functional load of tone and tonal surface variability on the acquisition and processing of a lexical tone system. By comparing Limburgians to a control group of non-tonal Dutch peers, we also address the influence that cross-linguistic differences in the functionality of pitch have on pitch processing. Before elaborating on Limburgian, we first review the existing literature that typically studied the lexical integration of pitch in non-tone language speakers and/or in tone languages with a high functional load for tone.

Quam and Swingley (2010) tested recognition of newly learned words carrying a tone in a bimodal preferential looking experiment adopting a mispronunciation paradigm. The idea behind mispronunciation paradigms is that successful detection of form-meaning mismatches requires the prior establishment of novel representations that include the tonal or segmental specification of interest. If the lexical representation of the newly acquired word is impoverished or incomplete with respect to for example its tonal specification, word recognition will not be hindered by tonal variability in the input signal.

In their study, English 30-month-old toddlers and adults were taught a novel pseudo-word as a label for a new toy. Subsequently, the target was either correctly pronounced (CP), i.e., with the trained tone, or mispronounced (MP), i.e., with a change in tone or a change in vowel. Quam and Swingley (2010) showed that both children and adults interpreted the changes in accordance with their native phonology. Word recognition was hindered by a vowel change, but not by a change in pitch. At least by 30 months of age, English children have thus learned to disregard pitch at the level of words.

In a paradigm similar to that of Quam and Swingley (2010), Singh et al. (2014) showed that, at 18 months, mono- and bilingual English learners were equally sensitive to tonal and vowel MPs, but at 24 months they no longer treated pitch as lexically contrastive, in accordance with their native phonology and in line with Quam and Swingley (2010). Mandarin-English bilinguals<sup>2</sup> who were dominant in Mandarin were sensitive to both vowel and tonal MPs at both ages. The authors suggest that, at 18 months, toddlers may over-assign weight to postlexical pitch information due to its high attentional appeal and by virtue of having observed its linguistic significance, either at the post-lexical or at the paralinguistic level.

Similar findings come from a series of experiments by Hay et al. (2015). In an associative word learning task using the two-object switch procedure (Stager and Werker, 1997), 14 month-old but not 17- and 19-month-old learners of English interpreted pitch differences as properties of words. According to Hay et al. (2015, p. 10), between 14 and 17–19 months, children go through a phase of "interpretive narrowing." With growing linguistic experience, they become more specific about what forms of words should be treated as lexically contrastive. Nevertheless, 17- and 19-month-olds continued to be sensitive to the difference between falling and rising pitch contours in a discrimination task that did not involve label-object mappings. To sum up, the studies above show that there is a shift in English children's interpretation of the lexical relevance of pitch patterns in the course of the second year of life.

A study that compared the ability to store lexical tones (in this case Mandarin tones) among adult speakers of languages differing in their lexical and post-lexical use of prosody is reported in Braun et al. (2014). The languages under investigation (German, Japanese, French, and Mandarin) differed with respect to the lexical status of word-level prosody as well as the complexity of the post-lexical pitch system (i.e., the number of utterance-level contrasts). German, a stress language, makes use of word-level prosody. Moreover, it has a relatively rich intonational system. French does not assign word stress to lexical items and would appear to have less pitch variability at the utterance-level. Japanese has word-level prosody in the form of pitch-accents. However, as in French, utterance-level pitch variability is more restricted. Speakers of Mandarin, Japanese, German, and French had previously shown sensitivity to Chinese tones in purely perceptual tasks.

The aim in Braun et al. (2014) was to see if the ability to lexically encode pitch in a word learning paradigm depended on experience with lexical or post-lexical prosody. Participants' recognition of newly learned words was tested in tonal and segmental mismatch conditions. As hypothesized, performance was modulated by the different prosodic structures of the participants' L1. The Mandarin group outperformed all the other groups. More surprisingly, German participants significantly outperformed Japanese and French listeners. Japanese and French listeners did not differ significantly from each other. The authors argue that the number of L1 utterance-level pitch contrasts, rather than the availability of word-level pitch contrasts, are beneficial for building long-term representations of lexical tone. However, German participants might have benefited both from their experience with f0 as a cue to word stress and as a cue to post-lexical intonation. Importantly, the fact that f0 is hardly used to signal lexical distinctiveness in German obviously does not prevent them from perceiving and lexically encoding pitch information.

Much less is known about the lexical integration of pitch by speakers of more restricted tone languages like Limburgian. The next section provides more information on the lexical tone system in Limburgian.

### The Limburgian Dialects of Dutch

The Limburgian dialects of Dutch belong to the Central Franconian dialect-continuum which covers the provinces of Limburg in the Netherlands and Belgium as well as the north of the German Rhineland-Palatinate and the southwest of North-Rhine Westphalia (Gussenhoven, 2000a; Fournier, 2008; see **Figure 1**).

The Dutch province of Limburg has about 1.1 million inhabitants<sup>3</sup> , 75% of which speak a Limburgian dialect (Driessen, 2006). Limburgian is a regional linguistic variety of Standard Dutch, the official language used in formal and institutional settings. Differences exist at the phonological, morphosyntactic,

<sup>2</sup>From personal communication with the authors, we know that the second language of the Mandarin bilinguals was English.

<sup>3</sup>www.cbs.nl

and lexical level, but still, mutual intelligibility is fairly high (Van Bezooijen and Van den Berg, 1999) due to the existence of many cognates. The probably most striking difference between Limburgian and Dutch is the fact that many Limburgian dialects have lexical tone.<sup>4</sup> Pitch is used in both languages as a cue to word stress and in post-lexical intonation (e.g., Gussenhoven, 1988; Gussenhoven and van der Vliet, 1999).

In this study, the focus is on the dialect of Roermond. The choice to focus on one particular dialect instead of on Limburgian as a whole stems from the fact that Limburgian is not a homogeneous linguistic variety. Limburgian is to be understood as an umbrella term for many different dialects. Comparable to the pitch-accents in different varieties of Japanese, Norwegian, and Swedish (Wetterlin, 2007; Tamaoka et al., 2014), the Limburgian tones may have different phonetic realizations across dialects, be embedded in different intonational systems or may be absent altogether (e.g., Gussenhoven, 2000a; Gussenhoven and Peters, 2008). The choice for the dialect of Roermond is partly motivated by the existence of a series of tone perception and production studies with adult speakers of Roermond Dutch (Fournier et al., 2006; Fournier, 2008; Fournier et al., 2010). Moreover, its vocabulary and (tonal) grammar are well documented (e.g., Kats, 1939, 1985; Gussenhoven, 2000b).

In Roermond Dutch, haas [ha:s] with falling pitch (accent 1) means 'hare,' whereas haas with falling-rising pitch (accent 2) means 'glove.' In a small number of frequent nouns, pitch also serves a grammatical function with accent 1 systematically indicating plurality (see **Figures 2**, **3**). In the Roermond dialect, the primary acoustic cue to the tone contrast is f0.

Lexical tone in Limburgian<sup>5</sup> has a lower functional load than tone in many Chinese dialects. There are few minimal pairs (approximately 80; Fournier, 2008), and there is only a two-way contrast. Gussenhoven and Peters (2008, p. 88) assume that "the word accent contrast (. . .) amounts to a contrast between the absence of lexical tone (Accent 1) and its presence (Accent 2)." Moreover, the contrast can only be realized on syllables with main stress, meaning that an unbound multisyllabic morpheme can only carry one accent. For this reason, Limburgian is comparable to for example Japanese (Kubozono, 1993; Tamaoka et al., 2014), Swedish (Gussenhoven, 2004; Riad, 2013), and Norwegian (Kristoffersen, 2000; Wetterlin, 2007; Steien and Van Dommelen, 2016). With respect to the domain of realization of lexical tone, Limburgian is more akin to tone languages such as Mandarin (Burnham et al., 2014), as the pitch contrast is realized within a single syllable.

Apart from the relatively small number of minimal pairs, any primary stressed bimoraic syllable is pronounced either with accent 1 or with accent 2 (Gussenhoven, 2000b). For example, in Roermond Limburgian, boum [blUm] ('tree') carries accent 2, whereassjaop [Sl:p] ('sheep') carries accent 1. Pronouncing any of these words with the wrong accent would turn them into a nonexisting word. Pitch is thus assumed to be part of a word's mental representation.

By studying Limburgian speakers' sensitivity to pitch changes, we could shed more light onto the lexical representations of accent 1 and accent 2. The FEATURALLY UNDERSPECIFIED LEXICON MODEL (Lahiri and Reetz, 2002) can be used to formulate predictions on this matter. If the lexical representation of a word is incomplete with respect to its tonal specification, tonal features present in the input signal cannot mismatch with an underspecified (i.e., empty) slot in the lexicon. In this case, word recognition cannot be hindered by tonal variability in the input. If it is indeed the case that accent 2 is the underlyingly specified accent, Limburgians would be sensitive to mispronunciations of accent 2 (leading to a mismatch), but not or to a lesser extent to mispronunciations of accent 1 (leading to a no-mismatch).

As in any other language, pitch in Limburgian also serves postlexical functions. Limburgian dialects have complex intonation systems (Gussenhoven and van der Vliet, 1999). As a result, the pitch contours of the accents vary as a function of information status, sentence type, and position in the utterance.

<sup>4</sup>Note that some scholars have questioned whether the Limburgian accents come from lexical tones. They argue that there is no lexical tone in Limburgian, but that the contrast emerges from different foot structures (e.g., Köhnlein, 2016).

<sup>5</sup>Henceforth, the term Limburgian is used to refer to those Limburgian dialects that use lexical tone.

'those are two rabbits.' The rhyme of the target word carries accent 1.

Surface variation due to tone-intonation interactions can also be observed in Swedish (Bruce, 1977; Riad, 2013), but to a lesser extent than in Limburgian (Gussenhoven, 2004). It has been suggested that the reliability of the mapping between underlying tones and their surface realizations has a large impact on the acquisition of a lexical tone system (Demuth, 1995; Ota, 2003). In addition, Rost and McMurray (2010) have shown that allophonic variability, unlike variability like speaker differences, can be problematic for creating phonologically specific representations of new words. Children might have a hard time distinguishing allophonic from phonemic variation, not knowing what to add to their lexical representations, leading to initially/temporarily under- or over-specified representations. Limburgian listeners are confronted with a considerable amount of allophonic (or allotonic) variation in lexical tone contours. Furthermore, this variation cannot be ignored since it does signal meaningful information at the post-lexical level. In light of this variation, it could be a challenge to recover the underlying tone system for young learners of Limburgian.

Yet another source of variation in Limburgians' input is due to the fact that most Limburgians also speak Dutch and are considered bidialectal (Cornips, 2014). Hardly any studies on the mapping of sounds to meaning focused on children acquiring two languages, let alone on children acquiring multiple dialects or regional varieties of the same language (for a review, see Fennell et al., 2016). Extant studies have shown that learning novel minimal pair words in both mono- and bilinguals is favored when children listen to a speaker that sounds like people from their environment (e.g., Mattock et al., 2010; Fennell and Byers-Heinlein, 2014). In word recognition studies with known words, the use of cognates can hinder the detection of mispronunciations, at least in close-language bilinguals (e.g., Ramon-Casas and Bosch, 2010). As a consequence of the highly variable input Limburgians are exposed to (Durrant et al., 2015), the higher probability of hearing accented speech (e.g., Bosch and Ramon-Casas, 2011) and the large amount of lexical overlap in the input (e.g., Sebastián-Gallés and Bosch, 2009), Limburgian children might exhibit a more lenient treatment of mispronunciations.

### Aims of the Present Study

In this study, we ask whether pitch plays a role in novel word recognition for children acquiring Roermond Limburgian in comparison to a control group of children acquiring Dutch. We aimed to answer two questions. First, do children acquiring Roermond Limburgian encode pitch information as part of their lexical entries when learning novel words? And secondly, do they behave differently from Dutch age-matched peers in this respect? To see whether their interpretation of pitch is adult-like or not yet fully developed, we also tested Limburgian and Dutch adults. Limburgian and Dutch 2.5- to 4-year-olds (Experiment 1) as well as adults (Experiment 2) participated in a bimodal preferential looking experiment (Golinkoff et al., 1987). Following the procedure employed by Quam and Swingley (2010) and Singh et al. (2014), participants learned two novel wordobject mappings. After training, word recognition was tested in correct pronunciation (CP) trials and mispronunciation (MP) trials featuring a pitch change.

In light of previous findings (Singh et al., 2014, 2015), we expected Limburgians to be sensitive to MPs involving pitch. However, a change in pitch might only hinder word recognition to a minor extent in Limburgian due to the relatively restricted nature of the Limburgian tonal system. Another characteristic of the Limburgian speakers' input that could lead to (temporarily) weaker MP effects is the large amount of surface variation in the contours of the Limburgian tones, phonetic variation due to their

exposure to multiple regional variants of a language (Durrant et al., 2015), and possibly also the fair amount of Dutch cognates without a tonal specification (but see Van der Feest and Johnson, 2016).

As for our Dutch participants, Ramachers et al. (2017) have shown that Dutch 6- to 12-month-old infants reliably discriminate the Limburgian tones in a discrimination task (see also Liu and Kager, 2014; Chen and Kager, 2016). Here we ask whether Dutch participants still attend to pitch in a higherlevel task that requires lexical encoding of pitch. Based on previous research with non-tone language speakers (e.g., Quam and Swingley, 2010; Singh et al., 2014; Hay et al., 2015), we expected that changes in pitch would not hinder Dutch subjects' recognition of newly learned words.

However, adult speakers of German showed sensitivity to word-level pitch differences despite the fact that German has no lexical tone (Braun et al., 2014). Also, de Bree et al. (2008) showed that Dutch 36-month-olds were sensitive to miss-stressing. The fact that 3-year-old Dutch children appear to be sensitive to word-level suprasegmental properties might also facilitate their encoding of other word-level prosodic features, like lexical tone.

For the adults, in principle the same expectations hold. However, due to accumulated linguistic experience, Limburgian adults might have learned not to rely on pitch alone during online language comprehension. We expected Limburgian adults to notice a change in tone, but it is an open question how strongly it will hinder word recognition. Dutch adults might also still show sensitivity to pitch differences by virtue of their accumulated linguistic experience with post-lexical intonation and word stress (but see Quam and Swingley, 2010).

### EXPERIMENT 1

### Materials and Methods Participants

A total number of 41 Limburgian toddlers were recruited via health care institutions and daycare centers in the city of Roermond in the Dutch province of Limburg. Twenty-three children with a mean age of 40.9 months (SD = 5.9 months; range = 31–49 months; 6 boys) were included in the analysis. An additional 18 toddlers were tested but excluded from analysis because they failed to contribute sufficient data. For a detailed description of trial, block and participant exclusion criteria we refer to the section "Data Pre-processing and Analysis" and **Table A1** in the **Appendix**.

Children in Limburg are often exposed to quite heterogeneous linguistic input. As a result, it is difficult to find toddlers who have only been exposed to one particular dialect, in our case Roermond Limburgian. Children from the municipality of Roermond who were exposed to any East-Limburgian dialect (Bakker and van Hout, 2012), spoken by at least one parent or caregiver, were allowed to participate. The realization of the word prosodic contrast within the East-Limburgian dialect region does not show much variation (Heijmans, 2003). Based on parental report (missing N = 1), using an adapted version of the PaBiQ (COST Action IS0804, 2011) 6 administered during a telephone interview, the language input provided at home to 22 of the Limburgian children was as follows: (a) both parents speak a different East-Limburgian dialect (N = 9), (b) one parent speaks an East-Limburgian dialect, the other Standard Dutch (N = 8), (c) both parents speak the same East-Limburgian dialect (N = 3), and (d) one parent speaks an East-Limburgian dialect, the other a dialect from another Limburgian dialect region (N = 2). All children were reported to understand both Limburgian and Dutch. Moreover, 19 out of 22 children were reported to speak Limburgian, and all participants were reported to speak Dutch. All Limburgian toddlers thus picked up on Dutch, even if they were not addressed in it by (one of) their parents, but for example by friends or at daycare. All toddlers could thus be considered bidialectals. For language use in the home (input quantity) parents were asked a series of questions with rating scale responses about the languages used by each household member to the child. From this, a proportion of language use in the home was derived. The questionnaire also contained a language richness measure (input quality), as defined by the extent to which children were exposed to storytelling, either as read from books or produced spontaneously, the expression of feelings, educational games (e.g., counting and spelling), labeling new objects, and media (e.g., television, PC, and tablet). Eighteen out of twenty-two children had higher input quantity scores in Limburgian than in Dutch. Seventeen out of twenty-two children had higher or equal input quality scores in Limburgian than in Dutch. See **Table A2** in the **Appendix** for more details.

A total number of 40 Dutch toddlers were recruited from the subject pool of the Baby Research Center of Radboud University, Nijmegen, Netherlands. All infants grew up in monolingual Dutch-speaking families. Thirty-five toddlers with a mean age of 36.8 months (SD = 1.8 months; range = 34–40 months; 13 boys) were included in the analysis. An additional five participants were excluded from the analysis for not contributing enough data (N = 4) and because one pair of children were twins (N = 1; the child contributing the least number of trials was excluded).

To make sure that none of the Dutch toddlers had substantial experience with a Limburgian dialect or any other tone language, their parents were asked questions related to the linguistic input of their child during an intake phone call. A child was regarded to have substantial experience with a tone language and thus not suitable for participation if: (a) one of the parents or primary caregivers was a native speaker of a tone language, (b) the child had weekly contact with a native tone language speaker.

None of the participants had known developmental disorders or delays and none of them had substantial exposure to a language other than Limburgian or Dutch. Ethical approval for the study was obtained from the Ethiek Commissie Faculteit der Sociale Wetenschappen (ECSW) at Radboud University in Nijmegen, Netherlands. Caregivers signed an informed consent and received

<sup>6</sup>This questionnaire is a translation/adaptation of the Questionnaire for Parents of Bilingual Children (COST Action IS0804, 2011). It is the short version of a longer questionnaire piloted by research groups in several countries within COST Action IS0804, which was in part based on the ALEQ (Paradis, 2011) and the ALDeQ (Paradis et al., 2010).

a picture book or a small monetary compensation for their participation.

### Apparatus

Limburgian children were tested in a dimly lit office using a portable lab set-up in a daycare center in Roermond. They sat in front of a 24-inch LCD screen (Philips 249C4QHSB) and were recorded via a digital video camera (Sony HC40) mounted on a tripod below the table. Behind the monitor were two speakers (Logitech Z130). The video camera broadcast the recording to a 13-inch Apple MacBook Air. Recordings were made with the video software Vidi (version 0.4.7). The experiment was presented using the LOOK software (Meints and Woodford, 2008), run on a laptop (HP EliteBook Folio 9470m). During testing, experimenter and caregiver listened to masking music through noise-canceling headphones (Sennheiser HME 110).

Dutch children were tested in a dimly lit room in the Baby Research Center at Radboud University, Nijmegen, Netherlands. The experiment was run in a test booth (size: 128 cm × 177 cm), which is partly closed by black wooden partitions, left and right from the 47-inch television screen (LG 47LK530 ZC). A digital video camera (Sony Handycam DCR\_HC85E PAL) was placed 30 cm below the screen, hidden by a black curtain with an opening for the lens. The video camera provided a broadcast of the infant's behavior to a monitor behind the TV. Recordings for offline coding were made using Virtual Dub (Version 1.9.11). The experiment was controlled using the LOOK software (Meints and Woodford, 2008). Experimenter and caregiver wore noisecanceling headphones (Sennheiser HMEC 300) that played masking music.

### Procedure

The procedure employed was the intermodal preferential looking paradigm (Golinkoff et al., 1987). The experiment lasted approximately 10 min and consisted of two blocks, separated by a 1-min break. In each block, children would learn one novel word-object mapping and subsequently it was tested how they reacted to a pitch change in the newly learned word. Each child thus learned two new words, one with accent 1 and one with accent 2. Half of the participants learned the accent 1 word first and half learned the accent 2 word first. Each block featured a different pair of objects. A visual overview of a block is presented in **Figure 4**.

A block started with an encouraging introduction phase inviting the participant to play a game. In the following object familiarization phase, the child was familiarized with two novel toy objects appearing simultaneously at the far left and far right side of the screen. The objects were presented for 9 s. The child heard (in Limburgian or in Dutch): "Look! What are those? They look great! Do you like them too?" One of these objects (the target) would be labeled in the subsequent learning phase. The other one (the distracter) would remain nameless. Target side during object familiarization was counterbalanced across blocks. The purpose of this phase was twofold: Familiarization of stimuli prior to labeling usually boosts levels of retention (e.g., Hilton and Westermann, 2016) and it lowers the task demand (e.g., Fennell, 2012).

After object familiarization, the child proceeded to the learning phase. During this ostensive-labeling phase, participants were taught a new word carrying either accent 1 or accent 2. The phase consisted of four trials of 30 s each. In the first and the third trial, the target appeared bouncing in front of a natural landscape and was labeled 10 times in each trial in sentences like: "Look! This is a [target]! A [target]! Can you see it? There's the [target]!" In total, the child heard 20 repetitions of the target label. Presenting a number of repetitions is in line with previous research on retention of novel word-object mappings (e.g., Quam and Swingley, 2010; Singh et al., 2014; Hilton and Westermann, 2016). Note that the target label always appeared in focus-final position in a declarative sentence. In this way, the phonetic realization of the Limburgian tones was held constant, and the child thus did not have to abstract away from different surface realizations. In trials two and four, the distracter object appeared in the same scenario and was talked about for an equal amount of time, but crucially, it did not receive a label. We tried to encourage the child to wonder what the name of the distracter was. The target and distracter object were presented for an equal amount of time to prevent a familiarity preference for one of both objects in the subsequent test phase. The order of trials was the same across blocks and participants.

Following the learning phase, the child entered the test phase that consisted of four test trials and four filler trials. In test trials, the target and the distracter toy appeared side by side on the screen. Children were asked to "Look at the [target]." Target onset was always at 2500 ms to enable children to inspect both objects before naming and to establish a baseline preference. To maximize engagement, a second sentence like: "Can you find it?" followed 1000 ms after target offset. Test trials lasted 7 s.

In two of the test trials, the label for the target object was correctly pronounced [Correct Pronunciation (CP) trials], while in the other two, the label was mispronounced [Mispronunciation (MP) trials]. This MP involved a change in pitch: A word taught with accent 1 was mispronounced with accent 2 and vice versa. Recall that during test trials the novel target item was paired with a novel, unlabeled distracter item. The presence of a nameless distracter offered participants the possibility to consider the mispronounced version of the target label to be a novel label for the unlabeled distracter. This presupposes the use of the principle of mutual exclusivity (ME; Markman, 1990). This principle guides people to map novel words to unfamiliar rather than familiar referents. The use of ME to identify referents of novel words has been reliably demonstrated in infants from 16 months of age (e.g., Halberda, 2003) and in monolingual, bilingual, and bidialectal preschool children (e.g., Markman and Wachtel, 1988; Diesendruck and Markson, 2001; Durrant, 2014; Singh et al., 2014; Kalashnikova et al., 2015). The procedure with a novel target and a novel distracter object has been successfully applied in similar word learning studies with 1.5- to 2-year-olds (Singh et al., 2014), 2.5-year-olds (Quam and Swingley, 2010), and 3- to 5-year-olds (Singh and Quam, 2016).

Order of test trials was pseudo-randomized in such a way that the target would never appear on the same side more than twice in a row. Moreover, all children were presented at least one CP


FIGURE 4 | Visual overview of an experimental block.

trial before the first MP trial. This resulted in three trial orders. To make sure children would remain engaged in the task, four filler trials involved correct pronunciations of four well-known words (e.g., Singh et al., 2015; Buckler and Fikkert, 2016). Test phases across all versions started with a filler trial to help children understand the nature of the task. Test and filler trials were presented in an alternating fashion.

Between blocks, children watched a 1-min video featuring farm animals and animal noises. The second block had the same structure as the first block but featured a new object-pair, one of which would receive a novel label. Object labels and tones were counterbalanced across participants. Each child was tested on his/her sensitivity to tonal MPs of accent 1 and accent 2 to test for asymmetries in tone sensitivity (e.g., Francis and Ciocca, 2003; Shi et al., 2017). Throughout the experiment, trials were preceded by a purple flashing light in the screen center and were initiated once the child fixated the attention getter.

#### Stimuli

For this experiment, we created two pseudo-word pairs: taaf <sup>1</sup>/<sup>2</sup> [ta:f] and moon1/<sup>2</sup> [mo:n].<sup>7</sup> We decided to teach each participant two words instead of one to reduce the possibility that any effects were idiosyncratic to a particular word. Moreover, in this way all participants could learn one word with accent 1 and one word with accent 2.

The segments and phonotactics of the target stimuli were equally compatible with Limburgian and Dutch, and both pseudo-word pairs were derived from existing tonal minimal pairs in Limburgian to ensure that they were legal with both tones.<sup>8</sup> Additionally, we controlled for phonological neighborhood density, since the existence of phonological neighbors could hinder children from using their full phonological sensitivity (e.g., Swingley et al., 1998; Swingley and Aslin, 2007) or from using the principle of ME (e.g., Jarvis et al., 2004). We considered a word a phonological neighbor if the item differed from the novel word by substituting, adding or deleting a single phoneme (Luce and Pisoni, 1998; Swingley and Aslin, 2002). We only considered words from the Lexilijst Nederlands (Schlichting and Spelberg, 2002) that are supposed to be produced and known by 15- to 27-month-old Dutch children. Taaf had no phonological neighbors known to children of this age, whereas moon had one phonological neighbor for the Dutch participants (maan [ma:n], 'moon'), and two for the Limburgian participants (maon<sup>1</sup> [ml:n], 'moon'; sjoon<sup>2</sup> [So:n], 'shoe').

Carrier sentences were recorded in Limburgian and Dutch. Target stimuli were recorded in and spliced from Limburgian carrier sentences to guarantee tone accuracy.<sup>9</sup> All stimuli were recorded in a child-friendly way by a female native speaker of Dutch and of an East-Limburgian dialect spoken in the municipality of Roermond. She reported to be dominant in Limburgian, but was equally proficient in Dutch and was trained in speaking accentless Standard Dutch. For Limburgian children, pre-experimental instructions as well as the experiment itself were in Limburgian. For Dutch children, the entire procedure was in Dutch. Across language contexts, only the tokens of the target stimuli taaf and moon were the same. Care was taken that the Dutch and Limburgian stimuli were recorded with the same intent and enthusiasm. The target stimuli were recorded multiple times with accent 1 as well as accent 2 and always appeared in a declarative focus-final context to avoid differences in the phonetic realization of the tones. Recordings were made in a soundattenuated booth using Adobe Audition (version CS6, 44.1 kHz). Stimuli were equalized for intensity to 65 dB and prepared for the experiment using Praat (version 5.3.35; Boersma and Weenink, 2012). For stimuli excision we followed the guidelines presented in Turk et al. (2006).

In total, 12 tokens of taaf <sup>1</sup>, taaf <sup>2</sup>, moon1, and moon<sup>2</sup> were selected, based on intuition of a native speaker of an East-Limburgian dialect [the first author] and careful listening by a trained phonetician [Carlos Gussenhoven]. Ten tokens were used in the learning phase, the remaining two in the CP trials in the test phase. For all tokens we measured maximum and minimum f0, f0 range (max f0 to min f0), average f0, and duration of the tone bearing portion as well as the duration of the entire token. Measurements were done manually, taking auditory as well as spectral properties into account. Independent t-tests revealed that accent 1 and accent 2 tokens differed significantly from each other with respect to minimum f0, maximum f0, and f0 range (see **Table A3** in the **Appendix**).

The four filler trials involved correct pronunciations of known words. One filler pair consisted of a cow and a horse, and the other of a car and a ball. Items were chosen for their very high frequency in the productive vocabulary of the age group at test, according to the Lexilijst Nederlands (Schlichting and Spelberg, 2002).

The visual target stimuli consisted of four plush toy objects of an animate character (see **Figure 5**). All objects had different, vibrant colors (pink, blue, purple, and yellow) and shapes. The pink and blue object (**Figures 5A,B**) were paired as well as the purple and yellow object (**Figures 5C,D**). Pairs were matched in visual complexity, brightness, and size. A paired-samples t-test comparing the mean proportion of looking time toward the target (M = 0.51, SD = 0.08) and the distracter object (M = 0.50, SD = 0.08) during the object familiarization phase showed that participants did not show a preference for the target object prior to the learning (i.e., labeling) phase [t(57) = 0.59, p > 0.05].

In the object familiarization phase and the test phase, the stimuli consisted of photographs of the objects against a gray background. During the learning phase, the objects bumped up and down against the background of a natural scene. Filler stimuli in the test phase consisted of photographs of a horse, a cow, a car, and a ball against a gray background. Two different pictures per object were used across blocks to minimize boredom effects.

#### Data Pre-processing and Analysis

Children's video recordings were coded offline using ELAN (version 4.5.0; Wittenburg et al., 2006) with a resolution of 40 fps.

<sup>7</sup> Subscripts indicate accents 1 and 2.

<sup>8</sup>The pseudoword taag comes from Limburgian graaf [Göa:f], meaning 'grave' with accent 2 and 'count' with accent 1. The pseudoword moon comes from Limburgian sjoon [So:n], meaning 'shoe' with accent 2 and 'beautiful' with accent 1.

<sup>9</sup>Note that some of the Limburgian stimuli were spliced too, since the selected tokens of the target stimuli did not always appear in the desired carrier sentence in the original recordings.

FIGURE 5 | The visual target stimuli used in the experiment. Objects (A,B) always appeared as a pair as well as objects (C,D).

In test trials, target onset was always at 2500 ms. The 2500 ms window prior to target onset was labeled the pre-naming window. The post-naming window lasted 2000 ms, starting 367 ms after target onset (e.g., Swingley and Aslin, 2000; Quam and Swingley, 2010; Altvater-Mackensen et al., 2013; Singh et al., 2014). The coder was blind to trial type and target side. A random 20% of the videos was recoded by a second experienced coder. The correlation between two coders was very strong (Pearson's r = 0.801, p < 0.001).

To ensure that our analyses were based on clean data and to enable within-subject comparisons of CP vs. MP trials and of accent 1 vs. accent 2 words, we maintained a number of trial, block, and participant exclusion criteria. **Table A1** in the **Appendix** provides a detailed overview of exclusion.

Test trials were excluded if (1) a child looked less than 500 ms during the 2000 ms post-naming window (e.g., Quam and Swingley, 2010; Singh et al., 2014; Tsuji et al., 2016), (2) the participant fixated only one of two objects during the 2500 ms pre-naming window (e.g., White and Morgan, 2008; Mani and Plunkett, 2011; Singh et al., 2015; Buckler and Fikkert, 2016), (3) an equipment or experimenter error occurred, and (4) if a participant refused to participate (e.g., by getting up and walking around) and the experiment had to be aborted.

A block was excluded if (1) a participant did not contribute at least one valid trial per condition (CP and MP) during the test phase (e.g., Buckler and Fikkert, 2016; Tsuji et al., 2016), and (2) total looking time during target and/or distracter learning trials was under 20 s out of a total of 60 s (e.g., Tsuji et al., 2016). The latter criterion is based on the assumption that children who pay more attention to the novel objects during learning should be better able to retain the novel word-object mapping (Hilton and Westermann, 2016).

Participants were excluded from the analyses if (1) at least one block had to be excluded, (2) an equipment failure or experimenter error occurred, and (3) other conditions were not met, e.g., if a participant's linguistic background was inappropriate or if we had twin participants.

Children's target recognition was inferred from the presence of a naming effect that is typically measured as an increase in target fixation upon hearing the target label relative to a baseline looking measure (e.g., Swingley and Aslin, 2000; Singh et al., 2015). To calculate the naming effect, the increase in the proportion of target looking (PTL) between the pre-naming and post-naming window of a test trial was calculated [i.e., Post-namingPTL(T/[T+D]) – Pre-namingPTL(T/[T+D])], resulting in a difference score. Computing naming effects by taking each individual participants' pre-naming values into account serves to control for possible effects of preference for a particular stimulus (e.g., White and Morgan, 2008; Quam and Swingley, 2010; Mani and Plunkett, 2011; Singh et al., 2015). A paired-samples t-test showed a small yet significant difference in PTL between object familiarization phase (M = 0.51, SD = 0.08) and pre-naming window (M = 0.53, SD = 0.07), t(57) = −2.05, p = 0.045, Cohen's d = −0.27. Moreover, a one-sample t-test showed that pre-naming PTL differed significantly from chance: t(57) = 3.56, p = 0.001, Cohen's d = 0.47. Thus, it appears that the target object had become slightly more interesting than the distracter after the learning phase due to repeated labeling (e.g., Schafer and Plunkett, 1998). To control for a possible effect of this target preference, we chose the post-minus pre-naming PTL measure as our dependent variable.

Naming effects were calculated and compared for CP and MP trials. If children notice the MP, the naming effect will be significantly less strong in MP than in CP trials. However, it is important to inspect the naming effect in MP trials more closely to gain insight into the strength of the MP effect. First, even if the naming effect in MP trials is significantly weaker than the naming effect in CP trials, it can still be positive and significantly above zero (as attested for one-feature segmental MPs in White and Morgan, 2008). This indicates that target recognition is hindered to some extent, but that recognition still takes place. Secondly, the naming effect in MP trials might not differ significantly from 0, signaling uncertainty, meaning that target recognition is hindered to such extent that recognition fails (as attested for two- and three-feature segmental MPs in White and Morgan, 2008, and for tonal MPs in Singh et al., 2014, 2015). Thirdly, a significant negative naming effect would point to a preference for the distracter object and can be seen as evidence for the formation of a novel mapping between the auditory label and the distracter object based on ME (e.g., Swingley and Aslin, 2000; White and Morgan, 2008; Mani and Plunkett, 2011).

### RESULTS

**Figure 6** shows naming effects for Limburgian and Dutch toddlers in the CP and MP condition.

To ensure whether word learning was successful, the naming effect in CP trials was compared to zero for each group by means of a one-sample t-test. For both Limburgian and Dutch toddlers, there was a significant positive naming effect in CP trials (Limburgian: M = 0.25, SD = 0.15, t(22) = 8.28, p < 0.001, Cohen's d = 1.73; Dutch: M = 0.18, SD = 0.23, t(34) = 4.60, p < 0.001, Cohen's d = 0.78). From this we can conclude that both participant groups learned the novel word-object mapping.

Next, a three-way mixed ANOVA with Condition (CP vs. MP) and Tone (Accent 1 vs. Accent 2) as within-subjects factors and Language (Limburgian vs. Dutch) as the betweensubjects factor was conducted to evaluate the possible influence of language and pitch change on the naming effect. Results revealed a significant main effect of Condition, F(1,56) = 8.53, p = 0.005, η 2 <sup>p</sup> = 0.13, observed power = 0.82, with a significantly stronger naming effect in CP trials (M = 0.21, SD = 0.20) than in MP trials (M = 0.09, SD = 0.24). No other significant main effects or interactions were found (all ps > 0.1). Both Limburgian and Dutch children thus treated the pitch change as lexically relevant as indicated by a significantly weaker naming effect in MP trials compared to CP trials. Mean PTL values and standard deviations for pre- and postnaming windows per Condition and Language are listed in **Table 1**.

TABLE 1 | Mean proportion of target looking in pre- and post-naming windows per group and condition for the toddlers.


Standard deviations in parentheses.

To investigate the strength of the MP, the naming effect in MP trials was compared to zero by means of a one-sample t-test. The test revealed a significant positive naming effect (M = 0.09, SD = 0.24; t(57) = 2.81, p < 0.01, Cohen's d = 0.37). Thus, despite the naming effect being weaker in MP than CP trials, target recognition was still possible in MP trials. From this we can infer that the pitch change only hindered word recognition to a minor extent.<sup>10</sup>

We next tested Limburgian and Dutch adults in the same experiment to find out whether the sensitivity to pitch in both the Limburgian and Dutch children in Experiment 1 was adultlike or whether it reflected a not yet fully developed phonological system.

### EXPERIMENT 2

As with the Limburgian children, we expected Limburgian adults to notice a change in tone, but it was an open question how strongly it would hinder word recognition. Adult speakers might have learned not to rely on pitch too much during online language comprehension because of the relatively low functional load of lexical tone and because pitch has no lexical relevance in their second L1, Dutch.

Speakers of Dutch were expected not to attend to pitch during the recognition of newly learned words. However, if the sensitivity exhibited by the Dutch children was dependent on their knowledge of pitch as a cue to word stress and/or intonation, Dutch adults might still be sensitive to pitch differences by virtue of their accumulated experience with the native prosodic system (but see Quam and Swingley, 2010).

### Materials and Methods Participants

Limburgian adults were recruited and tested in a public library in Roermond. The Limburgian listeners (N = 14, 5 males) ranged in age from 26 to 72 years (M = 53.6 years). An additional 10 participants were excluded from the analysis because (1) they reported to speak a dialect other than one from the East-Limburgian dialect region (N = 4), (2) they could only contribute one of two blocks due to exclusion of test trials (N = 3), or (3) they failed to learn the novel word-object mapping in one or two blocks, signaled by a mean PTL equal or smaller than

<sup>10</sup>Some previous studies found age-related differences in the sensitivity to pitch changes in tone language learning bilinguals (e.g., Singh et al., 2015). Since we also tested tone language learning 'bilinguals' spanning exactly this age range, we ran an additional mixed ANOVA on our Limburgian sample including Age as a within-subjects variable, comparing younger (31–38 months, N = 11) to older (42–49 months, N = 12) children. The analysis yielded a main effect of Condition, F(1,21) = 4.63, p = 0.04 and a marginally significant Condition × Tone × Age interaction, F(1,21) = 3.21, p = 0.088, suggesting that the effect of Condition in the younger children is carried by the accent 2 items whereas in older children it is carried by the accent 1 items. No other significant main effects or interactions were attested (all ps > 0.05). As suggested by the reviewers, we also ran an ANCOVA including Age as a covariate. This analysis only yielded a marginally significant Condition × Tone × Age interaction, F(1,21) = 3.38, p = 0.08, suggesting that the effect of Condition only holds for accent 1 items. This could after all signal a trend toward a perceptual asymmetry, indicating that accent 1 is the lexically specified tone. Increasing the sample size could perhaps increase the significance of this result, but was outside the scope of our study.

0.50 in the post-naming window of CP trials (N = 3).<sup>11</sup> All included Limburgian participants were born and raised in the East-Limburgian dialect region and lived there at the time of test. All of them reported to actively use an East-Limburgian dialect. The Limburgian participants also had native command of Dutch, except for two participants who reported very good or good command. All of them can thus be considered bidialectals.

Dutch adults were recruited at Radboud University, Nijmegen, Netherlands, and tested at the Baby Research Center of the same university. The Dutch listeners (N = 22, 7 males) ranged in age from 18 to 40 years (M = 23). None of them had weekly contact with people speaking a Limburgian dialect in their presence. Moreover, none of them grew up or lived in the province of Limburg. An additional two participants were excluded from the analysis due to the exclusion of one of both blocks.

All Limburgian and Dutch participants reported some degree of non-native command of one or more non-tonal languages (i.e., English, German, French, Spanish, Arabic, and Polish) as indicated on a six-point scale ranging from poor to native command, but none of them had experience with a tone language. All participants reported normal hearing and no speech, language, or attention deficits. Because of the fact that musical experience can have an influence on pitch processing (e.g., Burnham and Brooker, 2002; Burnham et al., 2015), we kept the number of musically trained individuals comparable across groups. Six of the Limburgian participants (43%) and eight of the Dutch participants (36%) reported to have had over 3 years of musical training. Ethical approval for the study was obtained from the Ethics Assessment Committee (EAC) of the Faculty of Arts at Radboud University, Nijmegen, Netherlands. Participants signed an informed consent and took part in the experiment either voluntarily or for a small fee.

#### Apparatus, Stimuli, and Procedure

The apparatus, stimuli, and procedure of the adult experiment were comparable to Experiment 1, as in Quam and Swingley (2010), who also tested children and adults under similar conditions. For the Limburgian adults we used the same portable set-up as the Limburgian children, but they were tested in a quiet, darkened room in a public library. To minimize external interference, stimuli were presented through noise-canceling headphones (Sennheiser HME 110). Dutch adults were tested under the exact same conditions as the Dutch children.

Regarding the procedure, we added extra filler trials (16 instead of 4) to the test phase to distract adult participants' attention away from the purpose of the experiment, leading to a total number of 20 trials. Participants were told before the study that they would be helping to test an experiment designed for 3-year-olds.

A paired-samples t-test, comparing the mean PTL toward the target (M = 0.51, SD = 0.05) and the distracter object (M = 0.49, SD = 0.05) during the object familiarization phase, showed that adult participants did not show a preference for the target object prior to the learning phase [t(35) = 0.73, p > 0.1]. After the experiment, adults completed a language background questionnaire.

#### Data Pre-processing and Analysis

A random 20% of the adult videos was recoded by a second experienced coder. Inter-coder reliability was excellent (Pearson's r = 0.937, p < 0.001).

Post-naming PTL was calculated within a 1000 ms window, starting 367 ms after target onset. We could have shifted the analysis window for adults earlier in time, but since earlier studies have shown that this does not have consequences for the results (e.g., Swingley, 2009), we retained the starting point of 367 ms post-target onset.<sup>12</sup>

As with the child data, we found a significant difference in PTL during object familiarization (M = 0.51, SD = 0.05) and pre-naming window (M = 0.56, SD = 0.12), t(35) = −2.73, p = 0.01, Cohen's d = −0.45. Moreover, a one-sample t-test showed that pre-naming PTL differed significantly from chance: t(35) = 3.16, p = 0.003, Cohen's d = 0.53. Thus, it appears that also for the adults the target object had become more interesting than the distracter after the learning phase. We again chose the post-naming minus pre-naming PTL measure as our dependent variable.

### RESULTS

Naming effects for Limburgian and Dutch adults in CP and MP conditions are depicted in **Figure 7**.

<sup>12</sup>A post hoc inspection of the adults' looking behavior in an earlier time window indeed showed that they were on target immediately after target onset. Changing the analysis window would thus not have changed the results.

<sup>11</sup>The drop-out rate might be due to the testing conditions: Participants were personally invited to participate and had to interrupt what they were doing. Moreover, in contrast to typically tested student populations, our participants might not have known what to expect. Some of them might not have been that motivated but accepted the invitation to avoid disappointment. These factors could have influenced their attention during the experiment.

To ensure that the adult participants successfully learned the novel word-object pairings, the naming effect in CP trials was first compared to zero for each language group by means of a one-sample t-test. For both Limburgian and Dutch adults, there was a significant positive naming effect in CP trials [Limburgian: M = 0.36, SD = 0.13, t(14) = 10.69, p < 0.001, Cohen's d = 2.86; Dutch: M = 0.41, SD = 0.14, t(22) = 14.28, p < 0.001, Cohen's d = 3.04]. From this we can conclude that both participant groups learned the novel word-object mappings.

Next, a three-way mixed ANOVA with Condition (CP vs. MP) and Tone (Accent 1 vs. Accent 2) as within-subjects factors and Language (Limburgian vs. Dutch) as the between-subjects factor was conducted. The analysis yielded no main effects or interactions (all ps > 0.05).

As in the CP trials, the naming effect in MP trials was significantly above zero [M = 0.34, SD = 0.22; t(38) = 9.53, p < 0.001, Cohen's d = 1.53].

The absence of an effect of Condition or Language is probably due to participants showing very strong naming effects in both CP and MP trials, as becomes clear from the PTL measures in **Table 2**. As can be inferred from Quam and Swingley (2010), the procedure used should be sensitive enough to yield a vowel MP effect. However, Quam and Swingley (2010) did not test native tone language speakers and thus did not show whether the method is equally suited to yield sensitivity to a change in pitch. This means that we cannot rule out the possibility that our findings are due to a task effect.

Our adult data thus provide no evidence of an effect of pitch variation on the recognition of newly learned words.

### DISCUSSION

In this study, we asked whether pitch plays a larger role in novel word learning and recognition in children acquiring East-Limburgian compared to a control group of children acquiring Standard Dutch. To see whether their interpretation of pitch was adult-like or not yet fully developed, we also tested Limburgian and Dutch adults.

Our main finding is that both Limburgian and Dutch children pay attention to pitch changes in newly learned words. However, children still preferred the target over the distracter object upon hearing a pitch change, indicating that a change in tone did not hinder word recognition to a great extent. Regarding our adult data, we can conclude that both Limburgian and Dutch adults succeeded in learning novel word-object mappings. However,

TABLE 2 | Mean proportion of target looking in pre- and post-naming windows per group and condition for the adult participants.


Standard deviations in parentheses.

we cannot draw conclusions about their interpretation of pitch changes due to very strong naming effects in both CP and MP conditions. In the next section, we will first discuss the findings from Experiment 1 with Limburgian and Dutch toddlers.

### The Lexical Encoding of Pitch in Limburgian and Dutch Toddlers

The finding that Limburgian children were sensitive to MPs involving pitch was in line with previous word recognition studies with tone language learners (Singh et al., 2014, 2015). However, as signaled by the positive naming effect in MP trials, the pitch change did not inhibit target recognition. This pattern of results is in line with toddlers' responses to one-feature segmental MPs in White and Morgan (2008). However, previous studies investigating Mandarin found no naming effects in tonal MP conditions (Singh et al., 2014, 2015), suggesting that pitch changes are more detrimental to word recognition in Mandarin than in Limburgian. We would like to suggest three explanations for this finding.

First, the fact that Limburgian children recognized the target word despite a tonal change might be due to the relatively low functional load of tone. One of the factors contributing to the functional load of a contrast is the number of minimal pairs. The low frequency of tonal minimal pairs, plus the fact that listeners can mostly rely on sentence context for disambiguation, might mitigate the reliance on pitch in perceiving Limburgian. Similar explanations have been put forward by Cutler (1986) for the role of lexical stress in English and by Cutler and Otake (1999), Sekiguchi and Nakajima (1999), and Goss (2015) for the influence of pitch-accent on word recognition in Japanese. This reasoning is in line with the hypothesis that phonological category learning is driven by contrast in the vocabulary (Dietrich et al., 2007). However, Dietrich et al. (2007) argue on the basis of the results of a word recognition study that 18-month-olds' native-like performance cannot have been the result of top-down information from the lexicon. The tested age group did not seem to know many minimal pairs involving the distinctions at test. We thus cannot assume that children need minimal pairs to decide whether a contrast is phonologically meaningful or not.

A second explanation for the Limburgians' lenient treatment of MPs might be tonal surface variability. Recall that Limburgian listeners are confronted with a considerable amount of allotonic variation in lexical tone contours, but this variation cannot be ignored since it does signal meaningful information at the post-lexical level. In light of this pitch variation, it could be a challenge to recover the underlying tone system, at least for young learners (Demuth, 1995; Ota, 2003; Rost and McMurray, 2010). A replication of our study with Swedish children could provide additional insight into the effect of surface variation on developing tonal representations.

A third factor that may have influenced our Limburgian participants' behavior is variation due to their exposure to multiple (closely related) linguistic varieties. Hardly any studies on the mapping of sounds to meaning focused on children acquiring two languages, let alone on children acquiring multiple dialects or regional varieties of the same language (for a

review, see Fennell et al., 2016). One type of variation due to bidialectalism comes from exposure to different dialects and Limburgian-accented Dutch. Evidence for the effects of dialectrelated variation on the phonological representation of known words is scarce. Durrant et al. (2015) showed that variable phonological input as a result of dialect variation has an impact on the specificity of lexical representations in 20-month-old British English multidialectal toddlers. In a preferential looking paradigm, they were tested on their sensitivity to single feature MPs of monosyllabic known words. MPs involved changes of onset consonants or of the vowel nuclei that were phonemic in all the varieties at test. The authors' main finding was that multidialectal infants, other than monodialectal infants, did not treat MPs of familiar words differently from CP's, suggesting that long-term exposure to regional linguistic variation leads to a broadening of phonetic categories or poorer use of phonological information in word recognition.

Another type of variation due to bidialectalism stems from lexical overlap. Limburgians know many cognates that do not have a tonal specification in Dutch. As such, they receive mixed evidence for the lexical relevance of pitch. Possibly, this mixed evidence (temporarily) leads them to assign less weight to pitch as a lexically contrastive feature. The existing evidence points in another direction, though. Van der Feest and Johnson (2016) tested 24-month-old Dutch toddlers who received mixed distributional evidence for the lexical contrastivity of fricative voicing. Toddlers were exposed to Limburgian-accented Dutch (which maintains the fricative voicing contrast) and to Dutch as spoken in the Nijmegen region (where the fricative voicing contrast is neutralized). Children treated fricative voicing as lexically relevant only in a Limburgian-accented context. The authors conclude that toddlers who receive mixed evidence for a phonological contrast due to variation in accents in their input do not simply treat the contrast as allophonic, nor do they ignore the contrast. Rather, they appear to track two sets of statistics, one for each variant, as bilingual children have been argued to do (e.g., Sundara and Scutellaro, 2011). Studies showing that the presence of mixed distributional evidence for a lexical tone contrast does not lead to less specific lexical representations were carried out by Singh et al. (2014, 2016). Twelve- to thirteen-month-old Mandarin-English bilinguals who, like our Limburgian participants, received mixed evidence for the lexical relevance of pitch, noticed tonal MPs in a Mandarin version of the one-object switch-task, but not in a non-tonal English version (Singh et al., 2016). In a preferential looking paradigm, also 18- and 24-month-old Mandarin-English bilinguals were sensitive to tonal MPs (Singh et al., 2014; but see Singh and Quam, 2016, for different results in a task involving language switching). From these findings we can probably infer that our Limburgian participants' lenient treatment of tonal MPs was not the result of their exposure to non-tonal cognates in Dutch. It could, however, be the case that their longterm exposure to dialect-related variation leads to a more general relaxation of phonetic boundaries, leading to less well specified lexical representations (e.g., Durrant et al., 2015). To investigate if the latter explanation holds, future studies should test Limburgians' responses to a variety of tonal and segmental MPs of familiar words, similar to the Durrant et al. study.

The fact that Dutch toddlers responded to pitch variation in a word learning task is not in line with previous studies on the lexical encoding of tone in non-tone language children (e.g., Quam and Swingley, 2010; Singh et al., 2014; Hay et al., 2015). These studies have shown that, from some point in development, English toddlers ignore pitch information during word learning. However, comparisons to these prior studies are difficult because these studies did not directly compare performance of tone and non-tone language learning children (i.e., in one statistical analysis). Moreover, prior studies testing non-tone language children have been restricted to learners of English, making it impossible to generalize their results to all non-tone language learners. We want to put forward three explanations for Dutch toddlers' sensitivity to word-level pitch.

First, Dutch toddlers could have interpreted the Limburgian pitch patterns as post-lexical intonation, as has also been put forward as an explanation for successful lexical tone discrimination in Ramachers et al. (2017). More specifically, toddlers might over-assign weight to post-lexical factors in novel word learning tasks by virtue of having observed their communicative significance at other levels of linguistic structure (e.g., Singh et al., 2014; Hay et al., 2015). Similarly, Braun et al. (2014) proposed that extensive utterance-level prosody in the L1 is helpful for storing pitch information as part of novel mental representations. On the other hand, Frota et al. (2012) showed that, by age 3, European Portuguese children do notice stress changes, but no longer treat intonation changes in newly learned words as lexically relevant.

A second possible explanation for the behavior of the Dutch toddlers also relates to L1 intonation. In a word recognition study, Fikkert and Chen (2011) showed that Dutch 24-montholds have knowledge of appropriate native intonation patterns. Particularly in imperatives, Dutch toddlers strongly preferred a high-low pitch pattern combined with a strong-weak (trochaic) stress pattern. In our study, the target sentences in the test trials were always imperatives. Possibly, our Dutch toddlers' behavior could have been influenced by their expectations of what a wellformed imperative sounds like. An imperative that ends in a high-low pitch pattern (i.e., accent 1) could be preferred over an imperative ending in a low-high pitch pattern (i.e., accent 2). This would result in Dutch children structurally fixating the target less if pronounced with accent 2, regardless of the trained tone. In this case we should have found an interaction involving our variables Language and Tone. Since we attested no such interaction, our data provide no evidence for the suggestion that Dutch children's expectations regarding well-formed imperatives have influenced their behavior in our study.

The third explanation of the fact that Dutch toddlers noticed a pitch change in a novel word is that they might have perceived the Limburgian tone contrast as a quantity contrast rather than as a pitch contrast. Previous research has shown that the shape of a pitch pattern can indeed affect the perceived duration of the tone bearing vowel (e.g., Lehiste, 1976; Pisoni, 1976; Yu, 2010; Gussenhoven and Zhou, 2013). Despite the fact that the Limburgian tones' primary acoustic cue is pitch

rather than duration, we think it is possible that speakers of Dutch perceived the pitch difference as a difference in duration. Previous research has shown that native and non-native speakers may give different degrees of attention to acoustic cues under the influence of the different functions and/or distributions of these cues in the L1 (Gandour and Harshman, 1978; Cebrian, 2006; Ueyama, 2000). For example, Gandour and Harshman (1978) showed cross-linguistic differences in the importance attributed to duration as a cue for tone perception, presumably reflecting the different linguistic status of vowel duration in their participants' L1s. In light of the fact that duration is an acoustic cue to lexical contrast in Dutch (i.e., word stress and vowel quantity) and Dutch children's early sensitivity to these contrasts (e.g., Dietrich et al., 2007; de Bree et al., 2008), we propose that the Dutch children in our study could have drawn upon their knowledge of this cue when perceiving a non-native tone contrast.

Anecdotal evidence with adult speakers of Dutch seems to strengthen this claim. Naïve speakers of Dutch who imitate the Limburgian tones tend to lengthen the stressed syllable of accent 2 words relative to accent 1 words (e.g., Ueyama, 2000). The impression that the citation form of accent 2 is longer in duration than the respective accent 1 form could be due to the more complex pitch pattern of accent 2 (H∗LH) compared to accent 1 (H∗L), assuming that changes in f0 can go hand in hand with a perceptual increase in duration (e.g., Lehiste, 1976; Rietveld and Gussenhoven, 1987; but see Gussenhoven and Zhou, 2013). In fact, Heijmans (2003) reports a formerly tonal dialect just outside the East-Limburgian area in which the tonal contrast was in large part reinterpreted as a length contrast. In future research, Dutch listeners could be presented tonal minimal pairs and explicitly judge which one sounds longer (e.g., Lehiste, 1976).

Until now, we have assumed different explanations for the behavior of the Limburgian and Dutch toddlers, despite their behavior being comparable. Lastly, we would like to mention the possibility that their behavior can be based on the same explanation. Recall that the only prosodic difference between Limburgian and Dutch is the fact that pitch is lexically relevant in Limburgian. Both languages make use of vowel duration, word stress, and intonation. We therefore cannot exclude the possibility that the Limburgians might not perceive the difference between accent 1 and accent 2 as a pitch contrast, but as a durational contrast.

Another finding that deserves some attention, especially in light of ongoing typological discussions about the phonological status of the Limburgian word prosodic contrast (e.g., Köhnlein, 2016, and references therein), is that Limburgian children were sensitive to MPs of both accent 1 and accent 2. Gussenhoven and Peters (2008) assume that accent 2 is the lexically specified tone, but our data provide no evidence for a perceptual asymmetry due to lexical (under)specification of one of the accents. It is possible that we did not attest an asymmetry due to a lack of power. However, an inspection of the means did not reveal a trend toward such an asymmetry. More research is needed to draw conclusions on this matter.

In the next section, we will turn to the findings from Experiment 2 with Limburgian and Dutch adults.

### The Lexical Encoding of Pitch in Limburgian and Dutch Adults

In line with Quam and Swingley (2010), who used a very similar design, the Limburgian and Dutch adults in our study successfully learned novel word-object pairings. However, both groups showed very strong naming effects in both CP and MP trials, possibly masking effects of Condition and/or Language. Their high recognition scores could either mean that the task was not sensitive enough [but see Quam and Swingley (2010)], or that our participants did not notice a pitch change within a word, or both.<sup>13</sup>

Besides the pitch change condition, Quam and Swingley (2010) also included a vowel MP condition. In this condition, English participants exhibited a marginally significant negative naming effect, whereas they showed a significant positive naming effect in both the pitch MP and in the CP condition. Their effect of Condition thus rested on the significant negative naming effect induced by the vowel MP. They found no significant difference between the performance in pitch MP and CP conditions, which is in line with the behavior of our participants. In a future study, it would be valuable to include one or more segmental MP conditions in addition to a tonal MP condition (e.g., Quam and Swingley, 2010; Singh et al., 2014, 2015).

With respect to our Limburgian participants, it could be that lexical tone in Limburgian, relative to segments, does not share equal priority as a cue to word recognition. A similar claim has been made for Japanese (e.g., Goss, 2015). Since adult Limburgians have accumulated ample linguistic experience, they might have learned not to rely heavily on pitch during online language comprehension because of the relatively low functional load of lexical pitch and/or because pitch has no lexical relevance in their second L1, Dutch. However, in light of Braun et al.'s (2014) finding, who showed that adult speakers of German were very sensitive to Mandarin tone contrasts in a word learning paradigm, we strongly believe that the absence of effects in our study is due to task effects. To increase the demands on memory load in a future task, we could use disyllabic stimuli and/or teach participants multiple tonal minimal pairs simultaneously (e.g., Braun et al., 2014).

Due to the lack of effects of Language, Condition or Tone in the adult study, we cannot draw conclusions on the phonological status of the Limburgian tone contrast. A lexical accent correctness judgment task (e.g., Goss and Tamaoka, 2015) or a lexical decision task with either phonological priming (e.g., Cutler and Otake, 1999) or semantic priming with tonal MPs could potentially advance our understanding of the lexical status of the Limburgian word prosodic contrast.

One important limitation that we want to mention at this point pertains to the input that both child and adult Limburgian participants were exposed to during the learning phase of the current experiment. Recall that they were presented with multiple tokens of the target word, but that the prosodic context was held constant. That is, participants did not have to deal with surface

<sup>13</sup>Yet another factor that could have influenced the results are the specific testing conditions (see footnote 11). Possibly, participants were not attentive enough due to a lack of motivation.

variation with which they are usually confronted due to toneintonation interactions in natural language input. It would be interesting to see how Limburgian toddlers and adults would perform if this surface variation were included in the learning phase.

### CONCLUSION

Both Limburgian and Dutch 2.5- to 4-year-old children are sensitive to lexical pitch information in novel words. This indicates that they store pitch information as part of their novel lexical entries. Due to a lack of effects in our adult study, we cannot draw conclusions on the lexical encoding of pitch in Limburgian and Dutch adults. Since pitch is not contrastive at the word-level in Dutch, Dutch listeners should recognize words irrespective of their pitch pattern. Dutch toddlers' sensitivity to word-level pitch probably reflects their growing knowledge of the native prosodic system. They could either have perceived the different pitch patterns in terms of intonation (e.g., Singh et al., 2014), or in terms of vowel duration. The Limburgian toddlers' behavior was in line with our expectations since pitch is assumed to be part of Limburgian lexical representations. The fact that a pitch change only hindered word recognition to a minor extent, and possibly not at all in Limburgian adults, could be due to the specific input conditions that Limburgians are exposed to. Future studies could include speakers of Swedish, since word-level pitch in Swedish also has a relatively low functional load and also shows a relatively high amount of surface variation, to corroborate that functional load and phonetic variability indeed have an impact on lexical tone processing.

### REFERENCES


Bruce, G. (1977). Swedish Word Accents in Sentence Perspective. Lund: Gleerup.

Buckler, H., and Fikkert, P. (2016). Dutch and German 3-year-old's representations of voicing alternations. Lang. Speech 59, 236–265. doi: 10.1177/0023830915587038

### AUTHOR CONTRIBUTIONS

SR conceptualized the study, recruited participants, collected data, conducted data analyses, interpreted the results, and drafted the manuscript. SB conducted data analyses, interpreted results, and revised the manuscript. PF assisted in the conceptualization of the study, interpreted results, and revised the manuscript.

### FUNDING

This research was supported by a grant from the Netherlands Organization for Scientific Research to SR (NWO Promoties in de Geesteswetenschappen, no. 322-75-001).

### ACKNOWLEDGMENTS

Thanks to all participating parents and their children from Nijmegen and Roermond, the Baby Research Center in Nijmegen, Netherlands, daycare center 'Ot en Sien' in Roermond, and GGD Limburg Noord in Roermond. We would like to thank Carlos Gussenhoven for his assistance in creating the target stimuli and for discussion about the results and earlier versions of this manuscript. We thank Dean Hermans, Chrissy Laurentzen, and Romy Roumans for recruiting and testing adult participants. Thanks also to the research group First Language Acquisition at Radboud University Nijmegen for valuable comments on this manuscript and to Daniel Swingley and Leher Singh for valuable discussion about the experimental design. We also thank the reviewers for their questions and suggestions.


Proceedings of the 32nd Annual Boston University Conference on Language Development (BUCLD 32), eds H. Chan, H. Jacob, and E. Kapia (Somerville, MA: Cascadilla Press), 60–71.


Change in Phonology and Morphology, ed. A. Lahiri (Berlin: Mouton de Gruyter), 215–260.



Jusczyck, P. (1997). The Discovery of Spoken Language. Cambridge, MA: MIT Press.


lexical tone variation. Cognition 142, 1–11. doi: 10.1016/j.cognition.2015. 05.010


in Linguistics in the Netherlands 16, eds R. van Bezooijen and R. Kager (Amsterdam: John Benjamins), 1–12.

Van der Feest, S., and Johnson, E. (2016). Input-driven differences in toddlers' perception of a disappearing phonological contrast. Lang. Acquis. 23, 89–111. doi: 10.1080/10489223.2015.1047096

Voorhoeve, J. (1973). Safwa as a restricted tone system. Stud. Afr. Linguist. 4, 1–22.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Ramachers, Brouwer and Fikkert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX

TABLE A1 | Number and percentage of excluded trials, blocks, and participants per language and age group for the child study.


TABLE A2 | Means, standard deviations, and ranges of proportions of input quantity and quality for the Limburgian children (missing N = 1).


#### TABLE A3 | Acoustic measurements of the target stimuli.


<sup>∗</sup>TBP stands for tone bearing portion. Note that the TBP for moon stimuli consisted of the entire rhyme (i.e., [o:n]) whereas for taaf stimuli it consisted of the nucleus (i.e., [a:]). Standard deviations in parentheses.

# Cross-Modal Association between Auditory and Visuospatial Information in Mandarin Tone Perception in Noise by Native and Non-native Perceivers

Beverly Hannah<sup>1</sup> , Yue Wang<sup>1</sup> , Allard Jongman<sup>2</sup> \*, Joan A. Sereno<sup>2</sup> , Jiguo Cao<sup>3</sup> and Yunlong Nie<sup>3</sup>

<sup>1</sup> Language and Brain Lab, Department of Linguistics, Simon Fraser University, Burnaby, BC, Canada, <sup>2</sup> Phonetics and Psycholinguistics Laboratory, Department of Linguistics, University of Kansas, Lawrence, KS, United States, <sup>3</sup> Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Gang Peng, Hong Kong Polytechnic University, Hong Kong Bencie Woll, University College London, United Kingdom

> \*Correspondence: Allard Jongman jongman@ku.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 23 June 2017 Accepted: 10 November 2017 Published: 04 December 2017

#### Citation:

Hannah B, Wang Y, Jongman A, Sereno JA, Cao J and Nie Y (2017) Cross-Modal Association between Auditory and Visuospatial Information in Mandarin Tone Perception in Noise by Native and Non-native Perceivers. Front. Psychol. 8:2051. doi: 10.3389/fpsyg.2017.02051 Speech perception involves multiple input modalities. Research has indicated that perceivers establish cross-modal associations between auditory and visuospatial events to aid perception. Such intermodal relations can be particularly beneficial for speech development and learning, where infants and non-native perceivers need additional resources to acquire and process new sounds. This study examines how facial articulatory cues and co-speech hand gestures mimicking pitch contours in space affect non-native Mandarin tone perception. Native English as well as Mandarin perceivers identified tones embedded in noise with either congruent or incongruent Auditory-Facial (AF) and Auditory-FacialGestural (AFG) inputs. Native Mandarin results showed the expected ceiling-level performance in the congruent AF and AFG conditions. In the incongruent conditions, while AF identification was primarily auditory-based, AFG identification was partially based on gestures, demonstrating the use of gestures as valid cues in tone identification. The English perceivers' performance was poor in the congruent AF condition, but improved significantly in AFG. While the incongruent AF identification showed some reliance on facial information, incongruent AFG identification relied more on gestural than auditory-facial information. These results indicate positive effects of facial and especially gestural input on non-native tone perception, suggesting that cross-modal (visuospatial) resources can be recruited to aid auditory perception when phonetic demands are high. The current findings may inform patterns of tone acquisition and development, suggesting how multi-modal speech enhancement principles may be applied to facilitate speech learning.

Keywords: cross-modal association, gesture, audio-visual, tone perception, Mandarin, English

## INTRODUCTION

From infancy onward, language users are continually tasked with solving the cross-modal binding problem in processing multi-sensory linguistic stimuli (Spence, 2011). According to the theory of embodied-grounded cognition, language processing involves embodying a pre-stored linguistic representation grounded across sensory and motor systems based on communicative contexts

(Barsalou, 2008; Borghi et al., 2013). This account predicts joint contributions of sensory and motor systems to perception, indicating that sensory signals from multiple input modalities for the same event must be matched for processing. This requires the perceiver to determine that the inputs are cross-modally associated in a meaningful manner (e.g., spatiotemporally correspondent, semantically congruent) (Ernst and Bülthoff, 2004; Fujisaki and Nishida, 2007; Spence, 2011).

In face-to-face speech interactions, such cross-modal connections may involve integrating auditory acoustic information and visual articulatory configurations (which produce the acoustic output) (Jongman et al., 2003; Munhall et al., 2004), or associating speech with manual gestures (which share similar semiotic representations) (McNeill, 2005; McCafferty, 2006; Kelly et al., 2017). Cross-modal binding may also take the form of cross-sensory equivalence, such as the metaphoric use of spatial stimuli (e.g., high or low) to equate speech information (e.g., high or low pitch) (Bernstein and Edelstein, 1971; Marks, 1987). Moreover, if multi-sensory binding requires reference to pre-stored linguistic representations (Barsalou, 2008), perceivers of different languages may engage in different integration patterns depending on which sensory input they give more weight to (Ernst, 2007; Parise and Spence, 2009; Wang et al., 2009; Hirata and Kelly, 2010).

The present study explores the issues regarding multisensory integration and equivalence in cross-modal association by examining native and non-native Mandarin lexical tone perception with auditory and facial articulatory input as well as hand gestures tracing tonal contours. The goal is to test the strength of the cross-modal association between acoustic information and visuospatial information for tone, to determine whether facial and gestural inputs bias tone perception in a linguistically significant manner and how linguistic experience affects such cross-modal association.

### Facial Cues for Speech

Regarding cross-modal association between auditory and visual facial information, previous research has observed auditoryface vowel binding in infants as young as 2 months of age (Kuhl and Meltzoff, 1982; Patterson and Werker, 2003). During critical times for learning their native language (L1), infants have been found to shift their gaze patterns from looking primarily at the speaker's eyes to the speaker's mouth (Lewkowicz and Hansen-Tift, 2012). Similarly, for adults, complementary visual articulatory information can be recruited to improve signal quality when listening comprehension conditions are less than ideal, such as when learning a second language (L2), listening to non-native speakers, or perceiving speech in noise (Summerfield, 1983; Jongman et al., 2003; Hazan et al., 2006; Wang et al., 2008, 2009; Kawase et al., 2014; Kim and Davis, 2014).

However, the extent to which facial cues enhance speech perception may vary depending on multiple factors, including the nature of the speech input, the intelligibility of the auditory cue, information value and saliency of the visual cue, the linguistic experience of perceivers, and processing load (Hardison, 1999; Hazan et al., 2005). For example, in terms of the nature of the speech input, when conversing in noisy environments or when repeatedly asked for clarification, speakers may modify their speaking style to produce clear speech with exaggerated visual cues for jaw displacement, lip stretching, lip rounding, and duration (Kim and Davis, 2014; Tang et al., 2015). These modifications have been shown to enhance speech intelligibility (Sumby and Pollack, 1954; Gagné et al., 2002). Research has further revealed that the facial areas that may provide the most visual benefit depend on the type of information sought, such as the eyebrows and upper part of the face for prosody or the lower part of the face for word-level information (Cavé et al., 1995; Lansing and McConkie, 1999; Swerts and Krahmer, 2008). Moreover, when the intelligibility of the auditory cue is degraded, such as in a noisy environment, perceivers rely more on visual cues to enhance speech perception (Sumby and Pollack, 1954; Summerfield, 1979; Hazan et al., 2010). With respect to visual saliency, it has been found that the more visually salient sounds (e.g., labial consonants) offer greater perceptual gains than less visually salient sounds (e.g., alveolar and velar consonants) (McGurk and MacDonald, 1976; Hazan et al., 2006, 2010). Perceiver experience may also affect the extent of visual benefit, with non-native perceivers (compared to native perceivers) showing increased reliance on visual information and enhanced visual benefit, such as in the perception of audiovisually mismatched McGurk tokens (e.g., auditory /ba/ with visual /ga/) produced by non-native speakers (Sekiyama and Tohkura, 1993; Chen and Hazan, 2007). However, the addition of visual information may be inhibitory, particularly when task demands are high. For example, unfamiliar L2 visual cues may cause difficulty in non-native perception (Wang et al., 2009; Kawase et al., 2014). Likewise, perception may be impeded by excessive processing load, when task and attentional demands exceed perceptual load capacities (Lavie, 1995; Alsius et al., 2005).

### Mapping Facial Cues to Tonal Distinctions

Of particular relevance for the present study is the contribution of facial cues to the perception of lexical tone, a prosodic feature used to distinguish word meanings. While there is consensus among previous studies with respect to how visual cues benefit segmental speech perception, findings on visual effects at the prosodic level have been inconclusive.

Some studies have indeed revealed facial effects on the perception of prosody (Lansing and McConkie, 1999; Chen and Massaro, 2008). Head, neck, and eyebrows are shown to produce visible cues to tone perception (Burnham et al., 2001a,b; Mixdorff et al., 2005; Chen and Massaro, 2008), as well as perception of intonation, stress, and contrastive focus (Yehia et al., 2002; Munhall et al., 2004; Kim et al., 2014). For instance, back and forth head movement may accompany a change in tone contours (Attina et al., 2010), dropping of head or chin may signal a dipping tone (Chen and Massaro, 2008), and eyebrow raising may be related to increases in vocal pitch (Huron and Shanahan, 2013).

However, it has been further revealed that these visual cues are not necessarily used until they are brought to perceivers'

attention (Chen and Massaro, 2008), or when listening conditions are challenging, such as in noisy backgrounds (Mixdorff et al., 2008), or in a non-native language (Burnham et al., 2006; Smith and Burnham, 2012). Furthermore, research has not revealed any robust articultory correlates to tone perception in terms of mouth movements. Although tone production results show different mouth movement patterns for different tones, no perception data are available to validate their roles in intelligibility (Attina et al., 2010). Thus it is not clear whether the facial cues revealed in these studies are articulatorily relevant cues to signal tonal category distinctions or are simply attention-grabbing cues. This is presumably because tonal production does not rely on vocal tract configurations and may thus be less visually salient (and more auditorily dominant). However, it may also be the case that the upward or downward articulatory movements of the head or eyebrows (e.g., head dipping, eyebrow raising or lowering) found in previous research (e.g., Kim et al., 2014) are associated with the raising or lowering of pitch (Huron and Shanahan, 2013). Further research is needed to determine whether facial cues (along with acoustic cues) modulate tonal distinctions in a linguistically significant manner.

### Co-speech Gestures

Co-speech gestures may be helpful to both the process of producing speech and perceiving speech (see Hostetter, 2011 for a review). The production of co-speech manual gestures can aid speakers in the management of processing load, where gestures in physical space can act as additional visuospatial working memory resources, thereby relieving the speech modality of having to shoulder the entirety of the semiotic burden of communication (Goldin-Meadow et al., 2001; Wesp et al., 2001). Research has also shown that gesturing during speech can improve fluency, especially when retrieving lexical items that contain spatial content (Rauscher et al., 1996).

Speech marked with gestures provides collocutors with concomitant auditory and visual accents, focus cues, and visual representations of the speaker's message (Alibali et al., 2001; Krahmer and Swerts, 2007). Previous studies have indicated that perceiving gestures with speech can indeed aid speech perception in an L1. For example, beat gestures have been shown to alter the perception of word prominence, providing additional parsing and focus cues for the perceiver (Krahmer and Swerts, 2007; Biau and Soto-Faraco, 2013). Co-speech gestures also enable perceivers to more easily represent visuospatial aspects in speech comprehension (Wu and Coulson, 2007). This bimodality increases the total possible communicative value of an utterance (McNeill, 2000), where a mismatch of speech and gesture can express two beliefs at once, signaling a gap in understanding or a transitional state of learning (Goldin-Meadow et al., 1993).

However, the extent to which gestures may facilitate speech perception in an L2 exhibits more complex patterns. On the positive end, the addition of gestural input has indeed been shown to facilitate lexical access, as well as rhythm and syllabification in L2 learners (Sueyoshi and Hardison, 2005; McCafferty, 2006). Gestures have also been used by L2 instructors to aid sentence comprehension and vocabulary learning (Barnett, 1983; Lazaraton, 2004; Gullberg, 2006; Gluhareva and Prieto, 2016). Gestures are not always beneficial to learning, though, especially when the tasks involve a fine-grained phonetic perceptual judgment. For example, iconic gestures were found to aid word learning only when word pairs containing the target phonemic contrasts were highly dissimilar in their surrounding segmental contexts, or when words were learned in isolation (Kelly et al., 2009; Kelly and Lee, 2012). As well, hand gestures signaling length of sounds were not effective in helping learners to discriminate phonemic vowel durational differences (Hirata and Kelly, 2010; Hirata et al., 2014; Kelly et al., 2014). Together, these findings may suggest that gestures are more effective for enhancing perception of non-native speech in higher (lexical and sentential) linguistic domains when fine-tuned phonemic distinctions are not the primary focus. When faced with highly demanding phonemic tasks, thus imposing high processing load, additional gestural input may be distracting.

### Mapping Gestures to Tonal Distinctions

Although little research has explored the use of gesture in linguistic pitch processing, the cross-modal association between auditory pitch and spatial movement has been well established in the general cognitive domain (Casasanto et al., 2003). Capturing pitch in gestures owes its inspiration to the illustrative aids of musical expression. For example, to create strong audiospatial connections, early stage learners in the Kodály music education system are encouraged to kinesthetically engage in their experience of music using gestures and physical movements (Houlahan and Tacka, 2008). Music teachers are taught to enhance pitch perception using hand levels and diagrams of melodic contours (Welch, 1985; Apfelstadt, 1988), and young singers are trained to improve their pitch accuracy using such gestures (Liao, 2008; Liao and Davidson, 2016).

Indeed, it has been claimed that pitch is audio-spatial in representation, which implies that pitch perception should inescapably be affected by spatial information (Connell et al., 2013). When participants were asked to represent stimulus sounds gesturally in a three-dimensional space, higher pitch was generally found to correlate with higher elevation in space (Küssner et al., 2014). Moreover, upward and downward hand gestures were shown to bias pitch perception in the direction of the gesture; and such gesture-directed pitch perception bias appeared to be driven by spatial mechanisms rather than verbal labeling strategies, since the bias remained under increased verbal memory load but disappeared under increased spatial memory load (Connell et al., 2013). It has further been found that musicians are more consistent in associating gestures with pitch height and direction in space than non-musicians, indicating that the strength of cross-modal associations can develop with experience (Küssner et al., 2014).

In a linguistic context, gestures have indeed been shown to affect the perception of pitch in a few studies on intonation and lexical tone perception in an L2. Kelly et al. (2017) found that upward or downward hand movement congruous with the direction of the intonational pitch contour (rising or falling, respectively) could facilitate perception of intonation in an L2, whereas incongruous gesture-pitch matching was disruptive. However, the effects of gesture on lexical tone perception and

learning are not clear. Morett and Chang (2015) trained English perceivers to learn Mandarin tone words either with or without viewing hand gestures tracing tone contours. The results showed greater post-training improvements for the group who received training with gesture compared to the no-gesture training group. However, this difference only held true for the word-meaning association task, which involved matching the 12 tone words used in training to their respective meanings; whereas for tone identification in both the trained and novel stimuli, the gesturetraining group showed no additional benefit over the no-gesture training group. As such, it is not clear whether the facilitative effects exhibited by the gesture-training group could be attributed to effective cross-modal pitch-gesture association or a result of memorization using verbal labeling strategies (cf. Connell et al., 2013, as discussed above), as participants might have memorized the word-meaning associations of the 12 words that appeared in training.

### The Present Study

The empirical findings reviewed above support the predictions of the theory of embodied cognition (Barsalou, 2008; Borghi et al., 2013), in that facial articulatory and co-speech gestural cues can be effectively integrated with auditory-acoustic information to enhance speech perception. However, several issues remain when it comes to cross-modal association in lexical tone processing.

First, regarding the link between auditory and visual information in lexical tone perception, it is not clear the extent to which the facial cues (such as head, eyebrow, and mouth movements) revealed in the previous studies (e.g., Burnham et al., 2001a,b; Mixdorff et al., 2005; Chen and Massaro, 2008) are articulatorily required or spatially relevant cues to signal tonal category distinctions, or attention-grabbing cues, since (unlike segments) tone production is not triggered by vocal tract configurations. Similarly, with respect to equating acoustic and spatial information for tone, previous work has not been able to determine if the facilitative effects of gesture on tone learning are due to effective cross-modal pitch-gesture association or arbitrary mnemonic devices (Morett and Chang, 2015). Moreover, research has not directly compared the relative weighting of multiple inputs in tone perception. The contribution of audio, facial and gestural inputs may be affected by the relative saliency of various input modalities such that more salient cues or combinations of cues are weighted more heavily in perception (Hazan et al., 2006; Chen and Hazan, 2007; Wang et al., 2009). However, perception may also be impeded by excessive processing load when too many input modalities are involved, especially for non-native perceivers facing demanding L2 phonemic tasks (Alsius et al., 2005; Hirata and Kelly, 2010).

An investigation of multi-modal tone perception to fill these gaps in the literature has significant theoretical implications with respect to the extent to which speech processing enjoys shared representations across sensory-motor domains. As discussed above, lexical tone provides a unique testing case for cross-modal binding due to the nature of its articulation (which is independent of vocal tract configurations), acoustics (with its perceived pitch being visuospatial), and linguistic status (being phonemic and difficult for non-tonal perceivers).

The present study thus examines the cross-modal association in the perception of Mandarin tones with Audio-Facial (AF, involving speaker facial movements in tone production) and Audio-FacialGestural (AFG, also involving speaker hand gestures tracing tone contours) input modalities by native (Mandarin) and non-native (English) perceivers. To test cross-modal association, the combination of auditory and visual tone input was manipulated to be either congruent, where the auditory tone and visual tone match (e.g., A-Rising + F-Rising in AF, or A-Rising + FG-Rising in AFG), or incongruent, where the auditory tone and visual tone are mismatched (e.g., A-Rising + F-Falling, or A-Rising + FG-Falling).

In terms of establishing a meaningful cross-modal association, we hypothesize that, for facial effects, if perceivers are able to effectively incorporate facial tonal cues (e.g., eyebrow raising, head dipping) as non-arbitrary articulatorily or spatially relevant cues, they would more accurately identify tones with congruent (than incongruent) audio and facial input. Likewise, for gestural effects, if perceivers are able to establish an acoustic-visuospatial link for pitch, they would more accurately identify tones when gestural input is available, and would be more accurate with congruent (than incongruent) audio and gestural input. However, if such audio-visual links are arbitrary, resulting from attentional or mnemonic strategies, we should instead find that congruent and incongruent input result in equal performance. Moreover, regarding the relative weighting of the different input modalities, we expect better performance and increased visual weighting in the AFG condition than the AF condition, since hand gestures along with facial movements provide additional input resources than facial movements alone. Finally, comparing native and non-native effects, we expect native Mandarin (relative to English) perceivers to rely less on visual information, as they possess firmly established auditory tone categories. In contrast, English perceivers would be more affected by facial and gestural input than Mandarin perceivers, as presumably they need additional resources to process challenging L2 tones. However, the presence of multiple input sources could also be detrimental for non-native perceivers if they increase processing load.

### MATERIALS AND METHODS

This study was carried out with the approval of the Office of Research Ethics at Simon Fraser University (SFU) with written informed consent from all participants.

### Perceivers

Fifty-two native and non-native Mandarin perceivers participated in the perception experiment. The native perceiver group consisted of 26 native speakers of Mandarin (15 female) born and raised in northern China or Taiwan, aged 19–33 (mean: 24). The non-native perceiver group consisted of 26 native speakers of English (14 female) born and raised in Western Canada or the United States, aged 19–30 (mean: 24), with no prior tone language experience or formal musical training (as per the criteria used in Cooper and Wang, 2012). All perceivers reported normal hearing and normal or corrected-to-normal vision, and no history of speech or language disorders.

### Stimuli

### Characteristics and Types of Stimuli

fpsyg-08-02051 November 30, 2017 Time: 16:12 # 5

Eight monosyllabic Mandarin words [(/jε/ pinyin: ye, /joυ/ pinyin: you) × four tones (Level, Rising, Dipping, Falling)] were chosen as the experimental stimuli for this study. Eight additional monosyllabic Mandarin words were used as tone familiarization stimuli (/tw3/ pinyin: duo × four tones) and task familiarization stimuli (/k3/ pinyin: ge × four tones).

Two modalities were recorded for the target stimuli: Audio + Facial, where the speaker's facial movements were presented while speaking a corresponding target stimulus; and Audio + FacialGestural, where the speaker made a matched tone contour shaped hand gesture in the space next to their face as indicated by an acetate tone graph on the LCD feedback monitor of the video camera while speaking (e.g., making a high steady left to right hand movement while saying a highlevel tone, making a slanted dropping hand movement for a falling tone). The stimuli were further edited to create two incongruent audio-visual stimulus types, where the auditory tone input does not match the facial and gestural tone input (see the section "Stimulus Editing" for details on stimulus editing). Thus, in total, four types of stimuli were developed, as illustrated in **Figure 1**: (1) congruent Audio and Facial tone input (AF-C), (2) incongruent Audio and Facial tone input (AF-I), (3) congruent Audio and FacialGestural tone input (AFG-C), and (4) incongruent Audio and FacialGestural tone input (AFG-I).

### Speakers and Recording

The experimental stimuli were produced by two (one male, one female) native Mandarin-speaking instructors with experience teaching college-level introductory Mandarin classes. Two additional native Mandarin speakers (one male, one female) produced the audio-only tone familiarization stimuli. The speakers were aged 25–35 and reported no history of speech or language disorders.

Audio-visual recordings were made of the speakers producing the target tokens in citation form in the AF condition in a sound-attenuated booth at the Language and Brain Lab at SFU. The speakers were positioned such that their head, eyebrow, and mouth movements were clearly visible. These facial features were kept neutral except during speech, when facial actions consisted of mouth/jaw opening, lip stretching, and head dipping. Neck bulges and ligament movements were also consistently visible. In the AFG condition, recorded after the AF condition, the same speakers were additionally asked to simultaneously say each token and trace a matching tone contour in the space next to their face as indicated by an acetate graph on the LCD feedback monitor of the video camera. Speakers started with their mouths closed and hands lowered, and returned to the rest position between tokens. The Mandarin characters and Pinyin romanizations were presented to the speakers via PowerPoint slides. Videos were captured on a high definition camcorder (Canon Vixia HF30) at a recording rate of 30 fps. Concurrent high quality audio was recorded

FIGURE 1 | Four types of experimental stimuli, exemplified using the syllable ye with Level tone (ye¯ ) and Dipping tone (yeˇ ): (1) upper-left panel – congruent Audio and Facial tone input (AF-C): yeˇ , (2) lower-left panel – incongruent Audio and Facial tone input (AF-I): audio ye¯ + video yeˇ , (3) upper-right – congruent Audio and FacialGestural tone input (AFG-C): yeˇ , and (4) incongruent Audio and FacialGestural tone input (AFG-I): audio ye¯ + video ye.ˇ

using a Shure KSM109 microphone at 48 kHz. All speakers provided written consent to have their images included in publications.

#### Stimulus Editing

fpsyg-08-02051 November 30, 2017 Time: 16:12 # 6

The videos were edited using Final Cut Pro X to contain one token per stimulus, with the separately recorded high quality audio replacing the audio track captured by the on-camera microphone, using an automated FFMPEG script.

The intensity of the audio track for each stimulus video was normalized to 65 dB SPL. As the stimuli were derived from the Mandarin perceivers' native language, tone identification tasks in clear (no-noise) audio would likely result in ceiling performance based exclusively on the audio input for the Mandarin group. In order to improve measurement sensitivity and facilitate the examination of audio-facial and audio-facialgestural associations, the stimuli were embedded in cafeteria noise following previous research (e.g., Wang et al., 2008, 2009). The signal-to-noise ratio (SNR) was empirically established where Mandarin and English pilot participants were tested on a smaller subset of audio-only stimuli embedded in +10, +5, 0, −5, −10, −12, −15, and −18 dB noise. At −12 dB, the error rate was 15% for the Mandarin group and 54% for the English group. At −15 dB, the tonal information started to be significantly masked by noise, resulting in poorer than chancelevel performance (particularly for the English group). On the other hand, at −10 dB, the Mandarin group's performance was close to ceiling (with less than 10% error rate). Thus, −12 dB was adopted as the optimal SNR that maintained the audibility of the tonal information without completely masking the tone information. Consequently, the 65 dB SPL audio track for each stimulus was embedded in 77 dB SPL cafeteria noise using FFMPEG.

The videos were mirrored horizontally so that the tone contour trace in AFG videos would travel left to right for perceivers during the experiment. AF videos were also mirrored for consistency. Each video was 4 s long to ensure that all the articulatory and gestural movements were captured.

For each modality, syllable and speaker, each auditory tone was paired with a tone-congruent video as well as the three other tone-incongruent videos, producing four tone-congruent pairings (one for each tone) and 12 tone-incongruent pairings (with all the possible audio and visual tone pairings differing in tone, e.g., audio-Level + video-Rising). Thus, for example, a tone-incongruent AFG auditory Level tone, visual Rising tone (AFG-A1V2) stimulus would contain the visual track from the original AFG Rising tone recording paired with the auditory track from the AFG Level tone recording. All videos presented during the experiment were cross-spliced in this manner, including the tone-congruent ones, in order to keep the treatment consistent across all stimuli. To accomplish this, a tone-congruent AFG auditory Level tone, visual Level tone (AFG-A1V1) stimulus would contain the visual track from the original AFG recording paired with the auditory track from the AF recording.

Additionally, to address the potential effects of the durational differences across tones (particularly for the tone-incongruent stimuli) in the audio–video pairings, an auditory durationmodified set of stimuli was created. For each stimulus, syllable onsets and offsets for the audio track of each video were manually marked, and the syllable durations were extracted in Praat (Boersma and Weenink, 2015). For each tone pairing, the durational difference between the original and replacement audio tones was calculated, and a stretch/compression factor was applied to the replacement tone based on the original tone duration. Then the replacement audio file was stretched or compressed accordingly. These duration-modified audio segments were then overlaid onto the original video using Final Cut Pro X, aligning the replacement audio with the syllable onset of the original. Thus, the duration-modified pairings were well matched for duration in the auditory and visual input. However, considering that the stretching or compressing of the audio files may affect the (spectral and temporal) naturalness of tones, the duration-unmodified condition with natural audio was also retained. Both sets of stimuli were presented to perceivers as part of the experiment.

In total, each participant perceived 384 test stimuli (192 AF, 192 AFG) over the course of two sessions. Each of the two modalities consisted of 96 incongruent trials (12 tone pairings × 2 syllables × 2 duration modification conditions × 2 speakers) and 96 congruent trials (4 tone pairings × 2 syllables × 3 repetitions × 2 duration modification conditions × 2 speakers). Moreover, eight additional audio files (4 tones × 1 syllable "duo" × 2 speakers) not embedded in noise were included as tone familiarization stimuli, and eight additional tone-congruent, noise-embedded audio–video files in each (AF, AFG) modality (4 tones × 1 syllable "ge" × 2 speakers) were prepared as task practice stimuli before the experiment.

All AF and AFG tokens were evaluated by two native Mandarin speakers for accuracy as well as audio–video quality check. The speakers correctly identified the stimuli and rated them as satisfactory exemplars of the intended tones and gestures.

### Procedures

The experiments were conducted in a sound-attenuated perception booth at the Language and Brain Lab at SFU. Stimuli were presented using Paradigm Stimulus Presentation software (Perception Research Systems, 2007) on 15-inch LCD monitors. Video stimuli were presented at 1024 × 576 resolution, and audio was presented using AKG circumaural headphones. Participants were scheduled for two 1-h test sessions, separated by at least 1 h break. They were compensated with their choice of \$30 cash or SFU course credits.

Prior to the test sessions, participants were introduced to the Mandarin lexical tone system with the eight tone-familiarization stimuli described above, presented in an audio-only condition with the tone descriptors "Level," "Rising," "Dipping," and "Falling" provided visually on the screen. These descriptors capture height and direction of auditory pitch as well as gestural movements. Participants subsequently practiced identifying these tone stimuli by pressing the corresponding buttons on the keyboard. No feedback was given. The participants were required

to perform above chance before continuing. They all met this inclusion criterion and none had to repeat the task (Mandarin group mean: 97.3%, SD: 7.3%; English group mean: 72.0%, SD: 16.4%).

After completion of the tone-familiarization exercise, participants moved on to the test sessions. Each session contained half of the stimulus set and was further divided into two test blocks by modality of presentation: AF and AFG. Each block consisted of four practice trials, followed by 96 experimental trials. The speakers, syllables, tones, tone congruency, and duration modification factors were randomized for presentation within each block. Block presentation was counter-balanced across session and participants. The task in each experimental trial required perceivers to watch and listen to a stimulus video of a speaker producing a target token, and then respond to the question "Which tone did you perceive?" by identifying the tone as Level, Rising, Dipping, or Falling, and pressing the correspondingly labeled button on the keyboard. Perceivers were instructed to respond as quickly as possible after the response screen appeared, and were given a maximum of 4 s to respond.

### RESULTS

### Effects of Audio-Visual Congruency

First, to evaluate tone perception as a function of the congruency of auditory and visual (facial gestural) input, tone perception accuracy was compared in the congruent and incongruent conditions. The auditory input served as the basis for tone accuracy measurements in the incongruent conditions. **Figure 2** illustrates these congruency comparisons.

The data were submitted to multilevel mixed effect logistic regression with Congruency (Congruent, Incongruent), Group

(Mandarin, English), and Modality (AF, AFG) as fixed factors. A random effect was added on the intercept term to account for different perceivers. Factors peripheral to the focus of the study were also adjusted for in the analysis, including Duration modification (Modified, Unmodified), Tone (Level, Rising, Dipping, Falling), Speaker gender (Male, Female), Syllable (Ye, You), and Repetition (1, 2, 3). The estimated coefficients, summarized in **Table 1** below, reveal significant main effects of Congruency, Group, and Modality, as well as main effects of Duration modification and Tone. For brevity, only significant interactions involving Congruency, Group, and Modality (the main factors of concern here) are reported. The significant effects involving Duration modification and Tone will be further analyzed in the Sections "Effects of Duration of the Auditory and Visual Input" and "Effects of Individual Tones," respectively.

As shown in **Table 1**, the logistic regression revealed a significant three-way interaction of Congruency × Group × Modality, as well as two-way interactions of Congruency × Group and Congruency × Modality. To further assess these interactions, likelihood ratio tests for each modality were conducted to determine whether including a Congruency × Group interaction term would improve the model fit compared to a reduced model excluding the interaction term but retaining Congruency, Group, Duration Modification, Tone, Speaker gender, Syllable, and Repetition as factors. Significant interactions of Congruency × Group were found for both AF [χ 2 (1) = 11.50, p < 0.001] and AFG [χ 2 (1) = 17.77, p < 0.001] modalities. With the same approach, significant Congruency × Modality interactions were observed for the Mandarin [χ 2 (1) = 155.90, p < 0.001] and English [χ 2 (1) = 383.64, p < 0.001] groups.

These significant interactions motivated further comparisons of Congruency within each Modality and Group, using Wald tests. First, in the AF modality, for Mandarin perceivers, accuracy was unexpectedly higher in the incongruent condition (AF-I Mean: 97.7%, SD: 5.5%) than the congruent condition (AF-C Mean: 96.8%, SD: 4.6%) [AF-C/AF-I = 0.61, CI = (0.42, 0.91), z = 2.51, p = 0.012], although it should be noted that performance was close to ceiling in both congruency conditions (**Figure 2**). In contrast, for the English group, tone accuracy was significantly higher in AF-C (Mean: 49.1%, SD: 23.5%) than in AF-I (Mean: 43.1%, SD: 20.2%) [AF-C/ AF-I = 1.37, CI = (1.19, 1.58), z = −4.53, p < 0.001], showing the expected positive effects when congruent auditory and facial information was presented. The positive effects of congruency were also revealed in the AFG modality, where congruent auditory and facial-gestural input (AFG-C) produced higher tone accuracy compared to incongruent input (AFG-I) for both the Mandarin (AFG-C Mean: 98.7%, SD: 2.4%; AFG-I Mean: 88.1%, SD: 22.2%) [AFG-C/AFG-I = 29.88, CI = (17.13, 52.11), z = −12.21, p < 0.001] and English (AFG-C Mean: 73.2%, SD: 19.0%; AFG-I Mean: 32.3%, SD: 23.1%) [AFG-C/AFG-I = 8.99, CI = (7.61, 10.62), z = −26.42, p < 0.001] groups.

In sum, the results demonstrate more effective perception with congruent (than incongruent) auditory and visual

interval.

TABLE 1 | Summary of mixed effect logistic regression model for tone identification accuracy.


(facial/gestural) input for both native (Mandarin) and nonnative (English) perceivers, with the exception of the Mandarin AF condition where ceiling performance was observed.

### Effects of Input Modality

To determine the extent to which the AFG modality relative to AF affected tone perception for both Mandarin and English perceivers, congruent audio-visual trials were analyzed using logistic regression with Group and Modality as fixed effects, with model adjustments for the same peripheral and random factors as reported in the section "Effects of Audio-visual Congruency." A significant main effect of Modality was observed across groups, where tone identification was more accurate in AFG (Mean: 86.0%, SD: 18.6%) than in AF (Mean: 72.9%, SD: 29.4%), [AFG/AF = 3.56, CI (3.11, 4.08), z = 18.70, p < 0.001]. A significant main effect of Group was observed across modalities, with Mandarin perceivers (Mean: 97.8%, SD: 3.8%) outperforming English perceivers (Mean: 61.2%, SD: 24.4%), [Mandarin/English = 48.42, CI (26.05, 90.02), z = 12.38, p < 0.001]. No significant Modality × Group interaction was observed [χ 2 (1) = 1.51, p = 0.22].

Despite the lack of interaction, the Mandarin perceivers' near-ceiling performance in both AF and AFG conditions motivated further Wald tests comparing Modality for each Group. The results confirmed that tone accuracy was superior in AFG compared to AF for both native Mandarin perceivers [AFG/AF = 2.78, CI (1.81, 4.27), z = 4.77, p < 0.001] and English perceivers [AFG/AF = 3.55, CI (3.10, 4.07), z = 18.67, p < 0.001]. A significant effect of Group for each Modality was observed, with native Mandarin perceivers outperforming English perceivers in both AF [Mandarin/English = 47.94, 95% CI (24.77, 92.75), z = 11.50, p < 0.001] and AFG [Mandarin/English = 44.25, CI (20.28, 96.54), z = 9.50, p < 0.001]. **Figure 3** illustrates these differences in performance by Modality and Group.

These results indicate the benefit of gesture, where AFG produced higher tone identification rates compared to AF for both native (Mandarin) and non-native (English) groups. The perceptual benefits of gesture were more pronounced for the non-native group, as the native perceivers achieved very high

tone identification accuracy rates with and without gesture, as expected.

### Effects of Perceptual Weighting of Auditory and Visual Input

In order to quantify the effects of visual (facial and gestural) relative to auditory information on perception in incongruent AF and AFG conditions, a perceptual weighting analysis in the present section sorted perceiver responses for each token into three categories: correct response based on auditory tone input (A), correct response based on visual tone input (V), or one of the remaining (Other, O) two tones (since participants were given all four tones as response options). For example, in the case of a token consisting of an audio Rising tone cross-spliced with a visual Falling tone, a Rising response would be coded

as A, Falling as V, and a Level or Dipping tone response as O. **Figure 4** shows the varying proportion of responses by Modality and Group, categorized as A, V, and O, for the Mandarin and English perceiver groups in AF and AFG conditions.

For each Group and Modality, Friedman's tests with subsequent Wilcoxon–Nemenyi–McDonald–Thompson post hoc tests were conducted to determine the rank order of the participant responses in the A, V, and O response categories. Within-group and modality weighting of A and V proportions were then evaluated using pairwise t-tests. These within-group proportions were subsequently submitted to two-sample t-tests in order to determine the differences in perceptual weighting between modalities.

For Mandarin perceivers in the AF modality, Friedman's test results indicated that the A, V, and O response categories were not equally preferred [χ 2 (2) = 11.69, p < 0.001]. As expected, the A-based responses were significantly greater than both V-based and O responses (ps < 0.001), whereas the latter two categories did not differ significantly (p = 0.306). Likewise, in AFG, significant differences among response categories were also observed [χ 2 (2) = 13.06, p < 0.001]; responses to A significantly outweighed V, which in turn outweighed O (ps < 0.001). However, between-modality comparisons using two sample t-tests showed that the A-based responses were significantly greater in AF than in AFG [t(25) = 2.49, p = 0.020], whereas the V-based responses were significantly greater in AFG than AF [t(25) = 2.31 p = 0.029].

For the English group, the AF condition also revealed significant differences in audio-visual weighting [χ 2 (2) = 10.61, p < 0.001], with post hoc tests indicating greater A-based responses over O, which in turn significantly outranked V-based responses (ps < 0.001). In the AFG condition, significant differences between category responses were observed as well [χ 2 (2) = 11.69, p < 0.001]. However, in contrast to the other results of A-dominant response patterns, with gesture, English perceivers' responses following the visual input increased to the extent that V exceeded both the A and O categories (ps < 0.001); while the latter two did not differ (p = 0.079). Comparisons

Frontiers in Psychology | www.frontiersin.org

English group.

A: % correct responses based on audio tone input, V: % correct responses based on visual tone input, O: % other tone responses; MAND: Mandarin group, ENG:

between the AF and AFG modalities showed that English perceivers gave significantly more A responses in AF than AFG [t(25) = 4.63, p < 0.001], whereas they gave significantly more V responses in AFG than in AF [t(25) = 8.67, p < 0.001].

Standard deviation results of the incongruent conditions at the group level have thus far suggested an inverse relationship between the variables of auditory (A) and visual (V) response, where A decreases when V increases. Pearson correlation coefficients were calculated to determine the linearity of this relationship in each Group and Modality. Overall, strong negative correlations were found for A and V tone responses for all groups and modalities in the incongruent conditions. In the Mandarin group, significant negative correlations were found for both AF (r = −0.97, n = 26, p < 0.001) and AFG (r = −0.98, n = 26, p < 0.001). Similarly, in the English group, there were significant negative correlations in AF (r = −0.88, n = 26, p < 0.001) as well as in AFG (r = −0.88, n = 26, p < 0.001). **Figure 5** illustrates these relationships by plotting % Audio Tone against % Visual Tone for each Group, Modality, and Participant.

Finally, cross-group comparisons using pairwise t-tests showed that both Mandarin [t(25) = 2.31, p = 0.029] and English perceivers [t(25) = 8.67, p < 0.001] significantly increased their V weighting from AF to AFG. Within each group, the differences for audio and visual responses between AF and AFG were then calculated for each participant. A two sample t-test of the differences revealed that the increase in visual weighting with the inclusion of gesture in the AFG condition was significantly greater for English perceivers (22.8%) than for Mandarin perceivers (8.5%) [t(50) = 3.18, p = 0.003].

To summarize, the analysis of the incongruent data showed that in stimuli where auditory and visual cues for tone were mismatched, both native (Mandarin) and non-native (English) perceivers increased their visual weighting when highly salient gesture cues were available in the AFG modality (as compared to AF). Furthermore, the non-native group weighted the visual tone even more highly than the auditory tone input when gestures were present.

### Effects of Duration of the Auditory and Visual Input

As discussed in the section "Speakers and Recording," two sets of stimuli were created for the incongruent stimuli: the durationmodified set with modified audio tone duration to match the duration of the visual tone, and the duration-unmodified set with the natural audio tone duration retained. The main effect of Duration modification in the full model logistic regression in the section "Effects of Audio-visual Congruency" motivated further analysis on the incongruent data to determine if durational congruency affects perception as a function of Modality and Group. A likelihood ratio test between the full model (including all two- and three-way interactions) and the reduced model excluding the interaction term indicated no significant Group × Modality × Duration modification interaction for either the Auditory-based responses [χ 2 (1) = 0.60, p = 0.44] or the Visual-based responses [χ 2 (1) = 0.7244, p = 0.395]. This result indicated that duration modification affected all groups and

modalities in the same way, and therefore no further analysis was undertaken.

### Effects of Individual Tones

The significant main effect of Tone (p = 0.012) observed in the full model logistic regression in the section "Effects of Audiovisual Congruency" motivated additional analyses of potential individual tone effects as functions of Modality and Group, for both the audio-visual congruent and incongruent data. First, likelihood ratio tests between the full model and the reduced model, which excluded the interaction term, were used to assess the Tone × Modality × Group interactions. If significant interactions were found, further Friedman's tests with Wilcoxon– Nemenyi–McDonald–Thompson post hoc tests were employed to tease apart the differing effects of individual tones in each modality and for each group. **Figure 5** illustrates individual tone perception in AF and AFG for Mandarin and English perceivers in terms of (a) percent correct identification in the congruent conditions, (b) percentage of responses matching the auditory tone in the incongruent conditions, and (c) percentage of responses that matched visual tone in the incongruent conditions.

Likelihood ratio tests of the Congruent data (**Figure 6A**) revealed a significant interaction of Group × Modality × Tone [χ 2 (3) = 16.561, p < 0.001]. Although identification accuracy for each tone was generally very high in the Mandarin perceiver group, Friedman's test showed significant differences in the AF condition [χ 2 (3) = 32.92, p < 0.001], with post hoc pairwise comparisons indicating significantly higher accuracy for Rising, Dipping, and Falling tones than Level tone (ps < 0.001). A significant tone effect was also observed in AFG [χ 2 (3) = 10.61, p = 0.014), showing better perception of Dipping and Falling tones than Rising tone (ps = 0.031). A significant effect of tone was also present for the English group in the AF condition

[χ 2 (3) = 16.27, p = 0.001], with Dipping tone performing better than both Rising and Falling tones (ps ≤ 0.017). Likewise, in AFG, the significant tone effect [χ 2 (3) = 13.49, p = 0.004] was due to better identification of Dipping than Rising tone (p = 0.002).

Analysis of the incongruent data based on audio responses (**Figure 6B**) only revealed a significant Group × Tone interaction [χ 2 (3) = 55.96, p < 0.001]. Across modalities, significant differences between tones were found for the Mandarin group [χ 2 (3) = 21.01, p < 0.001], but not for the English group [χ 2 (3) = 2.16, p = 0.130]. For the Mandarin perceivers, Dipping tone outperformed Rising tone, and Falling tone outperformed both Level and Rising tones (ps ≤ 0.024).

The video response analysis of the incongruent data (**Figure 6C**) also only revealed a significant Group × Tone interaction [χ 2 (3) = 38.76, p < 0.001]. Across modalities, significant differences between tones were found in both Mandarin [χ 2 (3) = 32.30, p < 0.001] and English [χ 2 (3) = 17.63, p < 0.001]. For the Mandarin group, Dipping tone responses were greater than all the other tones (ps ≤ 0.003). For the English group, Level, Rising, and Dipping tones all outperformed Falling tone (ps ≤ 0.020). Moreover, the Modality × Tone interaction was also significant [χ 2 (3) = 25.97, p < 0.001]. Across groups, significant differences between tones were found in AF [χ 2 (3) = 3.18, p = 0.008] and AFG [χ 2 (3) = 5.69, p < 0.001] modalities. In AF, Rising was better than Falling tone (p = 0.008); and in AFG, Dipping was better than all the other tones (ps < 0.001).

Overall, the most notable result of the individual tone analysis was how frequently Dipping tone was observed to outperform the other tones on the measures that included a visual component, especially in the AFG modality.

### Summary

Taken together, the results show better performance with congruent (than incongruent) auditory and visual (facial/gestural) input across groups, indicating that perceivers are able to make cross-modal associations between acoustic, visual articulatory, and spatial pitch information. And this association was not caused by durational congruency effects. Furthermore, the addition of gestural input increases perceptual accuracy as well as visual weighting over facial input, across all tones and especially for the Dipping tone. The perceptual benefits of visual, particularly gestural, input were more pronounced for the non-native group than the native group.

### DISCUSSION

### Facial Effects

For facial congruency effects, we hypothesized that if perceivers could effectively integrate visual facial cues as articulatorily relevant cues, they would achieve better performance in the audio-facial congruent than incongruent condition. While native Mandarin perceivers did not show the expected congruency effects, presumably due to their ceiling-level performance, the non-native results support our hypothesis in that English perceivers could more accurately identify tones with congruent (than incongruent) auditory and facial information. The English results are consistent with previous findings that facial cues for tone are more likely used by non-native perceivers who find themselves in a challenging non-native phonetic situation, than by native perceivers (Chen and Massaro, 2008; Smith and Burnham, 2012). The fact that the current stimuli were presented in cafeteria noise further added to the challenge. In fact, previous research has shown that, with increased auditory noise, perceivers' attention increasingly shifts from the eyes to the mouth, indicating a shift to articulatorily relevant cues (Vatikiotis-Bateson et al., 1998). The current nonnative performance indeed suggests that the perceivers may have been able to incorporate specific facial movements as

FIGURE 6 | Individual tone (Level, Rising, Dipping, Falling) perception by Group (Mandarin, English) and Modality (AF, AFG). (A) Percent correct identification in audio-visual congruent conditions. (B) Percentage of responses matching the auditory tone in incongruent conditions. (C) Percentage of responses matching the visual tone in incongruent conditions.

articulatorily relevant cues for tone, such as head movements previously found to be tone-relevant (Attina et al., 2010; Kim et al., 2014). Although the current design does not allow us to specify which particular cues contributed to the perceptual patterns, the congruency comparisons indicate that these cues are not just arbitrary cues, since if perceivers did not associate visual cues with auditory cues in a meaningful manner, their perception would not have been different for the congruent and incongruent audio-visual stimuli. Although the results from Mandarin perceivers did not directly support our hypothesis, their high performance in both congruent and incongruent conditions indicates that they possess firmly established auditory tonal categories sufficient for accurate perception, making it less likely for them to be misled by incongruent facial information; as was also shown in previous research (Wang et al., 2008).

The perceivers' different weighting of audio-visual information is further evidenced by their response patterns in the incongruent condition, where the responses following audio versus visual input were 97.7% and 0.7%, respectively, for Mandarin perceivers, but 43.2% and 21.2%, respectively, for English perceivers. These patterns support our hypothesis that the weighting of auditory and visual information would vary depending on the language background of the perceivers. Consistent with previous findings (Mixdorff et al., 2005), the Mandarin group relied almost exclusively on auditory input, which was sufficient for their accurate perception. The English group, on the other hand, showed a greater reliance on facial information, as was also shown in prior studies (Smith and Burnham, 2012).

Categorizing the current non-native results by individual tone, Dipping tone perception tended to be more accurate than most of the other tones in the audio-facial congruent condition; although in the incongruent condition, the Dipping tone was not more frequently responded to than the other tones following either auditory or visual input. The superior performance in the perception of the audio-facial congruent Dipping tone demonstrates effective integration of auditory and facial cues, indicating the recruitment of valid visual cues in perception. Indeed, in tone perception, Dipping tone is the candidate with the most visually salient features relative to other tones, with a noticeable head dipping or jaw lowering motion corresponding to the articulatory configuration for the turning point of the tone (Chen and Massaro, 2008; Smith and Burnham, 2012).

In face-to-face conversation, there is no reasonable expectation of hearing sounds issuing from a speaker that are in direct opposition to their articulatory configurations, so both the native and non-native performance patterns can be reasonably accounted for in relation to strategically recruiting articulatorily relevant auditory and visual cues in a complementary but integrative manner as needed in perception.

### Gestural Effects

In the present study, both native and non-native perceivers were more accurate in the congruent (than the incongruent) condition, where the hand movements were in the same direction and shape as the tone contours, supporting our hypothesis that perceivers were able to make a cross-modal connection between the tone gesture and the auditory tone. These results compare well with the findings of intonational contrasts that hand-intonation contour congruent gestures resulted in greater accuracy than incongruent gestures (Kelly et al., 2017), and that (non-speech) pitch perception could be swayed upward or downward in the direction of the gesture (Connell et al., 2013; Küssner et al., 2014). The results further confirmed the nature of the gesture-pitch association in the perception of phonemic tone that was not identified in previous research (e.g., Morett and Chang, 2015). That is, the association is due to the audio-spatial nature of pitch (Connell et al., 2013) rather than memorization of arbitrary labeling of a gesture with a specific tone since perceivers' performance in the congruent and incongruent conditions was different.

In spite of the positive gesture-pitch association across groups, native and non-native perceivers exhibit different audio-visual weighting patterns with the addition of gestural input. Although the Mandarin group showed increased visual weighting as an effect of adding tone gestures to the visual stimuli, their perception appeared to be overwhelmingly audio-based (with the audio vs. visual responses being 88%:9.2% in the incongruent condition), despite the non-optimal listening condition with the tonal stimuli embedded in noise. This pattern is aligned with the previous claim for audio-facial speech perception that the existence of robust auditory categories in native perceivers makes visual input weighted less and thus visual distraction less likely (Mixdorff et al., 2005; Wang et al., 2008). For the English group, however, the proportion of visual-based responses was so large that they out-weighed the proportion of audio-based responses (44.0%:32.4%). These results also agree with the previous findings that non-native (relative to native) perceivers attach greater weight to visual input (Hazan et al., 2006; Wang et al., 2008, 2009). They further provide evidence supporting the claim in the audio-facial domain that the degree of visual weighting positively correlates with the saliency of visual contrasts (Hazan et al., 2006). The individual tone results corroborate the saliency account showing that across groups the Dipping tone tends to be more accurately and frequently responded to, as compared to most of the other tones. This is presumably because the dipping gesture involves a trajectory of a combined falling to rising movement, which is more visually salient than a rising or falling contour alone.

Overall, despite the fact that gestures generally provide redundant cues to concurrent speech (Hostetter, 2011), perceivers are able to cross-modally relate visuospatial gestural tone information to the auditory tone information, and use them effectively when necessary.

### General Discussion

Taken together, the results reveal that cross-modal binding occurred in both AF and AFG conditions. However, a comparison of the two modality conditions showed that perceivers across groups (especially the non-natives) were able to identify tones more accurately and respond to the visual input

more frequently when gestural input was available than audiofacial input alone. These results support our hypothesis of an increased visual weighting in the AFG over the AF condition, indicating the effectiveness of hand gestures as an additional and salient input source. The facilitative effects of gestures in non-native tone perception also suggest that this additional channel of input did not make the task more demanding, or increase perceivers' processing or attention load, as suggested in previous research in non-pitch-related domains (e.g., Alsius et al., 2005; Hirata et al., 2014; Kelly et al., 2014). This result may be due to the audio-spatial nature of pitch (Connell et al., 2013). The fact that cross-modal pitch-gesture binding occurs in the perception of phonemic tone as well as for non-speech and musical pitch (e.g., Liao and Davidson, 2007; Connell et al., 2013; Küssner et al., 2014) indicates that this binding may exist universally in perceivers' sensory-motor systems and may not need to be learned. Thus, the facilitative effects of the gesturepitch association can overcome the processing load issue in phonetically demanding contexts.

The current results of an established cross-modal binding both between auditory and facial tone and between auditory and gestural tone suggest shared representations of pitch processing across sensory-motor domains, supporting the theory of embodied-grounded cognition (Barsalou, 2008; Borghi et al., 2013). Specifically, the present AFG results coupled with the previous findings (e.g., Liao and Davidson, 2007; Connell et al., 2013; Küssner et al., 2014) of an acoustic-visuospatial binding for pitch in speech, non-speech, and music imply that pitch representation is not only grounded across sensory-motor systems but also across cognitive domains. This pre-stored representation of pitch information can be recruited to aid perception in a cross-modal and cross-domain fashion. The shared representation account may not only address the gesturetone link in the AFG results, but also explain the facial articulatory and auditory connection of tone in the AF condition. As discussed previously, the production of tone (unlike that of segments) does not rely on vocal tract configurations, and thus may not necessarily involve visible mouth movements corresponding to specific features of sounds (e.g., lip spreading or rounding for vowels). Nonetheless, previous research has indicated that certain articulatorily relevant cues, such as head dipping, eyebrow raising and lowering (Burnham et al., 2001a,b; Huron and Shanahan, 2013; Kim et al., 2014), do affect pitch perception and production. Relating these results to the account of audio-spatial nature of pitch, it is likely to be the case that the head or eyebrow movements accompanied by tone productions provide spatial equivalence to pitch trajectories, similar to the function of hand gestures. The effective utilization of visual information in the perception of the Dipping tone in both AF and AFG conditions provides a good example illustrating this shared process, in that the dipping pitch trajectory may be visual-spatially realized both as a head dipping and a dipping hand movement. As such, the current audio-visual binding results from facial and gestural domains may be accounted for by common underlying mechanisms in terms of shared acoustic and spatial processing.

### CONCLUSION

The present findings support our hypothesis that perceivers can make cross-modal, non-arbitrary, audio-spatial correspondences between acoustic and visual tonal cues, and bind them during perception. These findings thus speak to a shared representation of tone across auditory-acoustic, articulatory, and visuospatial domains. On the other hand, the differences in audio-visual weighting between the AF and AFG modalities, and between native and non-native groups, also provide evidence of domain-specific and experience-based influences in tone perception. These patterns advance the previous integrative pitch processing account (Zatorre and Gandour, 2008) by extending the findings to the spatial gestural domain, thus providing insight into how the interaction of multi-sensory and cognitive processing is orchestrated by lower- and higher-level mechanisms.

The theoretical implications of this study point to directions for further research. First, in terms of audio-visual weighting, as the current study involved cafeteria noise intended to increase the level of visual reliance, the results attributed to the enhanced visual effect may not apply to the same stimuli perceived in quiet. Future research could compare how the weighting of cross-modal tonal information varies in different perceptual environments. Furthermore, regarding effects of linguistic experience, the intermodal relations in tone perception exhibited in the current study may be particularly beneficial for tone language development and learning, where infants and non-native learners need multiple resources to acquire new pitch contrasts. However, research tracing the developmental and learning trajectories is needed to establish how these resources can be utilized effectively at different stages of acquisition. Finally, the current results demonstrate the existence of meaningful cross-modal links for tone without identifying which facial and/or gestural cues contribute to the perception of specific tones. Extended research may seek to quantify the relationship between tone perception and specific visual cues to determine the nature of the shared representation of tone across auditory-acoustic, articulatory, and visuospatial domains. Together, research along these avenues informs how multi-modal speech enhancement principles can be applied to achieve effective human speech interactions.

### AUTHOR CONTRIBUTIONS

BH mainly contributed to experimental design, stimulus development, data collection, analysis and write-up. YW, AJ, and JS mainly contributed to study planning, experimental design, data analysis and write-up. JC and YN mainly contributed to statistical analysis.

## FUNDING

This study was supported by a Social Sciences and Humanities Research Council of Canada (SSHRC) research grant (SSHRC Insight Grant 435-2012-1641).

### ACKNOWLEDGMENTS

fpsyg-08-02051 November 30, 2017 Time: 16:12 # 14

The authors would like to thank Sylvia Cho, Katelyn Eng, Eleanor Hendriks, Keith King-Wui Leung, Danielle Weber, and Jennifer Williams from the Language and Brain Lab at Simon

### REFERENCES


Fraser University for their assistance in data collection and analysis. Portions of this study were presented at the 5th Joint Meeting of the Acoustical Society of America and Acoustical Society of Japan, Honolulu, HI, United States, November, 2016.



Mixdorff, H., Hu, Y., and Burnham, D. (2005). "Visual cues in Mandarin tone perception," in Proceedings of the INTERSPEECH-2005, Lisbon, 405–408.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Hannah, Wang, Jongman, Sereno, Cao and Nie. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Constraints on Tone Sensitivity in Novel Word Learning by Monolingual and Bilingual Infants: Tone Properties Are More Influential than Tone Familiarity

Denis Burnham<sup>1</sup> \*, Leher Singh<sup>2</sup> , Karen Mattock 1, 3, Pei J. Woo<sup>4</sup> and Marina Kalashnikova<sup>1</sup>

*<sup>1</sup> The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia, <sup>2</sup> Department of Psychology, National University of Singapore, Singapore, Singapore, <sup>3</sup> School of Social Sciences and Psychology, Western Sydney University, Sydney, NSW, Australia, <sup>4</sup> Department of Psychology, Sunway University, Kuala Lumpur, Malaysia*

#### Edited by:

*Chia-Ying Lee, Academia Sinica, Taiwan*

#### Reviewed by:

*Yuchun Chen, Fu Jen Catholic University, Taiwan Gang Peng, Hong Kong Polytechnic University, Hong Kong*

\*Correspondence: *Denis Burnham denis.burnham@westernsydney.edu.au*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *13 July 2017* Accepted: *01 December 2017* Published: *04 January 2018*

#### Citation:

*Burnham D, Singh L, Mattock K, Woo PJ and Kalashnikova M (2018) Constraints on Tone Sensitivity in Novel Word Learning by Monolingual and Bilingual Infants: Tone Properties Are More Influential than Tone Familiarity. Front. Psychol. 8:2190. doi: 10.3389/fpsyg.2017.02190* This study compared tone sensitivity in monolingual and bilingual infants in a novel word learning task. Tone language learning infants (Experiment 1, Mandarin monolingual; Experiment 2, Mandarin-English bilingual) were tested with Mandarin (native) or Thai (non-native) lexical tone pairs which contrasted static vs. dynamic (high vs. rising) tones or dynamic vs. dynamic (rising vs. falling) tones. Non-tone language, English-learning infants (Experiment 3) were tested on English intonational contrasts or the Mandarin or Thai tone contrasts. Monolingual Mandarin language infants were able to bind tones to novel words for the Mandarin High-Rising contrast, but not for the Mandarin Rising-Falling contrast; and they were insensitive to both the High-Rising and the Rising-Falling tone contrasts in Thai. Bilingual English-Mandarin infants were similar to the Mandarin monolinguals in that they were sensitive to the Mandarin High-Rising contrast and not to the Mandarin Rising-Falling contrast. However, unlike the Mandarin monolinguals, they were also sensitive to the High Rising contrast in Thai. Monolingual English learning infants were insensitive to all three types of contrasts (Mandarin, Thai, English), although they did respond differentially to tone-bearing vs. intonation-marked words. Findings suggest that infants' sensitivity to tones in word learning contexts depends heavily on tone properties, and that this influence is, in some cases, stronger than effects of language familiarity. Moreover, bilingual infants demonstrated greater phonological flexibility in tone interpretation.

Keywords: word learning, lexical tone, monolingual/bilingual, infant, nativenan-native

### INTRODUCTION

The use of pitch is ubiquitous in human languages (Gussenhoven, 2004). However, the functions served by pitch variation differ markedly across languages. The majority of the world's languages spoken by the majority of the world's population (Fromkin, 1978; Yip, 2002) use pitch to differentiate the meanings of words. These languages include classic tone languages, such as Mandarin Chinese and Thai, grammatical tone languages such as Yoruba and Sesotho, as well as pitch accent languages, such as Japanese and Swedish. In all these languages, the use of pitch (as well as other cues to some extent) is applied at the syllable level to alter the meanings of words. However, pitch is also used across all the world's languages to communicate relevant information such as a speaker's emotional state, their communicative intent, and words they intend to stress (Fernald and Mazzie, 1991; Banse and Scherer, 1996; van Heuven and Haan, 2002). The multiplexing of pitch in human languages can therefore potentially introduce challenges for the young language learner. Learners of tone languages must differentiate lexical changes in pitch (i.e., lexical tone) from nonlexical changes in pitch (e.g., intonation, shifts in vocal emotion, stress, and intent), appreciating the distinct functions served by each source of pitch variation. Learners of non-tone languages must attune to the fact that their language incorporates pitch variation, but that this variation does not signal lexical contrast. Moreover, bilingual learners of both a tone and of a non-tone language, such as Mandarin Chinese and English, must learn that tone serves a different set of functions in each of their languages. As a result, they must differentiate the various functions of pitch in a language-selective manner. The focus of the current study is to determine how early word learners negotiate different types of native and non-native (lexical and non-lexical) pitch variation in relation to their language background when learning new words.

Prior research has investigated infants' sensitivity to lexical tone variation in infancy primarily via speech discrimination and novel word learning paradigms. Research in speech discrimination has focused on the basic question of whether infants of different language backgrounds (specifically, tone and non-tone language exposure) demonstrate sensitivity to lexical tone contrasts. This research complements a long tradition of research conducted with vowels and consonants that shows that infants demonstrate perceptual narrowing over the first year for many phonetic contrasts, as revealed by a selective sensitivity to vowel and consonant contrasts that feature in their native language and a reduced sensitivity to those that do not (Eimas et al., 1971; Werker and Tees, 1984; Polka and Werker, 1994; but see Best and Tyler, 2007). Studies on infant perception of lexical tones have yielded mixed findings. Firstly, some studies suggest that lexical tone undergoes a similar developmental progression to that charted for vowels and consonants; that in their first year infants raised in a tone language environment remain sensitive to lexical tone contrasts whereas those raised in a non-tone language environment demonstrate reduced sensitivity to lexical tone contrasts. Specifically, in a tone discrimination study, Mattock and Burnham (2006) investigated Thai lexical tone discrimination in Chinese (Cantonese and Mandarin) and English learning infants at 6 and 9 months of age. They found that only Chinese learning infants remain sensitive to lexical tone contrasts at 9 months and that English learning infants demonstrate a decline in sensitivity to lexical tone contrasts at 9 months. Interestingly, tone-exposed infants demonstrated sustained sensitivity to lexical tone contrasts even though the tones on which they were tested were non-native (Thai) tones. This points to broad-based early sensitivity to lexical tones in tone-exposed infants that may not be specific to the native tone inventory. In a similar study, Yeung et al. (2013) tested English, Mandarin, and Cantonese exposed infants at 4- and 9-months on Cantonese lexical tones. Like Mattock and Burnham (2006) (and repeated in Mattock et al., 2008), Yeung et al. (2013) reported a decline in discrimination of lexical tones at 9 months in English learning infants. They also reported sustained tone sensitivity in Mandarin and Cantonese infants at 4 and 9 months. However, even at 4 months Mandarin and Cantonese infants responded in different ways to one of the Cantonese tone contrasts used, which the authors interpreted as evidence for specific effects of the native tone inventory on tone perception within tone language learners. Their findings therefore point to language-selective perception of lexical tones within tone language learners.

Secondly, and in contrast to the studies described above, there is evidence opposing the emergence of language-selective sensitivity to tones in infancy. In particular, in a study of Mandarin tone perception in Dutch-exposed infants, Liu and Kager (2014) reported U-shaped development in infants' sensitivity to Mandarin lexical tones between 5 and 18 months; infants demonstrated strong tone sensitivity prior to 8 months and after 12 months. In a second study, in which they presented infants with very subtle Mandarin tone contrasts, only 5–6- and 17–18-month-old infants showed discrimination. Likewise, when presented with a different pair of Mandarin tones, Chen and Kager (2015) reported an increase in tone sensitivity in Dutch learning infants between 4 and 12 months. In a more recent study investigating Dutch infants' sensitivity to Limburgian tones, Ramachers et al. (2017) reported a similar increase in sensitivity to lexical tones in Dutch-exposed infants between 6 and 12 months.

Speech discrimination tasks provide clear evidence that toneexposed infants remain sensitive to lexical tone during infancy, although it remains unclear whether they are selectively sensitive to native (vs. non-native) tone contrasts. What is less clear is whether non-tone exposed infants demonstrate a decline in sensitivity to lexical tones, with some studies demonstrating a decline (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013) and others a temporary decline (Liu and Kager, 2014) or facilitation with increasing age (Chen and Kager, 2015; Ramachers et al., 2017). One interpretation of studies showing sustained or increased sensitivity to tones in non-tone language learners would be that these learners maintain a similar lexical interpretation of tones to their tone learning counterparts. This question can be more directly addressed by investigating how infants incorporate lexical tone variation into the process of learning new words in relation to their language background.

Past studies investigating tone sensitivity in novel word learning provide convergent evidence that both tone- and nontone language learning infants demonstrate high sensitivity to lexical tones. In a study on novel word learning using a preferential looking paradigm, Singh et al. (2014) found that both bilingual infants learning English and Mandarin, and non-tone language learning infants (either monolingual in one or bilingual in two non-tone languages) incorporated lexical tones into newly learned words at 18 months. Only at 24 months did non-tone language learning infants disregard lexical tone variation when learning new words. Similarly, in a study investigating lexical tone sensitivity in novel word learning

using the habituation-based Switch paradigm, Hay et al. (2015) demonstrated that English learning infants integrated Mandarin lexical tones into newly-learned words at 17 months but not at 19 months. Interestingly this period of tone sensitivity in non-tone language learning infants was extended if infants were learning two non-tone languages. Thus, instead of the transition from incorporating to disregarding tone in word learning occurring between 17 and 19 months (Hay et al., 2015), for bilingual non-tone learning infants this change occurs between 19 and 22 months (Estes and Hay, 2015). These findings point to differences in tone sensitivity between monolinguals and bilinguals between 19 and 22 months, even though both groups were learning non-tone languages. Further differences between monolingual and bilingual learners with respect to lexical tone sensitivity were reported by Singh et al. (2016). In this study, monolingual Mandarin learners at 12–13 months were compared with bilingual English-Mandarin learners at the same age for their sensitivity to lexical tones when learning novel words. Results revealed that bilingual English-Mandarin learners were more sensitive to lexical tones when learning words in Mandarin than their Mandarin monolingual counterparts. This was not attributed to greater tone sensitivity in bilinguals in general, as the same bilingual infants were not sensitive to Mandarin lexical tones when learning words in English. In contrast, Mandarin monolingual learners only demonstrated a similar degree of sensitivity to Mandarin tones when learning words in Mandarin 6 months later at 18 months.

Prior investigations comparing monolingual and bilingual infants on their understanding of native sound-to-meaning relations point to greater phonological flexibility in bilingual infants. As discussed above, Estes and Hay (2015) demonstrated a prolonged period of flexibility in bilingual infants' interpretation of pitch variation. Past studies investigating sensitivity to consonants have converged upon a similar conclusion: while all infants demonstrate perceptual narrowing of consonants over the first year of life, there is evidence for a postponement (i.e., delayed onset) (Garcia-Sierra et al., 2011; Ferjan Ramírez et al., 2017) as well as a protraction (i.e., delayed offset) in this process in bilingual infants (Petitto et al., 2012). Empirical reports of prolonged phonological flexibility in bilingual infants have led to conclusions that bilingualism may lead to greater phonological openness such that early learners are less tethered to the native phonological inventory (Kuhl et al., 2008). Indeed, prior studies suggest that bilingual infants continue to incorporate nonnative phonological variation into newly learned words when monolingual infants no longer do so (Estes and Hay, 2015; Singh, 2017). This suggests that a prolonged course of perceptual narrowing in bilingual infants may lead to bilingual infants accepting a broader range of variation as lexically relevant when learning new words.

Findings from novel word learning studies suggest that infants from varied language backgrounds demonstrate early sensitivity to lexical tone contrasts. However, these studies have relied exclusively on sensitivity to Mandarin tones and also to a particular Mandarin tone contrast. Specifically, conclusions by Estes and Hay (2015), Singh et al. (2014), and Hay et al. (2015) were based on infants' sensitivity to a single tone contrast—the Mandarin rising/falling contrast, which is significant for the interpretation of their findings. Rising/falling pitch contours draw an important pragmatic distinction in English, Mandarin, and many other languages, specifically, the question/statement difference (Bolinger, 1958). Moreover, infants are highly sensitive to this distinction even if they are not learners of a tone language (Frota et al., 2014). This raises the possibility that tone- and non-tone language learning infants may be sensitive to the pragmatic functions of this distinction rather than to lexical tone distinctions. It remains to be seen whether these sensitivities generalise (i) to non-native tone inventories and (ii) to other Mandarin tone contrasts. That is, it is important to know (i) whether tone sensitivity in word-learning paradigms is languagespecific or whether learners of tone languages possess a broadbased sensitivity to tones, and (ii) whether tone word learning is dependent on the pitch contour properties (relatively static or more dynamic) and whether such pitch characteristics of tones might override effects of nativeness.

Regarding tone familiarity, tone language infants' sensitivity to lexical tones has been consistently observed in prior studies (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014; Chen and Kager, 2015; Ramachers et al., 2017). However, only one of these (Mattock and Burnham, 2006) suggests that tone discrimination generalizes to non-native (unfamiliar) tones. Regarding tone properties, some infant tone discrimination studies (Mattock and Burnham, 2006; Liu and Kager, 2014; Chen and Kager, 2015; Ramachers et al., 2017) show differential performance with more confusable (similar pitch direction) vs. less confusable tones, but this has yet to be investigated in tone-based word-learning studies. Moreover, the influence of bilingualism on tone sensitivity remains unclear. Although findings reported by Singh et al. (2016) point to a bilingual advantage in tone sensitivity for some Mandarin tones, it remains unknown whether this advantage extends to other tone pairs and to non-native tone contrasts.

In this study, we investigated the role of (i) tone familiarity (native vs. non-native tones), (ii) language background (monolingual/bilingual), and (iii) pitch properties of tones (static-dynamic/dynamic-dynamic) in novel word learning. Infants were tested using the Switch paradigm at 17 months given that prior studies have demonstrated effects of language background on tone sensitivity at 17–18 months using this paradigm (Estes and Hay, 2015; Hay et al., 2015). Three experiments were conducted to investigate tone-based word learning of native vs. non-native tones in 17-month-olds monolingual infants acquiring a tonal language (Mandarin, Experiment 1), bilingual infants acquiring a tonal and a non-tonal language (Mandarin-English, Experiment 2), and monolinguals infants acquiring a non-tonal language (English, Experiment 3). Tone familiarity was manipulated by varying the language of the stimuli. For Mandarin monolinguals and Mandarin-English bilinguals, native Mandarin contrasts and non-native Thai lexical tone contrasts were used. For English monolinguals, English intonational contrasts and non-native Thai and Mandarin lexical tone contrasts were used.

As different language learners use different cues to differentiate tones (see for example, Burnham and Francis, 1997; Burnham et al., 2014), and as we wished to keep tone contrasts acoustically similar across the native (Mandarin) and non-native (Thai) language stimuli, we used a priori bases to characterize pitch contrasts. The first was whether the tones in any particular tone contrast differed in their overall pitch movement, i.e., whether pitch was relatively static or dynamic over time and, if dynamic, then the direction of the contour was also used to characterize tones.

Using Chao values, in which numbers are used to signal tone height at initial, (mid), and final time points, Mandarin and Thai both have High tones (Thai 45, Mandarin 55), Rising tones (Thai 315, Mandarin 35 and 214), and Falling tones (Thai 241, Mandarin 51) [with Thai also having Mid (33) and Low (21) tones, which do not match easily with Mandarin tones]. The Static-Dynamic, High vs. Rising, contrast was chosen as a contrast on which the members of the pair differed in the degree of contour—relatively static (High) and relatively dynamic (Rising)—55 vs. 35 in Mandarin; and 45 vs. 315 in Thai).The Dynamic-Dynamic, Rising vs. Falling, contrast was chosen as the other tone contrast in each language because while both are relatively dynamic, their contour direction is in the opposite direction over time in both Mandarin (35 vs. 51) and in Thai (315 vs. 241). Lexical tone is not used in English, but for comparison purposes intonation contours were used that approximate the same tone contours used in Mandarin and Thai. A Static-Dynamic pair, Order- vs. Statement-shaped syllables and a Dynamic vs. Dynamic pair, Statement- vs. Question-shaped syllables were used. These can be characterized as Mid/Falling vs. High/Falling and High-Falling vs. Mid/Rising, respectively. While these do not exactly match the High-Rising and Rising-Falling tones, use of these intonational contrasts ensured that each group heard a pitch contrast that formed a part of native language input. The intonation contours were only used for the English monolingual group to investiage if they were differentially sensitive to native English contours or non-native Mandarin or Thai contours (tones). Plots of the three Mandarin and three Thai lexical tones and the three English intonataion contours are shown in **Figure 1** in the General Methods section.

### Predictions

It was predicted that bilingual and monolingual Mandarin infants would demonstrate similar levels of sensitivity to Mandarin lexical tones. Although Singh et al. (2016) demonstrated that at 12–13 months bilingual English-Mandarin learners had greater sensitivity to Mandarin tones than monolingual Mandarin learners, by 17–18 months monolingual Mandarin learners showed the same level of sensitivity to Mandarin lexical tones. Given that infants were tested at 17 months here, it was predicted that monolingual Mandarin and bilingual English-Mandarin learners would have similar levels of ability with Mandarin tones. However, it was also predicted that bilingual infants may demonstrate additional sensitivity to non-native tone contrasts (Thai) in view of past research attesting greater phonological flexibility in bilingual infants in segmental (Garcia-Sierra et al., 2011; Petitto et al., 2012; Ferjan Ramírez et al., 2017; Singh, 2017) and suprasegmental perception (Estes and Hay, 2015). Effects of tone contrast due to differences in pitch properties of tones were also predicted on account of prior studies demonstrating contrast-specific effects on the order of acquisition of individual Mandarin tones (Wong et al., 2005; Wong, 2012a,b, 2013). Finally, differences in effects of nativeness and pitch properties of tone pairs across monolingual and bilingual groups will be explored.

### GENERAL METHODS

Methods common to the three experiments are set out ahead of specific methods for each.

### Materials and Apparatus

The stimuli for each of the three experiments in each of the four conditions (native/non-native × Static-Static/Static/Dynamic) are set out in **Table 1**, and their fundamental frequency contours are shown in **Figure 1**.

A native female speaker of Malaysian Mandarin was audiorecorded producing the syllable /kha/ (Pinyin "ka") with the Mandarin tones T1 [55] and T2 [35] and T4 [51], in order to create a Static-Dynamic contrast, [55-35] and a Dynamic-Dynamic contrast, [35-51]. Only one of these syllables is a word in Mandarin, but it is of low frequency and is certainly not a word that would be high frequency in speech addressed to infants— [kha55] is a homophone (a) 咖 the first noun in the compound word meaning coffee (frequency = 4,366, percentile = 76.0, where 1 is the lowest and 100 percentile the highest possible frequency)

TABLE 1 | Language and Tone Contrast Familiarity and Tone Contrast Properties used in the four conditions of the habituation then switch task in Experiments 1, 2, and 3.


or (b) 喀 onomatopoeic of the coughing sound (frequency = 3,830, percentile = 75.1). The other two used here, [kha35] and [kha51], are not words in Mandarin (Da, 2015).

A native female speaker of Thai was audio-recorded producing the syllable /khaa/ with the Thai tones High [45], Rising [315], and Falling [241], in order to create a Static-Dynamic contrast, [45-315] and a Dynamic-Dynamic contrast, [315-241]. All three of these are words in Thai— [khaa45] means to trade; /khaa315/ means leg; and [khaa241] is a homophone, meaning (a) I, me (this is antiquated) or (b) value; or (c) to kill. These are all relatively low frequency but frequency is of no concern for the Thai stimuli as they simply served as non-native stimuli for the Mandarin background and English background infants. No Thai infants were tested.

A female native English speaker was recorded producing the syllable /ka/ with the following intonation contours: statement, order, and question.

For each set of language stimuli, the syllables were extracted from the recording and concatenated into 20 s strings with an inter-stimulus interval (ISI) of 500 ms. The visual stimuli consisted of video recordings of two colorful novel objects (a molecule and a crown) moving slowly along the horizontal axis in the center of the screen. Additionally, a video of a moving toy (spinning water-wheel) and an audio recording of the novel word /pok/ produced by a female speaker were used in the pre- and post-test phases of the task.

Stimuli were presented using Habit X1.0 software (Cohen et al., 2004) on a computer screen with the audio presented through loudspeakers located behind the screen. Infants sat on their caregiver's lap ∼60 cm away from the screen. Caregivers listened to masking sounds through headphones. The experimenter observed the infant through a CCTV camera in an adjacent room and controlled the presentation of the stimuli.

### Procedure

Each infant completed one of the between-subjects conditions of the task: native Static-Dynamic, native Dynamic-Dynamic, nonnative Static-Dynamic, or non-native Dynamic-Dynamic. At the start of the task, infants were presented with the attention getter, a flashing red light on the screen accompanied by a beeping sound. Once they had fixated the screen, the experimental task commenced. First, infants completed an habituation phase in which they saw each object (molecule or crown) paired with a different sound stimulus (e.g., crown + /ka/ Tone A, molecule + /ka/ Tone B, with the nature of Tone A and Tone B depending on the experiment). The habituation phase proceeded until infants reached the habituation criterion (decrease of 50% or more in looking time in two consecutive trials in comparison to the mean looking duration over the first three habituation trials) or after reaching the maximum of 24 habituation trials. After that, infants completed two test trials. One was a Same trial, in which the infants saw one of the object-sound pairings from the habituation phase (e.g., crown + /ka/ Tone A). The other was a Switch trial where infants saw the same object but paired with the sound that corresponded to the other object in the habituation phase (e.g., crown + /ka/ Tone B). Infants also completed a pre- and post-test trial at the start and end of each session (**Figure 2**). The pairings between the visual and auditory stimuli, the objects chosen for the test phase, and the order of Same and Switch trial presentation were all counterbalanced between participants.

### EXPERIMENT 1: MONOLINGUAL MANDARIN INFANTS

In Experiment 1, four groups of Monolingual Mandarin environment infants were tested with native Mandarin tone contrasts (High-Rising, T1 [55] vs. T2 [35]; Rising-Falling, T2

[35] vs. T4 [51]) and non-native contrasts (High-Rising, Thai [45] vs. [315]; Rising-Falling, Thai [315] vs. [241]).

### Participants

Thirty-three 17-month-old infants (16 female; M age = 523.06 days [17.26 months], SD = 13.07) participated. One additional infant participated but was excluded from final analyses due to experimenter error. Infants were randomly assigned to one of four groups: Native Rising-Falling (n = 9), Native High-Rising (n = 8), Non-native Rising-Falling (n = 8), Non-native High-Rising (n = 8). Parents were asked to complete a brief questionnaire about their infants' language environment and exposure. All infants were acquiring Mandarin as their first language and had no more than 10% exposure to any additional language (M = 6.08%, SD = 3.3) as reported by their primary caregiver. Twenty-nine infants were growing up in Singapore and four infants were growing up in Malaysia. All infants were typically-developing and were not at risk for sensory or developmental disorders.

### Results

Given that infant looking time data were not normally distributed, all raw looking time scores were subject to a log transformation, so that the data could be analyzed using Analyses of Variance.

First, infants' performance in the habituation phase and the pre- and post-test trials were compared across the four tests groups (Native High-Rising, Native Rising-Falling, Nonnative High-Rising, Non-native Rising-Falling) (see **Table 2**, Experiment 1). Total looking duration, F(3, 29) = 0.784, p = 0.513, η <sup>2</sup> = 0.075, and the number of habituation trials, F(3, 29) = 0.622, p = 0.607, η <sup>2</sup> = 0.060, did not differ across groups. Similarly, looking duration did not differ between pre- and post-test trials, F(1, 29) = 2.156, p = 0.153, η <sup>2</sup> = 0.069, and there was no effect of group, F(3, 29) = 0.722, p = 0.547, η <sup>2</sup> = 0.069, and no significant pre-/post-trial × group interaction, F(3, 29) = 0.943, p = 0.433, η <sup>2</sup> = 0.089. Thus, there was no systematic bias in attention between the groups, and within groups there was no general fatigue over time—attention did not diminish between pre- and post-test trials.

Looking times in test trials for Native/Non-native, High-Rising/Rising-Falling, and Same/Switch trials are shown in **Figure 3**. To assess infants' performance in these test trials, looking durations for the Same and Switch trials across the native vs. non-native and the stimulus type conditions were compared. A 2 (Native, Non-native) × 2 (High-Rising, Rising-Falling) × 2 (Same, Switch) ANOVA showed no main effect of Same/Switch, F(1, 29) = 2.212, p = 0.148, η <sup>2</sup> = 0.071, Native/Non-native, F(1, 29) = 1.006, p = 0.324, η <sup>2</sup> = 0.034, Tone Type, F(1, 29) = 2.177, p = 0.151, η <sup>2</sup> = 0.070, and no Same/Switch × Native/Non-native, F(1, 29) = 0.622, p = 0.437, η <sup>2</sup> = 0.021, Same/Switch × Tone Type, F(1, 29) = 0.022, p = 0.882, η <sup>2</sup> = 0.001, or Native/Non-native by Tone Type interactions, F(1, 29) = 0.006, p = 0.805, η <sup>2</sup> = 0.002. However, there was a significant three-way Same/Switch × Native/Non-native × tone type interaction, F(1, 29) = 8.594, p = 0.007, η <sup>2</sup> = 0.229 (see **Figure 3**).

To investigate the source of the interaction, infants' performance was analyzed separately for the static-dynamic (High-Rising) and dynamic-dynamic (Rising-Falling) conditions. In the High-Rising condition, there was no main effect of Same/Switch, F(1, 14) = 2.094, p = 0.170, η <sup>2</sup> = 0.130, or Native/Non-native, F(1, 14) = 0.808, p = 0.384, η <sup>2</sup> = 0.055, but these two factors did interact, F(1, 14) = 10.823, p = 0.005, η <sup>2</sup> = 0.436. Infants looked significantly longer in response to the Switch than the Same trials in the native (Same M = 0.671, SE = 0.121; Switch M = 1.049, SE = 0.083), t(7) = −2.923, p = 0.022, d = 1.283, but not the non-native condition (Same M = 1.028, SE = 0.067; Switch M = 0.881, SE = 0.092), t(7) = 1.572, p = 0.160, d = 0.644. On the other hand, in the Rising-Falling condition there were no main effects of Same/Switch, F(1, 15) = 0.681, p = 0.422, η <sup>2</sup> = 0.043 (Same M = 0.796, SE = 0.116; Switch M = 0.740, SE = 0.098), or Native/Non-native, F(1, 15) = 0.278, p = 0.606, n = 0.018 (Same M = 0.702, SE = 0.123; Switch M = 0.947, SE = 0.104), and also no significant interaction, F(1, 15) = 1.746, p = 0.206, η <sup>2</sup> = 0.104. Thus, when learning tone-bearing words, monolingual Mandarin infants were sensitive to native but not to non-native tone contrasts in this task. However, this was only the case for the static-dynamic (High-Rising) tone pairs—they did not look significantly longer to the Switch trial for the native dynamic-dynamic (Rising-Falling) tone pair.

When learning novel words, monolingual Mandarin infants were sensitive to static vs. dynamic (High-Rising) native



*<sup>a</sup>Log-transformed looking duration (seconds).*

tones but not to dynamic-dynamic (Rising-Falling) native tones. They were not sensitive to either type of non-native tone contrast (static vs. dynamic or dynamic-dynamic tone pairs).

### EXPERIMENT 2: BILINGUAL MANDARIN-ENGLISH INFANTS

In Experiment 2, four groups of bilingual Mandarin-English environment infants were tested with the same contrasts as in Experiment 1, native tone contrasts (High-Rising, Mandarin T1 [55] vs. T2 [35]; Rising-Falling, Mandarin T2 [35] vs. T4 [51]) and non-native contrasts (High-Rising, Thai High [45] vs. Rising [315]; Rising-Falling, Thai Rising [315] vs. Falling [241]).

### Participants

Thirty-two 17-month-old infants (16 female; Mage = 524.72 days [17.25 months], SD = 18.02) were included in the study. An additional five infants participated but were excluded due to failure to comply with the language selection criteria. Twentytwo infants were being raised in Singapore and ten infants were being raised in Malaysia. Infants were randomly assigned to four groups according to two between-subjects experimental conditions, native vs. non-native and High-Rising vs. Rising-Falling (Native High-Rising n = 8, Native Rising-Falling n = 8, Non-native High-Rising n = 8, Non-native Rising-Falling n = 8).

All infants were typically-developing and were not at risk for sensory or developmental disorders. Parents were asked to complete a questionnaire about their infants' language environment and exposure. Infants' weekly language exposure ranged from 26 to 72% (M = 51.48, SD = 13.69) for Mandarin and from 25 to 68% for English (M = 45.9, SD = 13.97). Sixteen children were reported to have some exposure to a third language, but this exposure was <10% (M = 5.6%, SD = 3.5%). Analysis revealed that degree of language exposure had no effect on the results<sup>1</sup> .

### Results

Performance in the habituation phase and pre- and posttest trials of the four between-subjects conditions revealed that infants' looking duration, F(3, 28) = 0.339, p = 0.798, η <sup>2</sup> = 0.035, and number of habituation trials, F(3, 28) = 0.685, p = 0.569, η <sup>2</sup> = 0.068, did not differ across groups. Similarly looking duration to the pre- and post-test trials did not differ, F(1, 28) = 2.332, p = 0.138, η <sup>2</sup> = 0.077, across groups, F(3, 28) = 0.332, p = 0.803, η <sup>2</sup> = 0.034, and there was no significant trial × group interaction, F(3, 28) = 0.134, p = 0.939, η <sup>2</sup> = 0.014. Thus, there was no systematic bias in attention between the groups, and within groups there was no general fatigue over time—attention did not diminish between pre- and post-test trials.

Log transformed looking times in test trials for Native/Nonnative, High-Rising/ Rising-Falling, and Same/Switch trials are shown in **Figure 4**. A 2 (Native, Non-native) × 2 (High-Rising, Rising-Falling) × 2 (Same, Switch) ANOVA was conducted to assess infants' performance in the test phase. There were no main effects of Same/Switch, F(1, 28) = 1.340, p = 0.257, η <sup>2</sup> = 0.046, Native/Non-native, F(1, 28) = 1.553, p = 0.223, η <sup>2</sup> = 0.053, or Tone Type, F(1, 28) = 0.650, p = 0.427, η <sup>2</sup> = 0.023.

<sup>1</sup>To check this, we conducted a 2 × 2 × 2 ANCOVA with native/non-native, Static-Dynamic/Dynamic-Dynamic as the between-subjects variables, Same/Switch as the within-subjects variable, and percentage of Mandarin exposure as the covariate. The results were identical to the reported results. There was no main effect of Same/Switch, F(1, 27) = 0.473, p = 0.497, η <sup>2</sup> = 0.017, native/non-native, F(1, 27) = 1.688, p = 0.25, n = 0.059, tone type, F(1, 27) = 0.195, p = .662, η <sup>2</sup> = 0.007, or exposure to Mandarin, F(1, 27) = 0.128, p = 0.724, η <sup>2</sup> = 0.005, and no Same/Switch by native/non-native, F(1, 27) = 2.207, p = 0.080, η <sup>2</sup> = 0.109, Same/Switch by exposure to Mandarin, F(1, 27) = 0.240, p = 0.628, η <sup>2</sup> = 0.009, native/non-native by tone type, F(1, 27) = 0.702, p = 0.410, η <sup>2</sup> = 0.025, or Same/Switch by native/non-native by tone type, F(1, 27) = 0.038, p = 0.847, η <sup>2</sup> = 0.001, interactions. The only significant interaction was Same/Switch by Tone Type, F(1, 27) = 9.710, p = 0.004, η <sup>2</sup> = 0.265, as reported.

However, contrary to the Mandarin monolingual group, there was a significant Same/Switch × Tone Type interaction, F(1, 28) = 6.273, p = 0.018, η <sup>2</sup> = 0.183. All other two- and three way interactions were not significant, (Same/Switch × Native/Non-native, F(1, 28) = 2.003, p = 0.168, η <sup>2</sup> = 0.067, Native/Non-native × Tone Type, F(1, 28) = 0.599, p = 0.445, η <sup>2</sup> = 0.021, and Same/Switch × Native/Non-native × Tone Type, F(1, 28) = 0.968, p = 0.334, η <sup>2</sup> = 0.033).

The source of this Same/Switch × Tone Type interaction was investigated by assessing infants' performance separately in the High-Rising and Rising-Falling conditions. In the High-Rising condition, infants produced significantly longer looks in the Switch (M = 0.948, SE = 0.077) than in the Same trials (M = 0.736, SE = 0.078), F(1, 14) = 5.004, p = 0.042, η <sup>2</sup> = 0.263. This was the case for both native and non-native conditions, as there were no significant effects of Native/Nonnative, F(1, 14) = 0.094, p = 0.764, η <sup>2</sup> = 0.007, nor was there a Same/Switch × Native/Non-native interaction, F(1, 14) = 0.069, p = 0.796, η <sup>2</sup> = 0.005. Infants looked longer in the Switch trials when they were presented with a High-Rising tone contrast, either the native Mandarin T1 [55] vs. T2 [35], or the nonnative Thai High [45] vs. Rising [315] contrast. For the Rising-Falling tone types, however, there were no significant differences in infants' looking duration in the Switch (M = 0.567, SE = 0.057) and Same (M = 0.945, SE = 0.064) trials, F(1, 14) = 1.374, p = 0.261, η <sup>2</sup> = 0.089, and there was no Native/Non-native effect, F(1, 14) = 2.519, p = 0.135, η <sup>2</sup> = 0.153, and no Same/Switch × Native/Non-native interaction, F(1, 14) = 4.631, p = 0.056, η <sup>2</sup> = 0.238.

As for the Mandarin monolingual infants, Mandarin-English bilingual infants were not sensitive to Rising and Falling tones in one of their native languages, Mandarin, nor in a non-native tone language, Thai. However, similar to Mandarin monolingual infants they were sensitive to High and Rising tones in their own tone language, Mandarin, but unlike monolingual Mandarin infants, bilinguals were sensitive to non-native High and Rising tones in Thai.

### EXPERIMENT 3: MONOLINGUAL ENGLISH INFANTS

In Experiment 1, Monolingual Mandarin language infants learned words on the basis of a native high-rising but not a risingfalling contrast. In Experiment 2, Bilingual Mandarin language infants learned words on the basis of a native high-rising but not a rising-falling contrast, and also on the basis of a non-native high-rising but not a rising-falling contrast. It could be that, over and above any advantage for bilingual over monolingual infants' perception of non-native tone contrasts, the high-rising contrast is particularly salient independent of tone language experience. To test this we added a third experiment in which non-tone, English, language experience infants were tested. For this group, the high rising tone is also native in that it conveys a question form, but it is non-lexical. In this sense, testing sensitivity to a high-rising contrast in addition to a rising-falling contrast serves to qualify our interpretation of the findings of Experiments 1 and 2. Specifically, if Monolingual English learning infants cannot learn words based on the high-rising contrast, then we presume selective sensitivity to this contrast in tone language learners is not stimulus driven, but is guided by phonological knowledge. In this group, we also took the opportunity to investigate sensitivity to native English intonational contrasts. The purpose of this was to address two additional questions which could not be answered by tone language learners. First, we sought to investigate whether infants only bind pitch to word meanings if their language binds pitch to word meanings, or whether they demonstrate a general sensitivity to contrastive pitch movements when learning new words even if their language does not lexicalize pitch. Prior studies (e.g., Singh et al., 2014; Hay et al., 2015) have demonstrated that English monolingual learners do bind Mandarin tones to word meaning; however, these studies were both based on sensitivity to a single rising-falling contrast. We have yet to learn whether these sensitivities are present in equal measure for other lexical tone contrasts and moreover, for native intonational contrasts. A second question derives from the fact that some pitch movements in English intonational systems such as the question/statement contrast—correspond in pitch direction to lexical tone contrasts. Contrasting sensitivity to similar lexical and intonational contrasts in English monolingual infants may reveal whether non-tone language learning infants demonstrate a selective sensitivity to native phonogical variation in pitch or whether they maintain a generalized sensitivity to isomorphic pitch contours, native or not.

In Experiment 3, four groups of Monolingual English environment infants were tested with tone (non-native) lexical tone contrasts (High-Rising—Mandarin T1 [55] vs. T2 [35], or Thai [45] vs. [315], counterbalanced between infants; Rising-Falling, Mandarin T2 [35] vs. T4 [51], or Thai [315] vs. [241], counterbalanced between infants) (see **Table 1**). The native condition consisted of contrasts of English intonation: English Order vs. Statement (Mid/Falling-High/Falling), and Statement vs. Question (High-Falling vs. Mid/Rising).

### Participants

Thirty-one 17-month-old infants (22 female; Mage = 532.39 days [17.5 months], SD = 12.75) were included in this experiment. An additional six infants participated but were excluded due to fussiness and failure to complete the experiment. Infants were randomly assigned to the four groups: native Order vs Statement (n = 8), native Statement vs. Question (n = 8), non-native High vs. Rising (n = 8, 4 tested on Mandarin and 4 on Thai tones), non-native Rising vs. Falling (n = 7, 4 tested on Mandarin and 3 on Thai tones). Using a parental questionnaire about infants' language environment and exposure, it was confirmed that all infants were acquiring English as their first language and had no exposure to any additional language. Twenty-nine children were growing up in the United Kingdom, and two infants were growing up in Australia. All infants were typically-developing and were not at risk for sensory or developmental disorders.

### Results

Habituation trial data are presented in **Table 2** (Experiment 3). Comparison of infants' performance in the pre- and post-test and habituation phases across the four groups revealed no betweengroup differences in total looking duration, F(3, 27) = 0.313, p = 0.816, η <sup>2</sup> = 0.034, or the number of habituation trials, F(3, 27) = 0.863, p = 0.472, η <sup>2</sup> = 0.087. Similarly, there was no difference in looking duration in the pre- and post-test trials, F(1, 27) = 1.023, p = 0.321, η <sup>2</sup> = 0.037, and there was no effect of group, F(3, 27) = 1.304, p = 0.293, η <sup>2</sup> = 0.127, and no significant pre-/post-trial × group interaction, F(3, 27) = 0.998, p = 0.409, η <sup>2</sup> = 0.100. Thus, there was no systematic bias in attention between the groups, and within groups there was no general fatigue over time—attention did not diminish between pre- and post-test trials.

Log transformed looking times in test trials for Native/Nonnative, Tone Type, and Same/Switch trials are shown in **Figure 5**. To compare infants' performance in the test phase, looking duration for the Same and Switch trials across the native vs. non-native and the tone type conditions, a 2 (Native, Non-Native) × 2 (Static-Dynamic, Dynamic-Dynamic) × 2 (Same, Switch) ANOVA was conducted. This yielded no main effects of Same/Switch, F(1, 27) = 0.209, p = 0.651, η <sup>2</sup> = 0.008, and Tone Type, F(1, 27) = 1.887, p = 0.181,η <sup>2</sup> = 0.065. However, the main effect of Native/Non-native was significant, F(1, 27) = 5.359, p = 0.028, η <sup>2</sup> = 0.166. Monolingual English infants who were presented with non-native Mandarin and Thai lexical tones (M = 0.811, SE = 072) produced significantly longer looks than infants presented with native intonation contours (M = 0.579, SE = 0.010). Importantly, there were no interactions of Same/Switch × Native/Non-native, F(1, 27) = 0.191, p = 0.665, η <sup>2</sup> = 0.007, Same/Switch × Tone Type, F(1, 27) = 0.718, p = 0.404, η <sup>2</sup> = 0.026, Native/Non-native × Tone Type, F(1, 27) = 2.366, p = 0.136, η <sup>2</sup> = 0.081, or of Same/Switch ×

Native/Non-native × Tone Type, F(1, 27) = 0.389, p = 0.538, η <sup>2</sup> = 0.014. Infants' looking duration did not differ significantly in response to the Switch and Same trials in the native or the non-native conditions involving either the Static-Dynamic or the Dynamic-Dynamic contrasts.

Monolingual English infants were not sensitive to native intonational contrasts (Order vs.Statement, or Statement vs. Question) nor to non-native lexical tone contrasts (High vs. Rising or Rising vs. Falling) when learning novel words. While not making these fine distintions between pitch contours, they did attend to unfamiliar non-native lexical tones to a greater extent than to familiar intonation patterns.

### DISCUSSION

The results of the three experiments are summarised in **Table 3**. As can be seen, each group of learners interpreted pitch movements in distinct ways. The results for each of the three groups are summarized below.

### Monolingual Mandarin Learning Infants

Monolingual Mandarin learning infants only contrasted words using the Mandarin High-Rising contrast. They did not contrast words using a Mandarin Rising-Falling contrast. They also did not contrast words using Thai contrasts with similar pitch properties to Mandarin tones.

### Bilingual Mandarin-English Learning Infants

Bilingual Mandarin-English learning infants, like Mandarin monolinguals, demonstrated sensitivity to the Mandarin High-Rising contrast, but not to the Mandarin Rising-Falling contrast. However, unlike Mandarin monolingual learners, their sensitivity to a native Mandarin High-Rising contrast extended to the nonnative Thai High-Rising tone contrast.

### Monolingual English Learning Infants

Monolingual English learning infants, in contrast to Mandarinexposed infants, (both monolingual and bilingual) did not contrast words by any type of pitch contrast included in the experiment (Mandarin contrasts, Thai contrasts, intonational contrasts). Nevertheless, they were senstive to pitch in that they attended to lexical tone-bearing words to a greater extent than intonationally marked words.

These findings suggest that participants' language background as well as the pitch properties of individual pitch/tone pairs influenced pitch sensitivity in novel word learning. Below, we discuss the results of each group in turn.

### Mandarin Monolinguals

Findings from Mandarin monolingual infants suggest that even for native learners, tone distinctions are acquired asynchronously. Asynchronies in tone sensitivity have been demonstrated in production (e.g., Wong, 2012a,b, 2013) and in tone discrimination (Tsao, 2017), but not thus far, in infant word learning. Prior studies investigating tone discrimination in Mandarin infants point to emerging asynchronies in tone sensitivity between 6 and 8 and 10 and 12 months of age (Tsao, 2017). Specifically, in Tsao's (2017) tone discrimination study, 10- to 12-month-old Mandarin learning infants were more sensitive to T1 vs. T3 (high, 55, vs. dipping, 214, a different Static-Dynamic contrast to that used in this study) than to the same Dynamic-Dynamic contrast used here (T2 vs. T4; Rising, 35, vs. Falling, 51). Along similar lines, in a familiar word recognition paradigm, Ma et al. (2017) found that Mandarin monolingual toddlers were more sensitive to mispronunciations of familiar words introduced by a Static-Dynamic contrast (T1- T3, 55-214) than by Dynamic-Dynamic contrasts (T3-T4, 214-51 or T2-T3, 35-214). Nevertheless, there is evidence that younger, 6- and 9-month-old, English-language infants discriminate the dynamic-dynamic Thai Rising-Falling tone contrast and do so



even better than the static-dynmaic Low-Rising tone contrast (Mattock and Burnham, 2006). This suggests, reminiscent of the Stager and Werker (1997) consonant-based discrimination vs. word learning experiments, that in tone word learning tasks, previously discriminable tone contrasts may be difficuilt to bind to novel words. These findings suggest that, irrespective of tone contrast discriminability, the ease with which Mandarin-learning infants bind tones to novel words is constrained across novel word learning and familiar word recognition, with advantages consistently linked to Static-Dynamic contrasts. Furthermore, our findings suggest that Mandarin monolingual learners orient toward linking native High-Rising tones to words, but not to linking High-Rising Thai tones to novel words. Further research could investigate the question of tone properties more closely by determining whether complex dynamic tones (those involving double dynamic (fall and rise) contours, such as Tone 315 in Thai, are more challenging when associating words with meaning. This possibility is supported by the late encoding of complex Mandarin tones (e.g., 214) even for Mandarin monolingual infants learning novel words (Ma et al., 2017).

Our findings invite the question as to why Mandarin learning infants were insensitive to Rising-Falling contrasts, either in Mandarin or in Thai. One possibility is that this contrast overlaps with the question/statement distinction in Mandarin (Yuan, 2004, 2006; Zeng et al., 2004), which does not differentiate words, but rather specifies communicative intent. It is possible that the structure of the current version of the Switch task (i.e., no contextual cues or other cues to speaker intentionality) renders the rising/falling tone contrast truly ambiguous. If interpreted as a question/statement contrast, language learners should indeed not bind this contrast to word meanings, but if interpreted as a tone contrast, they should rely on it to differentiate words. Prior studies investigating Mandarin learners' abilities to resolve question vs. statement forms with rising and falling tones suggest that their ability to reconcile intonational contrast with lexical tone develops quite late. Only at 4–5 years of age (and not at 3–4 years) do children recognize rising and falling tones regardless of whether they are expressed in rising and falling pitch contours (Singh and Chee, 2016). Even adult speakers of Mandarin demonstrate some processing costs when tone and intonation are potentially confusable (Yuan, 2004). It is therefore possible that Mandarin monolingual infants did not bind rising and falling tone variants to novel words on the grounds that these tones overlap with non-lexical contrasts present in the input. Further research could qualify this possible explanation by testing Mandarin monolingual infants on a nonreferential discrimination paradigm to determine whether they could discriminate these tones outside of a word learning context, as per the above mentioned paradigm (Stager and Werker, 1997, Experiment 4). Alternatively, it is possible that languageidentifying cues (e.g., carrier sentences in Mandarin) would have facilitated a lexical interpretation of Rising-Falling contrasts.

Indirect support for infants' lexical interpretation of risingfalling contrasts comes from (i) Singh et al. (2016) who, using the same word learning task, found that 18 month monolingual Mandarin learners distinguish both a subtle Dynamic-Dynamic, rising-fall/rise (Mandarin Tone 2, 35 vs. Tone 3, 214), and a Static-Dynamic, high level vs. fall/rise (Tone 1, 55 vs. Tone 3, 214) distinction, and (ii) that when provided with strong referential support and context to signify a lexical tone contrast, 18 month Mandarin learning infants do bind rising and falling tones to word meaning (Singh et al., 2014). Even though non-tone language adults successfully discriminate Rising-Falling lexical tones in Mandarin (T2 vs. T4, Wang et al., 1999), and in Thai (Rising, 315, vs. Falling, 24, Burnham et al., 2014), it is possible that these tones are uniquely complex for infants on account of their substantial overlap with question/statement forms. As suggested by the results here, this intonation-tone overlap may result in greater confusion in infants than adults who may still be engaged in the task of functionally differentiating pitch movements.

### Mandarin-English Bilinguals

Findings from bilingual infants suggest a similar advantage for High-Rising contrasts and a similar lack of sensitivity to Rising-Falling contrasts. However, the difference between bilingual and monolingual participants in their perception of the Thai High-Rising contrast suggests important differences in monolingual and bilingual learners' tone percepts. The finding that monolingual Mandarin participants were not sensitive to a Thai tone contrast but bilingual learners were, suggests the possibility of greater phonological flexibility in bilingual infants. This is consistent with previous data suggesting that bilingual infants demonstrate more lenient phonological boundaries for consonant variation (Garcia-Sierra et al., 2011; Petitto et al., 2012; Ferjan Ramírez et al., 2017; Singh, 2017, see also Estes and Hay, 2015 for effects of bilingualism on tone sensitivity). On account of more relaxed phonological boundaries, it is possible that the "grain size" of monolingual tone space may be smaller than that of bilingual infants. Prior studies suggest that the reduced granularity of the bilingual phonological space may facilitate the uptake of words in unfamiliar languages (e.g., Singh, 2017). However, it is possible that this may also complicate language learning. For example, at some point, native and non-native tone variation must be differentiated allowing for the acquisition of more than one tone language. On one hand, it is evident from the current study that native tone sensitivity is not reduced in bilingual vs. monolingual learners, so there is no evidence of a bilingual cost to learning Mandarin. On the other hand, it remains to be seen whether a prolonged openness to nonnative phonological variation could introduce a cost to learning other tone languages. In other words, the risk-to-opportunity ratio conferred by maintaining phonological flexibility remains to be determined.

### English Monolinguals

English monolingual infant learners were impervious to the integration of pitch movements when learning novel words. This is consistent with past studies using the Switch task demonstrating that English monolingual learners bound Mandarin rising and falling contours to word meanings before, but not after 14 months (Hay et al., 2015). However, monolingual English learners, when primed with referential cues, continue to integrate tone into word meaning up to 18 months of age (Singh et al., 2014). Given this, the finding that English learners did not link pitch movements to word meanings is surprising in light of recent studies suggesting that non-tone language (Dutch and English) infants and toddlers become increasingly sensitive to a range of Mandarin lexical tone distinctions with age (Chen and Kager, 2015; Liu and Kager, 2015; Chen et al., 2017; Tsao, 2017). However, see also Shi et al. (2017) who found stable discrimination over age of particular Mandarin tone contrasts by French infants. In addition, there is evidence of a decrease in tone sensitivity with age for Thai tones (Mattock and Burnham, 2006) and Cantonese tones (Yeung et al., 2014) in English learning infants.

Thus far, all sources of evidence for increased sensitivity to Mandarin tones in non-tone language learning monolingual infants rests on data from tone discrimination tasks. In this regard, it is of interest here that monolingual English infants did respond to pitch differences: they discriminated (albeit between groups) native intonation and non-native tone stimuli as shown by greater attention to (non-native) lexical tone syllables than to (native) intonational syllables. This could be interpreted as evidence that English learning monolingual infants do not treat all sources of pitch variation alike; there may be differences for lexical level (tones) vs. utterance level (intonation). Instead, they may recognize certain pitch contrasts (i.e., lexcial tones) as foreign and unfamiliar leading to a novelty preference for these sources of variation over familiar pitch variation (i.e., intonation). However, the task of interest here was binding differences between lexical tones or intonations to newly learned words, a step beyond discrimination. Thus, it is possible that while non-tone language learning monolingual infants do not consistently demonstrate perceptual narrowing for tones (and may show age-related facilitation), they do indeed demonstrate functional narrowing for tones such that tone becomes dissociated from word meaning with age. In other words, English learning infants' appreciation of the fact that tone does not serve a lexical function in English may mature in tandem with their increasing sensitivity to pitch movements in non-lexical contexts, such as auditory discrimination. Moreover, the fact that English language non-tone infants did not bind particular intonations to newly learned words in this study may be completely understandable—in English, while intonations are discriminable, they are not used to label words. Pitch discrimination abilities are integral to language comprehension in English (Cutler et al., 1997) and in all languages and as such, infants' selective sensitivity to pitch movements in auditory discrimination tasks but not in lexical tasks may actually reflect maturation and refinement in the functional differentiation of pitch.

### SUMMARY AND CONCLUSIONS

In this set of studies, the range of pitch contrasts to which infants were exposed was broadened from prior studies. The result is a more complex picture than has been revealed by previous research that has focused almost exclusively on the Mandarin rising/falling contour. The results suggest different degrees of tone sensitivity in word learning for different tone contrasts. Findings invite the possibility that tone contrasts that aggregate with intonational contrasts (i.e., rising/falling contrasts) may be more complex to negotiate—particurlarly in the absence of linguistic context—for both native monolingual and bilingual learners alike. In contrast, high and rising tone contrasts were bound to meaning in native tone learners. In comparing infants exposed monolingually and bilingually to Mandarin, our findings point to greater phonological flexibility in tone boundaries by bilingual learners. In sum, our findings extend and expand existing accounts of how infants interpret tone and pitch variation to suggest particularly strong effects of pitch properties on tone sensitivity in novel word learning. These effects appear to be stronger than those of language familiarity in guiding novel word learning.

### ETHICS STATEMENT

Ethics for testing human participants approved by Western Sydney Human Research Ethics Committee. Approval number— H7330. Infants' caregivers received detailed information about the study and signed an informed consent form. Infants were accompanied by their caregivers at all times during their visit to the four labs (U. Lancaster, NUS, Sunway U, and Western Sydney U) and the task was discontinued immediately if the infants showed any signs of discomfort or their caregivers wished so.

### AUTHOR CONTRIBUTIONS

DB: leader of the project; lead chief investigator on funded grant that provided funds for the project; with KM, designed the experiments and stimuli; management of all aspects of the project including experiment running experiment, analyzing results, and writing the manuscript. LS: supervision of collection of bilingual data and monolingual Mandarin data in Singapore; management of stimulus material collection in Singapore; significant writing of manuscript, especially the Introduction; intellectual input to the project, especially to the interpretation of the results. KM: chief investigator on funded grant that provided funds for the project; with DB, designed the experiments and stimuli; management of monolingual English data collection at Lancaster University; editing and commenting on manuscript. PW: significant input to the design of the experiments; supervision of collection of bilingual data and monolingual Mandarin data in Malaysia; management of stimulus material collection in Malaysia; editing and commenting on manuscript. MK: data collection of monolingual English data at Lancaster University; supervision of

### REFERENCES


monolingual English data collection at MARCS Institute, WSU; co-ordination of bilingual data collection between Malaysia and Singapore; analysis of data; preparation of figures; editing and commenting on manuscript.

### ACKNOWLEDGMENTS

This project was primarily supported by the Australian Research Council (DP0988201) to DB (lead investigator) and KM; and was also supported by a grant from the Ministry of Education, Singapore to LS (MOE Tier 1 Grant FY2013-FRC2-009). The authors wish to thank Dr. Benjawan Kasisopa for assistance with the design of the stimuli and for collecting the Thai, Mandarin, and English speech samples and then selecting and editing the speech stimuli. We thank the following people for recruitment and testing of participants: Ms. Jhia Mae Woo and Ms. Ai Jia Tan in Malaysia, Ms. Felicia Poh in Singapore, and Ms. Candice Michael in Sydney. We are very grateful for the time and commitment of all the caregivers and infants who participated in the study in Lancaster, Kuala Lumpur, Singapore, and Sydney.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Burnham, Singh, Mattock, Woo and Kalashnikova. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Effects of Lexical Pitch Accent on Infant Word Recognition in Japanese

#### Mitsuhiko Ota<sup>1</sup> \*, Naoto Yamane<sup>2</sup> and Reiko Mazuka2,3

<sup>1</sup> School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, United Kingdom, <sup>2</sup> Laboratory for Language Development, RIKEN Brain Science Institute, Wako, Japan, <sup>3</sup> Department of Psychology and Neuroscience, Duke University, Durham, NC, United States

Learners of lexical tone languages (e.g., Mandarin) develop sensitivity to tonal contrasts and recognize pitch-matched, but not pitch-mismatched, familiar words by 11 months. Learners of non-tone languages (e.g., English) also show a tendency to treat pitch patterns as lexically contrastive up to about 18 months. In this study, we examined if this early-developing capacity to lexically encode pitch variations enables infants to acquire a pitch accent system, in which pitch-based lexical contrasts are obscured by the interaction of lexical and non-lexical (i.e., intonational) features. Eighteen 17-montholds learning Tokyo Japanese were tested on their recognition of familiar words with the expected pitch or the lexically opposite pitch pattern. In early trials, infants were faster in shifting their eyegaze from the distractor object to the target object than in shifting from the target to distractor in the pitch-matched condition. In later trials, however, infants showed faster distractor-to-target than target-to-distractor shifts in both the pitch-matched and pitch-mismatched conditions. We interpret these results to mean that, in a pitch-accent system, the ability to use pitch variations to recognize words is still in a nascent state at 17 months.

#### Edited by:

Denis Burnham, Western Sydney University, Australia

#### Reviewed by:

Mariapaola D'Imperio, Aix-Marseille University, France Marilyn Vihman, University of York, United Kingdom

#### \*Correspondence:

Mitsuhiko Ota mits@ling.ed.ac.uk

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 August 2017 Accepted: 26 December 2017 Published: 12 January 2018

#### Citation:

Ota M, Yamane N and Mazuka R (2018) The Effects of Lexical Pitch Accent on Infant Word Recognition in Japanese. Front. Psychol. 8:2354. doi: 10.3389/fpsyg.2017.02354 Keywords: pitch accent, intonation, Japanese, infants, word recognition

## INTRODUCTION

### Complexities in Learning Pitch-Based Lexical Contrasts

Infants must learn the sound categories that mark lexical contrasts in their language. Because every language differentiates words using segments (e.g., consonants and vowels), one of the tasks that infants universally have to engage in is to discover segmental phonetic differences that are lexically contrastive. Much of this process takes place during the 1st year and half of life. Infants typically begin to lose perceptual sensitivity to acoustic differences that do not correspond to native segmental categories between 6 and 8 months for vowels (Kuhl et al., 1992; Polka and Werker, 1994) and between 8 and 12 months for consonants (Werker and Tees, 1984). They become able to distinguish familiar and novel words using acoustic differences that do correspond to native segmental categories as early as 11 months (Swingley and Aslin, 2000; Vihman et al., 2004; Swingley, 2005; Mani and Plunkett, 2010).

Some languages also distinguish lexical items with suprasegmental phonetic features such as pitch and duration. There is now a growing body of research on how infants acquire linguistic systems that mark lexical contrasts through variations in pitch, whose primary acoustic correlate

**113**

is the fundamental frequency (F0) (e.g., Li and Thompson, 1977; Clumeck, 1980; Harrison, 2000; Hua and Dodd, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Singh et al., 2008; Sato et al., 2010; Singh and Foong, 2012; Yeung et al., 2013; Singh et al., 2015; Singh and Chee, 2016; see Ota, 2016; Singh and Fu, 2016 for overviews). Most previous work on this topic has focused on the development of infants learning a '(lexical) tone language,' a language that specifies the pitch height or contour of the syllables in each word, and comparing that with the development of a language that does not use pitch to mark lexical contrasts (i.e., a 'non-tone language').

Findings from this line of research have revealed some interesting characteristics of the developmental trajectories of segmental and tonal contrasts. First, perceptual reorganization for pitch variations appears to occur earlier than that for segmental differences. Infants learning a non-tone language such as English and French lose perceptual sensitivity to certain pitch contrasts (e.g., rising vs. fall-rise) between 4 and 9 months, while infants learning a lexical tone language such as Mandarin, Thai and Yoruba maintain such perceptual sensitivity but also begin to show evidence of native tonal categories as early as 4 months (Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013). The onset of these changes precedes the perceptual changes witnessed for segmental contrasts by a few months, suggesting that infants' ability to adapt to phonetic distributions in the linguistic environment is more advanced for pitch (or F0) than phonetic dimensions related to segments (e.g., voice onset time, formant transitions).

Second, infant learners show robust readiness to incorporate pitch patterns into lexical information, whether or not their language uses pitch to encode lexical contrasts. Perhaps not surprisingly, tone-language learners begin to lexically encode pitch patterns before the end of the 1st year. For example, Singh and Foong (2012) tested Mandarin–English bilinguals on their ability to recognize word forms that were matched or mismatched on the tone of familiarized real words. While 9-month-olds incorrectly recognized both pitch-matched and mismatched Mandarin words, 11-month-olds correctly recognized only pitch-matched words. By 17–18 months, Mandarin-learning infants can also integrate tonal differences in novel word-object associations learned through short laboratory exposures (Singh et al., 2014, 2016). What is unexpected though is that learners of non-tone languages also associate pitch variations with novel word forms, in some cases, up to 18 months (Singh et al., 2014; Hay et al., 2015). In Singh et al. (2014), for example, Englishlearning 18-month-olds distinguished newly learned words on the basis of pitch patterns. This tendency disappears by 2.5 years, when we see clear evidence that English-learning infants treat pitch-differing words as lexically equivalent, reflecting the nonlexical nature of pitch contrasts in the language (Quam and Swingley, 2010). It should be noted that not all types of pitch contrasts are incorporated into lexical information with equal readiness even when the contrasts are present in the ambient language. In Burnham et al. (2017), both monolingual Mandarin-learning and bilingual English-Mandarin 17-montholds were able to differentiate novel words on the basis of the native Mandarin high vs. rising tone contrast but not on the native rising vs. falling tone contrast. In addition, bilingual English–Mandarin 17-month-olds were capable of using a nonnative (Thai) version of the high vs. rising contrast to learn novel words, but not the non-native Thai rising vs. falling contrast. Thus, infants' capacity to lexically integrate pitch information is not unique to tone language learners, but it is constrained to some extent by the characteristics of the pitch contrast.

Overall, the existent literature suggests that tonal development is characterized by a precocious perceptual specification for pitchrelated contrasts and readiness to incorporate pitch variations as lexical information. However, simple comparison of tone languages and non-tone languages may miss some of the potential complexities involved in mastering pitch phonology. First, the functions played by pitch in human languages are not limited to differentiation of words. In addition to marking lexical contrasts in some languages, pitch variations are also systematically used in intonation (or 'postlexical' contrasts) to indicate structures and contrasts above the word level (e.g., phrasal boundaries, focus, question vs. statement) and in paralinguistic expressions to signal speaker states (e.g., emotions, degrees of involvement, arousal) (Ladd, 2008). Because these non-lexical functions of pitch exist in all languages, systematic variations in pitch will be attested even if they are not used to mark lexical contrasts. This can explain why infants learning a non-tone language do not lose their sensitivity to all pitch variations. English-learning infants may become unresponsive to rising vs. low tones, but they continue to show good discrimination of rising vs. falling tones (Mattock et al., 2008), most likely because the latter contrast is encountered in the intonation patterns they are exposed to. It also provides an account as to why learners of non-tone languages remain openminded about the lexical vs. non-lexical status of pitch as late as 18 months (Singh et al., 2014), as infants must see enough evidence that pitch patterns do not correlate to wordlevel meanings before they abandon lexical interpretations of tonal variations. The multifunctionality of pitch variations can be a source of challenge to learners of tone languages too, as lexical tones are overlaid on intonational pitch movements. In Mandarin learners, it may not be until 4–5 years of age that children can identify certain tonal differences when they appear in intonational phrases with pitch movements that counteract those of lexical tones (Singh and Chee, 2016). The difficulty exhibited by younger Mandarin learners in learning novel lexical contrasts on the basis of the rising vs. falling contrast compared to the high vs. rising contrast may be attributable to the fact that the rising-falling difference also marks an intonational contrast in the language (Burnham et al., 2017).

A second potential source of complication in learning pitchbased lexical contrasts is that the pitch patterns associated with individual words may not always be constant. Such variability may come from a phonological rule governing lexical tones (i.e., tone sandhi) or an interaction between lexical and intonational features of pitch. An example of tone sandhi is what is known as Sandhi Rule 1 in Mandarin, by which a dipping tone (Tone 3) becomes a rising tone (Tone 2) when followed by another

dipping tone. A word like hen ('very') is therefore produced with either a dipping tone (e.g., hen jìn ˇ 'very near') or a rising tone (e.g., hén yuan˘ 'very far') depending on the following word or morpheme. The variability caused by sandhi may at least partly explain why Mandarin children as old as 3 years of age have difficulty in perceiving and producing the distinction between dipping and rising tones in familiar words (Li and Thompson, 1977; Clumeck, 1980; Wong et al., 2005; Shi et al., 2017). An example of variability introduced by an interaction of lexical and intonational feature can be seen in Swedish. In (Stockholm) Swedish, words fall into two lexical pitch accent categories: Accent 1 and Accent 2. When initially stressed disyllabic words are produced in isolation, Accent 1 words have one pitch peak (e.g., anden 'the duck') whereas Accent 2 words have two (e.g., anden 'the ghost'). However, the second peak in Accent 2 words is an intonational feature (i.e., sentence stress), which disappears in non-focus positions. The variability of word accents caused by the tone-intonation interaction obscures the lexically relevant tonal contrast (Ota, 2006), and may be one of the reasons why Swedish-learning children show confusion between Accents 1 and 2 during their first 2 years (Plunkett and Strömqvist, 1992).

Here we investigate the developmental consequences of these complexities in pitch-based phonology by examining infants' word recognition in the lexical pitch accent system of Tokyo Japanese. A lexical pitch accent system differs from a canonical tone language system in that tones are specified in words in a much sparser way, usually only on one syllable of the word. But the overall pitch of word is also shaped by intonation, creating a pitch contour that is a composite of lexical and non-lexical features. In a lexical pitch accent system, therefore, the challenge of mastering lexical tone contrasts is compounded by the issues described above. Learners must negotiate, within each word, the components of pitch patterns that are determined by lexical contrasts as opposed to non-lexical factors. They also need to determine how to represent the relevant pitch information that is associated with individual words even when those words may not always carry the same pitch pattern. The details of these aspects of pitch phonology in Japanese are described in the section below.

### Pitch Accent in Tokyo Japanese

Tokyo Japanese has only one type of tonal pattern that is lexically relevant, which is realized as a falling pitch contour. Words are either accented or unaccented. Unaccented words are not marked by the lexical falling pitch. Accented words have one 'accented' syllable, which carries the falling pitch contour within itself if it contains a long vowel or a nasal coda, but otherwise exhibits the pitch fall between itself and the following syllable. The pitch shape of individual words is also determined by a variety of intonational features, the most relevant of which for this study is the phrase-initial rise that marks the beginning of an accentual phrase. The interaction of the falling pitch accent and the phrase-initial rise is illustrated in the disyllabic minimal triplets in **Figure 1**, where the blue line above each word indicates a stylized F0 contour (in reality, there will be some interruptions in the F0 tracks due to the lack of voicing in /R /). The contrast between the three words is fully visible when they are followed by another word or morpheme. The unaccented /haR i/ 'edge' shows no rapid pitch fall (**Figure 1a**), but the initially accented /háR i/ 'chopsticks' has a pitch fall between the first and second syllable (**Figure 1b**) and the finally accented /haR í/ 'bridge' has a fall extending from the final syllable onto the following nominative marker (**Figure 1c**). The contrast between the unaccented /haR i/ 'edge' and the finally accented /haR í/ 'bridge,' however, is not observable when there is no following word or morpheme within the phrase (cf. **Figures 1d,f**). Furthermore, the rising pitch pattern shown in those two words disappears when they are not in phraseinitial position (**Figures 1g,i**), as the rise is a feature that marks the beginning of an accentual phrase. In contrast, the initially accented /háR i/ 'chopsticks' is consistently marked by a falling contour.

**Figure 1** also shows an autosegmental analysis of the structure underlying these pitch contours, based on the Pierrehumbert– Beckman model of Japanese prosodic structure (Beckman and Pierrehumbert, 1986; Pierrehumbert and Beckman, 1988) and its successor, the J-Tobi model (Venditti, 2005). Under this framework, the lexically defined pitch fall is seen as a realization of H∗L, a sequence of high (H) and low (L) tones. The H<sup>∗</sup> portion of this tone combination docks on to the syllable that is lexically marked as accented. The onset of an accentual phrase is marked by a delimitative low tone (%L), followed by a high phrasal tone (H-), unless the realization of the latter is preempted by the presence of the lexical H<sup>∗</sup> . Captured in this analysis is the composite nature of the pitch patterns exhibited by these words in different contexts, which can be understood as combinations of two types of basic tones (H and L) assigned at different levels (i.e., words and phrases).<sup>1</sup>

While the interaction of lexical and non-lexical (intonational) pitch in Japanese words may be revealed unambiguously in such segmentally identical words, most words that a learner encounters do not come in minimal tonal pairs or triplets. Rather, words with different pitch profiles are typically also segmentally different, as illustrated in **Figure 2**. Given this type of input, how does a learner of Tokyo Japanese go about teasing apart the lexical and non-lexical components of pitch patterns? In particular, when do they understand that the variable pitch patterns associated with the unaccented /isu/ 'chair' (**Figures 2a,d,g**) and finally accented /inu/ 'dog' (**Figures 2c,f,i**) lexically mark those words in contrast with the falling pitch contour of the initially accented /neko/ 'cat' (**Figures 2b,e,h**)? How do they encode that information in their lexical knowledge of /isu/ and /inu/? Do they use pitch patterns to recognize those words even though they can be sufficiently identified on the basis of their segmental composition?

It is still not clear whether these aspects of the pitch accent phonology deter Japanese-learning infants from identifying the lexically relevant pitch contrasts. There is evidence that Japanese infants develop early sensitivity to the acoustic differences

<sup>1</sup>These models of Japanese prosody also propose higher levels of structure that assign non-lexical tones (the 'intermediate phrase' and 'utterance' in Pierrehumber–Beckman, and the 'intonation phrase' in J-Tobi). These levels are not included in the discussion here as they do not have immediate bearing on our study.

FIGURE 1 | Three segmentally identical Japanese words contrasting in pitch accent. Blue lines are stylized F0 contours. In (a–c), hashi is followed by a nominative marker /ga/. In (d–f), it is the only word in an accentual phrase (and therefore, phrase-initial). In (g–i), it is not the initial word in an accentual phrase. Tonal analysis is given below each item. H∗L is a pitch accent assigned at the word level. L% marks the onset of the accentual phrase (shown in curly brackets), and is followed by a phrasal H tone (H–).

involved in the contrasts. As early as 4 months, they are capable of discriminating the falling vs. rising difference manifested in isolated words such as /háR i/ ('chopsticks') (**Figure 1d**) and /haR í/ ('bridge') (**Figure 1f**) (Sato et al., 2010). By 10 months, they begin to show left-hemispheric dominance in processing the same pitch contrast embedded in words, but not when the contrast is presented in pure tones, suggesting that their perception of pitch contours becomes specialized for linguistic processing between 4 and 10 months (Sato et al., 2010). In contrast, there is scant empirical information as to when pitch contrasts become lexically incorporated in Japanese learners. Studies based on production data show that 15- to 24-montholds consistently produce a falling contour for isolated initially accented words such as /neko/ ('cat') (**Figure 2d**), but vary in their extent to which they can produce a rising contour for isolated words with no or a non-initial accent such as /inu/ ('dog') (**Figure 2f**) (Hallé et al., 1991; Ota, 2003). This could be interpreted as evidence that Japanese-learning infants of this age have identified and learned the lexical falling pitch pattern but not the phrase-initial rise. However, a failure to produce a rising pitch contour may also be due to the additional articulatory effort required to produce a pitch rise compared to a pitch fall (Snow, 1998). The existing literature, therefore, fails to answer the question of how learning lexical contrasts in a lexical pitch accent language compares to the development of tone or non-tone languages.

### Purpose of the Current Study

fpsyg-08-02354 January 10, 2018 Time: 16:54 # 5

Previous work indicates that learners of tone languages (e.g., Mandarin) can use pitch in recognition of familiar words by 11 months and in novel word learning by 18 months. Learners of non-tone languages (e.g., English) before 18 months are also able to lexically encode pitch variations. This suggests that regardless of what lexical role pitch plays in the target language, infants before 18 months are capable of extracting the relevant pitch patterns associated with lexical input and encode them in their lexicon. Can this ability also be exploited in learning a pitch accent system such as Japanese despite the complexities described above, which might obscure the lexically relevant patterns? This should be possible if Japanese infants are tracking the whole range of pitch patterns that are associated with individual words. For example, they may store exemplars of the final-accented word /inu/ 'dog' with a rising contour (**Figures 2c,f**) and a flat pattern (**Figure 2i**), allowing them to recognize both patterns as familiar forms even before they master the role of the accentual phrase. From that point of view, we expect Japanese-learning infants before 18 months of age to be able to differentiate words on the basis of pitch variations that correspond to a lexical contrast (i.e., rising vs. falling contour).

In this study, we investigated this question by experimentally testing the extent to which modifications in pitch contour can affect recognition of words that Japanese infants are likely to be familiar with. Words that infants frequently hear in their linguistic input are subject to natural variation in pitch including, crucially, the phrase-initial intonational marking that makes the rising pitch a variable feature. Testing recognition of familiar words, therefore, allows us to see whether infants overcome such input variability in integrating pitch information into lexical representations. To this end, we employed the mispronunciation paradigm (Swingley and Aslin, 2000) to test Japanese-learning 17-month-olds and examined their recognition of phrase-initial words with no accent or a final lexical pitch accent (e.g., /inu/ 'dog' in **Figure 2c**) when we imposed a falling pitch contour on those words, making them (incorrectly) initially accented. If, by this age, Japanese infants have developed understanding of the lexical function of this pitch contrast, they should show better recognition of the test words with the correct (i.e., rising) contour compared to the incorrect (i.e., falling) contour.

## MATERIALS AND METHODS

### Overview

The participants in the experiment were 18 17-month-olds learning Tokyo Japanese. In each trial during the experiment, the infants saw two pictures on the monitor, accompanied by a recorded sentence naming one of the visual objects. In some trials, the target picture was named with the 'correct' pitch contour on the test word, while in some trials, it was named with an 'incorrect' pitch contour. There were also some filler trials in which a cartoon character familiar to many Japanese children was named with the correct pronunciation. Infants' fixation to the visual objects was recorded using an eye-tracker.

### Participants

The 18 participants ranged in age from 17 months to 4 days (520 days) to 17 months and 30 days (546 days), with a mean of 17 months and 20 days (537 days). Half of them were female. One additional participant was tested but not included in the analysis due to eye-tracking failure caused by fussiness. All infants were born full-term and had no known history of ear infection or hearing problems. All infants also had parents who grew up in the vicinity of Tokyo, where the lexical accent of the test words followed the patterns illustrated in **Figures 1**, **2**. None of them was reported having regular exposure to languages other than Japanese. Written informed consent was obtained from the parents of the participants.

## Materials

### Auditory Stimuli

The test words comprised three sets of words: Experimental words produced with the expected pitch contour ('Correctly Pronounced' or 'CP' words), experimental words produced with an unexpected pitch contour ('Mispronounced' or 'MP' words), and filler words, which were names of cartoon characters, always produced with the correct pitch contour. The CP and MP versions of the experimental words were created from 3 disyllabic words (inu 'dog', isu 'chair' and ashi 'leg') and 3 trisyllabic words (sakana 'fish,' kuruma 'car,' and oshiri 'bottom/buttocks'). They either had a lexical pitch accent on the final syllable (inu, ashi) or no pitch accent (the rest). Each of these words was embedded in the carrier passage Mite! Soo, [target] ('Look! Yes, [target]'), and said in a way such that it formed an independent prosodic phrase at the end of the sentence. The CP version had a rising pitch contour, as expected for a phrase-initial word without initial lexical accent. The MP version had a falling pitch contour, which, (incorrectly) signals an initial pitch accent. Each carrier passage was followed by one of the additional phrases, kawaii ne ('Isn't that cute.'), omoshiroi ne ('Isn't that interesting.') or wakatta ka na ('Did you get it?'). These phrases were added simply to break the monotony of the carrier passages without affecting the interpretation of the critical component of the stimuli. Combination of the additional phrase with the main part of the carrier passage was fully crossed. **Figure 3** shows schematic representations of these experimental stimuli, and **Figure 4** gives actual F0 extractions from the CP and MP versions of the recordings for kuruma 'car.'

The filler words were Ampamman, Doraemon, Mikkii (Mickey Mouse) and Puu-san (Winnie the Pooh). The first two occurred in the carrier passage Are? \_\_\_ da, Omoshiroi ne ('Hm? That's \_\_\_. Isn't that interesting.') and the other two in the carrier passage A! \_\_\_ da yo. Kawaii ne. ('Oh! There's \_\_\_. Isn't that cute.').

FIGURE 3 | Examples of carrier passages with the experimental items. Each passage was also followed by kawaii ne, omoshiroi ne, or wakaru ka na. Lines above the words are stylized representations of the pitch contour.

The stimuli were read by a female native speaker of Japanese, using infant-directed speech, and digitally recorded in a soundproof room at a sampling rate of 44.1 kHz (16 bit). Sound files were spliced so that the same recording of the carrier passages was used across experimental words. They were also normalized for amplitude.

### Visual Stimuli

The visual stimuli were colored illustrations of the objects and characters corresponding to the experimental and filler words: a dog, a chair, a leg, a fish, a car, buttocks, Ampamman, Doraemon, Mickey Mouse, and Winnie the Pooh. The images were yoked in pairs based on their semantic characteristics: dog with fish, leg with buttocks, chair with car, Ampamman with Doraemon, and Mickey Mouse with Winnie the Pooh. They were presented side by side against a black background on a 24-inch wide-screen monitor (1920 pixels × 1200 pixels, approximately 57.3 cm × 45.0 cm). On the screen, the pictures were approximately 480 pixels × 360 pixels in diameter and separated by about 480 pixels.

### Procedure

The experiment was conducted in a dimly lit sound-proof room. Infants sat on their parent's lap, approximately 60 cm away from the stimulus-presenting monitor. Parents listened to masking music played through a headset so that they could not hear the auditory stimuli, and were also asked to look down to prevent their eyes from being targeted by the tracking device. The experiment was monitored by a researcher, who sat in a control area outside the room and watched the procedure through a closed-circuit TV monitor. Stimulus presentation was controlled by the E-Prime 2.0 software (Psychology Software Tools, Pittsburgh, PA, United States). Auditory stimuli were played through loudspeakers placed below the TV monitor. Eyegaze data from the infants were collected using a Tobii T60XL eye-tracking system.

Before the experimental trials, a five-point calibration routine was run in order to calibrate the eye-tracker to the infant's eyes. The experimental trials consisted of 12 test trials and 4 filler trials, for a total of 16 trials. Each trial was 8 s long, and began with the presentation of two images appearing side by side at the vertical center of the screen. The images simultaneously moved at a steady pace toward the top of the screen, then to the bottom and back to the center at the end of the trial. The carrier passage began 2 s after the beginning of the trial. The onset of the test word (both experimental words and the fillers) occurred at 5 s. Between the trials, an animated sequence of a rotating smiley face was played. When the infant's gaze was fixated to the center of the screen, the experimenter started the following trial.

Four stimulus sets were used, each with two blocks of presentation. The second and fourth stimuli sets reversed the block order of the first and third. The third and fourth sets were left-right reflection of the first and second. Each of the six experimental words was tested once in each block, under the CP condition in one block and under the MP condition in the other. Each of the four filler words was tested once in each experiment, in either the first or second block. Each picture served twice as the target (on the right in one block, and on the left in the other) and twice as the distractor (also on the right in one block, and on the left in the other). Presentation order was randomized within block.

### RESULTS

If, by 17 months, Japanese infants have learned that disyllabic words without an initial pitch accent must not have a falling pitch contour, they should be more accurate or faster at fixating on the target image in CP trials than in MP trials. If their understanding of lexical pitch accent is robust enough, we expect to find this effect throughout the experimental trials. However, previous work on early lexical representation using a

similar paradigm found that mispronunciation effects sometimes diminish over the course of the experiment (Vihman et al., 2004). This occurs presumably because infants begin to accept the mispronounced versions of the familiar words in later trials when the lexical encoding of the critical contrasts is fragile. We therefore included trial order (i.e., first vs. second block) as a factor in our analysis.

The analysis was carried out using onset-contingent eyemovement data, which are summarized in **Figure 5**. These graphs display the time course of eye movement from the temporal onset of the test word, separately for the first block (top panel) and the second block (bottom panel). Within each panel, trials are aggregated into different lines depending on the condition (CP vs. MP) and the object at which the infant was looking at the word onset (target vs. distractor). For the purpose of the analysis, we call the object that matches the test word segmentally the 'target' picture whether the pitch contour was correct or not. For example, the picture of the dog was the target for both /inu/ (CP) and /inu/ (MP) and the yoked picture of the fish was the distractor for those words. Conversely, the picture of the fish was the target for both /sakana/ (CP) and /sakana/ (MP). The y-axis shows the proportion of fixation shifts to the opposite visual object for each 40 ms from the word onset. In the case of target-initial trials, this is the proportion of looks to the distractor over the sum of target and distractor looks. In the case of distractor-initial trials, this is the proportion of looks to the target over the sum of target and distractor looks. The analysis did not include trials in which the infant was looking neither at the target object or the distractor at the onset of the test word, which accounted for 22.4% of the data.

Following previous literature on fixation latency of this age range, we chose to analyze the gaze data from 360 to 2000 ms after word onset (Fernald et al., 1998), and modeled the time course of fixation shifts using growth-curve analysis (Mirman, 2014). All modeling was carried out using the lme4 package (Bates et al., 2015) on R. Time bins of 40 ms were created from the word onset and transformed to second-order orthogonal polynomial values to avoid correlations between time terms. We first ran two base models, one with the linear time term and one with both the linear and quadratic time terms. Both models also included by-participant random intercepts and slopes. As comparison of these models showed that adding a quadratic term to a linearonly model improved the model fit [χ 2 (4) = 42.71, p < 0.001], all subsequent models were built with linear and quadratic time terms (both with polynomial values). Next we ran an omnibus analysis using the two time terms (Time and Time<sup>2</sup> ), Onset Look (Target vs. Distractor), Condition (MP vs. CP) and Block (1st vs. 2nd) as fixed effects (including their interactions), as well as participant random effects on both Time and Time<sup>2</sup> ,

and participant-by-condition random effects on both Time and Time<sup>2</sup> . This analysis yielded significant 4-way, 3-way and 2 way interactions involving Block and the other fixed effects (see **Table 1** for full results).

In order to tease apart these interactions, we proceeded to build separate models for the two blocks. In these models, Block and its interactions with other factors were removed. The results for Block 1 are given in **Table 2**. There were significant interactions between Onset Look and Condition on Time and Time<sup>2</sup> , with the linear term indicating a generally faster overall shift from the distractor to the target for the CP condition relative to the MP condition (Estimate = 0.624, SE = 0.139, p < 0.001) and the quadratic term indicating more acceleration in the distractor-to-target shift for the CP condition relative to the MP condition (Estimate = 0.273, SE = 0.139, p = 0.049). There was also a significant interaction between Onset Look and Condition in reflection of an overall higher level of distractor-to-target shift in the CP condition than the MP condition (Estimate = 0.048, SE = 0.023, p = 0.037). In addition, there was an effect of Condition on Time, suggesting that the overall speed of shift was slower for the CP condition relative to the MP condition (discounting the Condition × Onset Look interaction mentioned above) (Estimate = −0.256, SE = 0.100, p = 0.010). However, there were no interactions of Onset Look and the time terms. These results indicate that the infants were more likely to shift their gaze from the distractor to the target object and did so faster than target-to-distractor shifts but only in the CP condition. In short, their distractor-to-target response was contingent on hearing the target word with the correct pitch contour.

The results for Block 2 are given in **Table 3**. There was a significant interaction between Onset Look and Condition on Time, indicating a generally slower overall shift from the distractor to the target for the CP condition relative to the MP condition (Estimate = −0.439, SE = 0.115, p < 0.001). However, there was again a significant interaction between Onset Look and Condition, indicating an overall higher level of distractorto-target shift in the CP condition than the MP condition (Estimate = 0.144, SE = 0.024, p < 0.001). These outcomes are likely due to the changing rates in the competitor-to-target shift in the MP condition, which showed little movement up to about 1000 ms post-naming, but a rapid increase toward the 1400 ms point, after which it plateaued. In comparison, the temporal change in the CP condition was more monotonic. Importantly, there was also a significant effect of Onset Look on Time, showing that the distractor-to-target shift was faster than the target-to-distractor shift across conditions (Estimate = 1.200, SE = 0.086, p < 0.001). In addition, there was an effect of Condition on Time, this time suggesting that the overall speed of shift was faster for the CP condition relative to the MP condition (discounting the Condition × Onset Look interaction mentioned above) (Estimate = 0.371, SE = 0.082, p < 0.001). These results indicate that, unlike in Block 1, infants were more likely to shift their gaze from the distractor to the target in both the MP and CP conditions, although the onset of the response was delayed in the MP condition compared to the CP condition.

The overall level of distractor-to-target shift was higher in the second block than in the first. In Block 1, the proportion of distractor-to-target shift did not reach 50% even between 1500 and 2000 ms in either the CP (mean = 39.1%) or MP (mean = 31.0%) condition. In Block 2, the mean shift proportion between 1500 and 2000 ms was 55.5% for the CP condition and 49.6% for the MP condition, although the difference in distractorto-target shift between the two conditions was not statistically significant.

### DISCUSSION

In this study, we examined whether 17-month Japanese-learning infants understand the contrastive nature of the pitch patterns in familiar words. Our focus was on phrase-initial unaccented and finally accented disyllabic words such as /isu/ 'chair' and /inú/ 'dog,' which have a rising pitch pattern as opposed to the falling pitch pattern found in initially accented disyllabic words such as /néko/ 'cat.' A point of particular interest was that the pitch rise is not a unique lexical marker of the unaccented and finally accented words, and the lexical contrast needs to be understood as a lack of the falling pitch contour that unambiguously defines initially accented words. We predicted that Japanese learning infants should be able to learn this contrast by exploiting the type of ability exhibited by both tone and non-tone language learners of similar ages to encode pitch information in lexical representation. The results of our experiment present some evidence that 17-month-olds indeed utilize pitch information in recognizing words such as /isu/ and /inu/. In early trials, infants were faster in shifting their gaze from the distractor object to the target object when the test word correctly had a rising pitch contour than when it incorrectly had a falling contour. This part of the results indicates that despite the variable realizations of the pitch contours, Japanese-learning infants by this age have internalized some information about one of the possible pitch patterns (i.e., the rising contour) of these words to the extent that the online recognition process was facilitated by pitch-matching.

This difference between the correct and incorrect conditions, however, did not persist into later trials, during which infants showed faster distractor-to-target shifts than target-to-distractor shifts both when the test words were 'mispronounced' with a falling contour as well as when they were correctly pronounced with a rising contour. Although the pitch-mismatched words caused a slight delay in the onset of the distractor-to-target shift, they induced as much target object fixation as did the pitchmatched words within 2 s. The willingness infants exhibited in accepting such mappings suggests that the lexical encoding of pitch information is not firmly established enough to reject a mismatch in pitch in later trials. This outcome is similar to that from one of the experiments conducted by Vihman et al. (2004) in which they tested 11-month English-learning infants on their auditory recognition of familiar words (e.g., baby) and mis-stressed words (e.g., ba'by) compared to rare words that are assumed to be unfamiliar (e.g, bridle). Tests using the head-turn preference paradigm showed no difference in the preference for mis-stressed words vs. rare words during the first half of the experiment, indicating that recognition of

#### TABLE 1 | Summary of the omnibus growth-curve model.

fpsyg-08-02354 January 10, 2018 Time: 16:54 # 9


Parameter estimates are for CP relative to the MP (Condition), Block 2 relative to Block 1, and distractor-initial relative to target-initial (Onset Look). <sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.


Parameter estimates are for CP relative to the MP (Condition) and distractor-initial relative to target-initial (Onset Look). <sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

familiar words was blocked by the incorrect placement of stress. However, mis-stressed words were significantly preferred over rare words in the second half, suggesting that after exposure to examples such as ba'by, the infants began to regard the stressmismatched words as familiar words. The emergent tendency to accept the pitch-mismatched words in our experiment might have been induced further by the nature of the task, which involved visual stimuli presented in pairs. In a visual world paradigm, participants' processing of prosodic information can be guided incrementally by the contextual expectations signaled by the visual stimuli (Kurumada et al., 2014). In the case of the current experiment, once the infants register, for example, the fact that there is a picture of a dog (/inu/) as well as of a fish (/sakana/) on the screen, they are more likely to look toward the dog upon hearing the pitch-mismatched /inu/, simply because of its better segmental match with one of the options presented. The extent to which such expectation effects might have affected the outcome of our study can be gauged by testing



Parameter estimates are for CP relative to the MP (Condition) and distractor-initial relative to target-initial (Onset Look). <sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

the same linguistic stimuli using the wordform-only design employed in Vihman et al. (2004) and several other studies on the lexical representation of familiar words in infants (e.g., Hallé and de Boysson-Bardies, 1994, 1996; Swingley, 2005; Vihman and Majorano, 2017).

These methodological considerations notwithstanding, the results here indicate that 17-month-olds are still in a nascent state when it comes to their grasp of the lexical import of the rise/fall contrast in Japanese. This timing of development seems rather protracted given the evidence that Japanese infants can perceptually discriminate the same contrast as early as 4 months (Sato et al., 2010), and both Mandarin and English infants of similar or younger ages are capable of encoding a rise/fall contrast in novel words through brief lab exposure (Singh et al., 2014; Hay et al., 2015). As foreshadowed in the Introduction, such a delay is mostly likely caused by the variable realization of pitch patterns introduced by the interaction of lexical and nonlexical factors in Japanese pitch phonology. A Japanese infant who hears the word /inu/ 'dog' sometimes with a pitch rise and sometimes with a flat pitch pattern may conclude (correctly) that the rise/plateau alternation is lexically irrelevant, but may fail to notice — precisely because of this variability — that the contrast between a rise or a plateau, on the one hand, and a fall, on the other, is lexically relevant. Note that such input variability is not a feature of experiments that demonstrate successful mapping of lexical tones with novel words by both Mandarin and English infants (e.g., Singh et al., 2014; Hay et al., 2015), because in these studies, the stimuli are played consistently in one type of lexical tone during familiarization. Hence, the ability to lexically encode pitch from invariable exemplars does not guarantee successful extraction of lexically contrastive pitch patterns in the face of variable realizations. Further support for this interpretation comes from a finding reported by Shi et al. (2017) for familiar word recognition by Mandarin learners. As in our study, Shi et al. (2017) used the mispronunciation paradigm with visual references, and tested whether monolingual Mandarin learners between 19 and 26 months would recognize familiar words when an incorrect tone was assigned. Their participants detected mispronunciations involving Tone 2 (rising tone) and Tone 4 (falling tone) or Tone 3 (dipping tone) and Tone 4, demonstrating that they have internalized these tonal contrasts in their lexical knowledge. However, the same individuals did not detect mispronunciations involving the contrast between Tones 2 and 3. Shi et al. (2017) reject perceptual confusion as a source of this failure because younger Mandarin learners are capable of discriminating Tones 2 and 3. Instead, they attribute the lack of mispronunciation effects for Tone 2/3 to the variable realization of Tone 2. As discussed in the Section "Introduction," in Mandarin, Tone 3 (dipping tone) is realized as Tone 2 (rising tone) when followed by another Tone 3. Mandarin infants, therefore, are exposed to words whose pitch pattern alternates between a dipping contour and a rising contour, potentially leading them to inaccurately encode both dipping and rising patterns as contextually constant representations of Tone 3 words. Variability is also a potential factor behind the apparently late pitch phonology development in Limburgian (Ramachers et al., 2017). Like Japanese, Limburgian has one type of tonal contrast that is lexically assigned to a syllable in each word, but its pitch realization varies dramatically across intonational contexts (e.g., declarative, interrogative, and continuation) (Gussenhoven and van der Vliet, 1999). Ramachers et al. (2017) trained 2.5- to 4-year-olds on novel word-object associations and subsequently tested their word recognition using a mispronunciation design. Their Limburgian learners fixated on the target object even when they heard a pitch-mismatched version of the novel word, suggesting that the pitch differences were not treated as a lexical contrast. It is difficult to compare this result with that of our study, given the differences in age, methodology (in particular, the use of novel words as opposed to familiar words), and linguistic environment (the Limburgian toddlers were also heavily exposed to Dutch)<sup>2</sup> . Yet, they are both consistent with the notion that the task of learning pitch contrasts could be made arduous when their realizations are subject to variability due to non-lexical factors.

<sup>2</sup>Another source of complication is that pitch mispronunciation did not block word recognition in either their age-matched Dutch toddlers or adult Limburgian speakers.

A slightly different point that is nevertheless pertinent to the issue of variability is the phonological contexts in which words tend to appear in the learner's speech input. If infants hear words such as /inu/ 'dog' and /isu/ 'chair' predominantly in single-word utterances (as in **Figures 2d,f**), the pitch contrast against initially accented words such as /neko/ (**Figure 2e**) will be more noticeable because it will be realized as a difference between a rising and a falling contour in the large majority of the cases, and because the size of the phonological material over which the critical contrast is expressed is small (i.e., a single word, which is also the entire phrase and utterance). This means that the question as to how easily learners can unravel the prosodic phonology that underlies the observable pitch patterns in the language is dependent not only on the nature of the system (e.g., lexical tone, lexical pitch, intonation) but also on how the critical contrasts are made more apparent by the distributional relationship between words, phrases and utterances in the ambient input. This principle may also apply to the development of non-tone languages. For example, Frota et al. (2014) demonstrated that both 5–6 month-olds and 8–9-month-olds learning European Portuguese (EP) could discriminate the intonational patterns associated with the declaratives (HL<sup>∗</sup> L%) vs. yes-no questions (HL<sup>∗</sup> LH%) of the language. In Soderstrom et al. (2011), however, infants between 4 and 24 months failed to classify the declarative vs. yes-no question patterns in English, albeit showing a preference for yes-no questions. One likely explanation for these different results is that the stimuli in Frota et al. (2014) were singleword utterances consisting of disyllabic words whereas those in Soderstrom et al. (2011) were multi-word utterances with the critical prosodic differences marked mostly at the end of the utterance. Furthermore, there is an indication that the proportion of single-word intonational phrases in infant-directed speech is much higher in EP than in English (Frota et al., 2014). For these reasons, the intonational contrast between declaratives and yes-no questions may be more tractable in EP than in English. An analysis of infant-directed Japanese may reveal that Japanese does not lean heavily toward single-word prosodic phrases as EP does, thus showing less scaffolding for the learner in this respect.

There are other factors that could pose a challenge to acquiring a lexical pitch accent system, especially in contrast to a lexical tone system. First, pitch has a lower functional load in pitch accent languages than in many tone languages. Because a lexical pitch accent system typically has only one type of lexically significant pitch pattern (e.g., a fall), which is also assigned only up to one syllable per word, it has far fewer minimal pairs that rely solely on pitch differences in comparison to tone languages. As such, the function that pitch plays in lexical contrasts may be less readily noticeable by the learner. Second, there may be a difference in perceptual salience between a pitch accent and lexical tones. Lexical tones are typically realized within a syllable, so the contour pattern is audible as a continuous sonorant unit. In contrast, single syllable realization of a lexical pitch accent can be limited to certain types of syllables (e.g., those that contain a long vowel or a sonorant coda in Japanese), and the contour of a pitch accent is otherwise interrupted by a syllable boundary. It is possible that learners find it more difficult to perceive pitch movements that are phonetically discontinuous. There may also be acoustic differences when similar pitch contours are compared between tone languages and lexical pitch languages. While the mean onset-to-offset F0 movements in our rising (232–388 Hz) and falling (375–184 Hz) pitch items are fairly comparable with, for example, Singh et al.'s (2014) Mandarin stimuli for rising/Tone 2 (221–346 Hz) and falling/Tone 4 (324– 206 Hz), the F0 movements in the phrase-initial rise may be less pronounced in naturalistic infant-directed Japanese (Ota, 2003).

Our study examined only part of the knowledge 17-montholds may have of the pitch accent system in Japanese. All the target words investigated here either had no lexical accent or an accent on the final syllable. Future studies should include testing of infants' response to initially accented words mispronounced with a rising contour as opposed to the correct falling contour. We predict that 17-month-olds should display stronger sensitivity to this mismatch because initially accented words are consistently marked by a falling contour (cf. **Figures 2b,e,h**), making any deviation from the pattern straightforwardly anomalous. An equally important issue that has been left unexplored here is how the non-lexical (i.e., intonational) component of pitch patterns is acquired. This can be decomposed into two issues. First, infants must learn that pitch changes caused by non-lexical factors, such as phrasal boundaries, do not have lexical consequences. This question can be addressed by testing, for example, whether Japanese-learning infants recognize words with no or non-initial accent in phraseinitial as well as non-initial position (cf. **Figures 2g–i**), where the rising contour disappears. Second, infants must also learn that certain pitch patterns are required by sentence structure or meaning, rather than words. This can be examined by testing whether infants detect anomalies in utterances that lack a phrase-initial rise when one is expected (e.g., **Figures 2d,f–i**). If lexical encoding of invariable pitch patterns plays an important role in the initial phase of pitch development, we expect such intricacies of non-lexical pitch phonology to be acquired only after some amount of lexical information has accumulated in the learner, for it is only when the contribution of wordlevel prosody is understood that many aspects of intonational phonology become evident. In this regard, it is interesting to note that there is a consensus emerging from research on early speech production in non-tone languages, including Catalan, Dutch, English, and Spanish, that the timing of intonational development is linked not to sentence length but lexical knowledge (Chen and Fikkert, 2007; DePaolis et al., 2008; Prieto et al., 2012).

To summarize, 17-month-old Japanese infants have internalized some lexically relevant pitch information of familiar words, but the information does not withstand the pressure to segment-match a pitch-mismatch word. On the one hand, this means that by this age infants can extract lexically relevant pitch patterns in the face of variability introduced by non-lexical (intonational) factors. On the other hand, however, lexical knowledge of pitch contrast in 17-month-old Japanese infants does not appear to be on a par with that found in similar-aged Mandarin infants, at least where comparable pitch contour differences (i.e., rising vs. falling) are concerned. Further research will shed light on whether such differences reflect the developmental complexities involved in decoupling lexical and intonational features in pitch phonology. In this respect, examination of the development of pitch accent languages offers insights that complement those emerging from relatively wellresearched systems such as lexical tone languages and nontone languages. The current study constitutes a step toward a more comprehensive understanding of how non-segmental lexical contrasts develop during infancy.

### ETHICS STATEMENT

fpsyg-08-02354 January 10, 2018 Time: 16:54 # 12

This study was carried out in accordance with the recommendations of British Psychological Society with written informed consent from all subjects. The protocol was approved

### REFERENCES


by the School of Philosophy, Psychology and Language Sciences, University of Edinburgh.

### AUTHOR CONTRIBUTIONS

MO designed the study, analyzed the data, and drafted the manuscript. NY prepared the experimental materials and set-up, collected and analyzed the data, and co-wrote the manuscript. RM supervised the project and co-wrote the manuscript.

### FUNDING

This study was supported by AHRC Research Leave scheme AH/E000320/1 awarded to MO, and JSPS Grant-in-Aid for Scientific Research S 16H06319 and MEXT Grant-in-Aid for Scientific Research on Innovative Areas #4903 (Co-creative Language Evolution) 17H06382 awarded to RM.


from emerging intonation in Catalan and Spanish. J. Child Lang. 39, 221–257. doi: 10.1017/S030500091100002X


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ota, Yamane and Mazuka. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpsyg-08-02354 January 10, 2018 Time: 16:54 # 13

# What Can Lexical Tone Training Studies in Adults Tell Us about Tone Processing in Children?

Mark Antoniou\* and Jessica L. L. Chin

The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia

A growing number of studies on the acquisition of lexical tone by adult learners have revealed that factors such as language background, musical experience, cognitive abilities, and neuroanatomy all play a role in determining tone learning success. On the basis of these findings, it has been argued that the effectiveness of tone learning in adulthood depends on individual differences in these factors. However, it is not clear whether similar individual differences play an analogous role in tone learning in childhood. Indeed, relatively few studies have made comparisons between how adults and children learn lexical tones. Here, we review recent developments for tone learning in both adults and children. The review covers tone training in a range of contexts, including in naive listeners, in native speakers of other tone languages, in listeners with varying levels of musical experience, and in individuals with speech and hearing disorders. Finally, we discuss the parallels between adult and child tone learning, and provide recommendations concerning how findings in adult tone training can provide insights into tone learning for children by accommodating the needs of individual learners.

Keywords: lexical tone, training, acquisition, individual differences, adults, children

### PERCEPTION OF TONES

In recent years, researchers have developed a sophisticated understanding of lexical tone acquisition in adults. Long-term experiences such as language and music exert a persistent influence that shapes the perception of lexical tone, and has implications for subsequent training and acquisition of non-native tone contrasts. To date, the vast majority of this work has been conducted on adults. In this review, we summarize the adult tone research literature and highlight several of the emerging themes that may guide future research and subsequently elucidate our understanding of tone processing in children.

Many of the world's languages use pitch patterns called lexical tones as a contrastive feature. These tone languages, such as Mandarin, Thai, and Cantonese, use lexical tones to differentiate the meanings of words. For example, the Mandarin syllable /ma/ can mean 'mother,' 'hemp,' 'horse,' or 'scold' depending on which of the four Mandarin tones are used. Similar pitch variations are not lexically meaningful in non-tone languages such as English. Such language differences have been shown to have a profound effect on the processing of lexical tones. For instance, native Mandarin Chinese listeners show an advantage when identifying tones (Gottfried and Suiter, 1997), and also show evidence of strong categorical perception of Mandarin Chinese tones, whereas English

#### Edited by:

Jessica Hay, University of Tennessee, Knoxville, United States

#### Reviewed by:

René Kager, Utrecht University, Netherlands Tianlin Wang, University of Wisconsin-Madison, United States

\*Correspondence: Mark Antoniou m.antoniou@westernsydney.edu.au

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 31 August 2017 Accepted: 03 January 2018 Published: 23 January 2018

#### Citation:

Antoniou M and Chin JLL (2018) What Can Lexical Tone Training Studies in Adults Tell Us about Tone Processing in Children? Front. Psychol. 9:1. doi: 10.3389/fpsyg.2018.00001

**126**

listeners do not (Xu et al., 2006; Wu and Lin, 2008). Taiwan Mandarin listeners perceive tones quasi-categorically but French listeners perceive tones psychophysically rather than as contrastive linguistic categories (Hallé et al., 2004). Native listeners of Mandarin, Cantonese, and German show comparable boundaries for rising and falling tone continua, but the Mandarin and Cantonese speakers were more categorical in their discrimination than the Germans who were more psychophysical (Peng et al., 2010). These native language advantages have also been observed in neuroscientific investigations of tone processing (Gandour et al., 2000). Native tone listeners show an advantage in cortical processing (Wong et al., 2004; Chandrasekaran et al., 2007, 2009b), more faithful and robust subcortical encoding of tone (Krishnan et al., 2005), and also potentially lefthemisphere specialization (Wang et al., 2004). Tone languages also vary considerably in the size and composition of their tone inventories, and this has consequences for the perception of non-native tones.

One possible explanation for how language background shapes tone processing is that tone and non-tone language speakers rely on different acoustic cues when discerning lexical tones. Specifically, language experience has been shown to shape perception of pitch such that listeners attend to pitch information that is meaningful in their native language (Braun and Johnson, 2011), thereby affecting perception of pitch in non-native languages (Schaefer and Darcy, 2014). For native Mandarin speakers, the primary cue for tone contrasts is the fundamental frequency (F0) contour (Xu, 1997; Liu and Samuel, 2004); conversely, native English speakers appear to rely on absolute height when differentiating Mandarin tone contrasts (Wang et al., 2003a). Further, Chinese listeners integrate consonantal and tonal information, whereas English listeners perceive tones and consonants as dimensions that may be separated (Lin and Francis, 2014).

Evidence is mixed concerning whether tone language speakers have an advantage when perceiving tones in a non-native language. On the one hand, numerous studies have shown that prior tone language experience improves subsequent perception of non-native tones. For example, Cantonese listeners outperformed Mandarin and English listeners on Cantonese tones, and Mandarin listeners outperformed Cantonese listeners who in turn outperformed English listeners on Mandarin tones (Lee et al., 1996; Schaefer and Darcy, 2014). Further, native Mandarin speakers identified Mandarin tones more accurately than non-native speakers of varying Mandarin experience (ranging from 1 to 4 years) and this pattern remained the same under talker variability or increased noise (Lee et al., 2010). Moreover, as experience with a tone language increases, so does the ability to correctly perceive contextual variations that affect tone identity and fine-grained acoustic differences between certain tone contrasts. This is important because tone identification critically depends on the preceding context (Moore and Jongman, 1997), and the ability to discriminate acoustically similar tones as well as the complex phonological relationship between them (Hao, 2012). Inexperienced listeners tend to assimilate second language (L2) tones to native language (L1) tones with the most similar acoustic properties (i.e., F0 height and contour) whereas experienced listeners are also sensitive to higher order phonological tone changes (Wu et al., 2014). Native tone language experience also facilitates perception of non-native tones when spoken by multiple talkers (Chang et al., 2017).

However, there is also evidence that prior tone language experience may interfere with non-native tone perception. For instance, Cantonese listeners were constrained by their native phonology (e.g., the phonemic status and the F0 patterns of Cantonese tones) on a task using Mandarin tones, but similar constraints were not observed in Japanese or English listeners, suggesting that the native phonological system may interfere when perceiving non-native tones (So and Best, 2010). Similarly, So (2005) found that native Cantonese speakers encountered more difficulty than native Japanese speakers when distinguishing between Mandarin tones 1 and 4, as well as tones 2 and 3. Tone language experience may also interfere with non-speech tone processing. Mandarin listeners were hindered in their perception of flat and falling pitch contours of non-speech stimuli and misidentified these stimuli more often than did English listeners (Bent et al., 2006). In summary, language experience shapes perception of native, non-native and non-speech tones. Therefore, differing language experiences are likely to have a profound effect on subsequent tone learning in a foreign language. Bearing this in mind, let us next turn to training studies that have attempted to teach learners with varying levels of tone language experience to discern unfamiliar, novel tones.

### TONE TRAINING IN NAIVE LISTENERS

It is well-established that the perception of non-native lexical tone contrasts is difficult for adult L2 learners (Burnham and Francis, 1997; Wang et al., 1999, 2003a; Wayland and Guion, 2004), particularly for those whose L1 does not make use of pitch height and movement to signal changes in word meaning. Nevertheless, speech training can improve tone identification accuracy in such listeners who possess no prior tone language experience (Wang et al., 1999). Neuroscientific studies have confirmed that tone training paradigms result in reliable changes in the learners' brains (Wang et al., 2003b). For example, successful versus unsuccessful learners of a tone speech training program showed different patterns of brain activation following training (Wong et al., 2007a). In a similar training study, individuals who were better at learning non-native tones showed larger repetition suppression in the left inferior frontal gyrus (Asaridou et al., 2016). Brain plasticity changes have also been observed in the auditory brainstem following short term tone training (Song et al., 2008). Subsequent studies have examined the effectiveness of different training types, and the factors that are likely to contribute to successful learning outcomes, including the learnability of tones, more generally.

Tone training has also been investigated in studies that have examined naive learners' abilities to track statistical regularities in the environment. Distributional learning experiments manipulate these statistical regularities to induce learning of one or two categories by presenting unimodal or bimodal stimulus distributions, respectively. In one such study, Australian English

listeners were trained on a Thai tone contrast, and those who were exposed to a bimodal distribution learned better than those exposed to a unimodal distribution, but this bimodal advantage only emerged when the task required that they attend to the stimuli (i.e., the bimodal advantage not observed for passive learning), suggesting that auditory attention is necessary for tracking statistical regularities (Ong et al., 2015). In a subsequent study comparing Mandarin native listeners to Australian English musicians, the Mandarin natives showed distributional learning as above, but Australian English musicians benefitted from both bimodal and unimodal exposure (Ong et al., 2017).

Tone word learning studies require participants to accurately perceive lexical tones in order to differentiate newly learned words, often containing the same segments. Although it has been suggested that listeners may be more aware of phonological segments than tones (Burnham et al., 2011), a word learning study by Antoniou and Wong (2016) found that English listeners were more successful at learning to map non-native tone contrasts to meaning than a non-native prevoiced-unaspirated voicing contrast, suggesting that tone contrasts may be easier to acquire than some consonantal distinctions (possibly because the learners' native language interfered with the learning of a non-native voicing contrast). In another study, non-tone language speakers were able to learn a vocabulary of non-native tone words, although considerable individual differences were observed, and tone word learning performance correlated with pretraining pitch perception ability and music experience (Wong and Perrachione, 2007). In a novel word learning experiment, Mandarin–English bilinguals were better than English monolinguals at using tone to identify novel words, and their performance correlated with degree of Mandarin dominance (Quam and Creel, 2012). Other word learning studies have investigated whether different elements of training paradigm design can boost tone learning. For English listeners, it has been suggested that orthography (e.g., tone marks) leads to better learning outcomes (Showalter and Hayes-Harb, 2013). Additionally, instructing native English speakers to focus on pitch direction, rather than pitch height, improves performance on a tone categorization task (Chandrasekaran et al., 2016).

A growing number of studies are taking into account individual differences among learners by examining pretraining abilities (i.e., sensitivity to tones), and memory availability, and examining how these interact with training paradigm designs such as those mentioned above. In a study involving native English speakers, pitch identification ability was a better predictor of performance on a Mandarin word learning task than musicality, language aptitude, or general cognitive ability and predicted generalization to new talkers (Bowles et al., 2016). High stimulus variability improved learning for learners with strong pretraining abilities but hindered the performance of low-aptitude individuals (Perrachione et al., 2011; Sadakata and McQueen, 2014). This suggests that learners with differing pretraining abilities will likely benefit from tailored training approaches that take these individual differences into account rather than one-size-fits-all approaches. Consistent with this idea, in a comparison of learners with high and low pretraining pitch sensitivity, learners with low pretraining pitch sensitivity showed the greatest improvements when lexical pitch pattern training preceded lexical training (Ingvalson et al., 2013).

In a study examining older adults, learning performance was best predicted by declarative memory capacity rather than baseline sensitivity for pitch patterns or working memory capacity (Ingvalson et al., 2017). This finding suggests that older adults may benefit from non-native speech training paradigms that have been tailored to the needs of individual learners, and the variables that predict performance differ across the lifespan (i.e., pitch pattern sensitivity in young adulthood vs. declarative memory capacity in older adulthood). Given that training older adults demonstrably relies on different predictors of tone learning performance than young adults (likely due to the effects of age-related cognitive decline in older adulthood), training children will also likely require a different set of predictors because it is during the course of childhood that the foundations of cognitive abilities are established. The crucial point is that training studies involving children should also measure pretraining abilities with a view to tailoring training to maximize learning outcomes.

This summary of the field on tone training in naive learners suggests that those with better pretraining abilities will benefit most from training. Who is a good versus poor learner appears to be dependent on perceptual and neurophysiological differences. It is also likely that differences can be accounted for by differences in cognitive ability (MacDonald, 2008; Majerus et al., 2008) or variations in language background (Iverson and Evans, 2009) although these studies have not examined tone training. Given the importance of learners' pretraining abilities, training paradigms must take into account individual differences among learners in order to produce the best training outcomes. Further, it is likely that the contribution of such factors is likely to vary across the lifespan. A fruitful avenue for future research would be to examine the contribution of such individual differences in children, and how these contributions vary as the child develops.

### NON-TONAL L1 SPEAKERS LEARNING A TONAL L2

We have thus far reviewed tone training in naive listeners with no tone language experience. Other studies have examined how varying degrees of L2 experience with a tone language affects processing and learning of lexical tones. First, let us look at studies of non-tonal L1 speakers who are actively acquiring a tone language as their L2.

Non-tonal L1 speakers who learn a tonal L2 provide a fascinating opportunity to examine the flexibility of the human perceptual system to attend to acoustic cues that serve a critical linguistic function in only one of their languages. A tone language learner may rely on either F0 height or direction when perceiving pitch, depending on their language background. L1 English–L2 Mandarin speakers, who rely on F0 height in their L1 and F0 direction in their L2, were exposed to four of the six Cantonese tones, with English monolinguals and L1 Mandarin speakers serving as the control groups. L2 learners, as well as the Mandarin

controls, were significantly better at discriminating the contour– level tone pairs than the level–level tone pairs. This suggests that L2 experience increased L2 learners' sensitivity to F0 direction in the perception of unfamiliar tones (Qin and Jongman, 2016). It should be noted that the Mandarin controls performed significantly worse than the other groups when perceiving level– level tones, suggesting that L1 tone experience did not provide a perceptual advantage for all tone pairs. Similarly, Mandarin speakers outperformed L1 English–L2 Mandarin speakers who in turn outperformed English monolinguals in non-linguistic and Mandarin discrimination tasks and a pitch-shift task. Further, discrimination of musical and Mandarin tones were correlated (Ning et al., 2014). Japanese-speaking Mandarin learners of elementary and intermediate proficiency levels were exposed to utterances in quiet and noisy listening conditions. When recognizing L2 Mandarin speech, the Japanese-speaking Mandarin learners were affected by their L2 proficiency, the semantic context, F0 contours, and noise (Zhang et al., 2016). In another study, students in an introductory Chinese language course were trained to identify tones via three different training types. Those who received training with visual pitch contours and pinyin performed better than students trained with visual contours only or with tone numbers and pinyin (Liu et al., 2011). Wang (2013) found that introductory (first semester) learners of Mandarin identified tone 3 most accurately regardless of L1 background (Hmong, Japanese, American English). Hmong speakers perceived Mandarin tones less accurately than the Japanese and English groups, experiencing the most difficulty in perceiving tone 1, but significantly improved in accuracy after training. Japanese speakers did not benefit from pitch accent in their L1, as they performed similarly to the English speakers in accuracy. By the end of training, all groups improved in accuracy, and Hmong and Japanese speakers were neither advantaged nor disadvantaged by their L1 prosodic backgrounds.

Several studies have observed that L1 intonation patterns exert an effect on the perception of non-native tones. For instance, when presented with Mandarin tone continua, Taiwanese listeners were more sensitive to between-category differences than within-category, whereas French listeners were equally sensitive to differences across continuum steps (Hallé et al., 2004). In another study, Mandarin and English native listeners showed different error patterns when perceiving Cantonese tones; the English listeners identified the high rising tone accurately, perhaps because it was perceived as similar to English question intonation, but poorly identified low rising and low falling tones because they did not map onto any native intonation pattern (Francis et al., 2008). The implication of these studies is that non-tone language speakers can map non-native tones onto the intonational contours used in their native language, and this may in turn influence non-native tone processing.

Neural investigations have shown that reliable brain changes follow a semester of Mandarin learning in college (Wang et al., 2003b). In this functional neuroimaging study, American English speakers studying beginner-level Mandarin completed eight sessions of tone training. Locations of activation in the cortex remained the same in pre- and post-training scans, including the left medial frontal gyrus, and bilaterally in the inferior frontal, middle temporal, and superior temporal gyri. Enrichment plasticity was observed in the early stages of L2 learning, shown by the expansion of cortical regions and recruitment of additional cortical areas specialized toward similar language functions, namely within the left superior temporal gyrus and the right inferior frontal gyrus. In sum, these studies demonstrate that even relatively short-term tone language learning (e.g., over a semester) leads to reliable learning advantages in acquiring novel tones, and also results in reliable learning-related brain changes.

### TONAL L1 SPEAKERS LEARNING A TONAL L2

Other training studies have investigated whether tone language speakers possess an advantage when it comes to learning the tones of an unfamiliar tone language. There is indeed evidence that sensitivity to tones in one language may boost perception of tones in another language (Wayland and Guion, 2004), such that knowledge of a tone language (e.g., Mandarin Chinese) may improve learning of tones in another (e.g., Thai) relative to controls that lack tone language experience (Wayland and Li, 2008). There is also evidence that speakers of tone languages exhibit superior performance in pitch-recognition tasks (Caldwell-Harris et al., 2015). The explanation for such advantages may lie in the native tone language speaker's ability to attend to the critical acoustic cues that differentiate lexical tones, even in non-native tone languages. For instance, native speakers of Mandarin Chinese show greater perceptual sensitivity to pitch contour differences later in the signal, while English speakers are more sensitive to earlier pitch differences (Kaan et al., 2007). This is consistent with neurophysiological studies that have shown that brainstem mechanisms for pitch encoding, as reflected in pitch-tracking accuracy and pitch strength, are more sensitive in tone (Chinese, Thai) than non-tone (English) language speakers (Krishnan et al., 2010). These differences in brainstem encoding give tone language speakers an advantage in perceiving minute changes in pitch, and may ultimately bear on tone learning outcomes.

Furthermore, native tone language speakers are capable of learning new tone contours in adulthood. Studies examining tone learning in adults have confirmed that language background affects both attentive and non-attentive processing of tone contrasts, but processing of pitch is malleable even in adulthood (Kaan et al., 2008; Chandrasekaran et al., 2009a). Additionally, forming new speech categories that depend on unfamiliar perceptual dimensions (such as lexical tone for non-tone language speakers) results in stronger gamma activity and more coherent alpha-band activity than forming new categories using familiar dimensions (Kaan et al., 2013).

The above evidence suggests that tone language experience brings subsequent tone learning advantages and that the adult brain is capable of learning novel tone contrasts from a foreign tone language. Whether similar advantages arising from prior tone language experience occur in children remains to be seen.

### TONE LEARNING IN INDIVIDUALS WITH MUSICAL EXPERIENCE

Both music and lexical tone place great importance on pitch, and thus a growing number of studies have investigated whether musical expertise results in tone language processing (and ultimately learning) advantages. Experience-dependent bidirectional transfer effects have been observed between speech and music (Chang et al., 2016). On the one hand, musicians show advantages in cortical (Schön et al., 2004; Marie et al., 2011, 2012) and subcortical (Wong et al., 2007b) processing of pitch. On the other hand, tone language speakers show enhanced musical pitch processing (Chandrasekaran et al., 2009a; Bidelman et al., 2013) and more robust brainstem encoding of musical pitch (Bidelman et al., 2011). Musical training has been shown to facilitate lexical tone identification, but the degree of facilitation is modulated by the tone in question and the type of acoustic input (Lee and Hung, 2008).

In terms of training, musical or tone language experience are associated with significantly better non-native word learning proficiency of tone-based words, as compared to individuals with neither musical training nor tone language experience (Cooper and Wang, 2012). Further, the combination of tone language and musical experience did not result in an additive advantage for Thai musicians above and beyond either experience alone. In a separate study, musicians who completed a 2-day training protocol identified pitch contours more accurately than nonmusicians, although their pitch contour abstraction ability (to other stimuli) was similar to that of non-musicians (Wayland et al., 2010).

Therefore, musical experience improves pitch encoding and leads to some lexical tone training advantages, although this is modulated by other factors. Similar effects would presumably emerge in children with music experience, although this warrants systematic investigation.

### TONE TRAINING IN INDIVIDUALS WITH SPEECH AND HEARING DISORDERS

Tone training has also been investigated in individuals with speech and hearing disorders (e.g., amusia) and hearing impairments (e.g., cochlear implant recipients). A small but growing body of research has examined the congenital disorder amusia (or tone-deafness) that impairs the ability to perceive pitch in language and music (Peretz, 2001; Ayotte et al., 2002). The general finding from this research literature has been that amusic listeners of a non-tone language consistently perform worse than speakers of non-tonal languages when exposed to lexical and non-speech tones. French amusics were poorer than controls at discriminating Mandarin lexical tones, although there was considerable overlap in performance (Nguyen et al., 2009). In another study, French amusics experienced greater difficulty discriminating lexical tones taken from Mandarin or Thai words, and acoustic analyses revealed that amusics relied on cues such as sound duration and intensity to compensate for their pitch perception deficit (Tillmann et al., 2011). British English amusics showed impaired discrimination, identification, and imitation of statements and questions that differed in pitch in the final word (Liu et al., 2010). Further, those amusics with smaller pitch thresholds tended to perceive intonation more accurately. English-speaking amusics were poorer at processing speech sounds (phonological and phonemic awareness), indicating deficits in sound processing that are not restricted to the domain of music (Jones et al., 2009).

Amusics who are native speakers of a tone language are also impaired in their ability to discriminate and identify lexical and non-speech tones. The majority of these studies focus on amusic Mandarin speakers. In one study, Mandarin amusics experienced difficulty identifying and discriminating Mandarin tones, with some participants exhibiting signs of lexical tone agnosia, that is, an inability to distinguish lexical tones (Nan et al., 2010). Interestingly, no analogous deficits were observed in Mandarin tone production, implying that congenital amusia primarily impairs the perception of pitch. Mandarin amusics have also shown poorer performance than controls for tasks relying on pitch sensitivity, but are not impaired when completing tasks involving multiple acoustic cues (Liu et al., 2012a). They have greater difficulty recognizing the pitch direction of discrete tones, rather than gliding tones, a pattern observed for both speech and non-speech stimuli (Liu et al., 2012b). Furthermore, Mandarin amusics were impaired in their perception of both lexical and non-speech intonation patterns (Jiang et al., 2010). Mandarin amusics who had undergone pitch sensitivity training had improved tone identification thresholds for both speech and music, matching the performance of non-amusic controls (Liu et al., 2017).

Cochlear implant recipients are also impaired in their ability to perceive pitch, and this has serious implications for speakers of tone languages. For example, Mandarin-native cochlear implant users scored between 30 and 50% on Mandarin tone recognition tests (Wei et al., 2004). To address this, training regimens have been developed that aim to improve cochlear implant users' perception of lexical and non-speech tones. Training methods range from training with musical instruments (Yucel et al., 2009) to various computer-assisted training software programs such as Computer-Assisted Speech Training (Fu and Galvin, 2007) and the Melodic Contour Training Program (Lo et al., 2015). Music training has been shown to improve accuracy in musical perception, and also has implications for improving speech (e.g., pitch) processing (Gfeller et al., 2015). Computerassisted training software offers auditory training for adult cochlear implant recipients, and has been shown to be effective in improving cochlear implant recipients' speech and music perception (Fu and Galvin, 2007). Melodic contour training has also been effective in improving cochlear implant users' perception of question/statement intonation and consonants in quiet environments (Lo et al., 2015). In a study of postlinguallydeafened adult cochlear implant recipients who underwent 20 h of auditory computer-assisted training over 4 weeks, 6 of the 7 subjects improved in speech recognition performance using electronic-only stimulation and electronic and acoustic stimulation. However, improvements were not observed in those who underwent acoustic-only stimulation (Zhang et al., 2012).

These studies suggest that even individuals who have difficulty perceiving pitch (i.e., both amusics and cochlear implant users) may improve their perceptual accuracy following science-based training interventions. Although more research is needed, this body of work provides hope to many affected individuals faced with the challenge of acquiring a tone language under challenging conditions, and paves the way for future interventions involving children with similar conditions.

### TONE ACQUISITION IN CHILDREN

The sections of the review thus far have covered studies that have examined tone learning in adults. The remainder of the article will be devoted to covering the work that has examined these abilities in children, and proposing how the field may be advanced by addressing the research questions that have been raised in the adult tone learning literature.

Using discrimination tasks adapted for infants, researchers have begun to understand the timeline of the developmental processes that underpin language development, including the development of sensitivity to lexical tones (Nazzi et al., 1998; Cheng et al., 2013). During the first year of life, infants attune to the elements of their native language and discrimination of non-native language elements deteriorates, in a process termed perceptual reorganization. It is now clear that language-specific speech perception follows a more complex developmental schedule than had been previously thought. Infants first attune to native lexical stress and tone patterns by 5 months of age, then vowels at 6–8 months, consonants at 9–12 months, and phoneme duration at 18 months (Yeung et al., 2013). Studies on tone acquisition in infants have uncovered the developmental window during which perceptual reorganization for lexical tones occurs. In a series of studies, Mattock and colleagues demonstrated that Chinese infants are able to discriminate both Thai lexical tones and non-speech (violin) tones equally well at both 6 and 9 months of age. In contrast, English infants lose their ability to discriminate Thai rising versus falling and rising versus low level lexical tones between 6 and 9 months, but their ability to discriminate non-speech tones is unaffected, suggesting that lexical tone perception in the first year of life is critically dependent on the native language environment (Mattock and Burnham, 2006; Mattock et al., 2008). Additionally, infants develop sensitivity to their native tone distinctions in an asymmetric fashion (Harrison, 2000).

Given that infants' sensitivity to non-native lexical tones diminishes over the first year of life, this raises questions regarding the specific nature of the resulting perceptual constraint and how it interacts with language experience. In a distributional learning study, 5-, 11-, and 14-monthold Dutch infants were familiarized with a unimodal or bimodal distribution of high-level versus high-falling Mandarin tones; the 5-month-olds discriminated both, the 11-month-olds discriminated the bimodal distribution, but the 14-month-olds were not able to discriminate either (Liu and Kager, 2017b). Subsequent studies have demonstrated that although infants' ability to discriminate Mandarin high-level versus high-falling tone contrasts diminishes by 9 months of age, their sensitivity rebounds by 18 months (Liu and Kager, 2014), and perhaps even sooner in the case of bilinguals (Liu and Kager, 2017a).

Indeed, studies involving bilingual infants supplement findings from monolinguals and demonstrate that early experience with multiple languages may improve perceptual flexibility. In one such study, 7.5-month-old Mandarin–English bilingual infants were able to recognize English words that were matched in pitch and Mandarin words matched in tone. 9-month-olds recognized English words that were mismatched in pitch or Mandarin words that contrasted in tone. By 11 months, however, infants had learned to correctly recognize English words that were pitch-matched and -mismatched, but only recognized tonal matches in Mandarin (Singh and Foong, 2012). Interestingly, a perceptual shift has been observed in Mandarin– English bilingual children such that 2.5–3.5-year-old toddlers' word recognition abilities are more affected by deviations in lexical tones than in segments, but 4–5-year-old preschoolers are more affected by deviations in segments than in tones (Singh et al., 2015). This observation is consistent with evidence that when 2.5–3.5-year-old Mandarin toddlers and 4–5-year-old preschoolers were presented with Mandarin words where intonation (question/statement) or tone (rising/falling) were manipulated, toddlers made errors due to intonation, whereas preschoolers recognized tone words regardless of intonation (Singh and Chee, 2016).

Further changes continue to emerge as a result of longterm experiences as the child enters school age and progresses toward adolescence and adulthood. In Cantonese children, tone recognition improves from ages 4 to 6, and from ages 6 to 10, at which point children perform as accurately as adults (Ciocca and Lui, 2003). Mandarin children tend to produce tones less accurately than adults, based on the ratings of native judges (Wong et al., 2005). Further, children perceived the level, rising, and falling tones accurately, but struggled to perceive and produce the dipping tone (tone 3). Children identified Mandarin tone 4 least accurately and tone 1 consistently, and they mostly confused Mandarin tones 2 and 3, followed by tones 1 and 4 (Li et al., 2017). The composition of the native tone inventory also shapes children's perception of non-native tones. For instance, Cantonese has a six-tone system, whereas Mandarin only has four tones, and it was observed that Cantonese-speaking first graders performed better in tone awareness tasks than their Mandarinspeaking counterparts (Chen et al., 2004). Moreover, long-term experience with music predicts better tone identification in Italian-speaking children between the ages of 6 and 8, but it does not predict phonological identification (Delogu et al., 2010).

In sum, the available evidence suggests that similar longterm experiential effects may be observed in children as in adults. Native language background, complexity of the native tone inventory, and prior music experience all contribute to tone processing in children. It is not yet clear, however, how such experiences interact with the emergence of tone processing abilities in children, and which factors take precedence at which timescales of development.

### TONE TRAINING IN CHILDREN

Tone training studies in young children have revealed that they are initially sensitive to acoustic differences between lexical tones, but those children from non-tone language backgrounds gradually learn to ignore such pitch differences as lexically relevant. Infant studies adapt word learning and familiarization tasks so that they are age-appropriate in their attentional and task demands, including how responses are measured and by employing a reduced number of trials. Singh et al. (2008) familiarized 7.5- and 9-month-old English infants with words and observed that while the 7.5-month-olds recognized words with matching pitch contours, the 9-month-olds treated words with mismatched pitch contours as equivalent. In another study, 14-month-old English infants learned labels for two novel objects that were differentiated by differing pitch contours, whereas 17- and 19-month-olds were unable to learn the picture-label pairings despite being able to differentiate the pitch contours in a separate task (Hay et al., 2015). This suggests that 14 month-olds are flexible learners when it comes to perceiving sounds that distinguish words, but for 17–19-month-olds with a non-tonal native language lexical tone is no longer treated as relevant for differentiating words. In another study, Englishspeaking adults and 2-year-olds learned a new word that would later undergo either a phonemic or pitch change. Changes in vowel-quality impaired word recognition, but changes in pitch contour did not, indicating that by the age of two, Englishlearning children disregard variations in pitch when recognizing words (Quam and Swingley, 2010). These studies suggest that experience with a non-tonal language constrains perceptual flexibility and integration of lexical tone into the learning of novel words. Interestingly, while this seems to be the case for monolingual infants, there is evidence that bilingual infants remain sensitive to non-native lexical tone differences longer than monolingual infants of the same age and are able to use nonnative pitch contours to differentiate newly learned words even at 17–19 months but not at 22 months (Graf Estes and Hay, 2015).

Studies on bilingual infants suggest that bilingualism leads to certain perceptual advantages that may aid tone learning. At 12–13 months of age, Mandarin–English bilingual infants were able to use tone to differentiate newly learned words in Mandarin (but not English), whereas Mandarin monolingual infants were unable to learn the words until 17–18 months even though they were capable of discriminating the tones (Singh et al., 2016). At 18 months, bilingual children are predisposed to process tone as lexically relevant regardless of their native language, but at 24 months, only tone-languagespeaking children continue to do so (Singh et al., 2014). These findings suggest that bilinguals remain perceptually flexible for longer than monolinguals (see Graf Estes and Hay, 2015; Hay et al., 2015) and may be sensitive to a wider variety of acoustic dimensions when learning label-object mappings to differentiate novel words at this age. Language-specific sensitivity continues to develop beyond toddlerhood. Mandarin–English bilingual 3 to 4- and 4 to 5-year-olds completed a word learning experiment and were presented with words that were matched or mismatched in tone and presented in English or Mandarin contexts. The 4–5-year-old preschoolers were able to process tone as lexically meaningful in Mandarin and disregard it in English, but the 3–4-year-olds could not (Singh and Quam, 2016).

Studies on children from tone language backgrounds have shown that while they are capable of learning novel tone categories, their perceptual performance continues to develop and improve throughout childhood as they grow. For instance, both 2- and 3-year-old monolingual Mandarin Chinese children struggled to recognize words in the presence of vowel and tone variation, but sensitivity to these features were agedependent. Specifically, only 2-year-olds performed poorly in word recognition in response to tone variation, while 3-yearolds showed insensitivity to tone variation and were able to use tones to learn new words, although tone 3 words were most difficult to learn (Ma et al., 2017). Further, tone learning abilities continue to develop throughout childhood. 6-, 10-, 14-, and 19-year-olds completed computerized Mandarin training in six 40-min training sessions over a period of 2 weeks, and children at all ages showed significant improvement, but not controls (who played computer games for the same time) (Wang and Kuhl, 2003). In a study investigating the perceptual abilities that correlate best with language development (as indexed by a narrative story-retelling task), Cantonese school-aged children in grades 1–6 completed a series of AX discrimination tasks that assessed their temporal and pitch-based auditory abilities. Temporal abilities were measured using a music rhythm task in which pairs of melodies were presented and some trials contained a change in rhythm caused by having a musical note occur in a different location. Pitch abilities were measured using a pitch pattern perception task involving non-speech pitch contours, and a music scale task where some melodies contained a musical note that differed by four semi-tones. Both temporal and pitch abilities correlated with language development, but only pitch abilities (i.e., performance on pitch pattern perception and music scale tasks) explained unique variance in narrative ability scores after age (Antoniou et al., 2015). This suggests that pitch abilities play a crucial role in linguistic development of tone-language-speaking children. Further, children in grades 5 and 6 did not match the level of performance of adults in terms of their temporal and pitch abilities, suggesting that sensitivity to these dimensions continues to develop into adolescence.

Other than studies that have looked at training using speech stimuli, there is also evidence that music training in childhood can boost pitch processing. In one such study, 8-year-old Portuguese children who completed music training for 6 months improved their reading ability and pitch discrimination in speech, but those who completed painting training for the same amount of time did not (Moreno et al., 2009). These findings are supported by the observation that 8-year-old children with several years of music experience showed an advantage in detecting subtle pitch deviations both in musical notes and lexical tones (Magne et al., 2006). These results reveal positive transfer effects from music training to speech processing in children, analogous to the long-term experiential effects of musicianship on speech processing observed in adult musicians.

These studies provide useful starting points, but many questions remain concerning the tone learning abilities of children. Very little attention has been paid to individual differences in child learners and thus it is not clear what makes some children more successful learners than others. Useful variables to consider include native language experience, pretraining tone sensitivity, prior music experience, working memory availability, and neurophysiological differences. By isolating the combination of factors that matter most for successful tone learning, it may ultimately be possible to tailor tone training proactively to the needs of individual child learners, including those with speech and communication disorders or hearing impairments, which we will now review.

### TONE PROCESSING IN CHILDREN WITH HEARING IMPAIRMENTS, SPEECH AND DEVELOPMENTAL DISORDERS

A small number of studies have examined children with speech and developmental disorders or hearing impairments. These studies offer some clues concerning how the effects of such conditions may be ameliorated by behavioral interventions. A number of these studies have examined tone languagelearning children with cochlear implants. Children with cochlear implants performed significantly worse than normal hearing controls in Cantonese tone perception, and their tone perception developed in a pattern differing from normal hearing children (Lee et al., 2002). Additionally, children with profound hearing impairment tend to produce tones less accurately than those with mild hearing loss (Cheung et al., 2014). These observations are consistent with findings from the adult literature.

There is also some promising evidence that children with cochlear implants benefit from training interventions. Children with newly-fitted cochlear implants participated in a familyoriented music training program that consisted of pitch, note, and rhythm discrimination exercises on an electric keyboard. Children who underwent musical training had more interest in music and, after 24 months, showed greater development in all areas of music perception. However, only modest improvements were observed for speech perception in the musically-trained children relative to controls (Yucel et al., 2009). In another study, Mandarin-speaking children with cochlear implants completed melodic contour identification training for 10 weeks. Performance improved after 4 weeks of training, and no performance decline was observed at the 8-week followup (Fu et al., 2015). Computerized speech training delivered for half an hour per day, 5 days per week, for 10 weeks improved vowel, consonant, and tone recognition performance of hearing-impaired children and these benefits were maintained 2 months after the cessation of training (Wu et al., 2007). These studies suggest that although cochlear implants are limited in their capacity to effectively provide F0 information, training interventions can teach children who use cochlear implants to attend to other available acoustic cues (e.g., temporal envelope cues) and improve perception of lexical tones.

Recent studies have begun to shed light on how tone processing is influenced by other speech and developmental disorders that affect children. Work by Wong et al. (2009) has demonstrated that poor tone identification in children with specific language impairment is not only affected by vocabulary knowledge, but some children also had deficits in pitch processing and/or pitch categorization. Moreover, a study comparing Chinese children with dyslexia to chronological-age controls and reading-level controls found that children with dyslexia had a later developmental ceiling, and their lexical tone awareness distinguished them from typically developing children across the primary school years. A perceptual training intervention was employed with the goal of improving lexical tone awareness and character naming in dyslexic children. Only second-grade children improved in both aspects, in comparison fourth-grade children showed improved performance in lexical tone awareness only (Wang et al., 2017). This suggests that children with dyslexia may benefit from perceptual training but more research is needed to maximize any training-related outcomes. Furthermore, a study involving Mandarin children with autism demonstrated that they were able to comprehend and discriminate tones equivalently to their typically developing peers, but they made different errors when presented with the tone 2–3 contrast. Additionally, children with both autism and significant language problems treated nonce word stimuli like pure tone stimuli, thus showing unstable abstract representations of tones (Lu, 2016). These studies demonstrate that perception of lexical tones in one's native language may be affected by a variety of speech and developmental disorders. Encouragingly, there are early indications that perceptual training interventions may be effective in improving lexical tone perception in some of these populations, although further research is needed.

Although great strides have been made in developing a detailed understanding of how children attune to the native tone system, work on the disorders that affect lexical tone processing is still in its infancy. Encouragingly, there are already signs that children with speech and developmental disorders or hearing impairments benefit from training interventions designed to improve tone processing. Future work should strive to improve the efficacy of such interventions.

### CONCLUSION

In this review, we have summarized recent developments in the tone learning literature for both adults and children. The literature covered addresses tone training across a range of adult learners differing in their prior language experience from naive listeners, to L2 tone language learners, to native speakers of other tone languages, to those with varying levels of musical experience, as well as individuals with speech and hearing disorders. Below we relate each of these factors to the most relevant research conducted in infants and children.

Evidence from speech perception, word learning, and neurophysiological studies on adults has demonstrated that prior language experience exerts a profound and persistent influence on the processing of non-native lexical tones.

Native tone language experience may aid processing of similar tone contrasts, but may also in some cases interfere with the processing of non-native tones (e.g., when two non-native falling tones are perceptually assimilated to a single native falling tone category). Additionally, speakers of non-tone languages may benefit from similarities to native intonational patterns (Hallé et al., 2004). In general, fewer studies have been conducted on infants and children than adults, and the evidence is largely in congruence with the data on adults. Native language tone processing advantages have been observed in infants (Mattock and Burnham, 2006), toddlers (Singh and Chee, 2016), and children (Ciocca and Lui, 2003). Benefits have also been observed for non-native tone training (Wang and Kuhl, 2003), although even school-aged children do not match adults in their pitch perception abilities (Antoniou et al., 2015), suggesting that sensitivity to tones continues to develop throughout childhood and into adolescence. Studies on bilingual infants have advanced our understanding of the consequences of bilingualism by revealing that they are flexible perceivers, able to integrate nonnative tones earlier and for longer than age-matched monolingual peers. Future work is needed to explore the conditions that give rise to such bilingual advantages in perceptual flexibility including language pairings (L1 tone–L2 non-tone language vs. L1 tone–L2 tone language vs. L1 non-tone–L2 non-tone language), intonational patterns present in the child's known languages, and mapping between known languages and the target language. The paucity of neurophysiological studies on infants' processing of lexical tone also provides fertile ground for significant contributions in knowledge to be made.

Aside from native language experience, musicianship is another long-term experience that profoundly alters processing of pitch both in speech and non-speech stimuli. While research on the experiential effects of music on adults is growing rapidly, research into the effects of musical training in childhood on lexical tone processing is still in its infancy; however, the few studies that have been conducted have reported that musical training is beneficial for pitch processing (Magne et al., 2006; Moreno et al., 2009).

Several studies on adults present converging evidence that pretraining perceptual abilities account for word learning performance (Perrachione et al., 2011; Sadakata and McQueen, 2014). Further, pretraining perceptual abilities can be used to tailor training to maximize learning outcomes (Ingvalson et al., 2013). Research on infants and children has neglected such pretraining individual differences. Several studies have reported large degrees of variation in young

### REFERENCES


children, but the underlying factors that account for this variation are not yet understood. Thus, there is still much to learn about individual differences in lexical tone processing, especially in individuals experiencing communicative difficulties.

Studies on speech and hearing disorders in adults have revealed that several conditions may lead to difficulties in lexical tone processing. Adults with congenital amusia are impaired in their ability to identify and discriminate lexical tones (Peretz, 2001; Ayotte et al., 2002). Similarly, adults with cochlear implants have difficulties correctly identifying lexical tones (Wei et al., 2004). Encouragingly, both conditions have been shown to respond to training interventions designed to improve pitch processing. There have not been many studies examining effects of speech and hearing disorders on lexical tone processing in children. Nevertheless, there are some positive indications that children with cochlear implants benefit from training to improve the accuracy with which they perceive pitch (Fu et al., 2015) and appreciate music (Yucel et al., 2009) possibly by teaching children with cochlear implants to attend to other available acoustic cues.

In sum, research into the development of sensitivity to lexical tone has revealed that perception, word learning, and subsequent language development follow a complex schedule that is influenced by long-term experience such as active language exposure and bilingualism. In comparison, work on adults has revealed long-term experiences such as language background and musicianship greatly affect processing of non-native lexical tones, that short-term training can improve non-native tone processing, and that individual differences among learners may predict tone learning outcomes. Exploring the interaction of these factors in childhood tone acquisition and tone learning will advance the field of infant and child lexical tone processing and lead to improved learning and interventions for those encountering difficulties in processing lexical tones.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

This work was supported by Australian Research Council Discovery Early Career. Research Award DE150101053 to MA.





**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Antoniou and Chin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Monolingual and Bilingual Infants' Ability to Use Non-native Tone for Word Learning Deteriorates by the Second Year After Birth

Liquan Liu1,2,3,4 \* and René Kager<sup>2</sup>

<sup>1</sup> School of Social Sciences and Psychology, Western Sydney University, Sydney, NSW, Australia, <sup>2</sup> Utrecht Institute of Linguistics-OTS, Utrecht University, Utrecht, Netherlands, <sup>3</sup> MARCS Institute for Brain, Behaviour & Development, Western Sydney University, Sydney, NSW, Australia, <sup>4</sup> Centre of Excellence for the Dynamics of Language, Australian Research Council, Canberra, ACT, Australia

#### Edited by:

Judit Gervain, Centre National de la Recherche Scientifique (CNRS), France

#### Reviewed by:

Nayeli Gonzalez-Gomez, Oxford Brookes University, United Kingdom Laurianne Cabrera, UMR 8242 Laboratoire Psychologie de la Perception (LPP), France

#### \*Correspondence:

Liquan Liu l.liu@westernsydney.edu.au; liquan82@gmail.com

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 25 June 2017 Accepted: 24 January 2018 Published: 15 March 2018

#### Citation:

Liu L and Kager R (2018) Monolingual and Bilingual Infants' Ability to Use Non-native Tone for Word Learning Deteriorates by the Second Year After Birth. Front. Psychol. 9:117. doi: 10.3389/fpsyg.2018.00117 Previous studies reported a non-native word learning advantage for bilingual infants at around 18 months. We investigated developmental changes in infant interpretation of sounds that aid in object mapping. Dutch monolingual and bilingual (exposed to Dutch and a second non-tone-language) infants' word learning ability was examined on two novel label–object pairings using syllables differing in Mandarin tones as labels (flat vs. falling). Infants aged 14–15 months, regardless of language backgrounds, were sensitive to violations in the label–objects pairings when lexical tones were switched compared to when they were the same as habituated. Conversely at 17–18 months, neither monolingual nor bilingual infants demonstrated learning. Linking with existing literature, infants' ability to associate non-native tones with meanings may be related to tonal acoustic properties and/or perceptual assimilation to native prosodic categories. These findings provide new insights into the relation between infant tone perception, learning, and interpretative narrowing from a developmental perspective.

Keywords: label–object mapping, lexical tone, bilingualism, interpretive narrowing, perceptual assimilation

### INTRODUCTION

As new language learners, young infants need to determine the possible sound forms in the ambient environment that entail lexical relevance. They must learn to ignore acoustic sound contrasts that do not carry meanings. This task may be more challenging for infants exposed to more than one language, which accounts for more than 50% of the world population (Grosjean, 2010), although how bilingual infants acquire language is largely derived from research studying monolingual infants. Similarities between monolingual and bilingual developmental trajectories can reveal fundamental learning mechanisms and highlight the nature of bilingual learning, whereas differences may reflect specific learning strategies and outcomes stemming from different learning environments. Tone languages consist of more than 60% of the world's languages (Yip, 2002). The current study adds to our understanding of non-native tone-language learning and investigates the intersection of linguistic and lexical development by examining the learning of minimal pairs involving a tonal contrast across non-tone-language learning monolingual and bilingual infants in the second year after birth.

Infants have an astounding sensitivity to speech sounds in the ambient environment. As such, newborns discriminate between non-native languages through different rhythmic classes

(Mehler et al., 1988; Nazzi et al., 1998a), pitch contours (Nazzi et al., 1998b), and lexical stress patterns (Sansavini et al., 1997). In the first year of life, infants tune in to their native sound inventories and tune out of non-native contrasts, a process known as perceptual attunement (Werker and Tees, 1984; Anderson et al., 2003; Kuhl et al., 2006; Watson et al., 2014). The language-specific attunement occurs around 8–12 months for consonants (Werker et al., 1981; Best et al., 1995) and around 6–8 months for vowels (Kuhl et al., 1992). By the end of the first year, infants' sensitivity to non-native consonant and vowel contrasts greatly decreases. These perceptual patterns extend to adulthood (Tsushima et al., 1994; Tsao et al., 2000). As for the attunement of lexical tones, tone-language learning infants maintain and improve their tonal sensitivity (Harrison, 2000; Mattock and Burnham, 2006; Yeung et al., 2013; Shi et al., 2017a; Tsao, 2017). Meanwhile, non-tone-language learning infants' sensitivity to tones greatly decreases at 9 months (Mattock and Burnham, 2006; Mattock et al., 2008; Liu and Kager, 2014; Cabrera et al., 2015; Shi et al., 2017b).

After perceptual attunement, listeners do not appear to totally lose sensitivity to tones. Instead, categorical perception (Hallé et al., 2004; Xu et al., 2006; Chen et al., 2015) and neuroimaging studies (Gandour et al., 2000; Kaan et al., 2008) suggest non-tone-language listeners are sensitive to tones, though perceiving them in an acoustic manner. In other words, non-tone-language listeners appear to demonstrate acoustic instead of linguistic processing of tones, determined by a number of factors across contexts, experiences, and modalities (Burnham et al., 2015a,b). A tonal sensitivity rebound has been found in non-tone-language learning infants in the second year after birth, resulting in a U-shaped tonal perceptual trajectory (Liu and Kager, 2014). Tested in a visual habituation paradigm, Dutch monolingual infants show a rebound in sensitivity to a tonal contrast at 17–18 months. No similar U-shaped pattern across the first 2 years after birth has been reported for the perception of non-native consonant and vowel contrasts (Liu and Kager, 2015a,b). However, since non-tone-language adult listeners are sensitive to tones (Xu et al., 2006), such rebound is not entirely unexpected. The rebounded sensitivity is attributed to infants' acoustic (or phonetic, hereinafter) rather than linguistic (or phonological, hereinafter) sensitivity in light of non-tonelanguage adult listeners' acoustic perception of tones (Hallé et al., 2004; Jongman et al., 2017). This is similar to English infants' discrimination of non-native Zulu click sounds given the acoustic dissimilarity of these sounds to native inventory (Best et al., 1988, 1995).

Question arises whether infants growing up learning two languages follow the same trajectory as monolinguals. Following the previous study reporting U-shaped tonal perceptual trajectory (Liu and Kager, 2014), a follow-up study reports that non-tone-language learning infants from bilingual backgrounds show a rebound to the same tonal contrast at approximately 11–12 months after birth, 6 months earlier than their monolingual peers. Similar findings have been reported for the perception of consonant and vowel contrasts (Bosch and Sebastián-Gallés, 2003; Burns et al., 2007; Albareda-Castellot et al., 2011; Liu and Kager, 2016b). Similar to that of monolinguals, the rebounded sensitivity in non-tone-language learning bilingual infants matches adult data in suggesting that non-tone-language listeners perceive lexical tones acoustically (Hallé et al., 2004; Jongman et al., 2017). More generally, it matches previous literature showing that adult tone-language and non-tone-language listeners use different acoustic cues when perceiving lexical tones (Lee et al., 2008, 2010; Huang and Johnson, 2010; Cabrera et al., 2014).

Additionally, several factors such as enhanced auditory sensitivity and cognitive advantages have been proposed to account for the bilingual perceptual difference (Liu and Weidemann, 2017). The rebounded sensitivity in non-tone-language learning infants and the bilingual difference are crucial for the current study as they lead to further questions: What is the nature of non-tone-language learning infants' tonal perception at the rebound stage: acoustic or linguistic? Does rebounded sensitivity influence infants' learning ability of non-native words? Furthermore, does a bilingual difference in perception lead to a better outcome in learning?

These questions can be answered through label–object mapping involving non-native tonal minimal pairs. Specifically, if infants perceive tones linguistically at the rebounded stage, they are expected to be able to learn words contrasting in tone. Alternatively, if their perception is acoustically driven, non-tone-language learning infants' tonal word learning ability should deteriorate with age. Infants initially accept a wide range of word forms (e.g., non-speech sounds; Woodward and Hoyne, 1999; Hirsh-Pasek et al., 2000; Namy, 2001), and recognize early word forms with increasing linguistic experience (Jusczyk and Hohne, 1997; Tincoff and Jusczyk, 1999; Swingley and Aslin, 2002; Fennell and Werker, 2003; Bergelson and Swingley, 2012, 2013). They are able to recognize frequently used words and map novel labels to novel objects as early as 6 months (Shukla et al., 2011; Bergelson and Swingley, 2012). At 12 months, infants have developed native phonotactics, and continue to show label–object mappings for non-native sound contrasts (Jusczyk and Luce, 1994; MacKenzie et al., 2011, 2012). Dutch infants of 18 months interpret vowel duration as lexically contrasting, but English learners of the same age do not (Dietrich et al., 2007). This is in keeping with the different vowel properties of Dutch and English. By 20 months, their ability to associate non-speech symbols or non-native sounds with objects deteriorates (Namy and Waxman, 1998; Woodward and Hoyne, 1999; May and Werker, 2014; Saffran, 2014; Hay et al., 2015). In the second year after birth, infants appear to have formed sound categories from native language which they adopt to guide the acquisition of words, suggesting the experience of a second attunement (Werker and Tees, 2005). That is, infants refine what they consider to be possible word forms, and their early linguistic learning entails not only language-relevant acoustic cues but also linguistic interpretation at appropriate levels of linguistic analysis.

Tested by a label–object mapping paradigm in which infants are required to map two novel sounds with novel objects, 14- and 20-month-olds successfully associate dissimilar-sounding words with novel objects (lif-neem; Stager and Werker, 1997; chook-dal, Bijeljac-Babic et al., 2009). They do not typically succeed in associating minimal pair acoustic features with novel objects

(bih-dih) possibly limited by their low vocabulary size (Werker et al., 2002) or task difficulty (Yoshida et al., 2009). Nevertheless, they are able to do so when additional information is provided, such as (1) increased referential cues (Fennell and Waxman, 2010), (2) additional object familiarization (Fennell, 2012), (3) enhanced speaker variability (Rost and McMurray, 2009), (4) added social interaction (Mani and Plunkett, 2008), (5) reduced task difficulty (visual choice procedure, Yoshida et al., 2009), and (6) similar lexical contexts (frequently occurring phonemes, Thiessen, 2007). At 17–20 months, infants are able to associate novel objects with minimal pair words (Werker et al., 2002). Their performance is tightly correlated with current and future language comprehension and production skills (Bernhardt et al., 2007). Few studies have directly examined the tonal word learning ability among non-tone-language learning monolingual and bilingual infants. English-learning infants succeed in label–object mapping of monosyllabic words that differ in a tonal contrast in Mandarin Chinese (T2 rising vs. T4 falling) at 14 months but fail at 17 (Burnham et al., 2017) and 19 months (Hay et al., 2015). They also fail to associate objects with Mandarin T1 flat vs. T2 rising contrast at 17 months (Burnham et al., 2017). Taken together, infants' native word learning ability increases between 14 and 20 months, while their non-native word learning ability decreases (Hay et al., 2012, 2015).

It remains unclear whether infants' diverse linguistic experience may prolong the developmental trajectory in word development. Some studies show non-prolongation of the developmental trajectory. Monolingual and bilingual infants appear to experience linguistic milestones and developmental trajectories at similar time windows (Swain, 1972; Vihman, 1985; Pearson et al., 1993, 1995; Petitto and Kovelman, 2003; Vihman et al., 2007; Werker and Byers-Heinlein, 2008; Werker et al., 2009; Hoff et al., 2012; Werker, 2013; De Houwer et al., 2014; Singh et al., 2014; Liu et al., 2017), with matched number of lexical concepts (Pearson et al., 1993; Pearson and Fernandez, 1994; Junker and Stockman, 2002; Thordardottir et al., 2006; Hoff et al., 2012; De Houwer et al., 2014; Liu et al., 2017). When appropriate contextual carriers are given (e.g., speaker matching their language environment), monolingual and bilingual infants both learn certain minimal pairs (/bos/-/gos/, Mattock et al., 2010; /kεm/-/gεm/, Fennell and Byers-Heinlein, 2014) at 17 months. On the other hand, prolonged word learning trajectory in bilingual infants have also been reported (Friedrich and Friederici, 2004; Conboy and Mills, 2006; Kaushanskaya and Marian, 2009; Marchman et al., 2010; Byers-Heinlein, 2013; Singh, 2017; Singh et al., 2017). Nine-month-old Chinese–English bilingual infants recognize both Chinese and English words contrasted in tone (Singh and Foong, 2012), even though English does not use tone to differentiate meaning. While monolingual infants begin to succeed at learning two minimal paired non-words (/bI/-/dI/) around 17 months, bilinguals do not succeed until 20 months (Fennell et al., 2007; Werker, 2013). Bilingual children are behind their monolingual peers in vocabulary size when single language, especially the non-dominant language, is compared from 8 months up to 10 years (Bialystok et al., 2010; Gauthier and Genesee, 2011). Eighteen-month-old bilingual but not monolingual infants keep

their flexibility for a prolonged period and continue to show the mapping of labels that minimally contrast in pitch contour to novel objects, and the ability deteriorates at 22 months (Graf Estes and Hay, 2015). Both monolingual English-learning and bilingual English–Chinese-learning infants are able to detect tonal substitutions as mispronunciations at 18 months, but monolingual infants are no longer able to do so at 24 months while the bilinguals were still able to do so. Bilingual infants' performance appears to be language-specific: they can detect tone mispronunciations in Chinese but not English contexts (Singh et al., 2014). In sum, the results of word learning studies with monolinguals and bilinguals suggest that infants attribute linguistic relevance to tones in a language-specific fashion between 18 and 24 months. By the end of the second year, infants' ability to use lexical tone for word learning is in accordance with their native language and exposure.

Discrepancies between monolingual and bilingual infants' novel word learning outcomes may be attributed to a number of factors such as different use of learning strategies (Sebastián-Gallés et al., 2012) and task design (Singh et al., 2012). Infants from a multilingual environment do not use mutual exclusivity to the same degree as their monolingual peers when learning words (Byers-Heinlein and Werker, 2009). They are sensitive to environmental differences and contextual cues (Mattock et al., 2010; Fennell and Byers-Heinlein, 2014). In addition, differences in time windows at which non-tone-language learning infants' tonal label–object mapping ability decreases may be due to different testing paradigms, which may elicit different levels of sensitivity.

Apart from these factors, acoustic properties of tonal contrast may also play a role. A number of studies use T2–T4 (rising– falling) in Mandarin Chinese as the target contrast (Singh and Foong, 2012; Hay et al., 2015), with the pitch directions close to those in the interrogative-narrative intonation patterns in many languages, such as English and Spanish. This potentially introduces an effect of perceptual assimilation, assimilating a non-native phoneme into one's native phonemic category (Best, 1994; Soderstrom et al., 2011; Tyler et al., 2014). If perceptual assimilation plays a role in tonal processing and learning, we would expect non-native listeners' better performance in T2–T4 (rising vs. falling) compared to the T1–T4 contrast, in which the assimilation of T1 remains unclear. In addition, an expansion of investigation to other non-native tonal contrasts and reexamination of the word learning of 18-month-old non-tonelanguage learning infants is necessary to further understand the impact of age, stimuli, and language background on interpretive narrowing in word learning (Stager and Werker, 1997; Hay et al., 2015), the process by which infants restrict the types of sounds that can be mapped to word meanings.

To understand the impact of linguistic diversity on novel word learning ability, we tested monolingual and bilingual infants, both of whom lacked prior experience to lexical tones. To reduce the effect of perceptual assimilation on tonal word learning, a new tonal contrast different from previous studies (T2 rising vs. T4 falling) was used. Linking the previous question concerning the nature of tone perception among infants learning non-tone-languages, the research questions are: How

does non-tone-language learning Dutch infants' label–object mapping ability for sound-to-meaning pairs involving lexical tone contrasts develop during the second year of life? Do learning patterns differ between non-tone-language learning monolingual and bilingual Dutch infants? We adopted a word learning paradigm using the same stimuli as in the previous visual habituation paradigm (Stager and Werker, 1997) and tested monolingual and bilingual infants at two age ranges (14–15 and 17–18 months). To understand the effect of contrast acoustic properties on learning and reduce the effect of potential perceptual assimilation, a contrast in Mandarin Chinese (T1 level vs. T4 falling) was used. We predicted successful learning at 14–15 months for both monolingual and bilingual Dutch infants and left the prediction open for 17- to 18-month-olds. We hypothesize that the tonal rebound is acoustic/phonetic in nature and hence would not positively affect word learning. Bilingual infants may show enhanced performance for word learning due to their flexibility in learning non-native tone-to-word pairings (Graf Estes and Hay, 2015). Alternatively, bilingual infants undergo the perceptual rebound earlier than monolinguals. These two factors may affect bilinguals' ability to learn new words contrasting on tones at 17–18 months.

### EXPERIMENT 1

### Participants

Sixty-four (30 male) typically developing Dutch infants aged 14–15 months participated in the experiment. All bilingual infants had a non-tone or pitch-accented language as the other native language since birth apart from Dutch (mean Dutch exposure 55 ± 15%). Evaluated by a multilingual infant questionnaire (Liu and Kager, 2016a), infants' degree of exposure to the non-dominant language was no less than 20%. Participating families come from similar social economic backgrounds with the same level of parental education. Data from 14 infants were excluded for: fussiness (4), crying (3), and inattentiveness [looking time (LT) less than 1 s in a consecutive of five trials] during the experiment (2), as well as failure in reaching the habituation criterion (5, defined in Procedure). The detailed attrition rate for the individual group is listed in **Appendix 1**. In the final sample, data of 20 monolingual and 20 bilingual infants (bilingual language backgrounds listed in **Appendix 2**) were incorporated into the analysis (mean age: 447 ± 13.7 days). Parents of the participants confirmed no language impairments as well as normal hearing for their children, and provided written informed consent for the study. At present and at the time of the study, the experiment endorses the WMA Declaration of Helsinki - Ethical Principles for Medical Research Involving Human Subjects, as well as The Netherlands Code of Conduct for Scientific Practice issued in 2004 (revised in 2012) by the Association of Universities in the Netherlands (VSNU). The Ethical Assessment Committee of Utrecht Institute of Linguistics, Utrecht University offered a positive advice on the current study.

### Stimuli

A Mandarin tonal contrast, not tested in previous word learning study (T1 level vs. T4 falling), was selected to create the stimuli for the label–object association in the current study. The syllable /ta/ was selected as the tone-bearing syllable. /ta1/ "build" and /ta4/ "big" are both legal words in Mandarin Chinese. A Mandarin female speaker's speech production was recorded by Audacity (open source computer program) via Genelec 1029A active speaker recording system in a sound-attenuated room of Utrecht Institute of Linguistics, Utrecht University Phonetics Lab. Four natural pairs of T1–T4 were recorded for each sound to increase within-speaker variation. Examples of pitch contour and spectrograms of a T1–T4 pair of stimuli was provided in **Figures 1A,B**. A ball is selected as the familiar stimulus, and the novel objects consisted of two distinct, multicolored images moving back and forth horizontally on the monitor (**Figure 2**).

### Procedure

A version of label–object mapping paradigm similar to previous studies (Graf Estes and Hay, 2015; Hay et al., 2015) was adopted. The paradigm included a pre-test, a habituation, a test and a post-test phases. In the pre-/post-test phases, infants saw a moving ball along with 10 tokens of the word "ball." The purpose was to test infants' initial and general attention, as well as familiarized them with the program. During habituation, infants were familiarized with the associations between two novel moving objects (**Figure 2**) and the corresponding sound labels (**Figure 1**). The novel label–object pairings were counter-balanced across infants, such that some infants were familiarized with Object1–T1 and Object2–T4 pairs, and the others on Object1–T4 and Object2–T1 pairs. Infants went through two to six blocks depending on their speed of habituation. Each block has four trials, two for each label–object mapping. Within each block, the trial orders were quasi-randomized among six non-repeated options: AABB, ABBA, ABAB, BAAB, BABA, and BBAA. The trials were infant-gaze controlled with maximally 20 s per trial. Each trial ended after infants' looked away for 2 s consecutively. The inter-stimulus interval was 1 s across phases. When participants' LTs to both label–object pairings dropped to 65% within a block compared to those in the first block, the habituation criterion was reached. Infants failing to reach this criterion within a maximum of six blocks were excluded from analysis. During the test phase, participants had four trials in either Switch–Same–Switch–Same or Same–Switch–Same–Switch orders. In the Same trials, participants heard the same label–object mappings as during habituation. In the Switch trials, labels were linked to the opposite objects shown in habituation, leading to discrepancies in the sound-object mapping, breaking the association. For instance, if an infant was familiarized with the Object1–T1 and Object2–T4 pairs during habituation, the Same trials in test would still be Object1–T1 and Object2–T4, and the Switch trials would be Object1–T4 and Object2–T1. A longer recovery of attention (in LT) during the broken association in comparison to the familiarized mapping would suggest that infants have successfully established the mapping in the habituation phase. Data of two instead of one trial per trial

type (Same vs. Switch) were collected to ensure that the results obtained in the test phase were not by random. The test ended with a happy Dutch song "Alle eendjes zwemmen in het water" ("All ducklings are swimming in the water") to enhance infants' joyful emotions when leaving the test booth.

In a sound-attenuated test booth of Utrecht Institute of Linguistics, Utrecht University, infants were seated on their caretaker's lap, facing a flat screen monitor, a hidden loudspeaker and a hidden camera approximately 1 m away. Infants' responses were observed through a closed circuit TV. An experimenter recorded infants' LTs using a button box. The test was presented using the Flexible Experimental Programme (Veenker, 2007) designed by university technician based on C. Caretakers and experimenters were blind to the audio stimuli by listening to masking music over headphones during the entire test.

### Results

A repeated measures analysis of variance (ANOVA) was conducted with the average LT during test as the dependent variable, Trial type (Same vs. Switch) as a within-subject factor, and Language (monolingual vs. bilingual) as a between-subject factor. The effect of Trial type was significant, F(1,38) = 8.467, p = 0.006, η <sup>2</sup> = 0.182. The interaction between Language and Trial type was not, F(1,38) = 0.161, p = 0.691, η <sup>2</sup> = 0.004. Data suggested that all infants succeeded in labeling a novel non-native tonal contrast with novel objects. Tests of between-subject effect showed that Language was not a significant factor, F(1,38) = 0.520, p = 0.475, η <sup>2</sup> = 0.014. Both monolingual and bilingual infants showed longer LT in Switch trials than in Same trials (**Figure 3**) in the test phase. In addition, infants' habituation time, habituation direction, or the number of blocks did not differ between monolingual and bilingual infants (ps > 0.361). Both monolingual and bilingual infants appeared to learn the minimal pairs contrasted in tones. To further investigate non-tone-language learning infants' word learning ability, we tested infants of an older age in the next experiment.

### EXPERIMENT 2

### Participants

Fifty-one (25 male) typically developing Dutch infants of 17–18 months participated in the study. The same language background criteria as in Experiment 1 were adopted. Data from the 11 infants were excluded for: fussiness (2), crying (3), and inattentiveness (1), failure in reaching the habituation criterion (4), and dyslexic background in the family (1). In the final sample, data of 20

monolingual and 20 bilingual infants (language background listed in **Appendix 2**) were incorporated into the analysis (mean age: 537 ± 12.3 days).

### Stimuli and Procedure

fpsyg-09-00117 March 13, 2018 Time: 15:55 # 6

The same stimuli and Procedure as in Experiment 1 were adopted.

### Results

A repeated measures ANOVA was conducted with the same factors as in Experiment 1. The main effect of LT between Same and Switch was not significant, F(1,38) = 1.642, p = 0.208, η <sup>2</sup> = 0.041, nor was the interaction between Language and Trial type, F(1,38) = 0.001, p = 0.976, η <sup>2</sup> < 0.001. Tests of between-subject effect shows that the effect of Language was not significant, F(1,38) = 0.009, p = 0.925, η <sup>2</sup> < 0.001. Neither monolingual nor bilingual infants showed longer LT in Switch (**Figure 4**). In addition, infants' habituation time, habituation direction, or number of blocks did not differ between monolingual and bilingual infants (ps > 0.400).

### DISCUSSION

This paper investigated the ability to learn label–object associations of a non-native tonal contrast in toddlers acquiring non-tonal languages, testing the generality of the interpretive narrowing. A tonal contrast different from previous studies (Singh et al., 2014; Graf Estes and Hay, 2015) was adopted for a better understanding of the effect of acoustic properties and perceptual assimilation on learning. Results shed light on Dutch infants' non-native interpretive narrowing process with two main findings. First, infants were able to establish associations between novel tones and objects at 14–15 months, whereas they failed to do so at 17–18 months. Second, the current results indicated similar developmental trajectories between monolingual and

bilingual infants in word learning involving novel and non-native sound contrasts.

### Infant's Fast Label–Object Mapping of Non-native Tones

Infants maintain detailed representations from the input, paying attention to acoustic, linguistic, and many other cues (Swingley and Aslin, 2002). Nevertheless, they need to ignore variabilities from the input in order to form abstract categories. Between learning stages of sounds and words, an interpretive narrowing in infants' usage of acoustic detail has been suggested (Stager and Werker, 1997). The finding that 14- to 15-month-old non-tone-language learning infants were able to establish associations between novel tones and novel objects is in line with previous studies, indicating that pitch contour may remain an important acoustic cue for word learning. Infants of 17–18 months, however, no longer exhibit a learning effect, showing incongruent results (Graf Estes and Hay, 2015; Hay et al., 2015). The overall trend suggests a reduction of linguistic function in non-native tones among non-tone-language learning infants, conforming with trends to the interpretive narrowing of consonant or vowel contrasts.

The observed decrease may be attributed to a natural decay of linguistic function with no relevant exposure from the environment under a second perceptual attunement (Werker and Tees, 2005) where infants concentrate on selecting the lexically contrastive properties from their native language. Contrasts that are not relevant to infants' native language may remain acoustically perceptible (Best et al., 1988). Nevertheless, they are not used for a linguistic function. Since no systematic functional use of lexical tones is present, non-tone-language learning infants never develop tonal categories to map the relevant input. Nor do they pay attention to the tonal variation on a lexical level, exemplifying a "use it or lose it" scenario. It is not a decreased tonal sensitivity that affects the ability to abstract and form categories because of the sensitivity rebound observed in previous studies (Liu and Kager, 2014, 2017c). The deterioration may also reflect the loss of a general ability to abstract as well as create a tonal proto-category. Establishing a lexical representation requires building a link between acoustic exemplars from the ambient environment and word meaning, and subsequently setting up an abstract, categorical representation. After (the first) perceptual attunement, infants have established category boundaries based on their native language inventories and set up categories that matter in meaning differences to guide word learning. It thus becomes increasingly difficult to create new representations for non-native input. Nontone-language learning infants' decreased tonal sensitivity may affect their ability to abstract and form categories for unattended acoustic dimensions. This is similar to studies discussing (late) learners' relative difficulties with specific non-native words (Best and Tyler, 2007; Best et al., 2009). However, this explanation does not conform to infants' rebounded tonal sensitivity at 17–18 months reported in

previous studies (Liu and Kager, 2014, 2017b), which should facilitate generalization of tonal categories.

Linking perception with label–object mapping, successful learning involving a non-native contrast may rely on a number of elements including the exposure to that contrast (Kaan et al., 2007; Liu and Kager, 2011, 2017c), the residual ability of creating categories from acoustic input, and the potential interference from native categories.

### The Effect of Bilingualism on Infant Language Development

The current experiment does not find any significant differences between monolingual and bilingual word learning abilities. Although native sound and word learning trajectories remain debatable between monolingual and bilingual infants, similar learning patterns were found in the current study. This pattern is similar to some word learning experiments showing that early bilingual exposure does not interfere with infants' fundamental word learning ability (Mattock et al., 2010; Byers-Heinlein et al., 2013), but different from some other experiments in which advantages (Singh et al., 2014; Graf Estes and Hay, 2015; Burnham et al., 2017) or delays (Fennell et al., 2007) are observed in bilingual population. Without relevant input to establish sound categories, neither monolingual nor bilingual non-tone-language learning infants appear to treat word-level pitch as linguistically relevant in the second half of the second year after birth.

### Toward an Integrative View of Non-native Tonal Word Learning

The same stimuli were used in our previous studies in which a visual habituation paradigm was adopted to track non-tone-language learning monolingual and bilingual infants' discrimination from 5 to 18 months (Liu and Kager, 2014, 2017b). The earlier results in relation to the current ones need to be discussed in order to compare the development of tonal discrimination and word learning ability. Non-tone-language learning infants discriminated the same tone contrast at 14–15 and 17–18 months. Lexical representations may be encoded in fine details, even though these details may not be necessary for linguistic functions such as native vocabulary acquisition (Swingley and Aslin, 2002). Although infants' auditory sensitivity to non-native tones is rebounded in later infancy and presumably extends to adulthood, 17- to 18-month-old infants do not show label–object mapping using non-native lexical tones in the current study. We hypothesize that non-tone-learning infants show an acoustic instead of linguistic perception of tones by the end of 2 years after birth, resembling non-tone-language adults (Hallé et al., 2004; Jongman et al., 2017).

Data from the current experiments are crucial to the understanding of the time-course of infant word learning ability under study. In line with previous studies (Graf Estes and Hay, 2015; Hay et al., 2015), infants map non-native tonal contrasts to novel objects at 14–15 months, suggesting flexibility in word learning ability for non-native contrasts even after tonal perceptual attunement (Werker and Tees, 2005). The lack of ability to establish label–object association at 17–18 months, for both monolinguals and bilinguals, is consistent with previous findings of monolingual infants (Hay et al., 2015) but contrasting those of bilinguals (observed at 22 months, Singh et al., 2014; Graf Estes and Hay, 2015). Such difference may be attributed to a number of factors such as stimuli or testing paradigms. The procedure used in Singh et al. (2014) introduces two phases of familiarization before training infants on novel label–object mappings: first, participants are familiarized with the task procedure using frequent word–object pairs, and secondly, novel objects are directly presented to the infants. This practice may largely reduce the task difficulty and lead to a better learning effect, resulting in successful mapping at a relatively later age. Moreover, the difference across studies may also be due to an effect of perceptual assimilation of the non-native contrasts (e.g., successful learning of the T2–T4 contrast in Graf Estes and Hay (2015) vs. unsuccessful learning of T1–T4 in the present study). The distinction between T2 and T4 (rising vs. falling) may be better assimilated and more easily perceived than that between T1 and T4 (flat vs. falling). Although non-tone-language learning infants' perception is arguably acoustically rather than linguistically based after perceptual attunement, their word learning ability appears to be contrast-dependent, influenced by listeners' linguistic experience and possibly native categories. Perceptual salience is another factor that may play a role in tone perception. It could be that the T2–T4 contrast tested in previous studies may be more salient than the current T1–T4 contrast. However, English infants of 18 months fail to learn a salient, non-native minimal pair contrasted in vowel duration (Dietrich et al., 2007), indicating that perceptual salience may contribute more to acoustic discrimination (e.g., Best et al., 1988; Liu and Kager, 2014; Ramachers et al., 2017) than to linguistic interpretation and as such its effect may be limited during interpretative narrowing.

By the end of the second year of life, infants may maintain detailed representations of acoustic details supported by their auditory sensitivity, but this sensitivity may not present itself in a label–object mapping task especially given isolated stimuli (Fennell and Waxman, 2010). Infants may retain detailed acoustic information provided their general auditory sensitivity. However, they may focus on establishing abstract categories during category learning. This hypothesis fits the developmental framework of the Processing Rich Information from Multidimensional Interactive Representations (PRIMIR) model (Werker and Curtin, 2005). PRIMIR assumes the availability of rich information in the speech input and proposes infants' information perception and acquisition along three interactive, multidimensional planes: a general perceptual plane, (meaningful) word form plane, a phonemic plane. In any situation, the processing of input information depends on the joint activity of three dynamic filters: initial perceptual biases, developmental stage, and environmental demands. In the current experiment, for instance, infants' lexical use of non-native tonal information decreases albeit their initial perceptual biases of the lexical pitch. Their performance in the word learning task is hypothesized to be influenced by the task design as well as the specific tonal contrast acoustics.

Regarding bilingual infants, the PRIMIR model has been further extended to bilingual infants (Curtin et al., 2011). Bilingual infants are required to determine which language is relevant in the context of the specific task at hand (Mattock et al., 2010). As lexical tonal information is absent in the linguistic context of Dutch monolingual and bilingual infants, no difference is observed in the current word learning task. Future models of speech processing may extend their predictions on contrast learning and learnability.

### CONCLUSION

This paper addresses how early language learners determine which acoustic dimensions in their environment differentiate word meanings. Non-tone-language learning monolingual and bilingual infants are able to construct linguistic representations of Mandarin T1–T4 tones at 14–15 but not at 17–18 months. Linking the current findings with previous literature, we hypothesize that infants' perception of non-native tones is more acoustic than linguistic in the later phase of language development, that is, mainly based on the acoustic properties of tones. In addition, provided the different outcomes across contrasts (T1–T4 vs. T2–T4) between the current and previous studies, we are inclined to suggest a role for perceptual assimilation in non-native word learning (Best, 1994). That is, non-tone-language learning infants' tonal label–object mapping ability is affected by intonation contours from their native language, and facilitation may occur when acoustic similarities/overlaps occur between nonnative tones and native intonation (e.g., T2–T4). Given that differences may also lie in the paradigms used across

### REFERENCES


associative word learning studies, the suggestion should be considered with caution. Last but not least, bilingual infants appear to at least keep the same pace as their monolingual peers along the word learning of non-native tonal contrasts.

### AUTHOR CONTRIBUTIONS

LL contributed to the experiments. LL and RK contributed to the manuscript.

### FUNDING

LL received an international Ph.D. grant from Utrecht Institute of Linguistics, Utrecht University to carry out this research, and a Start-up grant from School of Social Sciences and Psychology, Western Sydney University for open access publication.

### ACKNOWLEDGMENTS

We dearly thank Cedric Cheng, Mengru Han, Mieke du Toit, Shan Fan, Stephen Politzer-Ahles, Tianlin Wang, Tomas Lentz, and Vincent van Buul for their feedback on the paper. We sincerely thank the Babylab group and experimental phonology group members of the Utrecht Institute of Linguistics OTS, Utrecht University, for their valuable feedback. Special thanks go to Frontiers in Psychology journal reviewers. As always, we are in debt to all families that participated in our research in Utrecht, the Netherlands.

English-speaking adults and infants. J. Exp. Psychol. Hum. Percept. Perform. 14, 345–360. doi: 10.1037/0096-1523.14.3.345




the brain's mechanisms underlying all language acquisition. Learn. Lang. 8, 5–19.


abilities," in Proceedings of the 3rd International Conference on Spoken Language Processing, ICSLP 1994, Yokohama.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu and Kager. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX

fpsyg-09-00117 March 13, 2018 Time: 15:55 # 12

#### Appendix 2 | Bilingual language background.

Appendix 1 | Attritions across ages and language backgrounds.



# One Way or Another: Evidence for Perceptual Asymmetry in Pre-attentive Learning of Non-native Contrasts

#### Liquan Liu1,2,3 \*, Jia Hoong Ong3,4, Alba Tuninetti2,3 and Paola Escudero2,3

<sup>1</sup> School of Social Sciences and Psychology, Western Sydney University, Penrith, NSW, Australia, <sup>2</sup> The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, NSW, Australia, <sup>3</sup> Centre of Excellence for the Dynamics of Language, Australian Research Council, Canberra, ACT, Australia, <sup>4</sup> Division of Linguistics and Multilingual Studies, School of Humanities, Nanyang Technological University, Singapore, Singapore

Research investigating listeners' neural sensitivity to speech sounds has largely focused on segmental features. We examined Australian English listeners' perception and learning of a supra-segmental feature, pitch direction in a non-native tonal contrast, using a passive oddball paradigm and electroencephalography. The stimuli were two contours generated from naturally produced high-level and high-falling tones in Mandarin Chinese, differing only in pitch direction (Liu and Kager, 2014). While both contours had similar pitch onsets, the pitch offset of the falling contour was lower than that of the level one. The contrast was presented in two orientations (standard and deviant reversed) and tested in two blocks with the order of block presentation counterbalanced. Mismatch negativity (MMN) responses showed that listeners discriminated the non-native tonal contrast only in the second block, reflecting indications of learning through exposure during the first block. In addition, listeners showed a later MMN peak for their second block of test relative to listeners who did the same block first, suggesting linguistic (as opposed to acoustic) processing or a misapplication of perceptual strategies from the first to the second block. The results also showed a perceptual asymmetry for change in pitch direction: listeners who encountered a falling tone deviant in the first block had larger frontal MMN amplitudes than listeners who encountered a level tone deviant in the first block. The implications of our findings for second language speech and the developmental trajectory for tone perception are discussed.

Keywords: electroencephalography, mismatch negativity, speech processing, tone, pitch direction, learning, perceptual asymmetry

### INTRODUCTION

More than 60% of the world languages are tonal languages in which word-level pitch variations are used to distinguish meanings by signaling prosodic contrasts at syllable and/or word levels of linguistic representation (Yip, 2002; Maddieson, 2005). Speech perception has largely focused on consonants and vowels and less is known regarding the processing of lexical tones. The investigation of tones, a suprasegmental feature, provides an opportunity to examine the relationship between listeners' experience with cross-domain, time-varying pitch patterns and the (neural) processing of prosody on a lexical level. To provide a more comprehensive understanding

Edited by:

Guillaume Thierry, Bangor University, United Kingdom

#### Reviewed by:

Mariapaola D'Imperio, Aix-Marseille Université, France Chiara Gambi, Cardiff University, United Kingdom

\*Correspondence:

Liquan Liu l.liu@westernsydney.edu.au; liquan82@gmail.com

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 06 September 2017 Accepted: 31 January 2018 Published: 20 March 2018

#### Citation:

Liu L, Ong JH, Tuninetti A and Escudero P (2018) One Way or Another: Evidence for Perceptual Asymmetry in Pre-attentive Learning of Non-native Contrasts. Front. Psychol. 9:162. doi: 10.3389/fpsyg.2018.00162

**150**

of speech perception, this study is among the first to examine how adult listeners process non-native tonal distinctions at the neural level and specifically how changes in pitch direction are reflected in brain waves that can be measured using electroencephalography (EEG).

Although tone perception is determined by a number of factors such as context, experience, and modality (Burnham et al., 2015a,b), it is well known that native speakers of a tone language treat tonal variation as linguistically meaningful from infancy through adulthood. Despite the fact that neonates universally distinguish pitch contour differences at the word level (Nazzi et al., 1998), young infants and children growing up learning tone languages retain and improve their tonal sensitivity (Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Tsao, 2017; but see Shi et al., 2017a). Importantly, native speakers of a tonal language perceive lexical tones in a categorical manner (Gandour, 1978; Hallé et al., 2004; Content and Perwez, 2011), similarly to other speech segments, and their tone perception is subject to abstract rules (e.g., tone sandhi) in their native phonological system (Hume and Johnson, 2001; Politzer-Ahles et al., 2016). Categorical perception of pitch is not confined to lexical tone perception, but extends also to pitch accent alignment perception in intonational languages (D'Imperio and House, 1997). Recent neuro-imaging studies confirm that native listeners process tones similarly to other speech segments in the left hemisphere and with the activation of the left frontal operculum, which demonstrates that the phonological processing of suprasegmental units also occurs near Broca's area (Gandour et al., 2000; Brown-Schmidt and Canseco-Gonzalez, 2004; Xi et al., 2010).

In contrast, non-tone language speakers appear to process tones in a non-linguistic manner, with predominant neural activation in the right hemisphere (Gandour et al., 1998, 2000, 2004). Indeed, tone and non-tone language listeners have differential perceptual trajectories for tones shortly after birth. Non-tone learning infants, though showing initial sensitivity to tones just as their tone language peers, attune to their native language at around 9 months and treat tonal changes as linguistically irrelevant (Mattock and Burnham, 2006; Mattock et al., 2008). In other words, while "tone babies" tune in to lexical tones, "non-tone babies" tune out (Werker and Tees, 2002; Kuhl et al., 2006) and their tonal sensitivity deteriorates. In the 2nd year, a tonal perceptual rebound occurs for nontone learning infants, who start to be more sensitive to tonal differences (Liu, 2014; Liu and Kager, 2014, 2017a). However, a number of word learning experiments illustrate that this rebound in sensitivity is unlikely to be linguistic and instead may be acoustic, as non-tone language-learning infants ignore lexical pitch variations which do not yield meaningful changes and they do not associate different lexical tones to different objects by the end of their 2nd year (Singh et al., 2014; Hay et al., 2015; Liu and Kager, 2018). Non-tone language adult listeners appear to follow the same pattern and perceive tones in a psycho-acoustic fashion (Gandour et al., 2000; Hallé et al., 2004; Xu et al., 2006a; Kaan et al., 2008; Chen et al., 2015). Importantly, Chen et al. (2016) have shown that due to the absence of relevant exposure to encourage abstraction of tonal categories, identification and learning of tones become increasingly difficult for non-tone language adult listeners, just like non-tone learning infants.

Previous research has shown that perceiving tone contrasts is not always difficult, as listeners are able to use speech modulation cues (e.g., frequency modulation, Cabrera et al., 2015) and some contrasts are easier to discriminate than others (Whalen and Xu, 1992; Huang and Johnson, 2010). As their perception is likely to be acoustic, the observed variability may derive from the intrinsic acoustic properties of tones. Tone, or linguistic pitch, is an attribute of multiple dimensions, with pitch height, contour and direction serving as primary perceptual cues (Gandour, 1983; Chandrasekaran et al., 2009; Yeung et al., 2013). Listeners' discrimination ability may largely depend on their previous experience of these tonal properties, such that tone language experience or music training may sharpen listeners' overall pitch sensitivity (Wang et al., 2003; Wong et al., 2007; Kaan et al., 2008; Dittinger et al., 2016; Ong et al., 2016). Indeed, comparing tone language listeners and non-tone language listeners, it appears that having an extensive tone language experience allows listeners to pay more attention to certain pitch cues such as pitch slope and direction, relative to listeners without tone language experience (Gandour and Harshman, 1978). Alternatively, nontone language listeners' perception of lexical tones may be dependent on how such tones are categorized in terms of the listeners' native phonology (Singh and Chee, 2016). Specifically, it may be the case that although non-tone language listeners have no experience on tones or tonal categories, their knowledge of native intonation may affect non-native tone perception. Pitch contours of middle-rising [T2] vs. high-falling [T4] tones in Mandarin Chinese, for instance, are close to the interrogation vs. narration intonation contours in many non-tone languages such as English (Hay et al., 2015). Similarities such as those may increase the perceptual salience of certain non-native tonal contrasts for listeners who perceptually assimilate them to a native intonation contrast (So and Best, 2014). The question as to how non-tone language listeners perceive (the majority of other) tones that have no counterpart in intonation is still unanswered.

Without the influence of native categories, listeners' perception of tones may depend on the acoustic salience of the contrast, which varies as a function of the distance in perceptual space and cue weightings between the two members of the contrast (Escudero and Boersma, 2004; Escudero, 2005). Acoustic salience modulates listeners' ability for contrast discrimination under the pressure of language-specific perceptual attunement. Some acoustically salient contrasts, such as Zulu clicks (Best et al., 1988, 1995), voiceless fricative place contrasts from Nuu-Chah-Nulth /x/-/χ/ (Tyler et al., 2014), English /ε/-/æ/, German /u/-/y/ (Polka and Bohn, 1996), and Limburgian pitch accents (Ramachers et al., 2017) remain discriminable across ages, despite them being non-native. Conversely, some less salient native contrasts, such as the Dutch /i/-/I/ vowel contrast (Liu and Kager, 2016), are not well discriminated until a relatively later age.

Tonal acoustic salience is predominantly determined by three major cues: pitch height, pitch contour, and pitch direction (Gandour, 1983). However, very few studies have directly compared tonal acoustic salience by examining these properties.

Relating specifically to tonal contrasts, behavioral evidence suggests that both tone language and non-tone language listeners exhibit ceiling performance when discriminating a salient highlevel [T1] vs. high-falling [T4] tonal contrast in Mandarin Chinese (Liu and Kager, 2014; Shi et al., 2017b). However, tone language listeners outperformed non-tone language listeners when perceiving a similar contrast that was made less salient by shrinking the pitch distance between the two tones (Liu et al., 2017). Although the results of behavioral studies demonstrate that native speakers outperform non-native speakers in contrasts with less acoustic salience, an investigation of neural responses to three pitch contour contrasts using a passive oddball paradigm (Chandrasekaran et al., 2007) suggests this may be dependent on the tonal contrast itself. The authors found that native Chinese listeners had a larger mismatch negativity (MMN) response than English listeners when discriminating salient tonal contrasts such as high-level [T1] vs. middle-rising [T2], and high-level [T1] vs. dipping [T3] tones in Mandarin Chinese. In contrast, no clear MMN difference between language groups was shown for a nonsalient tonal contrast such as middle-rising [T2] vs. dipping [T3] tones, which is notoriously difficult to discriminate in isolation due to its similarities in acoustic as well as phonological (sandhi effect) properties.

The discrepancies between the behavioral and neural evidence call for further studies in tonal processing given that behavioral responses may reflect a late attention-modulated auditory processing stage, while neurophysiological responses can represent an earlier, pre-attentive stage of brainstem (Xu et al., 2006b) and cerebral cortical processing of pitch (Chandrasekaran et al., 2007). Importantly, non-native listeners may show an MMN for contrasts they cannot discriminate in behavioral tasks (Kraus et al., 1995b; Näätänen et al., 2007; Lipski et al., 2012), which may also apply to non-native tone contrasts. Some recent neurophysiological studies suggest that listeners' developmental trajectory for pitch processing depends on neural maturation and the discriminability of tonal changes (Lee et al., 2012; Cheng et al., 2013; Peter et al., 2016). No neurophysiological study thus far has investigated the specific perceptual cue of pitch direction. The current study examines non-tone language listeners' tonal perception of pitch direction using EEG to investigate factors affecting non-native tone perception at an early perceptual level. The MMN has been used extensively to examine the perception of non-native speech contrasts, either for the purposes of second language learning or to examine the neural bases of acoustic-phonetic processing (for a review, see Näätänen et al., 2007), making it an excellent tool to examine early perceptual processing of non-native tonal contrasts. Furthermore, the MMN provides a more sensitive measure than behavioral data because it allows us to examine pre-attentive sensitivity (that is, not requiring overt attention or response) to contrasts that may not be perceived behaviorally (e.g., Kraus et al., 1995b). The MMN is a negative-going response seen particularly in the frontal electrodes and it indexes when a change occurs in a stream of auditory stimuli. For non-native speech perception, the MMN captures pre-attentional perception of infrequent stimuli and is used to test whether participants can perceive the difference between two stimuli that differ either acoustically or phonetically. It is obtained by subtracting the ERP response to a frequent, or standard, stimulus from the ERP response that occurs when there is a switch to an infrequent, or deviant, stimulus and occurs between 150–250 ms after the onset of the switch. The change from the standard to the deviant stimulus is responsible for the MMN response and the MMN is elicited independent of attentional processes, so behavioral tasks are not needed to detect this waveform (Sams et al., 1984; Näätänen and Winkler, 1999; Näätänen, 2001).

In order to directly compare behavioral and pre-attentive results, we used a non-salient tonal contrast from previous behavioral experiments (**Figure 1**, contrast B, Liu and Kager, 2014, 2018; Liu et al., 2017). The two tonal tokens derived from the level and falling tones in Mandarin Chinese only differed in their slopes. Unlike previous studies typically testing tonal contrasts in one orientation (e.g., Kaan et al., 2008), contrasting sounds in both orientations were measured in a passive oddball listening paradigm. That is, two orientations of change were examined in this contrast with one sound serving as deviant in one condition and standard in the other. Listener may show different/asymmetrical perception when the standard and deviant switch places (Law et al., 2013), possibly due to the different acoustic salience between the two orientations. Although we predict that listeners may retain a certain degree of ability to perceive non-native tones acoustically, it remains unclear if the different orientations between the standard and the deviant may lead to changes in neural discrimination. Such discrimination patterns among non-tone language listeners may also further our understanding on second language speech processing and tonal language acquisition.

### MATERIALS AND METHODS

### Participants

The final sample consisted of 28 adults (20 females; Mage = 22.19 years, SDage = 6.36, range = 18–48). Approximately

half the participants were monolingual Australian English speakers (n = 13); the rest reported speaking at least one additional language (n = 15). However, all participants were naïve to tone or pitch accent languages. A handful of participants reported being musically trained (n = 5; ranging from 1 to 4 years) but none were still practicing music at the time of testing. Participants provided their written informed consent prior to participating and they received course credit or were reimbursed for their participation. Six participants were tested but were excluded from analysis due to excessive artifacts in their EEG data (see EEG data recording and analysis below). The study protocol was approved by the Western Sydney University Human Research Ethics Committee.

### Stimuli

As a tone language, Mandarin Chinese has four tones (/ta/ high-level [T1] 'take,' middle-rising [T2] 'reach,' low-dipping [T3] 'beat,' high-falling [T4] 'big'). The exact contrast used in previous experiments (Liu and Kager, 2014) was also used in the current study. A pair of natural tokens of the Mandarin high-level [T1] vs. high-falling [T4] tone bearing syllables /ta/ were produced by a female Mandarin speaker in a soundproof booth at the phonetics lab of Utrecht University in the Netherlands. Tokens were recorded using the open source computer program Audacity via a microphone (active speaker Genelec 1029A, sampling rate at 44,100 Hz). Tokens had equal values for intensity and duration via the computer program PRAAT (Boersma and Weenink, 2009). To avoid a ceiling effect due to the high acoustic salience of the T1–T4 contrast (Huang and Johnson, 2010; Sun and Huang, 2012), an acoustically contracted contrast was created from the T1– T4 tonal contrast by manipulating the F0 direction to reduce the acoustic salience of the contrast. Four interpolation points along the pitch contours (at 0, 33, 67, and 100%) were introduced. The F0 values occurring at 3/8 and 3/4 of the pitch distance of the original T1–T4 contrast were calculated at these interpolation points. Two new pitch contours were generated linking these points. The contracted level-falling tonal contrast (**Figure 1**, contrast B) shares similar acoustic properties with the natural T1–T4 contrast (**Figure 1**, contrast A), except for featuring a narrower distance between the pitch contours, thus shrinking the perceptual distance between the two tokens. A previous categorical perception study reported that Chinese listeners showed a categorical boundary at the position of step 3 along an 8-step continuum from T1 (step 1) to T4 (step 8), the exact step where contracted T1 resides. Meanwhile, non-tone-language (Dutch) listeners' categorical boundary was after step 4, falling in the middle of the continuum (Liu et al., 2017). The stimuli F0 excursion and semitone differences are listed in Supplementary Table A. Pitch duration was manipulated to 100ms to fit the EEG experimental scheme. Perceivable differences may occur between phonetic categories during categorical perception with native listeners (Gandour, 1978; Wu and Lin, 2008). However, for non-native listeners, just noticeable acoustic differences may be sufficient for discrimination.

### Procedure

Listeners were presented with a passive oddball paradigm, during which a frequently-presented stimulus is interspersed with infrequent presentations of a token (Näätänen, 2001; Näätänen et al., 2007). The current study contained two separate blocks: one in which the contracted level pitch was presented as the standard and the contracted falling pitch as the deviant (Dev-Falling), and the other in which the reverse happened (Dev-Level). The probability of the standard was 0.80 and 0.20 for each of the deviants in their respective blocks. The stimuli were presented in a pseudorandom order such that at least three standard stimuli and no more than eight standard stimuli were presented between the deviant stimuli. The blocks started with 20 standards, and contained a total of 500 trials. Both blocks together comprised 1000 trials. The inter-stimulus interval was randomly varied between 600 and 700 ms. Together, both blocks resulted in approximately 20 min of listening in total. After each oddball block, participants were presented with a control block in which they heard only the deviant stimuli they had heard in the previous oddball block 100 times (which lasted approximately 1 min per deviant stimulus). This way, we were able to compare the response to the same amount of deviant stimuli in the oddball block (100) to the control block (100). Participants were in counterbalanced conditions in which they either received the block with Dev-Level First or Dev-Falling First to examine the influence of previously-heard tokens on the second block.

Participants were tested within a single session in soundattenuated booths at The MARCS Institute for Brain, Behaviour and Development at Western Sydney University. They were instructed to avoid excessive movement. During presentation of the blocks, they watched a self-selected movie with subtitles. They were told they would hear some sounds and to disregard them and pay attention to the movie. The stimuli were presented binaurally via Etymotic earphones with the intensity kept at 70 dB SPL.

### EEG Data Recording and Analysis

Electroencephalogram (EEG) data were recorded from a 64 channel active BioSemi system, with Ag/AgCl electrodes placed according to the international 10/20 system fitted to the participant's head size. Six external electrodes were used: right and left mastoid for offline reference, below and above the right eye, and on the left and right temple to record eye movements. The electrode offset was kept below 50 mV and the data were recorded at a 512 Hz sampling rate.

The pre-processing and analysis of the data was done using EEGLAB (Delorme and Makeig, 2004) and ERPLAB (Lopez-Calderon and Luck, 2014). The data were first re-referenced to the average of the right and left mastoids and were then bandpass filtered with half power cut-offs at 0.1 and 30 Hz at 12 dB/octave. The data were epoched from 100 to 600 ms relative to stimulus onset and were baseline corrected by subtracting the mean voltage in the 100 ms pre-stimulus interval from each sample in the epoch. Independent component analysis (ICA) was done to identify and remove noisy EEG channels and eyemovement components based on activity power spectrum, scalp topography, and activity over trials. Noisy EEG channels that

were removed were then interpolated using spherical spline interpolation. Artifact rejection was done automatically for anything above 70 mV on any channel. Participants with more than 40% of artifact-contaminated epochs were subsequently excluded from further analyses (n = 6). The epochs were then averaged separately for standards (excluding the first 20 standards and the standards immediately following a deviant stimulus), for each deviant token, and for each control block.

Two difference waves were examined by subtracting the mean event-related potential (ERP) response to each control stimulus from the mean ERP response to its deviant counterpart. These difference waves were then grand-averaged across participants. In the grand-averaged waveform, we searched for a negative peak within the 100 to 250 ms time window after consonant production to ensure that we were measuring the response to the tone. This resulted in measuring the 120 to 270 ms time window post-stimulus onset to ensure that the consonant was not analyzed as part of the MMN response to the tone. We then centered a 40 ms time window at the peak and measured the mean amplitude in that window per individual participant (e.g., Brandmeyer et al., 2012; Tuninetti et al., 2017). These mean individual amplitudes were our measure of MMN amplitude in further statistical analyses. Latency was measured by searching for the most negative peak within the same 40 ms window from the grand averaged waveform per participant. These mean individual latencies were then used as the measure of MMN latency in subsequent statistical analyses.

### RESULTS

Mismatch negativity amplitudes, latencies and locality were measured at nine channels (Fz, FCz, Cz, F3, F4, FC3, FC4, C3, C4) in line with previous studies (e.g., Colin et al., 2009; Tuninetti et al., 2017). These were analyzed in two separate repeated-measures analysis of variances (ANOVAs) with a between-subject factor of Group (Dev-Level First, Dev-Falling First) and within-subject factors of Deviant (Dev-Level, Dev-Falling), anteriority [frontal (F), frontocentral (FC), central (C)], and laterality (left, middle, and right). Peak amplitude and latency may reflect different processing mechanisms, likely based on activating different neural populations (Horváth et al., 2008): the former indicates the robustness of listeners' discrimination as well as the acoustic/phonetic difference between the stimuli, while the latter reflects the time needed to process the difference between the standard and deviant stimuli (e.g., Cheour et al., 2002). Both are used as measures of auditory perceptual processing at early preattentive levels for native and non-native speech perception (e.g., Kraus et al., 1995a; Cheour et al., 2002). As the MMN tends to occur at frontal (F) and fronto-central (FC) sites, we expected to see increased MMN amplitude at those sites, suggesting that the auditory change between standard and deviant stimuli caused an involuntary attentional switch (Escera et al., 1998; Näätänen et al., 2007).

### MMN Mean Amplitude

**Figure 2** shows the grand-averaged MMN component recorded at Fz electrode (e.g., Näätänen et al., 2007; Horváth et al., 2008; Tuninetti et al., 2017) in response to two Deviant types—Dev-Level and Dev-Falling—for the two groups separately (Dev-Level First and Dev-Falling First).

We first determined whether participants elicited MMN responses on the Fz electrode by comparing the MMN amplitude against zero for each test block by group. The results of the one-sample t-tests revealed that participants appear to elicit a significant MMN only in the second block of test regardless of which deviant was tested (**Figure 3**, see Supplementary Table B for mean MMN amplitude by each electrode). Specifically, the Dev-Falling First group exhibited a significant MMN in the Dev-Level test block [t(13) = 4.133, p = 0.001, d = 1.10] but not in the Dev-Falling test block [t(13) = 1.571, p = 0.14, d = 0.42].

FIGURE 3 | Mean MMN amplitude for the two Groups (Dev-Level First and Dev-Falling First) by Block. The smaller dots represent individual data points. Error bars represent one standard error. Asterisks represent significant MMN amplitude.

Conversely, the Dev-Level First group exhibited a significant MMN in the Dev-Falling test block [t(13) = 2.39, p = 0.03, d = 0.64] but not in the Dev-Level test block [t(13) = 1.69, p = 0.11, d = 0.45].

A mixed ANOVA on the mean MMN amplitude yielded a main effect of Anteriority [F(2,52) = 4.00, p = 0.024, η 2 <sup>g</sup> = 0.002], which is qualified by a significant Group × Anteriority interaction [F(2,52) = 3.70, p = 0.031, η 2 <sup>g</sup> = 0.002; see **Figure 4**]. A post hoc Tukey test revealed that participants in the Dev-Falling First group showed a larger MMN amplitude than those in the Dev-Level First group in the frontal (F) electrode region (p = 0.024; Dev-Falling First: M = −2.53 µV, SD = 3.75 vs. Dev-Level First: M = −1.36 µV, SD = 2.83) but the two groups did not differ in the frontal-central (FC; p > 0.2; Dev-Falling First: M = −2.18 µV, SD = 3.30 vs. Dev-Level First: M = −1.60 µV, SD = 2.89) and central (C; p > 0.2; Dev-Falling First: M = −1.88 µV, SD = 3.12 vs. Dev-Level First: M = −1.34 µV, SD = 2.39) regions. This frontal locus is typical of MMN studies (Näätänen et al., 1997, 2007; Liu and Holt, 2011), and indicates an involuntary switch in attention caused by the auditory change, which is the basis for the MMN response. No other main effects or interactions reached significance.

### MMN Peak Latency

A mixed ANOVA on the mean MMN peak latency yielded a main effect of Deviant [F(1,26) = 389.83, p < 0.001, η 2 <sup>g</sup> = 0.821] and a significant Group × Deviant interaction [F(1,26) = 21.96, p < 0.001, η 2 <sup>g</sup> = 0.206; see **Figure 5**. See Supplementary Table C for mean MMN peak latency by each electrode]. A post hoc Tukey test revealed that for Dev-Level, participants in the Dev-Level First showed an earlier peak than those in the Dev-Falling First group (p < 0.001, Dev-Level First Group: M = 178.25 ms, SD = 16.49; Dev-Falling First group: M = 192.20 ms, SD = 16.30). For Dev-Falling, the reverse was true: participants in the Dev-Falling First group had an earlier peak than those in the Dev-Level

First group (p < 0.001, Dev-Falling First Group: M = 243.97 ms, SD = 15.82; Dev-Level First group: M = 262.25 ms, SD = 15.78). In other words, it appears that participants tended to show slower peak latency for the second test block relative to those who did the same test first. No other main effects or interactions reached significance.

## DISCUSSION

The current experiment examined whether listeners growing up in a non-tone language environment can discriminate tones with only pitch directional differences. Unlike most previous studies measuring non-native tone discrimination, we used neurophysiological measures, which are more sensitive to early pre-attentive responses than behavioral measures. This is particularly interesting for non-native speech perception as previous studies have shown that non-native listeners exhibit an MMN response for contrasts they did not discriminate in behavioral tasks (Kraus et al., 1995b; Näätänen et al., 2007; Lipski et al., 2012). Listeners' perception of our non-salient tonal contrast was tested in two orientations via a passive oddball listening paradigm, as the switch between the standard and deviant within the same contrast may lead to different acoustic salience and subsequently asymmetrical perception (Law et al., 2013). Results revealed that although non-native listeners were not able to discriminate the difficult tone contrasts in the first presentation block as their MMN amplitudes were no different from zero, they appeared to learn to discriminate the tonal contrast within the duration of the experiment, as their MMN amplitudes were significantly above zero in their second testing block. In addition, the overall MMN peak latency was earlier for Dev-Level than for Dev-Falling and all participants showed slower peak latency in the second block of test relative to those who did the same test in the first block. This may suggest a shift from acoustic to linguistic processing, the latter of which is arguably slower (Cheour et al., 2002; Horváth et al., 2008). Alternatively, the slower peak latency in the second block may simply be due to the change of token orientation (the standard became deviant and vice versa), which may result in a processing cost. Finally, listeners who did the Dev-Falling first exhibited larger MMN amplitudes in the frontal electrode region than those who did Dev-Level first, exhibiting an effect reminiscent of perceptual asymmetry, which suggests an interaction between contrast salience and learning. These three findings are further discussed below.

Our first finding that participants did not show a significant MMN until the second block indicated that listeners were able to perceive a non-native tonal contrast with low salience, yet not without effort. The lack of baseline discrimination in the first block indicates that listeners require exposure to achieve successful discrimination. Additionally, their sensitivity may be facilitated by the standard-deviant reversal between the two blocks, which may help them discover the acoustic difference between the two tokens. The directional difference in presentation across blocks, as well as the familiarization of the novel tonal information in the first block, enabled nonnative listeners to learn the tonal contrast and resulted in neural discrimination in the second block. Our finding is in line with that of a previous neurophysiological study demonstrating that after lexical tone training, English speakers show increased activation in the left superior temporal gyrus and emergent activation in the right inferior frontal gyrus, which in turn shows learning effects among second language learners (Wang et al., 2003). However, because we did not find a significant interaction between blocks in the MMN amplitude, future studies are needed to address the type of learning (e.g., representational vs. acousticphonetic) occurring across blocks.

Behavioral data have demonstrated that non-tone learning infants are able to discriminate the same contrast around 18 months, and infants' tonal sensitivity is likely to be acoustic rather than linguistic (Liu and Kager, 2014, 2017a). Following previous studies, we predicted that listeners may retain a certain degree of acoustic perception of non-native tones. The current findings, however, suggest otherwise: listeners did not discriminate the contrast initially. They appeared to learn to distinguish the contrast on the fly during the experiment, with little evidence suggesting that their discrimination ability stemmed from prior or residual sensitivity to tones. It seems that listeners' prior sensitivity, if any, was not applied to the current difficult/non-salient contrasts, which is in line with infant perception studies showing that some non-native contrasts may not elicit mismatch responses in the 1st year of life (Rivera-Gaxiola et al., 2005). This further indicates a "use-it-or-lose-it" tendency when perceiving non-salient non-native contrasts from infancy through adulthood.

Phonetic learning has often been shown among studies testing listeners' ability to track frequency distributions across ages and over time (Maye et al., 2002, 2008; Escudero et al., 2011; Escudero and Williams, 2014; Ong et al., 2016, 2017; Liu and Kager, 2017c). Specifically, listeners' perception can be altered

by the distributional information embedded in the ambient environment. The Second Language Linguistic Perception (L2LP) model (Escudero and Boersma, 2004; Escudero, 2005; van Leussen and Escudero, 2015) predicts that auditory mappings for new dimensions that are not utilized in listeners' native language (such as lexical tone to Australian English listeners) can be easily created, and L2 learners can learn via distributional learning. Our finding suggests that learning is possible with frequent repetitions of the target sounds to be discriminated. In other words, listeners can learn to discriminate a phonetic contrast merely through exposure to the specific target tokens instead of being trained on a pre-set statistical distribution. Such exposure may also be perceived as an extreme version of a bimodal Gaussian distribution with only the two peaks presented. Following the neural commitment theory (Kuhl et al., 2008) and the L2LP model, we hypothesize that rapid neural learning of a phonological distinction may be related to cumulative commitment of specific neural activation. Specifically, the first block paved the neural path for listeners who then showed more robust discrimination in the second block.

Moreover, research has shown that listeners can acquire statistical information of phonetic categories fairly rapidly, some in less than 3 min for certain foreign contrasts. However, longer exposure time is required to trigger learning for contrasts that are less salient (Yoshida et al., 2010). In our pre-attentive study, the overall effect of learning surfaced after 10 min of exposure (i.e., the time for each test block). As different pitch orientations yielded distinct learning effects in the second block, listeners' perceptual and learning ability for L2 speech sounds may be interpreted as a function of the type of contrast (e.g., intrinsic salience, perceptual assimilation) and degree (e.g., length) of exposure in the experiment. Furthermore, our proposal implies that listeners are able to abstract and retain memory of pitch directional cues albeit non-native. Listeners across ages appear to shift their acoustic/phonetic cue weighting and learning strategies in natural language learning environments (Escudero et al., 2009; Lany and Saffran, 2013; Tuninetti et al., 2015; Liu and Kager, 2017c). Our non-native listeners may have begun to weigh the pitch direction cue higher than other (e.g., segmental) cues, which could have guided them to successful perception and learning of our difficult tone contrast.

Listeners exhibited an earlier MMN peak latency for Dev-Level than for Dev-Falling: After exposure to one tone deviant, the second tone deviant was processed later relative to those who did the same test first. Since non-tone language listeners perceive tones psycho-physically, paying attention to pitch height, including onset and offset (Gandour and Harshman, 1978), the current finding may be caused by listeners' sensitivity to the most contrastive aspect of the deviant relative to the standard. Specifically, the level tone has both high pitch onset and offset whereas the falling tone has a high pitch onset and a low pitch offset. In the case of Dev-Level, the most contrastive aspect of the deviant is at its early portion since a relatively lower pitch offset of the falling tone standard is followed by a relatively higher pitch onset of the level tone deviant. Conversely, in the case of Dev-Falling, since the relatively low pitch offset of the falling tone deviant is followed by a relatively high pitch onset of the level tone standard, the most contrastive aspect of the deviant is at its later portion.

We also found that regardless of presentation order, listeners exhibited later peak latency in the second block, suggesting that their processing time was affected by the contrast encountered in the first block. We speculate that this may be caused by listeners' perceptual reorganization from faster acoustic processing to slower linguistic processing (assuming that ERP waveforms with later latency, such as P300 or N400, are typically associated with attentional, linguistic processing), thus reflecting learning and perceptual attunement. This seems to contradict previous studies that have shown decreased latency and increased amplitude after listeners are trained on (or have sufficient exposure to) non-native contrasts (Cheour et al., 2002; Horváth et al., 2008) implying that neural populations reacting to each stimulus respond faster to the change from standard to deviant after training. The discrepancy between our finding and those of previous studies may be task- and/or stimulus-driven. In our experiment, no training session was provided to participants and no MMN was observed in the first block, which suggests that listeners were using the same neuronal generators for both standard and deviant after limited exposure. In the second block, when the stimuli orientation order was switched, the same neuronal populations may have still responded to the same stimuli but gradually attuned to different acoustic parameters, leading to a reorganization of the response, and therefore, to a slower peak latency. The increase in latency may reflect that the standard and deviant are indeed two different stimuli that elicit separate responses and processing may gradually shift from more acoustic to more linguistic. If more blocks (e.g., a third block) had been provided, we might have seen a decrease in latency, reflecting more native-like L2 processing with more exposure.

Alternatively, the general slower peak latency in the second block might be due to their listening strategy or residual effects from the first block. For instance, the Dev-Level First group may have learned to discriminate the level tone deviant from the falling tone standard based on the pitch onset of the deviant in the first block. However, in the second block, when the falling tone became the deviant, participants who adopted the same strategy may have incurred some processing cost, as the same listening strategy is no longer helpful because the pitch onset of the deviant is similar to the pitch offset of the standard.

While the current study cannot disentangle these two possible explanations, the peak latency interaction effect implies that listeners engaged in some form of learning, consistent with our interpretation of the MMN amplitude findings described above. The observed latency change may thus signal a change in processing and may be associated in tandem with amplitude changes as convergent measures of sensitivity to the auditory change. Whether such change is driven by enhanced acoustic sensitivity or linguistic processing requires further examination, including but not limited to longer exposure time or a training phase.

Our last finding was an asymmetry in MMN amplitude observed between Group and Block: the Dev-Falling First group showed larger MMN amplitude than the Dev-Level First group in the frontal region. As no MMN was elicited in the first block, it

remained unclear if contrasts presented in different orientations were of equal salience. However, the presentation order, or the tonal directional changes across blocks, appears to induce such perceptual asymmetry. The processing differences in the second block were dependent on the type of contrast listeners were exposed to in the first block. The questions arise as to why listeners showed emergent directional asymmetrical perception and why there was a perceptual asymmetry in the presentation order of the directional change between the two tones.

Using similar pre-attentive paradigms, a number of previous studies attribute the asymmetrical MMN patterns induced by presentation order to the phonological level of speech processing, the decoding of physical sounds into linguistic percepts or phones, and the under-specification of phonological representations (Lahiri and Reetz, 2010; Cornell et al., 2013). Under-specification hypotheses are related to human abstract learning, which involves the mapping of phones onto abstract linguistic structure and the ways these linguistic units are represented in long-term memory. In terms of abstract representation, it appears that certain representations are stored in a more inclusive, flexible, and less feature-specific manner that is, they are underspecified. When listening to a speech contrast, MMN responses are larger when the standard is specified than when it is underspecified (Shafer et al., 2004; Schluter et al., 2016, 2017). This explanation has been adopted to account for the asymmetric discrimination performance for consonants (Gaskell, 2003; Hestvik and Durvasula, 2016), vowels (Scharinger et al., 2012; De Jonge and Boersma, 2015), tone sandhi (Politzer-Ahles et al., 2016) and pitch height (Law et al., 2013). However, this explanation does not fit well in the present study as it remains unclear which of the two tones is (under-) specified for non-native listeners.

An alternative explanation comes from proto-typicality theories: MMN responses are often larger when the standards are relatively prototypical members of their phonological category and the deviants are not (Ikeda et al., 2002). Prototypicality is also applicable to the situation when listeners perceptually assimilate non-native phonemes to native categories (e.g., Perceptual Assimilation Model, Best, 1994; Best and Tyler, 2007; L2LP, Escudero, 2005; Kriengwatana and Escudero, 2017). While the potential transfer from non-native tones to native prosodic categories remains a matter of debate, it is unclear whether the proto-typicality explanation applies to listeners' perception of a non-native contrast with no evident correspondent native category, or which tone is more "typical" should such correspondent category exist.

A third explanation is related to speech sound articulation discussed in the Natural Referent Vowel framework (Polka and Bohn, 2003, 2011): MMN responses are larger when the deviant is more articulatorily "peripheral" (e.g., tongue blade near the edges of the vowel space in speech production) than the standard (e.g., tongue blade near the center). This explanation is also unlikely as non-tone language listeners should have no correspondent motor memory of tone. Under the same rationale, the observed perceptual asymmetry is unlikely to be attributed to any lexical effects (Shtyrov and Pulvermüller, 2002) or phonotactic probability differences (Bonte et al., 2005).

Our last explanation stems from studies originally designed to test the under-specification hypothesis. Both tone and nontone language speakers show similar behavioral (Chen et al., 2015) and neural (Politzer-Ahles et al., 2016) asymmetrical patterns when discriminating the T2–T3 contrast in Mandarin Chinese, indicating such perceptual asymmetry may be more than phonological changes/under-specifications, but acoustic or phonetic instead. Similar traces surface in infancy where 4 month-old Dutch and Japanese infants both present a coronallabial perceptual asymmetry such that coronals are discriminated from labials but not vice versa (Tsuji et al., 2015). As it is unlikely that infants have formed a mature native phonology at this age, the asymmetry should be considered acoustic or phonetic rather than phonological. The cross-linguistic perceptual biases may be grounded in the acoustic-phonetic properties of the input and successively contribute to the phonological architecture during language acquisition (Polka and Bohn, 2011).

Crucially, perceptual biases may be determined by factors such as acoustic salience, which plays a significant role in speech perception from infancy through adulthood (Chandrasekaran et al., 2007, 2009). As listeners' perceptual and learning ability seems to be related to the type of pitch direction to which they are initially exposed, the emergent asymmetry may reside in the level of salience between the two directions. Specifically, Dev-Falling (i.e., a level tone as the standard and a falling tone as the deviant) may be perceptually less salient than Dev-Level (i.e., a falling tone as the standard and a level tone as the deviant) as the former may resemble a more natural sounding decline in speech also known as downdrift, or the tendency for pitch to decline gradually near the end of a narrative phrase (Lindau, 1986; Myers, 1999). Speakers often signal the topic closure by a pitch fall, and introduce a new topic by resetting the onset height to a high pitch (Wichmann, 2000), which is a phenomenon that has been categorized as a global or semi-global intonation feature (Cruttenden, 1997; Hirst and Di Cristo, 1998; Zerbian, 2010). This indicates that downdrift may be a general perceptual bias in natural speech perception and production and that it may be more difficult for listeners to detect a pitch contrast with a falling tendency than with a rising one. The Dev-Falling First group completed a relatively less salient direction of change in the first block, followed by an 'easier' direction of change (Dev-Level) in the second. Their increased performance compared to the Dev-Level First group may show that initial exposure to a difficult, less salient contrast may trigger enhanced perception or learning, possibly because listeners' acoustic sensitivity is heightened in the second block when facing the easier contrast. In addition, the MMN amplitude difference between deviant groups resides in the frontal region, suggesting that the pitch directional changes may have caught listeners' attention as the testing paradigm may function as an "involuntary attention switch" (Näätänen et al., 2007). Thus, the downdrift effect may lead to distinct acoustic salience between Dev-Level and Dev-Falling, resulting in a divergent degree of learning and asymmetrical processing.

We hypothesize that the perception of non-salient contrasts may become an exercise for listeners' ears and improve their overall perceptual sensitivity. Thus, a challenging information-

processing environment may actually enhance learning. This has strong implications for language acquisition and specifically the establishment of phonological categories. Children exposed to a multilingual environment, for instance, have a more challenging task than their monolingual peers, with more sound categories to acquire in the same phonetic space (Kuhl et al., 2008). However, bilingual children have been shown to outperform monolinguals when detecting language changes (Kuipers and Thierry, 2012), perceiving native and non-native speech contrasts (Shafer et al., 2011; Petitto et al., 2012; Liu and Kager, 2016, 2017a), and learning words (Graf Estes and Hay, 2015; Singh, 2017), regardless of the fact that they may not receive as much language input. The bilingual advantage may thus be the result of a more challenging learning environment which leads to heightened sensitivity across domains (Liu and Kager, 2017b).

In sum, our results show that listeners are able to discriminate non-native tones after short exposure to target tonal tokens with implications for L2 learning of tones. Specifically, after 10 min of exposure, non-tone language listeners demonstrated sensitivity to pitch direction, the listening of which contributes to neural changes in both MMN latency and amplitude. Perceptual learning of phonetic categories may occur simply through exposure to the given targets without distributional information, although distributional learning may further facilitate the learning trajectory and reduce the time required to successfully discriminate target tokens (Escudero, 2005). We also observed a residual effect from the previous block of test to the subsequent block in terms of peak latency possibly due to a misapplication of a perceptual strategy from the first to the second test block. Finally, manipulating the presentation order of directional change induced a perceptual asymmetry across blocks. Although we leave the reasons for the asymmetry open, we hypothesize that its underlying cause is the differential acoustic salience between the two directional changes, and thus likely to be acoustic rather than phonological. This novel finding leads to follow-up questions such as whether listeners across ages and language backgrounds demonstrate the same propensity in showing better responses under greater perceptual challenge as well as whether the observed asymmetry is restricted to tones or extends to other (segmental) features. Overall, this study advances our understanding of the neural encoding of linguistic pitch,

### REFERENCES


shedding light on tonal non-native perception and phonological development.

### ETHICS STATEMENT

This study was carried out in accordance with the recommendations and approval of Human Research Ethics Committee (HREC) of Western Sydney University (approval number: H11383). All participants gave written informed consent in accordance with the Declaration of Helsinki.

### AUTHOR CONTRIBUTIONS

LL and PE contributed to grant application, experimental design, and manuscript writing. JO contributed to experimental design, experimental testing, and manuscript writing. AT contributed to manuscript writing.

### FUNDING

This project was funded by a Transdisciplinary and Innovation Research Grant from the Australian Research Council (ARC) Centre of Excellence for the Dynamics of Language [CE140100041], which was awarded to LL. JO, AT, and PE's work and the publication of this research were also supported by the ARC Centre of Excellence for the Dynamics of Language.

### ACKNOWLEDGMENTS

We are grateful for the feedback from researchers of the School of Social Sciences and Psychology and the MARCS Institute for Brain, Behaviour and Development at Western Sydney University.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2018.00162/full#supplementary-material

Second Language Speech Learning: In Honor of James Emil Flege, eds M. J. Munro and O.-S. Bohn (Amsterdam: John Benjamins), 1334. doi: 10.1075/lllt. 17.07bes


processing in Mandarin Chinese. J. Psycholinguist. Res. 33, 103–135. doi: 10. 1023/B:JOPR.0000017223.98667.10


fpsyg-09-00162 March 16, 2018 Time: 15:37 # 11



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Liu, Ong, Tuninetti and Escudero. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Perception of Lexical Neutral Tone Among Adults and Infants

Shanshan Fan1,2 \*, Aijun Li<sup>2</sup> and Ao Chen3,4

<sup>1</sup> School of Preparatory Education, Beijing Language and Culture University, Beijing, China, <sup>2</sup> Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China, <sup>3</sup> School of Communication Science, Beijing Language and Culture University, Beijing, China, <sup>4</sup> Utrecht Institute of Linguistics OTS, Utrecht University, Utrecht, Netherlands

Neutral tone (T0) is a special tone form in Mandarin that contains tonal and stress information. Compared with canonical tones, T0 has a much shorter duration and reduced pitch contour. Its tonal contour is determined by the preceding canonical tone. However, not much is known about the perception of tonal and stress information in T0. In the current study, we investigate (1) whether T0 can be perceived as lexically unstressed by stress-language listeners; and (2) how Mandarin (tone language) and Dutch (stress language)-learning infants perceive T0. Three experiments were conducted. In Experiment 1, Dutch adults identified T0 as unstressed when presented with disyllabic sequences ending in T0. In Experiment 2, we used the visual fixation paradigm to test 4- to 6-month-old and 10- to 12-month-old Dutch and Mandarin infants on pseudoword discrimination (/pan1san4/ [high-level + high-falling] and /pan1san0/ [high-level + mid-falling]). T4 and T0 each exhibit a similar falling contour. The results show that (1) after being habituated to neutral tone sequences (/pan1san0/), Dutch infants discriminated the T1T0–T1T4 contrast; and (2) neither age groups of Mandarin infants discriminated the tone contrast. Assuming Mandarin infants' lack of discrimination might be due to the similar F0 contours, we tested Mandarin infants in Experiment 3 using a more salient contrast, /pan1san2/ (high-level + mid-rising) and /pan1san0/. While no overall discrimination was observed, those who were habituated to /pan1san0/ demonstrated discrimination. The continuous discrimination of Dutch infants suggests that they might process neutral–canonical tone contrast as lexical stress rather than as tonal information. Overall, Mandarin infants' failure implies that the representation of T0 is not complete during their 1st year of life; the acquisition of tonal categories may therefore take longer than we expected.

Keywords: perceptual reorganization, tone acquisition, lexical neutral tone, lexical stress, cross-language comparison

### INTRODUCTION

Lexical tones are pitch variations that distinguish lexical meanings. Mandarin is the most widely studied tone language, in which four canonical tones are used to distinguish word meanings, including T1 (high-level; 55 in Chao tone letters), T2 (mid-rising; 35), T3 (low-dipping; 21/214) and T4 (high-falling; 51). For example, the following words have different meanings based on canonical tones: /ma1/ ( , mother), /ma2/ ( , numb), /ma3/ ( , horse), and /ma4/ ( , to scold). Besides the four canonical tones, neutral tone (T0) never occurs independently or at the beginning of a word. It is always preceded by a canonical tone. Neutral tone can distinguish word meanings,

#### Edited by:

Liquan Liu, Western Sydney University, Australia

#### Reviewed by:

Caicai Zhang, Hong Kong Polytechnic University, Hong Kong Zhen Qin, Shanghai Jiao Tong University, China

\*Correspondence:

Shanshan Fan fanshanshan@blcu.edu.cn

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 August 2017 Accepted: 26 February 2018 Published: 23 March 2018

#### Citation:

Fan S, Li A and Chen A (2018) Perception of Lexical Neutral Tone Among Adults and Infants. Front. Psychol. 9:322. doi: 10.3389/fpsyg.2018.00322

**163**

such as /tu 1Ci1/ ( , east and west) and /tu 1Ci0/ ( , things), and appear in different lexical and syntactic contexts, including reduplication, affixation, lexeme type, directional complements, complement particles, etc. With regard to lexeme type, words are distinguished solely by the presence of neutral tone without any other morphological or grammatical marker, such as /tu 1Ci1/ ( , east and west) vs. /tu 1Ci0/ ( , things) (Luo and Wang, 2002; Lin, 2012). In the present study, we focus on the lexeme type.

Neutral tone is acoustically light with a shorter duration and reduced pitch contour. It has been referred to as unstressed or weak stress in previous studies (Chao, 1979; Zhu, 2002; Lu and Wang, 2005; Wei, 2005; Duanmu, 2007; Cao, 2008; Deng, 2010; Jia, 2011; Bao and Lin, 2014). The tonal contour of T0 is determined by the preceding canonical tone. When preceded by T1, T2, or T4, the tonal contour of T0 is falling; when preceded by T3, the tonal contour is mid-level (Chao, 1979; Wu, 1992; Kong and Lü, 1998; Luo and Wang, 2002; Lin and Wang, 2013; Zhang and Li, 2016). Neutral tone has a lower pitch register and narrower pitch range. Pitch patterns are shown in **Figure 1**, where the dashed lines denote sequences ending with a neutral tone. The duration of neutral tone is about 50% of its corresponding canonical tone (Lin and Yan, 1980; Lin, 1983; Lee, 2003) or about 60% of the preceding canonical tone (Cao, 1986; Li, 2017). In summary, neutral tone contrasts with canonical tone lexically because the neutral tone is unstressed and has distinguished pitch pattern. Neutral tone possesses properties of lexical stress and lexical tone.

The acoustic correlates of neutral tone are duration, F0, intensity, and spectral features (i.e., vowel reduction, initial consonant voicing, and spectral tilt steeping). The main acoustic correlates of neutral tone are F0 and duration (Lin and Yan, 1980; Lin, 1983; Cao, 1986; Yang, 1989; Wang, 2004; Chen and Xu, 2006; Li and Fan, 2015), with F0 being more important than duration (Cao, 1986; Wang, 2004; Li and Fan, 2015; Li, 2017). Spectral tilt is a reliable cue, but it is less important than duration (Zhong et al., 2001). Intensity is not reliable (Lin and Yan, 1980; Lin, 1983). The same acoustic correlates are found for lexical stress in stress language, with duration being the most reliable cue for lexical stress in Dutch (Sluijter and van Heuven, 1995, 1996; van Heuven and de Jonge, 2011).

Previous research has revealed inconsistencies regarding how infants perceive lexical tones and lexical stress early in life. Some studies found supportive evidence for the perceptual reorganization of lexical tones, which occurred around 9 months. For example, prior to 6 months, both tone- and non-tonelanguage infants can discriminate lexical tones. By around 9 months, non-tone-language infants' sensitivity to lexical tones declines, whereas no such decline is observed among tonelanguage infants (Mattock and Burnham, 2006; Mattock et al., 2008). Some other studies, however, reported different results. For instance, in Liu and Kager (2014), 5- to 18-month-old Dutch infants showed continuous discrimination of Mandarin T1–T4 contrast. But when the phonetic distance between T1 and T4 was reduced, the infants no longer demonstrated discrimination. In Chen and Kager (2016), 4-month-old Dutch infants failed to discriminate a non-salient Mandarin tonal contrast (T2–T3), yet 6- and 12-month-old infants succeeded. Infants may not be born with the ability to discriminate all the native contrasts and may especially need time to learn phonetically non-salient contrasts (Sundara et al., 2006; Narayan et al., 2010). For lexical tones, Shi (2010) discovered that Mandarin infants were only able to categorize phonetically variable lexical tones gradually after 8 months. In Tsao (2008), 12-month-old Mandarin infants discriminated T1–T3 better than T2–T3/T2–T4 contrasts. Taken together, early discrimination of lexical tones appears to exhibit a complex developmental pattern, where successful discrimination might relate to the phonetic salience of particular tonal contrasts.

In terms of lexical stress, in studies supporting perceptual reorganization, infants' stress perception appears to shift from universal discrimination to their native language at 9 months of age (Sansavini et al., 1997; van Ooijen et al., 1997; Hohle et al., 2009; Skoruppa et al., 2009, 2013). For example, newborn French infants could discriminate stress-initial and stress-final words (Sansavini et al., 1997), while 9-month-old French infants failed to discriminate stress contrast at a phonological level. Hence, French infants adapted their stress perception to their native language by 9 months. Nine-month-old Spanish infants, whose native language has contrastive lexical stress, demonstrated discrimination (Skoruppa et al., 2009). In some other studies, however, the discrimination of contrastive lexical stress requires sufficient exposure to ambient input (Weber et al., 2004; Keij and Kager, 2013; Butler et al., 2015). For instance, 5-monthold German infants could discriminate between stress-initial and stress-final pseudowords, yet 4-month-old German infants could not (Weber et al., 2004). In summary, attunement seems flexible in early language perception. It might be modulated by ambient language input for lexical tone and lexical stress. For lexical tone, participants' discrimination could be related to the acoustic salience of particular stimuli.

Besides acoustic salience, the order of stimuli presentation may influence the discrimination effect as well. Perceptual asymmetry was found in previous studies on the discrimination of both segments (Polka and Werker, 1994; Polka and Bohn, 1996, 2003) and suprasegments (Weber et al., 2004, 2005; Tsao, 2008; Chen, 2013; Segal et al., 2016). In Segal et al. (2016), when discriminating between initial and final lexical stress, Hebrew infants showed better discrimination when presented with uncommonly initial stress first. German-learning infants also showed similar perceptual asymmetry when perceiving lexical stress, namely that change detection was easier for infants when trochee, the predominant stress pattern, was embedded in iambs rather than the other way around (Weber et al., 2004). For early perception of lexical tones, Mandarin infants discriminated the T1–T3 contrast better if they were presented with T1 first than the other way around (Tsao, 2008). The mechanism underlying such asymmetry is not fully understood, yet it may be related to statistical distribution in the input. When habituated to an atypical pattern in ambient input, infants may consolidate such a pattern in representation and subsequently discriminate the frequent pattern in the input from the infrequent one. Yet if infants are habituated to the frequent pattern in the input, they might perceive the infrequent pattern as a non-prototypical realization of the frequent one.

The statistical distribution of particular phonological features in the input influences infants' perceptions of such features. Scholars have largely agreed that infants are sensitive to statistical distribution in speech input (e.g., Saffran et al., 1996; Maye et al., 2002). Infants prefer predominant patterns to which they are exposed in their native language, and such preferences are established with accumulating exposure (Jusczyk et al., 1993). In the current study, we compared stress-language (Dutch) and tone-language (Mandarin) infants on their discrimination of canonical and neutral tones. Because neutral tone carries lexical stress and tonal information, it serves as a feasible means to investigate early attunement to lexical tone or stress as the result of ambient input. We posed the following questions in the current study: (1) whether Mandarin infants can discriminate between neutral and canonical tones, and whether such discrimination is influenced by acoustical salience of the tones; and (2) whether Dutch listeners perceive neutral tone as tonal or as lexical stress, and whether perceptual reorganization can be observed for neutral tone. We began by testing whether tone- and stresslanguage-speaking adults perceived neutral tone as unstressed, which served as a baseline for the subsequent infant experiments. Next, we tested 4- to 6-month-old and 10- to 12-month-old Dutch and Mandarin infants on their discrimination of Mandarin canonical–neutral tone contrast. If Dutch infants perceived the canonical–neutral tone contrast as lexical stress, we would expect successful discrimination at both ages; on the other hand, if they perceived them as tonal, discrimination may only be successful for the younger group. For Mandarin infants, we expected them to be capable of discriminating the contrasts at both ages. Considering that sequences with neutral tone occur less frequently than those involving canonical tones, it may take time for Mandarin infants to learn these contrasts. In this case, we would expect only the 10- to 12-month-old Mandarin infants to discriminate the contrasts.

z-score; the horizontal axis is the normalized duration (Li and Gao, 2017).

### EXPERIMENT 1: ADULTS' PERCEPTIONS OF NEUTRAL TONE

To understand whether Dutch adult listeners perceive neutral tone as unstressed, a discrimination task and an identification task were conducted in Experiment 1. In the discrimination task, participants were required to discriminate disyllabic sequences ending in a neutral tone from those ending in a canonical tone. If Dutch adult listeners perceived neutral tone as unstressed, they would discriminate canonical–neutral tone contrast successfully. In the identification task, participants were required to identify the position of stress in the disyllabic sequences. Because duration is the most reliable cue for lexical stress in Dutch, and neutral tone exhibits a shorter duration compared with canonical tones, we predicted that Dutch adult listeners would identify the neutral tone as unstressed. For Mandarin listeners, given T0 as a category in native phonology, we assumed they would succeed in the discrimination task and thus be able to identify the neutral tone as unstressed.

### The Discrimination Task Stimuli

The pseudoword /pansan/ was selected as the tone-bearing sequence, which is a well-formed sequence phonotactically in Mandarin and Dutch. All possible tone combinations were included except T3T3, which is always produced as T2T3 due to the Mandarin tone sandhi process. In total, 19 target pseudowords were obtained, including 15 disyllabics ending with a canonical tone (4 × 4 − 1 = 15) and 4 disyllabics ending with a neutral tone (TnT0; n = 1, 2, 3, or 4). Another 20 tonal pairs of real words in Mandarin were added as fillers, which carried the same segments but different canonical tones, such as /ùğ2tù h a 2/ ( , duration) vs. /ùğ4tù h a 3/ ( , market).

All stimuli were produced by a 35-year-old male native Mandarin speaker. The speaker was born and raised in Beijing. No disorder was reported related to reading, speaking, or listening. Nineteen pseudowords were recorded along with 40 filler words in the soundproof room of the phonetics lab at the Chinese Academy of Social Sciences (CASS) using Cool Edit Pro 2.0 at a sample rate of 44,100 Hz.

#### Participants

Eighteen Mandarin adult listeners were tested, 10 males and 8 females, with an average age of 20.8 years (SD = 1.9). Another participant took part in the test but was excluded due to equipment failure. All participants were born and raised in Beijing, without reported hearing or speech disorders.

Eighteen Dutch adult listeners were tested, 6 males and 12 females, with an average age of 23.7 years (SD = 4.7). They were born and raised in the Netherlands. None of the participants had been exposed to any tone language, and no hearing or speech disorders were reported.

#### Procedures

The AX paradigm was adopted. Participants were presented with pairs of stimuli and required to indicate whether the two stimuli were the same or different. The series consisted of 30 pairs of different stimuli (AX or XA) and 19 pairs of identical stimuli (AA or XX). For each different pair, the comparison was only conducted between a sequence ending in a canonical tone and its corresponding neutral tone form. Taking /pan1san1/ as an example, its neutral tone form was "/pan1san0/". The different pairs were "/pan1san0/ vs. /pan1san1/" and "/pan1san1/ vs. /pan1san0/", and the identical pairs were "/pan1san1/ vs. /pan1san1/" and "/pan1san0/ vs. /pan1san0/". Another 80 pairs of fillers included different pairs such as "/ùğ2tù h a 2/ ( , duration) vs. /ùğ4tù h a 3/ ( , market)" and identical pairs such as "/ùğ2tù h a 2/ ( , duration) vs. /ùğ2tù h a 2/ ( , duration)."

A practice phase preceded the experiment. Seven pairs of stimuli were used to familiarize participants with the procedure. Each trial started with a fixation cross, followed by two audio stimuli with an inter-stimulus interval of 200 ms. When the audio stimuli concluded, two buttons were shown on the screen, labeled as "Same (F)" and "Different (J)." Participants provided their response by pressing either "F (Same)" or "J (Different)" on the keyboard. The next trial started automatically after the participant had responded. The inter-trial interval was 500 ms. ZEP was used to control the procedures, randomize stimuli, and collect participants' responses (Veenker, 2013).

#### Results

The accuracy rate was calculated by dividing the number of correct responses by the number of total trials for each participant. For identical pairs, the accuracy rate for Mandarin listeners was 93% (SD = 1.24) and 97.1% for Dutch listeners (SD = 0.55). When discriminating different stimuli pairs, the accuracy rate for Mandarin listeners was 91.7% (SD = 0.74) and 90.5% (SD = 0.99) for Dutch listeners.

**Figure 2** illustrates the accuracy rates of Mandarin and Dutch adult listeners. To better understand participants' sensitivity to the canonical–neutral contrast, d-prime (d 0 ) was calculated. An independent t-test was conducted using d-prime with the language group as the independent variable. No difference was found between Dutch and Mandarin adult listeners [t(34) = −0.57, p > 0.05].

Both Dutch and Mandarin adult listeners could discriminate neutral and canonical tones. To further investigate whether they perceived the neutral tone as unstressed, we conducted the following identification task.

## The Identification Task

Stimuli

The stimuli in the discrimination task were used in the identification task. Participants were asked to indicate the stress position in disyllabic sequences ending in a neutral tone as well as those ending in a canonical tone.

To ensure that the Dutch participants understood the task, we used Dutch lexical stress minimal pairs as practice stimuli. We selected three minimal pairs in Dutch: "kaNON – KAnon," "voorNAAM – VOORnaam," and "SERvisch – serVIES" (capital letters denote the stressed syllables) (Cutler and van Donselaar, 2001). A female Dutch native speaker produced all minimal pairs by reading each pair twice. All recordings were completed in the phonetics lab at Utrecht University using Audacity at a sample rate of 32,000 Hz. Five native Dutch-speaking phoneticians selected the most naturally produced pairs for the identification task.

#### Participants

Another 15 Mandarin listeners different from those in the discrimination task were tested, 5 males and 10 females (mean age = 21.5 years; SD = 0.58). All Mandarin listeners were born and raised in Beijing, and no hearing disorders were

reported. Another two participants were tested but excluded due to dyslexia (N = 1) and a reported background in phonetics (N = 1).

Another 14 Dutch listeners were tested, 6 males and 8 females (mean age = 25.9 years; SD = 8.4). All participants were born and raised in the Netherlands and reported no exposure to tone languages. No hearing disorders were reported. Another 6 Dutch listeners were tested but excluded for failing to identify the lexical stress for Dutch stress minimal pairs.

#### Procedures

A forced-choice procedure was adopted. The experiment was preceded by a practice phase in which 12 trials were used to familiarize the participants with the task. In each trial, Dutch listeners heard one word of the Dutch stress minimal pairs, such as "kaNON." The participants were required to identify the position of lexical stress. They were asked to give their response by clicking one of the buttons labeled "Initial (Strong-weak, Sw)", "Equal", or "Final (weak-Strong, wS)". Each word of the Dutch stress minimal pairs was repeated twice. Mandarin listeners heard 12 trials. They were presented with Mandarin disyllabic sequences, such as "/ùğ4tù h a 3/ ( , market)," and were required to indicate whether the word had initial, equal, or final stress by clicking the corresponding button.

Each trial in the test phase began with a fixation cross, after which an audio stimulus was presented. Participants were required to give their responses by clicking one of the buttons labeled "Initial," "Equal," or "Final." Another two buttons were below these, labeled "Repeat" and "Next." Participants were allowed to listen to the stimulus again by clicking "Repeat." By clicking the "Next" button, participants submitted their options and activated the next trial. ZEP was used to control the procedures, randomize stimuli, and collect participants' responses (Veenker, 2013).

#### Results

The accuracy rate in the practice phase was calculated for Dutch listeners. Only data from participants with accuracy rates over 80% in the practice phase were submitted for further analysis, including 6 males and 8 females (mean age = 25.9 years; SD = 8.4).

For each option, we calculated selection percentages under different tonal conditions, including neutral tone, canonical tone, and each individual tonal combination. Taking "Initial" as an example, the percentage was calculated by dividing the number of responses indicating "Initial" by the total number of responses. A Chi-square test was conducted to access whether participants' responses depended on their language background when presented with disyllabic sequences ending in a neutral tone. The results were not significant, χ 2 (2) = 0.63, p > 0.05. Hence, when presented with disyllabic sequences ending in a neutral tone, no difference appeared between Mandarin and Dutch listeners. Both groups predominantly selected "Initial" for sequences ending in a neutral tone. However, for sequences with two canonical tones, a significant relationship emerged between

participants' responses and language background [χ 2 (2) = 24.15, p < 0.01, ϕ = 0.24]. Mandarin listeners tended to identify stimuli as "Equal" (56.9%), while 23.8% of Dutch listeners selected "Initial," 35.2% chose "Equal," and 41% chose "Final." Participants' responses across the three options are listed in **Table 1**.

For particular tonal combinations ending in a neutral tone, 80% of Mandarin listeners identified T1T0 as "Sw (Initial)." Percentages for other sequences involving neutral tone were T2T0 (80%), T3T0 (86.7%), and T4T0 (93.3%). Seventy-eight percent of Dutch listeners identified T1T0 as "Sw." The percentages for other sequences involving neutral tone were T2T0 (64.3%), T3T0 (78.6%), and T4T0 (100%). All percentages are plotted in **Figure 3**.

To select the tonal contrast for the subsequent infant experiment, we also analyzed participants' identification of disyllabic sequences with two canonical tones. The ending T4 was predominantly perceived as stressed. For Mandarin listeners, the percentages selecting "wS (final)" were 80% for T1T4, 33.3% (T2T4), 73.3% (T3T4), and 20% (T4T4). For Dutch listeners, the percentages selecting "wS (final)" were 50% for T1T4, 64.3% (T2T4), 92.9% (T3T4), and 35.7% (T4T4) (see **Figure 4**). Except for tonal combinations ending with T4, Mandarin listeners tended to perceive disyllabic sequences with two canonical tones as being of "equal weight."

### Discussion

Dutch (stress language) adult listeners could discriminate disyllabic sequences ending in a neutral tone from those ending in a canonical tone. In addition, Mandarin and Dutch listeners identified a neutral tone as unstressed. Overall, Mandarin listeners tended to perceive sequences with two canonical tones as having equal stress, consistent with the fact that Mandarin generally lacks word stress.

Because neutral tone contains simultaneous tonal and stress information, the following notions warrant investigation: (1) how infants process neutral tone, and whether younger infants (4- to 6-month-olds) and elder infants (10- to 12-montholds) respond differently; and (2) whether ambient language input influences infants' perceptions. Specifically, we asked the following questions in Experiment 2: Can Mandarinand Dutch-learning infants discriminate neutral tones from canonical tones? Do stress-language-learning infants perceive neutral tone as unstressed? Given that 10- to 12-monthold infants are attuned to their native language, would 4 to 6-month-old infants respond differently from 10- to 12 month-olds? To this end, we tested 4- to 6-month-old and

TABLE 1 | The selection percentage of "Initial," "Equal," and "Final" in neutral/canonical tone combinations for Dutch and Mandarin listeners.


10- to 12-month-old Dutch and Mandarin infants on their discrimination of disyllabic sequences that either ended in a canonical (T4) or a neutral tone (T0). Since neutral tone is a native phonological category for Mandarin-learning infants, we predicted they would be able to discriminate between neutral tone and canonical tone throughout the 1st year of life. No difference was expected between 4- to 6-month-old and 10- to 12-month-old Mandarin infants. Given that Dutch adults were able to identify neutral tone as unstressed, if they processed neutral tone as lexical stress, they would show continuous discrimination throughout the 1st year of life. Thus, no difference was expected between 4- to 6-month-old infants and 10- to 12-month-old Dutch infants. However, if they processed the difference between T1T4 and T1T0 as tonal information instead of lexical stress, 10- to 12-month-old Dutch infants would fail to distinguish the tonal contrast, while 4- to 6-monthold Dutch infants were expected to discriminate the contrast successfully.

### EXPERIMENT 2: INFANTS' DISCRIMINATION OF CANONICAL TONE AND NEUTRAL TONE

### Stimuli

The pseudoword /pansan/ was used as stimuli. In the above discrimination and identification tasks, Dutch listeners identified T1T4 as "wS (final)" and T1T0 as "Sw (initial)," respectively. Hence, we selected these two sequences as stimuli for the infant experiment.

A 32-year-old female Mandarin native speaker produced neutral tone (/pan1san0/) and canonical tone (/pan1san4/) in infant-directed speech (IDS). Each stimulus was produced 20 times. Recordings were completed in the soundproof room of the phonetics lab at CASS using Cool Edit Pro 2.0 at a sample rate of 16,000 Hz. Another five Mandarin native speakers judged the naturalness of recordings on a continuum from 1 (extremely

unnatural) to 5 (very natural). Two phoneticians selected the six most natural tokens for T1T0 (/pan1san0/) and six most natural tokens for T1T4 (/pan1san4/).

For the canonical tone sequence /pan1san4/, the average duration of the first syllable was 259.7 ms (SD = 10.8), and the average duration of the second syllable was 316.8 ms (SD = 23.4). For the neutral tone sequence /pan1san0/, the first syllable was 269.2 ms on average (SD = 11.8), and the second syllable was 216 ms on average (SD = 41.8). For F0 contour, 10 F0 points were extracted along the F0 contour using Praat (Boersma and Weenink, 2013). For T1T4 combinations, the maximal F0 value of T4 was 307.4 Hz, and its minimal F0 value was 234.3 Hz with a range of 136.08 Hz. For T1T0 combinations, the maximal F0 value of T0 was 303.9 Hz, and its minimal F0 value was 218.5 Hz with a range of 85.4 Hz. The averaged F0 contours are shown in **Figure 5**, where T4 (high-falling) and T0 (mid-falling) exhibited similar falling contours.

### Participants

Fifty-two Dutch-learning infants were tested: 23 were between 4–6 months old (mean age = 4;18, SD = 0.7, 11 males and 12 females), and 29 were between 10–12 months old (mean age = 11;3, SD = 0.8, 19 males and 10 females). Another 20 infants were tested but excluded due to fussiness (N = 6), parental intervention (N = 1), and not being habituated (N = 13). All Dutch-learning infants were born and raised in Dutch-speaking families where Dutch was the only language in use. All parents reported normal hearing of the infants. Dutch-learning infants were tested in the infant lab at Utrecht University (UU).

Among the 24 Mandarin-learning infants tested, 8 were between 4–6 months old (mean age = 5;9, SD = 0.9, 4 males and 4 females), and 16 were between 10–12 months old (mean age = 11;21, SD = 1.1, 6 males and 10 females). Another 16 infants were tested but excluded due to fussiness (N = 5), parental intervention (N = 1), not being habituated (N = 5), equipment failure (N = 3), dialect interference in the input (N = 1), and being a preterm infant (N = 1). All Mandarin-learning infants were born and raised in Mandarin-speaking families where Mandarin was the only language in use. All parents reported normal hearing of the infants. Mandarin-learning infants were tested in the infant lab at CASS, Beijing.

### Procedures

A visual fixation procedure was adopted. During the experiment, a parent sat in a chair in the test cabin listening to music played through headphones to prevent possible intervention. The infant sat on his/her parent's lap, facing the screen in the front of the test cabin. The screen was one meter away from the infant, and the visual stimuli was played on the screen during the experiment. Two loudspeakers were situated on both sides of the cabin along with a hidden video camera above the screen. The camera was connected to a screen on the control desk, which was used to observe infants' responses to stimuli in real time. The control desk was in a separate room next to the test cabin.

There were four phases: pre-test, habituation, test, and posttest. The pre- and post-test were used to test infants' general attention. In the habituation phase, infants were habituated to either canonical tone (T1T4, /pan1san4/) or neutral tone (T1T0, /pan1san0/). In the test phase, canonical tone and neutral tone sequences alternated between trials. In the habituation phase, three tokens from each category were used to habituate Mandarin- and Dutch-learning infants. Chen and Kager (2016) reported that 4-month-old Dutch infants could not normalize multiple tokens of tonal contrast. Thus, only another one token was used in the test phase for Dutch infants. We used another three tokens in the test phase for Mandarin infants.

Each trial started with an attention-getter. Once the infant looked at the screen, the attention-getter faded out, and the visual stimuli and audio stimuli were played. The infant's looking time and non-looking time were recorded by the experimenter on the control computer. When the average looking time of three consecutive trials was shorter than 50% of the average looking time of the first three trials, the habituation criterion was met, and the test phase started automatically. The habituation phase had a maximum of 16 trials.

The test phase consisted of four trials, two of which were identical to the habituated tone sequences (same trials). The other two trials were the tone sequences that were not used in the habituation phase (novel trials). The same trials and novel trials alternated. Infants' looking time during the same trials and novel trials were recorded by the control computer. If they were able to discriminate the tonal sequences, then their looking time during the novel trials would presumably be longer than during the same trials.

### Results

To correct for skewness, the raw looking time was logarithmically transformed. Dutch- and Mandarin-learning infants were divided into two age groups: 4- to 6-month-olds and 10- to 12 month-olds. For each age group, the log-transformed looking time (LogLT) of the same trials and of the novel trials were compared.

#### Dutch-Learning Infants

We conducted a 2 (trial type: same/novel) × 2 (habituated category: neutral tone/canonical tone) × 2 (age group: 4 to 6-month-olds/10- to 12-month-olds) mixed effect ANOVA. Trial type was the within-subject factor. The between-subject factors were habituated category and age group. Trial type showed a main effect [F(1,48) = 6.3, p < 0.05, η 2 <sup>p</sup> = 0.12] with the looking time in the novel trials being significantly longer than in the same trials (p < 0.05). Age group had no main effect [F(1,48) = 0.22, p > 0.05], nor did the habituated category [F(1,48) = 0.39, p > 0.05]. There was significant interaction between trial type and the habituated category, F(1,48) = 4.5, p < 0.05, η 2 <sup>p</sup> = 0.09. When infants were habituated to T1T0, the looking time in the novel trials was significantly longer than in the same trials, t(23) = −2.58, p < 0.05. However, when infants were habituated to T1T4, there was no difference between the same trials and the novel trials, t(27) = −0.28, p > 0.05. **Figure 6** plots the infants' looking times separated by habituated tone. Neither the interaction between trial type and age group [F(1,48) = 1.80, p > 0.05] nor the three-way interaction among trial type, habituated category, and age group [F(1,48) = 0.04, p > 0.05] was significant.

Although there was no interaction between trial type and age group, to better capture the perception pattern within each age group, we looked at the data for 4- to 6-month-old and 10- to 12-month-old infants separately. Dutch 4- to 6 month-old infants looked longer at the novel trials (average LogLT = 3.85, SD = 0.27) than the same trials (average LogLT = 3.73, SD = 0.28), t(22) = −2.26, p < 0.05. No difference was found between the same trials (average LogLT = 3.74, SD = 0.32) and novel trials (average LogLT = 3.78, SD = 0.29) for 10- to 12-month-old infants, t(22) = −0.84, p > 0.05. These findings suggest that 4- to 6-month-old infants might be more sensitive to neutral–canonical contrast than 10- to 12 month-old infants. **Figure 7** plots the infants' log-transformed looking time in the same and novel trials for each age group.

#### Mandarin-Learning Infants

We conducted the same 2 (trial type: same/novel) × 2 (habituated category: neutral tone/canonical tone) × 2 (age group: 4- to 6 month-olds/10- to 12-month-olds) mixed effect ANOVA on the data obtained from Mandarin infants. Trial type failed to show a main effect [F(1,20) = 0.03, p > 0.05]. Habituated category did not show a main effect, F(1,20) = 0.003, p > 0.05. Age group did not show significant main effect either, F(1,20) = 0.10, p > 0.05. The interaction between trial type and age group was marginally

significant, F(1,20) = 3.58, p = 0.07, η 2 <sup>p</sup> = 0.15. No interaction was found between trial type and habituated category [F(1,20) = 0.09, p > 0.05] or among trial type, habituated category, and age group [F(1,20) = 0.31, p > 0.05].

Because we found a marginally significant interaction between age group and trial type, we conducted paired t-tests with 4- to 6-month-old and 10- to 12-month-old infants separately. The results showed no difference between the same trials and novel trials for the 4- to 6-month-old infants [t(7) = 1.41, p > 0.05] or the 10- to 12-month-old infants [t(15) = −1.61, p > 0.05]. **Figure 8** shows the looking time in the same and novel trials for 4- to 6-month-old and 10- to 12-month-old infants.

### Discussion

In the present experiment, we tested Dutch and Mandarin infants' discrimination of canonical (T1T4, /pan1san4/) and neutral tone (T1T0, /pan1san0/). Regardless of age, Dutch-learning infants were able to discriminate T1T4–T1T0 contrast. Given that Dutch adults perceived T1T4 and T1T0 as "wS" and "Sw," respectively, Dutch infants likely processed neutral–canonical tone contrast as lexical stress instead of tonal information. Perceptual asymmetry was found for Dutch infants: those who were habituated to T1T0 discriminated T1T4–T1T0 contrast; those habituated to T1T4 did not. For languages with a predominant initial stress pattern, such as English, German, and Dutch, infants demonstrated an initial stress preference (Fikkert, 1993; Jusczyk et al., 1993; Friederici et al., 2007). For Dutch infants, the trochaic pattern, which is T1T0 in the present study, may be more salient to perceive than the less frequent iambic T1T4 pattern. When presented with T1T0 first, it might be easier for infants to consolidate the representation of the salient pattern, which allows for later successful discrimination. When presented with the iambic

pattern, infants may accept such a pattern as a non-prototypical realization of the trochaic pattern.

Both the 4- to 6-month-old and the 10- to 12-month-old Mandarin infants unexpectedly failed to discriminate between the neutral–canonical tone contrast, and no perceptual asymmetry emerged. In previous studies, acoustic salience influenced participants' discrimination (Tsao, 2008; Liu and Kager, 2014; Chen and Kager, 2016). With regard to T1T4 and T1T0, T4 and T0 showed a similar falling pitch contour and similar register (Li, 2017). It is possible that T0 and T4 were not distinctive enough for Mandarin infants to discriminate them. According to the assumption of perceptual assimilation model (Best, 1994), Mandarin infants might perceive T1T0 as a realization of T1T4 and vice versa. As such, the failed discrimination could have been caused by the acoustic similarities between T1T4 and T1T0.

Therefore, we tested Mandarin infants in Experiment 3 using a more salient canonical–neutral tone contrast, the T1T2–T1T0 contrast. Unlike the T1T4 and T1T0 contrast, where both T4 and T0 exhibit a falling contour, in the T1T2 and T1T0 contrast, T2 exhibits a rising contour and T0 exhibits a falling contour (see **Figure 9**). Compared with T4, the pitch contour of T2 is more different from T0 (Li, 2017). If the phonetic similarity between T1T4 and T1T0 indeed hindered Mandarin infants' discrimination, the more salient acoustic difference would be expected to allow Mandarin infants to discriminate T1T2 and T1T0.

### EXPERIMENT 3: MANDARIN INFANTS' DISCRIMINATION OF T1T2 AND T1T0

### Stimuli

The pseudoword /pansan/ was also used as stimuli. Infants were tested on their discrimination of T1T2 and T1T0. T1T2 and T1T0 carried saliently different pitch contours: T1T2 was "highlevel + mid-rising," but T1T0 was "high-level + mid-falling." As T0 was realized in a shortened duration, T1T0 exhibited a "long-short" duration pattern.

The same female Mandarin native speaker produced /pan1san2/ in IDS 20 times. Recordings were completed in the soundproof room of the phonetics lab at CASS, using Cool Edit Pro 2.0 at a sample rate of 16,000 Hz.

Another five native Mandarin speakers judged the naturalness of the recordings on a continuum from 1 (extremely unnatural) to 5 (very natural). Two phoneticians selected the six most natural tokens of /pan1san2/. The six tokens of neutral tone (/pan1san0/), which were used in Experiment 2, were also used in the present experiment. In the six tokens of each category, three were used in the habituation phase, and another three were used in the test phase.

For the canonical tone sequence of T1T2 (/pan1san2/), the average duration of the first syllable was 253.5 ms (SD = 12), and the second syllable was 411.3 ms (SD = 15.7). For the neutral tone sequence of T1T0 (/pan1san0/), the average duration of the first syllable was 269.2 ms (SD = 11.8), and the second syllable was 216 ms (SD = 41.8). For tonal contours, 10 F0 values were

extracted on the F0 contour of each tone using Praat (Boersma and Weenink, 2013). For the T1T2 sequences, the maximal F0 value of T2 was 323.7 Hz, and its minimal F0 value was 226 Hz with a range of 97.7 Hz. For T1T0 sequences, as used in Experiment 2, the maximal of T0 was 303.9 Hz, and its minimal F0 value was 218.5 Hz with a range of 85.4 Hz. The average F0 contours are presented in **Figure 9**.

### Participants

Thirty-five Mandarin-learning infants different from those in Experiment 2 were tested. Eight of them were 4–6 months old (mean age = 5;16, SD = 0.7, 4 males and 4 females) and 27 were 10–12 months old (mean age = 10;9, SD = 2.8, 16 males and 11 females). Another 34 infants were tested but later excluded due to fussiness (N = 10), parental intervention (N = 7), dialect interference in the input (N = 6), not being habituated (N = 4), equipment failure (N = 4), and experimenter error (N = 3). All Mandarin-learning infants were born and raised in Mandarin-speaking families where Mandarin was the only language in use. All parents reported normal hearing of the infants. Mandarin-learning infants were tested in the infant lab at CASS, Beijing.

### Procedures

The experimental procedures were the same as in Experiment 2 (see section "Procedures" under the section "Experiment 2: Infants' Discrimination of Canonical Tone and Neutral Tone"). All experiments were completed in the infant lab at CASS, Beijing.

### Results

To correct for skewness, the raw looking time was logarithmically transformed. Mandarin-learning infants were divided into two age groups: 4–6 months old and 10–12 months old. For each age group, the LogLT in the same trials and novel trials were compared.

We conducted the same analysis as in Experiment 2: a 2 (trial type: same/novel) × 2 (habituated category: neutral tone/canonical tone) × 2 (age group: 4–6/10–12 months old) mixed effect ANOVA was conducted. Trial type served as the within-subject factor. The between-subject factors were habituated category and age group. Trial type showed no main effect [F(1,31) = 1.68, p > 0.05], nor did habituated category [F(1,31) = 0.03, p > 0.05] or age group [F(1,31) = 0.42, p > 0.05]. There was also no interaction between trial type and age group, F(1,31) = 1.18, p > 0.05. The interaction between trial type and habituated category, however, was significant, F(1,31) = 4.53, p < 0.05, η 2 <sup>p</sup> = 0.13. No significant interaction was found among trial type, habituated category, and age group, F(1,31) = 0.22, p > 0.05.

We further split the data according to the habituated category to examine the interaction between trial type and habituated category. Paired t-tests were conducted to compare infants' looking time in the same trials and novel trials. When infants were habituated to canonical tones (T1T2, /pan1san2/), there was no difference between looking time in the same trials and novel trials [t(14) = 1.5, p > 0.05]. However, when infants were habituated to neutral tone (T1T0, /pan1san0/), their looking time in the novel trials was significantly longer than in the same trials [t(19) = −2.51, p < 0.05]. **Figure 10** shows the interaction between trial type and habituated category.

### Discussion

When presented with an acoustically salient contrast (T1T2– T1T0), neither the 4- to 6-month-old infants nor the 10- to 12-month-old infants showed a discrimination effect. Despite the fact that neutral and canonical tones are contrastive phonetically and phonologically, Mandarin-learning infants did not seem to discriminate the neutral–canonical tone contrast during their 1st year of life.

Nevertheless, when discriminating the T1T2 and T1T0 contrast, perceptual asymmetry was evident. When infants were habituated to the sequence of T1T0 (/pan1san0/), Mandarin infants discriminated neutral (T1T0, /pan1san0/) and canonical tones (T1T2, /pan1san2/) successfully. But when they were

habituated to the sequence of T1T2 (/pan1san2/), they did not show discrimination. The directional asymmetry might be related to the statistical distribution of tonal patterns in Mandarin, where canonical tones are used to distinguish lexical meanings and are therefore more common than neutral tones. About 3.8% of Mandarin vocabulary involves neutral tones (Zhu, 2009). Thus, habituating infants to the uncommon T1T0 may allow them to develop a representation of neutral tone, which would allow for later discrimination between T1T0 and T1T2. However, once habituated to the common T1T2, the uncommon T1T0 might be processed as a realization of T1T2.

### GENERAL DISCUSSION

In the present study, we investigated the perception of neutral tone for tone- and non-tone (stress)- language listeners, both adults and infants. In Experiment 1, a discrimination task and an identification task were conducted for Mandarin and Dutch adult listeners. The results showed that Dutch adult listeners were able to discriminate between Mandarin neutral tones and canonical tones. In addition, Mandarin and Dutch listeners both identified neutral tones as unstressed. When presented with disyllabic sequences ending in a canonical tone, Mandarin listeners tended to identify the two syllables in the sequences as having "equal" stress, consistent with the claim that Mandarin does not have word level stress except for neutral tones.

In Experiments 2 and 3, we tested infants' discrimination between neutral tone and canonical tone contrast using the visual fixation paradigm. In Experiment 2, Mandarin and Dutch infants were tested on the discrimination of a non-salient neutral– canonical tone contrast, namely T1T4 and T1T0. Results showed that Dutch infants could discriminate between neutral tones and canonical tones regardless of age. Because Dutch infants discriminated the contrast continuously, they might discriminate neutral–canonical contrast as lexical stress contrast, which exists in their native language. Perceptual asymmetry was found for Dutch infants: when they were habituated to the T1T0 sequence (/pan1san0/), they discriminated canonical–neutral contrast; when they were habituated to the T1T4 sequence, however, they could not discriminate. Trochee is the predominant stress pattern in Dutch (van Heuven and Hagman, 1988; Leyden and van Heuven, 1996). For Dutch infants, the trochaic pattern (T1T0) may be more salient than the less frequent iambic pattern (T1T4). When presented with T1T0 first, it might be easier for the infants to consolidate the representation of the salient pattern, which allows for successful discrimination later. When presented with the iambic pattern, infants may accept the pattern as a non-prototypical realization of the trochaic pattern. Unexpectedly, however, neither the 4- to 6-month-old nor the 10- to 12-month-old Mandarin infants discriminated T1T4–T1T0 contrast. In previous studies, tonal discrimination was related to acoustic salience (Tsao, 2008; Liu and Kager, 2014; Chen and Kager, 2016). Both T4 (high-falling tone) and T0 (mid-falling tone) exhibited a falling tonal contour with a similar register. Even though Mandarin infants could form a prototype of falling tone through intensive habituation, they may perceive T4 and T0 as two realizations of the same tonal category.

To this end, a more salient contrast was used as stimuli in Experiment 3: T1T2 (/pan1san2/) and T1T0 (/pan1san0/). Compared with T4, the difference between T2 and T0 was larger in tonal contour with contradicting pitch movement directions (Li, 2017). The T1T2 is "high-level + midrising," and the T1T0 is "high-level + mid-falling." Although no discrimination was found overall, we found perceptual asymmetry to be similar to the Dutch infants in Experiment 2: when infants were habituated to T1T0 (/pan1san0/), they discriminated neutral–canonical tone contrast successfully; when habituated to T1T2 (/pan1san2/), infants failed to discriminate the contrast. Taken together, acoustic salience appeared to affect infants' discrimination. Perceptual asymmetry emerged when discriminating salient contrast. Directional asymmetry might reflect the statistical distribution of the tonal pattern in Mandarin. Canonical tones are more common in Mandarin, and neutral tones are more restricted in distribution. When habituated to the uncommon T1T0, it is likely that infants could form a new representation for neutral tone, thereby facilitating discrimination of the neutral-canonical tone contrast and leading to better discrimination.

Infants' different responses and perceptual asymmetry in Experiments 2 and 3 may reflect properties of their native languages, such as the statistical distribution of tonal/stress pattern in the input. In Dutch, duration is the most reliable cue for lexical stress. Like native Dutch adults, Dutch infants may perceive the T1T0–T1T4 contrast as lexical stress, where T4 carries a longer duration than T0. In Mandarin, however,

lexical meanings are distinguished by pitch variations. The failed discrimination between T1T0 and T1T4 might reflect the fact that both T0 (mid-falling tone) and T4 (high-falling tone) were perceived as realizations of the same falling tone. It seems that 10- to 12-month-old infants already weigh phonetic cues according to their distribution in the input (Jusczyk et al., 1993).

Based on results from the infant experiments, one could conclude that Mandarin infants in the 1st year of life cannot discriminate between neutral tone and canonical tone contrast, although in Mandarin, neutral tones contrast with canonical tones phonetically and phonologically. Acoustic salience may also affect infants' discrimination. Mandarin infants showed perceptual asymmetry when presented with a salient contrast but not when presented with a non-salient contrast.

Particular neutral tone types may also influence infants' discrimination. In the present study, the neutral tone in the lexeme type carried the pattern of "XY," where X and Y stand for two different syllables. Under this condition, infants could only rely on phonetic cues to discriminate neutral and canonical tones without access to any morphological information. But in other contexts in which neutral tones are used, such as in reduplication and affixation, infants could predict neutral tones by using morphological cues, including the pattern of "XX" or "X/ts10/" in which the second syllable is uttered in a neutral tone. Thus, the lexical neutral tone in the form of "XY" (Y can be said in either a neutral tone or a canonical tone, leading to different lexical meanings) might be more difficult to identify than the reduplication and affixation types. In Zhu (2002), a study of the production of neutral tones, reduplication occurred as early as 14 months but remained unstable at 24 months. Affixation and the lexeme types of neutral tones emerged at 17 months and stabilized earlier than reduplication. In the current research, we approached the discrimination of neutral tone and canonical tone contrast based mainly on phonetic cues. Reduplication and affixation types will be explored in future studies. Conceivably, infants may be able to represent and distinguish neutral tones from canonical tones in contexts where morphological markers are present.

The present study had two limitations. First, the sample size of 4- to 6-month-old Mandarin infants was small. However, previous studies reported that the discrimination performance of 10- to 12-month-old infants was better than 4- to 6-month-old infants for native contrast (Narayan et al., 2010; Shi, 2010). In this study, 10- to 12-month-old Mandarin infants were unable to discriminate the neutral tone and canonical tone contrast. Hence, we hypothesized that 4- to 6-month-old Mandarin infants would not have discriminated canonical–neutral tone contrast even if the sample size had been larger. Nevertheless, increasing the sample size would likely boost the statistical power of the results and produce an interaction between age group, trial type, and/or habituation category, rendering the developmental pattern more observable. Further testing of 4 to 6-month-olds will be conducted when practicalities allow. We invite future studies for replication, and we leave the issue open for further investigation. Second, a potential factor affecting Mandarin and Dutch infants' perceptions is tonal variability. We used multiple tokens in the habituation for Mandarin and Dutch infants, expecting that Mandarin infants would be able to represent lexical tones phonologically rather than phonetically; however, our results did not support such a hypothesis. Dutch infants, on the other hand, were able to map variable tokens in the habituation to a single token in the test phase, suggesting to some extent that they had access to abstract representation of the disyllabic sequences. The failure of the Mandarin infants might be due to the much lower frequency of neutral tones than canonical tones in the input, which could have led to the infants' bias of accepting the variable tokens of neutral tones as realizations of canonical tones. Future studies may use single tokens to test Mandarin infants and investigate whether they can identify the difference between canonical and neutral tones on a phonetic level.

Several issues remain crucial for future studies. First, lexical knowledge might play a role in learning neutral tones in the lexeme type, and stimuli in a particular morphological structure may help highlight the developmental changes in perceiving neutral tones. For neutral tones in the context of reduplication (e.g., , /ma1ma0/, mother) and affixation (e.g., , /tùuo1ts10/, table), infants could utilize morphological cues to perceive neutral tones. For neutral tones in the lexeme type (the stimuli used in the current study) cues other than phonetic ones might be needed to perceive neutral tone. Lexical meanings may facilitate infants' discrimination of neutral and canonical tones, but the infants tested in the current study were so young they had not yet developed sufficient knowledge of word meaning. As infants grow older, they may become capable of using word meaning to establish the representation of neutral tones. Future studies should test whether elder infants are able to use lexical meaning to discriminate between sequences ending in neutral and canonical tones.

Second, more attention should be paid to the perception of disyllabic tonal sequences. Disyllabic words are the predominant prosodic unit in Mandarin and occur more frequently than monosyllabic words (Feng, 1997; Wang, 2008). Hence, it is possible that infants learn disyllabic words holistically instead of the concatenation of individual tones, especially based on the observation that individual lexical tones are influenced by preceding and following tones due to articulation (Xu, 1997). Previous studies only used monosyllabic tones as stimuli (Mattock and Burnham, 2006; Mattock et al., 2008; Tsao, 2008; Chen, 2013; Liu and Kager, 2014; Chen and Kager, 2016), which may not reflect the actual language learning process. Knowledge of early perceptions of disyllabic canonical tone sequences among Mandarin infants will shed light on whether the acquisition of sequences involving neutral tones differ from those involving canonical tones.

In summary, a more detailed picture of neutral tone perception in future studies will emerge from the perspectives of phonetics, phonology, and word learning. Subsequent studies will provide deeper insights into discovering the process of suprasegmental information in early perception.

### ETHICS STATEMENT

fpsyg-09-00322 March 21, 2018 Time: 17:26 # 13

This study was carried out in accordance with the recommendations of the Institute of Linguistics (CASS, China) and the Utrecht Institute of Linguistics OTS (Netherlands) with written informed consent from all adult participants/infants' parents. All participants/infants' parents gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved for ethics by the Institute of Linguistics (CASS, China) and the Utrecht Institute of Linguistics OTS (Netherlands).

### AUTHOR CONTRIBUTIONS

AL, AC, and SF conceived and designed the study, and reviewed and edited the manuscript. SF performed the experiments and wrote the paper. All authors approved the manuscript.

### REFERENCES


### FUNDING

This study was supported by the National Natural Science Foundation of China (Nos. 61175016 and 61304250), the National Social Science Foundation of China (No. 15ZDB103), the CASS innovation project "Key Laboratory of Phonetics and Speech Science", and the China Exchange Program (CEP) between KNAW and CASS (No. 11CDP004).

### ACKNOWLEDGMENTS

We thank Babylab group members at Utrecht University and Chinese Academy of Social Sciences for their help in designing the experiments, setting up the computer program, and recruiting participants. We sincerely thank all the families who participated in our research.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Fan, Li and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Lexical Tones in Mandarin Chinese Infant-Directed Speech: Age-Related Changes in the Second Year of Life

Mengru Han<sup>1</sup> \*, Nivja H. de Jong2,3 and René Kager<sup>1</sup>

<sup>1</sup> Utrecht Institute of Linguistics (OTS), Utrecht University, Utrecht, Netherlands, <sup>2</sup> Leiden University Centre for Linguistics (LUCL), Leiden University, Leiden, Netherlands, <sup>3</sup> Leiden University Graduate School of Teaching (ICLON), Leiden University, Leiden, Netherlands

Tonal information is essential to early word learning in tone languages. Although numerous studies have investigated the intonational and segmental properties of infantdirected speech (IDS), only a few studies have explored the properties of lexical tones in IDS. These studies mostly focused on the first year of life; thus little is known about how lexical tones in IDS change as children's vocabulary acquisition accelerates in the second year (Goldfield and Reznick, 1990; Bloom, 2001). The present study examines whether Mandarin Chinese mothers hyperarticulate lexical tones in IDS addressing 18 and 24-month-old children—at which age children are learning words at a rapid speed vs. adult-directed speech (ADS). Thirty-nine Mandarin Chinese–speaking mothers were tested in a semi-spontaneous picture-book-reading task, in which they told the same story to their child (IDS condition) and to an adult (ADS condition). Results for the F0 measurements (minimum F0, maximum F0, and F0 range) of tone in the speech data revealed a continuum of differences among IDS addressing 18-month-olds, IDS addressing 24-month-olds, and ADS. Lexical tones in IDS addressing 18-month-old children had a higher minimum F0, higher maximum F0, and larger pitch range than lexical tones in ADS. Lexical tones in IDS addressing 24-month-old children showed more similarity to ADS tones with respect to pitch height: there were no differences in minimum F0 and maximum F0 between ADS and IDS. However, F0 range was still larger. These results suggest that lexical tones are generally hyperarticulated in Mandarin Chinese IDS addressing 18- and 24- month-old children despite the change in pitch level over time. Mandarin Chinese mothers hyperarticulate lexical tones in IDS when talking to toddlers and potentially facilitate tone acquisition and word learning.

Keywords: infant-directed speech, lexical tone, prosody, Mandarin Chinese, age effect, word learning

## INTRODUCTION

In tone languages, pitch is employed to differentiate lexical meanings. Consequently, in order to recognize or learn a word, a tone-language-learning infant must develop sensitivity to lexical pitch contours in addition to consonants and vowels; conversely, infants who learn non-tone languages need to pay attention to consonants and vowels but ignore pitch contours at the lexical level. Though a number of studies have looked at infants' discrimination, recognition, and acquisition of tones (see Singh and Fu, 2016 for a review), only a few studies have examined lexical tones in

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Jae Yung Song, University of Wisconsin System, United States Nan Xu Rattanasone, Macquarie University, Australia

> \*Correspondence: Mengru Han han.mengru@gmail.com

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 23 June 2017 Accepted: 15 March 2018 Published: 04 April 2018

#### Citation:

Han M, de Jong NH and Kager R (2018) Lexical Tones in Mandarin Chinese Infant-Directed Speech: Age-Related Changes in the Second Year of Life. Front. Psychol. 9:434. doi: 10.3389/fpsyg.2018.00434

early language input—i.e., infant-directed speech (IDS). The results drawn from these studies are inconsistent; some suggest that tones in IDS are hypoarticulated, while others show that they are hyperarticulated compared with tones in adult-directed speech (ADS). Moreover, most previous studies have focused on IDS in the first year of life, when perceptual reorganization is taking place (Werker and Tees, 1984); comparatively little is known about how tonal input changes in the second year, when children start to become verbal and gain vocabulary at a rapid speed (Bloom, 2001). As tonal information is crucial to distinguishing word meanings, the current study investigates whether lexical tones in Mandarin Chinese IDS addressing 18 and 24-month-old children are hyperarticulated—and if so, whether the tonal cues change depending on the age of the child.

Infant-directed speech is a speech register caregivers (typically mothers) use when addressing their infants, and as such it is an important type of input in early language acquisition (Soderstrom, 2007; Cristià, 2013). IDS is known to exhibit exaggerated intonation compared with ADS, including higher pitch, a larger pitch range, and greater pitch variation (Fernald and Simon, 1984; Fernald et al., 1989). These types of prosodic modifications are found in IDS in the majority of world languages, including both non-tone languages, such as English and German (Fernald and Simon, 1984; Fernald et al., 1989; Cristià, 2013), and tone languages, such as Mandarin Chinese, Cantonese, and Thai (Grieser and Kuhl, 1988; Kitamura et al., 2001; Xu Rattanasone et al., 2013). Despite the near-universality of exaggerated intonation, the degree of exaggeration may show cross-linguistic or cross-cultural differences. For instance, American English IDS was found to exaggerate prosody more than British English, Japanese, German, French, and Italian IDS (Fernald et al., 1989). In the IDS of tone languages, lexical tone (pitch at the lexical level) interacts with exaggerated intonation (pitch at the intonational level); as a result, the prosodic modifications expressed in tone-language IDS may differ in meaningful ways from those found in non-tonelanguage IDS. For instance, Kitamura et al. (2001) found that, although Thai IDS exhibited exaggerated intonation compared with Thai ADS, it was less exaggerated than Australian English IDS.

IDS is often claimed to facilitate language acquisition, although conflicting views have been proposed (Gleitman et al., 1984; Soderstrom, 2007). One line of research has shown that, compared with ADS, IDS attracts infants' attention more effectively. Infants—even newborns—prefer listening to IDS over ADS (Cooper and Aslin, 1990). This listening preference is probably largely attributable to the positive affect of IDS (Singh et al., 2002). Positive affect is a common characteristic of IDS, and one that shares similar prosodic features with exaggerated intonation (Kitamura and Burnham, 1998). When they manipulated affect and speech register in IDS to examine 6-month-old children's listening preference, Singh et al. (2002) found that higher pitch and greater pitch variation alone did not account for infants' preference; positive affect was also required.

The robust evidence that infants prefer listening to IDS, however, does not necessarily indicate that such speech carries a particular linguistic function in terms of language learning. Another line of research has been devoted to identifying the well-specified linguistic information encoded by IDS. A number of studies have explored two questions on this topic: First, are the segmental (mainly vocalic) and suprasegmental (tonal) properties of IDS hyperarticulated compared with those of ADS? It may seem, on first blush, that the exaggerated intonation IDS entails vowel hyperarticulation. However, it is also possible that exaggerated intonation provides more variable vowels, and thus poses a learning problem for vowel categorization. Similarly, exaggerated intonation need not naturally result in tone hyperarticulation; on the contrary, it may distort tonal cues at the syllabic level. Second, if the segmental and suprasegmental properties of IDS are indeed hyperarticulated, is this hyperarticulation expressed in a way that may support language acquisition? Previous investigations into this possibility have produced mixed results on the segmental level (vowels and consonants) and few results of any kind on the suprasegmental level (lexical tones).

An example of vowel hyperarticulation was identified by Kuhl et al. (1997), who compared the articulation of three point vowels (/i/, /a/, and /u/) between ADS and IDS addressing 2- to 5 month-old infants in American English, Russian, and Swedish. They analyzed the "vowel triangles" for the three vowels in IDS and ADS; a larger vowel triangle indicated that the vowels were more distinctive from each other. The results showed that in all three languages, mothers expanded the vowel triangles in IDS compared with ADS, suggesting that mothers produced more distinctive vowels in IDS. Similar results have been obtained in other languages, including Taiwanese Mandarin (Liu et al., 2003), French, and Japanese (Dodane and Al-Tamimi, 2007). However, contradictory findings have also been reported. First, vowel hyperarticulation seems to be restricted to point vowels (/i/, /a/, and /u/); when comparing other vowel contrasts such as [i – I] in American English, Cristià and Seidl (2014) did not find these contrasts to be enhanced in IDS. Second, while robust evidence of vowel hyperarticulation exists for multiple languages, other languages seem to show no trace of this phenomenon. For example, vowels in Cantonese IDS toward 3- to 12-monthold infants were not hyperarticulated compared with vowels in Cantonese ADS (Xu Rattanasone et al., 2013). Similarly, a recent study comparing the vowels in natural Japanese IDS addressing 18- to 24-month-old children with the vowels in read Japanese speech found that, although the IDS vowels were more variable, they did not necessarily show more clarity compared with those in ADS (Miyazawa et al., 2017).

The mixed results on vowel hyperarticulation in IDS are only magnified in studies investigating whether IDS supports language acquisition. On the one hand, Song et al. (2010) showed that vowel hyperarticulation in IDS improved word recognition in 19-month-old children. On the other hand, in a perception study on 6- and 7-month-old children, Trainor and Desjardins (2002) found that the exaggerated pitch contours in IDS helped children's discrimination of vowels, whereas high pitch hampered vowel discrimination. In sum, whether or not vowels in IDS are hyperarticulated—and whether such hyperarticulation, if it exists, helps children's language acquisition—is still debatable.

A similar debate may be extended to tone hyperarticulation. Hypothetically, the exaggerated intonation of IDS might affect tonal properties in two possible ways. Specifically, lexical tones in IDS may either be hyperarticulated or alternatively distorted (hypoarticulated) due to the exaggerated prosody. Two types of acoustic evidence may indicate tone hyperarticulation in IDS. First, tones' acoustic cues may be more prominent in IDS as compared with ADS. For example, as fundamental frequency (F0) is the primary cue to tone in Mandarin Chinese (Howie, 1976), tone hyperarticulation can be indicated by a larger F0 range for Tone 2 (mid-rising tone), Tone 3 (low-dipping tone), and Tone 4 (high-falling tone). Tone 1, a high-level tone, may have a higher F0 in IDS than in ADS. Additionally, tone duration, a secondary cue (Blicher et al., 1990), may also be enlarged in IDS for all four tones. Second, enhancement of tonal contrasts is a possible indicator of tone hyperarticulation in IDS. Such enhancement can be measured by comparing the pitch differences between tone pairs in ADS and IDS, or indicated by a larger tone triangle in IDS (e.g., Tang et al., 2017, to review later). To date, only a handful of studies have looked at lexical tones in IDS. Among the few studies that have performed perceptive or acoustic measurements on lexical tones in IDS, conflicting results emerge.

Results from several studies support the distortion prediction. Papoušek and Hwang (1991) found that tone contours in Mandarin Chinese IDS did not correspond to phonologically expected tone contours. In their study, participants were instructed to produce preselected utterances in role-play contexts, imagining the addressee was a child or an adult. The authors speculated that speakers intuitively sacrificed tonal information at the syllabic level in order to accommodate the IDS intonation. Though the study's results shed light on people's intuitive prosodic tuning when talking to children, they do not tell us much about tone production in natural IDS, when mothers and children interact directly. In a later study, Kitamura et al. (2001) collected IDS data from Thai speakers in a more natural setting. Specifically, the researchers recorded the spontaneous speech of mothers interacting with their children naturally at home, every 3 months, from birth until the infants were 12 months old (IDS condition); they also recorded the same participants interacting with adults (ADS condition). They then asked trained Thai phonologists to judge whether the tones in utterance-initial and utterance-final positions remained identifiable. The results showed that tones were slightly less identifiable in Thai IDS than ADS, especially in utterance-final positions.

While these studies suggest that tones may be distorted in IDS compared with ADS, there is also evidence that mothers hyperarticulate tones in IDS. Following the methods in Kuhl et al. (1997), Liu et al. (2007) investigated whether vowel hyperarticulation applied to tones in Taiwanese Mandarin. They performed an acoustic analysis on four Taiwanese Mandarin tones in speech directed at 10- to 12-month-old children. Their stimuli consisted of 12 disyllabic words in which the first syllable (target syllable) varied from Tones 1 to 4 and the second syllable remained Tone 1. In the IDS condition, mothers and their infants played together with pictures or objects corresponding to these stimuli; in the ADS condition, the same mothers talked to an experimenter about the children's interests in these target words. Mean F0, F0 range and duration of vowels of the target syllables were compared between the two conditions. The results showed that Taiwanese Mandarin tones produced in IDS had a raised mean F0, enlarged F0 range and lengthened duration suggesting that mothers tended to hyperarticulate tones when speaking to their infants.

Two studies further tested tone hyperarticulation in Cantonese IDS with different measurements. Xu Rattanasone et al. (2013) investigated Cantonese tones in the speech of mothers talking to their 3-, 6-, 9-, and 12-month-old children. The stimuli consisted of three of the six tones in the Cantonese tone inventory: tones 55, 25, and 21. The authors adopted a tone triangle measure from Barry and Blamey (2004). For each tone, F0 values were measured at the point of maximal vowel amplitude and at 50% of the maximum amplitude. These two values were plotted for three tones, making a tone triangle. Similar to the vowel triangles in Kuhl et al. (1997), a larger tone triangle indicated more distinctive tonal contrasts. The results showed that tone triangles were larger in IDS than ADS at 3, 6, and 9 months, indicating tone hyperarticulation for these age groups. However, the observed hyperarticulation was reduced for 12-month-olds, indicating that tones in speech to infants are more distinctive until children reached 12 months of age, at which point tones in ADS and IDS become similar. Significantly, the larger tone triangle found for 3-, 6-, and 9-month-olds mainly stemmed from differences between the high-level tone (55) and the low-level tone (21) (Xu, 2008, p. 111); thus, it remains unknown whether these larger tone triangles indicate tone hyperarticulation across the whole tone inventory. In a recent study, Wong and Ng (2017) examined Cantonese tone hyperarticulation in IDS (toward 7- to 12 month-old infants), using both native judgment and acoustic analysis. They found that tones in Cantonese IDS had higher F0 and longer duration than tones in ADS, but such differences did not seem to facilitate adults' perception of tonal contrasts. Using the tone triangle measure in Xu Rattanasone et al. (2013), Tang et al. (2017) examined tone hyperarticulation in Northern Mandarin. Interestingly, they only found tone hyperarticulation (for both tone space and duration) when the target tones were in utterance-final position.

Taken together, these findings indicate that Cantonese tones are hyperarticulated in early IDS compared with ADS, but that the degree of hyperarticulation diminishes by the end of the first year. They also suggest that tone hyperarticulation may be restricted to certain tones or positions (as in Northern Mandarin, where tone hyperarticulation is only present in utterancefinal positions). In other words, it has not been conclusively established that lexical tones in IDS are hyperarticulated across the board. To date, studies of IDS have been conducted on different languages, with different data collection methods and different measurements, and have yielded conflicting results. These methodological issues must be taken into consideration before we draw any conclusions about the hyperarticulation of tone in IDS.

Tone languages studied in the existing literature on IDS include Cantonese, Mandarin Chinese, Northern Mandarin, and

Taiwanese Mandarin, all of which have different tonal systems and prosodic patterns (e.g., Chen et al., 2009). It is certainly possible that the interaction of tone and prosodic modifications in IDS may show cross-linguistic differences. In fact, even among variants of the same language, the characteristics of IDS can differ; for example, as noted above, American English IDS tends to be more exaggerated than British English IDS (Fernald et al., 1989)—certainly implying that languages with different tonal systems or different dialects may differ significantly.

Second, speech elicitation methods used in previous studies range from reading tasks to spontaneous speech, and from home settings to laboratory settings. Papoušek and Hwang (1991) used scripted speech, while Liu et al. (2007) selected target words to elicit speech during mother–child interaction (semispontaneous), and Kitamura et al. (2001) collected spontaneous speech data in natural interactions at home. Prosody tends to differ in read speech vs. (semi-)spontaneous speech (De Ruiter, 2015). In spontaneous speech (elicited during "natural mother–child interaction"), the speech context varies according to the activity that is taking place—for example, reading books, playing with toys, or changing diapers. Furthermore, in typical experimental settings, the speech contexts for ADS and IDS conditions are rather different from each other. It is not surprising, then, that IDS may be more distinct from ADS in certain contexts, and less distinct in other contexts. Given this degree of variability, it's not clear whether the large differences between ADS and IDS reported in certain previous studies may actually have been due to the very different settings and activities in the two conditions.

Finally, previous studies have employed a wide range of analyses to compare the ADS and IDS conditions. Kitamura et al. (2001) used native judgment, whereas other researchers performed acoustic analyses; among the studies that conducted acoustic analyses, different measurements were used. These methodological differences further complicate the task of determining whether or not lexical tones are hyperarticulated in IDS.

Besides the methodological issues discussed above, the different ages of the children in the various studies may also have contributed to the contradictory results. Studies on vowel and tone hyperarticulation to date have mostly focused on IDS directed at children in the first year of life, and these results have often been interpreted from the perspective of "perceptual reorganization" (Werker and Tees, 1984). There is robust evidence showing that infants undergo perceptual reorganization, during the first 12 months of life, as their perception of phonetic categories shifts from language-universal to languagespecific. This shift is reflected in infants' progressively better discrimination of native contrasts and poorer discrimination of non-native contrasts. Such perceptual reorganization develops for consonants, vowels, and lexical tones. Mandarin-learning infants, for instance, show improvement in their discrimination of lexical tones between 6 and 9 months of age, while infants who are learning a non-tone language (e.g., English and Dutch) show a decline in their ability to discriminate tonal contrasts over the same age range (Mattock and Burnham, 2006; Liu and Kager, 2014). Thus, findings on tone hyperarticulation during

infancy are usually interpreted as evidence for the facilitating effects of IDS on tone perception: as infants' speech perception becomes progressively tuned to their native (tonal) language, tone hyperarticulation becomes less prominent. Xu (2008, p. 99), for instance, pointed out that her findings—which indicate that tone hyperarticulation declines at 12 months—are consistent with perceptual reorganization research. However, during the same period of perceptual reorganization, children also start to acquire words. Infants start to show recognition of common words as early as 6–9 months (Bergelson and Swingley, 2012), and usually utter their first words around their first birthday. In the second year of life, both receptive and productive vocabulary accelerate at an astonishing speed (Goldfield and Reznick, 1990; Bloom, 2001).

Since tonal information is crucial to word meaning in tone languages, it is important to examine whether tone hyperarticulation persists when children are becoming proficient word-learners in the second year. The general prosodic modifications in IDS are known to change based on the child's stage of language development (Stern et al., 1983; Kitamura et al., 2001). In general, IDS becomes more ADS-like as children grow older. Taking the perspective of word learning, tone hyperarticulation may not stop when children are 1 year old; on the contrary, it may persist, aiding children's lexical development as they move into the word-learning phase. As most studies to date have focused on the first year of life, little is known about whether tone hyperarticulation remains present in the second year. Consequently, the timeline of age-related changes in tone hyperarticulation is not well-described in the literature.

Two studies have investigated age-related changes in lexical tones in IDS, but both focused on the first year of life, prior to the lexical spurt. Kitamura et al. (2001) showed that lexical tones in Thai were distorted in IDS directed at children up to 9 months old, but that IDS directed at 12-month-old children did not differ significantly from ADS in tone identification. Results from Xu Rattanasone et al. (2013) showed similar age-related changes: Cantonese tones were hyperarticulated in IDS compared to ADS until 12 months of age, at which point this hyperarticulation was reduced. The authors interpreted their results as evidence that mothers modify their speech according to children's stages of language development. As infants tune their tone perception toward their native language in the first year of life (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013), tone hyperarticulation declines accordingly.

If age-related changes in IDS are explicitly tied to perceptual reorganization, we should expect any differences between ADS and IDS to diminish and disappear altogether as children reach 12 months of age. However, in a longitudinal study, Liu et al. (2009) found that speech directed to 5-year-old children still showed both general prosodic exaggeration and tone hyperarticulation compared with ADS, though it was less exaggerated than IDS directed at preverbal children. But what happens to IDS directed at children between infancy (up to 12 months) and school-age (5 years old)? There is a gap in the existing investigations of tone hyperarticulation during this period. The present study seeks to fill that gap by asking what happens to tones in IDS in the second year of life, when children start to talk and learn vocabulary at a high rate. It remains an open

question whether mothers speaking tone languages alter their tones in IDS to facilitate tone acquisition (and, consequently, lexical development) in their children. If tone hyperarticulation is not restricted to supporting perceptual reorganization, we should find evidence for tone hyperarticulation in IDS addressing 18 and 24-month-olds.

The current study set out to investigate tone hyperarticulation in Mandarin Chinese IDS at two points in time, both of which occur during the second year of life (the period of the lexical spurt). Our main research questions are: (1) Are tones in Mandarin IDS addressed to 18- and 24-month-old children generally hyperarticulated compared to tones in ADS? If so, we should expect to observe a larger F0 range for Tone 2, Tone 3, and Tone 4, a higher F0 for Tone 1, and possibly longer duration for tones in IDS vs. ADS, as shown by Liu et al. (2007). In addition to these general measures, we explored whether lexical tonal contrast between Tones 1 and 4 was enhanced in IDS. (2) Do lexical tones in Mandarin Chinese IDS change when the mother is addressing an 18-month-old child vs. a 24-month-old child? We predict that as children's vocabulary size increases significantly from 18 to 24 months, the lexical tonal cues change. Specifically, tonal cues in IDS should be more similar to ADS when children reach 24 months' old. To address these questions, we collected speech samples from a story-telling task, where mothers told a story containing target words featuring four Mandarin Chinese tones to their 18- and 24-month-old children (IDS condition), and to an adult control (ADS condition).

### MATERIALS AND METHODS

### Participants

Thirty-nine Mandarin-Chinese-speaking mother–child dyads participated in this study. The participant sample comprised two age groups: 18-month-olds (N = 21; mean age = 18;15; age range = 17;21 – 18;27; girls N = 9) and 24-month-olds (N = 18; mean age = 24;15; age range = 23;27– 24;27; girls N = 10). All participants were recruited from kindergartens in Yichang, China. All the participant mothers spoke Mandarin Chinese<sup>1</sup> (the official language in China), as well as a dialect (in this case, Southwest Mandarin). The participant children heard this dialect in their language community, but were exposed to Mandarin Chinese at home, at kindergarten, and in the national media. This type of bilingual language background is common for most people in China (Li and Lee, 2006). To obtain a homogeneous group of participants, we set these criteria in our recruiting interview: (1) the mothers should speak Mandarin Chinese with good proficiency; (2) the mothers should mostly speak Mandarin Chinese to their children at home; and (3) the children should be learning Mandarin Chinese as one of their first languages.

### Materials

A picture book titled Xiaotuzi de yitian ("Bunny's day") was designed to elicit four target words for 18- and 24-month-old children (see **Table 1**). On each page of the book, one word appeared on the left side, and a corresponding picture appeared on the right side. The pages contained no text beyond these target words. An additional six pages were used as fillers and to make the story coherent. The target words were all disyllabic nouns, of which the first syllable was always Tone 2 (a rising tone), and the second syllables varied from Tones 1 to 4. We chose Tone 2 for the first syllable in order to ensure consistent tonal coarticulation effects (i.e., carry-over effects on the following tone) across tokens and registers.

### Procedure

Participants were tested in a quiet room. Before the experiment, mothers were given a few minutes to get familiar with the book. In the IDS condition, the child sat on his or her mother's lap, and the mother was instructed to read the story to her child the way she usually did at home. The mothers were specifically told they could use any sentences; the only requirement was to include the words on each page. In the ADS condition, the mothers were instructed to tell the story to the experimenter (female, a native speaker of Mandarin Chinese), taking into account that she was a college student. This was done to control the speech context and content in both conditions. The order of the two conditions was counterbalanced across participants. A ZOOM H1 recorder (with 16-bit resolution and a sampling rate of 44.1 kHz) was used to make audio recordings, and all sessions were videotaped. Each experimental session took about 15–20 min. All families received a book as a gift after the session.

### DATA ANALYSIS AND RESULTS

### Data Analysis

The beginnings and endings of the target syllables (the second syllable of each target word) were annotated and extracted from the recordings in PRAAT (Boersma and Weenink, 2017), following the phonetic segmentation principles in Skarnitzl and Machac (2011) ˇ . In total, 713 target syllables were extracted; of these, 47 syllables (6.6%) were excluded due to background noise or interference from a child's voice.

We chose to acquire the maximum and minimum F0 for each syllable by marking them manually, rather than limiting tone measures to any specific segment(s) within the syllables. This was done for two reasons. First, the domain of tones (or Tone Bearing Units (TBUs)) is phonologically determined, and



<sup>1</sup>We use the term "Mandarin Chinese" in this paper in reference to "Putonghua" or "Standard Chinese," the official language spoken in China. It should be distinguished from Taiwanese Mandarin, another variety of Mandarin Chinese spoken in Taiwan.

what constitutes a TBU in Mandarin Chinese is debatable (see Zhang, 2014, p. 81 for a review). Phonetic studies have shown that the voiced parts of syllables—i.e., vowels, initial voiced consonants, prenuclear onglides, and nasal codas—may convey tonal information (Howie, 1974; Duanmu, 2007). In studies involving acoustic analyses of lexical tones in IDS, however, the common practice has been to identify tones based on the F0 measures on vowels, potentially leading to the exclusion of other segments that may carry pitch contours. Second, contextual tonal variations in natural speech—for example, anticipatory and carry-over effects in adjacent Mandarin tones (Xu, 1997) may also make it difficult to extract pitch measures accurately using an automatic method. In previous studies, the stimuli were either monosyllabic (Xu Rattanasone et al., 2013) or associated with the first syllable of the target words in natural speech, where the carry-over effects from the pre-target syllables were uncertain (Liu et al., 2007). Such methods disregard the potential for contextual impact from adjacent tones. In the current study, we made sure that the first syllable of the target words was always Tone 2 (a rising tone), so that the first syllable had a similar effect on the second tone for each target word.

Taking these issues into account, to get a more accurate picture of tonal information, the first author manually marked the maximum F0 and minimum F0 following the methods from Chen and Gussenhoven (2008). As a secondary cue to tones, durations of syllables were extracted automatically using a Praat script (Lennes, 2017). Using these techniques, we obtained four dependent measures for each target syllable: Minimum F0, Maximum F0, F0 range (Maximum F0 – Minimum F0), and Duration of syllables (in seconds). Tone 1 was excluded in the F0 range analyses since it is a flat tone, for which the pitch height (not the pitch range) is the major cue.

For all the F0 measures, we followed Liu et al. (2007) and used two scales: (1) Hz, a linear pitch scale that has been used traditionally in phonetic research; and (2) Equivalentrectangular-bandwidth-rate (ERB), which has been found to better describe pitch perception (Hermes and van Gestel, 1991).

### Results

### General Tone Hyperarticulation

To understand whether tones differed between (i) ADS and IDS and (ii) IDS directed at 18-month-olds and IDS directed at 24 month-olds, we used linear mixed-effects models for all analyses. In the models, we included fixed factors of Age (18-monthold/24-month-old), Condition (ADS/IDS) and Tone (Tones 1, 2, 3, and 4) on these dependent measures: Minimum F0 (in Hz and ERB), Maximum F0 (in Hz and ERB), F0 range (in Hz and ERB, for Tone 2, Tone 3 and Tone 4, excluding Tone 1), and Syllable duration (in seconds), with Participant Number as a random factor, and allowing for random slopes for Condition and Tone (Barr et al., 2013). All dependent measures were square-root transformed from raw data to get a more normalized distribution (indicated by W in Shapiro–Wilk test).

We used the lme4 package (Bates et al., 2017) in the R environment (R Development Core Team, 2016) for all data analyses. For each dependent measure, we took the backward elimination approach, starting with a model that included all fixed effects plus the random factor, and all interactions between them (the most complex model)<sup>2</sup> (Bates et al., 2015a). Then, we used the "step" function in the lmerTest package (Bates et al., 2015b, p. 15) to reduce the models by eliminating non-significant factors or interactions. When we arrived at an interaction of the fixed effects Condition and Age in the final models, we split the data by Age and built further models for each age group<sup>3</sup> . For Maximum F0 (Hz) and Minimum F0 (ERB), the models with maximal random effects failed to converge. Therefore, we excluded Tone as a random effect<sup>4</sup> for these two measures. As the results were consistent across Hz and ERB for all pitch measures, we only present results in Hz here. The results of F0 measures in ERB can be found in Supplementary Material. In the following subsections, we report on the final models for each dependent measure. Our main aim was to investigate the general tone hyperarticulation phenomenon in Mandarin Chinese IDS; hence we focus on the fixed effects of Condition and Age, as well as the interaction between these two factors. To further explore whether tonal contrasts are enhanced in IDS, we present an exploratory analysis on the enhancement of Tone 1 – Tone 4 contrast in Section "Exploring the Enhancement of Tone 1 – Tone 4 Contrast."

For Maximum F0 (Hz) (**Figure 1**), the final model (**Table 2**) revealed a significant main effect of Condition (p = 0.001), as well as a significant interaction of Condition and Age (p = 0.015). To further examine the different effects of Condition on Maximum F0 in the two age groups, we split the data by Age. The models for the two age groups showed a significant main effect of Condition for the 18-month group (β = 1.401, SE = 0.398, t = 3.516, p = 0.002), but not for the 24-month group, suggesting that there was no effect of Condition on Maximum F0 for IDS directed at 24-month-olds. The final models for Maximum F0 and Minimum F0 for each age group can be found in Supplementary Material. Thus, the Maximum F0 of lexical tones was higher in IDS than in ADS only in the 18-month-old group. By the time children were 24 months old, there was no difference between the two speech registers with respect to Maximum F0.

Results for Minimum F0 (Hz) (**Figure 2**) showed a similar pattern: the final model (**Table 3**) showed a significant main effect of Condition (p = 0.030) and a significant interaction of Condition and Age (p = 0.014). When we split the data by Age, we found that there was a significant main effect of Condition for the 18-month-old group (β = 0.589, SE = 0.224, t = 2.630, p = 0.010), but not for the 24-month-old group, as Condition was not in the final model. The results reveal that, similar to Maximum F0, Minimum F0 was also significantly higher in IDS addressing 18 month-old children than in ADS, while no similar differences in

<sup>2</sup>An example of the R codes for these models is: sqrt(max\_hz) ∼ Condition <sup>∗</sup> Tone <sup>∗</sup> Age + (1 + Condition + Tone| Participant Number).

<sup>3</sup>An example of the R codes is: sqrt(max\_hz) ∼ Condition <sup>∗</sup> Tone + (1 + Condition + Tone| Participant Number).

<sup>4</sup>An example of the R codes is: sqrt(max\_hz) ∼ Condition <sup>∗</sup> Tone + (1 + Condition| Participant Number).

TABLE 2 | Final model for Maximum F0 (Hz).


<sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

Minimum F0 arose between ADS and IDS for the 24-month-old group.

For the measure of F0 range (**Figure 3**), the final model (**Table 4**) only showed a significant main effect of Condition (p = 0.006); no interaction between Age and Condition was observed on this measure, suggesting that lexical tones in Mandarin Chinese IDS have a larger F0 range than ADS tones across the two age groups.

The last measure was duration (**Figure 4**). For this measure, the final model (**Table 5**) did not include Condition, suggesting that there was no effect of Condition on duration for either the 18-month-old or the 24-month-old groups.

### Exploring the Enhancement of Tone 1 – Tone 4 Contrast

Our main goal was to provide a global measure of tone hyperarticulation, however, tone hyperarticulation may also suggest that tonal contrasts are enhanced in IDS. In addition to comparing the tonal cues between ADS and IDS, we explored whether the contrast between Tone 1 and Tone 4 was enhanced

FIGURE 2 | Box plots of Minimum F0 (Hz) for ADS and IDS addressing 18 and 24-month-old children.

TABLE 3 | Final model for Minimum F0 (Hz).


<sup>∗</sup>p < 0.05, ∗∗p <0.01, ∗∗∗p < 0.001.

in IDS<sup>6</sup> . Both the contrast between Tone 1 (high-level tone) and Tone 4 (high-falling tone) and the contrast between Tone 2 (midrising tone) and Tone 3 (low-dipping tone) are typically used in studies on infant tone perception (e.g., Chen et al., 2017; Liu and Kager, 2017). As the realization of Tone 3 has a large degree of variation in spontaneous speech depending on various factors (e.g., Tone 3 sandhi and the position of a Tone 3 syllable in an utterance, see Yip, 2002), it is impossible to gage the enhancement of Tones 2 – 3 contrast from the current data. Thus, we opted instead to focus on the tonal contrast between Tones 1 and 4 and explored whether this tonal contrast was enhanced in IDS as compared with ADS. Since Tones 1 and 4 are mainly distinguished by pitch range, if the difference in pitch range

<sup>5</sup>The box plots in our paper show the first and third quantiles, medians, and outliers (included in analysis). All Y-axes are square-root transformed.

<sup>6</sup> It should be noted that the results reported in Section "Exploring the Enhancement of Tone 1 – Tone 4 Contrast" are exploratory. Our tokens per tone were few resulting in low statistical power for this specific analysis. An alternative way of measuring tonal contrast enhancement would be to compare the tone space area of selected tones (Tones 1, 2, and 4) in ADS and IDS as in Tang et al. (2017). Tang et al. (2017) found that tones in Mandarin Chinese IDS were only hyperarticulated at the utterance-final positions. However, in our current data set, only 18 out of 39 participants had tones (Tones 1, 2, and 4) produced at utterancefinal positions in both ADS and IDS conditions, meaning that tone space area could only be compared for these participants. As a result, such an analysis was not feasible for our data set.

TABLE 4 | Final model for F0 range (Hz).


<sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

between Tones 1 and 4 was larger in IDS than in ADS, we can conclude that the contrast between Tones 1 and 4 was enhanced in IDS.

First, we took all occurrences of Tones 1 and 4 across the two age groups into analysis. A paired samples t-test showed that there was a marginally significant difference (t = −2.024, df = 35, p = 0.051) in the difference in pitch range (Tones 4 – 1) between IDS (mean = 97.724 Hz, sd = 75.934) and ADS (mean = 66.987 Hz, sd = 60.927). As in Liu et al. (2007), we then further considered the first two occurrences of each tone. A paired samples t-test showed that there was a significant difference (t = −2.294, df = 31, p = 0.029) in the difference in pitch range (Tones 4 – 1) between the two conditions (ADS: mean = 65.408 Hz, sd = 55.525; IDS: mean = 103.397 Hz, sd = 88.669) as compared with ADS. Taken together these results showed that the contrast between Tones 1 and 4 was enhanced in IDS, especially for the first two occurrences of the target syllables.

#### Results Summary

Both Minimum F0 and Maximum F0 of lexical tones were higher (in both Hz and ERB) in IDS addressing 18-monthold children than in ADS, but no similar differences were observed between ADS and IDS addressing 24-month-children. This pattern suggests that mothers in the study raised the pitch

FIGURE 4 | Box plots of syllable duration (s) for ADS and IDS addressing 18 and 24-month-old children.

TABLE 5 | Final model for syllable duration (s).


<sup>∗</sup>p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

level of tones when they addressed 18-month-old children, but maintained ADS-like pitch height when addressing 24-montholds. F0 range (Hz and ERB), on the other hand, showed a difference between ADS and IDS across ages: F0 range was larger in IDS compared with ADS for both 18- and 24-month-olds. As for duration, our results showed that tones were not lengthened in either age group.

Our results showed that tone hyperarticulation was present in IDS addressing 18- and 24-month-old children, but the specific tonal cues differed between the two groups: for 18-month-olds, Tone 1 had a higher F0 in IDS, and Tones 2, 3, and 4 had higher F0 and a larger F0 range in IDS. For 24-month-olds, all four tones remained the same pitch level in the two speech registers, though Tones 2, 3, and 4 in IDS still had a larger pitch range in IDS. As a secondary cue to lexical tone (Blicher et al., 1990), duration did not differ between ADS and IDS in either age group. In addition, an exploratory analysis showed that the contrast between Tones 1 and 4 was enhanced in IDS.

### DISCUSSION AND CONCLUSION

This study examined lexical tones in Mandarin Chinese IDS addressing 18- and 24-month-old children, at the age of the vocabulary spurt. The study had two main goals: to test whether tones are hyperarticulated in IDS compared with ADS in Mandarin Chinese, and to explore how tones in IDS vary with

the age of the addressee during the period of vocabulary spurt. To accomplish these goals, we measured the acoustic cues of lexical tones in ADS and IDS in a semi-spontaneous storytelling task. The results demonstrated that tone hyperarticulation and age-related changes are observed in Mandarin Chinese IDS addressing toddlers.

Our research questions were: (i) Are tones in Mandarin IDS addressed to 18- and 24-month-old children hyperarticulated compared to tones in ADS? (ii) Do lexical tones in Mandarin Chinese IDS change when the mother is addressing an 18-monthold child vs. a 24-month-old child? Our results build on past studies on lexical tones in IDS addressing preverbal children (Liu et al., 2007; Xu Rattanasone et al., 2013), while extending that research to the second year of life—the period of the vocabulary spurt. Our findings show that tone hyperarticulation remains present in speech to toddlers, even after their phonetic perception has tuned to their native language and they have started learning words. Specifically, we found that, in speech addressed to 18 month-old children, both the minimum and maximum F0 of tones was higher in IDS than ADS, and the F0 range was larger, but the tones were not lengthened. These F0 measures are consistent with the findings of Liu et al. (2007) for IDS addressing 12-month-old children. In speech addressed to 24 month-old children, we found that pitch height of lexical tones had normalized to the ADS standard, while F0 range remained larger in IDS than ADS. Tone duration does not appear to differ between toddler-addressed IDS and ADS.

Taken in the context of previous studies exploring lexical tones in IDS addressing preverbal children and preschool children, our results contribute to the timeline of tonal changes in IDS by providing evidence for tone hyperarticulation in the second year. Xu Rattanasone et al. (2013) demonstrated that tones in Cantonese IDS are hyperarticulated when talking to children from 3 to 9 months, but that hyperarticulation declines as children approach 12 months. In their study on Taiwanese Mandarin, Liu et al. (2007) found that when addressing 12-month-old children, mothers exaggerated every acoustic correlate of tone in IDS, including producing a higher F0, larger F0 range, and longer duration. The current findings fill a crucial gap in the timeline and suggest that tone hyperarticulation may continue until children reach their second birthday. Liu et al. (2009) compared tones in Taiwanese Mandarin–speaking mothers' speech to preverbal children (IDS, age range: 0;7–1;0) and speech to preschool children (CDS: age: 5;0), and found that the degree of tone exaggeration was much less in CDS than IDS. Based on this evidence, we may tentatively trace a developmental trajectory of tone hyperarticulation in IDS: hyperarticulation is notably salient from birth to 12 months in both F0 and duration measures, remains present for F0 measures of tone at 18 months, but begins normalizing toward the ADS standard by the end of the second year. By 24 months, the degree of pitch height difference between ADS and IDS drops significantly for all four tones, although pitch range (of Tones 2, 3, and 4) remains larger in IDS compared with ADS. However, simply combining these findings is not sufficient to produce a complete picture of the developmental trajectory of how lexical tones change with age in IDS, since the studies noted above investigated different tone languages, and adopted different acoustic measures and different elicitation methods.

A question that follows from these findings is: why do tonal cues in IDS change over time? It seems likely that the change in the pitch level (minimum and maximum F0) is related to the general prosodic exaggeration, as the degree of prosodic exaggeration in IDS may also decline from 18 to 24 months when children have become more verbal and their word learning accelerates. However, since studies on tone hyperarticulation (including the current study) usually focus on the syllabic level, little is known about whether tonal cues coincide with other prosodic features of IDS. Crucially, our results showed that the pitch range (of Tones 2, 3, and 4) remained enlarged in IDS even when the pitch height had declined to the ADS level at 24 months, suggesting that mothers may hyperaticulate lexical tones during the period of vocabulary spurt in support of word learning.

A relationship between the quality of IDS and children's language development has often been assumed in research on IDS (e.g., Fernald and Simon, 1984), and the hyperarticulation phenomenon has been offered up as evidence for the facilitative effects of IDS on language acquisition. However, although phonetic input is clearly exaggerated in IDS, at the same time, it is also highly variable compared with the input observed in ADS (Adriaans and Swingley, 2017). Might this variability make it more difficult for infants to form phonetic categories? Adriaans and Swingley's (2017) research suggests not: the authors used categorization models to train two datasets of hyperarticulated vowels (IDS-characterized) and non-exaggerated vowels (ADScharacterized), and found that the highly variable vowels in IDS favored phonetic categorization compared with the nonexaggerated vowels. However, empirical research on whether IDS indeed supports language acquisition—and more specifically, tone acquisition and word learning in tone languages—is surprisingly lacking. Future research should examine whether raised pitch and/or enlarged pitch range indeed facilitates children's word recognition and word learning.

Thus, we must be cautious in interpreting our results as direct evidence for the linguistic function of IDS in word learning. Indeed, although the current study demonstrates that tone hyperarticulation remains present in language input during the vocabulary spurt period, it does not necessarily indicate that children benefit from this linguistic phenomenon. Several studies have explored the correlation between the quality of IDS and children's language outcomes. For instance, Liu et al. (2003) found that the vowel space in Taiwanese Mandarin IDS toward preverbal children (6–8 months; 10–12 months) is related to infants' performance on speech discrimination. Hartman et al. (2017) further showed that the quality of vowels in early English IDS may predict vocabulary size among 2-year-olds. In a wordlearning study, Ma et al. (2011) found that 21-month-old Englishlearning children could only learn words in the IDS condition, while 27-month-old children could learn words successfully in both IDS and ADS conditions. For tone languages, the correlated question—whether tone hyperarticulation in IDS indeed benefits lexical word learning—remains under-investigated. At this point, no research exists directly comparing word learning under ADS and IDS conditions in tone languages, and the literature offers

no insight into how tones in language input correlate with vocabulary outcomes.

The pitch measures of tones addressed to our 24-monthold group showed a different pattern from the findings in Liu et al. (2007). In addition to the different age groups under investigation, an alternative explanation for the inconsistent results may be attributed to language-specific properties. Even though Mandarin Chinese (spoken in mainland China) and Taiwanese Mandarin are variations of the same language, their sentential prosody differs (Chen et al., 2009), which may in turn affect the prosody of IDS. Literature comparing the prosody of IDS in Taiwanese Mandarin and Mandarin Chinese is lacking. As British English and American English IDS exhibit different prosodic features (Fernald et al., 1989), one direction invited by the current research is to compare tone hyperarticulation in different tone languages, as well as different variations of the same tone language.

A limitation of the current design is that we used only one target word for each tone. We also took steps to avoid generating contrasts between the target words by ensuring that the phonemes of the target syllables differed from each other. As vowels and tones may interact (Hoole and Hu, 2004), our results are not generalizable to all syllable–tone combinations in Mandarin Chinese. However, it should be noted that this line of research typically relies on a rather small set of stimuli due to the practicalities of testing children. For example, Kuhl et al. (1997) used one target word per vowel; Liu et al. (2007) had twelve syllables for four tones, but they only included "the first two clear tokens of each target word"; Tang et al. (2017) also used one syllable for each tone. Even though there has been some agreement on the salience of tone hyperarticulation in the first year of life, it has not been established that tone hyperarticulation is present across the board. Meta-analysis of existing tone hyperarticulation studies may provide a better understanding of this issue. Also, the current study took a cross-sectional design. We found no effect of Age across ADS and IDS, indicating that there are no group differences between the 18 and 24-month-old groups. However, a timetable of changes in tone hyperarticulation over time remains to be revealed by longitudinal studies.

Another useful future direction for study would be to examine whether tone hyperarticulation is related to the prosodic marking of focused words. Previous research has shown that, in English IDS, mothers tend to put contextually new words (focused words) at utterance-final positions, and these focused words usually carry prosodic marking in the form of higher pitch and a larger pitch range (Fernald and Mazzie, 1991). Relatedly, Tang et al. (2017) showed that tone hyperarticulation in Northern Mandarin only occurs at utterance-final position. In their experimental design, toys corresponding to the target words were provided one by one to each participant (thus, each target word was contextually new). As Mandarin Chinese and English are both SVO languages, it is certainly possible that tone hyperarticulation in Northern Mandarin, as in English, tends to occur when the lexical item in question is the focus of an utterance.

### CONCLUSION

This study investigated the tone hyperarticulation phenomenon in Mandarin Chinese IDS in the second year of life and revealed age-related changes of tonal cues in IDS addressed to 18-month-old vs. 24-month-old children. These findings may contribute to an understanding of the role of IDS in tone acquisition and word learning. Mothers may hyperarticulate lexical tones in order to provide more fine-grained information for language acquisition. However, it may be premature to interpret these findings as direct evidence for the linguistic function of IDS.

### ETHICS STATEMENT

There was no ethical committee in Utrecht Institute of Linguistics (UiL OTS), Utrecht University when data was collected for this study. This study was approved by UiL OTS and was carried out in accordance with the research guidelines in UiL OTS. All participants gave written informed consent.

### AUTHOR CONTRIBUTIONS

MH contributed to the design of the experiments, data collection, phonetic annotation, data analysis, and drafting the manuscript. NdJ and RK contributed to the experimental design, data analysis, and revision of the manuscript.

### FUNDING

This study was supported by the Postgraduate Scholarship awarded to MH by Chinese Scholarship Council. This research was funded by Utrecht Institute of Linguistics, Utrecht University.

### ACKNOWLEDGMENTS

We are grateful to Taohualing Kindergarten (Mrs. Aihua Zou and Mrs. Yuhong Li), Gezhouba Dongshan Kindergarten (Mrs. Huan He) and Gezhouba Early Education Center (Mrs. Hong Xie) in Yichang, China for their kind support and coordination in recruiting participants. We thank Run Chai for her help with part of data collection. We thank all families who participated in this study. We also acknowledge the members of Utrecht Babylab and Menghui Shi for their input on experimental design and data analysis. We thank the reviewers for their valuable comments and suggestions.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2018. 00434/full#supplementary-material

### REFERENCES

fpsyg-09-00434 March 30, 2018 Time: 16:17 # 11



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Han, de Jong and Kager. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Perceptual Reorganization of Lexical Tones: Effects of Age and Experimental Procedure

Antonia Götz <sup>1</sup> \*, H. Henny Yeung<sup>2</sup> , Anna Krasotkina<sup>3</sup> , Gudrun Schwarzer <sup>3</sup> and Barbara Höhle<sup>1</sup>

<sup>1</sup> Linguistics Department, University of Potsdam, Potsdam, Germany, <sup>2</sup> Department of Linguistics, Simon Fraser University, Burnaby, BC, Canada, <sup>3</sup> Developmental Psychology, Justus-Liebig University Gießen, Giessen, Germany

Findings on the perceptual reorganization of lexical tones are mixed. Some studies report good tone discrimination abilities for all tested age groups, others report decreased or enhanced discrimination with increasing age, and still others report Ushaped developmental curves. Since prior studies have used a wide range of contrasts and experimental procedures, it is unclear how specific task requirements interact with discrimination abilities at different ages. In the present work, we tested German and Cantonese adults on their discrimination of Cantonese lexical tones, as well as German-learning infants between 6 and 18 months of age on their discrimination of two specific Cantonese tones using two different types of experimental procedures. The adult experiment showed that German native speakers can discriminate between lexical tones, but native Cantonese speakers show significantly better performance. The results from German-learning infants suggest that 6- and 18-month-olds discriminate tones, while 9-month-olds do not, supporting a U-shaped developmental curve. Furthermore, our results revealed an effect of methodology, with good discrimination performance at 6 months after habituation but not after familiarization. These results support three main conclusions. First, habituation can be a more sensitive procedure for measuring infants' discrimination than familiarization. Second, the previous finding of a U-shaped curve in the discrimination of lexical tones is further supported. Third, discrimination abilities at 18 months appear to reflect mature perceptual sensitivity to lexical tones, since German adults also discriminated the lexical tones with high accuracy.

#### Keywords: perceptual reorganization, lexical tones, U-shaped curve, habituation, familiarization

### INTRODUCTION

During the first year of life, infants' perception abilities may change for stimuli that are not present or not relevant in their environment. For example, in the linguistic domain, perceptual changes have been detected in infants' sensitivity to native and non-native speech sounds. With increased experience with their native language, infants show an enhanced ability to distinguish between native speech sounds, whereas the initial sensitivity to non-native speech sounds decreases. This pattern of perceptual reorganization has been shown for consonants (Werker and Tees, 1984; Rivera-Gaxiola et al., 2005), vowels (Polka and Bohn, 1996, 2011; Tsuji and Cristia, 2014), lexical tones (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014);

#### Edited by:

Liquan Liu, Western Sydney University, Australia

#### Reviewed by:

Stefanie Ramachers, Radboud University Nijmegen, Netherlands Tianlin Wang, University of Wisconsin-Madison, United States

\*Correspondence: Antonia Götz antonia.goetz@uni-potsdam.de

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 31 August 2017 Accepted: 21 March 2018 Published: 06 April 2018

#### Citation:

Götz A, Yeung HH, Krasotkina A, Schwarzer G and Höhle B (2018) Perceptual Reorganization of Lexical Tones: Effects of Age and Experimental Procedure. Front. Psychol. 9:477. doi: 10.3389/fpsyg.2018.00477

**189**

(Singh and Fu, 2016), and word stress (Höhle et al., 2009; Skoruppa et al., 2009; Bijeljac-Babic et al., 2012).

However, research in recent years has converged on the idea that this picture is too simplistic. On the one hand, not all linguistically relevant sound contrasts are easily discriminable by young infants (Narayan et al., 2010; for a review, see Maurer and Werker, 2014). On the other hand, there are non-native sound contrasts that are discriminable by children beyond the typical ages of perceptual reorganization, and even by adults (for consonantal contrasts, see Best et al., 2001; for vocalic contrasts, see Mazuka et al., 2014). The present paper investigates the potential perceptual reorganization of lexical tones by infants learning non-tone languages. Previous research on lexical tone discrimination in infants is characterized by a rather complex pattern of findings: prior studies have found evidence for an increase, a decrease, and no-change in infants' and toddlers' ability to discriminate non-native tone contrasts across ages (for an overview, see **Table 1**). These divergent findings may be related to a number of dimensions on which these studies varied, including the tone contrasts used, the native language of the participants, and the experimental procedures. Our study focuses on the latter factor and compares the effects of familiarization vs. habituation in the initial exposure phase on German-learning infants' discrimination of a Cantonese tone contrast. In familiarization experiments infants are exposed to certain stimuli for a fixed time, thus the exposure is experimenter-controlled. In contrast, exposure in habituation is infant-controlled as the infant needs to reach a specific criterion (decrease in looking time) to proceed to the test phase. Thus, the latter type of preexposure may be more sensitive to the performance of individual infants.

We will first review prior studies on infants' and adults' perception of lexical tones and then present three experimental studies. In the first study, Cantonese tone discrimination in adult native speakers of Cantonese was compared to that in adult native speakers of German. In the second study, the discrimination of the high-rising and the mid-level Cantonese tones was tested in German-learning infants between 6 and 18 months of age using a familiarization procedure. The third experiment investigated discrimination of the same tone contrast in 6- and 9-month-old German infants using a habituation procedure.

### PREVIOUS STUDIES ON INFANTS' NON-NATIVE LEXICAL TONE PERCEPTION

A detailed review of infant tone perception can be found elsewhere (Singh and Fu, 2016). Here, we focus on studies that have investigated how infants learning a non-tonal language as their native language perceive different tones from various tone systems and we incorporate some more recent studies on infant tone perception. Furthermore, our review will also highlight details of prior experimental methods.

The first studies that tested perceptual reorganization of lexical tones provided evidence for a decline in tone discrimination by infants learning a non-tone language. Mattock and Burnham (2006) compared English and Chinese (Mandarin- or Cantoneselearning) infants at 6 and 9 months on their discrimination of Thai rising vs. falling as well as rising vs. low tones using the Conditioned Head-Turn (CHT) paradigm. Infants were first trained to perform a head-turn whenever an auditory background stimulus (a syllable carrying one tone) was replaced by the target stimulus (the segmentally same syllable with another tone). In the test phase—which was started after three consecutively correct head-turns in the training—the number of correct head-turns to a stimulus change was the dependent variable. Both 6- and 9-month-old Chinese-learning infants discriminated both tone contrasts, but English-learning infants showed a decrease in their discrimination from 6 to 9 months of age, with an overall higher performance for the rising-falling than for the rising-low contrast.

Mattock et al. (2008) extended this study to 4-month-old infants learning English or French, while continuing to test 6 and 9-month-olds acquiring these languages. They used a visual fixation paradigm (i.e., they measured infants' looking time at a central visual display during auditory stimulus presentation), where infants were initially exposed to a syllable representing either a low or a rising Thai tone for 30 s in a familiarization phase. In the test phase, two trial types were presented: four alternating trials that contained both the familiarized and the non-familiarized tone, and four non-alternating trials that only contained tokens of the familiarized tone. In this Stimulus Alternation Preference Procedure (SAPP), the 4- and 6-montholds but not the 9-month-olds showed significantly longer looking times for the alternating trials compared to the nonalternating trials with no difference across the language groups.

Yeung et al. (2013) tested 4- and 9-month-olds learning Cantonese, Mandarin, and English on Cantonese tones that were similar to the Thai contrast (high-rising vs. mid-level tones) investigated by Mattock and colleagues. Using a modification of the SAPP, infants heard three trial types in the test phase: four alternating trials (familiarized and non-familiarized tone intermixed), two non-alternating trials only containing the familiarized tone, and two non-alternating trials only containing the non-familiarized tone. With this modification, discrimination and preference could be measured in the looking times obtained within the same experiment: that is, differences between the alternating and non-alternating trials would indicate discrimination while the direction of differences between the non-alternating trials would indicate preference. The Englishlearning infants showed a decline in the ability to discriminate these contrasts while this was not the case for the Mandarin or Cantonese infants. Moreover, infants learning one of the tonal languages showed an asymmetrical performance pattern with better discrimination when they were familiarized with the high-rising tone than with the mid-level tone.

While these studies showed a decline in discrimination ability for non-tone language learners, others have found enhanced perceptual abilities with increasing age (Chen and Kager, 2016; Chen et al., 2017; Tsao, 2017). Chen and Kager (2016) as well as Chen et al. (2017) tested Dutch-learning infants' discrimination of the Mandarin low-rising and low-dipping tones. Different


from Mattock et al. (2008) and Yeung et al. (2013), who used familiarization in the initial exposure phase, infants were first habituated by repeatedly being exposed to one of the tones until their looking time had decreased for a predefined percentage. Then in the test phase, one trial of the habituated tone and one trial of the non-habituated tone were presented. The results from both studies suggest successful discrimination in 6- and 12-month-olds but not in 4-month-olds. The authors concluded from their results that, with increasing age, infants develop more fine-grained acoustic discrimination abilities for pitch information. Increasing perceptual sensitivity was also observed by Tsao (2017), who tested 6–8 and 10–12-month-old Mandarinand English-learning infants using the CHT paradigm on the Mandarin high-level vs. low-dipping tones. Both language groups showed discrimination at both ages and their discrimination ability was enhanced with increasing age.

A third pattern found in the literature is that infants show no changes in their discrimination ability with increasing age (Liu and Kager, 2014, 2017; Ramachers et al., 2017; Shi et al., 2017; Tsao, 2017). Ramachers et al. (2017) tested Dutch and Limburgian<sup>1</sup> 6-, 9-, and 12-month-old infants with Limburgian falling vs. falling-rising tones. After the infants were habituated with one tone, they were presented with trials that only contained the habituated tone (non-alternating) or with a mixture of the habituated and the non-habituated tones (alternating). Looking time to a central visual display was the dependent measure, and results showed that Dutch infants at all ages (with no previous exposure to this specific dialect) discriminated the Limburgian tone contrast. Ramachers et al. (2017) argue that Dutch intonation has pitch contours (H∗L and H∗LH%) that are acoustically comparable to the Limburgian tones (Gussenhoven, 2004), which may have led to a maintenance of discrimination. Shi et al. (2017) came to a similar result when testing Frenchlearning 4-, 8-, and 11-month-old infants. They habituated the infants to one instance of two Mandarin tone contrasts: either one token from the perceptually close rising vs. low-dipping contrast or one from the perceptually more distinct high-level vs. falling contrast. Infants were then tested on their discrimination of the habituated and the non-habituated tones. The infants showed successful discrimination across all three age groups with slight indications of a decline only for the perceptually close contrast. They discuss their findings as an indication of the emerging impact of native phonology and of the acoustic salience of the tested contrast in the perception of the non-native tone patterns.

Finally, a fourth developmental pattern was observed by Liu and Kager (2014), who tested the discrimination of the Mandarin high-level vs. high-falling tonal contrast in Dutch infants between 5 and 18 months of age using the visual fixation paradigm implemented with a habituation procedure. Their study revealed perceptual sensitivity at all ages when using naturally recorded speech stimuli. However, they found a U-shaped developmental curve in a second experiment, in which synthesized stimuli with smaller acoustic differences of the same contrast were used. Specifically, Dutch-learning infants at 5–6 and 17–18 months of age discriminated the contrast in these materials, but not the intermediate age groups. This U-shaped development was also found in a group of bilingual infants learning Dutch and another non-tone language (Liu and Kager, 2017). In line with Shi et al. (2017), the authors interpreted the finding that Dutch-learning

<sup>1</sup>Limburgian is a dialect of Standard Dutch that uses word-level pitch for marking lexical and grammatical differences.

infants regain their ability to discriminate the tones as a result of their experience with the native (Dutch) intonation system and its modulation by the acoustic salience of the contrast. To our knowledge, the two studies by Liu and Kager (2014, 2017) are the only ones that have tested tone perception across a larger age range extending into the second year of life and that have found evidence for a U-shaped learning curve.

In sum, previous studies have shown that infants' non-native tone perception is probably influenced by a large number of factors, including age, task demands, the acoustic salience of the target tone contrast, and the prosodic systems of the native languages of the infant participants. Thus, developmental change in language acquisition and the experimental observation of this change seem to be dependent on a complex interaction of different factors. This links up with findings that show that older children and adult speakers of non-tone languages can also identify and discriminate lexical tones, even though their performance is typically below that of native speakers of the particular language (Burnham and Francis, 1997; Hallé et al., 2004; Francis et al., 2008; So and Best, 2010; Hay et al., 2015). The adult perception of L2 tones has been shown to be influenced by various factors, among others by the L1 lexical tone system (if the L1 is a tone language) or the use of pitch variation for post-lexical functions, (e.g., different intonation or phrasing patterns) in the native language (Wayland and Li, 2008; Caldwell-Harris et al., 2015), but also by specific task conditions (e.g., duration of the interstimulus interval, requirement to count backwards during the interstimulus interval) that can show differential effects on non-native and native speakers' performance (Lee et al., 1996). One explanation for good tone discrimination abilities in adult speakers of non-tonal languages is that hearers might adopt their knowledge about the native intonation system for identifying and discriminating lexical tones (Francis et al., 2008). For instance, Francis et al. (2008) found that English listeners were highly accurate in identifying the Cantonese high-rising tone, which the authors linked to the acoustic similarity of this Cantonese tone to the rising intonation pattern of questions in English. Another possibility derives from the acoustic salience of the tested contrast. Highly acoustically salient tone contrasts are easier to discriminate independent of the native language background (Hallé et al., 2004). Given these findings that tone discrimination in adult speakers of non-tonal languages is possible, but is modulated by several factors, adult speakers' performance also needs to be considered when studying perceptual reorganization of tone discrimination in early infancy.

### THE CURRENT STUDY

The above-reviewed research on infants' non-native tone perception reflects the influence of several factors on experimental outcomes: acoustic properties of the tones used in the experiments, characteristics of the prosodic systems of the native languages of the participants, and also aspects of the experimental procedures. The studies that have found a perceptual decline with increasing age have mainly used familiarization procedures (Mattock et al., 2008; Yeung et al., 2013), whereas all studies that have found patterns of (re- )increased or maintained sensitivity across age have used infant-controlled habituation or conditioning procedures (Liu and Kager, 2014, 2017; Hay et al., 2015; Chen and Kager, 2016; Chen et al., 2017; Ramachers et al., 2017; Shi et al., 2017; Tsao, 2017). This suggests that habituation may be the more robust procedure to reveal discrimination abilities in infants. In line with this consideration, a recent test–retest reliability study suggests that habituation results are more consistent and reveal larger effects at the group level than familiarization (Cristia et al., 2016). One reason for this could be that infants in a habituation procedure enter the test phase of the experiment on an individually controlled encoding status of the stimulus. The duration of the exposure during the habituation procedure is dependent on infants' response to the stimulus. In contrast, familiarization has a fixed duration that does not take into account individual differences in the speed of encoding the stimuli. According to the model by Hunter and Ames (1988), the degree of familiarity with the exposed stimulus (which depends on an interaction of stimulus complexity and the infants' age as an indicator of developmental level) determines whether an infant prefers the familiar or the novel stimulus in the test phase. Therefore, group results may reflect heterogeneous individual patterns of novelty or familiarity preferences, which may lead to null effects. This inconsistency in the direction of preferences is actually predicted after familiarization in some cases but is never predicted after habituation. Thus, the conflicting results on infants' tone perception obtained across different studies may at least partly be related to the use of different pre-exposure techniques.

The present study had two main objectives. First, we further investigated the U-shaped development found by Liu and Kager (2014) using another tone contrast and testing a population with a different native language than Dutch. To this end, discrimination of a Cantonese tone contrast was tested with German-learning infants between 6 and 18 months of age, as well as with a group of German and Cantonese adults. Second, we wanted to pursue the question of methodological impacts on the results in infant discrimination studies. For that reason, the effect of using a familiarization or a habituation technique on the discrimination performance of 6- and 9-month-olds was investigated by testing these two age groups with two different experimental procedures.

Before testing infants, we first asked whether the target tone contrast would be discriminated by adult speakers of German. We tested a group of German adults on their ability to discriminate Cantonese tone contrasts and compared the results to the performance of a group of adult native speakers of Cantonese. Our prediction was that German adults may be able to discriminate these tones in an AXB task but that Cantonese speakers should outperform the German speakers. An AXB task was chosen to reduce the effects of memory load. Different tokens of syllables from the same tonal category were used to force listeners to discriminate categorically rather than acoustically.

### EXPERIMENT 1: ADULTS' DISCRIMINATION OF CANTONESE LEXICAL TONES

## Methods

#### Participants

Ten native Cantonese speakers (19–31 years, 5 female) and 14 native German speakers (22–31 years, 8 female) participated in this study. None of the native German speakers had any language competence in Cantonese or another tone language. Although all participants reported L2 proficiency in English, they considered themselves to be monolingual. All participants reported normal hearing abilities. The study was approved by the Ethics Committee of the University of Potsdam. Written informed consent in accordance with the Declaration of Helsinki was obtained from all participants.

#### Stimuli

The stimuli for the adult experiment comprised five different Cantonese lexical tones: high-rising (Tone 25), mid-level (Tone 33), low-falling (Tone 21), low-rising (Tone 23), and low-level (Tone 22). Although our experiments with the German infants (see below) were restricted to testing the discrimination of only Tone 33 and Tone 25<sup>2</sup> , we examined more tone contrasts in the adult experiment. This was done in order to minimize any effects of only presenting two tones repeatedly, which may draw the participants' attention to their specific acoustic differences and thus foster enhancement of discrimination during the experiment. A second reason for including multiple tones was to generate a broader picture of German adults' processing of lexical tones.

A female native speaker of Cantonese produced 40 segmentally different CV and CVC syllables in each of these five tones leading to 200 different syllables overall (e.g., the syllables/jin/and/se/, each produced with five different tones). Half of the stimuli were CV and the other half CVC syllables. All syllables had a legal German phonotactic structure and were meaningful Cantonese words. To create acoustic variability the speaker produced each stimulus four times. An acoustic analysis of the pitch patterns of the stimuli was conducted using PRAAT (see **Table 2**; Boersma and Weenink, 2016). Pitch contours were measured by sampling at three different time points within the vowel: at initial, middle (at 50%), and final position. **Figure 1** illustrates an example of the five different pitch contours of the syllable/jin/. The pitch contour of level tones showed no change across the syllable (Tone 22, Tone 33), whereas for contour tones a pitch rise (Tone 23, Tone 25) or fall (Tone 21) occurred at the end of the syllable. For the experiment, all stimuli were normalized in intensity.

#### Procedure

Both Cantonese and German adults performed an AXB discrimination task. In this task, participants needed to



All values are f0 means, Standard Deviations are given in parentheses. The analysis was done at three different positions: at the initial, middle and final position of the pitch contour.

discriminate between ten different tone pairs. The five tone types were combined with each other, such that Stimulus A and B of a trial were always segmentally identical syllables but belonged to different tone categories; X also had the same segmental structure and belonged either to the same tone category as A or as B. An AXB task was chosen to reduce the effects of memory load compared to an ABX task. The X in an AXB task is equally distant from A to B, which prevents a mapping bias to the B stimulus (Best et al., 2001; Hallé et al., 2004; Strange and Shafer, 2008). Within a trial, different tokens of the syllables from the same tonal category were used to force listeners to discriminate categorically rather than acoustically (Best et al., 1988; Polka, 1991, 1992), thereby increasing the likelihood of finding language-specific effects.

Four different trial types with the four possible orders of the stimuli were presented: AAB, ABB, BAA, and BBA. Each participant heard each of the 40 types of syllables combined with only one tone contrast. The pairing was randomized and counterbalanced across the participants (e.g., one participant heard the contrast Tone 25–Tone 33 on the syllable/se/, while another participant heard the contrast Tone 22–Tone 33 on the same syllable). Therefore, every participant heard each of the 40 syllables during the experiment but the tone contrast that was instantiated on these syllables varied across the participants.

<sup>2</sup>This tone contrast was also used in the study by Yeung et al. (2013) that tested English-learning infants. Given the prosodic similarity between English and German, we expected this tone contrast to generate similar effects in Germanlearning infants.

Each tone contrast occurred with four different syllables for each participant. During the experiment, each syllable-tone pairing was presented four times, once in each trial type. This resulted in an overall number of 160 trials for each participant (4 syllables × 10 tone contrasts × 4 trial orders). These trials were divided into four blocks of 40 trials, in order to allow pauses in between. Each block only contained one of the trial types for a syllable-tone pair. The trials within a block were presented in a pseudo-randomized order with the same tone contrast never repeating twice in row. The stimuli within trials were separated by an interstimulus interval of 1,000 ms; the intertrial interval was 3,000 ms. An interstimulus interval of 1,000 ms was chosen because previous studies have shown that language-specific effects are more clearly revealed with long interstimulus intervals (Werker and Logan, 1985). The maximum response time for the participants was 2,500 ms, measured from the offset of the last syllable. The pause between blocks was controlled by the participant, and the experiment continued when the participant pressed a button. In total, the experiment lasted around 20 min.

Participants were instructed to decide whether the second syllable was more similar to the first or to the third syllable, otherwise they were not instructed to attend to any specific part of the syllables. The experiment and the participants' responses on a keyboard were controlled with OpenSesame (Mathôt et al., 2012) and run on a laptop. All trials were presented over headphones in a silent room.

### Results

**Figure 2** summarizes the percentages of correct responses given for all contrasts by both language groups. Statistical analyses were run on the number of correct responses as the dependent variable. The performance of both language groups was significantly higher than predicted by chance for all tone contrasts (one sample t-test against chance level, all p's < 0.001). This was also true for the relevant tone contrast for the infant study (Tone 33–Tone 25). Most importantly, a one sample t-test against chance revealed above chance performance in German adults (t = 18.55, p < 0.001) for this contrast.

As a next step, we compared different models that were computed with the function glmer from the lme4 package (Bates et al., 2015) in R (R Core Team, 2017). Models and their results were obtained by the anova function. The best fitting model [lowest Akaike Information Criterion (AIC, Akaike, 1998) and significant difference in the Chi-square test] included item and subject as random factors and interaction of language group (Cantonese and German) and tone contrast (the 10 different tone contrasts) as fixed factors; see **Table 3**. Additionally, we asked for musical experience. Participants were asked whether they had learned to play an instrument and if yes, how long they do or did play it. Model comparison revealed that musical experience (years playing an instrument) did not modulate the outcome of our data. Compared to the model including the interaction of Tone Contrast and Language group, the model including musical experience has higher AIC (2183.4 compared to 2175.6) and no significantly better fit with Chi-square test results (p = 0.19).

In general, our results reveal good performance in both groups, but show that German native listeners performed less accurately than the native Cantonese listeners (86.5 vs. 93.4%, respectively). The statistical analysis showed that the overall performance differed significantly between the two language groups (β = −2.253, SE = 0.758, z = −2.973, p < 0.01). However, this group difference was not significant across all contrasts as indicated by the interaction of tone contrast with group. Cantonese listeners best discriminated high-rising (25) vs. mid-level (33), high-rising (25) vs. low-level (22), and mid-level (33) vs. low-rising (23), each at a level of 98.7%. German adults performed best on the discrimination of mid-level (33) vs. lowfalling (21). For both groups, the contrast high-rising (25) vs. low-rising (23) was the most difficult contrast.

With respect to the infant experiments, we were especially interested in how native and non-native adults perceive the difference between high-rising and mid-level tones. Our results revealed that the Cantonese adults discriminated Tone 25 vs. Tone 33 significantly better than the German listeners (β = −2.503, SE = 0.871, z = −2.874, p < 0.01). Furthermore, native listeners discriminated Tone 25 vs. Tone 22 (β = −2.567, SE = 0.786, z = −3.265, p < 0.01), Tone 33 vs. Tone 23 (β = −2.047, SE = 0.850, z = −2.409, p < 0.01), Tone 21 vs. Tone 23 (β = −1.818, SE = 0.713, z = −2.549, p < 0.05), and Tone 23 vs. Tone 22 (β = −1.127, SE = 0.336, z = −3.358, p < 0.001) significantly better than the non-native German listeners. The discrimination for the other tone contrasts was not significantly different between the Cantonese and the German listeners.

### Discussion

The first experiment tested the discrimination of Cantonese lexical tones by adult German listeners without knowledge of Cantonese and by native speakers of Cantonese. Three main findings were obtained: First, German native speakers were able to distinguish between different lexical tones. Second, native Cantonese speakers outperformed German listeners in their overall discrimination abilities. Third, there was variation in German listeners' discrimination performance depending on the specific contrast: while the discrimination reached native-like levels for some contrasts, performance was below that of native speakers for other contrasts. This is in line with other discrimination studies that have shown good discrimination by non-native listeners, but an overall better performance by native listeners (Lee et al., 1996; Burnham and Francis, 1997; Cutler and Chen, 1997; Francis et al., 2008).

However, the picture becomes less clear when comparing performances of each tone contrast separately. Some lexical tones (high-rising vs. mid-level, high-rising vs. low-level, lowrising vs. mid-level, low-rising vs. low-level, and low-rising vs. falling) are harder to discriminate for German than for Cantonese native speakers. However, there are also contrasts for which both language groups show comparable levels of high performance (high-rising vs. low-falling, mid-level vs. low-falling, and lowlevel vs. falling). Further, there are two contrasts for which both language groups show comparably lower performance (highrising vs. low-rising, mid-level vs. low-level). It is striking that the pairs that are highly discriminable by both groups contain one level and one contour tone or two contour tones with frequency

TABLE 3 | Results from the model comparison of the adult perception experiment.


Results from the model comparison of the adult perception experiment. The comparison is organized hierarchically. The first model was compared to the second model – which fit better to the data. The second model was then compared to the third and so forth. The comparison revealed best fit for the model which includes the interaction of tone contrast and group as fixed effect and subject and item as random effects (\*\*\* indicates p <0.001).

changes in opposite directions, while the tone pairs that are harder to discriminate are both level tones or show the same direction of frequency change. This pattern suggests that for non-native as well as for native tone discrimination, acoustic properties and the acoustic distance of the specific tone contrast are relevant for their discriminability. In addition, it is possible that German listeners assimilate some of the tones to their native intonation system. This would then support a languagespecific account of adult tone perception. It is noteworthy that all contrasts that are highly discriminable for the German listeners contain the falling Tone 21. The good discrimination seen here might stem from familiarity with the German intonation system, which uses falling contours for neutral statements (Grice and Baumann, 2002). That is, similar to what Francis et al. (2008) have proposed for English listeners, German native speakers might use their knowledge of the native intonation system to discriminate non-native lexical tones.

To summarize, our findings from the first experiment revealed that German native speakers discriminate Cantonese lexical tones highly accurately, but native listeners perform significantly better. The overall good discrimination performance for German listeners could be explained by acoustic salience and/or assimilation to the native prosody. Our results thus showed that native and nonnative adults' performance may differ depending on the specific contrast. Discrimination abilities in adults should therefore be considered before testing potential changes in infants' non-native sound discrimination. Overall, the most important finding from our first experiment is that German adults can discriminate the tone contrast that was used in our infant studies (Tone 33 vs. Tone 25), but that their performance was below that of native speakers of Cantonese. The finding that German adults can hear the difference between these tones increases the likelihood of observing a Ushaped developmental pattern, or perceptual enhancement with increasing age. But the finding that native Cantonese listeners show higher achievements in discriminating these two tones suggest that their discrimination is not only due to a large acoustic distance, but is also affected by the native language of the listener.

### EXPERIMENT 2: TESTING 6-, 9-, AND 18-MONTH-OLDS USING A FAMILIARIZATION PROCEDURE

Here we contribute new data to the infant tone perception literature by testing German infants' perception of the Cantonese Tone 33 vs. Tone 25 contrast that had previously been used in a study with English-learning infants by Yeung et al. (2013). Similar to Liu and Kager (2014), we included a wider age range than Yeung et al. had done in order to test for evidence of a U-shaped developmental curve in German 6-, 9-, and 18-montholds. Following the Yeung et al. study, we used a procedure involving familiarization, but the discrimination abilities during the test phase were assessed with the head-turn preference procedure.

#### Götz et al. Perceptual Reorganization of Lexical Tones

## Methods

### Participants

In total, 88 monolingual German-learning infants participated in this experiment: 30 6-month-olds (Mage = 182 days; range = 168–194 days; 14 girls), 30 9-month-olds (Mage = 275 days; range = 258–289; 18 girls), and 28 18-month-olds (Mage = 540 days; range = 526–556 days; 13 girls). An additional 16 infants were tested, but excluded from the analysis for the following reasons: crying (n = 8), fussiness (n = 5), technical error (n = 1), and pre-term (n = 2). Another two infants were excluded because at least one of the main caretakers grew up in an area in which the local German dialect uses word-level pitch contrasts (Werth, 2011). The remaining infants were all born full-term. According to parental report, infants did not suffer from repeated or acute ear infections, and there were no indications of atypical development or any experience with a tone language. This study was carried out in accordance with the recommendations of the Ethics Committees of the University of Potsdam with written informed consent given by the parents in accordance with the Declaration of Helsinki.

### Stimuli

For this study, we used the stimuli from Yeung et al. (2013): Cantonese CV syllables (/tC<sup>h</sup> i/) with either a high-rising (Tone 25) or mid-level (Tone 33) tone. In total, there were four different tokens of each tone. For detailed acoustic properties of the syllables, see Yeung et al. (2013).

The familiarization phase included only tokens of either Tone 25 or Tone 33. During the test phase, single syllables were concatenated into two different types of sequences: nonalternating (tokens from one tone category) and alternating sequences (tokens from both tone categories). In total, the test phase contained eight trials: four non-alternating and four alternating trials. Two of the non-alternating trials included only tokens of Tone 25 and the other two only of Tone 33. In the alternating trials, tone types were intermixed: the first four tokens at the beginning of the trial alternated between the two tones, the following ones were in a random order. The tokens were separated by an interstimulus interval of 1 s. Half of the alternating trials started with Tone 25, the other half with Tone 33, and they contained the same number of both tone types. During the familiarization phase, the maximal trial length was 15 s and during the test phase it was 30 s.

### Procedure

The experiment was run with the head-turn preference procedure (Hirsh-Pasek et al., 1987; Jusczyk and Aslin, 1995), which differed from Yeung et al.'s use of visual fixation, but still measured auditory preference by recording the duration of attention to a visual stimulus while being presented to an acoustic stimulus. Infants sat on their caretakers' lap in a booth and first fixated on a flashing green lamp in front of them. Next, the experimenter who sat in a second room and monitored the infants' gaze via a camera mounted above the green light—started the experimental trial by pressing a button. Then, one of the red lights mounted on the left or the right side inside the booth began to flash. As soon as the infant fixated the now blinking red light, the experimenter started the acoustic stimulus. The trial ended when the infant either looked away for more than 2 s, or when the end of the acoustic stimulus was reached. To start the next trial, the experimenter pressed a button and the green light in front of the infant again began to flash. Infants' looking duration (listening time) was coded online by the experimenter via a button box connected to a computer.

The experiment consisted of a familiarization and a test phase. During the familiarization phase, infants were presented with only one of the tones (either Tone 25 or Tone 33, counterbalanced across participants) until they had accumulated 30 s of listening time. A maximal trial length of 15 s assured that the infant looked at least once to both sides of the sound source during the familiarization. The test phase followed immediately after the familiarization phase and consisted of eight trials: two non-alternating trials of Tone 25, two non-alternating trials of Tone 33, and four alternating trials. These eight test trials were the same for all infants. During the test phase, the presentation order of alternating and non-alternating trials was pseudo-randomized; two alternating or non-alternating trials never followed each other directly (i.e., N-A-N-A-N-A-N-A or A-N-A-N-A-N-A-N). The test phase was additionally divided into two blocks: in each block, each trial type (alternating, nonalternating Tone 25, non-alternating Tone 33) was presented at least once. The presentation order of alternating, non-alternating Tone 25, and non-alternating Tone 33 was counterbalanced across infants, so that each of the trial types was presented in every position during the test phase. To check the reliability of the online measures of listening time (which was automatically calculated based on the experimenter's button pressing), 50% of the videos (randomly selected) obtained during the experimental session were re-coded by a second experienced coder using specialized software ELAN (Wittenburg et al., 2006). The intercoder reliability was Pearson's r = 0.99, p < 0.001.

### Results

The averaged listening times for each trial type were entered as dependent variable into the statistical analysis. The mean listening times separated by age group and condition are displayed in **Figure 3**. For the statistical analysis, listening times were logarithmically transformed in order to create a normal distribution of the residuals. Data were analyzed with R (R Core Team, 2017) and linear mixed models with the lmer function from the package lme4 (Bates et al., 2015). Model comparison revealed that the model including the interaction of Condition (alternating, non-alternating Tone 25, and nonalternating Tone 33) × Age Group (6-, 9-, and 18-months) as fixed effect and trial number and subject as random factors fit best to our data (**Table 4**). This indicates that the listening times are differently affected by the conditions and the age. Furthermore, the comparison revealed that the tone used for the familiarization did not modulate the results, as including this factor did not improve the model fit (indicated by higher AIC and no significant difference in the Chi-square test).

As the model showed a significant interaction of Age Group × Condition, we calculated separate models for each age group. Detailed statistical information for all age groups is provided in

indicating that only the 18-month-olds discriminated the lexical tones.

TABLE 4 | Results from the model comparison of the familiarization paradigm.


Results from the model comparison of the familiarization paradigm. The comparison is organized hierarchically. The first model was compared to the second model – which fit better to the data. The second model was then compared to the third and so forth. Trial number refers to each individual trial, familiarization refers to the type of familiarization tone. Results from the Chi-square test and AIC score revealed best model fit for the model which includes the interaction of Age and Condition as fixed effect and subject and trial number as random effects (\*\*\* indicates p < 0.001).

**Table 5**. These models also included subject and trial number as random factors and Condition as fixed effect. Familiarization was not included as a fixed effect, as the previous general model did not show an effect for the familiarization tone.

For the 6-month-olds, the listening times for the alternating trials (M = 10.6 s, SD = 7.9 s) did not differ significantly from the listening times for the non-alternating Tone 25 trials (M = 11.9 s, SD = 7.6 s) nor from those for the non-alternating Tone 33 trials (M = 11.3 s, SD = 7.9 s). The effect sizes (Cohen's d) for alternating vs. non-alternating Tone 25 were d = −0.249, and for alternating vs. non-alternating Tone 33 d = −0.108.

The 9-month-olds also did not show significant differences in their listening times for the alternating trials (M = 7.63 s, SD = 5.17 s) compared to the non-alternating Tone 25 (M = 7.55 s, SD = 4.89 s) or the non-alternating Tone 33 (M = 7.53 s, SD = 5.93 s) trials. The effect sizes (Cohen's d) for alternating vs. non-alternating Tone 25 were d = −0.009, and for alternating vs. non-alternating Tone 33 d = 0.132.

However, the 18-month-olds showed significantly longer listening times for the alternating trials (M = 9.07 s, SD = 6.87 s) than for the non-alternating Tone 33 trials (M = 6.89 s, SD = 5.47 s). The difference between alternating and non-alternating Tone 25 trials (M = 8.15 s, SD = 5.90 s) was not significant. The effect sizes (Cohen's d) for alternating vs. non-alternating Tone 25 trials were d = 0.087, and for alternating vs. non-alternating Tone 33 trials d = 0.323.

### Discussion

The results from this experiment did not provide evidence that 6- and 9-month-old German-learning infants discriminate the Cantonese Tone 25–Tone 33 contrast. Only the 18-montholds showed discrimination abilities for this contrast. However, discrimination showed up only in the comparison of the listening times to alternating sequences and non-alternating sequences containing Tone 33. No evidence of discrimination occurred between alternating sequences and non-alternating sequences that only contained Tone 25. This indicates some kind of asymmetry in the perception of these tones by German 18 month-olds.

Taken together, these results are only partly congruent with our prediction of perceptual reorganization and a U-shaped learning curve in tone perception. On the one hand, the differences in the results between the 9- and the 18-month-olds are in line with the observations by Liu and Kager (2014), who report an increase in the discrimination of Mandarin tone contrasts by Dutch-learning infants across these ages. Furthermore, our finding that 18-month-olds discriminate the tones is in line with our findings from Experiment 1 since German adults can also discriminate this contrast. However, what is missing is evidence of a decline in perceptual sensitivity between 6 and 9 months of age, as neither the 6- nor the 9-month-olds gave any indication of discriminating the contrast. So far, our result pattern for German-learning children is mostly



Detailed results of the statistical analysis of the familiarization experiment for each age group. The estimates represent the log-transformed listening times. The results indicate that only the 18-month-olds discriminate the contrast by longer listening times to the alternating trials, but not the 6- and 9-month-old infants. All models included Condition as fixed effect and subject and trial number as random effects as revealed as the best fit by the overall model comparison (\* indicates p < 0.05).

compatible with the hypothesis of an age-related enhancement in tone perception, which is consistent with the findings of previous studies with Dutch-learning (Chen and Kager, 2016; Chen et al., 2017) or English-learning (Tsao, 2017) infants. Given the fact that German 7- to 8-month-old infants have been shown to be sensitive to pitch variations (Wellmann et al., 2012; Abboub et al., 2016), the assumption that even 9-month-olds may not yet be able to discriminate the tone contrasts based on pitch information is not likely. However, it might be that infants at this age focus on sound contrasts that mark lexical distinctions in their native language. Since this is not the case for pitch differences on the syllabic level, 9-month-olds might ignore these pitch differences.

There may be at least two other potential explanations for our failure to find indications of a decline in discrimination in the two younger age groups that we tested. The first one is that perceptual reorganization for these tone contrasts has set in before 6-months of age. Remember that Yeung et al. (2013) tested 4- and 9-month-old but not 6-month-old English-learning infants with the same tone contrasts as were used with the German infants. They found discrimination in 4-month-olds but not in the 9-month-olds. Comparing the English-, Mandarinor Cantonese-learning 4-month-olds in that study revealed that all language groups discriminated between the tones, but that the preference patterns for the different stimulus types were not the same across the groups. This suggests language-specific influences on tone perception already at this early age, leaving open the possibility that we would have found evidence for perceptual reorganization in German infants younger than 6 months. Nevertheless, a number of other studies using different stimuli and testing infants exposed to different languages found non-native tone discrimination in 6-month-olds (Mattock and Burnham, 2006; Mattock et al., 2008). This suggests that the perceptual decline for lexical tone contrasts is not necessarily completed by the age of 6 months.

A second explanation for our failure to find evidence for changes in the younger infants' tone perception is methodological in nature: the method used in our experiment may not have been suitable to demonstrate infants' ability to discriminate the tones. As argued above, the effect of familiarization may be modulated by characteristics of the stimuli and the participants, making this type of pre-exposure not optimally suitable to uncovering discrimination abilities for all types of stimuli at all ages. Hence, our third experiment reinvestigated 6- and 9-month-olds' discrimination of the same contrasts as in the previous experiment but using a habituation procedure during the exposure phase.

Before we come to the third experiment, the results of the 18 month-olds deserve some consideration. As stated above, their listening times were longer for the alternating trials compared to the Tone 33 non-alternating trials, but not compared to the Tone 25 non-alternating trials. This pattern seems to be caused by enhanced listening times for the non-alternating Tone 25 sequences (compared to the non-alternating Tone 33 sequences). Listening times reflect specific preferences that infants have for stimuli that are presented during the experiment, and such preferences can emerge in the course of the experiment (when a familiarization phase is included) or can also be caused by some inherent properties of the stimuli (e.g., acoustic saliency, familiarity, etc.). Our results suggest that for German-learning infants, high-rising tones attract more attention compared to mid-level tones. Interestingly, Yeung et al. (2013) also found that the Mandarin-learning (but not the Cantonese-learning) infants showed longer listening times to Tone 25 compared to Tone 33. In contrast, the English-learning 4-month-olds showed a preference for listening to Tone 33 compared to Tone 25. The authors suggested that these differences in preference speak against an acoustic explanation that applies across languages, but rather suggests a language-specific preference for a certain tone type. A similar explanation may hold for the results of the German 18-month-olds. Their greater attention to Tone 25 than to Tone 33 indicates that they prefer pitch contours over level tones, which may be driven by the function that pitch contours have in German. In intonation languages like German, rising pitch contours often occur at the end of clauses, where a pragmatic function is to mark the utterance as a question or to indicate that the sentence is not yet finished (Grice and Baumann, 2002; Spinelli et al., 2017). The preference for the Cantonese contour Tone 25 may thus be interpreted as an indication that the 18-month-old German infants have started to learn about these pragmatic functions of rising contours. We will discuss this point in more detail in the general discussion.

### EXPERIMENT 3: TESTING 6- AND 9-MONTH-OLDS USING A HABITUATION PROCEDURE

### Methods Participants

Thirty monolingual German-learning infants participated in this experiment: 15 6-month-old (Mage = 182 days, range = 168–195 days; 8 girls) and 15 9-month-old (Mage = 207 days, range = 255– 289 days; 7 girls) infants. An additional 12 infants were tested but excluded from the analysis for the following reasons: crying (n = 3), failure to reach the habituation criterion (n = 7), listening times <500 ms for at least one of the four test trials (n = 1), and fussiness (n = 1). Infants from Experiment 3 did not participate in the previous Experiment 2. All infants were born full-term and according to parental report none of the infants suffered from any repeated or acute ear infections. None of the infants showed indications of atypical development or had experience with a tone language. This study was carried out in accordance with the recommendations of the Ethics Committees of the University of Potsdam with written informed consent from all parents in accordance with the Declaration of Helsinki.

#### Stimuli

The tone contrast for this experiment was identical to the contrast in Experiment 2. For habituation and test phases, the same four tokens as used in Experiment 2 were re-arranged into new sound files. Since we had four tokens of each tone, we decided to use all tokens in the habituation and test phases in order to allow more acoustic variation within each phase. Stimuli were separated by an interstimulus interval of 1 s, resulting in a speech string of 40 s. During the experimental trials, a black and white checkerboard was displayed on a screen (e.g., Horowitz, 1974; Stager and Werker, 1997). Between trials, infants saw a silent bouncing ball to redirect their attention to the screen.

### Procedure

Infants sat on the caretaker's lap, facing a monitor at a distance of ∼ 1.2 meters in a silent room. A camera positioned above the presentation screen monitored infants' looking behavior. The stimulus presentation and infants' looking behavior was coded online using Habit 2 (Version 2.1.25, Oakes et al., 2015). All acoustic stimuli were presented with an intensity of 65 dB over loudspeakers, which were placed behind the screen. One trial consisted of a 40 s speech string. Trials started as soon as the infant fixated the screen and the experimenter pressed a key. The length of each trial was controlled by the infant's behavior: the trials ended when infants either looked away for more than 2 s, or the maximum trial duration was reached.

The experiment consisted of three phases: habituation, test, and post-test phase. The maximum number of trials within the habituation phase was 18 trials. The habituation criterion was reached when infants' mean listening time across three consecutive trials decreased to 50% of the mean listening time of the first three trials. Infants who did not reach the criterion were excluded from the analysis. All infants were habituated with Tone 25. The test phase started immediately after infants reached the habituation criterion or after the maximum number of trials was presented. In the test phase two trials with the novel (Tone 33) and two trials with the habituated (Tone 25) tone, each with a maximum duration of 40 s, were presented. The presentation order of the two novel and habituated tone trials was counterbalanced across infants. Half of the infants started the test phase with a trial containing the novel tone and the

other half with a trial containing the habituated tone. A posttest phase followed directly after the test phase. During the posttest phase, a completely novel auditory stimulus was presented to verify the infants' attention to the task. The post-test trial differed segmentally from the tone stimuli. In total, 50% of the participants (randomly selected) were re-coded (frame by frame, 25 fps) by a second coder using the specialized software ELAN (Wittenburg et al., 2006). The inter-coder reliability was r =0.98, p < 0.001.

### Results

The averaged listening times for the novel and the habituated stimuli served as dependent variable. Mean listening times to the different trial types for the two age groups are displayed in **Figure 4**. Discrimination is indicated by a longer listening time for either the novel or the habituated tone. On average, infants needed about 6.08 trials (SD = 4.1) to reach the habituation criterion. Both age groups accumulated a comparable amount of listening time to the stimuli during habituation (91.95 s at 6 months, and 91.55 s at 9 months).

Again, all listening times were logarithmically transformed to fulfill the assumption of normal distribution of the residuals. The statistical analysis was performed with R (R Core Team, 2017) by using linear mixed models with the lmer function from the package lme4 (Bates et al., 2015). Again, we compared different models in order to test the best model fit using the anova function. The results from a Chi-square test as well as the lowest AIC revealed best fit for a model including the interaction of Age Group (6- and 9-months) and Condition (novel and habituated tone) as fixed factor and subject as random factor. In contrast to Experiment 2, trial number did not lead to a better model fit and was therefore excluded from further analysis. The missing effect of trial number was probably due to the smaller number of test trials. For details on the statistical analysis, see **Table 6**.

Since the interaction of Condition × Age Group was found to be significant, we performed separate analyses for each age group. Detailed statistical information can be found in **Table 7**. All comparisons were also calculated with the lmer function with Condition as fixed factor and subject as random factor. The 6 month-olds showed significantly longer listening times to the novel tone (M = 8.52 s, SD = 5.24 s) compared to the habituated


Results from the model comparison of the habituation paradigm. The comparison is organized hierarchically. The first model was compared to the second model – which fit better to the data. The second model was then compared to the third. The second model fit best to the data and included Age and Condition as fixed effects and subject as random effects. In contrast to Experiment 2, trial number did not lead to a better model fit and was therefore excluded from further analysis. Note that habituation type was not included in the models because all infants were habituated with the same tone (Tone 25) (\* indicates p < 0.05).

TABLE 7 | Detailed results from the statistical analysis of the habituation experiment for each age group.


Detailed results from the statistical analysis of the habituation experiment for each age group. The estimates represent the log-transformed listening times. Results indicated that the 6-month-olds discriminate between Tone 25 and Tone 33, but the 9-month-olds do not. All separate models included Condition as fixed effect and subject as random effect (\* indicates p < 0.05, \*\*\* indicates p < 0.001).

tone (M = 5.11 s, SD = 4.30 s). In contrast, the 9-month-olds' listening times to the novel tone (M = 6.31 s, SD = 5.15 s) were not significantly different from those to the habituated tone (M = 5.98 s, SD = 4.12 s). Effect sizes (Cohen's d) were calculated for the 6-month-olds, d = −0.435, and for the 9-month-olds, d = 0.048.

### Discussion

Our results from the habituation experiment clearly show an age-related decline in perceptual sensitivity for the contrast of Cantonese high-rising and mid-level lexical tones. While the 6 month-olds succeed in discriminating the tones, the 9-montholds did not show any evidence of discrimination. The decline in perceptual sensitivity between 6 and 9 months is in line with previous studies on lexical tone perception in infants (Mattock and Burnham, 2006; Mattock et al., 2008; Liu and Kager, 2014). These findings support the idea of perceptual reorganization for lexical tones between the ages of 6 and 9 months (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013) and extend this observation to German-learning infants.

### GENERAL DISCUSSION

The studies presented here pursued two main goals. The first one was to investigate whether further evidence can be obtained for a U-shaped development in the discrimination of non-native tone contrasts that is characterized by an initial decline and a later re-increase of perceptual sensitivity. The second goal was to investigate whether a procedure that involves habituation in the exposure phase of the experiment provides clearer evidence of infants' discrimination of lexical tones than a procedure that uses familiarization during the exposure phase of the experiment.

Summarizing the results across the three experiments, our overall findings suggest a U-shaped developmental pattern for tone discrimination in speakers and learners of German. First, German adults are able to discriminate the Cantonese high-rising vs. mid-level tones although their performance was below that of native Cantonese speakers. Second, we found a decline in the ability to discriminate these tones between the ages of 6 and 9 months: while 6-month-olds showed a clear dishabituation and thus discrimination effect in our last experiment, the results from the 9-month-olds did not indicate any discrimination of the tones across the two experiments. Third, evidence for a decline between the ages of 6 and 9 months was only obtained after habituation, but not after familiarization. We will first discuss the implications of our findings for the understanding of perceptual reorganization in infants and then consider methodological implications.

### UNDERSTANDING DEVELOPMENTAL TRAJECTORIES FOR TONE DISCRIMINATION

Overall, the results from our study suggest a developmental trajectory in the tone discrimination of German-learning infants that is identical to what Liu and Kager (2014) found for Dutchlearning infants: good discrimination at 6 and 18 months of age, but not at 9 months. Our study extends the findings from Liu and Kager (2014), who used the Mandarin high-level and high-falling tones, to a different tone contrast from another language and to learners of a different L1. This is an important finding as it shows that the U-shaped developmental pattern that was reported for the first time by Liu and Kager (2014) can be replicated and does indeed generalize to a new tone type. In addition, our study revealed that the tone contrast that was used in our infant study can also be discriminated by adult speakers of German, but on a significantly lower level than by native speakers of Cantonese. Contrastingly, for other tone contrasts tested in Experiment 1, discrimination reached native-like performance in adult speakers of German. This suggests that the adult discrimination of Tone 25 and Tone 33 is not only based on the acoustic saliency of the phonetic contrast. This in turn suggests that the Ushaped developmental pattern for this tone contrast is based on perceptual reorganization influenced by the acquisition of phonological properties of the native language and is not only due to a change in the acoustic sensitivity to pitch information.

As already discussed in previous studies (Liu and Kager, 2014; Ramachers et al., 2017; Shi et al., 2017), we assume that the intonation system of the native language and the relation of the tested non-native tone contrast to this system is crucial. Changes in pitch contours are not a unique characteristic of tone languages, as they are also relevant for the intonation of languages like German. In intonation languages, pitch movements have post-lexical functions indicating prosodic (and syntactic) phrasing and pragmatic functions, such that infants growing up with a non-tone language are not fully naïve to pitch variations. In German, rising pitch contours with a nuclear pitch accent (L∗H) are related to sentence internal boundaries of prosodic phrases and to Yes-No Questions (Grice and Baumann, 2002; Gussenhoven, 2004; Petrone et al., 2017). Since questions are frequently used in communication with infants and toddlers to catch their attention (Spinelli et al., 2017), and even infants and toddlers show discrimination of question over declarative intonation contours (Geffen and Mintz, 2011; Soderstrom et al., 2011), our finding that German toddlers discriminate highrising from mid-level tones at 18 months of age lines up with findings from other studies that assume that the native language intonation system has an impact on lexical tone perception in speakers of non-tone languages. Their growing knowledge of German intonation and its relation to the syntactic and pragmatic system may have sharpened, or re-sharpened, their processing of the tonal information in the Cantonese stimuli. However, 5-month-old English-learning infants can discriminate between statements and questions marked by their different prosodic contours (flat vs. rising contour: Geffen and Mintz, 2011; Soderstrom et al., 2011) and German 8-month-olds can detect phrase boundaries that are marked by pitch changes in combination with final lengthening (Wellmann et al., 2012). Given these results, the question arises why a decline in perceptual sensitivity to pitch as marking lexical tone is observed in learners of non-tone languages.

If the assumption that growing knowledge about the language-specific intonation system affects tone discrimination is correct, then the discrimination abilities of 6-month-olds and that of 18-month-olds probably do not rely on the same mechanisms. Discrimination of non-native contrasts in young infants has typically been attributed to extremely sensitive acoustic perception in early development (Aslin et al., 1998), which allows the discrimination of all kinds of minimal sound contrasts. Perceptual reorganization then maintains or sharpens the discrimination of contrasts that are relevant in the linguistic system of the native language, but leads to a decline in the discrimination of sound contrasts that are not relevant in the linguistic system. Thus, we assume that the younger infants still process tone stimuli in a more acoustic manner, and while an infant's native language is expected to influence these results (cf. Yeung et al.'s, 2013 findings of language-specific differences in preferences for pitch contours across languages at 4 months of age), there should not be any decline in the ability to perceive differences in contours until a point in the development when infants must learn the linguistic functions of either tonal or intonational contrasts.

The results from the experiment using the habituation procedure with 6- and 9-month-old German infants, along with prior work illustrating the classic pattern of perceptual reorganization, suggest that 9 months of age is perhaps a critical age of interest (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014). Because our (null) results for 9-month-olds were obtained across both experimental paradigms, we do not consider them to be a reflection of methodological issues. We propose that this decrease in tone discrimination around 9 months is an indication of a milestone in infants' linguistic development, when infants begin to reorganize their perceptual systems to understand how pitch is functionally used in their target language, with an emphasis on word-level meanings. For infants learning German, within-syllable pitch information is not lexically informative, and so like other 9 month-old learners of non-tonal languages, they may start to ignore pitch cues from this age.

A study by Hay et al. (2015) provides data that is related to this general idea. They found that 14-month-old Englishlearning infants can still use a Mandarin rising and falling tone contrast in word learning by mapping novel objects to labels that differ only in pitch contours. However, 17- and 19-montholds tested with the same procedure did not respond to this labeling violation (for similar results with English-learning 2 year-olds, see Quam and Swingley, 2010). Testing the 19-montholds on pure discrimination of the tones using a habituation task further revealed that these older infants could nevertheless discriminate the target tones. Hay et al. (2015) discuss this change across ages as an indication that infants get increasingly more specific about the sound contrasts that they consider to be lexically contrastive. Therefore, the older toddlers do not attend to tone contrasts in a word learning scenario, although they can discriminate them in other contexts. Infants and toddlers in our study were younger, but it may still be the case that their performance reflects shifts in attention related to lexical development. As Bergelson and Swingley (2012) have shown, infants from 6 to 9 months of age may already be strongly focused on word learning, and may be particularly attuned to sound contrasts that are lexically contrastive in their language (i.e., German), while largely ignoring sound contrasts that are not. In intonation languages, attention to tonal information may then potentially increase again when children start to detect semantic or pragmatic functions of the intonational patterns in their language which could explain why at 18-months German and Dutch infants again showed discrimination of the lexical tones. Further research would be necessary to test this hypothesis.

Future research must explore these ideas further, as lexical development might not be the only factor explaining the dip in discrimination abilities. Other factors, like salience of the contrast, might interact with the lexical development: for example, previous tone discrimination studies have not shown a perceptual decline at 9 months for certain tone contrasts (e.g., Liu and Kager, 2014, 2017; Ramachers et al., 2017; Shi et al., 2017; Tsao, 2017). Relatedly, a perceptual shift has also been reported in the visual domain around the same age, suggesting parallel development across perceptual domains. Data from Lewkowicz and Hansen-Tift (2012) have also shown a U-shaped function in visual scanning, such that infants around 8 to 9 months of age look at the mouth, whereas 4- and 12-month-olds look at the eyes. This shift may be symptomatic of a general increase in attention to certain units (segmental relative to suprasegmental information). Much remains unclear about why infants from 8 to 10 months of age show a specific developmental pattern with respect to tone perception.

### METHODOLOGICAL COMPARISONS

The difference in the 6-month-olds' results between the familiarization and the habituation experiment line up with previous research, since most other studies have shown lexical tone discrimination with habituation procedures (Liu and Kager, 2014, 2017; Chen and Kager, 2016; Ramachers et al., 2017; Shi et al., 2017; Tsao, 2017), whereas a decline in perceptual sensitivity has mostly been found with studies using a familiarization procedure (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2014). Similar to the findings from Cristia et al. (2016), our results show that the habituation procedure generates larger effect sizes at the group level. Both habituation and familiarization procedures are based on the customization of the participants to one type of stimulus and then measuring differences in the response to the old vs. a new stimulus. As stated in the introduction, we assume that habituation procedures are more adapted to individual variation by only stopping the initial exposure phase when the behavior of the infant indicates a specific level of customization. In contrast, familiarization-based procedures use a fixed amount of time or number of presentations and do not take individual differences in processing the stimuli into account. A comparison of the exposure time in our two experiments shows large differences: recall that the familiarization in Experiment 2 was fixed to 30 s of exposure to one of the tones. However, in the habituation experiment, infants needed about six presentation trials and accumulated an overall listening time to the tones of about 90 s before they reached the criterion, suggesting that they had more exposure to the crucial stimulus then the infants in Experiment 2. This difference may explain why the 6-month-olds discriminate the two tones after habituation, but not after familiarization: the amount of exposure may not have been sufficient for this age group to encode the stimulus in a way that allowed for its discrimination from another stimulus during the test phase. This also suggests that 6-month-olds may show discrimination after a longer familiarization [for effects of familiarization duration on infants' discrimination performance, see Bijeljac-Babic et al. (2012)]. The effect of trial number observed for Experiment 2 corroborates these considerations. Across the test phase, the listening times in 6-month-olds changed: while there was no evidence of discrimination in the first four trials, infants showed significantly different listening times to the two tones in the last trials<sup>3</sup> . This change over the experiment did not hold for

the 9-month-olds<sup>4</sup> , which underlines that the discrimination performance by the 9-month-olds was not affected by the methodological modulation but that the effects of perceptual reorganization are rather robust in this age group.

However, it can also be the case that other reasons might explain the different findings in our two experiments: for example, the higher number of different trial types in the SAPP may have made infants' responses less sensitive across the conditions. The SAPP as used in our experiment and in the study by Yeung et al. (2013) included three trial types (one nonalternating containing the familiarized tone, one non-alternating containing the novel tone, and one alternating), whereas the studies that used habituation during the initial exposure phase only presented two different trial types, as we did in our Experiment 3 (habituated tone and novel tone: Chen and Kager, 2016; Chen et al., 2017; Shi et al., 2017; or habituated tone and alternating: Ramachers et al., 2017), or only one trial type (the novel tone: Liu and Kager, 2014, 2017) during the test phase. Our two experiments with the infants also differed in another aspect of the experimental procedure. In Experiment 2, the duration of a head-turn to the presentation side of the acoustic stimulus was measured, while in Experiment 3 we measured visual fixation on a central monitor. We consider it unlikely that this methodological difference was responsible for the differential results across the two experiments, since listening times were the dependent variable in both cases. Moreover, head-turning vs. visual fixation was not considered as a highly relevant factor in modulating test-retest reliability data in the analysis by Cristia et al. (2016).

However, the difference in the results of our experiments across the two testing conditions underlines the importance of the methodological decisions made for experiments with infants. To make research undertaken by different labs more comparable, a higher standardization of the methods used for specific research questions is desirable. We agree with Cristia et al. (2016) that this is specifically important for infant research as it is slow and costly, and therefore needs the close collaboration of researchers across institutions and languages.

### CONCLUSIONS

Taken together, our findings suggest an age-related decline in the discrimination of lexical tones between 6 and 9 months with an additional perceptual recovery at the age of 18 months in German-learning infants. The perceptual recovery in toddlers might be driven by their acquisition of the native intonation and pragmatic system, whereas the discrimination at 6 months of age may be attributed to universal listening abilities. The decline in the ability to discriminate a non-native contrast was only evident

<sup>3</sup>First four trials: Alternating vs. Non-Alternating Tone 25 (β = 0.112, SE = 0.148, t = 0.760, p = 0.45), Alternating vs. Non-Alternating Tone 33 (β = 0.034, SE =

<sup>0.148,</sup> t = 0.233, p = 0.82). Last four trials: Alternating vs. Non-Alternating Tone 25 (β = 0.226, SE = 0.111, t = 2.017, p = 0.04), Alternating vs. Non-Alternating Tone 33 (β = 0.121, SE = 0.111, t = 1.084, p = 0.28).

<sup>4</sup>First four trials: Alternating vs. Non-Alternating Tone 25 (β = −0.020, SE = 0.148, t = −0.137, p = 0.89), Alternating vs. Non-Alternating Tone 33 (β = −0.183, SE = 0.148, t = −1.242, p = 0.22). Last four trials: Alternating vs. Non-Alternating Tone 25 (β = 8.5e-03, SE = 1.1e-01, t = 0.073, p = 0.94), Alternating vs. Non-Alternating Tone 33 (β = 9.8e-05, SE = 1.1e-01, t = 0.001, p = 0.99).

when using habituation, but not when using familiarization, suggesting that methodological aspects are important to consider in the interpretation of findings from infant studies.

### AUTHOR CONTRIBUTIONS

AG contributed to the design of the work, acquisition and analysis of the data, and drafting of the work. AK contributed to the design of the work, and revising of the manuscript. GS contributed to the design of the work, and revising of the manuscript. HY contributed to the design and stimuli construction, as well as to the revising of the manuscript. BH contributed to the design of the work, interpretation of the data, and drafting and revising of the manuscript. All authors gave final

### REFERENCES


approval of the version to be submitted. All authors approved the final version and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

### ACKNOWLEDGMENTS

We thank the Potsdam BabyLab Team for recruiting and testing the infants. Thanks to all parents and their children who participated in this study. The research presented here was funded by the DFG (German Research Foundation) as part of the Research Unit Crossing the Borders (FOR 2253) with grants to BH (HO 1960/19-1) and GS (Schw 665/12-1).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Götz, Yeung, Krasotkina, Schwarzer and Höhle. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Development of Mismatch Responses to Mandarin Lexical Tone in 12- to 24-Month-Old Infants

Ying-Ying Cheng1,2,3 and Chia-Ying Lee1,2,4,5 \*

<sup>1</sup> Brain and Language Laboratory, Institute of Linguistics, Academia Sinica, Taipei, Taiwan, <sup>2</sup> Institute of Neuroscience, National Yang-Ming University, Taipei, Taiwan, <sup>3</sup> Aim for the Top University Project, National Taiwan Normal University, Taipei, Taiwan, <sup>4</sup> Institute of Cognitive Neuroscience, National Central University, Taoyuan, Taiwan, <sup>5</sup> Research Center for Mind, Brain and Learning, National Chengchi University, Taipei, Taiwan

This study explores the development of mismatch responses (MMRs) to Mandarin lexical tone changes in infants at 12, 18, and 24 months of age using the multi-deviant oddball paradigm with the low dipping Tone 3 (T3) as the standard, the high level Tone 1 (T1) as the large, and the high rising Tone 2 (T2) as the small deviant. The results show that the large acoustic change between T1/T3 elicited mismatch negativity (MMN) in all three age groups. The small acoustic change between T2/T3 elicited a positive mismatch response (P-MMR) at 12 and 18 months of age, but no MMR was found to the T2/T3 change at 24 months. The coexistence of MMN and P-MMR in the same age group implies that different mechanisms were used for discriminating large and small deviants. Infants were able to detect the T1/T3 change automatically and showed adult-like MMN as early as 6 months of age. However, the detection of the T2/T3 change remains effortful in infants under 24 months of age. These findings support the notion that MMN and P-MMR may be used to index the maturation of speech perception.

Keywords: mismatch negativity (MMN), positive mismatch response (P-MMR), infant, lexical tone, Mandarin, event-related potentials (ERPs)

### INTRODUCTION

Discriminating ambient phonetic contrasts is an infant's first step in processing language. Infants' speech perception has been hypothesized to provide a foundation for future word learning (Werker and Yeung, 2005). For decades, studies on how language experience influences the development of speech perception were mainly focused on consonants and vowels. Both behavioral (Werker and Tees, 1984; Polka and Werker, 1994; Kuhl et al., 2006) and electrophysiological (Cheour et al., 1998b; Rivera-Gaxiola et al., 2005) studies have reported declines in discriminating non-native phonetic contrasts and improvement in discriminating native contrasts between 6 and 12 months of age. Moreover, acoustic characteristics of a contrast also play a role in the developmental timetable (Polka et al., 2001; Sundara et al., 2006; Narayan et al., 2010). The current study aims to explore how acoustic characteristics of a contrast might affect the developmental trajectory of electrophysiological response to Mandarin tonal change in infancy.

As this study involves infants' electrophysiological responses to Mandarin lexical tones, a description of the Mandarin tones and of Mandarin-learning infants' and children's learning of these tones is provided ahead of a review of perception of lexical tones, electrophysiological responses and electrophysiological responses to lexical tones in infancy. Lexical tone is one of the features to determine the meaning of a syllable in Mandarin. There are four lexical tones in Mandarin: the high level tone (T1), the high rising tone (T2), the low dipping tone (T3) and the

#### Edited by:

Denis Burnham, Western Sydney University, Australia

#### Reviewed by:

Laurianne Cabrera, UMR 8242 Laboratoire Psychologie de la Perception (LPP), France Gang Peng, The Hong Kong Polytechnic University, Hong Kong

> \*Correspondence: Chia-Ying Lee chiaying@gate.sinica.edu.tw

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 August 2017 Accepted: 16 March 2018 Published: 10 April 2018

#### Citation:

Cheng Y -Y and Lee C-Y (2018) The Development of Mismatch Responses to Mandarin Lexical Tone in 12 to 24-Month-Old Infants. Front. Psychol. 9:448. doi: 10.3389/fpsyg.2018.00448

**205**

high falling tone (T4). Studies have shown that children make few errors in producing Mandarin tones at 2 to 3 years of age (Li and Tompson, 1977; Hua and Dodd, 2000; Lin et al., 2008). However, the four tones differ in their acquisition rate: the production of T1 and T4 is mastered earlier than that of T2 and T3. As for perception, pitch height, and contour are crucial for categorizing Mandarin tones (Gandour and Harshman, 1978; Gandour, 1983). Based on the pitch contour, T1, whose F0 remains level over time, is the most distinct from the other three tones, whereas T2 and T3 are acoustically most similar to each other. Tsao (2017) reported that 6- to 8-month-old infants discriminated T1/T3, T2/T3, and T2/T4 contrasts above chance level using the head-turn paradigm. Discrimination performance for the T1/T3 contrast improved in 10- to 12-month-old infants, but discrimination of the other two contrasts did not improve in the same group of infants. Tsao (2008) also showed infants at 12 months of age discriminated the T1/T3 contrast more accurately than the T2/T3 and T2/T4 contrasts. At 3 years of age, children could still easily confuse T3 with T2 in a picture-pointing task (Wong et al., 2005). These findings suggest that the size of acoustic changes could affect the developmental timetable of discriminating lexical tone contrasts from 10 months of age.

Although the perception of vowels and consonants in infancy has been well explored, relatively few studies have investigated the development of lexical tone perception in infancy. Studies across different tonal languages have suggested the phonological representation of lexical tones could attune to ambient language in the first year of life (Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Cabrera et al., 2015). For example, non-tone language (English and French) infants showed an age-related decline in Thai lexical tone discrimination between 6 and 9 months (Mattock et al., 2008), while tone-language (Mandarin) infants perform equally well at 6 and 9 months for speech (Thai) and non-speech (violin) tone discrimination (Mattock and Burnham, 2006). Yeung et al. (2013) reported that English infants showed declines in Cantonese tone discrimination from 4 to 9 months of age. Moreover, they found that native Mandarin and native Cantonese infants were able to discriminate Cantonese tones at both ages. However, the Mandarin and Cantonese groups showed distinct preferences. Yeung et al. (2013) thus suggested that the perceptual reorganization for lexical tones could begin as early as 4 months of age. Also, the cues used on tone discrimination may change over age. For example, Cabrera et al. (2015) reported that Mandarin and French infants performed equally well in discriminating Mandarin tonal contrasts at 6 months of age, whereas at 10 months of age Mandarin infants relied more on frequency-modulation cues and French infants more relied on amplitude-modulation cues.

In contrast, some studies have shown that the discrimination of lexical tone contrasts in non-tone language infants does not always decline with age (Liu and Kager, 2014; Shi et al., 2017). Dutch infants' performance in discriminating the Mandarin T1/T4 contrast showed a U-shaped developmental pattern that is infants can discriminate the T1/T4 change at 5–6 and 17–18 months but not at ages in between. Moreover, the rebound of sensitivity is larger when the contrast is acoustically more distinct (Liu and Kager, 2014). In another study, French 4 to 11-month-old infants' discrimination of the acoustically similar T2/T3 contrast declined with increasing age, but their discrimination of the acoustically less similar T1/T4 contrast remained constant across ages (Shi et al., 2017). Taken together, the extent to which sensitivity to lexical tone contrasts declines in non-tone language infants could depend on the size of the acoustic difference. Tsao (2008, 2017) examined native Mandarin infants' sensitivity in discriminating Mandarin lexical tones that varied in acoustic similarity. They found that 6- to 8-month-old infants discriminated the acoustically dissimilar T1/T3 contrast and the acoustically similar T2/T3 contrasts equally well. By 10 to 12 months, infants showed improved accuracy in discriminating the T1/T3 contrast but no such improvement for the T2/T3 contrast (Tsao, 2008, 2017). As with the Shi et al. (2017)study, this also suggests that the size of acoustic differences plays a role in the developmental timetable of lexical tone discrimination (Tsao, 2017).

A growing body of studies has used mismatch negativity (MMN), an event-related potential (ERP) component for auditory change detection, to investigate the development of speech perception. Typically, MMN is observed in a passiveoddball paradigm by subtracting ERPs to the standard sounds from that to the deviant sounds. MMN can be elicited without the participant attending to the stimuli. Therefore, it has been widely used in studying auditory perception in infancy. MMN is hypothesized to index automatic change detection when the incoming sound violates the regularity of the previously exposed sequence (Winkler, 2007; Näätänen et al., 2011). MMN amplitude increases and latency decreases as the magnitude of change increases. Furthermore, MMN amplitude can be shaped by the accumulation of language experience in both infancy and adulthood. For example, MMN amplitude to native vowel contrasts increases between 6 and 12 months, while that to non-native contrasts decreases (Cheour et al., 1998b). MMN to Finnish vowel contrasts in fluent Finnish-learning Hungarians is comparable to that of native Finnish speakers, but Hungarians naïve to Finnish did not show an MMN (Winkler et al., 1999). This indicates that MMN could index the development of phonological representation in language acquisition. Therefore, in this study ERP was used as a tool to explore how maturation and the size of acoustic changes might affect the development of neurophysiological responses to Mandarin lexical tone contrasts.

In adults, MMN is typically characterized as a frontally distributed negativity peaking between 100 and 250 ms after the onset of a stimulus. In infants younger than 12 months, studies have reported adult-like MMN to changes in pure tones (Alho et al., 1990; Cheour et al., 2002a,b), durations (Brannon et al., 2004, 2008), vowel contrasts (Cheour-Luhtanen et al., 1995; Cheour et al., 1998a; Kushnerenko et al., 2001; Martynova et al., 2003), and tonal contrasts (Cheng et al., 2013). MMN in infants generally peaks around 300 ms or even later and persists for a longer period compared with MMN in adults. By contrast, other studies have observed a positive mismatch response (P-MMR) between 200 and 450 ms in infants. P-MMRs have also been reported for changes in various features, such as changes in the frequency of pure tones (Leppänen et al., 1997; Morr et al.,

2002; Novitski et al., 2007), vowel durations (Friederici et al., 2002), and phonetic contrasts (Dehaene-Lambertz and Dehaene, 1994; Dehaene-Lambertz and Baillet, 1998; Dehaene-Lambertz and Pena, 2001) in newborns and infants younger than 5 months of age. In sum, the polarity and latency of mismatch responses in infancy are highly inconsistent across studies. Therefore, in the following we use MMR as a superordinate term to refer to either MMN or P-MMR found between 100 and 450 ms.

Positive mismatch response has been mainly found in younger infants, especially for smaller deviants. However, the characteristics of P-MMR remain unclear. Studies measuring MMRs across ages have reported P-MMR at 2 to 3 months of age, whereas adult-like MMN is revealed at 4 to 6 months of age and becomes more dominant as the children grow older (Kushnerenko et al., 2002; Trainor et al., 2003; He et al., 2007). Since P-MMRs have mainly been found at younger ages, the presence of P-MMR has been suggested to be related to infants' maturational status. Other studies have reported that P-MMR tends to be found when it is more difficult to discriminate the change. For example, a smaller pure tone deviant (1000 Hz vs. 1200 Hz) elicits P-MMR in infants younger than 12 months, whereas a larger deviant (1000 Hz vs. 2000 Hz) elicits adult-like MMN from as young as 2 months (Morr et al., 2002). P-MMR has also been found in children as old as 6 to 7 years of age, for smaller deviants (frequency: 1000 Hz vs. 1060 or 1030 Hz; phoneme: "ba" vs. "ta" or "da") presented with relatively short inter-stimulus intervals (ISI) (Maurer et al., 2003a,b). Similar to MMRs elicited by changes in pure tones, P-MMR to a small deviant and MMN to a large deviant has also been evidenced for phonemic contrasts in 6-month-old infants (Cheng et al., 2013, 2015) and preschoolers (Lee et al., 2012). This suggests that stimuli-related factors, such as short ISI and smaller deviants, also determine the presence of P-MMR.

In addition, studies have shown that P-MMR is more likely to be found in children from disadvantaged language and reading backgrounds. Children with specific language impairment (SLI) required a frequency deviant of more than 10% relative to the 1000 Hz standard to elicit MMN, whereas a deviant of 2–5% is sufficient for the transition from P-MMR to MMN in the age-matched controls (Ahmmed et al., 2008). Also, children with a family history of dyslexia tend to have more positive P-MMRs than their age-matched controls (Maurer et al., 2003a,b). Furthermore, a recent study suggested that the polarity of MMRs could depend on language experience. A relatively difficult English /ta/ vs. /pa/ contrast elicited P-MMR in 11- to 14 month-old infants who had been exposed only to impoverished language input but elicited MMN in an age-matched group who had been exposed to richer language input (Garcia-Sierra et al., 2016). In summary, the presence of P-MMR depends on maturation, the difficulty of the contrasts, the presentation of stimuli, and individual language backgrounds. All these factors should be taken into consideration when investigating the development of MMRs in infancy.

The aim of this study is to systematically explore how maturation and deviant size affect the development of MMRs to lexical tone contrasts in native Mandarin infants. Regarding MMN to lexical tone contrasts, few studies have examined how deviant size affects MMN to Mandarin lexical tones. Cheng et al. (2013) used the acoustically distinct T1/T3 contrast as the large deviant and the acoustically similar T2/T3 contrast as the small deviant and demonstrated that MMN to the T1/T3 contrast has larger amplitude and earlier latency than MMN to the T2/T3 contrast in adults. This finding is congruent with Chandrasekaran et al. (2007), which reported that the deviant size effect on MMN was only found in native Mandarin speakers and not in English speakers without prior experience to a tonal language. Hsu et al. (2014) used the same set of stimuli as Cheng et al. (2013) in a magnetoencephalography (MEG) study to investigate how the deviant size modulates the neural generators underlying the magnetic mismatch response (MMNm) for detecting different magnitudes of lexical tone changes. The more distinct T1/T3 contrast showed larger MMNm in the left hemisphere in comparison with the less distinct T2/T3 contrast. Most critically, the source analysis demonstrated that deviant size affected laterality and the time course of activations in the temporal and frontal cortex. The large deviant showed a greater leftlateralization in superior and middle temporal gyrus. Meanwhile, a set of frontal generators was activated at a later time window to the small deviant, which reflects different top-down mechanisms in responding to large and small deviants (Hsu et al., 2014).

Other studies have also used the same set of stimuli to investigate the developmental trajectories of Mandarin lexical tone perception in preschoolers (Lee et al., 2012) and early infancy (Cheng et al., 2013). The T1/T3 contrast elicited an adultlike MMN in 4-, 5-, and 6-year-olds, but the T2/T3 contrast elicited P-MMR (Lee et al., 2012). The presence of MMN to the T1/T3 contrast suggests that the transition from P-MMR to MMN should occur at a younger age. Cheng et al. (2013) reported that MMRs to the T1/T3 contrast switched from P-MMR in newborns to MMN at 6 months of age. As for the T2/T3 contrast, no significant MMR was found in newborns, and P-MMR was found at 6 months of age. Cheng et al. (2013) suggested that the deviant size effect could be observed in newborns and infants at 6 months of age. However, there is still a gap in empirical evidence between 1 year and 4 years of age.

Meanwhile, other studies with preschoolers and school-age children (Liu et al., 2014; Chen et al., 2016) reported no P-MMR but a late negativity between 385 and 535 ms to the T2/T3 contrast, which the authors likened to an adult-like late discriminative negativity (LDN). Liu et al. (2014) suggested that the single-deviant paradigm used in their study might reduce contextual difficulty in comparison with the multi-deviant oddball paradigm and result in the absence of P-MMR. However, further studies are required to examine this account. LDN is a frontocentrally distributed negativity between 400 and 700 ms after stimulus onset. The LDN has a number of specific qualities as follows. The LDN is predominant in children (Korpilahti et al., 1995; Cheour et al., 2001; Korpilahti et al., 2001; Bishop et al., 2011) and tends to decrease with age (Hommet et al., 2009; Bishop et al., 2011; Liu et al., 2014). In addition, the LDN is more prominent in response to smaller deviants (Bishop et al., 2011) and in children with SLI (Bishop et al., 2010; Kujala and Leminen, 2017), dyslexia (Neuhoff et al., 2012) and attention deficit/hyperactivity disorder (Yang et al., 2015). The LDN is

suggested to reflect additional processing for sounds that are difficult to discriminate (Bishop et al., 2011; Liu et al., 2014) and could be associated with higher cognitive functions such as attention-related processing or long-term memory (Neuhoff et al., 2012; Kujala and Leminen, 2017). Chen et al. (2016) reported that a subgroup of 3-year-olds with persistent language delay (PLD) showed a positive MMR to the T2/T3 contrast in 185–335 ms, even though the grand mean of all their 3-yearold participants showed LDN. Taken together, although the ERPs elicited by the T2/T3 contrast are inconsistent across studies, there is a consensus that the T2/T3 contrast does not elicit stable MMN in early childhood. In sum, these studies suggest that the deviant size modulates the polarity of MMR. Given that MMN index a pre-attentive automatic change detection, the transition from P-MMR to MMN for the T1/T3 contrast at 6 months of age suggests that the phonological representation matures for automatically detecting the T1/T3 change by that age. However, the absence of MMN to the T2/T3 contrast suggests that the processing of the less distinct small lexical tone contrasts is still in the process of developing.

The current study aims to further explore the developmental trajectory of MMRs to Mandarin lexical tones from 12 to 24 months by using the same stimuli of Cheng et al. (2013). Given the observation of MMN to the T1/T3 contrast at 6 months in Cheng et al. (2013), an adult-like MMN was expected to be seen from 12 to 24 months of age. As for the T2/T3 contrast, Cheng et al. (2013) reported no MMR in newborns and a P-MMR at 6 months of age. Other studies reported inconsistent findings regarding whether T2/T3 would elicit P-MMR in toddlers (Liu et al., 2014; Chen et al., 2016). Therefore, the developmental trajectory of MMR to T2/T3 contrast and the deviant size effect across ages will be the critical observations in this study.

### MATERIALS AND METHODS

### Participants

EEG was collected from three groups of infants: 12, 18, and 24 months. For the 12-month-old group, 28 infants attended, but only 14 completed the experiment (3 girls; mean age: 12 months 5 days; range: 11 months 22 days to 12 months 14 days.) For the 18-month-old group, 26 infants attended, of whom 20 completed the experiment (7 girls; mean age: 18 months 5.3 days; range: 17 months 25 days to 18 months 17 days.). As for the 24-month-old group, 29 infants attended, of whom 19 completed the experiment (7 girls; mean age: 24 months 6.4 days; range: 24 months 1 day to 24 months 18 days.). All infants were full-term (gestational age ranged from 37 to 40 weeks) and their parents were native speakers of Mandarin Chinese. All infants passed the otoacoustic emission test for hearing screen at birth. Infants' cognitive function was assessed using the Bayley Scales of Infant Development-Second Edition: the Mental Developmental Index (BSID-II MDI) before they participated in the EEG recording. Most participants had their BSID-II MDI score within normal range; each group had one infant whose BSID-II MDI score fell in the borderline range (<85 and >70), but none fell below the normal range in their follow-up assessment on the BSID-II MDI 6 months later.

### Design of ERP Experiments Stimuli

The stimuli were the same as those used in Lee et al. (2012) and Cheng et al. (2013). The stimuli consisted of syllables the yi1 (T1), yi2 (T2), and yi3 (T3), which share the same vowel [i] but differ in their pitch contour. Syllable yi1 (T1) is a high-level tone with the fundamental frequency (F0) around 230 Hz. Syllable yi2 (T2) is a high rising tone with F0 rising from 180 to 200 Hz. Syllable yi3 (T3) is a low dipping tone with F0 descending from 100 to 80 Hz and then rising back to 100 Hz. T3 was assigned as the standard; T1 was assigned as the large deviant (level vs. contour); T2 was assigned as the small deviant (contour vs. contour). Stimuli were spoken by a female native speaker of Mandarin and recorded at 16 bits with a sampling rate of 44 kHz. The intensity of the stimuli was normalized to 70 dB, and the duration of stimuli was scaled to 250 ms with Sony Sound Forge 9.0 software.

### Procedure of Multi-Deviant Oddball Paradigm

During data collection, infants were seated in a high chair or on their caregiver's lap watching silent cartoons or puppet play to engage them to minimize their movement. The stimuli were presented at a sound pressure level (SPL) of 70 dB through a set of loudspeakers placed approximately 75 cm in front of the infant. The experimental session started with 20 repetitions of the standard (T3) followed by 1000 trials composed of 80% of the standard (T3), 10% of the large deviant (T1), and 10% of the small deviant (T2). The stimuli were presented in a pseudo-randomized sequence, in which at least two standards were presented between any two deviants. In each trial, stimuli lasted for 250 ms with a 500 ms ISI. The whole experiment took about 40 min.

### EEG Recording and Data Analysis

EEG signals were amplified by NuAmps (Neuroscan Inc.) in direct current (DC) mode, with 100 Hz low-pass and 60 Hz notch filters. Signals were recorded continuously and digitized at a rate of 500 Hz. Signals were recorded from FPz, F3, Fz, F4, C3, C4, O1, O2, and left (M1) and right mastoids (M2) through Ag/AgCl electrodes held with an elastic cap (QuickCap, Neuromedical Supplies, Sterling, VA, United States). Eye movement was monitored with two electrodes attached to the supra-outer canthus of the left eye and infra-outer canthus of the right eye. In the online recording, FPz was considered as ground, and Fz was taken as reference.

For offline processing, the EEG data were re-referenced to the average of M1 and M2. The continuous EEG was segmented into epochs of 800 ms including 100 ms pre-stimulus intervals for baseline correction. A 1 to 20 Hz bandpass filter (zero phase shifting, 12 dB/oct) was applied. Trials with voltage variation exceeding ±100 µV on any electrode were rejected from further analysis. The first 20 standards were excluded from the analysis and only those standards preceded by at least three standards were analyzed to fully control the sequence effect. The grandaveraged ERPs for the standard, the large deviant, and the small deviant were calculated for each participant and electrode. The

average number of trials and their standard deviations for each deviant in each age group were: 68.79 (13.55) for T1 and 67.57 (10.75) for T2 in 12-month-old group; 69.9 (17.46) for T1 and 69.5 (17.46) for T2 in 18-month-old group; 72 (12.77) for T1 and 71.89 (12.59) for T2 in 24-month-old group.

Mismatch response is typically distributed at frontal to central sites; therefore, we analyzed electrodes F3, Fz, F4, C3, and C4. To screen the time course of MMRs, we performed a twotailed paired t-test between the standard and each deviant on each sample point in intervals between 100 and 500 ms. Paired t-tests were conducted independently at each of the selected sites. MMR was considered meaningful and is reported here when the significant (p < 0.05) time points were consecutively longer than 30 ms (Guthrie and Buchwald, 1991). To handle the problems of multiple comparisons, we further examined the identified MMRs by a cluster-based random permutation analysis (Maris and Oostenveld, 2007). First, the consecutive time points with an alpha level less than 0.05 were grouped into clusters. A clusterlevel test statistic was calculated by summing all the individual t-values within each cluster. Then, computing 1000 randomized cluster-level statistics created a null distribution. Finally, the actual observed cluster-level statistics were compared against the null distribution. If the summed t-value of a cluster fell into the highest or lowest 2.5 percentile, the cluster was considered to be significant (alpha < 0.05, two-tailed) in the cluster-based permutation. Clusters with intervals longer than 30 ms and the significance of cluster-based permutations are reported in the following section.

### RESULTS

### MMR at 12 Months

Mismatch responses were present but not particularly robust at 12 months of age (**Figure 1**). For the T1/T3 contrast, a negative cluster was found at C4 in the 150–182 ms interval, but it was not significant in the cluster-based permutation (p = 0.122). No positive cluster was found for the T1/T3 contrast. As for the T2/T3 contrast, positive clusters were found between 302 and 358 ms at F3, between 302 and 370 ms at C3, and between 388 and



<sup>∗</sup>p < 0.05 in the cluster-based permutation.

464 ms at C4. The cluster-based permutation showed significant P-MMR to T2/T3 at F3 (p = 0.037), C3 (p = 0.027), and C4 (p = 0.024). The intervals of the clusters and the significance in the cluster-based permutation are summarized in **Table 1**.

### MMR at 18 Months

Mismatch responses to the two contrasts showed distinctly different patterns (**Figure 2**). For the T1/T3 contrast, negative clusters were found at all selected electrodes. Their intervals were 212–278 ms at F3, 214–284 ms at Fz, 222–288 ms at F4, 210–250 ms at C3, and 248–278 ms at C4. The cluster-based permutation showed that the MMN to T1/T3 was significant at F3 (p = 0.026), Fz (p = 0.039), and F4 (p = 0.042). For the T2/T3 contrast, positive clusters were found in two intervals. An early positive cluster was found between 122 and 170 ms at F3, between 122 and 162 ms at Fz, between 130 and 162 ms at C3, and between 134 and 168 ms at C4. The early positive clusters fulfilled the criterion of significance at only F3 (p = 0.042) in the clusterbased permutation. In later intervals, positive clusters were found at all selected electrodes. Their intervals were 284–392 ms at F3, 310–384 ms at Fz, 308–372 ms at F4, 286–360 ms at C3, and 330–388 ms at C4. The cluster-based permutation showed significant P-MMR to T2/T3 at F3 (p = 0.007), Fz (p = 0.025), F4 (p = 0.039), and C3 (p = 0.033).

### MMR at 24 Months

At 24 months, the T1/T3 contrast elicited MMN. The T2/T3 contrast did not elicit MMR, but a negative cluster was found in intervals later than 450 ms (**Figure 3**). For T1/T3, negative clusters were found in the intervals 222–274 ms at F3, 218–284 ms at Fz, 232–278 ms at F4, and 222–296 ms at C4. The cluster-based permutation showed significant MMN to T1/T3 at Fz (p = 0.02) and C4 (p = 0.018). Following the MMN, a positive cluster was found for T1/T3 in the interval 392–424 ms at F3, but it was not significant in the clusterbased permutation. The positive cluster was no longer found for T2/T3. Instead, a negative cluster was found for T2/T3 in the 428–500 ms at F3 (p = 0.024). This interval was relatively late in comparison with MMR, which suggests that the late negativity might be a component related to change detection but distinct from MMN. The nature of the late negativity for T2/T3 remains unclear, and this will be discussed in the next section.

### DISCUSSION

This study applied the same set of stimuli as Cheng et al. (2013) to explore how deviant size affects the development of MMRs to Mandarin lexical tone discrimination in infants from 12 to 24 months of age. MMN has been well-established in adults to index automatic change detection (Winkler, 2007; Näätänen et al., 2011). However, studies with infants and young children often report a P-MMR, instead of an MMN and demonstrate the developmental change from P-MMR to MMN with age (Kushnerenko et al., 2002; Morr et al., 2002; Lee et al., 2012; Cheng et al., 2013, 2015). Thus, the transition from P-MMR to MMN could serve as a neural marker to index when infants may automatically detect the auditory changes of a set of phonological contrasts. Our findings show that an acoustically large change (T1/T3) elicits MMN in infants at all three ages: 12, 18, and 24 months. In contrast, the acoustically small change (T2/T3) elicits no MMN at any age and a P-MMR in infants at 12 and 18 months of age but not at 24 months. Together the results of this study and that of Cheng et al. (2013) indicate the developmental trajectory of MMR from birth to 24 months of age. The large deviant T1/T3 elicits P-MMR in newborns. This P-MMR transitions into an MMN at 6 months of age and this MMN is sustained at 12, 18, and 24 months of age. As for the small deviant T2/T3, no MMR is found in newborns. The P-MMR appears at 6 months and is sustained at 12 and 18 months but disappears at 24 months. As the T1/T3 and T2/T3 contrasts differ in the pattern of MMRs in all age groups from birth to 24 months of age, it is possible that there are the two types of underlying mechanisms for the discrimination of T1/T3 and T2/T3 contrasts and that these change with development. Although it is unclear whether the absence of MMR at 24 months implies that a transition from P-MMR to MMN would occur at a later age, our current data suggest that infants under 24 months of age are not able to detect the change between T2 and T3 automatically. The potential applications of how the polarity of MMR may be used to index the maturation of lexical tone perception are discussed below.

Following the P-MMR to T2/T3 reported at 6 months of age by Cheng et al. (2013), the current study showed that P-MMR to T2/T3 remained until 18 months and disappeared at 24 months. This is consistent with the idea that P-MMR tends to be found at younger ages (Dehaene-Lambertz and Dehaene, 1994; Friederici et al., 2002; Kushnerenko et al., 2002; Shafer et al., 2011), and to more difficult discriminations, i.e., to smaller deviants (Morr et al., 2002; Lee et al., 2012; Cheng et al., 2013, 2015). Despite the indication from these results that the P-MMR might reflect a less mature speech discrimination process, the functional significance of the P-MMR remains

unclear. Nevertheless, given that Tsao (2017) has demonstrated that 6- to 12-month-old infants can discriminate T2/T3 at above chance level, the P-MMR elicited by T2/T3 in the current study might imply a less mature change detection mechanism than for the MMN.

Indeed, the coexistence of MMN to the T1/T3 contrast and P-MMR to the T2/T3 contrast at 12 and 18 months suggests that detecting the two contrasts could depend on different mechanisms. Similar observations for the coexistence of MMNs and P-MMRs have also been reported in other studies with infants (Morr et al., 2002; Friedrich et al., 2009; Cheng et al., 2013, 2015). Friedrich et al. (2004, 2009) suggested that P-MMR could reflect the effort to perceptually categorize the incoming stimuli before the change detection becomes automatic. Our finding of the coexistence of MMN and P-MMR between 6 and 18 months suggests that stimulus-dependent factors might affect whether effortful processes of perceptual categorization or more automatic processes are used for auditory change detection. Besides, the coexistence of MMN and P-MMR is not limited to infancy. It has been found in preschoolers and school-age children (Ahmmed et al., 2008; Lee et al., 2012), especially those who have a history of language and reading disability (Maurer et al., 2003a), and in adults, when a contrast is extremely difficult to discriminate (Kuo et al., 2014). In this latter adult study, it was found that a 1-channel cochlear implant (CI) simulation of the T1/T4 contrast elicited P-MMR in adults with normal hearing, while the natural spoken T1/T4 contrast and the 8 and 32-channel simulations of the T1/T4 contrast elicited MMN. This presence of P-MMR in adults when the spectro-temporal properties of speech sound are drastically degraded supports the idea that P-MMR may index effortful discrimination. Together with Cheng et al. (2013), this series of MMN studies on infant's lexical tone discrimination show that the T2/T3 contrast elicits no MMR at birth and P-MMR from 6 to 18 months of age. These findings suggest that, for infants under 24 months of age, phonological representations are still developing and are not yet sufficient to automatically discriminate small deviant changes of Mandarin tones.

The current study found two types of deviant size effects. Between 12 and 18 months, the large deviant T1/T3 contrast elicited MMN, but the small deviant T2/T3 contrast elicited P-MMR. Thus, the deviant size effect was reflected in the polarity of MMRs. This pattern is congruent with the deviant size effect on MMRs to lexical tone changes in 6-month-old infants (Cheng et al., 2013), and in 4- to 6-year-old preschoolers (Lee et al., 2012). However, the data show that 24-month-old infants exhibit MMN to the T1/T3 contrast but no MMR to the T2/T3 contrast. So the pattern of deviant size effect turned from the polarity of MMR into the presence or absence of MMN. Other studies have reported the disappearance of MMR in a particular age period. For example, Morr et al. (2002) examined how the deviant size affects the maturation of MMRs to the small (1000/1200 Hz) and large (1000/2000 Hz) frequency changes in infants from 3 to 47 months of age. They found that in infants under 12 months, a small frequency deviant elicited P-MMR, while a large deviant elicited MMN. In other age groups between 13 and 47 months, the large deviant continuously elicited MMN, but no significant MMR was found for the small deviant. The absence of MMR in certain age periods suggests that P-MMR may not transition to MMN immediately, and the time required for the transition from P-MMR to MMN could depend on the discriminability of contrasts. Cheng et al. (2013) reported that MMRs to the more discriminable T1/T3 change switched from P-MMR to MMN between the newborn period and 6 months. The current study showed that the less discriminable T2/T3 contrast elicited P-MMR until 18 months of age, but there was no MMR at 24 months of age, which suggests that brain response to the T2/T3 contrast requires a longer period to transition from P-MMR to MMN than does the brain response to T1/T3. However, the current data are not sufficient to evaluate how long it would take to switch from P-MMR to MMN. Further study is required to determine the age of emergence of MMN to T2/T3.

However, the absence of MMR to T2/T3 in infants at 24 months of age is unexpected, since Lee et al. (2012), using the same experimental design, reported P-MMR to

T2/T3 in children between 4 and 6 years of age. Lee et al. (2012) suggested that discrimination of T2/T3 remains effortful in preschoolers. Meanwhile, other studies using the singledeviant paradigm reported no MMR to T2/T3 in children at 3, 5, and 6 years of age (Liu et al., 2014; Chen et al., 2016). One potential account may involve individual differences. Previous studies have suggested that it is more likely to find a P-MMR than an MMN in children with disadvantaged language and reading backgrounds, such as those with SLI (Ahmmed et al., 2008) or a family history of dyslexia (Maurer et al., 2003a,b). Chen et al. (2016) reported no MMR to the T2/T3 contrast in the 185–355 ms interval in 3-yearold children (n = 30). However, when they subdivided the children into three groups (PLD, n = 10; late bloomer, LB, n = 10; and typical language development, TLD, n = 10), they found P-MMR for PLD and MMN for TLD children. That is, in grouped results, the MMNs of those who achieve automatic detection (N-responders) may be masked by the P-MMRs of those who still rely on less mature processing (P-responders). Given that Lee et al. (2012) have consistently reported P-MMR in preschoolers from 4 to 6 years, the current absence of P-MMR in the 24-month-old infants may be due to individual variations in their language abilities. Unfortunately, both the current study and Lee et al. (2012) had relatively small sample sizes (n = 14∼19 for each age group), too small for subgrouping the participants into different ability groups. Further studies with larger sample sizes and adding behavioral measures of language development are required to explore the proportion of P-responders or N-responders to small deviants in the TLD population. In this way, ERP measures could provide further information about speech perception in early infancy.

In addition, the T2/T3 contrast elicited a late negativity in infants at 24 months of age, with neither P-MMR nor MMN preceding this late negativity. A possible account is that the late negativity may resemble the LDN, which reflects higher cognitive functions for discriminating the T2/T3 contrast. LDN is typically a long-lasting negative deflection from 400 to 700 ms (Korpilahti et al., 1995; Cheour et al., 2001; Korpilahti et al., 2001; Bishop et al., 2011). Yang et al. (2015) used the same set of stimuli as the current study to examine auditory change detection of Mandarin lexical tones in 6- to 12-year-old children with or without ADHD. In response to the T2/T3 contrast, both groups elicited significant LDN from 400 to 700 ms, and neither P-MMR nor MMN preceded this LDN. However, in children between 4 and 6 years of age, the T2/T3 contrast elicited a late negativity in the 385–535 ms interval (Liu et al., 2014; Chen et al., 2016). In the current study, the late negativity for the T2/T3 contrast at 24 months was restricted to between 428 and 500 ms, rather than a long-lasting negative deflection from 450 to 800 ms that is typically shown in the LDN. Taken together, the late negativity elicited by the T2/T3 contrast in children younger than 6 years shows earlier and more restricted latency than the typical LDN does. Whether this late negativity is the typical LDN remains unclear. Another possibility is suggested by a study by Shafer et al. (2011) who found an nMMR, which was a negative component peaking around 340 ms, preceded by a P-MMR in response to English vowel contrast in 3-year-old children. The peak latency of the nMMR shifted approximately 25 ms per year earlier from 3 to 5 years of age (Shafer et al., 2010). Shafer et al. (2010) suggested that the nMMR is an emergence of MMN in the early developmental stage. In the current study, the latency of the late negativity to the T2/T3 contrast found at 24 months resembles that of the nMMR reported in Shafer et al.'s (2010) studies. However, the late negativity in the current study is not preceded by a P-MMR. Taken together, neither LDN nor nMMR can be used to account for the current finding of the late negativity to the T2/T3 contrast at 24 months. Further studies are required to examine whether it resembles LDN or it is an emergence of MMN.

It is worth noting that, MMN in the 12-month-old group was evidenced at only one electrode in a limited interval in contrast to the widely distributed frontal-to-central MMN in the other age groups. When we carefully inspected the status of infants in the EEG collection environment, 12-month-old infants were more likely than those in other age groups to become restless. This 12-month-old group also had a higher rate of attrition (50%) and fewer accepted trials, which could result in less observable MMNs. It is critical to shorten the duration of data collection and improve the quality of data in the future studies of the development of MMN.

### CONCLUSION

The current study documented MMRs to Mandarin lexical tone discrimination in infants at 12, 18, and 24 months of age. An adult-like MMN to the large deviant T1/T3 contrast was found across all age groups, whereas the small deviant T2/T3 contrast elicited P-MMR at 12 and 18 months of age and no MMR at 24 months of age. These findings suggest that 12- to 24-month-old infants can automatically discriminate T1 and T3, whereas categorization of the acoustically similar tone pair T2 and T3 remains effortful in infants under 24 months of age. In this regard, Kooijman et al. (2013) measured brain responses in a segmentation task in infants at 7 months of age. The majority of 7-month-old infants showed a positive response and a minority showed a left negativity that resembles the responses observed in 10-month-old infants (Kooijman et al., 2005). Critically, these negative responders had higher scores of expressive vocabulary and sentence processing skills at 3 years than did the positive responders, which suggests the polarity of ERP effect may be an important indicator of the maturation of language processing. Taken together, our findings support the notion that the polarity of MMRs may serve as a neural marker to index the maturation of speech perception in infancy.

### ETHICS STATEMENT

The study protocols were approved by the Human Subject Research Ethics Committee/Institutional Review Board (IRB) of

Academia Sinica, Taiwan. Written consent forms were obtained from parents for their infants' participation.

### AUTHOR CONTRIBUTIONS

Y-YC designed and prepared the experiments; acquired, analyzed, and interpreted the data; and drafted and revised this article. C-YL conceptualized this study; supervised and approved the experimental design; interpreted the data; revised the draft; and

### REFERENCES


approved the final version of this article. Y-YC and C-YL agreed to be accountable for all aspects of this study.

### ACKNOWLEDGMENTS

We thank Dr. Hsin-Chi Wu, Dr. Ming-Tao Yang, and Dr. Lu-Lu Zhao for their support in our data collection in Taipei Tzu Chi Hospital and we thank all the families that participated in this study.


Gandour, J. (1983). Tone perception in far eastern-languages. J. Phon. 11, 149–175.


stimuli in children. Brain Lang. 76, 332–339. doi: 10.1006/brln.2000. 2426


evidence from nasal place discrimination. Dev. Sci. 13, 407–420. doi: 10.1111/j. 1467-7687.2009.00898.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cheng and Lee. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Five-Year-olds' Acoustic Realization of Mandarin Tone Sandhi and Lexical Tones in Context Are Not Yet Fully Adult-Like

Nan Xu Rattanasone1,2,3 \*, Ping Tang1,3, Ivan Yuen1,2,3, Liqun Gao<sup>4</sup> \* and Katherine Demuth1,2,3

*<sup>1</sup> Department of Linguistics, Macquarie University, Sydney, NSW, Australia, <sup>2</sup> Center for Language Sciences, Macquarie University, Sydney, NSW, Australia, <sup>3</sup> ARC Center of Excellence in Cognition and its Disorders, Macquarie University, Sydney, NSW, Australia, <sup>4</sup> Centre for Speech, Language and the Brain, Beijing Language and Culture University, Beijing, China*

### Edited by:

*Leher Singh, National University of Singapore, Singapore*

#### Reviewed by:

*Ao Chen, Utrecht University, Netherlands Mengru Han, Utrecht University, Netherlands*

#### \*Correspondence:

*Nan Xu Rattanasone nan.xu@mq.edu.au Liqun Gao gaolq@blcu.edu.cn*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *05 September 2017* Accepted: *07 May 2018* Published: *28 May 2018*

#### Citation:

*Xu Rattanasone N, Tang P, Yuen I, Gao L and Demuth K (2018) Five-Year-olds' Acoustic Realization of Mandarin Tone Sandhi and Lexical Tones in Context Are Not Yet Fully Adult-Like. Front. Psychol. 9:817. doi: 10.3389/fpsyg.2018.00817* Large numbers of children around the world are learning tone languages, but few studies have examined the acoustic properties of children's early tone productions. Even more scarce are acquisition studies on tone sandhi, a tone change phenomenon which alters the surface realization of lexical tones. Two studies using perceptual coding report the emergence of lexical tone and tone sandhi at around 2 years (Li and Thompson, 1977; Hua and Dodd, 2000). However, the only acoustic study available shows that 3-year-olds are not yet adult-like in their lexical tone productions (Wong, 2012). This raises questions about when children's productions become acoustically adult-like and how their tone productions differ from those of adults. These questions were addressed in the current study which compared Mandarin-speaking pre-schoolers' (3–5-year-olds) tone productions to that of adults. A picture naming task was used with disyllabic real words familiar to pre-schoolers. Overall children produced appropriate tone *contours* for all tones, i.e., level for tone 1, rising for tones 2, 3 and full sandhi, falling for tone 4 and half sandhi. However, children's productions were not adult-like for tones 3, 4, and the sandhi forms, in terms of coordinating *pitch range, slope and curvature*, with little evidence of development across ages. These results suggest a protracted process in achieving adult-like acoustic realization of both lexical and sandhi tones.

#### Keywords: lexical tone acquisition, tone sandhi, mandarin, acoustic analysis, pre-schoolers

### INTRODUCTION

Despite recent interest in tone languages, little is known about the acquisition of lexical tones compared to segments (i.e., vowels and consonants). This is in spite of the pervasiveness of tone languages—it is estimated that more than half the world's languages are tonal (Yip, 2002). Especially lacking is knowledge about children's early productions of lexical tone, and if and how these differ from adult forms. This is also the case for phonological processes that involve lexical tone change (tone sandhi). For example, Mandarin has tone sandhi processes whereby the surface tone changes depending on tonal context, i.e., from tone 3 to a rising or falling tone. The acquisition of such phonological processes has not attracted much attention in the field of language acquisition. Given the large population of children learning tone languages, understanding how lexical tone and tone sandhi processes are acquired is crucial for providing a comprehensive account of language acquisition above the level of the segment. In this paper, we examine the early production of both lexical and sandhi tones in terms of their acoustic realizations to determine if pre-schoolers' productions are adultlike.

Languages that have lexical tone manipulate pitch height and pitch contours to change the meanings of words. Whereas in English rising and falling pitch contours on words are typically associated with prosodic information such as intonation and focus, in lexical tone languages these can change the meaning of the word. A well-studied lexical tone language is Mandarin, with the largest population of speakers around the world. Mandarin has a four-tone system with one level and three contour tones; tone 1 has a high level contour ("ma": mother), tone 2 a rising ¯ contour ("má": hemp), tone 3 a dipping contour ("ma": horse), ˇ and tone 4 a falling contour ("mà": reprimand). See **Figure 1** for pitch contours across time on the four lexical tones. While all four lexical tones appear in the productions of Mandarinspeaking children by the 1-word stage of development, confusion between tones 2 and 3 (rising and dipping tones) continues into the 2/3-word stage of development, finally disappearing as longer sentences are produced (Li and Thompson, 1977). Only one study has reported on the acoustic characteristics of lexical tone produced by Mandarin-speaking 3-year-olds in North America (Wong, 2012). Using monosyllabic words, the study showed that 3-year-olds did not yet have adult-like productions in terms of pitch range and slope, especially for tone 3, indicating that young children face challenges in producing complex tonal contours. This is mirrored in perception studies with 3-year-olds showing difficulty with tone identification, especially for the tone 3 having the most complex tone contour (Wong et al., 2005). Another study using perceptual coding of Mandarin-speaking Taiwanese 4- and 5-year-olds' tone productions, showed that older pre-schoolers still had a substantial number of atypical productions, with no changes over development (Wong, 2013). Together these studies suggest that Mandarin-speaking preschoolers are still learning to produce tone in an adult-like manner.

These studies point to a protracted acquisition period for Mandarin lexical tones, especially when compared to other tone languages with larger tone inventories. For example, Cantonese is a language with a six-tone system, but children are reported to have acquired all tones by the age of 2 (So and Dodd, 1995). This includes tones with very similar pitch contours, e.g., three level tones (high, mid, and low) and two rising tones (high and mid). The same early acquisition of the full tone inventory is also observed in Thai, a language with 5 tones including three level tones (high, mid, and low), a rising and falling tone, where all tones were present by the 2-word stage (Tuaycharon, 1977). Thus, a larger tone inventory and similarities between tonal contours does not appear to delay tone acquisition. One obvious reason might be that the perceptual coding (rather than acoustic analysis) used in these studies overestimated children's abilities. Indeed, during early stages of language acquisition, children can produce acoustic contrasts that may not be detected by the listener (Scobbie et al., 2000). However, another possibility is the role of Mandarin tone sandhi in acquisition. Tone change processes such as tone sandhi, where children must learn to associate multiple surface forms with their underlying forms, might contribute to a protracted acquisition process. However, little is known about the acoustic realizations of children's early productions of tone sandhi, whether it is adult-like and if it is acquired with lexical tones.

There are two contexts for tone sandhi process in Mandarin. The full sandhi context occurs when two tone 3 syllables occur in

succession (tones 3–3), and the first becomes a rising tone. The half sandhi context occurs when tone 3 is followed by any other tone (tones 1, 2, or 4), and is realized with a falling pitch. See **Figure 2** for pitch contours plotted over time for the full and half sandhi tones. Therefore, except in the utterance final position, tone 3 is always realized as full or half sandhi in connected speech. Previous studies have reported that tone sandhi emerges by the 2/3-word stage of development (around 2 years), when children begin to combine words (Li and Thompson, 1977; Hua and Dodd, 2000). However, it is unclear how sandhi forms are acoustically realized in children's productions. To date, we know of only one study which has reported on the acoustic characteristics of tone sandhi productions by pre-schoolers (Xu Rattanasone et al., 2016). That study reported that 3-year-olds' production of Mandarin tone sandhi on known words had tonal contours that are consistent with the sandhi forms. However, no adult control group was used and so it remains unclear the extent to which 3-year-old's productions are acoustically adultlike. Given previous reports on the protracted acquisition of tone sandhi, it is unlikely that 3-year-olds' productions would be adult-like at this early age. Indeed, in Bantu languages such as Sesotho, where lexical and grammatical tone interact, tone sandhi processes begin to emerge only by 3 years or later, as children learn more about the grammar of the language (Demuth, 1993).

Currently, it is unclear why some studies have reported Mandarin tone acquisition to be a protracted process. This could be related to the presence of a tone sandhi process or difficulty in producing the adult-like forms of both lexical and sandhi tones. One possibility is that children are producing global tonal contours that are consistent with lexical and sandhi tones (level, rising and falling), but are not yet able to make finer acoustic adjustments in an adult-like manner (e.g., pitch range). Indeed, previous studies of 3-year-olds have shown that children are producing global tonal contours that are consistent with lexical and sandhi tone targets (Xu Rattanasone et al., 2016), but these are not yet adult-like in terms pitch range, slope and curvature (Wong, 2012). A recent study reporting on adult ratings of child productions showed that compared to adult productions, children's productions were rated as being less accurate, especially in complex phonetic contexts – disyllables (Wong and Strange, 2017). These complex phonetic contexts include the tone sandhi context, but it is unclear from that study whether children were producing sandhi forms. Therefore, it remains unclear when children might reach adultlike productions on acoustic characteristics of pitch range, slope and curvature for both lexical and sandhi tones in context such as in disyllables.

In this study, we addressed the question of whether preschooler's lexical and sandhi tone productions are acoustically adult-like by comparing 3-, 4-, 5-year-olds' productions to adult forms. All participants were monolinguals raised in Beijing. First, we report on lexical tone productions. Based on previous research, we expected that all children might produce global tonal contours that are consistent with the four lexical tones (level, rising, dipping and falling). However, we also predicted that children might not reach adult-like levels on acoustic measures such as pitch range, slope and curvature. We also expected that there might be a developmental effect whereby older 5-yearolds' productions would be more acoustically adult-like than the productions of younger children. Secondly, we report on tone sandhi productions from the same groups of children compared to adults. Based on one previous study (without an adult control group), we expected that children's global tonal contours for tone sandhi productions to be consistent with full and half sandhi forms (rising vs. falling). No study has yet reported on children's productions of pitch range, slope and curvature

Xu Rattanasone et al. Acquisition of Tone Sandhi

for tone sandhi, but based on lexical tone, we predicted that children will not be adult-like on these measures. However, 5 year-olds' productions might be more adult-like than younger 3-year-olds.

### MATERIALS AND METHODS

### Participants and Design

Participants included 27 3-year-olds with a mean age of 3; 10 (range 3; 5– 3;11; 12 boys and 9 girls), 22 4-year-olds with a mean age of 4; 7 (range 4; 0–4; 11; 7 boys and 16 girls) and 25 5-yearolds with a mean age of 5; 7 (range 5; 0–5; 11; 16 boys and 15 girls). No participants were excluded.

All children were recruited in Beijing from the preschool associated with the Beijing Language and Culture University. The study was conducted in accordance with the ethics protocol approved by Macquarie University's Human Ethics Panel. All child participants received stickers for their participation and the preschool received book donations for all children to use at the center.

A total of 16 adult female controls, with mean age of 24 years (range: 19–35 years) were recruited. All adults are native speakers of Beijing Mandarin and were undertaking graduate or postgraduate training in Sydney. Written consent was provided prior to participation in the study and they were paid \$20 for their travel and time.

A within-subjects design was used. All participants were asked to name all lexical tone and tone sandhi items during testing.

### Stimuli

The stimuli included a total of 28 disyllabic words familiar to pre-schoolers (**Table 1**). To elicit the lexical tones, 12 disyllabic words with tones 1, 2, and 4 on the first syllable and tones 1– 4 on the second syllable were chosen. It was not possible to find enough familiar words for pre-schoolers all beginning with tone 1 to avoid tone co-articulation effects, therefore an equal number of words beginning with tones 2 and 4 (rising and falling contours) were also included to elicit a range of tonal contexts. It was also not possible to avoid some words ending in nasal /n/ and /η/ codas, which can have the effect of lowering the pitch of the syllable.

To elicit full sandhi, four disyllabic tone 3-3 words were chosen. For half sandhi 12 disyllabic words were chosen, with tone 3 as the first syllable and tones 1, 2, and 4 as the second syllable. This resulted in a total of 16 sandhi items. An additional two practice items in the full sandhi form (a puppy and a pony) were used at the beginning of each session to help train children to performing the task. These training items were not analyzed.

Most syllables had a CV structure, and where possible contained a stop or fricative/affricate onset to facilitate acoustic coding. However, a few contained a lateral or nasal onset, and some contained a nasal coda. Two versions of the test were created, each with a different randomization for the presentation order of words. See Appendix for **Table A1** on durations of each tone by syllable.


TABLE 1 | List of disyllabic stimuli words.

### Equipment

A total of 34 non-proprietary photographic images representing each of the 32 test and 2 practice items were selected from Google images. The images were presented one at a time using Microsoft PowerPoint 2013 delivered on an Apple iPad 2. The recordings were collected using a Zoom H2 digital voice recorder with lapel mic and the recordings were exported as PCM files.

### Procedure

Testing was conducted in a quiet area in the preschool. Each child was greeted by the native Mandarin-speaking experimenter, the first author. The task was explained as a picture naming game where children named the pictures on an iPad and received stickers for playing the game. Two practice trials were given, and for children who could not provide an answer after three attempts, the experimenter provided the answer, e.g., "puppy." The child was then asked to repeat the label before moving on to the next item. The children were encouraged to provide answers independently during the practice trials.

All children could perform the elicitation task during the test trials, however, there were two items which most children could not name, i.e., the tones 3-2 word (gloss: rainbow) and tones 2- 1 word (gloss: lobster). For these items, the experimenter named the items but the imitations from the children were not analyzed.

The same procedure was used for testing all children as well as the adult control.

### Data Analysis

The productions were acoustically coded in Praat (Boersma and Weenink, 2012) by a trained coder who is a native speaker of Mandarin. Two additional native speakers listened to all tokens. No mis-productions on consonant or vowel segments were identified by any of the three listeners so all productions were included and contributed to the final analyses. A total of 60,090 tokens were analyzed, 17,860 from the 3-year-olds, 14,920 from the 4-year-olds, 17,260 from the 5-year-olds, and 11,050 from the adults. The tones were extracted from the vocalic portion of the target syllable (and nasal if present), this was the second syllable for lexical tone words and the first for tone sandhi words. The vocalic portion was identified from the onset to cessation of higher formants. In cases where the second syllable had a nasal onset, anti-resonance and simplification of the waveform was used to identify the onset of the second syllable. F0 points were tracked within the annotated interval, using autocorrelation algorithm in PRAAT, and these f0 points were checked and manually revised to correct for any "doubling" or "halving" errors in pitch tracking. The revised pitch track was then interpolated and smoothed with a bandwidth of 20 Hz. F0 was then extracted in 10 equal steps for each syllable. The raw f0 values were transformed into semitones, with reference frequency of 100 hz, for anlaysis.

### RESULTS

To examine whether children's lexical tone and tone sandhi productions were adult-like, second order polynomial models were conducted for each tone separately (6 models in total: 4 for lexical tones and 2 for sandhi forms). Alpha was set at 0.008 after Bonferroni adjustment was made for multiple comparisons. In all models, children's productions were compared to the adult controls. The first order linear trends compared the steepness of the slopes, and the larger estimates indicated steeper slopes with larger differences between f0 onset and offset, i.e., greater pitch range. The second order quadratic trends compared the areas under the curve, with larger estimates indicating larger areas, i.e., more curvy contours.

Since children have higher pitch than adults, data for each age group was centered around the group means to ensure that only differences in f0 contour is analyzed and not the absolute f0 differences between children and adults. All analyses were conducted in R (R Core Team, 2013) using the lmerTest() function of the lme4 package with Satterthwaite adjustments to denominator degrees of freedom (Bates et al., 2015). The model included f0 measured over 10 time points (every 10%) as the dependent variable with Age group (3-, 4-, 5-year-olds, and Adults) as the fixed factor. Each speaker was entered as a random variable with random intercept estimated separately for each age group. The models for lexical tones (1–4) are reported first followed by tone sandhi (full- and half-sandhi). See **Tables 2**,**3**

TABLE 2 | Results for f0 of lexical tone across 10 time points.


\*\*\**p*<*0.001,* \*\**p*<*0.01,* \**p*< *0.05. Bold effects are still significant after Bonferoni adjustment (alpha* <*0.008). R-code: lmer(f0Centered* ∼ *(Linear*+*Quadratic)*\**AgeGroup* + *(Linear*+*Quadratic|Participant:AgeGroup)).*

for fixed effects model estimates of lexical tone and tone sandhi as well as R-codes for estimating the maximal model.

### Lexical Tones

We predicted that children would produce global tonal contours consistent with level, rising, dipping and falling tones but will not be adult-like in producing pitch range, slope and curvature. We also predicted that older 5-year-olds' productions might be more adult-like than younger 3-year-olds. The results for lexical tone are shown in **Table 2** and **Figure 3**. After Bonferroni adjustments for multiple models (6, 4 lexical tones and 2 sandhi tones), alpha was set at 0.008.

For tone 1, there was a significant linear trend and its interaction with age. The significant linear trend in the absence of a significant quadratic trend suggests that tone 1 productions from children and adults had a level contour, consistent with the contour expected for tone 1. The linear interaction with 3- and 4 year-olds, with significant negative estimates compared to adults, suggest that children's tone 1 productions had a flatter slope than adults.

The results for lexical tone 2 showed significant linear and quadratic trends suggesting that both children and adults produced a curved rising f0 contour. There were significant interactions with age for both the linear and quadratic trends. The positive effect on the linear term for 3-year-olds suggest that they produced f0 contours with steeper slopes than adults and therefore, a larger f0 range. The significant negative effect on the quadratic term for 5-year-olds suggest they produced a flatter f0 curve than adults. No other significant interactions were found.

The results for lexical tone 3 showed significant linear and quadratic trends suggesting that both children and adults produced a curved falling f0 contour. Since tone 3 has a negative going contour, the positive effect on the linear term for all three child ages suggest that children had a flatter f0 slope compared to adults, and produced tone 3 with a smaller f0 range. There were no significant interactions with the quadratic trend which suggests that the curviness of the f0 contours in the child and adult productions did not differ.

The results for tone 4 showed significant main effects for both linear and quadratic trends and interactions with age for all three age groups. The linear and quadratic trends suggest that both child and adults produced a curved falling f0 contours. The positive effect of all child groups on the linear and quadratic terms suggest that children produced flatter f0 curves and slope compared to adults, with reduced f0 range.

Overall, the results on lexical tones suggest that children were adult-like for producing global tonal contours consistent with a level contour for tone 1, rising for tone 2, dipping for tone 3 and falling for tone 4. They were also adult-like on f0 range, slope and curvature for tone 1, and mostly adult-like for tone 2. However, all children produced tone 3 with flatter slope and reduced f0 range compared to adults. Children's production of tone 4 differed the most from that of adults, with children producing both reduced f0 range and flatter f0 curves. The results did not show any consistent developmental changes across age, suggesting that older 5-year-olds were not more adult-like in their productions than younger 3-year-olds.

### Tone Sandhi

We predicted that children might produce the correct global tonal contours that are consistent with full and half sandhi (rising and falling) but would not be adult-like in producing pitch range, slope and curvature. However, children's productions might be more adult-like for older 5-year-olds than younger 3-year-olds. The results for the sandhi forms are shown in **Table 3** and **Figure 4**.

For full sandhi, there was a significant main effect of linear and quadratic trends and an interaction with age for the quadratic



\*\*\**p* < *0.001,* \*\**p* < *0.01,* \**p* < *0.05. Bold effects are still significant after Bonferoni adjustment (alpha* < *0.008). R-code: lmer(f0Centered* ∼ *(Linear*+*Quadratic)*\**AgeGroup* + *(Linear*+*Quadratic|Participant:AgeGroup)).*

trend. The linear and quadratic trends suggest that both children and adults produced curved rising f0 contours. The negative effects of all child groups on the quadratic term suggest that children produced full sandhi with flatter f0 contours than adults.

The results for half sandhi showed significant main effects of linear and quadratic trends and a significant interaction with age for the linear trend. The linear and quadratic trends suggest that both children and adults produced curved falling f0 contours. The positive effects of all child groups on the linear term suggest that children produced half sandhi with flatter f0 slopes and reduced f0 range compared to adults. These results suggest that children are not yet adult-like in their tone sandhi productions for f0 range, slope and contour, even for the oldest age group.

### DISCUSSION

The aim of this study was to examine the acoustic realizations of lexical and sandhi tones in the productions of pre-schoolers (3-, 4-, and 5-year-olds) to determine if and when they become adult-like. First, all global contours on the children's lexical tone productions were consistent with the productions by adults: a level contour for tone 1, a curved rising contour for tone 2, a curved downward dipping contour for tone 3, and a falling contour for tone 4.

However, in terms of pitch range, slope and curvature, the acoustic analysis of lexical tones suggest that children were not achieving adult-like productions across all tones. While child and adult productions of tones 1 and 2 were the least different, tones 3 and 4 showed much more difference between child and adult productions. For tone 1, 3-, and 4-year-olds produced pitch contours with smaller pitch range and flatter pitch contour compared to adults, but were adult-like by 5 years. Tone 2 also showed few differences between child and adult productions with 3-year-olds producing a larger pitch range and slope compared to adults, and 5-year-olds producing a contour that is less curvy compared to adults. No other group differences were found. Therefore, for tone 2, despite having a curved contour, most preschoolers produced it in an adult-like manner consistent with a rising tone.

Children's productions of tones 3 and 4 differed the most from adult productions in terms of pitch range, slope and curvature. For tone 3, children across all three age groups had a reduced pitch range and slope compared to adults. However, children did not show any challenges in producing curved contours for the complex tone 3; in fact, the degree of curvature did not differ from adult productions. For tone 4, all children's productions were reduced in pitch range and slope, as well as having a flatter contour with less curvature compared to adult productions. These results suggest that for the two lexical tones with a falling contour, tones 3 and 4, pre-schoolers are still struggling to coordinate pitch range, slope and curvature, even at the age of 5.

The second aim of this study was to examine the acoustic realizations of tone sandhi in children's productions. The analyses for both full and half sandhi suggest that children produced global tonal contours that are consistent with full and half sandhi tones (rising and falling). However, children are not yet adult-like on pitch range, slope and curvature. Compared to adults, children produced full sandhi contours with a flatter rising curve and half sandhi contours with a smaller falling pitch range and slope, again showing challenges in producing adult-like forms.

The results from both lexical tone and tone sandhi suggest that children are still fine-tuning their control and coordination of pitch range and slope with curvature, especially for tones 3, 4, and the sandhi forms. This provides support for Wong (2012, 2013) and suggests that reaching adult-like tone realization on specific acoustic measures is a protracted process. However, our study also found that even 3-year-olds could produce the overall tonal contours consistent with level, rising, dipping and falling tones, important for maintaining tone category distinctions. This may help explain why studies using perceptual coding have reported earlier acquisition of lexical tones compared to studies using acoustic measures; the former may have captured children's ability to produce global tonal contours that are consistent with the different tone categories (Hua and Dodd, 2000), whereas the latter identified the implementation of specified acoustic measures (pitch range, slope and curvature) that are not yet adult-like (Wong, 2012). Together with our study, these results suggest the gradual acquisition of tone realization, with children producing global contours first, and later fine-tuning of pitch range, slope and curvature. Studies with older children, and on

tonal coarticulation in a range of tone contexts will be needed to determine when this fine-tuning reaches adult-like acoustic values.

Similarly, for tone sandhi, children are producing rising and falling contours consistent with the two tone sandhi forms, but still fine-tuning pitch range, slope and curvature. However, our study used only real words and avoided low frequency words which pre-schoolers might not know. It is therefore possible that the sandhi forms examined here were lexicalized as tone 2 for full sandhi and a phonetic variant of tone 3 for half sandhi without children fully understanding how and where sandhi processes apply. Therefore, future studies are needed to examine children's ability to apply tone sandhi processes to novel words, examining their ability to generalize their knowledge about these phonological processes to word learning.

Our study did not find any developmental effects for either lexical tones or tone sandhi forms. Therefore, some caution must be taken when interpreting the results on differences observed across the age groups. For example, the results showed that for T2, 3-year-olds produced a more rising contour and 5-yearolds produced a less curvy contour, but there were no overall developmental effects. This must be interpreted with the general result showing that children's productions are not adult-like for any contour tones (i.e., tones 3 and 4, and full and half sandhi). The differences across age groups might therefore be part of children's general early difficulty in coordinating pitch range, slope and curvature to achieve adult-like productions, with the exception of the level T1 where children had achieved adult-like production by 5 years. However, the question of developmental changes in tone productions would be better answered in future longitudinal studies that track children as they develop mastery over tone production.

Finally, the lack of developmental changes for tone sandhi might be related to the use of known words in this study. It is possible that children might show developmental effects in their ability to apply tone sandhi processes when learning new words using novel items. Our results also raise questions about if and how non-adult-like productions may affect children's tone comprehension abilities. It is possible that children are less sensitive to changes in pitch range, slope and curvature but can track overall tonal contours. It is also possible that other acoustic cues are being favored by children, i.e., duration and turning point for the contour tones. Addressing these questions in future research will provide a comprehensive understanding of tone acquisition and the link between production and perception.

## CONCLUSION

Mandarin-speaking children produced adult-like global tone contours for lexical tone and tone sandhi were consistent with the level (tone 1), rising (tone 2 and full sandhi), dipping (tone 3), and falling (tone 4 and half sandhi) tone categories, showing that 3–5-year-olds have good knowledge about lexicalized forms of lexical tone and tone sandhi. However, pre-schoolers are still fine-tuning their control over coordinating pitch range, slope and curvature, especially for contour tones 2, 3, and 4, and the sandhi forms. Achieving adult-like acoustic realizations of lexical tone and tone sandhi is a protracted process, probably fully attained after the age of 5.

## AUTHOR CONTRIBUTIONS

NX project leader developed research question, designed experiments, collected data, performed data analysis and write up of drafts for this paper in collaboration with the coauthors. PT assisted in coding data, data analysis and interpretation, and contribution to drafts of the manuscripts. IY contributed to the design and implementation of the stimuli. Assisted in training coders for acoustic coding of the data. Contributed to drafts of the manuscripts. LG assisted in recruitment and data collection. Contributed to drafts of the manuscripts. KD contributed to shaping the research question, stimuli and research design, and to drafts of the manuscript.

### FUNDING

Macquarie University Research and Development Grant #1547620. Australian Research Council (ARC) Laureate Fellowship grant #130100014 (Demuth). ARC Centre of Excellence for Cognition and its Disorders grant #CE110001021.

### REFERENCES


### ACKNOWLEDGMENTS

We thank Phillip Chen for his help with recruiting adult participants and acoustic coding, as well as members of the Child Language Lab for their helpful comments in various aspects of this project. We also thank the students at Beijing Language and Culture University for their helpful assistance in testing children. Finally we thank Dr. Peter Humburg for statistical advice.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xu Rattanasone, Tang, Yuen, Gao and Demuth. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX

TABLE A1 | Mean and SD of durations in millisecond for all tones in the 1st and 2nd syllable positions.


*Acoustic analyses were conducted on the 1st Syllable for Tone Sandhi and 2nd syllable for Lexical Tones.*

# PRIMIR on Tone

Suzanne Curtin<sup>1</sup> \* and Janet F. Werker <sup>2</sup> \*

<sup>1</sup> Department of Psychology, University of Calgary, Calgary, AB, Canada, <sup>2</sup> Department of Psychology, The University of British Columbia, Vancouver, BC, Canada

Keywords: PRIMIR, tone, speech perception, language acquisition, word learning, bilingualism, infancy

Processing Rich Information from Multidimensional Interactive Representations (PRIMIR) was first developed to address divergent findings in the literature surrounding early speech perception and word learning (Werker and Curtin, 2005; Curtin and Werker, 2007). Specifically, there was a controversy in the literature at the time indicating that while under some circumstances infants and toddlers are able to discriminate a mispronunciation of a known word, in other testing circumstances, they had difficulty using phonetic/phonological differences to guide word learning. PRIMIR was developed as a framework for understanding why infants attend to some kinds of information over others in different testing situations. Integral to PRIMIR are the representations and processing that simultaneously impact how the infant interprets the information in the signal. PRIMIR assumes infants have access to a rich signal containing multi-sensory information. It further assumes that infants have initial biases that work together with learning mechanisms (e.g., statistical, comparison/contrast) to help process, organize, and store the information gleaned from the signal. Information is stored within emergent interactive representational planes: general perceptual, word form, and phoneme, and is processed using three dynamic filters: initial biases (e.g., preferences for speech, native-language rhythm, and infant-directed speech), the developmental level of the infant, and the task that the child is facing (e.g., discriminating sounds, learning words). While the initial biases draw the infant's attention to the linguistic signal, the developmental level and the task will influence what information is attended to, stored, and used at any given time. That is, the states of the infant's various representations and the task itself will shift how the infant interprets the information. With this in mind, rather than clear-cut developmental boundaries of when an infant might show evidence of acquiring any particular linguistic ability, the process is more multiplexed, with the use of some types of information potentiated only once the relevant representations are in place. While the input that the infant has access to is rich, some aspects of the signal are inherently more salient than others, and some of those properties that are perhaps not as inherently salient can become enhanced through experience, development, and their contribution to category formation. In other words, some sound properties have raw acoustic salience, while others require converging sources of evidence (be it multi-sensory or contextual) to be processed, discriminated, or learned. Tone is a paradigmatic example of how acoustic salience (pitch), and the representational structures that are in place influence its processing.

In considering the interface of initial perceptual biases, the extraction of word forms and connecting them to concepts, and the emergence of phonemes, PRIMIR initially focused on segments as both carriers of linguistic contrast (e.g., /dag/ "dog" vs. /bag/ "bog") and of indexical cues, such as talker and affect. However, pitch is perhaps an even more telling example of a perceptual property that can both be implemented as a tone (carried on vowels, voiced/sonorant consonants, or syllables) to designate and contrast meaning as well as an indexical signal of an array of functions from focus to emotional valence.The range of articles in this special issue eloquently describe these multiple functions and address the unique challenges for learners.

#### Edited by:

Leher Singh, National University of Singapore, Singapore

Reviewed by: René Kager, Utrecht University, Netherlands

\*Correspondence:

Suzanne Curtin scurtin@ucalgary.ca Janet F. Werker jwerker@psych.ubc.ca

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 03 May 2018 Accepted: 30 May 2018 Published: 15 June 2018

#### Citation:

Curtin S and Werker JF (2018) PRIMIR on Tone. Front. Psychol. 9:1007. doi: 10.3389/fpsyg.2018.01007 We highlight some of these findings to illustrate how PRIMIR can help explain how infants simultaneously learn to use pitch for indexical information and tone for word learning.

Consistent with PRIMIR are a number of studies examining the discrimination and functional use of native and nonnative tones. In some studies findings are consistent with how phonetic sensitivities develop with the general perceptual plane. For example, both babies growing up in tone and non-tone languages discriminate tones at a younger age (4–6 months), and stop doing so by 9-months unless they grow up in a tone language (e.g., Mattock and Burnham, 2006; Yeung et al., 2013; Liu and Kager, 2014). However, also consistent with PRIMIR there are variations in performance as a function of the discrimination procedure (Gotz et al., this issue), and as a function of the acoustic salience of the tones being compared (Chen et al., this issue; Cheng and Lee, this issue; Tsao, this issue). A finding in the tone literature that uniquely supports PRIMIR, is that even in the cases where there is a decline in non-native tone discrimination between 6- and 9 months, there is a rebound in discrimination by 18-months (Gotz et al., this issue; Liu and Kager, this issue) but an inability to use these (non-native) tone distinctions to guide word learning at 18 months (Burnham et al., this issue; Liu and Kager, this issue). These findings reveal simultaneous access to multiple representational planes. Importantly, within the PRIMIR framework, emergent phonemes help to direct information about lexical processing around 18 months of age. Thus, while infants of this age can still access acoustic information contained within the General Perceptual plane and direct attention in a discrimination task (as seen in the rebound in tone discrimination), the native phonological system constrains the mapping of non-native contrasts to distinct

### REFERENCES


words further suggesting that by 18 months, infants treat lexical tones as phonemic elements. In our refocused version of PRIMIR (Curtin et al., 2011), we began to explore how bilingual infants' experiences with their dual language input may result in them approaching tasks differently than monolingual peers. We had not yet considered, however, that their dual language experience might boost their attention to acoustic and phonetic information in tone. Nor have we considered what type of dual language experience (e.g., tone/non-tone, two non-tone) might boost attention. Bilingual infants learning two non-tone languages show a rebound in tone discrimination 6 months earlier than monolingual infants (Liu and Kager, 2017). Bilingual English-Mandarin infants demonstrate an ability to use the acoustically salient Thai tone contrast in a word learning task while monolingual Mandarin infants do not (Burnham et al., this issue). The extent to which dual language experience shifts infants' use of the dynamic filters across the representational planes is an exciting new direction for the further development of PRIMIR.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

The writing of this article was supporting by the Social Sciences and Humanities Research Council of Canada (grant 435-2017- 0120 to SC and grant 435-2014-0917 to JW) and the Natural Sciences and Engineering Research Council of Canada (grant 327319-2012 to SC).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Curtin and Werker. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Adult Learning of Novel Words in a Non-native Language: Consonants, Vowels, and Tones

Silvana Poltrock 1,2,3 \*, Hui Chen1,2, Celia Kwok <sup>4</sup> , Hintat Cheung<sup>4</sup> and Thierry Nazzi 1,2 \*

<sup>1</sup> Université Paris Descartes, Sorbonne Paris Cité, Paris, France, <sup>2</sup> CNRS, Laboratoire Psychologie de la Perception, Paris, France, <sup>3</sup> Department Linguistik, Universität Potsdam, Potsdam, Germany, <sup>4</sup> Department of Linguistics and Modern Language Studies, The Education University of Hong Kong, Tai Po, Hong Kong

#### Edited by:

Jessica Hay, University of Tennessee, Knoxville, United States

#### Reviewed by:

Mariapaola D'Imperio, Aix-Marseille Université, France Mireille Besson, Institut de Neurosciences Cognitives de la Méditerranée (INCM), France Aaron D. Mitchel, Bucknell University, United States

#### \*Correspondence:

Silvana Poltrock poltrocks@gmail.com Thierry Nazzi thierry.nazzi@parisdescartes.fr

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 16 October 2017 Accepted: 26 June 2018 Published: 24 July 2018

#### Citation:

Poltrock S, Chen H, Kwok C, Cheung H and Nazzi T (2018) Adult Learning of Novel Words in a Non-native Language: Consonants, Vowels, and Tones. Front. Psychol. 9:1211. doi: 10.3389/fpsyg.2018.01211

While words are distinguished primarily by consonants and vowels in many languages, tones are also used in the majority of the world's languages to cue lexical contrasts. However, studies on novel word learning have largely concentrated on consonants and vowels. To shed more light on the use of tonal information in novel word learning and its relationship with the development of phonological categories, the present study explored how adults' ability to learn minimal pair pseudowords in a tone language is modulated by their native phonological knowledge. Twenty-four adult speakers of three languages were tested: Cantonese, Mandarin, and French. Eye-tracking was used to record eye movements of these learners, while they were watching animated cartoons in Cantonese. On each trial, adults had to learn two new label-object associations, while the labels differed minimally by a consonant, a vowel, or a tone. Learning would therefore attest to participants' ability to use phonological information to distinguish the paired words. Results first revealed that adult learners in each language group performed better than chance in all conditions. Moreover, compared to native Cantonese adults, both Mandarin- and French-speaking adults performed worse on all three contrasts. In addition, French adults were worse on tones when compared to Mandarin adults. Lastly, no advantage for consonantal information in native lexical processing was found for Cantonese-speaking adults as predicted by the "division of labor" proposal, thus confirming crosslinguistic differences in consonant/vowel weight between speakers of tonal vs. non-tonal languages. These findings establish rapid novel word learning in a non-native language (long-term learning will have to be further assessed), modulated by native phonological knowledge. The implications of the findings of this adult study for further infant word learning studies are discussed.

Keywords: word learning, minimal pairs, non-native speech perception, tones, adults

### INTRODUCTION

Learning words is a crucial step in learning a language, no matter whether it is one's initial, native language as an infant, or a new, non-native language as an adult. Importantly, learning new words requires the ability to process relevant phonetic information and represent it in proper phonological categories. This ability is largely based on which phonetic variations are relevant to word meaning and how phonological categories are established in one's native language. In all languages, lexical representations include segmental information related to the identity of consonants and vowels constituting the word forms. In many languages, lexical representations also include suprasegmental information, such as lexical stress, pitch accents or tones. Phonological repertoires vary across languages, and the same is true for lexically-relevant prosodic information. For instance, both Cantonese and Mandarin have tonal systems, but their systems differ in both number and identity of tones (Wang, 1963; Mandarin: Cheng, 1966; Hashimoto, 1972; Howie, 1976; Cantonese: Bauer and Benedict, 1997; Duanmu, 2000; Yip, 2002), as illustrated in **Figure 1.** The present study will explore the interplay between word learning and phonological processing (of consonants, vowels and tones) by comparing three groups of adults when learning minimal pairs of Cantonese words in their native (Cantonese-speaking adults) vs. a non-native (Mandarin- and French-speaking adults) language.

Decades of research have established that speech perception becomes language-specific during the first year of life, as attested by decreases in ability to discriminate many (though not all, see below) non-native phonological contrasts (that is, contrasts not used in one's native language), and increases in the ability to discriminate native contrasts. These changes have been found to happen later for consonants (by 10–12 months of age, e.g., Werker and Tees, 1984a; Best et al., 1988; Rivera-Gaxiola et al., 2005), than for vowels (by 6 months of age, Kuhl et al., 1992; Polka and Werker, 1994). The developmental timing for tones is less clear, as changes are usually reported around 10 months of age (Mattock and Burnham, 2006; Mattock et al., 2008; Liu and Kager, 2014; Cabrera et al., 2015) although evidence for changes as early as 4 months has been found in one study (Yeung et al., 2013).

These developmental changes in speech perception, which attest to the early acquisition of the phonological repertoire of the native language, have continued effects in adulthood. Speech perception difficulties have been found for the processing of nonnative consonants (e.g., Werker and Tees, 1984b), non-native vowels (e.g., Polka, 1995), and non-native tones (Gandour et al., 2000; Hallé et al., 2004; So and Best, 2010). This is attested by the fact that adults will sometimes have difficulties identifying some non-native sounds, and/or discriminating between nonnative sounds. For example, regarding consonant perception, the fact that the English /r/ consonant does not have an equivalent in both Japanese and German has been found to lead to differences in how Japanese- and German-speaking adults perceive this sound (in contrast to English /l/) when compared with Englishspeaking adults (e.g., Miyawaki et al., 1975; Iverson et al., 2003). Moreover, perception of this non-native contrast differs across the two language groups, with more difficulty observed for the Japanese-speaking adults, who appear to form only one sound category, compared to the German-speaking adults who perceive two sound categories (e.g., Iverson et al., 2003). This further shows that these processing difficulties stem from interference with the native phonological system. With respect to tones, many studies have found that speakers of languages that do not use tone contrasts at the lexical level identify and discriminate nonnative tones with more difficulty than speakers of tonal languages (Gandour et al., 2000; Hallé et al., 2004; So and Best, 2010). Even though some discrimination ability is found in speakers of non-tonal languages, they appear to process tones differently. This is attested, for example, by the fact that (non-tonal) Frenchspeaking adults, while being able to discriminate non-native Mandarin tones, perceive these tones less categorically than Mandarin-speaking adults. Some have proposed to link this to their lack of phonological categories for tones (Hallé et al., 2004).

The first goal of the present study was thus to explore the effects of such perceptual changes on adults' ability to learn new words in an unfamiliar language. Although this is a situation that adults have to cope with when learning a new language, it has received surprisingly little attention to this day (but see Chandrasekaran et al., 2010; Cooper and Wang, 2012, 2013, for training studies on English-speaking adults' processing of Cantonese or Mandarin tones in lexical contexts). Here, we evaluated monolingually-raised Mandarinand French-speaking adults' ability to learn new words in Cantonese, and compared their performance to baseline data from Cantonese-speaking adults. This was done in a laboratory setting, during which, on each trial, adults had to learn a pair of Cantonese pseudowords that differed by either a consonant, a vowel or a tone. Given the above language specialization findings at the perceptual level, evidenced by difficulties in lowlevel (discrimination or identification) non-native processing, we predict that adults (and infants from 6 to 12 months onward) should have more difficulty learning new words in a non-native language than in their native language, because they are made up of some sounds that do not belong to the native phonological repertoire. Hence, overall performance should be higher for Cantonese-speakers than for Mandarin- and French-speaking adults, and it might even be that the latter two groups fail at learning. Another possibility is that linguistic distance between the native language and the language of the stimuli affects performance. Since Cantonese and Mandarin, being both Sino-Tibetan languages, share many phonological, morphological, and syntactic properties (Li, 1937; Gong, 1980; DeLancey, 2009), which is not the case with Cantonese and Indo-European French, Mandarin-speaking adults might have higher overall performance than French-speaking adults.

One important feature of our experimental design is the fact that on each trial adults had to learn a pair of new pseudowords. Therefore, for learning to take place, adults had to process the phonological contrast distinguishing the two paired words, which allowed us to explore in more detail the interplay between phonological and lexical processing in this process of acquiring new words. To begin with, the fact that the pseudowords contrasted in either consonant, vowel, or tone information allowed us to evaluate differential processing of these three phonological sound categories. This second goal of the present study was motivated by the proposal that consonants carry more information about the lexicon, whereas vowels play a more important role in syntactic and prosodic processing (Nespor et al., 2003). For example, in word reconstruction studies in which English, Dutch, and Spanish listeners hear pseudowords and have to transform them into real words, they preserve consonantal over vocalic information, changing kebra into cobra

authors' permission.

rather than zebra (van Ooijen, 1996; Cutler et al., 2000). Evidence for this bias for consonantal information in lexical processing (often referred to as the C-bias in the literature) is supported by studies with adults across several non-tonal languages (Dutch, English, French, Italian, Spanish) and a variety of different methods such as word learning (e.g., Bonatti et al., 2005; Creel et al., 2006; Toro et al., 2008; Havy et al., 2014) and lexical access (e.g., New et al., 2008; Carreiras et al., 2009; Delle Luche et al., 2014; New and Nazzi, 2014).

Importantly though, when this project was started, little was known about the C-bias in non-European languages, and in particular in tone languages. Tone languages provide a particularly interesting test of the C-bias as lexical tones are mostly carried by vowels. This might affect performance, in two opposing ways. Indeed, the need for speakers of tone languages to attend to tones to identify words might increase their attention to the vowels (which carry them), and thus increase the weight given to vowels compared to consonants in tone languages compared to non-tone languages. This might result in a lack of bias or in a reversed advantage in processing vocalic information. In contrast, the fact that vowels carry tones might make the acoustic realization of vowels more variable in tone than in non-tone languages, making them more difficult to process and identify. If so, the consonant bias in lexical processing found in non-tone languages might be even more pronounced in tone languages.

To date, only two studies have explored this issue, but have focused on levels other than word learning: lexical access to known words, and word form segmentation in an artificial language. First, in a word reconstruction study based on van Ooijen (1996) and testing lexical access, Wiener and Turnbull (2016) asked participants to transform a pseudoword into a real word by changing either a consonant, a vowel (in fact, to conform to Chinese phonology, they were asked to change the final - in Chinese phonology, and in the stimuli used in that study, the final corresponds to V, VV, or VVN), a tone, or any of the three. Results show effects of condition, corresponding to the fact that Mandarin-speaking adults appear to preferentially change tones over both consonants and vowels/finals, with vowels appearing to be the less mutable sound category, contrary to what had been found in Dutch, English, and Spanish (van Ooijen, 1996; Cutler et al., 2000). These findings suggest a different balance in the weight given to consonants and vowels, with less weight given to consonants (or more weight given to vowels), by Mandarinspeaking adults. Second, an artificial language study exploring whether Cantonese-speaking adults use consonants or vowels (and tones) to segment a fluent speech stream revealed that they could not use consonantal information alone, but could rely either on vocalic information alone (although the difference between the consonant and vowel conditions was not significant), or more likely on a combination of vocalic and tonal information (Gómez et al., 2017). This also suggests a different balance in the weight given to consonants and vowels by Cantonese-speaking adults as compared to French- or Italian-speaking adults (Bonatti et al., 2005; Toro et al., 2008). The present study will add to this literature by providing the first evaluation of this issue in a word learning task for tonal language speakers, for either the native language (Cantonese adults processing Cantonese stimuli) or a foreign language (Mandarin adults processing Cantonese stimuli). It will also provide the first evidence of whether the C-bias, found in French-speaking adults when processing native stimuli, would also extend to the processing of non-native stimuli (French adults processing Cantonese stimuli).

The results regarding the use of tonal information by Frenchspeaking adults when learning words will also inform our understanding of the link between phonological and lexical processing. Previous studies on tone perception/identification have established that even though adult speakers of non-tonal languages have more difficulties at tone processing than speakers of tone languages (e.g., Gandour et al., 2000; Hallé et al., 2004; So and Best, 2010), they tend to perform above chance levels. Recently, Liu and Kager (2014) found that tone discrimination in non-tonal Dutch-learning infants follows a U-shaped function: while 5–6-month-olds can discriminate a Mandarin tone easily, their sensitivity declines between 8 and 15 months, but they regain sensitivity to it by 17–18 months. They suggested that this rebound in sensitivity might be related to the acquisition of the intonation of the native language. If this increased sensitivity is not limited to low levels of processing, then the French speakers in our experiment might perform at above chance levels on the tone-contrasted trials. This prediction is supported by previous findings on English-speaking adults (Chandrasekaran et al., 2010; Cooper and Wang, 2012, 2013) showing tone processing in lexical contexts, although in these studies, participants were subject to intense word training (several training sessions of about 30 min, in which some feedback was provided). It is unclear whether non-tone language speaking adults would show sensitivity to tone information in a less intensive task.

The present study used eyetracking to investigate the ability of Cantonese-, Mandarin-, and French-speaking adults to learn pairs of pseudowords in Cantonese, while processing fine phonetic information (consonant vs. vowel vs. tone information), whether used contrastively in the native language or not. On each of 24 trials, adults saw a pair of cartoons. In each cartoon, an unfamiliar object was presented visually while 6 sentences in Cantonese, each containing a pseudoword labeling that object, were heard. Between the two cartoons, the pseudowords differed by either a consonant (8 times), a vowel (8 times) or a tone (8 times). Adults were then tested on whether they had been able to learn the words following this short word learning phase, by presenting them with the two unfamiliar objects side-by-side, and observing their pattern of object looking before (prenaming phase) and after (postnaming phase) one of the objects was named. The current procedure was based on Experiment 1a of Havy et al. (2014) in which French-speaking adults were taught pairs of new pseudowords that differed either by a consonant or by a vowel. A comparison of performance in the two conditions to evaluate the consonant bias revealed that adults increased their looking times toward the target object (mean percentage of looking times at the target object in the postnaming—prenaming phase) similarly in both the consonant and vowel conditions, but that latencies in shift from the distractor to the target at the time of naming were faster in the consonant than in the vowel condition, establishing a consonant bias. To explore the relative strength of the processing of consonant, vowel, and tone information in our three linguistic groups, we similarly analyzed changes in mean percentage of looking times at the target object between the prenaming and postnaming phases (which also evaluates whether the pseudowords were learned), and latencies to shift from the distractor to the target at the time of naming. We also performed cluster-based permutation analyses on the time course of looking times in the different conditions to determine when in processing looking times to the target differ among the three conditions.

## MATERIALS AND METHODS

### Participants

Seventy-two adults were tested in total: 24 Mandarin- (age range 21-27 years, mean age: 23.9 years, 21 female), 24 French-speaking adults (age range 21–38 years, mean age: 25.2 years, 12 female), and 24 Cantonese-speaking adults (age range 20–43 years, mean age: 27.4 years, 20 females) who served as the native speaker control group. French- and Mandarin-speaking participants had no knowledge of Cantonese (and French adults had no knowledge of other tone languages either). All participants had grown up monolingual, and no further background information (such as musical abilities. . . ) were collected. French-speaking adults were tested in Paris, Cantonese- and Mandarin-speaking adults were tested in Hong Kong, the latter group within a week of their arrival in Hong Kong. Before the experiment started, informed written consents were obtained from all participants. Both the experimental protocol and consent procedures were reviewed and approved by the CERES (Comité d'évaluation éthique des projets de recherche) of the Université Paris Descartes and the Human Research Ethics Committee of the Education University of Hong Kong. All data were obtained according to the principles expressed in the Declaration of Helsinki.

### Stimuli

### Speech Stimuli

The speech stimuli (presented in **Table 1**) consisted of 24 pairs of disyllabic CVCV Cantonese pseudowords, differing by a minimal phonological contrast of 1 feature (except for two 2 feature contrasts for consonants, and two 2-feature contrasts for vowels, which could not be avoided due to phonological and lexical constraints in Cantonese). All contrasts were on the first syllable of the words, and the second syllable was always associated with the high level tone (T1). Eight pairs involved a consonant contrast (e.g., **k** h O2.lǫ1/ − /**t** h O2.lǫ1/), 8 involved a vowel contrast (e.g., /ph**u**2.fO1/ − /p h **y**2.fO1/), and 8 involved a tone contrast (e.g., /p<sup>h</sup> a**5**.mi1/ − /p h a**6**.mi1/). The tones of the target syllables in the consonant and vowel pairs varied across trials.

While these pseudowords were all contrastive in Cantonese, some were not necessarily contrastive in French and/or Mandarin, as they were likely to assimilate to the same category in those languages, either equally well (such as /**k** h ǫ4.thœ1/ - /**k**ǫ4.thœ1/ in French, where [k<sup>h</sup> ] and [k] are both allophones of /k/), or with one of the sounds assimilating better than the other (such as /kǫ1.tsǫ1/ - /k**œ**1.tsǫ1/ in Mandarin, which has a frontmid-unrounded vowel [ǫ] but not a front-mid-rounded vowel [œ]). See detailed explanations in the Appendix.

The words were presented in sentences in Cantonese. For the familiarization, they were embedded in a little passage, and appeared in six different sentences. In the test phase, one of the two words was designated twice, in two sentences (see details in "animated cartoons" section below). All speech stimuli were recorded in a quiet room by a female native adult speaker of Hong Kong Cantonese. One audio file of each condition can be find in the Supplementary Material (Consonant trial:



T1 (High Level 55), T2 (High Rising 25), T3 (Mid-Level 33), T4 (Low Falling 21), T5 (Low Rising 23), T6 (Low Level 22).

/**k** h O2.lǫ1/ − /**t** h O2.lǫ1/; Vowel trial: /ph**u**2.fO1/ − /p h **y**2.fO1/; Tone trial: /p<sup>h</sup> a**5**.mi1/-/p<sup>h</sup> a**6**.mi1/). Note that while Mandarin and French adults did not speak Cantonese, the structure of the cartoon (with the moving object and the 6 sentences all embedding the target word) made it clear that each target word (which was thus the most frequent content word in each passage) was meant to name the object presented at the same time (which is confirmed by the results, see below).

#### Object Stimuli

Images of eight pairs of objects differing in shape, color and texture (see **Figure 2**) were taken from a previous study by Gonzalez-Gomez et al. (2013). The reason for using clearly different objects was to facilitate learning of the word-object pairings. All objects were selected so that they would look novel to the participants. All 8 object pairs were used 3 times, once in each condition (consonant, vowel, tone). This was done in order to ensure that overall performance differences across conditions could not be due to the objects used.

#### Animated Cartoons

The audio recordings were included in animated cartoons that have been successfully used in a computer-controlled wordlearning task in toddlers by Gonzalez-Gomez et al. (2013). An example of a cartoon is illustrated in **Figure 3**.

On each trial, a female character behind a black board presented the two objects, one at a time (**Figure 3**, learning phase). The first object always appeared in the left upper corner of the screen. At the beginning, the object moved horizontally in the upper left part of the display, while it was labeled three times ("Look! A [label]! This is a [label]. Look at what I'm doing with the [label]!"). Then, the object started shifting down, while it was labeled one more time ("I'm putting the [label] here"). It started moving vertically in the lower left part of the screen and was labeled two more times ("Have you seen the [label]? Have a look at the [label]!") before disappearing. The second object was always introduced in the upper right corner of the display and followed a trajectory analogous to that of the first object. The cartoon experimenter followed the objects' movements with her eyes. Participants were successively trained on each label-object pairing for 30 s. The entire learning phase lasted 1 min and each label was repeated 6 times.

After the learning phase, participants were tested immediately on the given contrast. There was a close up on the face of the cartoon experimenter saying: "Look!" in order to direct the participants' fixations to the center of the screen. After the face disappeared, the two objects appeared at the same time, each on the side where it had appeared during the learning phase, and started moving synchronously in a vertical way, for 5000 ms, while the out-of-sight speaker said: "Look at the [target]! Where's the [target]?" about half way through the presentation in order to divide the test phase into a prenaming and a postnaming phase of equal duration (**Figure 3**, test phase). Since the material was originally designed to test and compare performance in both adults and toddlers, and since it has been shown that it takes 367 ms for infants and toddlers to program eye movements (e.g., Swingley and Aslin, 2000), the cartoons were constructed so that the onset of the postnaming phase corresponded to the onset of the first target word + 367 ms for consonant and tone trials; while it corresponded to the onset of the first vowel of the target word + 367 ms for vowel trials (hence, it corresponded to the onset of the contrasting phoneme in all trials). However, for the adult data analyses, we changed the timing by time-locking the onset of the postnaming phase 200 ms after the onset of the contrasting phoneme (as usually done in adult studies, e.g., Barr, 2008), and reducing the size of the pre- and post-naming phases to 2,000 ms around this time point. Note that since there is debate whether tonal cues are already present in onset consonants, or whether they mostly become available with vowel onset, we conducted a preliminary analysis (see results section "Time course analysis for Cantonese speakers: onset of tonal information use") to explore this issue in our data.

Every object pair was associated with one pseudoword pair in each experimental condition (e.g., object A and B were associated with /k<sup>h</sup> O2.lǫ1/and/t h O2.lǫ1/ in the consonant condition; with /phu.2fO1/ − /p h y2.fO1/ in the vowel condition and /p<sup>h</sup> a5.mi1/ − /p h a6.mi1/ in the tone condition), for use in 24 different trials. Four versions of each cartoon were created so that in half of the trials, object/label A was the target (and consequently object/label B the distractor in those trials) and, in addition, the target was presented as first object in 50% of the trials and as second object in the other 50%. This yielded a total

stimulus set of 96 movies, all having a resolution of 1280 × 930 pixel. Presentation of each of the four versions of each cartoon was counterbalanced across participants.

### Apparatus and Procedure

In Paris, the movies were presented on a 17′′ TFT monitor (1280 × 1024 pixel resolution) with an integrated Tobii T60 eyetracking system which was run by a Dell computer. The presentation of the stimuli and the storing of the data were performed with the Tobii Studio software. In Hong Kong, a Tobii TX300 was used, which was run by a Dell computer and with videos presented on a Tobii TX300 screen unit with a 1920 × 1080 pixel resolution.

Each participant was tested individually in a quiet, dimly lit laboratory room and watched 24 testing trials in total. As French- and Mandarin-speaking participants had no knowledge of Cantonese they received a warm-up trial in Cantonese, in which the two pseudowords used were phonetically different in every single segment (/ka/ - /su/) and in which subtitles were presented, in order to familiarize them with the task.

There were 12 pseudo-randomized orders, which were each presented to two participants in each of the three language groups. Four sub-blocks of 6 trials were presented. After every sub-block, the participant could take a break for as long as s/he wished. In each sub-block, there were 2 consonant trials (one with target on the left, one with target on the right), 2 vowel trials (left, right), and 2 tone trials (left, right). Consequently, within a subject, half of the time the target word was on the left, half of the time it was on the right. All 3 conditions were presented in the first 3 trials and there were never more than 2 target-left or 2 target-right trials in a row. None of the words was presented twice, but the objects occurred three times during the test. Note that the same object pairs were not presented within the same sub-block, in order to prevent learning interference. After creating order 1 with those constraints, it was mirrored to get order 2 (e.g., order 1: trial 1 - trial 24; order 2: trial 24 - trial 1). For order 3, we shuffled the trials of order 1 so that the ones that occurred in the first half in order 1 appeared in the second half (and the other way around). Order 4 was again a mirror of order 3. Orders 5-8 and 9-12 were exactly like orders 1-4 but the conditions were differently assigned following a Latin square design (e.g., order 1: Vpair1, Tpair8, Ctrial3 . . . ; order 5: Tpair1, Cpair8, Vtrial3 . . . ; order 9: Cpair1, Vpair8, Ttrial3 . . . Note that the number of each pair here means the specific object pair that was used). As a consequence, between-subjects counterbalancing ensured that each object-word pair was presented and tested on the right and left side equally often and occurred in all 3 conditions at the same serial position. The experiment lasted approximately 30 min.

### Data Analysis

The eye-tracking data used for the analysis consisted of the binocular gaze position (X and Y coordinates) at each timestamp, that is, every 16.6 ms for French-speaking adults and every 3.3 ms for Mandarin- and Cantonese-speaking adults. Trials in which no data was available for the postnaming phase were discarded from the analyses (27/1728 trials). The data was analyzed in R (version 3.4.3, R Core Team, 2017, http://www.r-project.org) using the eyetrackingR package (Dink and Ferguson, 2015, http://www. eyetrackingr.com) for the latency and the growth curve analysis as well as for the cluster-based permutation analysis.

### RESULTS

### Time Course Analysis for Cantonese Speakers: Onset of Tonal Information Use

To evaluate the issue of whether the onset of tones should be time-locked to the onset of the consonants or the vowels of the syllables in which they were embedded, we first plotted the time course of the Cantonese adults' target looking behavior during the test phase based on two analyses. In the first one, as originally planned when preparing the videos, the postnaming phase was aligned with the beginning of the onset consonant of the target words (see **Figure 4**, top panel). In the second analysis, we corrected the time course aligning the postnaming phase to the onset of the vowel (see **Figure 4**, bottom panel). On average, we corrected for 127 ms (range 32–235 ms). As can be seen from the comparison of the two figures, similar identification curves are found for the consonant and vowel conditions, with a very similar timing. While word recognition appears delayed in the tone condition compared to the other two conditions when recognition is time-locked to the onset of the consonant, this delay disappears when it is time-locked to the onset of the vowel. This suggests that tonal information is more likely available from vowel onset rather than consonant onset for the current set of pseudowords, and that the speed of use of tonal information in native processing is similar to that of consonantal and vocalic information.

Given the above findings, all analyses presented in the following sections are based on the recalculation of the pre/postnaming phase for the tone-contrasted trials, taking vowel onset +200 ms as the beginning of the postnaming phase. Note however that equivalent analyses time-locked to consonant onset provided the same pattern of results.

### Accuracy-Overall Analysis

We first calculated the mean proportion of target looking (PTL = total looking time to target/ total looking time to both objects) on each trial for both the pre- and postnaming phase. For this purpose, two areas of interest (AOI) were defined (575 × 895 Pixel), each including one object. Time stamps that were not in any of the AOIs were treated as missing data, so that the calculated proportion of looking to one AOI is always relative to both AOIs, resulting in values between zero and 1 (i.e., a proportion value of 0.5 means that each AOI was looked at equally long). Word learning is typically reflected by the naming effect which corresponds to an increase in the proportion of target looking between the pre- and the postnaming phases that is significantly above 0 (e.g., Singh et al., 2015). The purpose of this prenaming correction is to control for looking preferences that are independent of the labeling. Difference scores between the pre- and postnaming phases were therefore calculated for each adult and each of the 24 contrast pairs, and then averaged for the 3 types of contrasts (see **Figure 5**). Zero corresponds to no increase in looking to target between the pre- and post-naming phases (chance performance). Positive difference scores mean an increase in target looking proportion.

For each of the three types of contrasts, adults in each language group exhibit an above chance naming effect (all ps < 0.001; see **Table 2** for details). This establishes that adults in all language groups could learn the words in all conditions, even though all stimuli were in Cantonese, a language not known by the Mandarin- and French-speaking adults.

To test for differences between language group and type of contrast, a 2-way ANOVA with the main factors of native language (Cantonese, Mandarin, French) and type of contrast (consonant, vowel, tone) was performed. A main effect of language [F(2, 69) = 16.96; p < 0.001] was found. T-tests revealed that Cantonese-speaking adults had a larger naming effect (0.40) than both Mandarin- [0.29, t(46) = 3.92; p < 0.001] and Frenchspeaking adults [0.22, t(46) = 6.19; p < 0.001], whose performance did marginally differ [t(46) = 1.92; p = 0.06]. This indicates an advantage of learning in one's native language vs. in an unknown language, and furthermore points toward a linguistic distance

effect as Mandarin and Cantonese are related languages while French is unrelated to Cantonese.

There was also a main effect of type of contrast [F(2, 138) = 10.46; p < 0.001], naming effects being larger for both consonants (0.33) and vowels (0.34) than for tones [0.26; t(71) = 3.34, p = 0.001, and t(71) = 3.34; p = 0.001, respectively]. This indicates that tone contrasts were overall more difficult to process than consonant and vowel contrasts. Performance between the consonant and vowel condition did not differ [t(71) = 0.65, p = 0.51]. In addition, the native language x type of contrast interaction was significant [F(4, 138) = 4.86; p = 0.001]. This indicates that performance for the different types of contrasts was differently affected in the three language groups. Compared to native Cantonese-speaking adults, both Mandarin- and French-speaking adults performed significantly worse on all three contrasts [tone: 0.25 vs. 0.41, t(46) = 4.09; p < 0.001; 0.12 vs. 0.41, t(46) = 8.06; p < 0.001; vowel: 0.31 vs. 0.40, t(46) = 2.34; p = 0.02; 0.30 vs. 0.40, t(46) = 2.57; p = 0.01; consonant: 0.32 vs. 0.40, t(46) = 2.16; p = 0.04; 0.26 vs. 0.40, t(46) = 3.86; p < 0.001]. Additionally, French-speaking adults performed worse on tone contrasts than Mandarin-speaking adults [0.12 vs. 0.25, t(46) = 2.77; p = 0.008].

Comparing conditions within each language, taking all 8 trials per condition into account, no difference in performance between the consonant and vowel conditions was found for the three language groups [Cantonese speakers: 0.40 vs. 0.40, t(23) = 0.07, p = 0.94; Mandarin speakers: 0.32 vs. 0.31, t(23) = 0.24, p = 0.82; French speakers: 0.26 vs. 0.30, t(23) = 1.16, p = 0.26]. Performance on tone contrasts was lower than in the other two conditions for French speakers [0.12 vs. 0.28, t(23) = 5.14, p < 0.001], but not for Mandarin [0.25 vs. 0.32, t(23) = 1.61, p = 0.12] and Cantonese speakers [0.41 vs. 0.40, t(23) = 0.49, p = 0.63]. Redoing these analyses removing the Single Category trials and the Category Goodness trials in each condition (see details in Appendix) confirmed the lack of difference in performance between the 8 consonant and 5 vowel native-like/Two Category pairs for Mandarin [0.32 vs. 0.35, t(23) = 0.93, p = 0.36], and the 6 consonant and 7 vowel nativelike/Two Category pairs for French [0.29 vs. 0.32, t(23) = 0.58, p = 0.56].

### Latency Analysis

Second, following Havy et al. (2014), we examined the participants' latency in shifting from the distractor to the target object, that is the time needed to orient from the initially fixated distractor object to the target object after labeling. Faster latencies to the target object in a condition would indicate a processing advantage compared to the other conditions. In a first step, distractor-initial trials were defined as those in which TABLE 2 | Naming effect broken down by language and condition.


participants fixated the distractor object at the onset of the pivotal phoneme (first consonant of the target word for consonant trials; first vowel for tone and vowel trials). These distractorinitial trials corresponded to, on average, 46% of all the trials (Cantonese: 45%; Mandarin: 45%; French: 48%). From those trials, we excluded trials in which participants shifted before the postnaming phase began (i.e., within the next 200 ms) as these saccades were probably programmed before the name of the target was processed (Cantonese: 21%; Mandarin: 22%; French: 10%) or did not shift at all (Cantonese: 1%; Mandarin: 6%; French: 7%) as well as outliers, that is values greater or smaller than 2.5 standard deviations from the mean (Cantonese: 1%; Mandarin: 2%; French: 2%).

Mean latencies and standard deviations are shown in **Table 3** for each language and condition. We used a linear mixed model using the function lmer of the R package lm4, with random effects for participants and items (Bates et al., 2015), and the package languageR (Baayen, 2015) to obtain p-values. The model included fixed effects of condition (compared in sliding contrasts: Consonants-Vowels; Vowels-Tones), language group (also compared in sliding contrasts: French vs. Cantonese; Cantonese vs. Mandarin), and the interaction between condition and language group. We decided to test the consonant-vowel contrast to be able to compare with previously reported results, and the tone-vowel comparison because of the same target phoneme onset. As for the language contrasts, we took Cantonese as the native speaker reference group with which to compare both non-native speaker groups. The output measure was mean shift latency. The only significant differences were found between Cantonese and Mandarin participants (β = 163.31, SE = 50.23, t = 3.25, p = 0.002) and between Cantonese and French participants (β = −199.87, SE = 49.14, t = 4.07, p < 0.001), with the Cantonese participants having overall faster latencies then each of the two other language groups. This points, again, to a general native language advantage. Importantly, the conditions did not differ from each other or interact with language.

### Growth Curve Analysis

Third, we conducted a Growth Curve Analysis (GCA) which includes time as a predictor to estimate if differences between conditions emerged over time within each language group. As dependent measure we took the transformed proportion data during the postnaming phase using the empirical logit (elog, aggregated in 100 ms time bins) and analyzed it with a weighted mixed-effects linear regression model within the eyetrackingR package (modeled after Mirman et al., 2008). For each language group separately, we entered condition (again compared in sliding contrasts: Consonants-Vowels; Vowels-Tones), orthogonal polynomials (linear, quadratic and cubic time

TABLE 3 | Mean shift latencies in ms and their SDs (in brackets), broken down by language and condition.

#### LANGUAGE


component), and the interaction between each time term and condition as fixed effects. Participants and items were entered as random effects into the model.

For Cantonese-speaking adults (see **Figure 6A**), conditions (Consonant-Vowel; Vowel-Tone) did not differ in their mean target looking, but both contrasts interacted (marginally) significantly with time (linear parameter: β = −0.93, SE = 0.21, p < 0.001; β = 0.52, SE = 0.21, p = 0.01; quadratic parameter: β = 0.45, SE = 0.21, p < 0.03; β = −0.37, SE = 0.28, p = 0.08).

For Mandarin-speaking adults (see **Figure 6B**), there was no significant main effect of the Consonant-Vowel and the Vowel-Tone contrast (both ps > 0.22), indicating no differences in the overall target looking in the postnaming phase between those conditions. We found a significant interaction between the Consonant-Vowel contrast and time (specifically, the quadratic and cubic parameter: β = 0.94, SE = 0.28, p < 0.001; β = −0.72, SE = 0.28, p = 0.009), and between the Vowel-Tone contrast and time (linear parameter: β = −0.71, SE = 0.28, p = 0.01).

For French-speaking adults (see **Figure 6C**), the GCA revealed a significant main effect of the Vowel-Tone contrast on the intercept term, confirming the overall lower target fixations for the tone trials relative to the vowel trials (β = −0.70, SE = 0.19, p = 0.001). In addition, the Vowel-Tone contrast interacted (marginally) significantly with time (linear time parameter: β = −0.39, SE = 0.21, p = 0.06; quadratic time parameter: β = 0.43, SE = 0.21, p = 0.04), suggesting divergent linear and non-linear temporal trajectories for tone and vowel trials. While the Consonant-Vowel contrast on the intercept term was not significant (β = 0.05, SE = 0.19, p = 0.81), its interaction with time was marginally significant (linear time parameter: β = 0.38, SE = 0.21, p = 0.06; cubic time parameter: β = −0.40, SE = 0.21, p = 0.06). This indicates that the temporal trajectory tends to differ between these conditions, although these differences are only trends, in line with the lack of mean target looking time differences during the postnaming phase between the consonant and vowel trials.

Note that eyetrackingR fits curves using orthogonal polynomials so that the estimated time parameters are independent from each other. As a consequence, the condition effect on the intercept corresponds to differences averaged across the entire postnaming phase. In a second model, we used natural polynomials in order to obtain so-called anticipatory effects, that is mean differences between conditions at the onset of postnaming phase (see Barr, 2008). These analyses revealed no significant effect of condition on the intercept term (all ps > 0.35). Thus, it can be ruled out that differences between conditions were already present before the postnaming phase started, that is before the critical information in a trial was processed.

### Cluster-Based Permutation Analysis

To further explore the different temporal trajectories that the results of the CGAs indicated, we conducted a clusterbased permutation analysis (Maris and Oostenveld, 2007) for each language group separately to identify the exact time periods where conditions differ significantly from each other. As dependent measure we took the proportion of target looking

within each 100 ms bin across the postnaming phase (20 bins). In a first step, this analysis compares conditions at each time bin with a t-test and identifies any time period(s) of adjacent bins in which conditions significantly differ. As t-threshold we chose an α-level of 0.05 (two-tailed). This yields in cluster-level t-value(s) which correspond to the sum of all single sample tvalues within the time period(s). In a second step, it generates a Monte-Carlo distribution to compare the cluster-level t-value(s) by randomly assigning the trials to conditions and repeating step 1 several times (for our data: 1000 times). This results in a Monte Carlo p-value for each observed time cluster which reflects the probability that this cluster could have occurred simply by chance.

as raw data (light) and fitted curves (bold).

This analysis revealed no differences between conditions for the Cantonese-speaking group. For Mandarin-speaking adults, Consonant and Vowel trials diverged from 300 to 900 ms during the postnaming phase (cluster t = 16.29, Monte Carlo p = 0.02) with Consonant trials having higher target looking proportions. While Tone trials did not differ from Vowel trials, they did from Consonant trials between 400 and 2000 ms during the postnaming phase (cluster t = 51.91, Monte Carlo p < 0.001), again Consonant trials having higher target looking proportions. Interestingly, redoing these analyses removing the Single Category and Category Goodness trials in each condition (see details in Appendix) there was no difference between conditions any more. For French-speaking adults, two significant clusters were found: both Tone and Vowel trials and Tone and Consonant trials diverged from 200 to 2000 ms during the postnaming phase (cluster t = 63.43, Monte Carlo p < 0.001; cluster t = 62.27, Monte Carlo p < 0.001, respectively), with Tone trials having lower target fixations. Consonant and Vowel trials did not differ. This was still the case after removing the Single Category and Category Goodness trials in the vowel and consonant conditions (see details in Appendix).

### DISCUSSION

In this study, we investigated whether and how adults can quickly learn new minimal pair words in a non-native tone language, Cantonese, and whether this ability is modulated by native phonological knowledge. We tested this learning ability in Mandarin- and French-speaking adults, using Cantonese pseudowords differing minimally in either a consonant, a vowel, or a tone, and compared their performance to those of native Cantonese-speaking adults. Overall, we found that all three groups of adults performed at above chance levels in learning the pseudowords, and this held for all three types of contrasts. Also, compared to native Cantonese-speaking adults, both Mandarinand French-speaking adults performed worse on all three types of contrasts. Furthermore, French-speaking adults performed even worse on tones when compared to Mandarin-speaking adults.

The present findings first establish that adults in all three language groups could rapidly learn new words in a computerbased situation, after solely 6 repetitions of each word. Note that the present interpretation in terms of word-learning needs to be qualified by the fact that the present study does not establish long-term establishment of lexical items, and could result from simple associations between the pseudowords and either the objects (or the side of the screen on which the objects were presented). Future studies will have to further probe our wordlearning interpretation, using designs testing for word learning independent of object localization, and in long term memory, for example adapting the word-learning design used in Dittinger et al. (2016). While this word-learning finding is in part trivial for the Cantonese-speaking adults (though see more discussion on this below), it holds even when the new words were presented to Mandarin- and French-speaking adults, for whom Cantonese was a non-native language, and who had no knowledge of Cantonese prior to taking part in the experiment. Our findings reveal a significant effect of nativeness status, as overall, Cantonesespeaking adults performed better than the other two groups (in overall performance and shift latency analyses).

The effect of linguistic distance is less clearcut. Indeed, although Cantonese is closer to Mandarin than to French (at many levels including phonology, morphology and syntax, Li, 1937; Gong, 1980; DeLancey, 2009), this did not significantly impact overall performance and shift latencies, as Frenchspeaking adults performed at the same overall level as Mandarinspeaking adults, in spite of a trend in the expected direction for overall performance (see further discussion in the Appendix for a more fine-grained approach). Our findings thus establish robust word learning abilities in a non-native language in adulthood, that contrast with the difficulties that adults have in learning some specific aspects of the phonology and syntax of nonnative languages (e.g., Flege et al., 1999; Birdsong and Molis, 2001; Dupoux et al., 2008; Boll-Avetisyan et al., 2016). This difference might be due to the fact that while the acquisition of the phonology and syntax of one's native language is to a great extent completed in the first years of life, vocabulary acquisition is a lifelong, continuing process that allows for the acquisition of specialized vocabularies (as when, for example, becoming a -developmental- psychologist!) or learning the names of new objects and concepts (e.g., to "log into" a "googledoc" on one's "iphone") in the native language.

Importantly, these word learning abilities were found in a specific learning context in which adults had to learn words presented in pairs, and in which the sound forms of the two words differed only by a consonant, vowel or tone. The fact that Mandarin- and French-speaking adults succeeded in learning the word pairs in all three conditions establishes that they could process fine segmental (consonantal and vocalic) and suprasegmental (tonal) information in doing so, and that they were establishing representations of the word forms that included specific segmental or tonal information. This finding is particularly striking for the French speakers' performance with tone contrasts, given that tones are not used in French at the lexical level. It could be due to the fact that these contrasts were introduced to the adults in minimal pairs of words, where they had to pay attention to the fine phonetic detail in order to distinguish the objects and memorize the words. Further research will be needed to explore whether our French-speaking adults would have failed to use such precise phonetic information if they had not been presented with minimal pairs, leading to lower or at chance performance. Importantly though, the ability of the French-speaking adults to use tonal information when learning words suggest that the rebound in tone discrimination found in late infancy in Dutch, another non-tonal language (Liu and Kager, 2014), interpreted in relation to the acquisition of the intonation of the native language, would not be limited to low levels of processing, but would extend to the lexical level.

Our findings also establish that adult performance is not solely based on the acoustic distance between the contrasted sounds, but is also dependent on their native phonological system. At this more fine-grained level, language distance appears to play a role, as our results clearly show that the Mandarin-speaking adults performed better than the French-speaking adults in learning words distinguished by Cantonese tonal contrasts. Since there was no difference in performance between the two language groups for consonants and vowels, this effect likely indicates that Mandarin-speaking adults, as experienced tone language users, exhibit greater ability in processing non-native tonal information at the lexical level, when compared to the non-tone user French speakers. In the Appendix, we present exploratory analyses, based on individual trials analyses, that allow some evaluation of the Perceptual Assimilation Model (PAM; for consonants: Best, 1995; for vowels: Tyler et al., 2014; for tones: Hallé et al., 2004) applied here at the level of word learning rather than speech processing.

Besides providing data on the phonological/lexical interface in processing a new, non-native language, our results also provide an evaluation of the use of tonal information in word learning, and its impact on processing consonantal and vocalic information at the lexical level in native speakers of a tone language. Regarding the use of tonal contrasts, we found that tonal contrasts are as important as consonantal and vocalic contrasts in processing word meanings for native Cantonesespeaking adults. This is revealed by the overall accuracy analyses showing that Cantonese adults perform at the same level in all three contrast conditions. Our time course analysis further shows that all three kinds of contrasts are processed at the same speed from the onset of the contrasting phonemes. For the tones, the comparison of our two analyses time-locked to consonant vs. vowel onset suggests that tonal information became available from the onset of the vowel. This might be related to the fact that 6 of the 8 pairs we presented started with unvoiced consonants, so that tonal information was mostly carried by the vowels. Whether a similar pattern would be found for syllables starting with voiced consonants would need to be evaluated in an experimental design counterbalancing the two types of consonants.

Furthermore, our work bears on the issue of the relative weight given to consonantal and vocalic information in lexical processing. Previous studies on various Indo-European languages (English, Dutch, French, Italian, Spanish) have found that adults have a consonant bias in accessing or learning words (e.g., van Ooijen, 1996; Cutler et al., 2000; Bonatti et al., 2005; Creel et al., 2006; New et al., 2008; Toro et al., 2008; Carreiras et al., 2009; Delle Luche et al., 2014; Havy et al., 2014; New and Nazzi, 2014). This supports the "division of labor" proposal by Nespor et al. (2003) that consonants are given more weight than vowels in lexical processing (while vowels are given more weight than consonants at the prosodic/syntactic levels). Accordingly, in the present study, we investigated Cantonese, a tone language, where lexical meanings are also crucially cued by tones. Our interest came from the fact that since tones are essentially associated with the voiced portions of syllables, which mostly correspond to vowels (and nasal codas) in Cantonese, and since only a few onset consonants (/j, w, m, n, η/) are voiced in that language, the relative weight given to consonants and vowels might be different from what has been found for Indo-European languages. The effect of tones could either increase the weight given to vowels (since they carry both segmental and tonal information, compared to only segmental information in nontone languages) or decrease their weight even further (due to additional acoustic variation related to tonal differences and the fact that each vowel in Cantonese can carry 6 different tones).

Our findings did not reveal any advantage for either consonantal or vocalic information in lexical processing for Cantonese-speaking adults, as shown by their overall similar performance in the consonant and vowel contrast conditions. This finding differs from all previous findings on adult speakers of non-tonal languages, and in particular with the findings of a clear C-bias in latency analyses found when French adults learn new words (Havy et al., 2014). The present null effect (lack of difference between the C and V conditions), which thus needs to be interpreted with caution, might be taken as evidence that Cantonese-speaking adults pay more attention to vowels that carry tonal information than non-tone language users, resulting in a lack of C-bias. This interpretation needs to be considered cautiously given that a null effect was also found in the other two language groups, including in the Frenchspeaking adults, which have been documented to have a Cbias in lexical processing when processing words in their native language (e.g., Bonatti et al., 2005; New et al., 2008; Havy et al., 2014). This lack of effect in the French- and Mandarin-speaking adults could mean that there is something in the acoustics of the stimuli used in the present study that does not support a Cbias. Alternatively, it could mean that the C-bias only operates in the native language, or in languages in which adults have sufficient experience/knowledge, hence the null effect found here for the French-speaking (and Mandarin-speaking) adults who had no or limited knowledge of Cantonese. Importantly though, our interpretation in terms of lack of a C-bias in Cantonese is corroborated by two recent studies having explored similar issues in either Mandarin- (Wiener and Turnbull, 2016) or Cantonesespeaking (Gómez et al., 2017) adults, using a word reconstruction and word form segmentation task respectively. As discussed in the introduction, their findings differ from those previously found in Indo-European languages, failing to find a clear Cbias in both languages, thus suggesting a different balance in the weight given to consonants and vowels in these two tone languages.

The above findings that begin to establish crosslinguistic differences in consonant/vowel weight between adult listeners of tonal vs. non-tonal languages are to be considered in relation to infant studies on Indo-European languages that have shown that the C-bias is modulated in infancy both developmentally and crosslinguistically (see Nazzi et al., 2016, for a complete review). Indeed, results from French and Italian show that while a Cbias is found from 8 months onward (Nazzi, 2005; Hochmann et al., 2011; Poltrock and Nazzi, 2015; Nishibayashi and Nazzi, 2016), it is not present up to 6 months of age (Benavides-Varela et al., 2012; Bouchon et al., 2015; Nishibayashi and Nazzi, 2016; Hochmann et al., 2018). Moreover, a C-bias could not be attested before 30 months in British English-learning infants (Nazzi et al., 2009; Floccia et al., 2014), and Danish-learning 20-month-olds demonstrate a V-bias (Højen and Nazzi, 2016). Taken together, these studies suggest that the C-bias is acquired and that its acquisition depends on the phonological and lexical properties of the native language.

Given that Cantonese- and Mandarin-speaking adults appear to have a reduced or reversed bias (Wiener and Turnbull, 2016; Gómez et al., 2017; present study), it is of great interest to expand research on the consonant bias to infants and toddlers learning a tone language, which was one of the original motivations for setting up the present study. At present, only one study has started to explore this issue in (Mandarin-dominant) Mandarin-English bilingual toddlers (aged 2.5–3.5 years) and preschoolers (aged 4–5 years). In a word recognition task exploring their sensitivity to mispronunciations of known words, the toddlers were found to be more sensitive to tone than consonant and vowel mispronunciations, while the reverse pattern was found in preschoolers (Singh et al., 2015). However, at both ages, no differences in sensitivity were found between consonant and vowel mispronunciations. Future studies will have to expand on this first finding, exploring such effects in younger monolingual infants learning various tone languages, and exploring various aspects of lexical processing, including both word learning and lexical comprehension.

In conclusion, the present study establishes adults' word learning abilities in an unknown language, and show that level of performance is modulated by how the phonologies of the native and non-native languages map onto each other. They also bring evidence suggesting that being a speaker of a tonal language reduces the consonant bias in lexical processing previously found in adults of several Indo-European languages, probably due to the fact that tones are carried by vowels more than by consonants. However, no clear bias could be found for either consonants or vowels, and future studies will have to further probe the link between phonological and lexical processing in tone languages. These findings nevertheless set up the foundations for equivalent developmental studies that will inform our understanding of what determines the phonological biases that are observed in lexical processing.

### AUTHOR CONTRIBUTIONS

SP created the experimental stimuli and design, conducted data analyses and drafted the manuscript. CK assisted in stimuli creation, recruited participants and collected data in Hong Kong. HuiC interpretated the data and drafted the manuscript. HC and TN conceptualized the study and supervised all stages of the project.

### ACKNOWLEDGMENTS

This work was partly funded by ANR-13-BSH2-0004 to TN, and ANR-15-CE28-0011 to TN and HC. We would like to thank the participants for their time, and Sylvie Margules for help running the experiment.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2018.01211/full#supplementary-material

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Poltrock, Chen, Kwok, Cheung and Nazzi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Training Children to Perceive Non-native Lexical Tones: Tone Language Background, Bilingualism, and Auditory-Visual Information

Benjawan Kasisopa<sup>1</sup> \*, Lamya El-Khoury Antonios <sup>1</sup> , Allard Jongman2,3, Joan A. Sereno2,3 and Denis Burnham<sup>1</sup>

*<sup>1</sup> MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, NSW, Australia, <sup>2</sup> Department of Linguistics, College of Liberal Arts & Sciences, University of Kansas, Lawrence, KS, United States, <sup>3</sup> Phonetics & Psycholinguistics Laboratory, Department of Linguistics, College of Liberal Arts & Sciences, University of Kansas, Lawrence, KS, United States*

#### Edited by:

*Judit Gervain, Centre national de la Recherche Scientifique (CNRS), France*

#### Reviewed by:

*Yang Zhang, University of Minnesota Twin Cities, United States Arturo Hernandez, University of Houston, United States*

\*Correspondence: *Benjawan Kasisopa b.kasisopa@westernsydney.edu.au*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *29 September 2017* Accepted: *31 July 2018* Published: *04 September 2018*

#### Citation:

*Kasisopa B, El-Khoury Antonios L, Jongman A, Sereno JA and Burnham D (2018) Training Children to Perceive Non-native Lexical Tones: Tone Language Background, Bilingualism, and Auditory-Visual Information. Front. Psychol. 9:1508. doi: 10.3389/fpsyg.2018.01508* This study investigates the role of language background and bilingual status in the perception of foreign lexical tones. Eight groups of participants, consisting of children of 6 and 8 years from one of four language background (tone or non-tone) × bilingual status (monolingual or bilingual)—Thai monolingual, English monolingual, English-Thai bilingual, and English-Arabic bilingual were trained to perceive the four Mandarin lexical tones. Half the children in each of these eight groups were given auditory-only (AO) training and half auditory-visual (AV) training. In each group Mandarin tone identification was tested before and after (pre- and post-) training with both auditory-only test (ao-test) and auditory-visual test (av test). The effect of training on Mandarin tone identification was minimal for 6-year-olds. On the other hand, 8-year-olds, particularly those with tone language experience showed greater pre- to post-training improvement, and this was best indexed by ao-test trials. Bilingual vs. monolingual background did not facilitate overall improvement due to training, but it did modulate the efficacy of the Training mode: for bilinguals both AO and AV training, and especially AO, resulted in performance gain; but for monolinguals training was most effective with AV stimuli. Again this effect was best indexed by ao-test trials. These results suggest that tone language experience, be it monolingual or bilingual, is a strong predictor of learning unfamiliar tones; that monolinguals learn best from AV training trials and bilinguals from AO training trials; and that there is no metalinguistic advantage due to bilingualism in learning to perceive lexical tones.

Keywords: lexical tone, auditory-visual, speech perception, bilingualism, perceptual attunement

### INTRODUCTION

Like consonants and vowels, lexical tone is subject to perceptual attunement as a product of specific language experience. However, unlike consonants and vowels, lexical tone is not used to distinguish meaning in all the languages of the world. While tone languages comprise 70% of the world's languages (Yip, 2002) and more than 50% of the world's population speak a tone language (Fromkin, 1978), one of the most prevalent world languages, English, and one that has hosted the vast majority of language development studies, is not a tone language. On the other hand another of the most prevalent languages, Mandarin, is a tone language. This paper concerns training 6- and 8-year-old children to perceive novel lexical tones and whether such training is assisted by visual information for tone, previous tone language experience, and bilingual vs. monolingual experience. The experimental study is prefaced by exposition of the nature of lexical tone, attunement to lexical tone in infancy and in schoolaged children, tone perception in monolingual and bilingual populations, and perceptual training methods for children.

### Lexical Tone

Lexical tone is a linguistic device contributing to the semantic realization of words. The main cue for lexical tone is fundamental frequency, perceived as pitch, but lexical tone is also characterized by variations in amplitude, duration and voice quality (Yip, 2002). Tones vary in type, level/static or contour/dynamic, with small or rapid pitch variation over time, respectively (Abramson, 1978). Tone languages also vary in the number of tones: for example, Thai has five tones, three level and two contour, Cantonese has three level and three contour tones, and Mandarin, the tone language to be investigated here, has one level and three contour tones (Yip, 2002). **Figure 1** shows the pitch patterns over time of the four Mandarin tones and the corresponding meanings when spoken on the syllable /ma/. Tone 1 is a High-Level [T55<sup>1</sup> ] tone; and tones 2, 3 and 4 are contour tones identified as a Mid-Rising [T35], Low-Falling-Rising [T214], and High-Falling [T51] (Hallé et al., 2004).

Tones 2 and 3 are the most difficult to perceive for first language (L1) children and second language (L2) adults. For example, Wong et al. (2005) found that 3-year-old Mandarin speaking children can accurately identify Tones 1, 2, and 4, but often confuse Tone 2 with Tone 3, and Li and Thompson (1977) showed that Mandarin-speaking children acquire Tones 2 and 3 later than Tones 1 and 4. Nevertheless, discrimination of the difficult Tones 2 and 3 improves dramatically after training, whereas the more easily discriminated Tones 1 and 4 are relatively resistant to improvement (Wang et al., 1999; Smith and Burnham, 2012).

### Perceptual Attunement in Infancy

Difficulties in discrimination of non-native language sounds by L2 tone language learners would appear to be related to perceptual attunement in infancy. Newborn infants perceive both native and non-native speech contrasts in a similar manner, but this language-general speech perception becomes more language-specific over infants' first year—there is a decline in discrimination performance for non-native speech contrasts while that for native speech contrasts is maintained or improves (Burnham and Mattock, 2010). Such perceptual attunement is evident between 7 and 11 months for consonants (Werker and Tees, 1984), between 4 and 6 months for vowels (Polka and Werker, 1994), and around the same age for lexical tones.

tones on the syllable /ma/ meaning "mother" [ma55]; "hemp"[ma35]; "horse"[ma214]; and "scold"[ma51] produced by four female native speakers of Beijing Mandarin dialect.

Mattock and Burnham (2006), testing tone-language and nontone-language-environment infants found that between 6 and 9 month-old English language infants' discrimination performance declined for Thai tones, but not for violin sounds created to have the same F0 contours as the tones, whereas Chinese infants showed no decline for either. Together, with similar studies with English and French infants (Mattock et al., 2008), these results show perceptual attunement for lexical tones that is specific to speech. More recently, Yeung et al. (2013) found distinctly different patterns of Cantonese tone perception at 4 months between Cantonese infants (for whom the tones were native), Mandarin infants (tones non-native), and English non-tone language infants. These results suggest that there is perceptual attunement for lexical tones by at least 4 months of age even before that for consonants and vowels (but see also Choi et al., 2017).

### Perceptual Attunement in Childhood

While there is perceptual attunement in infancy toward native and away from non-native sounds, non-native sounds are still perceivable and especially so under certain conditions (e.g., Werker and Logan, 1985), otherwise L2 learning would not be possible. Over and above this residual ability to perceive non-native sounds, there is now known to be a second period of perceptual attunement at the onset of reading for both consonants (Burnham et al., 1991; Burnham, 2003; Horlyck et al., 2012) and vowels (Burnham and Torstensson, 1995). Burnham (Burnham et al., 1991; Burnham, 2003) showed an intensified reduction in perceptual discrimination of non-native speech contrasts at 6, but not 4 or 8, years. This is strongest in children with better reading and reading-related ability (Burnham, 2003; Horlyck et al., 2012), and is a function of duration of school experience, rather than maturation per se (Horlyck et al., 2012).

<sup>1</sup>These numbers are Chao tone values which indicate tone height and contour at the start, (middle), and end of the syllable (Chao, 1930).

Burnham suggests that this reduced attention to non-native speech sounds is a response to the onset of reading instruction and that it may assist reading processes, specifically phonemeto-grapheme mapping. This is consistent with the fact that the reduced attention to non-native speech sounds is ameliorated by 8 years, an age at which reading usually becomes more automatic (Burnham, 2003).

No studies have yet investigated whether this second period of attunement also occurs for lexical tones. To that end, this study will test primary (elementary) school children in First Grade (6–7 years old) and Third Grade (8–9 years old) for their perception of non-native lexical tones.

### Tone Perception and Training—Auditory and Auditory-Visual Information Auditory Training of Tone Perception

Auditory training improves non-native tone perception in adult populations. For example, Wang et al. (1999) tested American-English adults in an auditory training regime for Mandarin tones involving learning the six possible pairings of the four Mandarin tones in eight sessions spaced over 2 weeks. There was a 21% increase in identification accuracy from pre-training to post-training test, an improvement which generalized to the perception of new stimuli by new speakers and which was maintained at 25% in a 6-month retention test (Wang et al., 1999). One of the strengths of the training was thought to be the use of a variety of monosyllabic Mandarin words and a variety of speakers. These will be implemented in the current study as well.

In another study, tone language Mandarin Chinese and non-tonal English speakers were trained to perceive Cantonese Chinese lexical tones (Francis et al., 2008). Both groups showed a similar initial performance and significant improvement in identification following training. However, English and Mandarin Chinese participants found particular tones difficult and others easier to identify, with English listeners improving significantly on the low-rising (23) and low–level (22) tones while Mandarin listeners showed significant improvement only on the low-falling (21) tone (Francis et al., 2008). While these results auger well for training non-tone language speakers to learn to perceive lexical tones, other studies have not been as successful with non-tonal speakers. Wayland and Guion (2003) investigated native English and native Chinese speakers' identification and discrimination of Thai tones. A significantly greater improvement from pre-training to post-training test was observed in the native Chinese group than in the native English group whose performance even declined over time. However, English speakers with some experience with Thai showed greater improvement in the perception of Thai tones compared to English speakers with no experience. Therefore, it can be concluded that previous lexical tone experience in a tone system, be it either as an L1 or an L2, may transfer to the perception of tones in a different tone system at least for adult learners (Wayland and Guion, 2003, 2004). In the study reported here, such prior tone language experience was investigated in children with respect to transfer to learning tones in a new unfamiliar tone system.

#### Auditory-Visual Tone Perception and Training

It is now well established that speech perception is multimodal, particularly with respect to auditory and visual information and particularly for consonants and vowels (McGurk and McDonald, 1976; Campbell et al., 1998; Vatiktiotis-Bateson et al., 2000). Evidence for auditory-visual perception of tone has been later in emerging. In two preliminary studies, Burnham found native Cantonese adults' identification of native tones presented in a Visual-Only (VO) mode was significantly better than chance for tones in running speech (but not words in isolation), for tones on monophthongal (but not diphthongal) vowels, and for contour (but not level) tones (Burnham et al., 2001a). In addition, both non-native Thai listeners and non-tonal Australian English adults were shown to make use of (presumably language-general) visual information in their discrimination of Cantonese tones (Burnham et al., 2001b).

Further studies have shown ubiquitous augmentation of visual tone perception in auditory-visual over auditory-only presentations (Mixdorff et al., 2005a,b,c; Smith and Burnham, 2012; Burnham et al., 2014). For instance Burnham et al. (2014), investigating the perception of Thai tones in noise, found better tone perception in AV than AO conditions irrespective of language background: visual augmentation was equivalent in tone language (Thai, Cantonese, Mandarin), pitch-accent (Swedish), and non-tone language (English) adults. Interestingly, Burnham et al. (2014) also found that non-tone-language English adults were much better than tone language or pitch-accent language adults in perceiving tone in VO situations (see also Smith and Burnham, 2012), presumably because those with no tone language experience use all available (e.g., visual) information for perceiving tones, while tone and pitch-accent language adults are accustomed to relying upon the perceptually more salient auditory information for tone.

#### Auditory-Visual Speech Perception in Children

Auditory-visual speech perception, at least for consonants, is evident early in development, even in infancy (Rosenblum et al., 1997; Burnham and Dodd, 2004; Desjardins and Werker, 2004). For example, 4½-month-old infants perceive the McGurk effect – auditory [ba] dubbed onto visual [ga] as "da" or "tha"– significantly more often than as "ba" (Burnham and Dodd, 2004). Nevertheless, there is further development of auditory-visual speech perception across childhood. In the original McGurk effect report (McGurk and McDonald, 1976) adults reported the auditory-visual fusion more than did children of 7 to 8 and 3 to 5 years. Subsequent studies have shown this reduced visual influence over age to be robust. There is more use of visual information by adults than by 4–6-year-old children (Massaro, 1984; Massaro et al., 1986), and there is a monotonic increase in visual speech perception across childhood from 5- , 7-, 9-, and 11-year-olds to adults (Hockley and Polka, 1994). This developmental increase is possibly related to articulation experience. Desjardins et al. (1997) showed that preschool children who make substitution errors in articulation are less influenced by visual cues than are children who can correctly produce consonants. In addition, between 6 vs. 8 years, the same ages as those tested in the study to be reported here, there is a large increase in the incidence of the McGurk effect in English-language (but not Japanese) children (Sekiyama and Burnham, 2008) which appears to be related to the onset of reading instruction (Erdener and Burnham, 2013).

### Tone Perception and Training in Children

On the basis of the above studies, we may expect some effect of visual information on speech perception in school-age children. However, while there is ample evidence for this with consonants, there are, as yet, no studies on children's auditoryvisual perception of tone or of training tone perception in children. Nevertheless, there are studies on the auditory training of tone perception in children (Wang and Kuhl, 2003; Sereno and Maniwa, 2006; Sereno, 2017). Wang and Kuhl (2003) trained monolingual American English 6-, 10-, and 14-year-old children and young adults to perceive Mandarin tones. Music, pictures, and sound effects were presented in the training program to engage the children and they also received rewards during the training. Training included six sessions spaced over 2 weeks with six different speakers of Mandarin Chinese. Accuracy in perceiving the Mandarin tones significantly improved in all age groups, but much more markedly in the young adults. It was suggested that the factors influencing lower performance among younger participants may be cognitive maturity resulting in difficulty in completing tasks, as well as experience with language in general. This study showed that six training sessions were effective and sufficient to improve tone perception at least in the older participants. Only auditory-alone training was used; no facial speech information was presented. In the study presented here, both auditory-only and auditory-visual training will be included with the expectation that auditory-visual training could enhance children's learning of non-native tones.

### Lexical Tone Perception in Bilinguals

A bilingual person is one who displays language abilities in two languages that they use frequently in many aspects of their daily lives (Grosjean, 2010). Bilingualism plays various roles in children's language development. One key advantage is heightened metalinguistic awareness which relates to understanding the elements that make up language including rules and patterns (Campbell and Sais, 1995; Jensen, 2008). While bilingual children may perform less well than their monolingual peers on linguistic tasks, they invariably do better on executive control tasks (Friesen and Bialystok, 2012). Whether this then results in better ability to learn lexical tones is unknown, as there is no information thus far on any bilingual advantage for children learning lexical tones. Most lexical tone studies have been conducted with monolingual populations, and have shown that tone language experience facilitates non-native tone perception. One exception is a study by Singh and Foong (2012) who investigated the age at which Chinese-English bilingual infants are able to recognize and distinguish between nonphonemic and phonemic pitch and lexical tone contrasts in each language. In a word matching task 11-month-old (but not at 7.5- or 9-month-old) Chinese-English bilingual infants correctly recognized words whether they were pitch-matched or pitchmismatched in English, but only correctly recognized words when they were pitch(tone)-matched in Mandarin. Thus, the perceptual attunement found for lexical tones early in infancy around 4 months appears to develop further in tone/non-tone language bilinguals such that by 11 months there is selective attunement depending on the language context.

The study reported here is the first to focus on the intricacies of training non-native lexical tone perception to monolingual vs. bilingual children with or without tone language experience. Four groups of primary (elementary) school students in two age groups, First grade (6–7 year-olds) and Third grade (8–9 yearolds), were trained (using either Auditory-Only or Auditory-Visual stimuli) to perceive non-native, Mandarin, lexical tone contrasts. Two of the four groups were bilingual: one bilingual group with two non-tonal language backgrounds: English and Arabic (**Bi-Eng/Arabic**); the other bilingual group with one nontone, English, and one tone, Thai, language background (**Bi-Eng/Thai**). In addition, there were two groups of monolingual children – one non-tonal, English (**Mono-Eng**) and one tonal (**Mono-Thai**).

### THE EXPERIMENT: TRAINING NON-NATIVE LISTENERS TO PERCEIVE MANDARIN TONES

In this study, children were trained to perceive the four Mandarin tones using Auditory-Only (AO) or Auditory-Visual (AV) computer-based 4-alternative forced choice-identification tasks across six training sessions.

Since tone language experience has been shown to facilitate lexical tone perception, it is expected that children with tone language experience would be better able to perceive foreign lexical tones, and those with a non-tone language background will have difficulty perceiving tones. Thus perception accuracy and improvement over training on Mandarin tones is expected to be better for children with tone language experience (Bi-Eng/Thai and Mono-Thai groups) than those without (Bi-Eng/Arabic and Mono-Eng groups).

Moreover, it is expected that bilinguals (Bi-Eng/Thai and Bi-Eng/Arabic) should show better performance than monolinguals (Mono-Thai and Mono-Eng) due to greater metalinguistic awareness that comes with the ability to attend to and transfer across languages.

In addition, as there has been found to be visual augmentation of auditory tone perception in adults, it is expected that groups of children given Auditory-Visual training will perform better than those given Auditory-Only training, although this is proposed tentatively, as visual perception of tone has not yet been studied in children.

Finally, while there is only two years between the two age groups here, 6 and 8 years, it is possible that the reduced ability to perceive non-native speech contrasts in the second period of perceptual attunement around reading onset may affect the younger, 6-year-old, more than the older 8-year-old children.

As there has been found to be a relation between children's speech perception and reading and reading-related abilities (Burnham, 2003; Horlyck et al., 2012) and between children's phonological and tonological awareness and their reading ability (Burnham et al., 2011), we also included tests of English language phonological awareness—phoneme deletion and word and nonword reading ability—for the three groups with English as one of their languages (Bi-Eng/Arabic, Bi-Eng/Thai, Mono-Eng). It was expected that there would be a stronger relationship between phonological awareness and tone perception for the 8-yearolds than the 6-year-olds (given that at 6 years there is an intensification of perceptual attunement; Burnham, 2003) and possibly a greater phonological awareness with tone perception relationship for bilingual than monolingual children due to the former's greater metalinguistic awareness (Campbell and Sais, 1995; Jensen, 2008).

### METHODS

### Participants

A sample of 81 primary school students participated in this study. The children were either bilingual (Bi-Eng/Thai or Bi-Eng/Arabic) or monolingual (Mono-Thai or Mono-Eng) and they had either a non-tone (Bi-Eng/Arabic or Mono-Eng) or tone (Bi-Eng/Thai or Mono-Thai) language background. Within each language group, there were two age groups, First Grade 6 to 7 years [6yo] and Third Grade 8–9 years [8yo], and within each language × age sub-group children were randomly assigned to either an Auditory-Only (AO) or an Auditory-Visual (AV) training group, prior to and irrespective of their scores on the Pre-training tests.

All participants' parents reported their children had normal hearing in both ears. Numbers, ages and distribution of the participants in each of the four language groups are as follows:


Children's parents or guardians gave informed consent to participate in the experiment. For the bilingual groups, the Bi-Eng/Arabic children were recruited from St. Charbel's College and Al Jabal Karm El Mohr community, and the Bi-Eng/Thai group from Buddharangsee Thai Community Language School, both in Sydney, Australia. All participants' parents or guardians received an AUD50 gift voucher as reimbursement for travel expenses. For the monolingual groups, the Mono-Eng participants were recruited from Deerfield Elementary School, in Lawrence, Kansas, USA, where participants' parents or guardians received USD60 as reimbursement. The Mono-Thai children were recruited from Wat Baan Maa School in Ayutthaya, Thailand. The participants' parents or guardians received a gift voucher worth THB500 as travel expense reimbursement. All children in the study received small gifts and a certificate of participation at the end of each session. The criteria for bilingualism in this study for the two bilingual groups were (a) at least one parent was a native speaker of Thai or Arabic, depending on the group, and spoke to their child in that language on a daily basis; (b) the children systematically learned both languages either at their normal school or at a language school; and (c) the parents reported that their children used each language on a daily basis. Parents or guardians also filled out a questionnaire regarding their child's language and musical training background. Only one child in the Bi-Eng/Arabic and three in the mono-Eng groups had received musical training while about 50% of the Bi-Eng/Thai group were engaged in Thai musical training at the time of testing as a requirement of the Thai language Saturday school curriculum. The study was conducted under the Western Sydney University and the University of Kansas Human Research Ethics Committees' approval in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki).

### Design

The study employed a mixed, 2 × 2 × 2 × 2 × (2 × 2) factorial design: 2 Language Background groups (bilingual or monolingual) and 2 Tone Language Experience groups (tonal or non-tonal) × 2 Age groups (6yo or 8yo) × 2 Training Modes (Auditory-Only [AO] or Auditory-Visual [AV]) between-subject factors × 2 Test Types (auditory-only [ao] or auditory-visual [av]) × 2 Test Phases (Pre-/Post-training) within-subject factors. The main dependent variable for accuracy was the percentage of correct tones identified. Participants in each Language × Age group were randomly assigned to either the AO or AV Training in equal numbers for each group. The order of the Test trials (i.e., ao and av) was counterbalanced between participants within groups. In addition, the three of the four groups who had English as (one of) their language(s) were also given phonological awareness tests, including a phoneme deletion and a reading ability (words and non-words) test.

### Perception Test Stimuli

Stimuli used in the experiment were auditory-visual recordings of monosyllabic Mandarin syllables from six native speakers of Beijing Mandarin Chinese (4 females and 2 males). One female speaker (Speaker F3) provided stimulus items for the introductory session and the practice test. Another female speaker (Speaker F4) provided stimulus items for the Pre- and Post-training Test Phase stimuli. Four other speakers, two female (F1, F2) and two male (M1, M2) provided stimulus items for the training sessions.

The recordings were conducted in a soundproof booth in the Face and Voice Lab at MARCS Institute for Brain, Behavior and Development, Western Sydney University. The speakers were asked to produce target syllables via an AKG lapel microphone connected to an SONY HD video camera. The recording sessions were recorded via Adobe Premiere Pro CS5 program (www. adobe.com). Sound was sampled at a frequency of 48,000 Hz. Target syllables were presented stationary in the center of a computer screen one at a time. Two separate files were then created from the same recording session for each stimulus, one for the Auditory-Only (AO) training and another for the Auditory-Visual (AV) training condition. Note that for the AO, a still image of the speaker was presented.

A total of 12 stimuli were created for the introductory session. These comprised Speaker F3 productions of 3 syllables (/ma/, /ni/, /ga/) with the four Mandarin tones (T55, T35, T214 and T51). Another set of 24 stimuli were created for the practice session. These comprised Speaker F3 productions of 3 syllables (/ma/, /na/, /p<sup>h</sup> a/), spoken with each of the four Mandarin tones, as well as one repetition of the resultant 12 trials.

A total of 96 stimuli (all words), 24 for each of the four Mandarin tones were created for the pre-training and posttraining tests. These comprised speaker F4 producing 24 syllables on each of the four Mandarin tones. The stimuli were identical in the pre-training and post-training tests but were re-randomized in each test. These 24 syllables were [Pinyin: can, chu, chuang, di, fa, gu, guo, han, hou, lang, nao, pai, peng, qian, qiao, qie, qu, shao, tui, xiang, xing, xue, yu, zuo].

A total of 144 stimuli (words and non-words), not present in the pre-training and post-training tests or practice were created for the six training sessions (See **Table A** in the Supplementary Material for all training stimuli). For each session, 24 stimuli were presented by each of the four speakers (F1, F2, M1, and M2). Thus each training session included 96 trials, and stimuli were randomized in each training session.

### Phonological Awareness Tests

Three phonological awareness tests were used. A Phoneme Deletion test was adapted from Tyler and Burnham (2006) and McDougall et al. (1994) and consisted of three practice trials and 18 test trials. Children were asked to pronounce a word omitting a particular sound, e.g., "say "train"without the "t" sound." Each correct response scored 1 and the dependent variable was the proportion of correct phoneme deletions. The task was presented via audio files in Microsoft PowerPoint on the laptops used for the perception test. The second two tests were the "Sight Word Efficiency" test consisting of 108 words and the "Phonemic Decoding Efficiency" test consisting of 66 phonotactically legal (in English) non-words, both from the Test of Word Reading Efficiency-Second Edition (TOWRE-2, Form A). The child was first given eight practice words or non-words to read aloud. Then, in each test, they were asked to read aloud as many words as possible in 45 s. The dependent variable is the number of words or non-words read accurately in 45 s.

### Procedure

The perceptual identification tests were run on DmDx software (Forster and Forster, 2003) in three different locations with up to four laptops running simultaneously in each location: Bi-Eng/Arabic and Bi-Eng/Thai in Australia, Mono-Eng in the USA, and Mono-Thai in Thailand. All audio was played through highquality headphones connected to USB audio capture sound cards, interfaced with the laptops. The sound level was initially set at 65 dB HL for each participant, but adjusted, as required, to a comfortable listening level after the practice trials. An identification task was used. On each trial a single word was played through the headphones and children were instructed to identify which tone was played by responding on a USBconnected button box with four colored buttons, red, orange, yellow and green representing the four Mandarin tones, red for Tone 55, orange for Tone 35, yellow for Tone 214, and green for Tone 51. Participants were instructed to pay attention to both the auditory and visual aspects of the presentations in all sessions but there were no specific instructions given about what specific cues to attend to. Information sheets, consent forms, and language questionnaires were distributed to parents or guardians of the children prior to the first session and were collected from parents at the first session.

Training in the AO and AV training mode groups was exactly the same except that in AV training words were presented auditory-visually via the articulating speaker's face, and in AO training words were presented auditorally with the dynamic video turned off and a static image of the speaker's face on the screen. The tone training program consisted of six sessions. The first was an introductory session to allow participants to become familiarized with the task, and included pre-training tests (both ao and av) for all participants irrespective of training (AO or AV) mode; a practice session (with AO or AV training depending on the training group): and the first training session (AO or AV). The second to fifth sessions consisted of only a training session (AO or AV, depending on group) and the sixth session consisted of AO or AV training plus post-training tests (ao and av tests for all participants irrespective of training group). Feedback ("Good Job!!" for correct responses; "Sorry!!" for incorrect responses, and "Sorry, Press Faster!!" for missing responses) was given in practice and in AO or AV training sessions but not in the pretraining or post-training ao and av tests. In all training sessions, participants were also rewarded with a short cartoon clip every time they made three consecutive correct responses. The six sessions were scheduled at two sessions per week over 3 weeks. In the third session, the three groups with English as one of their languages (Mono-Eng, Bi-Eng/Arabic, and Bi-Eng/Thai) also completed the three phonological awareness tasks, Phoneme Deletion, and Word reading and Non-word reading.

Testing and training were conducted in quiet classrooms at each school; or at MARCS BabyLab, Western Sydney University Bankstown Campus; or at public library near the participant's home. A maximum of four children were tested at one time, using four identical laptops and associated hardware and software.

### RESULTS

The results are presented in three parts: (a) a comparison of total raw accuracy in pre-training vs. post-training tests, irrespective of Training Mode (AO/AV) and test mode (ao/av) in an Age (6 vs. 8 years) × Language Background (Bilingual vs. Monolingual) × Tone Language Experience (Tonal vs. Non-tonal) × (Mean Test Score – Pre-/Post-Training) design with repeated measures on Pre- vs. Post-Training scores; (b) an analysis of a percentage gain due to training dependent variable derived from the Preand Post-Test scores (see formula below) in an Age × Language Background × Tone Language Experience × Training Mode (AO/AV) × (ao vs. av Tests) design, with repeated measures on ao vs. av tests; and (c) a set of correlations between the phoneme deletion and word and non-word reading ability tests and with the pre- and post-tests and gain due to training for the three English-speaking groups (Mono-Eng, Bi-Eng/Arabic, Bi-Eng/Thai) for whom data on the phonological awareness and reading tests was collected.

### Raw Accuracy

Raw percentage correct data were first analyzed to show the absolute level of performance Post-Training compared to Pre-Training as a product of the group factors. A 2 × 2 × 2 × (2) Analysis of Variance (ANOVA) was conducted with Age, Language Background and Tone Experience as between-subject factors and Phase (Pre- vs. Post-training test), as the withinsubject factor. All factors have two levels so no planned contrasts were required. Alpha was set at 0.05 and the effect sizes are given for significant differences (critical F = 3.898).

The results are graphically presented in **Figure 2**. As can be seen there was a general improvement from pre-training to posttraining and this Phase main effect was significant, F(1,65) = 7.61, p < 0.01, ηp² = 0.077, with Post-Training Mean = 28.43, and SD = 0.09, and Pre-Training Mean = 25.76, and SD = 0.06.

As can be seen in **Figure 2**, there was Pre- to Post-Training improvement for three of the four 6-year-old groups, and all of the four 8-year-old groups, with other interactions also apparent. Accordingly, while the Phase main effect was unaffected by Language Background, Monolingual vs. Bilingual, it was qualified by Age and Tone Experience: there was a Phase × Age, F(1,65) = 9.90, p < 0.005, ηp² = 0.395, and a Phase × Age × Tone Experience, F(1,65) = 15.40, p < 0.001, ηp² = 0.505, interaction. As can be seen in **Figure 2**, these interactions are due to (i) greater improvement from pre- to post-training by 8-year-olds than by 6 year-olds, and (ii) especially greater improvement for 8-year-olds with Tone Language experience, irrespective of whether the tonal experience is in a monolingual or bilingual context.

The decrease in performance from pre- to post-training by the monolingual tone language (Thai) 6-year-olds is puzzling. These children had just begun instruction in reading and writing at school, including learning the orthographic representation of Thai tones (a regular but complicated 4 way interaction of initial consonant class, final consonant manner, vowel length, and tone diacritics (Kasisopa et al., 2013, 2016; see Davis et al., 2015). It is possible that these, as yet non-automatic controlled, processes involved in learning the orthographic representation of Thai tones coupled with intensive training on foreign (Mandarin) tones, resulted in overload and confusion at the perceptual level interference from L1 phoneme-to-grapheme/grapheme-to-phoneme levels. This explanation is clearly speculative and requires further research.

### Performance Gain

While the above analysis shows effects of training on tone perception, it may be noted that many of Pre- and some of the Post-test scores hover around chance level (25%, given there are 4 Mandarin tones). This raises the issue of the degree of improvement given the initial level of performance and the equivalence of improvements from an initial level of chance responding vs. a higher level of initial responding. To accommodate such differences a dependent variable was derived as follows:

Performance Gain = ((Posttest % correct − Pretest % correct)/ Pretest % correct))<sup>∗</sup> 100%

Thus if a child had 20% correct on Pre-Training and 30% on Post-Training—the Performance Gain would be 50%; or if there was the same absolute increase of 10% from 50% on Pre-training to 60% on Post-training, the percentage improvement would be 20%. Thus this measure takes into account the initial level of performance in the pre-training test and represents the percent improvement in relation to that level. Mean and Standard Error Performance Gain for each of the four Language Background × Bilingual Status groups are shown for AO/AV training groups in ao and av tests in **Figure 3** as well as in **Table 1** alongside the number and percentage of participants who showed pre- to post-training improvement in each group (see also **Table B** in the Supplementary Material for individual Performance Gain scores for each participant).

Performance Gain scores were analyzed in an Age × Language Background x Tone Experience × Training Mode betweensubject factor × Test Type (ao or av) within-subject factor ANOVA. The only significant main effect was for Age, F (1,65) = 8.09, p < 0.01, ηp² = 0.111. Age also interacted with two other factors: there was an interaction of Age × Tone, F(1,65) = 15.19, p < 0.001, ηp² = 0.189, and of Age × Tone × ao/av test, F(1,65) = 9.54, p < 0.01, ηp² = 0.128. This set of results is represented in **Figure 4**. As can be seen, 8-year-olds showed more Performance Gain than 6 year-olds. There was more Performance Gain for Tone language than Non-Tone language background children, but this was only evident in the 8-year-olds. Finally, while the Tone > Non-Tone advantage for 8-year-olds was evident in both ao and av tests, Performance Gain was greater when indexed in ao tests.

The above Age and Tone Language background results are independent of whether the children were monolingual or bilingual and whether they were trained with AO or with AV stimuli. Turning to Training Mode and Monolingual/Bilingual, the Training Mode and Monolingual/Bilingual interaction, and the Training Mode × Monolingual/Bilingual × ao/av interaction were both very close to significance Training Mode × Monolingual/Bilingual, F(1, 65) = 3.91, p > 0.05, ηp² = 0.057; Training Mode × Monolingual/Bilingual × ao/av tests, F= 3.76, p>0.05, ηp² = 0.055 (critical F = 3.98). Given these close to significant interactions and the significant interaction of ao vs. av tests with Age and Tone Language results above, and in order to avoid a Type II error in this first test of the effect of training mode on lexical tone perception, these two approaching significance results were followed up in simple effect tests of Training Mode × Monolingual/Bilingual at each level of the test type, ao tests and av tests. These revealed a non-significant Training Mode × Monolingual/Bilingual interaction for av tests, F(1, 65) = 0.70, p > 0.1, ηp² = 0.011, but a significant Monolingual/Bilingual

FIGURE 2 | Mean percentage correct on Pre-training and Post-training test trials as a function of Age (6yo vs. 8yo), Language Background (Bilingual vs. Monolingual) and Tone Experience (Non-Tone vs. Tone). Error bars represent standard errors.

FIGURE 3 | Mean percentage of Pre- to Post-training improvement gain by each of the four language groups at 6 and 8 years, with AO or AV Training ao and av Test Types. Error bars represent standard errors.

interaction for ao tests, F(1, 65) = 5.19, p < 0.03, ηp² = 0.074. This set of results is represented in **Figure 5**. As can be seen Bilingual participants show greater Performance Gain after training with AO stimuli, whereas Monolingual participants show greater Performance Gain after training with AV stimuli, and this is especially the case when indexed by ao test trials.

### Correlations With Phonological Awareness and Reading

Correlations, with age partialed out, between the phoneme deletion, word and non-word reading tests with the pretraining, post-training, and performance gain due to training were conducted for the three groups with English as one of their languages (Mono-Eng, Bi-Eng/Arabic, Bi-Eng/Thai).

There were, not surprisingly, correlations between the three language measures –phoneme deletion and word reading r(62) = 0.50, p < 0.001, phoneme deletion and non-word reading r(62) = 0.59, p < 0.001, and word and non-word reading, r(62) = 0.75, p < 0.001.

More important are correlations between any of the three language measures and the tone training scores. The only significant correlation of this nature was between phoneme deletion and the pre-training av-test, r(62) = 0.26, p < 0.05. This indicates that children's phonological awareness, in this case their proficiency on a phoneme



deletion task, is positively related to their initial identification of the four Mandarin tones presented in auditory-visual mode.

results are summarized under four headings below, followed by discussion of the results.

## DISCUSSION

### Summary of Results

This study examined the role of tone vs. non-tone language experience, monolingualism vs. bilingualism, and auditory-only vs. auditory-visual training of foreign lexical tone contrasts. The

#### Training—Effects of Age and Language Factors

Training was effective: there was a general improvement in performance from pre- to post-training. Training was most effective for 8-year-olds; 6-year-olds showed only limited effects of training. Training was more effective if children had tone language experience, an advantage evident in the 8- but not the 6-year-olds. These effects of age and tone language on training

were most clearly indexed by the ao rather than the av pre- and post-training tests.

#### Language Background and Training Mode

There was a differential effect for the type of training: Monolingual children improved markedly with AV training but not at all with AO training, whereas Bilingual children improved markedly with AO training and to a lesser extent with AV training. However, these effects were only apparent when indexed by ao tests.

#### Correlation With Language Measures

For children with English as their only, or as one of their, language(s), proficiency on a phoneme deletion task was positively related to Mandarin tone identification in auditoryvisual pre-training test trials. As this was before training began, it shows that those children good at manipulating phonemes in (one of) their native language(s) were also good at perceiving what were completely novel phonological elements for the Mono-Eng, the Bi-Eng/Arabic and the Bi-Eng/Thai groups. This advantage did not extend to training, there was no advantage for good phoneme deleters in learning about foreign tones, just in their initial perception of foreign tones.

The results bear on a number of issues which are discussed below ahead of a discussion of limitations and suggestions for future research.

#### **Age**

There are two possible reasons why training was more effective with the older 8-year-olds than the younger 6-year-olds: task difficulty, and reduced sensitivity to foreign sounds. First it may be that the task employed here was demanding in terms of the degree of sustained attention required. For example, while in the procedure used here the pre- and post-training trials were the same, the training trials incorporated variation of both speakers and words. Wang et al. (1999) trained adults on a variety of monosyllabic Mandarin words spoken by a variety of speakers and found especially resilient learning. In an adaptation for children Wang and Kuhl (2003) also found a high degree of learning. However, over their six training sessions they graded the difficulty of the tasks (2 weeks ABX, 2 weeks 2AFC identification, then 2AFC with speaker variation) and within each pair of sessions they trained easier tone pairs first. While the Wang and Kuhl (2003) study and the study reported here shared the variability of speakers and words, here the task would have been more difficult because (i) tasks were not graded and (ii) a single presentation 4AFC identification task was used. It remains for future studies to adapt the procedures here and those in the Wang studies (Wang et al., 1999; Wang and Kuhl, 2003) to derive optimal, L2 training regimes especially for younger, e.g., 6-year-old children.

Secondly, irrespective of task difficulty, 6-year-old children may have reduced sensitivity to foreign sounds. Burnham and colleagues (Burnham et al., 1991; Burnham, 2003) investigating what has been called a second period of perceptual attunement, have shown that 6-year-olds, compared with both 8-year-olds and also 4-year-olds, have reduced sensitivity to L2 sounds and suggest that this is an adaptive device which facilitates attention to the difficult task of phoneme-to-grapheme mapping involved in reading. Burnham contends that at 4 years this process has not begun, and by 8 years the process has become relatively automated, whereas at 6 years this attentional filtering is most useful. Whether this explains the results here cannot be fully ascertained without a 4-year-old comparison group, and it remains for future research to investigate this issue further.

#### **Test trials and generalization of training**

In this experiment children were given both ao- and av-test trials pre- and then post-training. The training was either with AO stimuli in one group and AV stimuli in another group. In addition, the stimulus words and speakers were different in the pre- and post-training test phase on the one hand and in the Training trials on the other. Therefore, generalization of training can be indexed in two ways. First, any improvement after training, can be considered generalization because the training and test stimuli differed (although there could be an across-theboard improvement because the pre- and post-training stimuli were from the same pool). In this sense then, any performance gain from pre- to post-training, such as those gains found in this study, can be considered as both learning, and generalization of learning. Second, generalization can be indexed by any performance gain across both the ao- and the av-tests, irrespective of whether the training used AO or AV materials. A confounding factor in the interpretation of the results with respect to this type of generalization is that ao-tests proved to be more sensitive indices of performance gain than were av-tests. Nevertheless, it can be concluded that, in general, generalization of training was best for 8-year-olds, and especially for 8-year-olds with tone language experience whether that be monolingual (Mono-Thai) or bilingual experience (Bi-Eng/Thai).

#### **Tone language experience**

Participants with tone-language experience (the Bi-Eng/Thai and Mono-Thai groups) benefitted more from training than those with no tone language experience (Bi-Eng/Arabic and Mono-Eng), irrespective of whether the children were monolingual or bilingual. In addition, those with tone language experience (especially the 8-year-olds) also showed better generalization of training across test type—ao and av. This supports previous findings that tone language experience facilitates adult lexical tone perception (e.g., Burnham et al., 2014) and extends these findings to children. Moreover, these data provide information about two aspects of language learning. First, the data tell us that there is some perceptual or conceptual information about lexical tones that is general across tone languages (or at least for the two tone languages here, Mandarin (the target language) and Thai (the language experience language). Second, the data tell us that any metalinguistic advantage or extra skills learned as a product of learning more than one language is independent of the skills required for learning to perceive lexical tone in a tone language. Each of these is discussed in further detail below.

#### **Task difficulty and differences between tone languages**

Mandarin and Thai tone inventories differ on a number dimensions: Mandarin has 4 tones and Thai 5; Thai has 2 level tones and 3 contour tones, Mandarin has 1 level and 3 contour tones; all 5 Thai tones are of similar duration, whereas Mandarin tones differ markedly in duration. Thus Mandarin and Thai are quite distinct with respect to their tones and this has two interesting implications with respect to the results obtained here. Firstly, given these differences, it is reassuring that there was an effect of (Thai) tone language background on the learning of the target tones in Mandarin, i.e., that there was transfer of learning from Thai tones to learning Mandarin tones. Second, the differences between Thai and Mandarin may have played a part in the relatively small performance gains in tone perception here. Further studies in which the background language and target tone language are more similar with respect to their tones, e.g., Thai and Cantonese (6 tones: 3 level and 3 contour, all tones of similar duration), may result in more performance gains. More generally, the relative salience of differences between tones within a particular language and the relative difficulty of discriminating tones in one tone system vs. that in another system is largely unknown (but see Burnham et al., 2017). Recent studies have shown that tone perception develops for a specific tone system (Yeung, et al., 2013) and that non-native tone language speakers have difficulty with tones that are similar or overlap with their native tone systems (Hao, 2012). Much more research is required on what particular parameters make particular tones or tone systems easier or more difficult to learn.

### **Monolingual and bilingual children**

While monolingual vs. bilingual status of the children did not in itself facilitate tone learning in children, it did contribute to the mode of training that was most effective, as measured by the performance gain between pre- and post-training aotest trials. The Auditory-Visual (AV) mode of training was the most effective for monolingual children, whereas for bilingual children Auditory-Only (AO) training, and, to a lesser extent, AV training resulted in performance gains. The source of this difference is not clear. One possibility is that exposure to a greater range of linguistic differences and devices, as would be the case for bilingual children, allowed them to (i) learn from a range of parameters, including auditory information alone or auditory and visual information combined, and (ii) learn that, even though there is visual information for tone (Burnham et al., 2001a,b, 2014; Smith and Burnham, 2012) the auditory information is by far the most salient. This is speculative and requires more definitive evidence.

#### **Phonological awareness**

English phoneme deletion ability (in the Mono-Eng, Bi-Eng/Arabic, and Bi-Eng/Thai groups) was positively related to pre-training auditory-visual test trial performance. Although there was no relationship here between reading and tone perception, the results are reminiscent of those of Burnham et al. (2011) who found a significant relationship between Thai children's reading ability and their phonological and tonological awareness, and between Australian English children's reading and their phonological awareness. Thus here, the ability to manipulate phonemes is related to the ability to perceive foreign speech sounds and in Burnham et al. (2011) reading ability is related to the ability to manipulate (foreign) phonemes and tonemes. Further research is required to investigate the nature of any three-way relationship between reading, phonological awareness, and foreign speech sound (and of especial interest here, lexical tones), the findings of which could be relevant to children's propensity to learn a second language, especially a tone language.

### Limitations and Future Directions

A number of limitations can be noted.

#### Training and Test

The post-training test implicitly tested for generalization across speakers and words, and, in addition, these trials provided implicit tests of generalization from training mode (be it AO or AV) to test mode, as both ao and av tests were given irrespective of training. The downside of this is that tests between trained and untrained stimuli and voices could not be conducted. It is possible that children, even the younger 6-year-olds, may have performed better on trained than untrained stimuli and voices. This should be remedied in future studies. The upside is that any improvement as a result of training indicated generalization of training. So the performance gains obtained here, while modest, are robust.

A related point concerns variability. As discussed above, variability improves the robustness of learned distinctions (Wang et al., 1999), but variability should be optimized for the age and maybe the language background of the children. Here it was not.

A final point on this theme is that for both clusters of results the age and tone language experience cluster, and the training mode and monolingual/bilingual language experience cluster the ao-tests were more sensitive measures of improvement than were the av-tests. And, even though auditory and auditory-visual modes differentially affected training outcomes in monolingual and bilingual children, the indexation of such training was still generally better on ao-tests. The reason for this is unclear. In future studies it would seem that ao-tests should be preferred.

### Phonological Awareness

We included English language tests of phonological awareness here and found a positive relationship between phoneme deletion and pre-training auditory-visual test performance. Future studies should investigate this further by including reading tests across languages, phonological awareness tests across languages, and also tests of morphological awareness (McBride-Chang et al., 2003) and even executive function, in order to determine predictors of good lexical tone learning.

### Instructions

No specific instructions were given. Children were simply told to pay attention to both the auditory and visual aspects of the speakers as we wished to determine whether children naturally pick up relevant lexical tone cues in an experimental setting. In real-life L2 learning situations such experimentally objective procedures may not be desired; indeed any relevant cue could and should be made available. In this regard, Chen and Massaro (2008) tested Mandarin perceivers' Visual-Only (VO) identification of the four Mandarin tones. (Remember that Mandarin language adults are worse than English language adults in VO tone perception—Smith and Burnham, 2012; Burnham et al., 2014). Initially the Chen and Massaro adults performed only just above chance and were better for the 55 and 214 tones than the 35 or the 51 tones. In a followup test adults were told about visible movements in tone perception and instructed to pay attention to movements of the neck, head, and mouth. Visual-Only tone perception improved significantly. Further work on perceivable visible cues for tone perception is required to facilitate L2 tone learning regimes.

#### Tone Difficulty

The Chen and Massaro (2008) results also raise the issue of the relative difficulty of identification of individual tones and discrimination of tone pairs. Although the results of this study reported here were based on perception across all Mandarin tones, the data also showed some differences of how the participants of different ages and language backgrounds learned the Mandarin tones in this study. Details of performance on the different tones for each language background group and each age are shown in **Table C** (Supplementary Material) and some comments on these are provided here. Generally, high Static tone (T55) was the easiest tone to learn for monolingual non-tonal group while the Dynamic tones (either T241 or T51) were the most difficult. The results for the monolingual tonal group were exactly the opposite: the Static (T55) the most difficult to learn while the dynamic tones (T214 and T55) were the easiest. The data is a little less definite for the bilingual language background groups. Nevertheless, it appears that 6 year-olds in both bilingual groups found the Static tone (T55) the easiest to learn while the other Dynamic tones (T35, T241, and T51) were similarly difficult; whereas the 8 yearold bilingual groups found that the Dynamic (T214) was the easiest to learn. The fact that the participants WITH A TONAL BACKGROUND found the generally difficult DYNAMIC tones T35 and T214 (Chang, 2011) in Mandarin relatively easy to learn in this study is quite interesting. However, as the task used in this study was tone identification, some distinctive contours, rising and dipping, of these two rising tones might help in identifying them. The results might well be different if participants were asked to discriminate between these two rising tones; the task might be much more difficult. Future work must take into account such differences, but at the moment there are no objective criteria for determining difficulty of tone perception within and between languages. We (Burnham et al., 2017) are currently collecting data on the perception of tone pairs from three different tone languages by adults from five different language backgrounds in order to leach out some such criteria.

### CONCLUSIONS

The results of this study show that 8-year-olds and to some extent 6-year-olds can learn to identify Mandarin tones. The training procedure included feedback during practice and training sessions and rewards in training of animated cartoons after three consecutive correct responses. The same stimuli were used in pre- and post-training, but there was considerable variability of training stimuli in terms of speakers and the word contexts, variability that perhaps hindered learning in children of 8- and especially 6-years of age.

Nevertheless, there was successful training of 6- and 8-yearolds to identify non-native lexical tones. The main findings were that tone language experience with Thai facilitated learning to identify Mandarin tones, irrespective of whether the experience was monolingual Thai, or bilingual Thai-English. Monolingual vs. bilingual experience per se had no effect on tone training. However, the modality of training appeared to be differentially effective: as indexed in ao-test trials, monolingual children improved markedly with Auditory-Visual training stimuli but not at all with Auditory-Only training, whereas bilingual children improved markedly with Auditory-Only training and to a lesser extent with Auditory-Visual training. These results suggest an interesting interaction between language experience and training method that may have important implications for L2 training techniques and facilitate L2 training regimes.

### AUTHOR CONTRIBUTIONS

BK and LE-KA experimental design, data collection and analyses, data management, manuscript drafting and revising, approving the manuscript to be published and Agreeing to be accountable for all aspects of the work in this manuscript. AJ, JAS, and DB experimental design, supervision of data collection and data management, manuscripting drafting and revising, approving the manuscript to be published, and agreeing to be accountable for all aspects of the work in this manuscript.

### REFERENCES


### FUNDING

This research was supported by an Australian Research Council (ARC) Discovery grant (DP0988201) to the last author and some funding for an honors project to the first author.

### ACKNOWLEDGMENTS

This work was partially conducted while AJ and JAS were on sabbatical leave at the MARCS Institute for Brain, Behaviour and Development with support from The University of Kansas. The authors wish to thank Dr. Chutamanee Onsuwan from Thammasat University for supporting testing in Thailand.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2018.01508/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kasisopa, El-Khoury Antonios, Jongman, Sereno and Burnham. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## How Do Infants Disaggregate Referential and Affective Pitch?

#### René Kager\*

Utrecht Institute of Linguistics OTS, Utrecht University, Utrecht, Netherlands

Infants are faced with a challenge of disaggregating functions of pitch in the ambient language into affective, pragmatic or referential (the latter in tone languages only). This mini review discusses several factors that might facilitate the disaggregation of referential and affective pitch in infancy: acoustic characteristics of infant-directed speech, recognition of vocal affect, facial cues accompanying affective prosody, and lateralization of affective and referential prosody in the brain. It proposes two hypotheses concerning the role of audiovisual cues and brain lateralization

Keywords: infant speech perception, pitch processing, infant language representation, lexical tone acquisition1, Lexical tone perception

This article discusses potential factors to facilitate the disaggregation of referential and affective pitch in infancy: acoustic characteristics of infant-directed speech, recognition of vocal affect, facial cues accompanying affective prosody, and lateralization of affective and referential prosody in the brain. It proposes two hypotheses concerning the role of audiovisual cues and brain lateralization.

#### Edited by:

Leher Singh, National University of Singapore, Singapore

#### Reviewed by:

Carolyn Quam, Portland State University, United States Feng-Ming Tsao, National Taiwan University, Taiwan

> \*Correspondence: René Kager r.w.j.kager@uu.nl

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 04 May 2018 Accepted: 10 October 2018 Published: 31 October 2018

#### Citation:

Kager R (2018) How Do Infants Disaggregate Referential and Affective Pitch? Front. Psychol. 9:2093. doi: 10.3389/fpsyg.2018.02093

Among the many acoustic cues in speech, fundamental frequency (perceived as pitch) is arguably the one that, cross-linguistically, has the widest range of linguistic and para-linguistic uses (Gussenhoven, 2004). Universally, pitch signals affective use (for example, express happiness by high average pitch and wide pitch range) and pragmatic use (for example, marking a question by rising pitch is a universal tendency). Exclusively in tone languages, pitch supports referential use by contrasting word meanings (for example, Cantonese /fan/"divide" carries a high-level tone; "angry" a mid-rising tone). Infants born into tone languages (a term which includes "pitch accent" languages; Hyman, 2009) are faced with a challenge of discovering how pitch patterns in the ambient language distinguish different word meanings—hence, they must disaggregate pitch in the input into non-referential and potentially referential information. Infants learning a (nontone) lexical stress language must discover that pitch has no direct, but only indirect referential significance as one of the cues associated with stress (next to other cues, e.g., duration). Detection of the referential significance of pitch poses a critical challenge for infants when they are learning their first words. Yet several studies suggest infants discover the presence/absence of lexical tones before their first birthday. Tone-learning infants retain their initial ability to discriminate tones, while infants exposed to a non-tone language lose it between 6 and 9 months (Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014; Götz et al., 2018), before losing the ability to learn tone-to-word associations, which they still possess at 9 months (Yeung et al., 2014), by 18 months (Singh et al., 2014; Hay et al., 2015; Burnham et al., 2018; Liu and Kager, 2018). How are infants able to disaggregate pitch into non-referential affective and referential linguistic information? Infants' environments are rich in affective content, as infant-directed speech (IDS) is

characterized by exaggerated pitch contours reflecting "free vocal expression of emotion" (Trainor et al., 2000), which attracts infants' attention (Cooper and Aslin, 1990; Werker et al., 1994), yet does not a priori facilitate tone acquisition, as it may partially obscure contrastive shapes of tones (Papoušek and Hwang, 1991; Kitamura et al., 2002). Pitch exaggeration in IDS may be partly compensated by tonal hyper-articulation (Liu et al., 2007; Xu Rattanasone et al., 2013; Tang et al., 2017), yet to what extent precisely is an open issue. In order to facilitate disaggregation of referential and affective pitch, young infants may draw on their ability to recognize vocal and visual expression of affect.

The ability to interpret speech prosody as having affective value emerges early in life. Pitch contours in IDS presumably carry innately specified affective meanings to young infants, eliciting attention, arousal, approval, and disapproval (see Fernald, 1992, for a review). Neonates show an increase in eye opening responses to happy vocal stimuli as compared to other expressions (angry, sad, neutral), however only for their native language (Mastropieri and Turkewitz, 1999), suggesting prenatal influence on perception of vocal affect. By 5 months, infants reliably discriminate affect, detecting changes in vocal affect from sad to happy (Walker-Andrews and Grolnick, 1983); 7 month-olds show different ERP responses to affective (happy or angry) vs. neutral prosody (Grossmann et al., 2005). Yet infants' ability to discriminate affect may not provide a reliable basis for affective-referential pitch disaggregation; perhaps it should be matched by an ability to understand emotion in speech. However, this ability is not developed until 4–5 years (Quam and Swingley, 2012). School-aged children (around age 10) experience difficulties integrating vocal affect with lexical content (Friend, 2000, 2001, 2003; Friend and Bryant, 2000; Morton and Trehub, 2001; Morton et al., 2003). Since the ability to understand emotion in speech develops so slowly, it is worth exploring how affective-referential pitch disaggregation during the first year of life might be supported not only by auditory/vocal cues, but also by visual/facial cues.

By 4–6 months of age, infants in spite of their reduced visual processing can discriminate their native language from other languages partly by relying on visual cues accompanying gestures such as vocalic lip rounding (Weikum et al., 2007). In comparison, visual cues to tonal gestures are weak and unreliable to native listeners (Chen and Massaro, 2008; Hannah et al., 2017). Young infants (4-month-olds) can detect different emotions (happy, angry, sad) when presented with facial-vocal cues (Flom and Bahrick, 2007), an ability emerging prior to affect detection based on unimodal cues (Walker-Andrews, 1997). In light of infants' early sensitivity to facial-vocal cues to affect, the hypothesis can be proposed that affective-referential pitch disaggregation draws on facial affective cues accompanying vocal affect. By labeling pitch information as affective, infants may focus their linguistic attention to residual pitch information that has no clear affective interpretation, which includes referential information.

A neurological marker of affective-referential pitch disaggregation may be obtained in the hemispheric specialization for linguistic and affective pitch. A functional asymmetry between the right hemisphere (RH; dominant in processing pitch changes and emotional vocalization) and the left hemisphere (LH; dominant in processing speech, in

### REFERENCES

particular segmental information) occurs in neonates (Dehaene-Lambertz, 2000; Peña et al., 2003). Native listeners process linguistically relevant lexical pitch dominantly in LH (Wang et al., 2001); affective pitch dominantly in RH (Edmondson et al., 1987). Yet hemispheric lateralization of linguistic and affective pitch processing remains a controversial issue (Wong, 2002; Zatorre and Gandour, 2008). Turning to infant studies, early RH specialization for pitch processing is found in neonates (Arimitsu et al., 2011); 3-month-old Japanese infants show stronger RH responses to natural speech, which includes pitch contours, as compared to prosodically flattened speech (Homae et al., 2006). The processing of lexical pitch is lateralized to LH in Japanese infants between 4 and 10 months (Sato et al., 2010; see Minagawa-Kawai et al., 2011 for discussion). Plausibly, the disaggregation of affective-referential pitch involves a functional specialization of the brain's hemispheres: general (affective and linguistic) pitch processing starts out in RH, while disaggregation amounts to a lateralization of linguistic pitch processing to LH. Infants' detection of affect, guided by vocal-facial cues, provides the key ability. A second hypothesis is proposed to this effect: the more emotional speech is, the more dominant RH becomes in speech processing; conversely, less emotional speech implies a decreased role for RH in pitch processing, enabling a partial shift of pitch processing to LH, the dominant hemisphere for speech processing. This predicts that (the perceived amount of) facial affect influences the locus of pitch processing in the infant brain.

In sum, affective-referential pitch disaggregation by infants may be accomplished by a combination of two (possibly innate) abilities, matching the two hypotheses stated above: (a) recognition of affect in pitch contours and integration of audiovisual (vocal-facial) cues on affect; (b) hemispheric specialization for pitch processing, where RH acts as "emotion attractor" and LH as "language attractor." Integrating research on early tone perception, audiovisual affect recognition, and hemispheric specialization may open a new perspective on how infants manage to detect the presence/absence of lexical tone in their native language.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

### FUNDING

The Consortium on Individual Development (CID) is funded through the Gravitation program of the Dutch Ministry of Education, Culture, and Science and the Netherlands Organization for Scientific Research (NWO Grant No. 024.001.003).

Arimitsu, T., Uchida-Ota, M., Yagihashi, T., Kojima, S., Watanabe, S., Hokuto, I., et al. (2011). Functional hemispheric specialization in processing phonemic and prosodic auditory changes in neonates. Front. Psychol. 2:202. doi: 10.3389/fpsyg.2011.00202

Burnham, D., Singh, L., Mattock, K., Woo, P. J., and Kalashnikova, M. (2018). Constraints on tone sensitivity in novel word learning by monolingual and bilingual infants: tone properties are more influential than tone familiarity. Front. Psychol. 8:2190. doi: 10.3389/fpsyg.2017.02190

Chen, T. H., and Massaro, D. W. (2008). Seeing pitch: visual information for lexical tones of Mandarin-Chinese.

J. Acoust. Soc. Am. 123, 2356–2366. doi: 10.1121/1.28 39004


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kager. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Diversity of Tone Languages and the Roles of Pitch Variation in Non-tone Languages: Considerations for Tone Perception Research

#### Catherine T. Best\*

MARCS Institute, Western Sydney University, Penrith, NSW, Australia

Keywords: tone processing, non-native speech perception, prosodic hierarchy, tone language diversity, non-tone languages

#### Edited by:

Jessica Hay, The University of Tennessee, Knoxville, United States

#### Reviewed by:

H. Henny Yeung, Simon Fraser University, Canada Thierry Nazzi, Université Paris Descartes, France

> \*Correspondence: Catherine T. Best c.best@westernsydney.edu.au

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 03 May 2018 Accepted: 05 February 2019 Published: 26 February 2019

#### Citation:

Best CT (2019) The Diversity of Tone Languages and the Roles of Pitch Variation in Non-tone Languages: Considerations for Tone Perception Research. Front. Psychol. 10:364. doi: 10.3389/fpsyg.2019.00364 All languages employ consonants and vowels as discrete contrastive subcomponents of the basic timing units of words (syllables). These two classes of phonemes are used to differentiate between words, whose meanings can be categorically changed by switching even a single vowel or consonant, as in <pat> vs. <cat> or <pet>. They populate the lowest level of the phonological hierarchy, the segmental tier, and both classes are obligatory across spoken languages. But only some languages also make use of lexical tones, contrastive sub-syllabic fundamental frequency (pitch) variations referred to as tonemes (e.g., Jones, 1944), which for those languages comprise a third class of phonemic elements. Perceptual researchers often assume tones to be suprasegmental (e.g., So and Best, 2010, 2011, 2014; Liu et al., 2018; Poltrock et al., 2018), i.e., to extend across the consonants and vowels of the target syllable. While in a phonetic sense tones extend across the voiced segments of a syllable, however, such observations may not straightforwardly reflect the more abstract phonological properties of tones (e.g., see Wang, 1967; Hyman, 2011a,b). Indeed, several tone phonologists claim that lexical tones function as segments in tone languages (e.g., Lin, 1989; Duanmu, 1990, 1994). For the following paragraphs we adopt that phonological view that lexical tones function in tone languages at the segmental level, along with consonants and vowels. However, we return later to consider their phonological status and its relevance for understanding lexical tone perception by native and non-native listeners.

Unlike consonants and vowels, lexical tones are optional<sup>1</sup> . Many languages of Europe, the Americas, Oceania, Africa and even Asia function perfectly well without them. But lexical tones are, nevertheless, a popular option. They are employed in 60–70% of existing languages (Yip, 2002), including many Asian, African and indigenous American languages as well as a few European and South Pacific languages (Maddieson, 2013). It is important to note, nonetheless, that lexical tone

<sup>1</sup>While lexical stress and gemination are also optional (non-obligatory) phonological features used for lexical contrast, they both are defined across multiple timing units. Lexical stress is a contrastive relationship realized across two or more syllables, while gemination involves repetition of the same segment across two adjacent morae, either within a syllable or across a syllable/morpheme boundary. Given our focus on lexical tones, we will not discuss them except if/as relevant to perception of tones.

forms and usage vary widely across tone languages (e.g., Hyman, 2011a, 2016; Remijsen, 2016) 2 . Some include tonemes with temporally-changing pitch trajectories (contour tone languages) while others use only level pitches (register tone languages). Some rely only on pitch specifications for tone contrasts while others have been claimed to also incorporate phonation distinctions<sup>3</sup> . Some have seven or more contrastive tones while others have as few as two. Some apply tone values to all syllables while others restrict tones to accented syllables of specific words (lexical pitch accent<sup>4</sup> ). Some use tones only for stem morphemes while others use tone to mark grammatical or morphological alternations. Tone languages also differ in their degree of reliance on lexical tone distinctions, ranging from extensive, i.e., high functional load, to quite restricted use, i.e., low functional load.

Moreover, languages that lack lexical tones (non-tone languages) are far from devoid of systematic pitch variations. All spoken languages use pitch and contour paralinguistically, e.g., to convey information about emotions and talker gender and age. More importantly for our discussion of lexical tones, all languages also use pitch variation linguistically to mark intonation distinctions at supra-syllabic (metrical) levels of the phonological hierarchy: prosodic word, phonological phrase, intonational phrase, and utterance tiers (the prosodic hierarchy: e.g., Beckman and Pierrehumbert, 1986; Nespor and Vogel, 1986; Selkirk, 1986; Pierrehumbert and Beckman, 1988), which are most often examined using the ToBI (Tones and Break Indices) framework and transcription system (see Beckman et al., 2006), an approach that has also been applied to lexical tones (e.g., Francis et al., 2008). Clearly, then, phonological use of pitch distinctions is familiar to non-tone language speakers, at higher metrical levels of their language.

The crucial difference between tone and non-tone languages is that tone languages use contrastive pitch specifications at every level of the phonological hierarchy, whereas non-tone languages have a gap in contrastive use of pitch at the segmental level. As a result, non-tone language speakers are likely to perceive nonnative lexical tones in terms of paralinguistic information and/or as native-language (L1) prosodic distinctions. For example, they may perceive non-native lexical tones as L1 intonational phrase (e.g., Hallé et al., 2004) and/or stress contrasts (e.g., So and Best, 2010, 2011, 2014). Such a discrepancy in phonological tiers between the lexical tones of the non-native stimulus language and the higher prosodic level(s) at which non-tone L1 listeners perceive the pitch variations as distinctive may explain why non-tone L1 adults often err in perceiving, producing and remembering the lexical tones of names and words in a tone language (McGinnis, 1997), including even very proficient English-L1 speakers of L2 Mandarin (Wong and Perrachione, 2007). In tone word training studies, non-tone L1 listeners learn novel words' consonant-vowel patterns faster and more accurately than their lexical tones (Wong and Perrachione, 2007). They also display substantial individual variation in learning, which correlates with variations in their tone discrimination performance in non-lexical tasks (e.g., Wong et al., 2007; Chandrasekaran et al., 2010). Nonetheless, learning tones in words is more challenging than mere tone discrimination, which is clearly above chance even prior to training (e.g., 78% correct discrimination in a pre-test: Wong and Perrachione, 2007) 5 .

Unique insights into how language experience shapes phonological knowledge could be gained from studies of nonnative and native tone perception that exploit the diversity of lexical tone systems, and probe how a range of contrast types are perceived in relation to prosodic distinctions at higher tiers of the phonological hierarchy. Most prior studies of lexical tone perception by infants and young children, however, have drawn their target stimuli and native listeners from a small set of Asian languages that have contour tone systems, though there are some exceptions (e.g., Yoruba, an African register tone language: Harrison, 2000; Japanese, an Asian pitch accent language: Nazzi et al., 1998; Sato et al., 2009; Ota et al., 2018). The non-native listeners have often been non-tonal L1 speakers naïve to the target tone language, though in a few studies their L1s have been pitch accent languages (e.g., So and Best, 2010) or other contour tone languages (e.g., So and Best, 2010, 2011, 2014; Reid et al., 2015). Another potential limitation of much prior research with young children is that often only discrimination has been tested (e.g., Harrison, 2000; Mattock and Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013; Liu and Kager, 2014; Hay et al., 2015; Cheng and Lee, 2018). However, more recent studies have extended the investigation to word recognition and learning (Singh and Foong, 2012; Singh et al., 2014; Hay et al., 2015), including a number of papers in this Special Topic volume (e.g., Liu and Kager, 2018; Ota et al., 2018; Burnham et al., 2019; and several other papers discussed below). Other recent advances include studies on the developmental relationship between perception of lexical tones and perception of highertier linguistic information such as stress and prosody (Quam and Swingley, 2010; Liu and Kager, 2014; Singh and Chee, 2016;

<sup>2</sup> In addition, neither phonetic nor phonological notation for tones has been standardized or widely adopted to the same extent as for consonants and vowels (International Phonetic Alphabet [IPA], 2015). There are a number of competing and inconsistently used systems. Chao (1930) numbers ("letters") have been adopted most often, primarily but not only for Asian languages. However, even when used, Chao numbers are applied within each language relativistically, making direct comparison between tones of different languages not as straightforward as one might expect. The IPA offers a schizoid choice between tone diacritics on the vowel or pictographic symbols placed next to the syllable; neither are used as widely as Chao numbers. And some researchers instead use idiosyncratic, language-specific tone symbols (e.g., Thai) and/or names that are sometimes but not always English-lexified (e.g., Mandarin rising, falling, dipping, high level; but Vietnamese s ´ac, ngang, ngà, huy ˘ `en, h ˆ ôi and naˇ ◦ ng [or merged hôi-naaˆ ◦ ng in South Vietnamese]). None of these notation approaches systematically reflects effects of phonetic context and sandhi rules on the phonetic form of tones as they are actually realized in connected speech.

<sup>3</sup>As these claims have referred to creaky voice (very widely spaced pitch pulses) and glottalization (temporary lack of pitch pulsing) it is not entirely clear to me that they are necessarily categorically different from pitch specification. For example, perhaps they could indicate very to maximally low pitch.

<sup>4</sup>While it remains a matter of debate whether lexical pitch accent is a type of lexical tone, for heuristic purposes, languages that use only pitch accents, such as Japanese, are considered tone languages in this paper. They are assumed to be specified at the segmental tier of the phonological hierarchy in such languages, rather than at the higher timing tiers, as Duanmu and Lin have posited for non-pitch-accent tone languages.

<sup>5</sup> Similar findings have been reported for discrimination vs. higher-level perceptual tasks involving non-native lexical stress contrasts (e.g., Skoruppa et al., 2009, 2013)

Choi et al., 2017; Ma et al., 2017) and paralinguistic features such as pitch variations that convey emotions (e.g., Kager, 2018).

The six articles I was invited to comment on have each extended that recent progress in our understanding of the early development of native and non-native perception of lexical tones. All expand beyond the issues addressed in most previous research, although five of them maintain the typical focus on Asian contour tone languages, specifically the mostoften-studied language, Mandarin, and a second widely-spoken Chinese language, Cantonese. Chen et al. (2017) found that infants learning Dutch, a non-tone language, discriminated both a difficult Mandarin contour tone contrast (T2-T3) and matched tritone piano melodies at 12 but not 4 months, despite lacking exposure to lexical tones in their environment. The authors interpret these results as evidence that development of pitch contour perception is mediated by domain-general rather than language-tuned mechanisms. In a second paper, however, although both Mandarin-learning and English-learning infants also discriminated another Mandarin tone contrast (T1- T3) better at 12 than at 6 months, the Mandarin infants showed significantly greater improvement, which indicates that language-specific experience does enhance lexical tone discrimination (Tsao, 2017). Moreover, in a different categorial discrimination task both 4- and 13-month-old Mandarinlearning infants discriminated the Mandarin T2-T3 contrast (same as in Chen et al., 2017), but Mandarin 2-year-olds failed to detect T2-T3 tone mispronunciations of known words (Shi et al., 2017). The latter finding mirrors a previously-observed discrepancy between infants' basic discrimination of a consonant contrast as compared to their later poor recognition of that same contrast when it occurs in words (Stager and Werker, 1997).

Older children were the participants in the other three articles, two of which examined Cantonese-learning children. In one, 3-year-olds failed to perceive or produce Cantonese tones like adults but, consistent with a classic speech development hypothesis they were more accurate in tone perception than production (Wong et al., 2017). In the other, Cantonese 3rdgraders' lexical tone sensitivity was found to correlate with their sensitivity to lexical stress in L2-English words (Choi et al., 2017). The remaining article (Ramachers et al., 2017) took an important additional step away from the past by using a European pitch accent language, Limburgian, rather than an Asian contour tone language in which tones carry high functional load in the lexicon but no grammatical function. Limburgian's binary level-tone distinction, which is embedded in a complex intonation system, carries a low functional load, but contributes both to lexical items and to a morphological alternation for a few frequent nouns in which falling pitch indicates plurality. No evidence of effects of language experience was found for Limburgianversus Dutch-learning 2.5- and 4-year-olds' learning of novel Limburgian words with lexical tone: children of both ages were sensitive to tone mispronunciations of the newly-learned words. The authors inferred that the children's lexical representations for the novel items included tone specifications.

This set of papers individually and together advance our knowledge about the development of young children's perception and production of lexical tones, of their phonological representation of tones in words, and of the impact that speaking a native tone language may have on children's perception of lexical stress in a non-tone second language they are learning. Nonetheless, there is still a long way to go in understanding the role of experience in perception and phonological representation of lexical tone contrasts. Ideally, future research should include a wider range of non-Asian languages, including register tone as well as contour tone languages, and wider variations in the functional loads and morpho-grammatical functions of lexical tones across languages. Cross-language comparisons across a wider range of lexical tone systems will be needed to identify where, how and why perceptual assimilation of non-native lexical tones to higher prosodic tiers in the native languages of nontone L1 listeners may break down. Similarly, use of the full range of lexical tone types and systems will be needed to determine whether, when and how young non-tone language learners may shift from perceiving non-native lexical tones as potential segmental contrasts (like consonants and vowels) to assimilating them as native prosodic patterns, and on the other hand to better understand how and when young learners of tone languages begin to tease apart lexical tones (segmental tier) from not only paralinguistic indexical information (talker identity, gender, emotion etc.) but also linguistic prosodic information in their language.

Understanding the phonological status of lexical tones could provide an important linguistic basis for predicting and interpreting both native and non-native tone perception and early learning. However, it has not yet been resolved whether the lexical tones of tone languages serve suprasegmental or segmental functions, and in the latter case whether they constitute a third class of phonological segments or serve as phonological features of vowels or of consonants. As briefly summarized in the following paragraphs, certain sources of evidence and/or theoretical analyses appear to be consistent with each of these possibilities. Unfortunately, the nature of the evidence differs among them, making it difficult to decide among them. Further research and theoretical analyses will be needed to tease them apart. It is likely that the answer will depend on whether the approach focuses on tone production and phonological processes in lexical tone languages, or whether the approach focuses instead on native or non-native perception. With the former approach the answer may vary depending on what types of tone systems the target languages have, whereas with the latter approach the answer should vary according to whether the listener groups have tone or non-tone L1s.

The question of the suprasegmental vs. segmental status of lexical tones in tone languages has been addressed primarily via phonological analysis of diachronic and synchronic data on tones as produced in a range of languages. In classic generative phonology tones were considered to be segmental in nature (e.g., Chomsky and Halle, 1968). Furthermore, as noted earlier, Duanmu (1990, 1994) and Lin (1989) also concluded from the phonological evidence that tones function as segments in tone languages, and of course for their native speakers. Based on cross-language phonological analyses, Hyman also concluded that tones serve segmental functions in tone languages, though he reasoned that in addition, unlike consonants and vowels, tones also can and do serve metrical (suprasegmental) functions. Thus, the concensus from a phonological point of view is that lexical tones function as segments in the languages that employ them contrastively, although they can also serve suprasegmental functions in those languages.

This leads us to the next question: do lexical tones constitute a third class of phonological segments in addition to consonants and vowels in tone languages, or do they instead serve as optional phonological features of vowels or consonants? In the classic generative phonology framework of (Chomsky and Halle, 1968), lexical tones were treated as an optional set of vowel features, i.e., not as a separate third class of segments. On the other hand, several lines of phonological evidence suggest that lexical tones may function as consonantal features (rather than as a third segmental class) in tone languages. Firstly, the emergence of lexical tones during the historical evolution of a language (tonogenesis) is much more likely to arise via diachronic changes in laryngeal features of consonants, e.g., through trans-phonologization of voicing contrasts, than from diachronic changes in vowels (see Maddieson, 1984; Whalen et al., 1993; Ratliff, 2015; Remijsen, 2016; for ongoing consonant voicing-related tonogenesis in Seoul Korean, see Silva, 2006a,b). Secondly, some articulatory studies of speech production in tone languages have demonstrated that the laryngeal gesture that produces a lexical tone is coupled with the constriction gesture for the onset consonant of the tone-bearing syllable rather than being coupled with its vowel nucleus (Gao, 2009; Mücke et al., 2012; Hu, 2016). However, a recent articulatory study instead found that certain Mandarin tones differentially shift tongue body position in production of adjacent vowels (Shaw et al., 2016), which may be consistent with viewing them as vowel features. Alternatively, the phonological analyses of Duanmu (1990, 1994), Lin (1989) and Hyman (2011a,b) posit that although tones interact with consonants and vowels in various ways, depending on the specific tone language, tones are autonomous. This implies that in their views, tones are a separate, optional third segmental class, distinct from vowels and consonants. Thus, there does not appear to be a clear consensus from phonological and articulatory studies as to whether lexical tones function as a third, separate class of segments, or instead serve as vowel features or consonant features. Nor do neurocognitive studies resolve the issue. Some report a dissociation of tone processing from both consonant and vowel processing (Li et al., 2010), while others report partial dissociation of brain activation during tone vs. vowel production (Liu et al., 2006), and still others observed similar production difficulties with tones and consonants, but not with vowels, in non-fluent aphasic speakers of Mandarin (Packard, 1986).

Can we form a clearer picture based on existing crosslanguage tone perception studies? On the one hand, many reports on early developmental changes in non-native lexical tone perception appear compatible with the idea that tones are phonologically associated with consonants. For example, English-learning infants have been found to discriminate nonnative Mandarin tone contrasts at 6 months but not at 9 months (e.g., Mattock and Burnham, 2006), consistent with numerous reports of a developmental decline around 10 months in discrimination of many non-native consonant contrasts and at odds with reports of an earlier decline at 5–6 months for nonnative vowel contrasts (e.g., Werker and Tees, 1999). On the other hand, findings from a recent eye-tracking study of novel tonelanguage word learning by native, non-native tone L1 and nonnative non-tone L1 adults indicate that tone processing appears to be more tightly time-locked to the vowel than the consonant onset in the words (Poltrock et al., 2018).

Further complicating things are other developmental findings suggesting that language-specific changes in consonant perception appear somewhat earlier, by 8 months, in Frenchlearning than English-learning infants (Hoonhorst et al., 2009). And language-specific differences may emerge even earlier, by 4 months, in non-native English- and Mandarin-learning, and native Cantonese-learning infants' perceptual preferences for Cantonese tones (Yeung et al., 2013), in contrast to the previously reported language-specific decline in discrimination of non-native tones by 9 months (Mattock and Burnham, 2006). Yet other studies indicate instead that even 2- to 3-year-old monolingual tone language learners are not yet adultlike in their learning and recognition of spoken words, for which they are more strongly affected by vowel variation than tone variation (Ma et al., 2017), and they may not be able to perceptually disentangle the intonational vs. lexical basis for pitch variations until 4–5 years of age (Singh and Chee, 2016). In another study of monolingual Mandarin learners, however, 2- to 3-year-olds showed greater sensitivity to lexical tone mispronunciations than vowel or consonant mispronunciations of just-learned novel Mandarin words, whereas 4- to 5-year-olds reversed that pattern, showing greater sensitivity to vowel or consonant mispronunciations than to tone mispronunciations (Singh et al., 2015). By comparison, in a study of monolingual Englishand monolingual Mandarin-learning children both groups detected either tone or vowel mispronunciations of just-learned novel Mandarin words at 18 months, but only Mandarinlearning children detected the tone mispronunciations at 24 months (Singh et al., 2014). In sum, then, existing perceptual investigations also fail to provide a clear answer to the question of whether tones form a separate segmental class or instead serve as features of vowels or consonants.

The challenge for further research is how to design tests of whether young children, or adults for that matter, perceive tones as features of consonants or vowels or as different from both, and of how that pattern may differ for native listeners vs. non-native listeners/learners of different types of tone languages or non-tone languages. Future research will also need to take into account that all languages, whether or not they use lexical tones, employ prosodic pitch distinctions at higher tiers of the phonological hierarchy. This means that speakers of so-called non-tone languages are not lacking entirely in experience with phonological information being conveyed by pitch variations, and can refer to native pitch settings at a higher tier of the prosodic hierarchy when perceiving non-native lexical tones. Conversely, it also means that for speakers or learners of a tone language there is potential for ambiguity or confusion over which phonological tier is being represented by a given tonal

pattern. Such confusions could be the root cause of apparent developmental "dips" in tone sensitivity even in children whose native language uses lexical tones.

A key unanswered question for listeners from non-tone L1s is whether and how assimilating tones to native prosodic contrasts may help or hinder learning the lexical tones of words in a tone language. More specifically, it is an open question whether and how cross-tier perceptual influences differ quantitatively and/or qualitatively from perceiving non-native consonant and vowel contrasts with reference to same-tier native contrasts (for an excellent step toward addressing this see Braun and Johnson, 2011). These issues need to be carefully considered in any attempt to extend existing theoretical models of non-native and L2 speech perception, such as the Perceptual Assimilation Model (PAM: Best, 1995; Best and Tyler, 2007) or the Speech Learning Model (SLM: e.g., Flege, 1995), to the perception of non-native lexical tones by non-tone L1 listeners. Both models were developed specifically to account for cross-language perception of non-native consonants and vowels with reference to native segments, and can be extended fairly straightforwardly to predicting discrimination and categorization of non-native tones by adult listeners whose L1s are other tone languages, i.e., within the segmental tier. But neither model was designed to address the cross-tier perceptual relationships that are likely to come into play in non-native tone perception by listeners of non-tone L1s. Nonetheless, some studies have begun to examine perceptual assimilation of non-native tones to native intonation distinctions in non-tone listeners (e.g., So and Best,

### REFERENCES


Chao, Y. R. (1930). A system of tone-letters. Maitre Phonetiq. 5, 24–27.

Chen, A., Stevens, C. J., and Kager, R. (2017). Pitch perception in the first year of life, a comparison of lexical tones and musical pitch. Front. Psychol. 8:297. doi: 10.3389/fpsyg.201 7.00297

2010, 2011, 2014) and the results suggest that such assimilations may be less categorical than are assimilations to another lexical tone system. The most comprehensive understanding of native and non-native tone perception and its development is likely to require studies in which the target stimuli are taken from a wider range of types of tone languages, and the listeners' L1s are representative of a wider range of tone and non-tone languages. There is still much to learn about perception of lexical tones, and how it changes developmentally in both native and non-native listeners.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

### FUNDING

Preparation of this paper was supported in part by Australian Research Council grant DP130104237.

### ACKNOWLEDGMENTS

Many thanks to the two reviewers (Thierry Nazzi) for their thoughtful and constructive feedback and suggestions on the original submission of this Opinion. Appreciation also to Jessica Hay for her assistance and patience throughout the process of revising and resubmitting.


Wang, W. S.-Y. (1967). Phonological features of tone. Int. J. Am. Ling. 33, 93–105. doi: 10.1086/464946


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Best. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Editorial: Lexical Tone Perception in Infants and Young Children: Empirical Studies and Theoretical Perspectives

Leher Singh<sup>1</sup> \*, Denis Burnham<sup>2</sup> , Jessica Hay <sup>3</sup> , Liquan Liu<sup>4</sup> and Karen Mattock <sup>4</sup>

*<sup>1</sup> Department of Psychology, National University of Singapore, Singapore, Singapore, <sup>2</sup> MARCS Institute for Brain, Behaviour & Development, Western Sydney University, Penrith, NSW, Australia, <sup>3</sup> Department of Psychology, The University of Tennessee, Knoxville, TN, United States, <sup>4</sup> Department of Psychology, Western Sydney University, Penrith, NSW, Australia*

Keywords: language acquision, psycholinguistic abilities, mandarin Chinese, lexical tones, speech perception

**Editorial on the Research Topic**

### **Lexical Tone Perception in Infants and Young Children: Empirical Studies and Theoretical Perspectives**

Traditional theories of language development and speech processing have been derived from psycholinguistic research that has primarily focused on a particular subset of language types. Specifically, Romance and Germanic languages (e.g., English, French, German) have, until recently, received more attention than other types of languages, such as Chinese languages. This has led to selective emphasis on consonants and vowels—the phonological building blocks of European languages—in theories of language, to the exclusion of other phonological building blocks, such lexical tone. Like consonants and vowels, variations in tones determine lexical meaning, but unlike consonants and vowels, lexical tones are based on pitch variations. Lexical tone is pervasive; it is used in at least half of the world's languages (Maddieson, 2013), including most Asian and some African, Central American, and European languages. This Research Topic brings together a collection of recent empirical research on the processing and representation of lexical tones across the lifespan with an emphasis on advancing knowledge on how tone systems are acquired and enriching current theories of language processing and development. The articles focus on various aspects of tones: its early perception, influences on word learning, the acquisition of new tone systems, and tone production. One set of articles report on tone perception at the earliest stage of development, in infants learning either tone or non-tone languages. Tsao and Chen et al. demonstrate that, in contrast to traditional accounts of perceptual narrowing for consonants and vowels, infants' sensitivity to Mandarin lexical tone, as well as pitch, improves over the first year of life in both native and non-native learners. Götz et al. report a U-shaped developmental trajectory for Cantonese tone perception and illustrate how the choice of methodological approaches can influence findings on infants' tone sensitivity. Fan et al. demonstrate that sensitivity to less well-studied properties of tone languages, such as neutral tone, may develop after the first year of life. Cheng and Lee investigate native tone discrimination in an electrophysiological study during the second year of life and report effects of stimulus salience on infants' neural response to tones. In a complementary set of studies focused on tone sensitivity in word learning, Burnham et al. demonstrate that infants bind tones to newly-learned words if they are learning a tone language, either monolingually or bilingually, although it was also found that object-word binding was influenced by the properties of individual tones. Shi et al. also demonstrate effects of stimulus properties on tone-object binding in native learners of Mandarin. Liu and Kager chart a developmental trajectory over the second year of life in which infants narrow in their interpretation of non-native tones. Choi et al. investigate how learning a tone language can influence uptake of other suprasegmental properties of language, such as stress, and demonstrate that native

#### Edited and reviewed by:

*Manuel Carreiras, Basque Center on Cognition, Brain and Language, Spain*

> \*Correspondence: *Leher Singh psyls@nus.edu.sg*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *01 May 2019* Accepted: *07 May 2019* Published: *22 May 2019*

#### Citation:

*Singh L, Burnham D, Hay J, Liu L and Mattock K (2019) Editorial: Lexical Tone Perception in Infants and Young Children: Empirical Studies and Theoretical Perspectives. Front. Psychol. 10:1195. doi: 10.3389/fpsyg.2019.01195* Singh et al. Lexical Tone Perception

tone sensitivity in children can facilitate stress sensitivity when learning a stress-based language. Finally, two studies focus on sensitivity to pitch in a sub-class of tone languages: pitch accent languages. In a study on Japanese children's abilities to recognize words they know, Ota et al. demonstrate a limited sensitivity to native pitch contrasts in toddlers. In contrast, Ramachers et al. demonstrate comparatively strong sensitivity to pitch in native and non-native speakers of a different pitch accent system (Limburgian) when learning new words. Several studies focus on learning new tone systems. In a training study with schoolaged children, Kasisopa et al. demonstrate that tone language experience increases children's abilities to learn new tone contrasts. Poltrock et al. show similar advantages of tone experience in learning new tone systems in adults. In an electrophysiological study, Liu et al. demonstrate order effects in adults' neural responses to new tones, discussing implications for learning tone languages as an adult. Finally, Hannah et al.'s work suggests that extralinguistic cues, such as facial expression, can support adults' learning of new tone systems. In the first of three studies investigating tone production, Rattanasone et al. report the results of a study demonstrating kindergartners' asynchronous mastery of tones —i.e., delayed acquisition of tone sandhi forms relative to base forms. In a study examining a corpus of adult tone production, Han et al. demonstrate that mothers produce tones in a distinct manner when speaking to infants: tone differences are emphasized more when speaking to infants than to adults. Finally, combining perception and production of tones, Wong et al. report asynchronous development of tone perception and tone production in children. The Research Topic also includes a series of Opinion pieces and Commentaries addressing the broader relevance of tone and pitch to the study of language acquisition. Curtin and Werker

### REFERENCES

Maddieson, I. (2013). "Tone," in The World Atlas of Language Structures Online, eds M. S. Dryer, S. Matthew and M. Haspelmath (Leipzig: Max Planck Institute for Evolutionary Anthropology). Available online at: https://wals.info/chapter/13

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

discuss ways in which tone can be integrated into their model of infant language development (PRIMIR). Best discusses the phonological status of lexical tones and considers how recent empirical research on tone perception bears on this question. Kager focuses on how language learners distinguish lexical tones from other sources of pitch variation (e.g., affective and pragmatic) that also inform language comprehension. Finally, Antoniou and Chin unite evidence of tone sensitivity from children and adults and discuss how these areas of research can be mutually informative. Over the past decade, psycholinguistic studies of lexical tone acquisition have begun to burgeon. This collection of empirical studies and opinion pieces provides a state of-the-art panoply of the psycholinguistic study of lexical tones, and attest to its relevance to language acquisition research. The articles in this Research Topic will help address past biases toward European non-tone languages, and will contribute to an expanding narrative of speech perception, speech production, and language acquisition that draws from a greater diversity of languages. Importantly, these studies underline the scientific promise of psycholinguistic research on tone languages; the research questions raised by the study of lexical tone are new and complement those typically applied to more widely studied languages and populations. Studies on lexical tone will continue to enrich psycholinguistic research in language acquisition and processing in a way that brings us closer to universal principles of language development.

### AUTHOR CONTRIBUTIONS

LS, DB, JH, and LL drafted and edited the editorial. KM helped to draft and edit the proposal for this research topic.

Copyright © 2019 Singh, Burnham, Hay, Liu and Mattock. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.