Exploring the Complexity of the L2 Intonation System: An Acoustic and Eye-Tracking Study

Phonological research has demonstrated that English intonation, variably referred to as prosody, is a multidimensional and multilayered system situated at the interface of information structure, morphosyntactic structure, phonological phenomena, and pragmatic functions. The structural and functional complexity of the intonational system, however, is largely under-addressed in L2 pronunciation teaching, leading to a lack of spontaneous use of intonation despite successful imitation in classrooms. Focusing on contrastive and implicational sentence stress, this study explored the complexity of the English intonation system by investigating how L1 English and Mandarin-English L2 speakers use multiple acoustic features (i.e., pitch range, pitch level, duration, and intensity) in signaling contrastive and implicational information and how one acoustic feature (maximum pitch level) is affected by information structure (contrast), morphosyntactic structure (phrasal boundary), and a phonological phenomenon (declination) in L1 English and Mandarin-English L2 speakers' speech. Using eye-tracking technology, we also investigated (1) L1 English and Mandarin-English L2 speakers' real-time processing of lexical items that carry information structure (i.e., contrast) and typically receive stress in L1 speakers' speech; (2) the influence of visual enhancement (italics and bold) on L1 English and Mandarin-English L2 speakers' processing of contrastive information; and (3) L1 English and Mandarin-English L2 speakers' processing of pictures with contrastive information. Statistical analysis using linear mixed-effects models showed that L1 English speakers and Mandarin-English L2 speakers differed in their use of acoustic cues in signaling contrastive and implicational information. They also differed in the use of maximum pitch level in signaling sentence stress influenced by contrast, phrasal boundary, and declination. We did not find differences in L1 English and Mandarin-English L2 speakers' processing of contrastive and implicational information at the sentence level, but the two groups of participants differ in their processing of contrastive information in passages and pictures. These results suggest that processing limitations may be the reason why L2 speakers did not use English intonation spontaneously. The findings of this study also suggest that Complexity Theory (CT), which emphasizes the complex and dynamic nature of intonation, is a theoretical framework that has the potential of bridging the gap between L2 phonology and L2 pronunciation teaching.


INTRODUCTION
The past 30 years have witnessed evolutionary and farreaching advances in the field of L2 pronunciation, including establishment of the intelligibility principle (Munro and Derwing, 1995;Murphy, 2014), development of a holistic approach acknowledging the importance of both segmental and suprasegmental features (Anderson-Hsieh et al., 1992;Derwing et al., 1998;Levis and Levis, 2018), and development of technologies such as speech visualization (Levis and Pickering, 2004), automated speech recognition (ASR) (Cucchiarini and Strik, 2018) and speech synthesis (Ding et al., 2019).
Despite significant advances, challenges remain. Levis (1999), for example, stated that "[p]resent intonational research is almost completely divorced from modern language teaching. . . " (p. 37). The lack of a guiding theory causes many issues. For example, teachers may only focus on the aspects that they are conscious of using, resulting in an overemphasis on the attitudinal and emotional aspects in the teaching of suprasegmental features (Levis, 1999). Further, while teachers may assume that successful imitation and reproduction of target intonation patterns in classrooms leads to spontaneous production, students "may walk out of the class without having accepted the system at all. Or they may think intonation is simply decorative" (Gilbert, 2014). Gilbert's observation echoes what Allen (1971) had noted half a century ago: "there is little carry-over into the students' own conversations outside the classroom and the listen and repeat approach has never yielded satisfactory long-term results" (p. 79).
One main issue that sets research and teaching apart is the lack of applicability of research theories and models.
In the past few decades, the field of L2 phonology has seen a number of highly influential theories, approaches and models including Autosegmental-Metrical phonology (AM) (Pierrehumbert, 1980), the Systemic Functional Approach (Halliday, 2015), and Discourse Intonation (Brazil, 1980) as well as the PENTA model (Xu, 2004), the Kiel intonation model (KIM) (Kohler, 1995), and the Fujisaki model (Mixdorff, 2000). While these theories and models have gained wide popularity in research and software development, direct transfer of these theories and models into classroom teaching has faced tremendous difficulties. For example, in his attempt to apply Discourse Intonation in language teaching, Chapman (2007) found that both students and teachers encountered difficulties such as identifying rising and falling tone as well as locating tone-unit boundaries and prominence. The core of this issue is that L2 phonology and L2 pronunciation teaching, albeit sharing the same underlying subject of investigation, focus on different issues and have different goals. L2 phonology analyzes speech samples in an effort to develop a theory or a model that explains underlying schema of speech production. L2 pronunciation teaching, on the other hand, focuses on the development of learners' abilities in navigating the system of intonation in spontaneous speech. Larsen-Freeman (2017) pointed out that the field of second language acquisition is dominated by an approach which "seeks to understand phenomena by taking them apart" (Larsen-Freeman, 2017, p. 22). However, it could be the case that "it is from the components and their relationships that the system we are trying to understand emerges. If we isolate components artificially, we lose the essence of the phenomena we are attempting to describe" (Larsen-Freeman, 2017, p. 29). To promote L2 speakers' spontaneous use of intonation, a systematic view of intonation is needed. Complexity Theory (CT), which views the relationship among phonological features and phenomena as an interrelated and dynamic system, thus is a more appropriate theoretical framework for L2 pronunciation teaching.

LITERATURE REVIEW Intonation
Scholars have long acknowledged that English intonation relates to multiple acoustic features and perceptual phenomena. Palmer (1922), for example, stated that "all phenomena connected with this musical pitch or tone [such as word-prominence, wordgroup prominence, intensity, command, doubt, concession, reassurance, etc.] are designated by the term Intonation" (p. 7). Halliday (2015) argued that intonation is a system with three phonological and systemic variables: tonality, tonicity, and tone. Pierrehumbert (1980) views the intonation system as having three components-a grammar of phrasal tunes, a metrical representation of the text, and rules that line up tune with the text. Over the past century, numerous scholars defined intonation from different perspectives and with different focuses. Some scholars focused on the pragmatic meaning and phrasal structure (Gussenhoven, 2004;Levis and Wichmann, 2015), others emphasized pitch patterns or tones (Kingdon, 1958;O'Connnor and Arnold, 1973), still others highlighted the emotional and attitudinal aspects (Bolinger, 1989) (see Table 1).
Despite differences in approaches and focuses, scholars generally agree that the system of intonation includes multiple suprasegmental features and is closely related to information structure, morphosyntactic structure, and pragmatic functions (Gilbert, 2014;Levis and Wichmann, 2015). Gilbert (2014), for example, stated that "[i]n English, prosodic cues serve as navigation guides to help the listener follow the intentions of the speaker. These signals communicate emphasis and make clear the relationship between ideas (new and old information) so that listeners can readily identify these relationships and understand the speaker's meaning" (p. 123). Wennerstrom (1998) proposed that there is an intonation system in English that functions at the discourse level to signal relationships in information structure and to mark interdependencies among constituents; she proposes a model in which intonation functions as a grammar of cohesion. These studies pointed out the importance of intonation and the necessity of teaching English intonation to L2 English speakers.

Scholars
Definition Suprasegmental features Related variables Ladd (2008) "The use of suprasegmental phonetic features to convey 'postlexical' or sentence-level pragmatic meanings in a linguistically structured way" (p. 4).

Suprasegmental phonetic features
Pragmatic meanings, linguistic structure Pickering (2018) "The term intonation is narrowly defined in English as the use of pitch structure over the length of a given utterance" (p. 2). "Intonation is the grammatical system that includes our use of pitch, pause, and prominence (or sentence stress)…" (p. 3).
Pitch, pause, prominence O' Connnor and Arnold (1973) "When we talk about English intonation we mean the pitch patterns of spoken English, the speech tunes or melodies, the musical features of English" (p. 1).
Pitch, tunes/melody, musical features Kingdon (1958) "The active elements of intonation are the Tones, which always occur in association with stresses" (p. 3).
Tones, stress Levis and Wichmann (2015) "The use of pitch variations in the voice to communicate phrasing and discourse meaning in varied linguistic environments" (p. 139).

Pitch
Phrasing and discourse meaning Gussenhoven (2004) "Intonation is treated as the use of phonological tone for non-lexical purposes, or-to put it positively-for the expression of phrasal structure and discourse meaning" (p. 12).
Phonological tone Phrasal structure and discourse meaning Bolinger (1989) "Intonation manages to do what it does by continuing to be what it is, primarily a symptom of how we feel about what we say, or how we feel when we say" (p. 1).
Pitch/tone, prominence, intensity Attitudes primary stress (Hahn, 2004;Celce-Murcia et al., 2010). Sentence stress denotes the relative emphasis or prominence a word receives primarily by the manipulation of fundamental frequency (F0), duration, and intensity of the stressed syllable as well as the modification of vowel quality. Sentence stress plays a central role in English intonation because of its close connection with discourse meaning and information structure. It is frequently used to signal given vs. new (Pierrehumbert and Hirschberg, 1990) and contrast (Liu, 2020). It is also commonly used to make corrections (Ip and Cutler, 2016) or help listeners to anticipate what the speaker is going to say (Levis and Levis, 2018).
Sentence stress affects intelligibility, which is defined as "the extent to which a listener actually understands an utterance" (Derwing and Munro, 2005, p. 385). For example, investigating L1 English speakers' processing, comprehension, and evaluation of speech with correctly placed, misplaced, and missing sentence stress, Hahn (2004) found that the same speaker is more intelligible when sentence stress was used correctly. One of the multiple functions of intonation that impacts intelligibility is sentence stress used to show contrast. Levis and Levis (2018) argued that contrastive stress is a "highvalue pronunciation feature" that should be given more attention in L2 pronunciation teaching.

L2 Sentence Stress
Prior studies investigating Mandarin-English L2 speakers' intonation showed that even advanced level speakers face difficulties in using English sentence stress effectively (Chun, 1982;Wennerstrom, 1998;Pickering, 2001Pickering, , 2004. This issue relates to both sentence stress realization and placement. Pickering (2001), for example, found that the pitch structure of Chinese international teaching assistants' speech is relatively flat and monotone compared to native speaker teaching assistants. Chun (1982) found that "Chinese speakers sometimes failed to place sentence stress on the appropriate word or syllable" (p. 386), pointing out the issue of stress placement.
It was assumed by some scholars that Mandarin-English L2 speakers' ineffective use of English intonation is due to the lexical tone system in Mandarin (Clennell, 1997). The claim is that Mandarin uses pitch variation to signal lexical tone, prohibiting it from using the same acoustic feature to indicate sentence stress. However, recent studies investigating the Mandarin sentence stress system suggested that this may be a false assumption (Xu, 1999;Chen and Gussenhoven, 2008;Kabagema-Bilan et al., 2011). Ouyang and Kaiser (2015), for instance, found that all three suprasegmental features that English uses to indicate sentence stress (fundamental frequency (F0), duration, and intensity) are all used to encode sentence stress in Mandarin. They further concluded that Mandarin uses sentence stress to indicate discourse-level information and make contrasts. Analyzing five types of focus in Mandarin and English, Ip and Cutler (2016) found that Mandarin speakers showed greater increase in pitch range and pitch level for new-information focus.
Despite similarities between English and Mandarin sentence stress, there is evidence of lack of transfer of suprasegmental features from L1 Mandarin to L2 English. For example, comparing L1 English speakers' use of sentence stress to Mandarin-English L2 speakers' sentence stress in both English and Mandarin, Liu (2020) found that Mandarin-English L2 speakers did not use pitch to indicate English sentence stress even though they resemble L1 English speakers in the signaling of sentence stress when speaking in L1 Mandarin.
It is worth noticing that the lack of transfer of similar phonological features from L1 to L2 is not a language specific issue. For example, investigating the use of prosodic cues in German L2 English speakers' English and German speech, O'Brien et al. (2014) found that German-English L2 speakers do not transfer all prosodic uses from L1 to L2. The findings suggested that even speakers of Germanic languages that share similar morphosyntactic structure and uses of phonological cues with English do not transfer intonational cues directly from L1 to L2.
One factor that may account for the lack of transfer of phonological cues from L1 to L2 is the complex and dynamic nature of language. As Ortega-Llebaria and Colantoni (2014) stated, ". . . acquiring intonation in a L2 not only is an issue of learning to perceive and produce the target melody but, crucially, involves a new mapping between form and meaning that is affected by L1 transfer" (p. 351). The fact that intonation involves multiple acoustic cues and needs to be used with consideration of the information structure, morphosyntactic structure, and phonological phenomena may pose a significant challenge to L2 speakers. Complexity Theory (CT), which views the system of intonation as complex and dynamic, thus will be informative in L2 phonology research and L2 pronunciation teaching.

Complexity Theory
Language is a complex dynamic system that "emerges bottom-up from interactions of multiple agents in speech communities" (Larsen-Freeman, 2017, p. 49). Language is also a social endeavor constantly influenced and shaped by the interaction and accommodation among different individuals, speech communities, and communities of practice (CoP) (Gumperz, 1971;Labov, 1972;Giles et al., 1987;Lippi-Green, 1989;Holmes and Meyerhoff, 1999). In the system of intonation, two types of complexity have been identified: structural complexity and functional complexity.
Structural complexity, also referred to as inherent or absolute complexity (Housen et al., 2019) captures the intrinsic complexity of a system. Encompassing multiple interrelated and interacting variables, the system of English intonation involves structural complexity. For example, when a speaker stresses a constituent, multiple segmental and suprasegmental features (i.e., vowel quality, pitch, duration, and intensity) are manipulated. The changes in these features affect not only the use of prosodic features at the syllable and word level, but also the intonational contour of the entire intonational phrase.
Another level of complexity is derived from the dynamic relationship between intonation and other variables such as information structure, morphosyntactic structure, and phonological phenomena. We list six examples of this functional complexity. Intonation dynamically encodes information structure (e.g., new vs. old, contrast, etc.) (Pierrehumbert and Hirschberg, 1990;Hahn, 2004;Levis and Wichmann, 2015), indicates grammatical structure (Pickering, 2018), signals discourse level meanings and implications (Levis and Wichmann, 2015;Pickering, 2018), regulates speaker-listener interactions (Wennerstrom, 2001;Hellermann, 2003), directs listeners' attention (Chun, 1988;Gilbert, 2014), and expresses emotions and attitudes (Horley et al., 2010;Pell and Kotz, 2011). Functional complexity poses challenges to L2 speakers because not only can intonation be optionally used for these functions, it is also governed and dynamically shaped by all these aspects. In this sense, using L2 intonation is not simply manipulating a collection of acoustic cues, but navigating the entire linguistic repertoire while representing and balancing the influence of numerous linguistic and non-linguistic variables in real-time.

The Present Study
The present study explores the structural and functional complexity of the intonation system by comparing L1 English and Mandarin-English L2 speakers' use of acoustic cues in speech production and visual processing of contrastive and implicational information that typically receives stress in L1 English speakers' speech. The present study both serves as an example for research investigating intonation from a CT perspective and offers insights into Mandarin-English L2 speakers' use of English intonation and the dynamic mapping between L1 and L2.
In their discussion of a developing cognitive system, DeBot et al. (2007) asserted, "the system is in constant complex interaction with its environment and internal sources. Its multiple interacting components produce one or many selforganized equilibrium points, whose form and stability depend on the system's constraints" (p. 14). As applied to L2 phonological phenomena, sentence stress, situated at the nexus of information structure, morphosyntactic structure, and other linguistic and non-linguistic variables, is appropriate for investigation. Treating every speech sample as an equilibrium point, acoustic analysis reveals information about how phonological features are used. From a Complexity Theory (CT) perspective, the present study investigated L1 English and Mandarin-English L2 speakers' use of various acoustic features in signaling English sentence stress and the influence of information structure, morpho-syntactical structure, and phonological phenomena.
Eye-tracking technology has been used widely in the field of second language research as a means to investigate L1 and L2 speakers' parsing of temporarily ambiguous sentences (Papadopoulou and Clahsen, 2003;Dussias and Sagarra, 2007); processing of lexical and morphological cues (Kambe et al., 2001;Lew-Williams and Fernald, 2010), and processing of spoken language (Tanenhaus, 2007;Ito and Speer, 2008). Researchers found "systematic relations between fixation duration and the characteristic of the fixated words" (Dussias, 2010, p. 150), providing information about the incremental processing of sentence comprehension. Using eye-tracking technology, this study explored real-time processing of lexical items (1) that are contrastive or non-contrastive, (2) at different positions within written sentences, (3) that do or do not carry implications, and (4) that appear with or without visual enhancement. The guiding research questions are: • How do L1 and Mandarin-English L2 speakers use intonational features (i.e., pitch range, pitch level, duration, and intensity) to signal contrast and implication in speech production? • How do L1 and Mandarin-English L2 speakers use a phonological feature (maximum pitch level) influenced by information structure (contrast), morphosyntactic structure (sentence boundaries), and phonological phenomena (declination)? • How do L1 and Mandarin-English L2 speakers orally produce and visually process lexical items and pictorial information that are contrastive or implicational and that typically receive stress in L1 speakers' speech? • How do L1 and Mandarin-English L2 speakers orally produce and visually process contrastive information written with and without visual enhancement (italics and bold)?

Participants
Ten subjects participated in the study. Five were L1 English speakers and five were Mandarin-English L2 speakers enrolled in degree programs in a university in the US. Participants' biographical information is summarized in Tables 2, 3.

Procedure
The study was conducted in a Language Acquisition and Visual Attention lab at a large research university in the northeast. Participants were seated in front of a PC connected to an Eyelink 1000 Plus eye-tracker. All participants went through a calibration process. Then, the participants were presented a scenario with pictures contextualizing the first experiment.
In the first experiment, participants saw 18 sentences on the computer screen in random order. Participants saw one sentence at a time and each sentence was presented in a single line. Participants were asked to first read the sentence silently. Then, when they were ready, they read the sentence aloud. In the second experiment, the participants saw three sets of sentences in random order. Each set of sentences contains two sentences of the same wording, but different meanings or implications presented in parentheses. Participants were asked to first read silently and then read aloud the sentences to express the meanings or implication included in the parentheses. In the third experiment, participants read silently and then read aloud two short passages with contrastive information. In the fourth experiment, participants saw an eight-frame picture cartoon that describes a single story. They were asked to look at the pictures silently and prepare to tell the story. Then, they were asked to tell the story in their own words. The experiment materials and data analysis are further discussed within each experiment in the following sections.
While participants silently read the sentences and passages (in experiments I-III) and viewed the pictures (in experiment IV), the eye-tracker documented the fixation count, fixation percentage, dwell time, dwell percentage, run count, and regression of the Areas of Interest (AOI), which were the contrastive/implicational information or equivalent places in the distractors. The eye-tracker was recalibrated before each experiment. When the participants read aloud the information, they were audio-recorded using both an audio recording application software (Audacity) and a handheld recorder (Zoom H4N).

Data Elicitation and Analysis
Participants' fixation count (total number of fixations within the interest area), fixation percentage (percentage of total fixations in a trial falling within the current interest area), dwell time (total time (in milliseconds) spent on the current interest area), dwell percentage (percentage of trial dwell time spent on the current interest area), run count (number of times the interest area was entered and left), and regression on the focused areas were documented using the eye-tracker (definitions elicited from EyeLink Data Viewer User's Manual, Version 1.11.900, p. 39-40). Participants' speech data were analyzed using Praat version 6.0.37 (Boersma and Weenink, 2018). Maximum pitch level, minimum pitch level, and pitch range in Hertz and in semitones relative to 1 Hz, as well as the duration of words in seconds and the intensity of words in decibels (dBs) were elicited using a Praat script. Participants' pitch in Hertz and pitch in semitones were normalized based on each participant's average pitch level. We used R (R Core Team, 2017) and lme4 (Bates et al., 2015) to analyze the speech and processing data. Linear mixed-effects models were constructed for the statistical analyses; P-values were obtained by likelihood ratio tests.

EXPERIMENT 1
In Experiment 1, we investigated L1 and Mandarin-English L2 speakers' production and processing of sentence stress using a contextualized sentence read aloud task. The scenario we used was a Christmas Tree decoration task adapted from Ito and Speer (2008). The participants were told that a friend of theirs, Martin, is decorating a Christmas tree using items from an ornament board. Martin is trying to select two of the ornaments at a time and the participants were tasked with telling him which ones to hang. In the contextualizing process, the participants were asked to tell Martin what to hang based on circled items on pictures of the ornament board. Then, after participants understood the scenario, they were asked to complete the task using 18 written sentence prompts.
The sentences belong to six sentence sets. Each set has three sentences: one sentence with contrastive information presented not at phrasal boundaries, one sentence with contrastive information presented at phrasal boundaries, and a distractor that does not include contrastive information (see Appendix A for a complete list of sentences).
• First, hang the blue drum, then hang the yellow drum ("blue" and "yellow" are contrastive and not at phrasal boundaries). • First, hang the blue drum, then hang the blue ball ("drum" and "ball" are contrastive and at sentence phrasal boundaries). • First, hang the blue drum, then hang the pink ball (distractor: no contrastive information in the sentence).   All sentences were scrambled and presented to the participants in random order. For each sentence, the participants were asked to read the sentence silently to themselves first, and then produce the sentence as if they were providing Martin directions on which items to hang on the Christmas tree. Our first question was: (1) How do L1 and Mandarin-English L2 speakers use intonational features (i.e., pitch range, pitch level, duration, and intensity) to signal contrast and implication in speech production?
To answer this question, we analyzed speech data using linear mixed-effects models that predicted normalized pitch range, normalized maximum pitch level, duration, and intensity. We used different L1s, contrastive or non-contrastive, and the interaction between L1 and contrast as the fixed effects. Individual participants, sentences, and words were included in the model as the random effects. The results showed that in L1 English speakers' speech, there were statistically significant differences between the contrastive and non-contrastive information regarding the use of pitch range (estimate = 1.136, SE = 0.107, df = 445.824, t = 10.609, * * * p < 0.0001), maximum pitch level (estimate = 0.77, SE = 0.111, df = 455.753, t = 6.932, * * * p < 0.0001), and duration (estimate = 0.112, SE = 0.012, df = 460.980, t = 9.284, * * * p < 0.0001). However, there was no significant difference in the intensity of contrastive and non-contrastive information (estimate = 0.243, SE = 0.382, df = 457.689, t = 0.635, p = 0.526).
The results showed differences in the acoustic cues that L1 and Mandarin-English L2 speakers used to signal contrastive information. L1 speakers used pitch and duration to indicate contrastive stress. Mandarin-English L2 speakers, on the contrary, did not signal contrast using pitch or duration. Their use of greater intensity may have been intended to signal contrastive information.
Our second research question was: (2) How do L1 and Mandarin-English L2 speakers use a phonological feature (maximum pitch level) influenced by information structure (contrast), morphosyntactic structure (sentence boundaries), and phonological phenomena (declination)?
To answer question (2), we analyzed participants' use of maximum pitch level of contrastive and non-contrastive information at different positions within sentences (see Table 4). Specifically, we explored how L1 and Mandarin-English L2 speakers' pitch level is affected by (1) information structure (contrastive vs. non-contrastive), (2) morphosyntactic structure (at sentence boundary vs. not at sentence boundary), and (3) phonological phenomena (declination). We used the position of the words, contrastive or non-contrastive and the interaction between position and contrast as the fixed effects to predict normalized maximum pitch level. Individual participants, sentences, and words were entered into the model as random effects. In L1 English speakers' speech, there was a significant difference between the maximum pitch level of the contrastive and non-contrastive information (estimate = 0.632, SE = 0.255, df = 222.926, t = 2.476, * p = 0.014). We also found differences among positions. Compared to the words in position 1, words in position 3 had significantly lower pitch level (estimate = −0.882, FIGURE 1 | Speech production of L1 English and Mandarin-English L2 Speakers contrastive and non-contrastive stress. Frontiers in Communication | www.frontiersin.org SE = 0.255, df = 222.926, t = 3.457, * * * p = 0.0006). There was also a significant difference between words in position 1 and words in position 4 (estimate = −1.148, SE = 0.262, df = 223.083, t = −4.379, * * * p < 0.0001). The position differences suggested the effect of declination. There were no significant differences between words in position 1 and words in position 2 (estimate = 0.256, SE = 0.255, df = −222.926, t = 1.005, p = 0.316), suggesting an influence of a H-(high) boundary tone at phonological boundary (Pierrehumbert and Hirschberg, 1990) used to signal non-finality.
These results showed that in L1 English speakers' speech, one individual intonational cue-maximum pitch level-is affected by multiple variables including information structure (i.e., contrast), morphosyntactic structure (i.e., phrasal boundary), and phonological phenomena (i.e., declination). Although Mandarin-English L2 speakers' maximum pitch level reflected potential influence of declination, L2 speakers did not show schema that reflect functional complexity at a level comparable to L1 speakers.
The third and fourth research questions were: (3) How do L1 and Mandarin-English L2 speakers orally produce and visually process lexical items and pictorial information that are contrastive or implicational and that typically receive stress in L1 speakers' speech? (4) How do L1 and Mandarin-English L2 speakers orally produce and visually process contrastive information written with and without visual enhancement (italics and bold)?
To answer questions (3) and (4), we analyzed the subset of sentences with visual enhancement (i.e., bold and italics) signaling the contrastive information (two sets with italics and two sets with bold). We hypothesized that the orthographical conventions used in English may have different implications or functions in the Chinese logographic writing system (Mair, 1996, p. 200). These differences may affect the processing of written information and the production of contrastive information. Sample sentence set with italicized words • First, hang the yellow tree, then hang the white tree ("yellow" and "white" are contrastive, signaled by italics, and not at sentence boundaries). • First, hang the yellow tree, then hang the yellow star ("tree" and "star" are contrastive, signaled by italics, and at sentence boundaries).
• First, hang the yellow tree, then hang the white bell (no contrastive information, no visual enhancement).

Sample sentences set with words in bold
• First, hang the green egg, then hang the brown egg ("green, " and "brown" are contrastive, signaled by bold, and not at sentence boundaries). • First, hang the green egg, then hang the green sock ("egg, " and "sock" are contrastive, signaled by bold, and at sentence boundaries). • First, hang the green egg, then hang the orange tree (no contrastive information, no visual enhancement).
Participants' pitch range was analyzed using L1, visual enhancement (no visual enhancement, italics, and bold), and contrast (contrastive vs. non-contrastive) as the fixed effect and individual participants, sentences and words as the random effects. The results show that L1 English speakers used a significantly greater pitch range to signal contrastive information regardless of the use of visual enhancement (estimate = 1.04, SE = 0.2, df = 216.14, t = 5.09, * * * p < 0.0001). Mandarin-English L2 speakers, on the other hand, did not use pitch range to signal contrastive information even when the information was enhanced by bold or italics (estimate = −0.1, SE = 0.17, df = 223.56, t = −0.62, p = 0.54) (see Figure 3). We then investigated whether there were processing differences indicated by the fixation percentage and dwell percentage of the AOI by using L1, visual enhancement, and the interaction between these two factors as the fixed effects, and individual participants, sentences, and words as the random effects in our models. The results suggest that there were no statistically significant differences between L1 and Mandarin-English L2 speakers in fixation percentage (estimate = 2.31, SE = 2.26, df = 20.19, t = 1.024, p = 0.32) or dwell percentage (estimate = 2.82, SE = 2.32, df = 25.75, t = 1.217, p = 0.235). Also, whether the contrastive information was visually enhanced by italics or bold did not lead to significant processing differences.

EXPERIMENT 2
The second experiment is a read aloud task. Participants were given three sets of sentences (adapted from POSE-test). For each set of sentences, two different implications were given in parentheses and could be signaled by altering the stressed constituents in the same sentence (for a complete list of sentences used in Experiment 2, refer to Appendix B).
• My brother is a doctor (not my sister).
• My brother is a doctor (not a teacher).
The participants were asked to read the sentences silently first, and then read the sentence out loud to express the implications in the parentheses. When they were reading the sentences silently, their fixation and dwell time and percentage were measured using an eye-tracker. When they were reading aloud, participants were asked not to read the information in the parentheses, and the sentences were recorded. The research questions we asked in experiment 2 were: (1) How do L1 and Mandarin-English L2 speakers use intonational features (i.e., pitch range, pitch level, duration, and intensity) to signal contrast and implication in speech production? (3) How do L1 and Mandarin-English L2 speakers orally produce and visually process lexical items and pictorial information that are contrastive or implicational and that typically receive stress in L1 speakers' speech?
Linear mixed-effects models were constructed with L1s, different implications, and the interaction between L1 and implication as the fixed effects, and individual participants, sentences, and words as the random effects. The result shows that L1 speakers use statistically significantly greater pitch range (estimate = 1.2515, SE = 0.2687, df = 101.0587, t = 4.658, * * * * p < 0.0001), higher maximum pitch level (estimate = 0.61312, SE = 0.20432, df = 100.838, t = 3.001, * * p = 0.003), and longer duration (estimate = 0.09075, SE = 0.02539, df = 102.385, t = 3.574, * * * p = 0.0005) to express the information related to the implication. There was no significant difference between the intensity that L1 speakers used to encode implicational or non-implicational information. For Mandarin-English L2 speakers, there were no significant differences between implicational vs. non-implicational conditions for all prosodic features we analyzed. In terms of the processing of implicational information, we established linear fixed effects models with language as the fixed effect to predict both the fixation percentage and the dwell percentage. The results showed that there were no significant differences between L1 and Mandarin-English L2 speakers' processing of the information related to the implication as indicated by their fixation percentage (estimate = 1.880, SE = 3.998, df = 7.895, t = 0.470, p = 0.6508) and dwell percentage (estimate = 3.344, SE = 4.920, df = 7.918, t = 0.680, p = 0.516).

EXPERIMENT 3
We used two passage read-aloud tasks to further investigate L1 and Mandarin-English L2 speakers' production and processing of contrastive information with and without visual enhancement with less predictable text structures. The first paragraph was a passage adapted from a New York Times article. The contrastive information in this passage was signaled using italics. The lexical items analyzed were: "must, " "may, " "allows, " and "requires." There are roughly 6,000 languages in the world. Are they mostly the same or are they different from each other? Fifty years ago, a famous linguist pointed out a crucial fact about differences among languages. He said, "Languages differ essentially in what they must convey and not in what they may convey." What this means is this: if different languages influence our minds in different ways, this is not because of what our language allows us to think but rather because of what it habitually requires us to think about.
The second passage we presented is adapted from Hahn (2004). This paragraph has contrastive information not signaled by any orthographic symbols. The lexical items analyzed were: "personal (1), " "group (1), " "group (2), " "personal (2)." I will start by defining the topic for today, which is individualism and collectivism. Individualism concerns the placing of personal goals ahead of group goals. And collectivism concerns placing group goals ahead of personal goals. So let's suppose you have a conflict at work about break time. Let's say your co-workers want longer breaks, but you want shorter breaks. If you're a collectivist, you'll give in to the group. But if you're an individualist, you'll go against the group.
Participants were directed to read each of these passages silently first, during which an eye-tracker was used to measure their processing of the contrastive lexical items where the areas of interest (AOI) were set. Then the participants were asked to read the two passages aloud in a natural way. The results answer question (3) and (4) at the passage level: (4) How do L1 and Mandarin-English L2 speakers orally produce and visually process lexical items and pictorial information that are contrastive or implicational and that typically receive stress in L1 speakers' speech? (5) How do L1 and Mandarin-English L2 speakers orally produce and visually process contrastive information written with and without visual enhancement (italics and bold)?
Experiment III differs from Experiments I and II in that all analyzed lexical items are contrastive. Thus, comparisons were made between L1 and Mandarin-English L2 speakers' uses of intonational cues in signaling the contrastive information instead of how L1 and Mandarin-English L2 speakers used intonational cues to signal contrastive information as opposed to noncontrastive information. We used L1, visual enhancement and the interaction between these two factors as the fixed effects, individual participants, sentences, and words as the random effects to predict normalized pitch range, normalized maximum pitch level, duration, and intensity in the analyses of speech production. We found that L1 English speakers used a greater pitch range to signal the contrastive information compared to Mandarin-English L2 speakers when the contrastive information was visually enhanced (estimate = −1.423, SE = 0.305, df = 20.835, t = −4.668, * * * p = 0.0001) or not (estimate = 0.9547, SE = 0.3048, df = 20.8352, t = −3.132, * p = 0.005). L1 English speakers also used longer duration in indicating the contrastive information compared to Mandarin-English L2 speakers when italics were used (estimate = −0.131, SE = 0.036, df = 14.724, t = −3.658, * * p = 0.0024). However, there was no significant difference in the two groups' use of duration when the contrastive information was not visually enhanced (estimate = −0.059, SE = 0.036, df = 14.724, t = −1.66, p = 0.118). There was no significant difference between L1 and Mandarin-English L2 speakers in the maximum pitch level or intensity with and without visual enhancement.
In terms of the processing of contrastive information in passages, when contrastive information was not visually enhanced ( The production data support findings in Experiments I & II by showing differences in the use of pitch range and maximum pitch level by L1 English and Mandarin-English L2 speakers in speech production of contrastive information. Processing data suggested that L1 English and Mandarin-English L2 speakers differ in the processing of contrastive information at the passage level. Specifically, Mandarin-English L2 speakers did not fix on the contrastive information at a percentage comparable to the L1 English speakers when processing passages. when italics were used to signal contrastive information in passages, there was no difference between the two groups. These findings suggested that Mandarin-English L2 speakers may face challenges processing information with rich contextual information and less predictable text structure. Further, the use of visual enhancement may facilitate visual processing of contrastive information at the passage level.

EXPERIMENT 4
In experiment 4, we investigated L1 and Mandarin-English L2 speakers' processing and production of contrastive information in a picture narrative task. In this task, participants were given a set of eight pictures that describe a single story. Participants were asked to spend a couple of minutes looking at the pictures and then describe the story in English. This picture narrative task is adapted from Derwing et al. (2004) to elicit extemporaneous speech from speakers (see Appendix C). The research question that Experiment 4 answered was: (3) How do L1 and Mandarin-English L2 speakers orally produce and visually process lexical items and pictorial information that are contrastive or implicational and that typically receive stress in L1 speakers' speech?
Two researchers listened to the recordings and selected the words being stressed with the assistance of speech visualization software Praat. The researchers found that L1 speakers used intonational cues to signal contrastive information. The following is a transcription of the story told by an L1 speaker with the stressed information in capitalized letters.
"Two people are walking around in a big city. One man is leaving the building and one woman is ENTERING the building. But they bumped into each other at the entrance and they both dropped. . . their bags. Um. . . so they picked it up and walked away. And when the man gets home, he realizes that he has WOMAN's bag. And when the woman gets to work, she realizes that she has the man's bag." (L1 participant #1) We found that when telling the story, some Mandarin-English L2 speakers did not specify as much contrastive information in their speech as the L1 speakers did. The transcript below illustrates this tendency.
"In an apartment, um. . . um. . . a woman and a man run into each, each other. They took the package. They, they took the package. They make, make a mistake. So, when they return home, they open the package and found they make a mistake." (Mandarin-English L2 participant #3) We also found that, in some cases, Mandarin-English L2 speakers included contrastive information in the story but did not use intonational cues to signal the contrastive information.
"It's a big city, a woman and a man are walking on the street and carrying the same suitcases. And they crushed each other at the corner of the street, and the suitcases dropped on the ground. They stand up and pick up their suitcases and walked away. But when the man got home, he found a dress in the suitcase, so he actually got the woman's suitcase. And the woman is in. . . When the woman got home, she found a tie in her suitcase." (Mandarin-English L2 participant #4) We used heatmaps to show the areas that the participants paid attention to, and noted differences. We found that L1 speakers paid attention to the details of the pictures in a more holistic way, especially when contrastive information was presented. For instance, in the top two frames in Figure 4, L1 speakers spent longer gazing at both the man and the woman as well as the items they found in the suitcases. The Mandarin-English L2 speakers, however, paid less attention to the details and not at the contrastive information. For example, in the bottom two frames in Figure 4, one Mandarin-English L2 speaker focused on the man and the item held by the woman and another Mandarin-English L2 speaker focused on the woman and the item held by the man (see Figure 4).

DISCUSSION
Analyzing L1 English and Mandarin-English L2 speakers' speech production and processing of written and pictorial information from a Complexity Theory (CT) perspective, the present study found that sentence stress is a multidimensional and multifaceted feature dynamically connected with a series of linguistic and nonlinguistic variables. To promote spontaneous use of intonational features like sentence stress, a systematic view that takes into consideration both structure and functional complexity is needed. In Experiment 1 and Experiment 2, L1 English speakers used a collection of intonational cues to signal contrastive and implicational information in sentences including pitch range, maximum pitch level, and duration. Mandarin-English L2 speakers did not show differences in the use of any of these acoustic features to encode information structure. However, Mandarin-English L2 speakers used greater intensity when producing contrastive information in Experiment 1. The findings in Experiment 3 also showed that, compared to Mandarin-English L2 speakers, L1 English speakers used greater pitch range and higher maximum pitch level when signaling contrastive information in passages. Altogether, these findings suggested that Mandarin-English L2 speakers have difficulties in the integrated use of multiple acoustic cues in signaling contrastive or implicational information. Mandarin-English L2 speakers may, as in Experiment 1, manipulate one acoustic cue (intensity) in an attempt to signal stress. However, intensity was an intonational feature that L1 speakers did not rely on when signaling sentence stress in the same context. These findings suggested the need to address the structural complexity of sentence stress.
To address the structural complexity of the system of intonation, the relationship between different acoustic features needs to be clarified. Intonation teaching will benefit from helping learners to understand intonation as a complex system encompassing multiple interacting features. In addition to chapters focusing on individual features, textbook chapters that adopt a systematic view of intonation and that summarize the relationship among different suprasegmental features will help teachers and learners to develop a systematic view of intonation. Teacher training and preparation programs would do well to also focus more on the systematic use of intonation features and the complexity within the system of intonation.
The results of Experiment 1 showed that L1 English speakers' use of a single intonational cue (i.e., maximum pitch level) is affected by multiple interrelated variables and phenomena including information structure (contrast), morphosyntactic structure (phrasal boundary) and a phonological phenomenon (declination). Mandarin-English L2 speakers' speech, while still showing a general declination trend, did not reflect the influence of information structure or morphosyntactic structure. The results of Experiment 2 further demonstrated that L1 English speakers use intonational features to highlight the lexical item associated with the meanings and implications of the sentences whereas Mandarin-English L2 speakers did not signal implicational information with acoustic cues. These results suggested that learners either were unaware of the connection among intonation, meaning or implication, and information structure, or were incapable of navigating the system of intonation while taking into consideration all influential variables. Processing limitations may be a crucial factor. As O'Brien and Féry (2015) commented about L2 speakers' use of information structure, "[a]n appeal to processing limitations might predict that L2 learners, regardless of their L1s, may have difficulty coordinating all of the potential cues at their disposal when producing structures at the interface. . . it may be that L2 learners rely on a particular default strategy (e.g., making use of a single syntactic or phonological structure or the same article, regardless of discourse status) as the result of their being unable to integrate all of the types of information in real time" (p. 405). Thus, promoting L2 speakers' awareness about the functions of intonation and their ability in coordinating different segmental and suprasegmental cues for pragmatic proposes may help address the functional complexity of intonation and facilitate more target-like spontaneous use of English intonation.
In Experiments 1 and 2, we did not find significant differences in L1 English and Mandarin-English L2 speakers' processing of contrastive or implicational information as indicated by their fixation percentage and dwell percentage of the focused lexical items. These results support Ip and Cutler's (2016) statement that "Information structure is a linguistic universal" (p. 330). These findings also suggested that when the structure of sentences is relatively stable and predictable, L1 English and Mandarin-English L2 speakers did not differed in their processing of written information. However, in Experiment 3, we found that Mandarin-English L2 speakers' fixation percentage of the contrastive information was significantly lower than that of the L1 speakers when no visual enhancement was used to indicate the information in a passage. In Experiment 4, we found processing differences in the contrastive information in the pictures: L1 speakers focused more holistically on the contrastive information while Mandarin-English L2 speakers gazed at the information that is not contrastive. These results suggested that L1 English and Mandarin-English L2 speakers differ in visual processing when the information structure and morphosyntactic structure were more complex and less predictable. These results further supported O' Brien and Féry's (2015) hypothesis that processing limitations may be the crucial factor in the processing and production of L2 intonation.
In Experiments 1 and 3, Mandarin-English L2 speakers did not use intonational cues in their speech to signal contrastive information even when the information is cued in text by visual enhancement such as italics and bold. The results suggested a lack of familiarity with the L2 orthographical conventions and the connection between visually enhanced information (italicized words) and speech production (sentence stress). Pronunciation textbooks use visual enhancements (e.g., italics) for pedagogical purposes (e.g., indicate the placement of sentence stress). However, the use of visual enhancement such as italics in authentic materials is often a conscious choice authors use to convey their intent (e.g., contrast, implication, etc.). When teaching pronunciation using pedagogical materials with visual enhancements, teachers need to explicitly point out the three-way connection among the constituents that are visually enhanced, the functions and rationale for the use of visual enhancement, and the role of intonation in conveying the meanings and functions in speech production.

CONCLUSION
This study found that the structure and functional complexity of the system of intonation poses challenges to Mandarin-English L2 speakers. Complexity Theory (CT), which emphasizes the connection and interaction of interwoven variables within intonation, is an appropriate framework for L2 pronunciation research and teaching. A systematic view that highlights the complex and dynamic nature of intonation is recommended. Future studies researching the processing limitations of L2 speakers are needed. Further research investigating the dynamic mapping between L1 and L2 intonation is also recommended.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Boston University Institutional Review Board. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
DL and MR jointly identified the theoretical underpinning, designed, and carried out the experiments in the study. DL analyzed the data with the support from MR. DL and MR jointly wrote the manuscript. All authors contributed to the article and approved the submitted version.