Discovering Words in Fluent Speech: The Contribution of Two Kinds of Statistical Information

Thiessen, Erik  D; Erickson, Lucy  C

doi:10.3389/fpsyg.2012.00590

ORIGINAL RESEARCH article

Front. Psychol., 17 January 2013

Sec. Psychology of Language

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00590

This article is part of the Research TopicThe naïve language expert: How infants discover units and regularities in speechView all 15 articles

Discovering words in fluent speech: the contribution of two kinds of statistical information

Erik D. Thiessen*

Lucy C. Erickson

Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA

To efficiently segment fluent speech, infants must discover the predominant phonological form of words in the native language. In English, for example, content words typically begin with a stressed syllable. To discover this regularity, infants need to identify a set of words. We propose that statistical learning plays two roles in this process. First, it provides a cue that allows infants to segment words from fluent speech, even without language-specific phonological knowledge. Second, once infants have identified a set of lexical forms, they can learn from the distribution of acoustic features across those word forms. The current experiments demonstrate both processes are available to 5-month-old infants. This demonstration of sensitivity to statistical structure in speech, weighted more heavily than phonological cues to segmentation at an early age, is consistent with theoretical accounts that claim statistical learning plays a role in helping infants to adapt to the structure of their native language from very early in life.

Introduction

The ability to segment words from fluent speech is taken for granted by adults, but it represents a major accomplishment for infants. Unlike the white spaces between words on the written page, pauses do not consistently mark word boundaries in fluent speech. This is not troublesome for adults, who can identify word boundaries in large part due to their familiarity with the word forms in their native language (e.g., Nazzi et al., 2005; Norris and McQueen, 2008). Infants, though, begin the task of word segmentation unable to take advantage of familiar word forms. The challenge faced by infants is comparable to the task faced by adults attempting to identify words spoken in a foreign language. Nevertheless, infants succeed in this task before they have amassed a large lexicon of familiar word forms (e.g., Jusczyk and Aslin, 1995; Bortfeld et al., 2005). Two cues have been suggested to play a role in infants’ earliest ability to segment words from fluent speech: conditional statistical information, and information about the prosodic structure of words (Thiessen and Saffran, 2003). These cues are likely to work together in natural languages, but an open developmental question is which is available to infants earlier in development. In this series of experiments, we will examine the hypothesis that sensitivity to conditional structure is available from an earlier age, and that statistical learning helps infants discover the predominant prosodic structure of words in their native language.

There is no doubt that information about the prosodic structure of words plays a role in infants’ and adults’ word segmentation. The difference between stressed and unstressed syllables is perceptually available to infants from a young age (e.g., Jusczyk and Thompson, 1978; Weber et al., 2005). To the extent that stressed and unstressed syllable systematically occur in particular word positions, this distinction can serve as a cue to word boundaries. In English, for example, most bisyllabic content words follow a trochaic pattern: they begin with a stressed syllable, and are followed by an unstressed syllable (Cutler and Carter, 1987). English-learning infants prefer to listen to trochaic words over words with a weak-strong (iambic) pattern (Jusczyk et al., 1993). When exposed to a stream of syllables, English-learning infants and English-speaking adults treat the stressed syllables as word onsets (e.g., Cutler and Norris, 1988; Echols et al., 1997; Jusczyk et al., 1999). Importantly, though, not all languages show this trochaic predominance; lexical items in other languages may be predominantly iambic. Therefore, English-learners trochaic bias is likely acquired from experience with the language (Thiessen and Saffran, 2007).

By contrast, sensitivity to conditional statistical information does not require language-specific knowledge; it is a cue to word segmentation that is available cross-linguistically. This cue is relevant to word segmentation because sounds within a word are more likely to co-occur than sounds across word boundaries (Hayes and Clark, 1970). For example, copter is very likely to occur after heli; but many words could potentially occur after helicopter. Conditional statistics – such as transitional probability (e.g., Saffran et al., 1996) – reflect the likelihood of co-occurrence among elements of the input. A body of prior research indicates that both infants and adults are able to segment words from fluent speech on the basis of conditional statistical information. For example, artificial language experiments demonstrate that after exposure to a sequence of syllables, both infants and adults are able to distinguish between syllable groups with high conditional relations (i.e., words), and syllable groups with low conditional relations, such as groupings that occur across word boundaries (e.g., Aslin et al., 1998; Thiessen and Saffran, 2004).

A variety of different computational accounts have been proposed to explain sensitivity to conditional statistical information (for discussion, see Frank et al., 2010). The most successful of these models – clustering models – search for and store clusters of statistically coherent elements (e.g., Perruchet and Vinter, 1998; Orban et al., 2008). These models predict that after exposure to speech, participants should have extracted a set of candidate lexical items (e.g., Giroux and Rey, 2009). Research with both infants and adults is consistent with this prediction. For example, infants accept words from the synthesized speech in English utterances after exposure to a stream of synthesized speech (Saffran, 2001). Similarly, infants and adults learn labels for novel objects more easily when provided the opportunity to segment the labels from fluent speech (Graf Estes et al., 2007; Mirman et al., 2008).

In word segmentation tasks, for example, this means that exposure to fluent speech leads to learners extracting a set of candidate lexical items. Evidence that learners are extracting clusters of statistically coherent elements can be seen even for non-linguistic stimuli (e.g., Fiser and Aslin, 2005), suggesting that this extraction is a domain-general aspect of conditional statistical learning.

The fact that infants are capable of extracting and storing word forms is consistent with a statistical bootstrapping account of the development of word segmentation (Thiessen and Saffran, 2003). On this account, infants initially rely on language-universal cues – such as sensitivity to conditional statistical information – to segment words from fluent speech. Once they have identified and stored a set of word forms, they can identify the acoustic features that are consistent across them (e.g., Lew-Williams and Saffran, 2012). For example, if infants are exposed to a set of words in which stress consistently occurs on the first syllable, they will acquire a trochaic bias (Thiessen and Saffran, 2007). Once infants have discovered the acoustic features that are consistent in their proto-lexicon, they can use these features as cues to subsequent word segmentation (e.g., Johnson and Jusczyk, 2001).

This transition is from language-general to language-specific cues is thought to take place between 7 and 9 months. While 7-month-old infants rely on conditional statistical information to segment fluent speech, 9-month-old infants favor lexical stress, even if segmenting on the basis of stress contradicts conditional statistical information (Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003). Recent research by Höhle et al. (2009), however, indicates that infants as young as 6 months are familiar with the predominant prosodic structure of words in their native language. Höhle et al. suggest that 6 months is below the age at which infants are able to segment words from fluent speech via conditional statistical cues. If so, the statistical bootstrapping account of infants’ prosodic learning is necessarily incorrect. Instead, this would suggest that language-specific prosodic cues may be the earliest cue infants use to segment words from fluent speech. Additionally, it would suggest that knowledge about the prosodic form of words arises from some source other than statistical learning, perhaps such as learning solely from words in isolation.

However, the claim that infants below 6 months are unable to segment speech on the basis of conditional statistical information may be incorrect. Evidence suggests that young infants and even neonates are sensitive to conditional statistical information (Kirkham et al., 2002; Teinonen et al., 2009; Kudo et al., 2011). Further, one prior experiment indicates that 5- to 6-month-old infants are able to segment fluent speech via conditional statistical information (Johnson and Tyler, 2010). In Experiment 1, we seek to provide additional evidence that infants are able to segment fluent speech below 6 months of age. Additionally, we will investigate whether infants at this young age prioritize conditional statistical information over lexical stress as a cue to word segmentation, consistent with the statistical bootstrapping account. In Experiment 2, we will investigate whether infants in this age range are capable of learning to use lexical stress as a cue to word segmentation.

Experiment 1A

Within the word segmentation literature, it is commonly held that infants develop the ability to segment fluent speech by 7.5 months, citing a seminal study by Jusczyk and Aslin (1995). Before this age, researchers have asserted that infants lack the ability to extract words from fluent speech on the basis of statistical structure (e.g., Höhle and Weissenborn, 2003). Others have proposed that the ability to segment words from fluent speech via transitional probabilities is intact earlier (e.g., Thiessen and Saffran, 2003; Johnson and Tyler, 2010). Evidence from neuroimaging is consistent with this claim (e.g., Teinonen et al., 2009; Kudo et al., 2011). The goal of Experiment 1A was to provide further behavioral evidence that infants are capable of segmenting fluent speech via conditional statistical information below 6 months. To do so, we exposed 5-month-old infants to an artificial language in which the only cue to segmentation is higher conditional relations between syllables within words relative to syllables spanning word boundaries (part-words). If the ability to segment speech does not emerge until later than 7 months, these 5-month-old infants should not discriminate between words and part-words following familiarization with this fluent speech stream. However, if the ability to parse speech on the basis of statistical cues is intact at an earlier age, infants should discriminate between words and part-words.

Materials and Methods

Participants

Data were obtained from 10 participants between the ages of 5.0 and 5 months, 14 days (M = 5.10). To obtain data from 10 infants, it was necessary to run 13 infants. The additional three infants were excluded for crying during the testing session (1), average looking times of less than 3.0 s (1), or experimenter error (1). A sample size of 10 infants was used based on a power analysis using an effect size calculated from Thiessen and Saffran’s (2003) Experiment 3, of which this experiment is a replication with a younger age group.

Stimuli

The stimuli used in this experiment were identical to those used in Thiessen and Saffran’s (2003) Experiment 3. Infants were exposed to an artificial language containing four bisyllabic nonsense words: diti, bugo, dapu, and dobi. The language was synthesized using MacinTalk, and all syllables were produced with neutral stress. This language was constructed such that two of the words – dapu and dobi – occurred twice as often (90 times) as the other two words (diti and bugo, each of which occurred 45 times). This ensures that test item foils can be constructed that differ solely on their conditional probabilities, rather than on the frequency with which infants hear them (for discussion, see Aslin et al., 1998). Words occurred in a pseudo-random order, with the constraint that no word could follow itself. Syllable-to-syllable transitional probabilities were 100% within a word, and 33% at word boundaries. Because there were no pauses or other acoustic cues to word boundaries in this artificial language, the conditional probabilities (high within a word, low at boundaries) provided the only cue to word segmentation.

Two kinds of test items were created to assess infants’ ability to segment the language: words and part-words. The word test items were the infrequent words (diti and bugo) from the artificial language. Part-words were syllable conjunctions that occurred across the two more frequent words (bida and pudo). During the infants’ exposure to the artificial language, both words and part-words occurred equally often. Therefore, any difference in infants’ responses to these two kinds of test items is not due to the frequency with which they have heard the words or part-words.

Procedure

Infants were tested individually in a sound-attenuated testing room, seated on a caregiver’s lap 150 cm away from a 32′′ LCD monitor. An experiment outside the testing room observed the infant over closed-circuit video and recorded the duration of his or her gaze at the central monitor using the Habit X software (Cohen et al., 2004). To eliminate bias, parents were asked to wear headphones, and the experimenter was blind to the nature of the stimuli being presented. Two speakers situated next to the central LCD monitor were used to present the audio stimuli.

At the beginning of the experiment, the infants’ attention was attracted to the central LCD monitor by the presentation of a colorful Winnie the Pooh video, accompanied by an attention-getting phrase. Once the infant looked at the central monitor, the video was replaced by a static image of a checkerboard, and the artificial language began to play. The checkerboard remained on screen, and the language continued to play, for 2 min. At the end of this time, the attention-getting movie reappeared on the screen.

Once infants focused their gaze on the central monitor, the test phase began. During this phase, 12 test trials were presented. Six of these trials were word trials, and six were part-word trials. Each test item occurred on three trials during the testing phase. Test trials were presented in random order. A test trial began with the attention-getting movie playing on the central monitor drawing the infants’ gaze forward. When the observing experimenter pressed a key indicating that the infant had fixated, the monitor displayed a video of a looming green ball on a black background, while the speakers began to play the test item (either word or part-word) separated by 1.4 s pauses. For as long as the infant maintained their gaze on the central monitor, the test trial continued, up to a maximum of 20 s. When the infant looked away for more than two consecutive seconds, the test trial ended and the attention-getting video reappeared on the central monitor.

Results

If infants were able to successfully segment the artificial language, they should respond differentially to word test trials than to part-word test trials (e.g., Saffran et al., 1996). While in principle, any group-level preference is indicative that infants are able to differentiate the items, the experiments most similar to this one have resulted in a novelty preference (e.g., Thiessen and Saffran, 2003, 2007). If infants in this experiment behave in the same way, they should look longer at test items that violate their expectations (i.e., part-words) than at test items that fit what they have learned (i.e., words).

The results were consistent with prior experiments using these stimuli. Infants in this experiment displayed a novelty preference, listening longer to part-words (M = 8.10 s, SE = 0.90) than words (M = 6.78 s, SE = 1.34; See Figure 1). A paired-samples t-test (all t-tests reported here and in subsequent experiments are two-tailed) revealed that the difference in listening times as a function of test item type was significant, t(9) = 2.609, p < 0.05. After familiarization, 5-month-old infants distinguished between words and part-words, indicating that they had succeeded in parsing the speech signal.

FIGURE 1

Figure 1. Looking times to words and part-words in Experiment 1A. Error bars indicate standard error.

Discussion

The fact that infants were able to segment the artificial language used in this experiment is inconsistent with the common assertion that speech segmentation does not begin until around 7 months of age (e.g., Jusczyk and Aslin, 1995; Höhle and Weissenborn, 2003). Instead, it is consistent with prior results indicating that infants are sensitive to conditional statistical information from a young age (Kirkham et al., 2002; Teinonen et al., 2009; Johnson and Tyler, 2010; Kudo et al., 2011). Indeed, to our knowledge the infants in this experiment are younger than any prior group of infants in a behavioral word segmentation experiment. The fact that they successfully segmented raises the possibility that word segmentation may begin at younger ages than previously thought in native language environments. Moreover, the 5-month-olds in this experiment are demonstrating sensitivity to conditional statistical information at a younger age than any prior experiment has found sensitivity to language-specific acoustic cues to segmentation, such as lexical stress patterns. As such, these results are consistent with the hypothesis that conditional statistical information is one of the first cues available to infants as they begin to discover word forms in speech.

Experiment 1B

Experiment 1A demonstrated that 5-month-old infants are able to segment word forms from speech solely on the basis of conditional probability information. In Experiment 1B, we were interested in how infants of this age behave when statistical cues to word identity are placed in direct conflict with lexical stress, an acoustic cue thought to be very salient to infants (e.g., Gleitman et al., 1988; Echols and Newport, 1992). Much research attests to infants’ early sensitivity to prosodic information (e.g., Mehler et al., 1988) and preference that emerges at 9-months in English-exposed infants for trochaic words (consisting of a strong/weak pattern) over iambic words (weak/strong; Jusczyk et al., 1993). Additionally, 7.5-month-old infants in English-speaking environments are so reliant on lexical stress that they display a trochaic bias during segmentation, such that when exposed to passages containing the sequence “guiTAR#is,” they segment the trochaic sequence “TARis” from fluent speech even when it when occurs less frequently than the iambic sequence “guiTAR” (Jusczyk et al., 1999).

In the present experiment, we were interested in whether infants would extract units from familiarization on the basis of conditional information (i.e., extract syllable pairings characterized by high transitional probabilities) or on the basis of lexical stress cues (i.e., trochees following the dominant pattern of English). Based on the prior finding that 7-month-olds ignore stress cues, segmenting items on the basis of conditional information (Thiessen and Saffran, 2003), we predicted that 5-month-old infants in this study would also extract units according to this language-universal strategy rather than on lexical stress, which requires language-specific knowledge about words. If infants of this age segment statistical words rather than trochaic disyllables, this would provide strong support for the idea that conditional information is one powerful language-universal cue that could be recruited to acquire language-specific knowledge such as the preferred position of stressed syllables within word forms. In contrast, if these infants extract trochees from the speech stream, even when they are characterized by low transitional probabilities, this would be consistent with the early rhythmic segmentation hypothesis, proposed by Nazzi and colleagues (e.g., Nazzi and Ramus, 2003; Nazzi et al., 2006; Höhle et al., 2009; Mersad and Nazzi, 2011). According to this hypothesis, early segmentation is based on the rhythmic unit of the native language, which derives from infants’ early sensitivity to language rhythm.