Are the Products of Statistical Learning Abstract or Stimulus-Specific?

Vouloumanos, Athena; Brosseau-Liard, Patricia  E; Balaban, Evan; Hager, Alanna  D

doi:10.3389/fpsyg.2012.00070

ORIGINAL RESEARCH article

Front. Psychol., 23 March 2012

Sec. Psychology of Language

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00070

Are the Products of Statistical Learning Abstract or Stimulus-Specific?

AV
Athena Vouloumanos ¹^*
PE
Patricia E. Brosseau-Liard ²
EB
Evan Balaban ³
AD
Alanna D. Hager ⁴

1. Department of Psychology, New York University New York, NY, USA
2. Department of Psychology, University of British Columbia Vancouver, BC, Canada
3. Department of Psychology, McGill University Montreal, QC, Canada
4. Department of Psychology, University of Victoria Victoria, BC, Canada

Abstract

Learners can segment potential lexical units from syllable streams when statistically variable transitional probabilities between adjacent syllables are the only cues to word boundaries. Here we examine the nature of the representations that result from statistical learning by assessing learners’ ability to generalize across acoustically different stimuli. In three experiments, we compare two possibilities: that the products of statistical segmentation processes are abstract and generalizable representations, or, alternatively, that products of statistical learning are stimulus-bound and restricted to perceptually similar instances. In Experiment 1, learners segmented units from statistically predictable streams, and recognized these units when they were acoustically transformed by temporal reversals. In Experiment 2, learners were able to segment units from temporally reversed syllable streams, but were only able to generalize in conditions of mild acoustic transformation. In Experiment 3, learners were able to recognize statistically segmented units after a voice change but were unable to do so when the novel voice was mildly distorted. Together these results suggest that representations that result from statistical learning can be abstracted to some degree, but not in all listening conditions.

Introduction

One of the first tasks for a novice language-learner is to extract words from what is essentially a continuous stream of sound. Whereas written text provides spaces between words to mark word boundaries, there are no clear acoustic cues in speech input as to the location of word boundaries (e.g., Cole et al., 1980). Learners must thus generate hypotheses about where possible boundaries between words might be. At the same time, the problem is made especially challenging because naturally occurring speech signals in the environment are acoustically variable. As Church and Fisher (1998, p. 523) put it, “Locating and identifying words in continuous speech presents a number of well-known perceptual problems. These problems fall into two interrelated classes: word segmentation and acoustic/phonetic variability.” The current paper explores the intersection of these two problems of acquisition by asking whether a segmentation process that tracks statistically variable transitional probabilities across adjacent elements could help a learner segment potential words across acoustically variable environments.

Full-fledged words are rich, abstract, lexical entries that speakers of a language can recognize across contexts, across speakers, and across pronunciations. Representing speech sound patterns as potential lexical units is important for identifying new occurrences of these words, and essential for eventually mapping these words onto meanings. Here, we examine the nature of the representations extracted by statistical segmentation processes. We consider whether statistical segmentation processes produce “presemantic” sound representations that behave like lexical units or “acoustic” representations that are stimulus-bound. Since statistical segmentation studies to date have largely used acoustically identical sounds during training and testing, the nature of the lexical units learned from statistical segmentation processes remains to be examined.

Though learners use various sources of linguistic information – for example, prosodic (Johnson and Jusczyk, 2001), phonotactic (Mattys and Jusczyk, 2001), and allophonic (Jusczyk et al., 1999) – to segment words from the speech stream, a statistical process that keeps track of statistical regularities (e.g., transitional probabilities) between sounds might play a role in word segmentation (Saffran et al., 1996a). Using such a statistical mechanism, adjacent sounds that occur sequentially with a higher transitional probability would be segmented as part of the same word, while transitions with low or near-zero probability would likely mark boundaries between different words. Tracking statistically variable transitional probabilities has the advantage of being universally available for acquisition, unlike phonotactic constraints or prosodic patterns which require previous language-specific knowledge.

Both adults (Saffran et al., 1996b) and infants (Saffran et al., 1996a) can use transitional probabilities to segment a speech stream. However, statistical learning mechanisms appear to be neither domain-specific, nor specific to humans. In addition to speech sounds, this general-purpose learning mechanism can track probabilities in tone sequences and musical timbres (Saffran et al., 1999; Tillmann and McAdams, 2004) and visual patterns (Fiser and Aslin, 2002; Kirkham et al., 2002; Turk-Browne et al., 2005) and thus is not specific to language learning. It is also not exclusive to humans, as cotton-top tamarin monkeys and rodents can also learn from statistical regularities in speech input (Hauser et al., 2001; Newport et al., 2004; Toro and Trobalon, 2005). Its availability for segmenting non-speech and its use by non-humans brings into question the lexical nature of the products of statistical learning.

The starting point to our investigation is the fact that speech is an extremely variable acoustic signal. For instance, changes in speaking rate affect different phonemes differently (Jusczyk and Luce, 2002). Similarly, coarticulation between adjacent speech sounds renders every instance of a word acoustically different depending on the sounds that precede and follow it (Curtin et al., 2001). Variability can be detrimental to speech processing. In adults, variability has a detrimental effect on speech perception, especially under distorted conditions or in low-probability phrases (Mullennix et al., 1989). More variability in a familiarization stream of speech leads to more encoding of acoustic details and a larger influence of acoustic properties on later performance (Ju and Luce, 2006). When acoustic variation is introduced, infants perform less well than when the acoustic content is constant (Grieser and Kuhl, 1989; Swingley and Aslin, 2000). Other studies suggest that 2-month-old infants can perceptually normalize across talkers but also encode the difference between the voices of different talkers, and that they take longer to habituate to a phoneme uttered by several talkers than to the same phoneme uttered by a single talker (e.g., Jusczyk et al., 1992). Variability therefore seems generally detrimental to speech perception except in some cases where variability can improve performance (e.g., Goldinger et al., 1991).

Though variability in the speech signal generally results in less efficient and less accurate speech processing (Mullennix et al., 1989; Van Tasell et al., 1992), humans are able to ignore or compensate for a great degree of natural variation and artificial distortion in their use of spoken language. Adults and infants readily ignore incorrect or incomplete information in the acoustic realization of words. For example, replacing a phoneme with a noise burst does not impair adults’ recognition of words, and often the noise burst goes undetected (Warren, 1970). Even infants are able to recognize a familiar word in which the initial phoneme has been changed, which suggests that their representations of familiar words are not tied to specific sounds within the words (Swingley and Aslin, 2002). A particularly strong example of how little impact acoustic variation can have on speech recognition is that of whispered speech. Samuel (1988) showed that selective adaptation occurs for phonemes across normal and whispered speech, despite large acoustic differences in the phonemes’ acoustic properties between the two contexts. Words, or lexical units, are thus partly represented abstractly allowing speakers to recognize the same lexical unit across natural variability in different speakers, contexts, and pronunciations.

Human speech perception is robust even when the temporal or spectral properties of speech have been distorted (for reviews, see Scott and Johnsrude, 2003; Davis and Johnsrude, 2007). Saberi and Perrott (1999) found that intelligibility remains almost perfect even when words have been severely distorted: comprehensible English sentences were divided into segments that were then either locally time-reversed, or temporally delayed. Time-reversals of segments up to 50 ms in length were identified perfectly accurately, with accuracy decreasing with longer segments, reaching chance performance at around 130 ms reversals. Time delays were much less detrimental to performance. A spectral manipulation called band pass-filtering, which removes much spectral information from speech including formant transitions (Stickney and Assmann, 2001) can also result in highly intelligible speech with as few as three or four bands when slowly varying temporal information is available (Shannon et al., 1995; Davis et al., 2005; Hervais-Adelman et al., 2008). Despite extreme variability in the signal, adults are able to understand speech in a consistent way and recognize the same word in different contexts. Although lexical representations are not fully abstract, instead representing acoustic information specific to individuals and to contexts (Goldinger, 1996, 1998; Kraljic and Samuel, 2005; Werker and Curtin, 2005), adults’ proficiency at recognizing the same lexical unit in different acoustic contexts requires that humans draw on an abstract linguistic representation that is in part independent of specific acoustic properties.

Here, we use variability as a tool to understand the products of statistical learning. Although humans can use statistical regularities in fluent speech to readily acquire information about possible lexical units, little is known about how learners represent these newly segmented units, specifically whether segmented units are coded as acoustic units or as putative lexical entries. Only a few previous studies bear directly on this issue. Studies in which infants were trained on nonsense streams presented either in infant-directed speech (Thiessen et al., 2005) or with stressed syllables (Thiessen and Saffran, 2003) showed that infants were able to recognize segmented “words” from the “part-words” when these were subsequently presented in adult-directed or unstressed speech. Although this suggests some degree of abstraction in the representation of segmented speech units, the higher pitch and greater length of infant-directed speech and stressed syllables typically attract infant attention (e.g., Fernald, 1985), and this increased attention could account for infants’ better performance (see also footnote 1 in Thiessen and Saffran, 2003 reporting higher attrition to the monotone training). Similarly, adults can recognize shorter test tokens after segmentation training on lengthened tokens (Saffran et al., 1996b). Syllable lengthening, however, is present in natural speech (especially in word-final or sentence-final positions; Klatt, 1975) and thus adults in these studies may have normalized acoustic differences with which they were already familiar. More recently, infants and adults were shown to be able to map newly segmented words onto novel objects in a word learning task (Graf Estes et al., 2007; Mirman et al., 2008). Although both adults and infants more easily mapped the high probability syllable–strings onto referents, consistent with their representation as lexical units better performance may have also been due to better encoding and recognition of the high probability syllable–string simply due to its higher probability. Much like the process of lexicalization of new words (Gaskell and Dumay, 2003), initial encoding of novel syllable–strings as putative words may have relied on a stimulus-linked memory trace that is not actually integrated into the lexicon. These studies do not directly choose between abstract and acoustic representations.

In the domain of artificial grammar learning, which requires computations across statistically regular units, results conflict: some studies point to abstract representations (Altmann et al., 1995), and others stimulus-specific representations (Conway and Christiansen, 2006). Perhaps the best evidence for representing statistically regular information in an abstract and generalizable format comes from the visual domain. Turk-Browne et al. (2005) demonstrated that adult learners presented with statistical regularities instantiated in colored visual stimuli (novel green or red shapes presented sequentially), were able to abstract these regularities to black shapes during test. This suggests that products of statistical learning, at least in the visual domain, can be generalized to perceptually different stimuli. At the same time, however, studies which examined whether statistical learning transfers across modalities found that the products of statistical learning were stimulus-specific (Conway and Christiansen, 2006). Whether statistical learning in the auditory domain can be generalized to perceptually different stimuli remains to be tested. In this paper, we take a complementary approach to investigating how the output of statistical learning processes is represented by focusing on adults’ ability to restore acoustically variable and distorted speech.

In three experiments, we investigate how auditory units segmented through statistical processes are represented and abstracted by training and testing adults on acoustically dissimilar units. The acoustic characteristics of the sounds were temporally distorted by locally time-reversing the speech (see Figure 1). Previous studies show that reversals of 50 ms result in near perfect intelligibility, whereas reversals of 100 ms only allow listeners to recover 50% of the utterance. Performance is linearly related to the duration of the reversed segment (Saberi and Perrott, 1999). Time-reversal distorts the temporal envelope and the fine structure of the spectrum. Much acoustic evidence for place, manner, and voicing of speech segments is conveyed in temporal regions where the speech spectrum is changing rapidly, that is, when the vocal tract opens and closes rapidly to produce consonants (Stevens, 1980; Liu, 1996). At the same time, because consonants are briefer in duration and lower in intensity than vowels, which have a longer spectral steady state, they are less robust than vowels and more vulnerable to distortion (e.g., Miller and Nicely, 1955; Assmann and Summerfield, 2004). Time-reversals therefore affect the speech intelligibility of consonants more than vowels.

Figure 1

Adults were familiarized with a series of trisyllabic nonsense words that were concatenated into 10-min streams of continuous speech, in which transitional probabilities between adjacent segments were the only cues to word boundaries (e.g., Saffran et al., 1996a). Familiarization consisted of either normal speech, or the same speech stream, temporally distorted by locally time-reversing 50-ms or 100-ms segments (Saberi and Perrott, 1999), giving the sounds a reverberant quality. Participants were subsequently tested on their ability to discriminate the familiar “words” from trisyllabic “non-words” whose component syllables were drawn from the familiarization stream but which had been subjected to acoustic transformations (temporal reversals, or a voice change to a different gender). We used non-words rather than part-words because evidence for learning was more robust with non-words (e.g., Saffran et al., 1996b) providing a starting point for examining the influence of acoustic distortions. Our goal was to determine whether products of segmentation through statistical learning are tied to the specific acoustic properties of the training environment or if the units are represented more abstractly, allowing newly acquired words to be recognized when they differ acoustically.

Experiment 1: Representing Statistical Products: Abstracting to Distorted Speech

In Experiment 1A, we replicate the findings of previous statistical learning studies (e.g., Saffran et al., 1997) using different trisyllabic words and a shorter familiarization stream (similar to Peña et al., 2002). In Experiment 1B, participants were familiarized with the same speech stream as 1A, and then tested on words that were time-reversed every 50 ms, a type of distortion that is not severe enough to impair recognition of known English words (Saberi and Perrott, 1999). In Experiment 1C, participants were tested with words time-reversed every 100 ms, which is a more drastic form of distortion that has been shown to impair speech recognition (Saberi and Perrott, 1999).

Materials and methods

Participants

Forty-five participants were recruited through the undergraduate participant pool, or through ads posted on the university campus. Participation in the study was compensated with either course credit in one psychology course or $5. Participants gave informed consent and reported being fluent in English and having no hearing impairment. Participants were assigned to either condition A (15), condition B (15), or condition C (15). All procedures were approved by the research ethics board at McGill University.

Stimuli

The experimental materials consisted of 12 consonants (b, d, f, g, h, k, l, m, p, r, t, w) and 6 vowels (/ah/, /ay/, /aw/, /oh/, /oo/, /ee/) combined to form 18 consonant–vowel syllables (see Table 1). Syllables were generated individually in DECtalk 5.0 (Fornix Corporation, UT, USA). The settings for the electronic female voice “Betty” were used, with a “speech rate” (not related to syllable rate) setting of 205 words per minute, with a richness of 0, a smoothness of 70%, average pitch of 180 Hz and pitch range of 0. Syllables were 228–388 ms (average = 296 ms) in length (see Table 1 for details.) These syllables were combined into six trisyllabic units, hereafter called “words,” which were concatenated into four different 10-min speech streams. Only two streams were used in conditions A, and B, whereas all four streams were used in condition C. The order of words in each of the streams was randomized, with the only constraint being that the same word could not occur twice in a row. Transitional probabilities between syllables within a word were 1.0, but only 0.2 for syllables spanning word boundaries (since each word in the stream is followed randomly by one of the other five words). For 1A, and 1B the amplitude of the speech stream was attenuated every 50 ms with a 2.5-ms decrease in amplitude and a 2.5-ms increase in amplitude. For 1C, amplitude was attenuated every 100-ms with a 2.5-ms decrease in amplitude and a 2.5-ms increase in amplitude. Attenuations at reversal boundaries prevented noise bursts that come from abrupt changes in amplitude in artificially reversed speech and change the properties of the percept.

Table 1

Syllable 1	ms	Syllable 2	ms	Syllable 3	ms	Word
WORDS
Pee	229	Fah	388	Boe	267	Peefahboe
Roo	304	Wee	272	Law	336	Rooweelaw
Hoe	329	Lay	298	Gee	228	Hoelaygee
Kaw	298	Goo	267	Pah	306	Kawgoopah
May	286	Taw	298	Foo	330	Maytawfoo
Dah	300	Koe	285	Way	298	Dahkoeway
	291.0		301.3		294.2	Average length

Syllable 1	ms	Syllable 2	ms	Syllable 3	ms	Non-word

NON-WORDS
Lay	298	Hoe	329	Pee	229	Layhoepee
Wee	272	Kaw	298	Roo	304	Weekawroo
Boe	267	Gee	228	Taw	298	Boegeetaw
Goo	267	Dah	300	May	286	Goodahmay
Law	336	Way	298	Fah	388	Lawwayfah
Pah	306	Foo	330	Koe	285	Pahfookoe
	291.0		297.2		298.3	Average length

The word and non-words used in the three experiments and thee lengths of the component syllables in ms.

The six non-words were created with the following constraints: half began with a medial syllable and half began with a final syllable. No syllable occurred in the same position in the non-words as it did in the words (e.g., if “Wee” was in medial position in a word, it would not occur medially in a non-word) so as to prevent learners from falsely recognizing a nonsense word from positional encoding of its component syllables. Finally, each of the vowels occurred once in each position in the words and once in each position in the non-words but was never repeated within a given word, or given non-word.

The test trials were composed of the six trisyllabic words presented in the familiarization streams, as well as six trisyllabic non-words which consisted of the same syllables as the words, but in new orderings, such that transitional probabilities between each syllable of these non-words was 0, compared to 1.0 transitional probabilities for syllables within the words. We opted to use non-words (as in Saffran et al., 1996a, Experiment 1; Saffran et al., 1996b, Experiment 1; Saffran et al., 1997, Experiments 1 and 2), rather than part-words, because we were interested in whether learners could abstract to new materials and not the specifics of which transitional probabilities – trigrams or bigrams – learners were using. Learners may have extracted either bigrams or trigrams from the trainings streams. In 1A, words and non-words were acoustically identical to the training phase, replicating previous statistical learning studies (Saffran et al., 1997). In 1B, the same six words and six non-words were time-reversed every 50-ms. In 1C, the six words and six non-words were time-reversed every 100-ms¹.

Apparatus and procedure

Participants were tested individually in a small quiet room, using an Apple G4 400 MHz computer. All participants were familiarized with one of the 10-min speech streams played over two speakers placed on either side of the monitor using QuickTime (Apple, Cupertino, CA, USA). They were instructed to actively listen to the stream, but not to try to memorize the stream or do anything other than simply listening.

To create the test pairs, each of the six words was paired with each of the six non-words exhaustively for 36 total test trials. Each particular word was played first in three test trials and second in three test trials. In the test phase, participants were randomly presented with 36 pairs of words and non-words, and asked to choose which of the pair seemed most familiar. In half of the 36 trials, words were presented first, and in the other half, non-words were presented first. For participants in 1A, these words and non-words were normal, participants in 1B were tested on 50-ms time-reversed materials, and participants in 1C were tested on the 100-ms time-reversed materials. Guessing was encouraged if participants were not sure of the answer. Stimuli were presented and responses recorded using PsyScope 1.2.5 (Cohen et al., 1993). After each pair was presented a short pause allowed participants to press the “z” key on the keyboard, labeled “1,” if they believed that first trisyllabic sequence to be most familiar, and to press the “m” key, labeled “2,” if they believed the second sequence to be most familiar. The number of correct responses in the normal and reversed test conditions was then calculated. For three participants, only 35 trials were recorded, and thus the percentage of correct responses was scored out of 35. The entire session lasted approximately 20 min.

Results and discussion

Preliminary analyses were conducted in order to assess whether there was any difference in performance caused by particular familiarization streams. Since no significant effect was found, the different streams were combined in subsequent analyses. In 1A, words were chosen over non-words on average in 64.5% of test trials, reliably above chance [all t-tests are two-tailed one-sample tests; t(14) = 7.05, p < .001; see Figure 2]². In 1A, we thus replicated the findings of previous statistical learning studies by demonstrating that adults can segment words using statistically occurring transitional information (Saffran et al., 1997).

Figure 2

In Experiments 1B and 1C, we examined whether the type of representation segmented from a statistical speech stream allows learners to recognize words when they have been acoustically distorted by 50 and 100-ms time-reversals. In Experiment 1B, with 50-ms time-reversed materials, words were chosen over non-words in 61.3% of test trials. A planned two-tailed one-sample t-test indicated that this proportion was significantly different from chance [t(14) = 5.61, p < .001]. In Experiment 1C, 100-ms time-reversed words were selected on average in 54.4% of trials, significantly above chance [planned two-tailed one-sample t-test t(14) = 3.84, p = .002]. Performance differed between conditions, F(2,42) = 8.39, p = .001, with performance on 100-ms reversals being reliably worse than normal words (LSD Mean Difference = 10.1, p < .001) and 50-ms reversals (LSD Mean Difference = 6.96, p = .009). Learners were able to recognize segmented words under both distortion conditions, although performance was worse with larger distortion.

Adults were able to restore speech sounds that were acoustically distorted and to map them to their representation of newly segmented units; however, restoration became more difficult as the amount of acoustic distortion increased. Interestingly, adults’ performance in restoring newly learned segmented novel units parallels their ability to restore known speech which includes higher-level syntactic and semantic cues (Saberi and Perrott, 1999). This suggests that syntactic and semantic information is not essential for speech restoration, since our speech units were nonsense words presented in isolation, but that statistical regularities can suffice for generalization. The products of statistical learning are abstract enough to allow learners to recognize newly segmented units when they have been acoustically distorted.

Experiment 2: Acquiring Putative Lexical Units from Distorted Statistical Speech Streams

Adults have surprisingly robust abilities to recover known words from speech signals distorted both in frequency (Blesser, 1972; Remez et al., 1981; Shannon et al., 1995) and in time (e.g., Saberi and Perrott, 1999; Greenberg and Arai, 2004). In Experiment 1 we showed that listeners can recognize newly segmented speech units which have been altered with an artificial temporal distortion, suggesting that listeners can perceptually restore newly segmented speech material. However, other language learning processes show some dissociation between mechanisms involved in acquiring knowledge (e.g., acquiring a rule is easier from speech than non-speech; Marcus et al., 2007) and mechanisms involved in using knowledge (e.g., rules are applied equally to speech and non-speech; Marcus et al., 2007). In Experiment 2 we examine the extent to which humans can acquire and generalize novel lexical forms under distorted listening conditions.

Previous studies show that adult listeners can selectively adapt to certain types of acoustic distortions in a process called perceptual learning, however, perceptual learning is greatly facilitated when listeners are given feedback on sentences containing known words (e.g., Davis et al., 2005). Since the current study contains neither syntactic nor semantic cues, it is not clear whether language learners will be able to segment units under distorted conditions.

This experiment investigates whether statistical learning can occur with distorted input, and, if so, whether the representation thereby acquired will be equally resistant to acoustic change. In Experiment 2A, participants were familiarized with a 10-min 50-ms time-reversed speech stream; half of them were tested on 50-ms time-reversed words which maintain the acoustic distortion, and the other half were tested on non-distorted words that mirror the amount of acoustic transformation in Experiment 1B. In Experiment 2B, participants were familiarized with a 10-min 100-ms time-reversed stream; half of them were tested on 100-ms time-reversed words, maintaining the amount of distortion, and the other half were tested on non-distorted words, thus mirroring the amount of acoustic transformation from Experiment 1C.