Integrating Voice Quality Cues in the Pitch Perception of Speech and Non-speech Utterances

Kuang, Jianjing; Liberman, Mark

doi:10.3389/fpsyg.2018.02147

ORIGINAL RESEARCH article

Front. Psychol., 29 November 2018

Sec. Psychology of Language

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.02147

Integrating Voice Quality Cues in the Pitch Perception of Speech and Non-speech Utterances

Department of Linguistics, University of Pennsylvania, Philadelphia, PA, United States

Abstract

Pitch perception plays a crucial role in speech processing. Since F0 is highly ambiguous and variable in the speech signal, effective pitch-range perception is important in perceiving the intended linguistic pitch targets. This study argues that the effectiveness of pitch-range perception can be achieved by taking advantage of other signal-internal information that co-varies with F0, such as voice quality cues. This study provides direct perceptual evidence that voice quality cues as an indicator of pitch ranges can effectively affect the pitch-height perception. A series of forced-choice pitch classification experiments with four spectral conditions were conducted to investigate the degree to which manipulating spectral slope affects pitch-height perception. Both non-speech and speech stimuli were investigated. The results suggest that the pitch classification function is significantly shifted under different spectral conditions. Listeners are likely to perceive a higher pitch when the spectrum has higher high-frequency energy (i.e., tenser phonation). The direction of the shift is consistent with the correlation between voice quality and pitch range. Moreover, cue integration is affected by the speech mode, where listeners are more sensitive to relative difference within an utterance when hearing speech stimuli. This study generally supports the hypothesis that voice quality is an important enhancement cue for pitch range.

Introduction

Pitch perception is crucial to speech processing, as speakers use pitch to communicate important linguistic information like tone and intonation. Although pitch refers to an auditory property, in speech studies the term is often used interchangeably with its acoustic correlate, fundamental frequency (F0). At the same time, speakers differ in F0 ranges such that there may be overlap in the acoustic signals of “high” and “low” F0 for different speakers, as well as for different speakers’ phonetic (e.g., tonal) categories. In order to ascertain the linguistic pitch intended by a speaker, listeners must effectively locate the pitch within its speaker’s pitch range.

Speaker normalization has been known as a challenge for automatic tone recognition by a machine, yet it is an effortless process by human listeners. Speaker normalization is certainly easier when listeners are previously exposed to a voice or when the context is available (e.g., Wong and Diehl, 2003). However, studies (e.g., Honorof and Whalen, 2005; Lee, 2009; Lee et al., 2010) have shown that speaker normalization is even more efficient and effective than previously assumed, as listeners are able to identify the pitch location of very brief voice samples (e.g., only six glottal periods available) in an unknown speaker’s range, without any contextual cues. This suggests that listeners must use other signal-internal information that co-varies with F0 as cues to perceive pitch range.

Both Honorof and Whalen (2005) and Lee et al. (2010) speculated that voice quality, defined as the variability in the spectrum due to the variability of glottal constriction and vocal-fold contacts, could be such a cue. This speculation is plausible because systematic co-variation between F0 and voice quality has been found in both speech production studies (e.g., Kuang, 2017) and singing studies (e.g., Hollien and Michel, 1968; Hollien, 1974; Titze, 1988; Roubeau et al., 2009). That is, voice quality continuously changes as a speaker’s F0 increases or decreases in a nonlinear but predicable manner, and certain pitch ranges are bound to certain types of voice quality. For example, the lowest pitch range is often associated with vocal fry, and the highest pitch range is associated with tense voice and falsetto.

Indeed a study based on Mandarin speakers (Lee, 2009) found that voice-quality-related spectral cues (i.e., H1-H2, the relative amplitude difference between the first harmonic and the second harmonic; and H1-A3, the amplitude difference between the first harmonic and the third formant) were correlated with tone classification between high and low. However, they further noted that F0 was the only significant predictor for identification accuracy in the regression model. Bishop and Keating (2012) replicated Honorof and Whalen’s (2005) experiment and found that acoustic measures of voice quality had only a very small effect on pitch location ratings. They suggested that voice quality only indirectly influences pitch perception, possibly through its information about sex. This is plausible, since talker processing has been shown to interact with linguistic processing (e.g., Mullennix and Pisoni, 1990). However, since a multi-speaker design was used in these previous studies, and voice quality cues were not explicitly manipulated and controlled, it is impossible to tease apart its indirect gender effect (i.e., through the additional processing of the talker’s gender) from its direct signal-internal effect (i.e., through the co-variation between pitch and voice quality). Therefore, although the co-variation between pitch and voice quality has been found in production studies, it remains to be shown whether such co-variation relationship also exists in speech pitch perception.

Nonetheless, outside of speech studies, psychoacoustic studies have generally suggested that spectral properties (usually referred as “timbre” in this body of literature) of the signal can directly interfere with the perception of pitch height (e.g., Melara and Marks, 1990; Krumhansl and Iverson, 1992; Singh and Hirsh, 1992; Allen and Oxenham, 2014; to cite a few). A common finding from these studies is that there are interactions between pitch and timbre in speeded classification tasks. Listeners were instructed to attend to either timbre changes or pitch changes, while both dimensions simultaneously varied. Listeners’ pitch classification was more accurate and faster when the timbre dimension was “congruent” with the F0 dimension. Various types of spectrum have been explored in this body of literature, and have been found to be able to interfere with pitch perception: for example, natural timbres from different musical instruments (e.g., Krumhansl and Iverson, 1992; Marozeau et al., 2003); different values of duty cycles of square waves (Melara and Marks, 1990); the location of the center frequency of harmonic complex tones (e.g., Warrier and Zatorre, 2002; Russo and Thompson, 2005; Silbert et al., 2009; Allen and Oxenham, 2014); and the spectral locus of complex tones (Singh and Hirsh, 1992). Although various types of timbre have been tested, most studies only used non-speech stimuli, while speech-related studies are relatively rare. Therefore, it remains unclear whether spectral information is integrated in speech-related pitch perception as well, and it is possible that listeners ignore spectral cues in speech tasks, as speech is subject to very different neural processing. For example, studies have shown that listeners behave differently when processing speech and non-speech stimuli (e.g., Liberman, 1970; Repp, 1982), and neural imaging studies have, similarly, found that people use different parts of the brain to process linguistic and non-linguistic pitch (e.g., Merrill et al., 2012).

Although speech-related studies on the interaction between spectrum/timbre and pitch are very rare, Stoll (1984) and Krishnan et al. (2011) showed that timbre and pitch are probably integrated in the speech domain as well, since pitch perception is influenced by the manipulation of vowel formants, which is known to influence the overall shape of the spectrum. It is worth pointing out that there is a co-variation between vowel height and F0 in production as well; high vowels are naturally produced with higher F0 (e.g., Whalen and Levitt, 1995). Linguistically meaningful spectral variation is not only limited to vowel quality, as other dimensions such as voice quality also significantly affect the shape of the vocal spectrum. Therefore, it remains to be shown what kind of linguistically meaningful spectral variation is integrated into the perception of linguistic pitch targets. Specifically, in this study, we ask whether voice quality can function as an indicator of pitch range and therefore affect the perception of pitch height.

Taken together, in linguistic studies, it remains unclear whether and how voice quality cues interfere with linguistically meaningful pitch perception (e.g., tone perception); in psychoacoustic studies, it remains unclear whether the interaction between timbre and pitch occurs in the domain of speech as well, and if so, whether speech mode plays a role. The present study bridges the gaps in the linguistic and psychoacoustic literature in those respects.

The voice quality cue that was tested in this study is spectral slope. It has been well established that the relative slope of the voice source spectrum is one of the most important acoustic correlates of voice quality (see Gobl and Ni Chasaide, 2012 for a general review). A relatively steep spectral slope is associated with a breathier voice, and a flat spectral slope is associated with a tenser or creakier voice (the latter also characterized by pulse-to-pulse variability). The spectral tilt is usually measured as the amplitude of the fundamental (H1) relative to some higher-frequency components (e.g. H1-H2, H1-A1, H1-A2, and H1-A3; A1, A2, A3 are the amplitudes of the harmonic near the first, second and third formants). These measures have been found to be the reliable indicators of phonation contrasts across languages (e.g. Southern Yi: Kuang and Keating, 2014; Green Mong: Andruski and Ratliff, 2000; White Hmong: Esposito, 2012; Takhian Thong Chong: DiCanio, 2009; Sui/Kuai: Abramson et al., 2004; Javanese: Thurgood, 2004; Ju| ’hoansi: Miller, 2007; Santa Ana Valle Zapotec: Esposito, 2010; Mazatec: Garellek and Keating, 2011; Gujarati: Khan, 2012), and of voice quality classification in perceptual spaces (e.g., Kreiman et al., 2007, 2014; Garellek et al., 2016). Therefore, the working hypothesis of the current study is that, if voice quality can affect pitch perception, manipulating the spectral slope of a voice should be able to shift listeners’ perception of pitch height. This hypothesis is tested with both non-speech and speech stimuli.

The stimuli in this study were designed to resemble the prosody of natural utterances. The F0 contours (c.f. Method section for details) which contains two F0 peaks are similar to the design in previous studies on prominence perception (e.g., Terken, 1991; Gussenhoven et al., 1997). One question raised in those studies was that how listeners perceived the relative prominence of the two F0 peaks, whether they relied more on the local pitch targets (such as comparing with the other peak), or more on the global pitch range (the overall pitch height of the utterance within the speaker’s range). It was found that both global and local target play important roles in prominence perception (Gussenhoven et al., 1997). Although our study does not explicitly refer to prominence, a similar question can be also examined here, if voice quality does contribute to the pitch height normalization, whether it contributes to the normalization of the global pitch range or the normalization of the local pitch targets; and furthermore, whether speech mode plays any role in the normalization strategies.

Experiment 1: Pitch Perception With Non-Speech Stimuli

Materials and Methods

Stimuli

Similar to our previous pilot study (Kuang and Liberman, 2015), complex tones varying in pitch and spectral cues were synthesized. The stimuli were four sets of sine-wave overtones with two peaks, which were created by convolving a hamming window with a sawtooth whose baseline F0 value is always 120 Hz. The pitch contour was designed to simulate the prosody of natural utterances. To manipulate the F0 cues, the F0 of the first peak is always set to 169.34 Hz, while the second peak is a pitch continuum with 11 steps between 153.06 and 187.36 Hz, with an interval of 0.35 semitones. At step 6, peak 1 and peak 2 are identical in F0. The F0 range of these pitch contours roughly covers the upper half of the comfortable pitch range of a male speaker (Baken and Orlikoff, 2000). Pitch manipulation is illustrated in Figure 1.

FIGURE 1

To manipulate voice quality-related spectral cues, two source spectra, one with tilted slope and the other one with flat slope, were first created. In the tilted spectrum, overtone amplitude decreases with an 1/F slope, to a point 15 dB below the fundamental (Figure 2A). As can be seen here, as a result of the tilted slope, the first harmonic is relatively more prominent than the higher-frequency harmonics. By contrast, in the flat spectrum (Figure 2B), the overtone amplitude is kept constant, so the first harmonic is not prominent in the spectrum. Using the voice quality terms, the flat spectrum, which has more energy in high-frequency harmonics than the tilted spectrum, indicates a tenser voice.

FIGURE 2

The two types of source spectrum were then applied to the two peaks of the complex tones and resulted in four spectral conditions, as summarized in Table 1. Intended voice quality combinations were indicated in relative terms.

Table 1

Set	Peak1 spectrum	Peak2 spectrum	Intended VQ combination
Set BB	Tilted	Tilted	Breathier + Breathier
Set TT	Flat	Flat	Tenser +Tenser
Set BT	Tilted	Flat	Breathier + Tenser
Set TB	Flat	Tilted	Tenser + Breathier

Summary of manipulations of the stimuli.

Tilted spectrum, original spectrum; flat spectrum, boosted spectrum.

Therefore, there were 44 distinct stimuli (11 F0 steps × 4 spectral conditions) in total. All stimuli were 1 s in duration.

Procedure

A forced-choice pitch classification task was used to test how listeners categorize pitch values in different spectral conditions. Ten copies of each stimulus were presented in random order to each listener. For each trial, the listeners were asked to attend to pitch, and judge whether the second peak is higher or lower than the first peak by clicking on the corresponding buttons on the computer screen. All testing took place in a soundproof booth with stimuli presented over Sennheiser 280 headphones.

Subjects

Fifty eight participants, aged between 18 and 22 (half females), were recruited from the student population at the University of Pennsylavnia. All of them reported to speak English as their primary language. None of them reported to receive extensive musical training. Three of them failed to complete the task as instructed (i.e., clicked on the same answer for all trials), and thus were excluded from the analysis. None of the participants reported to have hearing issues.

Predictions

Figure 3 depicts the predictions of the experiment; as shown in Figure 3A, if listeners do not pay attention to spectral cues, there is no shift in the pitch classification function. On the other hand, if listeners indeed pay attention to spectral cues, there should be a signficant shift in the pitch clasification, as indicated in Figure 3B. Set BT (tiled/breathier + flat/tenser) would receive the most “peak 2 is higher” responses, while set TB would motivate the fewest “peak 2 is higher” responses. Note that, despite the way Figure 3 is plotted, we do not assume a categorical perception of the pitch classifiction.

FIGURE 3

Results

Figure 4 shows the proportion of “peak 2 is higher” responses across all listeners. The main effects of spectral conditions were evaluated using an MCMC generalized linear mixed-effects model (mcmcglmm package in R). F0 steps (1–11) and spectral conditions (BT, BB, TT and TB) were the fixed factors, and random intercepts and slope were included for subjects. Main effects of spectral conditions were summarized in Table 2. The results were reported as means of regression coefficients, followed by 95% highest posterior density intervals in square brackets and associated p-values. As shown in Table 2, significant effects were found between every two spectral conditions, which means that pitch classification function is significantly shifted in each spectral condition. The proportion of “peak 2 is higher” responses was in the order of Set BT (tiled/breathier + flat/tenser) > Set TT (flat/tenser + flat/tenser) > Set BB (tilted/breathier + tilted/breathier) > Set TB (flat/tenser + tilted/breathier; see Figure 3).

FIGURE 4

Table 2

	BB	TT	BT
TT	1.3[1.2,1.5], p < 0.001
BT	1.7[1.6,1.8], p < 0.001	0.4[0.3,0.6], p < 0.001
TB	0.4[0.3,0.5], p < 0.001	1.8[1.7,2.0], p < 0.001	2.5[2.4,2.7], p < 0.001

Main effects of spectral conditions between every two conditions.

Means of regression coefficients, followed by 95% highest posterior density intervals in square brackets and associated p-values.

Overall, the perception of pitch height was strongly biased by the spectral cues. As can be seen in Figure 4, compared to set BB, pitch classification function for set BT (breathier + tenser) was dominated by the “peak 2 is higher” responses, even when peak 2 was about 10 Hz lower than peak 1. By contrast, pitch classification function of set TB (tenser + breathier) was shifted in the opposite direction. In this condition, listeners hardly heard a higher peak 2, even when peak 2 was about 10 Hz higher than peak 1. In other words, when the second peak was tenser than the first peak, listeners tended to perceive a higher pitch, and when the second peak was breathier than the first peak, they tended to perceive a lower pitch. Interestingly, pitch classification functions for set BB (breathier + breathier) and set TT (tenser + tenser) were also significantly different, with set TT more in favor of “peak 2 is higher”. This suggests that listeners were also sensitive to the overall “voice quality” of the utterances.