Vocal Age Disguise: The Role of Fundamental Frequency and Speech Rate and Its Perceived Effects

Skoog Waller, Sara; Eriksson, Mårten

doi:10.3389/fpsyg.2016.01814

ORIGINAL RESEARCH article

Front. Psychol., 21 November 2016

Sec. Psychology of Language

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.01814

Vocal Age Disguise: The Role of Fundamental Frequency and Speech Rate and Its Perceived Effects

Sara Skoog Waller^*

Mårten Eriksson

Department of Social Work and Psychology, Faculty of Health and Occupational Studies, University of Gävle, Gävle, Sweden

The relationship between vocal characteristics and perceived age is of interest in various contexts, as is the possibility to affect age perception through vocal manipulation. A few examples of such situations are when age is staged by actors, when ear witnesses make age assessments based on vocal cues only or when offenders (e.g., online groomers) disguise their voice to appear younger or older. This paper investigates how speakers spontaneously manipulate two age related vocal characteristics (f₀ and speech rate) in attempt to sound younger versus older than their true age, and if the manipulations correspond to actual age related changes in f₀ and speech rate (Study 1). Further aims of the paper is to determine how successful vocal age disguise is by asking listeners to estimate the age of generated speech samples (Study 2) and to examine whether or not listeners use f₀ and speech rate as cues to perceived age. In Study 1, participants from three age groups (20–25, 40–45, and 60–65 years) agreed to read a short text under three voice conditions. There were 12 speakers in each age group (six women and six men). They used their natural voice in one condition, attempted to sound 20 years younger in another and 20 years older in a third condition. In Study 2, 60 participants (listeners) listened to speech samples from the three voice conditions in Study 1 and estimated the speakers’ age. Each listener was exposed to all three voice conditions. The results from Study 1 indicated that the speakers increased fundamental frequency (f₀) and speech rate when attempting to sound younger and decreased f₀ and speech rate when attempting to sound older. Study 2 showed that the voice manipulations had an effect in the sought-after direction, although the achieved mean effect was only 3 years, which is far less than the intended effect of 20 years. Moreover, listeners used speech rate, but not f₀, as a cue to speaker age. It was concluded that age disguise by voice can be achieved by naïve speakers even though the perceived effect was smaller than intended.

Introduction

The human voice changes from childhood and throughout an individual’s lifespan because of biochemical and physiological changes affecting the speech mechanism, as well as the result of sociolinguistic influence. Regularities in this variation allow listeners to make fairly accurate assessments of the speaker’s age from his or her voice and may also be used by speakers to give the impression of being younger or older than s/he actually is. Listeners generally rely on several voice parameters in their age estimates. For example, jitter, shimmer, noise and tremor, have been found to influence estimation of speaker age (Brückl and Sendlmeier, 2003; Schötz, 2006; Harnsberger et al., 2008), yet fundamental frequency (f₀) and speech rate are widely accepted as being particularly important (e.g., Linville, 1996; Harnsberger et al., 2008; Skoog Waller et al., 2015). However, it is unknown if f₀ and speech rate are actually modulated when speakers try to sound either younger or older, and if so, whether manipulations in f₀ and speech rate correspond to actual age related changes in the same voice parameters.

Speech rate decreases with age for both female and male speakers (e.g., Harnsberger et al., 2008; Skoog Waller et al., 2015) while changes in f₀ look different in male speakers compared to female speakers. For female speakers, f₀ does not change much until the menopause after which a drop occurs. In contrast, f₀ in aging male speakers follows a U-function being lowest between 40 and 50 years, reaching the level of 20–30 years at age 60–70 years, and then continues to rise (see review by Linville, 1996).

Listeners are relatively accurate in estimating speaker age. Several studies (Shipp and Hollien, 1969; Huntley et al., 1987; Neiman and Applegate, 1990; Braun, 1996; Brückl and Sendlmeier, 2003) have reported robust correlations (0.70–0.90) between estimated speaker age and the chronological age of the speakers. One factor leading to unprecise estimations is a bias toward the mean population age. Older speakers are regularly estimated as younger than they actually are while younger speakers are estimated as older than they are (see Shipp and Hollien, 1969; Hollien and Tolhurst, 1978; Huntley et al., 1987; Braun, 1996; Braun and Cerrato, 1999; Brückl and Sendlmeier, 2003; Skoog Waller et al., 2015).

Individuals may want to sound younger or older for numerous reasons. Actors on stage, in film and other media incessantly make portrayals in relation to the spectrum of age that draw on beliefs about vocal aging (Marshall and Lipscomb, 2010). In this context it is of value to understand how certain voice characteristics are related to perceived age.

For young asylum seekers age estimation is often a more fateful matter because special laws regulate the rights for admittance of minors. However, the age estimations are based on uncertain methods (Sauer et al., 2016) and the final decision is an overall assessment from various sources.

In the daily life of most people, age assessments are made in judgements and descriptions of speakers based on their voices. Such descriptions are also frequently made by victims and witnesses of crime who have encountered perpetrators under poor visual conditions (Yarmey et al., 1996; Yarmey, 2001, 2004). Testimonies may be based on observations made in the dark or the perpetrator may have hindered the victim or witness from seeing him by using force or by wearing some kind of mask. Some descriptions are based solely on acoustic information, e.g., when a perpetrator have not been observed visually but heard over the phone. Witnesses often provide assessments about the age of unknown perpetrators and such information can indeed be valuable in crime investigations. It is therefore important for law enforcers to have knowledge about the grounds on which age estimations are made (such as the relation between specific voice parameters and age estimates) and how precise estimations can be expected to be.

In some forensic cases interception may be performed to provide voice recordings that can be used to identify criminals through forensic voice analysis. In other cases identification may be achieved by ear witnesses. In either case, voice identification is subject to error at a relatively high rate (Boë, 2000) and may often be further afflicted by the fact that criminals frequently disguise their voices in order to obstruct identification (Reich and Duke, 1979; Orchard and Yarmey, 1995; Boë, 2000; Neuhauser, 2008; Suneetha, 2013). Voice disguise can be performed in various ways, some of them with the help of electronic devices, others by using mechanical devices such as to put a handkerchief or the hand over the mouth or to pinch the nostrils (Perrot and Chollet, 2012). Künzel (2000) notes that 15–25% of the cases processed at the speaker identification section at BKA (the German Federal Police Office) contained common non-instrumental forms of vocal disguise including whisper, falsetto, quirky voice, imitation of dialect or foreign accent and age disguise with the intention to sound younger or older. Vocal age disguise is sometimes performed by online groomers when telephone contact is established between a groomer and a victim (e.g., Whittle et al., 2013).

In online grooming cases and similar crimes with the intention to abuse minors, the interest is primarily that of adults and older people to sound younger than their true age. However, there is reason to believe that older speakers are not as skilled as young speakers in modulating their voices due to physiological changes such as increased stiffness of vocal cord tissues. For example, older language learners’ usually have a more pronounced accent than younger ones (Stevens, 1999; Piske et al., 2001). Identification of voice parameters that are resistant to disguise would be of value for crime investigations.

Many recent studies on the effects of voice disguise concern the design of automatic speaker recognition systems to be used by the police (e.g., Perrot and Chollet, 2008; Zhang and Tan, 2008; Wu et al., 2014). However, such systems can never replace human perception in a witness situation because they require recording of the offenders’ voice, which is not always possible. Hence, effects of disguise on human perception will always be important. The effects of voice disguise on estimations of speaker age have previously been studied by Lass et al. (1982). Their study was based on young adults attempting to disguise their true age by sounding younger or older. Small differences in perceived age in the attempted directions were described although no inferential statistics were reported and no description of how (in terms of speech parameters) the voices were changed was given. No more recent study has investigated age disguise by vocal manipulation although the application of such research is more current today than 30 years ago due to recent phenomena such as online grooming.

The purpose of the present research was to extend the study of Lass et al. (1982) in several ways. In a first study (Study 1), we analyzed how women and men from various age groups spontaneously manipulate two of the most important age related voice parameters (f₀ and speech rate) when instructed to disguise their voice to sound younger versus older and if the manipulations corresponded to actual age related changes in f₀ and speech rate. The purpose of Study 2 was to examine the effects of vocal age disguise on perceived age. The study of Lass et al. (1982) was extended by including speakers from three age groups. Finally, the direct effects of f₀ and speech rate on estimated age were examined in a cross-study analysis which also allowed us to investigate the relative contribution of each parameter.

Study 1

The purpose of the first study was to investigate how female and male speakers from various age groups spontaneously manipulate f₀ and speech rate when instructed to sound younger or older, and if the direction of the manipulations would correspond to the direction of actual age related changes in f₀ and speech rate in female and male speakers. Speech rate decreases rather continuously with age in both female and male speakers (Harnsberger et al., 2008; Skoog Waller et al., 2015) while f₀ decreases notably after menopause in female speakers and follows a U-function in male speakers, being lowest during middle age (Linville, 1996). Thus, if vocal age disguise imitates actual vocal aging young men could be expected to speak with decreased f₀ to sound older, while middle aged and older men could be expected to increase their f₀ to sound older. To sound younger, on the other hand, middle aged men could be expected to increase f₀ while older men would be expected to decrease f₀.

Method

Participants

Voices from 36 speakers recruited from students and staff at the University of Gävle were used. The speakers were from three age groups: 20–25 years (M = 23.38 years, SD = 1.19), 40–45 years (M = 42.25 years, SD = 3.22) and 60–65 years (M = 62.67 years, SD = 1.87). There were 12 speakers in each age group (six women and six men). All speakers were non-smoking native speakers of Swedish. The studies reported in this paper were conducted in accordance with the declaration of Helsinki and the ethical guidelines given by the American Psychological Association. All participants (listeners and speakers) were adults and participated on informed consent. The listeners and the speakers signed an information agreement form. The experiment caused no harm to any part, the identity of the participants has been kept confidential, and no conflict of interest can be identified.

Material and Procedure

Speech samples of read speech with duration between 9 and 12 s were recorded in a quiet laboratory setting using a dynamic microphone placed 15 cm from the speaker’s mouth. Participation was rewarded with a movie ticket.

Voice Conditions

The speakers in the two older age groups were instructed to sound around 20 years younger in one condition, to use their natural voice in another condition and to sound around 20 years older in a third condition. We did not include speech samples from speakers in the youngest age group disguised to sound younger because the voice condition required the speakers to try to sound like children of 0–5 years of age which is quite another task than what was required in the other voice conditions. The youngest age group (20–25) was instructed to sound around 20 years older in one voice condition and to use their natural voice in another. Thus, in all 96 speech samples were obtained from the 36 speakers.

Analyses

The voices were edited in Audacity 1.2.6¹. A standard feature was used to compress the dynamic range of the recordings, making the loudest parts softer while keeping the volume of the soft parts the same. The threshold value was set to -12 dB and the ratio was set to 2:1. The speech samples were then normalized for intensity by setting the maximum intensity of all the samples to the same value. The acoustic analyses on speech rate and fundamental frequency (f₀) were made in Praat 5.4.06², a software tool for analyzing, synthesizing and manipulating speech.

The data were computed and analyzed in SPSS 22.0 using mixed analysis of variance (ANOVA) models. Post hoc analyses were computed using the Bonferroni correction and the level of significance was set at 0.05. Because the study design did not include young speakers seeking to sound younger, two analyses were performed on fundamental frequency and speech rate respectively. The first included three voice conditions (young, natural, old) as a within-subject variable and two age groups (40–45, 60–65 years) as a between-subjects variable. The second analysis consisted of two voice conditions (natural, old) and three age groups (20–25, 40–45, 60–65 years). Sex of the speaker was included in both analyses because it is known that voices of women are higher than those of men (e.g., Titze, 1994). Mauchly’s test of sphericity indicated that the assumption of sphericity had not been violated (W > 0.90).

Results and Discussion

Fundamental Frequency

Mean and standard deviation of f₀ for women and men over voice conditions and age groups are shown in Table 1. The mean f₀ was about the same for female voices between 20–25 and 40–45 years in the natural condition but lower for female voices 60–65 years. This change in female voices would be expected from the description of Linville (1996). However, f₀ for the male speakers in the natural condition followed an inverted U-function with the men 40–45 years at the peak which is contrary to the development described by Linville (1996). Yet, this comparison is between groups and might be due to individual variation. Importantly though, both female and male speakers raised f₀ when disguised as younger and lowered f₀ when disguised as older.

TABLE 1

TABLE 1. Mean and standard deviation of voice parameters for 18 female voices and 18 male voices over conditions and age groups in Study 1.

The pattern in f₀ observed from Table 1 was supported by a 3 × 2 × 2 mixed analysis of variance with voice condition (young, natural, old) as the within-subject variable and speaker age group (40–45, 60–65 years) and sex (female, male) as the between-subjects variables. The analysis revealed main effects of voice condition and sex but no interaction effects. Hence, speakers did only to some extent manipulate f₀ in directions corresponding to actual age related changes in f_0. Speakers used higher f₀ (M = 166.19 Hz, df = 47.79) to sound younger compared with their undisguised voice (M = 156.40 Hz, df = 41.95) and lower f₀ to sound older (M = 145.20 Hz, df = 40.77, F[2,40] = 16.68, p < 0.001, MSE = 158.76, $η_{p}^{2}$ = 0.46, both differences were verified by a post hoc test using the Bonferroni correction, p < 0.05) which corresponds to the direction of actual f₀ change in female but not entirely in male speakers. As expected, the voices of female speakers (M = 193.75 Hz, df = 19.57) were higher-pitched than those of male speakers (M = 118.11 Hz, df = 15.88, F[1,20] = 105.67, p < 0.001, MSE = 974.66, $η_{p}^{2}$ = 0.84).

The results above were supported by a 2 × 3 × 2 ANOVA with voice condition (natural, older) as the within-subject variable and age group (20–25, 40–45, 60–65 years) and sex (female, male) as the between-subjects variables. The analysis again revealed a main effect of voice condition, (F[1,30] = 14.57, p = 0.001, MSE = 113.72, $η_{p}^{2}$ = 0.33) but no interactions. The speakers used a lower f₀ when disguised as old compared with the natural voices. Women’s f₀ were also higher than those of men (F[1,30] = 194.37, p < 0.001, MSE = 553.80, $η_{p}^{2}$ = 0.87). Neither analysis yielded a main effect of age group (Figure 1).

FIGURE 1

FIGURE 1. Change in fundamental frequency (f₀) when women and men disguise their voice. Error bars indicate the standard error of the mean (SEM).

Speech Rate

Mean and standard deviation of speech rate for women and men over voice conditions and age groups are shown in Table 1. Both female and male speakers spoke faster when disguised as younger and slower when disguised as older. This was first confirmed by a 3 × 2 × 2 mixed analysis of variance with voice condition (young, natural, old) as the within-subject variable and speaker age group (40–45, 60–65 years) and sex (female, male) as the between-subject variables. The analysis demonstrated a main effect of voice condition. Speakers spoke faster (M = 4.37 syll/s, df = 0.61) when disguised as younger as compared with their natural voices (M = 3.82 syll/s, df = 0.50) and slower when disguised as older (M = 3.12 syll/s, df = 0.76, F[2,60] = 47.68, p < 0.001, MSE = 0.189, $η_{p}^{2}$ = 0.71, both differences were verified by a post hoc test using the Bonferroni correction, p < 0.05). There was also a main effect of age such that speakers aged 40–45 years spoke faster than speakers aged 60–65 years, (F[2,30] = 3.98, p < 0.029, MSE = 0.791, $η_{p}^{2}$ = 0.21). Finally, there was also an interaction between voice condition and age group (F[2,60] = 5.32, p = 0.001, MSE = 0.184, $η_{p}^{2}$ = 0.26) indicating that speakers aged 40–45 years increased there speech rate more when attempting to sound younger compared to the speakers 60–65 years old.

The results were further supported by a 2 × 3 × 2 mixed ANOVA with voice condition (natural, older) as the within-subjects variable, and age group (20–25, 40–45, 60–65 years) and sex (female, male) as between-subjects variables. There was a main effect of voice condition, (F[1,30] = 45.07, p < 0.001, MSE = 0.142, $η_{p}^{2}$ = 0.60) but no significant interactions. Speakers spoke slower when attempting to disguise their voice to sound 20 years older (M = 3.38 syll/s, df = 0.77) compared with no disguise (M = 3.97 syll/s, df = 0.56). Thus, speakers manipulated speech rate in the direction corresponding to actual age related change. Inclusion of the younger age group led to a main effect of age group (F[2,30] = 6.18, p = 0.006, MSE = 0.613, $η_{p}^{2}$ = 0.98). Speakers aged 20–25 years spoke faster (M = 4.09 syll/s, df = 0.42) than speakers aged 60–65 years (M = 3.29 syll/s, df = 0.69) as confirmed by a post hoc test using the Bonferroni correction (Figure 2).

FIGURE 2

FIGURE 2. Change in speech rate (syllables/s) when speakers from three age groups disguise their voice. Error bars indicate the standard error of the mean (SEM).

In sum, subjects increased f₀ and speech rate as compared with their natural voice when trying to sound younger, whereas they decreased f₀ and speech rate when trying to sound older. No interaction between f₀ and age or between speech rate and age could be verified. The change in speech rate was larger than the change in f₀, as indicated by the effect sizes. In addition, differences in speech rate were found between the speakers as a function of their age. No effects of chronological age were revealed for f_0, although f₀ was sensitive to the sex of the speaker.

Study 2

The purpose of the second study was to investigate how successful the voice disguise from Study 1 was by asking naïve listeners to estimate the speakers’ age. The study of Lass et al. (1982) was extended by including voices from three age groups. We expected to replicate Lass et al.’s (1982) finding that young speakers are able to manipulate their voices to sound older. However, we believed that middle aged and older speakers would be less successful than young speakers to disguise their voices to sound younger or older. Because f₀ are in another range for women than for men, we also asked whether women and men were equally good at modifying their voices to sound a different age. Finally, it was asked if disguising the voice to sound younger was as effective as disguising the voice to sound older.