Comparing alignment toward American, British, and Indian English text-to-speech (TTS) voices: influence of social attitudes and talker guise

Dodd, Nicole; Cohn, Michelle; Zellou, Georgia

doi:10.3389/fcomp.2023.1204211

ORIGINAL RESEARCH article

Front. Comput. Sci., 03 July 2023

Sec. Human-Media Interaction

Volume 5 - 2023 | https://doi.org/10.3389/fcomp.2023.1204211

Comparing alignment toward American, British, and Indian English text-to-speech (TTS) voices: influence of social attitudes and talker guise

Nicole Dodd ^*

Michelle Cohn

Georgia Zellou

Phonetics Lab, Department of Linguistics, University of California, Davis, Davis, CA, United States

Article metrics

View details

Citations

3,5k

Views

991

Downloads

Abstract

Text-to-speech (TTS) voices, which vary in their apparent native language and dialect, are increasingly widespread. In this paper, we test how speakers perceive and align toward TTS voices that represent American, British, and Indian dialects of English and the extent that social attitudes shape patterns of convergence and divergence. We also test whether top-down knowledge of the talker, manipulated as a “human” or “device” guise, mediates these attitudes and accommodation. Forty-six American English-speaking participants completed identical interactions with 6 talkers (2 from each dialect) and rated each talker on a variety of social factors. Accommodation was assessed with AXB perceptual similarity by a separate group of raters. Results show that speakers had the strongest positive social attitudes toward the Indian English voices and converged toward them more. Conversely, speakers rate the American English voices as less human-like and diverge from them. Finally, speakers overall show more accommodation toward TTS voices that were presented in a “human” guise. We discuss these results through the lens of the Communication Accommodation Theory (CAT).

1. Introduction

Linguistic accommodation is a phenomenon in which an interlocutor converges toward or divergences from another interlocutor's speech patterns (also known as alignment, mirroring, and imitation). According to Communication Accommodation Theory (Giles, 1973; Giles et al., 1991; CAT), accommodation is a strategic process that speakers use to serve both functional and social purposes. For example, converging toward another speaker can facilitate comprehension between two interlocutors (Audience Design, Clark and Murphy, 1982; Street and Giles, 1982; Thakerar et al., 1982; see also Interactive Account, Garrod and Pickering, 2007). This convergence happens for various linguistic features, such as vowel quality (Babel, 2010, 2012; Pardo et al., 2010; Walker and Campbell-Kibler, 2015), prosody (Bosshardt et al., 1997; D'Imperio and German, 2015; D'Imperio and Sneed, 2015), or syntax (Bock, 1986; Weatherholtz et al., 2014). Convergence can also create or maintain positive social ties and signal in-grouping with another interlocutor (see also Similarity Attraction Theory, Byrne, 1971; Giles et al., 1987). Multiple social factors have been shown to mediate phonetic alignment, including gender (Namy et al., 2002; Pardo, 2006), perceived attractiveness (Babel, 2012; Michalsky and Schoormann, 2017), and conversational roles (Pardo, 2006; Pardo et al., 2010; Zellou et al., 2021b). In many cases, both linguistic and social factors interact to create more nuanced patterns of phonetic alignment (Babel, 2010; Walker and Campbell-Kibler, 2015).

The current study focuses on inter-dialectal phonetic accommodation, or the extent that a speaker of one variety converges or diverges from the acoustic-phonetic features a speaker from another dialect produces. Dialects differ from one another in both linguistic features, such as phonetic distance, and social attitudes toward speakers, providing the opportunity to investigate the impact of phonetic and social factors on alignment patterns. This is an active area of research (Babel, 2010, 2012; Kim et al., 2011; Rao, 2013; Chakrani, 2015; Walker and Campbell-Kibler, 2015; Ross et al., 2021). Previous work in this area has found that phonetic distance between dialects can be a strong predictor of alignment patterns; however, the direction of the effect is mixed across studies. On the one hand, some studies have observed stronger convergence toward dialects that are more similar to the speakers' dialect, or have smaller phonetic distances (Kim et al., 2011; Rao, 2013). For example, Kim et al. (2011) studied convergence patterns between same- and different-dialect pairs and found that same-dialect pairs showed more convergence than different-dialect pairs, which the authors interpret as evidence that large phonetic distances discourage alignment. On the other hand, other studies have found that larger phonetic distances encourage alignment (Babel, 2010, 2012; Walters et al., 2013; Walker and Campbell-Kibler, 2015). For example, Walker and Campbell-Kibler (2015) also conducted a shadowing task with same- and different-dialectal pairs. Their results showed that shadowers converged more toward model talkers whose dialects had a larger phonetic distance from their own; additionally, shadowers converged more with lexical items whose vowels had greater variability between dialects. It is important to note that Walker and Campbell-Kibler (2015) used a difference-in-difference (DID) measure, which has been shown to be limited in its approach (e.g., Cohen Priva and Sanker, 2019); however, further analyses have found DID estimates to be useful when used alongside more holistic measures (Ross et al., 2021).

Moreover, some work has shown stronger convergence toward interlocutors who speak language varieties deemed more socially favorable (Chakrani, 2015), and in many cases, these social attitudes can mediate convergence motivated by linguistic factors (Babel, 2010; Weatherholtz et al., 2014; Clopper and Dossey, 2020; Ross et al., 2021). For instance, in a case study of speech in a natural conversational setting, Chakrani (2015) found overall convergence among Arabic speakers toward speakers of prestigious Mashreqi (Middle Eastern) dialects, and divergence away from speakers of non-prestigious Maghrebi (North African) dialects. Notably, convergence and divergence shifted over time throughout the course of the interactions, and divergence was triggered when social conversational norms were not followed.

Social perceptions additionally affect phonetic alignment patterns in situations where phonetic differences encourage alignment. For example, Babel (2010) studied convergence by New Zealand English (NZE) speakers toward Australian English (AuE) speakers and reported overall convergence; however, the extent of convergence was dependent on both phonetic distance and social factors. On the one hand, participants aligned more to vowels with larger phonetic distances from their own. On the other hand, social factors mediated both vowel-specific alignment and overall alignment. While speakers aligned more to vowels with large phonetic distances, this was only the case for AuE vowels that were not clearly identifiable as AuE by NZE speakers (cf. Hay et al., 2006). Vowels with recognizable differences were not imitated as much, suggesting that the social identities of NZE speakers affected alignment behavior at a subconscious level. NZE speakers' social attitudes toward AuE speakers, as measured by an Implicit Association Task, were also a significant predictor of overall alignment patterns, such that NZE speakers with pro-AuE scores were more likely to converge toward AuE speakers. Ross et al. (2021) further explored this phenomenon by conducting a shadowing task with talkers of the Mid-Atlantic and General American dialects. The authors used target words with phonetic variables differing between the two dialects, as well as words with no distinguishing dialect features as a baseline. Their results found that dialect-specific features facilitated convergence; however, this convergence was mediated by social beliefs, such that participants did not align toward stigmatized features in the Mid-Atlantic dialect. These findings were consistent with previous findings on alignment toward stigmatized features of dialects (e.g., Clopper and Dossey, 2020).

Recent research has begun to explore the phenomenon of linguistic accommodation in human–computer communication. The Computers Are Social Actors (CASA) theory posits that despite the top-down knowledge that they are communicating with a computer, humans still treat computers as social actors, and behave similarly toward them as they would another person (Nass et al., 1994). A large body of research supports CASA and has shown that humans show similar alignment patterns toward computers as they do in human-human communication (Bell et al., 2003; Branigan et al., 2003; Cohn et al., 2019; Zellou et al., 2021b). These alignment patterns are motivated by linguistic differences and happen at various levels, including syntactically (Branigan et al., 2003; Pearson et al., 2004), phonetically (Cohn et al., 2019; Gessinger et al., 2021; Zellou et al., 2021b), lexically (Branigan et al., 2011; Cowan et al., 2015), and prosodically (Bell et al., 2003; Suzuki and Katagiri, 2007). Current research has further investigated these phenomena by assessing vocal alignment toward voice-enabled digital assistants (voice artificial intelligence or voice-AI), such as Amazon's Alexa and Apple's Siri (Cohn et al., 2019, 2021; Zellou and Cohn, 2020; Zellou et al., 2021b; Aoki et al., 2022), and has found evidence that social factors, such as gender (Cohn et al., 2019; Snyder et al., 2019) and conversational role (Zellou et al., 2021b), additionally affect human-computer alignment.

An open question is whether alignment in human–voice-AI communication directly mirrors alignment patterns in human-human communication, or whether different strategies are applied with a device interlocutor. For example, Cohn et al. (2019) conducted a shadowing task with human and text-to-speech (TTS) model talkers (using Apple's Siri voices) and investigated gender-mediated phonetic alignment for both types of interlocutors. They found that male voices were imitated more than female voices for both human and device model talkers, indicating that similar gender-mediated social patterns of alignment are at play during shadowing with device talkers as with human talkers. However, gender had a larger effect on alignment toward human voices than alignment toward device voices, leading the authors to conclude that computers are not treated identically to other humans. This conclusion raised further questions about these differing alignment patterns: were participants showing different alignment patterns toward TTS voices due to the synthetic acoustic features of their voices, or due to the top-down knowledge that they are a device? Zellou and Cohn (2020) explored this question by presenting human and TTS voices in guises (e.g., TTS voice presented as a device or human) in a shadowing task. Their results demonstrated that speakers were more likely to align in vowel duration toward TTS voices when they were presented in a device guise. They also found, however, that voice type overall (human or device) was not a significant predictor of alignment patterns, suggesting that the acoustic differences between human and TTS voices were not the main driver of differences in alignment patterns. Other work has shown that people have distinct expectations about the communicative competence of technology; for example, participants explicitly rate a TTS voice as less competent and less human-like than a human voice (Cohn et al., 2022) and more robotic TTS voices as less competent, relative to more human-like TTS voices (Cowan et al., 2015; Zellou et al., 2021a). Additionally, given the identical guise for a talker-cued by an image of a human or device silhouette - listeners show worse performance on a speech-in-noise task (Aoki et al., 2022). Together, these findings suggest that top-down information about the speaker could shape communicative pressures in an interaction, such as leading to greater alignment toward apparent device addressees.

In many ways, interacting with spoken technology might parallel cross-dialectal communication; it is not uncommon for participants to select TTS voices from other dialects (e.g., American users choosing a British English Siri voice, Bilal and Barfield, 2021). Some prior work has examined cross-dialectal perceptions of these voices. For example, Tamagawa et al. (2011) tested how NZ English speakers' attitudes toward TTS voices in different dialects of English affected their overall rating of the quality of care in a healthcare setting. The authors found that US voices were rated as more robotic than NZ or British voices, and received lower performance ratings compared to NZ and British voices. The authors take this to be evidence that speakers will have lower-quality experiences with robots that are rated as having more robotic voices. They additionally hypothesized that the NZ voice was preferred due to its attractiveness as a local accent, and predicted that humans prefer TTS voices that are similar to their own accent, consistent with CAT. In another study, Cowan et al. (2015) tested lexical alignment between Irish English speakers and US and Irish English human and TTS voices in the form of a picture-naming task. Irish English-speaking participants were assigned either a human or computer partner who spoke either US or Irish English, and authors found that participants were more likely to select US lexical items when interacting with a US English-speaking partner, rather than selecting their standard Irish English lexical items. Interestingly, interlocutor type was not a significant predictor in the outcomes of the task, meaning this effect was consistent whether the partner was a human or a computer. The findings from these studies highlight gaps in understanding how dialectal differences mediate patterns of alignment in both human–human communication and human–technology communication.

1.1. Current study

The current study investigates patterns of phonetic alignment among US English speakers toward TTS voices in different dialects of English. We aim to examine how social factors – specifically, dialectal biases – contribute to alignment patterns toward apparent device and apparent human speakers. We conducted an interactive task in which participants produce target words presented in a list to an addressee. The target words were produced by the participants after hearing an interlocutor's production of that word. Addressees spoke three dialects of English: General American English, British English, and Indian English. We assessed dialectal biases by collecting social ratings for each voice from the participants after they completed their interaction with each voice. To further explore how the top-down knowledge of a talker's guise affects alignment, we presented two versions of our task: one in which all voices were presented in a device guise, and one in which all voices were presented in a human guise. We assessed convergence through an AXB perceptual rating task (Pardo, 2013; Pardo et al., 2017), in which an independent group of participants rate whether a participant's pre- or post-exposure production is more acoustically similar to the model talker's production.

Our study focused on three English dialects: General American English (US), British English (specifically, Received Pronunciation, a formal register of British English; RP), and Indian English (IN). Both RP and IN English differ from US English in vowel quality and vowel length, and IN English additionally differs from US English in voice-onset time (VOT) of word-initial stops (Awan and Stine, 2011). The target words selected for this study were chosen to emphasize the phonetic distance between US to another US speaker (no change), US to RP (small change), and US to IN English (larger change) (Bent et al., 2021). Relative to US English, the stimuli differ in either vowel length or quality in RP and/or IN English, and in VOT in IN English (Wells, 1982; Schmitt, 2007; Awan and Stine, 2011). The target words selected for this study are provided in Table 1.

Table 1

Lexical set	Stimulus	Differing feature(s) (RP)	Differing feature(s) (IN)
FLEECE	Beak; deem	Vowel length	VOT
	Keyed; peak; teak		ø
LOT	Bock; goth	Vowel quality	Vowel quality; VOT
	Cog; pock; tock		Vowel quality
GOOSE	Boon; goos	Vowel length	VOT
	Kook; poop; toot		ø
BATH	Daft; gasp	Vowel quality; vowel length	Vowel quality; VOT
	Cask; path; task		Vowel quality

Target words and their differing features by dialect.

To test the effect of social perceptions on alignment, we use perceived prestigiousness as a measure of social bias. Previous research has shown that RP English is typically perceived as highly prestigious, while IN English is seen as less prestigious (Giles, 1970, 1973; Coupland and Bishop, 2007). Thus, these three dialects were selected to create a comparison of a prestigious dialect and a non-prestigious dialect against a baseline comparator. Though historically RP has been presented as “prestigious” and IN as “non-prestigious,” we collected speaker-specific prestigiousness attitude ratings to determine our population-specific attitudes toward each dialect. If social biases prove to be strong predictors of alignment, we expect participants to align toward the dialect with the most positive ratings, and potentially diverge from the dialect with the lowest ratings.

For replicability and generalizability, all voices used in the current study were TTS voices generated from widely available systems (Amazon, Google, and Apple). We therefore vary whether the guise of the talker is congruent (shown an image of a device) or incongruent (shown an image of a human). If alignment is driven by functional reasons, we expect participants to align the most toward device-guise voices in an effort to communicate more effectively (Cowan et al., 2015; Cohn et al., 2022). Conversely, if alignment is driven by similarity attraction (Byrne, 1971), we might expect participants to align more toward human-guise voices (Gessinger et al., 2021).

In the following sections, we detail a norming study (Experiment 1, Section 2) in which we select the voices, the interactive speech production study (Experiment 2, Section 3), and the perceptual similarities study (Experiment 3, Section 4). Data for all three experiments, as well as supplementary data, are provided in an Open Science Framework repository for the project¹.

2. Experiment 1: voice norming study

In order to select the voices to use in our experiment, we conducted a voice norming study online via Qualtrics. Our goal was to identify 6 voices (2 per dialect) with the most salient stereotypical accent (rated as being the “strongest” accent) and similar human-likeness ratings across voices.

2.1. Materials

We tested 9 US voices (4 Amazon, 2 Google, 3 Apple), 6 RP voices (2 Amazon, 2 Google, 2 Apple), and 5 IN voices (2 Amazon, 2 Google, 1 Apple). We used all female voices for the experiment to control for gender effects in alignment (Namy et al., 2002). We created an audio file for each voice, where the voice utilized a target stimulus in the form of a question (“The word, peak, is what number on your list?”), mirroring the presentation style of the stimuli in the subsequent interactive task. Recordings were produced using the Amazon Polly console through Amazon Web Services (AWS), the Google Actions console, and the command line on an Apple computer. All recordings were amplitude normalized to 65 dB.

2.2. Participants

Fifty five participants (48 female, 7 male; mean age: 20.3; SD: ± 2.0) completed the study. All participants were recruited from the University of California, Davis, psychology subjects pool, and received course credit for their participation. All participants reported English as their first language and no hearing impairments. The study was approved by the UC Davis institutional review board (IRB) and subjects completed informed consent before participating.

2.3. Procedure

For each voice, participants listened to the recorded audio file, then were asked to rate the voice on three dimensions - accent strength, perceived age, and human-likeness - on a sliding scale from 0 to 100 where every whole integer was a possible option. Each voice was presented one at a time, and participants rated each voice immediately after exposure. Voices were blocked by dialect and randomly presented within block, and participants rated all voices. Each participant additionally heard several listening comprehension questions, consisting of semantically unpredictable sentences produced by a human at a relatively lower intensity (45 dB), as an attention check.

2.4. Analysis and results

Mean ratings for each voice tested are reported in Supplementary Table 1 in Supplement A, and raw data are available in the OSF repository. Ratings provided by participants who failed the attention check were excluded, and the remaining ratings were combined to calculate an average for each voice based on accent strength, perceived age, and human-likeness. To ensure a strong indication of the voice's dialect, we selected the two TTS voices per dialect with the highest average ratings for accent strength. These voices also had roughly similar human-likeness scores, except for US voices which scored substantially lower overall on human-likeness (34.1 mean rating). Given these parameters, we selected the AWS Polly neural Salli (accent strength: 65.8; human-likeness: 62.5) and AWS Polly neural Joanna (accent strength: 66.9; human-likeness: 38.5) for the US dialect, AWS Polly Amy (accent strength: 61.0; human-likeness: 68.7) and Google's Google-GB2 (accent strength: 69.5; human-likeness: 52.0) for the RP dialect, and AWS Polly Aditi (accent strength: 58.2; human-likeness: 57.3) and Google's Google-IN1 (accent strength: 71.8; human-likeness: 64.1) for the IN dialect.

3. Experiment 2: interactive task

Our interactive task was designed to approximate a turn-taking conversation in which speakers repeated a word after hearing a model talker say it, but in a controlled communicative context (e.g., “The word, peak, is what number on your list?' “The word peak is number five.”).

3.1. Materials

Twenty single-word target words were selected using the dialectal criteria discussed in the introduction; namely, we selected target words that differed in either vowel length, vowel quality, or VOT between two or more dialects. The items with their lexical sets and targeted differing features by dialect are listed in Table 1. We specifically selected monosyllabic words with CV(c)c structure, and that were low-frequency (mean zipf value: 2.98); high-frequency items are less susceptible to imitation (Brysbaert and New, 2009). Each stimulus was presented in a sentence (e.g., “the word, beak, is what number on your list?”) to avoid over-emphasis in pronunciation, and to simulate a conversational format for the interactive experiment.

Similar to the stimuli creation for the voice norming study, stimuli were created using AWS and the Google Actions console. For each voice, we generated individual recordings for each stimulus within a sentence and other conversational snippets to help promote the feeling of an interactive conversation, such as introductory remarks (e.g., “Hello, my name is Rebecca. Let's get started”), using a prototypical, culturally-appropriate woman's name for each voice (see Supplementary Table 2 in Supplement B for a full list of utterances). Each voice had 6 possible pseudorandomized lists for stimuli presentation. The final product for each version of each voice was a single audio file that started with an introduction, looped through all 20 stimuli, providing the participant 3s to respond to each query, and ended with a closing remark to signal the end of the interaction. All stimuli were amplitude normalized to 65 dB.

3.2. Participants

Sixty participants (all female; mean age: 19.3, SD: ± 2.1) completed the study; 30 were assigned to the condition with the human guise, and 30 were assigned to the condition with the device guise in a between-subjects design. All participants were recruited from the University of California, Davis, psychology subjects pool, and received course credit for their participation. Students who participated in the voice norming study (experiment 1) were excluded from participating in the interactive task. All participants reported English as their first language and no hearing impairments. The study was approved by the UC Davis institutional review board (IRB) and subjects completed informed consent before participating.

3.3. Procedure

The experiment was conducted online via Qualtrics, and participants' recordings were captured with Pipe². Participants were told that they would be taking part in an interactive experiment and would be communicating with a series of speakers. Before the start of the experiment, we asked participants to read a list of 20 sentences, each of which included the target words (e.g., “The word beak is a rhyme with seek”). These recordings served as pre-exposure baseline productions for analysis.

To investigate the question of whether apparent humanity affects alignment patterns in human–voice-AI communication, our experiment contained a between-subjects guise manipulation shown either as a smart speaker (device guise) or a human talker (human guise) (see Figure 1). To elicit this top-down knowledge of interlocutor guise, the silhouette of either a smart speaker or a silhouette of a woman was shown throughout the experimental trials.

Figure 1

The depiction of a device-guise voice **(A)** and a human-guise voice **(B)**. Participants were told they would be connecting with an interlocutor over Skype to elicit more natural-sounding, conversational interactions.

Each interaction started with a Skype sound to simulate “connecting” with the model talker. After the “connection” was established, the model talker introduced themself and initiated a series of questions (schematized in Figure 2). The talker would first ask a question (e.g., “The word, cog, is what number on your list?”), to which the participant would respond using a templated response shown on the screen (e.g., “The word cog is number one.”), given 3s to respond. Participants were explicitly instructed to read the templated response provided on the screen to ensure that they used the stimulus in their response. The talker would verbally acknowledge the response (e.g., “Awesome”), then move on to the next question. This question was repeated once for each stimulus (20 total). Participants interacted with all 6 voices for a total of 120 post-exposure productions (20 stimuli × 6 voices). Participants interacted with each voice for approximately 4 min. Voices were blocked by dialect and randomly presented within block, and dialect blocks were randomly presented.

Figure 2

An example of the templated responses presented to each participant during the interactive task (in addition to the depiction of the voice shown in Figure 1), along with a sample interaction.

At the end of each interaction, a new block of questions appeared in which participants were asked questions about their experience with the talker. They were instructed to “please answer the following questions about your experience with [name]”, where the blank indicated the assigned name of the voice. First, they were asked to identify the talker's nationality (Where do you think [name] was from?) from a set of options³; this question was designed to test whether participants could identify the regional origin of the speaker's dialect. Next, participants provided ratings using sliding scales (0–100 where every whole integer was a possible option) to assess other socio-indexical features of the talker's voice. Based on the literature examining prestige in human-human interaction (e.g., Cargile and Bradac, 2001; Fuertes et al., 2012; McCullough et al., 2019), we asked participants to rate the talker's perceived intelligence [How intelligent did (name) seem? (0 = unintelligent to 100 = intelligent)] and perceived socioeconomic status [What do you think is (name)'s socioeconomic status? (0 = poor to 100 = wealthy)]. For example, socioeconomic status was included in order to gauge potential social biases toward speakers of different dialects of English (Dragojevic, 2017), such that low socioeconomic status scores would suggest more negative biases toward a speaker. In order to investigate social closeness, a factor shown to influence alignment (e.g., Giles et al., 1991), we also asked them to rate the talker's friendliness [How friendly was (name)? (0 = unfriendly to 100 = very friendly)]. Finally, as all voices were TTS voices, we asked them to rate the naturalness/human-likeness of the talker's voice [How natural does (name) sound? (0 = robotic to 100 = natural)]. In particular, naturalness was included to investigate whether voices presented in an inauthentic guise as a human were rated as less natural than those presented in an authentic guise (cf. Zellou and Cohn, 2020). Furthermore, previous literature has shown that speakers have more positive attitudes toward TTS voices that are rated as less robotic (Tamagawa et al., 2011).

3.4. Social ratings analysis and results

Participants successfully identified the dialect of US and IN voices (86 and 90% accuracy respectively), but showed lower accuracy in identifying the RP voices (74% accuracy). Of the incorrect answers, 17% of respondents labeled the RP voices as either Australian or New Zealander, both closely related dialects to RP English. Despite this discrepancy in accuracy for RP voices, participants reported that they were, on average, roughly equally familiar with British and Indian English accents (RP familiarity: 67.6, se: ± 3.2; IN familiarity: 65.6, se: ± 3.7). Participants additionally reported high familiarity with US accents (89.3; se: ± 2.4).

We modeled participants' ratings of the voices' naturalness, friendliness, intelligence, and socioeconomic status in separate linear regression models; mean ratings are shown in Figure 3. We additionally modeled a composite attitude rating for each voice by participant, which was calculated by averaging friendliness, intelligence, and socioeconomic status ratings. For each model, fixed effects included Dialect, which was sum coded with three contrasts (IN: 1, 0; RP: −1, −1; US: 0, 1), and Guise, which was sum coded with two contrasts (human: 1, device: −1), as well as their interaction. Random effects included by-Participant random intercepts. Model estimates for fixed effects are reported in Tables 2–6. No estimates for two-way interactions reached significance and are therefore reported in Supplement C in Supplementary Tables 3–7.

Figure 3

Average attitude ratings by dialect and guise with standard error bars. The composite score is an average of friendliness, intelligence, and perceived socioeconomic status rating by participant per voice.

Table 2

Composite	Coef	SE	p
(Intercept)	64.07	0.83	< 0.001
Dialect (IN)	3.96	1.18	< 0.001
Dialect (RP)	3.88	1.18	0.001
Dialect (US)	−7.84	1.18	< 0.001
Guise (Device)	2.09	0.83	0.01
Guise (Human)	−2.09	0.83	0.01

Model estimates for composite attitude scores linear regression, including re-leveled fixed effects.