The Look of (Un)confidence: Visual Markers for Inferring Speaker Confidence in Speech

Mori, Yondu; Pell, Marc D.

doi:10.3389/fcomm.2019.00063

ORIGINAL RESEARCH article

Front. Commun., 21 November 2019

Sec. Psychology of Language

Volume 4 - 2019 | https://doi.org/10.3389/fcomm.2019.00063

The Look of (Un)confidence: Visual Markers for Inferring Speaker Confidence in Speech

1. School of Communication Sciences and Disorders, McGill University, Montréal, QC, Canada
2. Centre for Research on Brain, Language and Music, Montréal, QC, Canada

Abstract

Evidence suggests that observers can accurately perceive a speaker's static confidence level, related to their personality and social status, by only assessing their visual cues. However, less is known about the visual cues that speakers produce to signal their transient confidence level in the content of their speech. Moreover, it is unclear what visual cues observers use to accurately perceive a speaker's confidence level. Observers are hypothesized to use visual cues in their social evaluations based on the cue's level of perceptual salience and/or their beliefs about the cues that speakers with a given mental state produce. We elicited high and low levels of confidence in the speech content by having a group of speakers answer general knowledge questions ranging in difficulty while their face and upper body were video recorded. A group of observers watched muted videos of these recordings to rate the speaker's confidence and report the face/body area(s) they used to assess the speaker's confidence. Observers accurately perceived a speaker's confidence level relative to the speakers' subjective confidence, and broadly differentiated speakers as having low compared to high confidence by using speakers' eyes, facial expressions, and head movements. Our results argue that observers use a speaker's facial region to implicitly decode a speaker's transient confidence level in a situation of low-stakes social evaluation, although the use of these cues differs across speakers. The effect of situational factors on speakers' visual cue production and observers' utilization of these visual cues are discussed, with implications for improving how observers in real world contexts assess a speaker's confidence in their speech content.

Introduction

During conversation, speakers produce visual cues that (often inadvertently) demonstrate their confidence level in or commitment to the content of their speech. Here, confidence refers to a transient mental state indexing a speaker's subjective level of certainty in a concept and/or word as it is retrieved (or a retrospective metamemory judgment) (Nelson and Narens, 1990; Boduroglu et al., 2014). This demonstration of confidence is not to be confused with speakers (un)consciously communicating confidence related to their social status or personality traits (e.g., sitting up straight) (Tenney et al., 2008; Nelson and Russell, 2011; Locke and Anderson, 2015) and/or speakers confidently presenting information that was previously retrieved from memory and rehearsed. For example, politicians or TV news anchors generally display composure in their facial expressions and posture, and produce minimal facial movements to mark their high confidence and neutral emotional state (Coleman and Wu, 2006; Swerts and Krahmer, 2010). In these latter instances, a speakers' confidence is described as a static mental state without considering the variable speech content speakers spontaneously produce. Yet, speakers in these instances can still experience a transient mental state of confidence, reflecting the ongoing memory retrieval and dynamic emotional states speakers have during natural conversation.

Despite this differentiation in the factors underlying a speaker's conveyed confidence, interlocutors likely do not make this distinction when decoding a speaker's visual cues and drawing social inferences from them. Rather, interlocutors may simply try to detect a speaker's level of certainty in the information speakers present. Research suggests that visual cues broadly referring to a speaker's confidence level are automatically decoded by observers (Moons et al., 2013) and can impact enduring social assessments of a speaker, such as in job interviews (DeGroot and Motowidlo, 1999; DeGroot and Gooty, 2009) or courtrooms with expert witnesses (Cramer et al., 2009, 2014). In these contexts, speakers are asked questions they may not be expecting or prepared for, which can result in them producing visual and vocal cues that mark their confidence level in their speech content (Brosy et al., 2016). Interlocutors can then use these non-verbal cues to infer a speaker's credibility, trustworthiness or believability (Cramer et al., 2009; Birch et al., 2010; Jiang and Pell, 2015, 2017; Jiang et al., 2017), such as when determining if a speaker is lying (Depaulo et al., 2003) or persuading others to adopt their stance (Scherer et al., 1973). Thus, it is important to understand how the visual cues that speakers produce as a result of their transient confidence in their speech content, can impact observers' impressions of their confidence level. However, it is unclear how observers infer a speaker's confidence level strictly from the visual cues they produce while speaking, such as when answering questions that tap general (shared) knowledge (Swerts and Krahmer, 2005; Kuhlen et al., 2014).

When speakers spontaneously communicate their knowledge via answering questions, they may produce visual cues for two main reasons (Smith and Clark, 1993). One, speaker's visual cues can indicate their cognitive processes for semantic activation and lexical retrieval, whereby concepts are accessed from memory and communicated using language. The speed and success of this process is influenced by properties of the concepts retrieved. According to models of semantic memory, a concept is activated more strongly during lexical retrieval if it is encoded more frequently and stored longer in memory (Anderson, 1983a,b). For example, when responding to the trivia question, “In what sport is the Stanley cup awarded?” (Smith and Clark, 1993; Jiang and Pell, 2017), a speaker should take longer to respond when the target concept (i.e., hockey) is activated less strongly (e.g., if the speaker is not from a hockey-playing nation or is not interested in sports). The speaker may also produce visual cues to mark the ease (or difficulty) of this process, which can vary across speakers depending on their background knowledge, cultural background and level of non-verbal expressiveness (Sullins, 1989; McCarthy et al., 2006, 2008; Zaki et al., 2009). However, this hypothesis for visual cue production does not necessarily involve speakers interacting with others.

Another reason why speakers may produce visual cues to indicate their confidence level in the content of their speech is for pragmatic purposes, as they consider how they appear to others during an interaction. According to the Gricean Maxim of Quality, during conversation speakers should not say information they believe is false or they have an insufficient amount of evidence for (Grice, 1975). To follow this maxim, speakers should indicate their level of certainty in their response to a question, which can be done through their (un)conscious production of visual cues. When speakers have low confidence in their speech content, the visual cues they produce may represent an unconscious mechanism that allows them to save face in the social context (Goffman, 1967, 1971; Visser et al., 2014). That is, speakers may furnish salient visual cues signaling their lack of commitment to the linguistic message so that their audience will be less critical of errors in the message. Conversely, speakers can produce visual cues to signal and pragmatically reinforce their certainty to others (Moons et al., 2013) or to feign certainty/false confidence. For example, speakers frequently gesture before speaking to retrieve words from memory (Rimé and Schiaratura, 1991; Krauss et al., 1996). A speaker's lexical retrieval may also be signaled by their facial expressions and changes in eye gaze (Goodwin and Goodwin, 1986; Bavelas and Gerwing, 2007). Goodwin and Goodwin (1986) examined gestures related to searching for a word in videotaped conversations produced in natural settings. They found that speakers produced a “thinking face” when word searching, which was often preceded by a change in gaze direction. This thinking face can involve the corners of the lips being turned downwards, the eyes widening, eyebrow movement, pursing of the lips or a stretching or slackening of the lips (Ekman and Friesen, 1978; Goodwin and Goodwin, 1986; Swerts and Krahmer, 2005). This facial expression demonstrates active involvement of various parts of the face during word retrieval. Based on these findings, one can postulate that speakers who have low confidence in the content of their speech are more likely to produce this face, possibly with an averted gaze, compared to speakers who have high confidence in their speech content. This pragmatic process can also interact with the cognitive process of lexical retrieval. Some researchers hypothesize that people avert their gaze from others when thinking to reduce their arousal level, so they can concentrate (Argyle and Cook, 1976; Patterson, 1976, 1982; Doherty-Sneddon and Phelps, 2005). This change in gaze can allow speakers to successfully retrieve and produce target words (Glenberg et al., 1998). These differences in gaze behavior could be a relevant sign of a speaker's confidence level.

When put together, different body and face movements can refer to a speaker's confidence level in the content of their speech in several ways: by demonstrating a speaker's retrieval of words from memory, their (un)conscious effort to communicate their (un)certainty to others for social purposes and/or their level of concentration in answering questions during face-to-face interactions. By focusing on how speakers respond to general knowledge (trivia) questions, the current study is likely to capture those visual cues that are most relevant to processes for retrieving semantic information from memory and socially communicating a level of certainty to others.

Observers are hypothesized to use various visual cues when evaluating a speaker's social traits due to cognitive and social processes. Firstly, from a social ecological approach, observers are thought to attend to visual cues that are more perceptually salient (McArthur, 1981; Fiske and Taylor, 1984). Observers' attention to these cues can influence the social impressions they form of a speaker (McArthur and Solomon, 1978). Since speakers with low confidence may produce cues that involve more movement or are more marked (e.g., a thinking face) compared to when speakers have high confidence in the content of their speech, observers may be more attentive to using these cues when evaluating a speaker's confidence level. However, variability in speakers' level of non-verbal expressiveness may not allow observers to detect a large difference in marked features indicating high vs. low confidence. It is also unclear whether a speaker with high confidence is detected by the absence of cues indicating low confidence or if there are other types of cues that are uniquely produced and perceived.

Secondly, observers may differentiate speakers' confidence level based on their visual cues by considering the speaker's perspective. This process can allow observers to accurately detect a speaker's thoughts/feelings to try to understand why speakers are producing certain visual cues (i.e., mentalizing or empathizing). With this process, observers may also be affected by cues that are more perceptually salient (Kuhlen et al., 2014). For example, in Kuhlen et al. (2014) when observers watched muted videos of speakers responding to trivia questions and rated them to have low confidence, observers showed increased activity in the mentalizing network (e.g., medial prefrontal cortex and temporoparietal junction). The researchers suggest this activation occurred because speakers perceived to have low confidence produce more salient visual cues compared to speakers with high confidence (Kuhlen et al., 2014). However, it is unknown what visual cues observers attended to when mentalizing. Also, this brain activation was limited to speakers perceived to have low confidence, no significant effect was found when speakers were perceived to have high confidence.

Observers' evaluations of speakers with high confidence may be affected by another aspect of mentalizing: observers' expectations or beliefs about the visual cues a speaker with high confidence (stereotypically) produces (Schmid Mast et al., 2006). For example, in Murphy (2007), speakers were asked to appear intelligent (Acting condition) or were not given any instruction about their behavior (Control condition) as they engaged in an informal conversation with another speaker. Speakers in the Acting condition were more likely to display eye contact while speaking, a serious face and an upright posture compared to speakers in the Control condition. From these cues, only a speaker's eye contact while speaking was significantly correlated with their perceived intelligence (Murphy, 2007). This difference in the frequency of visual cues produced by the Acting and Control speakers was likely influenced by speakers' beliefs about displays of intelligence, which may be similar to the visual cues indicating a speaker's high confidence in the content of their speech. Also, despite speakers' beliefs about producing a serious face and upright posture, these may not have been reliable cues of a speaker's perceived intelligence. This result demonstrates the reduced reliability of observers' beliefs about the cues a speaker with a given mental state produces. Not all visual cues produced by speakers may meaningfully contribute to evaluating a speaker's perceived confidence. Overall, not a lot is known about how speakers' confidence level in their speech content can impact the social impressions that observers form and the visual cues that observers use to evaluate a speaker's confidence level.

Research examining the visual cues that observers use to decode a speaker's mental state are often studied in literature on interpersonal sensitivity or empathic accuracy (Hall et al., 2001; Schmid Mast et al., 2006). It has been measured by participants providing a perceptual rating of an inferred state or trait in a speaker's and then indicating the types of cues that influenced their judgment. For example, in Mann et al. (2008), participants watched videos of speakers who were lying or telling the truth in one of three conditions, visual only (i.e., muted videos), audio only (i.e., audio recording, no video) or an audiovisual condition (i.e., full video). After seeing and/or hearing the speaker, participants indicated whether the speaker was lying or telling the truth, and then rated how often they paid attention to each speaker's speech content, vocal behavior (pauses, stutters, pitch etc.) or visual behavior (gaze, movements, posture etc.) in their judgments. The researchers only analyzed the audiovisual condition and found that participants reported to pay more attention to the speaker's visual cues compared to their speech or vocal cues (Mann et al., 2008). Although the researchers did not report the visual only condition, this methodology demonstrates the impact of the visual communication channel on observers' evaluations. With this study we aim to better understand the specific visual cues that affect observers' judgments of a speaker's transient state of confidence.

The purpose of this study was to identify major visual cues that observers use to evaluate a speaker's confidence level in the content of their speech, and to examine how these cues differ when speakers experience high vs. low confidence. Based on the literature, we focused on the production of facial expressions (e.g., a thinking face), facial movements (changes in gaze and eyebrow movements), and gross postural movements because of their reported association to speaker confidence. We also explored hand movements as indicating a speaker's confidence level, as they can cue lexical retrieval difficulties during speech production (Rimé and Schiaratura, 1991; Krauss et al., 1996). Following Swerts and Krahmer (2005), we posed general knowledge questions to spontaneously elicit high and low levels of confidence in a group of individuals, who were video recorded as they responded and then subjectively rated their own confidence level. A group of observers then watched muted versions of the videos and indicated the face or body areas they used when evaluating the speaker's confidence level.

We predicted there would be a strong correspondence between speaker's subjective confidence and how observers perceived their confidence level from visual cues (i.e., perceived confidence), demonstrating observers' accurate detection of a speaker's transient mental state of confidence. We anticipated that observers would use more visual cues to decode speakers with low compared to high confidence because of the greater perceptual salience of these cues and observer's ability to mentalize with speakers conveying low confidence. These visual cues indicating low confidence may include a “thinking face,” changes in gaze, eyebrow and head movements, and postural movements. In contrast, high confidence may be more marked by direct eye contact, a serious face and an upright posture. The relationship between visually-derived confidence impressions and more sustained social traits of the speaker (competence, attractiveness, trustworthiness, etc.) as well as variability in speakers' production of visual cues and observer's use of these visual cues were also explored.

Part I: Production Study

Methods

Speakers

Ten native Canadian English speakers (McGill University students) volunteered for the study (five males and five females, Mean age = 21.5 years, SD = 2.84, different racial/ethnic identities). Speakers reported having normal or corrected-to-normal vision and normal hearing. Three speakers wore eyeglasses during testing which did not obscure visibility of the eye or eyebrow region in the video recordings.

Materials

Recordings were gathered while administering a trivia (or general knowledge) question task, following previous work (Smith and Clark, 1993; Brennan and Williams, 1995; Swerts and Krahmer, 2005; Visser et al., 2014). The stimuli were adapted from a corpus of 477 general knowledge statements (in English) constructed and presented in written format to 24 Canadian participants in a pilot study. In that study, items were first perceptually validated to specify whether each statement was a known general knowledge fact (based on the group accuracy rate) and to indicate how confident participants were in their knowledge for each statement (mean confidence rating on scale from 1 to 5). Statements covered a range of topics including science, history, culture, sports and literature. Based on the hit rate for the validation group, a set of 20 less-known general knowledge statements were selected with a hit rate range of 0.12–0.64 (out of 1) and a mean confidence rating range of 1.6 to 3.48 (out of 5). A set of 20 well-known general knowledge statements were selected with a hit rate range of 0.88 to 1 (out of 1) and a mean confidence rating range of 4.04–4.96 (out of 5). For the purposes of this study, the selected statements were then transformed into questions. For example, the statement: “Carmine is a chemical pigment that is red in color” became the question: “What color is the chemical pigment, carmine?.” The well-known and less-known general knowledge questions were used to elicit a state of high and low confidence in the speaker regarding the speech content, respectively. The selected list of questions based on less-known or well-known general knowledge is provided in the Supplementary Material.

Elicitation Procedure

Each speaker was individually video recorded in a sound-attenuated recording booth in an experimental testing laboratory, to allow for high quality audiovisual recordings. Speakers sat at a table with a laptop computer and a Cedrus RxB60 response pad (with five buttons labeled from 01 to 05 consecutively) located directly in front of the computer. A video camera mounted on a tripod was positioned behind the laptop monitor along with a wall-mounted loudspeaker. The loudspeaker allowed speakers to communicate with the examiner, who was located outside the recording booth and could not see the speaker. The examiner faced a window to the recording booth that was covered by a white curtain, preventing their facial expressions from influencing the speaker's behavior (see Smith and Clark, 1993; Swerts and Krahmer, 2005, for similar methods). The experiment was controlled by SuperLab 5 presentation software (Cedrus Corporation, 2014).

To elicit naturalistic expressions associated with a speaker's confidence in the content of their speech, speakers engaged in a question-response paradigm with the hidden examiner who spoke in a neutral tone of voice. Participants answered a series of less-known and well-known general knowledge questions from the described corpus by formulating a complete sentence while looking at the video camera (e.g., Examiner: “What color is a ruby?,” Speaker: “A ruby is red”). Speakers were instructed to guess if they did not know the answer to a question. No visual or verbal feedback on the quality of the speaker's performance was provided to the speaker. After answering each question, speakers used the response pad to rate how confident they were in their response on a 5-point scale (1 = not at all confident, 5 = very confident; the 5-point scale was simultaneously presented on the computer screen). Since speakers always had to provide an answer, a confidence rating of five can be interpreted as the speaker having high confidence in their response, not that they had high confidence in not knowing the answer. Speakers answered each question once. The 20 well-known and 20 less-known general knowledge questions were asked in a randomized order over four blocks with 10 questions/block, always by the same examiner (YM). For a small number of trials in which speakers requested clarification after hearing the general knowledge question, the examiner repeated the question but did not provide supplementary details. The examiner asked speakers to repeat their response when it was not formulated in a complete sentence, although this rarely happened. This procedure resulted in a total of 400 video recordings (10 speakers × 2 types of general knowledge questions × 20 questions) for analysis. Speakers were compensated $10 CAD and the recording session took approximately 1 h.

Video Analysis

The video recordings were edited using Windows Movie Maker to isolate the response portion of each trial, defined as the offset of the examiner's question and the onset to offset of the speaker's verbal response. Videos of speaker's responses to the well-known and less-known trivia questions were an average duration of 5.05 s (SD = 2.85 s) and 9.88 s (SD = 7.52 s), respectively. The audio channel of the individual files was then removed to create muted versions of each response. During editing, data from one female speaker had to be discarded due to a recording artifact (i.e., her videos showed a strong reflection of the computer screen against her eyeglasses). In addition, several trials using less-known general knowledge questions (n = 30) were excluded because the speaker initiated a dialogue with the examiner seeking clarification. We hypothesized that these videos would not reflect speaker's immediate and spontaneous expressions of low confidence via their visual cues.

For the remaining nine speakers, question accuracy and self-ratings of confidence (i.e., subjective confidence) for each question were then analyzed to ensure the questions consistently elicited intended high or low confidence states (See Supplementary Material for mean hit rate and subjective confidence ratings for each question). This step was crucial because our analyses sought to identify visual cues associated with unambiguous conditions that elicit high vs. low confidence. Question accuracy was determined by assessing the speech content of the response, for which a specific level of detail was required. For example, for the less-known general knowledge question, “Who invented the Theory of Relativity?,” the response, “a scientist” was marked as incorrect. Table 1 supplies data on the frequency and subjective confidence ratings of the nine speakers by trivia question type and accuracy. We retained data for 14 well-known general knowledge questions and 17 less-known general knowledge questions which consistently elicited the target confidence state without excessive variability across speakers (see section Results for more details). In total, this process resulted in 249 muted videos of single utterances from 9 different speakers to be used in the Perception Study (126 responses to well-known general knowledge questions (9 speakers × 14 well-known general knowledge questions) + 123 responses to less-known general knowledge questions (9 speakers × 17 less-known general knowledge questions – 30 excluded videos).

Table 1

	General knowledge question type
	Well-known			Less-known
	n	M	SD	n	M	SD
Accurate response
Subjective confidence rating	119	4.39	1.12	7	2.00	1.42
Inaccurate response
Subjective confidence rating	7	2.43	1.50	116	1.46	0.91

Speaker's subjective confidence ratings (out of 5) as a function of the type of general knowledge question and their accuracy for the general knowledge questions and the frequency of occurrences (out of the 249 muted videos).

Coding of Visual Cues

A group of six coders specified the visual cues within each facial/body region characterizing speakers' confidence level. Coders were native speakers of Canadian or American English (4 female/2 male, mean age = 23.33, SD = 2.58) who were blind to the purpose of the experiment and did not know any of the speakers, as determined at the onset of the study.

The coders were tested individually and were instructed to watch the muted videos and characterize specific visual cues they observed in the video without any associated context. Before the coding procedure, they learned a list of visual cues of interest: for changes in gaze direction, cues included, “sustained eye contact,” “upward gaze,” “downward gaze”, and “sideways gaze”; for facial expressions, cues included, “thinking,” “happy,” “amused,” “serious” and “embarrassed” expressions; and for shifts in posture (in the speaker's seat), cues included “still,” “forward,” “backward”, or “sideways”. Coders were provided a paper copy of descriptions for the different visual cue subcategories to use as a reference during the procedure. The subcategories of visual cues were determined based on previous studies (Goodwin and Goodwin, 1986; Krahmer and Swerts, 2005; Swerts and Krahmer, 2005; Cramer et al., 2009). See Figure 1 for an illustration of some of these visual cues.

Figure 1

For the subcategories of changes in gaze direction, sustained eye contact was described as the speaker looking straight ahead at the camera. Descriptions of the different facial expressions were based on how they are typically described in the literature. A thinking expression (or thinking face) involved the corners of the lips turned downwards, widened eyes, pursed or stretched lips, wrinkled nose, and raised or furrowed eyebrows (Ekman and Friesen, 1978; Goodwin and Goodwin, 1986; Swerts and Krahmer, 2005). A happy expression involved a smile with the corners of the lips elongated and turned upwards, and raised cheeks (Ekman and Friesen, 1978; Sato and Yoshikawa, 2007). An amused expression involved similar movements as a happy expression as well as the head shifted backwards, an upward gaze, and tongue biting (Ekman and Friesen, 1976, 1978; Ruch, 1993; Keltner, 1996). A serious expression involved eye contact and minimal head movement without furrowed eyebrows or lip corner movement. An embarrassed expression involved a suppressed smile with minimal teeth showing, the head turned downwards or sideways, a downward gaze and face touching (Haidt and Keltner, 1999; Heerey et al., 2003; Tracy et al., 2009).

Each visual cue category (changes in gaze direction, facial expression, shifts in posture) was coded in separate blocks, and the order of these blocks was counterbalanced across coders. Each trial consisted of a fixation point in the middle of the screen (3,000 ms), a muted video (of variable duration), and the visual cue rating scale (presented until the coder responded). Coders could select all the subcategories of visual cues that they saw by clicking on these labels on the computer screen, unless they had chosen sustained eye contact or a still posture, in a block for changes in gaze direction or shifts in posture, respectively. In this case, they were instructed to not select any other gaze or posture cues, respectively. The coders' click responses indicated the presence of a cue, while not clicking indicated the absence of a cue. Stimuli were randomly assigned to ten blocks (≈20 videos/block), separated by a short break. No time limit was imposed, and each video could be repeatedly watched by the coder until they were satisfied.

Inter-rater Reliability

Gwet's AC1 (or Gwet's first-order agreement coefficient) (Gwet, 2008; Wongpakaran et al., 2013) was used to calculate inter-rater reliability of the coded visual cues because it can be used with categorical data involving more than two raters and more than two categories (McCray, 2013). This measure was used to calculate inter-rater reliability for the presence/absence of each visual cue subcategory. Gwet's AC1 is an agreement coefficient which ranges from 0 to 1 (McCray, 2013). Compared to other Cohen's Kappa measures for multiple raters, Gwet's AC1 provides a measure of statistical significance, is positively biased toward the agreement/disagreement between raters, and is able to handle missing data (i.e., it does not exclude items that were coded by less than six coders) (McCray, 2013). Gwet's AC1 was calculated using an R script called “agree.coeff3.dist.R” (Advanced Analytics LLC, 2010; Gwet, 2014). The magnitude of inter-rater agreement was determined by comparing the Gwet AC1 coefficient to the Altman benchmark scale (Altman, 1991).

For the subcategories of changes in gaze direction (eye contact, upward gaze, downward gaze, sideways gaze) there was fair agreement (Gwet AC1 coefficient = 0.26, SE = 0.02, 95% CI = 0.21, 0.30, p < 0.001). For the subcategories of facial expressions (thinking, serious, happy, amused, embarrassed expressions), there was also fair agreement (Gwet AC1 = 0.25, SE = 0.02, 95% CI = 0.21, 0.29, p < 0.001). For the subcategories of shifts in posture (still posture, forward, backward, and sideways shifts), there was moderate agreement (Gwet AC1 = 0.58, SE = 0.03, 95% CI = 0.53, 0.63, p < 0.001).

Results

Analyses focused on characterizing speakers based on their confidence level by examining speaker's subjective confidence (i.e., self-ratings) in their responses for less-known and well-known general knowledge questions and the specific visual cues speakers produced during these states. Statistical analyses were performed and figures were created using R statistical computing and graphics software in RStudio (R Core Team, 2017).

Manipulation Check for Speaker's High and Low Confidence

The purpose of these analyses was to ensure that the two types of trivia questions (well-known and less-known general knowledge) posed to speakers elicited a state of high and low confidence, respectively. As expected, speakers were more accurate in responding to the well-known general knowledge questions (M = 0.94, SD = 0.23) than the less-known general knowledge questions (M = 0.06, SD = 0.23, d = 3.83). This effect was seen across all speakers based on each speaker's mean accuracy by trivia question type, t₍₈₎ = 20.75, p < 0.001, 95% CI [0.79, 0.99]. Seven (out of the nine) speakers answered all the well-known general knowledge questions correctly and six (out of the nine) speakers incorrectly answered all the less-known general knowledge questions. Speakers were also subjectively more confident following their response to well-known general knowledge questions (M = 4.29, SD = 1.22) than to less-known general knowledge questions (M = 1.49, SD = 0.96, d = 2.92). This effect was seen across all speakers based on each speaker's mean subjective confidence ratings by trivia question type, t (8) = 16.71, p < 0.001, 95% CI [2.39, 3.15]. Following their response to well-known general knowledge questions, speakers rated themselves as “very confident” (i.e., a rating of 5 out of 5) for 68.3% of trials and following their response to less-known general knowledge questions, speakers rated “not at all confident” (i.e., a rating of 1 out of 5) for 72.4% of trials.

When only correct responses to each trivia question type were considered, mean self-ratings of confidence were significantly higher (M = 4.26, SD = 1.26) than when they answered the question incorrectly (M = 1.51, SD = 0.98, d = 2.18). This effect was seen across all speakers based on each speaker's mean subjective confidence ratings by their trivia question accuracy, t (8) = 14.87, p < 0.001, 95% CI [ 2.30, 3.15]. When speakers answered correctly, they rated “very confident” for 68.3% of the trials and when speakers answered incorrectly, they rated “not at all confident” for 71.5% of the trials. Based on these data, it can be argued that the speakers' responses to well-known vs. less-known general knowledge questions elicited representative and highly differentiated states of high vs. low confidence in the speech content. For the remaining analyses of visual cues, we will therefore refer to responses to well- and less-known general knowledge questions as reflecting high and low confidence conditions, respectively.

Characterizing Speakers' Confidence Level From Coded Visual Cues

To further characterize speakers' subjective confidence, we analyzed the specific visual cues that speakers produced when they had high or low confidence in their speech content. Given that inter-rater reliability was fair to moderate, we adopted a relatively strict criterion: a visual cue was considered present if indicated by at least 5 out of the 6 coders (83% agreement), otherwise it was coded as absent. Coders most frequently observed speakers producing a still posture (57.3% of all items) and a change in gaze direction (upward, downward and sideways gaze) (46.1% of all items) irrespective of speaker confidence level, and a serious facial expression was most often observed in the high confidence condition (51.5% of items).

We performed chi-square tests to determine if the speaker's confidence level (high or low) varied independently of the types of visual cues speakers produced using their gaze, facial expression and posture, where these cues were either present or absent. For the speaker's gaze, the relationship between sustained eye contact and confidence level was significant, χ2 (1) = 5.24, p = 0.02, Cramer's V = 0.16. When speakers produced sustained eye contact, it more often occurred when speakers had high confidence (20.8% of items) compared to low confidence (8.6% of items). When speakers produced an upward gaze, it more often occurred when speakers had low confidence (17.1% of items) compared to high confidence (3.0% of items), χ2 (1) = 9.80, p = 0.002, Cramer's V = 0.22. The other shifts in gaze (downward or sideways) were not significantly associated with a confidence condition, downward gaze: χ2 (1) = 2.10, p = 0.15, Cramer's V = 0.10; sideways gaze: χ2 (1) = 3.77, p = 0.05, Cramer's V = 0.14. A downward gaze was reported for 23.8% of low confidence items and 14.9% of high confidence items, and a sideways gaze occurred for 21.9% of low confidence items and 10.9% of high confidence items. These patterns suggest that upward eye movements are more prevalent when speakers experience low confidence, whereas sustained eye contact occurs more frequently in cases of high confidence.

When characterizing the speaker's facial expressions, it was first noted that coders rarely identified facial expressions being “amused,” “embarrassed,” or “happy” (always less than 2% of items in any condition). When speakers were characterized as having a “serious” face, this occurred significantly more often when speakers displayed high confidence, χ2 (1) = 41.11, p < 0.001, Cramer's V = 0.45. A serious expression occurred for 51.5% of high confidence items, compared to only 9.5% of low confidence items. In contrast, when speakers had low confidence, there were significantly more instances of a “thinking face,” χ2 (1) = 6.00, p = 0.01, Cramer's V = 0.17 (low confidence = 13.3% of items, high confidence = 3.0% of items). These patterns suggest that the most prominent facial cues were a thinking expression for low confidence and a serious expression for high confidence.

For the speaker's posture, the relationship between speakers producing a still posture and the speaker's confidence level was significant, χ2 (1) = 5.93, p = 0.01, Cramer's V = 0.17. When speakers had a still posture, it more often occurred when they had high confidence (66.3% of items) compared to low confidence (48.6% of items). Forward, backward, and sideways shifts in posture occurred more frequently when speakers had low confidence in their response; although these features were rarely identified in our stimuli for either confidence level (less than 3% of items in any condition). This pattern suggests that sitting still was the most salient postural feature in our data which was linked more often to a high confidence state.

We also analyzed whether speakers' self-ratings of confidence for the general knowledge questions related to the presence/absence of each visual cue and observed similar patterns (see Supplementary Material).

Discussion

This study allowed us to elicit a state of high and low confidence in speakers based on the speech content. Specifically, speakers were less accurate and had low subjective confidence ratings in their responses to the less-known general knowledge questions compared to the well-known general knowledge questions. This result replicates previous findings that used a similar task (e.g., Smith and Clark, 1993; Swerts and Krahmer, 2005).

We also replicated previous findings in terms of the visual cues that speakers with high and low confidence, respectively produce (Goodwin and Goodwin, 1986; Doherty-Sneddon and Phelps, 2005; Swerts and Krahmer, 2005; Murphy, 2007). For example, speakers with high confidence in their speech content were more likely to produce sustained eye contact, a serious facial expression, and no shifts in posture. These cues are similar to those reported in Murphy (2007), where speakers who were asked to appear intelligent (Acting condition) during an informal conversation with another speaker, were also more likely to display eye contact while speaking, a serious face and an upright posture. Thus, the visual cues produced by speakers with high confidence in the content of their speech, may be related to the cues produced by speakers conveying high intelligence. Moreover, there was a medium to large effect size when speakers produced a serious facial expression. This may suggest that a serious facial expression is a reliable visual cue for differentiating speakers with high vs. low confidence. We also found that speakers with low confidence were more likely to produce an upward gaze and a thinking facial expression. This result supports previous findings that cite the presence of these visual cues when speakers are trying to retrieve words from memory (Smith and Clark, 1993) and/or to save face from others in a social context (Goffman, 1967, 1971; Visser et al., 2014). Overall, the effect sizes for most of the produced visual cues ranged from small to medium, potentially due to variability in speakers' production of these cues.

By creating a high and low confidence speaker condition based on speakers' level of certainty in their speech content, we then investigated how observers differentiated between these confidence levels using a speaker's visual cues in the Perception Study. The prevalence of the visual cues produced in this Production Study will impact the visual cues that observers use to discern speakers of high and low confidence.