Encouraging participant embodiment during VR-assisted public speaking training improves persuasiveness and charisma and reduces anxiety in secondary school students

Valls-Ratés, Ïo; Niebuhr, Oliver; Prieto, Pilar

doi:10.3389/frvir.2023.1074062

ORIGINAL RESEARCH article

Front. Virtual Real., 03 October 2023

Sec. Virtual Reality and Human Behaviour

Volume 4 - 2023 | https://doi.org/10.3389/frvir.2023.1074062

This article is part of the Research TopicEditors Showcase: Embodiment in Virtual RealityView all 10 articles

Encouraging participant embodiment during VR-assisted public speaking training improves persuasiveness and charisma and reduces anxiety in secondary school students

Ïo Valls-Ratés^1,2*^†^‡

Oliver Niebuhr²^†^‡

Pilar Prieto^1,3^†^‡

¹Department of Translation and Language Sciences, Universitat Pompeu Fabra, Barcelona, Spain
²Centre for Industrial Electronics, University of Southern Denmark, Sonderborg, Denmark
³ICREA, Institució Catalana de Recerca i Estudis Avançats, Barcelona, Catalonia

Practicing public speaking to simulated audiences created in virtual reality environments is reported to be effective for reducing public speaking anxiety. However, little is known about whether this effect can be enhanced by encouraging the use of gestures during VR-assisted public speaking training. In the present study two groups of secondary schools underwent a three-session public speaking training program in which they delivered short speeches to VR-simulated audiences. One group was encouraged to “embody” their speeches through gesture while the other was given no instructions regarding the use of gesture. Before and after the training sessions participants underwent respectively a pre- and a post-training session, which consisted of delivering a similar short speech to a small live audience. At pre- and post-training sessions, participants’ levels of anxiety were self-assessed, their speech performances were rated for persuasiveness and charisma by independent raters, and their verbal output was analyzed for prosodic features and gesture rate. Results showed that both groups significantly reduced their self-assessed anxiety between the pre- and post-training sessions. Persuasiveness and charisma ratings increased for both groups, but to a significantly greater extent in the gesture-using group. However, the prosodic and gestural features analyzed showed no significant differences across groups or from pre-to post-training speeches. Thus, our results seem to indicate that encouraging the use of gesture in VR-assisted public speaking practice can help students be more charismatic and their delivery more persuasive before presenting in front of a live audience.

1 Introduction

Apart from improving their public speaking skills (Boyce et al., 2007), giving secondary school students the opportunity to practice public speaking has been shown to improve their social skills (Morreale et al., 2000), self-confidence, and acceptance by their peers (Bailey, 2018), while lessening the risk that they will not engage in critical thinking during class (Blume et al., 2010). Given these potential benefits, it is clear that schools should provide as many opportunities for public speaking practice as possible. However, given the large number of students that many have to manage and the extensive syllabus they are expected to cover, teachers are often reluctant to devote much class time to practicing public speaking (Schneider et al., 2017), which also requires teachers to ensure that the social climate in the classroom is sufficiently safe and positive (Adler, 1980) for anxious students to overcome their fear of speaking to an audience (Kougl, 1980). Finally, students themselves are reported to put most of their preparation effort into writing the script of what they will say, spending at most 5 minutes on practicing their oral delivery (see Pearson et al., 2006).

Virtual reality technology (henceforth VR) can be used as a supplementary tool for rehearsing oral presentations or speeches in the classroom by means of a VR headset that gives wearers the visual 3-D illusion that they are standing in front of an artificially generated audience. The effectiveness of this tool in preparing students for speaking before real audiences has been demonstrated by research, as we will see below. However, in the present study we will explore whether combining such VR-assisted training with “embodiment” in the sense of an encouraged use of gestures while speaking will make student speakers both less anxious and more effective in subsequent experiences speaking to a live audience than VR-assisted practice in public speaking alone. Note that the current study is part of a set of three studies investigating VR-effects on public-speaking performance and public-speaking anxiety. The first two studies focused on learning after VR-assisted training compared to non-VR-assisted training. The present study focuses on gestures by comparing two VR-assisted conditions. Thus, the series of studies look at public-speaking performance and public-speaking anxiety across a sequence of training conditions, from non-VR to VR to gesture-activated VR (see Valls-Ratés, 2023 to appear, for an overview of the three studies).

This paper is organized as follows. In Section 1 we will discuss the utility of VR to train public speaking (1.1), previous literature on the value of VR for reducing public speaking anxiety (1.2), and training public speaking performance (1.3), and the role of embodiment in oral communication (1.4). Our methods are described in Section 2 and our experimental results in Section 3. Finally, a discussion and conclusions are offered in Section 4.

1.1 Using VR to train public speaking

While VR technology is now widely utilized for recreational purposes (Peeters, 2019), VR-simulated environments are also increasingly used in education to promote active learning (Legault et al., 2019). VR can elicit the subjective illusion known as presence, the illusion of “being there” in the scenery that the VR technology recreates, even though the user consciously knows that the environment depicted is simulated (Armel and Ramachandran, 2003). VR users feel immersed in this virtual environment (Slater et al., 2006) and engage in it as active participants, to a much more intense degree than what they experience when they use a laptop or phone (Bowman and Hodges, 1999; Slater and Sanchez-Vives, 2016). VR simulated environments have shown to be an effective learning tool (Mikropoulos and Natsis, 2011), in part because they stimulate student enthusiasm and motivation (Dalgarno & Lee, 2010), to the extent that students are reported to be keen to adopt VR technology for their own educational purposes or encourage its adoption by educational institutions (Vallade et al., 2020).

With regard to training for public speaking in particular, research has shown that the speaking style of VR users addressing a simulated audience tends to be more listener-oriented in terms of its prosodic characteristics. To our knowledge, five studies have compared the features of speech when it is delivered to a live audience with speech delivered to a VR-simulated audience, three of them focusing on prosody. In the first, Niebuhr and Michalsky (2018) showed that the prosody of 24 university student participants as they practiced giving a speech in front of a VR-simulated audience was more conversational and listener-oriented than the prosody of students practicing alone, without an audience. The VR-assisted speech was characterized by a higher fundamental frequency (f01¹) level, a larger f0 range, and a slower speaking rate. Interestingly, the speech of students practicing alone underwent an increasing “prosodic erosion” effect whereby the more the students repeated their speeches, the progressively lower and narrower the speech melody of their delivery became; by contrast, the VR-assisted speakers exhibited much less of this effect (see also Niebuhr and Tegtmeier, 2019). Also, VR-assisted speakers spoke for a longer time, made fewer pauses and used a higher intensity level. In the second study, which was carried out with 30 female elementary school teachers, Remacle et al. (2021) demonstrated that a VR-simulated classroom was able to induce in teachers’ speech vocal features that were very similar to those they used in the classroom. In line with the findings by Niebuhr and Michalsky (2018), the participants’ f0 values, f0 variation and voice intensity levels were all much higher in speech delivered to a class, whether real or simulated, compared to unprepared speech delivered to the experimenter. A similar example is the study by Selck et al. (2022), which showed that speakers adjust the vocal effort of their speech according to how far away the interlocutor is. Selck et al. found a similar adjustment to the speaker-listener distance also in VR dialogues, especially when the effect of visual immersion was complemented with a 3D acoustic-ambiance immersion. In the third study related to prosody (Valls-Ratés, 2023, chapter II), found that as secondary school students practiced before VR-simulated audiences, their prosody became audience-oriented, that is, stronger, more effortful and louder, although they did not perform more gestures.

The remaining two studies focused not only on the prosody of VR users but also on other features. Notaro et al. (2021) explored the effects of VR on the fluency and gesture rate of 13 participants who performed the same speech twice, first in front of a live audience and then in front of a VR-simulated audience but also in the presence of the same live audience. The authors concluded that participants’ speech displayed larger f0 variation and higher intensity levels when they addressed the virtual audience. In the VR condition speakers also paused more often and reduced their speech rate as well as the number of meaningless gestures per minute, pointing to the possibility that when speaking to a VR audience they exerted greater control over their gestures. Finally, focusing on an L2 setting, Thrasher (2022) conducted a study with 25 learners of French performing two VR tasks and two classroom tasks to assess the impact of VR on the students’ anxiety and French comprehensibility. Native French-speaking raters assessed the audio files and found speeches performed while using VR to be more comprehensible than speeches performed in the classroom. They also concluded that VR made participants less anxious than in-class tasks and they rated low-anxiety participants as easier to understand than high-anxiety participants, regardless of the performance context.

Overall, research suggests that speakers using VR to address a simulated audience are willing to adopt a more engaging listener-oriented way of speaking. Therefore, it is reasonable to expect that practicing public speaking using VR technology has the potential to not only improve the public speaking performance of high-school students but also in the process reduce public speaking anxiety (henceforth PSA).

1.2 The effect of VR-assisted training on public speaking anxiety

In line with current educational practices in Western countries, secondary-level students are increasingly expected to stand in front of the class and deliver expository talks, with their classmates and teacher as audience. Unsurprisingly, some students are more comfortable being the sole focus of attention than others, and a certain proportion of the students in any class may experience what has been labeled PSA when asked to present in front of an audience. Physiologically, PSA is manifested by a wide range of symptoms such as increased heart and breathing rates, nausea, a dry mouth or sweating (Smith et al., 2005; Boyce et al., 2007; Tse, 2012), but the psychological reality of PSA has been amply documented through the use of self-reported measures of anxiety.

In the last few decades, a body of research has shown that VR technology is useful to reduce PSA in clinical contexts (e.g., Wallach et al., 2009; Lister et al., 2010; Wallach et al., 2011; Lister, 2016; Lindner et al., 2018; Yuen et al., 2019; Zacarin et al., 2019) as well as in educational settings (see Daniels et al., 2020 for a review). However, this technology has not been the only treatment for anxiety and other types of phobias such as fear of heights, arachnophobia or claustrophobia. In the field of psychology, treatments such as Cognitive Therapy or Cognitive Behavioral Therapy have been widely employed to help patients reduce or overcome PSA (e.g.,Anderson et al., 2005; Wallach et al., 2009).

To our knowledge, four studies focusing on the impact of VR-assisted public speaking practice on PSA have been carried out in university settings, generally by comparing participant self-reported levels of distress, communication competence, willingness to communicate, and/or physiological measures before and after training sessions, and all of them found that VR has a stronger impact in reducing PSA than other approaches (Heuett and Heuett, 2011; Boetje and Van Ginkel, 2020; LeFebvre et al., 2020; Rodero and Larrea, 2022). Heuett and Heuett (2011) compared VR to Visualization treatment (Ayres and Hopf, 1985) and reported that, although both groups were successful at diminishing anxiety, VR participants significantly increased their willingness to communicate and their self-perceived communication competence. Boetje and Van Ginkel (2020) concluded that rehearsing with VR two times after having received feedback reduces PSA and improves oral skills more effectively than rehearsing only once. Rodero and Larrea’s (2022) study with 100 university students investigated the role of distractors (e.g., someone coughing in the audience or a member of the audience asking a question) in participants’ public speaking performance and anxiety. Comparing the performance of students who had rehearsed their speeches in the VR environment with that of students in a control group who had rehearsed their speeches in front of an instructor, they concluded that those practicing with VR reduced their self-assessed and physiologically measured anxiety significantly more than the control group. The authors speculated that the use of distractors more closely simulates what the speakers can expect from a live audience, making them feel more prepared and self-confident and more able to concentrate. The study by LeFebvre et al. (2020), involving one group of 17 students, also reported significant changes in PSA from pre- to post-test with students using VR to train their oral skills. Their results suggest that VR minimizes the cognitive strain on speakers when they rehearse because, unlike when they practice alone, they are freed from having to imagine the scene and setting of the live audience they will ultimately have to face.

To the best of our knowledge, only two studies have explored the role that VR environments can play in reducing PSA in secondary school students. In the study by Kahlon et al. (2019), a group of 27 adolescents (aged 13–16) diagnosed with PSA underwent a single 90-min therapist-led session in which they performed various oral exercises using VR. Participant self-reports at one and three months after the session showed diminished PSA levels, although the lack of control or comparison groups made it impossible to clearly identify the sources underlying this decrease. In the other study, carried out by the authors of the present paper, Valls-Ratés et al. (2022) compared the public speaking performance of 50 students before and after they had practiced giving a 2-min speech, either in front of a VR audience or alone in a classroom. Students assessed their own anxiety levels before and after rehearsing, and 15 independent raters also rated participant performance for persuasiveness in pre- and post-training speeches, which were in addition analyzed for prosodic features as well as gesture rate. Though both groups significantly reduced their self-perceived anxiety at post-training and developed a more audience-oriented prosody, the raters detected no significant differences in the persuasiveness of delivery nor in the charisma of speakers in either group.

1.3 The effect of VR-assisted training on public speaking performance

Several studies have assessed the potential benefits of VR-assisted public speaking training for mitigating PSA and boosting public speaking performance. However, the few studies exploring the latter line of research came up with mixed results.

Sakib et al. (2019) performed a VR-assisted experiment involving 26 university students that included eight practice sessions with a pre- and post-test design consisting of giving a short speech before a live audience. Results showed improvements in the quality of the performance and self-assessed anxiety indicators at post-test. Nonetheless, the experimental design lacked a control element, limiting the external validity of the study’s findings.

Similarly, in a study involving two groups of 11 pre-university students each, Van Ginkel et al. (2020) compared the effect of practicing a speech either using VR or alone in front of an instructor. Immediately after speaking, both groups received feedback. The feedback offered to members of the first group was based on immediate feedback automatically produced by the VR system regarding the speaker’s use of voice, eye contact, and posture and gestures during the speech, while the second group received delayed feedback based simply on the instructor’s direct observations. The authors concluded that in the VR condition both the VR environment and the feedback the VR system provided were effective at increasing eye contact and speech rate when participants gave their final speech to classmates in the last session of the study. Nevertheless, Van Ginkel et al. acknowledged that it was difficult to claim that the outcomes were a direct result of the VR-assisted rehearsal itself because the instructions received by participants, feedback, and practice outside the workshop might also have affected the results.

For their part, Kryston et al. (2021) analyzed how the quality of speech delivery by 140 students and their PSA levels were affected by practicing a speech in a VR-simulated setting compared to not practicing at all. Results indicated that VR training sessions did not affect the PSA self-reported by students, but that VR-assisted practice yielded higher quality speech ratings than no practice.

In the context of L2 learning, Gao (2022) compared a VR condition to a traditional multimedia technology condition to boost the English pronunciation skills of 90 Chinese university students. Results showed that both conditions were successful in improving oral English skills, but the VR condition outperformed the control condition.

On the whole, previous findings regarding the value of VR-assisted training for public speaking seem to point to a gain in general public speaking performance. Nonetheless, more research is needed to assess the impact of VR in public speaking training, especially in secondary education, where studies are scarce (Kahlon et al., 2019; Valls-Ratés et al., 2022).

1.4 Embodiment in VR-assisted training and in public speaking

The term embodiment refers to the interaction between the physical activity of our bodies and the (technological) environment, implying a strong connection between mind and body (Kilteni et al., 2012). Within the embodied cognition paradigm, body and environment have been related to cognitive processes and embodiment has been shown to be grounded in physical perceptive and motor systems (e.g., Barsalou, 1999; Shapiro, 2014). In this paper, we use the term embodiment to refer to the participants’ strong activation of the body’s meaningful movements during VR public speaking experiences. Even though embodiment is related to the well-known ‘sense of presence’ in VR research (many authors have pointed out the correlation between higher levels of sense of presence and body movement; see Slater et al., 1995; Slater et al., 1998; Bianchi-Berthouze et al., 2007), here we will focus on encouraging participants’ embodiment. That is, even though we will not measure or directly systematically vary participants’ sense of presence, it is reasonable to assume that a higher amount of body engagement in creating nonverbal meanings (together with the speaker’s prosody) will be not just more natural and effective, but it will also stimulate a higher sense of presence, for reasons outlined below.

The connection between body movements and the ensemble of sensations felt when a person is interacting with a VR-simulated environment was explored in a study by Slater et al. (1998) in which the researchers assessed the sense of presence of participants interacting with VR environments. Participants were asked to walk through a VR forest and count the trees with unhealthy leaves. In one condition, the trees varied from short to tall while in the other they were consistently taller than normal eye level. Thus, in the first condition participants had to turn their heads around and up and down and if necessary bend down, while in the second such movements were unnecessary. The authors found that participants who made more body movements while performing the tasks reported a significantly higher sense of presence (see also Slater et al., 1995). In a similar vein, Bianchi-Berthouze et al. (2007) found that body movement not only increased the engagement of participants, but also played a role in the affective way in which participants got involved in the task, resulting in engagement scores being positively correlated with how much the participant moved (see also Pallavicini and Pepe, 2020 for a decrease of participants’ anxiety and body movement while playing VR video games). This body engagement is one of the factors that influences the sense of presence reported by VR users (Sanchez-Vives and Slater, 2005).

Outside the area of VR, the term embodiment has been used in the context of oral discourse performance to refer to the gesturing movements characteristically made by speakers when they speak, in other words, the participation of the body in the delivery of spoken messages. In the last few decades much of the literature has paid particular attention to how body movements and co-speech gestures are linked to language and thought (e.g., McNeill, 1992), that is, the way speakers use their faces, hands, or other body parts helps them express their ideas and, ultimately, is a reflection of their thinking (Hostetter and Alibali, 2019). Various theories have arisen in this connection, such as the gestures-as-simulated-action framework (e.g., Hostetter and Alibali, 2004; 2008; see also Hostetter and Alibali, 2018 for a review; see also Kita, 2000; McNeill, 2005), all of them sharing the view that embodied knowledge is directly reflected in speech-accompanying gestures.

Crucially, in the present paper we hypothesize that the encouragement of body engagement and the use of co-speech gesturing during VR-assisted public speaking training can trigger an improvement in public speaking performance. Research has shown that actively moving the body and gesturing while speaking (and even prompting an interlocutor to do so) facilitates language and cognitive processing tasks, perhaps because it increases access to words and neural activation (e.g., Krauss et al., 2000). Gesturing has been shown to help communicate spatial imagery (e.g., Alibali, 2005) and perform complex motor tasks (Feyereisen and Harvard, 1999). The visual-spatial imagery of gesturing also seems to help speakers package spatio-motor information into units that are compatible with speech (e.g., Kita, 2000). Gesturing while explaining a task is a predictor of how soon speakers will master the task (Church and Goldin-Meadow, 1986; Pine et al., 2004), and spontaneously gesturing while performing a task improves memory retention (Alibali and Goldin-Meadow, 1993; Cook and Goldin-Meadow, 2006). Even the form of the gesture is important: a study by Thomas and Lleras (2009) with participants trying to solve a problem while occasionally either swinging their arms or moving them in other ways demonstrated that the participants could solve the problem more easily when swinging their arms than when performing other arm movements. The authors concluded that specific movements seemed able to guide learners’ higher order cognitive processing. Importantly for the present study, previous studies have also shown that the experience of physical movement can have a direct effect on diminishing anxiety, as well as clinical depression (Wang et al., 2014; Gunnell et al., 2016; Korczak et al., 2017; McMahon et al., 2017). Repeated, rhythmic gestures in the form of aerobic exercises are negatively correlated with trait anxiety and depression and positively related to both physical health and self-concept (e.g., McDonald and Hodgdon, 1991; Fox, 2000). All in all, the results of this line of research indicate that the physiological changes triggered by one or multiple sessions of physical activity have a direct and positive effect on cognitive functioning (see Donnelly et al., 2016 for a review).

On a related note, different types of embodiment in public discourse have a clear effect on the listeners’ assessments of the speeches; for example, the specific style of gesturing used by the speaker can directly influence the audience’s evaluations. Specifically, various studies have found that listeners find gesturing speakers more self-assured and skilled (Maricchiolo et al., 2009), warmer and more in control of their performance (Gnisci and Pace, 2014) and more pleasant (Kelly and Goldsmith, 2004) than speakers who do not gesture. Despite this, some recent studies suggest that while audiences favor a moderate amount of gesture by speakers, excessive gesturing is felt to diminish the effectiveness of delivery as much as little or no gesturing (e.g., Rodero, 2022; Rodero et al., 2022). Posture also sends a message: various studies have shown that open postures convey high power and closed postures low power (Darwin, 1872; Hall et al., 2005; Carney et al., 2010). Other research suggest that postures not only send messages to viewers but also reinforce feelings of either dominance or submission in those who apply them, which can also make public speakers feel more or less self-confident (Cuddy et al., 2012). People who adopt high power poses feel more powerful, positive, in control, optimistic about the future, and focused on their ambitions (e.g., Anderson and Galinsky, 2006; Burgmer and Englich, 2012). However, evidence for the effect of power postures on speakers’ feelings is mixed (e.g., Ranehill et al., 2015; Davis et al., 2017; Latu et al., 2017), and many of the existing studies are underpowered.

In sum, it seems that encouraging the use of embodiment during VR-assisted public speaking training has the potential to help boost oral skills after intervention and reduce the public speaking anxiety of participants. Crucially, within a VR simulation context, it might well be that actively moving the body has an enhancing effect on the sense of presence that users experience, as has been reported by the studies reviewed in this section.

1.5 The present study: goals and hypotheses

Despite the considerable research outlined above, relatively few of these studies have focused on how VR could be used to improve training in public speaking skills, for secondary school students in particular. In addition, to our knowledge there has been no research so far on whether VR-assisted training in public speaking will be more effective—in terms of not only a more effective speaker performance but also reduced PSA—if speakers are encouraged to embody their speech during VR training, that is, to accompany their verbal message with moderate amounts of appropriate gesturing. Previous studies have shown that the use of VR does not automatically stimulate a more frequent use of gestures (e.g., Selck et al., 2022; Valls-Ratés et al., 2023). Therefore, the present study will investigate whether VR-assisted public speaking training in which participants are explicitly instructed to actively move their body will diminish speaker PSA and boost their public speaking performance after intervention to a greater degree than the same training without any instructions to use embodiment. Importantly, the study will include a comprehensive assessment of the students’ public speaking performance before and after their VR-assisted training sessions which will include the participants’ self-perceived levels of anxiety, listeners’ perception of persuasiveness and charisma, and an assessment of the prosodic and gestural features of the pre- and post-training speeches.

The fundamental research question of the study is whether VR-assisted training that encourages an embodied delivery will improve speaker effectiveness and reduce self-perceived anxiety. We hypothesize that such training will 1) diminish speaker anxiety, 2) make the delivery of participants more audience-oriented in terms of specific use of prosodic features and gesture rate, and c) make participants sound more charismatic and their messages more persuasive.

2 Method

2.1 Participants

A total of 78 students aged 16 to 17 were recruited from four secondary schools located in two central city districts of Barcelona. Although the city of Barcelona is characterized overall by a high percentage of Catalan-Spanish bilingualism, the degree to which one or the other language dominates in a particular neighborhood varies considerably. However, the schools chosen here were selected on the grounds that the bilingualism of their student bodies (as well as the middle-class socio-economic status of their families²) would have fairly uniform features (on average, students at all four schools reported that they used Catalan roughly 80% of the time in their daily lives).

Of the original 78 participants, data from eight participants had to be disregarded for one or both of the following two reasons: the participant failed to attend one of the practice trainings or perform the post-training task; and 2) their speeches in the pre- or post-training task lasted less than a minute or contained less than two supporting arguments. The mean age of the 70 remaining participants (71.43% female/28.57% male) was 16.45 years (SD = 0.36). All participants were typically developing adolescents and had no history of speech, language, or hearing difficulties.

The study was formally endorsed by the governing boards of all four schools, which treated the proposed training sessions as an extra-curricular activity that was carried out on the school premises.

2.2 Materials for the public speaking tasks

Since the experiment involved asking students to individually perform a total of five public speaking tasks, two in front of a real audience constituting the pre-training and post-training, and three in front of VR-simulated audiences constituting the practice sessions, it was felt necessary to control for the topics on which participants would speak on each occasion by mandating the same topic for each participant. In order to select topics that would be of interest to adolescents, an initial selection of 10 topics was made by the authors based on a long list of suggested topics taken from a public website for teachers of public speaking (www.myspeechclass.com). This list was fitted into an anonymous online survey asking respondents to rate on a seven-point scale how interesting they felt each topic would be, and a link to the survey was emailed to lists of about 75 17-year-olds, 58 of whom responded. The four topics receiving the highest scores overall from these respondents were chosen for the experiment.

For every speaking task, participants were provided with a set of printed instructions that included the topic for their speech and a list of five arguments they could employ to defend their ideas (see Supplementary Appendix). All participants received the same instructions. While the topic and arguments for the pre- and post-training speeches were identical, the topics for each of the three practice sessions were different, as were the accompanying arguments. Arguments provided were intended as guidance; participants were not required to use them in their speeches, nor were they told to employ a particular number of arguments.

The instructions and procedures of the experiment were piloted by four 17-year-old students in a 3-h session that enabled the researcher to refine and validate the final instructions and topics. The language of all materials and procedures was Catalan. It was also the language used by participants to deliver their speeches.

2.3 Experimental design

One week prior to the pre-training speech to a live audience, an information session was held by the experimenter in each of the high schools. The session served the purpose of explaining the experimental procedure and overall schedule. Participants were informed that the training period would consist of five sessions consisting of the preparation and delivery of a public speech, but that only the first and last sessions would be in front of a live audience, which would consist of three real people. Participants were also given the opportunity at this time to familiarize themselves with the use of VR goggles. Participants were specifically informed that their speeches had to be persuasive, since their audiences would consist of three representatives of the Catalan government who might be swayed to initiate policy (e.g., allocating more government spending to school field trips to the countryside) based on what they had heard.

After the information session, the researcher randomly divided participants from each school into two groups, both of which would participate in the subsequent public speaking practice sessions in front of a VR-simulated audience. One of the two groups, however, would be encouraged by the researcher to accompany their speech with gesture—henceforth the Gesture Activated VR group (n = 40) while the other would receive no instructions with regard to their use of gesture while speaking—henceforth the Non-Gesture Activated VR group (n = 30). Even though this study explores the differences in gesture encouragement while using VR, we considered that it was clearer to label the two groups “Gesture Activated VR” and “Non-Gesture Activated VR” group.

The rationale for planning three such sessions was that it was felt only one such session would provide insufficient time for the participant to become comfortable speaking in a VR-simulated environment. Research has shown that visual context-to-target associations can be learned effectively after three repetitions in VR (Zellin et al., 2014).

Though all participants performed the three practice speeches to a VR audience following the same basic instructions, the participants in the Gesture Activated VR group were given the following additional instruction in writing right before each of the three training sessions: “Remember to use your whole body to express yourself fully”.

Finally, as noted above, all participants again performed a speech to a live audience of the same three “government representatives” as a post-training. The topic on which they were instructed to speak was identical to that used for the pre-training. The full duration of the experiment was 5 weeks. The experimental design is shown schematically in Figure 1.

FIGURE 1

FIGURE 1. Experimental design.

2.4 Procedure

All public speaking performances were carried out individually by each participant in a silent room at each participating school and were video-recorded. They were supervised by the first author, who also managed the collection of data with the help of an assistant. For the pre- and post-training public speaking tasks, three 24-year-old university students also attended the session and acted as the live audience (the “government representatives”). Neither the research assistant nor the three members of the audience were aware of the goals of the study. To prevent our behavioral data from being biased by experimenter effects (see Rosenthal, 1976), the first author welcomed participants and informed them about the procedure but was present neither in the practice room nor in the room where participants gave the pre- and post-test speech.

Before the pre-training public speaking performance to the live audience, participants were given the written instructions and left alone for 2 minutes to mentally prepare what they planned to say. The topic prompt was “Do you think that adolescents should spend more time in nature?” They then proceeded to the room where the “government representatives” were seated and delivered their speech. They were allowed a maximum of 2 minutes to do so.

The first of the three training sessions took place a week later, and the second and third were conducted over the following 2 weeks. As with the pre-training speech, participants had 2 minutes after receiving the written instructions to individually plan their speech. After the 2 minutes of preparation had elapsed, they went to the adjacent classroom, where the experimenter fitted them with a Clip Sonic^® VR headset, to which a smartphone was attached. A week after the third training session, participants individually performed the post-training public speaking task, speaking about the same topic and to the same audience as in the pre-training task.

2.4.1 VR equipment

The study used a free-of-charge VR interface application installed on the smartphone called BeyondVR^©. When the phone screen is viewed through special cardboard glasses, it gives the user the impression that they are standing in front of an audience of 40 people. You can find the screenshots of the virtual audience here. The computer-generated low-fidelity audiences make gestures and body movements resembling those that a live audience would make while listening to a speaker. However, the audiences generated by this application do not react to what the speaker says, nor can they be manipulated to behave in different ways. Participants were not able to see their own body while wearing the VR headset nor could they see a virtual representation of their body in the VR environment. Participants were able to monitor their speaking time by referring to a timer displayed in their field of vision by the headset.

2.5 Anxiety measures

Speaker anxiety was self-reported by participants just prior to entering the room where they would give their pre- and post-training speeches using the Subjective Units of Distress Scale (SUDS; Wolpe, 1969). SUDS has been frequently used in cognitive-behavioral treatments and exposure practices to evaluate treatment progress, as well as for other research purposes. More specifically, the SUDS has been widely used in the analysis of speaker anxiety (e.g., North et al., 1998; Bartholomay and Houlihan, 2016; Takac et al., 2019) and is a validated instrument in which the reporting individual indicates his or her levels of anxiety in various contexts, using a 100-point scale where ‘0’ represents no distress whatsoever and ‘100’ represents the most intense distress imaginable. Each ten-point interval on the scale is accompanied by a brief description of how the participant might feel, so that the participant identifies with its meaning in the most specific way possible.

2.6 Public speaking performance measures

A total of 140 pre- and post-training test speeches were obtained from the 70 participants. They ranged from 1 to 2 min in duration, the mean being 1:23 min.

As noted above, these speeches were assessed for 1) perceived persuasiveness and charisma (2.6.1); 2) prosodic parameters (2.6.2); 3) and manual gesture rate (2.6.3).

2.6.1 Perceived persuasiveness and charisma

The impression created by each speech on a listener was measured in terms of the perceived persuasiveness of the speech and the perceived charisma of the speaker.

Persuasion has been defined as “the deliberate attempt to change thoughts, feelings, account, or behavior of others” (Rocklage et al., 2018: 1). More specifically, (Scheidel, 1967: 1) defines persuasion as “the activity in which the speaker and the listener are conjoined and in which the speaker consciously attempts to influence the behavior of the listener by transmitting audible and visual language”. It has been shown that the perception of persuasion is modulated not only by the specific information transmitted by the speaker but also by the prosodic characteristics of the oral discourse (e.g., Burgoon et al., 1990; Krauss et al., 1996; Manusov and Patterson, 2006; Jackob et al., 2011; Yokoyama and Daibo, 2012), as well as by the gestural performance (Mehrabian and Williams, 1969; Ekman et al., 1976; Kelly and Goldsmith, 2004; Maricchiolo et al., 2009; Peters and Hoetjes, 2017). For example, more varied intonation, greater fluency, and faster speaking rate are likely to convey more credibility and overall persuasiveness (Jackob et al., 2011), and greater vocal variety enhances the impression of competence, character, and sociability in a speaker (Addington, 1971; Ray, 1986).

Charisma has been widely studied, as it is a key aspect of leadership and social interaction. Contrary to the earliest definitions of charisma, which defined it as innate or almost magical (Weber, 1968), it is now regarded as an ability that can be taught and learnt. According to a recent terminological refinement of the concept by Michalsky and Niebuhr (2019), charisma represents a particular communication style. As (Niebuhr and Neitsch, 2020:358) point out, [charisma] gives a speaker leader qualities through symbolic, emotional, and value-based signals. Three classes of charisma effects are to be distinguished in the [public speaking] context, namely, 1) conveying emotional involvement and passion inspires listeners and stimulates their creativity; 2) conveying self-confidence triggers and strengthens the listeners’ intrinsic motivation; 3) conveying competence creates confidence in the speakers’ abilities and hence in the achievement of (shared) goals or visions. Inspiration, motivation, and trust together have a strongly persuasive impact by which charismatic speakers are able to influence their listeners’ attitudes, opinions, and actions.

In the present study, a group of 15 raters (9 women and 6 men, aged 23 to 63, all university-educated) assessed speakers’ persuasiveness and charisma based on the video recordings of the pre- and post-training test speeches. The first author of the study led a 1-h training session in which the raters, guided by the definitions of persuasiveness and charisma offered above, observed a public speaker and then rated their performance.

After training, the 15 raters were asked to watch each of the 140 video recordings embedded in an online questionnaire created using Alchemer (https://www.alchemer.com). After raters had viewed each speech, they were asked to answer two questions. “On a scale of 1–7, where 1 is “totally unpersuasive” and 7 is “extremely persuasive”, rate the persuasiveness of the message” and “On a scale of 1–7, where 1 is “totally uncharismatic” and 7 is “extremely charismatic”, rate the degree of charisma conveyed by the speaker” (see other studies that have employed perceptive ratings of charisma; e.g., Rosenberg and Hirschberg, 2009; Berger et al., 2017; Siegert and Niebuhr, 2021; Weninger et al., 2012). Raters were instructed to assess persuasiveness and charisma holistically and spontaneously, analyzing neither the words nor the rhetorical figures the speakers’ employed. The scores for both persuasiveness and charisma variables ranged from 15 to 105.

The 140 speeches were presented in pairs in a randomized order to make it easier for raters to spot differences by comparing the same speaker at two different times. This was done to make ratings more sensitive, and while we increased sensitivity, we did not introduce a bias as the raters did not know that they were rating before–after comparisons. To avoid rater fatigue, the questionnaire was divided into several units. The assessment tasks for all presentations took approximately 6 hours in total. Raters received financial compensation of 10 euros per hour. The inter-reliability score (ICC) across raters was found to be excellent 0.904 (i.e., results are considered reliable, as the score exceeded 0.7) (Koo and Li, 2016).

2.6.2 Prosodic measures

Acoustic-prosodic analysis of all 140 speeches was performed automatically by means of the ProsodyPro script by Xu (2013) and the supplementary analysis script by De Jong and Wempe (2009), both using the PRAAT (gender-specific) default settings (Boersma and Weenink, 2007). The analysis included a total of 20 different prosodic parameters, namely, five f0 parameters, seven duration parameters, and eight voice quality parameters.

The five f0 parameters were f0 minimum and maximum, f0 variability (in terms of the standard deviation), mean f0 and f0 range. A value was determined for each prosodic phrase for all five f0 parameters. Measured values were checked manually for plausability. Correction of outliers or missing values was performed by taking measurements manually. Additionally, all f0 values were recalculated from Hz to semitones (st) relative to a base value of 100 Hz. The prosodic domain of calculation for those f0 values was the interpausal unit (IPU), which was automatically detected. The criterion for the detection of an IPU boundary was the presence of a silent gap interval ≥300 ms, with silent gap being defined as a drop in intensity >25 dB.

The tempo domain consisted of the following seven parameters: total number of syllables, total number of silent pauses (>300 ms, which is above the perceived disfluency threshold in continuous speech, Lövgren and Doorn, 2005), total time of the presentation (including silences), total speaking time (excluding silences), the speech rate (syllables per second including pauses), the net syllable rate (or articulation rate, i.e., syll/s excluding pauses) as well as average syllable duration (ASD). ASD is a parameter that closely correlates with the fluency of speech (Rasipuram et al., 2016; Spring et al., 2019).

The domain of voice quality measurements included the eight parameters that are very frequently used in phonetic research (e.g., for analyzing emotional or expressive speech, see Banse and Scherer, 1996; Liu and Xu, 2014): harmonic-amplitude difference (f0 corrected, i.e., h1*-h2*), cepstral peak prominence (CPP), harmonicity (HNR), h1-A3, spectral center of gravity (CoG), formant dispersion (F1-F3), jitter, and shimmer. Voice quality measurements were based on the prosodic phrase, that is, one value per prosodic phrase was calculated. Also, all values were manually checked and, if necessary, corrected by a trained phonetician who conducted a visual inspection of the measurement tables and marked potential outliers, in particular, implausible values such as “0 Hz” or “600 Hz” for mean f0 and f0 maximum or a F1-F3 formant dispersion of “−1 Hz”, etc. These were corrected my manual re-measurements (or deleted from the dataset).

2.6.3 Manual gesture measures

All manual communicative gestures were annotated by considering the gestural stroke (the most effortful part of the gesture, which usually constitutes its semantic unit; Kendon, 2004; McNeill, 1992; Rohrer et al., 2020). Non-communicative gestures such as self-adaptors (e.g., scratching, touching hair; Ekman and Friesen, 1969) were excluded. Gesture rate was calculated per speech as the total number of gestures produced relative to the phonation time in minutes (gestures/phonation time).

2.7 Statistical analyses

Statistical analyses were performed using IBM SPSS Statistics 19. A number of GLMMs were run for the following independent variables, namely, self-perceived anxiety (SUDS), persuasiveness and charisma, and gesture rate, and a set of 20 values for all the prosodic parameters (5 for f0, 7 for duration and 8 for voice quality). All the GLMM models included Condition (two levels: Gesture Activated VR group and Non-Gesture Activated VR group) and Time (two levels: pre-training; post-training) and their interactions as fixed factors. Subject was set as a random factor. Pairwise comparisons and post hoc tests were carried out for the significant main effects and interactions.

2.8 Ethical approval

This study was approved by the Universitat Pompeu Fabra’s Ethical Review Board for Research Projects (Comissió Institucional de Revisió Ètica de Projects CIREP-UPF) and also received approval from Recercaixa Project [2017 ACUP 00249]. Prior written informed consent was obtained from each participant and/or their parents or legal guardians, as appropriate.

3 Results

3.1 Self-assessed anxiety

The GLMM analysis for SUDS showed a main effect of Condition (F (1,140) = 4.805, p = .030), which indicated that in general (at both pre- and post-training) Non-Gesture Activated VR group values were higher than Gesture Activated VR group values (β = 10.071, SE = 4.595, p = .030), and a main effect of Time (F (1,140) = 41.889, p < .001), showing that SUDS values were lower at post-training regardless of the condition (β = 12.381, SE = 1.913, p < .001). Also, a significant interaction between Condition and Time was obtained (F (1,140) = 4.474, p = .036). Post-hoc analyses revealed a significant difference between the two groups at post-training, showing a lower SUDS score for the Gesture Activated VR condition compared to the Non-Gesture Activated VR condition (β = 16.429, SE = 2.470, p < .001, g = 0.66). From pre- to post-training the Non-Gesture Activated VR condition significantly decreased their values: (β = 8.333, SE = 2.922, p = .005, g = 0.47), and so did the Gesture Activated VR condition: (β = 16.429, SE = 2.470, p < .001, g = 0.74). The graph in Figure 2 shows the mean SUDS scores separated by Condition (Gesture Activated VR group and Non-Gesture Activated VR group) and Time (pre-training and post-training). Table 1 displays the descriptive statistics for SUDS.

FIGURE 2

FIGURE 2. Mean SUDS values at pre- and post-training for both Non-Gesture Activated and Gesture Activated VR conditions.

TABLE 1

Table 1. Descriptive statistics for SUDS in each of the two conditions.

3.2 Perceived persuasiveness and charisma

The GLMM analysis for persuasiveness showed a near-significant main effect of Condition (F (1,112) = 3.778, p = .054, g = ), which indicated that Non-Gesture Activated VR group values showed a tendency to be lower than Gesture Activated VR values (β = 7.281, SE = 3.746, p = .054), and a main effect of Time (F (1,112) = 24.552, p < .001), showing that persuasiveness values were higher at post-training independently of the condition (β = 4.588, SE = .909, p < .001). Also, a significant interaction between Condition and Time was obtained (F (1,112) = 4.560, p = .035). Post-hoc analyses revealed a significant difference between the two groups at post-training (β = 9.256, SE = 3.719, p = .014, g = 0.67), showing higher persuasiveness scores for the Gesture Activated VR condition. From pre- to post-training the Non-Gesture Activated VR condition significantly increased their values (β = 2.607, SE = 1.253, p = .04, g = 0.17), and so did the Gesture Activated VR condition: (β = 6.500, SE = 1.211, p < .001, g = 0.44). The graph in Figure 3 shows the mean persuasiveness scores separated by Condition (Gesture Activated VR group and Non-Gesture Activated VR group) and Time (pre-training and post-training). Table 2 displays the descriptive statistics for persuasiveness.

FIGURE 3

FIGURE 3. Mean Persuasiveness values at pre- and post-training for both Non-Gesture Activated and Gesture Activated VR conditions.

TABLE 2

Table 2. Descriptive statistics for Persuasiveness in each of the two conditions.

Regarding charisma, the GLMM analysis showed a main effect of Time (F (1,112) = 13.109, p < .001), which indicated that pre-training scores were lower for both conditions (β = 2.945, SE = .813, p < .001). The analysis also showed a significant interaction between Time and Condition (F (1,112) = 5.717, p = .018). Post-hoc analyses revealed a significant difference between the two groups at post-training (β = 9.664, SE = 3.813, p = .013, g = 0.67). From pre- to post-training the charisma scores of the Gesture Activated VR group were significantly higher than at pre-training: β = 4.889, SE = 1.139, p < .001, g = 0.33; by contrast, the charisma scores for the Non-Gesture Activated VR condition did not significantly differ from pre- to post-training. The graph in Figure 4 shows the mean charisma scores separated by Condition (Gesture Activated VR group and Non-Gesture Activated VR group) and Time (pre-training and post-training). Table 3 displays the descriptive statistics for charisma.

FIGURE 4

FIGURE 4. Mean Charisma values at pre- and post-training for both Non-Gesture Activated and Gesture Activated VR conditions.

TABLE 3

Table 3. Descriptive statistics for Charisma in each of the two conditions.

3.3 Prosodic parameters

3.3.1 F0

Regarding the f0 domain, five GLMMs were applied to our target variables, namely, minimum and maximum f0, f0 variability (in terms of the standard deviation), mean f0 and f0 range. Table 4 shows the results of those GLMM analyses in terms of main effects (Time and Condition), as well as interactions between Time and Condition. Summarizing, a main effect of Time was obtained only for f0 maximum, meaning that the post-training values in both groups were lower than the pre-training values. A main effect of Condition was only obtained for f0 mean, meaning that the participants in the Gesture Activated VR group produced lower f0 values across both pre- and post-training phases. No significant interactions were obtained for any of the variables.

TABLE 4

Table 4. Summary of the GLMM analyses for the 5 f0 variables, in terms of main effects and interactions.

3.3.2 Tempo

Regarding tempo, a set of seven GLMMs were applied to our target variables, namely, total number of syllables, total number of silent pauses, total time of the presentation, total speaking time, the speech rate, the net syllable rate and ASD. Table 5 shows the results of those GLMM analyses in terms of main effects (Time and Condition), as well as interactions between Time and Condition. Summarizing, a main effect of Time was obtained only for number of syllables. A main effect of Condition was obtained for four variables, namely, number of silent pauses, speech rate, net syllable rate and ASD, meaning that the participants in the Gesture Activated VR group had lower speech-rate and net-syllable-rate (or articulation-rate) values, as well as higher ASD values. No significant interactions emerged for this domain either.

TABLE 5

Table 5. Summary of the GLMM analyses for the seven duration variables, in terms of main effects and interactions.

3.3.3 Voice quality

In the domain of voice quality measurements, a set of eight GLMMs were applied to our target variables, namely, h1*-h2*, h1-A3, CPP, Harmonicity, CoG, formant dispersion 1–3, shimmer, and jitter. Table 6 shows the results of those GLMM analyses in terms of main effects (Time and Condition), as well as interactions between Time and Condition. Summarizing, a main effect of Time was obtained for six variables, namely, h1-A3, CPP, CoG, formant dispersion 1-3, shimmer, and harmonicity, meaning that pre-training values were lower across groups for all the variables except for CoG and shimmer. A main effect of Condition was obtained for four variables, namely, h1*-h2*, h1-A3, shimmer, and jitter, meaning that the participants in the Gesture Activated VR group produced lower values compared to the Non-Gesture Activated VR group, both at pre- and post-training. No significant interactions were found for any of the variables.

TABLE 6

Table 6. Summary of the GLMM analyses for the 8 voice variables, in terms of main effects and interactions.

3.4 Manual gesture rate

To assess whether the additional embodiment instruction given to the participants of the Gesture Activated VR group was effective, we counted the number of manual gestures performed by participants in both conditions during their pre-training speech, as well as in the first and third VR-assisted training sessions. As noted above, gesture rate was calculated as the total number of hand gestures produced relative to the phonation time in minutes. The results showed that the mean gesture rate at pre-training was 42.27 gestures per minute for the Non-Gesture Activated VR group and 32.89 gestures per minute for the Gesture Activated VR group. For training sessions 1 and 3, the mean gesture rates were 28.53 gestures per minute for the Non-Gesture Activated VR group and 25.27 per minute for the Gesture Activated VR group. Crucially, the difference from pre-training to training session 1 was a reduction of 13.74 for the Non-Gesture Activated VR group compared with a reduction of only 7.62 for the Gesture Activated VR group. These results clearly indicate that Gesture Activated VR participants maintained their gesture rate when they underwent the training sessions, their relative use of manual gestures being higher than that of the Non-Gesture Activated VR participants.

A GLMM was applied to this data. A main effect of Time was obtained (F (1,114) = 4.276, p = .041), meaning that at post-training values were higher across groups (β = 2.895, SE = 1.400, p = .041). A main effect of Condition was also obtained (F (1,114) = 10.144, p = .002), meaning that Gesture Activated VR scores were higher across both pre- and post-training phases (β = 11.229, SE = 3.167, p = .001).

4 Discussion and conclusion

The central aim of the study was to investigate whether explicitly instructing secondary students to use gesture during a three-session VR-assisted public speaking training program would help reduce their levels of PSA and, in addition, enhance the quality of their performance in front of a small live audience after training. Therefore, a between-subjects experiment with a pre-training speech, three training sessions, and a post-training speech was designed so that we could compare pre-to post-training speeches between a group of students instructed to embody their speeches while speaking to the VR audience and a group who received no such instruction. One of the key features of the study was that it included a comprehensive assessment of the students’ public speaking performance before and after their VR-assisted training sessions. Specifically, the study assessed whether presenters giving their post-training speech reported lower levels of anxiety and displayed higher levels of persuasiveness and charisma, and/or produced a more audience-oriented speech from the point of view of prosodic and gestural features. In order to make the VR technology accessible to everyone, the study utilized a cost-effective method consisting of cardboard glasses attached to a phone that allowed us to recommend the application to students and instructors who showed interest in practicing their public speaking after the completion of the experiment at home and at school when needed.

In relation to the effects on anxiety, our results showed a significant reduction in the degree of anxiety in both Non-Gesture Activated VR and Gesture Activated VR conditions. Firstly, these results support previous VR training studies that reported a reduction in the self-assessed PSA levels of participants in clinical (e.g., Lister, 2016; Lindner et al., 2018; Yuen et al., 2019; Zacarin et al., 2019) and educational settings (e.g., Heuett and Heuett, 2011; Kahlon et al., 2019). Second, a key finding of the study is that the embodiment prompt during the VR training sessions triggered a significantly stronger effect in the reduction of self-perceived anxiety among participants in this condition as compared with the participants in the Non-Gesture Activated VR condition. These results expand previous findings on the positive effects that physical activity has on mental health (i.e., wellbeing and self-concept, as reported in McDonald and Hodgdon, 1991; Fox, 2000) and cognitive functioning (see Donnelly et al., 2016 for a review), as well as on the reduction of anxiety (e.g., Korczak et al., 2017).

Focusing now on the effects of embodiment on persuasiveness and charisma, a key finding of the present study is that the participants in the Gesture Activated VR condition increased their persuasiveness and charisma ratings from pre-to post-training, as opposed to the participants in the Non-Gesture Activated VR condition. Perceptual ratings of persuasiveness and charisma were used, as has been done in previous studies analyzing speakers’ persuasiveness or charisma (e.g., Maricchiolo et al., 2010; Jackob et al., 2011; Berger et al., 2017), a very high level of inter-rater reliability having been confirmed.

The present results seem to be connected to recent findings from research showing that the activation of the body and gesturing while performing speaking tasks has direct consequences on speakers’ cognitive processes because it helps speakers to reduce the amount of cognitive resources they need to formulate speech (Wagner et al., 2004), enhances their problem-solving abilities (Thomas and Lleras, 2009), and improves their ability to retain memories of things they have just learned (e.g., Alibali and Goldin-Meadow, 1993). Along these lines, we contend that our results constitute further evidence in support of the embodied cognition paradigm as a successful way to encourage learning through the activation of the body. As studies from numerous fields in neuroscience, linguistics, and cognitive science have claimed, “the highest percentage of human cognitive ability is based on bodily capabilities to produce knowledge” (Kosmas, 2019: 3) (see also Wilson, 2002; Gallese and Lakoff, 2005). We can speculate that by reminding participants to use their bodies to enhance their expressiveness, the speeches produced by the Gesture Activated VR group may have been enriched by this awareness of the body as a tool for the construction of effective discourse (Kalantzis and Cope, 2004). Moreover, this body activation may have favored a stronger feeling of self-confidence that was key to rater perceptions that they were more charismatic speakers and their messages more persuasive (McDonald and Hodgdon, 1991; Fox, 2000).

Another important factor that might explain the positive results obtained by Gesture Activated VR participants is the relationship between body movement and the greater sense of presence they perhaps experienced in the simulated VR environment. Following up on previous results (e.g., Slater et al., 1995; Slater et al., 1998), the fact that participants in the Gesture Activated VR condition received the instruction to use their body to increase their expressiveness could have enhanced their sense of presence and the VR experience could have been more immersive to them than to participants in the other condition (Bianchi-Berthouze, 2013). Encouraging participants to use their bodies could have triggered a more realistic and vivid VR experience, and this sense of enhanced presence was then transferred to the post-training live audience context, since crucially speakers in this group were perceived as more persuasive and charismatic. Although the study did not include any measure of presence, in our view it would be interesting to include this measure in future studies in order to analyze its relationship with gesture use and embodiment measures.

Regarding the effects of the Gesture Activated VR condition on prosodic parameters, significant interactions were obtained neither for f0 and tempo nor voice quality parameters, meaning that the addition of an embodiment instruction while employing VR did not lead to any differences in these prosodic parameters in the pre- and post-training speeches. These results contradict our expectations, given the reported relation between the prosodic features of speeches and their persuasiveness (e.g., Kelly and Goldsmith, 2004; Maricchiolo et al., 2009; Jackob et al., 2011; Yokoyama and Daibo, 2012; Peters and Hoetjes, 2017). Nevertheless, a possible explanation for the lack of significant changes in the prosodic parameters in post-training speeches is that already in the pre-training session the Gesture Activated VR group showed significant differences in the majority of the prosodic parameters compared to the Non-Gesture Activated VR group. These differences suggest that the Gesture Activated VR group had a higher level of audience-orientation right from the start and kept that high level also after training. That is, the Gesture Activated VR group was already performing well while the Non-Gesture Activated VR group was not able to improve further, which, in combination, prevented interaction effects from emerging. With regard to the five f0 values, no significant melodic changes were observed between the pre- and post-training speeches across groups. Even though no significant interactions were found, the Gesture Activated VR group showed a general tendency to produce a less thin and breathy but more harmonious and sonorous voice, key attributes of speech perceived as charismatic. The Gesture Activated VR group also used fewer pauses and a reduced net syllable rate, which is consistent with the listener-oriented speaking style that makes speeches more persuasive and is likely to signal greater credibility (Jackob et al., 2011).

Focusing on the gesture rate used in the pre-to post-training speeches, no significant differences were found across conditions. We expected to observe a significantly higher rate of gesturing in the Gesture Activated VR group because of the explicit instruction they had received in that regard. Though there was a higher relative increase in gesture rate at post-training for the Gesture Activated VR group, the difference between the two conditions was not significant. Therefore, our hypothesis regarding a higher gesture rate for the Gesture Activated VR condition was not supported. Interestingly, however, the fact that the embodiment instruction did not cause participants to perform significantly more gestures at post-training is consistent with the results of previous studies showing that the most effective and credible speaking style is characterized not by a very extensive use of gesture but rather by a moderate one (e.g., Dargue et al., 2019; Rodero, 2022; Rodero et al., 2022).

In summary, the results of our prosodic and gesture analyses of the student-produced speeches revealed no significant differences in prosodic or gesture parameters across groups. This is somewhat surprising given the fact that significant gains were obtained in perceived persuasiveness and charisma in the embodied condition. We expected to see some correlations between a more charismatic style and an increase in discourse persuasiveness in terms of the use of specific prosodic and gestural parameters. Gesture rate, then, might not be a suitable measure of a speaker’s overall multimodal behavior, which involves also gesture amplitude and timing in co-creating communicative meanings together with prosody as well as a bundle of features such as eye gaze patterns, facial expressions, and body posture (Signorello et al., 2012). This suggests that further and more detailed analyses of multimodal behavior would be needed for this data.

In summary, we can conclude that explicitly instructing students to use gestures when they are practicing public speaking in a VR-assisted environment has the potential to boost some of the performance parameters after intervention, when the students are asked to speak before a live audience. Specifically, it can help make the students less anxious, as well as more charismatic and persuasive. Our results have important educational implications. First, they confirm the value of applying VR technology in the classroom to enable students to practice developing their oral skills, in the process increasing their self-confidence and awareness of their oral communicative strengths (e.g., Van Ginkel et al., 2019), thereby leading to more charismatic delivery (Niebuhr and Michalsky, 2018; Niebuhr and Tegtmeier, 2019). Second, they show that adding embodiment instructions as a complementary technique can augment the positive effects of VR-assisted training on subsequent public speaking tasks. In general, our results confirm and expand previous results on the positive value of embodied learning approaches in language education: not only can embodied learning add emotional and motivational value benefits to language learning contexts by virtue of the fact that physical activities make classroom learning more enjoyable (Hanks and Eckstein, 2019; Kosmas and Zaphiris, 2019; see Jusslin et al., 2022 for a review). It also heightens student interest, overall wellbeing, and self-confidence (Mathias and von Kriegstein, 2022; Cannon, 2017; Hanks and Eckstein, 2019).

Several limitations must be considered. First of all, the study was conducted with a sample of 17-year-old students and the results cannot be safely generalized to other age groups, as PSA could vary with age. Nor were our two groups of participants controlled for in terms of gender, and it would have been interesting to assess possible differences between genders in the outcomes obtained.

Second, participants could not see their hands—either real or virtual—as they performed their speeches, which may have inhibited or otherwise distorted their embodiment behavior. Being able to see virtual hands and/or full virtual body would contribute to the sense of presence experienced by participants, something that we did not measure here. The sense of virtual ownership that users can experience seeing their virtual bodies in the VR environments could not take place in this study, as the VR application utilized did not feature it. Future research could implement the gesture-encouraging condition with a VR application that includes this feature.

Third, though anxiety levels were measured, the instrument used depended on self-reporting. Although SUDS has been widely used in public speaking studies and represents a validated overall measure of emotional distress (e.g., Tanner, 2012), adding objective measures such as electrophysiological data would allow us to obtain a more fine-grained picture of participant anxiety levels and compare them with other measures. Also, our analyses of persuasiveness and charisma would have been more comprehensive had they included an assessment of the cogency of the arguments deployed by speakers. And as we have noted, considerable work needs to be done to clarify the relationship between persuasiveness and charisma on the one hand and prosodic and gestural features on the other.

Finally, future longitudinal studies could be carried out in which public speaking practice before VR-simulated audiences takes place over more or longer sessions, possibly in combination with various feedback strategies.

In conclusion, the results of the present investigation offer further hints on how VR-simulated environments can be most effectively used by secondary students to sharpen their public speaking skills. Specifically, they show that the addition of a brief embodiment instruction suggesting that speakers combine their oral performance with the use of gestures not only seems to make for a more vivid VR experience but possibly also leads to reduced anxiety and concomitant gains in public speaking performance. These results have important academic implications, suggesting as they do that VR technology can be profitably employed as a complementary and powerfully engaging tool for the teaching of oral communication at the secondary school level.

Data availability statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by the Universitat Pompeu Fabra’s Ethical Review Board for Research Projects (Comissió Institucional de Revisió Ètica de Projects CIREP-UPF). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’ legal guardians/next of kin.

Author contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Funding

This work benefited from funding awarded by the Spanish Ministry of Economy and Competitiveness (PGC2018-097007-B-I00 and PID2021-123823NB-I00) and the Generalitat de Catalunya (2017 SGR_971). We also acknowledge support from the Recercaixa Project (RecerCaixa 2017ACUP 00249) and the Department of Translation at Universitat Pompeu Fabra through a 1-year doctoral grant to the first author.

Acknowledgments

We are deeply indebted to the student participants at the four Barcelona schools (Institut Fort Pius, Institut Quatre Cantons, Institut Vila de Gràcia and Institut Icària) for believing in the experiment and being so enthusiastic. We also thank the school board and teachers for being so supportive of the project. We are much obliged to Florence Baills, Mariia Pronina, and Patrick Rohrer (members of the GrEP group) for their help during data collection and to Júlia Florit-Pons, Yuan Zhang, and Xiaotong Xi for their help with the statistical analysis. Thanks are likewise due to Gemma Balaguer Fort, Elisenda Bernal, Gemma Boleda, and Emma Rodero for contributing to our research as members of the MA thesis committee and PhD research plan jury, whose questions proved invaluable. Finally, a special thanks to the 15 raters, who kindly took 6 hours of their time to assess the persuasiveness and charisma of all 140 student-produced speeches.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frvir.2023.1074062/full#supplementary-material

Footnotes

¹Fundamental frequency (f0) refers to the rate at which the vocal cords vibrate during speech or singing. It is commonly measured in hertz (Hz) or cycles per second (cps). The f0 of an individual is primarily determined by the length of their vocal cords, which is correlated with their overall body size. Typically, f0 values range from 80 to 450 Hz, with males generally having lower voices than females and children (Bäckström et al., 2003). The phonational range of an individual, which is the range of frequencies they can produce, tends to decrease with age.

²According to statistics published annually by the municipal government of Barcelona, retrieved 15 October 2022 from: https://ajuntament.barcelona.cat/estadistica/catala/Anuaris/Anuaris/anuari19/cap06/C0616010.htm.

References

Addington, D. W. (1971). The effect of vocal variations on ratings of source credibility. Speech Monogr. 38, 242–247. doi:10.1080/03637757109375716