L2 Vocabulary Teaching by Social Robots: The Role of Gestures and On-Screen Cues as Scaffolds

Social robots are receiving an ever-increasing interest in popular media and scientific literature. Yet, empirical evaluation of the educational use of social robots remains limited. In the current paper, we focus on how different scaffolds (co-speech hand gestures vs. visual cues presented on the screen) influence the effectiveness of a robot second language (L2) tutor. In two studies, Turkish-speaking 5-year-olds (n = 72) learned English measurement terms (e.g., big, wide) either from a robot or a human tutor. We asked whether (1) the robot tutor can be as effective as the human tutor when they follow the same protocol, (2) the scaffolds differ in how they support L2 vocabulary learning, and (3) the types of hand gestures affect the effectiveness of teaching. In all conditions, children learned new L2 words equally successfully from the robot tutor and the human tutor. However, the tutors were more effective when teaching was supported by the on-screen cues that directed children's attention to the referents of target words, compared to when the tutor performed co-speech hand gestures representing the target words (i.e., iconic gestures) or pointing at the referents (i.e., deictic gestures). The types of gestures did not significantly influence learning. These findings support the potential of social robots as a supplementary tool to help young children learn language but suggest that the specifics of implementation need to be carefully considered to maximize learning gains. Broader theoretical and practical issues regarding the use of educational robots are also discussed.


INTRODUCTION
Educational technologies are becoming commonplace in schools and homes across the world. While most attention has been given to screen technologies, such as tablets, and apps used with them (Herodotou, 2018;Papadakis et al., 2019), other digital devices such as robots are also becoming available for common use, to be used either independently or with screen technology. According to a 2020 report, the educational robot market is expected to grow 16% over the next 5 years across the world (Mordor Intelligence, 2020). The global trends are also reflected in academic research. By May 2017, 101 empirical papers (with 309 study results) concerning educational robots were published from different parts of the world such as North America, East Asia, Europe, and the Middle East, and 58% of these studies tested children (Belpaeme et al., 2018a). Importantly, however, most studies thus far not only focused on the affective components of the learning experience (e.g., whether the learner enjoyed the lesson or not) and did not evaluate learning gain but also have a small sample size and lack a control group. This study aims to address this limitation and to gain insights into specific ways to maximize educational robots' benefits for young learners.
Here, we exemplify second language (L2) teaching because fostering L2 skills is critical for the academic and social success of children in the increasingly globalized world, and because as described in the next section, it has been suggested that the unique characteristics of social robots may be particularly suited for language teaching. In addition, we aim to gain a better picture of how social robots should be used in language education, and examine the role of different scaffolds: hand gestures performed by the robot tutor and visual cues presented on the screen accompanying the robot.

Social Robot Tutors in Language Teaching
Social robots are autonomous or semi-autonomous robots that interact and communicate with humans while following the behavioral norms expected by the people with whom the robot is intended to interact (e.g., Bartneck and Forlizzi, 2004). A growing body of literature highlights the potential of social robots in education (e.g., Mubin et al., 2013;Belpaeme et al., 2018a), and more specifically, in teaching first (L1) and second language (L2) to typically-as well as atypically-developing children (e.g., Kanero et al., 2018;van den Berghe et al., 2019;Oranç et al., 2020). Kanero et al. (2018) list the adaptability and the ability to perform actions and gestures as the notable strengths of social robots to support teachers in educational settings. First, social robots can use their sensors to detect learners' motivational and educational needs and adapt their behavior accordingly. Therefore, robot tutors can provide individualized training on children's own time and offer opportunities for learning that might exceed what the teacher can offer in a given day. Second, with its physical body, a social robot can perform various gestures, which are known to facilitate language learning (e.g., Tellier, 2008;Macedonia et al., 2011;Wakefield et al., 2018). Some also suggest that not only the ability to perform gestures, but the physical presence per se might contribute to learning. For example, surveying 33 experimental works, Li (2015) identified the general pattern in which robots are more persuasive and perceived more positively when they are physically with the user than when the robot or another character was presented on the screen. A few studies found that children prefer physically-present robots over an on-screen avatar (Leite et al., 2008;Kose-Bagci et al., 2009;Jost et al., 2012;Looije et al., 2012), though whether the physical presence can affect language learning is unknown (but see Kennedy et al., 2015). Finally, teachers themselves also deem social robots as valid support in their classrooms (Fridin and Belokopytov, 2014;Serholt et al., 2014).
Despite social robots' unique characteristics and everincreasing public interest in them, there have not been many carefully controlled experiments that examined the potential benefits of social robots in education (Belpaeme et al., 2018b;Kanero et al., 2018). More specifically for our purposes, the empirical findings on the effectiveness of robot tutors in language teaching to typically-developing children are mixed (e.g., Kanda et al., 2004;Moriguchi et al., 2011;Tanaka and Matsuzoe, 2012;Mazzoni and Benvenuti, 2015). Especially regarding vocabulary teaching, robots are found to be merely as effective or even less effective than human tutors and other digital devices (e.g., Hyun et al., 2008;Moriguchi et al., 2011;see Vogt et al., 2019 for a large-scale study). The effectiveness of a robot tutor might vary depending on the alignment between multiple factors in the learning environment, such as the type of learning task (Tazhigaliyeva et al., 2016), and the level of social support (Saerbeck et al., 2010). Based on the idea, we theorized that other scaffolds available in the teaching environment might interact with the effectiveness of robot-led language lessons.
Aligning Affordances of Scaffolds With the Learning Task: Role of Gestures and On-Screen Cues According to recent instructional approaches, focusing on the effectiveness of technology for supporting learning on its own, i.e., simply testing whether technology is effective or not, provides a limited view; instead the facilitative role of any technology might vary depending on the specific topics or conditions . A key question in educational designs is to what extent a particular scaffold aligns with the type of representation that enables a particular learner to successfully complete a specific task (Gibson, 1979). Thus, the focus is not just on the learner, the task, or the technology but the nexus of all three. While prior work on educational technology in general, and social robots in particular, focused on different features of the technology, here we evaluate the effectiveness of social robots in a wider context and examine how the role of social robots vary as a function of the scaffolds available in the learning environment.
Digital learning environments differ not only in the specific instructional technology they leverage but also on several other dimensions. In a typical L2 tutoring context, the teacher provides auditory information, i.e., L2 word, and the instruction is supported by a visual component, i.e., picture of the object to be labeled. To gather students' attention, teachers typically provide additional visual scaffolds, also referred to as focusing or signaling cues (Jones and Plass, 2002;Marulis and Neuman, 2010). Here we examine two types of scaffolds that are frequently used and have been effective in teaching vocabulary-static, attentional cues provided on a screen, also referred to as onscreen cues (Höffler and Leutner, 2007), and dynamic co-speech hand gestures (Wakefield et al., 2018).
The first set of scaffolds we consider are static visual attentional cues (i.e., focusing/signaling cues), which consist of cues such as arrows or highlighters (Lowe and Boucheix, 2011). Prior research showed that visual cues enable instruction to lead to more robust learning by focusing on the learner's attention to relevant information (e.g., De Koning et al., 2009). Robot tutors are typically used with other scaffolds such as touchscreen devices that can provide additional signaling cues to direct children's attentional focus. The literature on the effectiveness of on-screen cues has been mixed, and prior work primarily focused on adults (e.g., Tabbers et al., 2004;De Koning et al., 2009). Little is known about how on-screen cues influence children's learning, and how they compare to other cues. Almost nothing is known about the integration of on-screen cues in social robots' teaching although most studies implicitly combine robots with screens.
The second set of scaffolds includes dynamic co-speech gestural cues. Speakers of all ages and backgrounds move their hands as they speak. These co-speech gestures come in different types including pointing, iconic, beat gestures, and emblems (McNeill, 1992), with pointing and iconic gestures being most frequently used in teaching contexts. Pointing gestures, also known as deictic gestures, are gestures that indicate an object, entity, or location through the extension of the index finger or whole hand. Iconic gestures are gestures that depict an action or shape of an object, e.g., drawing a circle in the air to describe round or opening hands wide to describe big. Teachers use gestures extensively in L2 teaching (Kusanagi, 2015), and gestures facilitate language learning in both first language (L1) and L2 (Goldin-Meadow and Wagner, 2005). Instruction that contains gestures typically promotes better learning compared to instruction that does not (e.g., Valenzeno et al., 2003, but see Singer andHostetter, 2011;Congdon et al., 2018). When L2 instruction is accompanied by gestures, adults and children learn and retain novel nouns in L2 better compared to no gesture or meaningless gestures (Tellier, 2008;Macedonia et al., 2011). Further, the facilitating role of gesture is greater for children than adults (Hostetter, 2011).
Although a strength of social robots over other technological tools is their ability to gesture (Kanero et al., 2018), empirical evidence on the effectiveness of robot tutors' gesture remains scarce and mixed-some studies report that the use of gestures by robots might enhance learning (Conti et al., 2017), while others report that gesture might detrimentally influence performance (Yadollahi et al., 2018) or have no effect (Vogt et al., 2019). For example, one study found that 5-to 6-year-old children recalled stories more accurately when stories were narrated by an animated robot (that used gestures, eye gaze, and expressive intonation) than by an inexpressive human teacher (Conti et al., 2017). While this study was one of the first steps in understanding the role of social robots' gestures, the two conditions differed not only on gesture use but also on voice tone and eye gaze. In contrast, another study reported that a social robot's pointing gestures distract children from comprehending the text, especially for those with lower reading proficiency (Yadollahi et al., 2018). A recent study by Vogt and colleagues, which did not find an additional benefit of gestures over a touchscreen tablet, focused on iconic gestures only (Vogt et al., 2019). Overall, existing studies present mixed findings on the effects of gestures in children's learning with robots, and have focused on a specific type of gesture only. Thus, how different gesture types might influence a robot tutor's teaching effectiveness is another gap we aim to fill.
Overall, research on the effectiveness of robot tutors and how it could be influenced by different additional scaffolds remains scarce. To our knowledge, no prior study compared the role of different scaffolds in supporting robot tutors' teaching effectiveness.

Current Study
The current study asks whether (1) children can effectively learn new L2 words from a robot tutor, (2) scaffolds differ in how they support robot teaching (by comparing co-speech hand gestures and on-screen attentional cues), and (3) the type of gesture affects the effectiveness of L2 vocabulary teaching. To answer our main research questions, Study 1 tested a robot tutor. In Study 2, we tested the same questions with a human tutor to examine whether the pattern of results is unique to a robot tutor or would generalize across tutors. In both studies, 5-year-old Turkishspeaking children were taught English measurement adjectives such as big and high. We used measurement adjectives because these are typically covered in school curricula and are central for early STEM education (Bishop, 1988). Further, although the majority of the work focusing on the role of gestures in word learning focused on nouns and verbs (Wakefield et al., 2018;Aussems and Kita, 2019), prior work also established the benefit of using gestures for adjectives (O'Neill et al., 2002). We chose 5years-old children as participants as they would be familiar with the measurement terms in their native language and because the previous studies suggest that younger children may struggle to be engaged in a lesson with a social robot (Moriguchi et al., 2011;Baxter et al., 2017). The gesture condition tested deictic gestures pointing the picture on a computer screen representing the word to be learned, and iconic gestures representing the meaning of the word. In the on-screen cue condition, a red rectangle was presented around the referent picture on the computer screen. Based on the prior literature, we hypothesized that children would effectively learn new words from a robot tutor. Given the small and mixed literature on scaffolds, however, we formed multiple predictions regarding the effect of scaffolds. On the one hand, given prior work on gestures in teaching, we may observe better learning in the gesture condition compared to the onscreen cue condition. Alternatively, if gestures simply serve as attentional cues, there would be no difference between gesture types, and between gesture vs. on-screen cue conditions.

STUDY 1: ROBOT TUTOR
Our social robot tutor was NAO, a 54-cm-tall humanoid robot by Softbank Robotics (see Figure 1). NAO has a child-friendly appearance and abilities to perform hand gestures and has been used successfully in many human-robot interaction studies involving young children (e.g., Belpaeme et al., 2018b).

Participants
Thirty-eight children participated in Study 1 with the Robot Tutor (M age = 69.85 months, SD = 4.18, 21 females). All but two children were tested in a quiet room at their own preschool, located in a large city in Turkey (Istanbul or Bursa), the remaining two children were tested in the lab in Istanbul. All participants were free of vision or hearing impairments. The initial sample consisted of 42 children, but one child was excluded because they knew four of the eight English words to be taught and three children quit the study before it ended. A combination of convenience and snowball sampling techniques were used. Initially, multiple preschools that did not offer extensive English education were contacted. When a school expressed interest in the study, the consent forms were sent to families via the teacher or principal. All children included in the study had their parents provided consent and children themselves provided verbal assent at the beginning of the study. All procedures were approved by the Koç University Committee on Human Research.

Stimuli
Children learned four pairs of English measurement adjectivessmall and big, wide and narrow, high and low, and tall and short. We first selected six pairs of words that were listed as measurement adjectives in the kindergarten math curricula in the Common Core in the US. The selected words were also balanced in terms of word frequency, familiarity, and imageability (see Coltheart, 1981;Zeno et al., 1995). In selecting the final set of words, we first identified gestures that would typically be used to describe the adjectives (see Table 1 for descriptions of iconic gestures produced for each adjective and its associated object). We then selected a subsample of these gestures that could be performed by NAO, and videotaped NAO performing the gestures. Twenty-seven adults (M age = 33.19 years; SD = 6.50; 10 females) were asked to watch the videos of NAO gesturing, and rated how well the gesture represented the corresponding word Word-gesture pairs that received an average rating of below two were excluded (thick and thin, near and far). The gestures for the remaining words were on average rated as 3.1 (SD = 0.05). We then created images to describe the target measurement adjectives using objects that should be familiar to children in our age range. These images included two balls (big and small), two doors (wide and narrow), two kites (high and low), and two flowers (tall and short) (see Figure 1 for an example and see Supplementary Material for all images as well as videos of NAO's gestures).

Design
In all conditions, the robot verbally taught the target adjectives and the images were presented using Microsoft Powerpoint on a 13-inch laptop screen, but the trials differed in terms of additional scaffolds provided. We used a mixed design where Gesture Type (Deictic, Iconic) was a between-subject factor and Scaffold Type (Gesture, Screen Cue) was a within-subject factor. All children went through two conditions: one of the two Gesture type conditions (Deictic or Iconic) and the On-Screen Cue condition. In the On-Screen Cue condition, the tutor did not perform any gestures. Instead, a red rectangle appeared around the corresponding object to draw attention to the image on the computer screen ( Figure 1A). The presence of pictures for the words closely mimics typical learning contexts for L2 learning (e.g., Jones and Plass, 2002). In the Iconic Gesture condition, the tutor performed an iconic gesture while teaching the word ( Figure 1B; see also Figure 2 for an example of phases for iconic gestures). In the Deictic Gesture condition, the tutor pointed to the object on the computer screen while teaching the word ( Figure 1C). Because NAO has three fingers that cannot move independently of one another, both NAO and the human tutor used palm pointing, instead of index pointing.
In each Scaffold Type condition (Gesture vs. Screen Cue), children learned two pairs of words per condition. Each condition consisted of three blocks, where the word pairs were repeated. The word pairs were counterbalanced such that half of the children learned two pairs of words (e.g., big and small, high and low) in the Gesture condition, whereas half of them learned the same pairs in the On-Screen Cue condition. The order of conditions was also counterbalanced across children. Twentyone of the 38 children participated in the Iconic Gesture + On-Screen Cue condition and 17 children participated in the Deictic Gesture + On-Screen Cue condition.

Procedure
All children met individually with the human experimenter and the Robot tutor in a quiet room. Prior to the experiment, children were asked if they knew what each English target adjectives meant. One child who knew four of the eight target adjectives was excluded from the dataset. Four children who knew two adjectives (big and small) were included in the overall data analysis but their responses for the big and small questions were excluded.
The child was seated in front of a 13-inch laptop where all images were presented. The Robot tutor sat across from the child, behind the laptop (see Figure 1). A human experimenter first introduced the robot to the child, but had no further interaction with the child, and wirelessly controlled the robot using a Wizard of Oz technique while pretending to complete paperwork. The robot taught two pairs of measurement adjectives per block. Each pair of adjectives were taught with a specific object presented on the screen (e.g., a ball for the words small and big). At the end of each block, children were presented with the image of two objects (e.g., a small ball and a big ball) on the screen, and the robot asked the child to point to the object that corresponded to the target adjective (e.g., small). Children were asked 4 questions per block per condition. Thus, the maximum score on the test was 24 (4 questions x 3 blocks for the Gesture condition and 4 questions x 3 blocks for the Screen-Cue condition). Immediately after completing the three blocks, children also completed a generalization task designed to evaluate whether they could extend the newly learned adjectives to novel objects. Children were presented with a series of new images with new objects representing the same set of eight adjectives (e.g., cars for the adjectives big and small) on the computer screen, and asked to point to the object that corresponded to each of the eight adjectives (see Figure 3 for an example item). The maximum score on generalization questions was 8. Responses were coded online by another experimenter, but the sessions were also videotaped for possible offline coding. The entire session took 15-20 min (see the Appendix for a full description and verbatim transcription of the procedure).

Results
We first examined if children learned the L2 words. One sample t-tests showed that children performed significantly better than chance both on the test, t(36) = 10.536, p < 0.001, and generalization questions, t(36) = 4.672, p < 0.001. In both the Gesture condition and the Screen Cue condition, children performed significantly better than chance on the test (Gesture: t(37) = 8.047, p < 0.001; Screen Cue: t(36) = 9.481, p < 0.001),

Accuracy for the Test Questions as a Function of Gesture Type
Generalized Linear Mixed-Effects Models (GLMMs) were run using SPSS Statistics 23.0 (SPSS Inc., Chicago IL) with accuracy as the dependent variable. The logit was used as the link function to account for the dichotomous (correct vs. incorrect) dependent variable. This first model was to see if the specific Gesture Type (Iconic vs. Deictic) made a difference in children's learning. This first GLMM included Gesture type (Deictic, Iconic) with Block (1, 2, 3) as fixed factors and Subject and Word (referring to the specific adjective used) as random factors. No effects were significant, ps > 0.10. In other words, learning did not vary as a function of whether the tutor used a Deictic or Iconic gesture. Thus, in the subsequent analysis, we collapsed over Gesture type.

Accuracy for the Test Questions as a Function of Scaffold-Type
Another GLMM model was run to examine the role of Scaffold Type (Gesture vs. Screen Cue) on accuracy. This model included fixed effects for Scaffold Type (Gesture, Screen Cue), with Block (1, 2, 3) as fixed factors and Subject and Word as random factors. Figure 4 represents the average accuracy. Results revealed a main effect of Scaffold Type (F (1,906) = 3.931, p = 0.048). Bonferroni pairwise comparison post-hoc tests showed that Screen Cue condition was associated with higher accuracy than the Gesture condition (β = 0.455, SE =0.230, 95% CI [0.005, 0.906]). The odds of giving a correct response instead of the incorrect for the Screen Cue condition was estimated to be exp(0.455) = 1.58 times the corresponding odds for children in the Gesture condition, all other things being equal. Thus, children were more likely to give a correct response in the Screen Cue condition

Accuracy for the Generalization Questions
Parallel models were run on generalization questions' accuracy. No effects reached statistical significance when examining the role of Scaffold Type or the role of specific Gesture Type, all ps > 0.10.

Interim Summary
Children were able to learn words from a Robot tutor and perform above chance on both test and generalization questions. Performance was also above chance both when teaching was accompanied by Gestures or On-Screen Cues. However, the accuracy was significantly higher when the robot tutor's teaching was supported by on-screen cues as compared to gestures. The main goal of the current study was to examine factors that examine the teaching effectiveness of robot tutors. However, these results raised a follow-up question of whether the role of scaffolds was unique to a robot tutor or would also generalize to a human tutor. Thus, we conducted a follow-up study using a human tutor to examine whether the pattern of results and the learning of the children would mimic the findings with the Robot tutor.

Method Participants
A new group of 41 children participated in the follow-up study that used a human tutor (M age = 67.6 months, SD = 4.98, 24 females). All but three children were tested in a quiet room at their own preschool, located in a large city in Turkey (Istanbul or Bursa), the remaining three children were tested in the lab in Istanbul. The initial sample consisted of 44 children, but one child was excluded because they knew four of the eight English words to be taught, one child was asked questions in the wrong order due to experimenter error, and one child quit the study before it ended. All other details (including recruitment methods) were the same as Study 1.

Stimuli
Same as Study 1.

Design
Same as Study 1. Out of the 41 children, 23 participated in the Iconic Gesture + On-Screen Cue condition and 18 participated in the Deictic Gesture + On-Screen Cue condition.

Procedure
All children met individually with the human experimenter, who was an adult female, in a quiet room. The experimenter served as the tutor. The child was seated in front of a 13-inch computer screen where all images were presented. The Human tutor sat across from the child. All other details were the same as Study 1.

Accuracy for the Test Questions as a Function of Gesture Type
GLMMs were run with accuracy as the dependent variable. This first model was to see if the specific Gesture Type (Iconic vs. Deictic) made a difference in children's learning. This first GLMM included Gesture type (Deictic, Iconic) with Block (1, 2, 3) as fixed factors and Subject and Word as random factors. No effects were significant, ps > 0.10. In other words, mimicking the results with the Robot tutor, there was no significant effect of Gesture type, meaning learning did not vary as a function of whether the Human tutor used a Deictic or Iconic gesture. Thus, in the subsequent analysis, we collapsed over Gesture type.

Accuracy of Test Questions as a Function of Scaffold-Type
Another GLMM was run with accuracy as the dependent variable to examine the role of Scaffold Type (Gesture vs. Screen Cue). This second model included fixed effects for Scaffold Type (Gesture, Screen Cue), with Block (1, 2, 3) as fixed factors and Subject and Word as random factors. Figure 4 represents the average accuracy. Results of the first GLMM analysis revealed a main effect of Scaffold Type (F (1,968) = 5.267, p = 0.022. Bonferroni pairwise comparison post-hoc tests showed that Screen Cue condition was associated with higher accuracy than the Gesture condition (β = 0.467, SE = 0.204, 95% CI [0.068, 0.867]). The odds of giving a correct response instead of the incorrect for the Screen Cue condition was estimated to be exp(0.467) = 1.59 times the corresponding odds for children in the Gesture condition, all other things being equal. Thus, children were more likely to give a correct response in the Screen Cue condition compared to the Gesture condition. Children performed better on the 3rd block compared to the 1st block representing learning over time (β = 0.530, SE = 0.199, 95% CI [−0.920, −0.0140], p = 0.008). Thus, the pattern of results mimicked the results with the Robot tutor.

Accuracy for the Generalization Questions
Parallel models were run on generalization questions' accuracy. No effects reached statistical significance, all ps > 0.10 when examining the role of Scaffold Type or the role of specific Gesture Type.

Interim Summary
Children were able to learn words from a Human tutor and perform above chance on both test and generalization questions. Performance was also above chance both when teaching was accompanied by Gestures or On-Screen Cues. Similar to the finding with the Robot tutor, accuracy was higher when teaching was accompanied by On-Screen Cues as compared to Gestures.

Exploratory Comparison of Robot and Human Tutor Across Study 1 and Study 2
Studies 1 and 2 were separately analyzed as one study was complete before the other and thus children were not randomly assigned to the two tutors. Nevertheless, we conducted an exploratory analysis to compare the robot and human tutors. We built an additional GLMM with accuracy as the dependent variable, Tutor (Human, Robot), Scaffold Type (Gesture vs. Screen Cue), and Block (1, 2, 3) as fixed factors, and Subject and Word as random factors.

DISCUSSION
Our goal was to examine the role of social robots in L2 vocabulary learning. We asked whether (1) children can effectively learn new words from a robot tutor, (2) scaffolds differ in how they support robot teaching, and (3) the type of gesture affects effectiveness in L2 vocabulary teaching. First, consistent with our hypothesis, we found that children were able to learn L2 words from a social robot. Second, we showed that children were able to learn when the tutor (robot or human) either gestured or used on-screen cues. Children were able to learn L2 words with both types of scaffolds, but learning outcomes were better when the teaching was supported by on-screen cues than when the tutor gestured. Finally, the type of gesture did not significantly influence L2 vocabulary learning. Below we further discuss our results.

Social Robot and Human Tutors in L2 Vocabulary Teaching
Our results showed that young children were able to effectively learn new L2 vocabulary from a social robot. Children performed above chance not only on the test but also generalization trials in which children were asked to associate the learned words with novel images. Our results are consistent with prior literature but provide novel insights by comparing both social robots and human robots in L2 vocabulary teaching. Although we refrain from emphasizing a trend indicating the robot promoted better learning outcomes than the human tutor, we can state that children successfully learned measurement adjectives in an L2 from a social robot tutor-as well as they learned from a human tutor. Possible explanations for the robot tutor's success include the robot's novelty. Anecdotally, children in the current study expressed great excitement about the robot tutor, and a recent review also emphasizes high enjoyment and anthropomorphic tendencies for robots in children in our age range (Ahmad et al., 2019;van Straten et al., 2020). A study similarly did not find an effect of tutor type, but showed that children gazed more at a robot tutor than a human tutor (Westlund et al., 2017). Another study with 10-to 13-year-olds also found more frequent gaze toward a robot compared to a human (Serholt and Barendregt, 2016). More frequent gaze might indicate a high interest in robots or simply novelty preference. One limitation of our study is that we did not have access to eye gaze data. Future studies should examine if eye gaze on the tutor or gestures mediates learning outcomes. Another limitation of our study is that the process of designing gestures could be improved. While we confirmed that robot gestures were interpretable by adults and while children's performance with the robot tutor did not differ from the human tutor condition, future research could leverage gestures produced by the children in this age range and also examine the information children gather from robot gestures in more detail.

Role of Gestures and On-Screen Cues in L2 Vocabulary Teaching
We demonstrated that the effectiveness of the tutor (social robot and human) varies as a function of the other scaffolds in the environment. In doing so, we extended the prior literature by explicitly focusing on the role of different scaffolds for robot tutors-thus we explored not only whether or not social robots aid learning but also how they might aid learning. More specifically, our results showed that children's learning outcomes were better when the tutor's (both social robot and human) teaching was supported by on-screen cues compared to cospeech hand gestures. Although the results are inconsistent with our hypothesis and were initially surprising, our results dovetail with a recent large-scale study showing no beneficial effect of robot's gestures for learning in teaching English vocabulary to young children (Vogt et al., 2019). Not all studies observe the facilitating effects of gesture on learning (Congdon et al., 2018). Individuals greatly vary in the amount of information they glean from gestures (Demir-Lira et al., 2018;Kartalkanat and Göksun, 2020). Some of this variability is due to age and prior knowledge level (Puccini and Liszkowski, 2012;Post et al., 2013;Novack et al., 2016), and gestures seem to help learners who are ready to learn, but not learners with low background knowledge (Post et al., 2013;Congdon et al., 2018). For example, when learning grammar rules from animations, children with low language skills performed worse when the rules were taught with gestures than when there were no gestures (Post et al., 2013). In a recent study, when social robots used pointing gestures during a reading comprehension task, children with higher proficiency benefited, whereas children with lower proficiency did not (Yadollahi et al., 2018). The children in our study did not know much English, and thus our results are consistent with previous research suggesting that gestures might only help learners who have some prior knowledge. We also did not find differences between the iconic and deictic gesture conditions-again, gesture type might make a difference for learners with a certain level of background knowledge.
In many earlier studies, gesture condition has been compared to control conditions where children were presented with speech only-in other words, conditions with no other scaffold in the environment (e.g., Demir-Lira et al., 2018). Some studies compared gestures to conditions with other educational supports such as with real objects (e.g., Novack et al., 2016). These studies present mixed results-gestures are sometimes more but sometimes less effective than interacting with real objects (Congdon et al., 2018). The mixed findings highlight the importance of comparing and contrasting the educational effectiveness of gestures to other educational scaffolds available in the learning environment such as other scaffolds provided by a screen device as was the case in our study. Taken together, our findings add to the literature by being the first study that compared gestures to an onscreen cue condition where attentional support was provided on a screen.
In terms of task characteristics, for beginner learners, focusing on a single visual scene can aid processing (Atkinson, 2005;Kalyuga, 2005). A previous study on discovery learning using NAO and the touchscreen device also suggests that it might be natural for children to look mostly at the screen instead of the robot (Kennedy et al., 2015). In the current study, children heard the measurement word to be learned (e.g., big) and were also presented with an image associated with the word on the computer screen (e.g., big ball). This context closely mimics typical L2 teaching contexts where children need to coordinate information presented by the tutor and supplementary visuals (such as pictures on a book or screen) to gather the meaning of a word (e.g., Jones and Plass, 2002). The gesture condition required children to shift their attention back and forth between the screen and the tutor. On the contrary, in the on-screen cue condition, the visual image and on-screen cue were both on the computer screen. Overall, gestures might have placed higher attentional demands than on-screen cues. A rich body of literature does support the use of non-verbal aids in children's learning and emerging work compares different types of nonverbal scaffolds in children's learning. For example, a recent study on word learning in preschoolers reported that pictures, compared to gestures, might reduce demands on working memory (Rowe et al., 2013). For beginner learners in our study, providing the visual information and cues on the same screen might have made the matching of images to auditory labels easier. More research should be conducted to understand the role of educational technologies and should further evaluate how different features interact with each other to help or to hinder learning. The ways in which the role of gestures might evolve as children become more proficient in L2 can also be explored.
In terms of the educational implications of our findings, our results are consistent with an emerging broader literature suggesting that the role of scaffolds might vary depending on multiple factors in the learning environment, more specifically the design/technology, the learner as well as task characteristics. Instead of a one-size-fits-all approach, our study also emphasizes the importance of considering the alignment of any educational technology with the particular task at hand as well as the particular characteristics of the learner (Lowe and Boucheix, 2011). Going forward, gestural vs. screen cues can be leveraged to different extents, depending on the background knowledge of the learner. For example, beginner learners could benefit more from static, concrete cues, and over time cues can become more representational and abstract as the learner gathers further background information-a possibility that should be tested in future studies. If robots will be introduced to educational contexts, their role should be evaluated in relation to other supports available in the teaching environment. Moreover, although our design did not provide this feature, social robots could be programmed to respond contingently and vary instruction depending on child needs which is an important future direction for educational research.
In summary, the findings suggest that children can learn new words equally well from a robot tutor or a human tutor when a screen is used as an intermediary medium to present the learning material. Given their potential in the classroom, identifying factors that facilitate the use of social robots in teaching will benefit the development of more supportive environments for L2 teaching. Our results also emphasize the importance of tailored educational environments as opposed to a one-size-fits-all approach. Future education designs could reach maximal effectiveness by leveraging tools that best match the constraints of the task, learning goals, and the learner's needs.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Koç University Committee on Human Research. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
ÖD-L, JK, CO, TG, and AK conceived the study with feedback from SK and IF. ÖD-L, JK, CO, SK, and IF were in charge of collecting the data. ÖD-L analyzed the data in consultation with JK, CO, TG, and AK. ÖD-L drafted the manuscript, and all authors critically edited it. All authors extensively contributed to the project and approved the final submitted version of the manuscript.

FUNDING
This study was conducted as part of L2TOR, the European Union's Horizon 2020 research and innovation programme under the Grant Agreement No. 688014 awarded to AK as the PI and TG as the co-PI for the Koç University site.

ACKNOWLEDGMENTS
We thanked all the schools who participated in the study and families and children who generously gave their time. We also thanked Merve Aslan and Orhun Uluşahin for help with data collection and Sevval Konyali for illustrations.