I Can See It in Your Eyes: Gaze as an Implicit Cue of Uncanniness and Task Performance in Repeated Interactions With Robots

Over the past years, extensive research has been dedicated to developing robust platforms and data-driven dialog models to support long-term human-robot interactions. However, little is known about how people's perception of robots and engagement with them develop over time and how these can be accurately assessed through implicit and continuous measurement techniques. In this paper, we explore this by involving participants in three interaction sessions with multiple days of zero exposure in between. Each session consists of a joint task with a robot as well as two short social chats with it before and after the task. We measure participants' gaze patterns with a wearable eye-tracker and gauge their perception of the robot and engagement with it and the joint task using questionnaires. Results disclose that aversion of gaze in a social chat is an indicator of a robot's uncanniness and that the more people gaze at the robot in a joint task, the worse they perform. In contrast with most HRI literature, our results show that gaze toward an object of shared attention, rather than gaze toward a robotic partner, is the most meaningful predictor of engagement in a joint task. Furthermore, the analyses of gaze patterns in repeated interactions disclose that people's mutual gaze in a social chat develops congruently with their perceptions of the robot over time. These are key findings for the HRI community as they entail that gaze behavior can be used as an implicit measure of people's perception of robots in a social chat and of their engagement and task performance in a joint task.


Introduction
An essential precondition for understanding the development of people's perception of robots in repeated interactions is the development of measurement techniques suitable for long-term assessment.To date, the measurement of people's perception of robots relies almost solely on questionnaires and interviews.However, these have several limitations.First, they only capture people's perception at one specific moment in time.This means that while changes in perception can be detected between the different points of measurement, it is not possible to relate these changes to particular events within the interaction.Second, in order to capture changes in people's perception over time as accurately as possible, multiple points of measurement are required.However, filling out questionnaires interrupts people's interactive experience and hence has a high potential for decreasing the involvement with the robot and the task they perform with it.Finally, measures of self-report are prone to bias.Repeatedly filling out the same questionnaire may cause people to remember previous answers, which can decrease the accuracy of the measurement and reveal the purpose of the experiment.To accurately study peoples' perception of robots in long-term interactions and relate changes in perception to specific actions of the robot, it is thus important to develop more implicit and continuous measurement techniques.In this paper, we explore gaze patterns as a potential continuous method for capturing people's perception of robots.
Previous studies in Social Psychology investigating the relationship between gaze behavior and people's perception of a human partner came to conflicting results.Indeed, people gaze more at each other when they share feelings of warmth and liking and when seeking friendship [17,26,54].However, they show similar behavior patterns (i.e., longer fixation) when interacting with unconventionally-looking people, for instance, those carrying a facial stigma [33,34].This is because a stimulus that does not match prior knowledge and expectations captures more attention than a stimulus that perfectly matches them [10,29].Transferring this to human-robot interaction (HRI), we can expect people to look more often to robots that they like, as well as to stare longer at unconventional robots, such as realistic androids eliciting uncanny feelings [36].Indeed, when presented with uncanny robots, people have sometimes avoided looking at them [53] and other times stared at them longer [35].In social robotics, experiments on the meaning of mutual gaze have almost solely focused on uncanny robots and still images.In this paper, we hence aim to discover whether findings from non-interactive scenarios translate to real face-to-face interactions with robots and whether mutual gaze in a social chat can be an implicit measure of people's perception of robots, in particular of likability and uncanniness.
In the literature, the few experiments that tracked gaze towards a robot in actual interactions between a human and a robot mainly used joint tasks involving multiple objects of attention (e.g., touchscreen) as a test-bed (e.g., [13,24,44]).We argue that in these contexts, the gaze towards the robot has a meaning distinct from the one it has in a face-to-face social conversation, as the robot and the objects involved in the joint task compete for the same attentional resources.The second focus of this paper is thus to understand whether the gaze participants allocate to the robot in a joint task is related to engagement and task performance and what is the meaning of the gaze people direct to the other objects involved in the joint task.Previous long-term HRI studies on gaze exclusively focused on how the gaze towards the robot developed over multiple sessions of the same activity (e.g., [2,3,50]).In this study, we also investigate how the gaze towards other foci of attention in the joint task varies over time and whether participants' mutual gaze in a social chat preceding and succeeding the joint task changes across repeated interaction sessions.
This paper presents an experiment in which participants were involved in three interaction sessions with the blended robotic head Furhat [6] occurring with multiple days of zero exposure in between.To achieve meaningful variations in the robot's perception, we manipulated its humanlikeness by applying three facial textures with different anthropomorphic features.Each interactive session was divided into a geography-themed cooperative game with the robot (serving as a joint task) and a face-to-face social chat before and after the game.To track and analyze gaze patterns, participants wore eye-tracking glasses throughout the interactive session.At different points in the sessions, they were asked to self-report their perception of the robot and their engagement with it and the collaborative game.The questionnaires were used to gain novel insights into the suitability of gaze patterns as an implicit measure of people's perception of a robot and of their engagement and performance in a joint task.The multiple sessions of interaction enabled us to track the progression of gaze within and between interactions and understand if and how gaze patterns change over time.
2 Related Work

Mutual Gaze and Liking
Goffman [20] was one of the first to state that the direction of gaze plays a crucial role in the initiation and maintenance of social encounters and can be an indicator of social attention.Exline et al. [19] showed how the amount of mutual gaze increases when a person is drawn by another individual, either in an affiliative or competitive way.It is through the mutually held gaze that two people commonly establish their openness to another's communication, and the aversion of the eyes in a face-to-face interaction can be read as a cut-off act as well as a sign of dislike [23,48].
Studies on mutual gaze in HRI mostly focused on how the implementation of such nonverbal behavior on a robot influences users' perception [27,37].A 2017 review identified three main lines of gaze research in HRI: (1) human responses to robot's gaze, (2) design of gaze features for robots, and (3) computational tools to implement social gaze in robots [1].The first attempt to use gaze as a mean to assess interest, liking, and engagement in HRI was made by Sidner et al. [51] who involved participants in a demo interaction with the penguin robot Mel and used gaze to understand whether the manipulation of the robot's behavior influenced the amount of mutual gaze it attracted.Lemaignan et al. [30] measured the direction of gaze of children involved in a collaborative task with the NAO robot (i.e., teaching handwriting skills to a robot) and used it to compute their with-me-ness, the extent a human is with the robot over the course of an interactive task.They obtained a with-me-ness value by comparing the child's focus of attention at a certain point in time with a set of expected attentional targets for that moment of the interaction.Kennedy et al. [24] found children interacting with a physical robot to gaze significantly more often to the robot than children interacting with a virtual one, and that they spent significantly more seconds per minutes gazing at the robot in the real robot condition than in the virtual one.Similarly, Papadopoulos et al. [44] used gaze to estimate the social engagement with the robot of adults in a memory game with the NAO robot, and Castellano et al. [13,14] of children in a chess game with the iCat robot.
These studies seem to suggest that the most compelling interaction conditions are those that elicit the longest gaze towards the robot, partially supporting the positive mutual gaze-liking relationship in the context of HRI.However, only a few of these studies specifically related participants' gaze toward the robot with metrics of likability and used it as an implicit measure of participants' perception of a robot [51].Moreover, most of the reviewed studies focused on the allocation of gaze towards a robot during a task (e.g., a chess game).In such a context, the robot and the task at hand compete for the same attentional resources, hence gaze towards the robot is not anymore a precise measure of the robot's likability and of participants' social syntony with it because it is hindered by participants' willingness to complete the task [15,16,46,47].In this study, we thus examine robot-directed gaze in two separate situations: during a collaborative game, but also in a face-to-face social chat between the participant and the robot occurring before and after the game interaction.In the former, we focus on the mutual gaze that the robot attracts and use it as a predictor of participants' perceptions.In the latter case, we focus on participants' gaze patterns and examine whether these can predict task performance, perceived involvement with the robot, and with the game.This is with the aim to understand whether mutual gaze in a face-to-face social chat increases with the robot's likability and which gaze patterns are related with task performance and engagement in the joint task.

Stigma, Staring and the Uncanny Valley
Staring is defined as gaze that persists regardless of the behavior of the other person [26].The novel stimulus hypothesis posits that behavioral avoidance of people that appear as physically different (e.g., pregnant) is mediated by a conflict over a desire to stare at novel stimuli and a desire to adhere to a norm against staring when the novel stimulus is another person [29].Langer and colleagues discovered that, when staring is not negatively sanctioned, it varies as a function of the novelty of the observed subject, whereas, when the norms against it are instated, staring is inhibited.In line with this, Kleck [25] found out that participants looked at a research confederate carrying a physical stigma significantly more than at one not carrying it and Madera and Hebl [33] found that interviewers of facially stigmatized interviewees (i.e., port stain) spent considerably more time looking at the specific location of applicants' stigma than interviewers evaluating non-stigmatized applicants.
In HRI, it is known that the likability of a robot increases with its humanlikeness up to a point where it drops abruptly.This drop in likability, known as the uncanny valley, is reached when a robot is almost indistinguishable from a healthy human, but some of its features still point to its artificiality and hence elicit eeriness [36].As uncanny robots might be perceived as more novel and the eeriness they generate might be equated to that elicited by a stigma, one can hypothesize that robots perceived as uncanny elicit higher staring than robots that are not perceived as such.However, in line with the extant literature on the positive relationship between liking and mutual gaze, one can also posit that uncanny robots attract less direct gaze, as they are less likable and elicit more discomfort.
Minato et al. [35] were the first to investigate whether the uncanniness of an android robot could have an effect on people's gaze behavior.They gauged the direction of gaze of people involved in a face-to-face conversation with three interlocutors: a human girl, a motionless android robot shaped as a girl, and the same android robot with a moving head, eyes, and neck.They found people to look significantly more at the eyes of the android robots than at those of the human girl and consequently suggested fixation time to be an implicit measure of uncanniness.Strait et al. [53] exposed participants to pictures of real humans and robots varying in humanlikeness (low, medium, high) and found participants to fixate highly humanlike robots less than the other agents when the whole body was taken into account, and more than the artificial agents when the head and the eyes were considered.Smith and Wiese [52] studied the effects of a robot's appearance on delayed disengagement.They asked participants to orient their gaze to a target dot appearing on the sides of a screen after fixating an agent in the center of it and measured the time it took for participants to reorient their gaze.Although reaction times should increase when processing stimuli with a negative connotation, their results did not disclose any significant difference across agents varying in humanlikeness (e.g., non-social, robot, robotoid, humanoid, human).
A similar study was carried out by Li et al. [31], who investigated both static and video stimuli.In the static image experiment, the reaction times to the mechanical robot were slower than those to the android robot and real human.In the video-based experiment, on the contrary, the reaction times to the android robot and the real human were slower than those to the mechanical robot.Since these related studies are mostly focused on non-interactive stimuli and their results do not point to a clear direction with respect to the two alternative hypotheses on the meaning of mutual gaze, we further explore whether mutual gaze in a face-to-face interaction with a social robot varying in humanlikeness is related to its perceived uncanniness, and, if so, whether this relation aligns with the mutual gaze-liking or novel stimulus hypothesis.

Tracking Gaze Over Time
When it comes to the progression of gaze over time in interactive scenarios with social robots, most of the related work is focused on Child-Robot Interaction (cHRI).In this context, pivotal work has been performed by Baxter et al. [9] and Kennedy et al. [24] who focused on changes in gaze patterns within an interaction session.Baxter and colleagues measured children's gaze behavior towards the robot during a joint task by calculating a number of gaze metrics (i.e., mean length of each gaze to the robot and length of gaze to the robot per minute) within a predefined time-window and comparing them with the same metrics gauged in subsequent time-windows.By splitting the interactions in three equal parts, they found that the gaze directed to the robot decreased from the first to the final third of the joint task and interpreted this result as a decrease in the engagement with the robot over time.Kennedy et al. [24] used the same approach in a collaborative sorting task and noticed that the gaze towards the robot significantly reduced between the first and the second third of the interaction and then stayed more or less constant.Similar to Baxter et al., they ascribed this drop and subsequent stabilization to a reduction of engagement over time due to the wearing-off of the novelty effect.
Along this line, but with a stronger focus on long-term cHRI is the work of Serholt and Barendregt [50].They involved 30 children in three sessions of play with a NAO robot in a map reading task and analyzed children's behavioral reactions to three implicit probes: a greeting, a feedback/praise, and a question.One of the behavioral markers employed to assess children's reactions to the probes was the gaze towards the robot, which they considered a sign of social engagement.Serholt and Barendregt found that the most common response to the three probes was directing the gaze towards the robot, and that over time, this response decreased slightly.The authors suggested that one way to counteract this decrease in children's engagement with the robot over time was to implement responsive robot behaviors that could facilitate bonding.Ahmad and colleagues moved in this direction by studying how different type of robot adaptation to children's states could influence social engagement and learning [2][3][4].They ran several long-term studies (three to four sessions) involving children in joint tasks with a NAO robot (i.e., snakes and ladders game, mathematical learning task, vocabulary learning task) and evaluated the effect of different types of robot's adaptation (e.g., memory and emotion adaptation) on children's engagement with it.They measured children's social engagement with the robot through a number of behavioral metrics, among which the gaze directed to the robot.As postulated by Serholt and Barendregt [50], they found that the gaze allocated to the robot during the joint task increased across sessions when the robot behaved empathetically [2,3] and that children learned significantly more over time when interacting with the empathetic robot [2] or when the robot gave them positive and supportive feedback [4].
The literature discussed above shows that gaze has been consistently used to measure social engagement with robots over time.However, in most cases, gaze has been manually annotated [2-4, 9, 24, 50].While several researchers have proposed automated methods to gauge gaze allocation [7,28,30], only [18] have used such methods to assess the development of social engagement with robots over the time of an interaction, and, to the best of our knowledge, no one has used them to monitor the direction of gaze towards different foci across repeated interactions.For this study, we automatically annotate gaze with a deep learning-based object detection algorithm based on YOLOv4 [11], and investigate how gaze patterns in a joint task develop between three interaction sessions with multiple days of zero exposure in between.We think that automatic gaze tracking holds promises for online assessment of engagement and could be used for real-time reward estimation in co-adaptive scenarios.
The main focus of long-term gaze studies has been engagement.However, Strait et al. [53] and Minato et al. [35] show how gaze can also be a meaningful predictor of a robot's uncanniness.From our previous work, we know that: (1) the mere exposure to a robot changes people's initial perceptions of it [43]; (2) progressively exposing people to the multimodal behaviors of a robot improves people's perception of it [39]; and (3) the perceptual dimensions that contribute to people's mental image of the robot stabilize over time [41,42].Hence, besides investigating the role of gaze patterns in a joint task with a focus on engagement, in this paper, we also focus on understanding how mutual gaze in a social chat develops over time within and between interaction sessions and how it relates to people's perception of the robotic interaction partner.Indeed, if mutual gaze was found to be a meaningful predictor of people's perception of robots, it could be used to track the development of people's mental image of a robot over time.To the best of our knowledge, this approach has never been attempted before.

Research Questions
This work aims to further our understanding of the meaning of gaze in two types of interactions with robots: face-to-face social chats and joint tasks.In the former, we focus on mutual gaze and attempt to understand whether it is related to people's perception of the robot.In the latter, we focus on people's gaze towards the robot and other objects involved in the game and explore which gaze pattern is related to participants' task performance, involvement with the game, and involvement with the robot.Hence, we pose the following research questions: RQ1 Is the mutual gaze directed to the robot in a face-to-face social chat a predictor of people's perception of the robot?
RQ2 Which gaze pattern in a joint task is predictive of people's engagement and task performance?
Extant literature has found that gaze towards a robot decreases over the time of an interaction [9,24,35].Similarly, we attempt to understand whether mutual gaze towards the robot reduces between two equally long social chats occurring before and after a joint task.Moreover, we explore whether it changes over three repeated interaction sessions.This way, we aim to answer the following research question: RQ3 Does the mutual gaze directed to the robot change between a pre-and post-game face-toface social chat and across repeated interactions?
Previous research has further discovered that the amount of gaze directed to a robot in a joint task slightly declines across repeated interactions [2,3,50].As these works have overlooked the gaze participants direct to other objects involved in the joint task (e.g., tablet and touchscreen), it is difficult to establish whether the decline in the gaze towards the robot they observe really corresponds to the allocation of attentional resources elsewhere.In this paper, we gauge both the gaze directed to the robot and the gaze directed to the other objects involved in the game (e.g., tablet and touchscreen) and attempt to understand how gaze as a whole changes over repeated interactions.Thus, we pose the following research question: RQ4 Do gaze patterns in a joint task change across repeated interactions?
Since it has been shown that a robot's level of humanlikeness affects the amount of gaze it attracts [35,53], in this study, we vary the humanlikeness of the robot with which participants interact.This way, we aim to answer the following research questions: RQ5a Does the level of humanlikeness of the robot affect the amount of mutual gaze directed to it in a face-to-face social chat?
RQ5b Does the level of humanlikeness of the robot affect people's gaze patterns during the joint task?
With respect to previous research which mainly focused on android robots and compared them with less humanlike robotic platforms [35,53], we keep the robot's embodiment constant across conditions by using a blended embodiment, and manipulate the humanlikeness of the robot exclusively by changing its facial texture.

Methodology
We designed an experiment involving participants in three interaction sessions (within-subject variable) with a social robot displaying three levels of humanlikeness (between-subject variable): humanlike, mechanical, and a morph between the two (cf.Fig. 1).The interaction sessions had an average of 6.9 days of zero exposure in between (S1-S2: M = 6.76,SD = 1.83;S2-S3: M = 7.05, SD = 2.41).Each session was divided into three phases: (1) a social chat with the robot, (2) a joint task to perform, and (3) a final social chat.

Participants
We recruited 60 participants from an international Master's course in Computer Science at Uppsala University to participate in the experiment.Five participants were excluded because they had previously interacted with the robot, two because they suspected the robot to be remotely controlled, and one because of eye-tracking failures occurring in all three sessions.The remaining 52 participants (M=38; F=13, 1 undisclosed) had an age comprised between 19  The study was approved by the regional ethics board, and participants were compensated with course credits for their time.

Scenario
Our experiment aimed to study people's gaze patterns in a face-to-face interaction and a joint task.We thus designed a scenario consisting of two distinct parts: a geography-themed collaborative Rapid Dialogue Game (RDG) and a social chat.In the collaborative RDG-Map game, the human and the robot were tasked with identifying as many countries as possible on the world map [40].Participants had the role of the tutor in this scenario.They saw a map with one country highlighted as the target.Their goal was to verbally describe this country to the robot, which acted as a learner with limited initial knowledge about the world map.Once the robot gained sufficient confidence about the described country, it made a guess about it and showed it on a shared screen placed in between the human and the robot.For each country correctly identified, the team received 2 points if the robot could guess the country at the first try and 1 point if it was able to guess it only at the second try.The more countries the human-robot team could identify in a given time of 10 minutes, the higher their score would be, and the larger the robot's knowledge base would become.The game score and the time left to score points were displayed on the shared screen positioned between the two players.
Before and after the game, the robot engaged the human in a two-minute social chat.The chat's content varied between sessions but not between participants, and involved topics such as favorite games,  countries that the human and the robot had visited, and future travel plans.In the second and third sessions, the robot remembered a few countries from the previous game interactions and facts from previous social chats.

Robot Embodiment & Behavior
To alter the anthropomorphic appearance of the robot while limiting confounding factors in the embodiment, we used a Furhat V1 blended robot platform for our experiment [6].Furhat is a head-only robot with a semi-translucent mask on which a virtual face is projected from within.Animating the virtual face texture allows the robot to move its mouth in sync with speech, perform facial expressions, and change gaze direction.In addition, the robot's two motors can be used to change the head's pitch and yaw.Taken together, the virtual animations of the face and the physical head manipulations allow to accurately direct the robot's focus of attention so it can be detected by a human interaction partner [5].Three different facial textures with varying degrees of anthropomorphic features were used in our experiment (cf.Fig. 1).The humanlike texture was based on the photograph of a human face.Similarly, the mechanical texture utilized a picture of a mechanical robot's face with parts such as screws visible in the texture.The morph texture was created by blending the humanlike and mechanical ones, keeping features from both of them.
The robot's verbal and non-verbal behavior in the interaction sessions was remote-controlled by a researcher, who followed detailed instructions to select the robot's verbal responses from a set of utterances provided by an interface (cf.Fig. 2 top left).The researcher was trained during 50 online sessions to ensure that the behavior of the robot was comparable between participants.The gaze behavior Figure 3: Overview of the study procedure over the three interactive sessions.Note that Q2 in S1 measures the perception of the robot after the first impression, while Q2 in S2 and S3 measures the recall of the robot without imminent exposure to it.of the robot differed between the social chat and the collaborative game.In the social chat, the robot autonomously tracked the participant's head and kept eye contact.In the game, instead, the robot focused its gaze on the shared screen.To ensure that the behavior of the robot was perceived as natural as possible, the human controller occasionally directed the gaze of the robot to the bottom left or right to simulate thinking during the social chat.Similarly, in the joint task, the human controller directed the gaze of the robot towards the human game partner in case long periods of silence occurred.This means that, in the game context, the shared screen acted as an object of shared attention and the iPad as an object of exclusive attention for the participants (cf.Fig. 2).Moreover, it also entails that, while in the social chat participants' gaze towards the robot could be considered mutual (i.e., when the participants looked at the robot, they made eye-contact with it), in the joint task it cannot, as the robot only rarely gazed at the participants.

Questionnaires & Recordings
To measure participants' perception of the robot and their engagement with it and the game, we asked them to complete a series of questionnaires.Before their first interaction with the robot they filled out a demographic questionnaire (Q1).The second questionnaire (Q2) was used to capture people's perception of the robot.It contained questions about the robot's perceived anthropomorphism (5 items on a 5-point Likert scale from the Godspeed questionnaire, α = .91;[8]), likability and threat (5-point Likert scale, likability: α = .83,perceived threat α = .89;[49]), as well as its perceived warmth, competence and discomfort (Robotic Social Attributes Scale; 18 items on a 7 point Likert scale; warmth: α = .92;competence: α = .95;discomfort: α = .90;[12]).In the first session, Q2 was filled out immediately after the social chat with the robot to collect people's first impression of it.In the second and third session, it was instead completed before the first social chat to understand participants' recall of the robot's perception before seeing it again (cf.Fig. 3).The final questionnaire (Q3) was filled out after the post-game social chat.It contained the same questions of Q2, but also additional scales to measure participant's involvement with the robot and with the game (User Engagement Questionnaire; 9 items on a 5 point-Likert scale: involvement: α = .71;[38]).
Participants were equipped with Tobii Glasses 2 (cf.Fig. 4), which recorded the experimental session from a first-person view with a full HD wide-angle camera.These glasses also tracked the participants' gaze direction with a sampling rate of 100 Hz.Further processing of gaze data is described in Section 5. Following the Ethographic and Laban-Inspired Coding System of Engagement (ELICSE) proposed by Perugia et al. [45,46], we focused on three foci of attention in the interaction: the robot, the shared screen, and the tablet, and measured the percentage of time participants gazed at each attentional focus during the different phases of the interaction session (i.e.social chats and collaborative game).To ensure the eye-tracker would not disturb participants in their interactions, we ran a pilot study with 6 participants (3 with and 3 without the eye-tracking glasses).Neither the participants wearing the eye-tracker nor the control group found the recording setup intrusive.Participants' interaction with Furhat was further recorded using a close-range Sennheiser microphone, two webcams, a Kinect, and a RealSense camera.These recordings were used to answer different research questions and are hence not discussed in this paper.

Experiment Setup & Procedure
The interaction space was set up with a table on which the shared touch screen, the Furhat robot, and the iPad were placed (cf.Fig. 2).Participants stood on one side of the table.The robot was placed in front of them roughly at the height of their eyes.The shared screen was positioned between the participant and the robot.A professional lighting system ensured even illumination for the video recordings and visibility for the robot's face texture.
During the first session (S1), participants were explained the experiment and asked to give informed consent.Then, they filled out Q1 on the iPad while the robot was still covered with a blanket.The researcher leading the experiment removed the blanket from the robot's head before manually starting the interaction.After the two-minute pre-game social chat, the robot asked participants to fill out Q2 on the iPad; then it automatically continued with the map game and the post-game social chat.At the end of the session, the robot prompted the participant to respond to Q3.The second (S2) and third sessions (S3) started with the researcher asking participants to fill in Q2 based on their memory from the previous session, before uncovering the robot.The pre-game social chat, the game interaction, and the post-game social chat were then performed without a break in between.Hence, while Q3 was always filled out at the same time, immediately after the post-game social chat, Q2 was completed after the pre-game social chat in S1, and before it in S2 and S3 (cf.Fig. 3).While participants responded to a questionnaire, the robot displayed idling behavior that involved looking around in the room and away from the human interaction partner.Participants were fully debriefed about the purpose of the study after the entire experiment was completed.

Data Processing
To understand what object participants were focusing on, we developed an object detector for the firstperson video stream from the wearable eye-tracker.The implementation of the object detector was based on the open-source neural network framework Darknet, which uses the real-time object detector YOLOv4 [11].For the purpose of this study, we used a version of YOLOv4 pre-trained on the MS COCO data set consisting of objects such as cars, tv screens, and people [32], and added 409 labeled images of tablets and the Furhat robot.The resulting model achieved a mAP of 89.06%, with the shared touchscreen, robot, and tablet having an AP of 85.07%, 89.35%, and 92.75%, respectively.
To run the analyses on gaze, we extracted the percentage of gaze directed to the robot, screen, and tablet from each interaction phase.We then compared every frame of the gaze coordinates provided by the Tobii eye-tracking system with the objects detected in the video stream and labeled them as either inside of the bounding box of the robot, screen or tablet, or "somewhere else".The Tobii system failed to detect participants' pupils on average on 11.55% of the frames (SD = 8.9%), in which case the frame was annotated as "Not applicable".Interaction phases containing more than 50% of undetected frames were excluded from the analysis.To correct for inaccuracies due to the inexact positioning of the bounding boxes in the first person video, we applied a filter to the resulting object annotations.The filtering algorithm detected one or two consecutive frames labeled as outside the bounding box of an object occurring in the middle of a larger block of frames detected as inside the bounding box of that object.If the distance between the frames labeled as outside and those labeled as inside the bounding box was lower or equal to 110.14 pixels (5% of the max.video distance), we changed the original label of the outlier frames to the label of the surrounding block of frames.
Two annotators manually labeled three of the videos frame-by-frame using the software ELAN 5.9.The inter-rater agreement between the two annotators, which was calculated on one video, was excellent (κ = .98;[21]).When comparing the automated annotations to the manual ones, the system achieved a similarly excellent average κ of .97.

Results
In the following, we use: (i) perception of the robot to refer to the subscales anthropomorphism, perceived threat, likability, warmth, competence, and discomfort; (ii) engagement to refer to the subscales involvement with the game and involvement with the robot; and (iii) task performance to refer to participants' game score.Moreover, when it comes to gaze metrics, we use: (a) mutual gaze to refer to the percentage of gaze directed to the robot during the pre-and post-game social chats; and (b) gaze patterns in the joint task to refer to the percentage of gaze towards the robot, screen, and tablet during  1: Mean (M ) and standard deviation (SD) of the different perceptual dimensions (Q3) per session the game interaction.All dependent variables used for the statistical analyses were normally distributed and met the equality of variance assumption.
In summary, while extant work showed that agents with ambiguous anthropomorphic cues are perceived as more uncanny (see [22] for a review), we did not find a significant difference in perceived threat and discomfort between the ambiguous robot, the morph one, and the other two versions of Furhat.However, we did find a significant difference in positive perceptions between the mechanical and the humanlike robot and between this latter and the morph one.These results show that our manipulation worked, although only partially.The lack of proper differentiation between the morph and the mechanical robot could be ascribed to the many facial features the two robots had in common.In the future, the humanlike characteristics of the morph robot should be strengthened to increase its recognizability and enhance its ambiguity and hence its uncanniness.Engagement and Task Performance.To understand the effects of our study design on engagement and task performance, we conducted a repeated measure MANOVA with the same independent variables (humanlikeness as between-subjects factor; interaction session as within-subject factor) and involvement with the robot, involvement with the game, and task performance (i.e., score at the game) as dependent variables.Results disclosed a significant main effect of the interaction session (F (6, 35) = 9.005, p < .001,ηp 2 = .607)and a trend main effect of humanlikeness (F (6, 78) = 1.936, p = .085,ηp 2 = .130)on the linear composite of the three dependent variables.No interaction effect between humanlikeness and interaction session was present (F (12, 72) = .866,p = .584,ηp 2 = .126).The univariate analyses showed a significant main effect of interaction session on task performance (F (2, 80) = 37.208, p < .001,ηp 2 = .482)but not on involvement with the robot (F (2, 80) = .576,p = .564,ηp 2 = .014)and with the game (F (2, 80) = 1.606, p = .207,ηp 2 = .039).On the contrary, the test of between-subject effects disclosed a significant main effect of humanlikeness on involvement with the robot (F (2, 40) = 3.595, p = .037,ηp 2 = .152)and a trend main effect on involvement with the game (F (2, 40) = 2.759, p = .075,ηp 2 = .121),but no significant effect on task performance (F (2, 40) = 2.142, p = .131,ηp 2 = .097).

Humanlike
Post-hoc analyses with a Bonferroni correction disclosed a significant difference in task performance between S1 (M = 22.720, SD = 9.59) and S2 (M = 28.419,SD = 10.15,p < .001),S2 and S3 (M = 30.558,SD = 10.24,p = .019),and S1 and S3 (p < .001).Moreover, they showed a significant difference in involvement with the robot between the humanlike (M = 4.255, SD = .519)and the morph robot (M = 3.752, SD = .519,p = .036),but not between the humanlike and the mechanical (M = 3.949, SD = .519,p = .351),and between the mechanical and the morph (p = 1.00).Similarly, a trend difference in involvement with the game was present between the humanlike (M = 4.327, SD = .552)and the morph robot (M = 3.855, SD = .552,p = .077),but not between the humanlike and the mechanical (M = 4.051, SD = .552,p = .552)and between the mechanical and the morph (p = 1.00).Interestingly, while engagement was higher in the humanlike robot condition compared to the morph condition, task performance, albeit not significant, was higher for the morph robot with respect to the humanlike robot (cf.Fig. 5).
Gaze Patterns as Predictors of Engagement and Task Performance (RQ2).To understand whether participants' gaze patterns during the joint task were predictors of their engagement and task performance, we ran separate regression analyses using the percentage of gaze directed towards the robot, towards the screen, and toward the tablet during the game interaction as predictors and involvement with the robot and with the game, and task performance as dependent variables.The percentage of gaze directed to the robot during the game was not a significant predictor of involvement with the robot (β = .021,t(130) = .243,p = .809;r 2 = .00),nor of involvement with the game (β = −.041,t(130) = −.470,p = .639;r 2 = .002).However, it was a significant negative predictor of task performance (β = −.249,t(130) = −2.917,p = .004;r 2 = .062),meaning that the more participants looked at the robot during the game, the less they scored at the game.
The percentage of gaze directed to the screen during the game was not a significant predictor of involvement with the robot (β = .128,t(130) = 1.466, p = .145;r 2 = .016).However, it was a significant predictor of involvement with the game (β = .192,t(130) = 2.220, p = .028;r 2 = .037)and especially of task performance (β = .305,t(130) = 3.643, p < .001;r 2 = .093).This indicates that the more participants focused on the object of shared attention (the screen), the more they were engaged with the game and the higher they scored at the game.
Finally, the percentage of gaze directed to the tablet during the game was not a significant predictor of involvement with the robot (β = −.118,t(130) = −1.351,p = .179;r 2 = .014),but it was a significant predictor of involvement with the game (β = −.172,t(130) = −1.983,p = .049;r 2 = .030)and task performance (β = −.248,t(130) = −2.907,p = .004;r 2 = .061).As opposed to the percentage of gaze directed the screen, the more participants looked at the object of exclusive attention (the tablet), the less they were engaged with the game and the lower they scored at the game.
Further univariate tests showed a significant main effect of time on the percentage of gaze directed to the screen (F (2, 60) = 7.143, p = .002,ηp 2 = 1928) and the tablet (F (2, 60) = 9.666, p < .001,ηp 2 = .244),but not of the percentage of gaze directed to the robot (F (2, 60) = 2.202, p = .119,ηp 2 = .068).Post-hoc analyses with a Bonferroni correction revealed a significant decrease in the percentage of gaze directed to the screen in the joint task from S1 (M = .423,SD = .185)to S3 (M = .371,SD = .191,p = .023),and from S2 (M = .425,SD = .190)to S3 (p = .007),but not from S1 to S2 (p = 1.00, cf.Fig. 7).Moreover, they revealed a significant increase in the percentage of gaze directed to the tablet in the joint task from S1 (M = .374,SD = .172)to S3 (M = .458,SD = .183,p = .004)and from S2 (M = .390,SD = .163)to S3 (p = .002),but not from S1 to S2 (p = 1.00, cf.Fig. 7).This seems to suggest that the gaze patterns in the joint task changed over time with a decrease in gaze towards the object of shared attention making space for an increase in gaze towards the object of exclusive attention.

Discussion
Mutual Gaze and Uncanniness (RQ1).In our experiment, mutual gaze in a social chat was a negative predictor of uncanniness (i.e., perceived threat and discomfort) and a positive predictor of Figure 7: Development of Gaze toward the robot, the screen and the tablet over the three sessions of the game interaction.S1= Session 1; S2= Session 2; S3= Session 3. likability (i.e., likability and warmth).The negative relation between mutual gaze and uncanniness and the positive relation between mutual gaze and likability lend support to the mutual gaze-liking hypothesis.Indeed, robots perceived as uncanny seem to elicit gaze aversion, whereas robots perceived as likable attract higher gaze allocation.In this sense, our work extends previous findings on the relationship between gaze and uncanniness to the context of face-to-face interactions with robots.Moreover, it suggests that people's mutual gaze in an interaction with robots can be used as an implicit and continuous measure of a robot's uncanniness and likability.Future work should corroborate this finding in a less exploratory way, by exposing people to robots explicitly manipulated in their level of uncanniness and likability and assessing whether our results still hold.
Shared Gaze, Engagement and Task Performance (RQ2).Participants that gazed at the screen longer during the game interaction felt a higher involvement with the game and performed better.As the screen acted as the object of shared attention in this study, these results entail that the more participants shared the focus of their attention with the robot, the more they felt involved with the game, and the better they performed.This claim is further supported by the fact that the gaze directed towards the object of exclusive attention (i.e., the tablet) negatively predicted involvement with the game and task performance, and the gaze directed to the robot negatively predicted participants' performance.Overall, we can state that gaze patterns in a joint task predict task performance and involvement with the game and that in a joint task involving tangible artifacts (e.g., the screen), shared attention signals higher involvement with the task and can predict a better performance.On the contrary, gaze directed to the robot and the object of exclusive attention (e.g., the tablet) are markers of disengagement with the task and poorer task performance.
In contrast with most HRI literature that employed gaze towards the robot as one of the core metrics of social engagement in a joint task, we did not find a relationship between gaze towards the robot and participants' perceived involvement with it.This confirms our suspect that in a joint task, the gaze allocated to the robot is not a precise measure of people's syntony with it because it is hindered by participants' willingness to complete the task.Combining this result with our findings on mutual gaze, we posit that gaze towards the robot does indicate social engagement, but only in interactions that do not involve the use of tangible artifacts, for instance, in face-to-face social dialogues.Joint tasks involving tangible artificats call for the allocation of attentional resources to the object where the activity takes place (in our case, the shared screen) rather than to the agent with which the activity is performed.Hence, we argue that, in these tasks, one can feel involved with the robot at a subjective-experiential level even without overtly expressing this involvement at a behavioral level.Future work should focus on testing these preliminary findings in further joint tasks and see if they still hold.
In this study, we found that: (1) the humanlike robot caused more engagement than the morph robot; (2) the amount of mutual gaze in the social chat was a negative predictor of uncanniness and a positive predictor of likability, (3) the gaze directed to the robot in the joint task was a negative predictor of task performance, and (4) the percentage of gaze directed to the screen was a positive predictor of task performance.Altogether, this seems to suggest that robots that are perceived as less likable might be more suitable for joint tasks, as they attract less attention and hence help the player stay focused on the activity.It would be interesting to understand whether we could leverage on a robot's likability to find a trade-off between engagement and task performance in joint activities.
The Development of Mutual Gaze Over Time (RQ3).We found participants' mutual gaze in the face-to-face conversation with the robot to change over time.It decreased between the pre-and postgame social chat in sessions 1 and 2, but not in session 3. The descriptive statistics reveal that mutual gaze in the third pre-and post-game chats stabilizes close to the values of the pre-game conversations of session 1 and 2. In line with the questionnaire results on the perception of robots, which found that perceived threat and discomfort were the last perceptual dimensions to stabilize over time, in this study, we found that mutual gaze, a negative predictor of perceived threat and discomfort, stabilized only at the third interaction session.Consistent with self-reports from participants, which showed that uncanniness reduced over time, we found mutual gaze, a negative predictor of uncanniness, to increase across sessions.These results seem to suggest that mutual gaze can be used to monitor the development of uncanny feelings towards a robot over time.
The decrease in mutual gaze between the pre-and post-game social chats of sessions 1 and 2 might be related to the robot's novelty.Indeed, it seems to suggest that participants look more at a robot when meeting it for the first time and after a period of zero exposure and that they gaze progressively less at the robot the more they become familiar with it.However, in this study, the reduction in mutual gaze within the interaction session was accompanied by an increase in mutual gaze across interaction sessions.This makes it challenging to draw conclusions on the role of the robot's novelty on mutual gaze.Future research should specifically investigate how the robot's novelty affects the amount of mutual gaze it attracts and how the interplay of novelty and uncanniness influences mutual gaze within and between interaction sessions.The Development of Gaze Patterns in a Joint Task Over Time (RQ4).As opposed to previous work finding a decrease in the gaze directed towards the robot over multiple sessions of a joint task [50], in our study, we found that the percentage of gaze directed to the robot during the map game was stable.This is in line with the results from the questionnaires (i.e., involvement with the robot).This result is positive as it shows that the map game is interesting enough to sustain participants' engagement with the robot over time.As gaze towards the robot during a joint task seems to be a significant predictor of poor task performance, preventing an increase in the attention the robot attracts across repeated interactions is crucial to ensure that the educational game fulfills its pedagogical objectives.
In contrast with the gaze towards the robot, the percentage of gaze directed to the screen or the tablet changed over time, with the former decreasing and the latter increasing over the last two sessions.This result is interesting.Indeed, the progressive improvement in task performance across sessions, the positive relationship between gaze towards the screen and task performance, and the negative relationship between gaze towards the tablet and task performance would have suggested an inverse development of gaze patterns over time.Hence, we suppose that this change in gaze patterns might capture the slight decrease in involvement with the game shown by the questionnaires, which eventually did not reach significance.However, it might also indicate that, as participant grew more confident with the game and settle for a strategy to score points in the last sessions, they felt more comfortable in abandoning the main support tool offered by the game (i.e., shared screen).Future research should investigate more thoroughly how gaze allocation to the objects of attention included in a joint task changes over time, especially as a consequence of the progressive increase in participants' task expertise.
The Effect of Humanlikeness on Mutual Gaze in a Social Chat (RQ5a).The three facial textures that we applied to the robot varied in terms of positive perceptions but not in uncanniness.As mutual gaze predicted only two perceptual dimensions (i.e., likability, warmth) out of the four positive ones that varied, the lack of a main effect of humanlikeness on mutual gaze does not surprise.We assume that a less subtle manipulation of humanlikeness will be more likely to influence the gaze allocation towards the robot in a social chat, and strongly advise future research to move in this direction.At the same time, we also recommend to keep the embodiment features of the robot as consistent as possible across conditions to limit the influence of other confounding factors.
The Effect of Humanlikeness on Gaze Patterns in a Joint Task (RQ5b).In contrast with the questionnaire results (i.e., involvement with the robot), we did not find a difference between the humanlike and the morph robot in terms of gaze allocation.This result is particularly interesting as it corroborates our hypothesis that, in joint tasks, the involvement with the robot might be felt at a subjective/experiential level rather than expressed at a behavioral level with gaze.This might be especially true for games with time constraints.Indeed, in this context, the time pressure set by the game and the pace that derives from it might leave little room for participants to focus their gaze on the robot.Future work should further investigate this line of thought by exposing participants to joint tasks differing in time constraint and investigating at which level of time pressure the engagement with the robot ceases to be expressed behaviorally.
Limitations.While we highlight the contribution of the present paper on the usage of gaze as an implicit measure of robot perception and task performance, we also acknowledge a number of limitations.For instance, the manipulation of the robot's humanlikeness in our experiment did not work as expected.Indeed, participants did not perceive the mechanical and the morph robot as differing in anthropomorphism, and Furhat's facial textures did not vary in perceived uncanniness.To overcome this drawback, we plan to add more anthropomorphic features to the morph texture in the future.Another potential limitation of the study lays in the remote-controlled nature of the robot's interactive capabilities.While participants were not aware of the robot being controlled by a human until they were fully debriefed, this might have set wrong (i.e., unrealistically high) expectations on the robot's abilities.We are currently working on a fully autonomous version of the map game, which we plan to deploy in future studies to confirm our findings.Third, although we found a large effect size for all significant analyses, future work would benefit from a larger and more heterogeneous group of participants, both in terms of background and gender.As most of the participants in this study identified themselves as male and came from a computer science background, our results might report the perspective of a limited group of users and thus need replication.Moreover, the study we performed was set in a lab environment, a context that grants a lot of control over confounding variables.Further research should focus on replicating this study in real-life scenarios where the collection of gaze data is more complex and environmental factors, such as light conditions, might intrude first-person object-recognition and hence the automatic annotation of gaze.Finally, albeit the participants involved in the pilot did not perceive the Tobii eye-tracking glasses as intrusive, some might have felt uncomfortable wearing them.Further research should hence explore the feasibility of stationary eye-trackers in similar scenarios and compare their accuracy in detecting gaze direction.

Conclusion
In this paper, participants took part in three interaction sessions with a robot varying in humanlikeness.In each session, they played a collaborative game with the robot and engaged in a brief social chat before and after the game.We gauged their gaze direction in both types of interaction and used regression analyses to relate it with measures of perception and engagement.Results suggest that mutual gaze towards a robot in a social chat is related to perceptions of uncanniness, and the gaze directed to the robot in a joint task is a predictor of poor task performance.Moreover, they show that mutual gaze in a social chat changes across repeated interaction sessions, and so do participants' gaze patterns in a joint task.These findings are crucial for the field of HRI as they highlight that gaze can be used as an implicit measure of people's perceptions of robots in a face-to-face interaction, and of engagement and task performance in a collaborative game.

Figure 2 :
Figure 2: Schematics of the experimental setup during the interaction session, including the operator interface (top left), the Tutor's screen on the iPad (bottom left), the eye-tracker recording from the participant's point of view with indicated center of attention and detected objects (top right), and recording from one of the RGB cameras (bottom right).

Figure 4 :
Figure 4: Participant wearing the Tobii glasses during the interaction session.

Figure 5 :
Figure 5: Development of involvement with the game, involvement with the robot, and task performance (participants' score at the game) over repeated sessions.S1= Session 1; S2= Session 2; S3= Session 3.

Figure 6 :
Figure 6: Development of mutual gaze in the pre and post-game social chats over repeated sessions.On the left, the development of mutual gaze in the pre and post-game social chat for each level of the robot's humanlikeness.On the right, the overall change.S1= Session 1; S2= Session 2; S3= Session 3.