Toward an Attentive Robotic Architecture: Learning-Based Mutual Gaze Estimation in Human–Robot Interaction

Social robotics is an emerging field that is expected to grow rapidly in the near future. In fact, it is increasingly more frequent to have robots that operate in close proximity with humans or even collaborate with them in joint tasks. In this context, the investigation of how to endow a humanoid robot with social behavioral skills typical of human–human interactions is still an open problem. Among the countless social cues needed to establish a natural social attunement, this article reports our research toward the implementation of a mechanism for estimating the gaze direction, focusing in particular on mutual gaze as a fundamental social cue in face-to-face interactions. We propose a learning-based framework to automatically detect eye contact events in online interactions with human partners. The proposed solution achieved high performance both in silico and in experimental scenarios. Our work is expected to be the first step toward an attentive architecture able to endorse scenarios in which the robots are perceived as social partners.


PRINCIPAL COMPONENT ANALYSIS
Because the OpenPose keypoints considered as feature vector seemed to be redundant, the PCA was performed to reduce their dimensionality. Figure S1(a) shows the cumulative plot of the explained variance of the collected dataset. As we can see from the plot, only 6 principal components are required to explained more than 95% of the total variance in the data. In Figure S1(b-c) the transformed data are plotted both using the first two and the first three principal components. The two learning classes (eye contact, no eye contact) can be grouped into well-defined clusters. Specifically, the no eye contact class extends mainly on the third principal component whereas the eye contact class on the first two components. (b-c) 2D and 3D scatter plot of the data used for the training of the classifier considering two and three principal components respectively. The class eye contact is in green whereas the class no eye contact is in blue.

COMPARISON BETWEEN THE SVM AND THE RANDOM FOREST ALGORITHM
In this Section, we make a comparison between the support vector machine and the random forest algorithms using as training samples either raw from OpenPose or the components extracted with the PCA. Both algorithms were trained and tested on the same train and test set (train set: 19 out of 24 participants, test set: 5 participants). The train set was augmented rotating geometrically the input features to the left and right as described in the main article. The performance measured in terms of accuracy, precision, recall and F1 score are reported in Table S1. As in Section 5 of the main article, we considered mean and standard deviation on k = 5 random splits of the dataset. The SVM classifier performs better than the random forest reaching values around 90% for all metrics. Furthermore, the SVM without the PCA reports mean values with narrow standard deviation with respect to those resulting after the PCA.

COMPARISON BETWEEN THE ICUB'S CAMERA AND THE INTEL REALSENSE CAMERA
To determine the impact of a higher level of details in the input image, we compared the results obtained by using the datasets collected with the iCub's camera (Dragonfly 2) and the high quality Intel RealSense D435i camera. Results are reported in Table S2. They were evaluated considering mean and standard deviation of the metrics on k = 5 random splits of the dataset. The SVM was trained using both datasets individually and tested on the test set collected from the iCub's camera. This choice was motivated by the fact that during the deployment the algorithm is required to take the input only from the Dragonfly 2 to avoid external hardware that may influence the human behaviour during the interaction with the robot.
Despite the high quality of the images coming from the Intel RealSense, the classifier trained with the dataset collected using the Dragonfly 2 performs better, especially in precision and F1-score metric where it reports also narrow standard deviations. For the sake of completeness, the algorithm was also tested on the test set collected from the RealSense both when it was trained on the train set collected from the RealSense and on the one from the iCub. The performances are reported in Table S3. We observe that the performance obtained using images from the RealSense camera (both for training and testing) is slightly better. That is completely reasonable due to the quality of the camera if compared to that one embedded in the iCub's eye. Nevertheless, the increase in performance is relatively minor and it does not justify the need for additional hardware.

FURTHER DETAILS ON THE BENCHMARK
As discussed in the main article, we compared the presented approach with the state-of-the-art method proposed in Chong et al. (2020). In this Section, we report further details on how we compared the two algorithms. Briefly, for the baseline the bounding box of the human face in the picture is first detected by means of dlib, an open source C++ toolkit containing machine learning algorithms, and then the cropped image is sent to the convolutional neural network to estimate eye contact events. The bounding box detection based on dlib failed in 33% of cases probably due to the presence of facemasks in the dataset.
To make a fair comparison, we replaced dlib with an algorithm to detect the bounding box of the faces based on OpenPose. For each frame, the baseline produces as output a score s ranging from 0 (no eye contact) to 1 (eye contact). In the evaluation of the metrics, we considered the baseline's prediction as eye contact if s ≥ 0.5 (no eye contact otherwise). The baseline output is depicted in Figure S2.

DISCUSSION ON THE CONFIDENCE LEVEL REPORTED BY THE MUTUAL GAZE CLASSIFIER
In this Section we analyse the confidence level provided by the mutual gaze classifier when used in a natural scenario. To this aim, we registered the confidence scores for the video attached to the article as supplementary material. In Figure S3 we plot the confidence level frame-by-frame (from second 5 to second 61 of the video) marking the eye contact prediction with a solid blue line and the no eye contact prediction with a solid red line. We notice sharp drops in the confidence level during the switching between eye contact and no eye contact conditions. Specifically: • the first switch (head oriented from frontal to right) is from frame 22 to frame 31 (seconds 7 − 9 of the video); • the second switch (head oriented from down to frontal) interests frames from 115 to 122 (seconds 18 − 20 of the video); • from frame 145 to frame 220 the confidence level assumes an oscillating behaviour (seconds 22 − 32). This corresponds to the sequence of frames in which the participant keeps eye contact with the iCub while moving their torso towards left and right. The segments in red in this interval happen when the participant has their torso completely rotated to the right or to the left. We can observe that even if some frames were wrongly classified as no eye contact (frames 160 − 176 and frames 203 − 206), the classifier reported a lower confidence level if compared to the no eye contact condition where it assumed values close to 1 (from frame 30 to frame 115); • from the frame 225 to 395 the confidence level keeps the oscillating behaviour (seconds 30 − 54). This second oscillation corresponds to the sequence of frames in which the participant rotates their torso and then moves their head in order to establish eye contact with the iCub. Also in this case, the segments in red are when the participant has their torso completely rotated to the right or to the left; • last switch occurs from frame 385 to frame 398 (seconds 53 − 55) when the participant returns in a frontal position.
As general comment, the confidence scores during the transitions tend to be lower (< 0.9) than those at steady state (> 0.9). In Figure S4  Confidence level (y-axis) registered for the video attached as supplementary material is reported frame-by-frame (x-axis). Solid blue line refers to eye contact condition whereas solid red line refers to no eye contact condition. Figure S4. Transition scenarios. Three sequences of frames representing different transition scenarios: from frontal to right, from down to frontal and from frontal to left keeping eye-contact with iCub. The prediction with the confidence score are in blue on each single frame.