Pose Generation for Social Robots in Conversational Group Formations

We study two approaches for predicting an appropriate pose for a robot to take part in group formations typical of social human conversations subject to the physical layout of the surrounding environment. One method is model-based and explicitly encodes key geometric aspects of conversational formations. The other method is data-driven. It implicitly models key properties of spatial arrangements using graph neural networks and an adversarial training regimen. We evaluate the proposed approaches through quantitative metrics designed for this problem domain and via a human experiment. Our results suggest that the proposed methods are effective at reasoning about the environment layout and conversational group formations. They can also be used repeatedly to simulate conversational spatial arrangements despite being designed to output a single pose at a time. However, the methods showed different strengths. For example, the geometric approach was more successful at avoiding poses generated in nonfree areas of the environment, but the data-driven method was better at capturing the variability of conversational spatial formations. We discuss ways to address open challenges for the pose generation problem and other interesting avenues for future work.

4. Choose a random location for the group members along the circular formation such that interactants would not be too close to one another. 5. Decide if the group's placement is valid by checking if a number of relevant locations for the group do not fall on occupied spaces of the map. The relevant locations included the midpoints between any combination of 2 group members (so that group members could potentially see each other), midpoints between any person and the center of their circular formation (so that all group members had access to the F-Formation o-space), and locations within a meter around any person in the group (to avoid placing interactants too close to objects).
6. If the group passed the above check, orient the members towards the center of their circular formation and output their poses along with the section of the environment map that surrounds them; otherwise, repeat the above steps until a successful group is created or a maximum number of attempts is reached.
Although the above approach could have been optimized in many ways, it was chosen for its simplicity given that simulated data only needed to be generated once. Example groups generated through this approach can be seen in Figure S1b (first column).

Simulated iGibson Dataset
We initially generated 34,405 simulated groups for training on the 15 iGibson environments. Group sizes were distributed as follows: 8445 groups were dyads, 7240 groups were triads, 6611 groups had 4 members, 6063 groups had 5 members, and 6046 groups had 6 members. Because these groups were perfect circular arrangments, we decided to slightly stretch them (horizontally or vertically) and rotate them (along with  Figure S1. (a) The 15 iGibson environments from which we generated simulated groups. (b) Original simulated groups (first column), transformed group via stretching and rotation (second column), and the same original group after adding angular noise to the context (third column). (c) Final distribution of simulated groups by group size after data augmentation (stretching and rotations).
the environment) to add more variability to the simulated dataset. In particular, we transformed groups with 3 or more members, resulting in 25,960 additional training examples. Figure S1b (second column) shows example transformations applied to simulated groups. The final distribution of simulated groups (including those that were stretched and rotated) by group size is shown in Figure S1c.
As an additional type of data augmentation, we implemented a transformation for the iGibson data which added angular noise to the orientation of the context poses during training of the WGAN. The noise was sampled from a normal distribution with zero mean and a standard deviation corresponding to 20 degrees. Example results from this transformation can be seen in Figure S1b (third column). This transformation was not applied to Cocktail Party data during training because the latter data was already diverse in comparison to the perfect circular arrangements generated on the iGibson environments.

WGAN ARCHITECTURE
This section details the neural network architectures used for the generator G and discriminator (or critic) D of the proposed WGAN model. Both networks received as input the poses of the people in the context C and a cropped map of the environment around the context. The locations in the context poses were given relative to a coordinate frame whose origin was the average location of the context poses, corresponding to the center of the cropped map. This made the data translation invariant and facilitated training. Also, the generator received as input a latent variable z, and the critic received an additional pose (from the true data distribution or from the generator). The location of the pose input to the critic was in the same coordinate frame as the context locations.

Generator Network
The generator network first processes its input graph with two GNNs, one in charge of reasoning about spatial-orientational information in the group's context and another one in charge of reasoning about proxemics information. The output of these two GNNs is then processed by a final multi-layered perceptron, as explained in Section 4.3 of the main paper. The sections below provide more implementation details for the generator network.
Spatial-Orientational GNN. The update function φ v 1 () described in the main paper is a multi-layer perceptron (MLP) and is implemented as outlined in Table S1 (left). The aggregate function ρ v→u 1 () is element-wise maximum.
which outputs a 2D tensor with a gaussian blob on the interactants location. The blob is generated using a normal distribution N (·; µ, Iσ) with µ = [x i y i ] T and σ = 0.21 (as used for the personal space loss of the geometric approach). Then, the updated node features are aggregated into a featurev using element-wise summation. Finally, the global attribute of the input graph is updated using u = φ u 2 (v , u). The function φ u 2 () is implemented as a convolutional neural network (CNN) with zero padding, as detailed in Table S2, and with a final flatten layer. Final Multi-Layer Perceptron. The final multi-layer perceptron of the generator is composed of three fully connected layers, as detailed in Table S3 (left). The last two elements of the 4D output of the MLP are finally applied a hyperbolic tangent transformation to constraint them to (−1, 1) because they represent the cos(θ) and sin(θ) of the output pose.

Critic Network
The critic network is similar to the generator network described previously, except that its input graph has as global attribute the environment map only (without information about a latent variable z) and the critic receives an additional input: a pose from the true data distribution or output by the generator, which is processed in a third parallel stream to the GNNs. The sections below provide more implementation details for each component of the critic network.
Spatial-Orientational GNN. The critic's Spatial-Orientational GNN is the same as for the generator, except that its node update function does not use batch normalization (BN) because BN can make it harder for the critic to converge, as discussed in (Gulrajani et al., 2017). Table S1 (right) details the parameters of the critic's node update function.
Proxemics GNN. The critic's Proxemics GNN is also the same as for the generator, except that the global attribute update function, which is implemented as a CNN, does not use batch norm.
Pose Multi-Layer Perceptron. The pose input to the critic is transformed with a series of fully collected layers, as detailed in Table S4.

ADDITIONAL QUANTITATIVE RESULTS FOR THE DATA-DRIVEN METHOD
In addition to the results presented in the main paper for the WGAN (Section 5), we also studied the performance of other variations for the data-driven model using the proposed quantitative metrics. These variations are described below: WGAN with increased distribution size. We chose 36 samples for the models that computed distributions in the main paper because this number of samples reasonably covered the area around the context for the geometric approach. Howevever, we were curious about whether more samples could benefit the WGAN and, thus, we evaluated it when running the generator 576 times. The results are presented in Table S5. In comparison to Table 1 in the main paper, the increased distribution size had minimal effect on performance. Table S5. Results on the Cocktail Party test set with a distribution of 576 samples from which we chose the biggest mode as final output. Each row shows µ ± σ for the metrics described in the main paper (lower is better). "(iG)" models were trained on simulated data using iGibson environment maps, "(CP)" indicates training with Cocktail Party train data, and "(iG,CP)" corresponds to pretraining with simulated data and then finetuning on Cocktail Party train data. WGAN with combined map for tall and short obstacles. Because the geometric approach only has information about free and occupied space, we tested training the WGAN with a similar configuration. That is, we merged the two channels of the map input to the WGAN, which represented occupancy by tall and short objects, into a single map with occupied and free space information. The results for this test are presented in Table S6. In general, the performance was similar to the WGAN that used a two-channel map, as described in the main paper. Thus, we primarily evaluated the WGAN with two-channel maps in this work, which more explicitly described obstacles in the environment. Worth noting, though, in some cases the model trained with combined maps and only on simulated groups generated poses outside the input map, resulting in a higher Circ. Fit metric than the results in Table 1. Table S6. Results on the Cocktail Party test set with combined environment channels. Each row shows µ ± σ for the metrics described in the main paper (lower is better). Models without * output a single pose, whereas those with * output a distribution of 36 poses from which we chose the biggest mode as final output. "(iG ‡ )" models were trained on simulated data using iGibson environment maps (without data augmentation), "(CP)" indicates training with Cocktail Party train data, and "(iG ‡ ,CP)" corresponds to pretraining with simulated data and then finetuning on Cocktail Party train data.

Method
Circ WGAN with personal space loss. In initial experiments, we also considered a modified version of the WGAN in which the generator was trained with an additional component for its loss which penalized for output poses that violated personal space. This component was implemented in the same manner as p in eq. (4) in the main paper. This means that the loss for the WGAN was: We set λ = 0.1 based on validation performance, and obtained the results shown in Table S7 using the original iGibson simulated groups (without data augmentation in the form of stretching, rotations, nor angle noise). We found that the addition of the personal loss to the generator reduced in some cases violations to intimate spaces in comparison to not adding the loss and training the model on the iGibson data without Table S7. Results on the Cocktail Party test set. Each row shows µ ± σ for the metrics described in the main paper (lower is better). Models without * output a single pose, whereas those with * output a distribution of 36 poses. The + p marker indicates that the WGAN generator was trained with a penalty for violating personal space (i.e., with personal loss). "(iG ‡ )" models were trained on simulated data using iGibson environment maps (without data augmentation), "(CP)" indicates training with Cocktail Party train data, and "(iG ‡ ,CP)" corresponds to pretraining with simulated data and then finetuning on Cocktail Party train data.

SURVEY USED FOR THE HUMAN EVALUATION
The human evaluation was carried out using Qualtrics online survey software. We organized the survey into 4 main sections: 1. Demographics section, e.g., with questions about age, gender, "how often do you play video games?", and "how often do you interact or work with a robot".
2. Practice section, which showed a robot in two scenes to familiarize them with the task of providing In Group ratings. First, the robot was shown using a ground truth pose from the Cocktail Party dataset. Second, it was shown having a bad orientation, as described in the main paper. Figure S2 shows the top-down renderings used for this section of the study. The presentation of the practice scenes within the survey was the same as for the evaluation scenes that followed. 3. Evaluation section, where the participants were asked to rate the pose of the robot in twenty scenes. Half of the scenes had the robot positioned as directed by the model-based approach; the other half used poses output by the data-driven method. The participants did not know which method was used in each rendering. Also, the order of the 20 scenes was randomized per participant to avoid potential ordering effects. An example page of this section of the survey is shown in Figure S3. All the top-down view renderings used in the evaluation are shown in Figures S4, S5, S6, S7 and S8.

Photos of the Pepper robot
Rendered scenes (the participants could click on the images to display them on the full browser window) In Group Measure questions (the order of the questions was randomized in each scene) Figure S3. Example evaluation page from the survey. 4. Final feedback section, which asked the participants to answer the question: "If you thought that the survey was difficult to complete for any particular reason, please explain below in detail what kind of difficulties you encountered with the survey." This question helped clarify the presentation of the instructions in pilots.

DETAILED STATISTICS FOR IN GROUP RATINGS
For every scene in the survey used for the human evaluation, the participants provided their agreement with the four statements shown in Figure S3. The statements were: (1) Pepper is too far from the human(s) in the scene to engage naturally in a group conversation with them; (2) Pepper is in a location that makes it look like it is in a group conversation with everybody else in the scene; (3) Pepper is positioned to socially engage with the human(s) in the scene; and (4) Pepper is orienting in an unusual way to be having a conversation with everybody else in the scene. These statements composed the In Group measure described in the main paper. Their means, standard deviations, and correlations are shown in Table S8.   Figure S5. Top-down renderings for a Group Size of 3. The renderings were used in our human evaluation.  Figure S6. Top-down renderings for a Group Size of 4. They were used in our human evaluation, except for the Context #2 prediction by the Geometric * approach (which placed the robot outside of the room).  Figure S7. Top-down renderings for a Group Size of 5. The renderings were used in our human evaluation.  Figure S8. Top-down renderings for a Group Size of 6. The renderings were used in our human evaluation.