O2A: One-shot Observational learning with Action vectors

We present O2A, a novel method for learning to perform robotic manipulation tasks from a single (one-shot) third-person demonstration video. To our knowledge, it is the first time this has been done for a single demonstration. The key novelty lies in pre-training a feature extractor for creating a perceptual representation for actions that we call 'action vectors'. The action vectors are extracted using a 3D-CNN model pre-trained as an action classifier on a generic action dataset. The distance between the action vectors from the observed third-person demonstration and trial robot executions is used as a reward for reinforcement learning of the demonstrated task. We report on experiments in simulation and on a real robot, with changes in viewpoint of observation, properties of the objects involved, scene background and morphology of the manipulator between the demonstration and the learning domains. O2A outperforms baseline approaches under different domain shifts and has comparable performance with an oracle (that uses an ideal reward function).


Introduction
Learning new manipulation tasks has always been challenging for robotic systems, whether it is a simple mobile manipulator or a complex humanoid robot. Programming manually step by step [1] is one of the earlier solutions to this problem. But this is labour intensive, requires specialist expertise and lacks autonomy. It is therefore not suitable for consumer robots and fully autonomous systems. Learning from Demonstrations (LfD) [2] is a potential solution to this problem. It requires only demonstrations of the task for the robot to learn from. Even though LfD has been studied widely, most previous works have stayed within Figure 1: Overview of O 2 A method. A 3D-CNN action vector extractor is used to extract action vectors X D and X R from the video clips of the demonstration and robot trial execution respectively. A reward function is used to compare X D and X R in the action vector space, generating a reward signal (r) based on their closeness. The reinforcement learning algorithm then iteratively learns an optimal control policy by maximizing this reward signal, thus enabling observational learning.
the 'Imitation Learning' [3][4][5][6] paradigm, where demonstrations are made from an egocentric perspective, either visually or kinesthetically. This requires the inconvenience of kinaesthetic guidance or teleoperation and also the rich source of third-person demonstrations available on the internet cannot be used. Therefore, in this paper we study the problem of LfD under the 'Observational Learning' [7][8][9][10] paradigm, where the demonstrations are observed as a thirdperson. This introduces the key challenge in observational learning, the shift between the demonstration and the learning domains. The domain shift can arise due to changes in viewpoints of observation, properties of objects used, scene background or morphology of the manipulator performing the task.
In this paper we present O 2 A (One-shot Observational learning with Action vectors), for one-shot observational learning of robotic manipulation tasks under different domain shifts. One-shot learning here means that only a single demonstration of the new task is required for learning. (Note that, it does not refer to the number of trial and error executions by the robot during learning from that single demonstration). We use an abstract perceptual representation: the 'action vector' , which is the task-aware and domain-invariant representation of the action in a video. The action vector is extracted using a 3D-CNN [11], pretrained on a generic action dataset as an action classifier (we use UCF101 [12] as the pre-training dataset for our experiments). Through our evaluation on a new 'Leeds Manipulation Dataset' (LMD), we show that the pre-trained action vector extractor can generalise to unseen manipulation tasks. The action vectors from the demonstration and trial robot execution video clips are then compared to generate a reward for the reinforcement learning algorithm. The algorithm learns an optimal control policy that performs the demonstrated task. Our experiments in simulation (with reaching, pushing tasks) and on a real robot (with pushing, hammering, sweeping, striking tasks) show that O 2 A can perform well under different domain shifts. Our contributions can be summarised as follows: • Implementing for the first time, a method for observational learning of robotic manipulation tasks from a single demonstration.
• O 2 A can handle shifts between the demonstration and the learning domains, caused by changes in viewpoint of observation, object properties, morphology of the manipulator and scene background.
• And finally, we pre-train the action vector extractor on a generic action dataset instead of task-specific manipulation videos. The extractor generalises to unseen manipulation tasks by learning the shared underlying visual dynamics.
Upcoming sections are arranged as follows: Section 2 discusses related works, Section 3 formulates the problem and describes the proposed method, Sections 4 and 5 report on experiments conducted and finally Section 6 presents the conclusions.

Related Work
Observational learning: Origins of observational learning of robotic manipulation tasks can be traced back to works from the 1990s [13][14][15]. Most of the early methods required assistance in observing the demonstrations. This assistance was provided by motion capture systems [16][17][18], visual detectors [19][20][21], trackers [22][23][24] or a combination of the above [13]. However, the entities to be tracked or detected must be known beforehand and only demonstrations using these entities can be learned.
With the advent of deep learning [25,26], it was possible to learn visual features characterising the task directly from raw RGB videos. The features are extracted from raw videos using a variety of methods: deep metric learning [27], generative adversarial learning [28], domain translation [29][30][31], transfer learning [32,33], action primitives [34], predictive modelling [35], video to text translation [36] and meta-learning [37,38]. A detailed comparison of these methods is given in the supplementary material Section A.
These methods have two main limitations: (1) Requirement of a large number of demonstrations for learning new tasks: The feature extractors are trained separately for each of the new task to be learned. Hence demonstration videos are to be collected in substantial numbers for each task. In contrast, our method requires only a single demonstration (hence one-shot) to learn a new task, since pre-trained feature extractors are used. (2) Constrained domain shifts: In existing approaches, assumptions are made regarding shift between learning and demonstration domains. For example viewpoint of observation is fixed [33] or manipulators with similar morphologies [29] are used. Our method O 2 A, does not make any such assumptions and can learn under unconstrained domain shifts.
Pre-training with large generic datasets: Pre-training on large generic datasets has become common in the fields of computer vision and natural language processing. Models are first pre-trained on a large generic dataset(s) in a supervised or unsupervised manner. After pre-training, the models are used to solve downstream tasks with minimum/no fine-tuning. Generic language models such as ELMo [39], GPT [40][41][42], BERT [43] have shown success in solving several downstream language processing tasks. Similarly, ImageNet models [44], Image-GPT [45], BiT models [46] have demonstrated that this approach can be applied for computer vision problems as well. We introduce a similar concept into visual robotic manipulation. Action vector extractors are pre-trained using a large generic action dataset and then generalised to manipulation tasks for observational learning.
Mirror Neurons: Neuroscience studies [47][48][49][50] show the presence of 'mirror neurons' in humans, that produce task-aware and domain-invariant representations of the actions observed. They are used for both action recognition (perception) and observational learning (action execution). Inspired by this dual role of mirror neurons, we pre-train the action vector extractor as an action classifier. This pretraining for action classification (recognition), will teach the model to extract task-aware and domain-invariant representations of actions from input videos.

Action vectors
Action vectors are the core of the O 2 A method. An action vector is the abstract task-aware and domain-invariant perceptual representation of the action being carried out in a video. In O 2 A, the action vector extraction is based on the following two assumptions: (1) The spatio-temporal features generated by the final layers of an action classifier pre-trained on a generic action dataset, are domain-invariant and task-aware. The features from the videos depicting similar actions should be identical irrespective of the domain in which they are recorded. The assumption is reasonable since the action classifier makes use of the same layer outputs to identify actions, independently of different camera angles, varying scene backgrounds, illumination conditions, actors / manipulators, object appearances, interactions, pose and scale.
(2) The action vector extraction model pre-trained on a generic dataset can generalise to unseen manipulation tasks used in robotic observational learning. The intuition is that the underlying visual dynamics between generic action datasets and manipulation tasks are the same. For example, it is the same physical laws of dynamics governing object interactions, both for a cricket shot as well as a robot striking cubes.
Section 4 shows results that validate these critical assumptions.

Network architecture and dataset
Our 3D-CNN model consists of eight 3D convolutional layers, five 3D maxpooling layers and three fully connected layers. The ReLU [51] activation function is used for all the convolutional and fully connected layers except the final layer, where a linear activation function followed by a Softmax function is used. The layer wise network architecture along with the kernel sizes, input and output dimensions are given in the supplementary material Section B. We use the UCF101 action dataset as the generic dataset for our experiments. It consists of 13320 real world action videos from YouTube each lasting around 7 seconds on average, classified into 101 action categories. The dataset has a large diversity both in terms of variety of actions and domain settings within the same class videos.

Pre-training action vector extractors
For pre-training, we first uniformly downsample UCF101 videos in time into 16 frames for providing a fixed-length representation for each video clip. We also resize videos into 112 x 112 pixels to standardize the size. We apply the same pre-processing steps to videos of demonstrations and robot trial executions for action vector extraction during observational learning. These downsampled and resized videos are then used for training the model for action classification from scratch. The training details are given in the supplementary material Section B. The trained model will be referred to as 'NN:UCF101' hereafter.
After training, we use features from one of the final layers of NN:UCF101 as the action vector. Our experiment (reported in the supplementary material Section C) shows that the features from layers pool5 (size: 8192) and fc6 (size: 4096) are best suited to be used as the action vector. We report results, both when the features from pool5 and fc6 layers are used as the action vector in this paper.

One-Shot observational learning
The overview of O 2 A is shown in Fig. 1. The robot views both the demonstration and its own trial executions from a camera mounted in a fixed position above the manipulator. With reference to Fig 1, let D be the single demonstration video clip of a task to be learned. We extract the n-dimensional action vectors X D and X R from the demonstration video D and the video clip of a trial robot execution respectively. The reward (r) for the reinforcement learning is then calculated as the negative of the euclidean distance between action vectors X D and X R as given below: Thus the reward directly measures the closeness of the actions in the demonstration and of the robot trial execution.
The reinforcement learning will then maximize this reward function to learn an optimal control policy. This optimal control policy will enable the robotic manipulator to perform the demonstrated task.

Reinforcement learning of the task
Any reinforcement learning algorithm can be used with our method. In the simulation experiment, we use the Deep Deterministic Policy Gradient (DDPG) [52] to estimate the optimal control policy. The states used by the control policy are instantaneous visual observations of the environment (as observed by the robotic system). We make use of a VGGNet pre-trained on ImageNet [53] for converting raw RGB images into visual state features. The 4608 long feature obtained from the last convolutional layer of the VGG-16 network is used as the instantaneous state representation. Reinforcement learning in real robots is an active area of research and remains a challenging problem. So we use a manipulation planning algorithm, the Stochastic Trajectory Optimisation (STO) [54,55], for the real robot experiment. STO generates an optimal control sequence by iteratively improving on the previous sequence guided by our reward function. The cost function C, to be minimized is calculated as: C = r 2 .

Action vector analysis
In this section, we aim to validate our assumptions for the proposed action vector extraction method explained in Section 3.1. First we collect a manipulation tasks dataset, the 'Leeds Manipulation Dataset (LMD)'. Note that this dataset is only used for evaluation and not used during training of the action vector extractor. LMD consists of videos of three different manipulation tasks: reach, push and reachpush, examples of which are shown in Figure 2. The task videos are collected directly with a human hand and by using tools resembling robotic manipulators/end effectors. Each class consists of 17 videos with variations in viewpoint, object properties, scene background and morphology of manipulator within each class. Note that, identical looking task classes were carefully selected and same set of objects and manipulators were used across tasks for collecting videos. These choices are deliberate to make the task differentiation more challenging. Under these circumstances, only an efficient action vector extractor can produce taskaware and domain-invariant action vectors for different task classes in LMD.

Class similarity scores
Here we calculate the interclass and intraclass similarity scores for different classes of LMD in the action vector space. For that, we extract action vectors from pool5 and fc6 layers of the NN:UCF101 model, for all the 51 videos in LMD. The Baseline-R is obtained using features from pool5 layer of the same NN:UCF101 model but initialised with random weights. The similarity score between a pair of action vectors, is shown as the cosine of the angle between them. The similarity scores are bounded by [−1, 1] with −1 indicating diametrically opposite vectors and 1 indicating coinciding vectors.
The results are tabulated in Table 1. For each chosen feature layer, the diagonal values represent the average of similarity scores between pairs of action vectors from the same class. And the non-diagonal values are the average of similarity scores between pairs of action vectors from different classes. The diagonal values are greater than the rest of the values indicating adequate task-awareness and domaininvariance for the action vectors extracted. The only exception is for layer fc6 where a greater inter-class similarity score is observed between reach and push classes than the intraclass similarity score for reach class. Provided that both tasks are extremely similar, these results are promising.
We also visualize these action vectors from LMD, projected into 2D using PCA, which are shown in Figure 3. The clustering of action vectors from the same classes, when compared to the Baseline-R is evident. This further indicates the domain-invariance and task-awareness of our action vectors. It must be noted that this visualisation collapses the vectors, of much greater dimensions, into a 2D space, which might be causing some of the 'artificial' overlaps.
The class similarity scores and visualization shows that our pre-trained action vector extractor can generalise to unseen manipulation tasks. In the next section we show how the action vector is used for observational learning and how well O 2 A performs under different domain shifts.

Robotic experiments
To explore the resilience of our method to shifts between the demonstrator and learner domains, we conducted the experiments with six different domain shifts, as defined in Table 2. The tasks used are reaching and pushing in simulation and pushing, hammering, sweeping and striking for the real robot experiment. The task definitions and completion measures are given in Table 3. Note that the task completion measures are only used for evaluating the performance  of O 2 A and not used during learning.

Simulation experiment
We set up the simulation learning domain with a 3DOF robotic manipulator for reaching and pushing using OpenAI Gym [56] and the MuJuCo physics engine [57]. In each setup (characterising a domain shift), we collect a single demonstration in the real world and run DDPG algorithm 10 times. Each run has 20 episodes per run and the number of steps per episode are 60 and 160 for reaching and pushing respectively. The network architectures and hyperparameters used by DDPG are given in the supplementary material Section D. For each run, the DDPG returns a control policy that corresponds to the maximum reward obtained. After training, we pick the top-2 [58] control policies with the highest rewards, and the task completion measures are calculated. The top control policies were selected to avoid policies from poorly performing runs affecting the overall performance. The output of the control policy are the robotic controls with a size of three corresponding to each of the joints. The robotic controls could be torques, joint positions or velocities of the manipulator. In our experiment we have used joint positions. We perform the experiment with action vectors extracted from both pool5 and fc6 layers of NN:UCF101 model. Figure 4 shows snapshots of the demonstration and execution of the corresponding learned policy for selected setups. Videos of the simulation experiment results, including demonstrations are available in the project website.
We compare our method with an oracle and two baseline approaches. The oracle is trained by using the corresponding task completion measure specified in Table 3 as the reward, in place of a reward derived from action vectors. It represents the upper bound on performance. The two baselines represent a video clip by averaging a 'static' representation for each frame, in contrast to the spatio-temporal representation used in O 2 A. Rewards are then generated using these representations. In Baseline-1, features from the output of the last convolutional layer of the ImageNet [53] pre-trained VGG-16 network are used and in Baseline-2, HOG [59] features are used. The average of the task completion measures for the top two control policies for oracle, O 2 A and the baseline approaches are plotted in Figure 5. The learned policies from O 2 A were successful in perform- ing the demonstrated task under different domain shifts with good task completion measures. It also significantly outperforms both baseline approaches and has a comparable performance to the oracle. Additionally we also analysed the quality of the rewards generated in O 2 A, the baseline approaches and the oracle using correlation of rewards. Results are reported in the supplementary material Section D.

Trajectory Maps
Here we plot the trajectories followed by the robotic manipulator in each episode during reinforcement learning of the task. This visualisation will help to understand, if high rewards are obtained for desired trajectories while learning the demonstrated task. The trajectories are coloured with a colour scale corresponding to normalised reward values obtained during task learning.
We also show the results when O 2 A action vector extractors are pre-trained with a manipulation task dataset, the Multiple Interactions Made Easy (MIME) dataset [32]. MIME dataset consists of 8260 videos of 20 commonly seen robotic manipulation tasks, executed by a human as well a Baxter robot. This model is referred to as 'NN:MIME'. The aim is to study how well O 2 A perform when pre-trained on a task specific manipulation dataset compared to a generic dataset. Relevant results are shown in Figure 6 and the rest can be found in the supplementary material Section D.
The results indicate that, when NN:UCF101 is used, high rewards are generated for desired trajectories for all domain shifts. However NN:MIME performs poorly for changes in viewpoint and manipulator used. An insight into this is that, even though the MIME dataset consist of large number of manipulation task examples, the variations in terms of viewpoints and manipulators used are limited. In contrast UCF101 contains examples with extensive range of variations in domain settings like viewpoint and manipulator morphology.

Real robot experiment
For the real robot experiment, we use a 6-DOF UR5 robotic arm with different end-effectors suitable for each task. All 6 domain shifts (see Table 2) are used for the pushing and hammering tasks. Whereas, only three domain shifts (I,V and M) are used for the sweeping and striking tasks, since others did not have meaning for these tasks. We only used features from pool5 layer of NN:UCF101 model as the action vector, due to the high cost of running the real robot experiment. Implementation details of the STO algorithm used to generate the optimal sequence of controls are given in the supplementary material Section E.
Each experiment is run two times with 10 iterations each.  In Figure 7 the snapshots of executions of optimal control sequences obtained for the selected setups are given. The average task completion measures for the optimal control sequences are shown in Figure 8. Our method achieves good task completion measures for different domain shifts. This shows the effectiveness of O 2 A in learning tasks on a real robot. Videos of all the results of the real robot experiments, including demonstrations are available in our project website.

Conclusion
We have presented O 2 A, a method for observational learning of robotic manipulation tasks from a single (oneshot) demonstration. The method works by extracting a perceptual representation (the action vector) from videos using pre-trained action vector extractor. Our analysis shows that the pre-trained action vector extractor can generalise to unseen robotic manipulation tasks. Also experiments in simulation and with a real robot show that O 2 A can perform well under different domains shifts and outperforms baseline approaches.
A limitation in our work is the number of trial executions required to learn a task. It would be interesting to see if we can map the action vector from the demonstration directly to a initial near optimal solution. Another future direction will be to use additional sensing modalities like touch or audio for situations where the demonstrations are not visually observable (e.g. due to occlusion). Also, it would be interesting to study pre-training on generic action datasets for other robotic manipulation problems. Such pre-training could potentially address the lack of large ImageNet [53]   like datasets of robotic manipulation task videos. Finally it would be exciting to extend O 2 A to multi-step manipulation tasks. One approach to tackle this could be to decompose these tasks into single-step tasks learnt using the current method, within a curriculum learning framework.  Table 4. O 2 A requires only a single demonstration to learn new tasks. It does not use any robot data for training the action vector extractor. And also works well under different all domain shifts. ∼40 min of human demonstrations + ∼20 min random robot manipulation data [28] An expert policy is used instead of direct demonstrations [29] ∼60-3000 human demonstrations using additional tools [30] ∼20-30 human demonstrations + ∼300-500 random human and robot images NA NA [31] ∼230 human demonstrations + corresponding robotic joint angle data NA [32] ∼200-400 human demonstrations + corresponding robotic joint angle data NA [33] ∼12 human demonstrations NA [34] ∼50-100 human demonstrations [35] Uses both human and robot task demonstrations (exact numbers unknown) NA NA [36] ∼2990 human demonstrations [37] 1 (But uses closely related supplementary task demonstrations. Requires ∼600-1200 robot and ∼600-1200 human demonstrations per task) [38] 1 (But requires large number of action primitive demonstrations. ∼600-1200 robot and ∼600-1200 human demonstrations per action primitive) NA O2A Only 1 demonstration (human demonstration with or without using additional tools)

B. Action vector extractor: Model architecture and pre-training details
The architecture of the 3D-CNN model used is given in Table 5. Variables NC and BS denotes, the number of classes and batch size respectively. The details of pretraining the model for action classification are given in Table 6.

C. Clustering analysis
We conduct the experiment to identify which one of the final layers of NN:UCF101 provides the best action vector for manipulation tasks. We use the quality of the clusters in the action vector space, as a measure to understand how task-aware and domain-invariant are the action vectors from different layers of NN:UCF101 model. The more the action vectors are domain-invariant and task-aware, the better the clustering of the action vectors from the same class will be.
To analyse the quality of the clusters, we use a standard clustering evaluation measure, the ARI [61] score. The ARI score measures the extent to which the predicted clustering corresponds to the and ground truth clusters by counting pairs that are assigned in the same or different clusters. ARI values are bounded by [−1, 1], where −1 is the lowest score, 0 indicates random clustering and 1 shows that the predicted clustering corresponds to the ground truth perfectly. For the experiment, we extract the action vector from the pool5, fc6, fc7 and fc8 layers of the NN:UCF101 model, for all the 17 videos in LMD. The Baseline-R is obtained using features from the pool5 layer of the same NN:UCF101 model but initialised with random weights. The features extracted from each layer are then clustered using the Kmeans clustering algorithm. The value of K=3 is used, corresponding to the number of task classes. After clustering, the predicted cluster labels are evaluated against ground truth labels and ARI scores are calculated. The results of the experiment are tabulated in Table. 7. The ARI value for Baseline-R is close to zero as expected and gives us the baseline to compare with. The ARI score increases when features from pool5 to fc6 layers are used as the action vector, but drops for the final fc7 and fc8 layers. The results indicate that the features from pool5 and fc6 layers of the NN:UCF101 model are the most suitable to be used as the action vector.

D. Simulation experiment D.1. DDPG algorithm
Details of the DDPG algorithm used in the simulation experiment are given here. We use architectures identical to [52] for the actor and critic networks. The hyper-parameters used are given in Table 8.

D.2. Correlation of rewards
We further analysed the quality of the rewards generated in O 2 A, the baseline approaches and the Oracle. To compare, we calculate the Pearson correlation coefficient [62] between the episodic perceptual rewards ( O 2 A, baselines) and the Oracle rewards for the top two runs. A high positive correlation (typically > 0.5 [63]) indicates that the perceptual rewards are as good as the Oracle rewards. All the results are tabulated in Table 9. From the results, the correlation coefficients are greater than 0.5 in all the cases for O 2 A, indicating that our rewards are as accurate as the Oracle rewards. Also, the correlation is higher and positive compared to the baselines for a range of domain shifts showing the superior performance of our method.

D.3. Trajectory maps
The trajectory maps for the rest of the setups (Obj,Obj+V and BG) for the task of reaching are give in Figure 9.

E. STO algorithm implementation
Briefly, we begin with an initial candidate control sequence. We execute this sequence using the manipulator to generate an initial cost. Thereafter, at each iteration we create 8 random control sequences by adding Gaussian noise to the candidate sequence from the previous iteration and execute them using the real robot. At the end of each iteration, we pick the control sequence with the minimum cost. Then set it as the new candidate sequence thereby iteratively reducing the cost. The initial control sequence is initialised by Figure 9: Trajectory maps while learning the task of reaching, when O 2 A action vector extractors are pre-trained with UCF101 dataset (NN:UCF101 (pool5, fc6)) and with MIME dataset (NN:MIME (pool5, fc6)). O 2 A pre-trained with UCF101 provides high rewards for desired trajectories for all the domain shifts (Obj, Obj+V, BG). However, O 2 A with MIME dataset fails when viewpoint of observation (Obj+V) changes.
providing a near solution path, following common practices in literature [33]. However, a more sophisticated algorithm can be used to obtain the optimal control sequence without this. Table 9: Pearson correlation coefficients between the rewards from the Oracle, and from O2A and two baselines. The coefficients are generally highest and positive for O 2 A rewards compared to baseline approaches.