# MACHINE LEARNING METHODS FOR HIGH-LEVEL COGNITIVE CAPABILITIES IN ROBOTICS

EDITED BY : Emre Ugur, Tetsuya Ogata, Yiannis Demiris, Tadahiro Taniguchi and Takayuki Nagai PUBLISHED IN : Frontiers in Neurorobotics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-261-9 DOI 10.3389/978-2-88963-261-9

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# MACHINE LEARNING METHODS FOR HIGH-LEVEL COGNITIVE CAPABILITIES IN ROBOTICS

Topic Editors: Emre Ugur, Boğaziçi University, Turkey Tetsuya Ogata, Waseda University, Japan Yiannis Demiris, Imperial College London, United Kingdom Tadahiro Taniguchi, Ritsumeikan University, Japan Takayuki Nagai, Osaka University, Japan

Citation: Ugur, E., Ogata, T., Demiris, Y., Taniguchi, T., Nagai, T., eds. (2019). Machine Learning Methods for High-Level Cognitive Capabilities in Robotics. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-261-9

# Table of Contents

*04 Editorial: Machine Learning Methods for High-Level Cognitive Capabilities in Robotics*

Tadahiro Taniguchi, Emre Ugur, Tetsuya Ogata, Takayuki Nagai and Yiannis Demiris

*07 Cross-Situational Learning With Bayesian Generative Models for Multimodal Category and Word Learning in Robots*

Akira Taniguchi, Tadahiro Taniguchi and Angelo Cangelosi

*26 Segmenting Continuous Motions With Hidden Semi-markov Models and Gaussian Processes*

Tomoaki Nakamura, Takayuki Nagai, Daichi Mochihashi, Ichiro Kobayashi, Hideki Asoh and Masahide Kaneko

*37 Representation Learning of Logic Words by an RNN: From Word Sequences to Robot Actions*

Tatsuro Yamada, Shingo Murata, Hiroaki Arie and Tetsuya Ogata

*55 Hierarchical Spatial Concept Formation Based on Multimodal Information for Human Support Robots*

Yoshinobu Hagiwara, Masakazu Inoue, Hiroyoshi Kobayashi and Tadahiro Taniguchi

*71 Multimodal Hierarchical Dirichlet Process-Based Active Perception by a Robot*

Tadahiro Taniguchi, Ryo Yoshino and Toshiaki Takano


Tomoaki Nakamura, Takayuki Nagai and Tadahiro Taniguchi


Zhijun Zhang, Qiongyi Zhou and Weisen Fan

# Editorial: Machine Learning Methods for High-Level Cognitive Capabilities in Robotics

Tadahiro Taniguchi <sup>1</sup> \*, Emre Ugur <sup>2</sup> , Tetsuya Ogata<sup>3</sup> , Takayuki Nagai <sup>4</sup> and Yiannis Demiris <sup>5</sup>

<sup>1</sup> Department of Information Science and Engineering, Ritsumeikan University, Kyoto, Japan, <sup>2</sup> Department of Computer Engineering, Bogaziçi University, Istanbul, Turkey, ˇ <sup>3</sup> Department of Intermedia Art and Science, School of Fundamental Science and Engineering, Waseda University, Tokyo, Japan, <sup>4</sup> Department of Systems Innovation, Graduate School of Engineering Science, Osaka University, Osaka, Japan, <sup>5</sup> Department of Electrical and Electronic Engineering, Imperial College London, London, United Kingdom

Keywords: machine leading, cognitive robotics, language acquisition, neural networks, cognitive architecture, probabilistic models, robot learning

**Editorial on the Research Topic**

**Machine Learning Methods for High-Level Cognitive Capabilities in Robotics**

# 1. INTRODUCTION

Adaptive learning and emergence of integrative cognitive system that involve not only low-level but also high-level cognitive capabilities are crucially important in robotics (Cangelosi et al., 2010; Cangelosi and Schlesinger, 2015; Ugur and Piater, 2015; Tani, 2016; Taniguchi et al., 2016, 2018). Recent advancement in machine learning methods, e.g., deep learning and hierarchical Bayesian modeling, enables us to develop cognitive systems that integrate multi-level sensory-motor and cognitive capabilities. Low-level cognitive capabilities includes sensory perception, physical control, and behavioral motion generation, while high-level cognitive capabilities include logical inference, planning, and language acquisition. To create robots that can deal with uncertainty in our daily environment, developing machine learning methods that can integrate low-level and high-level is essential. Following the successfully organized session "the Workshop on Machine Learning Methods for High-Level Cognitive Capabilities in Robotics 2016" held in IEEE-IROS 2016<sup>1</sup> , we organized this Research Topic. We aimed to publish original papers about the state-of-the-art machine learning methods that contribute to modeling sensory-motor and cognitive capabilities in robotics.

#### Approved by:

Florian Röhrbein, Technical University of Munich, Germany

\*Correspondence:

Tadahiro Taniguchi taniguchi@ci.ritsumei.ac.jp

Received: 19 August 2019 Accepted: 25 September 2019 Published: 22 October 2019

#### Citation:

Taniguchi T, Ugur E, Ogata T, Nagai T and Demiris Y (2019) Editorial: Machine Learning Methods for High-Level Cognitive Capabilities in Robotics. Front. Neurorobot. 13:83. doi: 10.3389/fnbot.2019.00083

2. ABOUT THE RESEARCH TOPIC

We are pleased to present 9 research articles, related to motor and behavior learning, concept formation, language acquisition, and cognitive architecture. In this section, we briefly introduce each paper.

First, three papers focused on action and behavior learning. Imitation learning is an important topic related to the integration of high-level and low-level cognitive capability because it enables a robot to acquire behavioral primitives from social interaction including observation of human behaviors. Nakajo et al. proposed a machine learning method for viewpoint transformation

<sup>1</sup>The Workshop on Machine Learning Methods for High-Level Cognitive Capabilities in Robotics 2016: http://mlhlcr2016. tanichu.com/

and action mapping using a neural network having encoderdecoder architecture, i.e., sequence to sequence. In imitation learning, demonstrator and imitator have different perspectives. The method deals with the problem and produced a successful result. Nakamura et al. proposed a new machine learning method called Gaussian process-hidden semi-Markov model (GP-HSMM). GP-HSMM can segment continuous motion trajectories without defining a parametric model for each primitive. That comprises Gaussian process, which is a regression method based on Bayesian non-parametric, and hidden semi-Markov model. This method enables a robot to find motion primitives from complex human motion in an imitation learning scenario. Manipulation using the left and right arms is an essential capability for a cognitive robot. Zhang et al. proposed a neural-dynamic based synchronous-optimization scheme manipulators. It was demonstrated that the method enables a robot to track complex paths.

Second, two papers focused on the relationship between action and object concept. Andries et al. proposes the formalism for defining and identifying affordance equivalence. The concept of affordance can be regarded as a relationship between an actor, an action performed by this actor, an object on which the action is performed, and the resulting effect. Learning affordance, i.e., inter-dependency between action and object concept, is an important topic in this field. Taniguchi et al. proposed a new active perception method based on multimodal hierarchical Dirichlet process, which is a hierarchical Bayesian model for multimodal object concept formation method. The important aspect of the approach is that the policy for active perception is derived based on the result of unsupervised learning without any manually designed label data and reward signals.

Third, three papers are related to language acquisition and concept formation. Hagiwara et al. proposed hierarchical spatial concept formation method based on hierarchical multimodal latent Dirichlet allocation (hMLDA). They demonstrated that a robot could form concept for places having hierarchical structure, e.g., "around a table" is a part of "dining room," using hMLDA, and became able to understand utterances indicating places in a domestic environment given by a human user. Yamada et al. described representation learning method that enables a robot to understand not only action-related words, but also logical words, e.g., "or," "and" and "not." They introduced an neural network having an encoder-decoder architecture, and obtained successful and suggestive results. Taniguchi et al. proposed a new multimodal cross-situational learning method for language acquisition. A robot became able

## REFERENCES

Cangelosi, A., Metta, G., Sagerer, G., Nolfi, S, Nehaniv, C., Fischer, K., et al. (2010). Integration of action and language knowledge: a roadmap for developmental robotics. IEEE Trans. Auton. Ment. Dev. 2, 167–195. doi: 10.1109/TAMD.2010.2053034

to estimate of each word in relation with modality via which each word is grounded.

The final paper presents a framework for cognitive architecture based on hierarchical Bayesian models. Nakamura et al. proposed Symbol Emergence in Robotics tool KIT (SERKET) that can integrate many cognitive modules developed using hierarchical Bayesian models, i.e., probabilistic generative models, effectively without re-implementation of each module. Integration of low-level and high-level cognitive capability and developing an integrative cognitive system requires researchers and developers to construct very complex software modules, and this is expected to cause practical problems. Serket can be regarded as a practical solution for the problem, and expected to push the research field forward.

# 3. NEXT STEP

With the tremendous success of the past three Special issues of this Research Topic, we organized follow-up workshops<sup>2</sup> and a Research Topic<sup>3</sup> . Two survey papers related to the series of workshops have already been published (Taniguchi et al., 2018; Tangiuchi et al., 2019). We will also organize a workshop with the special emphasis on deep probabilistic generative models<sup>4</sup> We believe that in order to create an artificial cognitive system, i.e., a robot, it is important to integrate low-level and high-level cognitive capabilities based on machine learning-based methods. We hope that this special issue will contribute to accelerating the robotics and machine learning studies that aims to create human-like cognitive systems that can behave in our real-world environment in collaboration with people.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# ACKNOWLEDGMENTS

The authors gratefully acknowledge the contributions of participants in this special issue.

Cangelosi, A., and Schlesinger, M. (2015). Developmental Robotics: From Babies to Robots. Cambridge, MA: MIT Press.

<sup>2</sup>The 2nd Workshop on Machine Learning Methods for High-Level Cognitive Capabilities in Robotics 2017: http://mlhlcr2017.tanichu.com/. The Workshop on Language and Robotics: http://iros2018.emergent-symbol.systems/.

<sup>3</sup>Research Topic Language and Robotics: https://www.frontiersin.org/researchtopics/8861/language-and-robotics.

<sup>4</sup>The Workshop on Deep Probabilistic Generative Models for Cognitive Architecture in Robotics 2019: https://sites.google.com/site/dpgmcar2019/.

Tangiuchi, T., Mochihashi, D., Nagai, T., Uchida, S., Inoue, N., Kobayashi, I., et al. (2019). Survey on fron- tiers of language and robotics. Adv. Robot. 33,700–730. doi: 10.1080/01691864.2019. 1632223


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Taniguchi, Ugur, Ogata, Nagai and Demiris. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# **Cross-Situational Learning with Bayesian Generative Models for Multimodal Category and Word Learning in Robots**

#### *Akira Taniguchi <sup>1</sup> \*, Tadahiro Taniguchi <sup>1</sup> and Angelo Cangelosi <sup>2</sup>*

*<sup>1</sup>Emergent Systems Laboratory, Ritsumeikan University, Kusatsu, Japan <sup>2</sup> The Centre for Robotics and Neural Systems, Plymouth University, Plymouth, United Kingdom*

In this paper, we propose a Bayesian generative model that can form multiple categories based on each sensory-channel and can associate words with any of the four sensorychannels (action, position, object, and color). This paper focuses on cross-situational learning using the co-occurrence between words and information of sensory-channels in complex situations rather than conventional situations of cross-situational learning. We conducted a learning scenario using a simulator and a real humanoid iCub robot. In the scenario, a human tutor provided a sentence that describes an object of visual attention and an accompanying action to the robot. The scenario was set as follows: the number of words per sensory-channel was three or four, and the number of trials for learning was 20 and 40 for the simulator and 25 and 40 for the real robot. The experimental results showed that the proposed method was able to estimate the multiple categorizations and to learn the relationships between multiple sensory-channels and words accurately. In addition, we conducted an action generation task and an action description task based on word meanings learned in the cross-situational learning scenario. The experimental results showed that the robot could successfully use the word meanings learned by using the proposed method.

#### *Edited by:*

*Frank Van Der Velde, University of Twente, Netherlands*

#### *Reviewed by:*

*Maxime Petit, Imperial College London, United Kingdom Yulia Sandamirskaya, University of Zurich, Switzerland*

> *\*Correspondence: Akira Taniguchi*

*a.taniguchi@em.ci.ritsumei.ac.jp*

*Received: 17 July 2017 Accepted: 21 November 2017 Published: 19 December 2017*

#### *Citation:*

*Taniguchi A, Taniguchi T and Cangelosi A (2017) Cross-Situational Learning with Bayesian Generative Models for Multimodal Category and Word Learning in Robots. Front. Neurorobot. 11:66. doi: 10.3389/fnbot.2017.00066* **Keywords: Bayesian model, cross-situational learning, lexical acquisition, multimodal categorization, symbol grounding, word meaning**

# **1. INTRODUCTION**

This paper addresses the study of robotic learning of the word meanings inspired by the process of language acquisition of humans. We developed an unsupervised machine learning method to enable linguistic interaction between humans and robots. Human infants can acquire word meanings by estimating the relationships between multimodal information and words in a variety of situations. For example, if an infant grasps a green cup by hand, let us consider the way the parent describes the actions of the infant to the infant using a sentence such as "grasp green front cup." In this case, the infant does not know the relationship between words and situations because it has not acquired the meanings of words. In other words, the infant cannot determine whether the word "green" indicates an action, an object, or a color. However, it is believed that the infant can learn that the word "green" represents the green color by observing the co-occurrence of the word "green" with objects of green color in various situations. This is known as cross-situational learning (CSL), which has been both studied in children (Smith et al., 2011) and modeled in simulated agents and robots (Fontanari et al., 2009). The CSL is related to the symbol grounding problem (Harnad, 1990), which is a challenging and significant issue in robotics.

The generalization ability and the robustness of observation noise to process situations that have never been experienced are important in cognitive robotics. The study of language acquisition by infants led to the proposal of a hypothesis of taxonomic bias (Markman and Hutchinson, 1984) that infants tend to understand a word as the name of a category to which the target object belongs rather than a proper noun. This hypothesis could also be considered to play an important role in CSL. In this study, we assume that words are associated with categories based on taxonomic bias. By associating words with categories, it becomes possible for a human to generalize and process words. Therefore, humans can use words for communication in new situations. To develop this ability, the robot needs to form categories from observation information autonomously. We develop this ability by categorization based on the Bayesian generative model. Another hypothesis regarding the lexical acquisition by an infant was mutual exclusivity bias (constraint) (Markman and Wachtel, 1988). In studies on lexical acquisition, this hypothesis was considered to be particularly important for CSL (Twomey et al., 2016). Mutual exclusivity bias assumes that the infant considers the name of an object to correspond to one particular category only. In other words, multiple categories do not correspond to that word simultaneously. In Imai and Mazuka (2007), it was suggested that once an infant decides whether a word refers to the name of an object or a substance, the same word is not applied across the ontological distinction such as objects and substances. In this study, we extend the mutual exclusivity constraint to the CSL problem in complex situations. We aim to develop a novel method that can acquire knowledge of multiple categories and word meanings simultaneously. In addition, we verify whether the effect of mutual exclusivity is biased toward lexical acquisition by constructing a model assuming different constraints.

In addition, humans can perform the instructed action using acquired knowledge. For example, the parent places some objects in front of an infant and speaks "grasp green right ball" to the infant. In this case, the infant can use the acquired word meanings to select the green ball to the right of some objects and perform the action of grasping. Furthermore, humans can explain self-action with the sentence using the acquired knowledge. For example, if the infant knows the word meanings after grasping a blue box in front of it, the infant can speak "grasp blue front box" to another person. Understanding instructions and describing situations are crucial problems that are also required to build a cognitive robot.

In this paper, the goal is to develop an unsupervised machinelearning method for learning the relationships between words and the four sensory-channels (action, object, color, and position) from the robot's experience of observed sentences describing object manipulation scenes. In the above example, sentences containing four words for four sensory-channels are shown. However, in the scenario described in this study, sentences of less than four words are allowed. In addition, the position sensory-channel corresponds to the original position of the object. In other words, we assume that the environment is static.We assume that the robot can recognize spoken words without errors, as this work focuses specifically on (1) the categorization for each sensory-channel, (2) the learning of relationships between words and sensorychannels, and (3) the grounding of words in multiple categories. In addition, we demonstrate whether the robot can carry out its actions and the sentence description of its action by conducting experiments using the CSL results. The main contributions of this paper are as follows:


The remainder of this paper is organized as follows. In Section 2, we discuss previous studies on lexical acquisition by a robot and CSL that are relevant to our study. In Section 3, we present a proposed Bayesian generative model for CSL. In Sections 4 and 5, we discuss the effectiveness of the proposed method in terms of three tasks, i.e., cross-situational learning, action generation, and an action description task, in a simulation and a real environment, respectively. Section 6 concludes the paper.

# **2. RELATED WORK**

# **2.1. Lexical Acquisition by Robot**

Studies of language acquisition also constitute a constructive approach to the human developmental process (Cangelosi and Schlesinger, 2015), the language grounding (Steels and Hild, 2012), and the symbol emergence (Taniguchi et al., 2016c). One approach to studying language acquisition focuses on the estimation of phonemes and words from speech signals (Goldwater et al., 2009; Heymann et al., 2014; Taniguchi et al., 2016d). However, these studies used only continuous speech signals without using co-occurrence based on other sensor information, e.g., visual, tactile, and proprioceptive information. Therefore, the robot was not required to understand the meaning of words. Yet, it is important for a robot to understand word meanings, i.e., grounding the meanings to words, for human–robot interaction (HRI).

Roy and Pentland (2002) proposed a computational model by which a robot could learn the names of objects from images of the object and natural infant-directed speech. Their model could perform speech segmentation, lexical acquisition, and visual categorization. Hörnstein et al. (2010) proposed a method based on pattern recognition and hierarchical clustering that mimics a human infant to enable a humanoid robot to acquire language. Their method allowed the robot to acquire phonemes and words from visual and auditory information through interaction with the human. Nakamura et al. (2011a,b) proposed multimodal latent Dirichlet allocation (MLDA) and a multimodal hierarchical Dirichlet process (MHDP) that enables the categorization of objects from multimodal information, i.e., visual, auditory, haptic, and word information. Their methods enabled more accurate object categorization by using multimodal information. Taniguchi et al. (2016a) proposed a method for simultaneous estimation of self-positions and words from noisy sensory information and an uttered word. Their method integrated ambiguous speech recognition results with the self-localization method for learning spatial concepts. However, Taniguchi et al. (2016a) assumed that the name of a place would be learned from an uttered word. Taniguchi et al. (2016b) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) based on place categorization and unsupervised word segmentation. SpCoA could acquire the names of places from spoken sentences including multiple words. In the above studies, the robot was taught to focus on one target, e.g., an object or a place, by a tutor using one word or one sentence. However, considering a more realistic problem, the robot needs to know which event in a complicated situation is associated with which word in the sentence. The CSL, which is extended from the aforementioned studies on the lexical acquisition, is a more difficult and important problem in robotics in comparison. Our research concerns the CSL problem because of its importance in relation to the lexical acquisition by a robot.

## **2.2. Cross-Situational Learning** 2.2.1. Conventional Cross-Situational Learning Studies

Frank et al. (2007, 2009) proposed a Bayesian model that unifies statistical and intentional approaches to cross-situational word learning. They conducted basic CSL experiments with the purpose of teaching an object name. In addition, they discussed that the effectiveness of mutual exclusivity for CSL in probabilistic models. Fontanari et al. (2009) performed object-word mapping from the co-occurrence between objects and words by using a method based on neural modeling fields (NMF). In "modi" experiments using iCub, their findings were similar to those reported by Smith and Samuelson (2010). The abovementioned studies are CSL studies that were inspired by studies based on experiments with human infants. These studies assumed a simple situation such as learning the relationship between objects and words as the early stage of CSL. However, the real environment is varied and more complex. In this study, we focus on the problem of CSL in utterances including multiple words and observations from multiple sensory-channels.

# 2.2.2. Probabilistic Models

Qu and Chai (2008, 2010) proposed a learning method that automatically acquires novel words for an interactive system. They focused on the co-occurrence between word-sequences and entity-sequences tracked by eye-gaze in lexical acquisition. Qu and Chai's method, which is based on the IBM-translation model (Brown et al., 1993), estimates the word-entity association probability. However, their studies did not result in perfect unsupervised lexical acquisition because they used domain knowledge based on WordNet. Matuszek et al. (2012) presented a joint model of language and perception for grounded attribute learning. This model enables the identification of which novel words correspond to color, shape, or no attribute at all. Celikkanat et al. (2014) proposed an unsupervised learning method based on latent Dirichlet allocation (LDA) that allows many-to-many relationships between objects and contexts. Their method was able to predict the context from the observation information and plan the action using learned contexts. Chen et al. (2016) proposed an active learning method for cross-situational learning of object-word association. In experiments, they showed that LDA was more effective than non-negative matrix factorization (NMF). However, they did not perform any HRI experiment using the learned language. In our study, we perform experiments that use word meanings learned in CSL to generate an action and explain a current situation.

# 2.2.3. Neural Network Models

Yamada et al. (2015, 2016) proposed a learning method based on a stochastic continuous-time recurrent neural network (CTRNN) and a multiple time-scales recurrent neural network (MTRNN). They showed that the learned network formed an attractor structure representing both the relationships between words and action and the temporal pattern of the task. Stramandinoli et al. (2017) proposed partially recurrent neural networks (P-RNNs) for learning the relationships between motor primitives and objects. Zhong et al. (2017) proposed multiple time-scales gated recurrent units (MTGRU) inspired by MTRNN and long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). They showed that the MTGRU could learn long-term dependencies in large-dimensional multimodal datasets by conducting multimodal interaction experiments using iCub. The learning results of the above studies using neural networks (NNs) are difficult to interpret because time-series data is mapped to continuous latent space. These studies implicitly associate words with objects and actions. Generally, NN methods require a massive amount of learning data in many cases. On the other hand, the learning result is easier to interpret when Bayesian methods rather than NN methods are used. In addition, Bayesian methods require less data to learn efficiently. We propose a Bayesian generative model that can perform CSL, including action learning.

#### 2.2.4. Robot-to-Robot Interaction

Spranger (2015) and Spranger and Steels (2015) proposed a method for the co-acquisition of semantics and syntax in the spatial language. The experimental results showed that the robot could acquire spatial grammar and categories related to spatial direction. Heath et al. (2016) implemented mobile robots (Lingodroids) capable of learning a lexicon through robot-to-robot interaction. They used two robots equipped with different sensors and simultaneous localization and mapping (SLAM) algorithms. These studies reported that the robots created their lexicons in relation to places and the distance in terms of time. However, these studies did not consider lexical acquisition by HRI. We consider HRI to be necessary to enable a robot to learn human language.

## 2.2.5. Multimodal Categorization and Word Learning

Attamimi et al. (2016) proposed multilayered MLDA (mMLDA) that hierarchically integrates multiple MLDAs as an extension of Nakamura et al. (2011a). They performed an estimation of the relationships among words and multiple concepts by weighting the learned words according to their mutual information as a post-processing step. In their model, the same uttered words are generated from three kinds of concepts, i.e., this model has three variables for same word information in different concepts. We consider this to be an unnatural assumption as the generative model for generating words. However, in our proposed model, we assume that the uttered words are generated from one variable. We consider our proposed model to involve a more natural assumption than Attamimi's model. In addition, their study did not use data that were autonomously obtained by the robot. In Attamimi et al. (2016), it was not possible for the robot to learn the relationships between self-actions and words because human motions obtained by the motion capture system based on Kinect and a wearable sensor device attached to a human were used as action data. In our study, the robot learns the action category based on subjective self-action. Therefore, the robot can perform a learned action based on a sentence of human speech. In this paper, we focus on complicated CSL problems arising from situations with multiple objects and sentences including words related to various sensory-channels such as the names, position, and color of objects, and the action carried out on the object.

# **3. MULTICHANNEL CATEGORIZATIONS AND LEARNING THE MEANING OF WORDS**

We propose a Bayesian generative model for cross-situational learning. The proposed method can estimate categories of multiple sensory-channels and the relationships between words and sensory-channels simultaneously.

# **3.1. Overview of the Scenario and Assumptions**

Here, we provide an overview of the scenario on which we focus and some of the assumptions in this study. **Figure 1** shows an overview of the scenario. The robot does not have any specific knowledge of objects, but it can recognize that objects exist on the table, i.e., the robot can segment the object and then extract the features of the segmented object. In addition, we assume that the robot can recognize the sentence uttered by the tutor without error. The training procedure consists of the following steps:


This process (steps 1–4) is carried out many times in different situations.

We assume that the robot does not know the relationships between the words and sensory-channels in advance. This study does not consider grammar, i.e., a unigram language model is assumed. The robot learns word meanings and multiple categories by using visual, tactile, and proprioceptive information, as well as words.

In this study, we consider two-level cross-situational learning (CSL-I and II). The first level (CSL-I) is the selection of an object related to a tutor utterance from multiple objects on the table.

**FIGURE 1** | Overview of the cross-situational learning scenario as the focus of this study; the robot obtains multimodal information from multiple sensory-channels in a situation and estimates the relationships between words and sensory-channels.

The second level (CSL-II) is the selection of the relationship between the specific word in the sentence and a sensory-channel in the multimodal information. In the first level, we assume joint attention. Tomasello and Farrar (1986) showed that the utterance referring to the object on which the child's attention was already focused is more effective in language acquisition. The above scenario enables the tutor to identify the object of attention, i.e., the object at which the robot is gazing. Furthermore, we assume that the robot considers the tutor to be speaking a sentence concerning the object of attention. This assumption of joint attention can avoid the problem of the selection of an object. The second level is the main problem in this study. Many previous studies on CSL-I have been reported (Frank et al., 2007, 2009; Fontanari et al., 2009; Morse et al., 2010); however, there are not the case for studies on CSL-II. The study discussed in this paper focused on solving the crucial problem of CSL-II.

In this study, we assume a two-level mutual exclusivity constraint (Markman and Wachtel, 1988) (MEC-I and II) regarding the selection of the sensory-channel. The first level (MEC-I) is the mutual exclusivity of sensory-channels with a word, i.e., one word is allocated to one category in one sensory-channel. The second level (MEC-II) is the mutual exclusivity between sensory-channels indicated by words, i.e., one word related to each sensory-channel is spoken only once in a sentence (or is not spoken). MEC-II is a stronger constraint than MEC-I. The proposed method can include both levels of mutual exclusivity.

# **3.2. Generative Model and Graphical Model**

The generative model of the proposed method is defined as equations (1–10). **Figure 2** shows a graphical model representing the probabilistic dependencies between variables of the generative model. Basically, the categorization for each sensory-channel is based on the Gaussian mixture model (GMM). In this model, the probability distribution of words is represented by the categorical distribution. The categorization of words in sentences is similar to that of LDA. The latent variable of a word shares the latent variable of any one of the sensory-channels in GMMs, signifying that a word and a category in a particular sensory-channel are generated from the same latent variable.

Frontiers in Neurorobotics | www.frontiersin.org

**11**

**5** December 2017 | Volume 11 | Article 66

(Fox et al., 2011). **3.3. Learning Algorithm** This model estimates parameters representing multiple categories, word distribution, the relationships between the word and the sensory-channel as input for the object features, positions, colors, robot actions, and the sentences spoken by a tutor. The model parameters and latent variables of the proposed method are estimated from the following joint posterior distribution by Gibbs sampling:

$$
\Theta, Z \sim p(\Theta, Z \mid X, H), \tag{11}
$$

where the set of model parameters is denoted as Θ = {{*π*}, {*ϕ*}, *θ*}, the set of latent variables is denoted as Z = {{*z*}, *F*}, the set of

where the discrete uniform distribution is denoted as Unif(*·*),

about categories. The proposed method can learn an appropriate nonparametric Bayesian approach. Specifically, this method uses the SBP, a method based on the Dirichlet process. Therefore, this method can consider theoretically infinite numbers *K* a , *K* p , *K* o , and *K* c . In this paper, we approximate the values of parameters representing the number of categories *K* a , *K* p , *K* o , and *K* c by

assigning sufficiently large values, i.e., a weak-limit approximation

*o*, *c*) represents observation variables, ordinary notation used as a superscript (a, p, o, c) represents sensory-channels. The robot needs to estimate the number of categories based on experience because the robot cannot have previous knowledge number of categories, depending on the collected data, by using a

$$\begin{array}{ll} \text{(6)} & \text{as } \alpha. \text{ The hyperparameter of the Gaussian-inverse-Wish} \\ \text{(7)} & \text{tribration is denoted as } \beta = \{m\_0, \kappa\_0, V\_0, \nu\_0\}. \text{ The hyperpart} \\ \text{(7)} & \text{of the Dirichlet distribution is denoted as } \gamma. \text{ Itadic notation} \\ \text{(8)} & o, \text{c) represents observation variables, ordinary notation } \mathbf{u} \\ \text{(9)} & \text{superscripts (a, p, o, c) represents sensor-channels.} \\ \text{The robot needs to estimate the number of categories} \\ \text{(10)} & \text{on experience because the robot cannot have previous loss.} \end{array}$$

equation (1), has the mutual exclusivity constraint that determine that each sensor-channel is represented only once in each sense. The hyperparameter of the mixture weights  $\pi$  is denote as  $\alpha$ . The hyperparameter of the Gaussian-inverse-Wishart di distribution is denoted as  $\beta = \{m\_0, \kappa\_0, \, V\_0, \, \nu\_0\}$ . The hyperparameter of the Dirichlet distribution is denoted as  $\gamma$ . Itlattice notation  $(a, \beta)$  represents observation variables, ordinary notation used as \_superscripts\_ (a, p, o, c) represents sensor-channels\_

The robot needs to estimate the number of categories based on experience because the robot cannot have previous knowledge.

$$\begin{aligned} & \text{ the mean vector representing the relative coordinates between} \\ & \text{hand position and the target position. The position information} \\ & \text{of the object } A\_{d} = m \text{ is denoted as } p\_{d \mid d}. \text{ Therefore, } \phi\_{d \mid d}^{\omega} \text{ is} \\ & \text{parameter obtained by converting the target hand position} \\ & \text{absolute coordinate system based on } \phi\_{s \downarrow}^{\omega} \text{ (the parameter of } 1 \text{)} \\ & \text{action category represented in the relative coordinate system} \\ & \text{2)} \quad p\_{d \mid d} \text{ (the position of the object of attention)}. \text{ The mixture we} \\ & \text{3)} \quad \pi\_{s}^{\omega} \text{ and } \pi\_{r}^{\omega}. \text{ The hyperparameter } \lambda \text{ of the uniform distribution} \\ & \text{4)} \quad \pi\_{s}^{\omega} \text{ and } \pi\_{r}^{\omega}. \text{ The hyperparameter } \lambda \text{ of the uniform distribution} \\ & \text{4)} \quad \text{equation (1), has the mutual exclusivity constraint that determine. The hyperparameter of the mixture weights } \pi \text{ is then} \\ & \text{5)} \quad \text{as } \alpha. \text{ The hyperparameter of the Gaussian-inversve-Wishart } \tau \text{ is then} \\ & \text{6)} \quad \text{itation is denoted as } \beta = \{m\_{0}, \alpha\_{0}, \ V\_{0}, \eta\_{0}\}. \text{ The hyperparameter} \\ & \text{7)} \quad \text{of the Dirichlet distribution is denoted as } \alpha. \text{ Table notation} \\ \end{aligned}$$

for learning word meaning; the action, position, color, and object categories are represented by a component in Gaussian mixture models (GMMs). A word distribution is related to a category on GMMs. Gray nodes represent observed variables. Each variable is explained in the description of the generative model in Section 3.2.

We describe the generative model as follows:

$$F\_d \sim \text{Unif}(\lambda) \tag{1}$$

$$
\theta\_l \sim \text{Dir}(\gamma) \tag{2}
$$

$$
\pi \sim \text{GEM}(\alpha) \tag{3}
$$

$$z\_{dm} \sim \text{Cat}(\pi) \tag{4}$$

$$\mathcal{w}\_{dn} \sim \text{Cat}\left(\boldsymbol{\theta}\_{l=\left(F\_{dn}, \boldsymbol{z}\_{d\boldsymbol{A}\_d}^{\boldsymbol{\theta}\_{dn}}\right)}\right) \tag{5}$$

$$
\phi\_k \sim \text{GIN}(\beta) \tag{6}
$$

$$
\rho\_{dm} \sim \text{Gauss}(\phi\_{z\_{dm}^o}^o) \tag{7}
$$

$$
\sigma\_{dm} \sim \text{Gauss}(\phi\_{z\_{dm}^c}^{\mathfrak{c}}) \tag{8}
$$

$$
\mathfrak{p}\_{dm} \sim \text{Gauss}(\boldsymbol{\phi}^{\mathbb{P}}\_{\boldsymbol{z}^{\mathbb{P}}\_{dm}}) \tag{9}
$$

$$a\_d \sim \text{Gauss}(\phi\_{z\_d}^{\mathfrak{a}'}),\tag{10}$$

the categorical distribution is denoted as Cat(*·*), the Dirichlet distribution is denoted as Dir(*·*), the stick-breaking process (SBP) (Sethuraman, 1994) is denoted as GEM(*·*), the Gaussian-inverse-Wishart distribution is denoted as GIW(*·*), and the multivariate Gaussian distribution is denoted as Gauss(*·*). See Murphy (2012) for specific formulas of the above probability distributions. In this paper, variables omitting superscript represent general notation, e.g., *π ∈*{*π*} = {*π* a , *π* p , *π* o , *π* c }, and variables omitting subscripts represent collective notation, e.g., *F* = {*F*1, *F*2, *. . . FD*}. The number of trials is *D*. The number of objects on the table is *M<sup>d</sup>* in the *d*-th trial. The number of words in the sentence is *N<sup>d</sup>* in the *d*-th trial. The *n*-th word in the *d*-th trial is denoted as *wdn*, which is represented by the bag-of-words (BoW). The model allows sentences containing zero to four words. The model associates the word distributions *θ* with categories *zdm* on four sensory-channels, namely, the action *ad*, the position *pdm* of the object on the table, the object feature *odm*, and the object color *cdm*. In this study, we define the action *a<sup>d</sup>* as a static action feature, i.e., proprioceptive and tactile features, when the robot completes an action. An index of the object of attention selected by the robot from among the multiple objects on the table is denoted as A*<sup>d</sup>* = *m*. The sequence representing the respective sensorychannels associated with each word in the sentence is denoted as *Fd*, e.g., *F<sup>d</sup>* = (a, p, c, o). The number of categories for each sensory-channel is *K*. An index of the word distribution is denoted as *l*. The set of all the word distributions is denoted as *θ* = *{θ l*=(*Fdn,z Fdn dm* ) *|Fdn ∈ {*o*,* c*,* p*,* a*}, z Fdn dm ∈ {*1*,* 2*, . . . , K <sup>F</sup>dn }}*. The index of the category of the sensory-channel *Fdn* and the object *A<sup>d</sup>* is denoted as *z Fdn dA<sup>d</sup>* . Then, the number of word distributions *L* is the sum of the number of categories of all the sensory-channels, i.e., *L* = *K* <sup>a</sup> +*K* <sup>p</sup> +*K* <sup>o</sup> +*K* c . The action category *ϕ* a *k* , the position category *ϕ* p *k* , the object category *ϕ* o *k* , and the color category *ϕ* c *k* are represented by a Gaussian distribution. The mean vector and the covariance matrix of the Gaussian distribution are denoted as *ϕ<sup>k</sup>* = {*µk*, Σ*k*}. We define *ϕ* a *′ z* a *d* as the parameter of the Gaussian distribution that added the object position *pdA<sup>d</sup>* to the element of the mean vector representing the relative coordinates between the hand position and the target position. The position information is the parameter obtained by converting the target hand position to the (the parameter of the action category represented in the relative coordinate system) and (the position of the object of attention). The mixture weights a , *π* p , . The hyperparameter *λ* of the uniform distribution, i.e., equation (1), has the mutual exclusivity constraint that determines that each sensory-channel is represented only once in each sentence. The hyperparameter of the mixture weights *π* is denoted as *α*. The hyperparameter of the Gaussian-inverse-Wishart distribution is denoted as *β* = {*m*0, *κ*0, *V*0, *v*0}. The hyperparameter of the Dirichlet distribution is denoted as *γ*. Italic notation (*a*, *p*,

observation variables is denoted as *X* = {*a*, *p*, *o*, *c*, *w*, *A*}, and the set of hyperparameters of the model is denoted as *H* = {{*α*}, {*β*}, *λ*, *γ*}.

The learning algorithm is obtained by repeatedly sampling the conditional posterior distributions for each parameter. The Dirichlet and GIW distributions are conjugate prior distributions for the categorical and Gaussian distributions, respectively (Murphy, 2012). Therefore, the conditional posterior distributions can be determined analytically. **Algorithm 1** shows the pseudo-code for the learning procedure. The initial values of the model parameters can be set arbitrarily in accordance with a condition. The following is the conditional posterior distribution of each element used for performing Gibbs sampling.

A parameter *π* o of categorical distribution representing the mixture weight of an object category is sampled as follows:

$$\pi^{\mathfrak{o}} \sim p(\pi^{\mathfrak{o}} | z^{\mathfrak{o}}, \alpha^{\mathfrak{o}}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Cat}(z\_{dm}^{\mathfrak{o}} | \pi^{\mathfrak{o}}) \text{Dir}(\pi^{\mathfrak{o}} | \alpha^{\mathfrak{o}})$$

$$\propto \text{Dir}(\pi^{\mathfrak{o}} | z^{\mathfrak{o}}, \alpha^{\mathfrak{o}}), \tag{12}$$

where *z* o denotes the set of all the latent variables of an object category. A parameter *π* c of categorical distribution representing the mixture weight of the color category is sampled as follows:

$$\pi^{\mathfrak{c}} \sim \mathfrak{p}(\pi^{\mathfrak{c}} | z^{\mathfrak{c}}, \alpha^{\mathfrak{c}}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Cat}(z\_{dm}^{\mathfrak{c}} | \pi^{\mathfrak{c}}) \text{Dir}(\pi^{\mathfrak{c}} | \alpha^{\mathfrak{c}})$$

$$\propto \text{Dir}(\pi^{\mathfrak{c}} | z^{\mathfrak{c}}, \alpha^{\mathfrak{c}}), \tag{13}$$

where *z* c denotes a set of all the latent variables of the color category. A parameter *π* p of the categorical distribution representing the mixture weight of the position category is sampled as follows:

$$\pi^{\mathbb{P}} \sim \mathfrak{p}(\pi^{\mathbb{P}} | z^{\mathbb{P}}, \alpha^{\mathbb{P}}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Cat}(z\_{dm}^{\mathbb{P}} | \pi^{\mathbb{P}}) \text{Dir}(\pi^{\mathbb{P}} | \alpha^{\mathbb{P}})$$

$$\propto \text{Dir}(\pi^{\mathbb{P}} | z^{\mathbb{P}}, \alpha^{\mathbb{P}}), \tag{14}$$

where *z* p denotes the set of all the latent variables of the position category. A parameter *π* a of the categorical distribution representing the mixture weight of the action category is sampled as follows:

$$\pi^{\mathfrak{a}} \sim p(\pi^{\mathfrak{a}} | z^{\mathfrak{a}}, \alpha^{\mathfrak{a}}) \propto \prod\_{d=1}^{D} \text{Cat}(z\_d^{\mathfrak{a}} | \pi^{\mathfrak{a}}) \text{Dir}(\pi^{\mathfrak{a}} | \alpha^{\mathfrak{a}}) \propto \text{Dir}(\pi^{\mathfrak{a}} | z^{\mathfrak{a}}, \alpha^{\mathfrak{a}}),\tag{15}$$

where *z* a denotes a set of all the latent variables of the action category. A parameter *ϕ* o *<sup>k</sup>* of the Gaussian distribution of the object category is sampled for each *k∈*{1, 2, *. . .* , *K* o } as follows:

$$\phi\_k^{\bullet} \sim p(\phi\_k^{\bullet}|z^{\bullet}, o, \beta^{\bullet}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Gauss}(o\_{dm}|\phi\_k^{\bullet}) \text{GIV}(\phi\_k^{\bullet}|\beta^{\bullet})$$

$$\propto \text{GIV}(\phi\_k^{\bullet}|o\_k, \beta^{\bullet}), \tag{16}$$

where *o<sup>k</sup>* denotes a set of all the object features of the object category *z* o *dm* = *k* in *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*}. **Algorithm 1** | Learning algorithm based on Gibbs sampling.

1: **procedure** Gibbs\_Sampling (*a, p, o, c, w, A*) 2: Setting of hyperparameters {*α*}, {*β*}, *λ*, *γ* 3: Initialization of parameters and latent variables {*π*}, {*ϕ*}, *θ*, {*z*}, *F* 4: **for** *j* = 1 to *iteration*\_*number* **do** 5: *π* <sup>o</sup> *∼* Dir(*π* o | *z* o , *α* o ) // equation (12) 6: *π* <sup>c</sup> *∼* Dir(*π* c | *z* c , *α* c ) // equation (13) 7: *π* <sup>p</sup> *∼* Dir(*π* p | *z* p , *α* p ) // equation (14) 8: *π* <sup>a</sup> *∼* Dir(*π* a | *z* a , *α* a ) // equation (15) 9: **for** *k* = 1 to *K* <sup>o</sup> **do** 10: *ϕ* o *<sup>k</sup> ∼* GIW(*ϕ* o *k |ok, β*<sup>o</sup> ) // equation (16) 11: **end for** 12: **for** *k* = 1 to *K* <sup>c</sup> **do** 13: *ϕ* c *<sup>k</sup> ∼* GIW(*ϕ* c *k |ck, β*<sup>c</sup> ) // equation (17) 14: **end for** 15: **for** *k* = 1 to *K* <sup>p</sup> **do** 16: *ϕ* p *<sup>k</sup> ∼* GIW(*ϕ* p *k |p<sup>k</sup> , β*<sup>p</sup> ) // equation (18) 17: **end for** 18: **for** *k* = 1 to *K* <sup>a</sup> **do** 19: *ϕ* a *<sup>k</sup> ∼* GIW(*ϕ* a *k |a ′ k , β*<sup>a</sup> ) // equation (19) 20: **end for** 21: **for** *l* = ( *Fdn, z Fdn dAd* ) in {(*Fdn, <sup>z</sup> Fdn dm* ) *| Fdn ∈ {*o*,* c*,* p*,* a*}, z Fdn dm ∈ {*1*,* 2*, . . . , K Fdn}* } **do** 22: *θ<sup>l</sup> ∼* Dir(*θ<sup>l</sup>* | *w<sup>l</sup>* , *γ*) // equation (20) 23: **end for** 24: **for** *d* = 1 to *D* **do** 25: **for** *m* = 1 to *M<sup>d</sup>* **do** 26: *z* o *dm ∼* ∏*Nd <sup>n</sup>*=<sup>1</sup> Cat( *wdn|θ l*= ( *Fdn,<sup>z</sup> Fdn dAd* ) ) Gauss( *odm|ϕ* o *z* o *dm* ) Cat(*z* o *dm|π* o ) // equation (21) 27: *z* c *dm ∼* ∏*Nd <sup>n</sup>*=<sup>1</sup> Cat( *wdn|θ l*= ( *Fdn,<sup>z</sup> Fdn dAd* ) ) Gauss( *cdm|ϕ* c *z* c *dm* ) Cat(*z* c *dm|π* c ) // equation (22) 28: *z* p *dm ∼* ∏*Nd <sup>n</sup>*=<sup>1</sup> Cat( *wdn|θ l*= ( *Fdn,<sup>z</sup> Fdn dAd* ) ) Gauss( *pdm|ϕ* p *z* p *dm* ) Cat(*z* p *dm|π* p ) // equation (23) 29: **end for** 30: *z* a *<sup>d</sup> ∼* ∏*Nd <sup>n</sup>*=<sup>1</sup> Cat( *wdn|θ l*= ( *Fdn,<sup>z</sup> Fdn dAd* ) ) Gauss( *ad|ϕ* a *′ z* a *d* ) Cat(*z* a *d |π* a ) // equation (24) 31: *F<sup>d</sup> ∼* ∏*Nd <sup>n</sup>*=<sup>1</sup> Cat( *wdn|θ l*= ( *Fdn,<sup>z</sup> Fdn dAd* ) ) Unif(*Fd|λ*) // equation (25) 32: **end for** 33: **end for** 34: **return** {*π*}, {*ϕ*}, *θ*, {*z*}, *F* 35: **end procedure**

A parameter *ϕ* c *<sup>k</sup>* of the Gaussian distribution of the color category is sampled for each *k∈*{1, 2, *. . .* , *K* c } as follows:

$$\phi\_k^{\boldsymbol{\xi}} \sim \mathfrak{p}(\phi\_k^{\boldsymbol{\xi}} | \boldsymbol{z}^{\boldsymbol{\xi}}, \boldsymbol{c}, \beta^{\boldsymbol{\xi}}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Gauss}(\mathfrak{c}\_{dm} | \phi\_k^{\boldsymbol{\xi}}) \text{GIV}(\phi\_k^{\boldsymbol{\xi}} | \beta^{\boldsymbol{\xi}})$$

$$\propto \text{GIV}(\phi\_k^{\boldsymbol{\xi}} | \boldsymbol{c}\_k, \beta^{\boldsymbol{\xi}}), \tag{17}$$

where *c<sup>k</sup>* denotes the set of all the color features of the color category *z* c *dm* = *k* in *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*}. A parameter *ϕ* p *k* of the Gaussian distribution of the position category is sampled for each *k∈*{1, 2, *. . .* , *K* P } as follows:

$$\phi\_k^{\mathbb{P}} \sim p(\phi\_k^{\mathbb{P}} | \boldsymbol{z}^{\mathbb{P}}, \boldsymbol{p}, \beta^{\mathbb{P}}) \propto \prod\_{d=1}^{D} \prod\_{m=1}^{M\_d} \text{Gauss}(\boldsymbol{p}\_{dm} | \phi\_k^{\mathbb{P}}) \text{GIV}(\phi\_k^{\mathbb{P}} | \beta^{\mathbb{P}})$$

$$\propto \text{GIV}(\phi\_k^{\mathbb{P}} | \boldsymbol{p}\_k, \beta^{\mathbb{P}}), \tag{18}$$

where *p<sup>k</sup>* denotes the set of all the position information of the position category *z* p *dm* = *k* in *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*}. A parameter *ϕ* a *<sup>k</sup>* of the Gaussian distribution of the action category is sampled for each *k∈*{1, 2, *. . .* , *K* a } as follows:

$$\phi\_k^{\mathfrak{a}} \sim p(\phi\_k^{\mathfrak{a}} | z^{\mathfrak{a}}, a, p, A, \beta^{\mathfrak{a}}) \propto \prod\_{d=1}^{D} \text{Gauss}(a\_d' | \phi\_k^{\mathfrak{a}}) \text{GIV}(\phi\_k^{\mathfrak{a}} | \beta^{\mathfrak{a}})$$

$$\propto \text{GIV}(\phi\_k^{\mathfrak{a}} | a\_k', \beta^{\mathfrak{a}}), \tag{19}$$

where *a* denotes the set of all the action information, *p* denotes the set of all the position information, and *A* denotes the set of all the attention information. The element representing the relative coordinates of the hand of *a ′ <sup>d</sup>* is calculated by the element representing the absolute coordinates of the hand of *a*, the object positions *p*, and the attention information *A*. The set of all the action information of the action category *z* a *<sup>d</sup>* = *k* in *d ∈*{1, 2, *. . .* , *D*} is denoted as *a ′ k* . A parameter *θ<sup>l</sup>* of the word probability distribution is sampled for each *l ∈ {*(*Fdn, z Fdn dm*)*|Fdn ∈ {*o*,* c*,* p*,* a *}, z Fdn dm ∈ {*1*,* 2*, . . . , K <sup>F</sup>dn }}* as follows:

$$\begin{split} \theta\_{l} &\sim p(\theta\_{l}|\boldsymbol{w}, \boldsymbol{z}^{\boldsymbol{o}}, \boldsymbol{z}^{\boldsymbol{e}}, \boldsymbol{z}^{\boldsymbol{P}}, \boldsymbol{z}^{\boldsymbol{a}}, \boldsymbol{F}, \boldsymbol{A}, \boldsymbol{\gamma}) \\ &\propto \prod\_{d=1}^{D} \prod\_{n=1}^{N\_{d}} \text{Cat}\left(\boldsymbol{w}\_{dn}|\theta\_{l} \underset{\left(\boldsymbol{r}\_{dn}, \boldsymbol{z}\_{dn\_{d}}^{\boldsymbol{v}\_{dn}}\right)}{\operatorname{Dir}(\theta\_{l}|\boldsymbol{w}\_{l}, \boldsymbol{\gamma})}\right) \text{Dir}(\theta\_{l}|\boldsymbol{\gamma}) \\ &\propto \text{Dir}(\theta\_{l}|\boldsymbol{w}\_{l}, \boldsymbol{\gamma}) \end{split} \tag{20}$$

where *w* denotes the set of all the words, *F* denotes the set of frames of all the sentences, and *w<sup>l</sup>* denotes the set of all the words of the word category *l* = (*Fdn, z Fdn dA<sup>d</sup>* ) in *n∈*{1, 2, *. . .* , *Nd*} and *d ∈*{1, 2, *. . .* , *D*}. A latent variable *z* o *dm* of the object category is sampled for each *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*} as follows:

$$\begin{split} \boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\alpha}} &\sim \mathfrak{p}(\boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\alpha}} | \boldsymbol{\omega}\_{d}, \boldsymbol{z}\_{d}^{\boldsymbol{\varepsilon}}, \boldsymbol{z}\_{d}^{\mathbf{p}}, \boldsymbol{z}\_{d}^{\mathbf{a}}, \boldsymbol{z}\_{-\mathrm{d}m}^{\boldsymbol{\alpha}}, \boldsymbol{\theta}, \boldsymbol{F}\_{d}, A\_{d}, \boldsymbol{a}\_{\mathrm{d}m}, \boldsymbol{\phi}^{\mathbf{o}}, \boldsymbol{\pi}^{\mathbf{o}}) \\ &\propto \prod\_{n=1}^{N\_{d}} \mathrm{Cat}\left(\boldsymbol{\omega}\_{\mathrm{d}n} | \boldsymbol{\theta}\_{\boldsymbol{l}=\left(\boldsymbol{F}\_{\mathrm{d}n}, \boldsymbol{z}\_{\mathrm{d}d\_{\mathrm{d}}}^{\boldsymbol{\theta}\_{\boldsymbol{d}}}\right)}\right) \mathrm{Gauss}(\boldsymbol{o}\_{\mathrm{d}m} | \boldsymbol{\phi}\_{\boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\alpha}}}^{\boldsymbol{\alpha}}) \mathrm{Cat}(\boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\alpha}} | \boldsymbol{\pi}^{\mathbf{o}}), \end{split} \tag{21}$$

where *w<sup>d</sup>* is a sequence of words in the *d*-th trial and *z* o *<sup>−</sup>dm* is the set of indicates of the object categories without *z* o *dm* in the *d*-th trial. A latent variable *z* c *dm* of the color category is sampled for each *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*} as follows:

$$\begin{split} \mathbf{z}\_{\mathrm{d}m}^{\boldsymbol{\varepsilon}} & \sim \operatorname{\mathfrak{p}} (\boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\varepsilon}} | \boldsymbol{w}\_{\mathrm{d}}, \boldsymbol{z}\_{\mathrm{d}}^{\boldsymbol{\varepsilon}}, \boldsymbol{z}\_{\mathrm{d}}^{\mathbb{P}}, \boldsymbol{z}\_{\mathrm{d}}^{\mathbb{P}}, \boldsymbol{z}\_{-\mathrm{d}m}^{\boldsymbol{\varepsilon}}, \boldsymbol{\theta}, \boldsymbol{F}\_{\mathrm{d}}, \boldsymbol{A}\_{\mathrm{d}}, \boldsymbol{c}\_{\mathrm{d}m}, \boldsymbol{\phi}^{\boldsymbol{\varepsilon}}, \boldsymbol{\pi}^{\boldsymbol{\varepsilon}}) \\ & \propto \prod\_{n=1}^{N\_{\mathrm{d}}} \operatorname{\mathbf{Cat}} \left( \boldsymbol{w}\_{\mathrm{d}n} | \boldsymbol{\theta}\_{\mathrm{d}n} (\boldsymbol{\theta}\_{\mathrm{d}m}, \boldsymbol{z}\_{\mathrm{d} \boldsymbol{\varepsilon}}^{\mathbb{P}}) \right) \operatorname{\mathbf{Gauss}} \left( \boldsymbol{c}\_{\mathrm{d}m} | \boldsymbol{\phi}\_{\mathrm{z}\_{\mathrm{d}m}}^{\boldsymbol{\varepsilon}} \right) \operatorname{\mathbf{Cat}} (\boldsymbol{z}\_{\mathrm{d}m}^{\boldsymbol{\varepsilon}} | \boldsymbol{\pi}^{\boldsymbol{\varepsilon}}), \end{split} \tag{22}$$

where *z* c *<sup>−</sup>dm* is the set of indicates of the object categories without *z* c *dm* in the *d*-th trial. A latent variable *z* p *dm* of the position category is sampled for each *m ∈*{1, 2, *. . .* , *Md*} and *d ∈*{1, 2, *. . .* , *D*} as follows:

$$\begin{split} \boldsymbol{z}\_{\text{d}m}^{\text{P}} & \sim \mathfrak{p}(\boldsymbol{z}\_{\text{d}m}^{\text{P}} | \boldsymbol{\omega}\_{d}, \boldsymbol{z}\_{d}^{\text{o}}, \boldsymbol{z}\_{d}^{\text{e}}, \boldsymbol{z}\_{d}^{\text{a}}, \boldsymbol{z}\_{-\text{d}m}^{\text{p}}, \boldsymbol{\theta}, \boldsymbol{F}\_{d}, \boldsymbol{A}\_{d}, \boldsymbol{p}\_{\text{d}m}, \boldsymbol{\phi}^{\text{P}}, \boldsymbol{\pi}^{\text{P}}) \\ & \propto \prod\_{n=1}^{N\_{d}} \text{Cat}\left( \boldsymbol{\omega}\_{\text{d}n} | \boldsymbol{\theta}\_{\boldsymbol{l}=\left(\boldsymbol{F}\_{\text{d}n}, \boldsymbol{z}\_{\text{d}d\_{\text{d}}}^{\text{P}}\right)} \right) \text{Gauss}\left( \boldsymbol{p}\_{\text{d}m} | \boldsymbol{\phi}\_{\text{z}\_{\text{d}m}^{\text{P}}}^{\text{P}} \right) \text{Cat}(\boldsymbol{z}\_{\text{d}m}^{\text{p}} | \boldsymbol{\pi}^{\text{P}}), \end{split} \tag{23}$$

where *z* p *<sup>−</sup>dm* is the set of indicates of the object categories without *z* p *dm* in the *d*-th trial. A latent variable *z* a *<sup>d</sup>* of the action category is sampled for each *d ∈*{1, 2, *. . .* , *D*} as follows:

$$\begin{split} \mathbf{z}\_{d}^{\mathbf{a}} &\sim p(\mathbf{z}\_{d}^{\mathbf{a}} | \boldsymbol{\nu}\_{d}, \mathbf{z}\_{d}^{\mathbf{o}}, \mathbf{z}\_{d}^{\mathbf{e}}, \mathbf{z}\_{d}^{\mathbf{p}}, \boldsymbol{\theta}, F\_{d}, A\_{d}, a\_{d}, p\_{d}, \boldsymbol{\phi}^{\mathbf{a}}, \boldsymbol{\pi}^{\mathbf{a}}) \\ &\propto \prod\_{n=1}^{N\_{d}} \text{Cat}\Big(\boldsymbol{\nu}\_{d\boldsymbol{n}} | \boldsymbol{\theta}\_{\boldsymbol{l}=\left(\boldsymbol{\mathcal{F}}\_{d\boldsymbol{n}}, \boldsymbol{\pi}\_{d\boldsymbol{d}\_{d}}^{\mathbf{r}}\right)}\Big) \text{Gauss}\Big(a\_{d} | \boldsymbol{\phi}\_{\boldsymbol{z}\_{d}}^{\mathbf{a}'}\Big) \text{Cat}(\boldsymbol{z}\_{d}^{\mathbf{a}} | \boldsymbol{\pi}^{\mathbf{a}}), \end{split} \tag{24}$$

where *p<sup>d</sup>* is the set of position data in the *d*-th trial. A latent variable *F<sup>d</sup>* representing the sensory-channels of words in a sentence is sampled for each *d ∈*{1, 2, *. . .* , *D*} as follows:

$$F\_d \sim p(F\_d|\boldsymbol{\omega}, \boldsymbol{z}^o, \boldsymbol{z}^c, \boldsymbol{z}^p, \boldsymbol{z}^a, \boldsymbol{\theta}, A, \lambda)$$

$$\propto \prod\_{n=1}^{N\_d} \text{Cat}\left(\boldsymbol{\omega}\_{dn}|\boldsymbol{\theta}\_{l=\left(F\_{dn}, \boldsymbol{z}\_{dn\_d}^{r\_{dn}}\right)}\right) \text{Unif}(F\_d|\lambda). \tag{25}$$

# **3.4. Action Generation and Attention Selection**

In this section, we describe the approach that selects an action and an object of attention from the human spoken sentence. A robot capable of learning word meanings accurately is considered to be able to understand human instruction more accurately. In an action generation task, the robot performs an action *a<sup>d</sup>* based on word meanings and multiple categories Θ from observed information *wd*, *od*, *cd*, and *pd*. In this case, the robot can use the set of model parameters Θ learned by using Gibbs sampling in the CSL task. In the action generation task, we maximize the following equation:

$$\underset{a\_d}{\text{argmax}}\, p(a\_d|\boldsymbol{w}\_d, o\_d, c\_d, \boldsymbol{p}\_d, \boldsymbol{\theta}, \{\boldsymbol{\phi}\}, \{\boldsymbol{\pi}\}, \boldsymbol{\lambda})$$

$$= \underset{a\_d}{\text{argmax}}\sum\_{A\_d} \sum\_{z\_d^\mathbf{a}} p(a\_d|\boldsymbol{\phi}^\mathbf{a}, z\_d^\mathbf{a}, \boldsymbol{p}\_d, A\_d)$$

$$\times p(A\_d, z\_d^\mathbf{a}|\boldsymbol{w}\_d, o\_d, c\_d, \boldsymbol{p}\_d, \boldsymbol{\theta}, \{\boldsymbol{\phi}\}, \{\boldsymbol{\pi}\}, \boldsymbol{\lambda}). \tag{26}$$

In practice, this maximization problem is separated into two approximation processes, because it is difficult to maximize equation (26) directly.

(1) The first process is the maximization of the attention *A<sup>d</sup>* and the index of the action category *z* a *d*

$$A\_d^\*, z\_d^{"\mu"} = \underset{A\_d, z\_d^{\mu}}{\text{argmax}} \, p(A\_d, z\_d^{\mu} | \mathcal{w}\_d, o\_d, c\_d, p\_d, \theta, \{\phi\}, \{\pi\}, \lambda). \tag{27}$$

The probability distribution of equation (27) is represented by the following equation:

$$\begin{split} &p(A\_d, z\_d^\mathbf{a} | w\_d, o\_d, c\_d, p\_d, \theta, \{\phi\}, \{\pi\}, \lambda) \\ &\propto p(A\_d = m) p(z\_d^\mathbf{a} | \pi^\mathbf{a}) \prod\_{M\_d} \sum\_{z\_{d\mathbf{u}}^o} \sum\_{z\_{d\mathbf{u}}^o} \sum\_{z\_{d\mathbf{u}}^o} \\ &\text{Gauss}\left(o\_{dm} | \phi\_{z\_{d\mathbf{u}}^o}^o \right) \text{Cat}(z\_{d\mathbf{u}}^o | \pi^\mathbf{e}) \\ &\text{Gauss}\left(c\_{dm} | \phi\_{z\_{d\mathbf{u}}^o}^c \right) \text{Cat}(z\_{d\mathbf{u}}^c | \pi^\mathbf{e}) \\ &\text{Gauss}\left(p\_{dm} | \phi\_{z\_{d\mathbf{u}}^o}^P \right) \text{Cat}(z\_{d\mathbf{u}}^P | \pi^\mathbf{e}) \\ &\left[\sum\_{F\_d} \text{Unif}(F\_d | \lambda) \prod\_{N\_d} \text{Cat}\left(w\_{d\mathbf{u}} | \theta\_{l=\begin{pmatrix} F\_{d\mathbf{u}}, z\_{d\mathbf{u}}^{\mathcal{I}\_d} \end{pmatrix}\right) \right]. \end{split} \tag{28}$$

Then, we assumed *p*(*A<sup>d</sup>* = *m*) = 1/*M<sup>d</sup>* as equal probability for the number of objects.

(2) The second process is the maximization of the action *a<sup>d</sup>* using *A ∗ d* and *z* a *d ∗*

$$a\_d^\* = \underset{a\_d}{\text{argmax}} \, p(a\_d | \phi^{\mathbf{a}}, z\_d^{\mathbf{a}\*}, p\_d, A\_d^\*)$$

$$= \underset{a\_d}{\text{argmax}} \, \text{Gauss} \left(a\_d | \phi\_{z\_d^{\mathbf{a}\*}}^{\mathbf{a}'}\right)$$

$$= \mu\_{z\_d^{\mathbf{a}'}}^{\mathbf{a}'}, \tag{29}$$

where the mean vector of the Gaussian distribution of the action category *z* a *d ∗* is denoted as *µ* a *′ z* a *d ∗* .

# **3.5. Description of the Current Situation and Self-Action by the Robot**

In this section, we describe the approach followed by the description task representing the current situation and the self-action of the robot. We consider a robot capable of learning word meanings accurately to be able to describe the current situation and self-action more accurately. In the action description task, the robot utters a sentence *w<sup>d</sup>* regarding a self-action *a<sup>d</sup>* and observed information *o*d, *cd*, and *p<sup>d</sup>* based on word meanings and multiple categories Θ. In this case, the robot can use the set of model parameters Θ learned by using Gibbs sampling in the CSL task. In the action description task, we maximize the following equation:

$$\begin{aligned} &\operatorname\*{argmax}\_{\mathcal{w}\_d} p(\boldsymbol{w}\_d | a\_d, o\_d, c\_d, p\_d, \theta, \{\phi\}, \{\pi\}, F\_d, A\_d) \\ &\propto \operatorname\*{argmax}\_{\mathcal{w}\_d} \sum\_{z\_d^o} \sum\_{z\_{d\lambda\_d}^o} \sum\_{z\_{d\lambda\_d}^o} \sum\_{z\_{d\lambda\_d}^o} \\ &\operatorname\*{Gaussian}(a\_d | \phi\_{z\_d^o}^{o'}) \operatorname\*{Cat}(z\_d^o | \pi^\mathbf{a}), \\ &\operatorname\*{Gauss}(o\_{d\lambda\_d} | \phi\_{z\_{d\lambda\_d}^o}^{o'}) \operatorname\*{Cat}(z\_{d\lambda\_d}^o | \pi^\mathbf{o}) \\ &\operatorname\*{Gauss}(c\_{d\lambda\_d} | \phi\_{z\_{d\lambda\_d}^o}^c) \operatorname\*{Cat}(z\_{d\lambda\_d}^c | \pi^\mathbf{e}) \\ &\operatorname\*{Gauss}(p\_{d\lambda\_d} | \phi\_{z\_{d\lambda\_d}^p}^P) \operatorname\*{Cat}(z\_{d\lambda\_d}^p | \pi^\mathbf{F}) \\ &\prod\_{N\_d} \operatorname\*{Cat}(w\_{dn} | \theta\_{l\_{n\lambda\_d}^p} | \phi\_{l\_n} \end{aligned}$$

If the frame of the sentence is decided, e.g., *F<sup>d</sup>* = (a, p, c, o), equation (30) is represented as the following:

$$\begin{split} \text{Equation (30)} &= \prod\_{N\_d} \operatorname\*{argmax}\_{\boldsymbol{\nu}\_{d\boldsymbol{u}}} \sum\_{\boldsymbol{z}\_{d\boldsymbol{d}\_d}^{\boldsymbol{r}\_{d\boldsymbol{u}}}} \operatorname\*{Gauss} \left( \boldsymbol{\pi}\_{d\boldsymbol{A}\_d}^{\boldsymbol{F}\_{d\boldsymbol{u}}} | \boldsymbol{\phi}\_{\boldsymbol{z}\_{d\boldsymbol{d}\_d}^{\boldsymbol{r}\_{d\boldsymbol{u}}}}^{\boldsymbol{F}\_{d\boldsymbol{u}}} \right) \\ &\times \text{Cat} \left( \boldsymbol{z}\_{d\boldsymbol{A}\_d}^{\boldsymbol{F}\_{d\boldsymbol{u}}} | \boldsymbol{\pi}^{\boldsymbol{F}\_{d\boldsymbol{u}}} \right) \operatorname\*{Cat} \left( \boldsymbol{\pi}\_{d\boldsymbol{u}} | \boldsymbol{\theta}\_{\boldsymbol{d}=\left(\boldsymbol{F}\_{d\boldsymbol{u}}, \boldsymbol{z}\_{d\boldsymbol{d}\_d}^{\boldsymbol{r}\_{d\boldsymbol{u}}}\right)} \right), \end{split} \tag{31}$$

where *x Fdn dA<sup>d</sup>* denotes data of the sensory-channel *Fdn* in the object number *Ad*, i,e., *ad*, *pdA<sup>d</sup>* , *cdA<sup>d</sup>* , or *odA<sup>d</sup>* . Therefore, equation (30) can be divided into the equations of finding a maximum value for each word.

# **4. EXPERIMENT I: SIMULATION ENVIRONMENT**

We performed the experiments described in this section using the iCub simulator (Tikhanoff et al., 2008). In Section 4.1, we describe the difference in the conditions of the methods that are used for comparison purposes. In Section 4.2, we describe the CSL experiment. In Section 4.3, we describe the experiment involving the action generation task. In Section 4.4, we describe the experiment relating to the action description task.

# **4.1. Comparison Methods**

We evaluated our proposed method by comparing its performance with that of two other methods.

(A) The proposed method.

This method has a mutual exclusivity constraint between the word and the sensory-channel (MEC-I and II), determining that each sensory-channel occurs only once in each sentence. For example, if the number of words in a sentence is *N<sup>d</sup>* = 4, *F<sup>d</sup>* can become a sequence such as (a, c, p, o), (a, p, c, o), or (p, c, o, a). Possible values of *F<sup>d</sup>* are constrained by *λ* as a permutation of four sensory-channels. The number of permutations is <sup>4</sup>P*<sup>N</sup><sup>d</sup>* = 4!*/*(4 *− Nd*)!.

(B) The proposed method without the mutual exclusivity constraint (w/o MEC-II).

This method does not have the mutual exclusivity constraint (MEC-II). This means that several words in a sentence may relate to the same sensory-channel. For example, if the number of words in a sentence is *N<sup>d</sup>* = 4, *F<sup>d</sup>* can become a sequence such as (a, o, c, o), (a, p, p, o), or (o, o, o, o) in addition to the above example of (A). Possible values of *F<sup>d</sup>* are constrained by *λ* as a repeated permutation of four sensorychannels. The number of repeated permutations is <sup>4</sup>Π*<sup>N</sup><sup>d</sup>* = 4 *Nd* . In this case, the robot needs to consider additional pairs of relationships between the sensory-channel and word compared to method (A).

(C) The multilayered multimodal latent Dirichlet allocation (mMLDA) (Attamimi et al., 2016).

This method is based on mMLDA. In this research, this method was modified from the actual mMLDA to apply to our task and the proposed method. In particular, the emission probability for each sensory-channel is changed from

a categorical distribution to a Gaussian distribution. This means the multimodal categorization methods are based on a Gaussian distribution for each sensory-channel and a categorical distribution for word information. This method relates all observed words in a situation to all observed sensory-channel information in the situation. This method neither has the mutual exclusivity constraint (MEC-I and II) nor does it select the sensory-channel by words, i.e., *F<sup>d</sup>* is not estimated.

# **4.2. Cross-Situational Learning**

# 4.2.1. Experimental Procedure and Conditions

We conducted an experiment to learn the categories for each sensory-channel and the words associated with each category. **Figure 3** shows the procedure for obtaining and processing data. We describe the experimental procedure for CSL as follows:


The above process is carried out many times in different situations. The robot learns multiple categories and word meanings by using multimodal data observed in many trials.

The number of trials was *D* = 20 and 40 for CSL. The number of objects *M<sup>d</sup>* on the table for each trial was a number from one to three. The number of words *N<sup>d</sup>* in the sentence was a number from zero to four. We assume that a word related to each sensory-channel is spoken only once in each sentence. The word order in the sentences was changed. This experiment used 14 kinds of words: "reach," "touch," "grasp," "look-at," "front," "left," "right," "far," "green," "red," "blue," "box," "cup," and "ball." The upper limit number of the categories for each sensory-channel was K = 10, i.e., the number of word distributions was *L* = 40. The number of iterative cycles used for Gibbs sampling was 200. The hyperparameters were *α* = 1.0, *γ* = 0.1, *m*<sup>0</sup> = **O***<sup>x</sup>dim* , *κ*<sup>0</sup> = 0.001, *V*<sup>0</sup> = diag(0.01, 0.01), and *v*0 = *xdim* + 2, where the number of dimensions for each sensory-channel *x* is denoted as *xdim* and the zero vector in *xdim* dimensions is denoted as **O***<sup>x</sup>dim* . PCA is used to reduce the object features to 30 dimensions. The color features are quantized to 10 dimensions by k-means.

We describe the criteria of words uttered for action category as follows: "reach" corresponds to the robot extending its right hand toward an object and the robot's finger does not make contact with an object; "touch" corresponds to the robot touching an object and its finger is relatively opened; "grasp" corresponds to the robot's hand holding firmly an object; "look-at" corresponds to the robot not moving its right hand and it focuses on an object of attention only. Based on these criteria, the tutor determines an action word. In particular, "reach" and "touch" are similar; the only difference is whether the hand touches the object or not.

We evaluate the estimation accuracy of the learning results by using uncertain teaching sentences. Each sentence contains four words or fewer in different order. We compare the accuracy of three methods by reducing the word information. In addition, the number of learning trials is changed. We compared the accuracy by changing the number of trials. We evaluated the methods according to the following metrics.

*•* Adjusted Rand index (ARI)

We compare the matching rate between the estimated latent variables *z* for each sensory-channel and the true categorization results. The evaluation of this experiment uses the ARI (Hubert and Arabie, 1985), which is a measure of the degree of similarity between two clustering results.

*•* Estimation accuracy rate of *F<sup>d</sup>* (EAR)

The evaluation of the estimation results of the sensorychannels corresponding to the words are determined as follows:

$$\text{EAR} = 1 - \frac{\text{The number of estimation errors}}{\text{The number of all of utilized words}}.\tag{32}$$

#### 4.2.2. Learning Results and Evaluation

The learning results obtained by using the proposed method are presented here. Forty trials were used. In this case, the number of words was four in all utterance sentences. **Figure 4A** shows the word probability distributions *θ*. Higher probability values are represented by darker shades. If the relationship between the word and sensory-channel can be estimated correctly, the ranges within thick-bordered boxes show higher probabilities. For example, the action categories show higher probabilities for words of action ("touch," "look-at," "reach," and "grasp"). The categories of the other sensory-channels are also the same. In the position and color categories, the estimated number of categories was equal to the number of types of words representing the sensory-channel. In the action category, the words "touch," "reach," and "grasp" were associated across several categories. In addition, these words were confused with each other. We considered actions representing these words to be ambiguous and similar. On the other hand, we considered the reason why these actions were divided into several categories to be a change in posture information depending on the position of the target object. **Figure 4B** shows the learning result for the position category *ϕ p* . For example, the position category p1 is associated with the word "front" (see **Figures 4A,B**). **Figures 4C,D** show examples of the categorization results for the object and color categories. The object categorization result was not perfect. We considered the robot to find it difficult to clearly distinguish objects of different shapes because the 3D-models of the objects had simple shapes. The color categorization result was perfect. In this case, *F<sup>d</sup>* was correctly estimated in all of the trials. The results demonstrate that the proposed method was able to accurately associate each word with its respective sensorychannel.

We performed the learning scenarios 10 times for each method. **Tables 1A,B** show the evaluation values of the experimental results for 20 and 40 trials. The rate of omitted words (ROW), which is expressed as a percentage, represents the uncertainty of teaching sentences. For example, the total number of words is 80 when ROW is 0%, 64 words for 20%, 48 words for 40%, and 32 words for 60% in 20 trials. Also, the total number of words is 160 for a ROW value of 0% and 96 words for 40% in 40 trials. ARI\_a, ARI\_p, ARI\_o, and ARI\_c are the ARI values of the action, position, object, and color category, respectively. The EAR values of mMLDA were not calculated because this method does not have *Fd*. If the ROW value is 100 (no word), the three methods will be equivalent as ALL, i.e., the GMM for each sensory-channel. We described the ARI values of ALL as reference values because ALL is not CSL. The EAR value obtained for the proposed method was higher than that obtained for the other methods. When the ROW decreased, i.e., the word information increased, the evaluation values tended to increase. Particularly, the result for the position category was favorably affected by the increase in word information for categorization. In addition, when the number of trials increased, the evaluation values tended to increase. This result suggests that the robot is able to learn the word meanings more effectively by accumulating more experience even in more complicated and uncertain situations. When the number of words was small (i.e., the ROW value is 40 or 60%), the difference between the EAR values of methods (A) and (B) was small (approximately 0.02) in 20 trials. However, when the number of words was large, the difference between the EAR values of methods (A) and (B) increased, and the EAR value of the method (A) was larger than that of (B). As a result, when the number of words was small, e.g., sentences including one or two words, there was almost no influence of the presence or absence of the MEC-II because the number of possible values of *F<sup>d</sup>* of the methods (A) and (B) were close. On the other hand, when the number of words was large, e.g., sentences including four words, the MEC-II worked well because the number of possible values of *F<sup>d</sup>* of the method (A) was narrowed properly down.

**FIGURE 4** | **(A)** Word probability distribution across the multiple categories; darker shades represent higher probability values. The pair consisting of a letter and a number on the left of the table is the index of the word distribution, which represents the sensory-channel related to the word distribution and the index of the category. Note that category indices are not shown; they are merged and not used because the number of the categories is automatically estimated by the nonparametric Bayesian method. **(B)** Learning result of the position category; for example, the index of position category p1 corresponds to the word "front." The point group of each color represents each Gaussian distribution of the position category. The crosses in the different colors represent the object positions of the learning data. Each color represents a position category. **(C)** Example of categorization results of object category; **(D)** example of categorization results of color category.

# **4.3. Action Generation Task**

## 4.3.1. Experimental Procedure and Conditions

In this experiment, the robot generates the action regarding the sentence spoken by the human tutor. The robot uses the learning results of the CSL task in Section 4.2. The robot selects the object of attention from among the objects on the table. In addition, the robot performs the action on the object of attention. In this task, the robot cannot use joint attention. Therefore, the robot needs to overcome both the problems of CSL-I and II. We describe the process of action generation as follows:


The above process is carried out many times on different sentences.

We compare the three methods by quantitative evaluation on the action generation task. We evaluate the accuracy of the selection of the object of attention. In addition, we evaluate the accuracy of an action of the robot based on questionnaire evaluation by participants. The robot generates an action from the tutor's spoken sentence in a situation. Participants check videos of the action generated by the robot and select a word representing the robot's action. We calculate the word accuracy rate (WAR) of the words selected by participants and the true words spoken by the tutor. In addition, we calculate the object accuracy rate (OAR) representing the rate at which the robot correctly selected the object instructed by the tutor.

We performed action generation tasks for a total of 12 different test-sentences. The test-sentences included four words representing the four sensory-channels. This placement of objects on the table was not used during the learning trials. In addition, the word order of sentences uttered during the action generation task is different from the word order of sentences uttered during the CSL task. The eight participants checked 36 videos of the robot's actions.

#### 4.3.2. Results

**Figure 5** shows three examples of the action generation results of the proposed method. **Figure 5A** shows the result of action generation by the robot in response to the sentence "reach front


*Bold and underscore indicate the highest evaluation values, and bold indicates the second highest evaluation values.*

A

"reach front blue cup."

B

C

"grasp right green ball."

"touch left red box."

**FIGURE 5** | Example of results of the action generation task in the iCub simulator. **(A)** Reach front blue cup. **(B)** Grasp right green ball. **(C)** Touch left red box.

blue cup." **Figure 5B** shows the result of action generation by the robot in response to the sentence "grasp right green ball." **Figure 5C** shows the result of action generation by the robot in response to the sentence "touch left red box." **Table 2** shows the results of the quantitative evaluation of the action generation task. The proposed method enabled the robot to accurately select the object. As a result of the proposed method, the object indicated and the object selected by the robot coincided in all sentences. In addition, the proposed method showed the highest values for both WAR and OAR. Therefore, the robot could select an appropriate object and could perform an action even in situations and for sentences not used for CSL.

# **4.4. Action Description Task**

#### 4.4.1. Experimental Procedure and Conditions

In HRI, the ability of the robot to use the acquired word meanings for a description of the current situation is important. In this experiment, the robot performs an action and speaks the sentence

**TABLE 2** | Results of evaluation values for the action generation using the results of the CSL for 40 trials (ROW is 0%).


*Bold and underscore indicate the highest evaluation values, and bold indicates the second highest evaluation values.*

**TABLE 3** | Experimental results of action description task for 20 and 40 trials.


*Bold and underscore indicate the highest evaluation values, and bold indicates the second highest evaluation values.*

**FIGURE 6** | Confusion matrix of results of the action description task using the learning result for 20 and 40 trials. **(A)** 20 trials; ROW values are (top) 0 and (bottom) 40. **(B)** 40 trials; ROW values are (top) 0 and (bottom) 40.

corresponding to this action. In other words, the robot explains self-action by using a sentence. The robot uses the learning results of the CSL task in Section 4.2. We describe the process of action description as follows:


The above process is carried out many times on different actions. We performed action description tasks for a total of 12 actions. This placement of objects on the table was not used during the learning trials. The robot generates a sentence consisting of four words that include the four sensory-channels. The word order in the sentence is fixed as *F<sup>d</sup>* = (a, p, c, o).

We compare the three methods by quantitative evaluation of the action description task. We evaluate the F1-measure and the accuracy (ACC) between the sentence generated by the robot and the correct sentence decided by the tutor. The evaluation values are calculated by generating the confusion matrix between the predicted words and true words.

#### 4.4.2. Results

**Table 3** show the F1-measure and ACC values of the action description task using the learning results under the different conditions. The proposed method showed the highest evaluation values. **Figures 6A,B** shows the confusion matrices of the results of the action description task using the learning result for 20 and 40 training trials. Overall, the robot confused the words "reach" and "touch" similar to the learning result in **Figure 4A**. The robot had difficulties in distinguishing between "reach" and "touch." In other words, this result suggests that these words were learned as synonyms. When the ROW increased, the evaluation values decreased. For the ROW value of 40% obtained for 20 trials, the robot confused words related to the action and position categories. This could be explained by considering that the robot misunderstood the correspondence between the word and the sensory-channel because the word information was insufficient and uncertain during CSL with the ROW value of 40% and 20 trials. On the other hand, an increase in the number of learning trials resulted in an increase in the evaluation values. Even if the robot is exposed to uncertain utterances, the robot can explain self-action more accurately by gaining more experience. As a result, the robot could acquire the ability to explain self-action by CSL based on the proposed method.

# **5. EXPERIMENT II: REAL iCub ENVIRONMENT**

In this section, we describe the experiment that was conducted by using the real iCub robot. The real-world environment involves more complexity than the simulation environment. We demonstrate that results similar to those of the simulator experiment can be obtained even in a more complicated real environment. We compare three methods, as in Section 4.1. In Section 5.1, we describe the experiment to assess cross-situational learning. In Section 5.2, we describe the experiment relating to the action generation task. In Section 5.3, we describe the experiment relating to the action description task.

# **5.1. Cross-Situational Learning**

#### 5.1.1. Conditions

The experimental procedure is the same as in Section 4.2.1. We use ARI and EAR as evaluation values. **Figure 7** shows all of the objects that were used in the real environment. We used 14 different objects including four types (car, cup, ball, and star) and four colors (red, green, blue, and yellow). In the simulation environment, the same type objects had the same shapes. In the real environment, objects of the same type include different shapes. In particular, all the car objects have different shapes, the cup objects have different sizes, and the star objects include one different shape. This experiment used 16 kinds of words: "reach," "touch," "grasp," "look-at," "front," "left," "right," "far," "green," "red," "blue," "yellow," "car," "cup," "ball," and "star." The number of trials was *D* = 25 and 40 for CSL. The number of objects *M<sup>d</sup>* on the table for each trial was a number ranging from one to three. The number of words *N<sup>d</sup>* in the sentence was a number ranging from zero to four. We assume that a word related to each sensory-channel is spoken only once in each sentence. The word order in the sentences was changed. Object features are reduced to 65 dimensions by PCA. Color features are quantized to 10 dimensions by k-means. The upper limit number of the categories for each sensory-channel was *K* = 10, i.e., the upper limit for the number of word distributions was *L* = 40. The hyperparameters were *α* = 1.0, *γ* = 0.1, *m*<sup>0</sup> = **O***<sup>x</sup>dim* , *κ*<sup>0</sup> = 0.001, *V*<sup>0</sup> = diag(0.01, 0.01), and *v*<sup>0</sup> = *xdim* + 2. The number of iterative cycles used for Gibbs sampling was 200.

**FIGURE 7** | All of the objects used in the real experiments (14 objects including four types and four colors).

A

# 5.1.2. Learning Results and Evaluation

The example we describe is the learning result of 25 trials and for a ROW value of 9%. In this case, the number of categories was set to *K* = 5. **Figure 8A** shows the word distributions *θ*. In the action category, the robot confused the words "reach" and "touch" as is the case with the simulator experiment. **Figure 8B** shows the learning result of the position category on the table. **Figure 8C** shows categorization results of objects. Although the object categorization contained a few mistakes, the results were mostly correct. **Figure 8D** shows the categorization results obtained for the

the Gaussian distributions of the position category. The crosses of each color represent the object positions of the learning data. Each color represents a position category. The circle represents the area of the white circular table. **(C)** Example of categorization results of object category; **(D)** example of categorization results of color category.

color categorization, which was successful. Interestingly, two categories corresponding to the word "green" were created because the robot distinguished between bright green and dark green. In addition, the robot was able to learn that both of these categories related to the word "green."

**Table 4** shows the evaluation values of the experimental results for 25 and 40 trials. There was not much difference in ARI values between the methods and between different conditions of ROW values. The EAR values of the proposed method were higher than those of the other methods. An increase in the number of trials led to an increase in the evaluation values, similar to the simulation results.

# **5.2. Action Generation Task**

#### 5.2.1. Conditions

In this experiment, the robot generates the action corresponding to the sentence spoken by the human tutor. The robot uses the learning results of the CSL task in Section 5.1. The experimental procedure is the same as in Section 4.3.1. We evaluated accuracy of object selection (the OAR values) using the CSL results for 25 trials. We performed the action generation task for


*Bold and underscore indicate the highest evaluation values, and bold indicates the second highest evaluation values.*

a total of 12 different test-sentences, each of which comprised four words representing the four sensory-channels. The placement of objects on the table was different in each trial and differed from the placements that were used during the learning trials.

### 5.2.2. Results

**Figure 9** shows an example of the results of the action generation task. **Figure 9A** shows the result of action generation by the robot for the sentence "grasp front red ball." **Figure 9B** shows the result of action generation by the robot for the sentence "reach right red cup." **Figure 9C** shows the result of action generation by the robot for the sentence "look-at left yellow cup." The resulting OAR values of the proposed method and its w/o MEC-II were 1.000, and the OAR value of mMLDA was 0.833. As a result, the robot could select an appropriate object even in situations and sentences not used at the CSL.

# **5.3. Action Description Task**

#### 5.3.1. Conditions

In this experiment, the robot performs the action and speaks the sentence regarding this action. The robot uses the learning results of the CSL task in Section 5.1. The experimental procedure is the same as in Section 4.4.1. We use the F1-measure and ACC as evaluation values. We performed the action description task for a total of 10 actions. The placement of objects on the table was different for each trial and differed from those used during the learning trials. The robot generates a sentence of four words representing the four sensory-channels. The word order in the sentence is fixed as *F<sup>d</sup>* = (a, p, c, o).

#### 5.3.2. Results

**Table 5** shows F1-measure and ACC values of action description task using the learning results under the different conditions. The

**TABLE 5** | Experimental results in 25 and 40 trials.


*Bold and underscore indicate the highest evaluation values, and bold indicates the second highest evaluation values.*

"grasp front red ball."

"reach right red cup."

"look-at left yellow cup."

**FIGURE 9** | Examples of results of action generation task with real iCub. **(A)** Grasp front red ball. **(B)** Reach right red cup. **(C)** Look-at left yellow cup.

proposed method showed the higher evaluation values than other methods. **Figure 10** shows confusion matrices between predicted words and true words using the learning results by 20 and 40 trials. In the action category, there was a tendency to confuse words "reach" and "touch" similar to simulation. The major difference in the result for each method was found in the words of action and object categories. Even if the accuracy of categorization is low as in action and object categories and the categories include uncertainty, the robot could describe the action more correctly if the correspondence between the word and the sensory-channel was performed more properly.

# **6. CONCLUSION**

In this paper, we have proposed a Bayesian generative model that can estimate multiple categories and the relationships between words and multiple sensory-channels. We performed experiments of cross-situational learning using the simulator and real iCub robot in complex situations. The experimental results showed that it is possible for a robot to learn the combination between a sensory-channel and a word from their co-occurrence in complex situations. The proposed method could learn word meanings from uncertain sentences, i.e., the sentence including four words or less with a changing order. In comparative experiments, we showed that the mutual exclusivity constraint is effective in the lexical acquisition by CSL. In addition, we performed experiments of action generation task and action description task by the robot that learned word meanings. The action generation task confirmed that the robot could also select an object successfully and generate an action even for situations other than those it

encountered during the learning scenario. The action description task confirmed that the robot was able to use the learned word meanings to explain the current situation.

The accuracy of the categorization of objects and actions tended to be lower than those of the color and position categories. In this paper, we used GMM for the categorization of each sensorychannel. MLDA achieved highly accurate object categorization by integrating multimodal information (Nakamura et al., 2011a). The accuracy of object categorization can be improved by using MLDA instead of GMM, i.e., by increasing the number of sensorychannels for the object categories. In the action categorization, the robot confused "reach" and "touch," because these are similar actions. However, the robot is able to classify diverse actions more accurately. In addition, we used static features as action information. The accuracy can be improved by segmenting the time-series data of the actions by using a method based on the hidden Markov model (HMM) (Sugiura et al., 2011; Nakamura et al., 2016).

In this study, we performed the action generation task by a sentence including four words corresponding to the four sensorychannels. However, action instruction also presented cases in which an uttered sentence contains uncertainty. In future, we plan to investigate what kind of action the robot performs based on uncertain utterances, such as when the number of words is fewer than four, when the same objects exist on the table, and when the sentence contains the wrong word. If the robot can learn the word meanings more accurately, the robot would be able to perform an action successfully even from an utterance including uncertainty. Detailed and quantitative evaluation of such advanced action generation tasks is a subject for future work.

Other factors we aim to address in future studies are grammatical information, which was not considered in the present study, and sentences containing five words or more. We showed that the robot could accurately learn word meanings without considering grammar in the scenario of this study. However, it is important to include even more complicated situations with more natural sentences such as "grasp the red box beside the green cup." More complicated sentences would require us to consider a method that takes the grammatical structure into account. We, therefore, aim to extend the proposed method to more complicated situations and natural sentences. Attamimi et al. (2016) used HMM for the estimation of transition probabilities between words based on concepts, as a post-processing step of mMLDA. However, they were unable to use grammatical information to learn the relationships between words and categories. Hinaut et al. (2014) proposed a method based on recurrent neural networks for learning grammatical constructions by interacting with humans, which is related to the study of an autobiographical memory reasoning system (Pointeau et al., 2014). Integrating such methods with the proposed method may be effective for action generation and action description using more complicated sentences.

In this paper, we focused on mutual exclusivity of words indicating categories in language acquisition. However, there are hierarchies of categories, e.g., ball and doll belong to the toy category. Griffiths et al. (2003) proposed a hierarchical LDA (hLDA), which is a hierarchical clustering method based on a Bayesian generative model, and it was applied to objects (Ando et al., 2013) and places (Hagiwara et al., 2016). We consider the possibility of applying hLDA to the proposed method for hierarchical categorization of sensory-channels.

For future work, we also plan to demonstrate the effectiveness of the proposed method by employing a long-term experiment that uses a larger number of objects. We believe that the robot can learn more categories and word meanings based on more experience.

# **REFERENCES**


In addition, as a further extension of the proposed method, we intend increasing the types of sensory-channels, adding a positional relationship between objects, and identifying words that are not related to sensory-channels. For example, Aly et al. (2017) learned object categories and spatial prepositions by using a model similar to the proposed model. It would be possible to merge the proposed method with this model in the theoretical framework of the Bayesian generative model. This combined model is expected to enable the robot to learn many different word meanings from situations more complicated than the scenario in this study.

# **AUTHOR CONTRIBUTIONS**

AT, TT, and AC conceived, designed the research, and wrote the paper. AT performed the experiment and analyzed the data.

# **FUNDING**

This work was partially supported by JST CREST and JSPS KAKENHI Grant Number JP17J07842. This material is partially based upon work supported by the Air Force Office of Scientific Research grant number: A9550-15-1-0025, Air Force Materiel Command, USAF under Award No. FA9550-15-1-0025, and by the EU H2020 Marie Skodowska-Curie European Industrial Doctorate APRIL (674868).

# **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at https://www.frontiersin.org/articles/10.3389/fnbot. 2017.00066/full#supplementary-material.

**VIDEO S1 |** Cross-situational learning scenario using a real iCub.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2017 Taniguchi, Taniguchi and Cangelosi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Segmenting Continuous Motions with Hidden Semi-markov Models and Gaussian Processes

Tomoaki Nakamura<sup>1</sup> \*, Takayuki Nagai <sup>1</sup> , Daichi Mochihashi <sup>2</sup> , Ichiro Kobayashi <sup>3</sup> , Hideki Asoh<sup>4</sup> and Masahide Kaneko<sup>1</sup>

*<sup>1</sup> Department of Mechanical Engineering and Intelligent Systems, The University of Electro-Communications, Chofu-shi, Japan, <sup>2</sup> Department of Mathematical Analysis and Statistical Inference, Institute of Statistical Mathematics, Tachikawa, Japan, <sup>3</sup> Department of Information Sciences, Faculty of Sciences, Ochanomizu University, Bunkyo-ku, Japan, <sup>4</sup> Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tsukuba, Japan*

Humans divide perceived continuous information into segments to facilitate recognition. For example, humans can segment speech waves into recognizable morphemes. Analogously, continuous motions are segmented into recognizable unit actions. People can divide continuous information into segments without using explicit segment points. This capacity for unsupervised segmentation is also useful for robots, because it enables them to flexibly learn languages, gestures, and actions. In this paper, we propose a Gaussian process-hidden semi-Markov model (GP-HSMM) that can divide continuous time series data into segments in an unsupervised manner. Our proposed method consists of a generative model based on the hidden semi-Markov model (HSMM), the emission distributions of which are Gaussian processes (GPs). Continuous time series data is generated by connecting segments generated by the GP. Segmentation can be achieved by using forward filtering-backward sampling to estimate the model's parameters, including the lengths and classes of the segments. In an experiment using the CMU motion capture dataset, we tested GP-HSMM with motion capture data containing simple exercise motions; the results of this experiment showed that the proposed GP-HSMM was comparable with other methods. We also conducted an experiment using karate motion capture data, which is more complex than exercise motion capture data; in this experiment, the segmentation accuracy of GP-HSMM was 0.92, which outperformed other methods.

#### Edited by:

*Ganesh R. Naik, Western Sydney University, Australia*

#### Reviewed by: *Douglas Scott Blank, Bryn Mawr College, United States Suparerk Janjarasjitt, Ubon Ratchathani University, Thailand Marc De Kamps, University of Leeds, United Kingdom*

# \*Correspondence:

*Tomoaki Nakamura tnakmaura@uec.ac.jp*

Received: *22 May 2017* Accepted: *29 November 2017* Published: *21 December 2017*

#### Citation:

*Nakamura T, Nagai T, Mochihashi D, Kobayashi I, Asoh H and Kaneko M (2017) Segmenting Continuous Motions with Hidden Semi-markov Models and Gaussian Processes. Front. Neurorobot. 11:67. doi: 10.3389/fnbot.2017.00067* Keywords: motion segmentation, Gaussian process, hidden semi-Markov model, motion capture data

# 1. INTRODUCTION

Human beings typically divide perceived continuous information into segments to enable recognition. For example, humans can segment speech waves into recognizable morphemes. Similarly, continuous motions are segmented into recognizable unit actions. In particular, motions are divided into smaller components called motion primitives, which are used for imitation learning and motion generation (Argall et al., 2009; Lin et al., 2016). It is possible for us to divide continuous information into segments without using explicit segment points. This capacity for unsupervised segmentation is also useful for robots, because it enables them to flexibly learn languages, gestures, and actions.

However, segmentation of time series data is a difficult task. When time series data is segmented, the data points in the sequence must be classified, and each segment's start and end points must be determined. Moreover, each segment affects other segments because of the nature of time series data. Hence, segmentation of time series data requires the exploration of all possible segment lengths and classes. However, this exploration process is difficult; in many studies, the lengths are not estimated explicitly or heuristics are used to reduce computational complexity. Furthermore, in the case of motions, the sequences vary because of dynamic characteristics, even though the same movements are performed. For segmentation of actual human motions, we must address such variations.

In this paper, we propose GP-HSMM (Gaussian process– hidden semi-Markov model), a novel method to divide time series motion data into unit actions by using a stochastic model to estimate their lengths and classes. The proposed method involves a hidden semi-Markov model (HSMM) with a Gaussian process (GP) emission distribution, where each state represents a unit action. **Figure 1** shows an overview of the proposed GP-HSMM. The observed time series data is generated by connecting segments generated by each class. The segment points and segment classes are estimated by learning the parameters of the model in an unsupervised manner. Forward filtering-backward sampling (Uchiumi et al., 2015) is used for the learning process; the segment lengths and segment classes are determined by sampling them simultaneously.

## 2. RELATED WORK

Various studies have focused on learning motion primitives from manually segmented motions (Gräve and Behnke, 2012; Manschitz et al., 2015). Manschitz et al. proposed a method to generate sequential skills by using motion primitives that are learned in a supervised manner. Gräve et al. proposed segmenting motions using motion primitives that are learned by a supervised hidden Markov model. In these studies, the motions

are segmented and labeled in advance. However, we consider that it is difficult to segment and label all possible motion primitives.

Additionally, some studies have proposed unsupervised motion segmentation. However, these studies rely on heuristics. For instance, Wächter et al. have proposed a method to segment human manipulation motions based on contact relations between the end-effectors and objects in a scene (Wachter and Asfour, 2015); in their method, the points at which the end-effectors make contact with an object are determined as boundaries of motions. We believe this method works well in limited scenes; however, there are many motions, such as gestures and dances, in which objects are not manipulated. Lioutikov et al. proposed unsupervised segmentation; however, to reduce computational costs, this technique requires the possible boundary candidates between motion primitives to be specified in advance (Lioutikov et al., 2015). Therefore, the segmentation depends on those candidates, and motions cannot be segmented correctly if the correct candidates are not selected. In contrast, our proposed method does not require such candidates; all possible cutting points are considered by use of forward filtering-backward sampling, which uses the principles of dynamic programming. In some methods (Fod et al., 2002; Shiratori et al., 2004; Lin and Kulic, 2012 ´ ), motion features (such as the zero velocity of joint angles) are used for motion segmentation. However, these features cannot be applied to all motions. Takano et al. use the error between actual movements and predicted movements as the criteria for specifying boundaries (Takano and Nakamura, 2016). However, the threshold must be manually tuned according to the motions to be segmented. Moreover, they used an HMM that is a stochastic model. We consider such an assumption to be unnatural from the viewpoint of stochastic models, and boundaries should be determined based on a stochastic model. In our proposed method, we do not use such heuristics and assumptions, and instead formulate the segmentation based on a stochastic model.

Fox et al. have proposed unsupervised segmentation for the discovery of a set of latent, shared dynamical behaviors in multiple time series data (Fox et al., 2011). They introduce a beta process, which represents a share of motion primitives in multiple motions, into autoregressive HMM. They formulate the segmentation using a stochastic model, and no heuristics are used in their proposed model. However, in their proposed method, continuous data points that are classified into the same states are extracted as segments, and the lengths of the segments are not estimated. The states can be changed in the short term, and therefore shorter segments are estimated. They reported that some true segments were split into two or more categories, and that those shorter segments were bridged in their experiment. On the other hand, our proposed method classifies data points into states, and uses HSMM to estimate segment lengths. Hence, our proposed method can prevent states from being changed in the short term.

Matsubara et al. proposed an unsupervised segmentation method called AutoPlait (Matsubara et al., 2014). This method uses multiple HMMs, each of which represents a fixed pattern; moreover, transitions between the HMMs are allowed. Therefore, time series data is segmented at points at which the state is changed to another HMM's state. However, we believe that HMMs are too simple to represent complicated sequences such as motions. **Figure 2** illustrates an example of representation of time series data by HMM. The graph on the right in **Figure 2** represents the mean and standard deviation learned by HMM from data points shown in the graph on the left. HMM represents time series data using only the mean and standard deviation; therefore, details of time series data can be lost. Therefore, we use Gaussian processes, which are non-parametric methods that can represent complex time series data.

The field of natural language processing has also produced literature related to sequence data segmentation. For example, unsupervised morphological analysis has been proposed for segmenting sequence data (Goldwater, 2006; Mochihashi et al., 2009; Uchiumi et al., 2015). Goldwater et al. proposed a method to divide sentences into words by estimating the parameters of a 2-gram language model based on a hierarchical Dirichlet process. The parameters are estimated in an unsupervised manner by Gibbs sampling (Goldwater, 2006). Mochihashi et al. proposed a nested Pitman-Yor language model (NPYLM) (Mochihashi et al., 2009). In this method, parameters of an n-gram language model based on the hierarchical Pitman-Yor process are estimated via the forward filtering-backward sampling algorithm. NPYLM can thus divide sentences into words more quickly and accurately than the method proposed in (Goldwater, 2006). Moreover, Uchiumi et al. extended the NPYLM to a Pitman-Yor hidden semi-Markov model (PY-HSMM) (Uchiumi et al., 2015) that can divide sentences into words and estimate the parts of speech (POS) of the words by sampling not only words, but also POS in the sampling phase of the forward filtering-backward sampling algorithm. However, these relevant studies aimed to divide symbolized sequences (such as sentences) into segments, and did not consider analogous divisions in continuous sequence data, such as that obtained by analyzing human motion.

Taniguchi et al. proposed a method to divide continuous sequences into segments by utilizing NPYLM (Taniguchi and Nagasaka, 2011). In their method, continuous sequences are discretized and converted into discrete-valued sequences using the infinite hidden Markov model (Fox et al., 2007). The discrete-valued sequences are then divided into segments by using NPYLM. In this method, motions can be recognized by the learned model, but cannot be generated naively because they are discretized. Moreover, segmentation based on NPYLM does not work well if errors occur in the discretization step.

Therefore, we propose a method to divide a continuous sequence into segments without using discretization. This method divides continuous motions into unit actions. Our proposed method is based on HSMM, the emission distribution of which is GP, which represents continuous unit actions. To learn the model parameters, we use forward filteringbackward sampling, and segment points and classes are sampled simultaneously. However, our proposed method also has limitations. One limitation is that the method requires the number of motion classes to be specified in advance. It is estimated automatically in methods such as (Fox et al., 2011) and (Matsubara et al., 2014). Another limitation is that computational costs are very high, owing to the numerous recursive calculations. We discuss these limitations in the experiments.

# 3. GAUSSIAN PROCESS-HIDDEN SEMI-MARKOV MODEL

**Figure 3** shows a graphical representation of the proposed GP-HSMM. In this figure, cj(j = 1, 2, · · · , J) denotes classes

of segments, and each segment **x**<sup>j</sup> is generated by a Gaussian process, with parameters denoted by **X**<sup>c</sup> and given by the following generative process:

$$c\_j \sim P(c|c\_{j-1}),\tag{1}$$

$$\mathbf{x}\_{\circ} \sim \mathcal{GP}(\mathfrak{x}|\mathbf{X}\_{c\_{\circ}}),\tag{2}$$

where **X**<sup>c</sup> represents a set of segments classified into class c. Segments are generated by this generative process, and the observed time-series data **s** is generated by connecting the segments.

#### 3.1. Gaussian Process

In this study, we utilize Gaussian process regression, which learns emission x<sup>i</sup> of time step i in a segment. This makes it possible to represent each unit action as part of a continuous trajectory. If we obtain pairs (**i**, **X**c) of emissions x<sup>i</sup> of time step i of segments belonging to the same class c, a predictive distribution whereby the emission of time step i becomes x follows a Gaussian distribution.

$$p(\mathbf{x}|i, \mathbf{X}\_c, i) \propto \mathcal{N}(\mathbf{k}^T \mathbf{C}^{-1} i, c - \mathbf{k}^T \mathbf{C}^{-1} \mathbf{k}),\tag{3}$$

where k(·, ·) represents the kernel function and **C** is a matrix whose elements are

$$C(i\_{\mathcal{P}}, i\_q) = k(i\_{\mathcal{P}}, i\_q) + \beta^{-1} \delta\_{\mathcal{P}q}.\tag{4}$$

β is a hyperparameter that represents noise in the observation. In Equation (3), **k** is a vector containing the elements k(ip, i), and c is a scalar value k(i, i). Using the kernel function, GP can learn a time-series sequence that contains complex changes. We use the following Gaussian kernel, which is generally used for Gaussian process regression:

$$k(i\_p, i\_q) = \theta\_0 \exp(-\frac{1}{2}||i\_p - i\_q||^2 + \theta\_2 + \theta\_3 i\_p i\_q),\tag{5}$$

where θ<sup>∗</sup> represents parameters of the kernel. **Figure 4** shows examples of Gaussian processes. The left graph in each pair of graphs represents learning data points (**i**, **X**c), and the right graph shows the learned probabilistic distribution p(x|i, **X**<sup>c</sup> , **i**). One can see that the standard deviation decreases with an increase in the number of learning data points. If the emission of time step i is multidimensional vector **x** = (x0, x1, · · ·), we assume that each dimension is generated independently, and a predictive distribution GP(**x**|**X**c) is computed as follows:

$$\begin{split} \mathcal{GP}(\mathfrak{x}|\mathbf{X}\_{\mathfrak{c}}) &= \mathfrak{p}(\mathfrak{x}\_{0}|i, \mathbf{X}\_{\mathfrak{c},0}, \mathfrak{t}\_{\mathfrak{c}}) \\ &\times \mathfrak{p}(\mathfrak{x}\_{1}|i, \mathbf{X}\_{\mathfrak{c},1}, \mathfrak{t}\_{\mathfrak{c}}) \\ &\times \mathfrak{p}(\mathfrak{x}\_{2}|i, \mathbf{X}\_{\mathfrak{c},2}, \mathfrak{t}\_{\mathfrak{c}}) \cdot \dots \cdot \end{split} \tag{6}$$

Based on this probability, similar segments can be classified into the same class.

### 3.2. Learning of GP-HSMM

#### 3.2.1. Blocked Gibbs Sampler

Segments and classes of segments in the observed sequences are estimated based on dynamic programming and sampling. For efficient sampling, we use the blocked Gibbs sampler, which samples segments and their classes in an observed sequence. In the initialization phase, all observed sequences are first randomly divided into segments. Segments **x**nj(j = 1, 2, · · · , Jn) in observed sequence **s**<sup>n</sup> are then removed from the learning data, and parameter **X**<sup>c</sup> of the Gaussian process and transition probability P(c|c ′ ) of HSMM are updated. Segments **x**nj(j = 1, 2, · · · , Jn) and their classes cnj(j = 1, 2, · · · , Jn) are then estimated as follows:

$$P(\mathbf{x}\_{n1}, \dots, \mathbf{x}\_{nI\_n}), (\mathbf{c}\_{n1}, \dots, \mathbf{c}\_{nI\_n}) \sim P(\mathbf{X}, \mathbf{c}|\mathbf{s}\_n),\tag{7}$$

where **X** is a set of segments into which **s**<sup>n</sup> is divided, and **c** denotes classes of the segments. To carry out this sampling efficiently, the probability of all possible segments **X** and


**Algorithm 2** Forward filtering-backward sampling


classes **c** must be computed; however, these probabilities are difficult to compute simply because the number of potential combinations is very large. Thus, we utilize forward filteringbackward sampling, which we presently explain. After sampling **x**nj and cnj, parameter **X**<sup>c</sup> of the Gaussian process and transition probability P(c|c ′ ) of HSMM are updated by adding them to the learning data. The segments and parameters of Gaussian processes are optimized alternately by iteratively performing the above procedure. Algorithm 1 shows the pseudocode of the blocked Gibbs sampler. Ncnj and Ncnj, <sup>c</sup>n, <sup>j</sup>+<sup>1</sup> represent parameters for computing the transition probability in Equation (10).

#### 3.2.2. Forward Filtering-Backward Sampling

In this study, we regard segments and their classes as latent variables that are sampled by forward filtering-backward sampling (Algorithm 2). In forward filtering, as shown in

**Figure 5**, the probability that k samples **s**t−<sup>k</sup> : <sup>t</sup> prior to time step t in observed sequence **s** form a segment, and that the resulting segment belongs to class c, is computed as follows:

$$\begin{aligned} \alpha[t][k][\boldsymbol{\varepsilon}] &= P(\mathbf{s}\_{t-k}; \boldsymbol{\mu} | \mathbf{X}\_{\boldsymbol{\varepsilon}}) \\ &\times \sum\_{k'=1}^{K} \sum\_{\boldsymbol{\varepsilon}'=1}^{C} p(\boldsymbol{\varepsilon}|\boldsymbol{\varepsilon}') \alpha[t-k][k'][\boldsymbol{\varepsilon}'], \end{aligned} \tag{8}$$

where C and K denote the number of classes and the maximum length of segments, respectively. P(**s**t−<sup>k</sup> : <sup>t</sup> |**X**c) represents the probability that**s**t−<sup>k</sup> : <sup>t</sup> is generated from a classc; this is computed as follows:

$$P(\mathbf{s}\_{t-k:t}|\mathbf{X}\_c) = \mathcal{GP}(\mathbf{s}\_{t-k:t}|\mathbf{X}\_c) P\_{len}(k|\lambda). \tag{9}$$

where Plen(k|λ) represents a Poisson distribution with a mean parameter λ; this corresponds to the distribution of the segment lengths. p(c|c ′ ) in Equation (8) represents a transition probability computed as follows:

$$p(c|c') = \frac{N\_{c'c} + \alpha}{N\_{c'} + C\alpha},\tag{10}$$

where N<sup>c</sup> ′ and N<sup>c</sup> ′ <sup>c</sup> denote the number of segments whose classes are c ′ and the number of transitions from c ′ to c, respectively, and k ′ and c ′ respectively denote the length and class of the segment preceding **s**t−<sup>k</sup> : <sup>t</sup> ; these are marginalized out in Equation (8). Moreover, α[t][k][∗] = 0 if t − k < 0, and α[0][0][∗] = 1.0. All elements of α[∗][∗][∗] in Equation (8) can be recursively computed from α[1][1][∗] by dynamic programming. **Figure 6** depicts the computation of a threedimensional array α[t][k][c]. In this example, the probability that two samples before time step t become a segment is computed; the resulting segment would be assigned to class two. Hence, samples at t − 1 and t become a segment, and all the segments whose end point is t − 2 can potentially transit to this segment. α[t][2][2] can be computed by marginalizing out these possibilities.

Finally, segment **x**<sup>j</sup> and its class are determined by backward sampling length k and class c of the segment, based on forward

probabilities in α. From t = T, length k<sup>1</sup> and class c<sup>1</sup> are determined according to <sup>k</sup>1,c<sup>1</sup> <sup>∼</sup> <sup>α</sup>[T][k][c], and **<sup>s</sup>**T−k<sup>1</sup> : T becomes a segment whose class is c1. Then, length k<sup>2</sup> and class c<sup>2</sup> of the next segment are determined according to k2,c<sup>2</sup> ∼ α[T − k1][k][c]. By iterating this procedure until t = 0, the observed sequence can be divided into segments and their classes can be determined.

## 4. EXPERIMENTS

We conducted experiments to confirm the validity of the proposed method. We used two types of motion capture data: (1) data from the CMU motion capture dataset (CMU, 2009), and (2) data containing karate motions.

#### 4.1. Segmentation of Exercise Motions

We first applied our proposed method to CMU motion capture data containing several exercise routines. The CMU motion capture data was captured using a Vicon motion capture system, and positions and angles of 31 body parts are available. The dataset contains 2605 trials in six categories and 23 subcategories, and motions in each subcategory were performed by one or a few subjects. In this experiment, three sequences from subject 14 in the general exercise and stretching category were used, and include running, jumping, squats, knee raises, reach out stretches, side stretches, body twists, up and down movements, and toe touches. To reduce computational cost, we downsampled from 120 frames per second to 4 frames per second. **Figure 7** shows the coordinate system of motion capture data used in this experiment; two-dimensional frontal views of the left hand (xlh, ylh), right hand (xrh, yrh), left foot(xlf , ylf ), and right foot (xrf , yrf ) were used. Therefore, each frame was represented by eight dimensional vectors:

TABLE 1 | Segmentation accuracy of CMU motion capture data. Hamming distance Precision Recall F-measure 0.33 0.81 0.81 0.81

(xlh, ylh, xrh, yrh, xlf , ylf , xrf , yrf). Because GP-HSMM requires the number of classes to be specified in advance, we set it to eight.

**Figure 8** shows the results of the segmentation. The horizontal axis represents the frame number, and the colors represent motion classes into which each segment was classified. The segments were classified into seven classes out of eight. **Table 1** shows the accuracy of the segmentation. We computed the following normalized Hamming distance between the unsupervised segmentation and the ground truth:

$$\text{ND}(\mathfrak{c}, \bar{\mathfrak{c}}) = \frac{D(\mathfrak{c}, \bar{\mathfrak{c}})}{|\bar{\mathfrak{c}}|},\tag{11}$$

where **c** and **c**¯ represent sequences of estimated motion classes and true motion classes, D(**c**,**c**¯) is the Hamming distance between two sequences, and |¯**c**| represents the length of the sequence. Therefore, the normalized Hamming distance ranges from 0 to 1; lower Hamming distances indicate more accurate segmentation. In this experiment, the Hamming distance was 0.33, which is comparable with the BP-HMM reported in (Fox et al., 2011). However, they also reported that some segments were split into two or more categories, and that those shorter segments were bridged. In contrast, we performed no such modifications, and **Figure 8** shows that there are no shorter segments. We also computed the precision, recall, and Fmeasure of the segmentation. To compute them, estimated boundaries of segments are evaluated as true positive (TP), true negative (TN), false positive (FP), or false negative (FN). **Figure 9** shows an example of segmentation evaluation. We considered the estimated boundary to be TP if it was within true boundary ± four frames, as shown in **Figure 9**(2). If

(C) Right upper guard.

TABLE 2 | Segmentation accuracy of karate motions.


the ground truth boundary has no corresponding estimated boundary as shown in **Figure 9**(6), it was considered as FN. Conversely, if the estimated boundary has no corresponding ground truth boundary as shown in **Figure 9**(11), it was considered as FP. From these evaluations, the precision, recall, and F-measure of the segmentation are computed as follows:

$$P = \frac{N\_{TP}}{N\_{TP} + N\_{FP}},\tag{12}$$

$$R = \frac{N\_{TP}}{N\_{TP} + N\_{FN}},\tag{13}$$

$$F = \frac{2PR}{P+R},\tag{14}$$

where NTP, NFP, and NTN represent the number of points assessed as TP, FP, and FN. The F-measure of the segmentation was 0.81, and this fact indicates that GP-HSMM can estimate boundaries reasonably. This is because GP-HSMM estimates the length of segments as well as the classes of segments.

Moreover, **Figure 8** shows that most false segmentations are in sequence 3. This is because "up and down" and "toe touch" motions are included only in sequence 3, and GP-HSMM was not able to extract patterns that occur infrequently. However, this problem is not limited to GP-HSMM, and it is generally difficult for any learning method to extract infrequent patterns. The Hamming distance, which was computed only from sequence 1 and sequence 2, was 0.15. This result shows that GP-HSMM can accurately estimate segments that appear multiple times in a sequence.

# 4.2. Segmentation of Karate Motion

We then applied our proposed method to more complex motion capture data, which consisted of the basic motions of karate (called kata in Japanese)<sup>1</sup> as shown in **Figure 10** from the motion capture library Mocapdata.com<sup>2</sup> . There are fixed motion patterns (punches or guards) in kata, and it is easy to form a ground truth for the segmentation. However, there might be shorter motion patterns, and GP-HSMM might be able to find those motion patterns if the number of classes is set to a larger number. Moreover, it is possible for GP-HSMM to discover patterns that cannot be labeled by humans, and GP-HSMM has the potential to analyze unlabeled time series data. However, in this experiment, we must evaluate the proposed method quantitatively, and fixed motion patterns (punches or guards) labeled by a human expert are used as ground truth. The type of kata we used was called heian 1, which is the most basic form of kata consisting of punches, lower guard, and upper guard (Tsuki, Gedanbarai, and Joudanuke in Japanese). **Figure 11** shows the basic movements used in heian 1. We divided this motion sequence into four parts, for use as four motion sequences to apply the blocked Gibbs sampler. Each motion sequence consisted of the following actions:


By way of its preprocessing, as shown in **Figure 7**, the motion capture data was converted into motions with the body facing forward with a center of (0,0,0). To reduce computational cost, we downsampled the motion capture data from 30 frames per

<sup>1</sup>https://mocapdata.blob.core.windows.net/freemotions/karate.zip <sup>2</sup>http://www.mocapdata.com/

second to 15 frames per second, and used two-dimensional lefthand positions (xlh, ylh) and right-hand positions (xrh, yrh) in the frontal view, as shown in **Figure 7**. To compare our method with others, we used segmentation based on HDP-HMM (Beal et al., 2001) and segmentation based on NPYLM and HDP-HMM (Taniguchi and Nagasaka, 2011), where NPYLM (Mochihashi et al., 2009) divides sequences discretized by HDP-HMM. In addition, we compared our method with BP-HMM (Fox et al., 2011) and AutoPlait (Matsubara et al., 2014).

**Figure 12** shows the segmentation results. The horizontal axis represents the frame number, and the colors represent motion classes into which each segment was classified. The figure shows that HDP-HMM estimated shorter segments than the ground truth. This occurred because the emission distribution of HDP-HMM is a Gaussian distribution, which cannot represent continuous trajectories. Moreover, the result produced by segmentation, in which NPYLM divided sequences discretized by HDP-HMM, yielded longer segments. Moreover, NPYLM cannot extract fixed patterns of sequences. This is because the sequences discretized by HDP-HMM included noise and, therefore, NPYLM was unable to find a pattern in them. It was also difficult for BP-HMM to estimate correct segments, and some shorter segments were present. Further, AutoPlait could not find any segments in the karate motion sequences. We believe this occurred because HMMs are too simple to model complex motions. On the contrary, we use Gaussian processes that make it possible to model complex sequences. **Table 2** shows the segmentation accuracy of each method. We considered the estimated boundary to be correct if it was within true boundary ± five frames. The F-measure of the proposed method was 0.92, which indicates that GP-HSMM can estimate boundaries

TABLE 3 | Computational time of each method.


accurately. The results show that GP-HSMM outperforms the other methods. **Figure 13** shows the learned Gaussian process. yrh in **Figure 13A**, which represents the height of the left hand, is decreased, which indicates the motion where the left hand is dropped for the lower guard. In contrast, yrh in **Figure 13B** is increased, which indicates the motion where the left hand is raised for the upper guard. Conversely, ylh in **Figure 13C** is increased for the right upper guard. From this result, we can see that characteristics of motions can be learned by Gaussian processes.

Moreover, the motions were classified into seven classes, although we set the number of classes to eight. This result indicates that the number of classes can be estimated to a certain extent, if a number closer to the correct number is given. However, a smaller number leads to under-segmentation and misclassification, and a much larger number leads to oversegmentation. This is a limitation of the current GP-HSMM, and we believe it can be solved by introducing a non-parametric Bayesian model.

Computational cost is another limitation of GP-HSMM. **Table 3** shows the computational time required to segment karate motion. HMM-based methods such as HDP-HMM, BP-HMM, and AutoPlait are relatively faster. In particular, AutoPlait is the fastest because it uses a single scan algorithm proposed in (Matsubara et al., 2014) to find boundaries, and it has been demonstrated that AutoPlait can detect meaningful patterns from large datasets. In contrast, our proposed GP-HSMM is much slower than other methods, and cannot process such large datasets. This is another limitation of the proposed method.

# 5. CONCLUSION

In this paper, we proposed a method for motion segmentation based on a hidden semi-Markov model (HSMM) with a Gaussian process (GP) emission distribution. By employing HSMM, segment classes and their lengths can be estimated. Moreover,

#### REFERENCES


a forward filtering-backward sampling algorithm is used to estimate the parameters of GP-HSMM; this makes it possible to efficiently search for all possible segment lengths and classes. The experimental results showed that the proposed method can accurately segment motion capture data. Although motions that occurred in the sequences a single time were difficult to segment correctly, motions that occurred a few times could be segmented with higher accuracy.

However, some issues remain in the current GP-HSMM. The most significant problem is that GP-HSMM requires the number of classes to be specified in advance. We believe this value can be estimated by utilizing a non-parametric Bayesian model. We are planning to introduce a stick-breaking process as a prior distribution of the transition matrix, and beam sampling for parameter estimation; these techniques are utilized in Beal et al. (2001). Another problem is computational cost. The computational cost to learn a Gaussian process is O(n 3 ), where n denotes the number of data points classified in the GP. To overcome this problem, efficient computation methods have been proposed (Nguyen-Tuong et al., 2009; Okadome et al., 2014), and we will consider introducing these methods into GP-HSMM.

# AUTHOR CONTRIBUTIONS

ToN, TaN, DM, IK, and HA conceived of the presented idea. ToN, TaN, and DM developed the theory and performed the computations. IK and HA verified the theory and the analytical methods. ToN wrote the manuscript with support from TaN and MK. IK and HA supervised the project. All authors discussed the results and contributed to the final manuscript.

## ACKNOWLEDGMENTS

This work was supported by JST CREST Grant Number JPMJCR15E3 and JSPS KAKENHI Grant Number JP17K12758.


Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing Vol. 1 (Singapore), 100–108.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Nakamura, Nagai, Mochihashi, Kobayashi, Asoh and Kaneko. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Representation Learning of Logic Words by an RNN: From Word Sequences to Robot Actions

Tatsuro Yamada<sup>1</sup> , Shingo Murata<sup>2</sup> , Hiroaki Arie<sup>2</sup> and Tetsuya Ogata<sup>1</sup> \*

<sup>1</sup> Department of Intermedia Art and Science, Waseda University, Tokyo, Japan, <sup>2</sup> Department of Modern Mechanical Engineering, Waseda University, Tokyo, Japan

An important characteristic of human language is compositionality. We can efficiently express a wide variety of real-world situations, events, and behaviors by compositionally constructing the meaning of a complex expression from a finite number of elements. Previous studies have analyzed how machine-learning models, particularly neural networks, can learn from experience to represent compositional relationships between language and robot actions with the aim of understanding the symbol grounding structure and achieving intelligent communicative agents. Such studies have mainly dealt with the words (nouns, adjectives, and verbs) that directly refer to real-world matters. In addition to these words, the current study deals with logic words, such as "not," "and," and "or" simultaneously. These words are not directly referring to the real world, but are logical operators that contribute to the construction of meaning in sentences. In human–robot communication, these words may be used often. The current study builds a recurrent neural network model with long short-term memory units and trains it to learn to translate sentences including logic words into robot actions. We investigate what kind of compositional representations, which mediate sentences and robot actions, emerge as the network's internal states via the learning process. Analysis after learning shows that referential words are merged with visual information and the robot's own current state, and the logical words are represented by the model in accordance with their functions as logical operators. Words such as "true," "false," and "not" work as non-linear transformations to encode orthogonal phrases into the same area in a memory cell state space. The word "and," which required a robot to lift up both its hands, worked as if it was a universal quantifier. The word "or," which required action generation that looked apparently random, was represented as an unstable space of the network's dynamical system.

Keywords: symbol grounding, neural network, human–robot interaction, logic words, language understanding, sequence-to-sequence learning

# 1. INTRODUCTION

In recent years, the development of robots that work collaboratively in our living environment has attracted great attention. In many scenarios, these robots will be required to behave appropriately by understanding linguistic instruction from humans. Here, the meanings of instructions may change depending on the environment. Thus, robots must be able to flexibly adapt their behavior

# Edited by:

Alex Pitti, Université de Cergy-Pontoise, France

#### Reviewed by:

Xavier Hinaut, Inria Bordeaux—Sud-Ouest Research Centre, France Cornelius Weber, University of Hamburg, Germany

> \*Correspondence: Tetsuya Ogata ogata@waseda.jp

Received: 18 August 2017 Accepted: 14 December 2017 Published: 22 December 2017

#### Citation:

Yamada T, Murata S, Arie H and Ogata T (2017) Representation Learning of Logic Words by an RNN: From Word Sequences to Robot Actions. Front. Neurorobot. 11:70. doi: 10.3389/fnbot.2017.00070

**37**

in accordance with the current situation or context. In the real world, no two events are identical; thus, a model that can generalize in order to translate an instruction to appropriate behavior even in novel situations is required. Specifying rules to define relations between language and behavior for various possible contexts becomes difficult and costs much more as task complexity increases. Therefore, to build a learning model that enables a robot to acquire generalizable relations from experience is especially desirable. To flexibly link language, which operates on discrete elements, to behavior, which operates within a continuous world, requires a solution to the symbol grounding problem (Harnad, 1990; Taniguchi et al., 2016).

One important characteristic of human language that enables us to describe even previously unseen situations is compositionality. In the field of formal semantics, the principle of compositionality (also referred to as Frege's principle) models a language system as follows: the meaning of a phrase or a sentence is given as a function of the meanings of its parts (e.g., words) (Partee, 2004). This principle means that the meaning of a complex expression is built from the meaning of its constituents and rules for combining them. Thanks to the compositionality of language and our cognitive ability to deal with it, humans can efficiently describe a wide variety of situations and dynamic events in the real world by compositionally constructing a complex expression from a finite number of elements. Investigating the compositional aspects of language deeply is important for understanding how human languages work in practice and for building intelligent communicative agents. Using the principle of compositionality as a base, formal semanticists attempt to build theoretical frameworks to explain the compositionality of natural language in a top-down manner.

In contrast with the top-down approach, there is a bottomup approach that attempts to work from observation and analyze what kind of symbolic or compositional expressions emerge spontaneously through communicative tasks among humans, robots, and other intelligent agents (Steels and Kaplan, 1998; Steels and McIntyre, 1998; Steels, 2001; Kirby, 2002; Sasahara et al., 2007; Bleys et al., 2009; Schueller and Oudeyer, 2015; Spranger, 2015; Sukhbaatar et al., 2016; Wang et al., 2016; Havrylov and Titov, 2017; Lazaridou et al., 2017; Mordatch and Abbeel, 2017). In particular, in recent years, there have been many studies of multi-agent interaction, in which agents implemented with a deep learning model are developed in a mutually interactive manner and a compositional communication protocol emerges through the interaction. In Mordatch and Abbeel (2017), multiple agents situated within simulated 2D environments were given collaborative tasks in which agents had to symbolically communicate with each other to tell other agents their own goals. Before learning, symbols were meaningless. Being trained by reinforcement learning, the agents spontaneously gave the symbols shared meanings, which were sometimes interpretable by humans (e.g., "GO-TO," "LOOK-AT"), and they became able to communicate by combining the symbols, each of which was a token representing a subject, verb, or objective. In Havrylov and Titov (2017), two long short-term memory (LSTM) networks developed their own communication protocol to express the content of images. The sender network encoded the image information as a sentence expression, and the receiver network decoded the sentence and inferred which image among alternatives was described by the sentence. The analysis showed that a natural language-like coding such as hierarchy of categories or the importance of word order could be developed.

In the bottom-up approach, there has also been much research that trained neural network models by supervised learning (Sugita and Tani, 2005; Ogata et al., 2007; Sugita and Tani, 2008; Arie et al., 2010; Tuci et al., 2011; Chuang et al., 2012; Stramandinoli et al., 2012; Ogata and Okuno, 2013; Heinrich and Wermter, 2014; Heinrich et al., 2015; Hinaut et al., 2014; Yamada et al., 2015, 2016; Zhong et al., 2017). In these studies, the example sets of language and corresponding behavior were designed and prepared by humans in advance. These sets were used as ground truth during training, and after that, compositional representations intermediating between language and behavior were self-organized in their models. For example, Sugita and Tani (2005) and Arie et al. (2010) trained recurrent neural network (RNN) models (Elman, 1990) to learn relations between 2- or 3-word sentences and corresponding robot behavior. After training, representations corresponding to verbs and nouns were topologically self-organized as different components in the feature space binding language with robot behavior. These were construed as plausible materialization of linguistic compositionality by a dynamical system approach. Tuci et al. (2011) also conducted robot experiments using a feedforward neural network and claimed that the compositional aspects that potentially exist in the behavior space are required for embedding robot behavior into compositional semantics via language. Heinrich et al. (2015) trained an RNN model to translate a robot's visual input into a corresponding sentence at the phoneme level. After training, the activated internal states of the RNN were more correlated with the type of word (color, shape, or position) than the phonemes. Hinoshita et al. (2011) visualized a similar kind of abstract encoding by a hierarchical RNN that was activated in accordance with the categories of words, even though they trained the RNN with linguistic sequences only. Investigating such representations organized in machine learning models is valuable, not only for understanding the compositionality of language but also for building interpretable intelligent systems.

The current study follows the supervised learning approach to the integration of language and behavior. In most previous studies of this type, mainly words that are directly grounded in real-world matters have been considered. For example, nouns (e.g., ball, box) or adjectives (e.g., red, tall) correspond to characteristics of objects. Verbs (e.g., hit, push) or adverbs (e.g., quickly, slowly) correspond to characteristics of motion. However, in our language, there are more abstract words (e.g., society, justice) that are not grounded in concrete physical objects or actions. To tackle the grounding of such words, Cangelosi et al. have conducted a series of language-robot experiments from the point of view of cognitive developmental robotics (Cangelosi et al., 2010; Chuang et al., 2012; Stramandinoli et al., 2012; Zhong et al., 2014; Stramandinoli et al., 2017). In these works, a robot implemented with a neural network develops its linguistic skill step by step, beginning by acquiring relations between simple basic motions and words (e.g., "push," "pull") directly grounded in them and moving on to achieving relationships between more abstract actions and words (e.g., "give," "reject") only indirectly grounded in them through connections to basic words.

However, the current study deals with another kind of abstraction. Language expressions in this study include grounded, in other words, referential<sup>1</sup> words and logic words, such as "not," "and," and "or". These logical words are not directly referring to the real world but act as logical operators in the construction of the meaning of the sentence. For example, just after you have closed a door, the commands "open the door!" and "do not close the door!" can express the same behavior OPEN-DOOR<sup>2</sup> . In another case, the appropriate behaviors in response to "bring A or B" include BRING-A and BRING-B. These logic words have not been addressed in conventional studies of integrative learning of language and behavior. In accordance with the formulation of formal semantics, even such non-referential words working as logical operators can be handled in a unified way. In fact, in cases of actual human–robot communication, it is highly likely that these words will be used.

The current study investigates what kind of structure representing compositional relations between language and robot actions is self-organized in the space of internal states of an RNN model trained through supervised learning. Here, our designed tasks include referential words and non-referential logic words. The meanings of sentences are constructed from both word types. We analyze how logic words are processed and how their functions are represented by the RNN dynamics along with the referential words. More precisely, we apply the sequenceto-sequence learning method that has recently attracted great attention in the field of natural language processing (Sutskever et al., 2014; Bahdanau et al., 2015; Vinyals and Le, 2015; Wu et al., 2016) to the translation from sentences to robot actions and analyze representations by visualizing internal states during interactions that occur after training.

This paper is organized as follows. In section 2, we introduce the learning model. In section 3, we give the results of the learning experiment for the first task and analyze the representations acquired by the learning model in detail. In section 4, we report the results of the second task. In section 5, we discuss the results and then conclude this study.

# 2. LEARNING MODEL

# 2.1. Problem Formulation

The aim of the current study is to investigate how the compositional relations between language and robot actions are developed and represented internally by the model from direct experiences of interaction. Therefore, we define the interactive instruction–action task as a simple problem, learning to predict a robot's joint angles appropriate to the current situation. At each discrete time step t a neural network model receives a word **w**<sup>t</sup> , visual information **v**<sup>t</sup> , and the robot's current joint angle configuration **j**<sup>t</sup> . An instruction sentence is given as a concatenation of some words, thus it takes some time steps. At each time step the model generates its prediction **j**t+<sup>1</sup> based on the input history **w**<sup>0</sup> : <sup>t</sup> , **v**<sup>0</sup> : <sup>t</sup> , and **j**<sup>0</sup> : <sup>t</sup> . During the instruction phase the appropriate prediction would be just keeping the current posture **j**<sup>t</sup> . After an instruction is given, an appropriate prediction should be the generation of angles different from the current ones. An action corresponding to the instruction must also be generated as a sequence of joint angle configurations over several time steps. In our tasks, the appropriate action sequence after an instruction is determined by the combination of the instruction sequence, the visual information given simultaneously with the sentence, and the robot's current posture.

# 2.2. Model Architecture and Forward Dynamics

In this study, as a model that learns the aforementioned problem, we use an RNN with an LSTM layer (Hochreiter and Schmidhuber, 1997). The model is a three-layer neural network whose middle layer is the LSTM layer, as shown in **Figure 1**. All the LSTM units have a peephole connection (Gers and Schmidhuber, 2000). At each time step, the model receives **w**<sup>t</sup> , **v**t , and **j**<sup>t</sup> . The LSTM layer calculates the current output **h**<sup>t</sup> from these external inputs, the memory cell state in the previous step **c**t−1, and its own output in the previous step **h**t−1:

$$h\_t = \text{LSTM}(\mathbf{w}\_t, \mathbf{v}\_t, \mathbf{j}\_t, h\_{t-1}, \mathbf{c}\_{t-1}; \boldsymbol{\theta}), \tag{1}$$

where θ denotes the parameters of the LSTM layer. In this process, **c**t−<sup>1</sup> is also updated to **c**<sup>t</sup> . The output layer is a fully connected layer. It receives the output of the LSTM layer and predicts the appropriate joint angles for the next time step, denoting these ˆ**j**t+1. We denote the model prediction by **j**t+1:

$$j\_{t+1} = \tanh(\mathcal{W}h\_t + b),\tag{2}$$

where **W** and **b** are a learnable weight matrix and a bias vector, respectively. The model prediction is also used as the joint angle input at the next time step. In this process, receiving an instruction and generating an action are completely conducted in the forward-propagation algorithm. An instruction sentence, visual information, and the robot's current posture are encoded as the states of memory cells in the LSTM layer. After receiving the instruction, a corresponding action sequence is generated by decoding the integrated information.

The working after training seems to be similar to the normal sequence-to-sequence models that have recently been used in the field of natural language processing for tasks such as question answering and translation. However, the current model is different in that it has only one LSTM layer; in other words, it does not separate the decoder from the encoder.

<sup>1</sup> In this paper, we use the term "referential" instead of "grounded" for the following reason. We conduct two robot experiments in the following sections, but the first task is numerically simulated on a computer. Even though the second task uses a real robot, the visual input is still highly preprocessed. Strictly speaking, we do not deal with the symbol grounding problem in accordance with the definition by Harnad (1990). To prevent misunderstanding, we use the term "referential," and sometimes "linking" to express that a word has a referent or a corresponding feature in other sensorimotor modalities.

<sup>2</sup> In this paper, we denote specific actions or behaviors executed by agents with capital letters.

Moreover, the algorithm does not explicitly switch between the instruction and action phases. As visually illustrated in **Figure 3** in the next section, the relations between instructions and corresponding actions are experienced entirely in the sequential data that represent human robot interaction, which consists of repeated iteration of instructions and actions. With such data, as mentioned above, the model learns to predict only the robot's joint angles appropriate for the next time step in the current situation. Because both phases are only implicitly included in the sequential data, the model has to learn to switch phases without a priori knowledge. In more precise terms, the contrasting functions of encoding and decoding (i.e., instruction receiving and action generation) emerge as an apparent phenomenon as a result of learning alone. The model continues to predict the joint angles even during receipt of an instruction, while the target is keeping the current posture. In contrast, zero-filled vectors are continuously received as language inputs even when the robot is generating an action sequence. Although no external algorithms or explicit signals on the network I/O for phase switching exist, the trained model behaves as though it flexibly switched phases. For more discussion from the point of view of dynamical systems, refer to Yamada et al. (2016).

### 2.3. Training

To train the model, supervised learning is conducted by minimizing the squared error between the model's output **j**t+<sup>1</sup> and the correct joint angles at the next time step ˆ**j**t+1: that is, the model is trained to minimize

$$E = \sum\_{s} \sum\_{t} (\mathfrak{j}\_{t+1} - \widehat{\mathfrak{j}}\_{t+1})^2,\tag{3}$$

where s is the index of a sequence. The error at each time step is back-propagated to the initial time step without truncation by using the back propagation through time algorithm (Rumelhart et al., 1986). In our tasks, sometimes there are multiple correct actions. For example, if the instruction is "hit red or blue," both HIT-RED and HIT-BLUE can be correct. In such cases, one action is chosen randomly each time and given as the correct response.

In the following sections, we describe learning experiments conducted using the model described in this section. We designed two tasks, the "flag task" and the "bell task," in which a robot is required to generate an action in response to linguistic instructions that sometimes include logic words. Although the former task is numerically simulated on a computer from data preparation to evaluation, it is interpretable as a task for a robot. In contrast, the latter task collects motion data by using a real robot; it is, therefore, a more complicated task.

# 3. EXPERIMENT 1: FLAG TASK

# 3.1. Task Overview

In this section, we first report the learning results of the first task, the "flag task". Although this task is completely performed in a computer simulation, we describe the task as if it was undertaken by a robot so that it is easy to imagine intuitively. First, a human makes the robot grasp flags colored red, green, or blue, one in the left hand and another in the right, at random. After that, the human gives the robot a linguistic instruction. The sentence consists of a combination of an objective ("red," "green," "blue"), a verb ["up" (i.e., lift), "down" (i.e., lower)], and a truth value ("true," "false"). Note that the words are given in this order because this game was designed by modifying a popular children's game in Japan. Japanese is a subject-object-verb language (cf., English, which is a subject-verb-object language), therefore a verb follows an objective word, and a truth value, which is one of the auxiliary verbs, follows a verb. Here, the objective color word indicates the arm that is grasping a flag of the stated color. The

In the objective part, two color words can be concatenated by "and" (referred to as AND-concatenated). For example, if the robot receives the instruction "red and blue up true" when it is grasping red and blue flags, the robot must lift up both arms. There are also cases in which two color words are concatenated by "or" (referred to as OR-concatenated). For example, if the robot receives the instruction "green or blue up false" when it is grasping the green and blue flags, the correct action is to lower either arm. However, if at least one arm is already in the DOWN posture, the robot must keep the current posture. The number of possible goal-oriented actions is six: L-UP, R-UP, B-UP, L-DOWN, R-DOWN, and B-DOWN, where L, R, and B mean left, right, and both, respectively. However, there are situations in which, even though the same goal-oriented action is required, the actual motion that should be generated by the robot varies according to the robot's current posture (shown as arrows in **Figure 2**). Note that there are even cases in which the robot should not move either of its arms. The number of possible situations, based on the combination of flag colors (6 patterns), instructions (24 patterns), and the robot's waiting posture (4 patterns), is 576. In this task, instructions inconsistent with the flag colors are never given. For example, if the colors of the flags held by the robot are red and blue, the instruction "green up true" is never given. Furthermore, cases in which both flags are the same color are not permitted.

The requirements imposed on the robot in this game are analyzed as follows. (1) First, the arm indicated by the color words depends on the arm with which the robot holds the flag. In other words, referring to an external situation is required. (2) The actual motion trajectory to be generated depends on the robot's current posture. For example, suppose the robot is required to generate L-UP action. If the robot's left arm is in the DOWN posture, the robot has to lift its left arm. However, if the robot's left arm is already in the UP posture, the robot has to maintain its posture. (3) Finally, the RNN has to deal not only with referential words (e.g., verb, objective) but also logic words such as "true," "false," "and," and "or," which we focus on in the current study. Due to this task setting, in extreme cases, sentences completely orthogonal to each other can indicate the same action (e.g., "red up true" with the red flag in the left arm and "blue down false" with the blue flag in the left arm). In contrast, some OR-concatenated sentences have an ambiguity that allows the robot multiple choices even in the same situation.

# 3.2. Data Representation

We represent the execution of the flag task as a sequence of 14 dimensional vectors. The state **S**<sup>t</sup> at time step t is represented as follows:

$$\mathfrak{H}\_{l} = [j\_{l}^{(t)}, j\_{r}^{(t)}],\tag{4}$$

$$\mathbf{v}\_t = [\boldsymbol{\nu}\_r^{(t)}, \boldsymbol{\nu}\_\mathcal{g}^{(t)}, \boldsymbol{\nu}\_b^{(t)}],\tag{5}$$

$$\mathbf{w}\_{t} = [\mathbf{w}\_{0}^{(t)}, \mathbf{w}\_{1}^{(t)}, \mathbf{w}\_{2}^{(t)}, \mathbf{w}\_{3}^{(t)}, \mathbf{w}\_{4}^{(t)}, \mathbf{w}\_{5}^{(t)}, \mathbf{w}\_{6}^{(t)}, \mathbf{w}\_{7}^{(t)}, \mathbf{w}\_{8}^{(t)}], \tag{6}$$

$$\mathbf{S}\_{t} = [\mathbf{j}\_{t}; \mathbf{v}\_{t}; \mathbf{w}\_{t}].\tag{7}$$

Regarding the robot joints, only the left and right shoulder pitches (j (t) l , j (t) <sup>r</sup> ) are used. The permissible range of each shoulder pitch is scaled in the interval [−1.0, 1.0]. The UP posture corresponds to a pitch of 0.8, and the DOWN posture corresponds to a pitch of −0.8. Posture changes from UP to DOWN or from DOWN to UP after receiving an instruction are completed over 6 time steps. Visual information is represented in 3 dimensions (v (t) r , v (t) <sup>g</sup> , v (t) b ). The three components correspond to the R, G, and B channels, respectively. If the color is grasped by the left hand, the component is set to 0.8; if it is in the right hand, the component is −0.8; and if not grasped by either hand, the component is 0.0. Nine elements are assigned for language. Each element corresponds to one word, out of "red," "green," "blue," "up," "down," "true," "false," "and," and "or," and an instruction sentence is represented as a sequence of one-hot vectors, which have the value of 0.8 at one element and 0.0 at the other element. In this study, the data representing the flag task are completely generated on a computer without using a real robot. Example interaction data are shown in **Figure 3**. Note that we added a small amount of Gaussian noise (mean: 0.00; standard deviation: 0.02) to the values of joint angles. In the preliminary experiment, we first trained the model without noise and got poor results. We then added noise and the results improved. We discuss this effect in section 5.

# 3.3. Learning Setting and Evaluation Method

We made 2,048 sequential datasets for training, each of which includes 10 episodes. The term, "episode" denotes a chunk consisting of an instruction and an action response. The situations included in each sequence were randomly ordered. All 576 possible situations were included at least once. We built five models with 50, 70, 100, 150, and 300 LSTM units and trained them 10 times from randomly initialized learnable parameters. We also trained the 100-node model with data without noise applied to the joint angles. Adam, a version of the stochastic gradient descent algorithm made stable by computing individual adaptive learning rates for each parameter, is used as an optimizer (for details, refer to Kingma and Ba, 2015). The number of learning iterations is 10,000, and the learning rate is set to 0.001. We coded our model within Python using the Chainer (https://chainer.org) framework. The source code of our model is available at https://github.com/ogata-lab/RNN\_ FNR2017.

After learning, we made another dataset for the evaluation. This dataset includes all the possible situations 10 times each. Although the situations were randomly ordered, the order was different from the training dataset. When the errors between

L-DOWN, R-UP, R-DOWN, B-UP, B-DOWN) in accordance with the instruction. In the objective parts, two color words can be concatenated by "and". In this case, the robot must generate B-UP (B-DOWN) action. Two color words also can be concatenated by "or," in which case the robot must move either arm. The actual movements corresponding to these goal-oriented actions for each starting posture are indicated by the arrows in this figure.

the generated postures of both arms six steps after receiving an instruction and the correct ones are less than 0.04, we judge that the RNN has succeeded in generating an appropriate action. Here, there are cases in which the correct action cannot be determined uniquely. In such cases, if the RNN succeeds in generating any of the correct actions, we judge that as success. We regard the situation patterns in which the RNN succeeds in generating an appropriate action more than seven times out of 10 as "appropriately learned". Note that in the current task, the sequences are given to the robot as multiple repetitions of the instructions and corresponding actions. Therefore, even if situations that are defined by combination of an instruction, the vision, and the robot posture are the same, slightly different activations are gained every time because the contextual information of the previous episode remains in the memory cell states. Thus, the generated action is not identical among trials.

# 3.4. Task Performance after Training

We classify all possible situations into four types. (1) Situations in which the instruction includes only one objective word (192 situations). (2) Situations in which the instruction is AND-concatenated (192 situations). (3) Situations in which the instruction is OR-concatenated, but there is only one correct action. For example, when the instruction is "red or blue up true" and the both arms are already in the UP position, the only correct action is to maintain the UP-UP posture (144 situations). (4) Situations in which the instruction is ORconcatenated, and two correct actions exist (48 situations). We evaluate performance by counting how many situation patterns each model learns appropriately with respect to each of the four types. **Figure 4** shows the result. Most situations in types (1), (2), and (3), in which the correct action is uniquely determined, were appropriately learned by all the models. However, the 100-node model trained with data without noise applied to the joints could not learn sufficiently well. For type (4), in which the correct action cannot be determined uniquely, a clear difference exists between models: the number of appropriately learned situations increased in accordance with the number of LSTM nodes. The model without noise also performed worse than the 100-node model with noise. **Figure 5** shows an actual example of interaction achieved by the 300-node model. It can be seen that the RNN generates an appropriate action immediately after receiving an instruction in each episode.

Next, we checked which arm was actually moved in situations of type (4). If the model learned the type (4) situations just as a left-arm action or just as a right-arm action, the meaning of "or" cannot be regarded as being truly learned, although the aforementioned evaluation criteria is fulfilled. Here, we investigated the results for a model with 300 LSTM units. In 45.4% of the trials, the left hand was moved. In 52.5%, the right hand was moved. In 2.1%, neither movement could be generated successfully. Overall, the arms were quite evenly chosen in these situations. There are 48 situation patterns of type (4), and the test was conducted 10 times for each of them. In all cases, the RNN sometimes chose to move the left arm and other times chose to move the right arm. In other words, the RNN could learn the meaning of OR-concatenated instructions appropriately as "OR". Thus, the flag task was performed sufficiently well by the trained models.

# 3.5. Analyses of Internal Representations

In the previous subsection, we confirmed that the RNN could learn to execute the flag task. In this section, to analyze how the RNN internally represents the relations between instructions and sensorimotor information, we visualized the internal states

FIGURE 4 | Experiment 1 (flag task). Action generation performance. We evaluated performance by counting how many situation patterns each model learned appropriately with respect to each of the four situation types: (1) the instruction includes one objective; (2) the instruction is AND-concatenated; (3) the instruction is OR-concatenated, but there is only one correct action; and (4) the instruction is OR-concatenated, and two correct actions exist. Note that the written values are averages of 10 trials in which learning began with different seeds. Error bars represent standard deviations.

FIGURE 6 | Top left: The states of the memory cells after the instruction "(L-flag color word) up true" or "(R-flag color word) up true" is given to the robot projected onto the space spanned by PC1, 2, and 3. Here, the robot is always waiting in the DOWN-DOWN posture, but the situations are different with respect to the colors of the flags grasped in each hand. For example, the filled blue circle is the activation after receiving "blue up true" in the situation B-R in which a blue flag is in the left hand and a red flag is in the right. In this task, which arm should be moved cannot be determined from the given objective word alone. However, in the PC1 direction, which arm is indicated by the objective word is represented. The RNN learned to integrate the objective word information and the current visual information, and acquired a representation corresponding to the meaningful pair of "left–right". By using these activations, the robot could choose a correct arm for each trial. Others: We also plotted the internal states after giving these instructions to the robot that is waiting in the other postures, together with the internal states on the DOWN-DOWN condition. We projected them onto the PC1–2, PC3–4, and PC5–6 space. Note that we carried out PCA again by using the internal states on all of these conditions. Plot colors and shapes are as in the top left panel except that the frame lines differ according to the robot current posture. In this case, the current posture information is strongly reflected to the internal states, thus it is encoded in the PC1–2 plane. But the representation corresponding to "left" and "right" is still able to be seen easily in the PC3–4 plane. The visual information was encoded in the PC5–6 space although the hexagon shape was a little distorted.

during the execution of the task by principal component analysis (PCA)<sup>3</sup> .

### 3.5.1. Representations of Referential Color Words

First, the top left panel of **Figure 6** shows the states of the memory cells after the instruction "(L-flag color word) up true" or "(Rflag color word) up true" is given to the robot. Here, the robot is always waiting in the DOWN-DOWN posture, but the situations are different with respect to the flag colors. Therefore, the RNN has to choose which arm should be raised by integrating the visual information and the input objective word. In the PC2–PC3 plane, the current visual input is directly embedded. However, in the PC1 direction, which arm has been indicated by an objective word is represented. In other words, in the experience of generating action sequences by receiving an instruction and visual input, the RNN acquired a representation corresponding to the meaningful pair of "left" and "right". We also plotted the internal states after giving these instructions to the robot that is waiting in the other postures, together with the internal states on the DOWN-DOWN condition. In the other three panels of **Figure 6**, we projected them onto the PC1–2, PC3–4, and PC5– 6 space. In this case, the current posture information is strongly reflected to the internal states, thus it is encoded in the PC1–2 plane. But the representation corresponding to "left" and "right" is still able to be seen easily in the PC3–4 plane. Here, note that in the case of the UP-UP posture, the actual motions to be generated by receiving "(L-flag color word) up true" or "(R-flag color word) up true" are the same (keep the current posture), and, in fact, the network could keep the posture. This analysis shows that even in such situations in which the same action was generated, the model could internally represent these instructions

<sup>3</sup>Before applying PCA, parallel translation was applied to the internal state vectors to make the mean of them the zero vector (i.e., centering preprocessing was performed).

as different meanings, "left" or "right". Incidentally, the visual information was also still encoded in a less principal component space (PC5–6) although the hexagon shape was a little distorted.

## 3.5.2. Representations of Logic Words: "True" and "False"

Next, we also analyzed the representations of logic words. We visualized memory cell activations after giving eight possible instructions with one objective word to a robot that was grasping R-B flags and waiting in the DOWN-DOWN posture (**Figure 7**). In the directions of PC1, PC2, and PC3, the activations directly corresponding to each part of speech (objective, verb, truth value) of the input sentence can be seen, that is, "red"/"blue", "up"/"down" and "true"/"false" pairs are reflected in the PC2, PC1, and PC3 axes, respectively. Here, the problem is that the RNN has to solve an X-OR problem that consists of "up"/"down" and "true"/"false" (shown in the left panel of **Figure 7**), and to link its interpretation into UP or DOWN goal-oriented action. More precisely, if the sentence includes "up true" or "down false," UP action must be chosen. In contrast, if the sentence includes "up false" or "down true," DOWN action must be chosen.

Actually, by exploring the lower-rank component PC4, the activations that were located diagonally across the parallelogram in PC1–PC3 space were located in the same direction. "Up true" and "down false," which are mutually orthogonal but have the same meaning UP, are represented in the bottom area of the right panel. In contrast, "up false" and "down true" are represented in the top area. Thanks to this non-linear embedding, the X-OR problem is solved in the PC4 direction. In summary, the RNN has extracted the XOR problem implicitly included in the sequential experiences and learned to link the orthogonal instructions in the same goal-oriented action by its non-linear dynamics, while retaining the information that the input sentences are very different from each other in the larger principal components.

## 3.5.3. Representations of Logic Words: "And" and "Or"

The left panel of **Figure 8** shows the memory cell states after giving a robot that is grasping R-B flags some instructions whose objective part is one word, AND-concatenated, or ORconcatenated. The verb and the truth value are "up" and "true," respectively. AND-concatenated instructions that direct the robot to raise both arms are represented away from other instruction encodings in the PC1 direction. The pair of "red" and "blue" is represented in the PC2 direction. Here, the word "or" that directs the robot to raise either hand is embedded in the middle space between these two encodings. This suggests that "or" is represented as an unstable point of the network dynamics and that, thanks to this acquired dynamics, behavior which apparently looks like randomly choosing the left or right arm has emerged.

To verify this, we conducted the following additional simulation. To a robot that had 2,048 different contexts, we gave the instruction "green or blue up true." Specifically, in all 2,048 contexts, a robot is currently waiting in a DOWN-DOWN posture with G-B flags. However, in each context, the order of preceding episodes is randomly different from in the other contexts. As mentioned in section 3.3, even when the situation, defined by the combination of an instruction, the vision, and the robot current posture (in this simulation, "green or blue up true," the green flag in the left hand, the blue flag in the right hand, and DOWN-DOWN posture, respectively) is the same, different activations occur every time because the contextual information of the previous episodes still remains in the memory cell states. Therefore, we see 2,048 different activations corresponding to 2,048 contexts. As shown in the top left panel of the right side of **Figure 8**, the memory cell states after the instruction "green or blue up true" were then arranged in an arch-shaped space. Each point corresponds to one specific context. When the activation was on the left side of the arch, the robot generated L-UP action.

FIGURE 8 | Left: The memory cell states after giving a robot that is grasping red and blue flags some instructions whose objective part is one word, AND-concatenated, or OR-concatenated. The verb and the truth value are "up" and "true," respectively. The AND-concatenated instructions are represented away from other instruction encodings in the PC1 direction. The pair of "red" and "blue" is represented in the PC2 direction. The "or" that directs the robot to raise either hand is embedded in the middle space between these two encodings. Right: To a robot waiting in the DOWN-DOWN posture with G-B flags after 2,048 different contexts, we gave the instruction "green or blue up true." The memory cell states after the instruction (t = 0) were arranged on an arch-shaped space (left top). Each point corresponds to one specific context. When the activation was on the left side of the arch, the robot generated L-UP action and the internal states converged to the fixed-point corresponding to the UP-DOWN posture. In contrast, on the right side, the robot generated R-UP action, and the internal states converged to the fixed-point corresponding to the DOWN-UP posture. When the activation was on the topmost area of the arch, a little unstable action was generated. However, even in such cases, the internal states eventually converged to one of fixed-points, as shown in the right bottom panel.

In contrast, for right-side activation, the robot generated R-UP action. When the activation was in the topmost area of the arch, some unstable motion was generated. However, in all cases, the internal states eventually converged into one of the fixed-point attractors that corresponded to the DOWN-UP posture or the UP-DOWN posture, as shown in the bottom rightmost panel of **Figure 8**. This means that to respond to OR instructions that require the robot to behave in a random exclusive-OR-like way, the internal representation was the convergence from an unstable space to either one of two stable points.

In this analysis, PC1 was strongly dominant (the contribution ratio is 97.9%). Therefore, due to this important contribution ratio, one could assume that only one neuron would be enough to generate this unstable dynamics. However, the activation in the PC1 direction was actually composed of the activations of multiple units. Specifically, no single unit has cosine similarity of more than 0.4 (or less than −0.4) to PC1. Instead, seven units have cosine similarity of in the range between 0.2 and 0.4 (or between −0.4 and −0.2) to PC1. In other words, this unstable dynamics was realized in a distributed way.

# 3.5.4. Dynamical Representations of the Task Execution

Finally, we visualized the internal dynamics during the execution of the task. **Figure 9** shows the state transition of memory cells while the robot experienced four episodes and its posture is moved in the order from DOWN-DOWN, through UP-DOWN, UP-UP, DOWN-UP, to DOWN-DOWN. Here, the PC1-2 space seems to roughly correspond to the robot's posture. Moreover, the transitions among different postures are represented as

transitions among different fixed-point attractors (shown as circles), each of which corresponds to a posture. By receiving an instruction, the internal state is activated in the PC3 direction and reaches the unstable point indicated by a + mark. This activation is gained as a result of the integration of the visual information and processing logic words, as mentioned above, although it is difficult to visualize them simultaneously in this figure. By converging to one of the fixed-points again after the activation, the corresponding goal-oriented action is generated. The robot then waits for a subsequent instruction at that point. This is the case even when the correct action is to maintain the current posture. While the apparent motion of joint angles is remaining stationary, it was internally represented as converging to the original fixed-point.

In summary, the RNN learned to encode the instructions in a form integrated with the visual inputs and the current robot posture and to generate an appropriate robot action through the experience of sequential interaction data. It was also revealed that logical words, "true," "false," "and," "or" are processed along with the other referential words and encoded in a way that reflects the functions in the current task.

# 3.6. Generalization Ability

In the previous subsection, we showed the internal representations of relations between instructions and actions acquired through the experience of an imposed task. Empirically, when such kinds of systematic representation can be organized, the model achieves a certain level of generalization ability (Sugita and Tani, 2005; Ogata et al., 2007; Yamada et al., 2016). Thus, we conducted learning experiments again by removing 50 or 25% of the possible situations from the training dataset. We chose removed patterns regularly so that each word, robot posture, and flag arrangement would appear uniformly, as shown in **Table 1**. Here, we trained only three models with 100, 150, and 300 LSTM units. The results are shown in **Figure 10**.

We first explain the performance of the models trained with only 50% of possible situations. For types (2)–(4), the models behaved appropriately for many of the possible patterns, even

TABLE 1 | To evaluate the model's generalization ability for the flag task, we conducted learning experiments again by removing (a) 50% or (b) 25% of the possible situations from training dataset.


(a) In the former case, only the situations indicated by ⊚ marks were included in training data. (b) In the latter case, situations indicated not only by ⊚ marks but also by ◦ marks were included in the training data. The situations denoted as an empty cell were included in traning data in neither case. In this table, instruction patterns are abbreviated as follows. L: Left flag color; R: right flag color; A: AND-concatenated objectives; O: OR-concatenated objectives; U: up; D: down; T: true; F: false. For example, the cell referred to as DOWN-DOWN, R-G, LUF is indicated by a ⊚ mark. It means that it is possible that the robot grasping R-G flags and waiting in a DOWN-DOWN posture receives an instruction "red up false" during training in both cases of (a) and (b). As another example, the cell referred to as UP-UP, R-B, OUT is indicated by a ◦ mark. It means that it is possible that the robot grasping R-B flags and waiting in an UP-UP posture receives an instruction "red or blue up true" and "blue or red up true" during training in only the case of (b). In the other example, the cell referred to as DOWN-UP, B-R, LUT is denoted as empty. It means that the robot grasping B-R flags and waiting in an DOWN-UP posture does not receive an instruction "blue up true" during training in either case.

for the unexperienced ones. In contrast, only about one-third of the possible patterns of type (1) single-objective instructions, could be dealt with appropriately. In fact, this performance matches the level from chance, in which the robot uniformly randomly chooses one of three possible motions for a singleobjective instruction (moving the left arm, moving the right arm, or keeping the current posture). To clarify why the network failed to generate appropriate motions, we checked some examples actually generated by the 100-node model (**Figure 11**). In one failure (indicated by the left rounded box), the final posture was correct but the trajectory was not stable, and so it did not satisfy the criterion that the error should be within 0.04. In another failure (right rounded box), a wrong action was chosen. The latter case indicates that although the model roughly learned to generate some possible actions after an instruction input, it failed to learn the relationships between color words and visual information.

One possible reason for failing to respond to (1) singleobjective instructions is that only this type is actually linked with visual information. For example, in the case of type (2) ANDconcatenated instructions, the RNN does not have to consider visual stimuli because, when the instruction includes "and," both arms have to be moved, regardless of the flag colors. In fact, when we tried to give the robot grasping R-B flags a contradictory instruction "green and blue up true," it raised both arms. In other contradictory cases, the results were similar. Also for types (3) and (4), when the instruction includes "or," either arm should be moved regardless of the flag colors. In that sense, type (1) single-objective instructions are more difficult than other types. It is possible that experiencing only half of the possible patterns is not enough to completely generalize the task space. Then, we performed the learning with the dataset in which only 25% of the situations were removed. In this case, the models responded appropriately to more than 80% of type (1) unexperienced situations in a generalized way.

In the next section, we describe another learning experiment based on the "bell task." The bell task is different from the flag task in two ways. First, the action sequences are more complicated because we collect motion data by using a real robot. Second, all the instructions including a logic word require referring to the visual information. We investigate whether a similar kind of representations of logic words that reflect their function can be organized in more realistic setting.

# 4. EXPERIMENT 2: BELL TASK

# 4.1. Task Overview

As a more realistic task, we conducted a learning experiment based on the bell task. In contrast with the first task, we collect motion data by using a real robot. First, a human places three bells colored red, green, and blue at random: one on the left, another to the center, and the other on the right front of the robot. Then, the human gives the robot a linguistic instruction consisting of a combination of a verb ("hit," "point"), an objective ("red," "green," "blue"), and an adverb ("slowly," "quickly"). When the left or right bell is indicated, the robot must hit (point at) the bell with the closer hand. However, when the center bell is indicated, the robot can hit (point at) the bell with either hand.

Similarly to the flag task, two objective (color) words can be concatenated by "and". In such cases, the robot has to hit (point at) the two indicated bells simultaneously. If two color words are concatenated by "or," hitting (pointing at) either bell indicated is correct. In another case, the logic word "not" can be prefixed to a color word (referred to as NOT-prefixed). In this case, hitting (pointing at) the two bells that are the complementary set of the indicated color is the correct response. For example, when the instruction is "hit not red quickly," the correct action is to simultaneously hit both the green and blue bells quickly.

The number of possible situations are 432: a combination of 72 possible instructions and 6 bell arrangements. In contrast to the flag task, in this task, the initial posture and end posture are the same, therefore the motion does not depend on the robot's initial posture. However, the actual action sequences are more complicated than the flag task, as shown in **Figure 12**.

# 4.2. Data Representation

We represent the execution of the bell task as a sequence of 26 dimensional vectors. The state **S**<sup>t</sup> on time step t is represented as

FIGURE 12 | Overview of the bell task. A human places three bells colored red, green, and blue in random order. The human gives the robot an instruction consisting of a combination of a verb ("hit," "point"), an objective ("red," "green," "blue"), and an adverb ("slowly," "quickly"). When the left or right bell is indicated, the robot must hit (point at) the bell with the closer hand. In the case of the center bell, the robot may hit (point at) it with either arm. Two color words can be concatenated by "and". In this case, the robot must act to both bells simultaneously (not presented in this figure). Two color words also can be concatenated by "or," in which case the robot may hit (point at) either bell. In another case, the logic word "not" can be prefixed to a color word. In this case, simultaneously hitting (pointing at) the two bells that are the complementary set of the indicated color is the correct response.

follows:

$$j\_t = [j\_{l0}^{(t)}, j\_{l1}^{(t)}, j\_{l2}^{(t)}, j\_{l3}^{(t)}, j\_{l4}^{(t)}, j\_{r0}^{(t)}, j\_{r1}^{(t)}, j\_{r2}^{(t)}, j\_{r3}^{(t)}, j\_{r4}^{(t)}],\tag{8}$$

$$\mathbf{v}\_{l} = [\boldsymbol{\nu}\_{l0}^{(t)}, \boldsymbol{\nu}\_{l1}^{(t)}, \boldsymbol{\nu}\_{c0}^{(t)}, \boldsymbol{\nu}\_{c1}^{(t)}, \boldsymbol{\nu}\_{r0}^{(t)}, \boldsymbol{\nu}\_{r1}^{(t)}],\tag{9}$$

$$\mathbf{w}\_{l} = [\mathbf{w}\_{0}^{(t)}, \mathbf{w}\_{1}^{(t)}, \mathbf{w}\_{2}^{(t)}, \mathbf{w}\_{3}^{(t)}, \mathbf{w}\_{4}^{(t)}, \mathbf{w}\_{5}^{(t)}, \mathbf{w}\_{6}^{(t)}, \mathbf{w}\_{7}^{(t)}, \mathbf{w}\_{8}^{(t)}, \mathbf{w}\_{9}^{(t)}], \text{ (10)}$$

$$\mathbf{S}\_{l} = [\mathbf{j}; \; \mathbf{v}\_{l}; \; \mathbf{w}\_{l}].\tag{11}$$

To represent the robot joints, 10 elements that correspond to shoulder pitch, shoulder roll, elbow roll, elbow yaw, wrist yaw on each arm are assigned to the vector **j**<sup>t</sup> . Action sequences take approximately 16 steps in the case of QUICKLY actions, and approximately 25 steps in the case of SLOWLY actions. Action sequences are recorded by actually controlling the robot joints along predesigned trajectories. Visual information is encoded as a six-dimensional vector (**v**t). Three pairs of elements encode the bell colors. For example, vl<sup>0</sup> and vl<sup>1</sup> are used to represent the left bell color. In this task, it is assumed that the hues R, G, and B correspond to 0, 120, and 240◦ on the hue circle, respectively. The component v (t) l0 is the sine of the angle of the left bell color on the hue circle, v (t) l1 is its cosine. The pairs v (t) c0 , v (t) c1 and v (t) r0 , v (t) r1 encode the center and right bell colors, respectively, in the same way. This encoding method was used by Sugita and Tani (2005) and Yamada et al. (2016). Ten elements are assigned for language. Each element of **w**<sup>t</sup> corresponds to one word, out of "hit," "point," "red," "green," "blue," "slowly," "quickly," "and," "or," and "not," and an instruction sentence is represented as a sequence of 1-hot vectors. Here, the instruction sentences and corresponding action sequences are concatenated on a computer, and sequences that represent interactions are similar to those for the flag task, with multiple repetitions of instructions and corresponding actions (and waiting phases).

# 4.3. Learning Setting and Evaluation Method

We made 512 sequential datasets for training, each of which includes eight episodes. All the possible situations were included at least once. We built models with 100, 300, 500, and 700 LSTM units, and trained them 10 times from randomly initialized learnable parameters. Adam is used as an optimizer. The number of learning iterations is 10,000, and the learning rate is set to 0.001.

After learning, we made another dataset for the evaluation which includes all possible situations 10 times. When the root mean squared errors between the generated angles and the correct ones per time step per joint during the action generations are less than 0.04, we judge that the RNN succeeds in generating an appropriate action. We regard the situation patterns in which the RNN succeeds in generating an appropriate action more than seven times out of 10 as "appropriately learned," just as in the flag task.

# 4.4. Task Performance after Training

We classify all the possible situations into four types: situations with (1) an instruction that includes only one objective word (72 situations); (2) an instruction is AND-concatenated (144 situations); (3) an instruction is OR-concatenated (144 situations); and (4) an instruction is NOT-prefixed (72 situations). We evaluate the performance by counting how many situation patterns each model learns appropriately with respect to each of four types. **Figure 13** shows the results. The task performance was improved by increasing the number of LSTM nodes. However, there is no significant difference between 500 and 700 node models for all situation types.

Next, we investigated which action was chosen by the model for instructions that had multiple correct actions. Here, we counted the result of the model with 500 LSTM units. The situations that have multiple correct actions are divided into three types. (a) The sentence instructs the robot to act on the center bell. In this case, acting with either arm is correct; therefore, two correct actions exist. (b) The sentence instructs the robot to act on the "left or right" bell. In this case, there are also two solutions. (c) The sentence instructs the robot to act on the "left or center" bell, or the "right or center" bell. In this case, there are three answers, (i) acting on the center bell with the left arm, (ii) acting on the center bell with the right arm, and (iii) acting on the left (right) bell with the left (right) arm. The results for these three types of situation are shown in **Table 2**. As shown in **Table 2**, the model could choose each of multiple solutions evenly. In fact, types (a), (b), and (c) have 24, 48, and 96 possible variations, respectively, and the test was conducted 10 times for each of

OR-concatenated; and (4) the instruction is NOT-prefixed. The written values are averages of 10 trials in which learning began with different seeds. Error bars represent standard deviations.



The situations that have multiple correct actions are divided into three types. (a) The sentence instructs the robot to act on the center bell. In this case, acting with either arm is correct; therefore, two correct actions exist. (b) The sentence instructs the robot to act on the "left or right" bell. In this case, there are also two solutions. (c) The sentence instructs the robot to act on the "left or center" bell, or the "right or center" bell. In this case, there are three correct answers: (i) acting on the center bell with the left arm, (ii) acting on the center bell with the right arm, and (iii) acting on the left (resp., right) bell with the left (resp., right) arm.

them. In most of these ambiguous situations, the RNN chose each possible solution at least once. Just as in the flag task, the RNN could learn to behave appropriately even in such ambiguous situations.

# 4.5. Analyses of Internal Representations 4.5.1. Representations of "Or"

As in the flag task, we investigated the internal representations organized after learning by using PCA. First, we visualized the states of the memory cells after giving instructions in the form of "hit (objective part) slowly" that include one objective word or two OR-concatenated objective words (the left panel of **Figure 14**). This figure shows that the activations after the OR-concatenated instructions are located between the activations after the one objective word instructions. For example, "hit red or green slowly" and "hit green or red slowly" are embedded between the encodings of "hit red slowly" and "hit green slowly." This suggests the fact that "or" is represented by unstable points in the network dynamics, as in the flag task. In fact, the right panel of **Figure 14** shows an arch shaped activation space like the one in the flag task, although the shape is less clean. Note that although in the flag task, the meaning of "or" is always "left or right" regardless of the flag colors, in the current task the two candidate bells depend on the input color words and visual information. Even in this kind of situation, the functional meaning of "or" can be appropriately acquired in a way that is integrated with the objective color words.

## 4.5.2. Representations of "And" and "Not"

**Figure 15** shows the memory cell states after giving instructions in the form of "hit (objective part) slowly," in which the objective part is AND-concatenated or NOT-prefixed. The bell arrangement was fixed in the order of R,G,B from left to right. In this task, "not" indicates the complementary set. Therefore, for example, "not green" and "red and blue" have the same meaning. Although the objective parts of these instructions are completely orthogonal to each other, they are located close each other in the space spanned by PC4 and PC5 and, as a result, instructions with the same meaning form clusters: that is, R-AND-G, G-AND-B, and B-AND-R. These instructions including logic words also require the RNN to consider visual information to determine the meaning of the sentence. Which two bells should be hit (pointed at) depends on both the input color words and visual information. The RNN learned to link these sentences flexibly in the sensorimotor information just from the experience of sequential data for the imposed task.

In summary, even in the bell task that requires both referring to visual information and processing of logic words simultaneously, the functional meaning of logic words could be appropriately organized in a way that was integrated with the referential words.

# 5. DISCUSSION

The current study conducted learning experiments involving translation from linguistic instructions, including both referential and logic words, into robot actions in order to investigate what kind of compositional representations emerged from the interactive experiences. In the case of referential words, objective words were merged with visual input, verbs were integrated with the robot's own posture, and as a result, appropriate actions were generated. Simultaneously, the model could also deal with the logic words "true," "false," "not," "and" and "or". By embedding these words as internal representations that reflect their functional properties, appropriate actions were achieved. In this following, we discuss three types of logic word separately.

# 5.1. True, False, Not

"True" and "false" in the flag task were understood as the goal-oriented action UP/DOWN by being combined with "up" and "down" in a X-OR manner. "Not" in the bell hitting task worked as an operation to choose a complementary set. For example, "not red" corresponded to "green and blue." The RNN learned to embed these completely orthogonal phrases as having the same meaning in the lower-ranking principal component space by its non-linear transformation. In the field of natural language understanding by deep learning, a similar kind of analysis has been performed. Li et al. (2016) showed that a model optimized for sentiment analysis changes its internal encoding drastically in response to the negation of an expression. Hence, for example, "not good" is encoded closer to "bad" than to "good". However, the visualization in the current study showed that even though the information that input sentences were completely different is still retained in the main component space, the combined representation corresponding to the behavioral meaning is reflected in the lower ranking principal components. In other words, not only information encoding compositionally integrated meaning but also information of compositional elements are retained in the model's memory.

This aspect seems to be important. For example, imagine that both of the sentences "hit red quickly" in the case of an RGB bell arrangement and "hit blue quickly" in the case of a BGR arrangement were encoded just as the action HIT-L-QUICKLY with the loss of the information about element words. In this case, it would be impossible for the model to respond appropriately to changes, such as a sudden replacement of bells during the action generation, because the color word information has been lost. By retaining the information about compositional elements, adaptive behavior to respond to such fluctuations would be

FIGURE 14 | Left: The states of the memory cells after giving instructions in the form of "hit (objective part) slowly" that include one objective word or two OR-concatenated objective words. The activations after the OR-concatenated instructions are located between the activations for the one objective word instructions. For example, "hit red or green slowly" and "hit green or red slowly" are embedded between the encodings of "hit red slowly" and "hit green slowly." Right: To a robot waiting with bells arranged in the order of RGB from left to right after 2,048 different contexts, we gave the instruction "hit red or blue slowly." The memory cell states after the instruction were arranged on an arch shaped space which was less defined than that for the flag task. When the activation was on the left side of the arch, the robot generated HIT-L-SLOWLY action. For activation on the right side, the robot generated HIT-R-SLOWLY action. When the activation was in the topmost area, an unstable action was generated.

possible, although it is not certain that our current model is capable of dealing with such situations because they were not included in training data.

# 5.2. And

In the flag task, "and" per se worked as a kind of universal quantifier without referring to objective words. For example, when a robot grasping R-B flags was given "green and blue up true," it lifted up both arms. In other contradictory cases, the results were similar. In other words, if the instruction includes "and," the color words are ignored and only the verb (and truth value) is considered. In that sense, "and" is represented as a concept one step higher. This interpretation of "and" by the neural network could not be expected before the experiment and is actually out of our common usage of "and"; but it can be seen as a reasonable and rational solution in the range of the current task. In contrast, in the bell task, AND-concatenated instructions required referring to visual information, and the model appropriately integrated them with the visual information and then generated correct both-hand actions.

In this way, "and" was represented in a different suitable manner with respect to each task. However, in general, there are more situations in which "and" is used in different ways to combine words, phrases, or sentences. The investigation of how such higher order or general types of "and" can be handled or represented is left for future work.

# 5.3. Or

In the flag task it was shown that without noise applied to the joint angles, the model learned less successfully than it did with noise. This difference did not appear in preliminary experiments that did not include OR-concatenated instructions. We think that due to the inclusion of OR-concatenated instructions that introduce ambiguity by giving as correct either of the answers

FIGURE 15 | The memory cell states after giving instructions in the form of "hit (objective part) slowly," in which objective part is AND-concatenated or NOT-prefixed. The bell arrangement was fixed in the order of R, G, B from left to right. In this task, "not" indicates the complementary set. Therefore, for example, "not green" and "red and blue" have the same meaning. Although the objective parts of these instructions are completely orthogonal, they are located close each other in the space spanned by PC4 and PC5 and, as a result, instructions with the same meaning form clusters: R-AND-G, G-AND-B, and B-AND-R.

randomly each time, the optimization by minimization of the simple squared error became unstable. This is a very similar to a popular thought experiment called Buridan's ass. In the story, an ass is given grass feed on both its left and right sides, located at exactly the same distance away. Faced with this dilemma it could not choose a side and finally starved to death. Our analysis shows the possibility that the network solved this problem, which the ass faced too honestly, by using the tiny amount of noise as a clue to determine which arm moves and by organizing unstable dynamics which converges to either of two fixed-point attractors. However, a more detailed analysis of the dynamical characteristics of the model is required. For example, Tani and Fukumura (1995) showed that a deterministic RNN model can reproduce a simple symbol sequence that is generated in accordance with probabilistic rules by producing a self-organizing chaotic dynamics. Namikawa et al. (2011) also demonstrated that a temporally hierarchical RNN could learn to generate pseudo-stochastic transitions between multiple motor primitives on a robot. Our experiment showed that a similar kind of function to generate actions as if they were generated probabilistically is achieved from the learning of an interactive instruction-action task that includes longer time dependency and more complexity. Our results also showed that the ability to deal with OR-concatenated instructions was improved by increasing LSTM node numbers. We think that by increasing the number of nodes and improving the representation ability the network could learn to forcibly embed the probabilistic experiences in a chaotic dynamics. We should analyze how the function is dynamically represented in the future.

# 5.4. Summary and Future Work

This study conducted learning experiments that translates linguistic sentences, including both referential and logic words, into robot actions to investigate what kind of compositional structures emerged from the experiences of interaction. Referential words were linked in the visual information and the robot's current state and then appropriate actions were generated. The logical words were also simultaneously represented by the model in accordance with their functions as logical operators. To be more precise, the words "true," "false" and "not" work as non-linear transformations to embed orthogonal phrases into the same area in a lower-rank principal component space. "And" in the flag task eliminated referring to the visual information in a rational way and worked as if it per se was a universal quantifier. "Or," which requires action generation that looks apparently random, was represented as an unstable space of the network's dynamical system.

Future work includes the following. First, we should confirm whether both referential and logic words are simultaneously learned when the complexity of the task is more extended. Although the scaling up of vocabulary size is one way to extend, the scaling up of syntactic variety is also required because

# REFERENCES


the sentence patterns in this study were fixed in each task. In extended tasks, it would be possible that the logic words are used not only between words but also between phrases or clauses. Moreover, although the visual information in the current experiments is highly preprocessed, in more realistic tasks, the environment would include various meaningful information, not only color. Therefore, the relationships between language and the environment should be learned from low-level data (e.g., raw images) in a less arbitrary way. To deal with such tasks, we could extend our model by replacing the preprocessing module with another neural network model for vision, such as a convolutional neural network (CNN). In fact, some studies have actually combined a CNN with an RNN to learn the relationships between linguistic instructions and corresponding behavior in an end-to-end manner (Chaplot et al., 2017; Hermann et al., 2017).

Second, a more detailed analysis of the internal representations is required. This includes the analysis of more dynamical characteristics and the visualization of the activation patterns of each neuron. In particular, the latter seems to be valuable, because, although in the current study we visualized activation only in the principal component space, models that have memory cells, such as gated recurrent units or LSTM, are expected to encode different information and functions in specific nodes.

Finally, we are planning to build a bi-directional neural model to translate between linguistic and behavioral sequences. In fact, human language systems are bi-directionally translatable. To build a bi-directional model would be valuable both for understanding symbol grounding structure more deeply and for developing more flexible communicative agents.

# AUTHOR CONTRIBUTIONS

TY, SM, HA, and TO conceived and designed the research, and wrote the paper. TY performed the experiment and analyzed the data.

# FUNDING

This work was supported by a Japan Society for the Promotion of Science (JSPS) Grant-in-Aid for Young Scientists (A) (No. 16H05878), a JSPS Grant-in-Aid for JSPS Research Fellow (No. 17J1058), a JST CREST Grant (No. JPMJCR15E3), and the Program for Leading Graduate Schools, "Graduate Program for Embodiment Informatics" of the Ministry of Education, Culture, Sports, Science, and Technology.


Gers, F. A., and Schmidhuber, J. (2000). "Recurrent nets that time and count," in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (Como), 189–194.


Kingma, D., and Ba, J. (2015). "Adam: a method for stochastic optimization," in International Conference on Learning Representations (ICLR2015) (San Diego, CA).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Yamada, Murata, Arie and Ogata. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Harnad, S. (1990). The symbol grounding problem. Phys. D 42, 335–346.

# Hierarchical Spatial Concept Formation Based on Multimodal Information for Human Support Robots

#### Yoshinobu Hagiwara\*, Masakazu Inoue, Hiroyoshi Kobayashi and Tadahiro Taniguchi

*Emergent Systems Laboratory, College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan*

In this paper, we propose a hierarchical spatial concept formation method based on the Bayesian generative model with multimodal information e.g., vision, position and word information. Since humans have the ability to select an appropriate level of abstraction according to the situation and describe their position linguistically, e.g., "I am in my home" and "I am in front of the table," a hierarchical structure of spatial concepts is necessary in order for human support robots to communicate smoothly with users. The proposed method enables a robot to form hierarchical spatial concepts by categorizing multimodal information using hierarchical multimodal latent Dirichlet allocation (hMLDA). Object recognition results using convolutional neural network (CNN), hierarchical k-means clustering result of self-position estimated by Monte Carlo localization (MCL), and a set of location names are used, respectively, as features in vision, position, and word information. Experiments in forming hierarchical spatial concepts and evaluating how the proposed method can predict unobserved location names and position categories are performed using a robot in the real world. Results verify that, relative to comparable baseline methods, the proposed method enables a robot to predict location names and position categories closer to predictions made by humans. As an application example of the proposed method in a home environment, a demonstration in which a human support robot moves to an instructed place based on human speech instructions is achieved based on the formed hierarchical spatial concept.

Keywords: spatial concept, hierarchy, human-robot interaction, multimodal categorization, human support robot, unsupervised learning

# 1. INTRODUCTION

Space categorization is an important function for human support robots. It is believed that humans predict unknown information flexibly by forming categories of space through their multimodal experiences. We define categories of spaces formed by self-organization from experience as spatial concepts. Furthermore, prediction based on the connection between concepts and words is thought to lead to a semantic understanding of words. It means that spatial concept formation is an important function of human intelligence, and having this ability is important for human support robots.

Spatial concepts form a hierarchical structure. The use of this hierarchical structure enables humans to predict unknown information using concepts in an appropriate layer. For example,

#### Edited by:

*Keum-Shik Hong, Pusan National University, South Korea*

#### Reviewed by:

*Cornelius Weber, University of Hamburg, Germany Zhong Yin, University of Shanghai for Science and Technology, China*

#### \*Correspondence:

*Yoshinobu Hagiwara yhagiwara@em.ci.ritsumei.ac.jp*

Received: *29 November 2017* Accepted: *26 February 2018* Published: *13 March 2018*

#### Citation:

*Hagiwara Y, Inoue M, Kobayashi H and Taniguchi T (2018) Hierarchical Spatial Concept Formation Based on Multimodal Information for Human Support Robots. Front. Neurorobot. 12:11. doi: 10.3389/fnbot.2018.00011* humans can linguistically represent their own positions at an appropriate level of abstraction according to the context of communication, such as "I'm in my home" at the global level, "I'm in the living room" at the intermediate level, and "I'm in front of the TV" at the local level. In this case, the living room has the home in the higher layer and front of the TV in the lower layer. By learning such a hierarchical structure, even if the unknown place does not have features such as front of the TV, its characteristics can be predicted if it has features of the living room. It is expected that the robot acquires spatial concepts in a higher layer by learning the commonality of features in spatial concepts at the lower layer.

Furthermore, the hierarchical structure of spatial concepts plays an important role when a robot moves based on linguistic instructions from a user. As shown in **Figure 1**, even if multiple tables are present in a room, robots can recognize them individually by using a spatial concept at a higher layer, such as "the front of the table in the living space." Indeed, in RoboCup@Home, an international competition in which intelligent robots coexist with humans in home environments, location names are defined as two layers in the tasks of a General Purpose Service Robot<sup>1</sup> as shown in **Table 1**. This table indicates that having sense of space relations is important for a robot coexisting with humans, e.g., that the living space has a center table. By having such hierarchical spatial concepts, it becomes possible to describe and move within a space based on linguistic communication with a user.

We assume that a computational model, which considers the hierarchical structure of spatial concepts, enables robots to acquire not only the spatial concepts, but also the hierarchical structure hiding among the spatial concepts through a bottomup approach and form spatial concepts similar to those perceived by humans. The goal of this study was to develop a robot that can predict unobserved location names and positions from observed information using formed hierarchical spatial concepts. The main contributions of this paper are as follows.


The rest of this paper is structured as follows. Section 2 describes related works. Section 3 presents an overview and the computational model of hierarchical spatial concept formation. Section 4 presents experimental results evaluating the effectiveness of the proposed method in space categorization. Section 5 describes application examples of using hierarchical spatial concepts in a home environment. Finally, section 6 presents conclusions.

# 2. RELATED WORKS

In order for a robot to move within a space, a metric map consisting of occupancy grids that encode whether or not an area is navigable is generally used. The simultaneous localization and mapping (SLAM) (Durrant-Whyte and Bailey, 2006) is a famous localization method for mobile robots. However, the tasks that are coordinated with a user cannot be performed using only a metric map, since semantic information is required for interaction with a user. Nielsen et al. (2004) proposed a method of expanding a metric map into a semantic map by attaching a single-frame snapshot in order to share spatial information between a user and a robot. As a bridge between a metric map and human-robot interaction, research on semantic maps that provide semantic attributes (such as object recognition results) to metric maps has been performed (Pronobis et al., 2006; Ranganathan and Dellaert, 2007). Studies have also been reported on giving semantic object annotations to 3D point cloud data (Rusu et al., 2008, 2009). Moreover, in terms of studies based on multiple cues, Espinace et al. (2013) proposed a method of characterizing places according to low-level visual features associated to objects. Although these approaches could categorize spaces based on semantic information, they did not deal with linguistic information about the names that represent spaces.

In the field of navigation tasks with human-robot interaction, methods of classifying corridors and rooms using a predefined ontology based on shape and image features have been proposed (Zender et al., 2008; Pronobis and Jensfelt, 2012). In studies on semantic space categorization, Kostavelis and Gasteratos (2013) proposed a method of generating a 3D metric map that is semantically categorized by recognizing a place using bag of features and support vector machines. Granda et al. (2010) performed spatial labeling and region segmentation by applying a Gaussian model to the SLAM module of a robot operating system (ROS). Mozos and Burgard (2006) proposed a method of classifying metric maps into semantic classes by using adaboost as a supervised learning method. Galindo et al. (2008) utilized semantic maps and predefined hierarchical spatial information for robot task planning. Although these approaches were able to ground several predefined names to spaces, the learning of location names through human-robot communication in a bottom-up manner has not been achieved.

Many studies have been conducted on spatial concept formation based on multimodal information observed in individual environments (Hagiwara et al., 2016; Heath et al., 2016; Rangel et al., 2017). Spatial concepts are formed in a bottom-up manner based on multimodal observed information, and allow predictions of different modalities. This makes it possible to estimate the linguistic information representing a space from position and image information in a probabilistic way. Gu et al. (2016) proposed a method of learning relative space categories from ambiguous instructions. Taniguchi et al. (2014, 2016) proposed computational models for a mobile robot to acquire spatial concepts based on information from recognized speech and estimated self-location. Here, the spatial concept was defined as the distributions of names and positions at each place.

<sup>1</sup>GPSR Command Generator: https://github.com/kyordhel/GPSRCmdGen

The method enables a robot to predict a positional distribution from recognized human speech through formed spatial concepts. Ishibushi et al. (2015) proposed a method of learning the spatial regions at each place by stochastically integrating image recognition results and estimated self-positions. In these studies, it was possible to form a spatial concept conforming to human perception such as an entrance and a corridor by inferring the parameters of the model.

However, these studies did not focus on the hierarchical structure of spatial concepts. In particular, the features of the higher layer, such as the living space, are included in the features of the lower layer, such as the front of the television, and it was difficult to form the spatial concept in the abstract layer. Furthermore, the ability to understand and describe a place linguistically in different layers is an important function in robots that provide services through linguistic communication with humans. Despite the importance of the hierarchical structure of spatial concepts, a method that enables such concept formation has not been proposed in previous studies. We propose a method that forms a hierarchical spatial concept in a bottom-up manner from multimodal information and demonstrate the effectiveness of the formed spatial concepts in predicting location names and positions.

# 3. HIERARCHICAL SPACE CONCEPT FORMATION METHOD

### 3.1. Overview

An overview of the proposed method of forming hierarchical spatial concepts is shown in **Figure 2**. First, a robot was controlled manually in an environment based on a map generated by simultaneous localization and mapping

TABLE 1 | Definition of location names with two layers in RoboCup@Home.


(SLAM) (Durrant-Whyte and Bailey, 2006) and acquires multimodal information, i.e., vision, position, and word information from attached sensors. Vision information is acquired as a feature vector generated by a convolutional neural network (CNN) (Krizhevsky et al., 2012). Position information is acquired as coordinate values in the map estimated by Monte Carlo localization (MCL) (Dellaert et al., 1999). Word information is acquired as set of words by word recognition. Text input is used for word recognition in this study. Second, acquired vision, position, and word information is represented as histograms. The histograms are utilized as observations in each modality. Third, the formation of hierarchical spatial concepts is performed by using hierarchical multimodal latent Dirichlet allocation (hMLDA) (Ando et al., 2013) on the observations. The proposed method enables a robot to form hierarchical spatial concepts in a bottom-up manner based on observed multimodal information. Therefore, it is possible to adaptively learn location names and the hierarchical structure of a space, which depend on the environment.

# 3.2. Acquisition and Feature Extraction of Multimodal Information 3.2.1. Vision Information

Vision information was acquired as the object recognition results of a captured image by Caffe (Jia et al., 2014), which is a framework of CNN (Krizhevsky et al., 2012) provided by Berkeley Vision and Learning Center. The parameters of CNN were trained by using the dataset from the ImageNet Large Scale Visual Recognition Challenge 2013<sup>2</sup> , which comprises 1,000 object classes, e.g., television, cup, and desk. The output of Caffe is given as a probability p(ai) at an object class a<sup>i</sup> ∈ {a1, a2, ..., aI} where I is the number of object classes and was set to 1,000. The probability p(ai) was represented as a 1,000-dimensional histogram of vision information **w** (υ) = (w (υ) 1 ,w (υ) 2 ,···,w (υ) 1,000) <sup>T</sup> by the following equation:

$$\boldsymbol{\nu}\_{i}^{(\upsilon)} = \boldsymbol{p}(a\_{i}) \, \* \, \, 10^{2} \, . \tag{1}$$

#### 3.2.2. Position Information

The position information (x, y) in the map generated by SLAM was estimated by MCL (Dellaert et al., 1999). It is assumed that the observed information is generated from a multinomial distribution in hMLDA. For this reason, the observed information with a continuous value is generally converted into a finite dimensional histogram by vector quantization. Ando et al. (2013) replaced the observed information with typical patterns by k-means clustering to form a finite dimensional histogram. The proposed method converts a position information (x, y) into a finite dimensional histogram of position information **w** p by hierarchical k-means clustering. The positional information (x, y) was classified hierarchically into 2, 4, 8, 16, 32, and 64 clusters with six layers by applying k-means clustering with k = 2 six times. If a position (x, y) was classified into a cluster c<sup>i</sup> ∈ {0, 1} at the ith layer, a path for the position information was described as C = {c1,c2,c3,c4,c5,c6}. The path C has the structure of a binary tree with six layers. The number of nodes at the 6th layer is 2<sup>6</sup> = 64. The position information (x, y) is represented as a 64-dimensional histogram of the position information **w** (p) = (w (p) 1 ,w (p) 2 ,···,w (p) <sup>64</sup> ) <sup>T</sup> by incrementing w (p) i based on the path C. For example, in a path C of position information (x, y), when c<sup>1</sup> = 0, w (p) 1 to w (p) <sup>32</sup> corresponding to nodes at the 6th layer are incremented, and when c<sup>1</sup> = 1, w (p) <sup>33</sup> to w (p) <sup>64</sup> are incremented. Similarly, w (p) corresponding to nodes at the 6th layer below it are incremented in each layer.

#### 3.2.3. Word Information

The voice information uttered by a user is converted manually into text data, which is then used as word information. In section 5, rospeex (Sugiura and Zettsu, 2015) is used to convert human speech into text data. The location names are manually extracted from the text data. The word information is described as a set of location names, which is a Bag of Words (Harris, 1954) with a location name as a word. The user could give not only one name but also several names to a robot at a given position. The given word information was represented as a histogram of word information **w** (w) = (w (w) 1 ,w (w) 2 ,···,w (w) J ) T . J is the dimension of **w** (w) , and depends on the number of location names in a dictionary S = {s1,s2,···,sJ}, which was obtained through the training phase. w (w) <sup>j</sup> was incremented when a location name s<sup>j</sup> was taught from user. J is the number of location names. Histograms of vision, position, and word information were used as observations in hMLDA.

# 3.3. Hierarchical Categorization by hMLDA

The hierarchical structure of spatial concepts is supported by nested Chinese restaurant process (nCRP) (Blei et al., 2010) in hMLDA (Ando et al., 2013). nCRP is an extended model of the Chinese restaurant process (CRP) (Aldous, 1985), which is a Dirichlet process used to generate multinomial distribution with infinite dimensions. nCRP stochastically calculates the hierarchical structure based on the idea that there are infinite Chinese restaurants with infinite number of tables. **Figure 3** shows the overview of nCRP. A box and a circle represent a restaurant and a customer, respectively. The customer

<sup>2</sup> ILSVRC2013: http://www.image-net.org/challenges/LSVRC/2013/

stochastically decides the restaurant to visit. In the proposed method, a box and a circle mean a spatial concept and data, respectively. Data is stochastically allocated to a spatial concept in each layer by the nCRP. In hMLDA, each spatial concept has a probability distribution with parameter βl,<sup>i</sup> to generate data. The proposed method forms a hierarchical spatial concept by hierarchical probabilistic categorization using nCRP. In the nonhierarchical approach, a place called "meeting space" and its partial places called "front of the table" and "front of the TV" are formed in the same layer. Therefore, the meeting space is learned as a place different from places called "front of the table" and "front of the TV." The proposed method enables the robot to learn the meeting space as a upper concept encompassing places called "front of the table" and "front of the TV" as lower concepts.

The graphical model of hMLDA in the proposed method and the definition of the variables are shown in **Figure 4** and **Table 2**, respectively. In **Figure 4**, **c** is a tree-structured path generated by nCRP with a parameter γ and z is a category index for a spatial concept that is generated by a stick-breaking process (Pitman, 2002) with parameters α and π. w υ ,w p ,w <sup>w</sup> are acquired vision, position, and word information generated by multinomial distributions with a parameter β <sup>m</sup> at a modality m (m ∈ υ, p,w). β <sup>m</sup> was determined according to a Dirichlet prior distribution with a parameter η <sup>m</sup>. D and L written on plates are the number of acquired data and the number of categories, respectively.

The generation process of the model is described as follows:

$$
\beta\_k^m \sim \text{Dirichlet}(\eta^m) \tag{2}
$$

$$\mathfrak{c}\_d \sim \text{nCRP}(\boldsymbol{\wp}) \tag{3}$$

$$
\theta\_d \sim \text{GEM}(\alpha, \pi) \tag{4}
$$

$$z\_{d\_\*n}^m \sim \text{Multi}(\theta\_d) \tag{5}$$

$$\mathcal{w}\_d^m \sim \text{Multi}(\beta\_{\mathcal{C}\_d}[z\_{d,n}^m]),\tag{6}$$

where:




• w m d is the observed information generated by a multinomial distribution with a parameter β from a category z m d,n at a path cd.

In this study, z is equivalent to a spatial concept expressed by the location name such as "the living room" or "front of the table."

Model parameter learning was performed by a Gibbs sampler. Parameters were calculated by alternately sampling a path c<sup>d</sup> for each datum and a category z m d,n assigned to the nth feature value of a modality m of the data d in the path. Category z m d,n was sampled according to the following formula.

$$\begin{split} z\_{d,n}^{m} &\sim p(z\_{d,n}^{m}|\boldsymbol{\omega}^{m}\_{-(d,n)}, \boldsymbol{\c}, \boldsymbol{\omega}^{m}, \boldsymbol{\alpha}, \boldsymbol{\pi}, \boldsymbol{\eta}^{m}) \\ &\propto p(z\_{d,n}^{m}, z\_{-(d,n)}^{m}, \boldsymbol{\c}, \boldsymbol{\omega}^{m}|\boldsymbol{\alpha}, \boldsymbol{\pi}, \boldsymbol{\eta}) \\ &\propto p(z\_{d,n}^{m}|\boldsymbol{z}\_{d,-n}^{m}, \boldsymbol{\alpha}, \boldsymbol{\pi}) p(\boldsymbol{\omega}\_{d,n}^{m}|\boldsymbol{z}, \boldsymbol{\epsilon}, \boldsymbol{\omega}^{m}\_{-(d,n)}, \boldsymbol{\eta}^{m}), \end{split} \tag{7}$$

where −(d, n) means excluding the nth feature value of the data d. p(z m d,n |**z** m d,−n , α, π) is a multinomial distribution generated by the stick-breaking process. The probability, that k is assigned to a category of the n-th feature of modality m of the d-th data, was calculated by the following formula.

$$p(z\_{d,n}^{m} = k | z\_{d,-n}^{m}, \alpha, \pi) = E\left[V\_k \prod\_{j=1}^{k-1} (1 - V\_j) | z\_{d,-n}^{m}, \alpha, \pi\right]$$

$$= E\left[V\_k | z\_{d,-n}^{m}, \alpha, \pi\right] \prod\_{j=1}^{k-1} E\left[1 - V\_j | z\_{d,-n}^{m}, \alpha, \pi\right] \tag{8}$$

$$\xi = \frac{(1 - \alpha)\pi + \#[\mathbf{z}\_{d, -n}^{m} = k]}{\pi + \#[\mathbf{z}\_{d, -n}^{m} \ge k]} \prod\_{j=1}^{k-1} \frac{\alpha \pi + \#[\mathbf{z}\_{d, -n}^{m} > j]}{\pi + \#[\mathbf{z}\_{d, -n}^{m} \ge j]},$$

where #[·] is a number that satisfies a given condition and V<sup>k</sup> and V<sup>j</sup> are values that determine the rate of folding a branch in categories k and j by the stick-breaking process, respectively.

In Formula (7), p(w m d,n |**z**,**c**,**w** m −(d,n) , η <sup>m</sup>) is the probability that a feature value is generated from a path **c**<sup>d</sup> and a category z m d,n . Since it is assumed that the parameters of the multinomial distribution that generates a feature value are generated from a Dirichlet prior distribution, the following formula is obtained.

$$\begin{aligned} p(\boldsymbol{\omega}\_{d,n}^{m}|\boldsymbol{z}, \boldsymbol{c}, \boldsymbol{\omega}\_{d,n}^{m}, \boldsymbol{\eta}^{m}) \propto \mathbb{1}[\boldsymbol{z}\_{-(d,n)}^{m} = \boldsymbol{z}\_{d,n}^{m}, \boldsymbol{c}\_{d,n}^{m} = \boldsymbol{c}\_{d, \boldsymbol{z}\_{d,n}^{m}}, \boldsymbol{\omega}\_{-(d,n)}^{m} \\ = \boldsymbol{\omega}\_{d,n}^{m}] + \boldsymbol{\eta}^{m} \end{aligned} \tag{9}$$

This gives the number of times that a category z m d,n is assigned to a feature value w m d,n in a path **c**d. A path c<sup>d</sup> was sampled by the following formula.

$$\begin{split} \mathbf{c}\_{d} &\sim p(\mathbf{c}\_{d}|\boldsymbol{\omega}^{\boldsymbol{\nu}},\boldsymbol{\omega}^{\boldsymbol{\mu}},\boldsymbol{\omega}^{\boldsymbol{\nu}},\boldsymbol{\varepsilon}\_{-d},\mathbf{z},\boldsymbol{\eta}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\mu}},\boldsymbol{\eta}^{\boldsymbol{\nu}},\boldsymbol{\chi}) \\ &\propto p(\mathbf{c}\_{d}|\mathbf{c}\_{-d},\boldsymbol{\chi})p(\boldsymbol{\omega}^{\boldsymbol{\nu}}\_{d}|\boldsymbol{\varepsilon},\boldsymbol{\omega}^{\boldsymbol{\nu}}\_{-d},\mathbf{z}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}})p(\boldsymbol{\omega}^{\boldsymbol{\rho}}\_{d}|\boldsymbol{\varepsilon},\boldsymbol{\omega}^{\boldsymbol{\rho}}\_{-d},\mathbf{z}^{\boldsymbol{\rho}},\boldsymbol{\eta}^{\boldsymbol{\nu}}) \\ p(\boldsymbol{\omega}^{\boldsymbol{\nu}}\_{d}|\boldsymbol{\varepsilon},\boldsymbol{\omega}^{\boldsymbol{\nu}}\_{-d},\mathbf{z}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}}), \end{split} \tag{10}$$

where **c**−<sup>d</sup> is a set of paths excluding **c** from **c**d. Sampling based on Formulas (9) and (10) was repeated for each training datum d ∈ {d1, d2, · · · , dD}. In this process, paths and categories for all observed data converge to **c**ˆ and **z**ˆ.

# 3.4. Name Prediction and Position Category Prediction

If vision information w v t and position information **w** p t are observed at a time t, then the posterior probability of word information w w t can be calculated with estimated parameters **c**ˆ and **z**ˆ by the following formula.

$$\begin{split} &\rho(\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}^{\boldsymbol{\prime}}}|\hat{\boldsymbol{z}},\hat{\boldsymbol{c}},\boldsymbol{\omega}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{\omega}^{\boldsymbol{\prime}}\_{t},\boldsymbol{\omega}^{\boldsymbol{\prime}}\_{t},\boldsymbol{\alpha},\boldsymbol{\pi},\boldsymbol{\eta}^{\boldsymbol{\prime}},\boldsymbol{\eta}^{\boldsymbol{\prime}},\boldsymbol{\eta}^{\boldsymbol{\prime}}) = \\ &\sum\_{\boldsymbol{z}\_{t}} p(\boldsymbol{\omega}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}}\_{t}|\boldsymbol{z}\_{t},\hat{\boldsymbol{z}}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}},\hat{\boldsymbol{c}},\boldsymbol{\omega}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}},\boldsymbol{\eta}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}},\boldsymbol{\eta}^{\boldsymbol{\prime}^{\boldsymbol{\prime}}}) \\ &p(\boldsymbol{z}\_{t}|\hat{\boldsymbol{z}}^{\boldsymbol{\prime}},\hat{\boldsymbol{z}}^{\boldsymbol{\prime}},\hat{\boldsymbol{c}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{\omega}^{\boldsymbol{\prime}},\boldsymbol{c}\_{t},\boldsymbol{\omega}^{\boldsymbol{\prime}}\_{t},\boldsymbol{\omega}^{\boldsymbol{\prime}}\_{t},\boldsymbol{\alpha},\boldsymbol{\pi},\boldsymbol{\eta}^{\boldsymbol{\prime}},\boldsymbol{\eta}^{\boldsymbol{\prime}}) \end{split} \tag{11}$$

The location name nˆ can be predicted by the maximum value of the calculated posterior probability.

If word information w w t is obtained at a time t, then a category z w t can be predicted by Formula (12) and selecting position pˆ randomly from dataset D<sup>z</sup> w t , which is a set of position data categorized into z w t . D<sup>z</sup> w t was automatically generated by the robot itself as a part of the categorization process.

$$z\_t^{\prime\prime} \sim p(z\_t^{\prime\prime}|z\_{-t}^{\prime\prime}, \boldsymbol{\omega}\_t^{\prime\prime}, \hat{\mathbf{c}}, \boldsymbol{\omega}^{\prime\prime}, \boldsymbol{\omega}^{\prime}, \boldsymbol{\eta}^{\prime}, \boldsymbol{\eta}^{\prime\prime}, \boldsymbol{\eta}^{\prime}, \boldsymbol{\eta}^{\prime}, \boldsymbol{\alpha}, \boldsymbol{\pi}) \tag{12}$$

# 4. EXPERIMENT

#### 4.1. Purpose

We conducted experiments to verify whether the proposed method can form hierarchical spatial concepts, which enable a robot to predict location names and position categories close to predictions made by humans. In the experiment, (1) the influence of multimodal information, i.e., words, on the formation of a hierarchical spatial concept was evaluated by comparing the space categorization results of using the proposed method and those of hierarchical latent Dirichlet allocation (hLDA) (Blei et al., 2010), which is a hierarchical categorization method with single modality; (2) the similarity between the hierarchical spatial concepts formed by the proposed method and those made by humans was evaluated in terms of predicting location names and position categories.

# 4.2. Experimental Conditions

**Figure 5A** shows an experimental environment which includes furniture, e.g., tables, chairs, and a book shelf, in order to collect training and test data. **Figure 5B** shows a mobile robot, which consists of a mobile base, a depth sensor, an image sensor, and a computer, used to generate a map and collect multimodal information in the test environment. The height of the camera attached to the robot was 117 cm in consideration of the typical eye level in the human body. This is equivalent to the average height of a 5-year-old boy in Japan. The Navigation Stack package<sup>3</sup> was used with ROS Hydro<sup>4</sup> for mapping, localization, and moving in the experiment. The robot was manually controlled to collect data from the environment. The orientation of the robot was controlled in as many different orientations as possible.

**Figure 6** shows a map generated in the environment by the robot using SLAM and examples of the collected data. Collected data consisted of image, position, and word information as shown in the samples of collected data at A, B, and C. In the experiment, 900 data points were used for training and 100 data points were used for testing from a total of 1,000 data points collected in the area surrounded by a dotted line in the map. The robot simultaneously acquired images and self-position data (x, y) at times of particle re-sampling for MCL. Words were given as location names by a user who was familiar with the experimental environment. The user gave one or more location names suitable for the place at a data point during the training. In example A, not only a name such as "front of the door" but also a name representing a space such as "entrance" and a name meaning a room such as "laboratory" were given as word information. Word information was partially supplied as training data. Five training data sets were prepared to evaluate robustness of the naming rate in training data as 1, 2, 5, 10, and 20%.

The similarity between the spatial concepts formed by the proposed method and those made by humans was evaluated in experiments of location name prediction and position category prediction based on the ground truth. The ground truth information was given for 100 test data points according to the agreement of three experts who were familiar with the environment. The hierarchy of the space in the experimental environment was defined as global, intermediate, and local.

<sup>3</sup>Navigation Stack: http://wiki.ros.org/navigation

<sup>4</sup>ROS Hydro: http://wiki.ros.org/hydro

TABLE 3 | List of location names and ground truth in the hierarchy.



Location names assigned to each hierarchy are shown in **Table 3**. As the ground truth for name prediction, three location names were uniformly given to each test datum considering the hierarchy to evaluate the accuracy of name prediction. As the ground truth for the position category prediction, regions corresponding to the 15 location names in **Table 3** were decided on the map. **Figure 7** shows the three regions of the "laboratory," "entrance," and "front of the table." The environment was divided into a grid of 50 units in length and 25 units in width, and the gray grids show the ground truth.

In the name prediction experiment, the accuracy of name prediction compared with the ground truth was calculated as an index of similarity. Formula (11) was used to predict names using the proposed method. The accuracy of name prediction at global, intermediate, and local levels was calculated by the following formula.

$$Accuracy = \frac{M\_l}{D},\tag{13}$$

where M<sup>l</sup> is a number matching the predicted names with the ground truth at layer l in the test dataset and D is the number of test data. In the experiment, l was set as (l ∈ {global, intermediate, local}) and D was 100.

In the position category prediction experiment, the precision, recall, and F-measure of the predicted position categories compared with the ground truth were calculated as an index of similarity. In the proposed method, a position (x, y) sampled multiple times for each location name by Formula (12). The precision, recall, and F-measure of position category prediction were calculated by the following formulas.

$$Precision = \frac{T\_n}{T\_n + F\_n} \tag{14}$$

$$Recall = \frac{T\_n}{G\_n} \tag{15}$$

$$F\text{-}measure = \frac{2 \cdot \text{Recall} \cdot \text{Precision}}{\text{Recall} + \text{Precision}},\tag{16}$$

where T<sup>n</sup> is a number matching the position with the ground truth for location name n, F<sup>n</sup> is a number that does not match the position with the ground truth, and G<sup>n</sup> is the number of grids for the ground truth. In the experiment, n was set as (n ∈ {1, 2, · · · , 15}).

In the proposed method, the hyper-parameters α, π, γ , η were set as α = 0.5, π = 100, γ = 1.0, η <sup>υ</sup> = 1.0 × 10−<sup>1</sup> , η <sup>p</sup> = 1.0 × 10−<sup>3</sup> , η <sup>w</sup> = 1.0 × 10−<sup>2</sup> , respectively. The path c and category z of each data were trained with the hyper-parameters. In the experiment, the dimensions of the information vectors w υ , w p , and w <sup>w</sup> were 1,000, 64, and 15, respectively.

#### 4.3. Baseline Methods

The most frequent class, nearest neighbor method, multimodal hierarchical Dirichlet process (HDP), and spatial concept formation model were used as baseline methods for evaluating the performance of the proposed method in the name prediction and position category prediction experiments. In the latter, the sampling of position for each location name was performed 100 times.

#### 4.3.1. Most Frequent Class

The training dataset D = {d1, d2, · · ·, dI} is used in this method. The datum d<sup>i</sup> consists of the position information p<sup>i</sup> = (x<sup>i</sup> , yi) and the word information w<sup>i</sup> , which is a set of location names. The frequency cntn<sup>j</sup> of each location name nj(j ∈ {1, 2, · · ·, 15}) is counted in the training dataset D. The location name n<sup>j</sup> is classified into three clusters by k-means (k = 3) based on cntn<sup>j</sup> . The three clusters of location names are Cglobal, Cintermediate, and Clocal in descending order of the frequency of the location name based on the assumption that global location names are more frequent than local location names. In the training dataset D, if a datum d<sup>i</sup> includes a location name in Cglobal, Cintermediate, and Clocal, the datum d<sup>i</sup> is set as a global dataset D<sup>g</sup> , an intermediate dataset D<sup>i</sup> , and a local dataset D<sup>l</sup> . The location names in the global, intermediate, and local levels are predicted as the most frequent location name in each dataset D<sup>g</sup> , D<sup>i</sup> , and Dl , respectively.

In the position category prediction, the positions are predicted by sampling the position information pˆ randomly from the datasets Dg,<sup>f</sup> , Di,<sup>f</sup> , and Dl,<sup>f</sup> , which have the most frequent location names in each dataset D<sup>g</sup> , D<sup>i</sup> , and D<sup>l</sup> , respectively. The sampling of position information for each location name was performed 100 times.

#### 4.3.2. Nearest Neighbor (Position and Word)

The nearest neighbor method (Friedman et al., 1977) discriminates data based on Euclidean distance. A datum di involves position information p<sup>i</sup> = (x<sup>i</sup> , yi) and word information w<sup>i</sup> . w<sup>i</sup> consists of a set of location names that obtained at a position p<sup>i</sup> in the training. For example, w<sup>i</sup> at data point B in **Figure 6** contains the following location names: "Meeting space," "Book shelf zone," and "Around the electric piano." If position information p<sup>t</sup> is observed, then word information wˆ<sup>t</sup> is calculated with the training dataset D = {(p1,w1), (p2,w2),···,(p<sup>I</sup> ,wI)} by the following formulas.

$$k = \underset{1 \le i \le I}{\text{arg min}} \|p\_i - p\_i\| \tag{17}$$

$$
\hat{w}\_t = w\_k \tag{18}
$$

The location name nˆ can be predicted by randomly selecting a location name from location names in wˆ<sup>t</sup> of the nearest data point.

If word information w<sup>t</sup> is observed, then position information pˆt is randomly selected from dataset Dn<sup>t</sup> , which is a set of data d<sup>i</sup> = (p<sup>i</sup> ,wi) satisfying the formula w<sup>i</sup> ∈ w<sup>t</sup> . The sampling of position information for each location name was performed 100 times.

#### 4.3.3. Nearest Neighbor (Vision, Position and Word)

This method is used only in the name prediction experiment. A datum d<sup>i</sup> includes vision information v<sup>i</sup> , position information p<sup>i</sup> = (x<sup>i</sup> , yi) and word information w<sup>i</sup> . υ<sup>i</sup> is a value calculated by Formula (1) at a position p<sup>i</sup> during training. w<sup>i</sup> consists of a set of location names that are obtained at a position p<sup>i</sup> during the training. If the vision information v<sup>t</sup> and the position information p<sup>t</sup> are observed, then the word information wˆ<sup>t</sup> can be calculated with the training dataset D = {(υ1, p1,w1), (υ2, p2,w2),···,(υ<sup>I</sup> , p<sup>I</sup> ,wI)} by using the following formulas.

$$k = \underset{1 \le i \le I}{\text{arg min}} (\alpha |\nu\_t - \nu\_i| + (\alpha - 1)|p\_t - p\_i|) \tag{19}$$

$$
\hat{\boldsymbol{w}}\_{l} = \boldsymbol{w}\_{k} \tag{20}
$$

where α is the weight coefficient between vision and position information. α was set as 0.3 in the validation dataset empirically. The location name nˆ can be predicted by randomly selecting a location name from the location names in wˆ<sup>t</sup> of the nearest data point.

#### 4.3.4. Multimodal HDP

Multimodal HDP (Nakamura et al., 2011) enables the multimodal handling of HDP (Teh et al., 2005), which is a method of categorizing observed data based on a Bayes generative model, in the topic distribution of latent Dirichlet allocation (LDA) as HDP. The graphical model and definition of variables in the multimodal HDP are shown in the Supplementary Material. Here, multimodal HDP was trained using vision, position, and word information. If vision information w v t and position information **w** p t are observed at a time t, then the posterior probability of word information w w t can be calculated by the following formula:

$$\begin{split} \rho(\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}}|\hat{\boldsymbol{z}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}},\boldsymbol{\nu},\boldsymbol{\eta}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}}) &= \\ \sum\_{\boldsymbol{z}\_{t}} p(\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}}|\boldsymbol{z}\_{t},\hat{\boldsymbol{z}}^{\boldsymbol{\nu}},\hat{\boldsymbol{c}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}}) p(\boldsymbol{z}\_{t}|\hat{\boldsymbol{z}}^{\boldsymbol{\nu}},\hat{\boldsymbol{z}}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}^{\boldsymbol{\nu}},\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}},\boldsymbol{\nu}\_{t}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}},\boldsymbol{\eta}^{\boldsymbol{\nu}}) \end{split} \tag{21}$$

The location name nˆ can be predicted by the maximum value of the calculated posterior probability.

If word information w w t is obtained at a time t, then a category z w t can be predicted by Formula (22) and selecting position information pˆ randomly from dataset D<sup>z</sup> w t , which is a set of position data categorized into z w t .

$$z\_t^{\mathcal{W}} \sim \not\!\simeq \not\!\simeq z\_t^{\mathcal{W}} \vert z\_{-t}^{\mathcal{W}}, \;\mathsf{w}\_t^{\mathcal{W}}, \;\mathsf{w}^{\mathcal{W}}, \;\mathsf{w}^{\mathcal{P}}, \;\mathsf{w}^{\mathcal{P}}, \;\mathsf{n}^{\mathcal{W}}, \;\mathsf{n}^{\mathcal{P}}, \;\mathsf{n}^{\mathcal{P}}, \;\mathsf{n}^{\mathcal{P}} \rangle \tag{22}$$

The sampling of position information for each location name was performed 100 times. In the multimodal HDP, the hyperparameters π, η were set as π = 50, η <sup>υ</sup> = 5.0 × 10−<sup>1</sup> , η <sup>p</sup> = 1.0 × 10−<sup>1</sup> , η <sup>w</sup> = 1.0 × 10−<sup>1</sup> in the validation dataset. The category z of each data is trained with the hyper-parameters.

#### 4.3.5. Spatial Concept Formation

Spatial concept formation (SpCoFo)<sup>5</sup> is a model that integrates name modalities into the spatial region learning model (Ishibushi et al., 2015). The model forms concepts from multimodal information and predicts unobserved information. The graphical model and definition of variables in the spatial concept formation model are shown in the Supplementary Material. The posterior probability of word information w n t after obtaining vision information w υ t and position information p<sup>t</sup> was calculated by the following formula:

$$\begin{split} p(\boldsymbol{w}\_t^n|\boldsymbol{p}\_t, \boldsymbol{w}\_t^\boldsymbol{\nu}) &= \sum\_{\boldsymbol{z}\_t} p(\boldsymbol{w}\_t^n|\boldsymbol{z}\_t) p(\boldsymbol{z}\_t|\boldsymbol{p}\_t, \boldsymbol{w}\_t^\boldsymbol{\nu}) \\ &= \sum\_{\boldsymbol{z}\_t} p(\boldsymbol{w}\_t^n|\boldsymbol{\beta}\_{\boldsymbol{z}\_t}^n) p(\boldsymbol{p}\_t|\boldsymbol{\mu}\_{\boldsymbol{z}\_t}, \boldsymbol{\Sigma}\_{\boldsymbol{z}\_t}) p(\boldsymbol{w}\_t^\boldsymbol{\nu}|\boldsymbol{\beta}\_{\boldsymbol{z}\_t}^\boldsymbol{\nu}) \end{split} \tag{23}$$

The location name nˆ can be predicted by the maximum value of the calculated posterior probability.

The prediction of position pˆ<sup>t</sup> after obtaining word information w n <sup>t</sup> was calculated by estimating a category z<sup>t</sup> and sampling position information pˆ using the following formulas.

$$z\_t = \underset{z\_t}{\text{arg}\max} \, p(z\_t|\boldsymbol{\omega}\_t^n)$$

$$\hat{p}\_t \sim p(\boldsymbol{p}\_t|\boldsymbol{\mu}\_{z\_t}, \boldsymbol{\Sigma}\_{z\_t}) \tag{24}$$

The sampling of position information for each location name was performed 100 times. In the spatial concept formation, the hyper-parameters π, η, µ0, κ0, ψ0, and ν<sup>0</sup> were set as π = 50, η <sup>υ</sup> = 5.0 × 10−<sup>1</sup> , η <sup>w</sup> = 1.0 × 10−<sup>1</sup> , µ<sup>0</sup> = (xcenter, ycenter), κ<sup>0</sup> = 3.0×10−<sup>2</sup> , ψ<sup>0</sup> = diag[0.05, 0.05, 0.05, 0.05], and ν<sup>0</sup> = 15 in the validation dataset, respectively. (xcenter, ycenter) indicates the center of the map. The category z of each data is trained with the hyper-parameters.

# 4.4. Experimental Results

#### 4.4.1. Hierarchical Space Categorization

**Figure 8** shows some categories formed by the proposed method. Categorized training data at each category are shown by positions, images, and the best three examples from the word probability. The category corresponds to the formed spatial concept. Each category was classified into an appropriate layer in the hierarchy of spatial concepts. One, four, and 28 categories were classified into the 1st, 2nd, and 3rd layers, respectively. The number of categories in each layer was determined by the nCRP based on the model parameter γ , which controls the probability that the data is allocated to a new category.

The 1st layer included only category 1, into which 900 data were allocated. The high-probability word of category 1 was "laboratory," which referred to the entire experimental environment. Since category 1 contains all the location names, the probabilities for location names becomes relatively low. Nonetheless, the proposed method was able to learn "laboratory," which was given only about 10% to the training dataset, with high probability compared to the second candidate. In the 2nd layer, 343 data in the vicinity of the entrance in the experimental environment were allocated into category 4. The location name of category 4 with the greatest probability was "entrance." The 389 data in the region deeper than the entrance in the experimental environment were categorized into category 5, in which "meeting space" had the greatest probability. In the 3rd layer, the data categorized into categories 4 and 5 in the second layer were further, more finely categorized. In categories 26 and 16, which were formed under category 4, "front of the door" and "front of the chair storage" had the greatest probabilities, respectively. 53 and 81 data were allocated into categories 26 and 16, respectively. Position and image data corresponding to "front of the door" and "front of the chair storage" were finely allocated. These results demonstrated that the proposed method can form not only categories in a lower layer such as "front of the chair storage" and "front of the door" but also categories at higher layers such as "entrance" and "laboratory," and can form its inclusion relations as a hierarchical structure.

<sup>5</sup> Spatial Concept Formation: https://github.com/EmergentSystemLabStudent/ Spatial\_Concept\_Formation

TABLE 4 | Mutual information for categorization of location names when changing the number of layers in hLDA with word information and the proposed method with vision, position, and word information.


*Mutual information was calculated by Formula 25. Underlined and bold values mean the maximum value in the experimental parameter.*

#### 4.4.2. Evaluation of Categorization

To evaluate the effectiveness of multimodal information on hierarchical space categorization, we compared the categorization results of using the proposed method and those obtained using hLDA, which is a hierarchical categorization method with single modality, i.e., based only on word information. Although the number of layers in ground truth in this experiment is 3, robots can not know the number of hierarchies of the spatial concepts in advance. Therefore, in the proposed method and hLDA, categorization was performed with the number of layers changed from 2 to 5. The accuracy of space categorization was evaluated by calculating mutual information between the ground truth, which consisted of a location name given by humans, and the estimated name, which was the best item in the word probability at a category allocated by the proposed method or by hLDA. Mutual information I(E;G) between estimated name E and ground truth G in each layer i and j was calculated by the following formula:

$$I(E;G) = \sum\_{\mathcal{g}\_j \in G} \sum\_{e\_i \in E} p(e\_i, \mathcal{g}\_j) \log \frac{p(e\_i, \mathcal{g}\_j)}{p(e\_i)p(\mathcal{g}\_j)}.\tag{25}$$

When the mutual information become high, the dependency of e<sup>i</sup> and g<sup>j</sup> can be regarded as high. By using mutual information, accuracy of categorization can be evaluated when the number of layers on ground truth and estimation result is different. **Table 4** shows the mutual information for categorization results between hLDA with word information and the proposed method with vision, position, and word information in the training data set. The effectiveness of multimodal information in space categorization was clarified, since the proposed method had a high level of mutual information in all layers. In addition, mutual information was maximized when using the same hierarchical number as in the ground truth. In the subsequent evaluations, the number of layers of the proposed method is set to 3.

#### 4.4.3. Evaluation of Name Prediction and Position Category Prediction

We conducted experiments to verify whether or not the proposed method could form hierarchical spatial concepts, which enable a robot to predict location names and position categories similar to predictions made by humans. In the experiment, (1) the influence of multimodal information on the formation of a hierarchical spatial concept was evaluated by comparing the space-categorization results obtained using the proposed method and using hLDA, which is a hierarchical categorization method with single modality; (2) the similarity between the hierarchical spatial concepts formed by the proposed method and those of humans was evaluated in predicting location names and position categories. The evaluation experiments were performed by cross verification with three data sets that consist of 900 training data and 100 test data with ground truth. The experimental results are indicated by the mean and standard deviation in the three data sets.

To verify whether or not the proposed method can form hierarchical spatial concepts, accuracy evaluation of name prediction and position category prediction through spatial concept use was performed. In the evaluation of name prediction, vision, position, and word information were given to the robot at the training data points. In the test data points, only vision and position information were given. Therefore, the robot has to predict the unobserved word information from the observed vision and position information. **Table 5** shows the accuracy of name prediction using the baseline methods, the proposed method, and those made by humans. The most frequent class, nearest neighbor (position and word), nearest neighbor (vision, position, and word), multimodal HDP, and spatial concept formation model were used as the baseline methods. The accuracy of name prediction was calculated by Formula (13) at global, intermediate, and local layers in ground truth. The proposed method and humans predicted location names in three layers. The results of humans consisted of the average accuracy of three subjects familiar with the experimental environment.

Compared with the accuracy obtained using the baseline methods, higher accuracies were obtained by the proposed method in the 1st, 2nd, and 3rd layers. It was assumed that weak features buried in the lower layer in the baseline methods were categorized as features of the higher layer in the proposed method. The proposed method enabled a robot to predict location names close to predictions made by humans by selecting the appropriate layer depending on the situation.

**Table 6** shows the evaluation results of position category prediction using the baseline methods, the proposed method, and those made by humans. In the evaluation, the most frequent class, nearest neighbor (position and word), multimodal HDP, and spatial concept formation model were used as the baseline methods. The position category prediction was evaluated in terms of precision, recall, and F-measure, which were calculated by Formula (14).

Compared with results obtained by the baseline methods, higher values of precision and recall were obtained by the proposed method in the global and intermediate layers. In the local layer, higher values of precision and recall were obtained by Nearest neighbor and Spatial Concept Formation (SpCoFo), respectively. However, in the F-measure, which is a harmonic mean between precision and recall, the proposed method has the largest values in the global, intermediate, and local layers. The reason why the recall and F-measure values were lower than the precision is that only 100 data points were predicted and plotted for regions with 100 grids or more, as shown in **Figure 7**. In the result of F-measure, independent ttests were performed in nine samples consisting of three data


TABLE 5 | Accuracy of name prediction using the baseline methods, the proposed method, and those made by humans; the accuracy was calculated by using Formula (13).

*The accuracy is indicated by the mean and standard deviation (s.d.). Underlined and bold values mean the maximum value in the experimental parameter.*

sets with three types of ground truth: global, intermediate, and local. In the proposed method, the p-values of the Most frequent class, Nearest neighbor, multimodal HDP, and SpCoFo were 0.00012, 0.00004, 0.00003, and 0.00051, respectively, and significant differences were observed with (p < 0.05). As the reason why the result of humans were not perfect, some errors were found in the boundary of the place. For example, the boundary between "Book shelf zone" and "front of the table," and the edge of the region called "front of the door" were different depending on the human. The centricity of the place is consistent, but the region includes ambiguity even among humans. The experimental results show that the proposed method enabled a robot to predict position categories closer to predictions made by humans than possible using the baseline methods.

In the experiments for location name and position category prediction, the proposed method showed higher performance than the baseline methods. In the baseline methods, i.e., multimodal HDP and SpCoFo, since the feature space is classified uniformly, the location concepts are formed non-hierarchically. For example, an upper concept, e.g., meeting space, is embedded in the lower concepts, e.g., front of the table and front of the display. Therefore, the place called "Meeting space" is learned as a place different from the places called "front of the table" and "front of the display." Since the proposed method forms concepts by extracting the similarity of knowledge in the upper concept, it is possible to form an upper concept without interfering with the formation of the lower concept. For this reason, the proposed method was able to show high performance in the experiments of name and position category prediction with global, intermediate, and local.

In human-robot interactions in home environments, location names as word information are given to only a part of the training data from a user. We evaluated the robustness of the proposed method in terms of the naming rate in order to verify how name and position category prediction performance changes with decreasing naming rate. In this experiment, the formation of spatial concepts using the proposed method was performed using the training data with the naming rate changed to 1, 2, 5, 10, and 20% successively. The naming rates of 1 or 20% mean that 9 or 180 of the 900 training data contained location names, while the remaining data did not contain any location name. **Table 7** shows the accuracy of name prediction and the F-measure of position category prediction for each naming rate. In the results of name prediction and position category prediction, it was confirmed that learning progresses in the global layer earlier than in the intermediate and local layers. It was clarified that overall prediction ability did not decrease greatly owing to the decreased naming rate, but gradually decreased from the lower layer. In this experiment, we performed spatial concept formation without prior knowledge in only one environment, but it is possible to increase learning efficiency by giving parameters of models estimated in other environments as prior probabilities. The transfer learning of spatial concepts will be performed in the future.

# 5. APPLICATION EXAMPLES FOR HUMAN SUPPORT ROBOTS

Application examples of the hierarchical spatial concept using the proposed method are demonstrated in this section. We implemented the proposed method for the Toyota human support robot (HSR)<sup>6</sup> and created application examples in which the robot moves based on human linguistic instructions and describes its self-position linguistically in an experimental field assuming a home environment.

The home environment and the robot used are shown in **Figure 9**. There were two tables as shown in **Figure 9A**, A and B. In the environment, whether the robot could move based on linguistic instructions including the hierarchical structure of spaces such as "front of the table in the living room" and "front of the table in the dining room" was verified. In **Figure 9B**, an

<sup>6</sup>Toyota Global Site—Partner Robot Family: http://www.toyota-global.com/ innovation/partner\_robot/family\_2.html

TABLE 6 | Precision, recall, and F-measure evaluation of position category prediction using the baseline methods, the proposed method, and those made by humans in global, intermediate, and local; the precision, recall, and F-measure were calculated by using Formula (14).


*In the experiment, the modalities of the nearest neighbor were position and word. The results are indicated by the mean and standard deviation as mean (s.d.). Underlined and bold values mean the maximum value in the experimental parameter.*

TABLE 7 | Robustness evaluation of the proposed method with respect to naming rate: accuracy in name prediction indicates the maximum value of the three layers.


RGB-D sensor and a laser range sensor were used to capture images and to estimate self-position, respectively. The packages<sup>7</sup> : hector\_slam and omni\_base for mapping, localization, and moving were used with ROS Indigo<sup>8</sup> to navigate the robot to the predicated position.

The robot collected 715 training data consisting of images, positions, and word information and formed a hierarchical spatial concept using the proposed method. Location names were given to 20% of total training data. Rospeex (Sugiura and

<sup>7</sup>hector\_slam: http://wiki.ros.org/hector\_slam

Zettsu, 2015) was used to recognize human speech instructions and convert them into text information. In the experiment, the dimensions of the information vectors w υ , w p , and w <sup>w</sup> were 1,000, 64, and 16, respectively.

The two places predicted by Formula (12) based on the speech instructions, i.e., "go to the front of the table in the living room" and "go to the front of the table in the dining room" are shown in **Figures 10A,B**, respectively. Predicted position categories indicated by red dots show that the "front of the table in the living room" and the "front of the table in the dining room" were recognized as different places using the space concept in the higher layer.

<sup>8</sup>ROS Indigo: http://wiki.ros.org/indigo

FIGURE 10 | Position category prediction using a hierarchical structure based on linguistic instructions from the user. (A) Positions for the front of the table in the living room. (B) Positions for the front of the table in the dining room.

FIGURE 12 | Linguistic description of self-position based on communication between the user and the robot using the hierarchical spatial concept.

FIGURE 11 | Movement based on speech instructions from the user through the hierarchical spatial concept.

**Figure 11** shows how the robot moved based on human speech instructions in the experiment. The robot recognized human speech instructions using rospeex and predicted position categories with the Formula (12) using a hierarchical spatial concept. It moved to the instructed place by sampling randomly from the predicted positions. **Figure 12** shows an application example in which the robot described its selfposition linguistically. The robot observed its self-position and image and predicted the name of its self-position by calculating Formula 11 using the hierarchical spatial concept. As shown in the left side of **Figure 12**, the proposed method enabled the robot to describe its self-position linguistically with different layers. We demonstrated application examples using the formed hierarchical spatial concept in the service scene in a home environment. The movie of the demonstration and training dataset can be found at the URL<sup>9</sup> .

# 6. CONCLUSIONS

We assumed that a computational model that considers the hierarchical structure of space enables robots to predict the name and position of a space close to the corresponding prediction by humans. In our assumptions, we proposed a hierarchical spatial concept formation method based on a Bayesian generative model with multimodal information, i.e., vision, position, and word information, and developed a robot that can predict unobserved location names and position categories based on observed information using the formed hierarchical spatial concept. We conducted experiments to form a hierarchical spatial concept using a robot and evaluated its ability in name prediction and position category prediction.

The experimental results for name and position category prediction demonstrated that, relative to baseline methods, the proposed method enabled the robot to predict location names and position categories closer to predictions made by

<sup>9</sup>Multimedia - emlab page: https://emlab.jimdo.com/multimedia/

humans. Application examples using the hierarchical spatial concept in a home environment demonstrated that a robot could move to an instructed place based on human speech instructions and describe its self-position linguistically through the formed hierarchical spatial concept. The experimental results and application example demonstrated that the proposed method enabled the robot to form spatial concepts in abstract layers and use the concepts for human-robot communications in a home environment. This study showed that it the name and position of a location could be predicted, even in a home, using generalized spatial concepts. Furthermore, by conducting additional learning in each house, a spatial concept adapted to the environment can be formed.

The limitation of this study is as follows. In the feature extraction of the position information, hierarchical k-means method was utilized to convert the position information (x, y) into the position histogram. In the experiment, 389 and 511 data were allocated to two clusters at the top layer c1. In the bottom layer c6, the number and standard deviation of the data allocated to each of the 64 clusters were 14.1 and 12.2, respectively. There is some bias between the clusters. The hierarchical k-means makes it possible to convert the position information into the position histogram including hierarchical spatial features. However, nearby data points at a classification boundary, which are classified into different clusters on a high level, are regarded as very different. We are considering a method to reduce bias in space while maintaining hierarchical features of space. As for the number of location names, at section 4 and 5 in the experiments, the numbers of location names were 15 and 16, respectively. The number of location names increases with increase in the numbers of teachings and users. If the robot learns the location names from several users over a long term, an algorithm to remove location names with low probability of observation is needed in order to improve the learning efficiency.

# REFERENCES


As future work, we will generalize the spatial concepts for various environments and perform transition learning of spatial concepts with the generalized spatial concepts as prior knowledge.

# AUTHOR CONTRIBUTIONS

YH designed the study, and wrote the initial draft of the manuscript. HK and MI contributed to analysis and interpretation of data, and assisted in the preparation of the manuscript. TT has contributed to data collection and interpretation, and critically reviewed the manuscript. All authors approved the final version of the manuscript, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

# FUNDING

This work was supported by MEXT/JSPS KAKENHI Grant Number JP17H06383 in #4903 (Evolinguistics), JP16K16133 and JPMJCR15E3.

# ACKNOWLEDGMENTS

We would like to thank Dr. Takayuki Nagai and Dr. Tomoaki Nakamura for sharing their source code with us.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot. 2018.00011/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Hagiwara, Inoue, Kobayashi and Taniguchi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multimodal Hierarchical Dirichlet Process-Based Active Perception by a Robot

#### Tadahiro Taniguchi <sup>1</sup> \*, Ryo Yoshino<sup>1</sup> and Toshiaki Takano<sup>2</sup>

*<sup>1</sup> Emergent Systems Laboratory, College of Information Science and Engineering, Ritsumeikan University, Ksatsu, Japan, <sup>2</sup> Adaptive Systems Laboratory, Department of Computer Science, Shizuoka Institute of Science and Technology, Fukuroi, Japan*

In this paper, we propose an active perception method for recognizing object categories based on the multimodal hierarchical Dirichlet process (MHDP). The MHDP enables a robot to form object categories using multimodal information, e.g., visual, auditory, and haptic information, which can be observed by performing actions on an object. However, performing many actions on a target object requires a long time. In a real-time scenario, i.e., when the time is limited, the robot has to determine the set of actions that is most effective for recognizing a target object. We propose an active perception for MHDP method that uses the information gain (IG) maximization criterion and lazy greedy algorithm. We show that the IG maximization criterion is optimal in the sense that the criterion is equivalent to a minimization of the expected Kullback–Leibler divergence between a final recognition state and the recognition state after the next set of actions. However, a straightforward calculation of IG is practically impossible. Therefore, we derive a Monte Carlo approximation method for IG by making use of a property of the MHDP. We also show that the IG has submodular and non-decreasing properties as a set function because of the structure of the graphical model of the MHDP. Therefore, the IG maximization problem is reduced to a submodular maximization problem. This means that greedy and lazy greedy algorithms are effective and have a theoretical justification for their performance. We conducted an experiment using an upper-torso humanoid robot and a second one using synthetic data. The experimental results show that the method enables the robot to select a set of actions that allow it to recognize target objects quickly and accurately. The numerical experiment using the synthetic data shows that the proposed method can work appropriately even when the number of actions is large and a set of target objects involves objects categorized into multiple classes. The results support our theoretical outcomes.

#### Edited by: *Minoru Asada,*

*Osaka University, Japan*

### Reviewed by:

*Shingo Murata, National Institute of Informatics, Japan J. Michael Herrmann, University of Edinburgh, United Kingdom Jivko Sinapov, Tufts University, United States*

> \*Correspondence: *Tadahiro Taniguchi taniguchi@ci.ritsumei.ac.jp*

Received: *24 August 2017* Accepted: *23 April 2018* Published: *22 May 2018*

#### Citation:

*Taniguchi T, Yoshino R and Takano T (2018) Multimodal Hierarchical Dirichlet Process-Based Active Perception by a Robot. Front. Neurorobot. 12:22. doi: 10.3389/fnbot.2018.00022*

Keywords: active perception, cognitive robotics, topic model, multimodal machine learning, submodular maximization

# 1. INTRODUCTION

Active perception is a fundamental component of our cognitive skills. Human infants autonomously and spontaneously perform actions on an object to determine its nature. The sensory information that we can obtain usually depends on the actions performed on the target object. For example, when people find a gift box placed in front of them, they cannot perceive its weight without holding the box, and they cannot determine its sound without hitting or shaking it. In other words, we can obtain sensory information about an object by selecting and executing actions to manipulate it. Adequate action selection is important for recognizing objects quickly and accurately. This example about a human also holds for a robot. An autonomous robot that moves and helps people in a living environment should also select adequate actions to recognize target objects. For example, when a person asks an autonomous robot to bring an empty plastic bottle, the robot has to examine many objects by applying several actions (**Figure 1**). This type of information is important, because our object categories are formed on the basis of multimodal information, i.e., not only visual information is used, but also auditory, haptic, and other information. Therefore, a computational model of the active perception should be consistently based on a computational model for multimodal object categorization and recognition.

In spite of the wide range of studies about active perception (e.g., Borotschnig et al., 2000; Dutta Roy et al., 2004; Eidenberger and Scharinger, 2010; Krainin et al., 2011; Ferreira et al., 2013) and multimodal categorization for robots (e.g., Nakamura et al., 2007, 2011a; Sinapov and Stoytchev, 2011; Celikkanat et al., 2014; Sinapov et al., 2014), active perception methods for a robot, i.e., action selection methods for perception for unsupervised multimodal categorization, have not been sufficiently explored (see section 2).

This paper considers the active perception problem for unsupervised multimodal object categorization under the condition that a robot has already obtained several action primitives that are used to examine target objects. In the context of this study, we need to study active perception on an unsupervised multimodal categorization method having generality as much as possible because it is believed that unsupervised multimodal categorization is important for future language learning by robots, and the findings obtained in this study should be able to be applied to other unsupervised multimodal categorization models. It was suggested that a child forms a category based on his/her sensorimotor experience before learning a word for the category in a Bayesian manner, and learning the word is a matter of attaching a new label to this preexisting category (Kemp et al., 2010). The multimodal hierarchical Dirichlet process (MHDP) is a mathematically very general and sophisticated nonparametric Bayesian multimodal categorization method. Therefore, we adopt MHDP proposed by Nakamura et al. (2011b) as a representative computational model for unsupervised multimodal object categorization.

We develop an active perception method based on the MHDP in this paper. The MHDP is a sophisticated, fully Bayesian, probabilistic model for multimodal object categorization (Nakamura et al., 2011b) that is developed by enabling hierarchical Dirichlet process (HDP) (Teh et al., 2006) to have multimodal emission distributions corresponding to multiple sensor information<sup>1</sup> . Nakamura et al. (2011b) showed that the MHDP enables a robot to form object categories using multimodal information, i.e., visual, auditory, and haptic information, in an unsupervised manner. The MHDP can estimate the number of object categories as well because of the nature of Bayesian nonparametrics.

This paper describes a new MHDP-based active perception method for multimodal object recognition based on object categories formed by a robot itself. We found that an active perception method that has a good theoretical nature, i.e., the performance of the greedy algorithm is theoretically guaranteed (see section 4), can be derived for MHDP. Our formulation is based on a hierarchical Bayesian model. If a cognitive system of a robot is modeled by using hierarchical Bayesian model, a recognition state are usually represented by posterior distribution over latent variables, e.g., object categories. The purpose of an active perception is to infer appropriate posterior distribution with a small number of actions. In our approach, we propose an action selection method that can reduce the distance between inferred posterior distributions and true posterior distributions.

In this study, we define the active perception problem in the context of unsupervised multimodal object categorization as following. Which set of actions should a robot take to recognize a target object as accurately as possible under the constraint that the number of actions is restricted<sup>2</sup> ? Our MHDP-based active perception method uses an IG maximization criterion, Monte Carlo approximation, and the lazy greedy algorithm. In this paper, we show that the MHDP provides the following three advantages for deriving an efficient active perception method.


Although the above properties follow from the theoretical characteristics of the MHDP, this has never been pointed out in previous studies.

The main contributions of this paper are that we


The proposed active perception method can be used for general purposes, i.e., not only for robots but also for other target

<sup>1</sup>HDP is a nonparametric Bayesian extension of latent Dirichlet allocation (LDA) (Blei et al., 2003), which has been widely used for document-word

clustering. The nonparametric Bayesian extension allows HDP to estimate the number of topics, i.e., clusters, as well.

<sup>2</sup>We can consider an extension of this problem by introducing different cost to each action, i.e., different action requires different time and energy. However, for simplicity, this paper focuses on the problem in which cost for each action is the same.

domains to which the MHDP can be applied. In addition, The proposed method can be easily extended for other multimodal categorization methods with similar graphical models, e.g., multimodal latent Dirichlet allocation (MLDA) (Nakamura et al., 2009). However, in this paper, we focus on the MHDP and the robot active perception scenario, and explain our method on the basis of this task.

The remainder of this paper is organized as follows. Section 2 describes the background and work related to our study. Section 3 briefly introduces the MHDP, proposed by Nakamura et al. (2011b), which enables a robot to obtain object categories by fusing multimodal sensor information in an unsupervised manner. Section 4 describes our proposed action selection method. Section 5 discusses the effectiveness of the action selection method through experiments using an upper-torso humanoid robot. Section 6 describes a supplemental experiment using synthetic data. Section 7 concludes this paper.

# 2. BACKGROUND AND RELATED WORK

# 2.1. Multimodal Categorization

The human capability for object categorization is a fundamental topic in cognitive science (Barsalou, 1999). In the field of robotics, adaptive formation of object categories that considers a robot's embodiment, i.e., its sensory-motor system, is gathering attention as a way to solve the symbol grounding problem (Harnad, 1990; Taniguchi et al., 2016).

Recently, various computational models and machine learning methods for multimodal object categorization have been proposed in artificial intelligence, cognitive robotics, and related research fields (Roy and Pentland, 2002; Natale et al., 2004; Nakamura et al., 2007, 2009, 2011a,b, 2014; Iwahashi et al., 2010; Sinapov and Stoytchev, 2011; Araki et al., 2012; Griffith et al., 2012; Ando et al., 2013; Celikkanat et al., 2014; Sinapov et al., 2014). For example, Sinapov and Stoytchev (2011) proposed a graph-based multimodal categorization method that allows a robot to recognize a new object by its similarity to a set of familiar objects. They also built a robotic system that categorizes 100 objects from multimodal information in a supervised manner (Sinapov et al., 2014). Celikkanat et al. (2014) modeled the context in terms of a set of concepts that allow many-to-many relationships between objects and contexts using LDA.

Our focus of this paper is not a supervised learning-based, but an unsupervised learning-based multimodal categorization method and an active perception method for categories formed by the method. Of these, a series of statistical unsupervised multimodal categorization methods for autonomous robots have been proposed by extending LDA, i.e., a topic model (Nakamura et al., 2007, 2009, 2011a,b, 2014; Araki et al., 2012; Ando et al., 2013). All these methods are Bayesian generative models, and the MHDP is a representative method of this series (Nakamura et al., 2011b). The MHDP is an extension of the HDP, which was proposed by Teh et al. (2006), and the HDP is a nonparametric Bayesian extension of LDA (Blei et al., 2003). Concretely, the generative model of the MHDP has multiple types of emissions that correspond to various sensor data obtained through various modality inputs. In the HDP, observation data are usually represented as a bag-of-words (BoW). In contrast, the observation data in the MHDP use bag-of-features (BoF) representations for multimodal information. BoF is a histogrambased feature representation that is generated by quantizing observed feature vectors. Latent variables that are regarded as indicators of topics in the HDP correspond to object categories in the MHDP. Nakamura et al. (2011b) showed that the MHDP enables a robot to categorize a large number of objects in a home environment into categories that are similar to human categorization results.

To obtain multimodal information, a robot has to perform actions and interact with a target object in various ways, e.g., grasping, shaking, or rotating the object. If the number of actions and types of sensor information increase, multimodal categorization and recognition can require a longer time. When the recognition time is limited and/or if quick recognition is required, it becomes important for a robot to select a small number of actions that are effective for accurate recognition. Action selection for recognition is often called active perception. However, an active perception method for the MHDP has not been proposed. This paper aims to provide an active perception method for the MHDP.

# 2.2. Active Perception

Generally, active perception is one of the most important cognitive capabilities of humans. From an engineering viewpoint, active perception has many specific tasks, e.g., localization, mapping, navigation, object recognition, object segmentation, and self–other differentiation.

In machine learning, active learning is defined as a task in which a method interactively queries an information source to obtain the desired outputs at new data points to learn efficiently Settles (2012). Active learning algorithms select an unobserved input datum and ask a user (labeler) to provide a training signal (label) in order to reduce uncertainty as quickly as possible (Cohn et al., 1996; Muslea et al., 2006; Settles, 2012). These algorithms usually assume a supervised learning problem. This problem is related to the problem in this paper, but is fundamentally different.

Historically, active vision, i.e., active visual perception, has been studied as an important engineering problem in computer vision. Dutta Roy et al. (2004) presented a comprehensive survey of active three-dimensional object recognition. For example, Borotschnig et al. (2000) proposed an active vision method in a parametric eigenspace to improve the visual classification results. Denzler and Brown (2002) proposed an information theoretic action selection method to gather information that conveys the true state of a system through an active camera. They used the mutual information (MI) as a criterion for action selection. Krainin et al. (2011) developed an active perception method in which a mobile robot manipulates an object to build a threedimensional surface model of it. Their method uses the IG criterion to determine when and how the robot should grasp the object.

Modeling and/or recognizing a single object as well as modeling a scene and/or segmenting objects are also important tasks in the context of robotics. Eidenberger and Scharinger (2010) proposed an active perception planning method for scene modeling in a realistic environment. van Hoof et al. (2012) proposed an active scene exploration method that enables an autonomous robot to efficiently segment a scene into its constituent objects by interacting with the objects in an unstructured environment. They used IG as a criterion for action selection. InfoMax control for acoustic exploration was proposed by Rebguns et al. (2011).

Localization, mapping, and navigation are also targets of active perception. Velez et al. (2012) presented an online planning algorithm that enables a mobile robot to generate plans that maximize the expected performance of object detection. Burgard et al. (1997) proposed an active perception method for localization. Action selection is performed by maximizing the weighted sum of the expected entropy and expected costs. To reduce the computational cost, they only consider a subset of the next locations. Roy and Thrun (1999) proposed a coastal navigation method for a robot to generate trajectories for its goal by minimizing the positional uncertainty at the goal. Stachniss et al. (2005) proposed an information-gain-based exploration method for mapping and localization. Correa and Soto (2009) proposed an active perception method for a mobile robot with a visual sensor mounted on a pan-tilt mechanism to reduce localization uncertainty. They used the IG criterion, which was estimated using a particle filter.

In addition, various studies on active perception by a robot have been conducted (Natale et al., 2004; Ji and Carin, 2006; Schneider et al., 2009; Tuci et al., 2010; Saegusa et al., 2011; Fishel and Loeb, 2012; Pape et al., 2012; Sushkov and Sammut, 2012; Gouko et al., 2013; Hogman et al., 2013; Ivaldi et al., 2014; Zhang et al., 2017). In spite of a large number of contributions about active perception, few theories of active perception for multimodal object category recognition have been proposed. In particular, an MHDP-based active perception method has not yet been proposed, although the MHDP-based categorization method and its series have obtained many successful results and extensions.

# 2.3. Active Perception for Multimodal Categorization

Sinapov et al. (2014) investigated multimodal categorization and active perception by making a robot perform 10 different behaviors; obtain visual, auditory, and haptic information; explore 100 different objects, and classify them into 20 object categories. In addition, they proposed an active behavior selection method based on confusion matrices. They reported that the method was able to reduce the exploration time by half by dynamically selecting the next exploratory behavior. However, their multimodal categorization is performed in a supervised manner, and the theory of active perception is still heuristic. The method does not have theoretical guarantees of performance.

IG-based active perception is popular, as shown above, but the theoretical justification for using IG in each task is often missing in many robotics papers. Moreover, in many cases in robotics studies, IG cannot be evaluated directly, reliably, or accurately. When one takes an IG criterion-based approach, how to estimate the IG is an important problem. In this study, we focus on MHDP-based active perception and develop an efficient near-optimal method based on firm theoretical justification.

# 3. MULTIMODAL HIERARCHICAL DIRICHLET PROCESS FOR STATISTICAL MULTIMODAL CATEGORIZATION

We assume that a robot forms object categories using the MHDP from multimodal sensory data. In this section, we briefly introduce the MHDP on which our proposed active perception method is based (Nakamura et al., 2011b). The MHDP assumes Taniguchi et al. MHDP-Based Active Perception

that an observation node in its graphical model corresponds to an action and its corresponding modality. Nakamura et al. (2011b) employed three observation nodes in their graphical model, i.e., haptic, visual, and auditory information nodes. Three actions, i.e., grasping, looking around, and shaking, correspond to these modalities, respectively. However, the MHDP can be easily extended to a model with additional types of sensory inputs. It is without doubt that autonomous robots will also gain more types of action for perception. For modeling more general cases, an MHDP with M actions is described in this paper. A graphical model of the MHDP is illustrated in **Figure 2**. In this section, we describe the MHDP briefly. For more details, please refer to Nakamura et al. (2011b).

The index m ∈ **M** (#(**M**) = M) in **Figure 2** represents the type of information that corresponds to an action for perception, e.g., hitting an object to obtain its sound, grasping an object to test its shape and hardness, or looking at all of an object by rotating it. We assume that a robot has action primitives and it can execute one of the actions by selecting the index of the action primitives. The observation x m jn ∈ X <sup>m</sup> is the m-th modality's n-th feature for the j-th target object. X <sup>m</sup> represents a set of observation of m-th modality. The observation x m jn is assumed to be drawn from a categorical distribution whose parameter is θ m k , where k is an index of a latent topic. Each index k is drawn from a categorical distribution whose parameter is β that is drawn from a Dirichlet distribution parametrized by γ . Parameter θ m k is assumed to be drawn from the Dirichlet prior distribution whose parameter is α m 0 . The MHDP assumes that a robot obtains each modality's sensory information as a BoF representation. Each latent variable t m jn is drawn from a topic proportion, i.e., a parameter of a multinomial distribution, of the j-th object π<sup>j</sup> whose prior is a Dirichlet distribution parametrized by λ.

Similarly to the generative process of the original HDP (Teh et al., 2006), the generative process of the MHDP can be described as a Chinese restaurant franchise, which is the name of a special type of probabilistic process in Bayesian nonparametrics (Teh et al., 2005). The learning and recognition algorithms are both derived using Gibbs sampling. In its learning process, the MHDP estimates a latent variable t m jn for each feature of the j-th object and a topic index kjt for each latent variable t. The combination of latent variable and topic index corresponds to a topic in LDA (Blei et al., 2003). Using the estimated latent variables, the categorical distribution parameter θ m k and topic proportion of the j-th object π<sup>j</sup> are drawn from the posterior distribution.

The selection procedure for latent variable t m jn is as follows. The prior probability that x m jn selects t is

$$P(t\_{jn}^{m} = t | \lambda) = \begin{cases} \frac{\sum\_{m} w^{m} N\_{j}^{m}}{\lambda + \sum\_{m} w^{m} N\_{j}^{m} - 1}, & \quad (t = 1, \cdots, T\_{j}),\\\frac{\lambda}{\lambda + \sum\_{m} w^{m} N\_{j}^{m} - 1}, & \quad (t = T\_{j} + 1), \end{cases}$$

where w <sup>m</sup> is a weight for the m-th modality, To balance the influence of different modalities, w <sup>m</sup> are set as hyperparameters. The weight w <sup>m</sup> increases the influence of the modality m on multimodal category formation. N m jt is the number of m-th modality observations that are allocated to t in the j-th object, and λ is a hyperparameter. In the Chinese restaurant process, if the number of observed features Njt = P <sup>m</sup> w <sup>m</sup>N m jt that are allocated to t increases, the probability at which a new observation is allocated to the latent variable t increases. Using the prior distribution, the posterior probability that observation x m jn is allocated to the latent variable t becomes

$$\begin{split}P(t\_{jn}^{m}=t|X^{m},\lambda) &= \frac{P(\mathbf{x}\_{jn}^{m}|X\_{k}^{m}=\_{k\_{\mathrm{j}l}})P(t\_{jn}^{m}=t|\lambda)}{P(\mathbf{x}\_{jn}^{m}|X^{m}\nmid\{\mathbf{x}\_{jn}^{m}\},\lambda)}\\ &\propto \begin{cases} P(\mathbf{x}\_{jn}^{m}|X\_{k}^{m}=\_{k\_{\mathrm{j}l}})\frac{\sum\_{m}\mathsf{w}^{m}N\_{\mathrm{j}l}^{m}}{\lambda+\sum\_{m}\mathsf{w}^{m}N\_{\mathrm{j}}^{m}-1}, & (t=1,\cdots,T\_{\mathrm{j}}),\\ P(\mathbf{x}\_{jn}^{m}|X\_{k}^{m}=\_{k\_{\mathrm{j}l}})\frac{\lambda}{\lambda+\sum\_{m}\mathsf{w}^{m}N\_{\mathrm{j}}^{m}-1}, & (t=T\_{\mathrm{j}}+1),\end{cases}\end{split}$$

where N m j is the number of the m-th modality's observations about the j-th object. The set of observations that correspond to the m-th modality and have the k-th topic in any object are represented by X m k .

In the Gibbs sampling procedure, a latent variable for each observation is drawn from the posterior probability distribution. If t = T<sup>j</sup> + 1, a new observation is allocated to a new latent variable. The dish selection procedure is as follows. The prior probability that the k-th topic is allocated on the t-th latent variable becomes

$$P(k\_{jt} = k | \boldsymbol{\nu}) = \begin{cases} \frac{M\_k}{\boldsymbol{\nu} + \boldsymbol{M} - 1}, & (k = 1, \dots, K), \\\frac{\boldsymbol{\nu}}{\boldsymbol{\nu} + \boldsymbol{M} - 1}, & (k = K + 1), \end{cases}$$

where K is the number of topic types, and M<sup>k</sup> is the number of latent variables on which the k-th topic is placed. Therefore, the posterior probability that the k-th topic is allocated on the t-th latent variable becomes

$$\begin{aligned} P(k\_{jt} = k | X, \boldsymbol{\nu}) &= P(\boldsymbol{X}\_{jt} | \boldsymbol{X}\_k) P(k\_{jt} = k | \boldsymbol{\nu}) \\ &= \begin{cases} P(\boldsymbol{X}\_{jt} | \boldsymbol{X}\_k) \frac{\boldsymbol{M}\_k}{\boldsymbol{\nu} + \boldsymbol{M} - 1}, & (k = 1, \cdots, K), \\ P(\boldsymbol{X}\_{jt} | \boldsymbol{X}\_k) \frac{\boldsymbol{\nu}}{\boldsymbol{\nu} + \boldsymbol{M} - 1}, & (k = K + 1) \end{cases} \end{aligned}$$

where X = ∪mX <sup>m</sup>, X<sup>k</sup> = ∪mX m k , and Xjt is the set of the j-th object's observations allocated to the t-th latent variable. A topic index for the latent variable t for the j-th object is drawn using the posterior probability, where γ is a hyperparameter. If k = K + 1, a new topic is placed on the latent variable.

By sampling t m jn and kjt, the Gibbs sampler performs probabilistic object clustering:

$$t\_{jn}^{m} \sim P(t\_{jn}^{m}|X^{-mjn}, \lambda),\tag{1}$$

$$k\_{jt} \sim P(k\_{jt} | X^{-jt}, \mathcal{V}),\tag{2}$$

where X <sup>−</sup>mjn = X \ {x m jn}, and X <sup>−</sup>jt = X \ Xjt. By sampling t m jn for each observation in every object using (1) and sampling kjt for each latent variable t in every object using (2), all of the latent variables in the MHDP can be inferred.

If t m jn and kjt are given, the probability that the j-th object is included in the k-th category becomes

$$P(k|X\_j) = \frac{\Sigma\_{t=1}^{T\_j} \delta\_k(k\_{jt}) \sum\_m \mathcal{W}^m N\_{jt}^m}{\sum\_m \mathcal{W}^m N\_j^m},\tag{3}$$

where X<sup>j</sup> = ∪mX m j , w <sup>m</sup> is the weight for the m-th modality and δa(x) is a delta function.

When a robot attempts to recognize a new object after the learning phase, the probability that feature x m jn is generated from the k-th topic becomes

$$P(\boldsymbol{\alpha}\_{jn}^m | \boldsymbol{X}\_k^m) = \frac{\boldsymbol{\nu}^m \boldsymbol{N}\_{k\boldsymbol{x}\_{jn}^m}^m + \boldsymbol{\alpha}\_0^m}{\boldsymbol{\nu}^m \boldsymbol{N}\_k^m + d^m \boldsymbol{\alpha}\_0^m},$$

where d <sup>m</sup> denotes the dimension of the m-th modality input, and N m kx<sup>m</sup> jn represents the number of features x m jn that is corresponding to the index k. Topic k<sup>t</sup> allocated to t for a new object is sampled from

$$k\_t \sim P(k\_{jt} = k | X, \boldsymbol{\nu}) \propto P(X\_{jt} | X\_k) \frac{\boldsymbol{\nu}}{\boldsymbol{\nu} + M - 1}.$$

These sampling procedures play an important role in the Monte Carlo approximation of our proposed method (see section 4.2.).

For a more detailed explanation of the MHDP, please refer to Nakamura et al. (2011b). Basically, a robot can autonomously learn object categories and recognize new objects using the multimodal categorization procedure described above. The performance and effectiveness of the method was evaluated in the paper.

## 4. ACTIVE PERCEPTION METHOD

#### 4.1. Basic Formulation

A robot should have already conducted several actions and obtained information from several modalities when it attempts to select next action set for recognizing a target object. For example, visual information can usually be obtained by looking at the front face of the j-th object from a distance before interacting with the object physically. We assume that a robot has already obtained information corresponding to a subset of modalities **mo**<sup>j</sup> ⊂ **M**, where the subscript **o** means"originally" obtained modality information. When a robot faces a new object and has not obtained any information, **mo**<sup>j</sup> = ∅.

The purpose of object recognition in multimodal categorization is different from conventional supervised learning-based pattern recognition problems. In supervised learning, the recognition result is evaluated by checking whether the output is the same as the truth label. However, in unsupervised learning, there are basically no truth labels. Therefore, the performance of active perception should be measured in a different manner.

The action set the robot selects is described as **A** = {a1, a2, . . . , aN**<sup>A</sup>** } ∈ **2 M**\**mo**<sup>j</sup> , where **2 M**\**mo**<sup>j</sup> is a family of subsets of **M** \ **mo**<sup>j</sup> , i.e., **A** ⊂ **M** \ **mo**<sup>j</sup> , a<sup>i</sup> ∈ **M** \ **mo**<sup>j</sup> and N<sup>A</sup> represents the number of available actions. We consider an effective action set for active perception to be one that largely reduces the distance between the final recognition state after the information from all modalities **M** is obtained and the recognition state after the robot executes the selected action set **A**. The recognition state is represented by the posterior distribution P(**z**<sup>j</sup> |X **mo**j∪**A** j ). Here, **z**<sup>j</sup> = {{kjt}1≤t≤T<sup>j</sup> , {t m jn}m∈**M**,1≤n≤<sup>N</sup> m j } is a latent variable representing the j-th object's topic information, where X **A** <sup>j</sup> = ∪m∈**A**X m j , X m <sup>j</sup> = {x m j1 , . . . , x m jn, . . . , x m jN<sup>m</sup> j }. Probability P(**z**<sup>j</sup> |X **mo**j∪**A** j ) represents the posterior distribution related to the object category after taking actions **mo**<sup>j</sup> and **A**.

The final recognition state, i.e., posterior distribution over latent variables after obtaining the information from all modalities **M**, becomes P(**z**<sup>j</sup> |X **M** j ). The purpose of active perception is to select a set of actions that can estimate the posterior distribution most accurately. When L actions can be executed, if we employ KL divergence as the metric of the difference between the two probability distributions,

$$\underset{\mathbf{A}\in\mathbb{F}\_{L}}{\text{minimize}}\,\text{KL}\left(P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{M}}), P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{m}\_{\mathbf{o}\_{j}}\cup\mathbf{A}})\right)\tag{4}$$

is a reasonable evaluation criterion for realizing effective active perception, where **F mo**<sup>j</sup> <sup>L</sup> = {**A**|**A** ⊂ **M** \ **mo**<sup>j</sup> , N**<sup>A</sup>** ≤ L} is a feasible set of actions.

However, neither the true X **M** j nor X **mo**j∪**A** j can be observed before taking **A** on the j-th target object, and hence cannot be used at the moment of action selection. Therefore, a rational alternative for the evaluation criterion is the expected value of the KL divergence at the moment of action selection:

$$\underset{\mathbf{A}\in\mathbb{F}\_{L}}{\text{minimize}}\,\mathbb{E}\_{\begin{subarray}{c}\mathbf{x}\_{j}\in\mathsf{m}\_{\text{o}}\\\mathbf{A}\in\mathbb{F}\_{L}\end{subarray}}\,\mathbb{E}\_{X\_{j}^{\mathbf{M}\backslash\mathbf{m}\_{\text{o}}}|X\_{j}^{\mathbf{m}\_{\text{o}}}}\left[\text{KL}\left(P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{M}}),P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{m}\_{\text{o}}\cup\mathbf{A}})\right)\right].\tag{5}$$

Here, we propose to use the IG maximization criterion to select the next action set for active perception:

$$\mathbf{A}\_{j}^{\*} = \underset{\mathbf{A} \in \mathbb{F}\_{L}}{\text{argmax}} \, \text{IG}(\mathbf{z}\_{j}; X\_{j}^{\mathbf{A}} | X\_{j}^{\mathbf{m}\_{0j}}) \tag{6}$$

$$=\operatorname\*{argmin}\_{\mathbf{A}\in\mathbb{F}\_{L}^{\mathbf{m}\_{\mathrm{oj}}}}\mathbb{E}\_{\mathbf{X}\_{j}^{\mathbf{A}}|\mathbf{X}\_{j}^{\mathbf{m}\_{\mathrm{oj}}}}[\operatorname{KL}\left(P(\mathbf{z}\_{j}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathrm{oj}}\cup\mathbf{A}}), P(\mathbf{z}\_{j}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathrm{oj}}})\right)],\tag{7}$$

where IG(X; Y|Z) is the IG of Y for X, which is calculated on the basis of the probability distribution commonly conditioned by Z as follows:

$$\text{IG}(X; \ Y|Z) = \text{KL}\left(P(X, Y|Z), P(X|Z)P(Y|Z)\right).$$

By definition, the expected KL divergence is the same as IG(X; Y). The definition of IG and its relation to KL divergence are as follows.

$$\begin{aligned} \text{IG}(X;Y) &= H(X) - H(X|Y) \\ &= \text{KL}\left(P(X,Y), P(X)P(Y)\right) \\ &= \mathbb{E}\_Y[\text{KL}\left(P(X|Y), P(X)\right)]. \end{aligned}$$

The optimality of the proposed criterion (6) is supported by Theorem 1.

**Theorem 1.** The set of next actions **A** ∈ **F mo**<sup>j</sup> L that maximizes the IG(**z**j; X **A** j |X **mo**<sup>j</sup> j ) minimizes the expected KL divergence between the posterior distribution over **z**<sup>j</sup> after all modality information has been observed and after **A** has been executed.

$$\begin{split} \underset{\mathbf{A}\in\mathbb{F}\_{L}}{\operatorname\*{argmin}} & \mathbb{E}\_{X\_{j}^{\mathbf{M}\backslash\mathbf{m}\_{\operatorname{\bf j}}}}[\operatorname\*{KL}\left(P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{M}}), P(\mathbf{z}\_{j}|X\_{j}^{\mathbf{m}\_{\operatorname{\bf j}}\cup\mathbf{A}})\right)] \\ &= \underset{\mathbf{A}\in\mathbb{F}\_{L}}{\operatorname\*{argmax}} \,\operatorname\*{IG}(\mathbf{z}\_{j};X\_{j}^{\mathbf{A}}|X\_{j}^{\mathbf{m}\_{\operatorname{\bf o}}}) \end{split}$$

]

Proof: See Appendix A.

This theorem is essentially the result of well-known characteristics of IG (see MacKay, 2003; Russo and Van Roy, 2016 for example). This means that maximizing IG is the optimal policy for active perception in an MHDP-based multimodal object category recognition task. As a special case, when only a single action is permitted, the following corollary is satisfied.

**Corollary** 1.1. The next action m ∈ **M** \ **mo**<sup>j</sup> that maximizes IG(**z**j; X m j |X **mo**<sup>j</sup> j ) minimizes the expected KL divergence between the posterior distribution over **z**<sup>j</sup> after all modality information has been observed and after the action has been executed.

$$\underset{m\in\mathbb{M}\backslash\mathfrak{m}\_{\mathbf{o}\_{j}}}{\operatorname\*{argmin}}\,\mathbb{E}\_{X\_{j}^{\mathsf{M}\backslash\mathfrak{m}\_{\mathbf{o}\_{j}}}|X\_{j}^{\mathsf{m}\_{\mathbf{o}\_{j}}}|X\_{j}^{\mathsf{M}}}\left[\operatorname\*{KL}\left(P(\mathbf{z}\_{j}|X\_{j}^{\mathsf{M}}),P(\mathbf{z}\_{j}|X\_{j}^{\{m\}\cup\mathsf{m}\_{\mathbf{o}\_{j}}})\right)\right]$$

$$=\underset{m\in\mathbb{M}\backslash\mathfrak{m}\_{\mathbf{o}\_{j}}}{\operatorname\*{argmax}}\,\operatorname\*{IG}(\mathbf{z}\_{j};X\_{j}^{m}|X\_{j}^{\mathsf{m}\_{\mathbf{o}\_{j}}}).\tag{8}$$

Proof: By substituting {m} into **A** in Theorem 1, we can obtain the corollary.

Using IG, the active perception strategy for the next single action is simply described as follows:

$$m\_{\vec{j}}^{\*} = \underset{m \in \mathbb{M} \backslash \mathfrak{m}\_{0j}}{\text{argmax }} \text{IG}(\mathbf{z}\_{\vec{j}}; X\_{\vec{j}}^{m} | X\_{\vec{j}}^{\mathfrak{m}\_{0j}}). \tag{9}$$

This means that the robot should select the action m<sup>∗</sup> j that can obtain the X m∗ j j that maximizes the IG for the recognition result **z**<sup>j</sup> under the condition that the robot has already observed X **mo**<sup>j</sup> j .

However, we still have two problems, as follows.


Based on some properties of the MHDP, we can obtain reasonable solutions for these two problems.

# 4.2. Sequential Decision Making as a Submodular Maximization

If a robot wants to select L actions **A**<sup>j</sup> = {a1, a2, . . . , aL} (a<sup>i</sup> ∈ **M** \ **mo**<sup>j</sup> ), it has to solve (6), i.e., a combinatorial optimization problem. The number of combinations of L actions is #(**M**\**mo**<sup>j</sup> )CL, which increases dramatically when the number of possible actions #(**M** \ **mo**<sup>j</sup> ) and L increase. For example, Sinapov et al. (2014) gave a robot 10 different behaviors in their experiment on robotic multimodal categorization. Future autonomous robots will have more available actions for interacting with a target object and be able to obtain additional types of modality information through these interactions. Hence, it is important to develop an efficient solution for the combinatorial optimization problem.

Here, the MHDP has advantages for solving this problem.

**Theorem 2.** The evaluation criterion for multimodal active perception IG(**z**j; X **A** j |X **mo**<sup>j</sup> j ) is a submodular and non-decreasing function with regard to **A**.

Proof: As shown in the graphical model of the MHDP in **Figure 2**, the observations for each modality X m j are conditionally independent under the condition that a set of latent variables **z**<sup>j</sup> = {{kjt}1≤t≤T<sup>j</sup> , {t m jn}m∈**M**,1≤n≤<sup>N</sup> m j }is given. This satisfies the conditions of the theorem by Krause and Guestrin (2005). Therefore, IG(**z**j; X m j |X **mo**<sup>j</sup> j ) is a submodular and non-decreasing function with regard to X m j .

Submodularity is a property similar to the convexity of a realvalued function in a vector space. If a set function F :V → R satisfies

$$F(A \cup \mathfrak{x}) - F(A) \ge F(A' \cup \mathfrak{x}) - F(A'),$$

where V is a finite set ∀A ⊂ A ′ ⊆ V and x ∈/ A, the set function F has submodularity and is called a submodular function.

Function IG is not always a submodular function. However, Krause et al. proved that IG(U; A) is submodular and nondecreasing with regard to A ⊆ S if all of the elements of S are conditionally independent under the condition that U is given. With this theorem, Krause and Guestrin (2005) solved the sensor allocation problem efficiently. Theorem 2 means that the problem (6) is reduced to a submodular maximization problem.

It is known that the greedy algorithm is an efficient strategy for the submodular maximization problem. Nemhauser et al. (1978) proved that the greedy algorithm can select a subset that is at most a constant factor (1 − 1/e) worse than the optimal set, if the evaluation function F(A) is submodular, nondecreasing, and F(∅) = 0, where F(·) is a set function, and A

Frontiers in Neurorobotics | www.frontiersin.org

is a set. If the evaluation function is a submodular set function, a greedy algorithm is practically sufficient for selecting subsets in many cases. In sum, a greedy algorithm gives a near-optimal solution. However, the greedy algorithm is still inefficient because it requires an evaluation of all choices at each step of a sequential decision making process.

Minoux (1978) proposed lazy greedy algorithm to make the greedy algorithm more efficient for the submodular evaluation function. The lazy greedy algorithm can reduce the number of evaluations by using the characteristics of a submodular function.

# 4.3. Monte Carlo Approximation of IG

Equations (6) and (9) provide a robot with an appropriate criterion for selecting an action to efficiently recognize a target object. However, at first glance, it looks difficult to calculate the IG. First, the calculation of the expectation procedure E X **A** j |X **mo**j j [·] requires a sum operation over all possible X **A** j . The number of possible X **A** j exponentially increases when the number of elements in the BoF increases. Second, the calculation of P(**z**<sup>j</sup> |X **A**∪**mo**<sup>j</sup> j ) for each possible observation X **A** j requires the same computational cost as recognition in the multimodal categorization itself. Therefore, the straightforward calculation for solving (9) is computationally impossible in a practical sense.

However, by exploiting a characteristic property of the MHDP, a Monte Carlo approximation can be derived. First, we describe IG as the expectation of a logarithm term.

$$\mathrm{IG}(\mathbf{z}\_{j};\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathbf{0}j}}) = \sum\_{\mathbf{z}\_{j},\boldsymbol{X}\_{j}^{m}} P(\mathbf{z}\_{j},\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathbf{0}j}}) \log \frac{P(\mathbf{z}\_{j},\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathbf{0}j}})}{P(\mathbf{z}\_{j}|\boldsymbol{X}\_{j}^{m})P(\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{m})}$$

$$= \mathbb{E}\_{\mathbf{z}\_{j},\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}}{\mathop{\mathbf{\mathbf{z}}}\_{\boldsymbol{\beta}^{\prime}}\boldsymbol{X}\_{j}^{m}} \Big[\log \frac{P(\mathbf{z}\_{j},\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{\mathbf{m}\_{\mathbf{0}j}})}{P(\mathbf{z}\_{j}|\boldsymbol{X}\_{j}^{m})P(\boldsymbol{X}\_{j}^{m}|\boldsymbol{X}\_{j}^{m})}\Big]. \tag{10}$$

An analytic evaluation of (10) is also practically impossible. Therefore, we adopt a Monte Carlo method. Equation (10) suggests that an efficient Monte Carlo approximation can be performed as shown below if we can sample

$$(z\_j^{[k]}, X\_j^{m[k]}) \sim P(\mathbf{z}\_j, X\_j^m | X\_j^{\mathbf{m}\_{\mathbf{o}j}}), \quad (k \in \{1, \dots, K\}).$$

Fortunately, the MHDP provides a sampling procedure for z [k] <sup>j</sup> ∼ P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j ) and X m[k] <sup>j</sup> ∼ P(X m j |z [k] j ) in its original paper (Nakamura et al., 2011b). In the context of multimodal categorization by a robot, X m[k] <sup>j</sup> ∼ P(X m j |z [k] j ) is a prediction of an unobserved modality's sensation using observed modalities' sensations, i.e., cross-modal inference. The sampling process of (z [k] j , X m[k] j ) can be regarded as a mental simulation by a robot that predicts the unobserved modality's sensation leading to a categorization result based on the predicted sensation and observed information.

$$\begin{split} \text{(10)} &\approx \frac{1}{K} \sum\_{k} \log \frac{P(\mathbf{z}\_{j}^{[k]}, \mathbf{X}\_{j}^{m[k]} | \mathbf{X}\_{j}^{\mathbf{m}\_{0j}})}{P(\mathbf{z}\_{j}^{[k]} | \mathbf{X}\_{j}^{\mathbf{m}\_{0j}}) P(\mathbf{X}\_{j}^{m[k]} | \mathbf{X}\_{j}^{\mathbf{m}\_{0j}})} \\ &= \frac{1}{K} \sum\_{k} \log \frac{P(\mathbf{X}\_{j}^{m[k]} | \mathbf{z}\_{j}^{[k]}, \mathbf{X}\_{j}^{\mathbf{m}\_{0j}})}{\underbrace{P(\mathbf{X}\_{j}^{m[k]} | \mathbf{X}\_{j}^{\mathbf{m}\_{0j}})}\_{\*}}. \end{split} \tag{11}$$

In (11), P(X m[k] j |**z** [k] j , X **mo**<sup>j</sup> j ) in the numerator can be easily calculated because all the parent nodes of X m[k] j are given in the graphical model shown in **Figure 2**. However, P(X m[k] j |X **mo**<sup>j</sup> j ) in the denominator cannot be evaluated in a straightforward way. Again, a Monte Carlo method can be adopted, as follows:

$$\begin{split} \langle \* \rangle = P(X\_j^{m[k]} | X\_j^{\mathbf{m}\_{\mathbf{o}j}}) &= \sum\_{\mathbf{z}\_j} P(X\_j^{m[k]} | \mathbf{z}\_j, X\_j^{\mathbf{m}\_{\mathbf{o}j}}) P(\mathbf{z}\_j | X\_j^{\mathbf{m}\_{\mathbf{o}j}}) \\ &= \mathbb{E}\_{\mathbf{z}\_j | X\_j^{\mathbf{m}\_{\mathbf{o}j}}} [P(X\_j^{m[k]} | \mathbf{z}\_j, X\_j^{\mathbf{m}\_{\mathbf{o}j}})] \\ &\approx \frac{1}{K'} \sum\_{k'} P(X\_j^{m[k]} | \mathbf{z}\_j^{[k']}, X\_j^{\mathbf{m}\_{\mathbf{o}j}}) \end{split} \tag{12}$$

where K ′ is the number of samples for the second Monte Carlo approximation. Fortunately, in this Monte Carlo approximation (12), we can reuse the samples drawn in the previous Monte Carlo approximation efficiently, i.e., K ′ = K. By substituting (12) for (11), we finally obtain the approximate IG for the criterion of active perception, i.e., our proposed method, as follows:

$$\operatorname{IG}(\mathbf{z}\_{\boldsymbol{j}};\boldsymbol{X}\_{\boldsymbol{j}}^{m}|\boldsymbol{X}\_{\boldsymbol{j}}^{\mathbf{m}\_{\bullet j}}) \approx \frac{1}{K} \sum\_{k} \log \frac{P(\boldsymbol{X}\_{\boldsymbol{j}}^{m[k]}|\mathbf{z}\_{\boldsymbol{j}}^{[k]},\boldsymbol{X}\_{\boldsymbol{j}}^{\mathbf{m}\_{\bullet j}})}{\frac{1}{K} \sum\_{k'} P(\boldsymbol{X}\_{\boldsymbol{j}}^{m[k]}|\mathbf{z}\_{\boldsymbol{j}}^{[k']},\boldsymbol{X}\_{\boldsymbol{j}}^{\mathbf{m}\_{\bullet j}})}.$$

Note that the computational cost for evaluating IG becomes O(K 2 ). In summary, a robot can approximately estimate the IG for unobserved modality information by generating virtual observations based on observed data and evaluating their likelihood.

# 4.4. MHDP-Based Active Perception Methods

We propose the use of the greedy and lazy greedy algorithms for selecting L actions to recognize a target object on the basis of the submodular property of IG. The final greedy and lazy greedy algorithms for MHDP-based active perception, i.e., our proposed methods, are shown in Algorithms 1 and 2, respectively.

The main contribution of the lazy greedy algorithm is to reduce the computational cost of active perception. The majority of the computational cost originates from the number of times a robot evaluates IG<sup>m</sup> for determining action sequences. When a robot has to choose L actions, the brute-force algorithm that directly evaluates all alternatives **A** ∈ **F mo**<sup>j</sup> L using (6) requires #(**M**\**mo**<sup>j</sup> )C<sup>L</sup> evaluations of IG(**z**j; X **A** j |X **mo**<sup>j</sup> j ). In contrast, the greedy algorithm requires {#(**M** \ **mo**<sup>j</sup> ) + (#(**M** \ **mo**<sup>j</sup> ) − 1) + . . . + **Algorithm 1** Greedy algorithm.

**Require:** MHDP is trained using a training data set. The j-th object is found. **mo**<sup>j</sup> is initialized, and X **mo**<sup>j</sup> j is observed. **for** l = 1 to L **do for all** m ∈ **M** \ **mo**<sup>j</sup> **do for** k = 1 to K **do** Draw

> (z [k] j , X m[k] j

K

}

k

**end for**

$$\text{IG}\_m \leftarrow \frac{1}{K} \sum\_{\cdot} \log \frac{P(X\_j^{m[k]} | \mathbf{z}\_j^{[k]}, X\_j^{\mathbf{m}\_{\mathbf{o}\_j}})}{\frac{1}{\cdot} \sum\_{\cdot} P(X^{m[k]} | \mathbf{z}\_j^{[k']})} $$

K P k ′ P(X m[k] j |**z** [k

) ∼ P(**z**<sup>j</sup>

, X m j |X **mo**<sup>j</sup> j )

)

] j , X **mo**<sup>j</sup> j )

**end for**

$$m^\* \leftarrow \operatorname\*{argmax}\_{m} \operatorname{IG}\_{m}$$

Execute the m<sup>∗</sup> -th action to the j-th target object and obtain X m∗ j .

**mo**<sup>j</sup> ← **mo**<sup>j</sup> ∪ {m<sup>∗</sup> **end for**

(#(**M** \ **mo**<sup>j</sup> )−L+1)} evaluations of IG(**z**j; X m j |X **mo**<sup>j</sup> j ), i.e., O(ML). The lazy greedy algorithm incurs the same computational cost as the greedy algorithm only in the worst case. However, practically, the number of re-evaluations in the lazy greedy algorithm is quite small. Therefore, the computational cost of the lazy greedy algorithm increases almost in proportion to L, i.e., almost linearly. The memory requirement of the proposed method is also quite small. Both the greedy and lazy greedy algorithms only require memory for IG<sup>m</sup> for each modality and K samples for the Monte Carlo approximation. These requirements are negligibly small compared with the MHDP itself.

Note that the IG<sup>m</sup> is not the exact IG, but an approximation. Therefore, the differences between IG and IG<sup>m</sup> may harm the performance of greedy and lazy greedy algorithms to a certain extent. However, the algorithms are expected to work practically. We evaluated the algorithms through experiments.

## 5. EXPERIMENT 1: HUMANOID ROBOT

#### 5.1. Conditions

An experiment using an upper-torso humanoid robot was conducted to verify the proposed active perception method in the real-world environment. In this experiment, RIC-Torso, developed by the RT Corporation, was used (see **Figure 3**). RIC-Torso is an upper-torso humanoid robot that has two robot hands. We prepared an experimental environment that is similar to the one in the original MHDP paper (Nakamura et al., 2011b). The robot has four available **Algorithm 2** Lazy greedy algorithm.

**Require:** The MHDP is trained using a training data set. The j-th object is found. **mo**<sup>j</sup> is initialized, and X **mo**<sup>j</sup> j is observed. **for all** m ∈ **M** \ **mo**<sup>j</sup> **do for** k = 1 to K **do** Draw

> (z [k] j , X m[k] j ) ∼ P(**z**<sup>j</sup> , X m j |X **mo**<sup>j</sup> j )

**end for**

$$\mathrm{IG}\_m \leftarrow \frac{1}{K} \sum\_k \log \frac{P(X\_j^{m[k]} | \mathbf{z}\_j^{[k]}, X\_j^{\mathbf{m}\_{\mathbf{o}j}})}{\frac{1}{K} \sum\_{k'} P(X\_j^{m[k]} | \mathbf{z}\_j^{[k']}, X\_j^{\mathbf{m}\_{\mathbf{o}j}})}$$

**end for**

$$m^\* \leftarrow \operatorname\*{argmax}\_{m} \operatorname{IG}\_m$$

Execute the m<sup>∗</sup> -th action to the j-th target object and obtain X m∗ j .

**mo**<sup>j</sup> ← **mo**<sup>j</sup> ∪ {m<sup>∗</sup> } Prepare a stack S for the modality indices and initialize it. **for all** m ∈ **M** \ **mo**<sup>j</sup> **do** push(S, (m, IGm)) **end for for** l = 1 to L − 1 **do repeat** S ← descending\_sort(S) // w.r.t. IG<sup>m</sup> (m<sup>1</sup> , IGm<sup>1</sup> ) ← pop(S) , (m<sup>2</sup> , IGm<sup>2</sup> ) ← pop(S) // Re-evaluate IGm<sup>1</sup> as follows. **for** k = 1 to K **do**

Draw

$$(z\_j^{[k]}, X\_j^{m^1[k]}) \sim P(\mathbf{z}\_j, X\_j^{m^1} | X\_j^{\mathbf{m\_{oj}}})$$

**end for**

$$\mathbf{IG}\_{m^{\mathbb{I}}} \leftarrow \frac{1}{K} \sum\_{k} \log \frac{P(\boldsymbol{X}\_{j}^{m^{\mathbb{I}}[k]} | \mathbf{z}\_{j}^{[k]}, \boldsymbol{X}\_{j}^{\mathbf{m\_{0}}})}{\frac{1}{K} \sum\_{k'} P(\boldsymbol{X}\_{j}^{m^{\mathbb{I}}[k]} | \mathbf{z}\_{j}^{[k']}, \boldsymbol{X}\_{j}^{\mathbf{m\_{0}}})}$$

$$\begin{array}{l}push(\ $,(m^{2},\mathrm{IG}\_{m^{2}})),push(\$ ,(m^{1},\mathrm{IG}\_{m^{1}}))\\\ \tiny\begin{array}{l}\text{until }\mathrm{IG}\_{m^{1}}\geq\mathrm{IG}\_{m^{2}}\\\ m^{\*}\leftarrow m^{1}\\\ \text{Exercise the }m^{\*}\text{-}th \text{ action to the }j\text{-th target object and obtain}\\\ X\_{j}^{m^{\*}}.\\\ \mathsf{m}\_{\mathbf{0}\_{0}}\leftarrow\mathsf{m}\_{\mathbf{0}\_{0}}\cup\{m^{\*}\}\end{array}$$

**mo**<sup>j</sup> ← **mo**<sup>j</sup> ∪ {m<sup>∗</sup> **end for**

actions and four corresponding modality information. The set of modalities was **M** = {m<sup>v</sup> , mas , mah , m<sup>h</sup> }, which represent visual information, auditory information obtained by shaking an object, one by hitting an object and haptic information, respectively.

#### 5.1.1. Visual Information (m<sup>v</sup> )

Visual information was obtained from the Xtion PRO LIVE set on the head of the robot. The camera was regarded as the eyes of the robot. The robot captured 74 images of a target object while it rotated on a turntable (see **Figure 3**). The size of each image was re-sized to 320 × 240. Scale-invariant feature transform (SIFT) feature vectors were extracted from each captured image (Lowe, 2004). A certain number of 128-dimensional feature vectors were obtained from each image. Note that the SIFT feature did not consider hue information. All of the obtained feature vectors were transformed into BoF representations using kmeans clustering with k = 25. The number of clusters k was determined empirically, considering prior works (Nakamura et al., 2011b; Araki et al., 2012). The k-means clustering was performed using data from all objects in a training set, and the centroids of the clusters were determined. BoF representations were used as observation data for the visual modality of the MHDP. The index for this modality was defined as m<sup>v</sup> .

## 5.1.2. Auditory Information (mas and mah)

Auditory information was obtained from a multipowered shotgun microphone NTG-2 by RODE Microphone. The microphone was regarded as the ear of the robot. In this experiment, two types of auditory information were acquired. One was generated by hitting the object, and the other was generated by shaking it. The two sounds were regarded as different auditory information and hence different modality observations in the MHDP model. The two actions, i.e., hitting and shaking, were manually programmed for the robot. Each action was implemented as a fixed trajectory. When the robot began to execute an action, it also started recording the objects's sound (see **Figure 3**). The sound was recorded until two seconds after the robot finished the action. The recorded auditory data were temporally divided into frames, and each frame was transformed into 13-dimensional Mel-frequency cepstral coefficients (MFCCs). The MFCC feature vectors were transformed into BoF representations using k-means clustering

FIGURE 3 | A humanoid robot used in the experiment.

with k = 25 in the same way as the visual information. The indices of these modalities were defined as mas and mah , respectively, for "shake" and "hit."

#### 5.1.3. Haptic Information (m<sup>h</sup> )

Haptic information was obtained by grasping a target object using the robot's hand. When the robot attempted to obtain haptic information from an object placed in front of it, it moved its hand to the object and gradually closed its hand until a certain amount of counterforce was detected (see **Figure 3**). The joint angle of the hand was measured when the hand touched the target object and when the hand stopped. The two variables and difference between the two angles were used as a threedimensional feature vector. When obtaining haptic information, the robot grasped the target object 10 times and obtained 10 feature vectors. The feature vectors were transformed into BoF representations using k-means clustering with k = 5 in the same way as for the other information types. The index of the haptic modality was defined as m<sup>h</sup> .

### 5.1.4. Multimodal Information as BoF Representations

In summary, a robot could obtain multimodal information from four modalities for perception. The dimensions of the BoFs were set to 25, 25, 25, and 5 for m<sup>v</sup> , mas , mah, and m<sup>h</sup> , respectively. The dimension of each BoF corresponds to the number of clusters for k-means clustering. The numbers of clusters, i.e., the sizes of the dictionaries, were empirically determined on the basis of a preliminary experiment on multimodal categorization. All of the training datasets were used to train the dictionaries. The histograms of the feature vectors, i.e., the BoFs, were resampled to make their counts N mv <sup>j</sup> = 100, N mas <sup>j</sup> = 80, N mah <sup>j</sup> = 130, and N mh <sup>j</sup> = 30. The weight of each modality w <sup>m</sup> was set to 1. The formation of multimodal object categories itself is out of the scope of this paper. Therefore, the constants were empirically determined so that the robot could form object categories that are similar to human participants. The number of samples K in the Monte Carlo approximation for estimating IG was set to K = 5, 000. The constant K was determined empirically. The effect of K will be examined in the experiment as well (see **Figure 11**).

#### 5.1.5. Target Objects

For the target objects, 17 types of commodities were prepared for the experiment shown in **Figure 4**. An object was provided for obtaining a training data, i.e., data for object categorization, and another object was provided for obtaining test data, i.e., data for active perception, for each type of objects. Each index on the right-hand side of the figure indicates the index of each object. The hardness of the balls, the striking sounds of the cups, and the sounds made while shaking the bottles were different depending on the object categories. Therefore, ground-truth categorization could not be achieved using visual information alone.

# 5.2. Procedure

The experimental procedure was as follows. First, the robot formed object categories through multimodal categorization in an unsupervised manner. An experimenter placed each object

in front of the robot one by one. In this training phase, two objects for each type of objects were provided. The robot looked at the object to obtain visual features, grasped it to obtain haptic features, shook it to obtain auditory shaking features, and hit it to obtain the auditory striking features. After obtaining the multimodal information of the objects as a training data set, the MHDP was trained using a Gibbs sampler. The results of multimodal categorization are shown in **Figure 4**. The category that has the highest posterior probability for each object is shown in white. These results show that the robot can form multimodal object categories using MHDP, as described in Nakamura et al. (2011b). After the robot had formed object categories, we fixed the latent variables for the training data set<sup>3</sup> .

Second, an experimental procedure for active perception was conducted. An experimenter placed an object in front of the robot. The robot observed the object using its camera, obtained visual information, and set **mo**<sup>j</sup> = {m<sup>v</sup> }. An object was provided for each type of objects shown in **Figure 4** to the robot one by one. Therefore, 17 objects were used for evaluating each active perception strategy. The sequential action selection and object recognition were performed once per an object. At each step of the sequential action selection, Gibbs sampler for MHDP was performed and it updated its latent variables, i.e., recognition state, of the MHDP. The robot then determined its next set of actions for recognizing the target object using its active perception strategy shown in Algorithms 1 and 2.

# 5.3. Results

#### 5.3.1. Selecting the Next Action

First, we describe results for the first single action selection after obtaining visual information. In this experiment, the robot had three choices for its next action, i.e., mas , mah , and m<sup>h</sup> . To evaluate the results of active perception, we used KL P(k|X **M** j ), P(k|X **A**∪**mo**<sup>j</sup> j ) , i.e., the distance between the posterior distribution over the object categories k in the final recognition state and that in the next recognition state as an evaluation criterion on behalf of KL P(**z**<sup>j</sup> |X **M** j ), P(**z**<sup>j</sup> |X **A**∪**mo**<sup>j</sup> j ) , which is the original evaluation criterion in (4). The computational cost for numerical evaluation of KL P(**z**<sup>j</sup> |X **M** j ), P(**z**<sup>j</sup> |X **A**∪**mo**<sup>j</sup> j ) using a Monte Carlo method is too high because **z**<sup>j</sup> = {{kjt}1≤t≤T<sup>j</sup> , {t m jn}m∈**M**,1≤n≤<sup>N</sup> m j } has so many variables and a posterior distributions over **z**<sup>j</sup> is very complex.

**Figure 5** (Top) shows samples of the KL divergence between the posterior probabilities of the category after obtaining the information from all modalities and after obtaining only visual information.

With regard to some objects, e.g., objects 6 and 7, the figure shows samples of that visual information seems to be sufficient for the robot to recognize the objects as compared the other objects<sup>4</sup> . However, with regard to many objects, visual information alone could not lead the recognition state to the final state. However, it could be reached using the information of all modalities. **Figure 5** (Middle) shows samples of IG<sup>m</sup> calculated using the visual information for each action. **Figure 5** (Bottom) shows the KL divergence between the final recognition state and the posterior probability estimated after obtaining visual information and the information of each selected action. We observe that an action with a higher value of IG<sup>m</sup> tended to further reduce the KL divergence, as Theorem 1

<sup>3</sup>The collected datasets for this experiment can be found in GitHub: https://github. com/tanichu/data-active-perception-hmdp

<sup>4</sup>Note that currently we don't have a good criteria of KL divergence to determine whether performing further actions are necessary or not.

suggests. **Figure 6** shows the average KL divergence for the final recognition state after executing an action selected by the IG<sup>m</sup> criterion. Actions IG .min, IG .mid, and IG .max denote actions that have the minimum, middle, and maximum values of IGm, respectively. These results show that IG .max clearly reduced the uncertainty of the target objects.

The precision of category recognition after an action execution is summarized in **Table 1**. Basically, a category recognition result is obtained as the posterior distribution (3) in the MHDP. The category with the highest posterior probability is considered to be the recognition result for illustrative purposes in **Table 1**. Obtaining information by executing IG .max almost always increased recognition performance.

Examples of changes in the posterior distribution are shown in **Figure 7** (Left, Right) for objects 8 ("metal cup") and 12 ("empty plastic bottle"), respectively. The robot could not clearly recognize the category of object 8 after obtaining visual information. Action IG<sup>m</sup> in **Figure 5** shows that mah was IG .max for the 8th object. **Figure 7** (Left) shows that mah reduced the uncertainty and allowed the robot to correctly recognize the object, as evidenced by category 6, a metal cup. This means that the robot noticed that the target object was a metal cup by hitting it and listening to its metallic sound. The metal cup did not make a sound when the robot shook it. Therefore, the IG for mas was small. As **Figure 7** (Right) shows, the robot first recognized the 12th object as a plastic bottle containing bells with high probability and as an empty plastic bottle with a low probability. **Figure 5** shows that the IG<sup>m</sup> criterion suggested mah as the first alternative and mas as the second alternative. **Figure 7** (Right) shows that mas and mah could determine that the target object was an empty plastic bottle, but m<sup>h</sup> could not.

As humans, we would expect to differentiate an empty bottle from a bottle containing bells by shaking or hitting the bottle, and differentiate a metal cup from a plastic cup by hitting it. The proposed active perception method constructively reproduced this behavior in a robotic system

recognition state are calculated for all objects and shown with box plot. This shows that an action with more information brings the recognition of its state closer to the final recognition state.

TABLE 1 | Number of successfully recognized objects.


using an unsupervised multimodal machine learning approach.

#### 5.3.2. Selecting the Next Set of Multiple Actions

We evaluated the greedy and lazy greedy algorithms for active perception sequential decision making. The KL divergence from the final state for all target objects is averaged at each step and shown in **Figure 8**. For each condition, the KL divergence gradually decreased and reached almost zero. However, the rate of decrease notably differed. As the theory of submodular optimization suggests, the greedy algorithm was shown to be a better solution on average and slightly worse than the best case (Nemhauser et al., 1978). The best and worst cases were selected after all types of sequential actions had been performed. The "average" is the average of the KL divergence obtained by all possible types of sequential actions. The results for the lazy greedy algorithm were almost same as those of the greedy algorithm, as Minoux (1978) suggested.

The sequential behaviors of IG<sup>m</sup> were observed to determine if their behaviors were consistent with our theories. For example, the changes in IG<sup>m</sup> at each step as the robot sequentially selected its action to perform on object 10 using the greedy algorithm is shown in **Figure 9**. Theorem 2 shows that the IG is a submodular function. This predicts that IG<sup>m</sup> decreases monotonically when a new action is executed in active perception. When the robot obtained only visual information (v only in **Figure 9**), all values of IG<sup>m</sup> were still large. After mah was executed on the basis of the greedy algorithm, IGmah became zero. At the same time, IGmas and IGm<sup>h</sup> decreased. In the same way, all values of IG<sup>m</sup> gradually decreased monotonically.

**Figure 10** shows the time series of the posterior probability of the category for object 10 during sequential active perception. Using only visual information, the robot misclassified the target object as a plastic bottle containing bells (category 3). The action sequence in reverse order did not allow the robot to recognize the object as a steel can at its first step and change its recognition state to an empty plastic bottle (category 4). After the second action, i.e., grasping (m<sup>h</sup> ), the robot recognized the object as a steel can. In contrast, the greedy algorithm could determine that the target object was in category 4, i.e., steel can, with its first action.

The effect of the number of samples K for the Monte Carlo approximation was observed. **Figure 11** shows the relation between K and the standard deviation of the estimated IG<sup>m</sup> for the 15th object for each action after obtaining a visual image. This figure shows that estimation error gradually decreases when K increases. Roughly speaking, K ≥ 1, 000 seems to be required for an appropriate estimate of IG<sup>m</sup> in our experimental setting. Evaluation of IG<sup>m</sup> required less than 1 second, which is far shorter than the time required for action execution by a robot. This means that our method can be used in a real-time manner.

These empirical results show that the proposed method for active perception allowed a robot to select appropriate actions sequentially to recognize an object in the real-world environment and in a real-time manner. It was shown that the theoretical results were supported, even in the real-world environment.

# 6. EXPERIMENT 2: SYNTHETIC DATA

In experiment 1, the numbers of classes, actions, and modalities as well as the size of dataset were limited. In addition, it was difficult to control the robotic experimental settings so as to check some interesting theoretical properties of our proposed method. Therefore, we performed a supplemental experiment, Experiment 2, using synthetic data comprising 21 object types, 63 objects, and 20 actions, i.e., modalities.

First, we checked the validity of our active perception method when the number of types of actions increases. Second, we checked how the method worked when two classes were assigned to the same object. Although the MHDP can categorize an object into two or more categories in a probabilistic manner, each object was classified into a single category in the previous experiment.

# 6.1. Conditions

A synthetic dataset was generated using the generative model that the MHDP assumes (see **Figure 2**). We prepared 21 virtual object classes, and three objects were generated from each object class, i.e., we obtained 63 objects in total. Among the object classes, 14 object classes are "pure," and seven object classes are "mixed." For each pure object class, a multinomial distribution was drawn from the Dirichlet distribution corresponding to each modality. We set the number of modalities M = 20. The hyperparameters of the Dirichlet distributions of the modalities were set to α m <sup>0</sup> =

FIGURE 7 | (Left) Posterior probability of the category for object 8 after executing each action. These results show that the action with the highest information gain, i.e., *ah*, allowed the robot to efficiently estimate that the true object category was "metal cup". (Right) Posterior probability of the category for object 12 after executing each action. These results show that the actions with the highest and second highest information gain, i.e., *ah* and *as*, allowed the robot to efficiently estimate that the true object category was "empty plastic bottle".

0.4(m − 1) for m > 1. For m = 1, we set α 1 <sup>0</sup> = 10. For each mixed object class, a multinomial distribution for each modality was prepared by mixing the distributions of the two pure object classes. Specifically, the multinomial distribution for the i-th mixed object was obtained by averaging those of the (2i − 1)-th and the 2i-th object classes. The observations for each modality of each object were drawn from the multinomial distributions corresponding to the object's class. The count of the BoFs for each modality was set to 20. Finally, 42 pure virtual objects and 21 mixed virtual objects were generated.

The experiment was performed almost in the same way as experiment 1. First, multimodal categorization was performed for the 63 virtual objects, and 14 categories were successfully formed in an unsupervised manner. The posterior distributions over the object categories are shown in **Figure 12**. Generally speaking, mixed objects were categorized into two or more classes. After categorization, a virtual robot was asked to recognize all of the target objects using the proposed active perception method.

# 6.2. Results

We compared the greedy, lazy greedy, and random algorithms for the active perception sequential decision making process. The random algorithm is a baseline method that determines the next action randomly from the remaining actions that have not been

taken. In other words, the random algorithm is the case in which a robot does not employ any active perception algorithms.

The KL divergence from the final state for all target objects is averaged at each step and shown in **Figure 13**. For each condition, the KL divergence gradually decreased and reached almost zero. However, the rate of decrease was different. The greedy and lazy greedy algorithms were clearly shown to be better solutions on average than the random algorithm. In contrast with experiment 1, the best and worst cases could not practically be calculated because of the prohibitive computational cost. Interestingly, the lazy greedy algorithm has almost the same performance as the greedy algorithm, as the theory suggests, although the laziness reduced the computational cost in reality.

FIGURE 12 | Categorization results for the posterior probability distributions for each object.

The number of times the robot evaluated IG<sup>m</sup> to determine the action sequences for all executable counts of actions L = 1, 2, . . . , M is summarized for each method. The number of times the lazy greedy algorithm was required for each target object was 71.7 (SD = 5.2) on average, and that of the greedy algorithm was 190. Theoretically, the greedy and lazy greedy algorithms require O(M<sup>2</sup> ) evaluations. Practically, the number of re-evaluations needed by the lazy greedy algorithm is quite small. In contrast, the brute-force algorithm requires O(2M) evaluations, i.e., far more evaluations of IG are required.

Next, a case in which two classes were assigned to the same object was investigated. The target dataset contained "mixed" objects. The results also imply that our method works well even when two classes are assigned to the same object. This is because our theory is completely derived on the basis of the probabilistic generative model, i.e., the MHDP. We show a typical result. **Figure 14** shows the time series of the posterior probability of the category for object 51, i.e., one of the mixed objects, during sequential active perception. This shows that the greedy and lazy greedy algorithms quickly categorized the target object into two categories "correctly." Our formulation assumes the categorization result to be a posterior distribution. Therefore, this type of probabilistic case can be treated naturally.

# 7. CONCLUSION AND DISCUSSION

In this paper, we described an MHDP-based active perception method for robotic multimodal object category recognition. We formulated a new active perception method on the basis of the MHDP (Nakamura et al., 2011b) .

First, we proposed an action selection method based on the IG criterion and showed that IG is an optimal criterion for active perception from the viewpoint of reducing the expected

KL divergence between the final and current recognition states. Second, we proved that the IG has a submodular property and reduced the sequential active perception problem to a submodular maximization problem. Third, we derived a Monte Carlo approximation method for evaluating IG efficiently and made the action selection method executable. Given the theoretical results, we proposed to use the greedy and lazy greedy algorithms for selecting a set of actions for active perception. It is important to note that all of the three theoretical contributions mentioned above were naturally derived from the characteristics of the MHDP. These contributions are clearly a result of the theoretical soundness of the MHDP. In this sense, our theorems reveal a new advantage of the MHDP that other several heuristic multimodal object categorization methods do not have.

To evaluate the proposed methods empirically, we conducted experiments using an upper-torso humanoid robot and a synthetic dataset. Our results showed that the method enables the robot to actively select actions and recognize target objects quickly and accurately.

One of the most interesting points of this paper is that not only object categories but also an action selection for object recognition can be formed in an unsupervised manner. From the viewpoint of cognitive developmental robotics, providing an unsupervised learning model for bridging the development between perceptual and action systems is meaningful for shedding a new light on the computational understanding of cognitive development (Asada et al., 2009; Cangelosi and Schlesinger, 2015). It is believed that the coupling of action and perception is important for an embodied cognitive system (Pfeifer and Scheier, 2001).

The advantage of this paper compared with the related works in robotics is that our action selection method for multimodal category recognition has a clear theoretical basis and is tightly connected to the computational model for multimodal object categorization, i.e., MHDP. The theoretical basis gives the method preferable characteristics, i.e., theoretical guarantee.

However, note that the theoretical guarantee is satisfied only when IG is correctly estimated. We assumed that outcome of each action is deterministic and fully observable when we apply the theory of submodular optimization to active perception in multimodal categorization. However, observations X <sup>m</sup> and IG are measured somehow probabilistically because of realworld uncertainty and Monte Carlo approximation. For example, IG is approximately estimated at each step of the greedy and lazy greedy algorithms. Theoretically, given this approximation in evaluating the objective being maximized, the (1 − 1/e) bound no longer holds. Streeter et al. proposed to introduce an additional penalty based on a function approximation (Streeter and Golovin, 2009). Golovin et al. extended submodularity to adaptive submodularity to consider stochastic property (Golovin and Krause, 2011). Though we discussed the proposed method from the viewpoint of submodular optimization, this algorithm can be regarded as a version of the sequential information maximization, more specifically (Chen et al., 2015). Extending our idea by referring the adaptive submodularity and/or the sequential information maximization, and update our method is our future challenge.

We assumed that each action requires same cost, and tried to reduce the number of actions in active perception, i.e., to maximize the performance of perception with the fixed number of actions. However, practically, each action, e.g., shake, hit and look at, requires different duration and different energy. Therefore, practical cost is not always the number of actions, but total cost of actions. Zhang et al. (2017) tried to deal with this problem in the context of multimodal object identification. This problem leads us a knapsack problem-like formulation. This type of submodular optimization has been studied by many researchers (Streeter and Golovin, 2009; Zhou et al., 2013). Our method will be able to be extended in the similar way.

In addition to active perception, active "learning/exploration" for multimodal categorization is also an important research topic. It takes a longer time for a robot to gather multimodal information to form multimodal object categories from a massive number of daily objects than it does to recognize a new object. If a robot can notice that "the object is obviously a sample of learned category," the robot need not obtain knowledge about object categories from such an object. In contrast, if a target object appears to be completely new to the robot, the robot should carefully interact with the object to obtain multimodal information from the object. Such a scenario will be achieved by developing an active "learning/exploration" method for multimodal categorization. It is likely that such a method will be able to be obtained by extending our proposed active perception method.

Considering more complex categorization scenario is our future challenge. For example, Schenck et al. (2014) is dealing with the more complex categorization scenario, i.e., 36 plastic containers with identical shape and 3 colors, 4 types of contents, and 3 different amounts of those contents. In this paper, we used MHDP which assumes an object is classified into a single object category and infers the posterior distribution over categories. When we consider human cognition, we can find that object categories have more complex characteristics. For example, object categories have a hierarchical structure, an object is categorized into several classes, and they have different modalitydependency based on the types of categories. Unsupervised machine learning methods for such complex categorization problem have proposed by several researchers based on hierarchical Bayesian models (Griffiths and Ghahramani, 2006; Ando et al., 2013; Nakamura et al., 2015). Theoretically, the main assumption we used was that the MHDP is a hierarchical Bayesian model and action selection is corresponding to obtaining an observation which is a probabilistic variable on the leaf node of its graphical model. Therefore, by applying the same idea to the more complex categorization methods, we will be able to extend our theory to more complex categorization problems. This is on of our future works.

Another challenge lies in feature representation for multimodal categorization. The MHDP assumed that observations are given as bag-of-features representations. However, there are many kinds of feature representations for visual, auditory and haptic information. In particular, the feature extraction capability of deep neural networks is gathering attention, recently. Theoretically, our main theorems do not depend on the type of emission distributions, i.e., bag-of-features representations. It is likely that the same approach can be used even when a multimodal categorization method uses different feature representations, e.g., the features in the last hidden layer of a pre-trained deep neural network. This extension is also a part of our future challenges.

In addition, the MHDP model treated in this paper assumed that an action for perception is related to only one modality, e.g., grasping only corresponds to m<sup>h</sup> . However, in reality, when we interact with an object with a specific action, e.g., grasping, shaking, or hitting, we obtain rich information related to various modalities. For example, when we shake a box to obtain auditory information, we also unwittingly obtain haptic information and information about its weight. The tight linkage between the modality information and an action is a type of approximation taken in this research. An extension of our model and the MHDP to a model that can treat actions that are related to various modalities is also a task for our future work.

# REFERENCES


# AUTHOR CONTRIBUTIONS

The main theory was developed by TaT. The experiments were conceived by RY. The data were analyzed by RY and ToT with help of TaT. The manuscript was written by TaT.

# FUNDING

This research was partially supported by Tateishi Science and Technology Foundation, and JST, CREST. This was also partially supported by a Grant-in-Aid for Scientific Research on Innovative Areas (16H06569) and a Grant-in-Aid for Young Scientists (B) (24700233) funded by the Ministry of Education, Culture, Sports, Science, and Technology.

# ACKNOWLEDGMENTS

The authors would like to thank undergraduate student Takuya Takeshita and graduate student Hajime Fukuda of Ritsumeikan University, who helped us develop the experimental instruments for obtaining our preliminary results.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Taniguchi, Yoshino and Takano. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# APPENDIX A: PROOF OF THE OPTIMALITY OF THE PROPOSED ACTIVE PERCEPTION STRATEGY

In this appendix, we show that the proposed active perception strategy, which maximizes the expected KL divergence between the current state and the posterior distribution of **z**<sup>j</sup> after a selected set of actions, minimizes the expected KL divergence between the next and final states.

$$\begin{split} \mathbf{A}\_{j}^{\*} &= \operatorname\*{argmin}\_{\mathbf{A} \in \mathbf{F}\_{L}} \mathbb{E}\_{X\_{j}^{\mathbf{M} \cup \mathbf{m}\_{j}} \mid X\_{j}^{\mathbf{m}\_{\mathcal{O}\_{j}}}} \left[ \operatorname\*{KL} \left( P(\mathbf{z}\_{j} | X\_{j}^{\mathbf{M}}), P(\mathbf{z}\_{j} | X\_{j}^{\mathbf{A} \cup \mathbf{m}\_{\mathcal{O}\_{j}}}) \right) \right] \\ &= \operatorname\*{argmin}\_{\mathbf{A} \in \mathbf{F}\_{L}} \sum\_{X\_{j}^{\mathbf{M} \cup \mathbf{m}\_{j}}} \sum\_{\mathbf{z}\_{j}} \left[ P(X\_{j}^{\mathbf{M} \cup \mathbf{m}\_{j}} | X\_{j}^{\mathbf{m}\_{\mathcal{O}\_{j}}}) P(\mathbf{z}\_{j} | X\_{j}^{\mathbf{M}}) \right. \\ & \quad \log \frac{P(\mathbf{z}\_{j} | X\_{j}^{\mathbf{M}})}{P(\mathbf{z}\_{j} | X\_{j}^{\mathbf{m}\_{\mathcal{O}\_{j}}}, X\_{j}^{\mathbf{A}})} \right] \end{split} \tag{A1}$$

The numerator inside of the log function does not depend on **A**. Therefore, the term related to the numerator can be deleted. In addition, by negating the remaining term, we obtain

$$\begin{split} \text{(A1)} &= \operatorname\*{argmax}\_{\mathbf{A} \in \mathbf{F}\_{\boldsymbol{\omega}}^{\mathbf{m}\_{\text{oj}}}} \sum\_{\mathbf{X}\_{j}^{\mathbf{M} \cup \mathbf{m}\_{\text{oj}}}} \sum\_{\mathbf{z}\_{j}} [P(\boldsymbol{X}\_{j}^{\mathbf{M} \cup \mathbf{m}\_{\text{oj}}} | \boldsymbol{X}\_{j}^{\mathbf{m}\_{\text{oj}}}) P(\mathbf{z}\_{j} | \boldsymbol{X}\_{j}^{\mathbf{M}})] \\ & \qquad \log P(\mathbf{z}\_{j} | \boldsymbol{X}\_{j}^{\mathbf{m}\_{\text{oj}}}, \boldsymbol{X}\_{j}^{\mathbf{A}})] \\ &= \operatorname\*{argmax}\_{\mathbf{A} \in \mathbf{F}\_{\boldsymbol{\omega}}^{\mathbf{m}\_{\text{oj}}}} \sum\_{\mathbf{x}\_{j}} \sum\_{\mathbf{z}\_{j}} [P(\mathbf{z}\_{j}, \boldsymbol{X}\_{j}^{\mathbf{M} \cup \mathbf{m}\_{\text{oj}}} | \boldsymbol{X}\_{j}^{\mathbf{m}\_{\text{oj}}})] \\ & \qquad \log P(\mathbf{z}\_{j} | \boldsymbol{X}\_{j}^{\mathbf{m}\_{\text{oj}}}, \boldsymbol{X}\_{j}^{\mathbf{A}})]. \end{split} \tag{A2}$$

By marginalizing X **M**\(**mo**j∪**A**) j from (A2), we obtain

(A2) = argmax **A**∈**F mo**j L X X **A** j X **z**j P(**z**<sup>j</sup> , X **A** j |X **mo**<sup>j</sup> j ) log P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j , X **A** j ) = argmax **A**∈**F mo**j L [ X X **A** j X **z**j P(**z**<sup>j</sup> , X **A** j |X **mo**<sup>j</sup> j ) log P(**z**<sup>j</sup> |X **mo** j , X **A** j ) × − X **z**j P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j ) log P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j ) | {z } constant w.r.t. **A** ] = argmax **A**∈**F mo**j L -X X **A** j X **z**j [P(X **A** j |X **mo**<sup>j</sup> j )P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j , X **A** j ) log P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j , X **A** j )] − X X **A** j X **z**j [P(X **A** j |X **mo**<sup>j</sup> j )P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j , X **A** j ) log P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j )] = argmax **A**∈**F mo**j L X X **A** j X **z**j [P(X **A** j |X **mo**<sup>j</sup> j ) KL P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j , X **A** j ), P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j ) ] = argmax **A**∈**F mo**j L E X **A** j |X **mo**j j [KL P(**z**<sup>j</sup> |X **A**∪**mo**<sup>j</sup> j ), P(**z**<sup>j</sup> |X **mo**<sup>j</sup> j ) ].

# Affordance Equivalences in Robotics: A Formalism

Mihai Andries <sup>1</sup> \* † , Ricardo Omar Chavez-Garcia2†, Raja Chatila<sup>3</sup> , Alessandro Giusti <sup>2</sup> and Luca Maria Gambardella<sup>2</sup>

1 Institute for Systems and Robotics (ISR-Lisboa), Instituto Superior Técnico, Lisbon, Portugal, <sup>2</sup> Istituto Dalle Molle di Studi sull'Intelligenza Artificiale, USI-SUPSI, Lugano, Switzerland, <sup>3</sup> Institut des Systèmes Intelligents et de Robotique, Sorbonne Université, Centre National de la Recherche Scientifique, Paris, France

Automatic knowledge grounding is still an open problem in cognitive robotics. Recent research in developmental robotics suggests that a robot's interaction with its environment is a valuable source for collecting such knowledge about the effects of robot's actions. A useful concept for this process is that of an affordance, defined as a relationship between an actor, an action performed by this actor, an object on which the action is performed, and the resulting effect. This paper proposes a formalism for defining and identifying affordance equivalence. By comparing the elements of two affordances, we can identify equivalences between affordances, and thus acquire grounded knowledge for the robot. This is useful when changes occur in the set of actions or objects available to the robot, allowing to find alternative paths to reach goals. In the experimental validation phase we verify if the recorded interaction data is coherent with the identified affordance equivalences. This is done by querying a Bayesian Network that serves as container for the collected interaction data, and verifying that both affordances considered equivalent yield the same effect with a high probability.

#### Edited by:

Tadahiro Taniguchi, Ritsumeikan University, Japan

#### Reviewed by:

Ashley Kleinhans, Ford Motor Company (United States), United States Lola Cañamero, University of Hertfordshire, United Kingdom

#### \*Correspondence:

Mihai Andries mandries@isr.tecnico.ulisboa.pt

†These authors have contributed equally to this work.

Received: 30 September 2017 Accepted: 16 May 2018 Published: 08 June 2018

#### Citation:

Andries M, Chavez-Garcia RO, Chatila R, Giusti A and Gambardella LM (2018) Affordance Equivalences in Robotics: A Formalism. Front. Neurorobot. 12:26. doi: 10.3389/fnbot.2018.00026 Keywords: affordance, learning, cognitive robotics, symbol grounding, affordance equivalence

# 1. INTRODUCTION

Symbolic grounding of robot knowledge consists in creating relationships between the symbolic concepts used by algorithms controlling the robot and the physical concepts to which they correspond (Harnad, 1990). An affordance is a concept that allows collection of grounded knowledge. The notion of affordance was introduced by Gibson (1977), and refers to the action opportunities provided by the environment. In the context of robotics, an affordance is a relationship between an actor (i.e., robot), an action performed by the actor, an object on which this action is performed, and the observed effect.

A robot able to discover and learn the affordances of an environment can autonomously adapt to it. Moreover, a robot that can detect equivalences between affordances can quickly compute alternative plans for reaching a desired goal, which is useful when some actions or objects suddenly become unavailable.

In this paper, we introduce a method for identifying affordances that generate equivalent effects (see examples in **Figures 1**, **2**). We define a (comparison) operator that allows robots to identify equivalence relationships between affordances by analysing their constituent elements (i.e., actors, objects, actions).

# 1.1. Affordance Discovery and Learning

All methods proposed in the literature for affordance learning are similar in viewing an interaction as being composed of three components: an action, a target object, and a resulting effect. Different methods were proposed to infer the expected effect, given knowledge about the action and target object.

Several papers approached affordance learning as learning to predict object motion after interaction. For this purpose, Krüger et al. (2011) employed a feedforward neural network with backpropagation which learned so-called object-action complexes; Hermans et al. (2013) used Support Vector Machines (SVM) with kernels; while Kopicki et al. (2017) employed Locally Weighted Projection Regression (LWPR) with Kernel Density Estimation and a mixture of experts. Ridge et al. (2009) first used a Self-Organising Map and clustering in the effect space to classify objects by their effect, and then trained a SVM which identified to which cluster an object belongs using its featurevector description.

Other papers addressed affordance learning from the perspective of object grasping. Stoytchev (2005) employed detection of invariants to learn object grasping affordances. Ugur et al. (2012) used SVMs to study the traversability affordance of a robot for grasping. Katz et al. (2014) used linear SVM to learn to perceive object affordances for autonomous pile manipulation. More details on the use of affordances for object manipulation can be found in the dissertation of Hermans (2014).

Some works followed a supervised training approach, providing hand-labeled datasets which mapped objects images (2D or RGB-D) to their affordances. Myers et al. (2015) learned affordances from local shape and geometry primitives using Superpixel-based Hierarchical Matching Pursuit (S-HMP), and Structured Random Forests (SRF). Image regions (from RGB-D frames) with pre-selected properties were tagged with specific affordance labels. For instance, a surface region with high convexity was labeled as containable (or a variation of it). Varadarajan and Vincze (2012) proposed an Affordance Network for providing affordance knowledge ontologies for common household articles, intended to be used for object recognition and manipulation. An overview of machine learning approaches for detecting affordances of tools in 3D visual data is available in the thesis of Ciocodeica (2016).

Another approach for learning affordances uses Bayesian Networks. Montesano et al. (2008) and Moldovan et al. (2012) employed a graphical model approach for learning affordances, using a Bayesian Network which represents objects/actions/effects as random variables, and which encodes relations between them as dependency links. The structure of this network is learned based on the data of robot's interaction with the world and on a priori information related to the dependency of some variables. Once learned, affordances encoded in this way can (1) predict the effect of an action applied to a given object, (2) infer which action on a given object generated an observed effect, and (3) identify which object generates the desired effect when given a specific action.

Yet another popular method for supervised affordance learning uses Deep Learning techniques. For instance, Nguyen et al. (2016) trained a convolutional neural network to identify object affordances in RGB-D images, employing a dataset of object images labeled pixelwise with their corresponding affordances. A similar approach using a deep convolutional neural network was taken by Srikantha and Gall (2016).

Recent comprehensive overviews of affordance learning techniques are available in the dissertation of Moldovan (2015), and in reviews by Jamone et al. (2016), Min et al. (2016), and Zech et al. (2017).

We argue that once affordances are learned, we can find relations between affordances by considering the effects they generate. One of these relations is equivalence, i.e., when two different affordances specify corresponding actions on objects that generate the same effect.

# 1.2. Affordance Equivalence

Affordance equivalence was studied by ¸Sahin et al. (2007), who considered relationships between single elements of an affordance. Thus, it was possible to identify objects or actions that are equivalent with respect to an affordance when they generate the same effect. Griffith et al. (2012) employed clustering to identify classes of objects that have similar functional properties. Montesano et al. (2008) and Jain and Inamura (2013) treated affordance equivalence from a probabilistic point of view, where, in the context of imitation learning, the robot searches for the combination of action and effect that maximises their similarity to the demonstrated action on an object. Boularias et al. (2015) discovered through reinforcement learning the graspability affordance over objects with different shapes, and indirectly showed equivalence of the grasp action.

Developing this line of thought, we propose a probabilistic method to identify which combinations of affordance elements generate equivalent effects. We first present in section 2 the affordance formalization employed, and based on that we then list in section 2.4 all the possible types of affordance equivalences.

Since the purpose of this study is to identify equivalences between affordances that were already recorded by the robot, we are not seeking to explain how to record these affordances. In this paper we employed the graphical model approach for learning affordances proposed by Montesano et al. (2008). In addition, we rely purely on perception-interaction data, without using a priori information (Chavez-Garcia et al., 2016b). To facilitate the experimental setup, we used pre-defined sensorial and motor capabilities for our robots.

The remainder of this paper is organized as follows. In section 2, we introduce our formalization of affordance elements, and define the equivalence relationship in section 2.4. A series of experiments on the discovery of equivalences between affordances is detailed in section 3, together with the obtained results. We conclude and present opportunities for future work in section 4.

# 2. METHODOLOGY: AFFORDANCE FORMALIZATION

In this section, we present the affordance formalism employed throughout the paper. We follow the definition proposed by Ugur et al. (2011), that we enrich by including the actor performing the action into the affordance tuple (object, action, effect). The inclusion of the actor into the affordance allows robots to record affordances specific to their body morphologies. Although we will not focus on this aspect in this paper, it is possible to generalize this knowledge through a change of affordance perspective from robot joint space to object task space (more about this in section 2.1.2).

We define an affordance as follows. Let G be the set of actors in the environment, O the set of objects, A the set of actions, and E the set of observable effects. Hence, when an actor applies an action on an object, generating an effect, the corresponding affordance is defined as a tuple:

$$\alpha = (\text{actor}, \text{object}, \text{action}, \text{effect}), \text{ for } \text{actor} \in \mathcal{G}, \text{object} \in \mathcal{O},$$

$$\text{action} \in A, \text{ and } \text{effect} \in E,\tag{1}$$

and can be graphically represented as shown in **Figure 3**. From actor perspective, it interacts with the environment (the object) and discovers the affordances. From object perspective, affordances are properties of objects which can be perceived by actors, and which are available to actors with specific capabilities. We can also consider observers, who learn by perceiving other actors' affordance acquisition process.

The way in which affordance elements are defined influences the operations that can be performed with affordances. Since we aim to establish equivalence relationships between affordances, we will analyse the definitions of the following affordance elements: actions (from actor and object perspectives), objects (as perceived by robot's feature detectors), and effects (seen as a description of the environment).

# 2.1. How Are Actions Defined?

Actions can be defined (1) relative to actors, by describing the body control sequence during the execution of an action in joint space; or (2) relative to objects, by describing the consequences of actions on the objects in operational space. We refer to object perspective when the actions are defined in the operational/task space, making their definition independent of the actor executing them. We refer to actor perspective when the actions are defined in the joint space of the actor, making them dependent of the actor executing them.

This statement comes from the different perspectives obtained from the affordance definition in Equation (1): actor and object perspective.

#### 2.1.1. Actions Described Relative to Actors

Actions are here described relative to actors and their morphology. They are defined with respect to their control variables in joint space (i.e., velocity, acceleration, jerk), indexed by time τ :

$$action: \{Q, \dot{Q}, \ddot{Q}\}\_{\mathfrak{t}} \tag{2}$$

As the action is described with respect to the actor morphology and capabilities, comparing two actions requires comparing both the actors performing the actions, and the actions themselves. When the actors are identical, the action comparison is straightforward. However, when there is a difference between actors' morphologies (and their motor capabilities), the straightforward comparison of actions is not possible and a common frame of reference for such comparison is needed.

#### 2.1.2. Actions Described Relative to Objects

When actions are described relative to objects, they represent an action generalisation from the joint space of a particular actor (where actions are defined on the actor) to the operational space of any actor (where actions are defined on the object).

Thus, when actions are described relative to objects, the actor can be omitted from the affordance tuple, to indicate that any actor which has the required motor capabilities is able to generate the action which causes this effect. In addition, the action employed in this representation is defined in operational space (and not in joint space as before). Hence, dropping the actor from the equation, we can rewrite Equation (1) as:

$$\alpha = (object, action, \textit{effect}), \text{ for } object \in O, action \in A\_0, \text{ and}$$

$$\textit{effect} \in E \tag{3}$$

where A<sup>o</sup> is the set of all actions in operational space, applicable to object o.

While affordances defined from actor perspective (in joint space, e.g., joint forces to apply) allow to learn using robot's motor and perceptual capabilities, affordances defined from object perspective (in task space, e.g., forces applied on the object) allow to generalise this knowledge.

# 2.2. How Are Objects Defined?

If an actor has the feature detectors p1, . . . , p<sup>n</sup> corresponding to its perception capacities (such as hue, shape, size), then an object is defined as:

$$object = \{p\_1, \ldots, p\_n\},\tag{4}$$

where each feature detector can be seen as function on a perceptual unit (e.g., a salient segment from a visual perception process).

# 2.3. How Are Effects Defined?

We suppose that an actor g has a set ξ of q effect detectors, that are able to detect changes in the world after an action a ∈ A<sup>g</sup> is applied. For example, when an actor executes action push on an object, the object-displacement-effect detector would be a function that computes the difference between two measurements of the object position taken before and after the interaction. Another effect can be the difference in the feedback force measured in the end effector before and after the interaction. Formally, effects are a set of q salient changes in the world ω (i.e., in the target object, the actor, or the environment), detected by robot's effect detectors ξ :

$$\text{effect} = \{\xi\_1(\alpha), \dots, \xi\_q(\alpha)\}\tag{5}$$

## 2.4. Affordance Equivalence Operator

In this section, we introduce the concept of affordance equivalence, based on the formalization presented earlier in section 2. We provide truth tables for two different affordance comparison operators: one for the case where actions are defined in actor joint space, and one for the case where actions are defined in object task space. For each case, we explore the possible types of affordance equivalence.

We have defined an affordance as a tuple of type (actor, object, actionjoint\_space, effect) when the action is defined relative to the actor, or as a tuple of type (object, actionoperational\_space, effect) when the action is defined relative to the object. Let us now define the truth table for an operator for comparing affordances (one for the actor perspective, and one for the object perspective) and identifying equivalence relationships between them.

We consider equivalent two affordances that generate equivalent effects. To know when two effects are equivalent, an effect-comparison function is required. We define an equivalence function f(ea, e<sup>b</sup> ) that yields true if two effects values e<sup>a</sup> and e<sup>b</sup> are similar in a common frame (e.g., distances for position values, similarity in color models, vector distances for force values). We detect affordance equivalence by (1) feeding the continuous (non-discretised) data on the measured effects to the Bayesian Network (BN) structure learning algorithm, and then (2) querying the BN over an observed effect to obtain the empirical decision on effect equivalence. Whenever two affordances generate equivalent effects, it is possible to find which affordance elements cause this equivalence. We distinguish several cases of affordance equivalence, depending on the elements which differ in two equivalent affordances, which are detailed below.

#### 2.4.1. Equivalence Between Affordance With Actions Defined Relative to Actors

The comparison cases for affordances with actions described relative to actors are shown in **Table 1**. The 2<sup>4</sup> cases of comparison between the elements of two affordances stem from all the possible (binary) equivalence combinations between the elements. In each case we compare the four components and establish if the elements of affordances are equivalent.

Since actions are defined here relative to the actors, actors with different morphologies cannot perform the same action defined in joint space, because their joint spaces are different. This renders inconsistent cases in which different actors perform the same action: lines (3), (4), (7), and (8) in **Table 1**. This leaves us with five cases of equivalence in **Table 1**, where:


We assume that the environment is a deterministic system: each time the same actor applies the same action on the same object, it will generate an equivalent effect. Therefore, generating a different effect with the same actor, action, and object is impossible, due to determinism.

Both the effect equivalence and non-equivalence cases provide information about the relationship between two affordances. The affordance equivalence concept is empirically validated in section 3.

#### 2.4.2. Equivalence Between Affordances With Actions Defined Relative to Objects

The comparison cases for affordances with actions described relative to objects are shown in **Table 2**. There are 2<sup>3</sup> cases of comparison, corresponding to the total number of possible (binary) equivalence cases between the elements of a pair of affordances. In this case, three types of equivalence exist:

• If different actions on different objects generate the same effect, then it is (object, action) equivalence;


# 3. EXPERIMENTS AND RESULTS: AFFORDANCE EQUIVALENCE

We designed experiments that would confirm the capability of our affordance representation to detect equivalences and non-equivalences between learned affordances. We employed a Bayesian Network structure-learning approach presented in (Chavez-Garcia et al., 2016a) to describe and learn affordances as relations between random variables (affordance elements). Then we analyse how the learned affordances relate to each case of equivalence presented in **Table 2**.

# 3.1. Pre-defined Actions

We assume that an agent is equipped, since its conception, with motor and perceptual capabilities that we called pre-defined. However, we do not limit the agent's capabilities to the predefined set, as through learning the agent may acquire new capabilities. In our scenario, we employed three robotic actors of different morphologies, each with its pre-defined actions:

	- Push (moving with constant velocity without closing the gripper)
	- Pull (closing the gripper and moving with constant velocity)
	- Wipe (closing the gripper and pressing downwards while moving)
	- Move aside (closing the gripper and moving aside)
	- Poke (moving forwards with constant acceleration)
	- Side push (moving aside with constant velocity)

The actors and their pre-defined sets of actions (motor capabilities) are shown in **Figure 4**.

#### 3.2. Pre-defined Perceptual Capabilities

Our visual perception process takes raw RGB-D data of an observed scene to oversegment the point cloud into a supervoxel representation. This 3D oversegmentation technique is based on a flow-constrained local iterative clustering which uses color and geometric features from the point cloud (Papon et al., 2013). Strict partial connectivity between voxels guarantees that supervoxels cannot flow across disjoint boundaries in 3D space. Supervoxels are then grouped to obtain object clusters that are used for extracting features and manipulation. **Figure 5** illustrates the visual perception process. The objects employed were objects of daily use: toys that can

TABLE 1 | Comparison of two affordances, when actions are described with respect to actors.


Equivalence cases between affordances are presented in even rows. Inconsistencies are underlined in red. The types of affordance equivalence are shown in bold letters.

TABLE 2 | Comparison of two affordances, when actions are described with respect to objects.


Equivalence cases between affordances are presented in even rows. The types of affordance equivalence are shown in bold letters.

be assembled, markers, and dusters. The objects were selected so as to be large enough to allow easy segmentation and manipulation.

#### 3.3. Pre-defined Effect Detectors

We used custom hand-written effect detectors for the experimental use-cases, although our experimental architecture allows for an automatic effect detector. An effect detector quantifies the change, if present, in one property of the environment or the actor. For this series of experiments, we developed the following effect detectors: color change in a 2D image (HSV hue) for an object or a region of interest; object's position change (translation only); and the end-effector position. **Figure 6** illustrates the detected effects when wipe action is performed. In our previous work we covered changes in joint torques, distance between finger grippers and object speed.

#### 3.4. Affordance Learning

Affordance elements E (effects), O (objects) and A (actions) are represented as random variables of a Bayesian Network (BN) B. First, in each actor interaction we record the values (discretized) for the random variables representing the objects (section 3.2), actions (section 3.1), and effects (section 3.3). The problem of discovering the relations between E, O, and A can be then translated to finding dependencies between the variables in B, i.e., P(B|D) learning the structure of the corresponding network B from data D. Thus, affordances are described by the conditional dependencies between variables in B.

We implemented an information-compression score to estimate how well a Bayesian Network structure describes data D (Chavez-Garcia et al., 2016b). Our score is based on the Minimum Description Length (MDL) score:

$$MDL(\mathcal{B}|\mathcal{D}) = LL(\mathcal{B}|\mathcal{D}) - |\mathcal{B}|\frac{\log N}{2},\tag{6}$$

where the first term measures (by applying a log-likelihood score Suzuki, 2017) how many bits are needed to describe data D based on the probability distribution P(B). The second term counts the number of bits needed to encode B, where log(N) 2 bits are used for each parameter in the BN. We consider log(N) 2 as factor that penalizes structures with larger number of parameters. For a BN's structure B, its score is then defined as the posterior probability given the data D.

We implemented a search-based structure learning algorithm based on the hill-climbing technique, as we did in our previous work. As inputs, this algorithm takes values for the variables in E, O, and A obtained from robot's interaction. This procedure estimates the parameters of the local probability density functions (pdfs) given a Bayesian Network structure. Typically, this is a

FIGURE 5 | An example of the visual perception process output. From left to right: (A) reference image (B) RGB cloud of points of the scene (C) supervoxel extraction (D) clusterization of supervoxels. For visual perception we use a Microsoft Kinect sensor that captures RGB-D data.

FIGURE 6 | Example of captured effects when performing the action wipe on the object duster. Left figure shows the spatial (pose) and perceptual (color) state of the duster, and the surface. After wipe action is performed, the effects on position and in hue are detected: duster has changed position but not color, surface has changed color but not position. Although for this experiment we do not use the force in the joints, we are also capturing these changes.

maximum-likelihood estimation of the probability entries from the data set, which, for multinomial local pdfs, consists of counting the number of tuples that fall into each table entry of each multinomial probability table in the BN. The algorithm's main loop consists of attempting every possible single-edge addition, removal, or reversal, making the network that increases the score the most the current candidate, and iterating. The process stops when there is no single-edge change that increases the score. There is no guarantee that this algorithm will settle at a global maximum, but there are techniques to increase its reaching possibilities (we use simulated annealing).

By using the BN framework, we are capable of displaying relationships between affordance elements. The directed nature of its structure allows us to approximate cause-effects relationships. It also handles uncertainty through the established probability theory. In addition to direct dependencies, we can represent indirect causation.

#### 3.4.1. Detection of Affordance Equivalence

Equivalence between two affordances can be identified by comparing their ability to consistently reproduce the same effect e, judging by the cumulated experimental evidence. The precise type of equivalence between two affordances, which tells which affordance elements' values are equivalent, can be identified by probabilistic inference on the learned BN. Inference allows to identify which (actor, object, action) configurations are more likely to generate the same effect. In practice, this inference is calculated through executing queries to the Bayesian Network, which allow to compute the probability of an event (in our case: the probability of an effect having a value between some given bounds) given the provided evidence data.

Queries have the following form: P(proposition|evidence) where proposition represents the query on some variable x, and evidence represents the available information for the affordance elements, e.g., the identity of the actor, the description of the action, and the description of the object. In the example of the robot pushing an object, the following query allows to compute the probability of the object displacement falling between certain bounds:

$$P(\text{position} > \text{lower bound}) \text{ and } (\text{position} < \text{upper bound})$$

$$factor = \text{Baxter}, action = \text{push}, object = block) \tag{7}$$

After querying the learned BN with the corresponding elements from **Tables 1**, **2** as evidence, if two (actor, object, action) configurations have probabilities of generating an effect that are higher than an arbitrary threshold, then we consider both affordances equivalent:

**if** P(e|actor1, object<sup>1</sup> , action1) > θ **and** P(e|actor2, object<sup>2</sup> , action2) > θ (8) **then** (actor1, object<sup>1</sup> , action1) ≡ (actor2, object<sup>2</sup> , action2)

For our experiments, we empirically established the equivalence threshold θ = 0.85. The aforementioned querying process connects the learning and reasoning steps, and according to the current goal of an actor, it allows for an empirical threshold selection or an adaptive mechanism.

# 3.5. Experimental Results

As shown in **Table 1**, affordances composed of 4 elements (actor, object, action, effect), which have their actions defined from the actor perspective, have five cases of equivalence (see **Figure 7** for some illustrated examples). We have selected three of them to demonstrate the use of the affordance equivalence operator: (object) equivalence, (action) equivalence, and (actor, action) equivalence. In **Figure 7** they correspond to the settings (a), (b), and (c). These experiments are detailed below. For a video demonstration of these experiments, please see the Supplementary Material section at the end of this document.

#### 3.5.1. The (Actor, Action) Equivalence

This experiment consisted in discovering the equivalence between (actor, action) tuples. The goal was to identify configurations that are equivalent in their ability of uncovering a region of interest (a red mark on the table) by moving the object occluding it from robot's camera view (in the case of the Baxter — a toy with features color: blue and shape: box; in the case of the Katana actor — a box with the same perceptual features). In our representation, two objects with the same perceptual features are considered the same. Actor Baxtergripper is equipped with a gripper and can perform action move\_aside. Actor Baxternogripper does not have a gripper and can only perform action poke. Actor Katana does not have a gripper and can only perform side push action.

The Bayesian Network structure was learned using data from 15 interactions using each (actor, action) tuple (**Figure 8**). Variables object\_shape and object\_color represent the object features, variable color\_mark captures the presence or absence of a colored mark. Queries performed on the BN suggested that the effect of revealing the red mark is consistently recreated when moving the object toy, with a probability of 0.98 for the action move\_aside done by the hand with a gripper, 0.97 for the action poke done by the hand with no gripper, and 0.94 for the action side\_push done by the Katana arm on the box object. The probabilities are based on the total number of trials verifying these relationships. Since these affordances consistently recreate equivalent effects while having some equal elements (same toy object for Baxtergripper and Baxternogripper, and a similar object for Katana), this points that affordance elements that differ between configurations are in fact equivalent in their ability to generate the effect of revealing the red mark, i.e., the tuples (Baxternogripper, poke), (Baxtergripper, move\_aside) and (Katana, side\_push). Source code of the experimental setup for the Katana actor is available at https://romarcg@bitbucket.org/romarcg/katana\_docker.git.

## 3.5.2. The (Object) Equivalence

The experiment consisted in determining the equivalence between two visually different whiteboard dusters: dusterblue and dusterorange. Actor Baxtergripper applies the same action wipe to remove a red marker trace from a blue colored surface, as shown in **Figures 4**, **6**. For distinguishing the clean blue colored surface from the surface dirtied with the red marker, the robot's predefined effect detector measured the effect on the hue extracted from an HSV histogram.

The robot performed 25 trials of the wipe action with each duster, and the obtained data was subsequently used to learn the Bayesian Network structure (see **Figure 8B**). Objects are represented in the same way as in section 3.5.1. The effect capturing the change in the wiped area is described by the variable color\_effect. Queries revealed that the wipe action cleans the red marker trace from the blue colored surface with a probability of 0.95 in both cases. Since the observed effects were equivalent, and the actor and action were the same, the objects dusterblue and dusterorange are considered equivalent in their ability to reach this effect.

### 3.5.3. The (Action) Equivalence

obtain the effect of turning on the light.

In this experiment we analysed equivalence between the actions of an actor. This experiment consisted in placing the same object toy into a desired location using two different actions push and pull of the actor Baxtergripper. The robot performed 30 trials using each of the push and pull actions. **Figure 8C** shows the learned BN for (action) equivalence. The arrival of the object (described as in previous experiments) to the desired position is described

handle/door) in order to obtain the same effect of opening those doors. (E) Two robots can apply two different actions on two different objects (light switch, lamp) to

by the effect variable x\_end (only the x component of the 3D position was measured). The target location to which we aim to push/pull the object is at x coordinate 0.72 ± 0.02m. Variable object\_x\_start is an object feature representing the object initial position. According to the BN that processed the obtained data, there was a 0.97 probability to pull the object to the desired location, and a 0.89 chance to do so by pushing it. With all the rest being equal (the actor, object, and effect are the same), and since both actions have a high probability of generating the given effect, these push and pull actions can be considered equivalent for placing the object toy in a desired location.

# 4. CONCLUSIONS AND FUTURE WORK

We have presented a formalization for affordances with respect to their elements, and the equivalence operator for comparing two affordances from the actor and object perspective. We performed Bayesian Network structure learning to capture affordances as sensorimotor representations based on the observed experimental data. We analysed and validated experimentally the affordance equivalence operator, demonstrating how to extract information on the tuples of actors, actions and objects by comparing two affordances and determining if such tuples are equivalent.

In practice, the learned affordance equivalences can be interchangeably used when some objects or actions become unavailable. In a multi-robot setting, these equivalences can allow an ambient intelligence (an Artificial Intelligence system controlling an environment) to select the appropriate robot for using an affordance to reach a desired effect.

# 4.1. Future Work

Our future work will focus on the domain of transfer learning. We plan to implement a transformation between the affordances learned by specific robots (in their own joint space) to affordances applicable to objects and defined in their operational space. This will generalise the affordances learned and perceivable by a robot with a specific body schema, making them perceivable (and potentially available) to robots with any type of body schema (morphology).

We are already working on an automatic method for generating 3D object-descriptors. This would allow us to remove human bias from the way in which the robot observes and analyses its environment. By using an auto-encoder (a type of artificial neural network) that trains on appropriate datasets, it can automatically adapt to changes in objects that the robot interacts with.

Work is also underway on representing robot actions in a continuous space (e.g., using a vector representation of torque forces, or Dynamic Movement Primitives), which would be an improvement from today's discrete representation of actions (e.g., move, push, pull).

Ultimately, we intend to define an algebra of affordances detailing all the operations that are possible on affordances, and which would encompass operators such as affordance equivalence, affordance chaining (Ugur et al., 2011), and other operators that are still to explore.

# AUTHOR CONTRIBUTIONS

Literature review by MA and RC-G. Methodology and theoretical developments by MA, RC-G, and RC. Experiment design and implementation by MA and RC-G. Analysis of the experimental results by MA, RC-G, RC, AG, and LG. Document writing and illustrations by MA, RC-G, RC, AG, and LG.

# ACKNOWLEDGMENTS

This work was funded by ANR RoboErgoSum project under reference ANR-12-CORD-0030, and by Laboratory of Excellence SMART (ANR-11-LABX-65) supported by French State funds managed by the ANR - Investissements d'Avenir programme ANR-11-IDEX-0004-02.

We kindly thank Hugo Simão for his help with the 3D renderings used for illustrating this work. Credit for the 3D models of the Baxter robot used in **Figures 1**, **2**, **6** goes to Rethink Robotics. Credit for the robot model used in the bottom part of **Figure 2** goes to Dushyant Chourasia (https://grabcad.com/library/robot-242).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot. 2018.00026/full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Andries, Chavez-Garcia, Chatila, Giusti and Gambardella. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# SERKET: An Architecture for Connecting Stochastic Models to Realize a Large-Scale Cognitive Model

#### Tomoaki Nakamura<sup>1</sup> \*, Takayuki Nagai <sup>1</sup> and Tadahiro Taniguchi <sup>2</sup>

*<sup>1</sup> Department of Mechanical Engineering and Intelligent Systems, University of Electro-Communications, Tokyo, Japan, <sup>2</sup> Department of Information Science and Engineering, Ritsumeikan University, Shiga, Japan*

To realize human-like robot intelligence, a large-scale cognitive architecture is required for robots to understand their environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework named Serket that enables the construction of a large-scale generative model and its inferences easily by connecting sub-modules to allow the robots to acquire various capabilities through interaction with their environment and others. We consider that large-scale cognitive models can be constructed by connecting smaller fundamental models hierarchically while maintaining their programmatic independence. Moreover, the connected modules are dependent on each other and their parameters must be optimized as a whole. Conventionally, the equations for parameter estimation have to be derived and implemented depending on the models. However, it has become harder to derive and implement equations of large-scale models. Thus, in this paper, we propose a parameter estimation method that communicates the minimum parameters between various modules while maintaining their programmatic independence. Therefore, Serket makes it easy to construct large-scale models and estimate their parameters via the connection of modules. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with the original models that we have proposed.

Keywords: cognitive models, probabilistic generative models, symbol emergence in robotics, concept formation, unsupervised learning

# 1. INTRODUCTION

To realize human-like robot intelligence, a large-scale cognitive architecture is required for robots to understand their environment through a variety of sensors with which they are equipped. In this paper, we propose a novel framework that enables the construction of a large-scale generative model and its inferences easily by connecting sub-modules in order for robots to acquire various capabilities through interactions with their environment and others. We consider it important for robots to understand the real world by learning from their environment and others, and have proposed a method that enables robots to acquire concepts and language (Nakamura et al., 2014; Attamimi et al., 2016; Nishihara et al., 2017; Taniguchi et al., 2017) based on the clustering of multimodal information that they obtain. These proposed models are based on Bayesian models

Edited by: *Quan Zou, UnitedHealth Group, United States*

Reviewed by: *Eric Chen, Thomas Jefferson University, United States Yanan Sun,*

*Booz Allen Hamilton, United States* \*Correspondence:

*Tomoaki Nakamura tnakmaura@uec.ac.jp*

Received: *30 November 2017* Accepted: *14 May 2018* Published: *26 June 2018*

#### Citation:

*Nakamura T, Nagai T and Taniguchi T (2018) SERKET: An Architecture for Connecting Stochastic Models to Realize a Large-Scale Cognitive Model. Front. Neurorobot. 12:25. doi: 10.3389/fnbot.2018.00025*

with complex structures, and we derived and implemented the parameter estimation equations. If we realize a model that enables robots to learn more complicated capabilities, we have to construct a more complicated model, and derive and implement equations for parameter estimation. However, it is difficult to construct higher-level cognitive models by leveraging this approach. Alternatively, these models can be interpreted as a composition of more fundamental Bayesian models. In this paper, we develop a large-scale cognitive model by connecting the Bayesian models and propose an architecture named Serket (Symbol Emergence in Robotics tool KIT<sup>1</sup> ), which enables the easier construction of such models.

In the field of cognitive science, cognitive architectures (Laird, 2008; Anderson, 2009) have been proposed to implement human cognitive mechanisms by describing human perception, judgment, and decision-making. However, complex machine learning algorithms have not yet been introduced, which makes it difficult to implement our proposed models. Serket makes it possible to implement more complex models by connecting modules.

One approach to develop a large-scale cognitive model is the use of probabilistic programming languages (PPLs), which make it easy to construct Bayesian models (Patil et al., 2010; Goodman et al., 2012; Wood et al., 2014; Carpenter et al., 2016; Tran et al., 2016). PPLs can construct Bayesian models by defining the dependencies between random variables, and the parameters are automatically estimated without having to derive the equations for them. By using PPLs, it is easy to construct relatively smallscale models, such as a Gaussian mixture model and latent Dirichlet allocation, but it is still difficult to model multimodal sensory information, such as images and speech obtained by the robots. Because of this, we implemented models for concept and language acquisition, which are relatively large-scale models, as standalone models without PPLs. However, we consider the approach where an entire model is implemented by itself has limitations if it is constructed as a large-scale model.

Large-scale cognitive models can be constructed by connecting smaller fundamental models hierarchically; in fact, our proposed models have such a structure. In the proposed novel architecture Serket, large-scale models were constructed by hierarchically connecting smaller-scale Bayesian models (hereafter, each one is referred to as a module) while maintaining their programmatic independence. The connected modules are dependent on each other, and parameters must be optimized as a whole. When models are constructed by themselves, the parameter estimation equations have to be derived and implemented depending on the models. However, in this paper, we propose a method for parameter estimation by communicating the minimum parameters between various modules while maintaining their programmatic independence. Therefore, Serket makes it easy to construct large-scale models and estimate their parameters by connecting modules.

In this paper, we propose the Serket framework and implement models that we proposed by leveraging this framework. Experimental results demonstrated that the model can be constructed by connecting modules, the parameters can be optimized as a whole, and they are comparable with original models that we have proposed.

# 2. BACKGROUND

# 2.1. Symbol Emergence in Robotics

Recently, it has been said that artificial intelligence is superior to human intelligence in the area of supervised learning, as typified by deep learning as far as certain specific tasks (He et al., 2015; Silver et al., 2017). However, we believe that it is difficult to realize human-like intelligence only via supervised learning because all supervised labels cannot be obtained for all the sensory information of robots. To this end, we believe that it is also important for robots to understand the real environment by structuring their own sensory information in an unsupervised manner. We consider such a learning process as a symbol emergence system (Taniguchi et al., 2016a).

The symbol emergence system is based on the genetic epistemology proposed by Piaget (Piaget and Duckworth, 1970). In genetic epistemology, humans organize symbol systems in a bottom-up manner through interaction with the environment. **Figure 1** presents an overview of the symbol emergence system. The symbols are self-organized from sensory information obtained through interactions with the environment. However, it can be difficult for robots to communicate with others using symbols learned only in a bottom-up manner, because the sensory information cannot be shared directly with others and the meaning of symbols differs depending on the individual. To communicate with others, the meanings of symbols must be transformed into common meanings among individuals through their interactions. This is considered as a top-down effect from symbols to individuals' organization of them. Thus, in the symbol emergence system, the symbols emerge through loops of top-down and bottom-up effects. In the symbol emergence in robotics, symbols include not only linguistic symbols but also various types of knowledge self-organized by robots. Therefore, symbol emergence in robotics covers a wide range of research topics, such as concept formation (Nakamura et al., 2007), language acquisition (Taniguchi et al., 2016b, 2017; Nishihara et al., 2017), learning of interactions (Taniguchi et al., 2010), learning of body schemes (Mimura et al., 2017), and learning of motor skills and segmentation of time-series data (Taniguchi et al., 2011; Nakamura et al., 2016).

We have proposed models that enable robots to acquire concepts and language by considering its learning process as a symbol emergence system. The robots form concepts in a bottom-up manner, and acquire word meanings by connecting words and concepts. Simultaneously, words are shared with others, and their meanings are changed through communication with others. Therefore, such words affect concept formation in a top-down manner, and concepts are changed. Thus, we have

<sup>1</sup> Symbol emergence in robotics focuses on the real and noisy environment, and the e in Serket represents a false recognition obtained through learning in such an environment.

considered that robots can acquire concepts and word meanings through loops of bottom-up and top-down effects.

# 2.2. Existing Cognitive Architecture

There have been many attempts to develop intelligent systems. In the field of cognitive science, cognitive architectures (Laird, 2008; Anderson, 2009) have been proposed to implement humans cognitive mechanisms by describing human perception, judgment, and decision-making. As mentioned earlier, it is important to consider how to model the multimodal sensory information obtained by robots. However, this is still difficult to achieve with these cognitive architectures. To construct more complex models, some frameworks have been proposed in the field of machine learning.

Frameworks of deep neural networks (DNNs) such as TensorFlow (Abadi et al., 2016), Keras (Chollet , 2015), and Chainer (Tokui et al., 2015) have been developed. These frameworks make it possible to construct DNN models and estimate their parameters easily. These frameworks are one of the reasons why DNNs have been widely used for several years.

Alternatively, PPLs that make it easy to construct Bayesian models have also been proposed (Patil et al., 2010; Goodman et al., 2012; Wood et al., 2014; Carpenter et al., 2016; Tran et al., 2016). The advantages of PPLs are that they can construct Bayesian models by defining the dependencies between random variables, and the parameters are automatically estimated without deriving equations for them. By using PPLs, relatively small-scale models, such as the Gaussian mixture model and latent Dirichlet allocation (LDA), can be constructed easily. However, it is still difficult to model multimodal sensory information, such as images and speech obtained by the robots. We believe that a framework by which a large-scale probabilistic generative model can be more easily constructed is required to model the multimodal information of the robot.

# 2.3. Cognitive Architecture Based on Probabilistic Generative Model

We believe that cognitive models make it possible to predict an output Y against an input X. For example, as shown in **Figure 2**, an object label Y is predicted from a sensor input X via object recognition. It is through the understanding of word meanings that the semantic content Y are predicted from speech signal X. In other words, the problem can be defined as how to model P(Y|X), where the prediction is realized by

and (B) end-to-end learning.

argmax<sup>Y</sup> P(Y|X). DNNs model relationships between an input X and output Y directly by an end-to-end approach (**Figure 2B**). Alternatively, we considered developing these cognitive models by leveraging Bayesian models, where X and Y are treated as random variables, and the relationships between them are represented by a latent variable Z (**Figure 2A**). Therefore, in Bayesian models, the prediction of output Y from input X is computed as follows:

$$P(Y|X) \propto P(Y,X) \tag{1}$$

$$=\int\_{Z} P(Y|Z)P(X|Z)P(Z)dZ.\tag{2}$$

This is multimodal latent Dirichlet allocation (MLDA) (Blei and Jordan, 2003; Nakamura et al., 2009; Putthividhy et al., 2010), the details of which are described in the Appendix. However, MLDA is based on the important assumption that the observed variables X and Y are conditionally independent against latent variable Z. Here, we consider models where assumptions are made about multiple observations without distinguishing between input and output. **Figure 3A** displays the generalized model, where the right side of Equation (1) corresponds to the following equation, and a part of the observations can be predicted from other observations.

$$P(\mathbf{o}\_1, \mathbf{o}\_2, \dots) = \int\_z P(z) \Pi\_n P(\mathbf{o}\_n | z) dz. \tag{3}$$

As mentioned earlier, it is assumed that all observations **o**1, **o**2, · · · are conditionally independent against z. This assumption is often used to deal with multimodal data (Blei and Jordan, 2003; Wang et al., 2009; Putthividhy et al., 2010; Françoise et al., 2013) because modeling all dependencies makes parameter estimation difficult.

Considering the modeling of various sensor data as observations **o**1, **o**2, · · · , it is not always true for all the observations to satisfy the conditionally independent assumption. In general, the information surrounding us has a hierarchical structure. Hence, a hierarchical model can be used to avoid this difficulty (Attamimi et al., 2016). Furthermore, latent variables, such as concepts, are generally related to each other, and such relationships can be represented by hierarchical models. **Figure 3B** represents a hierarchical version of **Figure 3A** and can be thought of as generalization of the cognitive architecture based on a probabilistic generative model. It should be noted that the structure can be designed manually (Attamimi et al., 2016) and/or found autonomously by using a structure learning method (Margaritis, 2003), which is beyond the scope

of this paper. In this hierarchized model, o∗,<sup>∗</sup> are observations and z∗,<sup>∗</sup> are latent variables, and the right side of Equation (1) corresponds to the following equation:

$$P(\mathbf{O}|z\_{M,1}, z\_{M,2}, \dots) = \prod\_{m}^{M} \prod\_{n}^{\bar{N}\_m} \int\_{z\_{m,n}} P(z\_{m,n}) \prod\_{i}^{N\_m} P(\mathbf{o}\_{m,n,i}|z\_{m,n})$$

$$\prod\_{n'}^{\bar{N}\_{m-1}} P(z\_{m-1,n'}|z\_{m,n}) dz\_{m,n} \tag{4}$$

where **O** is the set of all observations, M is the number of the hierarchy, and N<sup>m</sup> and N¯ <sup>m</sup> denote the number of observations and latent variables in the m-th hierarchy, respectively. In this model, it is not difficult to analytically derive equations to estimate the parameters if the number of the hierarchy is not large. However, it is more difficult to derive them if the number of the hierarchy increases. To estimate the parameters of the hierarchical model, we propose Serket, which is an architecture that renders it possible to estimate the parameters by dividing them into even hierarchies.

From the viewpoint of hierarchical models, many studies have proposed models that capture the hierarchical nature of the data (Li and McCallum, 2006; Blei et al., 2010; Ghahramani et al., 2010; Ando et al., 2013; Nguyen et al., 2014). On the other hand, Serket models the hierarchical structure of modalities. For such hierarchical models, methods based on LDA (Li et al., 2011; Yang et al., 2014) have been proposed, and we have also proposed multilayered MLDA (Attamimi et al., 2016). These models are the simplest examples constructed by Serket. In this paper, we construct these models by dividing them into smaller modules.

## 2.4. Cognitive Models

In the past, studies on how the relationships between multimodal information are modeled have been conducted (Roy and Pentland, 2002; Wermter et al., 2004; Ridge et al., 2010; Ogata et al., 2010; Lallee and Dominey, 2013; Zhang et al., 2017). Neural networks were used in these studies, which made inferences based on observed information possible by learning multimodal information, such as words, visual information, and a robot's motions. As mentioned earlier, these are some examples of the cognitive models that we defined.

There are also studies in which manifold learning was used for modeling a robot's multimodal information (Mangin and Oudeyer, 2013; Yuruten et al., 2013; Mangin et al., 2015; Chen and Filliat, 2015). These studies used manifold learning such as non-negative matrix factorization, in which multimodal information is represented by low-dimensional hidden parameters. We consider this as another approach to constructing cognitive models, in which the information is inferred through hidden parameters.

Recently, DNNs have made notable advances in many areas such as object recognition (He et al., 2015), object detection (Redmon et al., 2016), speech recognition (Amodei et al., 2016), sentence generation (Vinyals et al., 2015), machine translation (Sutskever et al., 2014), and visual question answering (Wu et al., 2016). In these studies, endto-end learning was used, which made it possible to infer information from other information. Therefore, these are also considered part of the cognitive model defined in this paper. However, as mentioned in section 2.1, we believe that it is important for robots to understand the real environment by structuring their own sensory information in an unsupervised manner.

To develop a cognitive model where robots learn autonomously, our group proposed several models for concept formation (Nakamura et al., 2007), language acquisition (Taniguchi et al., 2016b, 2017; Nishihara et al., 2017), learning of interactions (Taniguchi et al., 2010), learning of body schemes (Mimura et al., 2017), learning motor skills, and segmentation of time series data (Taniguchi et al., 2011; Nakamura et al., 2016). Although all of these are targets of Serket, we focused on concept formation in this paper. We define concepts as categories into which the sensory information is classified, and propose various concept models. These are implementations of the aforementioned hierarchical model. **Figure 4A** displays one of our proposed models. This is the simplest form of the hierarchical model, where z <sup>O</sup> and z <sup>M</sup> denote an object and a motion concept, respectively, and their relationship is represented by z (Attamimi et al., 2016). Therefore, in this model, z represents objects and possible motions against them, which are considered as their usage, and observations become conditionally independent by introducing the latent variables z O and z M.

In these Bayesian models, the latent variables shown as the white nodes z, z <sup>O</sup>, and z <sup>M</sup> in **Figure 4A** can be learned from the observations shown as gray nodes in an unsupervised manner. Moreover, these latent variables are not determined independently but optimized as a whole by depending on each other. Although it seems that this model has a complex structure and that it is difficult to estimate the parameters and determine the latent variables, this model can be divided into smaller components, each of which is an MLDA model. The models shown in **Figures 4B,C** can also be divided into smaller components despite their complex structure. Similar to these models, it is possible to develop larger models by combining smaller models as modules. In this paper, we propose a novel architecture Serket to develop larger models by combining modules.

In the proposed architecture, the parameters of each module are not learned independently but learned based on their dependence on each other. To implement such learning, it is important to share latent variables between modules. For example, z <sup>O</sup> and z <sup>M</sup> are shared between two MLDAs in the model, respectively, as shown in **Figure 4A**. The shared latent variables were not determined independently but determined depending on each other. Serket makes it possible for each module to maintain its independence as a program as well as be learned as a whole through the shared latent variables.

# 3. SERKET

### 3.1. Composing Cognitive Sub-modules

**Figure 3C** displays the generalized form of the module assumed in Serket. In this figure, we omit the detailed parameters for generalization because we assume that any type of models can be the modules of Serket. Each module has multiple shared latent variables zm−1,<sup>∗</sup> and observations **o**m,n,∗, which are assumed to be generated from latent variable zm,<sup>n</sup> of a higher level. Modules with no shared latent variable or observations are also included in the generalized model. Moreover, the modules can have any internal structure as long as they have shared latent, observation, and higher-level latent variables. Based on this module, a larger model can be constructed by connecting the latent variables of module(m − 1, 1), module(m − 1, 2), · · · recursively. In the Serket architecture, each module must satisfy the following requirements:

1. In each module with shared latent variables, the probability that latent variables are generated can be computed as

$$P(z\_{m-1,i}|z\_{m,n}, \mathbf{o}\_{m,n,1}, \mathbf{o}\_{m,n,2}, \dots, \mathbf{z}\_{m-1}).\tag{5}$$

2. The module can send the following probability by leveraging one of the methods explained in the next section:

$$P(z\_{m-1,i}|z\_{m,n}, \mathbf{o}\_{m,n,1}, \mathbf{o}\_{m,n,2}, \dots, \mathbf{z}\_{m-1}).\tag{6}$$

3. The module can determine zm,<sup>n</sup> by using the following probability sent from module (m+ 1, j) by one of the methods explained in the next section:

$$P(z\_{m,n}|z\_{m+1,j},\mathbf{o}\_{m+1,j,1},\mathbf{o}\_{m+1,j,2},\cdots,\mathbf{z}\_{m}).\tag{7}$$

4. Terminal modules have no shared latent variables and only have observations.

In Serket, the modules affecting each other and the shared latent variables are determined by their communication with each other. Methods to determine the latent variables are classified into two types depending on their nature. One is the case that they are discrete and finite, and another is the case that they are continuous or infinite.

# 3.2. Inference of Composed Models

In this section, we explain the parameter inference methods used for the composed models. We focus on the batch algorithm for parameter inference, which makes it easy to implement each module. Therefore, real-time application is beyond the scope of this paper although we would like to realize it in the future. One of the inference methods used to estimate the parameters of complex models is based on variational Bayesian (VB) approximation (Minka and Lafferty, 2002; Blei et al., 2003; Kim et al., 2013). However, a VB-based approach requires derivation against latent variables, and it is difficult to implement derivation in independent modules. To this end, we employed a sampling-based method because of its simpler implementation.

In this section, we utilize three approaches according to the nature of the latent variables.

#### 3.2.1. Message Passing Approach

First, we consider the case when the latent variables are discrete and finite. For example, in the model shown in **Figure 4A**, the shared latent variable z <sup>O</sup> was generated from a multinomial distribution, which is represented by finite dimensional parameters. Here, we consider the estimation of the latent variables according to the simplified model shown in **Figure 5A**. In module 2, the shared latent variable z<sup>1</sup> was generated from z2; and in module 1, the observation o was generated from z1. The latent variable z<sup>1</sup> is shared in modules 1 and 2, and determined by the effect on these two modules as follows:

$$z\_1 \sim P(z\_1|\mathbf{o}, z\_2) \tag{8}$$

$$\propto P(z\_1|\mathbf{o})P(z\_1|z\_2). \tag{9}$$

In this equation, P(**o**|z1) and and P(z1|z2) can be computed in modules 1 and 2, respectively. We assumed that the latent variable is discrete and finite, and P(z1|z2) is a multinomial distribution that can be represented by a finite-dimensional parameter whose dimension ranges from the number of elements of z1. Therefore, P(z1|z2) can be sent from module 2 to module 1. Moreover, P(z1|z2) can be learned in module 2 by using P(z1|**o**) sent from module 1, which is also a multinomial distribution. The parameters of these distributions can be easily sent and received, and the shared latent variable can be determined by the following procedure:


Thus, in the case when the latent variable is infinite and discrete, the modules are learned by sending and receiving the parameters

of a multinomial distribution of z1. We call this the message passing (MP) approach because the model parameters can be optimized by communicating the message.

#### 3.2.2. Sampling Importance Resampling Approach

In the previous section, the latent variable was determined by communicating the parameters of the multinomial distributions if the latent variables are discrete and finite. Otherwise, it can be difficult to communicate the parameters. For example, the number of parameters becomes infinite if the possible values of the latent variables are infinite patterns. In the case of a complex probability distribution, it is difficult to represent it by a small number of parameters. In such cases, the model parameters are learned by approximation using sampling importance resampling (SIR). We also consider parameter estimation using the simplified model shown in **Figure 5B**. Here, the latent variable z<sup>1</sup> is shared, and its possible value is either an infinite pattern or continuous. Similar to the previous section, the latent variable is determined if the following equation can be computed:

$$z\_1 \sim P(z\_1|\mathbf{o}, z\_2) \tag{10}$$

$$\propto P(z\_1|\mathbf{o})P(z\_1|z\_2). \tag{11}$$

However, when the value of z<sup>1</sup> is infinite or continuous, module 2 cannot send P(z1|z2) to module 1. Therefore, P(z1|**o**) is first approximated by L samples {z (l) : l = 1, · · · , L}:

$$z\_1^{(l)} \sim P(z\_1|\mathfrak{o}).\tag{12}$$

This approximation is equivalent to approximating P(z1|o) by the following P˜(z1|**o**):

$$P(z\_1|\mathfrak{o}) \approx \tilde{P}(z\_1|\mathfrak{o}) = \frac{1}{L} \sum\_{l}^{L} \delta(z\_1, z\_1^{(l)}),\tag{13}$$

where δ(a, b) represents a delta function, which is 1 if a = b, and 0 otherwise. The generated samples are sent from module 1 to module 2, and a latent variable is selected among them based on P(z1|z2):

$$z\_1 \sim P(z\_1 \in \{z\_1^{(1)}, \dots, z\_1^{(L)}\} | z\_2). \tag{14}$$

This procedure is equivalent to sampling from the following distribution, which is an approximation of Equation (11):

$$z\_1 \sim P(z\_1|z\_2)\tilde{P}(z\_1|\mathbf{o}).\tag{15}$$

Thus, the parameters of each module can be updated by the determined latent variables.

#### 3.2.3. Other Approaches

We have presented two methods but these are not the only ones available for parameter estimation. There are other applicable methods to estimate parameters. For example, one of the applicable methods is the Metropolis-Hastings (MH) approach. In the MH approach, samples are generated from a proposal distribution Q(z|z ∗ ), where z ∗ and z represent the current value and generated value of latent variables, respectively. Then, they are accepted according to the acceptance probability A(z, z ∗ ):

$$A(z, z^\*) = \min\left(1, \alpha\right) \tag{16}$$

$$\alpha = \frac{P(z^\*)Q(z|z^\*)}{P(z)Q(z^\*|z)},\tag{17}$$

where P(z) represents the target distribution from which the samples are generated.

The model parameters in **Figure 5** can be estimated by considering P(z1|**o**) and P(z1|z2, **o**) as the proposal distribution and target distribution, respectively. P(z1|z2, **o**) can be transformed into

$$P(z\_1|z\_2, \mathbf{o}) \propto P(z\_1|\mathbf{o})P(z\_1|z\_2)P(z\_2). \tag{18}$$

Therefore, α in Equation (16) becomes

$$\alpha = \frac{P(z^\*)Q(z|z^\*)}{P(z)Q(z^\*|z)} = \frac{P(z\_1^\*|z\_2, \mathbf{o})}{P(z\_1|z\_2, \mathbf{o})} \cdot \frac{P(z\_1|\mathbf{o})}{P(z\_1^\*|\mathbf{o})} \tag{19}$$

$$\eta = \frac{P(z\_1^\*|\mathbf{o})P(z\_1^\*|z\_2)P(z\_2)}{P(z\_1|\mathbf{o})P(z\_1|z\_2)P(z\_2)} \cdot \frac{P(z\_1|\mathbf{o})}{P(z\_1^\*|\mathbf{o})} = \frac{P(z\_1^\*|z\_2)}{P(z\_1|z\_2)},\tag{20}$$

Hence, the proposal distribution P(z1|**o**) can be computed in module 1, and the acceptance distribution can be computed in module 2. By using this approach, the parameters can be estimated while maintaining programmatic independence. The proposed value is sent to module 2, and module 2 determines whether it is accepted or not. Then, the parameters are updated according to the accepted values.

Thus, various approaches can be utilized for parameter estimation, and it should be discussed which methods are most suitable. However, we will leave this for a future discussion because of limited space.

#### 4. EXAMPLE 1: MULTILAYERED MLDA

First, we show that a more complex model, mMLDA, can be constructed by combining the simpler models based on Serket. By using the mMLDA, the object categories, motion categories, and integrated categories representing the relationships between them were formed from the visual, auditory, haptic, and motion information obtained by the robot. The information obtained by the robot is detailed in Appendix 2. We compared it with the original mMLDA and an independent model, where the object and motion categories were learned independently. The original mMLDA has an upper-bound performance because any approximation is not used in it. Therefore, the purpose of this experiment is to show that Serket implementation has a comparable performance with the original mMLDA.

#### 4.1. Implementation Based on Serket

The mMLDA shown in **Figure 4A** can be constructed using the MP approach. This model can be divided into to three MLDAs. In the lower-level MLDAs, object categories z <sup>O</sup> can be formed from multimodal information **w** v , **w** a , and **w** <sup>h</sup> obtained from the objects, and motion categories z <sup>M</sup> can be formed from joint angles obtained by observing a human's motion. Details of the information are explained in the Appendix. Moreover, in the higher-level MLDA, integrated categories z that represent the relationships between objects and motions can be formed by considering z <sup>O</sup> and z <sup>M</sup> as observations. In this model, latent variables z <sup>O</sup> and z <sup>M</sup> are shared; therefore, the whole model parameters are optimized in a mutually affecting manner. **Figure 6** shows the mMLDA represented by three MLDAs.

First, in the two MLDAs shown in **Figures 6A,B**, the probabilities P(z O j |**w** v j ,**w** a j ,**w** h j ) and P(z M j |**w** p j ) that the object and motion category of the multimodal information in the j-th data become z O j and z M j , respectively, can be computed using Gibbs sampling. These probabilities are represented by finite and discrete parameters, which can be sent to the integrated concept model shown in **Figure 6C**, where zˆ O j and zˆ M j can be treated as observed variables using these probabilities.

$$\hat{z}\_{jn}^{O} \sim P(z\_j^{O} | \mathbf{w}\_j^{\nu}, \mathbf{w}\_j^{a}, \mathbf{w}\_j^{h}),\tag{21}$$

$$\hat{z}\_{jn}^{M} \sim P(z\_j^{M} | \mathbf{w}\_j^{\mathcal{P}}).\tag{22}$$

where **w** v j ,**w** a j ,**w** h j , and **w** p j represent the visual information, auditory information, haptic information, and joint angles of the human's motion, respectively, which are included in the j-th data.

Thus, in the integrated concept model, category z can be formed in an unsupervised manner. Next, the values of the shared latent variables are inferred stochastically using a learned integrated concept model:

$$P(z^O|\hat{\mathbf{z}}\_j^M, \hat{\mathbf{z}}\_j^O) = \sum\_z P(z^O|z)P(z|\hat{\mathbf{z}}\_j^m, \hat{\mathbf{z}}\_j^o),\tag{23}$$

$$P(z^{M}|\hat{\mathbf{z}}\_{\boldsymbol{j}}^{M},\hat{\mathbf{z}}\_{\boldsymbol{j}}^{O}) = \sum\_{\boldsymbol{z}} P(z^{M}|\boldsymbol{z})P(\boldsymbol{z}|\hat{\mathbf{z}}\_{\boldsymbol{j}}^{m},\hat{\mathbf{z}}\_{\boldsymbol{j}}^{o}).\tag{24}$$

These probabilities are also represented by finite and discrete parameters, which can be communicated using the MP approach. These parameters are sent to an object concept model and motion concept model, respectively, where the latent variables assigned to the modality information m ∈ {v, a, h, p} of concept C ∈ {O, M} are determined using Gibbs sampling.

$$z\_{jmn}^C \sim P(z^C | \mathbf{W}^m, \mathbf{Z}\_{-jmn}) P(z^C | \hat{\mathbf{z}}\_j^M, \hat{\mathbf{z}}\_j^O),\tag{25}$$

where **W**<sup>m</sup> represents all the information of modality m, and **Z**−jmn represents a set of latent variables, except for the latent variable assigned to the information of modality m of the j-th observation. Whereas the latent variables were sampled from P(z C |**W**m, **Z**−jmn) in the normal MLDA, they were also sampled using P(z C |ˆ**z** M j , **z**ˆ O j ). Therefore, all the latent variables were learned in a complementary manner. From the sampled variables, the parameters of P(z o j |**w** v j ,**w** a j ,**w** h j ) and P(z m j |**w** m j ) were updated, and Equations (21–25) were iterated until they converged.

**Figure 7** shows the pseudocode of mMLDA and the corresponding graphical model. The model on the left in **Figure 7** can be constructed by connecting the latent variables based on Serket. Although the part framed by the red rectangle was implemented in the experiment, it can be easily extended to the model shown in this figure.

#### 4.2. Experimental Results

**Figure 8A** shows a confusion matrix of classification by the model, where the object and motion categories were learned independently, and the vertical and horizontal axes represent the correct category index and the category index to which each object was classified, respectively. The accuracies were 98 and 72%. One can see that the motion categories can be formed by the independent model almost correctly. However, the object categories could not be formed correctly compared to the motion categories. On the other hand, **Figure 8B** shows the results of using mMLDA implemented based on Serket, and the categories were learned in a

complementary manner. The classification accuracies were 100% and 94%. The motion that could not be classified correctly by the independent model was classified correctly. Moreover, the object classification accuracy improved by 22% owing to the effects of motion categories. In the independent model, category five (shampoos) objects were classified as category seven because of their visual similarity. On the other hand, in the mMLDA based on Serket, they were misclassified as category three (dressings) because the same motion (pouring) was performed with these objects. Also, the rattles (category 10) were misclassified because the rattles (category 10) and soft toys (category nine) had a similar appearance and the same motion (throwing) was performed with them. However, other objects were classified correctly, and this fact indicates that mutual learning was realized by Serket.

Furthermore, we conducted an experiment to investigate the efficiency of the original mMLDA which was not divided into modules. The results in **Figure 8C** show that the accuracies of the classification of objects and motions were 100 and 94%, respectively, although misclassified objects differed from that of the Serket implementation of mMLDA because of sampling. One can see that mMLDA implementation based on Serket is comparable with the original mMLDA.

**Table 1** shows the computation time of mMLDA implemented by each method. The Independent model was fastest because the parameters of two MLDAs were independently learned. Serket implementation was slower than the independent model but faster than the original mMLDA. In the original MLDA, all the observations were used for parameter estimation of the integrated concept model. On the other hand, in the Serket implementation, this was approximated and only the parameters sent from lower-level MLDA in Equations (21, 22) were used for parameter estimation of the integrated concept models. Thus, the Serket implementation is faster than the original mMLDA.

# 4.3. Deeper Model

In the original mMLDA, the structure of the model was fixed, and we derived the equations to estimate its parameters and then implemented them. However, by using Serket, we can flexibly change the structure of the model without deriving the equations for the parameter estimation. As one example, we changed the structure of mMLDA and constructed a deeper model as shown in **Figure 9**. To confirm that the parameters can be learned by using Serket, we generated training data by using the following generative process:

$$z\_{5,1} \sim P(z|\theta\_5) \tag{26}$$

$$\mathbf{o}\_5 \sim P(\mathbf{o}|\phi\_{z\_{5,1}}) \tag{27}$$

$$\text{for } m = 4 \text{ to } 1;$$

$$z\_{m,1} \sim P(z | x\_{m+1,1}, \theta\_m) \tag{28}$$

$$\mathbf{o}\_m \sim P(\mathbf{o}|\boldsymbol{\phi}\_{z\_{m,1}}) \tag{29}$$

where m denote the index of hierarchies, and the number of categories of all modules was 10. θ<sup>m</sup> and φ<sup>z</sup> were randomly generated, and we used uniform distribution as P(z|θ5). This generative process was repeated 50 times, and 250 observations were made. The parameters were estimated by classifying these 250 observations through a Serket implementation and independent model. **Table 2** shows the classification accuracies in each hierarchy. We can see that the Serket implementation outperformed the

TABLE 1 | Computational time of mMLDA.


independent model because the parameters were optimized as a whole by using an MP approach. Usually, the equations for parameter estimation must be derived for each model individually; deriving them for a more complicated model is difficult. However, Serket makes it possible to construct a complicated model flexibly and to estimate the parameters easily.

# 5. EXAMPLE 2: MUTUAL LEARNING OF CONCEPT MODEL AND LANGUAGE MODEL

In Nakamura et al. (2014) and Nishihara et al. (2017), we proposed a model for the mutual learning of concepts and the language model shown in **Figure 4B**; its parameters were estimated by dividing the models into smaller parts. In this section, we show that this model can be constructed by Serket. To learn the model, the visual, auditory, and haptic information obtained by the robot and teaching utterances given by a human user were used. The details are explained in Appendix 2. As in the previous experiment, the original model has upper-bound performance. Therefore, the purpose of this experiment is also to show that Serket implementation has comparable performance with the original model.

TABLE 2 | Classification accuracies of mMLDA having five hierarchies.


# 5.1. Implementation Based on Serket

Here, we reconsider the mutual learning model based on Serket. The model shown in **Figure 4B** is a one where the speech recognition part and the MLDA that represents the object concepts are connected, and can be divided as shown in **Figure 10**. The MLDA makes it possible to form object categories by classifying the visual, auditory, and haptic information obtained, as shown in the Appendix 2. In addition, the words in the recognized strings of a user's utterances to teach object features are also classified in the model shown in **Figure 10**. Through this categorization of multimodal information and teaching utterance, the words and multimodal information are connected stochastically, which enables the robot to infer the sensory information represented by the words. However, the robot cannot obtain the recognized strings directly; it can only obtain continuous speech. Therefore, in the model shown in **Figure 10**, the words s which are in the recognized strings are treated as latent variables and connected to the model for speech recognition. The parameter L of the language model is also a latent variable, and is learned from the recognized strings of continuous speech **o** using the nested Pitman–Yor language model (NPYLM) (Mochihashi et al., 2009). Furthermore, it is an important point of this model that the MLDA and speech recognition model are connected through the words s, which makes it possible to learn them in a complementary manner. That is, the speech is not only recognized based on the similarity of **o** but is accurately recognized by utilizing the inferred words s from the multimodal information perceived by the robot.

First, as the initial parameter of L, we used the language model where all phonemes were generated with equal probabilities. The MP approach can be used if all teaching utterances **O** are recognized by using a language model whose parameter is L and the probability P(**S**|**O**, A, L) that the word sequences **S** are generated can be computed. However, it is actually difficult to compute the probabilities for all possible word segmentation patterns of all possible recognized strings. Therefore, we approximated this probability distribution using the SIR approach. The L-best speech recognition results were utilized as samples because it is difficult to compute the probabilities for all possible recognized strings. **s** (l) j represents the l-th recognized string of a teaching utterance given the j-th object. By applying the NPYLM and segmenting them into words, the word sequences **S** = {**s** (l) j |1 ≤ l ≤ L, 1 ≤ j ≤ J} can be obtained.

$$\mathbf{S} \sim P(\mathbf{S}|\mathbf{S}', \mathcal{L}).\tag{30}$$

These generated samples are sent to the MLDA module, and the samples that are likely to represent multimodal information are sampled based on the MLDA whose current parameter is 2:

$$
\hat{\mathbf{s}}\_{j} \sim P(\mathbf{s}\_{j}^{(l)} | \mathbf{w}\_{j}^{\boldsymbol{\nu}}, \mathbf{w}\_{j}^{\boldsymbol{a}}, \mathbf{w}\_{j}^{\boldsymbol{t}}, \boldsymbol{\Theta}).\tag{31}
$$

The selected samples **s**ˆ<sup>j</sup> are considered as words that can represent multimodal information. Then, the MLDA parameters are updated using a set of these words **S**ˆ = { ˆ**s**<sup>j</sup> |1 ≤ j ≤ J} and a set of multimodal information **W**<sup>v</sup> , **W**<sup>a</sup> , **W**<sup>t</sup> by utilizing Gibbs sampling.

$$\boldsymbol{\Theta} = \operatorname\*{argmax}{P(\hat{\mathbf{S}}, \mathbf{W}^{\boldsymbol{v}}, \mathbf{W}^{\boldsymbol{a}}, \mathbf{W}^{\boldsymbol{t}} | \boldsymbol{\Theta})}.\tag{32}$$

Moreover, **S**ˆ is sent to the speech recognition model, and the parameter L of the language model is updated.

$$\mathcal{L} = \operatorname\*{argmax}\_{P(\hat{\mathbf{S}}|\hat{\mathbf{S}}', \mathcal{L})} P(\hat{\mathbf{S}}', \mathcal{L}), \tag{33}$$

where **S**ˆ ′ denotes strings obtained by connecting words in **S**ˆ. The parameters of the whole model can be optimized by iteration through the following process: the sampling words using Equation (30), the resampling words using Equation (31), and the updating parameters using Equations (32, 33).

**Figure 11** displays the pseudocode and the corresponding graphical model. In this model, one of modules is MLDA with three observations and one shared latent variable connected to the speech recognition module. o1, o2, and o<sup>3</sup> represent multimodal information obtained by the sensors on the robot, and o4, which is an observation of the speech recognition model, represents the utterances given by the human user. Although the parameter estimation of the original model proposed in Nakamura et al. (2014) and Nishihara et al. (2017) is very complicated, it can be briefly described by connecting the modules based on Serket.

#### 5.2. Experimental Results

We conducted an experiment where the concepts were formed using the aforementioned model to demonstrate the validity of Serket. We compared the following three methods.

(a) A method where speech recognition results **S** ′ 0 of teaching utterances with maximum likelihoods are segmented into words by the applied NPYLM, and the words obtained are used for concept formation.

(b) A method where the concepts and language model are learned by a mutual learning model implemented based on Serket. (Proposed method)

(c) A method where the concepts and language model are learned by a mutual learning model implemented without Serket proposed in (Nakamura et al., 2014). (Original method)

In method (a), the following equation was used instead of Equation (30), and the parameter L of the language model was not updated:

$$\mathcal{S}\_0 \sim P(\mathcal{S}|\mathcal{S}\_0', \mathcal{L}).\tag{34}$$

Alternatively, method (b) was implemented by Serket, and the concepts and language model were learned mutually through the shared latent variable **s**.

**Table 3i** shows the speech recognition accuracies of each method. In method (a), the language model was not updated; therefore, the accuracy is equal to phoneme recognition. In contrast, in method (b), the accuracy is higher than that of method (a) by updating the language model from the words sampled by MLDA.

**Table 3ii** shows the accuracies of word segmentation. Segmentation points were evaluated, as shown in **Table 4**, by applying dynamic-programming matching to find the correspondence between the correct and estimated segmentation. This table shows a case where the correct segmentation of a correctly recognized string "ABCD" is "A/BC/D," and the recognized string "AACD" is segmented into "A/A/CD." ("/" represents the cut points between each word.) The points that were correctly estimated (**Table 4b**), as cut points were evaluated as true positive (TP), and those that were incorrectly estimated (**Table 4d**) were evaluated as false positive (FP). Similarly, the points that were erroneously estimated as not cut points (**Table 4f**) were evaluated as false negative (FN). From the evaluation of the cut points, the precision, recall, and F-measure are computed as follows.

$$P = \frac{N\_{TP}}{N\_{TP} + N\_{FP}},\tag{35}$$

$$R = \frac{N\_{TP}}{N\_{TP} + N\_{FN}},\tag{36}$$

$$F = \frac{2RP}{R+P},\tag{37}$$

where NTP, NFP, and NFN denote the number of points evaluated as TP, FP, and FN, respectively. Comparing the precision of methods (a) and (b) in **Table 3ii**, one can see that it increases according to Serket. This is because more correct words can be selected among the samples generated by the speech recognition module. Alternatively, the recall of method (b) decreases because some functional words (e.g., "is" and "of ") are connected with other words such as "bottleof." However, the precision of method (b) is higher, and its F-measure is greater than 0.11. Therefore, method (b), which was implemented based on Serket, outperformed method (a). **Table 3iii** displays the object classification accuracy. One can observe that the accuracy of method (b) is higher because the speech can be recognized more correctly. Moreover, the Serket implementation [method (b)] was comparable to the original implementation [method (c)]. Thus, the learning of the object concepts and language model presented

TABLE 3 | Accuracies of speech recognition, segmentation, and object classification.


TABLE 4 | Evaluation of segmentation.


TABLE 5 | Computation time of mutual learning model.


in Nakamura et al. (2014); Nishihara et al. (2017) was realized by Serket.

**Table 5** shows the computation time of mutual learning models. From this figure, the model without mutual learning is fastest because the parameters of one MLDA and language model are independently learned once. On the other hand, Serket implementation is slower and comparable with the original model. This is because the parameters of the MLDA and language model in the Serket implementation were updated iteratively by communicating the parameters with the MP approach, and the computational cost was not much different from that of the original model.

# 6. CONCLUSION

In this paper, we proposed a novel architecture where the cognitive model can be constructed by connecting modules, each of which maintains programmatic independence. Two approaches were used to connect these modules. One is the MP approach, where the parameters of the distribution are of a finite dimension and communicated between the modules. If the parameters of the distribution are of an infinite dimension or a complex structure, the SIR approach was utilized to approximate them. In the experiment, we demonstrated two implementations based on Serket and their efficiency. The experimental results demonstrated that the implementations are comparable with the original model.

However, there is an issue with regard to the convergence of the parameters. If a large number of samples can be obtained, each latent variable can be locally converged into global optima because the MP, SIR, and MH approaches are based on the existing Markov chain Monte Carlo method. But when various types of models are connected, it is not clear whether all latent parameters can be converged into global optima as a whole. It was confirmed that the parameters were converged in the models used in the experiments. Nonetheless, this remains a difficult and important issue which will be examined in future work.

We believe that models that can be connected by Serket are not limited to generative probabilistic models, although we focused on the connected generative probabilistic models in this paper. Neural networks or other methods can be one of the modules of Serket, and we are planning to connect them. Furthermore, we believe that large-scale cognitive models can be constructed by connecting various types of modules, each of which represent a particular brain function. In so doing, we will realize our goal of artificial general intelligence. Serket can also contribute to developmental robotics (Asada et al., 2009; Cangelosi et al., 2015), where the human developmental mechanism is understood via a constructive approach. We believe that robots can learn capabilities ranging from motor skills to language, and these can be developed using Serket, as it makes it possible to understand humans.

# AUTHOR CONTRIBUTIONS

ToN, TaN and TT conceived of the presented idea. ToN developed the theory and performed the computations.

# REFERENCES


Chollet, F. (2015). Keras. Available online at: https://github.com/fchollet/keras

Françoise, J., Schnell, N., and Bevilacqua, F. (2013). "A multimodal probabilistic model for gesture-based control of sound synthesis," in 21st ACM international conference on Multimedia (MM'13) (Barcelona), 705–708.

ToN wrote the manuscript with support from TaN and TT. TaN and TT supervised the project. All authors discussed the results and contributed to the final manuscript.

# ACKNOWLEDGMENTS

This work was supported by JST CREST Grant Number JPMJCR15E3.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot. 2018.00025/full#supplementary-material


and mdl-based phrase extraction method. Adv. Robot. 25, 2143–2172. doi: 10.1163/016918611X594775


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nakamura, Nagai and Taniguchi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Acquisition of Viewpoint Transformation and Action Mappings via Sequence to Sequence Imitative Learning by Deep Neural Networks

Ryoichi Nakajo<sup>1</sup> , Shingo Murata<sup>2</sup> , Hiroaki Arie<sup>2</sup> and Tetsuya Ogata<sup>1</sup> \*

*<sup>1</sup> Department of Intermedia Art and Science, Waseda University, Tokyo, Japan, <sup>2</sup> Department of Modern Mechanical Engineering, Waseda University, Tokyo, Japan*

We propose an imitative learning model that allows a robot to acquire positional relations between the demonstrator and the robot, and to transform observed actions into robotic actions. Providing robots with imitative capabilities allows us to teach novel actions to them without resorting to trial-and-error approaches. Existing methods for imitative robotic learning require mathematical formulations or conversion modules to translate positional relations between demonstrators and robots. The proposed model uses two neural networks, a convolutional autoencoder (CAE) and a multiple timescale recurrent neural network (MTRNN). The CAE is trained to extract visual features from raw images captured by a camera. The MTRNN is trained to integrate sensory-motor information and to predict next states. We implement this model on a robot and conducted sequence to sequence learning that allows the robot to transform demonstrator actions into robot actions. Through training of the proposed model, representations of actions, manipulated objects, and positional relations are formed in the hierarchical structure of the MTRNN. After training, we confirm capability for generating unlearned imitative patterns.

#### Edited by:

*Ganesh R. Naik, Western Sydney University, Australia*

#### Reviewed by:

*Bolin Liao, Jishou University, China Eiji Uchibe, Advanced Telecommunications Research Institute International (ATR), Japan*

#### \*Correspondence:

*Tetsuya Ogata ogata@waseda.jp*

Received: *30 November 2017* Accepted: *03 July 2018* Published: *24 July 2018*

#### Citation:

*Nakajo R, Murata S, Arie H and Ogata T (2018) Acquisition of Viewpoint Transformation and Action Mappings via Sequence to Sequence Imitative Learning by Deep Neural Networks. Front. Neurorobot. 12:46. doi: 10.3389/fnbot.2018.00046* Keywords: imitative learning, human-robot interaction, recurrent neural networks, deep neural networks, sequence to sequence learning

# 1. INTRODUCTION

Today there is increased interest in robots capable of working in human living environments. Robot motions are generally preprogrammed by engineers, but it is crucial for robots to learn new actions in work environment contexts if they are to work with humans. One way for robots to learn new actions is imitation, which is the behavioral capability to generate the equivalent actions after the observation of the demonstrator's actions. Imitation is a powerful learning method that humans apply to acquire new actions without resorting to trial-and-error attempts. Hence, robot acquisition of imitative abilities will realize programming by demonstration (PbD) (Billard et al., 2008), in which new action skills are acquired from demonstrators without any prior design.

Early studies of imitation learning are related to computational neuroscience, focusing on tasklevel imitation such as assembly (Kuniyoshi et al., 1994), kendama manipulation (Miyamoto et al., 1996), and tennis serves (Miyamoto and Kawato, 1998). To date, the main approaches to imitative learning have been probabilistic models, reinforcement learning, and neural networks. Among probabilistic models, hidden Markov models realize behavior recognition, generation through imitative learning (Inamura et al., 2004), and imitation of object manipulation (Sugiura et al., 2010). Gaussian mixture models allow robots to imitate human gestures (Calinon et al., 2010). Reinforcement learning has been used for robot acquisition of motor primitives (Kober and Peters, 2010) and applied to task-level learning (Schaal, 1997). By combining reinforcement learning with a Gaussian mixture model, Guenter et al. (2007) achieved robot imitation of reaching movements. Neural network approaches mainly use recurrent neural networks that allow robots to imitate human gesture patterns (Ito and Tani, 2004) and object manipulations (Ogata et al., 2009; Arie et al., 2012).

As another perspective, cognitive developmental robotics (Asada et al., 2009; Cangelosi et al., 2010) has tried to understand the development of the human cognitive abilities through robot experiments based on constructive approaches. In studies focusing on imitative learning, robots were trained to learn imitative tasks by Hebbian learning (Nagai et al., 2011; Kawai et al., 2012) and neural networks (Ogata et al., 2009; Arie et al., 2012; Nakajo et al., 2015). Through training, experimenters observe behavior changes in robots and in the internal states of the learning models, then consider the developmental processes of imitation. The Hebbian learning approach reveals changes in granularity on visual development, allowing the robot to recognize self–other correspondences (Nagai et al., 2011; Kawai et al., 2012). Our previous studies used recurrent neural networks to demonstrate how robots can translate from other to own actions (Ogata et al., 2009), imitative ability for the composition of behaviors (Arie et al., 2012), and recognition of positional relations between self and other (Nakajo et al., 2015).

For robots working in human living environments, imitation of demonstrator behaviors roughly comprises two processes: (1) observing the behavior and (2) transforming the observed behavior into an action. During observations, robots are expected to extract information about the imitated behavior. In the transformation process, robots must extract necessary information from the observations, and match them with their own actions. Robots cannot always observe behaviors from the same position, but are expected to recognize and reproduce behaviors regardless of the position from which they were observed. However, few previous studies have focused on positional relations between robots and demonstrators or considered correspondences between actions provided from various positions.

If robots are to observe demonstrated actions and transform them into the robots' own actions, robots must process raw images and extract from them information necessary for imitation. However, the huge dimensionality of raw data makes direct processing too difficult. Deep-learning techniques are looked to as a solution to this problem (LeCun et al., 2015), because deep learning can process raw data and allows machines to automatically extract necessary information about requested tasks. For instance, deep learning techniques have outperformed previous methods for image recognition (Krizhevsky et al., 2012). Over the past several years, deep learning has been applied to action learning by robots, and many studies have investigated imitative learning through deep learning (Liu et al., 2017; Sermanet et al., 2017; Stadie et al., 2017). Stadie et al. applied deep learning methods to transformation of demonstrator views into robot control features. Sermanet et al. and Liu et al. trained learning models to relate demonstrator views from various positions with the robot view. After training learning models to transform demonstrator views, reinforcement learning (Liu et al., 2017; Stadie et al., 2017) or supervised learning methods (Sermanet et al., 2017) are applied to allow robots to imitate behaviors. Although these learning methods are suited to allowing robots to acquire imitative skills regardless of positional relations, demonstrators cannot provide their views to robots in actual environments; robots must instead capture demonstrator behaviors via cameras, and relate observed behaviors to their own situation.

Various training methods have also been researched in the field of deep learning. One common method applied to robot action learning is end-to-end learning, in which the learning model receives images and robot motor commands, and directly plans the robot's actions. Another technique often applied to natural language translation is sequence to sequence learning (Sutskever et al., 2014), which allows translation of a multi-dimensional time series into another time series. Utilizing this characteristic, Yamada et al. (2016) allowed a robot to perform tasks based on language instructions. This characteristic can also be applied to imitative learning, because robots must translate observations of demonstrator actions into their own actions. We thus consider the application of sequence to sequence learning to imitative learning.

The main contribution of this paper is demonstration of how a robot can acquire the following two abilities: (1) automatic visual-feature extraction, and (2) transformation from human demonstration into robotic action when positional differences are present. This paper proposes an imitative learning model that simultaneously enables a robot to acquire positional relations between a demonstrator and the robot, and transforms observed actions into the robot's own actions. In the learning process, the robot observes demonstrator actions using a mounted camera, and no pre-training is provided. To achieve imitative abilities, we combined two deep neural network models. An autoencoder extracts visual features from raw camera images, and a dynamic neural network model called a multiple timescale recurrent neural network (MTRNN) (Yamashita and Tani, 2008) is trained to learn how to imitate tasks . An MTRNN learns positional relations between a demonstrator and a robot. To allow the robot to learn how to translate observed actions into its own actions, the MTRNN is trained based on a sequence to sequence approach (Sutskever et al., 2014). In experiments, we imposed object manipulation tasks on a robot and conducted predictive learning to train the proposed learning model. After training, we confirmed that the robot could translate observed actions into its own actions. By inspecting the internal states of the MTRNN, we show how the robot recognizes positional relations between the demonstrator and the robot during tasks. We also considered what information the robot extracts through observation and translates into actions.

# 2. METHODS

# 2.1. Sequence to Sequence Learning of Imitative Interaction

We first describe the method by which robots use our proposed learning model to learn imitative interactions. We apply sequence to sequence learning (Sutskever et al., 2014) to map observed demonstrator actions to robot actions. sequence to sequence learning is a learning method for RNNs that is mainly used in the machine translation field. By inputting to RNNs series of sentences in the original and target languages, sequence to sequence learning allows forward propagation in the RNNs both to recognize the meaning of the original sentence and to generate a sentence in the target language by using the internal states acquired through encoding the original language. We use sequence to sequence learning to encode demonstrator actions and to generate robot actions. As **Figure 1** shows, by concatenating demonstrator and robot actions and inputting the concatenated sequences to a RNN, the network is expected to learn how to map the demonstrator actions to robot actions.

# 2.2. Overview of Proposed Learning Model

Robot imitation of demonstrator actions requires observation of demonstrator actions and transformation of observed actions into robot actions. The robot must process visual information to extract information related to demonstrator actions. Captured camera images have too many dimensions to process directly. The robot thus requires functions for automatically compressing and extracting visual information. To map extracted visual information from demonstrator actions to robot actions, visual features and robot motor information must be integrated into a single learning scheme. Doing so requires another learning model for integrating this information, separate from visual feature compression.

Our proposed learning model satisfies these conditions by including two neural networks. The first is a deep neural network called a convolutional autoencoder (CAE), which is applied to extraction of visual features from camera images. The second is a multiple timescale recurrent neural network (MTRNN), which we use to integrate time series of extracted visual features with robot motor information. **Figure 2** shows an overview of the proposed learning model. In the following subsections, we explain the CAE method for extracting visual features and the MTRNN method for integrating them with motor information.

# 2.3. Visual Feature Extraction via Convolutional Autoencoder

An autoencoder is a neural network with bottleneck layers, and comprises an encoder for dimensionally compressing input images and a decoder for restoring dimensionality in output images (Hinton and Salakhutdinov, 2006). Updating learnable parameters in the autoencoder to identically output an input image allows the network to acquire lower-dimensional features representing input images at the narrowest layer. By compressing input images, the robot can nondestructively extract visual features of camera images.

In this study, we applied a convolutional autoencoder (CAE), which is an autoencoder including convolution layers (Masci et al., 2011). Convolution is an arithmetic process inspired by the mammalian visual cortex, and is expected to extract visual features by focusing on spatial localities in the images. We combined a conventional CAE with fully connected layers. Camera images are taken as input, then the CAE is trained to minimize the mean squared error between input and reconstructed images. The mean squared error EAE is processed as

$$E\_{\rm AE} = \frac{1}{N} \sum\_{n}^{N} E\_{\rm AE}^{(n)},\tag{1}$$

FIGURE 1 | sequence to sequence learning scheme of the RNN. In the first half of the time sequence, the robot moves only its head and captures images of only the action being demonstrated. From the captured images, the RNN is expected to recognize and encode the demonstrator actions. In the second half of the time sequence, the RNN receives encoded internal states, plans robot actions, and issues robot motor commands.

$$E\_{\rm AE}^{(n)} = \frac{1}{HWC} ||\hat{\mathbf{X}}^{(n)} - \mathbf{X}^{(n)}||\_2^2,\tag{2}$$

where N is the number of mini-batches; **X**ˆ (n) is the nth input image; **X** (n) is the nth reconstructed image; and H, W, and C indicate the height, width, and channel, respectively, of the images. To avoid drastic changes in extracted visual features between continuous time steps, we furthermore applied the following slow penalty introduced in Finn et al. (2016):

$$g(\mathbf{f}\_t) = \eta \cdot \left| \left| (\mathbf{f}\_{t+2} - \mathbf{f}\_{t+1}) - (\mathbf{f}\_{t+1} - \mathbf{f}\_t) \right| \right|^2 \qquad (t \ge 1), \tag{3}$$

where **f**<sup>t</sup> indicates the visual features extracted from an image at time step t, and η is a hyper-parameter to control the strength of the penalty.

# 2.4. Sensory-Motor Integration by Multiple Timescale Recurrent Neural Network

Generating imitative actions from observation of demonstrator actions requires a function that integrates visual features extracted by the CAE with robot motor information. In this work, we use a dynamic neural network model called a multiple timescales recurrent neural network (MTRNN) (Yamashita and Tani, 2008). An MTRNN has different time constants in its hierarchically context layers. The layer connected to the input–output layers ["fast context" (FC) in **Figure 2B**] is a group of neurons with a smaller time constant, and so responds more quickly to current external inputs. Another layer connected only to neurons in the context layers ["slow context" (SC) in **Figure 2B**] has a larger time constant, and so responds more slowly. Yamashita and Tani (2008) demonstrated that stacking layers with different timescales allows the robot to acquire action primitives in the FC layer, and described the order of sequential combinations of primitives in the SC layer.

In MTRNN forward propagation, the internal state of the ith FC, SC, and output neural unit at time step t, (ut,i), for the sth sequence is calculated as

$$u\_{t,i}^{(j)} = \begin{cases} \left(1 - \frac{1}{\tau\_i}\right) u\_{t-1,i}^{(j)} + \frac{1}{\tau\_i} \left(\sum\_{j \in \mathcal{I}} w\_{\vec{v}} x\_{t,j}^{(j)} + \sum\_{j \in I\_{\text{IC}} \cup I\_{\text{IC}}} w\_{\vec{v}} c\_{t-1,j}^{(j)} + b\_i\right) & (t \ge 1, i \in I \text{rc}), \\\\ \left(1 - \frac{1}{\tau\_i}\right) u\_{t-1,i}^{(j)} + \frac{1}{\tau\_i} \left(\sum\_{j \in I\_{\text{IC}} \cup I\_{\text{IC}}} w\_{\vec{v}} c\_{t-1,j}^{(j)} + b\_i\right) & (t \ge 1, i \in I\_{\text{SC}}), \\\\ \sum\_{j \in I\_{\text{O}}} w\_{\vec{v}} c\_{t,j}^{(j)} + b\_i & (t \ge 1, i \in I\_{\text{O}}), \end{cases} \tag{4}$$

where IFC, ISC, and I<sup>O</sup> are index sets of the respective neural units, τi is the time constant of the ith neuron, wij is the connective weight from the jth to the ith neural units, x (s) t,j is the external input of the jth neural unit at time step t of the sth sequential data, c (s) t,j is the activation value of the jth context neuron at time step t of the sth sequence, and b<sup>i</sup> is the bias of the ith neural unit. We use tanh as the activation function for the context neural unit c (s) t,i and output unit y (s) t,i .

We trained the MTRNN by minimizing the mean squared error with the gradient descent method. The mean squared error ERNN is described as

$$E\_{\rm RNN} = \frac{1}{S} \sum\_{s}^{S} \frac{1}{T^{\langle s \rangle}} \sum\_{t}^{T} E\_{\rm RNN,t"}^{\langle s \rangle} \tag{5}$$

$$E\_{\rm RNN,t}^{(s)} = \frac{1}{Y} \left| |\hat{\mathbf{y}}\_t^{(s)} - \mathbf{y}\_t^{(s)}| \right|\_{2}^{2},\tag{6}$$

where S is the number of sequential data, T (s) is the number of time steps of the sth sequential data item, Y is the number of neural units in the output layer, **y**ˆ (s) t is the target sensorymotor values at time step t of the sth sequence, and **y** (s) t is the predicted sensory-motor values at time step t of the sth sequence. The learnable parameters of the MTRNN are composed of connected weights **w**, biases **b**, and initial internal states in context layers **u** (s) 0 . The gradients of these learnable parameters follow a conventional back propagation through time method (Rumelhart et al., 1986).

# 3. EXPERIMENT

# 3.1. Task Design

This section describes an experimental task given to a humanoid robot (NAO; Aldebaran Robotics). The task in this experiment is imitative interaction for object manipulation as shown in **Figure 3A.** Imitative interaction cycles comprised four processes: (i) the demonstrator shows the object manipulation action to the robot, then (ii) passes the manipulated object to the robot. Next, (iii) the robot mimics the observed manipulation, and (iv) the demonstrator receives the object from the robot. Furthermore, actions, manipulated objects, and positional relationships between the robot and the demonstrator were varied between cycles. Manipulated objects were two toys (a chick and a watering can), shown in **Figure 3B.** Objects were manipulated in two ways (move-side and move-up) as shown in **Figure 3C**. The positional relationship between the robot and the demonstrator varied according to where the demonstrator presented the action. We define 180◦ as the position when the robot presents a motion in front of itself. Accordingly, 120, 150, 180, 210, and 240◦ counterclockwise in the positive direction are used as the positional relationship between the demonstrator and the robot. **Figure 3D** shows a schematic diagram of positional relations between the demonstrator and the robot . Under these conditions, combinations that can be taken in a single cycle come in 20 patterns, from two objects, two movements, and five positional relations.

# 3.2. Training Data

This subsection describes the method for creating sequential training data. In this experiment, the training data consisted of time series of the robot joint angles and 120 × 160 RGB images captured by a front-facing camera mounted in its mouth. The CAE extracts visual features from captured images. Controlled joints had four degrees of freedom (DoF) (ShoulderPitch, ShoulderRoll, ElbowYaw, and ElbowRoll) at each arm and two DoF (HeadPitch and HeadYaw) at the neck.

To prepare the training data, the robot was controlled and actual joint angles and images were recorded. A control method for both arms was predesigned and the arms tracked the planned trajectories with noise. Gaussian noise was added into the planned trajectories to augment the training data, with the noise variance set as 0.0001. Neck joint angles were operated by proportional–integral–derivative control, so manipulated object centroids were centered in camera images during interaction. While recording training data, joint angles and camera images were sampled every 400 ms. Because recorded joint angle and camera image information had different value ranges, the information was normalized before input to the neural networks: joint angles were scaled to [−1.0, 1.0] according to angle limits, and image pixel values were normalized from [0, 255] to [−1.0, 1.0].

This experiment separately recorded the processes of imitative interaction tasks such as demonstrator and robot actions and object passing. After recording, processes were combined and an imitative interaction cycle was generated. There were 160 time steps for demonstrator and robot actions and 60 for passing objects between the demonstrator and the robot, for a total of 440 time steps. Each sequence of 20 combinations was generated five times, for a total of 100 instances of recorded data.

# 3.3. Training of CAE and MTRNN

The robot was trained with imitative interaction tasks through predictive learning of recorded time series including joint angles and camera images.

### 3.3.1. Visual Feature Learning via CAE

We first trained the CAE with camera images to extract visual features for input to the MTRNN with robot joint angles. Input 120 × 160 RGB images have 57,600 dimensions. These input images were trained to minimize errors between the original inputs and reconstructed images, and to extract 10 visual features from the middle CAE layer. **Table 1** presents the detailed CAE structure used in this learning experiment. For CAE training, we conducted mini-batch training with an Adam optimizer (Kingma and Ba, 2015), setting Adam hyperparameters as α = 0.01, β<sup>1</sup> = 0.9, and β<sup>2</sup> = 0.99, mini-batch sizes of 200, and slow penalty strength as η = 1.0 × 10−<sup>5</sup> . Learnable CAE parameters were updated 7,500 times.

3.3.2. Sensory-Motor Integration Learning via MTRNN

After extracting visual features by the trained CAE, time series of sensory-motor information were generated by concatenating robot joint angles and extracted visual features. To allow the robot to carry out imitative interactions, training sequences

observed in front of the demonstrator is defined as 180◦ , and five positions (120, 150, 180, 210, and 240◦ ) are labeled counterclockwise.

#### TABLE 1 | The structure of the CAE.


*In the "Processing" column, conv, deconv, and linear respectively indicate convolutional encoding, deconvolutional decoding, and fully-connected transformation. The input dimensions for convolutional and deconvolutional layers are shown as* (*height*, *width*, *channel*)*, and fully-connected layers are shown as d.*

for input to the MTRNN were created by connecting several combinations of imitative tasks. In this case, training sequences were sequences of four randomly selected imitative tasks, with overlapping allowed. An interval of 5–30 time steps was inserted between the connected time series. The robot retained the same pose during this interval. Under these conditions, 100 sequences were generated as MTRNN training data.

While there were 20 combinations of imitative tasks, we trained the MTRNN with 10 combinations to evaluate generalizability to unlearned combinations. **Table 2** shows the 10 combinations used for MTRNN training to predict the next state of joint angles and visual features. There were 10 joint angles and TABLE 2 | MTRNN training sequences.


*Rows show actions, and columns show positional relationships. In each cell, characters C and W indicate the manipulated object (*chick *or* watering can*). The time sequence indicated in each cell is used for MTRNN training.*

10 extracted visual features, for a total of 20 dimensions input to the MTRNN. We set the number of neural units in the FC and SC layers as 180 and 20 and time constant values as 2.0 and

64.0, respectively. For training, we used the Adam optimizer with hyperparameters α = 0.01, β<sup>1</sup> = 0.9, and β<sup>2</sup> = 0.99. Learnable parameters were updated with these settings 10,000 times.

# 4. TRAINING RESULTS

# 4.1. Reconstructed Images by CAE

After CAE training, the mean squared error between trained images and their reconstructed output was at most 0.0141. The worst mean squared error between untrained and reconstructed images was 0.0150. **Figure 4** shows a selection of reconstructed and untrained images. The reconstructed image in **Figure 4** suggests that the trained CAE could regenerate original input images. We applied principal component analysis to visual features extracted by the CAE at the beginnings of the demonstrations and robotic actions. As shown in **Figure 5A**, the positional relationships between the demonstrator and the robot were separated in the visual features at the beginning of the demonstrations. **Figure 5B** shows that the manipulated objects were separated in the visual features at the beginning of the robotic actions. The CAE could extract the visual features from images, thus we used time series of the extracted visual features for training of the MTRNN. An example of a time series of the extracted visual features is shown in **Figure 5C**.

# 4.2. Robot Action Generation

After MTRNN training, we evaluated the mean squared error between trained target sequences and predicted output, which was 0.00140 at worst. We input new sequences generated with the combination including untrained series, and evaluated the mean squared error. In that case, the evaluated value was 0.00164 at worst. **Figure 6** shows the MTRNN-predicted output against the untrained input [move-side,chick] as observed from position 150◦ . By using predicted output of the MTRNN against untrained input, the robot could imitate demonstrator actions.

# 4.3. Internal States in MTRNN

Principal component analysis was performed on the internal MTRNN state to grasp the internal structure the MTRNN acquired through predictive learning of robot sensory-motor information. We conducted PCA on internal states in the FC and SC layers at the time when the demonstrator ended the actions. **Figure 7** shows the difference in the positional relationship between the demonstrator and the robot in the FC layer, and **Figure 8** shows the difference between imitative actions and manipulated objects. As shown in **Figure 7**, the FC layer in the MTRNN separated positional relationships between the robot and the demonstrator when demonstrator actions were complete. At the same time, differences in imitative actions are clustered in the plane described by PC1 and PC2 of the internal states in the SC layer (see the upper graph in **Figure 8**). In contrast, in the plane described by PC3 and PC4 the differences between manipulated objects are separated by the dashed line in the lower graph in **Figure 8**.

We next extracted internal states in the SC layer at the time when the robot starts its action, and plotted the PCA results in **Figure 9**. As that figure shows, combinations of imitative actions and manipulated objects were clustered in the SC layer. The actions were distinguished at the beginning of robot imitation, so

components of the visual features at the beginning of robotic actions (PC1–PC2), and (C) an example of a time series of the visual features for [*move* − *side*, *watering* − *can*, 150◦ ].

FIGURE 6 | The predicted output of an untrained [*move* − *side*, *chick*] sequence observed from the 150◦ position. This figure shows only the prediction for both arms. The horizontal axis indicates time steps, and the vertical axis represents predicted output of the joint angles. The solid and dotted lines show output by the MTRNN and target sequences, respectively.

the robot could map observed actions to corresponding imitative actions in advance. Similarly, the robot could acquire an ability to carry out imitative actions while retaining information about manipulated objects in the internal MTRNN states. Furthermore, unlearned patterns indicated in **Figure 9** were recognized, so the MTRNN could acquire the ability to generalize via combinations of actions and manipulated objects.

One time step during the robot action was chosen and the internal states were analyzed at that time. Since robot motions comprised 160 steps, we chose the middle (80th) time step and visualized the internal states by PCA. **Figure 10** shows internal states of the FC layer at that time, and confirms that the robot distinguished between different combinations of actions and manipulations while performing imitative actions. In contrast, principle components in the FC layer do not show positional relations between the demonstrator and the robot. Therefore, the robot could transform observations into actions regardless of the positional relation. Finally, to confirm how internal MTRNN states transit during imitative interaction, we plotted the time development of neural units in the SC layer during interaction in a plane. **Figure 11** shows transitions of neural activities in the SC layer during imitative interactions. The positional relationship between the demonstrator and the robot is fixed as 120◦ , and combinations of actions and manipulated objects are separately shown. The figure shows that the internal states for all patterns start from the beginning of demonstrator actions ( ), transit to robot actions (△), and finally reach the same point where manipulated objects are passed from the robot to the demonstrator (). Since the internal states always reach the same point, the robot could continue to recognize the actions, manipulated objects, and positional relations after a single imitative interaction. Other positional relations also acquired results similar to those in **Figure 11**.

# 5. DISCUSSION

We proposed a possible imitative model that allows a robot to acquire the ability to recognize positional relations between the demonstrator, and to transform observed actions into robot actions. The imitative model had two neural networks: (1) a CAE that was trained to extract visual features from captured raw images, and (2) an MTRNN that integrated and predicted sensory-motor information. Through training of image reconstruction by the CAE, the robot could extract visual features from raw images captured by its camera. By sensory-motor integration through predictive learning with the MTRNN, the robot could recognize information that relates imitative interactions, such as positional relations between the demonstrator and the robot. In the rest of this section, we compare earlier studies with our current work, and clarify the distinction between them.

FIGURE 8 | Results of PCA of internal states in the SC layer when demonstrator actions are finished. Numbers in parentheses indicate contribution ratios of each principle component. Filled points are trained imitative patterns, and others are unlearned patterns. In the upper figure (PC1–PC2), differences of actions are separated in the PC1 direction. In the lower figure (PC3–PC4), differences of manipulated objects are classified by the dashed line.

From the viewpoint of acquiring positional relations between the demonstrator and robot, our proposed model allows the robot to recognize positional relations via predictive learning of sensory-motor sequences. By including differences in positions between the demonstrator and the robot, the proposed learning model might be forced to optimize these differences during predictive learning. Thanks to the hierarchical structure of the MTRNN and the sequence to sequence learning methods, the robot might come to process positional differences in the FC layer (shown in **Figure 7**), and possess information required for robot actions, such as kinds of actions and manipulated

(PC1–PC2). Filled points indicate trained imitative patterns, and outlined marks are unlearned. Combinations of imitative actions and manipulated objects can be clustered by the two dotted lines.

objects in the SC layer (see **Figures 8**, **9**). In this work, the sequence to sequence learning method was tried for encoding the demonstrator's actions into the plan of robotic actions. Thus, the information necessary for the robotic actions may be encoded in the SC layer, and the information necessary for the current prediction may appear in the FC layer. In the current experiment, the robotic actions do not require any positional relationships between the demonstrator and the robot. Therefore, positional relationships may remain in the FC layer. Furthermore, from **Figure 10**, conducting sequence to sequence learning that translates demonstrator actions into robot actions might allow the robot to properly transform observed actions into the same actions. In previous works, positional relations between demonstrator and robot were represented by coordinate transformations described as mathematical formulations (Billard et al., 2004; Lopes et al., 2010). Our proposed model requires no designed transformation to acquire positional relations between the demonstrator and robot. In this experiment, the robotic head moved through imitative interaction, and its joint angles differed for each positional relationship during the demonstration phase. These difference in the robotic head depended on the positional relationships between the demonstrator and the robot. Thus, the proposed learning model might require optimizing for these differences during predictive learning. Through predictive learning of sensorymotor sequences, including positional differences between the demonstrator and robot, the robot could automatically recognize differences and transform demonstrator actions into robot actions. Our previous work (Nakajo et al., 2015) allowed robots to acquire information about actions and positional relations by labeling this information and providing constraints that make

FIGURE 10 | Internal states in the FC layer while conducting robot actions (PC1–PC2–PC3). Filled points indicate trained imitative patterns, and outlined marks are unlearned patterns. Actions and objects are distinguished between in this 3D space, but positional differences between the demonstrator are ignored.

activities of neural units representing the same information close. In contrast, the current work eliminates labeling of actions and positional relations by conducting sequence to sequence learning.

From the perspective of action translations, sequence to sequence learning methods might contribute to learning how to translate demonstrator actions into robot actions. As **Figures 7**, **8** show, the robot recognized positional relations, actions, and manipulated objects in the demonstration phase. From **Figure 10**, after a demonstration, the robot could perform observed actions regardless of positional relation. Thanks to the characteristics of sequence to sequence learning, which can translate one multidimensional sequence into another sequence, the robot acquired the ability to choose information necessary for conducting actions. In addition, we conducted a validation trial in which the demonstrations from untrained positional relationships (135, 165, 195, and 225◦ ) were given to the MTRNN. The demonstrations observed from all untrained positions could be translated into the proper robotic actions by the MTRNN. On the other hand, although the MTRNN could map the untrained positional relationships into the points between the trained positional relationships, sometimes mapping failed and these relationships appeared at different points in the PCA space of **Figure 7**. These failures might come from visual features extracted by the CAE. In the current experiment, differences in the positional relationships were present in the visual images and the joint angles of the robotic head. However, the CAE did not learn to extract visual features from the untrained positional relationships. Thus, it may be difficult to extract these visual features with the CAE, which could affect predictions by the MTRNN. Previous studies applied separate modules to transform positional differences (Ogata et al., 2009; Liu et al., 2017; Sermanet et al., 2017). Ogata et al. (2009) used a mixture-of-experts algorithm, where each expert module translated demonstrator actions provided from a different position. Positional relations that the robot recognized were thus limited by the number of experts, although the robot could imitate observed actions from various positions. In this paper, every positional relationship is acquired within the internal structure of a single RNN, so the robot can process various positional relations. Sermanet et al. (2017) and Liu et al. (2017) used deep neural networks that associated demonstrator views with robot views. These methods were very powerful, because no previous knowledge was required to associate the views. However, third-person views were synchronized with robot views where needed to translate actions. In this paper, the robot required its own views, so a robot-mounted camera was necessary in an actual environment. Furthermore, from the viewpoint of transforming actions, previous works used separate modules to extract invariances that were included in views, and additional training was required to learn robot actions. Our proposed model allowed the robot to simultaneously learn recognition of positional relations and action transformation, so no pre-training was needed to integrate sensory-motor information.

When we train the CAE to extract visual features from the robot's vision, we discretely input visual frames. However, in sensory-motor integration for achieving sequential tasks, visual feature learning in which the learning model sequentially predicts images may be required. In the experiment described in this paper, robot actions were determined at the end of the demonstration, and only passing of objects occurs between the end of demonstrator actions and the beginning of robot actions. Thus, both internal representations in the SC layer might be similar. However, discrimination of manipulated objects was not acquired at the end of demonstrator actions, as shown in **Figure 8**. Discrimination of manipulated objects was instead achieved at the beginning of robot actions, as shown in **Figure 9**. This difference in representations might come from prediction error arising from visual information. For the CAE, the difficulty of reconstructing any object comes from the size of object regions. Specifically, reconstructing smaller objects is more difficult than larger objects. In this paper, the regions of manipulated objects during demonstration are smaller than those during robot actions. It thus seems more difficult for the CAE to reconstruct manipulated objects in the demonstration phase. This difficulty of reconstruction might affect sensorymotor integration, as seen in the internal representations in the SC layer. Video prediction in which the learning model is trained to sequentially predict images would contribute to overcoming this problem. Thanks to sequential prediction, the learning model applies histories of past predictions to the current prediction. Moreover, we separately trained the CAE and the MTRNN. Therefore, through training of sensory-motor integration with the MTRNN, no feedback was sent to visual processing by the CAE. However, to allow the robot to more properly process sensory-motor sequences, the prediction error should affect all processing in the learning model. A previous work by Hwang and Tani (2017) prepared a neural network that processes visual sequences, and another that controls the robot. By combining two neural networks through another subnetwork, they realized end-to-end training of sensory-motor integration. Our learning model has a structure similar to the model proposed by Hwang and Tani (2017), so combining two neural networks through another subnetwork might also be applicable to the proposed method.

We conducted sequence to sequence learning to allow the robot to transform each demonstrator action into robot actions. However, by giving the learning model pairs of demonstrator and robot actions that differ from the demonstrator's, sequence to sequence learning can realize translation of demonstrator actions into robot actions differing from the demonstrator's. Furthermore, we gave only one-to-one pairs of demonstrator and robot actions as training data during sequence to sequence learning. The robot can thus only imitate demonstrated actions in a single way, and cannot acquire imitative ability that performs demonstrated actions with equivalent goals but conducted by differing means, such as using both hands vs. using only one hand. Such an imitative ability is important for robots, but has not yet been realized by current methods using sequence to sequence learning. To realize this imitative ability, in future studies we should enrich training data to allow the robot to imitate demonstrated actions by various means. In the training data, the demonstrator and robot conduct equivalent actions by various means. Through training pairs of demonstrator and robot actions, the robot might come to imitate demonstrated actions in various ways. As has been found in the field of neural machine translation (Cho et al., 2014; Johnson et al., 2016), RNNs with an encoder–decoder architecture trained by sequence to sequence learning methods can acquire both syntactic and semantic structures. Thus, by applying sequence to sequence learning to action learning by robots, RNNs might allow robots to capture the underlying structures of demonstrated actions.

In this paper, imitative learning using a sequence to sequence learning method required an RNN to deal with long sequences. Therefore, RNNs other than MTRNN could be used to learn sensory-motor sequences. For example, we tried a continuous-time recurrent neural network (CTRNN) for the current experiment. Although the CTRNN generated the trained imitative patterns after predictive learning, it sometimes failed to generate untrained imitative patterns. As another example, it is well known that the long short-term memory technique (LSTM) can process long sequences because of its gating mechanisms. Thus, replacing MTRNN with LSTM will yield similar results. Although an RNN other than MTRNN could have been used, we adopted MTRNN because of its simpler representation of the internal state.

Moreover, future studies from the viewpoint of imitative learning should discuss the existence of mirror neurons (Rizzolatti et al., 1996), which by themselves show common

# REFERENCES


ignition states in the imitation ability of primates with the perception of other acts and movement. This mirror neuron system has also been discussed from the viewpoint of cognitive development robotics, because human beings lead the development of behavioral understandings in others (Nagai et al., 2011; Arie et al., 2012; Kawai et al., 2012). In a previous study (Nakajo et al., 2015), we realized robot acquisition of common neuronal transitions in the robot's own and other behaviors by constraint to neurons representing labeled information, but the internal states of all neurons were separated according to their own actions in this work. Therefore, as a future method for realizing neuron activity simulating mirror neurons, it is conceivable to consider an imitation experiment using a group of neurons with slow response speeds in the context layer of the RNN.

# AUTHOR CONTRIBUTIONS

RN, SM, HA, and TO conceived, designed the research, and wrote the paper. RN performed the experiment and analyzed the data.

# FUNDING

This work was supported by the JST, CREST Grant Number JPMJCR15E3, and JSPS KAKENHI Grant Numbers 15H01710 and 16H05878.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nakajo, Murata, Arie and Ogata. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Neural-Dynamic Based Synchronous-Optimization Scheme of Dual Redundant Robot Manipulators

#### Zhijun Zhang\*, Qiongyi Zhou and Weisen Fan

*School of Automation Science and Engineering, South China University of Technology, Guangzhou, China*

In order to track complex-path tasks in three dimensional space without joint-drifts, a neural-dynamic based synchronous-optimization (NDSO) scheme of dual redundant robot manipulators is proposed and developed. To do so, an acceleration-level repetitive motion planning optimization criterion is derived by the neural-dynamic method twice. Position and velocity feedbacks are taken into account to decrease the errors. Considering the joint-angle, joint-velocity, and joint-acceleration limits, the redundancy resolution problem of the left and right arms are formulated as two quadratic programming problems subject to equality constraints and three bound constraints. The two quadratic programming schemes of the left and right arms are then integrated into a standard quadratic programming problem constrained by an equality constraint and a bound constraint. As a real-time solver, a linear variational inequalities-based primal-dual neural network (LVI-PDNN) is used to solve the quadratic programming problem. Finally, the simulation section contains experiments of the execution of three complex tasks including a couple task, the comparison with pseudo-inverse method and robustness verification. Simulation results verify the efficacy and accuracy of the proposed NDSO scheme.

#### Edited by:

*Hong Qiao, University of Chinese Academy of Sciences (UCAS), China*

#### Reviewed by:

*Bolin Liao, Jishou University, China Ning Sun, Nankai University, China*

> \*Correspondence: *Zhijun Zhang*

*drzhangzhijun@gmail.com*

Received: *21 November 2017* Accepted: *22 October 2018* Published: *08 November 2018*

#### Citation:

*Zhang Z, Zhou Q and Fan W (2018) Neural-Dynamic Based Synchronous-Optimization Scheme of Dual Redundant Robot Manipulators. Front. Neurorobot. 12:73. doi: 10.3389/fnbot.2018.00073* Keywords: dual-redundant-manipulators, redundant robot, complex tasks, motion planning, acceleration-level, neural dynamic method

# 1. INTRODUCTION

Redundancy resolution problem is an important issue in the control of redundant robot manipulators. The redundancy of the robot manipulators endows us with extra degrees-of-freedom to finish some subtasks in addition to the end-effector main task (Jin and Li, 2016; Reynoso-Mora et al., 2016; Guo et al., 2017; Huang et al., 2017). Control of dual-redundant-manipulators is more complex because they have twice degrees-of-freedom than a single-redundant manipulator does. With more and redundant degrees-of-freedom, dual-redundant-manipulators can not only complete the main task of the end-effectors, but also finish various subtasks, such as joint-limitation avoidance, obstacle avoidance, singularity avoidance, and dual-arms cooperations (Zhang et al., 2014; Liu et al., 2015; Jin et al., 2017; Chikhaoui et al., 2018).

For each manipulator of the dual-redundant-robot-manipulators, since the number n of degrees-of-freedom of joints is greater than the dimension m of end-effectors' position and posture, solutions to the inverse kinematic problem of each manipulator as same as dual-manipulators are infinite (i.e., the multiple-solution problem). In order to solve such a multiple-solution problem, a number of methods have been proposed (Chevallereau and Khalil, 1988; Jin and Zhang, 2014; Toshani and Farrokhi, 2014; Luo et al., 2017). The conventional method is the pseudo-inverse formulation θ˙ = J <sup>+</sup>r˙ + (I − J <sup>+</sup>J)z<sup>v</sup> or θ¨ = J <sup>+</sup>(r¨ − ˙Jθ˙) + (I − J <sup>+</sup>J)za, which contains a specific minimum-norm solution plus a homogeneous solution (Lin and Zhang, 2013). The pseudo-inverse method has a simple form and has been applied to dual-redundantmanipulators (Zheng and Luh, 1986), but it has to compute the matrix inverse which may have high computational cost (Ho et al., 2005), algorithm singularities and have difficulty in containing zv, z<sup>a</sup> ∈ R n into inequality form. That is to say, it cannot solve inequality constrain problems (Cheng et al., 1994). What's worse, the determining the magnitude of z<sup>v</sup> and z<sup>a</sup> is based on trial-and-error approach and is over-dependent on subjective judgement and experience (Zhang et al., 2004). Although some improved pseudo-inverse methods have been developed in recent years, such as joint torque optimization (Flacco and De Luca, 2015; Wang et al., 2015; Xiao et al., 2016), but it still cannot solve the inequality problems.

A repetitive motion is a basic requirement of redundantrobot-manipulators in practical applications if they are expected to execute cyclic tasks. A repetitive motion is that when the end-effector tracks a closed path in Cartesian space, all the joint trajectories should be closed. That is to say, the final states of joints must coincide with the initial ones when the endeffector completes a closed end-effector path. If this issue is not considered into the motion planning scheme of dual-redundantmanipulators, the joint-drift phenomenon would happen. In order to realize repetitive motions, additional self-motion strategy is necessary to readjust the joints of dual-manipulators to the initial states at the end of each cycle. Evidently, this is much inefficient and is not acceptable in a factory automation assembly line. Klein firstly studied this problem in a single redundant-robot-manipulator, and his research showed that the joint-drift that occurred in the pseudo-inverse control scheme is not unpredictable (Klein and Kee, 1989). In the last two decades, in order to solve the joint-drift problem, many quadraticprogramming-based repetitive motion planning schemes have been proposed and solved by neural networks but most of them are about the single redundant robot manipulator (Zhang et al., 2008, 2018; Zhang and Zhang, 2012, 2013b). The control methodology of dual-redundant manipulators is imperative, as there are more and more complex end-effector tasks, such as unscrewing caps (Felip and Morales, 2015), grasping and moving of an object (Shin and Kim, 2015; Dong et al., 2017). These tasks can not be completed by a single manipulator and need dual-robot-manipulators. In recent years, some researchers have proposed impedance and admittance control methods to dualarms coordination. For example, Lee et al. (2014) and Jr and Roberts (2015) proposed a novel relative impedance control based on the relative Jacobian expression. These works more focus on dual-arms cooperation and allocating task through force/torque, and the force/torque sensors are necessary. In fact, some tasks only need dual-manipulators synchronous working and cooperation. For instance, moving a heavy box. To finish these tasks, some researchers exploited quadratic-programmingbased repetitive motion planning scheme for dual-redundantmanipulators and then used neural network as a quadratic programming solver. In our previous work, a neural dynamic method based repetitive motion planning scheme was proposed for humanoid robot arms (Zhang et al., 2015), but it is on velocity-level and cannot consider the joint-acceleration limits. In addition, the velocity-level repetitive motion planning scheme can not be directly applied to acceleration controlled robots. Jin and Zhang proposed a repetitive motion planning scheme at acceleration level (Jin and Zhang, 2014). However, the scheme is only performed on dual-manipulators with simple planar three links, and the end-effector tasks are very simple. It is worth pointing out that very few acceleration-level repetitive motion planning schemes take position-error feedback into consideration to make the position-error convergent as time involves.

The studying motivations of this paper can be summarized as: 1) A repetitive motion is a basic requirement of redundant-robotmanipulators in practical applications. 2) Most researches on the repetitive motion planning are based on a single-manipulator with less degrees-of-freedom, and very few researches considered the synchronous-optimization scheme of dual redundant robot manipulators. 3) The traditional resolution scheme at the velocity level cannot consider the acceleration limit avoidance, which may lead to acceleration limitation exceeded problem. In order to resolve the redundancy problem of dual-redundantrobot-manipulators with 14 degrees-of-freedom, a neuraldynamic based synchronous-optimization scheme of dual redundant robot manipulators (NDSO) is proposed in this paper. Different from the existing work (Jin and Zhang, 2014), the proposed NDSO scheme can be performed on dual-redundant-manipulators with 14 degrees-of-freedom and working in three-dimensional space. In addition, the dualredundant-manipulators can track some complex paths (such as geometric curves and numbers) and complete coupled tracking task. Furthermore, the NDSO scheme has excellent robustness under the perturbation of systematic error.

The remainder of the paper is organized into four sections. In section 2, the neural-dynamic based synchronous-optimization subschemes (Sub-NDSO) of the left and right manipulators are formulated. In section 3, the Sub-NDSO of the left and right manipulators are unified into a standard quadratic programming problem, which is equivalent to a piecewise-linear projection equation, and then solved by a linear variational inequalitiesbased primal-dual neural network (LVI-PDNN). Section 4 shows the simulation result that the NDSO scheme performed on dualredundant-manipulators to track three complex end-effector tasks in three-dimensional space. Comparison experiments and robustness verification experiment with perturbed LVI-PDNN are also conducted and the related results are showed in this section. Section 5 concludes this paper with final remarks.

The main contributions of the paper are as follows.

(1) A neural-dynamic based synchronous-optimization scheme of dual redundant robot manipulators (NDSO) is proposed to solve the joint-drift phenomena at the joint-acceleration level. The advantage of the NDSO scheme is that it can not only complete the traditional end-effector tasks but also some couple tasks. In addition, the physical limit constraints allow the scheme to apply to actual situations because it guarantees the robot joints not to exceed their physical limits. In addition, it is easier than velocity-level scheme to conduct such a scheme on an acceleration/torque controlled manipulator.


Before ending this section, the system structure of the scheme can be seen from **Figure 1**. First of all, the performance indices of the left and right arms are obtained by using neural dynamic method twice. Next, considering the position and velocity error, joint-angle, joint-velocity and joint-acceleration limits, the repetitive motion planning subschemes of left and right arms are constructed. Furthermore, by combining the repetitive motion planning subschemes of left and right arms, the NDSO scheme is obtained, which is further unified into a standard quadratic programming problem. The quadratic programming problem (i.e., QP in the figure) is equivalent to a set of linear variational inequalities problem (i.e., LVI in the figure) and is finally equivalent to a piecewise linear projection equation (i.e., PLPE in the figure). Finally, the piecewise linear projection equation is solved by a linear variational inequalities-based primal-dual neural network (LVI-PDNN).

# 2. PROBLEM FORMULATION

In this section, a forward kinematic equation is first presented. Next, an acceleration-level feedback is designed. Third, an acceleration-level repetitive motion criterion is deduced by neural dynamic method two times.

# 2.1. Preliminaries

For simplicity, we use the subscript L/R to represent the left and right redundant manipulators. The kinematic equations of the left or right arm of the dual-redundant-manipulators at position level, velocity level and acceleration level are formulated

FIGURE 1 | System structure of the neural-dynamic based synchronous-optimization scheme of dual redundant robot manipulators (NDSO). It visualizes the logical structure of the paper starting from background analysis, then the problem formulation and finally the simulation.

respectively as

$$f\_{\rm L/R}(\theta\_{\rm L/R}) = r\_{\rm L/R}(t) \tag{1}$$

$$J\_{\rm L/R}(\theta\_{\rm L/R})\dot{\theta}\_{\rm L/R}(t) = \dot{r}\_{\rm L/R}(t) \tag{2}$$

$$J\_{\rm L/R}(\theta\_{\rm L/R})\ddot{\theta}\_{\rm L/R}(t) = \ddot{r}\_{\rm aL/R}(t) = \ddot{r}\_{\rm L/R}(t) - \dot{f}\_{\rm L/R}(\theta\_{\rm L/R})\dot{\theta}\_{\rm L/R}(t) \tag{3}$$

where rL/R(t),r˙L/R(t), and r¨L/R(t) ∈ R <sup>m</sup> denote the position-andorientation vector, velocity vector, and acceleration vector of an end-effector, θL/R(t), θ˙ L/R(t), and θ¨ L/R(t) ∈ R <sup>n</sup> denote the joint angle, joint velocity and joint acceleration of the left or right manipulator, Jacobian matrix JL/R(θL/R) = ∂fL/R(θL/R)/∂θL/R, matrix ˙JL/R(θL/R) is the first order derivation of Jacobian matrix JL/R(θL/R) with respect to time t. In this paper, since one manipulator has seven degrees-of-freedom and the task is performed in a three dimensional space, n = 7 and m = 3. In Equation (1), θL/R(t) and rL/R(t) are related via a nonlinear function fL/R(·). If θL/R(t) is known, it is easy to compute rL/R(t) since fL/R(·) can be uniquely determined by a given redundant robot manipulator. This process is called a forward kinematic resolution. On the contrary, it is very difficult to compute

**Remark:** In practical systems, the control inputs are sometimes subject to the saturation problem and uncertainties. Many methods have been proposed to solve the issues such as (Tran et al., 2015; Eremin and Shelenok, 2017; Sun et al., 2017, 2018). Since we only focus on the redundancy resolution problem and it is assumed that the control inputs satisfy the condition, the saturation problem and uncertainties are out of our research scope, and are ignored here.

# 2.2. Acceleration-Level Forward Equation With Feedback

In practical applications, error feedback should be considered in Equation (3). With the following theorem, the acceleration-level forward equation with feedback is obtained, i.e.,v

**Theorem 1.** Considering an end-effector motion of a robot manipulator, for any scalar parameters ρ<sup>V</sup> > 0 and ρ<sup>P</sup> > 0, the error-feedback included acceleration-level forward kinematic equation is

$$J(\theta)\ddot{\hat{\theta}}(t) = \ddot{r}\_d(t) - \dot{J}(\theta)\dot{\hat{\theta}}(t) + \rho\_V(\dot{r}\_d(t) - f(\theta)\dot{\hat{\theta}}(t)) + \rho\_P(r\_d(t))$$

$$-f(\theta)),\tag{4}$$

where rd, r˙d, and r¨<sup>d</sup> denote desired end-effector path, desired endeffector velocity, and desired end-effector acceleration, respectively; θ, θ˙, and θ¨ denote the joint-angular variable, joint-velocity variable, and joint-acceleration variable; Function f(θ) is a continuous nonlinear mapping function with known parameters for a given robot; J(θ) and ˙J(θ) are the Jacobian matrix and the first order derivative of Jacobian matrix; parameters ρ<sup>V</sup> > 0 and ρ<sup>P</sup> > 0 are the feedback coefficients of velocity and position errors, respectively. With these error feedbacks, the end-effector position error would converge exponentially to zero.

**Proof 1**: Considering the following state-equations of two dimensional linear system

$$
\dot{\chi}(t) = A\chi(t), \tag{5}
$$

$$
\chi(t) = Q\chi(t), \tag{6}
$$

where χ(t) = [χ1(t), χ2(t)]<sup>T</sup> is the the state vector consisting of two state variables as its elements; χ˙(t) = [χ˙1(t), χ˙2(t)]<sup>T</sup> is the time derivative of the state vector χ(t); y(t) = [y1(t)] is an output vector consisting of two outputs as its elements, and A and Q are the coefficient matrices.

In order to make the position error converge to zero at the end of each cycle, an error function E<sup>f</sup> (t) is defined as

$$E\_f(t) = r\_{\rm dL/R}(t) - f(\theta\_{\rm L/R}) \tag{7}$$

where rdL/R(t) denotes the desired end-effector path. Its first-order and second-order derivative with time t (i.e., the

velocity error E˙ <sup>f</sup> and acceleration error E¨ f ) are

$$\dot{E}\_f(t) = \dot{r}\_{\rm dL/R}(t) - f\_{\rm L/R}(\theta\_{\rm L/R})\dot{\theta}\_{\rm L/R}(t), \tag{8}$$

$$\ddot{E}\_f(t) = \ddot{r}\_{\text{dL/R}}(t) - \dot{f}\_{\text{L/R}}(\theta\_{\text{L/R}})\dot{\theta}\_{\text{L/R}}(t) - f\_{\text{L/R}}(\theta\_{\text{L/R}})\ddot{\theta}\_{\text{L/R}}(t) \tag{9}$$

respectively.

For the convenience of analysis, the state variables χ<sup>1</sup> and χ<sup>2</sup> are set as E<sup>f</sup> and E˙ f , respectively, i.e.,

$$
\chi = \begin{bmatrix} E\_f \\ \dot{E}\_f \end{bmatrix}, \dot{\chi} = \begin{bmatrix} \dot{E}\_f \\ \dot{E}\_f \end{bmatrix}. \tag{10}
$$

In addition, by defining

$$A = \begin{bmatrix} 0 & 1 \\ -\rho\_{\mathcal{P}} & -\rho\_{\mathcal{V}} \end{bmatrix} \text{ and } Q = \begin{bmatrix} 1 & 0 \end{bmatrix},$$

with ρ<sup>V</sup> > 0 and ρ<sup>P</sup> > 0, the state-equations (5) and (6) are equivalent to the following second order differential equation

$$
\dot{E}\_f = -\rho \sqrt{E\_f} - \rho\_\mathrm{P} E\_f \tag{11}
$$

where ρ<sup>V</sup> > 0 and ρ<sup>P</sup> > 0 are the feedback coefficients of velocity and position errors, respectively. **Figure 2** shows the simulation diagram of the position and velocity feedback based on Equation (11). Substituting (7)–(9) into (11), we obtain

$$J\_{\rm L/R}(\theta\_{\rm L/R})\ddot{\theta}\_{\rm L/R}(t) = \ddot{r}\_{\rm afl,R}(t) = \ddot{r}\_{\rm dL/R}(t) - \dot{f}\_{\rm L/R}(\theta\_{\rm L/R})\dot{\theta}\_{\rm L/R}(t)$$

$$+ \rho\_{\rm V}(\dot{r}\_{\rm dL/R} - f\_{\rm L/R}(\theta\_{\rm L/R})\dot{\theta}\_{\rm L/R}(t)) + \rho\_{\rm P}(r\_{\rm dL/R}(t) - f(\theta\_{\rm L/R})). \tag{12}$$

Equation (4) is thus proved.

Next, we will prove the exponential convergence performance of the position errors E<sup>f</sup> (t). According to modern control theory (Tewari, 2002), characteristic roots ̺<sup>1</sup> and ̺<sup>2</sup> of the system matrix A can be obtained by solving the following characteristic equation

$$\left| \varrho I - A \right| = \left| \begin{bmatrix} \varrho & -1 \\ \rho\_{\mathbb{P}} \ \varrho + \rho\_{\mathbb{V}} \end{bmatrix} \right| = \varrho^2 + \rho\_{\mathbb{V}}\varrho + \rho\_{\mathbb{P}} = 0,\quad \text{(13)}$$

where I is an identity matrix, | ·| is the determinant notation, and ̺ is the characteristic root of Equation (13), which is determined by the coefficients ρ<sup>P</sup> and ρ<sup>V</sup> of characteristic Equation (13).

Since the position error E<sup>f</sup> (t) and the velocity error E˙ f (t) are the elements of state vector χ(t), discussion of the time-domain response of the state variables χ(t) is equivalent to discussion of errors. Based on modern control theory (Tewari, 2002), if the initial state χ(0) = χ<sup>0</sup> is determined, the unique solution of the state-equation (5) can be represented as

$$
\chi(t) = \Phi(t)\chi(0),\tag{14}
$$

where 8(t) = e At .

Considering A = [0, 1; −ρP, −ρV], based on characteristic Equation (13), the time-domain response of the state variables χ(t) (i.e., Equation 14) can be discussed according to the following three situations.

From the formula of root, we have the characteristic roots of Equation (13) as

$$\varrho\_1 = \frac{-\rho\_\mathcal{V} + \sqrt{\rho\_\mathcal{V}^2 - 4\rho\_\mathcal{P}}}{2}, \varrho\_2 = \frac{-\rho\_\mathcal{V} - \sqrt{\rho\_\mathcal{V}^2 - 4\rho\_\mathcal{P}}}{2}. \tag{15}$$

(i) When ρ 2 <sup>V</sup> <sup>&</sup>gt; <sup>4</sup>ρP, from Equation (15), we have <sup>ρ</sup><sup>V</sup> <sup>&</sup>gt; q (ρ 2 <sup>V</sup> − 4ρP) > 0, thus real characteristic roots ̺<sup>1</sup> < 0 and ̺<sup>2</sup> < 0. Based on modern control theory (Tewari, 2002), there exists a nonsingular matrix T satisfying

$$\Phi(t) = T \begin{bmatrix} e^{\varrho\_1 t} & 0 \\ 0 & e^{\varrho\_2 t} \end{bmatrix} T^{-1}. \tag{16}$$

Substituting (16) into (14), we obtain that

$$\|\|\chi(t)\|\|^{\mathfrak{z}} = \|\|\Phi(t)\chi(0)\|\|^{\mathfrak{z}} \lesssim \|\|\Phi(t)\|\_{F} \|\chi(0)\|\_{\mathfrak{z}} \lesssim \|\|T\|\_{F} \|T^{-1}\|\_{F} \sqrt{\mathfrak{e}^{2\varrho\_{1}t} + \mathfrak{e}^{2\varrho\_{2}t}} \|\chi(0)\|\|\varphi(0)\|$$

is globally exponentially convergent to zero since kTk<sup>F</sup> and kT <sup>−</sup>1k<sup>F</sup> are limited. Therefore, the first element of χ(t), i.e., position error E<sup>f</sup> (t), is globally exponentially convergent to zero.

(ii) When ρ 2 <sup>V</sup> = 4ρP, from Equation (15) we have real equal characteristic roots ̺<sup>1</sup> = ̺<sup>2</sup> = ̺<sup>e</sup> = −ρV/2 < 0. Based on modern control theory (Tewari, 2002), there exists a nonsingular matrix T satisfying

$$\Phi(t) = T \begin{bmatrix} e^{\varrho\_{\varepsilon^{\ell}}t} & te^{\varrho\_{\varepsilon^{\ell}}t} \\ 0 & e^{\varrho\_{\varepsilon^{\ell}}t} \end{bmatrix} T^{-1}. \tag{17}$$

Substituting (17) into (14), we obtain that

$$\|\|\chi(t)\|\|\_{2} = \|\Phi(t)\chi(0)\|\|\_{2} \lesssim \|\Phi(t)\|\_{F} \|\chi(0)\|\_{2} \lesssim \|T\|\_{F} \|T^{-1}\|\_{F} \sqrt{t^{2} + 2} \epsilon^{0\chi t} \|\chi(0)\|\_{2}$$

is globally exponentially convergent to zero. Therefore, the first element of χ(t), i.e., position error E<sup>f</sup> (t), is globally exponentially convergent to zero.

(iii) When ρ 2 <sup>V</sup> < 4ρP, from Equation (15) we have two imaginary characteristic roots and set them as ̺<sup>1</sup> = σ + jω and ̺<sup>2</sup> = σ − jω with the real part σ < 0. Based on modern control theory (Tewari, 2002),

$$\Phi(t) = \begin{bmatrix} \cos\alpha t & \sin\alpha t \\ -\sin\alpha t & \cos\alpha t \end{bmatrix} e^{\sigma^t t}. \tag{18}$$

Substituting (18) into (14), we obtain that

$$\|\chi(t)\|\_{2} = \|\Phi(t)\chi(0)\|\_{2} \lesssim \|\Phi(t)\|\_{F} \|\chi(0)\|\_{2} = \sqrt{2}e^{\sigma t} \|\chi(0)\|\_{2}$$

is globally exponentially convergent to zero. Therefore, the first element of χ(t), i.e., position error E<sup>f</sup> (t), is globally exponentially convergent to zero.

In conclusion, it is proved that the position error E<sup>f</sup> (t) is globally convergent to zero with the kind of error feedback in Equation (11) where ρ<sup>V</sup> and ρ<sup>P</sup> are both set positive.

## 2.3. NDSO Subscheme of Left/Right Arm

In order to remedy the joint-angle drift problem, a neuraldynamic based synchronous-optimization subscheme (Sub-NDSO) of left/right arm (i.e., the following theorem) is proposed.

**Theorem 2.** For a left or right arm of dual-redundantmanipulators, given a closed end-effector path, i.e., rL/R(T) = rL/R(0) where T denotes a task execution period, if Equations (19)–(23) are satisfied, the left or right arm of dual-redundant-manipulators achieves repetitive motion, and the joint-drift θL/R(t) − θL/R(0) would converge exponentially to zero. In addition, all the joint-angles, joint-velocities and joint-accelerations are constrained within their limits, i.e.,

$$\stackrel{\circ}{mminimize} \quad \frac{1}{2} \| \ddot{\theta}\_{L/R}(t) + b\_{L/R}(t) \|\_{2}^{2} \tag{19}$$

$$\text{subject to} \quad \mathsf{J}\_{\mathsf{L}/\mathsf{R}}(\theta\_{\mathsf{L}/\mathsf{R}}) \ddot{\vec{\theta}}\_{\mathsf{L}/\mathsf{R}}(\mathsf{t}) = \ddot{r}\_{\mathsf{aff}/\mathsf{R}}(\mathsf{t}) \tag{20}$$

$$
\theta\_{L/\mathbb{R}}^{-} \lesssim \theta\_{L/\mathbb{R}}(t) \lesssim \theta\_{L/\mathbb{R}}^{+} \tag{21}
$$

$$
\theta\_{L/R}^- \lesssim \theta\_{L/R}(t) \lesssim \theta\_{L/R}^+ \tag{22}
$$

$$
\ddot{\theta}^{-}\_{L/R} \lesssim \ddot{\theta}\_{L/R}(t) \lesssim \dddot{\theta}^{+}\_{L/R} \tag{23}
$$

$$\begin{aligned} \text{with} \quad & b\_{L/R}(t) = (\alpha + \beta)\dot{\theta}\_{L/R}(t) + \alpha\beta(\theta\_{L/R}(t) - \theta\_{L/R}(0)), \\ \ddot{r}\_{a\mathcal{I}\angle R}(t) &= \ddot{r}\_{d\mathcal{L}/R}(t) - \dot{f}\_{L/R}(\theta\_{L/R})\dot{\theta}\_{L/R}(t) \\ &+ \rho\_{\nu}(\dot{r}\_{d\mathcal{L}/R}(t) - f\_{L/R}(\theta\_{L/R})\dot{\theta}\_{L/R}(t)) + \rho\_{\mathcal{P}}(r\_{d\mathcal{L}/R}(t) \\ &- f(\theta\_{L\mathcal{R}})) \end{aligned}$$

where k · k<sup>2</sup> denotes the two-norm of a vector; θL/R, θ˙ L/R, and θ¨ L/R denote the joint angle, joint velocity, and joint acceleration of the left or right arms of dual-redundant-manipulators; rdL/R, r˙dL/R, and r¨dL/R denote desired end-effector position, desired endeffector velocity, and desired end-effector acceleration of the left or right arm of dual-redundant-manipulators; J(θ) and ˙J(θ) are the Jacobian matrix and the first order derivative of Jacobian matrix; α > 0 and β > 0 are used to scale the joint displacement; θ ± L/R, θ˙± L/R and <sup>θ</sup>¨<sup>±</sup> L/R denote the upper and lower limits of the joint angles, joint velocities and joint accelerations of the left/right manipulator, respectively.

**Proof 2**: First, a vector-valued error function, i.e., the deviation between the joint instant angle θL/R(t) and the initial joint angle θL/R(0) of the left/right manipulator, is defined as

$$e\_{\rm L/R}(t) = \theta\_{\rm L/R}(t) - \theta\_{\rm L/R}(0). \tag{24}$$

The joint-angle drift is zero if and only if the value of the error function e1L/R(t) = 0. In order to reduce and eventually eliminate the joint displacement, by the neural-dynamic method, we can obtain

$$\dot{e}\_{\rm 1L/R}(t) = -\alpha e\_{\rm 1L/R}(t) = -\alpha[\theta\_{\rm L/R}(t) - \theta\_{\rm L/R}(0)],\tag{25}$$

where design parameter α is used to adjust the convergence rate of e1L/R(t) to zero. By taking the derivative of Equation (24) with time t, e˙1L/R(t) = θ˙ L/R(t) is obtained. Substituting it into Equation (25), the following equation is obtained, i.e.,

$$
\dot{\theta}\_{\text{L/R}}(t) + \alpha (\theta\_{\text{L/R}}(t) - \theta\_{\text{L/R}}(0)) = 0. \tag{26}
$$

Second, in order to obtain the acceleration-level repetitive motion criterion, the joint acceleration should be included in the criterion. That is to say, there should be an equation equivalent to (26), which includes joint acceleration. To do so, the neural dynamic method is applied to Equation (26) again. Similarly, a vector-valued joint-displacement function is defined as

$$e\_{\rm 2L/R}(t) = \dot{\theta}\_{\rm L/R}(t) + \alpha(\theta\_{\rm L/R}(t) - \theta\_{\rm L/R}(0)).\tag{27}$$

According to neural dynamic design method (Cai and Zhang, 2012), i.e.,

$$
\dot{e}\_{2\text{L/R}}(t) = -\beta e\_{2\text{L/R}}(t) \tag{28}
$$

where design parameter β > 0, we can get

$$\ddot{\theta}\_{\text{L/R}}(t) + \alpha \dot{\theta}\_{\text{L/R}}(t) = -\beta(\dot{\theta}\_{\text{L/R}}(t) + \alpha(\theta\_{\text{L/R}}(t) - \theta\_{\text{L/R}}(0))). \tag{29}$$

Equation (29) is rewritten as

$$
\ddot{\theta}\_{\text{L/R}}(t) + (\alpha + \beta)\dot{\theta}\_{\text{L/R}}(t) + \alpha\beta(\theta\_{\text{L/R}}(t) - \theta\_{\text{L/R}}(0)) = 0. \tag{30}
$$

Considering the motion of the robot manipulator, it is better to minimize the performance kθ¨ L/R(t)+(α+β)θ˙ L/R(t)+αβ(θL/R(t)− θL/R(0))k 2 2 /2 rather than use (30) directly, i.e.,

$$\text{minimize} \quad \frac{1}{2} \|\ddot{\theta}\_{\text{L/R}}(t) + b\_{\text{L/R}}(t)\|\_{2}^{2},\tag{31}$$

where bL/R(t) = (α + β)θ˙ L/R(t) + αβ(θL/R(t) − θL/R(0)), and k · k<sup>2</sup> denotes the two-norm of a vector. If Equation (31) is used as the optimization criterion, the joint angle θL/R(t) tends to converge to θL/R(0). At the end of the task execution period, θL/R(T) = θL/R(0). Equation (19) is thus proved.

In practical applications, the joint physical limits, i.e., jointangle limits, joint-velocity limits and joint-acceleration limits, should be considered into the scheme, and thus an NDSO subschemes (termed as Sub-NDSO) is obtained as

$$\text{minimize} \quad \frac{1}{2} \| \ddot{\theta}\_{\text{L/R}}(t) + b\_{\text{L/R}}(t) \|\_{2}^{2} \tag{32}$$

subject to JL/R(θL/R)θ¨ L/R(t) = ¨raL/R(t) (33)

$$
\theta\_{\text{L/R}}^{-} \lesssim \theta\_{\text{L/R}}(t) \lesssim \theta\_{\text{L/R}}^{+} \tag{34}
$$

$$
\theta\_{\text{L/R}}^{-} \lesssim \theta\_{\text{L/R}}(t) \lesssim \theta\_{\text{L/R}}^{+} \tag{35}
$$

$$
\dot{\theta}^-\_{\text{L/R}} \lesssim \ddot{\theta}\_{\text{L/R}}(t) \lesssim \dot{\theta}^+\_{\text{L/R}} \tag{36}
$$

$$\begin{aligned} \text{with} \quad & b\_{\text{L/R}}(t) = (\alpha + \beta)\dot{\theta}\_{\text{L/R}}(t) + \alpha\beta(\theta\_{\text{L/R}}(t) - \theta\_{\text{L/R}}(0)) \\ \ddot{r}\_{\text{aL/R}}(t) = \ddot{r}\_{\text{dL/R}}(t) - \dot{f}\_{\text{L/R}}(\theta\_{\text{L/R}})\dot{\theta}\_{\text{L/R}}(t) \end{aligned}$$

where α > 0 and β > 0 are used to scale the joint displacement.

According to the acceleration-level feedback error design method in Theorem 1, r¨aL/R in Equation (33) can be replaced by r¨afL/R(t) = ¨rdL/R(t) − ˙JL/R(θL/R)θ˙ L/R(t) + ρv(r˙dL/R(t) − JL/R(θL/R)θ˙ L/R(t)) + ρp(rdL/R(t) − f(θL/R)). Equations (19)–(23) is thus proved. That is to say, with Equations (19)–(23), the left or right arm of dual-redundant-manipulators can achieve repetitive motion, meanwhile it can avoid joint-physical limits during the execution of the task.

Next, the exponential convergence rate of joint-drift θL/R(t) − θL/R(0) will be proved. In the light of differential equation theory (Hartman and Philip, 1982), the ith element of e2L/R(t) in Equation (28) is

$$e\_{2\text{L/Ri}}(t) = e\_{2\text{L/Ri}}(0)e^{-\beta t}.\tag{37}$$

When t approaches to infinity, each element would approach exponentially zero, i.e.,

$$\lim\_{t \to \infty} e\_{\text{2L/Ri}}(t) = \lim\_{t \to \infty} e\_{\text{2L/Ri}}(0)e^{-\beta t} = 0. \tag{38}$$

The proof of Theorem 2 is completed.

# 2.4. NDSO Scheme

In this section, based on the neural-dynamic based synchronousoptimization subschemes (Sub-NDSO) of the left arm and right arm proposed in Theorem 2, a neural-dynamic based synchronous-optimization scheme of dual redundant robot manipulators (NDSO) is proposed and developed.

**Theorem 3.** For a dual-redundant-manipulators system, including left manipulator and right manipulator, given a closed end-effector path, i.e., r(T) = r(0) where T denotes a task execution period, if Equations (39)–(43) are satisfied, the dualredundant-manipulators will achieve repetitive motion, and the joint-drift θ(t) − θ(0) would converge exponentially to zero. In addition, all the joint angles, joint velocities and joint accelerations of the dual-redundant-manipulators are constrained within their limits, i.e.,

$$\text{minimize} \quad \frac{1}{2}\ddot{\theta}^T(t)\ddot{\theta}(t) + b^T(t)\ddot{\theta}(t) \tag{39}$$

$$\text{subject to} \quad \mathcal{J}(\theta)\ddot{\theta}(t) = \ddot{r}\_{a\dot{\theta}}(t) \tag{40}$$

$$
\theta^- \lessdot \theta(t) \lessdot \dot{\theta}^+ \tag{41}
$$

$$
\dot{\theta}^- \lessdot \dot{\theta}(t) \lesssim \dot{\theta}^+ \tag{42}
$$

$$
\dot{\theta}^- \lesssim \ddot{\theta}(t) \lesssim \ddot{\theta}^+ \tag{43}
$$

$$\begin{aligned} \label{eq:SDAR} \text{with} \quad b(t) &= (\alpha + \beta)\dot{\theta}(t) + \alpha\beta(\theta(t) - \theta(0)), \\ \ddot{r}\_{df}(t) &= \ddot{r}\_d(t) - \dot{f}(\theta)\dot{\theta}(t) + \rho\_\nu(\dot{r}\_d(t) - f(\theta)\dot{\theta}(t)) \\ &+ \rho\_\{r\_d(t) - f(\theta)\} \end{aligned}$$

where θ(t) = [θL(t), θR(t)]<sup>T</sup> , θ˙(t) = [θ˙ <sup>L</sup>(t), θ˙ <sup>R</sup>(t)]<sup>T</sup> , and θ¨(t) = [θ¨ <sup>L</sup>(t), θ¨ <sup>R</sup>(t)]<sup>T</sup> denote the joint angle, joint velocity, and joint acceleration of the dual-redundant-manipulators; rd(t) = [rdL(t),rdR(t)]<sup>T</sup> , r˙d(t) = [r˙dL(t),r˙dR(t)]<sup>T</sup> , and r¨d(t) = [r¨dL(t),r¨dR(t)]<sup>T</sup> denote the position vector, velocity vector, and acceleration vector of the end-effector of the dual-redundantmanipulators; Scalar parameters α > 0 and β > 0 are used to scale the joint displacements; θ <sup>±</sup> = [θ ± L , θ ± R ] T , θ˙<sup>±</sup> = [θ˙<sup>±</sup> L , θ˙<sup>±</sup> R ] T and θ¨<sup>±</sup> = [θ¨<sup>±</sup> L , θ¨<sup>±</sup> R ] <sup>T</sup> denote the upper and lower limits of the joint angles, joint velocities and joint accelerations of the dual-redundant-manipulator, respectively. The combined Jacobian matrix and the first order derivative of the combined Jacobian matrix of the dual-redundant-manipulators are

$$J(\theta) = \begin{bmatrix} J\_L(\theta\_L) & \mathbf{0} \\ \mathbf{0} & J\_R(\theta\_R) \end{bmatrix}, \dot{J}(\theta) = \begin{bmatrix} \dot{f}\_L(\theta\_L) & \mathbf{0} \\ \mathbf{0} & \dot{f}\_R(\theta\_R) \end{bmatrix}. \tag{44}$$

**Proof 3**: Firstly, the optimization criterion (32) can be simplified as

$$\begin{aligned} &\frac{1}{2}\|\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + b\_{\mathsf{L}/\mathsf{R}}(t)\|\_{2}^{2} \\ &= \quad \frac{1}{2} \Big(\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + b\_{\mathsf{L}/\mathsf{R}}(t)\Big)^{\mathrm{T}} \Big(\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + b\_{\mathsf{L}/\mathsf{R}}(t)\Big) \\ &= \quad \frac{1}{2} \Big(\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + \ddot{\theta}\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)b\_{\mathsf{L}/\mathsf{R}}(t) + b\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) \\ &\qquad + b\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)b\_{\mathsf{L}/\mathsf{R}}(t)\Big) \\ &= \quad \frac{1}{2}\dot{\theta}\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + b\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)\ddot{\theta}\_{\mathsf{L}/\mathsf{R}}(t) + \frac{1}{2}b\_{\mathsf{L}/\mathsf{R}}^{\mathrm{T}}(t)b\_{\mathsf{L}/\mathsf{R}}(t). \tag{45} \end{aligned}$$

Since the redundant resolution problem is solved at the joint-acceleration level and θ¨ L/R is the decision variable, b T L/R(t)bL/R(t)/2 in Equation (45) can be regarded as a constant. Therefore, minimizing kθ¨ L/R(t) + bL/R(t)k 2 2 /2 = θ¨T L/R(t)θ¨ L/R(t)/2 + b T L/R(t)θ¨ L/R(t) + b T L/R(t)bL/R(t)/2 is equivalent to minimizing θ¨<sup>T</sup> L/R(t)θ¨ L/R(t)/2 + b T L/R(t)θ¨ L/R(t). Combining the


TABLE 2 | Four sets of equations used in the three groups of contrast experiments.


joint variables of left and right manipulators into one combined vector, the optimization criterion can be written as

$$\text{minimize} \quad \ddot{\theta}^{\text{T}}(t)\ddot{\theta}(t)/2 + b^{\text{T}}(t)\ddot{\theta}(t) \tag{46}$$

where θ¨(t) = [θ¨ <sup>L</sup>(t), θ¨R(t)]<sup>T</sup> and b(t) = [bL(t), bR(t)]<sup>T</sup> .

Secondly, acceleration level forward kinematic Equation (20) of left and right manipulators can be written as a combined forward kinematic equation as

$$
\begin{bmatrix} J\_{\rm L}(\theta) & \mathbf{0} \\ \mathbf{0} & J\_{\rm R}(\theta) \end{bmatrix} \cdot \begin{bmatrix} \ddot{\theta}\_{\rm L}(t) \\ \ddot{\theta}\_{\rm R}(t) \end{bmatrix} = \begin{bmatrix} \ddot{r}\_{\rm afL}(t) \\ \ddot{r}\_{\rm afR}(t) \end{bmatrix} \tag{47}
$$

where

$$\begin{aligned} \ddot{r}\_{\text{aff.}}(t) &= \ddot{r}\_{\text{dL.}}(t) - \dot{J}\_{\text{L}}(\theta\_{\text{L}}) \dot{\theta}\_{\text{L}}(t) \\ &+ \rho\_{\text{V}}(\dot{r}\_{\text{dL.}}(t) - \dot{J}\_{\text{L}}(\theta\_{\text{L}}) \dot{\theta}\_{\text{L}}(t)) + \rho\_{\text{P}}(r\_{\text{dL.}}(t) - \dot{f}(\theta\_{\text{L}})), \\ \ddot{r}\_{\text{affR}}(t) &= \ddot{r}\_{\text{dR}}(t) - \dot{J}\_{\text{R}}(\theta\_{\text{R}}) \dot{\theta}\_{\text{R}}(t) \end{aligned} \tag{48}$$

$$+\left.\rho\_{\rm V}(\dot{r}\_{\rm dR}(t) - J\_{\rm R}(\theta\_{\rm R})\dot{\theta}\_{\rm R}(t)) + \rho\_{\rm P}(r\_{\rm dR}(t) - f(\theta\_{\rm R})) \right. \tag{49}$$

Combining the upper and lower joint-limits of left and right arms of dual-redundant-manipulators, we can get combined joint-angular, joint-velocity, joint-acceleration limits respectively as

$$\boldsymbol{\theta}^{\pm}(t) = \left[\boldsymbol{\theta}\_{\rm L}^{\pm}(t), \boldsymbol{\theta}\_{\rm R}^{\pm}(t)\right]^{\rm T},\tag{50}$$

$$\dot{\theta}^{\pm}(t) = \left[\dot{\theta}\_{\text{L}}^{\pm}(t), \dot{\theta}\_{\text{R}}^{\pm}(t)\right]^{\text{T}},\tag{51}$$

$$\dot{\theta}^{\pm}(t) = \left[\dot{\theta}\_{\text{L}}^{\pm}(t), \ddot{\theta}\_{\text{R}}^{\pm}(t)\right]^{\text{T}}.\tag{52}$$

Taking into consideration of optimization criterion (46), feedback considered acceleration-level kinematic equation (47), and joint-limits (50)–(52), NDSO scheme (40)–(43) is obtained. The proof of Theorem 3 is completed.

# 3. QUADRATIC PROGRAMMING UNIFICATION & SOLVER

In this section, the proposed NDSO scheme (39)–(43) is unified into a standard quadratic programming problem, which is equivalent to linear variational inequality problem and is further equivalent to a piecewise linear projection equation. Finally,

FIGURE 3 | Comparisons between the scheme without considering repetitive motion and the NDSO scheme when tracking a pentagram-path. (A) Final states do not coincide with the initial states when using the scheme without considering repetitive motion. (B) Final states coincide with initial states when using NDSO scheme considering repetitive motion.

velocity of left arm θ˙

L

. (D) Joint velocity of right arm θ˙

L

. (F) Joint acceleration of right arm θ¨

R.

<sup>R</sup>. (E) Joint acceleration of left arm θ¨

the piecewise linear projection equation is solved by a linear variational inequalities-based primal-dual neural network (LVI-PDNN).

# 3.1. Joint-Limits Conversion

In order to resolve the redundancy problem at the accelerationlevel and satisfy the format requirement of standard quadratic programming, physical limits (41)–(43) at different levels should be converted into one bound constraint with joint-acceleration θ¨(t). Specifically, the ith elements of bounds ζ <sup>−</sup> and ζ <sup>+</sup> are defined respectively as

$$\begin{aligned} \xi\_i^-(t) &= \max \{ \ddot{\theta}\_i^-(t), \lambda\_\mathbf{v} (\dot{\theta}\_i^- - \dot{\theta}\_i(t)), \lambda\_\mathbf{p} (\{\theta\_i^- + \vartheta\_i\} - \theta\_i(t)) \}, \\ \xi\_i^+(t) &= \min \{ \ddot{\theta}\_i^+(t), \lambda\_\mathbf{v} (\dot{\theta}\_i^+ - \dot{\theta}\_i(t)), \lambda\_\mathbf{p} (\{\theta\_i^+ - \vartheta\_i\} - \theta\_i(t)) \}. \end{aligned}$$

Actually, there exist the inertia movement during the deceleration period caused by the mechanical inertia of the dual-redundant-manipulators in practice. Thus critical areas for joint position variables are considered into physical limits' representation so that there will appear a deceleration earlier when they enter the areas but not reach the joint position limit yet. ϑ<sup>i</sup> > 0 is a small constant and used to define the critical areas [θ − i , θ − <sup>i</sup> + ϑi] and [θ + <sup>i</sup> − ϑ<sup>i</sup> , θ + i ]. In the simulation section of the paper, ϑ<sup>i</sup> > 0 is set 0.01. The coefficient λ<sup>v</sup> > 0 and λ<sup>p</sup> > 0 denote the decreasing amplitude (Zhang et al., 2008).

TABLE 3 | Joint drifts when dual-redundant-manipulators tracking a pentagram-path synthesized by NDSO scheme with considering repetitive motions, joint limits, and feedback.


Therefore, constraints (39)–(43) can be rewritten as

$$\underset{\text{minimize}}{1} \quad \frac{1}{2}\ddot{\theta}(t)^{\text{T}}W\ddot{\theta}(t) + b^{\text{T}}(t)\ddot{\theta}(t) \tag{53}$$

$$\text{subject to} \quad J(\theta)\ddot{\theta}(t) = \underset{\dots}{\ddot{r}\_{\text{af}}(t)}\tag{54}$$

ζ <sup>−</sup>(t) 6 θ¨(t) 6 ζ <sup>+</sup>(t) (55)

The scheme (53)–(55) can be further unified into the following standard quadratic programming

$$\text{minimize} \quad \frac{1}{2} \mathbf{x}^{\mathrm{T}} \mathbf{G} \mathbf{x} + d^{\mathrm{T}} \mathbf{x} \tag{56}$$

$$\text{subject to} \quad \mathcal{C}\mathfrak{x} = h \tag{57}$$

$$
\mathfrak{x}^- \leqslant \mathfrak{x} \leqslant \mathfrak{x}^+ \tag{58}
$$

where

$$\begin{aligned} \boldsymbol{\chi} &= \ddot{\boldsymbol{\theta}}(t) = \begin{bmatrix} \ddot{\boldsymbol{\phi}}\_{\text{L}}(t) \\ \ddot{\boldsymbol{\phi}}\_{\text{R}}(t) \end{bmatrix} \in \boldsymbol{R}^{2n}, \; G = \boldsymbol{W} = \begin{bmatrix} \mathbf{1} & \mathbf{0} \\ \mathbf{0} & \mathbf{1} \end{bmatrix} \in \boldsymbol{R}^{2n \times 2n}, \\\ d = b(t) &= \begin{bmatrix} b\_{\text{L}}(t) \\ b\_{\text{R}}(t) \end{bmatrix} \in \boldsymbol{R}^{2n}, \; h = \ddot{r}\_{\text{af}}(t) = \begin{bmatrix} \ddot{r}\_{\text{af}\text{L}}(t) \\ \ddot{r}\_{\text{af}\text{R}}(t) \end{bmatrix} \in \boldsymbol{R}^{2m}, \\\ C = \boldsymbol{J} &= \begin{bmatrix} \boldsymbol{I}\_{\text{L}}(\boldsymbol{\theta}\_{\text{L}}) & \mathbf{0} \\ \mathbf{0} & \boldsymbol{J}\_{\text{R}}(\boldsymbol{\theta}\_{\text{R}}) \end{bmatrix} \in \boldsymbol{R}^{2m \times 2n}, \; \boldsymbol{x}^{\pm} = \boldsymbol{\xi}^{\pm}(t) \in \boldsymbol{R}^{2n}. \end{aligned}$$

## 3.2. Quadratic Programming Solver

According to Zhang et al. (2008), finding the solutions to quadratic programming problem (56)–(58) is equivalent to finding out a primal-dual equilibrium vector u <sup>∗</sup> = [x ∗ ; η ∗ ] <sup>T</sup> ∈ : = {u = [x T , η T ] <sup>T</sup> ∈ R <sup>2</sup>n+2m|u <sup>−</sup> 6 u 6 u <sup>+</sup>} to the following linear variational inequality

$$(\mu - u^\*)^\mathrm{T}(Mu^\* + q) \ge 0, \forall u \in \Omega,\tag{59}$$

where the augmented primal-dual decision variable u ∈ R (2n+2m) , and its bounds u <sup>±</sup> ∈ R (2n+2m) are respectively defined as

$$
\mu = \begin{bmatrix} \varkappa \\ \eta \end{bmatrix}, \ \mu^+ = \begin{bmatrix} \varkappa^+ \\ 1\_{\nu}\varpi \end{bmatrix}, \ \mu^- = \begin{bmatrix} \varkappa^- \\ -1\_{\nu}\varpi \end{bmatrix},
$$

with η ∈ R <sup>2</sup><sup>m</sup> being the corresponding dual decision vectors of Equation (57), 1<sup>ν</sup> = [1, · · · , 1]<sup>T</sup> denoting an appropriatelydimensioned vector composed of ones, and ̟ = 10<sup>10</sup> ∈ R replacing the +∞ for simulation and implementation purposes. The matrix M ∈ R (2n+2m)×(2n+2m) and the vector q ∈ R <sup>2</sup>n+2<sup>m</sup> are defined respectively as

$$M = \begin{bmatrix} G \ -C^T \\ C \ \mathbf{0} \end{bmatrix}, q = \begin{bmatrix} d \\ -h \end{bmatrix}.$$

The above inequality problem (59) can be solved by the following piecewise-linear projection equation (Zhang and Zhang, 2013a) as

$$P\_{\Omega}(\mu - (Mu + q)) - \mu = 0 \tag{60}$$

TABLE 4 | Joint drifts during the period of number "47" writing synthesized by the proposed NDSO scheme (39)–(43) which considers repetitive motion planning, joint limits, and feedback.


where P(·) ∈ R <sup>2</sup>n+2<sup>m</sup> → ⊂ R <sup>2</sup>n+2<sup>m</sup> is a projection operator defined from R <sup>2</sup>n+2<sup>m</sup> onto , and the ith element of P(u) is

$$\begin{cases} u\_{\mathbf{i}}^{-}, & \text{if } u\_{\mathbf{i}} < u\_{\mathbf{i}}^{-} \\ \quad u\_{\mathbf{i}}, & \text{if } u\_{\mathbf{i}}^{-} < u\_{\mathbf{i}} < u\_{\mathbf{i}}^{+}, \forall i \in \{1, 2, \dots, n+2m\} \\ \quad u\_{\mathbf{i}}^{+}, & \text{if } u\_{\mathbf{i}} > u\_{\mathbf{i}}^{+} \end{cases}$$

According to previous design experience on recurrent neural networks (Zhang and Zhang, 2013a), a linear-variationalinequality-based primal-dual neural network (abbreviated as LVI-PDNN) is employed to solve the piecewise-linear projection Equation (60) as well as the quadratic programming problem (56)–(58), i.e.,

$$
\dot{u} = \chi (I + M^{\text{T}}) (P\_{\Omega}(u - (Mu + q)) - u), \tag{61}
$$

where I is an identity matrix, and parameter γ ∈ R is a positive design parameter designed to scale the convergence rate of neural network. From Zhang and Zhang (2013a), the state vector u(t) of the primal-dual neural network in Equation (61) is globally convergent to an equilibrium point u ∗ . Furthermore, the first 2n elements of u ∗ constitute the solutions to the original quadratic programming problem (56)–(58).

Considering the systematic error generally including the differentiation error and the implementation error, the perturbed LVI-PDNN is formulated as

$$\dot{u} = \chi (I + M^T + \Delta D)(P\_\Omega(u - (Mu + q)) - u) + \Delta S, \tag{62}$$

where 1D ∈ R (2n+2m)×(2n+2m) and 1S ∈ R <sup>2</sup>n+2<sup>m</sup> denote the differentiation error matrix and the implementation error vector respectively. Equation (62) will be used in the experiment on robustness verification.

# 4. COMPUTER SIMULATIONS

In this section, the dual PA10 robot manipulators synthesized by the presented NDSO scheme are expected to track three closed complex trajectories, i.e., a pentagram, number "47" writing and end-effector-coupled pentagram. Each manipulator has 7 degrees-of-freedom, and the dual-manipulators have 14 degreesof-freedom in total. All joint physical limits are shown in **Table 1**. The design parameter α and β are set 4, and the design parameter γ = 10<sup>5</sup> in the ensuing simulations.

# 4.1. Pentagram Path-Tracking

In this section, the dual PA10 robot manipulators are expected to cooperatively track a pentagram-path. Initial joint angles of the left arm are θL(0) = [0; −π/4; 0; π/2; 0; −π/6; 0] rad, and initial joint angles of the right arm are θR(0) = [−π; −π/4; 0; π/2; 0; −π/6; 0] rad. The task execution period is 4 s. For comparisons, four sets of equations in which the variables d, x <sup>−</sup>, x <sup>+</sup>, ρp, ρ<sup>v</sup> in Equation (56)–(58) are set different values are showed in **Table 2**. Then the four sets of equations make up three groups of contrast experiments which are performed to prove the efficiency of repetitive motion criterion, physical limits criterion and feedback criterion. Firstly, comparison results between the scheme considering physicallimits, feedback criteria but no repetitive motion criterion (experiment 1) and the NDSO scheme considering the repetitive motion, physical limits and feedback criteria (experiment 4) performed on dual PA10 robot manipulators are illustrated in **Figures 3A,B**, respectively. **Figure 3A** shows that the final states of the end-effectors of the left and right arms of the dual-redundant-manipulators do not coincide with the initial states, which means that the end-effectors of the dual-redundantmanipulators can not return to the initial states when the task is completed. That is to say, the joint drift phenomenon has happened. It is noticed that this phenomenon is not expected in practical applications because it is necessary to add extra selfmotion to readjust the manipulator's configuration at the end of each task execution period in the cyclic motions. Evidently, this approach is inefficient. To remedy this joint-drift problem, the repetitive motion planning criterion is developed, and the corresponding result is shown in **Figure 3B**. Evidently, the final states of the dual-redundant-manipulators coincide well with their initial states. Comparing **Figures 3A,B**, we can see that the NDSO scheme nearly eliminates the joint-drift phenomena since it considers the repetitive motion criterion, and the efficiency of repetitive motion criterion is verified.

Secondly, comparisons between the scheme with considering the repetitive motion planning and feedback criterion but without considering limits (experiment 2) and the NDSO scheme with considering the limits criterion (experiment 4) are illustrated in **Figures 4**, **5**, respectively. The joint angles are shown in **Figures 4C,D**, **5C,D**. We can see that the final states of joints coincide with the initial ones and thus the efficiency of the repetitive motion planning criterion are illustrated. The velocities are shown in **Figures 4C,D**, **5C,D**. It can be seen from the figures that the velocities start from zero and end at zero, which is fit with the actual situations. However, **Figures 4E,F** show that θ¨ L3

and θ¨R3 exceed their upper or lower acceleration limits in 0–4s. This may lead to the damage to the dual-redundant-manipulators and is less desirable for practical applications. By comparison, joint accelerations θ¨ L3 and θ¨R3 in **Figures 4E,F** reach but never exceed their acceleration limits. This comparison result verifies the efficiency of the physical limits are very useful in applications.

Thirdly, comparisons between the NDSO scheme without considering feedback (experiment 3) and the NDSO scheme proposed in this paper with considering feedback (experiment 4) are illustrated in **Figures 6**, **7**, respectively. In the NDSO scheme, the feedback parameters ρ<sup>P</sup> and ρ<sup>V</sup> are set as 1 and 200, respectively. It can be seen from **Figure 6** that the end-effector position errors of left and right arms are less than 6.0 × 10−<sup>4</sup> m. However, the position errors become bigger and bigger as the task execution, i.e., the trend of the position errors are diverging. This would lead to bigger accumulated errors if the scheme is applied to perform cyclic tasks. Contrastively, the position errors in **Figures 7A,B** show that position errors are very tiny and become smaller and smaller since the proposed NDSO scheme is applied.

Last but not least, the joint drifts are measured when the position, velocity and acceleration feedback are taken into consideration in the NDSO scheme. **Table 3** lists small joint drifts which are all less than 6.2 × 10−<sup>3</sup> rads when

FIGURE 10 | Joint accelerations and position errors of the left arm during the period of end-effector-coupled pentagram-path tracking synthesized by pseudo-inverse scheme (63) and NDSO scheme (53)–(55). (A) Left arm joint acceleration θ¨ L of pseudo-inverse scheme. (B) Position error of left arm ǫL of pseudo-inverse scheme. (C) Left arm joint acceleration θ¨ L of NDSO scheme. (D) Position error of left arm ǫL of NDSO scheme.

the dual-redundant-manipulators track a pentagram-path synthesized by NDSO scheme.

In a word, the above three comparison experiments on tracking a pentagram-path illustrate well the effectiveness, safety and accuracy of the proposed NDSO scheme (39)–(43) and the LVI-PDNN to solve the joint-drift problem.

#### 4.2. Number Writing

In order to further verify the effectiveness, accuracy and generalization of the proposed NDSO scheme (39)–(43), another new end-effector task, i.e., number "47" writing, is expected to finished by the same dual PA10 robot manipulators which is synthesized by the NDSO scheme. In the simulations, ρ<sup>P</sup> and ρ<sup>V</sup> in Equations (48) and (49) are set as 1 and 100 respectively. Initial joint angles of the left arm are θL(0) = [0; −π/4; 0; π/2; 0; −π/6; 0] rad, and initial joint angles of the right arm are θR(0) = [−π; −π/4; 0; π/2; 0; −π/6; 0] rad. The task execution period is 2s.

The tracking trajectories, joint angles, joint velocities, joint accelerations and end-effector position errors are shown in **Figure 8**, and the joint drifts between the final state and the initial states of the left and right arms are listed in **Table 4**. As



TABLE 6 | Joint drifts when dual-redundant-manipulators tracking a

pentagram-path synthesized by NDSO scheme considering differentiation errors and implementation errors.


can be seen from **Figure 8A**, the end-effector task, i.e., number "47" writing is finished by the dual-redundant-manipulators synthesized by NDSO scheme (39)–(43) very well. In addition, as is shown in **Figures 8B–E**, all joint angles and joint velocities are within their joint limits, and the initial and final joint velocities and joint accelerations are both zero. From **Figures 8F,G**, we can see that the joint accelerations θ¨ L2 and during the range 0.3s–0.5s, θ¨ L3 during the range 1.6s–2s, θ¨R3 and θ¨R5 during the range 0.3s– 1.3s increase sharply and are constrained by the upper and lower acceleration limits. This means that all the joint variables are in safe motion ranges. End-effector position errors ǫ of the dualredundant-manipulators are shown in **Figures 8H,I**, which are very small (6 3 × 10−<sup>4</sup> m). It is worth pointing out that the endeffector position errors tend to convergence as the task execution due to the position and velocity feedbacks considered in the NDSO scheme. **Table 4** shows that the small joint displacements of NDSO scheme are all less than 2.4 × 10−<sup>3</sup> rads.

This number writing simulations further verify the effectiveness of the proposed NDSO scheme.

# 4.3. Coupled Task Tracking Example

In order to further verify the well-coordinated performance between dual-redundant-manipulators of the proposed NDSO

$$\begin{cases} \ddot{r}\_{\text{RX}} = \ddot{r}\_{\text{LX}} \\ \ddot{r}\_{\text{RY}} = 0.5 \times \ddot{r}\_{\text{LY}}, \forall t \in \{0, T\} \\ \ddot{r}\_{\text{RZ}} = 0.5 \times \ddot{r}\_{\text{LZ}} \end{cases}$$

In the simulations, ρ<sup>P</sup> and ρ<sup>V</sup> in Equations (48) and (49) are set as 1 and 200 respectively. The task execution period is 4s. The tracking trajectories, joint angles, joint velocities, joint accelerations and end-effector position errors are shown in **Figure 9**. From **Figure 9A** we can see that the coupled endeffector task is completed by the dual-redundant-manipulators synthesized by NDSO scheme. What's more, the final states perfectly coincide the initial states. In addition, in **Figures 9B–E**, all joint angles and joint velocities are within their joint limits, and the initial and final joint velocities and joint accelerations are both zero. From **Figures 9F,G**, we can see that the joint accelerations θ¨ L2 and θ¨ L6 during 1–3s, change sharply but both are constrained by their acceleration limits. This means that all the joint variables are in the safe motion ranges. The endeffector position errors ǫ shown in **Figures 9H,I** are very small (6 6 × 10−<sup>4</sup> m) and convergent.

In summary, the above three end-effector tasks and comparisons, i.e., pentagram-path tracking, number "47" writing, and the coupled task tracking example, demonstrate that complex end-effector tasks can be well performed by the presented NDSO scheme (39)–(43). From the simulations, it is known that the NDSO scheme can achieve the repetitive motion effectively and accurately. In addition, the position errors of the end-effectors can converge to nearly zero at the end of each cycle due to taking feedback into consideration.

# 4.4. Compared With Pseudo-Inverse Method

In order to further illustrate the advantages of the proposed NDSO scheme, both of the traditional pseudo-inverse method and the proposed NDSO are used to perform on a dual-redundant-manipulators to track the previous coupled pentagram paths. Initial joint angles of the left and right arms are set the same as before. The formulation of the pseudo-inverse method is

$$\ddot{\theta} = J^+ \ddot{r}\_{\text{af}}(t) - [I - J^+ f] b(t) \tag{63}$$

where θ¨, r¨af(t), J and b(t) have the same definition in the NDSO scheme. J <sup>+</sup> means the pseudo-inverse matrix of Jacobian matrix J and I is an identity matrix in m + n dimensions.

The comparative simulations are shown in **Figure 10**. Due to space limitation, only the joint acceleration and the position errors of left manipulators between the proposed NDSO scheme and the pseudo-inverse method are shown here. Specifically,

**Figures 10A,B** show the simulation result of the pseudo-inverse method, and **Figures 10C,D** show the simulation result of the proposed NDSO method. From **Figure 10A**, we can see that the joint acceleration θ¨ L2 exceeds its limits about 1.3s and 2.6s, and the end-effector position errors of the left arm shown in **Figure 10C** ǫXL are divergent as time goes on. That is to say, the end-effector of the dual-redundant-manipulators synthesized by the pseudo-inverse method can track the desired path but may lead to exceeding limit problem and the positioning errors will accumulate.

This comparison result further illustrate the efficiency and excellent advantages of the proposed NDSO scheme.

# 4.5. Robustness Verification

In this subsection, systematic errors are taken into consideration and the perturbed LVI-PDNN in Equation (62) is used to solve the path-tracking problem of the dual redundant manipulators. The pentagram path-tracking task in 4.1 is adopted to compare the joint displacements without perturbation in **Table 3**. During the simulations, error-matrix 1D and errorvector 1S are generated randomly. The element 1<sup>i</sup> of them is formulated as

$$
\Delta\_i = 0.1 \ast \upsilon\_a(\upsilon\_c \sin(\upsilon\_b t) + (1 - \upsilon\_c) \cos(\upsilon\_b t))\tag{64}
$$

where ν<sup>a</sup> is a random integer in [−5, 5], ν<sup>b</sup> is a random integer in [1, 5] and ν<sup>c</sup> is a random integer in [0, 1]. All of them are distributed evenly. ν<sup>a</sup> and ν<sup>b</sup> control the amplitude and frequency of the element respectively. ν<sup>c</sup> controls the form of the perturbation function to be sine function (ν<sup>c</sup> = 1) or to be cosine function (ν<sup>c</sup> = 0). The initial joint angles of dual arms are set as same in 4.1. The parameters d, x <sup>−</sup>, x <sup>+</sup>, ρp, ρ<sup>v</sup> are set according to the 4th set of equations in **Table 2**. Inspired by Zhang and Zhang (2013b), we consider joint-velocitylimit margins ι shown in **Figure 11** in our experiments. The updated θ˙<sup>±</sup> L (t) and θ˙<sup>±</sup> R (t) in (51) are shown in **Table 5**, where the margins considered joint-velocity-limits are highlighted in bold.

The joint drifts of dual arms are shown in **Table 6**, which shows that the joint displacement of every joint almost has no change compared with the result in **Table 3**. The joint accelerations and position errors during the period of pentagram-path tracking task are recorded in **Figures 12A–D**. The curves in **Figures 12A–D** show that the joint accelerations are all constrained within the limits (i.e., ±6rad/s 2 ). Besides, the position errors have been controlled within a very small range which is lower than 1 × 10−<sup>3</sup> (m). Although there exists time-varying systematic perturbation, the position errors are still convergent at the end of the task execution. In summary, the proposed NDSO method performs well under the perturbation and has strong robustness.

# 5. CONCLUSION

In this paper, a neural-dynamic based synchronous-optimization scheme of dual redundant robot manipulators scheme (NDSO) of dual-redundant-manipulators for tracking complex paths has been proposed to solve the joint-drift problem. The scheme can not only finish the end-effector task collaboratively with the dualredundant-manipulators, but also achieve repetitive motion, avoid physical limits and position-error convergence. First, the left and right manipulator subschemes are formulated and then are combined to one quadratic program scheme, i.e., the NDSO scheme. Next, the scheme is unified into a standard quadratic programming problem. Finally, the quadratic programming problem is solved by a linear-variational-inequality primaldual neural networks. Three complex end-effector tasks and comparisons, i.e., pentagram-path tracking, number writing, and coupled tasks have verified the effectiveness, accuracy, repeatability, safety, generality and robustness of the proposed NDSO scheme. To the best of authors' knowledge, it is the first time to propose such a NDSO scheme with so many optimization criteria and can solve the joint-drift problems in three threedimensional workspace. The future work is to exploit higher efficient resolving algorithms to further improve the performance of the scheme and consider the control input saturation problem and uncertainties.

# AUTHOR CONTRIBUTIONS

The idea of the paper that we can try to solve the optimization problem on the acceleration level of dual redundant manipulators is proposed by ZZ. The paper is drafted and revised by ZZ and QZ together. QZ and WF design and implement the experiment in coordination.

# FUNDING

This work was supported in part by the National Natural Science Foundation under Grants 61603142 and 61633010, the Guangdong Foundation for Distinguished Young Scholars under Grant 2017A030306009, the Guangdong Youth Talent Support Program of Scientific and Technological Innovation under Grant 2017TQ04X475, the Science and Technology Program of

# REFERENCES


Guangzhou under Grant 201707010225, the Fundamental Research Funds for Central Universities under Grant x2zdD2182410, the Scientific Research Starting Foundation of South China University of Technology, the National Key R&D Program of China under Grant 2017YFB1002505, the National Key Basic Research Program of China (973 Program) under Grant 2015CB351703, and the Guangdong Natural Science Foundation under Grant 2014A030312005.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Zhang, Zhou and Fan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.