Multimodal Hierarchical Dirichlet Process-Based Active Perception by a Robot

In this paper, we propose an active perception method for recognizing object categories based on the multimodal hierarchical Dirichlet process (MHDP). The MHDP enables a robot to form object categories using multimodal information, e.g., visual, auditory, and haptic information, which can be observed by performing actions on an object. However, performing many actions on a target object requires a long time. In a real-time scenario, i.e., when the time is limited, the robot has to determine the set of actions that is most effective for recognizing a target object. We propose an active perception for MHDP method that uses the information gain (IG) maximization criterion and lazy greedy algorithm. We show that the IG maximization criterion is optimal in the sense that the criterion is equivalent to a minimization of the expected Kullback–Leibler divergence between a final recognition state and the recognition state after the next set of actions. However, a straightforward calculation of IG is practically impossible. Therefore, we derive a Monte Carlo approximation method for IG by making use of a property of the MHDP. We also show that the IG has submodular and non-decreasing properties as a set function because of the structure of the graphical model of the MHDP. Therefore, the IG maximization problem is reduced to a submodular maximization problem. This means that greedy and lazy greedy algorithms are effective and have a theoretical justification for their performance. We conducted an experiment using an upper-torso humanoid robot and a second one using synthetic data. The experimental results show that the method enables the robot to select a set of actions that allow it to recognize target objects quickly and accurately. The numerical experiment using the synthetic data shows that the proposed method can work appropriately even when the number of actions is large and a set of target objects involves objects categorized into multiple classes. The results support our theoretical outcomes.


Introduction
Active perception is a fundamental component of our cognitive skills. Human infants autonomously and spontaneously perform actions on an object to determine its nature. The sensory information that we can obtain usually depends on the actions performed on the target object. For example, when a person finds a box placed in front of him/her, he/she cannot perceive its weight without holding the box, and he/she cannot determine its sound without hitting or shaking it. In other words, we can obtain sensory information about an object by selecting and executing actions to manipulate it. Adequate action selection is important for recognizing objects quickly and accurately. This example about a human also holds for a robot. An autonomous robot that moves and helps people in a living environment should also select adequate actions to recognize target objects. For example, when a person asks an autonomous robot to bring an empty plastic bottle, the robot has to examine many objects by applying several actions (Fig. 1). The importance of this type of active perception is because our object categories are formed on the basis of multimodal information, i.e., not only visual information, but also auditory, haptic, and other information. Therefore, a computational model of the active perception should be consistently based on a computational model for multimodal object categorization and recognition.
This paper considers the active perception problem for multimodal object recognition. Specifically, we adopt the multimodal hierarchical Dirichlet process (MHDP) proposed by Nakamura et al. (2011b) as a representative computational model for multimodal object categorization. We develop an active perception method based on the MHDP. The MHDP is a sophisticated, fully Bayesian probabilistic model for multimodal object categorization. It is a multimodal extension of hierarchical Dirichlet process (HDP) Teh et al. (2006), which is a nonparametric Bayesian extension of latent Dirichlet allocation (LDA) Blei et al. (2003), which in turn was originally proposed for documentword clustering. Nakamura et al. (2011b) showed that the MHDP enables a robot to form object categories using multimodal information, i.e., visual, auditory, and haptic information, in an unsupervised manner. Because of the nature of Bayesian nonparametrics, the MHDP can estimate the number of object categories as well.
In spite of the wide range of studies about active perception and multimodal categorization for robots, active perception methods, i.e., action selection methods for perception for multimodal categorization have not been sufficiently explored from a theoretical viewpoint (see Section 2). This paper describes a new MHDP-based active perception method for multimodal object recognition based on object categories formed by a robot itself. We found that an active perception method that has a good theoretical nature can be derived by taking the MHDP as a robot's multimodal categorization method.
In this study, we define the active perception problem in the context of unsupervised multimodal object categorization as follows.
• Which set of actions should a robot take to recognize a target object as accurately as possible under the constraint that the number of actions is restricted?
Our MHDP-based active perception method uses an information gain (IG) maximization criterion, Monte Carlo approximation, and the lazy greedy algorithm. In this paper, we show that the MHDP provides the following three advantages for deriving an efficient active perception method.
1. The IG maximization criterion is optimal in the sense that a selected set of actions minimizes the expected Kullback-Leibler (KL) divergence between the final posterior distribution estimated using the information regarding all modalities and the posterior distribution of the category estimated using the selected set of actions.
2. An efficient Monte Carlo approximation method for IG can be derived.
3. The IG has a submodular and non-decreasing property as a set function. Therefore, for performance, the greedy and lazy greedy algorithms are guaranteed to be near-optimal strategies.
Although the above desirable properties are due to the theoretical characteristics of the MHDP, this has never been pointed out in previous studies.
The main contributions of this paper are that we present the above three properties of the MHDP clearly, develop an MHDP-based active perception method, and show its effectiveness through experiments using a upper-torso humanoid robot and synthetic data.
The proposed active perception method can be used for general purposes, i.e., not only for robots but also for other target domains to which the MHDP can be applied. In addition, The proposed method can be easily extended for multimodal latent Dirichlet allocation (MLDA), which is a multimodal extension of latent Dirichlet allocation (LDA) Nakamura et al. (2009);Blei et al. (2003), and other multimodal categorization methods with similar graphical models. However, in this paper, we focus on the MHDP and the robot active perception scenario, and explain our method on the basis of this task.
The remainder of this paper is organized as follows. Section 2 describes the background and work related to our study. Section 3 briefly introduces the MHDP, proposed by Nakamura et al. (2011b), which enables a robot to obtain an object category by fusing multimodal sensor information in an unsupervised manner. Section 4 describes our proposed action selection method. Section 5 discusses the effectiveness of the action selection method through experiments using an upper-torso humanoid robot. Section 6 describes a supplemental experiment using synthetic data. Section 7 concludes this paper.

Background and Related Work
In this section, we describe background and related work of this paper.

Multimodal Categorization
The human capability for object categorization is a fundamental topic in cognitive science Barsalou (1999). In the field of robotics, adaptive formation of object categories that considers a robot's embodiment, i.e., its sensory-motor system, is gathering attention as a way to solve the symbol grounding problem Harnad (1990); Taniguchi et al. (2015).
Recently, various computational models and machine learning methods for multimodal object categorization have been proposed in artificial intelligence, cognitive robotics, and related research fields Celikkanat et al. (2014); Sinapov and Stoytchev (2011);Natale et al. (2004); Araki et al. (2012); Ando et al. (2013); Nakamura et al. (2007Nakamura et al. ( , 2009Nakamura et al. ( , 2011bNakamura et al. ( ,a, 2014; Griffith et al. (2012); Iwahashi et al. (2010); Roy and Pentland (2002); Sinapov et al. (2014). For example, Sinapov & Stoytchev (2011) proposed a graph-based multimodal categorization method that allows a robot to recognize a new object by its similarity to a set of familiar objects. . They also built a robotic system that categorizes 100 objects from multimodal information in a supervised manner Sinapov et al. (2014). Celikkanat et al. (2014) modeled the context in terms of a set of concepts that allow many-to-many relationships between objects and contexts using latent Dirichlet allocation. .
Of these, a series of statistical multimodal categorization methods for autonomous robots have been proposed by extending LDA, i.e., a topic model Araki et al. (2012); Ando et al. (2013);Nakamura et al. (2007Nakamura et al. ( , 2009Nakamura et al. ( , 2011bNakamura et al. ( ,a, 2014. All these methods are Bayesian generative models, and the MHDP is a representative method of this series Nakamura et al. (2011a). The MHDP is an extension of the HDP, which was proposed by Teh et al. (2006), and the HDP is a nonparametric Bayesian extension of LDA Blei et al. (2003). A graphical model of the HDP is shown in Fig. 2(a). Concretely, the graphical model of the MHDP has multiple types of emissions that correspond to various sensor data obtained through various modality inputs, as shown in Fig. 2(b). In the HDP, observation data are usually represented as a bag-of-words (BoW). In contrast, the observation data in the MHDP use bag-of-features (BoF) representations for multimodal information. Latent variables t jn are regarded as indicators of topics in the HDP, which correspond to object categories in the MHDP. Nakamura et al. (2011b) showed that the MHDP enables a robot to categorize a large number of objects in a home environment into categories that are similar to human categorization results.
To obtain multimodal information, a robot has to perform actions and interact with a target object in various ways, e.g., grasping, shaking, or rotating the object. If the number of actions and types of sensor information increase, multimodal categorization and recognition can require a longer time. In most practical cases, the execution of an action by a robot takes longer than it does for a human for mechanical and security reasons. In many cases, one action can take longer than 30 seconds, although that depends on each particular robotic system. When the recognition time is a constraint and/or if quick recognition is required, it becomes important for a robot to select a small number of actions that are effective for accurate recognition. Action selection for recognition is often called active perception. However, an active perception method for the MHDP has not been proposed. This paper aims to provide an active perception method for the MHDP.

Active Perception
Generally, active perception is one of the most important cognitive capabilities of humans. From an engineering viewpoint, active perception has many specific tasks, e.g., localization, mapping, navigation, object recognition, object segmentation, and self-other differentiation.
Historically, active vision, i.e., active visual perception, has been studied as an important engineering problem in computer vision. Roy et al. (2004) presented a comprehensive survey of active three-dimensional object recognition. For example, Borotshnig et al. (2000) proposed an active vision method in a parametric eigenspace to improve the visual classification results. Denzler et al. (2002) proposed an information theoretic action selection method to gather information that conveys the true state of a system through an active camera. They used the mutual information (MI) as a criterion for action selection. Krainin et al. (2011) developed an active perception method in which a mobile robot manipulates an object to build a three-dimensional surface model of it. Their method uses the IG criterion to determine when and how the robot should grasp the object.
Modeling and/or recognizing a single object as well as modeling a scene and/or segmenting objects are also important tasks in the context of robotics. Eidenberger et al. (2010) proposed an active perception planning method for scene modeling in a realistic environment. Hoof et al. (2012) proposed an active scene exploration method that enables an autonomous robot to efficiently segment a scene into its constituent objects by interacting with the objects in an unstructured environment. They used IG as a criterion for action selection. InfoMax control for acoustic exploration was proposed by Rebguns et al. (2011).
Localization, mapping, and navigation are also targets of active perception. Velez et al. (2012) presented an online planning algorithm that enables a mobile robot to generate plans that maximize the expected performance of object detection. Burgard et al. (1997) proposed an active perception method for localization. Action selection is performed by maximizing the weighted sum of the expected entropy and expected costs. To reduce the computational cost, they only consider a subset of the next locations. Roy et al. (1999) proposed a coastal navigation method for a robot to generate trajectories for its goal by minimizing the positional uncertainty at the goal. Stachniss et al. (2005) proposed an information-gain-based exploration method for mapping and localization.. Correa et al. proposed an active perception method for a mobile robot with a visual sensor mounted on a pantilt mechanism to reduce localization uncertainty. They used the IG criterion, which was estimated using a particle filter.
In addition, various studies on active perception by a robot have been conducted Gouko et al.  Pape et al. (2012). In spite of a large number of contributions about active perception, few theories of active perception for multimodal object category recognition have been proposed.
In particular, an MHDP-based active perception method has not yet been proposed, although the MHDP-based categorization method and its series have obtained many successful results and extensions.
In machine learning, active learning is a well-defined terminology. Active learning algorithms select an unobserved input datum and ask a user (labeler) to provide a training signal (label) in order to reduce uncertainty as quickly as possible Cohn et al. (1996); Settles (2012); Muslea et al. (2006). These algorithms usually assume a supervised learning problem. This problem is related to the problem in this paper, but is fundamentally different. Sinapov et al. (2014) investigated multimodal categorization and active perception by making a robot perform 10 different behaviors; obtain visual, auditory, and haptic information; explore 100 different objects, and classify them into 20 object categories. In addition, they proposed an active behavior selection method based on confusion matrices. They reported that the method was able to reduce the exploration time by half by dynamically selecting the next exploratory behavior. However, their multimodal categorization is performed in a supervised manner, and the theory of active perception is still heuristic. The method does not have theoretical guarantees of performance.

Active perception for multimodal categorization
IG-based active perception is popular, as shown above, but the theoretical justification for using IG in each task is often missing in many robotics papers. Moreover, in many cases, IG cannot be evaluated directly, reliably, or accurately. When one takes an IG criterion-based approach, how to estimate the IG is an important problem. In this study, we focus on MHDP-based active perception and develop an efficient near-optimal method based on firm theoretical justification.

Multimodal Hierarchical Dirichlet Process for Statistical Multimodal Categorization
We assume that a robot forms object categories using the MHDP from multimodal sensory data.
In this section, we briefly introduce the MHDP on which our proposed active perception method is based Nakamura et al. (2011a). The MHDP assumes that an observation node in its graphical model corresponds to an action and its corresponding modality. Nakamura et al. (2011b) employed three observation nodes in their graphical model, i.e., haptic, visual, and auditory information nodes. Three actions, i.e., grasping, looking around, and shaking, correspond to these modalities, respectively. However, the MHDP can be easily extended to a model with additional types of sensory inputs. It is without doubt that autonomous robots will also gain more types of action for perception. For modeling more general cases, an MHDP with M actions is described in this paper. A more general graphical model of the MHDP than in Fig. 2 is illustrated in Fig. 3. The index m ∈ M (#(M) = M ) in Fig. 3 represents the type of information that corresponds to an action-modality perception pair, e.g., hitting an object to obtain its sound, grasping an object to test its shape and hardness, or looking at all of an object by rotating it. The observation x m jn ∈ X m is the m-th modality's n-th feature for the j-th target object. The observation x m jn is assumed to be drawn from a categorical distribution whose parameter is θ m k , where k is an index of a latent topic. Parameter θ m k is assumed to be drawn from the Dirichlet prior distribution whose parameter is α m 0 . The MHDP assumes that a robot obtains each modality's sensory information as a BoF representation. Similarly to the generative process of the original HDP Teh et al. (2006), the generative process of the MHDP can be described as a Chinese restaurant franchise (CRF). The learning and recognition algorithms are both derived using Gibbs sampling. In its learning process, the MHDP estimates a latent variable t m jn for each feature of the j-th object and a topic index k jt for each latent variable t. The combination of latent variable and topic index corresponds to a topic in LDA Blei et al. (2003). Using the estimated latent variables, the categorical distribution parameter θ m k and topic proportion of the j-th object π j are drawn from the posterior distribution.
The selection procedure for latent variable t m jn is as follows. The prior probability that x m jn selects t is where w m is a weight for the m-th modality, N m jt is the number of m-th modality observations that are allocated to t in the j-th object, and λ is a hyperparameter. In the Chinese restaurant process, if the number of observed features N jt = m w m N m jt that are allocated to t increases, the probability at which a new observation is allocated to the latent variable t increases. Using the prior distribution, the posterior probability that observation x m jn is allocated to the latent variable t becomes where N m j is the number of the m-th modality's observations about the j-th object. The observations that correspond to the m-th modality and have the k-th topic in any object are represented by X m k . In the Gibbs sampling procedure, a latent variable for each observation is drawn from the posterior probability distribution. If t = T j + 1, a new observation is allocated to a new latent variable. The dish selection procedure is as follows. The prior probability that the k-th topic is allocated on the t-th latent variable becomes where K is the number of topic types, and M k is the number of latent variables on which the k-th topic is placed. Therefore, the posterior probability that the k-th topic is allocated on the t-th latent variable becomes A topic index for the latent variable t for the j-th object is drawn using the posterior probability, where γ is a hyperparameter. If k = K + 1, a new topic is placed on the latent variable. By sampling t m jn and k jt , the Gibbs sampler performs probabilistic object clustering: where X −mjn = X m jn \ {x m jn }, and X −jt = X t \ X jt . By sampling t m jn for each observation in every object using (1) and sampling k jt for each latent variable t in every object using (2), all of the latent variables in the MHDP can be inferred.
If t m jn and k jt are given, the probability that the j-th object is included in the k-th category becomes where X j = ∪ m X m j , w m is the weight for the m-th modality and δ a (x) is a delta function. When a robot attempts to recognize a new object after the learning phase, the probability that feature x m jn is generated from the k-th topic becomes where d m denotes the dimension of the m-th modality input. Topic k t allocated to t for a new object is sampled from These sampling procedures play an important role in the Monte Carlo approximation of our proposed method (see Section 4.2.) For a more detailed explanation of the MHDP, please refer to Nakamura et al. (2011b). Basically, a robot can autonomously learn object categories and recognize new objects using the multimodal categorization procedure described above. The performance and effectiveness of the method was evaluated in the paper.

Active Perception Method
In this section, we describe active perception method based on the MHDP.

Basic Formulation
A robot should have already conducted several actions and obtained information from several modalities when it attempts to select next action set for recognizing a target object. For example, visual information can usually be obtained by looking at the front face of the j-th object from a distance before interacting with the object physically. We assume that a robot has already obtained information corresponding to a subset of modalities m oj ⊂ M. When a robot faces a new object and has not obtained any information, m oj = ∅.
The purpose of object recognition in multimodal categorization is different from conventional supervised learning-based pattern recognition problems. In supervised learning, the recognition result is evaluated by checking whether the output is same as the truth label. However, in unsupervised learning, there are basically no truth labels. Therefore, the performance of active perception should be measured in a different manner.
The action set the robot selects is described as We consider an effective action set for active perception to be one that largely reduces the distance between the final recognition state after the information from all modalities M is obtained and the recognition state after the robot executes the selected action set A. The recognition state is represented by the posterior distribution P (z j |X represents the posterior distribution related to the object category after taking actions m oj and A. The final recognition state, i.e., posterior distribution over latent variables after obtaining the information from all modalities M, becomes P (z j |X M j ). The purpose of active perception is to select a set of actions that can estimate the posterior distribution most accurately. When L actions can be executed, if we employ KL divergence as the metric of the difference between the two probability distributions, is a reasonable evaluation criterion for realizing effective active perception, where F However, neither the true X M j nor X mo j ∪A j can be observed before taking A on the j-th target object, and hence cannot be used at the moment of action selection. Therefore, a rational alternative for the evaluation criterion is the expected value of the KL divergence at the moment of action selection: Here, we propose to use the IG maximization criterion to select the next action set for active perception: where IG(X; Y |Z) is the IG of Y for X, which is calculated on the basis of the probability distribution commonly conditioned by Z as follows: By definition, the expected KL divergence is the same as IG(X; Y ). The definition of IG and its relation to KL divergence are as follows.
The optimality of the proposed criterion (6) is supported by Theorem 1.

Theorem 1 The set of next actions
) minimizes the expected KL divergence between the posterior distribution over z j after all modality information has been observed and after A has been executed.
This theorem is essentially the result of well-known characteristics of IG (see Russo and Roy (2015); MacKay (2003) for example). This means that maximizing IG is the optimal policy for active perception in an MHDP-based multimodal object category recognition task. As a special case, when only a single action is permitted, the following corollary is satisfied.
) minimizes the expected KL divergence between the posterior distribution over z j after all modality information has been observed and after the action has been executed.
Proof By substituting {m} into A in Theorem 1, we can obtain the corollary.
Using IG, the active perception strategy for the next single action is simply described as follows: This means that the robot should select the action m * j that can obtain the X m * j j that maximizes the IG for the recognition result z j under the condition that the robot has already observed X mo j j . However, we still have two problems, as follows.
) cannot be performed in a straightforward manner.
2. The argmax operation in (6) is a combinatorial optimization problem and incurs heavy computational cost when #(M \ m oj ) and L become large.
Based on some properties of the MHDP, we can obtain reasonable solutions for these two problems.

Monte Carlo Approximation of IG
Equations (6) and (9) provide a robot with an appropriate criterion for selecting an action to efficiently recognize a target object. However, at first glance, it looks difficult to calculate the IG. First, the calculation of the expectation procedure E X A j |X mo j j [·] requires a sum operation over all possible X A j . The number of possible X A j exponentially increases when the number of elements in the BoF increases. Second, the calculation of P (z j |X A∪mo j j ) for each possible observation X A j requires the same computational cost as recognition in the multimodal categorization itself. Therefore, the straightforward calculation for solving (9) is computationally impossible in a practical sense.
However, by exploiting a characteristic property of the MHDP, an efficient Monte Carlo approximation can be derived. First, we describe IG as the expectation of a logarithm term.
An analytic evaluation of (10) is also practically impossible. Therefore, we adopt a Monte Carlo method. Equation (10) suggests that an efficient Monte Carlo approximation can be performed as shown below if we can sample Fortunately, the MHDP provides a sampling procedure for z j ) in its original paper Nakamura et al. (2011a). In the context of multimodal categorization by a robot, X j ) is a prediction of an unobserved modality's sensation using observed modalities' sensations, i.e., cross-modal inference. The sampling process of (z can be regarded as a mental simulation by a robot that predicts the unobserved modality's sensation leading to a categorization result based on the predicted sensation and observed information. In (11), P (X ) in the denominator cannot be evaluated in a straightforward way. Again, a Monte Carlo method can be adopted, as follows: where K is the number of samples for the second Monte Carlo approximation. Fortunately, in this Monte Carlo approximation (12), we can reuse the samples drawn in the previous Monte Carlo approximation efficiently. By substituting (12) for (11), we finally obtain the approximate IG for the criterion of active perception, i.e., our proposed method, as follows: .
Note that the computational cost for evaluating IG becomes O(K 2 ). In summary, a robot can approximately estimate the IG for unobserved modality information by generating virtual observations based on observed data and evaluating their likelihood.

Sequential Decision Making as a Submodular Maximization
If a robot wants to select L actions A j = {a 1 , a 2 , . . . , a L } (a i ∈ M \ m oj ), it has to solve (6), i.e., a combinatorial optimization problem. The number of combinations of L actions is #(M\mo j ) C L , which increases dramatically when the number of possible actions #(M\m oj ) and L increase. For example, Sinapov et al. (2014) gave a robot 10 different behaviors in their experiment on robotic multimodal categorization. Future autonomous robots will have more available actions for interacting with a target object and be able to obtain additional types of modality information through these interactions. Hence, it is important to develop an efficient solution for the combinatorial optimization problem.
Here again, the MHDP has advantages for solving this problem.
Theorem 3 The evaluation criterion for multimodal active perception IG(z j ; X A j |X mo j j ) is a submodular and non-decreasing function with regard to A.
Proof As shown in the graphical model of the MHDP in Fig. 3, the observations for each modality X m j are conditionally independent under the condition that a set of latent variables z j = {{k jt } 1≤t≤T j , {t m jn } m∈M,1≤n≤N m j }is given. This satisfies the conditions of the theorem by Krause et al. (2005). Therefore, IG(z j ; X m j |X mo j j ) is a submodular and non-decreasing function with regard to X m j .
Submodularity is a property similar to the convexity of a real-valued function in a vector space. If a set function F : V → R satisfies where V is a finite set ∀A ⊂ A ⊆ V and x / ∈ A, the set function F has submodularity and is called a submodular function.
Function IG is not always a submodular function. However, Krause et al. proved that IG(U ; A) is submodular and non-decreasing with regard to A ⊆ S if all of the elements of S are conditionally independent under the condition that U is given. With this theorem, Krause et al. (2005) solved the sensor allocation problem efficiently. Theorem 3 means that the problem (6) is reduced to a submodular maximization problem.
It is known that the greedy algorithm is an efficient strategy for the submodular maximization problem. Nemhauser et al. (1978) proved that the greedy algorithm can select a subset that is at most a constant factor (1 − 1/e) worse than the optimal set, if the evaluation function F (A) is submodular, non-decreasing, and F (∅) = 0, where F (·) is a set function, and A is a set. If the evaluation function is a submodular set function, a greedy algorithm is practically sufficient for selecting subsets in many cases. In sum, a greedy algorithm gives a near-optimal solution. However, the greedy algorithm is still inefficient because it requires an evaluation of all choices at each step of a sequential decision making process. Minoux (1978) proposed a lazy greedy algorithm to makes the greedy algorithm more efficient for the submodular evaluation function. The lazy greedy algorithm can reduce the number of evaluations by using the characteristics of a submodular function.
In this paper, we propose the use of the lazy greedy algorithm for selecting L actions to recognize a target object on the basis of the submodular property of IG. The final greedy and lazy greedy algorithms for MHDP-based active perception, i.e., our proposed methods, are shown in Algorithms 1 and 2, respectively.
The main contribution of the lazy greedy algorithm is to reduce the computational cost of active perception. The majority of the computational cost originates from the number of times a robot evaluates IG m for determining action sequences. When a robot has to choose L actions, the bruteforce algorithm that directly evaluates all alternatives A ∈ F . The lazy greedy algorithm incurs the same computational cost as the greedy algorithm only in the worst case. However, practically, the number of re-evaluations in the lazy greedy algorithm is quite small. Therefore, the computational cost of the lazy greedy algorithm increases almost in proportion to L, i.e., almost linearly. The memory requirement of the proposed method is also quite small.

Algorithm 1 Greedy algorithm.
Require: MHDP is trained using a training data set.
The j-th object is found. m oj is initialized, and X mo j j is observed.
Execute the m * -th action to the j-th target object and obtain X m * j . m oj ← m oj ∪ {m * } end for Both the greedy and lazy greedy algorithms only require memory for IG m for each modality and K samples for the Monte Carlo approximation. These requirements are negligibly small compared with the MHDP itself.

Experiment 1: Humanoid Robot
An experiment using an upper-torso humanoid robot was conducted to verify the proposed active perception method in the real-world environment.

Conditions
In this experiment, RIC-Torso, developed by the RT Corporation, was used (see Fig. 4). RIC-Torso is an upper-torso humanoid robot that has two robot hands. We prepared an experimental environment that is similar to the one in the original MHDP paper Nakamura et al. (2011a).

VISUAL INFORMATION (m v )
Visual information was obtained from the Xtion PRO LIVE set on the head of the robot. The camera was regarded as the eyes of the robot. The robot captured 74 images of a target object while it rotated on a turntable (see Fig. 4). The size of each image was re-sized to 320×240. Scale-invariant feature transform (SIFT) feature vectors were extracted from each captured image Lowe (2004). A certain number of 128-dimensional feature vectors were obtained from each image. Note that the SIFT feature did not consider hue information. All of the obtained feature vectors were transformed into Algorithm 2 Lazy greedy algorithm. Require: The MHDP is trained using a training data set.
The j-th object is found. m oj is initialized, and X mo j j is observed.
Execute the m * -th action to the j-th target object and obtain X m * j . m oj ← m oj ∪ {m * } Prepare a stack S for the modality indices and initialize it. for all m ∈ M \ m oj do push(S, (m, IG m )) end for for l = 1 to L − 1 do repeat S ← descending sort(S) // w.r.t. IG m (m 1 , IG m 1 ) ← pop(S) , (m 2 , IG m 2 ) ← pop(S) // Re-evaluate IG m 1 as follows.
Execute the m * -th action to the j-th target object and obtain X m * j . m oj ← m oj ∪ {m * } end for BoF representations using k-means clustering. BoF representations were used as observation data for the visual modality of the MHDP. The index for this modality was defined as m v .

AUDITORY INFORMATION (m as AND m ah )
Auditory information was obtained from a multipowered shotgun microphone NTG-2 by RODE Microphone. The microphone was regarded as the ear of the robot. In this experiment, two types of auditory information were acquired. One was generated by hitting the object, and the other was generated by shaking it. The two sounds were regarded as different auditory information and hence different modality observations in the MHDP model. The two actions, i.e., hitting and shaking, were manually programmed for the robot. When the robot began to execute an action, it also started recording the objects's sound (see Fig. 4). The sound was recorded until two seconds after the robot finished the action. The recorded auditory data were temporally divided into frames, and each frame was transformed into 13-dimensional Mel-frequency cepstral coefficients (MFCCs). The MFCC feature vectors were transformed into BoF representations using k-means clustering in the same way as the visual information. The indices of these modalities were defined as m as and m ah , respectively, for "shake" and "hit."

HAPTIC INFORMATION (m h )
Haptic information was obtained by grasping a target object using the robot's hand. When the robot attempted to obtain haptic information from an object placed in front of it, it moved its hand to the object and gradually closed its hand until a certain amount of counterforce was detected (see Fig. 4). The joint angle of the hand was measured when the hand touched the target object and when the hand stopped. The two variables and difference between the two angles were used as a three-dimensional feature vector. When obtaining haptic information, the robot grasped the target object 10 times and obtained 10 feature vectors. The feature vectors were transformed into BoF representations using k-means clustering in the same way as for the other information types. The index of the haptic modality was defined as m h .

MULTIMODAL INFORMATION AS BOF REPRESENTATIONS
In summary, a robot could obtain multimodal information from four modalities for perception. The and N m h j = 30. The weight of each modality w m was set to 1. The formation of multimodal object categories itself is out of the scope of this paper. Therefore, the constants were empirically determined so that the robot could form object categories that are similar to human participants. The number of samples K in the Monte Carlo approximation for estimating IG was set to K = 5000.

TARGET OBJECTS
For the target objects, 17 types of commodities were prepared for the experiment shown in Fig. 5. Each index on the right-hand side of the figure indicates the index of each object. The hardness of the balls, the striking sounds of the cups, and the sounds made while shaking the bottles were different depending on the object categories. Therefore, ground-truth categorization could not be achieved using visual information alone.

Procedure
The experimental procedure was as follows. First, the robot formed object categories through multimodal categorization in an unsupervised manner. An experimenter placed each object in front of the robot one by one. The robot looked at the object to obtain visual features, grasped it to obtain haptic features, shook it to obtain auditory shaking features, and hit it to obtain the auditory striking features. After obtaining the multimodal information of the objects as a training data set, the MHDP was trained using a Gibbs sampler. The results of multimodal categorization are shown in Fig. 5. The category that has the highest posterior probability for each object is shown in white. These results show that the robot can form multimodal object categories using MHDP, as described in Nakamura et al. (2011a). After the robot had formed object categories, we fixed the latent variables for the training data set.
Second, an experimental procedure for active perception was conducted. An experimenter placed an object in front of the robot. The robot observed the object using its camera, obtained visual information, and set m oj = {m v }. The robot then determined its next set of actions for recognizing the target object using its active perception strategy.

SELECTING THE NEXT ACTION
First, we describe results for the first single action selection after obtaining visual information. In this experiment, the robot had three choices for its next action, i.e., m as , m ah , and m h . To evaluate the results of active perception, we used KL P (k|X M j ), P (k|X A∪mo j j ) , i.e., the distance between the posterior distribution over the object categories k in the final recognition state and that in the next recognition state as an evaluation criterion on behalf of KL P (z j |X M j ), P (z j |X A∪mo j j ) . This is the original evaluation criterion in (4) because the computational cost for evaluating KL P (z j |X M j ), P (z j |X A∪mo j j ) is too high to calculate. Object ID KL div ergence v+as v+ah v+h Figure 6: (Top) KL divergence between the final recognition state and the posterior probability estimated after obtaining only visual information, (middle) estimated IG m for each object based on visual information, and (bottom) KL divergence between the final recognition state and the posterior probability estimated after obtaining only visual information and each selected action. Our theory of multimodal active perception suggests that the action with the highest information gain (shown in the middle) tends to lead its initial recognition state (whose KL divergence from the final recognition state is shown at the top) to a recognition state whose KL divergence from the final recognition state (shown at the bottom) is the smallest. These figures suggest the probabilistic relationships were satisfied as a whole.  Fig. 6 (top) shows the KL divergence between the posterior probabilities of the category after obtaining the information from all modalities and after obtaining only visual information. With regard to some objects, e.g., objects 6 and 7, the figure shows that visual information is sufficient for the robot to recognize the objects. However, with regard to many objects, visual information alone could not lead the recognition state to the final state. However, it could be reached using the information of all modalities. Fig. 6 (middle) shows IG m calculated using the visual information for each action. Fig. 6 (bottom) shows the KL divergence between the final recognition state and the posterior probability estimated after obtaining visual information and the information of each selected action. We observe that an action with a higher value of IG m tended to further reduce the KL divergence, as Theorem 1 suggests. Fig. 7 shows the average KL divergence for the final recognition state after executing an action selected by the IG m criterion. Actions IG .min, IG .mid, and IG .max denote actions that have the minimum, middle, and maximum values of IG m , respectively. These results show that IG .max clearly reduced the uncertainty of the target objects.
The precision of category recognition after an action execution is summarized in Table 1. Basically, a category recognition result is obtained as the posterior distribution (3) in the MHDP. The category with the highest posterior probability is considered to be the recognition result for illustrative purposes in Table 1. Obtaining information by executing IG .max almost always increased recognition performance.
Examples of changes in the posterior distribution are shown in Figs. 8 and 9 for objects 8 ("metal cup") and 12 ("plastic bottle containing bells"), respectively. The robot could not clearly recognize the category of object 8 after obtaining visual information. Action IG m in Fig. 6 shows that m ah was IG .max for the 8th object. Fig. 8 shows that m ah reduced the uncertainty and allowed the robot to correctly recognize the object, as evidenced by category 6, a metal cup. This means that the robot noticed that the target object was a metal cup by hitting it and listening to its metallic sound. The metal cup did not make a sound when the robot shook it. Therefore, the IG for m as was small. As Fig. 9 shows, the robot first recognized the 12th object as a plastic bottle containing bells with high probability and as an empty plastic bottle with a low probability. Fig. 6 shows that the IG m criterion suggested m ah as the first alternative and m as as the second alternative. Fig. 9 shows that m as and m ah could determine that the target object was an empty plastic bottle, but m h could not.
As humans, we would expect to differentiate an empty bottle from a bottle containing bells by shaking or hitting the bottle, and differentiate a metal cup from a plastic cup by hitting it. The proposed active perception method constructively reproduced this behavior in a robotic system using an unsupervised multimodal machine learning approach.

SELECTING THE NEXT SET OF MULTIPLE ACTIONS
We evaluated the greedy and lazy greedy algorithms for active perception sequential decision making. The KL divergence from the final state for all target objects is averaged at each step and shown in Fig. 10. For each condition, the KL divergence gradually decreased and reached almost zero. However, the rate of decrease notably differed. As the theory of submodular optimization suggests,    Figure 9: Posterior probability of the category for object 12 after executing each action. These results show that the actions with the highest and second highest information gain, i.e., ah and as, allowed the robot to efficiently estimate that the true object category was "plastic bottle containing bells." Step KL div ergence Worst case Average Lazy greedy Greedy Best case Figure 10: KL divergence from the final state at each step for each sequential action selection procedure. Note that the line of the lazy greedy algorithm is overlapped by that of the greedy algorithm.
the greedy algorithm was shown to be a better solution on average and slightly worse than the best case Nemhauser et al. (1978). The best and worst cases were selected after all types of sequential actions had been performed. The "average" is the average of the KL divergence obtained by all possible types of sequential actions. The results for the lazy greedy algorithm were almost the same as those of the greedy algorithm, as Minoux et al. (1978) suggested.
The sequential behaviors of IG m were observed to determine if their behaviors were consistent with our theories. For example, the changes in IG m at each step as the robot sequentially selected its action to perform on object 10 using the greedy algorithm is shown in Fig. 11. Theorem 3 shows that the IG is a submodular function. This predicts that IG m decreases monotonically when a new action is executed in active perception. When the robot obtained only visual information (v only in Fig. 11), all values of IG m were still large. After m ah was executed on the basis of the greedy algorithm, IG m ah became zero. At the same time, IG m as and IG m h decreased. In the same way, all values of IG m gradually decreased monotonically. Fig. 12 shows the time series of the posterior probability of the category for object 10 during sequential active perception. Using only visual information, the robot misclassified the target object as a plastic bottle containing bells (category 3). The action sequence in reverse order did not allow the robot to recognize the object as a steel can at its first step and change its recognition state to an empty plastic bottle (category 4). After the second action, i.e., grasping (m h ), the robot recognized the object as a steel can. In contrast, the greedy algorithm could determine that the target object was in category 4, i.e., steel can, with its first action.
The effect of the number of samples K for the Monte Carlo approximation was observed. Fig. 13 shows the relation between K and the standard deviation of the estimated IG m for the 15th object for each action after obtaining a visual image. This figure shows that estimation error gradually decreases when K increases. Roughly speaking, K ≥ 1000 seems to be required for an appropriate estimate of IG m in our experimental setting. Evaluation of IG m required less than 1 second, which is far shorter than the time required for action execution by a robot. This means that our method can be used in a real-time manner. These empirical results show that the proposed method for active perception allowed a robot to select appropriate actions sequentially to recognize an object in the real-world environment and in a real-time manner. It was shown that the theoretical results were supported, even in the real-world environment.

Experiment 2: Synthetic Data
In experiment 1, the numbers of classes, actions, and modalities as well as the size of dataset were limited. In addition, it was difficult to control the experimental settings so as to check some interesting theoretical properties of our proposed method. Therefore, we performed a supplemental experiment, Experiment 2, using synthetic data comprising 21 object types, 63 objects, and 20 actions, i.e., modalities.
First, we checked the validity of our active perception method when the number of types of actions increases. Second, we checked how the method worked when two classes were assigned to the same object. Although the MHDP can categorize an object into two or more categories in a probabilistic manner, each object was classified into a single category in the previous experiment.

Conditions
A synthetic dataset was generated using the generative model that the MHDP assumes (see Fig. 3). We prepared 21 virtual object classes, and three objects were generated from each object class, i.e., we obtained 63 objects in total. Among the object classes, 14 object classes are "pure," and seven object classes are "mixed." For each pure object class, a multinomial distribution was drawn from the Dirichlet distribution corresponding to each modality. We set the number of modalities M = 20. The hyperparameters of the Dirichlet distributions of the modalities were set to α m 0 = 0.4(m−1) for m > 1. For m = 1, we set α 1 0 = 10. For each mixed object class, a multinomial distribution for each modality was prepared by mixing the distributions of the two pure object classes. Specifically, the multinomial distribution for the i-th mixed object was obtained by averaging those of the (2i − 1)-th and the 2i-th object classes. The observations for each modality of each object were drawn from the multinomial distributions corresponding to the object's class. The count of the BoFs for each modality was set to 20. Finally, 42 pure virtual objects and 21 mixed virtual objects were generated.
The experiment was performed almost in the same way as experiment 1. First, multimodal categorization was performed for the 63 virtual objects, and 14 categories were successfully formed in an unsupervised manner. The posterior distributions over the object categories are shown in Fig. 14. Generally speaking, mixed objects were categorized into two or more classes. After categorization, a virtual robot was asked to recognize all of the target objects using the proposed active perception method. Step KL div ergence Method Greedy Lazy greedy Random Figure 15: KL divergence from the final state at each step for each sequential action selection procedure.

Results
We compared the greedy, lazy greedy, and random algorithms for the active perception sequential decision making process. The random algorithm is a baseline method that determines the next action randomly from the remaining actions that have not been taken. In other words, the random algorithm is the case in which a robot does not employ any active perception algorithms.
The KL divergence from the final state for all target objects is averaged at each step and shown in Fig. 15. For each condition, the KL divergence gradually decreased and reached almost zero. However, the rate of decrease was different. The greedy and lazy greedy algorithms were clearly shown to be better solutions on average than the random algorithm. In contrast with experiment 1, the best and worst cases could not practically be calculated because of the prohibitive computational cost. Interestingly, the lazy greedy algorithm has almost the same performance as the greedy algorithm, as the theory suggests, although the laziness reduced the computational cost in reality.
The number of times the robot evaluated IG m to determine the action sequences for all executable counts of actions L = 1, 2, . . . , M is summarized for each method. The number of times the lazy greedy algorithm was required for each target object was 71.7 (SD = 5.2) on average, and that of the greedy algorithm was 190. Theoretically, the greedy and lazy greedy algorithms require O(M 2 ) evaluations. Practically, the number of re-evaluations needed by the lazy greedy algorithm is quite small. In contrast, the brute-force algorithm requires O(2 M ) evaluations, i.e., far more evaluations of IG are required.
Next, a case in which two classes were assigned to the same object was investigated. The target dataset contained "mixed" objects. The results also imply that our method works well even when two classes are assigned to the same object. This is because our theory is completely derived on the basis of the probabilistic generative model, i.e., the MHDP. We show a typical result. Fig. 16 shows the time series of the posterior probability of the category for object 51, i.e., one of the mixed objects, during sequential active perception. This shows that the greedy and lazy greedy algorithms quickly categorized the target object into two categories "correctly." Our formulation Step Posterior Probability assumes the categorization result to be a posterior distribution. Therefore, this type of probabilistic case can be treated naturally.

Conclusion
In this paper, we described an MHDP-based active perception method for robotic multimodal object category recognition. We formulated a new active perception method on the basis of the MHDP Nakamura et al. (2011a) .
First, we proposed an action selection method based on the IG criterion and proved that IG is an optimal criterion for active perception from the viewpoint of reducing the expected KL divergence between the final and current recognition states. Second, we derived a Monte Carlo approximation method for evaluating IG efficiently and made the action selection method executable. Third, we proved that the IG has a submodular property and reduced the sequential active perception problem to a submodular maximization problem. Given the theoretical results, we proposed to use the lazy greedy algorithm for selecting a set of actions for active perception. It is important to note that all of the three theoretical contributions mentioned above were naturally derived from the characteristics of the MHDP. These contributions are clearly a result of the theoretical soundness of the MHDP. In this sense, our theorems reveal a new advantage of the MHDP that other several heuristic multimodal object categorization methods do not have.
To evaluate the proposed methods empirically, we conducted experiments using an upper-torso humanoid robot and a synthetic dataset. Our results showed that the method enables the robot to actively select actions and recognize target objects quickly and accurately.
One of the most interesting points of this paper is that not only object categories but also an action selection policy for object recognition can be formed in an unsupervised manner. From the viewpoint of cognitive developmental robotics, providing an unsupervised learning model for bridging the development between perceptual and action systems is meaningful for shedding a new light on the computational understanding of cognitive development Cangelosi and Schlesinger (2015); Asada et al. (2009). It is believed that the coupling of action and perception is important for an embodied cognitive system Pfeifer and Scheier (2001).
The advantage of this paper compared with the related works is that our action selection method for multimodal category recognition has a clear theoretical basis and is tightly connected to the computational model for multimodal object categorization, i.e., MHDP. This fact gives our active perception method a theoretical guarantee of its the performance.
Our directions for future research are as follows. In addition to active perception, active "learning" for multimodal categorization is also an important research topic. It takes a longer time for a robot to gather multimodal information to form multimodal object categories from a massive number of daily objects than it does to recognize a new object. If a robot can notice that "the object is obviously a sample of learned category," the robot need not obtain knowledge about object categories from such an object. In contrast, if a target object appears to be completely new to the robot, the robot should carefully interact with the object to obtain multimodal information from the object. Such a scenario will be achieved by developing an active "learning" method for multimodal categorization. It is likely that such a method will be able to be obtained by extending our proposed active perception method.
In addition, the MHDP model treated in this paper assumed that an action for perception is related to only one modality, e.g., grasping only corresponds to m h . However, in reality, when we interact with an object with a specific action, e.g., grasping, shaking, or hitting, we obtain rich information related to various modalities. For example, when we shake a box to obtain auditory information, we also unwittingly obtain haptic information and information about its weight. The tight linkage between the modality information and an action is a type of approximation taken in this research. An extension of our model and the MHDP to a model that can treat actions that are related to various modalities is also a task for our future work.

Appendix A. Proof of the Optimality of the Proposed Active Perception Strategy
In this appendix, we show that the proposed active perception strategy, which maximizes the expected KL divergence between the current state and the posterior distribution of z j after a selected set of actions, minimizes the expected KL divergence between the next and final states.
The numerator inside of the log function does not depend on A. Therefore, the term related to the numerator can be deleted. In addition, by negating the remaining term, we obtain By marginalizing X M\(mo j ∪A) j from (14), we obtain