Hierarchical Spatial Concept Formation Based on Multimodal Information for Human Support Robots

In this paper, we propose a hierarchical spatial concept formation method based on the Bayesian generative model with multimodal information e.g., vision, position and word information. Since humans have the ability to select an appropriate level of abstraction according to the situation and describe their position linguistically, e.g., “I am in my home” and “I am in front of the table,” a hierarchical structure of spatial concepts is necessary in order for human support robots to communicate smoothly with users. The proposed method enables a robot to form hierarchical spatial concepts by categorizing multimodal information using hierarchical multimodal latent Dirichlet allocation (hMLDA). Object recognition results using convolutional neural network (CNN), hierarchical k-means clustering result of self-position estimated by Monte Carlo localization (MCL), and a set of location names are used, respectively, as features in vision, position, and word information. Experiments in forming hierarchical spatial concepts and evaluating how the proposed method can predict unobserved location names and position categories are performed using a robot in the real world. Results verify that, relative to comparable baseline methods, the proposed method enables a robot to predict location names and position categories closer to predictions made by humans. As an application example of the proposed method in a home environment, a demonstration in which a human support robot moves to an instructed place based on human speech instructions is achieved based on the formed hierarchical spatial concept.


INTRODUCTION
Space categorization is an important function for human support robots. It is believed that humans predict unknown information flexibly by forming categories of space through their multimodal experiences. We define categories of spaces formed by self-organization from experience as spatial concepts. Furthermore, prediction based on the connection between concepts and words is thought to lead to a semantic understanding of words. It means that spatial concept formation is an important function of human intelligence, and having this ability is important for human support robots.
Spatial concepts form a hierarchical structure. The use of this hierarchical structure enables humans to predict unknown information using concepts in an appropriate layer. For example, humans can linguistically represent their own positions at an appropriate level of abstraction according to the context of communication, such as "I'm in my home" at the global level, "I'm in the living room" at the intermediate level, and "I'm in front of the TV" at the local level. In this case, the living room has the home in the higher layer and front of the TV in the lower layer. By learning such a hierarchical structure, even if the unknown place does not have features such as front of the TV, its characteristics can be predicted if it has features of the living room. It is expected that the robot acquires spatial concepts in a higher layer by learning the commonality of features in spatial concepts at the lower layer.
Furthermore, the hierarchical structure of spatial concepts plays an important role when a robot moves based on linguistic instructions from a user. As shown in Figure 1, even if multiple tables are present in a room, robots can recognize them individually by using a spatial concept at a higher layer, such as "the front of the table in the living space." Indeed, in RoboCup@Home, an international competition in which intelligent robots coexist with humans in home environments, location names are defined as two layers in the tasks of a General Purpose Service Robot 1 as shown in Table 1. This table indicates that having sense of space relations is important for a robot coexisting with humans, e.g., that the living space has a center table. By having such hierarchical spatial concepts, it becomes possible to describe and move within a space based on linguistic communication with a user.
We assume that a computational model, which considers the hierarchical structure of spatial concepts, enables robots to acquire not only the spatial concepts, but also the hierarchical structure hiding among the spatial concepts through a bottomup approach and form spatial concepts similar to those perceived by humans. The goal of this study was to develop a robot that can predict unobserved location names and positions from observed information using formed hierarchical spatial concepts. The main contributions of this paper are as follows.
• We propose a method of forming hierarchical spatial concepts using a Bayesian generative model based on multimodal information, e.g., vision, position, and word information. • We show that spatial concepts formed by the proposed method enable a robot to predict location names and positions similar to prediction made by humans. • We demonstrate application examples in which a robot moves within and describes a space based on linguistic communication with a user through the hierarchical spatial concept formed by the proposed method.
The rest of this paper is structured as follows. Section 2 describes related works. Section 3 presents an overview and the computational model of hierarchical spatial concept formation. Section 4 presents experimental results evaluating the effectiveness of the proposed method in space categorization. Section 5 describes application examples of using hierarchical spatial concepts in a home environment. Finally, section 6 presents conclusions.

RELATED WORKS
In order for a robot to move within a space, a metric map consisting of occupancy grids that encode whether or not an area is navigable is generally used. The simultaneous localization and mapping (SLAM) (Durrant-Whyte and Bailey, 2006) is a famous localization method for mobile robots. However, the tasks that are coordinated with a user cannot be performed using only a metric map, since semantic information is required for interaction with a user. Nielsen et al. (2004) proposed a method of expanding a metric map into a semantic map by attaching a single-frame snapshot in order to share spatial information between a user and a robot. As a bridge between a metric map and human-robot interaction, research on semantic maps that provide semantic attributes (such as object recognition results) to metric maps has been performed (Pronobis et al., 2006;Ranganathan and Dellaert, 2007). Studies have also been reported on giving semantic object annotations to 3D point cloud data (Rusu et al., 2008(Rusu et al., , 2009. Moreover, in terms of studies based on multiple cues, Espinace et al. (2013) proposed a method of characterizing places according to low-level visual features associated to objects. Although these approaches could categorize spaces based on semantic information, they did not deal with linguistic information about the names that represent spaces.
In the field of navigation tasks with human-robot interaction, methods of classifying corridors and rooms using a predefined ontology based on shape and image features have been proposed (Zender et al., 2008;Pronobis and Jensfelt, 2012). In studies on semantic space categorization, Kostavelis and Gasteratos (2013) proposed a method of generating a 3D metric map that is semantically categorized by recognizing a place using bag of features and support vector machines. Granda et al. (2010) performed spatial labeling and region segmentation by applying a Gaussian model to the SLAM module of a robot operating system (ROS). Mozos and Burgard (2006) proposed a method of classifying metric maps into semantic classes by using adaboost as a supervised learning method. Galindo et al. (2008) utilized semantic maps and predefined hierarchical spatial information for robot task planning. Although these approaches were able to ground several predefined names to spaces, the learning of location names through human-robot communication in a bottom-up manner has not been achieved.
Many studies have been conducted on spatial concept formation based on multimodal information observed in individual environments (Hagiwara et al., 2016;Heath et al., 2016;Rangel et al., 2017). Spatial concepts are formed in a bottom-up manner based on multimodal observed information, and allow predictions of different modalities. This makes it possible to estimate the linguistic information representing a space from position and image information in a probabilistic way. Gu et al. (2016) proposed a method of learning relative space categories from ambiguous instructions. Taniguchi et al. (2014Taniguchi et al. ( , 2016 proposed computational models for a mobile robot to acquire spatial concepts based on information from recognized speech and estimated self-location. Here, the spatial concept was defined as the distributions of names and positions at each place. The method enables a robot to predict a positional distribution from recognized human speech through formed spatial concepts. Ishibushi et al. (2015) proposed a method of learning the spatial regions at each place by stochastically integrating image recognition results and estimated self-positions. In these studies, it was possible to form a spatial concept conforming to human perception such as an entrance and a corridor by inferring the parameters of the model. However, these studies did not focus on the hierarchical structure of spatial concepts. In particular, the features of the higher layer, such as the living space, are included in the features of the lower layer, such as the front of the television, and it was difficult to form the spatial concept in the abstract layer. Furthermore, the ability to understand and describe a place linguistically in different layers is an important function in robots that provide services through linguistic communication with humans. Despite the importance of the hierarchical structure of spatial concepts, a method that enables such concept formation has not been proposed in previous studies. We propose a method that forms a hierarchical spatial concept in a bottom-up manner from multimodal information and demonstrate the effectiveness of the formed spatial concepts in predicting location names and positions.

Overview
An overview of the proposed method of forming hierarchical spatial concepts is shown in Figure 2. First, a robot was controlled manually in an environment based on a map generated by simultaneous localization and mapping  (Krizhevsky et al., 2012). Position information is acquired as coordinate values in the map estimated by Monte Carlo localization (MCL) (Dellaert et al., 1999). Word information is acquired as set of words by word recognition. Text input is used for word recognition in this study. Second, acquired vision, position, and word information is represented as histograms. The histograms are utilized as observations in each modality. Third, the formation of hierarchical spatial concepts is performed by using hierarchical multimodal latent Dirichlet allocation (hMLDA) (Ando et al., 2013) on the observations. The proposed method enables a robot to form hierarchical spatial concepts in a bottom-up manner based on observed multimodal information. Therefore, it is possible to adaptively learn location names and the hierarchical structure of a space, which depend on the environment.

Vision Information
Vision information was acquired as the object recognition results of a captured image by Caffe (Jia et al., 2014), which is a framework of CNN (Krizhevsky et al., 2012) provided by Berkeley Vision and Learning Center. The parameters of CNN were trained by using the dataset from the ImageNet Large Scale Visual Recognition Challenge 2013 2 , which comprises 1,000 object classes, e.g., television, cup, and desk. The output of Caffe is given as a probability p(a i ) at an object class a i ∈ {a 1 , a 2 , ..., a I } where I is the number of object classes and was set to 1,000. The probability p(a i ) was represented as a 1,000-dimensional histogram of vision information w (υ) = (w (υ) 1 , w (υ) 2 ,···,w (υ) 1,000 ) T by the following equation:

Position Information
The position information (x, y) in the map generated by SLAM was estimated by MCL (Dellaert et al., 1999). It is assumed that the observed information is generated from a multinomial distribution in hMLDA. For this reason, the observed information with a continuous value is generally converted into a finite dimensional histogram by vector quantization. Ando et al. (2013) replaced the observed information with typical patterns by k-means clustering to form a finite dimensional histogram. The proposed method converts a position information (x, y) into a finite dimensional histogram of position information w p by hierarchical k-means clustering. The positional information (x, y) was classified hierarchically into 2, 4, 8, 16, 32, and 64 clusters with six layers by applying k-means clustering with k = 2 six times. If a position (x, y) was classified into a cluster c i ∈ {0, 1} at the ith layer, a path for the position information was described as C = {c 1 , c 2 , c 3 , c 4 , c 5 , c 6 }. The path C has the structure of a binary tree with six layers. The number of nodes at the 6th layer is 2 6 = 64. The position information (x, y)  Similarly, w (p) corresponding to nodes at the 6th layer below it are incremented in each layer.

Word Information
The voice information uttered by a user is converted manually into text data, which is then used as word information. In section 5, rospeex (Sugiura and Zettsu, 2015) is used to convert human speech into text data. The location names are manually extracted from the text data. The word information is described as a set of location names, which is a Bag of Words (Harris, 1954) with a location name as a word. The user could give not only one name but also several names to a robot at a given position. The given word information was represented as a histogram of word information w (w) = (w (w) 1 , w (w) 2 ,···,w (w) J ) T . J is the dimension of w (w) , and depends on the number of location names in a dictionary S = {s 1 , s 2 ,···,s J }, which was obtained through the training phase. w (w) j was incremented when a location name s j was taught from user. J is the number of location names. Histograms of vision, position, and word information were used as observations in hMLDA.

Hierarchical Categorization by hMLDA
The hierarchical structure of spatial concepts is supported by nested Chinese restaurant process (nCRP) (Blei et al., 2010) in hMLDA (Ando et al., 2013). nCRP is an extended model of the Chinese restaurant process (CRP) (Aldous, 1985), which is a Dirichlet process used to generate multinomial distribution with infinite dimensions. nCRP stochastically calculates the hierarchical structure based on the idea that there are infinite Chinese restaurants with infinite number of tables. Figure 3 shows the overview of nCRP. A box and a circle represent a restaurant and a customer, respectively. The customer  stochastically decides the restaurant to visit. In the proposed method, a box and a circle mean a spatial concept and data, respectively. Data is stochastically allocated to a spatial concept in each layer by the nCRP. In hMLDA, each spatial concept has a probability distribution with parameter β l,i to generate data. The proposed method forms a hierarchical spatial concept by hierarchical probabilistic categorization using nCRP. In the nonhierarchical approach, a place called "meeting space" and its partial places called "front of the table" and "front of the TV" are formed in the same layer. Therefore, the meeting space is learned as a place different from places called "front of the table" and "front of the TV." The proposed method enables the robot to learn the meeting space as a upper concept encompassing places called "front of the table" and "front of the TV" as lower concepts.
The graphical model of hMLDA in the proposed method and the definition of the variables are shown in Figure 4 and Table 2, respectively. In Figure 4, c is a tree-structured path generated by nCRP with a parameter γ and z is a category index for a spatial concept that is generated by a stick-breaking process (Pitman, 2002) with parameters α and π. w υ , w p , w w are acquired vision, position, and word information generated by multinomial distributions with a parameter β m at a modality m (m ∈ υ, p, w). β m was determined according to a Dirichlet prior distribution with a parameter η m . D and L written on plates are the number of acquired data and the number of categories, respectively.
The generation process of the model is described as follows: where: • The parameter β m k of a multinomial distribution is generated by a Dirichlet prior distribution with a parameter η m in a table k(k ∈ T), e.g., β 1,1 and β 2,1 in Figure 3.
• The path c d in a tree structure for the data d (d ∈ 1, 2, ..., D) is decided by nCRP with a parameter γ . c d is represented by the sequence of numbers assigned to each node in the path, e.g., {(1, 1), (2, 1), (3, 2)} at data 2 in Figure 3. • The parameter θ d of a multinomial distribution is generated by the stick-breaking process based on a GEM distribution which forms θ d from a Beta(απ, (1 − α)π) distribution with the parameters α(0 ≤ α ≤ 1) and π(π > 0) (Pitman, 2006). θ d represents the selection probability of a layer in a path c d and corresponds to the generation probability of a category index z in a path c d . • z m d,n , which is a category index at the nth feature of the observed information w m d , is generated by a multinomial distribution with a parameter θ d . • w m d is the observed information generated by a multinomial distribution with a parameter β from a category z m d,n at a path c d .
In this study, z is equivalent to a spatial concept expressed by the location name such as "the living room" or "front of the table." Model parameter learning was performed by a Gibbs sampler. Parameters were calculated by alternately sampling a path c d for each datum and a category z m d,n assigned to the nth feature value of a modality m of the data d in the path. Category z m d,n was sampled according to the following formula.
where −(d, n) means excluding the nth feature value of the data d. p(z m d,n |z m d,−n , α, π) is a multinomial distribution generated by the stick-breaking process. The probability, that k is assigned to a category of the n-th feature of modality m of the d-th data, was calculated by the following formula.
where #[·] is a number that satisfies a given condition and V k and V j are values that determine the rate of folding a branch in categories k and j by the stick-breaking process, respectively. In Formula (7), p(w m d,n |z, c, w m −(d,n) , η m ) is the probability that a feature value is generated from a path c d and a category z m d,n . Since it is assumed that the parameters of the multinomial distribution that generates a feature value are generated from a Dirichlet prior distribution, the following formula is obtained.
This gives the number of times that a category z m d,n is assigned to a feature value w m d,n in a path c d . A path c d was sampled by the following formula.
where c −d is a set of paths excluding c from c d . Sampling based on Formulas (9) and (10) was repeated for each training datum d ∈ {d 1 , d 2 , · · · , d D }. In this process, paths and categories for all observed data converge toĉ andẑ.

Name Prediction and Position Category Prediction
If vision information w v t and position information w p t are observed at a time t, then the posterior probability of word information w w t can be calculated with estimated parametersĉ andẑ by the following formula.
The location namen can be predicted by the maximum value of the calculated posterior probability. If word information w w t is obtained at a time t, then a category z w t can be predicted by Formula (12) and selecting position p randomly from dataset D z w t , which is a set of position data categorized into z w t . D z w t was automatically generated by the robot itself as a part of the categorization process.

Purpose
We conducted experiments to verify whether the proposed method can form hierarchical spatial concepts, which enable a robot to predict location names and position categories close to predictions made by humans. In the experiment, (1) the influence of multimodal information, i.e., words, on the formation of a hierarchical spatial concept was evaluated by comparing the space categorization results of using the proposed method and those of hierarchical latent Dirichlet allocation (hLDA) (Blei et al., 2010), which is a hierarchical categorization method with single modality; (2) the similarity between the hierarchical spatial concepts formed by the proposed method and those made by humans was evaluated in terms of predicting location names and position categories.   Figure 5A shows an experimental environment which includes furniture, e.g., tables, chairs, and a book shelf, in order to collect training and test data. Figure 5B shows a mobile robot, which consists of a mobile base, a depth sensor, an image sensor, and a computer, used to generate a map and collect multimodal information in the test environment. The height of the camera attached to the robot was 117 cm in consideration of the typical eye level in the human body. This is equivalent to the average height of a 5-year-old boy in Japan. The Navigation Stack package 3 was used with ROS Hydro 4 for mapping, localization, and moving in the experiment. The robot was manually controlled to collect data from the environment. The orientation of the robot was controlled in as many different orientations as possible. Figure 6 shows a map generated in the environment by the robot using SLAM and examples of the collected data. Collected data consisted of image, position, and word information as shown in the samples of collected data at A, B, and C. In the experiment, 900 data points were used for training and 100 data points were used for testing from a total of 1,000 data points collected in the area surrounded by a dotted line in the map. The robot simultaneously acquired images and self-position data (x, y) at times of particle re-sampling for MCL. Words were given as location names by a user who was familiar with the experimental environment. The user gave one or more location names suitable for the place at a data point during the training. In example A, not only a name such as "front of the door" but also a name representing a space such as "entrance" and a name meaning a room such as "laboratory" were given as word information. Word information was partially supplied as training data. Five training data sets were prepared to evaluate robustness of the naming rate in training data as 1, 2, 5, 10, and 20%.

Experimental Conditions
The similarity between the spatial concepts formed by the proposed method and those made by humans was evaluated in experiments of location name prediction and position category prediction based on the ground truth. The ground truth information was given for 100 test data points according to the agreement of three experts who were familiar with the environment. The hierarchy of the space in the experimental environment was defined as global, intermediate, and local. Location names assigned to each hierarchy are shown in Table 3. As the ground truth for name prediction, three location names were uniformly given to each test datum considering the hierarchy to evaluate the accuracy of name prediction. As the ground truth for the position category prediction, regions corresponding to the 15 location names in Table 3 were decided on the map. Figure 7 shows the three regions of the "laboratory, " "entrance, " and "front of the table." The environment was divided into a grid of 50 units in length and 25 units in width, and the gray grids show the ground truth.
In the name prediction experiment, the accuracy of name prediction compared with the ground truth was calculated as an index of similarity. Formula (11) was used to predict names using the proposed method. The accuracy of name prediction at global, intermediate, and local levels was calculated by the following formula.
where M l is a number matching the predicted names with the ground truth at layer l in the test dataset and D is the number of test data. In the experiment, l was set as (l ∈ {global, intermediate, local}) and D was 100.
In the position category prediction experiment, the precision, recall, and F-measure of the predicted position categories compared with the ground truth were calculated as an index of similarity. In the proposed method, a position (x, y) sampled multiple times for each location name by Formula (12). The precision, recall, and F-measure of position category prediction were calculated by the following formulas.
where T n is a number matching the position with the ground truth for location name n, F n is a number that does not match the position with the ground truth, and G n is the number of grids for the ground truth. In the experiment, n was set as (n ∈ {1, 2, · · · , 15}).

Baseline Methods
The most frequent class, nearest neighbor method, multimodal hierarchical Dirichlet process (HDP), and spatial concept formation model were used as baseline methods for evaluating the performance of the proposed method in the name prediction and position category prediction experiments. In the latter, the sampling of position for each location name was performed 100 times.

Most Frequent Class
The training dataset D = {d 1 , d 2 , · · ·, d I } is used in this method. The datum d i consists of the position information p i = (x i , y i ) and the word information w i , which is a set of location names. The frequency cnt n j of each location name n j (j ∈ {1, 2, · · ·, 15}) is counted in the training dataset D. The location name n j is classified into three clusters by k-means (k = 3) based on cnt n j . The three clusters of location names are C global , C intermediate , and C local in descending order of the frequency of the location name based on the assumption that global location names are more frequent than local location names. In the training dataset D, if a datum d i includes a location name in C global , C intermediate , and C local , the datum d i is set as a global dataset D g , an intermediate dataset D i , and a local dataset D l . The location names in the global, intermediate, and local levels are predicted as the most frequent location name in each dataset D g , D i , and D l , respectively.
In the position category prediction, the positions are predicted by sampling the position informationp randomly from the datasets D g,f , D i,f , and D l,f , which have the most frequent location names in each dataset D g , D i , and D l , respectively. The sampling of position information for each location name was performed 100 times.

Nearest Neighbor (Position and Word)
The nearest neighbor method (Friedman et al., 1977) discriminates data based on Euclidean distance. A datum d i involves position information p i = (x i , y i ) and word information w i . w i consists of a set of location names that obtained at a position p i in the training. For example, w i at data point B in Figure 6 contains the following location names: "Meeting space, " "Book shelf zone, " and "Around the electric piano." If position information p t is observed, then word informationŵ t is calculated with the training dataset D = {(p 1 , w 1 ), (p 2 , w 2 ),···,(p I , w I )} by the following formulas.  The location namen can be predicted by randomly selecting a location name from location names inŵ t of the nearest data point. If word information w t is observed, then position information p t is randomly selected from dataset D n t , which is a set of data d i = (p i , w i ) satisfying the formula w i ∈ w t . The sampling of position information for each location name was performed 100 times.

Nearest Neighbor (Vision, Position and Word)
This method is used only in the name prediction experiment. A datum d i includes vision information v i , position information p i = (x i , y i ) and word information w i . υ i is a value calculated by Formula (1) at a position p i during training. w i consists of a set of location names that are obtained at a position p i during the training. If the vision information v t and the position information p t are observed, then the word informationŵ t can be calculated with the training dataset D = {(υ 1 , p 1 , w 1 ), (υ 2 , p 2 , w 2 ),···,(υ I , p I , w I )} by using the following formulas.
where α is the weight coefficient between vision and position information. α was set as 0.3 in the validation dataset empirically. The location namen can be predicted by randomly selecting a location name from the location names inŵ t of the nearest data point.

Multimodal HDP
Multimodal HDP (Nakamura et al., 2011) enables the multimodal handling of HDP (Teh et al., 2005), which is a method of categorizing observed data based on a Bayes generative model, in the topic distribution of latent Dirichlet allocation (LDA) as HDP. The graphical model and definition of variables in the multimodal HDP are shown in the Supplementary Material. Here, multimodal HDP was trained using vision, position, and word information. If vision information w v t and position information w p t are observed at a time t, then the posterior probability of word information w w t can be calculated by the following formula: The location namen can be predicted by the maximum value of the calculated posterior probability. If word information w w t is obtained at a time t, then a category z w t can be predicted by Formula (22) and selecting position informationp randomly from dataset D z w t , which is a set of position data categorized into z w t .

Spatial Concept Formation
Spatial concept formation (SpCoFo) 5 is a model that integrates name modalities into the spatial region learning model (Ishibushi et al., 2015). The model forms concepts from multimodal information and predicts unobserved information. The graphical model and definition of variables in the spatial concept formation model are shown in the Supplementary Material. The posterior probability of word information w n t after obtaining vision information w υ t and position information p t was calculated by the following formula: The location namen can be predicted by the maximum value of the calculated posterior probability. 5 Spatial Concept Formation: https://github.com/EmergentSystemLabStudent/ Spatial_Concept_Formation The prediction of positionp t after obtaining word information w n t was calculated by estimating a category z t and sampling position informationp using the following formulas.
The sampling of position information for each location name was performed 100 times. In the spatial concept formation, the hyper-parameters π, η, µ 0 , κ 0 , ψ 0 , and ν 0 were set as π = 50, η υ = 5.0 × 10 −1 , η w = 1.0 × 10 −1 , µ 0 = (x center , y center ),  Categorized training data at each category are shown by positions, images, and the best three examples from the word probability. The category corresponds to the formed spatial concept. Each category was classified into an appropriate layer in the hierarchy of spatial concepts. One, four, and 28 categories were classified into the 1st, 2nd, and 3rd layers, respectively. The number of categories in each layer was determined by the nCRP based on the model parameter γ , which controls the probability that the data is allocated to a new category. The 1st layer included only category 1, into which 900 data were allocated. The high-probability word of category 1 was "laboratory, " which referred to the entire experimental environment. Since category 1 contains all the location names, the probabilities for location names becomes relatively low. Nonetheless, the proposed method was able to learn "laboratory, " which was given only about 10% to the training dataset, with high probability compared to the second candidate. In the 2nd layer, 343 data in the vicinity of the entrance in the experimental environment were allocated into category 4. The location name of category 4 with the greatest probability was "entrance." The 389 data in the region deeper than the entrance in the experimental environment were categorized into category 5, in which "meeting space" had the greatest probability. In the 3rd layer, the data categorized into categories 4 and 5 in the second layer were further, more finely categorized. In categories 26 and 16, which were formed under category 4, "front of the door" and "front of the chair storage" had the greatest probabilities, respectively. 53 and 81 data were allocated into categories 26 and 16, respectively. Position and image data corresponding to "front of the door" and "front of the chair storage" were finely allocated. These results demonstrated that the proposed method can form not only categories in a lower layer such as "front of the chair storage" and "front of the door" but also categories at higher layers such as "entrance" and "laboratory, " and can form its inclusion relations as a hierarchical structure.

Evaluation of Categorization
To evaluate the effectiveness of multimodal information on hierarchical space categorization, we compared the categorization results of using the proposed method and those obtained using hLDA, which is a hierarchical categorization method with single modality, i.e., based only on word information. Although the number of layers in ground truth in this experiment is 3, robots can not know the number of hierarchies of the spatial concepts in advance. Therefore, in the proposed method and hLDA, categorization was performed with the number of layers changed from 2 to 5. The accuracy of space categorization was evaluated by calculating mutual information between the ground truth, which consisted of a location name given by humans, and the estimated name, which was the best item in the word probability at a category allocated by the proposed method or by hLDA. Mutual information I(E; G) between estimated name E and ground truth G in each layer i and j was calculated by the following formula: When the mutual information become high, the dependency of e i and g j can be regarded as high. By using mutual information, accuracy of categorization can be evaluated when the number of layers on ground truth and estimation result is different. Table 4 shows the mutual information for categorization results between hLDA with word information and the proposed method with vision, position, and word information in the training data set. The effectiveness of multimodal information in space categorization was clarified, since the proposed method had a high level of mutual information in all layers. In addition, mutual information was maximized when using the same hierarchical number as in the ground truth. In the subsequent evaluations, the number of layers of the proposed method is set to 3.

Evaluation of Name Prediction and Position Category Prediction
We conducted experiments to verify whether or not the proposed method could form hierarchical spatial concepts, which enable a robot to predict location names and position categories similar to predictions made by humans. In the experiment, (1) the influence of multimodal information on the formation of a hierarchical spatial concept was evaluated by comparing the space-categorization results obtained using the proposed method and using hLDA, which is a hierarchical categorization method with single modality; (2) the similarity between the hierarchical spatial concepts formed by the proposed method and those of humans was evaluated in predicting location names and position categories. The evaluation experiments were performed by cross verification with three data sets that consist of 900 training data and 100 test data with ground truth. The experimental results are indicated by the mean and standard deviation in the three data sets.
To verify whether or not the proposed method can form hierarchical spatial concepts, accuracy evaluation of name prediction and position category prediction through spatial concept use was performed. In the evaluation of name prediction, vision, position, and word information were given to the robot at the training data points. In the test data points, only vision and position information were given. Therefore, the robot has to predict the unobserved word information from the observed vision and position information. Table 5 shows the accuracy of name prediction using the baseline methods, the proposed method, and those made by humans. The most frequent class, nearest neighbor (position and word), nearest neighbor (vision, position, and word), multimodal HDP, and spatial concept formation model were used as the baseline methods. The accuracy of name prediction was calculated by Formula (13) at global, intermediate, and local layers in ground truth. The proposed method and humans predicted location names in three layers. The results of humans consisted of the average accuracy of three subjects familiar with the experimental environment.
Compared with the accuracy obtained using the baseline methods, higher accuracies were obtained by the proposed method in the 1st, 2nd, and 3rd layers. It was assumed that weak features buried in the lower layer in the baseline methods were categorized as features of the higher layer in the proposed method. The proposed method enabled a robot to predict location names close to predictions made by humans by selecting the appropriate layer depending on the situation. Table 6 shows the evaluation results of position category prediction using the baseline methods, the proposed method, and those made by humans. In the evaluation, the most frequent class, nearest neighbor (position and word), multimodal HDP, and spatial concept formation model were used as the baseline methods. The position category prediction was evaluated in terms of precision, recall, and F-measure, which were calculated by Formula (14).
Compared with results obtained by the baseline methods, higher values of precision and recall were obtained by the proposed method in the global and intermediate layers. In the local layer, higher values of precision and recall were obtained by Nearest neighbor and Spatial Concept Formation (SpCoFo), respectively. However, in the F-measure, which is a harmonic mean between precision and recall, the proposed method has the largest values in the global, intermediate, and local layers. The reason why the recall and F-measure values were lower than the precision is that only 100 data points were predicted and plotted for regions with 100 grids or more, as shown in Figure 7. In the result of F-measure, independent ttests were performed in nine samples consisting of three data sets with three types of ground truth: global, intermediate, and local. In the proposed method, the p-values of the Most frequent class, Nearest neighbor, multimodal HDP, and SpCoFo were 0.00012, 0.00004, 0.00003, and 0.00051, respectively, and significant differences were observed with (p < 0.05). As the reason why the result of humans were not perfect, some errors were found in the boundary of the place. For example, the boundary between "Book shelf zone" and "front of the table, " and the edge of the region called "front of the door" were different depending on the human. The centricity of the place is consistent, but the region includes ambiguity even among humans. The experimental results show that the proposed method enabled a robot to predict position categories closer to predictions made by humans than possible using the baseline methods.
In the experiments for location name and position category prediction, the proposed method showed higher performance than the baseline methods. In the baseline methods, i.e., multimodal HDP and SpCoFo, since the feature space is classified uniformly, the location concepts are formed non-hierarchically. For example, an upper concept, e.g., meeting space, is embedded in the lower concepts, e.g., front of the table and front of the display. Therefore, the place called "Meeting space" is learned as a place different from the places called "front of the table" and "front of the display." Since the proposed method forms concepts by extracting the similarity of knowledge in the upper concept, it is possible to form an upper concept without interfering with the formation of the lower concept. For this reason, the proposed method was able to show high performance in the experiments of name and position category prediction with global, intermediate, and local.
In human-robot interactions in home environments, location names as word information are given to only a part of the training data from a user. We evaluated the robustness of the proposed method in terms of the naming rate in order to verify how name and position category prediction performance changes with decreasing naming rate. In this experiment, the formation of spatial concepts using the proposed method was performed using the training data with the naming rate changed to 1, 2, 5, 10, and 20% successively. The naming rates of 1 or 20% mean that 9 or 180 of the 900 training data contained location names, while the remaining data did not contain any location name. Table 7 shows the accuracy of name prediction and the F-measure of position category prediction for each naming rate. In the results of name prediction and position category prediction, it was confirmed that learning progresses in the global layer earlier than in the intermediate and local layers. It was clarified that overall prediction ability did not decrease greatly owing to the decreased naming rate, but gradually decreased from the lower layer. In this experiment, we performed spatial concept formation without prior knowledge in only one environment, but it is possible to increase learning efficiency by giving parameters of models estimated in other environments as prior probabilities. The transfer learning of spatial concepts will be performed in the future.

APPLICATION EXAMPLES FOR HUMAN SUPPORT ROBOTS
Application examples of the hierarchical spatial concept using the proposed method are demonstrated in this section. We implemented the proposed method for the Toyota human support robot (HSR) 6 and created application examples in which the robot moves based on human linguistic instructions and describes its self-position linguistically in an experimental field assuming a home environment.
The home environment and the robot used are shown in Figure 9. There were two tables as shown in Figure 9A, A and B. In the environment, whether the robot could move based on linguistic instructions including the hierarchical structure of spaces such as "front of the table in the living room" and "front of the table in the dining room" was verified. In Figure 9B, an   RGB-D sensor and a laser range sensor were used to capture images and to estimate self-position, respectively. The packages 7 : hector_slam and omni_base for mapping, localization, and moving were used with ROS Indigo 8 to navigate the robot to the predicated position. The robot collected 715 training data consisting of images, positions, and word information and formed a hierarchical spatial concept using the proposed method. Location names were given to 20% of total training data. Rospeex (Sugiura and 7 hector_slam: http://wiki.ros.org/hector_slam 8 ROS Indigo: http://wiki.ros.org/indigo Zettsu, 2015) was used to recognize human speech instructions and convert them into text information. In the experiment, the dimensions of the information vectors w υ , w p , and w w were 1,000, 64, and 16, respectively.
The two places predicted by Formula (12) based on the speech instructions, i.e., "go to the front of the table in the living room" and "go to the front of the table in the dining room" are shown in Figures 10A,B, respectively. Predicted position categories indicated by red dots show that the "front of the table in the living room" and the "front of the table in the dining room" were recognized as different places using the space concept in the higher layer.

Figure 11
shows how the robot moved based on human speech instructions in the experiment. The robot recognized human speech instructions using rospeex and predicted position categories with the Formula (12) using a hierarchical spatial concept. It moved to the instructed place by sampling randomly from the predicted positions. Figure 12 shows an application example in which the robot described its selfposition linguistically. The robot observed its self-position and image and predicted the name of its self-position by calculating Formula 11 using the hierarchical spatial concept. As shown in the left side of Figure 12, the proposed method enabled the robot to describe its self-position linguistically with different layers. We demonstrated application examples using the formed hierarchical spatial concept in the service scene in a home environment. The movie of the demonstration and training dataset can be found at the URL 9 . 9 Multimedia -emlab page: https://emlab.jimdo.com/multimedia/

CONCLUSIONS
We assumed that a computational model that considers the hierarchical structure of space enables robots to predict the name and position of a space close to the corresponding prediction by humans. In our assumptions, we proposed a hierarchical spatial concept formation method based on a Bayesian generative model with multimodal information, i.e., vision, position, and word information, and developed a robot that can predict unobserved location names and position categories based on observed information using the formed hierarchical spatial concept. We conducted experiments to form a hierarchical spatial concept using a robot and evaluated its ability in name prediction and position category prediction.
The experimental results for name and position category prediction demonstrated that, relative to baseline methods, the proposed method enabled the robot to predict location names and position categories closer to predictions made by humans. Application examples using the hierarchical spatial concept in a home environment demonstrated that a robot could move to an instructed place based on human speech instructions and describe its self-position linguistically through the formed hierarchical spatial concept. The experimental results and application example demonstrated that the proposed method enabled the robot to form spatial concepts in abstract layers and use the concepts for human-robot communications in a home environment. This study showed that it the name and position of a location could be predicted, even in a home, using generalized spatial concepts. Furthermore, by conducting additional learning in each house, a spatial concept adapted to the environment can be formed.
The limitation of this study is as follows. In the feature extraction of the position information, hierarchical k-means method was utilized to convert the position information (x, y) into the position histogram. In the experiment, 389 and 511 data were allocated to two clusters at the top layer c 1 . In the bottom layer c 6 , the number and standard deviation of the data allocated to each of the 64 clusters were 14.1 and 12.2, respectively. There is some bias between the clusters. The hierarchical k-means makes it possible to convert the position information into the position histogram including hierarchical spatial features. However, nearby data points at a classification boundary, which are classified into different clusters on a high level, are regarded as very different. We are considering a method to reduce bias in space while maintaining hierarchical features of space. As for the number of location names, at section 4 and 5 in the experiments, the numbers of location names were 15 and 16, respectively. The number of location names increases with increase in the numbers of teachings and users. If the robot learns the location names from several users over a long term, an algorithm to remove location names with low probability of observation is needed in order to improve the learning efficiency.
As future work, we will generalize the spatial concepts for various environments and perform transition learning of spatial concepts with the generalized spatial concepts as prior knowledge.

AUTHOR CONTRIBUTIONS
YH designed the study, and wrote the initial draft of the manuscript. HK and MI contributed to analysis and interpretation of data, and assisted in the preparation of the manuscript. TT has contributed to data collection and interpretation, and critically reviewed the manuscript. All authors approved the final version of the manuscript, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.