Semantic Mapping Based on Spatial Concepts for Grounding Words Related to Places in Daily Environments

An autonomous robot performing tasks in a human environment needs to recognize semantic information about places. Semantic mapping is a task in which suitable semantic information is assigned to an environmental map so that a robot can communicate with people and appropriately perform tasks requested by its users. We propose a novel statistical semantic mapping method called SpCoMapping, which integrates probabilistic spatial concept acquisition based on multimodal sensor information and a Markov random field applied for learning the arbitrary shape of a place on a map.SpCoMapping can connect multiple words to a place in a semantic mapping process using user utterances without pre-setting the list of place names. We also develop a nonparametric Bayesian extension of SpCoMapping that can automatically estimate an adequate number of categories. In the experiment in the simulation environments, we showed that the proposed method generated better semantic maps than previous semantic mapping methods; our semantic maps have categories and shapes similar to the ground truth provided by the user. In addition, we showed that SpCoMapping could generate appropriate semantic maps in a real-world environment.


INTRODUCTION
An autonomous robot performing tasks in our daily environment needs to recognize semantic information regarding the place. For example, when an autonomous vacuum cleaner robot tries to understand a command given by its user, e.g., "clean Joseph's room, " the robot needs to be able to locate "Joseph's room" on its map of the environment in order to clean that place. In addition, the places estimated by the robot need to have a region dealing with the shape of the environment. Semantic mapping is the task through which suitable semantic information is assigned to a robot's map so that it can communicate with people and appropriately perform tasks requested by its users (Kostavelis and Gasteratos, 2015).
Vocabulary used in daily human life depends on the environment a person is in, such as their home or office; a robot is unable to completely understand this, including words used to describe it, because the symbol system itself is a dynamic one (Taniguchi et al., 2016c). Many previous studies on semantic mapping (Kostavelis and Gasteratos, 2015;Goeddel and Olson, 2016;Sünderhauf et al., 2016;Himstedt and Maehle, 2017;Brucker et al., 2018;Posada et al., 2018;Rangel et al., 2019) have been conducted based on the assumption that a list of labels such as place names can be used as pre-existing knowledge; thus, they have been unable to estimate the meaning of place understood by a robot when is given a command including an unknown place name like "Joseph's room." To deal with various environments by adapting semantically to them, and to collaborate with people, a semantic mapping method that can deal with unknown words uttered by users is crucial for service robots used in daily life. Therefore, this study proposes a novel statistical semantic mapping method called spatial concept formation-based semantic mapping (SpCoMapping) to address these issues. An overview of the SpCoMapping is shown in Figure 1.
Semantic mapping has been studied as a method to expand maps obtained using simultaneous localization and mapping (SLAM) into those including words. However, previous semantic mapping methods have three disadvantages.
FIGURE 1 | Overview of SpCoMapping. The robot moves around in the environment to obtain RGB data, words, and self-position data. It then learns spatial concepts by integrating multimodal information with a Markov random field and generates a semantic map.
1. The first is the overwrite problem, which is caused when the method overwrites the labels painted in previous cycles. For example, the image recognition results obtained when entering a room from the corridor and when leaving the room are different, even though the position is the same, because the visual images obtained by the robot are different in the two scenarios. Therefore, some methods overwrite the labels of the cells on the map that were generated on entering the room with new ones generated when the room is exited. However, this information should not merely be overwritten but should be stored statistically. Our proposed method solves this problem by modeling the room using semantic information from each cell on the map as a probabilistic variable. 2. Second, semantic maps generated by many previous methods are based solely on a single source of information, for example, depth or visuals. However, it can hardly be believed that people distinguish regions of a house semantically based on single sources of information. The regions and types of semantic categories on a semantic map formed in an environment should not only be influenced by one type information; it should also respond to multimodal information such as visual and location data, user utterances, and even other modalities such as sounds and smells. Our proposed method solves this problem by using a multimodal categorization method as part of the probabilistic generative model. 3. Third, many previous methods needed a list of place names to be set. However, we cannot expect all place names and features in our daily environment to be stored in the training dataset. For example, we cannot expect a training dataset generated for a typical house to include information on a particular person's room, e.g., "Joseph's room." In contrast, our unsupervised learning method is based on a hierarchical Bayesian model that can acquire words related to places from sources such as user utterances. Therefore, it can obtain a vocabulary of words corresponding to a place along with their probability distributions.
A typical previous method is semantic mapping based on convolutional neural network (CNN) (Sünderhauf et al., 2016). Sünderhauf et al. (2016) proposed a method of semantic mapping with CNN that could convert RGB visual data into semantic labels. Image recognition results were used as semantic labels for mapping, and a robot painted the map generated by SLAM with the labels obtained by the CNN. This simple visual recognitionbased approach also has the same problems. Spatial concept formation methods have been developed to enable robots to acquire place-related words as well as estimate categories and regions (Ishibushi et al., 2015;Taniguchi et al., 2016bTaniguchi et al., , 2017Taniguchi et al., , 2018. These methods can estimate the number of categories using the Dirichlet process (Teh et al., 2005). Taniguchi et al. (2016b) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA). However, although spatial concept formation methods can acquire unknown words and deal with multimodal information, including the image features typically extracted using CNNs, they cannot perform semantically segment a map appropriately because the position distributions corresponding to semantic categories are modeled by Gaussian distributions. These methods cannot model the various shapes of regions on semantic maps. In our method, we adopt a Markov random field (MRF) to deal with various shapes on the semantic maps.
SpCoMapping integrates probabilistic spatial concept acquisition (Ishibushi et al., 2015;Taniguchi et al., 2016b) and SLAM via an MRF to generate a map of semantic information. It solves the overwrite problem by assigning each cell of the semantic map a probabilistic variable. It also deals with multimodal information using a multimodal categorization method as part of the probabilistic generative model. In addition, it does not need to set place names because we employ an unsupervised learning method that can acquire words related to places. Here, an unknown word implies that the word is not yet grounded in the map. In other words, the robot does not know the words related to a specific place on the map beforehand. Unknown word discovery from spoken sentences was performed in SpCoA (Taniguchi et al., 2016b); therefore, we obtained word information from sentences or words given in the experiment of this paper.
SpCoMapping has the following characteristics to solve the problems discussed above.
1. SpCoMapping can solve the overwrite problem. 2. Each region on the semantic map can have an arbitrary shape. 3. The semantic map is generated on the basis of also word information obtained through human-robot interactions as well as visual information, i.e., multimodal information. 4. SpCoMapping can relate multiple words to one place, without pre-setting the list of place names, using the semantic mapping process. 5. SpCoMapping can estimate the number of semantic categories using the Dirichlet process (Teh et al., 2005) as the prior distribution for semantic categories.
SpCoMapping was tested in two experiments, in simulation and in the real-world environment. The remainder of this paper is organized as follows. Section 2 introduces existing semantic mapping and spatial concept acquisition methods. Section 3 describes our proposed method. Section 4 shows the results of the experiment conducted in a simulation environment. Section 5 shows the results of the experiment conducted by placing a robot in a daily human environment. Finally, section 6 concludes this paper.

Semantic Mapping
The task of semantic mapping includes map segmentation and place recognition. Map segmentation is a task that categorizes places by hypothesizing that regions can be found by looking at the layout of free space (Fermin-Leon et al., 2017;Mielle et al., 2018;Tian et al., 2018). Mielle et al. (2018) proposed a method for segmenting maps from different modalities, which are able to use for robot-built maps and hand-drawn sketch maps. Place recognition is a challenging task for a robot; however, with CNNs and a large scene dataset, robots can understand places using image information (Guo et al., 2017;Xie et al., 2017;Xinhang et al., 2017;Wang et al., 2018b). Wang et al. (2018b) applied CNNs for omni-directional images for place recognition, and that result was used to allow robots to navigate.
Some studies on semantic mapping using two-dimensional (2D) maps such as topological maps (Garg et al., 2017;Liao et al., 2017;Pronobis and Rao, 2017;Luperto and Amigoni, 2018;Wang et al., 2018a) and occupancy grid maps (Goeddel and Olson, 2016;Sünderhauf et al., 2016;Himstedt and Maehle, 2017;Brucker et al., 2018;Posada et al., 2018;Rangel et al., 2019), were also conducted. Wang et al. (2018a) proposed a method that constructed a topological semantic map to guide object search. In addition, some studies attempted to provide methods that could correct topological semantic maps by mitigating the effects of noise or incorrect place recognition. Zheng et al. (2017Zheng et al. ( , 2018 proposed a method that used graph-structured sum-product networks. They showed that this technique generates a semantic map from the results including incorrect nodes in place recognition. However, managing tasks like cleaning a room, which needs a place region, is difficult when a robot uses topological maps. Sünderhauf et al. (2016) employed a CNN to recognize place categories using visual information (i.e., RGB data) and laser-range data to build maps on which place categorization results are shown. They used the Places205 dataset (Zhou et al., 2014) to train the CNN, so that did not require environmentspecific training. Unfortunately, this method could not deal with semantic information that was not included in the pre-existing training dataset.
In addition, three-dimensional (3D) semantic mapping in an indoor environment was studied in terms of performing tasks such as grasping or detecting an object simultaneously (Antonello et al., 2018;Li et al., 2018;Sun et al., 2018). Antonello et al. (2018) proposed a method that constructed a 3D semantic map online using the result of semantic segmentation. In addition, some studies assume that place categories are generated by objects in that environment and build semantic maps with object features (Stückler et al., 2015;Sünderhauf et al., 2017). Sünderhauf et al. (2017) proposed a method that built environment maps that included object-level entities and geometrical representations. They employed a single-shot multi-box detector (Liu et al., 2016) to detect objects and 3D SLAM to generate environment maps.
In this paper, we propose a method that generates 2D semantic maps. This is because 2D semantic mapping is challenging and a 2D semantic map can be applied to the autonomous vacuum cleaner robot. Taniguchi et al. (2016a) proposed a method that estimated words related to places and performed self-localization by Monte Carlo localization (MCL) simultaneously. Ishibushi et al. (2015) proposed a self-localization method that integrated semantic information obtained from image recognition performed by a CNN, following an idea proposed by Taniguchi et al. (2016a). Taniguchi et al. proposed SpCoA and an extension (Taniguchi et al., 2016b that integrated a generative model for self-localization and unsupervised word segmentation in uttered sentences via the latent variables related to the spatial concept. However, all spatial concept acquisition methods assumed that the position information of each spatial region expressed a Gaussian distribution, i.e., that each semantic region had an ellipse-like shape. Therefore, these methods sometimes showed that the regions estimated FIGURE 2 | Flow diagram of SpCoMapping. The robot gets histograms of word features from bag-of-words information, histograms of image feature from CNN trained using the Places205 dataset (Zhou et al., 2014) and its position from the result of self-localization. We adopt word and image features and robot position for a multimodal categorization method and visualize indices of spatial concepts as a semantic map.

Spatial Concept Formation
Frontiers in Robotics and AI | www.frontiersin.org based on the Gaussian distribution exceeded the area of a room. In contrast, SpCoMapping allows for arbitrarily shaped regions by adopting an MRF that takes the shape of the environment into account.
In addition, Taniguchi et al. (2017) proposed an online spatial concept acquisition and simultaneous localization and mapping (SpCoSLAM) method that integrates visual, position, and speech information and performs SLAM and lexical acquisition simultaneously. The complete learning process was performed online. Hagiwara et al. (2018) extended the spatial concepts as hierarchical categorizations. They showed that this method could acquire spatial concept hierarchically using vision, position, and word information. Our proposed method can also be appropriately extended online and hierarchically such as these methods.

Overview
The flow diagram of the process for the learning and semantic mapping is shown in Figure 2. The robot can create a map of the environment in advance using SLAM. The robot first moves around in an environment by self-localization using MCL and obtains RGB data. As it moves around, the user can talk to it by uttering the names of each place. In addition, SpCoMapping employs a pre-trained CNN, similar to that in Sünderhauf et al. (2016), to obtain a probability distribution of the place labels to use as a feature vector of the proposed probabilistic generative model. The speech signals uttered by the user are recognized by a speech recognition system, and the results are provided to SpCoMapping as word information. We adopted bag-of-words (BoW) as word information because the count of the words uttered in each place represents a word feature. The robot next learns the spatial concepts by integrating multimodal data and generates a semantic map using the probabilistic generative model shown in Figure 3.
The pseudo-code of SpCoMapping is shown in Algorithm 1. The procedure of the method is described as follows: 1. The robot initializes C i,j described by MRF from the occupancy grid map. I and J represent the width and height of the map, respectively (line 2 in Algorithm 1). 2. If a pixel on the occupancy grid map does not correspond to free space, then there are no spatial concepts in this model, and the robot retains the C i,j value (line 6-7). 3. The robot converts x t , which denotes the coordinates obtained by MCL, to (i, j), which denotes the pixel on the occupancy grid map. The equation is shown in (11) (line 9). 4. For every free space on the occupancy grid map, the robot obtains an index of the spatial concepts by sampling (line 10-12). 5. The robot uses sampling to obtain the multinomial distribution of the index of spatial concepts (line 16). 6. For each spatial concept category, the robot uses sampling to obtain the multinomial distribution of image features (line 18). Algorithm 1 Semantic mapping based on spatial concepts 1: initialization π, θ , w 2: get C (1 : I,1 : J) from map 3: for h = 1 to iteration do 4: (11) 10: end for 21: end for 22: save C, π, θ l , w l 7. For each spatial concept category, the robot uses sampling to obtain the multinomial distribution of vocabulary features (line 19).
The details of the sampling process are described in section 3.3. Figure 3 shows the graphical model of SpCoMapping, and Table 1 shows the list of variables in the graphical model. We describe the probabilistic generative process represented by the

Definition of Generative Model and Graphical Model
where ∂(i, j) represents the neighborhood pixels of the (i, j) pixel, C ∂(i,j) represents the neighborhood node of C i,j , Dir represents the Dirichlet distribution, and Mult represents the multinomial distribution.
In (1), DP represents the Dirichlet process (DP). DP is a probabilistic process that can generate the parameters of an infinite-dimensional multinomial distribution. A nonparametric Bayesian clustering method that uses DP can automatically estimate the number of clusters. We adopted weak-limit approximation for calculating DP (Fox et al., 2011), described as: where L is the upper limit of the spatial concepts.
In (2), MRF represents the MRF described in the same way as in Chatzis and Tsechpenakis (2010). The equation is as follows: where γ represents the temperature parameter and C ∂(i,j)|m means that C ∂(i,j) is generated from the occupancy grid map m.
In (5) and (6), x t is the self-position of a robot and (i, j) represents the 2D index of pixels on the occupancy grid map. f t and s t are sampled only if self-position x t corresponds to the (i, j) pixel. The equation to convert x t to (i, j) is where ⌊(p x , p y )⌋ represents the floor function. This equation means that the maximum integer coordinates of the x-axis are not greater than real number p x and those of the y-axis are not greater than real number p y . X represents the original pose of a robot, and k represents the size of one pixel on the map in a real environment. The origin pose refers to the coordinates of the (1, 1) pixel. We obtained both using the occupancy grid message of the robot operating system (ROS) (Quigley et al., 2009).

Details of the Sampling Procedure
SpCoMapping estimates C i,j , π, θ l , and w l using Gibbs sampling. The procedure for each sampling is shown below.
• Sampling C i,j : If m x t is not free space, then C i,j have no spatial concept.
If m x t is free space, then the sampling equation is where t ′ represents an element of the set of times when C i,j = l in the converted self-positions (i, j) = convert(x t ) (t ∈ (1 : T)). T is the number of training data. • Sampling π: When the quantities of spatial concepts are unknown, we adopt DP. The sampling equation is shown in (14).
When the quantities of spatial concepts are known, we adopt Dirichlet distribution. The sampling equation is • Sampling θ l and w l : The sampling equations are  Finally, SpCoMapping can infer the semantic category of each pixel on the occupancy grid map using Gibbs sampling. The semantic mapping is achieved by this inference.

EXPERIMENT 1: SIMULATION ENVIRONMENT
We experimented to evaluate the semantic mapping ability of SpCoMapping and compare it with that of existing methods. For the quantitative evaluation, we performed experiments in the simulation environment SIGVerse 1 that emulated the daily living environment. We have provided the source code 2 for SpCoMapping and the test dataset 3 used in this experiment for public access on Github. Figure 4 shows the environment used in our experiment in SIGVerse. Table 2 presents information on the rooms in the simulation environment.

Conditions
We employed Caffe (Jia et al., 2014)-a deep learning framework-to implement the CNN. To train the CNN, we used the Places205 dataset (Zhou et al., 2014) and used AlexNet as the particular network architecture for the CNN (Krizhevsky et al., 2012). To give word information to the robot, we provided it with textual place name data directly, without using speech recognition, to keep the focus on evaluating the semantic mapping. We compared the following methods: For method (A), we set the upper limit for spatial concepts to 120. For method (B), we set the number of spatial concepts to be the same as the number of ground truths. Ishibushi et al. (2015)'s method only employed image features and selflocalization. However, in method (C) in our experiment, we compared it to a model designed to handle word information. Method (D) does not categorize all pixels on the occupancy grid maps; therefore, we adopted the nearest neighbor for this method to fill up all the pixels for the comparisons. Method (E), the nearest-neighbor method, is one of the easiest: it retrieves a word label from the sample nearest to its position.
We prepared the ground truth labels by asking a participant to draw a semantic map for each map by referring to the information from the 3D simulator and the 2D maps. In the simulation, we gave the robot words that were assumed to be used in each environment. The vocabulary list is summarized in Table 3 and includes underlined words that are not labels of the Places205 dataset. Bold number is the best in that environment and underlined number is the two best.
FIGURE 5 | Example of the changes in the ARI by iteration (Room2ldk4).

Clustering Accuracy
We calculated the adjusted Rand index (ARI) (Hubert and Arabie, 1985), which is a measure of similarity between two clusters, for each method. If two clusters are the same, the ARI is one; if each cluster are allocated randomly, the ARI is zero. Semantic mapping can be regarded as a task in which pixels are clustered on a map. We compared the performance of the methods from this viewpoint.
The results are shown in Table 4a. The column titled "Average" denotes the average ARI of the five rooms. SpCoMapping has a higher average, showing a higher performance on each map, compared to the other methods. This result suggests that SpCoMapping can solve the problems introduced in section 1, including the overwrite and shape problems; in other words, the categories of semantic maps it generates are closer than the other semantic mapping methods to the categories of semantic maps generated by a person. In this result, SpCoMapping with DP prior is better than SpCoMapping with Dirichlet prior which the number of categories is given. As same as (Nakamura et al., 2015) when Gibbs sampling algorithm samples using fixed quantities of categories, it is sometimes harder than using changing quantities of categories. Figure 5 shows an example of the change in the ARI by iteration for Room2ldk4. The increasing iterations also increased the ARI. Figure 6 shows an example of the change in the categories of Room1ldk4 caused by iterations of SpCoMapping (DP). This result shows that SpCoMapping (DP) gradually estimated  the number of semantic categories. However, the relationship between the iteration and the number of categories depends on the size of the map and the complexity of the environment. Therefore, in future work, we need to improve the ability to automatically determine the number of iterations. Figure 7 shows the semantic maps generated by each method. The regions on the maps generated by SpCoMapping are separated on the wall and do not put the wall between the regions itself. These maps show that the places estimated by SpCoMapping have regions dealing with the shape of the environment.

Estimation of Place by a Word Input
When a robot is required to perform a task that requires communication with the user, e.g., navigation, cleaning the room, or searching for an object, the robot needs to estimate the place indicated by the user from a word input. Therefore, we compared the matching rate of the places estimated by each method using the following calculations: where V represents vocabulary, i.e., the set of words, in the ground truth; M condition represents the number of spaces which meet the condition. m C is the pixels on the semantic map or ground truth given category C. L s represents the category of ground truth given word s. The equation for estimating an index of spatial concepts from the word s inputs is as follows: The results are shown in Table 4b. Method (B) performs better, with a score of "average, " compared to the other methods. This result shows that SpCoMapping can estimate place regions better than previous methods when the robot is given a command by its user that includes the place name. Method (A) performed the best in Room1dk5 and Room1dk4. However, it did not have good results for a large environment (Room1dk6 and Room2ldk4). The reason for the poor performance of SpCoMapping in the two large environments is attributed to the creation of many clusters for a large region and a wrong estimate of the number of categories in these environments. The nonparametric Bayesian estimation of semantic categories is unstable, as shown in the result. SpCoMapping (Dir) is more stable than SpCoMapping (DP).

Conditions
We experimented to generate a semantic map in a real-world environment. The robot and environment we employed are shown in Figure 8, respectively. The laboratory room serves as the living environment for experiments and as a study space for the researcher. In this experiment, we obtained word information from given sentences to demonstrate that SpCoMapping can acquire vocabularies as place names without setting them. We used sentences as word features to show how multiple words could be connected to a place without pre-setting the place names. The sentence list is shown in Table 5. We provided these 20 sentences, which include 50 vocabularies, five times for each sentence. We provided the RGB data 407 times.
For this experiment, we adopted SpCoMapping (DP) when the quantities of spatial concepts were unknown. We set the upper limit number of spatial concepts to 120. We set the hyperparameters as follows: α = 1.0 * 10 6 , β = 0.6, χ = 100.0, and γ = 4.0.
We set the weight for vocabulary feature using the tf-idf scheme (Salton and Mcgill, 1986) as mutual information between words and sentences. The equation used to calculate the weight for word i in sentence j is as follows: where n i,j is the number of words i in sentence j, D is the number of sentences, and D i represents the number of sentences including word i. By setting the weight for words using tfidf, the importance of words included in many sentences, for example, "is, " "here, " and "you, " are lower. This process helps in the acquisition of place-related words.
In order to ensure the stability of the proposed method, we sample w l for 100 times and use the average as w l .
In addition, we employed pre-learning by spatial concept formation using word information (Ishibushi et al., 2015). We set hyperparameters as the same parameters and calculated 1000 iterations. In Algorithm 1 line 1, we initialized π, θ , w using the result of pre-learning. In Algorithm 1 line 2, we initialized C i,j as follows: If m x t is not a free space, then C i,j have no spatial concept.
If m x t is a free space, then the sampling equation is as follows: where the multivariate Gaussian (normal) distribution is N , µ pre l is the mean vector of position distribution on the l-th pre-learning category, and pre l is the covariance matrix of the position distribution on the l-th pre-learning category. When pre-learning is employed vocabulary acquisition features are more stable and fewer iterations are required for learning. The occupancy grid map of the personal living environment and the result of pre-learning are shown in Figure 9. This occupancy grid map has 19,255 pixels as free space.

Results
The results of the generated semantic maps of each iteration are shown in Figures 10A-E. The semantic map of the pre-learning by spatial concept formation, shown in Figure 9B, does not deal with environment shape but can categorize a map for some categories. It helps to calculate the parameter of the multinomial distribution π for MRF in an early iteration. There are many categories in the center of the environment in the semantic maps of iterations 1 and 100, and some categories do not deal with the shape of the environment. However, in the semantic map of iteration 10000, some categories are combined, and each region has an arbitrary shape related to the shape of the environment. Therefore, it is shown, qualitatively, that SpCoMapping can gradually estimate semantic maps even in a real-world environment. In addition, since SpCoMapping uses word information as multimodal data, it can obtain words as features of space without the place name being set by the user. The semantic map of the final iteration, with three best words obtained for the representative spatial concept, is shown in Figure 10F. The robot acquired some vocabulary for place names along with the probability of their occurrence. For example, this result shows that if the robot catches the words "meeting" and "start, " it moves to the front of the whiteboard. However, since the robot used weights by mutual information for vocabulary, meaningless words such as "on, " "in, " and "too" have a high probability in the categories of each result. This problem can be resolved by assigning more vocabulary features for the robot to learn.

CONCLUSIONS
This paper proposed a novel semantic mapping method called SpCoMapping extended a spatial concept acquisition method using MRF. Experiments showed that SpCoMapping could deal with the problems faced by existing semantic mapping methods. The semantic maps that SpCoMapping generated in a simulation environment matched those generated by a participant more accurately compared to the existing methods. Furthermore, the semantic maps generated by SpCoMapping are better than those generated by the existing methods from the viewpoint of estimating place from word input, i.e., from the viewpoint of human-robot communication. Finally, an experiment in a real-world daily environment showed that SpCoMapping could generate a semantic map in a real environment as well as a simulated one. SpCoMapping can generate semantic maps dealing with the shape of the environment, and the robot can perform the task including place names. SpCoMapping with Dirichlet prior is more stable than SpCoMapping with DP prior. Therefore, we will use SpCoMapping with Dirichlet prior when we can use the number of semantic categories. However, it is rare when we can use the number of categories in a complex real-world daily environment, so we will use SpCoMapping with DP prior because SpCoMapping with DP prior estimates the number of semantic categories simultaneously.
In future work, we will apply the proposed method to tasks such as those executed by autonomous vacuum cleaner robots, e.g., "please clean my room, " that require communication with humans. Improving the stability of SpCoMapping (DP), as mentioned in the simulation experiment, is also a future challenge. In addition, SpCoMapping employs batch learning; therefore, we will also investigate the development of an online learning algorithm for SpCoMapping and integrating it with SLAM to work in new environments. In this paper, we proposed a 2D semantic mapping method. However, when a robot grasps or detects an object, it will need 3D semantic maps. Therefore, we will also extend this method to 3D. SpCoMapping can deal with the shape of an environment using MRF, however, the environment shape must have features, for example, the corridor is narrow or the entrance connects two spaces. Therefore, we will use a generative adversarial network or a variational autoencoder in order to generate map features.
Although several challenges remain, our proposed method significantly improved the performance of unsupervised learning-based semantic mapping, enabling a robot to make use of users' utterances in a daily environment for semantic mapping. We believe this method will contribute to learning-based humanrobot semantic communication in daily environments in the near future.

AUTHOR CONTRIBUTIONS
YK designed the study, and wrote the initial draft of the manuscript. All other authors contributed to analysis and interpretation of data, and assisted in the preparation of the manuscript. All authors approved the final version of the manuscript, and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.