Symbol Emergence as an Interpersonal Multimodal Categorization

This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. In this study, the semiotic communication refers to exchanging signs composed of the signifier (i.e., words) and the signified (i.e., categories). We define the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchanging signs with other agents as basic functions of the semiotic communication. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system (i.e., agent society) is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, i.e., the complete system can be regarded as a single agent performing multimodal categorization using the sensors of all agents, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence, i.e., forming and sharing signs and categories, is also verified in an experiment with two agents observing daily objects in the real-world environment. In the experiment, we compared three communication algorithms: no communication, no rejection, and the proposed algorithm. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.


INTRODUCTION
Language plays a crucial role in sharing information between people by using semiotic communication (i.e., exchanging signs).However, it is still a mysterious problem in language evolution: how the symbol system of the language has emerged through semiotic communication.The semiotic communication in this study is defined as the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchanging signs with other agents.Although sharing and forming the symbols representing the sensory experience is only a part of the function of language, obtaining a computational explanation of emergence of a symbol system in the real-world environment (i.e., only from individual sensory-motor experience and exchanging signs) is important and challenging.
A fundamental study on language evolution Chomsky (1975) focused on thoughts and concept formation within individuals, and advocated the concept of generative grammar that explains the syntactic ability required for languages.Tomasello (1999) focused on human cooperative communication and proposed a model to determine the communication ability for sharing intentions with others.As a hypothesis that handles concept formation and intention sharing in an integrated manner, Okanoya and Merker (2007) advocated a mutual segmentation hypothesis of sound strings and situations based on co-creative communication.Our study was conducted by a constructive approach inspired by the Okanoya's hypothesis, and it deals with the concept formation in individuals and intention sharing in multi-agent systems simultaneously.Constructive approaches are highly advantageous to understanding complex systems and are also useful for studying evolutionary linguistics (Steels, 1997;Hashimoto, 1999).Nakamura et al. (2014); Taniguchi et al. (2018) developed integrative probabilistic generative models for multimodal categorization and word discovery by a robot (using multimodal sensorimotor information obtained by a robot, and speech signals).However, their model only explains individual learning, and they implicitly presume that a teacher has a fixed and static symbol system.The goal of this paper is to provide a computational model that not only categorizes sensory information but also shares the meaning of signs within the multi-agent system.This model provides a clear view of symbol emergence as an interpersonal multimodal categorization.
In studies on language emergence in multi-agent systems, Kirby (1999) showed that the language exchanged between agents involving repeated generation alternation is gradually structured in a simulation model.Morita et al. (2012) showed in simulation experiments that semiotic communication systems emerge from interactions that solve collaborative tasks.Lazaridou et al. (2016) proposed a framework for language learning that relies on multi-agent communication for developing interactive machines (e.g., conversational agents).Lee et al. (2017) proposed a communication game in which two agents, native speakers of their own respective languages, jointly learn to solve a visual referential task.Graesser et al. (2019) proposed a computational framework in which agents equipped with communication capabilities simultaneously play a series of referential games, where agents are trained by deep reinforcement learning.These studies achieved language emergence in multi-agent systems by using a computational model.However, interaction with the real-world environment through one's own sensory information without pre-existing categories (i.e., internal representations) was not discussed in these studies.
In studies on language evolution based on a constructive approach using robots, Steels performed experiments on a self-organizing spatial vocabulary (Steels, 1995) and perceptually grounded categories through language for color (Steels and Belpaeme, 2005).Their research considered how people talk about the location of objects and places (Steels and Loetzsch, 2008).The series of his studies using robots are summarized as the talking heads experiment (Steels, 2015).
In studies beyond the talking heads experiment, Steels and Kaplan (2000) performed experiments with robots in "AIBO's first words: the social learning of language and meaning".De Beule et al. (2006) proposed a cross-situational learning algorithm for damping homonymy in the guessing game.Spranger proposed a perceptual system for language game experiments (Spranger et al., 2012) and performed the evolution of grounded spatial language (Spranger, 2011(Spranger, , 2015)).Bleys (2015) proposed language strategies for the domain of color.Matuszek (2018) proposed grounded language learning, where robotics and natural language processing meet.These studies focused on the symbol grounding in the language game and built the foundation of constructive studies on language evolution.However, the modeling of bottom-up internal representation learning from sensory-motor information within individuals has not been discussed.Taniguchi et al. (2016b) introduced a concept of symbol emergence system, which is a multi-agent system that dynamically organizes a symbol system, for example , by physical interaction with the environment and semiotic communication with other agents.A symbol emergence system can be regarded as an emergent system: a complex system that has an emergent property.Internal representations (e.g.object categories) are formed in individual agents' cognitive systems under the influence of a symbol system shared among a multi-agent system.A symbol system shared in a society is organized under the influence of internal representations formed by individual agents.The interdependency is crucial for symbol emergence in a multi-agent system.However, so far, there is a lack of a computational model that would describe the mutual dependency and would be evaluated in experiments in the real-world environment.
In studies of bottom-up concept formation and word grounding based on the sensory-motor information of a robot, Nakamura et al. (2009) proposed a model for grounding word meanings in multimodal concepts, and Ando et al. (2013) proposed a model of hierarchical object concept formation based on multimodal information.Several methods that enable robots to acquire words and spatial concepts based on sensorymotor information in an unsupervised manner have been proposed (Isobe et al., 2017;Taniguchi et al., 2017).Hagiwara proposed a Bayesian generative model to acquire the hierarchical structure of spatial concepts based on the sensory-motor information of a robot (Hagiwara et al., 2016(Hagiwara et al., , 2018)).These studies focused on language acquisition by a robot from a person who gives speech signals to the robot, and enables robots to discover words and categories based on their embodied meanings from raw sensory-motor information (e.g., visual, haptic, auditory, and acoustic speech information).They presume that a person has knowledge about categories and signs representing the categories, i.e., a symbol system shared in the society.Therefore, these computational models cannot be considered as a constructive model of symbol emergence systems.These studies have not dealt with the dynamics of emerging symbols while agents form categories based on sensory-motor information .
A concept of symbol emergence in cognitive systems is surveyed in (Tadahiro Taniguchi, 2018).A symbol emergence system is socially self-organized through both semiotic communications with autonomous cognitive developmental agents and physical interactions with the environment, as shown in Figure 1.The figure represents a symbol emergence system.Note that a symbol system cannot be controlled by anyone, but all individuals are constrained by an emergent and shared symbol system.In addition, all of them contribute to creating the socially shared symbol system.To understand this phenomena, the coupled dynamics of both a symbol system shared between the agents and the internal representation systems of individuals has to be modeled with a constructive and computational approach.Computational models for category formation and lexical acquisition by a robot have been studied and proposed.Studies on language evolution and symbol emergence have different challenges, in addition to concept formation and lexical acquisition.They have to deal with organization of a symbol system in the society itself (i.e., the symbol system is unstable), in contrast with studies on category formation and lexical acquisition, where the system can be considered as static and stable.
This study focuses on the symbol emergence in a multi-agent system and the category formation in individual agents through semiotic communication, which is the generation and interpretation of symbols associated with the categories formed using the agent's sensory information.The main contributions of this paper are as follows.• We propose a constructive computational model that represents the dynamics of a symbol emergence system by using probabilistic models for multimodal categorization and message passing based on the Metropolis-Hastings (M-H) algorithm.The model represents mutual dependency of individual categorization and formation of a symbol system in a multi-agent system.
• We show that our model representing a multi-agent system and symbol emergence among agents can be regarded as a single agent and a single multimodal categorizer, i.e., an interpersonal categorizer, mathematically.We prove that the symbol emergence in the model is guaranteed to converge.
• We evaluate the proposed model of the symbol emergence and category formation from an experiment by using two agents that can obtain visual information and exchange signs in the real-world environment.The results show the validity of our proposed model.
The rest of this paper is structured as follows.Section 2 describes the proposed model and inference algorithm for representing the dynamics of symbol emergence and category formation in multi-agent systems.Section 3 presents experimental results, verifying the validity of the proposed model and inference algorithm on the object categorization and symbol emergence.Finally, Section 4 presents conclusions.

Expansion of a multimodal categorizer from personal to interpersonal
The computational model that we propose in this paper is based on a key finding that a probabilistic generative model of multimodal categorization can be divided into several sub-modules of probabilistic generative models for categorization and message passing between the sub-modules.This idea of dividing a large probabilistic generative model for developing cognitive agents and re-integrating them was firstly introduced as a SERKET framework (Nakamura et al., 2018).However, their idea was only applied to creating a single agent.We found that the idea can be used for modeling multi-agent systems and is very suitable for modeling dynamics of a symbol emergence system.We modeled the symbol emergence in a multi-agent system and the category formation in individual agents as a generative model by expanding a personal multimodal categorizer (see Figure 2 (a)) to an interpersonal multimodal categorizer (see Figure 2 (b)).First, (a) shows a personal multimodal categorizer, which is a generative model with an integrated category c as a latent variable and sensor information from haptics and vision as observations o h and o v .The model is a simple version of multimodal latent Dirichlet allocation used as an object categorizer in the previous studies (Nakamura et al., 2009;Ando et al., 2013).Next, (b) shows an interpersonal multimodal categorizer in which two agents are modeled as a collective intelligence, with word w as a latent variable, and sensor information from agent A and B as observations o A and o B .As shown in Figure 2, the model generating observations through categories on each sensor from an integrated concept in an agent can be extended as the model generating observations through categories on each agent from a word in a multi-agent system.
Figure 2 (a) represents a graphical model for probabilistic generative model multimodal categorization, e.g., Nakamura et al. (2014).It can integrate multimodal information, e.g., haptics and visual information, and form categories. Index of category is represented by c in this figure.Following the SERKET framework (Nakamura et al., 2018), we can divide the model into two modules and a communication protocol for a shared node.Here, c is shared by the two modules and the node is renamed by w.We regard an index w as an index of word.In this case, if we regard the two separated modules as two individual agents (i.e., agent A and agent B), the communication between the two nodes can be considered as exchange of signs (i.e., words).As we see later, we found that, if we employ the Metropolis-Hastings algorithm, which is one of the communication protocols that the original SERKET paper proposed, the communication protocol between the nodes can be considered as semiotic communication between two agents.Roughly speaking, the communication is described as follows.Agent A recognizes an object and generates words for Agent B. If the word is consistent to the belief of Agent B, then Agent B accepts the naming with a certain probability; otherwise, Agent B rejects the information, i.e., does not believe the meaning.If the rejection and acceptance probability of the communication is the same as the probability of the M-H algorithm, the posterior distribution over w, i.e., symbol emergence among the agents, is theoretically the same as the posterior distribution over c, i.e., interpersonal categorization.

Generative process on the interpersonal multimodal categorizer
This subsection describes the generative process of the interpersonal multimodal categorizer.Figure 3 shows the graphical model is a single graphical model.However, following the SERKET framework (see Figure 2), it can be owned by two different agents separately.The right and left parts indicated with a dashed line in Figure 3 show the parts owned by agents A and B, respectively.Figure 3 and Table 1 show the graphical model and the parameters of a proposed interpersonal multimodal categorizer, respectively.The generative process of the interpersonal multimodal categorizer is described as follows.
The parameters φ A l and φ B l of multinomial distributions on each category (l ∈ L) are shown as follows: The parameters θ A k and θ B k of multinomial distributions on each word (k ∈ K) are shown as follows: The following operations from ( 5) to ( 8 • Indices of categories c A d and c B d generated from word w d are shown as follows: Theoretically, the generative model is a type of pre-existing model for multimodal categorization (Nakamura et al., 2018(Nakamura et al., , 2014) ) for an individual agent.In this paper, we assume that the graphical model is representing the symbol emergence in a multi-agent system.

Communication protocol as an inference algorithm on the interpersonal multimodal categorizer
This subsection describes the protocol of semiotic computation between two agents and cognitive dynamics of categorization in individual agents.As a whole, the total process can be regarded as a model of symbol emergence in the multi-agent system.Additionally, the total process can be regarded as an inference process of the probabilistic generative model integrating the two agents' cognitive systems (Figure 3).

Gibbs sampling
First, to introduce our proposed model, we describe an ordinary Gibbs sampling algorithm for the integrative probabilistic generative model in Figure 3. Gibbs sampling algorithm is widely used for multimodal categorization and language learning in robotics.Gibbs sampling (Liu, 1994) is known as a type of Markov chain Monte Carlo (MCMC) algorithm for inferring latent variables in probabilistic generative models.
Algorithm 1 shows the inference algorithm on the model of Figure 3 using Gibbs sampling.In the algorithm 1, i shows the number of iterations; O A and O B denote a set of all observations in agents A and B, respectively; C A and C B denote a set of all categories in agents A and B, respectively; and W denotes a set of all words.In line 14 of Algorithm 1, word w d is sampled from the product of probability distributions P (c k ) and P (c in agents A and B.

If an agent can observe both θ
k , which are internal representations of each agent, this algorithm can work.However, Agent A cannot observe θ k .Therefore, no agent can perform Gibbs sampling in this multi-agent system.In this sense, this is not a valid cognitive model for representing the symbol emergence between two agents.
Algorithm 1 Gibbs sampling algorithm 1: Initialize all parameters 2: for i = 1 to I do 3: end for 7: end for 11: )Multi(c end for 16: end for 2.3.2Computational model of the symbol emergence based on an inference procedure using the Metropolis-Hastings algorithm A communication protocol based on the M-H algorithm proposed in SERKET enables us to develop a valid cognitive model, i.e., updating parameters by an agent does not require the agent to use cognitively unobservable information (Nakamura et al., 2018;Hastings, 1970).The M-H algorithm is one of Markov chain Monte Carlo algorithms, and Gibbs sampling is a special case of it.It is known that both algorithms can sample latent variables from the posterior distribution.That means that, theoretically, both of the algorithms can converge to the same stationary distribution.
Algorithm 2 shows the proposed inference algorithm based on the M-H algorithm.It can be also regarded as a semiotic communication between two agents, and individual object categorization process under the influence of words that are given by the other agent.
A set of all the words W [i] at i th iteration is calculated by two steps of the M-H algorithm, as shown in lines 3 and 5. Basically, the M-H algorithm requires information that an agent can observe within the dotted line in Figure 3 and w d .In this model of symbol emergence, word w d is generated, i.e., uttered, by a speaker agent, either A or B. A listener agent judges if the word properly represents the object the agent looks at.The criterion for the judgement should rely on the information the listener knows, i.e., the probability variables inside the dotted line in Figure 3.
Algorithm 3 shows the M-H algorithm, where Sp and Li are the speaker and listener, respectively.Generation of word w d from speaker's observation o Sp d and category c Sp d , which the speaker regards as the target object, is modeled as a sampling process using P (w Sp d |c Sp d , θ sp d ).This sampling can be performed by using the information that is available to the speaker agent.In line 3, the sampling and judgment of words W are performed with agent A as the speaker, and agent B as the listener.In line 5, the sampling and judgment of words W are performed with agent B as the speaker, and agent A as the listener.In the M-H algorithm, the li that is available to the listener agents, i.e., • Li and w d .Simultaneous use of Algorithms 2 and 3 performs a probabilistic inference of the probabilistic generative models shown in Figure 3. Importantly, the M-H algorithm can sample words from the posterior distribution exactly the same way as Gibbs sampling that requires all information owned by an individual agent in a distributed manner.This gives us a mathematical support of the dynamics of symbol emergence.
Algorithm 2 Proposed interactive learning process based on the M-H algorithm 1: Initialize all parameters 2: for i = 1 to I do 3: end if 20: end for 21: for l = 1 to L do 22:

Dynamics of symbol emergence and category formation
This subsection describes how the proposed inference algorithm explains the dynamics of symbol emergence and category formation through semiotic communication.Figure 4 conceptually shows the relationship between the dynamics of symbol emergence and concept formation between the agents, and the inference process for a word in the proposed model.The proposed model consists of the categorization part, where the agents form categories individually, and the communication part, in which the agents exchange words between them.The categorization part is modeled based on latent Dirichlet allocation (LDA).The communication part connects the categorization parts of agents A and B. We modeled the communication part as the inference process of hidden variable w In line 12 of Algorithm 3, word w Sp d is sampled by a proposal distribution with parameters of the speaker only (i.e., c Sp d and θ Sp k ) by the following formula: The process can be regarded as a word utterance from agent A in observation o A d based on its internal parameters, as shown in Figure 4 (a).This is a part of semiotic communication.
In line 13, sampled word w Sp d is judged by the listener by using acceptance rate z calculated by the following formula: Acceptance rate z of sampled word w Sp d can be calculated from parameters of the listener only (i.e., c Li d , θ Li k , w Li d ).Therefore, this is plausible from a cognitive perspective.
In lines 14 to 19, sampled word w Sp d is probabilistically accepted or rejected by the listener using acceptance rate z and uniform random number u by the following formulas: where the continuous uniform distribution is denoted as Unif(•).Word w of the listener at a previous iteration is used when sampled word w Sp d is rejected.Roughly speaking, if the listener agent considers that the current word is likely to be the word that represents the object that the listener also looks at, the listener agent accepts the word and updates its internal representations with a high probability.The process can be explained as a judgment as to whether agent B accepts or rejects an utterance of agent A based on self-knowledge, as shown in Figure 4 (b).
In lines 21 to 29, the internal parameters of the listener are updated based on judged words W by the following formulas: ). (15) The process can be explained as the updation of self-knowledge based on partial acceptance of the other agent's utterance.Because both the utterance and acceptance of words use only self-knowledge, these processes can be rationally convinced.
As shown in Algorithm 2, words W , categories C and parameters φ l and θ k are inferred by repeating this process with I iterations while exchanging the agents A and B. This inference process not only is rationalized as a model of the symbol emergence and category formation through semiotic communication between the agents, but also gives a mathematical guarantee on the inference of the model parameters.

Procedure
We performed an experiment to verify the validity of the proposed model and algorithm for modeling the dynamics of symbol emergence and concept formation.Specifically, we used an experiment of object categorization in the real world.We also discuss the functions required for semiotic communication in the category formation and the symbol emergence in multi-agent systems from the comparison of three communication algorithms on the proposed model.Figure 5 shows an overview of the experiment.
The experiment was performed by the following procedure: • Step 1: Capture and memorize N objects with M images for each object on agent A and B with different angles.• Step 4: Repeat step 3 with I iterations.
• Step 5: Evaluate the coincidence of words and categories between agents A and B for each object.
In the experiment, objects N , images M , and iterations I were set as 10, 10, and 300, respectively.We performed steps 1 to 5 with 10 trials for a statistical evaluation.

Capturing and memorizing images (Step 1)
Figure 6 shows the experimental environment.Two cameras on agents A and B captured object's images from different angles.Captured images were memorized on a computer.Resolution of a captured image was 640 pixels on width and 360 pixels on height.Target objects were a book, can, mouse, camera, bottle, cup, pen, tissue box, stapler, and scissors , as shown in the right side of Figure 6.

Converting memorized images to observations (Step 2)
An object's image captured by a camera is converted to a visual feature as an observation by Caffe (Jia et al., 2014), which is a framework for convolutional neural networks (CNN) (Krizhevsky et al., 2012) provided by Berkeley Vision and Learning Center.The parameters of CNN were trained by using the dataset from the ImageNet Large Scale Visual Recognition Challenge 20121 .Visual feature o i ∈ {o 1 , o 2 , • • • , o I } was calculated by the following equation: The kappa coefficient was used as an evaluation criteria indicating the coincidence of words between agents A and B. Kappa coefficient κ was calculated by the following equation: where C o is the coincidence rate of words between agents, and C e is the coincidence rate of words between agents by random chance.The kappa coefficient is judged as follows: Excellent : (κ > 0.8), Good : (κ > 0.6), M oderate : (κ > 0.4), P oor : (κ > 0.0).
The ARI was used as an evaluation criteria indicating the coincidence of categories between agents A and B. The ARI was calculated by the following equation: Welch's t-test was used for statistical hypothesis testing between the proposed algorithm and two baseline algorithms, i.e. no communication and no rejection.

Experimental results
Table 2 shows the experimental results: the kappa coefficient and ARI for the proposed algorithm and two baseline algorithms, i.e. no communication and no rejection.
For the kappa coefficient on words between agents A and B, the proposed algorithm obtained a higher value than the baseline algorithms, and there were significant differences between the proposed algorithm and baseline algorithms in the t-test.The result implies that the agents used the same words for observations with a very high coincidence (of 0.8 or more) in the proposed algorithm.
The ARI for the proposed algorithm was higher than for the baseline algorithms, and there were significant differences between the proposed algorithm and no rejection.In case of no rejection, the word has a negative effect on categorization between the agents, comparing with the result of no communication.On the other hand, in the proposed algorithm that stochastically accepts the other agent's word based on self-knowledge, the word positively acts on the categorization between agents.This result suggests that a rejection strategy in the semiotic communication works as an important function in the language evolution.Naturally, our result suggests it is biologically feasible and mathematically feasible.
Figure 7 shows transitions in the kappa coefficients of words between agents A and B by the proposed algorithm, no rejection, and no communication in ten trials.In case of no communication, because all words of the opponent were rejected, the coincidence of words was at the level of random chance.In case of no rejection, although the coincidence of words is increasing in the initial iterations, it drifts and stagnates at approximately 0.55, which is a moderate value.In case of the proposed algorithm, the kappa coefficient is higher than for the baseline algorithms, and the sharing of words accompanying the increase in iterations was confirmed.
Figure 8 shows transitions for the ARIs of categories between agents A and B by the proposed algorithm, no rejection, and no communication in ten trials.Compared with the baseline algorithms, it was confirmed that the proposed algorithm gradually and accurately forms and shares the categories between the agents.Compared with no communication, the proposed algorithm promoted the sharing of categories from approximately 80 iterations, and obtained the highest ARI after approximately 150 iterations.The result suggests the dynamics of symbol emergence and concept formation, where a symbol communication slowly affects the category formation in an agent and promotes sharing of the categories between the agents.It is a cognitively natural result: repetition of semiotic communication in the same environment gradually causes the sharing of categories between the agents.
For qualitative evaluation, we showed words assigned to each of the three objects: bottle, can, and book.Figure 9 shows the examples of object's images observed from the viewpoints of agent A and B. Table 3 shows the words sampled by agents A and B for three example objects on three algorithms: the proposed algorithm and two baseline algorithms (no rejection and no communication).The sampled words are described as three best results out of ten sampled words for each object.(•) shows the rate of a word to ten Table 3. Sampling results of words for three example objects by 3 communication algorithms, i.e. no communication, no rejection, and the proposed algorithm: the sampling results are described as 1 st , 2 nd , and 3 rd words in ten sampled words for each object.The rate of a word to ten sampled words is described in sampled words.In case of no communication, a word representing an object was not shared between the agents.In case of no rejection, words representing an object such as "i," "b," and "c" (for the bottle) were shared, but the probabilities of words are not high.In case of the proposed algorithm, it was confirmed that a word representing an object was shared between the agents with a high probability.
To evaluate the accuracy of categorization of actual objects, the ARI between the object labels and categories formed by the proposed algorithm is shown in Figure 10.At 300th iteration, the proposed algorithm shows a high accuracy, as the categorization of real-world objects by an unsupervised learning, despite being influenced by the words of another agent.
The learning process of the correspondence relationship between words and objects for each agent is shown in Figure 11 as a confusion matrix.As the number of iterations increases, words corresponding to  object labels were learned from random to one-on-one relationship.Each word was allocated to describe an object at the result of 300 iterations.
Figure 12 shows the result of principal component analysis (PCA) between the confusion matrices for agents A and B in Figure 11.The results are described from 30 to 300 iterations at 30 iterations intervals on two and three dimensions.As the number of iterations increases, the results of PCA on the confusion matrices of two agents are getting closer.This can be interpreted as a process in which the interpretation system of words and objects between the agents approaches by the iteration of the semiotic communication.

Discussion
We evaluated the validity of the proposed model and algorithm as a model of the dynamics on symbol emergence and category formation from the experiments using daily objects in the real-world environment.In the experiment, we compared the process of symbol emergence and category formation of objects between the agents by using three communication algorithms: the proposed algorithm, no rejection, and no communication.The experimental results demonstrated the following three events in the communication algorithms.
• In case of no communication, when the agent rejects all the other agent's utterances, the coincidence of categories was high but the coincidence of words was not shared between the agents.This result is understood as the following event: similar categories are formed when two agents have similar sensors that individually observe the same object.
• In case of no rejection, when the agent unconditionally accepts the other agent's utterances and updates the internal parameters, the coincidence of words drifts and stagnates, and the coincidence of categories decreases, compared with no communication.This result is understood as the following event: other agent's utterances that use different symbols interfere with categorization within the agent's individual as a noise.
• In the proposed algorithm, which probabilistically accepts the other agent's utterances based on the internal parameters, the coincidence of words was very high, and the coincidence of categories also had a high value compared with other algorithms.This result is also convincing as a mechanism of the symbol emergence and category formation based on the human semiotic communication.
Furthermore, it was suggested that the semiotic communication needs the function of rejecting other's utterances based on one's knowledge in the dynamics of symbol emergence and category formation between the agents.Naturally, our result can be interpreted as biologically and mathematically feasible.

CONCLUSIONS
This study focused on the symbol emergence in a multi-agent system and the category formation in individual agents through semiotic communication that is the generation and interpretation of symbols associated with categories formed from the agent's perception.We proposed a model and an inference algorithm representing the dynamics of symbol emergence and category formation through semiotic communication between the agents as an interpersonal multimodal categorizer.We showed the validity of the proposed model and inference algorithm on the dynamics of symbol emergence and concept formation in multi-agent system from the mathematical explanation and the experiment of object categorization and symbol emergence in a real environment.The experimental results on object categorization using three communication styles, i.e. no communication, no rejection, and the proposed algorithm based on the proposed model suggested that semiotic communication needs a function of rejecting other's utterances based on one's knowledge in the dynamics of symbol emergence and category formation between agents.
This study did not model an emergence of a grammar.However, the proposed model and algorithm succeeded in giving a mathematical explanation for the dynamics of symbol emergence in multi-agent system and category formation in individual agents through semiotic communication.This means our study showed a certain direction for treating multi-agent system logically in the symbol emergence and category formation.
As future work, we are extending the proposed model based on a mutual segmentation hypothesis of sound strings and situations based on co-creative communication (Okanoya and Merker, 2007).The extension will be achieved through the following research process.
• The extension for a mutual segmentation model of sound strings and situations based on multimodal information will be achieved based on a multimodal LDA with nested Pitman-Yor language model (Nakamura et al., 2014) and a spatial concept acquisition model that integrates self-localization and unsupervised word discovery from spoken sentences (Taniguchi et al., 2016a).
• To reduce development and calculation costs associated with the large-scale model, "Serket: An Architecture for Connecting Stochastic Models to Realize a Large-Scale Cognitive Model" (Nakamura et al., 2018), will be used.
• Experiment with N agents will be performed on symbol emergence and concept formation by expanding the proposed model.We can design an experiment as a communication structure based on human conversation, because human conversation is usually performed by two people.In a related study, Oshikawa et al. (2018) proposed a Gaussian process hidden semi-Markov model, which enables robots to learn rules of interaction between persons by observing them in an unsupervised manner.
• Experimental results have shown the importance of a rejection strategy, but the evidence for the human brain to use such a strategy is not shown.We are planning to conduct psychological experiments.
• As an exploratory argument, mapping category c to observation o is theoretically possible for a neural network.A future study can develop a deep generative model, which integrates deep learning and generative model, by application of multimodal learning with deep generative models (Suzuki et al., 2016).

Figure 1 .
Figure 1.Overview of the symbol emergence systems Taniguchi et al. (2016b)

Figure 2 .
Figure 2. The expansion of a multimodal categorizer from personal to interpersonal: (a) shows a generative model of a personal multimodal categorizer between haptics and vision, and (b) shows a generative model of an inter-personal multimodal categorizer between the agents.Dashed lines in (b) show communication between agents.The parameters of these models are simplified.

Figure 3 .
Figure 3. Graphical model of the proposed interpersonal multimodal categorizer Index w d (of word w) connects the agents A and B as a hidden variable to generate the index of a category c d from the parameter of multinomial distribution θ k in each agent.o A d and o B d are observations on data point d obtained from the sensors attached to the agents A and B, respectively.c A d and c B d are indices of a category allocated to an observation o A d and o B d , respectively.φ A l and φ B l are the parameters of multinomial distributions to generate observations o A d and o B d based on categories c A d and c B d .α and β are the hyperparameters of θ and φ.K is the number of words in the word dictionary that a robot has.L is the number of categories.D is the number of observed data points.The multinomial distribution is denoted as Multi(•), and the Dirichlet distribution is denoted as Dir(•).
) are repeated for each data point (d ∈ 1, 2, ..., D): • Observations o A d and o B d generated from categories c A d and c B d are shown as follows:

Figure 4 .
Figure 4. Dynamics of symbol emergence and category formation through semiotic communication between the agents in the proposed method.

Figure 5 .
Figure 5. Overview of the experiment: agents A and B observed N objects placed at the front of them.An agent captures images and suggests words M times for each object.

•
Step 2: Convert a memorized image to a visual feature as observations o A d , o B d for agent A and B. • Step 3: Sampling w d and updating model parameters from observations o A d , o B d by the M-H algorithm.This step corresponds to semiotic communication between agents A and B based on the opponent's utterances and self-organized categories.

Figure 7 .Figure 8 .Figure 9 .
Figure 7. Transition on the kappa coefficient of words between agents: a line shows an average value, and top and bottom of each color show a maximum and minimum values in ten trials.

Figure 10 .
Figure10.ARI between object labels and categories formed by the proposed algorithm in ten trials with agents A and B. The horizontal axis and vertical axis show the iteration and ARI, respectively.

Figure 11 .Figure 12 .
Figure11.Confusion matrix between words and object's labels in each agent.The horizontal axis and vertical axis show the index of object's label and word, respectively.The order of the words was sorted according to the frequency of each object at agent A with 300 iterations.

Table 2 .
Kappa coefficient on words and ARI on categories between agents A and B: the result is described with mean, standard deviation (SD), p-value, and t-test for 3 algorithms: no communication, no rejection, and the proposed algorithm.In the t-test, **: (p < 0.01), *: (p < 0.05), n.s.: not significant.