Emergent communication of multimodal deep generative models based on Metropolis-Hastings naming game

Deep generative models (DGM) are increasingly employed in emergent communication systems. However, their application in multimodal data contexts is limited. This study proposes a novel model that combines multimodal DGM with the Metropolis-Hastings (MH) naming game, enabling two agents to focus jointly on a shared subject and develop common vocabularies. The model proves that it can handle multimodal data, even in cases of missing modalities. Integrating the MH naming game with multimodal variational autoencoders (VAE) allows agents to form perceptual categories and exchange signs within multimodal contexts. Moreover, fine-tuning the weight ratio to favor a modality that the model could learn and categorize more readily improved communication. Our evaluation of three multimodal approaches - mixture-of-experts (MoE), product-of-experts (PoE), and mixture-of-product-of-experts (MoPoE)–suggests an impact on the creation of latent spaces, the internal representations of agents. Our results from experiments with the MNIST + SVHN and Multimodal165 datasets indicate that combining the Gaussian mixture model (GMM), PoE multimodal VAE, and MH naming game substantially improved information sharing, knowledge formation, and data reconstruction.

Where n ij is the number of data points common between cluster i in the ground truth and cluster j in the predicted clustering.Values higher are better, with ARI = 1 indicating perfect agreement while ARI = 0 suggests a random clustering.
• Davies Bouldin Score (DBS) [Davies and Bouldin, 1979]: Evaluates the quality of cluster assignments by relating the average similarity measure of each cluster with its most similar cluster.Lower values indicate better clustering.
Where S i is the average distance between each point of cluster i and the centroid of that cluster.d(i, j) is the distance between cluster centroids.The DBS metric is minimized for optimal clustering.
• Fréchet Inception Distance (FID) [Heusel et al., 2017]: Measures the distance between real and generated image distributions, offering a more robust metric than direct pixel-wise comparisons.A lower FID score indicates that the two samples are more similar, suggesting better quality for generated images.
Where µ 1 and µ 2 are the sample means of the real and generated images, respectively, and Σ 1 and Σ 2 are their covariances.Lower values are better, indicating that the generated images are more similar to the real images.
Figure 2 presents the VAE network architecture used in experiment 1, designed for two distinct modalities.The top VAE is configured for the MNIST dataset, while the VAE in the lower is designed for the SVHN dataset.Figure 3 illustrates the VAE network structure implemented in experiment 2, incorporating three VAEs for three separate modalities.The first VAE is used for the image or visual modality, the second is configured for the auditory modality, and the third VAE is dedicated to the haptic modality.
The terms "Conv," "ConvTrans," "Linear," "ReLU," and "Sigmoid" refer to the convolutional layers, transposed convolutional layers, fully connected layers, the Rectified Linear Unit function, and the Sigmoid activation function, respectively.The notation "a*b*c" represents the data dimensions in terms of "channels*width*height." Figure 2: The network architecture of VAE used in experiment 1.The first is for MNIST, while the second is for SVHN.
Figure 3: The network architecture of VAE used in experiment 2. The first is for vision modality, the second is for audio modality, and the last one is for haptic modality.Figures 4, 5, and 6, while the t-SNE visualization of latent spaces is shown in Figure 7.

Figure 1 :
Figure 1: Visualization of t-SNE clustering for the Inter-GMM+VAE applied to the MNIST dataset at different vocabulary sizes: 10 (top row), 20 (second row), 50 (third row), and 100 (bottom row).The first column belongs to agent A, while the second one belongs to agent B

Figure 7 :
Figure 7: The t-SNE visualization of latent spaces of Inter-GMM+weighted-β-MVAE with MoE on the left column, PoE in the middle column, and MoPoE on the right column for experiment 1, illustrating the clustering of data across ten classes according to 10 digits.

Table 1 :
Performance evaluation of Inter-GMM+VAE on the MNIST dataset across various vocabulary sizes using Kappa, DBS, and FID metrics.The results indicate the model's performance in handling an increase in vocabulary size.