Crossmodal Language Grounding in an Embodied Neurocognitive Model

Heinrich, Stefan; Yao, Yuan; Hinz, Tobias; Liu, Zhiyuan; Hummel, Thomas; Kerzel, Matthias; Weber, Cornelius; Wermter, Stefan

doi:10.3389/fnbot.2020.00052

ORIGINAL RESEARCH article

Front. Neurorobot., 14 October 2020

Volume 14 - 2020 | https://doi.org/10.3389/fnbot.2020.00052

This article is part of the Research TopicCross-Modal Learning: Adaptivity, Prediction and InteractionView all 23 articles

Crossmodal Language Grounding in an Embodied Neurocognitive Model

Stefan Heinrich^1,2^*

Thomas Hummel¹

¹Knowledge Technology Group, Department of Informatics, Universität Hamburg, Hamburg, Germany
²International Research Center for Neurointelligence, The University of Tokyo, Tokyo, Japan
³Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University, Beijing, China

Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterizing the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilize the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organize hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.

1. Introduction

While research in natural language processing has advanced in specific disciplines such as parsing and classifying large amounts of text, human-computer communication is still a major challenge, due to multiple aspects: speech recognition is limited to good signal-to-noise conditions or well-adapted models, dialogue systems depend on a well-defined context, and language elements are difficult to reconcile with the environmental situation. Consequently, interactive robots that match human communication performance are not yet available. One reason for this is the fact that the crossmodal binding between language, actions, and visual events is not yet fully understood and was thus not realized in technical systems that have to interact with humans (Hagoort, 2017).

Imaging techniques such as Functional Magnetic Resonance Imaging (fMRI) have provided a better understanding of which areas in the cortex are involved in natural language processing and that these areas include somatosensory regions. Language studies have shown that there is a tight involvement of crossmodal sensation and action in speech processing and production as well as in language comprehension (Friederici and Singer, 2015). Thus, there is increasing evidence that human language is embodied. This means that it is grounded in most sensory and sensorimotor modalities and that the human brain architecture favors the acquisition of language by means of crossmodal integration (Pulvermüller, 2018).

As a consequence, research on cognitive modeling and developmental robotics is working toward developing models for natural language processing that reflect our understanding of distributed processing and embodied grounding of language in the brain. This way, the overall goal of studying the problem of language grounding in crossmodal perception and action can get approached. A particularly important aim is to develop a model for language grounding which reflects bio-inspired mechanisms and minimized difficult assumptions for the computational mechanisms.

In this paper, we present an embodied neurocognitive model for crossmodal language grounding that is trained in an end-to-end fashion. Additionally, we explore the concepts of varying multiple timescales in processing as well as distributed cell assemblies in representation learning. Based on the proposed model, we aim to investigate the characteristics of the learned crossmodally integrated representations.

1.1. Related Work

In order to bridge the gap between formal linguistics and bio-inspired systems, several valuable computational models have been developed that bring together language and an agent's multimodal perception and action. In their seminal Cross-channel Early Lexical Learning (CELL) model, Roy and Pentland (2002) demonstrate word learning from real sound and vision input. Each of these inputs is processed into a fixed-length vector, then lexical items arise by associations between vectors that represent the corresponding speech and an object's shape. Roy (2005) also highlights the importance of combining physical actions and speech in order to interpret words and basic speech acts in terms of schemas, which are grounded through a causal-predictive cycle of action and perception. Several works use self-organizing maps (SOMs), e.g., to form joint neural representations of simulated robot actions and abstract language input to encode the corresponding sensory-motor schemata (Wermter et al., 2005). This model addresses mirror neurons found in the motor cortical region F5, which link actor and observer by activating when performing a corresponding action or even just seeing or hearing it performed by someone else (Rizzolatti and Arbib, 1998). Vavrečka and Farkaš (2014) use a RecSOM (Voegtlin, 2002) which has a recurrent architecture with recursive updates to handle sequential input. Using a RecSOM and multiple SOMs, arranged in parallel for linguistic and visual input, and hierarchically for the integration of modalities, the model grounds spatial phrases within the corresponding image information.

Recent works often make reference to biological findings that support grounded language processing. Friederici and Singer (2015) provide evidence that linguistic and other cognitive functions are based on similar neuronal mechanisms, for example, single neurons react similarly to seeing a picture of a person's face or reading the person's name. More generally, Pulvermüller et al. (2014) propose a cognitive theory of distributed neuronal assemblies or thought circuits, integrating brain mechanisms of perception, action, language, attention, memory, decision, and conceptual thought. Rather than by SOMs, these neuroscience findings are better accounted for by distributed neural firing models. For example, in a multi-area model of cortical processing (Garagnani and Pulvermüller, 2016), some neurons become category-general while others are in category-specific semantic areas.

Among recurrent neural models, the multiple timescale recurrent neural network (MTRNN) (Yamashita and Tani, 2008) allows the emergence of a functional hierarchy with reusable sequence primitives. Heinrich and Wermter (2018) ground the generation of language in visual and motor proprioceptive signals, showing that an MTRNN can self-organize latent representations that feature hierarchical concept abstraction and concept decomposition. Zhong et al. (2019) address the generalization ability of MTRNNs by making use of semantic compositionality of simple verb-object sentences. They train an iCub robot to produce action sequences following a simple verb-object sentence comprising a selection of 9 verbs and 9 objects, where the network generalizes to all combinations despite being trained only on a subset. Yamada et al. (2017) investigate the handling of logic words in sentences from which an Long Short-Term Memory (LSTM) network generates corresponding robot actions. They show that, for example, the word “and” works like a universal quantifier, while the word “or” creates an unstable space in the LSTM dynamics.

While these models are used unidirectionally, bidirectional models have been proposed that can map both perceived language commands to actions and perceived actions to language descriptions. For this task, Yamada et al. (2018) train two paired recurrent autoencoders, one encoding the textual description sequence, the other encoding the action sequence. The autoencoders are paired by a joint loss function term that drives the two autoencoders' center-layer representations, which both have the same dimensionality, to be similar. As a result, a textual description leads to a representation that is suitable to generate an action sequence, and vice versa. For interactivity, the action sequence autoencoder receives additional image input in both encoder and decoder, while producing only the joint angle sequences as output. In each autoencoder, the direction of information flow between layers is fixed from input toward the output. In contrast, Antunes et al. (2018) implement a model of truly bidirectional information flow between three recurrent MTRNN layers of fast, medium, and slow timescale units. A subset of the fast units acts as input (or output) to a robot action sequence, and a subset of the slow layer's units acts as output (or input) to the language description. However, it needs to be investigated whether neural groups emerge that are solely devoted to information transmission into one of the directions, or, rather, whether shared bidirectional functionality emerges.

Another line of recent works shows that enriching linguistic data with other modalities can lead to better-performing systems. For example, continuous word representations like word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) have become popular, since they span some semantically meaningful low-dimensional space leading to robustness and to the possibility to track relations between words. Additionally, the original words can be recovered from the representations even when they are corrupted or altered by noise. These embeddings can become even more powerful when involving multiple modalities. Hill and Korhonen (2014) train a word2vec-like model on the ESPGame dataset, which annotates images with a list of lexical concepts, and on the CSLB Property Norms dataset which contains concepts for which human annotators produced several semantic properties. Lazaridou et al. (2015) train a similar model on text from Wikipedia and add visual information from the ImageNet database to a subset of the words, which is processed into an abstract vector by a pre-trained Convolutional Neural Network (CNN). Wang et al. (2018b) use GloVe vectors pre-trained on the Common Crawl dataset together with CNN-based visual vectors pre-trained on ImageNet. Auditory features extracted from a CNN network pre-trained on Google's AudioSet data are included in Wang et al. (2018a). The results of these models show that multimodal embeddings outperform unimodal embeddings. Furthermore, suitable images can be generated not only for concrete words but also for some abstract words by selecting the nearest neighbor image for a generated image vector (Wang et al., 2018a). For reinforcement learning interactive game agents, it was shown that augmenting environmental information with language descriptions (Narasimhan et al., 2018) or instructions (Chaplot et al., 2018) leads to better generalization and transfer capabilities.

There is also a recent focus on tasks like image captioning, Visual Question Answering (VQA), and phrase grounding in images. In these tasks, sequentially processed language refers to elements of images and the availability of corresponding large datasets for supervised learning has driven model development. VQA research, for example, led to neural architectures that facilitate reasoning steps, e.g. by affine transformations within the visual processing stream based on conditioning information from the question (Perez et al., 2018), by novel recurrent Memory, Attention, and Composition (MAC) cells (Hudson and Manning, 2018), or by more explicitly using graphs for reasoning (Hudson and Manning, 2019). Yet, these models do not cover the production of language, since VQA tasks are cast as classification problems where the network produces only the label to the correct answer among a given set of answers. Instead, they are tailored toward reasoning, but often fail in generalization, if their architecture is not primed for the task (Santoro et al., 2017). A potential reason for the lack of generalization can be in the poor integration of language and image representations by these models, since they are not embodied in interactive agents, which Burgard et al. (2017) suggest.

Overall this shows the need for an embodied neurocognitive model that can help to explain language processing in the brain and at the same time proves to be effective in generalization. To this end, we need to more closely look into components of both temporal decomposition and composition and at the same time realize an inherent multimodal abstraction on both sensory as well as conceptual level. It seems crucial that temporal decomposition and composition directly emerges in a model based on the context or the data, while multimodal abstraction needs to take place on sensory up to an overall contextual level.

1.2. Contribution

In this paper, we develop a neurocognitive model that grounds language production into embodied crossmodal perception. In particular, our model aims to map the auditory, sensorimotor, and visual perceptions onto the production of verbal utterances during the interaction of a learner with objects in its environment.

As a core characteristic, the model allows for the implicit adaptation of timescales based on the temporal characteristics of both perception and language production. Furthermore, the model tests multimodal abstraction in an end-to-end fashion with limited constraints on the preprocessing of the sensory input. The model is analyzed in depth based on a developmental robotics data recording that mimics natural interactions of an infant with said objects. This Embodied Multi-modal Interaction in Language learning (EMIL) data collection challenges the model by introducing a wider range of variability of the temporally dynamic sensory features, in order to exhibit effects on language learning and latent representation formation concerning findings for the human brain.

Therefore, the contribution of this paper is three-fold¹:

• We present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data.

• We demonstrate the effectiveness of our model on the novel EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver.

• We conduct an in-depth analysis of the model on the real-world multimodal data and draw several important conclusions. For example, crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment.

2. Embodied Neurocognitive Model

In order to add insight to related computational models, we aim to develop a model that satisfies a number of constraints. First, we seek to minimize difficult assumptions for computational mechanisms. In particular, we avoid building on top of mechanisms that are appealing for machine learning but not yet proven or not plausible for the processing in the brain such as neural gating, dropout regularization, or residual connections. In fact, we aim at building on top of the most simple computational architecture that still allows studying our proposed mechanisms. Second, we work with a minimal level of assumptions regarding language grounding. Here, we avoid using an oversimplified language such as modeling on word-level only. Additionally, we do not use natural speech but rather a simpler phonetic representation as the desired output. We will build our computational model with a distinct focus on the following biological mechanisms.

2.1. Biological Inspiration

It has been suggested that the human cognition is particularly strong because the human brain is good at both information composition and decomposition (Murray et al., 2014). Furthermore, it seems that many processes in the brain are reused in or coupled to a range of cognitive functions. In the brain, the decomposition and composition are governed by neural oscillations, multiple timescales in hierarchical processing streams, and a complex interplay of neural populations and local integration by mode coupling (Buzsáki and Draguhn, 2004; Badre et al., 2010; Engel et al., 2013). Additional evidence suggests that in higher stages of the spatial or temporal hierarchy neurons are organized in cell assemblies (Damasio, 1989; Palm, 1990; Levelt, 2001). These sparsely connected webs of neurons are distributed over different cortical areas and both hemispheres and form consistently during development for concepts on higher or lower levels.

In language grounding, both multiple timescales and cell assemblies seem to be reused. Multiple timescales in processing have been reported across the brain from lower auditory processing up to higher processing of perception (Ulanovsky et al., 2004; Smith and Kohn, 2008; Himberger et al., 2018) and cell assemblies are suggested to activate for both word processing as well as the overall thought processes (van der Velde, 2015; Tomasello et al., 2019). As a consequence, in our computational model, we further study the mechanisms of multiple timescales in information processing as well as crossmodal fusion by and sequence activation from cell assemblies.

2.2. Computational Model

We base our computational model on the Continuous Time Recurrent Neural Networks (CTRNN) architecture because of its universality in modeling sequential signals. Although we can derive the CTRNN from the leaky integrate-and-fire model and thus from a simplification of the Hodgkin-Huxley model from 1952, the network architecture was suggested independently by Hopfield and Tank (1986) as a nonlinear graded-response neural network and by Doya and Yoshizawa (1989) as an adaptive neural oscillator. Specifically, the CTRNN can be understood as a generalization of the Hopfield Network (Hopfield, 1982) with continuous firing rates and arbitrary leakage in terms of time constants. Compared to the Simple Recurrent Network (SRN, or Elman Network), the timescale τ is an additional hyperparameter of asymptotically not leaking, thus, a neuron can maintain part of its information for a longer period of time.

The activation y of CTRNN units is defined as follows:

\begin{array}{l} y_{t} = f (z_{t}), & (1) \end{array}

\begin{array}{l} z_{t} = (1 - \frac{Δ t}{τ}) z_{t - Δ t} + \frac{Δ t}{τ} (W x + V y_{t - Δ t} + b), & (2) \end{array}

for inputs x, previous internal states z_t−Δt, input weights W, recurrent weights V, bias b, and an activation function f. The timescale can be a pre-determined common parameter τ for all neurons or a vector τ of individual constants. In tasks with discrete numbers of time steps, the CTRNN can be employed as a discrete model, e.g., by setting Δt = 1.

With respect to modeling multiple timescales in information processing, the timescale parameter τ provides an interesting mechanism to capture sequential aspects on different timescales or periodicities and is particularly crucial for the hierarchical abstraction capability of the Multiple Timescale Recurrent Neural Network (MTRNN, compare Yamashita and Tani, 2008). Our model, therefore, integrates this predefined hierarchical abstraction. In particular, a fixed number of layers is defined a priori, e.g., having three adjacent layers called Input-Output (IO, τ = 2), Context-fast (Cf, τ = 5), and Context-slow (Cs, τ = 70), in order to force the architecture to hierarchically compose or decompose information.

In order to achieve decomposition and composition in the MTRNN, the overall context of a sequence is learned by or stored into some of the units in the slowest layers, called Context-controlling (Csc) units. Consequently, such an MTRNN can be defined in two forms, providing a decoder and an encoder component.

• MTRNN with Context Bias: the Csc units operate as a parametric bias during production and thus the Csc values are learned backwards during gradient descent training (compare Awano et al., 2010). Since the network weights are trained in parallel to the Csc units, the MTRNN with context bias learns to decompose a temporally dynamic sequence from a static initial bias.

• MTRNN with Context Abstraction: the Csc units operate as abstracting a static output during sensory processing similar to one-point classification (compare Heinrich and Wermter, 2018). Due to the increasingly larger timescales in the layers, the network learns to compose a static overall context from a temporally dynamic sequence.

When an MTRNN with context bias is coupled with an MTRNN with context abstraction in an end-to-end architecture, the Csc values of both networks are updated iteratively and form latent representations similar to a sparse auto-encoder on sequences.

In the MTRNNs, however, the τ needs to be carefully chosen as a hyperparameter, based on a priori known temporal characteristics of the data. This is usually done in coarse approximation on layer or module level. In contrast, time constants in the brain are subject to change during development and are hypothesized to be directly related to temporal structures (He, 2014). In previous work we developed a mechanism to obtain an adaptive timescale τ^A for each unit (Heinrich et al., 2018a). The timescales are governed by learnable weights U that work like a bias on the timescale instead of on the activation:

\begin{array}{l} τ_{t} = τ_{t}^{A} = 1 + e^{U + τ_{0}}, & (3) \end{array}

where the exponential function ensures timescales in (1, ∞), and the vector τ₀ can be initialized with sensible values for the timescales while the weights U get initialized to values close to zero. As a rule of thumb, we can initialize τ₀ either at random between 1 and a reasonably large number, i.e., to the length of the expected longest sequence (or a logarithm thereof) (Heinrich et al., 2015), or with timescales that are known to work well for MTRNNs in similar tasks.

In our computational model we, therefore, utilize adaptive MTRNNs with context abstraction for sensory inputs from multiple modalities and an adaptive MTRNN with context bias for verbalizing the observed sensation in natural language. Through this, the architecture provides a composition of a sensation into an overall meaning for that sensation as well as a decomposition of a meaning into a verbal description. The Csc units of all MTRNNs are coupled in cell assemblies from which, supposedly, a sparse latent representation for the meaning can emerge through iterative learning. Specifically, we integrate up to three MTRNNs for the abstraction of temporal dynamic auditory (au), sensorimotor (sm), and visual (vi) perception as well as an MTRNN which uses this context for language production in terms of verbal utterances describing the perception. The overall architecture is illustrated in Figure 1, further details on the scenario are provided in section 3.

FIGURE 1

Figure 1. Computational model: Adaptive MTRNNs with context abstraction for each input modality are coupled with an adaptive MTRNN with context bias via cell assemblies. Example timescales visualize the logarithmic leakage of information in the neurons.

2.3. Developmental Robot Scenario for Language Grounding

To investigate language grounding, we couple multi-modal sensations and a verbal description in order to train our model in an end-to-end fashion. Although supervised, this is related to models that investigate language grounding by mapping perception and action through Hebbian learning and studying the emergence and consolidation of connection patterns (e.g., Garagnani and Pulvermüller, 2016). Our aim is to further scale to a temporally dynamic scenario from real-word observations with the aim of studying both the emergence of timescales as well as connection patterns in terms of cell assemblies.

For this, our set-up is borrowed from a developmental robot scenario, where a humanoid robot, such as the Neuro-Inspired COmpanion (NICO, Kerzel et al., 2017), represents an infant learner who explores the environment by interacting with objects on a table and perceives verbal descriptions from a caregiver for particular object manipulations (see Figure 2). We conducted a data collection of the EMIL data set² (Heinrich et al., 2018b), that includes parallel multi-modal recordings from the robot's body-rational view as well as visual observations from a teacher perspective. The robot performs an action from a set of four predefined motions on a set of 30 distinct objects which exhibit different shape, color, texture, weight, friction, and sound characteristics when moved. The interaction is captured by microphones in the robot's ears for 48 kHz auditory sensation, by proprioception in the arm (motor position and current from eight motors, with 30 read-outs per second) for sensorimotor perception, and by a 90 degree field-of-view and 30 fps camera for visual perception. In addition, a textual description was recorded that describes the interaction with the object.

FIGURE 2

Figure 2. Developmental robot scenario of the EMIL data collection (Heinrich et al., 2018b): NICO is interacting with objects and perceives the interaction on auditory, sensorimotor, and visual modalities. A teacher provides a description for the interaction. (A) Scenario, (B) Example descriptions, (C) Teacher perspective.

To study the model on this scenario, we prepared two data sets from the EMIL version 1 collection:

• EMILv1 Data: 240 sensation-description pairs with up to 740 time steps for the perception streams and a simple holo-phrase with up to four words for the description. The descriptions were created from a vocabulary of 68 words and 4 symbols for punctuation, where a word is represented with one to nine phonemes.

• EMILv1 + Teacher Data: in order to mimic the situation of a caregiver providing additional descriptions to foster the infant's learning, we extended the data with additional teacher input. In particular, we appended data points where we replaced the nouns and verbs with synonyms and added slight Gaussian noise to the perception (σ = 0.01) in order to obtain 2,880 unique pairs. This is motivated by infants learning language better through scaffolding and guidance from their parents (Tomasello, 2003). The process can also be viewed as data augmentation from linguistic knowledge, which results in increased diversity and scale of crossmodal data for language learning, and is shown to lead to better generalization ability of neural models (Zhang et al., 2015). In order to ensure the quality of the teacher data, synonyms are obtained from WordNet (Miller, 1995), a high-quality lexical knowledge base according to the sense of the replaced word.

The EMILv1 data exhibits a couple of interesting characteristics. On the one hand, with the particularly long and noisy sequences (especially in the sensorimotor modality) the training is challenging for RNNs. On the other hand, in most sequences, the visual modality is most informative for the presented action + object pair. Compared to previous developmental robotic data sets, e.g. in Heinrich and Wermter (2018) the data does not imply a necessity for superadditivity (i.e., that more information is gained from multiple modalities only) but rather selectivity (meaning that one modality might be strongly favored in certain situations).

2.4. Representation and Training

For the verbal descriptions we prepared two different language representations:

• Phonetic: we transformed the utterances into phonetic sequences based on the ARPAbet and dictionary provided by CMU³ and represented these sequences as simple one-hot vectors. This is different from previous related research (Hinoshita et al., 2011; Heinrich and Wermter, 2018) where a single phoneme was stretched backwards and forward in time and thus learned much easier by using teacher forcing.

• Word embedding: in order to study the model on both fine-grained phonetic-level and coarse-grained word-level we utilize the GloVe-6B embeddings provided by the Stanford NLP group (Pennington et al., 2014).

We expect that the phonetic representation is more challenging and provides the necessity for the emergence of temporal composition in the MTRNN for verbal descriptions. The word embeddings, on the other hand, are more informative for studying the multi-modal fusion since the word embeddings already reflect semantic relatedness.

For the multi-modal sensation, we perform some simple preprocessing in order to provide input streams of comparable dimensions and low-level feature abstraction. For the auditory input, we transform the signals using Mel-Frequency Cepstral Coefficients (MFCC) analysis into 13 dimensions with a frame size of 33 ms and input window 60 ms. This is acceptable in terms of biological inspiration as the cochlea is doing a Fourier transformation of auditory signals that are roughly similar. The sensorimotor input was taken as is, but normalized, to result in 16 dimensions. The visual input in terms of a video stream was processed by a VGG16 neural network (Simonyan and Zisserman, 2015) (we took the output of the first dense layer after the convolution and pooling layers) and further condensed to 19 dimensions by Principal Component Analysis (PCA) in order to provide visual features. The VGG architecture was chosen since it is a powerful CNN architecture that was developed based on biological inspiration but does not yet incorporate implausible mechanisms such as arbitrary residual connections (Krüger et al., 2013; Hu et al., 2019). In our model, we used VGG layers that were pre-trained on ImageNet and thus provide reasonable features for objects. The reduction with PCA is not supposed to mimic any specific cortical processing but is an easy step in systematically reducing complexity in the model, which alternatively could be realized by neural unsupervised learning as well.

Since all network parameters are fully differentiable (Heinrich et al., 2018a), the architecture can be trained end-to-end using gradient descent. Although for the brain theories are in favor of Hebbian learning during development instead of backpropagation, we argue that for our research aim of studying the emergence of multiple timescales and the emergence of crossmodally fused representations for language grounding a supervised error signal is feasible (Dayan and Abbott, 2005; Lillicrap and Santoro, 2019).

3. Evaluation and Analysis

In order to analyse our model for how compositional language is grounded in multimodal sensations and how multimodal abstraction emerges through learning, we trained different variants of our model on different variants on the EMIL data sets.

For all experiments, we optimized the hyperparameters, i.e., the architecture size, optimization algorithm, learning rate, and batch size. We started with the model architecture from baseline CTRNNs, which are configured with equal timescales τ = 1 for all neurons. Once good hyperparameters were found, we used the same hyperparameters for all MTRNNs while separately optimizing their timescales. These timescales, in turn, are used as initial timescale values of the adaptive MTRNNs (AMTRNNs). All models were trained for at most 5, 000 epochs and a validation set was used for early stopping. We performed a 10-random sub-sampling validation, i.e., we repeated each run ten times with a different and independent split of training, test, and validation data (75, 12.5, 12.5%) as well as different and independent weights-initialization, based on a different random seed. The best results were found with RMSprop (Tieleman and Hinton, 2012), a learning rate of 0.01, and a batch size of 30. The exact architectural parameters are noted in Figure 1. In the following, for the argmax on the output, we report the mean accuracy over the cross-validation for each setup.

3.1. Generalization on Developmental Interaction Data

As a first step, we are interested in how well the architecture can actually learn verbal descriptions for the different sequential inputs. In order to inspect the generalization, we compare the accuracy on the test sets for both data sets, both verbal utterance representations, and three different model variants. In particular, we compare the baseline CTRNNs with the optimized MTRNNs and AMTRNNs.

The accuracy results (including standard errors) are presented in Table 1. We observe that the generalization is difficult for all models and that utterances which were described entirely correct are rare. For the phonetic representation, the model produces descriptions with a range of small errors such as pauses that are too long or producing incorrect phonemes at the end of words (rare) or of the utterance (more common). In many of those cases, the model shows tendencies to produce wrong descriptions from the first incorrect phoneme onward. For the word embedding representation, the descriptions are overall better, but in some cases, words are mixed up that are not necessarily semantically related.

TABLE 1

Table 1. Test accuracy (%) for different CTRNN architectures on phonetic vs. word representation.

Nevertheless, we observe strong differences between the models with different timescale characteristics on both the EMILv1 data and the data extended with additional teacher input (significant different performance between baseline CTRNNs and both other models, with p < 0.05). The baseline CTRNN model is not able to derive any description completely correct for the phonetic representation. In fact, we found that the CTRNN fails after the first few phonemes and afterwards just produces the phoneme that is most common in the data (usually the pause symbol SIL). For the word embedding, the performance is better, indicating that the CTRNN can handle the short utterances describing the sequence (only up to five words, compared to up to 25 phonemes in the phonetic representation). This also means that the CTRNN is able to capture the meaning of the input sequences (with up to 740 time steps) in terms of the presented action + object. The model based on an MTRNN with optimized timescales shows a large improvement on the phonetic representation. The model using adaptive MTRNNs performs even better (but not significant, with p > 0.05). Here, the errors in production are distributed over the utterance and a mostly incorrect description is characterized by the production of semantically wrong words, although the words were spelt correctly. Both the MTRNN- and AMTRNN-based models show improvements on the word embedding representation but notably differ in their mistakes. The incorrect words for the CTRNN seem arbitrary, especially if the words are at the end of the utterance. For the MTRNN and AMTRNN, we notice that incorrectly produced words were in many cases semantically related, e.g., mixing up “light” with “hard” or “red” and “pink.”

Overall it seems that the correct description is strongly dependent on whether the latent distributed representation (the cell assemblies) in the Csc units is able to abstract the sensory input and, thus, if the composition in the sensory CTRNN/MTRNN/AMTRNN correctly captures the temporally distributed information. In the following, we will, therefore, analyse the temporal aspect as well as the latent representations.

3.2. The Role of Adaptive Timescales

In order to inspect how the individual timescales contribute to sensory abstraction and utterance production, we compare the developed timescales as well as the activations within the AMTRNNs during processing the data. In Figure 3, we show a representative example for an interaction labeled “scoot heavy green car.” This sample is not producing the description (entirely) correct but shows characteristics that we found regularly in many cases. In Figures 3A–C, we compare the neural activation in all neurons with the raw input data, for auditory input shown as a spectrogram in the frequency domain, for sensorimotor as the plain measurements, and for visual as selected frames during the interaction (Figure 3F).

FIGURE 3

Figure 3. Impact of adaptive timescales in processing crossmodal input and phonetic output sequences on a representative example: “scoot heavy green car.” Hidden activations of all AMTRNN layers (stacked for each modality and sorted by timescale value) are shown together with the respective input or production. For the visual input, six frames are shown for selected time steps. (A) Auditory adaptive MTRNN and input. (B) Sensorimotor adaptive MTRNN and input. (C) Visual adaptive MTRNN. (D) Phonetic production adaptive MTRNN. (E) Timescale development during training. (F) Visual input (exemplary frames).

For both sensorimotor and visual activation we observe an increasing activity in the neurons with the highest timescales (in the graphs around a timescale of 660), showing that information is accumulated for the neurons that are part of the cell assemblies. For the auditory activation, this occurs on a much weaker level. We can also see that in the sensorimotor activation, neurons activate after some remarkable events, such as the spikes in the motor current around the first and second third of the sequence. This shows that, across the spectrum of timescales, neurons begin to reverberate when the current input seems different from sensory input in other interactions. Interestingly, in both sensorimotor and visual activations, neurons on timescales between 4 and 25 maintain their activation until the end of the sequence once positively or negatively activated. For the auditory activations, we can not easily spot a similar behavior but rather observe strong fluctuations for the neurons with small timescales until 80% of the sequence. Semantically plausible reverberations are rare, thus it seems the auditory information is much noisier and less decisive compared to the other modalities.

In the production of verbal utterances (Figure 3D) we spot patterns that are typical for MTRNNs: some neurons on lower timescale fluctuate according to specific phonetic output and neurons around timescales 4−6 activate and maintain their activation for some time spans. In notable cases, these activations coincide with the production of words representing semantically meaningful phoneme chains. The neurons with lower timescales of around 42, however, keep their activations over time with some leakage. These timescales correspond to the IO, Cf, and Cs layers and indicate a hierarchical decomposition. Notable is that the correspondence of activity in the Cf layer, with a produced word, is less pronounced than expected, while the activations of specific phonemes fade quickly. Correct phonemes are still produced, but at some point only SILs are activated. This clearly shows that this model has not ideally learned the production of the utterance, although the network structure induces the mentioned decomposition.

Regarding the learning of individual timescales, we see in Figure 3E that all AMTRNNs tend toward more fine-grained timescales in all layers. For the sensory input AMTRNNs, these changes are most notable for the neurons in the Cs layers, as they tend to result in smaller timescales (around 650) instead of the layer-wise optimized value of 700 of the MTRNN model. For the production AMTRNN, individual timescales also result in smaller values in some cases and a strong differentiation of the neurons in all layers. This indicates that, in addition to the predefined hierarchical structure, the AMTRNNs further adapted to the specific scales of relevant events in the sequences.

Overall it is notable that the timescale mechanism, w.r.t. the leakage of information, has its limit for covering events that occur on different timescales but are not particularly regular. In many cases, the multi-sensory perception is abstracted in terms of neurons accumulating information relatively independent of the timescales. The input data from the EMIL data set does not consist of chains of events that need to be composed, but they do show key events, such as grasping the objects or perceiving a difference in inertia through different current values in cases of rapidly moving an object. These key events seem to be captured, but neurons activate as a memory rather than a shortly active detector of features on a mid-level timescale. The production of verbal utterances, in many cases, illustrates shortcomings toward the end of the utterances, with the tendency of producing the overall most frequent phoneme (SIL).

3.3. Latent Representations in Cell Assemblies

Finally, we are interested in how cell assemblies form, based on the sensory input and description output. Specifically, we aim to inspect whether latent representations in the Csc spaces reflect the meaning of the utterances. We hypothesize that in cases of “good” models, the semantic components (action and object characteristics) that are exactly identical (e.g., the same action) or similar (e.g., a rectangular toy shape and a rectangular tissue shape) are represented similarly as well.

To analyse this, we compare setups where we trained AMTRNNs with all three modalities (auditory, sensorimotor, and visual), combinations of two modalities, or only on a single modality as input. The overview of the performance (accuracy results and standard errors) for these setups is presented in Table 2. For the trained networks we obtained the neural activations of the Csc units for the respective input AMTRNN and verbal description output AMTRNN and reduced the dimensionality of the representation to two Principal Components (PC) using PCA. For typical results and selected combinations of modalities, the reduced representations are plotted in Figure 4. Since the Csc from the sensory inputs map to the Csc for the verbal description we would expect that the plots for the verbal utterances show similarities most clearly. Note, however, that although two PCs usually explain >60% of the variability, they are only one perspective on the representation among others. Nevertheless, we selected cases that are representative for our observations across the results and avoided using t-Distributed Stochastic Neighbor Embedding (t-SNE) instead of PCA in order to not introduce additional biases.

TABLE 2

Table 2. Test accuracy (%) for training on restricted sensory input.

FIGURE 4

Figure 4. Learned representations in the cell assemblies for training on different modalities (reduced by PCA to first the two principal components PC1 and PC2). (A) Perception via auditory, sensorimotor, and visual modalities. (B) Perception via auditory and sensorimotor modalities. (C) Perception via auditory and visual modalities. (D) Perception via auditory modality. (E) Perception via visual modality.

Surprisingly, the results indicate that the setup that only uses visual input data performs best, compared to setups that process multimodal input data (notable but not significant, with p > 0.05). Overall, the setups that have access to the visual modality perform better (significant for all combinations, with p < 0.05), whereas the auditory modality leads to worse results (significant for combinations with an auditory input vs a visual input, with p < 0.05). When inspecting representations of the cell assemblies we can identify an explanation in the emerging representations. The semantic components are best distributed in the visual modality, indicating clusters for most of the characteristics, e.g., the object shape and action. To see this, compare all panels for the visual modality in Figure 4. Even though we do not visualize this here, we found similar clusterings for the color semantic component. In the sensorimotor modality the clusters are particularly obvious for action but strongly overlap for the shape component (not shown: it also overlaps for color components as well as weight and softness). In the auditory modality, all semantic components overlap for the case of full multimodal input (Figure 4A) and unimodal input (Figure 4D). However, in case of the auditory representation being presented together with sensorimotor or visual information only, we found a slight tendency of clustering toward the clusters that emerged within the other input modality (compare Figure 4B for auditory and sensorimotor and Figure 4C for auditory and visual). In most cases, the representation in the Csc of the verbal utterance production showed a mixture of the representations in the input Csc.

Overall it seems that (a) the characteristics of the raw data have a large influence, and (b) the end-to-end learning slightly favors a merging of the input modalities that is not directly beneficial. For (a), inspecting the raw data confirms our observation and expectation. In our raw data, we observe that the input streams are usually both quite noisy but also distinctive for some aspects. For example, the proprioception information from the motors (motor current) shows large deviations but for the human inspector it is easy to discriminate the different actions, while distinguishing between heavy and light objects (stronger vs. lower current) or hard and soft objects (stronger squishing and thus different finger motions) is extremely hard. In the auditory recordings, it is not possible to discriminate most object characteristics except for different friction sounds of heavy and light objects. However, distinguishing the actions by the motor sound is sometimes possible. For (b), we hypothesize that the amount of data in the EMILv1 data set is insufficient w.r.t. the complexity of the architecture, whereas the larger number of examples in the EMILv1 + Teacher set leads to a slightly different convergence. When comparing both data sets in Table 2 we find a tendency of modality selection for the smaller data set and a tendency of superadditivity for the larger one.

4. Discussion

In this paper, we investigated an embodied neurocognitive model to better understand the effects of adaptive multiple timescales as well as multi-sensory fusion mechanisms in grounding a temporal dynamic verbal description into temporal dynamic perceptions. For the model, we adopt that the human brain is reusing composition and decomposition as well as multiple sensory modalities in grounding natural language (compare section 2.1). Furthermore, in the model, we realize the merging of senses in a higher stage and inherently assume that the multiple timescales are in fact necessary (compare section 2.2). In our results, we found that adaptive timescales help in abstracting the information from temporally long and complex perceptions. Preparing the layers in these AMTRNNs with context abstractions toward an implicit hierarchy of multiple timescales forces a composition of an overall meaning from the crossmodal perception.

However, the concept of leakage in the AMTRNN specifically and in the MTRNN generally seems to reach its limit here. In previous studies, sequences were usually limited to < 50 time steps and, as a consequence, easily learned. In our experiments, perception inputs have ≈ 700 time steps for which MTRNNs hardly converge, even if a large hierarchy of carefully optimized timescales is tested. Consequently, meaningful abstractions emerge to some extent but compared to other mechanisms in machine learning, like gating or time-windowed CNNs, the resulting representations and performance are limited (Chang et al., 2017). Thus, although the decomposition through neural processes, which operate on different timescales, seems to contribute to the human abilities of language grounding, it does not explain how we cope with the complexity of our daily sensory input.

We also found that using end-to-end learning cell assemblies, i.e., pairs of temporally static abstracted modal information and production biases, show a tendency to organize w.r.t. similarities of the semantic components (i.e., an action, object shape, object softness, and so on). This is in line with previous studies and general observations on gradient descent machine learning. However, for our more natural and noisy interaction data, it shows that a choice between superadditivity and modality-specificity does not necessarily simply emerge but might involve additional cognitive processes.

In the past, language acquisition and grounding models were usually tested on synthetic toy examples or very constrained and carefully designed scenarios (Cangelosi and Schlesinger, 2015). Crucially, aspects of language were omitted or robotic interactions were designed particularly systematic. In contrast, our current study uses the EMIL data collection which challenges the model by introducing a wide range of variability in terms of sensory noise, object characteristics, and skewed distributions thereof. It seems, however, that by reducing these constraints and capturing truly multimodal and natural interaction scenarios we can reveal novel, potentially incompatible, effects.

5. Conclusions

Overall, our embodied neurocognitive model shows that in an end-to-end learning architecture with hierarchical concept abstraction and concept decomposition, language grounding can emerge and generalize. Adaptive multiple timescales and multi-sensory fusion on concept level are, among others, effective components. Of similar importance are the scenario characteristics of our more complex and natural EMIL data collection, which introduces a larger range of variability and noise. Through using more complex data we observe novel effects such as limits in temporal abstraction and contradicting observations concerning superadditivity vs. modality-specificity.

For future research, when aiming to explain complex cognitive functions, we need to take into account the full complexity of the environmental context as well as of the computational conditions. For language acquisition and grounding it seems particularly crucial to capture the full details of the language learning events, such as learners' prior body of experiences, the sensory richness of the context, and the input and thus influence of caregivers that teach the language. In addition, future research could further investigate the timescale mechanism with respect to hierarchically organized multiple timescales on mathematically more defined tasks, like predicting temporally noisy Lissajous curves with probabilistic transitions (compare Murata et al., 2014) and consider time dilation or time gating, instead of leakage (Chang et al., 2017). Increased understanding and better control of temporal hierarchical composition in neural models, as well as the development of more naturalistic training data and schedules, are promising paths toward models of more human-like language acquisition and learning.

Data Availability Statement

The datasets generated for this study are available freely via the link provided in the Appendix.

Author Contributions

SH developed, implemented, and evaluated the model. SH, YY, THu, and MK collected the data. SH and THi analyzed the model. ZL, CW, and SW helped in writing and revising the paper. All authors contributed to the article and approved the submitted version.

Funding

The authors gratefully acknowledge partial support from the German Research Foundation (DFG) and the National Science Foundation of China (NSFC) under project Crossmodal Learning (TRR-169).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors thank Shreyans Bhansali and Erik Strahl for support in the data collection as well as the NVIDIA Corporation for support with a GPU grant.

Footnotes

1. ^The source code of the model and experiment details can be found on https://github.com/heinrichst/adaptive-mtrnn-grounding.git.

2. ^More details on the collection are provided in the Appendix. We plan to obtain several versions of the EMIL data set with increasing scenario complexity and amount of data. The version 1 is publically available via https://www.inf.uni-hamburg.de/en/inst/ab/wtm/research/corpora.html.

3. ^ARPAbet is an American English phonetic transcription set, transcribed in ASCII symbols, http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

References

Antunes, A., Laflaquiére, A., and Cangelosi, A. (2018). “Solving bidirectional tasks using MTRNN,” in 8th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics (ICDL-EpiRob) (Tokyo), 19–25. doi: 10.1109/DEVLRN.2018.8761012

Crossmodal Language Grounding in an Embodied Neurocognitive Model

1. Introduction

1.1. Related Work

1.2. Contribution

2. Embodied Neurocognitive Model

2.1. Biological Inspiration

2.2. Computational Model

2.3. Developmental Robot Scenario for Language Grounding

2.4. Representation and Training

3. Evaluation and Analysis

3.1. Generalization on Developmental Interaction Data

3.2. The Role of Adaptive Timescales

3.3. Latent Representations in Cell Assemblies

4. Discussion

5. Conclusions

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Footnotes

References

Appendix: EMIL Collection

Related Data Sets

Dataset Characteristics

Developmental Robot Setup

Interactive Robot NICO

Recording

Preprocessing

Labeling for Object Tracking

Labeling for Language Learning

Impact and Research Opportunities

Active Perception

Imitation Learning

Cross-Modal Representation Learning

Developmental Language Acquisition

Lifelong Learning