Edited by: James L. McClelland, University of Pennsylvania, United States
Reviewed by: Jon A. Willits, University of California, Riverside, United States; Darrell A. Worthy, Texas A&M University College Station, United States
*Correspondence: Shamima Najnin
Bonny Banerjee
This article was submitted to Cognitive Science, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The “novel words to novel objects” language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task.
Infants face many complex learning problems, one of the most challenging of which is learning a language. It is nothing short of a scientific miracle how quickly and effortlessly they learn a language. The process of language acquisition is multisensory that involves hearing utterances, seeing objects in the environment, and touching and pointing toward them. The ability to map words onto concepts/referents/objects is at the core of language acquisition. The mapping between words and their referents is fundamentally ambiguous as illustrated by Quine using the “Gavagai” problem (Quine et al.,
One of the prominent solutions to this problem is cross-situational learning (Gleitman,
Computational modeling of word-learning has been a powerful tool for unraveling the underlying factors and mechanisms of word-learning in infants. It helps to examine psycholinguistic theories of word-learning. In developmental robotics, it plays an important role in the design of a robot's behavioral and cognitive capabilities (Lungarella et al.,
A taxonomy tree of existing word-learning models, including the models proposed in this paper.
One of the first models for learning word meanings is rule-based (Siskind,
In Yu and Ballard (
In Frank et al. (
An incremental, associative model that learns words from context is proposed in Kachergis et al. (
In Kievit-Kylar et al. (
In Lazaridou et al. (
In Fontanari et al. (
The continuous nature of speech and visual data are taken into account in Roy and Pentland (
In Mangin and Oudeyer (
The model in Frank et al. (
Social cues and cross-situational learning play a crucial role in learning word meanings (Gleitman,
The strong connection between reinforcement and the brain, particularly memory, is widely known (Schultz,
The effort of an agent for obtaining a reward can be measured as a value using a value system (Schultz,
In this paper, we extend the reinforcement learning framework for word-learning. Both table-based algorithms (Q-learning, SARSA, SARSA-λ) and neural network-based algorithms (Q-NN, NFQ, DQN) (Sutton and Barto,
Our experimental results show that reinforcement learning models are well-suited for word-learning. In particular, we show that:
Word-object pairs can be learned using reinforcement. Attentional Q-NN, NFQ, DQN and their attentional-prosodic counterparts can select referent objects with high accuracy for given target words. Attentional-prosodic Q-NN, NFQ, and DQN outperform their attentional counterparts in terms of F-score, precision and recall. Attentional-prosodic DQN outperforms some of the prominent existing models in terms of F-score.
The rest of this paper is organized as follows. Section 2 covers how reinforcement learning can be extended for word-learning. The dataset and its complexity are described in Section 3. Section 4 details the experimental results. Finally, the paper ends with concluding remarks.
The proposed solution to the computational problem of learning word meanings rests on recent advances in reinforcement learning algorithms. An agent interacts with its environment via perception and action, as shown in Figure
Agent-environment interaction in reinforcement learning. Here,
Let
For learning word meanings, the proposed agent is assumed to be embedded in an environment consisting of audio and visual stimuli. In any situation, the agent hears the caregiver's utterance (consisting of words) and sees the objects present in its immediate environment. In this paper, each state consists of spoken utterance and attended object in a situation. At current situation (
Computation of the internal reward is a non-trivial problem. The intuition behind this reward function is borrowed from language-specific constraints to restrict large hypothesis spaces; this facilitates cross-situational learning. These language-specific constraints help the infant to learn word-object mappings during early language development (Markman,
Based on the above principle, the proposed agent perceives the next situation (
The agent's objective is to compute a policy, π, that maps from state (
The above formulation reduces the problem to a finite Markov decision process (MDP) where each situation is a distinct state. Hence, standard reinforcement learning methods for MDPs can be deployed by assuming
For the task of learning word-object pairs, both table-based and neural network-based reinforcement learning methods will be investigated in this paper. Table-based methods include Q-learning, SARSA, and SARSA-λ. The neural network-based methods, namely Q-NN, NFQ, and DQN, will be extended for the word-learning task. Our formulation of these methods for the task is described in the following sections.
A number of reinforcement learning methods estimate the action-value function using the Bellman equation (Sutton and Barto,
where
The optimal action-value function obeys Bellman equation. If the optimal value
Such action-value function can be estimated iteratively as Watkins and Dayan (
Convergence of such value iteration algorithms leads to the optimal action-value function,
where α is a learning rate that decreases with iterations for convergence and γ is a discounting factor. It can be shown that, multiple updates of every state-action pair lead Q-learning to converge for finite state-action spaces (Sutton and Barto,
Q-learning Algorithm for Word-Learning
1: Input: s is the current state, a is the current action, s′ is the next state, |
2: Initialize action-value function, |
3: Initialize Q-matrix for word-object pairs, |
4: |
5: Observe current state, |
6: |
7: Choose action, |
8: Observe next state, |
9: |
10: r = 100 |
11: |
12: r = −1 |
13: |
14: |
15: |
16: |
17: |
18: |
19: |
20: |
21: |
22: |
23: |
24: |
25: |
State-Action-Reward-State-Action (SARSA) is an on-policy algorithm for temporal difference learning and is more realistic than Q-learning. According to Russell and Norvig (
SARSA Algorithm for Word-Learning
1: Input: States, |
2: Initialize action-value function, |
3: Initialize Q-matrix for word-object pairs, |
4: |
5: Observe current state, |
6: |
7: Choose action, |
8: Observe next state, |
9: |
10: r = 100 |
11: |
12: r = −1 |
13: |
14: |
15: |
16: |
17: |
18: |
19: |
20: |
21: |
22: |
23: |
24: |
25: |
The SARSA-λ algorithm was experimented with to observe the effect of memory in word-learning. Adding eligibility traces to the SARSA algorithm forms SARSA-λ algorithm (Loch and Singh,
SARSA-λ Algorithm for Word-Learning
1: Input: States, |
2: Initialize action-value function, |
3: Initialize Q-matrix for word-object pairs, |
4: |
5: Observe current state, |
6: for |
7: Choose action, |
8: Observe next state, |
9: |
10: r = 100 |
11: else |
12: r = −1 |
13: |
14: δ = |
15: |
16: |
17: |
18: |
19: |
20: |
21: |
22: |
23: |
24: |
25: |
26: |
27: |
28: |
In this paper, the state of the environment in the word-learning case is defined as a situation consisting of utterances and objects presented at that time. In the real world, children below three years of age hear at least 240 utterances per hour (Grabe and Stoller,
In neural network-based method, an error function is introduced that measures the difference between the current
Q-learning with Neural Network for Word-Learning
1: Initialize Q-matrix for word-object pairs, |
2: Initialize action-value function, |
3: Learning rate, α = 0.001, discount rate, γ = 0.99 |
4: Initialize ϵ = 0.99 for ϵ − |
5: |
6: Observe current state, |
7: |
8: Calculate |
9: Choose action, |
10: Observe |
11: Calculate reward |
12: Set |
13: Perform gradient descent step on |
14: Calculate |
15: |
16: |
17: |
18: |
19: |
20: |
21: |
22: Set |
23: |
24: Decrease ϵ linearly. |
25: |
Q-learning for neural network is an online algorithm where Q-network parameters (θ) are typically updated after each new situation. The problem with the online update rule is that, typically a large number of iterations is required to compute an optimal or near-optimal policy (Riedmiller,
NFQ-Algorithm for Word-Learning
1: Initialize Q-matrix for word-object pairs, |
2: Initialize a set of transition samples, |
3: Initialize action-value function, |
4: Learning rate, α = 0.001, discount rate, γ = 0.99 |
5: Initialize ϵ = 0.99 for ϵ- |
6: |
7: Observe current state, |
8: for |
9: Calculate |
10: Choose action, |
11: Observe |
12: Calculate reward |
13: Store transition ( |
14: Set the target Q-value for current state as: |
15: Perform gradient descent step on |
16: Calculate |
17: |
18: |
19: |
20: |
21: |
22: |
23: |
24: Set |
25: |
26: Decrease ϵ linearly. |
27: |
In deep-Q learning, an agent's experience at each situation is stored in a replay memory, termed as
Double deep Q learning for optimal control
1: Initialize experience replay memory |
2: Initialize action-value function, |
3: Initialize target action value function, |
4: Learning rate, α = 0.001, discount rate, γ = 0.99 |
5: Initialize ϵ = 0.99 for ϵ- |
6: |
7: Observe current state, |
8: |
9: Calculate |
10: With probability ϵ select random acrion |
11: otherwise select |
12: Observe |
13: Calculate reward |
14: Store transition ( |
15: Sample minibatch of transitions ( |
16: Set |
17: Perform gradient descent step on |
18: After |
19: Calculate |
20: |
21: |
22: |
23: |
24: |
25: |
26: |
27: Set |
28: |
29: Decrease ϵ linearly. |
30: |
Two transcribed video clips (me03 and di06) of mother-infant interaction from the CHILDES dataset (MacWhinney,
Several aspects make the CHILDES dataset challenging for learning word-object relations. In this type of bidirectional natural multimodal interaction, the vocabulary is large and the number of object labels are very less in comparison to the most frequent words. In these videos, only 2.4 and 2.7% co-occurring word-object pairs are relevant while rest are irrelevant. Moreover, this dataset contains a large amount of referential uncertainty where for each utterance up to seven objects are presented in the scene. Mother uttered the name an attended object explicitly only in 23% of utterances. Putting these facts together, there are many frequent but irrelevant pairs consisting of function words and a small set of referent objects. Thus, learning word-to-referent object mappings by computing co-occurrence frequencies leads to incorrect associations. Considering the smoothness and efficiency in word-learning, it is highly probably that infants use a more smart and effective strategy to learn relevant word-referent object pairs.
Each of the models discussed in section 2 is trained using a subset of CHILDES dataset (MacWhinney, The audio stream is represented as a symbolic channel containing the words that occurred in a sentence, as in Yu and Ballard ( The caregiver's speech is segmented into utterances based on speech silence. Each utterance is aligned with transcriptions using Sphinx speech recognition system (Lee et al.,
In this paper, two sets of experiments are done using social cues: (1) joint attention-based reinforcement learning where joint attention on the objects are considered as social cues and the audio stream is symbolically represented; (2) joint attention and prosodic cue-based reinforcement learning where both jointly attentive object and prosody in the caregiver's speech are considered as social cues. In the latter case, audio stream is represented as prosodic feature. For both experiments, video stream is represented as a list of object labels. The dataset consists of 624 utterances with 2,533 words in total. The vocabulary size is 419. The number of objects and referent objects are 22 and 17 respectively. The dataset includes a gold-standard lexicon consisting of 37 words paired with 17 objects (Frank et al.,
To find the object jointly attended by both caregiver and infant, we follow the same methodology as in Yu and Ballard (
In Snedeker and Trueswell (
The performance of proposed models for the task of word-referent object pair learning are evaluated using the following criteria:
Word-referent Matrix: If all 37 words are assigned an object, the proportion of pairings matching with the golden standard can be calculated from the confusion matrix representing the Word-referent Matrix (Kievit-Kylar et al., where, Quality of learned word-referent lexicon (Yu and Ballard,
Performance of the proposed models is compared to that of the existing models in Frank et al. (
We investigated the word-learning performance in a virtual agent using models based on six reinforcement learning algorithms: Q-learning, SARSA, SARSA-λ, NFQ, and DQN. For all the cases, the learning rate (α), discount rate (γ) and ϵ are chosen empirically as 0.001, 0.99, and 0.99, respectively. For Q-learning, SARSA and SARSA-λ, Q-tables are initialized randomly. λ is chosen as 0.9 empirically for SARSA-λ.
For each neural network model for word-learning, the Q-network has four-layers with the first layer being the input layer, second and third layers are hidden, while the fourth is the output layer representing the
The neural network weights are initialized randomly in [10−2, 102]. In this paper we have used online gradient decent algorithm for learning QNN and stochastic gradient descent for learning NFQ and DQN. For all three neural network-based algorithms, the behavior policy during training was ϵ-greedy with ϵ annealed linearly from 0.99 to 0.0001 and fixed at 0.0001 thereafter. The capacity
The input to each model is the auditory stream represented as a binary vector. The reward for each situation is computed based on jointly attended object transition. Each table-based model (Q-learning, SARSA, SARSA-λ) is run for 10,000 iterations. Rewards computed in each iteration are shown in Figure
For each iteration, reward is computed and plotted for Q-learning, SARSA, and SARSA-λ, respectively.
For Q-NN, NFQ, and DQN:
The result of word-referent pairs learning can be visualized using a color confusion matrix. The confusion matrix shows the similarity between word-object pairings as gradients from yellow to blue color filling each grid cell where yellow stands for higher
Confusion Matrix for
Gold-standard words incorrectly associated with gold-objects using Q-learning, SARSA, SARSA-λ, Q-NN, NFQ, and DQN.
“bunny” | “bigbirds” | “moocows” | “bird” | “bird” | “bird” |
“cows” | “bunnyrabbit” | “duck” | |||
“moocows” | “cows” | “duckie” | |||
“duck” | “moocows” | “kittycats” | |||
“duckie” | “duckie” | “lambie” | |||
“kitty” | “kitty” | “bird” | |||
“mirror” | “kittycats” | “hiphop” | |||
“piggies” | “lamb” | ||||
“ring” | “lambie” | ||||
“bunnies” | “bunnies” | ||||
“bird” | “bird” | ||||
“hiphop” | “oink” | ||||
“meow” | |||||
“oink” |
It is important to note which word/object pairs are more or less likely to be discovered by the model. A lexicon is created from the
ROC for Q-learning, SARSA, SARSA-λ, Q-NN, NFQ, and DQN, respectively.
F-score, precision, and recall values using the learned lexicon from Attentional reinforcement learning models.
Attentional Q-learning | 0.5667 | 0.7391 | 0.4595 |
Attentional SARSA | 0.5205 | 0.5278 | 0.5135 |
Attentional SARSA-λ | 0.5763 | 0.7727 | 0.4595 |
Attentional Q-NN | 0.6667 | 0.5135 | |
Attentional NFQ | 0.7097 | 0.88 | |
Attentional DQN | 0.9167 |
As neural network-based reinforcement learning models outperform table-based models, simulations were run on the former for word-referent learning considering audio and visual social cues: prosody and joint attention. The input to each model is the audio stream represented by prosodic feature vector instead of binary vector. The reward for each situation is computed based on jointly attended object transition as described in section 2. The same optimal structure of the Q-network from the last section is retained.
In Figures
Confusion Matrix for
To compare the performance of the proposed models with exiting models in the task of learning word meaning, we focus on precision, recall and F-score computed from each model. We have analyzed the effect of integrating prosodic cues with cross-situational learning in the proposed model and then compared it with the state-of-the-art models. To investigate the effect of addition of prosodic features, a lexicon is created from the
ROC for Attentional Q-NN, NFQ, DQN, Attentional-prosodic Q-NN, NFQ, DQN, existing Beagle, Beagle+PMI model, COOC, and Bayesian CSL.
The best F-score, precision and recall for Attentional Q-NN, NFQ, DQN, Attentional-prosodic Q-NN, NFQ, DQN, and existing Bayesian CSL, BEAGLE, hybrid (BEAGLE+PMI), MSG, and baseline COOC models are tabulated in Table
Comparison of proposed attentional and Attentional-prosodic reinforcement learning models with existing models.
Attentional Q-NN | 0.6667 | 0.5135 | |
Attentional NFQ | 0.7097 | 0.88 | 0.5946 |
Attentional DQN | 0.7213 | 0.9167 | 0.5946 |
Attentional-prosodic Q-NN | 0.77419 | 0.6486 | |
Attentional-prosodic NFQ | 0.78378 | 0.7838 | 0.7838 |
Attentional-prosodic DQN | 0.8421 | ||
COOC | 0.53 | 0.7578 | 0.4012 |
Bayesian CSL model | 0.54 | 0.64 | 0.47 |
Beagle model | 0.55 | 0.58 | 0.525 |
Beagle+prosodic cue model | 0.6629 | 0.71 | 0.525 |
Beagle+PMI model | 0.83 | 0.86 | 0.81 |
MSG model | 0.64 | NA | NA |
Attentive MSG | 0.7 | NA | NA |
AttentiveSocial MSG | 0.73 | NA | NA |
When compared to the existing models, Attentional DQN exhibits higher F-score than COOC, Bayesian CSL, BEAGLE, MSG, and AttentiveMSG models. It is noted that existing models mentioned here ignore prosodic cues. AttentiveMSG and Attentivesocial MSG models integrate social cues with cross-situational learning where infants attend to objects held by them instead of following eye gaze. State-of-the-art BEAGLE+PMI model also ignores prosodic cues in the infant-directed speech. For a fair comparison, we have tested BEAGLE model with prosodic vector instead of using random vectors. The BEAGLE model yields an F-score of 0.55 which increases to 0.6629 when it is integrated with prosodic cues. Integration of PMI model with prosodic cues is yet to be researched. No experiment with Attentional-prosodic BEAGLE+PMI model is performed in this paper due to availability of limited information regarding the exact procedure. Since the F-score of BEAGLE+PMI is very close to that of Attentional-prosodic DQN, it is unclear how the performance of the former would compare to reinforcement learning models. However, only joint attention could not make DQN's performance better than AttentiveSocial MSG. When prosody is combined with joint attention, DQN produces higher F-score than AttentiveSocialMSG and BEAGLE integrated with prosodic cue models. It is noteworthy that Attentional-prosodic DQN and BEAGLE+PMI would have been more comparable if the latter incorporated prosodic information as the former. The best lexicon learned by Attentional-prosodic DQN is shown in Table
Learned best lexicon (word-object pairs) using Attentional-prosodic DQN.
“ahhah” | eyes | “bunnies” | bunny | “hiphop” | bunny | “pig” | pig |
“ahhah” | rattle | “bunny” | bunny | “david” | mirror | “piggie” | pig |
“baby” | baby | “bunnyrabbit” | bunny | “kitty” | kitty | “piggies” | pig |
“bear” | bear | “cow” | cow | “kittycat” | kitty | “rattle” | rattle |
“big” | bunny | “duck” | duck | “kittycats” | kitty | “ring” | ring |
“bigbird” | bird | “duckie” | duck | “lamb” | lamb | “rings” | ring |
“bird” | bird | “eyes” | eyes | “lambie” | lamb | “sheep” | sheep |
“birdie” | duck | “hand” | hand | “meow” | kitty | “through” | bunny |
“book” | book | “hat” | hat | “mirror” | mirror | ||
“books” | book | “he” | duck | “moocow” | cow |
In this paper, an agent is developed that can learn word-object pairings from ambiguous environments using reinforcement. Joint attention and prosodic cues are integrated in caregiver's speech with cross-situational learning. Prosodic cues are extracted from audio stream. Joint attention is utilized to compute the reward for the agent. Among the proposed Q-NN, NFQ, and DQN algorithms for word-learning, Q-NN is online whereas the other two use batch processing. According to the behavioral studies in Vouloumanos (
Some research (e.g., Yu,
It is found through behavioral studies in Estes and Bowen (
From a machine learning perspective, the proposed Attentional-prosodic DQN model outperforms some of the existing models in word-learning tasks in terms of F-score, precision and recall. However, our models failed to discover the association if one word is assigned to multiple objects due to assuming the “novel words to novel objects” language-specific constraint for computing rewards. A future goal is to model reinforcement by relaxing the language constraint. It is believed that the word-learning process in children is incremental. Though the proposed Attentional-prosodic Q-NN model-based agent learned incrementally, our best performing agent is based on Attentional-prosodic DQN model which learns in mini-batches. In the proposed approach, joint attention was manually selected from the video. Recently a number of neural network-based reinforcement learning models have manifested strong performance in computing visual joint attention (Doniec et al.,
SN: Conducted the research on the algorithms, implemented and evaluated them, and wrote the paper; BB: Played an advisory role, helped in framing the problem and writing the paper, and acquired funding for conducting this research.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.