Coffee With a Hint of Data: Towards Using Data-Driven Approaches in Personalised Long-Term Interactions

While earlier research in human-robot interaction pre-dominantly uses rule-based architectures for natural language interaction, these approaches are not flexible enough for long-term interactions in the real world due to the large variation in user utterances. In contrast, data-driven approaches map the user input to the agent output directly, hence, provide more flexibility with these variations without requiring any set of rules. However, data-driven approaches are generally applied to single dialogue exchanges with a user and do not build up a memory over long-term conversation with different users, whereas long-term interactions require remembering users and their preferences incrementally and continuously and recalling previous interactions with users to adapt and personalise the interactions, known as the lifelong learning problem. In addition, it is desirable to learn user preferences from a few samples of interactions (i.e., few-shot learning). These are known to be challenging problems in machine learning, while they are trivial for rule-based approaches, creating a trade-off between flexibility and robustness. Correspondingly, in this work, we present the text-based Barista Datasets generated to evaluate the potential of data-driven approaches in generic and personalised long-term human-robot interactions with simulated real-world problems, such as recognition errors, incorrect recalls and changes to the user preferences. Based on these datasets, we explore the performance and the underlying inaccuracies of the state-of-the-art data-driven dialogue models that are strong baselines in other domains of personalisation in single interactions, namely Supervised Embeddings, Sequence-to-Sequence, End-to-End Memory Network, Key-Value Memory Network, and Generative Profile Memory Network. The experiments show that while data-driven approaches are suitable for generic task-oriented dialogue and real-time interactions, no model performs sufficiently well to be deployed in personalised long-term interactions in the real world, because of their inability to learn and use new identities, and their poor performance in recalling user-related data.


DATA-DRIVEN ARCHITECTURES
This section describes the data-driven dialogue models and their performance in the previous literature in detail and presents the hyperparameters used in the experiments for the Barista Datasets.

Supervised Embeddings
Word embedding models are strong baselines for predicting the response given the previous conversation in both open-domain and task-oriented dialogue (Dodge et al., 2016;Bordes et al., 2017;Al-Rfou et al., 2016;Li et al., 2021). One common approach in the literature (Bai et al., 2009;Dodge et al., 2016;Bordes et al., 2017;Joshi et al., 2017) scores the summed bags-of-embeddings of the candidate responses against the summed bags-of-embeddings of the previous conversation, referred to as Supervised Embeddings. This approach corresponds to a Memory Network with no attention over memory (Dodge et al., 2016) and a classical information retrieval model where the matching function is learnt (Bordes et al., 2017).
This work uses the implementation from Joshi et al. (2017) 1 . Due to the structure of this method (i.e., binary bag-of-embeddings of unique words), the order of the words within the input, such as the user utterance, bot response or conversation context, is not preserved as the output is an embedding and not a sentence. Moreover, repeating words would also be lost in the embedding. Thus, this model may not be suitable for dialogue with this implementation.

Sequence-to-Sequence
Sequence-to-Sequence (Seq2Seq) model (Sutskever et al., 2014) is a generative model that uses a long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997;Graves, 2013) to read the input sequence (i.e., as the encoder) for obtaining a fixed-dimensional vector representation, and another LSTM to extract the output sequence from that vector (i.e., as the decoder). The model was found to be a strong baseline in task-oriented and open-domain dialogue (Vinyals and Le, 2015;Sordoni et al., 2015;Li et al., 2016a,b;Zhang et al., 2018). This work uses the implementation from ParlAI 2 that was used in the ConvAI2 3 challenge (Dinan et al., 2020) with the Persona-Chat dataset.

End-to-End Memory Network
Humans tend to focus on salient parts of information for recalling the key aspects of a memory, or for efficiently accomplishing tasks. Similarly, attention mechanisms in deep learning focus on particular elements of a task, e.g., to respond to queries, based on a non-uniform weighting of the input to optimise the learning and recall processes. Such mechanisms can allow efficient memory handling and recalls in personalisation for long-term human-robot interaction, given the expanding volume of data over time. End-to-End Memory Network (Weston et al., 2015;Sukhbaatar et al., 2015) (MemN2N) is an attentionbased model with a long-term memory, where the input (e.g., user query) is weighted with a memory component to find the most relevant previous information, referred to as the supporting facts (e.g., the previous user or bot utterance in dialogue history), for producing an output (e.g., response). Multiple hops (i.e., iterating an output with the initial input in multiple layers) enforce the network to increase its attention 1 https://github.com/chaitjo/personalized-dialog 2 https://github.com/facebookresearch/ParlAI/tree/master/projects/convai2/baselines/seq2seq 3 http://convai.io on the correct supporting facts. The dialogue example from the recognition error task of the Personalised Barista with Preferences Information (PBPI2, Table S16 in SM 6) shows the attention weights on the conversation context (i.e., dialogue history) in varying hops.
A Memory Network can discover simple linguistic patterns based on verbal forms, such as (X, took, Y) for "Alice took the teacup.", hence, it can generalise the meaning of their instantiations for previously unseen vocabulary (Weston et al., 2015), which is an essential quality, e.g., for learning new names in personalised long-term interactions. Moreover, the relative time of the events is encoded into the memory to retain the to-and-fro relations between events.
In task-oriented dialogue (Bordes et al., 2017), MemN2N outperformed Supervised Embeddings and information retrieval approaches, such as the Term Frequency-Inverse Document Frequency (TF-IDF) and the Nearest Neighbor. Moreover, the vanilla model outperformed Supervised Embeddings on the Personalized bAbI dialog dataset (Joshi et al., 2017).
Similar to the Supervised Embeddings, this work uses the implementation 1 from Joshi et al. (2017), which is a retrieval-based approach for vanilla MemN2N.

Split Memory Network
The Split Memory (Joshi et al., 2017) architecture combines a MemN2N for conversation context with another MemN2N for the user profile attributes (i.e., gender, age, favourite food and dietary preference) to enforce attention on the user's profile, which is important for personalising a dialogue (e.g., for recommending a restaurant the user may like based on their favourite food). The outputs from both MemN2Ns are summed element-wise to get the final response of the bot for each conversation turn. For multiple hops, each MemN2N separately processes the output in multiple layers, and then the resulting outputs are summed.
Split Memory outperformed Supervised Embeddings in the Personalized bAbI dialog dataset in all tasks, and outperformed MemN2N in recommending the correct restaurant and conducting a full dialogue. On the other hand, Split Memory performed worse for responding to user queries and when the user requested changes, such as requesting a different type of cuisine than the previously requested one, suggesting that a simpler MemN2N model is more suitable for tasks that do not require compositional reasoning over various entries in the memory. In contrast to MemN2N, multiple hops may perform worse when there is more than one aspect of the user's profile, such as favourite food and dietary preference, or memory event to focus on (Joshi et al., 2017), as evident in a dialogue example from the recognition error task of the Personalised Barista with Preferences Information (PBPI2) presented in Table S17 in SM 6. This work uses the implementation 1 from (Joshi et al., 2017), which is a retrieval-based model.

Key-Value Profile Memory Network
Key-Value Memory Network (Miller et al., 2016) is an extension of retrieval-based MemN2N by storing facts in key-value structured memory slots. The keys are used to lookup relevant memories to the input (e.g., sharing at least one word with the input), and the corresponding values are read by taking their weighted sum using the assigned probabilities. Using hops (through repeated key addressing and value reading) enables focusing on and retrieving more pertinent information in subsequent accesses. If the key and value are set to be the same for all memories, the model becomes equivalent to the vanilla MemN2N.
Key-Value Memory Network was applied to open-domain dialogue (Zhang et al., 2018) by using the training set as the keys, and the values as the next dialogue utterance (e.g., user response). Correspondingly, the model had a memory of past dialogues that can be used to predict responses to the current conversation. User profiles (persona) that consisted of multiple sentences of textual description (e.g., "I have a computer science degree") were used to find the relevant lines to combine with the input (cosine similarity), which is used to predict the next utterance, by summing the weighted sum of profile sentences with the input query, and ranking the candidate set responses to determine most suitable bot response. Thus, the model was called Key-Value Profile Memory Network, here referred to as Key-Value for brevity. For multiple hops, the summed value is used to attend over the keys and output a weighted sum of values as before, which is then used to rank the candidate set to predict the next utterance.
Key-Value outperformed Seq2Seq and Generative Profile Memory Network in both the automated metric (i.e., the accuracy of the next dialogue utterance when choosing between the correct response and distractor responses, known as hits@1) and human evaluation (in terms of fluency, engagingness and consistency) (Zhang et al., 2018).
This work uses the implementation 4 that was used at the ConvAI2 challenge with the Persona-Chat dataset.

Generative Profile Memory Network
Generative Profile Memory Network (Zhang et al., 2018), here referred to as Profile Memory for brevity, extends the Seq2Seq model by encoding the profile entries as individual memory representations in a Memory Network. The decoder attends over both the encoded profile entries and the conversation context.
Profile Memory was shown to outperform Seq2Seq for automated metrics (hits@1 and perplexity) in the Persona-Chat dataset, however, it performed considerably worse than Key-Value in hits@1, achieving a score of 0.125 in comparison to 0.511 by Key-Value.
This work uses the implementation from ParlAI 5 that was used in (Zhang et al., 2018).

Hyperparameters
The hyperparameters for the data-driven architectures used in this work correspond to the hyperparameters from the original implementations unless otherwise noted in the text. In contrast to the original work, we used 100 epochs for training each baseline to ensure the equal comparison between models, except for Key-Value and Supervised Embeddings, which were only trained for 25/15 epochs due to the vast amount of time required to train them. However, the number of epochs or training time in the original work are either less than or equal to ours. For instance, Key-Value Memory Network was trained for 20 hours (in equivalence to our computational power) on the Persona-Chat dataset (Zhang et al., 2018), whereas, the training lasted between 17 to 40 days per task in the 10,000 dialogues datasets, despite having 20% of its task size. In addition, the batch size of the Supervised Embeddings was increased to 128 (was 32 in the original implementation) to decrease the training time, and a batch size of 1 was used for Seq2Seq and Generative Profile Memory Network on the test and OOV sets due to out-of-memory errors. Key-Value Memory Network, Generative Profile Memory Network and Seq2Seq model do not contain out-of-vocabulary words, similar to the original work. Previous user-bot labels refers to all the previous exchanges within the dialogue before the query (current user response).
Similar to (Joshi et al., 2017), the user profile is treated as a turn in the dialogue for MemN2N. The embeddings and memory have a fixed size, hence, the beginning of the conversation context may be cut off. The answer array is returned as a one-hot encoding. The Adam optimiser is used for minimising the cross entropy loss.
Split Memory has the same embedding and memory structures as MemN2N. The Adam optimiser is used for minimising the cross entropy loss. The profile attributes are added as separate entries in the memory before the start of the dialogue. In the Personalised Barista Datasets, the user profile can be updated during the conversation, such as for registering a new user or due to a recognition error, which is in contrast to the Personalized bAbI dialog dataset, where the user profile is not changed during the conversation. Thus, we overwrite the profile memory (i.e., MemN2N containing the information for the user profile attributes) with the new profile information, when a change of identity information occurs in dialogue. In contrast to the implementation that uses the most recent utterance in the context memory (differently from the reported results in (Joshi et al., 2017)), the full dialogue history is used in this work to improve the dialogue accuracy (e.g., 66.95% accuracy with the context, whereas 64.78% without the full context in PB8 for 1,000 dialogues).
Similar to the Split Memory, the user profile attributes are provided separately for Key-Value Memory Network, and are overwritten if updated during the conversation. Similar to (Zhang et al., 2018), only 1-hop model is evaluated due to the vast amount of time required to train the model, arising from the large size of the (key-value) pairs. In contrast to (Zhang et al., 2018), the model is trained on the Barista Datasets instead of using the weights from another model. Similar to (Zhang et al., 2018), but in contrast to the other methods, only the last bot utterance and the corresponding last user response are used instead of the full conversation context. This was reported to perform better in the original implementation, and the results in the Barista Datasets are in line with this finding, providing up to 20% increase in accuracy (e.g., 19.2% using context, 40.09% with only the utterance pair in PB8). Note that other methods were not evaluated in this way, either because the structure was not implemented or a difference was not reported in the original work. End-to-end training is enabled with standard backpropagation through stochastic gradient descent (SGD).
Similar to the work in (Zhang et al., 2018), GloVe embeddings (Pennington et al., 2014) is used for Generative Profile Memory Network, and the model is trained with the Adam optimiser. However, in contrast to the original implementation, the conversation context (i.e., history of the user and bot utterances) is used for the model, because this was found to improve the accuracy in the majority of the tasks in the Barista Datasets (e.g., for 1,000 dialogues datasets, 57.82% accuracy with context, 55.22% without context for the PB8). While the original implementation uses the user response-correct bot responses pairs in training for the conversation context, the user response-model's prediction pairs are used in the validation, test and OOV sets, which also performed better than using the correct response on the Barista Datasets (e.g., 57.82% accuracy with the model response in conversation context, 51.93% with correct bot response in context for PB8). Hence, this method was used, which also allowed for a fair comparison of this baseline to its performance in the Persona-Chat dataset. Similar to Split Memory, the profile attributes are separate and overwritten if updated during the conversation.
In contrast to Zhang et al. (2018), randomly initialised embeddings are used for the Seq2Seq model in this work instead of GloVe (Pennington et al., 2014) word embeddings, due to achieving a higher per-response accuracy (e.g., achieving 60.58% accuracy in comparison to 41.75% in task 8 for 1,000 dialogues in the Personalised Barista Dataset). Moreover, conversation context (i.e., all previous user-bot labels) is used for Seq2Seq, in contrast to (Zhang et al., 2018) due to higher accuracy, similar to Generative Profile Memory Network. Similar to (Zhang et al., 2018), the model is trained with negative log likelihood and the user profile is prepended to the input sequence (i.e., concatenated to the beginning of the input).
Supervised Embeddings model is trained with SGD using a margin ranking loss to ensure that the correct targets are ranked higher than any other targets (i.e., negative candidates). Similar to (Joshi et al., 2017), the user profile is treated as a turn in the dialogue.
GeForce GTX 1080 Ti or Tesla V100-SXM3-32GB was used as the graphics processing unit (GPU), depending on the availability on the server. Table S3. Hyperparameters of the models used in the experiments for the Barista Datasets. These correspond to the parameters from the original implementations (Joshi et al., 2017;Zhang et al., 2018), unless otherwise noted in text.

MemN2N
Split Memory      Table S13. Percentage of errors in dialogue state tracking (DST), personal(ised), order details, other and Barista Task 7 (B7) phrase types for Second Interaction test sets. The best performing methods (or methods within 0.1%) are given in bold for the error in per-response accuracy metric, and the error percentages within the phrase types are given in parentheses.

Customer Input
Okay .

Correct Response
I thought you looked familiar , Anne ! Would you like a small mocha and a blueberry muffin again ? Predicted Response I thought you looked familiar , Anne ! Would you like a small mocha and a blueberry muffin again ? Table S17. A dialogue example from the recognition error task (2) of the Personalised Barista with Preferences Information Dataset (PBPI2) shows the attention weights in the Split Memory model for varying hops. Split Memory allows focusing attention separately on the user profile (i.e., the customer's identity and most preferred order), in addition to the last bot response (containing the customer's name to be used in the response), which reinforces dialogue state tracking and predicting the correct response. Preferences information helps choose the correct items in the suggestion, which decreases the risk of mixing customers (and preferences). Hops facilitate focusing attention to relevant inputs, however, it can decrease the performance when there are multiple target items (for preference suggestion or order confirmation), as evident in Hop 3. Zero attention weight signifies a very small value (< 10 −5 ).

Customer Input
Okay .

Correct Response
I thought you looked familiar , Anne ! Would you like a small mocha and a blueberry muffin again ? Predicted Response I thought you looked familiar , Anne ! Would you like a small mocha and a blueberry muffin again ?   Figure S5. Key-Value can use out-of-vocabulary words, i.e., new customer names and order items (Brand, short, raspberry lemonade). However, the customer preference was incorrectly recalled. Russell is a first name in the training set. The UNK is the special token used to represent words that are not in the vocabulary in ParlAI framework. Despite the special token in the profile and the conversation context, Key-Value is able to learn and use those new words.