A Financial Service Chatbot based on Deep Bidirectional Transformers

We develop a chatbot using Deep Bidirectional Transformer models (BERT) to handle client questions in financial investment customer service. The bot can recognize 381 intents, and decides when to say"I don't know"and escalates irrelevant/uncertain questions to human operators. Our main novel contribution is the discussion about uncertainty measure for BERT, where three different approaches are systematically compared on real problems. We investigated two uncertainty metrics, information entropy and variance of dropout sampling in BERT, followed by mixed-integer programming to optimize decision thresholds. Another novel contribution is the usage of BERT as a language model in automatic spelling correction. Inputs with accidental spelling errors can significantly decrease intent classification performance. The proposed approach combines probabilities from masked language model and word edit distances to find the best corrections for misspelled words. The chatbot and the entire conversational AI system are developed using open-source tools, and deployed within our company's intranet. The proposed approach can be useful for industries seeking similar in-house solutions in their specific business domains. We share all our code and a sample chatbot built on a public dataset on Github.


Introduction
Since their first appearances decades ago (Weizenbaum, 1966) (Colby et al., 1971), Chatbots have always been marking the apex of Artificial Intelligence as forefront of all major AI revolutions, such as human-computer interaction, knowledge engineering, expert system, Natural Language Processing, Natural Language Understanding, Deep Learning, and many others. Open-domain chatbots, also known as Chitchat bots, can mimic human conversations to the greatest extent in topics of almost any kind, thus are widely engaged for socialization, entertainment, emotional companionship, and marketing. Earlier generations of open-domain bots, such as Mitsuku (Worswick, 2019) and ELIZA (Weizenbaum, 1966), relied heavily on handcrafted rules and recursive symbolic evaluations to capture the key elements of human-like conversation. New advances in this field are mostly data-driven and end-to-end systems based on statistical models and neural conversational models  aim to achieve human-like conversations through more scalable and adaptable learning process on free-form and large data sets , such as MILABOT (Serban et al., 2017), XiaoIce (Zhou et al., 2018), Replika (Fedorenko et al., 2017), Zo (Microsoft, 2019), and Meena (Adiwardana et al., 2020).
Unlike open-domain bots, closed-domain chatbots are designed to transform existing processes that rely on human agents. Their goals are to help users accomplish specific tasks, where typical examples range from order placement to customer support, therefore they are also known as taskoriented bots . Many businesses are excited about the prospect of using closed-domain chatbots to interact directly with their customer base, which comes with many benefits such as cost reduction, zero downtime, or no prejudices. However, there will always be instances where a bot will need a humans input for new scenarios. This could be a customer presenting a problem it has never expected for (Larson et al., 2019), attempting to respond to a naughty input, or even something as simple as incorrect spelling. Under these scenarios, expected responses from open-domain and closed-domain chatbots can be very different: a successful open-domain bot should be "knowledgeable, humourous and addictive", whereas a closed-domain chatbot ought to be "accurate, reliable and efficient". One main difference is the way of handling unknown questions. A chitchat bot would respond with an adversarial question such as Why do you ask this?, and keep the conversation going and deviate back to the topics under its coverage (Sethi, 2019). A user may find the chatbot is out-smarting, but not very helpful in solving problems. In contrast, a task-oriented bot is scoped to a specific domain of intents, and should terminate out-ofscope conversations promptly and escalate them to human agents. This paper presents AVA (A Vanguard Assistant), a task-arXiv:2003.04987v1 [cs.CL] 17 Feb 2020 Figure 1. End-to-end conceptual diagram of AVA oriented chatbot supporting phone call agents when they interact with clients on live calls. Traditionally, when phone agents need help, they put client calls on hold and consult experts in a support group. With a chatbot, our goal is to transform the consultation processes between phone agents and experts to an end-to-end conversational AI system. Our focus is to significantly reduce operating costs by reducing the call holding time and the need of experts, while transforming our client experience in a way that eventually promotes client self-provisioning in a controlled environment. Understanding intents correctly and escalating irrelevant intents promptly are keys to its success. Recently, NLP community has made many breakthroughs in contextdependent embeddings and bidirectional language models like ELMo, OpenAI, GPT, BERT, RoBERTa, DistilBERT, XLM, XLNet (Dai & Le, 2015;Peters et al., 2017;Devlin et al., 2019;Peters et al., 2018a;Lample & Conneau, 2019;Peters et al., 2018b;Howard & Ruder, 2018;Yang et al., 2019;Liu et al., 2019;Tang et al., 2019). In particular, the BERT model (Devlin et al., 2019) has become a new NLP baseline including sentence classification, question answering, named-entity recognition and many others. To our knowledge there are few measures that address prediction uncertainties in these sophisticated deep learning structures, or explain how to achieve optimal decisions on observed uncertainty measures. The off-the-shelf softmax outputs of these models are predictive probabilities, and they are not a valid measure for the confidence in a networks predictions (Gal & Ghahramani, 2016;Maddox et al., 2019;Pearce et al., 2018;Shridhar et al., 2019), which are important concerns in real-world applications (Larson et al., 2019).
Our main contribution in this paper is applying advances in Bayesian Deep Learning to quantify uncertainties in BERT intent predictions. Formal methods like Stochastic Gradient (SG)-MCMC (Li et al., 2016;Rao & Frtunikj, 2018;Welling & Teh, 2011;Park et al., 2018;Maddox et al., 2019;Seedat & Kanan, 2019), variational inference (Blundell et al., 2015;Gal & Ghahramani, 2016;Graves, 2011;Hernández-Lobato & Adams, 2015) extensively discussed in literature may require modifying the network. Re-implementation of the entire BERT model for Bayesian inference is a non-trivial task, so here we took the Monte Carlo Dropout (MCD) approach (Gal & Ghahramani, 2016) to approximate variational inference, whereby dropout is performed at training and test time, using multiple dropout masks. Our dropout experiments are compared with two other approaches (Entropy and Dummy-class), and the final implementation is determined among the trade-off between accuracy and efficiency.
We also investigate the usage of BERT as a language model to decipher spelling errors. Most vendor-based chatbot solutions embed an additional layer of service, where device-dependent error models and N-gram language models (Lin et al., 2012) are utilized for spell checking and language interpretation. At representation layer, Wordpiece model (Schuster & Nakajima, 2012) and Byte-Pair-Encoding(BPE) model (Gage, 1994;Sennrich et al., 2016) are common techniques to segment words into smaller units, thus similarities at sub-word level can be captured by NLP models and generalized on out-of-vocabulary(OOV) words. Our approach combines efforts of both sides: words corrected by the proposed language model are further tokenized by Wordpiece model to match pre-trained embeddings in BERT learning.
Despite all advances of chatbots, industries like finance and healthcare are concerned about cyber-security because of the large amount of sensitive information entered during chatbot sessions. Task-oriented bots often require access to critical internal systems and confidential data to finish specific tasks. Therefore, 100% on-premise solutions that enable full customization, monitoring, and smooth integration are preferable than cloud solutions. In this paper, the proposed chatbot is designed using RASA open-source version and deployed within our enterprise intranet. Using RASA's conversational design, we hybridize RASA's chitchat module with the proposed task-oriented conversational systems developed on Python, Tensorflow and Pytorch. We believe our approach can provide some useful guidance for industries contemplate adopting chatbot solutions in their business domains.
Principled uncertainty estimation in regression (V. Kuleshov & Ermon, 2018), reinforcement learning (et al., 2016) and classification (et al., 2017) are active areas of research with a large volume of work. The theory of Bayesian neural networks (Neal, 1995;MacKay, 1992) provides the tools and techniques to understand model uncertainty, but these techniques come with significant computational costs as they double the number of parameters to be trained. Gal and Ghahramani (Gal & Ghahramani, 2016) showed that a neural network with dropout turned on at test time is equivalent to a deep Gaussian process and we can obtain model uncertainty estimates from such a network by multiple-sampling the predictions of the network at test time. Non-Bayesian approaches to estimating the uncertainty are also shown to produce reliable uncertainty estimates (B. Lakshminarayanan, 2017); our focus in this paper is on Bayesian approaches. In classification tasks, the uncertainty obtained from multiplesampling at test time is an estimate of the confidence in the predictions similar to the entropy of the predictions. In this paper, we compare the threshold for escalating a query to a human operator using model uncertainty obtained from dropout-based chatbot against setting the threshold using the entropy of the predictions. We choose dropout-based Bayesian approximation because it does not require changes to the model architecture, does not add parameters to train, and does not change the training process as compared to other Bayesian approaches. We minimize noise in the data by employing spelling correction models before classifying the input. Further, the labels for the user queries are human curated with minimal error. Hence, our focus is on quantifying epistemic uncertainty in AVA rather than aleatoric uncertainty (Kendall & Gal, 2017). We use mixed-integer optimization to find a threshold for human escalation of a user query based on the mean prediction and the uncertainty of the prediction. This optimization step, once again, does not require modifications to the network architecture and can be implemented separately from model training. In other contexts, it might be fruitful to have an integrated escalation option in the neural network (Geifman, 2019), and we leave the trade-offs of integrated reject option and non-Bayesian approaches for future work.
Similar approaches in spelling correction, besides those mentioned in Section 1, are reported in Deep Text Corrector (Atpaino, 2017) that applies a seq2seq model to automatically correct small grammatical errors in conversational written English. Optimal decision threshold learning under uncertainty is studied in (Lepora, 2016) as Reinforcement learning and iterative Bayesian optimization formulations.

System Overview and Data Sets
3.1. Overview of the System Figure 1 illustrates system overview of AVA. The proposed conversational AI will gradually replace the traditional human-human interactions between phone agents and internal experts, and eventually allows clients self-provisioning interaction directly to the AI system. Now, phone agents interact with AVA chatbot deployed on Microsoft Teams in our company intranet, and their questions are preprocessed by a Sentence Completion Model (introduced in Section 6) to correct misspellings. Then, inputs are classified by an intent classification model (Section 4 & 5), where relevant questions are assigned predicted intent labels, and downstream information retrieval and questioning answering modules are triggered to extract answers from a document repository. Irrelevant questions are escalated to human experts following the decision thresholds optimized using methods introduced in section 5. This paper only discusses the Intent Classification model and the Sentence Completion model.

Data for Intent Classification Model
Training data for AVA's intent classification model is collected, curated, and generated by a dedicated business team from interaction logs between phone agents and the expert team. The whole process takes about one year to finish. In total 22,630 questions are selected and classified to 381 intents, which compose the relevant questions set for the intent classification model. Additionally, 17,395 questions are manually synthesized as irrelevant questions, and none of them belongs to any of the aforementioned 381 intents. Each relevant question is hierarchically assigned with three labels from Tier 1 to Tier 3. In this hierarchy, there are 5 unique Tier-1 labels, 107 Tier-2 labels, and 381 Tier-3 labels. Our intent classification model is designed to classify • Some questions are relevant to business intents but unsuitable to be processed by conversational AI. For example, in Table 1, question "How can we get into an account with only one security question?" is related to Call Authentication in Account Permission, but its response needs further human diagnosis to collect more information. These types of questions should be escalated to human experts.
• Out of scope questions. For example, questions like "What is the best place to learn about Vanguard's investment philosophy?" or "What is a hippopotamus?" are totally outside the scope of our training data, but they may still occur in real world interactions.

Textual Data for Pretrained Embeddings and Sentence Completion Model
Inspired by the progress in computer vision, transfer learning has been very successful in NLP community and has become a common practice. Initializing deep neural network with pre-trained embeddings, and fine-tune the models towards task-specific data is a proven method in multi-task NLP learning. In our approach, besides applying off-theshelf embeddings from Google BERT and XLNet, we also pre-train BERT embeddings using our company's proprietary text to capture special semantic meanings of words in the financial domain. Three types of textual datasets are used for embeddings training: • Sharepoint text: About 3.2G bytes of corpora scraped from our company's internal Sharepoint websites, in-cluding web pages, word documents, ppt slides, pdf documents, and notes from internal CRM systems.
• Emails: About 8G bytes of customer service emails are extracted.
• Phone call transcriptions: We apply AWS to transcribe 500K client service phone calls, and the transcription text is used for training.
All embeddings are trained in case-insensitive settings. Attention and hidden layer dropout probabilities are set to 0.1, hidden size is 768, attention heads and hidden layers are set to 12, and vocabulary size is 32000 using SentencePiece tokenizer. On AWS P3.2xlarge instance each embeddings is trained for 1 million iterations, and takes about one week CPU time to finish. More details about parameter selection for pre-training are avaialble in the github code. The same pre-trained embeddings are used to initialize BERT model training in intent classification, and also used as language models in sentence completion.

Intent Classification Performance on Relevant Questions
Using only relevant questions, we compare various popular model architectures to find one with the best performance on 5-fold validation. Not surprisingly, BERT models generally produce much better performance than other models. Large BERT (24-layer, 1024-hidden, 16-heads) has a slight improvement over small BERT (12-layer, 768-hidden, 12heads), but less preferred because of expensive computations. To our surprise, XLNet, a model reported outperforming BERT in mutli-task NLP, performs 2 percent lower on our data.
BERT models initialized by proprietary embeddings converge faster than those initialized by off-the-shelf embeddings (Figure 2.a). And embeddings trained on company's sharepoint text perform better than those built on Emails and phone-call transcriptions (Figure 2.b). Using larger batch size (32) enables models to converge faster, and leads to better performance.

Intent Classification Performance including Irrelevant Questions
We have shown how BERT model outperforming other models on real datasets that only contain relevant questions. The capability to handle 381 intents simultaneously at 94.5% accuracy makes it an ideal intent classifier candidate in a chatbot. This section describes how we quantify uncertainties on BERT predictions and enable the bot to detect irrelevant questions. Three approaches are compared: • Predictive-entropy: We measure uncertainty of predictions using Shannon entropy H = − K k=1 p ik log p ik where p ik is the prediction probability of i-th sample to k-th class. Here, p ik is softmax output of the BERT network (B. Lakshminarayanan, 2017). A higher predictive entropy corresponds to a greater degree of uncertainty. Then, an optimally chosen cut-off threshold applied on entropies should be able to separate the majority of in-sample questions and irrelevant questions.
• Drop-out: We apply Monte Carlo (MC) dropout by doing 100 Monte Carlo samples. At each inference iteration, a certain percent of the set of units to drop out. This generates random predictions, which are interpreted as samples from a probabilistic distribution (Gal & Ghahramani, 2016). Since we do not employ regularization in our network, τ −1 in Eq. 7 in Gal and Ghahramani (Gal & Ghahramani, 2016) is effectively zero and the predictive variance is equal to the sample variance from stochastic passes. We could then investigate the distributions and interpret model uncertainty as mean probabilities and variances.
• Dummy-class: We simply treat escalation questions as a dummy class to distinguish them from original questions. Unlike entropy and dropout, this approach requires retraining of BERT models on the expanded data set including dummy class questions.

Experimental Setup
All results mentioned in this section are obtained using BERT small + sharepoint embeddings (batch size 16

Optimizing Entropy Decision Threshold
To find the optimal threshold cutoff b, we consider the following Quadratic Mixed-Integer programming problem to minimize the quadratic loss between the predictive assignments x ik and true labels l ik . In (1), i is sample index, and k is class (intent) indices. x ik is N × (K + 1) binary matrix, and l ik is also N × (K + 1), where the first K columns are binary values and the last column is a uniform vector δ, which represents the cost of escalating questions. Normally δ is a constant value smaller than 1, which encourages the bot to escalate questions rather than making mistaken predictions. The first and second constraints of (1) force an escalation label when entropy E i ≥ b. The third and fourth constraints restrict x ik as binary variables and ensure the sum for each sample is 1. Experimental results ( Figure 3) indicate that (1) needs more than 5000 escalation questions to learn a stabilized b. The value of escalation cost δ has a significant impact on the optimal b value, and in our implementation is set to 0.5.

Monte Carlo Drop-out
In BERT model, dropout ratios can be customized at encoding, decoding, attention, and output layer. A combinatorial search for optimal dropout ratios is computationally challenging. Results reported in the paper are obtained through simplifications with the same dropout ratio assigned and varied on all layers. Our MC dropout experiments are conducted as follows: According to the experimental results illustrated in Figure  4, we make three conclusions: (1) Epistemic uncertainty estimated by MCD reflects question relevance: when inputs are similar to the training data there will be low uncertainty, whilst data is different from the original training data should have higher epistemic uncertainty.
(2) Converged models (more training epochs) should have similar uncertainty and accuracy no matter what drop ratio is used.
(3) The number of epochs and dropout ratios are important hyper-parameters that have substantial impacts on uncertainty measure and predictive accuracy and should be cross-validated in real applications.
We use mean probabilities and standard deviations obtained from models where dropout ratios are set to 10% after 30 epochs of training to learn optimal decision thresholds. Our goal is to optimize lowerbound c and upperbound d, and designate a question as relevant only when the mean predictive probability P ik is larger than c and standard deviation V ik is lower than d. Optimizing c and d, on a 381-class problem,    is much more computationally challenging than learning entropy threshold because the number of constraints is proportional to class number. As shown in (2), we introduce two variables α and β to indicate the status of mean probability and deviation conditions, and the final assignment variables x is the logical AND of α and β. Solving (2) with more than 10k samples is very slow (shown in Appendix), so we use 1500 original relevant questions, and increase the number of irrelevant questions from 100 to 3000. For performance testing, the optimized c and d are applied as decision variables on samples of BERT predictions on test data. Performance from dropout are presented in Table 3 and Appendix. Our results showed decision threshold optimized from (2) involving 2000 irrelevant questions gave the best F1 score (0.754), and we validated it using grid search and confirmed its optimality (shown in appendix). (2)

Dummy-class Classification
Our third approach is to train a binary classifier using both relevant questions and irrelevant questions in BERT. We use a dummy class to represent those 17,395 irrelevant questions, and split the entire data sets, including relevant and irrelevant, into five folds for training and test.
Performance of dummy class approach is compared with Entropy and Dropout approaches (Table 3). Deciding an optimal number of irrelevant questions involved in threshold learning is non-trivial, especially for Entropy and Dummy class approaches. Dropout doesn't need as many irrelevant questions as entropy does to learn optimal threshold, mainly because the number of constraints in (2) is proportional to the class number (381), so the number of constraints are large enough to learn a suitable threshold on small samples (To support this conclusion, we present extensive studies in Appendix on a 5-class classifier using Tier 1 intents). Dummy class approach obtains the best performance, but its success assumes the learned decision boundary can be generalized well to any new irrelevant questions, which is often not valid in real applications. In contrast, Entropy and Dropout approaches only need to treat a binary problem in the optimization and leave the intent classification model intact. The optimization problem for entropy approach can be solved much more efficiently, and is selected as the solution for our final implementation.
It is certainly possible to combine Dropout and Entropy approach, for example, to optimize thresholds on entropy calculated from the average mean of MCD dropout predictions. Furthermore, it is possible that the problem defined in (2) can be simplified by proper reformulation, and can be solved more efficiently, which will be explored in our future works.
6. Sentence Completion using Language Model 6.1. Algorithm We assume misspelled words are all OOV words, and we can transform them as [MASK] tokens and use bidirectional language models to predict them. Predicting masked word within sentences is an inherent objective of a pretrained bidirectional model, and we utilize the Masked Language Model API in the Transformer package (Hugging-Face, 2017) to generate the ranked list of candidate words for each [MASK] position. The sentence completion algorithm is illustrated in Algorithm 1.

Experimental Setup
For each question, we randomly permutate two characters in the longest word, the next longest word, and so on. In this way, we generate one to three synthetic misspellings in each question. We investigate intent classification accuracy changes on these questions, and how our sentence completion model can prevent performance changes. All models are trained using relevant data (80%) without misspellings and validated on synthetic misspelled test data. Five settings are compared: (1) No correction: classification performance without applying any auto-correction; (2) No LM: Auto-corrections made only by word edit distance without using Masked Language model; (3) BERT Sharepoint: Auto-corrections made by Masked LM using pre-trained sharepoint embeddings together with word edit distance; (4) BERT Email: Auto-corrections using pretrained email embeddings together with word edit distance; (5) BERT Google: Auto-corrections using pretrained Google Small uncased embedding data together with word edit distance.
We also need to decide what is an OOV, or, what should be included in our vocabulary. After experiments, we set our vocabulary as words from four categories: (1) All words in the pre-trained embeddings; (2) All words that appear in training questions; (3) Words that are all capitalized because they are likely to be proper nouns, fund tickers or service products; (4) All words start with numbers because they can be tax forms or specific products (e.g., 1099b, 401k, etc.). The purposes of including (3) and (4) is to avoid auto-correction on those keywords that may represent significant intents. Any word falls outside these four groups is considered as an OOV. During our implementation, we keep monitoring OOV rate, defined as the ratio of OOV occurrences to total word counts in recent 24 hours. When it is higher than 1%, we apply manual intervention to check chatbot log data.
We also need to determine two additional parameters M , the number of candidate tokens prioritized by masked language model and B, the beam size in our sentence completion model. In our approach, we set M and B to the same value, and it is benchmarked from 1 to 10k by test sample accuracy. Notice that when M and B are large, and when there are more than two OOVs, Beam Search becomes very inefficient in Algorithm 1. To simplify this, instead of finding the optimal combinations of candidate tokens that maximize (a) Accuracy -Single OOV (b) Accuracy -Two OOVs (c) Accuracy -Three OOVs (d) Accuracy per beam size Figure 5. As expected, misspelled words can significantly decrease intent classification performance. The same BERT model that achieved 94% on clean data, dropped to 83.5% when a single OOV occured in each question. It further dropped to 68% and 52%, respectively, when two and three OOVs occured. In all experiments, LM models proved being useful to help correcting words and reduce performance drop, while domain specific embeddings trained on Vanguard Sharepoint and Email text outperform off-the-shelf Google embeddings. The beam size B (M ) was benchmarked as results shown in subfigure (d), and was set to 4000 to generate results in subfigure (a) to (c).
the joint probability arg max d i=1 p i , we assume they are independent and apply a simplified Algorithm (shown in Appendix) on single OOV separately. An improved version of sentence completion algorithm to maximize joint probability will be our future research. We haven't consider situations when misspellings are not OOV in our paper. To detect improper words in a sentence may need evaluation of metrics such as Perplexity or Sensibleness and Specificity Average (SSA) (Adiwardana et al., 2020), and will be our future goals.

Results
According to the experimental results illustrated in Figure 5, pre-trained embeddings are useful to increase the robustness of intent prediction on noisy inputs. Domain-specific embeddings contain much richer context-dependent semantics that helps OOVs get properly corrected, and leads to better task-oriented intent classification performance. Benchmark shows B≥4000 leads to the best performance for our problem. Based on this, we apply sharepoint embeddings as the language model in our sentence completion module.

Implementation
The chatbot has been implemented fully inside our company network using open source tools including RASA (Bocklisch et al., 2017), Tensorflow, Pytorch in Python enviornment. All backend models (Sentence Completion model, Intent Classification model and others) are deployed as REST-FUL APIs in AWS Sagemaker. The front-end of chatbot is launched on Microsoft Teams, powered by Microsoft Botframework and Microsoft Azure directory, and connected to backend APIs in AWS environment. All our BERT model trainings, including embeddings pretraining, are based on BERT Tensorflow running on AWS P3.2xlarge instance. The optimization procedure uses Gurobi 8.1 running on AWS C5.18xlarge instance. BERT language model API in sentence completion model is developed using Transformer 2.1.1 package on PyTorch 1.2 and Tensorflow 2.0.
During our implementation, we further explore how the intent classification model API can be served in real applications under budget. We gradually reduce the numbers of attention layer and hidden layer in the original BERT Small model (12 hidden layers, 12 attention heads) and create several smaller models. By reducing the number of hidden layers and attention layers in half, we see a remarkable 100% increase in performance (double the throughput, half the latency) with the cost of only 1.6% drop in intent classification performance.  Table 4. Benchmark of intent classification API performance across different models in real application. Each model is tested using 10 threads, simulating 10 concurrent users, for a duration of 10 minutes. In this test, models are not served as Monte Carlo sampling, so the inference is done only once. All models are hosted on identical AWS m5.4xlarge CPU instances. As seen, the simplest model (6A-6H, 6 attention layers and 6 hidden layers) can have double throughput rate and half latency than the original BERT small model, and the accuracy performance only drops 1.6%. The performance is evaluated using JMeter at client side, and APIs are served using Domino Lab 3.6.17 Model API. Throughput indicates how many API responses being made per second. Latency is measured as time elapse between request sent till response received at client side.

Conclusions
Our results demonstrate that optimized uncertainty thresholds applied on BERT model predictions are promising to escalate irrelevant questions in task-oriented chatbot implementation, meanwhile the state-of-the-art deep learning architecture provides high accuracy on classifying into a large number of intents. Another feature we contribute is the application of BERT embeddings as language model to automatically correct small spelling errors in noisy inputs, and we show its effectiveness in reducing intent classification errors. The entire end-to-end conversational AI system, including two machine learning models presented in this paper, is developed using open source tools and deployed as in-house solution. We believe those discussions provide useful guidance to companies who are motivated to reduce dependency on vendors by leveraging state-of-the-art open source AI solutions in their business.
We will continue our explorations in this direction, with particular focuses on the following issues: (1) Current finetuning and decision threshold learning are two separate parts, and we will explore the possibility to combine them as a new cost function in BERT model optimization.
(2) Dropout methodology applied in our paper belongs to approximated inference methods, which is a crude approximation to the exact posterior learning in parameter space. We are interested in a Bayesian version of BERT, which requires a new architecture based on variational inference using tools like TFP Tensorflow Probability.
(3) Maintaining chatbot production system would need a complex pipeline to continuously transfer and integrate features from deployed model to new versions for new business needs, which is an uncharted territory for all of us. (4) Hybridizing "chitchat" bots, using state-of-the-art progresses in deep neural models, with task-oriented machine learning models is important for our preparation of client self-provisioning service.

A. Appendix
All extended materials and source code related to this paper are avaliable on https://github.com/cyberyu/ ava Our repo is composed of two parts: (1) Extended materials related to the main paper, and (2)  When multiple OOVs occur in a sentence, in order to avoid the computational burden using large beamsize to find the optimal joint probabilities, we assume all candidate words for OOVs are independent, and apply Algorithm 2 one by one to correct the OOVs.

A.2.1. BERT EMBEDDINGS MODEL PRETRAINING
The jupyter notebook for pretraining embeddings is at https://github.com/cyberyu/ava/blob/ master/scripts/notebooks/BERT_PRETRAIN_ Ava.ipynb. Our script is adapted from Denis Antyukhov's blog "Pre-training BERT from scratch with cloud TPU". We set the VOC SIZE to 32000, and use Sentence-Piece tokenizer as approximation of Google's WordPiece. The learning rate is set to 2e-5, training batch size is 16, training setps set to 1 million, MAX SEQ LENGTH set to 128, and MASKED LM PROB is set to 0.15.
To ensure the embeddings is training at the right architecture, please make sure the bert config.json file referred in the script has the right numbers of hidden and attention layers.
The main script run classifier inmem.py is tweaked from the default BERT script run classifier.py, where a new function serving input fn(): is added. To export that model in the same command once training is finished, the '-do export=true' need be set True, and the trained model will be exported to directory specified in '-export dir' FLAG.

A.2.3. MODEL SERVING API SCRIPT
We create a jupyter notebook to demonstrate how exported model can be served as in-memory classifier for intent classification, located at https://github.com/cyberyu/ava/scripts/ notebooks/inmemory_intent.ipynb. The script will load the entire BERT graph in memory from exported directory, keep them in memory and provide inference results on new questions. Please notice that in "getSess()" function, users need to specify the correct exported directory, and the correct embeddings vocabulary path. We use transformers-cli (https://huggingface.co/transformers/ converting_tensorflow_models.html) to convert our early pretrained embeddings to PyTorch formats. The input parameters for API are: • Input sentence. The usage can be three cases: -The input sentence can be noisy (containing misspelled words) that require auto-correction. As shown in the example, the input sentence has some misspelled words. -Alternatively, it can also be a masked sentence, in the form of Does it require [MASK] signature for IRA signup.
[MASK] indicates the word needs to be predicted. In this case, the predicted words will not be matched back to input words. Every MASKED word will have a separate output of top M predict words. But the main output of the completed sentence is still one (because it can be combined with misspelled words and cause a large search) . -Alternatively, the sentence can be a complete sentence, which only needs to be evaluated only for Perplexity score. Notice the score is for the entire sentence. The lower the score, the more usual the sentence is.
• Beamsize: This determines how many alternative choices the model needs to explore to complete the sentence. We have three versions of functions, predict oov v1, predict oov v2 and predict oov v3. When there are multiple [MASK] signs in a sentence, and beamsize is larger than 100, v3 function is used as independent correction of multiple OOVs. If beamsize is smaller than 100, v2 is used as joint-probability based correction. If a sentence has only one [MASK] sign, v1 (Algorithm 2 in Appendix) is used.
• Customized Vocabulary: The default vocabulary is the encoding vocabulary when the bidirectional language model was trained. Any words in the sentence that do not occur in vocabulary will be treated as OOV, and will be predicted and matched. If you want to avoid predicting unwanted words, you can include them in the customized vocabulary. For multiple words, combine them with -and the algorithm will split them into list. It is possible to turn off this customized vo-cabulary during runtime, which simply just put None in the parameters.
• Ignore rule: Sometimes we expect the model to ignore a range of words belonging to specific patterns, for example, all words that are capitalized, all words that start with numbers. They can be specified as ignore rules using regular expressions to skip processing them as OOV words. For example, expression "[A-Z]+" tells the model to ignore all uppercase words, so it will not treat 'IRA' as an OOV even it is not in the embeddings vocabulary (because the embeddings are lowercased).
To turn this function off, use None as the parameter.
The model returns two values: the completed sentence, and its perplexity score.

A.5. RASA Server Source Code
The proposed chatbot utilizes RASA's open framework to integrate RASA's "chitchat" capability with our proposed customized task-oriented models. To achieve this, we set up an additional action endpoint server to handle dialogues that trigger customized actions (sentence completion+intent classification), which is specified in actions.py file. Dialogue management is handled by RASA's Core dialogue management models, where training data is specified in stories.md file. So, in RASA dialogue model.py file run core function, the agent loads two components: nlu interpreter and action endpoint.
The entire RASA project for chatbot is shared under https://github.com/cyberyu/ava/bot. Please follow the github guidance in README file to setup the backend process.

A.6. Microsoft Teams Setup
Our chatbot uses Microsoft Teams as front-end to connect to RASA backend. We realize setting up MS Teams smoothly is a non-trivial task, especially in enterprise controlled enviornment. So we shared detailed steps on Github repo.

A.7. Connect MS Teams to RASA
At RASA side, the main tweak to allow MS Team connection is at dialogue model.py file. The BotFrameworkInput library needs to be imported, and the correct app id and app password specified in MS Teams setup should be assigned to initialize RASA InputChannel.