Development and testing of a multi-lingual Natural Language Processing-based deep learning system in 10 languages for COVID-19 pandemic crisis: A multi-center study

Purpose The COVID-19 pandemic has drastically disrupted global healthcare systems. With the higher demand for healthcare and misinformation related to COVID-19, there is a need to explore alternative models to improve communication. Artificial Intelligence (AI) and Natural Language Processing (NLP) have emerged as promising solutions to improve healthcare delivery. Chatbots could fill a pivotal role in the dissemination and easy accessibility of accurate information in a pandemic. In this study, we developed a multi-lingual NLP-based AI chatbot, DR-COVID, which responds accurately to open-ended, COVID-19 related questions. This was used to facilitate pandemic education and healthcare delivery. Methods First, we developed DR-COVID with an ensemble NLP model on the Telegram platform (https://t.me/drcovid_nlp_chatbot). Second, we evaluated various performance metrics. Third, we evaluated multi-lingual text-to-text translation to Chinese, Malay, Tamil, Filipino, Thai, Japanese, French, Spanish, and Portuguese. We utilized 2,728 training questions and 821 test questions in English. Primary outcome measurements were (A) overall and top 3 accuracies; (B) Area Under the Curve (AUC), precision, recall, and F1 score. Overall accuracy referred to a correct response for the top answer, whereas top 3 accuracy referred to an appropriate response for any one answer amongst the top 3 answers. AUC and its relevant matrices were obtained from the Receiver Operation Characteristics (ROC) curve. Secondary outcomes were (A) multi-lingual accuracy; (B) comparison to enterprise-grade chatbot systems. The sharing of training and testing datasets on an open-source platform will also contribute to existing data. Results Our NLP model, utilizing the ensemble architecture, achieved overall and top 3 accuracies of 0.838 [95% confidence interval (CI): 0.826–0.851] and 0.922 [95% CI: 0.913–0.932] respectively. For overall and top 3 results, AUC scores of 0.917 [95% CI: 0.911–0.925] and 0.960 [95% CI: 0.955–0.964] were achieved respectively. We achieved multi-linguicism with nine non-English languages, with Portuguese performing the best overall at 0.900. Lastly, DR-COVID generated answers more accurately and quickly than other chatbots, within 1.12–2.15 s across three devices tested. Conclusion DR-COVID is a clinically effective NLP-based conversational AI chatbot, and a promising solution for healthcare delivery in the pandemic era.

Results: Our NLP model, utilizing the ensemble architecture, achieved overall and top accuracies of .
-. ] respectively. For overall and top results, AUC scores of .
-. ] were achieved respectively. We achieved multi-linguicism with nine non-English languages, with Portuguese performing the best overall at .
. Lastly, DR-COVID generated answers more accurately and quickly than other chatbots, within . -. s across three devices tested.
Conclusion: DR-COVID is a clinically e ective NLP-based conversational AI chatbot, and a promising solution for healthcare delivery in the pandemic era.

. Introduction
The COVID-19 pandemic has profoundly changed our lives and disrupted global healthcare systems. The demand for medical services is increasing despite persistent movement and social contact limitations. This is further complicated by misinformation related to COVID-19 on the internet and social media (1,2), which may thwart the implementation of public health measures. Healthcare institutions are therefore exploring alternative models to improve communication, diagnostics, and treatment (3), including the use of digital technology and big data to facilitate healthcare delivery and pandemic control (4). As such, platforms such as telemedicine, Artificial Intelligence (AI) and Natural Language Processing (NLP) chatbots have gained significant prominence (5).
Natural language remains a fundamental way information is communicated in the healthcare setting. NLP is a range of computational techniques used to automatically analyze and represent human language (6). It has multiple utilities including conversational chatbots, automated translation, smart assistants, and predictive text writing (7)(8)(9). With the capacity for "complex dialogue management and conversational flexibility, " AI applied in healthcare communication has the potential to benefit humans significantly (10). Chatbots could therefore fill a pivotal role in the dissemination and easy accessibility of accurate information in a pandemic, in an interactive manner akin to the conventional patient-physician communication. Voice chatbots are capable of automated acute care triaging, remote monitoring, and chronic disease management (11) NLP chatbots have also been useful in education, including radiation safety training for clinicians (12). Furthermore, chatbots have applications in oncology, including patient support, process efficiency, and health promotion (13).
COVID-19 related applications of NLP include computerized tomography reports analysis (14), as well as chatbots. Most  Current medical COVID-19 chatbots face several limitations. First, most of these chatbots are created with English as the intended medium, thus limiting the utility for non-native English speakers (18). In comparison, there is a lack of viable multi-lingual chatbots (19). Next, achieving high accuracy may prove difficult due to nuances in communication. Inputs that are ambiguous or irrelevant to how the chatbot was trained can lead to a lack of meaningful responses by the chatbot (20). Our study aims to address these limitations by developing a multi-lingual chatbot able to respond accurately and quickly to general COVID-19 related questions by patients and the public.
. Materials and methods

. . Chatbot overview
In this multi-center prospective study, we developed a multilingual conversational chatbot, DR-COVID, hosted on the Telegram platform to answer COVID-19 related queries ( Figure 1). The Telegram chatbot may be assessed at: https://t.me/drcovid_nlp_ chatbot. English was used as the base to train the chatbot. A training dataset of unique questions-answer pairings was first created to train the NLP architecture, covering a range of common COVID-19 topics in the English language. Each question-answer pairing was expanded into sub-questions to increase the variety and scope of the training dataset. Subsequently, an independent testing dataset comprising questions-answer pairings in English was created to assess English accuracy. Collaborators were also involved in providing testing questions in nine widely-spoken non-English languages. These multi-lingual questions were translated to Frontiers in Public Health frontiersin.org . /fpubh. . English questions using Google Translate Application Programming Interface (API). The questions in English were used as the input for our NLP ensemble model. Similarity calculation was used to retrieve the output, that is, the top 1 and top 3 closest matching answers. If the original question was not in English, then the output would be translated to English using Google Translate API, and subsequently displayed on the Telegram application. Our study did not involve patients, and was hence exempt from requirements of approval by the Singhealth Institutional Review Board.

. . Training and testing dataset
Inclusion criteria for input data were topics relating to COVID-19, whereas exclusion criteria were those not relevant to COVID-19. The training and testing question-answer pairs were developed sequentially in English, consolidating data from publicly available sources (Supplementary Table 1). The training dataset was developed by creating main unique questions paired with respective answer (MQA), and grouped based on WHO categories: general information, contact tracing, symptoms and treatment, personal protection, public health travel advisory, safe distancing, and vaccines (21). Poor quality or unavailable data were not included if unable to further improve upon, based on the available sources.
Two categories of MQA were created: Singapore-centric, and global. Singapore-centric questions were defined as those localized to Singapore geographically, specific to Singapore's population, policies, rules, and regulations. Global questions included those pertaining to global statistics, general information on COVID-19, and policies with impact on a global scale.
Each MQA was expanded into 5 to 15 unique sub-questions, and each sub-question grouped and identified for answer retrieval based on the corresponding MQA. Next, the training dataset was independently created with at least three questions per MQA. A total of 218 MQA pairings were developed from the period of 1 st Jan 2021 to 1 st Jan 2022. Data was vetted for repetition and grammar twice, and the finalized content vetted again.

. . Natural Language Processing chatbot architecture
Two separate large corpus-trained NLP transformer models, spaCy library (22) and locally-modified Bidirectional Encoder Representation Transformer (BERT) (23), were combined in an 0.2: 0.8 weightage, to develop the chatbot architecture. The resultant vector was used for similarity calculation which was required for question-answer matching ( Figure 2A). Few-shot learning, which involved task sampling instead of direct training using the input dataset, was utilized in a low-resource setting; this enabled our customized BERT model to be better trained when a limited number of MQAs was available in the training dataset ( Figure 2B).
Rule-based question-answer retrieval was performed using feature extraction, and representation for the input test questions. Subsequently, a similarity score was generated for each MQA, with the highest matched score being the retrieved answer and therefore output. If similarity score fell below the pre-set threshold of 0.85 in our study, the top 3 closest matching MQAs were retrieved as the output instead.

. . Performance assessment
The ensemble model underwent three iterations of improvement before being used for eventual assessment. Chatbot performance was assessed based on the accuracy, AUC, precision, recall, and F1 score for the overall, and top 3 answers generated. A positive response was recorded for the top 3 answers if any one answer was appropriate. Grading was performed by two investigators independently. In the event of disparate grading, a discussion was held to reach a consensus, failing which a third investigator would provide the final decision. Subsequently, we invited ten collaborators to each contribute 20 English questions in an open-ended format, and thereafter assessed .

. . Multi-lingual text translation
The NLP models were trained with English corpus. The backend software used Google Translate API to translate the target language question into English as an intermediary step language, followed by analysis and question-answer retrieval, and lastly re-translation by Google Translate API back to the target language. The selected target languages included Chinese, Malay, Tamil, Filipino, Thai, Japanese, French, Spanish, and Portuguese.
We invited collaborators to assess the multi-lingual aspect of DR-COVID, with each contributing 20 questions in an openended format to assess the accuracy of the generated response. Ten collaborators were invited to assess the chatbot in Chinese and Malay; two in Spanish; and one each for the remaining languages Tamil, Filipino, Thai, Japanese, French, and Portuguese.
. /fpubh. . AUC, area under the receiver operating characteristic curve; CI, confidence interval. Parameters are as follows-Threshold similarity score was set at 0.85. DR-COVID was trained on two Nvidia Titan RTX (24GB each) graphics processing unit (GPU), executed on either Nvidia Titan RTX (24GB) GPU machine or central processing unit (CPU) machine (64GB), and hosted on Telegram.

. . User interface assessment
Twenty questions with no overlap of MQA were selected for User Interface (UI) assessment (Supplementary Table 2), which involved timing the interval between question input and answer generation for DR-COVID, WHO Messenger (16) and National Health Service (NHS) Inform (https://ask.nhsinform.scot/) (24). This was performed on 1 st Feb 2022 by a single investigator, using a stopwatch on three digital devices, including laptop, tablet, and smartphone. Specifications are described in Table 3.

. . GPU vs CPU assessment
The deployments of DR-COVID chatbot application were compared, to highlight the differences in the throughput performance of Graphical Processing Units (GPU) vs. Central Processing Units (CPU). In this study, benchmarking tests were performed between CPU and GPU. NVIDIA TITAN Xp GPU and Intel(R) Xeon(R) W-2145 CPU were used during the evaluation. Data regarding memory usage with sequential time profiler and memory profiler was obtained using 100 users and 3 questions.

. Results
A total of 2,728 questions in English, comprising 12,90 Singaporecentric and 1,438 global questions, were developed for the training dataset. Eight hundred twenty-one new questions in English were created as the testing dataset for assessment of accuracy, consisting of 335 Singapore-centric and 486 global questions (Supplementary Table 3

. . Multi-lingual performance
In terms of secondary outcomes of interest, nine non-English languages were assessed for accuracy, using a total of 560  questions contributed by the collaborators (Supplementary Table 5).
Supplementary Figure 1 and Supplementary Video 1 demonstrate the chatbot interface and response to an example question, "what are the available vaccines?", in the selected languages. Portuguese performed the best overall at 0.900, followed by Spanish at 0.725, then Thai at 0.600 (Table 2).

. . User interface assessment
In terms of other secondary outcome measures, DR-COVID achieved the highest overall accuracy of 0.800 when tested on the selected global questions, followed by WHO Messenger at 0.600, and finally NHS Inform at 0.500. The average time taken for DR-COVID to generate an answer was 2.15s ± 0.31 on a laptop device, 1.26 ± 0.49 on a tablet, and 1.12s ± 0.44 on a phone; significantly faster compared to NHS Inform, and WHO Messenger across all digital devices (p < 0.05). Amongst other question-answer chatbots tested for comparison, the average time taken for NHS Inform ranged between 2.20 and 2.51s, whereas that of WHO Messenger ranged between 4.04 and 4.85s (Table 3).

. . GPU vs CPU assessment
The tests were conducted by running a DR-COVID chatbot application on a workstation with Ubuntu 16.04.7 LTS operating system, NVIDIA TITAN Xp 12GB GPU RAM and Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz 64 GB DDR4 RAM. Using 100 users and three questions each as the test condition, and the sequential time and memory usage recorded was the average throughput 20 times test. To run DR-COVID chatbot on CPU required 3.628 GB, whereas on GPU required 2.936 GB memory plus 600 MB CPU memory. The average time taken per question was 23.52 s for CPU, whereas in comparison, the GPU could achieve a 66.9% reduction in time, at 7.79 s.

. Discussion
In this multi-center study, we created a general questionanswer chatbot with a training dataset of 2,728 questions to address COVID-related enquiries, incorporated multi-lingual text to text translation, and assessed chatbot performance by evaluating accuracy through external open-ended assessments, and comparing technical performance against enterprise-grade chatbot systems. In our study, we used an ensemble model aiming to overcome technical challenges associated with a single architecture, which gives lower accuracy, higher variance, noise, and bias (25). The ensemble method reduces model error while still preserving its generalization (26). In addition, accuracy and diversity can be improved by optimizing the performance of each base estimator and incorporating a range of estimators respectively. Ensemble models have typically outperformed single classifiers in terms of AUC, accuracy, and recall (27, 28). Considerations for implementation should include risks of overfitting the training data, as well as costs and complexity of training and deployment.
Existing literature regarding NLP-based chatbots in the COVID-19 pandemic has been largely experimental or descriptive in nature (29, 30). Nonetheless, studies thus far have demonstrated accuracies ranging between 0.54 and 0.92 (31)(32)(33). A Canadian chatbot, Chloe, developed to address pandemic misinformation, has demonstrated accuracies of 0.818 and 0.713 for the English and French language respectively, using a BERT-based NLP architecture (31). Whilst we demonstrated a better overall accuracy of 0.838 in the English language-potentially contributed by our ensemble vs. single classifier model-our accuracy of 0.350 in the French language fell short of expectations. There were several factors that could account for this discrepancy. First, Chloe was developed in the context of a bilingual English and French-speaking populace. Questions in the French language were able to undergo direct question-answer retrieval, without the use of translation software. On the contrary, DR-COVID required the use of Google Translate as an intermediary step, before question-answer retrieval, as well as before providing the output in the French language. Google Translate is not capable of transcreation, that is, the correct interpretation of context, intent, cultural and language nuances (34). As a result, non-native translation such as in DR-COVID, is ultimately less ideal than native translation, due to contextual specificities and transcreation difficulties. It may .
also be of utility for other chatbots to share their questions tested, in order to draw a reasonable comparison. Potential solutions would include collaborating with international partners and native translators to fine tune the multi-lingual datasets, as well as align with the locale, with the understanding that this would entail necessary cost, retraining, and turnaround time. In particular, Singapore is intrinsically a multi-racial and multi-lingual society, with a significant international populace. As such, it will be worthy to invest these resources, and shall be to the strength that we can produce such a chatbot as well.
Next, there were several questions in French of a highly specific nature, which were not within the scope of our existing Frontiers in Public Health frontiersin.org . /fpubh. . Video 2). Furthermore, answers to questions such as "can I get delta after being in remission from alpha?" were not included in our original dataset, as data regarding reinfection with new variants was not available at the time of development. That said, while it is a fair point that highly specific or technical questions may be difficult to achieve accuracy on initial try, more common layman queries that appear in every language should minimally be answered. In this study, multi-lingual analysis was limited by the small number of testing questions with N = 20 on average, lest the Chinese and Malay languages. Nevertheless, an ongoing analysis is underway to garner and assess more questions for multi-lingual accuracy, as well as to evaluate if the differential accuracy may be attributed to technical or general questions. Another Tunisian chatbot Smart Ubiquitous Chatbot, based on Long Short-Term Memory (LSTM) networks, was developed for education, and stress management during the pandemic. It reported an accuracy of 0.92, precision of 0.866, recall of 0.757, and F1 score of 0.808 (32). Similarly, DR-COVID achieved precision of 0.864 comparable to Smart Ubiquitous Chatbot, but demonstrated higher recall of 0.835, that is, the capability of giving more of the correct answers amongst all the correct answers. We also achieved a higher F1 score of 0.829, meaning that taking precision and recall in tandem, our chatbot demonstrated better overall performance. Extrinsic differences in linguistics, local policies and populations, as well as intrinsic technicalities of the algorithms likely play a role in these differential results. We were however unable to compare top 3 accuracy, recall, and precision with other chatbots that lacked this function. There was also difficulty benchmarking our AUC against other COVID-19 chatbots, as there has been a paucity of research evaluating this metric thus far.

MQAs, including "can I get infected through aerated steam?" (Supplementary
We demonstrated that when tested on new questions in English provided by collaborators, DR-COVID fared less optimally, with a drop in accuracy from 0.838 to 0.550, compared to using our own testing dataset. These errors are perhaps explainable. Firstly, this variance may illustrate the differential perspectives between the medical community and general public. The training and testing datasets, developed by the internal team comprising medical practitioners and data scientists, tend to be more medical in nature, including "will the use of immunomodulators be able to treat COVID-19?", and "what is the mechanism of action of rapid COVID-19 tests?"; there was potentially selection bias to some degree. On the other hand, the external questions were contributed by collaborators of both medical and non-medical backgrounds; these relate more to effects on daily life, and coping mechanisms. For example, "is the hospital safe to visit?". This further illustrates the limitations in our training dataset in covering everyday layman concerns relating to COVID-19 as discussed previously, and therefore potential areas for expansion. That said, we do observe common topics of overlap, such as general information, symptoms, and treatment pertaining to COVID-19. Secondly, despite having undergone several cycles of retraining, our model might not have the most up-to-date information on certain questions. Chatbots require a tedious training and retraining process. Information and policies are constantly changing in a pandemic setting, on both a local and global scale, which necessitates frequent monitoring and updating of the model, to ensure that appropriate information is conveyed. A prime example would be vaccine-related information such as booster dose requirements, newly approved vaccines, and variant-specific efficacy. Our model was not equipped with new information regarding booster vaccines, and was therefore shorthanded in addressing these questions. To circumvent tedious retraining, we could consider reinforcement learning in future implementation, a technique which incentivises the chatbot to learn through trial and error, by "rewarding" correct outputs and "punishing" incorrect answers (35).
Thirdly, insofar as our knowledge regarding COVID-19 is constantly evolving, there remain uncertainties for which it is challenging to give definite answers to. Questions such as "when will the COVID-19 pandemic end?" are difficult to predict, may give seemingly unsatisfactory answers, and therefore affect the accuracy of the chatbot. Ultimately, this difference demonstrates the variability which may arise, and therefore the need to test chatbots externally when implemented in a real-world setting.
This study gives hope to the potential expansion and realworld implementation of NLP-DLS chatbots, such as DR-COVID.
The use of open-source translation software, with the caveat of its drawbacks as discussed earlier, may improve scalability and multilingual customizability. Moreover, integration onto social media platforms-such as Telegram in our case-enables greater reach and convenience, potentially removing geographical constraints (2); the WHO's global pandemic outreach through WhatsApp is a prime example (36). These could mitigate resource limitations by improving scalability and efficiency (37). Moreover, chatbots have a high handling capacity which allows simultaneous conversations with multiple users (38), and are instantly available on-demand. This provides patients with a reliable source of information, whilst helping off-load labor-intensive communication traditionally performed by healthcare workers.
Furthermore, information garnered from multiple reliable sources can be presented in a succinct manner, mitigating the dangers of online misinformation (39). Specific to the ongoing pandemic, DR-COVID and other NLP chatbots could fill a pivotal role in the dissemination and easy availability of accurate information regarding COVID-19, therein also facilitating implementation of pandemic measures. They could potentially serve as accessible platforms to disseminate new operational workflow, news and protocols, thereby minimizing confusion faced on the ground by the general population, and even healthcare workers. This is critical to manage large-volume queries and national measures, which are often challenging and require unparalleled effort to coordinate on a large-scale. Moreover, this matters because misinformation could translate to vaccine hesitancy, and reluctance to comply with public health measures such as maskwearing. On the other hand, a better understanding of COVID-19 would reduce panic amongst the public, thereby reducing unwarranted visits to the emergency department, and better optimizing resource allocation in healthcare systems. Moreover, the resultant higher vaccination rates would also enhance "herd immunity, " thereby reducing the transmission of COVID-19 with resultant mortality benefits.
Lastly, whilst the main purpose of DR-COVID has been to facilitate efficient and accurate information sharing, it may be of utility to explore the inclusion of other tools, including detection of misleading information, triage, risk assessment, monitoring, and general wellbeing. For example, both ensemble-and BERT  (42). These can assist in triaging patients to suitable echelons of care, and thereby potentially reduce unwarranted health-seeking behavior. That said, one has to bear in mind the caveat that AI would not grasp the nuances of clinical management, and that liability issues for triaging errors should be addressed before implementation. In terms of risk stratification, another chatbot developed by University of California, San Francisco Health, assisted the hospital in making real-time manpower decisions, based on exposure risk of its healthcare workers (43). Another Singaporean chatbot, Bot MD, has helped doctors prioritize attention to potentially unwell patients on COVID-19 home recovery (44). Finally, chatbots have also been used to monitor the psychological effects, and mitigate the implications of isolation caused by social distancing (45). COVID-19 is likely to become endemic in time to come (46). In envisioning the eventual implementation for the current pandemic and beyond, we are also cognisant regarding the importance of acceptability and useability, which should be optimized for real-world implementation (47); in fact, the primary factor influencing acceptability is perception of ability, which is in turn driven by trust in the system (48). When implemented in the real world, there is therefore a need to balance between presenting facts from global authorities such as the WHO, and vocalizing local perspectives and policies. This requires collaboration amongst stakeholders. Therein also raises questions regarding legislative responsibility and accountability for chatbots. Decisions regarding licensing, much like credentials for healthcare workers, would require further deliberation.
Planned future studies include expansion to more languages, and integration to web messenger and social media platforms to reach greater audiences. Conversational experience can be refined with contextual awareness to improve relevance of answer retrieval. Future directions would also entail exploration of different but complementary domains such as text-to-speech, and speech-to-speech, which may be of help in specific populations like the visually impaired, or to provide more options for convenience. Other potential use cases in pandemic management include NLP-based risk stratification, contact tracing, and patient monitoring. Finally, DL-based chatbots may be utilized in various medical and surgical specialities for targeted patient education, disease monitoring, and encouraging treatment compliance, amongst others.

. Conclusion
Chatbots utilizing NLP, a type of conversational AI, have emerged as promising solutions to improve healthcare delivery in the pandemic era. In this study, we developed a multi-lingual NLP-based AI chatbot, DR-COVID, to facilitate healthcare delivery and disease control. Our NLP model, utilizing the ensemble architecture, achieved overall and top 3 accuracies of 0.838 [95% The sharing of training and testing datasets on an open-source platform will also contribute to existing data. Whilst AI-based NLP chatbots can enable healthcare systems to reap public health and resource benefits, clinicians and policymakers should work in tandem to deliver solutions to potential problems in realworld implementation.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.