Verbal Communication in Robotics: A Study on Salient Terms, Research Fields and Trends in the Last Decades Based on a Computational Linguistic Analysis

Verbal communication is an expanding field in robotics showing a significant increase in both the industrial and research field. The application of verbal communication in robotics aims to reach a natural human-like interaction with robots. In this study, we investigated how salient terms related to verbal communication in robotics have evolved over the years, what are the topics that recur in the related literature, and what are their trends. The study is based on a computational linguistic analysis conducted on a database of 7,435 scientific publications over the last 2 decades. This comprehensive dataset was extracted from the Scopus database using specific key-words. Our results show how relevant terms of verbal communication evolved, which are the main coherent topics and how they have changed over the years. We highlighted positive and negative trends for the most coherent topics and the distribution over the years for the most significant ones. In particular, verbal communication resulted in being highly relevant for social robotics. Potentially, achieving natural verbal communication with a robot can have a great impact on the scientific, societal, and economic role of robotics in the future.


INTRODUCTION
Robots are becoming increasingly pervasive in our everyday life, entering our homes (Abdi et al., 2018;Van Patten et al., 2020), places of work (Robla-Gómez et al., 2017), hospitals (Azeta et al., 2017), and schools (Belpaeme et al., 2018). Since humans need to communicate and cooperate with these machines, and because we are accustomed to communicating with other people, the same social norms and kind of communication that apply to humans might also apply to robots. As Nass and colleagues demonstrated in their book "The Media Equation" (Reeves and Nass, 1996), people often respond socially to computers in ways similar to how they would interact socially with other people. Therefore, the need to develop a robot that can behave socially has pushed researchers to incorporate a form of communication similar to what humans use in the design, such as the non-verbal communication (Breazeal et al., 2005;Brooks and Arkin, 2007;Mutlu et al., 2009;Cominelli et al., robots (IFR, 2018). The incorporation of speech recognition and speech generation abilities in robots have obtained encouraging results in several research fields such as educational robotics (Budiharto et al., 2017), collaborative robotics (Huang and Mutlu, 2016;Gustavsson et al., 2017), surgical robotics (Zinchenko et al., 2017), assistive robotics (Wu et al., 2014;Zhou et al., 2018), robot therapy (Barakova et al., 2015;Ramamurthy and Li, 2018), humanoid robotics (Ding and Shi, 2017), and navigation robotics (Draper et al., 2013;Schulz et al., 2015).
The increasing number of publications has made it more difficult to understand and track advances in the field (Landhuis, 2016;Altbach and de Wit, 2019). Thus, the aim of this work is to discover promising trends in the verbal communication field by performing a deep and systematic analysis of the research literature. To avoid any bias and author subjectivity, different text mining techniques were used in a bottom-up approach to retrieve research fields and keywords from scientific publications. As observed from previous work, bibliometric techniques leverage statistics to successfully extract useful information such as the identification of the fundamental "pillars" that support a research theme (Buter and Van Raan, 2013), the discovery of promising trends in the robotics field (Goeldner et al., 2015;Mejia and Kajikawa, 2017), of topics in conversational content (Yeh et al., 2016) and of relationships between social and technology issues (Ittipanuvat et al., 2014).
Investigating the emergent topics and keywords from thousands of publications belonging to verbal communication in robotics throughout the last decades can reveal important hidden topics or technology domains. This information can be extremely valuable to drive future research to applications where technology and needs intersect. In particular, we address the following questions: 1. How did the salient terms related to verbal communication in robotics evolve? 2. Are there any specific applications that involved the use of verbal communication? 3. Do they have any noteworthy trends in the last decade? 4. If they do, what can these trends reveal?
The paper is structured as follows. In Section 1, the criteria used to select publications are introduced. In Section 2, the contrastive analysis is described. In Section 3, the pre-processing step is explained. In Section 4, it is illustrated how topics are modeled. Section 5 reports the model evaluation. Section 6 shows the topics evaluation. In the last sections (7, 8), we summarize the results and discuss possible future scenarios.

SELECTION CRITERIA
Publications related to the verbal communication in robotics were retrieved from Scopus using the following query: "TITLE-ABS-KEY ((voice OR speech OR verbal OR talk OR dialogue OR spoken OR conversation) AND robot) AND PUBYEAR >1999". Since the main focus is on current trends and topics, the research was restricted only to works published from the year 2000. A total of 7,435 articles (titles and abstracts) were extracted on February 20, 2019. The framework of the natural language processing (NLP) tools used in this research is shown in Figure 2. The dataset can be accessed from the following GitHub repository https://github.com/vargas95/nlp_social_robotics.

CONTRAST ANALYSIS
All publications were divided into four groups based on the publication year. Intervals range from 2000 to 2004, from 2005 to 2009, from 2010 to 2014, and from 2015 to 2019. In this way, specific keywords that are representative for a certain interval of time can be identified. The extraction of domain-specific terms denoting domain entities was performed using the NLP tool T2K (Sagri et al., 2019). By default, the automatically POS-tagged and lemmatized text is searched for candidate domain-specific terms, expressed by either single nominal terms or complex nominal structures with modifiers (adjectival and prepositional modifiers). To select the terms representative for a certain interval of time, a contrastive analysis was performed: the list of extracted terms was ranked with respect to the variation of the term frequency inverse document frequency (tfidf) scores (Salton and Buckley, 1988) calculated for two different intervals of time. Two different contrastive analyses were performed: the analyzed interval of time 1) vs all other intervals and 2) only vs the past intervals. The contrastive analysis was performed for each group keeping only the first 19 main words. Results are displayed in Tables   1-4. It is worth noting that the contrast against past or all intervals is the same for the range 2015−2019 and that the contrast against the past cannot be applied to the range 2000−2004. Bold words (highlighted in the tables) are the ones that belong to both the contrasts; consequently, those words might define a specific technology or device that has mainly been cited only in that group. While there is only "fuzzy voice" in the 2005−2009 group, words underlined in group 2010−2014 are "wireless sensor", "multimodal language", "word correct rate" and "vocal cues". Although the contrast analysis allows us to retrieve specific keywords to describe an interval of time, it was not informative regarding topics and trends. Therefore, we decided to proceed with a deeper analysis considering single years.

PRE-PROCESSING
A series of pre-processing steps were applied to convert the text in a specific structure to subsequently perform text mining analysis. Specifically, texts are transformed into a document-term matrix, where each row represents one document, each column represents a term, and the associated value defines the term's frequency. Then, the text was subjected to lemmatization, i.e., an algorithm to convert the word to its lemma based on its intended meaning. In particular, lemmatization represents a better choice compared to stemming for topic modeling as it tries to correctly identify the intended part of Frontiers in Computer Science | www.frontiersin.org February 2021 | Volume 2 | Article 591164 speech and meaning of a word in a sentence or in a document. After that, the raw text was converted in a corpus on which other common transformations were applied, such as tokenization, conversion to lower case, and the removal of numbers, punctuation and stop words. Lastly, once the corpus was converted into a document-term matrix, it was filtered using the tf-idf measure. In particular, this operation allows us to weight each word based on the following formulas: The weight of a term is directly proportional to the number of times a term occurs within a document, but it is inversely proportional with the frequency of the term among the collection. In this way, tf-idf measures the importance of a term among a collection of documents. To further constrain the number of words, a threshold on the weights has been applied equal to the median of the tf-idf scores (Silge and Robinson, 2017). The difference between non-filtered text and the one filtered with tf-idf is highlighted in Figure 3 and Figure 4, are the wordclouds displaying the most frequent words. It is clearly visible how words that are frequent in the non-filtered text such as "robot", "human" and "system" become of secondary importance when text is filtered with tf-idf and more informative words such as "emotion", "agent" and "dialogue" become relevant.

TOPIC MODELING
We retrieved topics using the Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a generative probabilistic model of a corpus. This algorithm takes advantage of the fact that every document is a mixture of latent topics and that each topic is characterized by a distribution over words. Although LDA is a generative process, it can be inverted using the Bayes rule in order to estimate a model's parameters. From the document level, LDA can backtrack the topics that are likely to have generated the corpus thereby estimating the parameters' uncertainty. The method is based on three main steps: 1. Randomly assign to each word in each document one of the K topics; 2. For each document d: • Assume that all topics assignment, except this one, are correct; • Compute the probability of the topic given that document: p(topic document); • Compute the probability that the word belongs to a topic: p(word topic); • Multiply these two probabilities and assign to the word a new topic based on that probability: p(word topic) · p(topic document); 3. Continue until a steady state is reached.
The algorithm used in this work is the LDA method implemented in the R software, part of the "topicmodels" package (Hornik and Grün, 2011). As a sampling method, we selected the Gibbs method to infer unknowns from the data. Results of the LDA model consists of the a posteriori distributions: • The probability β that a term belongs to each topic; • The topic distribution Θ for each document.

MODEL EVALUATION
One limitation of the model is that it needs a priori the K number of topics. In order to find a "correct" number, four different metrics, which represent the goodness of the fit, can be visualized using the "ldatuning" package (Nikita, 2016). In particular, two categories of metrics can be distinguished. Then, the number of topics is selected as the one that: • minimize the following measures: These measures select the best number of topics using a symmetric KL-divergence of salient distributions which are derived from the factorization of the document-term matrix. A graphic visualization of the metrics' variation with respect to the number of topics is shown in Figure 5. It can be observed that the Deveaud2014 metric is not informative. Therefore, by analyzing the remaining three metrics, the correct number of topics might be located in the range between 60 and 150 topics. We selected, as a reasonable number, 60 topics which is the point where two of three metrics converge to their minimum or maximum point. Moreover, selecting the lower number of topics might avoid possible over-fitting.

TOPICS EVALUATION
The LDA model is a powerful way to extract topics from a corpus of documents in an unsupervized way. However, topics might not be clearly explicable, therefore topic coherence can be used as a measure to evaluate the topic quality (coherence implementation in R (Denny, 2018)). The topic coherence metric considers the cooccurrences of words within the documents and it is defined as: is the list of the M most probable words in a topic, D(v) the frequency of the term v within the document d and D(v, v ′ ) represents the co-occurrence of word v and v ′ within documents. The measure assumes negative values: the closer the value is to zero, the stronger the topic coherence will be.
The topic coherence has been evaluated on the 60 topics extracted, by considering the top 10 terms for each topic and setting a user-defined threshold on the coherence value to -145. Consequently, 26 topics have been identified as top topics, which are displayed in Figure 6. The three top terms for each topic are highlighted, i.e., words which have the largest probabilities to belong to that topic. A qualitative analysis of the keywords allows us to partition topics in five supracategories: control, application, language, interaction, signal, and hardware. The trend of each topic was evaluated considering the distribution of topics among each document θ over the years. Each distribution was fitted with a linear regression and the slope was evaluated to extract the main trends. Table 5 displays the slope values computed, highlighting the statistically significant ones. A non-significant trend means that the distribution fluctuates over the years, therefore no conclusions can be drawn about their future development.
To carry out a more detailed and high-resolution analysis, we evaluated the probability of topics over the years and the normalized frequency of the significant trends, highlighting those topics which can be defined as the most frequent and significant. Results are illustrated in Figure 7. Although "child, autism and therapy" and "elderly, care and assistance" have both significant positive trends, the normalized frequency of the former is larger compared to the latter thereby suggesting that they are more important. Consequently, topics can be narrowed again focusing only on the ones that have a higher normalized frequency such as "child, autism and therapy" and "emotion, emotional and affective" for the positive significant trends and on "sound, vocal and anthropometric" and "surgical, surgery and surgeon" for the negative significant ones.

RESULTS AND DISCUSSION
The various steps of the computational linguistic analysis conducted in the presented study led to meaningful answers to the questions we posed in the introduction. In this section, we outline the results that emerged from the data analysis. Not only does the analysis provide a complete overview about the past and the present of verbal communication in robotics, but it also provides an idea of what future scenarios and most promising applications might be.
(1) How did the salient terms related to verbal communication in robotics evolve?
Results from our study show that this field seems to be highly technology-driven. In particular, the contrastive analysis highlights devices or technology that define a specific interval of time. Starting with the 2000−2004 interval, the keywords in Table 1 mainly focus on the social interaction aspects. In the following years (2005−2009, Table 2), the analysis shows that "fuzzy voice" is the most used key-word in the scientific literature. While the 2010−2014 range ( Table 3)  From this first analysis, it is clear, for example, that studies in the immediate future will include deep learning algorithms, likely merged with cloud computing.
(2) Are there any specific applications that involved the use of verbal communication?
A deeper analysis was performed taking advantage of the LDA model. In this way, main coherent topics and keywords were retrieved, specifically, each topic is defined with three keywords (Figure 6). A qualitative analysis allows us to interpret the topics and to further partition them in five supra-categories, i.e., control, application, language, interaction, signal and hardware. (3) Do they have any trend in the last decade?
The overall trends of the topics were evaluated distinguishing the significant upward and downward ones ( Table 5). The main topics with upward trends are related to social robotics and its applications, whereas downward topics are more related with the technological aspects and the use of voice as a controller. Significant trends were also analyzed with single-year resolution and their frequency was normalized with respect to the overall number of words per year. In this way, among the topics with a significant trend, the more frequent ones can be distinguished. Specifically, the upward significant and more frequent topics are "child", "autism", "therapy" and "emotion", "emotional" and "affective". On the other hand, there are "sound", "vocal", "anthropometric" and "surgical", "surgery" and "surgeon". This data suggests that verbal communication appears to be more successful as a communication channel to socially interact with a robot, rather than a tool, or an interface for controlling the latter.
(4) If they do, what can these trends reveal?
Two categories have been identified as promising ones: autism therapy and affective interaction. Looking into the literature, the use of robots in the teaching procedure for children with autism spectrum disorder seems to be effective in enhancing specific social and communication behaviors which are not achieved by FIGURE 6 | Top topics. For each topic, the three top terms are displayed. beta represent the probability that the term belongs to that specific topic.
Frontiers in Computer Science | www.frontiersin.org February 2021 | Volume 2 | Article 591164 8 humans (Fachantidis et al., 2018). Moreover, children displayed more expression when interacting with a robot capable of performing affective interaction, i.e., to convey emotions and to adapt its behavior (Niculescu et al., 2013). While autism therapy reached a peak in 2017, the affective interaction had a steady increase over the years. On the other hand, applications which were shown to have a significant downward trend concerns the use of anthropometric sounds and vocalization and the use of verbal communication related to surgical activities. One limitation that might explain the downward trend of the first topic is that human realism of a character's face and voice can evoke feelings of eeriness (Mitchell et al., 2011), especially if not accompanied by an equal level of realism in the cognitive part of the robot, and thus, its behavior. Regarding the trend of the Frontiers in Computer Science | www.frontiersin.org February 2021 | Volume 2 | Article 591164 9 second topic, an issue might be related to the uncertainty that might arise when using the voice command for controlling tasks that require very high precision and accuracy such as surgical operations.

CONCLUSIONS
The presented study revealed that verbal communication is a research field that is continuously expanding in different areas of robotics. This increasing interest is driven by the desire of a natural human-like interaction with a robot. More than 7,000 scientific publications about the verbal communication field in robotics were analyzed by means of a contrastive analysis and topic mining technique with a related trend analysis. One of the most notable results was the identification of different topics describing the verbal communication field. Specifically, they resulted in being partitioned in five supra-categories: control, application, language, interaction, signal, and hardware. Another main result was that verbal communication for robotics proved to be highly technologydriven, and that several technologies, associated to specific time intervals, emerged as significant for its development. Moreover, two promising research fields related to social robotics were identified: autism therapy and affective interaction. While autism therapy reached a peak in 2017, the affective interaction had a steady increase over the years. On the other hand, the two most significant downward trends identified were vocal interaction and vocal control in surgical robotics. Reasons can be identified in the mismatch between human-like esthetic vs. behavioral realism, and in the uncertainty related to a voice command for precise and accurate tasks such as surgical operations. These findings show that verbal communication is expanding in the robotics field, finding different applications that can have a future translation in the market. Potentially, achieving a natural verbal communication with a robot can have a great impact in the scientific, societal and economic role of robotics in future. Nonetheless, due to the current technical limitations, it is confirmed that the use of voice is accepted and gladly applied in robotics if used for a social affective interaction with a robot, but it is not well liked, or is even mistrusted, when it must be used for applications in which human health or security are in danger. This scenario will probably change only if new technologies are proven to be highly secure, and those still have to be found or introduced in this field.
Although we tried to avoid any biases by implementing a computational pipeline that can extract topics and trends in a rigorous way, those might eventually emerge in some parts of the work. For instance, the query to retrieve the dataset, which is inevitably based on our knowledge. Moreover, this study presented the application of one computational linguistic method. A more extensive analysis could be carried out by comparing different methodologies together with different metrics.

DATA AVAILABILITY STATEMENT
The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
AMV is the first author of this paper, he studied the state-of-theart of verbal communication in robotics and did the data mining from Scopus, performing then the analysis reported in the manuscript; LC supervised the work selecting methods and discussing results; FDO contributed as computational linguistic expert; EPS is a full professor of bioengineering who supervised the entire work giving a strong contribution to the organization, writing, and proofing of the presented paper.