Verbal Communication in Robotics: A Study on Salient Terms, Research Fields and Trends in the Last Decades Based on a Computational Linguistic Analysis

Marin Vargas, Alessandro; Cominelli, Lorenzo; Dell’Orletta, Felice; Scilingo, Enzo Pasquale

doi:10.3389/fcomp.2020.591164

SYSTEMATIC REVIEW article

Front. Comput. Sci., 15 February 2021

Sec. Human-Media Interaction

Volume 2 - 2020 | https://doi.org/10.3389/fcomp.2020.591164

This article is part of the Research TopicLanguage and Vision in Robotics: Emerging Neural and On-Device ApproachesView all 5 articles

Verbal Communication in Robotics: A Study on Salient Terms, Research Fields and Trends in the Last Decades Based on a Computational Linguistic Analysis

Alessandro Marin Vargas¹*

Lorenzo Cominelli²*

Felice Dell’Orletta³

Enzo Pasquale Scilingo²

¹Department of Information Engineering (DII), University of Pisa, Pisa, Italy
²E. Piaggio Research Center, University of Pisa, Pisa, Italy
³Istituto di Linguistica Computazionale, CNR, Pisa, Pisa

Verbal communication is an expanding field in robotics showing a significant increase in both the industrial and research field. The application of verbal communication in robotics aims to reach a natural human-like interaction with robots. In this study, we investigated how salient terms related to verbal communication in robotics have evolved over the years, what are the topics that recur in the related literature, and what are their trends. The study is based on a computational linguistic analysis conducted on a database of 7,435 scientific publications over the last 2 decades. This comprehensive dataset was extracted from the Scopus database using specific key-words. Our results show how relevant terms of verbal communication evolved, which are the main coherent topics and how they have changed over the years. We highlighted positive and negative trends for the most coherent topics and the distribution over the years for the most significant ones. In particular, verbal communication resulted in being highly relevant for social robotics. Potentially, achieving natural verbal communication with a robot can have a great impact on the scientific, societal, and economic role of robotics in the future.

Introduction

Robots are becoming increasingly pervasive in our everyday life, entering our homes (Abdi et al., 2018; Van Patten et al., 2020), places of work (Robla-Gómez et al., 2017), hospitals (Azeta et al., 2017), and schools (Belpaeme et al., 2018). Since humans need to communicate and cooperate with these machines, and because we are accustomed to communicating with other people, the same social norms and kind of communication that apply to humans might also apply to robots. As Nass and colleagues demonstrated in their book “The Media Equation” (Reeves and Nass, 1996), people often respond socially to computers in ways similar to how they would interact socially with other people. Therefore, the need to develop a robot that can behave socially has pushed researchers to incorporate a form of communication similar to what humans use in the design, such as the non-verbal communication (Breazeal et al., 2005; Brooks and Arkin, 2007; Mutlu et al., 2009; Cominelli et al., 2018), as well as the verbal one (Nakamura and Sawada, 2006; Crowelly et al., 2009; Niculescu et al., 2013).

Although the role of non-verbal behaviors (Burgoon et al., 2016) is of undeniable importance, verbal communication has a primary role in human-human interaction. Indeed, the voice is one of the most powerful tools that mankind uses to convey emotions and intentions (Cowen et al., 2019) and language allows people to convey meaningful messages encoded in written or spoken words (Krauss, 2002). Therefore, developing conversational agents that can interact using natural language, be it for entertainment, controlled, or actioned, is of great interest. Moreover, spoken natural language interaction has some advantages compared to non-verbal language. It makes human-robot communication natural, accurate and efficient (Liu and Zhang, 2017) allowing for the possibility of the robot to cooperate, to be trained from non-expert humans, and to efficiently behave in a social environment (Mavridis, 2015).

Nevertheless, it is not yet possible to naturally communicate with a robot just as we do with other humans. There are several challenges which need to be addressed both from a technological and scientific point of view. For instance, robots still have difficulties in correctly capturing sound from distant speakers (Kumatani et al., 2012), dealing with environmental noise (Jensen et al., 2005), managing speech interruption from the user (i.e., barge-in problem (Huang et al., 2001)) and identifying the talking person when multiple users are present (Gomez et al., 2014). Moreover, the age of a user can also be considered to be an issue as there is a lack of speech recognition systems for children due to their pitch characteristics and speech disfluencies (Kennedy et al., 2017) and, elderly people might have dysarthria which might impair a regular communication flow (Kumar and Kumar, 2016).

Beyond the current technical limits of developing and improving this type of communication, there is an undeniable and significant increase of interest in verbal communication, testified by the last decade’s increase of publications related to the use of the voice in robots (Figure 1). A significant increasing trend is also observed in the industrial area. According to the International Federation of Robotics (IFR) figures, fields that are experiencing considerable growth are public relation robots and entertainment and leisure robots (IFR, 2018). The incorporation of speech recognition and speech generation abilities in robots have obtained encouraging results in several research fields such as educational robotics (Budiharto et al., 2017), collaborative robotics (Huang and Mutlu, 2016; Gustavsson et al., 2017), surgical robotics (Zinchenko et al., 2017), assistive robotics (Wu et al., 2014; Zhou et al., 2018), robot therapy (Barakova et al., 2015; Ramamurthy and Li, 2018), humanoid robotics (Ding and Shi, 2017), and navigation robotics (Draper et al., 2013; Schulz et al., 2015).

FIGURE 1

FIGURE 1. The annual amount of verbal communication robot-related publications since the year 2000. In the past 18 years, the number of publications gradually increased, reaching a maximum in current time.

The increasing number of publications has made it more difficult to understand and track advances in the field (Landhuis, 2016; Altbach and de Wit, 2019). Thus, the aim of this work is to discover promising trends in the verbal communication field by performing a deep and systematic analysis of the research literature. To avoid any bias and author subjectivity, different text mining techniques were used in a bottom-up approach to retrieve research fields and keywords from scientific publications. As observed from previous work, bibliometric techniques leverage statistics to successfully extract useful information such as the identification of the fundamental “pillars” that support a research theme (Buter and Van Raan, 2013), the discovery of promising trends in the robotics field (Goeldner et al., 2015; Mejia and Kajikawa, 2017), of topics in conversational content (Yeh et al., 2016) and of relationships between social and technology issues (Ittipanuvat et al., 2014).

Investigating the emergent topics and keywords from thousands of publications belonging to verbal communication in robotics throughout the last decades can reveal important hidden topics or technology domains. This information can be extremely valuable to drive future research to applications where technology and needs intersect. In particular, we address the following questions:

1. How did the salient terms related to verbal communication in robotics evolve?

2. Are there any specific applications that involved the use of verbal communication?

3. Do they have any noteworthy trends in the last decade?

4. If they do, what can these trends reveal?

The paper is structured as follows. In Section 1, the criteria used to select publications are introduced. In Section 2, the contrastive analysis is described. In Section 3, the pre-processing step is explained. In Section 4, it is illustrated how topics are modeled. Section 5 reports the model evaluation. Section 6 shows the topics evaluation. In the last sections (7, 8), we summarize the results and discuss possible future scenarios.

1 Selection Criteria

Publications related to the verbal communication in robotics were retrieved from Scopus using the following query: “TITLE-ABS-KEY ((voice OR speech OR verbal OR talk OR dialogue OR spoken OR conversation) AND robot) AND PUBYEAR >1999”. Since the main focus is on current trends and topics, the research was restricted only to works published from the year 2000. A total of 7,435 articles (titles and abstracts) were extracted on February 20, 2019. The framework of the natural language processing (NLP) tools used in this research is shown in Figure 2. The dataset can be accessed from the following GitHub repository https://github.com/vargas95/nlp_social_robotics.

FIGURE 2

FIGURE 2. Research framework.

2 Contrast Analysis

All publications were divided into four groups based on the publication year. Intervals range from 2000 to 2004, from 2005 to 2009, from 2010 to 2014, and from 2015 to 2019. In this way, specific keywords that are representative for a certain interval of time can be identified. The extraction of domain-specific terms denoting domain entities was performed using the NLP tool T2K (Sagri et al., 2019). By default, the automatically POS–tagged and lemmatized text is searched for candidate domain-specific terms, expressed by either single nominal terms or complex nominal structures with modifiers (adjectival and prepositional modifiers). To select the terms representative for a certain interval of time, a contrastive analysis was performed: the list of extracted terms was ranked with respect to the variation of the term frequency inverse document frequency (tf-idf) scores (Salton and Buckley, 1988) calculated for two different intervals of time. Two different contrastive analyses were performed: the analyzed interval of time 1) vs all other intervals and 2) only vs the past intervals. The contrastive analysis was performed for each group keeping only the first 19 main words. Results are displayed in Tables 1–4. It is worth noting that the contrast against past or all intervals is the same for the range $2015 - 2019$ and that the contrast against the past cannot be applied to the range $2000 - 2004$ . Bold words (highlighted in the tables) are the ones that belong to both the contrasts; consequently, those words might define a specific technology or device that has mainly been cited only in that group. While there is only “fuzzy voice” in the $2005 - 2009$ group, words underlined in group $2010 - 2014$ are “wireless sensor”, “multimodal language”, “word correct rate” and “vocal cues”. Although the contrast analysis allows us to retrieve specific keywords to describe an interval of time, it was not informative regarding topics and trends. Therefore, we decided to proceed with a deeper analysis considering single years.

TABLE 1

TABLE 1. Contrastive analysis: range 2000–2004.

TABLE 2

TABLE 2. Contrastive analysis: range 2005–2009.

TABLE 3

TABLE 3. Contrastive analysis: range 2010–2014.

TABLE 4

TABLE 4. Contrastive analysis: range 2015–2019.

3 Pre-Processing

A series of pre-processing steps were applied to convert the text in a specific structure to subsequently perform text mining analysis. Specifically, texts are transformed into a document-term matrix, where each row represents one document, each column represents a term, and the associated value defines the term’s frequency. Then, the text was subjected to lemmatization, i.e., an algorithm to convert the word to its lemma based on its intended meaning. In particular, lemmatization represents a better choice compared to stemming for topic modeling as it tries to correctly identify the intended part of speech and meaning of a word in a sentence or in a document. After that, the raw text was converted in a corpus on which other common transformations were applied, such as tokenization, conversion to lower case, and the removal of numbers, punctuation and stop words. Lastly, once the corpus was converted into a document-term matrix, it was filtered using the tf-idf measure. In particular, this operation allows us to weight each word based on the following formulas:

\begin{array}{l} {(t f)}_{i, j} = \frac{Num . of occurrency of term i}{Total num . of terms in doc . j} \\ {(i d f)}_{i} = ln (\frac{Total num . of docs}{Num . of docs that contains term i}) \\ {(t f - i d f)}_{i, j} = {(t f)}_{i, j} \cdot {(i d f)}_{i} \end{array} (1)

The weight of a term is directly proportional to the number of times a term occurs within a document, but it is inversely proportional with the frequency of the term among the collection. In this way, tf-idf measures the importance of a term among a collection of documents. To further constrain the number of words, a threshold on the weights has been applied equal to the median of the tf-idf scores (Silge and Robinson, 2017).

The difference between non-filtered text and the one filtered with tf-idf is highlighted in Figure 3 and Figure 4, are the wordclouds displaying the most frequent words. It is clearly visible how words that are frequent in the non-filtered text such as “robot”, “human” and “system” become of secondary importance when text is filtered with tf-idf and more informative words such as “emotion”, “agent” and “dialogue” become relevant.

FIGURE 3

FIGURE 3. Wordcloud of most frequent words for the non-filtered text.

FIGURE 4

FIGURE 4. Wordcloud of most frequent words for the text filtered with tf-idf.

4 Topic Modeling

We retrieved topics using the Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a generative probabilistic model of a corpus. This algorithm takes advantage of the fact that every document is a mixture of latent topics and that each topic is characterized by a distribution over words. Although LDA is a generative process, it can be inverted using the Bayes rule in order to estimate a model’s parameters. From the document level, LDA can backtrack the topics that are likely to have generated the corpus thereby estimating the parameters’ uncertainty. The method is based on three main steps:

1. Randomly assign to each word in each document one of the K topics;

2. For each document d:

• Assume that all topics assignment, except this one, are correct;

• Compute the probability of the topic given that document: $p (t o p i c | d o c u m e n t)$ ;

• Compute the probability that the word belongs to a topic: $p (w o r d | t o p i c)$ ;

• Multiply these two probabilities and assign to the word a new topic based on that probability: $p (w o r d | t o p i c) \cdot p (t o p i c | d o c u m e n t)$ ;

3. Continue until a steady state is reached.

The algorithm used in this work is the LDA method implemented in the R software, part of the “topicmodels” package (Hornik and Grün, 2011). As a sampling method, we selected the Gibbs method to infer unknowns from the data. Results of the LDA model consists of the a posteriori distributions:

• The probability β that a term belongs to each topic;

• The topic distribution $Θ$ for each document.

5 Model Evaluation

One limitation of the model is that it needs a priori the K number of topics. In order to find a “correct” number, four different metrics, which represent the goodness of the fit, can be visualized using the “ldatuning” package (Nikita, 2016). In particular, two categories of metrics can be distinguished. Then, the number of topics is selected as the one that:

• minimize the following measures:

• Arun 2010 (Arun et al., 2010)

• CaoJuan 2009 (Cao et al., 2009)

• maximize the following measures:

• Deveaud 2014 (Deveaud et al., 2014)

• Griffiths 2004 (Griffiths and Steyvers, 2004)

These measures select the best number of topics using a symmetric KL-divergence of salient distributions which are derived from the factorization of the document-term matrix. A graphic visualization of the metrics’ variation with respect to the number of topics is shown in Figure 5. It can be observed that the Deveaud2014 metric is not informative. Therefore, by analyzing the remaining three metrics, the correct number of topics might be located in the range between 60 and 150 topics. We selected, as a reasonable number, 60 topics which is the point where two of three metrics converge to their minimum or maximum point. Moreover, selecting the lower number of topics might avoid possible over-fitting.

FIGURE 5

FIGURE 5. Variation of four metrics with respect to the number of topics. On the top, Arun2010 and CaoJuan2009 metrics have to be minimized. On the bottom, Deveaud2014 and Griffiths2004 have to be maximized. They define a range where the correct number of topics should be located.

6 Topics Evaluation

The LDA model is a powerful way to extract topics from a corpus of documents in an unsupervized way. However, topics might not be clearly explicable, therefore topic coherence can be used as a measure to evaluate the topic quality (coherence implementation in R (Denny, 2018)). The topic coherence metric considers the co-occurrences of words within the documents and it is defined as:

C (t; V^{(t)}) = \sum_{m = 2}^{M} \sum_{l = 1}^{m - 1} log \frac{D (v_{m}^{(t)}, v_{l}^{(t)}) + 1}{D (v_{l}^{(t)})} (2)

where $V (t) = (v_{1} (t), \dots, v_{M} (t))$ is the list of the M most probable words in a topic, $D (v)$ the frequency of the term v within the document d and $D (v, v^{'})$ represents the co-occurrence of word v and $v'$ within documents. The measure assumes negative values: the closer the value is to zero, the stronger the topic coherence will be.

The topic coherence has been evaluated on the 60 topics extracted, by considering the top 10 terms for each topic and setting a user-defined threshold on the coherence value to -145. Consequently, 26 topics have been identified as top topics, which are displayed in Figure 6. The three top terms for each topic are highlighted, i.e., words which have the largest probabilities to belong to that topic. A qualitative analysis of the keywords allows us to partition topics in five supra-categories: control, application, language, interaction, signal, and hardware.

FIGURE 6

FIGURE 6. Top topics. For each topic, the three top terms are displayed. beta represent the probability that the term belongs to that specific topic.

The trend of each topic was evaluated considering the distribution of topics among each document θ over the years. Each distribution was fitted with a linear regression and the slope was evaluated to extract the main trends. Table 5 displays the slope values computed, highlighting the statistically significant ones. A non-significant trend means that the distribution fluctuates over the years, therefore no conclusions can be drawn about their future development.

TABLE 5

TABLE 5. Trends evaluation.

To carry out a more detailed and high-resolution analysis, we evaluated the probability of topics over the years and the normalized frequency of the significant trends, highlighting those topics which can be defined as the most frequent and significant. Results are illustrated in Figure 7. Although “child, autism and therapy” and “elderly, care and assistance” have both significant positive trends, the normalized frequency of the former is larger compared to the latter thereby suggesting that they are more important. Consequently, topics can be narrowed again focusing only on the ones that have a higher normalized frequency such as “child, autism and therapy” and “emotion, emotional and affective” for the positive significant trends and on “sound, vocal and anthropometric” and “surgical, surgery and surgeon” for the negative significant ones.

FIGURE 7

FIGURE 7. Key–words trends of significant top topics normalized and compared to the overall number of words in the collection of document for each year.

7 Results and Discussion

The various steps of the computational linguistic analysis conducted in the presented study led to meaningful answers to the questions we posed in the introduction. In this section, we outline the results that emerged from the data analysis. Not only does the analysis provide a complete overview about the past and the present of verbal communication in robotics, but it also provides an idea of what future scenarios and most promising applications might be.

How did the salient terms related to verbal communication in robotics evolve?

Results from our study show that this field seems to be highly technology-driven. In particular, the contrastive analysis highlights devices or technology that define a specific interval of time. Starting with the $2000 - 2004$ interval, the keywords in Table 1 mainly focus on the social interaction aspects. In the following years ( $2005 - 2009$ , Table 2), the analysis shows that “fuzzy voice” is the most used key-word in the scientific literature. While the $2010 - 2014$ range (Table 3) focuses more on the vocal aspects, and the last few years ( $2015 - 2019$ ) show the rise of deep learning in the verbal communication field (Table 4). Looking at the table carefully, there are also other technologies that proved to be a driving factor for the evolution of verbal communication in robotics. For instance, every technology that was helpful in providing attention mechanisms started appearing in $2000 - 2004$ , emotions related to speech started in $2005 - 2009$ , the use of smartphones with verbal abilities started in $2010 - 2014$ and soft robotics, and cloud robotics started in $2015 - 2019$ . From this first analysis, it is clear, for example, that studies in the immediate future will include deep learning algorithms, likely merged with cloud computing.

(2) Are there any specific applications that involved the use of verbal communication?

A deeper analysis was performed taking advantage of the LDA model. In this way, main coherent topics and keywords were retrieved, specifically, each topic is defined with three keywords (Figure 6). A qualitative analysis allows us to interpret the topics and to further partition them in five supra-categories, i.e., control, application, language, interaction, signal and hardware.

(3) Do they have any trend in the last decade?

The overall trends of the topics were evaluated distinguishing the significant upward and downward ones (Table 5). The main topics with upward trends are related to social robotics and its applications, whereas downward topics are more related with the technological aspects and the use of voice as a controller. Significant trends were also analyzed with single-year resolution and their frequency was normalized with respect to the overall number of words per year. In this way, among the topics with a significant trend, the more frequent ones can be distinguished. Specifically, the upward significant and more frequent topics are “child”, “autism”, “therapy” and “emotion”, “emotional” and “affective”. On the other hand, there are “sound”, “vocal”, “anthropometric” and “surgical”, “surgery” and “surgeon”. This data suggests that verbal communication appears to be more successful as a communication channel to socially interact with a robot, rather than a tool, or an interface for controlling the latter.

(4) If they do, what can these trends reveal?

Two categories have been identified as promising ones: autism therapy and affective interaction. Looking into the literature, the use of robots in the teaching procedure for children with autism spectrum disorder seems to be effective in enhancing specific social and communication behaviors which are not achieved by humans (Fachantidis et al., 2018). Moreover, children displayed more expression when interacting with a robot capable of performing affective interaction, i.e., to convey emotions and to adapt its behavior (Niculescu et al., 2013). While autism therapy reached a peak in 2017, the affective interaction had a steady increase over the years. On the other hand, applications which were shown to have a significant downward trend concerns the use of anthropometric sounds and vocalization and the use of verbal communication related to surgical activities. One limitation that might explain the downward trend of the first topic is that human realism of a character’s face and voice can evoke feelings of eeriness (Mitchell et al., 2011), especially if not accompanied by an equal level of realism in the cognitive part of the robot, and thus, its behavior. Regarding the trend of the second topic, an issue might be related to the uncertainty that might arise when using the voice command for controlling tasks that require very high precision and accuracy such as surgical operations.

8 Conclusions

The presented study revealed that verbal communication is a research field that is continuously expanding in different areas of robotics. This increasing interest is driven by the desire of a natural human-like interaction with a robot. More than 7,000 scientific publications about the verbal communication field in robotics were analyzed by means of a contrastive analysis and topic mining technique with a related trend analysis. One of the most notable results was the identification of different topics describing the verbal communication field. Specifically, they resulted in being partitioned in five supra-categories: control, application, language, interaction, signal, and hardware. Another main result was that verbal communication for robotics proved to be highly technology-driven, and that several technologies, associated to specific time intervals, emerged as significant for its development. Moreover, two promising research fields related to social robotics were identified: autism therapy and affective interaction. While autism therapy reached a peak in 2017, the affective interaction had a steady increase over the years. On the other hand, the two most significant downward trends identified were vocal interaction and vocal control in surgical robotics. Reasons can be identified in the mismatch between human-like esthetic vs. behavioral realism, and in the uncertainty related to a voice command for precise and accurate tasks such as surgical operations. These findings show that verbal communication is expanding in the robotics field, finding different applications that can have a future translation in the market. Potentially, achieving a natural verbal communication with a robot can have a great impact in the scientific, societal and economic role of robotics in future. Nonetheless, due to the current technical limitations, it is confirmed that the use of voice is accepted and gladly applied in robotics if used for a social affective interaction with a robot, but it is not well liked, or is even mistrusted, when it must be used for applications in which human health or security are in danger. This scenario will probably change only if new technologies are proven to be highly secure, and those still have to be found or introduced in this field.

Although we tried to avoid any biases by implementing a computational pipeline that can extract topics and trends in a rigorous way, those might eventually emerge in some parts of the work. For instance, the query to retrieve the dataset, which is inevitably based on our knowledge. Moreover, this study presented the application of one computational linguistic method. A more extensive analysis could be carried out by comparing different methodologies together with different metrics.

Data Availability Statement

The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Author Contributions

AMV is the first author of this paper, he studied the state-of-the-art of verbal communication in robotics and did the data mining from Scopus, performing then the analysis reported in the manuscript; LC supervised the work selecting methods and discussing results; FDO contributed as computational linguistic expert; EPS is a full professor of bioengineering who supervised the entire work giving a strong contribution to the organization, writing, and proofing of the presented paper.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Abdi, J., Al-Hindawi, A., Ng, T., and Vizcaychipi, M. P. (2018). Scoping review on the use of socially assistive robot technology in elderly care. BMJ open 8, e018815. doi:10.1136/bmjopen-2017-018815

PubMed Abstract | CrossRef Full Text | Google Scholar

Altbach, P. G., and de Wit, H. (2019). Too much academic research is being published. Int. Higher Edu. 59, 2–3. doi:10.6017/ihe.2019.96.10767