vocal interactivity in-and-between Humans, Animals, and Robots

next generation of language-enabled autonomous social agents. However, much of the research is conducted within well-deined disciplinary boundaries, and many fundamental issues remain. This paper attempts to redress the balance by presenting a comparative review of vocal interaction within-and-between humans, animals, and artiicial agents (such as robots), and it identiies a rich set of open research questions that may beneit from an interdisciplinary analysis.


iNTRODUCTiON
Almost all living organisms make (and make use of) sounds -even plants (Appel and Cocrot, 2014) -and many animals have specialized biological apparatus that is adapted to the perception and production of sound (Hopp and Evans, 1998).For example, some ish vibrate their swim bladders, many arthropods stridulate, 1 and the majority of birds and mammals "vocalize" (using a vocal organ known as a syrinx or a larynx, respectively).Predators may use vocal cues to detect their prey (and vice versa), and a variety of animals (such as birds, frogs, dogs, wolves, foxes, jackals, coyotes, etc.) use vocalization to mark or defend their territory.Social animals (including human beings) also use vocalization to express emotions, to establish social relations, and to share information.Human beings, in particular, have extended this behavior to a very high level of sophistication through the evolution of speech and language -a phenomenon that appears to be unique in the animal kingdom, but which shares many characteristics with the communication systems of other animals.
Likewise, auditory perception in many animals is adapted to their acoustic environment and the vocal behavior of other animals, especially conspeciics2 (Talkington et al., 2012).Vocalization thus sits alongside other modes (such as vision and olfaction) as a primary means by which living beings are able to sense their environment, inluence the world around them, coordinate cooperative or competitive behavior with other organisms, and communicate information.
Alongside the study of vocal behavior, recent years have seen important developments in a range of technologies relating to vocalization.For example, systems have been created to analyze and playback animals calls, to investigate how vocal signaling might evolve in communicative agents, and to interact with users of spoken language technology.3Indeed, the latter has witnessed huge commercial success in the past 10-20 years, particularly since the release of Naturally Speaking (Dragon's continuous speech dictation sotware for a PC) in 1997 and Siri (Apple's voice-operated personal assistant and knowledge navigator for the iPhone) in 2011.Research interest in this area is now beginning to focus on voice-enabling autonomous social agents (such as robots).
herefore, whether it is a bird raising an alarm, a whale calling to potential partners, a dog responding to human commands, a parent reading a story with a child, or a business-person accessing stock prices using an automated voice service on their mobile phone, vocalization provides a valuable communication channel through which behavior may be coordinated and controlled, and information may be distributed and acquired.Indeed, the ubiquity of vocal interaction has given rise to a wealth of research across an extremely diverse array of ields from the behavioral and language sciences to engineering, technology, and robotics.Some of these ields, such as human spoken language or vocal interactivity between animals, have a long history of scientiic research.Others, such as vocal interaction between artiicial agents or between artiicial agents and animals, are less well studied -mainly due to the relatively recent appearance of the relevant technology.his means that there is huge potential for cross-fertilization between the diferent disciplines involved in the study and exploitation of vocal interactivity.For example, it might be possible to use contemporary advances in machine learning to analyze animal activity in diferent habitats or to use artiicial agents to investigate contemporary theories of language grounding.Likewise, an understanding of animal vocal behavior might inform how vocal expressivity might be integrated into the next generation of autonomous social agents.
his paper appraises our current level of understanding about vocal interactivity within-and-between humans, animals, and artiicial agents (such as robots).In particular, we present a snapshot of our understanding in six key areas of vocal interaction: animal⇔animal, human⇔human, robot⇔robot, human⇔animal, human⇔robot, and animal⇔robot (see Figure 1) through the consideration of three aspects of vocal interactivity: 1. Vocal signals in interaction.his concerns properties of the signals themselves, including their structure, grammar, and semantic content where applicable.his topic contains a large body of research on both animal and human vocalizations.2. Vocal interaction between agents.Here, we primarily discuss the functions and diferent types of vocally interactive behavior between animals, between human beings and animals, and between human beings and technology.3. Technology-based research methodologies.Lastly, this paper reviews the use of technology in studying vocal interactivity.hese are of interest since they provide relatively recent and novel means to further our understanding in the ield while also contributing to the development of new technology capable of vocal interaction.
Given the vastness of the topics covered, we aim for snapshots that provide a good sense of the current state-of-the-art and allow us to identify some of the most pertinent open research questions that might beneit from a cross-disciplinary approach.In particular, when reviewing research on speciic aspects of human and/or animal vocal interactivity, we also highlight questions pertaining to the design of future vocally interactive technologies that these raise.2. vOCAL SiGNALS iN iNTeRACTiON

Physiology and Morphology
A range of diferent neural and physical mechanisms are involved in the production, perception, and interpretation of vocal behavior in humans, animals, and artiicial agents (Doupe and Kuhl, 1999;Jarvis, 2004;Ackermann et al., 2014;Andics et al., 2014). he physical apparatus for articulation and audition difers from species to species, as does the neural substrate for processing incoming signals and generating outgoing signals.In some species, it has also been hypothesized that exploiting the vocal production system in a form of analysis-by-synthesis may facilitate the understanding of vocal input (Arbib, 2005).
Human beings are mammals and, as such, the physical mechanisms for producing and perceiving vocalizations are constructed along the same lines as those possessed by all other land mammals.Air low from the lungs excites resonances in the oral cavity (the "vocal tract") by vibrating the vocal cords to produce a rich harmonic sound structure, by creating partial closures and using the resulting turbulence to generate noisy fricative sounds, or by closing the vocal tract completely and producing explosive sounds on releasing the air pressure.he spectral characteristics of the generated sounds are modiied by the shape of the vocal tract and thus continually inluenced by the movement and position of the main articulators -the tongue, the lips, and the jaw.As in other animals, body size inluences the characteristics of the vocalizations that human beings are capable of producing.Hence, the pitch of the voice and the "formants" (the vocal tract resonances) are considerably higher in a small child than they are in an adult.In a recent review, Pisanski et al. (2016) suggest that the control of vocal aspects, such as height of formants and pitch, to convey body size information, could be an evolutionary step toward our ability of producing speech.
One diference between the human vocal tract and those of all other mammals is that it is bent into an "L" shape, primarily as a result of our upright vertical posture.his coniguration gives rise to the so-called "descended larynx" in adult humans, and it has been hypothesized that this allows human beings to produce a much richer variety of sounds than other mammals (for example, a dog or a monkey) (Lieberman, 1984).his traditional view has been challenged (Fitch and Reby, 2001).
In general terms, however, much regarding the similarities/ diferences between the vocal systems (including brain organization) in diferent animals remain unknown and open to further research.Similarly, while morphology has an obvious inluence on vocalization as just discussed, the precise nature of this inluence and how vocal mechanisms are constrained (or indeed facilitated) by the morphology of the individual agents involved is a topic deserving further study.

Properties and Function of Animal Signals
Several works have been dedicated to studying how non-human animals adapt their vocalizations to the acoustic context and to the listeners' perception.Potash (1972) showed how ambient noise modiies the intensity, rate, and type of calls of the Japanese quail.Experiments conducted by Nonaka et al. (1997) demonstrate that the brain stems of cats hold neuronal mechanisms for evoking the Lombard relex (Lombard, 1911) of increasing speaker efort under the presence of noise.his efect has also been observed in many avian species (Cynx et al., 1998;Manabe et al., 1998;Kobayasi and Okanoya, 2003;Leonard and Horn, 2005) and in frogs (Halfwerk et al., 2016).Recent work has focused on how other aspects of the vocalizations, such as duration or frequency, are adapted and on the role of auditory feedback in such adaptations (Osmanski and Dooling, 2009;Hage et al., 2013).
Non-human animals have been shown to adapt their vocalizations depending on the audience.For instance, female Vervet monkeys produce a higher rate of alarm calls in the presence of their ofspring.Likewise, male Vervet monkeys make more calls in the presence of adult females than when other dominant males are near (Cheney and Seyfarth, 1985).In some cases, animals may employ vocalizations targeted at individuals of a diferent species.he kleptoparasitic fork-tailed drongo, when following terrestrially foraging pied babblers, will even perform false alarm calls to make the babblers ly to cover, thereby giving the drongos an opportunity to steal food items (Ridley et al., 2007).Also, vocal communication between species is not conined to animals of the same class, e.g., hornbills (a tropical bird) are known to be capable of distinguishing between diferent primate alarm calls (Rainey et al., 2004).
Some alarm and mobbing calls serve as an example of the capacity of non-human animals to transmit semantic information referring to speciic stimuli categories, to an associated risk, or to a particular amount of danger.Seyfarth et al. (1980) showed how vervet monkeys use and recognize diferent alarm calls for at least three predators: leopards, eagles, and snakes.Predator or dangerspeciic calls have been observed in many other species and situations (Blumstein and Armitage, 1997;Greene and Meagher, 1998;Zuberbühler, 2000Zuberbühler, , 2001;;Manser, 2001;Templeton et al., 2005;Griesser, 2009;Yorzinski and Vehrencamp, 2009).
Perhaps, the most interesting recent development in the ield of non-human animal vocal interaction is the evidence of syntactic and combinatory rules, grammar, and learning in certain species.For example, McCowan et al. (1999) showed (using bottlenose dolphin whistle repertoires) how an information-theoretic analysis could be used to compare the structural and organizational complexity of various animal communications systems.Ouattara et al. (2009) investigated the ability of non-human primates to generate meaningful acoustic variation during call production -a behavior that is functionally equivalent to suixation in human language when referring to speciic external events.A study by Schel et al. (2010) on the alarm call sequences of colobus monkeys concluded that the monkeys attended to the compositional aspects of utterances.Clay and Zuberbühler (2011) showed the ability of bonobos to extract information about external events by attending to vocal sequences of other individuals, thus highlighting the importance of call combinations in their natural communication system.Candiotti et al. (2012) describe how some non-human primates vary the acoustic structure of their basic call type and, through combination, create complex structures that increase the efective size of their vocal repertoire.Zuberbühler (2002) shows that the semantic changes introduced by a combinatory rule in the natural communication of a particular species of primate may be comprehended by members of another species.Arnold and Zuberbühler (2008) conclude that in the free-ranging putty-nosed monkeys, meaning is encoded by call sequences, not individual calls.Clarke et al. (2006) provide evidence of referential signaling in a free-ranging ape species, based on a communication system that utilizes combinatorial rules.Even though most work is focused on primates, this vocal behavior is also seen in others, e.g., Kershenbaum et al. (2012) provide evidence of complex syntactic vocalizations in a small social mammal: the rock hyrax.More recently, several quantitative approaches have been proposed to understanding the complex structure of bird songs (Sasahara et al., 2012;Weiss et al., 2014) and other non-human vocalization sequences (Kershenbaum et al., 2016).
ten Cate and Okanoya (2012) review a series of studies and perceptual experiments using artiicial grammars that conirm the capacity of non-human animals to generalize and categorize vocal sequences based on phonetic features.Another reviewed set of experiments show ability in non-humans to learn simple rules, such as co-occurrence or duplication of vocal units.However, the capacity of non-human animals to detect abstract rules or rules beyond inite-state grammars remains an open question.
Overall, establishing reliable communications in a challenging environment may therefore require additional efort on the part of the interlocutors, and there is good evidence that animals and human beings alter the characteristics (such as loudness, clarity, or timing) of their vocalizations as a function of the context and perceived communicative success (Brumm and Slater, 2006;Hooper et al., 2006;Candiotti et al., 2012;Hotchkin et al., 2013).Such adaptive behavior may be conditioned on the distance between the interlocutors, the ambient noise level, or the reverberant characteristics of the environment.In general, such behavior is an evidence for "negative feedback control" (Powers, 1974).What objective functions are being optimized?How are vocalizations manipulated to achieve the desired results, and is such behavior reactive or proactive?How should vocally interactive artiicial agents be designed in this context?Further, advanced vocal communication systems, such as language, seem to depend on an intimate connection between low-level sensorimotor processing and high-level cognitive processing.his appears to be necessary in order for contextual knowledge, priors, and predictions to constrain the interpretation of ambiguous and uncertain sensory inputs."heory of Mind" (ToM), in particular, is the mechanism by which agents can infer intentions and cognitive states underlying overt behavior.However, the degree to which non-human animals have ToM (Premack and Woodruf, 1978;Bugnyar et al., 2016) or how such insights are supported in diferent brains (Kirsch et al., 2008) remain unclear.As discussed further below, vocal interactivity is likely oten teleological and is thus conditioned on underlying intentions.Does this imply that ToM is crucial for language-based interaction?What level of ToM do animals possess, and could this be used to predict the complexity of their vocal interactivity?Similarly, do artiicial agents need ToM in order to interact efectively with human beings vocally?
Finally, although we focus vocal interactivity here, it is nonetheless worth mentioning that there are important issues arising from the relationship between vocal and non-vocal signals in various types of animal (and indeed human) modes of interaction.Indeed, vocal interaction almost always takes place in a multimodal context (Wermter et al., 2009;Liebal et al., 2013;Mavridis, 2014), and this means that vocalization may well be critically coordinated with other physical activities, such as gestures (Esposito and Esposito, 2011;Gillespie-Lynch et al., 2013;Wagner et al., 2014), gaze direction (Holler et al., 2014), and body posture (Morse et al., 2015).How are such multimodal behaviors orchestrated, especially in multi-agent situations?How is information distributed across the diferent modes, and what is the relationship between vocal and non-vocal (sign) language?

Structure
he main diference between human and animal vocalization lies not in the physical mechanisms per se, but in how they are used.As Miyagawa et al. (2014) have pointed out, human beings still employ their vocal apparatus as an animal-like call system (primarily to communicate afect).However, humans have also evolved a remarkable system for very high-rate information transfer that appears to be vastly superior to that enjoyed by any other animalslanguage.Indeed, based on the work of Dawkins (1991) and Gopnik et al. (2001), it can be reasonably claimed that "Spoken language is the most sophisticated behaviour of the most complex organism in the known universe" (Moore, 2007b).
he "special" nature of human spoken language has been much discussed, and it has been hypothesized that it is distinguished from all other forms of animal communication systems through its use of "recursion, " especially in syntactic structure (Hauser et al., 2002).Although compelling, such a distinction was immediately questioned by Pinker and Jackendof (2005).What is clear is that human language appears to be based on a "particulate" (as opposed to "blending") mechanism for combining elements in a hierarchical structure that exploits the combinatorial properties of compound systems (Abler, 1989).his means that the expressive power of human language is efectively unlimited -as von Humboldt (1836) famously said "Language makes ininite use of inite media."Likewise, human spoken language appears to be organized as a "contrastive" communication system, which aims to minimize communicative efort (i.e., employs minimal sound distinctions) while at the same time preserving communicative efectiveness, thereby giving rise to language-dependent "phonemic" structure.
he traditional representational hierarchy for spoken language spans acoustics, phonetics, phonology, syntax, semantics, and pragmatics.here is insuicient space here to discuss all of these levels in detail.Suice to say that each area has been the subject of intensive study for several hundred years, and these investigations have given rise to many schools of thought regarding the structure and function of the underlying mechanisms.Of particular interest are the perspectives provided by research into the phylogenetic roots and ontogenetic constraints that condition spoken language in both the species and the individual (Stark, 1980;MacNeilage, 1998;Aitchison, 2000;Fitch, 2000Fitch, , 2010)).

Human Language Evolution and Development
he contemporary view is that language is based on the coevolution of two key traitsostensive-inferential communication and recursive mind-reading (Scott-Phillips, 2015) -and that meaning is grounded in sensorimotor experience.For relatively concrete concepts, this is substantiated by a number of studies that show activations in sensorimotor areas of the brain during language processing [see, e.g., Chersi et al. (2010), for a discussion].For more abstract concepts, if (and if so, how) they are grounded in sensorimotor experience is still a matter of debate [e.g., hill et al. (2014)].Metaphors have been put forward as one mechanism to achieve such grounding (Lakof and Johnson, 1980;Feldman, 2008), but others argue that abstract concepts may (possibly in addition) build on linguistic or statistical information that is not directly grounded (Barsalou et al., 2008;Dove, 2011;hill and Twomey, 2016).
here is also considerable interest in the developmental trajectory exhibited by young children while acquiring language (Gopnik et al., 2001), including long-term studies of word learning (Roy et al., 2015).It is well established that early babbling and vocal imitation serves to link perception and production, and that an adult addressing an infant will adapt their own speech to match that of the child (so-called "infant-directed speech") (Kuhl, 2000).here is also evidence that children are sensitive to statistical and prosodic regularities allowing them to infer the structure and composition of continuous contextualized input (Safran et al., 1996;Safran, 2003;Kuhl, 2004;Smith and Yu, 2008).
Rather more controversial is the claim that children exhibit an acceleration in word learning around the age of 18 months -the so-called "vocabulary spurt" phenomenon (McCarthy, 1954;Goldield and Reznick, 1990;Nazzi and Bertoncini, 2003;Ganger and Brent, 2004).However, using data from almost 1800 children, Moore and ten Bosch (2009) found that the acquisition of a receptive/productive lexicon can be quite adequately modeled as a single mathematical growth function (with an ecologically well founded and cognitively plausible interpretation) with little evidence for a vocabulary spurt.

Interlocutor Abilities
hese perspectives on language not only place strong emphasis on the importance of top-down pragmatic constraints (Levinson, 1983) but they are also founded on an implicit assumption that interlocutors share signiicant priors.Indeed, evidence suggests that some animals draw on representations of their own abilities [expressed as predictive models (Friston and Kiebel, 2009)] in order to interpret the behaviors of others (Rizzolatti and Craighero, 2004;Wilson and Knoblich, 2005).For human beings, this is thought to be a key enabler for eicient recursive mindreading (Scott- Phillips, 2015) and hence for language (Pickering and Garrod, 2007;Garrod et al., 2013).
A signiicant factor in the study of (spoken) language is that its complexity and sophistication tends to be masked by the apparent ease with which it is used.As a result, theories are oten dominated by a somewhat naïve perspective involving the coding and decoding of messages passing from one brain (the sender) to another (the receiver).hey also place a strong emphasis on "turn-taking, " and hence interaction, in spoken language dialog (Levinson, 2006(Levinson, , 2015)).However, some researchers claim that "languaging" is better viewed as an emergent property of the dynamic coupling between cognitive unities that serves to facilitate distributed sense-making through cooperative (social) behaviors (Maturana and Varela, 1987;Bickhard, 2007;Cowley, 2011;Cummins, 2014;Fusaroli et al., 2014).
It is also important to consider the dependencies that exist between interlocutors and the efect such dependencies have on interactive behaviors.he degree to which a talker takes into account the perceived needs of the listener strongly conditions the resulting vocalizations.For example, it is well established that talkers adjust the volume and clarity of their speech in the presence of noise and interference (Lombard, 1911).his is the reason why there is a lack of so-called "invariance" in the vocal signals.Such adaptive behavior is ubiquitous, and speaker-listener coupling may be readily observed in interactions between adults and children (Fernald, 1985), between native and non-native speakers (Nguyen and Delvaux, 2015), and even between humans and machines (Moore and Morris, 1992).As observed by Lindblom (1990), such dependencies may be explained by the operation of control-feedback processes that maximize communicative efectiveness, while minimizing the energy expended in doing so.

Conveyance of Emotion
he formal study of emotion started with the observational work of Charles Darwin (Darwin, 1872) and has since grown into the ield we know today as "Afective Science" [and its technical equivalent -"Afective Computing" (Picard, 1997)].Emotion is a complex physiological, cognitive, and social phenomenon that is exhibited by both humans and animals.Plutchik (1980) hypothesized that emotions serve an adaptive role in helping organisms deal with key survival issues posed by the environment, and that, despite diferent forms of expression in diferent species, there are certain common elements, or prototype patterns, that can be identiied.In particular, Plutchik (1980) claimed that there a small number of basic, primary, or prototype emotions -conceptualized in terms of pairs of polar opposites -and that all other emotions are mixed or derivative states.Ekman (1999) subsequently proposed six "basic emotions": happiness, sadness, fear, anger, surprise, and disgust.More recent research favors a "dimensional" approach based on valence (positive vs. negative), arousal, and dominance [Mehrabian (1996), see also the circumplex model, Russell (1980)].
he expression of emotions can be of communicative value, and a number of theories exist regarding this value (hill and Lowe, 2012).For example, it has been put forward that expressing emotions facilitates social harmony (Griiths and Scarantino, 2005), while Camras (2011) suggests that emotion expression may even serve the need of the expressor in the sense that it can manipulate the perceiver to the beneit of the expressor's needs.
In general, emotion is thought to be just one aspect of the various "afective states" that an animal or human being can exhibit, the others being personality, mood, interpersonal stances, and attitudes -all of which have the potential to inluence vocalization (Scherer, 2003;Seyfarth and Cheney, 2003;Pongrácz et al., 2006;Soltis et al., 2009;Perez et al., 2012). he research challenge, especially in respect of emotionally aware artiicial agents, is to identify the degree to which afective states can be interpreted and expressed, and whether they should be treated as supericial or more deeply rooted aspects of behavior.What is the role of vocal afect in coordinating cooperative or competitive behavior?How do afective states inluence communicative behavior?Interesting work in this direction includes, for example, the design of sound systems that are capable of conveying internal states of a robot through appropriate modulation of the vocal signals (Schwenk and Arras, 2014).

Comparative Analysis of Human and Animal vocalization
One of the most important overarching set of research questions relates to the special (or possibly unique) position of human language in relation to the signaling systems used by other living systems, and how we acquired it as a species (Fitch, 2000;Knight et al., 2000;MacNeilage, 2008;Tomasello, 2008;Berwick et al., 2013;Ravignani et al., 2016;Vernes, 2016).Likewise, it is oten asked whether the patterning of birdsong is similar to speech or, perhaps, more related to music (Shannon, 2016).As discussed earlier, human spoken language appears to have evolved to be a contrastive particulate compositional communication system founded on ostensive-inferential recursive mind-reading.Some of these features are exhibited by non-human animals (Berwick et al., 2011;Arnold and Zuberbühler, 2012;ten Cate, 2014).In particular, Engesser et al. (2015) recently claimed evidence for "phonemic" structure in the song of a particular species of bird [although the value of this result was immediately questioned by Bowling and Fitch (2015)].However, only humans appear to have evolved a system employing all these aspects, so there is considerable interest in comparative analyses of how communication systems can emerge in both living and artiicial systems (Oller, 2004;Lyon et al., 2007;Noli and Mirolli, 2010).What, for example, is the relationship (if any) between language and the diferent signaling systems employed by non-human animals?To what degree is there a phonemic structure to animal communications, and how would one experimentally measure the complexity of vocal interactions (beyond information-theoretic analyses)?Bringing it all together, to what extent can diferent animals said to possess language and to what degree can human vocal interactivity be said to be signaling?
Similarly, vocal learning (especially imitation and mimicry) is thought to be a key precursor of high-order vocal communication systems, such as language (Jarvis, 2006a,b;Lipkind et al., 2013), and only a subset of species exhibits vocal learning: parrots, songbirds, humming birds, humans, bats, dolphins, whales, sea lions, and elephants (Reiss and McCowan, 1993;Tchernichovski et al., 2001;Poole et al., 2005;Pepperberg, 2010;King and Janik, 2013;Chen et al., 2016).More recently, Watson et al. (2015) have added chimpanzees to the list.However, the degree to which animals are capable of learning complex rules in vocal interaction remains an open question (ten Cate and Okanoya, 2012).What are the common features of vocal learning that these species share, and why is it restricted to only a few species?How does a young animal (such as a human child) solve the correspondence problem between the vocalizations that they hear and the sounds that they can produce?Who should adapt to whom in order to establish an efective channel [see, for example, Bohannon and Marquis (1977) (2015)].Recent years have also seen an emergence of interest in "social signal processing" (Pentland, 2008;Vinciarelli et al., 2009) and even in the characteristics of speech used during speed-dating (Ranganath et al., 2013).his in itself already raises a number of interesting questions.Does the existence (or absence) of prior relationships between agents impact on subsequent vocal activity?Do the characteristics of vocalizations carry information about the social relationship connecting the interactants (for example, how is group membership or social status signaled vocally)?his goes beyond conspeciics -humans and dogs are able to manage a productive and mutually supportive relationship despite the vocal communication being somewhat one-sided.What is it about the human-dog relationship that makes this one-sidedness suicient, and conversely, what can biases in communication balancing say about social relationships?Finally, how is vocalization used to sustain long-term social relations?
Non-human animals make multiple uses of vocalizationsfrom signals warning of the presence of predators to social calls strengthening social bonding between individuals.Alarm vocalizations are characterized by being high pitched to avoid the predator localizing the caller (Greene and Meagher, 1998).Alarm calls have been extensively studied in a wide variety of species (Seyfarth et al., 1980;Cheney and Seyfarth, 1985;Blumstein, 1999;Manser, 2001;Fichtel and van Schaik, 2006;Arnold and Zuberbühler, 2008;Stephan and Zuberbühler, 2008;Schel et al., 2010).
he function of alarm calls is not limited to warning conspecifics.For example, Zuberbühler et al. (1999) observed that high rates of monkey alarm calls had an efect on the predator who gave up his hiding location faster once it was detected.Many other animals employ vocalizations to cooperatively attack or harass a predator; these are known as mobbing calls (Ficken and Popp, 1996;Hurd, 1996;Templeton and Greene, 2007;Clara et al., 2008;Griesser, 2009;Yorzinski and Vehrencamp, 2009).
Another role of vocal communication between animals is to inform individuals during the selection of mating partners.Mating or advertising calls have received much research attention in birds due to their complex vocalizations (McGregor, 1992;Searcy and Yasukawa, 1996;Vallet et al., 1998;Gil and Gahr, 2002;Mennill et al., 2003;Pfaf et al., 2007;Alonso Lopes et al., 2010;Bolund et al., 2012;Hall et al., 2013).However, many other species employ such vocal interaction during sexual selection (Brzoska, 1982;Gridi-Papp et al., 2006;Charlton et al., 2012).Some species use vocalizations to advertise their territory or to maintain territorial exclusion, and the sounds emitted will usually travel long distances.For example, wolves use howls as means to control wolf pack spacing (Harrington and Mech, 1983).hese types of vocalization are also used by frogs to advertise their willingness to defend their territory (Brzoska, 1982).Territorial calls also play an important role in sea lions during the breeding season (Peterson and Bartholomew, 1969;Schusterman, 1977).Sea lions and other pinnipeds are also commonly cited as animals that use vocalization between mothers and their ofspring.Mothers employ a "pup-attraction call" that will oten elicitate a "mother-response call" in the pup (Trillmich, 1981;Hanggi and Schusterman, 1990;Gisiner and Schusterman, 1991;Insley, 2001).Mother-ofspring calls are one of many examples of the transmission of identity information through animal vocalizations.his aspect has also been studied in the context of songbirds (Weary and Krebs, 1992;Lind et al., 1996), domestic horses (Proops et al., 2009), dolphins (Kershenbaum et al., 2013), and primates (Candiotti et al., 2013).
Overall, vocal signals are therefore arguably generated on purpose (Tomasello et al., 2005;Townsend et al., 2016) and serve to attract attention (Crockford et al., 2014) as well as to provide information (Schel et al., 2013) and support cooperation (Eskelinen et al., 2016).However, other agents can exploit unintentional vocalizations for their own purposes.Also, in living systems, the ultimate driver of behavior is thought to be a hierarchy of "needs" (with survival as the most basic) (Maslow, 1943).As a result, there is interest in the role of "intrinsic motivations, " especially learning (Moulin-Frier et al., 2013).To what extent are vocal signals teleological, and is it possible to distinguish between intentional and unintentional vocalizations?Can intentional vocal activity be simulated by technological means to explore animal behavior?Does a vocalization carry information about the underlying intention, and how can the latter be inferred from the former?How do motivational factors such as "urgency" impact on vocalization?What motivational framework would be appropriate for a voice-enabled autonomous social agent?
An interesting type of vocal interaction, which oten occurs between mating pairs is duetting.his comprises a highly synchronized and temporally precise vocal display involving two individuals.Duets have been observed in several bird species (Grafe et al., 2004;Hall, 2004;Elie et al., 2010;Templeton et al., 2013;Dowling and Webster, 2016).here are a number of diferent hypotheses concerning the function of such behavior, e.g., territory defense, mate-guarding, and paternity-guarding (Mennill, 2006;Dowling and Webster, 2016).Duets also occur in other species and contexts, such as in the alarm calls of lemurs (Fichtel and van Schaik, 2006) and gibbons (Clarke et al., 2006).More generally, vocalizations are oten carefully timed in relation to other events taking place in an environment (including other vocalizations) (Benichov et al., 2016).his may take the form of synchronized ritualistic behavior (such as rhythmic chanting, chorusing, or singing) or asynchronous turn-taking (which can be seen as a form of dialog) (Cummins, 2014;Fusaroli et al., 2014;Ravignani et al., 2014).
Of particular interest is the dynamics of such interactions in both humans and animals (Fitch, 2013;Takahashi et al., 2013;De Looze et al., 2014), especially between conspeciics (Friston and Frith, 2015).Is there a common physiological basis for such rhythmic vocal behavior, and how is vocal synchrony achieved between agents?What are the segmental and suprasegmental prosodic features that facilitate such timing relations?What are the dependencies between vocalizations and other events, and how would one characterize them?Given the crucial nature of synchrony and timing in interactivity between natural agents, to what extent does this importance carry over to human-machine dialog?How would one model the relevant dynamics (whether to study natural interactivity or to facilitate human-machine interaction)?

vocal interactivity between Non-Conspeciics
Vocal interaction normally takes place between conspeciics (that is, agents with similar capabilities), but what happens between mismatched entities -between humans and/or animals and/ or artiicial agents?For example, Joslin (1967) employed both human-simulated howls and playback recordings to study wolf behavior and, surprisingly, discovered that the wolves responded more to the human-simulated howls than to the playbacks.Also Kuhl (1981) conducted listening tests on chinchillas in order to determine their capacity for discriminating speech and to provide support for the existence of a relation between the mammalian auditory system and the evolution of diferent languages.
More recently, the study of domestic or domesticated animals has become a topic of interest in the ield of human⇔animal vocal interaction.For example, Waiblinger et al. (2006) proposes considering vocal interaction in the assessment of human-animal relationships, especially in the context of farm animals' welfare.Also, Kaminski et al. (2004) present a case study in which they demonstrate a dog's capacity to "fast map, " i.e., forming quick and rough semantic hypotheses of a new word ater a single presentation.Horowitz and Hecht (2016) investigated owner's vocalizations in dog-human "play" sessions and found some identiiable characteristics associated with afect.
Research in this area also extends to wild animals.For example, McComb et al. (2014) show how elephants respond diferently to playbacks of human speech depending on their gender and age -aspects that can greatly afect the predator risks that humans present to elephants.
Research on vocal interactivity between non-conspeciics is particularly pertinent to the design of vocally interactive artiicial agents.For example, Jones et al. (2008) found diferences in individual preferences when people interacted with dog-like robots.According to Moore (2015Moore ( , 2016b)), understanding this situation could be critical to the success of future speech-based interaction with "intelligent" artiicial agents.For example, different bodies may lead to diferent sensorimotor experiences in which an agent's concepts are grounded, which may impact the FiGURe 2 | The evolution of spoken language technology from early military "Command and Control Systems" through current "voice-enabled Personal Assistants" (such as Siri) to future "Autonomous Social Agents" (e.g., robots).
degree to which two agents can communicate about the same things (hill et al., 2014).
What, therefore, are the limitations (if any) of vocal interaction between non-conspeciics?What can be learned from attempts to teach animals, the human language (and vice versa)?How do conspeciics accommodate mismatches in temporal histories (for example, interaction between diferent aged agents) or cultural experience?How can insights from such questions inform the design of vocally interactive artiicial agents beyond Siri?Is it possible to detect diferences in how diferent agents ground concepts from their language use, and can artiicial agents use such information in vocal interactivity with humans [as suggested by hill et al. ( 2014)]?

Spoken Language Systems
On the technology front, recent years have seen signiicant advances in technologies that are capable of engaging in voicebased interaction with a human user. he performance of automatic speech recognition, text-to-speech synthesis, and dialog management has improved year-on-year, and this has led to a growth in the sophistication of the applications that are able to be supported, from the earliest military Command and Control Systems to contemporary commercial Interactive Voice Response (IVR) Systems and the latest Voice-Enabled Personal Assistants (such as Siri) -see Figure 2. Progress has been driven by the emergence of a data-driven probabilistic modeling paradigm in the 1980s (Gales and Young, 2007;Bellegarda and Monz, 2015) recently supplemented by deep learning (Hinton et al., 2012) coupled with an ongoing regime of government-sponsored benchmarking. 4Pieraccini (2012) presents a comprehensive review of the history of spoken language technology up to the release of Siri in 2011.
At the present time, research into spoken language technology is beginning to focus on the development of voice-based interaction with Embodied Conversational Agents (ECAs) and Autonomous Social Agents (such as robots).In these futuristic scenarios, it is envisioned that spoken language will provide a "natural" conversational interface between human beings and the so-called intelligent systems.However, many challenges need to be addressed in order to meet such a requirement (Baker et al., 2009a;Moore, 2013Moore, , 2015)), not least how to evolve the complexity of voice-based interfaces from simple structured dialogs to more lexible conversational designs without confusing the user (Bernsen et al., 1998;McTear, 2004;Lopez Cozar Delgado and Araki, 2005;Phillips and Philips, 2006;Moore, 2016b).In particular, seminal work by Nass and Brave (2005) showed how attention needs to be paid to users' expectations [e.g., selecting the "gender" of a system's voice (Crowell et al., 2009)], and this has inspired work on "empathic" vocal robots (Breazeal, 2003;Fellous and Arbib, 2005;Haring et al., 2011;Eyssel et al., 2012;Lim and Okuno, 2014;Crumpton and Bethel, 2016).On the other hand, user interface experts, such as Balentine (2007), have argued that such agents should be clearly machines rather than emulations of human beings, particularly to avoid the "uncanny valley efect" (Mori, 1970), whereby mismatched perceptual cues can lead to feelings of repulsion (Moore, 2012).For a voice-enabled robot, this underpins the importance of matching the voice and face (Mitchell et al., 2011).
It has also been argued that the architecture of future spoken language systems needs to be more cognitively motivated if it is to engage meaningfully with human users (Moore, 2007a(Moore, , 2010;;Baker et al., 2009b), or that such systems should take inspiration from the way in which children acquire their communicative skills (ten Bosch et al., 2009).

TeCHNOLOGY-BASeD ReSeARCH MeTHODS
he large number of disciplines concerned with vocal interactivity means that there is an equally wide variety of tools, techniques, and methodologies used in the diferent areas of research that are relatively novel and emergent, resulting in several avenues for further research, both concerning the development of these methodologies themselves and their use in future studies of vocal interactivity.For example, large-scale data collection is the norm in spoken language technology (Pieraccini, 2012), and several international agencies exist for sharing data between laboratories (for example, the Linguistic Data Consortium5 and the European Language Resource Association). 6Are there other opportunities for sharing data or for inserting technology into non-technological areas?Is it necessary to create new standards in order to facilitate more eicient sharing of research resources?
Likewise, technology for simulating vocalizations is already being used in studies of animal behavior, but diferent disciplines model vocal interactivity using diferent paradigms depending on whether they are interested in predicting the outcome of ield experiments, eliciting (Benichov et al., 2016) and simulating (Webb, 1995) the behavior in the laboratory, or engineering practical solutions (Moore, 2016a).Vocal interaction may be modeled within a variety of frameworks ranging from traditional behaviorist stimulus-response approaches (for example, using stochastic modeling or deep learning and artiicial neural networks) to coupled dynamical systems (using mutual feedback control).In the latter case, vocal interaction is seen as an emergent phenomenon arising from a situated and embodied enactive relationship between cognitive unities, but how can these interactive behaviors be modeled computationally?Are there any mathematical modeling principles that may be applied to all forms of vocal interactivity, and is it possible to derive a common architecture or framework for describing vocal interactivity?
In addition, technological tools ofer great potential for studying vocalization in the wild.As Webb (2008) argues, because robots that act in the world, including interacting with other agents, need to solve many of the same problems that natural autonomous agents need to solve, they provide an additional means by which to study natural behaviors of interest.An ot-cited example is that of cricket mating calls: a female will be attracted to the male of her own species who produces the loudest calls.Webb (1995) built a robot capable of reproducing this behavior using a mechanism of phase cancelation and latency comparison.his is noteworthy in that the potentially complex computational problem, of not just locating sounds but also identifying the loudest source and ensuring it is the correct species, can be solved without an explicit representation of any of these factors.his is discussed further by Wilson and Golonka (2013) as an example of embodied cognition: it is the particular morphology of the cricket's ear channels and interneurons together with particular aspects of the environment (that males of diferent species will chirp at diferent frequencies) that solve this problem, foregoing the need for potentially complicated computations.
Robots can also help to elucidate necessary precursors and mechanisms for vocal interaction.For example, computational models have been used to investigate how children are able to solve the "correspondence problem" and map between their own perceptual and vocal experiences to those of the adult speakers with whom they interact (Howard and Messum, 2014;Messum and Howard, 2015).Even physical (robotic) models of a child's vocal tract have been designed to understand how these early stages of spoken language acquisition might function (Yoshikawa et al., 2003;Ishihara et al., 2009;Miura et al., 2012).
Another prominent example of this line of research is the "symbol grounding problem" (Harnad, 1990), which, in brief, states that amodal symbols manipulated by a formal system, such as a computer program, have no meaning that is intrinsic to the system itself; whatever meaning may exist is instead attributed by an external observer.Some researchers [e.g., Stramandinoli et al. (2012)] argue that robots require such an intrinsic understanding of concepts to achieve natural vocal interaction with humans.Cangelosi (2006), in particular, distinguishes between physical and social symbol grounding: the former concerns the grounding of an individual's internal representations in sensorimotor experience, while the latter refers to the determination of symbols to be shared between individuals, including their grounded meanings (in other words, social symbol grounding is the creation of a shared vocabulary of grounded symbols).
Both forms of symbol grounding are a problem that natural agents solve to a greater or lesser extent.Both forms have also been investigated in robots.Luc Steels' language games, for instance, provide a seminal example of robotic investigations into social symbol grounding (Steels, 2001).hese games investigated how artiicial agents would "generate and self-organise a shared lexicon as well as the perceptually grounded categorisations of the world expressed by this lexicon, all without human intervention or prior speciication" (Steels, 2003, p. 310).
Physical symbol grounding, as mentioned, is the problem of grounding an individual's internal representations in sensorimotor experience.Implementations of these mechanisms are thus not always concerned with cognitive plausibility but rather with implementing a practical solution [see Coradeschi et al. (2013) for a recent review and hill et al. ( 2014) for a longer discussion of how the simpler sensorimotor aspects considered in most robotics may afect the degree to which these can comment on human grounding].Nonetheless, robots have, for example, been used to put forward theories of how abstract concepts can be grounded in a sensorimotor experience (Cangelosi and Riga, 2006).Stramandinoli et al. (2012), in particular, propose a hierarchical structure for concepts; some may be directly grounded in sensorimotor experience, whereas others are indirectly grounded via other concepts.
he previously mentioned review by Coradeschi et al. ( 2013) also follow Belpaeme and Cowley (2007) in highlighting that social symbol grounding has the necessary mechanisms for the acquisition of language and meaning as a prerequisite.Here, we want to reairm the overall implication; to use robots to study vocal interactivity requires the implementation of prerequisite mechanisms.It is, for instance, occasionally argued that a "mirror neuron system" is an evolutionary precursor to language abilities (Arbib, 2005).his opens the discussion to robot (and computational) models of mirror neuron systems, for which we refer to recent reviews (Oztop et al., 2006;hill et al., 2013).It also follows from at least some theoretical positions on embodiment that the precise body of an agent may play a fundamental role in all matters of cognition, including symbol grounding (hill and Twomey, 2016).Indeed, hill et al. (2014) propose that robot implementations need to take this into account explicitly, suggesting that human usage of concepts, as characterized by appropriate analyses of human-produced texts may in fact yield insights into the underlying grounding.A robot, whose own body would ground these concepts in (possibly subtly) diferent ways, could make use of this information in interaction with human beings.Overall, then, the take-home message is that using robots in vocal interaction requires the researcher to be explicit about all aspects of the necessary model [see Morse et al. (2011), for a similar point].
Once an artiicial agent that is capable of vocal interactivity has been created (whether it achieved this as a result of cognitively plausible modeling or not), it is interesting to ask how humans might actually interact with it.Branigan et al. (2011), for example, report on ive experiments in which humans interacted either with other humans or (so they were told) a computer.he core behavior of interest was verbal alignment (in which participants, in a dialog, converge on certain linguistic behaviors).heir main  • How are vocalizations manipulated to achieve the desired results, and is such behavior reactive or proactive?
• How should vocally interactive artiicial agents be designed in this context?
• Is ToM crucial for language-based interaction?
• What level of ToM do animals possess, and could this be used to predict the complexity of their vocal interactivity?
• Do artiicial agents need ToM in order to interact effectively with human beings vocally?
• How are multimodal behaviors orchestrated, especially in multi-agent situations?
• How is information distributed across the different modes, and what is the relationship between vocal and non-vocal (sign) language?

Conveyance of emotion
• To what degree can affective states can be interpreted and expressed, and should they be treated as supericial or more deeply rooted aspects of behavior?
• What is the role of vocal affect in coordinating cooperative or competitive behavior?
• How do affective states inluence communicative behavior?

Comparative analysis of human and animal vocalization
• What is the relationship (if any) between language and the different signaling systems employed by non-human animals?
• To what degree is there a phonemic structure to animal communications, and how would one experimentally measure the complexity of vocal interactions (beyond information-theoretic analyses)?• To what extent can different animals said to possess language and to what degree can human vocal interactivity be said to be signaling?• What are the common features of vocal learning that species capable of it share, and why is it restricted to only a few species?• How does a young animal (such as a human child) solve the correspondence problem between the vocalizations that they hear and the sounds that they can produce?• Who should adapt to whom in order to establish an effective channel?• How are vocal referents acquired?• What, precisely, are the mechanisms underlying vocal learning?
TABLe 2 | Summary of research questions identiied in this paper that pertain to vocal interactivity, grouped by the sections of the paper in which they are discussed.

Use of vocalization
• Does the existence (or absence) of prior relationships between agents impact on subsequent vocal activity?• Do the characteristics of vocalizations carry information about the social relationship connecting the interactants (for example, how is group membership or social status signaled vocally)?• What is it about the human-dog relationship that makes the one-sidedness of this relation suficient, and conversely, what can biases in communication balancing say about social relationships?• How is vocalization used to sustain long-term social relations?
• To what extent are vocal signals teleological, and is it possible to distinguish between intentional and unintentional vocalizations?• Can intentional vocal activity be simulated by technological means to explore animal behavior?
• Does a vocalization carry information about the underlying intention, and how can the latter be inferred from the former?• How do motivational factors such as "urgency" impact on vocalization?• What motivational framework would be appropriate for a voice-enabled autonomous social agent?• What are the segmental and supra-segmental prosodic features that facilitate precise timing relations in vocal interaction?• What are the dependencies between vocalizations and other events, and how would one characterize them?
• Given the crucial nature of synchrony and timing in interactivity between natural agents, to what extent does this importance carry over to human-machine dialog?
• How would one model the relevant dynamics (whether to study natural interactivity or to facilitate human-machine interaction)?vocal interactivity between non-conspeciics • What are the limitations (if any) of vocal interaction between non-conspeciics?• What can be learned from attempts to teach animals, the human language (and vice versa)?• How do conspeciics accommodate mismatches in temporal histories (for example, interaction between different aged agents) or cultural experience?• How can insights from such questions inform the design of vocally interactive artiicial agents beyond Siri?
• Is it possible to detect the differences in how different agents ground concepts from their language use, and can artiicial agents use such information in vocal interactivity with humans?

Spoken language systems
• How does one evolve the complexity of voice-based interfaces from simple structured dialogs to more lexible conversational designs without confusing the user?
insight was that such alignment appeared to depend on beliefs that humans held about their interlocutors (speciically, their communicative capacity); they were, for example, more likely to align on a disfavored term for an object if they believed the interlocutor was a computer.Vollmer et al. (2013) extended this work replacing the computer system with a humanoid robot (an iCub) and found a similar alignment in the domain of manual actions (rather than the lexical domain).Kopp (2010) investigated the establishment of social resonance through embodied coordination involving expressive behavior during conversation between two agents.Such aspects form an important part of human conversation and may determine whether or not they perceive the other agent as social.Kopp (2010) argued that including such mechanisms (e.g., mimicry, alignment, and synchrony) may be a signiicant factor in improving human-agent interaction.Recently, de Greef and Belpaeme (2015) have demonstrated the relevance of such factors in robot learning, inding that a robot that uses appropriate social cues tends to learn faster.
Similarly, the properties of the vocal signals themselves have consequences for the overall interaction.For example, Niculescu et al. (2011) investigated the efects of voice pitch on how robots are perceived, inding that a high-pitched "exuberant" voice lead to a more positive perception of the overall interaction than a lowpitched "calm" voice, highlighting the importance of appropriate voice design for the overall quality of a human-robot interaction.Walters et al. (2008), similarly, found that the voice of the robot modulates physical approach behavior of humans to robots, and what distance is perceived as comfortable.
Finally, it is worth highlighting that such communicative systems need not always be inspired by insights from human or animal vocalization; for example, Schwenk and Arras (2014) present a lexible vocal synthesis system for HRI capable of modulating the sounds a robot makes based on both features of the ongoing interaction and internal states of the robot.

Technology-based research methods
• Are there novel opportunities for sharing data or for inserting technology into non-technological areas?• Is it necessary to create new standards in order to facilitate more eficient sharing of research resources?• How can vocal interactivity as an emergent phenomenon be modeled computationally?• Are there any mathematical modeling principles that may be applied to all forms of vocal interactivity, and is it possible to derive a common architecture or framework for describing vocal interactivity?• What tools might be needed in the future to study vocalization in the wild?

CONCLUSiON
his paper satisies two objectives.First, we have presented an appraisal of the state-of-the-art in research on vocal interactivity in-and-between humans, animals, and artiicial agents (such as robots).Second, we have identiied a set of open research questions, summarized again in Tables 1-3 for convenience.It is worth highlighting that many of these open research questions require an interdisciplinary approach -be it the use of artiicial agents to study particular aspects of human or animal vocalization, the study of animal vocal behavior to better distinguish between signaling and language in human beings, or indeed the study of human and/or animal vocal interactivity (including between humans and animals) with a view to designing the next generation of vocally interactive technologies.
he questions we have raised thus serve a dual purpose.Not only do they highlight opportunities for future research aimed at increasing our understanding of the general principles of vocal interactivity per se but they also have the potential to impact on practical applications and the design of new technological solutions.Consider, to give but one example, how current technology is moving toward an increasing number of artifacts that ofer both cognitive capabilities and voice-enabled interfaces.How the vocal interactivity of such artifacts should be designed is not obvious, since it is not clear how users might expect to interact with such interfaces.Would they prefer natural language or a more command-style interface?What are the precise underlying mechanisms needed for the artifact to ofer the desired capabilities?
Finally, let us close by emphasizing again that addressing many of the questions we raise fully requires an interdisciplinary approach that cuts across the diferent ields that study diferent types of vocal interactivity in diferent types of agent.We believe that the time is now ripe to tackle these challenges, and we expect interdisciplinary eforts at the intersections of the ields that together make up the study of vocal interactivity (as outlined in Figure 1) to blossom in the coming years.

FiGURe 1 |
FiGURe 1 | illustration of the key types of vocal interactivity linking humans, animals, and artiicial agents (such as robots).
, for a study showing that adults adapt their vocal interactivity with children based on comprehension feedback by the children]?How are vocal referents acquired?What, precisely, are the mechanisms underlying vocal learning?3. vOCAL iNTeRACTiviTY 3.1.Use of vocalization Cooperative, competitive, and communicative behaviors are ubiquitous in the animal kingdom, and vocalization provides a means through which such activities may be coordinated and managed in communities of multiple individuals [see, e.g., work by Fang et al. (2014); King et al. (2014); Volodin et al. (2014); Ma

TABLe 1 |
Summary of research questions identiied in this paper that pertain to vocal signals in interaction, grouped by the sections of the paper in which they are discussed.What are the similarities/differences between the vocal systems (including brain organization) in different animals?•How are vocal mechanisms constrained or facilitated by the morphology of the individual agents involved?

TABLe 3 |
Summary of research questions identiied in this paper that pertain to technology-based research methods.