# MODELS AND THEORIES OF SPEECH PRODUCTION

EDITED BY : Adamantios Gafos and Pascal van Lieshout PUBLISHED IN : Frontiers in Psychology and Frontiers in Communication

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-928-1 DOI 10.3389/978-2-88963-928-1

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# MODELS AND THEORIES OF SPEECH PRODUCTION

Topic Editors: Adamantios Gafos, University of Potsdam, Germany Pascal van Lieshout, University of Toronto, Canada

Citation: Gafos, A., van Lieshout, P., eds. (2020). Models and Theories of Speech Production. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-928-1

# Table of Contents


Jean-François Patri, Julien Diard and Pascal Perrier


Matthias Heyne, Donald Derrick and Jalal Al-Tamimi


#### *251 A Simple 3-Parameter Model for Examining Adaptation in Speech and Voice Production*

Elaine Kearney, Alfonso Nieto-Castañón, Hasini R. Weerathunge, Riccardo Falsini, Ayoub Daliri, Defne Abur, Kirrie J. Ballard, Soo-Eun Chang, Sara-Ching Chao, Elizabeth S. Heller Murray, Terri L. Scott and Frank H. Guenther


# Editorial: Models and Theories of Speech Production

Adamantios Gafos <sup>1</sup> \* and Pascal van Lieshout <sup>2</sup> \*

*<sup>1</sup> Department of Linguistics and Excellence Area of Cognitive Sciences, University of Potsdam, Potsdam, Germany, <sup>2</sup> Department of Speech-Language Pathology, Oral Dynamics Laboratory, University of Toronto, Toronto, ON, Canada*

Keywords: speech production, motor control, dynamical models, phonology, speech disorders, timing

**Editorial on the Research Topic**

#### **Models and Theories of Speech Production**

Spoken language is conveyed via well-coordinated speech movements, which act as coherent units of control referred to as gestures. These gestures and their underlying movements show several distinctive properties in terms of lawful relations among the parameters of duration, relative timing, range of motion, target accuracy, and speed. However, currently, no existing theory successfully accounts for all properties of these movements. Even though models in speech motor control in the last 40 years have consistently taken inspiration from general movement science, some of the comparisons remain ill-informed. For example, our present knowledge on whether widely known principles that apply to limb movements (e.g., the speed-accuracy trade off known as Fitts' law) also hold true for speech movements is still very limited. An understanding of the principles that apply to speech movements is key to defining the somewhat elusive concept of speech motor skill and to assessing and interpreting different levels of that skill in populations with and without diagnosed speech disorders. The latter issue taps into fundamental debates about whether speech pathology assessment paradigms need to be restricted to control regimes that are specific to those underlying typical speech productions. Resolution of such debates crucially relies on our understanding of the nature of speech processes and the underlying control units.

Unlike movements in locomotion or oculomotor function, speech movements when combined into gestures are not mere physical instantiations of organs moving in space and time but, also, have intrinsic symbolic function. Language-particular systems, or phonological grammars, are involved in the patterning of these gestures. Grammar constraints regulate the permissible symbolic combinations as evidenced via eliciting judgments on whether any given sequence is well-formed in any particular language (the same sequence can be acceptable in one, but not the other language). In what ways these constraints shape speech gestures and how these fit with existing general principles of motor control is, also, not clearly understood.

Furthermore, speech gestures are parts of words and thus one window into understanding the nature of the speech production<sup>1</sup> system is to observe speech movements as parts of words or larger chunks of speech such as phrases or sentences. The intention to produce a lexical item involves activating sequences of gestures that are part of the lexical item. The regulation in time of the units in such sequences raises major questions for speech motor control theories (but also for theories

Edited and reviewed by: *Niels O. Schiller, Leiden University, Netherlands*

#### \*Correspondence:

*Adamantios Gafos gafos@uni-potsdam.de Pascal van Lieshout p.vanlieshout@utoronto.ca*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *14 April 2020* Accepted: *12 May 2020* Published: *19 June 2020*

#### Citation:

*Gafos A and van Lieshout P (2020) Editorial: Models and Theories of Speech Production. Front. Psychol. 11:1238. doi: 10.3389/fpsyg.2020.01238*

**5**

<sup>1</sup>One of our reviewers notes that in the field of psycholinguistics the term speech production is used more broadly (than in the use of the term implied by the contributions to this Research Topic) and, points out the need, aptly stated, "to bridge the gap between psycholinguistically informed phonetics and phonetically informed psycholinguistics." We fully concur and look forward to future research efforts and perhaps Research Topics devoted to such bridging. For a recent special issue on psycholinguistic approaches to speech production, see Meyer et al. (2019) and for a more focused review of the issues pertinent to "phonetic encoding" (a term in psycholinguistics roughly equivalent to our use of the term speech production in the present Research Topic) see Laganaro (2019).

of cognition and sequential action in general). Major challenges are met in the inter-dependence among different time scales related to gestural planning, movement execution and coordination within and across domains of individual lexical items. How these different time scales interact and how their interaction affects the observed movement properties is for the most part still unknown.

In this special issue, we present a variety of theoretical and empirical contributions which explore the nature of the dynamics of speech motor control. For practical purposes, we separate these contributions in two major themes:


Following is a short description of each paper as listed under these themes.

1) Models and theories of speech production

The speech signal is simultaneously expressed in two information-encoding systems: articulation and acoustics. Goldstein's contribution addresses the relation between representations in these two parallel manifestations of speech while focusing not on static properties but on patterns of change over time (temporal co-modulation) in these two channels. To do so, Goldstein quantifies the relation between rates of change in the parallel acoustic and articulatory representations of the same utterance, produced by various speakers, based on x-ray microbeam data. Analysis of this relation indicates that the two representations are correlated via a pulse-like modulation structure, with local correlations being stronger than global ones. This modulation seems linked to the fundamental unit of the syllable.

It is widely assumed that acoustic parameters for vowels are normally distributed, but it is rarely demonstrated that this might be the case. Whalen and Chen quantified the distributions of F1 and F2 values of /i/ and /o/ in the English words "heed," "geek," "ode"/"owed," and "dote" produced by a single speaker on three different days. Analysis based on a high number of repetitions of these vowels in different consonantal contexts indicates that distributions are generally normal, which in turn suggests consistent vowel-specific targets across different contextual environments. The results add weight to the widely-held assumption that speech targets follow a normal distribution and the authors discuss the implications for theories of speech targets.

Turk and Shattuck-Hufnagel address the nature of timing in speech, with special attention given to movement endpoints, which as they argue relate to the goals of these movements. The argument is presented that these points require dedicated control regimes. Evidence for this argument is derived from work in both speech and non-speech motor control. It is also argued that in contrast to the Articulatory Phonology/Task Dynamics view, where gestural durations are determined by an intrinsic dynamics, duration must be an independently controlled variable in speech. A phonologyextrinsic component is thus proposed to be necessary and a call is made for developing and testing models of speech where a component of abstract, symbolic phonological representations is kept apart from the way(s) in which these representations are implemented in quantitative terms which include surface duration specifications and attendant timing mechanisms for achieving these.

Shaw and Chen investigated to what degree timing between gestures is stable across variations in the spatial positions of individual articulators, as predicted in Articulatory Phonology. Using Electromagnetic Articulography with a group of Mandarin speakers producing CV monosyllables, they found a correlation between the initial position of the tongue gesture for the vowel and C-V timing. In contrast to the original hypothesis, this indicates that inter-gestural timing is sensitive to the position of the articulators, suggesting a critical role for somatosensory feedback.

Roessig and Mücke study tonal and kinematic profiles of different degrees of prominence (unaccented, broad, narrow and contrastive focus) from 27 speakers of German. Parameters in both the tonal and kinematic dimensions are shown to vary systematically across degrees of prominence. A dynamical approach is put forward in modeling these findings. This approach embraces the multidimensionality of prosody while at the same time showing how both discrete and continuous modifications in focus marking can be expressed within one formal language. The model captures qualitatively the observed patterns in the data by tuning of an abstract control variable which shapes the attractor landscape over the parameter space of kinematic and tonal dimensions considered in this work.

Iskarous provides a computational approach to explain the nature of spatiotemporal particulation of the vocal tract, as evidenced in the production of speech gestures. Based on a set of reaction-diffusion equations with simultaneous Turing and Hopf patterns the critical characteristics of speech gestures related to vocal tract constrictions can be replicated in support of the notion that motor processes can be seen as the emergence of low degree of freedom descriptions from high degree of freedom systems.

Patri et al. address individual differences in responses to auditory or somatosensory perturbation in speech production. Two accounts are entertained. The first reduces individual differences to differences in acuity of the sensory specifications while the second leaves sensory specifications intact and, instead, modulates the sensitivity of match between motor commands and their auditory consequences. While simulation results show that both accounts lead to similar results, it is argued that maintaining intact sensory specifications is more flexible, enabling a more encompassing approach to speech variability where cognitive, attentional and other factors can modulate responses to perturbations.

One of the foundational ideas of phonology and phonetics is that produced and perceived utterances are decomposed into sequences of discrete units. However, evidence from development indicates that in child speech utterances are holistic rather than segmented. The contribution by Davis and Redford offers a theoretical demonstration along with attendant modeling that the posited units can emerge from a stage of speech where words or phrases start off as time-aligned motoric and perceptual trajectories. As words are added and repeatedly rehearsed by the learner, motoric trajectories begin to develop recurrent articulatory configurations which, when coupled with their corresponding perceptual representations, give rise to perceptual-motor units claimed to characterize mature speech production.

In their contribution, Kearney et al. present a simplified version of the DIVA model, focusing on three fitting parameters related to auditory feedback control, somatosensory feedback control, and feedforward control. The model is tested through computer simulations that identify optimal model fits to six existing sensorimotor adaptation datasets, showing excellent fits to real data across different types of perturbations and experimental paradigms.

An active area in phonological theory is the investigation of long-distance assimilation where features of a phoneme assimilate to features of another non-adjacent phoneme. Tilsen seeks to identify mechanisms for the emergence of such non-local assimilations in speech planning and production models. Two mechanisms are proposed. The first is one where a gesture is either anticipatorily selected in an earlier epoch or is not suppressed (after being selected) so that its influence extends to later epochs. The second is one where gestures which may be active in one epoch of a planninglevel dynamics, even though not selected during execution, may still influence production in a different epoch. Evidence for these mechanisms is found in both speech and non-speech movement preparation paradigms. The existence of these two mechanisms is argued to account for the major dichotomy between assimilation phenomena that have been described as involving the extension of an assimilating property vs. those that cannot be so described.

Xu and Prom-on contrast two principles assumed to underlie the dynamics of movement control: economy of effort and maximum rate of information. They present data from speakers of American English on repetitive syllable sequences who were asked to imitate recordings of the same sequences that had been artificially accelerated and to produce meaningful sentences containing the same syllables at normal and fast speaking rates. The results show that the characteristics of the formant trajectories they analyzed fit best the notion of the maximum rate of information principle.

Kröger et al.'s contribution offers a demonstration that a learning model based on self-organizing maps can serve as bridge between models of the mental lexicon and models of sensorimotor control and that such a model can learn (from semantic, auditory and somatosensory information) representational units akin to phonetic-phonological features. At a broad level, few efforts have been made to bridge theory and modeling of the lexicon and motor control. The proposed model aims at addressing that gap and makes predictions about the specificity and rate of growth of such representational features under different training conditions (auditory only vs. auditory and somatosensory training modes).

Parrell and Lammert develop a synthesis of the dynamic movement primitives model of motor control (Schaal et al., 2007; Ijspeert et al., 2013) with the task dynamics model of speech production (Saltzman and Munhall, 1989). A key element in achieving this synthesis is the incorporation of a learnable forcing term into the task dynamics' pointattractor system. The presence of such a tunable term endows task dynamics with flexibility in movement trajectories. The proposed synthesis also establishes a link to optimization approaches to motor control where the forcing term can be seen to minimize a cost function over the timespan of the movement under consideration (e.g., minimizing total energy expended during a reaching movement). The dynamics of the proposed synthesis model are explicitly described and their effects are demonstrated in the form of proof of concept simulations showing the consequences of perturbations on jaw movement trajectories.

#### 2) Applications

Noiray et al. present a study in which they examined whether phonemic awareness correlates with coarticulation degree, commonly used as a metric for estimating the size of children's production units. A speech production task was designed to test for developmental differences in intrasyllabic coarticulation degree in 41 German children from 4 to 7 years of age, using ultrasound imaging. The results suggest that the process of developing spoken language fluency involves dynamical interactions between cognitive and speech motor domains.

Tiede et al. describe a study in which they tracked movements of the head and speech articulators during an alternating word pair production task driven by an accelerating rate metronome. The results show that as production effort increased, so did speaker head nodding, and that nodding increased abruptly following errors. The strongest entrainment between head and articulators was observed at the fastest rate under coda alternation conditions.

Namasivayam et al. present an Articulatory Phonology approach for understanding the nature of Speech Sound Disorders (SSDs) in children, aiming to reconcile the traditional phonetic-phonology dichotomy with the concept of interconnectedness between these levels. They present evidence supporting the notion of articulatory gestures at the level of speech production and how this is reflected in control processes in the brain. They add an overview of how an articulatory "gesture"-based approach can account for articulatory behaviors in typical and disordered speech production, concluding that the Articulatory Phonology approach offers a productive strategy for further research in this area.

Heyne et al. address the relation between speech and another oral motor skill, trombone playing. Using ultrasound, they recorded midsagittal tongue shapes from New Zealand English and Tongan-speaking trombone players. Tongue shapes from the two language groups were estimated via fits with generalized additive mixed models, while these speakers/players produced vowels (in their native languages) and sustained notes at different pitches and intensities. The results indicate that, while airflow production and requisite acoustics largely constrain vocal tract configuration during trombone playing, evidence for a secondary influence from speech motor configurations can be discerned in that the two groups tended to use different tongue configurations resembling distinct vocalic monopthongs in their respective languages.

The papers assembled for this Research Topic attest to the advantages of combining theoretical and empirical approaches to the study of speech production. They also attest to the value of formal modeling in addressing long-standing issues in speech development and the relationship between motor control and phonological patterns; to the importance of somatosensory and auditory feedback in planning and monitoring speech production and the importance of integrating speech production

#### REFERENCES


models with other aspects of cognition; and finally, to the potential of theoretical models in informing applications of speech production in disordered speech and motor skills in other oral activities such as playing musical instruments.

## AUTHOR CONTRIBUTIONS

All authors listed have made equal contributions to the work and approved it for publication.

#### ACKNOWLEDGMENTS

AG's work has been supported by the European Research Council (AdG 249440) and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project ID 317633480 - SFB 1287, Project C04.

Schaal, S., Mohajerian, P., Ijspeert, A. J., Cisek, P., Drew, T., and Kalaska, J. F. (2007). Dynamics systems vs. Optimal control a unifying view. In Progress in Brain Research 165, 425–45. doi: 10.1016/S0079-6123(06)65027-9

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Gafos and van Lieshout. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Emergence of an Action Repository as Part of a Biologically Inspired Model of Speech Processing: The Role of Somatosensory Information in Learning Phonetic-Phonological Sound Features

#### Bernd J. Kröger<sup>1</sup> \*, Tanya Bafna<sup>2</sup> and Mengxue Cao<sup>3</sup>

<sup>1</sup> Neurophonetics Group, Department of Phoniatrics, Pedaudiology, and Communication Disorders, Medical School, RWTH Aachen University, Aachen, Germany, <sup>2</sup> Medical School, RWTH Aachen University, Aachen, Germany, <sup>3</sup> School of Chinese Language and Literature, Beijing Normal University, Beijing, China

#### Edited by:

Adamantios Gafos, Universität Potsdam, Germany

#### Reviewed by:

Joana Cholin, Bielefeld University, Germany Jason W. Bohland, Boston University, United States

#### \*Correspondence:

Bernd J. Kröger bernd.kroeger@rwth-aachen.de; bkroeger@ukaachen.de

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 09 January 2019 Accepted: 07 June 2019 Published: 10 July 2019

#### Citation:

Kröger BJ, Bafna T and Cao M (2019) Emergence of an Action Repository as Part of a Biologically Inspired Model of Speech Processing: The Role of Somatosensory Information in Learning Phonetic-Phonological Sound Features. Front. Psychol. 10:1462. doi: 10.3389/fpsyg.2019.01462 A comprehensive model of speech processing and speech learning has been established. The model comprises a mental lexicon, an action repository and an articulatory-acoustic module for executing motor plans and generating auditory and somatosensory feedback information (Kröger and Cao, 2015). In this study a "model language" based on three auditory and motor realizations of 70 monosyllabic words has been trained in order to simulate early phases of speech acquisition (babbling and imitation). We were able to show that (i) the emergence of phonetic-phonological features results from an increasing degree of ordering of syllable representations within the action repository and that (ii) this ordering or arrangement of syllables is mainly shaped by auditory information. Somatosensory information helps to increase the speed of learning. Especially consonantal features like place of articulation are learned earlier if auditory information is accompanied by somatosensory information. It can be concluded that somatosensory information as it is generated already during the babbling and the imitation phase of speech acquisition is very helpful especially for learning features like place of articulation. After learning is completed acoustic information together with semantic information is sufficient for determining the phonetic-phonological information from the speech signal. Moreover it is possible to learn phonetic-phonological features like place of articulation from auditory and semantic information only but not as fast as when somatosensory information is also available during the early stages of learning.

Keywords: neural model simulation, speech production and acquisition, speech perception, neural selforganization, connectionism and neural nets

# INTRODUCTION

Speaking starts with a message which the speaker wants to communicate, followed by an activation of concepts. This process is called initiation. Subsequently concepts activate words which may be inflected and ordered within a sentence with respect to their grammatical and functional role. This process is called formulation and starts with the activation of lemmas in the mental lexicon

**9**

that correspond to lexical concepts within the semantic network. In a following step, the lemma's corresponding word-forms are activated (Dell et al., 1997; Levelt et al., 1999). The phonological representation then is processed syllable by syllable by activating, executing, and monitoring a sequence of syllables. This process is called articulation and is thought to involve the mental syllabary (Levelt et al., 1999; Cholin, 2008; Brendel et al., 2011) as well as lower level motor and sensory processing modules. While the mental syllabary (Levelt and Wheeldon, 1994; Levelt et al., 1999) is accessed during phonetic encoding as part of the phonetic production process and comprises phonetic motor units it is hypothesized in our framework that an action repository is neurally connected with the mental lexicon comprising phonological, motor, auditory as well as somatosensory representations of all frequent syllables of a language (Kröger et al., 2009, 2011a,b). It is hypothesized that a hypermodal representation of these items (cf. Feng et al., 2011; Lametti et al., 2012) is stored in the action repository in the form of a cortical neural map which indicates an ordering of syllables with respect to syllable structure as well as with respect to phonetic features of the consonants and vowels building up each syllable (phonetic feature map, see Kröger et al., 2009; Kröger and Cao, 2015). This model has been embodied as quantitative computer model leading to results that approximate observed behavior but it is unclear how realistic the model is because some of its assumptions (especially the one concerning feature maps) are still not verified on the basis of neurophysiological findings.

It is still an open question how the knowledge and skill repositories mentioned above, i.e., how a mental lexicon and an action repository emerge and gather speech and language knowledge during speech acquisition and how both knowledge repositories are related to each other in order to allow speech processing (i.e., production as well as perception). The interaction between a mental lexicon and an action repository can be modeled if the syllabification process following the activation of phonological forms within the mental lexicon leads to syllable activation at the level of the action repository. This interface between mental lexicon and action repository does not exist at the beginning of the speech acquisition process, i.e., it is not available directly after birth. Moreover it can be assumed that the emergence of a phonological representation even for syllables, i.e., the emergence of a language-specific speech sound representation, as well as later on the emergence of phonological awareness (Castles and Coltheart, 2004) results from learning in early phases of speech acquisition, especially within the babbling and imitation phase.

Thus many models of speech production either focus on lexical linguistic processes and end with a phonological representation (e.g., Dell et al., 1997; Levelt et al., 1999; Levelt and Indefrey, 2004) or focus on the phonetic details and thus start with a phonological description of an utterance and give a detailed sensorimotor description of the speech production process (Saltzman and Munhall, 1989; Guenther et al., 2006; Guenther and Vladusich, 2012; Civier et al., 2013). In our approach we assume a phonological word-level representation as part of the mental lexicon while it is the task of the syllabification process to map these lexical phonological representations on syllabic phonological representations which are assumed to be part of the action repository (Kröger et al., 2014).

If we assume that only a sparse phonological representation exists at the beginning of speech acquisition (cf. Best et al., 2016), the emergence of the action repository as well as of the mental lexicon has to start with a sparse organization at the beginning of the acquisition process. Therefore we developed an approach comprising a direct neural association between conceptual lexical and sensorimotor syllabic representations of speech items. This approach elucidates how phonetic-phonological features and later on how a phonological representation of the target language emerges (Kröger and Cao, 2015). While the simulation of early phases of speech acquisition using this model was based on auditory stimuli in earlier simulations (ibid.) we now augmented the model in order to be capable of incorporating motor and somatosensory information.

It is the main goal of this study to evaluate how important the adding of somatosensory information is in order to learn phonetic-phonological features. For example the feature place of articulation is encoded in the acoustic speech signal in a very complex way and thus difficult for a listener to detect it from the acoustic speech signal alone. But place of articulation of consonants is easily detectable from somatosensory data like tactile feedback information from lips, tongue, and palate. Thus it can be assumed that somatosensory information plays an important role during those phases of speech acquisition coping with phonetic-phonological features like place of articulation.

# MATERIALS AND METHODS

# Description of the Model

The model is able to perform three working modes, i.e., learning, production, and perception. During learning, external knowledge – i.e., knowledge mainly gathered from interaction of the learner with its direct environment – is transferred to the learner (i.e., to the baby or toddler, also called "model"). This information is semantic information concerning words as well as auditory information generated by a caretaker. The neural model of the learner comprises a cognitive part and a sensorimotor part (**Figure 1**). The cognitive part consists of a growing self-organizing map (GSOM) representing words within a central neural map representing the mental lexicon. The growth process of that neural map takes place during learning. This neural map is also called semantic map or semantic feature map (S-MAP) because it is closely linked with the feature vectors representing each word, e.g., the word "mama" comprises semantic features like "is a human," "is a female," "is a part of parents," etc. These semantic feature vectors are activated within the semantic state map, shown at the right side of the S-MAP in **Figure 1**. During learning words are ordered within the S-MAP with respect to the semantic features defining each word (Kröger and Cao, 2015). Neural representations of feature vectors can be activated at the level of the semantic state map and lead to an activation of a neuron, representing that word within the S-MAP, and vice versa.

The semantic state map together with the S-MAP and the phonemic state map form the mental lexicon. The phonemic state map comprises phonemic representations of syllables and words and emerges during speech acquisition. Semantic and phonemic state maps are part of short term memory and their neural activation patterns change from word activation to activation of the next word and so on while the S-MAP is part of long term memory and its model neurons directly represent words (ibid.). In our approach the phonemic state map is not directly linked to the S-MAP because only early phases of speech acquisition are modeled here. A neural connection with the S-MAP is formed later if the phonological representation or phonological awareness is developed. This process follows the processes described in this modeling study.

The sensorimotor part comprises the action repository or speech action repository in the context of our neural model and a feedforward-feedback loop for realizing the articulatory execution of motor plans (motor actions) and later on for the self-perception of somatosensory and auditory information generated by the model. A second GSOM, called phonetic map or phonetic feature map (P-MAP) is the central map within this speech action repository. The growth process of this neural map, like the growth process of the S-MAP, takes place during learning. During that growth process of the P-MAP an ordering of syllables occurs within this P-MAP, which is based on the auditory, somatosensory, and motor information. This information is temporarily activated at the level of the motor state, auditory state and somatosensory state map for a syllable if the syllable is planned and executed. The state maps are part of the short term memory and neural activation within these maps changes from syllable to syllable during speech production. The P-MAP itself is part of long term memory and each model neuron within this neural map represents a frequent and learned syllable of the target language like each neuron within the S-MAP represents a frequent and learned word. The P-MAP can be interpreted as a hypermodal feature map because the ordering of syllables occurring in this map is based on auditory, somatosensory as well as on motor information.

After syllable activation at the P-MAP level the feedforward processing of syllabic motor plans results in articulatory movements of vocal tract model articulators (vocal tract model, see Birkholz and Kröger, 2006; Birkholz et al., 2007) and the articulatory-acoustic part of this model generates (i) an acoustic speech signal and (ii) somatosensory signals (tactile and proprioceptive signals) which are processed by the feedback processing pathway (self-perception in **Figure 1**). The neuromuscular programming and execution is modeled in our approach by introducing control variables for model articulators. The time course of these control variables can be interpreted as model articulator movement trajectories and these variables are directly generated and controlled by vocal tract actions (Kröger and Birkholz, 2007). The feedback processing of the acoustic and articulatory signals leads to auditory and somatosensory syllable representations which activate the external auditory and somatosensory state maps and which can be compared to the already learned internal auditory and somatosensory representations for that syllable, stored in the neural associations between internal state maps and P-MAP.

#### Neural Representation of Auditory and Somatosensory States

The auditory representation activated within the auditory state map can be interpreted as a neural version of a bark-scaled spectrogram (**Figure 2D**). This representation of a syllable is calculated from the acoustic signal (oscillogram, see **Figure 2C**).

Each of the 24 rows of this two dimensional neural representation codes the acoustic energy within the frequency range of one bark region and each column represents a time interval of 10 ms (Cao et al., 2014). The degree of neural excitation within a frequency-time-slot is proportional to the acoustic energy within this slot. In the case of the syllable [po] displayed in **Figure 2**, a short and low level acoustic noise occurs at the beginning of lip closure at 0.35 s. A strong noise burst from 0.44 to 0.53 s appears after release of lip closure followed by a clearly visible vowel portion from 0.53 to 0.59 s with an initial formant transition, i.e., an initial increase in the frequency of F1 and F2 from 0.53 to 0.56 s.

The somatosensory data (**Figure 2A**) reflects the normalized distance between articulators (e.g., lower and upper lips) or between articulator and vocal tract wall (e.g., tongue tip with alveolar ridge or tongue dorsum with hard palate) for lips, tongue tip, and tongue dorsum. A value of zero reflects contact while a value of one reflects a far distance (e.g., wide mouth opening or low tongue position. In the case of the jaw the range between value one and value zero represents the range for low to high jaw position. The neural representation of these somatosensory data (**Figure 2B**) represents these distances. A small distance (i.e., articulator contact or high articulator position) is represented now by high neural activation (black), while a far distance is represented by low neural activation (white). Thus this neural information can be interpreted as somatosensory (i.e., tactile and proprioceptive), because it reflects articulatory contact as well as the positioning of articulators.

In the case of our sample syllable [po] we can clearly identify the time interval of labial closure from 0.35 to 0.43 s, an ascending movement of the tongue dorsum toward the [o]-target during this time interval, an ascending-descending movement of the jaw during this time interval in order to support the labial closure first and then to support the increasing oral front cavity for [o]. In addition we can clearly identify a descending movement of the tongue tip for the same reason, because the front part of the tongue must descend to effect the huge oral vocalic front cavity for [o] while the middle and back part of the tongue – i.e., the tongue dorsum – is involved in forming a vocalic constriction in the velar region of the vocal tract and thus increases in height.

# The Working Modes of the Model

The three working modes of the model are (i) learning during early phases of speech acquisition (babbling and imitation), (ii) production, and (iii) perception. In this paper we focus mainly on learning but learning needs the functionality of production as well as of perception. All working modes are currently limited in our model to the processing of monosyllables. That means that all words learned by the current model are monosyllabic.

#### Production

A concept of a word is represented by a model neuron within the S-MAP (**Figure 1**). This neuron is activated from a pattern of already activated semantic features at the semantic state map using a winner-takes-all procedure (Kohonen, 2001). Due to the S-MAP to P-MAP neural association this leads to the activation

of a model neuron within the P-MAP and subsequently leads to an activation of a motor plan state followed by the generation of an articulation movement pattern and by the generation of an acoustic and articulatory speech signal (**Figure 1**). These acoustic and articulatory signals lead to an activation pattern at the level of the external auditory and somatosensory state maps via the self-perception feedback channels and the activation patterns of these external state maps can be compared with the internal auditory and somatosensory syllable representations which were activated from the P-MAP associations with the internal state maps (**Figure 1**) in order to guarantee a correct production of the syllable.

#### Perception

An auditory state representation is activated by an external speaker (e.g., caretaker, **Figure 1**) leading to a most activated winner-takes-all neuron at the P-MAP level. This results from the neural associations between external auditory state map and P-MAP (arrow from external auditory state map to P-MAP in **Figure 1**). Subsequently this leads to the activation of a winner-takes-all model neuron within the S-MAP via P-MAP-to-S-MAP association (arrow from P-MAP to S-MAP in **Figure 1**) and thus leads to the selection of a target concept at the level of the mental lexicon which then is activated in the semantic state map.

#### Learning

(i) Babbling starts with the activation of proto-vocalic, proto-CV and proto-CCV motor plans at the level of the motor plan state map within the action repository part of our model (**Figure 1**). "Proto-" means that these items are not language-specific but just raw or coarse realizations of vocalic, CV, and CCV syllables. If these articulatory movement patterns are executed via the feedforward and feedback route, neural activations occur not just within the motor state map but also in the external auditory state as well as in the external somatosensory state map. These three state representations or activations for each vocalic or syllabic item now form the input to the self-organizing phonetic feature map (P-MAP) for learning. Thus the phonetic feature map (P-MAP) is exposed to a set of sensorimotor learning items, i.e., to a set of syllables including motor states, auditory states as well as somatosensory states for each training item (Kröger et al., 2009). As a result, motor, auditory and somatosensory states are associated with each other for vowels and syllables. When this neural associative learning procedure is completed, auditory stimuli can be imitated because an auditory-to-motor state association has been learned now during babbling. Thus, the model can now generate an initial motor state if an auditory state is given.

(ii) Imitation starts with an auditory input generated externally (e.g., from a caretaker during learner-caretaker interaction, **Figure 1**). This auditory input, e.g., the word "ball," leads to the activation of a winner-takes-all neuron at the P-MAP level. In parallel a winner-takes-all model neuron is activated at the S-MAP level on the basis of the same learner-caretaker interaction which is directed for example to the visible object "ball" via activation of the semantic feature vector of "ball" within the semantic state map (**Figure 1**). These parallel activations at S-MAP and P-MAP level simulate a learning situation, where a child (the learner) may draw his/her attention as well as the attention of the caretaker to an object (e.g., a ball which can be seen by both communication partners) and where the child now forces the caretaker to produce that word "ball," i.e., to produce an auditory stimulus in parallel to the semantic network stimulation. Thus the concept "ball" is activated at the level of the semantic state network within the mental lexicon and the auditory representation of the same word is activated at the level of the external auditory state network within the action repository (**Figure 1**).

The resulting imitation learning within this word perception and word production scenario is a complex two stage process. Because each state activation (semantic as well as auditory level) leads to an activation pattern within the appropriate self-organizing map (S-MAP or P-MAP), neural associations are adapted between the semantic state map and the S-MAP at the level of the mental lexicon as well as between the auditory, somatosensory or motor state map, and the P-MAP at the level of the action repository. This leads to a modification of the ordering of syllables within the P-MAP. In the case of the mental lexicon this first stage process leads to an ordering of concepts within the S-MAP with respect to different semantic categories (cf. Kröger and Cao, 2015).

The second stage of the imitation learning process leads to an association between S-MAP and P-MAP nodes which results from the temporally co-occurring S-MAP and P-MAP activation resulting from learning scenarios as exemplified above for the word "ball." Later on during speech production the activation of an S-MAP node leads to an activation of a P-MAP node and vice versa in the case of speech perception (see **Figure 3**). Or in other words, imitation training leads to an association of phonetic forms (in the case of this study: V, CV, or CCV syllables) with

at the P-MAP level. Light blue and light violet regions indicate the neurons or nodes representing phonetic realizations of two different concepts within the P-MAP and in addition neurons representing the concepts itself within the S-MAP. Lines between neurons of S-MAM and P-MAP indicate examples for strong associations between neurons or nodes.

meaning (in the case of this study: monosyllabic words). Due to the changes occurring within S-MAP and P-MAP as a result of the first stage of the imitation learning process a further adaptation or modification occurs for the neural associations between S-MAP and P-MAP in order not to change the already established correct associations between semantic and phonological forms (Cao et al., 2014 and see **Appendix A** in this paper).

As a result of imitation learning a bidirectional S-MAP to P-MAP association is established and it can be clearly seen, via this association, whether two syllables are phonetic representations of the same word or of different words. This implicates that an occurring phonetic difference within two syllables can be interpreted as a phonological contrast if the associated words are (i.e., if the meaning of the two syllables is) different. Rare cases like words conveying two meanings (e.g., "bank" of a river or "bank" as a financial institution) are not modeled in our approach because our approach is tested on the basis of a very limited model language. But because it can be assumed that the child learns one of the two word meanings first, while it learns the second meaning later, such rare cases lead to no complications from the phonological viewpoint of separating phonetic differences, because during the early learning process of phonetic separation of words only one word meaning is activated.

It has been shown by Kröger and Cao (2015) and it will be shown in this study that syllables are ordered with respect to phonetic similarity at the P-MAP level which is a typical feature of neural self-organization (Cao et al., 2014). Therefore neighboring syllables within the P-MAP in many cases only differ with respect to one segment and for this segment often only with respect to one phonetic-phonological feature. Thus within the P-MAP space we define the space occurring between syllables representing different meanings together with differences in specific segmental features of one segment as "phoneme boundaries" which is used here as an abbreviation for "boundary indicating a difference of at least one distinctive feature."

As an example, at the level of the P-MAP syllables may be ordered with respect to phonetic features like vowel quality, i.e., vocalic phonetic features like high-low and front-back (Kröger and Cao, 2015). Thus a direction within the P-MAP may reflect the phonetic feature transition from high to low or from front to back vowels because a phoneme boundary concerning this feature occurs here (see **Figure 3**). It should be stated here that at the current state of the model the associations between S-MAP and P-MAP nodes define the word to syllable relation. This association does not affect the ordering of syllable items at P-MAP level (at phonetic level). All implicit syllable representations occurring within one "word region" at the level of the P-MAP, i.e., all syllable representations within the P-MAP representing one concept at S-MAP level, can be interpreted as phonetic realizations of syllables belonging to the same phonemic representation (see light blue and light violet regions in P-MAP in **Figure 3**). Thus, within the P-MAP we can find an ordering of phonetic syllable relations. Moreover we can find here boundaries for the separation of syllable realizations conveying different meanings. From this ordering and from the appearance of boundaries together with an already existing (intuitive) knowledge concerning syllable structure – including subsyllabic constituents like consonants and vowels – it is possible to extract phonological knowledge like "two neighboring P-MAP items conveying different meanings just differ in the first consonant of the syllable onset" or "this first consonant differs only in place or manner of articulation" or "two neighboring P-MAP items mapped conveying different meanings just differ in the vowel" and so on. This knowledge provides the basis to learn the phoneme repertoire, language-specific syllable structure rules, and the overall set of consonantal and vocalic distinctive features of the target language. In future versions of our model this knowledge will be saved within the phonemic state map (**Figure 1**). Thus the phonemic state map contains all target language phonological representations on syllable and segment level while the P-MAP only displays an ordering of phonetic realizations with respect to phonetic similarity from which phonological distinctions can be uncovered.

# Training Stimuli

The set of training stimuli consists of three realizations of 70 syllables, spoken by a 26 year old female speaker of Standard German (Cao et al., 2014; Kröger and Cao, 2015). These 70 syllables included five V-syllables (/i/, /e/, /a/, /o/, /u/), 5×9 CV-syllables combining each vowel with nine different consonants (/b/, /d/, /g/, /p/, /t/, /k/, /m/, /n/, and /l/) and 5 × 4 CCV-syllables combining each vowel with four initial consonant clusters (CC = /bl/, /gl/, /pl/, and /kl/). Thus, these 70 syllables (e.g., /na/) form a symmetrical shaped subset of syllables occurring in Standard German. This corpus was labeled as "model language," because each syllable was associated with a word (e.g., {na}), i.e., with a set of semantic features (Kröger and Cao, 2015). The total number of semantic features was 361 in case of these 70 different words. The semantic processing for semantic feature selection for each word was done manually by two native speakers of Standard German (for details see **Appendix Table A2** in Kröger and Cao, 2015). The chosen 70 words were the most frequent words occurring in a children's word data base (Kröger et al., 2011a).

Each of the three acoustic realizations per syllable (word) was resynthesized using the procedure described by Bauer et al. (2009). The articulatory resynthesis procedure allowed a detailed fitting of the timing given in the acoustic signal to articulator movement on- and offsets as well as to sound target on- and offset (e.g., begin and end of closure in case of a plosive or nasal). Thus the articulatory resynthesis copied acoustic timing errors to articulation. Places of articulation, i.e., articulatory target positions were adapted with respect to the acoustic signal by manual fitting. In the cases of the acoustic stimuli used here places of articulation were always pronounced correctly by the speaker and thus the standard places of articulation as defined in the articulatory model for Standard German were used. This leads to a stimulus set of 210 items, each comprising a natural and a synthetic acoustic realization and a motor plan representation, stemming from the resynthesis process. The somatosensory representation was calculated from the movements of the model articulators of the vocal tract model during for each of the 210 resynthesized syllable realizations. Two lip points, two tongue points and one point of the jaw were selected and tracked within the midsagittal plane of the vocal tract (**Figure 4**). These points were tracked during execution of the resynthesized syllable items in order to get the articulator point trajectory information (cf. **Figure 2A**) from which the neural somatosensory state representation can be calculated for each of the 210 items.

# Training Procedure

fpsyg-10-01462 July 9, 2019 Time: 17:38 # 7

An initial training cycle (training cycle 0) is executed in order to establish the initial GSOMs at the lexical and at the action repository level, i.e., the S-MAP and the P-MAP as well as to do an initial adjustment for the link weights of the bidirectional neural mapping (associative interconnection) between S-MAP and P-MAP (Cao et al., 2014). This training cycle is labeled as training cycle 0. Subsequently, fifty further training cycles were executed. Within the first 10 training cycles a GSOM adaptation training for both maps (P-MAP and S-MAP) is followed by an interconnection adaptation training for adjusting the associative interconnection network between both GSOMs and is followed by a GSOM checking processes which is executed during each training step (see **Appendix Table A1**). This training phase can be labeled as babbling phase because the P-MAP and S-MAP are trained here in isolation and only a very preliminary first associative interconnection network arises. Within the further 40 training cycles in addition an interconnection checking process is performed at the end of each training cycle which helps to establish an associative interconnection network between both GSOM's. This training phase can be labeled as imitation phase. Within each training cycle each of the 210 items is activated 7 times (Cao et al., 2014), leading to 1470 training steps and thus 1470 adjustments of each link weight per training cycle. Beside the GSOM adaptation trainings and the interconnection adaptation trainings mentioned above additional GSOM adaptation trainings as well as additional

interconnection adaptation trainings occur if this is demanded by the interconnection checking process done at the end of each training cycle. Thus a lower level GSOM checking process occurs after each training step and a higher level interconnection checking process occurs after each training cycle beginning with training cycle 11 (for details see **Appendix Table A1**).

In total twenty trainings with 50 training cycles each were simulated in order to end up with 30 instances of the trained model. Ten trainings were done using auditory information only, ten trainings were done using somatosensory information only and ten trainings used auditory and somatosensory information as input information for the self-organization of the P-MAP. Auditory information was taken from the natural items while the somatosensory information was taken from the resynthesized items, because no natural somatosensory data were available. Thus "auditory only trainings" and "auditory plus somatosensory trainings" can be separated in our study. Auditory trainings can be interpreted as purely passive trainings only using semantic plus auditory information while auditory plus somatosensory trainings in addition use information which stems from active articulation of the model during imitation. These later active trainings use information gathered from the resynthesized vocal tract movements (imitation movements).

# RESULTS

# Evaluation of Number of Clear, Unclear, and Occupied Nodes at P-MAP Level

In order to evaluate the increase in correct performance of speech perception and speech production as a function of increase in training cycles, three measures were taken, (i) the number of unclear nodes at P-MAP level (blue lines in **Figure 5**), (ii) the number of clear nodes with non-separated training items at P-MAP level (yellow lines in **Figure 5**), and (iii) the number of occupied nodes at P-MAP level (red lines in **Figure 5**). The terms "unclear node," "clear nodes with non-separated training items" and "occupied nodes" are defined below in this section.

An unclear node at P-MAP level (blue lines in **Figure 5**) is a node which represents at least two training items belonging to two different syllables or words. Thus, an unclear node may lead to a failure in speech processing (perception or production) for these words, because they may be confused in speech perception as well as in speech production. In the case of more than 25 training cycles we found that the number N of unclear nodes leads to about 2∗N different words which may be confused in production or perception, because after this number of training cycles the network is already differentiated and any unclear nodes do not represent more than two syllables or words.

In the case of auditory plus somatosensory training we get a mean value of N = 5 after 50 training cycles (**Figure 5**, dark lines), leading to a maximum of 10 of 70 words which could be confused in production or perception. In the case of auditory only training (**Figure 5**, light lines) we get N = 7, leading to 14 syllables or words which potentially could be confused in production or perception after 50 training cycles.

A clear node exhibiting non-separated training items at P-MAP level (yellow lines in **Figure 5**) is a node that represents at least two training items, but two training items which belong to the same syllable or word. In self-organizing networks it is desired that a node at P-MAP level represents a set of similar (phonetic) realizations of a syllable or word. This is called "generalization" and means that the network does not learn specific idiosyncratic differences of items representing one category (here: idiosyncratic differences of the phonetic realizations of a word) but generalizes toward the important (phonetic) features of and item in order to be able to differentiate items representing different words. Thus, the inverse of this measure (clear nodes representing more than one realization of the same syllable or word) represents the degree of overlearning. We can see that the number of this kind of nodes is low and thus the degree of overlearning is high, which may result from the fact that we train only three phonetic items per syllable, or word and thus are capable of learning specific features of each item because of the small number of training items per word. Thus, both of these facts, i.e., low number of items and close together grouping of items at P-MAP level, justifies the overlearning occurring in our simulations.

But – as can be seen from **Figures 6**–**9** – in most cases the nodes representing the same syllable or word are grouped closely together within the two-dimensional P-MAP. That means that learning leads to clear phoneme regions. These phoneme regions are not shown in **Figures 6**–**9** because these phoneme regions in each case include 3 P-MAP nodes in maximum. The phoneme boundaries shown in **Figures 6**–**9** are boundaries defined with respect to a specific phonetic-phonological feature contrast (distinctive feature contrast) and thus include more than one syllable or word. In the following they will be called "feature regions."

In the case of auditory plus somatosensory training the degree of overlearning is lower in comparison to auditory only training (higher number of clear nodes with non-separated training items in the case of auditory plus somatosensory training: 20 nodes vs. 15 nodes in case of auditory plus somatosensory vs. auditory only training at training cycle 50). This indicates that the diversity of auditory only items is higher than of items including auditory and somatosensory information. This may result from the fact that somatosensory information is more useful for separating different places of articulation than auditory information. The use of somatosensory plus auditory information for example clearly separates different places of articulation with respect to labial, apical, and dorsal.

The number of occupied nodes at P-MAP level (red lines in **Figure 5**) is the sum of all nodes representing one or more training items (i.e., syllables). This number should be near the total number of training items if all training items are sufficiently learned and if in addition overlearning is strong and if in addition only few P-MAP nodes are unclear nodes. This is the case for both training modes. The number of occupied nodes is about 205 in the case of the auditory only training mode and about 203 in the case of auditory and somatosensory training mode after

50 training cycles. The lower number of occupied nodes in the second case may reflect the fact of a lower degree of overlearning in the case of auditory plus somatosensory training. This effect is significant (Wilcoxon rank sum text, two sided, p < 0.05) for most training cycles (see **Appendix B**).

Beside the results at end of training (training cycle 50) which we already stated above, it can be seen from **Figure 5** that training leads to a faster decrease in number of unclear nodes in the case of auditory plus somatosensory training in comparison to auditory only training. A significant lower number of unclear nodes in the case of auditory plus somatosensory training compared with the case auditory only training is found for most training cycles (Wilcoxon rank sum text, two sided, p < 0.05 and see **Appendix B**). During later training cycles the number of unclear nodes further decreases but this difference is not anymore significant above training cycle 45 (Wilcoxon rank sum text, two sided, p > 0.05 and see **Appendix B**).

In the case of clear nodes representing more than one item of the same syllable (i.e., inverse degree of overlearning, yellow lines) it can be seen that overlearning increases significantly faster as well in the case of auditory plus somatosensory training in comparison to auditory only training (Wilcoxon rank sum text, two sided, p < 0.05 and see **Appendix B**).

## Evaluation of Ordering of Syllables at P-MAP Level

**Figures 6**–**9** give a visual depiction how training items are grouped and ordered by neural self-organization within the P-MAP. Nodes of the P-MAP representing training items are marked by colored dots within the P-MAP while P-MAP nodes which do not represent a training item are indicated by light gray circles. The form and size of the map results from the training process as is described in Cao et al. (2014). If new items need to be represented in the map new nodes are generated and included in the map thus increasing its size. New nodes are always added at the edge of the map. Thus, the map's form results from the addition of these nodes. The colors in **Figures 6**–**9** represent different phonetic feature values with respect to place of articulation (labial to velar, see **Figures 6**, **7**) and manner or articulation (plosive, nasal, lateral for CV-syllables, and plosive-lateral for the CCV syllables, see **Figures 8**, **9**). The black lines indicate the boundaries of feature regions. It can be seen that the ordering with respect to place of articulation is better in the case of auditory plus somatosensory training (**Figure 7**) in comparison to auditory only training (**Figure 6**) after training is completed (training cycle 50), because the number of feature regions, i.e. the number of regions within the P-MAP with same "value" for a specific distinctive feature (regions edged by the black lines) is lower in the case auditory plus somatosensory training in comparison to auditory only training. No such clear difference occurs for manner of articulation (**Figures 8**, **9**).

A further important result which can be directly deduced from a visual inspection of **Figures 6**–**9** is that training items are grouped together for any given syllable. Thus, the three training items representing three realizations of one syllable or word are grouped together within the two-dimensional plane of the P-MAP. See for example the green dots in the upper right region of **Figure 6** for the syllable or word {la} [-> (la1), (la2), (la3)] or the green dots indicating three representations of the syllable or

word {na} [-> (na1), (na2), (na3)]. If a realization is missing in a figure, this realization overlaps with another realization of the same syllable or of another syllable.

This spatial grouping together of items of the same syllable or word within the space of the P-MAP indicates that different realizations of the same syllable or word are less different with respect to phonetic detail than realizations of different syllables. Moreover this result explains why overlearning can take place in our corpus and learning scenario: The P-MAP has enough nodes to represent each training item, but nevertheless a kind of generalization occurs because realizations of same syllables are grouped closely together.

Coming back to the display of feature regions, a further main result of this study is that the ordering of items with respect to place of articulation increases in case of auditory plus somatosensory training in comparison to auditory training, while no clear result can be drawn by comparing the feature regions for manner of articulation for both training modes. This is illustrated in **Figures 6**–**9** which indicate that the number of feature regions within the P-MAP is higher in case of auditory only training (**Figure 6**) vs. auditory plus somatosensory training (**Figure 7**) for place of articulation.

The number of feature regions is lower for the consonantal feature manner of articulation (**Figure 8**) in comparison to the consonantal feature place of articulation (**Figure 6**) in the case of auditory training only (see also Kröger and Cao, 2015). If we compare the number of feature regions for manner of articulation for auditory plus somatosensory training (**Figure 9**) vs. auditory only training (**Figure 8**), it can be seen that the number of regions does not differ significantly. Thus the addition of somatosensory information to auditory information helps to separate place of articulation but not to separate syllables with respect to manner of articulation at the P-MAP level.

The faster learning (faster decrease in not clearly separated syllables) in case of auditory plus somatosensory learning can be seen by analyzing not just the phonetic feature separation at the P-MAP level after training cycle 50 (as done above: **Figures 6**–**9**) but by analyzing as well this feature separation at earlier training stages. This can be done by counting the number of feature regions for place and manner of articulation after 10 and 20 training cycles in comparison to 50 training cycles (**Table 1**) at P-MAP level. **Figures 6**–**9** illustrate the term "number of feature regions". Here we can find 39 feature regions in **Figure 6**, 19 feature regions in **Figure 7**, 11 feature regions in **Figure 8** and 11 feature regions in **Figure 9**.

TABLE 1 | Number of feature regions (mean value and standard deviation) for manner and place of articulation as function of number of training cycles (10, 20, and 50) for auditory only training (a) and for auditory plus somatosensory training (a+s).


Each training mode has been executed 10 times (i.e., 10 trainings per training mode).

**Table 1** clearly indicates that already at training step 10 the number of feature regions is significantly lower for place of articulation in case of auditory plus somatosensory training (Wilcoxon rank sum text, two sided, p < 0.001) in comparison to auditory only training, while no such effect is found for the feature manner of articulation (Wilcoxon rank sum text, two sided, p > 0.05 except for training cycle 50, here p = 0.011).

#### DISCUSSION

This study illustrates how the emergence of an action repository can be modeled in a neural large scale model. Two training modes were chosen here, i.e., the "auditory only" and the "auditory and somatosensory" training mode. In the first mode the model is trained by auditory and semantic data while in the second case somatosensory information is added to the auditory information. This somatosensory information stems from the reproduction of syllables by the learner, i.e., by the model itself. From an earlier study using the same training set (Kröger and Cao, 2015) but focusing on auditory only training we know that in the case of this training set including V, CV, and CCV syllables the main feature for ordering syllables within a neural phonetic map is syllable structure (V, CV, and CCV), subsequently followed by the vocalic features high-low and front-back, followed by the feature voiced-voiceless for the initial consonant and then followed by the features manner and place of articulation for the initial consonant or consonant cluster.

In this study we focused our interest on the question of how learning of the features manner and place of articulation can be improved. It can be hypothesized that syllables may be ordered and thus learned more successfully if the feature place of articulation is learned as early and as fast as the feature manner of articulation. In the acoustic only training mode the feature place of articulation is learned later. In that case the ordering of the neural self-organizing map is better for manner than for place of articulation (Kröger and Cao, 2015). It can be hypothesized that place of articulation is perhaps learned earlier and as fast as manner of articulation if training not uses only auditory information but somatosensory information as well. This hypothesis is in line with the Articulatory Organ Hypothesis (Tyler et al., 2014; Best et al., 2016) which stresses the importance of the role of active articulators in production also for perception and thus for speech learning already in the first year of lifetime. Indeed an earlier and faster separation of syllables with respect to place of articulation and thus an earlier and faster learning of this feature has been found in this study for the case of availability of auditory and somatosensory information compared to the case of auditory information only. Because the feature place of articulation emerges later in training based on auditory information only (ibid.) the result of this current study indicates that somatosensory information, i.e., information based on articulatory imitation of syllables, helps to identify and to learn this important feature place of articulation already in early phases of speech acquisition.

Moreover it should be stated that at the end of training a correct performance of speech production and perception resulting from a correct and functionally ordered P-MAP is established as well in the case of auditory only training. Thus it can be hypothesized that somatosensory information may help to clarify which information within the acoustic signal is important in coding place of articulation, and may help to establish the feature place of articulation early in speech acquisition, but a correct performing speech processing model is established as well in the case of auditory only training. This result reflects the fact that place of articulation is sufficiently encoded in the acoustic speech signal mainly by formant transitions (Öhman, 1966) but these transitions are not easy to decode so that somatosensory information is helpful to decode this place information more easily.

Looking at the structure of the phonetic maps (P-MAPs) trained in this study as well as in an earlier study (Kröger and Cao, 2015) it can be stated that syllables are ordered with respect to different phonetic dimensions (features) like high-low, front-back, voiced-voiceless as well as for manner and place of articulation. This finding from our simulation studies finds correspondents in natural data stemming from neuroimaging studies (Obleser et al., 2004, 2006; Shestakova et al., 2004; Obleser et al., 2010) as well as from recordings of cortical activity using high-density multielectrode arrays (Mesgarani et al., 2014). The results of these studies show that a spatial separation of activation in cortical regions exits for different groups of speech items if these groups represent different phonetic feature values.

It should be kept in mind that our model on the one hand does not reveal a detailed phonetic-phonological mapping at the segment level. The implicit phonological representation introduced here is based on the associations between P-MAP and S-MAP as well as on the ordering of items within the P-MAP. On the other hand the boundaries shown in **Figures 7**–**9** clearly indicate that boundaries emerge not only between the 70 types of syllables learned in these model simulations but also for different consonantal features occurring in the onset consonant of CV. Moreover, phoneme boundaries can also be found for different vocalic features as well as for different syllable structures like CV vs. CCV. These types of phoneme boundaries are not under discussion in this paper but are already shown as results of model simulations for different vowels in V-, CV-, and CCV-syllables in Kröger and Cao (2015) as well as for different syllable structures like V vs. CV vs. CCV in Kröger et al. (2011b).

Finally it should be stated that our training is based on semantic and sensorimotor phonetic information (auditory and somatosensory information) only. No phonological information is given directly here. The sensorimotor information comprises auditory information as it is generated by the caretaker as well as auditory, motor and somatosensory information generated by the learner itself during the process of word imitation. Thus our simulation approach clearly demonstrates that the emergence of phonetic features results from the ordering of items at the level of the P-MAP and that the emergence of phonological contrast as well results from this ordering together with information about which syllable is associated with which meaning (or word) generated at the S-MAP level. This later information is also available at the P-MAP level if a correct neural association between P-MAP and S-MAP results from the learning.

Our model starts with a direct neural association between semantic (or conceptual) and phonetic representations. That is the S-MAP and P-MAP associative interconnection. Other models like the GODIVA model (Bohland et al., 2010) directly start with hypotheses concerning the phonological representation by assuming a phonological planning module. But like in our model Bohland et al. (2010) assume predefined sensorimotor programs or predefined motor plans in terms of our model which are activated after passing the phonological planning phase. In GODIVA a speech sound map is assumed to represent a repository of motor plans of frequently used syllables which is comparable with the information stored in our P-MAP and its neural connection with the motor plan map. Bohland et al. (2010) as well see the syllable as the key unit for speech motor output. Like our P-MAP the speech sound map in GODIVA (ibid.) forms an interface between phonological encoding system (phonological plan and choice cells, ibid.) and the phonetic-articulatory system. But our model does not include a phonological encoding system because at this preliminary state our model is still limited to the production of monosyllables. Moreover sensorimotor programs for frequent syllables can be selected from speech motor map in full (ibid., p. 1509), which is comparable to an activation of a P-MAP node, leading to an activation of a specific motor program within the motor plan state map in our approach.

The concrete GODIVA model describes the temporal succession of phonological planning and motor execution. This is beyond the scope of our approach which is a purely connectionist model. Time is not an explicit parameter in our model but time is implicitly part of our model because motor plans as well as auditory and somatosensory states contain the information concerning the temporal succession and temporal overlap of articulatory actions as well as temporal information concerning auditory changes within a whole syllable. Thus our model can be seen as kind of "pre-model" describing how the knowledge for the speech sound map postulated in Bohland et al. (2010) could be acquired.

The HSFC approach (Hickok, 2012) as well as the SLAM model (Walker and Hickok, 2016) like our approach assume a direct neural connection between lexical modules (lemma level) to a syllable-auditory as well as to a phoneme-somatosensory module. These lower level modules define a hierarchy from lemma via syllable (including auditory feedback) to subsyllabic units like phoneme realizations. It is assumed in this approach that auditory feedback mainly influences syllable units while somatosensory feedback mainly influences segmental units. Like the DIVA and GODIVA model the HSFC approach does not include speech acquisition and thus does not speculate on syllabic or on segmental repositories like we do at least for the syllable level by introducing our P-MAP.

In summary, our neural model and the training scenario introduced here illustrate how a phonetic contrast can become a distinctive and thus phonological contrast during an extended training scenario if a semantic-phonetic stimulus training set is used covering the whole range of phonetic-phonological contrasts occurring in the target language under acquisition. The emergence of phonetic-phonological contrasts here results from the S-MAP to P-MAP association. But this knowledge now generated by learning needs to be generalized in order to develop the notion of different vocalic and consonantal distinctive features. This must be accompanied by already existing phonological knowledge concerning simple syllable structures (e.g., V, CV, and CVC,. . .) which already may exist at the beginning of babbling and imitation training. Thus, the central vehicle for locating this phonetic-phonological feature information is the neural P-MAP in our current model which forms a part of the action repository as well as the neural association occurring between P-MAP and S-MAP, but this information needs to be generalized and implemented in a phonological map which is not part of our current neural model. This may lead to a restructuring of the complex neural association of semantic and phonetic network levels in order to integrate a phonological representation layer.

# CONCLUSION

In this paper it has been illustrated how a neural realization of the action repository could be shaped and implemented in a computer based approach, how this action repository concretely emerges during speech acquisition and how phonetic items are ordered within this realization of an action repository. We were able to show that the occurring ordering of syllables within this realization of the action repository using GSOMs is the basis for a mental representation of phonetic features and that – due to an association between the action repository and the mental lexicon in early states of speech acquisition – first phonetic item clusters emerge which help to unfold the phonological organization of a target language.

It has been shown that a sufficient learning result is reached on the basis of auditory only training. Thus, motor representations leading to a correct imitation of syllables need not necessarily to be a part of speech (perception) learning, but the inclusion of imitation and thus the inclusion of production of speech items (e.g., of syllables) may lead to a faster acquisition of important features like place of articulation (cp. Iverson, 2010) in comparison to a passive learning processed only based on listening. This result implicates why children with severe speech motor dysfunctions are capable of learning to perceive and understand words like normal developing children (Zuk et al., 2018 for the case of childhood apraxia of speech), while learning correct word production of course is delayed, or perhaps never completed due to the existing motor dysfunction.

It is now necessary to further develop this neural simulation model of speech processing (production and perception) and speech learning in order to investigate the acquisition not just of a simple model language based on V-, CV-, and CCV-syllables and monosyllabic words but of a more complex real language. Furthermore it is important to extend the model with respect to the learning scenario. In our model, learning items are defined in advance but in reality the child actively shapes learning situations

and thus actively shapes the set of training stimuli and especially the number of presentations and the point in time when the child wants to learn a specific word or syllable for example by turning the attention of the caretaker to a specific object within a communication situation. Thus, beside the caretaker also the child is able to actively control the learning process.

# REFERENCES


Kohonen, T. (2001). Self-Organizing Maps. Berlin: Springer.

Kohonen, T. (2013). Essentials of the self-organizing map. Neural Netw. 37, 52–65. doi: 10.1016/j.neunet.2012.09.018

# AUTHOR CONTRIBUTIONS

BK, MC, and TB programmed the software code. BK and TB conducted the experimental simulation. All authors designed the study, wrote, and corrected the manuscript.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kröger, Bafna and Cao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# APPENDIX A

fpsyg-10-01462 July 9, 2019 Time: 17:38 # 16

Learning algorithm for the neural network (two stage process). The whole neural network can be described as an interconnected growing self-organizing map (I-GSOM) which comprises two growing self-organizing maps (GSOM's), i.e., a semantic map S-MAP and a phonetic map P-MAP where each node of one map is linked with each node of the other map and vice versa (associative bidirectional neural linking). The step-by-step update of all neural connection weights (link weights between nodes) during learning is described in detail in Cao et al. (2014). Both GSOM's are trained and thus grow using the same neural learning and thus using the same neural principles for defining the link weights between GSOM and its associated state maps. The associated state map is the semantic state map in case of the S-MAP and is the auditory, somatosensory, and motor state map in case of the P-MAP (see **Figure 1**: semantic map beside S-MAP and internal auditory and somatosensory map beside P-MAP). Learning can be defined as a series of training steps. In each training step a word-syllable stimulus pair is applied to the state maps. First, the node within each GSOM is determined which is representing the stimulus best, i.e., which is most similar to the stimulus (winner node), and the link weights of this node and of nodes in a defined neighborhood of the winner node are modified in direction toward the stimulus, i.e., the link weights between state maps and GSOM are modified in a way, that the winner node now is more similar to the training stimuli than it was before. The degree of approximating the stimulus in one training step is defined by the learning rate of the neural model. This learning is called GSOM adaptation training and done independently for both GSOMs. It leads to a self-organization of both GSOMs: (i) The nodes representing words are ordered with respect to all semantic features within the S-MAP which are inherently included in the set of word training stimuli. (ii) Syllables are ordered with respect to all phonetic features within the P-MAP which are inherently included in the set of syllable training stimuli. The result of this learning is also called neural self-organization and the associated maps are called self-organizing feature maps (Kohonen, 1982, 2001, 2013).

In order to allow a growth of these maps during this learning process the original algorithm developed by Kohonen (ibid.) has been modified as described by Alahakoon et al. (2000). While the modification of link weights is similar in SOM's and GSOM's a growth criterion needs to be defined in the case of a GSOM. Therefore each training stimulus is matched with each node of the already existing GSOM and the error with the best matching neuron within the GSOM is accumulated over successive training steps until a threshold value is reached indicating that a new node needs to be added to the GSOM in order to allow a better matching of stimuli and GSOM neurons. This growth process occurs together with self-organization of each GSOM and is part of the GSOM adaptation training.

In the babbling phase an adaptation of the P-MAP only is done on the basis of syllable stimuli. During the imitation phase the S-MAP is adapted in parallel. For auditory only training the somatosensory training data are not applied and vice versa for somatosensory alone training no auditory training data are applied. In case of auditory plus somatosensory training the whole set of training data is applied. Because of the similarity of motor and somatosensory training data the training of the P-MAP is done by using auditory and/or somatosensory data only in case of this study.

In addition to the GSOM adaptation training the training or learning of the associative mapping between both GSOM's, i.e., the development of the associative neural interconnections between both GSOM's needs to be done. This training is called interconnection adaptation training. The link weights of a neural interconnection link are modified (i.e., increased) only if winnertakes-it-all nodes occur simultaneously in both GSOM's for a given stimulus pair (i.e., a word-syllable pair). "Simultaneously" means that a combined word-syllable stimulus is applied to the I-GSOM leading to specific simultaneous activations of all nodes. The link weights between these two winner neurons are modified in a way that the interconnection between both winner neurons is strengthened in both directions between both GSOM's. If no winner-take-all neuron occurs for a specific stimulus in one of the GSOM's this GSOM is not able to identify a node as a good representation for a stimulus. In this case further GSOM adaptation training steps are needed. Whether those interim GSOM adaptation trainings are needed is checked by a GSOM checking process, which is executed in combination with each potential interconnection adaptation training step (see **Appendix Table A1**).

The GSOM checking process identifies so-called "high-density nodes," i.e., nodes which represent more than one stimulus within the P-MAP or within the S-MAP. In this case a modified GSOM adaption training will be inserted after the GSOM checking process. The modification is that during the GSOM adaptation training only those stimuli are applied to the neural network which are not resolved thus far. This modified GSOM adaptation

TABLE A1 | Organization of the whole training of the I-GSOM neural network.

#### Babbling training

• P-MAP adaptation training on basis of 5 training cycles for the syllable stimulus set (5 × 7 × 210 training steps randomized)

#### Imitation training

	- If GSOM checking process is positive: interconnection adaptation training
	- If GSOM checking process is negative: GSOM reinforcement and GSOM reviewing training (adaptation of P-MAP and of S-MAP for N<sup>u</sup> "unsolved" stimuli; N<sup>u</sup> < 210)
	- If interconnection checking process is positive: return to normal P-MAP and simultaneous S-MAP adaptation training (first two main black bullets of imitation training)
	- If interconnection checking process is negative: add an interconnection link forgetting process before returning to the interconnection checking process

In each training cycle all 210 the stimuli are applied 7 times randomly ordered. GSOM adaptation training includes adaptation of link weights between a GSOM and its state maps as well as growth of the GSOM.

process thus represents a process in which the learner is aware that there are still some words and syllables which cannot be produced correctly and thus are not perceived correctly by the caretaker. This modified GSOM adaptation training is called GSOM reinforcement training (see **Appendix Table A1**). The word "reinforcement" is chosen because it is assumed that the caretaker (as well as the child) is aware of this situation and thus concentrates on learning of "difficult" words and syllables. At the end of a GSOM reinforcement training phase a GSOM reviewing training phase is included which – like the normal GSOM adaptation training for each GSOM – again includes all 210 stimulus pairs i.e., recapitulates all items which were are ready learned and which are still to learn. This GSOM reviewing training is important to guarantee that the network does not "overlearn" the difficult words or syllables trained in a GSOM reinforcement training and thus forgets the other earlier learned words or syllables.

Moreover it may happen that a wrong link has been established within the associative neural interconnection network between both GSOMs. This may happen if a winner node is identified in one of the GSOMs for a specific word or syllable but this winner neuron later during learning turns to represent a different word or syllable. This may happen because the whole learning process is highly dynamic. Thus link weights are allowed to change with respect to learning rate and thus are quite flexible. In order to be able to cope with such situations a further higher level checking process, called interconnection checking process is included in the whole training procedure. This process starts if already 10 main training cycles have been executed in order to guarantee that a preliminary associative interconnection network is already grown between both GSOMs. Normal training is continued if the interconnection checking process allows it (see **Appendix Table A1**). Otherwise, the interconnection checking process demands a change in link weights of the identified wrong associative interconnections towards smaller values. This procedure is called interconnection link forgetting process ("link forgetting procedure" following Cao et al., 2014). This process needs to be introduced explicitly because associative learning as it is used within the interconnection adaptation training can only increase link weights. These interconnection checking processes are applied after each fully completed training cycle starting with training cycle 11 and thus occur 40 times in total in our learning scenario (**Appendix Table A1**).

# APPENDIX B

Significance levels for difference of median values. This appendix gives the significance levels for the difference of median values of dark vs. light lines in **Figure 5**, i.e., differences between the median values in case of auditory plus somatosensory training (**Figure 5**, dark lines) and the median values in case of auditory only training (**Figure 5**, light lines) for the three measures for nodes listed in **Appendix Table B1**. No correction of p-values was performed despite testing at each of 50 points in time representing different training cycles.

TABLE B1 | Significance level for median values of three measures (i) the number of unclear nodes at P-MAP level (blue lines in Figure 5), (ii) the number of clear nodes with non-separated training items at P-MAP level (yellow lines in Figure 5), and (iii) the number of occupied nodes at P-MAP level (red lines in Figure 5) for the comparison of auditory plus somatosensory training (Figure 5, dark lines) with auditory only training (Figure 5, light lines) for each training cycle (1–50).


Significance levels: <sup>∗</sup><0.05, ∗∗<0.01, and ∗∗∗<0.001; n.s., both median values are not significantly different.

# Variability and Central Tendencies in Speech Production

D. H. Whalen1,2,3 \* and Wei-Rong Chen<sup>2</sup>

*<sup>1</sup> Program in Speech-Language-Hearing Sciences, City University of New York, New York, NY, United States, <sup>2</sup> Haskins Laboratories, New Haven, CT, United States, <sup>3</sup> Department of Linguistics, Yale University, New Haven, CT, United States*

Speech is notoriously variable, but our understanding of this variability continues to

evolve. Variability has typically been taken as an indication of failure to reach a desired target due to physical or neurological limits. However, it is likely that some variability is beneficial, an effect that has been found in other domains. Part of the effort to separate beneficial from destructive variability must be to understand the distribution of values around a speech target. One aspect that is commonly measured is the standard deviation of some objective aspect of speech. The standard deviation is most meaningful for normal distributions, and the assumption in speech research has been that values are indeed normally distributed. This has not been rigorously tested, however, as the test of normality requires a large number of samples (some studies suggest a minimum of 200) to determine whether the data is normally distributed or not. Speech research (and, indeed, most research with humans) seldom reaches such numbers for a consistent environment. Here, an initial estimate for 300 repetitions of English words by a single speaker are presented. The words were pseudo-randomized with an equal number of filler items, so that immediate repetitions (and the neural and physical fatigue repetition can cause) were avoided. One hundred trials were collected on each of 3 days. Words were chosen to have very little coarticulatory influence ("heed," "ode"/"owed") or sizable coarticulatory influence ("geek," "dote"). Measurements of vowel formants at acoustic midpoints indicated that the distributions were indeed normal. This was true even of the high coarticulatory environment, which some theories would predict would be skewed by the vowel's reaching the edge of an acceptable region. The current results indicate that vowel targets are consistent for different environments. Further, the range of the distributions was quite similar across the two types of environment, being, for example, about 100 Hz for F1. The amount of variability is fairly substantial but can be presumed to be beneficial, as all items were heard correctly. The normality of the distribution nonetheless indicates a control structure that accommodates the coarticulatory environment at the level of planning.

#### Edited by:

*Adamantios Gafos, University of Potsdam, Germany*

#### Reviewed by:

*Daniel Williams, University of Potsdam, Germany Sam Kirkham, Lancaster University, United Kingdom*

> \*Correspondence: *D. H. Whalen whalen@haskins.yale.edu*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Communication*

Received: *03 May 2019* Accepted: *23 August 2019* Published: *10 September 2019*

#### Citation:

*Whalen DH and Chen W-R (2019) Variability and Central Tendencies in Speech Production. Front. Commun. 4:49. doi: 10.3389/fcomm.2019.00049*

Keywords: vowels, formants, variability, motor control, speech production

# INTRODUCTION

Variability is a well-known feature of speech, as it is with other biological systems. Although excessive variability can signify lack of motor control, lack of variability can itself be pathological (e.g., Dinstein et al., 2015). Variability in input has been shown to be helpful in learning (e.g., Bradlow et al., 1997; Preston et al., 2018), and variability in production can give a range of options for adapting to novel situations (e.g., Ossmy et al., 2018).

A typical assumption is that variability is normally distributed; indeed, it is typical to the point that most studies do not explicitly state that assumption. The successful analysis of results in those studies suggests that the assumption is justified to a great extent. Although many statistical tests provide replicable results even when their requirements are violated (Lix et al., 1996), there are indications that other results can be greatly affected (Cain et al., 2017). Concerns over the effects of skew and kurtosis have motivated the move to linear mixed effects models (Baayen et al., 2008; Pouplier et al., 2017), but the meaning of the distributions themselves is not addressed by such analyses.

Non-normal distributions of data can indicate that more than one process is affecting the distribution [A normal distribution can arise from multiple sources if the samples are independent and identically distributed (central limit theorem)]. If the distribution is bimodal, then the single measurement under consideration may be treating two effects as if they were one. If there is skew in the distribution, there may be influences at work that need to be addressed. For statistical purposes, skew may invalidate some tests, such as ANOVA (e.g., Harwell et al., 1992). The more interesting effect is that it may indicate an influence on the behavior under study. The most common of these, of course, is a boundary effect, when the mean of a distribution is close to a physical limit, that is, when the standard deviation simply cannot extend as far as it would without the constraint. Both of these effects can be informative rather than a hindrance to analysis if they are examined on their own. That is one purpose of the present experiment.

In this study, we examined the pattern of articulatory variability in vowel targets for English. Although direct articulatory measurements are more readily obtained now than in previous years, they are still demanding in data collection and analysis, making them challenging for large-scale studies [though see discussion below of a physiological study in Tilsen (2017)]. Here, we needed many repetitions in order to examine the distributional characteristics of the productions, and so we, like many others, relied on the acoustic output to index the articulatory activity. Not only is the acoustic output reliably shaped by the articulation (Fant, 1960; Iskarous, 2010), there is also evidence that variability in the acoustic domain is highly related to the variability in the articulatory domain (Whalen et al., 2018). The use of acoustics therefore is a reasonable first step in analyzing production variability.

The focus here is on random variability, not structured variability, so we needed to focus on single targets. There is a great deal of structured variability due to vocal tract length, coarticulation, emotion, etc. (e.g., Best, 2015), and such variation is of great importance for understanding the entirety of the phonetic system. If it were possible to code all of those structured effects on formant values, we might be able to assess distributions from large speech corpora; the residual after removing the structured effects would be the unstructured variability. However, no corpora are annotated to that extent, and it may be that none ever will be. The number of systematic sources of variability is sizable and generally expanding as more studies are completed. Relying on our accurate account of those factors is not possible at present, given, for example, the relatively inaccurate methods for vocal tract normalization (e.g., Flynn, 2011). Thus, we relied on multiple repetitions of non-contextualized words by a single speaker.

The level of variability due to motor noise and other intrinsic factors must be examined with productions that lack, to the extent possible, structured variability. "Intrinsic" factors are here conceptualized as distinct from structured ones, and they would include such variables as arousal state, location in the breath cycle, and changes in the motor program (either "intentional" or not). The boundary between intrinsic and systematic is not firm, however, and they may not really affect variability differently. We nonetheless wanted to avoid as many factors unrelated to the motor program as we could. To that end, we elicited multiple repetitions of target words so that, ideally, only variability in motor planning and execution remained. Fatigue of motor systems in sustained repetition is well-attested even if the underlying cause (central nervous system, the neuromuscular junction, or metabolic changes in the muscle fiber) is difficult to ascertain (e.g., Bigland-Ritchie, 1981). Thus, the paradigm of having a speaker produce many repetitions of a word [such as the 1,000 sequential repetitions of the word "bucket" in Kello et al. (2008)] can be expected to induce variability based on sheer physical and neural fatigue that are not relevant to understanding what speakers do when they are producing their ideal version of a word. We therefore collected our target words in lists which contained an equal number of filler items, allowing the neurons and muscles to reset between productions.

Direct instructions to eliminate variability do not appear to be successful and may even be counterproductive. In a study of multiple repetitions of target items, Tilsen (2017) provided feedback about consistency in an attempt to eliminate variability. It failed: Speakers continued to have variability, and the variability was structured across independent motor systems. For our purposes, the results indicated that providing feedback about individual productions was not effective in eliminating variability and therefore increased the cognitive load on the speaker without necessarily modifying the speaker's behavior. We therefore strove for consistency simply by asking the speaker to be consistent.

Formant measurements are known to be influenced by fundamental frequency (F0), but large datasets require automatic measurements that currently include such influences. Vowels are well-described by the formants (Fant, 1960), but it is really the resonances that are the true object of interest (Titze et al., 2015). Acoustic formant analysis tends to follow the most intense harmonic near a resonance (F0-effect) (Klatt, 1986; Shadle et al., 2016), but listeners respond to the true resonances, not the measured formants (Klatt, 1986). In the present study, we found that F0 effects were minimal due to the great consistency of F0 by our speaker, so that automatic measurements of formants were usable.

Many tokens are required to analyze the distribution of variability, but studies of speech seldom obtain the required amount. If 20 tokens are collected, we can obtain a fairly defensible estimate of the central tendency (mean) of the distribution, but a sample of only 20 tokens will almost always appear to be normally distributed, even if the true distribution is not normal. Mardia (1970) found that there was more than twice as much evidence for either atypical skewness or kurtosis when the sample size exceeded 106 (46 vs. 94%), indicating that large samples are needed for these measures. In a simulation of models with many parameters, Lerche et al. (2017) found that 200 trials provided good estimations for three- and four-parameter models (p. 522). These are not exact matches to the current experiment, but they give an indication of how many trials can be expected to give solid results. Thus, a sample size of 200 should provide good evidence of distributional properties; we oversampled by obtaining 300 repetitions.

Two environments were studied, allowing us to study intrinsic variability in two extrinsic changes in coarticulation. The first was an /(h)Vd/ environment, which has been shown to have small if any effects on vowel midpoint formant measures in comparison to isolated vowels (Stevens and House, 1963; Ohde and Sharf, 1975). The second was an environment of consonants that differed maximally from the vowel's position, that is [g\_k] for [i<sup>j</sup> ] ("geek") and [d\_t] for [ow] ("dote").

Our first analysis contrasts two hypotheses about the effect of coarticulation on the distribution of formant values. The first hypothesis, based on the "window" model of coarticulation (Keating, 1990) is that a neutral environment would have small skewness values while a coarticulated one would have larger skewness. The alternative model, labeled more generically as the non-window model, predicts non-skewed distributions for both environments. The rationale for this can be seen in **Figure 1**. The window model hypothesizes that the planning stage contains no central target for a segment, only a range (the window) of variability. The implementation is then the result of an interpolation process that finds an optimal path through connected windows with minimum articulatory effort. A window is defined as a pair of minimum and maximum values in a physical dimension that the observed productions are bounded by Keating (1990, pp. 455–456). Thus, a boundary effect on the skewness of distribution should occur if the path from one window to another is most easily accomplished by moving close to an edge. The predictions of Guenther's (1995) "convex region theory" would seem to be the same as the window model's, because the region is meant to be sufficient for the production of a target. His regions are multidimensional and include somatosensory space, so the acoustic predictions are not straightforward. Nonetheless, because the theory is meant to account for such features as vowel reduction (undershoot) (Lindblom, 1963, 1983), it would seem that it would make the same prediction as the window model in this case: Vowels must enter the convex region to be successful, so there should be few productions outside the convex region. Productions that enter the region more deeply will also be successful, but less common. Formant values would therefore be expected to show a skewed distribution. A further complication is that segments have somatosensory targets as well as acoustic ones, resulting in separate error calculations for each (e.g., Terband et al., 2009). Whether this later interaction would affect the distribution has not been tested.

**Figure 1** shows, schematically for F2 alone, the executed path of F2 (red solid lines) for /owd/ and /dowt/, necessarily the same in both the window model and the opposing non-window model. Hypothetical resultant F2 trajectories are shown from the onset of the syllable to the end of the first component of the vowel (omitting the offglide /w/ and the stop coda). The difference in the models is the control parameters, shown by the black dotted lines (the range of target). For the window model, this range defines planning parameters that are the same regardless of context. For the non-window model, this range represents a confidence interval of a normal distribution generated by, for example, a non-linear dynamical system (Saltzman and Munhall, 1989). For /owd/ (**Figure 1A**), both the window and non-window models predict that the F2 trajectory will concentrate in the middle of the target range without skewness. For /dowt/, the window model predicts that the F2 trajectory in the onset will accommodate the desired minimum effort by being toward the lower boundary of the target range, resulting in positive skew. In the middle of /o/, the skewness will be negative as the path enters the upper part of the range (**Figure 1B**). The non-window model predicts that the F2 distributions for /dowt/ should be normal throughout the whole trajectory (**Figure 1C**). Note that the two predictions have the same central tendency of the output trajectory but different predicted patterns of skewness. This is because the target range for a segment in the window model is always the same in all contexts, while in the non-window model the target range can be variable in different contexts as the result of gestural interactions. We chose /o/, despite its known diphthongal offglide (Pike, 1947), because the mid vowels have less chance of abutting a physiological limit, as we expect for /i/.

The second hypothesis is that /i/ should exhibit formant distributions that are somewhat skewed (i.e., positive skew in F1), given that the constriction for /i/ is limited as it approaches the hard palate.

Formants for vowels are rarely stable throughout the vocalic segment, whether the vowel is perceived as diphthongal or not (Hillenbrand and Nearey, 1999). Our analysis examines both the midpoint of the vocalic segment, often seen at the target of the vowel, and the trajectories as well.

# EXPERIMENT

Many repetitions of linguistic utterances are needed to address the issue of the normality of the distributions of vowel formants. This need dictated that the target words be produced in isolation so that the recording sessions would be short enough to be tolerable by the speaker. Filler items were needed to avoid excessive repetition and its concomitant shift in neural and muscular response.

# Method

#### Speaker

The speaker was a native speaker of American English. He is a trained phonetician as well as an instructor for the singing voice. He provided written informed consent as approved by the CUNY University Integrated IRB (City University of New York).

#### Materials

The target words were "heed," "geek," "owed"/"ode," and "dote." The homophones "owed" and "ode" were used as a condition for the experiment that was addressed by the filler items (not discussed here). Results for those items will be presented both separately and combined. Filler words were 25 homophones such as "air"/"ere" and "plain"/"plane."

#### Procedure

Recordings were made in a sound-attenuated booth at the Graduate Center of the City University of New York (CUNY). A free-field microphone (PCB Piezotronics 482C16) with builtin pre-amp (PCB Piezotronics 378B02) was used. An AD Instruments Power Lab (8/35-1008) data acquisition device with a Dell Optiplex 9010 computer processed signals, which were sampled at a rate of 44.1 kHz.

Recordings were made on 3 separate days, separated by 4 months in the first case and 18 months in the second. Each target word (or two words, in the case of "ode"/"owed") occurred 100 times in the randomized list for each day. Each group of 8 items contained one example of each target word (with "ode" and "owed" randomly assigned) along with 4 filler items. The 50 filler items were randomized twice, once for the first half of the session and another time for the second half.

Words were presented in standard orthography, one at a time, on a computer screen controlled by the Presentation program (https://www.neurobs.com/).

#### Measurements

The recorded audio files were downsampled to 16 kHz and forced-aligned via FAVE-align tool (Rosenfelder et al., 2014), then manually corrected when necessary. Formant frequencies were measured by the Burg method of linear predictive coding (LPC) (window size = 45 ms; step size = 2 ms, number of poles = 14, pre-emphasis from 50 Hz; Nyquist frequency = 5,000 Hz) with Viterbi tracking using Praat (Boersma and Weenink, 2019). The tracked formant frequencies were time-normalized into 11 points, representing measurements from 0 to 100% in steps of 10%.

#### Statistics

We carried out univariate normality tests on static formant values, separately for each word produced in each day. We ran the Shapiro et al. (1968) tests of skewness and kurtosis using the "normtest" package in R, taking F1 and F2 as separate dependent variables. To control for the inflation of Type I error due to multiple hypothesis testing, p-values of normality tests were adjusted by Benjamini and Hochberg's (1995) approach of "False Discovery Rate" (FDR). To predict the dynamic formant patterns from the data, we fit a Smoothing Spline ANOVA (SSANOVA) (Gu, 2002) model by adding "Context" and "Time" factors and their interaction, with a random effect of "Day," separately for each vowel and each formant.

# RESULTS

## Vowel Midpoint

In order to make the initial analysis tractable, a single time point was used: 50% of the duration of the vocalic segment. **Table 1** summarizes the formant values along with their standard deviations (SDs). Results are shown for the 3 recording days separately as well as for the four forms (/hid/, /gik/, /owd/, and /dowt/) across days. Values for "ode" and "owed" are combined in the form /owd/.

#### Distributions and Normality Tests

Normality of the formant distributions was tested statistically, but it can also be visually represented by the kernel density estimation (KDE). **Figure 2** presents the distributions of F1 (left



column) and F2 (right column) for the four forms (in each row) separately for each day (Day 1: blue solid lines; Day 2: red dotted lines; Day 3: green dashed lines). As can be seen in **Figure 2**, the distributions were quite regular for each day (i.e., 100 repetitions of each target form), but each day was somewhat different. From **Figure 2**, we can observe skewness on F2 distribution for /gik/ and /owd/ produced in Day 1 and for /hid/ in Day 2, as well as on F1 distribution for /gik/ in Day 3. **Table 2** summarizes the statistics of the moment coefficient of skewness and the excess kurtosis based on the distributions of F1 and F2 values measured at the vowel midpoint (**Figure 2**). Excess kurtosis is calculated as kurtosis (the fourth moment) minus three. The expected values for both skewness and excess kurtosis are zero for a normal distribution. Positive values of skewness indicate that the distribution was higher than the mean more often than expected (longer tail in higher frequency). An absolute value of skewness >1 is considered as highly skewed, and an absolute value in between 0.5 and 1 indicates moderately skewed. Positive excess kurtosis indicates the distribution is "skinnier" than a normal distribution with "fatter" tail presumably due to outliers, while negative excess kurtosis indicates the opposite. The indications of significance symbols were based on the p-values adjusted by Benjamini and Hochberg's (1995) FDR method for each block. For example, in the top-left block (F1 for /hid/) of **Table 1**, the six p-values (not shown) for the tests of both skewness and kurtosis in the 3 days were entered into FDR-adjustment; the (family-wise) null hypothesis is that none of the six statistics came from a normal distribution; any one FDR-adjusted p-value in a block that meets the significance level suggests rejection of such null hypothesis. The statistics in **Table 2** showed that the distribution of F1 for /gik/ produced in Day 1 and those of F2 for /gik/ and /owd/ produced in Day 1 are significantly skewed, which conformed to the shapes of distributions observed in **Figure 2**.

The dynamic pattern of skewness makes the evidence for an effect on the distributions even less likely. In **Figure 3**, skewness is calculated for each of the 11 time points of the time-normalized data. Solid circles indicate significant skewness while empty circles non-significant. Significance was based on FDR-adjusted p-values across 11 points of skewness separately for each day and for each formant, with a family-wise null hypothesis as none of the measured values of skewness in the 11 points conforms to normal distribution. Because the consonant(s) at the syllable boundary should have windows of their own, the skew could be expected to change over the course of the syllable, perhaps with a midpoint differing from both ends. Such a pattern is seen for F2 of /owd/ on day 1. However, two aspects of that pattern are inconsistent with our predictions: The onset of /owd/ should not be skewed, given that the target can be achieved from the beginning of the utterance. Even if there were an explanation for the presence of the skew, there is no obvious reason that the skew would not be present throughout the vocalic segment (up until the transitions for the final stop). Days 2 and 3, as can be seen, had radically different patterns; there is no clear interpretation for the differences. In short, whatever was skewing some of the formant distributions on some days was not systematic enough to be explained by either the window model or by the non-window model (see **Figure 1**).

**Figure 4** further visualizes the distributions of formant frequencies for all time points. Each gray-scaled contour represents the KDE-estimated probability density function (as those distributions displayed in **Figure 2**) at each time point; darker color indicates higher probability. Red crosses track the means of the distributions along the time course, and blue circles the mode (estimated by measuring the peak of probability density function) of distributions. The difference between mean and mode is known as the nominator of Pearson's mode skewness [(mean–mode)/SD]: If the mean is higher than the mode, it indicates positive skewness, which is a conservative visualization of the direction of skewness. Note that mode skewness may not be perfectly consistent with moment coefficient of skewness (as in **Table 2**). **Figure 4** is largely consistent with **Figure 3** and provides more information of probability distributions of formant values at each time point.

#### Dynamic Formant Patterns

The changes in formant location for the words across all 3 days were examined. The time-normalized values were used. A smoothing spline ANOVA (SSANOVA) was computed separately for F1 and F2 for /hid/ vs. /gik/ (**Figure 5**) and for /owd/ vs. /dowt/ (**Figure 6**). In such displays, the 95% Bayesian confidence intervals (shown in color around the mean formant values) are assumed to be statistically different when they do not overlap. The amount of divergence that is needed before the result is "significant" is debatable, but the existence of a visually distinct region suggests that the trajectories do differ in some ways. As can be seen in **Figures 5**, **6**, the first two formants were constantly changing, leaving no portion that was truly "steadystate." Indeed, inclusion of such minor variability has been shown to improve identification of synthetic versions of the target vowels (Hillenbrand and Nearey, 1999). Other predictable

TABLE 2 | Moment coefficients of skewness and excess kurtosis (the fourth moment minus 3) for F1 and F2 measured at the vowel midpoint.


*Positive skewness indicates longer tail in higher frequency. Positive excess kurtosis indicates fatter tail and "skinnier" distribution, and negative value the opposite. P-values were adjusted by FDR for each block (*\*\*\**p* < *0.001;* \*\**p* < *0.01;* \**p* < *0.05; † p* < *0.1). Bold face indicates the FDR-adjusted p-value is less than 0.05.*

aspects appeared. A separate SSANOVA (not presented here) comparing the homonyms "ode" and "owed" showed that they were, indeed, virtually identical. The formants for the shared alveolar stop at the end of the /ow/ words converged (**Figure 6**). The formants for the distinct places of articulation of the final stops for the /i/ words diverged (**Figure 5**). F2 was distinguished at the final portion of the trajectory in **Figure 5** and the first half of the trajectory in **Figure 6**. What was perhaps somewhat surprising was the overall dissimilarity of F1 for the two contexts for the /ow/ words but not for the /i/ words. Still, the differences were small (45 Hz for /ow/ words, and 2 Hz for /i/ words at the midpoint).

Although "geek" was intended to have velar productions on either side of the vowel, the low F2 values at onset indicate that this speaker used a very fronted place of articulation for the initial stop. Thus, the F2 pattern was quite linear, while the F2 of "dote" (**Figure 6**) behaved as intended. The vowel of "ode"/"owed" was, as expected, rather diphthongal, with F1 changing by about 65 Hz from time points 4 to 8 (the likely limits of coarticulatory effects of the stop). The vowel of "heed," by contrast, changed by about 10 Hz over those same time points.

# DISCUSSION

Multiple repetitions of English words in a fairly isolated state were found to have formants that were only slightly different from normality. Having a sizable number of tokens is necessary for such an analysis, but the biological constraints on speakers make collection challenging. Here, we reduced the constraints as much as possible by interleaving the tokens with filler items, but that limited us to collecting 100 repetitions in any one session.

the modes (peak). Difference between mean and mode indicates the direction of skewness. (A) /hid/. (B) /gik/. (C) /owd/. (D) /dowt/.

As can be seen in **Figures 2**, **3**, the formants obtained were quite consistent within those sessions; the small differences across sessions were smaller than the likely measurement error of the LPC analysis. Although changes in articulation across different days or even time of day (Heald and Nusbaum, 2015) have been reported before, the differences here are negligible.

The window model hypothesis that coarticulation would skew the distributions was not supported, while the non-window model was consistent with the lack of skewness. The trajectories were normally distributed not only near the midpoint of the vocalic segment, but throughout the production (**Figures 3**, **4**). Such a result is inconsistent with the "window" model in which the motor plan contains only target regions and not trajectories; in execution, segments reach the edge of one target region before moving on to the next (Keating, 1990). It is consistent with nonwindow model in which a motor plan takes the entire context into account from the beginning; overlapping activations for the gestures or segments then unfold in execution in such a way that

variability is structured by the interactions of the overlapping control parameters of gestures or segments.

The second hypothesis, that the /i/ formants would have skewed distributions because of the boundary effect of the hard palate, was not supported. Not only were there very few individual time points with significant skew, there was no discernable pattern to the skew either. For this speaker, at least, the constraints on articulation of the high front vowel were well-accommodated, so that the distributions of formants were unaffected by the physiological limits. Standard deviations were small but non-negligible (at midpoint, for /i/, 5.8% for F1, 2.3% for F2; for / ow/, 4.6% for F1, 3.8% for F2). It would seem that there is enough variability for a skewed distribution to be evident, if it were present. Instead, the formant distributions appear to be normal through the duration of the syllable.

Future studies are desirable to explore these issues further. Only one speaker was analyzed here, and he was chosen in part for his many years of practice and instruction in broadcast speaking. The resulting consistency was useful for having manageable amounts of variability, but less skilled speakers may show different patterns. Indeed, the kinds of variability that result may differ by such factors as speech sound disorder or speaking in a second language. Other acoustic or articulatory measures could be made, although the strongest predictions in the field have been about formant values. Measuring variability across the vowel system rather than for just two vowels would be useful (Whalen et al., 2018), although the number of tokens required becomes rather large. Finding word tokens that maintain the voicing of the final consonant would also be desirable. Other statistical approaches, such as Generalized Additive Mixed Models, may provide further insight.

Overall, the results for this speaker support the use of statistics that rely on normal distributions for analyzing formant values. As such, the results also support the use of Gaussian priors in Bayesian linear mixed models (Vasishth et al., 2018). Using the results of a single speaker has intrinsic drawbacks, so the current results can only be preliminary. Further, the formant values themselves are subject to many measurement errors (Klatt, 1986; Shadle et al., 2016), but, within those limits, estimation of the central tendencies for formants are relatively good, at least for F0s <200 Hz (Chen et al., 2019). The present data did not support models that assume target regions; instead, entire trajectories were normally distributed throughout the vocalic segment. Variable productions, therefore, appear to be variable in their global shape, not just in their relationship to local targets.

## DATA AVAILABILITY

All datasets generated for this study are included in the manuscript/**Supplementary Files**.

# ETHICS STATEMENT

The study was reviewed and approved by the CUNY University Integrated IRB. Written and informed consent was obtained from the participant.

## AUTHOR CONTRIBUTIONS

DW contributed conception and design of the study and wrote the first draft of the manuscript. W-RC organized the database, performed the statistical analysis, and wrote sections of the manuscript. DW and W-RC contributed to manuscript revision, read, and approved the submitted version.

## FUNDING

Research was supported by US NIH grant DC-002717 to Haskins Laboratories.

## ACKNOWLEDGMENTS

We thank Jason Shaw, Adamantios I. Gafos and two reviewers for helpful comments, and Richard Lissemore, Vilena Livinsky and Grace Kim-Lambert for technical assistance.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomm. 2019.00049/full#supplementary-material

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Whalen and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Modeling Dimensions of Prosodic Prominence

#### Simon Roessig\* and Doris Mücke

Institut für Linguistik – Phonetik, Universität zu Köln, Cologne, Germany

Detailed modifications both in the laryngeal as well as in the supra-laryngeal domain have been shown to be used by speakers of German to express prosodic prominence. This paper aims to bring the two domains together in a joint analysis and modeling account. We report results on the prosodic marking of focus types from 27 speakers that were recorded acoustically and with electromagnetic articulography. We investigate the intonational patterns (tonal onglide) as well as the articulatory movements during the vowel production (lip aperture and tongue body position). We provide further evidence for categorical and continuous modifications across and within accentuation and sketch a dynamical model that accounts for these modifications on multiple dimensions as the consequence of scaling the same parameter. In this model, the prosodic dimensions contribute differently to the complex shape of the compositional attractor landscape and respond differently to the scaling of the system. The study aims to add to our understanding of the integration of speech sounds in a two-fold manner: the integration of different channels of prosody (laryngeal and supra-laryngeal) as well as the interplay of categorical and continuous aspects of speech.

#### Edited by:

Adamantios Gafos, University of Potsdam, Germany

#### Reviewed by:

Argyro Katsika, University of California, Santa Barbara, United States Mariapaola D'Imperio, Aix-Marseille Université, France

> \*Correspondence: Simon Roessig mail@simonroessig.de

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Communication

Received: 15 March 2019 Accepted: 31 July 2019 Published: 10 September 2019

#### Citation:

Roessig S and Mücke D (2019) Modeling Dimensions of Prosodic Prominence. Front. Commun. 4:44. doi: 10.3389/fcomm.2019.00044 Keywords: prosody, dynamical systems, articulation, intonation, speech production, attractors

# INTRODUCTION

In the last decades, a growing body of research has pointed out the dynamical nature of the mind (e.g., Kelso, 1995; van Gelder and Port, 1995; Port, 2002; Spivey, 2007). To overcome limitations imposed by symbolic approaches, researchers from many disciplines have turned to the framework of dynamical systems describing a multitude of different cognitive processes including the production and perception of speech sounds and their cognitive representations (Browman and Goldstein, 1986; Tuller et al., 1994), organization of semantic knowledge (Mirman and Magnuson, 2009) as well as movement coordination (Haken et al., 1985).

In the fields of phonetics and phonology, the dynamical perspective has the potential to shed new light on the question of how the categorical and the continuous aspects of speech are related (Browman and Goldstein, 1986; Hawkins, 1992; Tuller et al., 1994; Port, 2002; Gafos, 2006; Gafos and Benus, 2006; Lancia and Winter, 2013; Roon and Gafos, 2016; Iskarous, 2017; Mücke, 2018). Phonology and phonetics have long been conceptualized as two separate modules with a process of translation to mediate between them. While phonology comprises the categorical representations of speech sounds and computations that operate on them (rules or ranked constraints), phonetics implements speech sounds in a physical representation. Thus, the translation from phonology to phonetics must be a process of transforming a discrete symbolic representation into a continuous signal. The division between phonology and phonetics into a discrete, symbolic domain on the one hand and a continuous, physical domain on the other is based on the observation that speech is characterized by abstract mental categories and continuous signals at the same time. While this perspective of duality appears to be a plausible motivation for a clear-cut separation of phonology and phonetics at first sight, accumulating evidence shows that the categorical and the continuous sides of speech are deeply intertwined. Crucially, this evidence questions a purely categorical, abstract nature of phonological representations (Pierrehumbert et al., 2000; Port, 2006; Ladd, 2011; Pierrehumbert, 2016). The dynamical perspective of the mind does not posit a strict division between categorical and continuous aspects of speech production and perception. In this view, the mind works in a completely continuous manner there are no pure, symbolic mental states (Spivey, 2007). While the mind is in constant flux, it gravitates toward relatively stable states, called attractors. These attractors are the analogs to categorical representations in the symbolic computation view. Since attractors are located in a fully continuous space that is not separated into discrete areas, it is sensible to talk of quasicategories in the context of attractors. Crucially, the fact that attractors are part of the continuous state space of the system makes a translation from the categorical to the continuous superfluous. As such, speech sound categories can be represented as stable states on multiple continuous dimensions. While the notion of the attractor reflects the observation that these categories are relatively stable, the continuous nature of the system allows for fine-grained variation around the attractor induced for example by prosody or by stronger intention to achieve a communicative goal.

One of the potential strengths of the dynamical systems approach is that it can deal with variation in speech production when investigating sound patterns. In a symbolic, modular view, only the discrete end result of a phonological computation—be it by virtue of rules or ranked constraints—is passed on to the phonetic implementation. The phonetic implementation module has no access to the "history" of discrete operations performed and implements symbols into physical signals regardless of the way they were obtained by the phonological module. Incomplete neutralization in German is a classic case that questions the plausibility of this chain: The final obstruents of <Rad> /Kad/ ("wheel") and <Rat> /Kat/ ("advice") should be completely indistinguishable for the phonetic implementation module after the neutralization rule described for German has turned both forms into [Kat<sup>h</sup> ]. Numerous studies demonstrated that this is not the case and that there are indeed systematic acoustic differences between the two words such as voice onset time, closure duration, or the duration of the preceding vowel (see among others Port and O'Dell, 1985, Port and Crawford, 1989, Roettger et al., 2014, and Roettger and Baer-Henney, 2018, for Dutch: Ernestus and Baayen, 2006). In the modular view, the phonetic component should not be able to produce different signals based on the two phonological representations because they are identical. Gafos (2006) and Gafos and Benus (2006) showed how a dynamical perspective can deal with the observed variation. The categories of voiceless and voiced are conceptualized as two attractors in a continuous space of voicing. At the ends of syllables, the voiceless attractor is the most stable of the two attractors. However, the exact location of the attractor basin can be modulated by lexical factors and the speaker's communicative intention, allowing for subtle differences in the acoustic realization of the voiceless obstruent.

Variation also plays an important role in the domain of intonation research. On the one hand, many studies have shown that there is a probabilistic mapping between functions and forms that are described as prosodic categories (Grabe, 2004; Röhr and Baumann, 2010; Yoon, 2010; Baumann et al., 2015; Ritter and Grice, 2015; Cangemi and Grice, 2016). On the other hand, a great deal of variation can be found in the realization of these prosodic categories. For example, the same type of nuclear pitch accent—the part of the pitch contour on and around the most prominent word in the phrase—can be used for different functions. However, the accent's realization in terms of the height of the pitch peak and the temporal alignment of the peak to the accented syllable is often systematically varied by speakers (Ladd and Morton, 1997; Kügler and Gollrad, 2015). Grice et al. (2017) investigated the distribution and realization of pitch accents in German focus marking and demonstrated how continuous and categorical variation go hand in hand. The authors compared focus constructions similar to those exemplified in (1–3) (English translations are given below). In all three cases, the word "Jana" in the answer (A) usually receives the nuclear pitch accent. Example (1) illustrates a case of broad focus, where the whole sentence is in focus and "Jana" functions as the exponent of the focus domain (Uhmann, 1991). In example (2), "Jana" is the only word in focus, a condition that is often called narrow focus (Ladd, 1980). Example (3) is quite similar to (2) but "Jana" contrasts with another word in the immediate context ("Paul" in the question Q)—this condition is called contrastive focus.

	- A: Melanie will Jana treffen. Melanie wants to meet Jana.
	- A: Melanie will Jana treffen. Melanie wants to meet Jana.
	- A: Melanie will Jana treffen. Melanie wants to meet Jana.

As already mentioned, in all three cases, the nuclear pitch accent is usually placed on the last noun, "Jana." Grice et al. (2017) showed that the distributions and realizations of pitch accent types differs between the focus conditions. However, their results suggest—as already reported in Mücke and Grice (2014)—that the mapping between focus types and pitch accent categories is not one-to-one. There are general tendencies for certain focus types to be more frequently realized with certain pitch accent types, for example broad focus with H+!H<sup>∗</sup> accents, narrow focus with H<sup>∗</sup> accents, and contrastive focus with L+H<sup>∗</sup> accents. But the focus types are also realized with different accent types for example, there is a considerable number of rising accents in the broad focus productions of some speakers. Crucially, Grice et al. (2017) and Roessig et al. (2019) demonstrated that variation in the phonetic parameters (peak alignment, target height, tonal onglide) within each pitch accent category is used to signal focus types as well. Moreover, this variation within category boundaries seems to mimic the variation across category boundaries: Some speakers, for example, use the shallower H<sup>∗</sup> accent in narrow focus primarily and the more rising L+H<sup>∗</sup> accent in contrastive focus. Others use H<sup>∗</sup> for both functions but increase the magnitude of the rising f0 movement from narrow focus to contrastive focus.

While f0 is a strong acoustic parameter in prosody, it is important to acknowledge that speakers exploit many phonetic dimensions to express prosodic structure. This means that prosodic structure is encoded in more than one phonetic exponent, a phenomenon that has recently been discussed in the context of pleiotropy by Gafos et al. (2019). For prosodic prominence, this implies that speakers can use multiple cues in different combinations to express the same degree of prominence. There are several important strategies of the supralaryngeal system to highlight important prosodic information in the phonetic substance. The first strategy is referred to as sonority expansion (Beckman et al., 1992). Sonority expansion enhances the vowel's sonority to strengthen the syntagmatic contrasts between accented and unaccented syllables. Under accent, speakers intend to produce louder and more sonorous syllables by opening the mouth wider. A more open oral cavity allows for a greater radiation of acoustic energy from the mouth. The second strategy is referred to as localized hyperarticulation (de Jong, 1995). It is based on the H&H model developed by Lindblom (1990) and follows the observation that signatures of prominence can be identified by a more extreme articulation of the tongue body in vowel productions. The hyperarticulation strategy involves the enhancement of paradigmatic features such as the place feature for a specific vowel. The tongue body position is lower in low vowels such as /a/, while it is more fronted in front vowels such as /i/ and more retracted in back vowels such as /U/ (de Jong et al., 1993; Harrington et al., 2000; Cho and McQueen, 2005).

During the production of low vowels, sonority expansion, and hyperarticulation are non-competing strategies. Lower tongue and jaw positions accompanied by a higher degree of lip opening both increase specifications of manner and place targets. In addition, low vowels are associated with a low degree of coarticulatory resistance, therefore allowing for a high amount of prosodic variation in the temporal and spatial domains. Prosodic strengthening is more complicated in high vowels. While sonority expansion triggers a more open vocal tract to produce louder vowels, localized hyperarticulation induces smaller constriction degrees to increase the vowel's place feature. In addition, high vowels are associated with a high degree of coarticulatory resistance, thus allowing for less prosodic variation at least in the spatial dimension (Mücke and Grice, 2014). However, these highlighting strategies can be combined in the coordination of different articulatory subsystems. While the lingual system is mainly involved in hyperarticulation to increase the place feature in vowels such as /i/ and /U/, the mandibular and the labial system attribute to sonority expansion by increasing the degree of lip opening. In the acoustic output, this leads to louder and longer syllables with more peripheral formant frequencies (Australian English: see Harrington et al., 2000; American English: see de Jong et al., 1993, as well as Cho, 2005).

Examples (1–3) above illustrate different focus constructions in which the last noun in the sentence ("Jana") is in the focus domain and receives the nuclear pitch accent. In example (4), the word occurs out of focus, i.e., in the background, and as such does not receive the nuclear accent in English and German. Many studies that investigated the above mentioned strategies of prosodic prominence marking concentrated on the distinction between unaccented and accented syllables and compared words in the most divergent conditions, i.e., background to words in contrastive focus [see example (3)].

	- A: Melanie will Jana treffen. Melanie wants to meet Jana.

More recently, Mücke and Grice (2014) investigated the adjustments of lip opening gestures within the group of accented words in different focus types (broad vs. narrow vs. contrastive focus) in comparison to adjustments of the lip kinematics between unaccented and accented words (background vs. {broad, narrow, contrastive}). They found the strongest modifications when comparing target words in contrastive focus to target words in the background. During the production of different vowel types, the speakers produced larger, longer, and faster lip opening movements, thus increasing sonority of vowels in prominent positions. However, when comparing background and broad focus, they found only subtle kinematic adjustments. Even though there were tendencies to increase sonority from background to broad focus, the modifications were not systematic. However, when comparing different focus structures within accentuation, i.e., broad, narrow, and contrastive focus, they found larger, longer, and partially faster lip movements from broad focus to contrastive focus, but no clear distinction between narrow focus and contrastive focus. On the basis of their results, Mücke and Grice (2014) concluded that supra-laryngeal articulation may be directly related to focal prominence and not mediated by accentuation itself. These articulatory findings are in line with recent work by Baumann and Winter (2018) who showed that listeners' judgements of prosodic prominence are influenced by a multitude of categorical (pitch accent type and placement) and continuous acoustic factors (e.g., intensity and duration).

In this paper, we investigate the prosody of focus marking in German in both the laryngeal and the supra-laryngeal domain. We analyse acoustic f0 movements in combination with articulatory movements tracked from the lingual and labial system using a 3D Electromagnetic Articulograph (EMA). In our articulatory measurements, we quantify the parameters related to the displacement of lip opening and lowest position of the tongue body in the vowel /a/, between unaccented and accented (out of focus/background vs. broad focus) and within accentuation (broad focus, narrow focus, contrastive focus).

We demonstrate that categorical and continuous adjustments are made by speakers to express focus structure by virtue of prosodic prominence. Finally, we sketch a dynamical system that accounts for the modifications with attractor landscapes that are shaped by the contribution of the different prosodic dimensions under scrutiny. This model is able to account for both categorical and continuous variation as the outcome of the process of scaling the single control parameter of the system. Crucially, we demonstrate how this scaling of the control parameter modulates all prosodic dimensions, laryngeal and supra-laryngeal, at the same time. In this way, the present work attempts to contribute to our understanding of the integration of multiple channels or tiers in speech production.

#### METHODS

#### Speakers, Recording Procedure, Speech Material

Twenty-seven monolingual native speakers of German were recorded with 3D Electromagnetic Articulography (EMA) using a Carstens AG501 articulograph and acoustically using a headmounted condenser microphone. All recordings took place at the If L Phonetics department of the University of Cologne. To track the movements of the articulators, sensors were placed on the upper and lower lip, tongue tip, tongue blade, and tongue body. Reference sensors were placed on the bridge of the nose and behind the ears to compensate for head movements. A bite plate measure was used to rotate the occlusal plane. The kinematic data were recorded at 1,250 Hz, downsampled to 250 Hz and smoothed with a 3-step floating mean. In this study, we analyse the data from the lip sensors and the tongue body sensor (backmost tongue sensor). The acoustic recordings were carried out with an AKG C520 headset microphone into a computer via a PreSonus AudioBox 22 VSL interface at a sampling rate of 44.1 kHz and a bit depth of 16 bit. At the time of recording, the speakers were aged between 19 and 35. 17 of them were female, 10 were male. None of the subjects had a special training in phonetics, phonology or prosody, or reported any speech or hearing impairments. The participants received compensation for their participation in the study. The actual recording session after the participant had been prepared lasted about 45 min including a training session.

The participants were seated in front of a screen and were involved in an interactive animated game. They were told that the game revolved around two robots working in a factory, in which one of them likes to move around the tools. The other robot, slightly older and technologically outdated, needs the participant's help to retrieve these tools. In each trial, the participant first saw one robot placing the tool on an object in the factory room and leaving the scene. In the next step, the second, older robot entered the scene. This robot did not enter the factory room but stopped in front of the closed door asking a question about the action of the first robot. After the participant's answer, the door opened, TABLE 1 | Example question-answer-pairs to elicit the focus structures.


the second robot entered the room, took the tool and left the scene.

For the robot's questions, natural productions by a male, native German speaker were used. These questions served as triggers for the focus structures of the answers and were chosen such that the target word denoting the object (where the tool is placed) could be in broad focus, narrow focus, contrastive focus, or in background (with a contrastive focus on the direct object). **Table 1** shows examples for such question-answer-pairs with square brackets and subscript F marking the focus domain. Each question was given auditorily and shown as a combination of pictures in a thought bubble above the head of the robot: the question tool on top of the question object in the case of background and contrastive focus; a simple question mark in the case of broad focus; the object and the question word "wo?" ("where?") in the case of narrow focus. The answers that the participant had to produce were always given in written form at the bottom of the screen. Many participants reported that they were able to give the answers without reading them on the screen after some trials. The participants were asked to always produce the answer with the same syntactic structure and to not add any words like "no." None of the participants had any problems with this restriction. Likewise, none of the participants reported that they found the sentences unnatural or difficult.

Twenty German sounding disyllabic nonce words with a C1V1:C2@ structure were chosen as target words. Since it is important to control for the segmental context in EMA experiments, we used nonce words in target positions. This enabled us to also control for the frequency of the target words. The words were designed such that the word stress was on the first syllable and the consonants (C1 and C2) either require movements of the labial system or the tongue tip to avoid influences on the tongue body measures for the vowel. The first consonant was chosen from the set of /n m b l v/, the second consonant from /n m z l v/. The first, accented vowel was either /a:/ or /o:/, the second always schwa. The consonants and vowels were combined such that each first consonant occurred

twice with each first vowel and each second consonant-schwacombination occurred four times in the whole set. Special care was taken that the words did not overlap with real German words. All words were presented with the female determiner "die" /di:/. All participants pronounced the words as expected. The target words are given in **Table S1** in the Supplementary Materials.

Each target word was associated with a fictitious visual object. This association remained fixed through the whole experiment and across all participants. The participants were presented with all objects and target words in a preparation phase immediately before the experiment and were asked to read the words aloud with the determiner "die" ("die Nohme," "die Lahse," etc.). This phase lasted a few minutes and was included to ensure that no participant placed the stress on the second syllable. In fact, all participants placed word stress on the first syllable starting with the first production.

As described above, in each trial, a tool is placed on one of the fictitious objects. Each object was paired with a tool to occur with. The tools are given in **Table S2** in the Supplementary Materials. As there are 10 tools and 20 target words, each tool had to occur twice. Furthermore, for the background condition and the contrastive focus condition, a competitor tool or object was needed, respectively (for the direct object of the question when the target word was in the background: "Did he place X on A?" "He placed Y on A!"; and for the indirect object of the question when the target word was in contrastive focus: "Did he place X on A?" "He placed X on B!"). These combinations were fixed for each participant, yielding 20 quadruples of target object, tool, competitor object, and competitor tool. The competitor object was chosen such that the first consonant or the first vowel did not equal the first vowel or consonant of the target object. The competitor tool was selected such that it differed in the first consonant from the target sentence tool. The 20 quadruples occurred with all four focus conditions, which resulted in a total of 80 trials. Sixteen trials with different object-tool-quadruples preceded the actual experiment session.

The order of trials was randomized for each of the 27 participants. Subsequent trials were not allowed to contain the same target word or tool used in the target sentence. Furthermore, there were no three subsequent trials with the same focus condition. For two subsequent trials with identical focus condition an upper limit was set: In only 15% of the list, two adjacent trials with equal focus conditions occurred.

The scenes, objects, tools, and robots were drawn by a professional book illustrator. The game was developed as an interactive website using HTML and JavaScript with jQuery for animation (e.g., robots' arm and mouth movement, the door opening, and closing). The experimenter, sitting behind the participant, pressed a key on the keyboard to make the robot move toward the tool and proceed to the next trial. There was a "rescue key" to repeat the trial in case something went wrong. Between trials, the scenery disappeared for 4 s and the screen transitioned through a series of light, muted colors. This was done to detach the trials from one another to make sure that the focus structure of the target sentence made reference to the current trial only. Points were counted for each complete trial in the lower right corner of the screen to make the task more game-like. **Figure S1** in the Supplementary Materials shows an example of the experiment screen, where the second robot has just asked his question and is waiting for the answer. The code of the experiment app is available for download: http://doi.org/10. 5281/zenodo.2611287.

#### Measures

In this paper, only a subset of the data is reported on. Since the vowel /o/ involves lip rounding, lip aperture in syllables with /o/ cannot be compared to values of syllables with /a/. We decided to restrict our analysis to the target words with /a/ in the stressed syllable. From all 1,080 productions (27 speakers × 4 focus conditions × 10 target words), a minority of cases (3.7%) had to be excluded due to mispronunciations, strong disfluencies, or technical problems during the recording session. The used data set comprises 1,040 tokens and is available for download: https:// osf.io/jx8cn.

One trained annotator labeled the beginning and end of the accented syllable of each target word using the waveform and the spectrogram in the emuR speech database system (Winkelmann et al., 2018). Within the boundaries of the syllable, lip aperture was evaluated as the Euclidean distance between the lips (Byrd, 2000) as given in Equation 1. An automatic procedure was used to retrieve the maximum of the trajectory within the boundaries of the labeled acoustic syllable. The maximal lip aperture represents the widest opening of the lips during the production of the vowel /a/. In addition, the lowest point of the tongue body during the production of /a/ was measured by finding the minimum of the recorded vertical trajectory within the boundaries of the labeled acoustic syllable. All values (lip aperture and tongue body position) were z-scored for each speaker. **Figure 1A** shows schematic depictions of the articulatory measures.

$$\begin{array}{l}\text{lip } aperture \, \mathbf{x} = \,\,\, u \,\text{per } \operatorname{lip} \, \mathbf{x} - \,\, lower \,\, \operatorname{lip} \, \mathbf{x} \\\text{lip } aperture \, \mathbf{y} = \,\, \, u \,\, \operatorname{per } \operatorname{lip} \, \mathbf{y} - \,\, \operatorname{lower } \operatorname{lip} \, \mathbf{y} \\\text{lip } aperture = \,\, \sqrt{(\operatorname{lip} \, aperture \, \mathbf{x})^2 + (\operatorname{lip} \, aperture \, \mathbf{y})^2} \end{array} \text{(1)}$$

To assess the differences in the f0 contours, we measured the tonal onglide of each nuclear pitch accent. **Figure 1B** provides a schematic depiction of the tonal onglide measure. Tonal onglide characterizes the portion of the f0 movement toward the main tonal target of the pitch accent (Ritter and Grice, 2015; Roessig et al., 2019). In terms of an autosegmental-metrical analysis, like GToBI (Grice et al., 2005), L+H<sup>∗</sup> , and H<sup>∗</sup> pitch accent types are described by a rising movement and result in positive onglide values. In contrast, the accent types H+L <sup>∗</sup> or H+!H<sup>∗</sup> are described by a falling movement from the initial high portion of the accent down to the L<sup>∗</sup> or !H<sup>∗</sup> on the accented syllable and result in negative onglide values. In addition to capturing the direction of the tonal movement ("is it rising or falling?"), the tonal onglide reflects the magnitude of the rise or fall in semitones ("how much does it rise or fall?"). It should be emphasized here that pitch accent categories are multi-dimensional and thus best described by multiple variables. Tonal onglide is a continuous variable that represents both the direction of the pitch movement as well as the magnitude of this movement, but it does not

capture all relevant details of pitch accents (see Grice et al., 2017 for an investigation of the characteristics of pitch accents in terms of tonal onglide and its relation to other parameters). Nevertheless, it has been shown that the tonal onglide movement is a perceptually relevant parameter of pitch accents in German (Baumann and Röhr, 2015; Ritter and Grice, 2015).

Two labelers with training in prosody annotated the f0 movements with a simple labeling scheme without having access to the intended focus structures of the sentences: First, the labelers identified all utterances in which the speaker did not place the nuclear pitch accent on the object. Second, the labelers judged perceptually whether the nuclear pitch accent was falling or rising. Third, the labelers identified the beginning and the end of the onglide movement manually within a window of three syllables including the accented syllable in the center, the syllable before and the syllable after.

For rising accents, a local minimum just before the rising movement was annotated in the pre-accented syllable or the accented syllable itself as the beginning of the onglide movement. A local maximum at the end of the rise was labeled in the accented syllable or the post-accented syllable as the end of the movement. For falling accents, a relatively high point at the start of the fall was labeled in the pre-accented syllable or the accented syllable itself as the beginning of the onglide movement. Since the f0 is usually falling throughout the syllable in a falling accent and hence a tonal target is virtually impossible to determine, the midpoint of the vowel of the accented syllable was marked as the end of the accentual movement.

If the nuclear accent was not placed on the target word, it is placed on the direct object of the sentence. In this case, the part of the phrase containing the target word and the following verb is characterized by a low stretch of f0. This situation was found in almost all cases of the background condition and in a minority of cases of the other conditions. When this deaccentuation of the target word occurred, an "onglide" measure was done with fixed time points (5 ms before the start and 50 ms before the end of the stressed syllable) since it is not possible to identify the beginning and the end of a tonal movement. We cannot speak of a real onglide here since there is no movement of a pitch accent. However, this measure makes it possible to compare and model the intonation of all utterances, with accented and unaccented target words, and to relate the intonational and articulatory modifications used to express focus structure across all experimental conditions.

Although using the semitones scale already eliminates a great deal of variation between speakers, normalization is needed to make the speakers more comparable. To do so, we divided each rising onglide value by the mean of the speaker's rising onglides, and each falling onglide value by the mean of the speaker's falling onglides. It is plausible that a rise is best interpreted in relation to other rises, while a fall is best interpreted in relation to other falls of the same speaker. For example, a raw onglide value of +6 semitones might be quite extreme for a speaker with a mean of +4 semitones for rises compared to a speaker with a mean of +6 semitones for rises. For the unaccented cases, where we cannot speak of rises and falls, we used the overall mean of the absolute onglide values for each speaker.

# RESULTS

#### Intonation

Before presenting the quantitative results, we turn to some examples of the main intonational modifications in **Figure 2**. The informative value of these examples is of course limited since they

only represent individual utterances. However, accompanying the quantitative results, they help to give a thorough insight into the data. The figure shows examples from one male speaker producing the conditions background, broad focus, narrow focus, and contrastive focus (from top to bottom). The stressed syllable of the target word is marked by the blue box, the arrows illustrate roughly the f0 movement that is captured by the onglide measure. This speaker uses a flat f0 stretch on the target word in the background condition (the target word is unaccented), a falling accent in broad focus and rising accents in narrow focus and contrastive focus. Comparing these last two conditions, a larger magnitude of the rise can be attested in contrastive focus.

**Figure 3** presents the normalized onglide values of all speakers for the four focus types in a violin plot. In the background condition, the data show a single mode located slightly below zero. For broad focus, we can observe a bimodal shape of the distribution, with almost equal numbers of falling and rising onglides. In narrow and contrastive focus, the right mode is

more pronounced. Since rising accents dominate the data, we look at the means of rising accents in **Figure 4**. In addition to the increase in the number of accents with a rising onglide, the magnitude of the onglides become larger, as reflected in the stepwise growth of the mean from broad focus to narrow focus, and from narrow focus to contrastive focus. Note that we treat all rises as one group. Many autosegmental-metrical systems like GToBI (Grice and Baumann, 2002) posit two rather similar rising accents, H<sup>∗</sup> and L+H<sup>∗</sup> . While we do not deny the existence of the two types of pitch accents, our analysis is not intended to be an autosegmental-metrical analysis. As outlined in the methods section, the labelers did not classify each accent beyond deciding whether it is a rise or a fall.

We analyse the results using a Bayesian linear mixed model in R (R Core Team, 2018) with the package brms (Bürkner, 2018) that implements an interface to Bayesian inference with MCMC sampling in Stan (Carpenter et al., 2017). We report the estimated differences between focus conditions in terms of posterior means, 95% credible intervals, and the probability of the estimate being greater than zero. Given the data and the model, the 95% credible intervals indicate the range in which one can be certain with a probability of 0.95 that the difference between estimates can be found. To calculate the differences between focus types, we subtract the posterior samples for background from broad focus (broad–background), broad focus from narrow focus (narrow–broad), narrow focus from contrastive focus (contrastive–narrow), and broad focus from contrastive focus (contrastive–broad).

The model includes normalized onglide as the dependent variable, focus type as a fixed effect, and random intercepts for speakers and target words as well as by-speaker and by-targetword slopes for the effect of focus type. Since the distribution of the dependent variable is bimodal, we use a prior for the predictor that is characterized by a mixture of two Gaussian distributions centered around −0.5 and 0.5 respectively. The model estimates the parameter theta that represents the extent to which the two Gaussian distributions are mixed. For this parameter, we use a prior centered around zero. Differences in theta indicate the differences in the proportions of the two modes in the onglide data. The model runs with four sampling chains of 5,000 iterations each, preceded by a warm-up period of 3,000 iterations.

We start with the results for the mixing parameter. Given the model and the data, the analysis yields strong evidence for differences in the posterior probabilities for the mixing parameter theta between broad focus and narrow focus (β<sup>ˆ</sup> <sup>=</sup> 1.35, 95% CI <sup>=</sup> [0.09, 2.49], Pr(β > <sup>ˆ</sup> 0) <sup>=</sup> 0.98), narrow focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 1.74, 95% CI <sup>=</sup> [0.28, 3.37], Pr(β > <sup>ˆ</sup> 0) <sup>=</sup> 0.99), as well as broad focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 3.09, 95% CI <sup>=</sup> [1.24, 4.95], Pr(β ><sup>ˆ</sup> 0) = 1), i.e., within the group of accented target words. In all cases, the differences are positive indicating a growth of the right mode from broad to narrow focus, and from narrow to contrastive focus. As to the difference between background and broad, the model also suggests that the mixing proportion of the two modes is different (β<sup>ˆ</sup> = −2.49, 95% CI <sup>=</sup> [−3.82, <sup>−</sup>1.14], Pr(β > <sup>ˆ</sup> 0) <sup>=</sup> 0). This comes as no surprise since the distribution of background is unimodal whereas the distribution of broad is bimodal. However, the model calculates a negative difference. This is due to the fact that the model takes the right mode of the prior mixture to capture the unimodal distribution of background. The mixing parameter we report here is higher when the right mode is stronger and the left mode is weaker (note that the model can also estimate the mixing parameter that describes the exact opposite situation but the direction of differences is mirrored in the same way regardless; both parameters cannot be estimated at the same time). Thus, it makes sense—for the sake of completeness—to report the probability of the difference between background and broad focus in the mixing parameter to be lower than zero: Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 1.

To assess the differences between the focus conditions regarding the rising distributions, we investigate the mean estimates of the right Gaussian sub-distribution. We only look at broad focus, narrow focus, and contrastive focus since we can only speak of a rising accent in these conditions. The model

provides evidence for differences in the posterior probabilities between broad focus and narrow focus (β<sup>ˆ</sup> <sup>=</sup> 0.16, 95% CI <sup>=</sup> [−0.02, 0.35], Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 0.96), narrow focus, and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 0.23, 95% CI <sup>=</sup> [0.10, 0.36], Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 1) as well as broad focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 0.39, 95% CI <sup>=</sup> [0.21, 0.58], Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 1). In all cases, the differences are positive, indicating that the model estimates the rises to become increasingly large from broad focus to narrow focus, and from narrow focus to contrastive focus.

#### Supra-Laryngeal Articulation

We now turn to the results of the supra-laryngeal parameters. **Figure 5** gives the mean values of the maximal lip aperture for all speakers and focus types (the raw distributions are shown in **Figure S2** in the Supplementary Materials). There is a clear jump from background to broad, with larger distances between the lips for broad focus. The differences between broad focus and narrow focus, as well as between narrow focus and contrastive focus are more subtle, especially between broad and narrow focus. In sum, these results show a modification of the lip opening gesture between unaccented and accented target words as well as within the group of accented words with a ranking from broad to contrastive: background < broad focus < narrow focus < contrastive focus.

**Figure 6** presents the mean values of the lowest tongue positions for all speakers and focus types (the raw distributions are shown in **Figure S3** in the Supplementary Materials). As with lip aperture, a larger jump from background to broad focus can be found, i.e., between unaccented and accented words. But there are also differences between broad focus and narrow focus and narrow focus and contrastive focus, i.e., within the group of accented words. Overall, the same ranking as for lip aperture can be attested for the lowest tongue body position: background > broad focus > narrow focus > contrastive focus (reversed because the tongue position is lowered and the values thus decrease).

Analogously to the tonal onglide analysis in Intonation, we analyse the results using Bayesian linear mixed models in R (R Core Team, 2018) with the package brms (Bürkner, 2018). We report the estimated differences between focus conditions in terms of posterior means, 95% credible intervals. Given the data and the model, the 95% credible intervals indicate the

range in which one can be certain with a probability of 0.95 that the difference between estimates can be found. To calculate the differences between focus types, we subtract the posterior samples for background from broad focus (broad–background), broad focus from narrow focus (narrow–broad), narrow focus from contrastive focus (contrastive–narrow), and broad focus from contrastive focus (contrastive–broad). In the case of the maximal lip aperture, we report the probability of the estimate being greater than zero because we are interested in whether the lip aperture increases from one focus type to another. In the case of the lowest tongue position, we report the probability of the difference being smaller than zero, because we are interested in whether the tongue position is lower, i.e., the values decrease, from one focus type to another.

The models include either the z-scored maximal lip aperture or the z-scored lowest tongue positions as the dependent variable. In both models, focus type is a fixed effect, and random intercepts for speakers and target words as well as by-speaker and by-targetword slopes for the effect of focus type are included. We use regularizing priors centered around zero. The models run with four sampling chains of 5,000 iterations each, preceded by a warm-up period of 3,000 iterations.

We start with the modeling results for the maximal lip aperture. Given the model and the data, the analysis yields clear differences in the posterior probabilities between background and broad focus (β<sup>ˆ</sup> <sup>=</sup> 0.81, 95% CI <sup>=</sup> [0.65, 0.97], Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 1), narrow focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 0.22, 95% CI <sup>=</sup> [0.04, 0.40], Pr(β ><sup>ˆ</sup> 0) <sup>=</sup> 0.99), as well as broad focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> 0.30, 95% CI <sup>=</sup> [0.12, 0.48], Pr(β ><sup>ˆ</sup> 0) = 1). For broad focus and narrow focus, the model provides evidence for a positive difference which is, however, weaker than in the other cases (β<sup>ˆ</sup> <sup>=</sup> 0.09, 95% CI <sup>=</sup> [−0.10, 0.25], Pr(β ><sup>ˆ</sup> 0) = 0.84). In sum, there is a clear increase in the maximal lip aperture from background to broad focus, i.e., from unaccented to accented. Within the group of accented target words, overall, the maximal lip aperture increases. Narrow focus seems to be closer to broad focus although the model still yields evidence for a difference between the two.

We now turn to the results for the lowest tongue position. Given the model and the data, the analysis yields clear differences in the posterior probabilities between background and broad focus (β<sup>ˆ</sup> = −0.25, 95% CI <sup>=</sup> [−0.44, <sup>−</sup>0.07], Pr(β > <sup>ˆ</sup> 0) = 1). This shows that when going from unaccented to accented, the tongue position for the low vowel /a/ is lowered. For the oppositions of broad focus and narrow focus (β<sup>ˆ</sup> = −0.11, 95% CI <sup>=</sup> [−0.31, 0.10], Pr(β < <sup>ˆ</sup> 0) <sup>=</sup> 0.85) as well as narrow focus and contrastive focus (β<sup>ˆ</sup> <sup>=</sup> <sup>−</sup>0.07, 95% CI <sup>=</sup> [−0.28, 0.15], Pr(β < <sup>ˆ</sup> 0) <sup>=</sup> 0.75), the model also provides evidence for differences, although they are not as strong as between background and broad, with 0.85 and 0.75, respectively. When comparing broad focus and contrastive focus, however, the evidence for the difference is stronger again (β<sup>ˆ</sup> = −0.18 95% CI <sup>=</sup> [−0.40, 0.04], Pr(β < <sup>ˆ</sup> 0) <sup>=</sup> 0.95), indicating that there is a substantial decrease in the lowest tongue position within the group of accented focus types.

#### DYNAMICAL MODEL

The results presented in the previous section show the following pattern: On the tonal tier, when going from background to broad focus, i.e., unaccented to accented, the distribution of flat f0 is split into a bimodal distribution. This bimodal distribution reflects that, when a pitch accent is placed, this accent can be either falling or rising. Both falling and rising accents are found in productions of broad focus, a result that is in line with Mücke and Grice (2014) and Grice et al. (2017). When going from broad focus to narrow focus, the number of rising accents increases while the number of falling accents decreases. This trend continues from narrow focus to contrastive focus. In addition, the magnitude of the rising movements increases between broad and narrow focus and between narrow focus and contrastive focus. The dominance of rising accents as well as the increase in magnitude of the tonal onglide of these rises help to make the accent more prominent.

On the articulatory tier, there is a continuous increase in the lip aperture and a lowering in the tongue body position from background to contrastive focus related to prosodic strengthening strategies during the production of the vowel in the target syllables. The increase in lip aperture can be attributed to sonority expansion, i.e., the speaker produces a louder vowel in the accented syllable (Beckman et al., 1992; Harrington et al., 2000). More energy radiates from the mouth, strengthening the syntagmatic contrast between accented and unaccented syllables in the utterance. The lowering of the tongue during the low vowel /a/ can be related to the strategy of localized hyperarticulation, i.e., the speaker intends to increase the paradigmatic contrast between the low vowel /a/ and any other vowel that could have occurred in the target syllable. The hyperarticulation of the vowel's place target [+low] is related to feature enhancement (de Jong, 1995; Cho, 2006; Mücke and Grice, 2014). Note that in this case the lowering of the tongue also contributes to sonority expansion. Both types of modifications can be seen as strategies to enhance the prominence of the target word from background to contrastive focus with intermediate steps for broad and narrow focus. In this section, we propose a dynamical system that models the tonal and articulatory modifications as the result of the scaling of one control parameter. Before turning to the actual model, we introduce some of the concepts of dynamical systems that are important for the present work.

The dynamical perspective of the mind, as explained in the introduction, views the mind not as a machine that manipulates symbols with discrete operations. Rather, it is conceptualized as a continuous system that is constantly in flux. This dynamical system follows predictable patterns of behavior in gravitating toward attractors, stable states in its space of possible states. To describe this evolution of the system through the state space over time, the language of differential equations can be employed (Iskarous, 2017). In this formal language, one way of formulating a dynamical system is by giving its potential energy function and its force function—the negative derivative of the potential energy function. The graph of the potential energy curve can give a good impression of the attractors present in the system, the attractor landscape. Consider the black lines in **Figure 7** presenting the potential energy curves of a system with two attractors (left) and another system with one attractor (right). On the x-axis, the state space is shown. This is the space of all possible states of the system, and crucially it is continuous. However, the system is moving toward local minima in the potential energy which are the attractors of the system.

The functions corresponding to the graphs are given in Equation 2 (two attractors) and 3 (one attractor). Both equations include a parameter k, called the control parameter of the system. By scaling this parameter, the system is "moved" through its possible patterns of behavior (Kelso, 2013). As a consequence, the attractor landscape can change when the parameter value is modulated. The black lines of **Figure 7** show the attractor landscape when the control parameter k is 0. The blue lines demonstrate how the system changes if the control parameter is increased to 0.5. In the case of the two-attractor landscape, the right attractor has become deeper than the left attractor and its deepest point also moved slightly to the right on the x-axis (the state space). In the case of the one-attractor landscape, the attractor also moved toward the right on the x-axis.

$$V\left(\mathbf{x}\right) = \frac{\mathbf{x}^4}{4} - k\mathbf{x} - \frac{\mathbf{x}^2}{2} \tag{2}$$

$$V(\mathbf{x}) = \frac{\left(\mathbf{x} - k\right)^2}{2} \tag{3}$$

A useful metaphor to illustrate how noise works in a dynamical system is to imagine a ball rolling through an attractor landscape like the one in **Figure 7** (left). When the ball is put into the attractor landscape at some random point, it will roll down into one of the two attractor valleys. We can enrich this metaphor by adding wind to the system that represents the notion of noise a very important component in dynamical systems (Haken, 1977). In this scenario, the ball is pushed away from its original trajectory from time to time. Sometimes these gusts of wind are strong and the ball is pushed far away, sometimes they are weak and it is only perturbed slightly. When the control parameter k is 0, and the two attractors of the system are symmetrical, it takes the same strength of wind gusts to push the ball out of both attractors. But if k 6= 0, one of the attractor basins is deeper. For

this deeper attractor, it will take stronger gusts of wind to push the ball out of it. Thus, this attractor is more stable than the other.

Another crucial feature of dynamical systems is that they can exhibit qualitative changes as a control parameter is scaled continuously, also called bifurcations (Gafos and Benus, 2006; Kelso, 2013). The model of Haken et al. (1985), for example, describes the shift between anti-phase and in-phase coordination of finger movements as an abrupt change in an attractor landscape that occurs when the tempo of the movement is scaled up continuously (anti-phase: 180◦ phase transition; in-phase: 0◦ phase transition). Starting at anti-phase coordination and scaling the tempo up, the mode of coordination remains anti-phase for some time but "breaks down" and changes to in-phase at a certain upper threshold. In the lower tempo ranges, two coordination patterns are possible (in-phase and anti-phase) while beyond the critical boundary, only one coordination pattern, in-phase, is possible. To model this phenomenon, Haken et al. (1985) proposed a dynamical system with two attractors for the lower range of tempo values (one attractor for in-phase and one attractor for anti-phase). For higher tempo values, the model exhibits a simpler landscape with a sole attractor for in-phase.

Equation 4 gives another example system. In **Figure 8**, the consequences of scaling of this system's control parameter k can be observed: As long as k has a value below 0, the system is characterized by a mono-stable attractor landscape (one attractor). As the parameter k passes 0, the landscape becomes bistable (two attractors).

$$V\left(\mathbf{x}\right) = \frac{\mathbf{x}^4}{4} - k\frac{\mathbf{x}^2}{2} \tag{4}$$

#### Modeling the Tonal Onglide

The part of the model dealing with the intonation side of our data is based on three observations: First, the proportion of falling and rising accents changes from broad to narrow focus, and from narrow to contrastive focus such that the number of rises increases. Second, the magnitude of the rises shifts subtly toward more extreme values, i.e., the rises become increasingly large from broad to narrow focus, and from narrow to contrastive focus. Third, the shape of the distribution changes from unimodal ("flat") to bimodal ("rising" vs. "falling") when going from background to broad focus.

In the two examples of dynamical models above we have laid out the foundations of how we can incorporate these observations into our model. The presence of two modes in the tonal onglide data for broad, narrow and contrastive focus but only one mode for background requires that we use a model with a bistable attractor landscape for a certain range of control parameter values and a monostable attractor landscape for a different range of control parameter values. Within the range of bistability, a change in the control parameter should cause a tilt to the rising side of the attractor landscape. This tilt must go hand in hand with a slight shift of the location of the deepest point of the attractor toward higher values of the state space (the x axis in the graphs of the potential energy function).

One possible model is given by the potential energy function V(x) in Equation 5. **Figure 9** illustrates the consequences of changing the control parameter k: When k is smaller than zero, the system has a single attractor. As it passes zero, it becomes bistable. When k is scaled further, the system tilts to the right giving the right attractor more stability.

$$V(\mathbf{x}) = \frac{\mathbf{x}^4}{4} - \left(1 - e^{-k}\right) \frac{\mathbf{x}^2}{2} - \left|k\right| \left(k - 1\right) \frac{\mathbf{x}}{4} \tag{5}$$

We take the system expressed by Equation 5 as a model for our onglide data and use simulations to evaluate predictions of the system to assess how well it can account for the structure of our observational data. We use a simulation method inspired by the software accompanying Gafos (2006), reimplemented and modified for our purposes. The code is available for download: https://osf.io/jx8cn.

The simulation operates on the force function, the negative derivative of the potential energy function. It starts at a random initial state and estimates the solution to the corresponding stochastic differential equation (Brown et al., 2006). The method calculates the change of the system at the current state and adds it to the current state to get to the next state. For the sake of simplicity, the simulation implements a time window that

always has the same length. Thus, after a fixed period of time, i.e., a fixed number of small time steps, in our case 10,000, a single simulation run stops and the current state is registered as the result. Crucially, during each step of the simulation, Gaussian noise is added to the current state. By adding noise, the simulation results are able to reflect the patterns of relative stability of the attractors: Noise pushes the system away from its current state, but the more stable an attractor, the smaller the influence of noise on this state. In other words, when the system is close to a more stable attractor, the probability is higher that it will stay in the basin of the attractor despite the noise. On the contrary, when the system is near a less stable attractor, it is more likely to be pushed away from the attractor basin eventually ending up in the vicinity of the more stable attractor. The simulation is run 10,000 times (i.e., 10,000 data points with 10,000 time steps each). We can conceive of a single simulation run as one production of an intonation contour.

We use the k values exemplified by the corresponding attractor landscapes in **Figure 9** for the four focus types. Background is modeled with k = −1, broad focus is modeled with k = 1, narrow focus is modeled with k = 1.4, contrastive focus is modeled with k = 1.7. The results of the simulations are shown in **Figure 10**. The same pattern as in the results for the tonal onglide can be observed here: the system produces a unimodal distribution slightly below zero for background. The distribution for broad focus is symmetrical. In narrow and contrastive focus, the right mode (rising) becomes increasingly strong. The mean values of the rising distributions also show essentially the same stepwise increase for the "accented" focus types (broad, narrow and contrastive focus), as presented in **Figure 11**. This shows that the attractor basin moves on the dimension of possible states toward more extreme values when the control parameter value is increased and the attractor landscape tilts to the right side.

# Enriching the Model

As outlined in the results section, not only the proportion and the scaling of accents are modified by speakers to express focus types, but the lip and tongue body kinematics of the vowel /a/ are also affected. The lips are opened wider, the tongue body position is lower. We can view these modifications as the outcome of a multi-dimensional system of prosody to signal information structure. In this system, the control parameter is used to scale the attractor landscape on many dimensions to achieve the bundle of prosodic modifications. The attractors of the landscape are the result of the combination of these multiple dimensions. The way in which the dimensions shape the multi-dimensional attractor landscape will, however, be different: Some of the dimensions will contribute a rather complex shape, like the tonal onglide with its two stable states for falling and rising—a dimension of the system that can be described well with the two-attractor landscape. Other dimensions will contribute a simpler shape, like the lip and tongue body movements, that can be described with a monostable attractor landscape.

**Figure 12** attempts to give an impression of a system with more than one dimension. It combines the landscape for the tonal onglide defined in the previous section with a parabolic landscape for the Euclidian distance of the lips, that could be modeled by a potential energy function as the one given in Equation 3 above. This results in the potential energy function given in Equation 6 which models the tonal onglide as the state of the variable x, and the lip aperture as the state of the variable y. In this function, the control parameter k affects both dimensions.

$$V\left(\mathbf{x},\boldsymbol{\upchi}\right) = \frac{\boldsymbol{\upchi}^4}{4} - \left(1 - e^{-k}\right)\frac{\boldsymbol{\upchi}^2}{2} - \left|k\right|\left(k - 1\right)\frac{\boldsymbol{\upchi}}{4} + \frac{\left(\boldsymbol{\upchi} - k\right)^2}{2} \tag{6}$$

Like in the one-dimensional illustrations above, the potential energy of the system is drawn on the vertical axis. On the left, it is shown what the attractor landscape looks like when the control parameter k is 1. In the tonal onglide dimension, both falling and rising onglides are equally possible. On the right, it is illustrated what the attractor landscape looks like when the control parameter k is increased to 1.4. Now, on the tonal onglide dimension, the right attractor has gained more stability. This leads to more instances of this pitch accent category (e.g., rising) and larger rises. In addition, this attractor has moved toward more extreme values. On the lip aperture dimension, the deepest point of the parabolic shaped attractor drifted toward more extreme values, too. Although we can only visualize two dimensions here, we can imagine that more than two dimensions can shape the attractor landscape. And in fact, it seems plausible to assume that even more than the three dimensions investigated in this paper contribute to the prosodic marking of focus.

The probability density function of a non-deterministic, firstorder dynamical system can be found as a stationary solution to the Fokker-Planck equation for the system (Haken, 1977; Gafos and Benus, 2006). In **Figure 13**, the graphs of probability functions are given for the system with two dimensions and the control parameter values used in the previous section to model the focus types (background: k = −1, broad focus: k = 1, narrow focus: k = 1.4, contrastive focus: k = 1.7). In **Figure 14**, the same distributions are given from a different perspective to make it easier to grasp the change on the lip aperture dimension. While the tonal onglide becomes bistable as the parameter k is scaled from −1 to 1 and then gains more and more stability on the right mode, the attractors also move on the dimension of lip aperture. First with a big step, from background to broad, and then subtly when going from broad focus to narrow focus, and from narrow focus to contrastive focus. Note that on this dimension the change is similar to what happens to the rising accents of the tonal onglide: While the probability on this dimension remains characterized by a single mode, this mode moves toward more extreme values when k is scaled.

The dimension of the tongue position also contributes a single attractor that is very similar to the one for the lip aperture, except that an increase in k makes it move toward lower values (the

tongue body is lowered). Equation 7 represents an attempt to sketch how such a system could be described with a potential energy function of three variables.

$$\begin{split} V\left(\mathbf{x}, \mathbf{y}, \mathbf{z}\right) = \frac{\mathbf{x}^4}{4} - \left(1 - e^{-k}\right) \frac{\mathbf{x}^2}{2} - \left|k\right| \left(k - 1\right) \frac{\mathbf{x}}{4} \\ &+ \frac{\left(\mathbf{y} - k\right)^2}{2} + \frac{\left(\mathbf{z} + k\right)^2}{2} \end{split} \tag{7}$$

It should be noted that none of the functions given here reproduces the measured values exactly. We have focussed on the qualitative correspondence of the experimental observations and the theoretical model (which the presented system is able to capture). The coefficients for the model are chosen for presentation purposes here. For example, the differences between the values for the focus types with regard to the lip aperture are greater compared to the tongue movement

as different articulators naturally produce different magnitudes of movements. This fact is not reflected in the system. The system only provides a scheme of how we can picture the score of prosodic dimensions in a single system with one control parameter.

# DISCUSSION

In this study, we have presented data on the prosodic marking of focus in German from 27 speakers. These data contribute to the increasing evidence of the systematic use of continuous variation in speech and the deep intertwining of this continuous variation with categorical variation. Moreover, the data show how speakers use a combination of cues related to the laryngeal and supra-laryngeal tiers to enhance prosodic prominence. This combination of prosodic dimensions is taken up by our dynamical model.

With regard to the intonation results, our analysis shows that there is no one-to-one mapping between focus types and accent types. However, there are probabilistic tendencies that can be described as patterns of relative stability between the quasi-categories represented by the attractors. With regard to the articulatory results, the study adds evidence to the finding that prosodic prominence is expressed gradually: There are not only modifications in terms of prosodic strengthening between unaccented and accented, but also within the group of accented targets to make the word more prominent. The increase in lip aperture during vowel production can be viewed as sonority expansion, while the corresponding lowering of the tongue body position can be interpreted as hyperarticulation of the vowel /a/ by enhancing the vowel's place feature [+low] and an increase in sonority at the same time. Since the vowel is low, the strategies of localized hyperarticulation and sonority expansion are compatible. The speakers intend to produce louder and more peripheral vowels (de Jong, 1995; Harrington et al., 2000; Cho, 2006; Mücke and Grice, 2014). Our results are generally in line with the findings of Mücke and Grice (2014) for German. The data support the assumption that prosodic strengthening in the articulatory domain is not just a concomitant of accentuation but is directly controlled to express different degrees of prominence. However, the modifications between target words in background and broad focus reported in the present study are stronger than those reported in Mücke and Grice (2014) who did not find systematic differences between background and broad. This might be attributed to the fact that the data set of the present paper (27 speakers) is considerably larger than in the study by Mücke and Grice (5 speakers) and therefore less sensitive to speakerspecific variation.

The results of the present study underscore that it is fruitful to analyse categorical and continuous aspects jointly and that theoretical devices that treat phonology and phonetics as a single system are needed. The dynamical perspective of the mind as endorsed by many researchers within the fields of phonology and phonetics (Browman and Goldstein, 1986; Tuller et al., 1994; Port, 2002; Gafos and Benus, 2006; Nava, 2010; Mücke, 2018) and beyond (Haken et al., 1985; Thelen and Smith, 1994; Kelso, 1995; Smith and Thelen, 2003; Spivey and Dale, 2006; Spivey, 2007) is well-suited to provide a view on the sound patterns of language without the need for a translation process between categorical and continuous aspects.

With respect to this intertwining of categorical and continuous aspects of prosodic prominence, it is worthwhile to take a short look at how the current approach relates to the widespread view of prosodic prominence as a characteristic of a hierarchically organized structure. In the literature, different hierarchies of prosodic structure have been proposed (Nespor and Vogel, 1986; Pierrehumbert and Beckman, 1988; Hayes, 1989; Selkirk, 1996; Shattuck-Hufnagel and Turk, 1996). Although the proposals disagree as to the existence of some levels, they all share the assumption that utterances can be decomposed into hierarchically organized constituents. A minimal structure that most researchers in the field agree upon can be outlined as follows (Grice, 2006): An utterance consists of one or more intonational phrases which contain one or more smaller phrases (e.g., an intermediate phrase). A constituent on the smallest level of phrasing contains one or more words, a word contains one or more feet, and a foot contains one or more syllables. Regarding the results of the current study, it is interesting to look at how this prosodic hierarchy has been related to prosodic prominence. One approach is to assume that the levels in the hierarchy are headed by prominences (Beckman and Edwards, 1994; Shattuck-Hufnagel and Turk, 1996). For example, a nuclear pitch accented syllable is the head of an intermediate phrase. Applying this view to the productions of the current corpus, this theory would interpret the increase of supra-laryngeal articulatory effort in the target word's stressed vowel as a correlate of the reorganization in the prosodic prominence structure as the nucleus is placed on the target word and hence the head status is moved from the stressed syllable of the direct object (the tool) to the stressed syllable of the target word. In our model of the production of prosodic patterns, the attractor basin situated on the continua of the articulatory dimensions moves toward more extreme values. In the tonal domain, controlled by the laryngeal system, we model this reorganization as a bifurcation on the dimension of onglide such that the system evolves from monostability (flat f0) toward bistability to reflect that the newly assigned nuclear pitch accent can be falling or rising.

However, the findings of the current study go beyond what we can conceptualize as a reorganization of the head-assignment in the prosodic hierarchy. They contribute to an understanding of prosodic prominence that is sensitive to both categorical and more fine-grained, continuous phenomena. When we look at the productions with the nuclear pitch accent in the same position, i.e., the same assignment of the head status, we observe that the change of the focus type (broad focus –> narrow focus –> contrastive focus) leads to an additional increase in prominence with an increase in articulatory effort, a higher probability of rising accents, and larger tonal onglides. In the modeling approach, this is reflected by an increase in the continuous control parameter.

Support for the idea that the structure of prosodic prominence in the phrase can be modified even in cases where the nuclear pitch accent is not reassigned, i.e., the nuclear pitch accent remains on the target word, comes from work on the perceived prominence of pitch accent types by Baumann and Röhr (2015). Their study showed that, in general, rising accents are perceived as more prominent than falling accents. Beyond the level of reorganization of the prosodic hierarchy, the choice and realization of the nuclear pitch accent work on the assignment of prosodic prominence. In our view, all these processes are the result of a non-linear dynamical system that does not assume a separation of the categorical, phonological, and the continuous, phonetic level.

In the modeling section of the present work, we have sketched a system that brings together different dimensions of prosodic prominence. The dimensions contribute to the shared attractor landscape in different manners. In the most complex dimension, the tonal onglide, we can see how the continuous scaling of a control parameter can lead to qualitative changes: The landscape goes from monostable (unaccented) to bistable (accented). The bistable landscape is then able to account for the proportions of falling and rising accents (categorical variation) as well as the increase in rising onglides (continuous variation). We have demonstrated a scenario in which one control parameter can account for changes in a multidimensional space including intonation and articulation. As already mentioned, the model does not attempt to exactly reproduce the values obtained from the phonetic analyses. It is rather seen as a proof of concept to demonstrate how we can think of prosody in a dynamical systems framework. The results presented in this paper concentrate on a subset of phonetic dimensions that play an important role for prosodic prominence. And so the model outlined on the basis of these results is restricted. In fact, the state space of a full model would include all relevant parameters including dimensions related to duration and relative timing. For example, in the articulatory domain, the duration of the lip and tongue movements is expected to be longer in prominent syllables. But even with a more complex model—one that could also include more than one control parameter—the main idea persists: the same mechanism that modulates the tonal domain also leads to changes in the articulatory domain. The domains with their multiple dimensions form a bundle to be used by the speaker to express prosodic prominence. These bundles might vary between languages, the attractor landscapes are conceptualized as part the speaker's knowledge of phonetics and phonology.

The concept of a multi-dimensional attractor landscape can in principle be extended to any number of dimensions, and is in line with the finding that phonological entities are characterized by many dimensions (Lisker, 1986; Coleman, 2003; Winter, 2014; Mücke, 2018) and that intonational categories are no exception (for Italian and German: Niebuhr et al., 2011; for German: Cangemi et al., 2015; for Italian: Cangemi and Grice, 2016; for English: Barnes et al., 2012, inter alia). Furthermore, in this work, we have conceptualized the dimensions to be orthogonal. Future research should investigate how the different dimensions interact. In addition, the model proposed in the current work takes into account the patterns of all speakers pooled together. In Roessig et al. (2019), we take a closer look at the intonation patterns of different speaker groups. We demonstrate that it is possible to conceptualize the different speaker-specific patterns as different uses, or scaling strategies, of the same system. For the unidimensional system presented in that study, it seems to be sufficient to assume that speakers use different ranges of values for the control parameter. For a more complex system, it might be necessary to assign more weight to one or more dimension in order to reflect the fact that speakers might not exploit all phonetic dimensions to the same degree.

The model presented in the current paper is a model of the production of prosodic patterns. We can, however, speculate that the perception of prosodic patterns can be modeled in a similar fashion. Attractors offer a flexible framework to model stability and variability in systems of different kinds and different environments. As such, they are also applicable to speech perception. In fact, similar models have been employed to account for phenomena in the perception of speech sound or lexical access (Tuller et al., 1994; Spivey et al., 2005). In addition, we might speculate that there is a strong connection between the attractor landscapes for production and those for perception, including a huge variety of acoustic and articulatory cues (Baumann and Winter, 2018; Gafos et al., 2019)—but this topic is beyond the scope of the current study and has to be left open for future research.

## DATA AVAILABILITY

The datasets generated for this study are available for download: https://osf.io/jx8cn/.

## ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Local Ethics Committee of the University of Cologne with written informed consent from

#### REFERENCES


all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Local Ethics Committee of the University of Cologne (application 16–404).

# AUTHOR CONTRIBUTIONS

SR and DM: substantial contributions to the conception and design of the work as well as the acquisition, analysis, and interpretation of data for the work.

## FUNDING

This work was supported by the German Research Foundation (DFG) as part of the SFB1252 Prominence in Language in the project A04 Dynamic modeling of prosodic prominence at the University of Cologne.

#### ACKNOWLEDGMENTS

The authors thank Timo B. Roettger and Bastian Auris for their advice on the statistical analyses, Stefan Baumann for discussions about prosody, as well as the reviewers for their helpful comments. All errors are ours.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomm. 2019.00044/full#supplementary-material

Figure S1 | Example screen from experiment during a trial with contrastive focus condition.

Figure S2 | Distributions of the maximal lip aperture (z-scored) for all speakers.

Figure S3 | Distributions of the lowest tongue body position (z-scored) for all speakers.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Roessig and Mücke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Emergence of Discrete Perceptual-Motor Units in a Production Model That Assumes Holistic Phonological Representations

#### Maya Davis and Melissa A. Redford\*

*Department of Linguistics, University of Oregon, Eugene, OR, United States*

#### Edited by:

*Adamantios Gafos, University of Potsdam, Germany*

#### Reviewed by:

*Marilyn Vihman, University of York, United Kingdom Ben Parrell, University of Wisconsin-Madison, United States*

> \*Correspondence: *Melissa A. Redford redford@uoregon.edu*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *29 April 2019* Accepted: *02 September 2019* Published: *18 September 2019*

#### Citation:

*Davis M and Redford MA (2019) The Emergence of Discrete Perceptual-Motor Units in a Production Model That Assumes Holistic Phonological Representations. Front. Psychol. 10:2121. doi: 10.3389/fpsyg.2019.02121*

Intelligible speakers achieve specific vocal tract constrictions in rapid sequence. These constrictions are associated in theory with speech motor goals. Adult-focused models of speech production assume that discrete phonological representations, sequenced into word-length plans for output, define these goals. This assumption introduces a serial order problem for speech. It is also at odds with children's speech. In particular, child phonology and timing control suggest holistic speech plans, and so the hypothesis of whole word production. This hypothesis solves the serial order problem by avoiding it. When the same solution is applied to adult speech the problem becomes how to explain the development of highly intelligible speech. This is the problem addressed here. A modeling approach is used to demonstrate how perceptual-motor units of production emerge over developmental time with the perceptual-motor integration of holistic speech plans that are also phonological representations; the specific argument is that perceptual-motor units are a product of trajectories (nearly) crossing in motor space. The model, which focuses on the integration process, defines the perceptual-motor map as a set of linked pairs of experienced perceptual and motor trajectories. The trajectories are time-based excursions through speaker-defined perceptual and motor spaces. By hypothesis, junctures appear where motor trajectories near or overlap one another in motor space when the shared (or extremely similar) articulatory configurations in these regions are exploited to combine perceptually-linked motor paths along different trajectories. Junctures form in clusters in motor space. These clusters, along with their corresponding (linked) perceptual points, represent perceptual-motor units of production, albeit at the level of speech motor control only. The units serve as pivots in motor space during speaking; they are points of transition from one motor trajectory to another along perceptually-linked paths that are selected to produce best approximations of whole word targets.

Keywords: speech production, speech acquisition, perceptual-motor integration, mathematical model, wholeword representations, dual lexicon model

# 1. INTRODUCTION

Speech can be experienced as a sequence of discrete sounds, at least among literate adults who have used a phonemic writing system from a young age. Linguistic theory in the west has leveraged this experience. Discrete sound units, such as phonemes, have been used by linguists to great analytic and practical advantage in work on the sound patterns of language. This is because "phonemic theory provides a basis for representing the physiological time functions of speech by discrete symbolic sequences (Peterson and Harary, 1961, p. 140)." Peterson and Harary go on to explain, in the proceedings from the 12th Symposium in Applied Mathematics, that "(an) essential part of this theory is the organization of the phone, a basic phonetic unit, into higher order sets of allophones and phonemes." They then argue that the basis for treating sounds as discrete symbols, embedded in hierarchies of sets, is the mathematical theory of types and equivalence relations. This argument helps explain why the discrete sound units of phonology have been so useful in linguistics—because, like mathematics, they provide a tool for rigorous description. In this paper, we take a different approach from Peterson and Harary. Rather than using the language of mathematics to motivate phonemic theory, we use it to rigorously describe a model that provides an alternative to the linguistic representation of discrete sound units, at least for understanding spoken language production. Our immediate objective is to demonstrate that the hypothesis of whole word production is compatible with adultlike speech motor control, which references speech motor goals. The larger objective is to formalize a developmentally sensitive theory of production that limits the serial order problem in spoken language to the level of phrase production.

#### 1.1. The Problem

The literate adult's awareness of discrete sounds in speech has motivated psycholinguistic theory as much as linguistic theory. Phonemes, in particular, have for a long while been understood as psychologically real units of language (Baudouin de Courtenay, 1881, cited in Koerner, 1972; Chomsky and Halle, 1965; Fromkin, 1971). One implication of this idea is that phonemes are relevant to speech production. In fact, a great deal of work in speech production since the 1970s has explicitly argued as much (e.g., Fromkin, 1971; Shattuck-Hufnagel, 1979; Stemberger, 1982; Dell, 1986; Levelt, 1989; Guenther, 1995; Roelofs, 1997; Schiller, 2000; Goldrick and Rapp, 2007; Hickok and Poeppel, 2007; Hickok, 2012; Turk and Shattuck-Hufnagel, 2014). A consequence of this hypothesis is the serial order problem (Lashley, 1951); that is, the problem of how discrete units are sequenced for output<sup>1</sup> . Psycholinguistic theory has addressed this problem by proposing a speech planning phase during production (see, e.g., Shattuck-Hufnagel, 1979; Levelt, 1989; Roelofs, 1997; Schiller, 2000; Goldrick and Rapp, 2007). This phase, known as phonological/phonetic encoding, is characterized as a sequential process of word form encoding that begins with phoneme sequencing within prosodic frames and ends with the context-dependent specification of phonetic information. Elsewhere, Redford has argued against this encoding hypothesis on developmental grounds (Redford, 2015, 2019); others have noted its incompatibility with the evidence that disruptions to phonological working memory do not in fact disrupt speech production any differently than, say, disruptions to visual-spatial working memory (Gathercole and Baddeley, 1993, p. Ch.4; Lee and Redford, 2015). Relatedly, whole research programs in phonetics and phonology (e.g., Autosegmental Phonology, Articulatory Phonology) have questioned the psychological reality of the phoneme and its importance in sequential speech planning based on evidence such as the long-distance acoustic and motor dependencies between "segments" in speech (i.e., coarticulation). Yet these programs also propose discrete phonological representations; for example, autosegmental phonologists favor distinctive features and articulatory phonologists propose the gesture, which is similar is some respects to the distinctive feature. Here, we argue against the general idea that discrete linguistic representations of sound are relevant to speech planning, and for the alternative, which is that word forms are remembered and retrieved holistically for production.

The whole word production hypothesis is particularly important and long-standing in child phonology where it has been used to explain the variability in a child's repeated production of the same word, the relationship between the child's production and the adult target, and the relationship between different words in the child's productive repertoire (Vihman and Croft, 2007; see also Vihman and Keren-Portnoy, 2013, and contributions therein). A version of the hypothesis is also advanced in Articulatory Phonology where word form representations are articulatory gestalts; more specifically, they are abstract and overlapping representations of discrete linguistic gestures used to produce the word (Browman and Goldstein, 1989, 1992; Goldstein et al., 2006) 2 . Yet another version of the hypothesis is proposed in Redford's(2015, 2019) developmentally sensitive theory of spoken language production. In this theory, the representations that guide adult speech are imagined as identical in kind to the holistic perceptual and motor phonological forms that underlie early child language. The perceptual representations posited are whole words derived from the ambient language, as in exemplar theories of phonology (Goldinger, 1998; Pierrehumbert, 2001; Hawkins, 2003; Johnson, 2006); the motor representations are abstracted with speech practice from sensorimotor experience, as in schema theories of action control (Schmidt, 1975; Norman and Shallice, 1986; Arbib, 1992; Cooper and Shallice, 2000). The proposal in Redford (2019) is that the whole word perceptual and motor phonological

<sup>1</sup>This statement is consistent with the modern conception of the serial order problem, but it mischaracterizes Lashley's (1951) argument. Lashley proposed that complex skilled action is effected with reference to a control structure of hierarchically-arranged discrete action units. In other words, the problem that Lashley addressed was that of complex skilled action; his solution introduced the serial order problem as it is currently conceived.

<sup>2</sup>Gestures are abstract representations of linguistically-significant vocal tract constrictions, similar to distinctive features but with intrinsic timing; for example, a labial vs. alveolar gesture gives rise to the minimal pair "bog" vs. "dog" where these gestures are co-produced (temporally overlapped) with the gesture associated with the following vowel.

forms are co-activated during production and integrated via the perceptual-motor map. We take this proposal to be a strong version of the whole word production hypothesis and defend it here.

The proposal that holistic phonological representations provide the plans that guide adult word production requires defense because adults produce highly differentiated speech sounds. To do so, the speaker must consistently achieve specific vocal tract constrictions in rapid sequence. These constrictions suggest speech motor goals, defined as planned outcomes that are referenced in the control of speech movement. The suggestion of goals is strongly supported by the many natural and experimental demonstrations of motor equivalence (see Perrier and Fuchs, 2015). For example, adult speakers adapt nearly immediately to unexpected perturbations of the lips and jaw to achieve bilabial closure for bilabial consonants (Folkins and Zimmermann, 1982; Kelso et al., 1984; Shaiman and Gracco, 2002; van Lisehout and Neufeld, 2014); they also make very rapid adjustments during repeated productions of the same vowel if the auditory feedback they receive does not match the formant frequencies of the vowel they intended to produce (Houde and Jordan, 1998; MacDonald et al., 2010; Katseff et al., 2012; Lametti et al., 2012). The different types of adjustments indicate the importance of different types of information in speech motor control: the nearly instantaneous adaptation to mechanical perturbations of the articulators suggests that specific vocal tract constrictions are goals (e.g., Saltzman and Munhall, 1989; Liberman and Whalen, 2000; Sorensen and Gafos, 2016); on-going adjustment to articulation in response to perturbed auditory or sensory feedback suggests perceptual goals (e.g., Katseff et al., 2012; Lametti et al., 2012). But no matter the type of goals assumed, they are linked to discrete phonological representations in current theory. When the goal is a constriction, its phonological representation is the linguistic gesture; when it is perceptual, it is associated with the phoneme.

In this paper, we seek to accommodate the evidence for goals in speech motor control absent discrete phonological representations. More specifically, we address the challenge implicit in Bohland's (Bohland et al., 2010, p. 1509) argument against holistic phonological representations; namely, that the whole word production hypothesis "is incompatible with the exquisite control of vocal performance that speakers/singers retain for even the highest frequency syllables." Our approach to this challenge is to model the integration of holistic perceptual and motor plans via the perceptual-motor map. The model we develop shares many assumptions of an information processing approach to speech motor control, especially the assumption that perception is important for speech motor control. The key difference is that our focus is not on the execution of speech, but rather on how perceptual-motor units of production emerge as motor space is reticulated with language acquisition. Another fundamental difference is that we explicitly address the relationship between phonology and speech motor control, and, in so doing, propose a motor phonological representation that is substantively different from the representations posited in current linguistic theory. Overall, the model objective is to demonstrate the in principle plausibility of the whole word hypothesis for understanding production in the context of adult-like speech articulation. Future research will address the in principle plausibility of the hypothesis for understanding production in the context of speech errors. This future work is necessary to complete the argument that the serial order problem in speech should be limited to sequencing words. We acknowledge that speech errors are a major source of evidence for a hypothetical phonological/phonetic encoding stage in speech production.

# 2. THE CORE MODEL

Perceptual-motor integration is a core assumption in neuropsychological models of speech production that assume perceptual goals (e.g., Guenther, 1995, 2006, 2016; Hickok and Poeppel, 2007; Houde and Nagarajan, 2011; Gow, 2012; Hickok, 2012). The Directions into Velocities of Articulators (DIVA; Guenther, 1995, 2006, 2016) is perhaps the best known and most completely developed of these models. DIVA provides a framework for understanding both the neuropsychology of speech motor control and the details of this control, including motor equivalence and coarticulation. In contrast, we seek to demonstrate that the whole word production hypothesis is compatible with adult-like speech motor control. To do this, we imagine perceptual-motor integration in speech from a developmental perspective given the domain knowledge of a phonetician. The result is the Core model, which is proposed here in the context of Redford (2015, 2019) developmentally sensitive theory of spoken language production.

The Core model is similar to DIVA in that it assumes a sound space and a motor space; it also envisions the perceptualmotor integration of speech with reference to trajectories through these spaces; however, the motor space in Core is more similar to the somatosensory space in DIVA than to its motor space. This is because Core does not address control over articulatory movements per se. The model is in fact agnostic on the question of how articulatory movements are themselves organized given a particular trajectory<sup>3</sup> . Another difference between the models is that, in Core, adult-like production relies by default on state feedback control rather than on feedforward processes (see Houde and Nagarajan, 2011). Thus, a matching and selection process on perceptual trajectories determines the path taken through motor space. Importantly, this process references holistic phonological representations that are the speech plan. By contrast, in DIVA, trajectories are defined by the sequential activation of cells in the speech sound map—that is, by a discretized plan. Below, we provide an informal overview of the Core model. This entails the introduction of a number of model specific terms. More precise definitions of these terms are given later when the model is more rigorously described.

## 2.1. Overview

Core is designed to accommodate developmental change and the flow of activation in speech production from conceptualization to perceptual-motor integration. The proposed representations

<sup>3</sup>The processes modeled in Core are nonetheless compatible in principle with a dynamical systems approach to this separate question of articulatory coordination (e.g., Saltzman and Munhall, 1989; Sorensen and Gafos, 2016).

allow for change. The major components, or levels, in the model indicate flow in the production process. **Figure 1** illustrates the relationship between the representations and levels to help frame the informal narrative description of the model given in this section.

Core assumes phonological representations that are distinct sets of holistic perceptual and motor forms associated with specific meanings: for example, with a nominal category like "dog," a social-pragmatic category like "psst," or a discourse device like "by the way." The perceptual word forms are exemplars. The acquisition of these require that the listener segment ambient language input into meaningful units. The relevant input is speech produced by those with whom the listener interacts or to whom they otherwise attend, which is why the auditory memories are socially indexed (see Goldinger, 1998; Pierrehumbert, 2001; Hawkins, 2003; Johnson, 2006). The motor word forms are schema composites we call silhouettes. A schema is the memory trace of a motor pattern (= motor trajectory in Core) that a speaker has used to successfully communicate a specific meaning (i.e., a word). As with the more generalized schema proposed in Redford (2015), the notion of a silhouette proposed here takes inspiration from whole word approaches to child phonology (for a review see Vihman and Keren-Portnoy, 2013), information processing approaches to movement sequence learning and control (e.g., Klapp, 1975; Schmidt, 1975; Keele and Summers, 1976; Norman and Shallice, 1986; Arbib, 1992; Cooper and Shallice, 2006), and the early view of word form representations in Articulatory Phonology (see Browman and Goldstein, 1992). When one speaks, exemplars and silhouettes are integrated for execution via a perceptual-motor map. The map is not part of the linguistic system per se because it is initialized during the prelinguistic period (see also Guenther, 1995; Kuhl, 2000; MacNeilage and Davis, 2000; Hickok et al., 2003; Menn et al., 2013; Vihman, 2014). The map can therefore be accessed independently of meaning, for example, to mimic ambient noises<sup>4</sup> . In the Core model, the perceptual-motor map is the set of links between the motor and perceptual trajectories that wend through motor and perceptual spaces, respectively. These links are established with vocal-motor exploration. For every vocalization an infant produces, the trace of the motor pattern used in production is preserved as a motor trajectory that is linked at each point in time to the auditory memory of that vocalization, which is the perceptual trajectory. The motor space is simplified as the set of articulatory configurations, or possible vocal tract states, within a multidimensional articulatory space. The perceptual space is simplified as the set of possible sounds in a multidimensional acoustic space. The articulatory and acoustic dimensions structure the motor and perceptual spaces in such a way that articulatory and perceptual distances can be defined. These notions of distance are critical to a number of processes in Core. The notion of articulatory distance also provides the basis for a critical hypothesis that is instantiated in the Core model: when motor trajectories approach one another in motor space to the point of (near) crossing, junctures are created that can then be exploited to generate a new trajectory that is the combination of existing (partially) adjacent trajectories.

The central idea behind the critical hypothesis is exemplified in **Figure 2**, which shows how the motor trajectories associated with [bAp] and [dAg] (left) can be used to produce [bAg] (right) via the junctures created in motor space where the [A] portion of the trajectories near one another. As this example makes clear, junctures index sets of (nearly) identical articulatory configurations. And, like all articulatory configurations along motor trajectories, the configurations at junctures are linked to sounds along corresponding trajectories in perceptual space. In this way, clusters of configurations at junctures in motor space, along with their corresponding perceptual points, can be considered the perceptual-motor units of speech. In Core,

in motor space. See text for details.

<sup>4</sup>As this example suggests, the distinction we make between linguistic and nonlinguistic depends on the functional definition of language as a system of formmeaning pairs (see also Saussure, 1959; Langacker, 1987; Fillmore et al., 1988; Goldberg, 1995; Bybee, 2001; Croft and Cruse, 2004; inter alia).

these units serve as pivots—places to transition from one motor trajectory to another along perceptually-linked paths that are selected to produce best approximations of whole word targets, as described below.

During the first word stage of language acquisition, an infant approximates a conceptually-linked exemplar drawn from the ambient language in the following way: the infant chooses an existing motor trajectory that is linked to a perceptual trajectory that is most similar to the exemplar being attempted. In this way, Core instantiates Vihman's (1993; 1996) hypothesis of first word production using vocal motor schemes: an infant's first words are based on familiar patterns from, say, babbling, that best approximate (perceptually) an adult target word (e.g., "ba" for "ball"). To account for developmental change beyond first words, Core assumes exemplars that are whole word forms<sup>5</sup> . These are represented as conceptually-linked perceptual trajectories that inhabit the same space as endogenous (i.e., self-generated) forms<sup>6</sup> . Similarity estimates between exogenous and endogenous perceptual trajectories are not necessarily based on the entire form. Instead, the estimates are biased toward matching the most salient aspects of the conceptually-linked trajectory, where salience is understood as subjective within certain acoustically defined bounds. Importantly, subjective salience is hypothesized to be governed by attention. What is salient during an attempt at matching any given exemplar can therefore change with experience. This change gives rise to the variable productions of early child language and, eventually, to adult-like productions of target words. So, for example, an infant might first try to match just the acoustically robust stressed syllable of a disyllabic word exemplar (e.g., "ba" for "bottle")<sup>7</sup> . Having done so, perhaps repeatedly, the infant will likely find the less robust unstressed syllable relatively more salient and, in subsequent productions, may seek to also match its quantity and/or quality (e.g., "baba" for "bottle")<sup>8</sup> . In this way, the assumptions of a non-linguistic basis for first word productions, holistic perceptual word form representations, and experience dependent changes in salience interact in the Core model to capture spoken language development. Successful communication during first word production triggers schema formation; that is, communicative success serves as the positive reinforcement needed to forge an associative link between a motor trajectory and lexical concept<sup>9</sup> . When the same concept is next selected for output, the newly established schema is activated along with the perceptual trajectory of the relevant exemplar. It is at this point that word production can be conceived of as the integration of perceptual and motor forms. Although the schema now biases production in the direction of the previously used motor trajectory, attention to different aspects of the coactivated exemplar will encourage some modification to or elaboration of the original motor trajectory. So, a second or third or fifth production of a single word is very likely to be different from the first. Each different successful production gives rise to a new schema, that is, to an additional motor trajectory with a link to the same lexical concept. These schemas are compiled to create a composite motor phonological form—the silhouette. This holistic representation then serves to define a swath through motor space during the integration process. This swath is narrow for those aspects of production that remain constant across many attempts at matching the exemplar, and wide elsewhere. Exemplar-driven exploration within and around this swath reticulates the motor space further, giving rise to additional junctures in areas of (near) articulatory overlap.

Key aspects of the Core model are formalized in the sections that follow. The formalization serves both to rigorously specify the interrelated hypotheses presented above and to demonstrate how these work together to yield perceptual-motor units absent their discrete specification in the phonology. The model presentation is organized developmentally, from infancy and prelinguistic vocalizations to early childhood and the emergence of an adult-like production process. We begin, though, with definitions of the perceptual-motor map and the acoustic and

<sup>5</sup> It could equally be the case that an infant initially remembers only the most salient portions of a word (see, e.g., Vihman, 2017), and that the exemplar representation therefore changes with developmental time.

<sup>6</sup> See section 2.2.3 for details.

<sup>7</sup> See Snow (1998) for a related prominence account of weak syllable deletion. <sup>8</sup>The idea that familiar items become less salient with repeated attention is based on the well-studied relationship between habituation and the emergence of a novelty preference in infant studies (see, e.g., Sirois and Mareschal, 2002).

<sup>9</sup>The emphasis on communicative success for schema formation is consistent with the recent revival of interest in associative learning for understanding speech and language acquisition (e.g., Howard and Messum, 2011; Ramscar et al., 2013; Warlaumont and Finnegan, 2016; see also Kapatsinski, 2018).

articulatory dimensions that structure the perceptual and motor spaces, respectively.

#### 2.2. The Perceptual-Motor Map

Holistic perceptual and motor phonological representations are integrated for execution via the non-linguistic perceptualmotor map, which is defined as the set of links between paired trajectories that exist in perceptual and motor spaces, respectively. More specifically, the map is a bijection between the perceptual trajectory set and the motor trajectory set10, and so can be thought of as the set of bidirectional arrows between the sets of trajectories as shown in **Figure 3**. The initial set of links, or bidirectional arrows, is established during the prelinguistic period, as described in section 2.3. In this section, we rigorously define the perceptual and motor spaces, including the topologies of these spaces, and what we mean by trajectories through these spaces.

#### 2.2.1. Perceptual and Motor Spaces

The perceptual space is a set of points in Core, denoted SOUNDS. Each point represents an "instantaneous" sound11, which is defined along the following 12 acoustic dimensions: the time derivative of loudness in phons, periodicity of the waveform, the first 3 Bark-transformed formants in the spectrum, the spectral center of gravity, the width of the spectral peak, and the time derivatives of each of these frequency dimensions12. It is possible that instantaneous sounds would be better represented with reference to the full speech spectrum (e.g., mel-frequency cepstrum), but the argument here does not depend on an exact representation of sound. Instead, the dimensions are illustrative and chosen with the goal of adequately and intuitively characterizing speech sounds for the phonetically informed reader. To complete this characterization, the dimensions are given the following values: periodicity is categorical and set to zero if the sound is aperiodic (e.g., voiceless fricative) and one if the sound is periodic (e.g., liquid); each of the other dimensions are set to some numerical value appropriate to the sound if the dimension is relevant for that sound, and set to zero otherwise. So, for example, when a sound is aperiodic, the Bark-transformed formant values (and their derivatives) are set to zero; when a sound is periodic, the center-of-gravity and width-of-peak values (and their derivatives) are set to zero to further distinguish sonorants from obstruents in perceptual space. Some nasals have an F<sup>3</sup> value of zero; in this case, we set the Bark-transformed value to zero. Formally, then, an instantaneous sound is a 12-tuple:

$$\left(\frac{d}{dt}\text{(10UDNESS), PER, }Z\_1, Z\_2, Z\_3, \text{COG, WIDTH, }\frac{d}{dt}\text{(Z\_1),}\right)$$

$$\frac{d}{dt}\text{(Z\_2), }\frac{d}{dt}\text{(Z\_3), }\frac{d}{dt}\text{(CGG), }\frac{d}{dt}\text{(WIDTH)}\right)$$

where <sup>d</sup> dt(LOUDNESS) is equal to the time derivative of the phon value for the current sound; PER = 1 if that current sound is periodic and PER = 0 if it is aperiodic; Z1, Z2, and Z<sup>3</sup> are equal to the first three Bark-transformed formant values if the sound is periodic and are equal to zero otherwise (with Z<sup>3</sup> also being zero for certain nasals as described above); COG is equal to the spectral center of gravity if the sound is aperiodic, and is zero otherwise; WIDTH is equal to the width of the dominant spectral peak if the sound is aperiodic, and is zero otherwise; and where

<sup>10</sup>Although the perceptual-motor map may or may not be a true bijection, the insights we offer from this model are not dependent on this particular assumption. Instead they depend on the assumption that motor and perceptual trajectories are systematically linked to one another.

<sup>11</sup>Clearly sound requires time and so "instantaneous sound" should not be interpreted as psychologically real. Instead, the construct is simply used to formalize the idea of trajectories. In fact, we never treat sound as independent of the trajectory on which it lies. In this way, all sound (and for that matter, movement) is inseparable from time in the Core model.

<sup>12</sup>We use phon values and Bark-transformed values instead of the more familiar RMS pressure and Hertz values to code loudness and formant frequency information in order to underscore the point that the dimensions we seek to define are psychological, not physical. The reader should imagine that the spectral center

of gravity and the width of the spectral peak are similarly transformed from the physical to the psychological.

d dt(Z1), <sup>d</sup> dt(Z2), <sup>d</sup> dt(Z3), <sup>d</sup> dt(COG), <sup>d</sup> dt(WIDTH) are equal to the time derivatives of the different spectral values<sup>13</sup> .

Although an instantaneous sound is mainly defined along dimensions that reference familiar acoustic measures of speech, the reference to time derivatives of acoustic properties is admittedly unusual and so requires explanation. In Core, an instantaneous sound is only ever realized as part of a trajectory. Derivatives allow us to code, at each point in time, the direction and extent of change along the intensity-related and spectral dimensions of this trajectory. This information is used to capture the amplitude and frequency modulation of the speech signal, which is critical for recovering place and manner of articulation information (e.g., Viswanathan et al., 2014). Including this as part of the representation of each point in the space ensures that if two trajectories (defined in section 2.2.3) pass through the same point in the space, they are perceptually equivalent at that moment. Note that our inclusion of dynamic information in the model assumes that infants also use such information when listening to speech. This assumption is reasonable based on the evidence that auditory temporal resolution is already adult-like by 6 months of age in typically developing infants (see Trainor et al., 2001).

Like the perceptual space, the motor space is a set of points in Core, denoted ARTIC. In this case, the points represent all possible articulatory configurations for the speaker. These configurations describe the overall physical state of the vocal tract at any given moment in time during a vocalization; they are not goal states. Thus, ARTIC, or the set of all possible articulatory configurations, can be used to represent continuous change in the vocal tract during production.

An articulatory configuration, and therefore the motor space, is defined along 20 dimensions: glottal width, 8 cross-sectional areas of the vocal tract, velum height, the time derivatives of each of the 8 cross-sectional areas and velum height, and the opening and closing phases of the jaw cycle. The cross-sectional areas of the vocal tract describe the result of coordinated actions, including laryngeal raising, pharyngeal constriction, and the movements of the tongue and lips with reference to the hard palate and maxilla (e.g., Fant, 1960) 14 . The specific choice of 8 segments is not critical to the model but is chosen here based on acoustic tube modeling work that considers consonantal articulation in addition to vowel articulation (Mrayati et al., 1988; Carré, 2004). Cross-sectional areas provide static information about jaw height given articulatory synergies between the jaw and tongue and between the jaw and lips; opening and closing phases of the jaw cycle are included as its own dimension in motor space in order to provide directional information, much like the time derivatives of acoustic properties in perceptual space. Such information is hypothesized to be relevant for delimiting syllable-sized articulatory timing relations (Redford, 1999; Redford et al., 2001; Redford and Miikkulainen, 2007), which will become important later. Formally, then, an articulatory configuration is the 20-tuple (g,c1,c2, . . . ,c8, v, d dt(c1), <sup>d</sup> dt(c2), . . . , d dt(c8), <sup>d</sup> dt(v), jdir) where g takes values in between 0 and 1 for glottal widths between fully closed (g = 0) and fully open (g = 1); c<sup>i</sup> is the normalized cross-sectional area of the ith vocal tract segment, where c<sup>i</sup> = 0 for a minimum area, and c<sup>i</sup> = 1 for a maximum area; v = 0 when the velum is lowered, v = 1 when the velum is raised, and v takes some appropriate value between 0 and 1 when the velum is between lowered and raised; and jdir takes a value between 0 and 1 during jaw opening, where jdir = 1 when opening is executed with maximum force, jdir takes a value between −1 and 0 during jaw closing, where jdir = −1 when closing is executed with maximum force, and jdir = 0 when the jaw is neither opening nor closing and so force is 0. Note that, for ease of some formal definitions, ARTIC can be thought of as being embedded in a larger set – the set of all 20-tuples of real numbers; however, this larger theoretical set includes impossible configurations as well as the possible ones that make up ARTIC.

We conclude this section with the following caveats. The focus in Core on sound and articulatory configurations for defining the perceptual and motor spaces is a simplifying choice. The dimensions we use to define these spaces are also simplified descriptions of acoustic and articulatory information. A more complete model would include additional dimensions and a sense of how these are weighted and normalized with respect to one another. It might also include, like DIVA, an additional layer in the map to solve the problems of articulatory coordination and timing that are not addressed here. Still, as defined, the dimensions in Core adequately describe human vocalzations, including word production. They also structure the perceptual and motor spaces in a manner that provides a formal foundation for the demonstration that perceptual-motor units of speech motor control can arise within the perceptual-motor map over developmental time absent discretized phonological input to the map.

#### 2.2.2. Perceptual and Articulatory Distance

The perceptual and motor spaces in Core are structured by the perceptual distance between instantaneous sounds and the articulatory distance between articulatory configurations. Defining the distance between every pair of points in motor space allows for the computation of distance between any two trajectories through motor space, which in turn allows for comparison of these trajectories; and similarly for perceptual space and perceptual trajectories. In Core, perceptual distance is relevant for word production and, later in development, for perceptually guided speech motor control (see Redford, 2019); that said, the argument in this paper is that the perceptual-motor units that arise with vocal exploration and spoken language acquisition are due to trajectory (near) overlap in motor space, not perceptual space. For this reason, we do not define a distance

<sup>13</sup>We assume that these variables are modeled well as piecewise continuous functions of time that are differentiable almost everywhere (i.e., on all but a set of measure zero). If at a particular point in time the derivative of one of these variables does not exist, we set it to zero to give it a well-defined value.

<sup>14</sup>This choice clearly elides the problem of articulatory movement and coordination that is central to other models of speech motor control (e.g., Saltzman and Munhall, 1989; Guenther, 1995, 2016), but is in keeping with our specific interest in the relationship between phonological representation and motor control. The choice is nonetheless plausibly motor in that the cross-sectional area of the vocal tract can presumably be recovered from somatosensory feedback. It is in this way that the motor space in Core resembles the somatosensory space/reference frame in DIVA.

metric on the set of points in perceptual space, but assume that the distance between two instantaneous sounds should rely on some combination of the following values: differences between the corresponding coordinates except for Z1, Z2, and Z3, and the differences between the respective values of Z<sup>3</sup> − Z<sup>1</sup> and the respective values of Z3−Z<sup>2</sup> (these relative values are to normalize for physiological difference between speakers)15. Further, we assume an appropriate distance metric exists that is based on these variables.

Unlike perceptual distance, articulatory distance is central to the emergence of production units in Core and therefore to the argument of this paper. A specific distance metric, dARTIC, for articulatory distance is therefore proposed: the Euclidean distance metric on the set of articulatory configurations. Thus, for two articulatory configurations a = (g,c1, . . . ,c8, v, d dt(c1), . . . , d dt(c8), <sup>d</sup> dt(v), jdir) and a ′ = (g ′ ,c ′ 1 , . . . ,c ′ 8 , v ′ , d dt(c ′ 1 ), . . . , d dt(c ′ 8 ), <sup>d</sup> dt(v ′ ), j ′ dir), the distance between the two is defined to be

$$=\begin{array}{c}d\_{\text{ARTIC}}(a,a')\\=\begin{array}{c}\overline{(\text{g}-\text{g}')^{2}+(c\_{1}-c\_{1}')^{2}+\cdots+(c\_{8}-c\_{8}')^{2}+(\nu-\nu')^{2}}\\+(\frac{d}{dt}(c\_{1}-c\_{1}'))^{2}+\cdots+(\frac{d}{dt}(c\_{8}-c\_{8}'))^{2}\\+(\frac{d}{dt}(\nu-\nu'))^{2}+(j\_{dir}-j\_{dir}')^{2}.\end{array}\\\end{array}$$

Note that if we were to define dARTIC in almost the same way, but using only the variables for glottal width, cross-sectional vocal tract areas, and velum openness, the distance between two articulatory configurations would match a phonetician's intuition of articulatory distance. Differences between jaw direction values are included to capture the additional intuition that achieving a particular vocal tract configuration while opening the mouth is different than achieving the same configuration while in the process of closing the mouth (see, e.g., Fujimura, 1990). Recall that jaw direction also allows us to define syllable-sized articulatory timing relations (Redford, 1999; Redford et al., 2001; Redford and Miikkulainen, 2007).

In addition to structuring the perceptual and motor spaces, the notions of perceptual and articulatory distances allow for the comparison of trajectories in these spaces. In Core, comparisons between perceptual trajectories are fundamental to the production of first words, comparisons between motor trajectories are fundamental to the evolution of motor representations, and comparisons of linked pairs of trajectories to targeted perceptual and motor representations are fundamental to the integration of these forms during production. Since two of these processes force further reticulation of motor space over developmental time, comparisons are also fundamental to the emergence of junctures. Junctures enable novel word generation in Core and the development of adult-like speech motor control.

#### 2.2.3. Perceptual and Motor Trajectories

Perceptual and motor trajectories are defined as functions from time intervals to perceptual space and motor space, respectively. A perceptual trajectory takes time as an input and gives as an output the instantaneous sound at each time; a motor trajectory also takes time as an input, and gives as an output the articulatory configuration at each time.

The mathematical structure imposed on motor space by the distance metric dARTIC organizes articulatory configurations so that the structure is consistent with intuitive notions about continuous physical motion. More specifically, the articulatory distance metric defined in section 2.2.2 induces a topology on motor space. Assuming the standard metric-induced topology on real intervals (i.e., the domains of motor trajectories), the continuity of motor trajectories can be assessed with reference to the structured motor space. In Core, we claim that every motor trajectory is a continuous function according to these topologies. This is a critical claim for the procedures defined below and, of course, also coincides with the facts of speech: in order to go from one articulatory configuration to another, the vocal tract must go through intermediate states such that each of our variables changes continuously; for example, in order for the 5th segment of the vocal tract to go from having a cross-sectional area of 3 to 1 cm<sup>2</sup> , it must go through stages in which it attains cross-sectional areas of 2, 1.5, 1.124 cm<sup>2</sup> , and so on. Put another way, since the notion of distance defined herein aligns with the reality of articulation, the notion of continuity as rigorously defined aligns with the reality of continuous motion.

Although functions of time, trajectories code only relative time. To normalize for absolute time, we define equivalence relations. In motor space, two trajectories are equivalent if one can be uniformly temporally stretched to create the other. Specifically, two motor trajectories m :[0, T] → ARTIC and n :[0, U] → ARTIC (i.e., motor trajectories with domains [0, T] and [0, U], respectively) are equivalent if and only if m(t) = n U T t everywhere on their domains16. This equivalence relation yields a set of equivalence classes of trajectories. In Core, every equivalence class yields a representative motor trajectory that has the domain [0,s], where s is the number of syllables for each motor trajectory within that class. The value of s is welldefined because syllable number is determined by jdir and so is the same for all motor trajectories within a single class. The representative (time normalized) motor trajectory is the one used in the production processes described below.

An analogous equivalence relation is imposed on the set of perceptual trajectories. Thus, if two motor trajectories are equivalent, then their perceptual counterparts will also be equivalent. In this way, the equivalence relation imposed on motor space is also a property of the perceptual-motor map. Note, however, that we are not able to as easily choose a representative of each perceptual equivalence class because syllable information, derived from jdir, is only available for perceptual trajectories that are already linked to motor trajectories (i.e., selfproductions). Exemplars, which inhabit the same space as self-productions, have no associated motor trajectories and so

<sup>15</sup>The Bark Difference Metric is a vowel-intrinsic normalization method adapted from Syrdal and Gopal (1986). Perceptual distance is normalized for speaker differences based on our assumption that exogenously-derived exemplars are trajectories in the same perceptual space as the trajectories that are auditory memories of self-productions, which define the perceptual aspect of the perceptual-motor map.

<sup>16</sup>It can be checked that this is in fact an equivalence relation.

no syllable information. When syllable number is available for the perceptual trajectories, they are normalized using this information; otherwise, they are normalized using an arbitrary domain length, since the processes themselves implicitly normalize for domain length.

## 2.3. Initializing the Perceptual-Motor Map

Having defined the perceptual and motor spaces, a notion of distance in each space, trajectories through the spaces, and a procedure for time normalization, we turn now to the initialization of the perceptual-motor map.

Core embodies the familiar hypothesis that an infant's prelinguistic vocalizations give rise to the perceptual-motor map (Stark, 1986; Guenther, 1995; Kuhl, 2000; MacNeilage and Davis, 2000; Hickok et al., 2003; Menn et al., 2013). Here, an infant's prelinguistic vocalizations are specifically understood as developmentally constrained explorations of the vocal motor and acoustic perceptual spaces. We suppose that with an infant's every vocalization the parallel motor and perceptual spaces are explored and the links between them defined, giving rise to the perceptual-motor map. Specifically, each vocalization results in a motor memory trace and an auditory memory trace that are associated in time. Through this association, the transient traces become fixed and linked. These links are the set of paired motor and perceptual trajectories that constitute the perceptual-motor map. Motor and perceptual trajectories and a link between them are established with every vocalization, from infancy to adulthood.

The perceptual-motor map is initialized at birth with the infant's cries and vegetative sounds. As an infant gains voluntary control over laryngeal and other articulatory movements at around 8 weeks of age, the perceptual and motor spaces are more deliberately explored. Although the squeals, coos, raspberries, and so on that are produced during the phonatory and expansion stages grow the set of links that constitute the perceptualmotor map, we follow the lead of others and focus on babbling due to its importance in theories of speech acquisition (see, e.g., Oller, 1980; Guenther, 1995; MacNeilage and Davis, 2000). The repetitive nature of babbled utterances also makes them useful for formally introducing the Core concept of junctures, which is central to the acquisition of spoken language: as previously described, junctures give rise to perceptual-motor units; they also delimit smaller paths, or articulatory chunks, within larger trajectories that can then be combined to produce new vocalizations. The combination process becomes the focus of description in what follows below.

## 2.4. Junctures, Clusters, and Articulatory Chunks

The illustrations in **Figure 2** convey the idea that junctures are created when motor trajectories approach one another in motor space. Junctures form in clusters with spoken language acquisition. These clusters, along with their corresponding (linked) perceptual points, represent perceptual-motor units of production at the level of speech motor control.

Junctures and clusters are defined based on trajectories in motor space—even though, as stated, the perceptual-motor units themselves entail the corresponding perceptual points. When a new motor trajectory m is created out of motor trajectories k1, . . . , k<sup>ℓ</sup> as described in **Appendix A**, k1(β1), k2(α2), k2(β2), k3(α3), k3(β3), . . . , kℓ−1(αℓ−1), kℓ−1(βℓ−1), kℓ(αℓ) become junctures; that is, the endpoints of the small segments that connect existing trajectories to create a novel one all become junctures. Then, at any given moment in developmental time, a single-linkage hierarchical clustering process is applied to the set of junctures, where the process is stopped just before the height of the tree meets or exceeds ε, where ε is the parameter defined in **Appendix A**. As a developmental process, this clustering can be described as follows. When an articulatory configuration a becomes a juncture, there are three possibilities: (1) it could be "sufficiently close" to exactly one existing juncture point, where "sufficiently close" in this case means being a distance of less than ε away, where ε is a pre-defined parameter used in the process defined in **Appendix A**; (2) it could be "sufficiently close" to multiple existing junctures; or (3) it could have a distance of greater than or equal to ε from all existing juncture (i.e., not sufficiently close to any existing junctures). If a is less than ε away from a single juncture point (possibility 1), then a joins the cluster that juncture point belongs to. If a is less than ε away from more than one juncture point (possibility 2)—for example, a is less than ε from a1, . . . , an, then the clusters that a1, . . . , a<sup>n</sup> belong to merge into one cluster that also now includes a—that is, they merge via their mutual connection to a. If a is not within ε of any existing juncture point (possibility 3), the set {a} becomes its own cluster.

Note that a single novel production can trigger the establishment of multiple juncture points. Regardless of the order in which these juncture points are "added," the process above yields the same clusters.

The early language function of junctures is to index locations where the speaker can deviate from one existing motor trajectory to pursue another. Since the juncture-delimited paths along existing trajectories are available to participate in novel trajectories through combinations, they can be thought of as articulatory chunks from which new utterances (e.g., words) can be built. The articulatory chunks are large in early development and small later on when many more junctures have arisen through exploration of the motor space. To illustrate chunking, we use figures in which the space on the page is treated as analogous to motor space, and where trajectories are represented as curves with direction through this space. Note that timing is not represented in the figures. For example, **Figure 4** shows the junctures at the [A] portions of the chunks [bAbA] (left) and [dAdA] (right). Junctures effectively delimit the chunks [bA] and [AdA], and make possible the combination [bAdA]. In the remainder of this section, we formally describe the combinatorial process in Core with reference to the case of [bAdA], beginning with the assumption that the articulatory configuration at the center of the first vowel in [bAbA] is close enough in motor space to the articulatory configuration achieved at the center of the first vowel in [dAdA] for a juncture to be created on each trajectory.

Let us first formally represent the motor trajectories for [bAbA] and [dAdA]. There are many motor trajectories that could accurately be described as yielding [bAbA] and [dAdA].

We choose two specific ones, m<sup>1</sup> and m2, to build [bAdA]. Most details of m<sup>1</sup> and m<sup>2</sup> are not relevant to the process, and so will be unspecified; what is relevant is the domain of these functions and the formal analog of the "close enough" assumption noted above. More specifically, both trajectories are two syllables so both have a domain of [0, 2]. So, we have m<sup>1</sup> :[0, 2] → ARTIC as the motor trajectory for [bAbA] and m<sup>2</sup> :[0, 2] → ARTIC as the motor trajectory for [dAdA]. Let a<sup>1</sup> be the articulatory configuration achieved at the center of the first vowel in [bAbA] and a<sup>2</sup> that for [dAdA]. For the sake of specificity, let the configurations occur at relative times 0.6 and 0.7 in their respective trajectories (the particular values are not central for the argument). This means that m1(0.6) = a<sup>1</sup> and m2(0.7) = a2. Critically, our assumption is that dARTIC(a1, a2) is "sufficiently small" for there to be a juncture created at the endpoints of the segment from m<sup>1</sup> to m<sup>2</sup> going from a<sup>1</sup> to a2; in the language of **Appendix A**, we assume that dARTIC(a1, a2) is smaller than the parameter ε – that is, we assume that criterion (\*) is fulfilled. Then, the speaker can traverse the first part of the [bAbA] trajectory and, once they reach articulatory configuration a1, make the small shift over to articulatory configuration a<sup>2</sup> to follow along the rest of the [dAdA] trajectory. Formally, making simplifying choices for a few of the parameters in **Appendix A**, we can define m :[0, 2] → ARTIC by

$$m(t) = \begin{cases} m\_1(t) & 0 \le t < 0.6\\ (1 - \lambda(t))a\_1 + \lambda(t)a\_2 & 0.6 \le t < 0.7\\ m\_2(t) & 0.7 \le t \le 2 \end{cases}$$

where λ(t) = 10t − 6.

Even without referencing the specifics of **Appendix A**, one can see that m has been defined as the concatenation of a piece of m<sup>1</sup> (that ends at vocal tract configuration a1), a connecting segment between a<sup>1</sup> and a2, and a piece of m<sup>2</sup> (that begins at vocal tract configuration a2). This clearly aligns with the illustration of this new trajectory shown in **Figure 4**.

More specifically, in reference to the variables in **Appendix A**, s = 2 (since the number of syllables in the resulting trajectory is 2), and, for simplicity of the formula above, we assume that δ<sup>1</sup> = 0.1 (this is the normalized length of time it takes to shift from m<sup>1</sup> to m2); these values together mean that u = 1 (this is a stretching parameter that ensures that the resulting trajectory, m, has the desirable domain). As stated above, we assumed that these trajectories were eligible for combination in the first place by assuming that dARTIC(a1, a2) was sufficiently small (in **Appendix A**, below the threshold value ε) 17 .

To summarize, the perceptual-motor exploration that occurs during the prelinguistic period initializes the perceptual-motor map with linked pairs of perceptual and motor trajectories. These can then be exploited to create new utterances via junctures at points of (near) overlap in motor space. The smaller paths delimited by junctures are articulatory chunks. The structure of these chunks is defined by the structure of prelinguistic vocalizations. For example, the repetitive nature of babbling is likely to result in chunks that are the size of syllables or demisyllables, as suggested by the case considered above. In the next section, we turn to the onset of spoken language production when an infant begins to use articulatory chunks to produce first words. Keep in mind, though, that babbling continues alongside word production until about 18 months of age (Locke, 1989; Oller, 2000; Vihman, 2014). This means that the infant will continue to explore perceptual and motor spaces and will therefore continue to lay down entirely new trajectories through motor space while also building up initial motor phonological representations.

#### 2.5. Perceptual-Motor Integration

In Redford's (2019) developmentally sensitive theory, adult speech production is imagined as the integration of holistic perceptual and motor phonological forms. Motor forms emerge from speech practice; perceptual forms are acquired. The acquisition of perceptual forms, or exemplars, depends both on the development of speech segmentation strategies and on the infant's insight that adult vocalizations convey conceptual information. Both of these conditions may be met as early as 7 months of age (Harris et al., 1995; Bergelson and Swingley, 2012). Let us assume then that it is at this point that the infant begins to acquire exemplars from the ambient language. Motor forms begin to emerge later, at around 12 months of age, with first word production. The production of first words is imagined in the theory as the moment when the infant, motivated to communicate a specific concept, selects a motor trajectory whose corresponding perceptual trajectory approximates the exemplar associated with that concept. In Core, this trajectory can be familiar (e.g., [bAbA]) or novel (e.g., [bAdA]). The important

<sup>17</sup>To be even more specific in reference to the **Appendix**, in this example we have ℓ = 2 because we are only using two "paths" to build our new trajectory; k<sup>1</sup> = m<sup>1</sup> and k<sup>2</sup> = m<sup>2</sup> are the trajectories m is being built out of; α<sup>1</sup> = 0, β<sup>1</sup> = 0.6, α<sup>2</sup> = 0.7, and β<sup>2</sup> = 2, meaning the intervals on which we are using k<sup>1</sup> and k<sup>2</sup> (in this case, m<sup>1</sup> and m2) are [0, 0.6] and [0.7, 2].

thing is that the trajectory be similar in some respects to the exemplar. This means that first word production requires a matching and selection process.

If the matching and selection process is successful, the infant will have communicated the intended concept to their interlocutor and induced some kind of desired response. In this case, the motor trace of the vocalization will be remembered in association with the concept that was communicated. This is the schema. In future attempts to communicate the same concept, the schema and exemplar associated with that concept are coactivated, biasing the matching-selection process in the direction of the previously used motor pattern. Still, the exemplarmatching objective of speech production remains. Thus, future attempts at the same word are likely to result in the selection of a new motor trajectory, especially if the infant attends to different aspects of the word during production (i.e., salience shifts). Each of these new selections, when they result in successful communication, generate new schemas. In Core, these are aligned and combined with the existing schema to define the silhouette—the motor phonological representation that evolves with developmental time.

Below, we rigorously describe the critical matching and selection process used in first word production, the motor phonological representations that result from this process, and the perceptual-motor integration process that characterizes production once motor phonological forms have been acquired.

#### 2.5.1. Matching and Selection

Recall that the perceptual space is denoted by SOUNDS, and the distance metric on the perceptual space is denoted by dSOUNDS. When an infant first attempts to communicate a concept, c, they choose a corresponding exemplar, e<sup>c</sup> , which becomes the perceptual goal for production. We claim that the goal is a function e<sup>c</sup> :[0, T] → SOUNDS and is a perceptual trajectory that is not attached to a motor trajectory. Different portions of the exemplar will have different levels of salience to the infant<sup>18</sup> . Salience is described by the function SALIENCEe<sup>c</sup> :[0, T] → [0, 1], which takes a time as an input, and gives as an output the salience of the sound that occurs at that time in the exemplar, where 1 indicates maximum salience and 0 indicates no salience. For example, suppose that the exemplar e<sup>c</sup> :[0, T] → SOUNDS is two syllables and these syllables are of equal duration; suppose also that the first syllable—that is, the first temporal half of the trajectory (up to and including the midpoint)—is maximally salient to the infant and the second syllable—that is, the second temporal half of the trajectory—is not at all salient to the infant; then the salience function for e<sup>c</sup> would be defined by

$$\text{SALIENCE}\_{\mathfrak{e}\_{\mathfrak{e}}}(t) = \begin{cases} 1 & \text{if } 0 \le t \le \frac{1}{2}T \\ 0 & \text{if } \frac{1}{2}T < t \le T \end{cases}$$

Once salience is taken into account, the search begins to find a pair of corresponding perceptual and motor trajectories, p :[0,s] → SOUNDS and m :[0,s] → ARTIC, that best fulfill the criteria enumerated below. Note that we do not specify what it means to "best fulfill" these criteria, or relatedly, how the process of finding these optimal trajectories is executed; a few possibilities are nonetheless mentioned in the discussion.


$$\int\_{0}^{1} \text{SALIENCE}\_{\mathbf{e}\_{\mathcal{E}}}(Tt) d\_{\text{SOUNDS}}(\mathbf{e}\_{\mathbf{c}}(Tt), \mathbf{p}(\mathbf{st})) \, dt$$

is small enough.

This expression first stretches e<sup>c</sup> and p (by multiplying their inputs by T and s, respectively) so that the starting point of e<sup>c</sup> is aligned with the starting point of p, and these both occur at t = 0, and the ending point of e<sup>c</sup> is aligned with the ending point of p, and these both occur at t = 1. Indeed, observe that ec(T · 0) = ec(0) and p(s · 0) = p(0), which are the values these trajectories take initially, time-wise; and ec(T·1) = ec(T) and p(s · 1) = p(s), which are the values these trajectories take finally, time-wise. Once these trajectories are properly aligned, the distance is computed between them for every value of t, and for each of these t values, multiplied by the salience of the exemplar at that time (where the salience function has also been stretched for alignment). Then, the average of all of these salience-weighted distances is computed. The result is the expression above. Thus, the smaller this expression, the better an approximation p is of e<sup>c</sup> . Note that this weighted distance is very similar to the "class of metrics" described by Mermelstein (Mermelstein, 1976, p. 96), the "time-normalized difference definition" given by Sakoe and Seibi (1978), and the approach is similar to that taken by Itakura (1975), among others.

3. m is a favored motor trajectory, where the notion of favored corresponds with frequency such that a trajectory is more favored if it has been traversed often, and a combination of motor trajectories is more favored if its constituent paths have been traversed together with one another often. Note that frequency is a relative value, and no claim is made here about the specific relationship between favoredness and frequency. The only claim is that favoredness increases as frequency increases.

Overall, then, the matching and selection process instantiated in Core is based on a perceptual approximation of the holistic goal, which is nonetheless constrained by existing paths through motor

<sup>18</sup>Recall that salience is defined by an infant's attention, which, though not determined by acoustic properties, is nonetheless influenced by them such that, for example, louder and longer sounds are expected to be more salient than quieter and shorter sounds. Of course, defined in this way, salience is likely to affect the acquisition and representation of exemplars. This means that perceptual phonological representations are as likely to change through developmental time as motor phonological representations. In Core, we gloss over this implication for the sake of simplicity. In lieu of modeling developmental changes to exemplar representations, we only model salience at the moment of production. Put another way, a more complete model would include the effect of salience on initial exemplar representation (i.e., a salience "filter" on the perceptual input) in addition to salience as internally directed attention to the perceptual goal.

space. The approximation is further biased by the frequency with which the motor paths have been practiced together. The process therefore ensures that first words resemble patterns that have been most extensively practiced during vocal-motor exploration. In this way, Core accounts for the observation that children's favored forms in first words reflect their favored production patterns in babbling (Vihman et al., 1986; McCune and Vihman, 2001; c.f. Davis et al., 2002) and the observation that children tend to favor a limited number of forms in first word productions, even while what is favored differs across individual children (Ferguson and Farwell, 1975; Macken and Ferguson, 1981; Stoel-Gammon and Cooper, 1984; inter alia).

#### 2.5.2. Motor Representations and Convergence

The matching and selection process described so far considers the role of holistic perceptual phonological representations (i.e., exemplars) in production. In this section, we define holistic motor phonological representations (i.e., schemas and silhouettes) and formally describe their role in the production process.

In a first attempt at communicating a concept c, the motor trajectory m is selected for output. Given the very slow development of peripheral motor control (Smith and Zelaznik, 2004), the trajectory that is executed will be close to m but not exactly the same as m; that is, it will be m plus whatever noise is introduced during implementation. Let us call this new trajectory m′ . If the vocalization from which m′ is derived successfully communicates c, then m′ will be linked to c. This is a schema, which we will refer to as SCHEMAm′ . Note that SCHEMAm′ is the same function as m′ , and only differs from m′ in that it is associated with the concept c. Once c has been successfully communicated using m′ :[0,s ′ ] → ARTIC, and assuming all the objects used were those that were identified in the selection process described above (section 2.5.1), the next attempt to communicate c is as follows.

SCHEMAm′ is activated at the same time as an exemplar associated with c. The specific exemplar may be different than before, so let us call it e ′ c . There is also a function SALIENCE<sup>e</sup> ′ c . A pair of perceptual and motor trajectories, m<sup>1</sup> :[0,s1] → ARTIC and p<sup>1</sup> :[0,s1] → SOUNDS, is then chosen based on criteria 1- 3 above (but with the appropriate objects substituted) and based on a fourth criterion:

4. m<sup>1</sup> is close to the motor schema SCHEMAm′ . More specifically, letting k be a fixed value, there exist α, β, with 0 ≤ α ≤ β−k ≤ s<sup>1</sup> − k such that

$$\frac{1}{\beta - \alpha} \int\_{\alpha}^{\beta} \left( d\_{\text{ARTIC}}(m\_1(t), m'(h\_{\alpha, \beta, s'}(t))) \right) \, dt$$

is sufficiently small, where hα,β,<sup>s</sup> ′(t) = s ′ β−α t − s ′α <sup>β</sup>−<sup>α</sup> —this function is used to align <sup>m</sup>′ with the portion of <sup>m</sup><sup>1</sup> going from relative time α to relative time β. Recall that SCHEMAm′ is the same as m′ , but with a link to a concept—so comparing something in motor space to SCHEMAm′ is the same as comparing it to m′ . The value of k is the minimum length of a portion of m<sup>1</sup> we are willing to have SCHEMAm′ align with.

Regarding criterion 4 above, it is not required that the chosen motor trajectory and the schema be similar start to finish; for example, the new motor trajectory might have an additional syllable than the motor trajectory associated with the previous schema. So, instead of requiring all of SCHEMAm′ to match m<sup>1</sup> well enough from start to finish—i.e., from times 0 to s1—the whole of SCHEMAm′ is allowed to be compared to m<sup>1</sup> from times α to β, for various values of α and β, as shown in **Figure 5**. For any particular choice of α and β, m′ is temporally stretched (via the precomposition with hα,β,<sup>s</sup> ′) to run from α to β with respect to m1. Then, with this alignment, the average distance between the schema and the motor trajectory is computed. The expression above gives this average.

To avoid compressing SCHEMAm′ too much relative to m1, we specify that SCHEMAm′ cannot be compared to a portion of m<sup>1</sup> that is less than k units of relative time, for some predetermined value of k. In other words, it must be the case that β − α ≥ k. Of course, it is also the case that α and β must be between 0 and s1, since they must be in the domain of m1. Combining these facts with the inequality β − α ≥ k, we get the chain of inequalities stated above, 0 ≤ α ≤ β − k ≤ s<sup>1</sup> − k. Additionally, to retain the relative timing of SCHEMAm′ , we only allow the temporal stretching to be linear—that is, by only allowing precomposition of m′ with a linear function, the only thing altered is which portion of <sup>m</sup><sup>1</sup> that <sup>m</sup>′ is being compared to—but within that comparison, the relative timing of m′ is maintained.

FIGURE 5 | Three possible alignments of an existing schema trajectory (the lower curve in each picture) with a motor trajectory selected for output (the upper curve in each picture). In this figure, a constant velocity is assumed, which means that time is proportional to distance. In the first case (left), the existing schema trajectory is being compared with about the first 80% of the selected motor trajectory. In the other cases, it is being compared with some middle portion of the selected trajectory. Out of the three cases, the first alignment gives the smallest distance. It would be possible, for instance, that only the first alignment fulfills the criteria of the expression above being "small enough". In that case, as long as β − α ≥ *k*, the criteria would considered fulfilled by *m*1, since there would *exist* a pair of values for α and β (i.e., an alignment) that makes this distance sufficiently small.

So, as stated above, the statement that m<sup>1</sup> is close to the motor schema SCHEMAm′ just means that there are some α and β, with 0 ≤ α ≤ β − k ≤ s<sup>1</sup> − k, such that the expression above i.e., the average distance after alignment based on α and β—is sufficiently small.

In summary, a linked pair of perceptual and motor trajectories, p<sup>1</sup> :[0,s1] → SOUNDS and m<sup>1</sup> :[0,s1] → ARTIC, is selected for output based on some combination of how well the perceptual trajectory matches the perceptual goal (criterion 2), the extent to which the associated motor trajectory is favored (criterion 3), and the extent to which that motor trajectory matches the activated schema (criterion 4). Also, the motor trajectory must be achievable (criterion 1). When the matching and selection process references exemplars and schemas, speech production can be characterized as the perceptual-motor integration of holistic perceptual and motor phonological forms. Note however that the process in Core is not integration per se; instead, perceptual-motor integration is the convergence of a linked pair of trajectories that best approximate the perceptual goal within the constraints of past speech motor practice.

Whereas it is common to assume strong motor constraints on production in early child language (e.g., Locke, 1983; McCune and Vihman, 1987; Davis et al., 2002), it is also clear that these constraints are relaxed in adult language with the development of adult-like speech motor control. There are many sources of evidence for this assertion, including results from auditory feedback perturbation studies (e.g., MacDonald et al., 2010; Katseff et al., 2012) and phonetic imitation studies (e.g., Shockley et al., 2004; Nielsen, 2011; Babel, 2012). All together, the evidence strongly suggests that adult speech is perceptually guided, at least within the limits of the perceptual and motor spaces explored in one's native language [see, e.g., the limits of VOT imitation in (Nielsen's, 2011) study]. In Core, the transition from strong motor constraints on production to adult-like perceptually guided speech production results from the evolution of motor phonological representations through time (see also Redford, 2015, 2019). Let us consider this evolution next.

As with the first successful attempt at a word, subsequent successful attempts at the word yield new and different schemas. This is both because a child's attention to exemplar attributes changes through time (see discussion of "salience" in section 2.1) and because their immature motor systems introduce noise into the production process such that the motor space adjacent to a trajectory that has been selected for output is randomly explored. In Core, the new schemas generated with each successful new production of a word are associated with the target concept. All schemas associated with a single concept come together to form a silhouette, which we define recursively to emphasize our developmental perspective. To keep track of the silhouette's shape at any point in developmental time, we write SILc,<sup>n</sup> to denote the silhouette that corresponds to c after the nth successful attempt to communicate c. When the moment in developmental time is not important, we will simply write SIL<sup>c</sup> to denote the silhouette corresponding to c, where the iteration is implicit. Then, to build the silhouette, the schemas are temporally aligned and the convex hull taken at each point in time of the outputs of the schemas<sup>19</sup> . The silhouette is defined to be a function that takes time as an input and gives the motorically possible subset of the convex hull corresponding to that time as an output; in other words, the silhouette encodes a time varying region. Note that the way we define a silhouette at each point in time uses a procedure similar to Guenther's (1995) convex region theory. Critically, though, DIVA's time varying regions contain exactly the vocalizations that are acceptable adult productions of a given speech sound. In contrast, a silhouette highlights a swath through motor space in Core; reference to a perceptual trajectory is required to find a good motor trajectory within the swath, namely, one that will yield an acceptable adult sound/word production.

Formally, the silhouette that is associated with c after n iterations will be a function SILc,<sup>n</sup> :[0,sn] → P(ARTIC), where P(ARTIC) is the power set of the set ARTIC (i.e., the set of all subsets of ARTIC), and s<sup>n</sup> is some number representing the number of syllables in SILc,n, and is derived from the constituent schemas and how these are aligned20. Although a silhouette, in the sense of a composite motor form, only really emerges after two different attempts at a word, here we consider the first silhouette to emerge after the first attempt at a word. So, suppose that the first schema for c is SCHEMAm<sup>1</sup> :[0,s1] → ARTIC; then the first silhouette, SILc,1 :[0,s1] → P(ARTIC) is defined by SILc,1(t) = {SCHEMAm<sup>1</sup> (t)}. This defines the silhouette as nearly the same function as SCHEMAm<sup>1</sup> , except that at each time input, instead of giving an element of ARTIC as an output, it gives as an output the set containing that element. Now we can build the silhouette as a representation with sets, i.e., regions, as outputs.

Consider the nth iteration of a silhouette; that is, consider SILc,<sup>n</sup> :[0,sn] → P(ARTIC). Suppose that the (n + 1)th schema associated with the same concept is SCHEMAmn+<sup>1</sup> :[0,sn+1] → ARTIC. Let k take the value as in criterion 4 above (note that k serves an analogous purpose here). Then we find α, β, with 0 ≤ α ≤ β − k ≤ sn+<sup>1</sup> − k such that

$$\frac{1}{\beta - \alpha} \int\_{\alpha}^{\beta} \min\_{\mathbf{x} \in \text{S11}\_{\mathcal{L}, \text{t}}(h\_{\alpha, \emptyset, s\_{\text{th}}}(t))} \left( d\_{\text{ARTIC}}(m\_{n + \ 1}(t), \mathbf{x}) \right) \, dt$$

equivalently,

$$\frac{1}{s\_{n+1}} \int\_0^{s\_{n+1}} \min\_{x \in \text{sIL}\_{\text{eff}}(t)} \left( d\_{\text{ARTC}} \left( m\_{n+1} \left( \frac{\beta - \alpha}{s\_{n+1}} t + \alpha \right), x \right) \right) dt$$

is minimal, where <sup>h</sup>α,β,sn(t) <sup>=</sup> sn β−α t − snα β−α , analogously to hα,β,<sup>s</sup> ′ in criterion 4; that is, we find an alignment of the schema and the silhouette so that the average distance from the schema to the closest point at each time in the silhouette is minimal. More specifically, for each pair of values α and β, this expression aligns the entire silhouette with a portion of the schema that runs temporally from α to β and computes the average distance between the two on that stretch. The smaller the average distance, the more appropriate (in some sense) it is to align the silhouette with that piece of the schema. The values of α and β that make this average distance (i.e., the expression above) minimal represent in some sense the optimal alignment of the schema and

<sup>19</sup>See **Appendix B** for the definition of a convex hull.

<sup>20</sup>Taking the power set is a necessary technical detail; see **Appendix B** for the rigorous definition of the convex hull, which requires a set as an input.

the silhouette. The success of this procedure (i.e., the minimal value of the expression being satisfactorily small) also entails that the schema necessarily has a portion of it that aligns well with the entire silhouette. This entailment rests on the assumption that words are progressively lengthened by adding on syllables or demisyllables over developmental time (e.g., the production ["nænA] for "banana" does not follow the production [b@"nænA] in developmental time). Note the similarity of this expression to the expression in criterion 4. In criterion 4, the alignment is essentially required to be good enough (the average distance is required to be "small enough"); whereas, here, the alignment is required to be optimal (that is, the average distance is required to be minimal). Fulfillment of the good enough requirement is sufficient for a motor trajectory to be selected; but when this alignment is being used to build out the silhouette, as described below, it is required to be optimal.

Once the best alignment of a new schema with an existing silhouette is identified, the (n + 1)th silhouette for c, SILc,n+1 :[0,sn+1] → P(ARTIC), can be defined:

SILc,n+1(t) = Conv {SCHEMAmn+<sup>1</sup> (t) ∪ SILc,n(hα,β,s<sup>n</sup> (t))} ∩ ARTIC (α ≤ t ≤ β) {SCHEMAmn+<sup>1</sup> (t)} otherwise,

where Conv(A) is the convex hull of A, for any subset A of motor space. (In this case, we consider ARTIC in particular as a subset of an affine space, so the convex hull is defined; see **Appendix B**).

#### 2.5.3. Adult-Like Production

In Core, adult-like production uses the same process as the second attempt at a word, but a silhouette, rather than a schema, biases the matching and selection process. More precisely, once a silhouette SIL<sup>c</sup> :[0,s ′ ] → P(ARTIC) exists for a particular concept c, a motor trajectory m :[0,s] → ARTIC and corresponding perceptual trajectory p :[0,s] → SOUNDS are chosen to communicate c based on the three criteria in section 2.5.1, as well as the following criterion, which is a generalization of the criterion 4, the criterion used in the second attempt at a word:

4\*. A portion of m is close to fitting into the current silhouette for c. That is, there exist α and β with 0 ≤ α ≤ β − k ≤ s − k (k the same as in the previous criterion 4) such that

$$\frac{1}{\beta - \alpha} \int\_{\alpha}^{\beta} \min\_{\mathbf{x} \in \text{S11}\_{\mathcal{L}}(h\_{\alpha, \beta, \mathbf{s'}}(t))} \left( d\_{\text{ARTIC}}(m(t), \mathbf{x}) \right) \, dt$$

is sufficiently small, where hα,β,<sup>s</sup> ′ is as defined in criterion 4.

Importantly, the regions that define the silhouette at each moment along its length will stay the same size with each iteration of a word or increase to include more points. The salience function introduces extensive variability in word production during early child language, which means that the region defined by a silhouette at each point in time will often expand. In addition, the well-grounded assumption that immature motor control introduces noise into execution entails an exploration of motor space adjacent to the planned (selected) motor trajectory. The new paths carved out by this exploration can be purposefully used in future productions to find closer approximations to the perceptual goal. Due to the increasing availability of better approximations, articulatory accuracy increases with developmental time, albeit not necessarily in a linear fashion. Further, we assume that failures in communication are also beneficial to the development of articulatory accuracy in that such failures also define new trajectories through motor space within and adjacent to the regions defined by the silhouettes.

In sum, silhouettes come to represent passages through motor space that are especially well-explored over developmental time. The exploration reticulates the motor space within these passages so completely that the motor phonological representation provides less and less of a constraint on the matching and selection process. Instead, the perceptual constraint can be fully optimized during each production; that is, the perceptual trajectory that is the goal can be closely approximated at each point in time using the set of endogenous perceptual trajectories that are linked to corresponding trajectories in motor space. This is adult-like speech production: a process that is perceptually guided within a silhouette-bounded motoric range.

## 3. DISCUSSION

Intelligible adult speakers achieve language-specific articulatory configurations one after another in rapid sequence. The configurations are typically conceived of as movement in service of speech motor goals. Most adult-focused models of speech production assume that these goals are perceptual or auditory in nature and linked in some manner to a limited set of discrete phonological representations, for example, to phonemes or distinctive features (e.g., Houde and Nagarajan, 2011; Tourville and Guenther, 2011; Hickok, 2012). This assumption introduces a serial order problem that psycholinguistic models of speech production are designed to solve. For at least half a century, the solution has been to posit an encoding process where segmental phonological rules are applied and then phonetic detail is specified (e.g., MacKay, 1970; Dell, 1986; Levelt, 1989; et seq.). Redford (2015, 2019) has argued that this solution is incompatible with a developmental perspective on spoken language production. In particular, the encoding process suggests an acquisition problem too complex to surmount by the time infants are producing first words at 12 months of age. Moreover, the hypothesis is at odds with the sound patterns of early child language, which suggest the whole word as both plan and goal (see, e.g., Vihman and Keren-Portnoy, 2013; Redford, 2019).

A developmental perspective leads us to embrace the alternative to a phonological-phonetic encoding hypothesis; namely, that word forms are remembered and retrieved holistically for production. This whole word production hypothesis solves the serial order problem by avoiding it, but it also begs the question: how does adult-like speech motor control develop absent discrete phonological representations? The Core model provides an answer. The ability to target linguistically significant articulatory configurations one after another in rapid sequence relies on a perceptually guided production process within a silhouette-bounded motoric range subsequent to the emergence of perceptual-motor units, which occurs over developmental time as the motor space becomes increasingly reticulated with exploration.

The central hypothesis in Core that the (near) overlap of motor trajectories yields perceptual-motor units and articulatory chunks for combination implies a production system that is superficially combinatorial; that is, a system where "parts of signals overlap (that is, occupy the same position in acoustic and perceptual space) with parts of other signals... Importantly, the overlapping parts of different signals need not necessarily also be the units of combination of the underlying linguistic representation (Zuidema and de Boer, 2009, p. 126)." Zuidema and de Boer distinguish such a system from one that is productively combinatorial; that is, a system "where the cognitive mechanisms for producing, recognizing and remembering signals make use of a limited set of units that are combined in many different ways. Productive combinatoriality is a property of the internal representations of language in the speaker (p. 126)." They argue that emergent elements in a superficial combinatorial phonology can become available for use in a productive combinatorial phonology over evolutionary time with communicative pressures. Core demonstrates, however, that the transition from a superficial combinatorial phonological system to a productive one is not necessary to account for normal speech production. Rather, Core assumes phonological representations that are sets of form-meaning pairings. In one set, the forms are holistic, perceptual, and exogenously derived; in the other set, the forms are holistic, motoric, and endogenously derived. Both types of representations are "integrated" for output using the perceptual-motor map according to a matching and selection process that produces increasingly optimal results (i.e., closer matches to the holistic perceptual goal) as the perceptual and motor spaces become increasingly retriculated with vocal-motor exploration and practice. In Core, the matching and selection process may result in a novel motor trajectory that can be analyzed as a combination of smaller paths from multiple trajectories, but there is no sense in which the junctures that delimit these paths are independently recognized and remembered by the speaker to generate a targeted linguistic form.

Although our assertion is that normal speech production is governed by holistic representations, this is not to say that the emergent perceptual-motor units and articulatory chunks posited in Core could not be inducted into the speaker's linguistic system. In fact, we expect that speakers may identify perceptualmotor units and the articulatory chunks they delimit as structurally important linguistic elements with the development of metalinguistic awareness and the right incentives (e.g., the motivation to read and write). This identification may never be critical to the speech production process, but could be useful for creative language, including for rhyming and for creating lines that are onomatopoetic, alliterative, and so on. We suggest that both the identification of perceptual-motor units as elements of linguistic structure and the creative use of these elements in spoken or written verse rely on a speaker's intuition of sound/action equivalence, which is in turn grounded in notions of perceptual and articulatory distance. These notions are themselves based on metrics implied in the architecture of motor and perceptual spaces in Core.

Specifically, one can define a distance metric on the set of equivalence classes of motor trajectories that aligns with the structures described in Core. Let m :[0,s] → ARTIC and m′ :[0,s ′ ] → ARTIC be motor trajectories. Define the distance between them to be

$$\int\_{0}^{1} d\_{\text{ARTIC}}(m(st), m'(s't)) \, dt \tag{i}$$

It can be checked that this is a pseudometric on the set of motor trajectories; that is, it is nearly a metric, except for the fact that there are (in theory) trajectories that are a distance of zero from each other that are nevertheless distinct due to global timing differences. The equivalence relation defined in section 2.2.3 treats two such trajectories as equivalent. The pseudometric then induces a metric on the set of equivalence classes; that is, the metric is compatible with the structure on the set of motor trajectories that has been laid out. For example, one can easily observe the similarity between this metric and the way that the distance between a motor trajectory and a motor silhouette is measured. Consider a case where the expression in criterion 4\* is utilized to compare a motor trajectory m :[0,s] → ARTIC and a silhouette SIL<sup>c</sup> :[0,s ′ ] → P(ARTIC), specifically with the alignment that compares the entirety of the motor trajectory to the entirety of the silhouette. That expression in this case becomes

$$\frac{1}{s} \int\_0^s \min\_{\mathbf{x} \in \text{SL}\_{\mathbf{<}}(\mathbf{s}'t/s)} d\_{\text{ARTIC}}(m(t), \mathbf{x}) \, dt,$$

which is equal to

$$\int\_0^1 \min\_{\mathbf{x} \in \text{S11}\_{\mathbf{c}}(\mathbf{s}'t)} d\_{\text{ARTIC}}(m(\mathbf{s}t), \mathbf{x}) \, dt$$

through a change of variables. Then, let m′ :[0,s ′ ] → ARTIC be a theoretical motor trajectory that is the closest possible at each point in time to m, while still being contained in the motor silhouette SIL<sup>c</sup> (i.e. <sup>m</sup>′ (t) is in SILc(t) for each t). This expression is then equivalent to

$$\int\_0^1 d\_{\rm{ARTIC}}(m(st), m'(s't))dt,$$

which is the distance between motor trajectories m and m′ as just defined in (i). In other words, using the procedure described in criterion 4\* to compare a motor trajectory to a silhouette on the entirety of both of their domains is equivalent to comparing that motor trajectory to a theoretical closest motor trajectory that is contained in the silhouette. It is in this way that these two notions are compatible. The relationship of (i) to the expression in criterion 4 (being a special case of the expression in criterion 4\*) is even more straightforward—if α and β are set to be 0 and s1, respectively, then the expression in criterion 4 is exactly the expression (i) applied to motor trajectories <sup>m</sup><sup>1</sup> and <sup>m</sup>′ .

Similarly, let p :[0, T] and p ′ :[0, T ′ ] be two perceptual trajectories (self-productions and/or exemplars). It is reasonable to define the distance between them to be the sum, or in this case average (which can be seen as a time-normalized sum), of the distances between them at each time (Itakura, 1975, p. 69; Mermelstein, 1976; Sakoe and Seibi, 1978; inter alia). More specifically, define the distance between them to be

$$\int\_{0}^{1} d\_{\text{SOUNDS}}(p(Tt), p'(T't)) \, dt. \tag{ii}$$

As in the motor case, this is a pseudometric on the set of perceputal trajectories that yields a metric on the set of equivalence classes of perceptual trajectories defined in section 2.2.3. Moreover, observe that this is the same as the measure between a self-production and a perceptual trajectory as defined in criterion 2 if the salience were 1 everywhere that is, if the whole of the exemplar were fully salient—as would likely be the case in adult speech. Thus, this metric is a good representation of the structure on the set of perceptual trajectories for an adult (for a discussion of desirable properties of perceptual distance measures, see Mermelstein, 1976).

A psychological notion of distance could emerge from the implied metrics described above. This notion could then account for the experience of two words as sounding or feeling similar. A creative language behavior, like rapping, could then be understood as the conscious exploitation of an intrinsic matching algorithm; specifically, as an attempt at minimizing the perceptual trajectory distance between two word-length perceptual trajectories, and/or minimizing the motor trajectory distance between two word-length motor trajectories; or as an attempt at keeping these distances within a certain range. For example, the impression that a line flows well in a rap might because the speaker has identified perceptual trajectories that are similar enough that the distance between them is below a certain threshold, but are different enough that they are not pure repetition (e.g. Eminem's "...all the stores ship us platinum" and then "...metamorphosis happen"; Mathers et al., 2002, track 12). Rhyming, on the other hand, is a particular instantiation of bounding the distance between perceptual trajectories, wherein a not-too-large, not-too-small average distance between trajectories is achieved specifically by making the distances larger at the onset, and very small in the rhyme. This additional restriction would require modulation or deliberate new constraints on the perceptual matching algorithm that is intrinsic to Core.

The distance metrics we define are fundamental to speech production and development in Core because both rely on comparisons between trajectories. Two critical comparison operations are matching to approximate a phonetically detailed perceptual representation (i.e., an exemplar) to produce words, and matching existing schemas to create an abstract motor phonological representation (i.e., a silhouette). The algorithms we instantiate to effect these and other comparison operations were sometimes motivated by specific hypotheses regarding spoken language behavior; other times they were expedient. For example, a theoretically motivated assumption underlies the choice to represent perceptual trajectories that are exemplars and perceptual trajectories that are self-generated in the same perceptual space and then match them based on patterns (e.g., the difference between Z<sup>1</sup> and Z3) rather than based on absolute values (e.g., the values of Z<sup>1</sup> and Z3). The assumption is that infants do not track the various acoustic correlates to linguistic contrasts separately; rather, they attend to how the correlates covary in time (see, e.g., Sussman, 1986). This assumption implies that the normalization problem is not actively solved during development. Instead, it is automatically solved in speech processing and production (for a contrasting view see, e.g., Plummer, 2014).

In contrast to the representation of perceptual trajectories, the choice to consider two trajectories equivalent if one can be made into the other by uniform stretching was merely expedient. A more accurate model would include a more nuanced method for the direct comparison of two perceptual or motor trajectories. In particular, applying nonlinear time warping might be preferable to the uniform stretching algorithm we used here, since it would more readily capture the disproportionate changes that vowels undergo relative to consonants with changes in speech rate (e.g., Gay, 1981). Techniques used in functional data analysis (see, e.g., Ramsay and and Silverman, 2002) or dynamic time warping algorithms (see, e.g., Sakoe and Seibi, 1978; Furui, 1986) could be considered for this21; however, many, if not all, dynamic time warping algorithms do not yield perfect metrics (Casacuberta et al., 1987), which is a disadvantage for defining distance in the perceptual and motor spaces. On the other hand, it may be the case that there exist dynamic time warping methods whose outcomes are essentially metrics on the set of actual vocalizations, which is a subset of the set of theoretically possible vocalizations (ibid).

There are a number of other examples of expedient choices that we made when formalizing the model. The most notable of these are the many criteria that were left underspecified. For example, in criterion 2 and criteria 4 and 4<sup>∗</sup> , a particular measure of distance is required to be "small enough" or "sufficiently small." We also choose trajectories that "best fulfill" criteria 2, 3, and 4, but we do not specify what optimal fulfillment means. These underspecified criteria suggest avenues for future research. For example, when a quantity is "small enough," that could mean it lies below some threshold value that is either fixed or changing over developmental time. Alternatively, "small enough" could mean "smallest out of some comprehensive set of objects considered". These and other open questions could be answered in empirical research designed to test different model-based predictions.

## AUTHOR CONTRIBUTIONS

The paper is fully collaborative. Each author contributed 50% effort to the manuscript. MD's primary responsibility was to formalize the model. MR's primary responsibility was

<sup>21</sup>See Mermelstein (1976) for a review of different distance measures.

to conceptualize the model. Both authors contributed to the writing.

#### FUNDING

This research was wholly supported by the Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) under grant R01HD087452 (PI: Redford). The content is solely the authors' responsibility and does not necessarily reflect the views of NICHD.

#### REFERENCES


#### ACKNOWLEDGMENTS

We thank the reviewers for their attentive comments. We are also grateful to Paul Herstedt for being a sounding board for the mathematical ideas and for notational edits.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.02121/full#supplementary-material


Guenther, F. H. (2016). Neural Control of Speech. Cambridge, MA: The MIT Press.

Harris, M., Yeeles, C., Chasin, J., and Oakley, Y. (1995). Symmetries and asymmetries in early lexical comprehension and production. J. Child Lang. 22, 1–18. doi: 10.1017/S0305000900009600


Hawkins, S. (2003). Roles and representations of systematic fine phonetic detail in speech understanding. J. Phonet. 31, 373–405. doi: 10.1016/j.wocn.2003.09.006


Emergence of Phonology: Whole-word Approaches and Cross-linguistic Evidence, eds M. M. Vihman and T. Keren-Portnoy (Cambridge: Cambridge University Press), 460–502.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Davis and Redford. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Motoric Mechanisms for the Emergence of Non-local Phonological Patterns

#### Sam Tilsen\*

Department of Linguistics, Cornell University, Ithaca, NY, United States

Non-local phonological patterns can be difficult to analyze in the context of speech production models. Some patterns – e.g., vowel harmonies, nasal harmonies – can be readily analyzed to arise from temporal extension of articulatory gestures (i.e., spreading); such patterns can be viewed as articulatorily local. However, there are other patterns – e.g., nasal consonant harmony, laryngeal feature harmony – which cannot be analyzed as spreading; instead these patterns appear to enforce agreement between features of similar segments without affecting intervening segments. Indeed, there are numerous typological differences between spreading harmonies and agreement harmonies, and this suggests that there is a mechanistic difference in the ways that spreading and agreement harmonies arise. This paper argues that in order to properly understand spreading and agreement patterns, the gestural framework of Articulatory Phonology must be enriched with respect to how targets of the vocal tract are controlled in planning and production. Specifically, it is proposed that production models should distinguish between excitatory and inhibitory articulatory gestures, and that gestures which are below a selection threshold can influence the state of the vocal tract, despite not being active. These ideas are motivated by several empirical phenomena, which include anticipatory posturing before production of a word form, and dissimilatory interactions in distractor-target response paradigms. Based on these ideas, a model is developed which provides two distinct mechanisms for the emergence of non-local phonological patterns.

#### Edited by:

Pascal van Lieshout, University of Toronto, Canada

#### Reviewed by:

Marianne Pouplier, Ludwig Maximilian University of Munich, Germany Michael Proctor, Macquarie University, Australia

#### \*Correspondence: Sam Tilsen

tilsen@cornell.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 20 March 2019 Accepted: 04 September 2019 Published: 26 September 2019

#### Citation:

Tilsen S (2019) Motoric Mechanisms for the Emergence of Non-local Phonological Patterns. Front. Psychol. 10:2143. doi: 10.3389/fpsyg.2019.02143 Keywords: Articulatory Phonology, Selection-coordination theory, locality, phonology, harmony

# INTRODUCTION

This paper addresses the topic of locality in the origins of phonological patterns. The main focus is on developing a model of speech production that is sufficient to generate non-local patterns. The conclusion is that even when non-local agreement relations between segments are observed, the mechanisms which gave rise to such relations can be understood to operate locally. This is desirable if we wish to avoid a conception of speech that allows for "spooky action at a distance," i.e., discontinuities in the motor planning processes which determine the articulatory composition of word. It is important to note that the model developed here involves the planning and production of word forms by an individual speaker, and the articulatory patterns generated by the model are viewed as seeds of potential sound change on larger spatial and temporal scales. The starting point of the model is the gestural framework of Articulatory Phonology

**73**

(Browman and Goldstein, 1989) and Task Dynamics (Saltzman and Munhall, 1989); recent extensions to this model in the Selection-coordination framework (Tilsen, 2016, 2018a,b) are also incorporated. We will develop an extension of these models in which there are two distinct ways for non-local patterns to arise; these mechanisms are shown to account for the origins of spreading and agreement harmonies, respectively.

In the gestural scores of Articulatory Phonology/Task Dynamics (henceforth AP/TD), an interval of gestural activation corresponds to a period of time in which there is force acting upon the state of the vocal tract, potentially driving it toward a new equilibrium value. Both the state parameter and the equilibrium value are typically represented by gestural labels in a score, e.g., an interval labeled as LA clo specifies the vocal tract state parameter as LA (lip aperture) and the equilibrium value as clo, i.e., a physical value corresponding to bilabial closure. Because of their inherent temporality, gestural activation intervals in the score provide a convenient proxy for mapping between a hypothesized cognitive system for control of movement and the empirical outputs of that system, i.e., changes in vocal tract states during speech. Yet there are many ways in which interpretation of the score necessitates familiarity with the underlying TD model. Indeed, there are several aspects of the system which are not shown in scores, and there are phenomena which scores are not well suited for describing.

To illustrate these points, we consider three issues in gestural representations of speech, which are relevant in different ways to our model of non-local patterns. The first issue is the role of the neutral attractor, which is hypothesized to govern the evolution of articulator states in the absence of gestural activation (Saltzman and Munhall, 1989; henceforth SM89). As we show below, there is a trade-off between the complexity of the neutral attractor and the postulation of additional gestures in the score. **Figure 1** shows several versions of gestural scores for a CV syllable, [sa]. Below the scores are a couple of the relevant tract variables and gestural targets (here and elsewhere we omit some tract variables/gestures, such as glottal aperture, for clarity of exposition). The gestural activation intervals of the score are periods of time in which the driving force on a tract variable is influenced by a gesture. For example, the segment [s] corresponds to a [TTCD nar] gesture. When [TTCD nar] becomes active, the TTCD tract variable is driven toward the associated target (i.e., a value labeled as nar, which refers to a degree of constriction that is sufficiently narrow to generate audibly turbulent airflow). What is not conventionally specified in gestural scores is the mechanism that drives a release of that constriction. In the SM89 model, the neutral attractor drives model articulators toward default positions when there are no active gestures that influence those articulators; it has a direct influence on model articulator states, but only an indirect influence on tract variables. Importantly, the neutral attractor is not a "gesture" because it does not directly specify a target in tract variable coordinate space.

Of primary interest in the example is how to model interactions between influences of the neutral attractor and influences of gestural activation. For the sake of argument, let's suppose – contrary to the SM89 model – that the effects of the neutral attractor on model articulator targets and stiffness (how quickly model articulators are driven to a target position) are blended with the effects of active gestures, and that the two neutral attractor blending strengths (i.e., stiffness blending and target blending) are correlated and constant throughout production of a word form. In this hypothetical situation, the model exhibits empirical deficiencies. Specifically, if the blending strength of the neutral attractor is relatively weak, then tract variables are slow to return to neutral positions after they have been displaced by gestural forces. For example, in **Figure 1A**, the hypothetical model exhibits an unrealistically slow release of the TTCD constriction (solid line). Simply strengthening the blended influence of the neutral attractor results in a different problem: the target of [TTCD nar] is never achieved (dashed line, **Figure 1A**). This target undershoot occurs because the relevant model articulators are driven to positions which reflect a compromise between the target of [TTCD nar] and the default positions associated with the neutral attractor. The empirical deficiencies associated with this hypothetical model are a consequence of the suppositions that stiffness and target blending strengths are related, and that the blending is constant.

The SM89 model does not presuppose that blending is constant. Instead, the SM89 model competitively gates the influence of the neutral attractor and the influences of gestures: when any active gesture influences a model articulator, the neutral attractor for that model articulator has no influence; conversely, when no active gestures influence a model articulator, the neutral attractor influences that articulator. This entails that the blending strength of the neutral attractor varies abruptly between minimal blending and maximal blending. The effect of competitive gating on tract variables is shown in **Figure 1B**. Competitive gating mitigates the problems that arise from constant blending: postgesture releases are more rapid and target undershoot is avoided.

The neutral attractor gating mechanism (**Figure 1B**) appears to be empirically adequate, but to my knowledge there is no direct evidence that this is the correct conceptualization of the control system. Moreover, there is a subtle conceptual problem with the competitive gating mechanism: whereas the neutral attractor directly influences model articulators, active gestures only indirectly influence articulators, via their influences on tract variables. It may be somewhat worrying that a mechanism must be posited which is sensitive to gestural activation – i.e., forces on tract variables, but which affects the neutral attractor, which is not a force on tract variables. Another problem is that this mechanism may be overly powerful in its ability to abruptly shut off the neutral attractor for specific model articulators during production of word form.

A logical alternative to competitive gating is a model in which constriction releases are accomplished via active gestures, such as [TTCD op] (which releases the TTCD constriction). This has been proposed by several researchers and is sometimes called the split gesture hypothesis (Browman, 1994; Nam, 2007a,b; Tilsen and Goldstein, 2012). As shown in the score of **Figure 1C**, a [TTCD op] gesture can be active and appropriately phased relative to [TTCD nar], so as to drive a constriction release. Alternatively, [TTCD op] may be co-active with the vocalic [PHAR [a]] gesture, and gestural blending can modulate its influence during the period of time in which [TTCD nar] is

active. In either case, the release of the TTCD constriction is sufficiently rapid (dashed line in TTCD panel). The point of contrasting the analyses in **Figures 1B,C** is to show that there is a trade-off between positing additional gestures and utilizing a more powerful blending mechanism. This is highly relevant to the model we develop below, which proposes a substantial expansion of the inventory of gestures and reconceptualizes the neutral attractor.

A closely related issue is that in many uses of gestural score representations, the velum and glottis are assumed to obtain default states during speech, in the absence of active velar or glottal gestures. The theoretical implications of this assumption have not been thoroughly examined in previous literature. The model we develop below does away with the notion of default states. Thus the reader should note that when velum or glottal gestures are omitted from scores in this paper, it is out of convenience/clarity, rather than a theoretical claim.

A second issue with gestural scores is that there are movements that occur prior to production of a word form which do not appear to be prototypically gestural. In particular, several studies have found evidence that speakers anticipatorily posture the vocal tract before producing an utterance, in a manner that is contingent upon the initial articulatory content of the utterance (Rastle and Davis, 2002; Kawamoto et al., 2008; Tilsen et al., 2016; Krause and Kawamoto, 2019a,b). For example, Tilsen et al. (2016) conducted a real-time MRI investigation in which CV syllables /pa/,/ma/,/ta/, and /na/ were produced in both prepared and unprepared response conditions. In the prepared response condition, the target syllable was cued together with a ready signal, which was followed by a variable delay (1250–1750 ms) prior to a go-signal. In the unprepared response condition, the target syllable was cued with the go-signal. Between-condition comparisons of vocal tract postures in a 150 ms period preceding the go-signal showed that in the prepared condition, many speakers adjusted the postures of their vocal organs in a manner that was specific to the upcoming response. This effect is schematized in **Figure 2A**, where the velum opens prior to the production of the syllable /na/.

Several aspects of anticipatory posturing effects are important to note here. First, the effects observed are predominantly assimilatory: anticipatory posturing almost always results in postures that are closer to the articulatory targets of the upcoming response. Second, effects are observed for a variety of tract variables/articulators, including lip aperture, tongue tip constriction degree, tongue body constriction degree, velum aperture, pharyngeal aperture, and vertical position of the jaw. Third, the effects are sporadic across speakers and articulators: not all speakers exhibit statistically reliable effects, and the tract variables in which effects are observed vary across speakers. Fourth, in an independently controlled condition in which speakers are required to maintain a prolonged production of the vowel [i] during the ready phase, anticipatory posturing effects

are also observed. A schematic example of anticipatory posturing for /na/ while the posture of the vocal tract is constrained is shown in **Figure 2B**.

anticipatory posturing is something of a conundrum in the standard AP/TD framework.

Notably, many of the anticipatory posturing effects observed in Tilsen et al. (2016) were partial assimilations: the ready phase posture in the prepared condition was only part of the way between the posture in the unprepared condition and the posture associated with achievement of the relevant gestural target. Furthermore, although not quantified in the study, it was observed that in prepared response conditions, the anticipatory movements that occurred in the ready phase exhibited slower velocities than movements conducted during the response.

Anticipatory posturing is challenging to account for in the standard AP/TD framework. The anticipatory movements cannot be attributed solely to a neutral attractor, because of their response-specificity: the neutral attractor would have to be modified in a response-contingent manner. The phenomenon also cannot be attributed solely to early activation of gestures: gestural activation should result in achievement of canonical targets, unless an ad hoc stipulation is made that pre-response gestures have alternative targets. A reasonable account is one in which the effects of anticipatorily activated gestures are blended with those of the neutral attractor; this would explain the partially assimilatory nature of the pre-response postures. However, recall from above that blending of the neutral attractor with active gestures is precisely what the SM89 model prohibits via the competitive gating mechanism (see **Figure 1B**), and this is necessary because an overly influential neutral attractor leads to the target undershoot problems illustrated in **Figure 1A**. Thus

A third issue with gestural scores is the representation of non-local agreement relations between gestures. Many theoretical approaches to phonology distinguish between "local" and "nonlocal" patterns (Pierrehumbert et al., 2000; Heinz, 2010; Rose and Walker, 2011; Wagner, 2012). Consider the hypothetical examples of harmonies in **Table 1**. Some languages exhibit cooccurrence restrictions in which certain consonants which differ in some particular feature do not occur in some morphological domain, such as a root or a derived stem. For example, (1) shows a sibilant anteriority harmony: all sibilants in a word form must agree in anteriority (i.e., alveolar vs. post-alveolar place of articulation). Consequently, [s] and [S] cannot co-occur. Example (2) shows a pattern in which nasality spreads from a rightmost nasal stop to all preceding segments. Example (3) shows yet another pattern, nasal consonant harmony, in which coronal consonants must agree in nasality. The reader should consult the comprehensive survey of Hansson (2001) for a catalog of many real-language examples of consonant harmonies.

There are two questions regarding these examples that are relevant here. First, how should articulatory patterns with nonlocal relations be represented in a gestural score, and second, what are the mechanisms which lead to their emergence on the timescale of utterances, for individual speakers? There is an ongoing debate regarding these questions. Gafos (1999) argued that many non-local patterns arise from gestural spreading, in which the activation of a gesture extends in time. Spreading of a feature or extended gestural activation is quite sensible


for patterns such the one as example (2), where intervening segments show evidence of being altered by the spreading feature, nasality in this case. The spreading analysis may also be tenable when the effect of a temporally extended gesture does not result in drastic changes in the expected acoustic and/or auditory consequence of the intervening. For example, in the case of the sibilant harmony in example (1), a tongue tip constriction location gesture (i.e., [TTCL +ant] or [TTCL −ant]) may be active throughout the entirety of a word form without resulting in substantial acoustic effects: the TTCL gesture may have relatively subtle effects on intervening vocalic postures and is masked by non-coronal consonantal constrictions, such as an intervening bilabial closure. There is indeed some articulatory evidence for spreading that involves lingual postures (Walker et al., 2008; Whalen et al., 2011).

However, not all cases of harmony are readily amenable to a spreading analysis. A wide variety of consonant harmonies are reported in Hansson (2001), involving features such as voicing, aspiration, ejectivity, implosivity, pharyngealization, velarity, uvularity, rhoticity, laterality, stricture, and nasality. Hanson and others (Walker, 2000; Heinz, 2010) have argued that many of these patterns cannot be readily understood as feature spreading or extended gestural activation, because the expected acoustic consequences of spreading are not observed and may be physically incompatible with articulatory postures required by intervening segments. Consider hypothetical example of nasal consonant harmony shown in **Table 1**, example (3), variations of which are attested in many Bantu languages and in other, unrelated languages (see Hansson, 2001). An attempt to represent a pattern in which /sapan/ –> /napan/with extended activation of a [VEL op] gesture, as in **Figure 3A**, is problematic in several ways: it incorrectly predicts nasalized vowels, nasalization of the oral stop [p], and nasalized fricatives as opposed to nasalized stops. Hence the extended gestural activation in **Figure 3A** does not provide an empirically adequate analysis of nasal consonant harmony.

Instead of spreading, nasal consonant harmony would seem to require a mechanism which forces certain gestures to appear in certain places in the score, but only when other gestures are present. For example, it is possible to posit a representation such as in **Figure 3B**, where the relevant TTCD constriction gestures co-occur with a [VEL op] gesture, and where [TTCD nar] becomes [TTCD clo]. But the representation does not directly address a number of important questions, namely: what is the nature of the association between the TTCD gestures and the [VEL op] gesture, with respect to the knowledge of speakers? How do such co-occurrence restrictions arise on the scale of individual utterances? How can such patterns be productive in derived domains? The crux of the problem is that the AP/TD model offers no mechanism which can activate the [VEL op] gesture in precisely those circumstances which are consistent with the empirically observed harmony pattern.

This paper addresses the issues above and related ones by developing an extended model of articulatory control. The model incorporates two additional mechanisms of articulatory planning and substantially elaborates the standard model of Articulatory Phonology/Task Dynamics. Section "The Intentional Planning Mechanism" describes the first mechanism, intentional planning, where "intention" refers to a target state of the vocal tract. This mechanism involves the postulation of vocal tract parameter fields in which time-varying spatial distributions of activation are driven by excitatory and inhibitory input from gestures. The integration of activation in these fields determines a current target state of the vocal tract. Section "Gestural Selection and Intentional Planning" describes the second mechanism, selectional planning, in which gestures are organized into sets and the sets are organized in a hierarchy of relative excitation. Feedback-driven reorganizations of the excitation hierarchy generate an order in which sets of gestures are selected, executed, and suppressed. Crucially, selectional dissociations allow for individual gestures to be selected early or suppressed late, relative to other gestures. Neither of these mechanisms is novel: the intentional mechanism is borrowed from Dynamic Field Theory models of movement target representation (Schöner et al., 1997; Erlhagen and Schöner, 2002; Tilsen, 2007, 2009c; Roon and Gafos, 2016), and the selectional mechanism is borrowed from competitive queuing models of sequencing (Grossberg, 1987; Bullock and Rhodes, 2002; Bullock, 2004), which have been extended to model the selection of sets of gestures (Tilsen, 2016). However, the integration of these models in a gestural framework is somewhat new, having been first attempted in Tilsen (2009c) and more recently in Tilsen (2018b). The most novel contribution here is a reconceptualization of articulatory gestures that derives from integrating these frameworks. Specifically, we argue that it is useful to distinguish between two types of gestures: excitatory gestures and inhibitory gestures; furthermore, we claim that gestures which are non-active but nonetheless excited can influence the state of the vocal tract. Section "The Origins of Non-local Phonological Patterns" shows that with these hypotheses a new understanding of the origins of non-local phonological patterns is possible, one which is both motorically grounded and local. Crucially, our emphasis here is on the issue of origination/emergence/genesis: the mechanisms we develop create articulatory patterns in individual utterances for individual speakers, and these patterns are potential precursors of sound changes.

#### THE INTENTIONAL PLANNING MECHANISM

An intention is, colloquially, an aim, purpose, goal, target, etc. Here we use intentional planning to refer to a mechanism which

determines the target state of the vocal tract. It is important to note that this new conception of target planning requires us to maintain a distinction between gestural targets and the dynamic targets of the vocal motor control system. Instead of being fixed parameters of the speech motor control system, dynamic targets are states that evolve in real-time, under the influence of gestures, whose targets are long-term memories. The dynamic target states are modeled as integrations of activation in fields, drawing inspiration from previous models (Schöner et al., 1997; Erlhagen and Schöner, 2002; Tilsen, 2007). In this section we present a basic model of intentional planning and discuss evidence for the model.

# A Dynamic Field Model of Intentional Planning

To develop intuitions for why a field model of intentional planning is sensible, we begin by elaborating a microscale conception of parameter fields, gestures, and their interactions. We imagine that there are two distinct types of populations of microscale units, tract variable (TV) populations and gestural (G) populations. For simplicity, **Figure 4** depicts only a single TV population along with a small set of G populations. The microscale units are viewed as neurons, and we envision that there are both inhibitory and excitatory neurons in both types of populations. The inhibitory neurons only project locally, within populations. Each G population projects to one TV population, and multiple G populations may project to the same TV population. Each TV population is assumed to exhibit some degree of somatotopic organization, such that the neurons can be arranged in a one-dimensional space which maps approximately linearly to target values of some vocal tract parameter. The units in the TV population are assumed to project to brainstem nuclei which ultimately control muscle fiber tension. We assume that there is some degree of homotopic spatial organization in TV-tobrainstem projections, i.e., a projective efferent field analogous to receptive afferent fields of neurons in primary sensory cortices.

The post-synaptic targets of projections from G to TV populations provide a basis for distinguishing between excitatory and inhibitory forces in the macroscale conception of intentional planning. Consider that some of the neurons in a given G population project to excitatory neurons in the relevant TV population (depicted in **Figure 4** as (+) projections), and others project to inhibitory neurons [i.e., (−) projections]. We conjecture that for a given G population there is a spatial complementarity between the distributions of these two types of projections. Thus a given G population preferentially excites the excitatory neurons in some region of the TV population and inhibits excitatory neurons in some other region (the inhibition occurs indirectly because the G population projects to inhibitory neurons, which in turn project locally to excitatory neurons within the TV population).

Given the above microscale conception, we construct a macroscale model in which the G populations are gestural systems (g-systems) and the TV populations are intentional planning fields. Furthermore, because of the distinction between (+) and

(−) G-to-TV projections, we can conceptually dissociate a given gestural system into g+ and g− subsystems, i.e., subpopulations which excite and inhibit regions of an intentional field. Each g+ and g− system has a time-varying excitation value which is assumed to reflect a short-time integration of the spike-rate of the neurons in the population. The integrated effects of the projections from g-systems to the TV population are understood as forces acting on an intentional field. Microscopically the strengths of these forces are associated with the numbers of G-to-TV projections and their synaptic efficacies; on the macroscale the strengths of the forces are the product of g-system excitation and a weight parameter which represents the microscale connectivity and which is constant on the utterance timescale. The pattern of spatial activation in the intentional field is driven by these forces, and the activation centroid is hypothesized to determine a current target state for the vocal tract parameter. In other words, the dynamic target is an activationweighted average of tract variable parameter values defined over an intentional planning field. Gestural system forces modulate the distribution of activation over intentional fields, but because the timescale of changes in G-to-TV synaptic connectivity and efficacy is relatively slow, gestural targets are best viewed as a long-term memory contribution to dynamic targets.

For concreteness, one can imagine that the relevant G population (light blue circles) in **Figure 4** is associated with a [VEL op+] gesture, which exerts an excitatory force on the region of the velum aperture field that drives an opening of the velum. In addition, one can imagine that there is a [VEL op−] gesture which exerts an inhibitory force on the region of the field associated with closing the velum. There is a large amount of explanatory power that we obtain by dissociating the excitatory and inhibitory components of gestures in this way. Note that in the example of **Figure 4**, the inhibitory force is shown to have a broader distribution than the excitatory one, but more generally the relative widths and amplitudes of force distributions might vary according to many factors. Moreover, in the general case multiple g+ and g− systems may exert forces on the same intentional field, and this allows the model to generate a range of empirical phenomena. The reader should imagine that there are many of these fields, perhaps one for each tract variable of the task dynamic model, and that the fields are relatively independent of each other, at least to a first approximation.

For a generic implementation of intentional planning, the time-evolution of the state of each parameter field u(x,t) can be modeled numerically using a normalized coordinate x which ranges from 0 to 1 in small steps. Equation 1 shows three terms that govern the evolution of the field. The first is an activation decay term, with gain α, entailing that in the absence of input, u(x) relaxes to zero and that field activation saturates with strong excitatory input. The second term is the excitatory force, where N is a Gaussian function of x with mean µ<sup>i</sup> + and standard deviation σ<sup>i</sup> <sup>+</sup> associated with gesture g<sup>i</sup> . The term Gi <sup>+</sup> represents a gestural force gating function; it is modeled as a sigmoid function of the excitation value of gesture g<sup>i</sup> , and modulates the amplitude of the Gaussian force distribution. In typical cases, the sigmoid gating function is parameterized such that it only allows gestures with excitation values greater than some threshold value to exert substantial forces on an intentional field; however, we will subsequently explore the consequences of leaky gating, in which a gesture with an excitation value below

the threshold can exert a substantial force on an intentional field. The gain term β <sup>+</sup> controls the overall strength of the excitatory input. The third term is the inhibitory force, and its components mirror those of the excitation term. Note that excitatory and inhibitory force distributions may differ in their width (σ<sup>i</sup> <sup>+</sup> vs. σi <sup>−</sup>), and the condition u(x,t) ≥ 0 is imposed at each time step. Equation 2 shows the calculation of the dynamic target as the average activation-weighted parameter value, i.e., the field activation centroid.

$$\begin{aligned} \text{Eq.1} \quad \frac{du(\boldsymbol{\kappa})}{dt} &= \underbrace{-au\,\boldsymbol{u}(\boldsymbol{\kappa})}\_{\text{decay}} + \underbrace{\boldsymbol{\beta}^{+} \sum\_{i} G\_{i}^{+} \mathcal{N}\left(\boldsymbol{x}, \mu\_{i}^{+}, \sigma\_{i}^{+}\right)}\_{\text{excitation}} \\ &+ \underbrace{\boldsymbol{\beta}^{-} \sum\_{i} G\_{i}^{-} \mathcal{N}\left(\boldsymbol{x}, \mu\_{i}^{-}, \sigma\_{i}^{-}\right)}\_{\text{in addition}} \\ \text{Eq.2} \quad T(t) &= \frac{\sum\_{\boldsymbol{x}} \boldsymbol{x} \,\boldsymbol{u}(\boldsymbol{x}, t)}{\sum\_{\boldsymbol{x}} \boldsymbol{u}(\boldsymbol{x}, t)} \end{aligned}$$

The model equations above are used in all subsequent simulations and visualizations. These equations should be viewed as tools for describing phenomena on a relatively macroscopic scale, rather than constituting a definitive claim about a neural mechanism. Note that related but somewhat different equations have been presented in Tilsen (2007, 2018b).

## Empirical Evidence for Intentional Planning

The somatotopic organization of intentional planning fields provides a "spatial code" for movement target planning, i.e., a representation in which a spatial distribution in the nervous system encodes a target in the space of vocal tract geometry. One motivation for positing a spatial code of this sort comes from studies of manual reaching and eye movement trajectories using a distractor-target paradigm. In this paradigm, a participant is presented with a distractor stimulus and shortly thereafter a target stimulus; the participant then reaches or looks to the target. The distractor stimulus is understood to automatically induce planning of a reach/saccade to its location, and this planning is hypothesized to subsequently influence the planning and execution of the reach/saccade to the target location.

Both assimilatory and dissimilatory phenomena are observed in the distractor-target paradigm, depending on the proximity or similarity of the distractor and target. When the distractor and target stimulus are sufficiently proximal in space, or are associated with similar movements, there is an assimilatory interaction in planning: reaches and saccades to the target are observed to deviate toward the location of the distractor (Ghez et al., 1997; Van der Stigchel and Theeuwes, 2005; Van der Stigchel et al., 2006). In speech, the analogous phenomenon of distractor-target assimilation has been observed between vowels (Tilsen, 2009b): formants in productions of the vowel [a] were assimilated toward formants of a distractor stimulus which was a subcategorically shifted variant of [a]; likewise, assimilation was observed for [i] and a subcategorically shifted variant of [i].

Erlhagen and Schöner (2002) (cf. also Schöner et al., 1997) presented a dynamic field model capable of producing this assimilatory pattern (see also Tilsen, 2007, 2009a; Roon and Gafos, 2016). A simulation of the effect is shown in **Figure 5A**, where the target gesture is A+ and the distractor gesture is B+. Gesture-specific input to the field creates Gaussian distributions of excitatory forces on the parameter field. The dashed lines show the modes of the force distributions of A+ and B+. Because the targets of the gestures are similar or proximal in the field, they do not exert inhibitory forces upon one another. The activation of the intentional planning field represents a combination of these forces, and the centroid of activation (green line) is shifted from A to B in an assimilatory fashion.

In contrast to the assimilatory pattern, a dissimilatory pattern arises when the distractor and target are sufficiently distal in space or associated with different response categories. Eye movement trajectories and reaches are observed to deviate away from the location of the distractor in this case (Houghton and Tipper, 1994, 1996; Sheliga et al., 1994). In speech, the analogous effect was observed in Tilsen (2007, 2009b): vowel formants of productions of [a] were dissimilated from formants of [i] when an [i] distractor was planned, and vice versa. A similar dissimilation was observed in F0 measures between Mandarin tone categories in a distractor-target paradigm (Tilsen, 2013b). These dissimilatory phenomena have been explained by hypothesizing that inhibition of the region of the field activated by the distractor shifts the overall activation distribution so that its centroid is further away from the target than it would otherwise be in the absence of the inhibition (Houghton and Tipper, 1994). This can be modeled by assuming that the inhibitory force influences the region of the field which encodes the target. The effect is shown in **Figure 5B**, where [A+] is the target gesture, [C+] is the distractor, and [A−] is an inhibitory gesture which is coproduced with [A+]. The inhibitory force exerted by [A−] not only cancels the excitatory force of [C+], but also shifts the centroid of the activation distribution away from [C+], resulting in a subtle dissimilation. Note that in order for this effect to arise, the inhibitory force distribution has to be either wide enough to overlap with the excitatory one, or its center has to be sufficiently close to the center of the excitatory one. Tilsen (2013b) argued that dissimilatory effects of this sort may be pervasive and provide a motoric mechanism for the preservation of contrast. In this view, degrees of resistance to coarticulation (Recasens, 1985; Fowler and Brancazio, 2000; Cho et al., 2017) might be understood as manifested by gradient differences in the amplitudes and widths of inhibitory gestural forces.

Another form of evidence for intentional planning is anticipatory posturing effects of the sort described in section "Introduction," **Figure 2**. There we noted that speakers exhibit vocal tract postures that are partially assimilated to the targets of gestures in an upcoming response. This phenomenon shows that some gesture-specific influences on the state of the vocal tract are present, even before a gesture becomes "active" (in the standard AP/TD sense). Discussion of how such effects are modeled in the current framework is deferred to section "Sub-selection Intentional Planning and Anticipatory Posturing,"

after we have presented a mechanism for organizing the selection of gestures.

#### The Inadequacy of Gestural Blending

The Articulatory Phonology/Task Dynamics (AP/TD) model cannot readily generate assimilatory or dissimilatory effects of the sort described above. A key point here is that in the distractortarget paradigm, only one of the tasks – the one associated with the target stimulus – is actually executed. This entails that only the target gesture becomes active, not the distractor. Of course, if both gestures were active, their influences on the target state of the vocal tract could be blended, resulting in an intermediate target. This blending is accomplished by a making the current target of a tract variable a weighted average of active gestural targets (Saltzman and Munhall, 1989). For example, if [A] and [B] have targets of T<sup>A</sup> = 0 and T<sup>B</sup> = 1 and blending weights of w<sup>A</sup> = w<sup>B</sup> = 0.5, the blended target T = (TAw<sup>A</sup> + TBwB)/(w<sup>A</sup> + wB) = 0.5, which is an intermediate value between T<sup>A</sup> and TB. The problem is that if only the target gesture is produced, the distractor gesture never becomes active, and the weight of [B] should be 0. Hence it is necessary to incorporate a mechanism whereby gestures which are not active can influence the dynamic targets of the vocal tract. We pursue this in section "Gestural Selection and Intentional Planning."

With regard to dissimilatory effects, the standard view of gestural blending is even more problematic. In order for blending of simultaneously active gestures to generate dissimilation, the calculation of a tract variable target must allow for negative weights. For example, if [A] and [B] have targets T<sup>A</sup> = 0 and T<sup>B</sup> = 1, and blending weights w<sup>A</sup> = 0.5 and w<sup>B</sup> = −0.1, then T = 1.25. This seems somewhat problematic from a conceptual standpoint because the blending function is undefined when w<sup>A</sup> = −wB, and because it generates a hyper-assimilatory target when −w<sup>B</sup> > wA. The problem of non-contemporaneous activation mentioned above also applies: the gesture of the distractor stimulus is not actually active; thus its weight should be 0 and it should not contribute to the calculation of the target.

As shown in section "Empirical Evidence for Intentional Planning," a model of target planning in which the inhibitory and excitatory effects of gestures are dissociated and have spatial distributions over a field can readily accommodate both assimilatory and inhibitory patterns. This reinforces the idea that rather than thinking of a gesture as having a monolithic influence on the target state of the vocal tract, we can more usefully think of gestures as having two distinct components: an excitatory component which exerts an excitatory force on a planning field, and an inhibitory component which exerts an inhibitory force on the same planning field. The temporal dynamics of activation of these two components of "the gesture" may in typical circumstances be highly correlated, but not necessarily so. It is logically possible and useful in practice to dissociate the exhibitory and inhibitory components. Thus the Articulatory Phonology conception of "a gesture" is re-envisioned here as a pair of gestures, one exerting an excitatory influence on a tract variable parameter field, the other exerting an inhibitory influence on the same field. For current purposes, we assume that the spatial distributions of the excitatory and inhibitory forces are effectively complementary, in that there is a single mode of the

inhibitory distribution and this mode is distant from the mode of the excitatory distribution. More general force distributions may be possible, but are not considered here.

It important to clarify that the intentional planning model does not supplant the Task Dynamic model equations for tract variables and model articulators. In the TD model each tract variable x is governed by a second order differential equation: 1 k x¨ + β k x˙ + x = T(t), where T(t) is a dynamic target calculated by blending gestural targets. The equation is analogous to a damped mass-spring system, where the dynamic target T(t) is a driving force, and changes in T can be conceptualized as changes in the equilibrium length of the spring. The intentional planning mechanism proposed here merely supplants the Saltzman and Munhall (1989) blending mechanism and introduces a new type of gesture – an inhibitory gesture – which can influence the dynamic target.

However, in order to account for how gestures which are not contemporaneously active can have effects on the target state of the vocal tract, further revision of the AP/TD model is necessary. This requires an explicit model of when gestures may or may not influence intentional fields, and is addressed in the following sections.

#### GESTURAL SELECTION AND INTENTIONAL PLANNING

The gestural scores of Articulatory Phonology/Task Dynamics do not impose any form of grouping on the gestures in a score. Indeed, there is no direct representation of syllables or moras in standard gestural scores, and this raises a number of challenges for understanding various typological and developmental phonological patterns (see Tilsen, 2016, 2018a). In order to address these challenges, the Selectioncoordination model was developed in a series of publications (Tilsen, 2013a, 2014a,b, 2016, 2018b). The Selection-coordination model integrates a competitive queuing/selection mechanism (Grossberg, 1987; Bullock and Rhodes, 2002; Bullock, 2004) with the coordinative control of timing employed in the AP/TD model. Because the selection-coordination model has been presented in detail elsewhere, only a brief introduction to the model is provided below. Furthermore, discussion of the full range of phonological patterns which the model addresses is beyond the scope of the current paper, and the reader is referred to other work for more thorough exposition (Tilsen, 2016, 2018a,b). Here we present the model in sufficient detail for the reader to understand how it interacts with intentional planning, and we address the question of when gestures may or may not influence intentional fields.

#### The Organization of Gestural Excitation

The selection-coordination model employs a mechanism for competitively selecting sets of gestures. The mechanism is based on a model of action sequencing developed in Grossberg (1987) which is referred to as competitive queuing (Bullock and Rhodes, 2002; Bullock, 2004). A key aspect of the competitive queuing model is that the plans for a sequence of actions are excited in parallel prior to and during production of the sequence, an idea which was advocated by Lashley (1951) and for which a substantial body of evidence exists (e.g., Sternberg et al., 1978, 1988). A schematic illustration of competitive queuing of three sets of motor plans – m1, m2, and m<sup>3</sup> – is provided in **Figure 6**. Prior to response initiation, the plans have a stable relative excitation pattern; upon response initiation a competition process occurs in which the excitation of the plans increases until one exceeds a selection threshold. The selected plan (here m1) is executed while its competitors are temporarily gated. Feedback regarding achievement of the targets of the selected plan eventually induces suppression of that plan and degating of the competitors, at which point the competition process resumes, leading to the selection of m2. The cycle of competition, execution, and suppression iterates until all plans have been selected and suppressed.

The Selection-coordination theory hypothesizes that the motor plans of the competitive queuing model in **Figure 6A** can be viewed as sets of gestures in the context of speech production. When a given set of gestures is above the selection threshold, the gestures in that set are selected. Within each selected set, the timing of gestural activation/execution is controlled by phasing mechanisms which we do not address here. Hence selection of a gesture does not entail immediate activation of that gesture: coordinative phasing mechanisms of the sort hypothesized in the coupled oscillators model are assumed to determine precisely when selected gestures become active (Tilsen, 2016, 2018b). In many cases, and in particular for adult speakers in typical contexts, it makes sense to associate the aforementioned motor plan sets with syllables. Thus the selection-coordination model partitions multisyllabic gestural scores into a sequence of competitively selected scores.

In order to facilitate conceptualization of the competitive selection mechanism, the relative excitation pattern of the gestures in a set can be viewed as organized in a step potential, which has the effect of transiently stabilizing excitation values between periods of competition/suppression. This leads to the picture in **Figure 6B**, where abrupt reorganizations (e<sup>1</sup> <sup>0</sup>–e<sup>4</sup> 0 ) intervene between stable epochs of organization (e1–e5). These reorganizations are understood to consist of promotion and demotion operations on gestures. Promotion increases excitation to the next highest level, and demotion lowers excitation of selected gestures to the lowest level. The topmost level of the potential is called the selection level, and the set of gestures which occupy the selection level are selected. Note that in order to avoid terminological confusion, we use the term excitation to refer a quantitative index of the states of gestural systems; the term activation is reserved to describe a state in which a gestural system exerts its maximal influence on an intentional planning field – this terminological distinction maintains some consistency with the Articulatory Phonology interpretation of gestural activation intervals in a gestural score. Importantly, gestures which are neither active nor selected can have gradient degrees of excitation which are below the selection threshold.

We motivate the macroscopic model of **Figure 6B** from the microscopic picture in **Figure 7A**. In addition to populations of microscale units for gestural systems and tract variable

FIGURE 7 | Microscale and macroscale conceptualizations of the motor sequencing population and gestural population. (A) The motor sequencing population differentiates into subpopulations which are conceptualized macroscopically as motoric systems; lexical memory determines a pattern of resonance between motoric systems and gestural systems. (B) The pattern of relative excitation of gestural systems is governed by a step potential, according to their associations with motoric systems.

parameters (not shown), we imagine a motor sequencing population. The motor sequencing and gestural populations have projections to one another, and the relevant projections are from excitatory neurons to excitatory neurons. When a word form is excited by conceptual/syntactic systems<sup>1</sup> (or "retrieved from lexical memory"), the gestures associated with the word form become excited. The mutually excitatory projections between gestural and motoric populations give rise to resonant states which augment gestural system excitation. Crucially, it is conjectured that the motoric population differentiates into subpopulations which correspond to sets of gestures, i.e., motor systems (henceforth m-systems). It is assumed that the longterm memory of a word form<sup>2</sup> includes information which determines the pattern of m-system differentiation, the pattern of resonances between g- and m-systems, and coupling relations between m-systems which are selected together. In the current example, the word form is comprised of three CV syllables and hence the motor population differentiates into three uncoupled, competitively selected m-systems (**Figure 7B**). If the excited word form were comprised of a different number of CV m-systems, the motor sequencing population would differentiate into that number. For syllables with a coda, diphthong, or long vowel, two anti-phase coupled m-systems would be organized in the same level of the potential.

The reader should note that the motor population differentiation pattern in **Figure 7A** exhibits a particular spatial arrangement, such that the initial m-system organization for a word form corresponds to the spatial pattern of differentiation in the motoric population. This spatial correspondence is not necessary for our current aim – modeling long-distance phonological patterns – but it is useful for a more comprehensive model in which the directionality of metrical-accentual patterns can be interpreted (see Tilsen, 2018a). Furthermore, it is important to emphasize that the motor population is finite and thus when a word form requires a greater number of m-system differentiations, the size of each m-system population becomes smaller and m-systems become more susceptible to interference. Thus an upper-bound on the number of simultaneously organized m-systems falls out naturally from the model, based on the idea that interference between m-systems destabilizes the organization (see Tilsen, 2018b).

One important advantage of the conceptual model is that the gestural-motoric resonance mechanism (g–m resonance) offers a way for gestures to be flexibly organized into syllable-sized or mora-sized units. Rather than resulting from direct interactions between gestures, syllabic organization arises indirectly from a pattern of resonances between g-systems and m-systems, in combination with the organization of m-systems into levels of relative excitation. In other words, g-systems interact not with each other, but instead couple with m-systems. These m-systems then couple strongly in stereotyped ways, giving rise to various syllable structures. This indirect approach to organization is desirable because direct interactions between g-systems are in conflict between word forms which organize the same gestures in different orders (e.g., pasta vs. tapas). Another advantage of the flexible organization based on g-m resonance is that it allows for developmental changes in the composition of m-systems, evidence of which is discussed in Tilsen (2016).

A final point to emphasize about the selection model is that the conception described above should be understood as a canonical model of a system state trajectory for sequencing, where "canonical" implies a standard against which other trajectories can be usefully compared. In the canonical trajectory, the relative excitation of sets of gestures is iteratively reorganized solely in response to external sensory feedback, and the reorganizations generate an order of selection which matches the initial relative excitation hierarchy. This trajectory serves as a reference for more general system state trajectories, for example ones in which reorganizations are not necessarily driven by external sensory feedback. Indeed, there is a particular form of deviation from the canonical trajectory which is highly relevant for current purposes. This deviation involves the use of internal rather than external feedback to govern reorganization; as we consider below, internal feedback allows for operations on the gestures in a set to be dissociated from each other.

# Selectional Dissociation and Local Coarticulation

An important aspect of the Selection-coordination model is that internal feedback can be used to anticipatorily select a gesture, before all of the gestures in the preceding epoch are suppressed. A great deal of evidence indicates that in addition to external sensory feedback, the nervous system employs a predictive, anticipatory form of feedback, called internal feedback (Wolpert et al., 1995; Kawato and Wolpert, 1998; Kawato, 1999; Desmurget and Grafton, 2000; Hickok et al., 2011; Parrell et al., 2018, 2019a,b). In the Selection-coordination model, if degating (i.e., promotion) and suppression (i.e., demotion) are contingent solely on external feedback, then there is necessarily a gap in time between target achievement of a preceding gesture and selection of a competitor gesture. However, if internal feedback is used to degate the competitor prior to target achievement of the preceding gesture, the gestural selection intervals can overlap. Pervasive overlap observed in spontaneous conversational speech indicates that anticipation/prediction of target achievement may be generally more influential on degating and suppression than the peripheral sensation of achievement, at least in adult speech. It might also be expected that the internal regime of control would be associated with less variability in the timing of selection

<sup>1</sup>Here an explicit model of conceptual-syntactic organization is not provided, but see Tilsen (2019) for a model which in many ways parallels the model of gestural-motoric organization developed here. Although in this paper we associate a pattern of gestural-motoric organization with "word forms," it is more accurate to associate such patterns of organization with prosodic words, which can include phonologically bound forms such as clitics.

<sup>2</sup> It is assumed that experiences from producing and perceiving words contribute to changes in microscopic state variables of the nervous system (e.g., synaptic efficacy and connectivity), which determine macroscopic properties of the production system (i.e., g–m resonances and organization). These macroscopic properties are "lexical knowledge" in that sense that they are associated with semantic concepts and derive from systems which change relatively slowly over time. It is beyond the scope of this paper to develop more detailed microscopic and macroscopic models of these long-term memories.

than the external one, because external sensory information may be perturbed by contextual effects on movement targets or other environmental influences.

Internal feedback allows for dissociations of degating and suppression of gestures which are canonically selected in a given epoch. These selectional dissociation phenomena are illustrated in **Figures 8A,B**, which depict hypothesized trajectories for {VC}{NV} and {VN}{CV} word forms, respectively (V = vocalic gesture, N = velum opening gesture; C = oral constriction gesture). Specific phonological forms which instantiate these would be /eb.na/ and /en.ba/. The pattern in **Figure 8A** is an example of anticipatory degating, which we will also refer to as early promotion. The velum opening gesture ([VEL op], labeled "N" in the potentials), is associated with the second syllable, i.e., the second of two competitively selected m-systems. The oral constriction gesture associated with N is C2. In a canonical trajectory, there would be two distinct selection epochs, (e1) and (e2), and N would be promoted along with C2 in (e2), subsequent to suppression of V1 and C1. However, internal feedback anticipates target achievement of V1 and C2, and thereby allows N to be degated early and promoted. This results in there being a period of time (e<sup>1</sup> 0 ) in which the [VEL op] gesture is selected along with gestures of the first syllable, resulting in a phonetic realization in which the stop is partially nasalized, i.e., [ebna] or [ebmna].

Conversely, **Figure 8B** shows a trajectory for a {VN}{CV} word form in which the [VEL op] gesture is suppressed late relative to gestures in the first syllable. In a canonical trajectory, [VEL op] would be demoted in the reorganization from (e1) to (e2). By hypothesis, reliance on internal feedback can not only anticipate target achievement, but also fail to anticipate target achievement, thereby creating a delay in the suppression of N relative to other gestures in the syllable, including the oral constriction gesture it is associated with, C1. This results in a period of time during which both [VEL op] and gestures associated with the second syllable are selected in (e<sup>2</sup> 0 ), which gives rise to a phonetic form with a partially nasalized stop, i.e., [en˜ba] or [enmba].

The mechanisms of early promotion (anticipatory degating) and late demotion (delayed suppression) generate local assimilatory patterns. The early promotion in **Figure 8A** can be phonologized as the assimilation /VC.NV/→/VN.NV/(/ebna/→/emna/), and the late demotion in **Figure 8B** as /VN.CV/→/VN.NV/ (/enba/→/enma/). Here "phonologization" entails that selection of [VEL op] in both epochs of the word form occurs because long term (i.e., lexical) memory specifies that this is the case.

The selectional dissociation mechanism is potentially quite powerful, especially if it is unconstrained. An important question is: what prevents early promotion and late demotion from occurring pervasively and for extended periods of time? A generic answer to this question is that anticipatory degating and delayed suppression may be opposed by other mechanisms when they substantially alter the external sensory feedback associated with a word form and have adverse consequences for perceptual recoverability (see Chitoran et al., 2002; Chitoran and Goldstein, 2006; Tilsen, 2016). In particular, the degree to which the sensory alteration affects the perceptual distinctiveness of gestures should correlate with resistance to selectional dissociations. Ultimately, whether anticipatory degating and delayed suppression will be extensive enough to be phonologized as anticipatory or perseveratory assimilation must depend on a complex interplay of factors that includes the perceptual contrasts in a language along with occurrence frequencies of sets of gestures and their functional loads.

A more specific source of restriction on selectional dissociation is hypothesized as follows. Given an excitatory gesture [x+], dissociated selection of [x+] is prevented if a gesture [y−], which is antagonistic to [x+], is selected. For example, [VEL clo−] is antagonistic to [VEL op+] because [VEL clo−] exerts a strong inhibitory force on the region of the velum aperture intentional field that [VEL op+] most strongly excites. The supposition here is that the selection of a gesture which is antagonistic to another gesture prevents the anticipatory degating or delayed suppression of that gesture. **Figures 8**, **9A,B** show hypothetical examples of VCNV and VNCV, respectively. These could be instantiated specifically as forms /ebna/ and /enba/. In **Figure 9A**, selection of a [VEL clo−] gesture (shown as N− in the potential) in epoch (e1) opposes extensive anticipatory degating of [VEL op+] (N+ in the potential), and thereby prevents early promotion. Along the same lines, in **Figure 9B** selection of [VEL clo−] in (e2) prevents delayed suppression of [VEL op+] and thereby prohibits late demotion.

It is possible to hypothesize an even stronger restriction, in which an antagonistic pair of gestures can never be co-selected. In that case, an NV syllable such as [na] would correspond to a set of gestures in which [VEL op+] and [VEL clo+] are selected, but not [VEL op−] and not [VEL clo−]. In that case, blending of the co-selected [VEL clo+] and [VEL op+] gestures can generate an empirically adequate pattern of velum aperture for a nasal consonant-oral vowel syllable. Interestingly, any /NV/ syllable in this account would be necessarily be "underspecified" for inhibitory VEL gestures, which would make it more prone to being influenced by gestural dissociations. For current purposes, this stronger hypothesis prohibiting co-selection of antagonistic gestures is unnecessary: we only need the weaker hypothesis that selection of an inhibitory antagonist in some epoch prevents a selectional dissociation in which an excitatory gesture is selected in that same epoch.

#### Sub-Selection Intentional Planning and Anticipatory Posturing

Here we integrate the intentional planning mechanism with the model of gestural selection described above. The basic question to address is: when is gestural excitation expected to result in observable changes in the state of the vocal tract? Given the model of intentional planning presented in section "The Intentional Planning Mechanism," we can rephrase this as the question of when gestures exert forces on intentional planning fields. One answer which can be rejected is that intentional planning is only influenced by active gestures, i.e., gestures which have been selected and triggered by phasing mechanisms. Such an account would be natural in the standard

FIGURE 8 | Dissociation of gestural promotion and demotion for intervocalic consonant-nasal sequences, VCNV and VNCV. (A) Anticipatory degating of a nasal gesture in a {VC}{NV} word form. (B) Delayed suppression of a nasal gesture in a {VN}{CV} word form. Lines from potentials indicate when in time a given pattern of activation occurs. Horizontal dashed lines are the selection threshold.

AP/TD framework, but falls short empirically because it cannot straightforwardly generate anticipatory posturing effects or assimilatory/dissimilatory effects in distractor-target paradigms. Merely allowing gestural activation to vary continuously does not solve this problem because the standard model requires some mechanism to trigger a change from zero to non-zero activation.

Recall from section "Introduction" that a number of studies have provided evidence that speakers exert control over vocal tract posture prior to production of a word form, and do so in a way that is specific to gestures in the word form (see **Figure 2**). Analyses of discrepancies between acoustic and articulatory measurements of verbal reaction time in delayed response paradigms have provided indirect evidence for changes in vocal tract state prior to the initiation of movement (Rastle and Davis, 2002; Kawamoto et al., 2008). Direct evidence of responsespecific anticipatory posturing was observed in the real-time MRI study designed specifically to test for such effects (Tilsen et al., 2016), discussed in section "Introduction." This study showed that prior to the cued initiation of a response, speakers often adopted a vocal tract posture that was partly assimilated to upcoming gestural targets. Another recent study has shown that in a delayed word-naming task, speakers configure their lips to anticipate the initial consonantal articulatory target of a response, even when the complete gestural composition of the response is unknown (Krause and Kawamoto, 2019a).

A standard gestural activation account could, in principle, generate anticipatory posturing effects, but only with several ad hoc adjustments. First, the relevant gesture(s) would need to be allowed to become active prior to other gestures. Second, and more problematically, the anticipated gestures would need to have alternative targets, because the observed anticipatory posturing effects are partial. But in the standard AP/TD model each gesture is associated with a single target parameter; thus it is not entirely sensible to say that a single gesture is associated with two targets, one for anticipatory posturing and the other for normal production. Alternatively, the competitive gating of neutral attractor and gestural influences on model articulators (see **Figure 1B**) could be relaxed to allow for partial blending of these influences before production. Yet this would require a fairly ad hocstipulation that only some model articulators are subject to the blending; moreover, the blending would need to be turned off (i.e., competitively gated) during production of the word form, otherwise target undershoot would be pervasive, as discussed in section "Introduction."

The selection-coordination-intention framework provides an alternative account of anticipatory posturing, based on the idea that gestural systems with excitation values below the selection threshold do in fact exert forces on intentional planning fields. **Figure 10A** illustrates this effect for velum opening in the syllable /na/, which is comprised of [TTCD clo±], [PHAR [a]±], and [VEL op±] gestures. Prior to overt production, the gestural systems are excited but below the selection threshold. Despite not being selected, the [VEL op±] gestures exert excitatory and inhibitory forces on the velum aperture intentional planning field. The excitatory force corresponds to a Gaussian distribution of activation in the field, indicated by the arrow. Note that a constant neutral attractor force on the field is also assumed to be present.

The amplitude of the gestural force distribution is modeled as a sigmoid function of the excitation value of [VEL op+] (see section "A Dynamic Field Model of Intentional Planning," Eq. 1). Two differently parameterized sigmoid functions are shown in **Figure 10B**. The strong gating function changes abruptly from 0 to 1 in the vicinity of the selection threshold, resulting in negligible forces from gestures below the threshold, and in maximal forces from gestures which are selected. The leaky gating function is parameterized so that its midpoint is lower and its slope is shallower; this results in a non-negligible force being exerted on the velum aperture field, even when [VEL op+] has below-selection-level excitation. Either parameter of the sigmoid function (i.e., its midpoint or slope) can be adjusted to achieve this effect.

The difference between the strong and leaky gating functions is reflected in the tract variable time series shown in **Figure 10A**. With strong gating (solid line), the neutral attractor is the only substantial influence on the velum aperture field prior to gestural selection, and hence the tract variable remains in a neutral position. With leaky gating (dashed line), the [VEL op+] gesture exerts a substantial influence that drives the tract variable to an intermediate state. This pre-response anticipatory posturing effect results in only a partial assimilation because the dynamic target of the system (the weighted average of field activation) integrates both the neutral attractor influence and the influence of [VEL op+].

It is worth noting that leaky gating can generate both anticipatory and perseveratory posturing effects: subsequent to a production, a gesture with leaky gating can have a persistent influence on the state of the vocal tract, as long as the excitation of the gesture is not too low. The empirical characteristics of anticipatory posturing effects can thus be modeled fairly straightforwardly, as long as the parameters of the gating function are allowed to vary from gesture to gesture, speaker to speaker, and even from utterance to utterance. Of course, there may be a number of factors that can predict variation in the magnitude of such effects, and these are worth future investigation.

The above model suggests that a disambiguation of the phrase gestural initiation is in order. Gestures are "initiated" in two senses: gestures conceptualized as systems become excited, to some subthreshold degree, and this "initiation of excitation" may or may not result in observable effects on the state of the vocal tract, depending on the parameterization of the gating function. Subsequently, gestural systems are selected, i.e., their excitation exceeds a threshold, and when triggered by phasing mechanisms they can begin to exert their maximal influence on an intentional field, which constitutes an "initiation of activation." At the same time, it is important to keep in mind that active gestures which influence the same tract variable can be blended, as in the standard AP/TD model, and thus activation of a gesture does not necessarily entail an immediately observable effect on the vocal tract.

In the context of the selection-coordination-intention framework, there is a potential ambiguity with regard to whether a given phonological pattern arises from selectional dissociations (i.e., early promotion/late demotion) or from subthreshold gestural forces allowed by leaky gating. Anticipatory and perseveratory phenomena might logically be understood to result from internal feedback-driven changes in gestural selection, or from changes in the parameterization of gating functions, or from a combination of both mechanisms. The question of which of these analyses to apply in a given context is explored in the

next section, where we apply the model to understand non-local phonological patterns.

#### THE ORIGINS OF NON-LOCAL PHONOLOGICAL PATTERNS

The selection and intention mechanisms provide two ways for the articulatory precursors of non-local phonological patterns to arise in individual utterances. It is important to emphasize that our primary aim here is a model of how non-local patterns (i.e., harmonies) originate. The issue of how such patterns are phonologized, i.e., become part of a phonological grammar, is a more general one, and treatment of this topic is beyond the scope of this paper. For current purposes, we assume an Ohalan conception of phonologization in which motoric mechanisms are bias factors that perturb articulatory realization, and in which these perturbations can be phonologized through hypocorrective mechanisms (Ohala, 1993). Hence the mechanisms presented below should be understood as operating on the timescale of a single utterance and the spatial scale of an individual speaker, but their effects may lead to change in behavior on larger temporal and spatial scales. Specifically, one can imagine that in a population of speakers there is stochastic variation in the parameters associated with various control mechanisms of the model (e.g., the leakiness of gating). Interactions between speakers may on supra-utterance time scales lead to population scale changes in such parameters, although this must be seen as a highly chaotic process which cannot be readily predicted. In any case, it is sensible to assume that our understanding of how non-local patterns are codified should depend on our understanding of the motoric genesis of such patterns. Indeed, one can argue that origination should be primary in our understanding of phonologization, because non-local patterns seem unlikely to spontaneously emerge, i.e., come into being without any sensorimotor precursors.

One obstacle in this endeavor is our incomplete knowledge of the extent to which an empirically observed non-local pattern is the product of active mechanisms which operate on long-term memories or is codified directly in lexical memory. To illustrate this distinction, consider the schematic harmony patterns in **Table 2**. Some non-local patterns, and in particular, many consonant harmonies (see Hansson, 2001), appear to be lexical co-occurrence restrictions in the domain of a lexical

TABLE 2 | Hypothetical non-local patterns which apply in different morphological domains.


root (1) or derivational stem (2). In these cases, it is quite sensible to interpret the pattern as directly encoded in longterm memory: the gestures that are retrieved from memory in association with a word form already conform to the harmonic pattern, and therefore no mechanism is required to generate the harmony in utterance planning. In contrast, other nonlocal patterns are better understood as actively generated by the production system during the process of planning an utterance. Vowel harmonies and vowel-consonant harmonies may be more likely to be of the active variety than consonant harmonies, because in some cases, these harmonies apply in an inflectional domain (3), i.e., a morphologically complex form that includes inflectional morphology (i.e., tense, aspect, mood, agreement, number, person, etc.). It is worth mention that even productive harmonies involving inflectional morphology might be construed as lexical if we allow for analogical mechanisms to influence the selection of morphs from the lexicon.

An important clarification to make here is that there are several senses of locality that may be applied to describe phonological patterns. One sense is based on the conception of speech as a string of symbols – i.e., segments which are arranged in a linear order. Another is based on the idea that the articulatory manifestations of a harmony pattern are continuous in time (Gafos, 1999; Smith, 2018), which is closely related to tier-based analyses in which articulatory features on a tier can spread (Goldsmith, 1979). A third sense is based on the temporal continuity of the motoric mechanisms which give rise to a pattern. We will show in sections "Spreading Arises From Selectional Dissociation" and "Agreement Arises From Leaky Gestural Gating" that the motoric mechanisms which give rise to harmony patterns are always local, even when articulatory manifestations are not. Identifying local mechanisms for the origination of such patterns is desirable because, as some have argued (e.g., Iskarous, 2016) physical laws always specify local relationships between variables in space and time, and so there cannot be a truly "non-local" mechanism. To show how these three conceptions of locality apply, **Table 3** classifies various assimilatory phonological patterns.

Our main focus in the following sections is on the last two types of patterns listed in **Table 3**: spreading harmonies (e– g) and agreement consonant harmony (h). It is nonetheless worthwhile to briefly consider how other types of patterns arise. One of the most cross-linguistically common phonological patterns is assimilation of adjacent sounds which are associated with the same syllable (a, b). Such patterns have been thoroughly examined in the AP/TD framework and can be readily understood through a gestural blending mechanism (Browman and Goldstein, 1990; Gafos, 2002; Gafos and Goldstein, 2012). In the selection-coordination-intention framework, gestures which are associated with the same syllable are co-selected. When coselected gestures exert forces on the same intentional planning field, the strengths of those forces are blended. When coselected gestures exert forces on distinct intentional planning fields, overlap of gestural activation can occur without blending coming into play. In either case, the co-activation of gestures can lead to phonologization of new articulatory targets, i.e., changes in the long-term memory specification of gesturalmotoric organization associated with a word form.

Assimilatory patterns between sounds associated with different syllables (c, d) must be understood differently from tautosyllabic patterns because the relevant gestures are associated with distinct competitively selected sets of gestures and therefore those gestures are canonically selected in different epochs. We have already shown in section "Selectional Dissociation and Local Coarticulation" how local coarticulation arises from the dissociation of gestural selection from canonical motoric organization. Specifically, internal feedback allows for some gesture or gestures to be promoted early or demoted late. These phenomena result in gestural overlap and constitute an active mechanism for generating assimilatory patterns. Moreover, they can be phonologized as assimilatory phonological alternations in long-term memory. As we argue below, selectional dissociation is also the mechanism via which spreading harmonies emerge.

The main proposal here is that there are two distinct mechanisms via which harmony patterns can arise: selectional dissociation and subthreshold intentional planning. The former gives rise to so-called "spreading" patterns which are not distinct, in a mechanistic sense, from assimilation of adjacent, heterosyllabic sounds. Spreading patterns are articulatorily local, in the sense described above. It is possible that all vowel and vowel-consonant harmonies are of this variety (Hansson, 2001; Nevins, 2010; Van der Hulst, 2011; Smith, 2018), and that some consonant harmonies are as well (Gafos, 1999). The other mechanism – subthreshold intentional planning – is associated with at least some consonant harmonies, which are described as "agreement" or "correspondence" patterns (Piggott and Van der Hulst, 1997; Walker, 2000; Hansson, 2001; Rose and Walker, 2011).


TABLE 3 | Locality-based classification of origins of assimilatory phonological patterns.

The crux of the empirical distinction between spreading vs. agreement amounts to whether there are articulatory manifestations of the relevant gesture during the period of time between the trigger and target segments. Let's consider a common variety of consonant harmony: coronal place harmony of sibilants. A prototypical example is one in which all sibilants in lexical root have the same anteriority as the last sibilant in the root (see **Table 2**, example 1). Gafos (1999) argued that a tongue tip constriction location (TTCL) gesture can be active during vocalic or non-coronal consonantal gestures which intervene between the trigger and target, without inducing a substantial auditory perturbation of the sensory consequences of those gestures. In other words, the position of the tongue blade may be physically influenced during the intervening segments, regardless of whether the influence has audible consequences. Indeed, some experimental evidence of this effect was provided in Gafos (1999). In this analysis, there is an articulatory continuity with respect to activation of the relevant TTCL gesture: the pattern is articulatorily local.

However, it has not been demonstrated that all sibilant harmonies exhibit continuous articulatory manifestations of this sort, and in most cases it is impossible to determine if such patterns originated in that manner. Moreover, there are other consonant harmonies which are highly unlikely to have originated from a continuous articulatory manifestation. One example is nasal consonant harmony, in which the nasality of certain classes of consonants must agree in a root or derived stem (see **Table 2**, example 2). Walker (2000) and Hansson (2001) have pointed out that continuous velum lowering between trigger and target would result all intervening vowels being nasalized and all intervening consonants being nasalized. Yet such nasalization of intervening segments is not observed in nasal consonant harmony (recall that this issue was raised in section "Introduction," in relation to **Figure 3**). This argues against conceptualizing nasal consonant harmony as the result of a continuously active gesture: such patterns are articulatorily non-local. The reader should note that nasal consonant harmony is distinct from nasal spreading (Cohn, 1993; Hansson, 2001); in nasal spreading intervening segments are nasalized.

Another example of a pattern which is articulatorily nonlocal is laryngeal feature harmony (Hansson, 2001), where oral stops with different laryngeal features (e.g., aspirated vs. ejective) may not co-occur in some domain. In a gestural framework, aspiration corresponds to a glottal opening gesture and ejection to a combination of glottal closing and laryngeal elevation gestures. It is not physically possible for the glottis to be open or fully closed during intervening vowels or voiced continuant consonants, without substantially influencing the acoustic manifestations of those sounds. Thus laryngeal harmonies are another type of consonant harmony pattern which cannot be readily understood as the result of articulatory continuity/continuous gestural activation.

The impossibility of articulatory continuity in certain harmonies is one motivation for distinguishing between mechanisms for the emergence of spreading and agreement; another is that there are numerous typological differences between patterns analyzed as spreading vs. agreement. In particular, these include differences in (i) blocking and transparency of intervening segments, (ii) morphological domain sensitivity, (iii) prosodic domain sensitivity, (iv) structure preservation, (v) similarity sensitivity, and (vi) directionality biases. Section "Spreading Arises From Selectional Dissociation" shows how spreading/blocking is modeled in the selectioncoordination-intention framework, section "Agreement Arises From Leaky Gestural Gating" shows how agreement is modeled, and section "Deriving the Typology of Agreement and Spreading Patterns" addresses the aforementioned typological differences.

# Spreading Arises From Selectional Dissociation

The intention and selection models developed in sections "The Intentional Planning Mechanism" and "Gestural Selection and Intentional Planning" generate spreading via the mechanism of selectional dissociation. Recall from section "Selectional Dissociation and Local Coarticulation" that a gesture which is canonically selected in a given epoch can be anticipatorily selected in an immediately preceding epoch, or the suppression of the gesture can be delayed to occur in a subsequent epoch. In other words, gestural selection can be dissociated from canonical motor set organization, such that gestures may be promoted early or demoted late. In typical circumstances, there are perceptual and contrast-related forces which may prevent anticipatory degating and delayed suppression from occurring too extensively. If the selectional dissociation compromises sensory information which is important for the perceptual recoverability of preceding gestures, it will not be too extensive. Moreover, if an inhibitory gesture [y−] is selected in some epoch, and [y−] is antagonistically related to [x+], then [x+] is unlikely to be anticipatorily promoted or belatedly suppressed in that epoch. However, early promotion or late demotion may not be perceptually or informationally disadvantageous, and may even be advantageous. Thus in the absence of the antagonistic gesture [y−], we would expect that the anticipation or perseveration of [x+] may extend throughout the relevant epoch.

Selection trajectories for perseveratory and anticipatory spreading are schematized in **Figures 11A,B**. Labels |a1|, |b2|, etc. . . are included to facilitate exposition. The examples involve a word form with three competitively selected sets of gestures: A, B, and C. The relevant spreading gestures are a (+)/(−) pair labeled as [x+] and [x−]. For concreteness, the reader can imagine that A, B, and C are comprised of oral consonantal constriction and vocalic gestures, and that [x+] and [x−] are excitatory and inhibitory [VEL op] gestures. For the perseveratory spreading pattern in **Figure 11A**, let's suppose that on a diachronic timescale there is an initial stage (stage 0) in which the selection trajectory is canonical; specifically, [B], [x+], and [x−] comprise a set of gestures {Bx+x <sup>−</sup>}, which is competitively selected relative to sets {A} and {C}. In the stage 0 trajectory, [x+] and [x−] are demoted in epoch (e3), when [B] is demoted (|a1|). In a subsequent stage (stage 1), the demotion of [x+] and [x−] is delayed relative to demotion of [B], and hence [x+] and [x−] remain selected during epoch (e3) in which gestures of {C} are also selected (|a2|). This diachronic stage represents an active spreading process,

section "Selectional Dissociation and Local Coarticulation" selectional dissociations are dependent upon whether there is an antagonistic gesture selected in the epoch which would potentially incorporate a dissociating gesture. This antagonistic gesture is represented as [y−] in **Figures 11C,D**. For instance, if the trigger gesture [x+] is a [VEL op+] gesture, then the antagonistic gesture [y−] would be [VEL clo−]. Spreading is blocked when it would involve co-selection of [x+] and [y−]. Hence in the anticipatory spreading example of **Figure 11C**, the gesture [x+] which is selected in (e1) can be selected in (e2) (label | c1|), but it is demoted in the reorganization to (e3) (| c2|) because this reorganization promotes the antagonistic gesture [y−]. In **Figure 11D**, anticipatory spreading can occur by early promotion of [x+] in (e2) (see |d1|), but cannot be promoted in (e1) (|d2|) because the antagonistic gesture [y−] is promoted. Thus in **Figure 11D** [x+] can be selected in (e2) but not in (e1) when [y−] is selected. Hence spreading and the blocking of spreading are understood as contingent upon whether antagonistic inhibitory gestures are promoted.

In a more detailed sense, the blocking occurs because promotion and demotion are reorganization operations that can enforce mutual exclusivity in the selection of gestures. However, the selectional dissociation mechanism allows for this mutual exclusivity to be violated when the relevant gestures are not strongly antagonistic. For current purposes it is sufficient to interpret the sensitivity of reorganization to antagonistic relations as categorical restriction on reorganizations: if [y−] is promoted, [x+] must be demoted and cannot be promoted. Thus it is only

and we conjecture that [x+] and [x−] can remain in a selected state through each subsequent epoch. The anticipatory version of spreading in **Figure 11B** is quite similar, except in this case [x+] and [x−] are promoted early in epoch (e1) (see |b1|) and persist in a selected state until gestures in the set they are canonically associated with, {Bx+x <sup>−</sup>}, are demoted (|b2|). It is worth mention that while some spreading patterns have

a clear directionality, in others directionality is unclear, or can be analyzed as bidirectional. Moreover, in both anticipatory and perseveratory cases, the spreading can be phonologized in a subsequent diachronic stage, such that [x+] and [x−] become members of each selection set that is organized upon retrieval of the word form (i.e., |a3| and |b3|). In this case, the selectional dissociation may or may not remain active. If the pattern is observed in productively derived stems or inflectional stems, it is most likely still active. Indeed, it is plausible that spreading can involve iterative phonologization of the relevant feature, such that (i) selectional dissociation perturbs articulation in a temporally adjacent epoch, (ii) the perturbation is phonologized, and then steps (i) and (ii) repeat for another pair of epochs.

An important characteristic of spreading is that it always involves epochs which are contiguous in utterance time. The reason for this is that anticipatory degating and delayed suppression can only extend the period of time in which a gesture is selected; these mechanisms do not involve additional selections or suppressions of a gesture. This restriction is important in accounting for the occurrence of blocking phenomena, which are represented in **Figures 11C,D**. As explained in

promotion of these gestures. Horizontal lines in selection trajectories represent the selection threshold. Labels | a1|, | b2|, etc. . . are referenced in the text.

when no [y−] gesture is selected that [x+] can be selected in a dissociated manner.

#### Agreement Arises From Leaky Gestural Gating

Whereas spreading is understood to arise from selectional dissociations, agreement patterns are modeled here as a consequence of sub-selection level gestural forces on intentional fields. Recall from section "Sub-selection Intentional Planning and Anticipatory Posturing" that when the gestural force gating function is leaky, a gesture which is not selected can exert a substantial force on an intentional field. This leaky gating mechanism was previously used to account for anticipatory posturing prior to production of a word form. There is no obvious reason why such a mechanism should not operate during epochs of production as well, and if that occurs, its effects can generate an agreement pattern. Moreover, this active agreement pattern has the potential to become phonologized via the Ohalan hypocorrective mechanism.

An example of an active agreement pattern is shown in **Figure 12A** for a word form comprised of three sets of gestures, {A}, {By+y <sup>−</sup>}, and {Cx+x <sup>−</sup>}. The gestures [x+] and [y−] are antagonistic. With leaky gestural gating, [x+], which is selected in (e3), exerts substantial forces on an intentional field in epochs (e1) and (e2). However, during epoch (e2) in which the antagonistic gesture [y−] is selected, the force that [x+] exerts on the intentional field is canceled by the inhibitory force from the antagonistic gesture [y−]. During epoch (e1), no gesture which is antagonistic to [x+] is selected, and thus the influence of [x+] on the intentional field will be manifested articulatorily. In such a situation, we see that the gestures selected in (e2) are transparent to the harmony pattern. A concrete instantiation of this example would be a phonological form /ba.sa.na/ which exhibits nasalization of the initial consonant, [masana]. We imagine that [x+] is a [VEL op+] gesture associated with the /n/, and that [y−] is an antagonistic [VEL clo−] gesture associated with /s/ and the vowel /a/. The phonetic precursors of the nonlocal agreement pattern can arise if [y−] is not selected in association with /p/.

On a diachronic timescale, we can imagine that the subselection influence of [x+] can be phonologized, in that the composition of the set of gestures selected in (e1) is reinterpreted by speakers as including the gesture [x+] along with [x−]. This circumstance is shown in **Figure 12B**. At this point, the sub-selection influence of [x+] may or may not remain present. Ultimately, what this model holds is that the articulatory precursors of an agreement pattern can arise between segments whenever there is no antagonist (of the triggering gesture) that exerts forces on the relevant intentional field, and as long gating of the triggering gesture is leaky. Hence antagonistic gestures block spreading harmonies, but cause transparency in agreement harmonies.

## Deriving the Typology of Agreement and Spreading Patterns

If the proposed distinction between mechanisms of spreading and harmony is useful, it should help us make sense of various typological differences between

consonant harmony and vowel harmony, which a number of researchers have argued are associated with agreement and spreading, respectively (see Hansson, 2001; Rose and Walker, 2011). **Table 4** lists some of the differences between agreement and spreading. It is worth emphasizing that if there is only one mechanism whereby long-distance phonological patterns arise, then these differences are almost entirely inexplicable, and must therefore be seen as accidental. Thus a model which can account for them is highly desirable.

One of the most telling differences between agreement and spreading is that agreement is never blocked by intervening segments, while spreading is blockable (Hansson, 2001; Rose and Walker, 2011). This difference falls out straightforwardly from the models in sections "Spreading Arises From Selectional Dissociation" and "Agreement Arises From Leaky Gestural Gating." Blocking occurs when promotion of an inhibitory gesture (y−) necessitates the demotion of an antagonistically related excitatory gesture (x+). Examples of blocking in spreading patterns were provided in section "Spreading Arises From Selectional Dissociation," **Figures 11C,D**. Blocking is observed in spreading harmonies because spreading harmonies arise from anticipatory promotion or delayed demotion of a source gesture; in other words, spreading is blockable because spreading derives from gestural selection, which is constrained by antagonistic relations between gestures. In contrast, blocking is never observed in agreement patterns because agreement patterns do not arise from gestural selection. Instead, agreement arises from leaky gating of a gesture with sub-selection level excitation; blocking does not occur in agreement patterns because the relevant gestural system need not be selected in order to influence the state of the vocal tract.

Along these same lines, intervening segments which are not targets of an agreement pattern are always "transparent" in agreement patterns, in the sense that they involve the selection of an antagonist whose influence on the relevant intentional field outweighs the influence of the triggering

TABLE 4 | Typological differences between agreement and spreading patterns.


gesture. Intervening segments in a spreading pattern must either block the selection of the dissociated gesture or allow selection of that gesture, in which case those segments will exhibit physically observable characteristics of the relevant articulatory state. Thus the differences in blockability and transparency of agreement and spreading patterns fall out naturally from the hypothesized difference in mechanisms. For example, in nasal spreading harmony, intervening vowels which become nasalized typically lack contrastive nasalized vowel counterparts. Hence we can infer that in such cases there is no [VEL clo−] antagonist selected with the vowels which would prevent the early promotion or late demotion of [VEL op+].

Agreement is almost always morphologically restricted to a root or derivational morphological domain, whereas spreading often extends to inflectional morphs and even clitics (Hansson, 2001: 430). As Hansson (2001: 430) puts it, "consonant harmony is never postlexical." Because we have not developed an explicit model of the role of morphological domains in gestural-motoric organization, a detailed analysis of this typological distinction cannot be presented. Nonetheless, to explain why agreement never seems to involve inflectional domains, we might conjecture that the reorganization operations associated with inflectional forms always enforce strong gestural gating: during epochs in which an inflectional form is selected, all gating functions are non-leaky. This would account for why agreement never extends to inflectional morphs.

Agreement is never sensitive to stress or other metrical structure, and is never bounded by prosodic domains such as the foot; in contrast, such prosodic domain restrictions are common for spreading patterns, such as vowel harmonies and vowelconsonant harmonies (Hansson, 2001; Rose and Walker, 2011). This difference can be interpreted with the idea that domains such as the prosodic word are associated with the selection of accentual gestures (Tilsen, 2018a, 2019), in conjunction with the idea that selection of accentual gestures can influence the promotion and demotion of articulatory gestures. Accentual gestures specify F0 and/or intensity targets, and are associated with stress (i.e., metrical structure) as well as intonation (pitch accents). If we assume that the selection of an accentual gesture can enhance the likelihood that speakers select a gesture which is antagonistic to a spreading gesture, or at least augment the antagonism, then we can generate patterns in which spreading harmonies are restricted to a particular prosodic domain. In contrast, this hypothesized effect of selecting an accentual gesture will have no bearing on the mechanism whereby agreement patterns arise, because such patterns are not contingent on selection of the triggering gesture.

Another typological difference is that agreement is always structure-preserving, in that agreement patterns never give rise to new classes of segments (Hansson, 2001). In contrast, spreading can and often does result in an expansion of the segmental inventory. To account for this, we must interpret the difference as a consequence of the phonologization of agreement and spreading patterns. When the sub-threshold gestural influence on an intentional field is reinterpreted as selection of the triggering gesture, that reinterpretation is constrained to result in selection of a set of gestures which already exists in the inventory of such sets in a given language. What makes spreading different is that the triggering gesture which is phonologized as a member of another selection set is already selected during the epoch governed by that set. Thus any prohibitions on reinterpretations which result in new sets of gestures in the inventory are weaker.

Agreement harmonies always involve segments which are similar, while spreading patterns do not necessarily involve similar segments. For example, nasal consonant harmonies are always restricted to a subclass of consonants – e.g., coronal sonorants – such that consonants not in this class are transparent to the harmony. This is expected if featurally similar segments are more likely to lack an antagonistic gesture which would oppose the subthreshold influence of the triggering gesture. It is worth noting that similarity appears to be factor in speech errors as well: segments which share more features are more likely to participate in substitutions and exchanges than segments with fewer features in common (Fromkin, 1971; Nooteboom, 1973; Shattuck-Hufnagel, 1979; Frisch, 1997). In contrast, the relations between triggers and targets in spreading harmonies are not expected to be constrained by featural similarity because antagonistic gestures block spreading; in the absence of this blocking any segment from an adjacent epoch can be influenced by dissociated selected.

Finally, agreement harmonies are predominantly anticipatory, and those cases which are not anticipatory can be analyzed as instances of stem-control (Hansson, 2001: 467). In contrast, spreading harmonies show a weaker bias for anticipatory directionality. This anticipatory bias in agreement patterns suggests that the subthreshold influence of a gesture may be stronger before the gesture is selected than after the gesture has been suppressed. This makes sense if we assume that suppression causes the excitation of the gesture to be lower than it was prior to selection. The force exerted on an intentional field is always a function of gestural excitation, and presumably even leaky parameterization of the gating function does not allow gestures with very low excitation to have strong influences on intentional fields. Our analysis of spreading, in contrast, does not hinge on the sub-threshold excitation of gestures, and therefore no similar bias is expected.

#### GENERAL DISCUSSION AND CONCLUSION

In this paper we presented a new model of how the target state of the vocal tract is controlled in the planning and production of speech. Specifically, we argued that for each parameter of vocal tract geometry in the Articulatory Phonology/Task Dynamics model, there is a one-dimensional field – an intentional planning field – in which a distribution of activation determines the current target value of that parameter. These intentional planning fields receive distributions of both excitatory and inhibitory input from gestural systems, and on that basis we distinguished between excitatory gestures and inhibitory gestures. In this expanded conception, we distinguished between dynamic targets, which vary continuously and are derived from integrating the distribution of activation in an intentional field, and gestural targets, which are associated with distributions of excitatory or inhibitory forces that gestures exert on the activation of intentional fields. Furthermore, the proposed model of intentional planning was integrated with the selectioncoordination framework (Tilsen, 2016, 2018b), in which sequencing of syllable-sized sets of gestures is accomplished via a competitive selection mechanism. The competitive selection mechanism is conceptualized as the organization of gesture sets in a step potential, in which selection sets are iteratively promoted and demoted.

There are several ways in which the model presented here complicates our understanding of speech, and thus it is important to establish why such complications are warranted. In general, when two models fare equally well in describing the same empirical phenomena, we should prefer the simpler model. But if the more complicated model accounts for a wider range of empirical phenomena than the simpler one, we must weigh the advantages of broader empirical coverage against the disadvantage of greater model complexity. In the current case the expanded empirical coverage outweighs the increase in complexity and therefore justifies the model. There are also ways in which the proposed model is simpler than the standard AP/TD model, and these constitute arguments in its favor. To elaborate on these points, we review the phenomena that the selection-coordination-intention model addresses.

First, we observed in section "Introduction" that there are aspects of control over the state of the vocal tract that gestural scores do not explicitly represent. Specifically, we showed that there are two alternative ways of conceptualizing how a consonantal constriction is released. On one hand, the standard AP/TD model accomplishes releases via the influence of a neutral attractor on model articulators. Crucially, we noted that in order to avoid unwanted influence of the neutral attractor during periods of time in which gestures are active, the AP/TD model competitively gates the influence of the neutral attractor on model articulators. The competitive gating amounts to turning the neutral attractor on and off in a way that is precisely locked to the activation of gestures and contingent on the model articulators that are influenced by those gestures. Alternatively, we suggested that releases of constrictions can be driven by active gestures. Despite increasing the number of gestures that are involved in production of a word form, this alternative is simpler in that it does away with the need to competitively gate the neutral attractor in a way that is precisely timed to gestural activation. A nice consequence of this view is that we do not need to posit ad hoc constructs such as a default modal-voicing state of the glottis during speech: all movement is driven by intentional planning fields which evolve continuously in time. The competitive gating account is also somewhat unsatisfactory from a conceptual standpoint, in that it requires a mechanism which is sensitive not only to the tract variables which gestures are associated with, but also the model articulators that are used to effect changes in

force is necessary. Second, in sections "Empirical Evidence for Intentional Planning" and "The Inadequacy of Gestural Blending" we considered the empirical phenomena of assimilation and dissimilation between contemporaneously planned targets. It was argued that the standard AP/TD model cannot generate either sort of pattern, because in that model gestures only have influences on the vocal tract when they are active. In distractor-target paradigms where assimilatory and dissimilatory patterns are observed, the distractor is never produced, hence the corresponding gesture should not be active and should have no influence on production. Furthermore, in the standard model, dissimilatory patterns would require a problematic form of gestural gating in which blending negatively weights the influence of the distractor. In contrast, the intentional planning model readily accounts for both assimilatory and dissimilatory patterns, without requiring gestural activation or unusual gating. This is accomplished by hypothesizing that gestures which are not selected can exert forces on intentional planning fields, and that those forces can be excitatory and/or inhibitory. Although this account is more complex, it succeeds in generating the empirical patterns.

Third, in the section "Sub-selection Intentional Planning and Anticipatory Posturing" we considered the phenomenon of anticipatory posturing, which involves the partial assimilation of vocal tract posture to targets of an upcoming response. The standard AP/TD model cannot account for this phenomenon without fairly ad hoc stipulations, such as positing multiple targets for gestures, new gestures, or special dynamics of gestural gating. The selection-coordination-intention model generates anticipatory posturing through influences of nonactive (i.e., excited but not selected) gestures on intentional planning fields. These subthreshold influences are governed by parameterization of the gestural gating function, which determines the strengths of the forces exerted by excited gestures on intentional fields. It was shown that leaky gating allows such influences to be non-negligible, and that blending those influences with the constant influence of the neutral attractor accounts for the partially assimilatory quality of anticipatory posturing.

Fourth, in section "The Origins of Non-local Phonological Patterns," we examined two varieties of non-local phonological patterns, spreading harmony and agreement harmony. It was shown that these two varieties of harmony can be understood to originate through distinct mechanisms. Spreading harmonies were understood to arise from selectional dissociations in which anticipatory degating (i.e., early promotion) or delayed suppression (i.e., late demotion) cause a gesture to be selected in an epoch other than the one in which it is canonically selected. One prediction of this account that could be readily tested is that (non-phonologized) spreading will be less extensive when external feedback plays a greater role in gestural selection and suppression, i.e., in slower, more careful speech. In contrast, agreement harmonies were understood to arise from leaky gating of gestural forces on intentional fields. The role of leaky gating in both anticipatory posturing and the origination of agreement patterns predicts that there may be correlation between the extent to which a speaker may exhibit an anticipatory articulatory posture in some tract variable and their ability to learn an agreement harmony involving that that tract variable.

Importantly, the proposed mechanisms account for a key phenomenological difference between spreading and agreement: the possibility of blocking. Spreading harmonies can be blocked because they hinge on selection of a gesture, and the selection of a given gesture is prohibited when an antagonistic gesture is selected. Agreement harmonies are never blocked because they do not require selection of the relevant gesture; intervening segments are thus always transparent. Furthermore, we discussed how a number of typological differences between spreading and agreement could be understood in the context of the model. These involved the sensitivity of such patterns to morphological and prosodic domains, structure preservation, similarity sensitivity, and directionality biases. The standard AP/TD model does not provide two distinct mechanisms for the origins of spreading and agreement, and so there is no straightforward way to understand the typological differences between such patterns.

In sum, the selection-coordination-intention model, while more complicated than standard AP/TD, addresses a broader range of empirical phenomena: assimilation/dissimilation of contemporaneously planned targets, anticipatory posturing, and spreading/agreement harmonies. A desirable consequence of the model is that agreement harmonies can be viewed as the result of a motoric mechanism which operates locally, i.e., involves continuous influence on an intentional field. This makes it unnecessary to stipulate non-local mechanisms in the utterance-timescale genesis of phonological patterns. The model also simplifies our understanding of control over the vocal tract by eliminating the need for a special blending mechanism involving the neutral attractor. The primary downside of the selection-coordination-intention model is the need for more detailed specification of the gestures that are involved in production of a word form, including a dissociation between excitatory and inhibitory gestures. An outstanding issue is whether there are undiscovered generalizations about when both excitatory and inhibitory gestures need to be specified, and when it is possible to specify only one of these. Future work should explore this question.

# AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

#### REFERENCES


**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tilsen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Bridging Dynamical Systems and Optimal Trajectory Approaches to Speech Motor Control With Dynamic Movement Primitives

Benjamin Parrell <sup>1</sup> \* † and Adam C. Lammert 2,3†

*<sup>1</sup> Department of Communication Sciences & Disorders, University of Wisconsin-Madison, Madison, WI, United States, <sup>2</sup> Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, MA, United States, <sup>3</sup> Bioengineering Systems & Technologies, MIT Lincoln Laboratory, Lexington, MA, United States*

#### Edited by:

*Adamantios Gafos, University of Potsdam, Germany*

#### Reviewed by:

*Plinio Almeida Barbosa, Campinas State University, Brazil Elliot Saltzman, Boston University, United States*

> \*Correspondence: *Benjamin Parrell bparrell@wisc.edu*

*†These authors have contributed equally to this work*

#### Specialty section:

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

Received: *15 March 2019* Accepted: *19 September 2019* Published: *14 October 2019*

#### Citation:

*Parrell B and Lammert AC (2019) Bridging Dynamical Systems and Optimal Trajectory Approaches to Speech Motor Control With Dynamic Movement Primitives. Front. Psychol. 10:2251. doi: 10.3389/fpsyg.2019.02251* Current models of speech motor control rely on either trajectory-based control (DIVA, GEPPETO, ACT) or a dynamical systems approach based on feedback control (Task Dynamics, FACTS). While both approaches have provided insights into the speech motor system, it is difficult to connect these findings across models given the distinct theoretical and computational bases of the two approaches. We propose a new extension of the most widely used dynamical systems approach, Task Dynamics, that incorporates many of the strengths of trajectory-based approaches, providing a way to bridge the theoretical divide between what have been two separate approaches to understanding speech motor control. The Task Dynamics (TD) model posits that speech gestures are governed by point attractor dynamics consistent with a critically damped harmonic oscillator. Kinematic trajectories associated with such gestures should therefore be consistent with a second-order dynamical system, possibly modified by blending with temporally overlapping gestures or altering oscillator parameters. This account of observed kinematics is powerful and theoretically appealing, but may be insufficient to account for deviations from predicted kinematics—i.e., changes produced in response to some external perturbations to the jaw, changes in control during acquisition and development, or effects of word/syllable frequency. Optimization, such as would be needed to minimize articulatory effort, is also incompatible with the current TD model, though the idea that the speech production systems economizes effort has a long history and, importantly, also plays a critical role in current theories of domain-general human motor control. To address these issues, we use Dynamic Movement Primitives (DMPs) to expand a dynamical systems framework for speech motor control to allow modification of kinematic trajectories by incorporating a simple, learnable forcing term into existing point attractor dynamics. We show that integration of DMPs with task-based point-attractor dynamics enhances the potential explanatory power of TD in a number of critical ways, including the ability to account for external forces in planning and optimizing both kinematic and dynamic movement costs. At the same time, this approach preserves the successes of Task Dynamics in handling multi-gesture planning and coordination.

Keywords: speech motor control, computational models, dynamical systems, optimal control, task dynamics, dynamic movement primitives

# INTRODUCTION

The speech motor system comprises many individual subsystems (respiratory, phonatory, articulatory), a larger number of individual articulators (upper lip, lower lip, jaw, tongue tip, tongue body, etc.), and an even larger number of muscles. The highly redundant structure of this system ensures that there are often many (perhaps infinite) ways for the system to move between two given configurations (Bernstein, 1967). How are speakers able to select from among the multitude of possible movement patterns, to arrive at those representing the highly accurate and precise movements that typify healthy, mature speech? Attempts to explain the speech control systems that produce such complex behavior have fallen into two opposing approaches: (1) dynamical systems theory, which conceptualizes movement patterns as emergent properties of synergistic groups or systems of speech articulators whose evolution is determined by the state of the system and current production goals, and (2) trajectory-based approaches, which solve the highly redundant control problem by pre-specifying a particular desired trajectory. A subset of this latter approach which will be particularly relevant to the current proposal are optimality-based approaches, which attempt to find a desired trajectory that minimizes some cost function (either kinematic properties of the movement, such as jerk, or dynamic properties, such as total force). While both dynamical systems and optimal control approaches have had success in replicating certain aspects of human speech behavior, they have arrived at essentially distinct understandings of the nature of speech motor control.

The dynamical systems approach suggests that control of the complex motor system can be considerably simplified by understanding the motor system as a self-organizing system of functional units of articulators, each of which corresponds to a particular behavioral task. The behavior of the component articulators, while governed by the higher-order functional unit, need not be explicitly or directly specified. These functional units thus serve to constrain the motor system in such a way that its evolution serves to perform the particular task specified by the functional unit, without the need to centrally control the activity of each degree of freedom in the system. Practically, these functional units are hypothesized to be autonomous dynamical systems whose evolution depends on the system's current and goal states. The particular parameters of each dynamical system (e.g., the goal, stiffness, damping, etc.) govern the evolution of the system from its current state toward a goal state, and the evolution of this dynamical system generates the motor activity in the lower-level subsystems needed to perform the task. Importantly, there is no specific plan or desired kinematic trajectory in this approach. Instead, the kinematic behavior of the system emerges from the dynamical regime governing the functional unit (which could alternatively be called the controller).

The most prevalent dynamical systems model of speech production is the Task Dynamic Model<sup>1</sup> (Saltzman, 1986; Saltzman and Munhall, 1989). In this theory, speech tasks are modeled as a second-order, damped mass-spring systems. The evolution of such a system is given by Equation (1) (discussed in more detail in the Task Dynamics section).

$$
\ddot{z} = M^{-1} \left( -B\dot{z} - K \left( z - g \right) \right) \tag{1}
$$

Where z¨ is the system acceleration of the system state, z is the current position, z˙ is the current velocity, g is the target spatial position or goal, and M, B, and K, respectively, the mass, damping, and stiffness coefficients, which are assumed to reflect critical damping. Such systems have two desirable characteristics for a motor controller. First, they exhibit equifinality, such that the system will come to rest at its target position regardless of the initial state of the system. This also assures that the system will reach its resting position regardless of any perturbations that may occur during the movement without the need for any re-planning or change in control. Second, such systems are time-invariant, in that the evolution of the system is a function governed by its current state and dynamical parameters (spatial target, mass, stiffness, damping) rather than being explicitly a function of time. This is a particularly important consideration for speech, where the duration of individual movements is affected by a wide range of parameters, including speech rate, stress, and prosodic structure.

Dynamical approaches to movement control receive some support from research on neurobiological control systems. For example, the VITE model (Bullock and Grossberg, 1988) presents a relatively simple neural network model that is able to generate appropriate kinematic behavior in directed reaching movements. The model consists of three distinct but interacting neural populations encoding (1) the present position of the system, (2) the desired target position, and (3) the difference between the target and present positions. The relational structure between these populations is such that the behavior of the controlled systems is consistent with second-order dynamics. This suggests a plausible neural implementation of the more abstract dynamical systems in Task Dynamics (Lammert et al., 2018). Additionally, recent studies have identified dynamical patterns in the neural activity that drives motor behavior (Churchland et al., 2012; Shenoy et al., 2013). Using intracortical recordings in non-human primates, these studies have shown that oscillatory motor behavior, such as walking, is reflected at a neural level by co-occurring oscillatory dynamics at the population level in the activity of motor cortical neurons. Importantly, cortical activity during goal-directed reaching, a non-oscillatory behavior, also exhibits patterns of neural activity consistent with a truncated limit-cycle oscillator. These results have recently been extended to human speech, where similar dynamical patterns have been demonstrated in the populationlevel activity of primary motor cortex neurons during production of monosyllabic words (Stavisky et al., 2018). Together, these results suggest that a controller based on dynamical equations may be an appropriate model of the neural implementation of motor control.

The principal drawback of the Task Dynamics implementation of dynamical systems control is that the

<sup>1</sup>Both the FACTS (Parrell et al., 2019) and the ACT (Kröger et al., 2009) models also incorporate dynamical systems control.

dynamics driving the evolution of the functional task units are limited in flexibility. The system is only able to generate oscillatory dynamics (with various degrees of damping), such that the system will evolve in a deterministic way from any given initial state toward the goal state. Though these movements can, in principle, be modified in a potentially profound way by changing the damping, stiffness and inertial coefficients, such changes would only globally affect each gesture within the system.

Extensions of Task Dynamics have attempted to address this limitation for specific cases. Prosodic gestures have been proposed that allow for temporally-specific changes in the rate and/or extent of movements (Byrd et al., 2000; Byrd and Saltzman, 2003; Saltzman et al., 2008), though these prosodic gestures act concurrently on all active gestures, rather than specifically on individual gestures. Also, multiple gestures produced with varying degrees of temporal overlap have been shown to result in movements that are truncated and forced to reverse direction prematurely, which can account for reduction phenomena, such as undershoot, flapping, and spirantization (Browman and Goldstein, 1990, 1992; Edwards et al., 1991; Beckman and Edwards, 1992; Beckman et al., 1992; Parrell, 2011; Parrell and Narayanan, 2018).

Despite these important modeling advances, the Task Dynamics implementation of dynamical systems control is still unable to produce local changes in the rate of change or reversals of direction arbitrarily, or for any single activated tract-variable (TV) gesture. While such behavior may not be critical for some aspects of speech (see the large literature on modeling speech using second order dynamics), some speech behaviors do require more complex control. For example, when speakers are exposed to a velocity-dependent force field on the jaw, they initially produce jaw trajectories that deviate, or curve away, from the relatively straight trajectories observed under unperturbed conditions (Tremblay et al., 2003, 2008; Tremblay and Ostry, 2006; Lametti et al., 2012). However, after a period of exposure, jaw trajectories return to their baseline curvature. When the force field is subsequently removed, jaw trajectories are curved in the opposite direction as under initial exposure. These results suggest that the speech motor control system can learn to account for the dynamics of the force field to generate motor commands that maintain a straight trajectory. Moreover, some have argued that speech motor control may rely on explicit trajectory representations rather than discrete attractors (Guenther, 2016) or that the speech motor system seeks to balance effort and intelligibility (Lindblom, 1990; Perrier et al., 2005; Patri et al., 2015). These types of behavior cannot be generated in Task Dynamics or any control system whose dynamics are dependent only on the system state.

In order to account for behaviors exhibited by speakers in the jaw perturbation paradigm discussed above, the controller must be sensitive to other types of information beyond the instantaneous system state. One solution to this problem is found in theories that rely on optimization to generate motor output. Such schemes, known as optimal controllers, seek to generate a movement that minimizes some cost function. Typically, this involves the generation of a pre-planned motor trajectory, such that the cost of the full movement can be calculated and minimized prior to movement onset [though see optimal feedback control, e.g., Todorov and Jordan (2002), for a variation of optimal control without pre-planned trajectories].

Optimal control has a long history in modeling discrete reaching tasks (Nelson, 1983; Flash and Hogan, 1985; Uno et al., 1989; Hoff and Arbib, 1993; Harris and Wolpert, 1998) as well as in speech (Perrier et al., 2005; Patri et al., 2015). While these models share the general concept of optimizing movements to minimize some cost, the nature of the cost function has been a matter of debate. It is often claimed that the central nervous systems minimizes the total muscle activation of a movement (Harris and Wolpert, 1998; Todorov and Jordan, 2002; Todorov, 2004; Perrier et al., 2005; Patri et al., 2015), either to minimize the amount of energy expended during a movement or to minimize error. Error is minimized along with total muscle activation because noise in the motor system is signal dependent, such that the variance of force scales proportionally with the square of the force (O'Sullivan et al., 2009; Diedrichsen et al., 2010). Other proposals suggest that the kinematic characteristics of movement determine the cost function. Cost functions have been suggested to minimize jerk, which is the third derivative of position (Flash and Hogan, 1985; Hoff and Arbib, 1993), torque change (Uno et al., 1989), or path curvature (Kistemaker et al., 2010, 2014). Regardless of their specific implementation, such proposals are able to account for external as well as internal dynamics in control, and are able to produce changes in behavior in response to force field perturbations (Izawa et al., 2008).

In speech, optimal control has been implemented in the GEPPETO (Perrier et al., 2005) model and its Bayesian reformulation (Patri et al., 2015). It is also, implicitly, incorporated into DIVA (Guenther, 2016). DIVA differs from many optimal control approaches in that it attempts to optimize planned motor trajectories with respect to a given reference (sensory) trajectory. Optimization serves the purpose of accurately following the reference trajectory, rather than minimizing some criterion intrinsic to the planned trajectory itself, such as effort. This is accomplished by summing, over time, corrective motor commands issued by the auditory and somatosensory feedback controllers, which can be seen as a type of iterative optimization.

Most optimal control models, including those of speech, rely on the generation of movement trajectories. This is partly because identifying specific, optimal trajectories is more computationally tractable when compared to identifying more general optimal control policies (Schaal et al., 2007). Trajectories (or, more precisely, time-varying targets) have also been suggested to be necessary for speech (Guenther, 2016). Trajectory-based control can also substantially simplify the degrees-of-freedom problem if trajectories are planned in mobility space<sup>2</sup> (as occurs

<sup>2</sup>The term "mobility space" comes from the robotics literature (Sciavicco and Siciliano, 2000). It is used here rather than the more common "articulatory space" to provide a neutral reference to the kinematic configuration of the vocal tract and avoid confusion over whether "articulatory" refers to a low-level description of the vocal tract geometry (e.g., a concrete description of muscle lengths, or a more abstracted version of vocal tract kinematics such as that provided by the model articulators currently used in TD or in our jaw movement example given below) or a higher-level description of task spaces (such as that provided by the gestural tract-variable space currently used in TD and in our jaw example below).

in DIVA and GEPPETO), since each degree of freedom is explicitly accounted for. However, trajectories lack flexibility, and may require frequent replanning/reoptimization in the face of changing environments or task demands. Trajectory-based control is also inherently time-indexed, in that trajectories are defined as a function of time. Such time-indexing has strict consequences for the validity of trajectory-tracking control policies in changing environments, and may also be difficult to reconcile with the temporally malleable speech production system (e.g., movement durations are affected by speech rate, stress, prosodic boundaries, etc.). Moreover, trajectory-based optimal controllers make inaccurate predictions about the types of variability observed in human kinematics (Todorov and Jordan, 2002). And, perhaps most importantly, there is growing evidence that human movement does not rely on fully preplanned trajectories, at least for limb control (Sergio and Scott, 1998; Desmurget and Grafton, 2000; Nashed et al., 2012).

Thus, the field of speech motor control is left with a situation where neither the dynamical systems nor optimal control approaches provide fully satisfactory accounts of human motor behavior. An ideal control system would provide the flexibility, temporal flexibility, and robustness of the dynamical systems approach with the ability to account for the behavioral evidence that humans do produce motor behavior in accordance with particular dynamic and/or kinematic constraints.

A few approaches in human motor control and robotics have sought to bridge this divide. These include Optimal Feedback Control (Todorov and Jordan, 2002; Todorov, 2004), Dynamic Movement Primitives (Schaal et al., 2007; Ijspeert et al., 2013), and Embodied Task Dynamics (Simko and Cummins, 2010a,b, 2011). Optimal Feedback Control (OFC) replaces trajectorybased optimization with an optimal feedback control law. While this solves many of the issues with traditional optimal control, the derivation and calculation of this optimal feedback control law is difficult, especially for non-linear systems like speech. The approach based on Dynamic Movement Primitives (DMPs) incorporates an additional forcing function into a second order dynamical control system that can be tuned to alter the trajectory produced by the dynamical control system. This approach is substantially easier to compute and, perhaps more importantly, retains the many benefits provided by existing dynamical control schemes. Embodied Task Dynamics is an extension of Task Dynamics that incorporates the physical masses of the speech articulators into the equations of control. This allows for the quantification of effort (sum of forces), which is then used in a cost function along with constraints on movement duration and speech intelligibility.

The current paper presents a step toward bridging the substantial theoretical gap that separates dynamical systems and optimal or trajectory-based approaches to speech motor control. We accomplish this by leveraging the tools of Dynamic Movements Primitives (Ijspeert et al., 2013) to incorporate optimization into the most well-developed dynamical-systems framework of speech motor control, Task Dynamics. In the sections below, we lay out the basics of dynamical control in Task Dynamics, DMPs, and the coordination of DMPs with secondorder dynamical systems. We then demonstrate the utility of this combined model by showing how this approach can be used to generate corrections for dynamic jaw perturbations that are consistent with experimentally measured human behavior. Lastly, we show how the mechanisms developed to incorporate DMPs into second-order dynamical systems can also be used as a system of intergestural coordination (Nam and Saltzman, 2003; Saltzman et al., 2008; Goldstein et al., 2009) as well as movement initiation (Tilsen, 2013).

#### TASK DYNAMICS MODEL

Articulatory Phonology (AP) posits that constriction actions (i.e., gestures) of the vocal tract represent both the primitive units of spoken language and the controlled tasks that characterize speech motor control (Browman and Goldstein, 1992). The Task Dynamics (TD) model asserts that the controlled evolution in time of these constriction actions is governed by secondorder equations of motion, consistent with a critically damped harmonic oscillator.

Speech gestures and their associated dynamics take place in a space described by a vector of N tract variables, z, where z = [z1, z2, . . . , zN], that correspond to the degree and location of vocal tract constrictions. Each specific gesture, k, is associated with its own pair of constriction degree and location tractvariables and its own set of mobility variables. Additionally, each gesture is associated with a corresponding set of tract-variable dynamic parameters (spatial target, mass, damping, and stiffness, all time-invariant) and articulator weights. Articulator weights are described below in conjunction with Equation 6. Gestures themselves are governed by equations of motion consistent with a damped harmonic oscillator, as described by Saltzman and Kelso (1987) and Saltzman and Munhall (1989):

$$M\ddot{z} = -B\dot{z} - K\Delta z \tag{2}$$

where 1z = (z − g), and g is a vector containing the timevarying set of parameters representing the current set of tractvariable spatial motor goals—i.e., the target positions to which the tract variables are compelled to move and upon which they will tend to converge. M, B, K are diagonal matrices containing the mass, damping, and stiffness coefficients, respectively. All tract variable parameters, M, B, K, and g, change over time as functions of the currently active set of gestures. As noted above, the stiffness, damping and inertial gestural parameters can have a profound influence on the gesture-related movement trajectories. These parameters, from a broader perspective, may therefore be considered part of the motor goals of the system, e.g., stiffness parameters are lower for vowels than consonants to capture the fact that vowel gestures are typically slower than consonant gestures.

The TD model also defines the relationships between the tract variables and relatively lower-level mobility variables, φ. Tract variables describe the state of the vocal tract with respect to speech gestures. However, the vocal tract, like many motor systems, is typically considered to have a hierarchical structure, where motor goals are defined in a high-level task space, and motor commands are issued in a low-level mobility space. For example, in a speech context, mobility space variables might be expressed in terms of the positions of the speech articulators (e.g., upper lip, lower lip, tongue tip, etc., called the model articulators in TD), or even in terms of muscle activations<sup>3</sup> . The relevant kinematic equations that define the relationships between the task and mobility spaces are expressed as follows:

$$z = h(\phi),\tag{3}$$

$$
\dot{z} = J(\phi)\dot{\phi},\tag{4}
$$

$$
\ddot{z} = (\phi)\ddot{\phi} + \dot{\mathcal{J}}(\phi, \dot{\phi})\dot{\phi} \tag{5}
$$

where h represents the direct kinematic mapping between task and mobility spaces, and J is the Jacobian matrix of first-order partial derivatives of z with respect to φ.

Using these kinematic relationships, one can express accelerations of the controlled, mobility space variables with respect to the task-space error:

$$\ddot{\phi} = J^\* \left( M^{-1} \left[ -B J \dot{\phi} - K \Delta z \right] \right) - J^\* \dot{\mathcal{J}} \dot{\phi}, \tag{6}$$

where J ∗ = W−<sup>1</sup> J T JW−<sup>1</sup> J T −1 is the pseudo-inverse of the Jacobian, weighted by a matrix W. The equation of motion, in Equation (6), for mobility space variables represents the full expression of the dynamical control law that characterizes TD, with integrated inverse kinematics, specifying how task-space error is equated to a preferred change in mobility space. It is worth noting that the weighted Jacobian pseudo-inverse provides a minimum norm solution that can be considered optimal in the sense that it minimizes the weighted sum of squared mobilityspace accelerations selected for the solution. As evidenced by this fact, it is possible to incorporate some aspects of preferred optimality directly into a dynamical systems control algorithm.

In Task Dynamics, the activation of a gesture is determined by its associated planning oscillator, a second order dynamical system with non-linear damping. The activation of a gesture is determined by the phase of this planning oscillator. Essentially, the phase of the oscillator determines the value of the "go" signal (G), which allows motion associated with a gesture to proceed. Early versions of Task Dynamics used a step function to define this relationship—e.g., G = 1 while the planning oscillator phase is between 0 and 270◦ ). More recent versions have used a cosine-ramped activation function, which results in more realistic kinematics (Byrd and Saltzman, 1998).

Note that accelerations are potentially experienced by all mobility variables, even those that are not engaged by currentlyactive gestures, due to the inclusion of a neutral attractor. The neutral attractor amounts to a mobility-space target position that drives mobility variables in the absence of driving influences from currently-active gestures.

The "go" signal itself is incorporated into TD in the form of a gating matrix, included as part of the inverse kinematics model (Saltzman and Munhall, 1989) 4 , as well as a gesture-specific parameter tuning function (spatial target g as well as damping and spring coefficients—all mass coefficients have been set to 1 for simplicity) for the dynamical control law (see **Figure 1**). Note that the role of the "go" signal used in gesture tuning is similar to and consistent with other models of directed action—e.g., Bullock and Grossberg (1988).

#### DYNAMIC MOVEMENT PRIMITIVES

The general idea of Dynamic Movement Primitives (DMPs) is to augment a dynamical systems model, like that found in Equation (2), with a flexible forcing function input, f . The addition of a forcing function allows the present model to overcome certain inflexibilities inherent in the original TD model. Given a speech gesture—conceptualized in AP and TD as comprising a set of a constriction target and inertial, damping and stiffness parameters—and a set of initial conditions, the unforced patterns of movement in TD are entirely determined by Equation (2). Without some method of otherwise influencing the dynamics, a speech gesture under the same initial conditions will follow the same pattern of movement during each instance of that gesture. Conversely, if the system is subjected to some external perturbation, the changes in movement associated with that perturbation will persist indefinitely. The addition of the forcing term allows for flexible modification of the trajectories of the tract variables as they move toward the spatial motor goal, all while preserving the dynamical form of the TD model. A forcing term of this type, and for this purpose, has been suggested and developed by Ijspeert et al. (Ijspeert et al., 2002, 2013; Hoffmann et al., 2009).

We refer to the dynamic control law augmented with a flexible forcing function input as the control system. In order for this forcing function to flexibly alter the evolution of the dynamical system, it must be time-variant. However, if the forcing function is explicitly a function of time—i.e., f(t)—such a formulation would remove one of the key benefits of dynamical systems control, which is that they are time invariant. To avoid this, we replace any explicit time dependency with a dependency on a separate dynamical system, the planning system, f(x). In the sections that follow, we first describe the nature of the control system and forcing function, then discuss details of the planning system<sup>5</sup> .

<sup>3</sup> In the present work, and especially the jaw movement example given below, the mobility space is taken to be an abstracted geometric description of the speech articulator configuration. Such geometric variables are conceptually related to the model articulators described in the literature on TD and included as part of the CASY model (Rubin et al., 1996). Currently, neither TD nor our jaw example include a model of the vocal tract's musculature and dynamics. We note that future developments in TD and in our work may implement a muscular mobility space as a replacement for the current abstract geometric variables or, alternatively, use such geometric variables as an intermediate step between task space and the control of muscle activations. Either development could be easily integrated with the present modeling efforts.

<sup>4</sup>Building on the weighted Jacobian pseudo-inverse above, the gating role of the 'go' signal was implemented using a diagonal matrix G (i.e., a matrix of 'go' signals), J <sup>∗</sup> = W−1GJ<sup>T</sup> GJW−1GJ<sup>T</sup> + [I − G] −1 .

<sup>5</sup>The control and planning systems are called the canonical and output systems, respectively, in previous presentations of DMPs (Schaal et al., 2007; Ijspeert et al., 2013). We have chosen to rename these systems to be consistent with the tasklevel dynamical control law (Equation 2) and planning oscillators in the Task Dynamics model. We also believe that these names more intuitively reflect these systems' functions.

#### Control System

In the control system, the dynamical systems model in Equation (2) is augmented with a forcing function input, f , as follows:

$$M\ddot{z} = -B\dot{z} - K\Delta z + f\tag{7}$$

The forcing term is a vector of forces acting on the vocal tract dynamics, where each element is also associated with a specific tract variable and specific gesture over that tract variable.

For a specific gesture k, the forcing term f<sup>k</sup> , an input to the control system, is a function of the planning system state, x, with the following form:

$$f\_k(\mathbf{x}\_k) = \frac{\sum\_{j=1}^n \Psi\_j(\mathbf{x}) \,\, \boldsymbol{w}\_j}{\sum\_{j=1}^n \Psi\_j(\mathbf{x})} \left(\frac{2\pi - x \text{mod} 2\pi}{2\pi}\right) (\mathbf{g}\_k - \mathbf{z}\_0) \tag{8}$$

where z<sup>0</sup> is the initial state of the tract variable associated with the gesture. Thus, the forcing term is essentially a linear combination of n fixed kernel functions 9<sup>j</sup> , each of which are a function of the planning system state and scaled according to kernelspecific weights w<sup>j</sup> . Because the planning system will be defined to converge to 2π, scaling this weighting by (2π − xmod2π)/2π ensures that the overall forcing function will tend toward zero as the planning system converges. This, in turn, ensures that the control system will converge to zero, eventually, as the dynamics revert to that of a damped spring-mass system. The purpose of scaling by g<sup>k</sup> − z<sup>0</sup> is to ensure certain advantageous invariance properties when scaling movements, as outlined by Ijspeert et al. (2013). We will not treat these invariance properties in depth in the current discussion.

The kernel functions have an exponential form:

$$\Psi\_{\dot{\jmath}}(\mathbf{x}) = \exp\left(-\frac{1}{2\sigma\_{\dot{\jmath}}^2} \left(\mathbf{x} - c\_{\dot{\jmath}}\right)^2\right) \tag{9}$$

giving them a Gaussian shape, with a specific kernel center c<sup>j</sup> that situates the kernel center relative to some planning system state, and also defined by a kernel width parameter σ<sup>j</sup> . As the planning system state evolves, kernel functions that are centered on specific state values will become more highly weighted, to the point where their centers align exactly with the planning system state, and subsequently become less weighted as the planning system evolves beyond that point. As pointed out by Ijspeert et al. (2013), this has similarities with vector-coding models of neural activation.

Several aspects of the model related to the kernel functions are worth noting. First, the kernels as implemented are defined as symmetrical in the planning system domain, x, which means that they are not necessarily symmetrical in the time domain. This can be clearly seen in **Figure 2**. Second, the degree of flexibility afforded to the control system via the kernels—insofar as they are used to compose the forcing function that directly influences the control system—will depend on the number of kernels used, their spacing in x, and the width parameter σ associated with each kernel. In broad terms, more flexibility will be associated with more, narrower kernels that are more closely spaced. Increased flexibility comes, however, at the expense of parsimony of the model. The tradeoff between flexibility and parsimony is an interesting one, the solution to which will certainly be application-specific, and could even be determined as part of an optimization process. For present purpose, it is assumed that the number, spacing, and width of the kernels is fixed. Following previous presentations of DMPs (Schaal et al., 2007; Ijspeert et al., 2013), we leave the question of the optimal kernel parameterization open for future work.

#### Planning System

To coordinate the activation of kernel functions in conjunction with a specific gesture, it is helpful to define a planning system for that gesture. Importantly, the use of a planning system also allows the control system to be abstracted away from linear time dependency. The planning system comprises a first-order dynamical system of the following form:

$$m\_i \dot{\mathbf{x}}\_k = \alpha\_\mathbf{x} \left( 2\pi - \mathbf{x}\_k \text{mod} 2\pi \right). \tag{10}$$

The state of this system is x<sup>k</sup> , the constant α<sup>x</sup> determines the rate of convergence, and m<sup>i</sup> is a tract variable-specific inertial parameter, a component of M from above. For present purposes, it is assumed that this system is initiated, at the beginning of a discrete gesture, with a value of 0. The dynamics of the planning system will cause it to subsequently converge to the next multiple of 2π, completing one full cycle<sup>6</sup> .

The planning system serves two purposes. First, the evolution of the system's state also serves as the basis for activating the primitive kernels at the appropriate time during that gesture. Second, the planning system can also be used to define the "go" signal, which allows motion associated with a gesture to proceed. For present purposes, we define the "go" signal as a rectangular step function of the planning system state:

$$G\_k = \begin{cases} 1 \cdot I, \text{ if } 0 + \varepsilon < \underset{\text{otherwise}}{\text{end}} 2\pi < 2\pi - \varepsilon\\ 0, \text{} \text{ } \text{ otherwise} \end{cases} \tag{11}$$

where ε ≈ 0 and k is the gesture. As shown in **Figure 3**, the "go" signal gates the inclusion of a gesture-specific target g into the vector of currently-active targets, similar to its function in the original TD model (see **Figure 1**), as well as the inclusion of other gestural parameters into the dynamical control law. In the present model, the "go" signal also gates the contribution of the forcing function f to the control system.

The "go" signal G is also modulated by an initiation signal, which is the results of a higher-level process monitoring an initial planning phase, during which the several (perhaps coupled) planning systems associated with an utterance are allowed to oscillate and converge to a stable temporal coordination pattern (see below for an extended example). Before convergence, the value of I is set to 0 and, after convergence, the value of becomes 1, and remains at that value until the entire utterance is complete. This change in value has the effect of allowing the movement associated with some gestures to commence, in accordance with the coordination pattern converged upon during the planning period. A similar initiation signal must be present in the planning oscillator formulation of Task Dynamics to drive the switch from planning to action.

Note that the function defined in Equation (11), above, might be most appropriately cast as another kernel function, which would be consistent with the use of kernel functions in the present framework, and which would allow for continuous rise and fall times, consistent with the gestural activations presented in the TD framework (Byrd and Saltzman, 1998; Saltzman, 1999).

#### KERNEL WEIGHT ESTIMATION AND MOVEMENT OPTIMIZATION

Having established the general form of the forcing function and planning system, we move to a discussion of how the weights of the kernels in forcing function can be assigned. Importantly, this is where optimization is incorporated into the model. While kernel weights could, in theory, be assigned to achieve any goal, in practice we show how the weights can be assigned to minimize some movement cost, following optimal control approaches. We take an agnostic stance over what aspect of movement may be optimized: there is evidence that both kinematic (Flash and Hogan, 1985; Uno et al., 1989; Hoff and Arbib, 1993; Kistemaker et al., 2010, 2014; Mistry et al., 2013) and dynamic properties (Todorov and Jordan, 2002; Todorov, 2004; Izawa et al., 2008; Diedrichsen et al., 2010) of movement may serve this function. In the following sections, we first show how DMPs may be used to minimize a kinematic constraint (trajectory tracking or straightness) as well as a dynamic constraint (effort minimization). We then show how both approaches are able to replicate the behavior of human speakers exposed to velocitydependent force fields applied to the jaw during speech.

#### Trajectory Tracking Optimization

One approach to assigning the kernel weights is to do so such that some reference trajectory is accurately reproduced. If a specific trajectory shape is desirable, e.g., a straight line (Kistemaker et al., 2010), it is possible to compute a set of weights that

<sup>6</sup>The range of planning system values (i.e., 0–2π) differs from the literature on DMPs, which use a range of 1–0. This change was made so that the planning system values would be compatible with the model of multi-gesture planning presented below. In that model, the temporal coordination of multiple gestures is accomplished through gestural coupling and entrainment during an oscillator phase preceding the initiation of action, similar to Saltzman and Byrd (2000) and later work (Goldstein et al., 2009; Nam et al., 2009). The discrete planning system described here is conceptualized as a single, final oscillation of those planning oscillators that governs movement execution.

Note that the kernels appear asymmetrical in the time domain—i.e., simulation iterations—because they are defined as symmetrical in the planning variable, *x*. The

bottom figure shows the evolution of the planning system (red) as it completes one cycle, and converges to a value of 2π.

will approximate that shape to the extent possible given the number and spacing of kernel functions available. Computing the weights requires an inversion of the control system dynamics, with environmental effects taken into account, in order to find the forcing function, which is what must be approximated by the weighted kernel functions.

With a detailed internal model of the dynamics of both the body and the environment, an estimate of the forcing function can be estimated. This can begin with Equation (7), accounting for the DMP-related forcing term (fs), as well as any additional forces (fp), due to environmental influences (e.g., perturbations). If one has a reference trajectory measured as a function of time, zref (t), the dynamics can be directly inverted, leading to the estimate:

$$f\_{\mathbb{S}}\left(\mathbf{t}\right) = \begin{pmatrix} M\ddot{\boldsymbol{z}}\_{\mathrm{ref}}\left(\mathbf{t}\right) - K\left(\boldsymbol{z}\_{\mathrm{ref}}\left(\mathbf{t}\right) - \mathbf{g}\right) + B\dot{\boldsymbol{z}}\_{\mathrm{ref}}\left(\mathbf{t}\right) \\ + f\_{\mathbb{P}}\left(\mathbf{t}\right)\big) / \left(\mathbf{g} - \boldsymbol{z}\_{\mathrm{ref}}\left(\mathbf{0}\right)\right) \end{pmatrix} \tag{12}$$

This estimate of the forcing function can be used to form an estimate of the kernel weights. Because the kernels are a function of the planning system (x) and not time (t), this first requires that the planning system be integrated, providing an estimate of the planning system as a function of time, x(t). Finally, linear regression can be used to solve for the weights, given the known shape of the kernel functions, using these time functions. This general procedure was outlined by Hoffmann et al. (2009).

#### Minimum Effort Optimization

Many possible approaches exist to optimizing a function based on the system output. One approach is to optimize the accumulated effort associated with a movement by minimizing it. Minimumeffort optimization criteria have a long history in models of motor control (Nelson, 1983; Todorov and Jordan, 2002; Todorov, 2004; Perrier et al., 2005; Patri et al., 2015), and minimaleffort criteria have been suggested to play an important role in speech production (Lindblom, 1990). DMPs afford the necessary flexibility to optimize dynamical systems control in this way. We provide an example of an iterative approach to effort minimization, using a simple method of updating the kernel weights, over many instances of a movement, based on an effort calculation. While more complicated optimization algorithms could be used, this straightforward iterative approach is used here as a proof of concept.

Admitting that, due to stochastic factors, such as those associated with neural activity, no two repetitions of any action will be precisely the same, a small extension of Equation (8) can be made, as follows:

$$f\_k(\mathbf{x}\_k) = \frac{\sum\_{j=1}^n \Psi\_j(\mathbf{x}) \left[\boldsymbol{w}\_j + \varepsilon \mathcal{N}(\boldsymbol{\mu}, \sigma^2)\right]}{\sum\_{j=1}^n \Psi\_j(\mathbf{x})} \left(\frac{2\pi - \chi}{2\pi}\right) (\mathbf{g}\_k - \mathbf{z}\_0),\tag{13}$$

for some small value of ε, and where N (µ, σ 2 ) is the normal distribution with mean µ and variance σ 2 . Deviations in the controlled forces implied by this change will likely result in deviations in the overall effort associated with an action, defined

as the integral of control forces τj,<sup>k</sup> over the entire ith instance of gesture k, summed over all p mobility space dimensions:

$$e\_i = \sum\_{j=1}^{p} \int\_{x=0}^{2\pi} \mathbf{r}\_{i,j,k}^2 \tag{14}$$

If e<sup>i</sup> is smaller than the smallest value of e observed prior to iteration i, then the value of εN (µ, σ 2 ) from the current iteration is added to the kernel weights in Equation (14). Any instance i that does not reach the target is considered a failed trial, and is not considered further. This, or a similar constraint on target achievement, is necessary because the "optimal" movement, from this perspective, would otherwise be to remain motionless. Similar constraints have been used in existing optimal control models of speech (Perrier et al., 2005; Patri et al., 2015).

These small deviations in weight, when summed over the course of many trials, will be associated with an overall change in the overall energy expenditure associated with the gestures, and with the overall trajectory of the jaw in task and mobility spaces.

#### Example: Jaw Control With Perturbation Adaptation

In order to provide an illustration of these optimization concepts in the domain of speech motor control, we present an example using greatly simplified model of the speech motor system. The example is inspired by the experiments of Tremblay et al. (2003), in which subjects were asked to speak the utterance "seeat" while a velocity-dependent force field was applied to the jaw that induced jaw protrusion. Initially, this caused increased curvature away from the relatively straight-line jaw movements produced as baseline. After a period of exposure, this curvature was reduced and the jaw movements became similar to the movement produced in the absence of the force-field.

We model jaw movements as a two degree of freedom system in terms of elevation and protrusion. The dimensions of elevation and protrusion align relatively well with the biomechanical forces applied to the human jaw by orofacial musculature in the relatively restricted range of jaw movements used for speech. They therefore represent a reasonable, if simplified, definition of the mobility space. Making the assumption that the tongue is passively moving in conjunction with the jaw, in this narrow experimental situation, it is also possible to define vocal tract constrictions in the pharyngeal and the palatal regions as higher-level descriptions of the articulatory speech tasks. This conceptualization of jaw movements is shown in **Figure 4**.

In order to model the relationship between the task (speech gesture) and mobility (jaw movement) spaces, we must ascertain the kinematic relationships between the two. In our simulations, the direct kinematic relationships between the tract variables and mobility variables are:

$$\begin{split} & \quad ^{Z\_{TBDD} - phar} \\ &= \sqrt{(\phi\_{TC-pro} - TBCL\_{phar, pro})^2 + (\phi\_{TC-ellev} - TBCL\_{phar, elv})^2} - r\_t \\ & \quad \times \sqrt{{}^{Z\_{TBDD} - pal} \, & \quad \times (\phi\_{TC-ellev} - TBCL\_{ph, elv})^2} - r\_t \end{split} \tag{15}$$

where r<sup>t</sup> represents the radius of the tongue body, as can be seen in **Figure 4**. Movement of the tongue center (φTC), in this example, is affected only by movement of the jaw, as measured at a jaw reference (φJR) point on the mandible. In other words, the tongue is assumed to move passively with the jaw, such that φTC−pro = φJR−pro − δpro and φTC−elev = φJR−elev − δelev, where δpro and δelev represent the horizontal and vertical offsets, respectively, of φTC from φJR, and are constrained to be constants (see middle panel of **Figure 4**). Future examples could incorporate independent actuation of the tongue by defining δpro and δelev as mobility variables.

The mobility state variable φ is considered to represent the position of the tongue center and jaw reference points in head-related coordinates described by protrusion (i.e., horizontal position relative to the head) and elevation (i.e., vertical position relative to the head). The variables zTBCD−phar and zTBCD−pal are the constriction degree variables for the tongue body, closely related to the Tongue Body Constriction Degree (TBCD) tract variable described in Task Dynamics (e.g., Saltzman and Munhall, 1989).

For the purposes of illustration, the present example maintains two constriction degree variables, each with its own target value. The constriction degree values are defined with respect to corresponding tongue body constriction location (TBCL) targets in the pharyngeal (phar) and palatal (pal) regions of the vocal tract, and represent the Euclidean distance between tongue body center and the given constriction locations minus the radius of the tongue body.

The forward dynamics of the jaw's movement are modeled simply, according to the following equations:

$$
\ddot{\phi}\_{\text{IR}-pro} = (\tau\_{\text{pro}} + f\_{\text{p}}) / m\_{\text{jaw}}.\tag{17}
$$

$$
\dot{\phi}\_{\text{IR}-elev} = \mathfrak{r}\_{elev}/m\_{\text{jaw}}.\tag{18}
$$

where mjaw is the mass of the jaw, and τpro and τelev are the control forces applied to the jaw. To model the velocitydependent force field, a force (fp) is used to perturb the jaw as it moves:

$$f\_{\mathcal{P}} = b\dot{\phi}\_{\mathcal{I}\mathcal{R}-\text{elev}\prime} \tag{19}$$

where b is a constant.

The Jacobian, J, is the matrix of first-order partial derivatives of z with respect to φJR:

An example of adaptation to jaw perturbation via optimization of both trajectory tracking and effort minimization is shown in **Figure 5**. For these simulations, we generate trajectories from /i/ to /ae/ based on the "see-at" trajectories studied in Tremblay et al. (2003). We assume /i/ has as a target a narrow palatal constriction of 0.05 arbitrary units while /ae/ has as a target a wide pharyngeal constriction of 0.3 arbitrary units (Browman et al., 2006). The trajectories generated from both optimization approaches are similar to the trajectories produced after adaptation to the velocity-dependent force field in Tremblay et al. (2003). Both approaches result in a return to fairly straight trajectories in both task and mobility space, which are very similar to the baseline condition. Interestingly, both approaches result in a small initial over-correction for the force field. While we hesitate to read too much into this result given the highly simplified model used in theses simulations, this pattern matches the results seen in arm reaching (Izawa et al., 2008), where the initial over correction has been shown to be the optimal solution to minimize motor effort. A hint of similar patterns for jaw movements can be seen in the data shown in Tremblay et al. (2008), though this is not always seen in the example data shown in these studies (Tremblay et al., 2003; Lametti et al., 2012). Such differences could potentially be attributed to cross-speaker differences in uncertainty about the force field dynamics (Izawa et al., 2008).

## COORDINATION OF MULTIPLE GESTURES

One of the benefits of the DMP approach is the use of a separate planning system that governs the activation of the forcing function. We have shown how this same signal can also be used as a "go" cue to gate movement. In this latter sense, the planning system functions in an analogous way to the planning oscillators in Task Dynamics (Saltzman and Byrd, 2000; Saltzman et al., 2008). These planning oscillators, which are themselves dynamical systems, serve to gate the activation of gestures in that model. For example, a gesture might be activated when the phase of the planning oscillator reaches 0◦ , and be deactivated at a later phase (e.g., 270◦ ). Essentially, we have replaced the planning oscillator from Task Dynamics with our planning system, which controls both the activation of a gesture as well as the evolution of the gesture's associated forcing function.

However, one of the key benefits of the planning oscillators in Task Dynamics has been their additional use in modeling the coordination between separate gestures (Browman and Goldstein, 2000; Nam and Saltzman, 2003; Goldstein et al., 2007, 2009; Saltzman et al., 2008). If we are to replace the planning oscillator with our proposed planning system, we need to ensure that the planning system can also account for this inter-gestural coordination. In order to do this, we will need to slightly amend the planning system dynamics presented earlier.

We start from the assumption that each gesture is associated with its own planning system. However, a planning system with

$$J(\boldsymbol{\phi}) = \begin{bmatrix} \frac{-(\text{TBC}\_{\text{ph},\text{par}} - \phi\_{\text{TC}-\text{po}})}{\sqrt{(\text{TBC}\_{\text{ph},\text{par}} - \phi\_{\text{TC}-\text{po}})^2 + (\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2}} & \frac{-(\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})}{\sqrt{(\text{TBC}\_{\text{ph},\text{par}} - \phi\_{\text{TC}-\text{do}})^2 + (\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2}}\\ \frac{-(\text{TBC}\_{\text{ph},\text{par}} - \phi\_{\text{TC}-\text{po}})}{\sqrt{(\text{TBC}\_{\text{ph},\text{par}} - \phi\_{\text{TC}-\text{po}})^2 + (\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2}} & \frac{-(\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2}{\sqrt{(\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2 + (\text{TBC}\_{\text{ph},\text{air}} - \phi\_{\text{TC}-\text{do}})^2}} \end{bmatrix} \tag{20}$$

head-related coordinate system (i.e., protrusion and elevation), mobility variables (i.e., jaw and tongue body position) and task variables (tongue body constriction degree in the pharynx and near the palate). The tongue is assumed to move passively with the jaw. *TBCLphar* and *TBCLpal* are fixed points in task space that are used, in conjunction with corresponding constriction degree targets, to shape motion patterns in the tongue body constriction variables, *zTBCD*−*phar* and *zTBCD*−*pal*, that create the desired palatal or pharyngeal constrictions.

FIGURE 5 | Kinematics of the jaw in mobility (jaw protrusion vs. jaw elevation) and task (pharyngeal constriction degree vs. palatal constriction degree) spaces, showing three different DMP kernel weightings (11 kernels used, with linear spacing in planning space). Starting position is indicated by a black "x," and the target is indicated by a red bullseye, with the trajectory shown as a red dashed line. Unperturbed jaw motion would lead to a straight-line trajectory from the starting point to the target. With a perturbation of the type described in Equation (19) (*b* = 0.07, *mjaw* = 1), the trajectory in both mobility and task space deviates substantially from a straight line. Both optimization schemes lead to approximately straight trajectories, as shown. In the case of trajectory optimization, this is because the optimization is explicitly seeking to reproduce a straight line. In the case of effort minimization, the straight line trajectory emerges as a consequence of lowering the overall control forces applied to the jaw (200 iterations, ε = 0.001).

the dynamics in Equation (11), which we have suggested controls the evolution of the control system during movement, is not a good model for planning, since it converges from its initial value of 0 to a multiple of 2π without repeating. While this behavior is useful to control the activation and time course of the control system, it is less useful for replicating the phase coordination between different gestures.

During the planning phase, we assume that the planning systems are, instead, rhythmic dynamical systems with constant phase velocity. It is then possible to allow phase coupling between planning systems in the following way:

$$m\_i \dot{\mathbf{x}}\_k = \boldsymbol{\alpha}\_{\boldsymbol{x}} + \mathbf{C}\_{kl} \tag{21}$$

$$\text{where } C\_{kl} = \alpha\_{kl}\sin\left( [\mathbf{x}\_k - \mathbf{x}\_l] + \varphi\_{kl} \right), \tag{22}$$

for two gestures k and l. The variable ϕkl denotes the target relative phase between x<sup>l</sup> and x<sup>k</sup> , where their relative phase is defined as x<sup>k</sup> − x<sup>l</sup> . Note that this sine-based coupling term is typical of many papers in the coordination dynamics literature (Haken et al., 1985; Rand et al., 1988; Schmidt et al., 1991, 1993), but differs from the linear coupling term typical of the published literature on DMPs. A linear term may be viewed as a small-angle approximation of sine-based coupling.

Additional coupling terms can be added for additional oscillators that may also be coupled in a multi-way coupling unit. During planning, the planning systems are initiated with a random relative phase. The systems are then allowed to converge to a stable phase relationship. Convergence can be defined in several ways, but is presently defined as:

$$\sum\_{i} \sum\_{j} \dot{C}\_{i,j}^{2} < \delta,\tag{23}$$

where <sup>C</sup>˙ i,j is the derivative, with respect to time, of the coupling term between gestures i and j, and δ is the convergence parameter.

After convergence, and upon initiation (triggered at x<sup>1</sup> = 0), the rhythmic planning systems switch to discrete systems, as in Equation (10). Conceptually, the discrete dynamics cause the oscillating planning system to compete a single, final cycle. This transition from rhythmic to discrete dynamics allows us retain the benefits of both rhythmical planning systems, such as stable relative phasing, on the one hand, as well as those of a discrete system for movement control on the other. These include intuitive activation gating, where the planning system triggers movement from its initial value until it converges to its stable final value (c.f. the relatively arbitrary phases for movement gating in the Task Dynamics planning oscillator model), as well as ensuring that the forcing function terminates at the end of the movement (a rhythmic system, such as an underdamped oscillator, would repeat the forcing function). Lastly, the discrete planning system effectively turns itself off when it reaches its convergence value, while planning oscillators would continue to cycle indefinitely.

This model effectively suggests that the planning and movement execution, while both governed by dynamical systems, exhibit different dynamical patterns. Interestingly, this hypothesis receives some support from intracranial recordings made in non-human primates during reaching movements (Churchland et al., 2006, 2010; Shenoy et al., 2013). These studies have shown that both movement planning and movement execution exhibit reliable patterns of neural activity consistent with the evolution of a dynamical system, but that the character of these dynamical patterns is qualitatively different between the two phases, and that this transition can be characterized as a transition between two different network dynamics (Shenoy et al., 2013).

As a proof of concept, an example of planning system oscillation, coupling and initiation can be seen in **Figure 6**. In this simulation, three gestures (C1, C2, and V) are coupled together to form a syllable with a complex onset. C1 and C2 are both coupled in phase with V (ϕC1−<sup>V</sup> <sup>=</sup> 0, <sup>ϕ</sup>C2−<sup>V</sup> <sup>=</sup> 0) but antiphase with each other (ϕC1−C2 = −π, <sup>ϕ</sup>C2−C1 <sup>=</sup> <sup>π</sup>). During planning, each planning system oscillates with a stable frequency (shown with phase unwrapped in **Figure 6**). During planning, the three gestures settle into a stable relative phase relationship due to their coupling, with C1 slightly preceding V, which in turn precedes C2. When these relationships converge to become stable, the planning systems switch from rhythmic to discrete dynamics when they next reach a phase 0, triggering initiation of their associated gestures. During movement execution, the discrete dynamics drive each planning system asymptotically toward their final value, 2π. This example demonstrates the capability of this framework to reproduce the well-studied ccenter effect (Browman and Goldstein, 2000; Goldstein et al., 2009), where initial consonants in an onset cluster begin before the onset of the syllable's vowel, which in turn precedes the onset of the final consonant in the onset cluster. For example, in the word /spa/, tongue tip movement for the [s] begins before tongue body movement for [a], which in turn begins before lip movement for [p]. The ability of Task Dynamic's planning oscillator model to derive c-center and other patterns of intergestural coordination is a strength of that model, which is maintained in the proposed approach.

# DISCUSSION

We have presented a framework for incorporating principles of optimal control into a dynamical systems framework for modeling the speech motor control system. This was accomplished through the addition of a forcing function to the second-order dynamical system previously hypothesized to regulate speech motor control in the Task Dynamics model. Specifically, this forcing function took a form consistent with the Dynamical Movement Primitives (DMPs) framework, which provides the ability to flexibly modulate the dynamics of the control system. We then showed how the integration of DMPs into the control system allows us to model the observed adaptation to velocity-dependent force fields applied to the jaw during speech production. Importantly, this framework is flexible enough to incorporate

a wide variety of features to be optimized and optimization algorithms. We showed how two different approaches can result in similar behavior, by optimizing over dynamic (total force) or kinematic (straight line) criteria. Lastly, we showed how the planning system which governs the temporal unfolding of the control system can also account for the temporal gating of individual speech gestures as well as the temporal organization between separate gestures (such as the c-center effect).

The DMP approach outlined here provides a coherent way to account for speech behavior that is otherwise difficult to reconcile with the dynamical systems approach to speech motor control, while retaining many of the benefits of that approach that have been developed over a long history of research. Importantly, the DMP model may have applications outside the narrow case of jaw perturbations explored here.

First, DMPs can provide a way to model competing constraints on the speech production system, such as the balance between articulatory effort on one hand and target achievement or intelligibility on the other (Lindblom, 1990). Importantly, models based on optimal trajectory control have shown that by changing the relative costs associated with these factors, it is possible to produce speech with varying degrees of target undershoot (Perrier et al., 2005; Patri et al., 2015). These changes may plausibly underlie articulatory changes associated with different speaking conditions and contexts (Lindblom, 1990; Bradlow, 2002). Undershoot can also be modeled in a dynamical systems framework by decreasing the duration of a gesture's activation (Browman and Goldstein, 1992; Parrell, 2011; Parrell and Narayanan, 2018) or by changing the other parameters of the dynamical controller (e.g., mass, damping, and stiffness). However, there is to date no principled way to relate changes in the control system parameters to the hypothesized constraints of effort and intelligibility. DMPs provide this bridge, and could provide a principled way of modeling articulatory changes associated with different speech registers or conditions.

The combination of dynamical systems control with movement optimization via DMPs may also be useful in a number of other areas of speech motor control. For example, it is well-established that articulatory kinematics do not reach stable, adult-like patterns until at least late adolescence (Walsh and Smith, 2002). Such protracted maturation of kinematic patterns could potentially be related to the development of stable forcing functions in the dynamical controller. Additionally, we have shown the DMPs are able to incorporate tracking of explicit trajectories into a dynamical systems framework. Targets with an explicit temporal component have been suggested to be critical for speech (Guenther, 2016). DMPs could thus potentially bridge this seemingly otherwise intractable divide between trajectory- and point-target- based theories. Indeed, it seems highly probable that the auditory target trajectories in DIVA (Guenther, 2016) could be reformulated as dynamical systems with DMPs in an auditory task space. Moreover, DMPs also provide a way to produce trajectories without an explicit time-dependency, which allows them to be more flexible.

Another possible use of our DMP model is in explaining temporo-spatial variation in production across different words. For example, words that are produced more frequently are typically more reduced than less frequent words (Munson and Solomon, 2004). This result is compatible with a stochastic optimization process driven by reinforcement from a listener. For example, the production-driven criteria for a "failed trial," discussed surrounding Equation (14) (i.e., whether the target was achieved) could be replaced by a reinforcement signal provided by a listener (i.e., they understand what was said). More frequent words would be more likely to be perceived correctly, and so more likely to receive a positive reward signal for any given amount of reduction. Moreover, given a stochastic optimization, more frequent words provide more opportunities for learning, which would lead to a more optimal production. Separately from word frequency, neighborhood density has also been shown to be related to reduction, such that words with more lexical neighbors exhibit less reduction than words of similar frequency with fewer lexical neighbors (Munson and Solomon, 2004). Again, such a pattern could plausibly be generated by a DMP controller, as words with more lexical competition would be more frequently confused by a listener, leading to less positive reinforcement of more reduced productions than in cases where there is little lexical competition. Importantly, such a system would implicitly adjust production based on a history of reinforcement, without the need to explicitly include an estimate of a listener's perceptual system (c.f. Lindblom, 1990; Wright, 2004). Alternatively, a more complex criteria that quantifies the degree of articulatory undershoot (Simko and Cummins, 2011) would be able to drive similar behaviors.

The scheme outlined above would imply that different words may be associated with their own (set of) forcing functions. To this point, we have avoided a discussion of the scope of the DMP forcing functions. However, it seems likely that they are associated with higher-level production units, such as words. The evidence for this, aside from the potential utility of the DMPs to model differential effects of word frequency and lexical neighborhood density, comes primarily from studies that have examined the generalizability of learned alterations to speech motor behavior. For example, participants who learned to adapt to a velocity-dependent force filed applied to the jaw failed to transfer this learning to untrained words, even when the patterns of jaw movement were very similar (Tremblay et al., 2008). However, other studies using auditory, rather than force-field, perturbations have shown that learning is somewhat generalizable (Rochet-Capellan et al., 2012). While force-field and sensory-perturbation learning are likely driven by different processes (Krakauer et al., 1999), this suggests that learning or optimization may be occurring at multiple levels of the production hierarchy (gestures, phonemes, syllables, words, etc.). It remains a question for future work to determine the precise nature of how and where optimization may play a role in speech production. However, the notion that words or syllables may be used at the units of speech planning, at least in some cases, is a common idea in many models (Levelt et al., 1999; Kröger et al., 2009; Guenther, 2016).

The DMP approach used here shares some conceptual similarities with the Embodied Task Dynamics model (Simko and Cummins, 2010a,b, 2011). Both approaches augment the basic Task Dynamics model in order to allow for optimization of a cost function. However, the present approach differs from the Embodied Task Dynamics model in a number of critical ways and addresses complementary phenomena. We optimize a forcing function that affects the production of individual gestures at the task level, but that does not change the timing of gestural activation. We show how optimization can be accomplished on the basis of minimizing effort/force, or through the use of an inverse model. The resulting model is shown to account for adaptation to a force-field applied to the jaw. On the other hand, the Embodied Task Dynamics model optimizes the stiffness and activation timing of gestures. This optimization is done of the basis of a cost function that includes terms for articulatory effort, target achievement, and total movement duration. This model has been successful in replicating some important aspects of interarticulator timing and articulatory undershoot. A more thorough comparison of the two approaches is needed in future work.

We have demonstrated that DMPs provide a method for modeling adaptation to altered system dynamics introduced by a novel force field. Such adaptation has alternately been viewed as either the generation of an "internal model" of the system dynamics that can be used to counteract the external forces (Shadmehr and Mussa-Ivaldi, 1994; Krakauer et al., 1999) or as a process of reoptimization of movement to achieve maximum performance (Izawa et al., 2008). We have shown that DMPs are compatible with both views. Given the demonstrated flexibility of the DMP approach (Schaal et al., 2007; Ijspeert et al., 2013), it is likely that it could also be used to adapt to more complex force field dynamics, including time-varying dynamics. However, adaptation in motor performance occurs not only in the presence of novel dynamics but also when alterations are introduced to movement kinematics, such as visuomotor rotations for reaching (Cunningham, 1989; Kagerer et al., 1997; Krakauer et al., 1999; Mazzoni and Krakauer, 2006) or shifts to vowel formants or pitch for speech (Houde and Jordan, 1998; Purcell and Munhall, 2006). Importantly, dynamic and kinematic adaptation have been suggested to be separate processes in human behavior (Krakauer et al., 1999; Rabe et al., 2009; Donchin et al., 2012). Adaptation to kinematic perturbations is typically thought to occur through either changes to "forward models" that predict the sensory consequences of movement (Mazzoni and Krakauer, 2006; Tseng et al., 2007; Shadmehr and Krakauer, 2008) and/or changes to "inverse models" that associate a goal with the motor commands necessary to achieve that goal (Kawato and Gomi, 1992; Wolpert et al., 1995; Wolpert and Kawato, 1998; Kawato, 1999). In our view, it is unlikely that DMPs would provide a satisfactory model for these types of kinematic adaptation. From a theoretical standpoint, neither a forward model (action-sensory mapping) nor inverse model of this type (goal-action mapping) is well-captured by DMPs. Empirically, a critical characteristic of adaptation to kinematic perturbations is that the final, adapted movement remains distinct from the initial, unperturbed movement. This is reflected in a change in reach angle in visuomotor rotation or a change in formant frequencies/vocal tract shape in auditory perturbations. While DMPs are well-suited to model arbitrary trajectories, they retain the (desirable) equifinality of critically-damped, second order dynamical systems. In our view, this means they are likely unable to cause the types of changes seen in kinematic adaptation.

In sum, combining optimal control with dynamical systems in speech motor control holds promise to provide a unified account of a number of different speech behaviors. We have shown that incorporating a tunable forcing function based on Dynamic Movement Primitives provides a way to combine these two separate approaches. Future work is needed to incorporate DMPs into a more plausible model of the speech motor system beyond the simplified jaw system in the current simulations, as well as to test the limits of this approach to explain different aspects of speech behavior.

#### REFERENCES


### AUTHOR CONTRIBUTIONS

BP and AL conceived the project, developed the theory, implemented the computational models, and wrote the paper.

# FUNDING

Work supported by NIH Grant R01DC017696 and by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001.


Guenther, F. H. (2016). Neural Control of Speech. Cambridge, MA: The MIT Press.


Conference on Speech Communication and Technology (Lisbon), September 4–8, 2005, 1041–1044.


**Disclaimer:** Approved for public release. Distribution is unlimited. This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Parrell and Lammert. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Modeling Sensory Preference in Speech Motor Planning: A Bayesian Modeling Framework

Jean-François Patri 1,2,3, Julien Diard<sup>2</sup> and Pascal Perrier <sup>1</sup> \*

<sup>1</sup> Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France, <sup>2</sup> Université Grenoble Alpes, CNRS, LPNC, Grenoble, France, <sup>3</sup> Cognition Motion and Neuroscience Unit, Fondazione Istituto Italiano di Tecnologia, Genova, Italy

Experimental studies of speech production involving compensations for auditory and somatosensory perturbations and adaptation after training suggest that both types of sensory information are considered to plan and monitor speech production. Interestingly, individual sensory preferences have been observed in this context: subjects who compensate less for somatosensory perturbations compensate more for auditory perturbations, and vice versa. We propose to integrate this sensory preference phenomenon in a model of speech motor planning using a probabilistic model in which speech units are characterized both in auditory and somatosensory terms. Sensory preference is implemented in the model according to two approaches. In the first approach, which is often used in motor control models accounting for sensory integration, sensory preference is attributed to the relative precision (i.e., inverse of the variance) of the sensory characterization of the speech motor goals associated with phonological units (which are phonemes in the context of this paper). In the second, "more original" variant, sensory preference is implemented by modulating the sensitivity of the comparison between the predicted sensory consequences of motor commands and the sensory characterizations of the phonemes. We present simulation results using these two variants, in the context of the adaptation to an auditory perturbation, implemented in a 2-dimensional biomechanical model of the tongue. Simulation results show that both variants lead to qualitatively similar results. Distinguishing them experimentally would require precise analyses of partial compensation patterns. However, the second proposed variant implements sensory preference without changing the sensory characterizations of the phonemes. This dissociates sensory preference and sensory characterizations of the phonemes, and makes the account of sensory preference more flexible. Indeed, in the second variant the sensory characterizations of the phonemes can remain stable, when sensory preference varies as a response to cognitive or attentional control. This opens new perspectives for capturing speech production variability associated with aging, disorders and speaking conditions.

Keywords: speech motor control, Bayesian modeling, sensory integration, sensory preference, speech motor goals

Edited by:

Pascal van Lieshout, University of Toronto, Canada

#### Reviewed by:

Bernd J. Kröger, RWTH Aachen University, Germany Satrajit S. Ghosh, Massachusetts Institute of Technology, United States

> \*Correspondence: Pascal Perrier pascal.perrier@grenoble-inp.fr

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 April 2019 Accepted: 01 October 2019 Published: 25 October 2019

#### Citation:

Patri J-F, Diard J and Perrier P (2019) Modeling Sensory Preference in Speech Motor Planning: A Bayesian Modeling Framework. Front. Psychol. 10:2339. doi: 10.3389/fpsyg.2019.02339

# 1. INTRODUCTION

The recent history of research that investigates the links between phonology, production and perception of speech has been marked by vigorous exchanges between proponents of purely acoustic/auditory theories (Stevens, 1972; Stevens and Blumstein, 1978; Blumstein and Stevens, 1979; Lindblom, 1990; Sussman et al., 1991) for whom the physical correlates of phonological units would be exclusively in the acoustic domain, and proponents of theories who rather saw these correlates primarily in the articulatory/somatosensory domain (Fowler, 1986; Saltzman, 1986). These debates were all the more vigorous because they were related to important theoretical issues around phonological theories (Chomsky and Halle, 1968; Clements, 1985; Keyser and Stevens, 1994 vs. Browman and Goldstein, 1989, 1992; Goldstein and Fowler, 2003) and cognitive theories of perception (Diehl and Kluender, 1989 vs. Gibson, 1979 vs. Liberman et al., 1967).

As a consequence, models that were designed to simulate and investigate the process of articulation and sound production from the specification of phonological sequences (we will call these models Speech Production Models henceforth) were split into two main categories: models in which the goals of the speech task were specified in the articulatory domain (Coker, 1976; The Task Dynamics Model: Kelso et al., 1986; Saltzman and Munhall, 1989; The DIVA Model Version 1: Guenther, 1995; Kröger et al., 1995; The C/D model: Fujimura, 2000), and models in which the goals were specified in the acoustic domain (The DIVA Model Version 2: Guenther et al., 1998; GEPPETO: Perrier et al., 2005).

A number of experimental studies have been carried out in order to find clear support for one or the other of these theories. The majority of them relied on perturbation paradigms, in which one of the modalities, either acoustic or articulatory, was perturbed. Patterns of behavioral adaptation to perturbation of the jaw with bite-blocks (Gay et al., 1981) or of the lips with lip-tubes (Savariaux et al., 1995) were interpreted as evidence for the specification of the goal in the acoustic/auditory domain, whereas adaptation in response to a perturbation of the jaw with a velocity-dependent force field (Tremblay et al., 2003) supported the hypothesis of a goal in the articulatory/somatosensory domain. In the absence of any evidence supporting undeniably one of these theories, new theories emerged assuming that phonological units could be associated with both auditory and somatosensory goals (see for example the concept of "perceptuo-motor unit" in the Perception-for-Action-Control Theory of Schwartz et al. (2012); or, for another perspective, the phonological processing of the HFSC model of Hickok (2012) distributed over an auditory-motor circuit for syllable and over a somatosensory-motor circuit for the phonemes).

Today, the large majority of the Speech Production Models associate both somatosensory and auditory goals to phonological units (Guenther et al., 2006; Kröger et al., 2009; Hickok, 2012; Yan et al., 2014; Parrell et al., 2018). In this context, a key-question is the respective weight of each modality in the specification of the goals. Lindblom (1996) and Stevens (1996) considered that the articulatory/somatosensory correlates are not primary, but are rather the secondary consequences of the articulatory strategies that have emerged for a correct achievement of the acoustic/auditory goals. In line with these suggestions, we have assumed a hierarchical organization of the goals, with a higher priority for the achievement of the auditory goals (Perrier, 2005). In its recent versions, the DIVA model assumes that speech acquisition is based on purely auditory targets, and that the somatosensory targets are learned in a second stage during speech development as "sensations associated with the sound currently being produced" (Guenther et al., 2006, p. 286), introducing also a hierarchy in the role of the modalities in the specification of the goals. In an experimental study, in which speech production was perturbed both in the auditory domain (with an on-line shift of formant F1) and in the somatosensory one (with an on-line alteration of the jaw opening, which also affects F1), Feng et al. (2011) found that participants compensated for the auditory perturbation regardless of the direction of the perturbation of the jaw opening. This observation was in support of a dominant role of the auditory modality in the control of speech production.

However, three important experimental findings have contested the validity of the hierarchical hypothesis. The first finding is the fact that, when the auditory feedback is perturbed, the compensation to the perturbation is never complete, with a magnitude commonly being at the most at 1/3 of the perturbation (Houde and Jordan, 2002; Purcell and Munhall, 2006; Villacorta et al., 2007; Cai et al., 2010). A convincing explanation for this phenomenon is the fact that the strength of the specification of the somatosensory goal limits the authorized magnitude of the articulatory changes used to compensate for the auditory perturbation (Villacorta et al., 2007; Katseff et al., 2012). The second finding is that motor learning associated with a perturbation of the auditory feedback generates a shift of the perceptual boundaries between the phonemes of interest (Shiller et al., 2009; Lametti et al., 2014). Using a simplified Bayesian model of speech production, we have shown that the perceptual boundary shift was also in part due to the strength of the somatosensory goals (Patri et al., 2018). The third finding is the observation of "sensory preference" in a speech production task in which both auditory feedback and jaw movement were perturbed on line (Lametti et al., 2012). Indeed Lametti et al. (2012) found that contrary to the observations of Feng et al. (2011) not all the participants did compensate in priority for the auditory perturbation: some of them did compensate more for the auditory perturbation, but some others did compensate more for the jaw perturbation, and a significant negative correlation was found between the amounts of compensation to the perturbation in each modality. This completely changed the way to consider the crucial question of the physical domain in which the speech goals are specified in adults speakers for the production of phonological units. The answer to this question would not be generic and only depending on the characteristics of the language, but would be strongly subject-dependent and related to a preference of the subjects for one feedback modality or the other. From a general linguistic point of view, the debate currently moves toward considering speaker-specific characteristics of the way to deal with the constraints of the language. Developing models

of such phenomena will open doors for the elaboration of new experimental paradigms to question how speakers deal with the constraints of their language, and to investigate the consequences on speaker behaviors in terms of adaptation, coarticulation, and possibly diachronic phonetic changes.

In this work, we address the question of the "sensory preference" within a Bayesian model of speech motor planning, in which speech units are characterized both in auditory and somatosensory terms. This approach includes internal models predicting the sensory consequences of motor commands, and the definition of the sensory characterization of the motor goals, also called henceforth "sensory targets," associated with phonemes. These components are described in terms of probability distributions. We show that sensory preference can be implemented in the model in two ways.

In the first variant, sensory preference is attributed to the relative accuracy measured as the precision (i.e., inverse of variance) of the sensory targets. This is inspired from wellacknowledged models of sensory fusion for perception (Ernst and Banks, 2002; Alais and Burr, 2004; Kersten et al., 2004) and of sensorimotor integration (Körding and Wolpert, 2004). It corresponds in particular to the approach proposed by the DIVA model (Villacorta et al., 2007; Perkell et al., 2008). In this view, sensory preference originates from the level of the stored sensory targets that are intrinsically associated with phonological units. This suggests that sensory preference would be an inflexible property of each individual. We call this modeling approach "Target-based approach."

In the second, more original variant, sensory preference is implemented by modulating the sensitivity of the comparison between the predicted sensory consequences of motor commands and the sensory characterization of speech motor goals. This approach differs from linear weightings of the error associated with each modality in the computation of the feedback correction signal (see for example the "synaptic weights" in Guenther et al., 2006, Equation 9, p. 286), because of our probabilistic formulation. Indeed, we will see that the probabilistic formulation enables an interesting interpretation of the variation of sensory preference in terms of "clarity" or "sharpness" of the sensory pathway. Furthermore, in this second view, sensory preference is more flexible, as it can be modified without changing the stored sensory targets. Such a modification can then result from cognitive control, attentional processes or features of the task, without affecting the sensory characterization of speech motor goals associated with phonological units. We call this modeling approach "Comparison-based approach."

The main purpose of the current study is to compare these two variants, in the context of the adaptation to a long-lasting steady-state external sensory perturbation. As we recalled above, numerous experimental studies have used such a perturbation paradigm, and they have shown that perturbation leads to two kinds of compensation depending on the exposure time to the perturbation: first to an almost immediate change of speech articulation aiming at compensating for the unpredicted newly introduced perturbation; second, after a sufficiently long period in presence of the sustained perturbation, to a long-lasting compensation resulting from adaptation. Adaptation has been shown to induce after-effects (Houde and Jordan, 1998; Tremblay et al., 2003) which has been interpreted as evidence for longlasting changes in the internal representations of the relations between motor commands and sensory outputs (called internal models in this paper). Thus, it is important to distinguish immediate compensation, associated with instantaneous motor control of speech movements, and compensation resulting from adaptation, associated with changes in the planning of speech movements. In this work we focus on the compensation resulting from adaptation, without considering the dynamics of the learning process underlying the transition from immediate compensation to final adaptation.

This paper is structured as follows. In section 2, we introduce all the elements of the modeling framework. We first describe the GEPPETO model, overall, and detail the Bayesian version of its motor planning layer. Then we explain how we simulate sensory perturbations and how we account for the resulting adaptations. Finally, we describe both variants of our model of sensory preference. In section 3, we simulate the two variants, highlighting their equivalence, which we then analyze formally. Finally, we discuss our results and possible extensions in section 4.

# 2. METHODS

#### 2.1. Overview of the Framework 2.1.1. The GEPPETO Model

GEPPETO (see **Figure 1**) is a model of speech production organized around four main components: (i) a biomechanical model of the vocal tract simulating the activation of muscles and their influence on the postures and the movements of the main oro-facial articulators involved in the production of speech (Perrier et al., 2011); (ii) a model of muscle force generation mechanisms (the λ model, Feldman, 1986) that includes the combined effects on motoneurons' depolarization of descending information from the Central Nervous System and afferent information arising via short delay feedback loops from muscle spindles (stretch reflex) or mechano-receptors; (iii) a pure feedforward control system that specifies the temporal variation of the control variables (called λ variables) of the λ model from the specification of the target values inferred in the motor planning phase and of their timing; and (iv) a motor planning system that infers the target λ variables associated with the phonemes of the planned speech sequence.

In the implementation of GEPPETO used in this study, the biomechanical model is a 2-dimensional finite element model of the tongue in the vocal tract, which includes 6 principal tongue muscles as actuators and accounts for mechanical contacts with the vocal tract boundaries. The motor planning layer specifies the target λ variables by considering the motor goals associated with the phonemes of the speech utterance to be produced and using an optimal approach. Complete descriptions of GEPPETO, available elsewhere (Perrier et al., 2005; Winkler et al., 2011; Patri et al., 2015, 2016; Patri, 2018), also involve the specification of intended levels of effort. This enables in particular to perform speech sequences at different speaking rates; however,

for simplicity, we do not consider this aspect of the model in the current study.

A key hypothesis in GEPPETO is that speech production is planned on the basis of units having the size of the phonemes. The account for larger speech units is given in the model via optimal planning: larger speech units correspond to the span of the phoneme sequence on which optimal planning applies (CV syllables, CVC syllables, VCV sequences, see Perrier and Ma, 2008; Ma et al., 2015). Given the limitations of the biomechanical model used in this study, which only models the tongue and assumes fixed positions for the jaw and the lips, we only consider French vowels that do not crucially involve jaw or lip movements, which are {/i/, /e/, /E/, /a/, /oe/, /O/}. GEPPETO further assumes that the motor goals associated with phonemes are defined as particular target regions in the sensory space. These regions are assumed to describe the usual range of variation of the sensory inputs associated with the production of the phonemes. Previous versions of GEPPETO have only considered the auditory space for the definition of these target regions. The auditory space is identified in GEPPETO to the space of the first three formants (F1, F2, F3) and target regions are defined in this space as dispersion ellipsoids of order 2, whose standarddeviations have been determined from measures provided by phoneme production experiments (Calliope, 1984; Robert-Ribes, 1995; Ménard, 2002) and adapted to the acoustic maximal vowel space of the biomechanical model (Perrier et al., 2005; Winkler et al., 2011). The top left part of **Figure 1B** represents the projection of these target regions in the (F2, F1) plane.

In the present study, we consider an updated version of GEPPETO that includes both auditory and somatosensory characterizations of the phonemes. We call it "Bayesian GEPPETO," because the planning layer, which is at the core of the present study, is described with a Bayesian model. In this formulation, the somatosensory space only accounts for tongue proprioception. This account is based on the shape of the tongue contour in the mid-sagittal plane. More specifically, the somatosensory space is defined as the space of the first three Principal Components that model the covariation of the 17 nodes of the tongue contour in the Finite Element tongue mesh in the mid-sagittal plane, when the target λ variables vary over a large range of values, which covers all possible realistic tongue shapes associated with vowel productions. In line with the idea that auditory goals are primary in speech acquisition and that somatosensory goals are learned as a consequence of the achievement of the auditory goals (Lindblom, 1996; Stevens, 1996; Guenther et al., 2006), GEPPETO assumes that somatosensory target regions characterizing phonemes are dispersion ellipsoids that approximate the projections of the

auditory target regions into the somatosensory space. The top right part in **Figure 1B** illustrates the somatosensory target regions in the plane of the first two principal components. Data points within increasing elliptical rings in the auditory target regions are plotted with identical colors in the auditory and somatosensory spaces, providing an intuitive idea of the geometry distortion resulting from the non-linear relation between the auditory and the somatosensory space.

For a given phoneme sequence, the goal of the motor planning layer of GEPPETO is to find the λ target variables that enable to reach the sensory target regions of the phonemes with the appropriate serial-order. In the most recent developments of GEPPETO, this inverse problem is addressed as an inference question formulated in a Bayesian modeling framework (Patri et al., 2015, 2016). It is on this Bayesian component of GEPPETO that we focus in this work.

#### 2.1.2. Bayesian Modeling of Speech Motor Planning in GEPPETO

The Bayesian model formulates the key ingredients of the motor planning stage of GEPPETO in a probabilistic framework, where key quantities are represented as probabilistic variables and their relations are represented by probability distributions. It is mathematically based on the theoretical concepts defined in the COSMO model of speech communication (Moulin-Frier et al., 2015; Laurent et al., 2017). In previous works we have described our modeling framework in the context of coarticulation modeling, planning of sequences of phonemes (Patri et al., 2015), and the specification of effort levels for the planning of speech at different speaking rates (Patri et al., 2016). However, these previous implementations of the model only considered auditory goals for the phonemes. A novelty in the present work is the integration of both auditory and somatosensory goals in "Bayesian GEPPETO." This integration is based on modeling principles that we have recently elaborated in the context of a simplified Bayesian model of speech production (Patri et al., 2018), in the aim to study various potential explanations for the shifts of perceptual boundaries observed after speech motor learning (Shiller et al., 2009; Lametti et al., 2014). Note that for simplicity we focus here only on the production of single phonemes. However, the extension of the present formulation to consider sequences of phonemes as in Patri et al. (2015) is straightforward.

In the case of single-phoneme planning, "Bayesian GEPPETO" includes eight probabilistic variables, described in **Figure 2** along with their dependencies. The right hand side of the diagram represents variables involved in the definition of the motor goals associated with phonemes: variable 8 is the variable representing phoneme identity, variables A8 and S8 are auditory and somatosensory variables involved in the sensory characterization of phonemes (we call them sensoryphonological variables). The left hand side of the diagram represents variables involved in sensory-motor predictions: the 6-dimensional motor control variable M represents the six λ variables that control muscle activation and then tongue movements in the biomechanical model (M = (λ1, . . . , λ6)); variables A<sup>M</sup> and S<sup>M</sup> are sensory-motor variables representing

the auditory and somatosensory consequences of motor variable M.

Motor planning of a single phoneme is achieved in the model by identifying the sensory-motor predictions that match the sensory specification of the intended phoneme. This matching is imposed with two coherence variables C<sup>A</sup> and C<sup>S</sup> (Bessière et al., 2013), that act as "probabilistic switches," and can be understood as implementing a matching constraint between the predicted sensory-motor variables and the specified sensoryphonological variables.

The diagram in **Figure 2** also represents the decomposition of the joint probability distribution of all the variables in the model:

$$P(M \oplus A\_M A\_\Phi \mid C\_A S\_M S\_\Phi \mid C\_S) = P(M)P(\Phi)$$

$$P(A\_M \mid M)P(A\_\Phi \mid \Phi)P(C\_A \mid A\_M A\_\Phi) \tag{1}$$

$$P(S\_M \mid M)P(S\_\Phi \mid \Phi)P(C\_S \mid S\_M S\_\Phi) \; . $$

Each of the factors on the right hand side of Equation (1) corresponds to one particular piece of knowledge involved in motor planning:

**P(M)** and **P(**8**)** are prior distributions representing prior knowledge about possible values of motor variable M and of phoneme variable 8. We assume all possible values to be equally probable (no prior knowledge) and thus define P(M) and P(8) as uniform distributions over their domains. The domain of variable M is a continuous 6-dimensional support defined by the allowed range of values of each parameter λ<sup>i</sup> of the biomechanical model. 8 is a discrete, categorical variable including the identity of the different phonemes considered in the model.

**P(A**<sup>8</sup> | 8**)** and **P(S**<sup>8</sup> | 8**)** correspond to the auditory and somatosensory characterizations of phonemes. We define them as multivariate Gaussian distributions in the auditory and somatosensory spaces:

$$P([X\_{\Phi} = \mathfrak{x}] \mid [\Phi = \phi]) \; := \mathcal{N}(\mathfrak{x} \; ; \; \mu\_{X}^{\phi} \; \Gamma\_{X}^{\phi} \text{)},\tag{2}$$

where X refers to the sensory modality (A for "Auditory" or S for "Somatosensory"), and µ φ X and Ŵ φ X correspond to the parameters specifying the distribution associated to phoneme φ in the sensory space X (i.e., mean vector µ φ X and covariance matrix Ŵ φ X ). This definition of the sensory characterizations translates in probabilistic terms the hypothesis that phonemes are characterized by the ellipsoid regions illustrated in **Figure 1B**. In particular, the mean vector and covariance matrix of each distribution are identified from these ellipsoid regions. The correspondence between these two representations is illustrated in the top and bottom plots of **Figure 1B**.

**P(A<sup>M</sup>** | **M)** and **P(S<sup>M</sup>** | **M)** correspond to the knowledge relating the motor control variable M to its predicted sensory consequences A<sup>M</sup> and SM, in the auditory and somatosensory space, respectively. We identify this knowledge to sensorymotor internal models in the brain (Kawato et al., 1990; Jordan and Rumelhart, 1992; Tian and Poeppel, 2010). In the current implementation we assume that these internal models are deterministic and we implement them as Dirac probability distributions centered on the outputs of sensory-motor maps, ρ<sup>a</sup> and ρ<sup>s</sup> :

$$P(\left[X\_m = \ge\right] \mid \left[M = m\right]) := \delta(\leftx - \rho\_x(m))\right,\tag{3}$$

where X<sup>m</sup> stands for A<sup>M</sup> or SM, depending on the modality, δ denotes the Dirac distribution (i.e., P([X<sup>M</sup> = x] | [M = m]) is zero unless x = ρx(m)). The sensory-motor maps ρ<sup>a</sup> and ρ<sup>s</sup> have been created from the results of around 50,000 simulations carried out with the biomechanical model by randomly sampling the space of the λ motor control variables. We implemented these sensory maps by learning the relation between the λ variables and the sensory variables with Radial Basis Functions (RBF; Poggio and Girosi, 1989) with a usual supervised learning approach.

**P(C<sup>A</sup>** | **A<sup>M</sup> A**8**)** and **P(C<sup>S</sup>** | **S<sup>M</sup> S**8**)** implement the two sensory matching constraints. C<sup>A</sup> and C<sup>S</sup> are both binary variables (taking values 0 or 1) that activate the corresponding matching constraint when their values are set to 1. This is implemented with the following definition:

$$P(\left[C\_X = 1\right] \mid \left[X\_M = \chi\_m\right] \left[X\_{\Phi} = \chi\_{\Phi}\right]) := \begin{cases} 1 & \text{if } \chi\_m = \chi\_{\Phi} \\ 0 & \text{otherwise.} \end{cases} (4)$$

where again X<sup>M</sup> stands for A<sup>M</sup> or SM, and X<sup>8</sup> stands for A<sup>8</sup> or S8.

#### 2.1.3. Motor Planning in the Bayesian Model

The goal of the motor planning layer in GEPPETO is to find values of the motor control variable M that correctly make the tongue articulate the intended phoneme. The Bayesian model enables to address this question as an inference question that can be formulated in three ways: (i) by activating only the auditory pathway with [C<sup>A</sup> = 1]; (ii) by activating only the somatosensory pathway with [C<sup>S</sup> = 1]; (iii) by activating both the auditory and somatosensory pathways with [C<sup>A</sup> = 1] and [C<sup>S</sup> = 1] (we call this the "fusion" planning model). These three planning processes are computed analytically, by applying probabilistic calculus to the joint probability distribution P(M A<sup>M</sup> S<sup>M</sup> A<sup>8</sup> S<sup>8</sup> 8 C<sup>A</sup> CS) specified by Equation (1). The outcome of these computations for each planning process gives:

$$P(\left[M=m\right] \mid \Phi \left[C\_A = 1\right]) \propto P(\left[A\_{\Phi} = \rho\_a(m)\right] \mid \Phi),\tag{5}$$

$$P(\left[M=m\right] \mid \Phi \left[\mathcal{C}\_{\mathbb{S}} = 1\right]) \propto P(\left[\mathcal{S}\_{\Phi} = \rho\_{\mathfrak{s}}(m)\right] \mid \Phi),\tag{6}$$

$$P(\left[M=m\right] \mid \Phi \left[C\_A = 1\right] \left[C\_{\mathbb{S}} = 1\right]) \propto P(\left[A\_{\Phi} = \rho\_a(m)\right] \mid \Phi\rangle)$$

$$P(\left[\mathcal{S}\_{\Phi} = \rho\_{\mathfrak{s}}(m)\right] \mid \Phi),\tag{7}$$

where the mathematical symbol "∝" means "proportional to."

Equations (5–7) give the probability, according to each of the three planning process, that a given value m of the motor control variable M will actually produce the intended phoneme 8. Practically, in order to have for each planning process a reasonable set of values covering the range of variation of the motor control variable with their probability to correctly produce the intended phoneme, we randomly sampled the space of the motor control variable according to these probability distribution. This sampling was implemented to approximate the probability distributions with a standard Markov Chain Monte Carlo algorithm (MCMC) using Matlab's "mhsample" function. The MCMC algorithm performs a random walk in the control space resulting in a distribution of random samples that converges toward the desired probability distribution. The left panels in **Figure 3** present the dispersion ellipses of order 2 in the auditory and somatosensory spaces of the result obtained from 2.10<sup>4</sup> random samples, taken from 20 independent sampling runs (after removal of the first 10<sup>3</sup> burn-in samples in each chain), for the production of phoneme /O/ for each of the three planning processes. It can be observed that all three planning processes correctly achieve the target region in both sensory spaces.

# 2.2. Implementation of Sensory Perturbations and Adaptation in the Model

Sensory perturbations alter the sensed consequence of motor actions such that the sensory output predicted by the internal model becomes erroneous. When the perturbation is consistently maintained, a new relation between motor control variables and sensory outputs is experienced and the sensory-motor internal models (P(A<sup>M</sup> | M) and P(S<sup>M</sup> | M)) are updated as a result of motor learning and adaption (Shadmehr and Mussa-Ivaldi, 1994; Houde and Jordan, 1998; Haruno et al., 1999; Tremblay et al., 2003), in order to capture the new sensory-motor relation imposed by the perturbation. We define adaptation, in the model, as the update of the parameters of the internal models.

According to Lametti et al. (2012), differences in sensory preference lead to differences across speakers in their tolerance to errors in each of the sensory modalities (auditory or somatosensory). This phenomenon has been assumed to explain the observed inter-speaker differences in the amount of compensation after adaptation. The evaluation of our two implementations of sensory preference is based on their capacity to account for these differences in compensation. Importantly, whatever the nature of the sensory perturbation (auditory or somatosensory), compensation induces changes in both the auditory and somatosensory outputs, generating errors in both

domains. Hence, the amount of compensation is modulated by sensory preference even if the perturbation affects only one sensory modality. Therefore in this paper, for the sake of simplicity, we only consider auditory perturbations (but see Patri, 2018 for results involving somatosensory perturbations).

#### 2.2.1. Implementation of Sensory Perturbations

We simulate auditory perturbations in the model by altering the spectral characteristic of the acoustic signal associated with the tongue configurations of the biomechanical model. More specifically, if a tongue configuration T produced an acoustic output a u in unperturbed condition, then with the auditory perturbation the same tongue configuration will result in a shifted acoustic output a <sup>∗</sup> = a <sup>u</sup> + δ. The middle panel of **Figure 3** illustrates the effect of an auditory perturbation that shifts the first formant F1 down by δ = −100 Hz, during the production of vowel /O/ for the three planning processes.

#### 2.2.2. Implementation of Adaptation

In the context of an auditory perturbation, only the auditorymotor internal model P(A<sup>M</sup> | M) becomes erroneous. Hence, we implement adaptation to the auditory perturbation by updating the auditory-motor map ρ<sup>a</sup> of the auditory-motor internal model P(A<sup>M</sup> | M) (see Equation 3). This update is defined in order to capture the new relation between the motor control variable and its auditory consequence. In the case of an auditory perturbation that shifts auditory values by a constant vector δ, we assume the resulting update to be complete and perfect, of parameter δ<sup>A</sup> = δ:

$$
\rho\_a^\*(m) = \rho\_a^u(m) + \delta\_A. \tag{8}
$$

where ρ ∗ a and ρ u a denote the auditory-motor maps in the perturbed and unperturbed condition, respectively. In all simulations involving the perturbation, we choose to shift only the first formant F1 down by −100 Hz, such that δ<sup>A</sup> = [−100, 0, 0].

The right panel of **Figure 3** illustrates the effect of the auditory perturbation and the outcome of adaptation for each of the three planning processes. In unperturbed conditions (left panels), all three planning processes correctly achieve both the auditory and the somatosensory target regions. In the middle panel, which represents the situation before adaptation occurs, the auditory perturbation induces for the three planning processes a shift in the auditory domain (top middle panel), and obviously not in the somatosensory domain (bottom middle panel), since the perturbation only alters the auditory-motor relations. The right panels illustrate the outcome of the three planning processes after adaptation has been achieved, as implemented by Equation (8). It Patri et al. Modeling Sensory Preference in Speech

can be seen that the results corresponding to the somatosensory planning, P(M | 8 [C<sup>S</sup> = 1]), remain unchanged. This is because somatosensory planning does not involve the auditorymotor map ρ<sup>a</sup> (Equation 6), and is then not concerned by the update of the auditory-motor map induced by the adaptation. On the other hand, and as expected, after the perfect update of the auditory-motor internal model, the auditory planning P(M | 8 [C<sup>A</sup> = 1]) (Equation 5) fully compensates for the perturbation and results in a correct reaching of the auditory target region (top right panel). However, this compensation is achieved by a change in the value of the motor control variable, which results in a tongue posture associated with a somatosensory output that is outside of the somatosensory target region (bottom right panel). Finally, the fusion planning P(M | 8 [C<sup>A</sup> = 1] [C<sup>S</sup> = 1]) (Equation 7) combines the two previous results: since auditory and somatosensory target regions are no more compatible due to the update of the auditory-motor internal model, fusion planning cannot reach both sensory target regions at the same time, and therefore it makes a compromise between the auditory and the somatosensory constraints. As a result, fusion planning leads to auditory and somatosensory consequences that lie midway between those of a pure auditory or a pure somatosensory planning.

In summary, we have described how the three planning processes achieve similar results in unperturbed condition but generate very different results after adaptation to the sensory perturbation. Intuitively, if we are able to modulate in the model the weight associated with each sensory modality in the fusion planning process, we would be able to achieve a continuum of compensation magnitudes after adaptation. This continuum, representing all the possible patterns of sensory preference, would go from full compensation for the auditory perturbation, when sensory preference induces a full reliance on the auditory modality, to no compensation at all when sensory preference induces a full reliance on the somatosensory modality.

For the evaluation of the two variants of our model of sensory preference, we mainly consider the "fusion" planning, as it is the planning process that combines both auditory and somatosensory pathways, and then enables an account of the sensory preference phenomenon (see Equation 7). However, we will also study the planning processes based on each sensory pathway individually, in order to have them as reference to evaluate the consequences of different sensory preference patterns. The impact of sensory preference on planning will be evaluated by modulating the relative involvement of each sensory pathway in the planning process. In general terms, the involvement of a sensory pathway is related to the magnitude of the mismatch between sensory-motor predictions and the intended target: for example, by increasing the magnitude of this mismatch for the auditory modality we obtain an increase of the involvement of auditory pathway in the planning process.

#### 2.3. Modeling Sensory Preference 2.3.1. The Target-Based Approach: Modulating the Precision of Sensory Targets

In the Target-based approach we modulate the involvement of each sensory modality at the level of the target regions associated with phonemes, as illustrated in the left panel of **Figure 4**. In our model, the target regions result from the sensory characterization of phonemes which is represented by the terms P(A<sup>8</sup> | 8) and P(S<sup>8</sup> | 8). These terms are specified in Equation (2) as multivariate Gaussian probability distributions with mean vectors µ 8 A and µ 8 S and covariance matrices Ŵ 8 A and Ŵ 8 S , respectively. We implement sensory preference in the model by modulating the precision of these distributions with the introduction of two additional parameters, respectively κ<sup>A</sup> and κ<sup>S</sup> for the auditory and the somatosensory pathway. These parameters multiply the covariance matrices of the corresponding Gaussian distributions:

$$P(\left[X\_{\Phi} = \mathfrak{x}\right] \mid \left[\Phi = \phi\right]) = \mathcal{N}(\mathfrak{x} \; ; \; \mu\_X^{\phi} \; \kappa\_X \Gamma\_X^{\phi} \text{)},\tag{9}$$

where X, once more, stands either for the auditory or the somatosensory modality. The left panel of **Figure 4** illustrates the effect of parameters κ<sup>X</sup> on the target distributions in a one-dimensional case: increasing κ<sup>X</sup> results in widening the distribution, and as suggested previously this induces a decrease of the involvement of the corresponding sensory modality in the planning process, since larger distributions will less penalize sensory signals that depart from the center of the target region and will thus allow larger errors in this sensory modality. The same reasoning applies to a decrease of κX, which will induce a narrowing of the distribution and an increase of the involvement of the corresponding sensory modality.

Replacing the forms given by Equation (9) into Equation (7) gives a first formulation of the influence of sensory preference in the fusion planning process:

$$\begin{aligned} &P([M=m] \mid \Phi \; [\text{C}\_{A}=1] \; [\text{C}\_{\text{S}}=1]) \\ &\propto \mathcal{N}(\rho\_{\text{s}}(m) \; ; \; \mu\_{\text{S}}^{\Phi}, \; \kappa\_{\text{S}} \Gamma\_{\text{S}}^{\Phi}) \mathcal{N}(\rho\_{\text{a}}(m) \; ; \; \mu\_{A}^{\Phi}, \; \kappa\_{A} \Gamma\_{A}^{\Phi}), \end{aligned} \tag{10}$$

#### 2.3.2. The Comparison-Based Approach: Modulating the Weight of the Sensory Matching Constraints

In the Comparison-based approach we modulate the involvement of each sensory modality at the level of the comparison between sensory-motor predictions and sensory characterizations of phonemes, as illustrated on the left panel of **Figure 5**. To do so, we have to slightly modify the definition of the operator that performs the comparison, i.e., the sensory matching constraint defined in Equation (4). Until now we have defined the sensory matching constraint in an "all-ornothing" manner, where terms are either "1" when values of the variable predicted with the sensory-motor map match exactly the sensory-phonological variables, or "0" when they differ, regardless of the magnitude of the difference (see Equation 4). This definition is very strict, as it requires an extreme accuracy in the achievement of the speech motor task in the sensory domain. Intuitively, if we are able to soften this constraint, we may be able to modulate the strengths of the comparisons and hence the involvement of each sensory pathway in the planning process.

We relax the sensory-matching constraint by extending its definition given in Equation (4) as follows (Bessière et al., 2013):

$$P(\left[C\_X = 1\right] \mid \left[X\_M = \chi\_1\right] \left[X\_{\Phi} = \chi\_2\right]) = e^{-d\chi\left(\chi\_1, \chi\_2\right)}.\tag{11}$$

FIGURE 4 | (A) Illustration of the effect in the Target-based approach of parameters κ<sup>A</sup> and κ<sup>S</sup> (see text) on the auditory and somatosensory target regions associated with phonemes P(A<sup>8</sup> | 8) and P(S<sup>8</sup> | 8). The greater the value of κ parameter, the wider the target region, and the weaker the contribution of the corresponding sensory pathway to the planning process. (B) Results of the fusion planning process after adaptation to the auditory perturbation described in section 2.2.2, for different values of parameters κ<sup>A</sup> and κS.

(B) Results of the fusion planning process after adaptation to the auditory perturbation described in section 2.2.2, for different values of parameters η<sup>A</sup> and ηS.

Here dX(x1, x2) is a distance measure between sensory values x<sup>1</sup> and x2. Since e −x is a decreasing continuous function of x, the function defined in Equation (11) gives high probability of matching for x<sup>1</sup> and x<sup>2</sup> values that are close (small distance dX(x1, x2)) and low probability of matching for values that are far from each other. Note that the definition given in Equation (4) can be considered to be a degenerate case of this new expression of the sensory-matching constraint, in which the distance measure would be zero when x<sup>1</sup> = x<sup>2</sup> and infinite otherwise. For computational reasons, we choose a distance measure that is quadratic, i.e., dX(x1, x2) = (x1−x2) 2 . This choice enables to obtain a closed analytic form for the derivation of the motor planning question.

With this new expression of the matching constraint, we implement sensory preference in the model by introducing two additional parameters, respectively η<sup>A</sup> and ηS, for the auditory and the somatosensory pathway. These parameters modulate the sensitivity of the distance measures dA(a1, a2) and dS(s1,s2) associated with the sensory pathways:

$$d\_X(\mathfrak{x}\_1, \mathfrak{x}\_2; \eta\_X) = \frac{(\mathfrak{x}\_1 - \mathfrak{x}\_2)^2}{2\eta\_X^2}. \tag{12}$$

With this choice of parametric quadratic measure, Equation (11) becomes:

$$P(\left[C\_X = 1\right] \mid \left[X\_M = \chi\_1\right] \left[X\_{\Phi} = \chi\_2\right]) = e^{-\frac{\left(\chi\_1 - \chi\_2\right)^2}{2\eta\_X^2}} \tag{13}$$

**Figure 5A** illustrates the form of the matching constraint defined by Equations (13) in the Comparison-based approach for different values of parameter ηX: small values of η<sup>X</sup> lead to sharper matching constraints; large values lead to flatter constraints. Note in particular that for η<sup>X</sup> → 0 the rigid constraint formulated in Equation (4) is recovered, while for η<sup>X</sup> → +∞ the constraint function becomes constant, independent of the sensory values, which in fact corresponds to an absence of constraint.

#### 3. RESULTS

#### 3.1. Simulating Sensory Preference

#### 3.1.1. Simulation of the Target-Based Approach

We now illustrate results of simulations using the Target-based approach to model sensory preference in the context of the adaptation to the auditory perturbation described above in section 2.2.2. The colored triangles in **Figure 4** present the mean results computed for different values of parameters κ<sup>A</sup> and κ<sup>S</sup> based on 2.10<sup>4</sup> samples in the motor control space. For reference, colored ellipses present the results obtained with the three planning processes of the previous Section [i.e., purely auditory (red color), purely somatosensory (blue color), or "fusion" planning (intermediate color)].

It can be seen that, as expected, progressively increasing parameter κ<sup>A</sup> leads to results that progressively drift toward the outcome of the pure somatosensory planning process. Similar results are obtained toward the outcome of the pure auditory planning when progressively increasing κS. Hence, parameters κ<sup>A</sup> and κ<sup>S</sup> effectively modulate the strength of each sensory pathway. This confirms the possibility of implementing sensory preference in our model in a way similar to previous approaches: modulating the relative precision of sensory target regions effectively modulates the contribution of the corresponding sensory pathway.

#### 3.1.2. Simulation of the Comparison-Based Approach

We now illustrate the Comparison-based approach to model sensory preference, and study the effect of parameters η<sup>A</sup> and η<sup>S</sup> in the model in the context of the adaptation to the auditory perturbation described above in section 2.2.2. The colored triangles in **Figure 5** present the mean results computed for different values of parameters η<sup>A</sup> and η<sup>S</sup> based on 2.10<sup>4</sup> samples in the motor control space. As in **Figure 4**, colored ellipses present the results obtained with the three initial planning processes, for reference.

It can be seen that progressively increasing parameter η<sup>A</sup> of the auditory matching constraint leads to results that progressively drift toward the outcome of the somatosensory planning process. Similarly increasing parameter η<sup>S</sup> of the somatosensory matching constraint results in a drift toward the outcome of the auditory planning process. Hence, parameters η<sup>A</sup> and η<sup>S</sup> successfully enable to modulate the strength of the constraint imposed by the corresponding sensory pathways.

#### 3.2. Equivalence of the Approaches

We have formulated two alternative approaches to implement sensory preference in Bayesian GEPPETO. Although these approaches account for clearly different ways to process sensory variables, simulations with the model have shown that they lead to qualitatively similar results (right panels of **Figures 4**, **5**). Increasing parameter κ<sup>A</sup> or parameter η<sup>A</sup> decreases in a comparable manner the involvement of the auditory modality in the model, and, thus, the magnitude of the changes induced by the compensation for the auditory perturbation. Thus, at the limit, for very large values of κ<sup>A</sup> or ηA, the magnitude of the compensation for the auditory perturbation tends toward zero, which perfectly matches the results of the pure somatosensory planning process. Conversely, increasing parameter κ<sup>S</sup> or parameter η<sup>S</sup> decreases the involvement of the somatosensory modality and induces an increase of the magnitude of the compensation for the auditory perturbation. At the limit, for very large values of κ<sup>S</sup> or ηS, the magnitude of the compensation tends toward the magnitude obtained with the pure auditory planning process.

However, a closer comparison of the results presented in the right panels of **Figures 4**, **5** reveals differences in the ways the compensation for the auditory perturbation varies when parameters κ<sup>X</sup> or η<sup>X</sup> vary. In the Target-based approach, the sequence of compensatory results follows a slightly more simple and straight path than in the Comparison-based approach.

Despite these slight differences, the qualitative similarity of the results obtained with both approaches can be formally explained. Indeed, let us consider the outcome of the fusion planning P([M = m] | 8 [C<sup>A</sup> = 1] [C<sup>S</sup> = 1]) using the generalized sensory matching constraints given by Equation (11) in the Comparison-based approach. It yields:

$$\begin{aligned} P([M=m] \mid \Phi \; [\text{C}\_A = 1] \; [\text{C}\_S = 1]) \\ \propto \sum\_{a \downarrow \Phi} P([A\_{\Phi} = a\_{\Phi}] \mid \Phi) P([\text{C}\_A = 1] \mid [A\_{\Phi} = a\_{\Phi}] \; [A\_M = \rho\_a(m)]) \\ \sum\_{s \downarrow} P([S\_{\Phi} = s\_{\Phi}] \mid \Phi) P([\text{C}\_S = 1] \mid [S\_{\Phi} = s\_{\Phi}] \; [S\_M = \rho\_i(m)]), \tag{14}$$

where we have omitted intermediate steps for the sake of brevity. Now, using the definition of sensory targets given in Equation (2) and the quadratic distance in the matching constraints as given in Equation (13), we note that all terms on the right hand side of Equation (14) are Gaussian. Hence, we can rewrite Equation (14) as:

$$P([M=m] \mid \Phi \mid \mathbb{C}\_{A} = 1] \left[\mathbb{C}\_{S} = 1\right])$$

$$\propto \sum\_{a\Phi} \mathcal{N}(a\_{\Phi}; \mu\_{A}^{\Phi}, \Gamma\_{A}^{\Phi}) \mathcal{N}(a\_{\Phi}; \rho\_{a}(m), \eta\_{A}^{2}I\_{A})$$

$$\sum\_{s\Phi} \mathcal{N}(s\_{\Phi}; \mu\_{S}^{\Phi}, \Gamma\_{S}^{\Phi}) \mathcal{N}(s\_{\Phi}; \rho\_{s}(m), \eta\_{S}^{2}I\_{S}),\tag{15}$$

where we have denoted by I<sup>A</sup> and I<sup>S</sup> the identity matrices in the auditory and somatosensory space, respectively. With the introduction of variable y = ρx(m) − x8, each of the sums in Equation (15) are in fact the convolution of two Gaussian distributions, one with mean µ 8 X and covariance Ŵ 8 X , the other of mean 0 and covariance η 2 X IX. The convolution of two Gaussian distributions with mean vectors µ1, µ<sup>2</sup> and covariances 61, 6<sup>2</sup> is known to result in another Gaussian distribution with mean vector µ<sup>1</sup> + µ<sup>2</sup> and covariance 6<sup>1</sup> + 62. Hence, the planning process becomes:

$$\begin{aligned} &P([M=m] \mid \Phi \; [C\_A = 1] \; [C\_S = 1]) \\ &\propto \mathcal{N}(\rho\_s(m) \; ; \; \mu\_S^{\Phi} \; , \; \Gamma\_S^{\Phi} + \eta\_S^2 I\_S) \mathcal{N}(\rho\_a(m) \; ; \; \mu\_A^{\Phi} \; , \; \Gamma\_A^{\Phi} + \eta\_A^2 I\_A) . \end{aligned} \tag{16}$$

Let us compare Equation (16) and Equation (10): they are almost identical, except for the form of the covariance matrices in auditory and somatosensory spaces. The planning process in the Target-based approach (Equation 10) involves Gaussian distributions with covariance matrices that are modulated multiplicatively by the parameters κ<sup>A</sup> and κS, whereas the planning process in the Comparison-based approach (Equation (16)) involves Gaussian distributions with covariance matrices that are modulated additively by parameters η<sup>A</sup> and ηS. Hence, the effect of parameters η<sup>X</sup> and κ<sup>X</sup> are qualitatively similar, as we have illustrated experimentally: they both induce an increase in the covariance of the sensory characterization of phonemes. However, quantitatively, we have shown that parameters κ<sup>X</sup> increase them multiplicatively, whereas parameters η<sup>X</sup> increase them additively.

We note that if the auditory and somatosensory spaces would be one-dimensional, both approaches would be exactly equivalent, since any additive increase Ŵ + η can be written as a multiplicative increase κŴ, with κ = 1 + η Ŵ . This is not true anymore in higher dimensions though, since the Targetbased approach scales all coefficients of the covariance matrices, whereas the Comparison-based approach only modifies their diagonal terms. More specifically, the Target-based approach increases the size of the target regions while preserving their orientation, whereas the Comparison-based approach stretches the regions along the coordinate axes, inducing a progressive alignment of the main axes of the target regions with the coordinate axes (off-diagonal terms in the covariance matrices become negligible compared to the increased diagonal terms, and the resulting ellipsoid regions progressively lose their orientations). We assume that the slight differences observed above in the consequences on compensation of progressive variations of the κ<sup>X</sup> and η<sup>X</sup> parameters find their origins in these changes in target orientations.

**Figure 6** gives an intuitive interpretation of the equivalence of these two approaches. On the one hand, the Target-based approach directly modulates the size of the target regions, while keeping their orientations, as illustrated on the left lens of the glasses in **Figure 6**. On the other hand, the Comparisonbased approach does not change the targets, but modifies the precision of the comparison of the target with the sensorymotor predictions. This is as if the target were seen through a blurring lens, that would "spread" the borders of the target, making it appear bigger. This "blurring effect" is induced by the convolution of the target with a Gaussian term that acts as noise (Equation 15). The larger the value of parameter ηX, the larger the power of the noise, and the stronger the "blurring" of the target.

# 4. DISCUSSION

The main contribution of our work is to present two different approaches implementing sensory preference in a speech production model that integrates both the auditory and the somatosensory modality. This is done in the context of our Bayesian GEPPETO model for speech motor planning and speech motor control (Perrier et al., 2005; Patri et al., 2016; Patri, 2018), which specifies both auditory and somatosensory constraints to infer motor commands for the production of a given phoneme. We have implemented sensory preference in this model by modulating the relative involvement of sensory modalities with two different approaches: (1) the Targetbased approach, which modulates the precision of auditory and somatosensory target regions; (2) the Comparison-based approach, which modulates the sensory-matching constraints between predictions from internal models and sensory target regions. At the core of the evaluation of the two approaches, we have considered the phenomenon of incomplete compensation for sensory perturbations in speech production and its intersubject variability, which has been evidenced by several experimental studies. Although conceptually different, we have shown in our model that these two approaches are able to account for incomplete compensation variability under the same amount of change in the internal model resulting from adaptation. Furthermore, we have demonstrated the mathematical equivalence of the two approaches in some specific cases, which explains the qualitative similarity of results obtained under both approaches.

In this context, the main outstanding question is whether the two modeling variants are distinguishable. We consider two aspects of this issue: mathematical formulation and experimental evaluation.

Let us compare the mathematical formulations of the two approaches. The Comparison-based approach is less compact and contains more degrees-of-freedom than the Target-based approach. We have also demonstrated that, under certain assumptions, both models behave similarly. On parsimony grounds, then, the Target-based approach certainly wins over the Comparison-based approach. On the other hand the additional degrees of freedom enable the Comparison-based approach to be more flexible.

For further experimental evaluation we consider two possible directions. First, our simulation results illustrate that the particular pattern of partial compensation obtained under both approaches slightly differ. Whether and how these differences could be assessed experimentally is an open question. The main difficulty arises from the fact that the observed differences in partial compensation do not only depend on differences in compensation mechanisms induced by each approach, but also on speaker specific relations between motor commands and sensory variables. Taking into account these speaker specific characteristics would be the main challenge in this experimental evaluation.

The second direction for experimental evaluation, would be related to the different flexibility associated with each approach. Whereas the Target-based approach would predict fixed compensation strategies, ascribing any remaining variability to causes unrelated to sensory preferences or measurement errors, the Comparison-based approach would potentially relate sensory preference with some aspects of the structure of the observed variability. Furthermore, experimentally induced effects (e.g., asking subjects, for a given trial block, to focus especially on

somatosensation; introducing a dual-task condition to induce attentional load, etc.) could help discriminating between the predictions of the two models.

Overall, the results of our study provide a new contribution to the understanding of the sensory preference phenomenon. They highlight that two factors could influence sensory preference, that mostly differ by their temporal stability. On the one hand, the Target-based approach represents sensory preference as the precision of target regions. This suggests that sensory preference is learned through language interaction and is stable over time, as the target regions would be used during everyday speech planning. On the other hand, the Comparison-based approach represents sensory preference "elsewhere" in the model, so that it can mathematically be manipulated independently of sensory target regions. Indeed, in this second approach, we have explicitly considered two independent components: (1) the sensory characterization of phonemes, which are mathematically characterized as constraints via the specification of sensory target regions; (2) matching-constraints, which modulate the precision with which sensory predictions from the internal models are compared with phoneme related sensory target regions. This allows a more general and flexible model, as compared to the Target-based approach. This flexibility suggests ways in which sensory preference would be modulated by cognitive control or attentional processes. Such an attentional model would explicitly modulate on the fly sensory preference depending on the context. This modulation could arise, for example, from changes in the access to one of the sensory modality due to disorders, aging, or noise, or from the absence of congruence between the two sensory pathways. A proposal for such an attentional model, as an extension of the Comparison-based model presented here, is outlined in **Supplementary Material**.

Finally, we turn to possible theoretical extensions and applications of our model. So far, the Comparison-based approach of sensory preference we have described here is constrained by the specific hypotheses of the Bayesian-GEPPETO model in which it is included. For instance, it only concerns sensory preference between somatosensory and acoustic descriptions of targets during serial order planning of sequences of vocalic speech sounds. Of course, the application scope could be extended, e.g., toward sensory preference during movement execution and movement correction, with a finer temporal resolution than we have considered so far. This would for instance allow to study time-varying sensory preference, or sensory preference that depends on speech sounds. Indeed, it is an open question whether consonant and vocalic sounds would differ on the sensory pathway they more precisely rely on. We could also consider using our Comparison-based architecture for describing how low-level sensory acuity would affect the learning of the target representations, and how different sensory preference during this learning would result in different sizes and separations of targets in each sensory pathway. Finally, such a learning mechanism with individual-specific sensory preference could contribute to the emergence of learned idiosyncrasies.

Furthermore, to put our approach in a wider theoretical context, we observe that the Comparison-based approach has a structure that could be cast into the general predictive coding framework, as popularized recently by the free-energy principle proposal (Friston and Kiebel, 2009; Feldman and Friston, 2010; Friston, 2010). Indeed, even though our model does not represent time or time-delays specifically, it nevertheless features the idea that "predictions" from internal models would be compared with sensory targets. We note that this is not exactly the same situation as for a comparison between forward predictions and sensory feedback, as would be used for instance in models of trajectory monitoring; nevertheless, the architecture is similar. In the Comparison-based approach, we have proposed a mathematically specific expression of the "comparison" operator, using probabilistic coherence variables and match measures. Whether this would be a plausible, or at least useful mathematical implementation of probabilistic comparison in predictive coding or free-energy architectures is an open question.

## DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

## AUTHOR CONTRIBUTIONS

J-FP, JD, and PP contributed conception and design of the study, and revised the manuscript. J-FP implemented the model and performed simulations, and wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.

#### FUNDING

The research leading to these results has received funding from the European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2 013 Grant Agreement no. 339152, Speech Unit(e)s, PI: Jean-Luc-Schwartz), and from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 754490 (MINDED Program). The funders had no role in study design, data

# REFERENCES


collection and analysis, decision to publish, or preparation of the manuscript.

#### ACKNOWLEDGMENTS

Authors wish to thank Jean-Luc Schwartz, Pierre Bessière, and Jacques Droulez for inspiring discussions and support.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.02339/full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Patri, Diard and Perrier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Morphogenesis of Speech Gestures: From Local Computations to Global Patterns

#### *Khalil Iskarous\**

*Department of Linguistics, University of Southern California, Los Angeles, CA, United States*

A subtle property of speech gestures is the fact that they are spatially and temporally extended, meaning that phonological contrasts are expressed using spatially extended *constrictions*, and have a finite duration. This paper shows how this spatiotemporal particulation of the vocal tract, for the purpose of linguistic signaling, comes about. It is argued that local uniform computations among topographically organized microscopic units that either constrict or relax individual points of the vocal tract yield the global spatiotemporal macroscopic structures we call constrictions, the locus of phonological contrast. The dynamical process is a morphogenetic one, based on the Turing and Hopf patterns of mathematical physics and biology. It is shown that reaction-diffusion equations, which are introduced in a tutorial mathematical style, with simultaneous Turing and Hopf patterns predict the spatiotemporal particulation, as well as concrete properties of speech gestures, namely the pivoting of constrictions, as well as the intermediate value of proportional time to peak velocity, which is well-studied and observed. The goal of the paper is to contribute to Bernstein's program of understanding motor processes as the emergence of low degree of freedom descriptions from high degree of freedom systems by actually pointing to specific, predictive, dynamics that yield speech gestures from a reaction-diffusion morphogenetic process.

#### *Edited by:*

*Pascal van Lieshout, University of Toronto, Canada*

#### *Reviewed by:*

*Sam Tilsen, Cornell University, United States Ioana Chitoran, Université Paris Diderot, France*

#### *\*Correspondence:*

*Khalil Iskarous kiskarou@usc.edu*

#### *Specialty section:*

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

*Received: 03 May 2019 Accepted: 07 October 2019 Published: 12 November 2019*

#### *Citation:*

*Iskarous K (2019) The Morphogenesis of Speech Gestures: From Local Computations to Global Patterns. Front. Psychol. 10:2395. doi: 10.3389/fpsyg.2019.02395*

Keywords: speech gestures, morphogenesis, BVAM system, Turing bifurcation, Hopf bifurcation

# INTRODUCTION

Languages vary widely in the phonological contrasts they utilize, and in the phonetic expression of these contrasts. However, two properties of the articulatory expression of phonological contrasts that occur universally are their particulation in time and space: articulatory gestures that physically express linguistic contrasts start at certain moments in time, and end at some later point in time, lasting some number of milliseconds (hence, their temporal particulation), and they are localized to some spatial *constriction* that extends for some number of millimeters (hence their spatial particulation). This is a subtle abstract property of contrastive gestures, which seems to not possibly be otherwise, as all motoric events have duration and if they are realized in space, they will be localized to a finite extent of space. It seems, therefore, that if phonological contrasts are spatiotemporally expressed, then, trivially and necessarily, they will be particulated. However, the neural system that organizes speech is known to be capable of very fast action (on the order of a few milliseconds), far faster than the durations of consonants and vowels (on the order of several to many tens or hundreds of milliseconds), and the motoneurons that control the speech articulators are spatially highly differentiated, capable of activity within a millimeter, but the constrictions of vowels and consonants are typically on the order of many millimeters or even centimeters (Zur et al., 2004; Mu and Sanders, 2010; Sanders et al., 2013). How and why then does the motor system organize gestures into spatiotemporal macroscopic units that are spatially and temporally large, compared to the neuromuscular fast and short-scale units?1 The fact of particulation, I believe, is evident to anyone who knows about phonetics and phonology, when it is pointed out, but its ubiquity, this work contends, is to be explained, as it is fundamental to our understanding of the interrelation of phonetics and phonology. This is because the linguistic pattering of the motor speech system makes use of gestural locality in space and time for signaling. Indeed, the particulation of contrastive units has been discussed explicitly or implicitly by several theories concerned with how the speech production and perception systems are able to fulfill their linguistic purpose. Some of these theories focus almost entirely on the articulatory system as the mechanism of particulation, while others focus on the acoustic-perceptual functioning of the vocal tract resonator. These theories will be presented in the next few paragraphs, and the reasons for the current model are advanced.

One explicit reference to the notion of particulation as fundamental to how language works through the articulatory system is the work of Studdert-Kennedy (Studdert-Kennedy, 1998; Studdert-Kennedy and Goldstein, 2003). Duality of patterning (Hockett, 1960) requires few meaningless elements to combine in many ways to create a very large number of meaningful elements (morphemes), and particulation in space and time is necessary for the generation of the large number of morphemes (Abler, 1989). That is, a large number of morphemes requires complex combinatoriality, and the latter requires particulation in space and time of the meaningless elements. This is indeed a very good reason for there to be particulation in time and space, and this paper does not contest this reason for the necessity of particulation, but this reasoning encapsulates a final cause (Aristotle, 1975), whereas the interest here is in the formal and material causes: how it is that the microscopic units of planning and execution are capable of realizing temporally and spatially macroscopic articulatory gestures? A material cause for particulation has been advanced by Studdert-Kennedy and Goldstein (2003) who attribute it to the anatomy of the vocal tract, where some articulators are separate from each other like the velum, tongue, and lips, so particulation is materially already present in the anatomy. However, the tongue cannot be claimed to be anatomically particulated into tip, dorsum, and root, since the muscles of the tongue, internal and external, interdigitate throughout the tongue (Sanders and Mu, 2013). The mystery of particulation, therefore, is not likely to be solved by positing separate organs of speech in the vocal tract, since the organ that probably contributes the most contrasts to the differentiation among speech sounds, the tongue, is not particulated anatomically. A formal cause of particulation has been proposed within the Task-Dynamic Model and Articulatory Phonology (Browman and Goldstein, 1989; Saltzman and Munhall, 1989; Sorenson and Gafos, 2016; Tilsen, 2019), which assume articulatory tasks to be expressed in terms of Constriction Locations and Degrees defined as finite spatial constrictions which control the vocal tract from specific initial to final points in time. So, particulation is built-in, and there is no problem to solve. The goal here will be to derive the spatial and temporal particulation, rather than stipulate it from the outset.

Another route to particulation is based on the acousticperceptual purpose of the vocal tract. Quantal theory (Stevens, 1972, 1989) motivated the typologically common vowels and consonants by considering the mapping between articulatory scales and the corresponding acoustic scale. If the articulatory scale is the position of a constriction in the vocal tract, and the corresponding acoustic scale is the position of the formants, then in many locations in the vocal tract, changing the position of the constriction slightly changes a few of the low formants very little, whereas in some locations, changing constriction location slightly changes a few of the low formants drastically. Stevens (1972, 1989) argued that languages choose locations in the vocal tract where constriction change has a small effect on the formants, as stability would make, for instance, coarticulatory and token-to-token variability in exact constriction positioning have less of an acoustic effect. A theory that motivates spatial particulation from Quantal Theory is distinctive regions and modes (Mrayati et al., 1988). The theory starts from the sensitivity functions showing how constrictions affect the first three formants, and show that there are eight discrete quantal regions, where a constriction in one of these regions has the same distinctive qualitative effect on the three formants. In one of these eight regions, for instance, F1, F2, and F3 are all raised, while in another, F2 and F3 are raised, but F1 is lowered. The regions are distinctive in that each of them has a different formant behavior. This cause for particulation, like that presented earlier based on the duality of patterning and the need for many words for successful communication, is a final/teleological cause. The point of this paper is that there are formal and material causes that make the related final causes of linguistic combinatoriality and discrete acoustic behavior possible.

It will be shown that particulation emerges from distributed local computations between many units expressing the neural networks of the central and/or peripheral regions responsible for the control of the vocal tract articulators, as well as the interaction of the points of the hydrostatic tissues that compose the tongue, velum, lips, and vocal folds. That is, if articulation is controlled by topographically organized interactive neural

<sup>1</sup> Particulation is not negated by many types of complexity, such as the fact that certain contrasts have multiple temporal stages or multiple articulations (Goldsmith, 1976; Catford, 1977; Sagey, 1986; Browman and Goldstein, 1989; Steriade, 1994; Shih and Inkelas, 2019), since each of the stages and articulations of multiple articulations are themselves particulated. And it is also not negated by the fact that many gestures are highly dynamic or overlap, since even dynamic events in the vocal tract are delimited in time, and their changes occur at particular finite locations in the vocal tract (Iskarous, 2005; Iskarous et al., 2010); also, overlapped gestures are themselves particulated.

networks, whose units constrict and relax the vocal tract at specific points, then we predict the spatial extent of contrastive units to be finite in extent, and to extend temporally to finite durations. This particulation can then serve the purpose of combinatoriality and quantal region behavior. Moreover, the model to be presented does not just provide a high-level description, but rather makes very specific predictions about how constrictions evolve in time, which allows it to be tested against existing data.

The goal is not to replace the ideas of articulatory phonology, but to derive the notions of Task Constriction Location and Degree, as well as the finite temporal extent of gestures from primitive computational principles that we know govern the neural and muscular tissue. Instead of seeing the neural and muscular unit firings as the final actuator of a planning process that assumes particulation, I show that global particulated gestures emerge from a local pattern formation process. The focus of this paper will be on the aspects of speech that the tongue contributes to, but it is believed that the contribution of articulators can be accounted for in the future using similar techniques. Section "Turing and Hopf Particulation" describes the Reaction-Diffusion Model of Turing and Rashevsky (Rashevsky, 1940; Turing, 1952; Rosen, 1968) describing inter-unit interaction that yield particulation either in space or in time, separately, in a tutorial style. Section "Simultaneous Turing and Hopf Patterning" shows simulations of the BVAM Reaction-Diffusion Model (Barrio et al., 1999) and how the computations lead to *simultaneous* patterning or particulation in space and time, which can be interpreted as area function change in speech production. It is also shown that two signatures of natural constriction dynamics, pivoting (Iskarous, 2005) and large proportional time to peak velocity in articulatory gestures (Sorenson and Gafos, 2016), are predicted by the BVAM theory. Section "Discussion and Conclusions" includes a discussion of what has been achieved and what needs to be resolved in future work.

#### TURING AND HOPF PARTICULATION

After founding computer science and artificial intelligence, inventing the computer, and breaking the Nazi code, Alan Turing turned to what he perceived as the fundamental question of biology: how does biological structure form? He knew that biological structures are highly spatially articulated, but that they emerge from structures that are basically uniform. How does uniformity give way to nonuniformity of structure? This is what he called the problem of *morphogenesis*, which he felt was well-exemplified by the featureless symmetric blastula that then changes to a highly structured embryo. Another example is the stripes on zebra skin. The notion of *gene* was already understood, and Turing knew that some genetically initiated process in individual cells leads to local within-cell expression or lack of expression of melatonin. But why do the cells within a dark stripe express melatonin, while the cells in the light colored areas do not? That is, what is the origin of the global particulation of the skin? Turing's answer was that genetically controlled uniform local interactions between cells is what gives rise to the global pattern formation. Turing proved that if the microscopic cells perform reactions, to be described, and if substances he called morphogens diffuse between surrounding cells, then these purely local microscopic *uniform* reactions (i.e., the same reactions and diffusions occur throughout the skin), lead to a local computation that yields highly particulated, nonuniform macroscopic structures. Before introducing the reader to the nature of these computations, and why the genesis of form in this way is so surprising, I would like to suggest that Turing's ideas are relevant for the current discussion of the spatial particulation of phonological gestures, because it will be suggested that the same dynamic that Turing used for biological pattern formation is at the basis of the particulation of vocal tract into parts that are macroscopically constricted (in analogy with the black strip on the zebra) and parts that are unconstricted (in analogy with the white background on the zebra skin) based on the computations of many microscopic neural and muscular units. I would also like to add that even though Turing's work was entirely mathematical, with no biochemical proof of any sort, the last two decades of work in laboratories throughout the world have biochemically *proven* that many animal biological structures such as hair follicles, digit development, where the five fingers and toes on each hand and foot emerge from a uniform stump through the reaction and diffusion of particular proteins in known amounts to create fingers (in analogy to black stripes), on a background of interfinger notches (white background), cortical folds in the brain, tooth development, and many other systems (Cartwright, 2002; Meinhardt, 2012; Sheth et al., 2012; Cooper et al., 2018). Therefore, the reaction-diffusion computational paradigm definitely seems to be relevant to biology, and could potentially be relevant to the biological behaviors of language and speech.

Diffusion had been understood for many decades. To understand how diffusion works, imagine a sheet of units, with a variable *A* defined at each unit, and the value of *A* can change in time, and can be different at contiguous units. The dynamic of diffusion says that the value of *A*, at each step of time, should become more similar to the average value of *A* in the local surroundings of the unit. If this is the case, *diffusion* will be said to have occurred. Mathematically, this is expressed as:2

$$\frac{\partial \mathcal{A}}{\partial t} = D \nabla^2 \mathcal{A} \tag{1}$$

which says that the change of *A* with respect to time at each unit is directly proportional to the discrepancy between the values of *A* at the unit and the average value of *A* in the immediate surroundings (symbolized by Ñ2*A* , the Laplacian

<sup>2</sup> There are tutorials on dynamical systems analysis intended for linguists in Sorenson and Gafos (2016) and Iskarous (2017). The meanings of equations presented here are described following the presentation of each equation; however, the reader interested in the details can consult these papers for background on dynamics.

of *A*) 3 , and *D*, which indicates the diffusion coefficient. To get a feel for the latter factor, imagine ink diffusing in water vs. oil. Diffusion is much faster in water (*D* is high), and very slow in oil (*D* is very low). A graphical example of diffusion can be seen in **Figure 1** in one dimension. At the initial step of time, the middle unit has *A* = 1, and all the other units are set to 0. At each subsequent frame in time, diffusion leads to less *A* where *A* had been high, since the surroundings have very low *A* (therefore their average is low), and *A* increases at points where the surrounding average is greater than their value. As time evolves contiguous units acquire closer values to the average of their surroundings, so that the initial nonuniformity of *A* = 1 in one location, and *A* = 0 at all other locations is replaced eventually becoming more and more uniform across the units.

What is easy to see is that diffusion erases structure, which is the way it had been understood for decades, and that is why it was surprising that Turing was able to prove that it was essential for morphogenesis in biology.

Even more surprising is that the other factor, reaction, also leads to uniformity. Imagine that at every point of a domain there are two substances, one we will call *A*, the Activator, and the other we will call *I*, the Inhibitor. The amounts of these substances are the values of the variables *A* and *I*. At each moment in time and at each location, if *A* is positive, then *A* increases with respect to time, so it is an Autoactivator. If *A* is positive, it also activates *I*, so *I* increases with respect to time. But when *I* is positive, *A* decreases. So *A* activates itself and *I*, but *I* inhibits *A*. This is therefore an Activator-Inhibitor reaction happening at each location. Mathematically, this reaction can be written as:

$$\begin{aligned} \frac{dA}{dt} &= aA - bI \\ \frac{dl}{dt} &= cA - dl \end{aligned} \tag{2}$$

This means that *A* increases with respect to time if *A* is positive, and decreases if *I* is positive. The coefficients *a* and *b* indicate by how much the increases and decreases affect the increase/decrease in *A*. *I*, on the other hand, is increased with respect to time, if *A* is positive, but decreases, if it itself is positive (the last aspect is not essential). **Figure 2** shows an example of the reaction in Equation 2, with *a* = *c* = 1, and *b* = *d* = 0.5. The initial conditions are the same as for the diffusion in **Figure 1**. Time is counted in frames, and the *A* variable, here and the rest of the paper, is in arbitrary units.

As we saw with diffusion, the activator-inhibitor reaction in this (the generic) situation leads to loss of structure present in the initial condition. How then, is it possible that an activatorinhibitor reaction combined with a diffusion, both of which lead to uniformity, usually called *equilibrium*, can lead to the birth rather than the death of nonuniformity and structure?

Turing's ingenious realization (Turing, 1952) was that if both *A* and *I* react in a way similar to Equation 2, and both diffuse,

FIGURE 1 | Diffusion as loss of structure and gain of uniformity.

but *A* is slower to diffuse than *I*, i.e., *I* has a higher diffusion coefficient than *A*, then structure and particulation are born. The physical intuition is as follows: (1) if *A* is positive at a location, it autocatalyzes itself, and starts to diffuse slowly, also increasing *I*, which diffuses faster than *A*; (2) the fast diffusion of *I* leads to spots, outside of the region where *A* is most highly concentrated, where *I* is significant enough to inhibit *A* from spreading; therefore, *A* is confined into a macroscopic region of concentration. This region can be a stripe, a spot, finger, a hair, or a tooth. In one dimension a small random bump of *A* grows into a whole region where *A* is high and surrounding regions where *A* is negligible, but *I* is high, a situation we refer to as spatial particulation. An example of Turing pattern formation is shown in **Figure 3,** where the initial values are, again, all 0 except for a bump of *A* in the middle. Instead of decay to equilibrium, the bump, after suffering some decay, develops into a self-enhancing nonuniform, particulated pattern. Peaks of the inhibitor alternate with peaks of the activator.

<sup>3</sup> The discrepancy interpretation of the Laplacian Ñ2*A* is due to Maxwell (1871).

The reaction-diffusion equations simulated in **Figure 3** will be discussed at length in the next Section.

Turing pattern formation describes the birth of global macroscopic nonuniformity from uniform local interactions of microscopic units. The fact that the interactions are uniform and local is quite significant: reaction involves locality in time, since the value of each point is affected by its immediate past and the past of the other substance, and diffusion involves locality in space, since the value of *A* and *I* are affected by the average of their local neighbors. The uniformity is that of the diffusion and reaction coefficients, which are the same everywhere. However, out of these uniform interactions, macroscopic regional differentiation, or pattern emerges.

The idea advanced in this paper is that the spatial nonuniformity that emerges from local uniform Turing computations can also explain the spatial differentiation of the vocal tract into constricted and unconstricted areas. *A* in this case is interpreted as the constrictedness of the vocal tract. The computations involved here would involve many thousands of topographically organized neural units, where each unit controls the amount of opening at one point of the vocal tract (the area function). Evidence of topographic organization of parts of the brain that control tongue and lips is plentiful in the investigation of primates, including humans, on the cortical and subcortical levels (Waters et al., 1990; Choi et al., 2012; Kuehn et al., 2017). This is all relevant only if it is the case that neural computations can lead to a Turing pattern, but that this is possible was shown several decades ago by Ermentrout and Cowan (1979). In their model, the activator and inhibitor are the excitatory and inhibitory neurons well-known in neuroscience since the work of Sharrington (1932). The model assumed is depicted in **Figure 4**. The Excitatory/ Activation units are in red, and the Inhibitory ones are in blue, and there could be many hundreds, if not thousands, of such units. Horizontal black arrows show the local diffusion connections

for one neuron, and the vertical black connections show the local reaction connections for the same neuron. The units, as we know for units at different brain areas, are topographically organized, and each unit activates/constricts one point of the vocal tract, and the next neural unit activates the next in the vocal tract. Crucially, the vocal tract constriction is spatially macroscopic, compared to spatial extent of the microscopic units. Therefore, a single global, macroscopic constriction may be due to cooperative activity among hundreds, if not thousands, of tiny units interacting *locally* in space and time. The interactions, moreover, are assumed to be, so far, of the reaction-diffusion dynamic generating Turing patterns.

However, what the Turing pattern predicts is spatial differentiation, but the spatial pattern persists for all future time, and that is not, of course, what we see in speech, where constrictions are also delimited in time. In the rest of this section, we present how temporal particulation arises. A wellknown way to particulate continuous time into periods of various durations was already known to Huygens in his invention of the pendulum clock. The linear nucleus of this model is what Meinhardt (1982) called an Activator-Substrate Model, but what is usually termed the Harmonic Oscillator seen in Equation 3:

$$\begin{aligned} \frac{dA}{dt} &= bS\\ \frac{dS}{dt} &= -cA \end{aligned} \tag{3}$$

This is a reaction diffusion equation where *A* increases if *S* is positive, but when *A* grows positively, *S* decreases (this kind of reaction can also be the nucleus of a Turing pattern as shown by Meinhardt, 1982). **Figures 5A,B** show a simulation of the values of *A* and *S*, respectively, of Equation 3 with *b* = 4 and *c* = 1. We see that as time increases, *A* and *S* are demarcated in time, since the rise and fall, demarcating a period for each cycle of waning and growth, so continuous time has been demarcated into periods of equal length. It can also be seen that *A* and *S*, as they instantiate the constraints imposed by Equation 3, alternate in growth and decline, i.e., they are out-of-phase.

In the simulation of **Figure 5,** both *A* and *S* were started with the value of 1 at the initial time. If the initial value of *A* and *S* were smaller, such as 0.5, then the value of *A* will oscillate up and down reaching that value positively and negatively as time progresses, and if the initial values were larger, such as 10, then that will also be the value of the extremes of *A*. Simulations of those two situations are in **Figures 6A,B**. We therefore say that the linear oscillator in Equation 3 is highly sensitive to the initial conditions, as its oscillatory amplitude is not stable, but varies with the initial amplitude. This can be seen clearly in **Figure 6C**, where the initial values are random. The oscillations at each point have different maximal amplitudes determined by the initial random values.

Huygens knew that a real clock needs to have oscillatory properties that are insensitive to the initial conditions, and the result was a more complex nonlinear version of Equation 3, which will be discussed in the next section, that will oscillate with a certain frequency and amplitude, regardless of how it initially started. **Figure 6D** shows an instance of such a Limit Cycle, or a Hopf Pattern, where the amplitude at the initial time is random. Regardless of the initial value, it can be seen that the oscillation develops to a stable value, demarcating time into equal increments, like a clock. These types of oscillations have been shown by Wiener and Rosenblueth (1946), Winfree (1980), and others to have extensive applications to biological systems. And some of the earliest evidence for Hopf patterns were to model neural oscillations (Wiener, 1958). Therefore, just as with Turing pattern formation, there is evidence that

neural systems are capable of generating these patterns. Hopf patterns, have of course, also been made use of for understanding timing in speech production research (Byrd and Saltzman, 1998), but that work uses these oscillators to denote planning oscillators at a much higher mental level, whereas here, the *A* variable will be interpreted, literally to be the amplitude of opening of the vocal tract at specific locations x, in the vocal tract.

# SIMULTANEOUS TURING AND HOPF PATTERNING

Speech production is built on particulation in both space and time, or as physicists would put it, necessitates the breaking of both spatial and temporal symmetry. The Turing and Hopf patterns seen in **Figures 3, 5D** can each be generated by a multitude of different differential equations (Cross and Hohenberg, 1993). However, it is very difficult to find equations, which simultaneously exhibit spatial and temporal patterning of the Turing and Hopf types,4 which is the situation necessary for modeling speech production, since we need both spatial and temporal demarcation. What we need are equations that admit of what are usually called Turing-Hopf interactions (Walgraef, 1997). However, many of the cases discussed in the literature under the banner of Turing-Hopf Bifurcations yields patterns too different from those we find in speech. An extensive search in the literature has yielded one set of reactiondiffusion equations that are a useful starting point, in the opinion of the author, for studying the kind of spatiotemporal particulation we find in speech. It is not expected, by any means that this is the only useful reaction-diffusion system that exhibits the patterning we need, but it is an interesting starting point. These equations are called Barrio-Varea-Aragon-Maini (BVAM) for their discoverers Barrio et al. (1999). They are listed in Equation 4:

$$\begin{aligned} \frac{du}{dt} &= \text{g}\left(u + a\nu - Cu\nu - \mu\nu^2\right) + D\_\text{u}\nabla^2 u\\ \frac{dv}{dt} &= \text{g}\left(hu + b\nu + Cu\nu + \mu\nu^2\right) + D\_\text{v}\nabla^2 \nu \end{aligned} \tag{4}$$

*u* and *v* are the activator and inhibitor variables, respectively, which interact in a nonlinear manner. The first two terms on the right-hand side are linear, the third term is quadratic, and the fourth is cubic. The signs of the cubic terms show that (from the first equation), when *v* is large positive, *u* will decrease, and (from the second equation) when *u* is large positive, *v* increases, confirming the activator-inhibitor nature of the reaction. The linear and quadratic terms in the equations, and the coefficients *a*, *b*, *C*, *g*, *h*, modulate the basic activatorinhibitor reaction. The last two terms in these equations are the diffusion terms, and the most important condition for a Turing pattern to emerge is that *Dv* > *Du*. When we set *a* = 2.513, *C* = 2, *g* = 0.199, *h* = −1, *Du* = 0.122, and *Dv* = 1,

<sup>4</sup> I am excluding, here, traveling wave patterns, since these are not found in speech production (Iskarous, 2005), even though they may be involved in swallowing.

then the value of b will determine whether we get a Turing Pattern only (*b* = −1.95), Hopf Pattern only (*b* = −0.85), or simultaneous Turing and Hopf patterns (*b* = −0.95). The analysis showing the influence of *b* is due to Leppänen et al. (2003). Simulations of Equation 4 with the values just discussed can be seen in **Figure 7**.

To re-iterate, the simultaneous presence of both Turing and Hopf is very rare (non-generic or structurally unstable in the mathematical senses), and is due to the exact value of *b*. If that value is changed slightly, either a Hopf only or a Turing only pattern is obtained. What we have in **Figure 7C,** is therefore a very special situation, in which local interactions in space and time among many microscopic units yields a global pattern where space and time are demarcated in spatial and temporal intervals. However, all three of the patterns in **Figure 7** are quite stable, as can be seen from the fact that the initial values of the simulations are random, but they reach stable patterns. The claim here is that the relevance of this to speech is that it is quite possible that the planning of the motion of points in the vocal tract, if it is done *via* reactiondiffusion type local uniform computations, could yield the type of particulation we find in speech production, if excited neural units seek to constrict the vocal tract and inhibitory neural units seek to open the vocal tract.

Two pieces of evidence for the relevance of simultaneous occurrence of Turing and Hopf pattern to speech is that the dynamics of constrictions, not just their presence, seems to be reproduced by the dynamic in Equation 4. Empirical studies of tongue motion in English and French by Iskarous (2005) showed that constrictions in speech are formed and relaxed in the same location, a pattern termed *pivoting*. In the production of [ia] for instance, the constriction dynamic for the [i] constriction relaxes within the palatal region, while the [a] constriction forms in the pharyngeal region. It was shown that there is very little change in the area function elsewhere in the vocal tract, including the areas between the palatal and pharyngeal regions. It may seem that this is trivial, and could not be otherwise, but we could imagine the palatal constriction to travel as a traveling wave down the vocal tract, fully formed, to the pharyngeal region. And indeed, the tongue is capable of generating such motion, since during swallowing a traveling wave of muscular activation pushes the bolus down the vocal tract with a constriction-like pusher of the bolus. However, investigation of hundreds of transitions between different speech segments showed that actual constriction formation (Iskarous, 2005) is more like a standing wave pattern of wave motion, where the formation and relaxation of constrictions occurs in the same place. And that is indeed the pattern we see in **Figure 7C**. The constriction peaks and troughs do not travel, but form and relax in the same location. Iskarous et al. (2010) have shown that the pivot dynamic plays a role in the perceptual system's judgment of the naturalness of speech.

Another well-studied aspect of speech dynamics is how constriction degree varies as time progresses. The initial hypothesis of Fowler et al. (1980) and Saltzman and Munhall (1989) was that the gestural dynamic is linear second order critically damped, but Perrier et al. (1988) and Sorenson and Gafos (2016) showed that the peak velocity in actual speech movements occurs about half way in the interval from lowest position amplitude to target position achievement, whereas the critically damped second order system predicts a much earlier proportional time to peak velocity (0.2). **Figure 8** shows the position and velocity of *A* predicted through simulation of Equation 4. Proportional time to peak velocity is 0.49, which is close to the value observed and predicted by a cubic nonlinear dynamic in Sorenson and Gafos (2016).

The reason that the model is able to predict the late velocity peak is that the reaction dynamics in the Reaction-Diffusion model in Equation 4 is nonlinear, as in the model proposed by Sorenson and Gafos (2016). What have we gained through the proposed model then, if one can already predict the late peak velocity through that earlier model? The Sorenson and Gafos model, like the Saltzman and Munhall Task Dynamic Model are for point dynamics and by default is supposed to

apply to all vocal tract gestures. Particulation in space in these models is postulated due to the inherent particulation of point dynamics. The proposed model is for the entire area function, and predicts particulation in space rather than stipulates it. One can of course say that the current model stipulates particulation through the specific constants and dynamics in Equation 4. However, Reaction-Diffusion dynamics can lead to equilibrium solutions, so particulation is a possible feature of solutions, but is not a necessary one, whereas when a pointdynamic is chosen, particulation in space is not only possible, but necessary. Furthermore, a prediction of the BVAM model, with the chosen coefficients, is that Constriction Degree and Constriction Location have entirely different dynamics, with Constriction Degree being reached gradually with a late peak velocity, as just discussed, but that Constriction Location changes using a pivoting dynamic, shifting discretely from one location to another, as in the earlier discussion on pivoting. This prediction is not shared with earlier models, which have no reason to predict Constriction Location and Degree to differ in their dynamics. In the current model location and degree occupy ontologically disparate parts of the mathematical architecture of the model. Constriction Location refers to a setting of the *independent variable* of position that happens to have a large amplitude due to Turing and Hopf pattern formation, whereas Constriction Degree refers to the large amplitude itself, not its location. The difference in dynamics is due to the ontological difference. We do not take this work to be a rejection of Task Dynamics or subsequent models inspired by it (e.g. Sorenson and Gafos, 2016; Tilsen, 2019), but a deepening of its predictive logic that is better able to predict major aspects of actual speech dynamics.

#### DISCUSSION AND CONCLUSIONS

This paper contributes to the literature on motor control as a dynamical phenomenon, initiated by Bernstein (1967) and Feldman (1966), and extended to speech by Fowler et al. (1980), by showing how the low degree of freedom tasks of a motor control system are obtained *via* a dynamical computational process that starts out with a very large number of degrees of freedom (see also, Roon and Gafos, 2016; Tilsen, 2019). The contribution is to isolate the high degree system, the low degree system, and the extremely specific BVAM dynamical process as a candidate dynamical system that starts with the high degree of freedom system and ends with the low degree of freedom system. The evidence advanced in the last section is of the abstract and concrete types. A subtle abstract property of the phonological act of speech production is that gestures begin and end in time and are localized in space as *constrictions*. The simultaneous presence of Turing and Hopf bifurcations achieves the segmentation in space and time that we see in speech. In the Task Dynamic program, for instance, the tasks are almost all categorized in terms of constriction locations and degrees, as most phonological feature theories are structured into place and manner features. The current theory explains why the location/place and degree/manner distinctions are so prevalent. It is due to particulation. And two highly concrete properties of how constrictions actually form and relax, one qualitative (pivoting) and the other quantitative (proportional time to peak velocity of approximately 0.5) are reproduced by the simultaneous Turing-Hopf BVAM model. The only thing that had to be postulated is that the neural planning units affect the closure of the vocal tract at different points, but this is the usual assumption about what motoneurons do. The computational system presented answer the *how* question (material and formal cause), and not the *why* (final cause) of the reason for particulation. I believe that two of the main final causes for particulation are the ones discussed in the introduction: (1) to allow for many words built from a few basic particles (Abler, 1989; Studdert-Kennedy and Goldstein, 2003); (2) to allow the vocal tract to act as an acoustic signaling device (Stevens, 1972, 1989; Mrayati et al., 1988).

However, even though this model is hopeful in that it explains some subtle and other concrete properties of speech production, it is by itself not sufficient, and has fatal shortcomings as a comprehensive model. First, it needs to be shown that manipulation of the constants of the system can produce actual words of actual languages, which has not been done here. This is unlikely to be doable with this system, since as can be seen in **Figure 7C,** the particulation in both space and time is too periodic to be of use in describing actual words in actual languages. This model can almost be seen as a model of a *gagagaga* stage of CV babbling, and the search needs to continue for other reactiondiffusion systems with simultaneous Turing and Hopf instabilities or interactions between those two instabilities that take us beyond the *gagagaga* stage of CV babbling by adding *controlled* variation in constriction location and degree. In the field of phonetics this may seem to disqualify this model entirely, but work in mathematical physics for centuries has always sought to understand complex phenomena, many of which are far less complex than speech, by proposing models that explain simple abstract properties of the phenomenon first, and that is the approach taken here. Second, the model does not cease. It needs to become clear Iskarous Morphogenesis of Speech Gestures

how the model can stop and produce a single word with just a few changes in constrictions in space and time. Therefore, part of the search for a refinement of the current model needs to take into consideration how the model can produce word length actions then stop, and start again. Third, some fundamental properties of speech having to do with prosody have not been mentioned, however there have been other dynamical approaches to prosody (Goldsmith and Larson, 1990; Prince, 1993; Goldsmith, 1994; Iskarous and Goldstein, 2018) that we believe can be combined with this model, since equations with Turing and Hopf pattern solutions have a multiscale structure (Kuramoto, 1975; Walgraef, 1997) that is actually quite similar to syllabification and metricity in speech as modeled by Goldsmith. Fourth, the actual spatial extent and temporal extent is known from many observational experiments, whereas the current model, using arbitrary units, does not generate these actually observed extents, however, rescaling of the variables is likely to allow the current model, or improvements, to match the macroscopic scales of speech. Fifth, even though there is plenty of evidence that neural computation is capable of generating Turing and Hopf patterns separately, the BVAM architecture used, has not, to my knowledge, been argued independently, to be a model of neural computations. One reaction-diffusion approach to cortical organization that has extensively used Turing, Hopf, and Turing-Hopf patterns as a foundation for brain macroscopic function has been presented in a series of papers by Steyn-Ross and her colleagues (Steyn-Ross and Steyn-Ross, 2013, 2017; Wang et al., 2014). In this work, the excitatory agent is the neurotransmitter Acetylcholin (ACh) and the inhibitory agent is GABA. Diffusion takes place through gap junctions, which communicate electrical signals between connected neurons. This group has specifically been quantitatively modeling global EEG waves that accompany the different stages of non-REM sleep. The authors show that measured signals of EEG are predicted based on simulation of models of neural interaction that are simplified by considering the input of each neuron to not be the specific other neurons that innervate it, but the mean of the entire network it is in. This *mean-field approximation* seems drastic, but is quite common in physics, and allows for prediction of actual solutions of a

#### REFERENCES


network as complex as a brain. Work by Cowan and his colleagues has tried to use more realistic approximations (Buice and Cowan, 2007). The work of Steyn-Ross specifically shows that a Turing-Hopf Bifurcation of the mean-field approximation plays a major role in the brain computations indicative of sleep. Therefore, even though we have not attempted to provide a brain-model that actually predicts the particulation we find in speech, there is some partial support for the possibility that Turing-Hopf patterns have a role to play in neural computation.

The nature of the microscopic units in this model is also uncertain. The conjecture we have offered so far is that the units are neural in nature. Another possibility is that the reaction diffusion equations to be sought are actually for the motions of tense hydrostatic muscle. Stoop et al.'s (2015) work on nonlinear elasticity theory has argued that geometrically and materially nonlinear material, the kind of material we know the tongue and other speech organs to be (Wilhelms-Tricarico, 1995), can yield reaction diffusion type equations of the Swift-Hohenberg type, and it is known that this type of equation has Turing and Hopf bifurcations (Cross and Hohenberg, 1993; Walgraef, 1997). Therefore, there could be a second conjecture that the equations sought are physical, not neural in nature. And even a third conjecture, the one we expect most likely to correspond to reality, where the equations are a combination of neural and muscular nature, non-dualistically connected.

## DATA AVAILABILITY STATEMENT

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

#### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.


Kuramoto, Y. (1975). *Chemical oscillations, waves, and turbulence*. Berlin: Springer.


wavelength of a Turing-type. *Science* 338, 1476–1480. doi: 10.1126/ science.1226804


Stevens, K. N. (1989). On the quantal nature of speech. *J. Phon.* 17, 3–46.


Zur, K., Mu, L., and Sanders, I. (2004). Distribution pattern of the human lingual nerve. *Clin. Anat.* 17, 88–92. doi: 10.1002/ca.10166

**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Iskarous. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Economy of Effort or Maximum Rate of Information? Exploring Basic Principles of Articulatory Dynamics

#### Yi Xu<sup>1</sup> \* and Santitham Prom-on<sup>2</sup>

<sup>1</sup> Department of Speech, Hearing and Phonetic Sciences, University College London, London, United Kingdom, <sup>2</sup> Department of Computer Engineering, King Mongkut's University of Technology Thonburi, Bangkok, Thailand

Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical data, however, do not always support these predictions. In the present study, we considered an alternative principle: maximum rate of information, which assumes that speech dynamics are ultimately driven by the pressure to transmit information as quickly and accurately as possible. For empirical data, we asked speakers of American English to produce repetitive syllable sequences such as wawawawawa as fast as possible by imitating recordings of the same sequences that had been artificially accelerated and to produce meaningful sentences containing the same syllables at normal and fast speaking rates. Analysis of formant trajectories shows that dynamic extremes in meaningful speech sometimes even exceeded those in the nonsense syllable sequences but that this happened more often in unstressed syllables than in stressed syllables. We then used a target approximation model based on a mass-spring system of varying orders to simulate the formant kinematics. The results show that the kind of formant kinematics found in the present study and in previous studies can only be generated by a dynamical system operating with maximal muscular force under strong time pressure and that the dynamics of this operation may hold the solution to the long-standing enigma of greater stiffness in unstressed than in stressed syllables. We conclude, therefore, that maximum rate of information can coherently explain both current and previous empirical data and could therefore be a fundamental principle of motor control in speech production.

Keywords: maximum rate of information, economy of effort, stiffness, peak velocity, target approximation

# INTRODUCTION

# Hypo- and Hyper-Articulation and Physiological Effort

To produce a speech sound, the vocal tract needs to be shaped in such a way that appropriate acoustic patterns are generated to allow listeners to identify the intended phonetic category. The shaping of the vocal tract takes time, and the quality of the sound produced may therefore depend on how much time is available for each sound. If there is too little time, the articulators may not be

#### Edited by:

Pascal van Lieshout, University of Toronto, Canada

#### Reviewed by:

Volker Dellwo, University of Zurich, Switzerland Jason A. Shaw, Yale University, United States

> \*Correspondence: Yi Xu yi.xu@ucl.ac.uk

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 13 March 2019 Accepted: 18 October 2019 Published: 22 November 2019

#### Citation:

Xu Y and Prom-on S (2019) Economy of Effort or Maximum Rate of Information? Exploring Basic Principles of Articulatory Dynamics. Front. Psychol. 10:2469. doi: 10.3389/fpsyg.2019.02469

able to move in place, resulting in undershooting the target. This is known as the undershoot model (Lindblom, 1963), and it was based on the finding that vowel formants in a symmetrical /d\_d/ consonant context vary with the duration of the vowel. Lindblom (1963) attributes such reduction to a constraint on the speed of articulatory movement. He further shows that duration is the main determinant of the reduction, whether the duration change is due to speech rate or degree of stress, i.e., stress affects vowel reduction only indirectly, i.e., through duration. This undershoot model, however, was questioned by a number of subsequent studies. Also examining formant movements of vowels surrounded by consonants, Gay (1978: 228) concludes that: "differences in vowel duration due to changes in speaking rate do not seem to have a substantial effect on the attainment of acoustic vowel targets." Based on acoustic and electromyographic data, Harris (1978: 355) concludes that the effects of changing stress and speaking rate are independent of each other and that this is in support of the "extra energy" model: "extra energy is applied to the stressed vowel, with the result that it lasts longer, and the signals to the articulators are a little larger, so that the vowel is further from a neutral vocal tract position." An important methodological feature shared by both Gay (1968) and Harris (1978), however, is that the target vowels examined are surrounded by /p/, a consonant known to conflict little with the vowel articulation as far as the tongue is concerned. As pointed out by Moon and Lindblom (1994: 41), "according to the undershoot model, no formant displacements would be expected for adjacent vowels and consonants with identical, or closely similar, formant values."

To address the conflicting data reported subsequent to Lindblom (1963) as mentioned above, Moon and Lindblom (1994) examined the formant frequencies of front vowels in English in a /w\_l/ frame at varying durations in clear and casual speaking styles. The large articulatory distances between the vowel and the surrounding consonants indeed led to greater duration dependencies in the formant values than in previous studies, thus reaffirming the early finding of Lindblom (1963). On the other hand, however, they also found that the duration dependency of undershoot is reduced in clear versus normal citation speech, which has led to their further conclusion that undershoot is a function of not only vowel duration and locus– target distance but also the rate of formant change, which is presumably faster in clear than in normal speech. The finding that velocity of articulatory movement is greater in clear speech has led to the theorization, known as the H&H theory, that "within limits speakers appear to have a choice whether to undershoot or not to undershoot," and that "avoiding undershoot at short segment durations entails a higher biomechanical cost" (Lindblom, 1990: 417). For this reason, energy saving is proposed as a core mechanism of undershoot in addition to time pressure.

Note, however, that energy saving and time pressure are two very different needs, each with very different implications for the explanation of undershoot. Undershoot due to energy saving would entail an effort reduction that slows down the articulatory movements. Undershoot due to time pressure, in contrast, would entail articulatory movements that are as fast as possible (driven by maximum articulatory force) before being cut short by premature termination. Of the two scenarios, the latter is much less explored than the former, not only because of the popularity of the H&H theory but probably also because the implication of the maximum speed of articulation is too extraordinary to be even worth contemplating. As asserted in Lindblom (1983: 219), "in normal speech the production system is rarely driven to its limits."

# Maximum Speed of Articulation: Is It Really Never Approached?

But there is already evidence that speech production is often driven to its extremes as far as speed of articulation is concerned. Based on a comparison between normal speech and Maximum Repetition Rate (measured in phones per second), Tiffany (1980: 907) concludes that "in some senses we normally speak about 'as fast as we possibly can,' at least in the production of full canonical utterances." Janse (2004) has compared the perceptual word processing speed of Dutch sentences sped up in two ways: (1) by asking the speaker to speak faster, and (2) by computationally time-compressing sentences originally produced at a normal rate. She finds that the perceptual reaction time to the natural-fast sentences is slower than to the computationally time-compressed normal utterances. This finding is further confirmed by Adank and Janse (2009), who show that perception of natural-fast speech has much lower recognition accuracy than does that of time-compressed speech. One likely explanation is that synthetically sped-up speech is still well within the processing speed of the human perceptual system, while naturally speeding up speech forces speakers to reach too many dynamic limits of articulation, and the resulting undershoot is serious enough to impair the quality of information transmission. If this interpretation is right, it is likely that some dynamic limits of articulation are already approached at normal speaking rate. The evidence seen in these studies is somewhat indirect, however. Attempts to more directly compare the performance space of speech and non-speech articulatory movements by using kinematic measurements have produced inconclusive results (Nelson et al., 1984; Perkell et al., 2002). More direct evidence is seen only in the case of F<sup>0</sup> production, where it is shown that the maximum speed of pitch change is indeed often approached (Xu and Sun, 2002; Kuo et al., 2007; Xu and Wang, 2009). It is therefore necessary to establish more directly than before whether dynamic limits of segmental production are also frequently reached.

# Hyper-Articulation: Does It Overshoot the Target?

It is unlikely, of course, that dynamic limits of articulation are reached all the time in each speech utterance, as it is well known that segment and syllable durations change frequently due to various linguistic functions (Lehiste, 1972; Turk and Shattuck-Hufnagel, 2007; Xu, 2009). In many instances, e.g., at domainfinal locations where lengthening regularly occurs (Lehiste, 1972; Klatt, 1975; Nakatani et al., 1981; Edwards et al., 1991; Turk and Shattuck-Hufnagel, 2007), phonetic units show durations

that well exceed the amount of time needed for achieving their targets. In those cases, does target overshoot (i.e., going beyond the underlying target) happen? For example, would there be a hyperspace effect for vowels, showing an F1–F2 distribution that exceeds their canonical space? The H&H theory may suggest that this would indeed happen, as an enlarged vowel space would enhance phonological contrast (Lindblom, 1990). A hyperspace effect was reported by Johnson et al. (1993a), although they did not link it to durational changes. Whalen et al. (2004), however, failed to replicate the hyperspace effect.

Also, the notion of overshoot may not be fully compatible with the notion of phonetic target. Lindblom (1963: 1773) defines a vowel target as "an invariant attribute of the vowel" that is "specified by the asymptotic values of the first two formant frequencies of the vowel and is independent of consonantal context and duration." A similar target concept is also seen in the task dynamic model, which assumes that for each articulatory gesture, there is an equilibrium point to which the articulatory state will relax by the end of the gestural cycle (Saltzman and Kelso, 1987; Saltzman and Munhall, 1989). The equilibriumpoint hypothesis (Perrier et al., 1996a,b) also assumes a target that is invariant. In none of these models can the target itself be exceeded. There are also models that assume that phonetic targets are not fixed but have variable ranges that can be described as area targets as opposed to point targets (Keating, 1990; Guenther, 1994); this may potentially allow target overshoot. However, the very reason for proposing area targets is to account for variabilities such as undershoot and possible overshoot. If a target itself is an area inclusive of all the variants, the notion of undershoot or overshoot would not make sense, as a phonetic output cannot conceptually be both inside and outside a target at the same time.

There are already some empirical data suggesting that a phonetic target behaves like an asymptote, which can be approached but not exceeded. Nelson et al. (1984) asked subjects to either silently wag their jaws repeatedly, or say "sa sa sa. . .", in both cases going from a very slow rate to as fast as possible. Toward the fast end (i.e., over 120 ms/cycle), both the wagging and the sa-sa-sa movements show a positive correlation between cycle duration and movement size. Toward the slow end, however, both tasks show a clear asymptote, i.e., leveling off at a particular displacement level as movement time is longer than 120 ms. What is remarkable is that the sa-sa-sa asymptote is much lower than the wagging asymptote, indicating that the /a/ target has a jaw opening specification much narrower than the maximal range of jaw opening. Because no formant measurements were reported by the study, however, it is not known whether the asymptote effect is also reflected in the speech signal.

# Economy of Effort: The Stress–Stiffness Enigma

The issue of target overshoot is also related to the problem of how economy of effort can be measured. It is not easy to estimate the total muscular activities involved in speech articulation, and no effort to our knowledge has been made to do so. It is possible, however, to estimate articulatory effort by analyzing the kinematics of articulatory movement. Assuming that an articulatory gesture is a movement toward a phonetic target (Lindblom, 1963; Saltzman and Munhall, 1989; Xu and Wang, 2001), articulatory displacement as a function of time should exhibit a trajectory similar to the one shown in **Figure 1A**, which consists of an initial acceleration phase and a final deceleration phase (Nelson, 1983). The time-varying velocity profile of such a movement should show a unimodal shape (Nelson, 1983; Sorensen and Gafos, 2016), as shown in **Figure 1B**. Nelson (1983) suggests that the peak of such a velocity profile is a good indicator of effort. Peak velocity has been measured in a number of studies (Kelso et al., 1985; Ostry and Munhall, 1985; Vatikiotis-Bateson and Kelso, 1993; Hertrich and Ackermann, 1997; Xu and Sun, 2002; Xu and Wang, 2009). But a common finding is that its closest correlate is movement amplitude (profile height in **Figure 1A**). In fact, the two are almost linearly related: the larger the displacement, the greater the peak velocity. An example is given in **Figure 1C**, which is taken from the present data shown in **Figure 6**. This quasi-linear

explained in the section "Interpretation Based on Modeling." (C) Peak velocity over displacement for F1, taken from Figure 6 in the present paper.

relation is true whether the measurement is articulatory, i.e., lips in Hertrich and Ackermann (1997), jaw and lips in Kelso et al. (1985), tongue dorsum in Ostry and Munhall (1985), lips and jaw in Vatikiotis-Bateson and Kelso (1993), or acoustic F0, as in Xu and Sun (2002) and Xu and Wang (2009). This means that peak velocity cannot directly tell us about articulatory effort, because it is heavily confounded by the amplitude of the corresponding movement. However, the linear relationship also means that the slope of the regression line between peak velocity and movement amplitude may serve as an indicator of effort: the steeper the slope, the greater the underlying articulatory force. Indeed, this slope, measured as the vp/d ratio (peak velocity over displacement), has been referred to as an indicator of gestural stiffness (Kelso et al., 1985; Ostry and Munhall, 1985).

When using the vp/d ratio as an indicator of effort, however, a puzzle has emerged. That is, the ratio is repeatedly found to be greater in unstressed syllables than in stressed syllables (Ostry et al., 1983; Kelso et al., 1985; Ostry and Munhall, 1985; Beckman and Edwards, 1992; Vatikiotis-Bateson and Kelso, 1993), as can also be seen in **Figure 1C**. This finding is hard to reconcile with the notion that stressed and unstressed segments vary along a hyper–hypo-articulation dimension (Lindblom, 1990; de Jong et al., 1993; de Jong, 1995; Wouters and Macon, 2002; Janse et al., 2003). This dilemma has been noticed by some studies (Edwards et al., 1991; de Jong et al., 1993; de Jong, 1995), but no consistent solution has been proposed.

In a dynamical system, the fullness of target attainment is jointly affected by stiffness and movement duration. That is, given a stiffness level, the longer the movement duration, the more closely the target is approached by the end of the movement; likewise, given a movement duration, the greater the stiffness, the better the target is attained by the end of the movement. Relating this back to the issue of hyperarticulation: would it be possible that a stressed syllable is often long enough to potentially lead to target overshoot but that there is a restraint on the increase of articulatory force that prevents it? This idea, however, does not seem to be compatible with the fundamental premise of economy of effort and has not been seriously contemplated so far.

# Maximum Rate of Information – An Alternative Principle

The literature review so far suggests that at least some of the assumptions behind the principle of economy of effort are open to question. It is doubtful that speakers always stay safely away from dynamic extremes such as maximum speed of articulatory movement. It is also doubtful that target overshoot is an articulatory means of achieving stress and clarity. Most critically, a solution is overdue for the enigma that stress is associated with lower rather than higher measured stiffness. As an alternative, here we would like to propose that, in speech production, there is a higher priority than the need to conserve energy: a pressure to transmit as much information as possible in a given amount of time. This can be referred to as the principle of maximum rate of information. This principle contrasts with economy of effort in a number of ways:


There is little doubt that, as a communication system, human speech is highly efficient (Hockett, 1960). A strong case is made in the seminal work of Liberman et al. (1967), which recounts the many efforts in the early 60s to develop a coding system to convert printed text to non-speech sounds that could be used by blind people. It turned out that none of the systems developed exceeded the transmission rate of Morse code. Yet the transmission rate of Morse code was only slightly higher than 10% of human speech. Liberman et al. (1967) attributed the efficiency of speech to human's remarkable ability to perceptually decode coarticulation, and this interpretation eventually led to the motor theory of speech perception (Liberman and Mattingly, 1985). But it ought to be recognized that the efficiency of coding has to be first rooted in speech production, as coarticulation is first and foremost an articulation phenomenon. In addition, coarticulation can be only one of the reasons why speech coding is so efficient, as at least the speed of articulation also needs to be fast enough. The maximization of the rate of information transmission, therefore, may be the ultimate driving force behind many phenomena in speech, including, in particular, both undershoot and effort reduction.

# The Present Study

fpsyg-10-02469 July 27, 2020 Time: 17:16 # 5

As can be seen from the foregoing discussion, the popular notion of economy of effort predicts that dynamic limits of articulation are seldom approached, because "in normal speech the production system is rarely driven to its limits" (Lindblom, 1983: 219), in order to save energy. It further predicts that extra articulatory effort is made only in the case of stress, for the sake of enhancing phonetic contrasts. Exactly the opposite predictions are made, however, by the principle of maximum rate of information. That is, dynamic limits of articulation are frequently approached during normal speech, because articulation is often made as fast as possible, especially in the case of unstressed syllables. In the case of stressed syllables, articulation would actually slow down as the phonetic target is approached in order to prevent overshoot in the face of the extra duration assigned to the stressed syllables.

The goal of the present study is to explore which is more likely the fundamental driving force behind the articulatory dynamics of speech: economy of effort or maximum rate of information. We will try to answer three specific questions based on the competing predictions mentioned above through an examination of formant movement dynamics: (a) Is the maximum speed of segmental articulation and henceforth the maximum articulatory effort used in meaningful utterances? (b) Are stressed or unstressed syllables more likely to involve the maximum speed of articulation and the associated maximum articulatory effort? And (c) what is the likely articulatory mechanism underlying these dynamic patterns?

Our general approach consists of three parts. The first is a method taken from Xu and Sun (2002), i.e., to ask speakers to imitate resynthesized speech that has been accelerated to a rate that is unlikely to be humanly attainable. The maximum speed of articulation that the participating speakers manage to achieve is therefore treated as an estimate of their voluntary dynamic limits. In the second part, these estimated dynamic limits are compared to the speed of articulation measured in meaningful sentences produced by the same speakers to establish whether and when the maximum speed of articulation is approached in real speech. These two parts will therefore answer the first two research questions. In the third part, we will address the third research question through analysis-by-modeling based on a variable-order dynamical system. The model used will be based on previous work on computational modeling of laryngeal and supralaryngeal articulation (Ostry et al., 1983; Saltzman and Munhall, 1989; Prom-on et al., 2009; Birkholz et al., 2011).

Unlike in most other studies on articulatory dynamics (Kelso et al., 1985; Ostry and Munhall, 1985; Vatikiotis-Bateson and Kelso, 1993; Hertrich and Ackermann, 1997), the kinematic measurements obtained in the present study are those of formants, as was done in many early studies of speech dynamics, including, in particular, Lindblom (1963) and Moon and Lindblom (1994), that have led to the H&H theory. Because of the popular assumption that articulatory movements should ideally be studied by examining articulatory data only, the following justifications are given to explain why formant measurements can also provide highly relevant information about articulatory dynamics.

Given that listeners hear speech through acoustics, all the perceptually relevant articulatory movements are reflected in the acoustic output. Among the acoustic properties, formant movements have been shown to be perceptually relevant since classic works like Cooper et al. (1952) and Liberman et al. (1954). Formant synthesis systems like the Klatt synthesizers (Klatt, 1980, 1987), though low on naturalness compared to the state-of-theart speech technology today, have achieved high intelligibility (Taylor, 2009). The widely accepted source-filter theory of speech production (Fant, 1960; Stevens, 1998) has established that the acoustic properties of speech sounds, especially those of the vowels, are determined by the shape of the entire vocal tract, which consists of not only the articulators that are easily measured (e.g., tongue tip, tongue blade, tongue dorsum, and lips), but those that are less accessible, like the tongue root, the pharynx, and even the larynx (Hoole and Kroos, 1998; Demolin et al., 2000). Thus, the movement of any particular articulator is not for its own sake but only as part of a whole movement that achieves a set of overall aerodynamic and acoustic effects. Those acoustic effects are arguably the ultimate goal of a phonetic target (Mattingly, 1990; Johnson et al., 1993b; Hanson and Stevens, 2002; Perrier and Fuchs, 2015; Whalen et al., 2018). In contrast, specific articulatory kinematic measurements can provide only a partial approximation of the goal-oriented articulatory movements as a whole (Whalen et al., 2018).

In fact, Hertrich and Ackermann (1997) and Perkell et al. (2002), after careful examination of articulatory dynamics, both suggested that the phonetically most relevant information may be found in the acoustic signal. Noiray et al. (2014) find that acoustic patterns faithfully reflect even highly idiosyncratic articulatory patterns that carry crucial information for perceptual contrast. Whalen et al. (2018) further demonstrate that crossspeaker variability in acoustics and articulation is closely related, rather than articulation being more variable than acoustics, as previously argued (Johnson et al., 1993a). Furthermore, the perturbation theory (Fant, 1980; Stevens, 1998) would predict that only the lowest formants (mostly F1–F3) are directly controllable by deliberate maneuvers of movable articulators such as the tongue and jaw, because too many nodes and antinodes are associated with higher formants to make it possible to deform the vocal tract shape at all the right locations without canceling out each other's perturbation effects. This means that most of the contrastive vowel information can only be carried by the first few formants. Therefore, formant trajectories are arguably a better indicator of articulatory dynamics, because they reflect vocal tract shapes as a whole, including those parts that are hard to measure, and they are fewer in number. This would make formant measurements such as its displacement, velocity, the vp/d ratio, etc., no less valid than those of any particular articulator.

Formant data are not without limitations, however. A wellknown issue is the sometimes abrupt shift of affiliation of the second and third formants with resonance cavity as the vocal tract shape changes smoothly, e.g., between [i] and [a] (Bailly, 1993;

Stevens, 1998). When this happens, the continuity of formant movements may be affected. Furthermore, formant trajectories do not capture the spectral patterns between the formants, which may also be phonetically relevant (Ito et al., 2001). For the purpose of the present study, however, the relevance of formant trajectories can be tested by examining whether their kinematics show similar patterns as those of articulatory movements. At least for fundamental frequency, highly linear relations between F<sup>0</sup> velocity and F<sup>0</sup> movement amplitude have been found (Xu and Sun, 2002; Xu and Wang, 2009), which resemble the linear relations in articulatory or limb movement (Kelso et al., 1985; Ostry and Munhall, 1985; Vatikiotis-Bateson and Kelso, 1993; Hertrich and Ackermann, 1997). This is despite the fact that F<sup>0</sup> is the output of a highly complex laryngeal system (Zemlin, 1988; Honda, 1995). Whether formant kinematics also exhibit similar linear relations and thus warrant the kinematic analyses that have been applied to limb and F<sup>0</sup> movements will therefore be an empirical question. More importantly, as a fundamental principle of any empirical investigation, the most critical requirement is to always make minimal contrast comparisons (Gelfer et al., 1989; Boyce et al., 1991) so that any potential adverse effects are applicable to both the experimental and reference conditions. This will also be the principle that guides the design of the present study.

## MATERIALS AND METHODS

#### Stimuli

The stimuli were resynthesized target syllable sequences to be imitated or printed sentences to be read aloud, as presented below. To guarantee continuous formant tracking, we used CV syllables where the consonants are glides and the vowels have maximally different vocal tract shapes from the adjacent glides. A further advantage of using glides instead of obstruent consonants is that they present the least amount of gestural overlap between C and V because glides, as semivowels, are specified for the entire shape of the vocal tract rather than mainly at the place of articulation as in obstruent consonants. The lack of gestural overlap should maximize time pressure. A similar strategy was adopted by Moon and Lindblom (1994) for the same reason. To assess whether the maximum speed of articulation is approached during speech, we asked the same group of subjects to produce meaningful sentences in which the same glide–vowel syllables are embedded.

#### Glide–Vowel Sequences

There were five CV sequences, each consisting of five identical glide–vowel syllables, as shown below. They were first spoken by author YX at a normal rate in a sound-treated booth. They were then resynthesized using the Pitch Synchronous Overlap and Add (PSOLA) algorithm implemented in Praat (Boersma, 2001) to increase the mean syllable rate to 8 syllables per second, which exceeds the fastest repetitive rate for glide–vowel syllables reported previously (Siguard, 1973; Tiffany, 1980). As an example, **Figure 2** shows the spectrograms of the original and accelerated rarararara sequence.


#### Sentences

The stimulus sentences, as shown below, contain symmetrical CVC patterns that each resemble a single cycle of a repetitive CVC sequence. These CVC patterns all appear in the first word of a two-word noun phrase. This is to guarantee that they are not subject to phrase-final lengthening (Nakatani et al., 1981). Each pattern is placed in a stressed syllable and an unstressed syllable in two different sentences. The unstressed /waw/ appears in three positions, early, middle, and late, for examining possible positional differences (not performed in the present study). All other patterns appear only in the sentence-medial position. The boldfaced syllables in the target words are stressed. These stress placements are natural to the native speakers, and subjects had no difficulty producing the intended stress patterns.


#### Subjects and Recording Procedure

Fifteen speakers of American English, 8 females and 7 males, age 18–25 years, participated as subjects. They were undergraduate students at Northwestern University or other universities in the Chicago area. All subjects signed informed consent approved by the Northwestern University Institutional Review Board and were paid for their participation.

The subject sat in front of a computer screen wearing a head-mounted condenser microphone (Countryman Isomax hypercardiod). During the recording, the stimuli were displayed on a web page controlled by a Javascript program. The program randomized the stimulus order so that each subject read a different random list. Another program, SoundEdit, ran in the background on the same computer to digitize the acoustic signal directly onto the hard disk at a 22.05 kHz sampling rate and 16-bit resolution.

For the syllable sequences, in the slow condition, the subject read aloud each sequence at the rate of careful speech; in the other two conditions, during each trial, the subject listened to a model sequence and then immediately imitated the sequence in two ways: (1) as fast as possible five times without slurring, and (2) as exaggeratedly as possible another three times without slurring. For the sentences, the subject was instructed to say each sentence

first at a normal rate three times and then at the fastest rate possible another three times without slurring. The experimenter, who was a native speaker of American English, made sure that the target words were all said with the right stress patterns.

#### Measurements

factor of 2.

The first step in taking the measurements was to demarcate the syllables, as illustrated in **Figure 3**. The demarcation points were set at the extrema of either F1 or F2 formant tracks. The procedure was facilitated by a Praat script (a predecessor of FormantPro: Xu and Gao, 2018) that cycled through all the utterances produced by each speaker and displayed the waveform, spectrogram, and a TextGrid for inserting the demarcation points and labeling the syllables. The demarcation points were first set manually and then corrected by the script based on the LPC formant tracks, which were smoothed by a trimming algorithm that eliminated abrupt bumps and sharp edges (originally developed for trimming F<sup>0</sup> contours: Xu, 1999).

For the /wawawawawa/, /yayayayaya/, and /rarararara/ sequences, the demarcation points were set at the F1 minima, as illustrated in **Figure 3A**. For the other two sequences, because of the small F1 movements, the demarcation points were set at the F2 minima (for /wiwiwiwiwi/) or maxima (for /yoyoyoyoyo/). For the sentences, only the target syllables were demarcated, as shown in **Figure 3B**.

Based on the demarcation of the syllables, the following measurements were taken.

maxF<sup>j</sup> (st) – highest value in the jth formant in semitones in each unidirectional formant movement, where j = 1, 2, 3. The conversion from Hz to semitones was done with the equation:

$$st = 12\log\_2 f\_{\circ} \tag{1}$$

where f <sup>j</sup> is the formant value in Hz. Note that, here, the reference value for f<sup>j</sup> is assumed to be 1 Hz.

minF<sup>j</sup> (st) – lowest value in the jth formant in each unidirectional formant movement.

Fj-displacement (onset and offset) – formant difference (in st) between adjacent maxF<sup>j</sup> and minF<sup>j</sup> . There are two unidirectional movements in each syllable: one for the onset ramp of the formant movement toward the vowel target, and the other for the offset ramp. Thus for each syllable, two displacements were computed.

mean Fj-displacement – average of onset and offset displacements.

movement duration (onset and offset) – time interval between adjacent formant maximum and minimum.

syllable duration – sum of onset and offset movement durations in each syllable.

peak velocity (onset and offset) – positive and negative extrema in the velocity curve corresponding to the rising and falling ramps of each unidirectional formant movement. The velocity curves were computed by taking the first derivative of formant curves. Following Hertrich and Ackermann (1997), the formant curves were low-pass filtered at 20 Hz with the Smooth command in Praat, but the velocity curves themselves were not smoothed so as not to reduce the magnitude of peak velocity.

vp/d ratio (onset and offset) – ratio of peak velocity to displacement calculated as the slope of the linear regression of peak velocity over displacement across all the points in a unidirectional formant movement.

#### Analysis

The first goal of the analysis is to determine whether the production of meaningful utterances has approached various dynamic limits observed in nonsense syllable sequences. This is assessed in two ways. The first is to compare the sequence conditions and the sentence conditions in terms of the distribution of formant displacement as a function of movement duration. The comparison is made with the theoretical bounds defined by Nelson (1983) as a reference to see if the distributions show patterns that suggest that speakers may indeed have maximized their articulatory effort. The second is to make the comparisons in terms of peak velocity as a function of displacement: vp/d. If much similarity is found between the sequence conditions and the sentence conditions for the same articulatory movement, this would again be an indication that a dynamic limit of articulation is approached in sentence production.

The second goal of the analysis is to determine whether the dynamic limits are more likely approached during stressed or unstressed syllables in the sentence condition. This will be done

with both formant displacement as a function of movement duration and peak velocity as a function of movement amplitude.

#### Displacement Over Duration

**Figures 4**, **5** display scatter plots of F1 and F2 displacement over movement duration for [wa], [ya], [ra], [wi], and [yo] in the syllable sequences (column 1) and sentences (columns 2, 3) produced by all 15 speakers (except for the slow sequence condition, for which three speakers were not recorded). In **Figure 4**, F1 is not plotted for [wi] and [yo] because the formant movements were often too small to allow reliable location of their maxima or minima. In column 1 of both figures, the points are separated into the three speaking modes for the syllable sequences: fast, exaggerated without slowing down, and slow. In column 2, the points are separated by speech rate in the sentence condition, and in column 3, they are separated by word stress in the sentence condition.

In terms of the vertical distribution of the (T, D) points, column 1 of both **Figures 4**, **5** shows a three-way split across the three conditions, with the fast rate closest to the bottom and the slow rate closest to the top, although there is much overlap between the three conditions. In column 2 of both figures, the distributions are very similar to those of the fast and exaggerated conditions in column 1, indicating that the same syllables in the sentences are spoken with a similar amount of muscle force. The plots in column 2 also show that there is no clear vertical separation between normal and fast speaking rates, which contrasts with column 3, where a better separation can be seen between stressed and unstressed syllables. Unsurprisingly, stressed syllables have larger formant displacements than unstressed syllables.

In the top-left graph of **Figure 4**, we have plotted the gray parabolic curves generated by Eq. (2), where T<sup>m</sup> is a function of U, which is a theoretical physical force (acceleration) determined by the maximum amount of muscle force that can be exerted for a movement (Nelson, 1983). The curves therefore represent theoretical minimum-time bounds given specific values of U. According to Nelson (1983: 140), given a particular time bound, all physically realizable movements have to lie to the right of that bound, and "any movement having a distance-time (D, T) point on or to the left of a particular contour would require a peak acceleration greater than the value for that contour." While the bounds can be theoretically moved left by increasing the value of U, the cost of such an increase would rise rapidly, as indicated by the closer spacing of the contours as they shift leftward. Thus, there is bound to be a physical limit that is virtually impossible to cross.

$$T\_m = \text{2(D/U)}^{1/2} \tag{2}$$

In Nelson (1983), the unit of U is physical distance in meters. Here, in the top-left plot of **Figure 4**, the theoretical bounds correspond to U = 5,000, 10,000, 15,000, . . . , 50,000, and are arbitrarily set to be above the bounds for most of the formant values in **Figures 4**, **5**. To assess the amount of muscle force exerted during the articulation of the target utterances, we fitted Eq. (2) to the (T, D) points in each condition for an optimal value of U with the fitModel function in the R package TIMP (Mullen and van Stokkum, 2007). The fitted curves are shown in each plot in **Figures 4**, **5**. With these fitted curves, the (T, D) distribution in different conditions can be compared for their U values.

For the syllable sequences, the fitting is done only for the fast and exaggerated conditions, because the slow condition shows a ceiling effect as syllable duration becomes increasingly

for frictionless movements with constant acceleration–deceleration magnitudes based on Nelson (1983). See text for detailed explanation.

long. As can be seen in the top-left graph of **Figure 4**, the F1 points in the slow condition do not parallel any of the time bounds but are largely horizontally distributed. This indicates that formant displacement ceases to consistently increase as movement duration goes beyond around 0.125 s (125 ms). This asymptotic distribution resembles those in **Figure 5** of Nelson et al. (1984: 950), with similarity even in terms of the critical duration of around 120 ms.

With the fitted curves, we can compare the values of U in different conditions for an initial assessment of the relative articulatory force applied by the speakers. For the syllable sequences in column 1, U is always greater in the exaggerated than in the fast syllable sequences, except for F2 in [ya] and [ra]. This seems to be consistent with the instructions given to the subjects in terms of speech mode. For the most crucial question of the current study, namely, whether syllables are spoken in sentences as fast as in sequences, as shown in both **Figures 4**, **5**, in the majority of the cases, the values of U in sentences are actually greater than those in sequences, with the exception of [ra]. In the case of [ra], for some reason, both F1 and F2 have relatively smaller ranges of displacement in sentences than in sequences. In terms of relative articulatory force in the sentence condition,

limits for frictionless movements with constant acceleration–deceleration magnitudes based on Nelson (1983). See text for detailed explanation.

however, there is no consistent pattern based on either speech rate or stress, although there is a tendency toward greater force for fast rate than for normal rate.

Overall, analysis of the distribution of displacement over duration (D, T) shows that CVC syllables spoken in sentences were articulated with at least as much muscle force as meaningless syllable sequences. However, the relative articulatory force in different sentence conditions is not yet clear. For that, we will turn to the analysis of peak velocity, which has been associated more directly with articulatory force (Nelson et al., 1984; Kelso et al., 1985; Ostry and Munhall, 1985; Perkell et al., 2002).

#### Peak Velocity Over Displacement (vp/d Ratio)

**Figures 6**, **7** display scatter plots of peak formant velocity over formant displacement for [wa], [ya], [ra], [wi], and [yo] in the syllable sequences (column 1) and sentences (columns 2, 3) produced by all 15 speakers (except for the slow condition in the syllable sequences, for which there are no data for three of the speakers). Because most distributions are highly linear (as found for articulatory movements: Ostry et al., 1983; Kelso et al., 1985; Ostry and Munhall, 1985; Beckman and Edwards, 1992; Vatikiotis-Bateson and Kelso, 1993), linear regression lines are fitted for every group of data to obtain the vp/d ratio.

FIGURE 6 | Scatter plots of F1 peak velocity over displacement for [wa], [ya], and [ra] in the syllable sequences (column 1) and sentences (columns 2,3) produced by all 15 speakers. Linear regression lines are fitted to each rate or stress condition. See text for detailed explanation.

FIGURE 7 | Scatter plots of F2 peak velocity over displacement for [wa], [ya], [ra], [wi], and [yo] in the syllable sequences (column 1) and sentences (columns 2,3) produced by all 15 speakers. Linear regression lines are fitted to each rate or stress condition. See text for detailed explanation.

In column 1, the slope of the regression line is much shallower in the slow sequences than in the fast and exaggerated sequences, but the differences between the fast and exaggerated sequences are rather small. A two-way repeated measures ANOVA with vp/d ratio as the dependent variable and rate and syllable as independent variables showed significant effects of rate on F1 [F(2,22) = 105.99, p < 0.0001] and F2 [F(2,22) = 90.36, p < 0.0001] and significant effects of syllable on F1 [F(4,44) = 2.8136, p = 0.0365] and F2 [F(4,44) = 2.852, p = 0.0347] (with three speakers missing in the slow sequence condition). A Bonferroni/Dunn post hoc test showed significant differences between slow and both fast and exaggerated conditions but not between the latter two. This is true of both F1 and F2.

In column 2, the regression slopes are consistently steeper for fast rate than for normal rate, which is not surprising. What is striking is that in column 3, the regression slopes are consistently steeper for the unstressed syllables than for the stressed syllables. A two-way repeated measures ANOVA with vp/d ratio as the dependent variable and rate and stress as independent variables showed significant effects of rate on F1 [F(1,14) = 13.66, p = 0.0024] and F2 [F(1,14) = 17.42, p = 0.0009] and significant effects of stress on F1 [F(1,14) = 4.86, p = 0.0448] and F2 [F(1,14) = 70.97, p < 0.0001]. For F2, there is also a significant interaction between rate and stress due to the much larger difference between stressed and unstressed syllables at fast rate than at slow rate, as shown in **Figure 8**. From **Figure 8**, it is clear that the greatest vp/d values are from unstressed syllables at fast rate. As can be seen in **Table 1**, this is the condition where syllable duration (79.9 ms) has dropped well below the critical duration of 120 ms mentioned in the section "Displacement Over Duration."



Overall, the difference between stressed and unstressed syllables, as shown in column 2, is quite similar to that between the two fast rates shown in column 1. This can be further seen in **Tables 2**, **3**, which show the mean vp/d ratios in syllable sequences and sentences, respectively. These differences were compared by performing two-tailed paired t-tests between the syllable sequences and sentences; the results are shown in **Table 4**. Either unstressed syllables had significantly greater vp/d ratios than the sequences (F2 in all conditions), or there were no significant differences (F1 in fast sentences). For stressed syllables, there was no difference in either formant when sentences were at the fast rate. At normal rate, stressed syllables had significantly different vp/d ratios from the sequences but always with lower values. Overall, then, vp/d is no lower in sentences than in sequences unless the syllable is stressed and at the normal speech rate.

These results therefore show that syllables in meaningful sentences are spoken with vp/d ratios that are equal to or even greater than those in nonsense sequences, except when they are stressed and at normal rate. Assuming that vp/d is a reliable indicator of gestural stiffness, CVC syllables spoken in sentences are articulated with at least as much muscle force as the fastest meaningless syllable sequences. On the other

TABLE 2 | Mean vp/d ratio in syllable sequences, with standard deviations in parentheses.


TABLE 3 | Mean vp/d ratio in sentences, with standard deviations in parentheses.


TABLE 4 | Two-tailed paired t-test comparisons between mean vp/d ratios in syllable sequences and in sentences.


The differences (sequence−speech) between the two conditions where p < 0.05 are shown.

hand, within the sentence condition, the finding of greater vp/d ratios in unstressed syllables than in stressed syllables has only deepened the mystery of the stress–stiffness enigma. Looking at the plots in **Figures 6**, **7** again, it is mostly those points with large displacements that are "bent down" relative to the linear regression lines, and these seem to have reduced the regression slopes. This is true in both the sequence and sentence conditions. In the next section, we will use computational modeling to explore whether this is a potential source of the stress–stiffness enigma.

#### Interpretation Based on Modeling

Various models have been proposed based on either acoustic or articulatory data to account for the articulatory dynamics underlying articulatory effort. In Lindblom (1963) and Moon and Lindblom (1994), a numerical model was used to simulate undershoot by representing formant values at turning points using a decaying exponential function. The model is based on the kinematics of the movement (single displacement or velocity measurement per movement) rather than its dynamics (continuous displacement and velocity trajectories). Such a strategy, however, is suboptimal in modeling (Kelso et al., 1985), because it is developed for simulating only particular kinematic measurements and so are unable to simulate the continuous trajectories of articulatory or acoustic movements. Moon and Lindblom (1994) also proposed a dynamic model. However, it is not a target-approaching model because each movement is simulated as consisting of an onset phase in the direction of the muscle force and an offset phase in the opposite direction [also see Fujisaki et al. (2005) for a similar strategy]. Such movements are thus more complex than the unidirectional movement with a unimodal velocity profile described above (Nelson, 1983). Also, in the model, the effect of stiffness is the opposite of the more widely accepted conceptualization, namely, higher stiffness should lead to greater displacement. We will therefore not consider those two types of models.

A more common approach is to use a dynamical system such as a linear mass-spring model to simulate simple movements with a unimodal velocity profile like the one illustrated in **Figures 1A,B**, in which displacement as a function of time exhibits a unidirectional asymptotic trajectory toward the equilibrium point of the system (Nelson, 1983; Ostry et al., 1983; Kelso et al., 1985). The equilibrium point serves as an attractor toward which the system converges over time regardless of its initial state (Kelso et al., 1986; Saltzman and Munhall, 1989). Such progressive convergence is clearly seen in the F<sup>0</sup> contours of a tone when preceded by different tones (Xu, 1997, 1999). This tonal convergence behavior has led to the Target Approximation model (Xu and Wang, 2001) and its quantitative implementation, quantitative target approximation (qTA), which is a critically damped third-order system driven by pitch targets as forcing functions (Prom-on et al., 2009). These dynamic models, however, have not yet been used to simulate kinematic patterns as was done in Lindblom (1963) and Moon and Lindblom (1994) (except in a limited way in Ostry and Munhall, 1985). In the present study, we will explore the ability of dynamic models to simulate observed kinematic measurements and, in the process, explore answers to questions about dynamic constraints in speech, as follows:


#### A Generalized Target Approximation Model

The model we are using is a generalized target approximation model extended from the qTA model (Prom-on et al., 2009). Like many other systems (Ostry et al., 1983; Kelso et al., 1985; Saltzman and Munhall, 1989), it is a mass-spring system that generates movement trajectories by sequentially approaching successive phonetic goals in an asymptotic manner. But unlike the others, it is a system with variable order to allow the simulation of different levels of complexity of the interactions among the variables. Mathematically, the target approximation movement can be represented by a general N-th order model:

$$\wp(t) = \varkappa(t) + e^{-\lambda t} \sum\_{k=0}^{N-1} c\_k t^k \tag{3}$$

where x(t) is the linear target function,

$$x(t) = mt + b\tag{4}$$

The target in this context is different from those in other massspring models where the equilibrium is a fixed displacement value (Feldman, 1986; Saltzman and Munhall, 1989; Perrier et al., 1996b). m and b represent the slope and height of the target function, respectively. This linear function is motivated by findings of dynamic tones in tone languages (Xu and Wang, 2001) and diphthongs in English (Gay, 1968). When the target is static, i.e., m = 0, as is assumed in all the calculations in the present study, the linear function in Eq. (4) is equivalent to an equilibrium point as in other mass-spring models. λ is related to stiffness (equivalent of ω<sup>n</sup> = q k m , where k is stiffness and m is mass in a mass-spring-dashpot system). The coefficients c<sup>k</sup> are determined from initial conditions and target parameters:

$$c\_k = \begin{cases} \wp(0) - b, & k = 0\\ \wp^k(0) + c\_0 \lambda - m, & k = 1\\ \frac{1}{k!} \left( \wp^k(0) - \sum\_{i=0}^{k-1} \frac{k!}{(k-i)!} c\_i ( - \lambda)^{k-i} \right), & k \ge 2 \end{cases} \tag{5}$$

In this general model, as in its third-order predecessor, articulatory state is assumed to be transferred across movement boundaries, i.e., from the end of the current movement to the beginning of the next movement. For example, in the case of qTA, three articulatory states are transferred across movement boundaries: displacement, velocity, and acceleration. As the order of the model increases, higher-order articulatory states are also transferred.

The cross-boundary state transfer is important not only because it guarantees the smoothness of the trajectory at the boundary but also because it fully simulates the higher-order carryover influences of one movement on the next, which has been found to sometimes even exceed that due to cross-boundary displacement transfer (Chen and Xu, 2006). This is illustrated in **Figure 9** with the second-order version of Eq. (3). In **Figure 9A**, the three adjacent movements have continuous displacement at their junctions (where the line thickness changes) but not continuous velocity (as shown in **Figure 9C**). The displacement function in **Figure 9B** is smoother than that in **Figure 9A** because its first derivative is also continuous at the junctions, as shown in **Figure 9D**. The movement amplitude in **Figure 9B** is larger than in **Figure 9A** because the high velocity at the end of the second movement has delayed the turning point into the third movement (the second movement has smaller amplitude in **Figure 9B** than in **Figure 9A** because it first has to overcome the negative velocity transferred from the end of the first movement when trying to achieve its higher target). As is apparent from **Figure 9**, whether higher-order state transfer is implemented makes a significant difference in terms of measured (as opposed to intended) movement duration, displacement, and peak velocity as well as other, derived measurements.

#### Simulation and Interpretation

A program was written in C to generate a sequence of three movements based on the generalized model (Eq. 3). In all of the simulations, the following parameter settings were kept constant:

m<sup>1</sup> = m<sup>2</sup> = m<sup>3</sup> = 0 (target slope) y<sup>01</sup> = 85 (initial displacement) b<sup>1</sup> = 80, b<sup>2</sup> = 100, b<sup>3</sup> = 80 (target height) λ<sup>1</sup> = λ<sup>2</sup> = λ<sup>3</sup> (rate of target approximation) d<sup>1</sup> = 0.2, d<sup>2</sup> = 0.1, d<sup>3</sup> = 0.3 (duration of target approximation).

The units of these parameters are arbitrary, but the values were chosen so that the output would be numerically comparable to the data shown in **Figures 6**, **7**.

Three parameters were systematically varied in the simulation: k, λ, and d2, where λ = λ<sup>1</sup> = λ<sup>2</sup> = λ3.

**Figure 10** shows the displacement (top) and velocity (bottom) trajectories of three sequences, with model orders of 2nd (A), 8th (B), and 14th (C). The thick section in the middle of each trajectory corresponds to the approximation interval of the second target, whose ideal displacement is b<sup>2</sup> = 100 and duration is d<sup>2</sup> = 0.1. As can be seen, as the model order increases, the amount of delay in the target approximation in the displacement

trajectory also increases, the velocity profiles become more and more symmetrical, and the velocity peak occurs increasingly later in the target interval.

The target intervals as shown in **Figure 10** are invisible in real speech data, of course. Actual measurements can therefore be based only on visible landmarks such as turning points. We therefore followed common practice and took the following measurements from the displacement and velocity trajectories, regardless of the actual target intervals used in generating the trajectories.

Displacement – difference in height between the first and second turning point in the displacement trajectory.

Movement duration – horizontal distance between the first and second turning points.

Peak velocity – peak value in the velocity trajectory.

With these measurements, we plotted peak velocity as a function of displacement, as shown in **Figure 11**.

With the plots in **Figure 11**, we can now attempt to answer the questions raised at the beginning of the modeling section. The first question is what may have given rise to the quasi-linearity of the vp/d function. The first thing to notice is that, regardless of the level of stiffness represented by λ, as displacement increases, peak velocity sooner or later reaches a plateau after the initial rising slope. This contrasts with **Figures 6**, **7**, where no obvious plateaus can be seen at the end of the slopes except a slowdown in the rise of peak velocity in some of the slow conditions in the syllable sequences (column 1 in both figures). However, a closer observation may reveal some resemblances. Looking across the plots in **Figure 11**, we can see that as the order of the model increases, the rising slopes become longer and shallower, and if we ignore the plateaus for a moment, the initial slopes become increasingly similar to the quasi-linear distributions of vp/d in **Figures 6**, **7**. Furthermore, for any given order, the greater the stiffness, the sooner a plateau is reached as displacement increases.

These two trends can be more clearly seen in **Table 5**, which lists the minimum durations at which a selection of the peak velocity trajectories in **Figure 11** nearly reach a plateau (arbitrarily defined as when the increase in velocity is <1 with each unit of increase in displacement). For each order of the model, three stiffness (λ) levels are shown. From **Table 5**, we can see that if the underlying mechanism of speech production is assumed to be a target approximation process of some kind, the following conclusions can be made:


In other words, the quasi-linear vp/d function shown in **Figures 6**, **7** could be generated by a critically damped highorder linear system operating at a stiffness level that allows only approximation but not attainment of the underlying target within the allocated duration. This stiffness level should not be

FIGURE 10 | Top: Displacement trajectories of three consecutive target approximation movement sequences generated with the (A) 2nd, (B) 8th, and (C) 14th order versions of Eq. (3). Shifts in line thickness are where target change occurs. Bottom: Velocity profiles of the three movement sequences in the top row. See text for parameters used to generate the trajectories.

represented by λ.

interpreted as low, however. Rather, it suggests that the applied muscle force is already at the maximum of the articulatory system but is too weak relative to the meager amount of time allocated to each movement. This is consistent with previous findings that the maximum rate of articulation is often applied even in normal speech (Tiffany, 1980; Xu and Sun, 2002; Kuo et al., 2007; Adank and Janse, 2009). In fact, the target approximation-based vp/d functions shown in **Figure 11** suggest that only when articulation is operating with near-maximum stiffness can the measured vp/d show quasi linearity as in **Figures 6**, **7**.

If speech is indeed generally produced at its speed limit, the time pressure should be worse for unstressed syllables than for stressed syllables. As shown in **Table 1**, unstressed syllables, even spoken at normal rate, are shorter than stressed syllables spoken at fast rate. With such short duration, most of the target approximation movements are cut short or truncated. This means that the vp/d points measured in unstressed syllables tend to be mostly located in the lower-left portions of the vp/d function shown in **Figure 11**. The effect of this is illustrated in **Figure 12**, which replots some of the 10th order curves from **Figure 11**. In the left graph, the two curves are both for the condition where λ = 65, but they differ in their data range. The range of the circled points is d ≤ 5, while that of the crossed points is d ≤ 11. When both of them are linearly fitted, the slope of the linear function for the points with the smaller range is steeper than that for those with the larger range: 17.143 vs. 14.198. Thus the greater steepness of the slope of the linearly fitted vp/d



function for unstressed syllables could be due to truncation of the associated movements under time pressure. This truncation effect can sometimes even make a movement with greater stiffness appear to have low stiffness, as illustrated in the right graph of **Figure 12**. There, the range of the function with λ = 65 is again d ≤ 5, but the range of the points with higher stiffness (λ = 85) is d ≤ 14. The linear fitting of the two functions now shows a steeper slope for the points with lower stiffness than for the points with greater stiffness. Thus, measurement of vp/d as a linear function is heavily dependent on the range of displacement values being fitted: the smaller the range, the greater the likely value of vp/d, other things being equal.

As for whether the left or right graph is the likely scenario in the case of the stress–stiffness enigma, **Table 6** shows maximum displacements of F1 and F2 in stressed and unstressed syllables at both speech rates from the present data. Although stressed syllables show consistently greater displacements than unstressed syllables, the differences are not extremely large. This means that the underlying stiffness may not be drastically different. This indeed seems to be the case, as shown in **Figure 13**, where the movement-specific vp/d ratio in the current data is plotted as a function of the duration for both F1 and F2. The stressed and unstressed syllables seem to share the same function of movement-specific vp/d relative to duration regardless of their differential distributions in duration. This suggests that the left graph of **Figure 12** is the more likely scenario.

To conclude the modeling section, assuming that articulatory gestures are target approximation movements that can be

TABLE 6 | Maximum displacement in st in the sentence condition, with standard deviations in parentheses.


modeled by a mass-spring system, speakers generally produce gestures too quickly for target approximation to complete even with maximum muscle force, and the time shortage is much worse for unstressed syllables than for stressed syllables. It is the incompleteness of the target approximation movements that may have led to the quasi-linearity of the generally observed vp/d function, but the slope of the linearly fitted vp/d function is also inversely related to the range of observable displacements, which tends to be smaller in unstressed syllable than in stressed syllables. This is the likely source of the stress–stiffness enigma.

#### DISCUSSION AND CONCLUSION

The experimental and modeling data presented above have provided evidence in support of the principle of maximum rate of information as an alternative to the principle of economy of effort, based on a test of the competing predictions from the two principles through an examination of formant dynamics. First, in the section "Analysis," the distribution of formant displacement as a function of movement duration shows that articulatory movements in meaningful speech utterances are no slower than the equivalent movements in meaningless syllable sequences that are produced at fast rate or spoken as exaggeratedly as possible without slowing down. Second, this fast speed in articulatory movement is confirmed by vp/d, peak velocity as a function of displacement, a measurement that has been considered as an indicator of gestural stiffness. This stiffness, however, is shown to be consistently higher for unstressed syllables than for stressed syllables, similar to the findings of previous studies based on articulatory data. Third, the modeling simulation in the section "Interpretation Based on Modeling" provides evidence that (a) the widely found linearity of the peak velocity over displacement function is likely due to stiffness being too low relative to the temporal intervals allocated to individual target approximation movements, and (b) the shortage of time is more severe for unstressed than for stressed syllables, and this may have led to

vp/d being consistently greater for unstressed syllables than for stressed syllables. Overall, therefore, speech seems to be generally operating at a near-ceiling level as far as stiffness is concerned. As a result, there is probably little or no room for speakers to further increase stiffness when undershoot happens.

These results, therefore, are incompatible with the principle of economy of effort, especially in the form of the H&H theory (Lindblom, 1990), which assumes that there is always room for further strengthening of articulatory effort to achieve hyperarticulation. On the contrary, the present results, together with many similar findings discussed earlier, are more consistent with Lindblom's (1963) earlier undershoot model, which recognizes shortage of time as a major source of incomplete target attainment. From the perspective of maximum rate of information, the highest priority in speech production is to transmit as much information as possible in a given amount of time. The most precious resource for speech would therefore be time rather than energy. Unstressed syllables are given less time because they are less important than stressed syllables and can therefore afford to have greater undershoot.

Shortage/abundance of time is not the only factor that determines measured stiffness in articulatory movements. Another factor is the need for articulatory precision. In motor movement research, it is well known that a more accurate movement takes a longer time to execute (Fitts, 1954; Schmidt et al., 1979; Soechting, 1984). In speech, phonetic categories require high precision to assure their perceptual recognition. The precision requirement is so high that children do not achieve an adult level of performance until their teens (Lee et al., 1999). This high precision must be associated with highly precise targets, and maintaining this target also means not to overshoot them even when there is enough time. This idea is illustrated in **Figure 14**. There, the vertical bound represents the physical limit in terms of how much time is needed to perform a movement of any particular amplitude (which may differ widely across speakers: Tiffany, 1980). For movements that are given abundant time, however, there is also a phonetic bound specified by the acoustic properties of the sound, as represented by the high plateau in **Figure 14**. This phonetic bound acts like a ceiling

that prevents speakers from overshooting the target. From the perspective of an information system, fidelity of transmission is an essential property of its capacity (Shannon, 1948), and assuring precision of target attainment for stressed syllables is therefore consistent with the principle of maximum rate of information. Note, however, that sometimes a phonetic bound can lie beyond a physical bound. In the case of an alveolar stop, for example, the target of the tongue tip can be set beyond the surface of the alveolar ridge. This would guarantee an air-tight seal during closure (Löfqvist and Gracco, 1999).

The modeling analysis in the section "Interpretation Based on Modeling" has suggested a solution to the enigma that stress is associated with lower rather than higher measured stiffness (Ostry et al., 1983; Kelso et al., 1985; Ostry and Munhall, 1985; Perkell et al., 2002). As illustrated in **Figure 12**, the widely reported steeper slope of the vp/d function for unstressed syllables than for stressed syllables is likely due to a measurement bias arising from the short duration of unstressed syllables in

general. This short duration results in a truncation of the target approximation movement so that, typically, only the fast-rising portion of the vp/d function is included in the data, which would have resulted in a linearly fitted vp/d indicating a greater stiffness than the underlying stiffness. On the other hand, for stressed syllables, because they are more likely to be given a longer time for target approximation, more of the final tapering off of the vp/d function is likely included. This would have resulted in a linearly fitted vp/d indicating a lower stiffness than the underlying stiffness.

Also, in light of the analysis and modeling in the present study, it becomes clear that none of the measurements we have examined here, namely, displacement, peak velocity, vp/d ratio, and movement-specific vp/d ratio, can be treated as a direct indicator of articulatory effort. Articulatory effort can be meaningfully assessed only when all the known factors are effectively controlled, and some kind of quantitative model of articulation is applied. A further caveat is that the simulation of formant dynamics done in the present study is not meant to be a simulation of full articulatory dynamics. Nor can the simulation of the dynamics of any single articulator achieve that goal. More realistic simulation can be done only through fullscale articulatory synthesis, as tested in some of our recent studies (Prom-on et al., 2013, 2014; Xu et al., 2019).

In conclusion, the findings of the present study have provided support for the principle of maximum rate of information in speech production. Under this principle, speech is generally produced at an overall maximum rate of articulation, due to which many of the syllables and segments are subject to undershoot because of lack of time, and the undershoot is much more severe in unstressed syllables than in stressed syllables. The high rate of undershoot in unstressed syllables may have led to a tendency for their measured stiffness in terms of vp/d ratio to be unduly high, as suggested by our modeling analysis.

#### REFERENCES


In cases where more time is given, as in the case of stressed syllables, the precision of target approximation, as required for the fidelity of information transmission, results in a reduced rate of increase in peak velocity as a function of displacement, as demonstrated by our modeling analysis. This may have led to a tendency for their measured stiffness in terms of vp/d ratio to be unduly low.

#### DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

#### AUTHOR CONTRIBUTIONS

YX conceived and designed the study, conducted the experiments, performed the computation modeling, and wrote the manuscript. SP-O wrote the modeling program, and reviewed and edited the manuscript.

## FUNDING

This work was supported in part by the National Institutes of Health (NIH) Grant No. 1R01DC03902.

## ACKNOWLEDGMENTS

We would like to thank Karen Liu for helping to design the experimental stimuli, conducting the recording and performing the initial data processing.




simulation," in Proceedings of The 19th International Congress of Phonetic Sciences (Melbourne).


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Xu and Prom-on. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Native Language Influence on Brass Instrument Performance: An Application of Generalized Additive Mixed Models (GAMMs) to Midsagittal Ultrasound Images of the Tongue

#### Matthias Heyne1,2 \*, Donald Derrick<sup>2</sup> and Jalal Al-Tamimi<sup>3</sup>

<sup>1</sup> Speech Laboratory, Department of Speech, Language & Hearing Sciences, College of Health & Rehabilitation Sciences: Sargent College, Boston University, Boston, MA, United States, <sup>2</sup> New Zealand Institute of Language Brain and Behaviour, University of Canterbury, Christchurch, New Zealand, <sup>3</sup> Speech and Language Sciences, Newcastle University, Newcastle upon Tyne, United Kingdom

#### Edited by:

Adamantios Gafos, University of Potsdam, Germany

#### Reviewed by:

Aude Noiray, University of Potsdam, Germany Philip Hoole, Ludwig Maximilian University of Munich, Germany

> \*Correspondence: Matthias Heyne Mattes.Heyne@gmx.de

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 03 May 2019 Accepted: 01 November 2019 Published: 27 November 2019

#### Citation:

Heyne M, Derrick D and Al-Tamimi J (2019) Native Language Influence on Brass Instrument Performance: An Application of Generalized Additive Mixed Models (GAMMs) to Midsagittal Ultrasound Images of the Tongue. Front. Psychol. 10:2597. doi: 10.3389/fpsyg.2019.02597 This paper presents the findings of an ultrasound study of 10 New Zealand English and 10 Tongan-speaking trombone players, to determine whether there is an influence of native language speech production on trombone performance. Trombone players' midsagittal tongue shapes were recorded while reading wordlists and during sustained note productions, and tongue surface contours traced. After normalizing to account for differences in vocal tract shape and ultrasound transducer orientation, we used generalized additive mixed models (GAMMs) to estimate average tongue surface shapes used by the players from the two language groups when producing notes at different pitches and intensities, and during the production of the monophthongs in their native languages. The average midsagittal tongue contours predicted by our models show a statistically robust difference at the back of the tongue distinguishing the two groups, where the New Zealand English players display an overall more retracted tongue position; however, tongue shape during playing does not directly map onto vowel tongue shapes as prescribed by the pedagogical literature. While the New Zealand Englishspeaking participants employed a playing tongue shape approximating schwa and the vowel used in the word 'lot,' the Tongan participants used a tongue shape loosely patterning with the back vowels /o/ and /u/. We argue that these findings represent evidence for native language influence on brass instrument performance; however, this influence seems to be secondary to more basic constraints of brass playing related to airflow requirements and acoustical considerations, with the vocal tract configurations observed across both groups satisfying these conditions in different ways. Our findings furthermore provide evidence for the functional independence of various sections of

the tongue and indicate that speech production, itself an acquired motor skill, can influence another skilled behavior via motor memory of vocal tract gestures forming the basis of local optimization processes to arrive at a suitable tongue shape for sustained note production.

Keywords: laboratory phonology, speech motor control, ultrasound imaging of the tongue, brass instrument performance, motor memory, acoustic to articulatory mapping, generalized additive mixed models (GAMMs), dispersion theory

# INTRODUCTION

Brass instrument performance and speech production both require fine motor control of the vocal tract. Dalla Casa (1584/1970) made the connection centuries ago, using speech syllables in his method book for the Renaissance cornetto, a finger-hole trumpet, and, more recently, brass players have also suggested an influence of native language and culture on playing style (i.e., see Fitzgerald, 1946). Anecdotal accounts of language influence on brass playing exchanged within the brass playing community, for example, include speculation that players of some nationalities are 'better' than others at certain facets of brass playing or why learners may face specific challenges related to their language background (Heyne, 2016).

However, despite this pedagogical connection between brass instrument playing and speech, the connection between speech articulation and note production has been largely untested. Here we use ultrasound images of the tongue from ten New Zealand English and ten Tongan-speaking trombone players, to determine whether there is an influence of native language speech production on trombone performance. We investigated midsagittal tongue shape during note production by New Zealand English and Tongan trombone players, as well as the relationship between vowel and note tongue shapes within each language, and how the latter are affected by pitch and note intensity (loudness). The specific trombone pitches produced by participants in the study were Bb2, F3, Bb3, D4 and F4 (in ascending order, specified according to the US standard system where C1 refers to the lowest C on the piano) while the recorded intensities ranged from piano (soft) via mezzopiano and mezzoforte to forte (loud).

Following the earliest published account by Dalla Casa (1584/1970), countless brass players have continued to employ speech syllables in brass teaching, recommending the use of different consonants (/t/ versus /d/ for hard versus soft attacks) and, starting in the 19th century, vowel colors (/A/ versus /i/ for low versus high range notes) to illustrate what students should do with their tongue to produce favorable sounds on brass instruments (cf. Heyne, 2016, section "2.4.1.2 Pedagogical writing on brass playing published in the last 50 years"). We have not come across any brass method books recommending the use of the 'neutral' vowel schwa, although it would seem to be an obvious candidate for achieving a maximally open (and uniform) vocal tract configuration as advocated by many influential teachers, perhaps most notably Arnold Jacobs, tuba player of the famous Chicago Symphony Orchestra (see Frederiksen, 2006; Loubriel, 2011). Most likely, the explanation is the lack of a consistent representation of schwa in standard orthography, and few highly accomplished brass players would have received formal training in linguistics or phonetics to raise such awareness. Of course, many of the world's languages also do not have such a vowel quality.

Beginning in 1954, a small number of researchers started to empirically test the assumptions underlying the use of speech syllables in brass instrument pedagogy. Hall's ground-breaking study (Hall, 1954) found that different players used unique individual positions of the tongue and jaw during trumpet performance, and that they tended to be consistent in using the same basic formation in all registers, indicating that no large modifications took place when changing registers. The author also traced midsagittal images of extreme vowels ("ah" [/A/], "oo" [/u:/], and "ee" [/i:/]) and reported that the most commonly used tongue shape during playing was "ah" but some players used the "oo" formation or intermediate formations falling between the extreme vowels. Subsequent work by Meidt (1967), Haynie (1969), Amstutz (1977), Frohrip (1972), and De Young (1975) largely confirmed Hall's findings, while observing a wider range of playing conditions that included changes in loudness and note articulations/attacks (cf. Heyne and Derrick, 2016b); two of these studies, Frohrip (1972) and De Young (1975), observed trombone players exclusively. Notably, Hiigel (1967) asked his participants (players of all brass instruments) to 'think' prescribed syllables printed underneath the music while performing various notes and found no evidence "that thinking a syllable during performance will tend to simulate the tongue position resulting from the enunciation of that syllable" (p. 108). Rather, he found significant differences between tongue placement during playing and enunciation of the prescribed syllables and this was true even for the players who claimed to use those specific syllables while playing. Overall, there was a tendency for the "tongue arch" to be placed higher with the tongue tip "farther forward" when comparing playing to recitation (p. 107). Most studies, however, did not compare tongue shape during playing to speech production and the few that did used isolated vowel articulations which we now know are not representative of the patterns occurring in natural speech (Lindblom, 1963; Farnetani and Faber, 1992; DiCanio et al., 2015; Tsukanova et al., 2019).

Empirical research on vocal tract movements during brass playing stopped almost completely after the dangers of exposure to radiation from x-rays became apparent in the 1970s and until methods like ultrasound imaging and articulography became available (see Heyne and Derrick, 2016b). There exist two relatively recent Doctor of Musical Arts dissertations that investigated the influence of native language (Mounger, 2012)

and dialect (Cox, 2014) on trombone performance more specifically; however, both studies only analyzed the acoustic signal produced during speech and instrument performance. Youngs (2018) presents a recent application of ultrasound tongue imaging to trumpet playing with a pedagogical focus which, however, involved comparisons of vowel and playing tongue shapes.

In comparison, speech production represents a wellresearched field and it is both obvious and well-documented that speech differs across languages, dialects, and accents. Among the large number of possible speech sounds occurring in the world's languages, vowel sounds have received the most attention, not only because they occur in every language but also because they are fairly easy to measure using both acoustic (Boersma and Weenink, 2014) and articulatory methods (Tiede, 2010; cf. Noiray et al., 2014; Tiede and Whalen, 2015). While some languages distinguish as few as three vowel sounds (Maddieson, 2013), other languages have up to 24 vowels (Maddieson, 1984; Vallée, 1994) and theoretical investigations suggest an effect of vowel inventory size on the general organization of vowel systems (de Boer, 2000).

More specifically, Dispersion Theory (Liljencrants and Lindblom, 1972; Lindblom, 1986) claims that speech sound organization is ruled by an "Adaptive Dispersion" of their elements, that follows a "Sufficient Perceptual Contrast" principle whereby acoustic vowel spaces are organized in a way that keeps them sufficiently distinct on the perceptual level. According to this theory, the phonetic values of vowel phonemes in small vowel systems should be allowed to vary more than in vowel systems with a more crowded vowel space. In addition, the Quantal Theory of Speech (Stevens, 1972; Stevens and Keyser, 2010) states that there are certain regions of stability in phonetic space, corresponding to the point vowels [i], [a], and [u]. Such vowels should be situated in approximately the same location across all languages, irrespective of vowel inventory size, and should display less intra-category variability than other vowels.

Both theories have received some empirical support (Al-Tamimi and Ferragne, 2005), which is unsurprising given they are informed by different investigative frameworks, namely speech perception as indexed by speech acoustics in the case of Dispersion Theory, and speech production represented by modeled vocal tract movements, in the case of the Quantal Theory of Speech. In addition, if [i], [a], and [u] inhabit regions of stability in phonetic space, then languages with larger vowel inventories must necessarily have regions of stability that separate vowels as clearly as in languages with smaller vowel inventories, which necessarily reduces variability of each vowel in larger vowel inventory systems. However, there are counter-examples for both Quantal and Dispersion Theory that relate to language variability (Bradlow, 1995) and the fact that not all three-vowel systems are maximally dispersed (Butcher, 1994), so a proper analysis must test both the range and variability of vowels and notes.

In addition, Articulatory Phonology (AP; Browman and Goldstein, 1986, 1992; Goldstein and Fowler, 2003) provides a theoretical framework whereby phonological units can be analyzed as constrictions occurring at various locations along the vocal tract. Six distinct 'constricting devices' (lips, tongue tip, tongue body, tongue root, velum, and larynx) form a combinatoric system of 'gestures' which minimally contrast at a single constriction location, and such gestures can overlap temporally, as modeled within the theory of Task Dynamics (Saltzman, 1986, 1995). Vowels are understood to differ mainly according to their constriction degree at locations involving the tongue (and lips) and as such are subject to the influence of preceding and following (consonant) articulations expressed by a 'gestural score' that indicates the organization of individual constricting movements and their patterns of coordination.

Although AP posits that speech should be regarded through a unitary structure that captures both physical (movement) and phonological properties, the underlying constriction actions are nonetheless potentially transferable across different vocal tract activities since they are described on the basis of goals rather than the resulting acoustic signal. Both in speech (phonology) and in brass instrument playing, gestures are geared toward the goal of producing behavioral outcomes that allow perceivers to distinguish between possible intended goals. Additionally, and similarly to speech, patterns of 'coproduction' (overlap of gestures) may occur during brass playing (a consonantlike gesture employed to start a note would overlap a vowellike gesture during its steady-state) and could be governed by similar biomechanical properties. A possible mechanism for the transfer of vocal tract gestures across different vocal tract activities is provided by the concept of motor memory. Motor memory (alternatively muscle memory) is generally defined as "the persistence of the acquired capability for performance," while the exact nature of the concept could refer to a "motor program, a reference of correctness, a schema, or an intrinsic coordination pattern" (Schmidt and Lee, 2011, pp. 461–462). Although it is of yet unknown where or how exactly such motor memory may be encoded and stored in the organs controlling human movement (see Tourville and Guenther, 2011, for suggestions regarding speech production), various researchers have suggested that the nervous system establishes muscular modules or "spatially fixed muscle synergies" (Ting et al., 2012) to reduce the excessive number of degrees of freedom observed during body motion (Bernstein, 1967; cf. Bizzi and Cheung, 2013). Furthermore, vocal tract movements seem to feature even greater muscle complexity than the rest of the human body (e.g., Sanders and Mu, 2013). It has been repeatedly demonstrated that speech production requires feedforward control (e.g., Neilson and Neilson, 1987; Perkell, 2012; Guenther, 2016) and operates in a multidimensional control space (Houde and Jordan, 1998; Tremblay et al., 2008; Gick and Derrick, 2009; Ghosh et al., 2010; Perkell, 2012), both of which are probably also true for brass instrument performance (see Bianco et al., 2010, for some evidence of the requirement of feedforward control when performing at maximum intensity on the trumpet).

Comparing the acoustic signal of brass instrument performance and vocalic speech production, one notices a similar pattern of steady states in sustained production, and dynamic changes in sound quality at the beginning and end of notes and vowels. There are also notable parallels in terms of how sound is generated during either activity. During brass playing, an outward-striking lip-reed mechanism – the player's

'embouchure' – excites the air column within the instrument, producing a spectrum of standing waves which are controlled by the natural frequencies of the air column and which are emitted from the bell at varying volumes (Benade, 1976; Campbell and Greated, 1987).

The embouchure thus serves as the 'source,' comparable to the larynx during speech production, while the instrument bore serves as 'filter.' Unlike during speech production, however, the player has only limited means of altering the properties of this filter. On most brass instruments, the player can only alter the length of the 'filter,' which thus effectively serves merely as an amplifier. The much greater length of tubing compared to the human vocal tract also means that the possible resonating frequencies of the tube are very much determined by the overtone series except for very high registers where the peaks of the impedance spectrum become progressively smaller (cf. Hézard et al., 2014; see Wolfe, 2019 for an excellent non-technical description of brass instrument acoustics).

Nonetheless, the shape of the player's vocal tract might influence the sound coming out of the instrument in limited ways, similar to the influence of subglottal resonances on speech production discovered only quite recently (Chi and Sonderegger, 2007; Lulich, 2010). While the pitch produced in the altissimo register of saxophones seems to be almost entirely determined by vocal tract resonances (Chen et al., 2008, 2012; Scavone et al., 2008), such resonances have a much smaller impact on brass instrument sound. Wolfe et al. (2010), observing the playing behavior of "an artificial trombone playing system," found that "raising the tongue, or the tongue tip, increases the height of peaks in the vocal tract impedance, and so more effectively couples it to the instrument resonances" and the sound generating mechanism (p. 310). Crucially, this difference was observed without changing any other parameters, suggesting that the mechanism might provide players with a method of fine pitch adjustment.

A small number of studies have addressed the influence of vocal tract shaping on brass instrument sound in human subjects by "measuring the impedance spectrum of the vocal tract by injecting a known broadband acoustic current into the mouth" (Wolfe et al., 2015, p. 11); this requires notes to be sustained for roughly a second but it is then possible to directly determine vocal tract resonances during playing. Using this method, a team of researchers at the University of New South Wales in Australia measured vocal tract influence on trumpet (Chen et al., 2012) and trombone performance (Boutin et al., 2015). Both studies yielded similar results with impedance peaks in the vocal tract usually being smaller than those measured for the trumpet or trombone bore, although vocal tract resonances were less variable in trombone players. While this suggests that there is no systematic tuning of vocal tract resonances to influence instrument pitch (or possibly timbre), Chen et al. (2012) nevertheless speculate that raising the tongue, if not for vocal tract tuning, might facilitate high note playing by changing the magnitude or phase of vocal tract resonances (p. 727). In the specific case of the trombone (Boutin et al., 2015), the first vocal tract resonance consistently stayed within a narrow range of 200–375 Hz, leading the authors to conclude that those changes were mostly driven by changes in glottis opening (but see section "Other Constraints on Tongue Shape During Brass Instrument Performance" for conflicting findings on glottal aperture during brass instrument performance); the second vocal tract resonance, however, could "presumably be modified by varying the position and shape of the tongue, as is done in speech to vary the resonances of the tract" (p. 1200). Additionally, the authors noted a split across study participants by proficiency level: beginning trombone players more often produced second vocal tract resonances around 900 Hz while that number was around 650 Hz for advanced players. Interpretation of these results based on the first vowel formant in speech (F1, corresponding to the second vocal tract resonance peak as measured in this study) suggests the use of a lower tongue position by more proficient players. The same research group has also mentioned and, to a limited extent, investigated the possibility of vocal tract resonances influencing the timbre of wind instruments; although not determining or noticeably affecting the frequency of the fundamental of a played note, such a "filtering effect, though smaller for most wind instruments than for voice," would admit the flow of acoustical energy into the instrument at some frequencies while inhibiting it at others (Wolfe et al., 2009, p. 7–8). The effect has been shown to determine the timbre of the didgeridoo (Wolfe et al., 2003) but it is much weaker on the trombone due to its higher impedance peaks and an additional formant introduced by the mouthpiece (cf. Wolfe et al., 2003, 2013).

It has also been suggested that vocal tract resonances could become dominant in the very high register of brass instruments. Based upon numerical simulations of simple and two-dimensional lip (embouchure) models, Fréour et al. (2015) propose a possible mechanism whereby changing the relative phase difference of oscillations within the oral cavity and instrument can lead to an optimum tuning of the system that maximizes acoustical feedback of oscillations within the instrument on the player's lips, at the same time maximizing lip motion and hence the acoustic flow into the instrument.

In general, however, brass playing requires a larger amount of airflow (460 ml/s for a low note played on the trumpet at medium intensity; cf. Frederiksen, 2006, pp. 120–121; Kruger et al., 2006; Fréour et al., 2010 for information on other brass instruments) than speech production (around 150 ml/s during reading; Lewandowski et al., 2018) which may bias tongue position and affect the biomechanics of consonant-like tongue movement used to initiate notes. Students are usually taught to begin notes by releasing the tongue from a coronal place of articulation (although multiple articulations also make use of more retracted places of articulation so that attacks can occur in quick succession). In terms of a possible overlap of vocal tract movements during both activities, and hence the possibility of language influence on brass instrument performance, there are thus two possible areas of investigation: vowel production and its influence on steady states during brass playing, and the dynamics of consonant articulations on the way players begin and end notes on brass instruments.

The above-mentioned sparsity of empirical studies on vocal tract movements during brass instrument performance points to the difficulty of collecting such data

(cf. Heyne and Derrick, 2016b). Ultrasound imaging of the tongue is a technique that has experienced increased use in the area of speech production research due to having no known side-effects (Epstein, 2005) and its comparably low cost (Gick, 2002) compared to more invasive technologies like real-time magnetic resonance imaging (MRI; e.g., Niebergall et al., 2013). Ultrasound imaging uses ultra-high frequency sound ranging from ∼3 to 16 MHz to penetrate soft tissues and calculate an image of their density by evaluating the echo returned when sound waves get reflected due to changes in tissue density; it was first applied to image the human tongue by Sonies et al. (1981). To produce ultrasound signals, ultrasound machines use piezoelectric crystals embedded in a transducer (or probe), which is held underneath the chin (submentally) when performing lingual ultrasound. Ultrasound waves "get absorbed by bone and reflect sharply off of air boundaries," meaning that the technique does not image bone or air very well (Gick et al., 2013, p. 161); its second property, however, is very useful for imaging the shape of the tongue within the oral cavity as it provides good resolution of the tongue surface as long as there is continuous tissue for the sound waves to travel through.

On the basis of the considerations laid out above, we chose to conduct our study on brass players from two languages that differ significantly in the size and organization of their vowel systems. New Zealand English (NZE) is a Southern-hemisphere variety of English that features a phoneme inventory typical for English. Although many of its vowels are considerably shifted from the more well-known vowel systems of American and British English, it retains the same large number of monophthong vowel phonemes (Hay et al., 2008). Tongan, in contrast, is a typical Polynesian language with a small phoneme inventory that distinguishes only the five cardinal vowels /a, e, i, o, u/. See **Table 1** and **Figure 1** for additional detail on the phonological inventory of both languages; throughout this paper, we employ the lexical sets included in **Figure 1** to refer to the vowel phonemes of NZE (cf. Wells, 1982). We also decided to focus on the trombone rather than including players of all brass instruments since differently sized mouthpieces might affect tongue shape due to varying air flow requirements and the potential for vocal tract influences on instrument sound at different resonating frequencies. Furthermore, the trombone provides an optimal choice in terms of investigating the influence of the dynamics of consonant articulations in speech on the way players begin and end notes (although this was not investigated in this study); in contrast to valved brass instruments such as the trumpet, a trombone player has to produce all articulations by momentarily interrupting the airflow

into the mouthpiece (using the tongue and/or glottis) so that the researcher can be sure of the vocal tract contributions to such articulations.

In a previous analysis of a subset of this data (Heyne, 2016), we had used smoothing splines analysis of variance (SSANOVA; Gu, 2013b; package gss: Gu, 2013a) to calculate average tongue shapes for monophthong and sustained note productions in polar coordinates on the language group level and for each individual player; see Heyne and Derrick (2015c) and Mielke (2015) for discussions on why performing these calculations in Cartesian coordinates leads to errors that are most pronounced at the tongue tip and root. However, ultrasound data of speech production pose another serious issue: Analysis of tongue contour data has proven to be quite difficult, in part because appropriate techniques for recognizing accurate between-subject variation have historically been underdeveloped. SSANOVAs make assumptions about confidence intervals that are not statistically appropriate, so we decided to instead use generalized additive mixed-effects models (GAMMs) for our analyses presented in this paper.

Generalized Additive Mixed-effects Models (GAMMs) represent a statistical technique that deals with non-linear relationships between time-varying predictors and outcome variables (Hastie and Tibshirani, 1986; Wood, 2006, 2017). The technique has received attention within the phonetics community recently with major publications featuring the use of GAMMs to quantify the dynamics of formant trajectories



Symbols appearing on the right sides of cells are voiced, symbols on the left are voiceless.

over time from an acoustic point of view (Sóskuthy, 2017), or tongue position changes measured over time (Wieling et al., 2016; Wieling and Tiede, 2017). GAMMs work by applying a smoothing function (henceforth simply 'smooth') to a time-series that can be adjusted to the specific variables that may influence it. It is also possible to model random effects to take into account the inherent variability between, e.g., speakers and lexical items. Tongue contours obtained using Ultrasound Tongue Imaging are dynamic in nature as the full tongue contour is traced sequentially, and hence this can mimic the behavior of time-varying outcomes. Modeling tongue contours and change over-time is also possible by using a "tensor product interaction" (ti) between two different "time-series" (see Al-Tamimi, 2018).

Having established both the scope of the study, and the tools for analysis, we here outline a reasonable set of predictions for our hypotheses:

**Hypothesis 1:** A brass player's native language will influence the vocal tract states they assume during performance on their instrument.

**Prediction 1a:** Given the longstanding tradition within brass instrument pedagogy of using speech syllables, and vowel tongue shapes, more specifically, as well as well-documented articulatory differences in vowel tongue shapes across languages, we predict that tongue shape during sustained note production will differ for the NZE players and Tongan players in our study. This difference will be apparent both when comparing the averages of all produced notes and when comparing groups of notes played at different intensities.

**Prediction 1b:** Because Tongan has a smaller vowel inventory than NZE, we hypothesize that Tongan vowels will have greater tongue position variability than English vowels. We predict this difference in variability will transfer to trombone performance so that Tongan players should also display higher variability in terms of the tongue positions used during trombone playing.

**Hypothesis 2:** NZE players will use a more centralized tongue position during trombone performance than Tongan players.

**Prediction 2:** Brass teachers and method books stress the necessity of keeping the vocal tract uniformly open to produce a good sound. An obvious candidate to produce such a vocal tract configuration is the neutral vowel schwa; NZE has such a vowel while Tongan does not. We hence predict that NZE players will use a more centralized tongue position during sustained note production on the trombone than Tongan players, who will assume a playing tongue position modeled on a different vowel in their native vowel system.

**Hypothesis 3:** Tongue position during trombone performance will vary with pitch.

**Prediction 3:** There is a century-old tradition within brass playing pedagogy of recommending the use of low vowels in the low register and high vowels in the high register. Based on this, we predict that the tongue positions employed during trombone performance will become increasingly closer (higher) with rising pitch.

# MATERIALS AND METHODS

# Ultrasound Imaging of Speech Production and Trombone Performance

Use of ultrasound in studies with long collection times requires a method of fixing the ultrasound transducer position relative to the head; due to the lack of hard oral cavity structures in the produced images, it is otherwise impossible to directly compare images across time and/or subjects. For this study, there was the additional need to allow users to play on a trombone while having their tongue imaged with ultrasound. We used a modified version of the University of Canterbury non-metallic jaw brace (Derrick et al., 2015) that was narrow enough not to contact the trombone tubing running along the left side of player's face. The device ties probe motion to jaw motion and thus reduces motion variance. An assessment of the motion variance of the system, evaluating tongue and head movement data collected using both ultrasound and electromagnetic articulography (Derrick et al., 2015), showed that 95 percent confidence intervals of probe motion and rotation were well within acceptable parameters described in a widely cited paper that traced head and transducer motion using an optical system (Whalen et al., 2005). We are not aware of any alternative systems available at the time of the data collection that would have been compatible with trombone performance. Similarly, electromagnetic articulography (EMA) would have been unsuitable for use in this study due to long setup times (fixing the sensors in place requires anywhere from 20 to 45 min), and the danger of sensors coming loose during a long experiment (participants were recorded for around 45 min, on average) that featured possibly more forceful tongue movements as well as higher amounts of airflow than previous speechonly experiments. Furthermore, EMA only provides data for isolated flesh points that will be inconsistently placed across individuals and it is very difficult and often impossible to position articulography sensors at the back of the tongue due to the gag reflex, meaning that we probably would have been unable to document the differences in tongue position we found at the back of the tongue using ultrasound imaging.

#### Recording Procedure

All study data were collected using a GE Healthcare Logiq e (version 11) ultrasound machine with an 8C-RS wide-band microconvex array 4.0–10.0 MHz transducer. Midsagittal videos of tongue movements were captured on either a late 2013 15<sup>00</sup> 2.6 GHz MacBook Pro or a late 2012 HP Elitebook 8570p laptop with a 2.8 GHz i5 processor, both running Windows 7 (64bit); the following USB inputs were encoded using the command line utility FFmpeg (FFmpeg, 2015): the video signal was transmitted using an Epiphan VGA2USB pro frame grabber, and a Sennheiser MKH 416 shotgun microphone connected to a Sound Devices LLC USBPre 2 microphone amplifier was used for the audio. The encoding formats for video were either the x264 (for video

recorded on the MacBook Pro) or mjpeg codecs (for video recorded on the HP Elitebook), while audio was encoded as uncompressed 44.1 kHz mono.

Although the ultrasound machine acquired images within a 110 degree field of view at 155–181 Hz depending on scan depth (155 Hz for 10 cm, 167 Hz for 9 cm, and 181 Hz for 8 cm), the bandwidth limitations of the frame grabber meant that the frame rates recorded to the laptops reached only 58–60 Hz and were encoded in a progressive scan uyvy422 pixel format (combined YUV and alpha, 32 bits per pixel; 2:1 horizontal downsampling, no vertical downsampling) at 1024 × 768. This means that the potential temporal misalignment of image content grabbed from the top versus bottom of the ultrasound machine screen (via the frame grabber; the misalignment is called 'tearing') would never exceed 6.45 milliseconds.

All NZE-speaking and one Tongan participant were recorded in a small sound-attenuated room at the University of Canterbury in Christchurch, New Zealand. No equivalent room was available for the recordings of the other Tongan participants. As a result, recordings were completed in a small empty room on the campus of the Royal Tongan Police Band in Nuku'alofa, capital city of the Kingdom of Tonga.

#### Speech Elicitation

All NZE-speaking participants were asked to read a list of 803 real mono- and polysyllabic words off a computer screen, except for the first participant. Words were presented in blocks of three to five items using Microsoft PowerPoint, with the next slide appearing after a pre-specified, regular interval; the first participant read a list of words of similar length printed on paper and presented in lines of three to seven items, depending on orthographic length. Words were chosen to elicit all eleven monophthongs of NZE (see **Figure 1**) in stressed position plus unstressed schwa (see Heyne, 2016, pp. 252–255 for the full word list). Note that we distinguish schwa occurring in non-final and final positions in our analyses, as we were previously able to show that these sounds are acoustically and articulatory different and display phonetic variability with speech style comparable to other vowel phonemes (Heyne and Derrick, 2016a). All words were chosen to elicit all combinations with preceding coronal (/t, d, n/) and velar (/k, g/) consonants, as well as rhotics and laterals. Although it is well-known that read speech and wordlists result in somewhat unnatural speech production (Barry and Andreeva, 2001; Zimmerer, 2009; Wagner et al., 2015), this form of elicitation was chosen to ensure that the desired phoneme combinations were reliably produced, and to facilitate automatic acoustic segmentation. While the blocks usually contained words with the same stressed consonant-vowel combination, the sequence of the blocks was randomized so participants would not be able to predict the initial sound of the first word on the following slide; all NZE participants read the list in the same order. This procedure resulted in nine blocks of speech recordings lasting roughly 2 min and 20 s each, except for the first participant who was shown the next block after completing the reading of each previous block.

The same setup was used for the Tongan speakers who read through a list of 1,154 real mono- and polysyllabic words to elicit all five vowels of Tongan, both as short and long vowels, and occurring in combination with the language's coronal and velar consonants (see **Table 1**; see Heyne, 2016, p. 249–251 for the full word list); all Tongan participants read the list in the same order. In Tongan, 'stress' is commonly realized as a pitch accent on the penultimate mora of a word (Anderson and Otsuka, 2003, 2006; Garellek and White, 2015), although there are some intricate rules for 'stress' shift that do not apply when lexical items are elicited via a list. We only analyzed stressed vowels with stress assigned to the penultimate mora and Tongan words are often quite short, consisting minimally of a single vowel phoneme, so it did not take as long to elicit the Tongan wordlist as the numerically shorter NZE wordlist.

Additionally, speakers from both language groups were asked to read out the syllables /tatatatata/ or /dadadadada/ at the beginning and end of each recording block to elicit coronal productions used to temporally align tongue movement with the resulting rise in the audio waveform intensity (Miller and Finch, 2011).

#### Musical Passages

The musical passages performed by all study participants were designed to elicit a large number of sustained productions of different notes within the most commonly used registers of the trombone. Notes were elicited at different intensities (piano, mezzopiano, mezzoforte, and forte; we also collected some notes produced at fortissimo intensity but removed them due to insufficient token numbers across the two language groups) and with various articulations including double-tonguing, which features a back-and-forth motion of the tongue to produce coronal and velar articulations. To control as much as possible for the intonation of the produced notes, five out of a total of seven passages did not require any slide movement and participants were asked to 'lock' the slide for this part of the recordings (the slide lock on a trombone prevents extension of the slide). The difficulty of the selected musical passages was quite low to ensure that even amateurs could execute them without prior practice. Participants were asked to produce the same /tatatatata/ or /dadadadada/ syllables described above at the beginning and end of each recording block in order to allow for proper audio/video alignment.

Trombone players these days can choose to perform on instruments produced by a large number of manufacturers, built of various materials and with varying physical dimensions, both of which influence the sound produced by the instrument (Pyle, 1981; Ayers et al., 1985; Carral and Campbell, 2002; Campbell et al., 2013 among others). For this reason, we asked all participants to perform on the same plastic trombone ('pBone' - Warwick Music, Ltd., United Kingdom) and mouthpiece (6 1/2 AL by Arnold's and Son's, Wiesbaden, Germany); the first English participant performed on his own 'pBone' using his own larger mouthpiece.

#### Study Participants

Study participants were recruited through personal contacts and word-of-mouth in Christchurch and Nuku'alofa and did not receive any compensation for their participation; data collection

was approved by the Human Ethics Committee at the University of Canterbury and all subjects were adults and gave written informed consent in accordance with the Declaration of Helsinki. **Table 2** lists some basic demographic and other tromboneplaying related information for participants in the two language groups; each group included one female player. Given the already quite restrictive criteria for inclusion in the study (playing a specific brass instrument, the trombone), we were unable to balance our sample in terms of playing proficiency; for the purpose of **Table 2**, playing proficiency was determined using a combination of profession (whether a player earned some (semiprofessional) or most of their income (professional) by playing music) and a qualitative rating of their skill by the first author. Note that even though the Royal Tongan Police Band is a fulltime professional brass band, players also serve as police officers some of the time, hence only one out of four Police Band players were rated as 'professional.'

All NZE-speaking participants were effectively monolingual and all but two never spent significant time outside New Zealand. One participant (S30) lived in the United Kingdom for 2 years as child and spent 6 months as a High School exchange student in Germany, while one professional participant (S25) lived in the United States for 7 years and reported elementary proficiency in German and Spanish.

All except the first (S4) of the Tongan participants resided in Tonga at the time of recording and reported elementary proficiency of English acquired as part of their Tongan High School education. S4 (recorded in Christchurch) had been living in New Zealand for 20 years but spoke English with a Tongan


accent and did not produce Tongan vowels that were markedly different from the other participants. Additionally, one of the players recruited in Tonga (S16) had previously spent oneand-a-half years living in Brisbane, Australia, while another (S17) reported elementary proficiency in Samoan. All remaining Tongan speakers were monolingual.

#### Data Preprocessing

Audio–video misalignment resulting from recording two different USB inputs (audio and video interfaces) was resolved by aligning the tongue movement away from the alveolar region with auditory release bursts during the production of /tatatatata/ or /dadadadada/ syllables produced at the beginning and end of every recording block (see Miller and Finch, 2011).

#### Segmentation of Audio Signals

In order to automatically segment the word list recordings, we used the HTK toolkit (Young et al., 2006) as implemented in LaBB-CAT (Fromont and Hay, 2012). Phonemes matching the orthography of the input were exported from the American English version of the CELEX2 dictionary (Baayen et al., 1995) for the NZE stimuli as we were unaware of any segmentation tool available for NZE at the time. A custom dictionary was created from a Tongan dictionary (Tu'inukuafe, 1992) for all the words contained in the Tongan wordlist. All annotations were checked and corrected as necessary by the first author, with errors occurring much more frequently in the Tongan data set since the segmentation process for this data relied on an algorithm developed for speech produced in English. Three participants' datasets recorded early on were segmented manually (two NZE and one Tongan participant).

For the musical passages, we used the Praat 'Annotate - to TextGrid (silences)' tool to perform a rough segmentation of the audio signal into different notes, manually corrected the boundaries, and finally applied a script to assign the appropriate label to each note from a predefined text file (Boersma and Weenink, 2014). Missed notes were eliminated, although for long sustained notes, we used a later part of the note if the participant recovered to produce a well-formed note.

#### Selection of Ultrasound Images for Articulatory Analysis

For both the NZE and Tongan data, only primarily stressed (or accented) vowels were selected for analysis; for the NZE data we used the stress markings from the New Zealand Oxford Dictionary (Kennedy and Deverson, 2005) entries, while we applied the penultimate stress/accent rule (Kuo and Vicenik, 2012) to the Tongan data. For all vowel articulations, we used the temporal midpoint to measure tongue shape, while we measured tongue shape at one third of note duration for sustained notes played on the trombone. Players of wind instruments often decrease note intensity following the beginning of notes and we wanted to make sure that we were measuring tongue shape during the steady-state of note production. For the first English participant, we manually selected a single ultrasound frame for each note as indicated by a stable tongue shape.

#### Tongue Contour Tracing and Outlier Removal

fpsyg-10-02597 November 25, 2019 Time: 15:43 # 9

It is important to understand that ultrasound measurements are usually exported as sequences of individual images (or videos) with almost all information contained in a grainy line that represents the change of tissue density in relation to the location of an ultrasound transducer that sweeps the fan-shaped field of view in radial fashion. Although it is possible to automatically trace such images, the tools available at the time of this data collection still required a lot of manual intervention so that we decided to focus our analysis on steady-state sounds (vowels in speech and sustained notes during brass playing).

We manually traced all midsagittal tongue contours using GetContours (Tiede and Whalen, 2015) for MATLAB (MathWorks Inc, 2015). The tool allows the import of time stamps from Praat TextGrids and automatically interpolates a minimum of three anchors placed manually to a cubic spline of 100 points length outlining the tongue shape produced in each individual ultrasound frame. Once all vowel or note tokens were traced for a certain participant, we employed various search terms to assemble all tokens for a specific stressed/accented vowel or note into a separate data set based on the information contained in the TextGrid imported from Praat; a small number of visual outliers (around 1% of tokens for speech and 1.4% for notes) were subsequently removed by plotting tokens of the same vowel or note together. Note that although Tongan distinguishes short and long vowels (often analyzed as one or two morae, respectively, see Feldman, 1978; Kuo and Vicenik, 2012), the articulatory differences between these phonemes are very small, and we thus decided to treat these tokens as a single underlying motor target. Overall, the models reported below were estimated based on 12,256 individual tongue contours of vowel tokens (7,834 for NZE, 4,422 for Tongan) and 7,428 tongue contours of sustained note production (3,715 for NZE, 3,713 for Tongan). Full token numbers are included in the R notebooks available on GitHub<sup>1</sup> .

Due to variable image quality and the unconstrained placement of GetContours anchors on each ultrasound frame, individual tokens differed greatly in length. Generalized additive mixed models (GAMMs) provide appropriate confidence intervals for noisy data, eliminating the necessity of cropping data, and are able to handle input data of different lengths by modeling individual differences similar to (linear) mixed effects models. Missing data points are replaced with an average value of existing data in the same position, taking individual variability, as well as variability inherent to the currently observed condition, into account. Nonetheless, we did remove a few tokens occurring in specific contexts (e.g., a certain note produced at fortissimo intensity, as mentioned above) where we did not have a sufficient number of tokens for each language group to estimate reliable average tongue shapes.

#### Rotating and Scaling Ultrasound Traces Across Individuals

Our research question necessitated the direct comparison of articulatory data across different vocal tract activities and individuals. Ultrasound data are particularly difficult in this regard since no anatomical landmarks are visible in the recorded images, and tongue shape during speech production can furthermore vary with individual differences in vocal tract shape and biomechanics (Simpson, 2001, 2002; Fuchs et al., 2008; Brunner et al., 2009; Rudy and Yunusova, 2013; Lammert et al., 2013a,b; Perrier and Winkler, 2015). Various methods have been developed for determining and comparing, e.g., the curvature of selected tongue shapes (Ménard et al., 2011; Stolar and Gick, 2013; Zharkova, 2013a,b; Dawson et al., 2016) or the relative articulatory height and fronting of a certain vowel tongue shape (Lawson and Mills, 2014; Noiray et al., 2014; Lawson et al., 2015), independent of anatomical landmarks. For the purposes of this study, however, we needed to compare information regarding both tongue shape and relative position, so we decided to transform all data into a common space prior to our statistical analysis. Across both language groups and the different vocal tract behaviors, the high front vowel /i:/ appeared to be most constrained by individual vocal tract morphology – and previous research has shown /i:/ to have a relatively stable production pattern across languages (Chung et al., 2012). According to Chung et al. (2012) in terms of rotational differences, cross-linguistic differences were mostly due to a more back location of /a/ and a more fronted location of /u/ produced by English and Japanese speakers relative to those of the three other languages.

For both NZE and Tongan, each subject's ultrasound contours were rotated to align the position of the (mean) average contour's highest points during the production of the high front vowel /i:/ (FLEECE in NZE). Note that for NZE participant S1, the highest point actually occurred during the averaged productions of /e/ (the NZE DRESS vowel) and we used this location instead; this articulatory reversal may either have been due to the extremely close articulation of the DRESS vowel (/e/) in modern NZE (cf. Introduction), possibly interacting with the increasing diphthongization of the FLEECE vowel (/i:/; cf. Maclagan and Hay, 2007), or the speaker could be a 'flipper' as suggested by Noiray et al. (2014; see also Ladefoged et al., 1972). Contour height was measured by calculating the distance of each point in relation to the virtual origin of the ultrasound signal. **Figure 2** illustrates the procedure used to identify the virtual origin on a sample ultrasound image. The Figure also shows green lines that can be used to convert image pixel values to real-life distances (cf. Heyne and Derrick, 2015c). (Note that all our articulatory images have the tongue tip at the right.) Identifying these two locations allowed us to calculate a two-dimensional vector connecting the two locations, which in turn was used to rotate tongue traces in polar space without affecting the underlying variability. Each set of contours was also scaled so that the furthest point of the high front vowels lined up to that of S24 NZE, who had the overall smallest vocal tract and hence served as the target space for all other data (vowel and playing contours across both language groups). **Figure 3** shows the scaling applied to six participant data sets in our study.

We also used the 'virtual origin' to correct one participant's data (S12 NZE) for whom the ultrasound transducer seemed to

<sup>1</sup>https://jalalal-tamimi.github.io/GAMM-Trombone-2019/

have moved partway through the recording session. Although the overall quality of palate shapes collected at regular intervals during the experiment by tracing tongue movement during water swallows (cf. Epstein and Stone, 2005) was insufficient for inter-subject alignment, the availability of such traces for the particular participant greatly helped in determining the required amount of rotation and translation to correct for the transducer movement; we were also able to confirm the temporal location of transducer movement by examining video of the participant's face collected throughout the recording session (cf. Heyne, 2016, pp. 144–145).

#### Statistical Analyses

The x- and y-coordinates of all tongue traces along with vowel identity and phonetic context for the proceeding and following speech tokens, and note identity (pitch) as well as intensity (loudness) for the five different trombone notes, were transferred to R (version 3.5.1, R Core Team, 2018) and transposed into polar coordinates using the virtual origin coordinates for the participant with the smallest vocal tract. To test each prediction, we generated Auto-Regressive Generalized Additive Mixed Models (GAMMs) using the bam function from the package mgcv (Wood, 2011, 2015).

For model back-fitting, we started by visually evaluating the patterns in the data and ran various models (e.g., no random effects, random effects, multiple predictors including sex and playing proficiency, etc.). However, when exploring the data visually, it was apparent that the differences between speakers, and how they produced notes at varying intensities are captured by the optimal model. The R 2 value of the optimal model is more than double that of the model without random effects. Using visual inspection, the R 2 values allowed us to select the optimal model.<sup>2</sup> Once the optimal model was obtained, we estimated the correlation level in the residuals and generated a new model that took the autocorrelation in the residuals into account. We also performed a well-formedness test using the gam.check function from the mgcv package to inspect the residuals themselves and determine whether the value k = 10, referring to the number of knots defining the smoothing spline basis function was sufficient.

The resulting model for Hypothesis 1 is shown in formula 1 below. This and all subsequent model formulae employ standard mgcv syntax defined as follows: s = smooth term used to estimate the curvature of tongue contours; bs = basis function of the smooth term; cr = cubic regression spline; fs = factor smooth that allows the estimation of interaction smooths for random effects; k = number of knots to control for the degree of non-linearity in the smooth; by = used to model non-linear interactions between a factor and the predictor; m = n, e.g., 1, parameter specifying how the smoothing penalty is to be applied, allowing the shrinkage toward the mean for the random effects; more details can be found in Wood (2011, 2017) and Sóskuthy (2017).

1: rho ∼ LanguageNoteIntensity + s(theta, bs = "cr", k = 10) + s(theta, k = 10, bs = "cr", by = LanguageNoteIntensity) + s(theta, subject, bs = "fs", k = 10, m = 1, by = NoteIntensity)

Where rho is the distance of the fitted tongue contour point from the virtual origin, and theta is the angle in relation to the virtual origin. The variable LanguageNoteIntensity encodes the interaction between language (Tongan, NZE), note identity (Bb2, F3, Bb3, D4, F4), and note intensity (piano, mezzopiano, mezzoforte, forte). It is used as a fixed effect and as a contour adjustment. The variable NoteIntensity encodes the interaction of note identity and note intensity. It is used as a contour adjustment for the random effect that uses subject ID, used to model the within-speaker variations with respect to note productions. All variables were ordered to allow for a meaningful interpretation of the smooths.

For Hypothesis 1, prediction 1b required the variance to be analyzed rather than position itself. For this model, the source data was summarized by grouping ultrasound theta angles into 100 bins, and computing variance of tongue position for each bin by speaker and token (both musical notes and vowels). In all other ways, the formula was derived as for predictions 1a, 2, and 3. The resulting model is shown in formula 2:

2: var(rho) ∼ LanguageNote + s(theta, bs = "cr", k = 10) + s(theta, bs = "cr", k = 10, by = LanguageNote) + s(theta, subject, bs = "fs", k = 10, m = 1, by = LanguageNote)

This formula uses var(rho) for the variation in the distance of the fitted tongue contour point from the virtual origin, and LanguageNote is the ordered interaction of language (Tongan, NZE), and notes/vowels. As with formula 1, all variables were ordered to allow for a meaningful interpretation of the smooths.

For Hypothesis 2 (separate models run for NZE & Tongan), the resulting model is shown in formula 3:

<sup>2</sup> It is possible to formally evaluate the addition of the random effect adjustments to the model. However, GAMM model estimation using Maximum Likelihood estimation (ML) is quite computationally demanding and we were unable to carry

out full model comparisons for the largest model reported in this paper even when using a computing node with 36 cores and 1024 GB of memory!

3: rho ∼ Token + s(theta, bs = "cr", k = 10) + s(theta, k = 10, bs = "cr", by = Token) + s(theta, subjectToken, bs = "fs", k = 10, m = 1) + s(theta, precedingSoundToken, bs = "fs", k = 10, m = 1) + s(theta, followingSoundToken, bs = "fs", k = 10, m = 1)

Similar methods were used for Hypothesis 2 and the related predictions (i.e., note-by-vowel and note-by-note differences). We used the relevant notes and vowels as fixed effects (Token in formula 2) and as tongue contour adjustment (using a 'by' specification). In addition, we specified three random effects. We created a new variable forming an interaction between the subject producing each given note or vowel quality (subjectToken); this variable was used as our first random effect. The two additional random effects were the interaction between the preceding sound, following sound, and vowel identity (and note intensity by note identity for notes, i.e., precedingSoundToken and followingSoundToken). These random effects allowed us to fine-tune the analysis to account for subject and contextual differences. All variables were ordered again to allow for a meaningful interpretation of the smooths. We performed the same back-fit and well-formedness analyses as for hypothesis 1.

For all three models, we used custom functions (Heyne, 2019) to visualize the predictions from our models in polar coordinates, using the package plotly (Sievert et al., 2017) to plot the transformed outputs of the plot\_smooth function from the package itsadug (Van Rij et al., 2015). Additionally, we used the function plot\_diff from the latter package to determine the intervals of significant differences for the whole range of given data points (in our case, the whole midsagittal tongue contour from the front to the back of the tongue) and added these as shaded intervals to our polar plots. All our analyses in the form of R notebooks are available on GitHub<sup>3</sup> .

## RESULTS

# Prediction 1a: Tongue Position During Sustained Note Production Will Differ for NZE Players and Tongan Players

Our final model investigating overall language differences during trombone performance found a robust interval of significant

<sup>3</sup>https://jalalal-tamimi.github.io/GAMM-Trombone-2019/

difference at the back of tongue with NZE players utilizing a more retracted tongue position. The model also showed a difference at the front of tongue where the Tongan players use a more elevated position, however, this difference was not very reliable across comparisons involving different notes produced at different intensities.

The full details (including a model summary) for the final GAMM investigating whether tongue shape during trombone playing differs across the two language groups included in this study are available as part of our supplementary notebooks on GitHub<sup>4</sup> . The final model includes an autocorrelation model to account for massive amounts of autocorrelation observed in the residuals of the same model that did not account for autocorrelation (see Sóskuthy, 2017, section 2.3; Wieling, 2018, section 4.8 for discussion). All plots included in this paper are based on models estimated using fast model estimation via fREML (fast restricted estimate of maximum likelihood) in combination with the discrete = TRUE flag. Our final model had an R 2 value of 0.855 and used 742,800 data points (7,428 individual tongue contours).

The optimal model design prevented us from directly comparing the average smooths for all notes produced by the players from each language group. However, we were able to fit smoothing splines of the average tongue shapes used across all notes and intensities across the two language groups by getting predicted values for all ingoing data points from our GAMM model using the predict.bam function from the mgcv package and fitting smoothing splines on these data split up by native language using R's generic predict.smooth.spline function. The overall average splines (**Figure 4**) show clear differences at the back and at the front.

Additionally, we carried out pairwise comparisons for each note at the four different intensities (piano, mezzopiano, mezzoforte, forte) across the language groups and all individual comparisons show at least one interval of significant difference (either at the back or the front of the tongue). **Figure 5** provides plots of the smooths estimated for each language group for the notes Bb2 at forte intensity, F3 at mezzoforte intensity, and Bb3 at mezzoforte intensity with areas of significant difference indicated by shading; the comparisons for F3 and Bb3 at mezzoforte intensity feature the largest token numbers in our data set (1,089/1,169 tokens for F3 and 986/1,042 tokens for Bb3 for NZE and Tongan, respectively). Note that overlap of the 95% confidence intervals is an imprecise diagnostic of significance differences between portions of two smooths. Instead, the shaded intervals, indicating regions of significant difference, have been determined using the precise and accurate statistical procedure implemented via the plot\_diff function from the itsadug package.

Overall, we find robust differences at the back of tongue (area from roughly -3/4π to -2/3π as shown in the plots) for all individual note comparisons except for the notes Bb2 produced at piano intensity, and D4 at mezzopiano intensity (the interval of significant differences for F3 at mezzoforte intensity barely extends past -3/4π but nonetheless seems substantial). Toward the front of the tongue, our plots also show significant differences for most comparisons, indicating a more elevated position used by the Tongan players; differences at the front of the tongue consistently occur at forte intensity but are notably absent for 2 out of 5 comparisons (notes F3 and D4) at mezzoforte intensity where we have substantial token numbers (1,089/1,169 tokens for F3 and 368/385 tokens for D4 for NZE and Tongan, respectively). However, we should not assign too much weight to any differences occurring past −1/3π at the front of the tongue and −3/4π at the back of the tongue due to the fact that we are averaging across subject data with different trace lengths that were normalized by rotation and scaling (see section "Rotating and Scaling Ultrasound Traces Across Individuals" above). Additionally, when overlaying the areas of significant differences for all individual comparisons the agreement becomes very small at the front of the tongue while 16 out of 19 comparisons (84.2%) show the substantial difference noted for the back of the tongue (see **Figure 5D**). **Table 3** provides a list of the intervals of significant differences for all individual note comparisons.

# Prediction 1b: Tongan Vowels Will Have Greater Production Variability Than English Vowels

The full details (including a model summary) for the final GAMM describing the difference in variance for tongue position distance from the virtual origin along the tongue curvature between NZE and Tongan are again available as part of our supplementary notebooks on GitHub<sup>4</sup> . Our final model had an R 2 value of 0.863 from 12,704 data points from 180 variance curves.

The optimal model design prevented us from directly comparing the average smooths for the variance of each language group. However, we were able to fit smoothing splines of the average tongue position variance used across vowels by participants from both language groups by getting predicted values for all ingoing data points from our GAMM model, similar to the method used when addressing Prediction 1a. The overall average splines (**Figure 6**) show clear variance differences at portions of the front, middle, and back of the tongue, indicating that the Tongan participants' vowel productions were more variable than those produced by the NZE speakers.

We also carried out the same comparison for NZE and Tongan note productions and found that Tongan trombone notes show more variability than English trombone notes for a small portion of the tongue surface between −2/3 and −7/12 π radians. These results had an R 2 value of 0.817 from 6,446 data points from 100 variance curves. **Supplementary Figure S1** can be found in the **Supplementary Material**.

# Prediction 2: NZE Players Will Use a More Centralized Tongue Position During Trombone Performance Than Tongan Players

The full details for the two final GAMMs describing the relationship of note tongue contours to vowel tongue positions in the two languages are also available on GitHub<sup>4</sup> . The final model for NZE had an R 2 value of 0.852 and used 1,154,900 data points (11,549 individual tongue contours), while the final model for

<sup>4</sup>https://jalalal-tamimi.github.io/GAMM-Trombone-2019/

Tongan had an R 2 value of 0.898 and used 813,500 data points (8,135 individual tongue contours).

**Figures 7**, **8** show the smooths for all vowels (A) and notes (B) produced by the participants in the two language groups. The confidence intervals plotted with the Tongan vowels in **Figure 8A**, although not as appropriate as the intervals estimated using the plot\_diff function shown in our plots addressing Hypothesis 1, indicate that even in a language with a small vowel system, the average vowel tongue shapes overlap considerably so they are not statistically different in terms of their articulation when properly accounting for variance such as subject-specific productions and preceding and following phonemes. Note that we decided not to include the 95% confidence intervals for the NZE vowels in **Figure 7A** as the crowded NZE vowel space already makes the left panel of the Figure very hard to read; for the same reasons, no confidence intervals are shown with the note smooths (**Figures 7B**, **8B**).

While inspection of all individual smooths comparisons from our models (see R notebooks on GitHub<sup>5</sup> ) indicated that the tongue shapes employed by the NZE players pattern somewhat closely with up to seven different monophthongs in NZE (KIT /9/, non-final-schwa / e /, FOOT / /, final schwa / e #/, STRUT / a /, START / a :/, and LOT /6/), the closest match seems to be with both the vowels occurring in the word 'lot' (LOT /6/) and the neutral vowel schwa when it occurs in final position (/ e #/); note that NZE being non-rhotic, the latter group also includes words ending in -er such as 'father.' In Tongan, in contrast, we do not find such a close match and the vowel tongue shape most closely approximated during trombone playing seems to be that for the vowel /o/; however, this is only the case visually – the vowel /u/ actually features less intervals of significant differences to the tongue shapes assumed during sustained notes produced by the Tongan players. We also see some consistent patterning with the vowel /a/ at the front of tongue. Nonetheless, all individual comparisons between these three vowels and the average note productions feature at least one interval of significant difference, indicating that the match between vowel and note tongue shapes is much closer in Tongan than in NZE. All of the closest-matching vowels identified for Tongan differ from NZE LOT and schwa (produced in both non-final and final environments) mostly in terms of tongue retraction. **Figure 9** shows plots of the vowels in both languages most closely approximated by the respective players' note productions. The left panel overlays the NZE players' note tongue contours, while the right panel does the same with the Tongan players' note contours.

<sup>5</sup>https://jalalal-tamimi.github.io/GAMM-Trombone-2019/

While the average tongue shapes during sustained trombone note production are clearly different for the two language groups as shown in **Figure 4** (cf. also comparisons of selected notes and intensities in **Figure 5**), average tongue contours for a subset of monophthongs of both languages that can be expected to feature relatively similar articulations across the two languages (based on their acoustic descriptions), map up fairly well when regarded in a controlled phonetic environment, as shown in **Figure 10**. Note that in each case, the NZE vowel articulations feature a more retracted tongue shape than the one used by the Tongan participants, in agreement with the overall differences observed at the back of the tongue during note productions. Acoustic descriptions of NZE (Gordon et al., 2004; Maclagan and Hay, 2004; Bauer et al., 2007; Bauer and Warren, 2008) indicate that NZE DRESS (/e/) is 'close' compared to a more 'cardinal' pronunciation of the /e/ vowel in Tongan; similarly, the NZE THOUGHT vowel (/o:/) is comparatively raised, possibly due to a chain shift documented for other varieties of English that motivates it to move into the space vacated by the fronted GOOSE vowel (/0:/) (Ferragne and Pellegrino, 2010, p. 30; cf. Scobbie et al., 2012; Stuart-Smith et al., 2015).

# Prediction 3: The Tongue Positions Employed During Trombone Performance Will Become Increasingly Closer (Higher) With Rising Pitch

The right-hand panels (B) of **Figures 7**, **8** show the smooths for the different notes produced by the players from the two language groups. While the NZE players as a group display a more-or-less consistent pattern of using a higher tongue position for higher pitch notes (except for the note F4), this pattern does not apply to the Tongan group. Instead, we find that in the area where we might expect the narrowest vocal tract constriction, the highest tongue contour is that of the lowest included note, Bb2, while D4 represents the highest tongue contour anterior of this location. Overall, upon visual inspection of the smooths and tabulation of intervals of significant differences for all notes at different intensities (produced in the same manner as **Table 3** above), TABLE 3 | Intervals of significant difference for all note comparisons.

fpsyg-10-02597 November 25, 2019 Time: 15:43 # 15


Comparisons are based on variable token numbers and there were no tokens of F4 produced at mezzopiano intensity by the NZE players, hence we were unable to carry out a comparison for that note. Note also that regrettably the 'scatterpolar' plotting modality from the plotly R package requires input values scaled in degrees. We thus had to apply a transformation to get our values to show up correctly but were able to overlay a scale in radians using fractions of π; these are equivalent to the following numbers: −5/6π = −2.62; −3/4π = −2.36; −2/3π = −2.09; −1/2π = −1.57; −5/12π = −1.31; −1/3π = −1.05; −1/4π = −0.79; −1/6π = −0.52.

we observe the biggest differences between notes produced at mezzoforte which may be specific to this intensity level but could also be an artifact of having larger token numbers at mezzoforte. The reader is also encouraged to view the parametric plots on GitHub<sup>6</sup> ; these plots show a clear difference with respect to the overall differences in the parametric terms (fixed effects) and how variable they are in both NZE and Tongan on the one hand, and in the position of the notes on the other. Higher notes (produced at louder intensities) seem to show a higher tongue position compared to lower notes; NZE shows an overall lower tongue position compared to Tongan in lower notes and a comparable position in the higher notes.

Out of total 76 note comparisons (40 for Tongan, 36 for NZE due to missing tokens for the note F4 produced at mezzopiano intensity), only 11 featured significant differences at either the back or front of the tongue (none had both). For Tongan these were: Bb2 vs. F4, D4 vs. F4, and Bb3 vs. F4 at mezzopiano intensity, and Bb3 vs. F4 and F3 vs. F4 at piano intensity; note that each comparison involved the note F4 for which we have the smallest token numbers. For NZE these were: Bb2 vs. Bb3, Bb2 vs. D4, and Bb3 vs. D4 at forte intensity, Bb2 vs. Bb3 and Bb2 vs. D4 at mezzoforte intensity, and Bb2 vs. F4 at piano intensity.

# DISCUSSION

In this paper, we have presented a comprehensive analysis of midsagittal ultrasound data that has allowed us to investigate a number of questions regarding the relationship between speech production and brass instrument performance, and some longstanding assumptions propagated by teachers of brass instruments whereby the tongue shapes assumed during performance resemble those employed during speech production, especially when producing vowels. We compared average tongue shapes of vowel articulation and tongue positioning during trombone performance estimated based on large token numbers using generalized additive models, a statistical technique that properly accounts for contextual factors and unknown variability such as speaker/performer idiosyncrasies. As far as we know, this article also presents the first comprehensive articulatory descriptions of both the New Zealand English and Tongan vowel systems. In the following, we evaluate our hypotheses and specific predictions based on the results presented in the previous sections and discuss some other constraints affecting tongue shape during brass instrument performance.

# Hypothesis 1, Prediction 1a: Language Influence on Trombone Performance

Our data provide clear support for our first hypothesis, prediction 1a, whereby a brass player's language will influence the vocal tract states they assume during performance on their instrument. We observed significant differences at the back of the tongue across our two language groups made up of NZE and Tongan speakers both overall as well as for 16 out of 19 individual note comparisons. These comparisons encompassed five different pitches performed within the standard playing range of the trombone at soft (piano) to loud (forte) intensities. All comparisons featured at least one interval of significant differences (either at the back or front of tongue), providing strong support for our Prediction 1a which stated: Tongue position during sustained note production will differ for NZE players and Tongan players, both overall and when comparing individual notes played at different intensities. However, there also seem to exist a lot of other factors influencing midsagittal tongue shape during trombone performance (e.g., airflow requirements and the potential of vocal tract resonances influencing the produced sound) – we will return to those later on in the discussion.

#### Hypothesis 1, Prediction 1b: Language and Token Position Variability

Our data provides support for our first hypothesis, prediction 1b, whereby tongue position variability in vowel production will be related to the segmental inventory size of the language. Tongan has fewer vowels than NZE, and so it was predicted to have higher token variability. The results show higher average Tongan vowel production variability in the contour comparison for portions of the tongue front, middle, and back (**Figure 6**). Tongue position variability differences between Tongan and NZE extend along the

<sup>6</sup>https://jalalal-tamimi.github.io/GAMM-Trombone-2019/

FIGURE 7 | (A) Left: Average smooths for the NZE monophthongs produced by all NZE speakers included in this study. (B) Right: Average smooths for the five different notes produced by the NZE-speaking trombonists.

entire surface of the tongue. Therefore, the results provide direct support for dispersion theory (Liljencrants and Lindblom, 1972; Lindblom, 1986; Al-Tamimi and Ferragne, 2005).

The results also suggest that this dispersion might extend to note productions on the trombone, as these were also more variable for the Tongan participants; however, this was true only for a small portion at the back of the tongue and as such, is not a strong effect (see **Supplementary Figure S1**). Moreover, significant differences are not visible in any comparison of specific notes or vowels, probably because these measures are based on single variance averages by participant, which is a very low number for GAMMs.

The note results in the **Supplementary Figure S1** also show that note variability is similar to vowel variability along the full

length of the imaged tongue contours; however, the Tongan contours for notes do not extend quite as far back as the data for the NZE participants, which, however, has no meaningful impact on the interpretation of **Figure 4**.

# Hypothesis 2: Use of a Schwa-Like Vowel Shape by the NZE Players

With Hypothesis 2, we were trying to determine whether an articulatorily informed interpretation of a popular recommendation among brass players, namely, to keep the vocal tract 'open' to produce a good sound, would be supported by empirical data. Various studies (see Heyne and Derrick, 2016b) have provided ambiguous evidence regarding the openness of the vocal tract, mostly presenting data for the oral cavity (but see section "Other Constraints on Tongue Shape During Brass Instrument Performance" below for some findings regarding glottis opening during brass instrument performance) and often interpreting their results in comparison to vowel tongue shapes which we will address in more detail below. We specifically predicted that the average tongue shape assumed during trombone playing by the NZE-speaking participants in our study would approximate the vowel tongue shape for the neutral vowel schwa while the Tongan players would assume a different shape as their native language does not contain a neutral vowel such as schwa. Indeed, we found that for the NZE players, two out of the three vowel tongue shapes most closely approximated by their playing tongue shapes were schwa when produced in non-final and final environments. However, the only NZE vowel that showed no significant intervals of difference to the NZE note tongue shapes for any comparisons was LOT (/6/), hence our Prediction is not fully supported. In terms of the note tongue shape assumed by the Tongan players, the data support our prediction in that they clearly use a more 'centralized' tongue shape during playing; the most salient difference, however, seems

to occur at the back of the tongue and we will return to this point later on.

# Hypothesis 3: Tongue Position During Note Production and Its Relation to Pitch

Our models fit on the full data set also allowed us to probe a longstanding assumption within brass pedagogy, namely that players should raise their tongue when ascending throughout a brass instrument's register. More precisely, many brass method books published from the 19th century onward recommend the use of low vowels in the low register with a gradual change toward high vowels to be employed when playing in the high register. Our prediction 3 represents a less strong version of such claims whereby we simply predicted that the tongue shapes assumed during sustained note production would become increasingly closer with rising pitch. The results presented in sections "Prediction 2: NZE Players Will Use a More Centralized Tongue Position During Trombone Performance Than Tongan Players" and "Prediction 3: The Tongue Positions Employed During Trombone Performance Will Become Increasingly Closer (Higher) With Rising Pitch" above do not provide much support for this prediction: while there is some indication of NZE players using a higher tongue position for higher notes, this pattern is much less clear for the Tongan participants. Additionally, none of the vowels typically mentioned in brass method books (e.g., /o/ to /i/) seem to map up particularly well with note tongue shapes used by the NZE players in our study, although the vowel tongue shape might be approximated by players who speak native languages that do not have a neutral/central vowel such as schwa. In addition to Tongan investigated in this study, similar considerations apply to languages like Spanish and Japanese.

Note however, that in the first author's more recent work using real-time MRI of the vocal tract to record tongue movements during trombone performance (Iltis et al., 2019) there is clear evidence for tongue raising in the midsagittal (and coronal) planes with ascending pitch. Hence we might speculate that the lack of pronounced differences in the ultrasound data presented here may be related to the use of a jaw brace for ultrasound transducer stabilization that ties tongue motion to jaw position.

# What Is a Possible Mechanism for Language Influence on Brass Instrument Performance?

Having established that there are significant differences regarding the midsagittal tongue shape used by players from the two different language groups investigated in this study, we may now move on to speculate what a possible mechanism for such a relationship might look like. Articulatory Phonology (Browman and Goldstein, 1986, 1992; Goldstein and Fowler, 2003) posits

that phonological units of speech can be analyzed as constrictions occurring at various locations along the vocal tract, and we suggested in the introduction that these gestures might take the form of motor memory when being transferred across different vocal tract activities. Since we observed a far from complete overlap of midsagittal tongue shapes during speech and trombone performance (even for the NZE players), and there may as well be other differences that we cannot measure with midsagittal ultrasound images (jaw opening, coronal tongue shape), we need to explore in more detail how vowel gestures from speech production might transfer to brass playing.

It has previously been shown that the tongue can be divided into at least four independent sections (along the sagittal plane) within the oral cavity (Stone et al., 2004) and it is possible that, for example, during brass playing, tongue root retraction forms an important vocal tract constriction that affects airflow and tone color (more below). In this vein, we might think of learning to play a brass instrument as a process whereby multiple vocal tract gestures relevant to this activity have to be fine-tuned in order to achieve a good sound, as well as flexibility in being able to change, and articulate, various notes. Tongue shape during brass playing might be determined by local optimization processes applying to various parameters including vocal tract constrictions based on gestures already encoded in the system as motor memory. The latter case, of course, is where we suggest vowel tongue shapes would come in. Note that we regard this process as local, rather than global, optimization in agreement with Loeb (2012) who argues that "good-enough strategies" such as trial-and-error learning will lead to "a diversity of solutions that offers robustness for the individual organism and its evolution" (p. 757; see Ganesh et al., 2010, for empirical evidence of local optimization during motor learning).

In contrast to a theory of optimal control, a theory of local optimization is in agreement with the astonishing amount of individual variability observed in this and earlier empirical studies on brass playing (and speech production, for that matter) and offers a plausible account of how the language differences we observed may arise. Imagine that a beginning player might initially explore different local optima (different vowel tongue shapes but possibly also language-unrelated gestures such as the tongue configuration used during whistling) before settling on a more stable default tongue shape that would be locally optimized using acoustic information and effort minimization. Using a vowel tongue shape as starting point would seem to reduce both error and the required effort, at least until the player develops sufficient motor memory for the new motor action. In turn, it should also be possible to gradually 'unlearn' (cf. Heyne and Derrick, 2015b, p. 7) language-related tongue shapes by developing brass playing-specific motor memory, reducing language influence on brass playing among highly skilled performers.

## Articulatory Setting Theory

Another possible mechanism for language influence on brass instrument performance is provided by the concept of languagespecific articulatory settings (cf. Wallis, 1653/1972; Vietor, 1884; Sweet, 1890; Honikman, 1964; Laver, 1978; Jenner, 2001; other terms include 'voice quality setting' and 'basis of articulation'). The validity of the concept was first experimentally verified by Gick et al. (2004) using old x-ray data; the authors found that interspeech postures (ISPs) "assumed between speech utterances: (a) are language-specific; (b) function as active targets; (c) are active during speech, corresponding with the notion of ASs [articulatory settings], and (d) exert measurable influences on speech targets, most notably including effects on the properties of neutral vowels such as schwa" (p. 231). These findings have since been replicated across languages (Wilson et al., 2007; Wilson and Gick, 2014) and dialects (Wieling and Tiede, 2017), and Ramanarayanan et al. (2013) were able to show that ISPs also differ across speech styles (read vs. spontaneous speech) using real-time MRI.

It is conceivable that brass players might (a) use their nativelanguage specific articulatory setting as default position during rests from playing and/or (b) develop a language- and brass playing-specific inter-playing position (IPP). A very limited comparison of only a single subject from each language group in this study in Heyne (2016) suggests that the latter indeed seems to be the case, and that the coronal place of articulation during both speech production and trombone playing heavily influences ISP and IPP. Note, however, that ISPs and IPPs are much harder to measure than vowels since either occur much less frequently, and the latter is even more so the case for IPPs due to the frequent occurrence of deep in-breaths during rests from playing, which require a very open vocal tract. Ultimately, it may not be necessary to measure AS/ISPs (and IPPs) separately, as suggested by an observation from Wieling and Tiede (2017) where they compare findings on ISPs across Dutch dialects to their earlier findings on tongue movements during word pronunciation (Wieling et al., 2016) within the same data set; they found that for both vowels and ISPs, one dialect group featured a more posterior tongue position than the other (measured using EMA), concluding that "articulatory setting differences may also be observed when analyzing a sizeable amount of variable speech data (i.e., not only focusing on a single segment)" (Wieling and Tiede, 2017, p. 392).

# Other Constraints on Tongue Shape During Brass Instrument Performance

It seems self-evident that brass playing imposes constraints upon vocal tract shape that differ substantially from speech production, not least the fact that the former generally requires a greater amount of airflow than the latter. The openness of the vocal tract was already touched upon above in relation to vowel tongue shapes, specifically neutral/central schwa which has long been viewed as effecting the least constriction in the vocal tract (Fant, 1960; Silverman, 2011; among many others). Early studies using MRI (e.g., Baer et al., 1991) have indeed shown that the vocal tract is heavily constricted in the oral cavity when producing high front vowels but the same also applies to the pharyngeal cavity when producing low back vowels. Either extreme would thus seem illsuited for brass playing, providing a straightforward rationale for the midsagittal tongue shapes we observed across both groups. For the Tongan players, positioning the back of the tongue in

a location similar to the ones used during the articulation of the back vowels /o/ and /u/ might provide the optimal solution given the aero-dynamical constraints of brass playing. Based on the assumption that the pharyngeal constriction for Tongan vowels would be at least somewhat comparable to the data for English by Baer et al. (1991), the vocal tract configurations of the Tongan vowels /i/ and /u/ might be too constrained in the oral cavity, while the low vowel /a/ might be too constrained in the pharyngeal cavity.

An alternative way of regarding the articulatory correlates of vowel tongue shape is suggested in Esling's (2005) paper "There Are No Back Vowels: The Laryngeal Articulator Model." Esling presents an attempt at re-conceptualizing the traditional vowel quadrilateral based on articulatory evidence on pharyngeal phonetics, adding the classifications "raised" and "retracted" to the traditional IPA chart (1996 version), as shown in **Figure 11**. Interestingly, the average midsagittal tongue shapes used by the musicians in our two language groups either straddle the boundaries of Esling's re-categorization<sup>7</sup> (NZE non-final / e / and final schwa / e #/) or fall within the raised category (Tongan /o/ and /u/). Note that LOT (/6/) in NZE is not a low vowel as shown on the IPA chart underlying Esling's re-categorization, and is generally articulated somewhat closer (cf. **Figure 1** in the introduction); our average tongue traces (**Figure 7A**) additionally suggest it is also somewhat fronted, definitely more so than the THOUGHT vowel (/o:/). By raising, Esling refers to "the positioning of the tongue when it is high (pulled upward and backward)," in contrast to retracted vowels, for which tongue position represents a "response to the sphinctering mechanism that closes the larynx" (14); the former action would have consequences for the pharyngeal cavity that would seem advantageous concerning airflow and some acoustical considerations affecting vocal tract resonances during brass playing (see below). A more recent conference paper (Moisik et al., 2019) provides some empirical support for the proposal that vowels pattern as front, raised and retracted in terms of larynx height in the form of MRI data collected from two subjects.

# Possible Acoustical Consequences of the Observed Language Differences

Throughout this paper we have discussed tongue shapes during vowel production and trombone playing from an articulatory perspective but it should be clear that we expect them to have acoustical consequences not only during speech production but also when playing the trombone. Basically, any changes to vocal tract shape will alter its acoustic impedance which will probably have an impact on instrument sound, even if the exact details of such a mechanism are of yet unknown. In a paper outlining considerations regarding vocal tract influence on different types of instruments, Wolfe et al. (2015) write that restricting the

opening of the true vocal folds (or controlling their impedance) not only allows for "fine control of mouth pressure" but also affects potential vocal tract influence considerations by providing a "higher reflection coefficient for acoustic waves in the vocal tract" (p. 3). The result would be a reduced influence of subglottal resonances on upper vocal tract resonances (extending from the glottis to the lips, cf. citations listed in introduction) which interact with oscillations within the instrument (i.e., vocal tract influence), and which in turn would make it easier to adjust the vocal tract impedance peak falling within the frequency range of the trombone. That range cuts off around 700 Hz (for details see Campbell and Greated, 1987, p. 346–347) and given that vocal tract impedance peaks have a relationship of around 4/3 times the frequency of speech formants (depending on glottis openings, cf. Hanna et al., 2012), it would seem advantageous in terms of maximizing the potential for vocal tract influence, to assume a vowel tongue shape that produces formants below 900 Hz. In terms of F2, this suggests utilizing the back of the vowel space, while for F1 most vowels would fall within the range of the trombone. Unfortunately, empirical findings on glottal aperture during brass instrument performance are inconclusive, with observations from x-ray imaging (Carter, 1969; cf. Nichols et al., 1971), as well as real-time MRI (Iltis et al., 2017), suggesting that glottis opening is correlated closely with loudness (smaller opening during soft playing). However, other authors have reported that glottis aperture during playing may be "self-adjusting or involuntary" (Bailey, 1989, p. 105) or differ with proficiency level (professional players of all wind instruments had smaller glottal apertures than amateur and intermediate players in Mukai's (1989) study, reported by Yoshikawa, 1998). Even though the latter finding would seem to fit well with Wolfe and colleagues' consideration mentioned above, we are unable to draw any conclusions based on it given the variety of playing proficiencies included in our sample (within and across the two language groups).

While we were unable to perform acoustical analyses of the musical passages performed by all participants in this study due to audio quality, we conducted a limited comparison of recordings

<sup>7</sup>Esling specifically comments that note that"[t]he intersection of the three lines dividing the three regions in [the figure] should perhaps fall exactly on the location of schwa to represent the focal point of movement away from neutral toward any of the three directions," but that it was placed differently "to show the susceptibility of [ a ] to becoming either front or retracted depending on the choice of articulator movement" (Esling, 2005, p. 23).

by two earlier participants of this study (S5 NZE and a semiprofessional Japanese trombone player) who differed in their average tongue shape during trombone performance (Heyne and Derrick, 2015a). The Japanese player who used an /o/-like and thus more backed tongue position during playing (similar to the Tongan participants in this study) had a larger component of high frequencies in the produced sound spectrum compared to the NZE player who used a tongue shape resembling the group average for the NZE players in this paper; this result, however, should not be over-interpreted due to the small sample size and a possible confound in the different horizontal location of the narrowest oral constriction produced by the two subjects.

# Reconsidering the Role of Language Influence on Brass Instrument Performance

The previous paragraphs have outlined several constraints regarding tongue shape during brass instrument performance that we will now relate back to our initial discussion whereby motor memory from a player's native language influences the tongue shape they employ when playing their instrument. Note that we regard language influence as secondary to any of these constraints, although there are certainly also interactions between language-related and -unrelated constraints, with the latter also affecting speech production, albeit probably to a lesser extent: Requirements of airflow favor the use of vocal tract configurations that avoid significant constrictions in the pharyngeal and/or oral cavities; high back vowels and nonlow central vowels (optionally grouped as 'raised' in Esling's (2005) 'laryngeal articulator model') seem to best satisfy these requirements. Considerations regarding the potential of vocal tract influence specific to the trombone suggest that a retracted (in the classical terminology) tongue position might be advantageous by situating the second vocal tract impedance peak (F2) below the cut-off frequency of the trombone (around 700 Hz). Furthermore, language influence via motor memory from a player's native language might operate in a different, more direct manner by influencing the place of articulation used during trombone performance; our ultrasound videos include the relevant data but we have not been able to test this hypothesis yet.

# Confounds and Shortcomings of Our Study

Finally, we admit to the following shortcoming and confounds of our study: Our two language groups were quite heterogeneous not only in terms of participant age and instrumental experience but also in terms of playing proficiency; however, we placed greater emphasis on having sufficient participant numbers than keeping groups balanced as there were already a lot of other factors we were unable to control for such as how the individual players' equipment (mouthpiece and instrument) might compare to performing on the 'pBone.' The group differences in tongue shape we found might be affected by individual vocal tract shape. It is plausible that the height and doming of our participants' palates differed on the group level due to genetic factors (cf. Dediu et al., 2017, 2019; Dediu and Moisik, 2019) and this has been shown to impact speech production (see citations in introduction). All of our comparisons were based on large numbers of tokens collected at single time points during monophthong articulation (at 1/2 of vowel duration) and note productions (at 1/3 of note duration) and it has to be clear that this represents a simplification as neither activity is constant over time. Another confound is the use of a jaw brace tying ultrasound transducer position to jaw opening; while the system was shown to be relatively stable during speech production (Derrick et al., 2015), the same may not apply to brass instrument performance, and we did not carry out an assessment of motion variance in this context. However, no alternative ways of transducer stabilization compatible with trombone performance requirements were available at the time of data collection, and the use of any of the available systems for correcting for jaw position such as optical tracking systems (Mielke et al., 2005; Whalen et al., 2005; Miller and Finch, 2011; Noiray et al., 2015) would have exhausted the financial possibilities of a Ph.D. research project.

# Implications of Our Findings

Our findings show that two activities previously linked through their cognitive mechanisms, language and music, are also related more indirectly via motor memory resulting from a shared physiological system. Although both activities clearly represent forms of communication, the latter is inherently non-referential (if we disregard vocal music with lyrics), while the other is by definition referential or semiotic (but see Bowling et al., 2010; Curtis and Bharucha, 2010 for papers challenging this traditional distinction).

Our use of GAMMs for the analysis of midsagittal ultrasound tongue contours shows that SSANOVAs may be underestimating confidence intervals and hence overestimating statistical differences between tongue shapes produced in different contexts (cf. SSANOVA average curves of the same data set in Heyne, 2016). This would seem to be especially relevant for SSANOVA average curves calculated on the basis of small numbers of articulatory traces, unless phonetic context is tightly controlled for. GAMMs allow the inclusion of random smooths to model out the variance arising from independent variables and take the variance observed in different contexts into account when estimating the average curves and confidence intervals pertaining to a specific condition. In contrast, SSANOVAs do not afford these possibilities and it is unclear how one might correct for multiple comparisons if one would like to compare, e.g., articulations produced in more than two phonetic contexts.

# CONCLUSION

In this paper, we were able to present evidence for native language influence on brass instrument performance based on statistically robust differences determined using generalized additive mixed models (GAMMs) fit on large numbers of midsagittal ultrasound tongue contours collected during speech production and trombone playing. We argued that these differences can be related to the different vowels systems of the two languages groups observed in this study, New Zealand English and Tongan, but

tongue shape during brass playing is more directly determined by constraints arising from airflow requirements and acoustical considerations. Our findings indicate that speech production, itself an acquired motor skill expressing a language's underlying phonological system, can influence another skilled behavior, brass instrument performance, via motor memory of vocal tract gestures. More specifically, such vocal tract gestures would form the basis of local optimization processes to arrive at a suitable tongue shape for sustained note production, although further research is required to determine whether such behaviors occur across a larger population of players at various proficiency levels.

## DATA AVAILABILITY STATEMENT

All datasets generated for this study are available on GitHub at: https://jalalal-tamimi.github.io/GAMM-Trombone-2019/.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Ph.D. and Staff Low Risk Application Guidelines by the University of Canterbury Human Ethics Committee with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the University of Canterbury Human Ethics Committee (reference HEC 2014/02/LR-PS, February 2014).

## AUTHOR CONTRIBUTIONS

MH and DD contributed to the conception and design of the ultrasound study and carried out all data collection. MH analyzed all ultrasound videos and organized all data into a combined database for statistical analysis. JA-T and DD performed the statistical analyses. MH prepared all visualizations and wrote the first draft of the manuscript. MH, DD, and JA-T wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

## REFERENCES


# FUNDING

MH's original research was supported by a Doctoral Scholarship from the University of Canterbury and his re-analysis of the dataset was supported by funding from the NIH (research grant R01DC002852, PI Frank Guenther). JA-T prepared this work while being on sabbatical funded by the Leverhulme International Academic Fellowship (IAF-2018-016).

#### ACKNOWLEDGMENTS

MH would like to thank Jennifer Hay for substantial contributions to this project both in terms of content and academic guidance offered in her role as co-supervisor (with DD) as part of his Ph.D. research at the University of Canterbury, which forms the basis of this paper. MH would also like to thank his two external thesis examiners, Murray Schellenberg and James Scobbie, as well as Joe Wolfe and Bryan Gick, for generous advice leading to substantial improvements of his thesis and fruitful discussions on a wide range of topics related to this project. The two reviewers provided expert commentary that substantially improved the quality of this manuscript and we would like to thank the editors for selecting our paper for inclusion in this Research Topic. We also acknowledge the support of MH's current post-doctoral advisor, Frank Guenther, for letting him work on this paper as part of his current position. Finally, we are highly appreciative of all the trombone-playing participants of this study who volunteered their time without financial renumeration, and the Royal Tonga Police Band, who allowed us to include an under-documented language in this research.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.02597/full#supplementary-material

FIGURE S1 | Average smoothing splines for variance in tongue surface distance from the ultrasound virtual origin for NZE and Tongan note productions.

TABLE S1 | Ultrasound recording parameters for all participants recorded for this study.



motor behavior. J. Neurophysiol. 104, 382–390. doi: 10.1152/jn.01058. 2009


Conference, eds S. Cassidy, F. Cox, R. Mannell, and S. Palethorpe, (Sydney: Australian Speech Science and Technology Association Inc), 183–188.

Maclagan, M., and Hay, J. (2007). Getting fed up with our feet: contrast maintenance and the New Zealand English "short" front vowel shift. Lang. Var. Change 19, 1–25.

Maddieson, I. (1984). Patterns of sounds. Cambridge: Cambridge University Press.


fpsyg-10-02597 November 25, 2019 Time: 15:43 # 25


Sweet, H. (1890). A Primer of Phonetics. Oxford: Clarendon Press.



Wells, J. C. (1982). Accents of English. Cambridge: Cambridge University Press.


Wieling, M., and Tiede, M. (2017). Quantitative identification of dialect-specific articulatory settings. J. Acoust. Soc. Am. 142, 389–394. doi: 10.1121/1.4990951

Wieling, M., Tomaschek, F., Arnold, D., Tiede, M., Bröker, F., Thiele, S., et al. (2016). Investigating dialectal differences using articulography. J. Phon. 59, 122–143. doi: 10.1016/j.wocn.2016.09.004

Wilson, I., and Gick, B. (2014). Bilinguals use language-specific articulatory settings. J. Speech Lang. Hear. Res. 57, 1–13. doi: 10.1044/2013\_JSLHR-S-12-0345

Wilson, I., Horiguchi, N., and Gick, B. (2007). Japanese Articulatory Setting: the Tongue, Lips And Jaw. New York, NY: New York University.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Heyne, Derrick and Al-Tamimi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Noggin Nodding: Head Movement Correlates With Increased Effort in Accelerating Speech Production Tasks

#### Mark Tiede<sup>1</sup> \*, Christine Mooshammer1,2 and Louis Goldstein1,3

<sup>1</sup> Haskins Laboratories, New Haven, CT, United States, <sup>2</sup> Institut für Deutsche Sprache und Linguistik, Humboldt-Universität zu Berlin, Berlin, Germany, <sup>3</sup> Department of Linguistics, University of Southern California, Los Angeles, CA, United States

Movements of the head and speech articulators have been observed in tandem during an alternating word pair production task driven by an accelerating rate metronome. Word pairs contrasted either onset or coda dissimilarity with same word controls. Results show that as production effort increased, so did speaker head nodding, and that nodding increased abruptly following errors. More errors occurred under faster production rates, and in coda rather than onset alternations. The greatest entrainment between head and articulators was observed at the fastest rate under coda alternation. Neither jaw coupling nor imposed prosodic stress was observed to be a primary driver of head movement. In alternating pairs, nodding frequency tracked the slower alternation rate rather than the syllable rate, interpreted as recruitment of additional degrees of freedom to stabilize the alternation pattern under increasing production rate pressure.

#### Edited by:

Pascal van Lieshout, University of Toronto, Canada

#### Reviewed by:

Anneke Slis, UMR 5216 Grenoble Images Parole Signal Automatique (GIPSA-lab), France Marc Swerts, Tilburg University, Netherlands

> \*Correspondence: Mark Tiede tiede@haskins.yale.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 29 April 2019 Accepted: 17 October 2019 Published: 27 November 2019

#### Citation:

Tiede M, Mooshammer C and Goldstein L (2019) Noggin Nodding: Head Movement Correlates With Increased Effort in Accelerating Speech Production Tasks. Front. Psychol. 10:2459. doi: 10.3389/fpsyg.2019.02459

Keywords: speech production, speech errors, head movement, EMA, articulatory entrainment

# INTRODUCTION

Movements of the head are integral to human speech. Casual observation of any conversational interaction reveals head nodding employed by the current speaker aligned with prosodic features and by listeners providing backchannel feedback. Nodding is coordinated with and complementary to other forms of gesticulation like hand and facial movements (Wagner et al., 2014), and sensitive to speech rate and affect (Birdwhistell, 1970; Giannakakis et al., 2018). Head movements are used by speakers to structure discourse (Kendon, 1972), indicate deixis (Birdwhistell, 1970; McClave, 2000) flag lexical repair (McClave, 2000), and to signal a turn-taking shift (Duncan, 1972; Hadar et al., 1984a), among other functions.

However, these **communicative** uses of head movement also coexist with **motoric** consequences of speech production, like those to be discussed in this work. These include head adjustments for respiration or compensations for other body movement (e.g., talking while walking; Raffegeau et al., 2018) as well as head movement entrained through the influence of active articulation. For example, in a kinematic study of head movement during conversation, Hadar et al. (1983b, p. 40) observed that during speaking turns, "the head moved almost incessantly", with 89.9% of recorded frames showing non-zero velocity. This contrasted with relatively little movement during pauses and listening turns (12.8%). In a follow-on analysis, they found a significant positive correlation between head movement amplitude and peak speech loudness (Hadar et al., 1983a)

**188**

but observed that this was driven mostly by fast, high-intensity movements and loud sounds. Similarly, a study investigating emotionally contrastive speech tasks (elicited using neutral vs. psychologically stressful interviews) found significantly increased head velocity under the stressed condition, corroborated by increases in concurrently recorded heart rate (Giannakakis et al., 2018). Congenitally blind speakers have been shown to move their heads while speaking with non-sighted speech partners, showing that speech entrains head movement despite, in this instance, lacking a communicative role that would usually be expressed through the visual channel (Sharkey and Stafford, 1990). Another relevant study by Hadar (1991) measured the head movement of aphasics and normal controls engaged in speech during interviews. He found that while head movement was positively correlated with speaking rate for both groups, it was highest for non-fluent aphasic speakers who, apart from increased effort required for speech coordination, showed no other motor impairment.

Yehia et al. (2002) used point source (Optotrak) data collected for sentence productions of two speakers to estimate F0 from head motion and vice versa. Their results showed 88 and 73% of F0 variance accounted for by head motion for the two speakers, respectively, but just 50 and 25% of head motion variance accounted for by F0 in the reverse direction. This asymmetry is consistent with the likelihood that competing demands on head position imposed by communicative intent distort estimates driven by prosodic F0 alone, but it leaves open the question of why, in the opposite direction, head movement should be so effective at predicting F0. Following Honda (2000), they suggest that strap muscles connecting the floor of the mouth through the hyoid bone and attaching to the outer edge of the cricothyroid cartilage provide an indirect biomechanical coupling, such that as the head is tilted, the straps will exert pull on the cricothyroid and thus potentially influence vocal fold tension. Although any such effect would be small, it might nonetheless serve to entrain modulation between head movement and F0.

A similar pattern of loose coupling is illustrated by a nonspeech task in which Kohno et al. (2001) asked four participants to open and close their mouths, tapping their teeth together in the closing cycle, while tracking movement of the upper and lower incisors. Jaw opening ranges were 1, 2, and 3 cm, and tapping frequency elicited by metronome varied from 1 to 3.3 Hz. Except for the smallest and slowest condition, the upper incisor was observed to move up at the same time that the lower incisor moved down at about 10% of its range. Cycle durations for both were found to be highly correlated (r = 0.94) and so were their vertical ranges of movement (r = 0.75). They propose that this coordination of movement may serve to make jaw movements smoother through offsetting postural changes of the head. While this likely occurs primarily during mastication, it suggests that rhythmic movement of the jaw during speech may also entrain head movement.

However, while it appears that motoric aspects of speech production can and often do affect head movement, such influence is neither automatic nor readily predictable. For example, Rimé et al. (1984) and Hoetjes et al. (2014) contrasted conversational speech in a baseline condition when speakers were free to move with a condition in which the head and other extremities were immobilized and reported no difference in speech fluency; this makes clear the lack of any direct biomechanical linkage between the speech articulators and the head. What then is the cause of non-communicative head movement linked to speech? One possibility is that the head participates somehow in networks of "coordinative structures" assembled as needed to achieve particular motor goals (Kugler et al., 1980) while constraining the degrees of freedom under control (Bernstein's Problem; Bernstein, 1967). Such structures, provided with appropriate input energy, dissipate it in a controlled and stable fashion, provided that the control parameters themselves are consistent; however, if these change beyond some threshold, driven say by execution errors or an increase in production rate, additional degrees of freedom are recruited as a new structure is organized (Kelso et al., 1993). Two studies from Dittmann and Llewellyn (1969) and Hadar et al. (1984b) are suggestive in this context: they report that the amplitude of head movement increases spontaneously immediately following speech dysfluencies. In this case, movement of the head appears to be recruited to serve a phase-resetting function for the interrupted articulatory plan by introducing additional energy and stability into the coordinative structures executing it (e.g., Fowler et al., 1980; Saltzman and Munhall, 1989). Because head movement does not contribute directly to achievement of the articulatory target, the linkage between the head and the articulatory system is a functional one, introduced by extending the coordinative structure to include the head as necessary.

The sensitivity of head movement to speech dysfluencies suggests that a useful paradigm for studying its relationship to articulation is through a task designed to elicit such errors reliably. Previous work has established that the repetition of word pairs with partial similarity (e.g., top cop) results in more production errors than either identical or entirely dissimilar words (Meyer and Gordon, 1985), and that alternating codas are slower to produce and more errorful than alternating onsets (Sevald and Dell, 1994). Kinematic studies of such sequences have confirmed this asymmetry (Mooshammer et al., 2018) and have shown that systematic alternation can lead to inappropriate suppression of the target constriction (a reduction error) or coconstriction of the non-targeted articulator (an intrusion), which in both cases may be partial or subphonemic (Pouplier, 2003; Goldstein et al., 2007). Kinematic studies of alternating sequences have also shown that more errors occur at higher production rates and that intrusion errors are more common (Goldstein et al., 2007; Slis and Van Lieshout, 2016).

An explanation for this behavior advanced in Goldstein et al. (2007) rests on the idea that during repetition, the executing task becomes a system in which each constriction gesture (lips, tongue tip, and tongue body) is driven by a non-linear oscillator, and those oscillators are coupled through synergy with the shared jaw. However, the frequencies of all of the oscillators are not the same because of the mismatch between the syllable rate vs. the alternating (phrasal) rate. In top cop, for example, the alternating tongue tip and tongue dorsum constrictions occur at one half the rate of the bilabial closures, and this 1:2 frequency ratio is inherently less stable than a 1:1 relationship. It is known from studies of coupling between non-linear oscillators that

their mutual phasing preferentially shifts from less stable to more stable patterns of organization, with the simplest 1:1 mode ultimately preferred (Haken et al., 1985). In addition, a series of index finger-wagging experiments has demonstrated that as rate increases, the end result, regardless of starting conditions, is inphase symmetric motion at the 1:1 rate (e.g., Kelso et al., 1993). Speech errors of the co-constriction type can thus be viewed as incipient phase transitions, which may either be transitory, if the production system succeeds in resetting itself, or complete. The expected effect of recruiting an additional oscillator such as the head at the lower frequency (phrasal) rate would be to bias the system to remain in the 1:2 mode: the idea is that the more power is shared among the oscillating components at a given frequency, the more stable that frequency will be (Nam et al., 2009).

An alternative view arises from kinematic studies of constriction variability (interpreted as gradient production errors) in repeated word pairs with alternating onsets conducted by Slis and Van Lieshout (2013, 2016). They report higher rates of tongue dorsum instrusion in onset alternation, especially in high (constrained) vowel contexts, relative to lower intrusion rates for tongue tip and lower lip constrictions, and more intrusions than reductions overall. They attribute this to potential coproduction demands on the primary constriction articulator, which can serve to bias a shared articulator toward partial or complete co-constriction as a consequence of coupling dynamics between gestures. In this view, the fewer shared oscillatory components (articulators) utilized to achieve an articulatory target, the less susceptible it will be to such bias. Thus, because the lower lip apart from the jaw is uncoupled from the tongue, it "is better able to maintain linguistic goals and counteract pressure from coupling forces to stabilize coordination patterns" (Slis and Van Lieshout, 2016, p. 14).

Irrespective of their cause, it is clear that the alternating word paradigm reliably produces errors and has, in the context of this current work, the additional advantage of minimizing communicative gesturing of the head (given the rote nature of the task), such that observed head movement can for the most part be attributed to motoric consequences of articulating the sequence (although a possible exception, the use of head movement to emphasize phrasal stress, will be explored below). Accordingly, this work uses the alternating word paradigm to investigate relationships between head movement and speech articulation. It extends previous work in two ways. First, production of alternating word pairs is driven by an accelerating rather than fixed rate metronome. This has the advantage of contrasting an initial low stress production rate (with a constant metronome period) against the effects of subsequent rate acceleration, placing the speaker under increasing production effort, with errors increasingly likely. Second, the motion of the head is tracked in tandem with observation of the speech articulators to investigate the effects of increasing production rate and effort on the following research questions:


With the consideration that recruitment of the head, if it occurs, is expected to support 1:2 alternation, we also evaluate the following hypothesis:

**H1**: In the production of alternating word pairs, the moving head will track the slower (phrasal) rate rather than the syllable rate frequency.

The approach to addressing these questions is outlined below.

# MATERIALS AND METHODS

#### Participants

Nine native speakers of American English (five females, mean age 24.4) were recruited from the New Haven community for the experiment. None reported any neurological, speech, or hearing disorders. Each provided informed consent supervised by the Yale University Institutional Review Board and were paid for their participation.

#### Recordings

Speech articulator movements with synchronous audio were recorded using electromagnetic articulometry (EMA; Carstens AG500). For each participant, EMA sensors were affixed using dental cyanoacrylate to the tongue dorsum (TD), blade (TB), and tip (TT), the upper (UL) and lower (LL) lips, and lower incisors (JAW) along the midsagittal plane. The TD sensor was placed as far back as the participant could comfortably tolerate; the TT sensor was placed approximately 1 cm posterior to the apex; and the TB sensor was centered between these. Lip sensors were attached at the vermillion border, and sensors placed on the upper incisors (UI) and JAW were attached at the gingival margin. Additional sensors placed on the left and right mastoid processes and nasion were used as references to correct for head movement. Biteplane data were collected to establish the occlusal plane for each participant. Three spatial dimensions for position were sampled for each EMA sensor at 200 Hz. Synchronized audio was recorded with a 16-kHz sampling rate using a directional microphone placed approximately 50 cm from the participant's mouth. Metronome clicks used to pace production as discussed below were presented monaurally through an earpiece in the left ear (opposite from EMA wires) and recorded separately at 8 kHz.

## Speech Tasks

The speech material discussed here consisted of repeated CVC real English word pairs that alternated in one of three **context** types. In the first context (SAME), both words were identical (e.g., top top). In the second context (ONSET), the onset consonant of each word alternated (e.g., **t**op **c**op). In the third context (CODA), the coda consonant of each word alternated (e.g., to**p** to**ck**). An additional condition in which both onset and coda were varied (e.g., pop tot) was also collected but is

excluded from this analysis as it produced an excessive number of production errors that were not amenable to the split-mean analysis described below. Both vowels from each word pair were always the same. Note that this procedure, which elicits repetitions of the same word pair throughout a trial, differs from paradigms in which different word pairs are contrasted to facilitate spoonerisms (e.g., Nooteboom, 2005). The list of words used is given in **Table 1**, which were presented in a total of 39 different pairings (including reverse orderings when not identical), although not all participants produced every combination. The word pair alternation trials were collected as blocks within a larger experiment probing speech errors in production, presented in Mooshammer et al. (2018).

#### Procedure

Trials were cued using a computer monitor that presented the instructions "Get ready – Breathe – GO" at 1-s intervals, together with the word pair under test. During the "Breathe" instruction, metronome clicking was initiated, delivered to the participant through an earpiece to avoid contaminating his or her audio production. Participants were instructed to time the onset of each word to a click and to avoid breathing during the trial if possible due to the phase-resetting effects of respiration (Goldstein et al., 2007). Some speakers were explicitly asked to produce trochaic stress while others were uninstructed for stress placement; however, all were consistent in stress realization. Metronome timing was computer-controlled to produce clicks over a 15-s interval, chosen to be readily achievable for participants to produce the entire alternation sequence within one breath. Clicks were exponentially decaying transients with a half-power bandwidth of 2 ms. During the first 7.5 s, the click rate was held stable at 170 clicks/min, following which the rate was increased with each click by a constant percentage of the current rate (0.12) to approximately 235 clicks/min at the final (48th) click. The advantage of this approach is that the initial stable rate provides an easy-to-maintain baseline for all speakers, with few production errors, while the subsequent rate acceleration places all speakers under increasing production effort, with errors increasingly likely.

# Post-processing

#### EMA Data

EMA sensor trajectories were processed in MATLAB (Mathworks) using zero-phase delay low-pass filtering at 20 Hz. The smoothed reference trajectories (UI, nasion, mastoids) were then used to rotate and translate all data to a coordinate system aligned with each speaker's occlusal plane centered on


Trial counts are across all participants, and for alternating pairs include the reverse orderings.

UI, as determined by their reference position in the biteplane trial. A copy of the UI sensor trajectory (HEAD), filtered but without head correction, was used to characterize speaker head movement for each trial. Velar and alveolar closures were tracked using the TD and TT trajectories, respectively. For bilabial closures, a derived measure of lip aperture (LA) was computed as the Euclidean distance between the UL and LL sensors (In one instance, where UL data were unusable, the vertical component of LL was used instead).

#### Defining Epochs

To distinguish the stable and accelerating phases of each trial, a functional grouping into epochs was determined procedurally as follows. First, the offset of each metronome click was identified by peak-picking RMS peaks within its audio channel. Next, the inflection point at which rate began to increase was found by differencing the click periods. The final usable click for a given trial was determined by inspection as the last click for which the speaker produced a controlled utterance timed to the metronome. The number of clicks from the inflection point to the final click was taken to be twice the epoch length for the trial (2n), such that the initial (STABLE) epoch encompassed n clicks preceding the inflection, the first accelerating epoch (ACC1) was n clicks following that, and the final accelerating epoch (ACC2) covered the remaining n clicks (see **Figure 1** for an illustration). The minimum epoch length (n) was nine clicks with mean 11.6 and s.d. 2.1. Because participants always began speaking before the beginning of the STABLE epoch, and continued production until at least the final click, this method ensured that movement during each trial could be binned systematically.

#### Identifying Errors

Production errors were identified on the EMA trajectories using the "split-mean" criterion established by Pouplier (2008). This approach relies on establishing the distributions of in-phase and anti-phase constriction events for non-errorful productions, then using the mean between them as a threshold to identify inappropriate deviations from expected behavior. For example, in the top cop sequence, the upward movement of the tongue tip during the tongue constriction we will refer to as "inphase," while its upward movement at the time where the tongue dorsum (with which it alternates) is forming a constriction we will refer to as "anti-phase." When the vertical component of TT fails to rise above threshold for its in-phase position (i.e., its expected target constriction), a **reduction** error is identified. Conversely, when it rises above threshold at its non-target anti-phase position (i.e., coincident with the expected velar constriction), an **intrusion** error is identified. When a reduction or intrusion error in one alternating articulator co-occurs with an error of the opposite polarity in its partner, a **substitution** error is identified. Following this approach, described more fully as the "error rate" procedure in Mooshammer et al. (2018), errors of these three types were labeled using a semiautomatic interactive procedure on the TD, TT, and LA trajectories of each trial. **Figure 2** provides an example.

#### Measurements

To investigate overall effects of increasing production rate on head movement, one set of measures was organized to contrast global effects over the three epoch phases (STABLE, ACC1, and ACC2). A separate set of measures was used to investigate local effects of errors, contrasting the immediate environment preceding and following each error (PRE, POST). Except as noted below, all measurements were computed using standard MATLAB augmented by locally developed procedures.

#### Epoch-Based Measures

Head movement was quantified over epochs on the HEAD (filtered UI) trajectory in two ways. Overall movement (MVT) was measured as the path integral distance traced by the UI sensor during each epoch, normalized by the duration of the epoch. Peak tangential velocity (VEL) was measured by first computing HEAD speed using central differencing, then computing the maximum of this signal over each click interval normalized by the duration of that interval, and finally recording the maximum of these values achieved within each epoch as the characterizing value for that epoch. In both cases, the time normalization was used to offset the effect of increasing metronome rate.

To investigate the relationships between movement of the head, the jaw, and the active articulators, we computed measures of average mutual information (AMI) and mutual power (MP). As these require comparing monodimensional signals, we used the first principal component of the HEAD and JAW trajectories and that of the alternating and non-alternating articulators as characterized by TD, TT, and LA (LA was used directly without principal component decomposition).

#### **Average Mutual Information**

Mutual Information (MI) quantifies the information dependency of two random variables, such that knowledge available for one reduces uncertainty associated with the other (e.g., Cover and Thomas, 2006). That is, MIij is the amount communicated by a given measurement y<sup>j</sup> from Y about the value x<sup>i</sup> measured from X. When this dependence is averaged over all cells in the joint distribution between X and Y, the result is their average mutual information (AMI), expressed in bits. In contrast to correlation, which tests only linear dependency, AMI is sensitive to the entire form of the joint distribution and thus evaluates nonlinear dependency. An AMI of zero implies that two variables are statistically independent, and conversely, the higher the AMI between them, the more information each contains about the other. In the context of this work, AMI provides a useful index relating movements of the head to those of the articulators, with higher values associated with greater mutual dependency. We computed AMI on the first principal component by epoch for the pairs HEAD:ART1 (MIH1), HEAD:ART2A (MIH2A), HEAD:ART2B (MIH2B), and HEAD:JAW (MIHJ), where ART1 was the non-alternating (syllable-rate) articulator trajectory (e.g., LA in to**p** c**op**), ART2A was the first alternating (half syllable-rate) articulator of the pair (e.g., TT in **t**op cop), and ART2B was the second alternating articulator of the pair (e.g., TD in top **c**op). For non-alternating control pairs, ART1 was the coda trajectory, and both ART2A and ART2B were

with each epoch including n clicks.

FIGURE 2 | Intrusive error example, showing inappropriate co-constriction of tongue dorsum (TD) coincident with the /t/ target TT closure. The error threshold is determined as the "split" mean between the median distributions of in-phase and anti-phase articulator extrema.

mapped to the onset trajectory. **Table 3** provides a glossary of these relationships.

#### **Mutual Power**

Entrainment between the head and the speech articulators can also be investigated using estimates of mutual power (MP) in the alternating and non-alternating frequency bands. It was computed here using the cross-wavelet transform (Grinsted et al., 2004), which convolves the discrete wavelet transform of one signal with the complex conjugate of the other, with MP given by the absolute value of the result converted to dB. This is a spectral representation similar to a spectrogram in which successive frames (time) encode power at different frequencies, with MP highest for those frequencies which are mutually coherent between the paired trajectories. **Figure 3** provides an example pairing HEAD and TD for a cop top sequence, showing relative MP in the alternating and non-alternating frequency bands. We used the Cross Wavelet Toolbox (Grinsted, 2014) to compute MP by epoch for the same first principal component pairs used to measure AMI. To quantify MP over each epoch, we tracked resonance amplitude peaks for the frequency band closest to both the expected syllable and alternating rates (as determined by the mean metronome click rate for the epoch) and determined their median values. For the HEAD:ART1 comparison, MPH1**1** represents the median value in the syllable (non-alternating) frequency band, and MPH1**2** represents the median value in the alternating band. Similarly, for the HEAD:JAW comparison, MPHJ1 and MPHJ2 give power in the syllable and alternating frequency bands, respectively. For the HEAD:ART2A comparison, MPH2A1 and MPH2A2 give the syllable and alternating frequency band values, and likewise for the HEAD:ART2B comparison, MPH2B1 and MPH2B2 give the syllable and alternating frequency band values. As with AMI, for non-alternating control pairs ART1

TABLE 2 | Error counts by speaker and condition [error types: intrusions, reductions, substitutions; context: same, onset, coda alternation; epoch: stable, initial, and final accelerating production rates; and articulator: tongue dorsum (TD), tongue tip (TT), lip aperture (LA)].


FIGURE 3 | The left panel shows mutual power (MP) between HEAD and tongue dorsum (TD) for an exemplar cop top sequence. Increasingly darker red hues indicate higher values of MP; lighter shades to the lower left and right indicate possible wavelet edge effects. The right panel shows corresponding source PC1 trajectories for HEAD, alternating /k/ (TD) and syllable rate /p/ lip aperture (LA). Both the syllable rate and alternating rate frequency bands show significant MP, but highest values are observed at the lower alternating frequency, showing that is the base rate for HEAD movement.

was the coda trajectory, and both ART2A and ART2B were mapped to the onset trajectory. See **Table 3** for a glossary of these relationships.

#### Error-Based Measures

fpsyg-10-02459 November 25, 2019 Time: 15:43 # 8

To investigate whether head movement is locally sensitive to error occurrence, we examined its peak velocity (EPV) immediately preceding and following each error. The PRE and POST evaluation windows for comparison were set equal to twice the length of the metronome click period containing the error; that is, for an error occurring at time t within a click period of duration p, the PRE value for that error was the peak HEAD speed achieved over the t–p range paired with the POST value over the t+p range. HEAD speed was computed as the tangential velocity of the UI sensor trajectory using central differencing.

#### Analysis

Statistical analysis of the collected data was performed within the R environment (R Core Team, 2018). Effect sizes for paired t-tests were evaluated using Cohen's d statistic. Linear mixedeffects models were evaluated using the lme4 (Bates et al., 2015) and lmerTest (Kuznetsova et al., 2017) packages. Log-likelihood comparisons were used to assess whether interaction terms and random slopes by speaker and word pair were supported. Significance of model fixed effects was assessed using estimates of the regression coefficients divided by their SEs (a t-test), with degrees of freedom based on the Satterthwaite approximation. Model effect sizes were evaluated using partial R 2 , the proportion of variance explained by the fixed effects alone, and conditional R 2 , the proportion of variance explained by both fixed and random effects, using the methods of Nakagawa and Schielzeth (2013). Significant results are indicated using the p < 0.001 ∗∗∗ , p < 0.01 ∗∗ , p < 0.05 <sup>∗</sup> , and p < 0.10 • convention. Full model outputs (indexed as M1, M2, . . . below) are provided as **Supplementary Material**. Note that we do not consider possible lexical effects because the task used common real words of English with simple CVC structure and because we consider that the nature of the task (rote repetition) minimizes lexical influence following the first production instance.

#### RESULTS

#### Error Rates

**Table 2** summarizes error counts by speaker and conditions, and **Figure 4** shows their distribution as error rates normalized by the number of syllables produced per epoch. As shown in **Figure 4**, error rate was affected by both context (alternation task) and production rate (epoch). Gestural intrusion (co-constriction of the anti-phase articulator) was the most common type of error. Extending the results of Slis and Van Lieshout (2016) to coda alternation, most intrusive errors were produced with the TD articulator and the fewest with the lips. A model (M1) predicting error rate (combined across all types) by fixed effects of context and epoch and their interaction, with random intercepts by speaker and word pair, showed a significantly greater main effect for context ONSET (t = 2.1 <sup>∗</sup> ) and CODA (t = 3.0 ∗∗) alternation

than for no alternation (SAME). While no main effect of epoch was observed, its interaction with context showed significantly higher error rates in the accelerated epoch ACC2 for alternating trials (ACC2:ONSET t = 3.8 ∗∗∗, ACC2:CODA t = 9.8 ∗∗∗). For this model, partial R <sup>2</sup> = 0.37, conditional R <sup>2</sup> = 0.49.

#### Head Movement

**Figure 5** illustrates the range of observed head movement by speaker, contrasting the STABLE:SAME condition, where least movement is expected, to the ACC2:ONSET,CODA (alternating) conditions where the most movement is expected. With two exceptions (M02 and F03, who showed head movement across all conditions), the accelerating metronome task resulted in increased mean head movement by epoch.

To adjust for a left-skewed distribution, head MVT measures were log-transformed for analysis. In addition to fixed effects of context and epoch, a derived error factor was used to distinguish between error-free epochs (ERROR = F) and epochs in which at least one labeled speech error occurred (ERROR = T). A model (M2) predicting log(MVT) from fixed effects of epoch and error, including random slopes for error by speaker and random intercepts by word pair, showed marginally more movement for errorful epochs overall (t = 2.0 •) and significantly more movement for the accelerating epochs ACC1 (t = 2.2 ∗ ) and ACC2 (t = 5.8 ∗∗∗) than the baseline stable epoch (inclusion of their interaction and an effect of alternation context were unsupported by model comparison). Partial R <sup>2</sup> = 0.04, conditional R <sup>2</sup> = 0.52.

#### Head Peak Velocity Evaluated by Epoch

Head peak velocity measures (VEL) were also left-skewed and thus log-transformed for analysis. **Figure 6** shows log(VEL)

FIGURE 5 | Boxplots of head movement by speaker, contrasting the condition with least expected movement (initial stable epoch, same word context) with the most (ACC2 epoch, alternating words), sorted by magnitude of ACC2 movement.

means and their SEs by epoch, context, and error grouped across speakers. Model comparison for the epoch-based measures resulted in a comparable model (M3) to that used for movement analysis, predicting log(VEL) from fixed effects of epoch and error with random slopes for error by speaker and random intercepts by word pair, with no interaction and no effect for context. The pattern of results was similar to that found for movement, showing marginally higher peak velocity for errorful epochs overall (t = 2.2 •) and significantly higher within the accelerating epochs ACC1 (t = 4.8 ∗∗∗) and ACC2 (t = 12.3 ∗∗∗). Partial R <sup>2</sup> = 0.12, conditional R <sup>2</sup> = 0.54. A post hoc test (Tukey HSD) confirmed that log(VEL) was significantly different by epoch, ordered as STABLE < ACC1 < ACC2 at the p < 0.0001 level (adjusted).

#### Evaluated Over Local Error Neighborhood

fpsyg-10-02459 November 25, 2019 Time: 15:43 # 10

Head peak velocity evaluated over the local PRE/POST neighborhood for each error provides twinned measurements suitable for a one-sided (H1: POST > PRE) paired t-test. The results show clearly that, in general, head peak velocity increases immediately following errors: t (1,098) = 6.5 ∗∗∗; Cohen's d = 0.2. A model (M4) predicting error-local log(VEL) with fixed effects of epoch, context, and PP (PRE/POST), with random intercepts by speaker and word pair, confirmed that POST > PRE (t = 3.3 ∗∗∗). Interactions were not supported. Main effects for epoch showed greater peak velocity associated with errors in the ACC2 condition (t = 2.9 ∗∗) and with ONSET (t = 2.4 <sup>∗</sup> ) and CODA (t = 2.9 ∗∗∗) alternation. Partial R <sup>2</sup> = 0.01, conditional R <sup>2</sup> = 0.41.

To investigate the possibility that the onset of head movement triggered by errors might be sensitive to the either the type of error (i.e., reduction or intrusion) or the active articulator (TD, TT, and LA), an additional model (M5) was fit, predicting error-local log(VEL) from fixed effects of context, error type, articulator, and PP, with random slopes for context and type by speaker and random intercepts by word pair. To reduce the complexity of the analysis, the subset of data used with this model excluded substitutions and the non-alternating (SAME) context given the low and unbalanced error rate in that condition (15 reductions but just one intrusion and no substitutions; **Table 2**) and did not include EPOCH as a fixed effect on the reasoning that the comparison PRE/POST error was valid regardless of the epoch within which it occurred. Interactions between context and error type and between context and articulator were supported, but not with PP. Model results show that head peak velocity increases: immediately following errors (POST > PRE; t = 3.6 ∗∗∗); more for reductions (t = 3.0 <sup>∗</sup> ), although this is offset in coda alternation (t = −2.7 <sup>∗</sup> ); and more for TD (t = 2.7 ∗∗) and TT (t = 2.5 <sup>∗</sup> ) articulators, again offset in coda alternation (t = −2.7 ∗∗ , t = −3.1 ∗∗). Post hoc tests (Tukey HSD) confirmed RED > INT and TD, TT > LA at the p < 0.05 level for onset contexts; not significant (n.s.) for coda contexts. Partial R <sup>2</sup> = 0.02, conditional R <sup>2</sup> = 0.49.

#### Average Mutual Information

Recall that AMI was computed pairwise between HEAD and the non-alternating (syllable rate) articulator ART1 (MIH1), the first alternating articulator ART2A (MIH2A), the second alternating articulator ART2B (MIH2B), and JAW (MIHJ). For non-alternating control pairs, ART1 was the coda trajectory, and both ART2A and ART2B were mapped to the onset trajectory. To assess whether more information is shared between HEAD and the alternating articulators rather than the nonalternating articulator, as a first analysis, one-sided (H1: MIH2A, B > MIH1) paired t-tests were performed on the alternating (context = ONSET, CODA) trials alone. For both ART2A (MIH2A > MIH1: t (665) = 14.5 ∗∗∗, Cohen's d = 0.6) and ART2B (MIH2B > MIH1: t (665) = 15.6 ∗∗∗, Cohen's d = 0.6), results confirm greater entrainment of HEAD with the alternating articulators, while a two-sided paired t-test found no significant difference between the first and second alternating articulators (MIH2A 6= MIH2B: t (665) = 1.3 n.s.).

An additional analysis on all word pairs including the non-alternating controls was performed using a linear mixedeffects model (M6) predicting AMI from fixed effects of epoch, context, and a derived variable pair encoding the HEAD-paired articulator, with random intercepts by speaker and word pair. Model comparison supported inclusion of interaction terms for epoch:context and context:pair, but not an effect for error. **Figure 7** illustrates marginal means for this model. Results showed main effects of significantly greater AMI between HEAD and JAW than the HEAD:ART1 baseline (t = 5.8 ∗∗∗) and for the first acceleration (ACC1) epoch than the initial stable epoch (t = 3.6 ∗∗∗). AMI significantly increased in the second acceleration (ACC2) epoch only under alternation, with CODA increasing more than ONSET (ACC2:ONSET t = 1.8 •, ACC2:CODA t = 4.9 ∗∗∗). As is evident from **Figure 7**, the interaction between context and pair was driven by significantly higher AMI between HEAD and both alternating articulators in the alternating vs. non-alternating (SAME) contexts (ONSET:MIH2A t = 5.0 ∗∗∗, CODA:MIH2A t = 5.1 ∗∗∗ , ONSET:MIH2B t = 5.6 ∗∗∗, CODA:MIH2B t = 5.6 ∗∗∗). For this model, partial R <sup>2</sup> = 0.07, conditional R <sup>2</sup> = 0.55. Post hoc tests (Tukey HSD) found no difference in AMI between HEAD paired with either the onset (MIH2A, MIH2B) or coda (MIH1) of nonalternating control pairs but confirmed the hierarchy MIH2A, MIH2B, MIHJ > MIH1 for both ONSET and CODA alternating contexts (p < 0.0001). In addition, in CODA contexts, MIHJ was significantly ordered between MIH2A,B and MIH1 (i.e., MIH2A, MIH2B > MIHJ > MIH1; p < 0.0001), indicating that biomechanical coupling between the head and jaw is insufficient to account for the degree of observed entrainment between HEAD and the alternating articulators.

#### Mutual Power

As with AMI, MP was computed pairwise between HEAD and the non-alternating articulator ART1 (MPH1x), the first and second alternating articulators ART2A and ART2B (MPH2Ax, MPH2Bx) and the jaw (MPHJx). For non-alternating control pairs, ART1 was the coda trajectory, and both ART2A and ART2B were mapped to the onset trajectory. MP was assessed for each pairing in the syllable rate frequency band (x = 1) and the alternating rate frequency band (x = 2); for example, MPH2A**2** gives MP between HEAD and ART2A in the alternating frequency band.

To test whether the head moved at a frequency tracking the alternating rather than the non-alternating articulator, reflected in higher MP observed at the slower rate, a one-sided (H1: MPH2A2, MPH2B2 > MPH11) paired t-test was applied to the alternating (context = ONSET, CODA) trials alone. The results strongly support the hypothesis, showing that substantially higher power was observed in the alternating frequency band for

both the HEAD:ART2A and ART2B pairings than the syllable rate HEAD:ART1 comparison (MPH2A > MPH11: t (665) = 19.2 ∗∗∗, Cohen's d = 0.7, MPH2B > MPH11: t (665) = 19.4 ∗∗∗ , Cohen's d = 0.8). An additional two-sided paired t-test found no significant difference between the first and second alternating articulators (MPH2A2 6= MPH2B2: t (665) = 0.9 n.s.).

A confirmatory analysis (M7) was performed on the alternating word pairs to predict MP from fixed effects of epoch, context, error, and PAIR, with pairings MPH11, MPH2A2, and MPH2B2. Model comparison supported the inclusion of an interaction term between error and context, random intercepts by speaker, and random slopes for pair by word. Results showed an increase in MP for errorful trials (t = 3.0 ∗∗), although this was decreased in coda contexts (t = −2.6 <sup>∗</sup> ). The pairings of HEAD with the alternating rate articulators (MPH2A2: t = 8.5 ∗∗∗, MPH2B2: t = 8.5 ∗∗∗) showed overwhelmingly greater MP (at the alternating rate) than the baseline syllable rate articulator MPH11, confirmed by post hoc (Tukey HSD) tests at the adjusted p < 0.0001 level, which also found no significant difference between MPH2A2 and MPH2B2. Partial R <sup>2</sup> = 0.18, conditional R <sup>2</sup> = 0.47. The model also showed that MP was significantly reduced in the fastest epoch ACC2 (t = −6.7 ∗∗∗). This result may be due to a loss of systematic coherence or increased production variability as errors multiply under rate pressure, since MP amplitude is affected by any deviation from expected alternation frequencies. As observed error rate is highest in ACC2 epochs and CODA alternation contexts, the lower MP values for those conditions may reflect error-driven deviation from the alternating rate, particularly if a frequency reorganization like that shown in **Figure 9** occurs. Conversely, the higher value seen overall for MP in errorful epochs likely reflects the increase in head movement observed in the MVT and VEL results; if such movement continues to track the alternation frequency, as in the **Figure 10** example, then higher coherent MP is to be expected.

Both AMI and MP results to this point show the head coupled with movement of the alternating articulators and with highest MP at the alternating frequency (although MP evaluated on alternating contexts only). However, this coupling may arise from two as yet undifferentiated sources. One possibility is that speakers may use the head to signal prosodic stress on each pair, for example, **tóp** cop or top **cóp**. In this case (HA), MP between HEAD and either articulator in the non-alternating control pairs should be highest at the frequency of prosodic alternation driving the head; that is, strongest at the alternation frequency regardless of context. An alternative possibility is that this coupling reflects reinforcement of the executing motor plan for the less stable (1:2) alternating word pairs only, as necessitated by increasing rate pressure. In this case (HB) MP for the non-alternating controls should be highest at the syllable rate because recruitment of the head is either unnecessary given the more stable (1:1) production pattern or if recruited tracks the 1:1 frequency.

To distinguish between these possibilities, a linear mixedeffects model (M8) that included the non-alternating controls was used to predict MP from fixed effects of epoch, context,

and PAIR, with random intercepts by speaker and word pair. Pairings of HEAD with ART1, ART2A, ART2B, and JAW were included at both the syllabic and alternating frequency rates. Recall that for non-alternating control pairs ART1 was the coda trajectory and both ART2A and ART2B were mapped to the onset trajectory. Only error-free epochs were used (Ns: SAME = 229, ONSET = 155, CODA = 138) to avoid the phaseresetting disruptions of errors on the computation of MP. Model comparison supported the inclusion of interaction terms for epoch:context and context:pair. **Figure 8** illustrates the marginal means for this model. It is readily apparent from this figure that in the control (context:SAME) condition, the strongest mutual power is between the head and the ART1 (coda) trajectories at the syllabic rate (MPH11), well above MP at the alternating rate (MPH12), whereas in the alternating (ONSET, CODA) contexts, highest MP occurs at the alternating frequency rate (MPH2A2, MPH2B2), thus confirming HB. As quantified by the model, all pairings for the STABLE context have lower MP than the MPH11 baseline: MPH12 t = −3.3 ∗∗∗, MPH2A1 t = −1.9 •, MPH2A2 t = −4.9 ∗∗∗ (Recall that MPH2B is a copy of MPH2A in STABLE contexts). Tukey HSD contrasts averaged over EPOCH for the STABLE context have the ordering MPH11, MPH2A1 > MPH12, MPH2A2 > MPHJ1, MPHJ2, significant at an adjusted value of p < 0.02. In the interaction of pairing with context, however, both first and second alternating articulators show strongest MP at the alternating rate, overwhelmingly greater than the MPH11 baseline (ONSET:MPH2A2 t = 10.2 ∗∗∗, CODA:MPH2A2 t = 8.9 ∗∗∗, ONSET:MPH2B2 t = 10.3 ∗∗∗, CODA:MPH2B2 t = 9.7 ∗∗∗). Partial R <sup>2</sup> = 0.18, conditional R <sup>2</sup> = 0.42. The pairing of HEAD with JAW shows the least energy for all three contexts, in both frequency bands, demonstrating again that it is not the underlying driver of head movement. As in the simpler model, an effect of epoch shows that MP declines as rate increases (subject to interaction with context), with the lowest values found at the fastest rate (main effect ACC2 t = −7.7 ∗∗∗). Because errors were not included in this analysis, this result is likely due to loss of coherence (thus affecting MP) as accelerating production rate leads to greater variability in the articulation of each sequence.

#### Summary of Results

Error Rates (**Figure 4**): More errors were observed in alternation conditions (CODA > ONSET > SAME) and at faster production rates (ACC2 > ACC1 > STABLE). Intrusions were most common (66%), followed by reductions (28%) and substitutions (6%). For intrusions, TD was the most common articulator (41%), followed by TT (37%) and LA (22%). For reductions, TT was most common (49%), followed by TD (36%), and LA (15%).

Head Movement (**Figure 5**): Increased by epoch with production rate.

Head Peak Velocity (**Figure 6**): By epoch, increased with production rate. By local error, uniformly increased immediately following the error (POST > PRE); in ONSET alternation

contexts, reductions increased more than intrusions and TD and TT articulators more than LA (in CODA alternation, these contrasts were n.s.).

Average Mutual Information (**Figure 7**): Greatest MI observed between head and the alternating (phrasal rate) articulators (MIH2A|B), least between head and the non-alternating (syllable rate) articulator (MIH1), and intermediate MI between head and jaw (MIHJ).

Mutual Power (**Figure 8**): For alternating trials only, including errors, highest MP was observed at the alternating (phrasal) rate; this increases in errorful trials and is reduced in CODA alternation and ACC2 epochs. For all trials, including nonalternating controls and excluding errors (to test possible effects of prosodic stress), no significant MP was found at the alternating frequency for controls but significant power at that frequency for alternating trials.

#### DISCUSSION

The pattern of observed speech errors increasing by epoch demonstrates the effectiveness of the accelerating rate task for imposing pressure on production and confirms that errors occur more frequently in coda than in onset alternation. Returning to the questions raised in the Introduction, the results show clearly that head movement, as indexed by distance traveled (MVT) and peak velocity aggregated over epochs (VEL), does increase with speech production rate as driven by the increasing rate metronome. Head movement was also significantly greater within epochs in which at least one error occurred compared to error-free production. In addition, peak velocity was shown to increase significantly immediately following labeled production errors, thus confirming the previous observations of Dittmann and Llewellyn (1969) and Hadar et al. (1984b). Some effects of error type were seen: more intrusions than reductions or substitutions were obtained overall, and more intrusions occurred with the TD articulator than with TT or LA, confirming the pattern reported by Slis and Van Lieshout (2016). While both AMI and MP results show significant coupling of the head to the jaw, this was in both cases subordinate to that seen for the pairing of the head to the constriction-forming articulators, thus ruling out the jaw as a primary source driving the entrainment (The lesser magnitude coupling that does exist between head and jaw likely arises from its synergistic role in helping form the constrictions).

The question of whether head movement is sensitive to onset vs. coda asymmetry has a more nuanced answer. Neither MVT nor VEL supported an effect of alternation context in modeling.

However, AMI computed between HEAD and the alternating and syllable-rate articulators showed an interaction between epoch and context such that the overall effect of increased AMI in the fast rate epoch ACC2 was significantly enhanced in the CODA alternation condition. As AMI requires some minimal level of systematic head movement to predict the paired articulator movement effectively, it is unsurprising that it should be greater in the ACC2 epoch with largest observed head movement. Also, given the higher overall error rates seen in the CODA alternation context, and based on the longer production times reported for CODA alternation by Sevald and Dell (1994), it is likely that the ACC2:CODA condition was the most difficult for speakers to execute. If the head is recruited to facilitate production under increasing pressure, then this condition is also the most likely to show the greatest entrained coordination between paired articulators as reflected by AMI, explaining the observed interaction. The reason that no corresponding effect of context was observed for head movement alone may derive from a lack of sufficient sensitivity: as shown by the spontaneous increase in movement observed following errors, any epoch that includes them will show greater movement overall, swamping any effect of context.

The AMI and MP results for the alternating context conditions clearly show that when the head does move, it tracks the alternating (phrasal) rate rather than the syllable-rate frequency, reflected in the highest values seen for these measures in the pairings between HEAD and both alternating articulators. Because these measures are computed in very different ways, their converging confirmation of this behavior is especially significant. We have considered two possibilities for why the head preferentially moves at the alternating frequency rate. Under the first, head movement is reflecting an imposed phrasal stress pattern, as in trochaic "**tóp** cop, **tóp** cop." Were such to be the case however, it should also apply consistently to control sequences like "**tóp** top, **tóp** top" and result in high MP at the alternating rate for those trials as well. However, results for control trial sequences instead show highest MP at the syllable rate, undermining this explanation. The alternative, supporting hypothesis H1, is that the head is recruited for enhancing stability of the 1:2 alternation pattern as production difficulty increases, a point considered more fully below.

It is possible that additional factors, not considered in this study, may also play a role in driving head movement. For example, conscious awareness of errorful production has been shown to lead to more dynamic facial expression (Barkhuysen et al., 2005), and this may in turn be coupled with increased head movement. Speakers may also have been distracted or influenced by the presence of the experimenters observing their production and used movement of the head in a communicative mode to signal correction following selfperceived errors. Future studies should consider recording facial features and polling participants for their self-awareness of errors

#### TABLE 3 | Glossary of dependent variables.

fpsyg-10-02459 November 25, 2019 Time: 15:43 # 15


PC1, first principal component; AMI, average mutual information; MP, mutual power. For non-alternating control pairs, ART1 is mapped to the coda articulator, and ART2A and ART2B to the onset articulator.

to address these concerns. However, because head movement was observed to increase systematically with rate pressure even without errors, self-awareness alone seems unlikely to be its primary cause.

In summary then, speaker head nodding increased with production effort and increased abruptly following errors. More errors were observed under faster production rates and in coda rather than onset alternations. More intrusions were observed than reductions or substitutions, and more errors were produced with the tongue (TD, TT) than the lips (LA). Neither jaw coupling nor imposed prosodic stress was observed to be a primary driver of head movement. The greatest entrainment between head and articulators was observed at the fastest rate under coda alternation. And nodding frequency in alternating word pairs tracked the alternation rate rather than the syllable rate. But these results leave open the additional question of why the head or other extremities should be systematically related to articulatory movement.

The study by Hadar (1991) mentioned above measured the head movement of aphasics and normal controls engaged in speech during interviews, finding that while head movement was positively correlated with speaking rate for both groups, it was highest for non-fluent aphasic speakers, who apart from speech coordination difficulties showed no other motor impairment. In a different domain, Goebl and Palmer (2009) showed that pianists performing a duet with manipulated auditory feedback increased the magnitude and coherence of their head movements as this feedback was degraded. In both cases, head movement appears to be supplemental to normal patterns of movement compensating for some kind of stress or impairment. Moreover, studies of dualtask demands imposed by walking and talking simultaneously (e.g., Kemper et al., 2003) show that when the head is unavailable for recruitment (because of its role in maintaining balance), both speech rate and fluency decline, particularly in older adults.

In the current study, the "cop top" trial shown in **Figure 10** provides a relevant example. Initially, the head is almost still, but it begins to move following a series of errors, eventually tracking the TT constriction as error-free alternation is (temporarily) restored. This illustrates a previously mentioned explanation, the recruitment of additional degrees of freedom to reinforce a (wobbly) coordinative structure in its execution of a motor pattern. As discussed above, the particular pattern arising from word pair alternation requires reinforcement because of its juxtaposition of syllabically vs. bisyllabically recurring articulation in a 1:2 frequency ratio, which is less stable than a 1:1 relationship, especially under rate pressure (Haken et al., 1985; Kelso et al., 1993; Goldstein et al., 2007). The "top cop" trial shown in **Figure 9** provides an example of what happens when production rate becomes overwhelming: an increase in head nodding magnitude at the alternating frequency following initial rate acceleration is ultimately insufficient to prevent a phase transition that leaves all articulators including the head oscillating at the 1:1 syllabic frequency.

While the trials shown in **Figures 9**, **10** represent interesting examples, in most cases though, recruitment of the head (and, although not recorded, the feet and hands, which were also sometimes observed to tap at the alternation frequency) served to stabilize the coordinative structure assembled to articulate the speech task under increasing rate pressure. Because the alternation frequency is less stable than the base syllable rate when words within the pair differ, crucially, that is the rate that the head was observed to support. When as in the example shown in **Figure 10** this recruitment follows immediately upon a production error, the reorganization of the coordinative structure to include the head appears to act to reset and restore the appropriate phase relations among articulators. As expressed by Kelso et al. (1993, p. 365):

"[A] system containing a set of active components that have been self-organized for a particular movement pattern is [...] no longer able to support that behavior in a stable fashion when a control parameter (here the frequency of motion) crosses a critical value. The new movement pattern may still be topologically equivalent to the previous one [...] but additional d.f. are required to perform the task."

In general, the recruitment of additional degrees of freedom is directly related to maintaining the executing task, as for example when both hands are needed to stabilize manipulation of a significant weight. What is interesting about head nodding, foot tapping, and other peripheral extremities recruited as in this task to maintain a rhythmic pattern under production stress is that they are at best only very loosely related biomechanically to the actual articulation of speech. The Coupling Graph model (Nam et al., 2009) predicts that the more connections that exist between the oscillators that collectively produce speech gestures, the more stable the relationships between those oscillators will be. Entrained oscillation of the head, despite contributing little or nothing directly to articulation, nonetheless serves in this view as a contributor to overall stability of the executing motor plan. Our results, particularly the abrupt increase in head movement observed following errors, provide evidence in support of coupling graph reorganization to include the head for this purpose. Thus, while under normal speaking conditions, the primary function of head movement is communicative, this work shows that head movement in speech tasks can also be driven by motoric influences, and that its recruitment can serve as a means of preserving articulatory stability under production duress.

#### DATA AVAILABILITY STATEMENT

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher upon request.

#### REFERENCES


## ETHICS STATEMENT

This study was carried out in accordance with the recommendations of Yale Institutional Review Board with written informed consent from all subjects, who were paid for their participation. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Yale Institutional Review Board.

#### AUTHOR CONTRIBUTIONS

MT, CM, and LG designed the experiment and wrote the manuscript. MT supervised the data collection and performed the data analysis. CM supervised the error labeling.

# FUNDING

This work was supported in part by NIH grants DC008780 and DC002717 to Haskins Laboratories.

## ACKNOWLEDGMENTS

Stefanie Shattuck-Hufnagel and Elliot Saltzman contributed useful discussion of this work, and Argyro Katsika, Raj Dhillon, and Hansook Choi provided valuable assistance with recordings and labeling. Reviewers materially improved the paper through useful suggestions.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.02459/full#supplementary-material



**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Tiede, Mooshammer and Goldstein. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spatially Conditioned Speech Timing: Evidence and Implications

#### Jason A. Shaw<sup>1</sup> \* and Wei-rong Chen<sup>2</sup>

<sup>1</sup> Department of Linguistics, Yale University, New Haven, CT, United States, <sup>2</sup> Haskins Laboratories, New Haven, CT, United States

Patterns of relative timing between consonants and vowels appear to be conditioned in part by phonological structure, such as syllables, a finding captured naturally by the two-level feedforward model of Articulatory Phonology (AP). In AP, phonological form – gestures and the coordination relations between them – receive an invariant description at the inter-gestural level. The inter-articulator level actuates gestures, receiving activation from the inter-gestural level and resolving competing demands on articulators. Within this architecture, the inter-gestural level is blind to the location of articulators in space. A key prediction is that intergestural timing is stable across variation in the spatial position of articulators. We tested this prediction by conducting an Electromagnetic Articulography (EMA) study of Mandarin speakers producing CV monosyllables, consisting of labial consonants and back vowels in isolation. Across observed variation in the spatial position of the tongue body before each syllable, we investigated whether inter-gestural timing between the lips, for the consonant, and the tongue body, for the vowel, remained stable, as is predicted by feedforward control, or whether timing varied with the spatial position of the tongue at the onset of movement. Results indicated a correlation between the initial position of the tongue gesture for the vowel and C-V timing, indicating that inter-gestural timing is sensitive to the position of the articulators, possibly relying on somatosensory feedback. Implications of these results and possible accounts within the Articulatory Phonology framework are discussed.

University of Southern California,

United States Philip Hoole, Ludwig Maximilian University of Munich, Germany

#### \*Correspondence:

Edited by: Pascal van Lieshout, University of Toronto, Canada

> Reviewed by: Louis Goldstein,

Jason A. Shaw jason.shaw@yale.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 April 2019 Accepted: 18 November 2019 Published: 05 December 2019

#### Citation:

Shaw JA and Chen W-r (2019) Spatially Conditioned Speech Timing: Evidence and Implications. Front. Psychol. 10:2726. doi: 10.3389/fpsyg.2019.02726 Keywords: feedforward control, articulatory phonology, gesture coordination, CV timing, Mandarin Chinese, electromagnetic articulography, state-based feedback, neutral attractor

# INTRODUCTION

Patterns of relative timing between consonants and vowels appear to be conditioned in part by abstract phonological structure, such as syllables, but also modulated by the particular gestures being coordinated (e.g., Marin and Pouplier, 2010; Marin, 2013; Brunner et al., 2014; Shaw and Gafos, 2015; Hermes et al., 2017; Ying et al., 2017). The most rigorous attempts to formalize phonologically relevant temporal patterns have come within the Articulatory Phonology (AP) framework, which draws a distinction between the inter-gestural level of representation and the inter-articulator level (Browman and Goldstein, 1989; Saltzman and Munhall, 1989). In AP, context-independent phonological representations are given at the inter-gestural level, in the form of dynamical systems that exert task-specific forces on articulators. The form of the dynamical system for a gesture remains constant across different phonological and lexical contexts. Contextual

effects on articulatory behavior, due to the starting position of the articulators or to temporal co-activation of gestures, is resolved at the inter-articulator level. The same gesture can have different net effects on articulatory behavior in different contexts owing to the way that competing demands on an articulator are resolved at the inter-articulator level. Crucially, AP is a feedforward control system. Gestures (at the inter-gestural level) exert forces on articulators but do not receive feedback from the state of the articulators in space or time. Feedback of this sort is encapsulated within the inter-articulator level.

The two-level feedforward control system of AP accounts for some language-specific phonetic patterns. It can account for target undershoot phenomenon and context effects on articulation without sacrificing phonological constancy (Browman and Goldstein, 1990). Moreover, higher level phonological structures have been linked to characteristic patterns of timing between gestures, results which receive a natural account within the inter-gestural level of AP. For example, languages that allow syllables with complex onsets, such as English, Polish and Georgian, pattern together in how word-initial consonant clusters are coordinated to the exclusion of languages that disallow complex onsets, such as Arabic and Berber (Goldstein et al., 2007; Shaw and Gafos, 2015; Hermes et al., 2017). In addition to simplex vs. complex syllables onsets, segment complexity may also have a temporal basis (Shaw et al., 2019). Shaw et al. (2019) show that in palatalized stops of Russian, e.g., /p<sup>j</sup> /, the labial and lingual gestures are timed synchronously whereas superficially similar sequences in English, e.g., /pj/in/pju/"pew", and unambiguous sequences in Russian, e.g., /br/, are timed sequentially. This difference between complex segments and segment sequences mirrors behavior found at the syllabic level. Language-specific temporal organization of phonology, as illustrated by cases such as these receives a natural account within the inter-gestural level of AP.

In contrast to AP, neuro-anatomical models of speech production rely on auditory and somatosensory state feedback to control movement timing (Houde and Nagarajan, 2011; Hickok, 2014). In these models there are no context-independent dynamics comparable to the gestures of AP. Rather, articulation is controlled through the mechanism of feedback. Adjustments to articulation are made online in order to guide articulators to producing target sounds. While these models are silent on the phonological phenomena for which the inter-gestural level of AP provides a natural explanation, they provide an account for how some speakers adjust articulation online in response to perturbation of auditory feedback (e.g., Houde and Jordan, 1998). In AP, articulator position information is available only to the inter-articulator level, which is governed by the Task Dynamics model (Saltzman and Munhall, 1989). Within the inter-articulator level, Task Dynamics assumes perfect information about articulator positions, although more recent work has explored replacing this assumption with a more realistic model of feedback (Ramanarayanan et al., 2016). Crucially for our purposes, there is no mechanism for state-based feedback at the inter-articulator level to influence inter-gestural coordination. This means that while auditory/somatosensory feedback could drive articulatory adjustments to how a

particular task is achieved it cannot trigger earlier/later activation of a gesture.

Experimental evidence indicating that information from the articulator level can feed back to the inter-gestural level is available from perturbation studies. In experimental contexts when there is a physical perturbation to articulation, gestures have been observed to "reset" (Saltzman, 1998; Saltzman et al., 1998). Phase-resetting in response to physical perturbation suggests that coordination at the inter-gestural level does not uni-directionally drive articulatory movement. Saltzman et al. (1998) argue: "intergestural and interarticulatory dynamics must be coupled bidirectionally, so that feedback information can influence the intergestural clock in a manner that is sensitive to articulatory state (p. 422)."

Some recent kinematic studies suggest possible links between the spatial position of articulators and relative timing observable outside of perturbation experiments (Brunner et al., 2014; Pastätter and Pouplier, 2017). Brunner et al. (2014) list the spatial position of the articulator as one of a number of factors that influences measures of gesture coordination, leading to consonant-specific variation in timing patterns in German. Pastätter and Pouplier (2017) investigated whether coarticulatory resistance, a measure of the degree to which an articulator resists spatial perturbation (Bladon and Al-Bamerni, 1976; Recasens and Espinosa, 2009; Chen et al., 2015) influences the relative timing of a consonant and following vowel. In line with their hypotheses, overlap between a consonant and vowel was affected by the coarticulatory resistance of the consonant. C-V overlap was greater for consonants less resistant to coarticulation. Pastätter and Pouplier also report a corresponding effect of consonant identity on the spatial position of the vowel. Vowels that showed less temporal overlap with the preceeding consonant were spatially closer to the preceeding consonant, converging evidence that consonants with high coarticulatory resistance delay vowel movements. In order to account for this pattern, Pastätter and Pouplier proposed to vary coupling strength at the intergestural level by articulator. In this way, different articulators could enter into the same basic coordination relation, e.g., in-phase or anti-phase timing, but exert differential forces on vowel timing. The theoretical account offered by Pastätter and Pouplier makes properties of articulators (but not their spatial positions) visible to inter-gestural timing. The account preserves language-specific timing at the inter-gestural level and feedforward control but does not reconcile the need for statebased feedback observed by Saltzman et al. (1998).

Our aim in this paper is to provide a direct test of whether the spatial position of the tongue influences consonant-vowel (C-V) coordination. To do so, we conducted an Electromagnetic Articulography (EMA) study of Mandarin Chinese. Mandarin is a good language to investigate C-V coordination, both because of its phonological properties and because it is relatively wellstudied otherwise. Mandarin allows fairly free combination of tones with consonants and vowels to make CV monosyllabic words. Varying lexical tone, while keeping the consonant and vowel sequence constant allowed us to generate a comparatively large number of phonologically distinct monosyllables to test our research question. We focused on non-low back vowels in

Mandarin because past work has shown that variation in lexical tone for these vowels does not influence the spatial location of the vowel target; /i/ and /a/, in contrast, vary with tone (Shaw et al., 2016). Our stimuli were CV monosyllables, consisting of a labial consonant and a back vowel. Single-syllable words in isolation allow for considerable variability in the starting position of the articulators. Across the observed variation in the spatial position of the tongue body, we investigated whether intergestural coordination between the lips, for the consonant, and the tongue body, for the vowel, remained constant, as is predicted by feedforward control.

There are competing hypotheses about the feedforward control regime for Mandarin C-V syllables. Xu (2005) theorizes that consonants and vowels (as well as lexical tones) begin synchronously, at the start of the syllable. This assumption has been implemented in computational modeling of f <sup>0</sup> for tone and intonation (Xu and Wang, 2009; Xu et al., 2015). A slightly different conclusion about Mandarin CV timing was reached by Gao (2008, 2009). In an EMA experiment tracking tongue and lip movements, Gao (2009) found that there is positive C-V lag, i.e., the vowel gesture does not begin movement until after the onset of movement of the consonant. Gao attributed the positive C-V lag to competitive coordination between consonant, vowel, and tone gestures. The account incorporates pressure to start the consonant and vowel at the same time, i.e., inphase coordination, along with other competing demands on coordination. The tone and vowel are coordinated in-phase, but the consonant (C) and tone (T) are coordinated sequentially (anti-phase). The competing demands of anti-phase C-T timing, in-phase C-V, and in-phase C-T timing are resolved by starting the vowel at the midpoint between the onset of consonant and tone gestures. Notably, Gao's analysis of C-V lag in Mandarin mirrors the analysis of C-V timing in languages with syllableinitial consonant clusters (Browman and Goldstein, 2000; Gafos, 2002; Goldstein et al., 2007; Marin and Pouplier, 2010; Hermes et al., 2013, 2017; Marin, 2013; Shaw and Gafos, 2015). The common thread is that the observed C-V lag in a CCV syllable is driven by competing forces on inter-gestural coordination – antiphase coordination for the consonants and in-phase coordination between each onset consonant and the vowel. Xu et al. (2015) do not address Gao's data. However, both accounts of C-V lag in Mandarin described above, although they differ in assumptions, involve feed-forward control of articulation. As such, they predict that relative timing is blind to the spatial position of the articulator. In the experiment that follows, we test this hypothesis.

## EXPERIMENT

#### Speakers

Six native speakers of Mandarin Chinese (3 male) participated. They were aged between 21 and 25 years (M = 23.7; SD = 1.5) at the time of the study. All were born in Northern China (Beijing and surrounding areas) and lived there until at least 18 years of age. The speakers all lived in Sydney, Australia, where the experiment was conducted, at the time of their participation. All participants were screened by a native speaker of Mandarin Chinese to ensure that they spoke standard Mandarin. Procedures were explained to participants in Mandarin by the second author, a speaker of Taiwanese Mandarin. Participants were compensated for their time and local travel expenses.

# Materials

Target items were a set of CV monosyllables that crossed all four lexical tones of Mandarin, tone 1 "high", tone 2 "rise", tone 3 "low", and tone 4 "fall" with two labial consonants {/m/, /p/} and three back rounded vowels {/ou/, /u/, /uo/} yielding 24 items, which were repeated 6–12 times by each speaker producing a corpus of 949 tokens for analysis. We chose labial consonants because of the relative independence between the consonant (lips) and the vowel (tongue dorsum) gestures. We chose back vowels in particular because of past work showing that /u/ in Mandarin resists the coarticulatory effects of tone, which influence /i/ and /a/ (Shaw et al., 2016). We also report an analysis of unrounded /i/ and /a/, drawing on data from Shaw et al. (2016). The purpose of this additional analysis is to assess whether the pattern for our target items generalizes to unrounded vowels.

Target items were randomized with fillers and displayed one at a time on a monitor in Pinyin, a standard Romanization of Chinese. The three back vowels included in the materials have the following representation in Pinyin: "o" /uo/, "u" /u/, "ou" /ou/. Here and throughout, we use slashes to refer to IPA symbols. Orthographic representations of vowels not in slashes refer to Pinyin. Many of the items were real words and could have been displayed as Chinese characters. We chose to represent the items with Pinyin orthography because it allowed us to collect all combinations of the onset consonants, vowels and tones under study including those that do not correspond to real words. The Pinyin sequences that are not attested words were combinations of /p/ with /ou/.

# Equipment

We used an NDI Wave Electromagnetic Articulograph system sampling at 100 Hz to capture articulatory movement. We attached sensors to the tongue tip (TT), body (TB), dorsum (TD), upper lip (UL), lower lip (LL), lower incisor (Jaw), nasion and left/right mastoids. Acoustic data were recorded simultaneously at 22 KHz with a Schoeps MK 41S supercardioid microphone (with Schoeps CMC 6 Ug power module).

## Stimulus Display

Syllables were displayed in Pinyin on a monitor positioned outside of the NDI Wave magnetic field 45 cm from participants. Stimulus display was controlled manually using a visual basic script in Excel. This allowed for online monitoring of hesitations, mispronunciations and disfluencies. These were rare, but when they occurred, participants were asked to repeat syllables.

#### Post-processing

Head movements were corrected computationally after data collection with reference to the left/right mastoid and nasion sensors. The post-processed data was rotated so that the origin of the spatial coordinates is aligned to the occlusal plane. The occlusal plane was determined by having each participant hold between their teeth a rigid object (plastic protractor) with three sensors configured in a triangle shape. Lip Aperture (LA), defined as the Euclidean distance between the upper and lower lip sensors, was also computed following rotation and translation to the occlusal plane. **Figure 1** shows the range of movement for the entire experiment for one speaker following head correction.

#### Articulatory Analysis

fpsyg-10-02726 December 4, 2019 Time: 17:5 # 4

The articulatory data analysis focuses on the relative timing between consonant and vowel gestures, which we define in terms of temporal lag, and the position of EMA sensors at linguistically relevant spatio-temporal landmarks: the onset of articulatory movement and the achievement of the gestural target. Onset and target landmarks were determined according to thresholds of peak velocity in the movement trajectories. For the labial consonants, the Lip Aperture trajectory was used. For the back vowels, landmarks were determined with reference to the Tongue Dorsum sensor in the anterior-posterior dimension (i.e., TDx). Landmark labeling was done using the findgest algorithm in MVIEW, a program developed by Mark Tiede at Haskins Laboratories (Tiede, 2005).

**Figure 2** shows an example of how the articulatory landmarks, labeled on the Lip Aperture signal (top panel) relate to the velocity peaks (lower panel). As the lips move together for the labial consonant, the lip aperture (top panel) gradually narrows. The peak velocity in this closing phase of −10 cm/s occurs just after 100 ms. The signal was thresholded at 20% of this velocity peak, resulting in the Onset and Target landmarks. We also explored the velocity minimum as a possible articulatory landmark for analysis but found that the threshold of peak velocity provided more reliable measurements across tokens. The cause seemed to be that some of the monophthongs in the experiment tended to have relatively long periods of low velocity around the point of maximum opening corresponding to the vowels. Although the NDI Wave system produced high spatial resolution recordings, even a small degree of measurement error (∼0.6 mm) makes picking out the true velocity minima from the wide basin of low velocity movement subject to sizeable temporal variation. Using the threshold of peak velocity mitigates

the effect of measurement noise, providing a reliable vowel target landmark across tokens.

The primary dependent variable of interest in this study was the temporal lag between consonants and vowels, henceforth C-V lag. A schematic diagram of C-V lag is provided in **Figure 3**. C-V lag was determined by subtracting the timestamp of the gesture onset of the consonant, Conset ts , from the timestamp of the gesture onset of the vowel, Vonset ts :

$$\text{CVlag} = \text{V}\_{\text{ts}}^{\text{onset}} - \text{C}\_{\text{ts}}^{\text{onset}}$$

The primary independent variable of interest is the distance between the tongue at movement onset for the vowel and at the achievement of target. We quantified this in a few different ways. First, we measured the spatial position of the TD sensor at the onset of movement of the vowel. Since all of the target vowels in this study were back vowels, the primary movements for the vowels involved tongue retraction, i.e., movement from a more anterior position to a more posterior position. We refer to the position of the tongue dorsum in this dimension as TDx:

TDx = coordinate of the tongue dorsum sensor in the

anterior-posterior dimension

For the speaker shown in **Figure 1**, the range of TDx values is about 18 mm, i.e., from −42 to −60 mm. The negative coordinates are relative to the occlusal plane, so −60 mm indicates 60 mm behind the occlusal plane clenched in the participants' teeth. The value of TDx at movement onset for the vowel served as the key independent measure in the study. The closer the value of TDx at vowel onset was to zero, the further the tongue would have to move to achieve its target.

In addition to TDx at movement onset, we also measured more directly how far away the tongue was from its target at the onset of movement. We call this measure Tdist, for distance to target. We used inferior-superior (y) and anterior- posterior (x) dimensions for both TD and TB in the calculation. Hence, Tdist is the four-dimensional Euclidean distance between the position of lingual sensors (TB, TD) at the onset of vowel movement and at the vowel target. The vowel target for each subject was determined by averaging the position of these sensors at the target landmark across tokens of the vowel. The formula for Tdist is defined below:

Tdist =

$$\sqrt{\begin{array}{l} \hline (\text{TD}\_{\text{x}}^{\text{Onset}} - \text{mean}(\text{TD}\_{\text{x}}^{\text{Target}}))^2 + (\text{TD}\_{\text{y}}^{\text{Onset}} - \text{mean}(\text{TD}\_{\text{y}}^{\text{Target}}))^2\\ \text{ } + (\text{TB}\_{\text{x}}^{\text{Onset}} - \text{mean}(\text{TB}\_{\text{x}}^{\text{Target}}))^2 + (\text{TB}\_{\text{y}}^{\text{Onset}} - \text{mean}(\text{TB}\_{\text{y}}^{\text{Target}}))^2 \end{array}}$$

**Figure 4** shows a visual representation of Tdist. The left panel shows the average position of the sensors for one speaker's "o" /uo/ vowel. The right panel shows the TB and TD components of Tdist as directional vectors in 2D (x,y) space. The start of the vector is the position of the sensors at the onset of movement, represented as red circles. The end of the vectors are the vowel targets for TB and TD. The length of the arrow from the vowel onset to the vowel target is the Euclidean distance for each sensor. Tdist is the combination of the two vectors.

Our main analysis assesses the effect of TDx and Tdist on C-V lag. To do this, we fit a series of nested linear mixed effects models to C-V lag. All models contained a random intercept for subject. We explored a baseline model with fixed effects for VOWEL (o, u, ou), CONSONANT (b, m), and TONE (1, 2, 3, 4). We ultimately dropped TONE from the baseline model because it did not improve over a model with just VOWEL and CONSONANT as fixed effects. This was somewhat expected since we deliberately selected vowels unlikely to be influenced by tone. Both remaining fixed factors in the baseline model were treatment coded – "o" /uo/ was the reference category for VOWEL and "b" /p/ was the reference category for CONSONANT. To this baseline model, we added one of our main factors of interest: TDx or Tdist. We also investigated whether another kinematic variable, peak velocity of the vowel gesture, explained C-V lag above and beyond the variables related to TD position at the onset of movement, i.e., TDx and Tdist. The modeling results are given in the next section following some visualization and description of the main factors of interest.

# RESULTS

#### Effect of Spatial Position on C-V Lag

**Figure 5** shows the probability density functions of C-V lag in raw milliseconds (i.e., not normalized) for the three vowels, fitted by kernel density estimations. We report the distribution in milliseconds to facilitate comparison across studies. The solid black vertical line at the 0 point indicates no lag – the vowel and the consonant start at the same time. In tokens with negative lag (the left side of the figure) the vowel started movement before the consonant; in tokens with a positive lag (right side of the figure), the consonant starts movement before the vowel. The distribution of lag values is centered on a positive lag for all three vowels, indicating that, on average, vowel movement follows consonant movement. Moreover, the size of the lag is comparable to what has been reported in past studies of CV lag in Mandarin (Gao, 2009; Zhang et al., 2019) and other lexical tone languages (Karlin and Tilsen, 2015; Hu, 2016; Karlin, 2018). There is also, however, substantial variation. The main aim of this paper is to evaluate whether the variability observed in CV lag is related to variability in the spatial position of the tongue dorsum at the onset of movement.

The distribution of tongue backness values (as indicated by TDx at the onset of movement of the TD toward the vowel target) was multi-modal, due to inter-speaker variation in the size of the tongue and the placement of the TD sensor. To normalize for speaker-specific sensor location and lingual anatomy, we calculated z-scores of TDx within speaker. The normalized values are centered on 0. We also normalized the C-V lag measures by z-score. The normalized measures of C-V lag and TDx are shown in **Figure 6**. The resulting distributions for both TDx and C-V lag are roughly normal.

The main result is shown in **Figure 7**. The normalized measure of C-V lag is plotted against TDx, i.e., tongue dorsum backness at movement onset. The figure shows a significant negative correlation (r = −0.31; p < 0.001). Variation in C-V lag is correlated with variation in the spatial position of the tongue dorsum at the onset of movement. C-V lag tends to be shorter when the tongue dorsum is in a more anterior position at movement onset. When the starting position of the TD is more posterior, i.e., closer to the vowel target, C-V lag is longer. Thus, **Figure 7** shows that the vowel gesture starts earlier, relative to the consonant gesture, when it has farther to go to reach the target. To evaluate the statistical significance of the correlation in **Figure 7**, we fit linear mixed effects models to C-V lag, using the lme4 package (Bates et al., 2014) in R. The baseline model included a random intercept for speaker and fixed effects for vowel quality and onset consonant. A second model added the main fixed factor to the baseline model. To index the position of the tongue dorsum relative to the vowel target, we considered both TDx and Tdist as fixed factors. For both of these factors as well as for C-V lag, we used the z-score-normalized values in all models. The normalized values of TDx and Tdist were highly collinear (r = 0.48∗∗∗), which prevents us from including both in the same model.

FIGURE 4 | Vowel targets for /uo/ for one speaker, calculated as the average position of the TD and TB sensors across repetitions. Red circles show the spatial positions of the sensors at the onset of movement toward the vowel target. The black circles with the white "x" denote the vowel target. The arrows represent the Euclidean distance between the sensors at the onset of movement and the achievement of target.

As expected, the effects of these factors on C-V lag were quite similar. The correlation between Tdist and C-V lag was slightly weaker (r = −0.28∗∗∗) than the correlation between TDx and C-V lag. Adding TDx to the model led to a slightly better improvement over baseline than Tdist. We therefore proceed by using TDx as our primary index of the starting position of the tongue dorsum.

We also considered whether the speed of the vowel movement impacts C-V lag. The peak velocity of articulator movements is known to be linearly related to gesture magnitude, i.e., the displacement of the articulator in space (Munhall et al., 1985; Ostry and Munhall, 1985). For this reason, TDx, which, as shown above, is strongly correlated to Tdist, is also highly correlated with the peak velocity of the movement (r = 0.33, p < 0.001). The natural correlation between peak velocity and displacement can be normalized by taking the ratio of peak velocity to displacement, a measure sometimes referred to as kinematic stiffness (Adams et al., 1993; Shaiman et al., 1997; Perkell et al., 2002; Van Lieshout et al., 2007). This provides a kinematic measure of speed that can be assessed across variation in TDx. We evaluated the correlation between stiffness and C-V lag and found that there was no effect (r = −0.03). This indicates that gesture velocity, once gesture magnitude is factored in, has no effect of C-V lag.

Adding TDx resulted in significant improvement to the baseline model (χ <sup>2</sup> = 125.52; p < 2.20E-16). Moreover, the increased complexity of the model is justified by the variance explained. The six degrees of freedom in the baseline model increased to seven degrees of freedom in the baseline + TDx model, but the AIC and BIC scores were lower in the baseline + TDx model (AICbaseline = 2607.2, AICbaseline+TDx = 2483.7; BICbaseline = 2636.3, BICbaseline+TDx = 2517.7). This indicates that the spatial position of the tongue dorsum has a significant effect on inter-gestural timing.

A summary of the fixed effects for our best model, baseline + TDx, is as follows. VOWEL had only a marginal effect on C-V lag. The effect of CONSONANT was negative (β = −0.276; t = −4.722∗∗∗), indicating that syllables that begin with [m] have shorter C-V lag than those that begin with [p], the intercept category for the consonant factor. The strongest fixed factor in the model was that of TDx (β = −0.559; t = −12.245∗∗∗). The strong negative effect indicates, as shown in **Figure 7**, that C-V lag decreases with increases in TDx. Larger TDx values indicate a more anterior position of the tongue. Since the vowel targets in the stimuli were all posterior (back vowels), the negative effect of TDx can be interpreted as shorter C-V lag values in tokens with more front starting positions for the vowel. In other words, the farther the tongue dorsum is from the (back) vowel target, the earlier the movement starts (and, thus, the shorter the C-V lag).

FIGURE 6 | Kernal density plot of normalized C-V lag (A) and TDx (B). The legend shows the Pinyin for the vowels, which correspond to: "o" /uo/, "ou" /ou/, "u" /u/.

# Exemplification of the Main Result

The general trend in the data is that C-V lag decreases with the anteriority of the tongue. To put this another way, movement toward the vowel target (relative to the consonant) is delayed when the tongue happens to be already near the target position. This pattern is exemplified with specific tokens in **Figure 8**. The top left panel shows the mean position of the sensors at the target of /uo/ for one speaker. At the target, the average backness of the TD sensor is −50.4(3.2) mm (black circles). The panel on the upper right zooms in on the position of the TB and TD sensors for two tokens, token 168, shown as red circles is relatively close to the vowel target for /uo/. Token 280, in contrast, is further away (green circles). The bottom two panels compare the time course of movement for each of these tokens. The panel on the left shows token 168, which starts closer to the target. In line with the general trend in the data, movement toward the target in token 168 is somewhat late relative to the lip aperture gesture. TD movement toward the target does not start until about halfway through the closing phase of the labial gesture. The TD movement in token 280, shown on the right, starts earlier in the phase of the consonant. Consequently, the lag between the consonant gesture and the vowel gesture is shorter in token 280 (right) than in token 168 (left).

#### Extension to Unrounded Vowels

The target items in this study involved labial consonants followed by rounded vowels. As described above, we selected high back vowels since they are known to resist tonal coarticulation. However, since high back vowels in Mandarin Chinese are rounded, there is a potential for interaction between gestural control of the lips by the labial consonant and gestural control by the rounded vowel. While the particular nature of this interaction for Mandarin is not known, some possibilities include gestural blending, whereby the movement of the lips results from a compromise between temporally overlapped task goals, or gesture suppression, whereby one of the overlapping gestures takes full control of the articulator. In the task dynamics model, these outcomes are dictated by the blending strength parameter (Saltzman and Munhall, 1989), which is hypothesized to be language specific (Iskarous et al., 2012). In some languages, the labial and dorsal components of high back rounded vowels enter into a trading relation such that the degree of rounding, for,

e.g., /u/, varies with the degree of tongue dorsum retraction (Perkell et al., 1993). This raises the question – to what extent is our main result related to the presence of rounding for the vowels? To address this question, we extended our analysis to unrounded vowels, /a/ and /i/, drawing on EMA data reported in Shaw et al. (2016).

The items in Shaw et al. (2016) included multiple repetitions of /pa/ and /pi/ produced with all four Mandarin tones by the same six speakers analyzed in this study. Following the procedure outlined in section "Experiment", we calculated C-V lag and TDx position for /pa/ and /pi/ syllables. A total of 470 tokens (233 /pa/ tokens; 237 /pi/ tokens) were analyzed. Both syllables show a correlation between C-V lag and TDx that is similar in strength to what we observed for high back vowels (**Figure 7**). For /pa/, the direction of the correlation was negative (r = −0.36; p < 0.001), the same direction as for the high back vowels. When the tongue dorsum is in a more front position (farther from the /a/ target), C-V lag tends to be shorter, indicating an earlier vowel movement relative to the consonant; when the tongue dorsum is in a more back position (closer to the /a/ target), C-V lag is longer. We observed the same pattern for the low back vowel, which is unrounded, as we observed for the high back vowels, which are rounded. The correlation between C-V lag and TDx is similarly strong for /pi/ syllables (r = 0.45; p < 0.001), but the correlation is positive. The positive correlation for /pi/ makes sense given the anterior location of the vowel target. In contrast to the back vowels, a relatively front tongue dorsum position puts the tongue close to the /i/ target; in this case, C-V lag tends to be long, indicating a delayed vowel gesture onset (relative to the consonant). **Figure 9** provides a scatterplot of C-V lag and TDx for /pi/ and /pa/. The positive correlation for /pi/ is essentially the same pattern as the negative correlation observed for /pa/ and for

the high back vowels that served as the main target items for the study. From this we conclude that whatever the effect of vowel rounding is on the lip gestures in Mandarin, it does not seem to have any influence on the relation between TDx position at the onset of the vowel gesture and C-V lag. We observe the same pattern across rounded and unrounded vowels.

#### DISCUSSION

Analysis of C-V lag in Mandarin monosyllables confirmed patterns reported in the literature and also revealed new effects that have theoretical implications for models of speech timing control.

First, we found that C-V lag in the Mandarin syllables in our corpus, which all have lexical tone, tends to be positive. The vowel typically starts well after the consonant. This pattern, positive C-V lag, has been reported for Mandarin before (Gao, 2008, 2009) and for other lexical tone languages (Karlin and Tilsen, 2015; Hu, 2016; Karlin, 2018). C-V lag tends to be longer for languages with lexical tone than for languages that have intonational tones or pitch accents (Mücke et al., 2009; Niemann et al., 2011; Hermes et al., 2012). In terms of millisecond duration, the C-V lag in tone languages reported in the studies above is in the range of ∼50 ms while the C-V lag for languages that lack lexical tone tends to be smaller, ∼10 ms. The C-V lag in our study was substantially longer (roughly twice) than other reports of lexical tone languages (**Figure 5**). This difference in absolute duration is probably due at least in part to the nature of our stimuli. Monosyllables read in isolation in Pinyin encourages hyperarticulation but served the specific purpose in our study of allowing variation in tongue position at the onset of movement while controlling for other factors that could influence C-V timing in longer speech samples. Another possible reason for the longer absolute C-V lag in our materials could be the onset consonants. Studies of tone and intonation tend to select sonorant consonants as stimuli to facilitate continuous tracking of f <sup>0</sup> across consonants and vowels. Our stimuli included both a nasal onset consonant, /m/, and an oral onset consonant, /p/. Although this was not expected, there was a significant effect of onset consonant identity on C-V lag. C-V lag was significantly shorter in syllables beginning with the nasal stop than in syllables beginning with the oral stop. The longer C-V lag found in our materials overall is conditioned in part by our inclusion of oral plosive onsets. As to why oral plosives condition longer C-V lag (than nasals), we currently have no explanation.

We found no effect of tone on C-V lag and only a negligible effect of vowel. Syllables with all four Mandarin tones and all three back vowels showed similarly positive C-V lag. The lack of a tone effect was expected from past work on Mandarin, including

Gao (2008). We avoided /i/ and /a/ vowels in our target items because past research had shown that the target tongue position for these vowels varies across tones whereas /u/ has a stable target (Shaw et al., 2016). Conceivably, the effect of tone on C-V lag would be more complicated for other vowels, because a change in tone may also condition a change in the magnitude of tongue displacement toward the vowel target. The vowel written with Pinyin "o" after labial consonants is pronounced as a diphthong /uo/ in standard Mandarin; the first target of this diphthong is the same target as for the monophthong /u/. The third vowel in the study was /ou/, which is also in the high back space. From the standpoint of feed-forward models of timing, effects of vowel quality on C-V coordination are not expected in general. This study does not offer a particularly stringent test of this assumption, since the vowel targets were similar. Rather, the materials in this study were optimized to evaluate effects of variation at the onset of the vowel.

We found a significant effect of the main factor of interest in this study. The spatial position of the tongue dorsum at the onset of vowel movement had a significant effect on C-V lag. We also showed that this main pattern generalized to /a/ and /i/ by re-analyzing data from Shaw et al. (2016). C-V lag values showed substantial token-by-token variation (**Figure 5**); however, the variation was not random. Variation in when the vowel movement starts relative to the consonant was systematically related to the spatial position of the tongue dorsum. When the tongue dorsum was further forward – farther from the vowel target – movement started earlier than when the tongue dorsum was further back – closer to the vowel target. This type of behavior is not expected from a strictly feedforward model of relative timing control, such as the coupled oscillator model of inter-gestural timing (Goldstein and Pouplier, 2014). However, the results are not inexplicable. There are a range of possible explanations. Before moving on to discuss possible theoretical explanations for the pattern, we first address a potential limitation of the study.

Our strategy of eliciting words in isolation was successful in that we obtained variation in the starting position of the tongue dorsum. The structure of this variation played an important role in revealing the main result. Since the stimuli consisted of labial consonants followed by vowels, each trial ended with the mouth in an open position (for production of the vowel) and the next trial began with a labial gesture, requiring either narrowing of the lips (/f/ in some filler trials) or closure (/m/, /p/). This design allows for the possibility that participants take up a rest posture in between trials which involves lip closure. In labeling the gestures for further analysis, we noticed that the lips typically remained open until the onset of the labial gesture; however, a small number of tokens involved lip closures that were unusually early, possibly because the lips closed before active control associated with the target stimuli. These tokens show up as outliers to the statistical distribution for the lip aperture gesture, i.e., extra long closure duration. Since our analysis did not exclude statistical outliers, we consider here the possible impact that they could have on our main result.

To assess the role of outliers resulting from early closure, we re-ran our analysis excluding outliers using each of two well-established methods: a priori trimming and outlier removal through model critique (Baayen and Milin, 2015). The mean lip aperture duration in the data was 327 ms (SD = 117); the median was 300 ms (27 ms shorter than the mean), which, consistent with our token-by-token observations from labeling, suggests a skew toward longer duration outliers. Following the a priori trimming method, we excluded tokens from analysis that were three standard deviations from the mean lip aperture duration value and re-fit the nested lmer models reported above. Removing outliers in this way improved the model fit, as indicated by a lower AIC:2382 for trimmed data set, c.f., 2483 for full data set. The effect of TDx on C-V lag was reduced slightly following a priori trimming, as indicated by the coefficient estimate for TDx: for the trimmed data set β = −0.53(SE = 0.043), c.f., for the full data set β = −0.56 (SE = 0.046). The slight change in the coefficient is reflected as well in the pearson's correlation between C-V lag and TDx: r = −0.30 for the trimmed data set vs. r = −0.31 for the full data set. We also removed outliers via model critique. Following the method suggested in Baayen and Milin (2015), we removed outliers to our best fitting model. Residuals to model fit greater than three standard deviations were removed and the model was refit to the trimmed data set. The resulting model showed further improvement; AIC dropped to 2297. The coefficient for TDx decreased slightly β = −0.52 (SE = 0.043). The pearson's correlation between C-V lag and TDx was the same as for the a prior trimming: r = −0.30. Removing outliers based on model fit does not directly reference lip aperture duration. Nevertheless, this approach produced similar results to removing outliers with unusually long lip closure duration (a priori trimming). Removing outliers based on lip closure duration had the effect of improving model performance overall with only a negligible influence on the estimate for TDx. This suggests that the occasional long labial closure in the data introduced noise (unexplained variance) in the model but did not have a substantial influence on the observed relation between spatial position (TDx) and intergestural timing (C-V lag).

We focus the remainder of this discussion on two possible explanations for the main result (section "Downstream Targets" and "Neutral Attractors") as well as some additional theoretical implications (section "Additional Theoretical Implications").

#### Downstream Targets

One possible explanation is that gesture coordination makes use of a richer set of gestural landmarks than just gesture onsets. For example, Gafos (2002) proposes a set of five articulatory landmarks which are referenced by a grammar of gestural coordination. These landmarks include the onset of movement, the achievement of target, the midpoint of the gesture plateau (or "c-center"), the release from target and the offset of controlled movement (p. 271). Variation in gesture onsets, as we observed for the vowel movements in this study could potentially subserve later production goals, such as the coordination of the target landmark or others landmarks that occur later in the unfolding of the gesture, i.e., after the gesture onset. To illustrate this concept, **Figure 10** shows two coordination schemas. The left panel, **Figure 10A** shows a pattern of synchronous consonant and vowel gestures. In this schema the vowel onset is aligned

FIGURE 10 | Two schematic diagrams of timing relations. Panel (A) shows the onset of the vowel timed to the onset of the consonant; panel (B) shows the target of the vowel timed to the offset of the consonant.

lag measurement. The schema represents the C-V timing pattern under which the lag measure is zero (perfect alignment). The bottom row shows the distribution of lag values. Lag measures were computed by subtracting the vowel landmark from the consonant landmark. The average lag between the Coffset and Vtarget (C) is zero; in contrast, the average lag for the schemas in (A) and (B) is positive.

to the consonant onset – the two gestures are in-phase. This can be contrasted with **Figure 10B**, which shows a later vowel target. The target of the vowel in this case is timed to the offset of the consonant gesture. The coordination schema dictates that the vowel achieves its spatial target at the offset of controlled movement for the consonant. If the coordination relation controlling C-V timing references the vowel target (and not the vowel onset), the vowel onset would be constrained only by the requirement that the target is achieved at the end of the consonant gesture. This could dictate that the timing of the vowel onset varies as a function of its distance to the vowel target. This account suggests some degree of state-feedback from articulator position to inter-gestural timing control. If the onset of the vowel gesture is timed to achieve its target at the end of the consonant gesture, speech motor control must have access to the position of the tongue, i.e., state feedback, either through proprioception or through tactile information.

To assess the downstream target hypothesis we calculated the lag between the vowel target and two other landmarks in the consonant gesture, the consonant release and consonant offset. These two landmarks were defined according to thresholds of peak velocity in the movement away from the consonant constriction, i.e., the positive velocity peak in **Figure 2**. Accordingly, they are the release-phase equivalents of the onset and target landmarks.

**Figure 11** shows the distribution of lag values for Crelease to Vtarget (**Figure 11B**) and for Coffset to Vtarget (**Figure 11C**). These are obtained by subtracting the consonant landmark from the vowel landmark, Vtarget - Coffset. For comparison, the lag values for Conset to Vonset, first presented in **Figure 5**, are repeated as **Figure 11A**. The top panels show schemas of lag measurements and the bottom panels show kernel density plots. In each plot a vertical black line is drawn at the 0 point. For Conset to Vonset (**Figure 11A**) and Crelease to Vtarget (**Figure 11B**), the lag is

positive (on average). For Coffset to Vtarget (**Figure 11C**), the probability mass is centered on zero. Although there is substantial variability around the mean, the target of the vowel occurs, on average, at the offset of the consonant. This pattern is consistent with the downstream target hypothesis. The target of the vowel is aligned to the offset of the consonant. In order to achieve the vowel target at the offset of consonant movement, movement toward the vowel target must start during the consonant gesture. How much earlier in time the vowel gesture starts is free to vary with the spatial position of the relevant articulators.

The alignment between Coffset and Vtarget (**Figure 11C**) has a possible alternative explanation. Since the vowels of our target items are rounded, it is possible that Coffset corresponds to an articulatory landmark associated with the labial component of the vowel instead of the consonant release phase. A hint of this possibility is apparent in the lip aperture (LA) signal in **Figure 8** (left), token 168, which shows a multi-stage time function. There is an abrupt decrease in LA velocity at around 900 ms; after this change, LA widens more slowly until around 1200 ms, when the TD achieves its target. It is possible that control of LA passes smoothly from the consonant gesture to a vowel gesture in such a way that the threshold of peak velocity applied to LA picks up on the labial component of the vowel, instead of the actual Coffset, which could occur earlier, i.e., around 900 ms in token 168. We therefore pursue another set of predictions that can differentiate the alignment schemas in **Figure 10**.

To further evaluate the alignment schemas in **Figure 10**, we conducted an analysis that leverages the temporal variability in the data. Articulatory coordination, like biological systems more generally, exhibit variation, owing to a wide range of factors. In assessing the predictions of control structures, such as the coordination schema in **Figure 10B**, we therefore look to the patterns of variability that are uniquely predicted. This approach follows past work exposing coordination relations by examining how they structure temporal variability in kinematic (Shaw et al., 2009, 2011; Gafos et al., 2014; Shaw and Gafos, 2015).

To exemplify, consider **Figure 12**. The top panels repeat the schema in **Figure 10**; the bottom panels show the same schema with longer consonant gestures. As the consonant gesture increases in length from the top panels to the bottom panels, we observe different effects on C-V lag. In the left panel, where the vowel onset is timed to the consonant onset, there is no effect of consonant duration on C-V lag. In the right panel, in contrast, C-V lag increases with consonant duration. Since the vowel is timed to the offset of the consonant, a longer consonant entails longer C-V lag (assuming that gesture duration for the vowel remains constant). This prediction can also be tested in our data. Moreover, testing this prediction does not require that we disentangle the release of the labial consonant from the labial component of the vowels. If the vowel target is timed to any landmark of the consonant following the consonant target, then an increase in consonant duration predicts an increase in C-V lag.

To evaluate this prediction, we investigated the correlation between C-V lag and the closing phase of the consonant. The closing phase of the consonant was defined as the duration from the onset of consonant movement to the achievement of target in the lip aperture signal, defined by a threshold of

FIGURE 12 | Comparison of two C-V coordination schema under different consonant durations. The top panels show shorter consonants and the bottom panels show longer consonants. As consonant duration increases from the top panel to the bottom panel, C-V lag is increased only for the schema on the right, where the vowel target is timed to the release of the consonant.

peak velocity (see **Figure 2**). A positive correlation between C-V lag and consonant duration is predicted by the downstream target hypothesis (**Figure 12**: right) but not by the C-V in-phase hypothesis (**Figure 12**: left). If the consonant and vowel gestures are in-phase, then C-V lag should be unaffected by consonant duration. The correlation between C-V lag and consonant duration was quite high (r = 0.61, p < 0.001), which is consistent with the downstream target prediction. A scatter plot is shown in **Figure 13**.

**Figure 13** shows that temporal variation in C-V lag is structured in a manner consistent with **Figure 12**: right. Variation in consonant duration stems from numerous factors, including individual differences that may have a neuro-muscular basis (Crystal and House, 1988; Tsao and Weismer, 1997; Tsao et al., 2006). Nevertheless, this variability is useful in exposing the underlying control structure. As consonant duration varies, C-V lag also varies in a manner predicted by downstream targets, as in **Figure 10B**, but not by in-phase timing, **Figure 10A**. The significant correlation is predicted by any alignment pattern in which the vowel target is timed to a consonant landmark later than the consonant target. Despite variation in speech rate and the absolute duration of consonantal and vocalic intervals, we observe consistency in temporal covaration predicted by a specific pattern of gesture coordination. Shaw and Gafos (2015) report a similar result for English. The pattern of temporal variation found across 96 speakers followed the predictions of a common pattern of gestural coordination, even as the absolute duration of consonant and vowel intervals varied substantially.

While our discussion has focused so far on intergestural timing, i.e., the timing of the vowel gesture relative to the consonant, the target-based timing account described above also suggests something about intra-gestural control that can be tested in the data. The vowel gesture may start earlier in time when it has farther to go to reach the target and starts later in time when there is less distance to travel. Stated this way, the timing of the vowel onset is relative not to the consonant (i.e., inter-gestural timing) but to the distance to the vowel target, i.e., gesture amplitude. Notably, this particular relation is one that is predicted by a nonlinear dynamical system with an anharmonic potential and not by a linear dynamical system (Sorensen and Gafos, 2016: 204).

To provide a direct test of this hypothesis about intra-gestural timing, **Figure 14** plots vowel gesture amplitude, as indexed by the displacement of TDx from vowel onset to vowel target, against the duration of the opening phase of the vowel, as indexed by

trajectory, and gesture amplitude (x-axis) as measured from the degree of TD sensor displacement in the anterior-posterior dimension (i.e., TDx).

the temporal interval from vowel onset to vowel target. There is a significant positive correlation between gesture amplitude and gesture duration (r = 0.45; p < 0.001). This result helps to sharpen the interpretation of the C-V lag results as well. It appears that the vowel gesture starts earlier when it has farther to go to reach the target, an aspect of intra-gestural control consistent with a non-linear dynamical systems model of the gesture.

We were curious as well about whether the variation in vowel gesture onset has consequences for acoustic vowel duration. Since the onset of vowel gestures typically takes place sometime during the consonant closure, variation in the gesture onset is potentially masked in the acoustics by the overlapped consonant. To investigate this, we measured the interval from the acoustic onset of the vowel, as indicated by the onset of formant structure, to the articulatory vowel target (as per **Figure 2**). This acoustic interval of the vowel was not positively correlated with the magnitude of the vowel gesture (TDx). There was a slight negative correlation (r = −0.15, n. s.). This indicates that the strong correlation between gesture magnitude and gesture duration is largely masked in the acoustic vowel interval from onset of voicing to the vowel target. The distance of the tongue to the vowel target (gesture amplitude), which is significantly correlated with vowel start times and is reflected in C-V lag, does not correlate with acoustic vowel duration.

#### Neutral Attractors

A second possible explanation for the main result is that there is a neutral attractor at work. Neutral attractors have been hypothesized to take control of articulators that are not otherwise under gesture control (Saltzman and Munhall, 1989). When a gesture achieves its target, control of the model articulator falls to the neutral gesture, which will drive the articulator toward a neutral position.

The explanation of the main result – that TD position correlates with C-V lag – in terms of a neutral attractor is as follows. Consider again two tokens that differ in the position of the TD during the pre-speech period of silence (**Figure 8**). When the TD is at an extreme position, the neutral attractor drives it toward a neutral position before the vowel gesture takes control. The momentum of the articulator movement controlled by the neutral attractor carries over to gestural control by a vowel. On this account, vowels with more extreme tongue dorsum positions may appear to start earlier in time relative to the overlapped consonant because control of the TD passes smoothly from a neutral attractor to a vowel gesture. In contrast, when the TD is already in a neutral position, movement does not start until the vowel gesture is activated. On this account, the early onset of vowel gestures that begin far from targets is an epiphenomenon of neutral attractor control.

The contrast between a token with early TD movement and one with later movement is shown in **Figure 15**. The top panel shows the token with a non-extreme TD backness position. The green box shows the vowel gesture activation interval, terminating with the achievement of target. The bottom panel illustrates the neutral attractor proposal. The yellow box shows the neutral attractor which drives the TD away from an extreme front position. Since the vowel target is back, the neutral attractor

happens to be driving the TD in the same direction as the vowel gesture, which kicks in at the same time across tokens. Typical heuristics for parsing gesture onsets from EMA trajectories based on the velocity signal, including those used in this paper, would likely be unable to differentiate between movement associated with the vowel gesture proper (top panel) and movement that is associated with a sequence of neutral attractor followed by a vowel gesture.

Notably, the neutral attractor analysis does not necessarily require the type of state-feedback discussed for the "downstream target" alternative. In this sense, the neutral attractor account of our data is parsimonious with the two level feedforward model of AP. However, the need for bidirectional interaction between inter-gestural and inter-articulator levels has been argued for elsewhere (Saltzman et al., 1998) and other more recent developments in the AP framework may render neutral attractors less necessary than in earlier work. For example, Nam (2007) pursues the hypothesis that the movement toward and away from constrictions are controlled by independent gestures. On this account, the "split-gesture" hypothesis, it is less clear that a neutral attractor is needed at all to return articulators to a neutral position, as this could be accomplished by the release gesture associated with consonants. Other empirical work has identified cases of anticipatory movements in speech which at times pre-empt the linguistically specified timing pattern and cannot easily be explained by a neutral attractor (Davis et al., 2015; Tilsen et al., 2016). Using real-time MRI, Tilsen et al. (2016) observed a range of idiosyncratic (across speaker) patterns of anticipatory movement during silence. He suggested that neutral attractors, if they were to account for the data, would have to be sensitive to upcoming gestures. Other relevant anticipatory movement phenomena include Whalen (1990), who found that, when reading aloud, speakers plan coarticulation based upon available information in the visual stimulus. Similarly, Davis et al. (2015) observed anticipatory articulatory movements in response to subliminal presentation of words in a masked priming task. These findings suggest that orthographic stimuli, even when brief (<50 ms) or absent until speech initiation, condition anticipatory speech movements. Phonetically sensitive neutral attractors have been suggested elsewhere in the literature (Ramanarayanan et al., 2013) but this proposal would have to be developed significantly to encompass the broader range of articulatory phenomena. Thus, while, in the case of our data, a "standard" neutral attractor, i.e., per Saltzman and Munhall (1989), may be sufficient to account for anticipatory movement, alternative mechanisms, e.g., release gestures, planning gestures or otherwise, "phonetically sensitive" attractors are theoretical developments that could potentially subsume the neutral attractor analysis.

In closing this section, we would like to highlight that the two possible theoretical explanations that we've offered for the effect of spatial position on relative timing are not mutually exclusive. The neutral attractor could explain some of the early vowel movements, even if the downstream target hypothesis is also correct. The preceding discussion of neutral attractors notwithstanding, it's possible that both mechanisms are independently necessary. The relative variability of movement onsets in contrast to movement targets has been noted in past work (Perkell and Matties, 1992) and discussed as evidence against a system of speech timing control driven by movement onsets (Turk and Shattuck-Hufnagel, 2014). While the neutral attractor may explain some of the variability found generally for gesture onsets in this and other studies, we note that the neutral attractor hypothesis does not predict the correlation between consonant (closing phase) duration and C-V lag, which was found to be quite strong. This correlation (C-closing and C-V lag) could instead be attributable to yet another factor, such as a general slowdown (scaling) of the clock related to, e.g., speech rate, or to the interaction between general slowdown and an amplitude-gesture duration tradeoff predicted by non-linear dynamical system. However, such a factor will also predicts a positive correlation between C-V lag and vowel duration, which was not shown in our data (see section "Neutral Attractors").

#### Additional Theoretical Implications

On average, C-V lag (Vonset to Conset) is positive in our data, which may be driven by the interaction between competing forces on coordination, as per the coupled oscillator model of gesture coordination (Goldstein and Pouplier, 2014). Such positive C-V lag in tone languages has been explained by the hypothesis that the onset of the tone gesture is temporally aligned with the offset of the consonant gesture (anti-phase timing) while the vowel onset is competitively coupled to both the consonant and tone gestures (Gao, 2008). However, if the downstream target hypothesis generalizes to tone, then the positive C-V lag found generally for syllables with lexical tone may also have an alternative explanation in terms of downstream targets. Tones, just as vowels, may be timed with reference to a tonal target or to other downstream landmarks, as opposed to the tone onset. Cross-linguistically, it seems necessary for tones to have different modes of syllable-internal alignment. In Dzongka, for example, tones appear to be left-aligned within the syllable, in that the high and low tones are most distinct near the onset of voicing (Lee and Kawahara, 2018). Tones in Mandarin, in contrast, are differentiated later in the syllable (Moore and Jongman, 1997;

Shaw et al., 2013). In Dinka, the timing of tones within a syllable is minimally contrastive (Remijsen, 2013). These cross-linguistic patterns suggest a richer ontology of syllable-internal timing patterns than may be possible if coordination makes reference only to gesture onsets.

#### CONCLUSION

Consonant and vowel gestures in Mandarin were generally not synchronous in our data. The vowel movement typically began after the consonant, which is consistent with past work on Mandarin and other lexical tone languages (Gao, 2009; Hu, 2016; Karlin, 2018; Zhang et al., 2019). The spatial position of the tongue influenced when the vowel movement begins relative to the consonant. This is to our knowledge the first direct evidence that the spatial position of the articulators conditions the relative timing of speech movements in unperturbed speech (c.f., Saltzman et al., 1998). On the face of it, this finding seems to challenge strictly feed-forward models of timing control adding to past experimental evidence for bidirectional interaction between the inter-gestural level and the interarticulator level of speech movement control. We discussed two possible explanations for the effect. The first proposal involves downstream targets. Movement onsets vary with spatial position to achieve coordination of later articulatory events. In this case, it would be necessary for state-based feedback to inform relative timing. Moreover, since the onset of vowel movement often occurred before phonation (during silence), the relevant state-based feedback must be somatosensory (likely proprioceptive) in nature. The "downstream targets" proposal made some additional testable predictions that are consistent with the data. As consonant duration varies, C-V lag covaries in the manner predicted by an alignment of the vowel target to some landmark in the release phase of the consonant. We also found a correlation between gesture amplitude and the duration of the opening movement of vowels, which is predicted by a non-linear dynamical model of gestures (Sorensen and Gafos, 2016). The second proposal involves neutral attractors which drive articulators toward rest position when they are not under active control of a gesture. This is in many ways a simpler solution in that it treats the effect of spatial position on C-V timing as an epiphenomenon of natural speech preparation. While these are both possible accounts of our data, we note that they are not mutually exclusive and that future research is needed to fully evaluate the proposals. Regardless of the proper theoretical

## REFERENCES


account of this finding, future empirical work investigating the relative timing of movement onsets should factor spatial position into the analysis.

# DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/supplementary material.

#### ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Western Sydney University Interval Review Board with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Western Sydney University Interval Review Board.

#### AUTHOR CONTRIBUTIONS

JS and W-RC designed the experiment, collected the data, and discussed each stage of the analysis. JS conducted the statistical analysis and wrote the first draft of the manuscript. W-RC made some of the figures. JS and W-RC contributed to the manuscript revision, read, and approved the submitted version.

# FUNDING

This research was funded by a MARCS Institute grant to JS and US NIH grant DC-002717 to Haskins Laboratories.

## ACKNOWLEDGMENTS

For assistance with subject recruitment, data acquisition and processing, we would like to thank Donald Derrick, Michael Proctor, Chong Han, Jia Ying, and Elita Dakhoul. We would also like to thank Doug Whalen for comments on an earlier version of this manuscript as well as the Yale Phonology group, audiences at Haskins Laboratories, Brown University, Cornell University, the University of Southern California, and LabPhon 16, where parts of this work were presented.



Phonology, Vol. 9, eds J. Cole, and J. I. Hualde, (Berlin: Mouton De Gruyter), 483–506.


movement plans from linguistic units. PLoS One 11:e0146813. doi: 10.1371/ journal.pone.0146813


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shaw and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Role of Temporal Modulation in Sensorimotor Interaction

#### Louis Goldstein\*

Department of Linguistics, University of Southern California, Los Angeles, CA, United States

How do we align the distinct neural patterns associated with the articulation and the acoustics of the same utterance in order to guide behaviors that demand sensorimotor interaction, such as vocal learning and the use of feedback during speech production? One hypothesis is that while the representations are distinct, their patterns of change over time (temporal modulation) are systematically related. This hypothesis is pursued in the exploratory study described here, using paired articulatory and acoustic data from the X-ray microbeam corpus. The results show that modulation in both articulatory movement and in the changing acoustics has the form of a pulse-like structure related to syllable structure. The pulses are aligned with each other in time, and the modulation functions are robustly correlated. These results encourage further investigation and testing of the hypothesis.

#### Edited by:

Adamantios Gafos, University of Potsdam, Germany

#### Reviewed by:

Mathias Scharinger, University of Marburg, Germany Thomas Schatz, University of Maryland, College Park, United States

#### \*Correspondence:

Louis Goldstein louisgol@usc.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 24 May 2019 Accepted: 04 November 2019 Published: 06 December 2019

#### Citation:

Goldstein L (2019) The Role of Temporal Modulation in Sensorimotor Interaction. Front. Psychol. 10:2608. doi: 10.3389/fpsyg.2019.02608 Keywords: speech production, temporal modulation, articulation, acoustics, syllable structure, sensorimotor interaction

# INTRODUCTION

Work over the last 20 years has revealed abundant evidence for real-time sensorimotor interaction in both speech production and speech perception. In speech production (the topic of this volume), the role of auditory feedback in guiding speech production has been demonstrated in experiments showing that talkers may produce compensatory articulatory changes in response to altered auditory feedback (Houde and Jordan, 1998). In addition, talkers can align their articulatory patterning in real-time to that of a partner, in the so-called "synchronous speech" task (Cummins, 2002). While less obviously real-time, talkers have also been showed to alter the temporal profile of their articulation to match that of a partner in experiments showing phonetic convergence (e.g., Lee et al., 2018). More generally, of course, vocal learning requires the ability to use auditory information to guide changes in articulatory behavior.

The existence of such sensorimotor interactions would appear to require that speakers have some common representation of speech articulation and acoustics that affords the kind of alignment that these experiment results exhibit. At first blush, it is tempting to think that evidence for this common representation might be found in the neural activation patterns in the motor cortex like those that have been found during listening to speech (Wilson et al., 2004). Indeed, the dual-stream model (Hickok and Poeppel, 2007) hypothesizes that such neural activation subserves sensorimotor control of speech production. However, recent experiments using electrocorticography have shown that the representation of speech segments in the motor areas during listening is quite distinct from its representation in the same areas during speech production. Cheung et al. (2016) compared the activation patterns while patients produced CV syllables and while they listened to recordings of themselves producing those syllables. The activation patterns of speech segments in the anterior

ventral sensorimotor cortex (vSMC or "motor cortex") during listening was found to be organized by their acoustic properties, clustering segments by manner classes, as is also found in the auditory areas such as the superior temporal gyrus and others (Mesgarani et al., 2014). However, activation patterns during speaking were found to be organized by vocal constricting organ (labial, coronal, dorsal), consistent with other recent work that has shown that electrode activity can be predicted as a function of coordinated articulatory movement creating constriction gestures of those three types (Chartier et al., 2018). Thus, the patterns of neural activation associated with acoustics and articulation of the same utterance are distinct, even in the motor areas. So, what binds them together to afford their interaction or integration?

Like most work contrasting articulatory versus acoustic representations in speech production and perception (and in phonology), the research described in the previous paragraph focuses on the paradigmatic aspects of the neural representations, e.g., how the neural representations of distinct speech segments differ in the same context. However, this focus ignores the temporal aspects of continuous acoustic and articulatory signals, which must be lawfully related as the articulatory movements actually cause the acoustic signals. Temporal aspects of the corresponding cortical representations have been the focus of recent work by Assaneo and Poeppel (2018) who found that cortical oscillations in auditory and speech-motor areas are synchronized with one another during listening, specifically to syllable repetition rates around 4.5 Hz, and have proposed this synchronization as a possible solution to the binding problem. Their model of the synchronization involves entrainment of theta-band (4–8 Hz) oscillations in the auditory cortex to the speech envelope as has been shown in other recent work (Doelling et al., 2014), as well as the coupling of neural oscillators in the auditory and speech-motor areas. In this listening situation, rhythmic regularities of the acoustic speech envelope in the theta band plays a key role in the entrainment model, and they have also been shown to contribute to the intelligibility of the speech (Ghitza and Greenberg, 2009) and to listener sensitivity in detecting gaps in artificial stimuli with speech-like rhythmic properties (Henry et al., 2014). However, turning to speech production, it is unknown whether there are periodic components in ongoing articulatory-motor activity that could play a role like that of the speech envelope in entraining cortical oscillations and contribute to synchronization of auditory and speech-motor areas. This may be due to the difficulties in obtaining "clean" brain responses from talking participants (both in the MRI scanner and during EEG acquisition) and provides a motivation for probing the temporal modulation of speech articulation and its relation to acoustic modulation.

The temporal dimension of the articulatory and acoustic structure of speech is the focus of the work to be described here. This work hypothesizes that there should be a systematic relation between the temporal modulation of articulation (how much is it changing at any given moment) and the corresponding temporal modulation of the acoustic signal, specifically ignoring in what way the signals are changing.

The cognitive significance of patterns of modulation or change over time has been addressed in a variety of domains. For example, viewers can perceive humans engaging in a variety of actions when watching dynamic point-light displays (e.g., Rosenblum et al., 1996), but there may be nothing in the static displays of the dots to suggest different human body parts or their similarity structure. Sinewave approximations to human speech (Remez et al., 1981), which were loosely modeled on point-light displays, preserve information about how frequency information in the signal changes over time, but static moments of the signal may not be so readily perceived as speech.

Measures of change over time have been incorporated into automatic speech recognition systems through use of the modulation spectrum (e.g., Kingsbury et al., 1998) or by using the derivatives of acoustic measures, such as Melfrequency cepstral coefficients (mfccs), as additional feature vectors (Furui, 1986). Derivatives have also been incorporated into some approaches to acoustic-to-articulatory inversion (Ghosh et al., 2009; Mitra et al., 2012). However, the structure of the modulation patterns in articulation and acoustics and their alignment have not been systematically or quantitatively investigated, nor has the potential relation of those modulation patterns to phonological structure. A first step at such an investigation is the goal of this paper.

The investigation takes as input temporal modulation functions of articulation and acoustics derived for utterances drawn from the X-ray Microbeam Speech Production Database (Westbury et al., 1994). Of necessity, the investigation is largely exploratory, as such modulation signals have not been explicitly investigated previously. Nonetheless, the main underlying hypothesis is that the modulation functions should be systematically correlated in some fashion. In addition, consideration of what is generally known about the structure of speech leads to some expectations, or predictions (in a loose sense), about the nature of these functions and their correlation.

We know that the speech signal does not change in a continuous way but rather is temporally structured. There are intervals of time, such as during a long, stressed vowel, during which the articulation and acoustics are not changing very rapidly, and there are other intervals, such as at the time of release of an onset consonant into a vowel or at the formation of a coda consonant, when change is rapid. Sharp acoustic change is seen in discontinuities in a spectrogram that are used as acoustic segmentation criteria for durational measurement. At the level of articulatory kinematics, several gestures are proceeding in close temporal sequence at release of an onset consonant, for example: release of the consonant constriction gesture, production of the vowel gesture, adduction of the vocal folds if the consonant is voiceless, lowering of the velum if the consonant is nasal (see Tilsen and Goldstein, 2012). This leads to two predictions: (1) Modulation functions of both articulation and acoustics should exhibit a pulse-like structure, alternating between periods of rapid change (change "pulses") and periods of little change. (2) The period of repetition of the pulses should be related to the syllable repetition rate, with one to three pulses per syllable depending on its complexity: one pulse in a simple CV syllable, somewhere between the onset

consonant's release and the vowel, and additional pulses if the syllable has one or more coda consonants. Considering next the relation between the articulatory pulses and the acoustic ones further predictions can be made: (3) Since articulatory change generally gives rise to acoustic change, there should be robust correlations between the articulatory and acoustic modulation functions, which have not been systematically evaluated in the past. One possible source of the correlations is that over the course of running speech, prosodic structure influences the velocity of articulator movements, such that velocities are slower near boundaries (Edwards et al., 1991). This slowing should be observable in the magnitudes of the modulation functions, both articulatory and acoustic. If this were the only source of correlation, it would suggest that spans of speech long enough to include prosodic phrase boundaries would be required in order for the system to solve the binding problem, which might not be realistic. It is important, therefore, to investigate the correlations in temporal windows of different length. (4) Finally, the temporal locations of articulatory and acoustic modulation maxima (pulses) should be systematically aligned. To the extent that speech has a rhythmic structure (Tilsen and Arvaniti, 2013; Lancia et al., 2019), the pulses observed in both modulation functions should have a repetitive structure, and that repetitive structure should be shared across the two functions.

# MATERIALS AND METHODS

#### Data

The study described here is a secondary analysis of publicly available, already published data from the X-ray Microbeam Speech Production Database (Westbury et al., 1994). For the analysis here, one sentence from the database was selected from one of the read paragraph tasks that the participants performed (the 'Hunter' paragraph): Once he thought he saw a bird, but it was just a large leaf that had failed to drop to the ground during the winter. Of the participants who recorded this sentence, 23 were selected (15 female and 8 male) who read the sentence with no audible hesitations and with only a single pause (after "bird"). The speakers were all students at the University of Wisconsin in the early 1990s. Their Dialect Base (described in Westbury et al., 1994, as "place of residence during linguistically formative years") included 13 from Wisconsin, 3 from Illinois, 2 from Minnesota, and one each from Indiana, Colorado, California, Massachusetts, and New Jersey. The data analyzed include markers attached midsagittally to the upper lip (UL), lower lip (LL), lower incisor (LI), four tongue markers (tip to dorsum: T1, T2, T3, T4), and simultaneous audio.

Pause durations following the word "bird" were measured manually from a wide-band spectrogram, from the release of the final/d/in "bird" to the release of the initial/b/in "but." The average syllable duration for each speaker's production was estimated by taking the duration of the entire sentence for a given speaker, subtracting the pause duration (following "bird"), and dividing the result by the number of syllables (n = 27).

# Articulatory Modulation Functions

Articulatory change, or modulation, was defined for a given frame as the sum of the squared velocities of the 14-dimensions defined by the 7 markers × 2 dimensions (x,y), as in (1):

$$MBEAM(k) = \sum\_{i=1}^{7} \sum\_{j=1}^{2} (m(i, j, k+1) - m(i, j, k))^2 \tag{1}$$

where m(i,j,k) are the marker positions for marker (i) 1-7 (UL, LL, T1, T2, T3, T4, LI), dimension (j) 1-2 (x,y), at frame k. Ignoring the mass of the articulators (i.e., treating all masses = 1), MBEAM also is twice the kinetic energy of the set of articulators (KE = 0.5 mv<sup>2</sup> ).

A version of microbeam corpus in Matlab format was employed. In this format, the data of all markers was interpolated to a fixed sampling rate of 145.6 Hz, so the duration of each frame was 6.866 ms. Because of the differencing involved in computing the MBEAM function, it is effectively high-pass filtered and can be noisy. The resulting MBEAM functions were therefore smoothed. To determine the appropriate frequency cutoff for the smoothing filter, the frequency content in the microbeam marker signals themselves was considered. Since the tongue tip marker was acquired at the shortest original (nominal) sampling period during the data acquisition (before the acquired data were interpolated by a smoothing spline to make samples all equal in duration, Westbury et al., 1994), the magnitude spectrum of the vertical movement of the marker closest to the tip of the tongue (T1) for the test sentence produced by each of the speakers was examined. The results of a typical speaker are shown in **Figure 1**. For all speakers, the amplitude of the spectrum at 10 Hz is down by 60 dB from its peak value, and changes little at higher frequencies. A cutoff frequency of 12 Hz was chosen, and the MBEAM functions were filtered in Matlab (Mathworks, Inc.) using a zero-phase, low-pass, nine-point Butterworth filter with a 12 Hz cutoff. In order to test if the resulting filtering overly determines the correlation results, another version of the MBEAM functions was created using a 25 Hz cutoff filter, and analyses were replicated using these functions.

The temporal structure of each MBEAM function was characterized by finding the times of the successive maxima of the function (using the zero-crossings of its derivative). These maxima will be referred to as the modulation pulses. The mean inter-pulse interval and its standard deviation were calculated. An alternative would be to define pulses as the maxima of the derivative of the modulation function, i.e., acceleration maxima where the velocity is changing most rapidly and which can be thought of points of maximum force, but this was not explored in this work. In order to test the predictions about the relation between pulses and syllable structure, the segment and word transcriptions of the 23 sentences were aligned to the audio signals using the Penn Forced Aligner (Yuan and Liberman, 2008). The segmentations were checked by hand and corrected where necessary. Almost all errors involved the low intensity fricative in "failed," which was often mistakenly characterized as a short pause. Since all but two of the words ("during" and "winter") were monosyllabic, the word-level segmentation also

served as a syllable segmentation. "During" was divided into syllables between the [r] and the [I], and "winter" between the [n] and the [t]. For each syllable of the transcription, the number of pulses falling in the temporal window of that syllable was automatically tallied. For each speaker, the mean number of pulses falling on open syllables (no coda consonant), syllables with single coda consonants, and syllables with more than one coda consonant were calculated.

#### Acoustic Modulation Functions

The signal representation chosen as the basis of the acoustic modulation functions is a set of mel-frequency cepstral coefficients (mfcc). In addition to fact that this representation has been widely used in speech technology applications (Rabiner et al., 1993), it encodes the resonance structure of the vocal tract, but not voiced source fundamental frequency, which of course is also not captured by microbeam markers on the surfaces of the vocal tract. Mfccs have been used in work that has successfully estimated articulator point marker time functions from acoustics using deep neural nets (Chartier et al., 2018) and other techniques (Mitra et al., 2011; Afshan and Ghosh, 2015). Mfcc parameters were calculated for the audio signals paired with the microbeam data using Matlab code developed by Kamil Wojcicki and available on the Mathworks File Exchange<sup>1</sup> . The window size for the analysis was 25 ms, and time between frames was chosen to be equal to the frame rate of the MBEAM functions, i.e., 6.866 ms. The audio signal was preemphasized using a highpass filter (coefficients [1, −0.97]), analyzed using 20 filterbank channels over the frequency range 0–3,700 Hz, as changes in this frequency range can be expected to be well-determined by changes in the anterior articulator positions that do not produce the narrow constrictions associated for example with fricatives. The spatial representation of such narrow constrictions is expected to be poorly related to fricative acoustics due to the potential mechanical interaction of the marker with the palate. 13 mfcc parameters were extracted, similar to the dimensionality of the microbeam data.

As the bandwidth of the unsmoothed mfcc parameters may be considerably higher than that of the microbeam markers, each coefficient was filtered using the same (12 Hz) smoothing filter used preceding calculation of the MBEAM modulation function. The (MFCC) modulation function was calculated as in (2) in a similar manner as the MBEAM modulation function:

$$MFCC(k) = \sum\_{i=1}^{13} (f(i, k+1) - f(i, k))^2 \tag{2}$$

where f(i,k) represents the ith mfcc at frame k. Due to the resulting high-pass filtering, the resulting MFCC function was also smoothed using a zero-phase, low-pass, nine-point Butterworth filter with a 12 Hz cutoff. As with the MBEAM functions, another version was created using a 25 Hz filter. The mean inter-pulse interval and its standard deviation were calculated in the same way as for the MBEAM function, and the mean number of pulses per syllable type for each speaker was calculated in the same way as for the MBEAM pulses.

#### Correlation Methods

In order to test the predictions that (a) there are robust correlations between articulatory and acoustic modulation functions and that (b) there is a repetitive temporal structure shared between articulatory and acoustic modulation functions, Correlation Map Analysis (CMA) was employed (Barbosa et al., 2012; Gordon Danner et al., 2018). CMA calculates a correlation time function between two signals using a sliding window centered on each sample of the signals. The window is actually a kernel: every sample in the signals contributes to the correlation, but the contribution of samples to the correlation decreases as a function of lag from the center of the window, as determined by a weighting function. (3) Shows the expression for calculating a covariance function between two signals x and y, at every sample (k).

$$S\_{\rm xy}(k) = \sum\_{l=-\infty}^{\infty} c e^{-\eta |l|} \varkappa(k-l) \wp(k-l) \tag{3}$$

l is the sample lag from the center of the window, and η (eta) is the parameter that determines the sharpness of the window. Greater values of η result narrower time windows. c is a constant chosen so that the sum of the weights over all samples is 1. Correlation (ρ) at each sample is then calculated as in (4).

$$\rho(k) = \frac{\mathbb{S}\_{\text{xy}}(k)}{\sqrt{\mathbb{S}\_{\text{xx}}(k)\mathbb{S}\_{\text{yy}}(k)}} \tag{4}$$

The choice of η determines an effective frequency cutoff of the resulting correlation time function, for which Barbosa et al. (2012) provide an approximation function. Three values of η were chosen: a narrow window (η = 0.8) that produces a frequency cutoff of 12.4 Hz (roughly equal to cutoff frequency of the modulation functions themselves), a wide window (η = 0.08)

<sup>1</sup>https://www.mathworks.com/matlabcentral/fileexchange/32849-htk-mfccmatlabh

that has a much lower frequency cutoff (1.24 Hz), and an intermediate value (η = 0.2) with a frequency cutoff of 3.1 Hz. For each value of η, the median of the correlation values across all the samples in the correlation function for a given speaker was calculated.

In order to provide a baseline with respect to which the observed correlation values can be evaluated, surrogate signal pairs where created, in which there is no systematic causal relation between the values of two signals. To create a surrogate pair, the k samples of each MFCC modulation function were divided into two halves (first and second k/2 samples), the order of the two halves was then reversed, and the resulting signal was paired with the unchanged MBEAM function. As a result, the first half of the MBEAM function was paired with the second half of the MFCC function, and second half of the MBEAM function was paired with the half of the MFCC function (Note that the same result would have been achieved by reversing halves of the MBEAM function). Any remaining correlation between the surrogate signals reflects general properties of signals of this type (as calculated with this method), not a causal relation between the two signals. The surrogate signal pairs were analyzed using the same conditions of filtering and η as used with the original signals. For each value of η, the median of the correlation values across all the samples in the correlation function for the original signal pairs was compared with the median values obtained with the surrogate pairs.

Correlation map analysis also calculates the correlation functions between signals as they are shifted in time with respect to each other. Critically, this allows us to evaluate the hypothesis that there is a repetitive temporal structure to the modulation pulses shared between the articulatory and acoustic functions. One way of characterizing the repetitive (or periodic) structure of a single signal is to examine the autocorrelation function of the signal, which represents the signal correlated with itself at different lags. To the extent that the signal has a periodic structure, there will be a clear peak in the autocorrelation function at a non-zero lag corresponding to the fundamental period of repetition. The autocorrelation functions of the MBEAM and MFCC functions were calculated individually using CMA to compare the signals with themselves at different lags, and the period of the repetition associated with each was determined by finding the lag associated with the maximum median correlation of the correlation function (other than zero-lag, which in the case of correlating a signal with itself always yields a correlation equal to 1). To evaluate the shared repetitive structure of the MBEAM and MFCC functions, the median correlation of MBEAM and MFCC at lags from −200 to +200 ms were compared to find the lags at which the correlation is maximal. The zero-lag is predicted to be maximal, because at this lag, the acoustic change at a given frame is aligned in time with the articulatory change that caused it. The changing shape of the vocal tract causes an immediate change in its acoustic source and filter properties. If there is any delay at all, it is much shorter than the 6.86 ms frame duration. If the form of the function relating lag to correlation has the form of an autocorrelation function, it will also be possible to find robust secondary maxima in the function. At the lag corresponding to a secondary maximum, the articulatory change is not aligned in time with the acoustic change that it caused, but the repetitive structure of the signals is such that articulatory modulation pulses (maxima) are still aligned with acoustic modulation pulses, and frames with little articulatory modulation are aligned with frames of little acoustic modulation. This is then the period of shared repetitive structure for the pair of signals. These values will be compared against the single-signal autocorrelation functions.

# RESULTS

# Characterization of Modulation Functions

**Figure 2** shows an example of the MBEAM and MFCC modulation functions (obtained with the 12 Hz filtering) along with the correlation function resulting from the CMA analysis (for η = 0.8) in the bottom panel. The first clause of the test sentence is shown (both waveform and spectrogram) for one of the 23 speakers. The pulse structure of the MBEAM function is obvious from the figure. As expected, the pulse peaks (times of maximum articulatory change; shown with vertical magenta lines) align reasonably well with points of rapid or discrete change in the spectrogram. Two peaks are found during the syllable corresponding to "once," one peak for "he," one for "thought," etc. The MFCC modulation function exhibits a similar structure, although it has more peaks than the MBEAM function. This is reasonable, as there is more information in the MFCCs than in the MBEAM and it is more fine-grained temporally: source changes and nasalization are not represented in the MBEAM data, and it is derived from measurements of the anterior tract

FIGURE 2 | Sample of modulation functions and their correlation function for the first clause of the test sentence, for one of the speakers. Panels represent (from top to bottom): audio waveform, sound spectrogram, MBEAM modulation function filtered at 12 Hz, MFCC function filtered at 12 Hz, and the correlation function from Correlation Map Analysis for the narrow window condition, η = 0.8. Green vertical lines represent acoustic segmentation into syllables. Purple vertical lines mark the peaks of the MBEAM function.

only. But for every MBEAM peak there is an MFCC peak close in time to it. Typically, the MBEAM peak lags the MFCC peak (except in "bird"). Presumably this is due to the fact that the MFCC frames are based on 25 ms windows and so "look ahead" of the corresponding MBEAM frame. Overall, the correlations shown in the bottom panel are quite high, with the clear majority of points showing positive correlations.

Box plots showing the mean number of MBEAM pulses occurring during open syllables, syllables closed by a single consonant, and syllables closed by more than one consonant are shown in the top panel of **Figure 3** (again for the 12 Hz filtering condition). Each speaker contributes one mean per box plot. As predicted, the mean of the open syllables is close to 1 (0.97), while the mean of syllables closed by a single consonant is 1.51, possibly suggesting that half the syllables have two pulses while the other half have only one. The difference between these two syllable types is highly significant (sign test p < 0.001), as 22 of the 23 speakers have more pulses in the case of the coda condition. (Here and in all the sign tests performed, p-values obtained that

syllable as a function of syllable type: "open: (no coda consonant), "final C" (single coda consonant), "final CC(C)" (two or more coda consonants). Each speaker contributes a single value to each box plot, which is the mean number of peaks found in syllables of that type for the speaker. (Bottom) Same plots as top panel, but for MFCC peaks.

are less than 0.001 are reported as p < 0.001). Finally, the mean of syllables with more than one coda consonant is almost twice the mean with a single coda (2.4). The differences between one and two coda consonants is likewise highly significant (p < 0.001), as all speakers have more pulses with multiple codas. The pattern of results for MFCC pulses are very similar but with a few more pulses overall, as shown in the bottom panel of **Figure 3**. The means for the three conditions are 1.15, 1.69, and 2.91, and the differences are highly significant in a sign test.

**Figure 4** shows box plots for the mean frequency of the 23 speakers' inter-pulse intervals for the MBEAM and MFCC functions (calculated from the mean inter-peak durations) for both 12 and 25 Hz smoothing conditions. Also shown are the syllable frequencies, calculated from the mean syllable durations. Examining the 12 Hz results, the MFCC frequencies are higher than the MBEAM frequencies (not surprisingly, since there are more MFCC pulses than MBEAM pulses), and the difference is highly significant (p < 0.001) in a sign test across the 23 speakers (all but two show higher MFCC frequencies). It is also clear that both of those frequencies are higher than syllable frequency. The median syllable frequency is 4.9 Hz and the median MBEAM frequency is 7.5 Hz. Their ratio is 1.5, which is consistent with the results in **Figure 3**, showing about one pulse per open syllable, but more pulses for syllables with coda consonants. Considering the results for the 25 Hz smoothing, the MBEAM frequencies are basically unchanged from the 12 Hz condition (median frequency for the 25 Hz condition is 7.7 Hz); the difference is not significant by sign test. Thus, the 7.5 Hz inter-peak frequency value for MBEAM modulation function data appears to characterize the temporal modulation in these (relatively slowing changing) articulatory signals quite well. The MFCC inter-peak modulation frequency is obviously much higher in the 25 Hz condition than in the 12 Hz condition (16 vs. 8.5 Hz). The 12 Hz filtering has removed higher modulation frequencies that are contained in the faster-changing acoustic signals and smoothed it to make it more comparable to the MBEAM function.

# Correlation Analysis

fpsyg-10-02608 December 5, 2019 Time: 16:14 # 7

#### Surrogate Analysis and Window Width

For the 12 Hz filtering condition, the global ('overall') correlation of the MBEAM and MFCC functions is positive and significant for every speaker (p < 0.001). The box plot of the 23 correlation values is shown in the left plot in **Figure 5**. For the surrogate data plotted on the right, only 12 speakers show significant correlations (significance varying from p < 0.05 to p < 0.001) and of those 8 are negative and 4 are positive. Because so many of the surrogate pairs are negatively correlated, comparison of the original and surrogate data is most conservatively done with the magnitudes of original and surrogate correlations, i.e., taking the absolute values of the surrogate correlations. Box plots of the resulting values are shown in the leftmost pair of columns of **Figure 6** (top panel); original on the left, surrogate on the right. A sign test confirms that the magnitude of the correlations is higher for the original than for the surrogate data (p < 0.001). All speakers but one (S34) have higher magnitude correlations in the original data. S34's surrogate correlation is in fact negative.

The remaining boxplots in **Figure 6** (top panel) plot the results of the CMA analysis for the three values of η. For each value, the results of the original data are plotted on the left, and the surrogate data on the right. For each value of η, the median value of the correlation function from the CMA analysis for each subject was calculated for each signal lag, and the maximum positive correlation value and the maximum negative correlation of that median across the lags was determined. The lag with the higher magnitude was taken to represent the correlation for that speaker, and is plotted in the box plots. For every value of η, a sign test confirms that the magnitude of the original data correlation is higher than that for the surrogate data (p < 0.001). There are two other ways in which the original data correlations exhibit a strikingly different pattern of results than the surrogate data. First, for the original data, for every value of η and for every speaker (except for speaker S30 for η = 0.8), the maximum positive correlation was higher in magnitude than the maximum negative correlation. However, for the surrogate data, a sign test

FIGURE 6 | (Top) Box plots comparing correlations between the original MBEAM and MFCC modulation functions filtered at 12 Hz and the correlations of the corresponding surrogate data functions, for four different correlation types: the overall correlation and the median values of the CMA correlation function for three different values of η. For each of the correlation types, the original data is plotted on the left and the surrogate data on the right. In all cases, the absolute values of the correlations are plotted. (Bottom) Same plots as top, for modulation functions filtered at 25 Hz.

revealed that there was no tendency for the highest magnitude correlation to be either positive or negative. Second, the lags that show the maximum positive correlations for the original data are tightly clustered around 6.866 ms (or a one frame delay of the MBEAM signal)<sup>2</sup> , with very small standard deviations, as shown in **Table 1**. The lags at which the maximum correlations (positive or negative) occur for the surrogate data are much more variable; the standard deviations of these lags are an order of magnitude higher than for the original data. Thus, the original data show robust, positive correlations between MBEAM and MFCC functions when the signals are temporally aligned with close to zero lag. The correlations exhibited by the surrogate data are weaker and are variable both in sign and in the lag at which the highest magnitudes are found.

<sup>2</sup>We might expect that zero lag would result in the highest correlations, but because of the temporal advance of MFCC function due to the size of its analysis window as discussed earlier, this is not always the case. For example, the median of the lags that show the highest correlation for the η = 0.8 condition is 6.866 ms: the MBEAM function is delayed by one frame.

TABLE 1 | Medians and standard deviations (across speakers) of the lag (in ms) at which the highest positive and negative correlations are found between MBEAM and MFCC functions.


Results shown separately for original and surrogate data as a function of eta. 12 Hz smoothing condition.

As can also be seen in **Figure 6**, the results show that the correlation is higher in narrower time windows than in wider ones. The correlation values for the original data show a regular progression as a function of window size η = 0.8 > η = 0.2 > η = 0.08 > overall. The difference between each of the adjacent steps in the progression was tested in three sign tests, and each is significant (at least p < 0.005). However, the same trend is found with the surrogate data, and the difference between the overall correlation and η = 0.08 is significant in a sign test, as is the difference between η = 0.8 and η = 0.20. Thus, the differences between narrow and wide windows may be due to some aspect of the method, rather than being informative of the locus of the correlation between the functions. However, the results clearly demonstrate that a wide (i.e., temporally long) window is not necessary to obtain meaningful correlations.

The results for the 25 Hz filtering condition are shown in the bottom panel of **Figure 6**. The correlations are lower than those in the top panel, as expected given the increased number of MFCC pulses in this condition. Nonetheless, the overall pattern of results is the same as for the 12 Hz filtering condition. A sign test confirms that the magnitude of the correlations is higher for the original than the surrogate data for the overall correlation and for all values of η (p < 0.001, except for η = 0.08, p = 0.011). As was the case for the 12 Hz condition, all speakers showed positive overall correlations for the original data, but there was no cross-speaker tendency for the sign of the correlation in the surrogate data. In the CMA analyses, for the original data, for every value of η the maximum positive correlation was higher in magnitude than the maximum negative correlation for a significant number of subjects (p < 0.005). However, for the surrogate data, a sign test revealed no tendency for the highest magnitude correlation to be either positive or negative. Likewise, as is shown in **Table 2**, the lags that exhibit the maximum positive correlations for the original data are clustered around 6.866 ms, with relatively small standard deviations; while lags at which the maximum correlations (positive or negative) occur for the surrogate data are much more variable.

For the original data, the pattern of correlations across the width of analysis windows is the same as in the 12 Hz condition (η = 0.8 > η = 0.2 > η = 0.08 > global) with pairwise differences are highly significant (p < 0.001) in a sign test. For the surrogate TABLE 2 | Medians and standard deviations (across speakers) of the lag (in ms) at which the highest positive and negative correlations are found between MBEAM and MFCC functions.


Results shown separately for original and surrogate data as a function of eta. 25 Hz smoothing condition.

data, however, there are no significant differences between values of η, though all of the CMA conditions show significantly higher magnitude correlations than the overall.

#### Lag Analysis

The lag analyses were conducted on the η = 0.8 condition, which exhibits the highest correlations. The top panel of **Figure 7** shows how the median of the CMA correlation function varies as a function of the lag between the MBEAM and MFCC functions for one speaker, for lags between +200 and −200 ms. Positive lags represent delay of the MBEAM signal with respect to the MFCC, and negative lags represent relative delay of the MFCC function. The lower panel shows the percentage of values in the correlation function at a given lag that are positive. The two functions of lag track each other quite closely, and the analysis will focus on the median correlation lag function. Even though the figure represents correlation of two different signals, it has the form of an autocorrelation function. Very high values are found at lag = 0, in this case 0.71 (Of course, if this were an actual autocorrelation function, the value would be equal to 1 at lag = 0). As the signals are shifted in time, the correlation decreases to minimum values at lags (± 65 ms), and then increases again to maxima between 100 and 150 ms of shift (in either direction). The surrogate data did not in general exhibit this kind of structure and was not considered further in the lag analysis.

Crucially, the fact that there are secondary maxima means that there is a repetitive period in the signals that is shared between them, just as the secondary maxima in autocorrelation can be used to determine the major periodicity of a single signal. Twenty-one of the twenty-three speakers exhibit these second maxima. The lag values at which the secondary maxima occur for a given speaker were determined as follows. First, the lags corresponding to correlation minima were determined by analyzing the median correlation lag function and finding the negative extrema closest to lag = 0. Then, the secondary maxima were found by finding a maximum between the time of the minima and + or −170 ms. Since the function was noisy around the secondary maxima for several speakers, there were sometimes multiple nearby maxima in which case the most extreme one was chosen. The lag values at which the secondary maxima occurred for a given speaker were referenced to the lag value exhibiting

the (primary) maximum. This is lag = 0 for the speaker shown in **Figure 7**, but this varied across speakers with a median value of 6.866 ms, or a delay of MBEAM by one frame. The measured lag was subtracted from the lag at which the primary maximum occurs. The positive lag and the absolute value of the negative lag were averaged to derive a single secondary maximum lag value for each speaker.

Box plots of the lag of the secondary maxima are shown in **Figure 8**. The leftmost plot shows the lags for the MBEAM-MFCC correlation for the 12 Hz filtering condition. The median value is 127 ms, which is very close to the median duration of the MBEAM inter-pulse intervals (132 ms). The next box plot shows the secondary maxima lags of the MBEAM function with itself (autocorrelation), with a median value of 124 ms, very close to the value for the MBEAM-MFCC correlation (though the values MBEAM-MFCC are more variable across speakers). This indicates that there is a repetitive structure to MFCC modulation function that aligns with the repetitive structure of the MBEAM function, even though the median inter-pulse interval for the MFCC function is actually shorter (116 ms), as is the median of the secondary maxima lags of the autocorrelation of the MFCC (110 ms). These differences are small in magnitude, to be sure, but the next three box plots from the 25 Hz filtering condition show the same pattern with a much larger magnitude. The MBEAM-MFCC correlation shows a median secondary maximum lag at 127 ms, similar to the median duration of the MBEAM interpulse intervals in this condition, 131 ms. However, the median duration of the MFCC inter-pulse intervals in this condition is 63 ms. This suggests that the MBEAM pulses are aligning with approximately every other MFCC pulse in this condition.

FIGURE 7 | Sample results of CMA lag analysis for one speaker for correlation between MBEAM and MFCC modulation functions filtered at 12 Hz with η = 0.8. (Top) Shows the median value of the correlation function for each signal lag in ms. Vertical lines indicate correlation minima and secondary maxima. (Bottom) Shows the percentage of correlation values that are positive at each lag.

#### Goldstein Temporal Modulation in Sensorimotor Interaction

# DISCUSSION

fpsyg-10-02608 December 5, 2019 Time: 16:14 # 10

The results of the analyses provide support for the primary hypothesis that there are robust correlations between the acoustic and articulatory modulation functions, as instantiated here in the MFCC and MBEAM functions [prediction (3) in the Introduction]. On the one hand, it is not surprising that they should be correlated given their causal relationship, but there are several reasons why these particular functions might not have revealed that. Primarily, there are several articulatory dimensions of change that are not represented in the microbeam data, including information about the velum, glottis, and pharynx. The lack of such information may be part of the reason that the pulses in the MFCC function were observed to have a considerably higher frequency than those of the MBEAM function (in addition to the intrinsic smoothness of articulatory movement), particularly when the MFCC function is not lowpass filtered at the 12 Hz frequency that appears to be the highest frequency in the MBEAM function. So, the fact that significant correlations are observed even in the 25 Hz filtering condition, where the pulse frequencies are quite different, is testament to the robustness of the co-modulation effect. Another indicator of its robustness is that fact that the correlation values are so consistent across speakers. Almost all speakers show predominantly positive correlations with maximal correlations close to zero lag, and the differences across various conditions tested were generally highly significant in simple sign tests, meaning all or almost all of the speakers showed differences in the same direction. The surrogate data show highly variable correlations across speakers in both sign and lag. This is consistent with the idea that correlations in the original data are intrinsic to the physics in combination with the phonological structure and are not parameters that set differently by individual speakers. Also, the fact that robust correlations can be found in narrow time windows indicates that the correlations are not dependent on including long enough stretches of speech such as to include systematic variation in articulator velocity due to prosodic boundaries.

The lag analysis revealed that pulse sequences of the articulatory and acoustic modulation functions share a repetitive structure (prediction 4), even when the MFCC function was twice the frequency of the MBEAM function. Returning to the issue raised in the introduction of how sensory and motor representations could be aligned within the nervous system, this result supports the possibility of modulation functions contributing to the solution. Rhythmic properties of articulatory modulation could entrain oscillations in speech-motor areas, and acoustic modulation could entrain oscillations in auditory areas. The correlations of the modulation functions demonstrated in the results could contribute therefore to auditory-motor synchronization. The correlations are high and are also sensitive to lag, so oscillations in motor and auditory areas, entrained respectively to articulatory and acoustic modulation functions would tend to be in-phase and effectively synchronized. One way to quantify the sensitivity to lag is to find the threshold lag at which the percentage of positive correlations in the correlation function drops to under 50%. For the η = 0.8 (12 Hz filtering condition), the median threshold across speakers is ∼40 ms. This means that the auditory and speech motor cortical oscillations based on these respective acoustic and articulatory modulation functions would intrinsically be within 40 ms of being in phase during speech production. Coupling between activity in these brain areas, as demonstrated during listening by Assaneo and Poeppel (2018) could further strengthen the synchronization.

The approach used here to reveal the shared repetitive stricture was somewhat indirect and limited, in that ultimately it was based on a linear correlation method. A better analysis that would avoid this limitation would to use a larger corpus of material and possibly a technique like joint recurrence analysis (Marwan et al., 2007; Lancia et al., 2019). Another alternative method that avoids the linear correlation would be to measure mutual information (Cover and Thomas, 2006) between the modulation functions. Mutual information measures how much knowledge of one signal reduces uncertainty about the other, and does not depend on linear correlations. Other possible methods of looking at temporal co-modulation, based on work with neural oscillations, would deploy frequency coupling (e.g., between faster and slower frequencies) or cross-wavelet power (Grinsted et al., 2004) between modulations of acoustics and articulation in different frequency bands.

The two other predictions about the structure of the modulation functions and their relation to syllable structure were supported by the analyses presented. The modulation functions have a repetitive pulse-like structure (prediction 1). The pulse structure appears to be related to syllable structure (prediction 2). On average one pulse was found for simple CV syllables, approximately 1.5 for syllables with a coda consonant, and 2.5 for syllables with multiple coda consonants. Of course, this needs to be tested on a larger and more varied corpus, particularly including syllables with multiple onset consonants. To the extent that such future analyses support the preliminary results obtained here, it may be possible to develop a new fully spatiotemporal model of syllable structure based on kinetic energy (of the articulators or the spectrum), departing from previous models that are either purely temporal (Goldstein et al., 2006; Nam et al., 2009) or purely spatial (i.e., sonority-based<sup>3</sup> , for example, Goldsmith and Larson, 1990).

While such a model of syllable structure would have several attractive features, its development would require systematic investigation of a wide variety of syllable structures and their resulting kinetic energy functions. A few speculations are nonetheless merited here. While kinetic energy is not an index of sonority per se, it could be an index of sonority change, such that a sharp sonority cline (the cross-linguistically preferred syllable onset or coda pattern) is indexed by a high magnitude of the kinetic energy pulse. Also, sequences of consonants in onset or coda that obey the sonority sequencing principle might result in single modulation pulses, while those that run counter to it could exhibit multiple pulses. Which is to say, a preference for single, high-magnitude pulses capable of entraining theta oscillations could underlie the preferred syllable structures in languages.

<sup>3</sup> "Sonority" has never been given a precise physical definition, but has been approximately described as the relative "acoustic energy" of a segment (Ladefoged, 1993). In general, relative sonority of segments within a syllable increases from onset to nucleus and decreases again from nucleus to coda.

Similar computations over sonority are the basis of Goldsmith and Larson's (1990) dynamical model of syllabification, but the values of sonority in that model are stipulated rather than representing measurable properties of speech, and temporal properties are not considered.

This modulation pulse model might also be able to provide insight into syllabification in languages in which syllables without vowels are common, such as Tashliyt Berber (Dell and Elmedlaoui, 1985) or Moroccan Arabic (Dell and Elmedlaoui, 2002). Data on articulatory organization of such vowel-less syllables has shown that the sequence of consonants constituting the onset and nucleus are organized such that the constriction gesture for the first consonant is fully released before the second is formed (Goldstein et al., 2006, for Tashhiyt; Gafos et al., 2019, for Moroccan Arabic). The sequential production of the two gestures could produce a modulation pulse that might be lacking if the two gestures were coordinated in a temporally overlapping pattern. Finally, the modulation pulse model might be able to distinguish glides (like /j/) from their corresponding vowels (like /i/), even though they are phonetically very similar in terms of static articulatory and acoustic properties. In standard phonological theory, the difference emerges as a function of being 'parsed' into the onset versus nucleus. In a modulation pulse model, this difference could emerge due to different patterns of overlap of an initial consonant gesture with a following glide (/Cj/) versus an initial consonant gesture with a following vowel (/Ci/). The overlap pattern in /Ci/ would presumably produce a modulation pulse (as it does in the data analyzed here), but the overlap pattern in /Cj/could fail to add a distinct modulation pulse.

#### CONCLUSION

While there is abundant empirical evidence for real-time sensorimotor interaction in speech production and perception, not the least of which is its requisite status in vocal learning and development, the patterns of neural activation associated with articulation and with acoustics of the same utterance are in fact distinct. This raises the question of the nature and basis of the neural binding that affords their integration. This paper presents a novel approach to this question by explicitly considering the temporal aspects of continuous acoustic and articulatory signals, which must of physical necessity be lawfully related, as the articulatory movements actually cause the acoustic signals. We hypothesize that the systematic relation between the temporal modulation of articulation

#### REFERENCES


and the corresponding temporal modulation of the acoustic signal offers the basis—or at least one critical basis—for the binding of production and perception, offering here an initial systematic and quantitative, albeit exploratory, investigation of the structure of the co-modulation patterns in articulation and acoustics. This preliminary data analysis identifies a pulse-like modulation structure related to syllable structure that is aligned systematically between oral articulatory movements and acoustic mfccs. Temporal co-modulation of articulation and acoustics can provide a springboard for illuminating the binding of language production and perception and its cognitive significance in phonological structuring.

## DATA AVAILABILITY STATEMENT

The data analyzed in this study were obtained from the NIH-funded University of Wisconsin X-ray Microbeam project directed by John Westbury. Requests to access these datasets should be directed to John Westbury (john.westbury@wisc.edu). The results of the analyses performed here will be made available by the author, without undue reservation, to any qualified researcher upon request.

# AUTHOR CONTRBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

#### FUNDING

This work was supported in part by NIH grant nos. DC003172 and DC007124 and NSF grant no. 1551695 to the University of Southern California.

#### ACKNOWLEDGMENTS

The following people contributed to this work through discussions, comments, suggestions and/or assistance of miscellaneous forms: Adriano Barbosa, Dani Byrd, Eddie Chang, Samantha Gordon Danner, John Houde, Khalil Iskarous, Nassos Katsamanis, Argyro Katsika, Jelena Krivokapic, Mark Liberman, ´ Richard McGowan, Shri Narayanan, Elliot Saltzman, Mark Tiede, and Sam Tilsen.

correlation map analysis. J. Acoust. Soc. Am. 131, 2162–2172. doi: 10.1121/1. 3682040


Cummins, F. (2002). On synchronous speech. Acoust. Res. Lett. Online. 3, 7–11.


**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Goldstein. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Spoken Language Development and the Challenge of Skill Integration

*Aude Noiray1,2 \*, Anisia Popescu1 , Helene Killmer3 , Elina Rubertus1 , Stella Krüger1 and Lisa Hintermeier4*

*1Laboratory for Oral Language Acquisition, Linguistic Department, University of Potsdam, Potsdam, Germany, 2Haskins Laboratories, New Haven, CT, United States, 3Department of Linguistics, University of Oslo, Oslo, Norway, 4Department of Education, Jyväskylä University, Jyväskylä, Finland*

The development of phonological awareness, the knowledge of the structural combinatoriality of a language, has been widely investigated in relation to reading (dis) ability across languages. However, the extent to which knowledge of phonemic units may interact with spoken language organization in (transparent) alphabetical languages has hardly been investigated. The present study examined whether phonemic awareness correlates with coarticulation degree, commonly used as a metric for estimating the size of children's production units. A speech production task was designed to test for developmental differences in intra-syllabic coarticulation degree in 41 German children from 4 to 7 years of age. The technique of ultrasound imaging allowed for comparing the articulatory foundations of children's coarticulatory patterns. Four behavioral tasks assessing various levels of phonological awareness from large to small units and expressive vocabulary were also administered. Generalized additive modeling revealed strong interactions between children's vocabulary and phonological awareness with coarticulatory patterns. Greater knowledge of sub-lexical units was associated with lower intra-syllabic coarticulation degree and greater differentiation of articulatory gestures for individual segments. This interaction was mostly nonlinear: an increase in children's phonological proficiency was not systematically associated with an equivalent change in coarticulation degree. Similar findings were drawn between vocabulary and coarticulatory patterns. Overall, results suggest that the process of developing spoken language fluency involves dynamical interactions between cognitive and speech motor domains. Arguments for an integrated-interactive approach to skill development are discussed.

Keywords: language acquisition, coarticulation, speech motor control, phonological awareness, vocabulary, speech production

#### INTRODUCTION

In the first decade of life, most children learn to speak their native language effortlessly, without explicit instruction but with daily exposure and experiencing of their native language as a speech motor activity. With the gradual expansion of children's expressive repertoire comes the fine tuning of phonological knowledge (e.g., Ferguson and Farwell, 1975; Menn and Butterworth, 1983; Beckman and Edwards, 2000; Munson et al., 2012). While relationships between lexical and phonological

#### *Edited by:*

*Pascal van Lieshout, University of Toronto, Canada*

#### *Reviewed by:*

*Catherine T. Best, Western Sydney University, Australia Marc F. Joanisse, University of Western Ontario, Canada*

*\*Correspondence:* 

*Aude Noiray anoiray@uni-potsdam.de*

#### *Specialty section:*

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

*Received: 07 May 2019 Accepted: 25 November 2019 Published: 17 December 2019*

#### *Citation:*

*Noiray A, Popescu A, Killmer H, Rubertus E, Krüger S and Hintermeier L (2019) Spoken Language Development and the Challenge of Skill Integration. Front. Psychol. 10:2777. doi: 10.3389/fpsyg.2019.02777*

**234**

developments have been well documented over the last decades (Storkel and Morrisette, 2002; Edwards et al., 2004, 2011; Stoel-Gammon, 2011; Vihman, 2017), research addressing their interaction with spoken language production has often been restricted to production accuracy or duration measures as metrics for assessing spoken language proficiency (e.g., Edwards et al., 2004; Munson et al., 2005). Likewise, speech motor control studies have provided in-depth analyses of developmental changes in articulatory variability, or movement velocity during word or sentence production (Smith and Goffman, 1998; Smith and Zelaznik, 2004; Green et al., 2010) without equivalently thorough assessments of children's phonological or lexical knowledge allowing developmental interactions to be evaluated. Despite a certain imbalance in the focus and analytical approaches of interaction studies, the findings suggest that spoken language proficiency entails dynamical interactions among a set of language-related domains including speech motor skill.

In the present research, we adopted an integrated approach to the study of spoken language development considering parallel developments of the lexical, phonological, and speech motor systems. The study more specifically investigated interactions between domains that have not yet been empirically connected: in particular *phonological awareness*, the awareness of the particulate nature of the language (e.g., Fowler, 1991; Studdert-Kennedy, 1998, 2005) that develops with literacy (reviews in Anthony and Francis, 2005; Brady et al., 2011; Goswami and Bryant, 2016; in German: Fricke et al., 2016) and *anticipatory coarticulation*, a mechanism that is deeply rooted in kinematics (e.g., Parush et al., 1983) and motor planning (e.g., Whalen, 1990; Levelt and Wheeldon, 1994; Grimme et al., 2011; Perrier, 2012; Davis and Redford, 2019) and is fundamental to speech fluency.

While phonological awareness and coarticulatory mechanisms may in principle belong to different realms, we argue that they are developmentally strongly interconnected: phonological awareness relates to the ability to consciously extract functional units of phonological organization from the continuous speech flow (e.g., syllables, segments) and combine those discrete units into new sequences of variable size and meaning (e.g., Metsala, 2011). Coarticulation embodies speakers' structural knowledge of the language, combining and (re)modeling its elementary particles into continuous articulatory movements and acoustic streams, hence contextualizing abstract representations into a decipherable "speech code" (Liberman et al., 1974; Fowler et al., 2016). In this perspective, investigating developmental changes in children's coarticulatory processes may give us an opportunity to track how a combinatorial principle is situated within the representational and production levels and to capture more broadly how motor and cognitive functions come together to develop the skill of spoken language.

While children's speech organization very early reflects their ability to combine phonetic units, the explicit awareness of the combinatorial nature of their native language forming larger compounds from smaller-sized units follows a more protracted development and seems to climax around the time children acquire literacy (e.g., Gillon, 2007). During that period, a gain in phonological awareness allows children to convert the already acquired phonetic units (i.e., sounds they hear and produce by means of distinct speech gestures) into phonological units. However, whether the acquisition of phonological knowledge only relates to attaining literacy or also modifies children's spoken language organization in fundamental ways remains an empirical question. The alternative direction in which a gain in spoken language practice would stimulate the development of phonological awareness and literacy has also not yet been demonstrated. The present study provides a first step toward addressing this issue by testing whether lexical and phonological skills interact with speech motor control in development. More specifically, we examined whether children with greater knowledge of the segmental makeup of words in their native language exhibited a segmentally specified organization of their speech gestures and reflected in their coarticulatory patterns. We focused on the period encompassing kindergarten to the end of the first primary school year, which is relevant for phonological development as well as for attaining literacy. Our motivations driven from empirical research are further outlined below.

#### What Are Children's Units of Spoken Language Organization

In the last decades, a growing number of developmental studies in the area of spoken language ability have focused on *coarticulation degree,* which characterizes the extent to which the articulatory gestures for neighboring phonemes overlap temporally (e.g., Browman and Goldtstein, 1992). Looking specifically at *lingual* coarticulation, which regards the gestural organization of the tongue, some research has found a developmental decrease in vocalic anticipatory coarticulation over previous segments, within the syllables (e.g., Nittrouer et al., 1996; Zharkova et al., 2011; Noiray et al., 2018) and beyond the syllabic span (e.g., Nijland et al., 2002; Rubertus and Noiray, 2018). On the basis of these results, Noiray et al. (2019) reasoned that spoken language fluency may entail a gradual narrowing of speech units toward smaller-sized units. In young children, vowels may represent building blocks, which children organize their speech around because of their perceptual salience, long duration, and earlier acquisition compared to consonants (e.g., Polka and Werker, 1994; review Nazzi and Cutler, 2019). Hence, children's vocalic and consonantal gestures may be activated more simultaneously than in adults, resulting in an overall larger vocalic influence on previous consonants and a greater degree of vocalic coarticulation than for adults. Instead, adults have been found to organize their speech with more temporally individuated gestures (Abakarova et al., 2018; Rubertus and Noiray, 2018). The result of rather large unit size speech organization echoes the multiple findings of wholeword learning (Vihman and Velleman, 1989; Keren-Portnoy et al., 2009; Menn and Vihman, 2011), transitional probability across syllables (e.g., Jusczyk et al., 1993; Saffran et al., 1996), or lexically grounded phonological development and production accuracy (Edwards et al., 2004; Velleman and Vihman, 2007; Vihman and Keren-Portnoy, 2013). The opposite finding of a lesser degree of coarticulation between consonants and vowel gestures in children compared to adults has also been reported (e.g., Katz et al., 1991), favoring a more segmental perspective of early spoken units.

Based on our own in-depth examinations of coarticulatory mechanism in both adults (Abakarova et al., 2018) and children (Noiray et al., 2018; Rubertus and Noiray, 2018), we have argued that (young) speakers exhibit gradients of coarticulation degree within a continuum from a more syllabic to a more segmental organization. The degree to which segment overlap depends on the gestural demands associated with the combined segments. In adults, contextual differences in coarticulation degree are well attested (e.g., Recasens, 1985; Fowler, 1994). For instance, syllables recruiting a single organ for the consecutive production of both consonantal and vowel targets (e.g. the tongue in /du/) require from speakers a functional differentiation between the subparts of the tongue (tongue tip, tongue dorsum). This type of syllable further requires greater spatiotemporal coordination in comparison to syllables recruiting two separate primary organs (e.g., the lips and tongue dorsum in /bi/). This phenomenon described within the theory of *coarticulatory resistance* has been reported in adults across languages over the past decades (review in Recasens, 2018). In children, extensive kinematic investigations of coarticulatory processes have been more challenging and hence somewhat restricted in scope compared to adults (e.g., limited variety of stimuli that can be tested in the young age, age range, sample size, scarcity of methodological replications across studies). Yet, once these studies are examined together, they support the view of coarticulatory gradients as observed in adults. While children show overall greater coarticulation degree than adults, they also exhibit contextual effects on coarticulation degree, which result from the particular combination of gestural goals between individual consonants and vowels. Based on those observations, we recently suggested a gestural approach as a "unifying organizational scheme to relate adults' to children's patterns. How coarticulatory organization matures over time is then no longer solely a question of direction (toward a greater or lesser coarticulatory degree) or categorical change in phonological organization (e.g., into segments or syllables) but a question of how a primitive gestural scheme shares similar tools (the articulators of speech), constraints, and principles (dynamic interarticulator coordination over time) with adults to instantiate complex phonetic combinations in line with the native language's phonological grammar" (Noiray et al., 2019, p. 3037). In this context, the question of (early) units of speech production may be viewed as a part-whole interaction.

## The Development of the Lexical, Phonological, and Motor Domains

While the maturation of the speech motor system is central to spoken language fluency, lexical and phonological developments are equally crucial (e.g., Smith et al., 2010; Edwards et al., 2011), and research has suggested that they interact dynamically over time (e.g., Beckman et al., 2007; Sosa and Stoel-Gammon, 2012; Vihman, 2017). A main hypothesis motivating the present study is that adults' coarticulatory patterns do not differ from those of children on the sole basis of greater precision of control from children's speech production system. Adults also have (1) built an expressive lexicon from which to harness their phonological representations, (2) they have gained an explicit understanding of the structure of their language, and (3) an ability to manipulate this information into a quasi-infinite set of intelligible spoken forms. Hence, considering speech motor development as a goal-directed process – for example, speaking a language fluently – what distinguishes children from adults is that children have not yet built explicit correspondences between phonetic segments and their motor realizations. The rapid growth of the expressive lexicon observed during the kindergarten-to-school years may help children solve this correspondence problem and more generally develop stable relations between representational and executional levels. Vocabulary is indeed often considered the backbone of language acquisition, supporting the development of phonological representations (e.g., Ferguson and Farwell, 1975; Metsala, 1999) and production accuracy (e.g., Edwards et al., 2004; Nicholson et al., 2015). Previous research also suggests that children first develop articulatory "routines" for the syllables present in their expressive repertoire (e.g., Menn and Butterworth, 1983; Munson et al., 2005; Ziegler and Goswami, 2005; Vihman, 2017). This lexically based process may lay the ground for increased phonetic distinctions along the dimensions of height, fronting and rounding for vowels, place and manner of articulation for consonants, and the maturation of coarticulatory flexibility for a wider range of phonetic environments.

This knowledge is at first experience-based; before entering primary school, children have limited *explicit* knowledge about the structural organization of their native language, that is, they have limited conscious awareness that the words they hear can be segmented into smaller-sized units (and recombined into new forms; e.g., Liberman et al., 1974; Gillon, 2007). Note that while the development of phonological awareness differs as a function of orthographic transparency (e.g. Fricke et al., 2016) or the age at which children learn how to read (e.g., review in Wimmer et al., 2000; Mann and Wimmer, 2002; Schaeffer et al., 2014; Goswami and Bryant, 2016) on average, children in kindergarten show only more or less equivalent proficiency in syllabic units' awareness to that of school-aged children (in English: e.g., Liberman et al., 1974; in German: Ziegler and Goswami, 2005; Schaeffer et al., 2014) but no advanced phonemic awareness before explicitly learning how to read. Taken together, young listener-speakers would progressively access smaller units allowing them to decipher a wider range of speech forms and manipulate those flexible units to craft increasingly more complex speech flows. **Figure 1** provides an illustrative conceptualization of these seemingly parallel developmental trajectories, from more holistic access and production of large units (e.g., lexemes) to more segmentally specified representations and coarticulatory organizations. Developmental overlaps (e.g., from lexeme access to rhyme access) and short-term regressions between learning phases may at times occur (e.g., Anthony et al., 2003), as noted in other domains (e.g., "phonological templates" during early word production: Vihman and Vihman, 2011; lip-jaw movement variability: Green et al., 2002; walking: Thelen and Smith, 1994). The developmental pace may also well change over time, as in

other domains (e.g., speech motor control: Green et al., 2010). **Figure 1** highlights the nonlinearity of those developmental processes over time (blue descending and ascending curves). With an advanced knowledge of their native language and a mature control of their speech motor system, adults naturally exhibit more flexible, context-specific organizations with greater or lesser coarticulation degree depending on the gestural properties of the individual segments assembled with one another.

Overall, results from these separate literatures suggest that the development of lexical, phonological, and speech motor abilities are fundamental to the maturation of children's spoken language. However, to our knowledge, empirical studies examining their interactions with precision have been rare, and this gap has prevented a unifying account of spoken language development. The central hypothesis driving our current research is that the transition from the rather selfpaced development of large unit phonological awareness to the more explicit knowledge of the phonemic constituents of the language initiated in primary school should correlate with a significant change in spoken language production from an experience-based holistic organization to a structurally informed, segmentally specified organization of spoken language. Because quantitative longitudinal investigations over a 2- to 3-year span are extremely difficult to conduct, we first opted for a crosssectional examination of a sample of 41 children in the last 2 years of kindergarten (at 4.5 and 5.5 years of age) and the end of the first grade (at age 7). The latter cohort was chosen to ensure children have been exposed to explicit literacy instruction for a year. With this approach, we first tested for significant interactions between children's motor, lexical, and phonological skills. Potential implications for causal relations are laid out in the discussion.

Based on our previous research, we expect differences in intra-syllabic coarticulation degree between children and adults but not necessarily between all child cohorts (Noiray et al., 2019). We also anticipated consonantal effects on children's lingual coarticulatory patterns within each age cohort as found in a preceding study investigating children's intra-syllabic coarticulation from the age of 3 (Noiray et al., 2018). More specifically, we expected a lower degree of lingual coproduction for consonantvowel syllables requiring two constriction goals by spatially distinct articulatory organs than from those requiring two constriction goals by a single organ as found in adults (e.g., Iskarous et al., 2013; Abakarova et al., 2018), albeit to a lesser extent than adults. Importantly, expanding on previous research, we predicted greater phonological awareness and vocabulary would coincide with lower coarticulation degree, i.e., greater segmental differentiation of consonants and vowels in syllables. We further suspected interactions between motor and cognitive domains to be nonlinear and to reflect the complex dynamics in place during the development spoken language fluency. If this were found, it would suggest that the skill of spoken language fluency is not solely tied to production-related considerations but may instead result from and be an integral part of multiple interactions, which are fundamental to the development of each individual skill. If no correlation was to be found, it would on the contrary indicate that representational and production levels may not be tightly coupled in the sense that greater awareness of phonological discreteness does not interact with coarticulatory degree.

# MATERIALS AND METHODS

#### Participants

Forty-one monolingual German children all living in the Potsdam region (Brandenburg) were tested: ten 4-year olds (6 females, mean age: 4; 06, called K1 in subsequent analyses), thirteen 5-year-old children (7 females, mean: 5; 06, called K2 hereafter) in kindergarten, and eighteen 7-year-old children at the very end of the first or very beginning of the second grade in primary school (12 females, mean: 7; 02, called P1 hereafter). The discrepancy in sample size was due to greater difficulty in recruiting children in kindergarten. All children were raised in monolingual German families without any known history of hearing, language, or cognitive impairment. They were recruited *via* the child registry from the BabyLab of the University of Potsdam. Ethics approval was obtained from the Ethic Committee of the University of Potsdam prior to the study. All parents were also fully informed of the study and gave written consent for their child to participate.

## Production Task

The speech production task consisted in the repetition of trochaic pseudowords (i.e., conforming to German phonotactics) of the form consonant1-vowel-consonant2-schwa (**C**1**V**C2ǝ). Target phrases used as stimuli were pre-recorded by a native German female adult speaker. Three consonants varying in place of articulation: /b/, /d/, and /g/ and six tense, long vowels /i/, /y/, /u/, /a/, /e/, and /o/ were used. Pseudowords were chosen instead of real words to combine consonants and vowels varying in lingual gestures and coarticulatory resistance. Target pseudowords were embedded in a carrier phrase with the article /aɪnə/ resulting in utterances such as /aɪnə ba:də/. Utterances were repeated six times in semi-randomized blocks. To measure lingual coarticulation, we employed the technique of ultrasound imaging (Sonosite edge, fps: 48 Hz) that permits recording movement from participants' tongue over time while producing various speech materials (Noiray et al., 2013). In this study, tongue imaging was integrated in a space journey narrative to stimulate children's motivation to complete the task. Children were seated in a pilot seat with seatbelts, facing the operating console from a space rocket replica. The ultrasound probe on which children positioned their chin was integrated into a customized probe-holder as part of the rocket console (for a full description of the method, see Noiray et al., 2018). The acoustic speech signal was recorded synchronously with the ultrasound tongue video *via* a microphone (Shure, fps: 48KHz).

#### Assessment of Phonological Awareness and Vocabulary

Assessments of various levels of phonological awareness (rhyme, onset segment, and individual phonemes) were conducted with the Test für Phonologische Bewusstheitsfähigkeiten (TPB; Fricke and Schäfer, 2008). Prior to testing, children were familiarized with all images used as test items. The procedure for each of the TPB test is briefly summarized below; a complete description can be found by Fricke and Schäfer (2008). The tests were scored according to the test instructions, and raw scores were considered for subsequent analyses.

#### Rhyme Production

Children are shown a picture and are instructed to produce (non)words that rhyme with the word corresponding to the target picture (e.g., *Puppe: Muppe, Kuppe, Wuppe*). Children are instructed to provide as many rhymes as they can. However, to make the task comparable for every child, we scored children's proficiency differently from the test instructions: for each of the 12 target words, children scored 1 point if they succeeded in giving at least one correct rhyme; if not, they scored zero. This way, we could assess the stability and generalization of the rhyming skill rather than relying on raw number of rhymes produced (e.g. if a child produced six rhymes for two target words but then failed for all other target words).

#### Onset Segment Deletion

Children are shown a picture and are instructed to delete the onset segment from the word represented by the picture and utter the resulting nonword (e.g. Mond: ond; Zahn: ahn). Note children were precisely instructed what to delete (e.g. "delete "m" from Mond"). A total of 12 words is tested in each age cohort.

#### Phoneme Synthesis

Children are instructed to produce a word after hearing a pre-recorded female voice uttering its phonemes one by one (e.g. fee: [f-e:], dose: [d-o:-z-Ə], salat: [z-a-l-a:-t]). For the onset segment deletion task, the TPB assessment uses a total of 12 words for each age cohort.

#### Expressive Vocabulary

Expressive vocabulary was tested with Patholinguistische Diagnostik bei Sprachentwicklungsstörungen (PDSS; Siegmüller and Kauschke, 2010) and widely used to assess German children's lexical repertoire. The test consists of a 20-word picture naming task assessing nouns for the target ages (see **Table 1** for an overview). In subsequent analyses, we used a composite score for phonemic awareness (PA hereafter that includes the two tasks tapping phoneme-size awareness: onset deletion and phoneme synthesis).

We focused on *output* phonological tasks as well as *expressive* vocabulary because we were interested in their direct relationship with children's speech production. Given that young children have a limited attention span, we could also assess children's actual proficiency with better confidence than when conducting long series of cognitively demanding assessments. All assessments were conducted in our laboratories by experimenters trained by a speech language pathologist.

## STATISTICAL ANALYSES

Consistent with previous research, intra-syllabic coarticulation degree was estimated in terms of whether the lingual gesture for a target vowel was anticipated in the previous consonant (see review on vowels' degrees of aggressiveness in the context of different consonants: Iskarous et al., 2010). We focused on the antero-posterior tongue dorsum position that is highly relevant in terms of articulatory and acoustical contrasts between

TABLE 1 | Summary of the results from the assessments tapping phonological awareness (Rhyme, Composite PA) and expressive vocabulary (VOC) conducted in 4-year-old (K1), 5-year-old (K2), and 7-year-old children at the end of first grade (P1).



vowels (e.g., Delattre, 1951). We calculated differences in tongue dorsum position between the production of consonants and following vowels. A tongue dorsum position for a consonant (e.g., /g/) that varies in the context of various vowels (e.g., /a/, /i/) indicates vocalic anticipation onto the previous consonant and hence a high coarticulation degree. On the contrary, low coarticulation degree is reflected by an absence of change in tongue dorsum position during the consonant in the context of various vowels (review in Iskarous et al., 2010).

Differences in coarticulation degree were estimated for each consonantal context from the midpoint of the consonant (C1) compared to the vowel midpoint (V). A few preliminary processing steps were necessary. First, the corresponding midsagittal tongue contours for both C1 and V were extracted from the ultrasound video based on the acoustic speech signal labeling. The tongue contours were then analyzed using SOLLAR (Noiray et al., submitted), a platform created in our laboratory for the analysis of kinematic data (Matlab environment). For each target tongue contour, a 100-point spline was automatically generated, and the *x*- and *y*-coordinates for each point were extracted. In subsequent analyses, we used the horizontal *x*-coordinate for the highest *y*-coordinate point of the tongue dorsum to reflect its variation in the anterior-posterior dimension (e.g., anterior position for /i/, posterior position for /u/, e.g., Abakarova et al., 2018). Data were normalized for each participant by setting the most anterior tongue dorsum position during the target vowel midpoints to 0 and the most posterior tongue dorsum position to 1. Tongue dorsum positions for consonant midpoints were then scaled within this range.

To test for developmental differences in coarticulation degree, we employed linear mixed effects models (LMER), using the "lme4" package in R (version 1.1–19; Bates et al., 2015). Coarticulation degree was calculated by regressing the horizontal position of the tongue dorsum at consonant midpoint (PEAKC1\_X) on the horizontal position of the tongue dorsum at vowel midpoint (PEAKV\_X) for each age group (K1, K2, and P1). Two interaction terms were used: Coarticulation and Consonant (C1) and Coarticulation and Age. By-subject C1 and by-word random slopes for PEAKV\_X were included as random effects.

To test for an effect of phonological awareness and vocabulary on children's coarticulation degree, we then employed *Generalized Additive Modeling (GAM),* a statistical approach allowing us to test for linear and nonlinear relationships (Winter and Wieling, 2016; Wood, 2017; for a comprehensive tutorial, see Wieling, 2018). To date, this approach has only been used in psycholinguistic research with adults (e.g., Strycharczuk and Scobbie, 2017; Wieling et al., 2017) and only recently in the developmental domain (Noiray et al., 2019). In this study, we were interested in the effect of three variables on the degree of coarticulation: RHYME, COMPOSITE\_PA (a composite computed from the sum of the scores obtained for both phonemic awareness tasks: onset segment deletion and phoneme synthesis, see section "Descriptive Statistics for Phonological Awareness and Vocabulary"), and VOC. We used the function *bam* of the *mgcv* R package (version 1.8–26) and *itsadug* (version 2.3). Our dependent variable was again PEAKC1\_X with respect to PEAKV\_X. We predicted this value on the basis of a nonlinear interaction, which is modeled by a tensor product smooth (te). A tensor product smooth can model both linear and nonlinear effects across a set of predictors and their interaction (see Wieling, 2018) here between: RHYME, COMPOSITE\_PA or VOC, and PEAKV\_X. The resulting estimated degrees of freedom (edf) indicate whether the relation is linear (value close to 1) or nonlinear (values above 1).

# RESULTS

## Testing for Developmental Differences in Coarticulation Organization

**Table 2** shows the results from the LMER testing for age-related differences in coarticulation degree across all consonants and vowels. No significant difference was noted across the three target age groups. However, differences in coarticulation degree were found across consonantal contexts, with a lower coarticulation degree in alveolar /d/ context as compared to labial /b/ context (estimate: −0.11793, *p* < 0.05). Coarticulation degree did not differ across other consonantal contexts.

#### Descriptive Statistics for Phonological Awareness and Vocabulary

Pearson product-moment correlations were computed to assess relationships between all developmental assessments. For the rhyming task, we conducted the task in 40 of the 41 children because one P1 child did not want to conduct the rhyming task. A strong positive 0.94 correlation (*p* < 0.001) was found between scores for onset deletion and phoneme synthesis. In subsequent analyses, testing the effect of phonological awareness on coarticulatory organization, we therefore computed a composite score as the sum of the scores obtained in the two tasks. This score was taken to reflect children's phonemic awareness (COMPOSITE\_PA), that is, of phonemic units in comparison to the awareness of larger phonological units (rhymes).

**Figure 2** provides an overview of the score distribution for each of the four developmental assessments conducted across child cohorts. Dot plots were used to highlight variations in the number of children obtaining a target score. **Table 1** provides a summary of the descriptive statistics reflecting children's phonological awareness and expressive vocabulary. Mean score and range reflect the number of correct items (raw scores). While mean scores increased with age for all language-related

TABLE 2 | Results from the linear mixed effects model testing for age comparisons in coarticulation degree between the 4-year-old group (K1), 5-yearold group (K2), and 7-year-old group (P1).


skills, results (1) revealed stark individual differences within the same age-group and (2) overlap in scores across age groups for rhyme and expressive vocabulary. For the phonological tasks targeting the awareness of phonemic units (onset segments and individual phonemes), children in kindergarten had overall great difficulty completing the tasks (despite being familiarized with pre-test items), while children in the first grade could complete the tasks with various levels of proficiency.

The Welch *t* test was conducted to test for developmental differences in phonological awareness and vocabulary. Performance on rhyme production for the scoring procedure we employed did not yield any significant differences among age groups (K1–K2: *t* = −0.58, df = 17.47, *p* < 0.6; K1–P1: *t* = −0.58238, df = 17.47, *p* < 0.6; K2–P1: *t* = −1.9085, df = 12.524, *p* < 0.08). With regard to the composite score computed to target the awareness of phonemic units, 5-yearold children (K2) did not differ in performance from 4-year olds (K1) (*t* = −1, df = 12, *p* < 0.4). Only 7-year-old children (P1) showed greater proficiency than K2 (*t* = −15.572, df = 21.128, *p* < 0.0001 4.693e-13) and K1 (*t* = −30.006, df = 14, *p* < 0.0001). Hence, a developmental increase in awareness of segmental units was found between children in kindergarten altogether and those in the first year of primary school, which yielded an overall high correlation between age and PA composite of 0.9 (*p* < 0.0001). Regarding vocabulary, similar directions were found. K1 children did not exhibit lower proficiency than K2 (*t* = −0.95914, df = 19.728, *p* < 0.4), only when compared to P1 children (*t* = −7.0665, df = 16.375, *p* < 0.0001). K2 children also had lower vocabulary scores than P1 children (*t* = −4.0338, df = 16.257, *p* < 0.001). However, unlike for phonemic awareness, the correlation between age and vocabulary was not significant (0.12, *p* < 0.3).

#### Interaction Between Phonological Awareness and Coarticulation Degree

Given the results from the developmental assessments, we adopted the following statistical approach: we first tested the interaction between *rhyme* proficiency as an index of intermediate unit-size awareness and coarticulation degree for all children. We then further tested for a separate interaction between phonemic awareness (COMPOSITE\_PA, named PA for short hereafter) or vocabulary (VOC) and coarticulation degree. We conducted GAM analyses to illuminate potentially nonlinear interactions.

First and foremost, an interaction between rhyme awareness and coarticulation degree was found across all three consonantal contexts (*p* < 0.0001). More specifically, greater rhyming skills were associated with lower coarticulation degree. Furthermore, the estimated degrees of freedom (edf) were all above 1, which indicates that rhyme proficiency was non-linearly related to an increase in children's coarticulation scores. Nonlinear interactions between rhyme and coarticulation degree were found in each consonantal context (**Table 3**). The nonlinearity was the highest in the alveolar context (edf: 10.778), followed by the velar and labial contexts. This means that the pattern of interaction between rhyme and coarticulation degree was specific to the gestural organization of the consonantvowel combinations.

**Table 4** presents an overview of the GAM model testing for an interaction between phonemic awareness (PA) and coarticulation degree. A negative correlation was found, that is, greater phonemic proficiency coincided with lower coarticulation degree. This interaction differed significantly across consonant contexts (*p* < 0.0001). The nonlinearity of the interaction was again the most prominent in the alveolar context and lowest in the labial context. **Figure 3** presents three-dimensional visualizations of the nonlinear interaction patterns obtained for each consonantal context, called terrain maps. These visualizations (also called contour plots) provide further insights into the direction of the observed interaction between PA and coarticulation degree. More specifically, they depict differences in the tongue dorsum position during the production of each stop consonant (/b, d, g/ from left to right plot) with respect to the tongue dorsum position during the production of the subsequent target vowel (*y*-axis) as a function of children's PA score (*x*-axis). In the plot, changes are expressed by means of a color scaling. The color scheme in the small upper right rectangle provides a referential color coding for various tongue dorsum positions scaled from 0 to 1. While blue shades characterize more anterior tongue dorsum positions (as expected for anterior vowels such as /i/), orange shades correspond to more posterior tongue positions (e.g., for /u/). The full-size plots themselves display the tongue position during the consonant as a function of its subsequent vowel position (*y*-axis) and PA scores obtained (value on the *x*-axis). If the tongue dorsum position of the consonant is highly influenced by the upcoming vowel (i.e., if coarticulation degree is high), the color distribution within the plots is expected to resemble the referential color scaling provided for the vowel tongue dorsum positions (i.e., yellow color for more posterior and blue color for more anterior tongue dorsum positions). The red contour lines are used similarly to isolines in topographic

TABLE 3 | Tensor smooth terms of the generalized additive model testing for an interaction between rhyme and coarticulation degree for all children per consonantal context /b/, /d/, /g/. edf: estimated degrees of freedom.


TABLE 4 | Tensor smooth terms of the generalized additive model testing for an interaction between phonemic awareness (composite\_PA) and coarticulation degree for all children per consonantal context /b/, /d/, /g/.


maps (e.g. for hiking) to indicate locations sharing the same (predicted, based on all trials) value. Here, the values are not altitude landmarks, but tongue dorsum positions. Hence, red contour lines characterize locations of identical consonant tongue dorsum positions across a set of PA scores (from 0 to 24) as a function of their vocalic environment. The direction and shape of the contour line provide information whether changes in tongue dorsum position are linear (straight line) or not (curved line).

Let us now take a concrete example. In the labial context /b/, we can see that for a target vocalic tongue dorsum position of 0.3 (value on the y-axis), the corresponding position at the consonant midpoint is about 0.4 (value on the red contour line) for children who have obtained a PA score close to 0. From a score of 10 upward, the tongue dorsum position during the consonant becomes slightly more posterior (i.e., above the 0.4 red contour line, hence further away from the target 0.3 value for its subsequent vowel).

Moving on to the alveolar context, it can be noted that the position of the tongue dorsum during the alveolar /d/ stop remains overall in a central (green shade) to anterior position (blue shade) regardless of the upcoming vowel. This shows that the tongue dorsum position during the alveolar stop resists vocalic influences due to more immediate gestural constraints requiring a more anterior to central tongue dorsum position. However, scores starting from 10 (about half the maximal score) onward are associated with a change toward a more central tongue dorsum position as compared to children with poorer PA scores. In labial and velar contexts, the color scaling characterizes more faithfully the range of vocalic targets in the antero-posterior dimension: from blue for anterior vowels to orange for more posterior vowels. This is very clear for children with a poor PA score: the tongue dorsum position for all vowels is well anticipated in the consonant. The color patterning differs in children with higher PA scores reflecting a more central tongue dorsum position (larger green portion) and hence

column) as a function of tongue dorsum position for target vowels (*y*-axis) and composite phonological awareness scores from 0 (the minimal score obtained) to the maximal score of 25 (*x*-axis).

a lower coarticulation degree. Furthermore, in velar context, the contour lines are flatter with central vowels (e.g., on *y*-axis: 0.5–0.6 values) and more non-linear in the context of posterior vowels (0.8 and above). In the labial context, the interaction between phonemic awareness and coarticulation degree is slightly nonlinear (edf value: 3). In **Figure 3**, the red contour lines look overall flat, except with anterior vowels (e.g., 0.3 value and below). Overall, **Figure 3** shows that the interaction of PA and coarticulation degree: (1) approximates linearity in labial and velar contexts contrary to the alveolar context and (2) varies as a function of the various combination of individual consonants and vowels. The implications of these nonlinear relationships between phonological and motor domains are discussed in section "Nonlinear Interactions Between Vocabulary, Phonological Awareness, and Coarticulatory Organization."

These visual outputs differ markedly from standard numerical reports. They are quite valuable for speech production research in general and more so for the developmental field (e.g., **Figure 3**) because the continuous color scaling used in these plots can reveal gradients in target effects or interactions between parameters and hence potentially identifying nonlinear patterning. In the case of spoken language acquisition, these permit departing from categorization of children's articulations in terms of abstract phonological targets (which they are in the process of acquiring) and instead obtain more faithful descriptions of the variety of articulatory expressions for a given target. This type of description is particularly relevant in the developmental field because like adults – and even to a greater extent than adults – children do not produce words or segments uniformly across repetitions. Acoustic and articulatory variability are indeed ubiquitous in child speech (e.g., Heisler et al., 2010). The color scaling in the GAM contour plots hence provides a fair depiction of the variations in tongue dorsum positions within regions associated with a specific target (e.g., individual vowels) or in interaction with a phonetic environment (e.g., a specific vowel in the context of a specific consonant).

#### Interaction Between Expressive Vocabulary and Coarticulation Degree

Last, we tested for an interaction between children's expressive vocabulary and their pattern of coarticulation degree. A significant effect was found in all three consonantal contexts (**Table 5**, *p* < 0.0001). Overall, nonlinear patterns of interactions between domains were noted. However, those were not uniform across consonant and vowel combinations (**Figure 4**). In the

TABLE 5 | Tensor function terms of the generalized additive model testing for an interaction between expressive vocabulary and coarticulation degree for all children per consonantal context /b/, /d/, /g/.


labial context, an increase in vocabulary score coincides with lower coarticulation degree. For example, in anterior vowels that have a 0.2 tongue dorsum position value (*y*-axis), the corresponding tongue dorsum position during the labial stop production has a value of 0.3 in children with low vocabulary while close to 0.4 in children with advanced vocabulary. Similar trends are observed in syllables including an alveolar onset, but the interaction between vocabulary and coarticulation degree is this time more nonlinear (more pronounced curved lines) and complex than in the labial context. For children with more proficient vocabulary (e.g., score 16 upward), the tongue dorsum position is slightly more central in the case of anterior vowels (e.g., 0.2). Consonantal tongue positions in the context of central vowels (e.g., 0.6) are characterized by a slightly oscillatory behavior from more to less to more central. Last, tongue position for the alveolar stop flanked by posterior vowels (e.g., 0.8) also shows a nonlinear pattern with an overall central tongue dorsum position. Last, in the velar context, the relation between vocabulary and coarticulation degree also translates into slightly more central tongue dorsum positions in children with higher vocabulary score. To summarize, greater expressive vocabulary is associated with a more central tongue dorsum during the consonant and hence lesser influence from individual vowels.

# DISCUSSION

In this study, we asked whether children's phonological awareness and expressive vocabulary have an impact on *anticipatory coarticulation*. Our general motivation for this research stemmed from independent findings made in speech motor control and developmental phonology suggesting an increasing access to and use of phonemic units during the kindergarten-to-primary school period. Results drawn from a cross-sectional investigation of 41 children provide the first empirical evidence that vocabulary and phonological awareness interact dynamically with coarticulation degree during the period from kindergarten to primary school. In general, greater phonemic awareness and vocabulary were associated with greater segmental differentiation of tongue gestures in children's coarticulatory organization. We expand below on the implications of those findings for the development of spoken language fluency.

#### Age-Related Versus Skill-Based Descriptions of Spoken Language Development

In the past decade, a fair amount of empirical research has reported greater vocabulary and phonological awareness in school-aged children than children in kindergarten (in German: Kauschke, 2000; Wimmer and Mayringer, 2002; Schäfer et al., 2014; in English: Carroll et al., 2003; Ziegler and Goswami, 2005). However, results from the present study suggest that age-driven categorizations are not always the only suitable ways to characterize skill development or at least they may underestimate its complexity. Several findings uphold this argument.

First of all, the language-related assessments conducted in this study provide a mixed validation of prior findings regarding a developmental increase in expressive vocabulary and phonological awareness. Indeed, our sample of kindergarten children was seemingly as proficient as first-grade children in expressive vocabulary as attested by the absence of significant age differences. Likewise, they were as proficient as first-grade children in their rhyming skills, which suggest that by the age of 4.5, they have gained awareness of *intermediate* size phonological components. This may be due to rhyming practices being initiated early in age, *via* singing, counting rhyming games at home or in kindergarten. With respect to tasks probing phonemic units, the two youngest cohorts did not differ from each other but showed significantly lower awareness than school-aged children at age 7. Interestingly in our study, the only 5-year old who could actually perform the phonemic task was able to read a few words and had knowledge about some letters. Hence, success in these tasks may emerge only once children have been explicitly trained in phonemic decoding/encoding, either in primary school in the context of reading acquisition (e.g., Ziegler and Goswami, 2005; Schaeffer et al., 2014) or with parents at home. We discuss this point further in section "An Integrated-Interactive Approach to Skill Development."

Second, children within the same age group did not behave all in the same way but instead exhibited substantial individual variability (**Figure 2**), a phenomenon also previously noted (e.g., review in Sosa and Stoel-Gammon, 2012; see also Wimmer and Mayringer, 2002; Schäfer et al., 2014). In the present study, this was the case in all three age groups and for all assessments, except for tasks probing phonemic awareness in kindergarteners (onset segments, phoneme synthesis) for which we noted a floor effect. Regarding first-grade children, it seems that while they have gained substantial awareness of sub-lexical units in comparison to children in kindergarten, it takes longer to be fully proficient in manipulating phonemic units (cf. the scores distribution, **Figure 2**). Regarding vocabulary, wide disparities across children from the same age are well-established (e.g., CDI reports within and across languages). Similar conclusions have been drawn regarding children's coarticulatory patterns (e.g., at 4 years of age in Nittrouer and Burton, 2005; Barbier et al., 2015; at 5 years of age in Zharkova, 2017; overlap between 3–4-year and 5-year olds in Noiray et al., 2019) and here again with no systematic age-related difference in coarticulatory degree across consonantal contexts.

It is not uncommon for developmental researchers to point to between-age overlaps and/or substantial within age-group differences in various abilities. The question is then why those differences are observed. A simple answer may be that children are at different individual stages in their developmental trajectory. For instance, well-attested vocabulary spurts seem to depend on pre-existing achievements (e.g. reaching the 50 words milestone) rather than be the result of biological age progression (see review of lexical development in Nazzi and Bertoncini, 2003). Other studies have underlined stronger developmental dependencies based on proficiency rather than age (e.g., between phonological development and motor ability, e.g., Smith, 2006; Goffman, 2010; between vocabulary and production accuracy, e.g., Edwards et al., 2004; Vihman and Croft, 2007). When that is the case, age-related interpretations are problematic because they may attribute evidence (e.g., a decrease in coarticulation degree) to the wrong source or hide complex relationships between factors that are individual-specific rather than age-dependent. This is not to argue that age does not matter: the development of speech motor skill along with lexical and phonological knowledge can actually be described within a maturational perspective because all skills develop in the time domain. It is hence not surprising that correlations between age and phonological awareness were found in our study – albeit not with all PA tasks and not with vocabulary. However, while age-based descriptions of language acquisition may be interpreted in the perspective of biologically-driven developments, it may instead be the effect of experience upon the learning mechanism (i.e., the exposure to and practice speaking the language) that gives maturation its transformational power (e.g., in perception: Kuhl et al., 1992; Hay, 2018). Uncovering how experience shapes (spoken) language acquisition independent of age has been not only thrilling but also enduring challenge for psycholinguists because experience unfolds within an extended time scale and results from multiple interactions in a continuously variable environment that remains difficult to replicate in lab environments.

To summarize, the results reported in this study provide good incentives for future research to draw skill-based comparisons of children's linguistic ability. With this approach, we will not only account for the complex developmental relationships across domains taking place in the first decade of life, we will also better capture the complexity of (spoken) language acquisition arising from both experience-based and biologically driven processes than if our analyses are restricted to age comparisons. This leads us to the discussion of the role of skill interactions for (spoken) language development.

#### Nonlinear Interactions Between Vocabulary, Phonological Awareness, and Coarticulatory Organization

As reported in previous sections, no uniformly strong differences in coarticulation degree emerged between 4-, 5- and 7-year-old children (**Table 2**). However, children showing poor phonological awareness indicated overall greater coarticulation degree than children with higher scores. This suggests that for children with poorer phonemic representations, lingual gestures for consecutive consonants and vowels may be activated together with substantial vocalic anticipation. Further, we noted no uniform relation between coarticulation and phonemic awareness across children's scores, by which each unit change in one domain would result in an equivalent (linear) unit change in the other domain of interest. In our sample of children, the relationship between domains was non-linear and therefore more complex: an increase in children's phonemic awareness score was at times not associated with any equivalent change in coarticulatory pattern until reaching a certain stage. Last, those non-linear interactions varied across phonetic contexts (cf. edf values). The shape of the skill interactions indeed differed as a function of the identity of the coarticulated consonants and vowels and the compatibility of their gestural goals (cf. colored terrain maps). For instance, in the case of a syllable involving two gestures from two anatomically distinct organs (the lips for the labial /b/ and the tongue for any vowel), vocalic influences remained high regardless of children's phonemic proficiency (rather flat isolines and all colors well represented; **Figure 3**). However, in the context of the alveolar /d/ stop that involves two consecutive lingual gestures within a short-temporal span (tongue dorsum for both /d/ and subsequent vowels), non-linear interactions were more noticeable. Children with advanced awareness of the smallest phonemic units (e.g., higher scores) exhibited slightly more central tongue dorsum positions than children with poorer ability (larger blue portion characterizing an anterior tongue position). This suggests a gradual functional decoupling between the anterior (tip-blade) and the posterior subparts of the tongue (dorsum-back). While the tongue remains in a rather anterior position during the alveolar stop production, the tongue dorsum seems a little more central as if to anticipate the production upcoming vocalic gesture. Non-linear interactions were also visible in syllables including a velar onset. Variation in phonemic awareness coincided with variation in the palatalto-velar constriction location as a function of the vowel (see Recasens, 2014). While lower phonemic awareness was associated with greater vocalic influences (full color scale represented, **Figure 3**), greater awareness correlated with more central tongue positions during the consonant articulation. This finding corroborates previous research reporting a lack of speech motor independence in the early age (e.g., Nittrouer et al., 1996) and provides additional evidence for an important interaction with phonemic awareness, which seems particularly relevant for the coarticulation of complex gestural goals involving a single organ.

Nonlinearities were also observed in the interaction between vocabulary and coarticulatory patterns. First, results indicated that children with greater expressive vocabulary showed lower intra-syllabic coarticulation degree independently of age (cf. 0.12 correlation) and hence greater sensitivity to the gestural demands underlying various consonant-vowel combinations, while children with poorer vocabulary showed larger coarticulatory units with greater vocalic influence over previous consonants. Given numerous findings supporting a lexically grounded development of phonological representations and its impact on production accuracy (e.g., Ferguson and Farwell, 1975; Metsala, 1999; Beckman and Edwards, 2000; Edwards et al., 2004, 2011; Munson et al., 2005; Vihman and Keren-Portnoy, 2013), our results supplement existing evidence that a rich lexical repertoire leads to greater phonological differentiation, by showing it may also support greater motor differentiation and flexibility in coarticulatory patterns depending on the gestural demands associated with consecutive segments. In the present study, the interaction between vocabulary and coarticulation degree in the alveolar context provides a compelling example that children with more proficient vocabulary show greater differentiation between the tongue dorsum and tongue tip for coarticulating consecutive consonantal and vocalic gestures recruiting the same organ. Second, the nonlinear nature of the interaction between vocabulary and coarticulation also suggests that the coupling between domains does not develop incrementally but rather that it may be when individual children reach a certain size of expressive vocabulary that the interaction with production weighs in children's coarticulatory organization.

Taken together, results support the view of a by-stage approach to skill development. Milestones and developmental stages have long been identified in various developmental domains (e.g., walking: Thelen and Smith, 1994; perception: e.g. Best, 1994; Maye et al., 2002; Werker, 2018; spoken language: e.g., Kuhl, 2011; language processing: e.g., Vilain et al., 2019) and provide researchers with referential landmarks for a better understanding of typical trajectories, as well as useful tools for the diagnosis and prediction of potential deviations from typical pathways. In the domain of spoken language development, canonical babbling stands as an undisputed milestone allowing children to move toward a more complex quality of the speech production skill (e.g., production of the first meaningful words). This study points to a similar mechanism for skill interaction. In the same way children continuously develop individual skills (e.g., spoken language, expressive vocabulary), there may be milestones and developmental stages characterizing periods for which an interaction is (more significantly) activated. The outcome of this interaction would lead children to progress toward a new developmental stage. Taking again the relation between phonemic awareness and coarticulation, an average score reaching above 10 may characterize a developmental stage by which phonemic differentiation is maturing both at the representational and speech motor levels.

# An Integrated-Interactive Approach to Skill Development

In a preceding study, we had argued that the question "whether children organize their speech in segments versus syllables versus phonological words or lexical items is twofold: It requires finding the phonological units guiding children's speech production and the motor units embedding those higher-level units" (Noiray et al., 2018, p. 8). The research conducted since then motivates us to endorse an integrated-interactive approach to (spoken) language acquisition. By integrated, we mean that the gradually acquired knowledge about different unit types and sizes does not constrain children to move from one organizational scheme to another (e.g., from holistic to segmental representation of speech or vice versa). Instead, this knowledge would integrate into an increasingly more complex and flexible language system allowing children to gradually manipulate a greater variety of phonetic compounds and structural organizations (Noiray et al., 2019). At the production level, this integrative process is exemplified in preschool-age children using gradients of coarticulation degree to accommodate the varying gestural demands of consecutive consonants and vowels (Noiray et al., 2019). At the representational level, the way phonological awareness has been traditionally assessed directly reflects an integrative approach to phonological development: children's structural knowledge of their native language is usually tested incrementally with tasks tapping different levels of unit complexity (e.g., words, syllables, rhymes, and segments). Phonological awareness may therefore be envisioned as an integrative learning process: it is only once children have fully integrated all organizational levels and can manipulate them into various ways that they have reached adult-like phonological representations.

The process of combinatoriality is not unique to language. In their discussion of language discreteness, Studdert-Kennedy and Goldstein (2003) had remarked on striking structural similarities between the way languages pattern and the way other processes in nature pattern (e.g., in biology, physics, chemistry). They argue for a "particulate principle" (Abler, 1989) under which "units that combine into a larger unit do not disappear or lose their integrity: they can re-emerge or be recovered through mechanisms of physical, chemical, or genetic interaction, or, for language, through the mechanisms of human speech perception and language understanding" (Studdert-Kennedy and Goldstein, 2003, pp. 52–53). Congruent with this theoretical position, we consider a view of (spoken) language in which various structural types of combinations – gestures, segments, syllables, and words – are not mutually exclusive but reflect complementary levels of linguistic organizations that all contribute to the richness and complexity of language systems (e.g., Goffman et al., 2008; Noiray et al., 2019). From very early in development, the process of coarticulation itself binds gestures, sounds, phonetic units together to create compounds that ultimately lend meaning to speech streams. This imparts to coarticulation a special role for (spoken) language development beyond its usual circumscription to low-level motor processes. By tracking the maturation of coarticulatory organization, we can indeed capture the gradual binding of representational and executional levels. Expanding on that view, the present findings provide evidence for subtle differences in the implementation of this relationship due to the very nature of the phonemes represented in children's mind and their motor expressions. From our preceding studies (Noiray et al., 2013, 2018, 2019; Rubertus and Noiray, 2018) and research conducted in the domains of lexical and phonological development, it seems that holistic and segmental organizations (both in representation and production) develop together, albeit probably at different paces at different times. For instance, lexically based organizations may prevail at an early stage because they support object-word correspondences and referencing which are particularly relevant for children at an early stage of their life, while segmental representations may develop more slowly because they are more abstract and not bound to real-world objects. While variability in individual trajectories is evidently to be expected (e.g., Smith et al., 2010), overall there is converging evidence in typically developing children that these types of organization integrate with one another in the course of developing spoken language fluency (e.g., Vihman, 2015).

Furthermore, we argue for an interactive approach to (spoken) language development in which various skills develop together and are equally important to the uniqueness of human communication. While the literature abounds with studies highlighting developmental interactions between phonological awareness and various cognitive domains (e.g. literacy: Ziegler and Goswami, 2005; or with vocabulary: Charles-Luce and Luce, 1995; Muter et al., 2004; Hilden, 2016), the present study sheds light on the interaction between cognitive and speech motor skills. Results suggest that motor, lexical, and phonological developments collaborate dynamically over time by contact with the language (i.e., *via* increasingly richer exposure and practice speaking the language). This is a fairly significant finding that has various implications.

First, it may challenge models of adult speech production that have suggested a modular approach with lexical, phonological, and motor processes considered as separate components sequentially orchestrated (e.g., Levelt and Wheeldon, 1994, Figure 1; Levelt, 1999, Figure 1). It may also promote a revision of speech production models that have considered interactions across domains but with a top-down approach, whereby motor execution depends on the output of preceding cognitive or neural processes (e.g., in Levelt and Wheeldon's model: motor execution is comprised within phonological encoding but implemented as the final component, p. 245; in Guenther and Vladusich, 2012's DIVA model: between the motor, auditory, and somatosensory domains, Figure 1, review in Tourville and Guenther, 2011). If interactions between the lexical, phonological, and motor domains exist in the developing speech system of children, those should prevail in adults' speech organization or at least residuals from such relationships may remain. Assuming a developmental continuity from children to adults' speech production, models of speech production would benefit in taking the ontogenetic findings into account and perhaps adopt a more integrated-interactive perspective. By doing so, it may be possible to move forward in the longstanding quest for determining the nature of the units of speech production (see, for example, discussion in Pierrehumbert, 2003; Hickok, 2014).

Second, the finding of interactions across domains is relevant for the clinical field. Indeed, while predictive studies have usually tested how skill *X* at a time T1 predicts the stage of another skill *Y* at time T2 (e.g., Walley et al., 2003; Edwards et al., 2004), no study has to our knowledge ventured to examine how interactions between specific skills change over developmental time or predict the stage of another interaction at a later time. Although the present study was not designed to demonstrate a specific causal direction in the relationships observed, it is highly likely that speech motor, lexical, and phonological skills mutually influence each other over time. There is enough evidence in infant and child research supporting both directions (e.g., motor, lexical and phonological developments: Menn and Butterworth, 1983; DePaolis et al., 2013; articulatory filter hypothesis: Vihman, 1996; DePaolis et al., 2011; Majorano et al., 2014; phonological templates: Vihman and Croft, 2007; Vihman and Wauquier, 2018; role of articulatory skills for later phonemic awareness). Given that coarticulated speech is initiated years before children gain adult-like knowledge about the structural combinatoriality of their native language, an effect of coarticulatory practice on the development of phonological awareness is not an implausible scenario. In the first 4-to-5 years of life, children acquire a basic awareness of the structural combinatoriality of *sounds* (phonetic awareness) because they can form new words (real words or imaginary creations) and converse comfortably with others. This raises the question whether phonological awareness is indispensable to adult-like fluent speech or only to fluent reading. To elucidate whether it is only a by-product of literacy acquisition that happens to create collateral changes to children's speech organization, it will be crucial to examine whether the maturational trajectories of illiterate adults or children's coarticulatory patterns are similar to those of literate children. If they do, it may suggest that developing adult-like coarticulatory patterns does not entail any advanced awareness of the structural combinatoriality of their native language. Instead, maturation of coarticulatory patterns may relate more to children tuning their speech motor system to the phonetic regularities of their native language and therefore interact more significantly with perceptual rather than phonological development. Expanding on this hypothesis, the process of language acquisition may encompass two types of interactions: one serving oral communication and primarily involving perceptual, motor, and lexical skills; another one developing in a more protracted fashion for the purpose of literacy acquisition and involving primary interactions between motor, lexical, and phonological skills. Comparisons with preschool-aged children with advanced phonemic awareness would also provide a compelling experimental framework for assessing the role of phonological awareness with respect to speech motor control skill for developing adult-like patterns of coarticulation. In a recently funded project, we have initiated a first step in this direction, testing for interactions between various levels of phonological awareness, reading proficiency, and production fluency in typically developing school-aged children (Popescu and Noiray, 2019) in comparison to children at risk or diagnosed with reading disorders.

#### Limitations and Perspectives for Future Research

Overall, results from the present study provide strong evidence that the process of developing spoken language fluency encompasses dynamic interactions between vocabulary, phonological awareness, and speech motor control in German children. While this represents a promising first step, further empirical work is obviously needed to understand these multidimensional interactions in greater detail. Generalized additive modeling (GAM) represents an innovative and powerful method because it can unveil nonlinear relationships between cognitive and motor domains and estimate their interrelated change over time. In the present study, it was possible to use GAM models to illuminate nonlinear patterns of interactions, which would have remained hidden if we had used linear mixed models. Note, however, our dataset presents some weaknesses. For instance, the examination of vocabulary being limited to nouns in this study, our assessment of children's expressive lexicon was limited, and hence, correlation should be considered with caution. As mentioned earlier, it was not possible to reliably test for the combined effect of vocabulary together with phonological awareness on coarticulatory coarticulation due to dataset requirements (e.g., recording many more children and obtaining many more scores per participant). For generalized additive modeling to provide reliable results, large sample-sized investigations are also necessary, which remain challenging in the developmental field due to various methodological constraints and time-consuming data processing. However, given the growing statistical expertise among developmental psycholinguists combined with greater effort to conduct synergistic data collection across laboratories, there is no doubt that future quantitative studies will succeed in teasing apart their (in)dependent effect on the development of spoken language fluency.

The present study is part of a longer-term project aiming to elucidate whether the expansion of vocabulary and phonological awareness contributes to increasingly more segmentally specified coarticulatory organizations from kindergarten to primary school. This question is not only important for theories of language acquisition but also for clinical practice. Assessments of deviant coarticulatory patterns have primarily tested their motor origins (e.g., apraxia of speech: Nijland et al., 2002; speech sound disorder: Maas and Mailend, 2012; phonological disorders: Gibbon, 1999; stuttering: Lenoci and Ricci, 2018). Evidence of an intricate relationship with other linguistic components of the language system would certainly affect the way diagnosis and treatment are envisioned. The opposite question whether increased practice coarticulating a wide range of phonetic combinations supports greater phonemic differentiation and the stabilization of motor correspondences would be equally exciting in terms of its implications for language-related cognitive development. In this study, we have first demonstrated that important interactions between cognitive and motor domains occur in the course of developing spoken language fluency. We believe our findings now warrant longitudinal investigations to further test whether the interactions observed are bi-directional and hence fundamental to the growth of each individual skill or unilateral.

Last, if phonological awareness is the knowledge of the discrete and coarticulation represents its continuous articulatoryacoustic make-up, it will be important in future studies to design analytical approaches that can adequately account for the development of this intricate relationship over time. Dynamical systems seem a promising avenue in that respect. In a recent discussion of speech dynamics, Iskarous emphasizes that dynamical systems "do not assume separate sets of principles to describe discrete and continuous aspects of a system. Rather, the discrete description is shown to predict the continuous one, using the concept of a differential equation" (Iskarous, 2017, p. 8). The present study provides an ontogenetic perspective illustrating how access to various levels of phonological discreteness (words, syllables, segments) interacts with the organization of the continuous: from the production of syllabic entities to the fine integration of segmentally specified gestures. In future research on this topic, we aim to combine dynamical systems theory with longitudinal data to address how this dynamical relationship precisely unfold in the developing language system of children.

#### CONCLUSION

The present study tested whether developmental differences in coarticulation degree widely reported in the literature over the past decades were strictly related to maturational differences in speech motor abilities or also interacted with children's language-related abilities. An examination of children's coarticulatory patterns in relation to their lexical and phonological proficiency allowed us to uncover developmental differences that would remain unexplained if each skill was considered separately. Other domains, which have not been examined in the present study, are likely to play a role and should be thoroughly considered in future studies (e.g., assessment of literacy, phonological memory). The question of what skill interactions allow children to become fluent language users and how those evolve dynamically over time have become pressing issues for developmental researchers. However, for those to be uncovered interdisciplinary collaborations will be necessary, between developmental biology, psychology, and linguistics. While all domains have separately argued that multiple developments are intricately connected over time, only actual collaborations across disciplines will generate a unified account of language development.

# DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

## ETHICS STATEMENT

The study reported in the manuscript has been approved by the Ethic Committee of the University of Potsdam in Germany. The goals of the research, the children population recorded, the method, and recruitment procedure have been described and reviewed by the Committee prior to giving a positive review.

# AUTHOR CONTRIBUTIONS

AN provided the theoretical framework of the study, obtained the funding, and designed the empirical questions resulting in the manuscript. AN and AP conceptualized and designed the statistical analyses. AN, AP, and LH organized the dataset for subsequent statistical analyses. AP performed all statistical analyses. AN, AP, HK, ER, SK, and LH contributed to ultrasound data collection and processing and/or administration and scoring of the behavioural assessments. HK trained the team in administration and scoring the developmental assessments. AN wrote the manuscript. AN and AP provided all visualizations and edited the first draft. HK, ER, and SK provided feedback on the pre-final draft. All authors read the manuscript and agreed on its submission.

#### FUNDING

This research was generously supported by the Deutsche Forschungsgemeinschaft (DFG) grant N° 255676067 and 1098 and PredictAble (Marie Skłodowska-Curie Actions, H2020- MSCA-ITN-2014, N° 641858).

#### ACKNOWLEDGMENTS

Many colleagues have contributed to the success of this study to whom we are indebted: Martijn Wieling for his careful guidance in the statistical analyses of the present dataset and Bodo Winter for useful related advice, Jan Ries and Mark Tiede for co-developing the SOLLAR platform used in this research, the BabyLab at University of Potsdam recruitment assistance (in particular Barbara Höhle and Tom Fritzsche), the team at Laboratory for Oral Language Acquisition (LOLA) involved in data recording and

#### REFERENCES


processing, and all participants enrolled in the study. We thank two reviewers for their thorough and insightful input. We are also grateful to Carol Fowler for stimulating discussions and for reviewing an earlier draft of this manuscript. Last, we shall thank the various scholars cited in this manuscript whose referential work has been a great source of inspiration. In that respect, a special thought for Michael Studdert-Kennedy who first sparked enthusiasm for this research. The publishing of this manuscript was supported by the Deutsche Forschungsgemeinschaft (DFG) and the Publishing fund of the University of Potsdam.

on Stoel-Gammon's 'relationships between lexical and phonological development in young children'. *J. Child Lang.* 38, 35–40. doi: 10.1017/S0305000910000450


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Noiray, Popescu, Killmer, Rubertus, Krüger and Hintermeier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# A Simple 3-Parameter Model for Examining Adaptation in Speech and Voice Production

Elaine Kearney<sup>1</sup> \*, Alfonso Nieto-Castañón<sup>1</sup> , Hasini R. Weerathunge<sup>2</sup> , Riccardo Falsini<sup>1</sup> , Ayoub Daliri<sup>3</sup> , Defne Abur<sup>1</sup> , Kirrie J. Ballard<sup>4</sup> , Soo-Eun Chang5,6, Sara-Ching Chao<sup>3</sup> , Elizabeth S. Heller Murray<sup>1</sup> , Terri L. Scott<sup>7</sup> and Frank H. Guenther1,2,8,9

<sup>1</sup> Department of Speech, Language, and Hearing Sciences, Boston University, Boston, MA, United States, <sup>2</sup> Department of Biomedical Engineering, Boston University, Boston, MA, United States, <sup>3</sup> Department of Speech and Hearing Science, Arizona State University, Tempe, AZ, United States, <sup>4</sup> Faculty of Health Sciences, The University of Sydney, Sydney, NSW, Australia, <sup>5</sup> Department of Psychiatry, University of Michigan, Ann Arbor, MI, United States, <sup>6</sup> Cognitive Imaging Research Center, Department of Radiology, Michigan State University, East Lansing, MI, United States, <sup>7</sup> Graduate Program for Neuroscience, Boston University, Boston, MA, United States, <sup>8</sup> The Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, United States, <sup>9</sup> Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA, United States

#### Edited by:

Pascal van Lieshout, University of Toronto, Canada

#### Reviewed by:

Douglas M. Shiller, Université de Montréal, Canada Ben Parrell, University of Wisconsin–Madison, United States

> \*Correspondence: Elaine Kearney ekearney@bu.edu

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 29 April 2019 Accepted: 17 December 2019 Published: 21 January 2020

#### Citation:

Kearney E, Nieto-Castañón A, Weerathunge HR, Falsini R, Daliri A, Abur D, Ballard KJ, Chang S-E, Chao S-C, Heller Murray ES, Scott TL and Guenther FH (2020) A Simple 3-Parameter Model for Examining Adaptation in Speech and Voice Production. Front. Psychol. 10:2995. doi: 10.3389/fpsyg.2019.02995 Sensorimotor adaptation experiments are commonly used to examine motor learning behavior and to uncover information about the underlying control mechanisms of many motor behaviors, including speech production. In the speech and voice domains, aspects of the acoustic signal are shifted/perturbed over time via auditory feedback manipulations. In response, speakers alter their production in the opposite direction of the shift so that their perceived production is closer to what they intended. This process relies on a combination of feedback and feedforward control mechanisms that are difficult to disentangle. The current study describes and tests a simple 3-parameter mathematical model that quantifies the relative contribution of feedback and feedforward control mechanisms to sensorimotor adaptation. The model is a simplified version of the DIVA model, an adaptive neural network model of speech motor control. The three fitting parameters of SimpleDIVA are associated with the three key subsystems involved in speech motor control, namely auditory feedback control, somatosensory feedback control, and feedforward control. The model is tested through computer simulations that identify optimal model fits to six existing sensorimotor adaptation datasets. We show its utility in (1) interpreting the results of adaptation experiments involving the first and second formant frequencies as well as fundamental frequency; (2) assessing the effects of masking noise in adaptation paradigms; (3) fitting more than one perturbation dimension simultaneously; (4) examining sensorimotor adaptation at different timepoints in the production signal; and (5) quantitatively predicting responses in one experiment using parameters derived from another experiment. The model simulations produce excellent fits to real data across different types of perturbations and experimental paradigms (mean correlation between data and model fits across all six studies = 0.95 ± 0.02). The model parameters provide a mechanistic explanation for the behavioral responses to the adaptation paradigm that are not readily available from

the behavioral responses alone. Overall, SimpleDIVA offers new insights into speech and voice motor control and has the potential to inform future directions of speech rehabilitation research in disordered populations. Simulation software, including an easyto-use graphical user interface, is publicly available to facilitate the use of the model in future studies.

Keywords: computational modeling, sensorimotor adaptation, motor control, speech production, voice, auditory feedback

#### INTRODUCTION

Sensorimotor adaptation paradigms have become an important experimental approach in studying the neural mechanisms of motor control, including speech production. These paradigms are based on the premise that small, often imperceptible, manipulations of sensory feedback result in lasting changes within the sensorimotor system (often referred to as motor learning) as participants gradually adapt their movements to compensate for the sensory perturbations. Residual compensatory behavior is evident after the manipulation is removed; such after-effects provide clear evidence of the adaptive changes within the motor system.

A typical sensorimotor adaptation paradigm consists of four phases, shown in **Figure 1**. The paradigm begins with a baseline phase<sup>1</sup> where participants produce stimuli (e.g., syllables, sustained vowels) and receive normal, unaltered auditory feedback. The second phase, referred to as a ramp phase, is characterized by a gradual addition of the auditory feedback perturbation. The perturbation is implemented in near real time (typically with a delay of 40 ms or less) using a digital signal processing system and/or personal computer-based software (e.g., Audapter; Cai et al., 2008) and is increased linearly until reaching the maximum perturbation magnitude. The maximum perturbation remains constant during the hold phase. The final phase is the after-effect phase, where auditory feedback immediately returns to normal. The number of trials per phase varies by study but is often in the range of 10 to 100 trials, with the largest number of trials usually occurring in the hold phase. In addition, the ramp may be more or less gradual (or omitted), and masking noise played on short blocks of trials during the hold phase can be used to assess adaptation in place of the after-effect phase.

Originally adapted from studies of limb motor control, the sensorimotor adaptation paradigm was first applied to formant frequencies during speech by Houde and Jordan (1998). Formant frequencies are peaks in the acoustic spectrum that are related to the overall shape of the vocal tract and are important for differentiating speech sounds. Roughly speaking, the first formant (F1) is inversely related to tongue height (i.e., sounds with higher tongue positions have lower F1 values) whereas the second formant (F2) is related to the location of the tongue constriction along the vocal tract (i.e., sounds with constrictions closer to the lips have higher F2 values). In the study by Houde and Jordan (1998), participants produced CVC syllables containing the vowel/ε/while the first two formants were shifted either toward the vowel/i/or the vowel/a/. Compensation was observed in the opposite direction to the perturbation. During the hold phase, adaptation was assessed by randomly interspersing trials with masking noise so that auditory feedback was unavailable to participants. The masked trials also showed evidence of compensatory behavior, revealing adaptation within the speech motor system to the formant perturbations.

Since the first application of the sensorimotor adaptation paradigm to speech, a number of adaptation studies have supported the original findings for formant perturbations (e.g., Purcell and Munhall, 2006; Villacorta et al., 2007) as well as several additional acoustic manipulations, including shifting the center of spectral energy of fricatives (Shiller et al., 2007, 2009) and perturbing fundamental frequency (f <sup>o</sup>, the acoustic correlate of pitch) during sustained phonation (Jones and Munhall, 2000; Hawco and Jones, 2010). The findings have also been generalized to perturbations of pitch and formant frequencies in Mandarin, a tonal language (Jones and Munhall, 2002, 2005; Cai et al., 2010), and to sentence-level stimuli with formants of multiple vowels perturbed within an utterance (Lametti et al., 2018). Keough et al. (2013) demonstrated that the presence or absence of specific instructions to attend to the acoustic manipulations does not affect adaptation suggesting that adaptation is under automatic rather than conscious control. Links have also been demonstrated between perceptual abilities and sensorimotor adaptation. For example, both Villacorta et al. (2007) and Martin et al. (2018) found that speakers who have better auditory acuity showed greater adaptive responses to perturbations of F1, and other researchers have shown that sensorimotor adaptation can result in changes in the speech perception of the adapted speech sound in addition to non-adapted but coarticulatorydependent speech sounds (Shiller et al., 2009; Lametti et al., 2014; Schuerman et al., 2017).

Most studies of sensorimotor adaptation in speech have involved neurologically normal adult speakers. More recently, the sensorimotor adaptation paradigm has been used to investigate sensorimotor adaptation in children and individuals diagnosed with communication disorders. Evidence of adaptation has been shown in children as young as three; however, the magnitude of the adaptive response is not as great as adults (Scheerer et al., 2016) and adaptation does not appear to have a reliable effect on their perceptual representations (Shiller et al., 2010). In the realm of communication disorders, the paradigm has

<sup>1</sup>Note that the names used to describe the phases are not always the same as those used in this article. We will use the terms defined here throughout the article to avoid confusion.

been used to assess speech motor control of individuals with Parkinson's disease (PD; Mollaei et al., 2013; Abur et al., 2018), hyperfunctional voice disorder (Stepp et al., 2017), cerebellar degeneration (Parrell et al., 2017), apraxia of speech (Ballard et al., 2018), autism (Demopoulos et al., 2018), developmental dyslexia (van den Bunt et al., 2017), and stuttering (Daliri et al., 2018). The findings of these studies have important implications for uncovering the underlying neural mechanisms of these disorders and may shed light on future treatment strategies.

As the studies reviewed above have demonstrated, the speech sensorimotor adaptation paradigm provides an informative window into learning in the speech motor system. However, it is important to realize that speech output under perturbed auditory feedback is a combination of online sensory feedback control processes (i.e., motor corrections based on sensory errors detected within the ongoing production) and adaptive processes that affect future productions whether or not they are perturbed. This makes it difficult to determine the true level of adaptation (in the sense of trial-to-trial learning) from the experimental data since this adaptive component is "corrupted" by online, within-trial contributions from sensory feedback control.

The widely used Directions Into Velocities of Articulators (DIVA) model of speech production (Guenther, 2006, 2016) proposes that the overall motor command to the speech articulators consists of three main components: (1) an auditory feedback control component that is invoked when errors are detected in auditory feedback, (2) a somatosensory feedback component that is invoked when errors are detected in somatosensory feedback from the speech articulators, and (3) a feedforward component that utilizes stored motor programs for the sounds being produced. Furthermore, the model posits that the feedforward command for future productions is updated based on sensory errors detected in the current trial. This adaptation process has been shown to be capable of accounting for compensatory responses seen in a prior sensorimotor adaptation experiment (Villacorta et al., 2007), though the relative contributions of the three different control processes could not be uniquely determined due to the relatively high number of free parameters in the full DIVA model.

Relatively complex models, such as the full DIVA model, are important for expanding our understanding of the neural bases of speech and providing theoretical frameworks to unify findings from a wide range of experimental paradigms. However, they are limited in their usefulness as a tool for characterizing the impaired speech of individuals in the clinic. Specifically, their complexities and parameter redundancies preclude a unique, meaningful model "fit" for the individual. The purpose of the current article is to describe a simple 3-parameter model based on DIVA that can be used to dissociate the contributions of the auditory feedback-based, somatosensory feedback-based, and feedforward control processes in experimentally measured sensorimotor adaptation responses. We will refer to this model as SimpleDIVA throughout the article. The overarching goal of SimpleDIVA is to distil a complex model into its most fundamental components so it can be used to derive a meaningful characterization of function/dysfunction in each of the three main sub-controllers for speech in individuals with speech disorders. The first step in this process is to verify that the model provides adequate fits to existing group datasets. As detailed in the next section, the model's parameters characterize the gains of the auditory and somatosensory feedback control systems as well as the trial-to-trial adaptation rate of the feedforward control system. Given a sensorimotor adaptation dataset, optimal values of these parameters for fitting the data are derived (i.e., are data-driven); the resulting parameters provide an estimate of the relative roles of the three different control subsystems in the corresponding experiment. For the purposes of the current article, we focus on adaptation experiments involving auditory feedback perturbations, though in principle the same model can be used to analyze the results of adaptation experiments involving somatosensory perturbations applied to the speech articulators (e.g., Tremblay et al., 2003; Nasir and Ostry, 2006) as well as experiments involving perturbations to both auditory and somatosensory feedback (Feng et al., 2011; Lametti et al., 2012).

The remainder of the article is organized as follows. After a description of the SimpleDIVA model, we report a series of 10 simulations in which the model is fit to existing sensorimotor adaptation datasets. Simulations 1 and 2 examine adaptation with perturbations applied to a single auditory dimension (F1). Simulations 3 and 4 assess adaptation with perturbations applied simultaneously to multiple auditory dimensions, specifically F1 and F2. Simulation 5 evaluates adaptation when applying a perturbation to f <sup>o</sup> under two different experimental conditions, first with an upward perturbation and then with a downward perturbation. Simulations 6 and 7 model f <sup>o</sup> adaptation when the measurement of f <sup>o</sup> is captured early as compared to late in the trial. Simulations 8 and 9 model data from an F1 experiment with a gradual perturbation onset condition and fit the resulting parameters to a second experimental condition with a sudden perturbation onset. The final simulation models all included F1 data in a single simulation to derive optimal model parameters for predicting responses to future F1 adaptation studies. We then summarize the contribution of the work to the literature and suggest future directions for using the SimpleDIVA model.

#### MATERIALS AND METHODS

fpsyg-10-02995 January 9, 2020 Time: 18:25 # 4

The following equations characterizing the SimpleDIVA model capture the key aspects of the DIVA model in a simplified form that involves only three free parameters that can be adjusted to fit a particular dataset. For the sake of readability, the equations will assume that the adaptation experiment being modeled involves F1, though the same equations apply to other auditory parameters, as illustrated in the simulations described in the next section. We will denote the target value of F1 for the experimental stimuli as F1<sup>T</sup> and define it to be equal to the mean of the F1 values produced by the participant during the baseline portion of the experiment. We assume that F1<sup>T</sup> remains constant over the course of the experiment; i.e., the participant does not change what they consider to be a correctsounding production.

In effect, SimpleDIVA focuses on the subspace of the highdimensional motor space that corresponds to changes in F1. This allows us to replace a high-dimensional motor command vector with a single variable corresponding to the effect of that motor command on F1. In this way, the overall motor command to the speech articulators becomes an F1 value that we will call F1produced. Equation 1 defines F1produced on a given trial or block (indexed by n) as:

$$\text{F1}\_{\text{produced}}(n) = \text{F1}\_{\text{FF}}(n) + \Delta \text{F1}\_{\text{FB}}(n) \tag{1}$$

Simply stated, the F1 value produced on a trial is a combination of a feedforward command (F1FF) and a sensory feedbackbased correction (1F1FB) that kicks in if/when the auditory and somatosensory feedback controllers detect production errors on the current trial. At the start of each simulation (i.e., for n = 1), F1FF is initialized to F1<sup>T</sup> corresponding to the assumption that participants have previously learned feedforward commands that successfully produce the target value of F1 under normal feedback conditions.

In the full DIVA model, feedback control consists of two components that are summed together: an auditoryfeedback-based component and a somatosensory-feedback-based component. The auditory feedback control component is formed by (i) calculating the difference (error) between a multidimensional auditory target and the current auditory feedback, (ii) transforming this auditory error into the motor space, and (iii) scaling the result by an auditory feedback control gain factor. Similarly, the somatosensory feedback control component is formed by calculating the difference between a multi-dimensional somatosensory target and the current somatosensory feedback, transforming this somatosensory error into the motor space, and scaling the result by a somatosensory feedback control gain factor. Again, in SimpleDIVA we focus on only the components of the multi-dimensional somatosensory and auditory spaces that correspond to changes in F1, which means that the auditory and somatosensory targets are both equal to F1T, and the feedback-based correction on a given trial is characterized by the following equation:

$$\Delta \text{F1}\_{\text{FB}}(n) = \alpha\_A \ast \left( \text{F1}\_T - \text{F1}\_{\text{AF}}(n) \right) + \alpha\_s \ast \left( \text{F1}\_T - \text{F1}\_{\text{SF}}(n) \right) \tag{2}$$

where F1AF is the value of F1 heard by the participant (including the perturbation, when one is applied) before feedback control mechanisms kick in on that trial (i.e., F1AF = F1FF + perturbation size) and F1SF is the F1 value corresponding to the current somatosensory feedback before feedback control mechanisms kick in. Since no somatosensory feedback perturbations are being considered herein, F1SF on a given trial is simply equal to F1FF for that trial in the simulations that follow. The free parameters a<sup>A</sup> and a<sup>S</sup> are the gains of the auditory and somatosensory feedback control subsystems, respectively. When an auditory perturbation is applied, the auditory feedback controller will attempt to compensate for the perturbation. This compensation will be partially counteracted by the somatosensory feedback controller, which is attempting to keep the vocal tract in the normal somatosensory configuration for the sound. Thus, if all else is equal, increasing α<sup>A</sup> will lead to an increase in the compensatory response to an auditory perturbation commanded by the feedback controller, whereas increasing α<sup>s</sup> will lead to a decrease in the compensatory response to an auditory perturbation.

The equation for updating the feedforward command from trial to trial is:

$$\mathrm{F1}\_{FF}\left(n+1\right) = \mathrm{F1}\_{FF}\left(n\right) + \lambda\_{FF} \* \Delta\mathrm{F1}\_{FB}\left(n\right) \tag{3}$$

where λFF is a learning rate parameter for the feedforward command. That is, the feedforward command for the next trial is updated by adding some fraction (characterized by λFF) of the feedback-based corrective command from the current trial, as in the full DIVA model.

To fit the SimpleDIVA model to a particular dataset, a particle swarm optimization procedure was used to find optimized values of the three free parameters of the model (αA, αS, and λFF) to fit the mean data for each trial/block in each condition. In this procedure, the system is initialized with a population of 1000 random sets of parameter values ("particles") and iterated until convergence to obtain an optimized parameter set. In each iteration, all parameter sets are evaluated by computing the root mean square error (RMSE) of their fits to the data, and a fraction of all sets is replaced by random linear combinations of those parameter sets currently producing the best fits. The procedure stops when all 1000 parameter sets converge within a 1% range of the optimal solution or after 100 consecutive iterations without any improvement in the optimal fit to the data. When the procedure stops, the optimal parameter set among the remaining 1000 sets is selected as the solution. Parameter values were limited to the range [0,1] except where noted, in keeping with their mechanistic interpretations in the model<sup>2</sup> . For each model fit, the optimization procedure was run 10 times in order to evaluate any potential residual variability due to initial conditions or local

<sup>2</sup>For example, it does not make sense within the model for the auditory feedback gain to be less than 0 (which would exacerbate rather than correct auditory errors) or greater than 1 (which would overcompensate for auditory errors).

measurement of F1 is made within a trial.

optima. The resulting parameter estimates were highly robust to initial conditions of the swarm procedure (that is, all 10 runs typically converge on the same optimal parameter set), indicative of reaching the global minimum of the RMSE measure. The minimum-RMSE solution across all 10 repetitions was chosen as the optimized parameter set, and Pearson's r was calculated for this solution to characterize fit quality.

The SimpleDIVA model can also be fit to data from multiple datasets. In these cases, RMSE is first calculated for each dataset individually (using the same parameter values for all datasets), and then the individual dataset error measures are summed to obtain the overall error used in the optimization procedure. This has the effect of weighting the datasets equally regardless of the number of trials in each dataset when determining optimal parameter values. Pearson's r is then calculated across all trials in all datasets, with this measure more heavily influenced by datasets with more trials (simulation 10 in the current article).

An important assumption of the model is that the measurement of F1 in a given trial occurs at a point in time when the auditory feedback controller has already had time to detect and correct for errors, ideally 150 ms or more after perturbed auditory feedback is available to the speaker. This assumption is in place because the model is implicitly expecting contributions from both feedforward and feedback control systems, and it will thus underestimate the influence of feedback control and (consequently) overestimate the amount of trial-to-trial adaptation<sup>3</sup> if the measurement occurs before feedback control has had time to contribute on the current trial (see simulations 6 and 7). The neural delays associated with sensory feedback processing are approximately 100–150 ms for auditory feedback (Burnett et al., 1997, 1998; Hain et al., 2000) and 20–75 ms for somatosensory feedback (Ludlow et al., 1992; Larson et al., 2008). **Figure 2** is a schematic illustration of a hypothetical within-trial time course of a perturbed trial (prior to any adaptation) based on the delays noted above. The auditory perturbation begins with the onset of the trial and remains on for the duration of the trial. An error is detected by the auditory feedback controllers early in the trial and the associated correction is evident starting around 100 ms. This auditory-based correction causes the articulators to change their configuration and, as a result, an error is detected by the somatosensory feedback controller, which begins to correct for the error ∼50 ms later (in the opposite direction of the auditory-based correction). In a typical sensorimotor adaptation experiment, a single measure is taken for each production, typically near the midpoint of a prolonged vowel (e.g., Mollaei et al., 2013; Daliri et al., 2018). If this measurement is taken at 120 ms (t1 in **Figure 2**), it is likely to underestimate the contribution of feedback control compared to using a later timepoint (e.g., 220 ms, t2). Unless otherwise noted, the studies modeled in this article all involved acoustic measurements that were made more than 150 ms after perturbed auditory feedback was provided.

Across datasets, the optimized model parameters are directly comparable when the experimental and data processing protocols are the same. However, parameters are likely to vary somewhat in response to changes in task, length of utterance, auditory dimension being perturbed, and the timepoint of the acoustic measurement. Random variation associated with recording data from different samples of participants may introduce some degree of uncertainty in the precision of the parameters estimates, but this does not preclude comparisons across datasets assuming the experimental protocols are comparable (see simulations 8 and 9).

## RESULTS

The SimpleDIVA model was used to fit experimental data collected from six prior speech sensorimotor adaptation studies,

<sup>3</sup>The model will still typically identify an auditory feedback gain that is greater than zero because such a gain is needed to account for trial-to-trial changes in the acoustic parameter being perturbed (see Eq. 3).

as detailed in the following subsections. Prior to model fitting, outlier data points greater than two standard deviations from the participant's mean production in each experimental phase were removed. The mean value of the measured acoustic feature (e.g., F1) across participants was then calculated for each trial in the experiment. All simulations were performed using MATLAB 2018a on a Macintosh computer (macOS Mojave, Version 10.14.3) and replicated on Windows and Linux platforms. Compiled MATLAB code for the SimpleDIVA model is available at http://sites.bu.edu/guentherlab/software/ simplediva-app, including a graphical user interface that allows the user to enter new datasets to fit with the model. The graphical user interface is a freely accessible program that does not require a MATLAB license to run.

#### Simulation 1: Upward F1 Perturbation

The first simulation involved fitting a dataset was from a classic implementation of the sensorimotor adaptation paradigm as illustrated in **Figure 1** that involved an upward perturbation to F1 (Haenchen, 2017). In this study, a group of young healthy speakers of American English (N = 18; aged 18–29) produced 60 blocks of trials, with each block including three trials in which the participant produced the word "bed," "dead," or "head" in pseudorandom order (180 total individual word trials). For each block, the mean F1 value across the three individual word trials was calculated; this blocked data was used for the model fit. Blocking in this way reduces variability in data plots but has a minimal effect on derived optimal parameters compared to fitting all individual trials. Participants were instructed to say the words slowly and clearly, with an utterance duration between 400 and 600 ms and intensity between 72 and 88 dB SPL. A baseline of 19 blocks (57 individual word trials) was followed by a short ramp phase (1 block) where auditory feedback of F1 was incrementally shifted from 0 to 30% over three trials. The hold and after-effect phases had a further 20 blocks each. Mean F1 was extracted for 60% of the duration of the word, starting from 10% after voice onset time. On average, participants compensated for 31.6% of the perturbation (calculated as change from the baseline to hold phase as a percentage of the maximum perturbation magnitude).

**Figure 3** shows the model fit to the experimental data. This figure and subsequent figures in this section follow the same format. In the left panel, the mean and standard error of the experimental data are shown in blue and model fits are shown in red. In the right panel, a Pearson's correlation coefficient (r) describes the relationship between the data and model fits, and the parameter estimates are given for αA, α<sup>s</sup> , and λFF. The parameter estimates for all 10 optimization runs (see section "Materials and Methods") are plotted here; however, they typically appear as a single point due to minimal differences between runs, suggesting unique, optimal solutions for these datasets. The reported optimized parameter values and Pearson's r are from the best fit obtained from the 10 optimization runs. The model provided an excellent fit to the data (r = 0.96), falling within the standard error of the sample mean for all but one block (the ramp block). Optimized values for the three model parameters (model interpretation given in parentheses) were α<sup>A</sup> = 0.23 (indicating an auditory feedback control gain in which

FIGURE 3 | Simulation 1: model fits of a dataset with upward perturbations to F1 (data from Haenchen, 2017; Scott et al., 2019). (Left) Mean and standard error of experimental data in blue; model fit in red. (Right) Fit quality and optimized parameter values (r = correlation coefficient; α<sup>A</sup> = auditory feedback control gain, α<sup>s</sup> = somatosensory feedback control gain, and λFF = learning rate).

23% of the detected auditory error for a trial was corrected within that trial), α<sup>s</sup> = 0.17 (indicating a somatosensory feedback control gain in which 17% of the detected somatosensory error for a trial was corrected within that trial), and λFF = 0.09 (indicating a feedforward command learning/update rate in which 9% of the correction from one trial was added to the feedforward command for the next trial).

# Simulation 2: Upward F1 Perturbation With Noise-Masked Trials

The second dataset was from a study involving an upward perturbation to F1 in a group of young healthy speakers of Australian (N = 9) and Canadian (N = 1) English (mean age = 25.3 ± 3.74 years) (see **Supplementary Material**, Ballard et al., 2018). A key difference in this experimental paradigm was the use of masking noise to block auditory feedback on certain trials as a way to gauge adaptation in the absence of online auditory feedback-based corrections. On each trial, participants said the word "pear," "bear," "care," or "dare," pseudorandomly distributed. Productions of "paw" were also recorded but not perturbed and were therefore not included in the current simulation. Participants were instructed to say the words with a clear voice quality, minimal pitch variation, constant speaking volume, and to prolong the vowel for approximately 500 ms. An initial baseline phase consisted of 40 trials with masking noise randomly played on half of the trials, followed by an additional 33 baseline trials with normal auditory feedback. No blocking of trials was performed due to the uneven distribution of noisemasked trials over the course of the experiment. A linear increase in F1 was applied over 59 ramp trials up to the maximum perturbation of 30%. During the hold phase, the maximum perturbation was maintained for a total of 115 trials. After every 15 hold trials, masking noise was played for the following 10 trials. The after-effect phase consisted of 40 noise-masked trials. F1 trajectories were extracted and averaged over the duration of the vowel. Average compensation was 36.5% during the hold phase on trials with unmasked auditory feedback.

As shown in **Figure 4**, the model again provided an excellent fit to the data (r = 0.90), and the model fits fell within the standard error of the data on 273/287 (95.1%) of trials, including both unmasked and masked trials. The optimized parameter values were α<sup>A</sup> = 0.33 (higher than in simulation 1, indicating a higher compensatory response to the perturbation), α<sup>s</sup> = 0.48 (higher than in simulation 1, indicating more resistance to compensatory responses that moved the production away from its normal state), and λFF = 0.27 (a bit higher than in simulation 1, indicating more adaptation of the feedforward command). Further, the higher α<sup>s</sup> value compared to α<sup>A</sup> indicates that, according to the model, the somatosensory feedback controller has a larger influence than the auditory feedback controller in this experimental protocol compared to simulation 1.

An interesting aspect of this simulation is the fact that the model captures a characteristic of the masked trials during the hold phase that was somewhat unexpected. At first glance, one might expect the F1 values produced during a sequence of consecutive noise-masked trials to remain steady since no auditory perturbation is detected. Instead, as captured by the model, there is a tendency for F1 to increase gradually during such noise-masked sequences. This occurs in the model because somatosensory feedback control remains active during the noise-masked trials, and the somatosensory feedback control system attempts to move the production closer to the normal (pre-perturbation) configuration, in effect counteracting the compensatory adaptation that occurs during unmasked trials in the hold phase.

# Simulations 3 and 4: F1 and F2 Perturbed Simultaneously

Simulations 3 and 4 provide fits to data from an experiment in which young healthy American English speakers (N = 14; mean age = 23.7 ± 6.92 years) underwent an adaptation paradigm in which both F1 and F2 were perturbed simultaneously (Daliri et al., 2018; data from only the adult non-stuttering group included here). The experiment involved a total of 90 trials; 18 baseline, 18 ramp, 36 hold, and 18 after-effect trials. The target words were "bed," "Ted," and "head," randomized within each block of three trials, and participants were instructed to produce word durations between 300 and 700 ms and intensities between 72 and 88 dB SPL. The ramp phase was characterized by a gradual increase in F1 to a max perturbation of 25% and a gradual decrease in F2 to a max perturbation of −12.5%. The other three phases followed the standard paradigm. F1 and F2 trajectories were extracted using a custom-written MATLAB script. Mean F1 and F2 were estimated at the center of the vowel (40–60% of the vowel duration) and blocked data (mean of every three trials) were used for model fitting. In response to the F1 perturbation, participants compensated by an average of 21.3%, whereas for the F2 perturbation, they compensated by 3.87%.

In simulation 3, parameters were optimized to fit both the F1 and F2 data simultaneously with one set of model parameters for both auditory dimensions. The model fit (**Figure 5**) had an r of 0.95, and the model fit for every block fell within the standard error of the data. The optimized parameter values were α<sup>A</sup> = 0.10, α<sup>s</sup> = 0.00, and λFF = 0.10. While λFF is within the range of simulations 1 and 2, the relatively low values of α<sup>A</sup> and α<sup>s</sup> indicate, within the SimpleDIVA interpretation, smaller sensory feedback-based corrections that, in turn, lead to lower compensation in this experiment compared to the prior experiments.

In simulation 4, the formant data were first normalized by dividing by the baseline average, then projected into a single dimension corresponding to the direction in F1/F2 space produced by the perturbation. This means that only the component of compensatory changes in F1/F2 that directly counteracted the perturbation were considered; this is similar to simulations 1 and 2, which only considered changes in F1 (the perturbed dimension) and ignored any changes in F2 that may also have occurred. The results for simulation 4, illustrated in **Figure 6**, are very similar to those of simulation 3 (r = 0.94; α<sup>A</sup> = 0.07, α<sup>s</sup> = 0.00, λFF = 0.10; model fit within the standard error of the data for every block), suggesting that

FIGURE 5 | Simulation 3: model fits of a dataset with perturbations applied to both F1 and F2; F1 and F2 data are fit simultaneously (data from Daliri et al., 2018). (Left) Mean and standard error of experimental data in blue; model fit in red. (Right) Fit quality and optimized parameter values (r = correlation coefficient; α<sup>A</sup> = auditory feedback control gain, α<sup>s</sup> = somatosensory feedback control gain, and λFF = learning rate).

projection of the results into a single dimension aligned with the perturbation is unnecessary as it produces essentially the same fit as fitting both the F1 and F2 datasets with a single set of parameter values.

# Simulation 5: Upward and Downward Perturbations of f<sup>o</sup>

Simulation 5 involves a dataset in which all participants underwent the adaptation paradigm under two counterbalanced conditions: one involving an upward shift in f <sup>o</sup> and one involving a downward shift in f <sup>o</sup> (Abur et al., 2018; data from only the healthy controls included here). Healthy older speakers of American English (N = 19, mean age = 65.3 ± 4.6 years) were instructed to vocalize a sustained/a/for three seconds while the stimulus appeared on a computer monitor. Both the shift-up and shift-down conditions followed a standard adaptation paradigm: 20 baseline, 60 ramp, 40 hold, and 40 after-effect trials. During the shift-up condition, f <sup>o</sup> was increased by 1.69 cents for each ramp trial, reaching a maximum perturbation of 100 cents (a cent

is a logarithmic unit of measure of changes in frequency, where 100 cents = 1 semitone). During the shift-down condition, the perturbation was applied in the same manner in the opposite direction reaching a maximum perturbation of −100 cents by the end of the ramp phase. Mean f <sup>o</sup> was calculated for the duration of each 3-s trial using an autocorrelation method in Praat software (Boersma, 2001). The mean f <sup>o</sup> across every block of three trials was estimated and the blocked data were used for model fitting. On average, participants compensated 83.8 and 86.7% in the shift-up and shift-down conditions, respectively.

In simulation 5, a single set of parameters was used to fit both the shift-up and shift-down data simultaneously, as in simulation 3. The resulting fit fell within the standard error of the data in 96.2% of the experimental blocks (shown in **Figure 7**). The quality of fit and optimized parameter values were: r = 0.96, α<sup>A</sup> = 0.93, α<sup>s</sup> = 0.00, and λFF = 0.02. This simulation resulted in much higher values of α<sup>A</sup> than prior simulations. Within the SimpleDIVA interpretation, a higher α<sup>A</sup> is expected here since the long analysis window allowed for an unnaturally long amount of time for speakers' auditory feedback correction to compensate for the perturbation. However, the very low value of α<sup>s</sup> in these simulations was not expected; see section "Discussion" for further treatment. The next two simulations directly tested the effect of varying the analysis window on model parameters.

# Simulations 6 and 7: Late Versus Early Measurements of Perturbed f<sup>o</sup>

Similar to the previous dataset, the dataset modeled in simulations 6 and 7 involved an f <sup>o</sup> perturbation experiment (Heller Murray, 2019; Heller Murray and Stepp, under review). The key feature of this dataset was that f <sup>o</sup> was measured during two time periods – early and late in vocalization. Twenty young healthy speakers of American English (mean age = 21.0 ± 2.29) were asked to vocalize a sustained/a/for 3 s while the stimulus appeared on screen. They completed the task under three conditions: shift-up, shift-down, and control. The shift-up and down conditions followed the standard paradigm and each

included a total of 60 trials: 15 baseline, 15 ramp, 15 hold, and 15 after-effect trials. No blocking of trials was performed due to the small number of total trials in the experiment. The ramp phase was characterized by a gradual change from 0 to a maximum perturbation of 100 cents (+ 100 cents in the shift-up condition and −cents in the shift-down condition). The control condition included a total of 60 trials without any perturbation and was used to account for the natural drift that occurs in f <sup>o</sup> over time in the shift-up and down conditions. Median f <sup>o</sup> was calculated using Praat software and custom MATLAB scripts, and each participant's shift conditions were divided by their control condition to normalize the values. The two analysis time periods were: (1) between 20 and 120 ms after voicing onset (early); and (2) between 200 and 1500 ms after voicing onset (late). Note that in the early time window, feedback control will have had little time to "kick in" and thus lower values of α<sup>A</sup> and α<sup>s</sup> are expected compared to the later time window. The early time window also allows examination of model behavior in the near-absence of auditory feedback control (see **Figure 2**). When measured at the early timepoint, participants showed 19.1% (upshift) and 50.9% (down-shift) compensation. When measured at the late timepoint, participants showed 29.8% (up-shift) and 51.5% (down-shift) compensation.

In simulation 6, the model was fit to data measured at the late timepoint, which is in keeping with the model's assumption that auditory feedback control has had a chance to contribute by the time the acoustic measurement is taken (i.e., that measurements occur 150 ms or more after perturbation onset). As before, a single set of parameters was used to fit both the shift-up and shift-down data simultaneously, with the resulting fits shown in **Figure 8**. The model fit fell within the standard error of the data on 68.3% of trials across both directions, and the resulting estimates were: r = 0.93, α<sup>A</sup> = 0.36, α<sup>s</sup> = 0.45, and λ FF = 0.20.

In simulation 7, the model was fit to data measured at the early timepoint (**Figure 9**), in violation of its implicit assumption

FIGURE 8 | Simulation 6: model fits of a dataset with perturbations applied to fundamental frequency in both shift-up and down directions [normalized by an unshifted control condition; data from Heller Murray (2019)]. Shift-up and down data are fit simultaneously. Measurement of fundamental frequency was taken late in the trial (200–1500 ms after voicing onset). (Left) Mean and standard error of experimental data in blue; model fit in red. (Right) Fit quality and optimized parameter values (r = correlation coefficient; α<sup>A</sup> = auditory feedback control gain, α<sup>s</sup> = somatosensory feedback control gain, and λFF = learning rate).

of a measurement 150 ms or more after perturbation onset. For this simulation, we allowed the parameter λFF to go above 1 in order to achieve the optimal fit. The model still gives a reasonably good fit, though significantly poorer than in simulation 6, falling within the standard error of the data on 63.3% of trials. The overall quality of the fit and the optimized model parameters were: r = 0.81, α<sup>A</sup> = 0.08, α<sup>s</sup> = 0.13, and λFF = 1.17. Simulation 7 resulted in relatively low α values, which were expected within the SimpleDIVA interpretation due to the limited time for feedback control mechanisms to contribute to the production. This pattern likely resulted because the dataset violated the model's assumption that feedback control mechanisms have kicked in by the time f <sup>o</sup> is measured; the early time window used in simulation 7 results in unrealistically low α values and a small feedback-based correction according to Eq. 2, which in turn requires an unrealistically high value of λFFin Eq. 3 to account for trial-to-trial changes.

#### Simulations 8 and 9: Model Parameters From a Gradual Onset Perturbation Fit to a Sudden Onset Perturbation

The following simulations provide fits to data from an F1 experiment conducted under two counterbalanced conditions: one involving a gradual ramp phase (gradual) and one involving no ramp phase (sudden) (Chao and Daliri, unpublished data; see **Supplementary Material** for detailed methods). Fifteen young healthy speakers of American English (mean age: 21.7 ± 4.09) were instructed to produce the words "heck," "head," and "hep" with a word duration of 400–600 ms and loudness intensity of 72–82 dB SPL. Both conditions had a total of 180 trials with a maximum perturbation of 30% in F1. The gradual condition followed the standard paradigm, with 45 baseline, 45 ramp, 45 hold, and 45 after-effect trials. F1 was linearly increased during the ramp phase up to the maximum perturbation. The sudden condition had 45 baseline, 90 hold, and 45 after-effect trials.

The maximum perturbation was introduced on the first trial of the hold phase. F1 trajectories were extracted using Audapter (Cai et al., 2008), which tracks formants based on linear predictive coding and dynamic programing. The average F1 was estimated in a window placed on the center of the vowel (40–60% of the vowel duration). Blocked data (mean of every three trials) were used for model fitting. Average compensation was 24.2% for the gradual condition and 23.7% for the sudden condition.

For these simulations, the goal was to first fit the model to one of the experimental conditions and then to use the resulting parameters to model the second condition, thus assessing how well the model could predict responses for a given experimental variation. In simulation 8, the model was fit to data from the gradual condition. The model fit fell within the standard error of the experimental data on all trials (**Figure 10**) and the quality of fit and optimized model parameters were: r = 0.97, α<sup>A</sup> = 0.19, α<sup>s</sup> = 0.38, and λFF = 0.08. These parameter values were then used to fit the data from the sudden condition (rather than finding optimal parameters for this condition). With αA, α<sup>s</sup> , and λFF fixed, the simulation predicted the same participants' response to a variation of the adaptation paradigm (i.e., with no ramp phase). **Figure 11** shows the resulting fits to the experimental data; the model fit is within the standard error on 98.3% of trials and estimates of fit quality indicated an excellent overall fit (r = 0.96).

The opposite was also true when the model was first fit to the sudden data (r = 0.97, α<sup>A</sup> = 0.16, α<sup>s</sup> = 0.21, and λFF = 0.07) and the resulting model parameters were used to fit the gradual data (r = 0.95). Together, these simulations highlight a strong predictive ability of the model across experimental conditions employing different patterns of perturbation.

#### Simulation 10: Identifying Representative Parameter Values Across F1 Adaptation Studies

In the final simulation, we fit F1 data from all of the formant studies described above (simulations 1, 2, 3, 8, 9) using a single

FIGURE 11 | Simulation 9: model fits of a dataset with perturbations applied to F1 and with a sudden ramp phase. Model parameters were fixed using the parameters in simulation 9 (data from Chao and Daliri, unpublished data). (Left) Mean and standard error of experimental data in blue; model fit in red. (Right) Fit quality (r = correlation coefficient).

set of parameters. **Figure 12** shows the resulting fits. The fit quality and optimized parameter values were: r = 0.86, α<sup>A</sup> = 0.18, α<sup>s</sup> = 0.29, and λFF = 0.14. These model estimates provide representative values that can be used to predict responses in future formant adaptation studies.

To assess the possibility that these representative parameter values are overfitting our particular datasets, we performed a leave-one-out cross-validation procedure in which the model was fit to four of the five datasets, with the optimized parameters then used to fit the fifth (test) dataset (repeated five times, with each dataset acting as the test set once). The average r for the test set in these five simulations was 0.91, indicating that the model's fit quality extends beyond datasets used in the optimization procedure<sup>4</sup> . The parameter ranges obtained across

<sup>4</sup>The careful reader might note that this cross-validated r value is actually higher than when all five datasets are used for fitting. This is possible because the

the five simulations were 0.17–0.20 for αA, 0.25–0.32 for α<sup>s</sup> , and 0.11–0.15 for λFF.

To further assess the reliability of these parameters, we utilized a percentile bootstrap estimation procedure (Efron and Tibshirani, 1993) to obtain 95% confidence intervals for each parameter. 1000 iterations were performed, with the data for each iteration formed as follows. For each of the five studies, a new dataset was formed by sampling subjects with replacement from the original dataset, and the average of these data was calculated. Then SimpleDIVA was used to simultaneously fit these five averages using a single set of parameters. This resulted in a distribution of 1000 estimates for each parameter, from which the 95% confidence interval was drawn. The resulting confidence intervals were 0.13–0.21 for αA, 0.17–0.38 for α<sup>s</sup> , and 0.06–0.38 for λFF.

# DISCUSSION

The aim of this article was to describe and test a simple 3-parameter model, SimpleDIVA, that can disentangle the roles of auditory feedback, somatosensory feedback, and feedforward control processes during sensorimotor adaptation experiments. We tested the model using six existing datasets collected in different laboratories and with numerous variations in the sensorimotor adaptation paradigm. The model provided close fits to data from these studies, which spanned experiments: of formant and pitch perturbations; with/without maskingnoise trials; with perturbations in single and multiple auditory dimensions; with measurements made in different analysis windows of the acoustic signal; and when predicting model fits from one experimental condition to another. The model simulations highlighted the effectiveness of the model in estimating the relative contribution of feedback and feedforward control systems to sensorimotor learning and providing excellent fits to the data, with a mean Pearson's r of 0.95 ± 0.02 across the studies modeled here (excluding simulation 6 that was included to illustrate the effect of analysis time window). In addition, the simulations revealed properties of the model (and of sensorimotor adaptation) that we will discuss in detail below.

### Role of Somatosensory Feedback in the Absence of Auditory Feedback

Previous studies have used noise-masked trials as a method of assessing sensorimotor adaptation in the absence of auditory feedback (e.g., Houde and Jordan, 1998; Ballard et al., 2018). A residual compensatory effect is observed in noise-masked trials during the hold phase, indicative of adapted feedforward commands. However, prior studies typically did not consider the effects of somatosensory feedback control during noise-masked trials (but see discussion in Ballard et al., 2018). In simulation 2, the SimpleDIVA model was fit to the data from one such study and revealed an interesting and somewhat unintuitive finding: when producing speech under masking noise in the hold phase, participants show gradual de-adaptation despite the fact that there is no auditory signal available. This aspect of the data is captured by the model since masking noise does not eliminate somatosensory feedback, and thus the somatosensory feedback controller is attempting to move the vocal tract back toward its pre-perturbation configuration; the resulting corrective movements generated by the somatosensory feedback controller lead to updating of the feedforward commands, in turn resulting in the de-adaptation evident in the experimental data and model fit. Thus, the model highlights a previously ignored aspect of speech sensorimotor adaptation experiments that involved masking noise during the hold phase, while at the same time providing an explanation for this phenomenon. Notably, this effect is analogous to findings in the visuomotor literature showing de-adaptation toward baseline in the absence of visual feedback (Hay et al., 1965; Scheidt et al., 2005; Smeets et al., 2006).

# Optimized Model Parameters Change as a Function of Experimental Protocol Variation

Although the optimized parameters were often similar across simulations, differences were observed that are likely at least partially due to differences in experimental design. For example, the model was sensitive to differences in the period of signal selected for analysis. Simulations 6 and 7 demonstrated the effect of varying the measurement window directly. In simulation 6 an early time window of 20–120 ms after voice onset was used, thus minimizing the contribution of feedback control mechanisms, which do not start affecting movement until approximately 50 ms after perturbation detection for somatosensory feedback control and over 100 ms after perturbation detection for auditory feedback control (see Burnett et al., 1998; Guenther, 2016). As expected, this resulted in much lower feedback control gains in the optimal model fit (α<sup>A</sup> = 0.09, α<sup>s</sup> = 0.15) compared to simulation 7, which used a later time window of 200– 1500 ms after voice onset and obtained optimized values of α<sup>A</sup> = 0.39 and α <sup>s</sup> = 0.44.

In an f <sup>o</sup> perturbation experiment that had a very long measurement window (∼3 s), the model estimated that sensorimotor control was dominated by the auditory feedback control system, with α<sup>A</sup> = 0.93 (simulation 5). Although a high gain for α<sup>A</sup> is expected due to the measurement window extending so long beyond perturbation onset, these simulations identified no contribution of the somatosensory feedback controller (i.e., α<sup>s</sup> = 0.00) rather than a higher than normal contribution that might be expected due to the long analysis time window. This unexpected finding indicates that, unlike the formant perturbation studies involving shorter/earlier time windows simulated herein where adaptation plateaus at approximately 25–50% of the perturbation size, adaptation in the f <sup>o</sup> perturbation study of Abur et al. (2018) was nearly complete (85.3%); in terms of the model, this is because somatosensory feedback control mechanisms are not acting to limit the amount of compensation. This finding may reflect a situation in which auditory feedback control dominates due to the use of unnaturally long (3 s) steady state vowel productions,

optimization procedure minimizes RMSE as described in Section "Materials and Methods" rather than maximizing r directly.

which may have allowed participants to consciously "pitch match" their production to the target pitch, thereby overcoming the natural tendency for somatosensory feedback to limit the amount of compensation. Further study is needed to verify or refute this interpretation.

Further experimental choices that could affect model parameters include the loudness of the auditory feedback signal (with a louder signal possibly resulting in more auditory error detection and within-trial correction, evidenced by a larger αA), the use of low levels of masking noise in combination with normal and perturbed auditory feedback (possibly lowering the amount of error detection and correction, evidenced by a smaller αA), or the use of anesthesia on the speech articulators (which should lead to a decrease in α<sup>s</sup> and a concomitant increase in overall compensation to an auditory perturbation). Future studies will investigate these possibilities.

## Relationships Between Somatosensory and Auditory Feedback Control Gains

An interesting finding in the simulations is that, in general, within-trial corrections based on somatosensory feedback seemed to be associated with the magnitude of compensation. That is, lower somatosensory feedback control gains occurred with lower auditory feedback control gains, and vice versa (simulation 5 was the exception). This result is not unexpected when one considers how the time window over which acoustic measurements are made affects the model parameter estimates: put simply, later time windows show evidence of more feedback control, both auditory and somatosensory.

In the model, increasing both auditory and somatosensory gains proportionally (e.g., going from α<sup>A</sup> = 0.2, α<sup>s</sup> = 0.4 to α<sup>A</sup> = 0.3, α<sup>s</sup> = 0.6) has no effect on the maximum amount of compensation that is achieved during a sufficiently long hold phase. To see why, note that the extent to which the auditory feedback control system opposes a perturbation directly affects the extent to which the somatosensory feedback control system will detect a mismatch from the normal configuration for the sound, in turn affecting the amount the somatosensory feedback control system opposes any corrective contributions from the auditory feedback control system. Ultimately, this competition between auditory and somatosensory feedback controllers determines the maximum compensation that can occur as a percentage of the perturbation size according to the following equation:

$$\text{Max Component} = \alpha\_A / (\alpha\_A + \alpha\_S) \tag{4}$$

For example, if the auditory and somatosensory feedback gains are equal, the maximum compensation achieved by the model will be 0.5, or 50% of the perturbation size. This equation also helps explain why model fits to the data from Abur et al. (2018), which showed near-complete compensation, resulted in an optimized α <sup>s</sup> of 0.00.

Although increasing α<sup>A</sup> and α<sup>S</sup> proportionally does not affect the maximum level of compensation, it does affect the amount of within-trial compensation seen for trials shortly after a perturbation is induced. This is because the feedback-based correction calculated in Eq. 2 will be larger if α<sup>A</sup> and α<sup>S</sup> are both larger. Furthermore, for a given value of λFF, increasing α<sup>A</sup> and αSwill lead to faster adaptation of the feedforward command according to Eq. 3.

Notably, if the ratio of α<sup>A</sup> to α<sup>S</sup> changes (as opposed to both of them increasing/decreasing proportionally), then we expect more adaptation (for greater αA/α<sup>S</sup> ratios) or less adaptation (for smaller ratios) after many training trials. Indeed, it is the ratio of these parameters that determines the degree of maximal compensation that will occur in the model since it captures the essence of the competition between the auditory and somatosensory feedback controllers discussed above. Again, different experimental paradigms may lead to somewhat different αA/α<sup>S</sup> ratios, in part because the delays in the two feedback control systems are different, which in turn means the relative influence of α<sup>A</sup> compared to α<sup>S</sup> depends on the point in time the acoustic measurement for the trial is made (see **Figure 2** and associated text). Following findings of individual preferences for auditory or somatosensory feedback control reported in some prior studies (e.g., Lametti et al., 2012), it is likely that the ratio of α<sup>A</sup> to α<sup>S</sup> also differs considerably across individuals.

In sum, the relative values of α<sup>A</sup> and α<sup>S</sup> determine the maximum amount of compensation that can occur in the model, whereas the absolute values of α<sup>A</sup> and α<sup>S</sup> affect the rate at which the model converges to this maximum compensation level during the hold phase.

#### Predictive Power of SimpleDIVA

To test the predictive power of the model, we identified optimal model parameters from data in one experimental condition involving a gradual perturbation onset and applied the parameters to a second experimental condition in which the perturbation onset was abrupt (simulations 8/9). The quality of the predicted fit was excellent (correlation coefficient of 0.96) and fell within the range of the other simulations in this article. Not only can SimpleDIVA provide an insight into the mechanisms underlying sensorimotor adaptation, but the model can also predict responses for an experiment using data from a prior experiment.

In the final simulation (simulation 10), we fit the model simultaneously across five F1 datasets with variations in the experimental design. The resulting parameters provide a reference point for expected model parameters in F1 adaptation studies and may be used to predict responses in future studies. Using the model in this way supports the development of clear hypotheses that can be tested empirically to ultimately advance the field of speech motor control.

#### Limitations of the Model

In this article, we have demonstrated how SimpleDIVA can be used across a number of different adaptation paradigms. One experimental variation that is not currently supported by SimpleDIVA is the setting of individually-derived perturbation magnitudes (e.g., a 20% shift in an individual's F1/F2 space toward another vowel; Schuerman et al., 2017). In future iterations, we plan to make it possible to specify the perturbation magnitude at the level of the individual, rather than only at the

group level. The model is also not yet designed to address the results of studies involving unexpected perturbations rather than the sustained perturbations used in the studies covered herein.

One important limitation of the model for fitting sensorimotor adaptation data concerns the assumption that feedback control mechanisms have started to contribute by the time that the acoustic measurement is taken, ideally at least 150 ms after perturbation onset. Most prior studies of sensorimotor adaptation encourage participants to lengthen their vowel productions in order to increase the amount of adaptation under perturbed feedback, making the data amenable to fitting by SimpleDIVA. However, the typical durations of some vowels during normally produced sentences are less than 150 ms (Jacewicz et al., 2007). For single-syllable stimuli with these vowels, it is unlikely that auditory feedback control substantially affects within-trial performance, though somatosensory feedback control mechanisms are likely contributing. The model's applicability to such cases is thus questionable.

A potential issue involving non-uniqueness of solutions can arise in the current version of the model when one of the model parameters assumes a value that is very close to zero. For example, if the optimal solution involves a value of λFF equal to zero, the value of α<sup>S</sup> no longer has an effect on the fit quality, and the model's optimization routine may find a different value of α<sup>S</sup> each time it is run despite achieving the same fit quality each time. This is not a shortcoming of the model per se; rather, it is an indication that the solution space is nonunique in these cases, with many possible solutions (typically an infinite number) providing the same optimal fit. This behavior is not likely to occur for neurologically normal participant groups (for whom the model parameters should not approach zero) but could possibly occur in certain disordered participant groups or when the individual trials of the perturbation experiment involve unusually long, drawn out perturbed utterances as described above with respect to simulation 5.

Another potential limitation of the model concerns an inherent assumption that the relative contributions of the auditory and somatosensory feedback controllers to adaptation of the feedforward command is the same as their relative contributions to online, within-trial corrections. This is because only a single adaptation rate parameter (λFF) is used, rather than separate rates for auditory and somatosensory feedback contributions. This assumption has not yet been experimentally verified; if it proves to be false, the model may need to be modified to include separate adaptation rates for auditory and somatosensory error-based updates of the feedforward command.

Another potential limitation of the model is the inclusion of only one form of learning: adaptation of feedforward motor programs. The model can be extended to allow other forms of learning, such as changing of the auditory and/or somatosensory targets for a speech sound. Changes to these targets are expected to occur on a much slower time scale – longer than the time scale of a single adaptation experiment – according to the model (see Guenther, 2016 for details). For example, targets may change over the course of speech development in children or over a longer period of speech therapy for those with communication disorders. Some studies have shown changes to perceptual category boundaries for speech sounds after speech motor learning (e.g., Shiller et al., 2009). Although this might be construed as evidence for changes to the production target for the speech sound over the course of an experimental session, this interpretation is tenuous since (i) the link between perceptual category boundaries and the targets for speech production remains unknown, and (ii) the production targets represent idealized versions of speech sounds, whereas adaptation effects on perception involve ambiguous stimuli at category boundaries. We performed simulations of versions of SimpleDIVA that included adaptation terms for auditory and/or somatosensory targets. If λFF is set to 0 and only sensory targets are allowed to adapt, the model's fits are poorer than for the version described here. If the sensory targets are allowed to adapt while still including λFF, model fits showed almost no improvement over the simpler version included here, and solutions were often non-unique. For these reasons sensory target adaptation was omitted from the simulations included in this article.

Finally, the simulations herein have focused on fits to group average data. The cross-validation and bootstrap confidence interval estimation analyses performed as part of simulation 10 indicate reliable ranges for each parameter when fitting F1 perturbation group datasets (N of 10 or more for each of the five studies analyzed here). They do not address questions regarding parameter stability within a single study, such as how many subjects are necessary in a group to obtain stable parameter values (a complex topic beyond the scope of the current article). Thus, significant caution is warranted when interpreting differences in parameter values across studies; the interpretations presented here are based on the model's theoretical foundations rather than direct statistical comparisons.

#### Future Directions

The current set of simulations focused on modeling data primarily from young healthy adult speakers (only simulation 5 included data from older adults). A key next step will be to expand this work to examine the contribution of feedback and feedforward control to sensorimotor learning across the lifespan and in those living with communication disorders. Model parameter values derived from multiple participant groups, for example, a neurotypical group and a group with a disorder, can be compared to illuminate the between-group differences in speech motor processing. This line of research has the potential to identify underlying mechanisms of communication disorders with a sensorimotor basis and to subsequently pave the way for the development of future treatments. An important step in the model development process for this purpose will be the creation of statistical tests of the reliability of parameter value differences between participant groups.

Another important future direction is to investigate the model's capabilities for reliably characterizing speech motor control processes in individuals. The current simulations were all fits to group average data, which does not capture individual variation in the relative use of auditory feedback, somatosensory feedback, and feedforward control processes. Previous studies of adaptive responses have shown increased variation among

disordered populations (e.g., Abur et al., 2018) as well as individual preferences for one sensory modality over another (Lametti et al., 2012). Future studies examining parameters derived from individual subjects will be necessary to assess how robust the model estimates are at the level of an individual, including individuals with speech motor disorders. Specific issues of importance are whether individual subjects can be fit reliably from a single experimental session and the degree to which the fits are unique and stable (e.g., could a near-optimal fit be achieved with wildly different parameter values even though the optimal fit is unique?).

A third future direction concerns testing of model predictions in order to better verify its assumptions. Unfortunately, direct verification through physiological, structural, or behavioral measures is not possible. One reason for this is that, in terms of physical aspects of the brain, the model parameters correspond to rather large-scale and difficult (if not impossible) to measure characteristics such as number of synaptic projections between areas, strengths of these synapses, plasticity of these synapses, and neural sensitivity of the auditory and somatosensory periphery. Model predictions regarding the relationship between adaptation and online corrections can be tested, but it is noteworthy that even the within-trial, online response to an auditory perturbation depends on factors other than the auditory feedback control gain since these within-trial responses are, like adaptive responses, also dependent on feedforward and somatosensory feedback control mechanisms. For this reason, we are currently formulating a version of SimpleDIVA that is aimed at within-trial responses to unexpected perturbations. This requires the addition of parameters representing the temporal delays in the auditory and somatosensory feedback control loops, which are not considered in the current version of the model. Adding these new parameters presents challenges regarding finding unique fits that we are currently addressing. Upon completion of this version of the model, it should be possible to test the model's ability to account for within-trial time courses as well as adaptation over many trials within the same subject. However, this topic is beyond the scope of the current manuscript, which has a primary aim of demonstrating how a simple model characterizing the three main motor control processes in speech can provide excellent fits to a wide range of auditory sensorimotor adaptation data.

Finally, SimpleDIVA is not the only computational model used to examine sensorimotor adaptation. For example, state space models have been widely used in studies of limb motor adaptation (Thoroughman and Shadmehr, 2000; Smith et al., 2006; Galea et al., 2015; Huberdeau et al., 2015) and such a model was recently applied to speech (Daliri and Dittman, 2019). While the state space model provides good fits to speech sensorimotor adaptation data, it is limited by the fact that the two model parameters (an internal estimate forgetting factor and a sensory error weighting factor) cannot differentiate auditory and somatosensory feedback control processes from feedforward control processes. SimpleDIVA's third parameter (compared to only two for the state space model) gives it this ability without adversely affecting the model's ability to find a unique optimal solution. Furthermore, the adaptation process captured by SimpleDIVA is, in essence, the same process that is used in the full DIVA model to develop accurately tuned speech motor programs in the first place; no such connection exists for state space model parameters. Further treatment of the relatively advantages and disadvantages of SimpleDIVA and state space modeling approaches is beyond the scope of the current article; we plan to address this important topic in a future study.

# DATA AVAILABILITY STATEMENT

The datasets generated for this study are available on request to the corresponding author.

# ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the Institutional Review Board at Arizona State University with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Institutional Review Board at Arizona State University.

# AUTHOR CONTRIBUTIONS

FG contributed conception and design of the study. FG, AN-C, RF, EK, and HW developed and programed the SimpleDIVA model. AD, DA, KB, S-EC, S-CC, EH, and TS collected and processed data. EK and HW tested model simulations. EK wrote the first draft of the manuscript. AD wrote the **Supplementary Material**. All authors contributed to manuscript revision, read, and approved the submitted version.

# FUNDING

This research was supported by NIH grants R01 DC002852 (FG, PI), R01 DC016270 (FG and C. Stepp, PIs), P50 DC015446 (R. Hillman, PI), R03 DC014045 (T. Perrachione, PI), R01 DC015570 (C. Stepp, PI), R01 DC011277 (S-EC, PI), T32 DC013017 (C. Moore, PI), T90 DA032484 (B. Shinn-Cunningham, PI), and F31 DC016197 (EH, PI), as well as an Australian Research Council Future Fellowship FT120100355 (KB, PI).

# ACKNOWLEDGMENTS

We are grateful to Tyler Perrachione and Cara Stepp for kindly sharing their data with us.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2019. 02995/full#supplementary-material

# REFERENCES

fpsyg-10-02995 January 9, 2020 Time: 18:25 # 15



Dyslexia: a Weaker Sensorimotor Magnet Implied in the Phonological Deficit. J. Speech Lang. Hear Res. 60, 654–667. doi: 10.1044/2016\_JSLHR-L-16-0201

Villacorta, V. M., Perkell, J. S., and Guenther, F. H. (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. J. Acoust. Soc. Am. 122, 2306–2319. doi: 10.1121/1. 2773966

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Kearney, Nieto-Castañón, Weerathunge, Falsini, Daliri, Abur, Ballard, Chang, Chao, Heller Murray, Scott and Guenther. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Timing Evidence for Symbolic Phonological Representations and Phonology-Extrinsic Timing in Speech Production

#### Alice Turk<sup>1</sup> \* and Stefanie Shattuck-Hufnagel<sup>2</sup>

<sup>1</sup> Linguistics and English Language, School of Philosophy, Psychology and Language Sciences, The University of Edinburgh, Edinburgh, United Kingdom, <sup>2</sup> Massachusetts Institute of Technology, Cambridge, MA, United States

The goals of this paper are (1) to discuss the key features of existing articulatory models of speech production that govern their approaches to timing, along with advantages and disadvantages of each, and (2) to evaluate these features in terms of several pieces of evidence from both the speech and nonspeech motor control literature. This evidence includes greater timing precision at movement endpoints compared to other parts of movements, suggesting the separate control of the timing of movement endpoints compared to other parts of movement. This endpoint timing precision challenges models in which all parts of a movement trajectory are controlled by the same equation of motion, but supports models in which (a) abstract, symbolic phonological representations map onto spatial and temporal characteristics of the part(s) of movement most closely related to the goal of producing a planned set of acoustic cues to signal the phonological contrast (often the endpoint), (b) movements are coordinated primarily based on the goal-related part of movement, and (c) speakers give priority to the accurate implementation of the part(s) of movement most closely related to the phonological goals. In addition, this paper presents three types of evidence for phonology-extrinsic timing, suggesting that surface duration requirements are represented during speech production. Phonology-extrinsic timing is also supported by greater timing variability for repetitions of longer intervals, assumed to be due to noise in a general-purpose (and phonology-extrinsic) timekeeping process. The evidence appears to be incompatible with models that have a unified Phonology/Phonetics Component, that do not represent the surface timing of phonetic events, and do not represent, specify and track timing by general-purpose timekeeping mechanisms. Taken together, this evidence supports an alternative approach to modeling speech production that is based on symbolic phonological representations and generalpurpose, phonology-extrinsic, timekeeping mechanisms, rather than on spatio-temporal phonological representations and phonology-specific timing mechanisms. Thus, the evidence suggests that models in that alternative framework should be developed, so they can be tested with the same rigor as have models based on spatio-temporal phonological representations with phonology-intrinsic timing.

#### Edited by:

Adamantios Gafos, University of Potsdam, Germany

#### Reviewed by:

Sam Tilsen, Cornell University, United States Juraj Šimko, University of Helsinki, Finland

> \*Correspondence: Alice Turk turk@ling.ed.ac.uk

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 April 2019 Accepted: 12 December 2019 Published: 24 January 2020

#### Citation:

Turk A and Shattuck-Hufnagel S (2020) Timing Evidence for Symbolic Phonological Representations and Phonology-Extrinsic Timing in Speech Production. Front. Psychol. 10:2952. doi: 10.3389/fpsyg.2019.02952

Keywords: speech motor control, phonology, phonetics, symbolic phonological representations, timing

# INTRODUCTION

There is growing appreciation that models of speech production need to take the process all the way to completion, i.e., to provide principled accounts of systematic patterns of timing behavior in speech, for individual movements and their coordination, and for intervals between acoustic landmarks (Stevens, 2002) that are created by these movements. The known timing characteristics of individual movements include their smooth, single-peaked velocity profiles, the strong positive relationship between peak velocity and distance (longer distance movements have higher peak velocities), and the increase in duration observed for more accurate and/or longer distance movements (in spite of higher peak velocities for longer distances), cf. Fitts' (1954) law. Patterns of coordination between movements include the coordination of movements made by several articulators involved in creating a single constriction, as well as the coordination of overlapping movements involved in making sequences of constrictions. Timing patterns of intervals between acoustic landmarks include systematic effects of interacting factors on acoustic intervals of various sizes, e.g., effects of phrasal position on word-final acoustic intervals, where the largest effects occur on an acoustic interval corresponding to the phrase-final syllable rhyme (phrase-final lengthening), and acoustic intervals corresponding to wordinitial onset closures (phrase-initial lengthening); effects of prominence (word-level and phrase-level stress) on syllablesized intervals, and compression effects of the number of syllables in units such as words (particularly when the word is in phrasally prominent position), complex effects of overall speaking rate; and the interaction of all of these effects (and more) with segment-intrinsic durational patterns. Factors that affect intervals between acoustic landmarks can also affect characteristics of individual and coordinated movements, but do so in different ways, e.g., durations of movements toward consonantal constrictions are affected less by prosodic position than are more "steady state" regions. See Turk and Shattuck-Hufnagel (2020) for more detail about these effects and references.

Existing models of speech production vary in how many of these effects they can account for. Articulatory Phonology in the Task Dynamics framework (AP/TD) is the model which to date provides the most comprehensive coverage, and is one of the very few models which has accounts of multiple types of effects of prosodic structure on durational patterns. However, its phonology-intrinsic approach to timing is fundamentally different from that of other models, in large part because of its use of spatiotemporal phonological representations and its lack of a phonetic planning component that is separate from the phonology. In the AP/TD approach, such a component is not required because surface timing and spatial characteristics are emergent from the phonological component. This modeling approach contrasts with other models which have symbolic phonological representations, used to express categories of phonological contrast and phonological equivalence but do not specify spatio-temporal characteristics. As a result, these

models have a phonetic planning component that is separate from the phonological planning component, to provide quantitative temporal, spatial, and spectral interpretations of the phonological representations. These differences among models lead us to ask a basic question: what is the most appropriate way to model systematic timing patterns in speech production?

The goal of this paper is twofold (1) to discuss the key features of existing articulatory models of speech production that govern their approaches to timing, along with advantages and disadvantages of each, and (2) to evaluate these features in terms of several pieces of evidence from both the speech and non-speech motor control literature. This evidence, taken together, supports an alternative approach to modeling speech production that is based on symbolic phonological representations and general-purpose, phonology-extrinsic, timekeeping mechanisms, rather than on quantitative spatiotemporal phonological representations and phonology-specific timing mechanisms. Thus, the evidence suggests that models in that alternative framework should be developed, so they can be tested with the same rigor as models based on spatiotemporal phonological representations with phonology-intrinsic timing mechanisms.

This paper is organized as follows: First, it presents key characteristics and differences among articulatory models that deal with timing issues, along with advantages and disadvantages of each. Second, it presents evidence from a wide variety of studies that bears on the appropriateness of these key characteristics, and the implications of this evidence for timing models. Third, it discusses Articulatory Phonology in the Task Dynamics framework, which to date is the most comprehensive, best-worked out model of timing, and why it is challenged by these findings. Finally, it discusses why the evidence supports 3-component models based on symbolic phonological representations and phonology-extrinsic timing, with separate components for phonological and phonetic planning, and motorsensory implementation.

# KEY CHARACTERISTICS AND DIFFERENCES AMONG ARTICULATORY MODELS THAT DEAL WITH TIMING ISSUES, ALONG WITH ADVANTAGES AND DISADVANTAGES OF EACH

# Spatio-Temporal vs. Symbolic Phonological Representations

Probably the most fundamental difference among current models of speech production planning has to do with the nature of phonological representations, which are symbolic in some, and spatio-temporal in others. Models with symbolic representations include Keating (1990), Fujimura (1992), Guenther (1995), and Henke (1966) et seq.; models with spatio-temporal representations include Articulatory Phonology (Browman and Goldstein, 1985, 1989, 1992; Saltzman et al., 2008; Goldstein et al., 2009) and its developments (e.g.,

Tilsen, 2013, 2016, 2018; Sorensen and Gafos, 2016; as well as Šimko and Cummins, 2010, 2011). It is important to note that although spatio-temporal representations in Articulatory Phonology are not symbolic, they are nevertheless abstract, because there is not a one-to-one mapping between phonological representations of each gesture and surface realization<sup>1</sup> .

The choice of the nature of phonological representations has fundamental implications both for the architecture of the speech production system and for the way it deals with timing issues. The dynamic spatio-temporal phonological representations of Articulatory Phonology "underlie[s] and give[s] rise to an action's observable kinematic patterns" (Saltzman, 1995, p. 150). Therefore, although they are abstract, they include quantitative details that govern how speech articulations are produced in space and time in a given context (once gestural activation and overlap are specified in a gestural score). Thus, they make it possible to do without a separate phonetic planning component to provide these quantitative specifications. This appears advantageous, because it makes it possible for speakers (and listeners) to avoid "translating" from data structures in one component to data structures in another (Fowler et al., 1980). In addition, it makes it possible to avoid planning all of the quantitative details of speech production for each utterance: If the quantitative details (including timing) are represented in the phonological units and structures, speakers don't need to explicitly plan them afresh for each utterance, in a separate phonetic planning component. Models with spatio-temporal phonological representations therefore have a very different architecture than those with symbolic phonological representations. That is, models with spatiotemporal representations typically have two components: (1) A single integrated component for both phonology and phonetics, and (2) a motor-sensory implementation component, whereas models with symbolic phonological representations typically have three: (1) A phonological planning component, (2) a separate phonetic planning component, and (3) a motor-sensory implementation component; in such 3-component models, the quantitative details of production are planned in the phonetic planning component.

Although obviating the need for complex online planning is a substantial advantage of the spatio-temporal approach, it is a challenge for this approach to provide an account of systematic contextual variability (including systematic timing variability) that is due to a range of factors such as overall rate of speech, prosodic position, segmental context, movement distance, etc. Existing spatio-temporal-based approaches have proposed additional mechanisms, such as adjustments to gestural activation time (Byrd and Saltzman, 2003; Tilsen, 2016), and/or additional, competing, target representations (Gafos, 2006; Gafos and Benuš, 2006 ˇ ; see also Flemming, 2001<sup>2</sup> ) to account for this variability. However, these approaches face the challenge of explaining how quantitative, spatio-temporal phonological representations and adjustments are learned, given that they are not directly observable from surface acoustics. In contrast, phonological learning is different in approaches with symbolic representations, where the learner must learn the phonological equivalence of variants that are members of a single category, but doesn't have to infer quantitative parameter values that define the category from potentially ambiguous input<sup>3</sup> .

#### Emergent Surface Timing Characteristics vs. Explicitly Specified Surface Timing Characteristics

One of the critical implications of choosing spatio-temporal representations over symbolic representations is that models with spatio-temporal representations + adjustments of their activation can yield surface temporal patterns without having to explicitly specify surface timing characteristics in units such as milliseconds. This is because surface timing in these models is emergent, rather than explicitly specified. For example, in models that use mass-spring systems to accomplish movements toward constrictions, different surface duration patterns can be achieved by changing the stiffness of mass-spring systems without explicitly specifying a surface duration. Emergent systems would be advantageous if it turned out that surface durations are not represented; however, as will be argued below, there is evidence that surface durations are in fact represented. Furthermore, not representing surface durations of speech may make it difficult to interact with external events in the world, e.g., to finish an utterance before the occurrence of an anticipated event, expected to occur at a particular time.

#### Separate vs. Integral Specification of Spatial and Temporal Characteristics

Another characteristic that is implied by the choice of spatio-temporal phonological representations is that in these models, temporal and spatial characteristics are represented integrally in phonological representations. In contrast, in models with symbolic representations, which require a separate phonetic planning component, it is in principle possible to specify temporal characteristics separately from spatial characteristics. Integrated spatio-temporal representations would be advantageous if temporal patterns were predictive of spatial patterns, but would be challenged if, as is argued below, speakers are able to accomplish the same temporal pattern using different spatial paths of movement, particularly when a single speaker produces the same temporal pattern in more than one way.

<sup>1</sup>The surface articulatory trajectories controlled by a given gesture are determined by the gesture itself, as well as context, i.e., gestural starting position, overlap with other gestures, prosodic context, and speaking rate.

<sup>2</sup>Like Gafos (2006) and Gafos and Benuš (2006) ˇ , Flemming's (2001) approach involves multiple, competing target specifications, but Flemming (2001) doesn't explicitly model speech articulation.

<sup>3</sup>An additional difference between the spatio-temporal vs. symbolic approaches is that in spatio-temporal approaches, a phonological representation (gesture) is defined by an equation of motion with a lexically specified and fixed gestural target coefficient. And each gesture controls a fixed set of articulators (a coordinative structure) for the production of an articulatory constriction, although the relative contribution of each articulator in producing a gesture can vary according to context. In contrast, in symbolic approaches, the constriction targets, and even the sets of articulators, used for the production of different tokens of the same symbolic phonological category can vary. For example, the symbolic feature [+labial] can be produced with a labiodental constriction target (for [v]) or with a labial constriction target (for [b]), and in British varieties of English the phoneme /t/ can be produced with and without involvement of the tongue tip, as in aspirated [th ] and glottal stop variants [P] (Heyward et al., 2014, see additional examples and discussion in Turk and Shattuck-Hufnagel, 2020).

# Use of General-Purpose, Phonology-Extrinsic Timekeepers and Timing Units vs. Phonology-Intrinsic Timekeepers

One might wonder whether it is in principle possible for models with spatio-temporal phonological representations to avoid the use of any type of timekeeper or timing unit. However, systematic contextual timing variability of speech (due to e.g., position-in-phrase, position-in-word, phrasal prominence, and speaking rate) appears to require timing control that specifies temporal extent. Thus to date all speech production models make use of some type of timekeeper, either a general-purpose timekeeper (in ms.) or a phonology-specific timekeeper. For example, Nam et al. (2010) use a phonology-specific timekeeper (gestural planning oscillators) to specify the relative timing of gesture initiation, and Saltzman et al. (2008) use such oscillators to specify the durations of gestural activation. In contrast, models with symbolic phonological representations assume a general-purpose timekeeper that operates with solar-timing units (e.g., ms). These include proposals by Fujimura (1992 et seq), Guenther (1995, 2016), and Henke (1966). Šimko and Cummins (2010, 2011)'s Embodied Task Dynamics is an example of an approach with spatio-temporal phonological representations that nevertheless assumes a general-purpose timekeeper and solar timing units. This approach provides an optimization account of systematic patterns of variability found in speech<sup>4</sup> . In this model, optimal movement parameters (including the duration of gestural activation as measured in milliseconds) are determined on the basis of several movement costs (effort, parsing, and time), where the time cost is based on utterance duration as measured in solar time units.

It would be difficult to distinguish models with phonologyspecific and general-purpose timekeepers if the timing units in both types of models were linearly related. However, mechanisms for lengthening gestural activation intervals that involve slowing the phonology-specific clock (e.g., Pi and Mu<sup>T</sup> gesture adjustments, Byrd and Saltzman, 2003; Saltzman et al., 2008) warp the relationship between phonology-specific time and solar time in parts of utterances that are affected by Pi and Mu<sup>T</sup> gestures, such as boundary-adjacent intervals and stressed syllables. That is, in models that use phonology-specific "clock"-slowing to accomplish boundary- and prominencerelated lengthening, the lengthened intervals do not contain more phonology-specific units, although they are longer in solar time, warping the relationship between these two kinds of representations in non-linear ways across an utterance, and in inconsistent ways between utterances. Diagnostics of speakers' representations of the durations of boundary-related and/or prominence-related intervals would provide a way of determining whether phonology-specific vs. solar timing units are more appropriate; see section "Constraints on Lengthening Due to Phrasal Prosody Suggest That Surface Timing Patterns Are Represented, and Not Emergent" for evidence that bears on this issue.

# Different Ways of Modeling the Time Course of Individual Movements

Models of speech production also differ in the mechanisms they use for achieving constriction-related movements that have appropriate movement velocity profiles. In Fujimura's (1994) model, movements toward constrictions, called "elemental gestures," are modeled as impulse response functions, parameterized for various aspects of the movement timecourse (i.e., affecting the shape of the velocity profile) as well as inherent amplitude. The values of the parameters for each elemental gesture are stored in a table. As long as the gestures are not constrained by e.g., saturation effects, the parameter values in the table are modified in a produced utterance according to a modification factor (the syllable pulse) that represents each syllable's strength in an utterance. In this model, elemental gestures for vowels change slowly over time, and faster-changing consonantal gestures are superimposed on these.

In Articulatory Phonology in the Task Dynamics framework, gestural movements are generated using a second order massspring system with a linear restoring force. The point attractor mass-spring dynamics of this model appropriately generates a smooth, single-peaked tangential velocity profile, i.e., with a single acceleration and a single deceleration phase. However, the velocity profiles generated by systems with linear restoring forces are asymmetrical, with velocity peaks that are earlier than observed in empirical data. To create more realistic velocity profiles, gestural activation functions which originally were turned on and off abruptly, were instead shaped to have gradual activation interval on- and off-ramps, and these were shown to successfully generate velocity profiles with centered peaks (Byrd and Saltzman, 1998). More recently, Birkholz et al. (2011) and Sorensen and Gafos (2016) showed that other types of massspring systems could generate more realistic timing of the velocity peak without gradual on-and off- ramps for gestural activation. Birkholz et al. (2011) used a 10th order linear mass-spring system, and Sorensen and Gafos (2016) used a second order system with a non-linear restoring force. Sorensen and Gafos (2016) showed that their system with a non-linear restoring force had the added advantage of providing an account of the observation that longer distance movements are longer in duration than shorter distance movements, in spite of higher peak velocity (cf. Fitts, 1954 law).

Movement trajectories (and consequently their velocity profiles) are generated in a different way in Guenther's DIVA model (2016). This model generates articulatory movement trajectories via a neural network mapping between directions in sensory space and velocities of articulators (Guenther and Micci Barreca, 1997; Guenther, 2016). In this model, articulatory movement trajectories are generated which produce acoustics that fall within a spectro-temporal target template for each speech sound. Thus, the time course of movement is determined by the way acoustic formants vary over time, and not by any explicit motor principle.

The non-speech motor control literature has proposed other ways of modeling appropriate velocity profiles. Nelson (1983), Harris and Wolpert (1998), and Tanaka et al. (2006) present Optimal Control Theory accounts. For example, Tanaka et al. (2006) propose that movements are produced with minimum

<sup>4</sup>Embodied Task Dynamics was not intended to be a theory of online speech production, but rather was deverloped to explain coordination patterns.

durations that conform to accuracy requirements, and show that appropriate velocity profiles and movement durations can be generated for different accuracy requirements on the assumption that noise grows with the size of the neural control signal. Harris and Wolpert (1998) and Tanaka et al. (2006) successfully predict the relationships among speed, distance, and accuracy described in Fitts' (1954) law.

Lee (1998) proposes that movement velocity profiles are governed by tau-coupling, where tau = time-to-goal-achievement at the current movement rate. Appropriate movement velocity profiles can be generated if actors keep their taus τ<sup>X</sup> in constant proportion to the taus of a Tau Guide τG, by making τ<sup>X</sup> = KτG, where τ<sup>G</sup> = 1 2 t − T 2 t , t is time and T is movement duration (The equation is based on Newton's equations of motion). The value of the coupling constant K determines the skewness of the velocity profile. If K = 1, the movement accelerates at a constant rate; lower values of K have an acceleration followed by a deceleration, with longer decelerations for lower values of K. Lee's model has the advantage of being computationally simpler than Optimal Control Theory accounts. It predicts that actors should be able to manipulate velocity profile skewness via the K parameter. This provides a potential account of velocity profile skewness differences observed in the non-speech and speech motor control literature (e.g., Perkell et al., 2002). For example, a bird attempting to land on a twig will have an earlier velocity peak to ensure a gentle, accurate, low velocity landing, whereas a tongue approaching the roof of the mouth for a /t/ might have a later velocity peak.

# Different Ways of Modeling Coordination

Another way in which articulatory models of speech production differ is in the ways that they model the temporal coordination of articulatory movements. Coordination can be described at different levels, including the coordination of movements that contribute to a single constriction, as well as the coordination of movements that contribute to sequences of constrictions. Models differ on the information used to determine relative timing patterns, i.e., on whether they are based on relative timing vs. spatial characteristics vs. absolute timing. For example, in Fujimura's model, where faster consonantal gestures are superimposed on slower, vocalic gestures, coordination is based on relative timing: Consonantal elemental gestures are triggered at appropriate delays or lags from the syllable pulse, where the delays are specified as ratios of the syllable duration (Fujimura, 1994; Wilhelms-Tricarico, 2015).

Nam et al.'s (2010) theory of coupled oscillator model of coordination is also based on relative timing, that is, one constriction formation gesture is initiated when a particular planning oscillator phase of an earlier gesture is reached. On this view, if coupled planning oscillators speed up or slow down, the relative timing of gestures governed by the oscillators will be preserved. While Tilsen (2016) adopts this relative timing view for the coordination of onset consonants with syllable nuclei, he proposes a different mechanism based on spatial characteristics for the coordination of coda consonants with syllable nuclei. Tilsen (2016) proposes that coda consonant gestures are activated at the achievement of nucleus gestural target, i.e., on the basis of spatial information. In contrast, Šimko and Cummins' model proposes that gestural coordination and overlap are governed by costs of parsing (perceptual recoverability) and absolute time. For example, among other things, a higher parsing cost will encourage the speaker to make gestures more perceptually recoverable by making them less overlapped, and a higher time cost will make utterance duration shorter through increased overlap.

Whereas models of speech production have to date focused primarily on the relative timing of movement initiation, models available in the non-speech motor control literature suggest another possibility, namely coordination based primarily on the goal-related parts of movement, where movements are initiated at a time that ensures spatial and/or temporal accuracy (Harris and Wolpert, 1998; Todorov and Jordan, 2002; Tanaka et al., 2006). Similarly, on Lee's (1998, 2009) view, movements are controlled based on tau-coupling (tau = time-to-goalachievement at the current movement rate) to achieve their goal at a particular time. Thus, his model ensures synchronous goal achievement for all movements that are tau-coupled before the end of movement, but does not require that these movements begin synchronously. See Turk and Shattuck-Hufnagel (2020) for more discussion of Lee's General Tau Theory as applied to speech.

# Different Ways of Modeling Effects of Prosodic Structure on Timing

In spite of growing evidence that prosodic structure has a systematic influence on the durational patterns of virtually all known languages, relatively few articulatory models have explicit accounts of these and other contextual effects on timing. Here, we discuss models which have explicitly modeled prosodic effects in different ways.

Fujimura (1992 et seq.) framework assumes that phonological representations are expressed in terms of symbolic distinctive features, as well as symbolic representations of syllables (including their sub-constituents, i.e., onsets, nuclei and codas), and assumes higher-level prosodic constituency which can influence syllable durations in the vicinity of higher-level constituent boundaries. The syllable representations are mapped onto a "syllable pulse train," i.e., a series of (usually symmetric) triangles corresponding to syllables and pauses (if they occur), whose bases are contiguous. Triangle heights represent an appropriate magnitude multiplication factor (the pulse) which controls syllable prominence and phrasal boundary effects, and triangle bases represent syllable or pause duration. As a default, the apex angle is assumed to be the same for all triangles; therefore syllable triangle height correlates with syllable duration, so that longer duration and prominence are linked. In cases where additional lengthening is required, either the apex angle can be adjusted, or additional (half) triangles can be added to the utterance (Fujimura, 2002). Although this model provides a framework for modeling the influence of prosodic structure on correlated spatial and durational characteristics, it doesn't provide a way of determining what the syllable pulse heights

and apex angles (and hence the syllable durations) should be for a given context.

Šimko (2009), Šimko and Cummins's (2010, 2011), and Windmann's (2016) approaches are of note in this regard, because they propose a principled cost-minimization mechanism for determining durational properties of speech, based on Optimal Control Theory. Šimko and Cummins' Embodied Task Dynamics model is a development of the Task Dynamics model used in AP/TD, in which the articulators are assigned masses, and optimization is used to determine model parameter values. In this model, gestural activation onset and offset timing (specified in solar time units) and system stiffness (where system stiffness is a scaling factor for gestural and "speech-ready" stiffness<sup>5</sup> ) are optimized using three costs: An effort cost, a perceptual (parsing) cost, and a time cost which is a linear function of utterance duration in milliseconds. Benuš and Šimko (2014) ˇ showed that locally decreasing the duration cost in the vicinity of a phrase boundary can be used to model boundary-related lengthening in Slovak m(#)abi and m(#)iba sequences<sup>6</sup> .

Although Šimko and Cummins' approach is based on Articulatory Phonology, it differs from AP in the use of solar timing units, which are used for the specification of its time cost, as well as for the specification of gestural activation durations which result from their optimization procedure. In contrast, Articulatory Phonology in the Task Dynamics framework (Byrd and Saltzman, 2003; Saltzman et al., 2008) provides an approach in which solar timing units are not required, and where surface timing patterns are fully emergent from phonology-specific processes. In their approach, lengthening effects due to prosodic structure are modeled as adjustments of gestural activation durations. Gestural activation durations are not specified in milliseconds, but rather in proportions of gestural planning oscillator periods. At a default rate of speech, gestural activation duration corresponds to gestural mass-spring settling time, i.e., the time required for a gesture to approximate its target. In particular prosodic positions, such as phrase-boundary-adjacent position, or at slower speech rates, the default gestural activations are stretched (Byrd and Saltzman, 2003). This stretching is implemented in later versions of the theory (Saltzman et al., 2008) by slowing the frequency of the gestural planning oscillators. Analogously, at faster rates of speech, or in unstressed positions, the default gestural activations are shortened by speeding up the frequency of the gestural planning oscillators. This approach has been used successfully to model effects of prominence, boundary-adjacency, and poly-subconstituent shortening.

Tilsen's recent development of AP (Tilsen, 2018) provides another mechanism for prominence-related lengthening, based on feedback about target approximation. In this model, one mechanism for ending gestural activation is the suppression of gestural activation after targets are approximated. In this proposal, prominent syllables and syllables produced at a slow speech rate are proposed to result from a high degree of reliance on external feedback about target approximation.

# EVIDENCE FROM THE LITERATURE THAT RELATES TO THESE CHARACTERISTICS AND CONSTRAINS THE CHOICE OF APPROPRIATE MODEL

The previous section showed substantial differences among existing models of speech articulation control timing patterns. Many of these differences derive from choices about the general architecture of the system and about the nature of phonological representations that encode contrast, phonological equivalence and prosodic structure. In spite of the differences, these models all generate plausible articulatory trajectories, at least in some contexts. How can they be distinguished? In this section, we discuss phenomena which bear on this question, focusing on the issues of (1) emergent vs. specified surface timing patterns (2) spatio-temporal representations vs. the independent representation of timing information (3) the use of phonology-specific vs. general-purpose timekeepers, (4) spatio-temporal representations vs. symbolic representations, (5) movement coordination, and (6) modeling effects of prosodic structure.

Evidence bearing on these issues motivates an alternative approach to modeling timing control, i.e., a phonology-extrinsic approach based on symbolic phonological representations in a Phonological Planning Component, with specifications for surface durations that are planned in a Phonetic Planning Component that is separate from the Phonological Planning Component. The first two phenomena, (1) constraints on lengthening due to phrasal prosody, and (2) different strategies for controlling rate of speech, boundary-related lengthening and quantity, suggest that surface durations are explicitly represented. As a result, they present a challenge to approaches to timing in which surface durations emerge without explicit representation; moreover, the second phenomenon suggests that surface durations can be specified independently of spatial characteristics, since the timing patterns are the same while the spatial characteristics vary. The third phenomenon, (3) more timing variability for longer duration intervals in speech and nonspeech behavior, suggests the involvement of a noisy generalpurpose timekeeping mechanism in the speech production process, in which longer durations intervals are associated with more timing variability due to accumulated noise. Finally, the fourth phenomenon (4) less timing variability at movement endpoints compared to other parts of movement, challenges the concept of spatio-temporal representations, and suggests that movement coordination is based on goal-related parts

<sup>5</sup> Speech-ready stiffness is analogous to the stiffness of the neutral attractor in the AP/TD framework, except that speech-ready dynamics is always turned on, even when gestures are active, and the speech-ready stiffness of individual articulators can be manipulated according to requirements for higher precision (Šimko, 2009). The speech-ready position is assumed to be "an average constellation with regard to the entire set of mastered gestures" (Šimko et al., 2014, p. 133).

<sup>6</sup>Windmann et al. (2015) and Windmann (2016) show how this same general approach, i.e., minimizing costs of effort, (mis)-parsing and time, can be used to model durational effects of prominence (phrasal prominence, lexical prominence, and their interaction) and polysyllabic shortening, as well as interactions with speaking rate, as measured from the acoustic signal.

of movement rather than onsets. Taken together, these four phenomena support the alternative view that speech production planning is based on symbolic phonological representations and includes separate components for Phonological and Phonetic Planning, as well as a third, Motor-Sensory Implementation component in which speech movements and acoustics are monitored and adjusted to ensure that spatial and timing goals are achieved appropriately (Houde and Nagarajan, 2011; Guenther, 2016).

# Constraints on Lengthening Due to Phrasal Prosody Suggest That Surface Timing Patterns Are Represented, and Not Emergent

In Northern Finnish and Dinka, which have a phonemic quantity contrast, the phonemically short vowels are lengthened less than the long vowels, in prosodic contexts such as phrase-final position (Remijsen and Gilley, 2008; Nakai et al., 2009, 2012). For example, as **Figure 1** shows, the magnitude of final, accentual, and combined lengthening of phonemically short vowels in Northern Finnish is restricted compared to lengthening on phonemically long vowels (Nakai et al., 2009, 2012). This suggests that speakers of this language explicitly constrain the surface durations of phonemically short vowels to maintain the duration contrast with longer vowels.

In this figure, VV refers to a phonemically long vowel, and V to a phonemically short vowel. Note that in the last syllable of CVCV(C) words, cf. the left-hand side of **Figure 1B**, the phonemically short vowel shows a greatly reduced magnitude of combined accentual + final lengthening (17%) compared to the phonemically long vowel in the same context (68%). The lengthening pattern on this so-called "half-long vowel"<sup>7</sup> is suggestive of a constraint resulting in a surface duration of phonemically short vowels of < ca. 140 ms, at least at this speaking rate, supporting the view that the (phonemically short) half-long vowels are lengthened less than the long vowels to avoid endangering the phonemic contrast between short and long vowels in this language. Two types of empirical evidence for a constraint on the surface duration of phonemically short vowels are provided in Nakai et al. (2009, 2012). First, Nakai et al. (2009) found a negative correlation between phrasemedial duration and the amount of final lengthening for V2 in CV1CV2 structures. One might initially imagine a mechanism by which speakers could learn to lengthen phonemically short vowels less to avoid confusion in their listeners, without explicitly representing a durational constraint. However, this potential solution is ruled out by the observation that speakers adjust the amount of lengthening for their phonemically short vowels in a way that maintains a surface durational distinction. That is, phonemically short vowels that are shorter are lengthened more, and phonemically short vowels that are longer are lengthened less, showing evidence of a surface duration constraint. Further support for a surface duration constraint

comes from Nakai et al. (2012)'s study of final lengthening and accentual lengthening, which combine sub-additively for V2 in CV1CV2. A constraint on lengthening of this type is difficult to express in a system that does not explicitly represent surface durations.

The final lengthening patterns in Dinka, a Nilotic language, are also consistent with this type of constraint. This language has a three-level quantity system, and vowels of short and medium quantities are lengthened less than the long vowels, in phrase-final position, a prosodic context that requires duration lengthening (see **Figure 2**, reproduced from Remijsen and Gilley, 2008).

The results reviewed here suggest that the explanation relates to surface durational information which is represented in the minds of speakers, and is involved in the maintenance of phonemic contrasts. These results are difficult to account for

<sup>7</sup>The "half-long vowel" is a phonemically short vowel whose phrase-medial duration is intermediate between that of a non-word-final phonemically short vowel and that of the long vowel (VV).

in models in which surface durations are the emergent output of activation interval durations + phonology-intrinsic clockslowing adjustments, are not explicitly represented, and so cannot be invoked as constraints on lengthening.

# Different Strategies for Manipulating Durations (in e.g., Rate of Speech, Boundary-Related Lengthening, and Quantity), Suggest That Surface Timing Goals Are Explicitly Represented, and Not Emergent

The explicit representation of surface duration requirements is supported by another type of evidence, related to the implementation of overall rate of speech as well as to boundaryrelated lengthening, and to duration-related quantity differences. This evidence suggests that speakers specify surface interval duration requirements as goals of speech production, and meet these requirements using a variety of different strategies. The equivalence of these strategies goes unexplained in theories that cannot represent surface durations. That is, the only thing shared by all of the different strategies is their equivalent effects on surface durations.

One example of this kind of evidence is that, when asked to speak quickly, speakers achieve surface durations using a wide variety of strategies. Although it is often the case that speakers accomplish this task by reducing the number and/or durations of pauses at fast rates, other strategies have also been observed. For example, acoustic studies show that speakers may manipulate the number of pauses, but not the durations, or vice versa (Fletcher, 1987; Trouvain, 1999). Likewise, kinematic studies reveal that, although the peak velocity/distance relationship for movements is often higher at fast rates, some speakers achieve faster rates by increasing articulatory speed (peak velocity), while other speakers achieve this by reducing movement distance (Abbs, 1973; Ostry and Munhall, 1985; Engstrand, 1988; Goozée et al., 2003; see Berry, 2011 for a review). And while many speakers show increased articulatory overlap at fast rates, not all speakers do (Engstrand, 1988; Boyce et al., 1990; Shaiman et al., 1995; Byrd and Tan, 1996; Shaiman, 2001, 2002; all cited in Berry, 2011).

What these studies show is that speakers respond differently to instructions when asked to speak quickly, but in all cases achieve shorter utterance durations. We cannot see how the equivalence of their strategies can be expressed without reference to the surface duration goals of these utterances. Similar findings of different strategies for achieving similar surface duration goals have been observed for quantity differences and phrase-final lengthening (Edwards et al., 1991; Hertrich and Ackermann, 1997). Hertrich and Ackermann's (1997) findings are of particular note because they show that the same speaker can use different strategies to achieve durational differences in different contexts. For example, Hertrich and Ackermann (1997) showed that the same speaker used different strategies to achieve the phonemically short vs. long distinction for different vowels. That is, some speakers showed a longer opening movement for /A:/ compared to /A/, but a predominate pattern of a longer initial part of the closing movement for /u:/ compared to /u/. Similarly, Edwards et al. (1991) showed that the same speakers used different strategies for achieving longer surface durations in phrase-final position (compared to phrase-medial position) at different rates of speech. That is, at faster rates, they slowed articulatory speed in phrase-final position, but at a slow rate, they kept speed constant and held the articulators in quasisteady states.

Taken together, these studies of strategies for adjusting durations for rate of speech, vowel quantity, and final lengthening suggest that surface durations are speech production goals that can be achieved in a variety of ways. This type of motor equivalence supports the view that (1) surface duration requirements can be specified as part of the speech production process, and (2) these durational requirements or goals are separately specified from how the goals are achieved. Particularly telling are cases where the same speakers show different articulatory strategies for achieving similar durational patterns in different speaking-rate contexts.

These findings support models in which (1) surface duration goals (or costs) for intervals can be explicitly represented during phonetic planning, and (2) these goals are specified separately from how the goals are achieved articulatorily. This type of model architecture would make it possible for the same goal to be achieved in a variety of ways.

# More Timing Variability for Longer Duration Intervals Suggests the Involvement of General-Purpose Rather Than Phonology-Specific Timekeeping Mechanisms

The previous sections presented evidence suggesting that speakers explicitly represent surface timing goals, and can accomplish those timing goals in many different ways. We argued that emergent timing mechanisms specific to the task of speaking cannot account for the observed behavior, raising the question of what kind of alternative mechanism could support the planning of such intervals. This section presents evidence from timing variability that suggests an answer: general purpose timing mechanisms that could be used in specifying and planning surface durations in speech.

Many types of timed behaviors show what is known as "the scalar property," a relationship between interval duration and variability that tends to follow Weber's law, resulting in an approximately constant coefficient of variation (SD/mean) over a range of intervals. Getty (1975) proposed that variability in interval durations arises from two sources of noise (1) a duration-dependent source, thought to be the consequence of noise in a timekeeping process, and thus to increase with the duration of the interval, and (2) a source of variability due to noise in the motor system, assumed to be constant regardless of the duration of the interval. This proposal provided an account of the higher coefficient of variation (SD/mean) observed for shorter intervals (up to approximately 200 ms) as compared to longer intervals (approximately 200– 1300 ms). For a review of different modeling approaches to general purpose timekeepers with accounts of timing variability, see Schöner (2002).

Behaviors showing timing variability that grows with interval duration include:


consonant target-to-second consonant release). Similar findings are reported in the speech literature for intervals measured from landmarks (Stevens, 2002) in the acoustic signal. For example, phonemic quantity differences (Dinka: Remijsen and Gilley, 2008; N. Finnish: Nakai et al., 2012); Chen (2006) for focused vs. non-focused constituents in Mandarin; Nakai et al. (2012) for final and phrasally accented vs. non-final, non-accented intervals in N. Finnish; and Lefkowitz (2017) for a linear relationship between standard deviation and mean duration of vowel intervals across a very wide range of contexts in an English experiment.

Findings of greater timing variability in phrase-final and/or phrasally prominent positions are thus consistent with the view that speech makes use of a general-purpose timekeeping mechanism, with variability that is proportional to the surface duration of the timed interval, as suggested by Gallistel, 1999; Gallistel and Gibbon, 2000; Jones and Wearden, 2004; Shouval et al., 2014; and others). The law applies to timing behavior in many different tasks (non-speech) and speech, and in perception and in production. Whatever mechanism accounts for this law therefore appears to be general across all of these tasks and behaviors. General purpose timekeeping mechanisms thus provide a unified account of timing variability for all timed intervals; see below for further discussion.

# The Observation of Less Timing Variability at Movement Endpoints Than at Other Parts of Movement Challenges (Spatio-)Temporal Phonological Representations, and Supports a Model of Speech Production Based on Symbolic Phonological Representations

The sections above presented three types of evidence for the representation of surface time intervals in the planning of movements – a constraint on the surface durations of phonemically short vowels in some quantity languages; multiple articulatory strategies for attaining appropriate acoustic durational patterns, suggesting that those patterns themselves are the goals of the movement, and the increase in variability with longer intervals, suggesting that those intervals are generated using a general purpose phonology-extrinsic timing mechanism that operates in units of surface (solar) time, rather than in phonology-intrinsic timing mechanisms operating in nonsolar time units. In this section we present an argument for symbolic (as opposed to spatio-temporal) phonological representations. The evidence for this argument comes from observations of less timing variability at particular parts of movement, which are most behaviorally meaningful. This evidence supports symbolic representations because it requires a representation of the most behaviorally meaningful part of movement so it can be prioritized for timing accuracy. It supports symbolic representations because they can map onto a part of movement that relates most directly to

the achievement of a phonological goal, and can therefore be prioritized. Such symbolic representations require the specification of timing and other phonetic characteristics in a separate phonetic planning component. Thus, this evidence for symbolic phonological representations provides a fourth argument for the use of phonology-extrinsic time, because there are no time specifications in the phonology.

Evidence for the representation of individual parts of movement, so that their timing can be prioritized, comes from a number of sources. In his 1998 paper, Dave Lee notes: "it is frequently not critical when a movement starts – just so long as it does not start too late. For example, an experienced driver who knows the car and road conditions can start braking safely for an obstacle a bit later than an inexperienced driver." This observation suggests that the timing of the part of movement most closely related to the goal attainment should be less variable than the timing of other parts of movement<sup>8</sup> . This section presents evidence from repeated movements elicited in controlled laboratory experiments that that confirms Lee's prediction.

Many findings in the literature are consistent with the observation that the timing of movement endpoints can be less variable than for other parts of a movement, even for repeated movements that have the same movement path<sup>9</sup> . For example, Gentner et al. (1980) study of keypress timing in typing found lower consistency in the start times of key press movements, as compared to the end times, for two typed repetitions of the same sentence, performed by an experienced typist. The median difference in start times was 58 ms, compared to a difference of 10 ms for end times.

Additional evidence for lower timing variability at movement endpoint can be found in periodic tapping data (Billon et al., 1996; Spencer et al., 2003; Zelaznik and Rosenbaum, 2010), For example, Spencer et al. (2003) found that timing variability in repetitive tapping showed lower variance at finger touchdown as compared with the time of peak velocity. Zelaznik and Rosenbaum (2010) found similar results for tapping, in that timing variability of contact with the tapping surface was lower than that of maximum finger extension. Interestingly, however, both Spencer et al. (2003) and Zelaznik and Rosenbaum (2010) found a different pattern of results for circle drawing, that is, no evidence for differences in timing accuracy at different points in the circle cycle. For example, in Zelaznik and Rosenbaum (2010), the variability at cycle onset (0◦ ) was no different from timing variability at a spatial location opposite to cycle onset (180◦ ). This evidence is consistent with the emergent timing view of continuous circle drawing, that is, that timing in such tasks is primarily emergent from dynamic characteristics and has minimum involvement from a timekeeping mechanism. See Zelaznik and Rosenbaum (2010) and Studenka et al. (2013) for evidence less consistent with emergent timing for circle drawing when it creates a perceptual (auditory or tactile) event that could be thought of as the goal of the movement, consistent with the idea that when salient timing-related events can be identified, general-purpose timing mechanisms are likely to be invoked; see Repp (2008) and Repp and Steinman (2010) for more nuanced discussions.

Although speech production data on this topic is limited, the available data show timing variability patterns that are consistent with those observed for typing and periodic tapping; that is, they show less timing variability at goal-related parts of movement, such as movement endpoint, than at other parts of movement, such as the movement onset. Because it is often difficult to accurately diagnose movement onset times for a particular articulator when its movements may have been governed by multiple phonemes, Perkell and Matthies (1992) studied timing variability for upper lip protrusion movements during spoken /i\_u/ sequences, where intervening consonants were /s,t,k/ and /h/, which are not normally associated with upper lip movement. The number of intervocalic consonants was varied systematically. Furthermore, to be sure that these intervening consonants did not have upper lip movement associated with them, they carefully examined upper lip protrusion traces during /i\_i/ contexts. Because they observed that /s/ did in fact have some idiosyncratic upper lip movement associated with it during the production of /isi/, they removed data with intervening /s/ from their analysis. As an additional precaution to ensure that their measure for movement onset was under sole control of the following /u/, they identified movement onset not as a point of velocity zero, but as the point of acceleration maximum, i.e., a time point clearly associated with movement toward the /u/ target. After all of these precautions to ensure that the measured upper lip protrusion timing data was due to the production of /u/ alone, they still found more variability for the timing of acceleration maximum as compared to the timing of maximum protrusion. As shown in **Figures 3**, **4** <sup>10</sup>, they observed lower variability in the timing of movement endpoint (maximum protrusion) relative to voicing onset for /u/, as compared to the timing of maximum acceleration, relative to voicing onset for the same vowel. This pattern suggests a tighter temporal coordination of maximum lip protrusion (movement endpoint) with voicing onset than of lip protrusion movement onset (max. acceleration) to voicing onset, and suggests that the timing of movement endpoint has higher priority than the timing of movement onset in these speech movements. This pattern suggests that having maximally protruded lips at the onset of voicing is the prioritized goal.

The view that the timing of the most behaviorally meaningful part of movement is given highest priority is supported by Gafos et al. (2019) evidence relating to the coordination of consonant clusters [bd, db, dg, gd, br, rb, kr, rk, kl, lk, lb,

<sup>8</sup> It is often the case that the part of movement most closely related to the goal in speech is the movement endpoint. For example, the endpoint of lip protrusion is most closely to the goal for /u/, but for geminate consonants followed by a vowel, the timing of the beginning of the release movement toward the following vowel may be the most relevant for signaling the geminate status of the consonant.

<sup>9</sup> In cases where movements are of different distances, additional variability in the timing of movement onsets might be expected, since Fitts' (1954) law dictates longer duration movements for longer distances (in spite of higher velocities for these movements) as long as accuracy or target width remains constant.

<sup>10</sup>Although they do not report magnitudes of variability, e.g., in standard deviations, the difference in variability is clear on visual inspection of the figure.

FIGURE 3 | Reproduced from Perkell and Matthies (1992), with the permission of AIP Publishing. Caption as in the original: Schematic illustration of data extraction. From top to bottom: (1) a segment of the acoustic signal (ACOUSTIC), (2) lip protrusion (PROTRUSION), (3) lip velocity (VELOCITY), and (4) lip acceleration (ACCELERATION) vs. time. Acoustic events in the time-expanded acoustic signal are end of the /i/ (iEnd) and beginning of the /u/ (Vbeg). Movement events are: movement beginning (mBeg), movement end (mEnd), and maximum acceleration (AccMax).

and nk] in three different positions-in-word in Moroccan Arabic. This evidence shows that movements toward C2 exhibit higher amplitude-normalized peak velocity the later they begin relative to C2 release. This finding supports the idea that speakers ensure appropriate movement velocity in order to achieve the behaviorally meaningful part of C2 (possibly its release) on time.

Additional evidence for the prioritization of timing accuracy at goal-related parts of movement can be found in speechrelated manual gesture (Leonard and Cummins, 2011). They found less timing variability at a point of maximum hand + arm extension compared to other parts of movement, for hand + arm "beat" gestures that co-occur with speech. They recorded hand + arm movements (by recording the movement of an LED marker attached to the base of the thumb) while a speaker read two repetitions of three short fables. They found that the point of maximum extension of the hand before retraction had the least timing variability compared to other parts of movement (movement onset, peak velocity of extension, peak velocity of retraction and movement retraction end), measured relative to landmarks in the stressed syllable in each word. These findings suggest that the point of maximum extension is the part of movement which is coordinated with stressed syllables, as opposed to the onset of movement.

Taken together, these results suggest that particular part(s) of movement can be more task-relevant, or "behaviorally meaningful" than other parts of movement (cf. Shaffer, 1982; Semjen, 1992; Billon et al., 1996, for timing). They are also consistent with the view that the most task-relevant features of motor performance are prioritized for accuracy and therefore have the least variability, as proposed in Todorov and Jordan's (2002, 2003) Minimal Intervention principle (cf. Winter, 1984; Lacquaniti and Maioli, 1994; Scholz and Schöner, 1999; discussed in Scott, 2004). Semjen (1992) makes this point about the control of finger movements in typing: "When copying a text, the typist probably attempts to produce the successive keystrokes fluently and at a fast sustained rate. The typist would thus anticipate the temporal properties of a sequence of behaviorally meaningful events, rather than the characteristics of the individual movements producing them. . . . We are thus led to a notion of multi-level temporal organization in serial movements, with some level(s) being more directly related to the subject's intentions than others." (Semjen, 1992, p. 248).

Along these lines, findings of greater temporal accuracy at particular parts of movement suggest that these parts of movement are "behaviorally meaningful" and are more closely related to the speaker's goals for the utterance. For example, the various movements of the articulators must be coordinated to create particular configurations at appropriate times, or the goal of signaling the features, sounds and words of the utterance will not be met. Other, less behaviorally meaningful parts of movement are produced in service of achieving those goals.

The finding that goal-related parts of movement are more accurate/less variable than other parts of a movement requires the representation of a movement goal as separate from the way the movement is achieved, as well as a mechanism to ensure more precise timing accuracy at the goal-related part of movement. As discussed below, this challenges models with spatio-temporal representations in which there is no distinction between the goal of a movement and the way it is achieved, because without this distinction, the phonological representation (and thus the phonological goal) actually corresponds to an entire movement trajectory (apart from starting position). As a result, the most behaviorally meaningful part of movement is not separately identified and therefore can't be prioritized. In contrast, it supports models which make use of symbolic representations in the phonology. This is because symbolic representations can map onto a particular part of movement that relates most directly to the achievement of a (symbolic) phonological goal, and can therefore be prioritized for accuracy.

# The Observation of Less Timing Variability at Movement Endpoints Than at Other Parts of Movement Challenges Onset-Based Movement Coordination

Findings of greater temporal precision at endpoints compared to other parts of movement also provide evidence for the nature of movement coordination. It suggests that movement coordination is based on goal-related part(s) of movement (often the endpoint), and requires a way to ensure timing accuracy at the goal-related parts of movement. Additional evidence for goalrelated, endpoint-based coordination in non-speech activity can be found in Gentner et al. (1980), Bootsma and van Wieringen (1990), Kazennikov et al. (1994), Haggard and Wing (1998), Craig et al. (2005), and Katsumata and Russell (2012); endpoint-based coordination and its implications for speech timing models are discussed at length in Turk and Shattuck-Hufnagel (2020).

# Summary of Evidence

To summarize, the above evidence suggests that (1) speakers represent and specify surface duration goals for intervals, (2) in specifying surface durations, speakers make use of generalpurpose timekeeping mechanisms, and (3) speakers separately represent, and prioritize for timing coordination accuracy, the most behaviorally meaningful parts of movement. This evidence is inconsistent with approaches to speech production in which (1) surface timing characteristics are only emergent and not represented, (2) timing units do not relate straightforwardly to solar time, and (3) phonological representations define the timing of all parts of a movement trajectory. Instead, the evidence presented above motivates speech production models that make use of (1) general-purpose timekeeping mechanisms to represent and specify surface durations, and (2) have a way of representing behaviorally meaningful parts of movement separately from other parts of movement, so that they can prioritized for timing accuracy, and be coordinated with other events. In the following sections, we discuss why AP/TD, the model with the most comprehensive account of articulatory timing behavior, is challenged by these phenomena, and why these phenomena support a model based on symbolic representations and phonology-extrinsic timing, with three components: (1) Phonological Planning, (2) Phonetic Planning, and (3) Motor-Sensory Implementation.

# DISCUSSION OF AP/TD AS THE MOST COMPREHENSIVE, BEST-WORKED OUT MODEL OF TIMING AND WHY IT IS NEVERTHELESS CHALLENGED BY THESE FINDINGS

Articulatory Phonology in the Task Dynamics framework is currently the model with the most comprehensive coverage of timing effects in speech, including smooth, single-peaked velocity profiles, durations of gestural (constriction-forming) movements, coordination, boundary- and prominence-related lengthening, poly-sub-constituent shortening, and rate of speech. Key features of its approach to timing include (1) the use of spatio-temporal phonological representations, called gestures, as units of lexical contrast and phonological equivalence, (2) phonology-intrinsic timekeeping and gestural activation adjustment mechanisms to account for systematic contextual variability, (3) surface timing characteristics that emerge from the phonology without any representation of their durations in solar timing units. Its use of spatio-temporal phonological representations as units of lexical contrast and phonological equivalence, as well its commitment to emergent timing, are both motivated by the desire to avoid translation between phonological data structures to different data structures in phonetics (Fowler et al., 1980). That is, it is motivated by the desire to avoid having a phonetic planning component that is separate from the phonology. A substantial challenge of this approach is how to account for systematic variability in the production of members of the same phonological category, while maintaining their phonological equivalence. Its solution to this challenge involves (1) a definition of the gesture which allows for different gestural starting positions; (2) phonology-intrinsic mechanisms for adjusting the interval of time that a gesture is active (its gestural activation interval); and (3) mechanisms for controlling patterns of gestural overlap, and articulatory activity that it governs. Together, these mechanisms give rise to surface contextual variability without altering the defining characteristics of each gesture, i.e., its equation of motion and coefficient values (apart from starting position).

In this section, we lay out AP/TD's approach to this challenge in more detail. Gestures are modeled by a second order massspring equation of oscillatory motion, critically damped so that the mass approximates a target position but doesn't oscillate. Target and stiffness coefficient values vary across gestural categories (target values differ for each gesture; stiffness values differ for consonants vs. vowels), whereas damping and mass coefficients are the same for all gestures (mass is set to 1, damping has a value that ensures critical damping). Gestural starting position is determined by previous context. How long each gesture is active (the gestural activation interval) is controlled by a system of coupled limit-cycle oscillators, i.e., gestural planning + syllable, foot and phrase oscillators (Saltzman et al., 2008, but see Tilsen, 2013, 2016, 2018, for a different approach). Because the gestural activation interval is defined as a proportion of a gestural planning oscillator cycle, the oscillation rate of the gestural planning + suprasegmental oscillator ensemble determines the duration of the gestural activation interval. A default oscillation rate gives rise to activation intervals which are long enough for gestures to approximate their targets. Longer intervals at phrase edges and in prominent positions are achieved via a mechanism which slows the AP/TD "clock" in these positions without adding timing units (see Byrd and Saltzman, 2003 for an early Pi gesture proposal; Saltzman et al., 2008 for a later Mu<sup>T</sup> proposal which slows the gestural planning + suprasegmental oscillator ensemble oscillation rate); shorter intervals for, e.g., faster rates of speech, are achieved by speeding up the planning + suprasegmental oscillator "clock"

(Saltzman et al., 2008), which may result in undershoot of the stored gestural target.

In this model, surface timing characteristics are the emergent output of fixed gestural characteristics (e.g., the time it takes for the gestural mass-spring system to approximate its target position), as well as utterance-specific gestural activation interval specifications (determined by the oscillation frequency of the planning + suprasegmental oscillator ensemble and the shape of the activation interval on- and off-ramps). Articulatory and acoustic surface timing characteristics (as measurable in solar timing units) emerge from this system without the involvement of any phonology-extrinsic, general-purpose timekeeping mechanisms that operate with such units. Because the planning + suprasegmental oscillator "clock" frequency is changed in particular phrasal positions and for different overall speech rates, there is no straightforward correspondence between planning + suprasegmental oscillator "clock" timing units and solar timing units.

Despite the general success of this model in accounting for timing effects in speech, it is challenged by the findings presented in previous sections. We will briefly review those challenges in light of the characteristics of AP/TD described just above.

# Constraints on Lengthening Due to Phrasal Prosody

Constraints on lengthening phonemically short vowels in prosodic contexts where lengthening occurs are difficult to explain in AP/TD. On the assumption that the lexical difference between short and long vowels in AP/TD is a difference in phonological representation, e.g., of one vs. two gestures, or of a gesture associated with one vs. two moras, both phonemically short and long vowels could be lengthened by the same amount and the lexical distinction would be maintained. But, that is not what is observed. Instead, less lengthening is found on the short vowel. This can only be accomplished in AP/TD by an ad hoc imposition of a smaller amount of lengthening on the short vowel, e.g., via a Pi/Mu<sup>T</sup> gesture with a smaller height for phonemically short vowels, or by proposing a Pi/Mu<sup>T</sup> phasing solution, in which (1) a Pi/Mu<sup>T</sup> gesture is aligned to the onset of the final syllable, (2) the Pi/Mu<sup>T</sup> gesture is of fixed duration, and (3) the Pi/Mu<sup>T</sup> gesture activation increases over time. However, although AP/TD provides these possible mechanisms for implementing different degrees of lengthening on phonemically short vs. long vowels, it provides no explanation of the phenomenon. That is, AP/TD provides no explanation for why phonemically short vowels should have Pi/Mu<sup>T</sup> gestures with shorter heights associated with them, or why a Pi/Mu<sup>T</sup> gesture would need to be phased with respect to the onset of a final syllable in Finnish, but not in other languages. For example, in non-quantity languages with reduced vs. full vowels, the proportional magnitude of boundary-related lengthening is not constrained for short duration, reduced vowels (e.g., the unstressed vowel in Thomas) as compared to longer duration full-vowels (e.g., the second syllable vowel in Brookline, Turk and Shattuck-Hufnagel, 2007). This suggests alignment of a Pi/Mu<sup>T</sup> gesture with the end of a word (Byrd and Saltzman, 2003), rather than with the onset of the final syllable.

In contrast to AP/TD, which has difficulty explaining the constraint on final lengthening on phonemically short vowels in quantity languages, theories which allow for the representation of surface durations provide a possible explanation. That is, smaller amounts of lengthening on phonemically short vowels can be explained if there is a constraint that preserves the surface duration distinction between phonemically short and long vowels.

To put it another way, if vowels of different quantities had the same phonological representation, the constraint on prosodic lengthening for short (and medium) vowels could be expressed as a constraint on the degree of AP/TD "clock" slowing. But, in this case, where the two types of vowels had the same phonological representation (i.e., the same number and type of gesture), there would be no way to express the lexical contrast. Instead, because AP/TD differentiates phonological categories with gestures, we assume that the phonological contrast between these different types of vowels is expressed in the lexicon as one vs. two or more gestures or perhaps as a single gesture associated with one vs. two moras. As a result, the surface durations of these vowels is due to a combination of (1) the number of AP/TD timing units in their gestural activation intervals (determined by the number of gestures or the number of moras), and (2) the degree of clock slowing (determined by the Pi or Mu<sup>T</sup> gesture). In this type of system, there is no way to account for the apparent surface duration constraint on the lengthening of contrastively short vowels, because this constraint relates to the emergent result of the interaction of two different AP/TD properties: (1) the number of AP/TD "clock" timing units in the activation interval and (2) the degree of clock slowing, which together result in surface duration in solar time. AP/TD can refer to each of these quantities, but has no way of representing the fact that they both affect surface duration, that is, it has no way of relating their equivalent effects on a desired surface duration. AP/TD therefore has no explanation for different degrees of clock slowing on phonemically short vs. long vowels, because the explanation has to do with the maintenance of a surface duration distinction.

In sum, while the AP/TD phonology-intrinsic "clock" slowing Pi- and MuT-gestures might provide a mechanism to specify different degrees of phrase-final- or phrasal-accentrelated lengthening for contrastively short (or medium) vs. long vowels, AP/TD has no representation of the surface duration outcomes of such activation interval adjustments. Consequently, it does not predict that a difference in lengthening degree for phonemically short vs. long vowels should occur, and does not offer an explanation for why contrastively short vowels are lengthened less, nor for the degree of lengthening these contrastively short vowels exhibit. Furthermore, adding a representation of the surface duration outcome would not be desirable in this framework, because this would involve a "translation" of phonology-intrinsic time into (phonetic) surface durations, something that the authors of the framework (and its antecedents) have tried to avoid (Fowler et al., 1980).

# Different Strategies for Manipulating Durations in, e.g., Rate of Speech, Boundary-Related Lengthening, and Quantity

This evidence suggests the equivalence between different temporal and spatio-temporal strategies that accomplish the same surface duration goal. It is challenging to account for in AP/TD for two reasons: (1) AP/TD doesn't have a representation of surface duration goals, and (2) AP/TD doesn't make a distinction between goals and how the goals are achieved. In this model, there are several different mechanisms that result in longer surface durations, e.g., differences in gestural stiffness, slowing gestural planning oscillators for longer gestural activation intervals, and adding gestures. However, because the model cannot refer to surface durations, the explanatory fact that these mechanisms all have a similar surface duration result is not captured in the model. Furthermore, in AP/TD, spatial and temporal aspects of movement are not independent: both are determined by the same phonological plan. Thus, it is difficult to account for behavior in which a speaker obtains the same temporal result with different spatial paths. Put another way, it is difficult for this model to account for the equivalence of rates, of quantities, and of lengthening (e.g., in final position) when these are achieved in different temporal and/or spatial manners, because this model doesn't allow the specification of temporal goals as distinct from the way they are achieved. That is, in AP/TD this equivalence can only arise by chance, because achieving the same surface duration pattern result can't be specified as the goal of the speaker.

# More Timing Variability for Longer Duration Intervals

Findings of greater timing variability in phrase-final and in prominent positions are inconsistent with AP/TD's account of boundary-related and prominence-related lengthening, with its lack of surface durations, and with its lack of generalpurpose timekeeping mechanisms. To see why this is so, first consider the details of how timing is adjusted in this model. In AP/TD, longer surface durations in phrase-final and prominent positions result from Pi or Mu<sup>T</sup> adjustments, which stretch gestural activation intervals in these positions. In recent versions of the model (Saltzman et al., 2008) this stretching is done by slowing the phonology-specific "clock," which is accomplished by slowing the oscillation frequency of an ensemble of gestural planning + suprasegmental oscillators. Because the duration of each gestural activation interval corresponds to a proportion of a planning oscillator period, slowing the gestural planning + suprasegmental ensemble of oscillators stretches the activation interval. Because this clock-slowing mechanism slows the phonology-specific clock without adding any extra timing units, intervals in phrase-final and prominent positions are not actually longer in phonology-specific clock time, even though they are longer in surface time.

These operational details and their implications are significant because they highlight the difficulty of accounting for greater timing variability for intervals that are longer in surface time but not in the number of phonology-specific timing units. That is, greater timing variability observed for longer surface duration intervals is straightforward to account for in a model where timing variability correlates with the number of timing units. AP/TD can use this type of account for the greater timing variability observed for longer duration phonemically long vowels as compared to shorter duration phonemically short vowels (cf. Figures 1 and 2 for examples of this variability pattern in N. Finnish and Dinka). This is because longer durations for phonemically long vowels correspond to greater numbers of phonology-specific timing units, e.g., phonemically long vowels are assumed to be composed of two gestures (or are potentially associated with two moras), with corresponding longer gestural activation intervals. However, AP/TD does not have an account for the greater timing variability for movements in phrase-final or phrasally prominent positions, since gestural activation intervals in these positions have the same number of AP/TD timing units as corresponding gestural activation intervals in phrase-medial, or non-prominent positions. This is because longer surface durations in these positions are due to AP/TD phonology-specific clock-slowing, rather than to a greater number of AP/TD phonology-specific clock units.

The number of phonology-specific timing units therefore is not a quantity that can be used to account for temporal variability within AP/TD. Neither is the degree of lengthening (i.e., of phonology-specific clock slowing as implemented through the height of a Pi or Mu<sup>T</sup> gesture): Adding noise in proportion to Pi or Mu<sup>T</sup> height might add timing variability of surface durations of long vowels, but would not explain the fact that phrase-medial unstressed vowels that are not accompanied by Pi or Mu<sup>T</sup> gesture lengthening also show timing variability.

The findings instead argue for the representation of surface duration as a quantity, which is absent from AP/TD. In addition, AP/TD's reliance on phonology-specific timekeeping mechanisms provides no account of the similarity in timing variability behavior between speech and non-speech activity. This finding is more consistent with the use of noisy, general-purpose timekeeping mechanisms in both domains (e.g., Schöner, 2002 and many others). That is, in AP/TD, the fact that general-purpose timekeepers governing other motor behaviors, and the proposed phonology-intrinsic timekeeper, share the characteristic of greater variability for longer intervals goes unexplained.

# The Observation of Less Timing Variability at Goal-Related Parts of Movement

#### A Challenge to Spatio-Temporal Phonological Representations

These data are problematic for AP/TD because they suggest that actors are able to separately represent, and differentially prioritize, the timing of different components of movement, e.g., endpoints over other parts of movement, such as movement onset. These findings are difficult to explain in models such as

AP/TD, where a phonological representation takes the form of equations that, together with gestural activation, define the full trajectory of a gestural movement (as well as the trajectories of the individual movements of the articulators that form the gesture), once starting position has been specified. Thus, it is not possible to represent either the spatial or temporal aspects of one part of a movement (e.g., the endpoint) separately from the other parts of the movement trajectory. As a result, it is not possible to prioritize greater timing accuracy for different parts of a movement separately.

Note that the fact that the movement target is a parameter of movement in AP/TD does not mean that the target can be singled out as a part of movement that is independent of other parts. This is because the movement target parameter value, along with values for starting position, spring stiffness, mass, and damping parameters, affect the entire trajectory of movement, defined by the mass-spring equation, its activation, and overlap with other gestures.

It is important to note also, that even if a part of movement could be identified in this type of framework, there is nothing in the model that would predict different timing variability for a particular part of movement. For example, the timing of movement onset can be identified in this model as the onset of gestural activation. However, because the timing of all parts of movement is defined by the same equation of motion, and its gestural activation interval, there is no available mechanism to differentially prioritize any particular part of movement over another part for timing accuracy. And perhaps most importantly, because the entire movement trajectory (minus its starting position) represents the goal of movement, there is no principled reason for any part of movement to be prioritized for timing accuracy over any other part.

In sum, the evidence presents two challenges: The first, that individual parts of movement cannot be identified, is partially addressed in that movement onsets can be identified with the onset of gestural activation; however but crucially, movement endpoints cannot. The second challenge, that some parts of movement are more accurately timed than others, cannot be met because the equation of motion that describes the phonological representation defines the spatial and temporal properties of the entire gesture, apart from its starting position. Thus no part of it can vary independently of any other.

#### A Challenge to Onset-Based Movement Coordination

These findings also suggest that coordination patterns can be based on the part of movement most closely related to the phonological goal, often the movement endpoint, instead of the movement onset, as currently implemented in AP/TD. Whereas the movement onset corresponds to the onset of gestural activation, the movement endpoint is much more difficult to identify in this framework. This is because the time of gestural target approximation is determined primarily from properties of the gestural mass-spring, point-attractor oscillator, and only relates straightforwardly to the duration of an activation interval at a default speaking rate. That is, at a default speaking rate, gestural activation interval durations correspond to the planning oscillator phase proportion that gives each gesture enough time to approximate its target. However, when activation intervals are adjusted for different speaking rates, or for prominent, or boundary-adjacent position, the time of gestural approximation will no longer correspond to a fixed phase of a planning oscillator (e.g., the end of gestural activation). Specifically, if the gestural activation interval is longer than the time it takes for the gesture to approximate its target (because the planning oscillator system has been slowed, e.g., in boundary-adjacent position), then the end of gestural activation will not correspond to the point of target approximation, and will occur later. Put another way, the time of gestural target approximation cannot be identified as a particular phase of a gestural planning oscillator at speaking rates different from the default, and in prosodic contexts (e.g., boundary-adjacent positions, and phrasally prominent positions) where gestural activation intervals have been stretched. This is because in these contexts, gestural planning oscillator frequency (which determines how long gestures are active), is independent of the natural frequency of gestures themselves (which is invariant and determined by properties of the gestural point-attractor mass-spring system Byrd and Saltzman, 2003; Saltzman et al., 2008). Because it is the natural frequency of each gesture that is primarily responsible for the timing of target approximation (Saltzman and Munhall, 1989), it is challenging to identify a movement endpoint (or the time of target approximation) in the current AP/TD framework<sup>11</sup>. We note that tying gestural movement timing more closely to gestural activation timing (so that endpoints could be identified with a particular phase of a planning oscillator cycle) would present additional problems, e.g., overly long movement durations in contexts where gestural activation must be long (e.g., in phrasefinal positions or at slow speaking rates). For example, if a singer is asked to sing a single syllable /bA/ for a long period of time on a single note, s/he will typically move from a bilabial target to the vowel target relatively quickly, and then prolong the /A/ vowel by maintaining the oral tract in a quasi-"steady state," target position for the vowel. If the movement toward the vowel target is slowed down in proportion to the duration of the note, the speaker would end up producing what might sound like a continuum of vowel-like sounds between the release of [b] and the target for [A].

The findings presented above thus challenge the chosen architecture of AP/TD, with its spatio-temporal representations, lack of separation between Phonological and Phonetic Planning Components, phonology-intrinsic timing, and emergent (rather than explicitly represented) surface phonetic characteristics. We suggest that providing accounts of the phenomena described in earlier sections of the paper may be difficult without sacrificing some of the core principles of this theory's current implementation. For example, accounting for less timing variability at a part of movement most closely related to a goal challenges the core principle of an integrated phonologyphonetics, in which the phonological representation both serves as the goal of movement, and provides instructions

<sup>11</sup>Researchers working in the AP framework, e.g., Browman and Goldstein (1989, 1992), Davidson (2006), describe some patterns of observed data in terms of coordination patterns based on parts of movement other than the onset (e.g., target and release); however, they do not provide explicit mechanisms for identifying these parts of movement so they can be coordinated.

for implementing the movement. And the evidence for the representation of surface durations may be difficult to accommodate in such a system, without sacrificing spatiotemporal phonological representations and without having to translate from data structures in phonology to different data structures on the surface (in phonetics). The current system of adjusting gestural activation intervals in different contexts, while preserving the invariance of gestural representations, allowed the theory (1) to account for the phonological equivalence of the same gestures in different contexts, (2) to account for different (emergent) surface behavior of these gestures in different contexts, (3) to do both of these things without a separate Phonetic Planning Component that would provide translation from qualitatively different phonological representations to quantitatively specified surface phonetic forms. However, the findings presented in this paper suggest the need for just such a process, i.e., it suggests that the surface duration results of the adjustment processes are represented, and require translation from the data structures in phonology to those in phonetics.

# WHY THE FINDINGS POINT TOWARD A 3-COMPONENT MODEL BASED ON SYMBOLIC PHONOLOGICAL REPRESENTATIONS AND PHONOLOGY-EXTRINSIC TIMING

In this section, we argue that the findings presented above motivate the consideration of models of speech production with three components: (1) Phonological Planning, (2) Phonetic Planning, and (3) Motor-Sensory Implementation. Those findings provide a number of lines of evidence that support an approach of this kind, which is based on phonology-extrinsic timing and symbolic phonological representations. First, several findings suggested that surface durations are represented in the minds of speakers, and furthermore that these durations are specified through the use of non-speech-specific, generalpurpose mechanisms, in solar timing units. Because this evidence supports mechanisms for quantitative specification that are extrinsic to the phonology, it can easily be accommodated in a model of speech production in which quantitative specification occurs in a phonetic component that is separate from the symbol-based phonological plan (which does not contain specific spectral, spatial or temporal information) that the speaker develops for a particular utterance.

Further support for a model of speech motor control that has a separation between Phonological and Phonetic Planning Components is provided by findings of greater temporal accuracy at behaviorally meaningful parts of movement. These findings also motivate a third, Motor-Sensory Implementation Component that is separate from the two planning components, and is used for tracking and adjusting movements once they have begun. That is, the findings presented earlier can be explained if (1) a particular part of movement (e.g., the endpoint or possibly constriction release) is identified as "behaviorally meaningful," i.e., most closely related to the goals specified in the symbolic phonological plan (which is developed during the operation of a Phonological Planning Component) that the speaker is trying to signal, and (2) other aspects of the movement (specified during the operation of a Phonetic Planning Component) are organized in the service of reaching the behaviorally meaningful (and thus high-priority) part of movement at the right time, and with appropriate temporal and spatial accuracy. As a result, parts of movement that are less directly related to the goal are less likely to be corrected and adjusted during the operation of the Motor-Sensory Implementation Component, because their accuracy is less critical, as long as the goal-related part can be reached on time (cf. Todorov and Jordan, 2002, 2003, Minimal Intervention Principle). Instead, the resources for tracking and adjusting are focused on the aspects of a movement that are most closely related to the goal of producing a planned set of acoustic cues, e.g., its endpoint, or release from constriction.

To put this another way, in a three-component model that separates the phonological goal (as an abstract, symbolic, phonological element in an appropriate utterance-specific context) from the manner of carrying out the goal (as a quantitative phonetic specification that includes movement duration in solar timing units, e.g., ms)<sup>12</sup>, it is possible to relate the symbolic phonological goal to the part(s) of articulatory movement that are most closely related to achieving that phonological goal. Because those parts of movement have a separate representation from other parts, it is possible to prioritize them for temporal coordination, and for more accurate production in a motor-sensory implementation component. This is precisely what appears to be required by the distribution of timing accuracy across a movement. The identification of this part of a movement with the phonological goal of movement provides a rationale for why that particular part of movement should be given higher priority with regard to timing and/or spatial accuracy. In a three-component approach, the Motor-Sensory Implementation Component, which tracks timing and position relative to the endpoint (presumably based on prediction from an efference copy of the motor commands as well as on sensory information), is required to provide adjustments to the movements as they unfold, in order to ensure that the prioritized endpoint is reached at an appropriate time.

Several models in the literature are 3-component models, with separate Phonological Planning, Phonetic Planning, and Motor-Sensory-Implementation Components, and make use of surface durations, specified in solar timing units, and of phonology-extrinsic general-purpose timekeeping mechanisms (e.g., Fujimura, 1992 et seq., Guenther, 2016). These models are therefore promising, because they are compatible with many of the findings detailed above. However, in spite of their use of symbolic phonological representations, in some cases these models have identified the goals of movement as entire movements (Fujimura, 1992 et seq.), or as spectro-temporal trajectories (Guenther, 2016), as opposed to identifying the goals as particular parts of movement. These modeling decisions are at

<sup>12</sup>Note that phonological representations in our proposal can include discrete, relational and/or e.g. binary quantity representations, but not gradient, scalar representations.

odds with findings of greater timing accuracy at goal-related parts of movement. We suggest that if these models were modified to map phonological goals onto particular parts of movement, they would be compatible with the findings presented here. Lee's (1998) tau-coupling theory provides a way to account for less timing variability at goal-related parts of movement, because in that theory, movements are guaranteed to reach their goals at a particular time even if the timing of movement onset is variable. Lee's theory also provides a principled way to account for the time-course of movement (and resulting velocity profile shapes).

One challenge for any model of speech production, including 3-component models that use symbolic phonological representations and phonology-extrinsic timekeeping mechanisms, is to account for the systematic influence of a wide range of factors on timing patterns in speech. Optimal Control Theory approaches are promising in this regard, because they provide a way to balance the costs of not achieving movement goals (e.g., signaling phonemic contrast, in ways that are appropriate in particular prosodic positions, using a particular style, at an appropriate rate, etc.), with movement costs. See e.g., Šimko and Cummins (2010, 2011), Windmann et al. (2015), and Windmann (2016) for examples of ways that Optimal Control Theory approaches can be used to predict systematic timing patterns in speech. However, if these approaches are to be taken as theories of speech production, they present their own challenges; for example, they require extensive computation every time an utterance is planned.

# REFERENCES


In summary, evidence from the timing literature suggests that models of speech production based on symbolic representations and phonology-extrinsic timing are worth developing as alternatives to the currently dominant AP/TD approach, in spite of their computational challenges. See Turk and Shattuck-Hufnagel (2020) for a sketch of a specific proposal for how this might be done.

### AUTHOR CONTRIBUTIONS

AT synthesized and interpreted the evidence from the timing literature. AT and SS-H wrote the manuscript.

# FUNDING

This work was supported by AHRC Grant AH/1002758/1 to AT, and by U.S. National Science Foundation Grants BCS 1023596, 1651190, and 1827598 to SS-H.

# ACKNOWLEDGMENTS

We would like to thank Elliot Saltzman and Dave Lee for many theoretical discussions and explanations, as well as the editor and the two reviewers for valuable feedback. Any errors are ours.




**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Turk and Shattuck-Hufnagel. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Speech Sound Disorders in Children: An Articulatory Phonology Perspective

Aravind Kumar Namasivayam1,2 \*, Deirdre Coleman1,3, Aisling O'Dwyer1,4 and Pascal van Lieshout1,2,5

<sup>1</sup> Oral Dynamics Laboratory, Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada, <sup>2</sup> Toronto Rehabilitation Institute, University Health Network, Toronto, ON, Canada, <sup>3</sup> Independent Researcher, Surrey, BC, Canada, <sup>4</sup> St. James's Hospital, Dublin, Ireland, <sup>5</sup> Rehabilitation Sciences Institute, University of Toronto, Toronto, ON, Canada

Speech Sound Disorders (SSDs) is a generic term used to describe a range of difficulties producing speech sounds in children (McLeod and Baker, 2017). The foundations of clinical assessment, classification and intervention for children with SSD have been heavily influenced by psycholinguistic theory and procedures, which largely posit a firm boundary between phonological processes and phonetics/articulation (Shriberg, 2010). Thus, in many current SSD classification systems the complex relationships between the etiology (distal), processing deficits (proximal) and the behavioral levels (speech symptoms) is under-specified (Terband et al., 2019a). It is critical to understand the complex interactions between these levels as they have implications for differential diagnosis and treatment planning (Terband et al., 2019a). There have been some theoretical attempts made towards understanding these interactions (e.g., McAllister Byun and Tessier, 2016) and characterizing speech patterns in children either solely as the product of speech motor performance limitations or purely as a consequence of phonological/grammatical competence has been challenged (Inkelas and Rose, 2007; McAllister Byun, 2012). In the present paper, we intend to reconcile the phonetic-phonology dichotomy and discuss the interconnectedness between these levels and the nature of SSDs using an alternative perspective based on the notion of an articulatory "gesture" within the broader concepts of the Articulatory Phonology model (AP; Browman and Goldstein, 1992). The articulatory "gesture" serves as a unit of phonological contrast and characterization of the resulting articulatory movements (Browman and Goldstein, 1992; van Lieshout and Goldstein, 2008). We present evidence supporting the notion of articulatory gestures at the level of speech production and as reflected in control processes in the brain and discuss how an articulatory "gesture"-based approach can account for articulatory behaviors in typical and disordered speech production (van Lieshout, 2004; Pouplier and van Lieshout, 2016). Specifically, we discuss how the AP model can provide an explanatory framework

#### Edited by:

Niels Janssen, University of La Laguna, Spain

#### Reviewed by: Wander M. Lowie,

University of Groningen, Netherlands Jonathan L. Preston, Syracuse University, United States

> \*Correspondence: Aravind Kumar Namasivayam a.namasivayam@utoronto.ca

#### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 28 April 2019 Accepted: 18 December 2019 Published: 28 January 2020

#### Citation:

Namasivayam AK, Coleman D, O'Dwyer A and van Lieshout P (2020) Speech Sound Disorders in Children: An Articulatory Phonology Perspective. Front. Psychol. 10:2998. doi: 10.3389/fpsyg.2019.02998

for understanding SSDs in children. Although other theories may be able to provide alternate explanations for some of the issues we will discuss, the AP framework in our view generates a unique scope that covers linguistic (phonology) and motor processes in a unified manner.

Keywords: speech sound disorders (SSD), Dynamical Systems Theory, Articulatory Phonology, childhood apraxia of speech (CAS), Dysarthria, articulation and phonological disorders, speech motor control, motor speech development

#### INTRODUCTION

In clinical speech-language pathology (S-LP), the distinction between articulation and phonology and whether a speech sound error<sup>1</sup> arises from motor-based articulation issues or language/grammar based phonological issues has been debated for decades (see Shriberg, 2010; Dodd, 2014; Terband et al., 2019a for a comprehensive overview on this topic). The theoryneutral term Speech Sound Disorders (SSDs) is currently used as a compromise to bypass the constraints associated with the articulation versus phonological disorder dichotomy (Shriberg, 2010). The present definition describes SSD as a range of difficulties producing speech sounds in children that can be due to a variety of limitations related to perceptual, speech motor, or linguistic processes (or a combination) of known (e.g., Down syndrome, cleft lip and palate) and unknown origin (Shriberg et al., 2010; McLeod and Baker, 2017).

The history of causality research for childhood SSDs encompasses several theoretically motivated epochs (Shriberg, 2010). While the first epoch (1920s-1950s) was driven by psychosocial and structuralist views aimed at uncovering distal causes, the second epoch (1960s to 1980s) was driven by psycholinguistic and sociolinguistic approaches and focused on proximal causes. The more recent third and fourth epochs reflect the utilization of advances in neurolinguistics (1990s) and human genome sequencing (post-genomic era; 2000s) and these approaches address both distal and proximal causes (Shriberg, 2010). With these advances, several different systems for the classification of SSD subtypes in children have been proposed based on their distal or proximal cause (e.g., see Waring and Knight, 2013). Some of the major SSD classification systems include the Speech Disorders Classification System (Shriberg et al., 2010), the Model of Differential Diagnosis (Dodd, 2014) and the Stackhouse and Wells (1997) Psycholinguistic Framework. However, a critical problem in these classification systems as noted by Terband et al. (2019a) is that the relationships between the different levels of causation are underspecified. For example, the links between the etiology (distal; e.g., genetics), processing deficits (proximal; e.g., psycholinguistic factors), and the behavioral levels (speech symptoms) are not clearly elucidated. In other words, even though the term SSD is theory-neutral, the poorly specified links between the output level (behavioral) speech symptoms and higherlevel motor/language/lexical/grammar processes limits efficient differential diagnosis, customizing intervention and optimizing outcomes (see Terband et al., 2019a for a more detailed review on these issues). Thus, there is a critical need to understand the complex interactions between the different levels that ultimately cause the observable speech symptoms (McAllister Byun and Tessier, 2016; Terband et al., 2019a).

There have been several theoretical attempts at integrating phonetics and phonology in clinical S-LP. In this context, the characterization of speech patterns in children either solely as the product of performance limitations (i.e., challenges in meeting phonetic requirements arising from motor and anatomical differences) or purely as a consequence of phonological/grammatical competence has been challenged (Inkelas and Rose, 2007; Bernhardt et al., 2010; McAllister Byun, 2012). McAllister Byun (2011, 2012) and McAllister Byun and Tessier (2016) suggest a "phonetically grounded phonology" approach where individual-specific production experience and speech-motor development is integrated into the construction of children's phonological/grammatical representations. The authors discuss this approach using several examples related to the neutralization of speech sounds in word onset (with primary stress) positions. They argue that positional velar fronting in these positions (where coronals sounds are substituted for velar) in children is said to result from a combination of jaw-dominated undifferentiated tongue gesture (e.g., Gibbon and Wood, 2002; see Section "Speech Delay" for details on velar fronting and undifferentiated tongue gestures) and the child's subtle articulatory efforts (increased linguo-palatal contact into the coronal region) to replicate positional stress (Inkelas and Rose, 2007; McAllister Byun, 2012). McAllister Byun (2012) demonstrated that by encoding this difficulty with a discrete tongue movement as a violable "MOVE-AS-UNIT" constraint, positional velar fronting could be formally discussed within the Harmonic Grammar framework (Legendre et al., 1990). In such a framework the constraint inventory is dynamic and new constraints could be added on the basis of phonetic/speech motor requirements or removed over the course of neuro-motor maturation. In the case of positional velar fronting, the phonetically grounded "MOVE-AS-UNIT" constraint is eliminated from the grammar as the tongue-jaw complex matures (McAllister Byun, 2012; McAllister Byun and Tessier, 2016).

<sup>1</sup>The term "speech sound error" refers to a mismatch between what an individual intends to say and what they actually say (Harley, 2006). In children, this may entail a clinically significant impairment or a non-standard production of speech sounds of the ambient language and may be classified according to the units of processing (e.g., phoneme, syllable, word or phrase) and the mechanisms (substitutions, additions, omissions/deletions and distortions) involved (Harley, 2006; Preston et al., 2013). The word "sound" is included in the term "speech sound error" to distinguish it from other speech errors such as disfluencies, voice and language (e.g., grammatical errors) based errors (McLeod and Baker, 2017).

In the present paper, we intend to reconcile the phoneticphonology dichotomy and discuss the interconnectedness between these levels and the nature of SSDs using an alternative perspective. This alternative perspective is based on the notion of an articulatory "gesture" that serves as a unit of phonological contrast and characterization of the resulting articulatory movements (Browman and Goldstein, 1992; van Lieshout and Goldstein, 2008). We discuss articulatory gestures within the broader concepts of the Articulatory Phonology model (AP; Browman and Goldstein, 1992). We present evidence supporting the notion of articulatory gestures at the level of speech perception, speech production and as reflected in control processes in the brain and discuss how an articulatory "gesture"-based approach can account for articulatory behaviors in typical and disordered speech production (van Lieshout, 2004; van Lieshout et al., 2007; D'Ausilio et al., 2009; Pouplier and van Lieshout, 2016; Chartier et al., 2018). Although, other theoretical approaches (e.g., Inkelas and Rose, 2007; McAllister Byun, 2012; McAllister Byun and Tessier, 2016) are able to provide alternate explanations for some of the issues we will discuss, the AP framework in our view generates a unique scope that covers linguistic (phonology) and motor processes in a unified and transparent manner to generate empirically testable hypotheses. There are other speech production models, but as argued in a recent paper, the majority of those are more similar to the Task Dynamics (TD) framework (Saltzman and Munhall, 1989) in that they address specific issues related to the motor implementation stages (with or without feedback) and not so much include a principled account of phonological principles, such as formulated in AP (Parrell et al., 2019).

# ARTICULATORY PHONOLOGY

This section on Articulatory Phonology (AP; Browman and Goldstein, 1992) lays the foundation for understanding speech sound errors in children diagnosed with SSDs from this specific perspective. The origins of the AP model date back to the late 1970s, when researchers at the Haskins laboratories developed a unique and alternative perspective on the nature of action and representation called the Task Dynamics model (TD; Saltzman and Munhall, 1989). This model was inspired by concepts of selforganization related to functional synergies as derived from the Dynamical Systems Theory (DST; Kelso, 1995).

DST in general describes behavior as the emergent product of a "self organizing, multi-component system that evolves over time" (Perone and Simmering, 2017, p. 44). Various aspects of DST have been studied and applied in a diverse range of disciplines such as meteorology (e.g., Zeng et al., 1993), oceanography (e.g., Dijkstra, 2005), economics (e.g., Fuchs and Collier, 2007), and medical sciences (e.g., Qu et al., 2014). Recently, there has also been an uptake of DST informed research related to different areas in cognitive and speech-language sciences, including language acquisition and change (Cooper, 1999); language processing (Elman, 1995); development of cognition and action (Thelen and Smith, 1994; Spencer et al., 2011; Wallot and van Orden, 2011); language development (van Geert, 1995, 2008); 2nd language learning and development (de Bot et al., 2007; de Bot, 2008); speech production (see van Lieshout, 2004 for a review; van Lieshout and Neufeld, 2014; van Lieshout, 2017); variability in speech production (van Lieshout and Namasivayam, 2010; Jackson et al., 2016); connection between motor and language development (Parladé and Iverson, 2011); connection between cognitive aspects of phonology and articulatory movements (Tilsen, 2009); and visual word recognition (Rueckle, 2002); and visuospatial cognitive development (Perone and Simmering, 2017).

The role of DST in speech and language sciences, in particular with respect to speech disorders, is still somewhat underdeveloped, mainly because of the challenges related to applying specific DST analyses to the relatively short data series that can be collected in speech research (van Lieshout, 2004). However, we chose to focus on the AP framework, as it directly addresses issues related to phonology and articulation using DST principles related to relative stable patterns of behaviors (attractor states), that emerge when multiple components (neural, muscular, biomechanical) underlying these behaviors interact through time in a given context (self-organization) as shown in the time varying nature of the relationship between coupled structures (synergies) that express those behaviors (Saltzman and Munhall, 1989; Browman and Goldstein, 1992). Some examples of studies using this AP/DST approach can be found in papers on child-specific neutralizations in primary stress word positions (McAllister Byun, 2011), articulation issues related to /r/ production (van Lieshout et al., 2008), apraxia of speech (van Lieshout et al., 2007), studies on motor speech processes involved in stuttering (Saltzman, 1991; van Lieshout et al., 2004; Jackson et al., 2016), phonological development (Rvachew and Bernhardt, 2010), SSDs (Gildersleeve-Neumann and Goldstein, 2015), and in children with repaired cleftlip histories (van Lieshout et al., 2002). In the next few sections we will review the concept of synergies and the development of speech motor synergies, which are directly related to DST principles of self-organization and coupling, followed by how the AP model uses these concepts to discuss linguistic/phonological contrast.

## Speech Motor Synergies

The concept of speech motor synergy was derived from DST principles based on the notion that complex systems contain multiple (sub)components that are (functionally and/or physically) coupled (Kelso, 1995). This means that these (sub)components interact and function as a coordinated unit where patterns emerge and dissolve spontaneously based on selforganization, that is, without the need of a pre-specified motor plan (Turvey, 1990). These patterns are generated due to internal and external influences relating to inter-relationships between the (sub)components themselves, and the constraints and opportunities for action provided in the environment (Smith and Thelen, 2003). Constraints or specific boundary conditions that influence pattern emergence may relate to physical, physiological, and functional/task constraints (e.g., Diedrich and Warren, 1995; Kelso, 1995; van Lieshout and Namasivayam, 2010). Such principles of pattern formation and coupling have already been

demonstrated in physical (e.g., Gunzig et al., 2000) and biological systems (e.g., Haken, 1985), including neural network dynamics (e.g., Cessac and Samuelides, 2007). Haken et al. (1985), Kelso et al. (1985), and Turvey (1990) at the time were among the first to apply these principles also to movement coordination. Specifically, a synergy in the context of movement is defined as a functional assembly of (sub)components (e.g., neurons, muscles, joints) that are temporarily coupled or assembled in a taskspecific manner, thus constrained to act as a single coordinated unit (or a coordinative structure; Kelso, 1995; Kelso et al., 2009). In motor control literature, the concept of coordinative structures or functional synergies are typically modeled as (non-linear) oscillatory systems (Kelso, 1995; Newell et al., 2003; Profeta and Turvey, 2018). By strengthening or weakening the coupling within and between the system's interacting (sub)components, synergies may be tuned or altered. For movement control, the synergy tuning process occurs with development and learning or may change due to task demands or constraints (e.g., Smith and Thelen, 2003; Kelso et al., 2009).

With regards to speech production, perturbation paradigms similar to the ones used in other motor control studies have demonstrated critical features of oral articulatory synergies (e.g., Folkins and Abbs, 1975; Kelso and Tuller, 1983; van Lieshout and Neufeld, 2014), which in AP terms can be referred to as gestures. Functional synergies in speech production comprise of laryngeal and supra-laryngeal structures (tongue, lips, jaw) coupled to achieve a single constriction (location and degree) goal. Perturbing the movement of one structure will lead to compensatory changes in all functionally coupled structures (including the articulator that is perturbed) to achieve the synergistic goal (Kelso and Tuller, 1983). For example, when the jaw is perturbed in a downward direction during a bilabial stop closure, there is an immediate compensatory lowering of the upper lip and an increased compensatory elevation of the lower lip (Folkins and Abbs, 1975). The changes in the nature and stability of movement coordination patterns (i.e., within and between specific speech motor synergies) as they evolve through time can be captured quantitatively via order parameters such as relative phase. Relative phase values are expressed in degrees or radians, and the standard deviation of relative phase values can provide an index of the stability of the couplings (Kelso, 1995; van Lieshout, 2004). Whilst order parameters capture the relationship between the system's interacting (sub)components, changes in order parameter dynamics can be triggered by alterations in a set of control parameters. For example, changes in movement rate may destabilize an existing coordination pattern and result in a different coordination pattern as observed during gait changes (such as switching from a walk to a trot and then a gallop) as a function of required locomotion speed (Hoyt and Taylor, 1981; Kelso, 1995). For speech, such distinct behavioral patterns as a function of rate have not been established. However, in the coordination between lower jaw, upper and lower lip as part of a lip closing/opening synergy, typical speakers have shown a strong tendency for reduced covariance in the combined movement trajectory, despite individual variation in the actual sequence and timing of individual movements (Alfonso and van Lieshout, 1997). This can be considered a characteristic of an efficient synergy. The same study also included people who stutter and reported more instances of not showing reduced covariance in this group, in line with the notion that stuttering is related to limitations in speech motor skill (van Lieshout et al., 2004; Namasivayam and van Lieshout, 2011).

Recent work has provided more insights regarding cortical networks in control of this coordination between speech articulators (Bouchard et al., 2013; Chartier et al., 2018). Chartier et al. (2018) mapped acoustic and articulatory kinematic trajectories to neural electrode sites in brains of patients, as part of their clinical treatment of epilepsy. Similar to limb control studies that discovered single motor cortical neurons that encoded complex coordinated arm and hand movements (Aflalo and Graziano, 2006; Saleh et al., 2012), coordinated movements involving articulators for specific vocaltract configurations were encoded at the single electrode level in the ventral sensorimotor cortex (vSMC). That is, activity in the vSMC reflects the synergies used in speech production rather than individual movements. Interestingly, the study found four major clusters of articulatory kinematic trajectories that encode the main vocal tract configurations (labial, coronal, dorsal, and vocalic) necessary to broadly represent the production of American English sounds. The encoded articulatory kinematic trajectories exhibited damped oscillatory dynamics as inferred from articulatory velocity and displacement relationships (phase portraits). These findings support theories that envision vocal tract gestures as articulatory units of speech production characterized by damped oscillatory dynamics [Fowler et al., 1980; Browman and Goldstein, 1989; Saltzman and Munhall, 1989; see Section Articulatory Phonology and Speech Sound Disorders (SSD) in Children].

The notion of gestures at the level of speech perception has been discussed in the Theory of Direct Perception (Fowler, 1986; Fowler and Rosenblum, 1989). This theory posits that listeners perceive attributes of vocal tract gestures, arguing that this reflects the common code shared by both the speaker and listener (Fowler, 1986, 1996, 2014; Fowler and Rosenblum, 1989). These concepts are supported by a line of research studies which propose that the minimal objects of speech perception reflect gestures realized by the action of coordinative structures as transmitted by changes to the acoustic (and visual) signal, rather than units solely defined by a limited set of specific acoustic features (Diehl and Kluender, 1989; Fowler and Rosenblum, 1989; Fowler, 1996). The Direct Perception theory thus suggests that speech perception is driven by the structural global changes in external sensory signals that allow for direct recognition of the original (gesture) source and does not require special speech modules or the need to invoke the speech motor system (Fowler and Galantucci, 2005). Having a common unit for production and perception provides a useful framework to understand the broader nature of both sensory and motor involvement in speech disorders. For example, this can inform future studies to investigate how problems in processing acoustic information and thus perceiving the gestures from the speaker, may interfere with the tuning of gestures for production during development. Similarly, issues related to updating the state of the vocal tract through somato-sensory feedback (a critical component in

TD; Saltzman and Munhall, 1989; Parrell et al., 2019) during development may also lead to the mistuning of gestures in production, potentially leading to the type of errors in vocal tract constriction degree and/or location as discussed in Section "Articulatory Phonology and Speech Sound Disorders (SSD) in Children." However, for the current paper, the focus will be on production aspects only.

#### Development of Speech Motor Synergies

In this section, we will discuss the development and refinement of articulatory synergies and how these processes facilitate the emergence of speech sound contrasts. Observational and empirical data from several speech motor studies (as discussed below) were synthesized to create the timeline map of the development and refinement of speech motor control and articulatory synergies as illustrated in **Figure 1**. Articulatory synergies in infants have distinct developmental schedules. Speech production in infants is thought to be restricted to sounds primarily supported by the mandible (MacNeilage and Davis, 1990; Davis and MacNeilage, 1995; Green et al., 2000). Early mandibular movements (∼1 year or less) are ballistic in nature and restricted to closing and opening gestures due to the limited fine force control required for varied jaw heights (Locke, 1983; Kent, 1992; Green et al., 2000). Vowel productions in the first year are generally related to low, non-front, and non-rounded vowels; implying that the tongue barely elevates from the jaw, and there is limited facial muscle (lip) interaction (i.e., synergy) with the jaw (Buhr, 1980; Kent, 1992; Otomo and Stoel-Gammon, 1992; but see Giulivi et al., 2011; Diepstra et al., 2017).

Sound sequences that do not require complex timing and coordination within/between articulatory gestures are easier to produce and the first to emerge (Green et al., 2000; Green and Nip, 2010; **Figure 1**). For instance, young children are unable to coordinate laryngeal voicing gesture with supralaryngeal articulation and hence master voiced consonants and syllables earlier than voiceless ones (Kewley-Port and Preston, 1974; Grigos et al., 2005). The synergistic interaction between the laryngeal and supra-laryngeal structures underlying voicing contrasts is acquired closer to 2 years of age (∼20–23 months; Grigos et al., 2005), and follows the maturation of jaw movements (around 12–15 months of age; Green et al., 2002; **Figure 1**) and/or jaw stabilization (Yu et al., 2014).

In children, up to and around 2 years of age, there is limited fine motor control of jaw height (or jaw grading) and weak jawlip synergies during bilabial production, but relatively stronger inter-lip spatial and temporal coupling (Green et al., 2000, 2002; Nip et al., 2009; Green and Nip, 2010). A possible consequence of these interactions is that their production of vowels is limited to that of extremes (high or low; /i/, /u/, /o/, and /A/), and lip rounding/retraction is only present when the jaw is in a high position (Wellman et al., 1931; Kent, 1992; **Figure 1**). As speechrelated jaw-lip synergies are emerging, it is not surprising that children's ability to execute lip rounding and retraction is possible when degrees of freedom can be reduced (i.e., when jaw is held in a high position). Observation of such a reduction in degrees of freedom in emerging synergies has been observed in other non-speech systems (Bernstein, 1996). Interestingly, although the relatively strong inter-lip coordination pattern found in 2-yearolds is facilitative for bilabial productions, it needs to further differentiate to gain independent control of the functionally linked upper and lower lips prior to the emergence of labiodental fricatives (/f/ and /v/; Green et al., 2000; **Figure 1**). This process is observed to occur between the ages of 2 and 3 years (Stoel-Gammon, 1985; Green et al., 2000). Green et al. (2000, 2002) suggest that upper and lower lip movements become adult-like with increasing contribution of the lower-lip toward bilabial closure between the ages of 2 and 6 years. Further control over jaw height (with the addition of /ε/ and /O/) and lingual independence from the jaw is developed around 3 years of age (Kent, 1992). The latter is evident from the production of reliable lingual gliding movements (diphthongs: /aU/, /OI/, and /aI/) in the anterior-posterior dimension (Wellman et al., 1931; Kent, 1992; Otomo and Stoel-Gammon, 1992; Donegan, 2013). Control of this dimension also coincides with the emergence of coronal consonants (e.g., /t/ and /d/; Smit et al., 1990; Goldman and Fristoe, 2000). By 4 years of age, all front and back vowels are within the spoken repertoire of children, suggesting a greater degree of control over jaw height and improved tongue-jaw synergies (Kent, 1992). Intriguingly, front vowels and lingual coronal consonants emerge relatively late (Wellman et al., 1931; Kent, 1992; Otomo and Stoel-Gammon, 1992). This is possibly due to the fine adjustments required by the tongue tip and blade to adapt to mandibular angles. Since velar consonants and back vowels are produced by the tongue dorsum, they are closer to the origin of rotational movement (i.e., condylar axis) and are less affected than the front vowels and coronal consonants (Kent, 1992; Mooshammer et al., 2007). With maturation and experience, finer control over tongue musculature develops, and children begin to acquire rhotacized (retroflexed or bunched tongue) vowels (/Ç/ and /Ä/) and tense/lax contrasts (Kent, 1992).

The later development of refined tongue movements is not surprising, since the tongue is considered a hydrostatic organ with distinct functional segments (e.g., tongue tip, tongue body; Green and Wang, 2003; Noiray et al., 2013). Gaining motor control and coordinating the tongue with neighboring articulatory gestures is difficult (Kent, 1992; Smyth, 1992; Nittrouer, 1993). Cheng et al.'s (2007)study demonstrated a lower degree and more variable tongue tip to jaw temporal coupling in 6- to 7-year-old children relative to adults (**Figure 1**). This contrasts with the earlier developing lip-jaw synergy reported by Green et al. (2000), wherein by 6 years of age, children's temporal coupling of lip and jaw was similar to adults. The coordination of the tongue's subcomponents follows different maturation patterns. By 4–5 years, synergies that use the back of the tongue to assist the tongue tip during alveolar productions are adult-like (Noiray et al., 2013), while synergies relating to tongue tip release and tongue body backing are not fully mature (Nittrouer, 1993; **Figure 1**). The extent and variability of lingual vowel-on-consonant coarticulation between 6 and 9 years of age is greater than in adults; implying that children are still refining their tuning of articulatory gestures (Nittrouer, 1993; Nittrouer et al., 1996, 2005; Cheng et al., 2007; Zharkova et al., 2011).

SSD in Children: Articulatory Phonology Perspective

fpsyg-10-02998 January 24, 2020 Time: 18:8 # 6

These findings suggest that articulatory synergies have varying schedules of development: lip-jaw related synergies develop earlier than tongue-jaw or within tongue-related synergies (Cheng et al., 2007; Terband et al., 2009). Most of this work has been done on intra-gestural coordination (i.e., between individual articulators within a gesture), but it is clear that both the development of intra- and intergestural synergies are non-uniform and protracted (Whiteside et al., 2003; Smith and Zelaznik, 2004). Variability of intragestural synergies (e.g., upper- and lower-lip or lower lip– jaw) in 4- and 7-year-olds has been found to be greater than with adults but decreases with age until it plateaus between 7 and 12 years (Smith and Zelaznik, 2004). Adultlike patterns are reached at around 14 years, and likely continuously refine and stabilize even up to the age of 30 years (Smith and Zelaznik, 2004; Schötz et al., 2013; **Figure 1**). Overall, these findings suggest that the development of speech motor control is hierarchical, sequential, nonuniform, and protracted.

# Gestures, Synergies and Linguistic Contrast

As mentioned above, within the AP model, the fundamental units of speech are articulatory "gestures." Articulatory "gestures" are higher-level abstract specifications for the formation and release of task-specific, linguistically relevant vocal tract constrictions. The specific goals of each gesture are defined as Tract Variables (**Figure 2**) and relate to vocal tract constriction location (labial, dental, alveolar, postalveolar, palatal, velar, uvular, and pharyngeal) and constriction degree (closed, critical, narrow, mid, and wide; **Figure 2**). While constriction degree is akin to manner of production (e.g., fricatives /s/ and /z/ are assigned a "critical" value; stops /p/ and /b/ are given a "closed" value), constriction location allows for distinctions in place of articulation (Browman and Goldstein, 1992; Gafos, 2002).

The targets of each Tract Variable are implemented by specifying the lower-level functional synergy of individual articulators (e.g., articulator set of lip closure gesture: upper lip, lower lip, jaw) and their associated muscles ensembles (e.g., orbicularis oris, mentalis, risorius), which allows for the flexibility needed to achieve the task goal (Saltzman and Kelso, 1987; Browman and Goldstein, 1992; Alfonso and van Lieshout, 1997; Gafos, 2002; **Figure 2**). The coordinated actions of the articulators toward a particular value (target) of a Tract Variable is modeled using damped mass spring equations (Saltzman and Munhall, 1989). The variables in the equations specify the final position, the time constant of the constriction formation (i.e., the speed at which the constriction should be formed; stiffness), and a damping factor to prevent articulators from overshooting their targets (Kelso et al., 1986a,b; Browman and Goldstein, 1989; Saltzman and Munhall, 1989). For example, if the goal is to produce constriction at the lips (bilabial closure gesture), then the distance between the upper lip and lower lip (lip aperture) is set to zero. The resulting movements of individual articulators lead to changes in vocal tract geometry, with predictable aerodynamic and acoustic consequences.

The flexibility within the functional articulatory synergy implies that the task-level goals could be achieved with quantitatively different contributions from individual articulatory components as observed in response to articulatory perturbations or in adaptation to the linguistic context in which the gesture is produced (Saltzman and Kelso, 1987; Browman and Goldstein, 1992; Alfonso and van Lieshout, 1997; Gafos, 2002). In other words, the task-level goals are discrete, invariant or context-free, but the resulting articulatory motions are context-dependent (Browman and Goldstein, 1992). Gestures are phonological primitives that are used to achieve linguistic contrasts when combined into larger sequences (e.g., segments, words, phrases). The presence or absence of a gesture, or changes in gestural parameters like constriction location results in phonologically contrastive units. For example, the difference between "bad" and "ban" is the presence of a velum gesture in the latter, while "bad" and "pad" are differentiated by adding a glottal gesture for the onset of "bad". Parameter differences in gestures such as the degree of vocal tract constriction yields phonological contrast by altering manner of production (e.g., "but" and "bus"; tongue tip constriction degree: complete closure for /t/ vs. a critical opening value to result in turbulence for /s/) (Browman and Goldstein, 1986, 1992; van Lieshout et al., 2008).

Gestures have an internal temporal structure characterized by landmarks (e.g., onset, target, release) which can be aligned to form segments, words, sentences and so on (Gafos, 2002). These gestures and their timing relationships are represented by a gestural score in the AP model (**Figure 2**; Browman and Goldstein, 1992). Gestural scores are estimated from articulatory kinematic data or speech acoustics by locating kinematic/acoustic landmarks to determine the timing relationships between gestures (Nam et al., 2012). The timing relationships in the gestural score are typically expressed as relative phase values (Kelso et al., 1986a,b; van Lieshout, 2004). Words may differ by altering the relative phasing between their component gestures. For example, although the gestures are identical in "pat" and "tap," the relative phasing between the gestures are different (Saltzman and Byrd, 2000; Saltzman et al., 2006; Goldstein et al., 2007). As mentioned above, the coordination between individual gestures in a sequence is referred to as intergestural coupling/coordination (van Lieshout and Goldstein, 2008). Inter-gestural level timing is not rigidly specified across an entire utterance but is sensitive to peripheral (articulatory) events (Saltzman et al., 1998; Namasivayam et al., 2009; Tilsen, 2009). The presence of a coupling between inter-gestural level timing oscillators and feedback signals arising from the peripheral articulators was identified in experimental work by Saltzman et al. (1998). In that study, unanticipated lip perturbation during discrete and repetitive production of the syllable /pa/ resulted in phase-shifts in the relative timing between the two independent gestures (lip closure and laryngeal closure) for the phoneme /p/ and between successive /pa/ syllables (Saltzman et al., 1998). This confirms the critical role of somato-sensory information in the TD model (Saltzman and Munhall, 1989; Parrell et al., 2019).

Dynamical systems can express different self-organizing coordination patterns, but for many systems, certain patterns of coordination seem to be preferred over others. These preferred

patterns are induced by "attractors" (Kelso, 1995), which reflect stable states in the coupling dynamics of such a system<sup>2</sup> . The coupling relationships used in speech production are similar to those identified for limb control systems (Kelso, 1995; Goldstein et al., 2006) and capitalize on intrinsically stable modes of coordination (specifically, in-phase and anti-phase modes; Haken et al., 1985). These are patterns that are naturally achieved without training or learning; however, they are not equally stable (Haken et al., 1985; Nam et al., 2009). In-phase coordination patterns, for instance, are relatively more stable than anti-phase patterns (Haken et al., 1985; Kelso, 1995; Goldstein et al., 2006). Other coordination patterns are possible, but they are more variable, may require higher energy expenditure and can only be acquired with significant training (Kelso, 1984; Peper et al., 1995; Peper and Beek, 1998; Nam et al., 2009). For example, when participants are asked to oscillate two limbs or fingers, they spontaneously switch coordination patterns from the less stable anti-phase to the more stable in-phase as the required movement frequency increases, but not vice versa (Kelso, 1984; Haken et al., 1985; Peper et al., 2004). These two modes of coordination likely form the basis of syllable structure (Goldstein et al., 2006). The onset consonant (C) and vowel (V) planning oscillators (see below) are said to be coupled in-phase, while the CC onset clusters and the nucleus (V) and coda (C) gestures are coupled in anti-phase mode. As the in-phase coupling mode is more stable, this can explain the dominance of CV syllable structure during babbling and speech development as well as across languages (Goldstein et al., 2006; Nam et al., 2009; Giulivi et al., 2011).

Using the TD framework in the AP model (Nam and Saltzman, 2003), speech production planning processes and dynamic multifrequency coupling between gestural and rhythmic (prosodic) systems have been explained using the notion of coupled oscillator models (Goldstein et al., 2006; Nam et al., 2009; Tilsen, 2009; Gafos and Goldstein, 2012). The coupled oscillator models for speech gestures are associated with non-linear (limit cycle)

<sup>2</sup>There are also certain states that are inherently unstable, which are referred to as repellors.

planning level oscillators which can be coordinated in relative time by specifying a phase relationship between them. During an utterance, the planning oscillators for multiple gestures generate a representation of the various (and potentially competing) coupling specifications, referred to as a coupling graph (**Figure 2**; Saltzman et al., 2006). The activation of each gesture is then triggered by its respective oscillator after they settle into a stable pattern of relative phasing during the planning process (van Lieshout and Goldstein, 2008; Nam et al., 2009). In this manner, the coupled oscillator model has been used to control the relative timing of multiple gestural activations during word or sentence production. To recap, individual gestures are modeled as critically damped mass-spring systems with a fixed-point attractor where speed, amplitude and duration are manipulated by adjustments to dynamic parameter specifications (e.g., damping and stiffness variables). In contrast, gestural planning level systems are modeled using limit cycle oscillators and their relative phases are controlled by potential functions (Tilsen, 2009; Pouplier and Goldstein, 2010).

Similar to the bidirectional relationship between inter-gestural timing and peripheral articulatory state, interactions between gestural and rhythmic level oscillators have also been noted. To explain the dynamic interactions between gestural and rhythmic (stress and prosody) systems, speech production may rely on a similar multi-frequency system of coupled oscillators as proposed for limb movements (Peper et al., 1995; Tilsen, 2009). The coupling strength and stability in such systems varies not only as a function of type of phasing (in-phase or anti-phase), but also by the complexity of coupling (ratio of intrinsic oscillator frequencies of the coupled structures), movement amplitude and the movement rate at which the coupling needs to be maintained (Peper et al., 1995; Peper and Beek, 1998; van Lieshout and Goldstein, 2008; van Lieshout, 2017). For example, rhythmic movement between the limbs has been modeled as a system of coupled oscillators that exhibit (multi)frequency locking. The most stable coupling mode is when two or more structures (oscillators) are frequency locked in a lower-order (e.g., 1:1) ratio. Multi-frequency locking for upper limbs is possible at higher order ratios of 3:5 or 5:2 (e.g., during complex drumming) but only at slower movement frequencies. As the required movement rate increases, the complex frequency coupling ratios will exhibit transitions to simpler and inherently more stable ratios (Peper et al., 1995; Haken et al., 1996). Studies on rhythmic limb coupling show that increases in movement frequency are inversely related to decreases in coupling strength and coordination stability. The increases in movement frequency or rate may be associated with a drop in the movement amplitude that mediates the differential loss of stability across the frequency ratios (Haken et al., 1996; Goldstein et al., 2007; van Lieshout, 2017). However, smaller movement amplitude in itself (independent from duration and rate) can also decrease coupling strength and coordination stability (Haken et al., 1985; Peper et al., 2008; van Lieshout, 2017). Amplitude changes are presumably used to stabilize the output of a coupled neural oscillatory system. Smaller movement amplitudes may decrease feedback gain, resulting in a reduction of the neural oscillatoreffector coupling strength and stability (Peper and Beek, 1998; Williamson, 1998; van Lieshout et al., 2004; van Lieshout, 2017). Larger movement amplitudes facilitate neural phase entrainment by enhancing feedback signals, but a certain minimum sensory input is required for entrainment to occur (Williamson, 1998; Ridderikhoff et al., 2005; Peper et al., 2008; Kandel, 2013; van Lieshout, 2017). Several studies have demonstrated the critical role of movement amplitude on coordination stability in different types of speech disorders such as stuttering and apraxia (van Lieshout et al., 2007; Namasivayam et al., 2009; for review see Namasivayam and van Lieshout, 2011).

Such complex couplings between multi-frequency oscillators may be found at different levels in the speech system such as between slower vowel production and faster consonantal movements (Goldstein et al., 2007), or between shorter-time scale gestures and longer-time scale rhythmic units (moras, syllables, feet and phonological phrases; Tilsen, 2009). Experimentally, the interaction between gestural and rhythmic systems have been identified by a high correlation between inter-gestural temporal variability and rhythmic variability (Tilsen, 2009), while behaviorally, such gesture-rhythm interactions are supported by observations of systematic relationships between patterns of segment and syllable deletions, and stress patterns in a language (Kehoe, 2001; for an alternative take on neutralization in strong positions using constraint-based theory and AP model see McAllister Byun, 2011). Issues in maintaining the stability of complex higher order ratios in multi-frequency couplings (especially at faster speech rates) between slower vowel production and faster consonantal movements have also been implicated in the occurrence of speech sound errors in healthy adult speakers (Goldstein et al., 2007). More about this aspect in the next section.

The development of gestures is tied to organs of constriction in two ways: between-organ and within-organ differentiation (Goldstein and Fowler, 2003). There is empirical data to support that these differentiations occur over developmental timelines (Cheng et al., 2007; Terband et al., 2009; see Section Development of Speech Motor Synergies). When a gesture corresponds to different organs (e.g., bilabial closure implemented via upper and lower lip plus jaw), betweenorgan differentiation is observed at an earlier stage in development. For within-organ differentiation, children must learn that for a given organ, different gestures may require different variations in vocal tract constriction location and degree. For example, /d/ and /k/ are produced by the same constriction organ (tongue) but use different constriction locations (alveolar vs. velar). Within-organ differentiation is said to occur at a later stage in development via a process called attunement (Studdert-Kennedy and Goldstein, 2003). During the attunement process, initial speech gestures produced by an infant (i.e., based on between organ contrasts) become tailored (attuned) toward the perceived finer grained differentiations in gestural patterns in the ambient language (e.g., similar to phonological attunement proposed by Shriberg et al., 2005). In sum, gestural planning, temporal organization of gestures, parameter specification of gestures, and gestural coupling (between gestures, and between gestures and other rhythmic units) result in specific behavioral phenomena including casual

speech alternations (e.g., syllable deletions, assimilations), as will be discussed next.

# Describing Casual Speech Alternations

The AP model accounts for variations and errors in the speech output by demonstrating how the task-specific gestures at the macroscopic level are related to the systematic changes at the microscopic level of articulatory trajectories and resulting speech acoustics (e.g., speech variability, coarticulation, allophonic variation, and speech errors in casual connected speech; Saltzman and Munhall, 1989; Browman and Goldstein, 1992; Goldstein et al., 2007). Browman and Goldstein (1990b) argue that speech sound errors such as consonant deletions, assimilations, and schwa deletions can result from an increasing overlap between different gestures, or from reducing the size (magnitude) of articulatory gestures (see also van Lieshout and Goldstein, 2008; Hall, 2010). The amount of gestural overlap is assumed to be a function of different factors, including style (casual vs. formal speech), the organs used for making the constrictions, speech rate, and linguistic constraints (Goldstein and Fowler, 2003; van Lieshout and Goldstein, 2008).

The gestural processes surrounding consonant and schwa deletions can be explained by alterations in gestural overlap resulting from changes in relative timing or phasing in the gestural score. The gestural overlap has different consequences in the articulatory and acoustic output, depending on whether the gestures share the same Tract Variables and corresponding articulatory sets (homorganic) or whether they employ different Tract Variables and constricting organs (heterorganic). Heterorganic gestures (e.g., lip closure combined with a tongue tip closure) will result in a Tract Variable motion for each gesture that is unaffected by the other concurrent gesture; and their Tract Variables goals will be reached, regardless of the degree of overlap. However, when maximum overlap occurs, one gesture may completely obscure or hide the other gesture acoustically during release (i.e., gestural hiding; Browman and Goldstein, 1990b). In homorganic gestures, when two gestures share the same Tract Variables and articulators, as in the case of a tongue tip (TT) constriction to produce /θ/ and /n/ (e.g., during production of /tεn θimz/) they perturb each other's Tract Variable motions. The dynamical parameters of the two overlapping gestural control regimes are 'blended.' These gestural blendings are traditionally described phonologically as assimilation (e.g., /tεn θimz/ → [tε θimz]) or allophonic variations (e.g., front and back variation of /k/ in English: "key" and "caw"; Ladefoged, 1982) (Browman and Goldstein, 1990a,b).

Articulatory kinematic data collected using an X-Ray Microbeam system (e.g., Browman and Goldstein, 1990b) have provided support for the occurrence of these gestural processes (hiding and blending). Consider the following classic examples in the literature (Browman and Goldstein, 1990b). The production of the sequence "nabbed most" is usually heard by the listener as "nab most" and the spectrographic display reveals no visible presence of /d/. However, the presence of the tongue tip raising gesture for /d/ can be seen in X-ray data (Browman and Goldstein, 1990b), but it is inaudible and completely overlapped by the release of the bilabial gestures /b/ and /m/ (Hall, 2010). Similarly, in fast speech, words like "potential" sound like "ptential," wherein the first schwa between the consonants /p/ and /t/ seems to be omitted, but in fact is hidden by the acoustic release of /p/ and /t/ (Byrd and Tan, 1996; Davidson, 2006; Hall, 2010). These cases show that relevant constrictions are formed, but they are acoustically and perceptually hidden by another overlapping gesture (Browman and Goldstein, 1990b). Assimilations have also been explained by gestural overlap and gesture magnitude reduction. In the production of "seven plus seven," which often sounds like "sevem plus seven," the coronal nasal consonant /n/ appears to be replaced by the bilabial nasal /m/ in the presence of the adjacent bilabial /p/. In reality, the tongue tip /n/ gesture is reduced in magnitude and overlapped by the following bilabial gesture /p/ (Browman and Goldstein, 1990b; Hall, 2010). The AP model accounts for rate-dependent speech sound errors by gestural overlap and gestural magnitude reduction (Browman and Goldstein, 1990b; Hall, 2010). Auditory-perceptual based transcription procedures would describe the schwa elision and consonant deletion (or assimilation processes) in the above examples by a set of phonological rules schematically represented as d → ∅/C\_C (i.e., /d/ is deleted in the presence of two adjacent consonants in "nabbed most" → "nab most"; Hall, 2010). However, these rules do not capture the fact that movements for the /d/ or /n/ are still present. Furthermore, articulatory data indicate that such speech sound errors are often not the result of whole-segment or feature substitutions/deletions, but are due to co-production of unintended or intrusion gestures to maintain the dynamic stability in the speech production system instead (Pouplier and Goldstein, 2005; Goldstein et al., 2007; Pouplier, 2007, 2008; Slis and van Lieshout, 2016a,b).

The concept of intrusion gestures is illustrated with kinematic data from Goldstein et al. (2007) study where participants repeated bisyllabic sequences such as "cop top" at fast and slow speech rate conditions. Goldstein et al. (2007) noticed unique speech sound errors in that both the intended and extra/unintended (intruding) gestures were produced at the same time. True substitutions and deletions of the targets occurred rarely, even though, substitution errors are the most commonly reported error type in speech sound error studies when using auditory-perceptual transcription procedures (Dell et al., 2000). Goldstein et al. (2007) explained their findings based on the DST concepts of stable rhythmic synchronization and multi-frequency locking (see Section Gestures, Synergies and Linguistic Contrast). The word pairs "cop top" differ in their onset consonant but share the syllable rhyme. Thus, each production of "cop top" contains one tongue tip (/t/), one tongue dorsum (/k/) gesture, but two labial (/p/) gestures. This results in the initial consonants being in a 1:2 relationship with the coda consonant. Such multifrequency ratios are intrinsically less stable (Haken et al., 1996), especially under fast rate conditions. As speech rate increased, they observed an extra copy of tongue tip inserted or coproduced during the /k/ production in "cop" and a tongue dorsum intrusion gesture during the /t/ production in "top." Adding an extra gesture (the intrusion) results in a more stable harmonic relationship where both the initial consonants (tongue tip and tongue dorsum gestures) are in a 2:2 (or 1:1) relationship

with the coda (lip gestures) consonant (Pouplier, 2008; Slis and van Lieshout, 2016a,b). Thus, gestural intrusion errors can be described as resulting from a rhythmic synchronization process, where the more complex and less stable 1:2 frequency-locked coordination mode is dissolved and replaced by a simpler and intrinsically more stable 1:1 mode by adding gestures. Unlike what is claimed for perception-based speech sound errors (e.g., Dell et al., 2000), the addition of "extra" cycles of the tongue tip and/or tongue dorsum oscillators results in phonotactically illegal simultaneous articulation of /t/ and /k/ (Goldstein et al., 2007; Pouplier, 2008; van Lieshout and Goldstein, 2008; Slis and van Lieshout, 2016a,b). The fact that /kt/ co-production is phonotactically illegal in English makes it difficult for a listener to even detect its presence. Pouplier and Goldstein (2005) further suggest that listeners only perceive intrusions that are large in magnitude (frequently transcribed as segmental substitutions errors), while smaller gestural intrusions are not heard, and targets are scored as error-free despite conflicting articulatory data (Pouplier and Goldstein, 2005; Goldstein et al., 2007; see also Mowrey and MacKay, 1990).

# ARTICULATORY PHONOLOGY AND SPEECH SOUND DISORDERS (SSD) IN CHILDREN

In this section, we briefly describe the patterns of speech sound errors in children as they have been typically discussed in the S-LP literature. This is followed by an explanation of how the development, maturation, and the combinatorial dynamics of articulatory gestures (such as phasing or timing relationships, coupling strength and gestural overlap) can offer a well-substantiated explanation for several of these more atypical speech sound errors. We will provide a preliminary and arguably, tentative mapping between several subtypes of SSDs in children and their potential origins as explained in the context of the AP and TD framework (**Table 1**). We see this as a starting point for further discussion and an inspiration to conduct more research in this specific area. For example, one could use the AP/TD model (TADA; Nam et al., 2004) to simulate specific problems at the different levels of the model to systematically probe the emerging symptoms in movement and acoustic characteristics and then verify those with actual data, similar to recent work on apraxia and stuttering using the DIVA framework (Civier et al., 2013; Terband et al., 2019b). Since there is no universally agreedupon classification system in speech-language pathology, we will limit our discussion to the SSD classification system proposed by Shriberg (2010; Vick et al., 2014; see Waring and Knight, 2013 for a critical evaluation of the current childhood SSD classification systems) and phonological process errors as described in the widely used clinical assessment tool Diagnostic Evaluation of Articulation and Phonology (DEAP; Dodd et al., 2006). We will refer to these phonological error patterns as process errors/speech sound error patterns, in line with their contemporary usage as descriptive terms, without reference to phonological or phonetic theory underpinnings.

## Speech Delay

According to Shriberg et al. (2010) and Shriberg et al. (2017), children with Speech Delay (age of occurrence between 3 and 9 years) are characterized by "delayed acquisition of correct auditory–perceptual or somatosensory features of underlying representations and/or delayed development of the feedback processes required to fine tune the precision and stability of segmental and suprasegmental production to ambient adult models" (Shriberg et al., 2017, p. 7). These children present with age-inappropriate speech sound deletions and/or substitutions, among which patterns of speech sound errors as described below:

#### Gliding and Vocalization of Liquids

Gliding is described as a substitution of a liquid with a glide (e.g., rabbit /ræbIt/ → [wæbIt] or [jæbIt], please /pliz/ → [pwiz], look /lUk/ → [wUk]; McLeod and Baker, 2017) and vocalization of liquids refers to the substitution of a vowel sound for a liquid (e.g., apple /æpl/ → [æpU], bottle /bAtl/ → [bAtU]; McLeod and Baker, 2017). The /r/ sounds are acoustically characterized by a drop in the third formant (Alwan et al., 1997). In terms of movement kinematics the /r/ sound is a complex coproduction of three vocal tract constrictions/gestures (i.e., labial, tongue tip/body, and tongue root), requires a great deal of speech motor skill, and is mastered by most typically developing children between 4 and 7 years of age (Bauman-Waengler, 2016). Ultrasound data suggests that children may find the simultaneous coordination of three gestures motorically difficult and may simplify the /r/ production by dropping one gesture from the segment (Adler-Bock et al., 2007). Moreover, the syllable final /r/ sounds are often substituted with vowels because they share only a subset of vocal tract constrictions with the original /r/ sound and this is better described as a simplification process (Adler-Bock et al., 2007). For example, the child may drop the tongue tip gesture but retain the lip rounding gesture and the latter dominates resulting vocal tract acoustics (Adler-Bock et al., 2007; van Lieshout et al., 2008). Kinematic data derived from electromagnetic articulography (van Lieshout et al., 2008) also points to a limited withinorgan differentiation of the tongue parts and subtle issues in relative timing between different components of the tongue in /r/ production errors. These arguments also have support from longitudinal observational data on positional lateral gliding in children (/l/ is realized as [j]; Inkelas and Rose, 2007). Positional lateral gliding in children is said to occur when the greater gestural magnitude of prosodically strong onsets in English interacts with the anatomy of the child's vocal tract (Inkelas and Rose, 2007; McAllister Byun, 2011, 2012). Within the AP model, reducing the number of required gestures (simplification) and poor tongue differentiation issues would likely have their origins at the level of Tract Variables while issues in relative timing between the tongue gestures are likely to arise at the level of the Gestural Score (**Table 1**).

#### Stopping of Fricatives

Stopping of fricatives involves a substitution of a fricative consonant with a homorganic plosive (e.g., zoo /zu/ → [du], shoe /Su/ → [tu], see /si/ → [ti]; McLeod and Baker, 2017). Fricatives



are another class of late acquired sounds that require precise control over different parts of the tongue to produce a narrow groove through which turbulent airflow passes. Within the AP model, the stopping of fricatives may arise from an inappropriate Tract Variable constriction degree specification (Constriction Degree: /d/ closed vs. /z/ critical; Goldstein et al., 2006; see **Table 1**), possibly as a simplification process secondary to limited precision of tongue tip control. Alternatively, neutralization (or stopping) of fricatives especially in prosodically strong contexts has also been explained from a constraint-based grammar perspective. For example, the tendency to overshoot is greater in initial positions where a more forceful gesture is favored for prosodic reasons. This allows the hard to produce fricative to be replaced by a ballistic tongue-jaw gesture that does not violate the MOVE-AS-UNIT constraint (Inkelas and Rose, 2007; McAllister Byun, 2011, 2012) as described in the "Introduction Section."

#### Vowel Addition and Final Consonant Deletion

Different types of vowel insertion errors have been observed in children's speech. An epenthesis is typically a schwa vowel inserted between two consonants in a consonant cluster (e.g., please /pliz/ → [p@liz] CCVC → CVCVC; blue /blu/ → [b@lu] CCV → CVCV), while other types of vowel insertions have also been noted (e.g., bat /bæt/ → [bæta]; CVC → CVCV) (McLeod and Baker, 2017). A final consonant deletion involves the deletion of a consonant in a syllable or word final position (seat /sit/ - [si], cat /cæt/ - [cæ], look /lUk/ - [lU]; McLeod and Baker, 2017). Both these phenomena could be explained by the concept of relative stability. As noted earlier, the onset consonant and the vowel (CV) are coupled in a relatively more stable in-phase mode as opposed to the antiphase VC and CC gestures (Goldstein et al., 2006; Nam et al., 2009; Giulivi et al., 2011). Thus, the maintenance of relative

stability in VC or CC coupling modes may be more difficult with increasing cognitive-linguistic (e.g., vocabulary growth) or speech motor demands (e.g., speech rate), and there may be a tendency to utilize intrusion gestures as a means to stabilize the speech motor system (i.e., by decreasing frequency locking ratios; e.g., 2:1 to 1:1; Goldstein et al., 2007). We suspect that such mechanisms underlie vowel intrusion (error) gestures in children. In CVC syllables (or word structures), greater stability in the system may be achieved by dropping or deleting the final consonant and thus retaining the more stable inphase CV coupling (Goldstein et al., 2006). Moreover, findings from ultrasound tongue motion data during the production of repeated two- and three-word phrases with shared consonants in coda (e.g., top cop) versus no-coda positions (e.g., taa kaa, taa kaa taa) have demonstrated a gestural intrusion bias only for the shared coda consonant condition (Pouplier, 2008). These findings suggest that the presence of (shared) coda consonants is a trigger for a destabilizing influence on the speech motor system (Pouplier, 2008; Mooshammer et al., 2018). From an AP perspective, the stability induced by deleting final consonants or adding intrusion gestures (lowering frequency locking ratios) can be assigned to limitations in inter-gestural coordination and/or possible gestural selection issues at the level of Gestural Planning Oscillators (**Figure 2**). We argue that (vowel) intrusion sound errors are not a "symptom" of an underlying (phonological) disorder, but rather the result of a compensatory mechanism for a less stable speech motor system. Additionally, children with limited jaw control may omit the final consonant /b/ in /bAb/ in a jaw close-openclose production task, due to difficulties with elevating the jaw. This would typically be associated with the Tract Variable level in the AP model or at later stages during the specification of jaw movements at the Articulatory level (see **Figure 2** and **Table 1**).

#### Cluster Reduction

Cluster reduction refers to the deletion of a (generally more marked) consonant in a cluster (e.g., please /pliz/ → [piz], blue /blu/ → [bu], spot /sp6t/ → [p6t]; McLeod and Baker, 2017). From a stability perspective, CC onset clusters are less stable (i.e., anti-phasic) and in the presence of increased demands or limitations in the speech motor system (e.g., immaturity; Fletcher, 1992), they are more likely replaced by a stable CV coupling pattern by omitting the extra consonantal gesture (Goldstein et al., 2006; van Lieshout and Goldstein, 2008; Nam et al., 2009). Alternatively, there is also the possibility that when two (heterorganic) gestures in a cluster are produced they may temporally overlap, thereby acoustically and perceptually hiding one gesture (i.e., gestural hiding; Browman and Goldstein, 1990b; Hardcastle et al., 1991; Gibbon et al., 1995). Within the AP model, cluster reductions due to stability factors and gestural hiding may be ascribed to the Gestural Score Activation level (a gesture may not be activated in a CCV syllable to maintain stable CV structure) and to relative phasing issues (increased temporal overlap) at the level of inter-gestural coordination (**Figure 2** and **Table 1**; Goldstein et al., 2006; Nam et al., 2009).

#### Weak Syllable Deletion

Weak syllable deletion refers to the deletion of an unstressed syllable (e.g., telephone /tεl@foUn/ → [tεfoUn], potato /p@teItoU/ → [teItoU], banana /b@næn@/ → [næn@]; McLeod and Baker, 2017). Multisyllabic words pose a unique challenge in that they comprise of complex couplings between multi-frequency syllable and stress level oscillators (e.g., Tilsen, 2009). Deleting an unstressed syllable in a multisyllabic word may allow reduction of complexity by frequency locking in a stable lower ordermode between syllable and stress level oscillators. Within the AP model, this process is regulated at the level of Gestural Planning Oscillators (see **Table 1**; Goldstein et al., 2007; Tilsen, 2009).

#### Velar Fronting and Coronal Backing

Fronting is defined as a substitution of a sound produced in the back of the vocal tract with a consonant articulated further toward the front (e.g., go /go/ → [do], duck /d2k/ → [d2t], key /ki/ → [ti]; McLeod and Baker, 2017). Backing on the other hand, is defined as a substitution of a sound produced in the front of the vocal tract with a consonant articulated further toward the back (e.g., two /tu/ → [ku], pat /pæt/ → [pæk], tan /tæn/ → [kæn]; McLeod and Baker, 2017). While fronting is frequently observed in typically developing young children, backing is rare for English-speaking children (McLeod and Baker, 2017). Children who exhibit fronting and backing behaviors show evidence of undifferentiated lingual gestures, according to electropalatography (EPG) and electromagnetic articulography studies (Gibbon, 1999; Gibbon and Wood, 2002; Goozée et al., 2007). Undifferentiated lingual gestures lack clear differentiation between the movements of the tongue tip, tongue body, and lateral margins of the tongue. For example, tongue-palate contact is not confined to the anterior part of the palate for alveolar targets, as in normal production. Instead, tongue-palate contact extends further back into the palatal and velar regions of the vocal tract (Gibbon, 1999). It is estimated that 71% of children (aged 4-12 years) with a clinical diagnosis of articulation and phonological disorders produce undifferentiated lingual gestures. These undifferentiated lingual gestures are argued to arise from decreased oro-motor control abilities, a deviant compensatory bracing mechanism (i.e., an attempt to counteract potential disturbances in tongue tip fine motor control; Goozée et al., 2007) or may represent an immature speech motor system (Gibbon, 1999; Goozée et al., 2007). Undifferentiated lingual gestures are not a characteristic of speech in typically developing older schoolage children or adults (Gibbon, 1999). In children's productions of lingual consonants, there is a decrease in tongue-palate contact on EPG with increasing age (6 through 14 years) paralleled by fine-grained articulatory adjustments (Fletcher, 1989). The tongue tip and tongue body function as two quasi-independent articulators in typical and mature speech production systems (see section Development of Synergies in Speech). However, in young children, the tongue and jaw (tongue-jaw complex) and different functional parts of the tongue may be strongly coupled in-phase (i.e., always move together), and thus lack functionally independent regions (Gibbon, 1999; Green et al., 2002). Undifferentiated lingual patterns may thus result from simultaneous (in-phase) activation of regions of the tongue

and/or tongue-jaw complex in young children and persist over time (van Lieshout et al., 2008).

Standard acoustic-perceptual transcription procedures do not reliably detect undifferentiated lingual gestures (Gibbon, 1999). Undifferentiated lingual gestures are sometimes transcribed as phonetic distortions or phonological substitutions (i.e., velar fronting or coronal backing) in some contexts, but may be transcribed as correct productions in other contexts (Gibbon, 1999; Gibbon and Wood, 2002). The perception of place of articulation of an undifferentiated gesture is determined by changes in tongue-palate contact during closure (i.e., articulatory drift; Gibbon and Wood, 2002). For example, closure might be initiated in the velar region, cover the entire palate, and then be released in the coronal or anterior region (or vice versa). Undifferentiated lingual gestures could therefore yield the perception of either velar fronting or coronal backing. The perceived place of articulation is influenced by the direction of the articulatory drift and the last tongue-palate contact region (Gibbon and Wood, 2002). Children with slightly more advanced lingual control, relative to those described with widespread use of undifferentiated gestures, may still present with fine-motor control or refinement issues (e.g., palatal fronting /S/ →[s]; backing of fricatives /s/ →[S]; Gibbon, 1999). Velar fronting and coronal backing can be envisioned as incorrect in relative phasing at the level of inter-gestural coordination<sup>3</sup> (see **Table 1**). For instance, the tongue tip-tongue body or tongue-jaw complex may be in a tight synchronous in-phase coupling, but the release of constriction may not. It may also be a problem in Tract Variable constriction location specification (**Table 1**).

#### Prevocalic Voicing and Postvocalic Devoicing

Context sensitive voicing errors in children are categorized as prevocalic voicing and postvocalic devoicing. Prevocalic voicing is a process in which voiceless consonants in syllable initial positions are replaced by voiced counterparts (e.g., pea /pi/ → [bi]; pan /pæn/ → [bæn]; pencil /pεns@l/ → [bεns@l]) and postvocalic devoicing is when voiced consonants in syllable final position are replaced by voiceless counterparts (e.g., Bag /bæg/ → [bæk], pig /pIg/ → [pIk]; seed /sid/ → [sit]; McLeod and Baker, 2017). Empirical evidence suggests that in multi-gestural segments, segment-internal coordination of gestures may be different in onset than in coda position (Krakow, 1993; Goldstein et al., 2006). When a multi-gestural segment is produced in a syllable onset, such as a bilabial nasal stop (e.g., [m]), the necessary gestures (bilabial closure gesture, glottal gesture and velar gesture) are synchronously produced (i.e., in-phase), creating the most stable configuration for that combination of gesture; this makes the addition of voicing in onset position easy. However, in coda position, the bilabial closure gesture, glottal gesture (for voicing) and velar gesture must be produced asynchronously (i.e., in a less stable anti-phase mode; Haken et al., 1985; Goldstein et al., 2006, 2007). It is thus less demanding to coordinate fewer gestures in the anti-phase mode across oral and laryngeal speech subsystems in a coda position. This would explain why children (with a developing speech motor system) may simply drop the glottal gesture (devoicing in coda position) to reduce complexity. Note, that in some languages (e.g., Dutch), coda devoicing is standard irrespective of the original voicing characteristic of that sound. Within the AP model, prevocalic voicing and postvocalic devoicing (i.e., adding or dropping a gesture) may be ascribed to gestural selection issues at the level of Gestural Planning Oscillators (**Figure 2** and **Table 1**).

Recent studies also suggest a relationship between jaw control and acquisition of accurate voice-voiceless contrasts in children. The production of a voice-voiceless contrast requires precise timing between glottal abduction/adduction and oral closure gestures. Voicing contrast acquisition in typically developing 1- to 2-year-old children may be facilitated by increasing the jaw movement excursion, speed and stability (Grigos et al., 2005). In children with SSDs (including phonological disorder, articulation disorder and CAS) relative to typically developing children, jaw deviances/instability in the coronal plane (i.e., lateral jaw slide) have been observed (Namasivayam et al., 2013; Terband et al., 2013). Moreover, stabilization of voice onset times for /p/ production has been noted in children with SSDs undergoing motor speech intervention focused on jaw stabilization (Yu et al., 2014). These findings are not surprising given that the perioral (lip) area lacks tendon organs, joint receptors and muscle spindles (van Lieshout, 2015), and the only reliable source of information to facilitate inter-gestural coordination between oral and laryngeal gestures comes from the jaw masseter muscle spindle activity (Namasivayam et al., 2009). Increases in jaw stability and amplitude may provide consistent and reliable feedback used to stabilize the output of a coupled neural oscillatory system comprising of larynx (glottal gestures) and oral articulators (van Lieshout, 2004; Namasivayam et al., 2009; Yu et al., 2014; van Lieshout, 2017).

#### Articulation Impairment

Articulation impairment is considered a motor speech difficulty and generally reserved for speech sound errors related to rhotics and sibilants (e.g., derhotacized /r/: bird /bÇd/ → [b3d]; dentalized/lateralized sibilants: sun /s2n/ → [ì2n] or [ 2n]; McLeod and Baker, 2017). A child with an articulation impairment is assumed to have the correct phoneme selection but is imprecise in the speech motor specifications and implementation of the sound (Preston et al., 2013; McLeod and Baker, 2017). Studies using ultrasound, EPG and electromagnetic articulography data have shown several aberrant motor patterns to underlie sibilant and rhotic distortions. For rhotics, these may range from undifferentiated tongue protrusion, absent anterior tongue elevation, absent tongue root retraction and subtle issues in relative timing between different components of the tongue gestures (van Lieshout et al., 2008; Preston et al., 2017). Correct /s/ productions involve a groove in the middle of the tongue along with an elevation of the lateral tongue margins (Preston et al., 2016, 2017). Distortions in /s/ production may arise from inadequate anterior tongue control, poor lateral bracing (sides of the tongue down) and missing central groove (McAuliffe and Cornwell, 2008; Preston et al., 2016, 2017).

<sup>3</sup>For an alternative take on velar fronting using the Harmonic Grammar framework and AP model see McAllister Byun, 2011, 2012; McAllister Byun and Tessier, 2016.

Within the AP model, articulation impairments may potentially arise at three levels: Tract Variables, Gestural Scores and dynamical specification of the gestures. We discussed rhotic production issues at the Tract Variables and Gestural Score levels in the Gliding and vocalization of liquids section as a reduction in the number of required gestures (i.e., some parts of the tongue not activated during /r/), limited tongue differentiation, and/or subtle relative timing issues between the different tongue gestures/components. Errors in dynamical specifications of the gestures could also result in speech sound errors. For example, incorrect damping parameter specification for vocal tract constriction degree may result in the Tract Variables (and their associated articulators) overshooting (underdamping) or undershooting (overdamping) their rest/target value (Browman and Goldstein, 1990a; Fuchs et al., 2006).

### Childhood Apraxia of Speech (CAS)

The etiology for CAS is unknown, but it is hypothesized to be a neurological sensorimotor disorder with a disruption at the level of speech motor planning and/or motor programing of speech movement sequences (American Speech–Language– Hearing Association (ASHA, 2007). A position paper by ASHA (2007) describes three important characteristics of CAS which include inconsistent speech sound errors on repeated productions, lengthened and disrupted coarticulatory transitions between sounds and syllables, and inappropriate prosody that includes both lexical and phrasal stress difficulties (ASHA, 2007). Within the AP and TD framework, the speech motor planning processes described in linguistic models can be ascribed to the level of inter-gestural coupling graphs, intergestural planning oscillators and gestural score activation; while processes pertaining to speech motor programing would typically encompass dynamic gestural specifications at the level of tract variables and articulatory synergies (Nam and Saltzman, 2003; Nam et al., 2009; Tilsen, 2009).

Traditionally, perceptual inconsistency in speech production of children with CAS has been evaluated via word-level token-totoken variability or at the fine-grained segmental-level (phonemic and phonetic variability; Iuzzini and Forrest, 2010; Iuzzini-Seigel et al., 2017). These studies provide evidence for increased variability in speech production of CAS relative to those typically developing or those with other speech impairments (e.g., articulation disorders). Data suggest that speech variability issues in CAS may arise at the level of articulatory synergies (intragestural coordination). Children with CAS demonstrate higher lip-jaw spatio-temporal variability with increasing utterance complexity (e.g., word length: mono-, bi-, and tri-syllabic) and greater lip aperture variability relative to children with speech delay (Grigos et al., 2015). Terband et al. (2011) analyzed articulatory kinematic data on functional synergies in 6- to 9-year-old children with SSD, CAS, and typically developing controls. The results indicated that the tongue tip-jaw synergy was less stable in children with CAS compared to typically developing children, but the stability of lower lip-jaw synergy did not differ (Terband et al., 2011). Interestingly, differences in movement amplitude emerged between the groups: CAS children exhibited a larger contribution of the lower lip to the oral closure compared to typically developing controls, while the children with SSD demonstrated larger amplitude of tongue tip movements relative to CAS and controls. Terband et al. (2011) suggest that children with CAS may have difficulties in the control of both lower lip and tongue tip while the children with SSD have difficulties controlling only the tongue tip. Larger movement amplitudes found in these groups may indicate an adaptive strategy to create relatively stable movement coordination (see also Namasivayam and van Lieshout, 2011; van Lieshout, 2017). The presence of larger movement amplitudes to increase stability in the speech motor system has been reported as a potential strategy in other speech disorders, including stuttering (Namasivayam et al., 2009); adult verbal apraxia and aphasia (van Lieshout et al., 2007); cerebral palsy (Nip, 2017; Nip et al., 2017); and Speech-Motor Delay [SMD, a SSD subtype formerly referred to as Motor Speech Disorder–Not Otherwise Specified (MSD-NOS); Vick et al., 2014; Shriberg, 2017; Shriberg et al., 2019a,b]. This fits well with the notion that movement amplitude is a factor in the stability of articulatory synergies as predicted in a DST framework (e.g., Haken et al., 1985; Peper and Beek, 1998) and evidenced in a recent study on speech production (van Lieshout, 2017). Additional mechanisms to improve stability in movement coordination were documented in gestural intrusion error studies (Goldstein et al., 2007; Pouplier, 2007, 2008; Slis and van Lieshout, 2016a,b) as discussed in section "Describing Casual Speech Alternations," and are more present in adult apraxia speakers relative to healthy controls (Pouplier and Hardcastle, 2005; Hagedorn et al., 2017).

With regards to the lengthened and disrupted coarticulatory transitions, findings suggest that abnormal and variable anticipatory coarticulation (assumed to reflect speech motor planning) may be specific to CAS and not a general characteristic of children with SSD (Nijland et al., 2002; Maas and Mailend, 2017). The lengthened and disrupted coarticulatory transitions between sounds and syllables can be explained by possible limitations in inter-gestural overlap in children with CAS. A reduction in overlap of successive articulatory gestures (i.e., reduced coarticulation or coproduction) may result in the speech output becoming "segmentalized" (e.g., as seen in adult apraxic speakers; Liss and Weismer, 1992). Segmentalization gives the perception of "pulling apart" of successive gestures in the time domain and possibly adds to perceived stress and prosody difficulties in this population (e.g., Weismer et al., 1995). These may arise from delays in the activation of the following gesture and/or errors in gesture activation durations.

Inappropriate prosody (lexical and phrasal stress difficulties) in CAS is often characterized by listener perceptions of misplaced or equalized stress patterns across syllables. A potential source of this problem is that children with CAS may produce subtle and not consistently perceptible acoustic differences between stressed and unstressed syllables (Shriberg et al., 1997; Munson et al., 2003). Children with CAS unlike typically developing children, do not shorten vowel duration in weaker stressed initial syllables as an adjustment to the metrical structure of the following syllable (Nijland et al., 2003). Furthermore, syllable omissions have been particularly noted in CAS children who demonstrated inappropriate phrasal stress (Velleman and Shriberg, 1999). These interactions between syllable/gestural units and rhythmic (stress and prosody) systems have been discussed earlier in the context of multi-frequency systems of coupled oscillators (e.g., Tilsen, 2009). We speculate that children with CAS may have difficulty with stability in coupling (i.e., experience weak or variable coupling) between stress and syllable level oscillators.

## Speech-Motor Delay

fpsyg-10-02998 January 24, 2020 Time: 18:8 # 16

Speech-Motor Delay (formerly MSD-NOS; Vick et al., 2014; Shriberg, 2017; Shriberg and Wren, 2019; Shriberg et al., 2019a,b) is a subpopulation of children presenting with difficulties in speech motor control and coordination that is not consistent with features of CAS or Dysarthria (Shriberg, 2017; Shriberg et al., 2019a,b). Information on the nature, diagnosis, and intervention protocols for the SMD subpopulation is emerging (Vick et al., 2014; Shriberg, 2017; Namasivayam et al., 2019). Current data suggests that this group is characterized by poor motor control (e.g., higher articulatory kinematic variability of upper lip, lower lip and jaw, larger upper lip displacements). Behaviorally, they produce errors such as fewer accurate phonemes, errors in vowel and syllable duration, errors in glide production, epenthesis errors, consonantal distortions, and less accurate lexical stress (Vick et al., 2014; Shriberg, 2017; Namasivayam et al., 2019; Shriberg and Wren, 2019; Shriberg et al., 2019a,b). As many of the precision and stability deficits in speech and prosody in SMD (e.g., consonant distortions, epenthesis, vowel duration differences and decreased accuracy of lexical stress) and adaptive strategies to increase speech motor stability (e.g., larger upper lip displacements; van Lieshout et al., 2004; Namasivayam and van Lieshout, 2011) overlap with CAS and other disorders discussed earlier, we will not reiterate possible explanations for these within the context of the AP model. SMD is considered a disorder of execution: a delay in the development of neuromotor precisionstability of speech motor control. Children with SMD are at increased risk for persistent SSDs (Shriberg et al., 2011, 2019a,b; Shriberg, 2017).

# Developmental Dysarthria

Dysarthria "is a collective name for a group of speech disorders resulting from disturbances in muscular control over the speech mechanism due to damage of the central or peripheral nervous system. It designates problems in oral communication due to paralysis, weakness, or incoordination of the speech musculature" (Darley et al., 1969, p. 246). Dysarthria may be present in children with cerebral palsy (CP) and may be characterized by reduced speaking rates, prolonged syllable durations, decreased vowel distinctiveness, sound distortions, reduced strength of articulatory contacts, voice abnormalities, prosodic disturbances (e.g., equal stress), reduced respiratory support or respiratory incoordination and poor intelligibility (Pennington, 2012; Mabie and Shriberg, 2017; Nip et al., 2017). Speakers with CP consistently produce greater lip, jaw and tongue displacements in speech tasks relative to typically developing peers (Ward et al., 2013; Nip, 2017; Nip et al., 2017). These increased displacements were argued to arise from either a reduced ability to grade force control (resulting in ballistic movements) or alternatively, can be interpreted as a strategy to increase proprioceptive feedback to stabilize speech movement coordination (Namasivayam et al., 2009; Nip, 2017; Nip et al., 2017; van Lieshout, 2017). Further, children with CP demonstrate decreased spatial coupling between the upper and lower lips and reduced temporal coordination between the lips and between lower lip and jaw (Nip, 2017) relative to typically developing peers. These measures of inter-articulator coordination were found to be significantly correlated with speech intelligibility (Nip, 2017).

Within the AP model, the neuromotor characteristics of dysarthria such as disturbances in gesture magnitude or scaling issues (overshooting, undershooting), imprecise articulatory contacts (resulting in sound distortions), slowness (reduced speaking rate and prolonged durations), and coordination issues could be related to inaccurate gestural specifications of dynamical parameters (e.g., damping and stiffness), inaccurate gesture activation durations, imprecise constriction location and degree, and inter-gestural and intra-gestural (i.e., articulatory synergy level) timing issues (Browman and Goldstein, 1990a; van Lieshout, 2004; Fuchs et al., 2006). Inter-gestural and intra-gestural timing issues may characterize difficulties in coordinating the subsystems required for speech production (respiration, phonation and articulation) and difficulties in controlling the many degrees of freedom in a functional articulatory synergy, respectively (Saltzman and Munhall, 1989; Browman and Goldstein, 1990b; van Lieshout, 2004). Overall, dysarthric speech characteristics would encompass the following levels in the AP/TD framework: inter-gestural coordination, and dynamic specifications at the level of Tract Variables and Articulatory Synergies (**Table 1**).

# CLINICAL RELEVANCE, LIMITATIONS AND FUTURE DIRECTIONS

In this paper, we briefly reviewed some of the key concepts from the AP model (Browman and Goldstein, 1992; Gafos and Goldstein, 2012). We explained how the development, maturation, and the combinatorial dynamics of articulatory gestures in this model can offer plausible explanations for speech sound errors found in children with SSDs. We find that many of these speech sound error patterns are in fact present in speech of typically developing children and more importantly, even in the speech of typical adult speakers, under certain circumstances. Based on our presentation of behavioral and articulatory kinematic data we propose that such speech sound errors in children with SSD may potentially arise as a consequence of the complex interaction between the dynamics of articulatory gestures, an immature speech motor system with limitations in speech motor skills and specific boundary conditions related to physical, physiological, and functional constraints. In fact, much of these speech sound errors themselves may reflect compensatory strategies (e.g., decreasing speech rate, increasing movement amplitude, bracing, intrusion gestures, cluster reductions, segment/gesture/syllable deletions, increasing

lag between articulators) to provide more stability in the speech motor system as has been found in both typical and disordered speakers (Fletcher, 1992; van Lieshout et al., 2004; Namasivayam and van Lieshout, 2011).

Based on the presented evidence, we speculate that in general children with SSDs may occupy the low end of the speech motor skill continuum similar to what has been argued for stuttering (van Lieshout et al., 2004; Namasivayam and van Lieshout, 2011) and that the differences we notice in speech sound errors between the subtypes of SSD may in fact be differences in how these individuals develop strategies for coping with the challenges of being on the lower end of the speech motor skill continuum. This is a critical shift in thinking about the (distal and proximal) causes for speech sound errors in children with SSD (or in adults for that matter). Many of these children show similarities in their behavioral symptoms and perhaps the traditional notion of separating phonological from motor issues should be questioned (see also Maassen et al., 2010) and replaced with a broader understanding of how all levels involved in speech production are part of a complex system with processing stages that are highly integrated and coupled at different time scales (see also Tilsen, 2009, 2017). The AP perspective and the associated DST principles provide a suitable basis for this kind of approach given its transparency between higher and lower levels of control through the concept of gestures.

Despite the uniqueness of the AP approach in offering new insights into the underlying mechanisms of speech sound errors in children, there are some limitations of using this approach. For example, the current versions of the AP model does not have an auditory feedback channel and is unable to account for any effects of auditory feedback perturbations. Further, although there are some recent attempts at describing the neural mechanisms underlying the components of the AP model (e.g., Tilsen, 2016) the model generally does not explicitly specify neural structures as some other models have done (e.g., DIVA model; Tourville and Guenther, 2011; for a detailed comparison between models of speech production see Parrell et al., 2019).

Critically, the theoretical concepts of gestures/synergies in speech production from this framework are yet to be taught widely in professional S-LP programs and related disciplines (see also van Lieshout, 2004). There are several reasons for this knowledge translation issue with the top ones being a lack of availability of accessible reviews and tutorials on this topic, limited empirical data on the nature of SSDs in children from an AP framework, and most importantly the absence of convenient, reliable and published practical methods to assess the status of gestures and synergies in speech production in a clinical setting. Although, some intervention approaches like the Prompts for Restructuring Oral Muscular Phonetic Targets approach (PROMPT; Hayden et al., 2010) and the Rapid Syllable Transitions Treatment program (ReST; Thomas et al., 2014) aim at addressing speech movement gestures and transitions between them, they lack empirical outcome data related to their impact at the level of gestures and articulatory synergies. It is also unclear at this point whether or not it is possible to provide tools to identify differences in timing relationships in jaw-lip or tongue tip-jaw coupling that would work well in a clinical setting. Using purely sensory (visual and auditory) means to observe speech behaviors will always be subject to errors and biases common to perception-based evaluation procedures (e.g., Kent, 1996). At the moment, there is a paucity of literature in this area which opens up great opportunities for future research. With technologies like real time Magnetic Resonance Imaging finding its way into the analysis of typical and disordered speech (e.g., see Hagedorn et al., 2017) and relatively low cost automatic video-based facetracking systems (Bandini et al., 2017) starting to emerge for clinical purposes, we hope that speech-language pathologists will have the tools they need to support their assessment and intervention planning based on a better understanding and quantification of the dynamics of speech gestures and articulatory synergies. To this end, we hope that this paper provides an initial step in this direction as an introduction to the AP framework for clinical audiences and a motivation for a larger cohort of researchers for developing testable hypothesis regarding the contribution of gestures and articulatory synergies to sub-types of SSD in children.

# CONCLUSION

The foundations of clinical assessment, classification and intervention for children with SSD have been heavily influenced by psycholinguistics and auditory-perceptual based transcription procedures (Shriberg, 2010; see Section Articulatory Phonology and Speech Sound Disorders in Children). A major problem as noted earlier (in the Introduction section) is that, the complex relationships between the etiology (distal), processing deficits (proximal) and the behavioral levels (speech symptoms) is underspecified in current SSD classification systems (Terband et al., 2019a). It is critical to understand the complex interactions between these levels as they have implications for differential diagnosis and treatment planning (Terband et al., 2019a). There have been some theoretical attempts made toward understanding these interactions (e.g., Inkelas and Rose, 2007; McAllister Byun, 2012; McAllister Byun and Tessier, 2016), and we hope this paper will trigger a stronger interest in the field of S-LP for an alternative "gestural" perspective and increase the contributions to the limited corpus of research literature in this area.

# AUTHOR CONTRIBUTIONS

AN: main manuscript writing, synthesis and interpretation of literature, brain storming concepts and ideas, and creation of tables and figures. DC and AO: main manuscript writing, brain storming concepts and ideas, references, and proofing. PL: overall supervision of manuscript, writing subsections, and original conceptualization.

# REFERENCES

fpsyg-10-02998 January 24, 2020 Time: 18:8 # 18




intelligibility in children with speech sound disorders. J. Commun. Disord. 46, 264–280. doi: 10.1016/j.jcomdis.2013.02.003




**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Namasivayam, Coleman, O'Dwyer and van Lieshout. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.