Discrete anatomical coordinates for speech production and synthesis

The sounds of all languages are described by a finite set of symbols, which are extracted from the continuum of sounds produced by the vocal organ. How the discrete phonemic identity is encoded in the continuous movements producing speech remains an open question for the experimental phonology. In this work, this question is assessed by using Hall-effect transducers and magnets -mounted on the tongue, lips and jaw- to track the kinematics of the oral tract during the vocalization of vowel-consonant-vowel structures. Using a threshold strategy, the time traces of the transducers were converted into discrete motor coordinates unambiguously associated with the vocalized phonemes. Furthermore, the signals of the transducers combined with the discretization strategy were used to drive a low-dimensional vocal model capable of synthesizing intelligible speech. The current work not only addressed a relevant inquiry of the biology of language, but also shows the performance of the experimental technique to monitor the displacement of the main articulators of the vocal tract while speaking. This novel electronic device represents an economic and portable option to the standard system used to study the vocal tract movements.


INTRODUCTION
Among all species humans are the only ones capable of generating speech. This complex process, that distinguishes us from other species, emerges as an interaction between the brain activity and the physical properties of the vocal system. This interaction implies a precise control of a set of articulators (lips, tongue and jaw) to produce a continuous change in the shape of the upper vocal tract 1 . The output of this process is the speech wave sound, which could be discretized and represented by a finite set of symbols: the phonemes.
Moreover, the phonemes across languages could be hierarchically organized in terms of articulatory features, as described by the International Phonetic Alphabet 2 (IPA). On the other side of the process, at brain level, intracranial recordings registered during speech production showed that motor areas encode the same set of articulatory features 3 . Then, one missing piece of the puzzle is: how does the continuous vocal tract movement generating speech encode the discrete information?
During the speech production process the displacement of the articulators modifies the vocal tract configuration allowing: (i) to apply different filters on the sound initiated by the oscillations of the vocal folds at the larynx (i.e. vowels) and (ii) to produce a turbulent sound source by occluding (i.e. stop consonants) or constricting (i.e. fricatives) the tract 4 . Previous works developed biophysical models for this process 5,6 and tested its capabilities to synthesize realistic voice 7 . In principle, those models could have a high dimensionality, especially due to the many degrees of freedom of the tongue 5 . However, its dimension ranges between 7 and 3, suggesting that a small number of measurements of the vocal tract movements should be able to successfully decode speech and to feed the synthesizers.
In this study the oral dynamics is monitored using sets of Hall-effect transducers and magnets mounted on the tongue, lips and jaw during the utterance of a corpus of syllables (including all the Spanish vowels and voiced stop consonants). By applying a threshold strategy on the signals recorded by three sensors it was possible to decode the uttered phonemes well above chance. Moreover, the signals are used to drive an articulatory synthesizer producing intelligible speech. The results disclose that continuous measurements from the oral movements could be represented in a discrete motor-coordinates space; explicitly showing that all steps comprising the speech process can be described in terms of discrete units.
From a technical point of view, the present work represents a benchmark on the state of the art of the measurement techniques used in the speech production field. During the last decades no many improvements have been achieved on the experimental methods used to measure the vocal tract movements. The widely used technique on the field is the electromagnetic articulography 8,9 (EMA). This equipment produces very accurate measurements but it presents two main disadvantages: it is non-portable and is expensive. The device described in the current work (which is shown to be capable of tracking the vocal tract during continuous speech) represents an alternative method without the problems described above.

RESULTS
Following the procedure described by Assaneo et al. 10 sets of Hall-effect transducers and magnets ( Figure   1a) were mounted on the upper vocal tract to record the displacement of the articulators (jaw, tongue and lips). More specifically, 3 transducers and 4 magnets were placed on the oral cavity of the participant following the configuration displayed in Figure 1b. The position of the elements was chose in a way that each transducer signal is modulated by a subset of magnets (color code in Figure 1b, see Methods for a more detail). The upper teeth transducer signal represent an indirect measurement of aperture of the jaw, the lips transducer signal represents the roundness and closure of the lips and the palate transducer gives an indirect measure of the position of the tongue within the oral cavity. To diminish the body surface in contact with the glue participants wore plastic molds in their upper and lower dentures ( Figure 1a). Just three elements were glued directly on the participants skin: the ones on the tongue and lips. Also, this strategy diminishes the variance on the device configuration between different sessions (the elements on the mold stayed fixed).
Four native Spanish speakers were instructed to vocalize a corpus of syllables while wearing the device. The transducer signals (h J (t), h T (t) and h L (t) for the jaw, tongue and lips respectively) were recorded simultaneously with the produced speech, an example of the four signals is shown in Figure 1c

From continuous dynamics to a discrete motor representation
A visual inspection of the data revealed that the sensors signals remain stable during the utterance of each phoneme and execute rapid excursions during the transitions in order to reach the next state (see Figure 1c); moreover, the signals persist in the same range of values for different vocalization of the same phoneme.
This observation invited to hypothesize that each phoneme could be described in a three dimensional discrete space by adjusting thresholds over the signals. This hypothesis was mathematically formalized and tested by using a subset of the data to extract the thresholds and the rest to compute the decoding performance of the phonemic identity.

Thresholds
A previous study showed that applying one threshold for each transducer was enough to decode the 5 Spanish vowels 10 . Following the same strategy an extra threshold per signal is added in order to include the stop consonant to the description. Then, the signals were discretized by fitting two thresholds: a vowel threshold v, dissociating vowels, and a consonant threshold c, dissociating vowels from consonants. A visual exploration of the signals (see Figure 1c for an example, in Supplementary Materials the whole dataset is available) suggested the following rules to fit the thresholds: The thresholds were fixed independently for each transducer signal by choosing the value that better accomplishes the previous described rules (see Methods).

Mathematical description for the discretization process
The transformation from continuous transducer signals to discrete values can be mathematically accomplished through saturating functions of the form: This function goes from zero to one in a small interval around x=0, whose size is inversely proportional to m.
is zero for h(t)<v and one for h(t)>v. These are the conditions that define the binary coordinates for vowels. Using the transducer signals h L (t), h J (t), h T (t) and the threshold values v L , v J and v T for the lips, jaw and tongue respectively, the vowels read: Plosive consonants represent articulatory activations reaching the dark areas of Figure 1c, assigned to the value 2. In order to include them to the description an extra saturating function was added to each coordinate, using the consonant thresholds c J , c T and c L . Following the previous notation, phonemes (either vowels or consonants) can be represented in the discrete space directly from the transducer signals as:

Decoding performance: Intra-subject & intra-session
To perform the decoding it is necessary to define the threshold values. In this case, one set of thresholds was adjusted for each participant and session. More precisely, the first 15 VCVs of the session were used as training set, i.e. to fix the thresholds (see Table S1 for the numerical values). The following 60 VCVs of the corresponding session were used as the test set, i.e. to calculate the decoding performance using the thresholds optimized on the training set. Figure 2a shows the confusion matrix obtained by averaging the decoding performance across participants and sessions (see Figure S1 for each participant's confusion matrix). Every phoneme is decoded with performances well above chance levels. This result validates the discretization strategy and discloses a discrete encoding of the phonemic identity in the continuous vocal tract movements.

Decoding performance: Intra-subject & inter-session
The previous result leads to the question of whether thresholds can be defined for each participant, independently of the variations in the device mounting across sessions. To explore this, the VCV data of all sessions were pooled together for each participant. Then the 10% of the data was used to adjust thresholds and tested the performance on the rest of the data. More specifically, a 50-fold cross-validation was performed over each subject's data set. The confusion matrix of Figure 2b exposes the confusion matrix obtained by averaging the decoding performance across participants (see Figure S2 for individual participant's confusion matrixes and Table S1 for the mean value and standard deviation of the 50 thresholds). The performance remained well above chance for every phoneme, with the only one exception of the vowel /e/, that can be confused with /a/. As shown in Figure 1d, these two vowels are distinguished by the state of the tongue, the articulator for which the mounting of the device is more difficult to standardize.

Decoding performance: Inter-subject & inter-session
Next the robustness of the configuration, regardless anatomical differences amongst subjects, was tested.
Therefore the VCV data from all sessions and participants were pooled together and the 10% of the data, with a 50-fold cross-validation, was used to fix thresholds (see Table S1). The confusion matrix of Figure 2c represent the average values obtained from 50-fold cross-validation. As in the previous case, the vowel /e/ was mistaken for /a/, revealing that the mounting of the magnet on the tongue needs to be treated with a more fine protocol. This results shows that the discretization strategy is robust even while dealing with different anatomies, suggesting that the encoding of the sounds of language in a low-dimensional discrete motor space represents a general property of the speech production system.  Table S1 and S2, respectively, at Supplementary Materials.

Occupation of the consonant's free states
As pointed out before, vowels and consonants have different ranks in the discrete representation: while each vowel is represented by a vertex of the cube of Figure 1d, each consonant is compatible with many states, shown as the points on the 'walls' surrounding the cube. The occupation levels of those states were explored.
The discrete state for each consonant was computed using the intra-subject and intra-session decoding, for all participants and sessions, and just the VCVs that were correctly decoding were kept for this analysis. The occupations of the different consonantal states are shown in Figure 3a.
The /b/ is defined by the lips in state 2; the tongue and the jaw are free coordinates. The state 2 has not been observed in the tongue, and is presumably incompatible with the motor gesture of this consonant, however no significant differences were found between the states 0 and 1 (binomial test with equal probabilities, p=0.1). Similarly, for the jaw coordinate the state 0 is underrepresented, with an occupation of the 18%, below the chance level of 1/3 (binomial test, p<0.001). The /d/ is defined by the tongue in state 2; the lips and the jaw are free coordinates. The lips show a dominance of the state 1 over the 0 (binomial test, p<0.001), and the state 0 of the jaw is significantly less populated than the others with an occupation of the 8%, lower than the chance level of 1/3 (binomial test, p<0.001). The /g/ has free lips and tongue coordinates.
The lips show no significant differences between the states 0 and 1 (binomial test with equal probabilities, p=0.52), and the state 0 was preferred for the tongue (binomial test with equal probabilities, p=0.006).
A well-known effect in the experimental phonology field is the coarticulation; this effect implies that the articulation of the consonants is modified by the neighboring vowels 11 . The occupation levels of the consonants as a function of their surrounding vowels were calculated ( Figure 3b) and coarticulation effects were revealed. The results show that when the surrounding vowels share some of the consonant's free states, this state is transferred to the consonant. Specifically, when the previous and following vowels share the lip state its value is inherited by the consonants with free lip's coordinate, being /d/ and /g/ (p<0.001 for the four binomial tests). Additionally, /b/ inherits the state of the tongue of the surrounding vowels when both share that state (p<0.001 for both binomial tests), and when both vowels share the tongue state 0, it is inherited by the /g/ (binomial test, p<0.001). No coarticulation is presented by the jaw: for /b/ and /d/ the jaw is homogeneously occupied by the states 1 and 2, regardless of the states of the surrounding vowels. Finally, to test the intelligibility of the synthetic speech, the samples were presented to 15 participants, who were instructed to write down a VCV structure after listened to each audio file. The confusion matrixes obtained from the transcription are shown in Figure 4 for consonants and vowels. All values are above chance levels (33% for consonants and 20% for vowels). In order to recover the vowel's identity just one threshold per signal is needed. Thus, the vowels are represented in the discrete motor space as the corners of a cube. Curiously, the dimension of the vowel cube (eight) is in agreement with the number of Cardinal Vowels 12 , a set of vocalic sounds used by the phoneticians to approximate the whole set of cross-language vowels. This dimension match suggests that the discrete motor states captured by this study could represent the basic motor gestures of vowels. Moreover, the state on each articulator's transducer corresponds to an extreme value along the two-dimensional coordinate system used by the International Phonetic Alphabet to describe vowels. Interestingly, the same discrete representation for vowels could be recover from direct measurements of human brain activity during vocalizations 13 .
The consonants chosen for this study were /b d g/. They cause a complete occlusion of the vocal tract produced by the constriction gesture of one of the three independent oral articulator sets (lips for /b/, tongue tip for /d/ and tongue body for /g/); and they have been suggested as the basic units of the articulatory gestures 14 . Therefor, they appear as the natural candidates to study the presence of discrete information within the continuous movements of the oral tract. shared feature with the brain activation during speech, which represents a clear benefit for brain-computer interface applications.
From a more general point of view, this implementation represents an alternative to the extended strategy used in the bioprosthetic field: large amounts of non-specific physiological data processed by statistical algorithms to extract relevant features for vocal instructions 23,24 . Instead, in the current approach a small set of recordings from the movements of the speech articulators, in conjunction with a threshold strategy, are used to control a biophysical model of the vocal system.
Although this approach shows potential benefits for bioprothetic applications, further work is needed to optimize the system. On one hand, the mounting protocol for the tongue should tighten up to get stable threshold across sessions. On the other, the protocol should be refined to include the whole consonants data set. Arguably, the manner of articulation could be integrated by including different sets of thresholds; and increasing the number of magnet-transducer sets mounted on the vocal tract could retrieve other places of articulation. Regarding the vowels, the current vocalic space is complete for the Spanish language, and has the same dimension as the cardinal vowels suggesting that it would be enough to produce intelligible speech in any language 25 .
The state of the art of the techniques used to monitor the articulatory movements during speech remained stagnant during the last decades, with some exceptions employing different technologies to measure the different articulators 24 . The standard method used to track the articulator's displacements during speech is the EMA 8,9 . This technique has been prove to provide very accurate recordings [26][27][28] at the expenses of being non portable and expensive. Here, a novel method is introduced and proved to be able to capture the identity of the uttered phoneme, to detect coarticulation effects and to correctly drive an articulatory speech synthesizer. This device presents two main advantages: it is portable and is non-expensive. The portability of the system makes it suitable for bioprosthetic applications; and, crucially, because of the low cost of its components, it could significantly improve the speech research done in non-developed countries.

Ethics Statements
All the participants signed a written consent to participate in the experiments, which were approved by the CEPI ethics committee of Hospital Italiano de Buenos Aires, qualified by ICH (FDA -USA, European Community, Japan) IRb00003580.

Participants
Four individuals (1 female) within an age range of 29±6 years and no motor or vocal impairments participated in the recordings of anatomical and speech sound data. They were all native Spanish speakers, graduate students working at the University of Buenos Aires. Fifteen participants (9 females) native Spanish speakers participated in the audio tests. Details of the configuration of the 3 magnet-transducer sets shown in figure 1b:

Experimental device for the anatomical recordings
Red, lips: One cylindrical magnet (3.0 mm diameter and 1.5 mm height) was glued to the dental cast between the lower central incisors. Another one (5.0 mm diameter and 1.0 mm height) was fixed with medical paper tape at the center of upper lip. The transducer was attached at the center of the lower lip. The magnets were oriented in such a way that its magnetic field has opposite signs in the privileged axis of the transducer.
Green, jaw: A spherical magnet (5.0 mm diameter) and the transducer were glued to the dental casts, in the space between the canine and the first premolar of the upper and lower teeth respectively.
Blue, tongue: A cylindrical magnet (5.0 mm diameter and 1.0 mm height) was attached at a distance of about 15 mm from the tip of the tongue, using a small amount of denture adhesive. The transducer was glued to the dental plastic replica, at the hard palate, approximately 10 mm right over the superior teeth (sagittal plane).
Transducer wire was glued to the plastic replica and routed away to allow free mouth movements.

Articulatory synthesizer
During the production of voiced sounds, the vocal folds oscillate producing a stereotyped airflow waveform 29 that can be approximated by relaxation oscillations 30 such as the produced by a van der Pol system: The glottal airflow is the variable u for u>0, and u=0 else. The fundamental frequency of the glottal flow is f 0 (Hz) and the oscillations' onset is attained for a>-1.
The pressure perturbations produced by the injection of airflow at the entrance of the tract propagate along the vocal tract. The propagation of sound's waves in a pipe of variable cross section A(x) follows a partial differential equation 31 . Approximations have been proposed to replace this equation by a series of coupled ordinary differential equations, as the wave-reflection model [32][33][34] and the transmission line analog 35 . Those models approximate the pipe as a concatenation of N=44 tubes of fixed cross-section A i and length l i . In the transmission line analog, the sound propagation along each tube follows the same equations as the circuit shown in Figure 5, where the current plays the role of the airflow u and the voltage the role of sound pressure p. The flows u 1 , u 2 and u 3 along the meshes displayed in Figure 5, follow the equations: The

Discrete states to vocal tract anatomies
The shape of the vocal tract can be mathematically described by its cross-sectional area A(x) at distance x from the glottal exit to the mouth. Moreover, previous works 34,36,37 developed a representation in which the vocal tract shape A(x) for any vowel and plosive consonant can be expressed as The first factor in square brackets represents the shape of the vocal tract for vowels, the vowel substrate. The function Ω(x) is the neutral vocal tract, and the functions φ 1 (x) and φ 2 (x) are the first empirical modes of an orthogonal decomposition calculated over a corpus of MRI anatomical data for vowels 37  This description of the anatomy of the vocal tract fits well with our discrete representation. A previous study 10 showed that a simple map connects the discrete space and the morphology of the vocal tract for vowels. It is carried out by a simple affine transformation defined by: The numerical values of the transformation were phenomenologically found to correctly map the discrete states to the vowel coefficients q 1 and q 2 . Together, Equations 5 and 6 allow the reconstruction of the vocal tract shape of the different vowels from the transducer signals.
During plosive consonants, the vocal tract is occluded at different locations. In our description, this corresponds to have a value 2 in one or more coordinates, which means that the transducers signal cross the consonant threshold c. The saturating functions with the consonant threshold were used to control the parameter w c of Equation 5 that controls the constriction. More specifically, the following Equations were used to generate the consonants: The values of x c and r c are in units of a vocal tract segmented in 44 parts, starting from the vocal tract entrance (x c =1) to the mouth (x c =44).
This completes the path that goes from discretized transducer signals h J (t), h T (t) and h L (t) to the shape of the vocal tract A(x,t) for vowels and plosive consonants.

Vocal tract dynamics driven by transducers' data
To produce continuous changes in a virtual vocal tract controlled by the transducers, it is necessary to replace the infinitely step functions in Equation 6 and 7 by smooth transitions from 0 to 1. Therefor, the condition m=∞ is replaced by finite steepness values m 1 , m 2 and m 3 . The values used to synthesize continuous speech were m 1 =300, m 2 =300 and m 3 =900 for lips, tongue and jaw, respectively. These numerical values were manually fixed with the following constrain: applying Equation 6 over the recorded signals during the stable part of the vowels, and using the obtained (q 1 , q 2 ) to synthesize speech should produce recognizable vowels. This process is explained below.
First, the mean values of the transducer signals during the production of vowels for one participant were computed (left panel Figure 6). More precisely, just the set of corrected decoded vowels for subject 1, using the intersession threshold, were selected. Second, different exploratory sets of (m 1 ,m 2 ,m 3 ) were used to calculate the corresponding (q 1 , q 2 ) , by means of Equation 6. Then, the given vocal tract shapes (A(x) in Equation 5) could be reconstructed and the vocalic sounds were synthetized, from which the first two formants were extracted using Praat 38 . Each sets of (m 1 ,m 2 ,m 3 ) produce a different map going from the sensor space to the formants space ( Figure 6) The first two formants of a vocalic sound defines its identity 29 ; its variability for real vocalizations of Spanish vowels is represented by the shaded areas on the right panel of Figure 6 according to previous reported results 39 . The chose steepness values (m 1 =300, m 2 =300 and m 3 =900) map more than 90% of the transducer data into the experimental (F 1 ,F 2 ) regions.

Synthetic speech
To synthesize speech it is necessary to solve the set of equations described in the Articulatory synthesizer tubes, using a Runge-Kutta 4 algorithm 40 coded in C at a sampling rate of 44.1 kHz. The sound intensity of the files was equalized at 50 dB.
Fifteen participants using headphones (Sennheister HD202) listened to the synthetic speech trials in random order. They were instructed to write down a VCV structure after listened to each audio file. The experiment was written in Psychtoolbox 41 .

Aknowledgements
This work describes research partially funded by CONICET, ANCyT, UBA.