Laryngeal Features Are Phonetically Abstract: Mismatch Negativity Evidence from Arabic, English, and Russian

Schluter, Kevin T.; Politzer-Ahles, Stephen; Al Kaabi, Meera; Almeida, Diogo

doi:10.3389/fpsyg.2017.00746

ORIGINAL RESEARCH article

Front. Psychol., 15 May 2017

Sec. Psychology of Language

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.00746

Laryngeal Features Are Phonetically Abstract: Mismatch Negativity Evidence from Arabic, English, and Russian

Kevin T. Schluter ¹^*

Stephen Politzer-Ahles ^2,3,4

Meera Al Kaabi ^3,5

Diogo Almeida ¹

1. Division of Science, New York University Abu Dhabi Abu Dhabi, United Arab Emirates
2. Faculty of Linguistics, Philology, and Phonetics, University of Oxford Oxford, UK
3. NYUAD Institute New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
4. Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University Kowloon, Hong Kong
5. Department of Applied Language Sciences, United Arab Emirates University Al-Ain, United Arab Emirates

Article metrics

View details

Citations

12,4k

Views

1,8k

Downloads

Abstract

Many theories of phonology assume that the sound structure of language is made up of distinctive features, but there is considerable debate about how much articulatory detail distinctive features encode in long-term memory. Laryngeal features such as voicing provide a unique window into this question: while many languages have two-way contrasts that can be given a simple binary feature account [±VOICE], the precise articulatory details underlying these contrasts can vary significantly across languages. Here, we investigate a series of two-way voicing contrasts in English, Arabic, and Russian, three languages that implement their voicing contrasts very differently at the articulatory-phonetic level. In three event-related potential experiments contrasting English, Arabic, and Russian fricatives along with Russian stops, we observe a consistent pattern of asymmetric mismatch negativity (MMN) effects that is compatible with an articulatorily abstract and cross-linguistically uniform way of marking two-way voicing contrasts, as opposed to an articulatorily precise and cross-linguistically diverse way of encoding them. Regardless of whether a language is theorized to encode [VOICE] over [SPREAD GLOTTIS], the data is consistent with a universal marking of the [SPREAD GLOTTIS] feature.

Introduction

The way speech sounds are categorized and stored in long-term memory has long been a central topic of investigation in language research. This line of inquiry has drawn on insights from many different sources, including detailed analyses of the structure of sound patterns of languages (Jakobson et al., 1951; Halle, 1959; Chomsky and Halle, 1968), data pertaining to speech perception and sound categorization (Repp, 1984) and, more recently, neurophysiological evidence (Dehaene-Lambertz, 1997; Phillips et al., 2000; Eulitz and Lahiri, 2004; Mesgarani et al., 2014).

Many theoretical (phonological) models of sound structures of languages have long held that not only are speech sounds organized into discrete phonemic categories, such as the ones represented by the symbols /s/ and /z/, but also that these categories are not atomic (cf. Baković, 2014 for an overview). Instead, sub-phonemic bits of information often termed distinctive features are recognized as the elemental components of linguistic sound categories. Here, we assume these distinctive features are the long-term memory representations relevant for auditory representations of language (cf. Mesgarani et al., 2014).¹

The point of contention across different theoretical models built around the notion of distinctive features is how to best characterize their nature and their mental organization. Early theories posited that features were loosely grounded around acoustic and articulatory information that was binary in nature (Jakobson et al., 1951; Chomsky and Halle, 1968). For example, the distinction between segments [s] and [z] was simply that the former had a negative specification for the vibration of the vocal cords, coded as [-VOICE], while the latter had a positive specification of the same articulator, [+VOICE]. The same feature distinguishes English [t] and [d], despite the fact that, in English, there is often little or no vocal fold vibration associated with [d]. A more accurate representation of the English contrast, then, is with the phonemes /t^h/ and // rather than /t/ and /d/². More recently, phonological theory has moved away from using binary features in favor of privative features (e.g., where [z] is specified for [VOICE] whereas [s] lacks a specification and thus lacks vocal fold vibration), arguing that the negative specification is not needed when writing phonological rules or constraints, but this difference is in principle one of notation, as any binary feature system can be recoded as a privative feature system. This abstractness of the connection between the phonetic reality and phonological features has been often repeated by phonologists, even when they use non-binary or privative features (Lombardi, 1991/1994).

Other theoretical models have explored variations on this basic representational schema, particularly a closer relationship between distinctive articulatory features in long-term (phonological) memory and their articulatory realizations (Lisker and Abramson, 1964; Iverson and Salmons, 1995; Honeybone, 2005). In these theories, some features may be tied to language-specific properties, such as exactly how a voiced/voiceless contrast is made. Laryngeal realism, for example, suggests that a language like German can be better explained when its voiced/voiceless contrast can be construed as an aspirated/unaspirated contrast (Iverson and Salmons, 1995, 1999, 2003; Honeybone, 2005).

These two kinds of theories about the connection between phonological features and their phonetic realization make divergent predictions when it comes to the laryngeal articulators. For example, many languages, like Spanish, French, Russian, English, German, Swedish, and Turkish, exhibit a two-way phonological contrast between what are traditionally described as voiced and voiceless stop consonants like /d/ and /t/. Under early, more abstract feature models, a single, binary feature, such as [+VOICE] vs. [-VOICE], would be enough to account for all these cases. However, the actual articulatory gestures that speakers of these languages use to produce these two-way distinctions are known to vary cross-linguistically. Some languages, like Spanish, French, and Russian, use primarily the timing of the onset of vocal fold vibration—voice onset time (VOT)—before the consonant release to mark the two-way distinction: they contrast pre-voiced stops with neutral or shot-lag stops. Other languages, like English and German, mark a two-way distinction primarily with aspiration, a long lag between the stop release and the onset of voicing, contrasting a plain or short-lag consonant with a long-lag one (see Lisker and Abramson, 1964). These different phonetic details can be captured by a system involving an inventory of laryngeal articulators, such as [VOICE] (which controls the vibration of the vocal cords) and [SPREAD GLOTTIS] (which controls the amount of aspiration), each of which may have positive or negative values (under a binary feature approach) or be specified or left unmarked (under a privative feature approach). In a true voicing language like Russian (Petrova et al., 2006; Ringen and Kulikov, 2012; Nicolae and Nevins, 2015), [VOICE] would be the active feature responsible for the two-way distinction, whereas in languages like English and German, this role would be accomplished by [SPREAD GLOTTIS].

Therefore, different feature models make different predictions about the underlying structure and representation of laryngeal articulatory features. Early theories predict a simple binary distinction that abstracts from significant articulatory detail in order to implement a simple two-way phonological contrast. More recent theories, on the other hand, propose that simple two-way phonological contrasts can be implemented by different combinations of a richer set of underlying articulatory features, and that these combinations can vary across languages.

In this paper, we turn to neurophysiological data, in the form of the Mismatch Negativity (MMN) paradigm, that has been argued to reveal at least some aspects of phonological structure (Phillips et al., 2000; Walter and Hacquard, 2004; Kazanina et al., 2006; Scharinger et al., 2010, 2012; Cornell et al., 2011, 2013; Law et al., 2013; Truckenbrodt et al., 2014; de Jonge and Boersma, 2015; Hestvik and Durvasula, 2016; Politzer-Ahles et al., 2016; Schluter et al., 2016) in order to test these different representational approaches. In three MMN experiments, we test English, Arabic, and Russian, three different languages that have a functional two-way voicing distinction at a phonological level, but which rely on different underlying articulatory mechanisms to implement these distinctions during speech production. If earlier feature models are correct and the long-term feature representation abstracts away from considerable phonetic detail, then we predict a stable cross-linguistic pattern in the results across languages (English, Arabic, and Russian) and across consonant types (fricatives and stops). If, on the other hand, the long-term representation of laryngeal features is more closely tied to their precise articulatory detail, we predict different cross-linguistic patterns, since these languages’ respective two-way voicing distinctions are implemented via the use of differently specified laryngeal articulators.

Phonetics and Phonological Representations

Given that there are multiple ways to implement a two-way contrast, we are interested in the question of whether languages use one relatively phonetically abstract feature to do this, or if phonetically distinct contrasts are encoded in different ways. Two types of obstruent consonant commonly display a voicing contrast: stops and fricatives.³ Fricatives such as [f], [v], [s], and [z] are distinguished in terms of voicing by the presence or absence of vocal fold vibration⁴. Stop consonants, however, are often described in terms of a VOT continuum in which the difference between voiced and voiceless can vary depending on where the categorical boundary lies (Lisker and Abramson, 1964; Beckman et al., 2011; Beckman et al., 2013). Pre-voiced stops (with negative VOT as the voicing gesture begins before the release of the consonant) may contrast with plain or short lag VOT consonants (with the release occurring concurrently or shortly before voicing begins) or long-lag VOT consonants (with the release occurring well before voicing begins). Thus, for any given language a two-way stop contrast may have one of three articulatory-phonetic patterns: pre-voiced vs short-lag (Spanish, French, and Russian), short-lag vs. long-lag (English, German), or pre-voiced vs. long-lag (Swedish, Turkish). The difference between aspiration and pre-voicing languages is shown in Figure 1 (aspiration) and 2 (pre-voicing). Other languages even use a three way contrast: pre-voiced vs. short-lag vs. long-lag (Thai).⁵ Nonetheless, in terms of long-term mental representations, phonologists tend to use the same features to represent the voice-voiceless contrast in stops as they do for fricatives because it is the categorical contrast that is seen as ultimately creating a coherent mental representation for the entire sound system⁶. Therefore, there are two issues at play when capturing the complexity of a two-way contrast in phonology: (1) the number of features used, and (2) the values of those features.

FIGURE 1

**Aspiration Contrast**. The boundary between the three major voice onset time categories separates long-lag stops from short-lag and pre-voiced stops. Release time is indicated with a vertical red bar, with aspiration or pre-voicing represented as a horizontal bracket.

FIGURE 2

**Voicing Contrast**. The boundary between the three major voice onset time categories separates pre-voiced stops from short- and long-lag stops. Release time is indicated with a vertical red bar, with aspiration or pre-voicing represented as a horizontal bracket.

The number of features speaks to how abstract the relationship between phonetics and the mental representations are. In a one-feature system, the feature’s presence or absence in the mental representation is enough to distinguish two sounds, but not to clearly spell out the phonetic implementation. For example, in English one feature could be used to distinguish the abstract relationship between /t^h/ and // (aspiration) and the relationship between /s/ and /z/ (vocal fold vibration). Similarly one feature could capture the difference between Russian where the distinction between /t/ and /d/ is pre-voicing rather than aspiration. If this is true, we expect that we can get the same results by testing stops and fricatives in a comparable way, and testing typologically distinct languages for the same results.

The second issue is the label of the features and the label’s relationship to articulation and acoustics. While one abstract feature could be labeled in any way, phonologists have long suspected that the physical implementation of language should be taken into account when labeling these features (see, e.g., Lombardi, 1991/1994, and references therein). Thus, the contrast in English might be labeled with a feature related to the vibration of the vocal folds—[VOICE]—or alternately with reference to the absence of these vibrations. Where the absence of vibrations may seem odd from a physiological level at first, preventing the vocal folds from vibrating during speech does require muscular effort to keep the vocal folds apart and has distinct acoustic contributions to the speech signal (Edmondson and Esling, 2006). Thus, a feature referring to the muscular effort to keep the vocal folds from vibrating—[SPREAD GLOTTIS]—could be used as the label for the same contrast. While the specific labels and machinery for these features may vary (cf. Jakobson et al., 1951; Halle, 1959, 2005; Chomsky and Halle, 1968; Avery, 1997; Avery and Idsardi, 2001; Gallagher, 2011, among numerous others), we adopt the well-known labels voice and spread glottis (cf. Lombardi, 1991/1994).

A third issue is the valuation of these labeled features. There is considerable debate among phonologists if features should be coded as binary (i.e., [+VOICE] vs. [-VOICE]) or if privative features (i.e., [VOICE] vs. [ ]) are able to encode the same two-way distinction as [+VOICE] vs. [-VOICE]. Here, we largely ignore this debate as it is somewhat orthogonal to our research question. Whether the phonological system of a language needs to refer to both the positive and negative values of a feature is at the heart of this debate, and we note that there is some recent literature suggesting a need for a reference to both labels of a binary feature, for e.g., that [-VOICE] is necessary to represent phonological processes in some languages (Wetzels and Mascaró, 2001; Bennet and Rose, unpublished). More relevant for our purposes is the notion of markedness, that one of the two options (i.e., [+VOICE] vs. [-VOICE] or [VOICE] vs. [ ]) is marked (i.e., specified with a feature) while the other is unmarked (i.e., left featurally unspecified). A marked feature is seen as phonologically active, while the unmarked option would be phonologically inert. These correlate to some extent with the neurophysiological results of Eulitz and Lahiri (2004) and we adopt their logic regarding feature specification⁷. Thus, we currently ignore the issue of what it might mean for a feature to be marked in the negative or unmarked, in favor of focusing on marked and privative feature labels. We further simplify our terminology for expository purposes and will simply refer to marked or unmarked features henceforth.

A tight correlation between phonetics and phonology has been argued in the form of laryngeal realism (Iverson and Salmons, 1995, 1999, 2003; Honeybone, 2005). Laryngeal realism states that the phonetics of a voiced-voiceless contrast indicate the feature marking responsible for the contrast. An aspirating language like German or English will mark the contrast with a feature responsible for aspiration [SPREAD GLOTTIS] while a voicing language like Spanish or Russian will mark the contrast with a [VOICE] feature. Using our terminology laid out above, this would mean that languages like English and German, on the one hand, would have phonemes traditionally described as voiceless (like /p/, /t/, and /k/) bearing a marked laryngeal feature [SPREAD GLOTTIS], and their traditionally described as voiced counterparts (like /b/, /d/, and //) left unmarked for their laryngeal gestures. In voicing languages like French or Russian, on the other hand, the situation would be reversed: phonemes traditionally described as voiceless (like /p/, /t/, and /k/) would be left unmarked, and the traditionally described as voiced (like /b/, /d/, and //) would be marked for [VOICE].

Many recent phonetic studies (Helgason and Ringen, 2008; Beckman et al., 2011, 2013; Ringen and Kulikov, 2012; Ringen and van Dommelen, 2013; Nicolae and Nevins, 2015) find support for laryngeal realism, providing evidence, for instance, that rate of speech affects the pronunciation of the marked stop (i.e., pre-voicing or long-lag duration) but not the unmarked, short-lag stop. Indeed, in Swedish, this is taken as evidence for contrastive overspecification, as dialects of Swedish and Norwegian phonetically contrast pre-voiced with long-lag stops. The logic underlying these studies is that rate of speech should only cause changes to segments bearing the marked feature value because these are actual gestural commands; the neutral, short-lag stop is a sort of default without any particular articulatory gesture associated with it.

These articulatory results—consistent with laryngeal realism—are also consistent with data from language acquisition. Kager et al. (2007) also tested some of the predictions of laryngeal realism by analyzing speech errors in English, German, and Dutch. Assuming the phonetically grounded articulatory feature representation used by laryngeal realism, Kager et al. (2007) hypothesize that children ought to make more speech errors toward the unmarked, rather than the marked segment. Contrasting a voicing language (Dutch, which putatively marks [VOICE]) with aspirating languages (English and German, which putatively mark [SPREAD GLOTTIS]), Kager et al. (2007) find that Dutch children make more speech errors toward voiceless segments and that English and German children make more errors toward voiced ones. Kager et al. (2007) argue that a mixed analysis where the marked feature differs from language to language makes better predictions than one in which only one feature (e.g., [VOICE]) is used for all three languages.

Whether [VOICE] or [SPREAD GLOTTIS] is active in English, however, is not uncontroversial. Kohn et al. (1995) argue that evidence from aphasic disfluencies suggest that voiced consonants of English are marked rather than voiceless ones, whereas laryngeal realism would posit the opposite if [SPREAD GLOTTIS] is the marked feature responsible for the English two-way contrast, under the assumption that only a marked feature should be active in the phonology of the language. The aphasic patients in Kohn et al. (1995)’s study tended to erroneously substitute the homorganic [+VOICE] consonant when another [+VOICE] consonant occurred in the same word, indicating that [VOICE] active in the phonology, and therefore had a marked value. This was not true for their [-VOICE] or [SPREAD GLOTTIS] consonant errors (i.e., [f𝜀s] for vest was an uncommon error type while [gælevin] for calendar was significantly more common). In a similar vein, Hwang et al. (2010) find evidence that it is the voiceless segment (e.g., English /t/) that is unmarked, because it fails to produce predictions in the perception of final consonant clusters. In a conscious categorization task, the voiced-voiceless sequence (e.g., [uds]) is responded to more slowly and less accurately than codas matching in terms of laryngeal state (i.e., [uts], [udz]) or the voiceless-voiced sequence (i.e., [utz]). The slower and less accurate member of the quadruplet is theorized to be distinct as the voiced stop induces a prediction for a following voiced fricative (assumed to be marked for [VOICE]) which is violated in the [ds] sequence. Moreover, Vaux (1998) argues that, cross-linguistically, it is the voiceless fricative that is marked, except in languages like Burmese which contrast voiced /z/, voiceless /s/, and voiceless aspirated /s^h/ fricatives. Recent neurophysiological evidence, however, has been argued to support the laryngeal realism hypothesis (Hestvik and Durvasula, 2016).

Mismatch Negativity (MMN)

Research on electrophysiology of language has revealed the potential sensitivity of an event-related potential called the MMN to phonological structure (Dehaene-Lambertz, 1997; Phillips et al., 2000; Eulitz and Lahiri, 2004). The MMN (and its magnetoencephalography correlate, the mismatch field or MMF; Näätänen, 2001; Näätänen et al., 2007) is an early ERP component that is known to be sensitive to acoustic changes in general (Näätänen et al., 1978) but which has also been shown to be sensitive to categorical changes in speech stimuli (e.g., Dehaene-Lambertz, 1997; Näätänen and Alho, 1997). The MMN is usually evoked in an oddball paradigm, where a number of ‘standard’ sounds are played repeatedly and occasionally a ‘deviant’ or oddball sound is played (generally at a ratio of about seven standards per one deviant). The MMN is maximal at fronto-central sites (often Fz), and obtained by subtracting the average response to standards of one stimulus or category of stimuli from the average response to the same stimulus or category of stimuli presented as a deviant. The elicitation of an MMN indicates that the processing system has detected a change in a stream of stimuli. This change-detection property has been exploited in studies interested in investigating whether the MMN can be used to detect not only changes at an acoustic or phonetic category level, but also at a phonological level. For example, Kazanina et al. (2006) found that a robust MMN response to the voicing contrast between [d] and [t] can be observed in Russian speakers, for whom the contrast is phonemic, but no such contrast can be observed in Korean speakers, for whom [d] and [t] are allophones of the same underlying phonemic category. Similarly, Truckenbrodt et al. (2014) tested German nonce words in the context of word-final devoicing in a reverse oddball paradigm. In the crucial comparison where the deviant and standard could be plausibly related via word-final devoicing (standard /vuzǝ/ with deviant [vus]) there was no MMN detected for the fricative as the two fricatives were apparently categorized as the same segment given the context (other contexts, including standard /vus/ with deviant [vuzǝ] did show a MMN for the fricatives). While final devoicing may be linked to a morphophonological alternation, the lack of an MMN in final devoicing context does suggest that in some context either an asymmetric MMN or the MMN itself will not be found for voiced and voiceless speech sounds. Thus, we expect the MMN will show effects of categorical differences where warranted, and fail to show differences when the sounds are not distinct categories, even for voicing differences.

In addition to a basic sensitivity to phonological information, the MMN has been shown to reflect, in an interesting fashion, the markedness status of phonological features in the form of asymmetrical effects (Eulitz and Lahiri, 2004, et sqq.). Eulitz and Lahiri (2004) argue that asymmetries in the strength of the MMN arise when marked sounds and unmarked sounds are contrasted in a reverse oddball paradigm. When a marked sound is the deviant and an unmarked sound the standard, the MMN is smaller than when the unmarked sound is the deviant and the marked sound the standard. Eulitz and Lahiri (2004) argue this is related to the phonological representation of the sounds, where the marked deviant is not inconsistent with the unmarked standard, but an unmarked deviant has a phonetic representation which clashes with the marked stored representation of the standard, amplifying the strength of the MMN (see Alternatives Accounts and Politzer-Ahles et al., 2016, for a review of other factors that can cause MMN asymmetries that may not be tied to the markedness of distinctive features). This mechanism is referred to as underspecification in the phonological literature (Archangeli, 1984, 1988; Lahiri and Reetz, 2002, 2010; Eulitz and Lahiri, 2004, among others). Applying Eulitz and Lahiri’s (2004) logic to voicing and the feature marking hypothesis laid out by laryngeal realism, one would expect to observe, in an aspirating language like English, an asymmetry based on an aspiration or [SPREAD GLOTTIS] feature, as voicing in English is taken to be only a phonetic phenomenon. Indeed, this was recently tested with English stop consonants, where Hestvik and Durvasula (2016) find a larger MMN for the unmarked voiced deviant /d/ than the voiceless one (/t/). By the same token, in a voicing language, the prediction about the MMN asymmetry is the reverse: a larger MMN for the unmarked voiceless deviant (/t/) compared to the marked voiced deviant (/d/), as the voiced segment is marked for [VOICE] and the voiceless one left unmarked. However, although Hestvik and Durvasula’s MMN results are consistent with the predictions of laryngeal realism for a specific language (English), there is no current cross-linguistic evidence from MMN for laryngeal realism: this is the kind of evidence that we seek to adjudicate in this paper.

Here we build on the previous MMN findings to test the two different kinds of models of laryngeal feature specifications in long-term memory. Traditional single-feature models would predict that a single feature, such as [VOICE], is the relevant one responsible for the contrast in both stops and fricatives. The laryngeal realist theory, on the other hand, predicts a different pattern of results (see Figure 3). By applying the same logic of underspecification to glottalic states, in an aspirating language like English we should observe an MMN asymmetry based on an aspiration or [SPREAD GLOTTIS] feature and a voicing feature if voicing in English stops is the result of only a surface phonetic specification. The feature responsible for voicing in English fricatives, however, may differ from the [SPREAD GLOTTIS] feature used for stops. Furthermore, speakers of a voicing language should show a different pattern based on the phonetic implementation of the stop contrast: speakers of a voicing language that marks a stop contrast with pre-voicing should use a [VOICE] feature to mark the difference, not [SPREAD GLOTTIS].

FIGURE 3

**Predictions**. When two sounds differ in markedness, a marked standard’s feature is compatible with the unmarked deviant’s lack of a feature **(a)** but the reverse is not true. An unmarked deviant **(b)** has some phonetic surface marking, which conflicts with the standard’s phonologically marked feature. This conflict causes a larger MMN. When [VOICE] is assumed to be the marked feature **(c,d)** we expect to see a different asymmetry than when [SPREAD GLOTTIS] is marked **(e,f)**. This logic can be used to determine which feature is active in a fricative contrast (e.g., [s] vs. [z]) as well as different stop contrasts (e.g., [t] vs. [d] or [t^h] vs. []).

Alternatives Accounts

While we assume the underspecification mechanism of Lahiri and Reetz (2002, 2010) and Eulitz and Lahiri (2004), there are other factors which may play a role in the MMN and MMN asymmetries for both language and non-language studies. The presence or absence of an additional physical change in non-linguistic auditory or visual stimulus (relative to the standard) has been shown to produce asymmetric MMN effects (Winkler and Näätänen, 1993; Nordby et al., 1994; Sabri and Campbell, 2000; Timm et al., 2011; Bendixen et al., 2014; Czigler et al., 2014). As the N1 and MMN are temporally close to one another, differences in N1 refractoriness may modulate the responses to stimuli differentially (see May and Tiitinen, 2010, for a review). The MMN may also be influenced by differences in prototypicality (Ikeda et al., 2002) or by general perceptual biases (Polka and Bohn, 2011).

Moreover, there are some accounts which explicitly reject the proposal that underspecification can lead to MMN asymmetries to begin with. Bonte et al. (2005), for example, suggest that purportedly underspecification effects in the MMN may be due instead to uncontrolled differences in phonotactic probabilities. Tavabi et al. (2009) similarly proposed that other variables like frequency and context, rather than underspecification, may drive MMN asymmetries. Gow (2001, 2002, 2003) and Gaskell (2003) further suggest that the notion of underspecification is unnecessary for explaining alternations such as place assimilation, and Mitterer (2011) finds no evidence for underspecified representations in an eye-tracking study.

While we cannot refute all the possible objections to the linking of underspecification and asymmetric MMNs, here, we note that we specifically focus on ERPs for fricatives which are presented in isolation (excepting the stops [te] and [de] in Experiment 3) exactly to avoid many of the proposed top-down confounds above. Furthermore, we test these predictions in English, Arabic, and Russian which are typologically different in their patterns of voiced and voiceless segments, and therefore are not necessarily acoustically similar.

Hypothesis and Predictions

Our aim is to test how close the coupling is between the phonetic implementation and long term mental representation of distinctive features, assuming the proposed link by Lahiri and Reetz (2002, 2010) and others between underspecification of phonological units and the elicitation of MMN asymmetries. We do this with three experiments. In Experiment 1, we use English fricatives to test if the feature marking of these segments is the same as the one in stops (as revealed by the results of Hestvik and Durvasula, 2016)⁸. We test this using an oddball paradigm with the English segments [f] (voiceless) and [v] (voiced) and compare the results to those of Hestvik and Durvasula (2016) for the English stops [t^h] and []. If the same MMN pattern observed by Hestvik and Durvasula (2016) for [t^h] and [] emerges for the fricatives [f] and [v] (i.e., if [v] deviants in the context of [f] standards elicit a greater MMN than [f] deviants in the context of [v] standards), we can conclude that English is likely to mark both voicing contrasts in the same way (supporting a one-feature theory, but less clearly compatible with theories like laryngeal realism, that posit a closer connection between phonetic and phonological representations). Alternatively, if the results for the fricatives [f] and [v] go in the opposite direction from the stop results observed by Hestvik and Durvasula (2016), we can conclude that the two-way voicing distinction in English stops is implemented differently, at a featural level, from the two-way voicing distinction in English fricatives, which may indicate the need to invoke two different features to account for the results; for example, [SPREAD GLOTTIS] is marked for stops, but [VOICE] is marked for fricatives.

In Experiment 2, we test whether the fricatives of English (an aspirating language) are marked in the same way as the fricatives of Arabic (a purportedly voicing language). We test both English and Arabic tokens at two places of articulation (dental [s] and [z] and interdental [θ] and [ð]) for both English and Arabic speakers. If Arabic is truly a voicing language and marks [VOICE] rather than [SPREAD GLOTTIS], we should find an interaction such that the MMN asymmetries are opposite in English and Arabic speakers, indicating that one’s native language influences the features used to represent the contrast. If we find the same pattern of asymmetries for Arabic and English speakers, we would suspect that typologically different languages may still use one set of features, not necessarily driven by the precise articulatory phonetic details of the language.

Finally, we examine the marking of both fricatives and stops in Russian (an uncontroversial voicing language, using dental fricatives /s/ and /z/, a mixed set of voiced (/v/, /z/, /ʐ/) and voiceless (/f/, /s/, /ȿ/) fricatives, and stops (/te/, /de/) to consolidate the results for fricatives and compare them directly to stop consonants. If the pattern of results for Russian fricatives is the same as English fricatives, we find support for a theory according to which fricatives are marked in the same way for these typologically distinct languages, regardless of how these languages implement the laryngeal marking of their stop consonants. Comparison to the stops will crucially suggest whether laryngeal realism is supported or not for stop consonants, as this theory posits that a voicing language would mark its voiced stops, rather than their unvoiced ones. Thus, if the feature marking hypothesis of laryngeal realism is correct, one would expect that the results observed for Russian stops will be the exact opposite pattern from the results of Hestvik and Durvasula (2016). If, on the other hand, the same pattern of MMN asymmetries is observed across English and Russian stop consonants, then a single feature may be responsible for the cross-linguistic results, in which case the value of that feature, which Hestvik and Durvasula (2016) identified as [SPREAD GLOTTIS], and the support that it lent to laryngeal realism would have been entirely coincidental, due to the fact that English was the only language investigated by Hestvik and Durvasula (2016).

Experiment 1: English [f] vs. [v]

Methods

Participants

Twenty-nine native English-speaking participants took part in the study, for which the goal was to have data from 24 subjects. Two were eliminated because of technical errors and three were eliminated because they had fewer than 30 artifact-free deviant trials in one of the blocks, leaving 24 subjects in the analysis (10 males, 14 females, mean age = 20.9, SD = 3.7; age data from one participant is not available). The participants were recruited from the New York University Abu Dhabi community. All participants reported normal hearing and cognitive function. Though all participants reported English dominance, seven reported some degree of bilingualism (Hindi, Urdu, Mandarin, German, Japanese, and French). All methods for the study were approved by the Institutional Review Board of New York University Abu Dhabi. Participants were compensated for their time.

Stimuli

Stimuli consisted of short tokens of English fricatives [f] and [v] pronounced by one female native English speaker in a sound-attenuated room. Stimuli were recorded using an Electro-Voice RE20 cardioid microphone, and digitized at 22050 Hz with a Marantz Portable Solid State Recorder (PMD 671). There were no surrounding vowels for any tokens. The use of naturally produced fricatives in isolation mirrors previous studies using vowels (Cornell et al., 2011; de Jonge and Boersma, 2015) and also eliminates any possible effects of coarticulation or phonotactic knowledge (Bonte et al., 2005), or cross-splicing (Steinberg et al., 2012), and has been successfully used in previous experiments (Schluter et al., 2016). For each type, six distinct tokens were selected by a trained phonetician. Tokens were modified in Praat (Boersma and Weenink, 2013) to a duration of about 250 ms by removing material from the middle of the token at zero-crossings, and then normalized for amplitude to 70 dB_SPL (RMS). Tokens were not ramped; the natural onset and offset were retained. See Supplementary Materials for audio stimuli used in this experiment.

Experimental Procedure

The electroencephalogram (EEG) was obtained during an oddball paradigm in one 2-hour session, concurrent with five similar experiments (not reported here). The experiment consisted of two blocks. One block contained 680 standard [v] tokens with 120 deviant [f] tokens, with an additional 20 standards at the beginning of the block. Tokens were jittered with a 400–600 ms ISI and pseudorandomized such that 2–10 standards occurred before each deviant. This allowed us to run a large number of experiments (not reported here) on the same participants on a reasonable amount of time. A second block was run with [f] as the standard and [v] as the deviant and otherwise identical. The blocks were presented to subjects in random order. Subjects watched a muted film with English subtitles during the experiment and were offered a break after each block.

EEG Acquisition and Preprocessing

EEG was continuously recorded from 34 active Ag/AgCl electrode positions (actiCAP, Brain Products) using a BrainAmp DC amplifier (Brain Products). The sampling rate was 1000 Hz, and data were filtered online from 0.1 to 1000 Hz. FCz served as the online reference and AFz as the ground. Interelectrode impedances were kept below 25 kΩ. Subjects were asked to sit still and avoid excessive eye movements.

Offline data was re-referenced to the average of both mastoids and band-passed filtered at 0.5–30 Hz for each participant. The data were segmented into 701 ms epochs (-200 to 500 ms). The initial set of 20 standards, the first deviant in each block, and the first standard after each deviant were excluded from further analysis. Epochs were baseline-corrected using a 100 ms pre-stimulus interval. Epochs with voltages exceeding ±75 μV on any channel were removed from analysis. For each participant at least 30 deviant trials per condition were retained. The MMN was calculated by subtracting the average ERP response to each standard from the average ERP response to the same stimulus type as a deviant in the other block: e.g., standard [f] from one block was subtracted from deviant [f] from the other.

Statistical analysis of MMN amplitude was conducted via spatiotemporal cluster-based permutation tests (Maris and Oostenveld, 2007) over the 100 to 300 ms post-stimulus-onset time window (a broad window in which the MMN is expected to appear). This method checks for clusters of spatially and temporally adjacent data point clusters that meet an arbitrary threshold of significance (p = 0.05) and then evaluates the significance of these clusters using a non-parametric permutation statistic. While the MMN has a well-known time-course and topography (Näätänen and Alho, 1997; Näätänen, 2001; Näätänen et al., 2007), this statistical analysis reduces (but does not eliminate) researcher degrees of freedom in the choice of analysis window, as it allows for testing main effects over a broad temporal and spatial window and makes use of all 31 channels used in the analysis rather than only one.

Results

Visual inspection of the data (see Figure 4) suggests the two conditions are distinct, and that deviant [v] evokes a greater MMN than deviant [f]. This asymmetry is consistent with the results of Hestvik and Durvasula (2016) as our voiced deviant fricative [v] patterns with their voiced stop and vowel stimulus [æ] and our voiceless [f] with their [t^hæ].

FIGURE 4

**Topographic maps and difference waves (at Fz) for deviant [f] (red) and deviant [v] (blue)**. Ribbons indicate a difference-adjusted 95% Cousineau-Morey within-subjects interval (which can be interpreted as follows: at a given time point, if neither condition’s difference-adjusted interval contains the other condition’s mean, then the difference between conditions is likely to be significant at the 95% alpha level, without correction for multiple comparisons). Horizontal lines on the difference waves indicate the average amplitude for the 51 ms window centered on the MMN peak.

The cluster-based permutation test revealed significant differences between the MMNs elicited by voiced and voiceless deviants. Voiced deviants elicited more negative MMNs than voiceless deviants (p < 0.001) based on a cluster of samples from 100 to 185 ms and including 25 channels: Fp1, F3, Fz, F4, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, CP5, CP1, CP2, CP6, P7, P3, Pz, P4, P8, O1, Oz, O2, and FCz. Voiced deviants also elicited more positive later effects than voiceless deviants (p = 0.009), based on a cluster from 212 to 300 ms and including 20 channels: Fp1, Fp2, F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, C3, Cz, C4, T8, CP1, CP2, CP6, P4, and FCz (i.e., the P3 wave following the MMN).

Discussion

The results here show an asymmetry between the MMN magnitude observed for the voiced [v] vs. voiceless [f], and are in line with a long-term encoding system in which the voiced segment is unmarked and the voiceless segment is marked, as indicated by the predicted asymmetric MMN patterns (cf. Eulitz and Lahiri, 2004; Scharinger et al., 2010, 2012; Cornell et al., 2011, 2013; de Jonge and Boersma, 2015; Schluter et al., 2016). Furthermore, the more negative peak for the voiceless deviant suggests that it is the voiceless sound which is marked for English fricatives, just as Hestvik and Durvasula (2016) found for English stops. These results may suggest that one single feature accounts for both English stops and fricatives and, following the feature marking hypothesis of laryngeal realism, that feature should be [SPREAD GLOTTIS]. Alternatively, contrary to the feature marking hypothesis of laryngeal realism, it may be the case that the feature specification for English voicing may coincidentally be a universal marking. Cross-linguistic evidence is required to determine if other languages use a [VOICE] feature in lieu of [SPREAD GLOTTIS]. Such evidence would be found if the voiced deviant were to show a smaller MMN than the voiceless one in a language hypothesized to use [VOICE] rather than [SPREAD GLOTTIS] to distinguish a two-way voicing contrast.

In the next experiment, we seek to replicate these English results with other places of articulation and compare the asymmetry for English (an aspirating language) with Arabic (purportedly a voicing language). Given how the functional two-way voicing contrast in these two languages is phonetically realized by different articulatory means (unmarked [VOICE] and marked [SPREAD GLOTTIS] in English, and marked [VOICE] and unmarked [SPREAD GLOTTIS] in Arabic), a theory of distinctive features that posits a strong connection between articulatory detail and the long term distinctive feature representation would predict the opposite patterns of MMN asymmetries in these two languages. If, however, the functional two-way contrast abstracts away from this level of phonetic detail, the MMN asymmetric patterns are predicted to be similar across these two languages. We test these competing predictions with fricative sounds possessing two other places of articulation: dental ([s] and [z]) and interdental ([θ] and [ð]). These two places of articulation occur in both Standard English and Emirati Arabic and allow us to see whether the predicted asymmetries are robust across segments varying in place of articulation.