Ease and Difficulty in L2 Phonology: A Mini-Review

A variety of phonological explanations have been proposed to account for why some sounds are harder to learn than others. In this mini-review, we review such theoretical constructs and models as markedness (including the markedness differential hypothesis) and frequency-based approaches (including Bayesian models). We also discuss experimental work designed to tease apart markedness versus frequency. Processing accounts are also given. In terms of phonological domains, we present examples of feature-based accounts of segmental phenomena which predict that the L1 features (not segments) will determine the ease and difficulty of acquisition. Models which look at the type of feature which needs to be acquired, and models which look at the functional load of a given feature are also presented. This leads to a presentation of the redeployment hypothesis which demonstrates how learners can take the building blocks available in the L1 and create new structures in the L2. A broader background is provided by discussing learnability approaches and the constructs of positive and negative evidence. This leads to the asymmetry hypothesis, and presentation of new work exploring the explanatory power of a contrastive hierarchy approach. The mini-review is designed to give readers a refresher course in phonological approaches to ease and difficulty in acquisition which will help to contextualize the papers presented in this collection.


INTRODUCTION
Why are some sounds harder to learn than others? A Japanese learner of English may have difficulty acquiring a novel L2 English [l]/[ɹ] contrast (Brown, 2000) but less difficulty acquiring a novel L2 Russian [l]/[r] contrast (Larson Hall, 2004 (Matthews, 2000). A Brazilian Portuguese learner of English may have difficulty acquiring consonant clusters such as [sl], [sn] or [st] which are absent from the L1 (Cardoso, 2007), while a Persian learner of English who also lacks L1 [sl], [sn] and [st] may find them quite easy to acquire (Archibald and Yousefi, 2018). A Spanish learner of English may find it easier to acquire the [i]/[ɪ] contrast (which is absent from the L1) when learning Scottish English than British English (Escudero, 2002). There are also examples of so-called directionality of difficulty effects (Eckman, 2004). For example, an English learner of German might find it easier to suppress a final voicing contrast than a German learner of English would find it to learn to make a new L2 final voicing contrast. These are the types of facts researchers need to explain (the explanandum). In this short paper, I will provide an overview of some of the proposed phonological accounts (the explanans) of such cases of ease or difficulty.
We begin by asking what it means to have acquired a sound. To probe such a question from a phonological perspective means that we must tackle the question of contrast. Phonemes are used to represent lexical contrasts. Such contrasts must also be implemented phonetically in both production and perception. Given that L2 production and perception may well be nonnativelike, this raises the interesting question for the L2 phonologist of determining whether the individual is 1) producing an inaccurate representation accurately, or 2) producing an accurate representation inaccurately. A case of 1) would be where an L2 learner might have the same representation for both /l/ and /r/ (i.e., not making a phonemic liquid contrast) and who also merged the production of [l] and [r]. A case of 2) would be where a learner might have a representational contrast for /b/ and /p/ (i.e., making a phonemic VOT contrast) but not implementing the contrast in a nativelike fashion. Methodologically, this reveals that researchers (and teachers) cannot rely on inaccurate production as a diagnostic of non-nativelike representation.
This leads us to a related question concerning production vs. perception. Much work in L2 speech proceeds on the assumption that accurate perception must (logically and developmentally) precede accurate production (Flege, 1995). Thus, much of the literature focusses on assessing whether the subjects can discriminate phonetic contrasts reliably, and represent phonological contrasts accurately. However, there are certain cases where learners may be accurate in either production (Goto, 1971) or lexical discrimination (Darcy et al., 2012) tasks and yet remain inaccurate on discrimination tasks. In both cases, it may be that metalinguistic knowledge plays an important role.
Ever since the Contrastive Analysis Hypothesis (Lado, 1957), linguists have tried to predict which aspects of L2 speech would be easy or difficult to learn. Since the 50s, both the representational models of phonology and the learning theories have become more sophisticated, and this has led to a consideration of multiple factors in exploring the construct of difficulty. Such approaches stand in marked contrast to the models of cross-language speech production (Flege, 1995) and cross-language speech perception (Best and Tyler, 2007) which primarily invoke acoustic and articulatory factors to explain difficulty in acquisition.
In the field of second language acquisition (SLA), there have been many factors explored to account for aspects of learner variation, including variation in nativelikeness of L2 speech. The following factors have been explored: • L1 transfer (Trofimovich and Baker, 2006) • amount of experience (Bohn and Flege, 1992) • amount of L2 use (Guion et al., 2000) • age of learning (Abrahamsson and Hyltenstam, 2009) • orthography (Escudero and Wanrooi, 2010;Bassetti et al. 2015) • frequency (Davidson, 2006) • attention (Guion and Pederson, 2007) • training (Wang et al., 2003) It goes without saying that all of these factors do come in to play in accounting for learner behavior. What I will focus on in this mini-review are key representational issues which have informed phonological approaches to the construct of ease and difficulty.

REPRESENTATIONAL APPROACHES
This mini-review is focusing on representational models of phonology. There is a rich literature on output-based approaches (Tessier et al., 2013;Jesney, 2014) which tend to emphasize the computational system which generates the output form rather than emphasizing the form of the underlying (or input) representation.

Markedness
Some have looked to the notion of markedness (Parker, 2012) as an explanation by suggesting that unmarked structures are easier to acquire than marked ones (Carlisle, 1998). For example, it could be argued that 3-consonant onsets (e.g., [str]) were more difficult to acquire than 2-consonant onsets (e.g., [tr]) because they were more marked. Even within 2-consonant sequences work such as Broselow and Finer (1991), Eckman and Iverson (1993) demonstrate that principles such as Sonority Sequencing instantiate markedness with greater sonority distance between the adjacent segments being less marked (i.e., [pj] would be less marked than [fl]). Such machinery is designed to account for the observation that not all structures which are absent from the L1 are equally difficult to acquire in the L2. The developmental path would be from unmarked to marked structures.
Some have suggested that a markedness continuum was not enough but rather that markedness differential was the locus of explanation (Eckman, 1985). Under this approach, a structure which was absent from the L1 and more marked than the L1 structure would be difficult to acquire while one which was absent from the L1 but less marked than the L1 structure would be easier to acquire.
Often, however, the unmarked forms are the most frequent (e.g. 3-consonant clusters are more marked than 2-consonant clusters, and 3-consonant clusters are also less frequent than 2consonant clusters) so it is difficult to tease these factors apart. If learners are more accurate on 2-consonant clusters is it because they are more frequent or less marked?

Frequency-Based Approaches
Usage-based (Wulff and Ellis, 2018) and Bayesian (Wilson and Davidson, 2009) approaches argue that targetlike production accuracy is correlated with input frequency. Thus, if there are two elements which are absent from the L1 and one is frequent in the L2 input while one is infrequent, then the frequent structure might be more easily acquired. Cardoso (2007) documents a scenario in which the most frequent structure is the most marked so we can tell which construct is most explanatory. In looking at the acquisition of L2 English consonant clusters by L1 speakers of Brazilian Portuguese, he focused on [st], [sn] and [sl]. Without getting into the details of the markedness facts here, [st] is both the most frequent and the most marked of the clusters. When it came to learner production, the learners were least accurate on the most marked cluster ([st]) even though it was most frequent in their input. For production (though not perception), markedness seemed to be more explanatory than frequency.

Frequency Versus Markedness
The construct of markedness itself has its critics (Haspelmath, 2006;Zerbian, 2015). If the notion is ill-defined measure of complexity-difficulty or abnormality?-then how can it be a valid explanans? Responding to Archibald (1998) who suggested that positing markedness as an explanation (rather than a description) only bumped the explanation problem back a generation (because what explains markedness?), Eckman (2008;105) counter-argues that, "to reject a hypothesis because it pushes the problem of explanation back one step misses the point that all hypotheses push the problem of explanation back one step-indeed, such 'pushing back' is necessary if one is to proceed to higher level explanations."

Processing Accounts
While more work has been done on the role of the processor in morpho-syntax in SLA (O'Grady, 1996;O'Grady, 2006;Truscott and Sharwood Smith, 2004), Carroll (2001) explores the role of the phonological parser in mapping the acoustic signal onto phonological representations. Carroll (2013) addresses these questions in initial-state L2 learners empirically. There has also been some work done on L2 phonological parsing at the level of the syllable (Archibald, 2003;Archibald, 2004;Archibald, 2017) which suggests that structures which can be parsed are easier to acquire than structures which the parser cannot yet handle.
Such models intersect with the perception literature insofar as the L2 acoustic input is filtered by the L1 phonological system (Pallier et al., 2001). In turn, such perceptual shoe-horning can lead to activation of phantom lexical competitors (Broersma and Cutler, 2007) which may slow lexical activation.
The notion that only some input can be processed at any given time, thus leading to the intake to the processor being a subset of the environment input, is well-studied in applied linguistics (Corder, 1967;Schmidt, 1990). What has proved more elusive is explaining when input becomes intake (and when it does not). Certainly one of the challenges is avoiding circularity of the following sort: Q: why is x produced/perceived accurately before y? A: Because it became intake Q: How do you know it became intake? A: Because it was produced/perceived accurately.
Processing accounts are not necessarily independent of abstract phonological studies as they have also been important in documenting the viability of abstract phonological features (Lahiri and Reetz, 2010;Schluter et al., 2017). Features can be explanatory when we note classes of sounds behaving in a similar fashion, for example, only nasals being allowed in syllable codas in a given L1. Thus difficulty may arise when these learners attempt to parse L2 stops into a coda. Note that the difficulty would affect, say, [p t k] as a class of voiceless stops.

Representational Accounts
Theories of phonological representation help us to model both synchronic and diachronic aspects of L2 phonological grammars. Özçelik (2016) addresses the general question of developmental path in L2 grammars (a fundamental concern of the field as we try to develop a transition theory). He proposes a cue-based model which clarifies which structural properties (i.e., parameters) are logical precursors to the acquisition of subsequent parameters. Özçelik and Sprouse (2016) demonstrate that interlanguage grammars are constrained by phonological universals (such as the behavior of feature spreading).
Feature-based models (Brown, 2000) can be contrasted with segment-based models (Flege, 1995). A segment-based model might say that a new segment will be difficult to acquire based on a comparison of the L1 and L2 phonetic categories. A featurelevel account would argue that new L2 contrasts which were based on distinctive features that were absent from the L1 would be difficult while new contrasts based on L1 features would be easy. Brown (2000) showed that Korean learners of English could acquire new contrasts if the contrasts were based on an existing L1 feature (e.g., [continuant]) while L2 contrasts which were not based on L1 features (e.g., [distributed]) were more difficult to acquire. LaCharité and Prévost (1999) suggest that this was too strong an approach and that some features which were absent (i.e., terminal nodes) would be acquirable while others (i.e., articulator nodes) would not, as shown in (1).
The features in boldface are the ones which are absent from the L1 French inventory. They predict that the acquisition of L2 English [h] will be more difficult than the acquisition of [θ] because [h] requires the learner to trigger a new articulator node. On a discrimination task, the learners were significantly less accurate identifying [h] than identifying [θ], however, on a word identification task (involving lexical access) there was no significant difference between the performance on [h] vs.
[θ]. Özçelik and Sprouse (2016), however, show that L2 learners are able to acquire the features of secondary articulations (e.g., palatalized consonants). Hancin-Bhatt (1994) proposed that the functional load of a particular feature in implementing a contrast in a language would determine its weighting (with features with high functional load predicted to have greater cross-linguistic influence than those with low functional load). Archibald (2005) proposed the Redeployment Hypothesis in which it would be easier to acquire new L2 structures which could be built from existing L1 building blocks (e.g., features, or moras) than to acquire new building blocks. In some ways, this approach presages Lardiere's (2009) Feature Reassembly Hypothesis which looks to account for the difficulty that L2 learners have acquiring L2 morphology.
One example of redeployment is evidenced in the L2 acquisition of Japanese geminate consonants by L1 English speakers. Japanese geminate consonants have the moraic structure shown in (2).
English does not have geminate consonants, but does have a weight-sensitive stress system, shown in (3) where coda consonants project moras which attracts stress to heavy syllables.
Thus, the English quantity-sensitive system can be redeployed to acquire L2 geminates. The corollary to this would be that L2 structures which could not be built from L1 components would be more difficult to acquire. Cabrelli et al. (2019), looking at Brazilian Portuguese learners of English coda consonants, also demonstrate that L2 learners can restructure their phonological grammars insofar as the L2 learners are licensing coda consonants which are not found in the L1. Carlson (2018) found similar effects in L1 Spanish. Garcia (2020) describes an interesting case where a property of the L2 (stress placement) which could be acquired on the basis of transferring an L1 property of weight-sensitivity is, in fact, difficult to acquire because another property of the L1 is able to account for the L2 data, and this property (positional bias) is more robust in the L2 input. Darcy et al. (2012) present data which show, contra Flege (1995), that some learners who were able to lexically represent a contrast were unable to accurately discriminate it. The model is known as DMAP which stands for direct mapping of acoustics to phonology. The basic empirical finding which they report on is a profile where L2 learners of French (with L1 English which lacks/y/) can distinguish lexical items which rely on a /y/ -/u/ distinction while simultaneously being unreliable in discriminating [y] from [u] in an ABX task. Detection of acoustic properties can lead to phonological restructuring (according to general economy principles of phonological inventories) which will result in a lexical contrast but the phonetic categories may not yet be targetlike. The learners rely on their current interlanguage feature hierarchy to set up contrastive lexical representations even as phonetic category formation proceeds. This is reminiscent of the Goto (1971) study where Japanese learners were able to produce an /l/-/r/ liquid contrast even while not being able to discriminate between them in a decontextualized task. It could be that the tactile feedback received in the production of these two sounds, and the orthographic distinction between "l" and "r" were able to cue the learners' production systems. This sort of metalinguistic knowledge can affect production. Davidson and Wilson (2016) extend a body of research which documents L2 learners' sensitivity to non-contrastive phonetic properties (which might account for occurrences of prothesis vs. epenthesis in cluster repair) to look at learner behavior in the classroom. While subjects listening in a classroom (compared to a sound booth) showed some differences (e.g., less prothesis repair), by and large the performance was very similar. This suggests that laboratory research may well have quite direct implications for classroom learners.

Learnability and L2 Phonology
Learnability approaches (Wexler and Culicover, 1980;Pinker, 1989;White, 1991) argued that learning would be faster when there was positive evidence that the L1 grammar had to change, while change that was cued only by negative evidence would be acquired more slowly. Positive evidence is evidence in the linguistic environment of well-formed structures. Negative evidence is evidence given to the learner that a particular string is ungrammatical. It would be easier to move from an L1 which was a subset of the L2 (because there is positive evidence to indicate that the current grammar is incorrect) than it would be to move from an L1 which was a superset of the L2 (as this would require negative evidence). Consider the example of L1 English and L2 Hungarian as shown in Figure 1. Hungarian secondary stress (Kerek, 1971) is quantity-sensitive to the Nucleus (meaning that only branching nucleii (i.e., long vowels (CVV)) are treated as Heavy but not branching Rhymes (i.e., closed syllables (CVC)). English stress is quantity sensitive to branching nuclei and branching rhymes.
If your L1treated long (i.e., bimoraic) vowels (CVV) and closed syllables (CVC) as heavy (as English does) but the L2 only treated long vowels as heavy then it might take a while for the learner to hypothesize "wait, I've never heard a secondary stress on a closed syllable!". But L1 Hungarian to L2 English would have clear positive evidence when the learner hears stress placed on a closed syllable (as in agénda). An English learner of Hungarian would have to notice that Hungarian never stressed closed syllables. Dresher and Kaye (1991) argued that when the data reveal that closed syllables and branching nuclei behave the same with respect to stress assignment this is the universal cue for the system to be quantity sensitive to the rhyme. See Archibald (1991) for further discussion and empirical investigation. Young-Scholten's (1994) Asymmetry Hypothesis predicts that if an L2 phonological rule applies in a prosodic domain that is a superset of the L1 phonological domain then the positive evidence will make it easier to acquire. However, when the target domain is smaller than the L1 domain then the lack of positive evidence will make acquisition more difficult. In English, the rule of flapping applies within a phonological utterance (e.g., Don't sit on the mat [ɾ], it's dirty.). German has a rule of final devoicing which applies within a phonological word (e.g., Ich ha [b]e ∼ Ich hab[p]). So, English learners of German are predicted to have difficulty acquiring phonological patterns which are licensed only in smaller phonological domains.
In addition to positive evidence or direct negative (i.e., correction) evidence, however, Schwartz and Goad (2017) have demonstrated that indirect positive evidence can play a role in second language learning where the L2 is a subset of the L1. In this case, L2-accented English was shown to be a source of evidence for some subjects as to the phonotactics of Brazilian Portuguese.
There is one area which is just starting to be explored in L2 phonology and that is Dresher (2009) contrastive hierarchy as an explanatory tool for ease and difficulty. Dresher's model suggests that L2 features which are active (i.e., involved in many phonological processes in the language) will be easier to learn than L2 features which are inactive due to the type of evidence they present to the learner. Active features provide robust cues to the learner that a given feature must be highly ranked in a contrastive hierarchy, and is, therefore, evidence to restructure the L1 hierarchy. Archibald (2020) has explored this model in an analysis of L3 phonological systems. Such a mechanism is reminiscent of Hancin-Bhatt (1994) notion of how functional load defines featural prominence.

CONCLUSION
What I have attempted to show in this mini-review is that there is a rich history in addressing the question of ease vs. difficulty in L2 phonology. I hope that this overview will provide useful background to the readers of this collection. Unsurprisingly, there is no easy answer to the difficult question of ease vs. difficulty.

AUTHOR CONTRIBUTION
I am the sole author of this piece.

ACKNOWLEDGMENTS
I would like to thank the reviewers for this piece. Their keen eye for clarity and accuracy has greatly improved this mini-review. But I have to say that the friendly, supportive scholarly exchange was as enjoyable as it is rare.