Length of Utterance, in Morphemes or in Words?: MLU3-w, a Reliable Measure of Language Development in Early Basque

The mean length of utterace (MLU), which was proposed by Brown (1973) as a better index for language development in children than age, has been regularly reported in case studies as well as in cross-sectional studies on early spontaneous language production. Despite the reliability of MLU as a measure of (morpho-)syntactic development having been called into question, its extensive use in language acquisition studies highlights its utility not only for intra- and inter-individual comparison in monolingual language acquisition, but also for cross-linguistic assessment and comparison of bilinguals' early language development (Müller, 1993; Yip and Matthews, 2006; Meisel, 2011). An additional issue concerns whether MLU should be measured in words (MLU-w) or morphemes (MLU-m), the latter option being the most difficult to gauge, since new challenges have arisen regarding how to count zero morphemes, suppletive and fused morphemes. The different criteria have consequences, especially when comparing development in languages with diverging morphological complexity. A variant of MLU, the MLU3, which is calculated out of the three longest sentences produced (MLU3-w and MLU3-m), is included among the subscales of expressive language development in CDI parental reports (Fenson et al., 1993, 2007). The aim of the study is to investigate the consistency and utility of MLU3-w and MLU3-m as a measure for (morpho-)syntactic development in Basque, an agglutinative language. To that end, cross-sectional data were obtained using either the Basque CDI-2 instrument (16- to 30-month-olds) or the Basque CDI-3 (30- to 50-month-olds). The results of analyzing reports on over 1,200 children show three main findings. First, MLU3-w and MLU3-m can report equally well on very young children's development. Second, the strong correlations found between MLU3 and expressive vocabulary in the Basque CDI-2 and CDI-3 instruments, as well as between MLU3 and both nominal and verbal morphology scales, confirm the consistency not only of MLU3 but also of the two Basque CDI instruments. Finally, both MLU3-w and MLU3-m subscales appear sensitive to input after age 2, which emphasizes their utility for identifying developmental patterns in Basque bilinguals.


INTRODUCTION Mean Length of Utterance: MLU and MLU3
How to measure language complexity is a question that has occupied linguists in a longstanding debate. Some authors maintain that since all languages are learnable by any child, they must have the same degree of complexity. In this regard, cross-linguistic differences found in complexity in each language component are believed to be the result of a compensation system, so that languages showing very high complexity in one particular domain are expected to have less complexity in other domains and vice-versa. In addition, the observation that, synchronically, many languages with low complexity in morphology have a rigid word order or a more complex phonological system than languages with complex morphology may support that assumption. However, counter-evidence has also been provided by scholars denying any theory-internal reason to predict similar degrees of complexity in all natural languages. See Newmeyer (2017) and Newmeyer and Preston (2014) for an overview of the debate.
The issue of language complexity piqued early language acquisition researchers' interest already in the beginning of the twentieth century. Such is the case of, for example, Nice (1925), who regarded average sentence length as "the most important single criterion for judging a child's progress in the attainment of adult language" (Rice et al., 2010). In a similar vein, five decades later, Roger Brown passionately defended his Mean Length of Utterance or MLU, which proved to be one of the most commonly-mentioned indexes of constructional complexity in child language by the end of the century: ". . . The MLU is an excellent simple index of grammatical development because almost every new kind of knowledge increases length: the number of semantic roles expressed in a sentence, the addition of obligatory morphemes, coding modulation of meaning [. . . ]and, of course, embedding and coordinating. All alike have the common effect on the surface form of the sentence of increasing length (especially if measured in morphemes, which includes bound forms like inflections rather than words)" (Brown, 1973, pp. 53-54).
Brown considered MLU to be a more suitable index than age to compare individuals' development, since it permits identifying "on internal grounds" children who are "at the same level of constructional complexity" but who may not be "of the same chronological age" (Brown, 1973, p. 55).
In addition to the MLU calculated from the sentence sample uttered in a recording session, Brown regarded the upper bound or the longest sentence produced at a specific age as a relevant additional index to measure the attained grammar complexity of children. Thus, he established a sequence of five stages in children's earliest morphosyntactic development based on the two indexes: MLU and upper bound. Both values increased with age in the three longitudinal corpora analyzed (Eve, Adam, and Sarah). Each stage was associated with the child's productive use (at least in 90% of the contexts in which they are required) of some linguistic structures, and individual differences were observed in the age at which each child reached the various stages. For instance, Eve attained stage V at 2;2 years, whilst at that age Adam's and Sarah's MLU values around 2 indicated stage II. In Table 1 we have combined data which Brown presented separately: the target values of MLU and upper bound corresponding to each stage and the age ranges of the three children studied longitudinally at the different stages. The variability in age is evidenced by the large age ranges across stages displayed in column 4.
Despite the advantages of an index other than age to compare children's linguistic development, Brown still pointed out some limitations, starting from Stage V onwards. He argued that, at that stage, children's varied linguistic productions and their MLU begin to depend more on the nature of the interaction than on what children know (Brown, 1973, p. 54).
Brown's view of complexity is not related to any specific language component such as semantics or morphology. It is based on the assumption that the acquisition of components such as x and y alone does not immediately, or even relatively quickly, lead to the acquisition of the construction x + y that combines the two. Consequently, in his cummulative sense of complexity, "construction x + y may be regarded as more complex than x or y because it involves everything involved in either of the constructions alone, plus something more" (Brown, 1973, p. 400). This lack of precision is probably what led researchers to question MLU's appropriateness to measure morphosyntactic development. Bickerton (1991), for instance, suggested that qualitative aspects of syntactic development cannot be directly evaluated, since the increase in length of utterances does not necessarily imply an increase in syntactic complexity. In fact, similar or higher MLU values (1a-c) may correspond to utterances with a lower morphosyntactic complexity, which is the case with the coordinated structures in (1a) as compared to S-V agreement examples in (1b) or the embedding structures in (1c).
Frontiers in Psychology | www.frontiersin.org increases, some sort of increase in complexity is bound to occur, but there is no a priori reason why the increase should take only the forms it does, and, in particular, that these forms should be the same for all children studied, whatever the language in question" (Brown, 1973, pp. 64-65). Additionally, issues such as how to measure children's achieved linguistic complexity and whether the same degree of complexity should be assumed at a particular stage cross-linguistically or across individuals acquiring a particular language have not received a convincing and generally accepted answer yet. However, the generalized acquisition order of 14 inflectional markers in English established by Brown, which was confirmed in later longitudinal studies, reinforces the supposition of some pattern in morphosyntactic development which goes beyond the aforementioned individual variability. Despite MLU being originally "invented for English, " Brown was still aware of its utility in other languages for cross-linguistic comparison, once some adjustments were made: "Studies of highly inflected languages [. . . ], all report some difficulty in adapting our rules of calculation, invented for English, which is minimally inflected, to their languages. What I have used is, in each case, the author's choice of the linguistically most reasonable value" (Brown, 1973, p. 68). Actually, many longitudinal case studies conducted in typologically distant languages have provided relevant results regarding the specific structures which arise in children's spontaneous production at each specific developmental stage. Besides, MLU has been used in cross-sectional studies comparing early bilingual children's development in their two languages (Marchman et al., 2004;Meisel, 2011;Thordardottir, 2011;Hoff et al., 2014) as well as typical vs. atypical language development (Johnston, 2001;Rice et al., 2010;Wieczorek, 2010).
In his seminal 1973 book, Brown devoted part of the introductory section to describing and discussing the set of rules for calculating MLU and upper bound in spontaneous production corpora. Here are the most relevant ones: (a) a subsample is required to calculate MLU in a longer sample gathered at some specific developmental stage. However, not every utterance can be equally reliable in the sample: 100 utterances should be taken from the fully transcribed utterances, starting at the second transcription page rather than from the first minutes of the conversation; (b) stuttering or repeated attempts to produce some words or utterances are counted once, in the most complete form used. This rule may avoid underscoring due to the selection of non-representative items of the child's (real) linguistic performance in constructional complexity; (c) fillers such as umm are not counted, in contrast to no, yeah, hi, which are included in the counting; (d) inflectional morphemes (plural, genitive, 3rd singular present -s, and so on) are counted as separate morphemes and inflected auxiliaries are counted as mono-morphemic words, as are compounds, for example, birthday. In our opinion, such counting criteria appear as an intermediate option between counting words and morphemes. However, such a counting system, together with the specific properties of English morphosyntax (a limited inventory of inflectional person and plural markers, low word complexity) and the scarcity of inflectional markers in children's early productions, may lead one to predict no great difference in measuring English child utterance length in words or in morphemes. In contrast, in languages with a certain degree of morphological complexity, like Basque, many researchers are in favor of measuring morphosyntactic development in morphemes rather than in words (Idiazabal, 1991;Barreña, 1995;Ezeizabarrena, 1996;Elosegi, 1998;Larrañaga, 2000;Larrañaga and Guijarro-Fuentes, 2012a). Nonetheless, the high (almost perfect) intralinguistic correlations between the two ways of calculating MLU found in such typologically distant languages as Spanish (Aguado, 1995;Jackson-Maldonado and Conboy, 2007), Irish, Icelandic and Dutch (see Parker and Brorson, 2005 and references therein), indicates that MLU-m may not necessarily be a better measurement than MLU-w. In contrast to authors who have suggested the higher usefulness of MLU-w because of the ease of calculating it, Wieczorek (2010) has questioned the fact that MLU-w and MLU-m can be regarded as similar indicators of morphosyntactic development simply because of the high correlations attested cross-linguistically. According to this researcher, MLU-w is related to lexical development rather than to grammatical development and therefore, the opposite is expected to be the case for MLU-m, which should show a stronger relation to grammatical rather than lexical development. A third way of calculating MLU in syllables (MLU-s) has also been explored in Irish (Hickey, 1991) and in Inuktitut (Allen and Dench, 2015). Surprisingly, MLU-s, which a priori would not be considered an index of grammatical development per se, or at least not in every language, also correlates with the previous indexes. The high correlations attested across languages between the different types of MLU may indirectly cast doubt on the "equivalence" of all of them as measures of language development, although determining exactly what the different variants of MLU measure in each language goes far beyond the aim of the current study.
Apart from the several ways of counting MLU, another objection to the use of MLU is the subjectivity present throughout the different steps preceding its calculation. To start with, MLU is sensitive to event and exchange patterns, situational variability and conversational dominance in a bilingual child, which may cause the sample collection on a particular date or conversational situation not to be the best example of the child's regular linguistic use (see Johnston, 2001 and references therein). Thus, counting all the sentences in a session or selecting the (50?, 100?, more?) utterances from the first, intermediate or final part of a two-hour recorded conversation may result in a different MLU value of a child's production at a particular age. Moreover, criteria for calculating MLU vary across studies, such as in the case of MLU vs. alternate MLU measures (Johnston, 2001), or of measuring MLU in words (MLU-w), morphemes (MLU-m) or syllables (MLU-s). Finally, subjectivity is present in the process of transcribing and coding oral speech in general, a task which "relies on the accuracy of the transcriber" (Rollins et al., 1996) and in the process of segmenting utterances. Segmenting words and especially morphemes in an utterance arises as the next complication in the process, where decisions regarding null morphemes, multimorphemic words such as portmanteaux, compounds and so on need to be made before starting with the analysis. Otherwise the variability found in children's spontaneous productions may lead to quite diverging value assignments to the same utterance. In order to regulate the subjectivity inherent in the processes mentioned above, single individuals are put in charge of the segmentation task of a whole set of recordings or of a sample collection, and further interjudge reliability rates are established on their codifications.
Despite the objections discussed earlier, MLU has still been extensively used in both intra-and inter-individual comparative studies. This is the case of, for instance, studies on language dominance which compare bilinguals' development in their two languages. On the assumption that length of utterances across languages may vary more depending on the unit in which its calculation is based, MLU-m has been proposed as a better measure for bilinguals' individual interlinguistic comparison in language pairs such as Basque-Spanish (Meisel, 1994;Ezeizabarrena, 1996;Elosegi, 1998;Larrañaga, 2000;Larrañaga and Guijarro-Fuentes, 2012a etc.), whilst studies on French-German bilinguals (Meisel, 1991;Müller, 1993;Müller and Kupisch, 2003;Kupisch, 2008;Schmeiser et al., 2016) or English-Mandarin bilinguals (Yip and Matthews, 2006) and even some on Spanish-Basque (Larrañaga and Guijarro-Fuentes, 2012b) have opted for MLU-w. See also Hickey (1991), who considers that MLU's utility for cross-linguistic comparison cannot be generalized even intraindividually.
Despite criticisms, MLU, in its different modalities, remains as one very relevant index for morphosyntactic development in longitudinal corpora of spontaneous language production, and the inclusion of some versions of it in assessment instruments confirms this fact. Such is the case of MLU3, included in the MacArthur-Bates Communicative Development Inventories (CDI) instrument (Fenson et al., 1993(Fenson et al., , 2007, a parental questionnaire designed to obtain normative data which may allow researchers to assess both typically and atypically developing children. The MLU3 is a combination of two indexes on which Brown's 5-stage classification was based (mean length of utterance and the upper bound). Yet MLU3 has the particularity that the mean length is calculated based on the child's three longest recently-produced sentences according to their parents, instead of on a specific sample of child utterances gauged by a researcher in a longitudinal corpus.
Studies on early bilingualism using this measurement have concluded that MLU3 values are sensitive to the amount of a child's exposure to the language. Bilinguals, who by definition have less exposure to their language(s) than monolinguals, have shown lower values than their age-matched monolingual counterparts (1;10-2;6: Hoff et al., 2012Hoff et al., , 2014. More specifically, the results from Spanish-English bilingual groups, which were distinguished according to their higher, balanced and lower exposure to the language, revealed that the less input bilinguals had received in the language under study, the lower the scores they obtained in MLU3 values (Hoff et al., 2012).

Utterance Length in Basque
From the genetic point of view, Basque is unrelated to any other known language; that is, it is an isolate language. Typologically, Basque is a null subject, ergative language with non-rigid SOV word order, a language with very rich nominal and verbal inflection (case marking, person and number subject-, direct object-and indirect object-agreement marking in the verb), with a predominantly agglutinative morphology and affixed postpositions. As a result, most nominal and verbal words comprise two or more morphemes (2a-c), which makes utterance length diverge, depending on whether it is measured in words (1,1 and 4 w) or morphemes (2, 4, and 8 m) in (2a), (2b) and (2c), respectively. ( Aux.S3s (4 w, 8 m) 'Jon will come with the dolly" However, not all morphemes are counted as productive morphology in early child productions. Following Brown's (1973) proposal of counting productive (non-rote learned) words and morphemes and taking into account both the specific morphosyntactic properties, as well as the characteristics of earliest productions in Basque, Idiazabal (1991) established the first list of rules to calculate MLU-m in Basque, which were followed in later longitudinal case studies (Barreña, 1995;Ezeizabarrena, 1996;Elosegi, 1998;Almgren, 2000;Larrañaga, 2000). According to these rules, diminutive suffix -txo is not counted as a morpheme in very frequent diminutive words in child and child-directed speech such as ama-txo "mumm-y" and aita-txo "dadd-y" (1 w / 1 m) but, on the other hand,txo is counted as a morpheme in the rest of the few remaining words that include it (2a-c). Moreover, the -Ø morpheme is not counted, and the -a ending, which is translated as Det(erminer) in the (2a, 2b) glosses, is not counted as a morpheme either. There are several reasons for not counting this -a ending, which is suffixed to the nominal phrase rather than to the noun, as a (productive) morpheme: (a) many lexical roots having an organic -a ending do not modify their phonology when the determinera is suffixed (musika "music/music-Det"), (b) overtly determined roots like etxe-a "house-Det" cannot always be considered as such, since they can be used to respond to the question, "how do you say. . . house in Basque?", where no determined nouns are expected; and (c) in early child Basque the nominal -a ending acts as an unanalyzed word boundary, rather than as a grammatical element, as seen in examples like bestea umea instead of beste umea "other child, " attested in several longitudinal samples (Barreña and Ezeizabarrena, 1999).
Basque-speaking community of roughly one million speakers mostly comprises people who grew up in Basque-speaking families and acquired Basque as their L1 (either simultaneously or alongside Spanish or French, successively) and early L2 speakers who, growing up in almost monolingual Spanish or French families, are exposed to Basque very early (from age 2 or 3 onwards) through the educational system. Another group of late L2 speakers acquired that language through adult training courses. Sociolinguistic surveys conducted in 2006 with population older than 15 years of age in the Basque Country described the following distribution of linguistic profiles: 15.4% passive bilinguals, 25.7% active bilinguals and 58.9% French or Spanish monolinguals. Further censal surveys conducted in the Basque Autonomous Community, the region in which most of the current sample was collected, concluded that 39% of the 5to 9-year-old population had Basque exclusively or together with Spanish as their home language (Basque Government, 2009). Consequently, most L1 Basque-speaking children are exposed to different degrees of Spanish (or French) input, and this is also the case of the participants of our study.

Aims and Predictions
The current paper investigates MLU3 scales' reliability as compared to other scales of the Basque CDI to assess early language development in that agglutinative language. For that, it provides data of 16-to 50-month-old children obtained using the Basque versions of the MacArthur-Bates CDI parental questionnaires.
In a language community such as the Basque-speaking one, in which being bilingual is the norm rather than the exception, the assumption that monolingual data are the best reference for "typical development" does not hold, and consequently, only instruments which are sensitive to the amount of exposure to the language(s) can accurately assess early bilingual language development. Therefore, a further study conducted with a subsample of over 1200 18-to 48-month-olds' MLU3-w and MLU3-m scores will analyse those measurements' sensitivity to two variables, chronological age and (relative) amount of exposure to the Basque language, with the aim of checking MLU3 subscales' utility in that particular context. Three predictions can be stated in this regard: 1. MLU3 scales will be as sensitive as the rest of the scales in the Basque CDI instrument to detect children's developmental changes as found in previous studies, and will reflect development in morphological complexity (Fenson et al., 1993(Fenson et al., , 2007. 2. Taking into account the morphosyntactic properties of an agglutinative language with rich morphology, such as Basque, MLU3 measured in morphemes will prove to be more discriminative than the MLU3 measured in words. 3. Input quantity will affect children's expressive language. Hence, differences in length of utterance are expected among bilinguals, depending on children's relative amount of exposure to Basque, as widely reported in early bilingual research (Marchman et al., 2004;Meisel, 2011;Thordardottir, 2011;Hoff et al., 2014).

Instruments
The MacArthur-Bates Communicative Development Inventory (CDI) instrument is a parental questionnaire used to gather information regarding children's language use. Different versions of the instrument have been developed, all designed for different age ranges (CDI-1 for 8-15 months, CDI-2 for 16-30 months, and CDI-3 for 30-50 months) and for different purposes such as screening (short CDI-1 and CDI-2) or clinical diagnosis and research (full CDI-1 and CDI-2 questionnaires) (see Fenson et al., 2000). The CDI-1 is the only instrument which includes vocabulary comprehension in addition to expressive vocabulary and grammar. In contrast, CDI-2 and CDI-3 are oriented to expressive language use. The current study reports on data obtained with the long version of the CDI-2 and the CDI-3, for which there is only one (short) version. The Basque version of the full CDI-2 instrument (16-30 months), henceforth BCDI-2, contains different sections such as vocabulary and morphology, in which informants tick the items their child already produces, some questions about whether the child has started combining words, as well as a section for writing down the child's three longest recently-produced sentences. In addition, there is a list of multiple-choice items in which informants choose, from the different options the one that best fits with the child's current production. Filling in this questionnaire may take between 10 ′ and 60 ′ , depending on the child's level of expressive use.
The Basque version of the CDI-3 instrument (30-50 months), henceforth the BCDI-3, is much shorter than the CDI-2. The BCDI-3 contains a vocabulary list, a grammar section, a section for writing down the three longest utterances, a list of multiplechoice items and a list of questions intended to assess children's knowledge of some logical and mathematical terms.
The sections and number of items analyzed in the current study are presented in Table 2. Neither the 37/29 items of the multiple-choice item section nor the 10 yes/no questions on logical concepts (included only in BCDI-3) have been included in the current analysis, since they are less homogeneous in format, across items and across the two instruments.

Participants
The parents of over 2,000 children aged between 16 and 50 months of age participated in the study, filling in one of the MLU3 a For the current study, some postpositions, included in the vocabulary section of the questionnaire were analyzed as morphological suffixes rather than as vocabulary items. Consequently, the distribution of (vocabulary/grammatical) items included in this study will vary from previous studies such as Barreña et al.'s (2008a,b), conducted with the same data sample.
two instruments: either the BCDI-2 (16-30 months) instrument (Barreña et al., 2008a) or the BCDI-3 (30-50 months) instrument (Garcia et al., 2014). The questionnaire is written exclusively in Basque. Consequently, all the informants in this study are bilingual parents with different levels of language use who interact in Basque and (at least) one other language on a daily basis and address their child (some exclusively, others mostly or only sometimes) in Basque. Participants gave informed consent prior to participation. The study was approved by the ethics commission of the University of the Basque Country. The data sampling lasted over a decade. The initial data collection of 2,248 questionnaires (BCDI-2 n = 1,204 / BCDI-3 n = 1,044) was filtered out based on a set of exclusion criteria: out of the age range (101 out of 15-30 months/26 older than 50 months), below 8-month-pregnancy pre-term born children (15/7), children who had over two ear infections during the first year (20/55); questionnaires in which vocabulary and/or grammar sections were incomplete (93/0) and questionnaires where any (one, two, or three) of the three longest utterances produced (207/389) and/or input data (25/15) were missing. Thus, the data sample of 16-to 50-month-olds analyzed for the current study includes 1,337 questionnaires (BCDI-2 n = 750/BCDI-3 n = 587). As shown in Figure 1, all age groups (in months) consist of a range of 20-64 participants for the whole period studied. As for gender, girls and boys are evenly distributed across the age groups [χ 2 (14) = 6.27, p = 0.96 in BCDI-2 and χ 2 (20) = 28.18, p = 0.11 in BCDI-3]. In order to investigate the effect of input and age and the interaction between these two variables on MLU scores, the sample was limited to children aged between 1;6 and 4 years. The sub-sample of 1202 participants was divided into age groups and input groups. Five groups resulted from the division in six-month age groups (18-24 months, 25-30 months, 31-36 months, 37-42 months and 43-48 months). Each age group was further divided into four different input groups based on the relative amount of exposure to Basque and Spanish: Monolingual or M (over 90% Basque input), Basque-dominant bilingual or BDB (Basque input 60-90%), Balanced Bilingual or BB (Basque input 40-60%) and Spanish-dominant bilingual or SDB (below 40% Basque input) (see Table 3). In what follows, we will use the terms input or relative input to refer to the relative amount of exposure to Basque and Spanish, following Thordardottir (2011), among others.

Procedure and Coding
As in the original CDI, the grammar section of the BCDI includes several items regarding nominal inflection, verbal inflection and an item in which participants are requested to report on the child's longest three sentences produced recently. The MLU3 was calculated from the three utterances reported, as displayed in (3). (3) Idatzi zure haurrak azken aldian esan dituen hiru esaldi luzeenak. 'Please write down the longest three sentences your child has recently produced':   (3) illustrate the three longest utterances of a 28month-old child randomly chosen from the BCDI-2 sample and the way they were measured. Thus, MLU3 in (3) was calculated based on the mean of the length of the three utterances reported. So that MLU3-w of (3a + 3b + 3c) / 3 is (4 + 4 + 3) / 3, that is, 3.66 and MLU3-m is (6 + 6 + 5) / 3, namely, 5.66. This shows that MLU-w and MLU3-m differ considerably in Basque.
Only the children who had not started combining words yet (their parents responded with "not yet" to the item preceding the three longest utterance section) obtained 1 as a mean value for the two variables, MLU3-w and MLU3-m. The rest of the children obtained higher values.
The results from the MLU sections will be analyzed together with the scores obtained in three more scales: vocabulary, nominal inflection, and verb morphology. In the vocabulary and morphology sections, informants were asked to tick the items their child had started producing. The final score was calculated by summing up the total number of items ticked in each of the sections.
The maximal potential score in vocabulary was 643 items in BCDI-2 and 120 in BCDI-3. MLU3-w and MLU3-m were open scales and therefore no maximal values could be estimated a priori.
As for verbal inflection, the maximal possible score was 39 in BCDI-2 and 22 in BCDI-3, corresponding to the number of items included in the two instruments in the current study. The items in BCDI-2 are three aspectual suffixes (imperfective -tzen, future -ko, and perfective -ta) in addition to 36 inflected frequent verb forms, most of them auxiliary forms. The items included in BCDI-3 (22) are two aspectual suffixes (imperfective -tzen, and future -ko) and 20 very frequent, most of them inflected auxiliary verb forms (naiz "am, " da "is, " dago "is", dizut "I have. . . it to you, " zenuen "you had. . . it").

Data Analysis
One-way ANOVAs were conducted separately for BCDI-1 and BCDI-2 instruments in order to measure the effect of age. In addition, Pearson's correlations were calculated to analyse between-scale relations, and finally, partial correlation coefficients were computed between BCDI scales with age as the covariate.
On the other hand, two-way ANOVAs were performed to compare the main effects of age and input in the whole sample, as well as the interaction between age and input in MLU3-w and MLU3-m scales. The effect size was calculated according to Cohen (1992) and Richardson (2011).

RESULTS
A variety of structures and morphological markers are attested in the sample of utterances produced by the participants, based on their parents' reports. The examples of 24-month-olds listed in (4a-b) and of 30-month-olds in (4c-d) were collected using the BCDI-2, whereas examples from 30-month-olds (4e-f), 36month-olds (4f-h), 42-month olds (4i-j) and 48-month-olds (4kl) were obtained using the BCDI-3 instrument. As expected in a language with rich case and inflectional morphology, length of utterance varies depending on whether it is measured in w(ords) or in m(orphemes) and the older the children become, the more complex are the structures attested. Thus, morphologically complex structures which are rare among children younger than 30 months, such as inflected verb forms with multiple agreement markers (4d), postpositional complex phrases (4f, 4h), embedded sentences carrying embedding particles (9g, 9k, 9l), start being reported from 2;6 and 3 years onwards or even later.

BCDI-2 (16-30 Months)
The scores on all scales of the BCDI-2 increased significantly with age, as depicted in  Table 4.
The ANOVA analysis revealed a significant effect of age on all the scales of the BCDI-2: vocabulary [F (14, 735) Cohen (1992) and Richardson (2011).
As shown in Table 5, correlations between vocabulary, nominal morphology, verbal morphology, MLU3-w and MLU3-m scales were strong (r range: 0.81-0.97), especially between MLU3-w and MLU3-m (r = 0.97). Some correlation FIGURE 2 | Mean vocabulary scores by age in BCDI-2 (643 items). coefficients decreased after controlling for age, but their values remained both significant and high (r range: 0.66-0.95). Cronbach's alpha was 0.97 for the five scales 2 .

BCDI-3 (30-50 Months)
The scores on all the BCDI-3 scales increased with age, as depicted in Figures 4, 5 and the effect size of age was large. Mean and standard deviation values of BDCI-3 scales are shown in Table 6.
2 Correlation between MLU and the scores obtained in the multiple choice question section in BCDI-2 yielded statistically significant results (p < 0.001): MLU3-w (r = 0.77 and r = 0.64, controlling for age) and MLU3-m (r = 0.77 and r = 0.63, controlling for age). The analysis of the multiple choice item sections goes beyond the purpose of the current study. Nonetheless, we have reported these data because of the request of one anonymous reviewer. 3 MLU and the scores obtained in the multiple choice question section in BCDI-3 also yielded statistically significant results (p < 0.001): MLU3-w (r = 0.60 and r = 0.54, controlling for age) and MLU3-m (r = 0.64 and r = 0.59, controlling for age).

Input and MLU3
Two-way ANOVA analyses were performed in order to investigate the effect of age, input (the relative amount of exposure to Basque and Spanish), and the interaction between them on the two MLU3 measures, MLU3-w and MLU3-m in the whole sample, which is depicted in Figure 6.
Similar results were also found in MLU3-m, with significant main effects of age [F (4, 1182) = 108.25, p < 0.001, η 2 p = 0.27] and input [F (3, 1182) = 45.97, p < 0.001, η 2 p = 0.10]. In addition, the interaction between age and input proved significant [F (12, 1182) = 3.99, p < 0.001, η 2 p = 0.04] (see Table 9).  Concerning MLU3-m (see Figure 6 and Table 9), no significant differences were observed among the four input groups in the youngest age range (18-24 months) [F (3, 320) = 1.63, p = 0.182, η 2 p = 0.01]. Nevertheless, from the age of 2 the effect of input in the MLU-w was revealed to be significant in all age groups: 25-30 months of age [F (3, 367) = 9.73, p < 0.001, Post hoc analyses with a Bonferroni correction indicated no significant differences among input groups on MLU3-w and MLU3-m scores in the youngest age group (18-24 months). However, from 2 years of age, the mean scores for monolinguals and Basque-dominant bilinguals were significantly higher than those of the Spanish-dominant bilinguals (see Tables 8, 9). In contrast, monolinguals and Basque-dominant bilinguals did not differ significantly throughout the whole period studied, whilst balanced bilinguals showed intermediate scores which were closer to those of the Spanish-dominant bilingual group than to the Basque-dominant bilinguals in the age ranges before the 42nd month. Finally, in the oldest age group (43-48 months), the balanced bilinguals aligned with the Spanish-dominant bilinguals rather than with the Basque-dominant ones, as shown in Figure 6 and Tables 8, 9.
Therefore, three main results can be drawn from the analyses provided above: 1) Large age effects were attested in MLU-w and MLU-m as well as in the rest of the scales of the BDCI-2 and BCDI-3 instruments, and high correlations were observed between both MLU scales and the other scales tested.
2) The two MLU scales showed almost perfect correlations.
3) Input groups behaved similarly in the 18-24-month-old group, but differences among input groups started to be significant from age 2 onwards, in such a way that monolingual and Basque-dominant bilinguals differed more

DISCUSSION
This paper is in line with previous research which used mean length of utterance, in general, and MLU3 in particular, as an accurate index of language development for individual assessment (Brown, 1973;Fenson et al., 1993Fenson et al., , 2007. The present bilingual data further indicate that an appropriate use of the measurement which takes into account the amount of exposure to which children are exposed will favor a more accurate assessment of these children's actual language development. The current study, which reported MLU data of Basque obtained by means of parental questionnaires from 16-to 50-month-olds, challenged general objections regarding the reliability (a) of parental reports to assess children's expressive language, (b) of MLU as an index for language development, and (c) the accuracy of measuring MLU in words in an agglutinative language with complex morphology.
Subjectivity is one of the strongest criticisms made regarding the CDI instrument in general and the MLU3 measure in particular. Nevertheless, many studies have defended the ecological validity of parental reports as compared to studies based on experimental data, based on the observation that parents witness their children's language use in manifold communicative situations (Institute of Medicine, 2001; American Academic of Pediatrics, 2003;O'Neil, 2007). Moreover, many handbooks of the adaptations of the CDI instruments to English and many other languages include validity studies comparing CDI parental report data with data obtained using other FIGURE 4 | Mean vocabulary scores by age in BCDI-3 (120 items).
Frontiers in Psychology | www.frontiersin.org  methodologies such as elicitation, or spontaneous interaction. These studies also reported strong correlations between MLU3 and the rest of the scales (Fenson et al., 1993;Jackson-Maldonado et al., 2003;López-Ornat et al., 2005;Barreña et al., 2008a). As for the subjectivity in coding MLU in general, and MLU3 in particular, the current study was based on data coded by two different researchers for both BCDI-2 and BCDI-3 data. The high correlation found between the two analyses confirmed the reliability of the coding used. The Basque sample data of 1337 children between 16 and 50 months of age obtained with either the BCDI-2 or the BCDI-3 revealed a gradual increase of mean scores in the scales studied throughout the age groups, month by month, similar to the one found in the lexical and grammatical scales of the BCDI-2 and BCDI-3. The high correlations found between MLU3-w, MLU3-m and the scales of vocabulary, verbal morphology, nominal morphology as well as with the section of multiple choice items regarding children's advance in the acquisition of some particular structures revealed an extremely strong internal consistency throughout the two parental questionnaires. Such a consistency proves, first, parental reports' trustworthiness when reporting about their children's language use and, second, BCDI instruments' reliability.
The first prediction-that MLU3 scales in BCDI would be as sensitive as the rest of the scales in this instrument in detecting toddlers' developmental changes-has been confirmed by the data analyzed. On the one hand, the large size of age effects on the BCDI scales tested confirmed the sensitivity of MLUw and MLU-m as well as the rest of the scales in detecting developmental changes in both instruments (η 2 p = 0.43 in BCDI-2, η 2 p = 0.12-0.13 in BCDI-3). The effect size in the rest of scales was η 2 p = 0.41-0.51 in BCDI-2 and lower, but still large or close to it (η 2 p = 0.11-0.16) in BCDI-3. The fact that the effect size of age decreased from BDCI-2 (η 2 p ≈ 0.40) to BCDI-3 (η 2 p ≈ 0.15) can be explained in two ways. First, methodological differences such as the number of items included in the two instruments (see Table 1) may be the reason, at least partially, for the difference in the effect of age: the differences in the number of items are large in vocabulary (643/120 words). However, they are not so big in morphology (17/16 in nominal morphology and 40/20 in verbal morphology) where, nevertheless, the effect size of age decreased at the same pace as for vocabulary. Moreover, MLU scales were calculated in exactly the same way in both instruments and revealed again a weaker effect of age in BCDI-3 than in BDCI-2, questioning the relevance of the methodological account for the differences mentioned. The second explanation in terms of development appears to be much more convincing: the difference attested between the two Basque instruments is compatible with the stronger developmental changes taking place between the earlier developmental period covered by the BCDI-2 (16-30 months), as compared to the later one covered by the BDCI-3 (30-50 months). The decrease in developmental speed found in the Basque data is in line with that found by Fenson et al. (2007) with the English instruments CDI-2 (16-30 m) and CDI-3  Means that do not share a common alphabetical subscript differ at p < 0.05 (a > b > c) according to post hoc analyses with a Bonferroni correction.
(30-42 m), and with Brown's statement that MLU scales may not be accurate enough for measuring language complexity once the child has reached Stage V. Note that two of the children studied by Brown reached that stage at around age 4, whilst the third one had reached it almost 2 years earlier. Hence, this is compatible with the idea that the effect of this factor decreases after some age between 3 and 4 years. On the other hand, the high correlations between MLU and the rest of the scales reveals the consistency of the instrument and its validity to measure children's verbal communicative development between 16 and 50 months of age in line with the results of many adaptations of the CDI-2 and CDI-3 instruments (Fenson et al., 1993(Fenson et al., , 2007Jackson-Maldonado et al., 2003;López-Ornat et al., 2005). Even though the explanation is not clearly formulated yet, we can conclude, in line with Dethorne et al. (2005), that the strong correlation attested between MLU values and scales of varied instruments used across studies to measure children's development in different language components (expressive vocabulary, grammar. . . ) confirms Brown's assumption that MLU is a measure of early development in language complexity in general, rather than of a specific language component, such as semantics or morphosyntax, in particular. Its validity may be limited to the earliest stages, applying no further than Stage V. Nonetheless, this last point Means that do not share a common alphabetical subscript differ at p < 0.05 (a > b > c) according post hoc analyses with a Bonferroni correction.
could not be either confirmed or disconfirmed by the Basque data and requires further research. The second hypothesis that MLU3-m would turn out to be more discriminative than MLU-w has not been confirmed by the data, since no size differences were found in the effect of age in the two MLU scales: η 2 p = 0.43 in BCDI-2 and η 2 p ≈ 0.11 in BDCI-3. Moreover, the almost perfect correlations between the two MLU scales indicate their similar validity to measure utterance length, regardless of the specific unit (word/morpheme) adopted as baseline. Based on the high correlations found in studies comparing MLU-w and MLU-m scores in several languages (and even MLU counted in syllables), many authors consider that both MLU measures function equally well for measuring grammatical development (Hickey, 1991;Aguado, 1995;Parker and Brorson, 2005). In contrast, Wieczorek (2010) considers that each MLU scale measures development in a different language component: MLU-w being more related to lexical development, and MLU-m to morphological development. Our data support the former position. The high correlations between the two scales in both instruments (r > 0.97 and r > 0.95, when age is controlled) confirm the utility of both indexes to measure development in language complexity. Moreover, regardless of measuring MLU3 in words or in morphemes, correlations between MLU3-m and the rest of the scales are almost identical to those between MLU3-w and the same scales, regardless of the lexical or grammatical character of them, in contrast to what has been suggested by Wieczorek (2010). The relations across MLU measurements and between MLU3-w and MLU3-m and the rest of scales may vary across languages or language types which differ in degree of morphological complexity and transparency (agglutinative, fusionant, polysynthetic. . . ), but such an analysis goes far beyond the scope of the current paper.
Utterance segmentation in words is much quicker and easier, since no technical descriptions are necessary, fewer decisions are required (less subjectivity) and variability across coders decreases considerably, in line with previous studies (Hickey, 1991;Jackson-Maldonado and Conboy, 2007, among others). The redundancy of using both, in addition to the ease of segmenting the utterance in words as compared to morphemes, leads us to recommend MLU-w as a more parsimonious measurement for screening in clinical studies, as has been suggested in other languages (Hickey, 1991;Parker and Brorson, 2005), without denying MLU-m's utility for more specific surveys in research.
The third hypothesis, that the relative amount of input would affect children's MLU, has been partially confirmed. MLU3 scales proved sensitive to detect input effects. A subsample of around 1200 children aged 18-48 months was analyzed with more detail in order to test MLU3's utility to test children's attained developmental level in the acquisition of a minority language in permanent contact with another socially dominant Romance language (Spanish or French). The data revealed MLU3-w and MLU3-m's sensitivity not only to age, already tested in Basque as in many other languages, but to the relative amount of exposure to the language. However, the effect of the amount of (relative) exposure to the language was not visible in the youngest child group (18-24 months). Interestingly, the effect of input increased with age after age 2, varying from medium at age 2 (η 2 p = 0.07 and 0.12) to large at age 3 (η 2 p = 0.15 and 0.20). From age 2 onwards, children with a large amount of exposure to Basque (M and BDB groups) showed more similar scores in MLU3-w and MLU3-m scales than the group with less exposure (SDB), in line with previous studies which tested these populations' lexical and grammatical scores (Barreña et al., 2008a,b).
Despite the strong intralinguistic correlations found among the BCDI subscales, in line with CDI data of English-Spanish bilinguals (Marchman et al., 2004;Hoff et al., 2014), measuring Basque bilinguals' language use only in Basque leads us to underscore the real language capacity of most participants in the present study. Children who are exposed to more than one language rarely have the same amount of exposure to one of the languages as compared to age-matched monolinguals, on whom normative data are based (Ezeizabarrena et al., 2017). As has been shown very convincingly by Pearson et al. (1997), bilingual assessment should ideally take place in their two languages, and in this vein, the accurate evaluation of Basque-Spanish bilinguals' communicative skills should include assessing MLU in their two languages.

CONCLUSIONS
The analysis of cross-sectional data obtained with the BCDI-2 (16-30 months) and BCDI-3 (30-50 months) of over 1200 children revealed a strong correlation between MLU3 and expressive vocabulary in both instruments, as well as between MLU3 and morphological scales. These findings confirm the consistency of the MLU measurement, as well as that of both BCDI instruments. The results also showed that MLU3-w and MLU3-m scales can report equally well on very young children's development in the Basque language up to age 4, which leads us to recommend the easier MLU-w measurement for clinical purposes. Finally, MLU3 subscales proved sensitive to input (25-48 months), which indicates the utility of these subscales to identify developmental patterns in Basque bilinguals aged 2-4.

ETHICS STATEMENT
This study was approved by the ethics commission of the University of the Basque Country.

AUTHOR CONTRIBUTIONS
Data-analysis: IG and M-JE. Manuscript writing and editing: M-JE and IG.