Language Screening in 3-Year-Olds: Development and Validation of a Feasible and Effective Instrument for Pediatric Primary Care

Objective: The study was aimed at evaluating the validity and feasibility of SPES-3 (Sprachentwicklungsscreening), a language screening in 3-year-old children within the constraints of regular preventive medical check-ups. Methods: A four-component screening measure including parental reports on the child's expressive vocabulary and grammar based on the MacArthur Communicative Development Inventory and pediatrician-administered standardized assessments of noun plurals and sentence comprehension was used in a sample of 2,044 consecutively seen children in 30 pediatric offices. One-hundred forty-four children (70 who failed and 74 who passed the screener) comprised the validation sample and also underwent follow-up gold standard assessment. To avoid verification and spectrum bias multiple imputation of missing diagnosis for children who did not undergo gold standard assessment was used. Independent diagnoses by two experts blinded to the screening results were considered gold standard for diagnosing language disorder. Screening accuracy of each of the four subscales was analyzed using receiver operator characteristic (ROC) curves. Feasibility was assessed by use of a questionnaire completed by the pediatricians. Results: The two parental screening subscales demonstrated excellent accuracy with area under the curve (AUC) scores of 0.910 and 0.908 whereas AUC scores were significantly lower for the subscales directly administered by the pediatricians (0.816 and 0.705). A composite score based on both parental screening scales (AUC = 0.946) outperformed single subscales. A cut off of 41.69 on a T-scale resulted in about 20% positive screens and showed good sensitivity (0.878) and specificity (0.876). Practicability, acceptability and sustainability of the screening measure were mostly rated as high. Conclusion: The parent-reported subscales of the SPES-3 language screener are a promising screening tool for use in primary pediatric care settings.


INTRODUCTION
Depending on the definition used, 2-10% of pre-schoolage children experience delayed language acquisition, which makes language disorder (LD) one of the most prevalent developmental disorders (1,2). However, there is no generally accepted definition of what constitutes a LD. A recent consensus statement on terminology and criteria for language problems in children (3) has resulted in the endorsement of the term Developmental Language Disorder (DLD) for language difficulties that are associated with functional impairment and poor prognosis but have no known biomedical etiology. DLD continues to be a clinical diagnosis since functional impairment and prognosis need to be assessed by appropriately trained clinicians. In addition, the degree of language delay and the linguistic dimensions (phonology, vocabulary, morphology, syntax, pragmatics) and modalities (expressive and receptive) encompassed have not been specified in the consensus document.
In an English population study, Norbury et al. (2) found a prevalence of DLD (of unknown origin) of 7.58%, while 2.34% of LDs were associated with intellectual disability and/or a medical diagnosis (total approximately 10% of LDs from all causes). They defined DLD as scores of −1.5 standard deviations (SD) and below on at least two of five language domains. Similarly, other researchers (4-6) classified a child as "specifically language impaired" whenever language performance was below −1.25 SD in at least two language domains measured by norm-referenced tests. In the absence of a generally accepted measurable gold standard for the definition of LDs, we based our definition on the previously mentioned classifications, which are commonly used in research, with an expected prevalence rate of about 10%.
LDs can affect multiple domains of development through adolescence and adulthood. Children with delayed language development are at increased risk for poor socio-emotional, health (7), behavior and academic outcomes (8, 9) and later unemployment (10) with corresponding costs and loss of human potential.
There is growing evidence that intervention for children with LD may be effective. Direct treatments by a specialist and indirect treatments mediated by caregivers have been shown to have positive effects (11)(12)(13)(14). Hence, it is essential to identify children who require educational or therapeutic support in their language learning in order to offer timely and effective intervention.
Based on a number of methodological problems identified in their systematic review of language screening (e.g., lack of information on the effects of age, setting and administrator on screening accuracy) and insufficient evidence for long-term outcomes of language interventions, Nelson et al. (15) did not recommend universal language screening. In a more recent systematic review, Wallace et al. (16) reported on the accuracy of some screening measures for identification of children with language impairment although evidence of feasibility in primarycare settings remained inadequate. Consequently, the U.S. Preventive Taskforce did not recommend universal screening for language delay (17). Given the international systematic reviews (1,15) and following the National Health and Medical Research Council (18), the German Institute for Quality and Efficiency in Health Care (19) also considered the available evidence insufficient to recommend the implementation of language screening in Germany. In the same vein, a systematic review evaluating the effectiveness of systematic populationbased screening for specific language impairment in pre-school children in Germany (20) concluded that the accuracy of German screening measures had not yet been sufficiently examined.
Systems for general health check-ups have been established in many countries in the Western world and could be suitable opportunities for identification of language delays. However, due to a lack of accurate and feasible instruments, routine wellbaby check-ups are generally not used for systematic language screening. One exception is in the Netherlands, where a fiveminute interview/test procedure (VTO language screening) administered by a youth health care physician to 24 monthold children led to more cases with language impairment being identified than by the regular procedure (0.4-2.4%), and at a significantly younger age than in regions in which the regular detection procedures were used (21). However, as demonstrated by the low number of cases, many 2-year-old children with language impairment were not identified (low sensitivity of 24-52%). Given the high instability of language development trajectories in young children, Law et al. (1) recommended the age span of 3-5 years as the optimal period for language screening.
A combination of (i) observations by parents with extensive long-time knowledge of their children's behaviors in everyday life and (ii) standardized assessments by pediatricians is essential to avoid assessment bias (22). A comparative study of direct assessments and parent reports of language and pragmatics revealed patterns of difference that indicated a need to collect assessment data from multiple informants (23). Nevertheless, convergent validity of parent-reported tools in comparison to direct assessments of language is usually high (24,25). Mere elicitation of parental concerns has therefore also been described as a valid method for identifying an increased risk for developmental disorders (26)(27)(28).
Particularly at the age of 3 years, grammatical knowledge is a good marker of language development (29)(30)(31). Furthermore, grammatical skills can usually be reliably assessed within a shorter time than expressive or receptive vocabulary knowledge. In addition, a comprehensive measure for identifying language disorders must refer to both production and comprehension (4). Another reason for including language reception in a screening measure is its ability to predict the persistence of language difficulties (32). As parental reports of language comprehension have often been found to overestimate children's skills (33), a direct assessment of language comprehension should be considered.
Upper Austria is a federal state with a population of 1.45 million inhabitants (with 13.297 births in 2007). Well-child visits are provided free of charge by community pediatricians and general practitioners. In 2010, when the validation study was conducted, 9,125 health check-ups (68.6% of the children born in 2007) were carried out with 3-year-old children, 66% of which by pediatricians and the remainder by general practitioners.
The aim of this study was the validation of a screening tool for LDs for 3-year-old German-speaking children in a pediatric primary-care setting in Upper Austria. The new instrument, the SPES-3 (Sprachentwicklungsscreening) that has been developed in a pilot study includes both parent observations and direct assessments of the child by the primary-care pediatrician. As required for comprehensive language assessments, the screening measure takes various linguistic domains (grammar and vocabulary) and modes (receptive and expressive) into account. For use in primary-care settings, the screening tool must have high acceptability and require little time to administer while maximizing accuracy.

Construction of the Screening Measure and Pilot Testing
Both parent-administered screening scales are based on the MacArthur-Bates Communicative Development Inventories (MCDI), Level III (34), and require parents to systematically report their observations of their child's development of expressive grammar and expressive vocabulary. Inspired by the MCDI concept for grammar assessment a number of morphosyntactic structures of German that typically emerge around age 3 were selected to be presented to the child in the form of 27 pairs of correct and incorrect options. Parents are asked to select the option they are more likely to observe in their child's spontaneous use of language. For expressive vocabulary, 100 words from the MCDI-III English word list (34) translated into Austrian German (35) were chosen. Parents are asked to indicate whether their child uses the words in their expressive communication or not.
The screening subscales administered by the pediatricians were compiled from pre-existing subtests of a German standardized language test (Sprachentwicklungstest SETK 3-5; (36), The first subscale includes 20 items that assess the production of noun plurals: The pediatrician presents and names a pictured item in the singular and asks the child to produce the respective plural form supported by a picture that shows several identical items. The second subscale assesses sentence comprehension: Single sentences are read aloud (9 items), and the child is asked to point to the corresponding picture from a selection of four. For an overview of all screening scales (see Table 1).
The preliminary version of the screening procedure was first used in 2006 in a pilot study in close cooperation with the Pediatric Association and the Department of Health and Social Affairs in Upper Austria, with the aim of developing language screening instruments to be used in pediatric primary care within regular health check-ups at ages 2 and 3. The pilot study included 1,730 non-preselected 3-year-old children with German as their first language, who were consecutively assessed by a group of 24 primary care pediatricians recruited from across the state of Upper Austria (whose participation was voluntary).
Based on the pilot study, the parental report of expressive grammar scale, which initially contained 27 items, was shortened by excluding 14 items with low response rates, low difficulty and low item-scale correlation. Thus, the final screening measure used in the validation study covered 13 items for expressive grammar, 100 items for expressive vocabulary, 20 items for noun plurals and 9 items for sentence comprehension. All items were scored either as 0 (does not apply/false) or 1 (applies/correct), and for each subscale a sum over the item scores was computed. Internal consistency (Kuder and Richardson Formula 20) was excellent for the parental screening scales (ρ KR20 = 0.98 for expressive vocabulary and 0.90 for expressive grammar) and adequate for the screening scales directly administered with the children [ρ KR20 = 0.78 for noun plurals and sentence comprehension; (36)] For the questionnaires to be completed by the parents (expressive grammar and expressive vocabulary) cut-off scores for the validation study were derived from the 10th percentiles (SD 1.25) based on the final instruments, as suggested by the literature (4)(5)(6). Normative data to determine cut-off scores were available for both of the standardized assessments to be completed by pediatricians (noun plurals and sentence comprehension).

Gold Standard Assessment of Language Disorder
Given the lack of well-defined standards for the diagnosis of LDs, independent expert diagnoses by two experienced clinical linguists blinded to the language-screening results were considered the gold standard. Their diagnostic decisions on LD were based on the results of two standardized tests assessing the linguistic domains of expressive sentence grammar, noun plural production, sentence comprehension [SETK 3-5; (36)] and expressive vocabulary [AWST-R; (37)] that had been administered by other linguists and a short video sample (5-10 min) of spontaneous language of each child in a play and/or dialogic picture-book situation. For their decision on LD both clinical linguists followed international research (4-6) and classified a child as language delayed when language performance was at about the 10th percentile or lower in at least two of the four measured language domains and observations of spontaneous language production confirmed the significant language difficulties. Inter-rater reliability (kappa) between the linguists' diagnoses (+/-LD) was 0.95. Discrepancies were resolved by consensus decisions between the two raters.
For the assessment of non-verbal intelligence, the Snijders Oomen Non-verbal Intelligence Test [SON-R 2 ½-7; (38)] was administered by clinical psychologists. Pure tone audiometry was used to assess hearing.

Feasibility Measures
Feasibility was measured primarily by use of a questionnaire completed by the pediatricians who participated in the study and by the completeness of screening tests administered. Following the guidelines suggested by Bowen et al. (39), four dimensions of feasibility were investigated. All questions of the pediatric questionnaire were rated on a 4-point Likert scale (very goodgood-difficult-very difficult).

Practicality
Practicality was operationalized by the extent to which administration of the screening was considered possible within the time constraints of pediatric primary care. In addition, the pediatricians were asked about the ease of administration of each of the two screening measures (noun plurals and sentence comprehension) and to evaluate parental difficulties in completing the two parental subscales of the screening. In addition, the pediatricians ranked five pre-specified factors that might challenge the completion of the screening measure within their respective settings.

Acceptability
Acceptability refers to the children's, parents' and pediatricians' reactions to the screening measures. Child acceptance was measured by the percentage of screening subscales administered by the pediatricians that were fully completed, as this reflects child compliance with the test. In addition, the pediatricians assessed parental acceptance of the inclusion of language screening in the 3-year medical check-up. Finally, the pediatricians were asked to rate the meaningfulness of including a language screening within the regular well-baby check-ups.

Sustainability
Sustainability refers to the likelihood of language screening in this form being continued within the present system of preventive medical care in Austria. Pediatricians were asked in the questionnaire whether they intended to continue the language screening after the study ended (yes, to a limited extent, no).

Study Procedures and Recruitment
In 2009, all primary care pediatricians of the province of Upper Austria were invited to participate in the validation study of the language screening. Thirty-six out of 60 pediatricians participated in a half-day training and in the subsequent implementation of the screening procedures. The participating pediatricians served all major geographical areas of Upper Austria. Over a 1-year period (2010) 2,635 3-year-old children (19.8% of the entire 2007 Upper Austrian birth cohort) were screened by the 30 pediatricians who ultimately participated in the study.
Overall, 591 children were excluded from the study: 31 children with missing data on their date of birth, one child with missing data on the screening date, 95 children who were outside the age range (34-38 months), 349 children from families whose primary family language was other than German, 95 children with missing data on both parental screening subscales and 22 with missing data on both pediatric screening subscales. The remaining sample (n = 2,044) equaled 15.4% of the entire 2007 birth cohort and about 22% of the children born in 2007 by native-born mothers. The mean age was 36.03 months (SD = 0.994). 50.9% were boys, 4.9% were multiple births, 50.7% of the children had older siblings, and 8.6% of children were born prematurely. Compared to the 2007 birth cohort for Upper Austria, the sample was representative in terms of sex ratio and prematurity rate. Multiple births were slightly overrepresented (4.9% vs. 3.4% in the birth cohort; χ²(1) = 14.003, p < 0.001). To assess the representativeness of the sample in terms of maternal education (proportion of mothers with university entrance qualification or tertiary degree), we calculated an age-adjusted comparison value based on the educational level of all women (i.e., not only mothers) from Upper Austria. Among the participating families the percentage of mothers with either a university entrance qualification or a tertiary degree was comparable to the age-adjusted population (39.4% vs. 41.6%; χ²(1) = 4.072, p < 0.05).
The full screening was administered in the course of wellchild visits in the pediatric medical practices. Parents completed a screening package that included demographic information in addition to their language observations (expressive grammar and expressive vocabulary). All parents gave written permission for scientific use of their children's anonymized data for scientific purposes. The study was approved by the ethics committee of the Hospital of St. John of God, Linz. Pediatricians administered two screening subscales (noun plurals and sentence comprehension). All screening data were sent back to the clinic conducting the research. In accordance with Tomblin et al. (40), a screening test was considered a fail if the results of at least two of the four subtests were 1.25 standard deviations below the age norm. Using this ex-ante definition for positive screening results, 21.7% of the sample was considered a screening fail.
To validate the screening measures, a sample for full gold standard assessment was recruited in a two-step procedure. All pediatricians were instructed to refer children with positive Frontiers in Pediatrics | www.frontiersin.org screening results to a single specialized program for gold standard examination, which was performed in 70 children. Second, to evaluate specificity of the screening tool, four pediatricians from different regions were asked to invite a random subsample of children from the whole cohort, excluding those (n = 70) who had already undergone the gold standard assessment. The sample was stratified by gender, maternal education, position among siblings and single parenthood. The recruitment procedures (see Figure 1) led in total to a validation sample of n = 144 (i.e., 7% of the total sample), that did not deviate from the remaining sample in terms of demographic variables (see sample description).

Analytic Strategy
First, we report descriptive statistics for the total sample and the validation sample. Second, we present results on the screening accuracy of each subscale based on receiver operator characteristic (ROC) analyses. Areas under the curve (AUCs) ≥0.9 are regarded as excellent, AUCs ≥0.8 and <0.9 as good, AUCs ≥0.7 and <0.8 as fair, and tests with AUCs <0.7 as poor (41). DeLong's method for comparing AUCs of different tests (42) was applied to analyze differences in diagnostic accuracy between the subtests. Further, logistic regression was applied to identify subscales with significant independent contributions to the prediction of the gold standard diagnosis. The aim of this analytic step was to reduce the number of screening subtests while optimizing diagnostic accuracy. Lastly, we evaluated possible cut-off scores for the final screening composite by estimating sensitivity, specificity, positive predictive values (PPV), negative predictive values (NPV), and diagnostic likelihood ratios for positive and negative screening results (DLR+ and DLR-, respectively). DLR+ and DLR-are alternative measures of the accuracy of a diagnostic test and have the advantage, unlike predictive values, not to depend on the prevalence of the disorder under investigation [(43); for an explanation see Supplementary Material]. DLR+ is the multiplicative change in the pre-screening odds of having a LD given a positive screening result (i.e., post-screening odds = DLR+ × pre-screening odds) and DLR-is the change in the pre-screening odds of having a LD given a negative screening result (post-screening odds = DLR-× pre-screening odds). Following Jaeschke et al. (44), DLR+ values ≥ 10 and DLR-≤ 0.1 indicate large changes in prescreening odds, DLR+ ≤ 10 and >5, and DLR-> 0.1 and ≤0.2 indicate moderate changes, DLR+ ≤ 5 and >2, and DLR-> 0.2 and ≤0.5 indicate small changes. DLR+ < 2 and DLR-> 0.5 are rarely important. Logistic regression models were conducted using Mplus 8 (45), for the ROC analysis the pROC package (46) in R was employed, and cut-off values were determined with the R-OptimalCutpoints package (47).
Notably, the recruitment procedure led to oversampling of children with positive screening results, which is not uncommon for screening validation studies (48). Ideally, the validation sample should reflect the patient population, the given overrepresentation of positive screening results induces bias in measures of screening accuracy [i.e., verification bias; (49)]. In order to deal with this bias, we used multiple imputation (MI) of missing diagnosis status for children who did not undergo gold standard assessment (50). Missing values on screening subscales (ranging between 0.4 and 12.8% in the full sample) were also imputed. We used the Blimp imputation software (51,52) to generate 50 imputed data sets using a chained equation imputation procedure that takes the clustering of children within pediatricians into account. Beside the study variables (i.e., sociodemographic variables and screening results) we used various auxiliar variables (e.g., parental concern about language development) that were predictive of diagnosis status. Estimates and their standard errors were computed according to Rubin's combining rules (53). Even though recent simulation studies do not indicate that high proportions (up to 90%) of missing data might bias estimates based on MI data sets (54), we also report results for the original validation sample (n = 144) as Supplementary Material for completeness. Moreover, analyses showed that children with positive screening results (ex-ante definition) who attended the gold standard assessment showed lower scores in three subscales than children with positive screening results who did not attend the assessment (expressive grammar: d = 0.60, p < 0.001; sentence comprehension: d = 0.29, p < 0.05; noun plurals: d = 0.51; p < 0.001). Thus, the validation sample may also be subject to spectrum bias (i.e., screening positives include primarily the "sickest of the sick" and not the full spectrum of positive screens), which is associated with overestimation of sensitivity, specificity, and AUC (49). However, the used multiple imputation procedure is also suitable to counteract spectrum bias. Based on the gold standard, 27.8% of the validation sample (n = 144) had a LD. Notably, all children with LD had positive screening results (i.e., sensitivity = 1.00). After imputation 11.7% of the children are classified as having a LD.

RESULTS
As children are clustered within pediatricians, we estimated the intraclass correlation (ICC) to evaluate whether there are differences between pediatricians in the subscales. We found that pediatricians accounted for ≈14% of the differences in the pediatrician-reported subscales (ICC noun plurals = 0.144, p < 0.001; ICC sentence comprehension = 0.139, p < 0.001). If differences between pediatricians were due to population differences in the catchment areas, we would also expect comparable ICCs for the parent reports. However, ICCs for parent reports were smaller (ICC expressive vocabulary = 0.049, p = 0.005, ICC expressive grammar = 0.021, p = 0.05). Thus, these findings indicate that pediatricians differed significantly in their application of the screening tools, which calls into question the objectivity of implementation.

Diagnostic Accuracy of Subscales
Results of the ROC analysis for the screening subscales are shown in To examine independent contributions of subscales to predicting the gold-standard diagnosis logistic regression analyses were performed ( Table 3). The two parent-reported subscales independently predicted LD. After controlling for parent-reported screening tests, none of the pediatricianadministered subscales significantly predicted LD. Standardized coefficient (b) for the expressive vocabulary subscale was −0.408 (p < 0.001) and −0.388 (p < 0.001) for the expressive grammar subscale. Notably, as indicated by the overlapping confidence intervals of the standardized logistic regression coefficients, both parent reported scales had roughly the same weight in predicting LD.

Diagnostic Accuracy of the Composite Screening Score
Given the results of the logistic regression models, a composite screening score based on both significant predictors (expressive vocabulary and expressive grammar) was computed. As both parent reported scales contributed almost equally to the prediction of LD, a composite score was computed as the mean of the z-scores of expressive vocabulary and expressive grammar. The AUC for the composite score was excellent at 0.946 (DeLong 95% CI = [0.883, 1.000]). DeLong tests for paired ROC curves indicate that the composite outperformed the single parent reported scales (composite score vs. expressive vocabulary: AUC = 0.038, t-value = 2.380, p = 0.019; composite score vs. expressive grammar: AUC = 0.036, t-value = 2.102, p =0.037).

Cut-Off Estimation
In the next step, cut-off values for the screening composite were estimated ( Table 4). For ease of interpretation, the composite score was transformed into a T metric (i.e., M = 50, SD = 10). First, we estimated the cut-off by setting sensitivity equal to specificity using the "SpEqualSe" criterion in the Optimal Cutoff Package (47). A cut-off at 41.69 was most efficient. Table 5 reports the classification results for this cut-off that resulted in satisfactory accuracy statistics: sensitivity = 0.878 (95%-CI = Lower cut-offs resulted in fewer screening fails and higher PPV, DLR+, DLR-and specificity, but also in lower sensitivity (see Table 4). Thus, lower cut-offs would yield more false-negative results.

Feasibility
Completed questionnaires that addressed feasibility were returned by 23 (77%) out of the 30 pediatricians participating in the study.

Practicality
According to the pediatricians, 13.6% of the parents considered completion of the questionnaire to be "difficult". However, the remainder rated it most often as "very easy" (47.8%) or "easy" (34.8%). Practicality of the screening measures within the time limits of preventive medical care was rated mostly as good (69.6%) or very good (8.7%). In 21.7% of the cases, it was considered "difficult", but never "very difficult". Ease of administration of the sentence comprehension screening scale was assessed as significantly higher than that for noun plurals (expressive grammar). Ease of administration of the receptive measure was rated in most cases as good (60.9%) and very good (34.8%), and as difficult by only one pediatrician (4.3%).
In contrast, 47.8% rated the screening of noun plurals as difficult, 39.1% as good and only three pediatricians (13.0%) as very good. Among the factors that might complicate or even prevent administration of language screening, lack of or insufficient follow-up was ranked highest (40%), followed equally by insufficient training of pediatricians (20%) and insufficient funding (20%), and then by limited meaningfulness of language screening (13.3%). Only one respondent cited time constraints (4.3%).

Acceptability
General parental acceptance of language screening included in preventive health care was rated as "very good" by 43.5% and as "good" by all remaining pediatricians (52.2%). The percentage of children refusing their cooperation in the pediatrician's administration of the assessment of noun plurals was 12.8%, and for the assessment of sentence comprehension 5.5%, indicating satisfying acceptance of the assessment of language reception by the children, although child participation is no longer required for the final screening package that is exclusively based on parent report. A great majority of pediatricians rated the meaningfulness of language screening within the regular medical checkups as very good, another three (13.9%) as good, and only 2 expressed concerns (9.1% "difficult").

Sustainability
Most pediatricians (82.6%) reported that they would continue the SPES-3 language screening beyond the study, while the

95%-CI in brackets.
Frontiers in Pediatrics | www.frontiersin.org remainder would stop. It is noteworthy that no specific additional funding for language screening could be provided.

DISCUSSION
This study evaluated the diagnostic accuracy and feasibility of a newly developed language screening measure (SPES-3) for children aged 3 years in pediatric primary care within regular well-baby check-ups. In addition to the implementation by community pediatricians in consecutively evaluated children, the extensive sample size and the use of multiple imputation to avoid verification and spectrum bias are strengths of this study. The two parent-reported subscales (expressive grammar and vocabulary development) showed excellent accuracy (AUC.910, 0.908). Although AUC scores for pediatric subscales were significantly lower, they were good for the noun-plural subscale (0.816) and fair for the sentence comprehension subscale (0.705). Nevertheless, logistic regression analysis showed that none of the pediatrician-administered subscales significantly increased the diagnostic accuracy achieved by the parent scales. A composite based on both parent reports showed excellent accuracy (AUC = 0.946) and outperformed the single parental subscales. Our findings are consistent with those of a systematic review on the predictive validity of preschool screening tools for language by Sim et al. (55). Their results exhibited higher sensitivity, specificity and negative predictive value of parentreport screening tools as compared to direct child assessment. This finding supports the richness of parental information that is based on long-time observation of their child's language use in a variety of everyday situations. In contrast, the quality of screening tools based on direct assessments is influenced by the brevity required for use in the regular pediatric office and might be influenced by the relationship between child and examiner. The higher intraclass correlations of the screening tests administered by the pediatricians compared to the parent-reported screening tests may, at least partly, reflect differences in building rapport with the child. Since screening duration in community settings is critical for their large-scale implementation, stronger validity of parentreported screening in place of direct assessment by a pediatrician can be viewed as a positive finding in terms of feasibility of the screening tool. Practicability of the new screening measure was assessed to be high, even when direct assessments were included in the screening measure. Pediatricians also rated the acceptability by parents and children as high, and the majority of them regarded the inclusion of a language screening measure within their regular well baby check-ups was meaningful.
By use of ROC analysis a cut off of 41.69 is optimal when equal values for sensitivity and specificity are desired. Predictive values, which depend on the prevalence of the disorder, were NPV = 0.984 and PPV = 0.438. Moreover, we evaluated a broader range of cut-offs. As illustrated by Table 4, there is a trade-off between sensitivity and specificity. The more sensitive is a cut off value, the less specific it is. Due to increasing risk of persistence of language difficulties by increasing age and for reasons of practicability we suggest the cut-off of 41 for this new language screening measure for 3-year-olds to reduce the number of missed children with delayed language development. The relatively low PPV would lead to overreferrals but it must be interpreted with regard to the relatively low rate of diagnoses of LD (11.7%). As language difficulties have a dimensional rather than categorical character, children with false-positive scores in language skills have been shown to perform significantly lower on diagnostic measures than children with true-negative scores (56). Since children "overreferred" for diagnostic testing perform lower on language, they are very likely to also carry more psycho-social and cognitive risk factors associated with language delay. Therefore, diagnostic testing should not be regarded as an unnecessary inconvenience to the family and expense to society, but as an opportunity to identify children with unmet needs and the interventions required to improve their language and social and academic learning even though they are probably less severely affected than children with LD.
Even though direct assessments by pediatricians did not contribute to the predictive quality of the screening measure its administration within existing systems for general health check-ups might still be considered. Medical reports and recommendations are often highly valued by families. In case of screening fails (parent reported scales) a medical professional may still administer a brief assessment of language comprehension (sentence comprehension subscale), that was very well accepted by the children. A delay in language comprehension can be indicative of a more severe and persisting language problem (3) and may be associated with other more general developmental problems (e.g., general developmental delay, autism spectrum disorder, or hearing loss) and thus require follow-up at a multi-professional diagnostic center. Notably, among the possible barriers to universal language screening, lack of insufficient follow-up was ranked highest by the pediatricians who participated in the study, followed by insufficient training of pediatricians and lack of funding. Screening guided tiered referral pathways to local speech-language therapists and/or specialized multi-professional diagnostic centers (for the smaller number of children with complex needs) might be a costefficient approach.
The limited size of the original validation sample, the high proportion of screening fails in the validation sample and a more severe character of the language delay as compared to those without clinical followup are limitations of this study. However, multiple imputation of missing diagnosis for children without gold standard assessment was used to deal with these biases. Another limitation is the lack of baseline data on other developmental disorders. Whereas, we have provided data on the representativity of the study sample, we do not have information on differences between participating and non-participating pediatricians.

CONCLUSION
Parent-reported screening measures for expressive vocabulary and grammar (SPES-3) administered within regular well-baby check-ups in pediatric primary care have been found to be accurate in identifying LDs in 3-year-olds. Administration of screening subscales for noun-plurals and sentence comprehension by pediatricians showed lower specificity and sensitivity and, when added to the parental assessments, did not improve overall accuracy of the screening package. Feasibility within regular preventive check-ups was rated as mostly good or very good by the pediatricians. Ease of administration and acceptance by parents and pediatricians demonstrated in implementation with a large cohort of non-preselected children that the SPES-3 screening measure is valuable and can be recommended for universal language screening at age three in pediatric primary care.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because parents of the children participating in the study have not given their consent to make the data publicly available. Requests to access the datasets should be directed to daniel.holzinger@bblinz.at.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethikkommission Konventhospital Barmherzige Brüder Linz. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.