Edited by: Sergio Machado, Salgado de Oliveira University, Brazil
Reviewed by: Antonio Zuffiano, Liverpool Hope University, United Kingdom; Marie Arsalidou, National Research University – Higher School of Economics, Russia
*Correspondence: Deborah Denman
This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Language impairment refers to difficulties in the ability to comprehend or produce spoken language relative to age expectations (Paul and Norbury,
Language assessments are used for a range of purposes. These include: initial screening, diagnosis of impairment, identifying focus areas for intervention, decision-making about service delivery, outcome measurement, epidemiological purposes and other research pursuits that investigate underlying cognitive skills or neurobiology (Tomblin et al.,
Language assessments used in clinical practice and research applications must have evidence of sound psychometric properties (Andersson,
Previous studies have identified limitations with regards to the psychometric properties of spoken language assessments for school-aged children (McCauley and Swisher,
More recently, literature has focussed on diagnostic accuracy (sensitivity and specificity). Although this information is often lacking in child language assessments, some authors have suggested that diagnostic accuracy should be a primary consideration in the selection of diagnostic language assessments, and have applied the rationale of examining diagnostic accuracy first when evaluating assessments (Friberg,
No previous reviews investigating the psychometric properties of language assessments for children were systematic in identifying assessments for review or included studies published outside of assessment manuals. This is important for two reasons, first, to ensure that all assessments are identified, and second, to ensure that all the available evidence for assessments, including evidence of psychometric properties that was published in peer reviewed journals, is considered when making overall judgments. Previous reviews have also lacked a method of evaluating the methodological quality of the studies selected for review. When evaluating psychometric properties, it is important to consider not only outcomes from studies, but also the methodological quality of studies. If the methodological quality of studies is not sound, then outcomes of studies cannot be viewed as providing psychometric evidence (Terwee et al.,
In the time since previous reviews of child language assessments were conducted, research has also advanced considerably in the field of psychometric evaluation (Polit,
The COSMIN taxonomy describes nine measurement properties relating to domains of reliability, validity and responsiveness. Table
COSMIN domains, psychometric properties, aspects of psychometric properties and similar terms based on Mokkink et al. (
Reliability | Internal consistency (The degree of the interrelatedness between items) | Internal reliability |
Reliability (Variance in measurements which is because of “true” differences among clients) | Inter-rater reliability |
|
Measurement error (Systematic and random error of a client's score that is not due to true changes in the construct to be measured) | Standard Error of Measurement | |
Validity | Content Validity (The degree to which the content of an instrument is an adequate reflection of the construct to be measured) | n/a |
Construct validity (The degree to which scores are consistent with hypotheses based on the assumption that the instrument validly measures the construct to be measured) | n/a | |
Aspect of construct validity—structural validity (The degree to which scores reflect the dimensionality of the measured construct) | Internal structure | |
Aspect of Construct validity—hypothesis testing (Item construct validity) | Concurrent validity |
|
Aspect of Construct validity-Cross cultural validity (The degree to which the performance of the items on a translated or culturally adapted instrument are an adequate reflection of the performance of the items of the original version of the instrument) | n/a | |
Criterion validity (The degree to which scores reflect measurement from a “gold standard”) | Sensitivity/specificity (when comparing assessment with gold-standard) | |
Responsiveness | Responsiveness (The ability to detect change over time in the construct to be measured) | Sensitivity/specificity (when comparing two administrations of an assessment) Changes over time Stability of diagnosis |
Interpretability (The degree to which qualitative meaning can be assigned to quantitative scores obtained from the assessment) | n/a |
The aim of this study was to systematically examine and appraise the psychometric quality of diagnostic spoken language assessments for school-aged children using the COSMIN checklist (Mokkink et al.,
Assessments selected for inclusion in the review were standardized norm-referenced spoken language assessments from any English-speaking country with normative data for use with mono-lingual English-speaking children aged 4–12 years. Only the most recent editions of assessments were included. Initial search results indicated 76 assessments meeting this criterion. As it was not possible to review such a large number of assessments, further exclusion criteria were applied. Assessments were excluded if they were not published within the last 20 years. It is recognized that norm-referenced assessments should only be used with children whose demographics are represented within the normative sample (Friberg,
For diagnosis of Specific Language Impairment using standardized testing, previous research has recommended the use of composite scores that include measures of both comprehension and production of spoken language across three domains: word (semantics), sentence (morphology and syntax) and text (discourse) (Tomblin et al.,
Given the support in literature for the use of comprehensive assessments in diagnostics and the wide use of these assessments by speech pathologists, it was identified that a review of comprehensive language assessments for school-aged children is of particular clinical importance. Therefore, assessments were included in this study if they were the latest edition of a language assessment with normative data for monolingual English speaking children aged 4–12 years; were published within the last 20 years; were primarily designed as a diagnostic assessment; and were designed to assess language skills across at least two of the following three domains of spoken language: word (semantics), sentence (syntax/morphology) and text (discourse).
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were developed through consensus of an international group to support high quality reporting of the methodology of systematic reviews (Moher et al.,
Flowchart of selection process according to PRISMA.
Database searches of PubMed, CINAHL, PsycINFO, and Embase were conducted between February and March 2014. Database searches were conducted with subject headings or mesh terms to identify relevant articles up until the search date. Free text word searches were also conducted for the last year up until the search date to identify recently published articles not categorized in subject headings. The search strategies are described in Table
Search Terms used in database searches.
Subject Headings | Child, preschool: 2–5 years; Child: 6–12 years | |
English language; Preschool child < 1 to 6 years>; School child < 7 to 12 years> | ||
No limitations | ||
(English[lang]) AND (“child”[MeSH Terms:noexp] OR “child, preschool”[MeSH Terms]) | ||
Free Text Words | English language; Child, preschool: 2–5 years; Child: 6–12 years; Publication date: 20130101-20141231 | |
English language; Preschool child < 1 to 6 years>; School child < 7 to 12 years>; yr = “2013-Current” | ||
English; Preschool age (2–5 years); School Age (6–12 years); Adolescence (13–17 years); Publication year: 2013–2014 | ||
English; Preschool Child: 2–5 years; Child: 6–12 years; Publication date from 2013/01/01 to 2014/02/31 | ||
Gray Literature | No limitations | |
No limitations | ||
Free Text Words |
English Language | |
English language | ||
English | ||
English | ||
English | ||
Gray literature |
English | |
Publication year of assessment to current | ||
No limitations | ||
No limitations |
Assessments were also identified from searches of websites and textbooks. Speech pathology association websites from English speaking countries were searched and one website, the American Speech and Hearing Association, was identified as having an online directory of assessments. The website for this directory was identified as being no longer available as of 30/01/16. Publisher websites were identified by conducting Google searches with search terms related to language assessment and publishing and by searching the publisher sites from assessments already identified. These search terms are listed in Table
Published articles relating to psychometric properties of selected assessments were identified through additional database searches conducted between December 2014 and January 2015 using PubMed, CINAHL, Embase, PsycINFO, and HaPI. Searches were conducted using full names of assessments as well as acronyms; and limited to articles written in English and published in or after the year the assessment was published. Articles were included in the psychometric evaluation if they related to one of the selected assessments, contained information on reliability and validity and included children speaking English as a first language in the study. Google Scholar, OpenGrey (
All retrieved articles were reviewed for inclusion by two reviewers independently using selection criteria, with differences in opinion settled by group discussion to reach consensus. All appropriate articles up until the search dates were included.
Across all searches, a total of 1,395 records were retrieved from databases and other sources. The abstracts for these records were reviewed and 1,145 records were excluded as they were not related to language assessment for mono-lingual English-speaking children aged 4–12 years. The full text versions of remaining records were then reviewed and 225 records were excluded as they did not provide information on the 15 selected assessments, did not contain information on the reliability and validity of selected assessments, did not examine the study population, or were unpublished or unable to be located. Records were also excluded if they were not an original source of information on the reliability and validity of selected assessments. For example, articles reviewing results from an earlier study or reviewing information from an assessment manual were not included if they did not contain new data from earlier studies. A total of 22 records were identified for inclusion, including 15 assessment manuals and 7 articles. Figure
Studies selected for inclusion in the review were rated on methodological quality using COSMIN with the outcome from studies then rated against criteria based on Terwee et al. (
The four point COSMIN checklist (
Different methods for scoring the COSMIN 4-point checklist are employed in studies examining the methodology of psychometric studies. One suggested method is a “worst rating counts” system, where each measurement property is given the score of the item with the lowest rating (Terwee et al.,
In this current study, the scores for each item were “averaged” to give an overall rating for each measurement property. This provides information on the methodological quality in general for studies that were rated. In the scoring process, the appropriate measurement properties were identified and a rated on the relevant items. The options for “excellent,” “good,” “fair,” and “poor” on the 4-point checklist were ranked numerically, with “excellent” being the highest score and “poor” being the lowest score. As the current version of the COSMIN 4 point scale was designed for a “worst rating counts” method, some items do not have options for “fair” or “poor.” Therefore, this was adjusted for in the percentage calculation so that the lowest possible option for each item was considered a 0 score. As each measurement property has a different number of items or may have items that are not applicable to a particular study, the number of items rated may differ across measurement properties or across studies. Therefore, overall scores for each measurement property rated from each study were calculated as a percentage of points received compared to total possible points that a study could have received for that measurement property. The resulting percentages for each measurement property were then classified according to quartile, that is: “Poor” = 0–25%, “Fair” = 25.1–50%, “Good” = 50.1–75%, and “Excellent” = 75.1–100% (Cordier et al.,
The findings from studies with “fair” or higher COSMIN ratings were subsequently appraised using criteria based on Terwee et al. (
Criteria for measuring quality of findings for studies examining measurement properties based on Terwee et al. (
Internal consistency | + | Subtests one-dimensional (determined through factor analysis with adequate sample size) and Cronbach alpha between 0.70 and 0.95 |
? | Dimensionality of subtests unknown (no factor analysis) or Cronbach's alpha not calculated | |
− | Subtests uni-dimensional (determined through factor analysis with adequate sample size) and Cronbach's alpha < 0.7 or > 0.95 | |
± | Conflicting results | |
NR | No information found on internal consistency | |
NE | Not evaluated due to “poor” methodology rating on COSMIN | |
Reliability | + | ICC/weighted Kappa equal to or > than 0.70 |
? | Neither ICC/weighted Kappa calculated or doubtful design or method (e.g., time interval not appropriate) | |
− | ICC/weighted Kappa < 0.70 with adequate methodology | |
± | Conflicting results | |
NR | No information found on reliability | |
NE | Not evaluated due to “poor” methodology on COSMIN | |
Measurement error | + | MIC > SDC or MIC equals or inside LOA |
? | MIC not defined or doubtful design or method | |
− | MIC < SDC or MIC equals or inside LOA with adequate methodology | |
+ | Conflicting results | |
NR | No information found on measurement error | |
NE | Not evaluated due to “poor” methodology on COSMIN | |
Content validity | + | Good methodology (i.e., an overall rating of “Good” or above on COSMIN criteria for content validity) and experts examined all items for content and cultural bias during development of assessment |
? | Questionable methodology or experts only employed to examine one aspect (e.g., cultural bias) | |
− | No expert reviewer involvement | |
± | Conflicting results | |
NR | No information found on content validity | |
NE | Not evaluated due to “poor” methodology | |
Structural validity | + | Factor analysis performed with adequate sample size. Factors explain at least 50% of variance |
? | No factor analysis or inadequate sample size. Explained variance not mentioned | |
− | Factors explain < 50% of variance despite adequate methodology | |
± | Conflicting results | |
NR | No information found on structural validity | |
NE | Not evaluated due to “poor” methodology | |
Hypothesis testing | + | Convergent validity: Correlation with assessments measuring similar constructs equal to or >0.5 and correlation is consistent with hypothesis |
? | Questionable methodology e.g., only correlated with assessments that are not deemed similar | |
− | Discriminant validity: findings inconsistent with hypotheses (e.g., no significant difference identified from appropriate statistical analysis) |
|
± | Conflicting results | |
NR | No information found on hypothesis testing | |
NE | Not evaluated due to “poor” methodology |
Overall evidence ratings for each measurement property for each assessment were then determined by considering available evidence from all the studies. These ratings were assigned based on quality of methodology available studies (as rated using COSMIN) and the quality of the findings from the studies (as defined in Table
Level of evidence for psychometric quality for each measurement property based on Schellingerhout et al. (
Strong evidence | +++ or −−− | Consistent findings across 2 or more studies of “good” methodological quality OR one study of “excellent” methodological quality |
Moderate evidence | ++ or −− | Consistent findings across 2 or more studies of “fair” methodological quality OR one study of “good” methodological quality |
Weak evidence | + or − | One study of “fair” methodological quality (examining convergent or discriminant validity if rating hypothesis testing) |
Conflicting evidence | ± | Conflicting findings across different studies (i.e., different studies with positive and negative findings) |
Unknown | ? | Only available studies are of “poor” methodological quality |
Not Evaluated | NE | Only available studies are of “poor” methodological quality as rated on COSMIN |
To limit the size of this review, selected assessments were not appraised on the measurement property of responsiveness, as that would have significantly increased the size of the review. Interpretability is not considered a psychometric property and was also not reviewed. However, given the clinical importance of responsiveness and interpretability, it is recommended that these properties be a target for future research. Cross-cultural validity applies when an assessment has been translated or adapted from another language. As all the assessments reviewed in this study were originally published in English, cross-cultural validity was not rated. However, it is acknowledged that the use of English language assessments with the different dialects and cultural groups that exist across the broad range of English speaking countries is an area that requires future investigation. Criterion validity was also not evaluated in this study as this measurement property refers to a comparison of an assessment to a diagnostic “gold-standard” (Mokkink et al.,
Diagnostic accuracy, which includes sensitivity and specificity and positive predictive power calculations, is an area that does not clearly fall into a COSMIN measurement property. However, current literature identifies this as being an important consideration for child language assessment (Spaulding et al.,
Where several studies examining one measurement property were included in a manual, one rating was provided based on information from the study with the best methodology. For example, if a manual included internal consistency studies using different populations then a rating for internal consistency was given based on the study with the most comprehensive or largest sample size. The exception was for reliability, where test-retest and inter-rater reliability were rated separately and hypothesis testing where convergent validity and discriminant validity were rated separately. In most cases, these different reliability and hypothesis testing studies were conducted using different sample sizes and different statistical analyses. As it was considered that manuals that include both these studies for each measurement property are providing evidence across different aspects of the measurement property, it was decided that counting these as different studies would allow this to be reflected in final data.
Some assessments also included studies for hypothesis testing examining gender, age and socio-cultural differences. Whilst this information contributes important information on an assessment's usefulness, we identified convergent validity and discriminant validity as key aspects for the measurement property of hypothesis testing and thus only included these studies in this review.
All possible items for each assessment were rated from all identified publications. Where an examination of a particular measurement property was not reported in a publication or not reported with enough detail to be rated, this was rated as “not reported” (NR). Two raters were involved in appraising publications. To ensure consistency, both raters involved in the study trained as part of a group prior to rating the publications for this study. The first rater rated all publications with a random sample of 40% of publications also rated independently by a second rater. Inter-rater reliability between the two raters was calculated and determined to be adequate (weighted Kappa = 0.891; SEM = 0.020; 95% confidence interval = 0.851–0.931). Any differences in opinion were discussed and the first rater then appraised the remaining 60% of articles applying rating judgments agreed upon after consensus discussions.
A total of 22 publications were identified for inclusion in this review. These included 15 assessment manuals and seven journal articles relating to a total of 15 different assessments. From the 22 publications, 129 eligible studies were identified, including three studies that provided information on more than one of the 15 selected assessments. Eight of these 129 studies reported on diagnostic accuracy and were included in the review, but were not rated using COSMIN, leaving 121 articles to be rated for methodological quality. Of the 15 selected assessments, six were designed for children younger than 8 years and included the following assessments: Assessment of Literacy and Language (ALL; nine studies), Clinical Evaluation of Language Fundamentals: Preschool-2nd Edition (CELF:P-2; 14 studies), Reynell Developmental Language Scales-4th Edition (NRDLS; six studies), Preschool Language Scales-5th Edition (PLS-5; nine studies), Test of Early Language Development-3rd Edition (TELD-3; nine studies) and Test of Language Development-Primary: 4th Edition (TOLD-P:4; nine studies). The Test of Language Development-Intermediate: 4th Edition (TOLD-I:4; nine studies) is designed for children older than 8 years. The remaining eight assessments covered most of the 4–12 primary school age range selected for this study and included the following assessments: Assessment of Comprehension and Expression (ACE 6-11; seven studies), Comprehensive Assessment of Spoken Language (CASL; 12 studies), Clinical Evaluation of Language Fundamentals-5th Edition (CELF-5; nine studies), Diagnostic Evaluation of Language Variance-Norm Referenced (DELV-NR; ten studies), Illinois Test of Psycholinguistic Abilities-3rd Edition (ITPA-3; eight studies), Listening Comprehension Test-2nd Edition (LCT-2; seven studies), Oral and Written Language Scales-2nd Edition (OWLS-2; eight studies) and Woodcock Johnson 4th Edition Oral Language (WJIVOL; six studies). These 15 selected assessments are summarized in Table
Summary of assessments included in the review.
6–11 years | Spoken language including pragmatics. |
|
Spoken and written language skills including phonemic awareness |
||
3–21 years | Spoken language including pragmatics |
|
5;0–21;11 years | Spoken language; supplemental tests for reading, writing and pragmatics |
|
3;0–6;11 years | Spoken language |
|
4–9 years | Spoken language: |
|
5;0–12;11 years | Spoken and written language: |
|
6–11 years | Spoken language |
|
3;0–7;5 years | Spoken language |
|
3–21 years | Spoken language |
|
Birth-7;11 years | Spoken language |
|
3;0–7;11 | Spoken language |
|
8;0–17 years | Spoken language |
|
4;0–8;11 years | Spoken language |
|
2–90 years | Spoken language |
During the selection process, 61 assessments were excluded as not meeting the study criteria. These assessments are summarized in Table
Summary of assessments excluded from the review.
1 | Adolescent Language Screening Test (ALST) | Morgan and Gillford (1984) | 11–17 | Pragmatics, receptive vocabulary, expressive vocabulary, sentence formulation, morphology and phonology | Not published within last 20 years |
2 | Aston Index Revised (Aston) | Newton and Thomson (1982) | 5–14 | Receptive language, written language, reading, visual perception, auditory discrimination | Not published within last 20 years |
3 | Bracken Basic Concept Test-Expressive (BBCS:E) | Bracken (2006) | 3–6;11 | Expressive: basic concepts | Not comprehensive language assessment |
4 | Bracken Basic Concept Test-3rd Edition Receptive (BBCS:3-R) | Bracken (2006) | 3–6;11 | Receptive: basic concepts | Not comprehensive language assessment |
5 | Bankson Language Test-Second Edition (BLT-2) | Bankson (1990) | 3;0–6;11 | Semantics, syntax/morphology and pragmatics | Not published within last 20 years |
6 | Boehm Test of Basic concepts-3rd Edition (Boehm-3) | Boehm (2000) | Grades K-2 (US) | Basic concepts | Not comprehensive language assessment |
7 | Boehm Test of Basic Concepts Preschool-3rd Edition (Boehm-3 Preschool) | Boehm (2001) | 3;0–5;11 | Relational concepts | Not comprehensive language assessment |
8 | British Vocabulary Scale-3rd Edition (BPVS-3) | Dunn et al. (2009) | 3–16 | Receptive vocabulary | Not comprehensive language assessment |
9 | Clinical Evaluation of Language Fundamentals–5th Edition Metalinguistics (CELF-5 Metalinguistic) | Wiig and Secord (2013) | 9;0–21;0 | Higher level language: making inferences, conversation skills, multiple meanings and figurative language | Not comprehensive language assessment |
10 | Clinical Evaluations of Language Fundamentals-5th Edition Screening (CELF-5 Screening) | Semel et al. (2013) | 5;0–21;11 | Receptive and expressive semantics and syntax | Screening assessment |
11 | Comprehensive Receptive and Expressive Vocabulary Test-Second Edition (CREVT-3) | Wallace and Hammill (2013) | 5–89 | Receptive and expressive vocabulary | Not comprehensive language assessment |
12 | Compton Speech and Language Screening Evaluation-Revised Edition | Compton (1999) | 3–6 | Expressive and receptive language, articulation, auditory memory and oral-motor co-ordination | Screening Assessment |
13 | Executive Functions Test Elementary | Bowers and Huisingh (2014) | 7;0–12;11 | Higher level language: working memory, problem solving, inferring and making predictions | Not comprehensive language assessment |
14 | Expressive Language Test-2nd Edition (ELT-2) | Bowers Huisingh et al. (2010) | 5;0–11;0 | Expressive language: sequencing, metalinguistics, grammar and syntax | Not comprehensive language assessment |
15 | Expressive One-Word Vocabulary Test-4th Edition (EOWPVT-4) | Martin and Brownell (2011) | 2–80 | Expressive vocabulary (picture naming) | Not comprehensive language assessment |
16 | Expression, Reception and Recall of Narrative Instrument (ERRNI) | Bishop (2004) | 4–15 | Narrative skills: story comprehension and retell | Not comprehensive language assessment |
17 | Expressive Vocabulary Test-Second Edition (EVT-2) | Williams (2007) | 2;6–90+ | Expressive vocabulary and word retrieval | Not comprehensive language assessment |
18 | Fluharty Preschool Screening Test-Second Edition (FPSLST-2) | Fluharty (2000) | 3;0–6;11 | Receptive and expressive language: sentence repetition, answering questions, describing actions, sequencing events and articulation. | Screening Assessment |
19 | Fullerton Language Test for Adolescent-Second Edition (FLTA-2) | Thorum (1986) | 11-Adult | Receptive and expressive language | Not published within last 20 years |
20 | Grammar and Phonology Screening Test (GAPS) | Van der Lely (2007) | 3;5–6;5 | Grammar and pre reading skills | Not Comprehensive language assessment |
21 | Kaufman Survey of Early Academic and Language Skills (K-SEALS) | Kaufman and Kaufman (1993) | 3;0–6;11 | Expressive and receptive vocabulary, numerical skills and articulation | Not published in last 20 years |
22 | Kindergarten Language Screening Test-Second Edition (KLST-2) | Gauthier and Madison (1998) | 3;6–6;11 | General language: question comprehension, following commands, sentence repetition, comparing and contrasting objects and spontaneous speech | Screening Assessment |
23 | Language Processing Test 3 Elementary (LPT-3:P) | Richard and Hanner (2005) | 5–11 | Expressive semantics: word association, categorizing words, identifying similarities between words, defining words, describing words | Not comprehensive language assessment |
24 | Montgomery Assessment of Vocabulary Acquisition (MAVA) | Montgomery (2008) | 3–12 | Receptive and expressive vocabulary | Not comprehensive language assessment |
25 | North Western Syntax Screening Test (NSST) | Lee (1969) | Unknown | Syntax and morphology | Not published in last 20 years |
26 | Peabody Picture Vocabulary test-4th Edition (PPVT-IV) | Dunn and Dunn (2007) | 2;6–90 | Receptive vocabulary | Not comprehensive language assessment |
27 | Pragmatic Language Skills (PLSI) | Gillam and Miller (2006) | 5;0–12;11 | Pragmatics | Not comprehensive language assessment |
28 | Preschool Language Assessment Instrument-Second Edition (PLAI-2) | Blank et al. (2003) | 3.0–5;11 | Discourse | Not comprehensive language assessment |
29 | Preschool Language Scales-5th Edition Screener (PLS-5 Screener) | Zimmerman (2013) | Birth-7;11 | General language | Screening assessment |
30 | Receptive One-Word Picture Vocabulary Tests-Fourth Edition (ROWPVT-4) | Martin and Brownell (2010) | 2;0–70 | Receptive vocabulary | Not comprehensive language assessment |
31 | Renfrew Action Picture Test-Revised (RAPT-Revised) | Renfrew (2010) | 3–8 | Expressive language: information content, syntax and morphology | Not comprehensive language assessment |
32 | Renfrew Bus Story-Revised edition (RBS-Revised) | Renfrew (2010) | 3–8 | Narrative retell | Not comprehensive language assessment |
33 | Rhode Island Test of Language Structure | Engen and Engen (1983) | 3–6 | Receptive syntax (designed for hearing impairment but has norms for non-hearing impairment) | Not comprehensive language assessment |
34 | Screening Kit of Language Development (SKOLD) | Bliss and Allen (1983) | 2–5 | General language | Not published within last 20 years |
35 | Screening Test for Adolescent Language (STAL) | Prather and Breecher (1980) | 11–18 | General language | Not published in last 20 years |
36 | Social Emotional Evaluation (SEE) | Wiig (2008) | 6;0–12;0 | Social skills and higher level language | Not comprehensive language assessment |
37 | Social Language Development Test Elementary (SLDT-E) | Bowers et al. (2008) | 6–11 | Language for social interaction | Not comprehensive language assessment |
38 | Structured Photographic Expressive Language Test-Third Edition (SPELT-3) | Dawson and Stout (2003) | 4,0–9,11 | Expressive syntax and morphology | Not comprehensive language assessment |
39 | Structured Photographic Expressive Language Test Preschool-2nd Edition (SPELT-P:2) | Dawson et al. (2005) | 3;0–5;11 | Expressive syntax and morphology | Not comprehensive language assessment |
40 | Test for Auditory Comprehension of Language-Fourth Edition (TACL-4) | Carrow-Woolfolk (2014) | 3;0–12;11 | Receptive vocabulary, syntax and morphology | Not comprehensive language assessment |
41 | Test of Auditory Reasoning and processing skills (TARPS) | Gardner (1993) | 5–13;11 | Auditory processing: verbal reasoning, inferences, problems solving, acquiring and organizing information | Not published within last 20 years |
42 | Test for Examining Expressive Morphology (TEEM) | Shipley (1983) | 3;0–7;0 | Expressive morphology | Not published within last 20 years |
43 | Test of Grammatical Impairment (TEGI) | Rice and Wexler (2001) | 3;0–8;0 | Syntax and morphology | Not comprehensive language assessment |
44 | Test of Grammatical Impairment-Screener (TEGI-Screener) | Rice and Wexler (2001) | 3–6;11 | Syntax and morphology | Screening assessment |
45 | Test of Language Competence-Expanded (TLC-E) | Wiig and Secord (1989) | 5;0–18;0 | Semantics, syntax and pragmatics | Not published within last 20 years |
46 | Test of Narrative language (TNL) | Gillam and Pearson (2004) | 5;0–11;11 | Narrative retell | Not comprehensive language assessment |
47 | Test of Pragmatic Language (TOLP-2) | Terasaki and Gunn (2007) | 6;0–18;11 | Pragmatic skills | Not comprehensive language assessment |
48 | Test of Problem Solving 3 Elementary (TOPS-3-Elementary) | Bowers et al. (2005) | Language-based thinking | Not comprehensive language assessment | |
49 | Test of Reception of Grammar (TROG-2) | Bishop (2003) | 4+ | Receptive grammar | Not comprehensive language assessment |
50 | Test of Semantic Skills-Intermediate (TOSS-I) | Huisingh et al. (2004) | 9–13 | Receptive and expressive semantics | Not comprehensive language assessment |
51 | Test of Semantic Skills-Primary (TOSS-P) | Bowers et al. (2002) | 4–8 | Receptive and expressive semantics | Not comprehensive language assessment |
52 | Test of Word Finding-Second Edition (TWF-2) | German (2000) | 4;0–12;11 | Expressive vocabulary: word finding | Not comprehensive assessment |
53 | Test of Word Finding in Discourse (TWFD) | German (1991) | 6;6–12;11 | Word finding in discourse | Not comprehensive assessment |
54 | Test of Word Knowledge (TOWK) | Wiig and Second (1992) | 5–17 | Receptive and expressive vocabulary | Not published within last 20 years |
55 | Token Test for Children-Second edition (TTFC-2) | McGHee et al. (2007) | 3;0–12;11 | Receptive: understanding of spoken directions | Not comprehensive language assessment |
56 | Wellcomm: A speech and language toolkit for the early years (Screening tool) English norms | Sandwell Primary Care Trust | 6 months–6 years | General language | Screening Assessment |
57 | Wh—question comprehension test | Vicker (2002) | 4-Adult | Wh-question comprehension | Not comprehensive language assessment |
58 | Wiig Assessment of Basic Concepts (WABC) | Wiig (2004) | 2;6–7;11 | Receptive and expressive: basic concepts | Not comprehensive assessment |
59 | Word Finding Vocabulary Test-Revised Edition (WFVT) | Renfrew (2010) | 3–8 | Expressive vocabulary: word finding | Not comprehensive language assessment |
60 | The WORD Test 2 Elementary (WORD-2) | Bowers et al. (2004) | 6–11 | Receptive and expressive vocabulary | Not comprehensive language assessment |
61 | Utah Test of Language Development (UTLD-4) | Mecham (2003) | 3;0–9;11 | Expressive semantics, syntax and morphology | Not comprehensive language assessment |
The seven identified articles were sourced from database searches and gray literature. These included studies investigating structural and convergent validity (hypothesis testing) of the CASL (Reichow et al.,
Articles selected for review.
Eadie et al., |
CELF-P:2 (Australian) Diagnostic accuracy | Investigation of sensitivity and specificity of CELF:P-2 at age 4 years against Clinical Evaluation of Language Fundamentals-4th Edition (CELF-4) at age 5 years |
Hoffman et al., |
CASL Structural Validity Hypothesis testing | Investigation of the construct (structural) validity of the CASL using factor analysis. Investigation of convergent validity between the CASL and Test of Language Development-Primary: 3rd Edition (TOLD-P:3) |
Kaminski et al., |
CELF-P:2 Hypothesis testing | Investigation of predictive validity and convergent validity between CELF:P-2 and Preschool Early Literacy Indicators (PELI) |
McKown et al., |
CASL Internal consistency Reliability (test-retest) | Examination of the internal consistency of the Pragmatic Judgment subtest of the CASL Examination of test-retest reliability of the Pragmatic Judgment subtest of the CASL |
Pesco and O'Neill, |
CELF:P-2 DELV-NR Hypothesis testing | Investigation of performance on the DELV-NR and CELF:P-2 to be predicted by the Language Use Inventory (LUI) |
Reichow et al., |
CASL Hypothesis testing | Examination of the convergent validity between selected subtests from the CASL with the Vineland Adaptive Behavior Scales |
Spaulding, |
TELD-3 Hypothesis testing | Investigation of consistency between severity classification on the TELD-3 and the Utah Test of Language Development-4th Edition (UTLD-4) |
The assessment manuals for all the selected assessments were not available through open sources and were only accessible by purchasing the assessment. Only three published articles by authors of assessments were identified. One of these contained information on the development, standardization and psychometric properties of the NRDLS (Letts et al.,
The results of the COSMIN ratings of the psychometric quality of the 15 assessments are listed in Table
Ratings of methodological quality and study outcome of reliability and validity studies for selected assessments.
ACE6-11 | ACE6-11 Manual | 77.8 |
Test-retest 75.9 Excell |
53.3 |
42.9 Fair |
25 |
Convergent 52.2 Good |
ALL | ALL Manual | 75.0 |
Test-retest 72.4 Good |
20 |
92.9 Excell |
33.3 |
Convergent 52.2 Good |
CASL | CASL Manual | 57.1 |
Test-retst 56.0 |
40 |
71.4 Good |
33.3 |
Convergent 39.1 Fair |
Hoffman et al., |
NR | NR | NR | NR | 33.3 |
Convergent 73.9 Good |
|
McKown et al., |
83.3 |
Test-retest 62.0 |
NR | NR | NR | NR | |
Reichow et al., |
NR | NR | NR | NR | NR | Convergent 52.2 Good |
|
CELF-5 | CELF-5 Manual | 71.4 |
Test-retest 72.4 Good |
40 |
71.4 Good |
58.3 Good |
Convergent 65.2 Good |
CELF:P-2 | CELF:P-2 Manual | 71.4 |
Test-retest 72.4 Good |
40 |
64.3 Good |
33.3 |
Convergent 47.8 Fair |
Kaminski et al., |
NR | NR | NR | NR | NR | Convergent 56.5 Good |
|
Pesco and O'Neill, |
NR | NR | NR | NR | NR | Convergent 47.8 Good |
|
NR | NR | NR | NR | NR | Convergent 65.2 Good |
||
NR | NR | NR | NR | NR | Convergent 69.6 Good |
||
DELV-NR | DELV-NR Manual | 66.7 |
Test-retest 69 Good |
40 |
57.1 Good |
50 |
Convergent 34.8 Fair |
NR | NR | NR | NR | NR | Convergent 47.8 Good |
||
ITPA-3 | ITPA-3 Manual | 71.4 |
Test-retest 62.1 Good |
40 |
57.1 Fair |
50 Fair |
Convergent 34.7 Fair |
LCT-2 | LCT-2 Manual | 50 |
Test-retest 34.6 Fair |
40 |
28.5 Fair |
50 |
Discriminant 29.4 |
NRDLS | NRDLS Manual | 66.7 |
Test-retest 60.0 Good |
40.0 |
57.1 Good |
NR | Convergent 52.2 Good |
OWLS-II | OWLS-II Manual | 57.1 |
Test-retest 72.4 Good |
40 |
71.4 Good |
33.4 |
Convergent 21.7 Poor NR Discriminant 47.1 Fair |
PLS-5 | PLS-5 Manual | 50 |
Test-retest 69.0 Good |
40 |
71.4 Good |
57.1 |
Convergent 56.5 Good |
TELD-3 | TELD-3 Manual | 61.1 |
Test-retest 72.4 Good |
33.4 |
71.4 Good |
41.7 |
Convergent 39.1 Fair |
Spaulding, |
NR | NR | NR | NR | NR | Convergent 47.8 Fair |
|
TOLD-I:4 | TOLD-P:4 Manual | 71.4 |
Test-retest 72.4 Good |
40 |
57.1 Fair |
33.4 |
Convergent 60.9 Good |
TOLD-P:4 | TOLD-I:4 Manual | 71.4 |
Test-retest 69.0 Good |
40 |
57.1 Fair |
50 |
Convergent 60.9 Good |
WJIVOL | WJIVOL Manual | 57.2 |
NE | 40 |
78.6 Excell |
50 |
Convergent 43.5 Fair |
Ratings for each measurement property are shown as percentage of total points available and classified according to quartile in which percentage falls: Excellent (Excell) = 100–75.1, Good = 75–50.1, Fair = 50–25.1, and Poor = 25–0. The rating of measurement properties based on percentages of all items allows for the overall quality of a study be considered, however it also means that it was possible for studies to be rated “excellent” or “good” overall when individual items may have been rated “poor” for methodology. The footnotes in Table
Studies with COSMIN ratings of “fair” or higher were then rated on the evidence provided in the study outcome for each measurement property using the criteria as summarized in Table
The overall rating given after considering the methodological quality and outcome of all available studies (Table
Level of evidence for each assessment based on Schellingerhout et al. (
ACE6-11 | ? | ? | ? | ? | ? | ++ |
ALL | ? | ? | ? | +++ | ? | +++ |
CASL | ? | ? | ? | ? | ? | ++ |
CELF-5 | ? | ++ | ? | ++ | ? | +++ |
CELF:P-2 | ? | ? | ? | ++ | ? | +++ |
DELV-NR | ? | ? | ? | ? | ? | ? |
ITPA-3 | ? | ? | ? | ? | ? | + |
LCT-2 | ? | ? | ? | ? | ? | + |
NRDLS | ? | ? | ? | ? | NA | ++ |
OWLS-II | ? | + | ? | ? | ? | + |
PLS-5 | ? | ? | ? | ++ | ? | +++ |
TELD-3 | ? | ? | ? | ? | ? | + |
TOLD-I:4 | ? | ? | ? | ? | ? | ++ |
TOLD-P:4 | ? | ? | ? | ? | ? | ++ |
WJIVOL | ? | NA | ? | ? | ? | + |
For seven assessments, studies examining diagnostic accuracy were identified. This information came from the respective manuals and one article. Data on sensitivity, specificity, positive predictive power and negative predictive power for these seven assessments are presented in Table
Diagnostic Accuracy data reported for each assessment.
ALL | ALL Manual | 10% base rate for population sample; |
−1 SD = 98 |
−1SD = 89 |
10% base rate: |
10% base rate: |
CELF-5 | CELF-5 Manual | 10% base rate for population sample; |
−1 SD = 100 |
−SD = 91 |
10% base rate: |
10% base rate: |
CELF:P-2 | CELF:P-2 Manual | 20% base rate for population sample; |
NR | NR | 20% base rate: |
20%base rate: |
Eadie et al., |
CELF-P:2 scores at 4 years against CELF-4 scores at 5 years | −1.25 SD = 64.0 |
−1.25 SD = 92.9 |
NR | NR | |
DELV-NR | DELV-NR Manual | 10% base rate for population sample; |
−1 SD = 95 |
−1 SD = 93 |
10% base rate: |
10% base rate: |
PLS-5 | PLS-5 Manual | 20% base rate for population sample; |
With standard score 85 as cut-off = 91 | With standard score 85 as cut-off = 78 | 20% base rate: |
20% base rate: |
TOLD-I:4 | TOLD-P:4 Manual | Criterion against other assessments: |
With Standard Score 90 as cut-off: |
With Standard Score 90 as cut-off: |
With Standard Score 90 as cut-off: |
NR |
TOLD-P:4 | TOLD-I:4 Manual | Criterion against other assessments: |
With Standard Score 90 as cut-off: |
With Standard Score 90 as cut-off: |
With Standard Score 90 as cut-off: |
NR |
It should be noted that whilst these results from diagnostic accuracy studies are reported without being rated for methodological quality, significant methodological concerns were noted and are reported in the discussion section of this study.
In this study, a total of 121 studies across all six measurement properties were rated for methodological quality. Of these, 5 were rated as “excellent” for overall methodological quality, 55 rated as “good,” 56 rated as “fair,” and 5 rated as “poor.” However, whilst almost half (
Overall, across all measurement properties, reporting on missing data was insufficient, with few studies providing information on the percentage of missing items or a clear description of how missing data was handled. Bias may be introduced if missing data is not determined as being random (Bennett,
A lack of clarity in reporting of statistical analysis was also noted, with a number of assessment manuals not clearly reporting the statistics used. For example, studies used terms such as “correlation” or “coefficient” without specifying the statistical procedure used in calculations. Where factor analysis or intra-class correlations were applied in structural validity or reliability studies, few studies reported details such as the rotational method or formula used. Lack of clear reporting creates difficulty for independent reviewers and clinicians to appraise and compare the quality of evidence presented in studies.
COSMIN ratings for
With regards to
COSMIN ratings for
Ratings for
COSMIN ratings for
Five assessment manuals (ACE6-11, DELV-NR, LCT-2, PLS-5, and TELD-3) did not report on a structural validity study using factor analysis but reported on correlations between subtests; however, this is not sufficient evidence of structural validity according to COSMIN. One assessment (NRDLS) did not provide any evidence to support structural validity through either factor analysis or an examination of correlations between subtests. Structural validity studies are important to examine the extent to which an assessment reflects the underlying constructs being measured in both the overall score and the subtests.
The majority of studies relating to
Studies on
Diagnostic accuracy studies were not rated for methodological quality; however significant methodological flaws were noted in the reporting of information. The evaluated article (Eadie et al.,
An important discovery was that all the studies examined in this review used statistical methods solely from classical test theory (CTT), as opposed to item response theory (IRT). Although some manuals made reference to the use of IRT methods in the initial development of assessment items, no studies reported any details or outcomes for these methods. Whilst COSMIN does not currently indicate a preference between these two methods, IRT methods are increasingly being utilized for the development of assessments within fields such as psychology and have numerous reported advantages over CTT-only methods (Reise et al.,
Comparisons between manuals and independent articles are limited to instances where studies with adequate methodology from both a manual and an article are available for a measurement property. These included three instances examining convergent validity of the CASL, CELF:P-2 and DELV-NR (Hoffman et al.,
The correlations reported in the CELF-P:2 manual (Wiig et al.,
The study by Hoffman et al. (
Collectively, these findings indicate that further independent studies are required to examine the validity of different comprehensive language assessments for children. Further research is also required to determine if children are categorized similarly across different assessments with regards to diagnosis and severity of language impairment (Hoffman et al.,
It is acknowledged that speech pathologists should consider a range of factors as well as psychometric quality when selecting an assessment for use including the clinical population for which the assessment will be used, the purpose for which the assessment will be used and theoretical construct of the assessment (Bishop and McDonald,
Standardized assessments are frequently used to make important diagnostic and management decisions for children with language impairment in both clinical and research contexts. For accurate diagnosis and provision of effective intervention, it is important that assessments chosen for use have evidence of good psychometric quality (Friberg,
This review also identifies areas in need of further research with regards to individual assessments and development of the field of child language assessment in general. Where an assessment does not present with an “excellent” or “good” level of evidence for all measurement properties, further research is required to determine if this evidence exists. In general, further information is particularly needed to provide evidence of structural validity, measurement error and diagnostic accuracy. The use of IRT methods for statistical analysis of psychometric properties of also identified as an area in need of further exploration within the field of child language assessment.
Very limited evidence of psychometric quality currently exists outside of what is reported in manuals for child language assessments and where evidence does exist, it does not always support information reported in manuals Assessment manuals are produced by developers who have commercial interest in the assessment. Furthermore, the reporting of psychometric quality in manuals is not peer-reviewed and can only be viewed after purchasing. When assessment developers make information on psychometric properties available online or in published peer-reviewed journals, transparency is achieved and clinicians and researchers are able to review psychometric properties prior to purchasing assessments. A need for independent studies is also identified in order to provide additional information to data provided in assessment manuals. When information is able to be collated from a variety of different studies, then the evidence regarding psychometric quality of assessments will become more substantial.
This review identified a number of assessments that currently present with better evidence of psychometric quality than others, although substantially more data is required to show that any assessments have “good” evidence. Until further information becomes available, it is suggested that speech pathologists favor assessments with better evidence when assessing the language abilities of school-aged children, provided that the normative sample is appropriate for the population in which the assessment is to be used. However, given that all assessments have limitations, speech pathologists should avoid relying on the results of a single assessment. Standardized assessment results should be supplemented with information from other assessment approaches (e.g., response to intervention, curriculum-based assessment, language sampling, dynamic assessment) when making judgments regarding diagnosis and intervention needs (Hoffman et al.,
Due to a need to restrict size, responsiveness was not investigated in this review. It was, however, noted that no assessment manuals reported on responsiveness studies. These studies have a longitudinal design with multiple administrations of the assessment across time to measure sensitivity to change in a person's status. Evidence of responsiveness is particularly important when assessments are to be used for measuring intervention outcomes or monitoring stability over time (Eadie et al.,
This review was confined to school-age language assessments that cover both the production and comprehension of spoken language. While this reflects current literature and clinical practice (Tomblin et al.,
There is a need for future research to examine the psychometric quality of assessments for children who are bi-lingual or speaking English as a second language (Gillam et al.,
This systematic review examines the psychometric quality of 15 currently available standardized spoken language assessments for children aged 4–12 years. Overall, limitations were noted with the methodology of studies reporting on psychometric quality, indicating a great need for improvements in the design and reporting of studies examining psychometric quality of both existing assessments and those that are developed in the future. As information on psychometric properties is primarily provided by assessment developers in manuals, further research is also recommended to provide independent evidence for psychometric quality. Whilst all assessments were identified as having notable limitations, four assessments: ALL, CELF-5, CELF:P-2, and PLS-5 were identified as currently having better evidence of reliability and validity. These four assessments are suggested for diagnostic use, provided they suit the purpose of the assessment process and are appropriate for the population being assessed. Emphasis on the psychometric quality of assessments is important for speech pathologists to make evidence-based decisions about the assessments they select when assessing the language abilities of school-aged children.
DD, RS, NM, WP, and RC all contributed to the conceptual content of the manuscript. DD and YC contributed to data collection and analysis.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
1Recent international consensus has replaced the term Specific Language Impairment with Developmental Language Disorder (Bishop et al.,