ORIGINAL RESEARCH article

Front. Educ., 19 February 2026

Sec. Assessment, Testing and Applied Measurement

Volume 11 - 2026 | https://doi.org/10.3389/feduc.2026.1671946

Evaluation of the consistency of a speech verification system with human raters in early literacy screening assessments

  • 1. Florida State University, Tallahassee, FL, United States

  • 2. Marino Institute of Education, Dublin, Ireland

  • 3. University of South Carolina, Columbia, SC, United States

Article metrics

View details

412

Views

48

Downloads

Abstract

This study investigates the use of a speech verification system (SVS) technology, a form of automatic speech recognition (ASR), in the assessment of children's reading skills. Despite the growing integration of ASR systems in educational assessment, significant challenges persist, particularly due to the acoustic, pronunciation and dialectal variability inherent in children's speech. Our research evaluated the consistency between human rater (HR) and speech verification system (SVS) scores produced from SoapBox Labs across three linguistic tasks—phoneme blending, expressive vocabulary and word reading. Results reveal variability in agreement rates, with SVS showing lower consistency with human raters in phonologically complex tasks like phoneme blending, compared to expressive vocabulary and word reading tasks. Additionally, we address potential racial differences in SVS performance, highlighting the importance of diverse speech sample collection to ensure equitable assessments, as well as inter-item differences within a task. The study concludes with recommendations for consideration of using SVS in educational assessments, advocating for ongoing research and algorithmic advancements to better support educational assessment practices.

Introduction

Educational assessments are a critical component of prevention models for children with language and literacy disorders. Across the United States and other countries worldwide, universal screening for risk of reading disorders is mandated by law (e.g., Gearin et al., 2022) with similar movements underway to screen for oral language impairments (Eadie et al., 2022; Jullien, 2021). Universal screening can be beneficial; however, such processes often require classroom teachers or other educational professionals to administer and score the assessments when they already have limited instructional time (Adlof et al., 2017). Research has previously noted even trained administrators of assessments can present with considerable variability in the fidelity of screening administration and scoring by examiners (Cummings et al., 2014). More broadly, a substantial body of literature has provided evidence of human error in scoring, particularly related to judging responses from children from marginalized racial backgrounds (Beyer et al., 2015; Evans et al., 2018). To address these issues, some have suggested the use of artificial intelligence (AI) through automatic speech recognition (ASR) technology (Nese and Kamata, 2021) as a tool that could produce unbiased scoring during school-based assessments. Integrating ASR when developing educational assessments provides numerous advantages, including lower costs, faster scoring, enhanced score consistency within and across administrations, and less biased scoring (Foltz et al., 2020). The confluence of this research suggests there is an opportunity to further study the role of ASR in school-based assessment contexts both globally and locally in the United States to understand the potential consistency of scoring relative to human raters and differences in scoring consistency across groups of children according to race.

Considerations and challenges for automatic speech recognition in expressive-skill based reading assessments

Automatic speech recognition (ASR) can be defined as, “the process and the related technology for converting [a] speech signal into its corresponding sequence of words or other linguistic entities by means of algorithms implemented in a device, a computer, or computer clusters” (Li et al., 2015). A notable achievement in the progression of ASR technology is its application in evaluating a wide array of reading skills including phonemic awareness (i.e., the capacity to identify and manipulate individual speech sounds) and phonics (Wang et al., 2009), as well as assessing children's reading fluency in longer text passages (Bernstein et al., 2017; Nese and Kamata, 2021). The integration of ASR technology in assessing children's reading development not only improves test delivery and data collection but also sparks new approaches to evaluating children's reading skills. Traditionally, reading assessments have relied heavily on human evaluation and administration, processes that are often time-consuming and susceptible to human error. In contrast, ASR technology offers improved accuracy, adaptability, and efficiency, providing a promising alternative for identifying children who may need additional reading support and early intervention. However, integrating ASR into assessments of reading presents several challenges that require careful consideration.

Various aspects of children's speech pose challenges for ASR technology. One major issue is that children's speech exhibits greater variability in acoustic features than adult speech, including higher pitch, shifted formant frequency ranges, and slower or less stable speaking rates (Lee et al., 1999; Gerosa et al., 2007). Additionally, children's articulation and pronunciation patterns are still maturing such that young speakers often produce sounds inconsistently or simplify certain phonemes (Holm et al., 2007). Even as these acoustic and pronunciation factors are interrelated, there are sufficiently distinct dimensions of variability that must both be addressed in ASR systems (Shivakumar and Georgiou, 2020). A further challenge is dialectal variation among English-speaking children: differences in regional or cultural dialects (e.g., African American English; AAE) can lead to recognition errors if not accounted for, underscoring the importance of testing for racial differences and accommodating dialect differences in developing ASR technology for children's speech (Koenecke et al., 2020).

Acoustic constructs in children's speech

Acoustic constructs refer to the features and characteristics of children's speech, including pitch, intonation, rhythm, and developmental aspects that differentiate it from adult speech. These acoustic cues help distinguish different phonemes and words in spoken language and are used to train ASR models to enhance their accuracy in recognizing and interpreting speech. The physiological and acoustic variability between adult and children's speech is well-documented. Children's speech differs from adults' speech (Zue et al., 2000) and exhibits high acoustic variability due to both anatomical and physiological changes that occur during development. Consequently, ASR systems primarily designed with adult voices in mind have been found to be less reliable when used with children (Wilpon and Jacobsen, 1996; Russell and D'Arcy, 2007; Potamianos and Narayanan, 2003).

Key acoustic constructs to consider in children's speech include:

  • Pitch: Children typically have higher-pitched voices compared to adults, and their pitch range is narrower due to the size and development of their vocal folds.

  • Formants: The resonant frequencies in the vocal tract, known as formants, differ in children's speech, often being higher in frequency and more closely spaced than in adult speech.

  • Voice Quality: Children's voices can exhibit a breathier or less-controlled quality due to smaller vocal folds and less developed vocal control.

Children's speech recognition performance is significantly impacted by the speaker's age because of the pronounced differences in pronunciation and language use at various stages of language and speech development. Younger children's speech is generally recognized much less accurately by ASR systems (Lee et al., 1999; Feng et al., 2021). Considering these findings, it is important that publishers who are using ASR-enabled reading assessments, especially those used in the United States for legislative-responsive screening where decisions are made about instruction, ensure that the databases used for scoring are specifically tailored for children. These databases should incorporate age-specific acoustic characteristics to enhance accuracy and boost confidence in the system's responses. By accounting for the distinctive elements of children's speech, ASR systems can more accurately recognize and interpret children's speech, leading to more reliable and effective reading assessments.

Pronunciation patterns in children's speech

Pronunciation patterns pertain to the unique ways children articulate sounds and words. Compared to adults, children often exhibit different phonetic and articulation variations, such as producing more phonetic errors and mispronunciations due to the rapidly changing developmental stages of their language (Potamianos and Narayanan, 1998; Black et al., 2007; Wang et al., 2009), and recent child-ASR work shows that such atypical segmental realizations and disfluency types are common sources of recognition difficulty (e.g., Dudy et al., 2018). Pronunciation patterns can also be influenced by variations in accent and dialect (Feng et al., 2024). The high variability in these pronunciation patterns can pose significant challenges for ASR systems, even those designed specifically for children's speech (Shivakumar and Georgiou, 2020; Shivakumar and Narayanan, 2022).

Research by (Potamianos and Narayanan 1998) demonstrated that mispronunciations by younger children (8–10 years) were twice as high as those for older children (11–14 years). These mispronunciations can be problematic for ASR systems compared to human assessors. For instance, if a young child mispronounces the word “rain” as “wain,” an ASR system might score this as incorrect because the expected phonemes were not all accurately pronounced. In contrast, a human assessor, with a more nuanced understanding, could recognize that the child intended to say the target word but mispronounced it, and accordingly, mark the child's response as correct. Implications of such phenomena are that errors in scoring can lead to imprecision in reading and language score ability estimates and misidentification in current U.S. screening systems using ASR for the early identification of students who are at-risk for protracted reading issues.

Adams (2011) highlighted that young children produce “spontaneous language marked by unconventional wording and syntax” (p. 13), which creates a challenge for ASR technology but not necessarily for human assessors. Even at the single word level, inconsistencies and variations in children's speech can cause reduced reliability in response judgments for ASR. In a 2007 study, Black et al. integrated ASR technology to record kindergarten, first, and second-grade children reading 55 single words. From these recordings, they identified four common disfluencies young children make when reading single words: “hesitations” (i.e., starting to pronounce the target word, pausing, and then saying the word), “sound-outs” (i.e., sounding out each phoneme in the word), “whispers” (i.e., whispering some of the phonemes in the word), and “stalling” (i.e., lengthening the first phoneme(s) or syllable in a word). Over 20% of the participating children exhibited at least one disfluency when reading single words, with kindergartners most frequently stalling and first graders most often sounding words out (Black et al., 2007).

In 2009, Wang et al. applied ASR technology to assess children's (n = 193) phoneme blending skills as part of the TBall (technology-based assessment of language and literacy) project. They evaluated the children's pronunciation accuracy and smoothness when blending a range of target words and compared both human assessor and ASR administration. They identified pronunciation disfluencies, such as partial/full repetitions and self-corrections, and smoothness disfluencies such as long pauses and elongations of phonemes (usually the first phoneme), as causing difficulties when the assessment was administered and scored by the ASR technology. They also noted accents as a challenge to ASR assessments of phoneme blending. In the design of the TBall project, Wang et al. trained the system to detect disfluencies and accents, and consequently, when readministered, the ASR version of the assessment achieved comparable scores to the human-administered assessment. Therefore, more efficient and sophisticated training of ASR models, through the targeting and detection of disfluencies in children's speech, is required to improve the quality and accuracy of the technology to attend to such challenges.

Dialectal variability and ASR

Unlike developmental speech variations, dialectal variations are not often judged accurately by human scorers and may be influenced by factors such as speaker race (Evans et al., 2018). A substantial body of research indicates that increased listener familiarity with speakers' speech and language patterns yields more efficient and accurate linguistic processing (Levi et al., 2019; Schoonmaker-Gates, 2018; Sumner and Samuel, 2009). Human ratings of response accuracy vary when the speaker and listener have different dialectal backgrounds, with listeners favoring speakers who sound similar to themselves (Beyer et al., 2015; Lippi-Green, 2012). Even trained educational professionals with expertise in language may be more likely to over-identify individuals as having disorders as compared to typical linguistic differences when presented with clinical case scenarios involving non-mainstream dialects of English (Easton and Verdon, 2021). Importantly, few educational professionals receive explicit training or support to develop the skills needed to recognize dialectal variation accurately and reliably. Teachers and other educational professionals alike have reported limited access to training about dialectal variation, most specifically focusing on lack of information about African American English, one of the most commonly spoken dialects of American English that is considered to be non-mainstream (Diehm and Hendricks, 2021; Gatlin-Nash and Terry, 2022; Robinson and Stockman, 2009). Given these findings, ASR systems may offer a potential path to reduce linguistic and racial bias in educational assessment.

Dialectal variation also requires explicit attention in the development and implementation of ASR technology (Petscher and Patton Terry, 2020). Differences in regional accents and dialects may yield full phoneme substitutions, variable deletions of phonemes, and subtle shifts in vowel productions that increase the overall variability of target word pronunciation. When ASR systems do not include sufficient representation of the dialects or accents spoken by the population of individuals for whom the systems are designed, accuracy of the ASR will inherently be low for the dialects that are underrepresented. A recent study conducted by Koenecke et al. (2020) revealed that speech recognition systems from Google, Apple, Amazon, Microsoft and IBM performed significantly worse in transcribing the speech of Black Americans as compared to White Americans. This racial bias may be reduced through proactive design. ASR system's acoustic models can be trained to accommodate accent and dialectal variation in speech productions by ensuring extensive, broad, inclusive speech variations are represented in the system's database.

Overall, the potential benefits of ASR to support efficient, more reliable, educational assessment indicate that further investigation of the effectiveness of ASR-based scoring is warranted. It is critical that when developing ASR-enabled educational assessments, researchers consider factors including specialized development of ASR for child speakers, typical development of children's speech, and dialectal variation in speech patterns. At present, human judgment is considered the “gold standard” for evaluating spoken responses to assessment tasks; however, there is some overlap in the limitations of human judgment as compared to ASR, particularly for evaluating responses provided by children who speak non-mainstream dialects of English. Although a small number of ASR systems are purportedly robust to recognize and interpret children's speech, it is possible that human judgment and ASR may have complementary strengths that could be leveraged to improve the overall accuracy and reliability of reading assessment for all learners.

Speech recognition systems and speech verification systems

Speech verification systems (SVS) are a sub-discipline of automatic speech recognition (ASR). Systems of latter type are designed to transcribe spoken language into written text by recognizing and transcribing/interpreting spoken words. SVS is focused on content verification as to whether a spoken input matches a predefined target and SVS often provide a confidence score to indicate the accuracy of the match. ASR aims to convert speech to text as accurately as possible, whereas SVS is used to check if the speech corresponds to a specific expected response. This distinction between ASR and SVS is important in prompted tasks such as educational assessments as verification is designed to score responses even if the spoken responses are acoustically unlikely (Kelly et al., 2020).

An SVS pipeline typically includes a process of first capturing the audio signal, extracting relevant acoustic features, using deep neural network (DNN) to produce posteriors, and performing a matching step that results in the confidence or target score for each production. Consequently, SVS systems are typically employed in scenarios requiring precise confirmation of spoken content, such as in assessments (e.g., van der Velde et al., 2025).

Automated scoring vs. human raters in assessments

The comparability between human scores and those produced by SVS is a significant issue in the development of SVS-enabled reading assessments. This is because SVS scores are intended to either replace or supplement human scores (Yan and Bridgeman, 2020). Recent studies indicate that SVS can achieve high agreement with human evaluators. van der Velde et al. (2025) evaluated the item-level consistency between SVS and human scoring on an word reading fluency tasks using accuracy and speed metrics on word. Their findings showed high sensitivity (0.93) and low specificity (0.69) of SVS scoring with human scoring on word-level accuracy outcomes along with low sensitivity (0.76) and moderate specificity (0.86) on passage-level accuracy.

Chen and Sun (2025) evaluated multiple automated English-speaking proficiency assessments and found that two systems' scores showed strong alignment with human rater scores based on correlations of 0.85 and 0.87. They further noted that there were no statistically significant differences between each of the AI Speak Master and TalkAI Language Practice software with human scoring, but both presented with small practically important differences (i.e., Cohen's d = 0.14 and −0.15, respectively). Moreover, the SmartSpeech AI software used presented with large effect size (d = 3.69) when compared to human scoring.

Another recent study focusing on children's phonological decoding skills used a researcher-developed AI-driven app to score young learners' spoken responses and reported that the audio-drive scores were predictive of human scoring; however, they noted that strength of the relation was both moderated by the participants' vocabulary and reading comprehension and also varied based on the individual items themselves (Turner et al., 2025).

Across these recent studies, it is important to note that other emerging evidence suggests that SVS and human scoring consistency is dependent on the systems used (e.g., Chen and Sun, 2025) where discrepancies can stem from training datasets, algorithmic design or their consideration of certain speech features, all underscoring the need for careful calibration. Even as recent research suggests speech verification systems can serve as a valuable complement to human raters in educational assessments, there is weaker evidence for sensitivity and specificity of SVS with human scoring in certain circumstances (e.g., van der Velde et al., 2025) and the strength of the relation between modalities of scoring may be item-level dependent (e.g., Turner et al., 2025).

Current study

The present study builds on emerging research in the consistency of speech verification and human scoring differences on three commonly administered word-level measures important to literacy development (Petscher et al., 2020). Specifically, scoring judgments for the phonological awareness skill of blending (i.e., phoneme-level manipulation and recombination of sounds), expressive vocabulary (i.e., lexical retrieval), and word reading (i.e., print to speech decoding and articulation) tasks were evaluated, all of which were designed to elicit single-word target verbalizations in response to stimuli. There are differing demands for each task with associated distinct error profiles that could be operationalized and scored differently by SVS and human raters.

We further explored differences in modality scoring judgments by speakers' racial backgrounds, focusing on differences observed for students whose parents identified their race as Black or White.

To explore the potential and limitations of speech verification systems (SVS) in supporting assessments of reading, this study addresses three key research questions.

  • How consistent are the scores between human raters (HR) and speech verification systems (SVS) in tasks assessing phonological awareness, expressive vocabulary, and word reading?

  • How does manipulating the SVS target score for accuracy across various thresholds affect the consistency between HR and SVS scores?

  • What are the differences in consistency estimates between HR and SVS scores for White and Black students?

This work is built on van der Velde et al. (2025) study by using item-level data across three different literacy tasks, uses a shared and expanded methodological approach looking at item-level and aggregate consistency and classification accuracy metrics, and extends the work by looking disaggregation according to race.

Method

Participants

Data were collected as part of a longitudinal project that investigated the psychometrics of reading and language assessments for school-based screening and diagnostic assessments in kindergarten through grade 3. The present analyses included 429 kindergarten students recruited at the beginning of kindergarten and followed through the end of the academic year. Students were recruited from 20 schools across three states in southeastern and northeastern United States. Based on collected student demographic data, participants' sex was 50.7% female and 49.3% male. Participants were identified as White, Non-Hispanic (46.6%), Black, Non-Hispanic (31.1%), Multiracial, Non-Hispanic (7.7%), White, Hispanic (3.7%), Multiracial, Hispanic (2.8%), Asian (2.6%), and Pacific Islander (< 1%) with 4.5% declining to provide information. The percentage of students that qualified for free or reduced-price lunch was 37.5%. Students identified in kindergarten as English learners (i.e., emergent bilingual) represented 6.4% of the sample. Students identified with an intellectual disability or autism spectrum disorder were not included. All study procedures were approved and monitored by the institutional review boards and all ethical standards were met in the conduct of the study.

Measures

Interstellar Express Assessment System (IEAS)

IEAS (Petscher and Catts, 2024) is a set of reading and language tasks created for the normative measurement of literacy skills in kindergarten through grade 3 (KG-G3). All IEAS tasks were administered via iPad. A trained assessor provided one iPad to a child where all task directions and stimuli are presented. The assessor used a second iPad, synced to the child's iPad through our patent-pending technology, to score the child's response based on the assessor's ability to see a range of allowable responses. As the assessor scored a child's oral response on the assessor's iPad, the child's IEAS experience automatically progresses to the next item or task. The child's IEAS experience is governed by existing programming that moves the child seamlessly through intra-task (i.e., directions, practice items, and assessment items) and inter-task (i.e., moving from assessment task to assessment task) experiences. The embedded speech verification software from Soapbox Labs both recorded and probabilistically scored the child's oral response according to pre-programmed item-level targets to compare what the child says with the expected target (e.g., if the target is “book” the algorithm estimates the probability that what a child said matches “book”). Students are seated 18 inches from the iPad device for the purpose of voice capture in naturalistic assessment settings.

Phonological Awareness—Blending (BLE). Participants were administered 26 blending items (Supplementary Table S1) that measured a child's phonological awareness skills by assessing his or her ability to combine portions of a word and say the target word out loud. The child listens to a word that is broken down into parts and then asked to combine the sounds and say the target word. Empirical reliability of scores in the sample from a 2-parameter logistic item response model of the sample data was 0.93.

Expressive Vocabulary (EVO). Participants were administered 17 items (Supplementary Table S2) that measured a child's expressive vocabulary and word retrieval skills by assessing his or her ability to name pictures correctly. The child viewed a picture on their iPad and must then name the image shown in the picture aloud. Empirical reliability of scores in the sample from a 2-parameter logistic item response model of the sample data was 0.85.

Word Reading (WRE). Participants were administered 32 real word items (Supplementary Table S3) that displayed a word on the screen and the child must read the word shown aloud. Marginal reliability of scores in the sample from a 2-parameter logistic item response model of the sample data was 0.88.

Procedures

Assessor fidelity and automated speech scoring

Assessor Fidelity. IEAS data were collected by research staff and graduate students. All assessors participated in a combination of large group format training led by senior project staff, self-guided training, individual practice sessions on mock assessments, and structured practice with other testers on mock assessments. Large group review sessions were offered for assessors. Following the training regiment, assessors were required to complete a quiz on each individual subtest and demonstrate 100% proficiency on every skill on the fidelity checklist in order to be cleared for independent data collection.

Soapbox Labs Verification System. Soapbox Labs specializes in speech recognition technology designed specifically for children. It focuses on the unique acoustic and linguistic characteristics of young speakers to improve the accuracy and fairness of automated speech assessments in educational contexts. With over 100,000 samples of children's voices collected, Soapbox Labs emphasizes the importance of gathering diverse speech data to identify and mitigate potential biases across different demographic groups. Soapbox Labs incorporates a digital, automatic speech verification system (SVS) into their speech recognition platform (e.g., Kelly et al., 2020). This SVS determines if spoken input matches expected responses and checks if it aligns with a predefined target. For instance, in a phonological awareness blending task involving the word “butterfly,” users must provide one or more acceptable correct responses for the algorithm to assess. When a speech sample is uploaded, the verification system uses proprietary algorithms to compare the input against the target scores and generates a confidence score. This confidence score ranges from 0 to 100 and reflects the likelihood, or probability when divided by 100, that what the person orally produced in their speech matches the supplied, correct responses from the user of the platform. Soapbox Labs allows the user to obtain scores at the utterance, phoneme, and word level. For this study, the confidence score at the word level was used for each of the phonological awareness, vocabulary, and word reading tasks for each item across all participants in the sample.

Data analysis

A combination of item-level and aggregate score-level analyses were used to broadly test the consistency (or agreement) of human rater (HR) and speech verification system (SVS) scoring of children's oral production on the three tasks at the item-level. As previously noted, confidence scores in SVS represent a likelihood that a person's produced speech matches the pre-specified target such that the system requires a user-imposed threshold on the confidence score to reach a dichotomous categorization of “correct” and “incorrect.” Given the wide range of plausible thresholds that could be used to judge a likelihood as correct ranging from a more liberal threshold (e.g., ≥50 denoting that an item would be scored as “correct” if the oral response confidence score met or exceeded this value) to a more conservative threshold (e.g., ≥90 denoting that an item would be scored as “correct” if the oral response confidence score met or exceeded this value). We opted to test five thresholds of the confidence score (i.e., ≥50, 60, 70, 80, and 90; hereafter referred to as Target 50–Target 90) for the purpose of creating SVS-, item-level dichotomies of correct and incorrect responses.

Frequency statistics were used to calculate the item-level mean and variance of the proportion correct according to HR and SVS threshold scoring for each task. Cohen's kappa was used to measure the inter-rater agreement between HR and each of the SVS Target 50–90 thresholds. Kappa accounts for the possibility of agreement occurring by chance with values < 0 conventionally indicating no agreement between raters, 0.01–0.20 as none to slight agreement, 0.21–0.40 as fair agreement, 0.41–0.60 as moderate agreement, 0.61–0.80 as substantial agreement, and 0.81–1.00 as near perfect agreement. The R-software irr package (Gamer, 2010) was used to estimate Cohen's kappa.

Task-level agreement between HR and SVS (i.e., ignoring the individual item id) on the full set of items within each task was evaluated using the same SVS Target 50–90 thresholds and looking at agreement when looking at the full sample of participants, Black participants only, and White participants only. The R-software biostatUtil package (Chiu et al., 2024) was used to obtain a bootstrapped estimate and confidence for kappa in each of the three sample-based analyses. Specific comparisons between the Black- and White-sample based confidence intervals were conducted to detect the extent of non-overlap in the plausible value ranges according to each SVS Target threshold by task test to identify statistically significant values between groups.

The classification accuracy of SVS Target thresholds for each of the three literacy tasks was tested against HR to evaluate the sensitivity (i.e., percent of HR correct responses also captured by SVS), specificity (i.e., percent of HR incorrect responses captured by SVS), positive predictive power (i.e., percent of total SVS correct responses also captured by HR), negative predictive power (i.e., percent of total SVS incorrect response also captured by HR), and overall correct classification between HR and SVS. All indices were calculated as a function of the implied 2 × 2 confusion matrix that can be constructed from cells according to true positives, true negatives, false positives, and false negatives stemming from the overlap and non-overlap between SVS and HR count data.

Results

Full sample agreement

Blending

The descriptive statistics (Supplementary Table S1) showed the mean HR accuracy across all items was 66.30% (SD = 19.51%) compared with the SVS Target 50 (M = 23.10%, SD = 7.56%); Target 60 (M = 20.70%, SD = 7.18%); Target 70 (M = 18.0%, SD = 6.83 %); Target 80 (M = 14.70%, SD = 6.92%); Target 90 (M = 9.30%, SD = 5.57%) accuracy rates. The agreement indices Table 1 presents the results of the item-level agreement between the human rater (HR) and the speech verification system (SVS) across each of the target thresholds (i.e., 50, 60, 70, 80, 90) for each item in the Blending task. Results revealed agreement between HR-SVS to be poor in the Target 50 condition across all items with a mean κ = 0.15 (range = 0.05, 0.31). As well, the item-level and aggregate consistency ratings were poor for HR-SVS Target 60 (M = 0.15; range = 0.05, 0.29), HR-SVS Target 70 (M = 0.15; range = 0.03, 0.29), HR-SVS Target 80 (M = 0.13; range = 0.01, 0.26), and HR-SVS Target 90 (M = 0.09; range = 0.00, 0.23) conditions.

Table 1

Item50 Target60 Target70 Target80 Target90 Target
KappaZpKappaZpKappaZpKappaZpKappaZp
Baby0.092.640.0080.092.810.0050.093.070.0020.082.950.0030.072.930.003
Bike0.174.580.0000.174.770.0000.174.870.0000.154.530.0000.093.440.001
Box0.203.830.0000.214.190.0000.224.370.0000.255.360.0000.235.340.000
Butterfly0.164.570.0000.164.510.0000.164.760.0000.154.760.0000.083.340.001
Coat0.102.370.0180.143.440.0010.143.800.0000.154.210.0000.113.460.001
Cow0.082.050.0400.102.820.0050.092.640.0080.092.640.0080.082.620.009
Cupcake0.061.910.0560.051.870.0610.052.090.0370.020.890.3720.000.100.923
Desk0.254.590.0000.234.470.0000.275.700.0000.245.380.0000.204.870.000
Fast0.172.940.0030.193.620.0000.183.620.0000.163.690.0000.163.690.000
Feet0.143.430.0010.164.270.0000.184.730.0000.154.360.0000.113.950.000
Fish0.165.930.0000.165.870.0000.155.620.0000.125.180.0000.074.050.000
Fly0.205.610.0000.226.170.0000.226.460.0000.226.810.0000.206.790.000
Food0.051.600.1100.072.770.0060.062.350.0190.052.650.0080.031.890.059
Football0.082.270.0230.082.380.0170.103.450.0010.062.580.0100.021.860.063
Fox0.254.720.0000.244.660.0000.255.130.0000.265.370.0000.214.990.000
Hamburger0.124.020.0000.103.700.0000.073.140.0020.042.230.0260.021.550.121
Kite0.092.570.0100.082.580.0100.103.100.0020.103.450.0010.083.150.002
Knife0.305.970.0000.285.710.0000.275.720.0000.204.970.0000.123.780.000
Mouse0.214.650.0000.214.650.0000.204.620.0000.184.400.0000.103.500.000
Pancake0.082.410.0160.082.640.0080.103.610.0000.042.420.0150.011.250.213
Paper0.112.660.0080.112.810.0050.143.810.0000.144.070.0000.114.020.000
Rabbit0.093.980.0000.083.890.0000.084.110.0000.084.420.0000.043.460.001
Railroad0.082.910.0040.052.160.0310.031.810.0700.011.320.1860.010.930.352
Smoke0.284.410.0000.294.770.0000.264.560.0000.173.630.0000.133.590.000
Snake0.315.610.0000.285.220.0000.295.560.0000.265.620.0000.133.900.000
Sunday0.082.900.0040.093.300.0010.072.960.0030.052.230.0260.011.020.307

Cohen's kappa for assessor and speech verification agreement on blending by item and target score threshold.

Target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct.

The confluence of low accuracy according to SVS and poor consistency in item-level and aggregate kappa scores is then reflected in poor specificity in the full sample in Table 2 such that of the scores identified as incorrect by the human rater only 26% of those were also identified as incorrect by the SVS in the target 50 (and this percentage gets smaller as the target increases). Positive predictive power was also low, highest at 0.43, meaning that 43% of the total correct responses provided by the SVS were also captured by the human rater. Conversely high specificity and negative predictive power were observed across all target score thresholds ranging from 0.95 to 0.99 and 0.89 to 0.97, respectively, indicating the strength of convergence between the rater methods on “correct” responses.

Table 2

SampleTargetBLEEVOWR
OCCSESPNPPPPPOCCSESPNPPPPPOCCSESPNPPPPP
Full500.520.950.260.890.430.870.860.870.910.800.800.770.810.910.56
600.510.960.240.910.430.860.910.820.940.760.780.820.770.930.53
700.500.970.220.930.430.840.950.780.960.720.750.880.710.950.50
800.480.980.180.950.420.820.970.730.970.680.710.920.650.960.46
900.450.990.120.970.400.760.980.620.990.610.630.970.520.980.40
White500.480.940.260.900.370.860.860.860.930.740.820.720.850.910.58
600.470.950.240.920.370.850.910.820.960.690.800.800.810.930.54
700.460.970.220.940.370.830.950.780.970.650.790.870.770.950.52
800.440.980.180.960.360.800.960.730.980.610.750.940.690.980.47
900.400.990.130.970.350.730.980.620.990.540.670.970.590.990.40
Black500.600.950.280.850.540.870.860.870.880.850.870.670.880.970.34
600.590.960.260.880.540.870.920.830.920.820.840.720.850.970.30
700.580.970.230.910.530.860.950.790.950.800.800.830.790.980.27
800.570.980.190.930.520.850.970.740.960.760.720.890.710.990.22
900.400.990.120.970.350.790.990.620.980.690.601.000.561.000.17

Classification accuracy indicators by task, sample, and target threshold.

BLE, blending; EVO, expressive vocabulary; WR, word reading; target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct. OCC, overall correct classification; SE, sensitivity; SP, specificity; NPP, negative predictive power; PPP, positive predictive power.

Expressive vocabulary

Descriptive statistics for Vocabulary (Supplementary Table S2) showed more consistent accuracy levels between HR and SVS compared to Blending scores. Mean HR accuracy across all items was 62.38% (SD = 20.59%) and was comparable with the SVS Target 50 (M = 60.32%, SD = 17.64%). Mean accuracy diminished with Target 60 (M = 55.94%, SD = 18.08%); Target 70 (M = 51.92%, SD = 19.63%); Target 80 (M = 47.80%, SD = 20.35%); and Target 90 (M = 40.14%, SD = 20.95%) thresholds. Item-level agreement indices were considerably stronger in the Vocabulary task (Table 3) with a mean HR-SVS Target 50 κ = 0.70 (min = 0.16, max = 0.88); HR-SVS Target 60 κ = 0.69 (min = 0.18, max = 0.87); HR-SVS Target 70 κ = 0.66 (min = 0.17, max = 0.88); HR-SVS Target 80 κ = 0.61 (min = 0.16, max = 0.86); and HR-SVS Target 90 κ = 0.50 (min = 0.10, max = 0.84). It is worth noting that the HR-SVS κ scores were closer in magnitude using the Target 50 and 60 thresholds for dichotomous scoring of the SVS score with a more noticeable drop-off in consistency as more conservative values were applied to the original SVS score (i.e., using ≥70, 80, 90, respectively). For certain items, a relatively consistent κ score was observed across threshold (e.g., chimney, feather, needle, salad), whereas a sharper decline in κ scores was observed in other items (e.g., binoculars, dancing, parachute, sinking, squirrel, whisker). In other instances, an increase in κ scores was observed as the target score threshold increased (e.g., heel, paddle).

Table 3

Item50 Target60 Target70 Target80 Target90 Target
KappaZpKappaZpKappaZpKappaZpKappaZp
Beard0.538.290.0000.467.500.0000.427.380.0000.346.400.0000.285.740.000
Binoculars0.8111.500.0000.7510.860.0000.598.960.0000.386.760.0000.103.060.002
Camel0.8812.560.0000.8712.390.0000.8812.430.0000.8311.860.0000.6810.050.000
Chimney0.7711.010.0000.7711.040.0000.7911.440.0000.7911.480.0000.649.650.000
Dancing0.7210.640.0000.589.000.0000.548.580.0000.437.430.0000.316.110.000
Doorknob0.7010.000.0000.7911.190.0000.8011.470.0000.7510.840.0000.578.840.000
Feather0.8512.060.0000.8311.810.0000.8111.500.0000.7911.350.0000.7510.790.000
Heel0.689.980.0000.7310.560.0000.8111.540.0000.8612.110.0000.8411.830.000
Ladder0.164.380.0000.185.600.0000.175.290.0000.165.260.0000.145.410.000
Needle0.8111.530.0000.8011.470.0000.8311.930.0000.8311.980.0000.7711.320.000
Paddle0.346.940.0000.397.980.0000.428.750.0000.429.000.0000.429.450.000
Parachute0.8712.390.0000.8712.400.0000.8111.710.0000.7110.590.0000.518.310.000
Ruler0.628.910.0000.608.830.0000.538.120.0000.517.960.0000.356.450.000
Salad0.699.820.0000.7510.660.0000.7010.230.0000.659.780.0000.558.690.000
Sinking0.8812.460.0000.8411.970.0000.8111.660.0000.7711.100.0000.669.910.000
Squirrel0.709.930.0000.7710.940.0000.709.980.0000.588.700.0000.427.280.000
Whisker0.7915.900.0000.7114.610.0000.6313.400.0000.5712.620.0000.4510.710.000

Cohen's kappa for assessor and speech verification agreement on expressive vocabulary by item and target score threshold.

Target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct.

Classification accuracy indices for the full sample (Table 2) demonstrated a much stronger in balance in the sensitivity and specificity of scores between HR-SVS at Target 50 (SE = 0.86, SP = 0.87) and 60 (SE = 0.92, SP = 0.82) compared to the larger discrepancy observed in blending. As the target score threshold increased, sensitivity increased from 0.86 to 0.98; specificity decreased from 0.87 to 0.62, negative predictive power increased from 0.91 to 0.99, and positive predictive power decreased from 0.80 to 0.61.

Word reading

Average accuracy scores for item-level word reading performance (Supplementary Table S3) were the highest among the three tasks tested. Mean HR accuracy across all items was 78.94% (SD = 16.83%) and was closely aligned with the SVS Target 50 estimates (M = 73.54%, SD = 21.48%). Mean accuracy of Target 60 was 69.59%, (SD = 22.86%) compared with Target 70 (M = 65.32%, SD = 24.88%), Target 80 (M = 58.98%, SD = 25.89%), and Target 90 (M = 48.17%, SD = 26.12%) thresholds. Item-level agreement indices for HR-SVS varied widely (Table 4) ranging from 0.00 for “light” to 1.00 for “after” and “idea.” Average agreement between HR-SVS were observed as: HR-SVS Target 50 κ = 0.52 (min = 0.00, max = 1.00); HR-SVS Target 60 κ = 0.55 (min = 0.00, max = 1.00); HR-SVS Target 70 κ = 0.52 (min = 0.00, max = 1.00); HR-SVS Target 80 κ = 0.47 (min = 0.00, max = 1.00); and HR-SVS Target 90 κ = 0.34 (min = 0.00, max = 0.78). Classification accuracy in the full sample (Table 2) showed a consistent increase in sensitivity as the target score increased from 50 to 90 (i.e., SE = 0.77–0.97) with a comparable decrease in specificity (i.e., SP from 0.82 down to 0.52). Negative predictive power remained high across all target scores (NPP = 0.91 to 0.98) with positive predictive power dropping from 0.56 down to 0.40.

Table 4

Item50 Target60 Target70 Target80 Target90 Target
KappaZpKappaZpKappaZpKappaZpKappaZp
After1.005.200.0001.005.200.0000.653.600.0000.653.600.0000.362.440.015
Another0.729.670.0000.7510.180.0000.7510.270.0000.669.430.0000.477.450.000
Back0.304.220.0000.324.360.0000.395.220.0000.456.010.0000.466.750.000
Before0.462.390.0170.784.160.0000.784.160.0000.784.160.0000.372.480.013
Believe0.894.720.0000.894.720.0000.894.720.0000.452.850.0040.251.980.047
Between0.794.190.0000.703.820.0000.563.230.0010.442.770.0060.312.220.027
Both0.547.730.0000.537.630.0000.538.020.0000.497.900.0000.376.830.000
Different0.362.440.0150.512.680.0070.603.150.0020.462.830.0050.282.090.037
Does0.121.770.0760.050.880.3770.040.940.3470.020.770.4400.011.160.247
Each0.472.880.0040.362.440.0150.362.440.0150.241.910.0570.171.570.116
Filly0.648.480.0000.648.520.0000.537.350.0000.406.300.0000.356.090.000
Friend0.659.410.0000.6910.080.0000.497.710.0000.406.880.0000.204.580.000
Has0.051.190.2320.010.260.7960.031.130.2580.031.320.1880.010.680.499
Idea1.005.200.0001.005.200.0001.005.200.0001.005.200.0000.764.050.000
Info0.844.430.0000.854.400.0000.854.400.0000.703.690.0000.452.800.005
Inside0.699.390.0000.679.200.0000.628.740.0000.487.320.0000.315.770.000
Last0.415.770.0000.476.770.0000.497.340.0000.436.790.0000.325.730.000
Let0.597.990.0000.547.380.0000.486.890.0000.456.840.0000.234.490.000
Light0.00--0.462.390.0170.462.390.0170.784.160.0000.784.160.000
Listen0.713.670.0000.713.670.0000.603.150.0020.603.150.0020.583.340.001
Move0.426.070.0000.416.080.0000.406.230.0000.345.740.0000.254.710.000
Number0.739.750.0000.7610.120.0000.7710.390.0000.729.770.0000.588.470.000
Orange0.351.960.0500.462.830.0050.352.400.0170.241.910.0570.101.190.233
Out0.598.090.0000.567.900.0000.608.520.0000.547.910.0000.416.750.000
Quickly0.844.410.0000.844.410.0000.844.410.0001.005.200.0000.663.660.000
See0.294.190.0000.416.150.0000.355.560.0000.294.890.0000.214.000.000
Show0.608.770.0000.628.900.0000.618.790.0000.537.590.0000.507.780.000
Six0.193.310.0010.163.540.0000.133.090.0020.092.470.0130.051.680.093
Start0.587.740.0000.587.920.0000.527.330.0000.487.250.0000.386.360.000
That0.010.370.7140.00−0.390.6940.000.360.7160.000.360.7160.000.001.000
Through0.341.860.0620.422.240.0250.492.630.0090.502.650.0080.442.410.016

Cohen's kappa for assessor and speech verification agreement on word reading by item and target score threshold.

Target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct.

Agreement disaggregation by race

The race-based disaggregated frequency estimates (Table 5) provided further insights into HR-SVS differences by task. Consistent with the previous aggregated findings, mean accuracy estimates were consistently the highest for the HR condition. For example, mean accuracy HR for blending with White students was 74% compared with a range of 11–25% in the SVS condition. Likewise, the mean accuracy HR for blending with Black students was 59% compared to a range of 9–23% in the SVS condition. Of particular importance were the robust differences in the standardized effect size comparing accuracy between White and Black students based on the HR condition. Cohen's d for differences in accuracy between White and Black students on blending according to the HR condition showed a d = 0.32 difference. Conversely, the effect sizes ranged d = 0.05–0.08 for the five SVS conditions. Stated differently, even though the HR modality of scoring yielded more accurate results overall compared to the SVS condition the standardized difference in accuracy between White and Black students was larger in the HR condition compared to the SVS conditions.

Table 5

OutcomeRaterWhiteBlack
MSDMSDCohen'sd
BLEAssessor0.740.440.590.490.32
Target 500.250.430.230.420.05
Target 600.230.420.200.400.07
Target 700.200.400.170.380.08
Target 800.160.370.140.340.06
Target 900.110.310.090.280.07
EVOAssessor0.690.460.530.500.33
Target 500.640.480.530.500.22
Target 600.600.490.480.500.24
Target 700.550.500.440.500.22
Target 800.510.500.410.490.20
Target 900.440.500.340.470.21
WRAssessor0.780.420.920.28−0.40
Target 500.720.450.840.37−0.29
Target 600.670.470.800.40−0.30
Target 700.620.480.740.44−0.26
Target 800.550.500.660.48−0.22
Target 900.460.500.520.50−0.12

Aggregated accuracy by outcome, rater, and racial subgroup.

BLE, blending; EVO, expressive vocabulary; WR, word reading; target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct.

Expressive Vocabulary (EVO) showed a similar yet not as discrepant pattern of results with mean accuracy differences between White and Black students for HR condition as d = 0.33 compared with the SVS conditions of d = 0.20–0.24. The Word Reading task presented the largest differences in accuracy ratings based on HR with d = −0.40 as well as with larger differences for the SVS conditions (d = −0.12 to −0.30).

The κ index for agreement within-race (Table 6) showed statistically significant higher levels of agreement between HR-SVS for Black students compared to White students for blending at Target 50–80 (p < 0.05) using a between-group kappa test. Likewise, significant differences were observed between Black and White students on vocabulary at target scores 70, 80, 90, and for word reading at target scores 80 and 90. Classification accuracy by race (Table 2) showed higher overall correct classification for Black students compared to White students for blending scores but were comparable for both vocabulary and word reading. Sensitivity, specificity, and negative predictive power estimates also were comparable between White and Black students across target score thresholds. Positive predictive power estimates showed the greatest discrepancy between student groups with a range of 0.35–0.37 for White students compared with 0.35–0.54 for Black students in blending. A similar observation was made for vocabulary outcomes with 0.54–0.74 positive predictive power for White students compared with 0.69–0.85 for Black students. The reverse was true with higher estimates for White students in word reading outcomes (i.e., 0.40–0.58) compared to Black students (i.e., 0.17–0.34).

Table 6

TaskTarget scoreFull SampleBlackWhite
Kappa95% LB95% UBKappa95% LB95% UBKappa95% LB95% UB
BLE500.160.150.180.210.180.240.14*0.120.15
600.160.150.170.210.180.230.13*0.120.15
700.150.140.160.190.170.220.13*0.120.14
800.130.120.140.160.140.180.11*0.100.12
900.080.080.090.110.090.130.080.070.09
EVO500.720.700.740.730.690.770.690.660.72
600.710.680.730.740.700.780.670.640.71
700.680.660.710.730.690.770.64*0.610.67
800.640.620.660.690.650.730.59*0.560.63
900.540.520.560.590.550.630.49*0.460.52
WR500.510.480.540.390.210.570.520.460.58
600.500.470.530.350.200.520.520.460.57
700.470.440.500.320.190.470.510.450.56
800.420.400.450.250.150.370.47*0.420.52
900.330.310.350.180.110.260.37*0.330.42

Aggregate Cohen's kappa index for the full sample disaggregated by race, task, and target score.

BLE, blending; EVO, expressive vocabulary; WR, word reading; target 50–90 reflects the threshold value that was used to dichotomize the speech verification target score as correct. *denotes statistically significant differences in Kappa between white and black students.

Discussion

Modern technology innovation has led to increased use and implementation of speech verification systems (SVS) for educational assessment. The purpose of this study was to build on existing science that has examined human rater (HR) and SVS consistency in other areas of educational screening, such as oral reading fluency (Bolaños et al., 2013; Nese and Kamata, 2021) and to specifically test: (1) HR-SVS consistency in word-level phonological awareness, vocabulary, and word reading tasks, (2) evaluate differences in consistency when the base rate SVS target score for accuracy is manipulated to account for several thresholds of accuracy, and (3) evaluate differences in estimates of consistency for White and Black students in our sample.

HR-SVS consistency across tasks

The consistency between SVS and HR scores observed in this study varied significantly across different tasks, highlighting the intricate nature of accurately assessing diverse linguistic skills through SVS systems. In the Blending task, which involves combining individual syllables or phonemes (i.e., distinct units of sound in a language) to construct complete words, SVS exhibited notably lower agreement with HR. This disparity may stem from the inherent complexity of the Blending task itself, which requires strong phonemic awareness and adept phonological processing skills—both fundamental to accurately perceiving and manipulating individual speech sounds. SVS systems, reliant on automated algorithms, may struggle to fully capture the nuanced intricacies involved in such a task.

A critical challenge lies in the variability of phonetic pronunciation among speakers, influenced by factors such as accent, speech rate, and coarticulation with adjacent sounds. This variability poses difficulties for SVS systems in establishing consistent patterns for each phoneme. Furthermore, given the nature of the Blending task, children may have incorporated frequent pauses, hesitations, and self-corrections while striving to articulate individual phonemes into cohesive words, consistent with findings from Wang et al. (2009). This could pose a challenge for SVS systems, whose algorithms typically expect continuous and fluent speech patterns, making it difficult to distinguish between intentional pauses and errors. This variability in speech patterns, especially in tasks like the Blending task where phonemic accuracy is crucial, highlights the complexity SVS systems face in accurately assessing children's speech and phonological skills. Addressing these complexities necessitates advanced algorithms capable of robustly modeling phonetic variations and identifying disfluencies, thereby enhancing the overall accuracy and reliability of SVS. Initial evidence that such adjustments are possible was provided by Wang et al. (2009), who trained the system to detect disfluencies and accents, and consequently, when readministered, the SVS version of the assessment achieved more comparable scores to the human-administered assessment.

In contrast, the Expressive Vocabulary (EVO) and Word Reading (WR) tasks exhibited higher agreement rates between SVS and HR, although significant mismatches were still observed. EVO, which involves expressive language skills and word usage, showed the best agreement among the tasks evaluated. This could be due to the relatively clearer phonetic structures and less variability in word articulation compared to the Blending task, particularly at the 50, 60, and 70 target scores. WR, while demonstrating greater agreement than the Blending task, showed lower agreement than EVO. Similar to the Blending task, these results could be attributed to disfluencies such as hesitations and self-corrections, given the nature of the task requiring children to decode a word, which could have posed a challenge for the SVS, especially in students of this grade.

Adjustments akin to those mentioned for the Blending task may also be necessary to enhance performance in the WR task. The observed discrepancies in both the Blending and WR tasks suggest that SVS technology may struggle with tasks requiring deeper phonological analysis. These results underscore the ongoing need to refine and validate SVS in educational assessments, particularly for tasks involving language skills that demand a detailed analysis of the phonetic structure of words.

Thresholds of accuracy

A further aim of this study was to evaluate differences in consistency when the base rate SVS target score for accuracy was manipulated to account for several thresholds of accuracy. Thresholds of accuracy in SVS are critical benchmarks used to determine the system's performance in correctly identifying and validating speech inputs. These thresholds define the confidence levels at which an SVS deems a response correct or incorrect. Adjusting these thresholds can significantly impact the balance between sensitivity (correctly accepting valid inputs) and specificity (correctly rejecting invalid inputs). Higher thresholds may reduce false positives but increase false negatives, while lower thresholds can have the opposite effect. Therefore, setting appropriate accuracy thresholds is essential for optimizing the reliability and effectiveness of SVS in various applications, including educational assessments.

In this study, we found that as the thresholds of accuracy moved from target 50 to target 90, agreement varied across both tasks and items within tasks, with some items showing better agreement between SVS and HR than others, and this agreement further varied across threshold levels. This finding suggests that a single target threshold for all items within a task may not be as accurate as selecting a target threshold for a specific item. For example, in Word Reading, the item “after” has a kappa of 1.0 at 50 and 0.36 at 90, whereas “light” has a kappa of 0 at 50 and 0.78 at 90. These items show opposite patterns between threshold and agreement with human raters, thus, selecting different thresholds for each item separately may allow for better accuracy. This variability was consistent in items across all three tasks (see Tables 1, 3, 4). It is plausible that some items may require a lower threshold to accurately capture responses, while others may require a higher threshold to maintain rigor and precision. Further research is warranted to systematically evaluate the consistency and reliability of accuracy when employing item-based thresholds in SVS assessments. Future research should examine the best way to select thresholds to provide the most accurate results.

Racial variability

According to Martin and Wright (2023), speech verification systems (SVS) are under-researched concerning potential biases embedded within their algorithms and training datasets, which could inadvertently disadvantage specific racial or ethnic groups. The current study aimed to address this gap by examining the consistency between SVS and human raters (HR) across different racial demographics. The findings indicate varying degrees of accuracy in scores between White and Black participants across different linguistic tasks, as evidenced by Cohen's d values. For the Blending (BLE) task, small effect sizes in SVS suggest minimal differences between the two groups, indicating relative insensitivity to variations. However, it should be noted that the overall agreement in the SVS-scored BLE task was poor across both groups, indicating challenges in accurately assessing phonemic blending with SVS. Interestingly in the Blending task, the highest effect size was observed in the assessor ratings (d = 0.32), suggesting some level of bias. In contrast, the Expressive Vocabulary (EVO) task showed moderate effect sizes, with White participants generally scoring higher potentially indicating that expressive language skills may be more influenced by dialectal differences, potentially leading to biased outcomes against Black participants in this task.

The Word Reading (WR) task demonstrated negative effect sizes indicating that Black participants scored higher than White participants, suggesting that the WR task may be more sensitive to dialectal features that favor Black participants' performance, highlighting the complex interaction between task type and dialectal variation. Overall, the findings of this study revealed significant variability in SVS performance based on the race of the students assessed, even when incorporating the current SVS technology and underscores the necessity for further investigation into potential biases in the application of SVSs and their implications.

The clinically important and differential effect size differences across tasks and conditions highlight several specific considerations and questions. The notable size of the HR-condition effect sizes relative to the SVS-conditions raises the question of the role that implicit bias may play in oral production scoring. Scientific evidence points to the role that subconscious biases play in educational performance judgments pertaining to White vs. Black or Latino students (e.g., Ready and Wright, 2011; Tenenbaum and Ruck, 2007) and may manifest through an explicit and systematic differences in evaluating how students perform (e.g., Wood and Graham, 2010). It is important to note that it is not merely the size of the effect in the HR condition that might suggest implicit bias but rather the systematic differences between HR and SVS conditions in the effect size. For the blending outcome the relative Cohen's d difference between HR ranged to 0.24–0.27 with relative differences in vocabulary of 0.09–0.13, and relative differences in word reading of 0.10–0.28. An implication of this finding is the importance of assessment administration training to consider incorporating aspects of implicit bias awareness and considerations for scoring based on linguistic diversity (Hendricks and Adlof, 2017; Kang et al., 2019).

The systematically lower accuracy for both racial groups according to SVS scoring raises the question of the types of training data that are used to calibrate automated scoring systems and their appropriateness in the context of screening assessments. Even where algorithms are calibrated according to a robust set of speech samples, the types of acoustic and phonetic patterns inherent to certain tasks (Dellwo et al., 2003) such as phonological awareness may not be as robustly represented. For example, trained human assessors may have a stronger contextual understanding of phonological awareness acoustic profiles that could explain the larger differences in accuracy between HR and SVS conditions. If SVS training data sets do not contain such contextual features, it could result in lower overall accuracy ratings. Further, the interaction between item-level contextual features and student-level linguistic diversity may manifest in lower accuracy for underrepresented groups (e.g., Tatman, 2017).

Limitations and future directions

The current study has several limitations that should be acknowledged. The reliance on a single automated SVS limits the generalizability of the findings. Future research should include comparisons across multiple SVS platforms to validate these results. Additionally, the study focused on kindergarten students, and the findings may not be directly applicable to older children or other educational contexts. The developmental differences in speech patterns and cognitive abilities across age groups warrant further investigation. The study also did not account for potential socio-economic factors that could influence children's speech and, consequently, the SVS's performance. Including a more comprehensive set of demographic variables in future research could provide a more holistic understanding of SVS's efficacy.

Future research should focus on refining SVS algorithms to better accommodate the variability in children's speech, including dialectal and accent variations. Developing robust acoustic models that can accurately recognize and interpret diverse speech patterns is crucial for improving SVS accuracy. Refining SVS to account for the phonetic and developmental features of diverse children's speech is essential for ensuring fair and reliable assessments in educational settings. Further consideration may also be needed in the selection of items for tasks that account for dialectal variability. One notable feature of African American English (AAE) is the omission of final consonants such as “g” and “r.” The Expressive Vocabulary (EVO) task, in which the White participants scored higher, included six items that ended with either a “g” or an “r,” specifically the words: “dancing,” “feather,” “ladder,” “ruler,” “sinking,” and “whisker.” To ensure fair assessment, the SVS might need to be programmed to recognize the omission of these final consonants as correct responses when they are consistent with the dialectal patterns of AAE speakers.

Additionally, exploring adaptive threshold settings (i.e., using varying target score thresholds to dichotomize accuracy) based on item characteristics and individual child profiles could enhance the precision of SVS-based assessments. Implementing machine learning techniques to dynamically adjust thresholds and calibrate scores in real-time might provide a more accurate and fair assessment tool. Longitudinal studies tracking the same cohort of children over several years would also be beneficial in understanding how SVS performance evolves with children's developmental changes to provide insights into optimizing SVS systems for continuous educational monitoring and support.

Conclusion

The study underscores the potential and challenges of integrating SVS into educational assessments. Although SVS technology shows promise in reducing human error and addressing racial biases, significant attention is needed by educational assessment publishers looking to integrate SVS. Educational publishers should attend to SVS by understanding where the outputs produce consistent results as human raters and the extent to which the results vary across tasks, developmental ages, student characteristics, and potentially items within tasks. Ensuring equity and fairness through testing and reporting will be key as publishers seek to adopt SVS in their assessment systems. By addressing the identified limitations and pursuing the suggested future research directions, SVS can become a valuable tool in educational settings, enhancing the accuracy and fairness of assessments across diverse student populations.

Statements

Data availability statement

The datasets presented in this article are not readily available due to consent and usage constraints. Requests to access the datasets should be directed to .

Ethics statement

The studies involving humans were approved by Florida State University Human Subjects Review Office. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants' legal guardians/next of kin.

Author contributions

YP: Methodology, Writing – review & editing, Formal analysis, Funding acquisition, Writing – original draft. JO'S: Writing – review & editing, Writing – original draft. HC: Funding acquisition, Project administration, Writing – review & editing. AE: Writing – review & editing, Formal analysis. LF: Methodology, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. Chan Zuckerberg Initiative.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was used in the creation of this manuscript. Generative AI was used to edit the manuscript for clarity, ensure responses to reviewers were detailed, conducted a reference check, identified areas of the manuscript where flow can be improved.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2026.1671946/full#supplementary-material

References

  • 1

    AdamsM. J. (2011). Technology for Developing Children's Language and Literacy: Bringing Speech Recognition to the Classroom. The Joan Ganz Cooney Center at Sesame Workshop. Available online at: https://joanganzcooneycenter.org/wp-content/uploads/2011/09/jgcc_tech_for_language_and_literacy.pdf (Accessed December, 12, 2025).

  • 2

    AdlofS. M.ScogginsJ.BrazendaleA.BabbS.PetscherY. (2017). Identifying children at risk for language impairment or dyslexia with group-administered measures. J. Speech Lang. Hear. Res.60, 35073522. doi: 10.1044/2017_JSLHR-L-16-0473

  • 3

    BernsteinJ.ChengJ.BaloghJ.RosenfeldE. (2017). “Studies of a self-administered oral reading assessment,” in SLaTE, 172–176. doi: 10.21437/SLaTE.2017-30

  • 4

    BeyerT.EdwardsK. A.FullerC. C. (2015). Misinterpretation of African American English BIN by adult speakers of Standard American English. Lang. Commun.45, 5969. doi: 10.1016/j.langcom.2015.09.001

  • 5

    BlackM.TeppermanJ.LeeS.PriceP.NarayananS. S. (2007). “Automatic detection and classification of disfluent reading miscues in young children's speech for the purpose of assessment,” in Proceedings of Interspeech 2007, 206–209. doi: 10.21437/Interspeech.2007-87

  • 6

    BolañosD.ColeR. A.WardW. H.TindalG. A.HasbrouckJ.SchwanenflugelP. J. (2013). Human and automated assessment of oral reading fluency. J. Educ. Psychol.105, 11421151. doi: 10.1037/a0031479

  • 7

    ChenT.SunS. (2025). Evaluating automated evaluation systems for spoken English proficiency: an exploratory comparative study with human raters. PLoS ONE20:e0320811. doi: 10.1371/journal.pone.0320811

  • 8

    ChiuD.TalhoukA.LeungS. (2024).biostatUtil: Utility Functions Used in Biostatistics Projects. R Package Version 0.5.8. Available online at: https://talhouklab.github.io/biostatUtil/ (Accessed December, 12, 2025).

  • 9

    CummingsK. D.BiancarosaG.SchaperA.ReedD. K. (2014). Examiner error in curriculum-based measurement of oral reading. J. Sch. Psychol.52, 361375. doi: 10.1016/j.jsp.2014.05.007

  • 10

    DellwoV.WagnerP.SoléM. J.RecasensD.RomeroJ. (2003). “Relations between language rhythm and speech rate,” in Proceedings of the International Congress of Phonetic Sciences (International Phonetic Association), 471474.

  • 11

    DiehmE. A.HendricksA. E. (2021). Teachers' content knowledge and pedagogical beliefs regarding the use of African American English. Lang. Speech Hear. Serv. Sch.52, 100117. doi: 10.1044/2020_LSHSS-19-00101

  • 12

    DudyS.BedrickS.AsgariM.KainA. (2018). Automatic analysis of pronunciations for children with speech sound disorders. Comput. Speech Lang.50, 6284. doi: 10.1016/j.csl.2017.12.006

  • 13

    EadieP.LevickisP.McKeanC.WestruppE.BavinE. L.WareR. S.et al. (2022). Developing preschool language surveillance models—cumulative and clustering patterns of early life factors in the early language in Victoria Study cohort. Front. Pediatr.10:826817. doi: 10.3389/fped.2022.826817

  • 14

    EastonC.VerdonS. (2021). The influence of linguistic bias upon speech-language pathologists' attitudes toward clinical scenarios involving nonstandard dialects of English. Am. J. Speech Lang. Pathol.30, 19731989. doi: 10.1044/2021_AJSLP-20-00382

  • 15

    EvansK. E.MunsonB.EdwardsJ. (2018). Does speaker race affect the assessment of children's speech accuracy? A comparison of speech-language pathologists and clinically untrained listeners. Lang. Speech Hear. Serv. Sch.49, 906921. doi: 10.1044/2018_LSHSS-17-0120

  • 16

    FengS.HalpernB. M.KudinaO.ScharenborgO. (2024). Towards inclusive automatic speech recognition. Comput. Speech Lang.84:101567. doi: 10.1016/j.csl.2023.101567

  • 17

    FengS.KudinaO.HalpernB. M.ScharenborgO. (2021). Quantifying bias in automatic speech recognition. arXiv [preprint] arXiv:2103.15122.

  • 18

    FoltzP. W.YanD.RuppA. A. (2020). “The past, present and future of automated scoring,” in Handbook of Automated Scoring: Theory into Practice, eds. D. Yan, A. Rupp, and P.W. Foltz (CRC Press), 19. doi: 10.1201/9781351264808-1

  • 19

    GamerM. (2010). irr: Various Coefficients of Interrater Reliability and Agreement. Available online at: http://cran.r-project.org/web/packages/irr/irr.pdf (Accessed December, 12, 2025).

  • 20

    Gatlin-NashB.TerryN. P. (2022). “Theory-based approaches to language instruction for primary school poor readers who speak non mainstream American English,” in Handbook of Literacy in Diglossia and in Dialectal Contexts, Vol. 22: Literacy Studies, eds. E. Saiegh-Haddad, L. Laks, and C. McBride (Springer). doi: 10.1007/978-3-030-80072-7_20

  • 21

    GearinB.PetscherY.StanleyC.NelsonN. J.FienH. (2022). Document analysis of state dyslexia legislation suggests likely heterogeneous effects on student and school outcomes. Learn. Disabil. Q.45, 267279. doi: 10.1177/0731948721991549

  • 22

    GerosaM.GiulianiD.BrugnaraF. (2007). Acoustic variability and automatic recognition of children's speech. Speech Commun. 49, 847860. doi: 10.1016/j.specom.2007.01.002

  • 23

    HendricksA. E.AdlofS. M. (2017). Language assessment with children who speak nonmainstream dialects: examining the effects of scoring modifications in norm-referenced assessment. Lang. Speech Hear. Serv. Sch.48, 168182. doi: 10.1044/2017_LSHSS-16-0060

  • 24

    HolmA.CrosbieS.DoddB. (2007). Differentiating normal variability from inconsistency in children's speech: normative data. Int. J. Lang. Commun. Disord.42, 467486. doi: 10.1080/13682820600988967

  • 25

    JullienS. (2021). Screening for language and speech delay in children under five years. BMC Pediatr.21:362. doi: 10.1186/s12887-021-02817-7

  • 26

    KangO.RubinD.KermadA. (2019). The effect of training and rater differences on oral proficiency assessment. Lang. Test.36, 481504. doi: 10.1177/0265532219849522

  • 27

    KellyA. C.KaramichaliE.SaebA.VeselýK.ParslowN.DengA.et al. (2020). “SoapBox Labs verification platform for child speech,” in Proceedings of INTERSPEECH 2020, 486487.

  • 28

    KoeneckeA.NamA.LakeE.NudellJ.QuarteyM.MengeshaZ.et al. (2020). Racial disparities in automated speech recognition. Proc. Natl. Acad. Sci. U.S.A.117, 76847689. doi: 10.1073/pnas.1915768117

  • 29

    LeeS.PotamianosA.NarayananS. (1999). Acoustics of children's speech: developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am.105, 14551468. doi: 10.1121/1.426686

  • 30

    LeviS. V.HarelD.SchwartzR. G. (2019). Language ability and the familiar talker advantage: generalizing to unfamiliar talkers is what matters. J. Speech Lang. Hear. Res.62, 14271436. doi: 10.1044/2019_JSLHR-L-18-0160

  • 31

    LiJ.DengL.Haeb-UmbachR.GongY. (2015). Robust Automatic Speech Recognition: A Bridge to Practical Applications. Academic Press.

  • 32

    Lippi-GreenR. (2012). English with an Accent: Language, Ideology and Discrimination in the United States. Routledge. doi: 10.4324/9780203348802

  • 33

    MartinJ. L.WrightK. E. (2023). Bias in automatic speech recognition: the case of African American language. Appl. Linguist.44, 613630 doi: 10.1093/applin/amac066

  • 34

    NeseJ. F. T.KamataA. (2021). Evidence for automated scoring and shorter passages of CBM-R in early elementary school. Sch. Psychol.36, 4759. doi: 10.1037/spq0000415

  • 35

    PetscherY.CabellS. Q.CattsH. W.ComptonD. L.FoormanB. R.HartS. A.et al. (2020). How the science of reading informs 21st-century education. Read. Res. Q.55, S267S282. doi: 10.1002/rrq.352

  • 36

    PetscherY.CattsH. (2024). Preliminary Psychometric Evidence for the K-3 Interstellar Express Assessment System. Tallahassee, FL: Florida State University. Available online at: https://osf.io/preprints/edarxiv/u8pn5

  • 37

    PetscherY.Patton TerryN. (2020). Speech Recognition in Education: The Powers and Perils. SmartBrief. Available online at: https://www.smartbrief.com/original/speech-recognition-education-powers-and-perils (Accessed December, 12, 2025).

  • 38

    PotamianosA.NarayananS. (1998). Spoken dialog systems for children. ICASSP1, 197200. doi: 10.1109/ICASSP.1998.674401

  • 39

    PotamianosA.NarayananS. (2003). Robust recognition of children's speech. IEEE Trans. Speech Audio Process.11, 603616. doi: 10.1109/TSA.2003.818026

  • 40

    ReadyD. D.WrightD. L. (2011). Accuracy and inaccuracy in teachers' perceptions of young children's cognitive abilities: the role of child background and classroom context. Am. Educ. Res. J.48, 335360. doi: 10.3102/0002831210374874

  • 41

    RobinsonG. C.StockmanI. J. (2009). Cross-dialectal perceptual experiences of speech-language pathologists in predominantly Caucasian American school districts. Lang. Speech Hear. Serv. Sch.40, 138149. doi: 10.1044/0161-1461(2008/07-0063)

  • 42

    RussellM.D'ArcyS. (2007). “Challenges for computer recognition of children's speech,” in Proceedings of Speech and Language Technology in Education (SLaTE 2007), 108–111. doi: 10.21437/SLaTE.2007-26

  • 43

    Schoonmaker-GatesE. (2018). Dialect comprehension and identification in L2 Spanish: familiarity and type of exposure. Stud. Hispanic Lusophone Linguist.11, 193214. doi: 10.1515/shll-2018-0007

  • 44

    ShivakumarP. G.GeorgiouP. (2020). Transfer learning from adult to children for speech recognition: evaluation, analysis and recommendations. Comput. Speech Lang.63:101077. doi: 10.1016/j.csl.2020.101077

  • 45

    ShivakumarP. G.NarayananS. (2022). End-to-end neural systems for automatic children speech recognition: an empirical study. Comput. Speech Lang.72:101289. doi: 10.1016/j.csl.2021.101289

  • 46

    SumnerM.SamuelA. G. (2009). The effect of experience on the perception and representation of dialect variants. J. Mem. Lang.60, 487501. doi: 10.1016/j.jml.2009.01.001

  • 47

    TatmanR. (2017). “Gender and dialect bias in YouTube's automatic captions,” in Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, 53–59. doi: 10.18653/v1/W17-1606

  • 48

    TenenbaumH. R.RuckM. D. (2007). Are teachers' expectations different for racial minorities than for European American students? A meta-analysis. J. Educ. Psychol.99, 253273. doi: 10.1037/0022-0663.99.2.253

  • 49

    TurnerJ.PorterA.GrahamS. M.Ralph-DonaldsonT.KrüsemannH.ZhangP.et al. (2025). Evaluating the scoring system of an AI-integrated app to assess foreign language phonological decoding. Res. Methods Appl. Linguist.4:100257. doi: 10.1016/j.rmal.2025.100257

  • 50

    van der VeldeM.HarmsenW.VeldkampB. P.FeskensR.KeuningJ.SwartN. (2025). Speech enabled reading fluency assessment: a validation study. Int. J. Artif. Intell. Educ.35, 25692595. doi: 10.1007/s40593-025-00480-y

  • 51

    WangS.PriceP.LeeY.-H.AlwanA. (2009). “Measuring children's phonemic awareness through blending tasks,” in Proceedings of Speech and Language Technology in Education (SLaTE 2009), 101–104. doi: 10.21437/SLaTE.2009-22

  • 52

    WilponJ. G.JacobsenC. (1996). A study of speech recognition for children and the elderly. IEEE Int. Confer. Acoustics Speech Signal Process. Confer. Proc.1, 349352. doi: 10.1109/ICASSP.1996.541104

  • 53

    WoodD.GrahamS. (2010). “Why race matters: social context and achievement motivation in African American youth,” in The Decade Ahead: Applications and Contexts of Motivation and Achievement (Advances in Motivation and Achievement, Vol. 16, Part B), eds. T. Urdan and S. Karabenick, 175209 (Bingley: Emerald Group Publishing Limited). doi: 10.1108/S0749-7423(2010)000016B009

  • 54

    YanD.BridgemanB. (2020). “Validation of automated scoring systems,” in Handbook of Automated Scoring: Theory into Practice, eds. D. Yan, A. Rupp, and P.W. Foltz (CRC Press), 297318. doi: 10.1201/9781351264808-16

  • 55

    ZueV.SeneffS.GlassJ.PolifroniJ.PaoC.HazenT.et al. (2000). Jupiter: a telephone-based conversational interface for weather information. IEEE Trans. Speech Audio Process.8, 8596. doi: 10.1109/89.817460

Summary

Keywords

artificial intelligence, dyslexia, reading, screening tools, speech verification

Citation

Petscher Y, O'Sullivan J, Catts HW, Edwards A and Fitton L (2026) Evaluation of the consistency of a speech verification system with human raters in early literacy screening assessments. Front. Educ. 11:1671946. doi: 10.3389/feduc.2026.1671946

Received

28 July 2025

Revised

07 January 2026

Accepted

13 January 2026

Published

19 February 2026

Volume

11 - 2026

Edited by

José Manuel de Amo Sánchez-Fortún, University of Almeria, Spain

Reviewed by

Daniel D. Hromada, Berlin University of the Arts, Germany

Si-ioi Ng, Arizona State University, United States

John Sahaya Rani Alex, VIT University Chennai, India

Updates

Copyright

*Correspondence: Yaacov Petscher,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics