Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial From the Clinical Perspective

Rowe, Hannah P.; Gutz, Sarah E.; Maffei, Marc F.; Tomanek, Katrin; Green, Jordan R.

doi:10.3389/fcomp.2022.770210

PERSPECTIVE article

Front. Comput. Sci., 12 April 2022

Sec. Human-Media Interaction

Volume 4 - 2022 | https://doi.org/10.3389/fcomp.2022.770210

Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial From the Clinical Perspective

Hannah P. Rowe¹

Sarah E. Gutz²

Marc F. Maffei¹

Katrin Tomanek³

Jordan R. Green^1,2^*

¹Department of Rehabilitation Sciences, MGH Institute of Health Professions, Boston, MA, United States
²Department of Speech and Hearing Bioscience and Technology, Harvard University, Boston, MA, United States
³Google LLC, Mountain View, CA, United States

Despite significant advancements in automatic speech recognition (ASR) technology, even the best performing ASR systems are inadequate for speakers with impaired speech. This inadequacy may be, in part, due to the challenges associated with acquiring a sufficiently diverse training sample of disordered speech. Speakers with dysarthria, which refers to a group of divergent speech disorders secondary to neurologic injury, exhibit highly variable speech patterns both within and across individuals. This diversity is currently poorly characterized and, consequently, difficult to adequately represent in disordered speech ASR corpora. In this article, we consider the variable expressions of dysarthria within the context of established clinical taxonomies (e.g., Darley, Aronson, and Brown dysarthria subtypes). We also briefly consider past and recent efforts to capture this diversity quantitatively using speech analytics. Understanding dysarthria diversity from the clinical perspective and how this diversity may impact ASR performance could aid in (1) optimizing data collection strategies for minimizing bias; (2) ensuring representative ASR training sets; and (3) improving generalization of ASR for difficult-to-recognize speakers. Our overarching goal is to facilitate the development of robust ASR systems for dysarthric speech using clinical knowledge.

Introduction

Dysarthria, or impaired speech due to motoric deficits, can have a detrimental impact on functional communication, often leading to significantly reduced quality of life (Hartelius et al., 2008). For individuals with speech impairments, automatic speech recognition (ASR) systems can enhance accessibility and interpersonal communication. However, inadequate acoustic models continue to impede the widespread success of ASR for disordered speech (Gupta et al., 2016; Moore et al., 2018). The limits of disordered speech ASR may be, in part, a byproduct of the significant variety of abnormal speech patterns across individuals (Duffy, 2013) and their underrepresentation in training corpora (Gupta et al., 2016). Nevertheless, studies on ASR for dysarthria have rarely considered this diversity (Blaney and Wilson, 2000; Benzeghiba et al., 2007; Gupta et al., 2016; Keshet, 2018; Moore et al., 2018). In this perspective article, we examine how speech impairment diversity has been characterized based on clinical models and how this diversity may impact ASR performance.

ASR can be broadly classified into (1) speaker-independent (SI) systems, which are typically trained on large multispeaker datasets, and (2) personalized systems, which can be trained either by adapting an existing SI model to a target speaker (i.e., speaker-adaptive [SA]) or by solely using the target speaker's speech data (i.e., speaker-dependent [SD]). Although commercially developed SI ASR systems have demonstrated low word error rates (WER) for healthy speakers, these systems perform considerably worse with impaired speech (Moore et al., 2018). An increasing amount of work has thus investigated the use of personalized systems for speakers with speech impairments, demonstrating significantly stronger performance compared to that of SI systems (Mengistu and Rudzicz, 2011; Kim et al., 2013; Mustafa et al., 2014; Xiong et al., 2019; Takashima et al., 2020; Green et al., 2021). Green et al. (2021), for example, recently demonstrated that the recognition accuracy of short phrases using end-to-end (E2E) ASR models was 4.6% for personalized models compared to 31% for SI models.

While personalized systems have promising utility for recognizing impaired speech, they require training data from the speaker, which may be impractical for some applications and can be cumbersome for individuals with neurodegenerative diseases who are prone to fatigue. Thus, a more efficient and effective approach would be to improve the recognition accuracy for existing SI systems for dysarthric speech. For example, prior work has shown improvements in performance of SI systems when training sets include dysarthric speakers, thereby providing more variability on which to train (Mengistu and Rudzicz, 2011; Mustafa et al., 2014). However, even the highest performing SI ASR models are inadequate for impaired speakers.

Although poor performance has largely been attributed to the shortage of disordered speech training datasets, closing the performance gap is likely to require not only more data but also sufficiently diverse corpora. Indeed, solely adding speakers to the training corpora in attempts to improve ASR performance is inefficient, expensive, and possibly unachievable. Ensuring dysarthric speech diversity requires conceptual schemes for identifying salient atypical speech variables and their expressed ranges across individuals. In this article, we consider several conceptual schemes used by speech-language pathologists to clinically characterize dysarthria diversity often for the purpose of speech diagnosis. An improved understanding of the diversity inherent to dysarthria and its potential impact on ASR performance could lead to (1) optimized data collection strategies for minimizing bias; (2) sufficiently representative ASR training sets; and (3) more widespread generalization across ASR users and, in turn, stronger performance for difficult-to recognize speakers. We consider the following questions:

1. What types of diversity need to be represented in dysarthria ASR training corpora?

2. What phonemic patterns are present in dysarthric speech and therefore might impact dysarthria ASR performance?

3. What can be done to adequately represent the different sources of variability in dysarthria ASR training corpora?

Characterizing Dysarthria Diversity

What Types of Diversity Need to Be Represented in Dysarthria ASR Training Corpora?

Diversity in Speech Severity

To date, the most frequently used metric for distinguishing variation in a dysarthria research cohort is overall speech impairment severity (Duffy, 2013). Severity is a multidimensional construct that refers to the speaker's overall impairment and includes a range of components, including naturalness, intelligibility, and subsystem abnormalities (see section Diversity in Speech Subsystems Impairment) (Duffy, 2013). Severity is often indexed by trained listeners, such as speech-language pathologists, who use adjectival descriptors (e.g., mild, moderate, severe, and profound). Alternatively, severity can be assessed using human transcription intelligibility, which indicates a listener's ability to understand the speaker based on the speech signal alone (Yorkston et al., 2007). While a functional metric, intelligibility is just one component of severity and does not necessarily account for all the different fluctuations in speech that are influenced by severity (e.g., changes in voice and resonance), especially for more mild speech impairment (Rong et al., 2015).

Including the full range of speech severities in ASR training sets is essential because good recognition accuracy for mild speech is unlikely to generalize to more severely affected speech (Moore et al., 2018). Thus, sufficient representation of speakers with severe dysarthria, in addition to those with mild and moderate impairments, in the training dataset could provide a more sustainable approach for enabling models that generalize to speakers across the severity continuum. Representing diversity only with severity, however, fails to address the substantial variety of aberrant speech features that characterize clinically distinct dysarthria variants. Other sources of diversity in dysarthric speech must, therefore, be considered to develop inclusive and sufficiently representative datasets.

Diversity in Dysarthria Type

One of the most established clinical taxonomies for speech motor disorders was developed over 50 years ago by Darley, Aronson, and Brown (DAB) (Darley et al., 1969). The DAB labeling system distinguishes 38 atypical speech features that are rated on a 7-point scale and groups dysarthria types based on speech feature profiles. For the development of the DAB model, the authors stratified dysarthric speakers based on clusters of speech features associated with lesions in specific regions of the central and peripheral nervous systems. These clusters are associated with at least five subtypes of dysarthria: flaccid, spastic, ataxic, hypokinetic, and hyperkinetic (see Figure 1). In many cases, patients exhibit a combination of the five subtypes (i.e., mixed dysarthria) (Darley et al., 1969). In addition to its clinical and neurological implications, the DAB model can serve as a basic heuristic in developing comprehensive and representative ASR corpora.

FIGURE 1

Figure 1. Breakdown of dysarthria subtypes within a widely used taxonomy of speech motor disorders. ALS, amyotrophic lateral sclerosis; CP, cerebral palsy; AT, ataxia; HD, Huntington's disease; MS, multiple sclerosis; MSA, multiple systems atrophy; PD, Parkinson's disease; PSP, progressive supranuclear palsy; TD, tardive dyskinesia; ARTIC, articulation; PHON, phonation; PROS, prosody; RES, resonance; RESP, respiration.

One disadvantage of the taxonomy, however, is that it relies entirely on subjective observations, which requires expert clinical training and may be too coarse and unreliable for capturing the range of diversity in dysarthria (Kent, 1996). To address this limitation, researchers have been exploring the diagnostic utility of a wide variety of speech analytic approaches for identifying variants of disordered speech (Rusz et al., 2018; Rowe et al., 2020)—an effort referred to as quantitative or digital phenotyping.

Diversity in Speech Subsystems Impairment

Regardless of dysarthria subtype, disordered speech is the byproduct of impairments in neural control over one or more of the five speech subsystems (i.e., respiration, phonation, resonance, prosody, and articulation) (see Figure 1). Objective characterizations of dysarthria through quantitative and digital phenotyping have allowed for more precise measures of speech, which has further illuminated the diversity in subsystem functioning. Indeed, deficits in each subsystem can engender specific aberrant speech features, many of which can be detected in the acoustic signal. For example, respiratory deficits in ataxic dysarthria can lead to excessive loudness variations, quantified acoustically using amplitude modulation (MacDonald et al., 2021); similarly, phonatory deficits in flaccid dysarthria can lead to a breathy vocal quality, quantified acoustically using cepstral peak prominence (Heman-Ackah et al., 2002).

While phonatory, resonatory, respiratory, and prosodic deficits can significantly limit communicative capacity, articulatory subsystem impairments have the greatest impact on speech intelligibility (Lee et al., 2014; Rong et al., 2015). Given the strong association between intelligibility and ASR performance (McHenry and LaConte, 2010; Tu et al., 2016; Jacks et al., 2019), it is possible that (1) articulatory motor impairments may be a major contributor to degraded ASR performance and (2) representing the range of articulatory motor impairments seen in dysarthria may maximize ASR accuracy and generalizability.

Considering the potential value of articulatory features and the need for objective and reliable measures of speech function, our group conducted a scoping review of the dysarthria literature to summarize the variety of acoustic techniques used to characterize articulatory impairments in neurodegenerative diseases (Rowe et al., under review). Across the 89 articles that met our inclusion criteria, we identified 24 different articulatory impairment features. To summarize the findings, we stratified the acoustic features into five aspects of articulatory motor control: Coordination, Consistency, Speed, Precision, and Rate (Rowe and Green, 2019). The findings demonstrated variable manifestation of articulatory impairments (1) across diseases [e.g., speakers with ataxia (AT) exhibited greater impairments in features associated with Rate than did speakers with Parkinson's disease (PD)] and (2) across articulatory components within each disease [e.g., speakers with Huntington's disease (HD) demonstrated greater impairments in Consistency than in Rate] (see Figure 2) (Rowe et al., under review).

FIGURE 2

Figure 2. Meta-analysis of the mean effect size (disease group compared to healthy controls) for all acoustic features within each articulatory component. ALS, amyotrophic lateral sclerosis; AT, ataxia; HD, Huntington's disease; MS, multiple sclerosis; MSA, multiple systems atrophy; PD, Parkinson's disease; PSP, progressive supranuclear palsy.

Within-Speaker Variability Due to Motor Disease Type, Disease Progression, Fatigue, and Medication Use

Our discussion thus far has focused on between-speaker differences in severity, dysarthria type, and subsystem involvement. However, there is also a significant amount of within-speaker variability that should be considered in ASR corpora development. For example, some dysarthria types, such as ataxic dysarthria, exhibit inconsistent motor patterns of limb and speech muscles (Darley et al., 1969), which can result in significant variability even across repetitions of the same utterance. Furthermore, across all dysarthria types, changes in disease progression, fatigue, and medication use can lead to rapid and transitory fluctuations in speech. Indeed, progressive diseases, such as ALS or PD, can result in declines in speech performance over several months or even weeks. Daily fluctuations in speech patterns can occur in patient populations who are prone to fatigue (Abraham and Drory, 2012). Lastly, medication use can result in dramatic changes—both positive and negative—in speech output. For example, levodopa has been related to improvements in voice quality, pitch variation, and articulatory function in patients with PD (Wolfe et al., 1975), while antipsychotic medication has been related to excess word stress and increased timing deficits in patients with HD (Rusz et al., 2014). To mitigate the detrimental effects of within-speaker variability on ASR performance, training datasets may need to include multiple instances of the same utterances recorded at different timepoints in individuals experiencing frequent speech changes (due to motor disease type, disease progression, fatigue, and/or medication use).

What Phonemic Patterns Are Present in Dysarthric Speech and Therefore Might Impact ASR Performance?

The influence of dysarthria diversity on phonemes is complex and not fully understood, as phoneme production involves intricate interactions between speech subsystems. Nevertheless, while modern E2E ASR systems often operate at the word or subword level (Kochenderfer, 2015) and employ a strong language model, sound-level distortions may still have a substantial negative impact on the recognition accuracy. In these cases, it may be necessary to compensate for acoustic distortions by increasing their representation in the training data or adjusting the encoder. Previous research has used methods such as phoneme confusion matrices to identify phonetic error patterns and create pronunciation models. For instance, Caballero-Morales and Trujillo-Romero (2014) examined substitution errors made by an ASR system¹ for a speaker with severe dysarthria. They noted that phonemes /r/, /s/, /sh/, and /th/ and phonemes /k/, /m/, and /p/ were consistently substituted by /f/ and /t/, respectively. The authors suggested that an improved system could use these error patterns to estimate /k/, /m/, or /p/ from a recognized /t/ (Caballero-Morales and Trujillo-Romero, 2014). However, most of the dysarthria ASR literature is based on datasets that combine subtypes of dysarthria and, therefore, do not specify which phonemes are misrecognized for each subtype. We propose that a heterogeneous corpus of disordered speech based on known error patterns may improve phoneme recognition. Below, we describe a subset of such error patterns in individuals with dysarthria. A more detailed and extensive list of these patterns can be found in Duffy (2013).

Few studies, to our knowledge, have examined the association between dysarthria subtype and ASR phonemic error patterns. Shor et al. (2019) examined the WER of an ASR model fine-tuned to speakers with ALS, a neurodegenerative disease characterized by mixed flaccid-spastic dysarthria. The study found that (1) /p/, /k/, /f/, and /zh/ were among five phonemes that accounted for the highest likelihood of deletion and (2) /m/ and /n/ accounted for 17% of substitution/insertion errors in the ASR response (Shor et al., 2019). The authors' first finding is consistent with the muscular weakness and low muscle tone characteristic of flaccid dysarthria, which frequently leads to deficits in sounds that require a buildup of pressure (i.e., pressure consonants) (Darley et al., 1969). Furthermore, flaccidity may lead to hypernasality due to air escape from the nasal cavity (i.e., velopharyngeal insufficiency). As a result, speakers often incorrectly insert nasal consonants such as /m/ and /n/ during speech (Duffy, 2013), which is consistent with Shor et al. (2019)'s latter finding. Increased severity in speakers with ALS also affects phonetic features, including stop-nasal (e.g., “no” for “toe”) and glottal-null (e.g., “high” for “eye”) contrasts (Kent et al., 1989). Additionally, abnormal lingual displacement and coupling in ALS has been associated with reduced vowel distinctiveness (Rong et al., 2021).

Spastic dysarthria is characterized by muscle stiffness and rigidity (Darley et al., 1969). Prior work on speakers with cerebral palsy (CP), who often exhibit pure spastic dysarthria, demonstrated that the predominant phonemic errors occurred on fricatives (Platt et al., 1980), suggesting that spasticity impairs oral constriction. A more recent study corroborated this finding by highlighting abnormalities in the fricative /s/ in speakers with CP (Chen and Stevens, 2001). Another etiology of spastic dysarthria—traumatic brain injury—can result in phonetic contrast errors between glottal-null (e.g., “hall”/“all”), voiced-voiceless (e.g., “bit”/“pit”), alveolar-palatal (e.g., “shy”/“sigh”), and nasal-stop (e.g., “meat”/“beat”) sounds (Roy et al., 2001).

Ataxic dysarthria is characterized by muscle weakness and incoordination (Darley et al., 1969). Seminal acoustic work has described the impact of voice onset time (VOT) disturbances on voicing contrasts (Ackermann and Hertrich, 1997). Later research revealed similar findings, demonstrating that VOT abnormalities in speakers with Friedreich's ataxia (FA) resulted in voicing contrast errors (e.g., /d/ vs. /t/ or /s/ vs. /z/) (Blaney and Hewlett, 2007).

Hypokinetic dysarthria is characterized by reduced range and speed of movement (Darley et al., 1969). Acoustically, speakers with hypokinetic dysarthria secondary to PD tend to replace a stop gap with low-intensity noise due to incomplete plosive closure, a process known as spirantization, which often occurs on voiceless phonemes, such as /p/, /t/, or /k/ (Canter, 1965). Reduced range of motion characteristic of hypokinesia also leads to articulatory undershoot and is reflected in features such as reduced second formant (F2) slope and restricted vowel space (Kim et al., 2009), which can lead to vowel centralization (e.g., /uh/ for /i/).

Lastly, hyperkinetic dysarthria, which is characterized by excess movement, encompasses a diverse range of speech characteristics (Darley et al., 1969). The phonemic errors in speakers with hyperkinesia are often influenced by the associated movement disorder. For example, hyperkinesia associated with HD may lead to variability in VOT and incomplete closure of pressure consonants, which could result in voicing substitutions (e.g., /t/ for /d/) and manner substitutions (e.g., /z/ for /d/), respectively (Hertrich and Ackermann, 1994). However, hyperkinesia associated with tardive dyskinesia, an antipsychotic medication side effect, may lead to excessive formant fluctuations and distorted vowels during sustained phonation (e.g., sustained /ah/) (Gerratt et al., 1984).

Overall, understanding the phonemic patterns specific to different dysarthria types can provide insight into which words or subwords (e.g., those that include pressure consonants or nasal sounds) may need to be disproportionately represented in the training data.

Considerations for Improving ASR Corpora

What Can Be Done to Adequately Represent the Different Sources of Diversity in Dysarthria ASR Training Corpora?

Deploying E2E machine learning models may preclude the need to understand the underlying pathophysiological phenomena given a sufficiently diverse training set. However, the current ineffectiveness of ASR approaches for dysarthric speech suggests that the limits of E2E models are presently defined by the lack of training sample diversity, which in this case is the wide variety and variability of dysarthric speech patterns both between and within speakers. Attempting to capture this diversity solely by adding more speakers is a costly endeavor that would likely be insufficient. The domain knowledge that we have discussed in this article is likely to help optimize participant selection strategies in large cohort studies on speech disordered ASR.

We propose that developing diverse corpora may involve a principled method of creating datasets for highly heterogeneous data, which may best be achieved through a three-pronged approach: (1) clinical phenotyping (i.e., characterizations of speech based on perceptual features); (2) quantitative phenotyping (i.e., characterizations of speech based on objective features); and (3) data-driven clustering (unsupervised groupings of speakers). Clinical phenotyping will require domain experts, such as speech-language pathologists, to guide the inclusion criteria for ensuring adequate representation of atypical speech characteristics (e.g., speech severity, articulatory deficits, etc.). Ultimately, with large datasets and validated quantitative measures of speech, data-driven clustering of dysarthric speech characteristics may become feasible. Upon the development of larger and more diverse datasets, quantifying heterogeneity with a diversity metric may be the next step toward ensuring that the training samples are sufficiently diverse. Such a metric will only be possible with a deeper understanding of the potential variables to consider in dysarthria and their impact on speech changes.

Of course, all these approaches will need to be supported by large-scale data collection efforts that will require partnerships with speech-language pathology clinics, private foundations, and medical institutions (MacDonald et al., 2021). This effort will be greatly facilitated by the development of secure but accessible electronic medical record systems and mHealth platforms (i.e., the use of mobile technologies that improve health outcomes), which will, in turn, aid in identifying and collecting speech recordings from individuals with diverse etiologies and speech impairments (Ramanarayanan et al., 2022).

Conclusions

Improving ASR accuracy for dysarthric speech may have significant implications for communication and quality of life. This article outlined the sources of diversity inherent to speech motor disorders, their potential impact on ASR performance, and the importance of their representation in training sets. Representing dysarthric speech variability in ASR corpora may be an important step for improving disordered speech ASR and is consistent with the call to action in the artificial intelligence community to reduce bias in the training data by increasing diversity.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author/s.

Author Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it forpublication.

Funding

This work was supported by NIH-NIDCD under Grants K24DC016312, F31DC019556, F31DC019016, and F31DC020108.

Conflict of Interest

KT was an employee of Google LLC.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^Note that the adapted ASR system did not employ an e2e deep learning model. Thus, because the system did not possess a strong language model, phoneme confusion could be measured in a meaningful way.

References

Abraham, A., and Drory, V. E. (2012). Fatigue in motor neuron diseases. Neuromuscul. Disord. 22, S198–S202. doi: 10.1016/j.nmd.2012.10.013