Pauses for Detection of Alzheimer’s Disease

Pauses, disfluencies and language problems in Alzheimer’s disease can be naturally modeled by fine-tuning Transformer-based pre-trained language models such as BERT and ERNIE. Using this method with pause-encoded transcripts, we achieved 89.6% accuracy on the test set of the ADReSS (Alzheimer’s Dementia Recognition through Spontaneous Speech) Challenge. The best accuracy was obtained with ERNIE, plus an encoding of pauses. Robustness is a challenge for large models and small training sets. Ensemble over many runs of BERT/ERNIE fine-tuning reduced variance and improved accuracy. We found that um was used much less frequently in Alzheimer’s speech, compared to uh. We discussed this interesting finding from linguistic and cognitive perspectives.


INTRODUCTION
Alzheimer's disease (AD) involves a progressive degeneration of brain cells that is irreversible (Mattson, 2004). One of the first signs of the disease is deterioration in language and speech production (Mueller et al., 2017). It is desirable to use language and speech for AD detection (Laske et al., 2015). In this paper, we investigate the use of pauses in speech (both unfilled and filled pauses such as "uh" and "um") for this task.

Pauses
Unfilled pauses play an important role in speech. The occurrence of pauses is subject to physiological, linguistic, and cognitive constraints (Goldman-Eisler, 1961;Rochester, 1973;Butcher, 1981;Zellner, 1994;Clark, 2006;Ramanarayanan et al., 2013;Hawthorne and Gerken, 2014). How different constraints interact in pause production has been an active research subject for decades. In normal speech, the likelihood of pause occurrence and the duration of pauses are correlated with syntactic and prosodic structure (Brown and Miron, 1971;Grosjean et al., 1971;Krivokapic, 2007). For example, if a sentence has a syntactically complex subject and a syntactically complex object, speakers tend to pause at the subject-verb phrase boundary, and pause duration increases with upcoming complexity (Ferreira, 1991). It has been demonstrated that pauses in speech are used by listeners in sentence parsing (Schepman and Rodway, 2000), and the pause information can benefit automatic parsing (Tran et al., 2018).
Atypical pausing is characteristic of disordered speech such as in Alzheimer's disease, and pauses are often used to measure language and speech problems (Ramig et al., 1995;Yuan et al., 2016;Shea and Leonard, 2019). The difference between typical and atypical pauses is not only on their frequency and duration, but also on where they occur. In this study, we propose a method to encode pauses in transcripts in order to capture the associations between pauses and words through fine-tuning pretrained language models such as BERT [19] and ERNIE [20], which we describe in Section 1.2.
The use of filled pauses may also be different between AD and normal speech. English has two common filled pauses, uh and um. There is a debate in the literature as to whether uh and um are intentionally produced by speakers (Clark and Fox Tree, 2002;Corley and Stewart, 2008). From sociolinguistic point of view, women and younger people tend to use more um vs. uh than men and older people (Tottie, 2011;Wieling et al., 2016). It has also been reported that autistic children use um less frequently than normal children (Gorman et al., 2016;Irvine et al., 2016), and that um occurs less frequently and is shorter during lying compared to truth-telling (Arciuli et al., 2010).

Pre-trained LMs and Self-Attention
Modern pre-trained language models such as BERT (Devlin et al., 2018) and ERNIE (Sun et al., 2019) were trained on extremely large corpora. These models appear to capture a wide range of linguistic facts including lexical knowledge, phonology, syntax, semantics and pragmatics. Recent literature is reporting considerable success on a variety of benchmark tasks with BERT and BERT-like models. 1 We expect that the language characteristics of AD can also be captured by the pre-trained language models when fine-tuned to the task of AD classification.
BERT and BERT-like models are based on the Transformer architecture (Vaswani et al., 2017). These models use selfattention to capture associations among words. Each attention head operates on the elements in a sequence (e.g., words in the transcript for a subject), and computes a new sequence of the weighed sum of (transformed) input elements. There are various versions of BERT and ERNIE. There is a base model with 12 layers and 12 attention heads for each layer, as well as a larger model with 24 layers and 16 attention heads for each layer. Conceptually the self-attention mechanism can naturally model many language problems in AD, including repetitions of words and phrases, use of particular words (and classes of words), as well as pauses. By inserting pauses in word transcripts, we enable BERT-like models to learn the language problems involving pauses.
Previous studies have found that when fine-tuning BERT for downstream tasks with a small data set, the model has a high variance in performance. Even with the same hyperparameter values, distinct random seeds can lead to substantially different results. Dodge et al. (2020) conducted a large-scale study on this issue. They fine-tuned BERT hundreds of times while varying only the random seeds, and found that the best-found model significantly outperformed previous reported results using the same model. In this situation, using just one final model for prediction is risky given the variance in performance during training. We propose an ensemble method to address this concern.

Automatic Detection of AD
There is a considerable literature on AD detection from continuous speech (Filiou et al., 2019;Pulido et al., 2020). This literature considers a wide variety of features and machine learning techniques. Fraser et al. (2016) used 370 acoustic and linguistic features to train logistic regression models for classifying AD and normal speech. Gosztolya et al. (2019) found that acoustic and linguistic features were about equally effective for AD classification, but the combination of the two performed better than either by itself. Neural network models such as Convolutional Neural Networks and Long Short-Term Memory (LSTM) have also been employed for the task (de Ipiña et al., 2017;Fritsch et al., 2019;Palo and Parde, 2019), and very promising results have been reported. However, it is difficult to compare these different approaches, because of the lack of standardized training and test data sets. The ADReSS challenge of INTERSPEECH 2020 is "to define a shared task through which different approaches to AD detection, based on spontaneous speech, could be compared" (Luz et al., 2020). This paper stems from our effort for the shared task.

Data
The data consists of speech recordings and transcripts of descriptions of the Cookie Theft picture from the Boston Diagnostic Aphasia Exam (Goodglass et al., 2001). Transcripts were annotated using the CHAT coding system (MacWhinney, 2000). We only used word transcripts, the morphological and syntactic annotations in the transcripts were not used in our experiments.
The training set contains 108 speakers, and the test set contains 48 speakers. In each data set, half of the speakers are people with AD and half are non-AD (healthy control subjects). Both data sets were provided by the challenge. The organizers also provided speech segments extracted from the recordings using a simple voice detection algorithm, but no transcripts were available for the speech segments. We didn't use these speech segments. Our experiments were based on the entire recordings and transcripts.

Processing Transcripts and Forced Alignment
The transcripts in the data sets were annotated in the CHAT format, which can be conveniently created and analyzed using CLAN (MacWhinney, 2000). For example: "the [x 3] bench [: stool]". In this example, [x 3] indicates that the word "the" was repeated three times [: stool] indicates that the preceding word, "bench" (which was actually produced), refers to stool. Details of the transcription format can be found in (MacWhinney, 2000).
For the purpose of forced alignment and fine tuning, we converted the transcripts into words and tokens that represent what were actually produced in speech. "w [x n]" were replaced by repetitions of w for n times, punctuation marks and various comments annotated between "[]" were removed. Symbols such as (.), (..), (. . .), < , > , / and xxx were also removed.
The processed transcripts were forced aligned with speech recordings using the Penn Phonetics Lab Forced Aligner (Yuan and Liberman, 2008). The aligner used a special model "sp" to identify between-word pauses. After forced alignment, the speech segments that belong to the interviewer were excluded. The pauses at the beginning and the end of the recordings were also excluded. Only the subjects' speech, including pauses in turntaking between the interviewer and the subject, were used.

Word Frequency and Uh/Um
From the training data set, we calculated word frequencies for the Control and AD groups respectively. Words that appear 10 or more times in both groups are shown in the word clouds in Figure 1. The following words are at least two times more frequent in AD than in Control: oh (4.33), laughs (laughter, 3.18), down (2.66), well (2.42), some (2.2), what (2.16), fall (2.15). And the words that are at least two times more frequent in Control than in AD are: window (4.4), are (3.83), has (3.0), reaching (2.8), her (2.62), um (2.55), sink (2.3), be (2.21), standing (2.06).
Compared to controls, subjects with AD used relatively more laughter and semantically "empty" words such as oh, well, and some, and fewer present particles (-ing verbs). This is consistent with findings in the literature. Table 1 shows an interesting difference for filled pauses. The subjects with AD used more uh than the control subjects, but their use of um was much less frequent.

Unfilled Pauses
Duration of pauses was calculated from forced alignment. Pauses under 50 ms were excluded, as well as pauses in the interviewer's speech. We binned the remaining pauses by duration as shown in Figure 2. Subjects with AD have more pauses in every group, but the difference between subjects with AD and non-AD is particularly noticeable for longer pauses.

Input and Hyperparameters
Pre-trained BERT and ERNIE models were fine-turned for the AD classification task. Each of the N 108 training speakers is considered a data point. The input to the FIGURE 1 | The word cloud on the left highlights words that are more common among control subjects than AD; the word cloud on the right highlights words that are more common among AD than control. model consists of a sequence of words from the processed transcript for every speaker (as described in Section 2.2). The output is the class of the speaker, 0 for Control and one for AD. We also encoded pauses in the input word sequence. We grouped pauses into three bins: short (under 0.5 s); medium (0.5-2 s); and long (over 2 s). The three bins of pauses are coded using three punctuations ",", ".", and ". . .", respectively. Because all punctuations were removed from the processed transcripts, these inserted punctuations only represent pauses. The procedure is illustrated in Figure 3.
We used Bert-for-Sequence-Classification 2 for finetuning. We tried both "bert-base-uncased" and "bertlarge-uncased", and found slightly better performance with the larger model. The following hyperparameters (slightly tuned) were chosen: learning rate 2e-5, batch size 4, epochs 8, max input length of 256 (sufficient to cover most cases). The standard default tokenizer was used (with an instruction not to split ". . ."). Two special tokens, [CLS] and [SEP], were added to the beginning and the end of each input.
ERNIE fine-tuning started with the "ERNIE-large" pre-trained model (24 layers with 16 attention heads per layer). We used the default tokenizer, and the following hyperparameters: learning rate 2e-5, batch size 8, epochs 20 and max input length of 256.
The fine-tuning process is illustrated in Figure 4.

Ensemble Reduces Variance in LOO Accuracy
When conducting LOO (leave-one-out) cross-validation on the training set, large differences in accuracy across runs were observed. We computed 50 runs of LOO cross-validation. The hyperparameter setting was the same across runs except for random seeds. The results are shown in the last row (N 1) of Tables 2 and 3. Over the 50 runs, LOO accuracy ranged from 0.75 to 0.86 for BERT with three pauses, from 0.78 to 0.87 for ERNIE with three pauses, and from 0.77 to 0.85 for ERNIE with no Pauses. The large variance suggests performance on unseen data is likely to be brittle. Such brittleness is to be expected given the large size of the BERT and ERNIE models and the small size of the training set (108 subjects).
To address this brittleness, we introduced the following ensemble procedure. From the results of LOO cross-validation, we calculated the majority vote over N runs for each of the 108 subjects, and used the majority vote to return a single label for each subject. To make sure that the ensemble estimates would generalize to unseen data, we tested the method by selecting N 5, N 15, . . ., runs from the 50 runs of LOO crossvalidation. The results are shown in Table 2 and 3. In the tables, the first row summarizes 100 draws of N 5 runs. The second row is similar, except N 15. All of the ensemble rows have better means and less variance than the last row, which summarizes the 50 individual runs of LOO cross-validation without ensemble (N 1). Figure 5 illustrates Table 2 and 3. In Figure 5 the black lines represent accuracy of individual runs whereas the purple lines represent ensemble accuracy of N 35. We can see that there is a wide variance in individual runs (black). The proposed ensemble method (purple) improves the mean and reduces variance over estimates based on a single run.

EVALUATION
Under the rules of the challenge, each team is allowed to submit results of five attempts for evaluation. Predictions on the test set from the following five models were submitted for evaluation: BERT0p, BERT3p, BERT6p, ERNIE0p, and ERNIE3p. 0p indicates that no pause was encoded, and 3p and 6p indicate, respectively, that three and six lengths of pauses were encoded. To compare with three pauses, 6p represents six bins of pauses, encoded as: "," (under 0.5 s), "." (0.5-1 s); ".." (1-2 s), ". . ." (2-3 s), ". . . ." (3-4 s), ". . . . ." (over than 4 s). The dots are separated from each other, as different tokens. Following the method proposed in Section 3.2, we made 35 runs of training for each of the five models, with 35 random seeds. The classification of each sample in the test set was based on the majority vote of 35 predictions. Table 4 lists the evaluation scores received from the organizers.
The best accuracy was 89.6%, obtained with ERNIE and three pauses. It is a nearly 15% increase from the baseline of 75.0% (Luz et al., 2020).
ERNIE outperformed BERT by 4% on input of both three pauses and no pause. Encoding pauses improved the accuracy for both BERT and ERNIE. There was no difference between three pauses and six pauses in terms of improvement in accuracy.

DISCUSSION
The group with AD used more uh but less um than the control group. In speech production, disfluencies such as hesitations and speech errors are correlated with cognitive functions such as cognitive load, arousal, and working memory (Daneman, 1991;Arciuli et al., 2010). Hesitations and disfluencies increase with increased cognitive load and arousal as well as impaired working memory. This may explain why the group with AD used more uh, as a filled pause and hesitation marker. More interestingly, they  used less um than the control group. This indicates that unlike uh, um is more than a hesitation marker. Previous studies have also reported that children with autism spectrum disorder produced um less frequently than typically developed children (Gorman et al., 2016;Irvine et al., 2016), and that um was used less frequently during lying compared to truthtelling (Benus et al., 2006;Arciuli et al., 2010). All these results seem to suggest that um carries a lexical status and is retrieved in speech production. One possibility is that people with AD or autism have difficulty in retrieving the word um whereas people who are lying try not to use this word. More research is needed to test this hypothesis. From our results, encoding pauses in the input was helpful for both BERT and ERINE fine-tuning for the task of AD classification. Pauses are ubiquitous in spoken language. They are distributed differently in fluent, normally disfluent, and abnormally disfluent speech. As we can see from Figure 2, the group with AD used more pauses and especially more long pauses than the control group. With pauses present in the text, the selfattention mechanism in BERT and ERNIE may learn how the pauses are correlated with other words, for example, whether there is a long pause between the determiner the and the following noun, which occurs more frequently in AD speech. We think this is part of the reason why encoding pauses improved the accuracy. There was no difference between three pauses and six pauses in terms of improvement in accuracy. More studies are needed to investigate the categories of pause length and determine the optimal number of pauses to be encoded for AD classification.
ERNIE was designed to learn language representation enhanced by knowledge masking strategies, including entitylevel masking and phrase-level masking. Through these  strategies, ERNIE "implicitly learned the information about knowledge and longer semantic dependency, such as the relationship between entities, the property of a entity and the type of a event". (Sun et al., 2019) We think this may be why ERNIE performs better on recognition of Alzheimer's speech, in which memory loss causes not only language problems but also difficulties of recognizing entities and events. Both BERT and ERNIE were pre-trained on text corpora, with no pause information. Our study suggests that it may be useful to pre-train a language model using speech transcripts (either solely or combined with text corpora) that include pause information.

CONCLUSION
Accuracy of 89.6% was achieved on the test set of the ADReSS (Alzheimer's Dementia Recognition through Spontaneous Speech) Challenge, with ERNIE fine-tuning, plus an encoding of pauses. There is a high variance in BERT and ERNIE fine-tuning on a small training set. Our proposed ensemble method improves the accuracy and reduces variance in model performance. Pauses are useful in BERT and ERNIE fine-tuning for AD classification. um was used much less frequently in AD, suggesting that it may have a lexical status.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
JY, Principal investigator, corresponding author; XC, Running ERNIE experiments; YB, Help running BERT experiments; ZY, Consultation on Alzheimer's disease, paper editing and proofreading; KC, Visualization of LOO experiment results, paper editing and proofreading.