This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Alzheimer’s disease (AD) has a long pre-clinical period, and so there is a crucial need for early detection, including of Mild Cognitive Impairment (MCI). Computational analysis of connected speech using Natural Language Processing and machine learning has been found to indicate disease and could be utilized as a rapid, scalable test for early diagnosis. However, there has been a focus on the Cookie Theft picture description task, which has been criticized. Fifty participants were recruited – 25 healthy controls (HC), 25 mild AD or MCI (AD+MCI) – and these completed five connected speech tasks: picture description, a conversational map reading task, recall of an overlearned narrative, procedural recall and narration of a wordless picture book. A high-dimensional set of linguistic features were automatically extracted from each transcript and used to train Support Vector Machines to classify groups. Performance varied, with accuracy for HC vs. AD+MCI classification ranging from 62% using picture book narration to 78% using overlearned narrative features. This study shows that, importantly, the conditions of the speech task have an impact on the discourse produced, which influences accuracy in detection of AD beyond the length of the sample. Further, we report the features important for classification using different tasks, showing that a focus on the Cookie Theft picture description task may narrow the understanding of how early AD pathology impacts speech.
Alzheimer’s disease (AD) includes a long “pre-clinical” period, during which pathological change accumulates in a patient’s brain with no apparent effect on their behavior or performance (
There are two broad approaches to detecting pathology: brief cognitive screening tests and biological markers (biomarkers) of disease. Of the former, the Mini Mental State Examination (MMSE) and Montreal Cognitive Assessment can be administered rapidly (
Biomarkers include Magnetic Resonance Imaging (MRI), cerebrospinal fluid (CSF) analysis and Positron Emission Tomography (PET). All three approaches can distinguish AD from controls with accuracies of over 90% (
There is evidence that connected spoken or written language (discourse) begins to change early in the course of AD, possibly prior to MCI (
A common approach to obtaining a sample of discourse involves the patient describing a scene, such as that depicted in the “Cookie Theft” picture (
Another approach involves narration of a learned story (either well-known, such as Cinderella, or a novel narrative presented in pictures)-a cognitively complex task that entails the integration of a story’s characters and events within a temporal framework (
For reasons related to its simplicity, standardization and task constraints, and existence of large volumes of data (particularly the DementiaBank (
There have been few formal comparisons of the sensitivities of different speech sampling approaches to early AD.
Fifty participants (see
Participant demographics.
HC median (IQR) | AD+MCI Median (IQR) | Test |
|
|
Age (yrs) | 63 (12) | 71 (13) | Mann Whitney |
0.018* |
Sex (% f) | 72% | 24% | Chi square | 0.001** |
Education (yrs) | 16 (3.8) | 12 (4) | Mann Whitney |
0.007* |
MMSE (30) | 29 (0.70) | 24 (2.99) | Mann Whitney |
<0.001** |
IQR = interquartile range, MMSE = mini mental state examination, converted from total ACE-III score (
Global cognition was assessed with the Addenbrooke’s Cognitive Examination Third Edition (ACE-III) (
All tasks were administered by the same individual (NC). Only words spoken by the participant were analyzed. We refer to these different approaches as: Picture Description (PD); Conversational Speech (CS); Overlearned Narrative Recall (ONR); Procedural Recall (PR); and Novel Narrative Retelling (NNR).
PDs were elicited using a novel version of the Cookie Theft stimulus, consisting of an updated and colored adaptation the original (
CS was generated using the Map Task (
Participants were asked to recall the story of Cinderella from memory. They were given the instruction “I’d like you to tell me, with as much detail as you can, the story of Cinderella.”
Participants were asked to recount the procedure for making a cup of tea. They were given the instruction “I’d like you to tell me, in as much detail as you can, how you would make a cup of tea.”
The wordless picture book “Frog, Where Are You?” (by Mercer Mayer) was used as a stimulus for the generation of a novel narrative. Participants looked through the book once, before describing the story based on the pictures.
The resulting sample from each connected speech task was transcribed according to conventions detailed in
Two hundred and eighty-six linguistic features, consisting of fine-grained indices reflecting a range of linguistic and para-linguistic phenomena, were extracted from each connected speech task transcript (
Linguistic domains covered by features extracted from each task transcript (number of features in brackets).
Type | Linguistic feature | Example features |
---|---|---|
Lexico-syntactic (275) | Word production and complexity (11) | e.g., Mean syllables per word, repeated words |
Parts-of-speech (POS) (18) | % Of POS (e.g., nouns, verbs, coordinates) and ratios (e.g., noun:verb ratio) | |
Lexical richness (8) | e.g., Type-token-ratio (TTR; types:tokens), moving average TTR with a window size of 10, 20, 30, 40, and/or 50 if the sample was of sufficient length | |
Psycholinguistics (34) | Average normative ratings for e.g., familiarity, concreteness, age-of-acquisition of words | |
Psychological processes (50) | % Of words relating to individual psychological processes e.g., anger, time, work | |
Syntactic structures and complexity (32) | e.g., mean length of sentence, verb phrases per T-unit (VP/T), complex nominals per clause (CN/C) | |
Syntactic parse tree features (4) | e.g., maximum depth, mean depth | |
Grammatical constituents (111) | Grammatical constituents of syntax tree e.g., NP—> DT NN, a noun phrase composed of a determiner and a noun | |
Shannon entropy (1) | Entropy for letters in the sample ( |
|
Fluency (3) | e.g., false start ratio, filler ratio | |
Non-verbal (3) | e.g., pauses, laughter | |
Semantic (11) | Semantic content (3) | e.g., idea density |
Semantic coherence (9) | e.g., Mean cosine similarity between adjacent sentences utilizing google news word2vec model ( |
Sparse features (defined as those with > 50% zero values for either class) were removed. To render feature scales invariant values were transformed to a scale between 0 and 1 using the MinMax method. To minimize the danger of overfitting, feature selection was applied in each training fold, using i) feature ranking on mutual information with the class, selecting the top 5, 10, 20, and 40; or ii) logistic regression combined with recursive feature elimination (RFE;
Four participant groups were considered: i) those with clinical evidence to suggest the presence of AD pathology, i.e., mild AD plus those with MCI (AD+MCI); ii) MCI alone, iii) AD alone, and iv) healthy controls (HC). Each vector of selected features was used to train a series of linear support vector machines (SVM) to output three binary classifications: HC vs. AD+MCI, HC vs. AD; and HC vs. MCI. SVM have previously been used to achieve good results with similar data (
We calculated accuracy and balanced accuracy, due to class imbalance for subgroup classifications. The latter (
We also report sensitivity [
To identify features important for group classification the learnt coefficients, corresponding to weights associated with each feature during training, were extracted from each fold and ranked by absolute value. Features selected in only one fold were excluded from further analysis. This method uses information from both the feature selection step and the final training step as an indication of importance and aims to find features that are most stable across the model, thus potentially more generalizable.
Between group analyses were conducted for important features using the non-parametric Mann Whitney
HC and AD+MCI groups were not balanced for age, sex and years in education (
HC vs. AD+MCI mean (s.d) SVM classification performance across five-fold cross validation for five connected speech tasks, ranked by accuracy.
Discourse-generating task | Accuracy | AUC | Sensitivity | Specificity |
---|---|---|---|---|
ONR | 0.78 (0.08) | 0.84 (0.05) | 0.75 (0.23) | 0.82 (0.21) |
PD | 0.76 (0.18) | 0.84 (0.11) | 0.69 (0.30) | 0.81 (0.12) |
PR | 0.74 (0.15) | 0.85 (0.19) | 0.78 (0.15) | 0.74 (0.25) |
CS | 0.66 (0.11) | 0.74 (0.10) | 0.62 (0.10) | 0.78 (0.31) |
NNR | 0.62 (0.16) | 0.62 (0.10) | 0.53 (0.21) | 0.72 (0.11) |
PD achieved the second highest accuracy (0.76), with similar specificity (0.81) and the same AUC (0.84) as ONR but a lower sensitivity (0.69 compared to 0.75). The condition with the third highest accuracy (PR) achieved the highest sensitivity of all tasks (0.78) but second lowest specificity (0.74). The lowest accuracies and AUCs were obtained using CS and NNR. The s.d. of the mean accuracy and AUC for ONR is smaller than for the remaining tasks (0.08 and 0.05, compared to 0.18 and 0.11 for the second most accurate task, PD) indicating less variability given different training and test data.
In the interests of brevity, we focused on the features important for the two most accurate tasks – ONR and PD – which both utilized multivariate feature selection.
Important features of overlearned narrative recall for classifying HC vs. AD+MCI. Ordered by number of folds and then mean rank. Mann Whitney
Feature | Linguistic domain | No. folds | Mean rank | Between group comparison | Description | ||
---|---|---|---|---|---|---|---|
HC median (IQR) | AD+MCI median (IQR) |
|
|||||
BNC spoken freq CW | Psycholinguistics | 5 | 6.4 | 1.32 (0.28) | 1.70 (0.51) | 0.001** | Mean frequency rating for content words based on British National Corpus. Higher values = higher frequency |
|
Grammatical constituents | 5 | 3.2 | 0.00 (0.00) | 0.01 (0.01) | 0.294 | Noun phrase with a bare determiner e.g., “this,” “those” |
|
Shannon entropy | 5 | 2.4 | 4.11 (0.04) | 4.07 (0.06) | 0.037* | Entropy calculated for letters ( |
PP type rate | Grammatical constituents | 4 | 6.8 | 0.08 (0.01) | 0.05 (0.02) | <0.001** | Rate of prepositional phrases |
False starts ratio | Fluency | 3 | 8.7 | 0.00 (0.00) | 0.01 (0.01) | 4.605 | Ratio of incomplete words |
S –> CC NP VP | Grammatical constituents | 2 | 7.5 | 0.000 (0.00) | 0.002 (0.01) | 1.173 | Sentence with a coordinating conjunction, noun phrase and a verb phrase e.g., “But Cinderella smiled.” |
Idea density | Semantic content | 2 | 7 | 0.57 (0.02) | 0.54 (0.06) | 0.064 | Mean propositional idea density per word |
Ingest | Psychological processes | 2 | 6 | 0.13 (0.37) | 0.00 (0.00) | 0.053 | % words that correspond to concept of “ingestion” e.g., hungry, dish |
DESWLsy | Word production and complexity | 2 | 5 | 1.32 (0.04) | 1.26 (0.11) | 0.043* | Mean number of syllables per word |
Health | Psychological processes | 2 | 3.5 | 0.7 (0.68) | 0.00 (0.54) | 0.031* | % words that correspond to concept of “health” e.g., clinic, flu |
Sixltr | Word production and complexity | 2 | 3.5 | 14.34 (2.09) | 11.76 (5.88) | 0.012* | % words longer than six letters |
Mean WMD | Semantic coherence | 2 | 2.5 | 0.88 (0.17) | 1.17 (0.49) | 0.001** | Mean word movers distance ( |
* =
Between-group comparisons of the values of the features selected in the HC vs. AD+MCI classification using the ONR sample are displayed in
Radar plot showing features important for HC vs. AD+MCI classification using overlearned narrative recall. HC = healthy control, AD+MCI = Alzheimer’s disease and Mild Cognitive Impairment group. Features have been scaled to between 0 and 1 using MinMax scaling and medians plotted. * =
Eleven features were selected in at least two folds using PD samples to classify HC vs. AD+MCI (
Important features of picture description for classifying of HC vs. AD + MCI. Mann Whitney
Feature | Linguistic domain | No. folds | Mean rank | Between group comparison | Description | ||
---|---|---|---|---|---|---|---|
HC median (IQR) | AD + MCI median (IQR) |
|
|||||
|
Grammatical constituents | 5 | 8.6 | 0.00 (0.01) | 0.01 (0.01) | 0.008* | See |
Tone | Psychological processes | 5 | 8.4 | 50.32 (30.84) | 32.45 (23.87) | 0.021* | Measures overall emotional tone of sample. Higher values = more positive |
S – > ADVP NP VP | Grammatical constituents | 5 | 7.2 | 0.002 (0.01) | 0.000 (0.00) | 0.042* | Sentence with an adverb phrase, noun phrase and verb phrase e.g., “Hardly anyone noticed.” |
SUBTLEXus Range FW | Psycholinguistics | 4 | 6.5 | 8,189.19 (163.97) | 8,273.81 (124.38) | 0.32 | Measures frequency of function words according to their range, (i.e. across documents as opposed to within) using the SUBTL corpus of television and film subtitles |
Demonstratives | Parts-of-speech | 4 | 5 | 0.01 (0.00) | 0.01 (0.01) | 1.127 | Use of demonstratives (this, that, these, those) |
|
Shannon entropy | 3 | 6.3 | 4.14 (0.06) | 4.12 (0.07) | 0.447 | See |
FocusPast | Psychological processes | 3 | 4 | 1.23 (1.43) | 2.14 (2.07) | 0.334 | % words that are focused on the past e.g., ago, did |
PosEmo | Psychological processes | 3 | 3 | 2.19 (1.99) | 1.19 (1.67) | 0.248 | % words that reflect positive emotion e.g., love, nice |
S –> S CC S | Grammatical constituents | 3 | 2.3 | 0.00 (0.01) | 0.01 (0.01) | 0.239 | Two sentences joined by a coordinating conjunction e.g., “She runs but he walks.” |
MRC Imageability AW | Psycholinguistics | 2 | 5.5 | 359.80 (13.58) | 343.57 (20.67) | 0.084 | Mean ease of imageability of a word according to the Medical research council database. Higher values = easier imagery. |
MATTR_30 | Lexical richness | 2 | 3.5 | 0.77 (0.04) | 0.76 (0.05) | 0.703 | Moving average type-token-ratio with a window of 30 words |
* = p < 0.05,** = p < 0.001. Features in bold appear important for classification using both overlearned narrative recall and picture description (see
Group comparisons showed significant differences between the values of three features: noun phrases consisting of a bare determiner, emotional tone and sentences composed of an adverbial phrase, noun phrase and verb phrase. Comparative scaled values are plotted in
Radar plot showing features important for HC vs. AD+MCI classification using picture description. HC = healthy control, AD+MCI = Alzheimer’s disease and Mild Cognitive Impairment group. Features have been scaled to between 0 and 1 using MinMax scaling and medians plotted. * =
Comparisons of the selected features between the two discourse types reveal that both classifiers learned class membership from grammatical constituents, psycholinguistics and psychological processes (
MCI and AD subgroups were explored, as important clinically distinctive groups that may differ in management and disease course.
HC vs. AD mean (s.d) SVM classification performance across five-fold cross-validation for five connected speech tasks, ranked by accuracy.
Discourse-generating task | Balanced accuracy | AUC | Sensitivity | Specificity |
---|---|---|---|---|
ONR | 0.90 (0.11) | 0.94 (0.06) | 0.83 (0.24) | 0.96 (0.09) |
CS | 0.75 (0.15) | 0.80 (0.23) | 0.62 (0.26) | 0.88 (0.12) |
NNR | 0.71 (0.18) | 0.73 (0.26) | 0.65 (0.34) | 0.76 (0.22) |
PR | 0.68 (0.24) | 0.65 (0.25) | 0.52 (0.46) | 0.84 (0.15) |
PD | 0.59 (0.30) | 0.75 (0.26) | 0.50 (0.35) | 0.68 (0.32) |
HC vs. MCI mean (s.d) SVM classification performance across five-fold cross-validation for five connected speech tasks, ranked by accuracy.
Discourse-generating task | Balanced accuracy | AUC | Sensitivity | Specificity |
---|---|---|---|---|
ONR | 0.78 (0.13) | 0.82 (0.22) | 0.67 (0.31) | 0.90 (0.10) |
CS | 0.70 (0.20) | 0.75 (0.10) | 0.58 (0.37) | 0.82 (0.19) |
PD | 0.62 (0.26) | 0.77 (0.28) | 0.40 (0.42) | 0.84 (0.15) |
PR | 0.52 (0.12) | 0.62 (0.21) | 0.43 (0.25) | 0.60 (0.19) |
NNR | 0.50 (0.23) | 0.45 (0.30) | 0.27 (0.43) | 0.73 (0.18) |
Comparing the three classifications, performance was higher in all four metrics for HC vs. AD compared to HC vs. MCI, and HC vs. AD+MCI (
Classification performance for groups and subgroups. HC = healthy control, MCI = Mild Cognitive Impairment, AD = Alzheimer’s disease, AD+MCI = Alzheimer’s disease and Mild Cognitive Impairment group. All classifications used linguistic features from the overlearned narrative recall task. Error bars + 1 sd.
A linear regression with the twelve important features from ONR (
The accuracy of linguistic features automatically extracted from five connected speech tasks for classifying mild AD and MCI was compared. Differences were observed in classification performance using SVM, which, although small for the top performing tasks, indicated differential clinical utility for classifying mild AD and MCI based on task choice.
When comparing cognitively healthy controls with those judged likely on clinical grounds to harbor AD pathology, (i.e. diagnosed with either MCI or AD) the highest accuracy (78%) was achieved using data obtained using ONR. The same data also yielded the highest accuracy in smaller, but clinically relevant, subgroup classifications (mild AD alone or MCI alone compared to HC (90% and 78% respectively)). These results suggest that an overlearned narrative recall task may be the best approach to obtaining discourse samples for detecting early or pre-symptomatic cases of AD, a goal that has become central to successful clinical trial outcomes.
PD achieved the second highest accuracy (76%) supporting the role of a new, updated version of this commonly used task. Sensitivity was lower (69% compared to 75% for ONR), and the task performed poorly for classification of AD only. The accuracy of features probably increases with sample length (
Although conversational discourse elicited using a map reading task achieved only 66% accuracy to detect AD+MCI, accuracy improved in the subgroup analyses: CS gave the second highest accuracy for mild AD and MCI groups alone, suggesting that critical differences in CS may develop between the MCI and mild dementia stages.
NNR with a picture-book stimulus produced the worst performance for AD+MCI and the MCI subgroup classification. In a previous study in which retellings of the same task were scored by a linguist, only 15% of AD patients grasped the overall theme of the story (
The minimum sample length required for meaningful analysis has been subject to debate (
Although the advantage of ONR may simply be task-related, (i.e. due to the involvement of memory as well as language), it is also instructive to examine features that were robustly selected and the overlap with those selected from PD samples. As in
In keeping with the findings of
Entropy was retained in five folds using ONR, and three for PD. Entropy quantifies the information content contained in a string of letters (
The overall emotional tone (a “summary variable” calculated by LIWC2015 (
Classifications based on both ONR and PD retained in all folds the increased frequency with which participants in the AD + MCI group formed a noun phrase using a bare determiner (NP – > DT), e.g. “look at this” as opposed to “look at this jar”. Determiners can serve a deictic purporse, so speech tasks with a pictorial stimulus may be more sensitive to their use;
We make note of two remaining features: imageability (MRC Imageability AW) and word-movers distance (WMD). Although selected in fewer than five folds, median imageability measured in PD was numerically lower in the AD+MCI group. This “reverse imageability effect” has also been observed in speech of SD patients (
The mean WMD, although retained in only two folds of the ONR classifier, was significantly different between groups. Using word2vec embeddings, WMD measures the minimum cumulative distance required to travel between collections of word vectors in a high-dimensional semantic space, analogous with coherence (
Demographic variables were not balanced across groups, unfortunately a common issue (
Acoustic features were not studied as extraction was beyond the scope of the study—their inclusion may have improved performance, seen in previous research such as
Compared to current tests, the reported AUC for detecting MCI is higher than the MMSE (82% compared to 74% (
The results of the current study indicate that linguistic analysis could be used to detect mild AD and MCI, as well as these subgroups compared to healthy controls - an important clinical task – in a novel dataset. Computational analysis of language would offer a rapid, scalable and low-cost assessment of individuals, that could be built in to remote assessment, such as via a smartphone app, less obtrusive and anxiety provoking than current biomarker tests. We have shown, in a direct comparison of the same participants, that the choice of speech task impacts subsequent performance of classifiers trained to recognize mild AD and MCI based on linguistic features. Tasks that probe memory and language may be optimal. Although some features appear important for classification independent of discourse type, tasks may be sensitive to different linguistic features in early AD; due to the reliance on PD in previous studies, some features susceptible to disease may have garnered less attention. This has implications for future work seeking to characterize AD and MCI based on speech, and clinical adoption of computational approaches. Future work could look to explore use of different tasks in larger samples, and include novel features found here important in classifying groups to improve sensitivity to disease, such as the WMD and analysis of emotional tone. Longitudinal assessment of healthy individuals prior to a possible later diagnosis of AD is needed, in order to identify very early linguistic changes and delineate the impact of Alzheimer pathology on language from other factors. Such studies are underway and beginning to provide important insights (
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.
The studies involving human participants were reviewed and approved by the Research Ethics Service Committee London-Dulwich. The patients/participants provided their written informed consent to participate in this study.
All authors contributed to conception and design of the study. NC collected the data, performed the analysis and wrote the first draft of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.
This research was funded by the Medical Research Council (grant number MR/N013638/1).
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at:
The feature set size of 10 was pre-determined according to the highest average accuracy when using the filter approach: taking the mean accuracy across all five tasks for each threshold of
Negative