Edited by: Michel Dojat, Institut National de la Santé et de la Recherche Médicale (INSERM), France
Reviewed by: Maria L. Bringas, University of Electronic Science and Technology of China, China; Pedro Gomez-Vilda, Polytechnic University of Madrid, Spain; Alberto Mazzoni, Sant'Anna School of Advanced Studies, Italy
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Many articles have used voice analysis to detect Parkinson's disease (PD), but few have focused on the early stages of the disease and the gender effect. In this article, we have adapted the latest speaker recognition system, called x-vectors, in order to detect PD at an early stage using voice analysis. X-vectors are embeddings extracted from Deep Neural Networks (DNNs), which provide robust speaker representations and improve speaker recognition when large amounts of training data are used. Our goal was to assess whether, in the context of early PD detection, this technique would outperform the more standard classifier MFCC-GMM (Mel-Frequency Cepstral Coefficients—Gaussian Mixture Model) and, if so, under which conditions. We recorded 221 French speakers (recently diagnosed PD subjects and healthy controls) with a high-quality microphone and via the telephone network. Men and women were analyzed separately in order to have more precise models and to assess a possible gender effect. Several experimental and methodological aspects were tested in order to analyze their impacts on classification performance. We assessed the impact of the audio segment durations, data augmentation, type of dataset used for the neural network training, kind of speech tasks, and back-end analyses. X-vectors technique provided better classification performances than MFCC-GMM for the text-independent tasks, and seemed to be particularly suited for the early detection of PD in women (7–15% improvement). This result was observed for both recording types (high-quality microphone and telephone).
Parkinson's disease (PD) is the second most common neurodegenerative disease after Alzheimer's disease and affects approximately seven million people worldwide. Its prevalence in industrialized countries is around 0.3% and increases with age: 1% of people over the age of 60 and up to 4% of those over 80 are affected (De Lau and Breteler,
Voice impairment is one of the first symptoms to appear. Many articles have used voice analysis to detect PD. They observed vocal disruptions, called hypokinetic dysarthria, expressed by a reduction in prosody, irregularities in phonation, and difficulties in articulation. The classification performances (accuracy rate) using voice analysis ranged from 65 to 99% for moderate to advanced stages of the disease (Guo et al.,
Different classification methodologies have been explored to detect PD using voice analysis. The first studies used global features, such as the number of pauses, the number of dysfluent words, the standard deviation (SD) of pitch and of intensity, along with averaged low-level perturbations, such as shimmer, jitter, voice onset time, signal to noise ratio, formants, or vowel space area, which are reviewed in Jeancolas et al. (
Another type of features, which has been used in the field of speaker recognition for decades, is the Mel-Frequency Cepstral Coefficients (MFCCs) (Bimbot et al.,
List of abbreviations.
PD | Parkinson's disease |
HC | Healthy control |
SD | Standard deviation |
MFCC | Mel-frequency cepstral coefficients |
GMM | Gaussian mixture model |
MDS-UPDRS | Movement disorder society sponsored revision of the unified Parkinson's disease rating scale |
LLH | Log-likelihood |
DNN | Deep neural network |
TDNN | Time delay neural network |
LDA | Linear discriminant analysis |
PLDA | Probalistic linear discriminant analysis |
DDK | Diadochokinesia |
EER | Equal error rate |
DET | Detection error tradeoff |
Over the past 15 years, MFCCs have appeared in the detection of vocal pathologies, such as dysphonia (Dibazar et al.,
Several statistical analyses and classifiers can be applied on MFCC features. For instance, if MFCC dispersion is low within classes, generally due to a poor phonetic variety, one can simply consider the MFCC averages (in addition to other features). This is generally the case for sustained vowel tasks (Tsanas et al.,
If frames are acoustically very different (such as during reading or free speech tasks), additional precision is required to describe the MFCC distribution. One possible modeling technique uses vector quantization (Kapoor and Sharma,
Over the last few years, with the increase of computing power, several Deep Neural Network (DNN) techniques have emerged in PD detection. Some studies applied Convolutional Neural Networks on spectrograms (Vásquez-Correa et al.,
In the present study, we adapted a brand-new text-independent (i.e., no constraint on what the speaker says) speaker recognition methodology, introduced in Snyder et al. (
According to the authors, the advantages of x-vectors are that they capture well the characteristics of speakers that have not been seen during the DNN training, that they provide a more robust speaker representation than i-vectors (Snyder et al.,
In 2018, the same authors adapted the x-vector method to language recognition (Snyder et al.,
Recently, we proposed an adaptation of x-vectors for PD detection in Jeancolas (
A total of 221 French speakers were included in this study: 121 PD patients and 100 healthy controls (HC). All PD patients and 49 HC were recruited at the Pitié-Salpêtrière Hospital and included in the ICEBERG cohort, a longitudinal observational study conducted at the Clinical Investigation Center for Neurosciences at the Paris Brain Institute (ICM). An additional 51 HC were recruited to balance the number of PD and control subjects. All patients had a diagnosis of PD, according to the United Kingdom Parkinson's Disease Society Brain Bank (UKPDSBB) criteria, <4 years prior to the study. HC were free of any neurological diseases or symptoms. Participants had a neurological examination, motor and cognitive tests, biological sampling, and brain MRI. PD patients were pharmacologically treated and their voices were recorded during ON-state (<12 h after their last medication intake). Data from participants with technical recording issues, language disorders not related to PD (such as stuttering) or when a deviation from the standardized procedure occurred, were excluded from the analysis. The ICEBERG cohort (clinicaltrials.gov, NCT02305147) was conducted according to Good Clinical Practice guidelines. All participants received informed consent prior to any investigation. The study was sponsored by Inserm, and received approval from an ethical committee (IRBParis VI, RCB: 2014-A00725-42) according to local regulations.
Among the 217 participants kept for the analysis, 206 subjects including 115 PD (74 males, 41 females) and 91 HC (48 males, 42 females) performed speech tasks recorded with a high-quality microphone. Information about age, time since diagnosis, Hoehn and Yahr stage (Hoehn and Yahr,
High-quality microphone database information.
M | 74 | 63.7 ± 9.3 | 2.5 ± 1.4 | 2.0 ± 0.1 | 34.1 ± 7.0 | 415 ± 298 |
F | 41 | 63.9 ± 9.3 | 2.7 ± 1.5 | 2.0 ± 0.0 | 29.6 ± 5.8 | 352 ± 191 |
– | – | |||||
M | 48 | 58.9 ± 10.7 | – | 0.0 ± 0.0 | 4.6 ± 3.7 | – |
F | 43 | 59.3 ± 9.2 | – | 0.1 ± 0.4 | 4.9 ± 3.4 | – |
– | – |
Most of the participants, 101 PD (63 males, 38 females) and 61 HC (36 males, 25 females) also carried out telephone recordings. Information about age, time since diagnosis, Hoehn and Yahr stage, MDS-UPDRS III score (OFF state), and LEDD are detailed in
Telephone database information.
M | 63 | 63.7 ± 9.0 | 2.5 ± 1.4 | 2.0 ± 0.1 | 34.2 ± 6.9 | 403 ± 311 |
F | 38 | 63.3 ± 9.3 | 2.7 ± 1.5 | 2.0 ± 0.0 | 29.5 ± 6.1 | 359 ± 194 |
– | – | |||||
M | 36 | 63.1 ± 9.3 | – | 0.0 ± 0.0 | 4.6 ± 3.5 | – |
F | 25 | 61.8 ± 7.4 | – | 0.1 ± 0.5 | 5.3 ± 3.6 | – |
– | – |
In this section we present our MFCC-GMM baseline framework. This method, based on Gaussian mixture models fitting cepstral coefficients distributions of each class, has been used for decades in speaker recognition and was recently adapted for early PD detection (Jeancolas et al.,
The first preprocessing regarding our high-quality microphone recordings was spectral subtraction (Boll,
We then extracted the log-energy and 19 MFCCs, using the Kaldi software (Povey et al.,
Once the MFCCs and their deltas were extracted, we carried out Vocal Activity Detection (VAD), based on the log-energy, in order to remove silent frames.
Finally, to complete denoising, a cepstral mean subtraction (Quatieri,
We split the databases into three groups per gender: one group of PD subjects and one group of controls for training, and a third group for testing, containing all the remaining PD and control participants. In the laboratory setting database, we took 36 PD and 36 HC for the male training groups and 38 PD and 12 HC for the male test group. As for women, we considered 30 PD and 30 HC for training and 11 PD and 13 HC for testing. For the telephone database, we selected 30 PD and 30 HC for the male training groups and 33 PD and 6 HC for the male test group. For females we used 20 PD and 20 HC for training and 18 PD and 5 HC for the test. In order to have accurate and generalizable classification performances, the splits were repeated 40 times with the ensemble method described below.
During the training phase, we built multidimensional GMMs, with the Kadi software, to model the MFCC distributions of each training group (see
MFCC-GMM training phase: GMM training from MFCCs of each training group. The different colors of the GMM represent the different gaussian functions that compose it. The final GMM (gray curve) models the MFCC distribution of one training group (either male PD, female PD, male control, or female control). VAD, voice activity detection; CMS, cepstral mean subtraction; EM, expectation-maximization.
For each test subject we calculated the log likelihood (LLH) of their MFCCs compared to the two GMM models corresponding to their gender. We first computed one log-likelihood per frame (after silence removal) of the test subjects' MFCCs against the two models, then we took the average over all the frames. Thus, the likelihood was guaranteed to be independent of the number of frames. A sigmoid function was then applied to the difference of these means (the
MFCC-GMM test phase: the test subjects' MFCCs are tested against a PD GMM model and a HC GMM model. The sigmoid of the log-likelihood ratio provides the classification score. VAD, voice activity detection; CMS, cepstral mean subtraction; LLH, log-likelihood.
Ensemble methods are techniques that create multiple models (in our case 40 × 2 GMMs) and then combine them to improve classifications or regressions. Ensemble methods usually produce more accurate solutions than a single model would (in our case one PD GMM and one HC GMM). That is why we chose to carry the final classification with an ensemble method. More precisely, we performed a repeated random subsampling aggregation (Bühlmann and Yu,
Final classification using the repeated random subsampling aggregation ensemble method. If the subject
The choice of this ensemble method was based on several elements:
First of all, regarding the sampling technique, we chose repeated random subsampling rather than k-fold or Leave-one-subject-out (which are more common) because it allowed us to have the same number of PD and HC subjects for training. This led to same training conditions for the PD and HC GMMs, like same optimal number of Gaussians, therefore fewer hyperparameters and a reduced risk of overfitting.
We then chose to carry the final classification with an ensemble method because they are known to decrease the prediction variance, usually leading to a better classification performance (Friedman et al.,
Regarding the type of aggregation, we chose to average the scores rather than use a majority vote type because it is the technique which is known to minimize the variance the most (Friedman et al.,
The error calculated on the final scores (of
In section 3.7, we compared the classification performance of the aggregated model with the performance of the single model. The real (or generalized) performance of the single model (the one we would have if we tested an infinity of other new subjects against one PD GMM and one HC GMM trained with our current database) was estimated by the performance of the repeated random subsampling cross-validation (i.e., the average of the classification performance of each run). In all other sections we used the aggregated model for the classification.
In this section we present the x-vector system we adapted from the latest speaker recognition method (Snyder et al.,
Since DNN training usually requires a lot of data, we used a DNN trained on large speaker recognition databases and available online (
For the analysis of our telephone recordings, we considered the pretrained DNN SRE16 model, described in Snyder et al. (
For the analysis of our high-quality microphone recordings, we used the voxceleb model, trained on the voxceleb database (Nagrani et al.,
Finally, data augmentation, as described in section 2.2.2.3, was applied to all these DNN training datasets.
These DNNs were trained in the context of speaker identification (see
DNN diagram: training phase. The DNN was trained in the context of speaker identification. The goal was to identify speakers among the
The architecture of the DNN is detailed in
– A set of frame-level layers taking MFCCs as inputs. These layers composed a Time Delay Neural Network (TDNN) taking into account a time context coming from neighboring frames.
– A statistics pooling layer aggregating the outputs (taking the mean and SD) of the TDNN network across the audio segment. The output of this step was a large-scale (3,000 dimensions) representation of the segment.
– The last part was a simple feed forward network composed of two segment-level layers taking as input the result of the pooling layer, reducing its dimensionality to 512 (providing the so-called x-vectors), and ending with a softmax layer. The softmax layer yielded the probability of the input segment coming from each speaker in the training database.
DNN architecture.
Frame-level 1 | 5 | 5 × K | 512 |
Frame-level 2 | 9 | 1,536 | 512 |
Frame-level 3 | 15 | 1,536 | 512 |
Frame-level 4 | 15 | 512 | 512 |
Frame-level 5 | 15 | 512 | 1,500 |
Pooling | T | 1,500 × T | 3,000 |
Segment-level 6 | T | 3,000 | 512 |
Segment-level 7 | T | 512 | 512 |
softmax | T | 512 | N |
Language mismatch between DNN training and x-vector extraction is not an issue: x-vectors have been reported to be robust to this domain mismatch in speaker recognition (Snyder et al.,
For the results presented in section 3.6, we trained a DNN with our own data (telephone recordings). The only difference in the DNN architecture was the size of the softmax layer output, which was two. Indeed, here the DNN was trained directly to discriminate PD subjects from HC (two classes) instead of discriminating between speakers (N classes).
In order to extract the x-vectors for each subject of our databases we had to extract the MFCCs in the same way as it was done for the pretrained DNN. We extracted the log energy and 23 MFCCs every 10 ms for our telephone recordings (like the SRE16 model) and 30 MFCCs with log energy for our high-quality recordings (like the voxceleb model). For the high-quality microphone recordings, we first had to downsample them to 16 kHz (from 96 kHz), in order to match the sampling frequency used for the DNN training. Moreover, for this database as for the MFCC-GMM analysis, we carried out spectral subtraction to compensate for mismatched background noises. Voice activity detection and cepstral mean subtraction were also performed on both databases, as done for the SRE16 and voxceleb models and for our MFCC-GMM analysis.
X-vectors were then extracted for each subject. They were defined as the 512-dimensional vector extracted after the first segment-level layer of the DNN, just before the Rectified Linear Unit (ReLU) activation function.
Even if the audio segment tested did not belong to any speaker used to train the DNN, the x-vectors extracted could be considered as a representation of this segment and captured the speaker characteristics (Snyder et al.,
The audio segments used for the DNN training had a duration of 2–4 s (after silence removal). The DNN could be used to extract x-vectors from new unseen audio segments with durations comprised between 25 ms and 100 s. The audio segments of our database shorter than 25 ms were removed and the one longer than 100 s were divided into fragments smaller than 100 s. X-vectors corresponding to these fragments were then averaged.
We assessed the impact of matched segment durations between training and test in section 3.1. For all the other experiments we chose to divide our audio files into 1–5 s segments.
In recent studies, speaker recognition using i-vectors and x-vectors has been enhanced by augmenting the data (Snyder et al.,
– Reverberation: a reverberation was simulated by taking the convolution of our data with a Room Impulse Response (RIR) of different shapes and sizes, available online (
– Additive noise: different types of noise, extracted from the MUSAN database (
– Additive music: musical extracts (from the MUSAN database) were added as background noise.
– Babble: three to seven speakers (from the MUSAN database) were randomly selected, summed together, then added to our data.
The MUSAN and RIR NOISES databases were sampled at 16 kHz, so we downsampled them to 8 kHz for the telephone recordings analysis.
At the end, two out of the four augmented copies were randomly picked and added to our training database, multiplying by three its size.
Once the x-vectors were extracted for each subject, the x-vectors of the PD training group and the x-vectors of the HC training group were averaged in order to have one average x-vector representing each class, for each gender (see
Reference x-vectors: x-vectors are computed for all the training subjects using their MFCCs, then averaged within the training groups (male PD, female PD, male control, and female control) in order to have one average x-vector per group. VAD, voice activity detection; CMS, cepstral mean subtraction; DNN, deep neural network.
Classification of test subjects was done by comparing their x-vectors to the average x-vector
x-vector test phase: x-vectors are computed for each test subject from their MFCCs, then compared to the average x-vector
Several methods exist to measure similarity between vectors. We compared three methods often used with i-vectors and x-vectors: cosine similarity, cosine similarity preceded by LDA, and PLDA.
In order to reduce intra-class variability and raise inter-class variability, discriminant analyses can be added to the back-end process. We supplemented the previous cosine similarity with a two-dimensional LDA, consisting in finding the orthogonal basis onto which the projection of x-vectors (extracted from our training groups) minimized intra-class variability while maximizing inter-class variability. The cosine similarity was then computed within this subspace.
The columns of matrix
PLDA was preceded by an LDA in order to reduce the x-vector dimension.
For the final classification and the validation we kept the ensemble method used for the MFCC-GMM analysis and described in section 2.2.1.4.
In the following section we present the results of the x-vector analysis compared to the MFCC-GMM one for both genders and for both recording types (high-quality and telephone). We analyzed the effect of the audio segment durations, data augmentation, gender, type of classifier (for each speech task), dataset used for DNN training, and the choice of an ensemble method. More details about the MFCC-GMM analysis (men only) can be found in Jeancolas et al. (
In order to have enough x-vectors for the LDA and PLDA training, we segmented our training audio files into 1–5 s segments. For the test phase, we compared two conditions. In the first condition, we considered a large variety of segment durations, from 25 ms to 100 s (in order to stay in the DNN compatible limits as explained in section 2.2.2.2). The durations of these test segments were not matched with the ones used for the DNN training (segment durations comprised between 2 and 4 s) nor with the ones used during our classifier training phase (durations from 1 to 5 s). In the second condition, we divided all our audio files into 1–5 s segments. Test segment durations were then matched with training segment durations. Results for both duration conditions, obtained from the sentence repetition tasks of male telephone recordings, are presented in
PD vs. HC classification EER (in %) obtained with different segment lengths for the x-vectors extraction.
x-vec + cos | 41 | |
x-vec + LDA + cos | 36 | |
x-vec + PLDA | 36 |
Classification of x-vectors with cosine similarity combined with LDA performed as well as PLDA, and were globally better than cosine similarity alone, whatever the recording condition (telephone or high-quality microphone) or speech task (see
PD vs. HC classification EER (in %) obtained with different classifiers: MFCC-GMM baseline, and x-vectors combined either with cosine similarity (alone and with LDA) or with PLDA, with and without data augmentation.
MFCC-GMM | 26 | 42 | 45 | 35 | 36 | 42 | 40 | |
x-vec + cos | 32 | 35 | 51 | 41 | 39 | 33 | 49 | 43 |
x-vec + LDA + cos | 27 | 39 | 32 | 35 | 34 | |||
x-vec + augLDA + cos | 24 | 33 | 39 | |||||
x-vec + PLDA | 24 | 28 | 39 | 35 | 33 | 36 | 36 | |
x-vec + augPLDA | 25 | 37 |
PD vs. HC classification EER (in %) obtained with different databases for the DNN training: the SRE16 database and our male telephone database (DDK tasks).
MFCC-GMM | ||
x-vec + cos | 47 | |
x-vec + LDA + cos | ||
x-vec + augLDA + cos | 39 | |
x-vec + PLDA | ||
x-vec + augPLDA | 38 |
In this section we assessed the impact of augmenting the LDA and PLDA training data. Results obtained with and without data augmentation for the LDA and PLDA training are detailed in
In this section we compared the classification methodologies using x-vectors with the more classic MFCC-GMM classification.
We already showed that data augmentation for the LDA and PLDA training improved classification for the free speech task but not for the text-dependent tasks. Therefore, for the comparison between MFCC-GMM and x-vectors, we used for the latter, cosine similarity combined with augmented LDA for the free speech task, and not augmented LDA for the sentence repetition and DDK tasks.
For both recording conditions and both genders, we observed improved classification performances with x-vectors, compared to MFCC-GMM, for the free speech task (see
This improvement with x-vectors (compared to MFCC-GMM) was more pronounced in women (7% increase with telephone and 15% with high-quality microphone, compared to 1–3% in men). Detection Error Tradeoff (DET) curves in
DET curves of female classification PD vs. HC, using the free speech task, recorded with the high-quality microphone
Finally, results from the very specific DDK task (tested with male telephone recordings) are presented in
MFCC-GMM and x-vector classifiers were trained separately for each gender, in order to study gender effect on early PD detection.
With the MFCC-GMM classification method, the female group showed poor PD detection performances: EER ranged from 40 to 45%, compared with 22–36% for men (see
Interestingly, x-vectors when combined with a discriminant analysis (LDA or PLDA) clearly improved female classification performances, with an EER comprised between 30 and 39%. Nevertheless the female performances did not reach PD detection performances in males, whether obtained with the MFCC-GMM technique or with x-vectors (the best EER reached 22% with both methods in males).
In order to make the DNN more suitable for the particular type of DDK tasks, we carried out an additional experiment, training this time the DNN with DDK tasks from our own database. The subjects used for the DNN training were the same as those used for the constitution of the average x-vector
In order to test the advantage of the ensemble method we used, we compared its performances with the results obtained with the corresponding single model. To estimate the performance of the single model, we fulfilled a classic random subsampling cross-validation. We averaged the DET curves from each run and calculated the EER corresponding to the average DET curve. We used male telephone recordings and considered the most appropriate tasks for each classifier. The performances obtained are detailed in
PD vs. HC classification EER (in %) obtained with the aggregated model compared to the single model.
MFCC-GMM | DDK | 28 | |
x-vec + LDA + cos | Repet | 35 | |
x-vec + augLDA + cos | Monol | 35 | |
x-vec + PLDA | Repet | 35 | |
x-vec + augPLDA | Monol | 35 |
According to the literature, the latest speaker recognition system, called x-vectors, provides more robust speaker representations and better recognition, when a large amount of training data is used. Our goal was to assess if this technique could be adapted to early PD detection (from recordings done with a high-quality microphone and via telephone network) and improve the detection performances. We compared a x-vector classification method to a more classic system based on MFCCs and GMMs.
We recorded 221 French speakers (PD subjects recently diagnosed and healthy controls) with a high-quality microphone and with their telephone. Our voice analyses were based on MFCC features. The baseline consisted in modeling the PD and HC distributions with two GMMs. For the x-vector technique, MFCCs were used as inputs of a feed-forward DNN from which embeddings (called x-vectors) were extracted then classified. Since DNN training usually requires a lot of data, we used a DNN trained on large speaker recognition databases. All the analyses were done separately for men and for women, in order to avoid additional variability due to gender, as well as to study a possible gender effect on early PD detection. We varied several experimental and methodological aspects in order to analyze their effect on the classification performances.
We observed that using short audio segments that were matched between training and test provided better results (3% improvement). The improvement may be due to the matching durations between training segments and test segments, or to the fact that the classification was performed on more test segments (because they were shorter on average). This would compensate for the fact that taken separately, long segments have been shown to be better classified than short segments in speaker and language recognition (Snyder et al.,
We compared different back-end analyses used with x-vectors. We noticed that the addition of LDA clearly improved the cosine similarity classification and performed as well as a PLDA classifier. This can be explained by the fact that discriminant analyses reduce intra-class variance and increase inter-class variance, highlighting differences due to PD. This improvement due to the addition of discriminant analyses was even more pronounced in women (up to 15% improvement), whose voices are known to contain more variability (i.e., higher intra-class variance).
We found that augmenting data for the training of LDA and PLDA led to an improved classification for the free speech task (2–3% improvement) but not for text-dependent tasks (like sentence repetition and DDK). This can be explained by the fact that data augmentation, while increasing the training audio quantity, added phonetic variability which may have damaged the specificity of the phonetic content of the text-dependent tasks (like sentence repetitions, reading or DDK tasks). Data augmentation seems to be more suited for text-independent tasks (like free speech).
The comparison with the MFCC-GMM classification showed that x-vectors performed better for the free-speech task, which is consistent with the fact that x-vectors were originally developed for text-independent speaker recognition. An overall improvement with x-vectors also appeared for the sentence repetition and reading tasks but in a less consistent way. This may be explained by the fact that GMMs captured well the specificity of text-dependent phonetic content. Indeed the reduction of phonetic content inter-subject variability made easier the isolation of the variability due to the disease, at least for the high-quality recordings. For telephone recordings there were no reading task, and the free speech task lasted much longer than sentence repetitions. This may compensate the expected improvement due to the constant phonetic content. Moreover, the participants carried out the telephone recordings by themselves without any experimenter to make them do the task again when not well executed. So mistakes or comments occurred during the telephone sentence repetitions, increasing a bit the variability of their phonetic content. As for x-vector classification, another aspect has to be taken into account. DNNs were trained with public databases with a very wide variability in the phonetic content, making the x-vector extractor not particularly suited to tasks with fixed phonetic content. Very specific tasks, like DDK, resulted in better performances with GMMs. Lower results with x-vectors for this task may be due to the DNN training, which was from recordings of conversations, containing wider variety of phonemes than DDK tasks (composed of vowels and stop consonants only). Thus, DDK specificity was not exploited by the DNN, resulting in a loss of discriminating power when using x-vectors.
For all classifiers we noticed an important gender effect, with better performances for male PD detection. Several reasons may explain these gender differences. First of all, previous studies have reported wider female MFCC distributions, with more variability, making MFCC based classifications more difficult in women (Fraile et al.,
In order to make the DNN more specific to DDK tasks, we carried out an additional analysis by training it this time with our database (from DDK tasks). We noticed a clear performance degradation when data augmentation was applied on the LDA and PLDA trainings. This is consistent with the fact that data augmentation, while adding noise, impairs the specificity of the DDK phonetic content. Results obtained with cosine similarity + LDA and PLDA, without data augmentation, were similar to those obtained with the previous pretrained DNN. Our DNN training was certainly more specific but perhaps suffered from insufficient data quantity, which could explain why it did not outperform the pretrained DNN, confirming the importance of including a large quantity of data for the DNN training.
Finally, we observed a 2–3% improvement in the classification, when the ensemble method was used, for both MFCC-GMM and x-vectors classifiers. This demonstrates the interest of using ensemble methods for PD detection using voice.
One of the limitations of this study is that our classifications were based only on cepstral features, which cannot capture all voice impairments due to PD. Indeed, articulatory impairments due to PD, like vowel dedifferentiation (due to an amplitude reduction of tongue and lips movements) and imprecise consonant articulation (e.g., vocal tract not completely closed during stop consonant pronunciations and bad coordination between laryngeal and supralaryngeal muscles) have an impact on the different spectral envelopes over time, so they are well captured by the different MFCC vectors. Nevertheless, MFCCs do not describe well several phonatory disruptions due to PD (such as pitch and intensity instability and voice hoarseness), nor abnormal pauses, or prosodic and rhythmic disruptions encountered in PD. For that, one should prefer global features to quantify them, as the ones stated in the introduction. A fusion of a classification based on these features, with the x-vector approach we presented in this paper, should improve the PD detection performances.
It is also worth highlighting that a comparison of our classifier performances with the literature remains difficult. Indeed, as far as we know, our results were the first obtained in early PD detection: (i) in women based only on voice; (ii) using recordings from the telephone channel (if we do not count our last conference paper on MFCC-GMM classification; Jeancolas et al.,
An additional limitation of our work is that x-vectors were conceived for text-independent speaker recognition, whereas some of our tasks are text-dependent. Moreover, the use of complex artificial neural networks in the feature extraction process makes the reasons for score improvements difficult to understand and the physiopathology underlying PD speech impairments difficult to interpret. This fact affects the production and testing of new hypotheses.
Finally it would also be interesting to test other distance measures (such as the Euclidean or Mahalanobis distance or the Jensen-shannon divergence) to compare the x-vectors of the test subjects with the average x-vector
The aim of the study was the discrimination between subjects with early stage Parkinson and healthy controls, thanks to a new speech analysis technique, adapted from recent findings in speaker recognition. We compared the efficacy of this method (called x-vectors) with the more classical MFCC-GMM technique, and varied several experimental and methodological aspects to determine the optimal approach.
We found that the x-vectors optimal methodological procedure for early PD detection consisted in using short and matched audio segments, adding discriminant analysis (LDA or PLDA) to the back-end process, augmenting the training data for the text-independent tasks, and using an ensemble method for the final classification. This resulted in better performances for early PD detection with x-vectors, compared with the MFCC-GMM technique, for the text-independent speech tasks. This improvement was observed for both genders, but this x-vector technique seems to be particularly suited to early PD detection in women, with 7–15% point improvement. The improved classification results with x-vectors, from text-independent tasks, were obtained with both professional microphone recordings and telephone recordings. This validated the x-vector approach for PD detection, using both high-quality recordings performed in a laboratory setting and low-quality recordings performed at home and transmitted through the telephone network.
In future work we will focus on other embeddings (e.g., d-vectors; Variani et al.,
The datasets presented in this article are not readily available because of compliance with the ethical consents provided by the participants. Requests to access the datasets should be directed to Marie Vidailhet,
The studies involving human participants were reviewed and approved by IRBParis VI, RCB: 2014-A00725-42. The patients/participants provided their written informed consent to participate in this study.
LJ: experimental design, data collection, data analysis and interpretation, and manuscript draft. DP-D: experimental design, validation of the analysis and its interpretation, and manuscript revision. GM: participants' diagnosis and clinical scores. B-EB: validation of the analysis and its interpretation. J-CC, MV, and SL: design and development of the ICEBERG study, data collection, and manuscript revision. HB: validation of the analysis and its interpretation and manuscript revision. All authors contributed to the article and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The authors would like to thank Samovar laboratory (especially Mohamed Amine Hmani and Aymen Mtibaa), CIC Neurosciences (especially Alizé Chalançon, Christelle Laganot, and Sandrine Bataille), sleep disorders unit and CENIR teams. The authors are also grateful to Tania Garrigoux, Obaï Bin Ka'b Ali, and Fatemeh Razavipour for the manuscript revision. Finally, the authors would like to express their sincere acknowledgments to all the subjects who have participated in this study.
Vocal tasks performed during the laboratory setting recordings and analyzed in the present study were: readings, sentence repetitions, a free speech and fast syllable repetitions. They were presented in a random order to the participants via a graphical user interface.
– “Au loin un gosse trouve, dans la belle nuit complice, une merveilleuse et fraîche jeune campagne. Il n'a pas plus de dix ans et semble venir de trés loin. Comment il en est arrivé là, ça l'histoire ne le dit pas.” – — Tu as eu des nouvelles de Ludivine récemment? Elle ne répond plus à mes messages depuis quelques temps. — Je l'ai aperçue par hasard au parc hier. Tu ne devineras JAMAIS ce qu'elle faisait! — Vas-y raconte! — Elle courait autour du stade à CLOCHE-PIED et avec un BANDEAU sur les yeux! — Ha la la! Comment c'est possible de ne pas avoir peur du ridicule à ce point? — À mon avis elle aime juste bien se faire remarquer. – – Tu as appris la nouvelle? – C'est pas possible!
– “Tu as appris la nouvelle?” – “C'est pas possible!” – “Tu sais ce qu'il est devenu?” – “Il n'aurait jamais dû faire ça!”
Vocal tasks performed during the phone calls and analyzed in the present study were: sentence repetitions, a free speech and fast syllable repetitions. All the instructions were audio and given by the interactive voice server.
– “Tu as appris la nouvelle?” – “C'est pas possible!” – “Tu sais ce qu'il est devenu?” – “Il n'aurait jamais dû faire ça!” – “Tu as bien raison!” – “Comment il s'appelle déjà?” – “Les chiens aiment courir après les ballons.” – “Un carré est un rectangle particulier.”
Transmission chain of the telephone recordings. GSM, Global System for Mobile Communications; AMR, Adaptive Multi-Rate; PSTN, Public switched telephone network; PCM, Pulse-code modulation; IP, Internet Protocol; IVM, interactive voicemail from NCH company.