<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">624488</article-id>
<article-id pub-id-type="doi">10.3389/fcomp.2020.624488</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Pauses for Detection of Alzheimer&#x2019;s Disease</article-title>
<alt-title alt-title-type="left-running-head">Yuan et al.</alt-title>
<alt-title alt-title-type="right-running-head">Pauses for Alzheimer&#x2019;s Detection</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Yuan</surname>
<given-names>Jiahong</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/934109/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Cai</surname>
<given-names>Xingyu</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bian</surname>
<given-names>Yuchen</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="http://loop.frontiersin.org/people/1164851/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ye</surname>
<given-names>Zheng</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="http://loop.frontiersin.org/people/6334/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Church</surname>
<given-names>Kenneth</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="http://loop.frontiersin.org/people/616463/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Baidu Research, <addr-line>Sunnyvale</addr-line>, <addr-line>CA</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>Institute of Neuroscience, Key Laboratory of Primate Neurobiology, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, <addr-line>Shanghai</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/141969/overview">Saturnino Luz</ext-link>, University of Edinburgh, United Kingdom</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/343451/overview">Jiri Pribil</ext-link>, Slovak Academy of Sciences, Slovakia</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1139925/overview">Shane Sheehan</ext-link>, University of Edinburgh, United Kingdom</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Jiahong Yuan, <email>jiahongyuan@baidu.com</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>29</day>
<month>01</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>2</volume>
<elocation-id>624488</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>10</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>12</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Yuan, Cai, Bian, Ye and Church.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Yuan, Cai, Bian, Ye and Church</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Pauses, disfluencies and language problems in Alzheimer&#x2019;s disease can be naturally modeled by fine-tuning Transformer-based pre-trained language models such as BERT and ERNIE. Using this method with pause-encoded transcripts, we achieved 89.6% accuracy on the test set of the ADReSS (<underline>A</underline>lzheimer&#x2019;s <underline>D</underline>ementia <underline>Re</underline>cognition through <underline>S</underline>pontaneous <underline>S</underline>peech) Challenge. The best accuracy was obtained with ERNIE, plus an encoding of pauses. Robustness is a challenge for large models and small training sets. Ensemble over many runs of BERT/ERNIE fine-tuning reduced variance and improved accuracy. We found that <italic>um</italic> was used much less frequently in Alzheimer&#x2019;s speech, compared to <italic>uh</italic>. We discussed this interesting finding from linguistic and cognitive perspectives.</p>
</abstract>
<kwd-group>
<kwd>Alzheiemer&#x2019;s disease</kwd>
<kwd>pause</kwd>
<kwd>BERT</kwd>
<kwd>ERNIE</kwd>
<kwd>ensemble</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Alzheimer&#x2019;s disease (AD) involves a progressive degeneration of brain cells that is irreversible (<xref ref-type="bibr" rid="B27">Mattson, 2004</xref>). One of the first signs of the disease is deterioration in language and speech production (<xref ref-type="bibr" rid="B28">Mueller et al., 2017</xref>). It is desirable to use language and speech for AD detection (<xref ref-type="bibr" rid="B24">Laske et al., 2015</xref>). In this paper, we investigate the use of pauses in speech (both unfilled and filled pauses such as &#x201c;uh&#x201d; and &#x201c;um&#x201d;) for this task.</p>
<sec id="s1-1">
<title>1.1 Pauses</title>
<p>Unfilled pauses play an important role in speech. The occurrence of pauses is subject to physiological, linguistic, and cognitive constraints (<xref ref-type="bibr" rid="B16">Goldman-Eisler, 1961</xref>; <xref ref-type="bibr" rid="B33">Rochester, 1973</xref>; <xref ref-type="bibr" rid="B4">Butcher, 1981</xref>; <xref ref-type="bibr" rid="B44">Zellner, 1994</xref>; <xref ref-type="bibr" rid="B6">Clark, 2006</xref>; <xref ref-type="bibr" rid="B31">Ramanarayanan et al., 2013</xref>; <xref ref-type="bibr" rid="B21">Hawthorne and Gerken, 2014</xref>). How different constraints interact in pause production has been an active research subject for decades. In normal speech, the likelihood of pause occurrence and the duration of pauses are correlated with syntactic and prosodic structure (<xref ref-type="bibr" rid="B3">Brown and Miron, 1971</xref>; <xref ref-type="bibr" rid="B20">Grosjean et al., 1971</xref>; <xref ref-type="bibr" rid="B23">Krivokapic, 2007</xref>). For example, if a sentence has a syntactically complex subject and a syntactically complex object, speakers tend to pause at the subject-verb phrase boundary, and pause duration increases with upcoming complexity (<xref ref-type="bibr" rid="B12">Ferreira, 1991</xref>). It has been demonstrated that pauses in speech are used by listeners in sentence parsing (<xref ref-type="bibr" rid="B34">Schepman and Rodway, 2000</xref>), and the pause information can benefit automatic parsing (<xref ref-type="bibr" rid="B38">Tran et al., 2018</xref>).</p>
<p>Atypical pausing is characteristic of disordered speech such as in Alzheimer&#x2019;s disease, and pauses are often used to measure language and speech problems (<xref ref-type="bibr" rid="B32">Ramig et al., 1995</xref>; <xref ref-type="bibr" rid="B43">Yuan et al., 2016</xref>; <xref ref-type="bibr" rid="B35">Shea and Leonard, 2019</xref>). The difference between typical and atypical pauses is not only on their frequency and duration, but also on where they occur. In this study, we propose a method to encode pauses in transcripts in order to capture the associations between pauses and words through fine-tuning pre-trained language models such as BERT [19] and ERNIE [20], which we describe in <xref ref-type="sec" rid="s1">Section 1.2</xref>.</p>
<p>The use of filled pauses may also be different between AD and normal speech. English has two common filled pauses, <italic>uh</italic> and <italic>um</italic>. There is a debate in the literature as to whether <italic>uh</italic> and <italic>um</italic> are intentionally produced by speakers (<xref ref-type="bibr" rid="B5">Clark and Fox Tree, 2002</xref>; <xref ref-type="bibr" rid="B7">Corley and Stewart, 2008</xref>). From sociolinguistic point of view, women and younger people tend to use more <italic>um</italic> vs. <italic>uh</italic> than men and older people (<xref ref-type="bibr" rid="B37">Tottie, 2011</xref>; <xref ref-type="bibr" rid="B40">Wieling et al., 2016</xref>). It has also been reported that autistic children use <italic>um</italic> less frequently than normal children (<xref ref-type="bibr" rid="B18">Gorman et al., 2016</xref>; <xref ref-type="bibr" rid="B22">Irvine et al., 2016</xref>), and that <italic>um</italic> occurs less frequently and is shorter during lying compared to truth-telling (<xref ref-type="bibr" rid="B1">Arciuli et al., 2010</xref>).</p>
</sec>
<sec id="s1-2">
<title>1.2 Pre-trained LMs and Self-Attention</title>
<p>Modern pre-trained language models such as BERT (<xref ref-type="bibr" rid="B10">Devlin et al., 2018</xref>) and ERNIE (<xref ref-type="bibr" rid="B36">Sun et al., 2019</xref>) were trained on extremely large corpora. These models appear to capture a wide range of linguistic facts including lexical knowledge, phonology, syntax, semantics and pragmatics. Recent literature is reporting considerable success on a variety of benchmark tasks with BERT and BERT-like models.<xref ref-type="fn" rid="FN1">
<sup>1</sup>
</xref> We expect that the language characteristics of AD can also be captured by the pre-trained language models when fine-tuned to the task of AD classification.</p>
<p>BERT and BERT-like models are based on the Transformer architecture (<xref ref-type="bibr" rid="B39">Vaswani et al., 2017</xref>). These models use self-attention to capture associations among words. Each attention head operates on the elements in a sequence (e.g., words in the transcript for a subject), and computes a new sequence of the weighed sum of (transformed) input elements. There are various versions of BERT and ERNIE. There is a base model with 12 layers and 12 attention heads for each layer, as well as a larger model with 24 layers and 16 attention heads for each layer. Conceptually the self-attention mechanism can naturally model many language problems in AD, including repetitions of words and phrases, use of particular words (and classes of words), as well as pauses. By inserting pauses in word transcripts, we enable BERT-like models to learn the language problems involving pauses.</p>
<p>Previous studies have found that when fine-tuning BERT for downstream tasks with a small data set, the model has a high variance in performance. Even with the same hyperparameter values, distinct random seeds can lead to substantially different results. <xref ref-type="bibr" rid="B11">Dodge et al. (2020)</xref> conducted a large-scale study on this issue. They fine-tuned BERT hundreds of times while varying only the random seeds, and found that the best-found model significantly outperformed previous reported results using the same model. In this situation, using just one final model for prediction is risky given the variance in performance during training. We propose an ensemble method to address this concern.</p>
</sec>
<sec id="s1-3">
<title>1.3 Automatic Detection of AD</title>
<p>There is a considerable literature on AD detection from continuous speech (<xref ref-type="bibr" rid="B13">Filiou et al., 2019</xref>; <xref ref-type="bibr" rid="B30">Pulido et al., 2020</xref>). This literature considers a wide variety of features and machine learning techniques. <xref ref-type="bibr" rid="B14">Fraser et al. (2016)</xref> used 370 acoustic and linguistic features to train logistic regression models for classifying AD and normal speech. <xref ref-type="bibr" rid="B19">Gosztolya et al. (2019)</xref> found that acoustic and linguistic features were about equally effective for AD classification, but the combination of the two performed better than either by itself. Neural network models such as Convolutional Neural Networks and Long Short-Term Memory (LSTM) have also been employed for the task (<xref ref-type="bibr" rid="B9">de Ipi&#xf1;a et al., 2017</xref>; <xref ref-type="bibr" rid="B15">Fritsch et al., 2019</xref>; <xref ref-type="bibr" rid="B29">Palo and Parde, 2019</xref>), and very promising results have been reported. However, it is difficult to compare these different approaches, because of the lack of standardized training and test data sets. The ADReSS challenge of INTERSPEECH 2020 is &#x201c;to define a shared task through which different approaches to AD detection, based on spontaneous speech, could be compared&#x201d; (<xref ref-type="bibr" rid="B25">Luz et al., 2020</xref>). This paper stems from our effort for the shared task.</p>
</sec>
</sec>
<sec id="s2">
<title>2 Data and Analysis</title>
<sec id="s2-1">
<title>2.1 Data</title>
<p>The data consists of speech recordings and transcripts of descriptions of the Cookie Theft picture from the Boston Diagnostic Aphasia Exam (<xref ref-type="bibr" rid="B17">Goodglass et al., 2001</xref>). Transcripts were annotated using the CHAT coding system (<xref ref-type="bibr" rid="B26">MacWhinney, 2000</xref>). We only used word transcripts, the morphological and syntactic annotations in the transcripts were not used in our experiments.</p>
<p>The training set contains 108 speakers, and the test set contains 48 speakers. In each data set, half of the speakers are people with AD and half are non-AD (healthy control subjects). Both data sets were provided by the challenge. The organizers also provided speech segments extracted from the recordings using a simple voice detection algorithm, but no transcripts were available for the speech segments. We didn&#x2019;t use these speech segments. Our experiments were based on the entire recordings and transcripts.</p>
</sec>
<sec id="s2-2">
<title>2.2 Processing Transcripts and Forced Alignment</title>
<p>The transcripts in the data sets were annotated in the CHAT format, which can be conveniently created and analyzed using CLAN (<xref ref-type="bibr" rid="B26">MacWhinney, 2000</xref>). For example: &#x201c;the [x 3] bench [: stool]&#x201d;. In this example, [x 3] indicates that the word &#x201c;the&#x201d; was repeated three times [: stool] indicates that the preceding word, &#x201c;bench&#x201d; (which was actually produced), refers to stool. Details of the transcription format can be found in (<xref ref-type="bibr" rid="B26">MacWhinney, 2000</xref>).</p>
<p>For the purpose of forced alignment and fine tuning, we converted the transcripts into words and tokens that represent what were actually produced in speech. &#x201c;w [x n]&#x201d; were replaced by repetitions of w for n times, punctuation marks and various comments annotated between &#x201c;[]&#x201d; were removed. Symbols such as (.), (..), (&#x2026;), <inline-formula id="inf1">
<mml:math id="mml-math1-fcomp.2020.624488">
<mml:mo>&#x3c;</mml:mo>
</mml:math>
</inline-formula>, <inline-formula id="inf2">
<mml:math id="mml-math2-fcomp.2020.624488">
<mml:mo>&#x3e;</mml:mo>
</mml:math>
</inline-formula>, / and &#x0078;&#x0078;&#x0078; were also removed.</p>
<p>The processed transcripts were forced aligned with speech recordings using the Penn Phonetics Lab Forced Aligner (<xref ref-type="bibr" rid="B42">Yuan and Liberman, 2008</xref>). The aligner used a special model &#x201c;sp&#x201d; to identify between-word pauses. After forced alignment, the speech segments that belong to the interviewer were excluded. The pauses at the beginning and the end of the recordings were also excluded. Only the subjects&#x2019; speech, including pauses in turn-taking between the interviewer and the subject, were used.</p>
</sec>
<sec id="s2-3">
<title>2.3 Word Frequency and <italic>Uh/Um</italic>
</title>
<p>From the training data set, we calculated word frequencies for the Control and AD groups respectively. Words that appear 10 or more times in both groups are shown in the word clouds in <xref ref-type="fig" rid="F1">Figure 1</xref>. The following words are at least two times more frequent in AD than in Control: <italic>oh</italic> (4.33), <italic>&#x3d; laughs</italic> (laughter, 3.18), <italic>down</italic> (2.66), <italic>well</italic> (2.42), <italic>some</italic> (2.2), <italic>what</italic> (2.16), <italic>fall</italic> (2.15). And the words that are at least two times more frequent in Control than in AD are: <italic>window</italic> (4.4), <italic>are</italic> (3.83), <italic>has</italic> (3.0), <italic>reaching</italic> (2.8), <italic>her</italic> (2.62), <italic>um</italic> (2.55), <italic>sink</italic> (2.3), <italic>be</italic> (2.21), <italic>standing</italic> (2.06).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The word cloud on the left highlights words that are more common among control subjects than AD; the word cloud on the right highlights words that are more common among AD than control.</p>
</caption>
<graphic xlink:href="fcomp-02-624488-g001.tif"/>
</fig>
<p>Compared to controls, subjects with AD used relatively more laughter and semantically &#x201c;empty&#x201d; words such as <italic>oh</italic>, <italic>well</italic>, and <italic>some</italic>, and fewer present particles (<italic>-ing</italic> verbs). This is consistent with findings in the literature. <xref ref-type="table" rid="T1">Table 1</xref> shows an interesting difference for filled pauses. The subjects with AD used more <italic>uh</italic> than the control subjects, but their use of <italic>um</italic> was much less frequent.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Subjects with AD say uh more often, and um less often.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="center">
<italic>uh</italic>
</th>
<th align="center">
<italic>um</italic>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Control (non-AD)</td>
<td align="center">130</td>
<td align="center">51</td>
</tr>
<tr>
<td align="left">Dementia (AD)</td>
<td align="center">183</td>
<td align="center">20</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2-4">
<title>2.4 Unfilled Pauses</title>
<p>Duration of pauses was calculated from forced alignment. Pauses under 50&#xa0;ms were excluded, as well as pauses in the interviewer&#x2019;s speech. We binned the remaining pauses by duration as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. Subjects with AD have more pauses in every group, but the difference between subjects with AD and non-AD is particularly noticeable for longer pauses.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Subjects with AD have more pauses (in all duration bins).</p>
</caption>
<graphic xlink:href="fcomp-02-624488-g002.tif"/>
</fig>
</sec>
</sec>
<sec id="s3">
<title>3 BERT and ERNIE FINE-TUNING</title>
<sec id="s3-1">
<title>3.1 Input and Hyperparameters</title>
<p>Pre-trained BERT and ERNIE models were fine-turned for the AD classification task. Each of the <inline-formula id="inf3">
<mml:math id="mml-math3-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>108</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> training speakers is considered a data point. The input to the model consists of a sequence of words from the processed transcript for every speaker (as described in <xref ref-type="sec" rid="s2">Section 2.2</xref>). The output is the class of the speaker, 0 for Control and one for AD.</p>
<p>We also encoded pauses in the input word sequence. We grouped pauses into three bins: short (under 0.5&#xa0;s); medium (0.5&#x2013;2&#xa0;s); and long (over 2&#xa0;s). The three bins of pauses are coded using three punctuations &#x201c;,&#x201d;, &#x201c;.&#x201d;, and &#x201c;&#x2026;&#x201d;, respectively. Because all punctuations were removed from the processed transcripts, these inserted punctuations only represent pauses. The procedure is illustrated in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Procedure for pause encoding.</p>
</caption>
<graphic xlink:href="fcomp-02-624488-g003.tif"/>
</fig>
<p>We used Bert-for-Sequence-Classification<xref ref-type="fn" rid="FN2">
<sup>2</sup>
</xref> for fine-tuning. We tried both &#x201c;bert-base-uncased&#x201d; and &#x201c;bert-large-uncased&#x201d;, and found slightly better performance with the larger model. The following hyperparameters (slightly tuned) were chosen: learning rate &#x3d; 2e-5, batch size &#x3d; 4, epochs &#x3d; 8, max input length of 256 (sufficient to cover most cases). The standard default tokenizer was used (with an instruction not to split &#x201c;&#x2026;&#x201d;). Two special tokens, [CLS] and [SEP], were added to the beginning and the end of each input.</p>
<p>ERNIE fine-tuning started with the &#x201c;ERNIE-large&#x201d; pre-trained model (24 layers with 16 attention heads per layer). We used the default tokenizer, and the following hyperparameters: learning rate &#x3d; 2e-5, batch size &#x3d; 8, epochs &#x3d; 20 and max input length of 256.</p>
<p>The fine-tuning process is illustrated in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Procedure for fine-tuning.</p>
</caption>
<graphic xlink:href="fcomp-02-624488-g004.tif"/>
</fig>
</sec>
<sec id="s3-2">
<title>3.2 Ensemble Reduces Variance in LOO Accuracy</title>
<p>When conducting LOO (<underline>l</underline>eave-<underline>o</underline>ne-<underline>o</underline>ut) cross-validation on the training set, large differences in accuracy across runs were observed. We computed 50 runs of LOO cross-validation. The hyperparameter setting was the same across runs except for random seeds. The results are shown in the last row (<inline-formula id="inf4">
<mml:math id="mml-math4-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) of <xref ref-type="table" rid="T2">Tables 2</xref> and <xref ref-type="table" rid="T3">3</xref>. Over the 50 runs, LOO accuracy ranged from 0.75 to 0.86 for BERT with three pauses, from 0.78 to 0.87 for ERNIE with three pauses, and from 0.77 to 0.85 for ERNIE with no Pauses. The large variance suggests performance on unseen data is likely to be brittle. Such brittleness is to be expected given the large size of the BERT and ERNIE models and the small size of the training set (108 subjects).</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Ensemble improves LOO (leave-one-out) estimates of accuracy; better means with less variance.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th colspan="2" align="center">BERT with three pauses</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">N</td>
<td align="center">Mean <inline-formula id="inf5">
<mml:math id="mml-math5-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> sd</td>
<td align="center">min - max</td>
</tr>
<tr>
<td align="left">5</td>
<td align="char" char=".">0.837 <inline-formula id="inf6">
<mml:math id="mml-math6-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.010</td>
<td align="char" char=".">0.815&#x2013;0.861</td>
</tr>
<tr>
<td align="left">15</td>
<td align="char" char=".">0.840 <inline-formula id="inf7">
<mml:math id="mml-math7-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.011</td>
<td align="char" char=".">0.815&#x2013;0.861</td>
</tr>
<tr>
<td align="left">25</td>
<td align="char" char=".">0.839 <inline-formula id="inf8">
<mml:math id="mml-math8-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.011</td>
<td align="char" char=".">0.815&#x2013;0.870</td>
</tr>
<tr>
<td align="left">
<bold>35</bold>
</td>
<td align="char" char=".">
<bold>0.838</bold>
<inline-formula id="inf9">
<mml:math id="mml-math9-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula>
<bold>0.010</bold>
</td>
<td align="char" char=".">
<bold>0.824</bold>&#x2013;<bold>0.861</bold>
</td>
</tr>
<tr>
<td align="left">45</td>
<td align="char" char=".">0.839 <inline-formula id="inf10">
<mml:math id="mml-math10-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.011</td>
<td align="char" char=".">0.824&#x2013;0.861</td>
</tr>
<tr>
<td align="left">
<bold>1</bold>
</td>
<td align="char" char=".">
<bold>0.819</bold>
<inline-formula id="inf11">
<mml:math id="mml-math11-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula>
<bold>0.023</bold>
</td>
<td align="char" char=".">
<bold>0.750</bold>&#x2013;<bold>0.861</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Ensemble also improves LOO for ERNIE (with and without pauses). LOO results are better with pauses than without, and better with ERNIE than BERT.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th colspan="2" align="center">ERNIE with three pauses</th>
<th colspan="2" align="center">ERNIE with No pauses</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">N</td>
<td align="center">Mean <inline-formula id="inf12">
<mml:math id="mml-math12-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> std</td>
<td align="center">Min - max</td>
<td align="center">Mean <inline-formula id="inf13">
<mml:math id="mml-math13-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> std</td>
<td align="center">Min - max</td>
</tr>
<tr>
<td align="left">5</td>
<td align="char" char=".">0.845 <inline-formula id="inf14">
<mml:math id="mml-math14-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.013</td>
<td align="char" char=".">0.806&#x2013;0.880</td>
<td align="char" char=".">0.828 <inline-formula id="inf15">
<mml:math id="mml-math15-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.016</td>
<td align="char" char=".">0.796&#x2013;0.870</td>
</tr>
<tr>
<td align="left">15</td>
<td align="char" char=".">0.851 <bold>&#xb1;</bold> 0.008</td>
<td align="char" char=".">0.833&#x2013;0.870</td>
<td align="char" char=".">0.831 <inline-formula id="inf16">
<mml:math id="mml-math16-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.012</td>
<td align="char" char=".">0.796&#x2013;0.861</td>
</tr>
<tr>
<td align="left">25</td>
<td align="char" char=".">0.853 <bold>&#xb1;</bold> 0.007</td>
<td align="char" char=".">0.833&#x2013;0.870</td>
<td align="char" char=".">0.833 <inline-formula id="inf17">
<mml:math id="mml-math17-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.010</td>
<td align="char" char=".">0.815&#x2013;0.861</td>
</tr>
<tr>
<td align="left">
<bold>35</bold>
</td>
<td align="char" char=".">
<bold>0.854 &#xb1; 0.007</bold>
</td>
<td align="char" char=".">
<bold>0.824</bold>&#x2013;<bold>0.861</bold>
</td>
<td align="char" char=".">
<bold>0.836</bold>
<inline-formula id="inf18">
<mml:math id="mml-math18-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula>
<bold>0.009</bold>
</td>
<td align="char" char=".">
<bold>0.815</bold>&#x2013;<bold>0.852</bold>
</td>
</tr>
<tr>
<td align="left">45</td>
<td align="char" char=".">0.854 <bold>&#xb1;</bold> 0.007</td>
<td align="char" char=".">0.833&#x2013;0.861</td>
<td align="char" char=".">0.834 <inline-formula id="inf19">
<mml:math id="mml-math19-fcomp.2020.624488">
<mml:mo>&#xb1;</mml:mo>
</mml:math>
</inline-formula> 0.008</td>
<td align="char" char=".">0.815&#x2013;0.861</td>
</tr>
<tr>
<td align="left">
<bold>1</bold>
</td>
<td align="char" char=".">
<bold>0.827 &#xb1; 0.020</bold>
</td>
<td align="char" char=".">
<bold>0.778</bold>&#x2013;<bold>0.870</bold>
</td>
<td align="char" char=".">
<bold>0.816 &#xb1; 0.023</bold>
</td>
<td align="char" char=".">
<bold>0.769</bold>&#x2013;<bold>0.852</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To address this brittleness, we introduced the following ensemble procedure. From the results of LOO cross-validation, we calculated the majority vote over <italic>N</italic> runs for each of the 108 subjects, and used the majority vote to return a single label for each subject. To make sure that the ensemble estimates would generalize to unseen data, we tested the method by selecting <inline-formula id="inf20">
<mml:math id="mml-math20-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf21">
<mml:math id="mml-math21-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>15</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, &#x2026;, runs from the 50 runs of LOO cross-validation. The results are shown in <xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="table" rid="T3">3</xref>. In the tables, the first row summarizes 100 draws of <inline-formula id="inf22">
<mml:math id="mml-math22-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> runs. The second row is similar, except <inline-formula id="inf23">
<mml:math id="mml-math23-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>15</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. All of the ensemble rows have better means and less variance than the last row, which summarizes the 50 individual runs of LOO cross-validation without ensemble (<inline-formula id="inf24">
<mml:math id="mml-math24-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). <xref ref-type="fig" rid="F5">Figure 5</xref> illustrates <xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="table" rid="T3">3</xref>. In <xref ref-type="fig" rid="F5">Figure 5</xref> the black lines represent accuracy of individual runs whereas the purple lines represent ensemble accuracy of <inline-formula id="inf25">
<mml:math id="mml-math25-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>35</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. We can see that there is a wide variance in individual runs (black). The proposed ensemble method (purple) improves the mean and reduces variance over estimates based on a single run.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Individual and ensemble Leave-one-out (LOO) accuracy for BERT with pauses (top) and ERNIE with and without pauses (bottom). Black lines represent accuracy of individual runs; purple lines represent ensemble accuracy of <inline-formula id="inf26">
<mml:math id="mml-math26-fcomp.2020.624488">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>35</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</caption>
<graphic xlink:href="fcomp-02-624488-g005.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Evaluation</title>
<p>Under the rules of the challenge, each team is allowed to submit results of five attempts for evaluation. Predictions on the test set from the following five models were submitted for evaluation: BERT0p, BERT3p, BERT6p, ERNIE0p, and ERNIE3p. 0p indicates that no pause was encoded, and 3p and 6p indicate, respectively, that three and six lengths of pauses were encoded. To compare with three pauses, 6p represents six bins of pauses, encoded as: &#x201c;,&#x201d; (under 0.5&#xa0;s), &#x201c;.&#x201d; (0.5&#x2013;1&#xa0;s); &#x201c;..&#x201d; (1&#x2013;2&#xa0;s), &#x201c;. . .&#x201d; (2&#x2013;3&#xa0;s), &#x201c;. . . .&#x201d; (3&#x2013;4&#xa0;s), &#x201c;. . . . .&#x201d; (over than 4&#xa0;s). The dots are separated from each other, as different tokens.</p>
<p>Following the method proposed in <xref ref-type="sec" rid="s3">Section 3.2</xref>, we made 35 runs of training for each of the five models, with 35 random seeds. The classification of each sample in the test set was based on the majority vote of 35 predictions. <xref ref-type="table" rid="T4">Table 4</xref> lists the evaluation scores received from the organizers.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Evaluation results: Best accuracy (acc) with ERNIE and three pauses (3p). Pauses are helpful: three pauses (3p) and six pauses (6p) have better accuracy than no pauses (0p).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th colspan="2" align="center">Precision</th>
<th colspan="2" align="center">Recall</th>
<th colspan="2" align="center">F1</th>
<th align="center">Acc</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left"/>
<td align="center">Non-AD</td>
<td align="center">AD</td>
<td align="center">Non-AD</td>
<td align="center">AD</td>
<td align="center">Non-AD</td>
<td align="center">AD</td>
<td align="left"/>
</tr>
<tr>
<td align="left">Baseline ()</td>
<td align="char" char=".">0.700</td>
<td align="char" char=".">0.830</td>
<td align="char" char=".">0.870</td>
<td align="char" char=".">0.620</td>
<td align="char" char=".">0.780</td>
<td align="char" char=".">0.710</td>
<td align="char" char=".">0.750</td>
</tr>
<tr>
<td align="left">BERT0p</td>
<td align="char" char=".">0.742</td>
<td align="char" char=".">0.941</td>
<td align="char" char=".">0.958</td>
<td align="char" char=".">0.667</td>
<td align="char" char=".">0.836</td>
<td align="char" char=".">0.781</td>
<td align="char" char=".">0.813</td>
</tr>
<tr>
<td align="left">BERT3p</td>
<td align="char" char=".">0.793</td>
<td align="char" char=".">0.947</td>
<td align="char" char=".">0.958</td>
<td align="char" char=".">0.750</td>
<td align="char" char=".">0.868</td>
<td align="char" char=".">0.837</td>
<td align="char" char=".">0.854</td>
</tr>
<tr>
<td align="left">BERT6p</td>
<td align="char" char=".">0.793</td>
<td align="char" char=".">0.947</td>
<td align="char" char=".">0.958</td>
<td align="char" char=".">0.750</td>
<td align="char" char=".">0.868</td>
<td align="char" char=".">0.837</td>
<td align="char" char=".">0.854</td>
</tr>
<tr>
<td align="left">ERNIE0p</td>
<td align="char" char=".">0.793</td>
<td align="char" char=".">0.947</td>
<td align="char" char=".">0.958</td>
<td align="char" char=".">0.750</td>
<td align="char" char=".">0.868</td>
<td align="char" char=".">0.837</td>
<td align="char" char=".">0.854</td>
</tr>
<tr>
<td align="left">ERNIE3p</td>
<td align="char" char=".">0.852</td>
<td align="char" char=".">0.952</td>
<td align="char" char=".">0.958</td>
<td align="char" char=".">0.833</td>
<td align="char" char=".">0.902</td>
<td align="char" char=".">0.889</td>
<td align="char" char=".">
<bold>0.896</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The best accuracy was 89.6%, obtained with ERNIE and three pauses. It is a nearly 15% increase from the baseline of 75.0% (<xref ref-type="bibr" rid="B25">Luz et al., 2020</xref>).</p>
<p>ERNIE outperformed BERT by 4% on input of both three pauses and no pause. Encoding pauses improved the accuracy for both BERT and ERNIE. There was no difference between three pauses and six pauses in terms of improvement in accuracy.</p>
</sec>
<sec id="s5">
<title>5 Discussion</title>
<p>The group with AD used more <italic>uh</italic> but less <italic>um</italic> than the control group. In speech production, disfluencies such as hesitations and speech errors are correlated with cognitive functions such as cognitive load, arousal, and working memory (<xref ref-type="bibr" rid="B8">Daneman, 1991</xref>; <xref ref-type="bibr" rid="B1">Arciuli et al., 2010</xref>). Hesitations and disfluencies increase with increased cognitive load and arousal as well as impaired working memory. This may explain why the group with AD used more <italic>uh</italic>, as a filled pause and hesitation marker. More interestingly, they used less <italic>um</italic> than the control group. This indicates that unlike <italic>uh</italic>, <italic>um</italic> is more than a hesitation marker. Previous studies have also reported that children with autism spectrum disorder produced <italic>um</italic> less frequently than typically developed children (<xref ref-type="bibr" rid="B18">Gorman et al., 2016</xref>; <xref ref-type="bibr" rid="B22">Irvine et al., 2016</xref>), and that <italic>um</italic> was used less frequently during lying compared to truth-telling (<xref ref-type="bibr" rid="B2">Benus et al., 2006</xref>; <xref ref-type="bibr" rid="B1">Arciuli et al., 2010</xref>). All these results seem to suggest that <italic>um</italic> carries a lexical status and is retrieved in speech production. One possibility is that people with AD or autism have difficulty in retrieving the word <italic>um</italic> whereas people who are lying try not to use this word. More research is needed to test this hypothesis.</p>
<p>From our results, encoding pauses in the input was helpful for both BERT and ERINE fine-tuning for the task of AD classification. Pauses are ubiquitous in spoken language. They are distributed differently in fluent, normally disfluent, and abnormally disfluent speech. As we can see from <xref ref-type="fig" rid="F2">Figure 2</xref>, the group with AD used more pauses and especially more long pauses than the control group. With pauses present in the text, the self-attention mechanism in BERT and ERNIE may learn how the pauses are correlated with other words, for example, whether there is a long pause between the determiner <italic>the</italic> and the following noun, which occurs more frequently in AD speech. We think this is part of the reason why encoding pauses improved the accuracy. There was no difference between three pauses and six pauses in terms of improvement in accuracy. More studies are needed to investigate the categories of pause length and determine the optimal number of pauses to be encoded for AD classification.</p>
<p>ERNIE was designed to learn language representation enhanced by knowledge masking strategies, including entity-level masking and phrase-level masking. Through these strategies, ERNIE &#x201c;implicitly learned the information about knowledge and longer semantic dependency, such as the relationship between entities, the property of a entity and the type of a event&#x201d;. (<xref ref-type="bibr" rid="B36">Sun et al., 2019</xref>) We think this may be why ERNIE performs better on recognition of Alzheimer&#x2019;s speech, in which memory loss causes not only language problems but also difficulties of recognizing entities and events.</p>
<p>Both BERT and ERNIE were pre-trained on text corpora, with no pause information. Our study suggests that it may be useful to pre-train a language model using speech transcripts (either solely or combined with text corpora) that include pause information.</p>
</sec>
<sec id="s6">
<title>6 Conclusion</title>
<p>Accuracy of 89.6% was achieved on the test set of the ADReSS (<underline>A</underline>lzheimer&#x2019;s <underline>D</underline>ementia <underline>Re</underline>cognition through <underline>S</underline>pontaneous <underline>S</underline>peech) Challenge, with ERNIE fine-tuning, plus an encoding of pauses. There is a high variance in BERT and ERNIE fine-tuning on a small training set. Our proposed ensemble method improves the accuracy and reduces variance in model performance. Pauses are useful in BERT and ERNIE fine-tuning for AD classification. <italic>um</italic> was used much less frequently in AD, suggesting that it may have a lexical status.</p>
</sec>
<sec id="s7">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/supplementary Material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>JY, Principal investigator, corresponding author; XC, Running ERNIE experiments; YB, Help running BERT experiments; ZY, Consultation on Alzheimer&#x27;s disease, paper editing and proofreading; KC, Visualization of LOO experiment results, paper editing and proofreading.</p>
</sec>
<sec id="s9" sec-type="COI-statement">
<title>Conflict of Interest</title>
<p>Authors JY, XC, YB and KC were employed by company Baidu USA Inc.</p>
<p>The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<ack>
<p>We thank Julia Li and Hao Tian for their suggestion and help with ERNIE. This paper is an extended version of our paper presented at Interspeech 2020 (<xref ref-type="bibr" rid="B41">Yuan et al., 2020</xref>).</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arciuli</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mallard</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Villar</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>&#x201c;Um, i can tell you&#x2019;re lying&#x201d;: linguistic markers of deception versus truth-telling in speech</article-title>. <source>Appl. Psycholinguist.</source> <volume>31</volume>, <fpage>397</fpage>&#x2013;<lpage>411</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1017%2FS0142716410000044">10.1017/S0142716410000044</ext-link> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benus</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Enos</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Hirschberg</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Shriberg</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2006</year>). &#x201c;<article-title>Pauses in deceptive speech</article-title>,&#x201d; in <conf-name>Speech prosody 2006</conf-name>, <conf-loc>Dresden, Germany</conf-loc>, <conf-date>May 2&#x2013;5, 2006</conf-date>. </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Miron</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>1971</year>). <article-title>Lexical and syntactic predictors of the distribution of pause time in reading</article-title>. <source>J. Verb. Learn. Verb. Behav.</source> <volume>10</volume>, <fpage>658</fpage>&#x2013;<lpage>667</lpage>. <pub-id pub-id-type="doi">10.1016/S0022-5371(71)80072-5</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Butcher</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>1981</year>). <source>Aspects of the speech pause: phonetic correlates and communicative functions</source>. <publisher-loc>Kiel, Germany</publisher-loc>: <publisher-name>Institut fur Phonetik der Universitat Kiel</publisher-name>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>H. H.</given-names>
</name>
<name>
<surname>Fox Tree</surname>
<given-names>J. E.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>Using uh and um in spontaneous speaking</article-title>. <source>Cognition</source> <volume>84</volume>, <fpage>73</fpage>&#x2013;<lpage>111</lpage>. <pub-id pub-id-type="doi">10.1016/s0010-0277(02)00017-3</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>H. H.</given-names>
</name>
</person-group> (<year>2006</year>). <source>Pauses and hesitations: psycholinguistic approach</source>. <publisher-name>Encyclopedia of Language &#x26; Linguistics</publisher-name>, <fpage>244</fpage>&#x2013;<lpage>248</lpage>. </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corley</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Stewart</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Hesitation disfluencies in spontaneous speech: the meaning of um</article-title>. <source>Language and Linguistics Compass</source> <volume>2</volume>, <fpage>589</fpage>&#x2013;<lpage>602</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/j.1749-818X.2008.00068.x">10.1111/j.1749-818X.2008.00068.x</ext-link> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Daneman</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>1991</year>). <article-title>Working memory as a predictor of verbal fluency</article-title>. <source>J. Psycholinguist. Res.</source> <volume>20</volume>, <fpage>445</fpage>&#x2013;<lpage>464</lpage>. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/BF01067637">10.1007/BF01067637</ext-link> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Ipi&#xf1;a</surname>
<given-names>K. L.</given-names>
</name>
<name>
<surname>de Lizarduy</surname>
<given-names>U. M.</given-names>
</name>
<name>
<surname>Calvo</surname>
<given-names>P. M.</given-names>
</name>
<name>
<surname>Beitia</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Garcia-Melero</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ecay-Torres</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Analysis of disfluencies for automatic detection of mild cognitive impartment: a deep learning approach</article-title>,&#x201d; in <conf-name>International Conference and Workshop on Bioinspired Intelligence (IWOBI)</conf-name>, <volume>2017</volume>, <fpage>1</fpage>&#x2013;<lpage>4</lpage>. </citation>
</ref>
<ref id="B10">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Devlin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>M.-W.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Toutanova</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Bert: pre-training of deep bidirectional transformers for language understanding</article-title>. <comment>arXiv preprint. Available at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1810.04805">https://arxiv.org/abs/1810.04805</ext-link>
</comment> (<comment>Accessed</comment> October 11, 2018). </citation>
</ref>
<ref id="B11">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Dodge</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ilharco</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Schwartz</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Farhadi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hajishirzi</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Fine-tuning pretrained language models: weight initializations, data orders, and early stopping</article-title>. <comment>arXiv preprint. Available at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2002.06305">https://arxiv.org/abs/2002.06305</ext-link>
</comment> (<comment>Accessed</comment> February 15, 2020). </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferreira</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>1991</year>). <article-title>Effects of length and syntactic complexity on initiation times for prepared utterances</article-title>. <source>J. Mem. Lang.</source> <volume>30</volume>, <fpage>210</fpage>&#x2013;<lpage>233</lpage>. </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Filiou</surname>
<given-names>R.-P.</given-names>
</name>
<name>
<surname>Bier</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Slegers</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Houz&#xe9;</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Belchior</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Brambati</surname>
<given-names>S. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Connected speech assessment in the early detection of alzheimer&#x2019;s disease and mild cognitive impairment: a scoping review</article-title>. <source>Aphasiology.</source> <volume>34</volume>, <fpage>1</fpage>&#x2013;<lpage>33</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1080%2F02687038.2019.1608502">10.1080/02687038.2019.1608502</ext-link> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fraser</surname>
<given-names>K. C.</given-names>
</name>
<name>
<surname>Meltzer</surname>
<given-names>J. A.</given-names>
</name>
<name>
<surname>Rudzicz</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Linguistic features identify alzheimer&#x27;s disease in narrative speech</article-title>. <source>J Alzheimers Dis.</source> <volume>49</volume> (<issue>2</issue>), <fpage>407</fpage>&#x2013;<lpage>422</lpage>. <pub-id pub-id-type="doi">10.3233/JAD-150520</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Fritsch</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wankerl</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>N&#xf6;th</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Automatic diagnosis of alzheimer&#x2019;s disease using neural network language models</article-title>,&#x201d; in <conf-name>ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing</conf-name>, <conf-loc>Brighton, United Kingdom</conf-loc>, <conf-date>May 12, 2020</conf-date> (<publisher-name>ICASSP IEEE</publisher-name>), <fpage>5841</fpage>&#x2013;<lpage>5845</lpage>. </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goldman-Eisler</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>1961</year>). <article-title>The distribution of pause durations in speech</article-title>. <source>Lang. Speech</source> <volume>4</volume>, <fpage>232</fpage>&#x2013;<lpage>237</lpage>. <pub-id pub-id-type="doi">10.1177/002383096100400405</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Goodglass</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Kaplan</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Barresi</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2001</year>). <source>Boston diagnostic Aphasia examination</source>. <edition>3rd Edition</edition>. <publisher-loc>Philadelphia</publisher-loc>: <publisher-name>Lippincott Williams &#x26; Wilkins</publisher-name>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gorman</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hill</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lunsford</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Heeman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>van Santen</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Uh and um in children with autism spectrum disorders or language impairment</article-title>. <source>Autism Res.</source> <volume>9</volume>, <fpage>854</fpage>&#x2013;<lpage>865</lpage>. <pub-id pub-id-type="doi">10.1002/aur.1578</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gosztolya</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Vincze</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Toth</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Pakaski</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kalman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hoffmann</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Identifying mild cognitive impairment and mild alzheimer&#x2019;s diseasebased on spontaneous speech using asr and linguistic features</article-title>. <source>Comput. Speech Lang.</source> <volume>53</volume>, <fpage>181</fpage>&#x2013;<lpage>197</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1016%2Fj.csl.2018.07.007">10.1016/j.csl.2018.07.007</ext-link> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Grosjean</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Grosjean</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Lane</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>1971</year>). <article-title>The patterns of silence: performance structures in sentence production</article-title>. <source>Cognit. Psychol.</source> <volume>11</volume>, <fpage>58</fpage>&#x2013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1016/0010-0285(79)90004-5</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hawthorne</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Gerken</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>From pauses to clauses: prosody facilitates learning of syntactic constituency</article-title>. <source>Cognition</source> <volume>133</volume>, <fpage>420</fpage>&#x2013;<lpage>428</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2014.07.013</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Irvine</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Eigsti</surname>
<given-names>I. M.</given-names>
</name>
<name>
<surname>Fein</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Uh, um, and autism: filler disfluencies as pragmatic markers in adolescents with optimal outcomes from autism spectrum disorder</article-title>. <source>J. Autism Dev. Disord.</source> <volume>46</volume>, <fpage>1061</fpage>&#x2013;<lpage>1070</lpage>. <pub-id pub-id-type="doi">10.1007/s10803-015-2651-y</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krivokapic</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Prosodic planning: effects of phrasal length and complexity on pause duration</article-title>. <source>J. Phonetics</source> <volume>35</volume>, <fpage>162</fpage>&#x2013;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1016/j.wocn.2006.04.001</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Laske</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Sohrabi</surname>
<given-names>H. R.</given-names>
</name>
<name>
<surname>Frost</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>L&#xf3;pez-de-Ipi&#xf1;a</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Garrard</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Buscema</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> <year>2015</year>). <article-title>Innovative diagnostic tools for early detection of Alzheimer&#x27;s disease</article-title>. <source>Alzheimers Dement</source> <volume>11</volume>, <fpage>561</fpage>&#x2013;<lpage>578</lpage>. <pub-id pub-id-type="doi">10.1016/j.jalz.2014.06.004</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Haider</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>de la Fuente</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fromm</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>MacWhinney</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Alzheimer&#x2019;s dementia recognition through spontaneous speech: the ADReSS Challenge</article-title>,&#x201d; in <conf-name>Proceedings of INTERSPEECH 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 25&#x2013;29, 2020</conf-date>. </citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>MacWhinney</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2000</year>). <source>The CHILDES project: tools for analyzing talk</source>. <edition>3rd Edition</edition>. <publisher-loc>Mahwah, NJ</publisher-loc>: <publisher-name>Lawrence Erlbaum Associates</publisher-name>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mattson</surname>
<given-names>M. P.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Pathways towards and away from Alzheimer&#x27;s disease</article-title>. <source>Nature</source> <volume>430</volume>, <fpage>631</fpage>&#x2013;<lpage>639</lpage>. <pub-id pub-id-type="doi">10.1038/nature02621</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mueller</surname>
<given-names>K. D.</given-names>
</name>
<name>
<surname>Koscik</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Hermann</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>S. C.</given-names>
</name>
<name>
<surname>Turkstra</surname>
<given-names>L. S.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Declines in connected language are associated with very early mild cognitive impairment: results from the Wisconsin registry for alzheimer&#x27;s prevention</article-title>. <source>Front. Aging Neurosci.</source> <volume>9</volume>, <fpage>437</fpage>. <pub-id pub-id-type="doi">10.3389/fnagi.2017.00437</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Palo</surname>
<given-names>F. D.</given-names>
</name>
<name>
<surname>Parde</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Enriching neural models with targeted features for dementia detection</article-title>,&#x201d; in <conf-name>Proceedings of the 57th annual Meeting of the Association for computational linguistics (ACL)</conf-name>, <conf-loc>Florence, Italy</conf-loc>, <conf-date>July 2019</conf-date>. </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pulido</surname>
<given-names>M. L. B.</given-names>
</name>
<name>
<surname>Hern&#xe1;ndez</surname>
<given-names>J. B. A.</given-names>
</name>
<name>
<surname>Ballester</surname>
<given-names>M. A. F.</given-names>
</name>
<name>
<surname>Gonz&#xe1;lez</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mekyska</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sm&#xe9;kal</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Alzheimer&#x2019;s disease and automatic speech analysis: a review</article-title>. <source>Expert Syst. Appl.</source> <volume>150</volume>, <fpage>113213</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2020.113213</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ramanarayanan</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Goldstein</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Byrd</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Narayanan</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>An investigation of articulatory setting using real-time magnetic resonance imaging</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>134</volume>, <fpage>510</fpage>&#x2013;<lpage>519</lpage>. <pub-id pub-id-type="doi">10.1121/1.4807639</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ramig</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Countryman</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Thompson</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Horii</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Comparison of two forms of intensive speech treatment for Parkinson disease</article-title>. <source>J. Speech Hear. Res.</source> <volume>38</volume>, <fpage>1232</fpage>&#x2013;<lpage>1251</lpage>. <pub-id pub-id-type="doi">10.1044/jshr.3806.1232</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rochester</surname>
<given-names>S. R.</given-names>
</name>
</person-group> (<year>1973</year>). <article-title>The significance of pauses in spontaneous speech</article-title>. <source>J. Psycholinguist. Res.</source> <volume>2</volume>, <fpage>51</fpage>&#x2013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1007/BF01067111</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schepman</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Rodway</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Prosody and parsing in coordination structures</article-title>. <source>Q. J. Exp. Psychol.</source> <volume>53</volume>, <fpage>377</fpage>&#x2013;<lpage>396</lpage>. <pub-id pub-id-type="doi">10.1080/713755895</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shea</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Leonard</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Evaluating measures of pausing for second language fluency research</article-title>. <source>Can. Mod. Lang. Rev.</source> <volume>75</volume>, <fpage>1</fpage>&#x2013;<lpage>20</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.3138%2Fcmlr.2018-0258">10.3138/cmlr.2018-0258</ext-link> </citation>
</ref>
<ref id="B36">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Ernie 2.0: a continual pre-training framework for language understanding</article-title>. <comment>arXiv preprint. Available at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1907.12412">https://arxiv.org/abs/1907.12412</ext-link>
</comment> (<comment>Accessed</comment> July 29, 2019). </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tottie</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Uh and um as sociolinguistic markers in british English</article-title>. <source>Int. J. Corpus Linguist.</source> <volume>16</volume>, <fpage>173</fpage>&#x2013;<lpage>197</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1075%2Fijcl.16.2.02tot">10.1075/ijcl.16.2.02tot</ext-link> </citation>
</ref>
<ref id="B38">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Tran</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Toshniwal</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bansal</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gimpel</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Livescu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ostendorf</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information</article-title>. <comment>arXiv preprint. Available at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1704.07287">https://arxiv.org/abs/1704.07287</ext-link>
</comment> (<comment>Accessed</comment> April 24, 2017). </citation>
</ref>
<ref id="B39">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>A. N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Attention is all you need</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>, <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>. </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wieling</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Grieve</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bouma</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Fruehwald</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Coleman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liberman</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Variation and change in the use of hesitation markers in germanic languages</article-title>. <source>Lang. Dynam. Change</source> <volume>6</volume>, <fpage>199</fpage>&#x2013;<lpage>234</lpage>. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1163%2F22105832-00602001">10.1163/22105832-00602001</ext-link> </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bian</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Disfluencies and fine-tuning pre-trained language models for detection of alzheimer&#x2019;s disease</article-title>,&#x201d; in <conf-name>Proceedings of INTERSPEECH 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 25&#x2013;29, 2020</conf-date>. </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liberman</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Speaker identification on the scotus corpus</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>123</volume>, <fpage>3878</fpage>. <pub-id pub-id-type="doi">10.1121/1.2935783</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lai</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Liberman</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Pauses and pause fillers in Mandarin monologue speech: the effects of sex and proficiency</article-title>. <source>Proc. Speech Prosody</source> <volume>2016</volume>, <fpage>1167</fpage>&#x2013;<lpage>1170</lpage>. <pub-id pub-id-type="doi">10.21437/SpeechProsody.2016-240</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zellner</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>1994</year>). &#x201c;<article-title>Pauses and the temporal structure of speech</article-title>,&#x201d; in <source>Fundamentals of speech synthesis and speech recognition</source>. Editor <person-group person-group-type="editor">
<name>
<surname>Keller</surname>
<given-names>E.</given-names>
</name>
</person-group> (<publisher-loc>Chichester</publisher-loc>: <publisher-name>John Wiley</publisher-name>), <fpage>41</fpage>&#x2013;<lpage>62</lpage>. </citation>
</ref>
</ref-list>
<fn-group>
<fn id="FN1">
<label>
<sup>1</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="%20https://gluebenchmark.com">https://gluebenchmark.com</ext-link>
</p>
</fn>
<fn id="FN2">
<label>
<sup>2</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="%20https://github.com/huggingface/transformers">https://github.com/huggingface/transformers</ext-link>
</p>
</fn>
</fn-group>
</back>
</article>
