# COGNITIVE HEARING MECHANISMS OF LANGUAGE UNDERSTANDING: SHORT- AND LONG-TERM PERSPECTIVES

EDITED BY: Rachel J. Ellis, Patrik Sörqvist, Adriana A. Zekveld and Jerker Rönnberg PUBLISHED IN: Frontiers in Psychology and Frontiers in Neuroscience

### *Frontiers Copyright Statement*

*© Copyright 2007-2017 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

> *The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88945-303-0 DOI 10.3389/978-2-88945-303-0

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **COGNITIVE HEARING MECHANISMS OF LANGUAGE UNDERSTANDING: SHORT- AND LONG-TERM PERSPECTIVES**

Topic Editors: **Rachel J. Ellis,** Linköping University, Sweden **Patrik Sörqvist,** Linköping University, Sweden; University of Gävle, Sweden **Adriana A. Zekveld,** Linköping University, Sweden; VU University Medical Center, Netherlands **Jerker Rönnberg,** Linköping University, Sweden

**Citation:** Ellis, R. J., Sörqvist, P., Zekveld, A. A., Rönnberg, J., eds. (2017). Cognitive Hearing Mechanisms of Language Understanding: Short- and Long-Term Perspectives. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-303-0

# Table of Contents


Hartmut Meister, Stefan Schreitmüller, Magdalene Ortmann, Sebastian Rählmann and Martin Walger

### **Chapter 2: Translational Research and Methodological Evaluation**

*92 Working Memory and Hearing Aid Processing: Literature Findings, Future Directions, and Clinical Applications*

Pamela Souza, Kathryn Arehart and Tobias Neher

*104 Learning and Memory Processes Following Cochlear Implantation: The Missing Piece of the Puzzle*

David B. Pisoni, William G. Kronenberger, Suyog H. Chandramouli and Christopher M. Conway

*123 Lexical Influences on Spoken Spondaic Word Recognition in Hearing-Impaired Patients*

Annie Moulin and Céline Richard


Jana B. Frtusova and Natalie A. Phillips

# **Chapter 3: Bilingualism, Signed-, and Native- and Non-Native Language**


Noelia Calvo, Agustín Ibáñez and Adolfo M. García

*197 Imitation, Sign Language Skill and the Developmental Ease of Language Understanding (D-ELU) Model*

Emil Holmer, Mikael Heimann and Mary Rudner

*210 Load and distinctness interact in working memory for lexical manual gestures* Mary Rudner, Elena Toscano and Emil Holmer

# **Chapter 4: Working Memory and Cognition in YoungNormally-Hearing Listeners**


Britt Hadar, Joshua E. Skrzypek, Arthur Wingfield and Boaz M. Ben-David

*236 Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing*

John E. Marsh, Robert Ljung, Anatole Nöstl, Emma Threadgold and Tom A. Campbell


# **Chapter 5: Communication-Related Disorders**

*262 Children with speech sound disorder: comparing a non-linguistic auditory approach with a phonological intervention approach to improve phonological skills*

Cristina F. B. Murphy, Luciana O. Pagan-Neves, Haydée F. Wertzner and Eliane Schochat

*274 Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear When Exposed to 65 dB of Auditory Noise*

Göran B. W. Söderlund and Elisabeth Nilsson Jobs

*285 Theory-of-mind in individuals with Alström syndrome is related to executive functions, and verbal ability*

Hans-Erik Frölander, Claes Möller, Mary Rudner, Sushmit Mishra, Jan D. Marshall, Heather Piacentini and Björn Lyxell

*296 Cognitive skills and reading in adults with Usher syndrome type 2* Cecilia Henricson, Björn Lidestam, Björn Lyxell and Claes Möller

# **Chapter 6: Listening Effort**


Catherine M. McMahon, Isabelle Boisvert, Peter de Lissa, Louise Granger, Ronny Ibrahim, Chi Yhun Lo, Kelly Miles and Petra L. Graham

*332 Impact of Background Noise and Sentence Complexity on Processing Demands during Sentence Comprehension*

Dorothea Wendt, Torsten Dau and Jens Hjortkjær

*344 Autonomic Nervous System Responses During Perception of Masked Speech may Reflect Constructs other than Subjective Listening Effort* Alexander L. Francis, Megan K. MacPherson, Bharath Chandrasekaran and Ann M. Alvar

# **Chapter 7: Neurophysiology**


John E. Marsh and Tom A. Campbell

### **Chapter 8: Additional Trends**

*404 Three Factors Are Critical in Order to Synthesize Intelligible Noise-Vocoded Japanese Speech*

Takuya Kishida, Yoshitaka Nakajima, Kazuo Ueda and Gerard B. Remijn

*413 A Deficit in Movement-Derived Sentences in German-Speaking Hearing-Impaired Children*

Esther Ruigendijk and Naama Friedmann


Sara Skoog Waller, Mårten Eriksson and Patrik Sörqvist

# Editorial: Cognitive Hearing Mechanisms of Language Understanding: Short- and Long-Term Perspectives

### Rachel J. Ellis 1, 2 \*, Patrik Sörqvist 2, 3, Adriana A. Zekveld2, 4 and Jerker Rönnberg1, 2

<sup>1</sup> Department of Behavioural Sciences and Learning, Linköping University, Linköping, Sweden, <sup>2</sup> Linnaeus Centre HEAD, Swedish Institute for Disability Research, Linköping University, Linköping, Sweden, <sup>3</sup> Department of Building, Energy and Environmental Engineering, University of Gävle, Gävle, Sweden, <sup>4</sup> Section Ear and Hearing, Department of Otolaryngology-Head and Neck Surgery and Amsterdam Public Health Research Institute, VU University Medical Center, Amsterdam, Netherlands

Keywords: cognitive hearing science, working memory, speech perception, language processing, hearing impairment

### **Editorial on the Research Topic**

### **Cognitive Hearing Mechanisms of Language Understanding: Short- and Long-Term Perspectives**

Cognitive hearing science is a relatively new field, which developed in response to an increasing awareness of the critical role of cognition in communication (Arlinger et al., 2009). Cognitive hearing science emphasizes the subtle balancing act between bottom-up and top-down aspects of language processing. Recent models of language understanding under adverse or distracting conditions have emphasized the complex interactions between working memory capacity, attention, executive functions, cognitive spare capacity and episodic and semantic long-term memory (Mishra et al., 2013; Rönnberg et al., 2013). This kind of approach has promoted a more comprehensive grasp of the interplay between bottom-up and top-down processes, including both online processes and long-term changes (positive or negative) relating to hearing impairment/deafness and aging.

The goal of this research topic (Cognitive hearing mechanisms of language understanding: Short- and long-term perspectives) was to encourage submissions that could push the field forward by suggesting behavioral and neural mechanisms that are important for online language processing, and for long-term cognitive change. Each of the 34 papers that are included in this research topic have contributed toward meeting this goal, and to furthering our understanding of the complex interplay between cognition and language. In addition to papers reporting original research, the research topic also includes both review and opinion and theory articles, giving us not only new empirical evidence, but novel approaches and theories drawn from existing knowledge and data.

# AGEING, COGNITION, AND LANGUAGE

Many of the papers included in this research topic focus on the impact of aging on cognition and language. Carroll et al. showed that age-related differences in lexical access efficiency modulates successful speech recognition in noise in listeners with normal hearing, even when vocabulary size is matched between younger and older listeners. Karawani et al. investigated auditory perceptual learning in older adults with and without hearing loss. While both groups of listeners showed significant improvements on the trained conditions (compared to untrained listeners),

Edited and reviewed by: Robert J. Zatorre, McGill University, Canada

> \*Correspondence: Rachel J. Ellis rachel.ellis@liu.se

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 23 May 2017 Accepted: 08 June 2017 Published: 22 June 2017

### Citation:

Ellis RJ, Sörqvist P, Zekveld AA and Rönnberg J (2017) Editorial: Cognitive Hearing Mechanisms of Language Understanding: Short- and Long-Term Perspectives. Front. Psychol. 8:1060. doi: 10.3389/fpsyg.2017.01060 Ellis et al. Cognitive Hearing

generalization to non-trained tasks was limited. Heinrich et al. compared behavioral and self-report measures of aided speech perception in older adults with mild-to-moderate hearing loss. The results suggest that behavioral and self-report measures correlate more highly when they relate to similar speech situations; and that only behavioral speech perception measures correlate with cognition. Two articles focussed on sentence comprehension in older adults; DeCaro et al. report that once hearing acuity and working memory capacity have been accounted for, age does not significantly predict comprehension accuracy. Amichetti et al. were interested in whether aging and/or hearing loss affected the extent to which listeners relied on formal syntax vs. plausibility to successfully comprehend sentences. Like DeCaro et al., Amichetti et al. also found that (in all conditions but one) age did not significantly predict comprehension once hearing acuity and working memory capacity had been accounted for, but that age and hearing acuity may affect which comprehension strategy is used. Meister et al. investigated the effects of cognitive load on speech recognition with one or two competing talkers in older listeners with and without hearing loss. The results showed that listeners with hearing loss performed particularly poorly, and demonstrated a different pattern of errors, in conditions with two (compared to one) competing target talkers. These differences are attributed to impaired object formation and an increased demand on working memory.

# TRANSLATIONAL RESEARCH AND METHODOLOGICAL EVALUATION

In their review article, Souza et al. investigate the relation between hearing aid processing and working memory, concluding that evidence for a link between memory and wide-dynamic range compression is strong, yet further research is needed to investigate the links between working memory and other hearing aid processing strategies. Pisoni et al. also highlight the need for more research, in this case into the cognitive factors predicting speech and language outcomes in cochlear implant users, suggesting that research into basic domain-general learning abilities is particularly lacking.

Moulin and Richard investigated the role of context and lexical factors on a standard clinical spondaic word recognition in quiet task in 160 adult listeners with hearing loss. Their results indicate that the use of context decreased with hearing loss once the pure tone average exceeded 55 dB HL, and that there is a significant age effect on the relation between word recognition and word frequency. Koch et al. were also interested in the clinical and real-world utility of the speech materials used in their study of acceptable noise level test outcomes. They found that acceptable noise levels were correlated with selfreported hearing problems, and that the repeatability of the acceptable noise level test was not affected by the use of more natural speech materials. Frtusova and Phillips also focussed on the impact of methodological choices on test outcomes, showing that older adults obtained benefit to performance in a working memory task when stimuli were audiovisual as opposed to audio-only. Both behavioral and electrophysiological measures indicated that this benefit was more pronounced for listeners with poorer hearing than for those with better hearing.

# BILINGUALISM, SIGNED-, AND NATIVE-AND NON-NATIVE LANGUAGE

A number of papers in the research topic focussed on native and non-native language use, or bilingualism and its relation to cognition. Schneider et al. compared the short-term memory performance of older and younger native English speakers to a group of younger people for whom English was a second language. The results showed that older adults had poorer memory scores than younger adults, but that there was no difference in performance between the younger adults for whom English was a native language, and those for whom English was a second language. Schmidtke investigated the speech understanding in noise scores of monolingual English and Spanish-English bilingual young adults. The results indicated that speech understanding in noise improves with greater language exposure and that working memory does not provide additional predictive power, likely due to the large amount of variance shared between working memory and language proficiency. In their review, Calvo et al. argue that while many previous studies have observed that bilingualism does not affect working memory, certain aspects of working memory may be enhanced by bilingualism, yet methodological choices often mean that such improvements are difficult to observe.

Holmer et al. also investigated the effects of native and nonnative language, but in signed rather than spoken language. The results are used as support for a developmental version of the Ease of Language understanding model, the D-ELU, which is outlined in the article. Rudner et al. also focussed on sign language by investigating the effects of load and distinctness on performance on a sign-based memory task by young adult listeners with no previous experience of sign language. The results showed that working memory load increased when sign distinctness decreased, providing support for an amodal mechanism active even when a pre-existing semantic representation is missing.

# WORKING MEMORY AND COGNITION IN YOUNG NORMALLY-HEARING LISTENERS

Füllgrabe and Rosen caution against assuming that working memory has the same importance to speech-in-noise processing in young adults with normal hearing as it does in older adults with hearing loss. The results of their meta-analysis suggest that variations in working memory capacity account for only approximately 2% of speech-innoise recognition scores in younger adults with normal hearing. In contrast, Hadar et al. report that working memory is of importance for speech processing in younger adults with normal hearing, using processing time as opposed to accuracy as an outcome measure in their eye-tracking study.

In relation to the interplay between communication, working memory and cognition, Marsh et al. (see also corrigendum) report results from a study showing that background noise interferes with gist processing of spoken messages. Finally, Beaman and Jones discuss mechanisms involved in forgetting in short-term memory, focussing on different forms of overwriting.

### COMMUNICATION-RELATED DISORDERS

A number of articles in this research topic report results pertaining to individuals with communication-related disorders other than hearing loss. Murphy et al. looked at phonological and non-linguistic auditory training on the phonological skills of children with speech sound disorder. While neither training condition led to improvements in phonological skills, the nonlinguistic auditory programme did lead to improvements in both auditory and cognitive measures. Söderlund and Jobs investigated differences in speech recognition thresholds in children with and without attention deficit hyperactivity disorder when exposed to noise. The findings indicated that children with attention deficit hyperactivity disorder had higher speech recognition thresholds than a control group, and that this difference disappeared when the children were exposed to white noise. Frölander et al. investigated theory of mind and executive function in adults with Alström syndrome (a genetic disorder associated with a variety of symptoms including vision and hearing loss), finding that the group with Alström syndrome performed significantly poorer than a control group in both types of task. Usher syndrome is another genetic disorder associated with hearing and vision loss, and was the focus of research reported by Henricson et al. The findings showed that adults with Usher syndrome performed more poorly than a control group on tests of phonological processing, and on tests involving fast visual or phonological processing. Henricson et al. suggest that these difficulties may contribute to explaining the high levels of fatigue often reported by individuals with Usher syndrome.

# LISTENING EFFORT

Listening effort was the subject of a number of papers in the research topic. Wagner et al. used pupillometry and response times to show that lexical competition is associated with effort, and that degraded speech affects the timing of information processing, leading to increased effort. McMahon et al. also used pupillometry, along with alpha power, to investigate listening effort. The findings showed that these two measures show similar trends when participants processed highly intelligible speech, but seemed to diverge with degraded speech. Wendt et al. were also interested in comparing different measures of listening effort, finding that pupillometry and subjective rating scales index different aspects of effort, and that the syntactic complexity of speech can affect effort even when intelligibility is high. Francis et al. also report differences in subjective and physiological measures of listening effort, finding that masking affects physiological measures to a greater extent than subjective measures of listening effort.

# NEUROPHYSIOLOGY

In Cardin's review of the effects of aging and hearing loss on cortical auditory regions, parallels are drawn between the cortical mechanisms that are engaged when young listeners with normal hearing engage in effortful listening, and those that are engaged in all listening for older listeners and those with hearing loss. Marsh and Campbell were also interested in the neurophysiological mechanisms underlying auditory processing. Their article introduces the new early filter model which suggests that complex sounds are processed early on by a subcortical filter under cholinergic top-down control.

In addition to the themes outlined above, a number of other topics were the focus of articles included in this research topic. Kishida et al. reported work on the synthesis of intelligible noise-vocoded Japanese speech. Ruigendijk and Friedmann investigated the effects of hearing impairment on the comprehension and repetition of movement-derived sentences in German-speaking children. Heald et al. introduce a new framework for perceptual plasticity relating to auditory object recognition, focussing on its context-dependent nature. Skoog Waller et al. investigated estimates of a speaker's age, finding that speakers were estimated as having a younger age when the speech rate was fast, and an older age when the speech rate was slow.

Taken together, the manuscripts included in this research topic represent important advances in the field of cognitive hearing science. Specifically, the papers demonstrate evidence of the involvement of cognition at all levels of speech perception, from basic word recognition through to comprehension. The variety of outcome measures included has enabled us to gain further insight into the impact of methodological choices on the likelihood of observing an effect of cognition, and indeed how the effect of these choices may vary depending on the participant group being tested. Further evidence is provided of the negative impact of hearing loss and aging on communication, and the role of cognition in ameliorating these negative effects. An important contribution of this topic is finding evidence for the importance of cognition on communication not just for older listeners or those with hearing loss, but also for listeners with a variety of communicative impairments, and indeed young listeners without hearing loss (albeit dependent on the outcome measures employed). We also see a lot of evidence for the moderating role of linguistic factors (e.g., native/nonnative language, syntactic complexity) on the relation between cognition and communication.

Together these results provide support for five of the six predictions based on the ease of language understanding (ELU, see for example Rönnberg et al., 2013) model outlined by Rönnberg and Rudner (under review), namely that the effect of signal distortion, early attention mechanisms, use of phonological and semantic cues, and effort are all linked to working memory, and that working memory also affects the perception of sign language. The only prediction that is not addressed here is that there will be differing effects of hearing loss on long-term memory, but not on working memory.

We hope that this research topic will help to inspire many more studies in the field of cognitive hearing science. This area will continue to advance and contribute to both our understanding of the mechanisms underlying language understanding, and to improving assessment and treatment options for people with a communicative impairment.

### REFERENCES


# AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

# FUNDING

This research was funded by the Linnaeus Centre HEAD, The Swedish Research Council (Vetenskapsrådet, grant number: 2007-8654).

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Ellis, Sörqvist, Zekveld and Rönnberg. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

Rebecca Carroll1,2 \*, Anna Warzybok1,3, Birger Kollmeier1,3 and Esther Ruigendijk1,2

<sup>1</sup> Cluster of Excellence 'Hearing4all', University of Oldenburg, Oldenburg, Germany, <sup>2</sup> Institute of Dutch Studies, University of Oldenburg, Oldenburg, Germany, <sup>3</sup> Medizinische Physik, University of Oldenburg, Oldenburg, Germany

Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an average vocabulary size similar to that of younger adults, they were still slower in lexical access.

Keywords: age, speech perception in noise, mental lexicon, lexical access, vocabulary size, verbal working memory, cognitive change

# INTRODUCTION

Speech perception in background noise is relatively difficult compared to speech perception in quiet, and it most likely depends on a conglomerate of multiple factors (e.g., Benichov et al., 2012; Humes et al., 2012; Füllgrabe et al., 2015). Acoustic-perceptual factors such as pure-tone thresholds and acoustic setting, e.g., masker type, spatial configuration, or the signal-to-noise ratio (SNR),

### Edited by:

Jerker Rönnberg, Linnaeus Centre HEAD, Sweden

Reviewed by: Michael S. Vitevitch, University of Kansas, USA Carine Signoret, Linnaeus Centre HEAD, Sweden

\*Correspondence: Rebecca Carroll rebecca.carroll@uni-oldenburg.de

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 September 2015 Accepted: 16 June 2016 Published: 04 July 2016

### Citation:

Carroll R, Warzybok A, Kollmeier B and Ruigendijk E (2016) Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise. Front. Psychol. 7:990. doi: 10.3389/fpsyg.2016.00990

are among the most obvious candidates. Speech recognition is well-documented to deteriorate with decreasing SNR (Plomp and Mimpen, 1979; Kollmeier and Wesselkamp, 1997; Bruce et al., 2013). But cognitive factors such as (verbal) working memory, sensitivity to interference, attention, processing speed, or adaptive learning have also been shown to contribute to speech perception in noise (Pichora-Fuller, 2003; Pichora-Fuller and Souza, 2003; Füllgrabe and Moore, 2014; Füllgrabe, 2015; Heinrich et al., 2015; Huettig and Janse, 2016). Age has been reliably found to alter speech perception on top of that, although the exact mechanisms are not completely understood (see overviews by CHABA, 1988; Wayne and Johnsrude, 2015). Linguistic factors including lexical access, inhibition of lexical competitors, and integration of phonemic and lexical information into context (e.g., Cutler and Clifton, 2000; Weber and Scharenborg, 2012) are important for speech processing in general. Although not all linguistic factors have been tested in acoustically challenging conditions or in older populations, speech perception in noise is likely to follow known mechanisms of speech perception and word recognition that are known for speech presented in quiet. We consider various measures that are applied to determine individual differences: hearing levels, age, working memory, vocabulary size (i.e., how many words a person knows), and lexical access time. The umbrella term 'individual difference measures' refers to the collection of all of these measures.

Deterioration of speech recognition in noise is characteristic of older listeners, even in individuals with normal or almost normal pure-tone thresholds (CHABA, 1988; Dubno et al., 2002; Pichora-Fuller, 2003; Zekveld et al., 2011; Schoof and Rosen, 2014; Besser et al., 2015). Possible explanations for this deterioration vary from age-related changes in supra-threshold auditory processing (Legér et al., 2012; Schoof and Rosen, 2014; Füllgrabe et al., 2015) to age-related changes in cognitive factors such as processing speed, working memory, and susceptibility to interference (e.g., Füllgrabe et al., 2015; see also Wayne and Johnsrude, 2015; Wingfield et al., 2015 for recent reviews). Füllgrabe et al. (2015), for example, observed age effects on speech-in-noise recognition in English and explained these by age-linked reductions in sensitivity to temporal fine structure and a composite measure of cognition. Self-rated hearing ability and modulation masking release did not explain age-related differences in speech recognition.

On the cognitive level, speech recognition has been suggested to rely strongly on working memory (e.g., overview by Besser et al., 2013). The general assumption is that the larger the working memory is, the better the speech recognition scores (see Rudner and Signoret, 2016). The Ease of Language Understanding (ELU) model (Rönnberg et al., 2013) posits that a degraded speech input is not automatically matched against a semantic representation in long-term memory. This mismatch in the rapid automatic multimodal phonological buffer inhibits immediate lexical access, and requires an additional, explicit processing loop. Crucially, this explicit semantic processing loop is thought to depend on working memory, because the phonetic details of the input signal and the semantic content of the context have to be held in shortterm memory (or the phonological buffer) while searching for a lexical match. The ELU model suggests that successful perception of degraded speech necessitates relatively more selective attention at very early stages of stream segregation and relatively more working memory capacity to match the degraded input with long-term representations in the mental lexicon. The model also suggests that the influence of cognitive factors on speech recognition increases as the speech signal deteriorates, e.g., due to decreasing SNRs (see e.g., Rudner et al., 2012). Another way of thinking about the relation between cognition and speech perception in adverse conditions is provided by the cognitive spare capacity hypothesis (Mishra et al., 2013, 2014). Assuming that working memory is limited (Baddeley and Hitch, 1974), cognitive spare capacity is defined as the resources that are still available after those cognitive capacities required for lexical access have been recruited. Since listening in adverse conditions, such as noise and/or hearing impairment, is assumed to require more cognitive resources to match a stimulus to the semantic longterm representation (see Rönnberg et al., 2013), comparatively less cognitive spare capacity for post-lexical speech processing and integration into discourse is expected in these situations than in acoustically less challenging situations. Mishra et al. (2013, 2014) could show that cognitive spare capacity was somewhat independent of working memory, but was related to episodic long-term memory. Despite a growing body of published evidence supporting the cognitive spare capacity hypothesis, the ELU model, and the importance of working memory in general, higher working memory capacity has not been universally found to benefit speech recognition in acoustically adverse conditions. Rudner et al. (2012), for example, found that perceived listening effort ratings correlated with speech perception in different types of noise and at different SNRs, but were independent of working memory capacity for two groups of Danish and Swedish listeners with hearing loss. Working memory did not influence speech perception per se but seemed to influence the relative rating of perceived effort with respect to different noise types. Picou et al. (2013) presented US-American hearing-impaired listeners with a dual-task comprising a word recognition task and a visual reaction time (RT) task. Working memory capacity was related to a word recognition benefit from visual cues, but was not associated with changes in the auditory presentation (i.e., the addition of noise).

Speech recognition in acoustically adverse conditions has also been suggested to depend on linguistic factors, such as vocabulary knowledge. Benard et al. (2014) reported a positive correlation of phoneme restoration scores with vocabulary size as measured by the Dutch version of the Peabody Picture Vocabulary Test (PPVT; Bell et al., 2001), suggesting that the more words a listener knows, the better his/her speech recognition scores in a phoneme restoration task. McAuliffe et al. (2013) observed a correlation between the PPVT and recognition scores for English dysarthric speech, supporting the idea that a large lexicon may be beneficial for word recognition in adverse listening conditions. Benichov et al. (2012), in contrast, observed no relation between vocabulary knowledge and speech recognition in English sentences with predictable (vs. unpredictable) final words. They concluded that their participants' (aged 19–89) ability to benefit from the predictability of linguistic context was "sufficiently

robust that a relatively wide range in verbal ability among native English speakers had no effect on [speech] recognition performance" (Benichov et al., 2012, p. 250). Banks et al. (2015) investigated effects of inhibition, vocabulary knowledge, and working memory on perceptual adaptation to foreignaccented English speech. Vocabulary knowledge predicted better recognition of unfamiliar accents, whereas working memory only indirectly influenced speech recognition, mediated by listeners' vocabulary scores. Research on the role of vocabulary size across the adult life span diverges even more. Despite the cumulating support for an association of vocabulary knowledge with speech recognition in adverse conditions, the reasons and associated mechanisms are far from understood, especially with respect to the standardized tests that are typically used. For example, why should knowing relatively obscure words from vocabulary tests (e.g., usurp or concordance) predict correct recognition and recall of relatively familiar words such as clouds or table that are typically used in standardized speech recognition tests? McAuliffe et al. (2013) cautiously suggested that a larger vocabulary size may require a more fine-grained or detailed lexical representation. They did, however, not test this hypothesis. To confirm the impact of vocabulary knowledge on speech recognition in noise, Kaandorp et al. (2015) compared speech-innoise recognition scores of three groups of normal-hearing young listeners with their vocabulary knowledge and lexical access. Although vocabulary knowledge and lexical access were highly correlated, only lexical access times reliably predicted speech recognition scores: faster lexical access correlated with better speech recognition scores.

Unfortunately, there is no compelling theory that can explain all of the above observations. The ELU model (Rönnberg et al., 2013) provides a reasonable explanation to delineate the relation of working memory and disturbed lexical access due to mismatch. It does not, however, allow a direct prediction of how vocabulary size would modulate this mismatch (McAuliffe et al., 2013; cf. Benard et al., 2014). The speculation is that the more words someone knows, the more likely a lexical match is (or the less likely a mismatch that would trigger the explicit phonological or semantic processing loop). As a consequence, word recognition should be faster in people with larger lexicons. Considering the findings by Banks et al. (2015), it is possible that both vocabulary knowledge (or lexical representation) and working memory may only indirectly relate to speech recognition (in noise-distorted speech). Lexical access times may mediate the explanatory gap between lexical representations in the lexicon (either word form or semantic knowledge) and successful speech recognition, especially in acoustically adverse conditions (Kaandorp et al., 2015). The faster—or rather the more efficient a person's lexical access, the better the corresponding speech recognition score because that leaves more spare capacity for resolving acoustic-phonetic matching difficulties of subsequent speech material, or for integrating recognized speech into discourse context (e.g., Mishra et al., 2013, 2014; Rönnberg et al., 2013). There are, however, at least two problems with the simplistic prediction of a large vocabulary size leading to fast lexical processing and integration into sentence context, thus resulting in successful speech-in-noise recognition: (A) There is evidence from bilingualism and aging studies suggesting that a larger lexicon may require longer search times, resulting in slower speech processing times (e.g., Vitevitch and Luce, 1998; Salthouse, 2004; Ramscar et al., 2014; Schmidtke, 2014). (B) The role of age-related differences in vocabulary knowledge, lexical access time, and working memory is somewhat obscure. Age is not only associated with changes in speech recognition in noise as pointed out above, but also with changes in cognitive and possibly linguistic factors. Several authors have posited that vocabulary knowledge increases with increasing age (e.g., MacKay and Burke, 1990; Kavé and Halamish, 2015; Keuleers et al., 2015). Others reported peak vocabulary knowledge with subsequent decline in later adulthood (e.g., Salthouse, 2004; Kavé et al., 2010; Hartshorne and Germine, 2015). Are agerelated differences in vocabulary knowledge, lexical access time, and working memory comparable, and how do they relate to word recognition? Older adults have been found to have lower working memory capacities than younger adults (e.g., Pichora-Fuller et al., 1995; Desjardins and Doherty, 2013; Kidd and Humes, 2015). However, whether working memory capacity relates directly or indirectly to generally poorer speech processing or language understanding, remains unclear (see e.g., Besser et al., 2013; Banks et al., 2015). Ramscar et al. (2014) argued that a larger lexicon requires more detailed representations to allow efficient lexical access. They further proposed that older adults underperform in word recognition tests, because they know more words than younger adults, which may require longer searches. This lexical access account for age-related differences in word recognition may be independent of or in addition to age-related declines in cognitive skills. Accordingly, people with a larger lexicon should perform worse in speech recognition tasks, unless they can compensate with better working memory (ELU model, Rönnberg et al., 2013).

Based on the findings from different languages as described above, vocabulary knowledge likely contributes to higher speech recognition scores, but it may be mediated by lexical access and possibly working memory. We contribute to the existing research by adding data from a German population, and by focusing on a theoretically motivated explanation that takes into account the intricate interplay of known factors that contribute to speech recognition. We hypothesized that only the relative efficiency of lexical access may be correlated with successful speech recognition. Relative efficiency comprises a combination of vocabulary size and lexical access times: quick lexical access relative to the vocabulary size should predict good speech recognition scores. Slow access relative to the vocabulary size should predict worse recognition. Provided that listeners have no time constraints, a slow-but-detailed approach may be acceptable but is not efficient. To perform well in speech recognition, slow access listeners would have to be able to keep the word(s) in their phonological loop for rehearsing instead. Such an explanation could thus coherently integrate all of the individual difference measures listed above. The focus of our investigation is on the role of the mental lexicon for success in a standardized speechin-noise recognition test. We submit that the efficiency with which the lexicon is accessed may modulate speech recognition performance. We attempt to show this using a limited battery

of tests that can be administered in a clinical routine. Whereas many studies reporting hearing status or hearing device strategies as the key predictor of speech recognition difficulties amalgamate young and older adults (e.g., Dirks et al., 2001; George et al., 2006; Mackersie et al., 2015), we attempt to tease apart influences of age-related and hearing-related differences for speech perception in noise.

Our specific research questions were:


# MATERIALS AND METHODS

### Participants

Two groups of native listeners of German with normal hearing participated in the experiments. The first group consisted of 22 younger listeners (YNH, 13 women and 9 men), varying in age from 18 to 35 years, with an average age and corresponding standard deviation of 25.3 ± 4.1 years. The second group included 22 older listeners (ONH, 15 women and 7 men), who ranged in age from 60 to 78 years, with an average age of 67.7 ± 4.8 years. Inclusion criteria were based on normal hearing status (see section "Hearing Status" for a detailed description) and age group (YNH: 18–35, ONH: 60–80). Education may play an indirect role in speech recognition because people with advanced education may know more words (see also Kaandorp et al., 2015), and highly educated people may also be better with respect to working memory capacity, adaptability, and lexical access time. Education level was assessed using a questionnaire: participants were categorized according to their highest level of education (doctoral, master's, bachelor, high school, or middle school degree).

### Speech Recognition Task

We tested speech-in-noise recognition with the Göttingen Sentence Test (GÖSA; Kollmeier and Wesselkamp, 1997), which consists of short, meaningful sentences from everyday speech with a structure similar to the Plomp-type sentences (Plomp and Mimpen, 1979) or the Hearing in Noise Test sentences (Nilsson et al., 1994). GÖSA sentences vary with respect to their syntactic complexity, from simple subject-verb-object sentences to more complex structures (see Uslar et al., 2011). They also vary with respect to the context-driven predictability of individual words. For example, Spiele 'games' in Er gewinnt vier Spiele nacheinander, 'He wins four games in a row' is highly predictable, while Licht 'light' in Mach doch das Licht an 'Do turn on the light' is not as predictable. The sentences with varying complexity and context are distributed equally across test lists. The test is optimized and evaluated for speech intelligibility in noise (Kollmeier and Wesselkamp, 1997). The test contains 10 statistically and phonemically balanced lists, each of 20 sentences.

# Individual Difference Measures Hearing Status

YNH listeners were defined as having normal hearing when their pure-tone thresholds were equal to or better than 20 dB HL across the octave frequencies from 125 Hz to 8 kHz in the better ear. This strict criterion was relaxed for the group of ONH listeners. Age-related changes in pure-tone threshold mostly affected frequencies of 4 kHz and above. The main speech-relevant frequency range between 500 Hz and 4 kHz remained, however, unaffected. ONH were therefore defined as having normal hearing when their pure-tone average across the frequencies 500, 1k, 2k, 4k Hz (PTA-4), was equal to or lower than 20 dB HL in the better ear. The average PTA-4 was 3.2 ± 3.0 dB HL for the YNH group and 8.4 ± 4.5 dB HL for the ONH group. Differences in hearing level between the better and the worse ear were 15 dB HL or less at each of the PTA frequencies, except for 3% of the data, where the interaural difference was higher in maximally one PTA frequency per person. **Figure 1** shows the mean pure tone thresholds with corresponding standard deviations for the YNH (solid dots) and ONH (circles) group at the better/measured ear. Given that most of the spectral power of the GÖSA speech material was between 100 Hz and 5 kHz, small differences in hearing level at or above 4 kHz were not expected to cause notable differences in speech recognition.

### Verbal Working Memory

Verbal working memory was tested using the German version of the Reading Span Test (RST) that has been suggested for application in cognitive hearing research (Carroll et al., 2015). This test consists of 54 short sentences, half of which

are semantically sensible and half of which are absurd. The participants' task was to read a sentence presented on a screen and to indicate via button press within 1.75 s whether the sentence was absurd or not. After a block of 3, 4, 5, or 6 sentences, participants were asked to repeat either the first noun or the last word of each sentence in that block. Correctly recalled items in the correct order were scored on a sheet of paper.

### Lexical Access Times

fpsyg-07-00990 June 30, 2016 Time: 17:46 # 5

The Lexical Decision Test (LDT) presented four-letter combinations on a computer screen. Forty items were monosyllabic pseudowords (i.e., non-existent words that are structurally possible but carry no meaning in German, e.g., MAND). Forty items were monosyllabic existing words, of which half (n = 20) occur frequently and half occur infrequently in the language. Frequency of occurrence was established using the Leipzig Wortschatz corpus<sup>1</sup> . The participants' task was to decide as quickly and as correctly as possible whether a given letter combination represented an existing German word. Responses were collected via button press. Presentation and logging was done using the E-Prime 2.0 professional software (Psychology Software Tools, Inc., Pittsburgh, PA, USA). RTs were calculated for correctly answered trials. Lexical access time was defined as the mean RT of all words with both high and low frequency of occurrence. Frequent words are more likely to be pre-activated than less frequent words (e.g., Marslen-Wilson, 1989; Cleland et al., 2006); RT are therefore bound to be much faster. On the flipside, RT to infrequent words may be more strongly influenced by vocabulary size than RT to frequently used words. For our main analyses, we therefore averaged RT to all words, log-transformed them to minimize the effects of long latencies, and z-transformed for the statistical analyses to allow for direct comparisons across listener groups and with other tests. To check for response biases or speech-accuracy tradeoffs (SATs), we also analyzed RT to frequent and infrequent words separately.

### Vocabulary Size

Two standardized tests of receptive vocabulary size were measured, the Wortschatztest (WST; Schmidt and Metzler, 1992) and an updated German version of the PPVT (Buhlheller and Häcker, 2003). The use of standardized tests of vocabulary size follows other studies that found good correlations with sentence in noise recognition (e.g., McAuliffe et al., 2013; Benard et al., 2014; Kaandorp et al., 2015).

In the WST, participants were presented 42 lines of six words each on a sheet of paper. Five of these words per line were pseudowords and one was an existing word. The task was to identify the existing word in each row at the necessary pace. Participants were instructed to not mark anything unless they were sure they could recognize the existing word. We assume the WST to test recognition of the (orthographic) word form. Semantic knowledge is not required (albeit beneficial) for high scores.

In the PPVT, participants saw four pictures on a paper test block and heard a target word from a loudspeaker. The task was to indicate the picture that best represented the target word. Responses were not timed. The test consists of 89 trials with increasing picture-matching difficulty. To perform well on this test, individuals not only needed to be familiar with the (acoustic) word form but also have a detailed semantic representation of the target word to correctly distinguish the correct picture from its three semantically similar and/or related competitors. We therefore assume that the PPVT focuses more strongly on semantic representations (and/or world knowledge) compared to the WST. Note that guessing and competitor elimination strategies cannot be completely excluded, especially in the PPVT.

To reduce test-specific effects and to focus exclusively on vocabulary size, we combined the z-transformed scores of the WST and PPVT into a new composite variable VOCABULARY, which we used for further analyses (see Salthouse, 2010, p. 105; Schoof and Rosen, 2014 for similar procedures). In addition, relatively higher error rates for detecting infrequent existing words (LDTLF error) on the lexical decision task may also arguably reflect vocabulary size: the fewer words a person knows, the more likely he/she is to reject an infrequent word as a pseudoword in the LDT. We therefore also considered LDTLF error as a potential factor reflecting vocabulary size.

### Procedure

All GÖSA stimuli were presented using the Oldenburg Measurement Application (HörTech gGmbH, Oldenburg, Germany<sup>2</sup> ) and free-field equalized Sennheiser HDA200 headphones. They were amplified by either an Earbox 3.0 High Power (Auritec, Hamburg, Germany) or an RME Fireface UCX (RME, Haimhausen, Germany). Measurements took place in a sound-attenuated booth that fulfilled the requirements of ANSI/ASA S3.1 and S3.6 standards (ANSI, 1999). The headphones were free-field equalized (ISO, 2004) using a finite impulse response filter with 801 coefficients. The measurement setup for speech intelligibility measurements was calibrated to 65 dB SPL using Brüel and Kjær artificial ear type 4153, the microphone type 4134, preamplifier type 2669, and amplifier type 2610 (Brüel and Kjær, Nærum, Denmark). PPVT words were presented using pre-recorded soundfiles over a Genelec 8020 loudspeaker. Signals in the speech recognition measurements were presented diotically to the better ear at fixed SNRs of −4 and −6 dB, with the test-specific noise signal fixed at 65 dB SPL. SNRs were chosen to correspond to SRTs yielding 50 and 80% intelligibility for young adults with normal hearing (see Kollmeier and Wesselkamp, 1997). The noise signal was turned on 500 ms before and turned off 500 ms after presentation of each sentence. In addition, 50 ms rising and falling ramps were applied to the masker using a Hann window, to prevent abrupt signal onset and offset. The listeners' task was to repeat the words they had understood, and the test instructor marked the correct responses on a display; each word in a sentence was scored separately. Each participant listened to six test lists: one test list presented speech in noise at −4 dB SNR and one list at

<sup>1</sup>http://wortschatz.uni-leipzig.de

<sup>2</sup>www.hoertech.de

−6 dB SNR; the other four test lists presented sentences in other acoustic settings that are not under investigation here. The order of test list and acoustic condition was randomized. This study was approved by and carried out in accordance with recommendations from the local ethics committee at the University of Oldenburg.

### Statistical Analyses

fpsyg-07-00990 June 30, 2016 Time: 17:46 # 6

To address our hypotheses, we conducted several different analyses. (1) To determine age-related group differences in our individual difference measures, we performed a multivariate analysis of variance (MANOVA). Since, in theory, age effects on all predictors may be independent from speech recognition, we excluded that latter factor here. As mentioned above, we combined WST and PPVT scores to a new composite variable VOCABULARY, to reduce test-specific effects and to avoid collinearity. (2) To appropriately model our dataset statistically, and to assess the relevant factors contributing to speech recognition, we employed an overall linear mixed effects regression (lmer) model using the lme4 package (Bates et al., 2014) in R 3.1.0. This model is described below. (3) To determine whether speech recognition scores were differentially associated with cognitive-linguistic measures depending on age group and/or SNR level, we applied hierarchical forward regression models for each listener group and for each SNR level. These planned post hoc analyses were based on the lmer outcome (see also Heinrich et al., 2015 for a similar approach). Five YNH participants did not complete the SNR-4 test list and the RST because the test protocol was expanded to include a measure at higher intelligibility (∼80% correct) and a measure of working memory after some first measurements. Numbers of participants are provided for each analysis. For the lmer model, we applied an exploratory approach in determining the need for random and fixed effects. Because the individual measures used very different scales, they were z-transformed for direct comparisons. The necessity to include random intercepts for LISTENER and LIST was assessed to account for possible variability (for each listener and each GÖSA list) in the effects of certain predictors. The following fixed factors (predictors) were considered in order to determine the best-fitting model: AGEGROUP (YNH vs. ONH), SNR level (−4 vs. −6 dB SNR), CONDITIONORDER, AGE, PTA-4, RST, RT in the lexical decision test (LDTRT), error rate for infrequent words in the lexical decision test (LDTLF error), VOCABULARY, EDUCATION. Testing the possible predictors GROUP, SNR level, RST, LDTRT, VOCABULARY, and education followed directly from our hypotheses. CONDITIONORDER (i.e., the order in which SNR-6, SNR-4, and the four unrelated acoustic conditions were presented) was considered as a factor in order to account for possible training or adaptation effects. We also considered different interactions. The model improvement for adding each predictor was determined by comparing the Akaike Information Criterion (AIC; Akaike, 1974) of the simpler and the more complex model. A significant reduction (of at least 2) in the AIC indicates that the higher model complexity (added predictor) compared to the simpler model is warranted (see Janse, 2009; Baayen and Milin, 2010). By introducing a penalty term for the number of parameters in the model, AIC resolves the danger of improving model likelihood by adding too many predictors.

# RESULTS

After first data inspection, we determined which individual difference measures change with age (section "How Do Individual Difference Factors Change with Age?"), then determined which of these measures actually explain variance of our speech in noise task (section "Which Factors Relate to Speech Recognition in Noise?"), and finally tested whether age-related differences in the individual difference measures relate to the age-related differences for the speech recognition task (section "Can Age-Related Differences in Lexical Access Efficiency Explain Speech Recognition Scores?"). **Table 1** summarizes the results and descriptive statistics of all variables for both listener groups.

Speech recognition scores were about 25–30% lower for ONH than for YNH listeners. Despite their clinical status of normal hearing, and despite the fact that most of the age-related hearing loss was found in higher frequencies, PTA-4 was about 5 dB higher for ONH than for YNH. Vocabulary size was larger for ONH as indicated by higher PPVT scores and lower LDT error rates for less frequently used words. Lexical access time was slower, and working memory was about two points lower, than for YNH.

**Table 2** provides an overview of the inter-correlations between the factors in **Table 1**. Age, as a continuous variable, significantly correlated with PTA, lexical access time, and LDTLF errors. As the two measures of vocabulary size, WST and PPVT were highly correlated, their combination into a common variable VOCABULARY (see Materials and Methods) was deemed justified.

# How Do Individual Difference Factors Change with Age?

Whether the individual measures listed in **Table 1** significantly differed between ONH and YNH was tested by means of a MANOVA. **Table 3** summarizes the statistical results and indicates how well the individual factors can explain the observed variances that are described in **Table 1**.

The hearing levels of ONH were significantly higher [on average 5 dB; F(1,36) = 16.06; p < 0.001; see **Table 1**] than those of YNH, despite the fact that both groups had hearing levels in the normal range (according to World Health Organization [WHO], 2016 criteria). The difference was mainly triggered by higher thresholds at 4 kHz for older adults. Verbal working memory, as measured by z-transformed RST, did not differ significantly between the groups. We therefore assume that the working memory capacity of YNH and ONH participants varied to roughly the same degree. Both groups had similar levels of education: we could not establish a significant group effect based on participants' highest degree of education.

An AGEGROUP effect for lexical access time (LDTRT) indicates that older adults were significantly slower in their lexical access than younger adults [F(1,36) = 14.08; p < 0.001]. In addition,


TABLE 1 | Summary of descriptive statistics for the individual differences measures for younger (YNH, N = 22) and older (ONH, N = 22) listeners with normal hearing thresholds.

N = 44; ‡N = 37; Levene's test for homogeneity of variance; K-S = Kolmogorov–Smirnov test for normal distribution; <sup>∗</sup>p ≤ 0.05.

### TABLE 2 | Inter-correlations between predictor measures.


<sup>∗</sup>p < 0.05, ∗∗p ≤ 0.01, Pearson's r and two-tailed p-values are reported, #N = 39.

older adults made fewer mistakes on infrequent words compared to younger adults, as evidenced by the effect for LDTLF errors [F(1,36) = 23.76; p < 0.001]. This suggests that older adults knew more infrequent words than younger adults. To exclude the possibility that age-related differences in LDTRT and LDTLF error were merely an effect of SAT or response bias instead of an age-linked difference in vocabulary size, we correlated error rates and RT to frequent and infrequent words separately and compared them to our other vocabulary measures (see **Table 4**). A negative correlation between LDT error rates and either PPVT or WST indicates a measure of vocabulary size: the more words are known, the fewer errors should be made during lexical decision. This is independent of an SAT or response bias. As **Table 4** illustrates, Pearson's r was highly negative for both groups (YNH: r = −0.73/−0.71; p < 0.001; ONH: r = −0.52; p < 0.05/r = −0.71; p < 0.001). The smaller correlation of LDTLF error and PPVT in ONH (r = −0.52) was most likely due to a generally high performance on the PPVT (M = 78.77 ± 7.36; about 88%). The correlation was, however, still substantially negative, indicating a similar effect in both age groups. Kaandorp et al. (2015) described an SAT or response bias for their young listeners with normal hearing: relatively fast answers elevated the probability of incorrect decisions. In our dataset, an SAT would be reflected by a strong negative correlation of RT and errors, especially for low frequency words, as these were more error prone. The correlation for ONH in **Table 4** suggests the opposite: a substantial positive correlation for infrequent words (r = 0.53, p < 0.05) indicates that infrequent words were either quickly correctly recognized, or not correctly identified even after a long search time. There may, however, have been an SAT tendency for frequent words in the ONH group. The negative correlation of RTs to frequent words and vocabulary size (r = −0.33; p > 0.05) did not reach significance to begin with, and we argue that exclusion of incorrect trials and our use of averaged RT for all words should further reduce any impact of a potential response bias.

The composite variable VOCABULARY did not show any AGEGROUP effect [F(1,36) = 0.33; p = 0.57], suggesting that both groups knew about the same number of words tested in the WST and the PPVT. **Figure 2** illustrates the distribution of VOCABULARY size over age, which appears to be nonlinear. Whereas VOCABULARY size increased with age in the younger group, it seemed to decrease with increasing age in the ONH group. This observation was supported by a statistically significant correlation in a cubic distribution (r = 0.53; R <sup>2</sup> = 0.29; p = 0.003), which fit the data better than our initial linear fit (r = 0.12; R <sup>2</sup> < 0.001; p < 1). Because the model is missing data points between 35 and 60 years, using one model to fit all data points may not be appropriate. We circumvented this uncertainty by applying separate regression analyses per age group in addition to the overall analysis to determine a possible association of vocabulary size with speech recognition.

TABLE 3 | Results of MANOVA group effects for individual differences besides speech recognition.


N = 44; ‡N = 39.

# Which Factors Relate to Speech Recognition in Noise?

Given the complex interplay of cognitive factors as indicated by previous studies, it seemed worthwhile to include several factors that have previously been identified to relate to speech recognition processes. To this end, a linear mixed effects regression model including all data points with the z-transformed dependent variable speech recognition score was built. Based on previous research, we expected main effects of AGEGROUP, SNR level, RST, vocabulary size, and lexical access times. We also tested for random effects and possible interactions.

We established LISTENER as a random factor (intercept variance = 0.133, SE = 0.36), allowing for individual intercepts per listener to account for individual differences. **Table 5** summarizes the best-fitting model, which includes data from 39 listeners (22 ONH, 17 YNH). The restricted maximum likelihood criterion at convergence was 133.2.

As expected, AGEGROUP was a significant predictor: the younger group (YNH) performed better in the speech-innoise recognition task than the older group (ONH; B = 2.16; t = 6.22). Similarly, the SNR level was a strong contributor to speech-in-noise recognition: the lower SNR (−6 dB) predicted lower speech recognition scores than the higher SNR (−4 dB; B = −1.16; t = −12.66). CONDITIONORDER was a strong predictor, indicating that listeners' speech recognition scores were significantly increased by the number of test lists they had heard prior to the one under investigation (B = 0.1; t = 3.33). Lexical access time—as measured by the LDTRT also emerged as a significant predictor for our overall speech recognition scores: the positive estimate of coefficients (B = 0.67; t = 3.51) suggests that the longer our participants needed to decide whether a given letter combination was an existing word, the better their speech recognition scores. VOCABULARY size was also a significant predictor: the larger the participant's vocabulary size, the better the corresponding speech-in-noise recognition scores (B = 0.297; t = 2.71). Working memory capacity (RST), although a significant predictor by itself, fell below significance level when either LDTRT or VOCABULARY were also considered (B = 0.007; t = 0.64). We could not establish any interactions or random slopes that would have improved the regression model or that would have significantly improved the predictions for speech recognition. List number (i.e., which of the 10 GÖSA test lists) could not be established as a random factor, suggesting that recognition scores can be assumed to be equal across GÖSA test lists. Our final best-fitting model also excluded the non-significant factors education, LDTLF error, and hearing level (PTA-4).

# Can Age-Related Differences in Lexical Access Efficiency Explain Speech Recognition Scores?

We expected that the more words people knew, and the faster their lexical access, the better their GÖSA speech-innoise recognition scores would be. The lmer model showed a large variability in speech recognition scores between younger and older adults and between SNR levels, but we did not detect any interactions between cognitive-linguistic tests and AGEGROUP (see previous section). Furthermore, our MANOVA (see section "How Do Individual Difference Factors Change with Age?") suggested AGEGROUP effects in working memory and lexical access. Based on the non-linear distribution of vocabulary size over age (**Figure 2**), separate linear models were considered sensible. It is likely that different mechanisms or factors depending on AGEGROUP and/or SNR level are involved in speech-in-noise recognition. **Figure 3** shows correlations of vocabulary size, lexical access time, working memory, and speech recognition scores for each AGEGROUP and SNR.

By means of four hierarchical regression models carried out for each group and SNR level, we therefore assessed how vocabulary size, lexical access time, working memory, and age or hearing level modulated speech recognition in the two noise conditions. Each hierarchical regression analysis consisted of four forward models that always followed the same order: Model 1 (M1) included VOCABULARY only, model 2 (M2) added LDTRT as a measure of lexical access time, model 3 (M3) comprised VOCABULARY, LDTRT, and RST. Model 4 (M4) finally added better ear hearing level (PTA-4) and age as possible additional


HF, high frequency; LF, low frequency; RT, reaction time; <sup>∗</sup>p ≤ 0.05; ∗∗p ≤ 0.001.

factors. The overall lmer model had identified age in terms of AGEGROUP, but our findings (see **Table 4**) indicated that smaller age-related differences within the subgroups were possible. The order of inclusion was based on our expectation that efficiency of lexical access does not merely imply quick access, but rather refers to quick access that is relative to the vocabulary size (see Ramscar et al., 2014). The results of the four analyses per subset are reported in **Table 6**. Both the dependent variable speech recognition and the fixed factors used z-transformed scores. Missing values in the YNH group (SNR-4, RST) were replaced by averaged values.

For YNH listening to GÖSA sentences at −4 dB SNR, vocabulary size was the only factor that contributed significantly to the hierarchical linear regression model [M1; F(1,20) = 8.62; p > 0.001]. Lexical access time (M2), working memory (M3), and age or hearing level (M4) did not significantly improve the models. When listening to GÖSA sentences at −6 dB SNR, the combination of vocabulary size and lexical access time provided a significant model improvement [M2; F(1,19) = 4.37; p = 0.05]. For ONH, speech recognition at −4 dB SNR did not seem to relate to any of our factors. We did, however, observe a trend for M2: F(1,19) = 3.78; p = 0.07. At −6 dB SNR, ONH speech recognition scores were also related to the combination of vocabulary size and lexical access time [M2; F(1,19) = 15.81; p = 0.001], as seen for YNH. Inclusion of neither working memory (M3) nor PTA and age (M4) improved the models.

### DISCUSSION

Our aim was to determine whether age-related differences in the efficiency of lexical access relate to performance differences in a standardized test of German sentence recognition in noise. To this end, we addressed three questions:


# Which Individual Difference Measures Pertaining to the Mental Lexicon Change with Age?

Our cognitive-linguistic tests revealed no significant effects of age on vocabulary size, working memory, or education. This finding for working memory stands in opposition to previous studies (e.g., Pichora-Fuller et al., 1995; Desjardins and Doherty,

### TABLE 5 | Best-fitting linear mixed-effects regression model.


Dependent variable: GÖSA speech recognition (z-transformed); random effect: LISTENER (intercept; variance = 0.09 ± 0.30); Number of observations = 77; 39 listeners; restricted maximum likelihood criterion at convergence: 133.2; RST, reading span; SNR, signal-to-noise ratio; LDTRT, log-transformed reaction times of words in the Lexical Decision Test; significance levels ∗∗∗p ≤ 0.001, ∗∗p ≤ 0.01, <sup>∗</sup>p ≤ 0.05.

2013; Schoof and Rosen, 2014; Kidd and Humes, 2015). The reasons for this contradictory finding are not clear. It is possible that the similar education levels of older and younger adults diluted any effect of working memory; the two measures were significantly correlated in our data set. Our hypothesis with respect to working memory did, however, relate mainly to its interplay with speech recognition and/or other measures, not with age per se. Although not a predictor itself, working memory could indirectly affect speech perception, and this could change with age (see, e.g., Banks et al., 2015). We also found an age-group effect of average pure-tone hearing level (PTA-4), despite the fact that all listeners had PTA-4s of 20 dB or better within the normal hearing range (according to World Health Organization [WHO], 2016 criteria). The observed significant 5 dB group difference in PTA-4 was therefore not expected to have a strong influence on GÖSA speech recognition scores.

The similarities in vocabulary size that we observed between younger and older listeners also contradict previous studies (e.g., MacKay and Burke, 1990; Salthouse, 2004; Kavé and Yafé, 2014; Kavé and Halamish, 2015; Keuleers et al., 2015). Performance on our vocabulary tests, especially the PPVT,


TABLE 6 | Results of four hierarchical linear regression models per subset of listener group and SNR level.

LDT, log-transformed reaction times of words in the Lexical Decision Test; WM, working memory (z-transformed reading span scores); PTA, z-transformed averaged hearing level (0.5–4 kHz); N = 22 per group; missing values for YNH were replaced by mean values. Significant F changes are shaded, the trend is hatched; significance levels ∗∗∗p ≤ 0.001, ∗∗p ≤ 0.01, <sup>∗</sup>p ≤ 0.05.

was generally high. The similarities between the groups may therefore have arisen from our use of the standardized vocabulary tests, in which most listeners performed well (see Ramscar et al., 2014 for a similar reasoning). They probably also arose in part from an inappropriate linear model. Our composite VOCABULARY measure showed a clear non-linear distribution, in which vocabulary size increased linearly with increasing age for the younger group but decreased in the older group. This observation supports previous observations by Salthouse (2004), Kavé et al. (2010), and Hartshorne and Germine (2015) that vocabulary size may decrease after some peak.

Notably, we found strong age-group-related effects for lexical decision, both for errors on infrequent words and for lexical access time. Younger adults were faster than older adults, which is compatible with observations by Kavé and Yafé (2014). Agerelated effects in a time-sensitive measure such as our visually presented lexical decision test could in principle result from general processing or motoric speed (see, e.g., Janse, 2009; Besser et al., 2012; Füllgrabe et al., 2015). The likelihood of simple speed as an exclusive explanation is comparatively small, however, because the relative RT difference between frequent and infrequent words was comparable in both groups. Older listeners also made fewer mistakes on infrequent words, which suggests better vocabulary knowledge. Our data are congruent with predictions made by the Transmission Deficit hypothesis (TDH) proposed by Burke et al. (1991). Although the TDH was proposed for age-linked word retrieval difficulties in speech production, it is based on MacKay's (1987) Node Structure theory, a connectionist approach with applications in both speech production and perception. Connections between nodes in a network are reinforced through frequent, persistent, and recent exposure. This could explain why older adults in our study made fewer errors in recognizing infrequent words than younger adults (see also Kavé and Halamish, 2015). The TDH also postulates that connections between nodes may weaken with age, resulting in age-related word retrieval difficulties. Support for this assumption comes from production studies (e.g., Burke et al., 1991; Kemper and Sumner, 2001; Kavé et al., 2010). ONH's slowed lexical access time supports the TDH assumption of an age-linked weakening of connections between word form and the corresponding meaning of a word in speech recognition. An efficiency reduction in lexical access can thus be explained by weakening of connections in aging, whereas better performance in recognizing infrequent words is explained by reinforced connections as a result of experience.

# Which Factors Are Relevant Predictors of Speech-in-Noise Recognition Performance?

We had expected a combination of age group, SNR level, puretone hearing thresholds, working memory, vocabulary size, and lexical access time to predict speech recognition scores. Our bestfitting lmer model showed that speech recognition scores for the German everyday-sentence-test were predicted by age group, SNR level, condition order, lexical access time, and vocabulary size. We discuss the implications of each individual predictor below.

Younger adults generally scored about 25–30% better on the speech recognition test than older adults at both fixed SNRs (−4 and −6 dB). This finding follows a long list of similar observations (e.g., Dubno et al., 1984; CHABA, 1988; Pichora-Fuller et al., 1995; Pichora-Fuller and Souza, 2003; Füllgrabe et al.,

2015; but see Schoof and Rosen, 2014 for relativization). Wayne and Johnsrude (2015) suggested that a generic age effect per se is unlikely to cause deterioration in speech perception performance. It is more likely that age mediates other perceptual (e.g., suprathreshold processing, e.g., Füllgrabe et al., 2015) and/or cognitive measures (e.g., Schoof and Rosen, 2014; Wingfield et al., 2015). Pure-tone hearing threshold was not a significant predictor in our model, but was significantly correlated with age, indicating decreasing hearing acuity with increasing age. It is possible that speech recognition processes employed by older adults may be affected by neural changes that pertain to temporal aspects, such as processing speed or temporal fine structure in auditory perception (e.g., Füllgrabe et al., 2015). The coding of information in the auditory nerve may be not as good as in young NH listeners, due to loss of synapses or degeneration of neurons with increasing age. Poor neural representation may arguably lead to poor speech recognition in noise (e.g., Anderson et al., 2011). However, these changes cannot be quantified with the pure tone threshold since only a few functioning neurons are required to detect a single tone in quiet (e.g., Stone and Moore, 2014; Bharadwaj et al., 2015; Füllgrabe et al., 2015).

Not surprisingly, GÖSA speech recognition scores increased with the higher SNR level, a fact that can be explained by the masking properties of the noise: the higher the SNR, the smaller the effect of masking. This finding is predicted by a number of speech recognition models, such as the Speech Intelligibility Index (ANSI, 1997) or the Speech Transmission Index (Steeneken and Houtgast, 1980), and supports a number of previous findings (e.g., Plomp and Mimpen, 1979; Kollmeier and Wesselkamp, 1997). A result that was unexpected but not surprising was that the order of list presentation seemed to play a role in speech-in-noise recognition. This suggests an effect of training, or rather perceptual learning, despite the fact that listeners never heard the same acoustic setting, or the same sentence more than once. Neger et al. (2014) dissociated perceptual and statistical learning of understanding noise-vocoded speech in groups of older and younger adults. Both groups showed perceptual learning.

More interestingly, our best-fitting model suggested that lexical access time (LDTRT) was a very relevant predictor of speech recognition (see Kaandorp et al., 2015 for a similar observation in Dutch). The longer participants needed to determine whether a letter combination was an existing word, the better their speech recognition scores were (especially at −6 dB SNR). This observation may seem somewhat counterintuitive. Both the ELU model (Rönnberg et al., 2013) and the cognitive spare capacity hypothesis (Mishra et al., 2013, 2014) predict the opposite: quick lexical access could be construed as more automatic lexical access with fewer mismatches that would require the engagement of the explicit processing loop.

Vocabulary size also predicted speech in noise recognition: the more words a listener knew, the better his or her speech-in-noise recognition scores were. Our data thus follow a number of studies in other languages that found a comparable relation between word knowledge or vocabulary size and speech recognition (e.g., Pichora-Fuller et al., 1995; McAuliffe et al., 2013; Benard et al., 2014; Banks et al., 2015; Kaandorp et al., 2015).

We propose that the correlation of slower lexical access time and better speech recognition scores need to be interpreted together with vocabulary size. Both factors contributed to speech perception independently; but both our overall lmer model and our hierarchical models per subset suggest that the combination of the two factors relate to the speech recognition data. Following the reasoning proposed by Ramscar et al. (2014), we argue that a larger lexicon may require longer search and hence access times because more competitors need to be evaluated. This may have been the case in YNH, where a slight positive relation between lexical access time for frequent words and vocabulary size suggests that larger lexicons tended toward longer searches (see **Table 4**). But lexical access time to infrequent words decreased with larger vocabulary size. In ONH, on the other hand, lexical access time decreased with increasing vocabulary size for both frequent and infrequent words. One explanation for this unexpected observation of group and frequency-related lexical access times is that a larger mental lexicon also necessitates a more detailed representation than a smaller lexicon to facilitate distinction and correct identification (see argumentation above, McAuliffe et al., 2013; Ramscar et al., 2014; Kaandorp et al., 2015). Following word recognition accounts that favor exemplarbased approaches (see Weber and Scharenborg, 2012 for an overview of different models), a larger lexicon is likely to entail more instances or variants per word. According to the TDH (Burke et al., 1991; see above), frequent or extended exposure (e.g., with age) to a word results in strengthening of the connections from word form to meaning. This would explain why our RT to frequent words were always faster than RT to infrequent words. A larger lexicon would therefore require longer search times for some words or some populations, but at the same time result in a higher chance of matching the input with one of the exemplars (see also Schmidtke, 2014; Kavé and Halamish, 2015 for a similar line of argumentation in bilinguals).

Working memory by itself was a significant predictor for German speech recognition but became insignificant once vocabulary size or lexical access were accounted for. Still, including working memory improved the fit of our lmer model. Our observation that working memory did not contribute strongly to speech recognition in noise does not quite follow the assumptions of the ELU model (Rönnberg et al., 2013) or the cognitive spare capacity hypothesis (Mishra et al., 2013, 2014). Our findings are consistent, however, with other studies that have also failed to identify working memory as a strong predictor, especially when other predictors were tested as well (e.g., Banks et al., 2015; Füllgrabe et al., 2015). An alternative option is more likely: the inter-correlations of working memory with lexical access times and vocabulary size, together with the fact that the latter were substantial predictors for speech recognition scores in our listeners, may arguably also suggest an indirect role of working memory (see Banks et al., 2015 for more compelling evidence for such a claim).

# Can Age-Related Differences in Lexical Access Efficiency Explain Differences in Speech Recognition Scores?

Given the age-related differences in speech recognition scores and in lexical access times, and the distribution of vocabulary size across the adult lifespan, we hypothesized that the relation between the cognitive-linguistic factors and speech recognition may be different for younger and older listeners. To test this, we calculated hierarchical regression models for each group. As noted above, the most likely speech recognition strategy should involve efficiency of lexical access, i.e., quick lexical access relative to vocabulary size. Speech recognition tests do, however, also allow a second 'offline' strategy: listeners can conceivably simply listen to the sentence and only start processing and 'matching' the acoustic signal with a lexical entry in their lexicon after the sentence has been completed. This strategy is likely to engage relatively more working memory because successful recall is only possible if the sentence can be kept in the phonological loop for rehearsal and matching (cf. Rönnberg et al., 2013).

For YNH, different SNRs seem to invoke different speech recognition mechanisms: at the better SNR (−4 dB), only vocabulary size played a role. This could be because speech intelligibility was relatively high in this condition for YNH (85.4% ± 10.6). Crucially, this condition showed the highest, albeit non-significant, correlation of working memory (RST) and speech recognition scores: listeners may have used the offline speech recognition strategy in this relatively easy condition. This observation is similar to the latent relation of working memory on speech recognition scores that was reported by Banks et al. (2015). At the lower SNR (−6 dB), the combination of vocabulary size and lexical access time was important for speech recognition. This indicates that once perception becomes more difficult—as evidenced by lower speech recognition scores—efficient lexical access becomes more relevant.

For ONH, the picture is similar to YNH, but with slight differences: at the lower SNR (−6), the combination of vocabulary size and lexical access time (M2), that is efficiency of lexical access, explains speech recognition, just as in the YNH group. At the better SNR (−4 dB), none of our models including cognitive-linguistic factors, age, or hearing level could reliably model the subset data. We did, however, observe a trend for the combination of vocabulary size and lexical access time, suggesting a similar tendency for efficient lexical access as for YNH at the lower SNR (−6 dB). Notably, ONH did not show any sign of the alternative 'offline' processing mechanism that we observed at −4 dB SNR in YNH. It thus appears that the speech recognition mechanisms used in noise change only slightly with age. Speech-in-noise recognition scores were nevertheless much lower for older compared to younger listeners. But neither working memory, as suggested by Van der Linden et al. (1994), Pichora-Fuller et al. (1995), and Benichov et al. (2012), nor puretone averages could account for this difference. It seems that efficiency of lexical access may be the best explanation for speech recognition in adverse conditions.

In summary, our results suggest that older adults with normal hearing apply mechanisms in speech recognition in noise that are very similar to those used by younger adults. It seems that lexical access time, possibly mediated by vocabulary size, is the most relevant correlate (or leading predictor). Although ONH were, on average, at least as good in their vocabulary size and working memory capacity as YNH, their lexical access times and speech recognition scores were worse. If there is, indeed an intricate interrelation between vocabulary size, working memory, and lexical access, as suggested by Banks et al. (2015), then a significant reduction in one of the three could arguably explain the poorer speech recognition scores. The fact that speech recognition in noise (at least for the lower SNR) in both YNH and ONH was modulated by the combination of vocabulary size and lexical access time (see the models M2 in **Table 6**) suggests that not accessing speed but accessing efficiency may be a relevant predictor. Following the argumentation of Ramscar et al. (2014), we assume that people with a larger mental lexicon require relatively longer search (accessing) times compared to people with smaller lexicons. If, however, people with large vocabularies are also fast in lexical access—which means their accessing efficiency is high—then this should result in a processing benefit and possibly better speech recognition results. If vocabulary size is somewhat comparable between groups but lexical decision time is considerably slowed in older adults, efficiency of lexical access may be affected. Our findings suggest that efficiency of lexical access declines with age, and this decline results in poorer speech recognition scores for ONH.

# Study Limitations

There are some noteworthy limitations to this study. Firstly, we observed a ceiling effect for the PPVT, especially in older adults, which could potentially have led to an underestimation of the role of vocabulary knowledge. We countered this effect by using a composite VOCABULARY factor that included both PPVT and WST scores, and by accounting for the LDT error rate for infrequent words. The latter were argued to reflect vocabulary size as well. Nevertheless, the influence of the composite VOCABULARY variable may possibly change as individual differences increase (but see Schmidtke, 2014; Kaandorp et al., 2015). Secondly, our measure for lexical access time was based on the simple RTs for existing words. These include the actual access, the search, but also decision times and general processing times, including motoric reaction of pressing a button. It is possible that the age-related differences partially pertain to general processing or motoric speed components. Future studies should therefore include a separate measure of processing speed to exclude motor or general processing speed and to allow a more "linguistic" interpretation. Thirdly, our two age groups were not completely equal in their hearing thresholds. Although preferable, a perfect match in pure-tone average was not feasible. We therefore cannot completely rule out any influence of hearing status (supra-threshold or otherwise), even though hearing level never turned up as significant predictor for speech in noise recognition. Since the focus of this study was mostly on aspects of speech recognition and the lexicon, we did not investigate any aspects of temporal or spectral coding of the signal and their relation to auditory processing. These latter aspects are, however, likely to decline with age as well, and have been shown to relate to reductions in speech-in-noise recognition scores (e.g., Füllgrabe et al., 2015; see also Rönnberg et al., 2013).

### AUTHOR CONTRIBUTIONS

fpsyg-07-00990 June 30, 2016 Time: 17:46 # 14

RC and AW conceptualized the research question, and the experimental design and conducted the measurements. RC analyzed the data and drafted the manuscript; RC and AW interpreted the results and wrote the manuscript. ER approved the experimental design, critically reviewed, and significantly contributed to the manuscript. BK provided framework for speech recognition tests and critically reviewed the manuscript. All authors approved the final version of the manuscript for publication. All authors agree to be accountable for all aspects of

### REFERENCES


the work and in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

# FUNDING

This work was supported by the Cluster of Excellence EXC 1077/1 "Hearing4all" funded by the German Research Council (DFG).

# ACKNOWLEDGMENTS

The authors thank Jennifer Trümpler and the reviewers for very helpful comments and suggestions on earlier drafts of the manuscript.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer CS and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Carroll, Warzybok, Kollmeier and Ruigendijk. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Auditory Perceptual Learning in Adults with and without Age-Related Hearing Loss

### Hanin Karawani <sup>1</sup> \*, Tali Bitan<sup>2</sup> , Joseph Attias <sup>1</sup> and Karen Banai <sup>1</sup>

<sup>1</sup> The Department of Communication Sciences and Disorders, Faculty of Social Welfare and Health Sciences, University of Haifa, Haifa, Israel, <sup>2</sup> The Department of Psychology, Faculty of Social Sciences, University of Haifa, Haifa, Israel

Introduction : Speech recognition in adverse listening conditions becomes more difficult as we age, particularly for individuals with age-related hearing loss (ARHL). Whether these difficulties can be eased with training remains debated, because it is not clear whether the outcomes are sufficiently general to be of use outside of the training context. The aim of the current study was to compare training-induced learning and generalization between normal-hearing older adults and those with ARHL.

Methods : Fifty-six listeners (60–72 y/o), 35 participants with ARHL, and 21 normal hearing adults participated in the study. The study design was a cross over design with three groups (immediate-training, delayed-training, and no-training group). Trained participants received 13 sessions of home-based auditory training over the course of 4 weeks. Three adverse listening conditions were targeted: (1) Speech-in-noise, (2) time compressed speech, and (3) competing speakers, and the outcomes of training were compared between normal and ARHL groups. Pre- and post-test sessions were completed by all participants. Outcome measures included tests on all of the trained conditions as well as on a series of untrained conditions designed to assess the transfer of learning to other speech and non-speech conditions.

# Results : Significant improvements on all trained conditions were observed in both ARHL and normal-hearing groups over the course of training. Normal hearing participants learned more than participants with ARHL in the speech-in-noise condition, but showed similar patterns of learning in the other conditions. Greater pre- to post-test changes were observed in trained than in untrained listeners on all trained conditions. In addition, the ability of trained listeners from the ARHL group to discriminate minimally different pseudowords in noise also improved with training.

Conclusions : ARHL did not preclude auditory perceptual learning but there was little generalization to untrained conditions. We suggest that most training-related changes occurred at higher level task-specific cognitive processes in both groups. However, these were enhanced by high quality perceptual representations in the normal-hearing group. In contrast, some training-related changes have also occurred at the level of phonemic representations in the ARHL group, consistent with an interaction between bottom-up and top-down processes.

Keywords: presbycusis, age-related hearing loss, auditory training, speech in noise, time-compressed speech, perceptual learning

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Carine Signoret, Linnaeus Centre HEAD, Sweden Larry E. Humes, Indiana University, USA

> \*Correspondence: Hanin Karawani hanin7@gmail.com

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 30 September 2015 Accepted: 31 December 2015 Published: 03 February 2016

### Citation:

Karawani H, Bitan T, Attias J and Banai K (2016) Auditory Perceptual Learning in Adults with and without Age-Related Hearing Loss. Front. Psychol. 6:2066. doi: 10.3389/fpsyg.2015.02066

# INTRODUCTION

Speech perception and communication in noisy environments become more difficult as we age. Specifically, older adults often experience considerable difficulties when listening to speech in the presence of background noise, to competing speech signals or to rapid speech (Pichora-Fuller et al., 1995). Because these conditions are present in everyday situations, many older-adults find it difficult to understand speech in everyday life. These difficulties are often exacerbated by age-related hearing loss (ARHL; Fitzgibbons and Gordon-Salant, 2010) which is one of the most prevalent chronic health conditions among the elderly (Yueh et al., 2003). ARHL is estimated to affect more than 25% of the population aged 60 or more and its incidence is expected to increase with the aging of the population (Roth et al., 2011). While it has been shown that ARHL is the major cause of these speech perception difficulties, research has shown that cognitive functions such as memory and attention also affect these difficulties (Pichora-Fuller, 2008; Humes and Dubno, 2010).

Individuals with sensorineural hearing loss can regain some lost auditory function with the help of hearing aids (Gil and Iorio, 2010; Lavie et al., 2014, 2015), however this is often insufficient when speech perception under nonoptimal conditions is considered (Kochkin, 2000; Gordon-Salant, 2005). Therefore, attempts are being made to supplement the rehabilitation process with patient-centered education, counseling, and auditory training, which were hypothesized to help listeners compensate for degradation in the auditory signal and improve communication (Sweetow and Sabes, 2006). In this vein a number of studies have suggested that auditory training may be beneficial for individuals with ARHL (Sweetow and Palmer, 2005; Stecker et al., 2006; Sweetow and Sabes, 2006, 2007; Sweetow and Henderson Sabes, 2010; Lavie et al., 2013). Studies with older adults have shown that even participants with normal pure-tone and speech perception thresholds often report that listening in everyday life has become effortful (Schneider et al., 2002). Thus, the current study specifically asks whether a homebased auditory training approach that mimics the challenges of real-world listening can improve speech perception in normalhearing and in hearing impaired older adults, and whether the patterns of learning and generalization are influenced by the presence of a hearing impairment.

# Speech Processing in Younger and Older Adults

Speech processing involves not only the perception and identification of individual speech sounds and words, but also the integration of successively heard words, phrases, and sentences to achieve a coherent and accurate representation of the meaning of the message being communicated. In this process distinct (but interactive) neural networks process both the acoustic structure and the meaning of speech. The end result, mapping sounds to meaning, relies on matching the output from acoustic and phonetic analyses with stored lexical representations (Davis and Johnsrude, 2007; Hickok and Poeppel, 2007). Thus, accurate speech processing requires the use of voice and emotions cues, the use of silent gaps and duration cues to recognize phonemes, the use of temporal envelope patterns related to the rate of speech and spectral information, and access to and retrieval of semantic information (Price et al., 2005; Pichora-Fuller and Macdonald, 2008). Moreover, cognitive processes such as working memory, selective attention, and the speed at which information can be processed also affect speech understanding (Pichora-Fuller and Singh, 2006). The use of knowledge and semantic context (e.g., phonological and semantic knowledge of phonemes, words, and sentences) is known to enhance recall and comprehension in older and younger adults (Wingfield and Stine-Morrow, 2000; Pichora-Fuller, 2008; Tun et al., 2012).

According to several theoretical accounts, the relative contributions of lower-level sensory and perceptual processes or representations and higher-level cognitive processes (e.g., working memory, semantic processes) to speech recognition may differ between optimal and unfavorable listening conditions [e.g., the Ease of Language Understanding Model (Rönnberg et al., 2013) or the Reverse Hierarchy Theory RHT (Ahissar et al., 2009)]. According to the Ease of Language Understanding Model (ELU), incoming speech is initially processed automatically and a phonological representation of the signal is created. Word recognition (or "lexical access") should occur if this automatically created representation matches an existing representation in long term memory. However, when an automatically created representation does not match an existing representation in long term memory, for example when the signal is degraded or when sensory processing of the signal is less precise due to hearing loss, an explicit and effortful working memory process is engaged in an attempt to compensate for the mismatch between the phonological representation and long term memory prolonging the recognition process (Rönnberg et al., 2008, 2013). Therefore, under difficult listening conditions or when hearing is impaired listeners are more likely than otherwise to engage in top-down processes that would allow semantic or real-world knowledge to influence speech recognition through working memory or attentional processes (Rönnberg et al., 2013).

Lower level processes are compromised to a greater extent in older-adults with ARHL group than in normal-hearing older adults. For example, older adults with presbycusis required more favorable signal-to-noise ratios (SNRs) to benefit from the ability to predict sentence-final words from sentence context than older adults with normal hearing, even though the magnitude of the context effect was similar in the two groups (Pichora-Fuller et al., 1995). This example also suggests that hearing impairment does not necessarily interfere with the ability to engage top-down processes to support listening. Rather, studies have shown that as supra-threshold auditory processing gradually declines over decades, the brain reorganizes so that more frontal brain areas, including those serving semantic processing and working memory, are activated to a greater extent in older compared to younger brains in conditions in which the performance of older and younger adults is matched (Wingfield and Grossman, 2006; Peelle et al., 2011). As speech becomes less intelligible, processing relies more on top-down influences from frontal areas (Pichora-Fuller et al., 1995; Zekveld et al., 2006). A similar conclusion was reached in an MRI study that found higher correlation between the volume of frontal areas and speech in noise perception in older adults compared to normal-hearing young adults (Wong et al., 2010). Despite this compensatory engagement of higher-level brain areas, older adults experience disproportionate difficulties in understanding speech in ecological conditions that include suboptimal noise conditions and fast talkers. Therefore, successful auditory training in this population should foster an effective balance between bottom-up, signal-based processes, and top-down knowledge-based processes (Pichora-Fuller and Levitt, 2012).

### Auditory Training

Auditory training for the purpose of hearing rehabilitation involves active listening to auditory stimuli and aims to improve the ability of participants to comply with the demands of non-optimal listening environments (Boothroyd, 2007; Henderson Sabes and Sweetow, 2007). Home-based auditory training programs were developed to allow adults with hearing loss to engage in perceptual learning, which in turn may lead to better speech understanding and improved communication ability (Sweetow and Sabes, 2007). The consequences of training specific auditory skills are often specific to the trained stimuli (e.g., Wright et al., 1997; Cainer et al., 2008). In addition training outcomes also depend on the trained task (Amitay et al., 2006), suggesting that plasticity is also mediated by cognitive task-specific mechanisms rather than by only the sensory attributes of the trained stimuli. Other factors such as feedback (Amitay et al., 2010) and motivation (Amitay et al., 2010; Levitt et al., 2011; Ferguson and Henshaw, 2015a) likewise influence training outcomes.

Two aspects of learning were typically quantified to document the effects of training on listening skills in the context of hearing rehabilitation—"on-task" learning defined as improvements on the trained tasks and "generalization" defined as improvements in tasks that are not trained directly. On-task learning following auditory training in older adults with ARHL is usually robust, however generalization of learning to untrained tasks or stimuli that were not experienced directly during training does not always occur, or is very small (see Henshaw and Ferguson, 2013 for a similar use of the terms). Robust effects of "on-task learning" were previously reported for syllables and words in older adults with hearing loss (Burk et al., 2006; Stecker et al., 2006; Burk and Humes, 2008; Humes et al., 2009; Ferguson et al., 2014). Burk and colleagues examined the effect of word-based auditory training and focused on word-recognition abilities within a background noise with varied words and talkers. Such training on perceptual distinctions assumes that by resolving lower level sensory issues through training, listening and communication should improve in a bottom-up manner. In their studies, improvements on the trained task were maintained over an extended period of time; however, generalization to untrained words did not occur (Burk et al., 2006). Although there is evidence to suggest that training using multiple talkers promotes greater word-in-noise learning that generalizes to unfamiliar speakers (Burk et al., 2006), such training yields learning that is specific to the content of the trained stimuli and does not always generalize to unfamiliar words, nor familiar words embedded in unfamiliar sentences (Humes et al., 2009).

Other studies suggest that training in ecological tasks, with whole sentences which emphasize top-down processes (such as generating semantic expectations, requiring working memory, and selective attention) might result in wider generalization than training that emphasizes specific auditory capacities (Sweetow and Sabes, 2006; Smith et al., 2009; Anderson et al., 2013a,b). Two home-based training programs were used in previous studies (1) Brain Fitness™ (Smith et al., 2009) that consists of modules designed to increase the speed and accuracy of auditory processing and (2) "listening and communication enhancement" LACE™ (Sweetow and Sabes, 2006) that provides a variety of interactive and adaptive tasks in three categories: degraded speech, cognitive skills, and communication strategies. In the latter program, listeners train on speech recognition in passages on a wide variety of topics, in conditions such as competing speakers, time-compressed speech and speech-in-noise, that mimics the challenges of real-world listening. The overall goal of such ecological training approaches is to improve sensory function, and engage higher level processes that support sensory processing (Schneider and Pichora-Fuller, 2000).

Previous studies evaluated the effects of home-based ecological training on participants with ARHL (Sweetow and Sabes, 2006; Anderson et al., 2013a) and normal-hearing (Anderson et al., 2013b). They found that training changed the neural processing of speech sounds and promoted cognitive and perceptual skills. In one of these studies (Anderson et al., 2013a), participants improved in both physiological (brainstem timing) as well as perceptual assessments (speech-in-noise perception, short-term memory and speech processing) following 40 sessions of computerized home-based auditory training. In another study Anderson et al. (2013b) compared learning in the ARHL group to normal-hearing adults, and found significant training-induced changes in speech-in-noise perception specific to the hearing impaired trained group, with no corresponding changes in the normal-hearing group. Sweetow and Sabes (2006) tested older adult hearing-aid users on trained and untrained measures of speech-in-noise. They reported significant on-task learning effects but only small effects of generalization and only in one of the two untrained tasks with sentences stimuli.

In the current study we trained listeners on speech perception tasks similar to the ecological training programs used in previous studies (e.g., Sweetow and Sabes, 2006; Song et al., 2012). Passages on a wide array of topics were presented in degraded form (noise or time-compression) or in parallel to a competing talker. Listeners had to answer content-related questions and the level of acoustic difficulty was adapted based on their responses. We chose this approach because evidence from normal-hearing individuals and few auditory rehabilitation studies shows that emphasizing top-down processes (selective attention, working memory, use of linguistic, and world knowledge) during training is more effective in terms of generalization than training on basic acoustic features (Borg, 2000; Sweetow and Sabes, 2006; Moore, 2007). Whole sentences are expected to provide topdown lexical feedback in the perceptual learning process (Davis et al., 2005). Thus, the listener may learn to use their stored semantic knowledge about the topic and about language, as well as visually presented verbal information, to facilitate their perception of the "interrupted" acoustic signal. Finally, training on whole sentences is expected to motivate participants and promote compliance with the training regimen.

We focused on adults with mild-to-moderate sensorineural hearing loss who were experiencing hearing difficulties, but had not yet sought intervention for their hearing loss, as well as on normal-hearing adults. To the best of our knowledge, the present study is one of the first studies to conduct homebased training research in everyday listening situations; in fact it is the first of its kind in relation to Hebrew speakers. We expect that training-induced behavioral gains will be observed. Moreover, perceptual learning studies usually ask if learning simple auditory skills can generalize to more complex ones. In the current study, generalization to untrained speech tasks was examined in normal-hearing older adults and those with ARHL. We also ask whether training on complex sounds generalizes to simple acoustic tasks, by testing participants on non-verbal auditory discrimination tasks. The aims of the current study were (1) to examine the efficacy of a home-based auditory training scheme in improving speech perception abilities among normalhearing older adults and among hearing impaired non-aided older adults. (2) To compare the patterns of training-induced learning between normal-hearing adults and those with ARHL and (3) to assess learning on the trained tasks and transfer to other untrained (speech and non-speech) tasks to study generalization.

### MATERIALS AND METHODS

### Participants

Seventy one adults (44 females) aged 60–71 years (mean age = 66.5 years ± 4 months) with no history of neurological disorders, were recruited for this study. Participants were recruited from the Institute for Audiology and Clinical Neurophysiology at the Interdisciplinary Clinical Center at the University of Haifa, from the Hearing and Speech Center at the Rambam Health Care Campus and through advertisements at the University and Rambam. Recruitment criteria included age 60–72 years, normalhearing or hearing impairment with no neurologic disorders and Hebrew as a first language. Exclusions from the study were on the basis of audiometric results of asymmetric or conductive hearing loss (n = 4), being an existing hearing aid user (n = 5), unwillingness to participate in post-test sessions (n = 4), inability to control a computer mouse (n = 2). Participants provided informed consent and were compensated for their time. All procedures were approved by the Faculty of Social Welfare and Health Sciences, University of Haifa Review Board (approval number 197/12). Pure-tone audiometric thresholds were obtained bilaterally for air conduction at octave frequencies 250–8000 Hz and at 3000 and 6000 Hz and for bone conduction at octave frequencies 250–4000 Hz.

A total of 56 participants (35 females) met the inclusion criteria reported above and their data is included in the analyses reported in this manuscript. Based on audiometric thresholds participants were divided into normal-hearing (NH, mean age = 64.6 years ± 4.3, n = 21) and ARHL (mean age = 67.6 years ± 3.3, n = 35) groups; no significant age difference was found between the groups [t(54) = 0.7, p = 0.59]. The normal-hearing participants had hearing thresholds ≤ 25 dBHL through 6000 Hz and ≤ 30 dBHL through 8000 Hz. Participants with ARHL had symmetrical mild to moderate hearing loss with hearing thresholds ≤60 dBHL through 8000 Hz, and did not use hearing aids either in the past or at the time of the study. Audiograms for both groups are shown in **Figure 1**. No significant differences between the right and left ears were found in pure-tone average of 500, 1000, and 2000 Hz in air conduction thresholds therefore an average of both ears are shown in **Figure 1** [t(110) = 0.6, p = 0.54]. In addition there were no significant differences in bone conduction thresholds between right and left ears [t(110) = 1.03, p = 0.305]. All participants received standardized cognitive tests taken from the Wechsler Abbreviated Scale of Intelligence (WASI, Similarities, and Block Design) and the Digit span memory subtest from the Wechsler Intelligence Test (Wechsler, 1997) and showed age normal cognitive function.

### Study Design

The study used a randomized, controlled, quasi-crossover design similar in concept to Ferguson et al. (2014). Participants completed three test sessions (see **Figure 2**). Subgroups of participants underwent auditory training between different test sessions such that overall, participants served as their own untrained controls. All participants (NH and ARHL) underwent a series of tests in session 1 (t1), and then were randomly assigned to either complete the auditory-based training phase immediately (immediate-training, mean age = 65 ± 4.3, n = 24; NH = 10, ARHL = 14) or to a waiting phase (delayed-training, mean age = 66 ± 3.1, n = 22; NH = 11, ARHL = 11). Another group of participants with ARHL did not train at all (no-training ARHL, mean age = 67 ± 3.4, n = 10) and participated in two testing sessions only, see **Figure 2**. Four weeks after t1 all participants underwent another session (t2). As shown in **Figure 2**, training occurred between times t1 and t2 for the immediate-training participants and between times t2 and t3 for the delayed-training

FIGURE 1 | Audiogram. Mean air conduction hearing thresholds across ears and participants are plotted for all Normal-Hearing (NH) and Age-Related Hearing Loss (ARHL) participants. Error bars represent standard deviations (SDs).

participants, and the retention period occurred between times t2 and t3 for immediate-training participants. Training data was collected from both training periods (t1–t2, t2–t3); a total of 46 participants underwent the training phase (introduced in the Sections Materials and Methods and Results as trained NH, n = 21 and trained ARHL, n = 25). Data from the retention period will not be discussed in the current paper.

Details of test sessions for each group: The three testing sessions were conducted at the University of Haifa and included tests on the trained tasks to assess the training effect (ontask learning), and on a series of untrained tasks to assess generalization. As shown in **Figure 2**, the Immediate-training and No-training groups were tested on the trained and untrained tasks in t1 (pre-test) and in t2 (post-test). For the Immediatetraining group, t3 also included tests on the untrained tasks to assess retention (which will not be discussed in the current paper). The delayed-training group was tested only on the untrained tasks in t1, and was then tested on both trained and untrained tasks in t2 and t3.

As shown in **Table 1**, demographic characteristics and indices of cognitive function (assessed at t1) were similar across all five NH and ARHL groups [F(4, 51) ≤ 1.4, p ≥ 0.25]. Likewise, demographic and cognitive characteristics were similar across the immediate-training, delayed-training and no-training groups [F(2, 53) ≤ 0.92, p ≥ 0.86].

### Training Protocol and Tasks

The trained groups completed 13 sessions of home-based auditory training, each lasting 20–30 min spread over 4 weeks. The training program was designed to improve speech perception in three listening conditions (A) Speech-in-noise (B) Timecompressed speech and (C) Competing speaker. The training tasks were similar in principle to the training procedure introduced in Sweetow and Sabes (2006) and Song et al. (2012). Each session was devoted to one condition, which was practiced for three blocks, except for the last session which included training on all three conditions (one block of each condition). To keep listeners engaged, recordings on a wide variety of topics were used, and in each block a different topic was presented. The auditory training materials were thematic passages of 3–6 min in Hebrew, read by five readers (four male voices and one female) from popular science articles. The passages were broken into content units of 1–2 sentences of about 10 s each, using Audacity software (Audacity, version 1.2. 6). Each unit was followed by a multiple choice question related to the content of the sentences, which was presented visually. Feedback (correct/incorrect response with the correct answer) was also given visually.

During training an adaptive 2-down/1-up staircase procedure was used to adjust the level of difficulty to the performance of each listener based on their individual performance. Improvements with training is reflected by a reduction in the threshold, suggesting that as training progressed listeners could maintain a good level of accuracy even with a more "difficult" (lower quality) stimulus.

The starting values for each day of training were based on the end values of the previous session for each listener in each condition. The speech-in-noise condition sentences were embedded in four-talker babble noise which consisted of two female and two male talkers reading printed prose. The amplitude of each speech signal was maximized to a point just below peak clipping and the four recordings were mixed into a single channel. Various segments of the noise were used to avoid adaptation. The segments were applied pseudo-randomly (i.e., approximately equivalent total number of uses) across sentences to reduce possible effects of amplitude fluctuations that would be present in one noise segment. All noise segments were normalized to an overall root mean square (RMS) level of 70 dB via level 16 (Tice and Carrell, 1997). The adaptive parameter was the signal to noise ratio, where the noise level changed by 1.5 dB. Time-compressed speech adaptive parameter was the compression rate and in the competing speaker's condition, two sentences were presented simultaneously by male and female voices, listeners were instructed to respond to a target speaker and the adaptive parameter was the signal to noise ratio of the two sentences. Mean SNR thresholds of each block was calculated for each participant in speech-in-noise and competing speaker conditions, and mean compression ratio threshold was calculated for each block in the time-compressed speech condition.

The training program was installed by the experimenter (first author HK) on all of the trained participant's personal computers and participants practiced in their homes. Stimuli were presented in sound field via two speakers (Logitech S-0264A, provided by the researchers) placed on either side of the computer and

TABLE 1 | Means and (SDs) of demographic and cognitive measures across all groups (immediate-training, delayed-training, and no-training) divided into normal-hearing (NH) and Age-related hearing loss (ARHL) groups.


facing the participant (around 45◦ ). The sound level was set at a comfortable listening level, as determined by the trainee, prior to the start of each training session. After the installation of the training program participants completed one practice block for each condition intended to familiarize them with the training program prior to the onset of independent training. At this time participants were also instructed to call the experimenter if they had any questions or if they encountered problems with the program. Subsequently participants were called on a weekly basis to encourage their continued compliance with the training regimen. At the end of training period, the results were uploaded by the experimenter from the personal computers.

represent testing on trained tasks, yellow (bottom) circles represent testing on untrained tasks.

Analysis of the training-phase data was conducted on the data of all trained participants (collected between t1 and t2 for the immediate-training group and between t2 and t3 for the delayed-training group). A series of univariate ANOVAs showed no significant differences in the pre-training and post-training results between the normal immediate- and delayed-training groups [F(1, 19) ≤ 1.61, p ≥ 0.22], and the ARHL immediateand delayed-training groups [F(1, 23) ≤ 1.86, p ≥ 0.19], therefore the two groups were combined. A total of 21 NH and 25 ARHL listeners completed training and are referred to as trained listeners or trained groups throughout the Results Sections (Training-Phase Learning, Pre- to Post-Test Learning on the Trained Tasks).

### Pre- and Post-Training Assessments

Pre- and post-training assessments were conducted 4 weeks apart. Data from these sessions was used to assess learning (performance on the trained tasks but with different content) as well as generalization of learning to untrained tasks by comparing changes in performance over time between trained and untrained participants.

### Learning

Performance on the trained tasks (but with different passages) was used to document learning and determine whether trainingrelated changes were significant in trained in comparison to untrained listeners. For this analysis, data was collected immediately before the first training (pre-test) session and immediately after the final training session (post-test) corresponding to times t1 and t2 for the immediate and notraining groups, and times t2 and t3 for the delayed-training group (see **Figure 2**, and Section Study Design for more details). Therefore, these analyses, reported in Results—Section Pre- to Post-Test Learning on the Trained Tasks include data from 21 trained NH participants, 25 trained ARHL participants, and 10 untrained ARHL participants. Data on these tasks was not collected for untrained NH listeners, because our main goal was to test the existence of learning changes in the ARHL group. Participants were tested on two blocks of each trained condition in each time point (2 × speech-in-noise, 2 × time-compressed speech and 2 × competing speaker). Differences between preand post-tests on the trained tasks were compared between groups.

### Generalization

Performance on untrained tasks was used to study the transfer of the potential training-induced gains to other speech and nonspeech conditions (generalization). These tasks were completed by all subgroups and included (A) a speech-in-noise pseudoword discrimination task, (B) a speech-in-noise sentences task, (C) a duration discrimination task, and (D) a frequency discrimination task.

(A) In the speech-in-noise pseudoword discrimination task participants performed a same/different discrimination task in which 60 pairs of two-syllable pseudowords were presented aurally by a native female speaker with equal numbers of "same" and "different" trials. "Different" trials were minimal pairs (e.g., "same": /damul/-/damul/, "different": /malud/-/maluk/), with equal number of pairs from each phonetic contrast and vowel template. The pseudowords were embedded in background four-talker babble noise (same as used in the training paradigm). Pseudowords were used in this test to eliminate the effect of context provided by familiar words, shown to be stronger in individuals with presbycusis compared to nearnormal hearing listeners (Pichora-Fuller et al., 1995). (B) Speechin-noise sentences task, in which listeners were required to make plausibility judgments on 45 Hebrew sentences embedded in the same four-talker babble noise used in the training paradigm. After hearing a sentence listeners had to determine whether the sentence was semantically plausible ("true") or not ("false"). Both Speech-in-noise tests (pseudowords and sentences) were administered at the most comfortable level for each participant, with a starting SNR value of +5 which was adapted based on their responses with a 2-down/1-up adaptive staircase procedure. The adaptive parameter was the SNR, where the noise level changed by steps of 1.5 dB. All sets of stimuli were RMS-amplitude normalized to 70 dB SPL using Level 16. Just noticeable differences (JNDs) served as the outcome measure for discrimination thresholds in the speechin-noise pseudowords test, while mean SNR thresholds were used for the speech-in-noise sentences tests. The two speechin-noise tests (pseudowords and sentences) were used to study generalization to untrained speech-in-noise tasks. (C) Duration discrimination was tested with 1000 Hz reference tones with a standard duration of 200 ms in an oddball procedure. On each trial two identical standard tones and one target tone were presented with an 800-ms inter-stimulus-interval. The duration of the odd tones were adapted based on performance with a 3-down/1-up multiplicative staircase procedure. (D) Frequency discrimination was tested in an oddball procedure with 500 Hz as a reference tone in one task and 2000 Hz reference tone in another task with duration of 500 ms. The frequency difference between the odd and frequent tones was adapted based on performance. The non-speech tasks were administered using a listener friendly interface of 60 trials. These tests were used to determine whether generalization can be observed to untrained basic psychoacoustic non-speech tasks. Each psychoacoustic test lasted ∼7–10 min. Visual feedback was provided for both correct and incorrect responses. Stimuli were presented with an initial level of 70 dB SPL, but the tester adjusted the intensity of all speech and non-speech stimuli to a comfortable listening level using the computer's volume setting. Most stimuli were thus presented at the range of 80–83 dB SPL. The level of presentation did not exceed 90 dB SPL.

The generalization tasks were administered to all NH and ARHL participants on times t1 and t2. Thus, these tasks were administered before and after the training period for the immediate-training participants, but before and after the control period for the delayed-training and no-training participants (see **Figure 2**). Therefore, data from ARHL delayed-training group and the ARHL no-training groups was combined since ANOVA showed no significant differences in the pre-test (t1) and posttest (t2) results [F(1, 19) ≤ 4.33, p ≥ 0.06]. This resulted in four groups which were compared in subsequent analyses: two groups were tested before and after their training period—the immediate-training NH (n = 10) and immediate-training ARHL (n = 14) groups and two groups were tested before and after their control period NH (delayed-training, n = 11) and ARHL (delayed training + no-training groups, n = 21). Data was analyzed using repeated measures ANOVA with two between subject factors (during-training vs. during-control period and NH vs. ARHL) and one within subject factor: time (t1 vs. t2). Shapiro-Wilk tests were used to confirm that the data was normally distributed within each group (p > 0.1). In addition, Levene tests confirmed that variances were homogeneous across groups within each analysis (p > 0.16).

# RESULTS

# Training-Phase Learning

Forty-one out of 46 trained participants, from both the NH or ARHL groups, completed all 13 sessions of the auditory training program, showing a high level of compliance with no dropouts; five additional participants completed 10–11 sessions. Data from all 46 trained participants was therefore included in the statistical analysis.

In order to determine whether participants improved during training, and whether this depended on their hearing status, linear curve estimation was performed on the performance of the group in each training condition across sessions (**Figure 3**). These analyses (see **Table 2** for details) revealed a good fit of the linear curves to the data with significant R-squared values (R-squared > 0.43, p < 0.01) that, suggests that a linear improvement across sessions accounts for a significant amount of the variance in performance.

To compare the amount of training-induced changes between groups (NH and ARHL) the linear slopes of the individual learning curves were calculated for each participant in each training condition. As shown in **Table 3**, mean slopes were


R-squared, F-values with degrees of freedom and p-values are presented across conditions for trained normal-hearing (NH) and trained Age-related hearing loss (ARHL) groups.

condition. Regression lines and slopes of the learning curves (A) for trained NH are shown in red and for trained ARHL in green. \*\*p < 0.01.

TABLE 3 | Means and (SDs) of the individual linear learning slopes for trained normal-hearing (NH) and trained Age-related hearing loss (ARHL) groups.


t-values, p-values of the group comparison and 95% confidence interval of the difference between groups are also shown.

significantly negative (p < 0.01) in both trained groups and across all three training conditions. In the speech-in-noise condition, learning curves were significantly steeper in the NH than in the ARHL group [t(44) = −2.05, p = 0.046]. No significant differences were found between the learning-curve slopes of NH and ARHL participants in the time-compressed speech condition [t(44) = 0.65, p = 0.52] and in the competing speaker condition [t(44) = −0.76, p = 0.45].

Visual inspection of the data (see **Figure 3**) suggests that the rate of learning may have changed over the course of training with an initially rapid learning phase followed by a slower learning phase. Therefore, two-line linear curves were also fitted to the group data, separately for sessions 1–6 and 7–13 in each condition (see Supplementary Material). These models showed a good fit in some conditions and groups. Therefore, only for conditions in which both groups showed a significant fit to the model, individual slopes were calculated and the slopes were compared between groups. The results were similar to those obtained with the one-line model (see Supplementary Material for details).

Taken together, these data suggest that training-phase learning was observed in both the normal-hearing and the ARHL trained groups. Both trained groups showed a similar amount of learning over the course of training in the time-compressed speech and competing speaker conditions. However, in the speech-innoise training condition normal-hearing group showed more improvements than ARHL group.

### Pre- to Post-Test Learning on the Trained Tasks

To determine whether training resulted in greater pre- to posttest changes in trained than in untrained participants and as a function of hearing status, pre- and post-test performance on each of the trained conditions was compared across the three groups (see **Figure 4**) using a repeated measures ANOVA with group (NH, ARHL, no-training ARHL) as a between-subject factor and time (pre-test, post-test) as a within-subject factor followed by post-hoc tests. As explained in Section Learning, the trained tasks was administered immediately before the first training session(pre-test) and immediately after the final training session (post-test) corresponding to times t1 and t2 for the immediate and no-training groups, and times t2 and t3 for the delayed-training group (see **Figure 2**). Therefore, these analyses include data from 21 trained NH participants, 25 trained ARHL participants, and 10 untrained ARHL participants.

The results showed a statistically significant effect of time and group. Performance on all three trained conditions was significantly influenced by both time [pre vs. post—speechin-noise: F(1, 53) = 32.50, p < 0.0001, η 2 <sup>p</sup> = 0.38; timecompressed speech: F(1, 53) = 47.21, p < 0.0001, η 2 <sup>p</sup> = 0.47; competing speakers: F(1, 53) = 109.98, p < 0.0001, η 2 <sup>p</sup> = 0.68] and group [speech-in-noise: F(2, 53) = 7.8, p < 0.001, η 2 <sup>p</sup> = 0.23; time-compressed speech: F(2, 53) = 6.01, p < 0.001, η 2 <sup>p</sup> = 0.27; competing speakers: F(2, 53) = 9.68, p < 0.0001, η 2 <sup>p</sup> = 0.27]. The time × group interactions were also significant: speech-in-noise: F(2, 53) = 9.01, p < 0.001, η 2 <sup>p</sup> = 0.26; time-compressed speech F(2, 53) = 28.77, p < 0.001, η 2 <sup>p</sup> = 0.52; competing speaker F(2, 53) = 14.41, p < 0.001, η 2 <sup>p</sup> = 0.35 (see **Figure 4**). The significant differences between pre- and post-tests stem from greater changes in both trained groups than in the no-training group. Post-hoc Tukey HSD analysis showed significant (p < 0.001) pairwise comparisons between the no-training group with each trained group (NH and ARHL) for all three conditions [speech-in-noise: F(2, 53) = 10.89, time-compressed speech: F(2, 53) = 12.32 competing speaker: F(2, 53) = 12.43]. Moreover, t-test analyses showed a significant effect of time for both NH and ARHL trained groups on the three conditions [ARHL: speech-in-noise: t(33) = −2.96, p < 0.001; time-compressed speech t(33) = −3.87, p < 0.001; competing speaker t(33) = −3.57, p < 0.001. NH: speechin-noise: t(29) = −4.38, p < 0.001; time-compressed speech t(29) = −4.22, p < 0.001; competing speaker t(29) = −4.97, p < 0.001]. On the other hand, as can be seen in **Figure 4**, untrained listeners hardly changed between the two points and no significant differences between pre- and post-tests for the untrained group were found in any condition [speech-in-noise: t(9) = 1.03, p = 0.57; time-compressed speech: t(9) = −2, p = 0.95; competing speaker: t(9) = 1.8, p = 0.1]. Taken together, training induced learning was observed for trained tasks in both normal-hearing and ARHL trained groups in all conditions, untrained listeners did not show any changes between pre- and post-tests and significant differences were observed between trained and untrained listeners; all these confirm that trained listeners improved more than untrained listeners between the pre- and the post-tests. Moreover, normal-hearing trained group significantly outperformed ARHL trained group in the speech-in-noise condition in the post-test session, [Hearing group effect: F(1,44)= 7.97, p < 0.01, **Figure 4A**], consistent with the steeper learning curves observed in this group during training.

### Generalization

To study the transfer of learning and to determine whether training resulted in greater pre- to post-test changes duringtraining than during-control period and as a function of hearing level, pre- (t1) and post-test performance (t2) on the untrained tasks was compared across the immediate-, delayed-, and the no-training groups between the times t1 and t2 (see **Figure 2**). As shown in the Materials and Methods—Section Generalization. The participants were divided into four groups (1. NH immediate-training, 2. ARHL immediate-training, 3. NH delayed-training, 4. ARHL delayed-training + no-training, see **Figure 5**) using a repeated measures ANOVA with two between subject factors (training and hearing groups) and one within subject factor, time (pre vs. post). Mean group thresholds are shown in **Figure 5**, across all untrained tasks, as a function of hearing and training factors, for speech (speechin-noise pseudowords and sentences, **Figures 5A,B**) and nonspeech tasks (duration discrimination **Figure 5C** and frequency discrimination, **Figures 5D,E**).

### **Speech in noise tests**

Significant effects of hearing group were found in both speechin-noise tests (**Figures 5A,B**), where normal-hearing participants significantly outperformed participants with ARHL [speech-innoise pseudowords: F(1, 52) = 8.14, p = 0.006, η 2 <sup>p</sup> = 0.14; speech in-noise sentences: F(1, 52) =11.13, p = 0.002, η 2 <sup>p</sup> = 0.18]. A significant main effect of time was observed only in the speech-in-noise pseudowords task [time: F(1, 52) = 23.42, p < 0.001, η 2 <sup>p</sup> = 0.32]. No significant effect of time was shown in the speech-in-noise sentences task. A significant interaction of time × training group was observed only in the speech-innoise pseudowords task [F(1, 52) = 4.47, p = 0.036, η 2 <sup>p</sup> = 0.08]. This interaction stems from a significant effect of time [F(1, 33) = 21.01, p < 0.001], and a significant interaction of time × training group [time: η 2 <sup>p</sup> = 0.40; time × train: F(1, 33) = 6.24, p = 0.018, η 2 <sup>p</sup> = 0.16] only for the ARHL groups. The interaction time × training was not significant among normalhearing participants. Therefore, transfer of learning was observed only for the speech-in-noise pseudowords task and only in ARHL group.

### **Duration and frequency discrimination tasks**

No significant differences were observed between any of the groups on these tasks (neither hearing differences nor training vs. control period differences, p > 0.4). There was no main effect of hearing group (p > 0.10) or training group (p > 0.12) in any of the non-speech tasks. In frequency discrimination 500 Hz task, there was a main effect of time [F(1,52) = 5.42, p = 0.026], but without any interaction with either hearing group (p = 0.13) or training group (p = 0.89). These results indicate that there was no transfer of learning to the duration discrimination or frequency discrimination tasks in any of the groups (**Figures 5C–E**).

### DISCUSSION

The present study tested the effect of a home-based training program in everyday listening situations, specifically focused on older adults with mild-to-moderate sensorineural hearing loss who experienced hearing difficulties but did not have hearing aids as well as normal-hearing listeners in the same age range. The outcomes of training on speech perception were compared between normal-hearing adults and those with ARHL. The outcomes of training on generalization to other speech and non-speech tasks were assessed.

The major outcomes of the current study were: (i) Robust training-induced learning effects were found in both normalhearing and individuals with ARHL, and for the trained tasks these were not limited to the trained materials. (ii) The normalhearing group showed more learning than the ARHL in the speech-in-noise trained condition. (iii) Generalization to the perception of pseudowords in-noise was observed in the ARHL group only. (iv) The perception of sentences in-noise, duration discrimination and frequency discrimination did not improve in either of the trained groups. Together these findings suggest that although learning remains robust in older adults with normal hearing and in older adults with ARHL, generalization is limited.

# Learning and Generalization

### Learning on the Trained Tasks

Consistent with previous studies (Sweetow and Sabes, 2006; Humes et al., 2009), and as expected, learning was observed in the trained groups. In the current study, significant trainingphase learning was observed in both normal-hearing and ARHL. Participants performed significantly better at the end of the training period than on the initial blocks (**Figure 3**), indicating that participants' understanding of speech improved over the course of training, in all three conditions: speechin-noise, time-compressed speech, and competing speaker. Furthermore, between the pre- and post-tests both ARHL and normal-hearing participants improved on the trained conditions more than untrained participants (**Figure 4**). The normalhearing and the ARHL groups showed similar patterns of learning over the course of training in the time-compressed speech and competing speaker's conditions as evident by their overlapping learning curves (**Figure 3**). On the other hand, in the speech-in-noise condition, the learning curves were significantly steeper in the normal-hearing than in the ARHL group (**Figure 3**), suggesting that on this condition, training had a greater influence on normal-hearing listeners than on listeners with ARHL. The current study is the first to compare between the training outcomes of normal hearing and ARHL groups. These groups are defined by a difference in lower level sensory processes. Even though the training program was designed to emphasize higher level top-down cognitive processes the difference in learning between groups suggests that the poor quality of perceptual representations in ARHL reduced the benefit of this type of training. It is possible that the use of hearing aids might improve the quality of representations and therefore could enhance the benefits of training.

### Generalization

Although learning on the trained conditions was not stimulus specific (**Figure 4**), the magnitude of training-induced transfer to other speech-in-noise tasks was small (**Figure 5A**). Transfer was limited to the pseudowords task and to the ARHL group. This finding is consistent with the findings of Anderson et al. (2013b) where significant improvements in the speech-in-noise outcome measure were specific to the hearing impaired group and generalization was not shown in the normal-hearing group. On the other hand, in the current study, training (in both ARHL and normal-hearing) did not generalize to an untrained speechin-noise sentence task. In this task, listeners had to judge the semantic plausibility of sentences embedded in noise. This task was different from the trained task, in which listeners were asked multiple-choice questions about the content of the sentence they had heard. So despite using the same babble noise, the change in task requirements was sufficient to preclude generalization. Moreover, no transfer was found in either group to more basic psychophysical abilities such as duration or frequency discrimination (**Figures 5C–E**). These findings suggest that the type of training used in the current study affected higher level task-specific cognitive processes and did not enhance low-level auditory processing of duration or frequency.

The small effect of generalization observed in the ARHL group was also reported in previous training studies using a similar training program (e.g., Sweetow and Sabes, 2006). Sweetow and Sabes reported only small effects of generalization to speech outcomes in adults with ARHL, and only in one of the two untrained tasks with sentences stimuli. In contrast, normal-hearing young adults showed generalization to untrained speech tasks when trained with the same program, suggesting that training improved the neural representation of cues important for speech perception (Song et al., 2012). Altogether these results suggest that the restricted generalization in the current study, in which both groups were of older age, is associated with the degenerative changes that occur due to aging or hearing loss or both.

One potential interpretation for the discrepancy between learning and generalization in the ARHL group is that during training, although listeners focused on the content of the sentences and not on the acoustic/phonetic characteristics of the stimuli, the low quality of the signal (due to both their auditory loss and noise) had driven listeners to rely on lower-level sensory representations that were not sentence specific. Although the ability to use lower-level sensory representations may have been helpful when making decisions about pseudowords, it would not have been enough when new semantic demands were imposed by semantic judgment task [see Ahissar et al., 2009 for the detailed theoretical framework and (Banai and Lavner, 2014) for a previous discussion in the context of the perceptual learning of speech]. Consistent with this interpretation, it is plausible that mid-level sensory representations were used during training and that was shown in the pseudowords task. Learning did not reach as high as the levels of sentences representations and it did not go as low as the acoustic parameters of frequency or duration. This may be due to the type of task and feedback used during training, or may be a more general feature of auditory training as suggested by small generalization effects observed in previous studies see Henshaw and Ferguson (2013).

An alternative hypothesis is that generalization at the perceptual level of speech in noise could be identified with other outcome measures not used in the current study (Amitay et al., 2014), such as identification of real words or identification of

FIGURE 4 | Pre-to-post learning effects. Pre- and post-test performance in trained normal-hearing [NH, trained ARHL (ARHL)] and no-training ARHL group for the three conditions: (A) Speech-in-noise (B) Time-compressed speech and (C) competing speaker. Mean signal-to-noise ratio (SNR) thresholds and SDs are shown for the speech-in-noise and competing speaker conditions and mean compression ratio thresholds and SDs are shown for the time-compressed speech condition. \*\*\*p < 0.001; \*\*p < 0.01.

FIGURE 5 | Generalization. Means and SDs of (A) speech-in-noise pseudowords and (B) speech-in-noise sentences thresholds in dBs (C) duration discrimination in milliseconds (ms) (D) 500 Hz frequency discrimination and (E) 2000 Hz frequency discrimination thresholds in Hz, obtained from pre- and post-tests for Normal-Hearing (NH) and Age-Related Hearing Loss (ARHL) groups. For the subgroups: NH immediate-training, ARHL immediate-training, NH delayed-training, and ARHL delayed-training + no-training. See Materials and Methods—Section Generalization for subgroups division.

key words in a sentence. Moreover, changes in higher level processes could perhaps be identified with tests of working memory and attention (Ferguson and Henshaw, 2015b). On the other hand, a variety of outcome measures have been used across previous studies, but only small effects of generalization have been reported. Therefore, auditory training may prove useful in hearing rehabilitation, but only if future studies converge on training regimens that yield greater generalization than observed with the regimens studied so far. A potential way forward is to combine the different types of training approaches in order to offer generalization benefits to real world listening abilities as suggested by Ferguson and Henshaw (2015b).

### Comparisons Between Normal-Hearing and ARHL Groups in the Generalization of Learning

Differences between normal-hearing and ARHL were shown when looking at the transfer tests (**Figure 5A**); where a significant transfer effect, albeit small, was observed in the ARHL in the pseudowords task, but not in the normal-hearing group. The differences between the normal-hearing and the ARHL groups concerning transfer to the speech-in-noise pseudowords test may be consistent with the processing model introduced in the introduction (Section Speech Processing in Younger and Older Adults). It is plausible that for normal-hearing participants the bottom-up acoustic information was still reliable and sufficient, therefore it matched the lexical representations and there was no need to divert attentional resources to low-level representations during training. In the ARHL group, lower-level and lexical representations did not automatically match, making it necessary to devote attentional resources to the matching process. This additional burden in the ARHL group may have increased the reliance on bottom-up perceptual processes which were generalized to pseudowords.

### Compliance and Subjective Outcomes

Our training paradigm tried to mimic the challenges of realworld listening and consisted of blocks of sentences in a wide variety of topics. In addition to enhancing reliance on top-down processes the aim of this approach was to enhance motivation and compliance with the training program. It was previously shown that increased time on task is positively associated with gain in understanding speech in noise (Levitt et al., 2011). Thus, the training paradigm in the current study engaged participants resulting in a high rate of compliance (90%) similar to previous reports by Stecker et al. (2006). The improvement on the trained conditions suggests that participants with ARHL can benefit from an improved SNR adjustment to compensate for the inaudibility of high frequencies; such improvements though, are hard to accomplish in many everyday settings. However, despite the lack of evidence concerning transfer of learning in objective measures, more than 50% of the normal-hearing and 75% of the ARHL trained listeners reported that training was helpful in their communication with their grandchildren "especially those who speak really fast," and "understanding what is being said in noisy environments" suggesting that training may result in subjective benefits.

# CONCLUSIONS

We suggest that most training-related changes in the current study occurred at a higher level of task-specific cognitive processes in both groups, as evident by the lack of generalization to the sentence task, and to the frequency and duration discrimination tasks. Given that the difference between the normal-hearing and ARHL groups is defined based on lower level acoustic and perceptual processing, the larger learning gains in the normal-hearing group suggests an interaction between bottom-up and top-down processes. Namely, learning related changes in high level task-related cognitive processes is enhanced by the high quality of perceptual representations in the normalhearing group.

Furthermore, the finding of generalization to pseudowords, only in the ARHL group, suggests that some learning related changes have also occurred at the level of identifying phonemic representations in this group. Presumably, because perceptual and phonemic representations were of low quality in the ARHL group, the training program has affected this level of representations in ARHL more than in the normal-hearing group.

Taken together, it was observed in the current study that the auditory training that was used, benefits people with mildto-moderate hearing loss. It is left for future research to measure top-down processing strategies in order to enhance our understanding of the effects of training. There may be more effective training methods to add to the current training program; perhaps this requires more diverse training—in many more tasks, or more intensive training over a very long period of time or change in the type of feedback used. Finally, studies into the training regimen that yields more generalization are needed.

# AUTHOR CONTRIBUTIONS

HK, TB, JA, and KB designed the study; HK collected and analyzed the data; HK, TB, and KB wrote the manuscript. All authors approved the final version of the manuscript.

# ACKNOWLEDGMENTS

This study was supported by Marie Curie International Reintegration (IRG 224763) and National Institute of Psychobiology in Israel grants to KB, by a feasibility grant from the Israel Ministry of Health to TB, JA, and KB, by Steiner's Fund for Hearing Research to JA, and by Excellence scholarships from the Council of Higher Education, Planning and Budgeting committee and the Graduate Studies Authority of the University of Haifa to HK.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2015.02066

# REFERENCES


Cognitive Training (IMPACT) study. J. Am. Geriatr. Soc. 57, 594–603. doi: 10.1111/j.1532-5415.2008.02167.x


Tice, B., and Carrell, T. (1997). Tone. Lincoln, NE: University of Nebraska.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer, Carine Signoret, and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Karawani, Bitan, Attias and Banai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Only Behavioral But Not Self-Report Measures of Speech Perception Correlate with Cognitive Abilities

### Antje Heinrich<sup>1</sup> \*, Helen Henshaw<sup>2</sup> and Melanie A. Ferguson2,3

<sup>1</sup> Medical Research Council Institute of Hearing Research, Nottingham, UK, <sup>2</sup> Otology and Hearing Group, National Institute for Health Research Nottingham Hearing Biomedical Research Unit, Division of Clinical Neuroscience, School of Medicine, University of Nottingham, Nottingham, UK, <sup>3</sup> Nottingham University Hospitals NHS Trust, Nottingham, UK

Good speech perception and communication skills in everyday life are crucial for participation and well-being, and are therefore an overarching aim of auditory rehabilitation. Both behavioral and self-report measures can be used to assess these skills. However, correlations between behavioral and self-report speech perception measures are often low. One possible explanation is that there is a mismatch between the specific situations used in the assessment of these skills in each method, and a more careful matching across situations might improve consistency of results. The role that cognition plays in specific speech situations may also be important for understanding communication, as speech perception tests vary in their cognitive demands. In this study, the role of executive function, working memory (WM) and attention in behavioral and self-report measures of speech perception was investigated. Thirty existing hearing aid users with mild-to-moderate hearing loss aged between 50 and 74 years completed a behavioral test battery with speech perception tests ranging from phoneme discrimination in modulated noise (easy) to words in multi-talker babble (medium) and keyword perception in a carrier sentence against a distractor voice (difficult). In addition, a self-report measure of aided communication, residual disability from the Glasgow Hearing Aid Benefit Profile, was obtained. Correlations between speech perception tests and self-report measures were higher when specific speech situations across both were matched. Cognition correlated with behavioral speech perception test results but not with self-report. Only the most difficult speech perception test, keyword perception in a carrier sentence with a competing distractor voice, engaged executive functions in addition to WM. In conclusion, any relationship between behavioral and self-report speech perception is not mediated by a shared correlation with cognition.

Keywords: speech perception, cognition, self-report, communication, hearing aid users, mild-to-moderate hearing loss

# INTRODUCTION

Good communication skills in everyday life are crucial for wellbeing and are therefore overarching aims of audiological rehabilitation. Communication abilities can be measured in a variety of ways, and the measures do not necessarily assess identical or even overlapping aspects of communication. One way of measuring communication abilities is by using speech perception tests. They use

### Edited by:

Adriana A. Zekveld, VU University Medical Center, Netherlands

### Reviewed by:

Rebecca Carroll, University of Oldenburg, Germany Jana Besser, Sonova AG, Switzerland

> \*Correspondence: Antje Heinrich antje.heinrich@ihr.mrc.ac.uk

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 10 December 2015 Accepted: 07 April 2016 Published: 23 May 2016

### Citation:

Heinrich A, Henshaw H and Ferguson MA (2016) Only Behavioral But Not Self-Report Measures of Speech Perception Correlate with Cognitive Abilities. Front. Psychol. 7:576. doi: 10.3389/fpsyg.2016.00576

behavioral indices to assess the passive perception of speech without an opportunity for interaction with other people. We use the term speech perception tests in accordance with Erber's (1982) framework that defines perception as the identification and repeatability of phrases without deeper comprehension. Note that speech perception represents but one aspect of communication. In real-life situations communication includes additional aspects such as the bi-directional transfer of information (Kiessling et al., 2003). Better suited to assess this second aspect of communication are self-report questionnaires. In contrast to behavioral speech perception tests, they often explicitly ask about how acoustic and linguistic information is used and transmitted effectively in a bi-directional process. Given this difference, it is likely that these two measures assess only partially complementary aspects of a listener's experience (see Pronk et al., 2011 for a similar argument).

# Behavioral versus Self-Report Measures to Assess Speech Perception and Communication

Correlations between behavioral and self-report measures of speech perception and communication vary substantially across studies from hardly any correlations in some studies to consistent correlations in other studies. Hardly any correlations were found by Newman et al. (1990) when testing middle-aged nonhearing aid users<sup>1</sup> on the Hearing Handicap Inventory for Adults and an unspecified word recognition test. In contrast, consistently high correlations between a speech-perception-innoise measure (SRTN) and all subscales of the Amsterdam Inventory for Auditory Disability and Handicap were found in a group of older hearing aid users and non-hearing aid users by Zekveld et al. (2013). In other studies, the correlation strength depended on the particular combination of self-report subscales and speech perception tests (Cox and Alexander, 1992; Ng et al., 2013; Heinrich et al., 2015). For instance, Cox and Alexander (1992) compared intelligibility in the Connected Speech Test for a number of simulated listening environments with the subscales of the Profile of Hearing Aid Benefit questionnaire, and found correlations only between two speech perception tasks and two questionnaire subscales. All other combinations of behavioral and self-report measures did not yield significant correlations. Similarly, inconsistent results were found by Ng et al. (2013), who tested a sample of hearing aid users with an SRTN test and the International Outcome Inventory for Hearing Aids (IOI-HA) and the Speech, Spatial and Qualities of Hearing (SSQ) Scale. While they found no significant correlations between speech perception and selfreport measures when either the IOI-HA or SSQ subscales of complex speech perception (Speech in noise, Speech in speech contexts, Multiple speech-streams processing and switching) were used, they did find significant correlations with other aspects of self-reported listening (i.e., quality and spatial listening). Finally, inconsistent correlations were also found by Heinrich et al. (2015) when testing older non-hearing aid users. They tested intelligibility in a range of speech perception situations and with different methods of setting signal-to-noise ratios (SNRs) and compared these behavioral results to a variety of self-report measures. The only instance in which they found consistent correlations between almost all self-report questionnaires and a word perception test was when the SNR was changed by adjusting the background noise level, but not when the SNR was changed by adjusting the target speech. A consequence of the former adjustment method was an overall increase of the overall presentation level of the speech test, whereas the latter method led to a decrease in overall presentation level. This finding suggests that only speech perception test methods that altered the background noise, as opposed to altering the speech levels, capture aspects of communication, participation restriction and tolerance to noise that are also captured by the questionnaires.

What drives the variability in correlations remains unclear. Studies vary in a number of experimental factors, including hearing aid use, methods of identifying SNRs and details of the administration of the self-report measures. For hearing aid use, inconsistent correlations come from studies where listeners either do not (Heinrich et al., 2015) or do (Ng et al., 2013) wear hearing aids, whereas consistently high correlations were found in a study with a group of listeners with hearing loss where only half of participants used hearing aids (Zekveld et al., 2013). Hence, taking hearing loss into account does not improve the consistency of results. Secondly, the way in which the SNR of behavioral speech perception tests is adjusted (see Heinrich et al., 2015), suggests that procedural details in the measurement of speech perception can affect the correlation with self-report questionnaires. Thirdly, the administration protocol for the self-report questionnaires can affect correlations with speech perception measures, which are higher if the listening situations assessed by self-report and speech perception measures are more closely matched (Ng et al., 2013; Zekveld et al., 2013). Such a practice contrasts with the current practice that typically measures self-report scores as an average across a number of listening situations. This would also contrast with the measurement protocol typically used for behavioral tests, which assess only one situation.

In the present study we also depart from the standard practice of using averaged self-report scores, and instead use each individual listening situation separately for comparison with behavioral intelligibility measures. The main research question is: Does matching specific listening situations between the two different types of measures affect subsequent correlations in hearing aid users with mild-to-moderate hearing loss?

### Speech Perception and Cognition

In addition to the relationship between behavioral and self-report measures of listening, we also sought to better understand the relationship between speech perception and cognition. We have previously tested the predictive power of a number of cognitive abilities for speech perception tests of varying complexity in older listeners with mild hearing loss (Heinrich et al., 2015).

<sup>1</sup>Non-hearing aid users here and elsewhere refers to people who have not been prescribed and issued with hearing aids.

We found that cognition only explained a significant amount of variance in speech perception performance in the most complex listening situation, namely sentences presented in modulated noise. A principal component analysis (PCA) was then conducted to extract latent cognitive factors from the multiple cognitive tests used in the study. The PCA produced a two-factor solution with the factors representing working memory (WM) and attention, with only the latent factor of attention showing a predictive value for speech perception. The interpretation of the two factors in the previous paper as WM and attention was guided by Baddeley and Hitch (1974) and Baddeley (2000) who defined WM as the interplay between visuo-spatial and/or verbal information on the one hand and the central executive on the other. Note that the concept of attention is closely related to the concept of executive function (Hasher et al., 2007), and therefore the latent attentional factor in the previous paper might have been more appropriately labeled executive processes. While executive processes are a multifaceted concept and include aspects of attention and inhibition, these facets could not be further differentiated in the previous study due to the small number of cognitive tests, and therefore Baddeley's model, which united all aspects of attention and inhibition, was the appropriate theoretical framework in that study. However, a model that differentiates executive functions might be more appropriate because executive functions in general, and inhibition in particular (Sommers and Danielson, 1999; Janse, 2012; DiDonato and Surprenant, 2015; Helfer and Jesse, 2015), have been proposed to play a role in complex communication situations (e.g., when communicating in a group where executive function regulates monitoring, attention switching, updating; Ferguson et al., 2014) and in the resultant benefits from auditory training (Ferguson and Henshaw, 2015). Diamond's (2013) model of executive functions is such a model that articulates a more differentiated view. It also explicitly incorporates Baddeley's WM component, which was of interest to the current study as WM has been widely suggested to play a role in speech perception (Wingfield and Tun, 2007; Akeroyd, 2008; Mattys et al., 2012). Hence, a second objective of the present study, in addition to assessing the relationship between self-report and behavioral measures, was to assess the contribution of different executive functions and WM to various speech perception tests.

In our previous study (Heinrich et al., 2015), WM tests included digit span (forward and backward), and a visual letter monitoring task, while attention was assessed with single and divided attention tests [Test of Everyday Attention (TEA6 and 7) and the Matrix Reasoning Test]. In the current study, identical or similar tests were chosen to measure WM (Size Comparison Test, Letter Number Sequencing, Dual Digits in Quiet) and attention (TEA6 and 7).

Diamond (2013) distinguishes two different inhibitory control mechanisms, namely interference control and response control. These two control mechanisms differ in the processing stage at which the inhibitory control takes place. Interference control takes place when an individual manages to direct attention away from or suppress a prepotent mental representation. Response control takes place when behavior is controlled despite the urge to follow a prepotent response. Its third core executive function, cognitive flexibility, will not be further discussed. The selection of the remaining cognitive tests was guided by Diamond's model. How the selected cognitive tests in the current study fit within the model is displayed in **Figure 1**.

Given the large number of cognitive tests, PCA was applied as a data reduction method. PCA, a strictly atheoretical data reduction tool based solely on amount of shared variance between the tests in the analysis, is often the method of choice (e.g., van Rooij and Plomp, 1990; Humes et al., 1994; Schoof and Rosen, 2014; Heinrich et al., 2015). The exploratory PCAs in Heinrich et al. (2015) returned a storage-focussed WM factor and a factor encompassing all other, more attention-focussed processes. When conducting the analysis with comparable tests we predicted that this solution would be replicated. However, on adding the additional tests selected specifically to tests aspects of attention and executive function noted above into the analysis, we predicted that the previous attention factor would be split into two, reflecting the latent variables (i.e., interference and response control) underlying test selection. We also predicted that these three latent cognitive factors would be differentially predictive of speech perception in particular tests.

### Energetic versus Informational Masking

Based on previous research we know that target and background signals differ in the demands they place on cognitive and linguistic processes (Ferguson and Henshaw, 2015; Heinrich et al., 2015), and that it is particularly the complex communication situations that appear to involve executive functions (Ferguson et al., 2014). Therefore, it was important to choose speech stimuli that were sufficiently complex to invoke executive processing. In our previous study, target speech stimuli were phonemes, words and simple sentences, presented either in quiet (the phonemes), or in speech-shaped or white noise (words and sentences). This made perception relatively easy because the maskers were either not present at all or were only energetic in nature (Freyman et al., 1999; Brungart et al., 2001; Arbogast et al., 2002; Kidd et al., 2005).

In the current study, we attempted to increase listening difficulty in two ways: by making the background masker more complex and by presenting one of the speech tasks in a divided attention context. Background masker complexity was increased compared with the previous study by presenting almost all target speech in a speech masker (babble masker or concurrent talker). This extended the previous study by introducing informational masking in addition to energetic masking (Brungart et al., 2001; Arbogast et al., 2002; Kidd et al., 2005; Schneider et al., 2007). For the purpose of the current study, we follow Schneider et al.'s (2007) definition of informational masking as ". . .any aspect of the background sound that interferes with the processing of the speech signal at more central (cognitive) levels of processing." In this sense, informational masking should not be viewed as a single phenomenon but rather as resulting from actions at any of the stages of processing beyond the auditory periphery. As a result, it is intimately connected to perceptual grouping and source segregation, attention, memory, and general cognitive processing abilities. As shown by Simpson and Cooke (2005) even high-numbered talker babble can have significant effects

of masking above and beyond those provided by speech-shaped noise with the same envelope as the babble. This effect according to Rosen et al. (2013) is driven by the higher similarity between the target speech and babble as opposed to noise. According to Simpson and Cooke (2005) this greater signal similarity between target and background sound for babble may lead to greater attentional demands, greater distracting effects of numerous onsets and general non-stationarity, thus leading to greater masker efficiency.

We also increased the complexity of the speech perception task by adding conditions in which the listening task was not presented in isolation but in concurrence with a memory task. This was intended to increase listening effort. The concept of listening effort is based on Kahneman's (1973) model of limited processing resources and assumes that performance on a listening task can be affected by the introduction of a second task (e.g., memory), which diverts some of the attention usually available for perception, to another task such as memory encoding. As a result, performance on one or both tasks may decline compared to when the tasks are performed alone (Mattys et al., 2014). The ultimate goal of all these changes in the speech perception tasks was to sample a wide variety of listening situations, and to generally increase the complexity of listening in order to maximize the possibilities of seeing correlations with a range of cognitive functions. Unavoidably, sampling a range of listening situations and changing the characteristics of the foreground and background signal comes at the cost of not being able to investigate systematically which changes in the listening condition cause a change in correlation with self-report and cognition.

### Hearing Loss

A final aspect of listening that was important for this study was the presence of hearing loss and its clinical management with hearing aids. Hearing aids can have tangible consequences not only for the accuracy of listening but also for the involvement of cognitive processes to achieve this (Lunner et al., 2009). Although hearing aids increase the audibility of the signal and thereby might make it easier for the listener to hear the target speech, they also introduce distortions (Edwards, 2007). It has been suggested that adjusting to the unfamiliar and distorted signal requires cognitive input (Arehart et al., 2013). While we cannot directly compare the effect of hearing status on speech perception and self-report between Heinrich et al. (2015) and the current study, as some of the speech perception tests and self-report questionnaires differed, a qualitative comparison was still possible and formed a third aim.

In summary, the current study sought to investigate the relationship between self-report and behavioral speech perception in a group of existing hearing aid users with mild-tomoderate hearing loss. The primary aim was to extend findings about relationships between listening, cognition and self-report from our earlier study in adults with mild hearing loss who did not wear hearing aids, to hearing aid users with mild-to-moderate hearing loss. Based on the previous study we hypothesized the following.

## Hypotheses

H1: As speech intelligibility was not assessed in a way that involved an increase of either background noise or overall stimulus level, we predict no correlation between the speech perception tests and the averaged scores of self-report questionnaires, thus replicating earlier results in those with mild hearing loss. However, when using self-report scores based on specific individual listening situations, we might expect correlations with speech intelligibility scores to emerge when

the two listening/speech situations from the self-report and perception test mirrored each other.

H2: We predict the relationship between speech perception and cognition to be not uniform across different speech perception tests but rather to be specific to a particular test, and to become more evident as the complexity of the target speech, defined by its linguistic and other cognitive demands, and the complexity of the masker are increased.

H3: We predict to replicate a two-factor PCA solution with WM storage and attention when including only those tests that are comparable to the previous study. However, when including cognitive tests that include components of executive function, we expect to find a third factor, based on Diamond (2013), that splits the previous attention factor into an interference and a response control executive factor.

# MATERIALS AND METHODS

The data on which the analyses in this paper are based are the baseline outcome measures of an auditory training study with 50–74 year old hearing aid users with mild-to-moderate hearing loss (Ferguson and Henshaw, 2015). The training task plays no role in the current data. Instead we analyze the outcome measures of speech perception, cognition, and self-report of hearingrelated activities at the baseline, pre-training session. The study was approved by the Nottingham Research Ethics Committee and Nottingham University Hospitals NHS Trust Research and Development. Informed signed consent was obtained from all participants.

# Participants

Thirty (20 males) existing hearing aid users (minimum use = 3 months, mean = 10.3 years, SD = 10.7 years) aged 50–74 (mean = 67.4 years, SD = 7.1) with mild-tomoderate symmetrical sensorineural hearing loss (mean puretone hearing thresholds of the better ear averaged across 0.5, 1, 2, 4 kHz = 43.6 dB HL, SD = 13.6) were recruited from the NIHR Nottingham Hearing Biomedical Research Unit research volunteer database. Overall, 56% of participants indicated that they used their hearing aids all the time, while 17% used them ≥75% or of the time, and 27% used them 50–75% of the time. All participants spoke English as their first language, and were paid a nominal attendance fee and travel expenses for the visit.

## Procedure

All testing was carried out in a quiet testing room. All auditory stimuli were presented in the free field via a single speaker (Logitech LS 11) situated directly in front of the participant at a distance of 1m, set to individuals' most comfortable loudness (MCL) level (Ventry et al., 1971), unless otherwise specified. The MCL was set for each participant at the first testing session and kept constant throughout. Participants wore their hearing aids during all testing. Visual stimuli were presented on a 21<sup>0</sup> screen (Genelec Inc., Natick, MA, USA) placed 50 cm in front of the participant. Auditory, cognitive and questionnaire responses were obtained in a fixed order, with audiological measures (otoscopy, tympanometry, pure-tone audiometry, MCL) first, followed by speech and cognitive measures in a mixed order that was the same for all participants.

### Outcome Measures Audiological

Outer and middle ear functions were checked by otoscopy and standard clinical tympanometry using a GSI Tympstar (Grason-Stadler, Eden Prairie, MN, USA). Pure-tone air conduction thresholds (0.25, 0.5, 1, 2, 3, 4, and 8 kHz) were obtained for each ear, following the procedure recommended by the British Society of Audiology (British Society of Audiology , 2011), using a Siemens (Crawley, West Sussex, UK) Unity PC audiometer, Sennheiser (Hannover, Germany) HDA-200 headphones, and B71 Radioear (New Eagle, PA, USA) transducer in a soundattenuating booth. The better-ear-average (BEA) across octave frequencies 0.5–4 kHz was derived and is reported here.

### Speech Perception

The Phoneme Discrimination (PD) test (Ferguson et al., 2014) performed in background noise measured the discrimination threshold for one vowel continuum (/e/-/a/) delivered through Sennheiser HD-25 headphones at a fixed level of 75 dBA, presented in 8-Hz modulated speech-shaped noise at 0 dB SNR. The vowel continuum contained 96 steps, which had been synthesized from the real voice recordings at the end points. The continuum was presented in sequential blocks, and all listeners were tested twice. A three-interval, three-alternative forced-choice, oddball paradigm was used. The participant's task was to choose the odd one out from three sequentially presented phonemes. Feedback (correct/incorrect response) was given. Initially, two (identical) vowel were selected randomly from one end of the continuum and the odd (target) vowel from the opposite end (i.e.,·wav files #1 and #96). Correct detection of the target, delivered randomly in any of the three intervals, resulted on the next trial in the identical and target phonemes being chosen from a more difficult comparison (e.g., files #11 and #86; i.e., step size 10). Trials then varied adaptively over two, 1-down 1-up reversals, step size 10 and 5, changing to a 3-down 1-up paradigm using a step size of 2 and determining the 79% correct point on the psychometric function (Levitt, 1971). Performance was measured in terms of the separation between stimulus file numbers at threshold. A smaller number signifies better discrimination ability. As the particular vowel continuum here represents a type of phoneme, the resulting threshold was called phoneme discrimination threshold (%), and calculated as the average of the last two reversals over 35 trials.

The Four Alternative Auditory Feature (FAAF) test (Foster and Haggard, 1987) assessed phoneme discrimination accuracy in the context of a word in background noise. The overall output level of the stimuli was set at the participant's MCL for speech. The SNR was fixed at 0 dB SNR. The noise was 20-talker babble noise. The FAAF is a closed-set test with four alternative CVC words per trial. The words vary only in a single phoneme, either the initial (9 sets) or the final (11 sets) consonant of the word. All target words were presented

in the carrier sentence "Can you hear \_\_\_ clearly" and were followed by the visual presentation of four minimally paired alternatives from which participants chose their response. For instance, the target word mail might be paired with bail, nail, and dale. Following a short practice session, 20 test trials were randomly selected from a larger test base and the percentage of correctly perceived words was measured. Responses were given via touch screen and feedback on the correct response was provided.

The Dual Task of Listening and Memory required participants to listen to and repeat words while retaining digits in memory. In the speech perception part of the task they listened to lists of five AB isophonetic monosyllabic (CVC) words (Boothroyd, 1968) presented at 65 dB SPL in either quiet or a 20-talker babble background at two SNRs, 0, and −4 dB. Listeners were asked to repeat each word immediately after presentation and were instructed to prioritize both tasks equally. A total of 12 lists (four in each background condition) was presented, with presentation order of noise conditions counterbalanced across participants. A maximum score of 20 per background condition was possible. The word score (Single Words) will be reported as part of the speech perception results.

The Modified Coordinate Response Measure (MCRM) (Hazan et al., 2009) measures closed-set keyword perception in a sentence carrier. In contrast to the carrier sentence in the FAAF, of which only one version existed and which was only meant to alert the listener to the presence of the target, the carrier sentence in the MCRM varied in call sign and voice, with only one combination representing the carrier sentence of the target stimulus. The task was based on the Coordinate Response Measure (Bolia et al., 2000). Participants were presented with sentences in the form of 'Show the [animal] where the [color] [number] is'. There were six possible monosyllabic animals (cat, cow, dog, duck, pig, and sheep), six colors (black, blue, green, pink, red, and white) and eight numbers (1–9, excluding multisyllabic 7). Two sentences were presented concurrently, one by a female talker (target) and one by a male talker (distractor). Participants were asked to listen for the color and number spoken by the female talker ('dog' was always the animal target) whilst ignoring the male talker, and to respond by pressing the corresponding target color-number on a computer touchscreen. The test used an adaptive 1-down 1-up staircase method with an initial step size of 10 dB until reversal 1, reducing to 7 dB at reversal 2, and 4 dB at reversal 3 onward and continued until eight reversals were achieved. Speech reception thresholds were calculated as the SNR in dB required to achieve 50% intelligibility in the last two reversals.

### Self-Report of Hearing Difficulties

The Glasgow Hearing Aid Benefit Profile (GHABP) (Gatehouse, 1999) assesses unaided pre-intervention hearing disability (or activity limitations) and handicap (or participation restrictions) in Part 1, and benefit and satisfaction derived from hearing aid (HA), reported HA use, and residual disability (i.e., the disability that remains despite using HA) in Part 2. There are four predefined situations (Q1: Listening to the television with other family or friends when the volume is adjusted to suit other people; Q2: Having a conversation with one other person when there is no background noise, Q3: Carrying on a conversation in a busy street or shop; Q4: Having a conversation with several people in a group), using a five-point scale (residual disability: 1 = no difficulty to 5 = cannot manage at all). The score for each domain was converted to a percentage score. For residual disability, the main communication measure, both the mean overall score averaged across all four situations and the individual scores for each of the four listening situations were considered.

### Cognitive

Two subtests of the Test of Everyday Attention (TEA6 and TEA7) (Robertson et al., 1994) assessed single and divided attention. In the single attention Telephone Search (Subtest 6) participants had to identify 20 pairs of identical symbols, as quickly and accurately as possible, and ignore all other symbols while searching entries in a simulated classified telephone directory. The score was calculated as a quotient between the total time taken to complete the test divided by the number of symbols detected. Lower values represent superior performance. Divided attention was measured with the Telephone Search (Subtest 7, dual task) that was identical to Subtest 6 except participants were additionally required to count and report the number of tones from a string of 1-kHz tones of varying lengths while searching for the symbols. The score was obtained separately for each task, and in combination to give a dual task decrement (DTD). For statistical analyses, the scales for both TEA subtests were reversed to harmonize the direction of scoring with the other cognitive tests where higher scores indicated a better performance.

The IMAP (IHR Multicentre study of Auditory Processing test) measures auditory and visual sustained attention by comparing reaction times (RTs) to target stimuli when cues to their presence are either present or absent (Moore et al., 2010). In the auditory modality, listeners were asked to press a button in response to a 1-kHz 200-ms tone presented at 80 dB SPL as quickly as possible. On 20/36 the target sound was preceded by a 125 ms modulated tone with a carrier frequency of 0.6–4.0 kHz and a modulation frequency of 32 Hz, which was presented at 75 dB SPL. Listeners were instructed to regard this "chirp" as a cue to the upcoming target stimulus. In the visual task, participants responded with a button press when an animated character displayed on a computer screen raised their arm. On 20/36 of the trials the arm movement was primed by a change of the character's t-shirt color. The test comprised a total of 72 trials, 36 auditory and 36 visual, 20 of which in each modality were primed. All targets were spaced 1–4 s apart, and if a cue was present it preceded the target stimulus by 500–1000 ms. In both tests, the mean response times in ms to cued and uncued trials represented the outcome variable.

The Letter Number Sequencing (LNS) task (Wechsler, 1997) is a measure of verbal WM in which participants were asked to repeat a string of pre-recorded numbers and letters (e.g., 4-S-6-A) with numbers in numerical order first, followed by letters

in alphabetical order (e.g., 4-6-A-S). Sequences began with two items and had the potential to increase to a maximum of eight items. For each sequence length, three trials were presented for which a participant needed to correctly recall at least one out of the first two trials in order to advance to the third trial and the next longer sequence. When no trial of a particular sequence size was correctly recalled, the task was terminated. The overall number of correctly recalled sequences was used as outcome measure.

The Size Comparison Span (SICspan) (Sörqvist et al., 2010) measures the ability to exclude irrelevant information from WM while retaining target items for later recall, thus testing verbal WM together with aspects of response inhibition. The task consisted of two parts, a size judgment task and a memory task. The stimuli to the first task were to be ignored after the task was completed. For instance, participants were presented with the following words: "Is CAT larger than COW? CROCODILE," were expected to respond yes or no to the size comparison element (in this example no) and then encode the third word into memory (i.e., crocodile) for recall at the end of the list. The total number (out of 40) of correctly recalled memory words (SICspan Size) was the outcome measure. When a participant recalled a size comparison words instead of a target word, this was classed as a list intrusion (SICspan Intrusions). The total number of intrusions (out of a possible 80) was summed across the whole test.

The Dual Task of Listening and Memory required participants to listen to and repeat words while retaining digits in memory, and was originally designed to assess listening effort (Howard et al., 2010). Participants were asked to encode a string of five digits displayed on a computer screen during a 5 s period for later recall. After encoding, listeners completed the speech perception task as described above. After the completion of the speech perception task participants were asked to recall the encoded digits. A maximum score of 20 was possible per noise condition. In the following the digit score (Digits) will be reported as part of the cognitive results.

### RESULTS

**Table 1** displays the descriptive information for all variables of interest in the study. Note that the perception component of the Dual Task (Words) is classified as a word perception task while the memory component of the same task is classified as cognitive task (Digits), even though both elements of the task were always presented concurrently, in three different background noise conditions.

For the Dual Task of listening and memory there was a steady decline in the intelligibility of the words (Words) from the quiet condition to the 0 dB SNR to the −4 dB SNR condition (**Figure 2**). In contrast, memory for the digits in the same task (Digits) showed a decrease from the quiet to 0 dB SNR condition, but recovered at the most adverse noise level (−4 dB SNR). Two repeated-measures ANOVAs for Words and Digits with background condition (quiet, 0 dB SNR, −4 dB SNR) as the only factor confirmed these patterns (Words: F[2,58] = 238.5, MSE = 6.3, p < 0.001, Quiet > 0 dB SNR > −4 dB SNR; Digits: F[2,58] = 5.77, MSE = 13.6, p = 0.005, Quiet = −4 dB SNR > 0 dB SNR).

The IMAP attention task showed that RTs to cued stimuli were faster than to uncued stimuli, that RTs to visual stimuli were generally faster than to auditory stimuli, and that the difference between cued and uncued stimuli was greater for visual than auditory stimuli (**Table 1**). These patterns were confirmed in a 2 modality (visual, auditory) × 2 cue (no cued, cued) repeated-measures ANOVA, which showed main effects for modality (F[1,29] = 64.2, MSE = 6837.5, p < 0.001), and cue (F[1,29] = 126.9, MSE = 5086.3, p < 0.001) and an interaction between the two (F[1,29] = 28.4, MSE = 2593.2, p < 0.001).

# Correlation between Speech Perception Tests and Self-reported Communication Abilities

H1: As speech intelligibility was not assessed in a way that involved an increase of either background noise or overall stimulus level, we predict no correlation between the speech perception tests and the averaged scores of self-report questionnaires.

Pearson Product-Moment correlations between the overall score of residual disability and speech perception were as follows: Phoneme Discrimination (PD) in noise r = −0.42 (p = 0.03); FAAF r = −0.26 (ns); word perception dual task (Words) in quiet r = −0.26 (ns), 0 dB SNR r = −0.38 (p = 0.04), and −4 dB SNR r = −0.29 (ns), MCRM r = 0.29 (ns). Hence, there were two significant correlations with the overall residual disability score: with PD and with Word perception at 0 dB SNR. For Word perception this means that listeners with better intelligibility scores tended to have lower residual disability scores, as might be expected. When BEA was partialled out, the correlation disappeared (r = −0.12). The correlation between PD scores and residual disability was both unexpected and counterintuitive and was unaffected by hearing loss (r = −0.47 with BEA partialled out), and suggested that listeners with better phoneme discrimination ability (lower scores) tend to have higher disability scores. We speculate about the underlying reasons for this result in the Discussion section.

More pertinent, however, are the correlations between the speech perception tests and the residual disability score for each of four individual GHABP situations, shown in **Table 2**.

Spearman coefficients were used because of the ordinal scale on the GHABP. Except for listening to a conversation in quiet (i.e., the easiest listening situation), all the GHABP pre-defined situations were significantly correlated with performance on at least one speech perception test. Listening to a TV set to someone else's need (Q1) correlated with PD, following a conversation in a busy street or shop (Q3) correlated with performance on word perception tests in noise (i.e., FAAF and single word perception), and following a group conversation with several people (Q4) correlated with performance on both PD and the MCRM keywords in the carrier sentence.


### TABLE 1 | Mean, standard deviation (SD) and range for demographic information and experimental variables.

PD, Phoneme Discrimination task; FAAF, Four Alternative Auditory Feature word perception task; Words, single word perception aspect of dual task; MCRM, Modified Coordinate Response Measure; Digits, five digit encoding and recall aspect of Dual task with word perception task presented in quiet, at 0 dB SNR, −4 dB SNR; IMAP, IHR Multicentre study of Auditory Processing test; SICspan Size, Size Comparison span, span size; GHABP, Glasgow Hearing Aid Benefit Profile.

Similar to the overall score results, correlations between word perception and residual disability were in the expected direction with better speech performance scores associated with lower residual disability scores for Q3 (conversation in a busy street). Unlike the overall scores, it was not only

Word perception at 0 dB SNR that showed a significant correlation to the self-report residual disability score, but also Word perception at −4 dB SNR and the FAAF. All of these tests require listeners to perceive words in a background of noise.

# Correlation between Speech Perception and Cognition

H2: We predict the relationship between speech perception and cognition to be not uniform across different speech perception tests but rather to be specific to a particular test, and to become more evident as the complexity of the target speech, defined by its linguistic and other cognitive demands, and the complexity of the masker are increased.

There were moderate, significant correlations for BEA with all the speech perception tests except PD in noise and Word perception at −4 dB SNR (see Supplemental Information). The correlation was negative for FAAF and Word perception reflecting the fact that increased BEA thresholds were associated with decreased perceptual accuracy. The correlation was positive for MCRM because an increased BEA was associated with an increased SNR. PD did not correlate with performance on any cognitive test, whereas performance on the MCRM correlated with performance on a broad range of cognitive tests. Performance on the FAAF and Word perception tests were most strongly correlated with verbal WM (LNS), alongside a correlation with one other cognitive test each. Exact values for all correlations are reported in Supplementary Table S1. Because most speech perception tests significantly correlated with BEA, Supplementary Table S2 presents the same correlations with BEA partialled out. The correlational patterns did not change substantially when hearing sensitivity (BEA) was partialled out. These patterns are consistent with our previous study (Heinrich et al., 2015) in that there were different cognitive profiles for different speech perception tests. Self-reported residual disability correlated with cognition in only two instances, but notably these occurred for the two situations (conversation in a busy shop, conversation with a group of people) that are most likely to engage cognition (Supplementary Table S3).

## Latent-Factor Analyses (PCA)

H3: We predict to replicate a two-factor PCA solution with WM storage and attention when including only those tests that are comparable to the previous study. However, when including cognitive tests that include components of executive function, we expect to find a third factor, based on Diamond (2013), that splits the previous attention factor into an interference and a response control executive factor.

Cognitive tests that were broadly comparable between the current and the previous study were TEA6/7, LNS, SICspan Size, and the Digits in quiet. The two TEA tests were identical between the two studies. LNS and SICspan Size tests were similar to the Backward Digit Span (BDS) and Visual Letter Monitoring (VLM) of the previous study as all four tests are WM tasks with storage and processing components. The LNS was deemed particularly similar to BDS and VLM tasks because in all of these tasks the processing component was integral to the span task. In contrast, the SICspan Size test contained a processing component that was not integral to completing the span task. The Digit Quiet task was included because it was also similar to the Digit Span Forward task: both were pure serial recall/storage tasks without a processing component. Using these five cognitive tests in a PCA with Varimax Rotation that extracted all factors with eigenvalues > 1 led to a two-factor solution that explained 64.9% of the overall variance (KMO = 0.5, Bartlett's test of sphericity: χ 2 (10) = 28.1, p = 0.002). Factor loadings are displayed in **Table 3**.

The solution replicated the factor structure found in our previous study (Heinrich et al., 2015) in that it showed an

TABLE 2 | Spearman correlation coefficients between residual disability scores for each of the four GHABP listening situations and the speech perception tests.


Acronyms as in Table 1. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.



Acronyms as in Table 1.

attentional factor on which the Tests of Everyday Attention (TEA) loaded and a WM factor on which the tasks with storage and processing components loaded. A second analysis included all the cognitive tests, which had been selected to specifically assess executive function: sustained attention (IMAP visual, IMAP audio), aspects of inhibitory control (SICspan Intrusions, Digits), and dual attention (Digits at 0 and −4 SNR). Note that for tests with measures for individual subcomponents as well as difference scores, such as the TEA and IMAP, only one or the other was included in the PCA. For the TEA, the two component tests TEA6 (single attention) and TEA7 (divided attention) but not their difference score was included. This was done in order to preserve continuity to the previous study which had also included the component scores into the PCA. For the auditory and visual IMAP tests, only the difference scores were included because no precedence for using the component scores existed, and using the difference scores was a more efficient way of combining information. A PCA with Varimax rotation that extracted all factor eigenvalues > 1 resulted in a three-factor solution that explained 63.3% of overall variance (KMO = 0.6, Bartlett's test of sphericity: χ 2 (45) = 86.0, p < 0.001). Factor loadings are displayed in **Table 4**.

Some aspects of the 3-factor model looked similar to the 2-factor model, even though factor labels have changed. TEA subtests loaded on one component, while SICspan Size and LNS loaded on another. The most notable change between the two models was that the storage factor loading of Digit Quiet was less pronounced. Instead, Digit Quiet, together with Digit 0 dB SNR (and to some extent Digit −4 dB SNR), loaded on a new factor (i.e., not WM). Factor labels reflect to some extent constructs of the Diamond (2013) model. A high score on Factor 1 combines good performance on TEA tests, a large difference between uncued and cued attentional IMAP trials and poor Digit memory at −4 dB SNR. The factor may indicate the involvement of attention (as indexed by TEA scores), but also an inability for sustaining attention and inhibiting extraneous distractors. As such it may be indicative of poor attentional interference control. A high score on Factor 2 that combines good Digit memory in quiet and at 0 dB SNR (and to some extent at −4 dB SNR) with many intrusion errors on the SICspan task, may indicate a good memory storage enabling good memory performance despite intrusion of other information. A high score of Factor 3 that combines SICspan Size and LNS, indicates good verbal WM processing performance. The importance of the processing component for this factor might be emphasized by the fact that the IMAP difference score also has a secondary loading on this factor. Note that all principal component analyses are post hoc and exploratory as factor extraction is solely based on the amount of shared variance between measured tests. The resultant factor structure is therefore not theoretically motivated and should be interpreted with caution.

In a final analysis we investigated the effectiveness of the three latent factors for the prediction of speech perception performance and self-reported residual disability (**Table 5**). Forward stepwise regression analyses on the six speech perception tests were performed. BEA and age were always entered in a first step, all latent factors were entered together in the second step.

For PD in noise, cognition did not predict performance. For all other speech perception tests (FAAF, Words, MCRM), cognition predicted performance. For all the Word perception tests, the verbal WM component drove the predictive power of cognition. For the MCRM task, in addition to verbal WM, response control and to a lesser degree attentional interference control also contributed to explaining variance. The cognitive test that probably drove the predictive power of the verbal WM component was the LNS, which showed correlations with all word-in-noise tests (Supplementary Tables S1 and S2). Consistent with the fact that MCRM performance was predicted by a broader range of cognitive components is the finding that it was correlated with a broader range of cognitive tests (Supplementary Tables S1 and S2). Interestingly, even though both Q3 and Q4 each correlated with one cognitive measure (Supplementary Table S3), this was not reflected in the regression analysis after the measures had been combined into latent variables. For instance, there was a moderate correlation between Q3 and the IMAP auditory difference score. However, this difference score was only one of five scores that formed the attentional interference score, and indeed it only had a loading of 0.63 on the latent factor. Very likely, the correlation was not strong enough to overcome its small role on the latent factor.

# DISCUSSION

It is common for many older adults to find it challenging to communicate effectively in noisy environments. The discomfort and frustration resulting from this can prompt withdrawal or avoidance of social situations, which can in turn severely limit activities (Heffernan et al., 2016). This can result in a less active and satisfying lifestyle, and may lead to depression (Cohen-Mansfield et al., 2010; Mikkola et al., 2016). Understanding why older listeners struggle with speech perception in noisy situations is a critical first step to any rehabilitative effort to ensure successful communication, active aging and well-being. One vital question in this context is how to best measure communicative functioning. Self-report and behavioral measures are widely used. Intriguingly, these measures seem to provide information that can seem contradictory, as self-reported difficulties are not always captured by behavioral tests, and behavioral test results do not always reflect listener experience. A better understanding of why the results of these two types of measures are so poorly correlated

### TABLE 4 | Factor loadings for five cognitive tests producing a three-factor solution in a Principal Component Analysis.


Acronyms as in Table 1.

TABLE 5 | Results for forward stepwise regression models carried out for each of six speech perception tests.


In step 1 age and BEA was added. In step 2 the three latent cognitive factors (see Table 5) were entered. Acronyms as in Table 1.

may guide us to construct speech-in-noise tests that better reflect the listener's everyday experience, which would provide a first step to successful rehabilitation.

Here, we approached this question from two perspectives. First, we investigated whether we could better understand the relationship between self-report and behavioral tests by being more specific about individual listening situations, both behavioral and self-report. Hence, we investigated the association between behavioral speech perception tests and specific selfreport situations rather than just the averaged overall scores of questionnaires.

Second, we investigated the role of cognition in the understanding of listening difficulties. It has long been known that cognition is important for speech perception (Akeroyd, 2008), but which cognitive aspects support listening in which situation, remains to be understood. Recently, we have argued that the relationship between speech perception and cognition is specific to the particular speech test condition (Heinrich et al., 2015), and that more complex listening situations engage more and different aspects of cognition than less complex listening situations (Ferguson et al., 2014). Here, we expanded on this notion by considering a range of specific listening situations, from simple (phonemes in modulated noise) to complex (keyword perception in a carrier sentence with competing talker), and a greater range of theoretically motivated cognitive functions than previously (Heinrich et al., 2015).

In addition to the relationship between behavioral tests, selfreport measures and cognition, it is also important to bear in mind that listeners' sensory auditory function declines as they age and that they have increasing difficulties with listening to

speech in noise (CHABA, 1988; Pichora-Fuller, 1997). While auditory decline and speech-in-noise perceptual difficulties are related to some degree, the relationship is far from perfect (Luterman et al., 1966; Phillips et al., 2000; Schneider and Pichora-Fuller, 2001; Pichora-Fuller and Souza, 2003; Gifford et al., 2007). In our earlier paper, we investigated older adults with mild hearing loss who had not sought hearing aids. In the current paper we expanded the range of participants to older listeners with a mild-to-moderate hearing loss who wore hearing aids. We investigated whether the previously found relationships in Heinrich et al. (2015) would hold for a group of listeners who used hearing aids (the current study). One aspect that remained similar across studies was the nature of the target speech; both studies used single CVC words (digit triplet test vs. FAAF test and single word perception), and either a simple sentence or a keyword in a carrier sentence measure (Adaptive Sentence List vs. MCRM). However, our two studies were not directly comparable in quantitative terms as some test measures, particularly background maskers and cognitive tests, had changed. Lastly, because the hearing sensitivity characteristics of the listeners had changed, the most appropriate aspect of self-report assessed in the GHABP changed from initial disability (used for non-hearing aid users) to residual disability after hearing aid use. Nevertheless, both studies tested similar concepts (speech perception in noise; self-report; cognition) and thus are comparable in principle. Specific hypotheses are discussed below.

# Correlation between Speech Perception Tests and Self-Reported Communication Abilities

The first hypothesis concerned correlations between speech perception accuracy and overall scores of self-reported hearing disability. Heinrich et al. (2015) failed to find significant correlations for the vast majority of comparisons in which speech perception was assessed without raising the overall presentation level. We replicated the failure to find consistent correlations between speech perception and overall self-report scores. Only for two of the speech perception tests did the overall GHABP residual disability score correlate with speech perception. Those tests were Word perception in quiet and Phoneme Discrimination (PD) in noise. Moreover, the correlation was in the expected direction only for the former test where better perception scores correlated with lower perceived disability. The correlation with PD was counterintuitive; we can only speculate as to why this happened. Possibly, listeners who function well in their auditory environments and in psychometric speech perception tests employed a very different listening strategy for PD task compared to listeners who generally function less well in auditory environments. A direct comparison with the previous study, which found no correlation at all, is made difficult by the fact that the PD task had been previously presented in quiet whereas here it was presented in noise. One potentially interesting detail is that if one considers the correlation sizes in the studies between behavioral measures of speech perception and overall self-report scores it was only the correlation involving PD that was significantly higher in the current than the previous study; all other correlation coefficients were roughly of a similar size.

In contrast to a relative lack of significant correlations with overall self-report scores, some consistent patterns emerged for correlations between specific GHABP situation scores and each speech perception test. For instance, there were consistent significant correlations between the situation describing a conversation in a noisy background (Q3) and all but one (Words in Quiet) word perception in noise tests. The tests with significant correlations (FAAF and Word perception at 0 and –4 dB SNR) all shared two features that distinguished them from all other tests: first they required listeners to perceive isolated words, second all words were embedded in 20-talker babble. The consistent correlations suggest that either or both of these characteristics assess an aspect of listening that is also important for following a conversation in noise (Q3). It also suggests that this aspect is not assessed by either PD in noise or by word perception in a carrier sentence masked by a single talker (MCRM).

The correlations between cognition and speech perception tests suggest that performance on word perception tests covaries mainly with verbal WM (LNS). This is in agreement with Akeroyd (2008) who found that results in most of the speechin-noise perception studies surveyed correlated with verbal WM. Why this correlation only occurred with LNS but not SICspan Size is a matter of speculation. One possible interpretation is that a WM task only measures skills relevant to speech-innoise perception when the task involves the manipulation of the recalled material (as is the case in the LNS task) and not when the manipulation and recall concerns separate materials (as in the case of the SICspan Size). Something in the listening task, either the separation of words from background noise or the dealing with multi-talker babble, uniquely engages verbal WM as measured by the LNS task. The same aspect of the LNS task may also provide the link to the self-report measure. While the correlation between LNS and Q3 is not significant, numerically it does provide the second highest value of correlations between Q3 and cognition, lending at least some credence to our speculation.

Finally, the MCRM task engaged more cognitive processing than solely LNS, possibly diluting any correlation with the self-report ratings on Q3. Instead, performance on MCRM sentences correlated with self-reported functioning in a more complex situation (i.e., participating in a group conversation, Q4) where more complex speech phrases and lower number background talkers are more common. While the correlation between MCRM and group conversation (Q4) makes intuitive sense, it is less intuitive to understand why self-rated ability to hold a group conversation also shares common variance with PD in noise. We speculate that this result may be an expression in the use of different listening strategies between listeners who functioned well or not so well in their auditory environments and in psychometric speech perception tests. PD in noise also correlated with self-reported residual disability concerning the TV level set to suit other people's need (Q1). In

both cases, Q1 and Q4, the correlation with PD was negative indicating that better self-reported functioning in the listening situation was associated with worse performance on the PD test.

The current data set cannot differentiate between the two interpretations of whether it was the foreground speech (words as opposed to phonemes or phrases) or the background (multi-talker babble as opposed to modulated noise or single talker background) that led to the distinct correlations with cognitive processing and self-report. This question will have to be addressed in a future study that manipulates the characteristics of the background sound systematically.

Self-rated residual disability in our group of hearing aid users was largely independent of cognitive ability. Only the questions from the more challenging situations (following a conversation on a busy street (Q3), and in a group of several people (Q4)) showed a single correlation with cognition. In both cases this was either cued attention, or the difference between cued and uncued attention, presumably because for the situations described in Q3 and Q4, an ability to be able to pay attention is crucial to successful listening. However, neither correlation was strong enough to be a significant predictor in the regression model of latent predictors. These differences in correlations with cognitive abilities between behaviorally measured speech perception and self-reported residual disability implies that the correlation shared between word perception tests and Q3, and MCRM and Q4, was not moderated by cognition alone, and must reflect some other shared dimension between these measures.

However, there is also a more conceptual difference between behavioral speech perception tests and self-report measures, which may explain some inconsistencies in correlations and which is much more difficult to tackle. In particular, it is possible that many laboratory-based behavioral speech perception tests do not capture the demands of listening in the real world. Examples for mismatches between the two types of situations are the fact that the SNRs in laboratory-based speech perception tests are often more adverse than in real life situations (Smeds et al., 2015), that they often neglect reverberation and that they are often perception or comprehension tests without the necessity for the listener to engage in two-way interactions. If listeners refer to memories of their real life situations when responding to self-report measures, then it is not surprising that correlations between behavioral speech perception tests and selfreport measures are often low. In order to remedy this issue, the conceptualization of speech perception tests as a whole would need to be reassessed.

# Correlation between Speech Perception and Cognition

We found no significant correlations between cognitive performance and PD suggesting that none of the cognitive abilities tested here (attentional interference control, response control, verbal WM) played a role in this speech task. This replicates previous null findings from Heinrich et al. (2015). All other speech perception tests correlated highly with the LNS suggesting that good verbal WM abilities were important for good performance on these speech perception tests. As seen from the correlations (Supplementary Tables S1 and S2) but not the latent-variable regressions, each speech test was also associated with an additional specific cognitive ability, which varied from test to test. For the FAAF it was the SICspan Size test that measured the ability to exclude irrelevant information from WM while retaining target information for later recall. For Word perception it was either sustained attention when words were presented in quiet or good memory storage when words were presented at an adverse SNR. Presumably in the former case, good performance mostly depended in being able to keep attention on perception (while also holding digits in memory) in order to hear all the words, while in the latter, a good memory helped because it meant that less attentional resources were needed to retain the digits in memory and more resources could be spent on the speech perception task. The MCRM task engaged a wide variety of cognitive abilities. Taking these cognitive profiles into account when interpreting speech performance may help us understand why some listeners do better than others and how we can choose speech perception tests that either maximize or minimize cognitive differences between listeners.

In showing different cognitive profiles for different speech perception tests the study replicates a main finding from Heinrich et al. (2015), despite a different group of listeners and slightly changed speech perception and cognitive tests. This replication under changed circumstances suggests that speech perception tasks do indeed differ in cognitive profile and that the previous results were not due to the peculiarities of either the listener group or the combination between speech and cognitive tests. Different cognitive profiles for different listening situations have also been found by Helfer and Freyman (2014). The finding also extends a number of previous studies that either used one speech perception test and a number of cognitive tests (Besser et al., 2012; Zekveld et al., 2013) or a number of speech tests and only one or two cognitive tests (Desjardins and Doherty, 2013). Until now, no systematic comparison between speech perception situations and cognitive abilities exists across a range of systematically varied listening situations, and comparisons always have to be made across studies. These direct comparisons between studies are difficult as typically both fore- and background sounds as well as the assessed cognitive abilities change from study to study.

A much more systematic and theoretically driven approach to the variation of fore-and background speech as well as the assessment of cognitive function is needed. With this and the previous (Heinrich et al., 2015) study we are attempting to start laying the ground for such theoretical underpinnings by discussing selected cognitive tests within wider frameworks of cognitive functioning (see the third hypothesis). In the previous study we discussed the cognitive results within latent factors for WM storage and attention, considered within the framework of Baddeleys WM model of storage and manipulation (Baddeley and Hitch, 1974; Baddeley, 2000). We concluded that manipulation (attention) was particularly important in the most complex listening situation. Here, we expanded on the notion of attention

by putting it within the framework of executive functioning as represented by Diamond (2013), a concept that has long been claimed to be important for speech perception (Sommers and Danielson, 1999; Janse, 2012; DiDonato and Surprenant, 2015; Helfer and Jesse, 2015). In using a more differentiated approach to testing executive functions that considers executive functions, not as a whole (as does Baddeley), but as distinguishable processes (cognitive and attentional interference control, response control), we were able to specify that it may have been the response control aspect of attention that predicts speech perception performance in more complex listening situations.

One difference we found between Heinrich et al. (2015) and the current study is the fact that single word perception, operationalized as triple digits in the previous study and as FAAF or single words in a dual task in the current study (Words), seemed to engage different cognitive processes. Whereas triple digits performance was not predicted from the cognitive performance of the tests assessing WM and attention, FAAF and Word perception was predictable from the WM performance as measured here. Two possible explanations for this divergence in results are considered. First, the Word perception task in the current study might have been more complex. The pool of target words consisted of more than a closed set of nine digits, and the background sound was multi-talker babble not speechshaped noise. Maybe this difference was enough to engage WM. Alternatively; it is possible that the change in listener group caused the change in correlational pattern. While single word perception operationalized as digits in noise does not engage WM in older listeners with no hearing aids, it does so in older hearing aid wearers. Differentiating between these two interpretations requires a direct comparison between these listener groups within the same study.

The current study was explorative in nature, set up to test the influence of a large range of possible variables. As a result, test selection of both speech perception tests and cognitive tasks was not as systematic as a careful elucidation of mechanisms would have demanded. On the other hand, however, this approach allowed us to define conditions that should be satisfied in future studies in order to advance our understanding of cognitive contributions to speech perception. First, listening situations need to be more complex than perception of single words, in order to draw out executive contributions. Second, the characteristics of fore- and background signals need to be systematically and parametrically manipulated to understand which aspect of listening engages which aspect of cognition. Third, embedding cognitive test selection within general cognitive frameworks may allow us to discuss cognitive processes not only on the level of selected tests but on the level of underlying cognitive components, and may thus make it easier to compare across studies. It may also allow us to connect speech perception research more closely with the wider cognitive research community.

There were some limitations of study design and analysis, which restricted the interpretability of the results. First, both target speech and maskers were varied across conditions, which made an interpretation of the results concerning the associations with self-report measures and cognition harder and less reliable. Moreover, although in agreement with other recent studies (Corbera et al., 2013; Schoof and Rosen, 2014; Heinrich et al., 2015), the study sample was small for the number of statistical analyses. This increases the risk of false positive findings. Only by repeating the study with independent groups of participants and by using larger sample sizes will it be possible to establish which correlations are replicable.

# CONCLUSION

These exploratory results replicate and extend our previous findings by investigating the relationship between a set of speech perception tests and cognitive measures, which were more complex, in aided listeners with mild-to-moderate hearing loss. We found that the association between speech perception performance and cognition varied with the specific tests used, showed that verbal WM in particular appears to be important for the speech perception tests used, and that correlations were evident when behavioral speech perception tests and listening situations in self-report questionnaires matched in some characteristics. Finally, cognition did not correlate with self-report of communication. The next step is to test these conclusions with systematic hypothesis-driven research.

# AUTHOR CONTRIBUTIONS

MF and HH designed the study. AH analyzed and interpreted the data. AH wrote, and MF contributed to, the manuscript. AH and MF contributed to critical discussions. AH and MF revised the manuscript. All authors approved the final version of the manuscript for publication. All authors agree to be accountable for all aspects of the work and in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

# ACKNOWLEDGMENTS

This paper presents independent research funded by the National Institute for Health Research (NIHR) Biomedical Research Unit Programme. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. This work was also supported by the Medical Research Council [U135097128] and the Biotechnology and Biological Sciences Research Council (BB/K021508/1).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.00576

# REFERENCES

fpsyg-07-00576 May 23, 2016 Time: 10:50 # 15


normative data in noise. Br. J. Audiol. 21, 165–174. doi: 10.3109/03005368709 076402


and withdrawal from leisure activities in older community-dwelling adults. Aging Clin. Exp. Res. 28, 297–302. doi: 10.1007/s40520-015-0389-1


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Heinrich, Henshaw and Ferguson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Two Sides of Sensory–Cognitive Interactions: Effects of Age, Hearing Acuity, and Working Memory Span on Sentence Comprehension

Renee DeCaro<sup>1</sup> , Jonathan E. Peelle<sup>2</sup> , Murray Grossman<sup>3</sup> and Arthur Wingfield<sup>1</sup> \*

<sup>1</sup> Department of Psychology and Volen National Center for Complex Systems, Brandeis University, Waltham, MA, USA, <sup>2</sup> Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA, <sup>3</sup> Department of Neurology, University of Pennsylvania, Philadelphia, PA, USA

Reduced hearing acuity is among the most prevalent of chronic medical conditions among older adults. An experiment is reported in which comprehension of spoken sentences was tested for older adults with good hearing acuity or with a mild-tomoderate hearing loss, and young adults with age-normal hearing. Comprehension was measured by participants' ability to determine the agent of an action in sentences that expressed this relation with a syntactically less complex subject-relative construction or a syntactically more complex object-relative construction. Agency determination was further challenged by inserting a prepositional phrase into sentences between the person performing an action and the action being performed. As a control, prepositional phrases of equivalent length were also inserted into sentences in a non-disruptive position. Effects on sentence comprehension of age, hearing acuity, prepositional phrase placement and sound level of stimulus presentations appeared only for comprehension of sentences with the more syntactically complex objectrelative structures. Working memory as tested by reading span scores accounted for a significant amount of the variance in comprehension accuracy. Once working memory capacity and hearing acuity were taken into account, chronological age among the older adults contributed no further variance to comprehension accuracy. Results are discussed in terms of the positive and negative effects of sensory–cognitive interactions in comprehension of spoken sentences and lend support to a framework in which domain-general executive resources, notably verbal working memory, play a role in both linguistic and perceptual processing.

Keywords: working memory, hearing acuity, sentence comprehension, adult aging, syntactic structure

# INTRODUCTION

Unlike reading, where one can control the input rate with eye-movements, in the case of spoken language speech rate is controlled by the speaker and not by the listener. Because of the rapidity of natural speech and its inherently transient nature, comprehension operations that cannot be accomplished as the speech is being heard must be conducted on a fading trace of that speech in memory (Jarvella, 1970, 1971; Fallon et al., 2004). Added to the rapidity of natural speech, many

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

L. Robert Slevc, University of Maryland, College Park, USA Patti Adank, University College London, UK

> \*Correspondence: Arthur Wingfield wingfield@brandeis.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 01 September 2015 Accepted: 05 February 2016 Published: 29 February 2016

### Citation:

DeCaro R, Peelle JE, Grossman M and Wingfield A (2016) The Two Sides of Sensory–Cognitive Interactions: Effects of Age, Hearing Acuity, and Working Memory Span on Sentence Comprehension. Front. Psychol. 7:236. doi: 10.3389/fpsyg.2016.00236

of the words we hear in spoken discourse are significantly underarticulated, requiring a heavy demand on acoustic and linguistic context for successful recognition (Pollack and Pickett, 1963; Lindblom et al., 1992; Wingfield et al., 1994).

Adult aging brings special challenges for speech comprehension due to age-related declines in episodic memory (Wingfield and Kahana, 2002), processing speed (Salthouse, 1996), and working memory resources (Salthouse, 1994), all of which can have a negative impact on comprehension of spoken sentences (see reviews in Light, 1990; Carpenter et al., 1994; Wingfield and Lash, 2016). Of special note, however, is the effect on sentence comprehension of age-related hearing impairment. The goal of this present study is to examine the effects of hearing impairment in older adults on the comprehension of spoken sentences as the processing difficulty is manipulated by the syntactic complexity of the sentences and the sound level of the presented stimuli.

# Hearing Acuity and Sentence Comprehension

Age-related hearing loss is the third most prevalent chronic medical condition among older adults, exceeded only by arthritis and hypertension (Lethbridge-Cejku et al., 2004). This is of concern for speech comprehension as even with a relatively mild hearing loss one can miss, or mishear, words from spoken utterances. More subtle, however, is the mounting evidence that even with a relatively mild hearing loss the cognitive effort needed for successful front-end speech recognition can draw resources that would otherwise be available for storing what has been heard in memory (Rabbitt, 1991; Surprenant, 1999, 2007; Pichora-Fuller, 2003; Wingfield et al., 2005), or comprehending sentences in which the meaning is expressed with complex syntax (Wingfield et al., 2006). Critically, this effect can occur even when it can be demonstrated that the speech itself has passed a threshold of audibility.

When the consequences of this front-end perceptual effort are added to an age-related decline in working memory capacity (e.g., Salthouse, 1994), one might expect speech comprehension to be far poorer among older adults than one ordinarily observes. There is a general recognition in the literature that older adults' relative success with language comprehension is due to their ability to offset effects of reduced hearing acuity and working memory resources with the compensatory use of linguistic knowledge that is ordinarily well-preserved in healthy aging. (Reviews of evidence for the preservation of linguistic knowledge and the procedural rules for its use in healthy aging can be found in, for example, Light, 1988; Kemper, 1992; Kempler and Zelinski, 1994; Wingfield and Stine-Morrow, 2000.)

This delicate balance between the negative effects of processing deficits and the positive effects of spared linguistic knowledge in adult aging works well until the total processing burden exceeds a listener's processing capacity. When compensatory mechanisms are not able to keep up with demand, listeners' performance will suffer. Increasing the processing challenge through linguistic and acoustic manipulations is therefore a useful way to test the interaction of cognitive and perceptual factors in speech comprehension. In the following experiment, we examine spoken sentence comprehension under conditions in which this balance is maintained, and under conditions where the processing challenge disrupts this balance by increasing the processing demands needed for successful comprehension at the linguistic and perceptual levels.

# Syntactic Complexity and Working Memory

In addition to the challenge imposed on many older adults by a reduced quality of the acoustic signal, challenges also arise when the syntactic structure of a sentence departs from a simple canonical form in which the first noun in the sentence identifies an agent that performs an action, the first verb encountered is the action being performed, and the next noun encountered is the recipient of the action (e.g., "The king [agent] assisted [action] the queen [recipient of the action]"). When sentences become longer, or the sentence meaning is represented with complex syntax, the cognitive challenge becomes greater (Just and Carpenter, 1992). The literature on sentence processing offers a number of reasons why this is so.

Early models of sentence comprehension postulated that, as a listener hears a sentence, the listener is continually forming hypotheses about the structure of what they are hearing and forming predictions about what they have yet to hear. These are working hypotheses, either confirmed or modified with the arrival of subsequent words of the sentence (cf., Frazier and Fodor, 1978; Fodor and Frazier, 1980; Marslen-Wilson and Tyler, 1980; Wanner, 1980). This general principle has been instantiated more recently in probability-based models of sentence processing that postulate that syntactically complex sentences are more difficult to understand because they violate the listener's experience-based expectations of the likely structure of the sentence. This requires a re-analysis of the initially assumed structure, as, for example, that the first noun will be the agent of an action (cf., Novick et al., 2005; Levy, 2008; Padó et al., 2009; Gibson et al., 2013). In support of this view, whether a cause or consequence of the extra effort speakers and listeners must invest to produce and understand sentences with greater syntactic complexity, studies of everyday speech samples show that sentences with simpler syntactic forms occur far more frequently than sentences with more complex syntax (Goldman-Eisler, 1968; see also Goldman-Eisler and Cohen, 1970).

Consistent with the above observations, it is well known that, independent of hearing acuity, sentences with a variety of complex syntactic constructions are more difficult to comprehend and to recall than those with less complex structures, and that this is especially so for older adults (Feier and Gerstman, 1980; Emery, 1985; Kemper, 1986; Norman et al., 1991). Among the best-studied linguistic challenges in the literature are sentences that express their meaning with an object-relative syntactic structure versus sentences with a syntactically simpler subject-relative structure. Past studies have shown that not only do object-relative sentences produce more comprehension and recall errors than subject-relative sentences, but that this is differentially so for older than for younger adults (e.g., Carpenter et al., 1994; Wingfield et al., 2003). For this reason

we have selected these two sentence types to form the basis for our analysis of a potential interaction between hearing acuity among older adults and the linguistic complexity of the speech materials.

The upper panel in **Table 1** shows an example of the simplest syntactic form we employed in the present study (base sentence): a six-word sentence with a subject-relative centerembedded clause structure, in which the main clause (Sisters are fortunate) is interrupted by a relative clause (that assist brothers). The more complex syntactic form we employed had an object-relative center-embedded clause structure. The first sentence in the lower set shows a sentence composed of the same six words, but now ordered such that the meaning is expressed with an object-relative construction. In this case the embedded clause (that sisters assist) not only interrupts the main clause, but the head noun phrase (Brothers) functions as both the subject of the main clause (brothers are fortunate) and the object of the relative clause (that sisters assist). Because the order of thematic roles in object-relative constructions is not canonical (the first noun is not the agent of the action), such sentences require a more extensive thematic integration than required for the more canonical structure represented by subjectrelative sentences (Warren and Gibson, 2002). As a consequence, accurate comprehension of object-relative sentences has been considered to be more resource demanding than processing subject-relative sentences (e.g., Ferreira et al., 1996; Cooke et al., 2002).

More specifically, it has been suggested that to determine the thematic roles in object-relative sentences one must keep the subject of the sentence in mind for a longer period of time than in subject-relative sentences (e.g., Cooke et al., 2002), which would be expected to place a heavier demand on working memory. Consistent with this likelihood have been studies showing that young and older adults with lower scores on tests of verbal working memory show more comprehension errors for complex sentences than those with better scores (e.g., Just and Carpenter, 1992; MacDonald et al., 1992; Carpenter et al., 1994; Vos et al., 2001). This working memory account has, either directly or indirectly, been used to account for the greater number of comprehension errors typically found for object-relative than for subject-relative sentences (Just and Carpenter, 1992; Zurif et al., 1995; Cooke et al., 2002; Wingfield et al., 2006), increased patterns of neural activation in functional imaging studies (Just et al., 1996; Cooke et al., 2002; Wingfield and Grossman, 2006; Peelle et al., 2010), and slower self-pacing patterns for both written (Stine-Morrow et al., 2000) and spoken (Waters and Caplan, 2001; Fallon et al., 2006) sentences.

# Increasing the Processing Challenge by Adding Prepositional Phrases

Although the non-canonical word order of object-relative sentences violates listeners' experience-based expectancies, a major source of the above-cited difficulty with such sentences, as argued by, for example, Cooke et al. (2002) and Warren and Gibson (2002), is a word-order that impedes successful semantic integration of the lexical elements in the sentence. If this were the case, then whether a sentence has a subjectrelative or an object-relative structure, a manipulation that further increases the difficulty of the semantic integration of the sentence elements would be expected to increase failures of correct comprehension of the sentence meaning. Central to our primary interest, however, is the question of whether one would see an exaggeration of any effects of this added degree of linguistic challenge on sentence comprehension in listeners with reduced hearing acuity.

To test these hypotheses, 10-word sentences were created from the six-word base sentences by inserting a four-word prepositional phrase (e.g., with short brown hair) into each of the six-word base sentences. Moreover, the particular placement of the prepositional phrase manipulated the processing challenge by manipulating the separation between key sentence constituents. In a less syntactically disrupting case the placement of the prepositional phrase kept the person performing the action and the action being performed adjacent to, or in close proximity with, each other. These are indicated in **Table 1** asshort separation sentences. The second sentence in the upper set illustrates such a propositional phrase placement for a subject-relative sentence (short separation). The second sentence in the lower set shows this for an object relative sentence. (In the table, we have underlined the agent performing the action and the action being performed.)

In the second type of placement the prepositional phrase was inserted in a position to produce a long separation between the person performing the action and the action being performed. This placement was designed to add difficulty to the task of determining the thematic role assignments of the two persons in the sentence, and a presumed increase in working memory demands, but without changing the formal syntactic structure of the sentence itself. Examples of such long separation sentences are shown in **Table 1** for a subjectrelative sentence (upper set) and an object-relative sentence (lower set). If object-relative sentences prove more difficult, this manipulation would allow us to dissociate the challenging grammatical features of this sentence from the increased difficulty

### TABLE 1 | Examples of sentence types.


associated with the increased separation between the key sentence constituents.

By having sentences in which males (e.g., brother) or females (e.g., sister) as the agents or recipients of actions, accurate comprehension could be demonstrated by the participant correctly indicating the gender of the agent of the action. (In these examples and in the experiment itself the complementizer that was used instead of the more grammatically correct who. This was done to avoid the use of the who–whom distinction that could serve as an undesired comprehension cue.)

The target groups in this experiment were older adults with good hearing acuity and an age-matched group with a mild-tomoderate hearing loss. A third group of participants consisting of young adults with normal hearing acuity was included to illustrate the maximal performance level that might be expected under ideal circumstances.

# Presentation Level

Hearing research over the years has reflected a choice among the intensity levels that might be used: whether to present speech at an intensity that approximates conversational speech levels (dB HL or SPL; Hearing Level or Sound Pressure Level) or at a presentation level relative to an individual's hearing threshold (dB SL; Sensation Level). Experimental studies typically employ either one presentation method or the other; rarely both within the same experiment. This leaves open the question of whether the two methods will be equally sensitive to the factors of interest in a particular study. For this reason in the present study, we employed both of the presentation methods (the same absolute presentation level for all participants [dB HL] and a presentation level adjusted for each individual's hearing threshold [dB SL]) using a within-participants design. Including both sound presentation levels would thus allow us to see whether both methods may reveal an influence on the factors tested in this sentence processing task equally, and to provide useful empirical information in helping to determine which approach may be more appropriate in future studies. Thus, uniquely within a single experiment, we manipulate syntactic complexity, the effect of a separation of key sentence elements by insertion of a prepositional phrase, and presentation level of the sentence stimuli within the context of adult aging and hearing acuity.

# Experimental Hypotheses

One could entertain two hypotheses in terms of sentence comprehension in older adults with good or poor hearing acuity. The first is that perceptual effort – as determined by participants' hearing acuity and presentation level – will have similar effects on sentence comprehension regardless of the cognitive load imposed by syntactic complexity, sentence length, and prepositional phrase placement. This simple additivity would be manifested in parallel comprehension performance functions for the good-hearing and hearing-impaired listeners, albeit with a potential difference in y-intercepts. A finding of additivity would be consistent with the notion of independence of cognitive and perceptual operations. (See Allport et al., 1972 and McLeod, 1977, for early arguments favoring multiprocessor models of attention.)

The alternative would be a multiplicative effect, in which perceptual effort engendered by reduced hearing acuity and/or reduced presentation amplitude, produces a differentially greater negative effect on comprehension of the more cognitively challenging sentences (object-relative sentences with a long agent-action separation) than on comprehension of the less challenging sentences (subject-relative sentences with a short agent-action separation).

This latter finding would be in keeping with the principles embodied in models that postulate limited attentional (Kahneman, 1973; Cowan, 1999; Engle, 2002) or working memory (Baddeley, 2012; Chow and Conway, 2015) resources that must be shared among concurrent or closely sequential processing operations. Applied to the present case, this would imply that the resources required for front-end perceptual operations will necessarily draw on the resources that would otherwise be available for comprehension operations at the linguistic processing level. Such an effect would thus predict that the consequences of the extra resource draw necessary for successful perceptual processing of an acoustically degraded speech input will fall more heavily on successful comprehension of the more resource-demanding long separation object relative sentences than their less syntactically challenging counterparts.

# MATERIALS AND METHODS

# Participants

Participants were 36 older adults, 18 with good hearing acuity (5 males and 13 females) and 18 older adults with a mild-tomoderate hearing loss (6 males and 12 females). Audiometric assessment was conducted using a GSI 61 clinical audiometer (Grason-Stadler, Madison, WI, USA) using standard audiometric procedures in a sound attenuating testing room.

**Figure 1** shows better-ear pure-tone thresholds from 500 to 8,000 Hz for the three participant groups plotted in the form of audiograms, with the x-axis showing the test frequencies and the y-axis showing the minimum sound level (dB HL) needed for their detection. Hearing profiles for individual listeners within each participant group are shown in light gray, with the group average drawn in black. The shaded area in each of the panels indicates thresholds less than 25 dB HL, a region commonly considered as clinically normal hearing for speech (Katz, 2002).

We summarized individuals' hearing acuity in terms of their better-ear pure tone average (PTA) across.5, 1, 2, and 4 kHz, a range especially important for the perception of speech. The participants in the older adult good-hearing group had a mean PTA of 17.9 dB HL (SD = 3.0). The older adult hearing-impaired group had a mean PTA of 35.8 dB HL (SD = 5.7) placing them in the mild-to-moderate hearing loss range (Katz, 2002). This degree of loss represents the single largest proportion of hearing-impaired older adults (Morrell et al., 1996). None of the participants in the present study reported regular use of hearing aids, and all were tested unaided.

The good-hearing and hearing-impaired older adult groups were similar in age, with the good-hearing group ranging from

68 to 84 years (M = 74.0 years, SD = 4.6) and the hearingimpaired group ranging from 67 to 83 years (M = 74.5 years, SD = 5.2; t[34] = 0.31, n.s.). Both groups were well educated, with a mean of 16.5 years of formal education for the goodhearing group (SD = 2.2) and 17.2 years for the hearing-impaired group (SD = 2.5); t(34) = 0.84, n.s.. The two groups were also similar in vocabulary knowledge as measured by a 20-item version of the Shipley vocabulary test (Zachary, 1991). This is a written multiple choice test in which the participant is required to indicate which of four listed words means the same or nearly the same as a given target word. The good-hearing older adults had a mean score of 17.3 (SD = 2.4) and the hearing-impaired older adults had a mean score of 17.4 (SD = 2.4); t(34) = 0.14, n.s..

For purposes of comparison we also included a group of 18 younger adults (three males, 15 females), ranging in age from 18 to 29 years (M = 20.4; SD = 2.7), all of whom had age-normal hearing acuity, with a mean PTA of 6.7 dB HL (SD = 3.1). At time of testing the young adults had completed fewer years of formal education (M = 14.6 years; SD = 1.1) than either the good-hearing, t(34) = 3.29, p < 0.01, or the hearing-impaired, t(34) = 4.02, p < 0.001, older adults. As is common in adult aging (e.g., Verhaeghen, 2003), the young adults had somewhat lower vocabulary scores (M = 13.9; SD = 2.3) than either the good-hearing, t(34) = 4.41, p < 0.001, or the hearing-impaired, t(34) = 4.47; p < 0.001, older adults.

All participants reported themselves to be in good health, with no history of stroke, Parkinson's disease, or other neuropathology that might compromise their ability to carry out the experimental task. All participants reported themselves to be monolingual native speakers of American English with no history of speech or language disorders.

### Working Memory Capacity

Although varying in emphasis, the term working memory has been typically used to refer to the retention of information in conscious awareness when this information is not present in the environment, to its manipulation, and to its use in guiding behavior (Postle, 2006; see also McCabe et al., 2010; Baddeley, 2012, for converging definitions). In accord with this definition, tests of working memory typically focus on complex span tasks in which material must be held in memory while other operations, either related or unrelated to the material in memory, must be performed (Baddeley and Hitch, 1974). A common assessment of verbal working memory that meets this definition is the reading span task introduced by Daneman and Carpenter (1980), and its variants (e.g., Baddeley et al., 1985; Waters and Caplan, 1996; Conway et al., 2005; Moradi et al., 2014).

For all participants working memory capacity was assessed using the reading span task modified from Daneman and Carpenter (1980; Stine and Hindman, 1994). In this task participants read sets of sentences and responded after each sentence whether the statement in the sentence was true or false. Once a full set of sentences had been presented participants were instructed to recall the last word of each of the sentences in the order in which the sentences had been presented. The task thus requires the participant to make a true-false decision about the statement in each sentence while simultaneously holding the final words of each of the prior sentences in memory. McCabe et al.'s (2010) stair-step presentation was used, in which participants received three trials for any given number of sentences, with a working memory score calculated as the total number of trials in which all sentence-final words were recalled correctly in the correct order. The maximum score on this test is 15.

The reading span task was chosen because it draws heavily on both storage and processing components of working memory (Daneman and Carpenter, 1980), and in written form would not be confounded with hearing acuity. As illustrated in, for example, a meta-analysis of published studies reported by Daneman and Merikle (1996), reading span scores have been shown to be a good predictor of performance in a variety of language processing tasks.

**Figure 2** shows a plot of the working memory (reading span) scores for each of the young adults and each of the good-hearing and hearing-impaired older adults taking part in the experiment. The variability within groups and the overlap between groups stands out clearly. Given this variability there was no significant difference between the scores for two older adult groups, t(34) = 0.46, n.s.. Although **Figure 2** shows a tendency for the young adults' distribution to be shifted higher relative to the two older adult groups, the overall difference showed only a nonsignificant trend as compared with the good-hearing older adults, t(34) = 2.03, p = 0.051, and no significant difference relative to the hearing-impaired older adults, t(34) = 1.42, p = 0.17.

# Stimuli

Preparation of the stimuli began with construction of 144 sixword English sentences with a subject-relative structure. In each sentence a male agent (e.g., boy, uncle, king) or a female agent (e.g., girl, aunt, queen) was performing an action (e.g., pushed, helped, teased). In half of the sentences the male was the agent of the action and in half the female was the agent. For each of these sentences a counterpart sentence was then constructed using the same words but with the meaning expressed with an objectrelative structure. In addition, for each of these subject-relative and object-relative sentences a plausible four-word prepositional phrase was inserted in a position that kept at most a one word separation between the person performing the action and the

action being performed (short separation) or placed so as to separate the person performing the action and the action being performed by at least four intervening words (long separation). Examples of these six sentence types are illustrated in the previously described **Table 1**.

The resulting 864 sentences were recorded by a female speaker of American English to form the master stimulus set. Sentences were recorded with natural intonation at an average speaking rate of 150 words per minute onto sound files using Sound Studio v2.2.4 (Macromedia, Inc., San Francisco, CA, USA) that digitized (16-bit) at a sampling rate of 44.1 kHz. Recordings were equalized within and across sentence types for root-meansquare (RMS) intensity using MATLAB (MathWorks, Natick, MA, USA). There were also 48 filler sentences prepared that consisted of six- to nine-word active-conjoined sentences that were similar in content to the test sentences but that did not contain an embedded clause structure.

### Procedures

Each participant heard 144 test sentences, 24 in each of the six sentence types (six-word subject-relative, six-word objectrelative, 10-word subject-relative short separation, 10-word subject-relative long separation, 10-word object-relative short separation, 10-word object-relative long separation) along with the 48 filler sentences. Participants were instructed to listen to each sentence as it was presented and then to indicate whether it was either the male or the female in the sentence that was performing the action. Responses were made by pressing the correct one of two keys labeled male or female.

Half of the sentences of each type (six-word subjectrelative, six-word object-relative, 10-word subject-relative short separation, 10-word subject-relative long separation, 10-word object-relative short separation, 10-word object-relative long separation) were presented to participants at 65 dB HL, a level that approximates everyday conversational speech. The remaining half of each sentence type was presented at 20 dB above each the participant's better-ear PTA (i.e., 20 dB SL; Jerger and Hayes, 1977). Stimuli were presented binaurally over Eartone 3A insert earphones (E-A-R Auditory Systems; Aero Company, Indianapolis, IN, USA) via a Grason Stadler GS-61 clinical audiometer (Grason-Stadler, Inc., Madison, WI, USA) in the same sound-isolated testing room in which hearing acuity was tested.

A within-participants design was used in which each participant received equal numbers of sentences of each type, with no base sentence (a particular combination of agent, recipient and action) heard more than once by any participant. Sentences and sound level presentation conditions were counterbalanced across participants such that, by the end of the experiment, each base sentence had been heard an equal number of times in each of its syntactic and agent-action separation versions and at 65 dB HL and 20 dB SL an equal number of times. Sound levels were blocked in presentation, with the order of sound level blocks counterbalanced across participants. Sentence types were randomized in order of presentation within the sound-level blocks. Written informed consent was obtained from all participants according to a

mild-to-moderate hearing loss (hearing-impaired).

protocol approved by the Brandeis University Institutional Review Board prior to the start of the experiment.

### Audibility Testing

fpsyg-07-00236 February 25, 2016 Time: 17:36 # 7

To insure audibility of the stimuli participants were presented with two sentences at 65 dB HL, and two sentences at 20 dB SL for that individual, with one sentence at each intensity level having a subject-relative structure and one with an objectrelative structure. The participant's task was simply to repeat each sentence aloud as it was heard. None of these sentences was used in the main experiment. The good-hearing older adults had 100% word report accuracy at both 65 dB HL and 20 dB SL. The older adult hearing-impaired group scored a mean of 99.5% correct at 65 dB HL and 100% correct at 20 dB SL. The young adults scored 100% correct at 65 dB HL and 99.8% correct at 20 dB SL.

### RESULTS

The main results are summarized in **Figure 3** that shows the percentage of correct comprehension responses for subjectrelative and object-relative six-word, 10-word short separation and 10-word long separation sentences when heard at 65 dB HL and at 20 dB SL for the three participant groups. Consistent with expectations it can be seen that comprehension of sentences with the syntactically simpler subject-relative structures was excellent for all three participant groups regardless of sound-level condition, sentence length, or prepositional phrase placement. The ceiling and near-ceiling level performance for the subject relative sentences also confirms the basic audibility of sentences heard with both sound-level presentations. Differences in comprehension accuracy begin to appear, however, when the syntactic complexity of the sentences was increased by expressing the meaning with an object-relative structure.

An omnibus mixed design analysis of variance (ANOVA) was conducted on the comprehension accuracy data shown in **Figure 3** that included effects of syntactic structure (2: subject-relative, object-relative), length manipulation (3: sixword sentences, 10-word short subject-action separation, 10 word long subject-action separation), participant group (3: young adults, older good-hearing, older hearing-impaired) and presentation level (2: 65 dB HL, 20 dB SL). Participant group was a between-participants variable; all others were withinparticipants variables. Because of ceiling effects constraining variance for the subject-relative sentences, we performed all ANOVAs and paired-comparison t-tests on rationalized arcsine transformed data (Studebaker, 1985).

The ANOVA confirmed a significant main effect of syntactic structure, reflecting the previously cited common finding of poorer comprehension accuracy for the more computationally demanding object-relative sentences than for the less demanding subject-relative sentences, F(1,51) = 106.09, p < 0.001, η 2 <sup>p</sup> = 0.68. Although this effect of complex syntax on comprehension accuracy held across all three participant groups, the relative size of the effect differed between participant groups as reflected in a significant Syntactic structure × Participant group interaction, F(2,51) = 4.00, p < 0.05, η 2 <sup>p</sup> = 0.14.

Of greater interest, the ANOVA also confirmed a main effect of the sentence length manipulation, F(2,102) = 39.32, p < 0.001, η 2 <sup>p</sup> = 0.44. As can be seen from visual inspection of **Figure 3**, however, this main effect was moderated by a significant Length × Syntactic structure interaction, F(2,102) = 24.11, p < 0.001, η 2 <sup>p</sup> = 0.32, confirming that the effect of length had its effect only for the more syntactically complex object-relative sentences. There was also a significant main effect of participant group, F(2,51) = 3.24, p < 0.05, η 2 <sup>p</sup> = 0.11.

Although both presentation sound levels were suprathreshold, as confirmed by the previously cited audibility check, the uniform presentation level of 65 dB HL was relatively louder than the 20 dB SL presentation level for all three participant groups. This difference resulted in a significant main effect of presentation level on comprehension accuracy, F(1,51) = 9.03, p < 0.01, η 2 <sup>p</sup> = 0.15. Because the 20 dB SL presentation levels were based on individuals' pure tone thresholds, the size of the difference between these values and the 65 dB uniform presentation level was inversely proportional to participants' baseline hearing acuity. The relative effect of the sound level, however, did not differ by group, as seen in the lack of a significant Presentation level × Participant group interaction, F(2,51) = 1.53, p = 0.23, η 2 <sup>p</sup> = 0.06. The effect of presentation level, however, had a greater effect on comprehension accuracy for the object-relative sentences than for the subject-relative sentences, with comprehension accuracy for subject-relative sentences at ceiling or near ceiling for both presentation levels, resulting in a significant Presentation level × Syntactic structure interaction, F(1,51) = 12.53, p = 0.001, η 2 <sup>p</sup> = 0.20. None of the remaining interactions was significant.

We conducted a series of follow-up ANOVAs and pairedcomparisons to explore in more detail the factors underlying this pattern of main effects and interactions. Because comprehension accuracy for subject-relative sentences was at or near ceiling for all participants and all conditions, the analyses that follow were conducted on the data for just the object-relative sentences.

For the young adults a two-way ANOVA conducted on comprehension accuracy showed a significant main effect of sentence length (p < 0.001) and of presentation level (p < 0.05**),** but no Length × Presentation level interaction (p = 0.65). Follow-up paired comparison testing failed to show a significant difference between the six-word sentences and the 10-word short separation sentences for either the 65dB HL (p = 0.99) or the 20 dB SL (p = 0.15) presentation levels. That is, the significant effect of sentence length was due to the poorer comprehension for the 10-word long separation sentences relative to the six-word sentences and the 10-word short separation sentences for both presentations levels (p levels < 0.05 to < 0.01). The difference in comprehension accuracy for the two presentation levels failed to reach significance for either the six-word sentences (p = 0.12) or for the 10-word short separation sentences (p = 0.43). There was a non-significant trend toward an effect of presentation level for the 10-word long separation sentences (p = 0.053).

For the good-hearing older adults a two-way ANOVA conducted on comprehension accuracy showed a significant

main effect of sentence length (p < 0.001) but neither a significant main effect of presentation level (p = 0.45), nor a Length × Presentation level interaction (p = 0.53). Similar to the data for the young adults, paired-comparison testing failed to show a significant difference between the six-word sentences and the 10-word short separation sentences for either the 65dB HL (p = 0.15) or the 20 dB SL (p = 0.15) presentation levels. As with the young adults there was again poorer comprehension for the 10-word long separation sentences relative to the sixword sentences and 10-word short separation sentences for both presentations levels (p levels < 0.01 to < 0.001). The difference in comprehension accuracy for the two presentation levels failed to reach significance for either the six-word sentences (p = 0.94), the 10-word short separation sentences (p = 0.76), or the 10-word long separation sentences (p = 0.20).

For the hearing-impaired older adults several of the trends seen for the better-hearing groups were now more marked. A two-way ANOVA conducted on comprehension accuracy for the hearing-impaired participants showed significant main effects of sentence length (p < 0.001) and presentation level (p = 0.001). There was no Length × Presentation level interaction (p = 0.51). Although the ANOVA failed to yield a significant Length × Presentation level interaction, planned comparison tests showed no significant difference between the six-word sentences and 10-word short separation sentences at 65dB HL (p = 0.36) but this difference did reach significance with the more challenging 20 dB SL presentation (p < 0.01). The 10-word long separation sentences showed significantly poorer comprehension accuracy than both the six-word and 10-word short separation sentences at both presentation levels (p levels < 0.01 to < 0.001). The difference in comprehension accuracy for the two presentation levels was significant for the six-word sentences (p < 0.05), the 10-word short separation sentences (p < 0.01), and the 10-word long separation sentences (p < 0.01).

A final analysis was conducted to compare the two target groups with each other: the good-hearing older adults versus the hearing-impaired older adults. The two groups' comprehension accuracy was similar for the six-word sentences at the 65 dB HL (p = 0.60), with a trend toward a difference emerging at the

20 dB SL presentation level (p = 0.054). A developing pattern was seen for the 10-word short separation sentences which failed to show a significant difference between the two groups at 65 dB HL (p = 0.64), but a significant difference between the two participant groups did appear for the 20 dB HL presentation level (p < 0.05). For the 10-word long separation sentences there was again no significant difference between groups at 65 dB HL (p = 0.27) but there was a small but significant difference between groups at 20 dB SL (p < 0.05), potentially constrained by the previously noted functional floor of chance level performance for the 10-word long separation sentences with a 20 dB SL presentation level.

# Effects of Working Memory, Hearing Acuity, and Age as Continuous Variables

Although the good-hearing and hearing-impaired older adults were equivalent in mean age and reading span scores, there was, as seen, within-group variability in age, reading span, and hearing acuity. The error bars seen in **Figure 3** also indicate some variability around the plotted means. To explore the factors that may have led to the variability in comprehension accuracy we carried out hierarchical multiple regressions separately for the two presentation levels, first to see what factors may have contributed to comprehension performance and second, to determine whether the pattern of relative contributions generalized across presentation levels. In these analyses we considered just the older adults rather than including the young adults to avoid the multiple differences between the young and older adult groups potentially biasing the regression outcomes.

The dependent variable in each case was comprehension accuracy for the object-relative sentences due to the ceiling and near-ceiling performance for both participant groups for

TABLE 2 | Hierarchical regression analyses for object-relative sentences.

comprehension of the subject-relative sentences in all three length conditions and the two sound-level conditions. Predictor variables were entered into the model in the following order: working memory span (represented by reading span score), hearing acuity (represented by the better ear PTA, averaged over 500, 1000, 2000, and 4000 Hz), and participants' chronological age in years. This order was selected to examine any contribution of hearing acuity beyond effects of working memory span, and to determine whether age contributed unique variance after accounting for working memory span and hearing acuity.

The results of the regression analyses are shown in **Table 2**. For each predictor variable in each of the two presentation level conditions we show R 2 , which represents the cumulative contribution of each variable along with the previously entered variables, and the change in R 2 , which shows the contribution of each variable at each step. The next column shows the level of significance of each variable and the final column shows the standardized regression coefficients (β). It can be seen that working memory as measured by reading span is a significant predictor of comprehension accuracy for all conditions in the experiment; for both the six- and 10-word sentences and in the latter case for the short and long agent-action separations for both the 65 dB HL and the 20 dB SL presentation levels.

When the presentation level was at the higher 65 dB HL level, hearing acuity contributed to comprehension accuracy only for the 10-word sentences with a long agent-action separation. When the perceptual task was more challenging in the 20 dB SL condition hearing acuity contributed marginally for the six-word sentences, increasing to a significant contribution for the 10 word short and long separation sentences. That is, hearing acuity contributed significant variance only for the more challenging presentation level and even then only for the longer 10-word


<sup>∗</sup>p-value reflects significance of change in R<sup>2</sup> at each step of the model.

†Standardized multiple regression co-efficient.

sentences. With the contributions of working memory span and hearing acuity taken into account, chronological age did not contribute additional variance to comprehension accuracy. (The same pattern as shown in **Table 2** also appeared when the data for the young adults were included in the regression analyses.)

# DISCUSSION

Although hearing loss is a common accompaniment of adult aging, it has primarily been considered as an independent issue in aging research. There is now a growing recognition, however, that successful speech comprehension reflects an adaptive interaction between sensory and cognitive operations. There are two aspects to this interaction. The first is that the poorer the acoustic quality of the stimulus, whether due to reduced hearing acuity, poorly articulated speech, or the presence of background noise, the more support is required from top-down linguistic knowledge (Lindblom et al., 1992; Wingfield et al., 1994; Pichora-Fuller, 2003; Benichov et al., 2012; Rönnberg et al., 2013). In the present experiment this successful balance was revealed in the excellent level of comprehension success for six- and 10-word meaningful sentences by both good-hearing and hearing-impaired older adults at both presentation levels so long as the sentence meanings were expressed with the syntactically less complex subject-relative construction.

It is the case that all participants, to include those in the older adult hearing-impaired group, successfully scored at ceiling or near ceiling when tested for speech audibility at both sound intensity levels we employed. This should not imply, however, that all groups had access to the same quality of stimulus input. This is the other side of the sensory–cognitive interaction; namely, the previously cited position that successful perception in the face of an acoustically degraded stimulus may come at the cost of resources that would otherwise be available for higherlevel cognitive or linguistic operations. This position, in its broad outlines, has sometimes been referred to as an "effortfulness hypothesis" (Rabbitt, 1968, 1991; see also Surprenant, 1999, 2007; Murphy et al., 2000; Pichora-Fuller, 2003; McCoy et al., 2005; Wingfield et al., 2005, 2006; Amichetti et al., 2013, for similar arguments).

So long as the processing demands required for sentence comprehension did not exceed an upper limit on total processing resources, as in the case of sentences with a subject-relative structure, successful comprehension was possible even under conditions of perceptual effort. According to this resource argument, this point would have been exceeded when the difficulty in determining the thematic role assignments within a sentence imposed additional processing demands beyond those required for resolution of subject-relative sentences and when greater listening effort was required. This effect was revealed in reduced accuracy for object-relative sentences and when the relational elements were separated by insertion of a prepositional phrase in the long agent-action separation condition. This latter placement would be expected to exacerbate the already greater difficulty in determining thematic roles in object-relative constructions as the relational elements would need to be held in memory for a longer period of time (see Cooke et al., 2002, for a similar argument). The pattern of contributions of working memory and hearing acuity across conditions in the regression analyses is consistent with this argument. It is interesting that, at least for these data, chronological age contributed little to the variance in comprehension accuracy once working memory and hearing acuity were taken into account.

The effortfulness hypothesis, which is consistent with extant models that postulate an upper limit on working memory or attentional resources (cf., Kahneman, 1973; Baddeley and Hitch, 1974), has some descriptive utility as an account for our central question of why reduced hearing acuity results in a differentially greater effect on comprehension of object-relative than on subject-relative sentences even though all sentences were presented at a supra-threshold level that insured audibility of the recorded stimuli.

An additional factor that may be considered can be referred to as an expectancy-uncertainty based account. As noted previously, because object-relative and other syntactically complex forms occur less frequently in one's everyday listening experience than simpler syntactic forms (e.g., Goldman-Eisler, 1968; Goldman-Eisler and Cohen, 1970), one's expectations of encountering such forms would consequently be lower. In an early formulation Osgood (1963) focused on expectations at the form-class level; the likelihood, for example, that a noun phrase will be followed by a verb, and a verb will be followed by a noun phrase. Later formulations have combined both syntactic and semantic elements to account for the greater difficulty listeners are known to have for sentences that express their meaning with complex syntax. This is the postulate that the listener's experience-based expectation that the first noun will be the agent of an action will have to be rejected as a sentence with an object-relative construction unfolds and this expectation is disconfirmed. Elements of this postulate can be seen in a number of expectancy inclusive models of sentence comprehension (cf., Hale, 2001; Novick et al., 2005; Levy, 2008; Padó et al., 2009; Gibson et al., 2013).

It should be noted in this discussion that we do not present working memory and experience-based expectation accounts as mutually exclusive alternatives. Indeed, a study examining eye-movements in reading text has implicated contributions to sentence processing from both sources (Staub, 2010).

Although an expectancy-based account might apply to the traditional finding of greater comprehension errors for objectrelative sentences, it would not, in itself, explain why reduced hearing acuity would exacerbate this effect. An expectancybased account, however, must not only include the likelihood of encountering a particular lexical item or structural form. It must also include an element of uncertainty, sometimes referred to as response entropy (see Shannon and Weaver, 1949). Here this would be represented by the number and probability strengths of alternative perceptual interpretations of the acoustic signals representing relationally critical words in the sentences. Studies of word recognition from reduced acoustic information have shown that alternative possibilities fitting an ambiguous acoustic signal may be activated by sentential context (Lash et al., 2013) and phonological similarity with other words

(Sommers, 1996). Activation of a wider array of lexical possibilities might be expected to arise when the acoustic specificity of a word is reduced, as would be the case with poor hearing acuity, compounded by the lower presentation level in the 20 dB SL condition. Support for the influence of both expectation and entropy in spoken word recognition can be seen in studies of words presented in noise or with reduced word onset information, with the uncertainty (entropy) effect stronger for older than for younger adults (cf., van Rooij and Plomp, 1991; Lash et al., 2013).

### Limitations of the Present Study

First, it is important to note that the participants in this experiment represented high-functioning older adults with good verbal knowledge and working memory capacity. Indeed, as a group, the good-hearing and hearing-impaired older adults had better vocabulary scores than their younger adult counterparts and a distribution of working memory span scores that were relatively close to that of the young adults. It should also be emphasized that stimuli were presented in quiet, thus avoiding the special difficulty older adults have when hearing speech with background noise (Humes, 1996; Tun et al., 2002). With less cognitively able older adults and/or with speech heard in noise one might expect even greater effects of age, hearing acuity, and working memory capacity on comprehension accuracy. As reviewed by Mattys et al. (2012), these variables do not exhaust the potential adverse conditions that might affect speech comprehension, to include accented speech and listening while engaging in a concurrent secondary activity.

Second, although we have made reference to listening effort, it must be acknowledged that both its definition and measurement remain a topic of debate (McGarrigle et al., 2014). It should also be acknowledged that definitions of working memory and its relation to attentional resources and executive function remain in contention (cf., Cowan, 1999, 2005; Miyake et al., 2000; Engle, 2002; Barrouillet et al., 2004; McCabe et al., 2010; Baddeley, 2012; Chow and Conway, 2015). It is possible that differences in tasks and in task demands may tap different components of

# REFERENCES


a complex working memory system (cf., Akeroyd, 2008; Schoof and Rosen, 2014; Füllgrabe et al., 2015). Finally, specific to the psycholinguistics literature, there is also the question of whether language comprehension is carried by a specialized or more general working memory system (Wingfield et al., 1998; Caplan and Waters, 1999).

# CONCLUSION

Declines in sensory acuity and efficiency of cognitive function often co-occur in adult aging. Both can affect speech comprehension, with the interaction between the two revealed in the dual challenges of hearing impairment and syntactic complexity in determination of semantic relations in sentence comprehension. It should also be noted that although our focus has been on downstream effects of listening effort, deficits in recall and comprehension of written text with degraded vision have also been reported in the literature (Dickinson and Rabbitt, 1991; Gao et al., 2012). This suggests that the principles of sensory–cognitive interactions under study in this present paper have wider application to issues in adult aging even beyond hearing acuity and listening effort.

# AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

The research reported in this publication was supported by the National Institute on Aging of the National Institutes of Heath under award numbers R01 AG019714 and R01 AG038490. We also gratefully acknowledge support from the W.M. Keck Foundation.


eds A. Miyake and P. Shah (Cambridge: Cambridge University Press), 62–101. doi: 10.1017/CBO9781139174909.006



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 DeCaro, Peelle, Grossman and Wingfield. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multiple Solutions to the Same Problem: Utilization of Plausibility and Syntax in Sentence Comprehension by Older Adults with Impaired Hearing

### Nicole M. Amichetti, Alison G. White and Arthur Wingfield\*

Volen National Center for Complex Systems, Brandeis University, Waltham, MA, USA

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

L. Robert Slevc, University of Maryland, College Park, USA Carine Signoret, Linköping University, Sweden

> \*Correspondence: Arthur Wingfield wingfield@brandeis.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 24 November 2015 Accepted: 10 May 2016 Published: 30 May 2016

### Citation:

Amichetti NM, White AG and Wingfield A (2016) Multiple Solutions to the Same Problem: Utilization of Plausibility and Syntax in Sentence Comprehension by Older Adults with Impaired Hearing. Front. Psychol. 7:789. doi: 10.3389/fpsyg.2016.00789 A fundamental question in psycholinguistic theory is whether equivalent success in sentence comprehension may come about by different underlying operations. Of special interest is whether adult aging, especially when accompanied by reduced hearing acuity, may shift the balance of reliance on formal syntax vs. plausibility in determining sentence meaning. In two experiments participants were asked to identify the thematic roles in grammatical sentences that contained either plausible or implausible semantic relations. Comprehension of sentence meanings was indexed by the ability to correctly name the agent or the recipient of an action represented in the sentence. In Experiment 1 young and older adults' comprehension was tested for plausible and implausible sentences with the meaning expressed with either an active-declarative or a passive syntactic form. In Experiment 2 comprehension performance was examined for young adults with age-normal hearing, older adults with good hearing acuity, and age-matched older adults with mild-to-moderate hearing loss for plausible or implausible sentences with meaning expressed with either a subject-relative (SR) or an object-relative (OR) syntactic structure. Experiment 1 showed that the likelihood of interpreting a sentence according to its literal meaning was reduced when that meaning expressed an implausible relationship. Experiment 2 showed that this likelihood was further decreased for OR as compared to SR sentences, and especially so for older adults whose hearing impairment added to the perceptual challenge. Experiment 2 also showed that working memory capacity as measured with a letter-number sequencing task contributed to the likelihood that listeners would base their comprehension responses on the literal syntax even when this processing scheme yielded an implausible meaning. Taken together, the results of both experiments support the postulate that listeners may use more than a single uniform processing strategy for successful sentence comprehension, with the existence of these alternative solutions only revealed when literal syntax and plausibility do not coincide.

Keywords: sentence comprehension, plausibility, adult aging, hearing impairment, working memory

# INTRODUCTION

fpsyg-07-00789 May 26, 2016 Time: 12:51 # 2

A critical feature of spoken language is its rapidity, with everyday speech rates often exceeding 180 to 200 words per minute (Stine et al., 1990). The fact that spoken sentences can be successful comprehended in spite of this rapid input rate raises the question of whether listeners may necessarily engage in a fully exhaustive word-by-word analysis of a sentence to determine its meaning. That is, rather than building a detailed and compete representation of the utterance, listeners may under some circumstances analyze the lexical input to a level of detail that is just "good enough" to extract the sentence meaning, with this especially so when the listener is faced with sentences that express their meaning with relatively complex syntactic structures (Ferreira et al., 2002; Ferreira, 2003; Christianson et al., 2006; Ferreira and Patson, 2007). Listeners must also comprehend sentences that contain ungrammatical or underspecified structures, a common occurrence in everyday communication (Goldman-Eisler, 1968; Elsness, 1984; Thompson and Mulac, 1991). In such cases it has been argued that comprehension is accomplished based on probabilistic inferences and plausibility substituting for operations represented in formal hierarchical syntactic processing models (e.g., Ferreira, 2003; Padó et al., 2009; Frank and Bod, 2011; Gibson et al., 2013).

The process we are describing can be referred to as shallow processing, a processing strategy in which the meaning of a sentence is rapidly inferred based on word order and thematic plausibility (Ferreira, 2003). Because we live in a relatively predictable and usually plausible world, this processing heuristic will ordinarily be successful. It will fail only in those circumstances when a sentence conveys an unexpected or unlikely meaning. It has been suggested by Rönnberg et al. (2013) that when a listener is under time pressure, and willing to accept the gist of a message, a thorough analysis might not take place. Indeed, it has been argued that detailed and time-consuming lexical and syntactic analyses of an utterance may be an exception, rather than the rule (Ferreira, 2003; Ferreira and Patson, 2007).

This position argues against traditional assumptions of a single "optimal" model of sentence processing that underlies successful comprehension. Rather, it is possible that a range of processing heuristics, ranging from relatively more shallow to more exhaustive word-by-word processing, will produce similar consequences under usual, but not all, circumstances. Broadly defined, this is a position in tune with a developing recognition in modern neurobiology that is showing that a range of circuit parameters are "good enough" to yield the same output, although not all solutions may be equally robust to potential perturbations (Marder, 2011; Tang et al., 2012).

In an analogous manner, we argue that uniform success in sentence comprehension need not imply that each incidence of successful comprehension has been achieved by the same cognitive route. This can be revealed when listeners are presented with sentences that contain an implausible meaning. This circumstance appeared, for example, when Ferreira (2003) presented university students with sentences that expressed a plausible or an implausible meaning with an active-declarative syntactic form (e.g., "The dog bit the man"; "The man bit the dog") or plausible or implausible sentences with a less common passive construction (e.g., "The man was bit by the dog"; "The dog was bit by the man"). When asked to identify the thematic roles in such sentences (who did the biting; who was bit), listeners were more likely to focus on plausibility (responding that the dog bit the man) when the meaning was conveyed with the less canonical passive syntactic structure. In such cases listeners' use of a shallow processing heuristic rather than a fully exhaustive word-by-word analysis will be revealed when thematic plausibility over-rides meaning based on the literal syntax of a sentence. This issue may take on special importance in the context of adult aging, where both working memory resources and hearing acuity typically show some degree of decline.

# THE SPECIAL CHALLENGES OF ADULT AGING AND HEARING IMPAIRMENT

Although hearing loss is a common accompaniment of adult aging, it has primarily been considered as an independent issue in cognitive aging research. We now know that there are subtle but important effects of reduced hearing acuity beyond simply missing or misidentifying individual words in a spoken message. That is, when speech is degraded, either due to reduced hearing acuity or due to acoustic masking, the cognitive effort needed for successful perception can take a toll on both comprehension and memory for spoken materials (cf., Rabbitt, 1968, 1991; Surprenant, 1999, 2007; Pichora-Fuller, 2003; McCoy et al., 2005; Wingfield et al., 2006; Piquado et al., 2010, 2012). Importantly, these effects appear even when it can be demonstrated that the speech itself has passed a threshold of audibility.

It is known that older adults have more difficulty than their younger adult counterparts in understanding sentences with complex syntactic structures (Wingfield et al., 2003, 2006). This has been attributed to increased working memory demands required for comprehension of such sentences that place older adults at a special disadvantage (Carpenter et al., 1994). Combined with an age-related hearing impairment adding to the processing challenge, a shift toward a processing heuristic that is adequate for comprehension, rather than one that engages a more resource-demanding, fully exhaustive syntactic analysis, might be expected to lead older adults to the more frequent use of plausibility rather than to the literal syntactically determined meaning of an utterance. Thus, to the extent that successful speech recognition in the face of hearing loss may draw resources needed for processing the sentence meaning, shallow processing may be more likely for older adults with hearing impairment than for young adults or for older adults with good hearing acuity.

We report the results of two experiments designed to test this hypothesis. The first experiment was patterned after Ferreira (2003), although with older as well as younger adults. Following Ferreira (2003), sentences were heard with either plausible or implausible meanings expressed with either an active-declarative structure or a less canonical passive structure. Our question was whether plausibility would be more likely to over-ride the literal

syntactically determined meaning for older adults as compared to young adults. This first experiment was intended to define the lower boundaries of a potential interaction between adult aging and a plausibility bias, as the syntactic contrast between active-declarative and passive structures is a relatively mild one (see data in Gibson et al., 2013) and the older adults for this experiment would be especially selected for good hearing acuity.

In Experiment 2 comprehension was assessed when the processing challenge was further increased in two ways. First, the syntactic contrast would be between sentences with a subjectrelative (SR) embedded clause structure and a much more complex object-relative (OR) embedded clause structure. This syntactic contrast was chosen because the comprehension of OR sentences is known to produce significantly greater processing demands than SR sentences (Ferreira et al., 1996; Just et al., 1996; Gibson, 1998; Cooke et al., 2002; Wingfield et al., 2006; Peelle et al., 2010; Staub, 2010). Second, the experiment was conducted with two groups of older adults: one group who had good hearing acuity for their ages and another group with a bilateral mild-to-moderate hearing loss, the most common degree of loss among older adults with hearing impairment (Morrell et al., 1996). Our question was whether the combined challenge of complex syntax combined with perceptual effort due to a hearing impairment, would increase the likelihood of a listener conducting a more shallow analysis of the speech input. Such a processing strategy would be revealed by a comprehension response to an implausible sentence (i.e., one with an unlikely meaning), that relies on plausibility rather than on its literal syntactically based meaning.

# EXPERIMENT 1

# Method

### Participants

Participants were 24 young adults (2 men, 22 women) ranging in age from 18 to 30 years (M = 20.2 years, SD = 2.4) and 24 older adults (7 men, 17 women) ranging in age from 66 to 82 years (M = 75.1 years, SD = 4.2). The young adults were university students and staff and the older participants were healthy community-dwelling volunteers. To insure that any age decrements would not be attributable to an accidental difference in vocabulary knowledge all participants were screened with the Shipley Vocabulary Test (Zachary, 1991). As is common for healthy older adults (Kempler and Zelinski, 1994; Verhaeghen, 2003), the older adults in this study had an advantage in terms of vocabulary knowledge [M younger = 13.3, SD = 2.0; M older = 17.0, SD = 2.3; t(46) = 5.99, p < 0.001]. All participants reported themselves to be in good health, with no self-reported history of stroke, Parkinson's disease, or other neurologic involvement that might compromise their ability to perform the research task. All participants reported themselves to be native speakers of American English.

Audiometric evaluation was carried out for each participant using a GSI 61 clinical audiometer (Grason-Stadler, Inc., Madison, WI, USA) by way of standard audiometric techniques in a sound-attenuated testing room (Harrell, 2002). The young adults had a mean better-ear pure tone threshold average (PTA) of 8.0 dB HL (SD = 4.5) averaged over 500, 1,000, 2,000, and 4,000 Hz. The older adults had a mean better-ear PTA (500, 1,000, 2,000, and 4000 Hz) of 23.2 dB HL (SD = 6.5). Participants who demonstrated unbalanced hearing (more than a 15 dB difference between ears in one or more frequencies) were excluded from participation.

Although elevated relative the young adults, t(46) = 9.42, p < 0.001, the older adults' thresholds fell within or close to a range typically considered to be clinically normal for speech (PTA < 25 dB HL; Katz, 2002). None of the older participants wore hearing aids on a regular basis, and all testing was conducted unaided. Written informed consent was obtained from all participants according to a protocol approved by the Brandeis University Institutional Review Board.

### Stimuli

A total of 16 active-declarative sentences, patterned after Ferreira (2003; Experiment 1), were constructed to contain an agent of an action and a recipient of that action. Active-declarative sentences represent a typical noun-verb-noun (NVN) structure, in which the first noun is the agent of the action. From each of these activedeclarative sentences we constructed an additional 16 sentences with the same meaning but with this meaning expressed with a less canonical passive structure, in which the second noun is the agent of the action.

Four versions of each sentence were constructed: an activedeclarative version with a plausible action (e.g., "The eagle attacked the rabbit"), an active-declarative version with the agent and recipient switched to yield a less likely (implausible) action (e.g., "The rabbit attacked the eagle"), a passive sentence structure with a plausible action (e.g., "The rabbit was attacked by the eagle"), and a passive version with an implausible action (e.g., "The eagle was attacked by the rabbit"). This resulted in 64 experimental sentences: 16 base sentences consisting of a unique set of nouns and action verbs with four versions of each.

In addition to these 64 experimental sentences (16 base sentences × 4 versions of each), 72 filler sentences were constructed to avoid a uniform pattern of plausible and implausible non-reversible sentences. Two-thirds of the fillers contained an active or passive construction but in which the agent and recipient could be exchanged without affecting plausibility (e.g., "The boy thanked the girl"; "The girl thanked the boy"). Other fillers were constructed that were non-reversible (e.g., "The man walked across the street"; "The bird was bright red"). Each participant heard 36 fillers (24 reversible fillers, and 12 non-reversible fillers). These fillers did not form part of the experimental analyses.

The experimental and filler sentences were recorded onto computer sound files by a female speaker of American English at a natural speaking rate of approximately 165 words per minute (wpm) with normal prosody using Sound Studio v2.2.4 software (Macromedia, Inc., San Francisco, CA, USA) that digitized (16 bit) at a sampling rate of 44.1 kHz. Recordings were equalized within and across sentence types for root-mean-square (RMS) intensity using MATLAB (MathWorks, Natick, MA, USA).

### Procedure

Each participant heard a total of 64 experimental sentences, 16 active-plausible, 16 active-implausible, 16 passive-plausible, and 16 passive-implausible. No version of any base sentence (a particular combination of nouns and action verb) was heard more than once by any participant, with the particular base sentence heard in each of its versions counterbalanced across participants such that, by the end of the experiment, each base sentence had been heard in each of its versions an equal number of times. Stimuli were presented in a mixed-list design, with experimental sentences and filler sentences intermingled in a pseudo-random order across lists. This resulted in a total of 100 sentences heard by each participant.

Participants were told that following each sentence there would be a 250 ms pause, followed by a spoken probe question. For the experimental sentences and the reversible filler sentences participants were asked to name aloud either the agent or the recipient of the action, in the form of, "Who was the do-er?" or "Who was the receiver?" Participants were asked to give their responses aloud as accurately as possible. Sentences and probe questions were also counterbalanced, such that, by the end of the experiment, each of the experimental sentences and reversible fillers were followed an equal number of times by agent and recipient probes. Probe questions for the non-reversible filler sentences were "What was the color?" or "What was the action?" (Ferreira, 2003).

Participants were tested individually in a sound-attenuated testing room, with stimuli presented binaurally through calibrated Eartone 3A insert earphones (E-A-R Auditory Systems, Aero Company, Indianapolis, IN, USA), via a GSI-61 audiometer (Grason-Stadler, Madison, WI, USA) at 65 dB HL. Participants' responses were recorded for later accuracy scoring. The main experiment was preceded by a brief practice session to familiarize participants with the task and the sound of the speaker's voice. This session consisted of eight active and passive form sentences of similar length as the test sentences. None these sentences were used in the main experiment.

### **Audibility check**

A pretest was conducted in order to insure that the speech materials would be audible to both the young and older adult participants. One- and two-syllable common nouns were presented one at a time at the same intensity level as would be used for the main experiment. After the presentation of each word participants were asked to repeat the word just heard. All participants' report accuracy was above a pre-determined cutoff criterion of 90% accuracy, with the young adults having a mean accuracy of 98.7% words correct and the older adults 97.9% words correct.

### Results

The left panel of **Figure 1** shows the mean percentage of times that the young adults used the literal syntax to determine who was the agent or the recipient of the action for plausible and implausible experimental sentences in which the meaning was expressed with an active or passive syntactic structure. The right panel shows these data for the older adults. There was

no significant difference in response accuracy depending on whether the agent or recipient of the action was requested. For all analyses data were thus collapsed across both types of question probes.

The data shown in **Figure 1** were analyzed with a 2 (Plausibility: plausible, implausible) × 2 (Age: young, older) × 2 (Syntactic complexity: active, passive) mixed-design analysis of variance (ANOVA), with syntax and plausibility as withinparticipants variables and age as a between-participants variable. As can be seen in **Figure 1**, both participant groups' responses were more likely to be consistent with the literal syntax in plausible than in implausible sentences, as confirmed by a significant main effect of plausibility, F(1,46) = 17.54, p < 0.001, η 2 <sup>p</sup> = 0.28. There was also a significant main effect of age, F(1,46) = 11.95, p < 0.01, η 2 <sup>p</sup> = 0.21. There was a marginal effect of syntactic complexity, F(1,46) = 3.46, p = 0.069, η 2 <sup>p</sup> = 0.07. None of the interactions reached significance, consistent with the general similarity in patterns for both age groups.

To look more closely at the nature of these patterns subsidiary 2 (Plausibility) × 2 (Syntactic complexity) repeated measures ANOVAs were conducted separately for each participant group. Although the main effect of plausibility was significant for both groups (young adults, p < 0.05; older adults, p < 0.01), the appearance in **Figure 1** of differentially fewer responses following the literal syntax in implausible passive sentences than in implausible active sentences was not supported either a significant main effect of syntax nor a significant Syntax × Plausibility interaction for either participant group (p's > 0.05). Planned comparisons did give some support for such an interaction. For the young adults a significant difference appeared between plausible and implausible passive sentences, t(1,23) = 3.11, p < 0.05, but not for the active sentences,

[t(1,23) = 1.14, p = 0.27]. A similar pattern was shown for the older adults, with a significant difference appearing between plausible and implausible sentences for the passive sentences, t(1,23) = 2.57, p < 0.05 and a marginal difference between plausible and implausible sentences for the active sentences t(1,23) = 1.86, p = 0.07. It can be noted that for plausible active sentences both age groups' accuracy was above 95% correct comprehension with the younger and older adults differing by only 2.6% points, t(46) = 2.06, p = 0.054.

Although there is a suggestion of a differential increase in reliance on plausibility for sentences with a passive structure compared to those with an active structure, it can be seen that the effect is a weak one. This is consistent with other studies (e.g., Gibson et al., 2013), that have shown a small or absent effect on comprehension responses for implausible active vs. passive sentences. In the present experiment, for example, the difference between responses based on literal syntax in implausible active vs. passive sentences amounting to only a 2.4% point difference for the young adults and a 4.7% point difference for the older adults, with neither difference approaching significance.

### Discussion

The results of Experiment 1 show that when the literal syntax of a sentence would imply an implausible meaning, older adults were less likely than their young adult counterparts to follow the meaning expressed by the literal syntax (See Obler et al., 1991, for supportive data). Although our focus is on comprehension, our findings are consistent with results from studies of verbal memory that have shown that older adults perform as well as young adults when memory probes for studied passages are plausible, but more poorly than young adults when they are implausible (e.g., Reder et al., 1986). Analogous to arguments for plausibility effects in sentence comprehension, Reder et al. (1986) suggested that older adults may employ a plausibility-based strategy because it is less resource-demanding than decisions based on specific passage details. Here we suggest that older adults tend to give heavier weight to plausibility than to the literal content in sentence comprehension when the two are in conflict as a way of conserving reduced working memory resources (See Connell and Keane, 2006, for a discussion of plausibility as a cognitive shortcut in memory retrieval).

Early syntax-based models of sentence processing (Miller and Chomsky, 1963), that largely replaced even earlier expectancybased Markov models of language (Miller, 1952), predicted that comprehension of sentences in a passive form would be more demanding than active sentences because, for understanding, listeners would have to decompose passive sentences into their active form from which they were assumed to be derived (e.g., Miller, 1962). [In a Markov model a particular sequence of symbols (e.g., words, musical notes) is determined solely by their statistical probability based on prior events].

Although theoretical accounts of the relation between active and passive sentences have subsequently evolved (see Ratner et al., 1993, pp. 16–27, for a review of this evolution), early studies showed poorer comprehension and recall of passive sentences than active sentences (e.g., Miller, 1962; Savin and Perchonock, 1965). These studies, offered in support of a derivational theory of sentence complexity, however, were not without criticism on methodological grounds (cf., Wearing, 1970; Boakes and Lodwick, 1971).

It is the case that passive sentence structures are less likely to be encountered in everyday listening experience than activedeclarative sentences. For example, an analysis of the types of sentences heard in a British sample of everyday discourse found that simple declarative sentences were by far the most commonly used grammatical forms, accounting for 70–80% of the spoken sentences in the sample. By contrast, passives were encountered in only 0.7–11% of everyday spoken discourse (Goldman-Eisler and Cohen, 1970). It may thus be the case that a listener's expectation of hearing an active-declarative sentence, in which the first noun is the agent of the action, must be rejected for successful comprehension. Such an argument has been made by Yoon et al. (2015). (See Novick et al., 2005, for an analogous account of the comprehension difficulty for passive sentences encountered in patients with Broca's aphasia). We found young adults responded to plausibility more frequently than literal syntax for passive sentences, similarly to Ferreira (2003) who also tested young adults. In experiment 1 we showed this same effect also held for older adults although not to a differentially greater degree than the young. It should be noted, however, that the size of the effect was small for both age groups suggesting that both the young and older adults in our study were adept at dealing with this frequency-based violation.

Although the effect of our syntactic manipulation was small, plausibility of the utterance had an impact on performance. In the case of plausible sentences one cannot tell whether the listener is basing his or her comprehension on the literal syntax of the sentence or the plausibility, as the two coincide. The test comes with sentences where the literal syntax and semantic plausibility are in conflict. When this occurred, the incidence of sentence comprehensions that followed the literal syntax was reduced. Even for the older adults, however, comprehension responses based on the literal syntax predominated for both syntactic forms examined.

### EXPERIMENT 2

In Experiment 1 all of the older adults had good hearing acuity for their ages. This raises the question of whether the extra processing load induced by reduced hearing acuity, as is more typical of older adults (Morrell et al., 1996), might increase reliance on a resource-conserving strategy represented by shallow processing, and especially so when the sentence meaning is expressed with a more challenging syntactic manipulation than used in Experiment 1.

In Experiment 2 we thus examined effects of hearing acuity on comprehension responses to determine whether perceptual effort consequent to reduced hearing acuity will amplify the shift to a plausibility-weighted algorithm, or alternatively, to induce a greater reliance on a complete syntactic analysis. As part of this question we employed a contrast between sentences with a SR or an OR structure, where we might expect the greater syntactic challenge of OR sentences to show a stronger effect of

plausibility on comprehension responses than responses based on literal syntax. As before, the critical condition for separating these alternative processing strategies would be sentences in which the literal syntax and the plausibility of the utterance are in conflict.

Should one see an increase in comprehension responses that favor plausibility over literal syntax to occur with implausible sentences that express their meaning with an OR structure than an SR structure, one might expect this effect to be larger for older adults relative to young adults, and larger still for older adults with impaired hearing. This prediction would follow from findings that the comprehension of plausible OR sentences place a greater demand on working memory resources than plausible SR sentences (Just and Carpenter, 1992), with the behavioral consequences greater for older adults who begin with reduced working memory resources relative to younger adults (Carpenter et al., 1994). To test this hypothesis we also tested the working memory capacity of the participants in Experiment 2.

# Method

### Participants

The young adult participants were 24 university students and staff (5 men, 19 women) ranging in age from 18 to 27 years (M = 19.7, SD = 1.94 years), all of whom had age-normal hearing acuity, as measured by PTA averaged over 500, 1,000, 2,000, and 4,000 Hz. (M = 8.5 dB HL, SD = 3.14). The group had a mean Shipley vocabulary score (Zachary, 1991) of 13.3 (SD = 2.24).

Forty-eight older adults were tested, 24 with good hearing acuity (7 men and 17 women) and 24 with a mild-to-moderate hearing loss (5 men and 19 women). We summarized individuals' hearing acuity in terms of their better-ear PTA across.5, 1, 2, and 4 kHz, a range especially important for the perception of speech. Clinically normal hearing is defined as a PTA of less than 25 dB HL in the better ear (Hall and Mueller, 1997). The older adult group with better hearing acuity had a mean better-ear PTA of 16.8 dB HL (SD = 5.05), placing them within well range considered to be clinically normal for speech (PTA < 25 dB HL; Katz, 2002). The hearingimpaired group had a mean better-ear PTA of 35.8 dB HL (SD = 5.50), placing them in the mild-to-moderate hearing loss range (Katz, 2002). As indicated previously, this degree of loss represents the single largest group of hearing-impaired older adults (Morrell et al., 1996), the majority of whom do not regularly wear hearing aids (Kochkin, 1999; Fischer et al., 2011). None of the participants in the hearing-impaired group were regular users of hearing aids and all testing was conducted unaided. Potential participants who demonstrated unbalanced hearing (more than a 15 dB difference between ears under one or more frequencies) were excluded from participation.

**Figure 2** shows better-ear pure-tone thresholds from 500 to 4,000 Hz for the individual participants in the three participant groups plotted in the form of audiograms, with the x-axis showing the test frequencies and the y-axis showing the minimum sound level (dB HL) needed for their detection. Hearing profiles for individual listeners within each participant group are shown in light gray, with the group average drawn in black. The shaded area in each of the panels indicates thresholds less than 25 dB HL, a region, as indicated above, commonly considered as clinically normal hearing for speech (Katz, 2002).

The good-hearing and hearing-impaired older adults were similar in age, with the good-hearing group ranging in age from 68 to 83 years (M = 74.7 years, SD = 5.13) and the hearing-impaired group ranging in age from 69 to 81 years (M = 74.7 years, SD = 3.62). The two groups were also wellmatched for verbal ability, as estimated by Shipley vocabulary scores (Zachary, 1991); older adult group with better hearing acuity, M = 16.3, SD = 2.35; hearing-impaired, M = 16.3, SD = 2.38. As is common in adult aging (Kempler and Zelinski, 1994; Verhaeghen, 2003), the older adults had somewhat better vocabulary scores than the young adults, a finding that held true for both the good-hearing, t(46) = 4.59, p < 0.001, and the hearing-impaired, t(46) = 4.44 p < 0.001, older adults. As was the case for Experiment 1, all participants reported themselves to be native speakers of American English, with no history of stroke, Parkinson's disease, or other neurological involvement that might compromise their ability to perform the research task. None of the participants in Experiment 2 had participated in Experiment 1. Written informed consent was obtained from all participants according to a protocol approved by the Brandeis University Institutional Review Board.

### **Working memory measurement**

Working memory was assessed with the Letter Number Sequencing Task (LNS; Wechsler, 1997). This is a complex span test in which participants read aloud a series of letters and numbers in sets ranging from two items to nine items, with three trials per set size. Participants are asked to repeat back the numbers first, in ascending order, followed by the letters in alphabetical order. The span measure is the total number of correct trials. This span test thus contains elements of both holding and manipulation of items in immediate memory as a measure of individual differences in working memory capacity (cf., Postle, 2006; McCabe et al., 2010).

**Figure 3** shows the scores of the working memory span test separately for young adults with age-normal hearing acuity (young adults), older adults with clinically normal hearing acuity for speech (good-hearing) and older adults with mild-tomoderate hearing loss (hearing-impaired).

Working memory scores were similar for the good-hearing (M = 10.8, SD = 2.36) and hearing-impaired (M = 10.3, SD = 3.06) older adults, t(46) = 0.53, n.s.). As might be expected from the body of work on adult aging and working memory (see reviews of this literature in Salthouse, 1991, 1994; Kausler, 1994), the young adults had higher working memory scores (M = 13.2, SD = 2.86) than either the older adults with better hearing acuity, t(46) = 3.25, p < 0.01, or hearing-impaired, t(46) = 3.36 p < 0.01, older adults.

### Stimuli

Preparation of the stimuli began with construction of 64 sentences, each of which contained an action, an agent of the

action, and the recipient of the action. Four versions of each sentences were then constructed: an SR version with a plausible action (e, g., "The eagle that attacked the rabbit was large"), an SR sentence with the agent and recipient switched to yield an implausible action (e.g., "The rabbit that attacked the eagle was large"), the plausible version presented with an OR sentence structure (e.g., "The rabbit that the eagle attacked was large"), and an OR sentence with an implausible action (e.g., "The eagle that the rabbit attacked was large").

The SR and OR sentences in both their plausible and implausible versions contained exactly the same words, differing only in word order. In this example of a plausible SR sentence one can see that the main clause (the eagle was large) is interrupted by a relative clause (that attacked the rabbit). In the plausible OR sentences the embedded clause not only interrupts the main clause, but the head noun phrase (the rabbit) functions as both the subject of the main clause (large) and the object of the relative clause (that attacked the rabbit). Implausible OR sentences followed the same principle in which the head noun phrase serves as both the subject of the main clause and the object of the relative clause.

There are a number of reasons why comprehension of sentences with an OR structure are more challenging than sentences with an SR structure. For example, because the order of thematic roles in OR constructions are not canonical, such sentences require a more extensive thematic integration than required for the more canonical structure represented by SR sentences (Warren and Gibson, 2002). In addition, to determine these thematic roles, one must keep the subject of the sentence in mind for a longer period of time than in SR sentences (Cooke et al., 2002), such that OR constructions are thought to tax working memory to a greater degree than SR sentences (Ferreira et al., 1996; Cooke et al., 2002).

Although different authors may give different weight to each of these factors it is well-established that OR sentences result in more comprehension errors than SR sentences (Just and Carpenter, 1992; Wingfield et al., 2006), that comprehension of OR sentences are accompanied by increased patterns of neural activation in functional imaging studies (Just et al., 1996; Cooke et al., 2002; Peelle et al., 2004, 2010), and that they produce slower self-pacing patterns than SR sentences for both written (Stine-Morrow et al., 2000)

and spoken (Waters and Caplan, 2001; Fallon et al., 2006) sentences.

In addition to the 256 experimental sentences (64 base sentences × 4 versions of each), 72 SR and OR filler sentences were constructed to avoid a uniform pattern of plausible and implausible non-reversible sentences. For thus purpose filler sentences were included in which the agent and recipient could be exchanged without affecting plausibility (e.g., "The boy that pushed the girl was mean"; "The boy that the girl pushed was mean").

The experimental and filler sentences were recorded onto computer sound files by a female speaker of American English at a natural speaking rate of approximately 165 wpm and equalized within and across sentence types for RMS intensity as described in Experiment 1.

### Procedure

Each participant heard a total of 64 experimental sentences (16 SR-plausible, 16 SR-implausible, 16 OR-plausible, 16 ORimplausible) plus 36 filler sentences. No version of any base sentence was heard more than once by any participant, with the particular base sentence heard in each of its versions counterbalanced across participants such that, by the end of the experiment, each base sentence had been heard in each of its versions an equal number of times. Stimuli were presented in a mixed-list design, with experimental sentences and filler sentences intermingled in a pseudorandom order across lists. Along with 36 filler sentences this resulted in a total of 100 sentences heard by each participant.

Instructions were the same as in Experiment 1, with participants told that following each sentence there would be a 250 ms pause, followed by a spoken probe question in the form of "Who was the do-er?" or "Who was the receiver?" Responses were to be given aloud as accurately as possible and were recorded for later scoring for accuracy.

Participants were tested individually in a sound-attenuated testing room, with stimuli presented binaurally through calibrated Eartone 3A insert earphones (E-A-R Auditory Systems, Aero Company, Indianapolis, IN, USA), via a GSI-61 audiometer (Grason-Stadler, Madison, WI, USA) at 65 dB HL. The main experiment was preceded by a brief practice session to familiarize participants with the task and the sound of the speaker's voice.

### **Audibility Check**

As in Experiment 1 a pretest was conducted in order to insure that the speech materials would be audible for all participants. This again consisted of one- and two-syllable common nouns presented one at a time at the same intensity level as would be used for the main experiment. After the presentation of each word participants were asked to repeat the word. All participants in the three participant groups showed good accuracy, with a mean of 98.3% words correctly repeated for the young adults, 96.4% correct for older adults with good hearing and 95.6% correct for older adults with a hearing impairment.

# Results

The left panel of **Figure 4** shows the mean percentage of times that the young adults used the literal syntax to determine who was the agent or the recipient of the action for plausible and implausible SR and OR sentences. The middle and right panels show these data for the good-hearing and hearing-impaired older adults, respectively.

There was again no difference depending on whether the agent or the recipient of the action was requested. For all analyses data were thus collapsed across the types kinds of probe questions.

The data shown in the three panels of **Figure 4** were examined with a 2 (Plausibility: plausible, implausible × 3 (Participant group: young adults, good-hearing older adults, hearing-impaired older adults) × 2 (Syntactic complexity: SR, OR) mixed design ANOVA, with plausibility and syntax as within-participants variables and groups as a betweenparticipants variable. As implied by visual inspection of **Figure 4** there was a significant main effect of sentence plausibility, with plausible sentences more likely to produce comprehension responses consistent with the their literal syntax than for implausible sentences, F(1,69) = 75.58, p < 0.001, η 2 <sup>p</sup> = 0.52. There was also a main effect of participant group, F(2,69) = 7.04, p < 0.01, η 2 <sup>p</sup> = 0.17. This main effect, however, was moderated by a significant Participant group × Plausibility interaction, F(2,69) = 5.03, p < 0.01, η 2 <sup>p</sup> = 0.13. This interaction can be seen to reflect the observation in **Figure 4** that the hearing-impaired older adults were less likely than the other two participant groups to produce comprehension responses based on the literal syntax of a sentence when the meaning of the sentence was implausible than when the meaning was plausible.

Unlike the active-passive contrast in Experiment 1, the more challenging contrast represented by SR vs. OR sentences in the present experiment now yielded a significant main effect of syntactic complexity, F(1,69) = 59.70, p < 0.001, η 2 <sup>p</sup> = 0.46. A significant Plausibility × Syntactic complexity interaction, F(1,69) = 18.52, p < 0.001, η 2 <sup>p</sup> = 0.21, confirmed the appearance in **Figure 4** that the effect of plausibility across the three groups was generally greater for OR sentences than for SR sentences. Neither the remaining two-way nor the three-way interactions reached significance.

The meaning of this pattern of main effects and interactions was further explored by conducting separate 2 (Plausibility) × 2 (Syntactic complexity) repeated measures ANOVAs on the data for each of the participant groups. For each of the participant groups the reduced likelihood of comprehension responses being based on the literal syntax when the sentences were implausible rather than plausible was supported by a significant main effect of plausibility (p < 0.001 in all cases). Each of the three groups also revealed a main effect of syntax (young adults, p < 0.001; goodhearing older adults, p < 0.05; hearing-impaired older adults, p < 0.001). The tendency for comprehension responses to be less likely to correspond with the literal syntax of the sentence for implausible OR sentences than for SR sentences resulted in significant Plausibility × Syntax interactions for the young adults (p < 0.05) and hearing-impaired older adults (p < 0.01), and a marginal effect for the good-hearing older adults (p = 0.07). An

ANOVA confirmed the appearance of similar performance for all groups in the SR Plausible condition F(2,69) = 2.36, p = 0.10, confirming that the performance declines for implausible OR sentences were not due to the two older adult groups being unable to hear the stimuli as well as the young.

### Age and Hearing as Continuous Variables

The relatively greater difficulty older adults' have in comprehending sentences with complex syntax as compared to young adults has been attributed by many theorists to a reduced working memory capacity depriving older adults of the resources needed to support comprehension (cf., Carpenter et al., 1994; Daneman and Merikle, 1996; Caplan et al., 2011). One might thus expect that individual differences in working memory capacity would contribute significantly to the variance observed in the comprehension data.

We conducted a linear mixed-effect model regression analysis considering syntax and plausibility as categorical factors and working memory, hearing acuity, and age as continuous variables where subjects were entered as random effects. This analysis showed working memory to account for a significant amount of the variance, t(68) = 2.59, p = 0.012, and a marginal plausibility by hearing acuity interaction, t(210) = 1.96, p = 0.051. To look more closely at these effects we conducted hierarchical multiple regressions for each of the four stimulus types (SRplausible, SR-implausible, OR-plausible, and OR-implausible), with the percentage of responses that were consistent with the literal meaning of the sentences serving as the dependent variable in each case. Predictor variables were entered into the model in the following order: working memory span represented by Letter-number Sequencing, hearing acuity represented by betterear PTA averaged over 500, 1000, 2000, and 4000 Hz, and participants' chronological age. This order was selected so as to determine the extent of a potential contribution of hearing acuity after statistically controlling for working memory span, and whether chronological age contributed additional variance after accounting for working memory span and hearing acuity. For each predictor variable for each of the sentence types we show R 2 , which represents the cumulative contribution of each variable along with the previously entered variables, and the change in R 2 , which shows the contribution of each variable at each step. The next column shows the level of significance of each variable and the final column shows the unstandardized regression coefficients (β).

Inspection of **Table 1** shows the prediction for working memory to be born out: Working memory scores accounted for a significant proportion of the variance in comprehension responses across all sentence conditions, albeit at a marginal level for implausible SR sentences.

Although the pretest confirmed that stimuli were audible for all three participant groups, it is likely that this perceptual success came at the cost of greater perceptual effort than for those with poorer hearing acuity. This raises the concern that perceptual success in the face of reduced hearing acuity may draw cognitive resources that might otherwise be available for downstream comprehension operations, with this effect being especially damaging for more challenging sentence conditions (Pichora-Fuller, 2003; Wingfield et al., 2006). Consistent with this


### TABLE 1 | Summary of hierarchical regressions.

fpsyg-07-00789 May 26, 2016 Time: 12:51 # 10

argument, the regression analyses in **Table 1** show hearing acuity to have contributed significantly to comprehension responses for the implausible sentences, where the literal syntax and plausibility were in conflict, but not the plausible sentences in which the two were mutually supportive.

Finally, it can also be seen that when the contributions of working memory span and hearing acuity were taken into account chronological age did not in most cases contribute additional variance to comprehension responses. We do not have an account for the singular exception for sentences in the plausible OR condition.

### GENERAL DISCUSSION

It is reasonable to accept the generality that successful comprehension of spoken (or written) sentences rests on determination of the semantic relationships among the words of a sentence, and that these relationships are carried by the syntactic structure of the utterance (Chomsky, 1965, 1995; Frazier and Fodor, 1978; MacDonald et al., 1994). It is our contention, and that of others (e.g., Ferreira et al., 2001; Sanford and Sturt, 2002; Ferreira, 2003; Ferreira and Patson, 2007; Padó et al., 2009; Gibson et al., 2013), however, that a full syntactic analysis of the utterance is not necessarily obligatory for accurate sentence comprehension.

Evidence for this latter contention can be found in the way individuals will "hear" the missing word "to" in the sentence, "The mother gave the candle the daughter" (Gibson et al., 2013). Such examples reflect the experience-based assumption that many of the utterances we hear will be fragmentary, will have underspecified syntax, or occasional will have some words masked by background noise (cf., Goldman-Eisler, 1968; Elsness, 1984; Thompson and Mulac, 1991; Levy, 2008; Padó et al., 2009; Gibson et al., 2013).

Because we expect that the utterances we hear will have plausible meanings, one can conduct a resource-conserving shallow analysis of the sentence input, sampling some words, inferring others, and guiding our solution to the comprehension task by presumed plausibility. While the occurrence of shallow processing will go unnoticed when it results in correct comprehension, its consequences appear when errors are made. One notable example is the "Moses illusion," in which listeners will often answer "Two" in response to the question, "How many animals of each sort did Moses put on the ark? (Erickson and Matteson, 1981; Van Oostendorp and De Mul, 1990; Van Oostendorp and Kok, 1990).

Error-free performance in the usual case of plausible sentences can obscure the role of plausibility in this success. As we have seen, however, the importance of plausibility can be revealed when the literal syntax of a sentence and its semantic plausibility are placed in conflict. We saw this in Experiment 1, where for both active-declarative and passive sentences fewer responses followed the literal meaning of the sentence when this meaning was implausible. Findings such as these are often interpreted as reflecting an age-related decline in comprehension ability for implausible sentences, with comprehension responses that favor plausibility taken as evidence for such a deficit (Obler et al., 1991; see also Yoon et al., 2015). By contrast, we would see such data, to include our own, as representing not an incorrect response but rather, as evidence of an alternative, and ordinarily adaptive, solution to the comprehension challenge.

Our finding that syntactic complexity had little effect in Experiment 1 is consistent with other studies showing a small if any effect of an active-passive manipulation (cf., Obler et al., 1991; Ferreira, 2003; Gibson et al., 2013). We introduced Experiment 1 to define the lower bounds of a syntactic effect. In Experiment 2, we contrasted SR vs. OR sentences, a contrast that has been reliably shown in numerous studies to yield significant differences in comprehension accuracy, and especially so for older adults (e.g., Just and Carpenter, 1992; Carpenter et al., 1994; Cooke et al., 2002; Wingfield et al., 2003; Wingfield et al., 2006; Peelle et al., 2010). This condition allowed a test of the hypothesis that listeners will more often engage in a resource-conserving shallow processing strategy when detection of thematic roles in an utterance via a full analysis of each word's contribution to the sentence meaning is made more difficult by using an OR sentence structure.

Experiment 2 yielded three key findings. First, listeners' comprehension responses were less likely to correspond to

the literal meanings of the utterances if this process yielded an implausible meaning. Second, the ratio of comprehension responses based on the meaning as determined by the literal syntax relative to responses that opted for a plausible interpretation when the two were in conflict, was larger with the less syntactically demanding SR sentences than the more resource-demanding OR sentences. Finally, this effect was markedly greater for the older adults with a mild-to-moderate hearing loss, all of whom passed an audibility screen for speech presented at the same sound intensity as used in the main experiment. This should not imply, however, that their success did not come at the cost of greater perceptual effort than for the young adults or the good-hearing older adults. When hearing acuity was taken as a continuous variable in the regression analysis we saw that hearing acuity did indeed add to the variance in comprehension responses for the implausible sentences, where syntax and plausibility were in conflict, but not for the plausible sentences where the two were in accord.

Two final caveats should be mentioned. In the first case, in the absence of a real-time measure of processing operations we cannot say whether syntactic parsing, determination of semantic relations within the sentence and testing against real-world plausibility are processed concurrently, or whether one conducts a syntax-first analysis followed by a plausibility check after the initial-phase processing has been completed (for a discussion see Padó et al., 2009).

Our experimental task is intended to represent effects of syntax and plausibility in sentence comprehension (e.g., Ferreira, 2003; Ferreira and Patson, 2007). It should be acknowledged, however, that plausibility could have exerted its effect at the time that the comprehension question probe was delivered. Whichever is the case, however, it is clear from that listening effort consequent to age-related hearing loss leads to greater reliance on plausibility in these data than for either age-matched older adults with good-hearing acuity or, in turn, younger adults with age-normal hearing.

Second, it should be acknowledged that perceptual or cognitive effort in listening tasks are most often assessed, as was the case here, as a performance decline for degraded but audible speech vs. clearer speech (e.g., Rabbitt, 1968, 1991; Surprenant, 1999, 2007; Pichora-Fuller, 2003; Pichora-Fuller and Souza, 2003; McCoy et al., 2005; Wingfield et al., 2006). Attempts to find a measure of processing effort independent of performance on the target task itself have included reduced accuracy on a concurrent non-language secondary task while listening to and recalling

# REFERENCES


clear vs. degraded speech (e.g., Larsby et al., 2005; Sarampalis et al., 2009; Tun et al., 2009; Fraser et al., 2010), an increase in pupil dilation of the eye while listening to degraded speech as an indicator of effortful processing (Zekveld et al., 2011; Kuchinsky et al., 2013) and increased patterns of neural activation revealed in functional neuroimaging (Peelle et al., 2010, 2011). It remains the case, however, that that the cognitive literature has yet to reach a consensus on a formal definition of effort or effortful processing (for a discussion of attempts, see McGarrigle et al., 2014).

# CONCLUSION AND FUTURE DIRECTIONS

It has been argued that a goal of cognitive aging research should be removal of chronological age as an experimental variable (e.g., Kausler, 1994). We attempted to follow this goal in Experiment 2, with regression analyses showing that for the present task once working memory and hearing acuity were taken into account, in all but one sentence condition chronological age did not add additional variance to the nature of the comprehension response. The three factors we considered (working memory capacity, hearing acuity, and age), however, still left considerable variance unaccounted for that might be accounted for by additional variables not tested. One possible candidate may be individual differences in self-efficacy and control beliefs that can affect performance in a number of domains (cf., Lachman and Jelallian, 1984; Hastings and West, 2011; Smith et al., 2011; Agrigoroaei et al., 2013). We suggest this as a fruitful area for future research.

# AUTHOR CONTRIBUTIONS

NA and AW contributed equally to the design, and conduct of the research and in preparation of this manuscript. AGW contributed to the conduct of experiment 1.

# ACKNOWLEDGMENTS

The authors acknowledge support from the National Institute on Aging of the National Institutes of Heath under award numbers R01 AG019714 and R01 AG038490 (AW) and NIA training grant T32 AG00204 (NA). We also gratefully acknowledge support from the W. M. Keck Foundation.

Handbook of Psycholinguistics, ed. M. Gernsbacher (San Diego, CA: Academic Press), 1075–1122.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer CS and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Amichetti, White and Wingfield. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Effects of Hearing Loss and Cognitive Load on Speech Recognition with Competing Talkers

Hartmut Meister <sup>1</sup> \*, Stefan Schreitmüller <sup>1</sup> , Magdalene Ortmann<sup>1</sup> , Sebastian Rählmann<sup>1</sup> and Martin Walger <sup>2</sup>

<sup>1</sup> Jean-Uhrmacher-Institute for Clinical ENT-Research, University of Cologne, Cologne, Germany, <sup>2</sup> Clinic of Otorhinolaryngology, Head and Neck Surgery, University of Cologne, Cologne, Germany

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Teemu Rinne, University of Helsinki, Finland Jed A. Meltzer, University of Toronto, Canada

\*Correspondence:

Hartmut Meister hartmut.meister@uni-koeln.de

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 30 November 2015 Accepted: 16 February 2016 Published: 04 March 2016

### Citation:

Meister H, Schreitmüller S, Ortmann M, Rählmann S and Walger M (2016) Effects of Hearing Loss and Cognitive Load on Speech Recognition with Competing Talkers. Front. Psychol. 7:301. doi: 10.3389/fpsyg.2016.00301 Everyday communication frequently comprises situations with more than one talker speaking at a time. These situations are challenging since they pose high attentional and memory demands placing cognitive load on the listener. Hearing impairment additionally exacerbates communication problems under these circumstances. We examined the effects of hearing loss and attention tasks on speech recognition with competing talkers in older adults with and without hearing impairment. We hypothesized that hearing loss would affect word identification, talker separation and word recall and that the difficulties experienced by the hearing impaired listeners would be especially pronounced in a task with high attentional and memory demands. Two listener groups closely matched for their age and neuropsychological profile but differing in hearing acuity were examined regarding their speech recognition with competing talkers in two different tasks. One task required repeating back words from one target talker (1TT) while ignoring the competing talker whereas the other required repeating back words from both talkers (2TT). The competing talkers differed with respect to their voice characteristics. Moreover, sentences either with low or high context were used in order to consider linguistic properties. Compared to their normal hearing peers, listeners with hearing loss revealed limited speech recognition in both tasks. Their difficulties were especially pronounced in the more demanding 2TT task. In order to shed light on the underlying mechanisms, different error sources, namely having misunderstood, confused, or omitted words were investigated. Misunderstanding and omitting words were more frequently observed in the hearing impaired than in the normal hearing listeners. In line with common speech perception models, it is suggested that these effects are related to impaired object formation and taxed working memory capacity (WMC). In a post-hoc analysis, the listeners were further separated with respect to their WMC. It appeared that higher capacity could be used in the sense of a compensatory mechanism with respect to the adverse effects of hearing loss, especially with low context speech.

Keywords: speech recognition, competing talkers, attention, working memory, age-related hearing loss

# INTRODUCTION

Age-related hearing loss is a common chronic condition in older persons (Zhan et al., 2010; Lin et al., 2011). It causes communication problems, especially in demanding listening situations, such as when speech is masked with noise or when competing talkers are present (Festen and Plomp, 1990; Kiessling et al., 2003; Summers and Molis, 2004). There is a growing body of evidence suggesting that cognitive factors play an important role in these situations (Akeroyd, 2008; Humes, 2013). A number of studies have shown a relationship between speech recognition and working memory capacity (WMC). Working memory refers to short-term maintenance and processing of information supporting ongoing and upcoming actions (e.g., Baddeley, 2010; Eriksson et al., 2015; Mansouri et al., 2015). It is characterized by a limited capacity system typically declining with age (e.g., Nyberg et al., 2012).

Another basic cognitive factor involved in speech understanding is attention (e.g., Bronkhorst, 2015). In a multitalker environment, attention refers to the ability to selectively focus on a target talker while inhibiting competing information, or to divide attention to or switch between different talkers (McDowd, 2007). Though frequently used as autonomous definitions, working memory and attention are substantially intertwined (Barrouillet et al., 2004; Engle and Kane, 2004) and both attributed to the concept of core executive functions (e.g., Diamond, 2013).

Both attention and working memory are reflected in common models of speech understanding in adverse listening situations. The concept of auditory scene analysis (Bregman, 1990) assumes that in a multitalker environment, at first, auditory objects are established (Griffiths and Warren, 2004; Shinn-Cunningham and Best, 2008). After object formation, the auditory objects are grouped into auditory streams. Different acoustic cues, such as the talker's fundamental frequency or other voice characteristics such as formant frequencies, are used for stream build up (Shinn-Cunningham and Best, 2008; Moore and Gockel, 2012). Following this concept, attention can then be selectively directed to the talker of interest while inhibiting irrelevant information, or it can be redirected to another auditory stream.

Sörqvist (2010) and Rönnberg et al. (2013) describe that inhibition of irrelevant information or dividing attention between different sources are associated with individual WMC of the listener. In the framework of their "ease of language understanding" (ELU) model (Rönnberg, 2003; Rönnberg et al., 2008, 2013), they describe different memory domains associated with the processing of speech. Basically, the ELU model postulates that multimodal (i.e., auditory, visual) speech information is bound into a phonological representation in an episodic buffer based on a continous process that feeds forward syllables in rapid succession. Entries of this buffer are matched with corresponding representations in semantic longterm memory (LTM). Under ideal circumstances, this implicit process allows rapid and automatic lexical retrieval. However, if the speech input is altered—for example, due to hearing loss, masking, artifacts of signal processing, etc.—it might not be precise enough to match the representations in semantic LTM. The model then assumes that explicit cognitive processes come into play to compensate for the mismatch: The altered information has to be stored and further processed, engaging short-term and working memory, respectively. This process might include inference-making, semantic integration, switching of attention, storing of information, and inhibiting irrelevant information (Rönnberg et al., 2013). Following the ELU model, WMC is essential for executing these explicit processes in order to overcome the disruption of the automatic implicit process. In conjunction with this, the ELU model also considers lexical context as an important factor aiding speech recognition. The use of context relies on linguistic knowledge and narrows down the set of lexical candidates in the speech stream accordingly supporting explicit cognitive processing (Rönnberg et al., 2013). Linguistic knowledge and the rules for its use are preserved in older age and thus might be used to counteract effects of cognitive decline and hearing impairment associated with aging (e.g., Wingfield et al., 2015).

Against the background of these model considerations, the present study attempted to examine mechanisms in older adults with respect to speech recognition when competing talkers are present. Concretely, we were interested in the effects of hearing impairment and attention tasks differing in cognitive load. Therefore, older persons with typical age-related hearing loss and a matched control group of older persons with clinically normal hearing thresholds were requested to repeat back words either from a single target talker or from two target talkers in a competing talker paradigm. Thus, tasks differed regarding their attentional and memory demands. We further examined the effects of context with these two tasks by presenting concurrent speech streams with lower and higher word predictability. Three different error sources reflecting word object formation, stream segregation and word recall were determined in order to shed light on the question of at which stage of the processing problems occur for the listeners. It was hypothesized that the hearing-impaired individuals exhibit significantly greater speech recognition problems than their normal-hearing peers at all processing stages reflecting in degraded object formation, stream segregation, and word recall. We anticipated that the difficulties of the HI listeners were especially pronounced under higher cognitive load. We further hypothesized that both listener groups make use of context to promote speech recognition with competing talkers.

# METHODS

# Speech Materials

Two commonly used German speech audiometric test materials were administered, namely the Oldenburg sentence test ("OLSA," Wagener et al., 1999) and the Göttingen sentence test ("GOESA," Kollmeier and Wesselkamp, 1997). The OLSA presents low context speech with a fixed five-word syntactic structure (name– verb–numeral–adjective–object, such as "Stefan kauft sieben nasse Schuhe"/"Stefan buys seven wet shoes"). These sentences are syntactically correct but semantically unpredictable. Using the j-factor model (Boothroyd and Nittrouer, 1988), calculating a measure for the predictability of the OLSA corpus yields a value of j = 4.3 (i.e., an average of 4.3 parts of the sentences are statistically independent). The GOESA presents high context speech and includes everyday sentences with three- to sevenword lengths and with a high word predictability of j = 2.5 (Bronkhorst et al., 2002). Only the five-word sentences from GOESA (such as "Adler fliegen tausend Meter hoch"/"eagles fly thousand meters high") were used in order to match the length of the OLSA sentences. The same male speaker produced both GOESA and OLSA materials.

In order to provide distinct acoustic cues for the separation of target and masker, the sentences were modified with respect to the fundamental frequency (F0) and formant frequencies using "praat" (Boersma and Weenink, 2001). F0 of the original utterances was shifted by +80 Hz and formant frequencies were shifted by +16%, thereby yielding the characteristics of a female talker (Darwin et al., 2003). Original and modified sentences thus differed solely in these characteristics, with all other attributes (such as prosody, speaking rate, etc.) being identical. Acoustic modifications yielded naturally sounding stimuli and the participants were not aware of the female voice being an artificial adaptation of the male talker.

Stimuli were generated by superimposing two sentences, one with the male voice and one with the female voice. The corresponding sentences were identical in duration. The level of the sentences was not modified thus yielding a mean targetmasker ratio (TMR) of 0 dB across all sentence pairs. In order to consider not only acoustic characteristics of different talkers, but also linguistic properties, the superimposed sentences were either drawn from the low context speech material of the OLSA stimulus type (denoted as LC/LC, where LC stands for low context) or from both low context (OLSA) and high context sentences of the GOESA (denoted as LC/HC). In the latter case, both sentence sets were used as a target as well as a masker, depending on the given voice characteristics (speaker gender as the target cue, see procedures). The number of OLSA and GOESA targets was balanced. Stimuli were presented at an average level of 70 dB SPL via a free-field loudspeaker placed in front of the participant's head at a distance of 1.2 m in a sound-treated booth.

### Procedures

Procedures based on methods described by Humes et al. (2006) and Meister et al. (2013). Speech recognition was assessed during two different attention tasks. With the "one target talker" task (1TT), the participants were requested to selectively attend to a target talker and to repeat back as many words as possible from the target sentences while ignoring the competing masker sentences. Prior to each stimulus, the target sentence was indicated by requesting the participant to listen to either the female or the male voice. This information was updated from trial to trial with a balanced proportion of male and female targets. With the more demanding "two target talkers" task (2TT) the participants were requested to repeat back as many words as possible from both talkers and to correctly assign them to the male and the female voice. Thus, both tasks differed with regard to their attentional and memory requirements whereas perceptual load was identical due to the use of identical stimuli. With both tasks the listeners were encouraged to guess in case they were uncertain about the words presented. Measurements were performed with three test lists with 14 stimuli per condition, yielding 168 presentations in total (42 stimuli [3 lists with 14 sentence pairs each] × 2 target talker tasks [1TT, 2TT] × 2 stimulus types [LC/LC, LC/HC]). To avoid order effects the order of tasks and stimulus types was randomized and the lists were randomly assigned to the different conditions.

Prior to the measurements, the participants were intensively familiarized with the stimulus materials and the procedures. Stimuli presented during familiarization were discarded for the measurements.

### Participants

Fourteen older adults aged 58–79 years (mean 68.3 years) with good hearing (denoted as "normal hearing" (NH) listeners in the following) and 14 older adults with typical age related hearing loss (denoted as hearing impaired (HI) listeners) aged 60–85 years (mean 69.6 years) participated in the study. Hearing loss was predominantly symmetrical, with between-ear differences typically less than 15 dB HL. None of the listeners was provided with hearing aids. Mean pure-tone thresholds are given in **Table 1**.

Both groups underwent cognitive screening using the DemTect inventory (Kalbe et al., 2004). All participants passed the cognitive screening (score > 12 in the DemTect). In order to match the two groups with regard to their neuropsychological profile a test battery addressing different cognitive domains was administered. These tests tapped into attention and concentration (test d2, Brickenkamp, 1962), attention and task switching (Trailmaking test, Reitan, 1958), reasoning and fluid intelligence (Leistungsprüfsystem LPS-4, Horn, 1983), crystallized intelligence (Mehrfachwortschatztest MWT-B, Lehrl, 2005) as well as WMC (Verbaler Lern- und Merkfähigkeitstest VLMT, Helmstädter et al., 2001). The VLMT was further used for a post-hoc grouping criterion (i.e., median split) of the listeners. With the VLMT lists of 15 words were visually presented and the participants were requested to recall back as many words as possible. This procedure was repeated five

TABLE 1 | Better ear hearing loss (BEHL) of the normal-hearing (NH) and hearing-impaired (HI) listeners for the frequencies 0.125–8 kHz.


Mean and standard deviation are given.

TABLE 2 | VLMT scores for the listener groups and the post-hoc median split of the VLMT.


Mean and standard deviation are given.

times and the mean across the repetitions was calculated as the outcome value. Thus, a value of 10 corresponds to 10/15 words recalled per list in average. The test primarily addresses verbal short-term memory and learning abilities, but also captures the individual WMC of the participant (see Elger et al., 1997; Helmstädter et al., 2001; Van der Elst et al., 2005 for the English version of the VLMT). Group results of the VLMT are shown in **Table 2**. Importantly, there were no significant group differences with all neuropsychological measures assessed, namely the test d2, the Trailmaking test, the LPS-4, the MWT-B and the VLMT (independent samples t-tests, all p > 0.05).

All participants provided their written informed consent prior to the experiments. The study was approved by the ethics committee of the University of Cologne.

### Analyses

Following the methods described by Meister et al. (2013), the participants' responses with the speech recognition tests were audio recorded in order to allow for a detailed analysis of errors. Three types of errors were documented, namely substitutions, confusions and omissions. Substitutions were indicated if a word repeated back did not match the word presented. These words were predominantly lexical neighbors, that is, at least one phonological element of a word was correct but other parts were misunderstood (such as taking "Dosen" for "Rosen"). Confusions were indicated if overlapping words from the target and masker talker were mixed up, that is when a word uttered by the female voice was spuriously assigned to the male voice and vice versa. Due to the regular syntactic structure of all sentences, word positions were not confused. Omissions were indicated if a word presented was not repeated back.

Mixed design ANOVAs for the number of words repeated back and the number of different errors were conducted, with task (1TT, 2TT) and stimulus type (LC/LC, LC/HC) as within-subject variables, and listener group (NH, HI) as the between-subject variable. Moreover, for a post-hoc examination of the influence of WMC on speech recognition, a median split was performed based on the VLMT scores (above median: VLMT↑, below median: VLMT↓), and used as a further between-subject variable. Log-transforms were applied since not all data were normally distributed. All statistical analyses were performed using IBM SPSS Statistics 22.

(NH) and hearing-impaired (HI) listeners for the different tasks and stimulus types. LC, low context; HC, high context; 1 TT, one target talker, 2 TT, two target talkers. Mean across one test list and standard deviation are given.

# RESULTS

**Figure 1** shows the overall number of words repeated back, irrespective of substitution or confusion errors. The outcome is given as the average across the three tests lists (14 sentence pairs each) presented for each condition (i.e., maximum 70 words in the 1TT and 140 words in the 2TT task). In general, the 2TT task obviously yielded more words repeated back than the 1TT task and the NH listeners repeated back more words than the HI listeners. Subjecting the data to a mixed design ANOVA revealed significant main effects of task [F(1, 26) = 27.28, p < 0.001] and group [F(1, 26) = 8.24, p = 0.008]. Moreover, a significant interaction task × group [F(1, 26) = 4.96, p = 0.035] could be observed. This significant interaction was evaluated further. Post-hoc independent samples t-tests revealed that the difference between the 1TT and the 2TT task was significantly greater in the NH listeners than in the HI listeners [t(1, 54) = 2.58, p = 0.012]. No other main effects or interactions were significant.

**Figure 2** shows the number of target words repeated back correctly. In general, the average number of correct target words was higher in the NH compared to the HI listeners and also appeared to be higher for LC/HC condition compared to the LC/LC condition. Subjecting the data to a mixed design ANOVA revealed significant main effects of stimulus type [F(1, 26) = 38.24, p < 0.001] and group [F(1, 26) = 9.15, p = 0.006]. Moreover, a significant interaction stimulus type × task [F(1, 26) = 10.29, p = 0.004] could be observed. Post-hoc independent samples t-tests revealed that the 1TT and the 2TT tasks revealed similar outcome in the LC/LC condition but that in the LC/HC condition more target words were repeated back correctly in the 2TT than in the 1TT task [t(1, 27) = 4.25, p < 0.001]. No other main effects or interactions were significant.

Three different error types, namely substitutions, confusions and omissions were documented. Subjecting the number of errors to a mixed model ANOVA with error type as the withinand group as the between-subjects variable revealed a significant

main effect of error type [F(2, 220) = 202.75, p < 0.001] and a significant interaction error type × group [F(2, 220) = 3.46, p = 0.033]. Post-hoc t-tests showed that omissions occurred more frequently than substitutions and confusions [t(1, 111) = 18.61, p < 0.001, t(1, 111) = 15.7 p < 0.001], and that the listener two groups showed a significant difference for substitutions [t(1, 111) = 3,05, p = 0.03], a strong trend toward significance for omissions [t(1, 111) = 1,86, p = 0.065] but no significant differences in confusions (p < 0.05). No other main effects or interactions were significant.

**Figure 3** shows the number of substitutions (i.e., misunderstood words) for the different conditions and groups. In general, substitutions appeared to be more numerous for the HI than the NH listeners and for the 2TT compared to the 1TT task. Subjecting the data to a mixed design ANOVA revealed significant main effects of task [F(1, 26) = 49.24, p < 0.001], stimulus type [F(1, 26) = 5.42, p = 0.028], and group [F(1, 26) =

6.28, p = 0.02]. Moreover, a significant interaction task × group [F(1, 26) = 9.93, p = 0.004] could be observed. Post-hoc independent samples t-tests revealed that the group-difference in substitutions was only significant in the 1TT condition [t(1, 54) = 5.99, p < 0.001]. No other main effects or interactions were significant.

The number of confusions (i.e., mixing up the male and the female talker) for the different conditions is shown in **Figure 4**. Apparently, confusions were higher for the 2TT task than for the 1TT task and also higher for low context speech compared to high context speech. Subjecting the data to a mixed design ANOVA revealed significant main effects of task [F(1, 26) = 71.49, p < 0.001] and stimulus type [F(1, 26) = 264.64, p < 0.001]. Furthermore, a significant interaction of task × stimulus type [F(1, 26) = 5.07, p = 0.033] was found. Post-hoc paired comparison t-tests confirmed that the difference in confusions between the 1TT and the 2TT task was significantly larger for the LC/LC condition compared to the LC/HC condition [t(1, 27) = 2.15, p = 0.04]. No other main effects or interactions were significant.

**Figure 5** shows the number of omissions (i.e., not repeating words back) in the different conditions. Compared to substitutions and confusions, omissions occurred clearly more frequently. Subjecting the data to a mixed design ANOVA revealed significant main effects of task [F(1, 26) = 871.57, p < 0.001], and group [F(1, 26) = 13.98, p = 0.001]. Furthermore, a significant interaction task × group could be observed [F(1, 26) = 9.91, p = 0.004]. Post-hoc t-tests revealed that the difference between the 1TT and the 2TT task was significantly larger in the HI than the NH listeners [t(1, 54) = 3,7, p = 0.001]. No other main effects or interactions were significant.

Following the significant group difference for omissions and the model assumptions regarding the importance of WMC for speech recognition in adverse conditions, the participants were further characterized with respect to their VLMT scores in a post-hoc analysis. **Figure 6** shows the target word recognition of the NH and HI listeners, each subdivided into groups with

above-median (VLMT↑) and below-median (VLMT↓) scores. Importantly, this characterization yielded a similar hearing loss for the VLMT↑ and the VLMT↓ participants in the HI listeners. Furthermore, below-median performers in the NH and HI group did not show significantly different VLMT scores and the same held for the above-median performers (see **Table 2**). It appeared that the VLMT-split did not largely affect target word recognition in the NH listeners, whereas it had a greater effect on the results of the HI listeners. Subjecting the data to a mixed design ANOVA revealed significant main effects of stimulus type [F(1, 24) = 44.63, p < 0.001], group [F(1, 24) = 10.84, p = 0.003], and VLMT score [F(1, 24) = 4.84, p = 0.038]. The significant main effects of stimulus type and group reflect the outcome already presented in **Figure 2**. Additionally, a significant main effect could be observed for the VLMT-based separation, with those participants with higher scores revealing better target word recognition. Furthermore, as with the data presented in **Figure 2** there was a significant stimulus type × task interaction [F(1, 24) = 9.68, p = 0.05]. Additionally, a significant stimulus type × group × VLMT interaction [F(1, 24) = 4.59, p = 0.042] could be observed, suggesting greater differences in target word recognition between the two VLMT-groups in the HI listeners than in the NH listeners, especially for low context speech. Posthoc t-tests revealed that only the VLMT-group difference in the HI listeners with the LC/LC condition was significant [t(1, 26) = 2,70, p = 0.012]. No other main effects or interactions were significant.

# DISCUSSION

Several studies have addressed speech recognition with competing talkers and mainly focused on differences between younger and elderly listeners. These examinations demonstrated that older listeners performed worse than younger listeners even when group differences in hearing loss were taken into account (Humes et al., 2006) and that difficulties already occur in middleaged persons (Helfer and Freyman, 2014). The present study

examined the effects of hearing impairment and two attention tasks differing in cognitive load on speech recognition with competing talkers. The theoretical framework comprised the mechanisms relevant to auditory scene analysis and the interplay of speech and memory aspects as proposed by the ELU model. Two groups of older adults with and without hearing loss were closely matched with regard to age and their neuropsychological profile. We hypothesized that the hearing impaired participants would experience difficulties on different stages of speech processing, namely object formation, stream segregation, and word recall and that the difficulties were especially pronounced in the task with higher cognitive load.

Analysis of the overall number of words repeated back revealed that both task and group showed a significant effect. The former could simply be explained with the fact that in the 2TT task more words might be repeated back per se than in the 1TT task. However, it was noticeable that the hearing impaired listeners were in average not able to repeat back more than about 70 words per list in the 2TT task which corresponds to the maximum number of words that might be repeated back in the 1TT task. Indeed, there was a significant task × group interaction revealing that the task-difference was significantly smaller in the HI than the NH listeners who showed low performance especially in the more demanding 2TT task. It should be noted that the two groups were closely matched with respect to their neuropsychological profile including a measure of WMC. Thus, the group difference in the overall number of words repeated back cannot simply be attributed to group differences in recall abilities.

Analysis of the number of correctly repeated back target words revealed significant main effects of stimulus type and group and a significant interaction of task × stimulus type. The beneficial effect of context with the LC/HC stimulus type was significant in the more demanding 2TT task. In general, the NH listeners were able to correctly repeat back a higher number of target words than the HI listeners but both groups benefitted from context in a similar manner. We are not aware about studies focusing on competing talkers that address the use of context information in persons with and without hearing loss. However, our finding that the normal hearing and the hearing impaired listeners benefitted similarly from context is in line with findings of Benichov et al. (2012), who assessed final-word recognition in sentences masked with noise (i.e., the so called closure paradigm). Different levels of context information were given with the sentences with higher context facilitating better final-word recognition. Benichov et al. examined three different groups ("good hearing", slight-mild hearing loss with averaged pure-tone hearing loss (PTA) of 16–40 dB HL and moderate hearing loss with 41–60 dB HL). They found that the group with moderate hearing loss benefitted most from context information whereas the listeners with good hearing and slight-mild hearing loss revealed similar benefit. In our HI listeners, PTA ranged from 22 to 49 dB HL thus predominantly representing slight-mild hearing loss.

### Error Types

Different errors limited the ability to repeated back target words correctly. In line with the theoretical framework we specified three different error types, namely substitutions, confusions, and omissions. Analysis revealed a significant difference in the occurrence of these errors and a significant error type × group interaction.

Substitutions were due to misunderstanding words and hearing loss resulted in a significantly larger number of substitutions. This is not surprising since the hearing loss considered here typically causes misperception of high-frequency speech sounds, possibly resulting in misunderstanding words. With regard to the theoretical considerations outlined in the introduction, it might be suggested that substitution errors are associated with failures in word object formation. The process underlying word object formation is argued to be a remapping of the speech signal from one encoding acoustic attributes to one representing its phonemic components (Steinschneider et al., 2014). Hearing loss results in impaired acoustic encoding, especially with respect to temporal fine structure, which is in turn relevant for speech understanding in adverse conditions (Anderson et al., 2013). Consequently, failures in object formation were greater in the HI listeners than in the NH listeners. As already discussed in Meister et al. (2013), there were also significant main effects of stimulus type and task. The latter reflects a higher number of substitutions in the 2TT compared to the 1TT task. This can be interpreted as an effect demonstrating that cognitive load might impair the accuracy of acoustic encoding and thus auditory acuity (Rönnberg et al., 2008). Recently, Mattys and Palmer (2015) have shown that the participants' discrimination of phonemes in a divided attention task (auditory plus visual stimulation) decreased, since they tended to select more similar sounding stimuli under higher cognitive load. This is in line with the present study, with substitutions predominantly stemming from similar sounding, yet different words (such as "Dosen" vs. "Rosen"). This increase in substitutions with higher cognitive load was less pronounced in the HI group (see significant task × group interaction), who revealed more substitutions in general. Presumably, the stronger impact of sensory impairment might have toned down the effect of cognitive load on substitutions, as observed in the NH listeners.

Confusion errors might be associated with failures in stream segregation. They depended on the task and the stimulus type, but not on the study group. More confusions were found with the 2TT than with the 1TT task and with low context speech than with high context speech. Both voice characteristics associated with the different "gender" of the talker as well as linguistic characteristics (LC/LC vs. LC/HC) seemed to be beneficial for auditory stream segregation, and these cues largely remained useful for the participants with age-related hearing loss. There is evidence that sensorineural hearing loss is associated with worsened processing of temporal fine structure cues that might deteriorate F0 discrimination (Moore and Glasberg, 2011) and might thus affect stream segregation and speech recognition with competing talkers of different gender (Lee and Humes, 2012). On the other hand, recent physiological data suggested that models exclusively based on temporal fine structure and/or envelope cues do not fully account for the discrimination thresholds assessed in behavioral tests (Kale et al., 2014). Thus, other factors such as central processing noise might also play an important role. Whatever the exact mechanisms in F0 discrimination are, the differences in acoustic cues between the two voices obviously provided robust talker information. Given the relatively low number of confusions this largely facilitated stream segregation in both NH and HI listeners. In our stimuli, additionally formant frequencies of the utterances differed by about 16%, though this cue might be less effective, as Mackersie et al. (2011) have shown that hearing-impaired subjects are restricted in the use of formant frequency changes. Furthermore, linguistic properties also seemed to provide useful information for stream segregation, especially helpful with the more demanding 2TT task (see significant task × stimulus type interaction). Our sentences revealed regular syntax but differed with respect to semantic properties (i.e., low context vs. high context). An examination of syntactical effects on speech recognition with competing talkers was recently described by Kidd et al. (2014). Similar to our methods they used two competing talkers differing in voice cues (and/or location). These were defined as "lowlevel" cues promoting segregation of speech streams. Sentences had either regular syntax (i.e., name, verb, numeral, adjective, object) or were a random variation of the five-word structure. Syntax was considered to be a "high-level" cue relying on top-down processes and a priory language knowledge. Results obtained in young normal hearing listeners revealed that both, low-level and high-level cues served to select a specific talker and to maintain the focus of attention. Syntax even showed a beneficial effect on target word recall when no low level cues were available. As with our differences in semantic properties of the sentences it was suggested that better predictability due to regular syntax and high context aids performance in competing talker conditions.

The most frequently observed error type was omissions, and there were significant main effects of task and group. Predictably, more omissions occurred with the more demanding 2TT task that required to repeat back words from two target talkers instead of only one target talker. Furthermore, the hearingimpaired participants revealed consistently more omissions than the normal-hearing listeners though the groups were carefully matched with regard to their WMC. It should be noted that this observation also seems not to be due to the slight-moderate hearing loss per se, since speech recognition was near perfect when the sentences were presented at 70 dB SPL without competing masker. Thus, it is unlikely that HL rendered single words completely unintelligible. It might be speculated that the increased amount of omissions in the HI group additionally reflects increased cognitive load due to the sensory impairment and the corresponding mechanisms proposed by the ELU model: The hearing loss of the participants might have disrupted the rapid and automatic matching of the entries in the episodic buffer and representations in LTM. As a consequence of the implicit process disruption, a compensatory mechanism taxing WMC might be invoked—labeled "explicit processing" by Rönnberg (2003) and Rönnberg et al. (2008, 2013). Together with the finding that the representation of words in short-term memory seems to be less stable with hearing impairment than with normal hearing (Pichora-Fuller et al., 1995), this might explain the larger proportion of omissions observed in the HI listeners compared to the NH participants. The significant task × group interaction suggests that the 2TT task was especially detrimental to the HI listeners. Thus, under increased cognitive load there might be extra difficulties for hearing impaired listeners due to the combined effects of hearing loss and the higher attentional and memory requirements.

### Post-hoc Group Splitting

Following the results discussed above a post-hoc group splitting with respect to WMC was performed. Though this additional separation resulted in a relatively low number of observations, a further significant main effect of VLMT score as well as a further significant interaction of VLMT score, listener group, and stimulus type could be shown for target word recognition. The main effect of VLMT score revealed that participants with higher WMC repeated back more target words than those with lower WMC. It could be argued that this simply reflects similarities between the VLMT paradigm and the speech recognition tests as both require to repeat back words. However, the significant interaction of VLMT score, listener group, and stimulus type revealed that the effect of group-split regarding the VLMT score held only for the HI listeners in the LC/LC condition. This suggests that WMC was especially important for the HI listeners

### REFERENCES


when they could not rely on context information. In the LC/LC condition the HI listeners with better WMC approached the results of the NH listeners. This might be interpreted in the sense of a compensatory effect of cognitive function on the detrimental impact of hearing loss. However, due to the small group size in the post-hoc analysis this finding should be treated with caution and requires a more comprehensive examination. Nevertheless, it seems in line with Wingfield and Stine-Morrow (2000) and Wingfield et al. (2015) who showed that contextual cues might dilute some of the effects of limited WMC.

Taken together, the results suggest an interplay of hearing loss and cognitive load regarding speech recognition with competing talkers. We anticipated that the hearing impaired listeners would experience difficulties on different stages of speech processing. This held true for word object formation and word recall, but not for stream segregation. We also hypothesized that difficulties for the HI listeners would be pronounced in the more demanding 2TT compared to the 1TT attention task. The results suggest extra difficulties with higher cognitive load for the HI compared to the NH listeners as they were especially limited in repeating back words in the 2TT task, though both groups showed near perfect speech recognition in quiet and showed similar outcome in their neuropsychological profile. When the groups were further separated with respect to their WMC it appeared that the HI listeners could use working memory in the sense of a compensatory mechanism with regard to the detrimental effects of hearing loss—especially when no supporting context cues were available. However, due to the relatively small number of participants this has to be examined further. Apart from these differences found between the groups the results revealed that context could be used by the normal hearing listeners and the listeners with typical age-related hearing loss in a similar manner.

# AUTHOR CONTRIBUTIONS

HM designed the work, analyzed the data and wrote the manuscript, SS acquired the data and revised the manuscript critically, MO, SR, and MW revised the manuscript critically.

# ACKNOWLEDGMENTS

This research was supported by the by the Marga und Walter Boll Foundation (Reference 210-08-11) and in part by the Koeln Fortune Program (169/2012)/Faculty of Medicine, University of Cologne. The authors declare no conflicts of interest.


assessment. J. Acoust. Soc. Am 102, 2412–2421. doi: 10.1121/1.4 19624


aged 24-81 years and the influence of age, sex, education, and mode of presentation. J. Int. Neuropsychol. Soc. 11, 290–302. doi: 10.1017/s13556177050 50344


impairment in older adults. Am. J. Epidemiol. 171, 260–266. doi: 10.1093/aje/k wp370

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Meister, Schreitmüller, Ortmann, Rählmann and Walger. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Working Memory and Hearing Aid Processing: Literature Findings, Future Directions, and Clinical Applications

### *Pamela Souza1\*, Kathryn Arehart2 and Tobias Neher3*

*<sup>1</sup> Communication Sciences and Disorders, Knowles Hearing Center, Northwestern University, Evanston, IL, USA, <sup>2</sup> Speech, Language and Hearing Sciences, University of Colorado Boulder, Boulder, CO, USA, <sup>3</sup> Medizinische Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky University of Oldenburg, Oldenburg, Germany*

### *Edited by:*

*Jerker Rönnberg, Linköping University, Sweden*

### *Reviewed by:*

*Larry E. Humes, Indiana University, USA Björn Lyxell, Linköping University, Sweden*

*\*Correspondence: Pamela Souza p-souza@northwestern.edu*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 09 October 2015 Accepted: 23 November 2015 Published: 16 December 2015*

### *Citation:*

*Souza P, Arehart K and Neher T (2015) Working Memory and Hearing Aid Processing: Literature Findings, Future Directions, and Clinical Applications. Front. Psychol. 6:1894. doi: 10.3389/fpsyg.2015.01894*

Working memory—the ability to process and store information—has been identified as an important aspect of speech perception in difficult listening environments. Working memory can be envisioned as a limited-capacity system which is engaged when an input signal cannot be readily matched to a stored representation or template. This "mismatch" is expected to occur more frequently when the signal is degraded. Because working memory capacity varies among individuals, those with smaller capacity are expected to demonstrate poorer speech understanding when speech is degraded, such as in background noise. However, it is less clear whether (and how) working memory should influence practical decisions, such as hearing treatment. Here, we consider the relationship between working memory capacity and response to specific hearing aid processing strategies. Three types of signal processing are considered, each of which will alter the acoustic signal: fast-acting wide-dynamic range compression, which smooths the amplitude envelope of the input signal; digital noise reduction, which may inadvertently remove speech signal components as it suppresses noise; and frequency compression, which alters the relationship between spectral peaks. For fast-acting wide-dynamic range compression, a growing body of data suggests that individuals with smaller working memory capacity may be more susceptible to such signal alterations, and may receive greater amplification benefit with "low alteration" processing. While the evidence for a relationship between wide-dynamic range compression and working memory appears robust, the effects of working memory on perceptual response to other forms of hearing aid signal processing are less clear cut. We conclude our review with a discussion of the opportunities (and challenges) in translating information on individual working memory into clinical treatment, including clinically feasible measures of working memory.

Keywords: working memory capacity, reading span, hearing aid, wide-dynamic range compression, digital noise reduction, frequency compression

# THE ROLE OF WORKING MEMORY IN SPEECH PERCEPTION

Working memory—the ability to process and store information (Daneman and Carpenter, 1980; Miyake and Shah, 1999; Baddeley, 2000, 2012)—has been identified as an important aspect of speech perception in difficult listening environments. For instance, working memory is thought to play an active role in the maintenance of task-relevant information. Storage and processing of information are simultaneously carried out during a complex cognitive task. Those processes draw upon a common set of resources which can be allocated according to the various task demands. Because working memory can be envisioned as a limited-capacity system, there will be a trade-off: if more processing is required, less information can be stored, and vice versa. When working memory capacity is reached, both processes will be impaired.

A comprehensive description of the relationship between working memory and speech understanding is contained in the Ease of Language Understanding (ELU) model developed by Rönnberg et al. (2008, 2013). Briefly, the ELU model views language input as containing phonological, syntactic, prosodic, and semantic information. When the language input can be matched unambiguously to a phonological representation stored in long-term memory, lexical retrieval proceeds in an implicit (and relatively effortless) way. However, when the phonological representation is not readily matched to the phonological representation (because the incoming information is degraded in some way), working memory is explicitly deployed to reconcile a match. To reconcile a match, the listener may need to utilize semantic information, make inferences, or inhibit irrelevant information to assign meaning to the input. We can think of working memory being engaged to a greater extent when the speech signal is ambiguous or distorted; and engaged to a lesser extent when the speech signal is audible and undistorted. Following from that model, it seems reasonable to expect stronger associations between working memory capacity and speech recognition when speech is acoustically degraded and weaker associations when speech is audible and clear.

A number of empirical studies have supported this view, showing working memory capacity to be more strongly related to speech in noise than to speech in quiet (see Akeroyd, 2008; Besser et al., 2013 for reviews). This relationship has led to calls for including measures of working memory in diagnostic protocols (Weinstein, 2015), or in treatment planning (Remensnyder, 2012). Individuals who present with a range of communication difficulties will likely benefit from an understanding of the cognitive (and sensory) factors that influence their communication abilities. However, it is less clear how working memory should be applied to practical decisions, including the selection and fitting of hearing aids. The current paper seeks to address this issue.

# MEASURING WORKING MEMORY

Working memory capacity is usually measured with complex span tests which require the participant to manipulate and recall information. For example, the participant may be asked to recall a list of digits or letters in reverse serial order, to solve problems, or to make a judgment about items prompted for recall. Most relevant to the current review are tests of verbal working memory, particularly the reading span test (Daneman and Carpenter, 1980; Baddeley et al., 1985). In a typical reading span paradigm, participants read a set of sentences and make a semantic judgment about each sentence (thereby engaging processing). After a block of sentences, participants are asked to recall as many test items as possible. The participant may be asked to recall the items in the same order as they were presented (serial recall) or allowed to recall the items in any order (free recall). The number of items recalled is used as a metric of working memory capacity. However, when interpreting working memory capacity, we must remember that working memory is, essentially, a composite ability. Reading span tests draw on a number of abilities including reading speed, phonological processing, speed of lexical processing, and executive functioning (Souza and Arehart, 2015; Souza et al., 2015). Those abilities may govern the reading span test's predictive power.

Many studies have documented that working memory capacity varies among individuals (see Akeroyd, 2008 for review). For the majority of studies summarized below, the reading span test was used to measure working memory capacity. Where available, participants' reading span scores are provided (**Tables 1–3**). For the most common administration and scoring methods, reading span scores for older adults (*>*60 years) are typically distributed with a mean of about 35–40% and a standard deviation of about 10%. Scores for younger adults (*<*30 years) are higher, but still show considerably variability among individuals (e.g., Füllgrabe et al., 2015; Souza and Arehart, 2015).


∗*Administered with participant-controlled timing (Conway et al., 2005).*


### TABLE 2 | Summary of studies which related working memory capacity (via reading span) to digital noise reduction.

TABLE 3 | Summary of studies which related working memory capacity (via reading span) to frequency compression.


There are several reasons why signals transduced by hearing aids might interact with working memory. Although they are designed to improve audibility (and therefore speech perception), hearing aids, by their nature, alter the input signal. In contrast to linear hearing aids that merely provided overall gain and frequency shaping and as such minimally altered the input signal, modern digital hearing aids aim to enhance speech, suppress noise, eliminate acoustic feedback, and maintain comfortable loudness. To accomplish these goals, digital filtering and manipulation are applied, which can considerably alter the input signal (Kates, 2010). The following sections consider the acoustic effects of three common hearing aid processing strategies: fast-acting wide-dynamic range compression; digital noise reduction (NR); and frequency compression (FC). Each is related to empirical data and then also considered in the context of the ELU model.

### FAST-ACTING WIDE-DYNAMIC RANGE COMPRESSION

The purpose of wide-dynamic range compression (WDRC) is to improve audibility while maintaining loudness comfort. That goal is achieved by applying gain as a function of intensity, with lower gain applied to higher input levels. At a group level, WDRC has been shown to provide equivalent speech recognition to linear amplification at conversational levels, and improved audibility and loudness comfort for lowor high-intensity speech (Larson et al., 2000; Souza, 2002). To understand the relationship between WDRC and working memory, the next section describes some details of WDRC processing.

# Fast-Acting Wide-Dynamic Range Compression: Processing Principles

In a typical WDRC implementation, excessive input levels are managed by a front-end limiter. Next, the signal is filtered according to the number of compression channels (2–30 channels, depending on the hearing aid). The input intensity within each channel is monitored and gain is adjusted for inputs above a compression threshold (typically 40–50 dB SPL). In clinical fittings, compression ratios1 typically vary between 1:1 and 3:1. To avoid loudness discomfort, signals exceeding a given ("compression limiting") threshold (typically 80–100 dB SPL) are subject to more extreme gain reduction (and higher compression ratios). The compressor gain function is determined by the total input signal, including the target speech and any background noise present in the environment.

Gain is adjusted dynamically as the input level changes. An important characteristic is the speed of the compressor, indicated by the attack and release times2 , which together

<sup>1</sup>The ratio of increase in input level to increase in output level.

<sup>2</sup>The attack time is the time for the compressor to activate and stabilize (i.e., reach maximum compression) as input level increases; the release time is the time for the compressor to deactivate and stabilize (i.e., return to linear gain) as input level decreases.

determine the compression speed. Although compression speed varies along a continuum, compression systems are often classified as fast- or slow-acting, where release times of less than 200 ms indicate a fast-acting WDRC system. In fast-acting WDRC systems, audibility may be high, but the amplitude envelope of the signal may be substantially altered relative to its natural amplitude pattern (Jenstad and Souza, 2005; Stone and Moore, 2007). Slow-acting compression systems adhere more closely to the natural amplitude envelope, but at the expense of improved consonant audibility. Slow-acting compression is sometimes marketed as providing improved sound quality, whereas fast-acting compression is marketed as more dynamic and able to respond more aggressively to changing inputs.

Decades of research have failed to reach consensus as to the "optimal" compression speed for improved speech recognition. Some studies showed better (group) performance with fast-acting WDRC, others with slow-acting WDRC (Souza, 2002). A feasible explanation for the conflicting evidence is that fast WDRC is the best option only for *some* listeners, and slow WDRC is the best option for others. In other words, there may be a trade-off between improved audibility and susceptibility to distortion of the amplitude patterns of the signal.

Recall that the ELU model proposes that working memory will be explicitly engaged (and working memory capacity will play a larger role) in cases where the phonological input cannot be immediately matched to its phonological representation in long-term memory. Because fast WDRC can result in greater alteration of the signal, it has been proposed that some WDRC parameters may increase the chance of match failure between the phonological input and the phonological representation in long-term memory. If that situation occurs, we expect to find a relationship between working memory capacity and understanding of speech amplified by fast-acting WDRC. The next sections review a series of studies that evaluated this relationship.

# Studies of WDRC Speed and Working Memory Capacity: Empirical Findings

In an influential study, Gatehouse et al. (2003, 2006a,b) explored how individual abilities modified the benefits of hearing aid signal processing. The authors were interested in a variety of predictors, including pure-tone thresholds, dynamic range, temporal, and spectral resolution, cognitive abilities, and the variability of sound levels in the listener's daily listening environment. Data were obtained from experienced hearingaid wearers who undertook a double-blind trial of amplification strategies which varied in compression speed. The cognitive tests consisted of letter- and digit-monitoring tests. Although not described as working memory tests, the cognitive tests required both processing and storage. Speech recognition was measured in a closed-set speech test. Cognitive ability was related to both reported and measured intelligibility such that listeners with higher cognitive scores also had higher intelligibility scores, but only for the hearing aid processing conditions which employed fast compression. Data were interpreted to suggest that individuals with greater capacity to store and process information would benefit to a greater extent from fast compression.

Lunner and Sundewall-Thorén (2007) replicated the Gatehouse work in a group of 23 experienced hearing-aid wearers but with an adaptive sentence-in-noise test, where background noises were speech-spectrum noise or two-talker modulated noise (Dreschler et al., 2001). As in the Gatehouse work, fast- and slow-acting WDRC were implemented in wearable hearing aids and participants used the aids for a period of acclimatization prior to testing. Consistent with Gatehouse et al. (2006b) low cognitive scores on a letter-monitoring test were associated with poorer performance with fast-acting WDRC. However, that relationship also depended on the type of noise. For example, for sentences in speech-spectrum noise amplified with slow compression, pure-tone average explained nearly 30% of variance in speech scores, with cognitive ability accounting for only 5%. For sentences in modulated noise amplified with fast compression, pure-tone average explained less than 5% of variance, and cognitive ability accounted for nearly 40%. These data patterns can be interpreted to suggest that as signal complexity increases (either through the presence of noise modulation, or application of fast WDRC), the role of cognition increases and the role of audibility decreases.

Foo et al. (2007) evaluated working memory and the effect of compression speed in experienced hearing-aid wearers. All participants completed the reading span test. Two different sentence recognition tests were completed: the Hagerman sentences (Hagerman and Kinnefors, 1995) in one-talker modulated noise and in unmodulated noise; and the HINT sentences (Nilsson et al., 1994) in two-talker ICRA noise (Dreschler et al., 2001) and in unmodulated noise. The HINT sentences have higher predictability than the Hagerman sentences. Each speech-in-noise test was performed with fast and slow WDRC. For the Hagerman sentences, there was an interaction between compression speed and reading span score, such that listeners with low working memory performed more poorly with fast WDRC. For the HINT sentences, there was no interaction between compression speed and reading span score. There was an interaction between compression speed and a second cognitive test (letter monitoring). Listeners who scored more poorly on the letter monitoring test performed more poorly with slow WDRC. However, the study authors also speculated that because letter monitoring is a serial task, it may capture different (and less relevant) aspects of cognition than the dual store-and-process tasks required in the reading span test and during speech recognition.

Ohlenforst et al. (2015) delved further into this relationship, focusing on the modulation characteristics of the background noise. Working memory capacity was assessed with the reading span test. Older participants were grouped by high or low working memory according to their reading span scores. Speech intelligibility was measured for low-context sentences presented in background noise, where the noise varied in the extent of modulation (1-, 2-, and 6-talker ICRA noise). Fast- or slow-acting WDRC was created in a laboratory simulation. As in Gatehouse et al. (2006b) and in Lunner and Sundewall-Thorén (2007), Ohlenforst et al. (2015) demonstrated a relationship between cognitive ability and compression speed. Listeners with high working memory demonstrated higher speech recognition scores when fast compression was applied than when slow compression was applied. In contrast, the low working memory group performed better with slow compression compared to fast compression. The magnitude of the score difference between compression speeds depended on the number of talkers in the background noise, with the largest differences for the highly modulated noises. However, noise modulation did not interact with working memory.

In a more clinical implementation, Souza and Sirow (2014) measured working memory capacity (via the reading span test) in older adults seen for hearing evaluations in an audiology clinic. Speech recognition was measured for sentences in noise (four-talker babble) using hearing aids with a range of compression speeds. All aids were adjusted to the same prescriptive target with an omnidirectional microphone, but had different numbers of compression channels and digital NR and feedback settings. Encouragingly, the relationships between working memory and compression speed followed those shown in more controlled, laboratory-based studies. The relative influence of working memory, amount of hearing loss, and age to speech recognition depended on the speed of the compression processor. For slow-acting compression, speech recognition was affected by age and amount of hearing loss but not related to working memory capacity. For fast-acting compression, working memory capacity accounted for 30% of the variance in speech understanding.

Although most studies which examined the relationships between speech understanding and working memory did so for speech in noise, working memory capacity may also help listeners with resolving a mismatch for specific phonemes in quiet. Davies-Venn and Souza (2014) processed vowelconsonant-vowel syllables with fast-acting WDRC. A range of compression ratios and release times were used to create stimulus sets with different degrees of acoustic alteration (and, presumably, a greater or lesser chance of a missed lexical match). The participants were adults with hearing losses ranging from mild to severe. Working memory capacity was measured using the reading span test with participantcontrolled timing, which resulted in a similar variance but higher overall scores. The authors also considered signal audibility and spectral resolution, hypothesizing that listeners with poor spectral resolution would be most susceptible to the smoothed amplitude contours from the WDRC processing. Working memory, signal audibility and spectral resolution were all related to the effects of WDRC processing. The predictive value of working memory was strongest for the listeners with more hearing loss. That finding is consistent with the ELU model, as those listeners would be expected to experience the greatest "mismatch" due to their more severe hearing loss.

In summary, there is growing consensus that the response to specific compression parameters may be affected by individual working memory capacity. A number of studies (**Table 1**) have shown that listeners with smaller working memory capacity have more difficulty understanding speech processed by fast WDRC than by slow WDRC. However, that conclusion must be qualified, as it may apply only to populations, materials and hearing aid fittings that have been tested. In the next sections, we consider some variables that may modify the strength of the working memory-compression speed relationship.

### Does Previous Exposure to Fast-Acting WDRC Matter?

The existing data support an association between smaller working memory capacity and poorer response to fast WDRC. In some of those studies, relationships between working memory and WDRC speed were noted as the listener was presented with a previously unfamiliar type of WDRC processing (e.g., Foo et al., 2007; Souza and Sirow, 2014; Ohlenforst et al., 2015). We might reasonably ask: will the relationship persist after "getting used to" new processing? Rudner et al. (2009) have argued that susceptibility to signal alteration (as with listeners with low working memory presented with fast WDRC) would be greatest in cases where the device processing presents a "mismatch". Because slow WDRC more closely preserves natural speech amplitude patterns, we expect fast-acting WDRC to cause the greatest mismatch. After wearing the new processing for a period of time, the listener might "relearn" the new acoustic representations and store those representations in long-term memory, diminishing the mismatch problem and dissolving the working memory-compression speed relationship. Rudner et al. (2009) presented data in support of this idea, in that working memory and sentence-in-noise understanding were more likely to be related when speech was amplified with a compression speed that was unfamiliar to the listener; and less likely to be related when speech was processed with a WDRC speed familiar to the listener. However, a requirement that the processing be unfamiliar to generate a mismatch (and hence a working memory relationship) cannot be universally true. At least, the demonstration of a working memory-compression speed relationship was maintained even after multi-week experience with the specific processing under study (Gatehouse et al., 2003, 2006a,b; Lunner and Sundewall-Thorén, 2007; Rudner et al., 2011). It is possible that longer exposure (months or years) would result in different relationships. Although long-term acclimatization may or may not alter the role of working memory, most authors have assumed that experience counts, and that the most empirically valid conclusions can be drawn after acclimatization to processing.

### Does the Speech Material Matter?

If we consider the strength of the working memory capacityby-compression speed relationship in the context of the ELU model, we expect a stronger relationship when there is a greater chance of match failure. This has already been shown by the fact that working memory capacity tends to predict speech in noise performance when signal components are likely to be masked or ambiguous, but not speech understanding in quiet where signal components are audible and clear. In that theme, it may be of value to consider the relationship between working memory and compression speed relative to the acoustic properties of the speech undergoing compression. Although multichannel compression can also introduce spectral changes, a dominant feature is smoothing of the speech envelope. Presumably, for speech materials where envelope cues are relatively more important, the consequences of compression speed will be larger. Several authors have noted that envelope cues are of relatively greater importance for sentence perception compared to short-duration (syllable or bisyllabic word) perception (e.g., Van Tasell and Trine, 1996; Fogerty and Humes, 2012). Accordingly, the consequences of working memory may be more strongly demonstrated for compressed sentence- or narrative-length speech materials.

That idea could also be carried forward to the linguistic content of the speech. Highly predictable speech could be regarded as an "easy match," with fewer ambiguities and therefore a lesser role of working memory. Less predictable speech might require a heavier processing load, more storage, and more rapid evaluation for meaning. In support of this idea, Cox and Xu (2010) assessed cognitive abilities, speech recognition and user preferences for 24 experienced hearing-aid wearers. Speech recognition tests included high-predictability sentences in four-talker babble, and a closed-set monosyllabic word test in modulated and unmodulated noise (similar to the closed-set test employed by Gatehouse et al. (2006a,b)). As had been the case in previous studies, speech recognition was compared for fast-acting WDRC and slow-acting WDRC. Cox and Xu's (2010) paradigm differed somewhat from previous work, in that they deliberately selected listeners with very low or very high cognitive scores for comparison (*n* = 8 participants per group). Cognitive ability was quantified with three test scores (including the visual letter monitoring test employed by Gatehouse et al. (2006b), but not including a reading span test) collapsed into a composite score. Two interesting findings emerged. First, Cox and Xu's (2010) sentence results did not reproduce the working memory-bycompression speed effect reviewed above. In fact, when only the visual letter monitoring task was used as a predictor (as by Lunner and Sundewall-Thorén, 2007), listeners with low cognitive scores performed worse (not better) with slow WDRC compared to fast WDRC. Cox and Xu (2010) concurred with previous authors that working memory contributed to the effect of WDRC processing. Unlike previous authors, they highlighted a potential effect of the speech materials: that for high-predictability materials, listeners with smaller working memory capacity might require fast WDRC (and its accompanying audibility improvements) for best performance. That suggestion returns to a point raised earlier in this review: that the choice of compression speed may create an audibility-by-distortion balance. In that theme, the net benefit depends on the listener (e.g., severity of hearing loss, susceptibility to signal distortion) and, perhaps, on the environment and/or the speech material. Given the small size of the comparison groups, replication of the Cox and Xu work with a larger sample would be valuable in untangling this issue.

### Summary

Among the studies described above, some general conclusions can be drawn. The effect of compression release time seems to be less important for listeners with larger working memory capacity. Those individuals perform better overall than listeners of similar age and audiometric status but with smaller working memory capacity. They also show minimal effects of varying compression release time. When release time makes a difference to individuals with larger working memory capacity, they perform better with fast compression. Similar to previous authors, we interpret these data to suggest that participants with larger working memory capacity have better abilities to store and process information simultaneously, which allows them to cope with distortion introduced by the fast compressor. One positive effect of fast compression is the potential to amplify brief speech segments in the target speech signal. Individuals with larger working memory capacity seemed to have the ability to better utilize the amplified information and, perhaps, to distinguish between helpful information and phonetic artifacts created by the compressor.

The effect of compression release time seems to be most consequential for listeners with smaller working memory capacity. Across most studies, these individuals show greater benefit from the less-distorting slow compressor. Presumably, when confronted with an acoustically altered signal, those individuals are less able to deploy cognitive resources to achieve a lexical match, preventing them from obtaining full benefit from the greater signal audibility. That pattern may also depend on the speech materials, particularly predictable vs. ambiguous syntax. The acoustic environment may also play an important role. Several studies have shown that the working memory by compressor speed interaction is largest when modulated background noise is present. Finally, it is possible that these effects will be moderated by prior long-term use of fast compression.

# DIGITAL NOISE REDUCTION

Where the input to the hearing aid is a mixed speech and noise signal, digital NR aims to identify and suppress noise components while preserving the speech components. When the background noise is other speech, digital NR is unlikely to result in improved speech perception. However, it may have other benefits, including greater sound comfort (Bentler, 2005; Bentler et al., 2008; McCreery et al., 2012). To understand the relationship between NR and working memory, the next section describes some details of digital NR.

# Digital Noise Reduction: Processing Principles

The main purpose of digital NR is to reduce the adverse effects of background noise on speech. This is achieved by means of an algorithm that estimates the presence (or absence) of speech in a noisy input signal. Once a signal segment has been classified as being noise- or speech-dominated, amplification can be applied to that segment in order to attenuate the noise and/or enhance the speech (e.g., Kates, 2008).

Various approaches have been developed for detecting speech in a noisy input signal. As a consequence, NR systems can differ widely in terms of their design principles and hence their efficacy under different acoustical conditions. In general, NR systems have several common features: they estimate the presence of speech based on one or more signal features; they perform the processing in a number of frequency bands; and they involve a trade-off between the amount of noise suppression achieved and the amount of artifacts introduced concurrently.

In the following, we will briefly describe three types of NR processing that have recently been tested in studies concerned with the influence of working memory capacity on NR outcome: (1) modulation-based NR processing, (2) binary mask-based NR processing, and (3) binaural coherence-based NR processing.

### Modulation-Based Noise Reduction Processing

A characteristic feature of human speech is that – in contrast to many noise signals – it contains strong amplitude modulations, especially in the 3–4 Hz range (e.g., Drullman et al., 1994). Therefore, one approach to the design of a NR system is to use modulation depth as a criterion for the detection of speech. Signal segments containing strong modulations are classified as speech and are preserved, whereas signal segments with little modulation are classified as noise and are attenuated (e.g., Holube et al., 1999). The overall effect of the processing varies with the time scale over which the estimation and attenuation occur, and also with the strength of the attenuation. Because speech and noise signals vary over time, performing the processing on shorter time segments allows the algorithm to better track these variations. In principle, the classifications will reflect the actual short-time properties of the input signal. Nevertheless, misclassifications may also occur, especially for shorter time scales (where the estimates will be based on fewer observations). Increasing the strength of attenuation can lead to better noise suppression for signal segments that are accurately classified as being noise-dominated. For misclassifications, however, this will result in greater attenuation and thus distortion of the wanted signal. Thus, in the parameterization of a NR algorithm a trade-off exists between noise suppression and speech distortion.

### Binary Mask-Based Noise Reduction Processing

An alternative (and more recent) approach to noise suppression is the use of so-called binary masks (e.g., Wang, 2008; Wang et al., 2009). Essentially, a binary mask is a matrix of zeros and ones that index the presence or absence of speech information in a noisy signal mixture as a function of time and frequency. Each zero or one corresponds to a given time-frequency unit. A one denotes a speech-dominated unit and a zero denotes a noise-dominated unit. Whether a given unit is assigned a zero or a one depends on the signal-to-noise ratio (SNR) of that unit (the 'local SNR'). If the SNR exceeds a certain threshold (e.g., 0 dB) the unit is assigned a one; otherwise it is assigned a zero. The resultant pattern of zeros and ones is then used as a time- and frequency-dependent gain function that is applied to the original signal mixture, attenuating the noisy time-frequency units.

A notable problem with the binary mask-based approach is how to estimate the local SNRs accurately. In earlier studies, *ideal* binary masks were used to investigate the perceptual consequences of this type of processing (e.g., Anzalone et al., 2006; Wang et al., 2009). Ideal binary masks have *a priori* knowledge of the local SNRs (i.e., they do not need to estimate them). In a wearable hearing aid with no opportunity for prior knowledge of the signal, the mask must make do with nonideal speech and noise detectors. More recently, some researchers have included a more realistic form of binary mask-based NR processing in their studies (Ng et al., 2013, 2015). With that type of processing, the local SNRs are estimated based on the output signals of two directional microphones, one facing forward in the direction of the target speech (thereby providing a relatively 'clean' speech signal) and the other one facing backward in the direction of the interfering signals (thereby providing a relatively 'clean' noise signal; cf., Boldt et al., 2008).

Binary mask-based NR processing is subject to the constraints concerning time scales and attenuation strengths outlined above. In addition, an SNR threshold for distinguishing between speechand noise-dominated units has to be chosen. Binary mask-based NR processing can therefore also produce distortions that offset the benefit from the noise suppression, especially for realistic binary mask-based applications where speech and noise signals have to be estimated.

### Binaural Coherence-Based Noise Reduction Processing

A third approach to the estimation of useful and detrimental acoustic information relies on the across-ear comparison of noisy input signals. This type of algorithm exploits the interaural similarity or binaural coherence as a decision metric for distinguishing between target signals and interferers (e.g., Grimm et al., 2009). As such, it requires the exchange of information across hearing instruments (e.g., using a wireless link). An implicit assumption made in the design of this algorithm is that incoherent signal components constitute detrimental information for the user (because they typically are due to strong reflections or diffuse background noise) and can be attenuated. First, the binaural coherence of the ear input signals is estimated as a function of time and frequency. The estimates produced in this manner can take on values between 0 and 1. A value of 0 corresponds to fully incoherent (or diffuse) sound, while a value of 1 corresponds to fully coherent (or directional) sound. Because of diffraction effects around the head, the coherence is always high at low frequencies. At frequencies above about 1 kHz, the coherence is low for diffuse and reverberant signal components, but high for the direct sound from nearby directional sources (e.g., talkers). Due to the spectro-temporal fluctuations contained in speech, the ratio between incoherent and coherent signal components may vary across time and frequency. By applying appropriate time- and frequency-dependent gains to the noisy input signals, this ratio can be improved. Once again, greater noise suppression comes at the expense of greater distortion of presumably useful signal components such as speech signals from nearby talkers (cf., Neher, 2014).

# Digital Noise Reduction and Working Memory Capacity: Empirical Findings

Recently, a number of studies have also investigated the relationship between working memory capacity (as indexed by the reading span test) and NR outcome, which are summarized below.

### Modulation-Based Noise Reduction Processing

Desjardins and Doherty (2014) conducted a study to investigate listening effort with a modulation-based NR algorithm implemented in wearable (commercial) behind-the-ear hearing aids. Twelve mostly elderly hearing aid users participated. Amplification was prescribed in accordance with the DSL fitting rule (Scollie et al., 2005). Outcome was assessed using a dualtask paradigm combining speech understanding with a visual tracking task. A correlation analysis was conducted to explore the influence of working memory as well as performance on a measure of "processing speed" on visual tracking performance (i.e., the authors' measure of listening effort). No correlations were observed. However, there was a trend for participants with faster processing speed to perform better on the visual tracking task when NR was engaged.

### Binary Mask-Based Noise Reduction Processing

With respect to binary mask-based NR processing, Ng et al. (2013) conducted a study where they tested both ideal and non-ideal versions of this algorithm. Stimulus presentation was via insert earphones and included proprietary linear amplification. Participants were 26 mostly middle-aged hearing aid users. Outcome was assessed using a paradigm that required participants to identify the final words of a set of sentence-in-noise stimuli and then recall them afterwards. Data analyses revealed a main effect of working memory capacity on recall, with better memory being related to longer working memory capacity. Furthermore, an interaction between working memory capacity and non-ideal NR processing was observed. That is, participants with larger working memory capacity (measured using a reading span test) recalled more words from a speech recognition task than participants with smaller working memory capacity as a result of NR processing.

In a follow-up experiment based on essentially the same setup, Ng et al. (2015) tested the non-ideal algorithm further. A group of mostly older hearing aid users participated. Again, outcome was assessed in terms of sentence-final word identification and recall. Data analyses confirmed the previously observed effect of reading span on recall. Also, a two-way interaction between working memory capacity, NR processing, and serial word position was observed. That is, participants with smaller working memory capacity achieved better memory performance due to NR processing for the final word position only, whereas participants with larger working memory capacity achieved better memory performance irrespective of sentence word position.

Arehart et al. (2015) tested ideal binary mask-based NR processing as well as several non-ideal versions obtained through systematic manipulation of two algorithmic parameters (i.e., error rate and attenuation strength). Participants were mostly elderly hearing-impaired listeners, including 14 hearing aid users. Stimulus presentation was via headphones with linear amplification prescribed according to the NAL-R (Byrne and Dillon, 1986) fitting rule. Both speech understanding and speech quality were assessed. Data analysis revealed that working memory capacity was a significant predictor of overall intelligibility, but did not interact with the level of signal distortion in explaining performance.

### Binaural Coherence-Based Noise Reduction Processing

Concerning binaural coherence-based NR processing, Neher et al. (2014b) carried out a headphone experiment with a hearing aid simulator that, in addition to NR processing, provided NAL-R amplification. Participants were elderly hearing aid users. A dual-task paradigm combining speech understanding with visual response time was used to assess performance. Pairwise preference comparisons were also collected. Regarding speech understanding, data analyses revealed a main effect of working memory capacity, but no interaction with NR processing. Regarding visual response times, no influence of working memory capacity was found. Regarding overall preference, participants with smaller working memory capacity preferred stronger NR processing than participants with larger working memory capacity.

Using a similar setup and almost the same group of participants, Neher et al. (2014a) tested a number of additional binaural coherence-based NR conditions. Outcome measures included the dual-task paradigm used previously as well as subjective ratings of listening effort and overall preference. This time, working memory capacity was unrelated to speech understanding and did not interact with NR processing either.

Again using a similar setup but this time a group of completely different elderly hearing aid users, Neher (2014) assessed speech understanding and also collected pairwise preference comparisons at a number of fixed SNRs. Participants' performance on a visual measure of "executive control" (designed to tap into cognitive functions such as working memory, mental flexibility, and selective attention) was also considered. Regarding speech understanding, larger working memory capacity was once again associated with better performance. Furthermore, working memory capacity interacted with NR processing at 0 (but not −4) dB SNR. That is, while participants with larger working memory capacity showed a (statistically significant) performance decrement of a few percentage points due to (strong but not moderate) NR processing, participants with smaller working memory capacity did not. Regarding overall preference, no effects of working memory capacity were found. However, NR processing interacted with executive control at 0 and 4 (but not −4) dB SNR, i.e., participants with poorer executive control preferred stronger NR than participants with better executive control.

### Summary

Out of the seven studies on DNR and working memory summarized above (**Table 2**), five observed a general influence of working memory capacity on speech-in-noise performance (assessed in terms of speech intelligibility or memory performance), thereby confirming the positive relationship between working memory capacity and basic speech understanding abilities reported previously (e.g., Akeroyd, 2008). In contrast, only three studies observed an interaction between working memory and NR processing, two of which assessed memory performance and the other one speech intelligibility. Furthermore, across these three studies working memory capacity was inconsistently related to NR outcome. That is, while the two studies on memory performance found longer working memory to be associated with (larger) benefit from (binary mask-based) NR processing, the study on speech intelligibility found larger working memory capacity to be associated with *dis*benefit from strong (but not moderate, binaural coherence-based) NR processing. Although a fourth study indicated a relation between smaller working memory capacity and preference for (stronger binaural coherence-based) NR processing, subsequent studies failed to replicate this. However, one study found a corresponding relation between performance on a measure of executive control and preference for (binaural coherence-based) NR processing.

Because of these divergent findings, it is not straightforward to reconcile them with the ELU model. As pointed out above, the ELU model postulates a larger influence of working memory capacity when the phonological input cannot be immediately matched to its phonological representation in longterm memory. If one assumes that stronger NR processing results in greater alteration of the input signal, one would expect a relationship between larger working memory capacity and better understanding (or recall) of noise-reduced speech, but this was generally not the case. A possible reason for this could be that stronger NR processing may be having two concurrent effects: improving audibility of the speech signal (by suppressing more noise) and introducing more distortion than less aggressive processing. Perhaps the net effect of these competing factors contributes to the weak relationship between reading span and NR. It could also be that in some studies the effects of NR processing were kept constant across participants, while in others they were not (e.g., if the effects of NR co-varied with the prescribed amplification, as may be the case in commercial devices).

In summary, although working memory capacity is generally associated with speech perception, it seems to barely interact with NR outcome. In this context, however, it should be noted that the experimental conditions (e.g., the types of algorithm or outcome measures used) differed rather widely across studies (probably much more so than across the studies on WDRC), making a direct comparison difficult.

# FREQUENCY COMPRESSION

The goal of FC is to increase the audibility of higher-frequency phonemes (where a patient typically has significant hearing loss) by restricting them to lower frequency regions (where the patient has better thresholds). The following section describes some details of this processing.

# Frequency Compression: Processing Principles

Several different implementations of FC have been used in simulated and commercial hearing aids (see Alexander, 2013 for a review). In one approach, the input speech signal is represented as a sum of sinewaves with characteristic frequencies, amplitudes and phases. When the speech signal is compressed, the modeled sine waves in the higher-frequency portions of the signal are reproduced at lower frequencies. The shifting of the higher-frequency energy to lower-frequency regions alters the fidelity of the incoming speech signal. Frequency compression may modify the signal envelope by changing the modulation structure within auditory bands, and will also reduce frequency spacing in the regions of compression (McDermott, 2011). FC is characterized by a cutoff frequency (CF) and by a compression ratio (CR), with lower cutoff frequencies and higher compression ratios representing more aggressive processing and greater amounts of signal distortion.

# Frequency Compression and Working Memory Capacity: Empirical Findings

Using a hearing-aid simulation of FC based on sinusoidal modeling, Arehart et al. (2013) considered the relationship between working memory and the combined effects of distortion caused by noise and FC in a group of older listeners with hearing loss. Results showed that age, hearing loss and working memory were all significant factors associated with degraded ability of listeners to process noisy speech processed with FC. Listeners with greater hearing loss, poorer working memory and more advanced age had the lowest intelligibility of frequency-compressed noisy speech. A follow-up study (Souza et al., 2015) found similar effects when FC was combined with wide-dynamic range compression. Similarly, in a neural network model of listener response to FC, Kates et al. (2013) showed that working memory was an important factor in perceptual response to FC for listeners with greater degrees of hearing loss but not for listeners with more mild hearing losses.

Ellis and Munro (2015) studied the relationship between FC and working memory capacity in a small group of older adults with moderate-to-severe high-frequency hearing loss. Because participants were part of a clinical trial with wearable (commercial) hearing aids, they had time to acclimatize to the FC processing. Listeners had customized FC parameters based on their hearing loss. Greater high-frequency hearing loss was positively correlated with FC benefit, but cognitive measures were not.

# Summary

As with NR processing, the relationship between working memory and response to FC processing is mixed (**Table 3**) and may be due to a number of different factors. For example, the specific implementation of FC differed in the studies of Arehart and colleagues compared to work conducted by Ellis and Munro. The experimental approach differed between the two research groups. The Arehart group used a simulated hearing aid such that effects of noise and FC were controlled, such that all listeners with hearing loss got the same amounts of FC processing. This had the advantage of being able to consider the relationships of working memory capacity and hearing loss and the cumulative effects of signal degradation caused by both noise and signal processing but also lacked using wearable hearing aids in clinical fittings. In contrast, Ellis and Munro (2015) used commercially available hearing aids and customized the amount of FC based on the individual listener's audiogram. While having strong clinical face validity, their listeners also received different amounts of actual signal processing. Such differences may have contributed to differences across studies, and may also speak to the importance of individual customization in achieving best outcomes.

# CONCLUSION, FUTURE DIRECTIONS, AND CLINICAL APPLICATIONS

A growing body of work suggests that individuals with smaller working memory capacity may be more susceptible to an altered acoustic signal, such as might be produced by various types of hearing aid processing. The evidence is strongest for fastacting WDRC, where nine studies have shown similar patterns. In each case, listeners with smaller working memory capacity (as measured with a reading span task) performed better with slow-acting than with fast-acting WDRC. One study (Cox and Xu, 2010) showed a relationship between working memory and compression speed, but in the opposite direction. Concerning FC, evidence for a relationship with working memory is mixed. Two studies by the same group using hearing aid simulations showed a relationship; a different study using wearable aids with customized hearing-aid parameters did not. Concerning NR processing, evidence for a relationship with working memory is weakest, with those few studies that observed a relationship producing incongruent outcomes. In this context, it should be noted that the signal processing conditions and outcome measures were rather dissimilar across the studies on NR effects.

To resolve the apparent discrepancy concerning the relationship between working memory and different types of hearing aid processing it would be useful to conduct research to assess the response to a number of hearing aid processing conditions (e.g., WDRC vs. NR) within the same group of individuals using the same outcome measures (e.g., speech understanding or memory performance). In this manner, it would be possible to find out whether the influence of working memory capacity on WDRC outcome generally translates to the domain of NR processing or not. Along those lines, it would also be useful to compare different types of NR processing (e.g., binary mask- vs. binaural coherence-based NR) within the same group of individuals. In this manner, it would be possible to assess the influence of specific NR design choices on the effects of working memory capacity. A more complete understanding of the role of working memory on speech understanding in listeners wearing hearing aids may also require consideration of how the signal alterations caused by a single type of signal processing interact with other concurrently implemented processing algorithms. In addition, in the context of the ELU model it may be important to consider how the cumulative effects of degradation caused by signal processing interact with other forms of signal degradation including the degree of hearing loss and the amount and type of noise in the environment. Irrespective of the actual research question, it would be important to characterize the effects of the signal processing conditions under consideration objectively (e.g., in terms of SNR improvement or amount of speech distortion). In this manner, it would be possible to rule out (or identify) factors (or confounds) that co-vary with working memory.

Given the aforementioned relationship between the strength of association between WDRC and working memory and acclimatization, longitudinal investigations would be beneficial to gain a better understanding of any long-term effects. For instance, it is possible that individuals with smaller working memory capacity who initially are disadvantaged by fast-acting WDRC in the long run would benefit from the greater audibility that it provides relative to slow-acting WDRC. It would also be important to extend this line of research to the domain of FC, and perhaps also to digital NR (although in this case continuous exposure would not be possible).

The role that working memory measurements may play in clinical care is an emerging issue. In contrast to laboratory studies, many of which focused on speech recognition, hearing aid benefit is multidimensional. For example, studies to date have noted a relationship between working memory and objective speech recognition, but also between working memory and the subjective benefits of different processing in the listener's own environment. For the assessment of working memory to be feasible in the clinic, tasks are needed that can be administered within a reasonable amount of time (e.g., 5 min), that are acceptable for both the audiologist and the client, and for which scores can be quickly obtained and easily interpreted. The reading span task that has widely been used in the research studies summarized above is rather strenuous and typically takes around 15 min to complete (with 54 test items). A shorter version has been developed (e.g., Ng et al., 2013), but is not widely used. It may also be useful to consider measures of working memory capacity that bear close resemblance to the problems encountered by typical hearing aid candidates (i.e., that are more life-like); or, alternatively, components of working memory that lend themselves more readily to time-efficient testing in a clinical environment.

Clinical audiologists have shown great interest in the relationships between working memory and hearing aid benefit. Over the past few years, many clinical conferences have featured keynote speakers who work in the areas of cognition, hearing and aging. Clinicians have indicated a willingness to incorporate cognitive measures provided they offer improved hearing aid outcomes and/or better patient care. In addition to the need for time-efficient tests of working memory, the current review has identified several issues needing clarification. Given some of the uncertainties, such as the unclear role of contextual information, more controlled studies are needed to define the boundaries of the working memory-hearing aid effects, so that these relationships can be capitalized on to enhance hearing care.

### REFERENCES


# FUNDING

This research was supported by the National Institutes of Health (R01 DC012289 to authors PS and KA, and R01 DC006014 to author PS); and DFG Cluster of Excellence EXC 1077/1 "Hearing4all" (to author TN).

# ACKNOWLEDGMENT

The authors thank Thomas Lunner, Jing Shen, Tim Schoof, and Stephanie Trippel for helpful conversations regarding the topics in this review.


theoretical, empirical, and clinical advances. *Front. Syst. Neurosci.* 7:31. doi: 10.3389/fnsys.2013.00031


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Souza, Arehart and Neher. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Learning and Memory Processes Following Cochlear Implantation: The Missing Piece of the Puzzle

David B. Pisoni<sup>1</sup> \*, William G. Kronenberger<sup>2</sup> , Suyog H. Chandramouli<sup>1</sup> and Christopher M. Conway<sup>3</sup>

<sup>1</sup> Speech Research Laboratory, Department of Psychological and Brain Sciences, Indiana University, Bloomington, IN, USA, <sup>2</sup> Riley Child and Adolescent Psychiatry Clinic, Department of Psychiatry, Indiana University School of Medicine, Indianapolis, IN, USA, <sup>3</sup> NeuroLearn Lab, Department of Psychology, Georgia State University, Atlanta, GA, USA

At the present time, there is no question that cochlear implants (CIs) work and often work very well in quiet listening conditions for many profoundly deaf children and adults. The speech and language outcomes data published over the last two decades document quite extensively the clinically significant benefits of CIs. Although there now is a large body of evidence supporting the "efficacy" of CIs as a medical intervention for profound hearing loss in both children and adults, there still remain a number of challenging unresolved clinical and theoretical issues that deal with the "effectiveness" of CIs in individual patients that have not yet been successfully resolved. In this paper, we review recent findings on learning and memory, two central topics in the field of cognition that have been seriously neglected in research on CIs. Our research findings on sequence learning, memory and organization processes, and retrieval strategies used in verbal learning and memory of categorized word lists suggests that basic domain-general learning abilities may be the missing piece of the puzzle in terms of understanding the cognitive factors that underlie the enormous individual differences and variability routinely observed in speech and language outcomes following cochlear implantation.

Keywords: learning, memory, repetition, free-recall, semantic clustering

# INTRODUCTION

For a number of years, my colleagues and I have been on a mission to understand and explain the reasons for the enormous individual differences and variability in speech and language outcomes following cochlear implantation in adults and children. In numerous papers, we have argued that the individual differences routinely observed at all implant centers around the world are not mysterious, anomalous or idiopathic in nature but instead reflect differences and natural sources of variability in more basic elementary building blocks of cognition (Pisoni et al., 2008). These cognitive factors include the early registration, sensory encoding, storage, rehearsal, retrieval, and processing of phonological and lexical representations of spoken words in speech perception and spoken language processing tasks. In our search for underlying process-based explanations of individual differences, we have focused our research efforts on issues related to learning and memory, two central topics in cognition that have been neglected in the field of cochlear implantation.

Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Michel Hoen, Cochlear Implant Systems – Oticon Medical, France Björn Lyxell, Linköping University, Sweden

> \*Correspondence: David B. Pisoni pisoni@indiana.edu

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 27 January 2016 Accepted: 21 March 2016 Published: 08 April 2016

### Citation:

Pisoni DB, Kronenberger WG, Chandramouli SH and Conway CM (2016) Learning and Memory Processes Following Cochlear Implantation: The Missing Piece of the Puzzle. Front. Psychol. 7:493. doi: 10.3389/fpsyg.2016.00493

This paper is organized into six main sections. In the first section (The Puzzle about Outcomes following Cochlear Implantation), we introduce and discuss the longstanding problem of variability in speech and language outcomes following cochlear implantation and suggest that learning, memory, and related cognitive processes may represent the missing piece of the puzzle to understanding such variability. In the next two sections ("Explicit Sequence Memory Spans" and "Explicit Sequence Learning Spans"), we present a summary of some of our earlier research findings showing atypical explicit memory and learning of auditory and visual serial patterns in deaf children with cochlear implants (CIs; Pisoni and Cleary, 2004). We next review an area of research investigating the "Hebb repetition effect" that provides additional evidence for understanding deficits in serial memory and learning and language outcomes (The "Hebb Effect" and Sequence Repetition Learning). In the Section "Implicit Learning of Sequential Patterns," we review research demonstrating that deaf children with CIs show atypical implicit learning of sequential patterns, and this disturbance may be part of the reason for the observed language delays (Conway et al., 2011b). Finally, in the Section "Verbal Learning and Memory Processes," we describe some recent findings on verbal learning and memory in prelingually deaf longterm CI users obtained with the California Verbal Learning Test (CVLT-II; Delis et al., 2000), a well-known and widely used neuropsychological assessment instrument that provides information about the control processes and organizational strategies that individuals use in free recall of categorized word lists (Chandramouli et al., Manuscript in preparation). We discuss the theoretical and clinical implications of all of these findings in the "Theoretical and Clinical Implications" Section of this paper. Overall, we suggest that the broad domain of learning and memory may turn out to be a very important aspect of cognition that provides a principled explanation for the enormous individual differences routinely observed in outcomes following implantation.

# THE PUZZLE ABOUT OUTCOMES FOLLOWING COCHLEAR IMPLANTATION

The variability and individual differences observed following cochlear implantation in profoundly deaf adults and children is enormous and represents a significant clinical problem in the field of otology and audiology. At the present time, there is no question that CIs work and often work very well in quiet listening conditions for many profoundly deaf children and adults. The speech and language outcomes data published in the clinical and basic science journals over the last two decades document quite extensively the clinically significant benefits of CIs using numerous behaviorally based outcome measures of speech recognition, speech intelligibility, and language processing in both children and adults who have received these sensory aids as a medical treatment for their profound deafness. Without a CI, a prelingually deaf infant or a young child with a profound bilateral hearing loss would not be able to acquire receptive and expressive spoken language skills and would display significant global developmental and intellectual delays that would remain over his/her entire lifetime.

Recognizing these successes, Wilson et al. (2011, p. 117) stated recently that CIs represent "one of the great success stories of modern medicine" and that "the CI is the most successful neural prosthesis developed to date" and "exceeds by orders of magnitude the number for all other types of neural prostheses." Despite these recent broad sweeping statements about the "efficacy" of CIs as a medical intervention for profound hearing loss in both children and adults, there still remain a number of challenging unresolved clinical and theoretical issues that deal with the "effectiveness" of CIs in individual patients that have not been successfully resolved yet despite many years of basic and clinical research (Pisoni et al., 2008).

After receiving a CI, all patients require a relatively long period of sensory acclimatization and perceptual adaptation to learn how to process underspecified acoustic-phonetic information encoded in the degraded signal. This sensory and perceptual adaptation must occur before these patients are able to derive any functional benefits from their implant and display solid evidence of perceiving speech, understanding spoken language, and reliably recognizing natural environmental sounds. Despite the importance of learning and adaptation following implantation, this particular domain of cognition has received little attention compared to the voluminous literature on conventional speech perception and language outcomes.

The revolution in the field of experimental psychology in the 1960s that gave birth to the new field of cognitive psychology has had a profound and long-lasting influence on our thinking about how humans perceive, encode, store, and process information (Neisser, 1967; Haber, 1969). Armed with new experimental methods and a richer and more powerful theoretical conceptualization of how these complex cognitive processes might be carried out by humans, the field of cognition has flourished over the last 50 years and has made important contributions to many related fields including neuroscience, developmental and clinical science, and social psychology. Until recently, the cognitive approach has been very slow to have a significant impact in the field of clinical audiology and, in particular, research on hearing loss and CIs which has been heavily dominated by the medical community of otologists, audiologists, and speech-language pathologists (Arlinger et al., 2009; Jerger, J. cited in Fabry, 2011, p. 20).

While there have been several small steps made applying some of the methods and theory of information processing psychology to problems in CIs (Pisoni, 2000), one of the core foundational areas of cognition that has been neglected in almost all of the clinical research on CIs in both adults and children is learning and memory processes and the organizational strategies that CI users employ in conventional laboratorybased episodic memory and learning tasks such as free-recall, recognition, and repetition-based learning. The precise reasons for the lack of research on these central topics are unclear at this time but the current situation may simply reflect the long-standing historical biases toward the study of peripheral

sensory processes in the field of hearing research and clinical audiology and the strong reluctance over the years to fully acknowledge that hearing loss in the pediatric population is primarily a "brain" issue and not an "ear" issue (Luria, 1973; Flexer, 2011). The absence of a large body of basic research on learning and memory processes in deaf children and adults with CIs represents a very significant gap in our basic knowledge and understanding of cognition, neural plasticity, and experienceand activity-dependent learning, core foundational processes which underlie all adaptive behaviors in both humans and animals.

Although our focus in this paper is on the role of cognition, specifically, learning and memory processes, we do not want to minimize in any way the important contributions of demographics and other contributing factors to speech and language outcomes following implantation. The evidence collected over the years linking outcomes to variables like age of implantation, communication mode, family and device factors as well as a host of audiological and hearing-related variables is very strong and reliable. However, these factors taken in isolation fail to account for a significant part of the variance observed in the conventional speech and language outcome measures routinely obtained from CI users at centers around the world. Additional sources of variance, we argue, come from cognitive factors such as learning, memory, attention, inhibitory control, working memory, executive function, and cognitive control processes. Our point in this paper is that demographics do not fully account for the whole story and in many cases may obscure more basic underlying elementary processes. Moreover, the focus on demographics and conventional endpoint measures of outcome and benefit often prevents researchers from moving beyond descriptive accounts to explanatory causal explanations framed within a broader theoretical context that emphasizes the role of basic elementary information processing operations, such as learning and memory processes.

In the remaining sections of this paper, we summarize some of the major findings from our ongoing program of research on learning and memory processes following cochlear implantation and discuss the broader clinical and theoretical implications of these findings for understanding the factors underlying individual differences and variability in speech and language outcomes.

# EXPLICIT SEQUENCE MEMORY SPANS

The traditional methods for measuring immediate memory capacity using digit spans require a subject to encode both item and order information and then verbally repeat back and reproduce the sequence of test items using an overt articulatory motor response (Dempster, 1981). Because most deaf children with CIs also have other comorbid delays in speech development and often display "atypical" articulation and speech motor control because of their early hearing loss, it is possible that any differences observed in auditory short-term memory or working memory tasks using conventional digit span tests could be due to the nature of the response organization requirements during retrieval and response output processes in addition to any possible differences in early sensory registration, encoding, storage, or retrieval processes (AuBuchon et al., 2015). To eliminate the use of an overt articulatory-verbal motor response, we developed a new experimental methodology to measure sequence memory based on Milton–Bradley's Simon ©, a well-known memory game that uses a simple reproduction task. **Figure 1** shows a display of the apparatus we used in our early studies (Cleary et al., 2001; Pisoni and Cleary, 2004). We took an off-the-shelf Simon memory game box and modified it in our shop by building a custom interface to a PC so we could directly control the stimulus presentation, record the subject's responses, and provide feedback when needed.

In our version of the Simon sequence memory task, a child hears or sees a sequence of color names or color lights presented by the computer and then simply "reproduces" the stimulus pattern by depressing a sequence of colored response panels on the four-alternative Simon response box using a manual response. Because the Simon memory game was controlled by a computer, we were able to manipulate the stimulus presentation conditions in several different ways while also holding the response format constant. In addition to measuring sequence memory spans, the Simon memory game apparatus and methodology also provided us with an opportunity to study basic learning processes, specifically, serial learning and the relations between sequence memory and serial learning using the same experimental procedures and same response demands (Cleary et al., 2001; Karpicke and Pisoni, 2004; Pisoni and Cleary, 2004).

The lights on the Simon apparatus were illuminated in temporal patterns from a vocabulary ensemble of four colors. Before the memory game began, we asked each child to identify recorded audio tokens of the four color-names by pointing to one of the four large colored buttons on the response box to make sure they could hear and recognize the four color names without any errors. Three types of sequential patterns were presented in separate blocks: auditory-only (A-only), lights-only

(L-only), and auditory+lights (A+L). All of the sequences used for the memory game task were generated pseudo-randomly by a computer program from the four alternative colors, with the stipulation that no color name or color light would be repeated consecutively in any given list. Each subject started with a list length of one item. If two sequences in a row at a given list length were correctly reproduced, the next sequence that was presented was increased in length by one item that was chosen randomly from the four colors. If the list was incorrectly reproduced on any trial, the next trial presented a new list that was one item shorter in length. This up-down adaptive tracking procedure is similar to methods typically used in psychophysical testing (Levitt, 1970). Importantly, novel sequences were generated randomly on each trial in order to prevent any learning from occurring other than routine practice effects that would typically be observed in learning how to carry out a new task in an unfamiliar laboratory setting.

We computed a "weighted" sequence span score for each child which was calculated by finding the proportion of lists correctly reproduced at each list length and summing these proportions across all list lengths (Pisoni and Cleary, 2004). A summary of the results from the Simon sequence reproduction memory span task for two groups of 8–9 year-old-children is shown in **Figure 2**. The weighted-span scores for the normal-hearing aged-matched children (N = 31) are shown in the left panel while the scores for the deaf children with CIs (N = 31) are shown in the right panel. Within each panel, the scores for A-only presentation condition are shown on the left, scores for L-only presentation are shown in the middle and scores for the combined A+L condition are shown on the right.

Examination of the sequence memory span scores revealed several important differences between the two groups. Not surprisingly, the sequence memory spans for the A-only and A+L presentation conditions were consistently lower overall for the children with CIs than the normal-hearing children. However, the deaf children with CIs displayed shorter sequence memory spans in the L-only condition than the normal-hearing children. This was an unexpected and surprising finding that provides additional converging support for the hypothesis that rapid phonological recoding and efficient verbal rehearsal processes in short-term working memory play an important inseparable role in perception, learning, and memory in these children (Pisoni and Geers, 2000; Pisoni and Cleary, 2004). Capacity limitations of verbal short-term memory are closely tied to speed of processing information even for visual sequential patterns that can be rapidly recoded and rehearsed in verbal short-term memory using a phonological or articulatory code in sequential processing tasks (Conrad, 1960). Verbal coding strategies may be mandatory or at least commonly used by humans who are engaged in memory tasks that require immediate serial recall (ISR) of patterns that preserve item and order information (Gupta and MacWhinney, 1997). Although the visual patterns were presented using only sequences of colored lights, many of the participants, particularly the normal-hearing children, likely recoded the serial patterns using well-learned automatized verbal labels and coding strategies in order to create stable representations of the stimulus patterns in working memory for maintenance and rehearsal prior to response organization and motor output.

When compared to the group of normal-hearing controls, the deaf children with CIs may have used a different encoding strategy and less efficient verbal rehearsal processes for maintaining temporal sequences of the color name codes in working memory. Early auditory deprivation and the absence of sound stimulation following a period of prelingual profound hearing loss during the initial stages of language development may not only affect early sensory processing and perception but may also influence subsequent encoding and rehearsal processes in verbal working memory (Conrad, 1979). The deaf children with CIs in this study showed a reduced capacity to maintain serial order information in short-term memory even when the information was presented through the visual sensory modality (see Myklebust and Bruton, 1953). These findings on immediate sequence memory spans for auditory and visual patterns obtained with the Simon memory game which did not require any overt verbal articulatory-motor responses replicate the earlier memory span results we obtained using the WISC digit spans which showed large and consistent differences in memory span between deaf children with CIs and age-matched normal-hearing children (Pisoni and Geers, 2000; Pisoni and Cleary, 2004). To our knowledge, these were the first memory span data collected from prelingually deaf children with CIs demonstrating differences in immediate memory capacity and rehearsal processes without relying on any articulatory-based verbal response for output.

# EXPLICIT SEQUENCE LEARNING SPANS

The initial version of our Simon memory game used novel sequences of color names or colored lights on each trial to measure immediate memory spans. As previously mentioned, all of the test sequences were generated randomly in order to prevent any learning from occurring other than routine practice effects. The primary goal of the first phase of this project was to obtain estimates of immediate memory capacity for serial patterns that were not influenced by repetition effects or idiosyncratic verbal coding strategies that might increase memory capacity from trial to trial (Cleary et al., 2001, 2002). There was no basis for any new learning to take place and the measures of Simon sequence memory span could be used to estimate the capacity of immediate memory for serial patterns of familiar color names or color lights.

In the second phase of this study, we used the same basic Simon memory game apparatus and procedure to study learning and to investigate the effects of sequence repetition on coding and rehearsal strategies in immediate memory. To accomplish this goal and to directly compare the gains in repetition learning and the increases in working memory capacity to our earlier sequence memory span measures, we examined the effects of repetition on immediate memory span. To measure learning, during a block of trials, the same visual or auditory pattern was repeated over again after each correct trial. Each new test sequence then increased in length by one item until the child was no longer able to reproduce the repeated pattern correctly anymore. This small

procedural change in generating the test sequences provided an opportunity to measure sequence learning following repetition and to explore how sequence repetition affects the capacity of immediate memory (see Hebb, 1961). Everything else remained exactly the same as in the original sequence memory conditions except that the same serial pattern was repeated after each correct reproduction.

The right-hand bars in **Figure 2** display a summary of the results obtained in the Simon learning conditions that investigated the effects of stimulus repetition on sequence learning spans. The same three presentation formats used in the earlier sequence memory conditions were used again, that is, A-only, L-only, and A+L. The weighted memory span scores for the sequence memory conditions obtained earlier under random presentation in the first phase are shown by the left-hand bars of each panel in **Figure 2**; the corresponding set of sequence learning span scores obtained following repetition for the same three presentation conditions are reproduced on the right-hand side of each panel. The data for the 8- and 9-year-old normalhearing children are shown in the left panel; the data for the 8- and 9-year-old deaf children with CIs are shown in the right panel.

Examination of the two sets of sequence span scores shown within each panel reveals several consistent findings. First, just repeating the same stimulus sequence again after a correct reproduction produced robust repetition learning effects for both groups of children. This sequence repetition effect can be seen clearly by comparing the three scores on the left-hand side of each panel to the three scores on the right-hand side. In every case, the sequence learning span scores on the right are higher than the sequence memory span scores on the left. Repetition of a serial pattern increased immediate memory span capacity although the magnitude of the improvement differed across the two groups of subjects. Although a sequence repetition effect was also obtained with the deaf children who use CIs, the size of their improvement was about half the size of the repetition effect found for the normal-hearing children shown in the left-hand panel. Second, the rank ordering of the three presentation formats in the sequence learning conditions was similar to the rank ordering observed in the sequence memory span conditions for both groups of children. The repetition effect was always largest for the A+L conditions for both groups due to redundancy gains when both modalities are combined together.

To assess the magnitude of the sequence repetition learning effects for the individual children in both groups, we computed difference scores between the learning and memory conditions by subtracting the memory span scores from the learning span scores for each subject. The difference scores for all of the individual subjects in both groups for the three presentation formats are displayed in ascending order in **Figure 3**. Scores above zero in the **Figure 3** indicate the presence of a repetition benefit; scores below zero indicate no repetition learning. Inspection of these distributions reveals a wide range of individual performance for both groups of subjects. Some subjects showed relatively large gains in learning while others showed only very small gains. Although most of the subjects

in both groups displayed some evidence of repetition-based sequence learning in terms of showing a positive repetition effect, there were a few subjects at the end of the distribution who either failed to show any sequence repetition learning effect at all or showed a small reversal of the predicted repetition effect.

While the number of these subjects was quite small in the group of normal-hearing control children, we found that about one-third of the deaf children (N = 11) showed no evidence of a sequence repetition learning effect at all and obtained no benefit from having the same stimulus sequence repeated over again on each trial. The failure of a large subset of deaf children with CIs to display any evidence of simple repetitionbased sequence learning following presentation of a visual pattern in the reproduction memory task suggests the presence of a significant impairment in serial learning for both auditory and visual patterns.

As explained previously, given the nature of the stimuli used in the Simon task, it is likely that the normal hearing children were using a verbal rehearsal strategy to label and help remember each color sequence as it occurred (e.g., "RED–BLUE–GREEN– BLUE" etc.). It is possible that the reduced sequence memory and learning spans in the deaf children with CIs is due to atypical verbal rehearsal or even a non-verbal coding strategy. In order to tease apart the extent that the sequence memory and learning impairments were due to atypical verbal rehearsal strategies, we recently designed a new version of the Simon sequence learning and memory task using a touch-screen monitor that incorporated

four different conditions to assess both verbal and non-verbal visual sequencing (Gremp et al., Manuscript in preparation). Two of the conditions used black and white visual stimuli instead of colored squares in order to make verbal rehearsal less likely. In addition, half of the tasks used sequences that were randomly generated on each trial, as was the case in the first set of studies described above to assess sequence memory, while the other half used repeating sequences on each trial to measure sequence repetition learning effects. Thus, this design provided a direct way to assess the effect of verbal coding on sequence memory and sequence learning. A group of deaf children with CIs and an age-matched group of normal-hearing children participated in the study. The findings revealed that while the deaf children with CIs showed lower performance for the verbal versions of sequence memory and sequence learning, their performance was lower overall on all versions of the task, regardless of whether verbal rehearsal was likely to have occurred. These recent findings suggest that the impairment on visual sequence memory and learning is not solely due to difficulties with verbal coding and verbal rehearsal but may reflect a more global domain-general disturbance in the learning and memory of sequential patterns.

Repetition-based learning of serial patterns like the learning and memory of highly familiar color sequences and visual-spatial patterns observed in this study is one of the earliest and most primitive forms of learning and adaptive functioning that the brain and nervous system carry out in acquiring knowledge and recording experiences about regularities in the surrounding environment. These are theoretically important findings in this clinical population because they link the present set of serial learning results to an extensive and rapidly growing literature on ISR, the Hebb repetition learning (HRL) effect and the learning of phonological word-forms and lexical development discussed in the next section.

# THE "HEBB EFFECT" AND SEQUENCE REPETITION LEARNING

The findings reported in the previous section on repetition effects in sequence memory and learning of serial patterns in deaf children with CIs using the Simon memory game methodology are closely related to a large body of research carried out on processing of serial order information and sequence learning and memory using an experimental methodology first developed by Donald Hebb more than 50 years ago (Hebb, 1961). In his seminal study on the effects of sequence repetition on serial learning and memory, Hebb gave 40 college subjects 24 randomized lists each containing nine randomized digits for ISR. An example of one of the lists he used is 591437826. The experimenter read the sequence of nine digits aloud to each subject at a rate of about one digit per second. Interspersed within these random lists of digits was one list of digits that was repeated over again after every third trial. Subjects were told to listen carefully to each list and simply repeat back the list of digits after presentation in the same order they were presented. The subjects were told that the purpose of the study was to see if memory span for sequences of random digits would improve with practice.

Hebb found that although his subjects showed no evidence of any improvement in reproduction of the randomized digit lists over the course of the experiment, they did show a very consistent pattern of learning and improvement for the repeated digit lists. After the experiment was over, Hebb asked his subjects if any of them noticed the repeating pattern. About half of the subjects stated that they were aware that some of the patterns repeated. Fifteen of the subjects stated that they had not observed any repetitions at all.

The improvement in performance following presentation of a repeating sequence of stimuli is known as the "Hebb Effect" in the human learning and memory literature, and this form of repetition-based sequence learning has recently taken on a special status in a series of novel studies on serial learning and phonological word-form learning (Page and Norris, 2009a). The HRL effect is a very robust finding that has been replicated and extended by Melton (1963), McKelvie (1987), and Stadler (1993) among others over the years. Hebb originally suggested that this "rather simple-minded experiment" provided strong evidence for the conclusion that a single repetition of a sequence of random digits could produce a permanent structural change in long-term memory without feedback and that this structural memory trace had fundamentally different properties from the sensory-based activity-traces that support short-term memory span (Hebb, 1961).

Although the HRL effect has received some modest amount of attention over the years since it was first reported, current interest in the HRL effect has accelerated quite rapidly over the last 10 years as several researchers lead primarily by Page and Norris (2009a,b) have recognized the potential usefulness of the experimental methodology and findings in the study of serial learning and sequence memory effects. In a series of recent papers, Page and Norris (2009a,b) have extended and elaborated Hebb's original findings on repetition-based serial learning and proposed the working hypothesis that the processing mechanisms used for encoding serial order information that underlies the Hebb Effect are closely linked and associated with the same coding processes involved in ISR tasks such as digit span, non-word repetition, non-word paired associate learning as well as the phonological wordform learning component underlying lexical acquisition. All these processing tasks require the encoding and processing of item and serial order information and all of them rely on establishing links between the contents of short-term memory and representations of serial order information in long-term memory.

The recent literature on the Hebb Effect and its relationship to novel word form learning is extensive and growing rapidly and will not be reviewed here because of space limitations (see Mosse and Jarrold, 2008; Szmalec et al., 2009, 2012). However, there are a number of reasons why the HRL effect has become the focus of new research efforts on serial learning and memory, and it is worth mentioning them here. First, the topic of processing and encoding serial order information continues to be an important and central issue of research and theory in the field of cognition since the earliest days of experimental psychology and memory science going back to Ebbinghaus (1964) and the seminal

observations of Lashley (1951) on the role of serial order coding in complex behavior. In addition, it has been widely assumed that the encoding and processing of serial order information is often a core foundational component of other more complex cognitive processes in other domains, especially language and motor behaviors. Second, the HRL effect is very robust and is relatively easy to obtain in different populations. Third, the effect represents the intersection of both short-term and longterm memory processes involving the transfer of information from short-term "activity-traces" in immediate memory to more permanent and stable "structural-traces" of item and order information in long-term memory which are thought to involve fundamentally different underlying neural systems. Fourth, the observed repetition effects and experimental methodology used in studying HRL encompasses both implicit and explicit memory and learning processes, two areas of memory research that are typically treated as separate processing domains. Finally, and perhaps most importantly, several researchers have argued that the HRL effect can serve as a laboratory-based analog for the real-world everyday language learning activities that are involved in phonological word-form learning and lexical acquisition of novel words by children learning language (Page and Norris, 2009a).

The sequence learning and serial order encoding results observed in the HRL paradigm have broader theoretical relevance and implications for language learning and development and individual differences in several clinical populations who may have delays and/or deficits in receptive and expressive language learning. For instance, several recent studies have reported long-term serial-order learning problems in children and adults with dyslexia suggesting that they may have a fundamental impairment in the encoding and processing of serial order information that results in weaker and more fragile lexical representations of words in long-term memory (see Szmalec et al., 2011; Bogaerts et al., 2015a,b). Other recent studies on adults with dyslexia have reported increases in proactive interference in an n-back task (Bogaerts et al., 2015b), suggesting that the deficit in serial order memory may also affect automatic inhibitory control processes used in verbal working memory tasks which commonly require the encoding and representation of both item and order information and the control of active attention.

The close parallels between these diverse sets of results obtained with several different populations and somewhat different experimental methodologies are probably not just a random coincidence. Instead, they very likely reflect a common domain-general core disturbance and/or impairment in the same basic underlying serial order coding processes that are involved in encoding, storing, and retrieving item and order information in verbal sequence memory and learning tasks. Our findings on Simon sequence memory and learning with deaf children who use CIs, along with a body of other results, reflect a deficit or disturbance/delay in the operation of a common serial order cognitive mechanism that is intimately involved in binding, chunking, and recoding repeated serial patterns reflecting the same processing operations used in sequence memory and novel word-form learning. We will return to these core issues again in the "Theoretical and Clinical Implications" Section of the paper.

# IMPLICIT LEARNING OF SEQUENTIAL PATTERNS

Similar to the idea of the Hebb repetition effect, which demonstrates the learning of repeating patterns, is the notion of "statistical learning," which reflects the acquisition of statisticalbased regularities such as co-occurrence statistics or the probability of two stimuli occurring together in time or space. This type of statistical learning, a form of implicit learning, is currently thought to be one of the basic elementary learning mechanisms that is used in language acquisition (Cleeremans et al., 1998; Saffran et al., 2001; Altmann, 2002; Ullman, 2004). There are many studies on infants (Saffran et al., 1996), children (Meulemans et al., 1998), adults (Conway and Christiansen, 2005), and even non-humans (Petkov and Wilson, 2012) that have reported findings on implicit statistical pattern learning.

Several recent studies from our research group have explored the relations between individual differences in implicit statistical learning and spoken language processing abilities (Conway et al., 2007, 2010, 2011b; Conway and Pisoni, 2008; Shafto et al., 2012). In one of our initial studies, young NH adults carried out an implicit statistical learning task involving visual sequences and a sentence perception task that required listeners to recognize words in noise. The test sentences were taken from the Speech in Noise Test (SPIN) and varied on the predictability of the final word (Kalikow et al., 1977). We found that performance on the implicit learning task was correlated with performance on the speech perception task – specifically, for the high predictability SPIN sentences that had a highly predictable final word. This result was observed even after controlling for the variance associated with non-verbal intelligence, short-term memory, working memory, and attention and inhibition (see Conway et al., 2007, 2010).

The findings obtained with NH adults suggested that domaingeneral abilities related to implicit learning of sequential patterns are closely coupled with the ability to acquire and use information about the predictability of words occurring in degraded spoken sentences, knowledge that is critical for the successful acquisition of linguistic competence. The more experience that an individual has with the underlying sequential patterns of spoken language, the better one is able to use one's long-term knowledge of those patterns to perceive and understand novel spoken utterances, and to reliably predict upcoming words in sentences, especially under degraded or challenging listening conditions. While our initial studies provided some preliminary evidence for an important empirical link between implicit learning and language processing in NH adults, in order to better understand the development of implicit learning, it is necessary to investigate implicit statistical learning processes in both typically developing and atypical clinical populations such as profoundly deaf children who have been deprived of sound and the typical environmental conditions of development that are appropriate for robust language learning.

Toward this end, we investigated implicit learning in a group of deaf children with CIs and a chronologically age-matched control group of NH typically developing children to assess the effects that a period of auditory deprivation and delay in language may have on learning of complex visual sequential patterns (Conway et al., 2011a). Some evidence already suggested that a period of auditory deprivation occurring early in development may have secondary cognitive and neural sequelae in addition to the obvious first-order hearing-related sensory effects (see Myklebust and Bruton, 1953; Luria, 1973; Conrad, 1979). Specifically, because sound is a physical signal distributed in time, lack of experience with sound patterns may affect how well a child is able to encode, process, and learn sequential patterns and encode and store temporal information in memory (Rileigh and Odom, 1972; Todman and Seedhouse, 1994; Fuster, 1995, 1997, 2001; Marschark, 2006). We have suggested that exposure to sound may also serve as a kind of "auditory scaffolding" in which a child gains specific experiences and practice with learning and manipulating sequential patterns in the environment (Conway et al., 2009, 2011b). Based on our earlier implicit visual sequence learning research with NH adults, we predicted that deaf children with CIs would show disturbances in visual implicit learning of sequential patterns because of their lack of experience with auditory temporal patterns early on in development. We also predicted that their implicit learning abilities would be associated with several measures of language development.

Two groups of 5–10 year-old-children participated in this study. One group consisted of 25 deaf children with CIs; the second group consisted of 27 age-matched typically developing, NH children. All children carried out an implicit visual sequence learning task. Several clinical measures of language outcome were also available for the CI children from our larger longitudinal study. Scores on these tests were also obtained for the NH children. Our specific hypothesis was that if some core foundational aspects of language development draw on domain-general learning abilities, then we should observe correlations between performance on the implicit visual sequence learning task and several different measures of spoken language processing. Measures of vocabulary knowledge and immediate memory span were also collected from all participants in this study in order to rule out obvious mediating variables that might be responsible for any observed correlations. The presence of correlations between implicit sequence learning and language processing even after partialing out the common sources of variance associated with these other measures would provide support for the hypothesis that implicit learning is directly associated with spoken language development, rather than being mediated by a third contributing factor.

Two artificial grammars (Grammars A and B) were used to generate the colored sequences used in the implicit learning task. These grammars specified the probability of a particular color occurring given the preceding color in sequence. Sequence presentation consisted of colored squares appearing one at a time, in one of four possible positions in a 2 × 2 matrix on a computer touchscreen in a manner that mimicked the basic design of the previous Simon memory game. The four states (1–4) of each grammar were randomly mapped onto each of the four screen locations as well as four possible colors (red, blue, yellow, green). The assignment of states in the grammar to position/color was randomly determined for each subject; however, for each subject, the mapping remained consistent across all trials. Grammar A was used to generate 16 unique sequences for the learning phase and 12 sequences for the test phase. Grammar B was used to generate 12 additional novel sequences for the test phase.

The children were told that they would see sequences of four colored squares displayed on the computer touch screen monitor. The squares would flash on the screen in a pattern and their job was to remember the pattern of colors on the screen and reproduce the sequence at the end of each trial by touching the square boxes on the computer monitor. The procedures for both the learning and test phases were identical and from the perspective of the subject, there was no indication of separate phases at all. The only difference between the two phases was which sequences were used. In the Learning Phase, the 16 learning sequences from Grammar A were presented first. After completing the reproduction task for all of the learning sequences, the experiment seamlessly transitioned to the Test Phase, which used the 12 novel sequences from Grammar A and the 12 novel Grammar B test sequences. The children were not told that there was an underlying grammar for any of the learning or test sequences or that there were two types of sequences in the Test Phase. The child just observed and then reproduced the visual sequences.

A sequence was scored correct if the child reproduced the entire test sequence correctly. Sequence span scores were then calculated using a weighted method in which the total number of correct test sequences at a given length was multiplied by the length and then scores for all lengths were added together (see Cleary et al., 2001). We calculated separate sequence span scores for Grammar A and Grammar B test sequences for each subject. We also calculated an implicit learning score for each subject, which was the difference in sequence span scores between the learned grammar (Grammar A) and the novel grammar (Grammar B). The implicit learning score measured generalization and reflected how well sequence memory spans improved for novel sequences that were constructed by the same grammar that subjects had previously experienced in the Learning Phase, relative to span scores for novel sequences created by Grammar B.

**Figure 4** shows the average implicit learning scores for both groups of children (left). For the NH children, the average implicit learning score was 5.8% which was significantly greater than 0 demonstrating that as a group the NH children showed better learning of test sequences with the same statistical structure as the sequences from the initial Learning Phase. On the other hand, the average implicit learning score for the children with CIs was −2.5%, a value that was not statistically different from 0. In addition, the NH group's implicit learning score was significantly greater than the CI group. **Figure 4** also shows the implicit learning scores for the individual children in the NH group (middle) and the CI group (right). Whereas 14 out of 26 (53.8%) of the NH children showed an implicit learning score of 0 or higher, only 8 out of 23 (34.7%) of the CI children showed a learning score above 0.

The present results demonstrate that deaf children with CIs show atypical implicit statistical learning of visual sequential patterns compared to age-matched NH children. This result is consistent with the hypothesis that a period of deafness and language delay may cause secondary disturbances in the development of sequencing skills. In addition, for the children with CIs, we computed a partial correlation between their implicit learning score and age at implantation, with chronological age partialed out. Implicit learning was negatively correlated with the age at which the child received their implant (r = −0.410, p = 0.058) and positively correlated with the duration of implant use (r = 0.410, p = 0.058). The longer the child was deprived of auditory stimulation, the lower the visual implicit learning scores; correspondingly, the longer the child had experience with sound via his/her implant, the higher the implicit learning scores. These correlations suggest that exposure to sound via a CI has secondary indirect effects on basic serial learning processes that are not directly associated with hearing, audibility, speech perception or language development; longer implant use appears to be associated with better ability to implicitly learn complex visual serial patterns and acquire knowledge about the underlying abstract grammar that generated the patterns.

Finally, in order to assess the association between implicit learning and language outcomes in the children with CIs, we computed bivariate correlations between the implicit learning score and three subtest scaled scores of the Clinical Evaluation of Language Fundamentals, fourth edition (CELF-4; Semel et al., 2003). These three subtests measure aspects of general language ability, including auditory comprehension, spoken sentence generation, and spoken sentence imitation. The implicit learning score was positively correlated with all three subtests, and for the most part this positive association remained significant even after controlling for the common variance associated with duration of implant use, forward digit spans, backward digit spans, and vocabulary scores.

In a related study, both groups of children also completed a sentence recognition task (Conway et al., 2014a,b), using the set of lexically controlled sentences developed by Eisenberg et al. (2002). The stimuli consisted of twenty lexically easy words (i.e., high word frequency, low neighborhood density) and twenty lexically hard words (i.e., low word frequency, high neighborhood density) embedded in short meaningful English sentences. The sentences were presented over a loudspeaker at 65 dB SPL. The children were instructed to listen closely to each sentence and then repeat back what they heard to the examiner even if they were only able to perceive one word of the sentence. All of the test sentences were presented in random order to each child. Responses were recorded onto digital audiotape and were later scored off-line based on number of keywords correctly repeated for each sentence. The sentences were played in the quiet without any degradation to the deaf children with CIs. For the NH children, the original sentences were spectrally degraded to simulate a CI using a four-channel sinewave vocoder to reduce performance from ceiling levels (Shannon et al., 1995).

For both groups, performance was analyzed for recognition accuracy of each of the three key words in each sentence. This allowed us to examine the extent that the children were using sentence context to improve their perception and reproduction of the spoken words. Whereas the NH children showed robust effects of contextual facilitation as measured by improved performance for the third word in each sentence compared to the first word, the deaf children with CIs on average showed no such contextual facilitation. When taken together with our previous findings with NH adults showing that better implicit serial learning abilities result in more robust knowledge of the sequential predictability of words in sentences which leads in turn to more efficient use of sentence context to aid spoken word recognition processes (Conway et al., 2010), it is possible that the deaf children's inability to make use of sentence context is due to their observed disturbances to implicit learning.

In sum, these recent studies showed that the deaf children with CIs display atypical implicit learning abilities, possibly due to a lack of early experience with auditory patterns and/or exposure to spoken language. Implicit sequence learning abilities in turn were positively correlated with better language scores even after controlling for other general cognitive scores. Finally, we found that these children displayed an inability to use sentence context to facilitate the perception of spoken words, possibly as a consequence of their disturbances in implicit sequence learning. It appears that these children were treating sentences as "strings of unrelated words" (Eisenberg et al., 2002; Conway et al., 2014a), not having a good sense of how various words co-occur with each other in a given sentence context and being unable to use previous words and prior supporting context to help them perceive and recognize subsequent words.

# VERBAL LEARNING AND MEMORY PROCESSES

Although we are now beginning to make some significant progress in understanding how normal-hearing listeners manage to recognize and understand speech under many adverse and challenging conditions and how they carry this process out so quickly and efficiently, very little basic or clinical research has focused on investigations of the underlying processes responsible for rapid adaptation, adjustment and perceptual learning in hearing impaired listeners who use CIs. Most of the outcomes research on speech perception and sentence recognition skills in CI-users has been carried under benign testing conditions in quiet in the audiology clinic or laboratory using conventional low-variability test materials (words and sentences) that place very few, if any, processing demands on rapid automatized processes such as perceptual adaptation, adjustment or normalization. To the best of our knowledge, no studies have investigated the elementary foundational processes related to verbal learning and memory processes in this clinical population.

Fundamental questions about the nature of rapid perceptual adaptation and perceptual normalization and issues dealing with how CI users process compromised underspecified acoustic signals have not been fully investigated in this clinical population despite many years of research on outcomes. This is not surprising because only a small handful of studies have been carried out on working memory dynamics (capacity, speed, updating, inhibition, shifting, switching, etc.), and longterm episodic, procedural and semantic memory processes that underlie robust speech recognition and spoken language processing in normal hearing listeners. The available evidence from several recent studies strongly suggests that rapid adaptation, robust perceptual adjustment and normalization to multiple sources of variability in the speech signal is critically dependent on a small set of neurocognitive factors– elementary processes related to learning and memory, attention, inhibition, executive functioning, and cognitive control processes.

Learning is fundamental to all adaptive behaviors in living organisms and is inseparable from the sensory, perceptual, and cognitive processes involved in the acquisition, storage, and retrieval of information in long-term memory. The fluency and perceptual robustness routinely observed in processing speech signals under challenging conditions by normal hearing listeners reflects the operation of the entire information processing system working together in an integrated fashion (Oblesser et al., 2007). No single component taken alone in isolation from the rest of the processing system is entirely responsible for the observed robustness and perceptual integrity of the final product of the comprehension process– successfully recovering the talker's intended linguistic message. While there can be little doubt that basic elementary learning and memory processes play a fundamental role in the development of speech and language and perceptual adaptation and normalization skills in challenging listening conditions, this foundational topic in cognition has been seriously neglected in the field of hearing loss and, specifically, in the field of CIs.

To begin studying the elementary cognitive factors and information processing operations that underlie robust speech perception and spoken word recognition skills, we have significantly broadened the conventional end-point productbased approach typically used in assessing outcomes and individual differences in CI-users by directly investigating basic fundamental verbal learning and memory processes in pre-lingually deaf CI users. In a recent study, we obtained some preliminary results using a well-known norm-referenced neuropsychological test of verbal learning and memory, the CVLT-II, which has been used extensively with several different clinical populations although it has not been used with prelingually deaf long-term CI-users. Only one other study has used the CVLT with hearing-impaired listeners. Heydebrand et al. (2007) administered the CVLT-II before implantation to a group of 44 post-lingually deaf adults who were candidates for cochlear implantation in order to predict their audiological and speech recognition outcomes 6 months after surgery. They found that a composite verbal learning score based on four CVLT sub-scores was a strong predictor (r = 0.82) of CNC speech recognition scores post-CI after controlling for CNC speech recognition at baseline before implantation. Their results suggest that verbal learning may play a central foundational role in speech and language outcomes following implantation because basic learning processes share common variance with the information processing tasks routinely used to measure speech perception and spoken language understanding.

The CVLT makes use of a multi-trial free recall (MTFR) methodology to obtain measures of several foundational cognitive processes used in verbal learning and memory such as repetition-based multi-trial free recall, primacy and recency, proactive and retroactive interference, memory decay in freeand cued-recall and organizational strategies in memory retrieval such as serial, semantic, and subjective clustering that are often routinely used by subjects to make items more accessible for retrieval in free recall tasks (Delis et al., 2000). In the MTFR procedure used in the CVLT-II, subjects are read a list of 16 familiar words (List A) five times to measure repetition learning processes and free recall. The 16 words on List A were selected

from four semantic categories. After each list is presented, the subject is asked to recall as many of the study items from List A as possible in any order. This free recall procedure is followed for five learning trials with List A. Each presentation involves one repetition of List A followed by free recall of the List A items. After the fifth presentation and recall of List A items, subjects are presented with a new list of 16 words, List B, to measure proactive interference. List B also contains words from four semantic categories. After recall of List B, subjects are then asked to recall List A again (short-delay free recall) to measure retroactive interference produced by List B. Following a 20 min delay period during which the subject is engaged in a distractor task, the subject is asked to recall the words from List A again (long-delay free recall) to measure memory decay after a long delay interval.

The CVLT is a "high-yield" clinical test of verbal learning and memory processes that was designed to study repetition and organizational strategies used in free recall tasks. It produces a large amount of clinically relevant data in a short assessment time. The scores obtained from the CVLT provide important diagnostic information about basic core verbal learning and memory processing skills that are related to domains of executive functioning and cognitive control such as controlled attention, fluency-speed, abstraction, self-generated retrieval organization strategies and mental control processes as well as spoken word recognition, encoding, storage and retrieval strategies.

**Figure 5** shows a global overall summary of the multi-trial free recall scores for the five repetitions of List A obtained from two groups of subjects that we tested recently (see Chandramouli et al., Manuscript in preparation for further details). The left set of bars in **Figure 5** show average free recall scores from a group of 20 prelingually deaf long-term CI users; the right set of bars shows the scores from a group of 24 normal-hearing controls who were matched in age and non-verbal IQ to the CI users. Both groups of subjects were part of a large ongoing research project dealing with executive function and cognitive control processes in longterm prelingual CI-users (Kronenberger et al., 2013, 2014). Each bar in **Figure 5** represents the average correct recall scores over the 16 items in each presentation of List A.

Inspection of **Figure 5** shows two main findings. First, both groups of subjects display robust repetition learning effects over the five presentations of List A. Second, the group of CI users shows consistently poorer total free recall scores after each repetition of List A compared to the NH controls. Looking only at the overall average measures of free recall performance shown in **Figure 5**, however, provides an incomplete picture of the underlying organizational and processing strategies that subjects use in carrying out this MTFR task with categorized word lists. In addition to providing total recall scores summed across all serial positions following the five repetitions of List A, the CVLT-II provides several other more detailed measures of verbal learning and memory processes obtained from separate analyses of the subcomponents of the serial position curve. Below we provide a brief summary of these findings, including (1) primacy and recency effects; (2) recall patterns and retrieval processes; (3) organizational strategies and semantic clustering; and (4) correlations with speech and language outcomes.

In terms of primacy and recency effects, **Figure 6** shows a summary of the free recall scores as a function of the five List A repetitions for the three subcomponents, primacy (first four items), pre-recency (middle eight items) and recency (last four items) portions of the conventional serial position curve. These subcomponents are thought to reflect fundamentally different storage and retrieval processes in carrying out free recall tasks (Atkinson and Shiffrin, 1971). Recall scores from the primacy portion of the serial position curve are shown in the left-hand panel, scores from the pre-recency (middle) portion are shown in the center panel, and scores from the recency portion are shown in the right-hand panel of **Figure 6**. Examination of **Figure 6** shows two patterns in free recall. First, free recall consistently improves for both groups of subjects in all three subcomponents of the serial position curve following each of the five repetitions of List A. Second, the differences observed in free recall between the two groups are not comparable across all three subcomponents of the serial position curve but are confined selectively to only the pre-recency and recency portions of the serial position curve as shown in the center and right-hand panels. It should be noted that study items on the CVLT-II, List A are always presented in the same order during the MTFR phase. The absence of any differences between the two groups in the primacy portion of the serial position curve suggests that early list items were successfully encoded and retrieved equivalently by both groups of subjects. In contrast, the differences observed between the two groups in the pre-recency and recency portions suggest disturbances in the component processing operations used in verbal rehearsal and retrieval strategies possibly reflecting weaknesses in active rehearsal and transfer of incomplete or underspecified phonological and lexical representations of the list items. These differences may also reflect the use of different retrieval strategies as well.

To gain further insights into the recall patterns and retrieval processes, we visualized the data as shown in **Figure 7**. On any given learning trial, there are two possible states of recall for a specific test item: the item is either recalled or not recalled by the participant. Over the five repetitions of List A, there are 32 possible ways an item can be recalled. Following Batchelder et al. (1997), we called each of these possibilities a "recallevent" and computed the frequency distribution of recall-event occurrences for the two groups after collapsing over subjects and items on List A. The results of this analysis are shown in **Figure 7** where each of the 32 possible recall-events is listed on the ordinate and denoted by a 5-bit binary string in which each bit represents a correct recall or failure to recall an item on any given learning trial. Thus, a "00000" denotes that the item was never recalled on any of the five learning trials, a "00111" denotes that an item was not recalled on Trials 1 and 2 but was recalled on Trials 3, 4, and 5, and a "11111" denotes that an item was recalled on every trial, and so on for the remaining recall-events. The probability of each of the 32 recall-events is displayed on the abscissa separately for each of the two groups of subjects.

This recall-event analysis provides useful information about the temporal processing dynamics of item recall during repetition learning of items on List A. In **Figure 7**, we observed peaks at

recall events such as "11111," "01111," "00111," and "00011" which suggests that, on the whole, once encoded and learned, items tend to be recalled in each trial. We also observe in **Figure 7** a peak at "00000" for CI subjects. This means that the CI users were much more likely than controls to have never recalled an item over the five repetition learning trials (event '00000'). CI users had a 7.5% probability of never recalling an item (a small portion of which is caused due to misheard intrusions), and controls displayed only a 1.3% chance of never recalling an item. In addition, the CI users also required more trials before successfully recalling an item for the first time (i.e., more events where there are zero bits preceding the first occurrence of '1' bit). Given that we are dealing with a group of subjects who may have inherent differences in audibility and resolution of the fine acoustic-phonetic details of speech inputs, it would be safe to say that many of the observed differences may come about due to differing ease of early sensory encoding or item registration between the two groups. However, the CI users were at ceiling on a separate auditory identification task using all of the test items in the two lists at the end of the CVLT test protocol. Moreover, all of the test items on both List A and List B were administered in the quiet. When we did an item analysis of the error responses, we did not observe any particular word or words that accounted for this trend.

While the previous finding about differences in encoding might not be surprising to many, what was interesting to us was the observation that given an item was recalled once, the CI users were more likely (25.85%) than controls (20.39%) to miss an item on the next trial. In carrying out this analysis, we considered only recall-events where an item was recalled at least once within the first four trials before calculating this percentage, and then we analyzed how likely they were to fail on future trials. Finding differences in these recall patterns in CI users compared to the controls suggests retrieval differences, especially if all-or-none models of encoding and retrieval are assumed. This analysis followed earlier efforts by Batchelder and Riefer (1986), Batchelder et al. (1997) and Riefer et al. (2002) who have used Multinomial Processing Tree (MPT) models to quantitatively estimate underlying process-measures using such recall-event patterns. We report just the qualitative results here for now using percentages and leave more accurate and quantitative estimates of encoding and retrieval probabilities by generating MPT models to future work. Chandramouli et al. (Manuscript in preparation) provide additional converging evidence that it is indeed the case that retrieval differences also exist between the groups. We suggest that these differences can be traced back to differences in the long-term developmental histories and early experiences and language processing activities of the two groups.

Next, we explored participants' organizational strategies and semantic clustering. In free-recall or list-learning experiments using categorized word lists, the order in which participants recall items can be used to infer the type of organizational strategies that are used during encoding and retrieval. A semantic

organizational strategy is observed when there is a higher probability of recalling a sequence of items in succession that are from the same semantic category. The CVLT-II quantifies this value by using a list-based semantic clustering index (Delis et al., 2000; Stricker et al., 2002). To obtain this measure, first the number of response clusters observed in each trial is tallied: whenever there is a correct recall followed immediately by another correct recall, and both are from the same semantic category, the tally increases by one. The chance-adjusted semantic clustering indices for the two groups of subjects are plotted in **Figure 8**. The semantic clustering index displayed here is a simple difference between the number of observed clusters and the number of clusters expected by chance for the observed total recall length. A positive difference indicates that the observed semantic clustering is higher than that expected by chance. In **Figure 8**, we can clearly see that the NH controls are using semantic strategies for organizing items in their memory. Moreover, their use of semantic clustering increases every successive trial even after the number of items recalled by the group stops increasing as they approach ceiling. In contrast, however, the use of semantic clustering strategies by the CI users remains at chance across the five learning trials. While a part of this result has to do with the higher incidence of intrusion errors that reduce the number of clusters observed for the CI-group, the results clearly show that the CI users as a group use semantic clustering much less often than the NH controls.

Finally, we investigated correlations with speech and language outcomes. In addition to the MTFR measures of verbal learning and memory using the CVLT-II, as part of a large project on individual differences in outcomes following long-term CI use, we also administered a battery of speech and language and executive function measures to assess the strengths, weaknesses, and milestones in these two groups of subjects (see Kronenberger et al., 2014). Measures of speech and language included conventional tests of receptive vocabulary, open-set spoken word

recognition, sentence perception, non-word repetition as well as several indexical processing tasks such as regional dialect categorization and non-native speaker ratings. Measures of executive functioning included neuropsychological tests such as digit span, Stroop color-word naming, number–letter switching, retrieval fluency, coding copy, visual matching, and concept formation. To investigate the relations between a subset of measures obtained from the CVLT-II (total words correctly recalled, learning slope over the five List A repetition trials and the average semantic clustering index) and the speech and language and executive function scores, we carried out a series of simple bivariate correlations. CVLT total words recalled correlated significantly (p < 0.05) with DKEFS number–letter switching (r = –0.50), Stroop color-word naming (r = 0.46) and WISC coding (r = 0.50). CVLT learning slope was correlated with fragmented visual sentence recognition (r = 0.61) and nonword repetition of syllables (r = 0.45). CVLT average semantic clustering was correlated with Stroop color word naming (r = 0.38), non-word repetition (r = 0.42), and recognition of keywords in foreign-accented PRESTO sentences (r = 0.37). These initial findings provide converging evidence of associations between measures of verbal learning and memory obtained from the CVLT-II and measures of executive functioning and speech perception in long-term CI users suggesting that the same elementary information processing operations are shared by all these sets of measures.

# THEORETICAL AND CLINICAL IMPLICATIONS

Some profoundly deaf children with CIs do extremely well on traditional clinically based speech and language outcome measures while other children have much more difficulty after

they receive their CIs. The enormous variability in outcome and benefit following cochlear implantation is recognized as a significant clinical problem in the fields of pediatric hearing loss, otology, and clinical audiology, although it has not received adequate attention by clinicians or research scientists working on CI outcomes in the past. Until we are able to obtain a much better understanding of the underlying early sensory and cognitive basis of individual differences in outcomes, we will continue to face significant challenges in developing new approaches for diagnosis, treatment, and especially the early identification of deaf children who may be at high risk for poor speech and language outcomes after implantation. New fundamental knowledge about the underlying elementary sensory and cognitive processes that contribute to the observed variability in speech and language outcomes will also play an important role in developing novel robust interventions following implantation in terms of selecting specific methods for habilitation and treatment that are specifically targeted for an individual child based on his or her strengths, weaknesses and milestones. We have now identified the locus of two areas of weakness in the neurocognitive functioning that may underlie variability in speech and language outcomes: (1) basic domain-general learning abilities, specifically, explicit and implicit serial learning; and (2) the organizational processes and retrieval strategies used in verbal learning and memory in free recall of categorized lists of spoken words. The new findings on organizational processes in free recall of categorized word lists obtained from the CVLT-II suggest that semantic clustering strategies are significantly compromised in long-term CI users who show little evidence of making efficient use of semantic similarity relations among words to facilitate retrieval of items from long-term memory.

Many deaf children with CIs may have other comorbidities and/or disturbances in basic neurocognitive processes that serve as the foundation for the information processing systems used in spoken language processing. These comorbidities and disturbances appear to be, at least in part, secondary to their profound hearing loss and delay in language development (Conrad, 1979). A period of auditory deprivation during critical developmental periods before implantation affects sensory and cognitive development in a variety of ways (Luria, 1973). Differences resulting from both deafness and subsequent neural reorganization and plasticity of multiple brain systems may therefore be responsible for the enormous variability observed in speech and language outcome measures following implantation. Without knowing what specific underlying neurobiological and neurocognitive factors are responsible for the individual

differences in speech and language outcomes, it is difficult to recommend an appropriate and efficacious approach to habilitation and speech-language therapy after a child receives a CI. More importantly, the deaf children who are performing poorly with their CIs are not a homogeneous group and may differ in numerous ways from one another, reflecting dysfunction of multiple brain systems associated with congenital deafness and profound hearing loss. From a clinical perspective, it seems very unlikely that an individual child will be able to achieve optimal speech and language benefits from his/her CI without knowing why the child is having speech and language problems and which particular neurocognitive domains underlie these problems.

In addition to the earlier findings reported in this paper on explicit and implicit sequence learning and memory processes using the Simon memory game and the new more recent results obtained on verbal learning and memory using the CVLT-II to study multi-trial free recall strategies, we have also carried out a number of other studies over the past 15 years on the ISR skills of deaf children with CIs using traditional measures of digit span as well as novel measures of non-word repetition, talker discrimination, and regional dialect categorization (Tamati and Pisoni, 2015). Although all of these behavioral tasks use quite different experimental procedures and methodologies and measure somewhat different information processing skills when looked at superficially, there are several elementary components in common across these tasks that provide some important new insights into the underlying processing architecture and mechanisms that appear to be responsible for the delays and deficits observed in speech and language and executive functioning in this clinical population. When all of our findings are considered together, a consistent pattern begins to emerge suggesting a process-based explanation for the differences observed between deaf children with CIs and age-matched NH children and for the individual differences and variability observed in outcomes. This processing-based account is mechanistic in nature involving the rapid encoding of item and order information in speech and episodic context of the encoding conditions.

One of the most important and critical components underlying speech and language processing is the early encoding, storage, and use of item and order information and episodic contexts in representations and processing of spoken language (Page and Norris, 2009a). Regardless of whether we are considering word recognition, sentence perception or comprehension of sequences of meaningful sentences, sequencing and the episodic encoding of item and order information is central to all aspects of spoken language processing. We propose that the initial registration and processing of item and order information and encoding of episodic context is significantly compromised in this clinical population (Conway et al., 2009) and that this domaingeneral impairment in basic sequential processing skills creates cascading effects on later higher-order speech and language processing operations used in rapid phonological coding, word recognition, lexical access, verbal retrieval, syntactic parsing, and comprehension. Deficits in registration and early encoding of the episodic context and fine acoustic-phonetic details of speech are observed across the board in a wide range of different language processing tasks including open-set word recognition, sentence recognition in quiet and noise and nonword repetition as well as indexical processing tasks such as talker discrimination and recognition, regional dialect classification and judgments of speech quality and speech intelligibility. All of these processes rely on the registration, early encoding, storage, retrieval and processing of highly detailed memory representations that preserve item and order information in sequential patterns. Although we only have some intuitions and tentative hypotheses at this time, we believe it is very likely that the core deficits in all of these ISR tasks may reflect more basic elementary impairments and deficits in the fine episodic encoding of context and environmental conditions at the time of acquisition which attenuate and often prevent the efficient registration of highly detailed phonetic and sublexical representations of spoken words in isolation and in sentences.

Much of the clinical research carried out on CIs since they became widely adopted as the standard of care for profoundly deaf children has been intellectually isolated from the mainstream of current research and theory in the fields of neuroscience, cognitive psychology and developmental neuropsychology. As a consequence, the major clinical research issues have been very narrowly focused on speech and language outcomes and efficacy of cochlear implantation as a medical treatment for profound hearing loss. Relatively little basic or clinical research has investigated the elementary information processing operations and components—the building blocks of cognition that underlie the enormous individual differences and variability routinely observed in measures of the effectiveness of CIs. Moreover, very few studies have attempted to identify early neurocognitive predictors of outcome and benefit or to systematically assess the effectiveness of specific neurocognitive interventions or habilitation strategies after implantation. As discussed earlier, although variables like age of implantation, communication mode, family and device factors, and various audiological and hearing-related variables clearly play an important role in understanding the nature of variation in speech and language outcomes, we believe that these factors alone are only part of the story. Additional sources of variance, such as those arising from basic processes of learning, memory, and cognition, are needed to fully understand the underlying mechanisms that contribute to successful speech and language outcomes following cochlear implantation.

We believe these are important new directions for clinical research on CIs in the future, directions that draw heavily on basic research, theory, and methodology in the fields of cognition and cognitive science that represent the intersection of several closely related scientific disciplines that are all concerned with brain plasticity, neural development, learning and memory, attention, executive function and cognitive control. As Carol Flexer observed a few years ago, "Hearing loss is primarily a brain issue, not an ear issue," (Flexer, 2011). Until we begin to recognize the important down-stream contributions of

central auditory and cognitive factors and the role of the entire information processing system working together, researchers working on CIs will continue to carry out the same kind of conventional outcome studies expecting different results that will never lead to new advances in evidence-based interventions for deaf children who are doing poorly with their CIs.

# AUTHOR CONTRIBUTIONS

DP wrote the first draft; WK edited and rewrote several sections; SC carried out the research reported in the "Explicit Sequence

## REFERENCES


Memory Spans" Section; CC edited and rewrote several sections and was the lead researcher for the studies reported in the "The Puzzle about Outcomes following Cochlear Implantation" Section.

# FUNDING

This research was supported by grants from the National Institutes of Health–National Institute on Deafness and Other Communication Disorders: R01 DC000111, R01 DC009581, and R01 DC012037.



Luria, A. R. (1973). The Working Brain. New York, NY: Basic Books.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Pisoni, Kronenberger, Chandramouli and Conway. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Lexical Influences on Spoken Spondaic Word Recognition in Hearing-Impaired Patients

Annie Moulin1, 2, 3 \* and Céline Richard4, 5

1 INSERM, U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France, <sup>2</sup> CNRS, UMR5292, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France, <sup>3</sup> University of Lyon, Lyon, France, <sup>4</sup> Otorhinolaryngology Department, Vaudois University Hospital Center and University of Lausanne, Lausanne, Switzerland, <sup>5</sup> The Laboratory for Investigative Neurophysiology, Department of Radiology and Department of Clinical Neurosciences, Vaudois University Hospital Center and University of Lausanne, Lausanne, Switzerland

Top-down contextual influences play a major part in speech understanding, especially in hearing-impaired patients with deteriorated auditory input. Those influences are most obvious in difficult listening situations, such as listening to sentences in noise but can also be observed at the word level under more favorable conditions, as in one of the most commonly used tasks in audiology, i.e., repeating isolated words in silence. This study aimed to explore the role of top-down contextual influences and their dependence on lexical factors and patient-specific factors using standard clinical linguistic material. Spondaic word perception was tested in 160 hearing-impaired patients aged 23–88 years with a four-frequency average pure-tone threshold ranging from 21 to 88 dB HL. Sixty spondaic words were randomly presented at a level adjusted to correspond to a speech perception score ranging between 40 and 70% of the performance intensity function obtained using monosyllabic words. Phoneme and whole-word recognition scores were used to calculate two context-influence indices (the j factor and the ratio of word scores to phonemic scores) and were correlated with linguistic factors, such as the phonological neighborhood density and several indices of word occurrence frequencies. Contextual influence was greater for spondaic words than in similar studies using monosyllabic words, with an overall j factor of 2.07 (SD = 0.5). For both indices, context use decreased with increasing hearing loss once the average hearing loss exceeded 55 dB HL. In right-handed patients, significantly greater context influence was observed for words presented in the right ears than for words presented in the left, especially in patients with many years of education. The correlations between raw word scores (and context influence indices) and word occurrence frequencies showed a significant age-dependent effect, with a stronger correlation between perception scores and word occurrence frequencies when the occurrence frequencies were based on the years corresponding to the patients' youth, showing a "historic" word frequency effect. This effect was still observed for patients with few years of formal education, but recent occurrence frequencies based on current word exposure had a stronger influence for those patients, especially for younger ones.

Keywords: speech perception, lexical influences, word occurrence frequency, hearing loss, aging, spoken word recognition, spondaic words, laterality

### *Edited by:*

Adriana A. Zekveld, VU University Medical Center, Netherlands

### *Reviewed by:*

Samira Anderson, University of Maryland, USA Josée Lagacé, University of Ottawa, Canada

> *\*Correspondence:* Annie Moulin annie.moulin@cnrs.fr

### *Specialty section:*

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

*Received:* 29 September 2015 *Accepted:* 26 November 2015 *Published:* 23 December 2015

### *Citation:*

Moulin A and Richard C (2015) Lexical Influences on Spoken Spondaic Word Recognition in Hearing-Impaired Patients. Front. Neurosci. 9:476. doi: 10.3389/fnins.2015.00476

# INTRODUCTION

Speech perception in hearing-impaired patients involves not only the audibility of the speech material but also the entire process of reconstructing meaningful words from partial or deteriorated acoustic input resulting from hearing damage (Miller et al., 1951). This process is dependent on the patient's lexical knowledge (Wingfield et al., 1991; Pichora-Fuller, 2008; Krull et al., 2013), general cognitive ability (Benichov et al., 2012) and on the type of linguistic material used for speech tests (Boothroyd and Nittrouer, 1988; Olsen et al., 1997). Two types of linguistic influence can be distinguished: (1) The type of linguistic material used (e.g., syllables, monosyllabic words, multisyllabic words, or sentences), which has a stronger influence on speech perception as the stimulus becomes more complex (Miller et al., 1951; Boothroyd and Nittrouer, 1988; Olsen et al., 1997) and (2) lexical factors that are well-known to influence speech perception (Goldinger, 1996), such as word occurrence frequency, familiarity, phonological similarity, or age of word acquisition.

Numerous studies have indicated the benefit of contextual information for speech perception in noise (Wingfield et al., 1991; Pichora-Fuller et al., 1995; Pichora-Fuller, 2008). In highcontext conditions (high-predictability sentences), contextual information can even compensate almost entirely for moderate hearing loss (Miller et al., 1951; Benichov et al., 2012) and is suggested to be even more beneficial to elderly listeners (Benichov et al., 2012). Most of these studies examined differences between the perception of isolated words and words embedded in sentences with different degrees of predictability, i.e., providing a much higher degree of contextual compensation than isolated words. Sublexical compensation (i.e., compensation at the word level) has also been shown for isolated monosyllabic words, for which older adults can compensate for a loss of word identification in noise by better use of lexical constraints (Boothroyd and Nittrouer, 1988; Nittrouer and Boothroyd, 1990) or in noise-vocoded speech conditions by providing greater exposure to the stimulus to increase word familiarity (Sheldon et al., 2008). Using the contextual influence indices devised by Boothroyd and Nittrouer (1988), the present study explores the main sources of variability in the perception of spondaic words measured in silence, a condition that is much easier for older hearing-impaired patients than the speech-in-noise tasks used in the literature and that is commonly used in audiology to evaluate hearing-impaired patients' speech perception. Speech perception scores can be influenced independently from hearing loss by differences in patients' use of contextual information and by lexical factors (primarily word occurrence frequency and phonological similarity) and the interaction between those lexical factors and patient characteristics.

Indeed, the word frequency effect is one of the strongest and most extensively demonstrated effects in written and, to a lesser degree, spoken word recognition. This effect had been examined using lexical decision tasks (Brysbaert et al., 2011a,b) in young, normal-hearing university students whose characteristics are far from those of the majority of patients usually encountered in audiology clinics, as Benichov et al. (2012) and others have noted. It is likely that in an elderly, hearing-impaired population, linguistic factors have stronger and more heterogeneous effects on word perception scores, especially when accounting for the patients' lexical knowledge and general cognitive abilities. In a task consisting of spoken word repetition in hearing-impaired subjects, the occurrence frequency of the acoustic and phonologic forms of the words, i.e., the spoken word occurrence frequency, is likely to better reflect the subject's relevant spoken word exposure. The greater predictive value of the spoken word occurrence frequency compared with the written occurrence frequency has been noted in written word recognition (Brysbaert and New, 2009). This difference has been attributed to a better match between the type of language material that participants usually read in psycholinguistics experiments and the language of television series and films rather than the more formal language and non-fiction texts represented in the written corpora (Brysbaert et al., 2011a,b). Indeed, the repetition of the spoken form of a word via its frequent availability in real-life situations is likely to aid in its recognition, especially among hearingimpaired older subjects with patchy neurosensory peripheral auditory information. According to this hypothesis, the greatest variability in spoken word recognition would depend on the words' occurrence frequencies at the time of the experiment, independently of the subject's age. However, it could also be argued that older occurrence frequencies might be more relevant to older subjects because age of word acquisition is a predictor of word recognition, albeit a much weaker one than word frequency (Brysbaert et al., 2011a). The spondaic word lists commonly used to assess hearing-impaired patients' speech perception in France date back to the 1950s (Fournier, 1951; HAS, 2007; Legent et al., 2011). Indeed, principles that are still used in speech perception tests in audiology today (for a review: Wilson and McArdle, 2005) were developed in the 1930s (Fletcher and Steinberg, 1929), and Hudgins and Hawkins (1947) developed the first English spondaic word lists in the 1940s. The most important criterion for selecting words, according to Hudgins and Hawkins (1947), was homogeneity with regard to basic audibility, i.e., the words should yield equal perception scores when spoken at a constant level by a normal speaker. Hudgins and Hawkins (1947) suggested that a steeper slope of the performance intensity function reflected greater homogeneity among the words and better precision in graphically obtaining the 50% threshold. The first 42 spondaic word lists were later reduced to the 36 most familiar words by Hirsh et al. (1952). Those principles led Fournier (1951) to select French disyllabic words composed of two equally accented syllables and ending with a vowel sound for his disyllabic words corpus. For greater homogeneity and equivalent difficulty levels among lists, he chose only masculine nouns ending with vowels that were familiar in the spoken dayto-day vocabulary at the time, but he strongly emphasized his regrets about not having a French lexicon database of spoken occurrence frequencies (Fournier, 1951). Thus, because of the natural evolution of language over time, some words that were very frequent in the 1950s are less frequently used today (Michel et al., 2011). This change in language over time provides the opportunity to investigate the hypothesis of a potential historical word frequency effect using several indices of word occurrence frequencies (spoken and written frequencies from different periods (from 1900 to today) on speech perception scores in hearing-impaired patients.

The aim of this study was to explore context influence in spondaic word recognition scores obtained in silence using standard clinical linguistic material in a clinical population. The dependence of contextual influence on characteristics of the linguistic material (mainly word occurrence frequencies) and patient characteristics such as age, ear tested (left vs. right), years of education and hearing loss were examined.

# MATERIALS AND METHODS

### Patients

One hundred sixty patients (75 women and 85 men, aged from 23 to 88 years, mean = 62.1) who were native French speakers and who presented for routine clinical ENT examinations were involved in this study. The patients underwent routine clinical examinations, including otoscopy, tympanometry, puretone audiometry at octave intervals from 250 to 8000 Hz and speech audiometry. The patients' number of years of education (YE) ranged from 7 to 17 (mean = 10.6 years) and was obtained from the highest diploma/degree reported by the patients. All of the patients experienced hearing loss after language acquisition, and none presented articulation problems or neurological problems. Most of the patients had noiseinduced hearing loss and/or presbycusis (62%), 21% of them presented mixed conductive and sensorineural hearing loss, and 18% presented with sensorineural hearing loss of other origins. Hearing impairments were classified as mild (21–40 dB HL, mean = 31.6, n = 72), moderate (41–70 dB HL, mean = 53.0, n = 76) or severe (70–90 dB HL, mean = 77.8, n = 12), according to the International Bureau for Audiophonology guidelines<sup>1</sup> .

All of the data were anonymously collected, and the study was conducted in compliance with the Helsinki declaration pertaining to human research and the Good Clinical Practice Guidelines. The participants provided written informed consent and the protocol was approved by the French Ethical Committee for Participant Protection (CPP Sud-Est IV).

### Audiological Testing

### Pure-Tone Audiometry and Tympanometry

After a clinical otoscopy examination, the patients underwent pure-tone audiometry using an Interacoustics◦ AC 40 clinical audiometer in a soundproof booth. Air and bone conduction hearing thresholds, in decibel hearing levels (dB HL), were obtained at octave intervals from 250 to 8000 Hz. For each ear, a four-frequency (500 Hz, 1, 2, and 4 kHz) average pure-tone threshold (PTA) was obtained. Tympanometric measurements were taken using an air pressure from −600 to +300 daPa (Interacoustics◦ AA222).

### Spoken Spondaic Word Recognition

Triphonemic monosyllabic word lists currently used in French ENT practices (Lafon, 1964) were presented at several intensities to the patient at minimum steps of 5 dB to obtain the stimulus level corresponding to a phonemic score between 40 and 70% for each patient. Because we wanted to use the exact same presentation level for the monosyllabic words and the disyllabic words, we could not use a stimulus level associated with the 50% threshold; this threshold was quite difficult to obtain using a 5 dB step and presented challenges in terms of time and patients' fatigue. Indeed, because monosyllabic word slopes range from 5%/dB (in normal hearing subjects) to 3%/dB in severely hearingimpaired patients (reviewed for several languages in Han et al., 2009), a minimum of 10–25% variability is expected around the 50% point for a 5 dB variation in stimulus level. Therefore, we chose to accept scores ranging from 40 to 70% (median = 57%, interquartile range at 10%) to obtain a disyllabic whole word score as far as possible from 0 to 100% to avoid floor and ceiling effects.

At this stimulus level, which was kept constant, 60 spondees taken from the Fournier disyllabic word corpus [common clinical material used in France (Collège National d'Audioprothèse, 1999; Legent et al., 2011; Richard et al., 2012) and recommended by the health authorities (HAS, 2007)] were presented in random order to the patients. More than 80% of the variance in the stimulus level was explained by the patients' PTA, thus showing a good adaptation of the stimulus level to the patients' hearing ability. The presentation level for the words averaged 5.6 dB HL (SD = 7.6) greater than the patient's average PTA. The pre-recorded words were presented monaurally to the subjects, who were seated in a soundproof booth, and sound levels were monitored using an Interacoustics AC40 audiometer. One ear (left or right) was chosen at random for each patient. The examinees responded verbally after each word presentation, and an experienced audiologist identified the patients' correct and incorrect responses.

### Linguistic Analysis of the Disyllabic Words

The 60 spondees used were extracted from the corpus of 400 spondaic words established by Fournier (1951). The linguistic characteristics of each word were obtained from the Lexique 3.8 database of more than 142,000 words in the French language, which was updated online in October 2012 (http://www.lexique. org) (New et al., 2004, 2007). Most of the words (55/60) were 4 or 5 phonemes long (with an average of 4.5).

### Occurrence Frequency Measures

Because not all occurrence frequency estimates are equally predictive (Brysbaert and New, 2009; Ferrand et al., 2010), the occurrence frequency of each spondee has been determined using different metrics, as available in the Lexique◦ database: the written frequency, based on written texts, and the spoken frequency, based on film subtitles (New et al., 2007). Because we were examining auditory word recognition, we considered the occurrence frequency of a spoken word to be the sum of the occurrence frequencies of each orthographic variant of the same phonological form, and we calculated the cumulative

<sup>1</sup> International bureau for audiophonology Audiometric Classification of Hearing for Impairments. In BIAP Recommendation n◦ 02/1 bis. Available online at: http://www.biap.org.

occurrence frequency of all the homophones of each word (for example, most plural forms in French are pronounced the same way as the singular forms; thus, the occurrence frequency of /dragon/ [the same word in English] would be the sum of the occurrence frequencies of /dragon/ and /dragons/). The highest occurrence frequency of all the homophones of each word was also obtained. The frequencies were log transformed, and frequencies lower than 0.01 word per million words (noted as 0.00 in the database) were given a log value of -2.5 (as in Ferrand et al., 2010).

Additionally, we used the word frequencies derived from the Google Books N\_gram database (Michel et al., 2011), which provides the frequency of a word's occurrence within published books according to publication year. The word frequencies corresponding to the sum of the frequencies of all of the homophones of each of the 60 spondees were extracted from the Google Books N\_gram Viewer, and a smoothing factor of 5 was used to obtain the word frequencies for the years 1900–2005 in 5-year steps.

For each word, the modification rate of the occurrence frequency for that period was calculated. According to this rate, the group of 60 words was split into two groups of 30: the first group ("older words") had a decreasing occurrence frequency over time (i.e., these words occurred frequently 50 years ago but were much less common now), whereas the second group ("newer words") comprised words with more stable or increasing occurrence frequencies over time. For each patient, both "older words" and "newer words" scores were obtained.

The word occurrence frequency measures for each word are detailed in the Supplemental Table 1.

### Phonological Similarity Measures

To measure the phonological similarities between the stimulus word used and different words in the French language, we used the Lexique◦ database to calculate the phonological neighborhood of each word, which consisted of the phonological neighbors obtained by substituting a phoneme and the neighbors obtained by deleting or adding a phoneme (Marian et al., 2012). The occurrence frequency of each neighbor was obtained in the same manner that was described above for the stimulus words. Several measurements were calculated using lab-created scripts to characterize each stimulus word: the phonological neighborhood density (the number of phonological neighbors per word) (Luce and Pisoni, 1998), and the high-frequency phonological neighborhood density, defined as the number of phonological neighbors with a higher occurrence frequency than the stimulus word.

### Acoustic Analysis of Disyllabic Words

To rule out a potential confounding factor, i.e., the possibility of an interaction between a word's acoustic spectrum and its linguistic parameters, the spectral acoustic pattern of each spondee was obtained from the recorded versions of the words used, and the root mean square (RMS) amplitude was calculated from 125 Hz to 8 kHz per octave frequency for each of the 60 words used. No statistically significant correlation was obtained between the words' acoustic spectra and the linguistic factors.

# Data Analysis

### Word Score Measurements

All 160 of the patients tested were included in the analysis. Three of the patients were tested at a level corresponding to a monosyllabic score greater than 70% (88% for one patient). Because the disyllabic word scores of these three patients were considerably less than 100%, we decided to keep them in the analysis. The monosyllabic word scores were used only to determine the stimulation level and were not part of the statistical analysis. For the disyllabic words, the phoneme scores were based on 268 items, the syllable scores on 120 items and the whole-word scores on 60 items.

Because percentage-type variables violated several parametric assumptions (Studebaker, 1985), all of the percentage recognition scores were transformed into rationalized arcsine-transformed scores (or rau scores), which were specifically designed for speech recognition scores (Studebaker, 1985; Sherbecoe and Studebaker, 2004) so that a score of 50 raus corresponds to a percentage score of 50%, and both rau and percentage scores are very close to each other when percentage scores are between 15 and 85%. Although the stimulus intensities were chosen to obtain word percentage scores for each ear that are as close as possible to the middle range (25–75%), thus avoiding floor and ceiling effects, the scores for individual words (calculated across several patients) could exceed 90%. Because most rau units were very close to the percentage scores, only the rau scores are mentioned in the remainder of this manuscript, and percentage scores are only mentioned when they are particularly relevant.

### Context Effect Measurements

To evaluate the effects of context on word recognition, we used both the word-to-phoneme score ratios (W/Pho) and the "j factor", which was defined by Boothroyd and Nittrouer (1988) and is described in Equation (1):

$$\mathbf{Pw} = \mathbf{P} \mathbf{p}^{\mathrm{j}} \tag{1}$$

with Pw representing the probability of whole-word recognition and Pp representing the probability of recognition of a part of the word (in this case, the phonemes). J varies between 1 (recognition of a single part is sufficient for whole-word recognition) and n (recognition of all the different parts [here, phonemes] is necessary for whole-word recognition). Hence, the j factor can be interpreted as the number of independently perceived constituents of a word, with j approaching 1 as the contextual influence increases.

$$\mathbf{j} = \log(\text{phonacci score}) / \log(\text{word score}) \tag{2}$$

Similar to the method described by Boothroyd and Nittrouer (1988), the j factor was calculated according to Equation 2 only for percent scores between 5 and 95% to avoid extreme values, i.e., for 154 of 160 patients. Because the j factor distribution was not Gaussian, statistical tests were performed on 1/j following a Gaussian distribution using the Shapiro-Wilk test (Shapiro and Wilk, 1965).

To avoid the caveats linked to the tendency for the ratio W/Pho [Equation (3)] to fall as scores approach 100%, both word and phoneme scores were first converted into rau scores:

W/Pho = word scores (in rau)/phonemes scores (in rau) (3)

Because W/Pho did not follow a Gaussian distribution, an arcsine transformation was used to meet the Gaussian distribution requirement for statistical analysis. W/Pho increases (up to 100%) as contextual influence increases: indeed, the more we rely on contextual information, the more we tend to complete patchy sensory information and to increase the number of whole words repeated rather than constituent parts alone (i.e., syllables and phonemes).

### Statistical Analysis

The Gaussian distribution of the data was assessed for each variable using the Shapiro-Wilk test (Shapiro and Wilk, 1965). Pearson's correlations and multiregression analysis were performed for the rau scores obtained for each word across several groups of patients and the linguistic and acoustic characteristics of each word. Correlation coefficients were compared using Fisher's Z-transformed scores (Steiger, 1980). Analysis of variance (ANOVA) and analysis of variance for repeated measures (ANOVA-R) were performed, and the results are presented as the mean and standard deviation (SD). The effect size was measured using Cohen's d statistic, η 2 for ANOVAs (Levine and Hullett, 2002) and correlation coefficients (Cohen, 1992). Following recent statistical guidelines (see for example Asendorpf et al., 2013; Glickman et al., 2014), we used a false discovery rate approach to the problem of multicomparison (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001), with a p corrected value of 0.02 (ns for non-significant), to avoid the inflated type II error rate resulting from more classical multicomparison adjustments such as the Bonferroni correction. For the correlational analysis involving hundreds of correlations, random permutation tests (Sherman and Funder, 2009) were used to determine whether the sets of significant correlations observed were due to chance. All of the statistical analyses were performed using R◦ software, version 2.13.1 (R Development Core Team, 2008), and Statistica◦ software (StatSoft◦ ).

The analysis involved 2 different approaches. In the first approach, for each word, scores were calculated across the entire population or across different subgroups of patients. In the second approach, scores were calculated for each patient across the 60 words (or across the two groups of 30 words termed "newer words" and "older words"). Several types of patient subgroups were defined within the total population according to hearing loss and/or age and/or years of education (YE) and/or ear tested (right or left) and/or gender. Due to the evolution of education possibilities within the last 80 years in the country, the number of YE showed a decrease as age increased, especially as the same diploma requires more years of education nowadays than 60 years ago. Hence, YE was treated as a dichotomous variable, with a high YE group and a low YE group. There was no interaction between YE group and Ear tested, or PTA, or gender. No statistically significant interactions were detected between Age groups, PTA groups, gender or ear tested (χ 2 tests are provided in the Supplemental Table 2).

### RESULTS

### Contextual Influence on Patient Scores

As expected, the ANOVA-R and pairwise comparisons between the different scores (monosyllabic word scores and disyllabic word scores calculated in phonemes, syllables and whole words) showed significant differences: F(3, 477) > 370, p < 0.0001, with the highest scores for disyllabic phonemic scores (mean = 79 rau) and the lowest scores for monosyllabic phonemic scores (mean = 57 rau).

The mean j factor was 2.07 (SD = 0.5), and the mean W/Pho = 77.6% (SD = 8.9), with a significant correlation between the two (r = 0.57, p < 0.001). No statistically significant correlations were obtained between j (or W/Pho) and the patients' ages or PTA.

Disyllabic word scores decreased significantly with increasing PTA (r = −0.27, p < 0.001). However, in our population, no significant relationship was observed between word scores and age, and only a weak relationship was obtained between age and PTA (r = 0.23, p < 0.005).

When the population was divided into two groups according to YE, W/Pho decreased significantly as PTA increased (r = −0.35, p < 0.001, n = 83) in the high-YE group, with a significant difference compared to the low-YE group (r = −0.02, p = ns, n = 77, z = 2.1, p < 0.05). For all of the word scores and for W/Pho, there was a significant main effect of PTA and a significant interaction between YE and PTA groups. W/Pho was significantly greater in low-YE patients with mild hearing loss and significantly lower for patients with severe hearing loss (**Figure 1**). The results of the different ANOVAs are summarized in **Table 1**. There were no statistically significant influences of

FIGURE 1 | Mean contextual influence index (W/Pho in arcsine units) as a function of PTA groups (with hearing loss levels specified in dB HL) for the high-YE group (green triangles) and the low-YE group (blue dots). The arrows with + and − show how the contextual influence varies. Only significant differences between the YE groups according to the PTA group are shown, with \*p < 0.05. The results obtained for the worst-PTA group and the high-YE group differed significantly (p < 0.01) from those of the three other PTA groups. The number of patients in each group is shown in blue (low-YE) and in green (high-YE).


TABLE 1 | The results of the three ANOVAs performed for the phonemic scores, word scores, and the W/Pho index as a function of the number of years of education (YE, two groups) and the four PTA groups, with Df indicating the number of degrees of freedom, *F* the *F*-values, MSE the mean square error (in gray), *p*-values and the η <sup>2</sup> measure indicating effect size, and ns indicating non-statistically significant values.

YE or PTA on the j factor, although a tendency toward greater contextual influence in the high-YE patient group vs. the low-YE group could be identified [F(1, 146) = 2.7, p = 0.10].

The influence of the ear tested was analyzed in the subgroup of 150 right-handed subjects and was not statistically significant for any word score [F(3,146) = 0.8, p = ns] or for PTA or age. However, there was a statistically significant interaction between YE and the ear tested, with several context indices: F(1, 142) = 7.74 (p = 0.006, η <sup>2</sup> = 0.05) for the j factor and F(1, 146) = 5.0 (p < 0.03, η <sup>2</sup> = 0.04) for W/Pho, with a significantly greater contextual influence for right ears than for left ears in the high-YE group (Fisher's t = 2.7, p < 0.007, Cohen's d = 0.65 for j) and a greater contextual influence for the high-YE group than for the low-YE group for the right ears (Fisher's t = 3.16, p < 0.002, Cohen's d = 0.78 for j; **Figure 2**).

When only right ears were selected, a significant difference was obtained between the high-YE and low-YE groups, with the high-YE group having greater contextual influence indices than the low-YE group: one-way ANOVA: F(1, 66) = 10.7, p = 0.002, η <sup>2</sup> = 0.14 for the j factor and F(1, 69) = 6.8, p = 0.01, η <sup>2</sup> = 0.09 for W/Pho. No significant differences based on the YE or PTA group were obtained for word scores.

# Word Linguistic and Acoustic Characteristics' Influences on Patients' Scores

The percentage score, for each word, calculated across the 160 patients, varied from 17.5 to 92% (word score) and between 49.8 and 95% (phonemic score). Correlations between the word scores and several linguistic factors, such as occurrence frequency (oral and written), phonological neighborhood density, and number of high-frequency phonological neighbors, indicated that occurrence frequency was a major influence and that no other linguistic factors had significant correlations (**Table 2**). As expected, the correlation between occurrence frequency and word scores was significantly stronger than the correlation between occurrence frequency and phonemic scores (z = 2.4, p < 0.02), regardless of the occurrence frequency used.

The correlations between word scores and occurrence frequency tended to be stronger (but not significantly so) for cumulative oral frequencies than for the maximum occurrence frequency of the phonological form, or the written frequency (r = 0.37, p < 0.01). Correlations between word scores and occurrence frequencies obtained from the Google Books N\_gram French database, calculated in 5-year units from 1900 to 2005, showed the strongest correlations for occurrence frequencies from 1950 (**Table 2**). However, the differences between the correlation coefficients obtained for the 1950 and 2005 occurrence frequencies did not reach statistical significance.

Significant correlations were obtained between word scores and word amplitude, calculated in RMS per octave, with the strongest correlations for 0.5 and 1 kHz and with significantly stronger correlations for phonemic scores than for word scores (**Table 2**). No significant correlations were obtained for the


TABLE 2 | Pearson correlation coefficients obtained between word scores, syllabic scores, and phonemic scores (in raus) and word acoustic and linguistic factors.

Significant correlations at the 0.01 level are shown in bold on a gray background, whereas significant correlations at the 0.05 level are shown in bold. Significant differences in correlation coefficients between whole-word scores and phonemic scores are shown in the last column.

frequency band amplitudes centered at 0.25, 2, 4, or 8 kHz. To ascertain the statistical significance and reliability of our correlation results, 50,000 random permutation tests (Sherman and Funder, 2009) were performed on the set of 144 observed correlations to form a distribution of significant findings expected by chance; on average, 9.15 of the 144 observed correlations could have been significant by chance, with an average r of 0.10 (SE = 0.04). This value is significantly (p < 0.0001) below the average r observed in the data (0.34) and lower than the number of significant correlations observed (60), showing that the pattern of correlations observed here cannot be attributed to chance.

Stepwise regression analysis starting with five potential explanatory variables representing acoustic factors (0.5 and 1 kHz amplitude in dB) and linguistic factors (occurrence frequency and phonological neighborhood density) yielded statistically significant models that could explain 40% of the variance of word scores and 54% of the variance of phonemic scores (**Table 3**) using only two explanatory variables, 1 kHz amplitude and occurrence frequencies. Occurrence frequency had a greater influence on word scores (beta = 0.58), and 1 kHz amplitude had a greater influence on phonemic scores than occurrence frequency did (beta = 0.68 vs. 0.25).

# Influence of the Interactions between Patients and Word Characteristics on Speech Perception Scores

The percentage score for each word was calculated for different groups of patients organized by age and/or number of years of education. A strong effect of age was observed, with significantly stronger correlations between word scores and word frequencies for the youngest group (under 50 years of age; r = 0.53, p < 0.0005) than for all groups older than 60 years of age (r = 0.33, p < 0.02; z = 2.6, p < 0.01; **Figure 3A**). The same analysis, performed by grouping the patients by age and YE, revealed a significantly stronger relationship between word scores and word frequencies for the low-YE group than for the high-YE TABLE 3 | Stepwise multiregression analysis of the word scores (top table) and phonemic scores (bottom table) as a function of several explanatory variables, representing acoustic factors (0.5- and 1-kHz amplitudes) and linguistic factors (lexical spoken occurrence frequency and phonological neighborhood density).


Only the significant predictors are noted, and only those models that are significantly different from the previous one are presented (variance analysis between models with p < 0.01). The model coefficients are specified as B (with standard error in SE) and standardized coefficients are noted as Beta (in bold). Each model is specified by the percentage of variance explained (r<sup>2</sup> , in bold) and the corresponding degree of statistical significance (p), with ns for non-significant. The degree of significance of each predictor at each step is noted with t and p.

group (with significant differences for all three groups under 70 years of age). The low-YE group exhibited systematically higher correlation coefficients than the high-YE group. A similar result was obtained for W/Pho, which showed a decreasing correlation as age increased, especially for the low-YE group. The effects of database.

age on the dependency of patient responses to current spoken language occurrence frequencies might be related to the fact that those words were relatively more common in the 1950s than

today. To check this hypothesis, we calculated "older words" and "newer words" scores for each patient (**Figure 4**). A mixed-ANOVA (1 within-subjects factor: word group, and 2 betweensubjects factors: YE and age groups) showed no significant difference according to each variable (YE, age or word group), but there was a significant interaction between word group and age: F(3, 152) = 4.3, p < 0.006, <sup>p</sup>η <sup>2</sup> = 0.08. The interaction between word group, age and YE was not statistically significant [F(3, 152) = 2.4, p = 0.07]. For the youngest and low-YE patients, the "older words" scores were significantly lower than (1) the scores of the older patients and high-YE patients and (2) the "newer words" scores (**Figure 4**). In addition, only the low-YE patient group showed a statistically significant correlation between age and the difference between the "older words" and "newer words" scores (r = −0.39, p < 0.0005 for the low-YE group and r = −0.07, p = ns for the high-YE group), with a decreasing difference as age increased that was mostly related to an increase in the "older words" score as age increased. This could explain why the younger patients, especially those with low YE values, were more sensitive to the current spoken word occurrence frequencies from the Lexique◦ database. Both the age and YE effects disappeared when the 1950s occurrence frequencies were used (**Figure 3B**).

To investigate which occurrence frequency explains the greatest variability in the word scores, correlations among word scores, phonemic scores and the occurrence frequencies obtained from the N-gram database in 5-year units were analyzed. For the entire group of patients, the best correlations were observed for the occurrence frequencies from 1950 to 1960: **Figure 5A** depicts the percentages in variance in word, syllabic and phonemic scores, explained by occurrence frequency as a function of the

year in which the books were published. To compare evolution as a function of the year, the maximal percentage of variance across the years was set at zero so that the other percentages showed the amount of decrease in the percentage of variance explained by the different occurrence frequencies (**Figure 5B**). Hence, **Figure 5B** shows that the word scores appeared more dependent on the occurrence frequencies years than syllabic or phonemic scores did. The decrease in the dependence of scores on occurrence frequency for recent years showed a greater slope for word scores than for phonemic scores. When the total population was split by YE, the pattern was very similar (**Figure 5C**). However, a clear age effect occurred when we grouped the patients by age: younger patients were more sensitive to more recent occurrence frequencies, whereas older patients were more sensitive to "older" occurrence frequencies (**Figure 5D**). An interaction between YE and age was again observed, with a greater difference between the young and old patients in the low-YE group (**Figure 5F**) than in the high-YE group (**Figure 5E**).

# DISCUSSION

# Contextual Influence Measures of Disyllabic Words

The phoneme scores for disyllabic words were 17 raus greater on average than the word scores, which allowed the calculation of different contextual influence indices. Both the j factor and the average ratio of word scores to phonemic scores (W/Pho) across our population showed greater contextual influence than the data reported by Olsen et al. (1997) (with j-values ranging from 2.3 to 2.8) and by Boothroyd and Nittrouer (1988) for young, normally hearing university students using monosyllabic consonant-vowel-consonant (CVC) words presented in noise with a 0-dB SNR (78 vs. 72.3% for W/Pho and 2.07 vs. 2.46 for j factor). Because the j factor varies between 1 (word perceived as a whole) and the maximum number of parts used as measurement units (in this case, phonemes), its range depends on the type of linguistic material used. Whereas both Olsen et al. (1997) and Boothroyd and Nittrouer (1988) reported j factors that ranged between 1 and 3 (because of their use of CVC words), our j factors could theoretically have ranged between 1 and 4.5 (the average number of phonemes in our disyllabic words). Therefore, a j factor of 2.07 denoted a substantially greater contextual influence on the word scores, which was confirmed by our higher W/Pho. Although contextual influence tends to increase with YE, the greater contextual influence in our population compared with the reports in the literature cannot be attributed to a higher YE because our population was heterogeneous regarding YE and had an average YE much lower than that of the subjects in most studies. The most likely explanation is the greater redundancy of 4- to 5-phoneme disyllabic words than triphonemic monosyllabic words. Additionally, because the j factor can be interpreted as the number of chunks of information independently perceived by the listener, our 2.07 j factor can speculatively be interpreted as agreeing with the disyllabic structure of the words used: the patients tended to perceive the words as two individual syllables rather than as a string of phonemes, in agreement with the greater syllabic structure of the French language compared with English (Ferrand et al., 1996, 2010).

# Differences in Contextual Influence Depending on Patients Characteristics Years of Education

The contextual influence on spondee recognition was significantly greater in the high-YE than in low-YE patients, especially those with milder hearing loss (<32 dB HL), which

FIGURE 5 | Percentage of variance in the word scores (y-axis) explained by word occurrence frequencies measured in different years (in 5-year bins, from 1900 to 2005, x-axis). For (B–F), the maximum percentage of variance was normalized to zero so that the other values show the decreases in the percentage of variance explained by word occurrence frequencies during periods other than the optimal period. (A) Percentage of variance in the whole-word scores (W scores, yellow dots), the syllabic scores (Syll. scores, red squares) and the phonemic scores (Phonemic scores, dark red triangles), explained by word occurrence frequencies per 5 years. (B) Shows the same data; however, the maximum percentage of variance was normalized to zero, allowing the comparison of the different patterns of variance as a function of year. (C) Normalized percentage of variance in the whole-word scores for the group of patients with a high number of years of formal education (high YE, green triangles) and those with a low number of years of formal education (low YE, blue dots). (D) The normalized percentage of variance in the word scores of different age groups, with dark purple triangles for the eldest patients, pink squares for those 55–72 years old and orange dots for the youngest patients (under 55 years of age). The patients' (Continued)

### FIGURE 5 | Continued

birth dates are represented by the symbols on horizontal lines parallel to the x-axis. (E,F) The normalized percentage of variance in the whole-word scores for the two different age groups, with dark triangles for the oldest patients (over 60 years of age) and light dots for the youngest patients (under 60 years of age). The birth dates are represented by symbols on horizontal lines parallel to the x-axis. (E) Shows the subgroup of patients with the highest number of years of education (high YE, in green), whereas (F) depicts the subgroup of patients with the lowest number of years of education (low YE, in blue).

underlines the importance of YE to spoken word recognition scores in an elderly, hearing-impaired population in an audiological clinic. YE can be considered a very crude reflection of lexical knowledge and cognitive ability. Indeed, in a metaanalysis, Verhaeghen (2003) reported a significant correlation between YE and two vocabulary tests (the WAIS-R vocabulary subtest and the Shipley scale). Although contextual influence tended to be greater in the high-YE patient group, major differences between both groups appeared in combination with hearing loss. The high-YE group with an average PTA lower than 55 dB HL was better able to repeat complete words (with significantly greater contextual influence, as shown by their higher ratio of word scores to phonemic scores) than the low-YE patients. This finding suggests that these patients exhibited better compensation for the partial phonological information they receive using top-down lexical information, at least in cases of mild to moderate hearing loss. Such compensation was not observed for patients with severe hearing loss (PTA > 55 dB HL). A likely explanation resides in the overly degraded auditory information available to patients with more severe sensorineural hearing loss, who experience greater distortions, widened cochlear filters, frequency selectivity alteration and loss of temporal resolution (Moore, 2007) that cannot be compensated for by a simple increase in the absolute stimulus level. This heavily degraded auditory information would not be sufficient to properly fuel the lexical restoration process. This finding is consistent with the results of Ba¸skent et al. (2010), who demonstrated that perceptual phonemic restoration could be identified in normal-hearing subjects and those with mild hearing impairments, but not in patients with moderate hearing loss (PTA > 40 dB HL in patients over 60 years of age in their study). Similarly, Benichov et al. (2012) observed a decrease in contextual benefit for their patients with moderate hearing loss (PTA > 45 dB HL) compared with patients with mild hearing loss. The results of the present study indicate that the degree of compensation for the degraded bottom-up information, using top-down lexical processes, varied greatly from patient to patient, even for the recognition of isolated words in silence, with a heavy emphasis on general vocabulary knowledge reflected by years of formal education.

### An Age Effect?

The lack of an age effect on contextual influence indices that was observed in this study appears to contrast with the results of Krull et al. (2013), who observed that the top-down restoration process declined with age in an identification task involving isolated monosyllabic words in speech-shaped noise. However, this finding could be attributed to several factors: the population described in our study consisted of a majority of older patients (50% of the patients were over 65), so that our younger patients were actually substantially older (i.e., the 10th percentile of our population was 42 years old) than the young group of Krull et al. (which was between 18 and 32 years of age). Second, the task used in our study consisted of the auditory recognition of disyllabic words in silence and used words with different degrees of linguistic difficulty, whereas Krull et al. (2013) used monosyllabic words presented in noise and showed an agerelated decrease in the ability to exploit temporal and spectral glimpses embedded in words presented in speech-shaped noise. The use of disyllabic words, which offers greater information redundancy than monosyllabic words, is likely to have favored a "lexical restoration" process in our high-YE patients without specific noise-induced perception difficulty that would have negated the benefits of the restoration process. Saija et al. (2014) showed that although normal-hearing older participants (average age: 66 years) exhibited poorer speech intelligibility in interrupted noise than a younger group (average age: 22 years), the older patients maintained phonemic restoration even better than the young group. The authors hypothesized that the process of speech perception degradation in noise with age could be counteracted by top-down processes dependent on the increased general knowledge and vocabulary observed in elderly subjects compared with younger participants (Park et al., 2002; Keuleers et al., 2015). Additionally, using full sentences as stimuli, Saija et al. (2014) provided their participants a broad range of mostly linguistic cues, including syntactic and semantic contexts in addition to lexical cues, that could be used for speech restoration. This better use of contextual information by older adults has been observed in several studies (Wingfield et al., 1991), especially when adding semantic clues (Boothroyd and Nittrouer, 1988; Nittrouer and Boothroyd, 1990; Sommers and Danielson, 1999; Pichora-Fuller, 2008). This is consistent with the hypothesis of compensation for the decrease in fluid intelligence with age, by maintenance of, or improved use of/an increase in general knowledge, including linguistic and verbal knowledge, as encompassed by the crystallized intelligence concept (Cattel, 1963; Horn and Cattell, 1966). More recently, Rogers et al. (2012) presented a more pessimistic view: they showed that the greater use of contextual information by older adults is more likely to lead to false hearing in incongruent semantic conditions than among younger subjects. Hence, the greater benefit (or compensation) from contextual information would be related to the older adults' tendency to respond in a manner consistent with the context and not necessarily to better use of contextual information because the former leads to more errors when the context is misleading.

### Left/Right Ear

The present results suggest that the degree of compensation from sublexical influence was not only educational level-specific, hearing loss-dependent and, to some degree, age-dependent (in terms of occurrence frequency), it is also ear-dependent and has a significant interaction with YE; there was a significantly stronger

contextual influence for words presented in the right ears than in left ears in the high-YE patient group among the subgroup of 150 right-handers, with no significant differences in hearing loss, age or word scores between the right and left ears. Moreover, when only the right ears were considered, the high-YE group exhibited significantly greater contextual influence than the low-YE group on the two context influence indices. The so-called right ear advantage linked to hemispheric functional asymmetry for language processing is usually observed behaviorally when both ears are competing, i.e., words presented in a subject's right ear are more likely to be repeated than words presented concomitantly in the left ear at a comfortable loudness level (for a review, see Lazard et al., 2012). Here, the situation was very different: the task was monaural, and its difficulty stemmed from the low sound level used, which was adjusted to obtain a word score of approximately 50% in a hearingimpaired population. The absence of an ear difference in the raw word scores with the presence of a right-ear advantage for contextual influence, argues in favor of the involvement of higher-level processing and not a peripheral effect. Among the many studies examining speech perception in noise and speech restoration, very few have specifically investigated the difference between right and left ears. Pisoni et al. (1970) obtained more efficient recall of sentences with semantic constraints presented in a noise masker in the right ears vs. the left ears of right-handed subjects, suggesting a right-ear advantage in contextual influence that is linked to cerebral dominance. In speech perception evaluations, hearing-impaired patients are tasked with building a meaningful auditory word from patchy phonological information, i.e., a task that is very close to phonemic restoration (Warren, 1970) and to the Ganong effect (Ganong, 1980), in which ambiguous speech sounds are properly categorized when presented in a word context, showing evidence of reciprocal interaction between phonetic and lexical processing. The neural correlates of phonological-lexical interactions have been preferentially shown in the left hemisphere with the involvement of the left supramarginal gyrus and left middle temporal gyrus (Prabhakaran et al., 2006; Myers and Blumstein, 2008). Using the Ganong effect, Gow et al. (2008) reported an increase in phonetic activation in the left posterior superior temporal gyrus within the time frame associated with a lexical effect, providing evidence in favor of a top-down feedback model and allowing for a direct influence of the lexical context on phonemic perception rather than only a post-perceptual decision process. Using prior knowledge of the speech content to enhance the clarity of degraded speech, Sohoglu et al. (2012, 2014) argued further in favor of an early influence of linguistic knowledge on the top-down modulation of acoustic processing. The greater contextual effect observed in our patients' right ears vs. left ears could be attributed to the left hemispheric preference for phonological-lexical interaction processing, with a preference for left hemisphere-right ear top-down interaction. However, ear preferences for the monaural presentation of auditory stimuli would be best investigated in an intra-subject paradigm, which would imply the use of word stimuli carefully balanced between both ears.

# The Word Frequency Effect Viewed Through the Looking-Glass of the Speech Perception Scores of Hearing-Impaired Elderly Patients

Because the words used in the present study had a broad range of occurrence frequencies and included some older words whose frequency of use has greatly diminished over the years, the influence of the patients' age on the relationship between word occurrence frequency and word score seemed particularly relevant. Indeed, we observed that the dependence of spoken word recognition on the spoken word occurrence frequency decreased as age increased. This finding appears to contradict several results showing a greater dependence on word frequency in spoken word recognition as age increases (Revill and Spieler, 2012), which is consistent with most studies of visual word recognition. In those studies, a stronger predictive value of written word frequency has been observed in older subjects than in younger subjects (matched for vocabulary size and YE) (Spieler and Balota, 2000; Balota et al., 2004). However, the population observed here differed in several major respects from populations in other studies: our younger subject group was far older than the typical young subjects in those studies (university students), and our population had an average educational level that was lower than that of the university graduates who are usually included as study participants. Indeed, when the data were analyzed according to YE group, the dependency on word occurrence frequency was significantly greater for the low-YE group than for the high-YE group, and the age effect, i.e., the greater occurrence frequency dependence for younger groups, disappeared for the high-YE group. This outcome could be attributed to an increased learning advantage and the greater and longer exposure to words experienced by older adults than younger adults; the older adults had a larger vocabulary and greater familiarity with words that were frequent in the 1950s but that are rare today. The statistically significant differences between the low-YE and high-YE groups among the younger subjects reinforced this hypothesis, with word frequency showing significantly greater predictive power in low-YE groups, who had lower scores for rare words.

In addition, not all occurrence frequency estimates are equally predictive (Brysbaert and New, 2009). Our results showed that the spoken word frequencies, which were obtained from a film subtitle database, explained 16% of the variance vs. 12.2% for the written book frequencies, confirming the superiority of the film subtitle database over written frequencies. However, word occurrence frequency is not the only parameter that influences word recognition (Goldinger, 1996), and its influence is difficult to separate from those of age of acquisition and word familiarity (or subjective frequency). Because most of our patients were over 50 years of age, and the word lists used here came from a corpus designed in the 1950s (Fournier, 1951), a number of the words that appeared to be unfamiliar to younger subjects (in their twenties to forties) were more familiar to an elderly population because the elderly people had encountered those words in their younger years. Thus, perhaps historical word occurrence frequencies, dating back to the youths of these elderly patients, could better explain their scores than current word occurrence frequencies. The potential influence of the "occurrence frequency year" was suggested by Brysbaert and New (2009); however, Brysbaert et al. (2011b) reported no decrease in predictability among older subjects in a lexical decision task with the use of the most recent occurrence frequencies, which were taken from the Google Books N\_gram database (Michel et al., 2011).

The discrepancy between our results and the lack of an influence of the occurrence frequency year reported by Brysbaert et al. (2011b) can be explained by at least two factors: (a) the populations studied were very different: the data reported by Brysbaert et al. (2011a,b) involved two groups of patients, including older adults, both with high YE (data from Spieler and Balota, 2000), whereas the present study revealed a "historical occurrence frequency" effect that was more important for the low-YE group, and (b) the present study examined auditory word recognition in hearing-impaired patients, vs. visual word recognition with a lexical decision task which was used in Brysbaert et al. (2011b). Auditory word recognition may be more sensitive to the historical word frequency effect than visual word recognition. Indeed, by reanalyzing correlations between Luce and Pisoni (1998) auditory perceptual data and the more recent occurrence frequency databases, Yap and Brysbaert<sup>2</sup> showed that auditory word recognition tended to be more sensitive to the age of acquisition than visual word recognition was. Thus, the "historical occurrence frequency effect" observed in the present study might be attributable in part to the stronger effect of age of acquisition on auditory word recognition than on visual word recognition. We observed that the 1950s occurrence frequencies tended to be better predictors of the word scores than the spoken occurrence frequencies obtained from the Lexique 3.8◦ database (24 vs. 16% of variance explained), and they were better predictors than the more recent N-gram frequencies (2005), which explained 19% of variance. Additionally, the 1950s occurrence frequencies explained a significantly greater percentage of the variance for both contextual influence indices that we used: W/Pho (28 vs. 22%) and j. When the 1950s occurrence frequencies were used, the age effect on the relationship between word scores and occurrence frequency disappeared. When we correlated word scores with the historical occurrence frequencies from 1900 to 2005, the greatest variance explained was obtained for the 1950s; this variance had a similar shape regardless of whether the group was divided into high- or low-YE groups. However, when the scores were grouped by the patients' ages, the peak of the explained variance shifted toward more recent years (1970) for the younger patients. This effect was observed for both the low-YE and high-YE groups. For the low-YE patients under 60 years of age, the maximal percentage of variance appeared for the most recent occurrence frequencies (2005); for the older group, the maximal percentage of variance occurred for the occurrence frequencies from the 1950s.

This result suggests that exposure to a word at a younger age seems to have greater impact than current exposure does, perhaps because of a stronger and more stable mental representation. Additionally, because most of our patients suffered from presbycusis with gradually worsening hearing loss over their lifetime, it is possible that exposure to a word's phonological form at a younger age was more relevant because it corresponds to an exposure to a less-degraded stimulus, i.e., exposure occurred at a time when the hearing loss was milder or even non-existent, thus contributing to building a stronger mental representation. This historical word frequency effect may be emphasized in hearingimpaired patients compared with non-hearing impaired subjects, which would explain why it was not observed systematically in Brysbaert et al. (2011b).

# Potential Implications for Audiology Practice

This study extends the main results of psycholinguistic research concerning the influence of linguistic context on spoken word recognition to the speech scores obtained from a heterogeneous hearing-impaired population similar to the population encountered in clinical practice, with potential consequences for speech scores. Indeed, the task most commonly used to evaluate speech perception in audiology, i.e., the repetition of a heard word with no time constraints, differs from the tasks usually used in word recognition research (i.e., reaction times/scores in lexical decision and naming tasks), and the population tested here (i.e., a hearing-impaired, older population with great variability in linguistic and general knowledge) differs from the typical university student cohorts used in psycholinguistics studies. Indeed, even for isolated words presented in silence, contextual influences can add substantial variability to speech scores. Top-down lexical compensation (or the lack thereof) for partial phonological information can greatly increase inter-subject variability depending on the patient's YE, age, hearing loss and ear tested. The influence of the ear tested was only visible for contextual influence indices and not for raw scores; thus, it is probably negligible in practice compared with other factors. The historical word occurrence frequency effect, which was of variable importance depending on the patients' age and number of years of education, suggests a strong interaction between linguistic factors and patient-specific factors. This interaction emphasizes the need to consider linguistic factors carefully (including the "history" of these factors) when developing speech recognition material (and to avoid focusing only on acoustic factors) (Meyer and Pisoni, 1999). Although achieving perfect item equivalence in speech perception linguistic material across several variables for a heterogeneous patient population could be considered wishful thinking, the current availability of large lexical databases encompassing several languages and types of occurrence frequencies is allowing substantial improvements in the current material used.

### CONCLUSION

Substantial inter-subject variability related to contextual influences can be identified in the speech perception scores for spondaic words in audiological clinic populations. These

<sup>2</sup>Yap, M. J., and Brysbaert, M. (2009). Auditory word Recognition of Monosyllabic Words: Assessing the Weights of Different Factors in Lexical Decision Performance. Available online at: http://crr.ugent.be/papers/Yap\_Brysbaert\_ auditory\_lexical\_decision\_regression\_final.pdf.

influences vary according to patient-specific factors, such as hearing loss characteristics, age, ear tested (right/left ear), and years of formal education. These patient-specific factors interact differently with linguistic material-specific factors, such as the occurrence frequency and phonological similarities of words. This phenomenon is illustrated by the historical occurrence frequency effect observed here, in which spondaic word recognition scores showed a stronger correlation with the word occurrence frequencies corresponding to the patient's youth than with current word occurrence frequencies; the older hearing-impaired patients were more likely to repeat a word that is rarely heard now but was common in their youth than a word that occurs frequently in daily communications (i.e., a word to which they are strongly exposed) but was rare in their youth. This finding was especially true for patients with more years of education. Even at the isolated word level, when words are presented in silence, lexical influence can partially compensate for bottom-up loss of phonological information

# REFERENCES


in mild to moderate hearing loss and can improve spondaic recognition scores, but it depends strongly on general and linguistic knowledge.

### ACKNOWLEDGMENTS

This work was supported in part by la "Fondation de l'Avenir", research program ET2-652, the LABEX CELYA (ANR-11- LABX-0060) and the LABEX CORTEX (ANR-11-LABX-0042) of Université de Lyon, within the program "Investissements d'Avenir" (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2015.00476


Krull, V., Humes, L. E., and Kidd, G. R. (2013). Reconstructing wholes from parts: effects of modality, age, and hearing loss on word recognition. Ear Hear. 34, e14–e23. doi: 10.1097/AUD.0b013e31826d0c27


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Moulin and Richard. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Lafon, J.-C. (1964). Le Test Phonétique et la Mesure de l'audition. Paris: Dunod.

# Type of Speech Material Affects Acceptable Noise Level Test Outcome

Xaver Koch1,2 \*, Gertjan Dingemanse<sup>3</sup> , André Goedegebure<sup>3</sup> and Esther Janse1,4,5

<sup>1</sup> Center for Language Studies, Radboud University, Nijmegen, Netherlands, <sup>2</sup> International Max-Planck Research School for Language Sciences, Nijmegen, Netherlands, <sup>3</sup> Department of ENT, Erasmus Medical Center, Rotterdam, Netherlands, <sup>4</sup> Max-Planck Institute for Psycholinguistics, Nijmegen, Netherlands, <sup>5</sup> Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands

The acceptable noise level (ANL) test, in which individuals indicate what level of noise they are willing to put up with while following speech, has been used to guide hearing aid fitting decisions and has been found to relate to prospective hearing aid use. Unlike objective measures of speech perception ability, ANL outcome is not related to individual hearing loss or age, but rather reflects an individual's inherent acceptance of competing noise while listening to speech. As such, the measure may predict aspects of hearing aid success. Crucially, however, recent studies have questioned its repeatability (test–retest reliability). The first question for this study was whether the inconsistent results regarding the repeatability of the ANL test may be due to differences in speech material types used in previous studies. Second, it is unclear whether meaningfulness and semantic coherence of the speech modify ANL outcome. To investigate these questions, we compared ANLs obtained with three types of materials: the International Speech Test Signal (ISTS), which is non-meaningful and semantically non-coherent by definition, passages consisting of concatenated meaningful standard audiology sentences, and longer fragments taken from conversational speech. We included conversational speech as this type of speech material is most representative of everyday listening. Additionally, we investigated whether ANL outcomes, obtained with these three different speech materials, were associated with self-reported limitations due to hearing problems and listening effort in everyday life, as assessed by a questionnaire. ANL data were collected for 57 relatively good-hearing adult participants with an age range representative for hearing aid users. Results showed that meaningfulness, but not semantic coherence of the speech material affected ANL. Less noise was accepted for the non-meaningful ISTS signal than for the meaningful speech materials. ANL repeatability was comparable across the speech materials. Furthermore, ANL was found to be associated with the outcome of a hearing-related questionnaire. This suggests that ANL may predict activity limitations for listening to speech-in-noise in everyday situations. In conclusion, more natural speech materials can be used in a clinical setting as their repeatability is not reduced compared to more standard materials.

Keywords: acceptable noise level, speech material type, hearing, working memory, self-control capabilities

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Sven Mattys, University of York, UK Jana Besser, Sonova AG, Switzerland

> \*Correspondence: Xaver Koch x.koch@let.ru.nl

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 30 November 2015 Accepted: 31 January 2016 Published: 26 February 2016

### Citation:

Koch X, Dingemanse G, Goedegebure A and Janse E (2016) Type of Speech Material Affects Acceptable Noise Level Test Outcome. Front. Psychol. 7:186. doi: 10.3389/fpsyg.2016.00186

# INTRODUCTION

fpsyg-07-00186 February 24, 2016 Time: 19:11 # 2

One of the most frequent complaints of adult hearing aid users is that comprehending speech is challenging in noisy environments (Cord et al., 2004; Killion et al., 2004; Nábelek et al., 2006 ˇ ) Indeed insufficient benefit of hearing aids in noisy situations seems to be an important reason for people fitted with a hearing aid not to use it. Hearing rehabilitation could be better attuned to the needs of hearing-impaired individuals if audiologists were able to identify those hearing-impaired individuals who will have problems with accepting higher noise levels in everyday communication situations. Individualized counseling may help hearing-impaired individuals to set realistic expectations of hearing-aid benefit in noise. Furthermore, the use of assistive listening devices could then be applied early on for individuals who can be expected to be unsatisfied with hearing devices in noisy environments in order to ultimately minimize disappointment with the device, activity limitations and participation restrictions related to hearing disabilities (cf. Nábelek et al., 2006 ˇ ; Kim et al., 2015).

This raises the question of how to identify future hearing aid users who may be discouraged from using hearing aids because of difficulty listening in noise. One obvious approach would be to measure the individual's objective ability to understand speech in noise (e.g., the standard speech-reception threshold measure). However, such objective performance measures are not predictive of hearing aid benefit or success (Bender et al., 1993; Humes et al., 1996; Nábelek et al., 2006 ˇ ). In contrast, one subjective measure called "acceptable noise level" or "tolerated SNR" (henceforth, ANL) seems to be predictive of hearing aid and cochlear implant success (Nábelek et al., 1991, 2006 ˇ ; Bender et al., 1993; Humes et al., 1996; Plyler et al., 2008; but cf. Olsen and Brännström, 2014). The ANL procedure involves the following two steps: listeners are first asked to indicate the loudness level they find most comfortable [henceforth, Most Comfortable Loudness Level (MCL), cf. Hochberg, 1975] for listening to a continuous speech signal. In a second step, listeners adjust the background noise level [henceforth, Background Noise Level (BNL)] to the maximum level they are willing to put up with while following the running speech presented at their individual MCL level. Subtracting the BNL value from the MCL value yields the ANL measure which typically ranges between −15 and 40 dB with a mean of around 5 to 12 dB (Nábelek et al., ˇ 1991, 2006; von Hapsburg and Bahng, 2006; cf. Eddins et al., 2013; Walravens et al., 2014). The lower the ANL value, the more noise the participant accepts while listening to speech. The ANL measure quantifies the individual's "willingness to listen to speech in background noise" (cf. Nábelek et al., 2006 ˇ , p. 626). As such, it may be a better indicator of successful hearing aid uptake than the individual's objective ability to understand speech in noise as it is more telling about the individual's wishes, motivation, and intentions.

Speech perception is generally considered to involve an interaction between the processing of acoustic information (bottom–up processing) and linguistic and cognitive processing (top–down processing). An important question is how ANL outcome relates to this interaction, as participants are explicitly instructed to 'follow the speech' during the ANL task. Even though listeners may engage in setting up linguistic hypotheses about upcoming content when the signal is clear, top–down contextual support may be particularly helpful in reconstructing the message when the signal is presented in noise. It is unclear whether type of speech material affects ANL. The original ANL publications (e.g., Nábelek et al., 1991, 2006 ˇ ) used a standard stretch of read speech, making up a coherent story (the Arizona Travelog passage). In contrast, Olsen and Brännström (2014) used the International Speech Test Signal (ISTS; Holube et al., 2010), which is non-meaningful by definition as the signal consists of roughly syllable-sized units from six different languages and speakers, concatenated into a continuous speech stream. Olsen and Brännström (2014) argue that the ISTS can be used to compare ANL values across languages. However, the use of the ISTS precludes top–down processing. In that sense, the question whether type of speech material affects ANL outcome is a question about the nature of the ANL task in the broader context of models of speech processing. Regarding the question of whether meaningfulness affects ANL outcome, ANLs obtained with unintelligible speech (i.e., reversed or unfamiliar speech) have been found to be higher (i.e., indicative of lower noise tolerance) than those obtained with intelligible speech (Gordon-Hickey and Moore, 2008). In contrast, Brännström et al. (2012a) showed that ANLs were lower for the ISTS in comparison with meaningful speech stimuli. We investigate whether ANL depends on meaningfulness and coherence by using three different stimulus types that differ in meaningfulness (ISTS vs. concatenated sentences and fragments of conversational speech) and coherence (concatenated sentences vs. coherent conversational speech). If meaningfulness of the test material does not affect ANL outcome, listeners' acceptance of noise while following speech may mainly rely on bottom–up processing. Consequently, following speech in noise as captured by the ANL task would deviate from speech perception and comprehension. In line with Gordon-Hickey and Moore (2008), we expect to find increased ANL values for the non-meaningful ISTS material compared to the meaningful materials. Our hypothesis regarding the direction of a semantic coherence effect is that participants will accept more noise (i.e., show lower ANLs) for the conversational stimulus type in comparison with the passage of concatenated sentences as redundant information is available on the discourse level, which facilitates speech comprehension. Alternatively, however, the faster speech rate and less careful articulation observed in conversational speech may make listening harder than in the sentence materials and may yield lower noise acceptance.

In order for ANL to be a clinically useful tool in hearing rehabilitation, it is important to establish its repeatability (i.e., consistency over repeated measures or test–retest reliability with the exact same materials). Olsen and Brännström (2014) questioned the repeatability of the existing ANL procedures using the ISTS material. In the present study we investigate whether speech material type affects ANL outcomes and repeatability. Relatedly, repetition of the exact same materials may lead to substantial priming effects, especially for the meaningful materials. Consequently, participants would accept more noise upon repeated exposure, yielding a lower repeatability. We

investigate whether the use of meaningful materials yields differential repeatability compared to non-semantic ISTS material.

Nábelek et al. (2006) ˇ suggest that future hearing aid use can be predicted on the basis of ANL outcome for a majority of hearing aid candidates. Olsen and Brännström (2014), however, challenge the predictive value of ANL outcome for hearing-aid use, and report that results regarding the association between ANL and self-reported hearing-aid outcome measures have been mixed. These inconsistent findings may be caused by the multitude of variables that are possibly related to hearing-aid use, hearing-aid satisfaction and hearing-aid success, as reviewed by Knudsen et al. (2010) and McCormack and Fortnum (2013). Note, however, that self-reported hearing problems have been shown to be consistently associated with hearing-aid outcome measures obtained throughout the process of getting a hearing aid (help seeking, hearing-aid uptake, use, and satisfaction). We investigate whether ANL is associated with (specific components of) the Speech, Spatial, and Qualities of Hearing self-report questionnaire (SSQ; Gatehouse and Noble, 2004) and whether this relation depends on ANL test material type. Our expectation is to find differential correlations between the questionnaire outcome and ANL for three speech stimulus types with stronger associations for the more ecologically valid materials.

The central concept of the ANL measure is 'Listening comfort.' Thus, individual ANLs are not necessarily linked to the listener's objective ability to comprehend speech in noise, as shown in a number of studies (cf. Nábelek et al., 2004 ˇ ; Mueller et al., 2006; von Hapsburg and Bahng, 2006; Plyler et al., 2008, but cf. Gordon-Hickey and Morlas, 2015). Whether and how the concept of comfort in noisy listening situations relates to listening effort is unclear. The clinical meaning of the concept of listening effort has recently been discussed in several papers (McGarrigle et al., 2014; Rennies et al., 2014; Francis and Füllgrabe, 2015; Schulte et al., 2015). One way to quantify listening effort is to ask participants to fill in effort-related subscales of selfreport questionnaires (cf. McGarrigle et al., 2014). We therefore investigate whether listening effort, as measured with specific questions of the SSQ (Akeroyd et al., 2014) is associated with ANL. We hypothesize that ANL is associated with a listening effort-related subscale of the SSQ with more subjective listening effort related to lower noise acceptance (i.e., higher ANLs).

Listeners need cognitive capacity to map a noisy signal onto stored representations (McGarrigle et al., 2014), as laid out in the Ease of Language Understanding model (Rönnberg et al., 2008, 2013). Multiple studies have shown that hearing aid users' objective speech understanding in adverse conditions (such as background noise) is related to their working memory capacity, verbal working memory in particular (Akeroyd, 2008; Rudner et al., 2011; Ng et al., 2013, 2014). Given the relatively large amount of unexplained variance for individual ANLs, ANLs may also be associated with working memory. Brännström et al. (2012b) found a significant correlation between working memory capacity and ANL for a sample of normal-hearing participants, with lower noise acceptance (i.e., higher ANLs) relating to poorer working-memory capacity. We investigate whether ANL outcomes obtained with the different types of speech materials relate to listeners' working memory capacity, where we expect to replicate the results of Brännström et al. (2012b).

As ANL specifically asks listeners about their willingness to accept noise, ANL may be related to personality traits. Indeed, self-control abilities (i.e., the capability to control thoughts, feelings, impulses and performance; Baumeister et al., 1994), have been found to predict ANL outcomes (Nichols and Gordon-Hickey, 2012). We revisit the question to what extent ANL outcome relates to personality characteristics in this study. We expect to replicate effects of self-control on ANL with better self-control related to lower ANLs (cf. Nichols and Gordon-Hickey, 2012). Furthermore, even though earlier studies have not found a link between ANL and age (Nábelek et al., 1991 ˇ ; Moore et al., 2011), nor between ANL and pure-tone hearing thresholds (Nábelek et al., 1991 ˇ ; Freyaldenhoven et al., 2007; Plyler et al., 2007), or between ANL and speech perception accuracy in noise (Nábelek et al., 2004 ˇ ), we investigate whether our data replicate this pattern of results.

This study investigates whether speech material type affects ANL outcomes and repeatability for a reference sample of normal-hearing middle-aged and older participants. As addressing these questions on speech material and repeatability involves relatively long testing sessions with repeated ANL measurements, we tested a non-clinical population first so as not to burden a patient population. Future testing is then required to see whether material type effects generalize to a patient population and whether ANLs based on conversational materials better predict hearing aid success than ANL values obtained with more standard audiology materials (such as, e.g., ISTS).

The present study was set up to address the following four research questions:


# MATERIALS AND METHODS

### Participants

Seventy-one adults were recruited, all native speakers of Dutch, above 30 years of age (39 female, 33 male). From the initial sample, we excluded 10 participants whose hearing loss in one or both ears exceeded the Dutch health insurance criterion for partial reimbursement of hearing aids (i.e., pure-tone average over 1000, 2000, and 4000 Hz ≥ 35 dB HL in either ear). We also excluded two participants who suffered from tinnitus and one participant who showed significant binaural low-frequency hearing loss. One participant was excluded because she did not manage to perform the ANL task in the training phase. The 57

remaining participants (34 female, 23 male) ranged in age from 30 to 77 years with an overall mean of 60.7 years (SD = 11.0). All participants indicated that they had no hearing impairment and did not use hearing aids. None of the participants had a history of a neurological disease. We followed the protocols of the Radboud University Ethics Assessment Committee for the Humanities. All participants provided written informed consent and were informed that they could withdraw from the study at any time.

# Speech Stimuli

Three types of speech materials were used for ANL testing that differed in meaningfulness and semantic coherence: the unintelligible speech-like ISTS (Holube et al., 2010), a concatenated passage of meaningful Dutch sentences taken from speech material developed by Versfeld et al. (2000; henceforth, SENT), and conversational speech (henceforth, CONV) extracted from the Dutch conversational IFADV corpus (van Son et al., 2008). The 60 s long ISTS signal is made up of units that are roughly syllable sized, originating from six female speakers each reading a short standard passage in their native language (being Mandarin, Spanish, English, German, French, and Arabic). The ISTS signal had been developed on the basis of an automatic procedure to cut, concatenate and reassemble the roughly syllable sized segments from the original six recordings to create a smooth 60 s long speech-like signal including pauses at regular intervals (all pause durations being smaller than 600 ms). The resulting speech rate is approximately 4 syllables per second (Holube et al., 2010). Furthermore, the ISTS signal has been shaped to spectrally match the female international long-term-average speech spectrum (ILTASS, Byrne et al., 1994).

To create the second type of material (SENT), we concatenated fifty sentences from the female speaker of the materials of Versfeld et al. (2000) with intervals of 500 ms silence between sentences (total duration of the passage was 120 s). These sentences are all between five and eight words long and are semantically coherent. A translated example sentence is: "I hope to be able to catch the train." The speech rate of the sentences ranges between 3.5 to 5.7 syllables per second (Mean = 4.6 syllables/s, SD = 0.6). In order to match the spectral properties of the SENT materials to the ISTS materials, the concatenated SENT material was filtered to the ILTASS (combination of male and female signal) using a finite impulse response (FIR) filter between 100 and 16000 Hz.

The third type of speech material was created by extracting two male and two female recordings from the conversational IFADV corpus (van Son et al., 2008). The Dutch open-source IFADV corpus consists of annotated high-quality recording of dialogs on daily topics such as problems in public transport, leisure time activities or vacations. As we wanted to spectrally shape these materials, we selected four longer stretches of speech [CONV1 (female speaker), CONV2 (male speaker), CONV3 (male speaker), CONV4 (female speaker)] where only one speaker was speaking, without being interrupted by the dialog partner. These stretches were based on the available corpus annotations. In a few instances we cut out verbal backchannelling (e.g., "yes," "hmm") of the interlocutor, which did not overlap with the target speech. All pauses longer than 500 ms were shortened to 500 ms. The four resulting speech files ranged in duration between 63 and 75 s. Speech rate calculated over the breath groups (sequence of words between inhalations) ranged between 2.6 and 7.5 syllables per second (Mean = 5.7 syllables/s, SD = 1.2; CONV1: 6.10 syllables/s, CONV2: 5.10 syllables/s, CONV3: 5.79 syllables/s, CONV4: 5.89 syllables/s). In order to match the spectral contents of the conversational materials to the other types of materials, the four conversational fragments were also filtered to the ILTASS (combination of male and female signal) using a FIR filter between 100 and 16000 Hz.

# Noise Material

The noise stimulus used throughout the ANL test procedure was a non-stationary eight speaker babble noise (BAB8, Scharenborg et al., 2014) filtered to the ILTASS (combination of male and female spectrum) using a FIR filter between 100 and 16000 Hz. In line with the idea of aiming to approximate realistic listening conditions, we used a multi-talker babble noise since it is a typical background sound encountered in daily life.

# Experimental Procedure

### Test Set-Up

All ANL test materials were presented in a sound-attenuated booth using an Alesis multimix 4USBFX device and Behringer MS16 loudspeakers in front of the listener (0◦ azimuth) at a distance of 1 m. Stimuli were presented in a custom application (cf. Dingemanse and Goedegebure, 2015) running in Matlab (v7.10.0) on a MacBook Pro (type 9,1). Participants adjusted the sound level of the speech stimuli or the noise file using the up and down keys of a customized keyboard. The starting intensity for the MCL was 45 dB (SPL). The intensity of the speech file for the BNL task was set to the mean of the three measurements in the preceding MCL task. The step size for the intensity adjustment for both tasks was fixed at 2 dB per button press.

All speech and noise materials were scaled to have the same overall level in dB (RMS). Sound level calibration was done using a 2250 Brüel and Kjær real time sound analyzer and a 1000 Hz warble test tone with the same RMS-value as the ANL materials.

### ANL Instructions

Participants were instructed to first adjust the level of the speech until it was too loud (i.e., up to the first deviation point), then to reduce the intensity until the speech became very soft (being the second deviation point) and lastly find the MCL. Then the participant's task was to select the maximum BNL they were willing to accept while following the speech at their MCL. They were instructed to use the same pattern of adjustments as described for MCL: turn up the volume of the noise until it was too loud to comfortably listen to the speech (i.e., the first deviation point), then to reduce the noise intensity until the speech became very clear (i.e., the second deviation point) and lastly to find the maximal background noise level they were willing to put up with while following the speech signal (BNL).

### Familiarization Phase

fpsyg-07-00186 February 24, 2016 Time: 19:11 # 5

In order to familiarize participants with the ANL procedure prior to actual testing, each participant was presented with a phonetically balanced Dutch training fragment. A 2-min-long recording of a female Dutch speaker reading a standard text passage (Dappere fietsers – 'Brave cyclists') served as training material. The noise stimulus (BAB8) used throughout the actual ANL test (BNL part) also served as background noise during the training session. Participants first received written instructions on the experimental task (which was a Dutch translation of the instruction provided in Nábelek et al., 2006 ˇ , p. 639). The experimenter then demonstrated the task, using scripted instructions, which again followed the translation of Nábelek et al. (2006) ˇ . A visual display was available during the familiarization phase that enabled the participant, as well as the experimenter, to see the course of the presentation level during the MCL and the BNL tasks. Each participant had to demonstrate the expected intensity pattern (up-down-final adjustments, cf. deviation points above) three times in a row for both MCL and BNL components before they could proceed with the test phase.

### Test Phase

Unlike during the familiarization phase, visual output was available only to the experimenter during the ANL test sessions. Participants had to perform the MCL and BNL tasks for each of the six ANL test stimuli, and each of the two tasks was repeated three times in a row to decrease measurement error (cf. Brännström et al., 2014b; Walravens et al., 2014). The ANL for each fragment and for each participant was calculated by subtracting the mean BNL from the averaged MCL. Note that stimulus presentation was looped such that if participants had not provided their response before the end of the stimulus, the stimulus was automatically repeated. All participants managed to set the MCL and BNL levels within the stimulus duration in the test phase (minimal duration: 60 s for the ISTS).

### Test Repetition

In order to test the repeatability of the ANL measures across the different materials, we asked the participants to do the ANL task twice for each stimulus type (ISTS, SENT, CONV) with exactly the same material. Note that we took into account that the repetition of the exact same materials across sessions could lead to substantial priming effects, especially for the meaningful materials, by including a control variable in our models to capture changes in ANL over test sessions. Participants first performed the ANL test with the different materials at the beginning of the test session, and again (approximately 1 h later) toward the end of the session. Participant characteristics data were collected in between these two ANL test sessions. During the first ANL session (session I), six different fragments were presented: ISTS, SENT, CONV1, CONV2, CONV3, and CONV4. To restrict testing time, we only presented one fragment for each of the three material types in the test repetition (session II): ISTS, SENT, and CONV4. We selected the CONV4 stimulus from the four conversational test fragments because it featured a female speaker (as was the case for the ISTS and the SENT material) and because

### Randomization

We used a block-wise randomization procedure to minimize presentation order effects for the material types. Each participant was pseudorandomly assigned to one out of six possible block orders for the speech material types (ISTS, SENT, CONV). The order of the presented speech material types for the second test session (session II) matched the order of session I.

The order in which the four conversational materials appeared in the first ANL test session was also randomized. Each participant was randomly assigned one out of 24 possible presentation orders for the conversational speech stimuli.

# Tests of Participant Characteristics Hearing (Pure-Tone Average)

Hearing status was screened with air conduction pure-tone audiometry using the modified Hughson-Westlake technique for octave-frequencies between 250 and 8000 Hz, including two half-octave frequencies of 3000 and 6000 Hz (see **Figure 1**). Audiometric averaged thresholds were calculated for the better ear as auditory presentation of the ANL test was binaural. Seven participants showed an asymmetric hearing loss, defined as an interaural difference of more than 10 dB averaged over 500, 1000, 2000, and 4000 Hz (Noble and Gatehouse, 2004). In addition to the pure-tone average over 1000, 2000, and 4000 Hz, we calculated high-frequency PTAHF as the mean threshold over 3000, 4000, 6000, and 8000 Hz. **Table 1** displays descriptives for the two PTA measures. Higher values indicate poorer hearing.

### Speech Perception in Noise

Speech perception in noise was tested using a standard Dutch speech audiometry test, the CVC word material from Bosman and Smoorenburg (1992, 1995), which is common in clinical practice in the Netherlands. The test allows presenting the materials at SNRs which are reasonably representative of noise levels during everyday communication (Smeds et al., 2015). This test material consists of meaningful monosyllables (e.g., kaas, 'cheese') produced by a female speaker arranged in lists of 12 words. The material was presented in a sound-attenuated booth using Behringer MS16 loudspeakers placed in front of the listener (0◦ azimuth) at a distance of one meter. The CVC words were presented at an intensity level of 65 dB (SPL) mixed with a masking noise of the same intensity (long-termaverage spectrum of the recorded speaker). The test score was based on the number of correctly reproduced phonemes (max. three per test item), discarding the first item of each list (which is considered a practice item). Based on Bosman and Smoorenburg's standardizations results, we expected a mean phoneme accuracy score of about 80–85% for normal hearing adult participants at an SNR of 0 dB (more favorable signalto-noise ratios may thus lead to ceiling effects in performance). All participants were presented with five consecutive lists (list 31–35), which resulted in a maximum accuracy score of 165 phonemes correct (5 lists × 11 items × 3 phonemes). The speech perception in noise score reported here was quantified as the

percentage of correct phonemes produced. **Table 1** provides the descriptives for the perception in noise score. Higher values indicate better speech perception in noise.

### Reading Span

We used a Dutch version of the well-established reading span test to index working memory (cf. Daneman and Merikle, 1996; Besser et al., 2013; Besser, 2015, Unpublished). The Dutch test consists of 54 grammatically correct sentences, consisting of a noun phrase plus verb phrase. The 54 sentences are divided in 12 sets of three, four, five, or six consecutive sentences. Half of the 54 sentences make sense (e.g., The student sang a song); the other half is absurd (e.g., The daughter climbed the past). The sentences


were presented orthographically in chunks: first the subject noun phrase was presented (determiner-noun, e.g., The student), followed by the verb (e.g., sang), followed by the object noun phrase (determiner-noun, e.g., a song; cf. Besser, 2015, p. 173). We used E-prime (2.0, Psychology Software Tools) to present the chunks of the respective test sentences (Subject, Verb, and Object) consecutively on a computer screen (display time of each chunk: 800 ms, blank inter chunk interval: 75 ms). Font size was 36 pt (Verdana). The primary unspeeded task was to repeat back either the first or the last nouns of the respective test set ranging in length from three to six consecutive sentences. Thus, participants were visually prompted to (orally) recall either the subject noun phrases (first nouns) or the object noun phrases (last nouns) of the 12 test sets. The order in which participants recalled the first or last words was not taken into consideration for the scoring (cf. Besser et al., 2013). Additionally, participants were asked to perform a speeded plausibility judgment after each sentence as a secondary task. This task ensured that participants read and comprehended the sentences. Response time was restricted by imposing a time out of 1.75 s after a visual prompt appeared that initiated the plausibility judgment task. Participants gave their plausibility judgment by either pressing a red (i.e., absurd) or a green button (i.e., makes sense) on a customized standard keyboard. Participants received written task instructions and completed a training test set before the actual test started. Reading span score was quantified as the percentage of correctly recalled nouns across the 12 sets. **Table 1** displays the descriptives for the

Reading Span test. Higher values indicate better working memory capabilities.

### Self Control

Participants filled in a Dutch translation of the Brief Self-Control Scale, a 13 items questionnaire using a five-point Likert scale (Tangney et al., 2004; cf. Kuijer et al., 2008). Individual test score were quantified as the percentage of points out of the maximum of 65 points. **Table 1** displays the descriptives for the self-control predictor variable. Higher values indicate better self-control abilities.

### SSQ Questionnaire

Prior to the ANL testing session, participants filled in an online (Dutch) version of the Speech, Spatial and Quality of Hearing Scale (SSQ, Gatehouse and Noble, 2004). The SSQ self-report scale, which consists of 49 items, is subdivided into three parts: Part 1: 'Speech hearing' (14 questions), Part 2: 'Spatial hearing' (17 questions), and Part 3: 'Qualities of hearing' (18 questions). Following Akeroyd et al. (2014), we extracted a factor related to listening effort covering question numbers 15 and 18 of the SSQ subscale 'Qualities of hearing' ('Do you have to put in a lot of effort to hear what is being said in conversation with others?'; 'Can you easily ignore other sounds when trying to listen to something?'). Hence, we calculated the SSQ 'effort and concentration' subscale by averaging scores over these two questions. We also calculated the average over the first and the third SSQ scale as these two were deemed most relevant. **Table 1** presents the descriptive values for averaged SSQ 'Speech hearing' and 'Qualities of hearing' scores, as well as for the factor related to listening effort (SSQ 'effort and concentration'). Higher values on the SSQ scale indicate fewer limitations in self-reported activity due to hearing problems. **Table 2** provides a correlation matrix of all the participant-related characteristics.

# Analyses

### RQ1

Two separate statistical regression models were run to investigate the effects of meaningfulness and coherence (RQ1) of the test material on ANL, using linear mixed-effect models with participants as random variable. The program R was used with the lme4 package (Bates et al., 2013) and restricted maximum likelihood estimation. p-values were calculated using the ANOVA function of the car package which calculates type II Wald χ 2 values. The categorical within-subject variable meaningfulness included two levels: not meaningful (ISTS material) vs. meaningful (CONV and SENT material). The within-subject variable coherence featured two categories: coherent on sentence level (SENT material) vs. coherent on discourse level (CONV material). Block order (order a–f) was included as additional control variable in all models. For the model on meaningfulness (model 1A), we allowed for the possibility that the effect of meaningfulness differed across participants by including a random participant slope for meaningfulness. Similarly, we allowed for the possibility that the effect of semantic coherence differed across participants by including a random participant slope for meaningfulness in the 'coherence' analysis (model 1B). Note that we also included the interaction between session number and meaningfulness (in model 1A) or between session number and coherence (in model 1B), to allow for the possibility that ANLs may systematically change with session number due to semantic priming. Consequently, we also allowed for the possibility that the effect of session number differed across participants by including a random participant slope for both models (model 1A, model 1B).

### RQ2

We first ran a linear mixed-effect model (with random intercepts for participants) with ANL differences between test sessions as dependent variable. The question was whether ANL values obtained for the three types of speech materials differed in their repeatability across test sessions. One outlier was excluded from repeatability analysis of the ISTS material as the ANL difference between sessions I and II of this participant exceeded a threshold of the sample mean plus three standard deviations.

Apart from the mixed-effect analysis described above, we followed the procedures described by Brännström et al. (2014b) to assess the repeatability of the three speech materials. Hence, we inspected the Bland–Altman plots (Bland and Altman, 1986; Vaz et al., 2013) as well as the coefficient of repeatability (henceforth, CR) for each of the three test materials for which two test sessions had been run. The CR measure is a repeatability (test–retest



Significance level notation: ∗∗∗p < 0.001; ∗∗p < 0.01; <sup>∗</sup>p < 0.05; .p < 0.1.

reliability) measure. It indicates the size of the measurement error in its original measured unit (i.e., dB). In our case, it represents the size of the difference between one measurement (session) and another measurement using the exact same material (with 95% confidence level). The Bland–Altman plots show for each of the three speech materials (ISTS, SENT, CONV4) each participant's mean ANL over the two sessions on the x-axis against the difference between the two sessions on the y-axis. The CR was calculated for each material by multiplying the standard deviation of the differences between ANLs (averaged over repetitions) for the two sessions with 1.96. Additionally, we calculated the coefficients of repeatability for all test materials (i.e., incl. CONV1, CONV2, and CONV3) over their three repetitions within test sessions (repetition 1 vs. repetition 2; repetition 2 vs. repetition 3). This enabled us to analyze whether repeatability changed within and across test sessions.

### RQ3

To assess the question whether self-reported hearing related activity limitations and listening effort differentially predict ANL outcomes for the three different speech materials (RQ3) we set up four linear mixed-effect models that included a categorical speech material variable (ISTS, SENT, CONV) in interaction with one of three variables derived from the SSQ scale (SSQ Part 1, SSQ Part 3, SSQ 'effort and concentration'). Session number was added as categorical covariate to capture repetition effects due to semantic priming. Again, we allowed for the possibility that the effects of session number and speech material differed across participants and therefore added random slopes for the variable speech material and session number to the model.

### RQ4

To investigate the effects of participant characteristics (age, hearing thresholds, speech perception in noise accuracy, working memory, and self-control abilities) on ANL for the three speech materials (RQ4) we performed 15 correlation analyses (Pearson's r) and Bonferroni corrected for multiple comparisons. ANL values were pooled across the two test sessions.

### RESULTS

**Table 3** shows the ANL test results per speech material per test session for the three unrepeated conversational materials (CONV1-3) and the three repeated materials (CONV4, SENT, ISTS). Mean ANLs are higher for the ISTS material than for the meaningful materials. **Figure 2** gives an overview of the ANL test results per test session including the conversational materials that were only presented in test session I (i.e., CONV1, CONV2, and CONV3).

# Research Question 1A: Does ANL Outcome Depend on the Meaningfulness of the Speech Material?

The results of the statistical model (cf. **Table 4**) showed that ANLs for the meaningful materials (SENT, CONV) were significantly different from those for the non-meaningful ISTS material [χ 2 (1,


N = 341) = 17.98, p < 0.001]. Participants showed 1.46 dB higher ANLs and thus less noise acceptance for the ISTS signal in comparison with the meaningful materials. The observed effect direction matched our a priori hypothesis that participants would accept less noise for the non-semantic ISTS material than for the meaningful materials. Block order of presentation did not influence ANL, nor did session number. These control variables also did not interact with the meaningfulness of the test material. The absence of a significant effect of session number on ANL suggests that ANL was stable over sessions and that no semantic priming occurred between sessions. This absence of priming held across material types as the meaningfulness × session number interaction was insignificant. Block order did not affect the ANL outcome, which suggests that our randomization procedure was adequate. For reasons of brevity block order is left out in the model presented below [the variable having six levels; χ 2 (5, N = 341) = 2.13, p > 0.1].

We also investigated the effect of meaningfulness including all conversational materials (this implies that it can only be assessed for session I). To that end, we averaged ANLs per participant over the conversational materials (CONV1–CONV4). In line with the results presented in **Table 4**, this analysis showed an effect of meaningfulness on ANL with less noise acceptance for the non-meaningful ISTS material compared to the two types of meaningful materials [χ 2 (1, N = 170) = 18.47, p < 0.001].

# Research Question 1B: Does ANL Outcome Depend on the Semantic Coherence of the Speech Material?

A significant effect of coherence was observed with higher ANLs for the material with coherence on discourse level, i.e., the conversational material [χ 2 (1, N = 227) = 6.04, p < 0.05] than for the concatenated sentences (cf. **Table 5**). Thus, for the conversational test material participants accepted less background noise. The size of the effect was 1.05 dB. The observed direction of the effect matched the hypothesis that participants would accept less noise for the conversational material, which was coherent at the discourse level, but may have been more difficult in terms of speech rate and speaking style than the concatenated sentences. Again, neither simple nor interaction effects (with the variable of interest, i.e., coherence) were found for the predictors session number and block order suggesting that the randomization procedures were appropriate and that there

was no semantic priming from the first to the second session. The control variable block order is not included in the model below for reasons of brevity [χ 2 (5, N = 227) = 2.62, p > 0.1].

We also investigated whether the coherence effect can be generalized to different conversational speech fragments by replacing the conversational ANL values in the analysis above (CONV4) by the average ANL over the four conversational speech materials (CONV1–CONV4) per participant (for the first session only). The results of this alternative analysis did not



Significance level notation: ∗∗∗p < 0.001; nsp > 0.1.

### TABLE 5 | Model testing for the effect of semantic coherence on ANL.


Significance level notation: <sup>∗</sup>p < 0.05; nsp > 0.1.

replicate the previous finding of a coherence effect on ANL [χ 2 (1, N = 113) = 1.41, p > 0.1]. Thus, there is no clear evidence for a coherence effect on ANL in our data. We raised the possibility that speech rate may affect ANL outcomes and that the difference between the conversational and concatenated sentences material is not just about discourse coherence, but also about speech rate. To follow up on that, we tested whether speech rate differences between the four conversational fragments affected ANL outcome by setting up a linear mixed-effect model with speech rate as a continuous predictor of ANL (first session measurements only, only conversational fragments). Speech rate turned out not to be a significant predictor of ANL in this subset analysis [χ 2 (1, N = 228) = 0.33, p > 0.1].

## Research Question 2: Does ANL Repeatability Differ Across Speech Material Types?

The mixed-model analysis did not show a significant speech material effect on repeatability of the ANL, quantified as the difference between the ANLs per participant for the two test sessions [χ 2 (2, N = 169) = 0.57, p > 0.1]. In an additional analysis on repeatability across material types we used the statistical approach of the coefficient of repeatability (CR). **Figure 3** displays the Bland–Altman plots for the three materials for which two test sessions had been run.

The highest coefficient of repeatability and thus the lowest repeatability was found for the ISTS material (CR = ± 6.65 dB). Both the concatenated sentences material (SENT) as well

as the conversational material showed lower coefficients of repeatability and thus numerically slightly better repeatability. For the concatenated sentences material (SENT) the CR was ±6.40 dB. The best repeatability (numerically) was found for the conversational test material with a CR of ±6.14 dB. The combination of these two analyses suggests comparable repeatability across the speech materials.

In an additional step we calculated the coefficients of repeatability for all test materials over subsequent repetitions within test sessions. **Table 6** shows that ANL repeatability increased numerically (i.e., CRs decreased) within test session I for all test materials except for CONV3. The same pattern of improved repeatability is seen for the CRs within test session II except for the SENT material. Overall, the repeatability in test session II does not seem to be numerically different from the repeatability in test session I. Note that repeatability seems to be most stable for the CONV4 material both within and across test sessions.

# Research Question 3: Are ANLs Differentially Associated with Self-Report Measures of Listening Effort and of Hearing-Related Activity Limitations for the Different Speech Materials?

We first tested whether the first subscale of the SSQ self-report questionnaire ('Speech hearing') would be associated with ANL outcomes. The model showed significant material effects [χ 2 (2, N = 341) = 21.39, p < 0.001] with highest ANLs found for the ISTS material and lowest ANLs for the sentence material (SENT). Importantly, this model showed a significant effect of the subjective questionnaire predictor SSQ (subscale 'Speech hearing') on ANL [c?2(1, N = 341) = 4.62, p < 0.05, see **Table 7**]. Higher scores on the SSQ subscale (i.e., fewer self-reported limitations due to hearing problems) were associated with more noise acceptance and thus lower ANLs. For an increase of 1 point on the SSQ 'Speech hearing' subscale the model predicted an ANL

TABLE 6 | Coefficients of repeatability (in dB) for ANL for the six speech materials and the two test sessions contrasting subsequent repetitions.

TABLE 7 | Model testing for differential associations between SSQ subscale scores and ANLs for three speech materials (CONV, SENT, ISTS).



Significance level notation: <sup>∗</sup>p < 0.05; nsp > 0.1.

decrease of approximately 1 dB, which corresponds to an overall effect size of 4.4 dB (with the SSQ 'Speech hearing' subscale ranging from 4.86 to 9.36). However, the model did not show differential SSQ subscale effects on ANL for the three materials [χ 2 (2, N = 341) = 0.74, p > 0.1].

We also investigated the association between the third subscale of the SSQ self-report questionnaire ('Qualities of hearing') and ANL. The model showed significant material effects with lowest ANLs for the sentence material [χ 2 (2, N = 341) = 21.31, p < 0.001]. However, we did not find an association between ANL and the third subscale of the SSQ self-report [χ 2 (1, N = 341) = 0.43, p > 0.1], nor differential SSQ 'Qualities of hearing' effects on ANL for the three materials [χ 2 (2, N = 341) = 1.56, p > 0.1].

In a third step we analyzed the association between the factor 'Effort and concentration' (questions number 15 and 18 of the 'Qualities of hearing' subscale of the SSQ) and ANL. As for the analyses above, the model showed significant material effects with lowest ANLs for the sentence material [χ 2 (2, N = 341) = 21.32, p < 0.001]. Yet, neither an association of ANL with the factor 'Effort and concentration' [χ 2 (1, N = 341) = 1.80, p > 0.1] nor differential 'Effort and concentration' effects on ANL for the three materials were found [χ 2 (2, N = 341) = 1.30, p > 0.1].

Additionally, we explored the strength of the association between the SSQ self-report measures (subscale 'Speech hearing') and the ANLs (pooled over sessions) separately for the three materials by running correlation analyses. Only for the conversational material (CONV) a marginally significant correlation (r = −0.23, p = 0.082, Pearson's r) was found.

# Research Question 4: Do Participant Characteristics such as Working Memory (4A), and Age, Hearing Thresholds, Speech Perception in Noise, and Self-control Abilities Predict ANL (4B)?

Again, ANLs were pooled over the two test sessions for each of the three materials. Working memory was not correlated with ANL (p > 0.1). Likewise, none of the other correlations (N = 15) were statistically significant at an alpha level of 0.05 (i.e., not even before application of any correction required for multiple testing). Similarly, adding participant characteristics as continuous variables to either of the linear mixed-effect models discussed above (for research questions 1A and 1B) did not yield any significant effects of these participant-related variables.

# DISCUSSION

The clinical purpose of the ANL test is to predict self-reported hearing problems and future hearing aid success as reliably as possible. Therefore, it is crucial to know whether and how its clinical applicability depends on what speech material listeners are presented with and how the test is administered. Material effects on the outcome of the ANL test have been addressed in numerous studies (von Hapsburg and Bahng, 2006; Gordon-Hickey and Moore, 2008; Olsen et al., 2012a,b; Ho et al., 2013; Olsen and Brännström, 2014). In a number of recent publications (Brännström et al., 2012a, 2014a,b; Olsen et al., 2012a,b) – the ISTS (Holube et al., 2010) has been used, which is non-meaningful by definition. However, the original ANL test fragment used by Nábelek et al. (2006) ˇ , in which ANL outcome was shown to be predictive of hearing aid uptake, was a meaningful and coherent read story, and thus linguistically different from the ISTS material. With the present study we investigated material effects on ANL to find out whether meaningfulness and coherence affect ANL (RQ1). In addition, we evaluated the repeatability of the ANL test across a range of test materials to check whether ecologically more valid materials yield a comparable repeatability as more standard audiology materials and the ISTS signal (RQ2). Further, we analyzed the association between ANLs and the outcome of a questionnaire that measures activity limitations due to hearing problems to elaborate on the connection between listening effort and ANLs. We also reexamined the association of working memory and self-control abilities and ANLs (RQ4) found in previous studies (Brännström et al., 2012b; Nichols and Gordon-Hickey, 2012).

As expected, ANLs were higher for the ISTS material in comparison with the meaningful materials. Our interpretation of this effect is that the available redundancy for the meaningful materials facilitated speech processing (via top–down processing) and thus led participants to choose higher levels of acceptable noise (i.e., lower ANLs) than for the non-meaningful material. The unintelligible ISTS signal might have led participants to still want to hear as much as possible (i.e., relying more heavily on bottom–up processing). Furthermore, contrasting conversational ANL test materials with a passage of concatenated standard audiology sentences, we have not found convincing evidence for a semantic coherence effect on ANL. Possibly, the faster and more casual speaking style in the conversational material made listening more difficult, but this speaking style effect may have been offset by greater semantic coherence in the conversation, providing a form of discourse redundancy. The data did not provide clear evidence for priming effects across tests sessions (but note that **Table 6** shows that coefficients of repeatability were largest between the first and second measurement within test session I). All in all, these results provide some evidence that top–down processing plays a role in ANL performance.

An important question was whether repeatability differs across the three speech materials. Neither the statistical modeling approach nor the analysis of the coefficient of repeatability (CR) showed statistically differential repeatability. Rather, repeatability was comparable for the three speech material types with CR values ranging between ±6.14 dB for the conversational material and ±6.65 dB for the ISTS material. Crucially, a coefficient of repeatability lower or equal to ±6 dB ensures that measurement error is lower than the distance between the two thresholds used to categorize hearing aid users as either successful or unsuccessful (≤7 and >13 dB, cf. Nábelek et al., 2006 ˇ ). Across test sessions, all three speech material types yielded CRs just above the critical ±6 dB threshold. With respect to ANL repeatability within test sessions, the conversational material (CONV4) yielded most stable CRs with values below ±6 dB. Our interpretation of the relatively high CR values across sessions is that listeners' internal

criteria for MCL and BNL may be somewhat variable over time, particularly if they are engaged in other activities in-between test and retest measurements. As suggested by Brännström et al. (2014b), noise acceptance while following speech may best be considered a range (Acceptable Noise Range), rather than a specific level (ANL). The relatively poor repeatability of ANL may raise concerns about the clinical value of the ANL as an indicator for hearing aid use and success. However, if the ANL is used to compare two hearing aid conditions within one session, within-session reliability seems to be sufficient. For example, the ANL has been used successfully to show the effect of a noise reduction algorithm (Mueller et al., 2006; Peeters et al., 2009; Dingemanse and Goedegebure, 2015). Further research would be required to investigate whether Acceptable Noise Range may be a more reliable predictor of hearing problems and future hearing aid success than ANL.

Our analysis on the association of ANLs and the outcome of a subjective hearing-related questionnaire (RQ3) relates to recent discussion about the clinical meaning of concepts such as listening effort and fatigue in hearing-impaired individuals (McGarrigle et al., 2014). Our data showed a significant effect of participants' score on the subscale 'Speech hearing' of the Speech, Spatial, and Qualities of Hearing self-report (SSQ, Gatehouse and Noble, 2004) on ANL, particularly when listening to conversational speech. Participants who reported fewer listening problems also tolerated more noise while listening to speech (i.e., lower ANLs). Most questions of the 'Speech hearing' subscale are about conversation in noise. Both measurements (SSQ and ANL) are subjective judgments, where SRT measurements are not. This makes an association between ANL and SSQ more likely than an association between SRT and SSQ. The subscale 'Qualities of Hearing' was not significantly correlated with ANL. The between-participant differences of the 'quality of sound rating' were relatively small in this group of nearly normal-hearing participants. Possibly, perceived sound quality and ANL may be associated among hearingimpaired participants. No association was found between ANL and the subscale 'Effort and Concentration.' This suggests that noise tolerance (as one aspect of listening comfort), is a different concept than the listening effort concept as formulated in these specific questionnaire questions. Further research should clarify differences and commonalities of both concepts.

The association between self-reported listening difficulties in noise and noise acceptance (i.e., ANL) only becomes evident when such an ANL test relates to everyday experiences. We think this result clearly makes a case for the use of ecologically valid conversational materials in clinical testing. Audiologists and speech researchers should think about how representative the type of noise and noise levels are of everyday listening, but they should also care about differences between read aloud speech and spontaneous conversation.

Further, the attempt to replicate working memory effects on ANL was unsuccessful. This suggests that noise tolerance, as one aspect of listening comfort, is not related to individual working memory capacity. Importantly, in line with previous studies (cf. Akeroyd, 2008), working memory was considerably correlated with speech perception in noise (cf. **Table 2**), with higher working memory relating to better speech perception. The failure to replicate working memory effects on ANL in our study can be accounted for in two ways. First, it may be due to the use of different test materials and test procedures to quantify working memory. The test that Brännström et al. (2012b) used to quantify working memory was an auditory version of the reading span task in which the examiner presented the sentences orally, which may have increased the contribution of hearing. Alternatively, the lack of a correlation between ANL and working memory can be taken to underline that ANL and speech perception in noise are different in nature. The latter account ties in with our observation that ANLs did not relate to age, hearing thresholds, and speech-in-noise perception abilities. This held in the relatively good-hearing adult sample as tested here, but was also found by Nábelek et al. (1991, 2004) ˇ , Freyaldenhoven et al. (2007), Plyler et al. (2007), and Moore et al. (2011) for both normal-hearing and hearing-impaired participants. Moreover, we have not found evidence for an association between ANL and self-control abilities reported in Nichols and Gordon-Hickey (2012). However, the latter study used a self-control scale containing 36 items in contrast to the Brief Self-Control Scale with 13 items that we asked our participant to fill in.

The combined pattern of results converges on material effects being present for the ANL test with better noise tolerance and slightly better and more stable repeatability, at least numerically, for meaningful stimuli. We have also shown that activity limitations due to hearing problems and ANLs are related, especially if conversational materials are used as ANL test material. More natural speech materials can thus be used in a clinical setting as repeatability is not reduced compared to more standard materials. We aim to conduct follow-up research to investigate whether ecologically valid test materials – such as the conversational speech material used in this study – can be used to improve the predictive power of the ANL test for hearing aid success, relative to more standardized speech materials.

# AUTHOR CONTRIBUTIONS

XK, GD, and EJ designed the experiment; XK and GD prepared the speech materials; XK conducted the experiment and analyzed the data; XK wrote the paper supported by input from GD, AG, and EJ; GD developed the computerized test procedure.

# FUNDING

This research was supported by the Netherlands Organization for Research (NWO) under Project No. 276-75-009 (grant awarded to EJ).

# REFERENCES

fpsyg-07-00186 February 24, 2016 Time: 19:11 # 13


with normal and impaired hearing. J. Speech Lang. Hear. Res. 50, 878–885. doi: 10.1044/1092-4388(2007/062)


normal hearing. Int. J. Audiol. 51, 353–359. doi: 10.3109/14992027.2011. 645074


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Koch, Dingemanse, Goedegebure and Janse. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Auditory-Visual Speech Benefit on Working Memory in Older Adults with Hearing Impairment

Jana B. Frtusova and Natalie A. Phillips \*

Cognition, Aging, and Psychophysiology Lab, Department of Psychology, Concordia University, Montreal, QC, Canada

This study examined the effect of auditory-visual (AV) speech stimuli on working memory in older adults with poorer-hearing (PH) in comparison to age- and education-matched older adults with better hearing (BH). Participants completed a working memory n-back task (0- to 2-back) in which sequences of digits were presented in visual-only (i.e., speech-reading), auditory-only (A-only), and AV conditions. Auditory event-related potentials (ERP) were collected to assess the relationship between perceptual and working memory processing. The behavioral results showed that both groups were faster in the AV condition in comparison to the unisensory conditions. The ERP data showed perceptual facilitation in the AV condition, in the form of reduced amplitudes and latencies of the auditory N1 and/or P1 components, in the PH group. Furthermore, a working memory ERP component, the P3, peaked earlier for both groups in the AV condition compared to the A-only condition. In general, the PH group showed a more robust AV benefit; however, the BH group showed a dose-response relationship between perceptual facilitation and working memory improvement, especially for facilitation of processing speed. Two measures, reaction time and P3 amplitude, suggested that the presence of visual speech cues may have helped the PH group to counteract the demanding auditory processing, to the level that no group differences were evident during the AV modality despite lower performance during the A-only condition. Overall, this study provides support for the theory of an integrated perceptual-cognitive system. The practical significance of these findings is also discussed.

Edited by: Jerker Rönnberg,

Linköping University, Sweden

# Reviewed by:

Frederick Jerome Gallun, VA Portland Health Care System, USA Mitchell Sommers, Washington University in St. Louis, USA

\*Correspondence:

Natalie A. Phillips natalie.phillips@concordia.ca

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 19 December 2015 Accepted: 21 March 2016 Published: 12 April 2016

### Citation:

Frtusova JB and Phillips NA (2016) The Auditory-Visual Speech Benefit on Working Memory in Older Adults with Hearing Impairment. Front. Psychol. 7:490. doi: 10.3389/fpsyg.2016.00490 Keywords: aging, hearing impairment, speech perception, multisensory interaction, working memory, evenrelated potentials

# INTRODUCTION

Aging is associated with various physical and cognitive changes, including both structural and functional changes in the auditory system resulting in hearing difficulty. Hearing impairment is the third most common chronic condition in older adults, ranking just after arthritis, and hypertension (Zhang et al., 2013) and it has a significant impact on older adults' quality of life (e.g., Strawbridge et al., 2000; Dalton et al., 2003). The most common cause of hearing impairment in older adults results from various structural and functional age-related changes in the cochlea (Schneider, 1997). In addition to elevated hearing thresholds, these changes affect the processing of temporal and spectral cues, which are important for speech perception (e.g., Baer and Moore, 1994; Schneider, 1997; Schneider and Pichora-Fuller, 2001; Pichora-Fuller et al., 2007). Research also indicates that older adults need to engage broader cortical networks to process speech compared to younger adults (Wong et al., 2009). Thus, age-related changes in the auditory system can have a negative effect on speech perception, making it more effortful, and resource demanding.

In addition to hearing difficulty, one of the most common complaints of older adults is difficulty with remembering information. According to a model proposed by Schneider and Pichora-Fuller (2000) there is a direct link between perceptual and higher-order cognitive functioning, such as memory. More specifically, they have proposed that perceptual and cognitive functions share a common pool of processing resources. Under this theory, having to devote too many processing resources toward perception may result in insufficient residual resources for subsequent higher-order processing, such as encoding and storing of the information in memory. Thus, for older adults with hearing impairment, memory difficulty may be a secondary effect of having to devote too many processing resources to speech perception. This has been demonstrated by several studies, which have shown that hearing impairment as well as presentation of auditory information in background noise interferes with memory performance (e.g., Rabbitt, 1968, 1991; Pichora-Fuller et al., 1995; McCoy et al., 2005).

In contrast to a negative effect of hearing impairment, there is strong evidence indicating that auditory-visual (AV) speech, in which both auditory and visual speech cues (i.e., lip, tongue, and face movements) are available, enhances speech recognition (e.g., Sumby and Pollack, 1954; Klucharev et al., 2003; Bernstein and Grant, 2009; Ma et al., 2009; Tanaka et al., 2009; Fraser et al., 2010; Winneke and Phillips, 2011). Importantly, AV speech is not only associated with behavioral improvements of speech perception, but also with more efficient brain processing. This effect is indicated by studies using event-related potential (ERP) methodology, which measures electrical brain activity associated with different stages of stimulus-related processing (Luck, 2005). ERP components relevant to speech perception include the P1, which refers to a positive-going waveform peaking approximately 50 ms after the onset of the stimulus and that is proposed to originate from the primary auditory cortex (Liegeois-Chauvel et al., 1994), and the N1, which is a negative-going waveform that peaks approximately 100 ms after the onset of a sound and is proposed to originate from the secondary auditory cortex (Liegeois-Chauvel et al., 1994; Pantev et al., 1995).

The data from ERP research suggest that the brain elicits earlier and smaller responses during AV speech in comparison to auditory-only (A-only) speech modality. More specifically, both amplitude (van Wassenhove et al., 2005; Stekelenburg and Vroomen, 2007; Frtusova et al., 2013) and latency (van Wassenhove et al., 2005; Stekelenburg and Vroomen, 2007; Pilling, 2009; Winneke and Phillips, 2011; Frtusova et al., 2013) of the auditory N1 and/or P1 component are reduced during processing of AV compared to A-only speech. Overall, these results indicate that the brain is able to process auditory information more efficiently and produce better behavioral outcomes when visual speech cues are available.

According to the theory of an integrated perceptual-cognitive system proposed by Schneider and Pichora-Fuller (2000), the observed perceptual benefit of AV speech should lead to more resources being available for higher-order cognitive processes, such as encoding of information in memory, and thus improved behavioral performance. This has been confirmed by Pichora-Fuller (1996) who demonstrated that visual speech cues help to counteract the negative effect of noise on working memory (WM) performance. We have previously examined the effect of AV speech on WM using an n-back task while also measuring ERP responses (Frtusova et al., 2013). The n-back task has been found to be sensitive to age-related changes (e.g., Verhaeghen and Basak, 2005; Van Gerven et al., 2007, 2008; Vaughan et al., 2008; Vermeij et al., 2012) and it has been examined by previous ERP research. It has been found that P3 amplitude decreases with increased WM load (i.e., higher n-back condition; Segalowitz et al., 2001; Watter et al., 2001), while P3 latency seems independent of n-back manipulation (Watter et al., 2001; Gaspar et al., 2011). These results were interpreted as suggesting that P3 amplitude reflects demands related to updating of WM, with greater demands resulting in a lower P3 amplitude, while P3 latency reflects processing related to the comparison of the current stimulus with the one presented n-trials before (Watter et al., 2001).

During the n-back task used in our previous experiment (Frtusova et al., 2013) with normal-hearing younger and older adults, spoken digits were presented in either the visual-only (V-only), A-only, or AV modality. The results showed that participants were faster across all memory loads, and more accurate in the most demanding WM conditions (2- and 3-back) when stimuli were presented in the AV modality compared to in the A-only and the V-only modality. Furthermore, the AV modality was associated with facilitated perceptual processing as evidenced by an earlier-peaking auditory N1 component in both age groups, and a smaller auditory N1 amplitude in older adults in the AV condition compared to the A-only condition.

The aforementioned findings come mostly from studies of younger and older adults with normal hearing. There is evidence to suggest that individuals with hearing impairment also benefit from having speech presented in the AV modality in terms of improved speech recognition in noisy environment (Grant et al., 1998; Tye-Murray et al., 2007; Bernstein and Grant, 2009). Furthermore, Grant et al. (2007) found that, despite a lower performance in an A-only condition during a syllable recognition task, participants with hearing impairment performed similarly to normal-hearing individuals in an AV condition. Thus, there is an indication that visual speech cues can help older adults with hearing impairment to counteract the hearing difficulty experienced during A-only conditions.

There is a scarcity of ERP research examining AV speech perception in the hearing impaired population. In one study, Musacchia et al. (2009) measured auditory ERPs in a group of older adults with normal hearing and those with mild to moderate hearing loss during A-only, V-only, and AV speech perception. Participants were asked to watch and/or listen to a repeated presentation of a "bi" syllable. The results showed that the AV modality did not result in the same level of modulation of ERP components for the hearing impaired group as it did for the normal-hearing controls. Musacchia et al. (2009) interpreted these results as an indication that AV integration abilities are diminished in individuals with hearing impairment. Thus, the results of this study seem contradictory to the observed AV speech benefit reported in behavioral studies and more ERP studies are needed to clarify this issue.

Importantly, there is preliminary behavioral evidence that older adults with hearing impairment may derive a WM benefit from AV speech. Brault et al. (2010) asked older adults with normal hearing and those with mild/moderate hearing loss to recall the last three words from word lists of unpredictable lengths. The word lists were presented in either the AV or the A-only modality. The results showed that when the stimuli were not perceptually degraded by white noise, older adults with hearing impairment and good lip-reading ability benefited from AV speech in comparison to A-only speech. On the other hand, when the stimuli were presented in background noise, the AV speech benefit in comparison to the A-only condition was evident independent of hearing impairment status or lip-reading proficiency. However, Brault et al. (2010) thought that these improvements were more related to perception rather than WM.

Overall, AV speech seems to improve speech recognition in individuals with hearing impairment, and there is preliminary evidence that it may also lead to better WM performance. However, more studies that include a combination of behavioral and electrophysiological measures are needed to provide information about the AV interaction effect in individuals with hearing impairment. ERP methodology, in particular, can help to clarify the timing and nature of the AV interaction in this population in comparison to normal-hearing controls. In addition, this methodology can also help to clarify to what extent the behavioral WM improvements are in fact related to perceptual facilitation of auditory processing during AV speech.

The present study examined the effect of AV speech on WM in older adults with hearing impairment in comparison to ageand education-matched controls. WM was tested using an nback task with 0-, 1-, and 2-back conditions, and with A-only, V-only, and AV stimuli. During the task, ERP responses were collected together with behavioral accuracy and reaction time (RT) measures.

Similar to our previous work (Frtusova et al., 2013), it was expected that participants would have higher accuracy and faster RT in the AV condition compared to the A-only and V-only conditions. In addition, both perceptual and WM facilitation was expected to be evident on ERP measures in the AV condition compared to the A-only condition. More specifically, participants were expected to have earlier-peaking and smaller amplitude auditory P1 and N1 components during the AV condition compared to the A-only condition, indicating perceptual facilitation. Furthermore, they were expected to have an earlier-peaking and greater amplitude P3 component during the AV condition compared to the A-only condition, indicating WM facilitation. Note that perceptual facilitation is indicated by smaller P1 and N1 amplitudes as this suggests that fewer resources are required for auditory processing whereas WM facilitation is indicated by greater P3 amplitude as this suggests that more resources are available for WM processing.

Based on the hypothesis that strenuous perceptual processing caused by hearing impairment affords fewer available cognitive resources for higher-order functions, and the expectation that this effect can be counteracted by AV speech cues, we predicted a greater AV benefit for the hearing impaired population.

Furthermore, we examined whether a direct relationship between perceptual facilitation and improvement on WM could be found. A greater facilitation of N1 amplitude, indicating more efficient perceptual processing, was expected to be associated with higher accuracy. Additionally, a greater facilitation of N1 latency, indicating faster perceptual processing, was expected to be associated with faster RT.

## MATERIALS AND METHODS

### Participants

The sample in this study consisted of 16 older adults with poorerhearing (PH) and 16 older adults with better-hearing (BH). Participants were recruited through the community, mostly an existing laboratory database, or through local advertisements and word of mouth by previous participants. Two PH participants were recruited through the Deaf and Hard-of-Hearing Program at the MAB-Mackay Rehabilitation Centre in Montreal and several were recruited through the Communicaid for Hearing Impaired Persons organization in Montreal. The data from 10 participants in the BH group came from a previous study (Frtusova et al., 2013) that used a nearly identical procedure (with the exception of eliminating the 3-back condition in this study). The analyses of behavioral data from the new participants compared to those from the previous study did not show any significant group differences (Mdiff = 1.45, p = 0.39 for accuracy and Mdiff = 21.22, p = 0.72 for RT). Thus, we chose to include all participants to increase statistical power.

All participants in this study were reasonably healthy, with no self-reported history of disease significantly affecting cognitive ability (e.g., stroke, dementia, Parkinson's disease, or epilepsy). All were completely fluent in English and were right handed (one participant in the PH group reported mixed handedness). Potential participants for the PH group were included if they reported hearing difficulty and either wore a hearing aid or were eligible for hearing aids according to their self-report. In this way we tried to limit our sample to participants with sensorineural hearing loss.

All participants completed a hearing screening that measured hearing thresholds for 250, 500, 1000, 2000, 4000 Hz (Welch Allyn, AM 232 Manual Audiometer). From these, we computed pure tone average (PTA) values for each ear by averaging across the thresholds obtained for 500, 1000, and 2000 Hz. Control participants had to have a PTA equal to or below 25 dB (Katz, 1985). The individuals in the PH group had to have sufficient hearing to be able to correctly identify the stimuli in the Aonly condition without a hearing aid. All participants completed a vision screening that measured contrast sensitivity using the Mars Contrast Sensitivity Test (by MARS Perceptrix; Arditi, 2005). In this test, participants were asked to read a series of large print letters that were degraded in terms of background contrast. Contrast sensitivity, measured as logMAR scores, was

### TABLE 1 | Demographic characteristics.


<sup>a</sup>Montreal Cognitive Assessment (MoCA; Nasreddine et al., 2005).

<sup>b</sup>Contrast sensitivity scores on Mars Contrast Sensitivity Test (Arditi, 2005).

<sup>c</sup>The pure tone average (PTA) represents the average of hearing thresholds for 500, 1000, and 2000 Hz.

obtained for each eye separately as well as binocularly. Lastly, cognitive screening was completed using the Montreal Cognitive Assessment (MoCA; Nasreddine et al., 2005). The groups were matched on age, education, gender, vision, and general cognitive skills. The demographic characteristics of the two samples are presented in **Table 1**. The protocol was approved by the University Human Research Ethics Committee (UHREC) of Concordia University as well as by the Review Ethics Board of CRIR Institutions.

### Stimuli

The stimuli consisted of short videos of a female speaking the digits 1, 2, 3, 4, 5, 6, 8, 9, and 10 with a neutral facial expression. The digit 7 was omitted because it is bi-syllabic and thus more easily distinguishable from the other digits. The stimuli were recorded in a recording studio at the Department of Journalism, Concordia University, and subsequently edited using Adobe Premier (Video codec, Windows Media Video 9; frame size, 500 px 388 px; frame rate, 29.97 fps; Audio codec, Windows Media Audio; sample rate and size, 44,100 Hz 16-bit). The videos showed the full face and shoulders of the speaker against a green background. The videos were edited such that the first obvious lip movement occurred nine frames after the onset of the video and the last lip movement happened approximately nine frames before the video ended. Imperceptible triggers were inserted at the time of the first lip movement (i.e., visual trigger) and at the onset of the sound (i.e., auditory trigger), in order to signal these events to the recording electroencephalogram amplifier which was important for subsequent ERP analyses (as described later). The lag between the onset of the video and the onset of the sound was approximately 395.3 ms (SD = 103.24). The average length of the video was 2010 ms (SD = 160 ms), with an inter-trial interval of 2400 ms. The sound was presented binaurally using insert earphones (EARLINK tube ear inserts; Neuroscan, El Paso, Texas).

The AV stimuli included both video and audio channels, meaning that the participants could both see and hear the speaker. For the A-only stimuli, the video channel was deleted and only a white fixation point was presented on a black background to maintain eye fixation. For the V-only stimuli, the auditory channel was deleted and the participants needed to identify the digits based on the visual speech cues. Overall, the stimuli in the three modalities were identical with the exception of the presence of either both of the modalities or only one of the modalities. The stimuli were presented on a black screen 15-in. CRT monitor, using Inquisit (version 2.0; Millisecond Software, 2008). Participants were seated in a comfortable chair approximately 60 cm from the screen.

### Procedure

Participants completed the n-back task in three modalities: Vonly (where they could see the speaker presenting the digits but could not hear her voice); A-only (where they could hear the speaker presenting digits but could not see her face); and AV (where they could both hear and see the speaker presenting digits). There were three different levels of task difficulty ranging from 0-back to 2-back load in a blocked design. In the 0-back condition, participants had to decide whether the currently presented digit matched a target digit assigned at the beginning of the block. In the 1-back condition, participants had to decide whether the currently presented digit matched the one presented one trial before, and in the 2-back condition, participants had to decide whether the currently presented digit matched the one presented two trials before.

The sequences of digits were semi-random, each containing 40 "Match" trials and 60 "Non-Match" trials. In Match trials, the currently presented digit matched the one assigned at the beginning of the block (0-back) or the one presented one or two trials before (1- and 2-back, respectively). Participants completed the 0-back condition in each modality, followed by the 1-back condition in each modality and finished with the 2-back condition in each modality. The order of the modality presentations was varied across participants. Participants were presented with different sequences of digits in different modalities, but modality-sequence combinations were also varied across participants.

Participants practiced speech-reading and responding with the computer mouse before the experiment began. To practice speech-reading, participants had to identify the digits used in the experiment based on only seeing the speaker to utter these digits (similar to the V-only condition). Digits were first presented in numerical and then random order. This procedure was repeated if the participant made mistakes in the random practice condition. In general, participants had to identify all the digits correctly in the practice session before proceeding with the experiment. To practice responding with the computer mouse, participants were asked to hold the mouse in both of their hands and press the left or right button using their thumbs to indicate Match or Non-Match responses. The assignment of Match response to the left or right button was counterbalanced across participants. For all conditions, participants were instructed to respond as fast and as accurately as they could. To practice responding, they completed 10 trials that were identical to the AV 0-back condition. After this, the experimental tasks began. In order to ensure that each participant understood the task, they completed 10 practice trials before each new n-back block (i.e., before beginning the 0-back, 1-back, and 2-back tasks). During these trials, feedback was provided by presenting a short low-frequency beep whenever participants made a mistake. The practice blocks were repeated if participants made more than a few mistakes or it appeared that they did not understand the task. For many participants, this was mostly necessary in the 2-back condition. Lastly, in order to give participants a chance to adjust to each new condition, five "Warm-Up" trials were included at the beginning of each sequence. These trials were not counted in the analyses.

Two behavioral measures were collected: the accuracy, defined as the percentage of correct Match responses, and RT, defined as the amount of time between the onset of the auditory trigger and the participant's button response for correct Match trials. Trials were excluded if the response occurred less than 200 ms after the first cue about the identity of the digit (i.e., the onset of the lip movement in the V-only and AV conditions or the onset of the sound in the A-only condition). This was done because such early responses were unlikely to represent a valid response.

# Electroencephalography Data Acquisition and Processing

The electroencephalography (EEG) data were collected during the task using a Biosemi ActiveTwo system with 72 channels. Sixty-four electrodes were arranged on the head according to the extended International 10–20 system (Jasper, 1958). Electroocculograms (EOG) were used to monitor eye movements: one electrode was placed above and one below the left eye to monitor vertical eye movements and one was placed beside the outer canthi of each eye to monitor horizontal eye movements. The sampling rate during the recording was 2048 Hz but the files were down-sampled offline to 512 Hz.

After down-sampling, the recorded data were converted to Neuroscan continuous data format using Polygraphic Recording Data Exchange (PolyRex; Kayser, 2003). The data were rereferenced to a linked left and right ear lobe reference and subsequently processed using Scan software (version 4.5; Compumedics Neuroscan, 2009). Vertical ocular artifacts were corrected using a spatial filtering technique (Method 1; NeuroScan Edit 4.5 manual, 2009). Next, the frequencies outside the range of 1–45 Hz were filtered using a bandpass filter. Continuous recordings were divided into separate epochs going from −100 to 1000 ms around the onset of auditory stimuli (i.e., auditory triggers) and baseline corrected based on the 100 ms prestimulus period (i.e., −100 to 0 ms before the auditory trigger). Epochs with excessive artifacts (i.e., activity larger than ±75 µV in the active electrodes at and around the midline or EOG activity exceeding ±60 µV) were excluded by the software program. The accepted epochs were subsequently inspected manually by the examiner to ensure that there was no excessive noise in the epochs that were to be used in the analyses. The mean number of accepted trials was 31.5 out of 40 for the Match condition (SD = 5.64). The epochs were then sorted by the software based on the condition, and individual averages (i.e., average waveforms for each individual) for each condition were computed. In order to examine the AV interaction, the waveforms for A-only and V-only were added to create A+V

In this study, we were interested in three ERP components, namely the P1, N1, and P3. These components were first detected by a semiautomatic procedure in Scan software (NeuroScan Edit 4.5 manual). For this purpose, the P1 was defined as the highest positive point occurring between 20 and 110 ms after the onset of the stimulus; the N1 was defined as the lowest negative point occurring between 60 and 170 ms after the onset of the stimulus; and the P3 was defined as the most positive point occurring between 300 and 700 ms after the onset of the stimulus. Subsequently, the detected peaks were inspected and manually adjusted, when necessary, by a trained examiner who was blinded to the modality and group factors.

# RESULTS

The data were analyzed by repeated measures ANOVAs using SPSS (version 22; IBM). Predicted interaction effects were decomposed with simple effects analyses. The reported results are significant at α ≤ 0.05 unless otherwise specified. For the main analyses, the Greenhouse-Geisser non-sphericity correction was used for interpreting results for within-subject factors with more than two levels. Based on the convention suggested by Jennings (1987), Greenhouse-Geisser epsilon (ε) values and uncorrected degrees of freedom are reported together with adjusted p-values and mean square error (MSE) values. Participants had to reach an accuracy of at least 60% during a particular condition in order to be included in the analyses; otherwise the value for that condition was replaced by the group mean. This criterion was imposed in order to ensure that participants were sufficiently engaged in the task so that the observed values indicated a valid representation of task-related performance. Eight values (out of 144) needed to be replaced in the BH group and nine values (out of 141) needed to be replaced in the PH group. In addition, one participant from the PH group discontinued the 2-back condition because she found it too difficult and thus the missing values were replaced by group means.

# Behavioral Results

Behavioral data were analyzed by repeated measures ANOVAs with modality (V-only, A-only, AV) and n-back load (0-, 1-, and 2-back) entered as within-subject variables and group (BH and PH) entered as a between-subject variable.

### Accuracy

The accuracy data are shown in **Figure 1**. The analysis revealed a significant main effect of modality [F(2, 60) = 9.7; MSE = 48.22; p < 0.001; ε = 0.86; η 2 <sup>p</sup> = 0.25], indicating that participants were more accurate in the A-only and AV conditions compared to the V-only condition but performance in the A-only and the AV condition did not differ. There was also a main effect of load [F(2, 60) = 162.2; MSE = 53.23; p < 0.001; ε = 0.89; η 2 <sup>p</sup> = 0.84], showing that accuracy decreased as n-back load increased. Neither the main effect of group [F(1, 30) = 0.6; MSE = 127.95; p = 0.43; η 2 <sup>p</sup> = 0.02] nor the Modality × Group interaction

[F(2, 60) = 0.4; MSE = 48.22; p = 0.66; ε = 0.86; η 2 <sup>p</sup> = 0.01] were significant.

### Reaction time

The RT data are shown in **Figure 2**. The analysis revealed a significant main effect of modality [F(2, 60) = 42.6; MSE = 9946.64; p < 0.001; ε = 0.80; η 2 <sup>p</sup> = 0.59], load [F(2, 60) = 29.7; MSE = 18372.55; p < 0.001; ε = 0.91; η 2 <sup>p</sup> = 0.50], as well as a significant Modality × Load interaction [F(4, 120) = 7.8; MSE = 9575.33; p < 0.001; ε = 0.66; η 2 <sup>p</sup> = 0.21]. Pairwise comparisons showed that participants were faster during the AV condition compared to the V-only and A-only conditions at all n-back loads, but they were faster in the A-only condition compared to the V-only condition only during the 0-back condition.

Furthermore, there was a statistical trend for the effect of group [F(1, 30) = 3.6; MSE = 105309.68; p = 0.07; η 2 <sup>p</sup> = 0.11], indicating that the BH group was faster than the PH group. This effect was qualified by a Modality × Group interaction [F(2, 60) = 4.6; MSE = 9946.64; p = 0.02; ε = 0.80; η 2 <sup>p</sup> = 0.13], which showed that the PH group performed similarly to the BH group in the V-only [F(1, 30) = 0.8; p = 0.37; η 2 <sup>p</sup> = 0.03] and the AV [F(1, 30) = 2.0; p = 0.17; η 2 <sup>p</sup> = 0.06] conditions but were significant slower in the A-only condition [F(1, 30) = 12.6; p = 0.001; η 2 <sup>p</sup> = 0.30].

### Electrophysiological Results: Perceptual Processing

For the electrophysiological results, the V-only condition was not included in the analyses because our analyses focused on the auditory evoked potentials. More specifically, we were interested in the comparison of auditory processing with and without the presence of visual speech cues. N1 amplitude was defined as an absolute voltage difference between the trough of the P1 and the peak of the N1, thus we refer to this component complex as P1-N1 when describing the amplitude data. In order to explore the possibility of multisensory effects occurring before the N1 component, we also analyzed the data from the P1 component separately. P1 amplitude was measured relative to the 0 µV baseline. The P1 and N1 latencies were measured at the components' peaks relative to the onset of the auditory trigger. The data from the CZ electrode were used for the analyses as these components reach their maximum in mid-central electrodes (Näätänen and Picton, 1987) and no hemispheric differences were identified in a previous work in our laboratory (Winneke and Phillips, 2011).

To explore multisensory processing, the AV and the A-only conditions were compared to the A+V measure. This waveform was obtained by the summation of electrophysiological activity in the A-only and the V-only conditions locked to the onset of the auditory stimuli. For this purpose, we embedded imperceptible triggers into the V-only files at the time points where the onset of the sound would have occurred, if it had been presented (i.e., at the identical time point as in the A-only and the AV stimuli). This way we were able to assess whether the AV condition represented a multisensory interaction or merely the simultaneous processing of two independent modality channels (A-only and V-only). Planned comparisons consisted of the contrast of A-only vs. AV waveforms and A-only vs. A+V waveforms. The values for the P1 and P1-N1 amplitudes and for the P1 and N1 latencies were analyzed by repeated measures ANOVAs with modality (AV, Aonly, A+V) and n-back load (0-, 1-, and 2-back) conditions entered as within-subject variables and group (BH and PH) entered as a between- subject variable.

### P1-N1 Amplitude

The grand averages illustrating different modalities for the P1-N1 wave are presented in **Figure 3**. The mean values and standard deviations are also presented in **Table 2**. The ANOVA showed a main effect of modality [F(2, 60) = 12.5; MSE = 4.74; p < 0.001; ε = 0.74; η 2 <sup>p</sup> = 0.29], such that the amplitude of the P1-N1 was smaller in the AV condition compared to both the A-only condition and the A+V measure, and smaller in the A-only condition compared to the A+V measure. Thus, the data provided evidence for a multisensory interaction in the AV condition. There was also a main effect of group [F(1, 30) = 10.5; MSE = 38.88; p = 0.003; η 2 <sup>p</sup> = 0.26] with the PH group having a smaller P1-N1 amplitude than the BH group.

In order to test our main hypothesis, the planned simple effects, followed by pairwise comparisons, indicated that there was a significant decrease in P1-N1 amplitude in the AV

n-back conditions. Note the smaller amplitude of P1 and N1, and earlier P1 in the AV in comparison to the A-only condition for participants with poorer hearing.

TABLE 2 | The mean amplitudes (µV) and standard deviations (in parenthesis) of the P1-N1 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the CZ electrode.


condition compared to the A-only condition and the A+V measure, and in the A-only condition compared to the A+V measure for the PH group [F(2, 29) = 7.4; p = 0.003; η 2 <sup>p</sup> = 0.34]. However, while a similar pattern of results was suggested in the BH group, the mean differences did not reach the level of significance [F(2, 29) = 1.8; p = 0.19; η 2 <sup>p</sup> = 0.11; see **Table 2**].

### P1 Amplitude

The grand averages illustrating different modalities for the P1 wave are presented in **Figure 3**. The mean values and standard deviations are also presented in **Table 3**. The ANOVA showed a main effect of modality [F(2, 60) = 10.2; MSE = 4.62; p = 0.001; ε = 0.76; η 2 <sup>p</sup> = 0.25]; the amplitude of P1 was smaller in the AV condition compared to the A-only condition and the A+V measure, while the A-only condition and the A+V measure did not significantly differ. These results indicate that the multisensory interaction effect is evident early in the information processing stream and modulation observed in the AV condition compared to the A-only condition cannot be explained by simultaneous but independent processing of visual and auditory speech information.

There was also a main effect of group [F(1, 30) = 4.4; MSE = 9.35; p = 0.04; η 2 <sup>p</sup> = 0.13], with the PH group having a smaller P1 amplitude than the BH group. Furthermore, there was a significant Modality × Group interaction [F(2, 60) = 4.2; MSE = 4.62; p = 0.03; ε = 0.76; η 2 <sup>p</sup> = 0.12], indicating that for the PH group [F(2, 29) = 9.0; p = 0.001; η 2 <sup>p</sup> = 0.38], P1 amplitude was smaller in the AV condition compared to the A-only condition and the A+V measure, and there was a

TABLE 3 | The mean amplitudes (µV) and standard deviations (in parenthesis) of the P1 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the CZ electrode.


TABLE 4 | The mean latencies (ms) and standard deviations (in parenthesis) of the P1 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the CZ electrode.


statistical trend (p = 0.06) for the P1 to be smaller in the Aonly condition compared to the A+V measure. However, no modality effect was indicated in the BH group, [F(2, 29) = 0.5; p = 0.61; η 2 <sup>p</sup> = 0.03; see **Table 3**]. The simple effects conducted on the interaction also revealed that the PH group had a smaller P1 amplitude in the AV condition compared to the BH group [F(1, 30) = 11.9; p = 0.002; η 2 <sup>p</sup> = 0.28], while the two groups had similar P1 amplitudes in the A-only condition [F(1, 30) = 2.1; p = 0.16; η 2 <sup>p</sup> = 0.07] and the A+V measure [F(1, 30) = 0.02; p = 0.90; η 2 <sup>p</sup> = 0.00]. Lastly, there was a main effect of load [F(2, 60) = 3.5; MSE = 4.86; p = 0.05; ε = 0.84; η 2 <sup>p</sup> = 0.10] with P1 amplitude being smaller in the 1-back than the 0-back condition. No other differences were evident across different WM loads.

### P1 Latency

The grand averages illustrating different modalities for the P1 wave are presented in **Figure 3**. The mean values and standard deviations are also presented in **Table 4**. The data showed the main effect of modality [F(2, 60) = 6.0; MSE = 373.14; p = 0.01; ε = 0.86; η 2 <sup>p</sup> = 0.17]; the P1 peaked earlier in the AV condition compared to the A-only condition and the A+V measure, while the A-only condition and the A+V measure did not significantly differ. The main effect of group was not TABLE 5 | The mean latencies (ms) and standard deviations (in parenthesis) of the N1 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the CZ electrode.


significant [F(1, 30) = 2.9; MSE = 854.68; p = 0.10; η 2 <sup>p</sup> = 0.09] but there was a statistical trend toward a Modality × Group interaction [F(2, 60) = 3.2; MSE = 373.14; p = 0.06; ε = 0.86; η 2 <sup>p</sup> = 0.10], indicating that the P1 peaked earlier in the AV condition compared to the A-only condition and the A+V measure for the PH group [F(2, 29) = 7.1; p = 0.003; η 2 <sup>p</sup> = 0.33] but the differences in the BH group did not reach statistical significance [F(2, 29) = 0.3; p = 0.78; η 2 <sup>p</sup> = 0.02].

### N1 Latency

The grand averages illustrating different modalities for the N1 wave are presented in **Figure 3**. The mean values and standard deviations are presented in **Table 5**. The main effect of modality did not reach statistical significance [F(2, 60) = 3.0; MSE = 600.15; p = 0.09; ε = 0.61; η 2 <sup>p</sup> = 0.09]. There was a statistical trend toward the main effect of group [F(1, 30) = 3.8; MSE = 1311.89; p = 0.06; η 2 <sup>p</sup> = 0.11], with the N1 peaking later in the PH group than the BH group. There was a main effect of load [F(2, 60) = 5.6; MSE = 482.81; p = 0.01; ε = 0.95; η 2 p = 0.16], which was qualified by a Load × Group interaction [F(2, 60) = 4.3; MSE = 482.81; p = 0.02; ε = 0.95; η 2 <sup>p</sup> = 0.13] and further by a Modality × Load × Group interaction [F(4, 120) = 2.8; MSE = 376.14; p = 0.05; ε = 0.69; η 2 <sup>p</sup> = 0.09]. The simple effects and pairwise comparisons indicated that there were no statistical differences in the BH group (all Fs < 1.9; all ps > 0.16). For the PH group, no differences across different modalities were observed in the 0-back [F(2, 29) = 0.2; p = 0.81; η 2 <sup>p</sup> = 0.02] condition, but the N1 peaked earlier in the AV condition compared to the A-only condition and the A+V measure during the 1-back load [F(2, 29) = 6.5; p = 0.01; η 2 <sup>p</sup> = 0.31], and earlier in the A-only condition compared to the A+V measure during the 2-back load [F(2, 29) = 3.1; p = 0.06; η 2 <sup>p</sup> = 0.18].

### Electrophysiological Results: Working Memory Processing

P3 amplitude was measured relative to the 0 µV baseline and P3 latency was measured at the component's peak relative to the onset of the auditory trigger. The data from the PZ electrode were used for the analyses as this component reaches its maximum

component for better-hearing older adults (left panel) and older adults with poorer hearing (right panel). The data are collapsed across different n-back conditions. Note the smaller P3 amplitude in participants with poorer hearing for the A-only condition but similar P3 amplitudes in both groups for the AV condition. Also note the earlier peaking P3 in the AV in comparison to the A-only condition in both groups and later peaking P3 in both modalities for participants with poorer hearing.

TABLE 6 | The mean amplitudes (µV) and standard deviations (in parenthesis) of the P3 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the PZ electrode.


TABLE 7 | The mean latencies (ms) and standard deviations (in parenthesis) of the P3 component for better-hearing (BH) participants and participants with poorer hearing (PH) at the PZ electrode.


in mid-posterior sites (Watter et al., 2001; Frtusova et al., 2013). The P3 is considered to reflect WM processes (i.e., higher-order) rather than perceptual processing and thus for this condition we only compared the AV and A-only modalities. The values from the P3 components were analyzed by repeated measures ANOVAs with the modality (A-only and AV) and n-back load (0-, 1-, and 2-back) conditions entered as within-subject variables and group (BH and PH) entered as a between-subject variable.

### P3 Amplitude

The grand averages illustrating different modalities for the P3 wave are presented in **Figure 4**. The mean values and standard deviations are also presented in **Table 6**. The ANOVA showed that neither the main effect of modality [F(1, 30) = 0.5; MSE = 2.69; p = 0.50; η 2 <sup>p</sup> = 0.02] nor the main effect of group [F(1, 30) = 3.0; MSE = 15.58; p = 0.10; η 2 <sup>p</sup> = 0.09] was significant. However, there was a significant Modality × Group interaction [F(1, 30) = 4.1; MSE = 2.69; p = 0.05; η 2 <sup>p</sup> = 0.12]. The two groups had similar P3 amplitudes in the AV condition [F(1, 30) = 0.6; p = 0.43; η 2 <sup>p</sup> = 0.02] but the PH group had significantly smaller P3 amplitude in the A-only condition compared to the BH group [F(1, 30) = 5.8; p = 0.02; η 2 <sup>p</sup> = 0.16]. As expected, there was a main effect of load [F(2, 60) = 11.3; MSE = 5.04; p < 0.001; ε = 0.79; η 2 <sup>p</sup> = 0.27], with P3 amplitude being greater in the 0-back condition compared to the 1-back and 2-back conditions, while the 1-back and 2-back conditions did not significantly differ.

### P3 Latency

The grand averages illustrating the different modalities for the P3 wave are presented in **Figure 4**. The mean values and standard deviations are presented in **Table 7**. The data showed a main effect of modality [F(1, 30) = 11.3; MSE = 5319.22; p = 0.002; η 2 <sup>p</sup> = 0.27], with the P3 peaking earlier in the AV condition compared to the A-only condition. There was also a main effect of group [F(1, 30) = 14.2; MSE = 13022.67; p = 0.001; η 2 <sup>p</sup> = 0.32], with the P3 peaking later in the PH group compared to the BH group. The interaction between Modality × Group was not significant [F(1, 30) = 0.02; MSE = 5319.22; p = 0.88; η 2 <sup>p</sup> = 0.00].

TABLE 8 | Zero-order correlations between the facilitation of P1-N1 amplitude and improvement in accuracy (on the left) and facilitation of N1 latency and improvement in reaction time (RT; on the right) during the AV condition in comparison to A-only condition.


\*significant at α ≤ 0.05 one-tailed.

# Correlation between Facilitation of Perceptual Processing and Improvement in Working Memory Performance

We examined whether there is a relationship between the amount of perceptual facilitation (i.e., a decrease in the amplitude of the P1-N1 and the latency of the auditory N1) and the level of behavioral improvement on the WM task in the AV condition compared to the A-only condition. Firstly, we examined whether there is a positive relationship between the facilitation of the auditory P1-N1 amplitude (A-only–AV) and higher accuracy (AV–A-only). Secondly, we examined whether there is a positive relationship between facilitation of the auditory N1 latency (A-only–AV) and faster RT (A-only–AV). We reasoned that participants with greater perceptual facilitation should have greater behavioral improvement. The results are presented in **Table 8** (note that positive correlations always reflect a relationship in the expected direction).

# DISCUSSION

This study examined the effect of AV speech on WM in older adults with hearing impairment compared to better-hearing older adults. The results showed that both groups were faster in the AV condition compared to the unisensory conditions even though the accuracy was comparable between the AV and A-only conditions. Participants with hearing impairment were slower compared to controls during the A-only condition but the two groups performed similarly in the AV and the V-only conditions. These results suggest that group differences in the A-only condition are due to more demanding perceptual processing for the PH group rather than differences in WM, and that visual speech cues can help to counteract this more demanding auditory processing.

The electrophysiological results revealed facilitation of perceptual processing in the PH group, indicated by smaller and faster perceptual ERP responses during the AV condition compared to the A-only condition. Furthermore, the ERP data showed facilitation of WM processing, indicated by earlier P3 components in both groups. For P3 amplitude, the PH group had smaller P3 amplitude than the BH group in the A-only condition but no group differences were observed in the AV condition, supporting the suggestion that visual speech cues can help to counteract the negative effect of more demanding perceptual processing on WM.

# Auditory-Visual Speech Interaction in Older Adults with Hearing Impairment

The results of the current study indicate that older adults with hearing impairment show a more robust multisensory interaction effect compared to older adults with age-normal hearing. More specifically, the amplitudes of the auditory P1 and the P1-N1 were significantly reduced in the AV condition compared to the A-only condition and the A+V measure for participants with hearing impairment but these effects did not reach statistical significance in participants with age-normal hearing. Similarly, there was a reduction in the auditory P1 latency during the AV condition, compared to the A-only condition and the A+V measure, evident in hearing impaired participants while in those with age-normal hearing these differences were not statistically significant. Lastly, for the auditory N1 latency, a reduction in the AV condition compared to the A-only condition and the A+V measure, was observed in the 1-back load for the hearing impaired group while no significant differences were seen in controls. Overall, our results suggest intact AV multisensory interaction in older adults with hearing impairment. These effects were observed early in the processing stream (i.e., the level P1 component), suggesting that the multisensory interaction is occurring as early as at the level of the primary auditory cortex (Liegeois-Chauvel et al., 1994).

These results stand in contrast to those by Musacchia et al. (2009) who found that older adults with hearing impairment may not be able to integrate auditory and visual speech information to the same extent as older adults with age-normal hearing. There are several methodological differences between the current study and that conducted by Musacchia et al. (2009) that may have contributed to the differences in the results. For example, Musacchia et al. (2009) assessed speech perception by repetition of the same syllable, participants were not actively involved in the task, which may have affected their attention to the stimuli, and lastly, they equalized the auditory input across the groups by adjusting the intensity level of the stimuli. Our results confirmed the observation of improved perceptual functioning during AV speech reported by behavioral studies examining speech recognition in older adults with hearing impairment (e.g., Grant et al., 1998; Tye-Murray et al., 2007; Bernstein and Grant, 2009).

# The Effect of Auditory-Visual Speech on Working Memory

The behavioral results showed faster RT during the AV condition compared to the unisensory conditions in both groups, suggesting facilitation of WM processing. Furthermore, while the WM performance of individuals with hearing impairment was slower in comparison to better-hearing individuals during the Aonly condition, no group differences were observed during the AV condition. Thus, it appears that visual speech cues may help to counteract the slowing of information processing caused by hearing impairment.

Surprisingly, no difference between the AV and the A-only condition was evident in the accuracy data suggesting that despite the facilitation of processing speed, the AV speech did not seem to influence overall WM capacity. There was also no effect of group on accuracy. Overall, these results indicate that both older adults with hearing impairment and those with age-normal hearing are able to achieve similar levels of accuracy during A-only and AV speech, however they are able to achieve these levels of accuracy at faster RTs when visual speech cues are available.

On electrophysiological correlates of WM, facilitation of processing speed (indicated by P3 latency) was observed in both groups and facilitation of WM resources (indicated by P3 amplitude) was observed in the individuals with hearing impairment. More specifically, both older adults with hearing impairment and those with age-normal hearing showed earlier P3 latency in the AV condition compared to the A-only condition, further validating the finding of improved processing speed during AV speech observed in the behavioral RT data. Overall, there seems to be a disproportionate gain on WM processing speed when perceptual processing speed is facilitated. That is, the average facilitation of P1 latency was 5.0 ms (SD = 11.33) and N1 latency was 3.8 ms (SD = 18.28) whereas the average facilitation was 35.5 ms (SD = 53.40) for P3 latency and 82.6 ms (SD = 53.35) for RT. In addition, we observed that P3 amplitude was smaller during the A-only condition in hearing-impaired participants compared to controls but no group differences were evident in the AV condition. Thus, similar to the RT data, it appears that visual speech cues may help to counteract the negative effect of more demanding perceptual processing caused by hearing impairment.

# Do Older Adults with Hearing Impairment Show a Greater Auditory-Visual Speech Benefit?

The results of this study have confirmed that perceptual processing was more demanding for older adults with hearing impairment. This was suggested by a significantly smaller amplitude of the auditory P1-N1 component in older adults with hearing impairment compared to better-hearing controls. N1 amplitude is known to be affected by stimuli characteristics, such as intensity and tonal frequency (Näätänen and Picton, 1987). Thus, it appears that physically similar stimuli become "tuned down" and less perceptible in the context of hearing impairment. Furthermore, there was a statistical trend for a delayed auditory N1 latency in the older adults with hearing impairment in comparison to the better-hearing controls, suggesting prolonged perceptual processing time. These results agree with the finding of Oates et al. (2002) who found an increased latency and a decreased N1 amplitude with increasing hearing loss during a syllable discrimination task. In contrast, studies using more ambiguous stimuli during speech discrimination tasks, found increased N1 amplitudes in individuals with hearing impairment (Tremblay et al., 2003; Harkrider et al., 2006). In the current study, the effects of hearing impairment were also evident on WM measures. Older adults with hearing impairment had smaller P3 amplitude and longer RT during the A-only condition compared to the control group. In addition, the group with hearing impairment had generally greater P3 latency, regardless of modality.

When comparing the overall results between better-hearing older adults and those with hearing impairment, the pattern suggests that older adults with hearing impairment are deriving a greater AV speech benefit than better-hearing older adults. Firstly, older adults with hearing impairment showed greater facilitation of perceptual processing, as evidenced by the greater reduction in P1 and N1 latency and P1 and P1-N1 amplitudes in the AV condition compared to the A-only condition. Furthermore, both behavioral RT data and electrophysiological P3 amplitude data suggest greater facilitation of WM processing in older adults with hearing impairment. More specifically, the group differences were observed in the baseline (i.e., A-only) condition but not during the AV condition, indicating that visual speech cues helped older adults with hearing impairment to compensate for the difficulty that they experienced during the more demanding A-only condition. The observed findings are in agreement with previous behavioral research reporting improved speech recognition under AV conditions in individuals with hearing impairment (Grant et al., 1998; Tye-Murray et al., 2007; Bernstein and Grant, 2009). Furthermore, these results support the indication of greater AV benefit in older adults with hearing impairment compared to those with better hearing observed in a syllable recognition paradigm by Grant et al. (2007) as well as in a behavioral WM paradigm by Brault et al. (2010). Overall, the greater AV benefit in older adults with hearing impairment supports the inverse-effectiveness hypothesis, which proposes that the benefit from multisensory interaction increases as the functioning of unisensory channels decreases (Stein and Meredith, 1993).

When examining the direct relationship with correlation analyses between perceptual facilitation (i.e., facilitation of P1- N1 amplitude and N1 latency during the AV condition in comparison to A-only condition) and behavioral improvement (i.e., higher accuracy and faster RT in the AV in comparison to A-only condition), we found that better-hearing older adults showed a reliable dose-response relationship between these variables, especially for facilitation of processing speed. A reliable relationship was found between greater facilitation of N1 latency in the AV condition compared to the A-only condition and greater improvement in RT during the 2-back condition. Similar trends were observed across other conditions. Interestingly, the BH group did not show a reliable AV benefit for neither N1 latency nor P1-N1 amplitude in the group ANOVA analyses. Taken together these results suggest that even though older adults with better hearing may have shown more inconsistent perceptual facilitation as a group (i.e., in the ANOVAs), those who derived a perceptual benefit from the AV speech were also able to benefit at the WM level, especially in terms of facilitation of processing speed (as demonstrated by the correlation analyses). One might question why a reliable relationship was only demonstrated between the N1 latency and 2-back RT performance. We would argue that this finding shows a relationship between two logically similar measures of processing speed in the experimental condition that was most demanding of WM resources. One might not expect reliable relationships between more conceptually dissimilar measures (e.g., ERP amplitude vs. RT; or at levels of non-demanding working memory load). Moreover, the behavioral measures represent the output of a number of preceding processes, including sensory and perceptual processing, working memory operations, response biases, and decision making thresholds, while the ERPs can be taken to be more discrete and temporally specific. Nevertheless, we should be cautious in our interpretation of these correlational findings and we encourage replication with a larger independent sample of participants.

On the other hand, participants with hearing impairment showed a more robust perceptual AV benefit, as indicated by facilitation of N1 latency in the 1-back condition and overall facilitation of P1-N1 amplitude evident in the group ANOVAs, but were found to have a less clear dose-response relationship between perceptual facilitation and WM performance (i.e., the correlation analyses). This may be related to the fact that perceptual facilitation helps individuals to reach their WM capacity but not necessarily to expand its limits. Thus, participants may gain a variable level of perceptual facilitation but, regardless of this variability, may achieve similar improvement on behavioral measures. This hypothesis is supported by the observation that no behavioral differences were observed between the groups in the AV condition. For reaction time specifically, individuals with hearing impairment were slower in comparison to controls in the A-only condition but not in the AV condition. Thus, visual speech cues appeared to improve their WM capacity to the point that their performance no longer differed from those with better hearing.

### Practical Implications

The statistics clearly highlight the high prevalence of social and psychological difficulties in the hearing impaired population (e.g., Strawbridge et al., 2000; Dalton et al., 2003). AV speech represents one possibility for facilitation of information processing and thus improved communication abilities for older adults with hearing impairment. Furthermore, numerous speech comprehension training programs have been developed over the years (see Pichora-Fuller and Levitt, 2012) and previous research has found that speech-reading training can improve speech perception of individuals with hearing impairment (e.g., Walden et al., 1981; Richie and Kewley-Port, 2008). The results of the current study indicate that such training may be beneficial not only for enhancement of perceptual but also for higher-order functioning.

In addition to speech comprehension training, the current results have implications for technology adaptation and future development. For example, despite their increased popularity in commercial companies and government institutions, research has shown that older adults find it very challenging to use interactive voice response (IVR) services (Miller et al., 2013). Capitalization on AV speech may provide one method for making future technology user-friendlier for older adults, especially those with hearing impairment.

# Limitations

Several methodological and statistical limitations of the current study need to be acknowledged. Firstly, a larger sample size would decrease error variance and provide greater statistical power. In a previous study with a similar design (Frtusova et al., 2013) but a greater sample size, we found AV facilitation of both N1 latency and P1-N1 amplitude in older adults with agenormal hearing. In the current study, the modality effect on these perceptual measures did not reach statistical significance for this group even though the means pointed in the right direction (see **Tables 2, 5**). Secondly, a consideration needs to be given to our sample of older adults with hearing impairment. Individuals in the hearing impaired group were quite heterogeneous in terms of their level of hearing impairment (average PTA ranging from 31.67 to 73.33 dB), and their general cognitive ability as estimated by the MoCA (overall score ranging from 21 to 30 points). However, exploratory analyses showed that these factors were not systematically associated with the level of AV benefit. On the other hand, a significant correlation between higher contrast sensitivity and a lower AV benefit on P1-N1 amplitude was observed for the 1-back (r = −0.49) and the 2-back (r = −0.48) conditions. Thus, those older adults with hearing impairment who also have poorer visual ability seem to derive the largest AV benefit. Another consideration is that we were unable to confirm for all the participants the exact nature of their hearing impairment; some participants were unsure of the cause and did not have an audiology report available. Nevertheless, all participants reported wearing or being eligible to wear hearing aids, which is most commonly prescribed for older adults with sensorineural hearing loss. Lastly, information about the exact length of hearing aid use was not available for all participants, which may obscure heterogeneity in this group in regard to any potential disadvantage when being tested without a hearing aid.

# Conclusions

This study provides evidence that older adults derive WM benefit from AV speech. Importantly, these effects were found to be even more robust in older adults with hearing impairment compared to those with better hearing. In the context of an integrated perceptual-cognitive system, these results indicate that AV speech facilitates perceptual processing that is otherwise very demanding for older adults with hearing impairment. The perceptual facilitation results in more resources available for subsequent WM processing. The evidence of processing facilitation afforded by AV speech has important practical implications for helping to improve the quality of life for older adults with hearing impairment.

# AUTHOR CONTRIBUTIONS

JF contributed to the conceptualization of the study and study design as well as designed the stimuli and experimental paradigm. She also contributed to the acquisition, analysis and interpretation of the data and wrote the draft of the study. She approved the final version and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. She completed the study as a part of her Doctoral Dissertation. NP contributed to the conceptualization of the study and study design as well as analysis and interpretation of the data. She also revised the draft of the manuscript and critically evaluated it for important intellectual content. She approved the final version and agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

## FUNDING

This research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The study was supported by grants awarded to NP from the Canadian Institutes of Health Research (Grant MOP-97808) and the Concordia University VPRGS

### REFERENCES


Seed/Accelerator Funding Program (2009). We would like to thank to the Alzheimer Society of Canada for Alzheimer Society Research Program Doctoral Award awarded to JF. We would also like to express gratitude to the Centre for Research in Human Development for the financial and technical support to the Cognition, Aging, and Psychophysiology Lab where the research was conducted. Finally, we would like to thank the members of the Cognition, Aging, and Psychophysiology Lab for their contribution to the project. Special thanks to Ms. Lianne Trigiani for her help with data processing and to Ms. Mariya Budanova for her help with testing and data processing.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Frtusova and Phillips. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How Age and Linguistic Competence Affect Memory for Heard Information

Bruce A. Schneider <sup>1</sup> \*, Meital Avivi-Reich<sup>2</sup> , Caterina Leung<sup>1</sup> and Antje Heinrich<sup>3</sup>

<sup>1</sup> Human Communication Laboratory, Psychology, University of Toronto Mississauga, Mississauga, ON, Canada, 2 Interdisciplinary Center Herzliya Israel, Psychology, Herzliya, Israel, <sup>3</sup> Medical Research Council Institute of Hearing Research, Nottingham, UK

The short-term memory performance of a group of younger adults, for whom English was a second language (young EL2 listeners), was compared to that of younger and older adults for whom English was their first language (EL1 listeners). To-be-remembered words were presented in noise and in quiet. When presented in noise, the listening situation was adjusted to ensure that the likelihood of recognizing the individual words was comparable for all groups. Previous studies which used the same paradigm found memory performance of older EL1 adults on this paired-associate task to be poorer than that of their younger EL1 counterparts both in quiet and in a background of babble. The purpose of the present study was to investigate whether the less well-established semantic and linguistic skills of EL2 listeners would also lead to memory deficits even after equating for word recognition as was done for the younger and older EL1 listeners. No significant differences in memory performance were found between young EL1 and EL2 listeners after equating for word recognition, indicating that the EL2 listeners' poorer semantic and linguistic skills had little effect on their ability to memorize and recall paired associates. This result is consistent with the hypothesis that age-related declines in memory are primarily due to age-related declines in higher-order processes supporting stream segregation and episodic memory. Such declines are likely to increase the load on higher-order (possibly limited) cognitive processes supporting memory. The problems that these results pose for the comprehension of spoken language in these three groups are discussed.

Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Laurianne Cabrera, University College London, UK Stuart Rosen, University College London, UK

\*Correspondence: Bruce A. Schneider bruce.schneider@utoronto.ca

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 23 December 2015 Accepted: 13 April 2016 Published: 09 May 2016

### Citation:

Schneider BA, Avivi-Reich M, Leung C and Heinrich A (2016) How Age and Linguistic Competence Affect Memory for Heard Information. Front. Psychol. 7:618. doi: 10.3389/fpsyg.2016.00618 Keywords: second language speakers, auditory memory, context, age, spoken word recognition, spoken language comprehension

# INTRODUCTION

A listener's ability to comprehend a lecture, or a multi-talker conversation, is usually measured by having the listener answer questions about the discourse they heard. Clearly, the ability to store the information contained in the lecture or conversation for later recall is one of many abilities that are required in order to perform well on this test of speech comprehension. Consequently, we would expect speech comprehension in individuals who were less proficient than others in either storing or retrieving the heard information, to be poorer than in those individuals whose memory is unimpaired. All other things being equal, those who have good memory are likely to outperform those with poorer memory. Older adults are one group that suffer from declines in memory processes (Ohta and Naveh-Benjamin, 2012; Morris and Logie, 2015). Second language listeners may be another (Olsthoorn et al., 2014), although the evidence here is less clear and may depend on the particular memory task (Schroeder and Marian, 2014).

However, memory is not the only determinant of performance in a conversational situation, especially when there are competing sound sources. Speech recognition is also vital. As a consequence of poorer perceptual skills, people who find it difficult to hear the individual words in connected discourse will most likely find it difficult to extract the information in an utterance, integrate the extracted information with past knowledge, and store it in memory for later recall. This will result in less efficient recall of what they have heard compared to those who were experiencing fewer difficulties with respect to word recognition. Hence, difficulties in remembering heard information could result from compromised speech perception, reduced memory ability, or both. One way to differentiate between these alternatives is to equate groups of listeners with respect to word recognition accuracy. We know that under identical listening conditions, young native English listeners have better word recognition than either: (a) older native English listeners, or (b) young adults for whom English is a second language. If recall differences among these groups primarily reflect group differences in word recognition, equating these groups for word recognition should substantially reduce group differences in recall. However, if older adults, and possibly younger adults listening in their second language, also experience genuine memory difficulties in noisy situations, group difference in recall should remain after equating all individuals for their ability to recognize individual words.

In the present study we compare the ability of three groups of listeners to remember heard material after equating for differences in word recognition: young adults listening to English words in their first or native language (young EL1 listeners), older adults listening to English words in their native language (older EL1 listeners), and young adults listening to words in their second (non-native) language (young EL2 listeners). We had the following predictions. If linguistic competence affects memory, we would expect poorer performance in the young EL2 listeners than in the young EL1 listeners even after equating for word recognition. Alternatively, if linguistic competence does not affect memory but age does, we would predict, after equating all individuals with respect to word recognition, that memory for heard words should be equivalent in the two younger groups (young EL1 and young EL2 listeners), and poorer in older EL1 listeners.

### Controlling for Word Recognition

One can control for individual differences in the ability to recognize individual words masked by a competing sound (such as a babble of voices) by adjusting the listening situation. This can be done in listening situations that offer little, if any, contextual support for word recognition of masked words using a two-stage process. First, determine the threshold for detecting the presence of the masker (in these experiments, a babble of voices). Then present the target voice at a fixed level above each individual's threshold for detecting babble. Second, find the Signal-to-Noise Ratio (SNR) at which an individual is able, 50% of the time, to repeat accurately the last word in low-context sentences such as "Jane was thinking about coffee," when such sentences are masked by noise (in this case, a babble of voices). The low-predictability sentences of the Revised Speech Perception in Noise Test (R-SPIN, Bilger et al., 1984) can be used to determine this SNR, because the context preceding the last word of the sentence provides only minimal clues as to the identity of the last word. Knowledge of each individual's babble threshold, and his or her 50% threshold for sentence final word recognition, can then be used to individually adjust the listening situation so that word recognition in the absence of contextual support (the probability of correctly identifying the word being spoken) is comparable for all individuals regardless of their hearing status, or of their age.

# Memory for Words Presented in Background Noise When Listening in One's Second Language

Some studies have shown episodic memory in a word recall task to be poorer in young EL2 than in young EL1 listeners after listening to a series of words (e.g., Fernandes et al., 2007). This could be due to a number of factors related to their linguistic ability. For instance, we might expect the lexicon to be less fully developed in one's second language than in one's first language (Bialystok et al., 2010; Bialystok and Luk, 2012). Second, the target speech stream might be expected to initiate activity in the individual's first language lexicon as well as in the second-language lexicon (Schroeder and Marian, 2014). Deficiencies in one's second language lexicon, coupled with dual activation of both the first- and secondlanguage lexicons could make it more difficult to encode the heard words into long-term memory. A third reason is poorer discriminability for certain phonemic contrasts, especially when noise is present (e.g., Garcia Lecumberri and Cooke, 2006). Because Fernandes et al. (2007) took no steps to equate individuals with respect to word recognition other than presenting the words in quiet, we do not know whether memory in young EL2 listeners would continue to be poorer than in young EL1 listeners once the listening situation is individually adjusted in all participants to achieve equivalent levels of word recognition for to-be-encoded words. Hence, if reduced perceptual accuracy and discriminability play a critical role for young EL2 listeners' speech perception and memory abilities, equating their perceptual accuracy to that of their first language counterparts (young EL1) should minimize memory differences between the two groups. If equating for perceptual differences does not equate for memory differences, then differences in other linguistic abilities such as size and activation of the mental lexicon must play an important role. If this were to be the case, young EL2 listeners' memory performance could resemble that of older EL1 listeners even after equating for word recognition (Murphy et al., 2000; Heinrich and Schneider, 2011).

If a poorer memory for EL2 listeners is found even after adjusting for word recognition, an exploration of how memory is affected by the parameters of the competing noise could help us to identify the reasons for poorer memory in one's second language than in one's first language. An examination of the similarities and differences in the patterns of memory deficits in young EL2 and older EL1 listeners could potentially help us identify the nature and comparability of the memory deficits in both groups. For these reasons, in this study we compared memory performance in a paired-associate memory paradigm for heard words identical to that previously used to obtain data from young and old EL1 listeners.

Presenting all stimuli in a background noise makes it easier to equate for perceptual differences between individuals. However, it is not only the presence or absence of noise that has previously been found to affect memory but also the temporal relationship of the background noise to the word presentation (Heinrich and Schneider, 2011). Therefore, we investigated recall in two different background noise conditions: continuous noise, and noise only present during word presentation. In addition we also collected data for a quiet baseline condition. In a previous study we found that for young EL1 listeners, only the presence of continuous noise led to a reduction in memory compared to the quiet condition whereas both kinds of noise led to memory deficits for older EL1 listeners (Heinrich and Schneider, 2011). If the same pattern as the one found in older EL1 listeners occurred for young EL2 listeners, we might conclude that the underlying processes governing recall in noise reflect similar deficiencies in the processes supporting paired-associate memory. Conversely, if after equating for word recognition, paired-associate memory for heard words was found to be equivalent in young EL1 and EL2 listeners, we could conclude that the language proficiency of listeners did not affect their memory for heard words. Such a finding would be consistent with the hypothesis that the reasons why older EL1 listeners perform poorer than young EL1 listeners even after equating for word recognition, is due primarily to agerelated changes in higher-order cognitive processes supporting episodic memory.

Hence, in the present experiment we compared memory for heard paired associates obtained in previous experiments for younger and older adults listening in their first language (young EL1 listeners, older EL1 listeners, Murphy et al., 2000; Heinrich and Schneider, 2011) to data collected here on young EL2 listeners. In all three groups, the average sound pressure at which the word pairs were presented and the SNR at which they were presented in the background babble were adjusted to produce equivalent levels of word recognition in the absence of contextual support in all three groups. The paired associates were presented under three different masking conditions: (1) no masking (Quiet); (2) Continuous masking by a 12-talker babble of voices; and (3) Word-Only masking where the onset and offset of the masker was contemporaneous with the onset and offset of the word pair (see **Figure 1**). Four seconds after the last paired associate was presented, a warning tone was sounded. Four seconds later, the first word of one of the paired associates was presented in quiet. These three masking conditions were chosen because the pattern of results for young EL1 listeners for these three maskers differed substantially from the pattern of results on the same three maskers in older EL1 listeners. Hence we felt that an exploration of how young EL2 listeners might perform under these three masking conditions would allow us to identify (1) the ways in which memory might differ in young EL1 and young EL2 listeners; and (2) shed some light on the nature of the perceptual and/or cognitive factors that might be responsible for the memory deficits in older adults.

# METHODS

### Materials and Methods Young EL2 Participants

A total of 90 EL2 undergraduate students at the University of Toronto (30 students in each of the three conditions) were paid \$10 per hour for their participation. All participants first

by a warning tone, followed 4 s later by the first word pair. Subsequent word pairs were spaced 4 s apart with 100 ms separating the two words in a pair. A warning tone followed 4 s after word pair five. The first word of one of the word pairs was presented in quiet 4 s later. In the Continuous Babble condition, the babble was played continuously between the warning tones. For Word-Only Babble, the babble began and ended with the word pair.

became immersed in an English speaking environment after the age of 7 years, and were not extensively exposed to English prior to that. Details concerning their age, gender, age of arrival in an English-speaking country, years of education, Mill Hill vocabulary scores, and Nelson-Danny reading comprehension scores are presented in **Table 1** separately for each of the three testing conditions. All participants were required to have clinically normal hearing. Pure-tone air-conduction thresholds were measured at nine frequencies (0.25–8 kHz) for both ears using an Interacoustics Model AC5 audiometer (Interacoustic, Assens, Denmark). All participants were required to have pure tone air-conduction thresholds of 15 dB HL or lower, between 0.25 and 8 kHz in both ears. Participants with a threshold of 20 dB HL at a single frequency were not excluded from the study. Participants who demonstrated unbalanced hearing (more than a 15 dB difference between ears at one or more frequencies) were excluded from participation. The average audiograms for the 90 EL2 participants are shown for the left ear only in **Figure 2** (circles). During each participant's first experimental session we administrated audiometric thresholds, the Nelson-Denny reading comprehension test (Brown et al., 1981) and the Mill Hill test of vocabulary knowledge (Raven, 1965). The memory task, along with the babble detection thresholds and the low-context R-SPIN thresholds were administered over the next experimental session. All experimental procedures were approved by the Research Ethics Board of the University of Toronto.

### Younger and Older EL1 Participants

The data for the younger and older EL1 listeners in the Quiet condition were taken from Experiment 2 of Murphy et al. (2000). The data for the younger and older EL1 listeners in the Continuous Babble condition were taken from Experiment 3 of Murphy et al. Finally, the data for the younger and older EL1 listeners in the Word-Only Babble condition were taken from Heinrich and Schneider (2011). The younger adults were also University of Toronto undergraduates, and were tested under the same conditions as the young EL2 listeners in the present experiment. The older EL1 listeners were volunteers from the Mississauga community, and tested under the same conditions as the young EL2 listeners in the present experiment. Their numbers, ages, years of education, and vocabulary scores are reproduced in **Table 1**. All EL1 listeners were immersed in an English-speaking environment before the age of 5. Reading comprehension scores were not available for these participants. The left-ear Babble and R-SPIN thresholds appear in **Table 2**.

### General Methods

The stimuli, apparatus, and testing protocols were taken from Murphy et al. (2000), Heinrich et al. (2008) and Heinrich and Schneider (2011). Hence any differences between the present results and those previously found in these studies cannot be attributed to differences in any of these factors.

### **Apparatus and stimuli**

The word pairs, which were the same as those in Murphy et al. (2000), consisted of 400 two-syllable common nouns with a frequency of more than 1 per million (Kucera and Francis, 1967). The individual words, spoken by a female speaker, were digitally recorded at a sampling rate of 20 kHz and had similar root-meansquare (RMS) values. The word pairs were delivered through a 16-bit digital-to-analog converter (TDT DD2) followed by a 10 kHz low-pass filter to the left ear only. All testing took place in a double-walled sound-attenuating chamber using headphones.

### **Procedure babble threshold**

The words were presented at a level that was individually set to 50 dB above the listener's babble threshold. Adjusting presentation level individually was important because older adults' babble thresholds are considerably higher than those of

TABLE 1 | Participant parameters for the young EL2 and EL1 listeners under the three conditions tested along with those of the older EL1 listeners.


Reading comprehension scores were not available for the participants in Murphy et al. (2000) as well as in Heinrich and Schneider (2011). Age of immersion is not relevant for the younger and older EL1 listeners.

\*Data was taken from Murphy et al. (2000), Experiment 2.

\*\*Data was taken from Murphy et al. (2000), Experiment 3.

\*\*\*Data was taken from Heinrich and Schneider (2011), Experiment 2.

younger adults. If an identical presentation level for both age groups had been used, the stimuli would have been presented too close to the old listeners' threshold, which could have an adverse effect of word recognition. To individually adjust presentation level, the 12-talker babble materials used in these experiments were taken from the Revised Speech Perception in Noise (R-SPIN) test (Bilger et al., 1984). Thresholds for the detection of babble when presented to the left ear only were determined for each individual allocated to one of the two noise conditions (Word-Only and Continuous Babble). We used a two-interval, two-alternative forced-choice paradigm with an adaptive three down one up procedure (Levitt, 1971) to determine the babble threshold corresponding to the 79% point on the psychometric function. In this procedure, a 1.5 s babble segment was randomly presented in one of two intervals which were separated by a 1.5-s silent period. Two lights on the button box indicated the occurrence of each interval, and the listener's task was to identify the interval containing the babble segment by pressing the corresponding button. Immediate feedback was provided after each press by flashing the LED corresponding to the interval in which the babble segment occurred (for more details see Heinrich et al., 2008; Heinrich and Schneider, 2011). The starting intensity was 50 dB SPL. The intensity of the babble was reduced after three correct responses in a row and increased after a single incorrect response. The session was terminated after 12 reversals. The babble threshold was defined as the average SPL on the last eight reversals. Babble thresholds for the left ear (all stimuli were presented to the left ear only) are shown in **Table 2.**

### **Individually adjusting the signal-to-noise ratio**

Following Murphy et al. (2000), and Heinrich and Schneider (2011), the low-context sentences from the R-SPIN test (Bilger et al., 1984) were used to determine the SNR for each individual that resulted in 50% correct identification of the last word in these sentences. Participants were asked to immediately repeat the last word of individual sentences presented to them in a multi-talker babble background. Each participant listened to at least two R-SPIN lists played to his or her left ear, at SNRs that

### TABLE 2 | Babble and SPIN thresholds in the left ear for each of the conditions and groups.


\*Data was taken from Murphy et al. (2000), Experiment 2.

\*\*Data was taken from Murphy et al. (2000), Experiment 3.

\*\*\*Data was taken from Heinrich and Schneider (2011), Experiment 2.

were chosen to bracket the 50% final words' intelligibility point in low-context sentences (e.g., "Jane was thinking about coffee"). The SNR corresponding to the 50% point was then estimated by linear interpolation and is shown in **Table 2** for all groups. The SNR used in the memory task, was set at the individual SNR value corresponding to 50% correct identification minus 7 dB, which was shown by Murphy et al. (2000) to result in approximately 91% correct word identification when the words used in the memory experiments were presented in babble. Consider the following example in which the listening situation is individually adjusted for two individuals, one younger, and the other older, to produce equal word recognition in the absence of contextual support. Suppose the thresholds for detecting a babble of voices are 10 and 18 dB SPL for the younger and older adult, respectively. To equate for individual differences in babble threshold, the target sentence is presented at, say, 50 dB above each individual's babble threshold (at 60 and 68 dB SPL, for the younger and older individuals, respectively). Now suppose we want to set the nominal SNR to −7 dB. Suppose the low-predictability R-SPIN threshold for the younger individual is −1 dB, whereas it is 4 dB for the older individual, a 5 dB difference. The babble level for the younger listener would be set to 60 + 7 dB + 1 = 68 dB SPL, for an SNR of −8 dB. The babble level for the older individual would be set to 68 + 7 − 4 = 71 dB SPL producing an SNR of −3 dB. Note that the SNR for the older individual is 5 dB higher than that of the younger individual, which is equal to the difference in the R-SPIN thresholds for low-predictability sentences.

Previous studies have shown that the psychometric functions relating percent correct word recognition to SNR have equivalent slopes for younger and older adults (Ben-David et al., 2012), and that the slopes for younger adults do not differ substantially for EL1 and EL2s (Zhang et al., 2014). Hence, once the SNRs are individually adjusted for 50% word recognition, changes away from the adjusted value should produce equivalent performance across age and language experience in the absence of contextual support. Thus, by individually adjusting the SNRs we equated for individual differences in word recognition in noise when there is no assisting context. **Table 2** presents the average SNR for 50% intelligibility of a low-context sentence under each of the two babble conditions.

### **Word recall**

As in the previous studies of this series, participants listened to words that were randomly arranged in 40 lists containing five word pairs following the paradigm by Madigan and McCabe (1971). Four seconds after a short warning tone, the first word pair was presented with a silent period of 100 ms between the words. The inter-stimulus-interval between successive word pairs was also 4 s. Another 4 s after the presentation of the last word pair of the list, another short warning indicated the beginning of the recall phase (for more details see Heinrich and Schneider, 2011). Participants were cued with the first word from one of the five previously presented word pairs and were asked to recall the second word which was presented as part of the same pair. Only one pair from each list was cued; no time limit was placed on recall, and participants were encouraged to guess. The serial positions refer to the order in which the word pairs were presented in each trial; the first serial position refers to the first word pair. The serial position of each word pair within the fiveword-pair list was tested eight times within a session. The list order was identical for all participants, and the order in which the serial positions were tested was independently and randomly determined for each participant. No feedback was provided. Participants were instructed to take a break after the presentation of the first 20 lists.

The word pairs were presented under three different masking conditions: (1) no masking (Quiet); (2) Continuous masking by a 12-talker babble of voices; and (3) Word-Only masking where the onset and offset of the masker was contemporaneous with the onset and offset of each of the word pairs (see **Figure 1**). Three independent groups of participants were tested in each masker condition.

### RESULTS

# Babble Detection Thresholds, R-SPIN Word Recognition Thresholds, and Mill Hill Vocabulary

Babble thresholds, R-SPIN thresholds, and Mill Hill vocabulary scores were obtained for all of the participants in both Continuous Babble and Word-Only Babble in all three groups (see **Table 2**). For babble and R-SPIN thresholds, as well as for Mill Hill Vocabulary scores, we might expect to find differences among the three groups but not among masking conditions nor any interaction between masker type and group. A Between-Subjects ANOVA with three Groups (young EL1 listeners, young EL2 listeners, and old EL1 listeners) and two Masker Types (Continuous vs. Word-Only) indicated that babble thresholds differed across groups [F(2, 117) = 32.357, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.356] but not across Masker Type [F(1, 117) < 1]. In addition the Group × Masker Type interaction was not significant [F(2, 117) < 1]. Post-hoc LSD tests found that both younger groups had lower babble thresholds than the older group (p < 0.001 in both instances) with no difference between the younger EL1 and EL2 groups (p = 0.859). A Between-Subjects ANOVA on the R-SPIN thresholds found that R-SPIN thresholds differed across groups [F(2, <sup>117</sup>) = 33.688, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.365] with no significant difference between the two Masker Conditions [F(1, 117) = 2.127, p = 0.147] and no significant interaction between the two factors [F(2, 117) < 1]. Post-hoc LSD tests indicated that R-SPIN thresholds were order from lowest to highest as young EL1, old EL1, young EL2 with young EL1 listeners having significantly lower R-SPIN thresholds than the other two groups (p < 0.001 for both comparisons), and older EL1 listeners having significantly lower thresholds than young EL2 listeners (p = 0.009).

A Between-Subjects ANOVA with three Groups (young EL1 listeners, young EL2 listeners, and old EL1 listeners) and two Masker Types (Continuous vs. Word-Only) indicated that Mill Hill vocabulary scores differed across groups [F(2, 117) = 42.801, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.423] but not across Masker Type [F(1, 117) < 1]. In addition the Group × Masker Type interaction was not significant [F(2, 117) = 1.703, p = 0.187]. Post-hoc LSD tests found that Mill Hill vocabulary scores were ordered from lowest to highest as young EL2, young EL1, and old EL1, with young EL2 listeners having lower Mill Hill vocabulary scores than did both of the other two groups listeners (p < 0.001 for both comparisons), and young EL1 listeners having lower vocabulary scores than older EL2 listeners (p = 0.014). There were no main effects of Masker Type on babble thresholds, R-SPIN thresholds, and Mill Hill vocabulary scores, nor any evidence of interaction between Masker Type and Groups for these three variables. This indicates that any effects of Masker Type on memory performance cannot be attributed to the use of different participants for the three different types of maskers.

## Paired-Associate Memory: Young EL2 vs. Young EL1 Participants

In these experiments, the levels at which the words were presented, and the SNR at which they were presented, were adjusted to achieve the same level of word recognition in all participants. To determine whether the linguistic status of the listener affected their ability to recall the second word in a pair when prompted with the first word of a pair, we conducted a three factor ANOVA on the percentage of words correctly recalled in each serial position for young EL1 listeners (taken from Murphy et al., 2000) and the EL2 listeners. In this analysis the serial position of the word was a within-subject factor. The two other factors, Type of Masker (none, Continuous Babble, Word-Only Babble), and language status were between subjects factors. This analysis revealed a main effect of Serial Position [F(4, 520) = 137.543, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.514], a main effect of Masker Type [F(2, 130) = 6.419, p = 0.002, η<sup>p</sup> <sup>2</sup> = 0.090], and an interaction between Serial Position and Masker Type [F(8, 520) = 3.617, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.053]. Post-hoc LSD tests indicated that performance in quiet was better than in Continuous Babble (p = Schneider et al. Effects of Age and Linguistic Competence on Memory

0.001), and better than in Word-Only Babble (p = 0.042), but that there was no overall difference between Continuous Babble and Word-Only Babble (p = 0.176). Neither the main effect of Language Status, nor any of the interactions involving Language Status reached statistical significance: Language Status [F(1, 130) < 1]; Serial Position × Language Status [F(8, 520) = 1.318, p = 0.262]; Serial Position × Language Status × Masker Type [F(8, 520) < 1]. Hence, there is no evidence that recall is affected by language status as long as the listening situation is adjusted to achieve equal levels of word recognition in both young EL1 and young EL2 listeners. Because we did not find any effect of language status between young EL1 and EL2 participants, we aggregated the data from both groups in subsequent analyses.

**Figure 3** plots the percentage of words correctly recalled as a function of serial position for each of the masking conditions. **Figure 3** suggests that the performance of young adults was roughly identical under all masker conditions for serial positions 4 and 5. When the word pairs were presented in Quiet or in Word-Only Babble, performance appears to be equivalent for serial positions 1, 2, and 3. However, when a Continuous Babble was used as a masker, performance appeared to be significantly lower in positions 1 and 2 than in the other two masker conditions. To confirm that the difference in performance among the three maskers in the early serial positions was responsible for the Serial Position × Masker Type interaction, we conducted three additional ANOVAs. In each of these ANOVAs, Serial Position was a within-subject factor. In the first ANOVA the second factor (Masker Type) contained only two levels (Quiet and Continuous Babble). In the second ANOVA, the two levels of Masker Type were Quiet and Word-Only Babble. In the third ANOVA, the two levels of Masker Type were Continuous Babble and Word-Only Babble. Significance in these three ANOVAs were Bonferroni corrected. Not surprisingly, the main effect of serial position was highly significant in all three ANOVAs (p < 0.001, η<sup>p</sup> <sup>2</sup> > 0.48). When the masker contrast was between Quiet and Word-Only Babble there was no significant main effect due to Masker Type [F(1, 89) = 3.681, p > 0.2], nor was there a significant Serial Position × Masker Type interaction [F(4, 356) < 1]. However, when the masker contrast was between Quiet and Continuous Babble, there was significant main effect of Masker Type [F(1, 88) = 13.308, p < 0.01, η<sup>p</sup> <sup>2</sup> = 0.131] and a significant Serial Position × Masker Type interaction [F(4, 352) = 3.812, p < 0.01, η<sup>p</sup> <sup>2</sup> = 0.042]. Finally, when the masker contrast was between Continuous Babble and Word-Only Babble the main effect due to Masker Type was not significant [F(1, 89) = 1.788, p > 0.5] whereas the interaction between Serial Position and Masker Type was [F(4, 356) = 5.165, p < 0.01, η<sup>p</sup> <sup>2</sup> = 0.055]. Hence, for young adults, performance in the Word-Only Babble appears to be equivalent to performance in Quiet, with performance in Continuous Babble being worse than in the other two masking conditions for serial positions 1 and 2.

# Paired-Associate Memory: Younger Adults vs. Older Adults

Paired-associate memory in younger adults was compared to that of older adults in a 2-Age × 3 Masker Types × 5 Serial Positions

ANOVA with Serial Position as a within-subject factor and Age and Masker Type as between-subjects factors. There were significant main effects of Serial Position [F(4, 708) = 154.126, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.465], Age [F(1, 177) = 45.166, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.203], and Masker Type [F(2, 177) = 12.181, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.121]. In addition there was a significant two-way interaction between Serial Position and Masker Type [F(8, 708) = 4.899, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.052], and a significant three-way interaction between Age, Serial Position, and Masker Type [F(8, 708) = 2.294, p = 0.020, η<sup>p</sup> <sup>2</sup> = 0.025]. These effects are readily visible in **Figure 4**.

**Figure 4** indicates that, on average, memory is poorer in all three conditions for the older adults. The top panel indicates the presence of a Serial Position × Age interaction when the word pairs are presented in Quiet. This interaction is absent when word pairs are presented in either Continuous or Word-Only Babble. To confirm this we conducted separate ANOVAs for each of the three panels to test for an Age × Serial Position interaction. Probabilities for these three separate ANOVAs were Bonferroni corrected. The Serial Position × Age interaction was statistically significant in the top panel [F(4, 232) = 4.456, p < 0.01, η<sup>p</sup> <sup>2</sup> = 0.071], but not in either the middle or bottom panels of **Figure 4** [F(4, 232) < 1, and F(4, 244) = 1.467, p > 0.2, respectively].

The above analysis failed to reveal any significant Age × Serial Position interaction when the paired associates were masked by either Continuous or Word-Only Babble. However, the effect of serial position appears to differ between the two types of maskers. **Figure 4** suggests that for a Continuous Babble masker, performance continues to decline from serial position 3 to serial position 1 whereas there is no apparent decline from position 3 to position 1 for a Word-Only Babble masker. To confirm that the effect of Serial Position differed between the two types of maskers, we conducted a 2 Masker Type × 5 Serial Position × 2 Age Group ANOVA with Serial Position as a within-subject factor, and Masker Type and Age Group as between-subjects factors. The ANOVA revealed significant main effects of Serial Position [F(4, 476) = 94.105, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.442], Age Group [F(1, 119) = 44.957, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.274], but not

of Masker Type [F(1, 119) < 1]. Age did not interact with any of the other factors [Serial Position × Age, F(4, 476) < 1; Masker Type × Age, F(1, 119) < 1; and Masker Type × Serial Position × Age, F(4, 476) < 1). However, the Serial Position × Masker Type interaction was significant [F(4, 476) = 8.116, p < 0.001, ηp <sup>2</sup> = 0.064]. To confirm that the Serial Position × Masker Type interaction was due to a decline from serial position 3 to position 1 when the masker was Continuous Babble, and an absence of decline when the masker was Word-Only Babble, we conducted separate ANOVAs for these two Masker Types on the first three serial positions only. When the masker was continuous, there was a significant main effect for the first three serial positions [F(2, 116) = 12.193, p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.174], with a significant linear trend [F(1, 58) = 20.856, p < 0.001, ηp <sup>2</sup> = 0.264] but not when the masker was Word Only Babble [F(2, 122) < 1]. The Age effect was significant for both Continuous (p < 0.001, η<sup>p</sup> <sup>2</sup> = 0.239) and Word-Only Babble (p < 0.001, ηp <sup>2</sup> = 0.196) but there was no evidence of any Age × Serial Position interaction for either masker (p > 0.5 for both types of maskers). Companion analyses on serial positions 4 and 5, failed to reveal any interactions between Masker, Serial Position, and Age Group. Hence the two-way interaction between Serial Position and Masker Type is due to a decline in performance from serial position 3 to serial position 1 when the babble is continuous, whereas there is no statistically significant decline over these three positions when the word pairs are presented in Word-Only Babble.

# Relationship of Vocabulary Knowledge and Reading Comprehension to Average Recall in EL2 Listeners

All of the EL2 listeners had their vocabulary knowledge, and reading comprehension assessed using the Mill Hill vocabulary test and the Nelson Denny Reading Comprehension test<sup>1</sup> . In addition, we also asked them to report on their number of years of schooling. These three measures were first centered (had the means removed) in each of three masking conditions to remove any mean differences among these conditions. We then regressed these three centered measures against the average centered percent correct score in each of these conditions according to the equation

$$\nu\_i = a\_1 \ast \text{ } years\_i + a\_2 \ast \text{ } MH\_i + a\_3 \ast ND\_i$$

where years<sup>i</sup> refers to the number of years of education for subject i, MH<sup>i</sup> is subject i's Mill Hill vocabulary score, and ND<sup>i</sup> is subject i's Nelson-Denny Reading Comprehension score. In this model we were unable to reject the hypothesis that a<sup>1</sup> = a<sup>3</sup> = 0 [F(2, 87) = 1.296, p = 0.279]. Hence this reduced the model to,

$$\gamma\_i = a\_2 \ast\_i MH\_i$$

The correlation coefficient between centered percent correct and the centered Mill Hill vocabulary score was 0.34 which was significantly different from zero (p = 0.001).

# Relationship of Vocabulary Knowledge and Reading Comprehension to Average Recall in Young EL1, Young EL2, and Old EL1 Listeners

Because we also had Mill Hill vocabulary scores for all but 11 of the younger and older EL1 listeners, we centered these scores, and combined them with the data from the EL2 listeners. Hence we could examine the relationship between the Mill Hill vocabulary scores (centered in each of the 3 Group × 3 Masker Condition) and the average percentage correct word recall (again centered in each of the nine groups). We first fit a model in which

<sup>1</sup>Nelson Denny reading comprehension measure was not collected in the previous studies on younger and older EL1 listeners.

separate slopes were fit to each of these nine sets of data. This nine parameter model accounted for 15.2% of the data. We then compared this to a model in which a single slope was fit to all the data. Reducing the number of fitted parameters from 9 to 1 did not significantly improve the fit [F(8, 162) < 1]. Hence, a single slope provides as good a fit to the data as a 9-slope model. **Figure 5** shows that the correlation coefficient between centered percent correct and the centered Mill Hill vocabulary scores provides a good fit to the data for all combinations of Age Group × Masker Type Conditions.

**Figure 5** indicates memory performance is positively related to vocabulary knowledge to the same extent in each of these three groups, and that this relationship is unaffected by the type of masker once word recognition ability has been equated in all three groups. Hence, those with greater vocabulary knowledge outperform those with lesser vocabulary knowledge, independent of their age or language status.

### DISCUSSION

### Perceptual and Cognitive Measures

Because the hearing levels of the young EL1 and EL2 listeners were equivalent (both groups had thresholds within the normal range see **Figure 2**), we would expect both groups to be equally adept at detecting the presence of a babble of voices. However, we would expect older adults to have higher detection thresholds for speech than younger adults because of age-related hearing losses which are especially prominent in the high-frequency range (Schneider and Pichora-Fuller, 2000 for review). Consistent with this expectation, babble detection thresholds did not differ between young EL1 and EL2 listeners, with babble thresholds in both younger groups being lower than in the older EL1 listeners.

Differences in word recognition thresholds between younger and older EL1 listeners are most likely related to age-related changes in hearing. A number of studies have shown that word recognition in noisy situations is poorer in older than in younger adults when individuals from both groups are listening

to speech in their native language (Dubno et al., 1984; Humes and Christopherson, 1991; Benichov et al., 2012). Age-related changes in peripheral hearing would lead to a reduction in the salience of the acoustic cues that would facilitate lexical access (Schneider et al., 2010; Rönnberg et al., 2013). Hence under equivalently noisy listening conditions we would expect older adults to recognize fewer words than younger adults.

Word recognition has been found to be poorer in young people when they are listening to speech in their second language under noisy conditions (e.g., Bradlow and Pisoni, 1999; Ezzatian et al., 2010). Here, the reasons for needing a more favorable SNR are unlikely to be due to an impoverished acoustic signal but rather to an inadequate command of the phonology, semantics, and syntax of their second language (Gollan and Kroll, 2001; Bialystok et al., 2009; Zhang et al., 2014). Moreover, when English is a person's second language, it is possible that an English word might elicit activity in both the L2 and L1 lexicons, leading to some degree of confusion (Kroll and Steward, 1994). Hence, in the absence of sufficient context, we might expect to find large differences in word recognition thresholds between young EL1 and EL2 listeners, with the extent of the difference in word recognition being dependent on the degree of exposure to and immersion in the second language (Mayo et al., 1997; Ezzatian et al., 2010).

In previous experiments (Ezzatian et al., 2010; Avivi-Reich et al., 2014, 2015) and in the present experiments, group differences in Mill Hill vocabulary scores were also observed. Specifically, the older EL1 listeners have the highest vocabulary scores, followed by the young EL1 listeners who, in turn, had significantly higher scores than the younger EL2 listeners. The latter result is not surprising given that EL1 listeners have had considerably more experience in reading and listening to material in English than those for whom English is a second language. Note, however, that older EL1 listeners have significantly higher vocabulary scores than younger EL1 listeners, a consistent finding in studies from our laboratory over the past few decades (Ben-David et al., 2015). The greater degree of vocabulary knowledge in older than in younger EL1 listeners probably reflects a lifetime of exposure to English language materials.

# The Effects of Linguistic Competence on Memory

A somewhat surprising result is that, once all individuals were equated for word recognition, the effects of serial position and the type of masker were the same for young EL1 and EL2 listeners. As mentioned in the Introduction, previous studies have found that individuals operating in a second language, tend to have a smaller vocabulary than monolinguals, appear to have more difficulty finding words (more tip-of-the-tongue states), have slower response times in naming pictures, and lower accuracy in recognizing words presented in noisy conditions (Gollan and Kroll, 2001; Bialystok et al., 2009). They also appear to have a reduced ability to discriminate fine phonemic information (Heinrich et al., 2010) and make use of linguistic cues, and may experience cross-language interference due to the activation of semantic and linguistic processes in more than a one language (e.g., Kroll and Steward, 1994; Mayo et al., 1997; Bradlow and Pisoni, 1999; Meador et al., 2000; Weber and Cutler, 2004). Although the present study equated young EL1 and young EL2 listeners with respect to word recognition, it did not compensate for their poorer semantic and linguistic skills, slower lexical access, and possible cross-language interference. As Zhang et al. (2014) pointed out, the relatively poorer 50% speech recognition thresholds of EL2 listeners whose asymptotic performance in quiet is near perfect, most likely reflects their lack of proficiency in the second language. Because in the current set of experiments, individually adjusting the SNRs at which the to-be-remembered words were presented produced nearasymptotic word recognition performance in all listeners, we would expect Zhang et al.'s argument to hold and the word recognition thresholds of EL2 listeners in this experiment to depend primarily on their proficiency in their second language (Gollan and Kroll, 2001; Bialystok et al., 2009; Zhang et al., 2014). The fact that episodic memory did not appear to differ with language competence when the listening situation was adjusted to produce equivalent word recognition suggest that the primary factor that makes it difficult for young EL2 listeners to recall heard words in noisy everyday listening situations is their poorer word recognition when they are tested at the same level as EL1 listeners, and not their poorer command of L2. In other words, equivalent word recognition implies equivalent memory performance in young listeners, independent of their language status.

Of particular interest is the fact that the substantially lower vocabulary scores found in EL2 listeners as compared to EL1 listeners had no apparent effect on word recall in these experiments. A number of studies have indicated that when listening is easy, bottom-up acoustic information is likely to be sufficient for word recognition (lexical access). However, when listening becomes difficult, listeners might need to draw on their vocabulary knowledge to facilitate word access (Mattys et al., 2009, 2010; Mattys and Wiget, 2011). It is quite likely that when young EL1 and young EL2 listeners are listening in the same situations (no compensation for differences in word recognition), the young EL2 listeners are more likely to be drawing on their vocabulary knowledge to facilitate lexical access than young EL1 listeners. In the current experiments, presenting the to-beremembered words at a higher SNR in young EL2 than in young EL1 listeners may have the effect of boosting the acoustic signal to such a degree that there is little, if any, need to draw on vocabulary knowledge in both groups, and/or other top-down processes to achieve word recognition. If this is so, the greater vocabulary knowledge of the EL1 listeners may not give them as great an advantage in word recognition over EL2 listeners as it would when no adjustments in SNR are made. Hence, equating these two groups for word recognition may be expected to reduce any differences in the comprehension of heard speech in these two groups.

### The Effects of Age on Memory

The age-related declines in memory found in older adults after compensating for age-related differences in word recognition (see **Figure 4**), could reflect age-related declines in phonetic, linguistic and semantic ability, or age-related declines in the ability to store, and/or retrieve, information from memory. We have seen that differences in memory performance between young EL1s and young EL2s disappear after equating for word recognition. Hence, we can safely assume that the reduced phonetic, linguistic and semantic abilities of young EL2 listeners has little, if any, effect on their ability to store word pairs for later recall once adjustments are made for word recognition. If we assume that adjustments to compensate for word recognition differences between younger and older adults also do compensate for any age-related declines in phonetic, semantic, and linguistic abilities, then the remaining age-related differences in memory performance most likely reflect age-related declines in the cognitive processes subsuming the storage and retrieval of information in memory. Hence these results support the notion that there are age-related losses in the ability to either transfer words into long-term storage and/or to retrieve the stored information. Such difficulties would explain why younger and older adults having equivalent recall of word pairs in the 4th or 5th serial positions in quiet but not of the word pairs in the more remote serial positions (see the top panel of **Figure 4**). Presumably, the word pairs in positions 4 and 5 are still in working memory, and therefore are available for prompted recall. Word recall in the more remote serial positions is likely to depend on memory for items in long-term storage. Recent models (Baddeley, 2000; Oberauer, 2002; Unsworth and Engle, 2007) reflect a growing consensus that working memory tasks are not solely dependent on either the long-term or short-term memory systems, but information in memory may exist in different states of accessibility (Oberauer, 2002). Only a limited number of items may be within a state of direct access (primary memory), while recently activated information remains in a passive state of readiness within the long term or secondary memory. When listening to word pairs in noise, the listening effort caused by the background babble might require the listener to switch attention away from maintaining items in primary memory. This might be especially challenging in a task such as the one used in the current study, as the number of items the listener has to remember exceeds four. Thus, at least some of the words must be retrieved from secondary memory (Unsworth and Engle, 2007). Age-related deficiencies in encoding or in retrieval from longterm or secondary storage, could explain the age-related deficit in quiet in these positions.

The results for memory in the presence of Continuous Babble or in Word-Only Babble indicate age-related decrements in all serial positions. Age-related declines in the perceptual and attentional processes required for extracting the word pairs from a babble of voices may be responsible for the uniform deficits seen in each serial position. When the babble background is continuous, the listener may have to continuously focus attention on the acoustic signal to facilitate processing of the word pair when it is presented, drawing resources away from maintaining the words in working memory where they can be rehearsed, and transferred to long-term memory. Age-related declines in such attentional resources could lead to the pattern of results shown in the middle panel in **Figure 4**. The continued decline in performance from serial position 3 to serial position 1 is also consistent with this hypothesis. If continuous babble interferes with rehearsal and transfer into long-term storage, we might expect that the more remote the word pair is from time of testing the less likely it will be recalled correctly. Hence the need to maintain focused attention on the auditory input when the babble is continuous could be the reason for the Serial Position × Masker Type interaction that is present in both younger and older adults.

The age-related decline in performance at all serial positions when the background babble begins at the same time as the word pairs is most likely due to a greater degree of sluggishness in stream segregation in older compared to younger adults. Ben-David et al. (2012) have shown that near simultaneous onset of the babble background and the word to be recognized is more deleterious to word recognition in older than in younger adults. Recall that word recognition in the two age groups is equated for individual words presented in a continuous background babble. Hence equating word recognition in a continuous babble may not produce equal word recognition when there is near simultaneous onset of the masker and the word pairs. Poorer word recognition when the onset of the babble is simultaneous with word pair onset would be expected to produce poorer memory for all serial positions in older adults. For further discussion of the effects of age on memory for paired associates please see Heinrich and Schneider (2011).

# The Effects of Vocabulary and Reading Comprehension on Memory

Vocabulary size but not reading span or years of schooling contributed to individual differences in episodic memory of unrelated word pairs in EL2 listeners. Moreover, this relationship between vocabulary and memory was qualitatively the same for young EL2 listeners as for EL1 listeners of both ages. This suggests that in this particular memory task, all three groups of listeners rely on vocabulary knowledge to the same extent once perceptual differences were equated for. This result is in contrast to more conversational listening situations as will be discussed below.

# The Role of Memory in the Comprehension of Spoken Language

The present results indicate that age-related declines in episodic memory persist even when steps are taken to equate all listeners with respect to their ability to recognize words in the absence of supportive context. Moreover, the failure to find episodic memory deficits in young EL2 listeners indicates that once young listeners are equated for word recognition, their degree of linguistic competence does not appear to have a major impact on their performance in this paired-associate memory task. Because the syntactic and semantic systems are relatively well-preserved in older EL1 listeners, it is unlikely that age-related changes in linguistic abilities are the source of age-related memory declines. We have suggested that these age-related deficits are related to age-related changes in perception (e.g., sluggish stream segregation), and to age-related changes in the availability or deployment of the attentional resources that are used to support episodic memory.

That age-related losses in memory persist even when the acoustic scene is adjusted for differences in word recognition in noise poses a problem for studies investigating the ability of younger and older adults to comprehend connected discourse of the kind that occurs when listening to lectures or to multitalker conversations. Digesting the content of a lecture or following a multi-talker conversation when noise is present in the background is a complex and difficult task for any listener. For instance, in a multi-talker conversation the listener has to perceptually segregate the target speech from the background, extract the meaning of each utterance, switch attention from one talker to another, keep track of what was said by whom, store this information in memory for future use, integrate incoming information with what each conversational participant has said or done in the past, and draw on the listener's own knowledge of the conversation's topic to extract general themes and ideas (Murphy et al., 2006; Schneider et al., 2010). Higher word recognition thresholds in young EL2 and older EL1 listeners would place them at an immediate disadvantage relative to young EL1 listeners, and, indeed, their ability to answer questions about what they just heard is compromised in such a condition (e.g., Schneider et al., 2000). This raises the question of what we might expect to find in a lecture-type experiment in which listeners are required to answer questions when we equate individuals in all three groups with respect to their ability to recognize individual words in the absence of context using the same procedure that we followed in the paired-associate memory experiments described above.

Clearly, answering questions about a lecture or conversation that you have just heard has a significant memory component. The paired-associate memory experiments described above indicate that memory in younger adults appears to be independent of the language competency of the individuals as long as SNRs are adjusted to produce equivalent word recognition in all individuals. Hence one might expect comprehension differences between young EL1 and young EL2 listeners to be minimal once the listening situation is adjusted to produce equivalent word recognition. Indeed, when young EL2 and EL1 adults are asked to answer questions after listening to two- and three-person conversations, the two groups do not differ with respect to the number of questions they can answer correctly (Avivi-Reich et al., 2014, 2015) when they are equated for word recognition. But we have seen that age-related memory deficits remain after adjustments have been made to word recognition in older adults. Hence we might expect that their ability to answer questions concerning the heard material would be compromised by their poorer episodic memory even after adjusting for word recognition. The results of such experiments, however, indicate that once younger and older adult have been equated for word recognition, they can answer approximately the same number of questions correctly (Schneider et al., 2000; Murphy et al., 2006; Avivi-Reich et al., 2014, 2015). Such results indicate that older adults are able to compensate in some fashion for their poorer memory when asked to comprehend connected discourse of various kinds as long as they can hear the individual words as well as younger adults. The question then becomes how they are able to maintain good comprehension in the face of memory deficits?

There appear to be two possible explanations of how such compensation might be accomplished. The first is that there is evidence that older adults, including those with hearing loss, make better use of context when it is available. It is important to keep in mind that most episodic memory tasks are conducted with word list type material, which consists of single unrelated words. Discourse, on the other hand, contains ample context that could help in encoding and recalling information. The advantageous effect of context for older adults' memory is well known within the cognitive literature (Koutstaal and Schacter, 1997). Moreover, context not only plays an important role in memory encoding in older listeners, but also for perception. It has been previously found that older adults benefit more than younger adults from context when asked to repeat a sentence they just heard or read (Pichora-Fuller et al., 1995; Speranza et al., 2000). The SNR adjustment procedure used in the experiments where listeners were asked questions about lectures or conversations (Schneider et al., 2000; Murphy et al., 2006; Avivi-Reich et al., 2014, 2015) equated individuals with respect to their ability to recognize words in the absence of contextual support. If, after such an adjustment, older adults can make better use of context to support word recognition than can younger adults, we would expect them to actually have better word recognition than younger adults when listening to lectures or conversations. Hence the presence of context in such listening situations may compensate for older adults' poorer episodic memory for unrelated words.

Older adults are also likely to have acquired a broader world knowledge than have younger adults, which may help them to compensate for memory difficulties in conversations. World knowledge is often referred to as crystalized intelligence. Crystalized intelligence is accumulated through education and life experience, and does not appear to decline, and may even improve with age (McArdle et al., 2002). The greater one's knowledge of a culture's language history is, the more likely one is to be able to comprehend and remember discourse related to that specific culture. If older adults' crystalized intelligence is more fully developed than that of younger adults, the easier it will be for them to comprehend and remember lectures and/or conversations that are embedded in that culture<sup>2</sup> . Hence, a more comprehensive knowledge of the culture from which the materials were drawn in older adults could also compensate for their age-related deficits with respect to episodic memory.

# REFERENCES


A person's vocabulary knowledge is often used as a measure of one's crystalized intelligence. It has been shown consistently that older adults' knowledge of the English vocabulary has exceeded that of younger adults (Ben-David et al., 2015) 3 . Since vocabulary knowledge is often taken as a measure of crystalized intelligence, the higher vocabulary scores of older adults gives credence to the notion that their crystalized intelligence exceeds that of younger adults. Moreover, when listening to lectures and stories becomes difficult, individual differences in vocabulary scores are more predictive of individual differences in comprehension in older EL1 listeners than they are in younger EL1 or EL2 listeners (Schneider et al., 2016). It may be that under difficult listening situations older adults rely more on crystalized intelligence than do younger adults. Hence, the available evidence suggests that younger and older adults rely on different sets of abilities to achieve comparable levels of comprehension when all individuals have been equated for word recognition in the absence of context, and that their generally greater degree of world knowledge, and the greater benefit they gain from context may compensate for their poorer episodic memory for unrelated words.

# AUTHOR CONTRIBUTIONS

BS, MA, and AH designed the studies and participated in the analysis of the data. CL and MA conducted the EL2 study. BS contributed most to the writing of the manuscript with input from AH and MA. BS and AH contributed most to the revision of the manuscript and all authors approved its final version.

### ACKNOWLEDGMENTS

This work was supported by Canadian Institutes of Health Research grants (MOP-15359, TEA-1249), the Natural Sciences and Engineering Research Council of Canada (RGPIN 9952- 13), and the BBSRC (BB/K021508/1). We would like to thank Jane Carey for her assistance in conducting these experiments.


<sup>2</sup>The lectures and conversations used in (Schneider et al., 2000; Murphy et al., 2006), and (Avivi-Reich et al., 2014, 2015), were all drawn from within a North American English speaking context.

<sup>3</sup>A recent study (Hartshorne and Germine (2015) suggests that vocabulary knowledge reaches its peak somewhere between 55 and 70 years of age and declines thereafter (see their **Figure 3**). However, the average vocabulary knowledge of the older adults between the ages of 65–80 still exceeds those of 20 year olds.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Schneider, Avivi-Reich, Leung and Heinrich. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Bilingual Disadvantage in Speech Understanding in Noise Is Likely a Frequency Effect Related to Reduced Language Exposure

Jens Schmidtke\*

Michigan State University, East Lansing, MI, USA

The present study sought to explain why bilingual speakers are disadvantaged relative to monolingual speakers when it comes to speech understanding in noise. Exemplar models of the mental lexicon hold that each encounter with a word leaves a memory trace in long-term memory. Words that we encounter frequently will be associated with richer phonetic representations in memory and therefore recognized faster and more accurately than less frequently encountered words. Because bilinguals are exposed to each of their languages less often than monolinguals by virtue of speaking two languages, they encounter all words less frequently and may therefore have poorer phonetic representations of all words compared to monolinguals. In the present study, vocabulary size was taken as an estimate for language exposure and the prediction was made that both vocabulary size and word frequency would be associated with recognition accuracy for words presented in noise. Forty-eight early Spanish–English bilingual and 53 monolingual English young adults were tested on speech understanding in noise (SUN) ability, English oral verbal ability, verbal working memory (WM), and auditory attention. Results showed that, as a group, monolinguals recognized significantly more words than bilinguals. However, this effect was attenuated by language proficiency; higher proficiency was associated with higher accuracy on the SUN test in both groups. This suggests that greater language exposure is associated with better SUN. Word frequency modulated recognition accuracy and the difference between groups was largest for low frequency words, suggesting that the bilinguals' insufficient exposure to these words hampered recognition. The effect of WM was not significant, likely because of its large shared variance with language proficiency. The effect of auditory attention was small but significant. These results are discussed within the Ease of Language Understanding model (Rönnberg et al., 2013), which provides a framework for explaining individual differences in SUN.

Keywords: speech understanding in noise, bilingual, working memory, frequency effect, spoken word recognition

# INTRODUCTION

Spoken language comprehension is a complex process that entails encoding an acoustic signal, matching it to the right phonological representation stored in long-term memory (LTM) out of thousands of such representations, and finally retrieving the semantic information associated with the phonological information and integrate it with the preceding information. Yet understanding

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Sari Ylinen, University of Helsinki, Finland Sandra Campeanu, Lehman College – City University of New York, USA

> \*Correspondence: Jens Schmidtke schmi474@msu.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 29 January 2016 Accepted: 22 April 2016 Published: 13 May 2016

### Citation:

Schmidtke J (2016) The Bilingual Disadvantage in Speech Understanding in Noise Is Likely a Frequency Effect Related to Reduced Language Exposure. Front. Psychol. 7:678. doi: 10.3389/fpsyg.2016.00678

spoken language under optimal listening conditions is usually a seemingly effortless process. Only when it comes to listening to speech under suboptimal conditions do we become conscious of this process and individual differences in people's ability to understand speech become obvious. This is especially true in a second language, as many second language speakers can attest to and has also been shown in many studies (for a review see Garcia Lecumberri et al., 2010). What is surprising is that even speakers who learned their second language early in life and became dominant in that language still show poorer performance on speech understanding in noise (SUN) tests (Mayo et al., 1997; von Hapsburg et al., 2004; Rogers et al., 2006; Shi, 2010). To explain these findings the present study tested the hypothesis that bilinguals are disadvantaged in SUN because of their reduced exposure to each of their languages relative to monolinguals. Its contribution to the current discussion on bilingual SUN is a larger sample size of early bilingual speakers compared to previous studies and the presentation of a framework to explain bilingual disadvantages in auditory language comprehension.

The ease of language understanding (ELU) model (Rönnberg et al., 2013) provides a framework for explaining the effects of a suboptimal speech signal on listening effort. The model assumes that during listening sublexical information at the level of the syllable is buffered in a temporary storage system called RAMBPHO (rapid, automatic, multi-modally bound phonological representations). These syllabic units are then compared to phonological representations in LTM. The model assumes that phonological representations consist of multiple attributes and for successful lexical access the speech signal has to activate a minimum number of attributes. If the threshold for lexical retrieval is not reached, similar sounding words may be retrieved instead. However, contextual information may often be sufficient for a lexical item to be retrieved even when the bottom-up information from the speech signal is insufficient. In such cases where information in RAMBPHO cannot be matched to a LTM representation, explicit processing that involves working memory (WM) is needed to resolve the mismatch, causing a delay in lexical access. Otherwise lexical access occurs automatically. Mismatches between the speech signal and LTM representations can occur for speaker-external (e.g., distorted speech or an unfamiliar accent) or speaker-internal reasons (imprecise phonological representations; Rönnberg et al., 2013, p. 3).

The degree of similarity between the acoustic signal and an internal phonological representation determines the amount of processing that is needed for lexical access to be successful. When the match is optimal, processing is automatic and effortless. The greater the mismatch, the greater is the need for explicit processing of the signal. This explicit processing loop is dependent on WM resources. Thus, according to the model, individual differences in SUN can be attributable to two sources, individual differences in WM capacity and individual differences in the quality of speaker-internal phonological representations of words in LTM.

How can we explain differences in the quality of phonological representations? Exemplar models of the mental lexicon (Klatt, 1979; Goldinger, 1996, 1998; Pierrehumbert, 2000, 2003; Hawkins, 2003; Johnson, 2005) may be especially useful here. In contrast to models that assume that words in the mental lexicon are stored in an abstract form without any indexical information (e.g., speaker voice characteristics such as gender, age, etc.), exemplar models assume that each encounter with a word token leaves a separate episodic trace in memory. Thus pronunciation variants and reduced forms, for example, are also assumed to be stored (e.g., Pierrehumbert, 2001; Pitt et al., 2011). Phonetic categories are not understood as discrete symbols but as distributions in a multidimensional space that develop through experience. With increased experience, listeners develop selective attention (c.f. Nosofsky, 1986) to those acousticphonetic dimensions that are relevant in a given language. In these models, the effect of word frequency arises from the assumption that words that are encountered often are represented with more exemplars on a "cognitive map" than infrequent words (Pierrehumbert, 2003). During retrieval, all exemplars with a certain degree of similarity to an acoustic signal receive activation. Thus frequently reoccurring units of speech (e.g., words) receive more activation since they are associated with more exemplars. This gives high frequency words an advantage over low frequency words in terms of speed of lexical access. Furthermore, the selection of high frequency words will be more robust when information in the acoustic signal is missing or when there is noise in the signal. However, it is not just the mere frequency with which words are encountered that determines the robustness of a representation. For example, research shows that variability in the signal as it occurs through different speakers helps infants extract the distribution of phonetic categories from the signal so that minimal pairs (e.g., buk and puk) sound less similar, presumably because variability directs infants' attention to the relevant dimension that distinguishes the minimal pairs (in this case voice onset time; Rost and McMurray, 2009). Exemplar models can also be extended to explain second language (L2) speech perception (Hardison, 2003, 2012). Because the acousticphonetic space is arbitrarily divided into phonetic categories that differ from language to language, listeners need to create new categories when learning a L2. Proponents of an exemplarbased mental lexicon assume that phonetic differences between a first language (L1) and a L2 can be perceived; however, at first old category labels will continue to be activated by L2 input. Again, acoustic variability in the signal may help the L2 learner create new phonetic categories by directing his attention to those dimensions that may be irrelevant in the L1 but vary systematically in the L2. For example, Japanese listeners need to learn to attend to the third formant (the third resonance peak of the vocal tract) to differentiate between American English /r/ and /l/ because this dimension is not relevant in their first language (Lotto et al., 2004). Perceptual training studies of the /r/-/l/ distinction with native Japanese speakers showed superior identification ability between the two phonetic categories when training stimuli were spoken by multiple speakers compared to a condition with a single speaker (Lively et al., 1993; Hardison, 2003, 2005). Also relevant for the present discussion is a finding from a study on second language vocabulary learning. Native English speakers who learned new Spanish words spoken by six

different speakers showed better retention and faster retrieval of those words compared to those who heard the novel words spoken by one speaker only (Barcroft and Sommers, 2005; also see Sommers and Barcroft, 2011). These findings suggest that token frequency of words in the input determine the quality of mental representations of words. Multiple exemplars associated with one word will make the retrieval of that word more efficient and robust.

Within the account described above, the assumption is made that the quality of phonological representations differs within and between speakers. Within speakers they differ because high frequency words are represented with more phonetic detail than low frequency words, and between speakers because some speakers have more language experience (i.e., more exposure) than others. These assumptions are similar to the lexical quality hypothesis developed by Perfetti and Hart (2002) and Perfetti (2007) to explain individual differences in reading comprehension. A further assumption made here is that bilingual speakers differ in language experience in one language from monolinguals because they speak and hear each of their languages less often compared to someone who only speaks one language. This assumption is expressed in the weaker-links hypothesis developed by (Gollan et al., 2002, 2005, 2008) to explain differences in lexical access between monolinguals and bilinguals (see Ivanova and Costa, 2008; Diependaele et al., 2013; Cop et al., 2015). As a result of reduced language experience, all words in a bilingual's mental lexicon will be of lower experienced frequency compared to a monolingual speaker. Frequency effects in general are pervasive in language processing (Ellis, 2002). Word frequency in particular affects lexical retrieval times (e.g., Oldfield and Wingfield, 1965; Murray and Forster, 2004) and recognition accuracy for words presented in noise (Howes, 1957). Frequency effects are logarithmic in nature, which means that changes in frequency at the low end affect lexical retrieval times and recognition accuracy more than changes at the high end (Murray and Forster, 2004). As a consequence, reduced language exposure will especially affect low frequency words. In one study, Kuperman and Van Dyke (2013) asked subjects with more and less education to rate words for their subjective, or experienced, frequency. When comparing the two groups, subjective ratings for words that are highly frequent in the language (based on a corpus count) were very similar but the lower the objective frequency, the more subjective frequency ratings of both groups diverged. This suggests that frequency estimates that are based on large corpora such as SUBTLEX (Brysbaert and New, 2009) may overestimate the frequency with which certain words are encountered for individuals with less language experience such as bilinguals. Thus the idea behind the weaker-links hypothesis and similar theories is that slower verbal processing in bilinguals is a frequency effect. Bilinguals encounter all words less frequently compared to monolinguals and so they process all words more slowly. Less efficient spoken word recognition has been shown for late and also early bilinguals (e.g., Weber and Cutler, 2004; Schmidtke, 2014).

While occurrence counts in large corpora of language can give us an idea of the relative quality of representations of words in memory (the less frequent a word the less precise its representation), it is more difficult to estimate the overall language experience of individuals. Different means of data collection are possible such as asking participants to keep a diary of daily interactions for a week or similar techniques. However, these measures are based on self-report and do not capture language experience over longer periods of time. In this paper, the assumption is made that vocabulary knowledge, and more precisely productive vocabulary knowledge, closely resembles language experience and thus the quality of phonological representations. Individuals who are able to recall infrequent words must have been exposed to these words more often than someone who is not able to recall low frequency words. Someone with a weaker phonological representation of a word may be able to recall the first sound or a similar sounding word but lexical retrieval may not be successful. This phenomenon is usually referred to as a tip-of-the-tongue state in the literature (Brown and McNeill, 1966). A second reason for not knowing a word is that the participant may have never encountered the word before. This would also suggest reduced language experience because the more someone is exposed to language, the more likely they are to encounter an infrequent word (Kuperman and Van Dyke, 2013). The prediction is then that individuals with a higher score on a vocabulary test will be overall more accurate on a word-recognition-in-noise test, and the difference compared to someone with a lower vocabulary score will be most pronounced for low frequency words. The frequency effect might thus explain why SUN in a L2 is usually more difficult compared to one's first language and why this effect is modulated by experience in the L2 (e.g., Mayo et al., 1997; Rogers et al., 2006; Shi, 2009, 2010; Shi and Sánchez, 2010). At the same time, the frequency effect could also explain individual differences between monolingual speakers that have been shown to exist in normal hearing subjects (see Tamati et al., 2013).

As mentioned before, in the ELU there are two sources for speaker-internal individual differences in SUN. One source is the quality of internal phonological representations, as described above. The other source is differences in WM capacity. When mismatches between the acoustic signal and phonological representations occur, speech processing relies more on explicit processes, which presumably are more susceptible to individual differences in processing resources than implicit processes. Examples of such explicit processes include "inference making, semantic integration, switching of attention, storing of information, and inhibiting irrelevant information" (Rönnberg et al., 2013, p. 3). Individuals with greater WM capacity have more resources available for such processes and are thus better able to make up for missing information from the speech signal. In support of this hypothesis, studies have established a link between the quality of sensory information and maintenance of such information in WM. For instance, hearing verbal stimuli under suboptimal listening conditions leads to reduced recall accuracy of such stimuli even when intelligibility is not impaired (Rabbitt, 1966; Pichora-Fuller et al., 1995; Amichetti et al., 2013). Other studies have used brain imaging and found that alpha power, an indication of WM load, increased as a function of speech intelligibility (Obleser et al.,

2012) and degree of hearing loss of listeners (Petersen et al., 2015). Importantly, these studies established increased power in alpha oscillations during the retention phase of a memory test, suggesting that retaining degraded speech in WM is more effortful than clear speech, even when overall intelligibility is high.

Several studies have established a correlation between tests of verbal WM, typically assessed through the reading-span test (see Daneman and Carpenter, 1980), and performance on SUN tests. The problem with such studies is that no direct causation can be established as performance on both tests may be influenced by a third variable. Specifically, it has been shown that short-term memory (STM) for words is not independent of LTM representations of those words; both word frequency and phonotactic probability influence serial recall of words (Hulme et al., 1991; Hulme et al., 1997; Gathercole et al., 1999). At the same time, SUN is dependent on these factors as described above. Thus the quality of phonological representations in LTM, which is dependent on language experience, may influence both verbal WM and SUN. Therefore, studies that assess the correlation between verbal WM and SUN need to control for language experience to ensure that the correlation is not confounded by this third factor. Two recent studies found that verbal WM was no longer a significant predictor of SUN in a second language when proficiency in that language was controlled for (Kilman et al., 2014; Sörqvist et al., 2014).

Other executive functions next to WM may be recruited during SUN. When individuals follow a conversation in background noise, they have to selectively attend to one speaker and ignore other sounds or speakers (e.g., Mesgarani and Chang, 2012; Wild et al., 2012). In addition, during word recognition, words that are semantically and acoustically related to the target words also become active and inhibiting these competitors may require executive functions (Sommers and Danielson, 1999; Lash et al., 2013). Two recent studies assessed the relationship between individual differences in attention and SUN. Anderson et al. (2013) used structural equation modeling and found that a latent variable consisting of auditory attention, auditory STM, and auditory WM explained a large amount of variance in SUN. However, the contribution of auditory attention was only small compared to the memory measures, which suggests that in this specific study the role of auditory attention was limited. The second study comes from Tamati et al. (2013), who found that individuals who performed high and low on a SUN test did not differ in their performance on a color-Stroop test. This last finding might suggest that auditory attention is more important in SUN than more general attention that is measured by the Stroop test.

The purpose of the present study was to find individual differences that would predict SUN. Based on the ELU model, it was hypothesized that language experience, measured through vocabulary knowledge, verbal WM, and auditory attention would predict SUN. It was further hypothesized that differences between monolingual and bilingual participants would mostly be attributable to differences in language experience. To test this hypothesis, word frequency of to be recognized words was manipulated.

# MATERIALS AND METHODS

# Participants

The study included 53 monolingual and 48 bilingual participants. The inclusion criteria for monolinguals were that they did not learn a second language before the age of 10. Some monolinguals had learned a second language in foreign language classes in school but they were not fluent in their second language and had not spent more than a short vacation in a non-English speaking country. Bilinguals had to have learned Spanish from birth and English before the age of 8<sup>1</sup> . In addition, participants had to be between 18 and 35 years old. Six additional monolinguals and five additional bilinguals were tested but they were not included in the final sample because they did not meet the definition of monolingual (5), early bilingual (4), or were too old (1) or too young (1) to be included in the study. Detailed participant information can be seen in **Table 1**. The study was approved by the local Institutional Review Board and all subjects gave their written informed consent to participate.

# Experimental Design Background Questionnaire

Participants' background information was collected with a questionnaire created for this study, administered by the experimenter. The instrument was loosely based on Marian et al. (2007) but included additional information about parental education and use of English and bilingual participants' use of English and Spanish during their childhood and adolescence. It took about 6–10 min to administer.

### Speech Understanding in Noise

Materials for the SUN test were taken from the revised Speech Perception in Noise test (SPIN; Bilger et al., 1984), which was obtained as a digitized recording. The test consists of 200 target words and each word is recorded in a predictive and unpredictive context. For example, the word coast could be preceded by Ms. Brown might consider the coast (low predictability) or by The boat sailed along the coast (high predictability). The original SPIN recordings were obtained on CD from the Department of Speech and Hearing Science from the University of Illinois Urbana-Champaign. The sound file was edited so that each sentence was saved in a separate file. For the background babble, a short sequence from the original babble track (12-talker babble) was chosen and mixed with each sentence in Praat (Boersma and Weenink, 2014) at two different speech-to-noise ratios (SNRs; −2 dB and 3 dB). These SNRs were chosen based on a pilot experiment. The sound intensity of the sentence was held constant and so the intensity of the babble differed for the two SNRs.

<sup>1</sup>Four bilinguals reported to have learned English later than 8 but they were included in the study because they were born in the US and attended school in the US from kindergarten. They reported that they attended a Spanish–English bilingual program but that little English was taught. However, they likely had some exposure to English. Thirty-seven (77%) bilinguals were born in the US. Of the remaining bilinguals, all but five arrived in the US before the age of 6. Four of those immigrated at the age of 7 and one at the age of 13. The latter participant was included because her mother was a native speaker of English and she had learned both English and Spanish from birth and attended a bilingual school.

In the present study, 128 sentences from the test were chosen and divided into four lists of 32 words<sup>2</sup> . Words in each list were matched on word frequency, phonotactic probability, and on neighborhood density. Information about lexical variables was taken from different sources. Information about lexical frequency was taken from Brysbaert and New (2009). These norms are based on a large corpus created from subtitles of American movies and TV shows. The mean log10 word frequency of the stimuli used in the present study was 2.70 (SD = 0.44) and the mean frequency per million was 15.92 (SD = 16.46). Information about phonotactic probability came from Vitevitch and Luce (2004). This database provides the summed probabilities of each phoneme in a word and the summed probability of each biphone.

<sup>2</sup>The word fun was later dropped from all analyses because its frequency per million of 235 was several SDs away from the mean of 15.9. Thus there were 127 unique items.



Years of musical experience: participants were asked if they have ever played a musical instrument or sung in a choir or band and for how many years. See text for an explanation of the different language measures. W-scores are based on an arbitrary equal-interval scale. Standard scores have a population mean of 100 and a standard deviation of 15. Values in parentheses are standard deviations.

The number of neighbors of a word were calculated based on the English Lexicon Project (Balota et al., 2007). The correlation between biphone probability and log-frequency was r = 0.16 and the correlation between log-frequency and neighborhood density was r = 0.16.

Each participant heard the first half at 3 dB SNR and the second half at −2 dB. Within each SNR, half of all words were played in a predictive context and the other half in an unpredictive context in a randomized order. Across all participants, each word was administered in all four conditions in a Latin-square design. After each sentence, the participant was prompted to type the last word of the sentence. The next trial started when a participant pressed ENTER. Before the actual experiment, 10 sentences were administered at a SNR of 8 dB to ensure that participants had understood the task. Participants were also told to check the word they typed on the screen for any spelling errors before going to the next trial. This test was administered in Eprime 2.0 (Psychology Software Tools, Sharpsburg, PA, USA).

### Working Memory

The WM test used for this study comes from the US National Health Institute's (NIH) so called Toolbox<sup>3</sup> . The NIH toolbox is a collection of different tests in the areas of cognition, emotion, motor function, and sensation. All tests are available freely and are administered online. In the WM test, participants see pictures and their labels and hear their names. The set-size differs from two to seven pictures. Pictures are either animals or food items. After each set of pictures, participants are asked to repeat what they just saw in size order from smallest to biggest. For example, if they saw a bear, a duck, and an elephant, they would say duck, bear, and elephant. To establish the size order, participants have to pay attention to the size of the object on the screen but in most cases, the relative proportions on the screen corresponded to real life. The test has two parts. In the first part, sets consist only of animals or only of food items. In the second part, sets consist of animals and food and participants are asked to repeat the food first from smallest to biggest and then the animals from smallest to biggest. Both parts start with two practice sets to ensure that participants understood the directions. If they make a mistake in either practice set, the instructions are repeated and the set is administered again. After the practice items, the test starts with a set size of two. If a participant correctly repeats all pictures, the set size of the next trial increases by one. If the participant makes an error, another set of the same size but different items is administered. Testing stops when a participant cannot correctly repeat two sets in a row or when the last set is administered. Responses were recorded on a paper sheet and a score for each participant was calculated by counting the total number of items of all correctly repeated sets. Thus the total score for each part is 27 (2+3+4+5+6+7) and the total possible score is 54. This test was only administered in English.

Recently, the reliability of the test was established (Tulsky et al., 2014). The test–retest interclass correlation coefficient was 0.77. The test also correlated with other established WM tests

<sup>3</sup>www.nihtoolbox.org

(r = 0.57) and tests of executive function (r = 0.43 −0.58). The correlation with a test of receptive vocabulary, on the other hand, was low (r = 0.24).

### Verbal Ability

Verbal ability was assessed with the Woodcock-Muñoz Language Survey – Revised (WMLS-R; Woodcock et al., 2005), which is a norm-referenced, standardized test of English and Spanish. Both versions were normed on a large sample of speakers in the US and Latin America in the case of the Spanish version. The rawscore on the test can be transformed into a standard score with a population mean of 100 and a standard deviation of 15 through software that is provided with the test (Schrank and Woodcock, 2005). In addition, scores can be expressed as W-scores, which are based on an equal interval scale and are therefore suitable for statistical analyses and group comparisons. Unlike standard scores, W-scores are not corrected for participant age at testing.

The WMLS-R consists of seven tests, two of which were administered in the present study. The first one is called Picture-Vocabulary test. Participants are shown pictures in sets of six and are asked to name them one by one as the experimenter asks them "What is this" while pointing at a picture. The second test administered is called Verbal Analogies. Participants are asked to solve "riddles" such as In is to out as down is to . . .? Scores from both tests can be combined into a single score with the provided software, which the test developers call Oral Language Ability (henceforth verbal ability). This score correlates highly with the cluster score that is based on all tests of the WMLS-R (r = 0.9). The standard error of the mean for all tests is between 5.55 and 5.93 and the internal consistency reliability coefficients were around r<sup>11</sup> = 0.9 (Alvarado and Woodcock, 2005).

### Auditory Attention

The auditory attention test was adapted from Zhang et al. (2012). In this test, participants have to decide whether two tones were played to the same ear or different ears. What makes this test challenging is that the frequency of the two tones is sometimes the same and sometimes different. Because participants are only supposed to respond based on the location of the tones, response conflict arises on trials in which the location is different but the frequency the same or the location the same and the frequency different. The manipulation of frequency and location results in four conditions, same-frequency same-location (SFSL), samefrequency different-location (SFDL), different-frequency samelocation (DFSL), different-frequency different-location (DFDL). The original test also has a second part where frequency is the task-relevant dimension and location is the irrelevant dimension that has to be ignored. However, only the first condition was used in the present study to reduce the time needed to administer the test.

Three different measures can be derived from the test, baseline RT, involuntary orientation, and conflict resolution. Baseline RT is the mean RT in the SFSL condition. In Zhang et al. (2012), baseline RT correlated with the RTs in a separate test that did not involve response conflict and therefore the authors suggested that this measure reflects information processing speed. Involuntary attention can be calculated by subtracting RTs on trials with the same frequency from those of different frequency [(DFDL+DFSL) –(SFSL+SFDL)]. Conflict resolution can be calculated by subtracting the mean RTs on trials where location and frequency were both different or both the same (no response conflict) from those where they were different [(SFSL+DFDL)–(SFDL+DFSL)]. Preliminary correlational analyses (see Supplementary Materials) with each of these three measures and overall accuracy on the SUN test showed that only processing speed correlated significantly with SUN accuracy and so only this variable was used in the analyses reported below.

The tones for this test were created in Praat (Boersma and Weenink, 2014) as pure tones with a length of 100 ms. The frequency ranged between 500 and 1400 Hz in 100 Hz intervals, which resulted in ten different sound files. For differentfrequency trials, the second tone was randomly chosen. There were a total of 96 experimental trials, 24 trials in each condition. The experiment was programmed in E-Prime.

### Procedure

All participants completed all tests in the following order: consent form, background questionnaire, attention test, Words-in-Noise test (Wilson et al., 2007, not reported here), SUN test, verbal ability, WM test, and a consonant perception test (not reported here). Bilingual participants then also completed the verbal ability and the Words-in-Noise test in Spanish.

### Analysis

Incorrect responses on the SUN test were manually checked for any spelling mistakes. A misspelled word was counted as correct in the following cases: letter transposing (e.g., thief for thief), wrong letter when the correct letter was adjacent to it on the keyboard and the resulting word was not a word in English (e.g., ahore for shore), when a letter was missing and the resulting word was not a word in English, or when the answer was a homophone of the target word, regardless of whether the typed word was a real English word (e.g., gyn or jin for gin). In total, 286 (2.2%) instances were corrected in this way, which is comparable to 2.5% in Luce and Pisoni (1998) who used a similar procedure.

For the analysis, mixed-effects regression models were run in R (R Core Team, 2015) using the lme4 package (Bates et al., 2015). P-values were calculated using the Anova function in the car package (Fox and Weisberg, 2011) using the type II sums of squares method. Subjects and items were entered as random effects.

# RESULTS

First a model was run with four predictor variables to analyze group-level effects, language group (bilingual/monolingual), predictability (low/high), noise level (low/high), and word frequency (see the Supplementary Materials for model specifications). The results showed that words in low noise (M = 85.5%, SD = 35.2) were recognized with higher accuracy than words in high noise [M = 67.6%, SD = 46.8; χ 2 (1) = 712.4, p < 0.001], and words in a predictive context (M = 88.7%,

SD = 31.6) better than words in an unpredictive context [M = 64.4%, SD = 47.9; χ 2 (1) = 1059.3, p < 0.001]. The difference between a low and a highly predictive context was 28.2% when noise was high and 20.5% when noise was low and this interaction was significant [χ 2 (1) = 30.7, p < 0.001]. Monolinguals recognized words more accurately (M = 80.8%, SD = 39.4) than bilinguals [M = 71.8%, SD = 45.0; χ 2 (1) = 76.7, p < 0.001]. When noise was low, the difference between monolinguals and bilinguals was smaller [M<sup>1</sup> = 7.1 percentage points (pp)] than when noise was high (M<sup>1</sup> = 10.9 pp) but this interaction did not reach significance [χ 2 (1) = 3.19, p = 0.074]. The effect of predictability was only slightly larger for monolinguals (M<sup>1</sup> = 24.8 pp) than bilinguals (M<sup>1</sup> = 23.8 pp). Nevertheless, the interaction between predictability and language group was significant [χ 2 (1) = 47.56, p < 0.001]. As can be seen in **Figure 1**, this interaction was likely caused by the fact that monolinguals benefitted more from a predictive context compared to bilinguals when noise was high. In the high noise condition, the benefit for monolinguals was M<sup>1</sup> = 31.06 pp and M<sup>1</sup> = 24.87 pp for bilinguals. The main effect of frequency was significant [χ 2 (1) = 6.00, p = 0.014], showing that high frequency words were recognized with greater accuracy than low frequency words. The interaction between frequency and language group was also significant [χ 2 (1) = 5.65, p = 0.017]. **Figure 2** suggests that this interaction was driven by the steeper slope of the frequency effect in the bilingual group compared to the monolingual group.

The following variables were added to the analysis to investigate the effect of individual differences: verbal ability, WM, and processing speed. All continuous variables were centered around the mean. The mean values for each variable can be seen in **Table 2**. WM and verbal ability were highly correlated [r(99) = 0.527, p < 0.001] and WM and processing speed were moderately correlated [r(99) = 0.229, p = 0.021]. Processing speed and verbal ability were not correlated [r(99) = 0.034, p = 0.737; see the Supplementary Materials for a detailed correlation matrix].

A model was built with the same variables as above, that is, language group, word frequency, noise level, and predictability, plus the individual difference variables. Besides the main effects, only the significant interactions are reported here. The full model can be seen in the Supplementary Materials.

The main effects of language group, noise level, and predictability were highly significant as before (all χ <sup>2</sup> > 10, ps < 0.001). Furthermore, main effects of verbal ability [χ 2 (1) = 44.51, p < 0.001] and processing speed [χ 2 (1) = 5.87, p = 0.015] were significant, showing that higher verbal ability and faster processing speed (lower RTs) were associated with higher accuracy on the SUN test. This can be seen in **Figures 3** and **4** respectively. The interaction between verbal ability and predictability was significant [χ 2 (1) = 53.10, p < 0.001]. As **Figure 3** shows, participants with higher verbal ability benefitted more from a predictive context compared to those with lower verbal ability. The interaction between word frequency and verbal ability was also significant [χ 2 (1) = 5.13, p = 0.024]. This interaction can best be interpreted using **Figure 5**. The difference in accuracy between listeners with high and low verbal ability was most pronounced for low frequency words. WM was not a significant predictor of SUN accuracy [χ 2 (1) < 0.01, p = 0.978], likely because of its high correlation with verbal ability (when verbal ability was taken out of the model, WM became a significant predictor; see Supplementary Materials). These analyses show that verbal ability was a powerful predictor of SUN accuracy. Expressed as a odds-ratio, compared to someone with average verbal ability, someone with verbal ability 1 SD above the mean was 2.14 times more likely to recognize a target word. Compared to verbal ability, the effect of processing speed was much smaller. Compared to someone with mean processing speed, someone 1 SD below the mean was 1.09 times more likely to recognize a target word.

To check whether the effect of verbal ability was true for both groups or was simply driven by group differences, follow-up

Schmidtke Bilingual Disadvantage in Speech Understanding in Noise

TABLE 2 | Mean values for the individual differences variables.


W-scores are based on an arbitrary equal-interval scale. See text for an explanation of working memory scores. Processing speed is the baseline mean response time on the attention test. Standard deviations are shown in parentheses.

analyses were run for each group separately. The main effect of verbal ability and the interaction with predictability were highly significant in both groups (all χ <sup>2</sup> > 15, ps < 0.001) but the interaction with frequency was no longer significant (both χ <sup>2</sup> < 1). The main effect of frequency was significant in the bilingual group [χ 2 (1) = 8.61, p = 0.003] but not in the monolingual group [χ 2 (1) = 3.27, p = 0.071]. Furthermore, the effect of processing speed did not reach significance in either group (ps = 0.058 and 0.129 for the bilingual and monolingual group, respectively). This may have been due to insufficient power in these smaller samples.

The analyses so far suggest that verbal ability had an effect on SUN in both the monolingual and the bilingual group. Yet, even when verbal ability was controlled for, language group was still a significant predictor. To investigate further what the added difficulty for bilinguals might be, two subgroups were formed from each group, respectively, that were closely matched on their vocabulary score<sup>4</sup> by randomly selecting participants from each group with a similar score (see **Table 3**). A t-test confirmed that the difference in vocabulary scores between these subgroups was not significantly different [t(44) = 0.63, p = 0.534]. The

mean group difference in SUN accuracy in this subsample was M<sup>1</sup> = 5.1 pp, which is smaller than in the total sample (M<sup>1</sup> = 9.0 pp). Yet this difference was still statistically significant [χ 2 (1) = 15.35, p < 0.001]. The interaction between word frequency and language group was not significant [χ 2 (1) = 2.02, p = 0.155] but **Figure 6** suggests that it was especially the low frequency words that were more difficult for bilinguals. Also the language group by predictability interaction was still significant in this subsample [χ 2 (1) = 4.07, p = 0.044], suggesting that differences in language proficiency alone cannot explain this interaction.

high, mid, and low frequency.

<sup>4</sup>For matching, only the vocabulary score (i.e., Picture Vocabulary) was compared because in the bilingual group, English verbal reasoning was correlated with Spanish verbal reasoning and so vocabulary is likely a better indicator of English exposure.

### TABLE 3 | A subsample from each group matched on language proficiency.


A subsample from each group was randomly chosen and matched on their picture vocabulary score. PV = picture vocabulary. VA = verbal ability. SUN = speech understanding in noise. Standard deviations are shown in parentheses.

# DISCUSSION

The results confirmed previous studies by showing that noise had a disruptive effect on speech understanding whereas a predictive context was facilitative. The effect of a predictive context was stronger when noise was high compared to when it was low and monolinguals benefitted more from a predictive context than bilinguals. Word frequency had an effect on recognition accuracy, high frequency words were recognized with greater accuracy than low frequency words. However, in follow-up analyses, this effect was only marginally significant in the monolingual group, while it remained significant in the bilingual group. Next, an analysis of the effect of individual differences in verbal ability, WM, and attention was conducted. The effect of verbal ability was highly significant in both groups, as was the interaction between verbal ability and predictive context, showing that individuals with higher verbal ability recognized more words in general and also benefitted more from a predictive context. The effect of WM was not significant, likely because of its shared variance with verbal ability. The effect of processing speed was significant when both groups were analyzed together but did not reach significance when each group was analyzed separately. Finally, two subsamples from each group that were matched on their vocabulary scores were compared. This analysis showed that group differences were reduced when subjects were matched on verbal ability but the differences were still statistically significant, suggesting that differences in verbal ability cannot completely explain the bilingual disadvantage in SUN.

As in previous studies (e.g., Mayo et al., 1997; Meador et al., 2000; Rogers et al., 2006; Bradlow and Alexander, 2007; Shi, 2009, 2010; Van Engen, 2010), the bilingual speakers recognized fewer words on average than the monolingual speakers. However, the effect was additive rather multiplicative, meaning there was no interaction between noise level and group. This is in line with Rogers et al. (2006). Yet, other studies found an interaction (Mayo et al., 1997; Shi, 2010; Tabri et al., 2011). The reason may be that in the present study only two noise levels were tested. Another reason may be that the bilinguals in the present study learned English early and had grown up in an English-speaking environment. They were thus more proficient than many of the second language speakers tested in previous research.

An improvement to many previous studies that compared monolingual to early bilingual listeners (e.g., Mayo et al., 1997; Shi and Sánchez, 2010) was the inclusion of a larger sample. Thus there is more robust evidence that even early bilinguals have greater difficulties recognizing words in noise. Previous research also established that more exposure, a younger age of acquisition, and greater proficiency in the target language is positively associated with SUN (Meador et al., 2000; Shi, 2009, 2012; Rimikis et al., 2013; Kilman et al., 2014). The present study sought to contribute to the current discussion of bilingual SUN by not only showing the existence of a so-called bilingual disadvantage and which factors contribute to it but also to find possible explanations for this disadvantage. In this respect, an improvement to previous research was that monolingual and bilingual participants were tested with the same standardized language test. A standardized test is not only important to make results comparable across studies but also to be able to compare the samples of monolinguals and bilinguals within a study. This is important to note because the present study found that verbal ability was associated with SUN in both groups. Since bilinguals often have a smaller vocabulary in each of their languages compared to monolinguals (e.g., Portocarrero et al., 2007; Bialystok and Luk, 2012; Gasquoine and Dayanira Gonzales, 2012), one reason for the bilingual disadvantage for SUN in previous studies may be that groups simply differed in verbal ability. This assumption was confirmed when two subsamples were compared that were matched on vocabulary size. Compared to the total sample, the difference went down from 9.0 to 5.1 pp, which is a decrease of 43%. At the same time, differences in language proficiency cannot be the only explanation because even these two subsamples matched on proficiency were still significantly different in SUN accuracy.

Word frequency may be a second, albeit related, explanation for the bilingual disadvantage in SUN. All participants recognized high frequency words with higher accuracy than low frequency words. As described in the introduction, exemplar models of speech perception assume that each encounter with a word leaves a trace in memory and that words that are encountered frequently are represented in memory with more phonetic detail. The more a word is encountered in different contexts,

spoken by different speakers, the more robust its recognition will be under suboptimal listening conditions. Because on average bilingual speakers have not had as much exposure to each of their languages compared to a monolingual speaker, all words are encountered less often (c.f. Gollan et al., 2008) and disproportionately less so low frequency words (Kuperman and Van Dyke, 2013). This can explain the interaction between group and word frequency, which showed that the bilinguals as a group recognized especially low frequency words with lower accuracy than monolinguals. This explanation suggests that the bilingual disadvantage stems from their reduced exposure to each of their languages. Thus we would expect the same to be true for monolinguals who, for various reasons, have not had as much exposure to lower frequency words. For example, Tamati et al. (2013) tested a large sample of native English speakers on a SUN test and also had subjects rate their subjective familiarity with certain words. They found that those who performed well on the test reported to be more familiar with low frequency words than those who performed not so well on the test. Assuming that familiarity is closely related to the frequency of encounter with a word, their study and the present one suggest that subjective word frequency is an important factor influencing individual differences in SUN.

Both explanations for individual differences in SUN, verbal ability and word frequency, are related because both depend on language experience. Someone who is exposed to language in many different contexts is more likely to learn the meaning of more words compared to someone with more limited exposure and, at the same time, they will encounter words of lower frequency more often. How, then, can we explain that the two subsamples that were matched on verbal ability still performed significantly different on the SUN test? It may be that for the bilinguals, vocabulary knowledge overestimated their actual exposure to English. Even though they knew the meaning of a less common word, they may not have encountered that word as often as a monolingual speaker. Also, assuming that a bilingual speaker hears English in school and Spanish outside of school, they will hear each language not only less often but also from a more limited number of speakers. These may be factors that determine the quality of phonological representations (Gollan et al., 2014; Schmidtke, 2015) and thus SUN. Suggestive of this explanation is that, as in the whole sample, the largest difference between these two subsamples was in the low frequency range (see **Figure 6**), although the interaction between group and frequency did not reach significance. In this respect it is interesting that the size of the frequency effect changed as a function of proficiency. The effect was most pronounced for participants at the lower end of the proficiency range. In the matched subsamples, however, the proficiency range was smaller and this may be why the interaction was no longer significant.

While the present hypothesis for the bilingual disadvantage was based on exemplar models, the data do not necessarily contradict the predictions of models that assume an abstract level of representation of words. For example, TRACE (McClelland and Elman, 1986) assumes three levels of representation, a feature level, a phoneme level, and a word level, with each level of representation being more abstract. Frequency effects can be modeled by adjusting the resting-activation levels of words so that words with high resting levels require less activation from the speech signal, which results in earlier selection compared to words with low resting-activation levels (Dahan et al., 2001). A noisy signal could result in fewer features that receive activation so that words with a low restingactivation level do not receive sufficient activation to pass the threshold necessary for selection. Proponents of a mental lexicon with abstract representations of words can explain differences between native and non-native speech perception by assuming differences at a perceptual level. Because categorical speech perception develops very early in life (Kuhl, 2004), even an early learned second language will be perceived through the phonemic inventory of the first language (e.g., Sebastián-Gallés and Soto-Faraco, 1999), which will result in nonnative-like phonological representations in the mental lexicon (Pallier et al., 2001). However, the two models do not have to stand in opposition to each other and more recently researchers have developed hybrid models that include aspects of exemplar and abstract models to be able to explain the whole range of phenomena (e.g., Goldinger, 2007; Ernestus, 2014; Kleinschmidt and Jaeger, 2015; Pierrehumbert, 2016). This being said, exemplar models provide a more elegant solution to explain the present results. Differences in the quality of mental representations of words between and within speakers are a fundamental part of exemplar-based models and so they can readily explain individual differences in word recognition. Abstract models, on the other hand, have to assume additional mechanisms to be able to explain individual differences.

Exemplar-based models may also be useful to explain the finding that individual differences in WM capacity were not a significant predictor of SUN when controlling for language ability. A verbal WM test was included in the current study because of the ELU's prediction that individuals with a larger WM capacity would recognize words in noise with less effort and thus be more accurate. The test required individuals to remember items in different set sizes and to mentally manipulate the order of the items according to their size. Because of these storage and processing components, the test is believed to tap into WM. Individuals who can correctly recall more sets are assumed to have a larger WM capacity. The items were common animals and food items such as mouse, pig, and banana that all participants were likely very familiar with. It was therefore surprising that the test correlated highly with the language test (r = 0.5). Exemplar-based models can explain this finding because they assume that not only one representation is activated at the time of encoding but all exemplars of a word. If a word is represented by many exemplars then it is more likely that a memory trace is still active in LTM at the time of retrieval. Related to this explanation is also the finding that items stored in WM are not independent from LTM representations (e.g., Hulme et al., 1997; Acheson et al., 2011). Additionally, in individuals with larger mental lexicons the phonological representations of words may be overall more precise, which may reduce the spread of activation to similar sounding words and therefore prevent interference during rehearsal (cf. Cowan et al., 2005). However,

although WM was not a significant predictor of SUN accuracy in the present sample, this does not necessarily imply that individual differences in WM are not important for SUN. The participants here were all young adults and a more diverse sample in terms of age may be needed to find an effect of verbal WM above and beyond verbal ability. For example, Parbery-Clark et al. (2011) found a correlation between auditory WM and SUN ability even when controlling for vocabulary knowledge in a sample of older listeners. But the present results may further inform the ELU in that the quality of lexical representations in LTM and capacity limits of WM are not independent constructs. This view would be more akin to the model of WM developed by Cowan et al. (2005) and Cowan (2008) rather than to a limited capacity system for temporary storage of items as it is currently defined in the ELU (Rönnberg et al., 2013, p. 2). The present results also have implications for future research. Researchers interested in the relationship between SUN and cognition should always also include a proficiency test that measures vocabulary knowledge in their test batteries when they administer a verbal WM test. Otherwise correlations may be attributed to WM (or some other covariate) when in fact language experience is the underlying factor. However, the type of verbal ability test used may also lead to differing results, since an effect of verbal ability is not always found (e.g., Benichov et al., 2012). In the same way, in the norming study of the WM used here the authors found a much weaker correlation between WM and receptive vocabulary (r = 0.24; Tulsky et al., 2014).

The next finding that merits discussion is the effect of a predictive context. Previous research found that bilingual and second language speakers do not benefit as much from a predictive context as monolinguals under certain circumstances (Mayo et al., 1997; Bradlow and Alexander, 2007; Shi, 2010). However, the present results suggest that individual differences in the effective use of context also exist between monolinguals and that verbal ability is the mediating factor. This would again suggest that differences between monolinguals and bilinguals might emerge because of differences in verbal ability (see above). As a result, the less effective use of context cues attributed to bilingualism is not a bilingual disadvantage per se but may be a result of reduced language experience (cf. Newman et al., 2012). But what is the relationship between verbal ability and the effective use of context cues? One explanation is that individuals with lower verbal ability generally understood fewer words and so if they missed words in the preceding context of the target words, they were not able to form any predictions. Another explanation may be the relationship between verbal ability and WM. In order to make predictions about the target word, subjects need to maintain preceding words in WM. This process might take up more resources depending on the ease with which phonological representations are retrieved and maintained in WM. A third explanation may be the association strength between words (Spence and Owens, 1990). One example sentence from the SUN test is the ship sailed along the coast. Here, ship and sailed may be used to predict the target word coast. If individuals with larger vocabularies have more language experience overall, then they have likely heard words such as ship and coast more often in the same context and thus there is a stronger association of ship and coast compared to an individual with less language experience (c.f. Nation and Snowling, 1999).

Given the findings discussed so far, a frequency-based explanation of differences between monolinguals and bilinguals seems to be the most powerful because it cannot only account for group differences but also differences between individual participants. Furthermore, a frequency-based account can give a united explanation of the language-related effects such as language proficiency, word frequency, predictive context, and the null-effect of verbal WM. The last variable to be discussed, attention, stood out in this respect because it was not language related. The attention test was included in the study to give a more complete picture of individual differences in SUN, as recent studies have pointed to the potential role of non-linguistic factors in language comprehension and especially SUN (e.g., Anderson et al., 2013; Fedorenko, 2014).

The attention test based on Zhang et al. (2012) provided three different variables but no prediction was made as to which variable would be associated with SUN. In the analysis, only processing speed was used because it provided the most robust correlations with the SUN test of the three variables. The results showed a small but significant effect of processing speed on SUN accuracy. The reason why this effect was small might be that there was not enough variance in the data for a stronger effect to emerge. As with WM, processing speed may become more important as a factor in older populations. The general speed of information processing slows down in older adults (Salthouse, 1996), which may explain why cognitive factors are sometimes a better predictor of SUN than hearing acuity (Wingfield, 1996; Benichov et al., 2012). However, further studies are needed to confirm or disconfirm that processing speed is indeed a better predictor of SUN than the conflict resolution or involuntary attention components of the test.

One practical implication of the study for hearing testing is that word frequency needs to be taken into account. One possibility is to only use high frequency words when testing patients to avoid a possible confound. On the other hand, it may be useful to test high and low frequency words and to have norms for each set. If a patient fares especially poor for the low frequency words then this might be an indication for the practitioner that part of the patient's hearing difficulties may stem from factors unrelated to hearing acuity.

Some limitations of the present study that qualify the results should be addressed. Inherent to the design of the study, no inferences about causation can be made. The results suggest that a larger vocabulary is associated with better SUN but the nature of this relationship requires further investigations. Here the assumption was made that exposure frequency is the mediating variable but vocabulary size could also have a direct influence on word recognition. Alternatively, though less likely, people with better SUN ability may be better able to pick up new words through listening and therefore have larger vocabularies. Another limitation is that only one WM and one attention test were used. Future studies would benefit from the use of multiple tests for each construct, which, along with a larger sample size, would allow more sophisticated statistics such as

structural equation modeling. Finally, the two samples did not only differ in language status (monolingual vs. bilingual) but also in the age of acquisition (AoA) of the tested language and socioeconomic status (SES; assessed by maternal education level). In the present study, additional tests showed that neither variable was a significant predictor of SUN once language proficiency was accounted for but these results may be different in a sample where AoA and SES are not correlated with verbal ability.

### CONCLUSION

The purpose of the present study was to find factors that would explain individual differences in SUN between listeners, especially between monolingual and bilingual listeners. Previous research had established that bilinguals often performed below monolinguals on SUN tests, even when the bilinguals had learned the second language early in life. The present study confirmed these results but the general conclusion was that differences between groups could largely be explained by frequency effects, which suggests that differences between groups are less categorical than might be assumed based on previous research. Based on the ELU model (Rönnberg et al., 2013), it was hypothesized that listening difficulty arises from mismatches between the speech signal and internal phonological representations. Mismatches can occur because of a poor signal and because of poor phonological representations in LTM. In the current ELU model, the definition of what poor phonological representations are is underspecified and so the ELU was extended to exemplar models of the mental lexicon (e.g., Goldinger, 1996, 1998). These models assume that each encounter with a word leaves an episodic trace in memory. The present study showed that recognition of high frequency words was more robust to noise compared to low frequency words. Exemplar models can explain this finding in that high frequency words are represented in memory with more exemplars and more highly activated exemplars than low frequency words (Pierrehumbert, 2001). Word retrieval of high frequency words is more robust because a new exemplar will more likely be similar to an already stored exemplar when more exemplars of a word exist in memory. Following these assumptions, the premise of the study was that the bilingual disadvantage in SUN is a frequency effect (c.f. Gollan et al., 2008). Because bilinguals are exposed to each of their languages less often than monolinguals, they encounter all words less frequently. Consequently, bilinguals will have fewer stored exemplars in LTM for all words. This will especially affect the recognition of low

### REFERENCES


frequency words as bilinguals will encounter these even more rarely than monolinguals and consequently recognition of these words under noise is expected to be more fragile. In support of this hypothesis, the present study found that differences in SUN between groups were largest for low frequency words. Another consequence of reduced exposure to each language is a smaller vocabulary. As in previous research (Portocarrero et al., 2007; Bialystok and Luk, 2012), bilinguals scored on average below monolinguals on verbal ability test, and higher verbal ability was associated with better performance on the SUN test. Importantly, however, there was a relationship between verbal ability and SUN for both groups, suggesting that some of the group differences might be explained by the overall lower English proficiency of the bilinguals. When two subgroups that were matched on language proficiency were compared, the difference in performance on the SUN test was much smaller (5.1% compared to 9.0%). These results support the hypothesis that differences in SUN between monolinguals and bilinguals are a result of the bilinguals' reduced exposure to each of their languages as a consequence of being bilingual.

# FUNDING

The research was supported by a Doctoral Dissertation Improvement Grant (NSF-DDIG 1349125) from the US National Science Foundation.

### ACKNOWLEDGMENTS

The research presented here was part of the author's doctoral work (Schmidtke, 2015) that he conducted while at Michigan State University. He is currently at the China University of Petroleum East, Qingdao, China. I am grateful to my dissertation committee, Aline Godfroid, Laura Dilley, Debra Hardison, and Paula Winke, and the two reviewers for their comments, which greatly improved the present article. I thank Kara Morgan-Short for providing research space at UIC and Karthik Durvasula for assistance with Praat.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.00678



to Sense: 50+ Years of Discoveries in Speech Communication, eds J. Slifka, S. Manuel, and M. Matthies (Cambridge, MA: MIT Press), 181–186.



**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Schmidtke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Impact of Bilingualism on Working Memory: A Null Effect on the Whole May Not Be So on the Parts

Noelia Calvo1, 2, Agustín Ibáñez 3, 4, 5, 6, 7 and Adolfo M. García3, 4, 8, 9 \*

1 Institute of Philosophy, School of Philosophy, Humanities and Arts, National University of San Juan, San Juan, Argentina, <sup>2</sup> Faculty of Psychology, National University of Córdoba, Córdoba, Argentina, <sup>3</sup> Laboratory of Experimental Psychology and Neuroscience, Institute of Translational and Cognitive Neuroscience, INECO Foundation, Favaloro University, Buenos Aires, Argentina, <sup>4</sup> National Scientific and Technical Research Council, Buenos Aires, Argentina, <sup>5</sup> Universidad Autónoma del Caribe, Barranquilla, Colombia, <sup>6</sup> Department of Psychology, Universidad Adolfo Ibáñez, Santiago, Chile, <sup>7</sup> ARC Centre of Excellence in Cognition and its Disorders, Sydney, NSW, Australia, <sup>8</sup> UDP-INECO Foundation Core on Neuroscience, Diego Portales University, Santiago, Chile, <sup>9</sup> Faculty of Elementary and Special Education, National University of Cuyo, Mendoza, Argentina

Keywords: bilingualism, bilingual advantage, executive functions, working memory, L2 proficiency, simultaneous interpreting

Abundant research has examined the relationship between bilingualism and working memory (WM), a system that keeps information accessible while dealing with concurrent processes, distractions, or attention shifts (Baddeley and Hitch, 1974; Engle et al., 1999; Conway et al., 2002). Some studies have reported no WM differences between bilinguals and monolinguals (Bialystok et al., 2008; Feng, 2009; Bialystok, 2010; Namazi and Thordardottir, 2010; Bonifacci et al., 2011; Engel de Abreu, 2011), leading top scholars to maintain that this domain is impervious to bilingualism. For instance, Bialystok (2009) first claimed that WM is indifferent to the development of a non-native language (L2). Later, she slightly reframed her position, stating that WM is only occasionally enhanced by the bilingual experience (e.g., Bialystok et al., 2009, 2012). Likewise, in another study, Engel de Abreu (2011: p. 6) concluded that "bilingual experience does not seem to convey any advantage in working memory abilities," which aligns with recent criticism on the very notion of bilingual benefits (Duñabeitia and Carreiras, 2015; Calvo et al., 2016; Paap et al., 2016).

However, there is no shortage of evidence for enhanced WM in bilinguals. While full-blown WM advantages have been only sparsely reported, several studies yielding no overall benefits did find such effects in specific tasks or conditions. This is also true of comparisons between bilingual groups who daily exert different levels of demand on their WM systems (in particular, simultaneous interpreters vs. non-interpreting bilinguals). These findings indicate that WM is not completely unaffected by the distinctive executive demands of bilingualism. Instead, they suggest that a bilingual advantage may indeed exist in some aspects of WM, as we argue below.

The hypothesis underlying the field is that cognitive skills developed to cope with the demands of controlling two languages generalize to more efficient processing in executive domains, including WM. Relevant evidence is typically garnered as follows. First, two sociodemographically matched samples are recruited, one comprising bilinguals and the other composed of monolinguals alternatively, these could be interpreters and non-interpreters. A set of tasks (including WM paradigms) are then administered to both groups, and their respective results are compared. Crucially, WM tasks vary widely across studies, as they involve different stimuli, procedures, and presentation modalities.

Within that literature, some studies reported concrete advantages for bilinguals. For instance, Bialystok et al. (2004) compared bilingual and monolingual adults (aged 30–80) in three different

### Edited by:

Rachel Jane Ellis, Linköping University, Sweden

Reviewed by: Judith F. Kroll, Pennsylvania State University, USA

> \*Correspondence: Adolfo M. García adolfomartingarcia@gmail.com

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 16 November 2015 Accepted: 10 February 2016 Published: 25 February 2016

### Citation:

Calvo N, Ibáñez A and García AM (2016) The Impact of Bilingualism on Working Memory: A Null Effect on the Whole May Not Be So on the Parts. Front. Psychol. 7:265. doi: 10.3389/fpsyg.2016.00265 studies using a non-verbal Simon task. Overall, bilinguals outperformed monolinguals when WM demands were high, and the extent of the difference was proportional to age. Further evidence for a bilingual WM advantage was reported by Morales et al. (2013) in two experiments with children. To this end, the authors used a Simon-type task and a visual-spatial task. Their overall results showed that bilinguals surpassed monolinguals in all the conditions involving high WM and executive demands. Similarly, the bilingual children studied by Blom et al. (2014) showed better performance in visuospatial (Dot Matrix/Odd-One-Out) and verbal (Forward Digit Recall/Backward Digit Recall) WM tests when vocabulary was controlled for, especially in tasks that involved processing and not just storage.

Moreover, studies often cited as disconfirmatory evidence have actually reported enhanced performance by some bilingual groups under specific conditions. Feng (2009) presented various WM tasks to monolinguals and bilinguals from two age groups: children and adults. Despite null results in most conditions, a general bilingual advantage was observed in a spatial WM task (recalling the position of randomly ordered items). A similar result was reported by Bialystok et al. (2008), who evaluated bilingual and monolingual younger and older adults. In this case, participants completed different WM, lexical retrieval, and executive control tasks. While the adult groups showed no significant WM advantages, this effect did emerge for younger bilinguals in a Corsi Block task. Also, Namazi and Thordardottir (2010) compared the performance of young bilingual and monolingual children through assessments of verbal short-term memory, verbal WM, visual WM, and visual controlled attention. Although both language groups performed similarly in most tasks, bilinguals showed positive correlations between visual WM and attentional control skills. Finally, Bonifacci et al. (2011) tested bilingual and monolingual children with a choice reaction-time task, an anticipation task, a go/no-go task, and two WM tasks (numbers and symbols). In this case, only bilingual infants were faster in a visual anticipation task calling on WM resources. In sum, even those studies which failed to find overall WM advantages did report such an effect under certain circumstances.

In this sense, most studies have explored the issue using words or digits as stimuli (e.g., Bialystok, 2010; Engel de Abreu, 2011). Given that bilinguals generally have more difficulty than monolinguals in word processing (Bialystok et al., 2009), tasks with high verbal requirements may not be well suited to test the bilingual WM advantage hypothesis. Indeed, as seen above, WM tasks employing (non-verbal) visual stimuli have yielded consistent advantages for bilinguals.

Two views may account for this pattern. On the one hand, the bilingual experience may selectively enhance a visually-specialized subcomponent within WM. This possibility is compatible with Baddeley's model (Baddeley and Hitch, 1974; Baddeley, 2000), which posits that WM comprises a visuospatial sketchpad, separate from the so-called phonological loop. Moreover, it aligns with meta-analytic data indicating that the development of specific components of WM may be differentially associated with L2 proficiency (Linck et al., 2014). On the other hand, it may be that an undivided WM interacts with several systems in long-term memory. Those systems which are inherently weakened by bilingualism—in particular, verbal processing (Bialystok, 2009)—would carry over their processing disadvantages to any task which taps into them, including WM.

Note that executive skills needed to direct visual attention to location and space may be honed by increased language processing demands. In fact, attentional control mechanisms are essential to process visual (Chun and Wolfe, 2001) and verbal (Bialystok and Cummins, 1991) information. Moreover, the attentional control processes of WM may account for individual differences in the bilingual literature (Linck et al., 2014). In this respect, modality-specific bilingual advantages in WM may be related to increased attentional skills. Recent evidence supports this conjecture. Tse and Altarriba (2014) assessed bilingual children with varied proficiency levels through the Simon task (Simon/Simon switching) and an operation-span WM task. More proficient bilinguals showed better conflict resolution and WM capacity when the tasks demanded more attentional control.

Finally, if the proposed effects stem from increased control demands during bilingual processing, they should be greater in bilinguals who daily face particularly stringent processing conditions, such as simultaneous interpreters (García, 2014). Relationships between WM and interlingual processing skills have been reported in studies which did not consider interpreters. For example, Kroll et al. (2002) compared word naming and translation performance between native English speakers with different levels of L2 competence. In addition to the main finding of the study (better performance for the more fluent group), a positive correlation was found between the participants' WM and their translation performance. Such a result fits well with meta-analytic evidence that WM is robustly associated with L2 processing/proficiency outcomes (Linck et al., 2014). In light of these findings, it is also worth considering comparisons between professional interpreters (whose language processing is repeatedly subject to high WM demands) and non-interpreter bilinguals—an empirical corpus that previous discussions have mostly neglected.

Bajo et al. (2000) assessed lexico-semantic, comprehension, and WM abilities in professional interpreters, interpreting students, non-interpreter bilinguals, and monolinguals. The interpreters showed increased WM spans for digits and words, in addition to faster categorization, reading, and lexical access skills. Interpreters also showed increased abilities in other studies tapping WM storage through visual span tasks (Christoffels et al., 2006; Yudes et al., 2011). For instance, Christoffels et al. (2006) compared language and WM skills among professional interpreters, bilingual university students, and highly proficient L2 teachers. The interpreters outperformed both other groups in WM measures, including word span and reading span—for a fuller discussion, see García (2014).

Moreover, those advantages have been repeatedly observed in tasks involving verbal stimuli. Thus, while WM enhancements led by bilingualism proper (as opposed to monolingualism) may be more pervasive in (non-verbal) visual tasks, those guided by differential processing skills between bilingual groups could possibly manifest in other domains. Indeed, the meta-analysis by Linck et al. (2014) revealed that positive correlations between L2 proficiency and WM may be more pronounced for verbal than non-verbal measures of the latter domain.

In sum, specific aspects of WM may actually be enhanced by the bilingual experience. Discrepant results seem to reflect methodological differences among the studies, especially in terms of task- and stimulus-related variables. Specifically, failure to observe WM differences between bilinguals and monolinguals in most previous studies may be explained by the use of verbal stimuli, given that bilingualism seems detrimental to vocabulary skills. Future studies should evaluate which particular components within WM functioning are sensitive to the effects of bilingualism. For instance, it would be useful to assess whether bilingualism enhances the attentional components of WM in a stimulus- and modality-independent fashion.

To conclude, WM is a complex domain both in its internal configuration and in its connections to other cognitive systems. Bilingualism may not enhance WM function at large, but it

### REFERENCES


may improve certain aspects of it. Whether such selective advantages correspond to improvements in mechanisms within WM remains to be empirically determined. However, extant evidence suffices to raise a word of caution: failure to observe an effect in certain aspects of a function should not be automatically taken as evidence for a null effect in all of its components. Further research on the distinctive aspects of bilingualism might benefit from this general premise.

### AUTHOR CONTRIBUTIONS

Overall idea: NC, AG. Literature review: NC, AI, AG. Manuscript elaboration: NC, AI, AG.

### ACKNOWLEDGMENTS

This work was partially supported by grants from CONICET, CONICYT/FONDECYT Regular (1130920), FONCyT-PICT 2012-0412, FONCyT-PICT 2012-1309, and the INECO Foundation.

innovations. Front. Aging Neurosci. 7:249. doi: 10.3389/fnagi.2015. 00249


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Calvo, Ibáñez and García. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Imitation, Sign Language Skill and the Developmental Ease of Language Understanding (D-ELU) Model

*Emil Holmer1\*, Mikael Heimann2 and Mary Rudner1*

*<sup>1</sup> Linnaeus Centre HEAD, Swedish Institute for Disability Research, Department of Behavioural Sciences and Learning, Linköping University, Linköping, Sweden, <sup>2</sup> Swedish Institute for Disability Research and Division of Psychology, Department of Behavioural Sciences and Learning, Linköping University, Linköping, Sweden*

### *Edited by:*

*Patrik Sörqvist, University of Gävle, Sweden*

### *Reviewed by:*

*Andrej Kral, Hannover School of Medicine, Germany Sarah Theodoroff, Department of Veterans Affairs, USA*

> *\*Correspondence: Emil Holmer emil.holmer@liu.se*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 08 October 2015 Accepted: 19 January 2016 Published: 16 February 2016*

### *Citation:*

*Holmer E, Heimann M and Rudner M (2016) Imitation, Sign Language Skill and the Developmental Ease of Language Understanding (D-ELU) Model. Front. Psychol. 7:107. doi: 10.3389/fpsyg.2016.00107*

Imitation and language processing are closely connected. According to the Ease of Language Understanding (ELU) model (Rönnberg et al., 2013) pre-existing mental representation of lexical items facilitates language understanding. Thus, imitation of manual gestures is likely to be enhanced by experience of sign language. We tested this by eliciting imitation of manual gestures from deaf and hard-of-hearing (DHH) signing and hearing non-signing children at a similar level of language and cognitive development. We predicted that the DHH signing children would be better at imitating gestures lexicalized in their own sign language (Swedish Sign Language, SSL) than unfamiliar British Sign Language (BSL) signs, and that both groups would be better at imitating lexical signs (SSL and BSL) than non-signs. We also predicted that the hearing non-signing children would perform worse than DHH signing children with all types of gestures the first time (T1) we elicited imitation, but that the performance gap between groups would be reduced when imitation was elicited a second time (T2). Finally, we predicted that imitation performance on both occasions would be associated with linguistic skills, especially in the manual modality. A split-plot repeated measures ANOVA demonstrated that DHH signers imitated manual gestures with greater precision than non-signing children when imitation was elicited the second but not the first time. Manual gestures were easier to imitate for both groups when they were lexicalized than when they were not; but there was no difference in performance between familiar and unfamiliar gestures. For both groups, language skills at T1 predicted imitation at T2. Specifically, for DHH children, word reading skills, comprehension and phonological awareness of sign language predicted imitation at T2. For the hearing participants, language comprehension predicted imitation at T2, even after the effects of working memory capacity and motor skills were taken into account. These results demonstrate that experience of sign language enhances the ability to imitate manual gestures once representations have been established, and suggest that the inherent motor patterns of lexical manual gestures are better suited for representation than those of non-signs. This set of findings prompts a developmental version of the ELU model, D-ELU.

Keywords: imitation, sign language, manual gesture, representation, development

# INTRODUCTION

There is a close connection between mental representation and imitation, the behavioral repetition of another person's act (Brass and Heyes, 2005). In particular, there are empirical indications of a relationship between imitation of manual gestures and both lexical representation (McEwen et al., 2007) and language comprehension (Farrant et al., 2011). For sign language users, manual gestures may bear phonological and semantic information. Indeed, it has been shown that the ability to imitate manual gestures is related to gesture-based phonological representation in deaf signing children (Mann et al., 2010). However, it is not known whether the ability to imitate manual gestures is related to existing semantic representations in this group. In the present study, we investigated whether knowledge of Swedish Sign Language (SSL) is related to the ability to imitate manual gestures that are familiar (lexical items in SSL), unfamiliar (lexical items in British Sign Language, BSL), or illegal (non-signs), in children whose language skills are still developing.

Sign languages are natural languages that are performed in the manual–visual modality and include sublexical, lexical, and syntactic structures analogous to spoken languages (for a review, see Emmorey, 2002). Whereas, the sublexical structure of spoken languages is based on the patterning of speech sounds, the sublexical structure of sign languages is based on the patterning of a number of articulatory parameters including: formation and orientation of the hands; finger or/and hand movements; placement of the hand(s) in relation to the body; and nonmanual facial gestures (Brentari, 2011). Thus, for deaf and hard-of hearing (DHH) signing children, manual gestures are sometimes linguistic and may bear semantic and phonological information. Even when a manual gesture is not part of the lexicon, its formational characteristics may be similar to those of lexicalized signs, or even qualify it as a potentially lexicalized sign. However, for hearing non-signing children, manual gestures only involve motoric information, unless they are emblematic, e.g., "thumbs up". In the present study, participants imitated signs that were lexicalized in SSL or BSL, and non-emblematic nonsigns. For Swedish DHH signing participants the SSL signs bore both semantic and phonological information, while BSL signs bore phonological information only. For hearing non-signing participants, neither SSL nor BSL signs bore either semantic or phonological information. Non-signs bore no semantic information for either group and only reduced phonological information for the signing group.

The Ease of Language Understanding (ELU) model (Rönnberg, 2003; Rönnberg et al., 2008, 2013) describes how language understanding depends on pre-existing representations. The model states that language processing is rapid and automatic if input matches pre-existing phonological and semantic representations (Rönnberg et al., 2013) and it is likely that the best match is obtained when phonological and semantic representations are available simultaneously. When only matching phonological representations are available, a cohort of lexical candidates will be activated (Marslen-Wilson, 1987) that is unconstrained by meaning, and language processing will probably be less efficient. When input bears reduced phonological information, phonological constraints will be fewer and processing will probably be even less efficient (Rudner et al., 2016). These factors are likely to be of importance even in the developing language system (Mann et al., 2010; Sundström et al., 2014). Thus, in the present study, we predicted that Swedish DHH signing children would be better at imitating SSL signs with both semantic and phonological information than BSL signs with phonological information only, and better at imitating lexical signs (SSL and BSL) than non-signs with reduced phonological information. Because recent studies indicate that non-signs are more difficult to process than lexical signs, even for non-signers (Cardin et al., 2016; Rudner et al., 2016), we predicted that both groups would be better at imitating lexicalized signs (both SSL and BSL) than non-signs.

We also predicted that initially the hearing non-signing children would be worse at imitating all types of manual gestures than DHH signing children at a similar developmental level. This prediction was based on the former group's limited experience of signs with linguistic and symbolic information. However, we predicted that the act of imitation would help establish representations (Brass and Heyes, 2005) of the manual gestures and, thus, that the performance gap between groups would narrow when imitation was elicited a second time. Moreover, we predicted that imitation performance on both occasions would be associated with linguistic skills (McEwen et al., 2007), especially in the manual modality (Mann et al., 2010).

# MATERIALS AND METHODS

### Participants

### Deaf and Hard-of-Hearing Participants

All five of the Swedish state special schools for deaf and hardof-hearing (DHH) pupils were invited to be part of this study. In these schools, pupils are taught in both SSL and spoken and/or written Swedish and admission is granted for children with hearing impairment. Two schools agreed to participate. Staff members identified seventeen potential participants who showed an interest in text and were able to read words at a level corresponding to typical readers in Grade 1. Pupils attending Swedish state primary schools for DHH children represent a heterogeneous population (Svartholm, 2010), which was also reflected in the sample. Four potential participants had an additional severe medical or developmental disability and were thus excluded: 13 DHH pupils (seven girls) with a mean age of 10.2 years (*SD* = 2.3) and attending grades 1–7 at the first testing occasion were included in the present study. Eleven used technical aids: five used only hearing aid (HA) (four bilateral); five used only cochlear implant (CI) (four bilateral), and one had a CI on one ear and a HA on the other. Up-to-date audiological records were not available and since imitative ability, and its relationship with language and cognitive skills, was the focus of this study, audiological measurements were not made. Two participants had a vision deficit which was corrected. All participants used SSL: nine as their primary language (mean age of first exposure to SSL = 2.8 years, *SD* = 3.3, range 0.0–8.0; *n* = 6), four of whom had at least one deaf native signing parent; the other four used SSL in school and occasionally at home and during spare time activities (mean age of first exposure to SSL = 6.1 years, *SD* = 4.0, range 3.0–11.7). Seven participants were born abroad; age at which residence in Sweden commenced ranged from 2.2 to 10.6 years (*n* = 5). Non-verbal intelligence (NVIQ) of participants was screened using Raven's Colored Progressive Matrices (CPM) (Raven and Raven, 1994); twelve participants scored between the 5th and 95th percentile, and one was one point below (*M* = 25.2, *SD* = 5.88). Three families omitted to provide background data in full or in part.

### Hearing Participants

Thirty-six typically developing children (20 girls) with no reported hearing impairment or knowledge of sign language attending first grade of primary school took part. In grade one, typically developing children are starting to learn to read. They were sampled from four different schools in a municipality in southeast Sweden with representative socioeconomic status. The mean age of the participants at the first occasion was 7.5 years (*SD* = 0.3). Swedish was their first language. One had corrected to normal vision. NVIQ of the participants was screened using Raven's CPM (Raven and Raven, 1994) and all scored between the 5th and 95th percentile (*M* = 25.4; *SD* = 4.35).

### Procedure

All participants were tested individually at their school on two occasions (T1 and T2) separated by 35 weeks. Hearing participants were instructed in Swedish, and DHH children were instructed in their preferred communication mode. Instructions in SSL were provided by a test leader fluent in SSL, and were based on a rephrased version of the Swedish instructions in SSL following a formal coding system (Bergman, 2012). The SSL instructions were coded by a deaf native SSL user, and checked by three of the test leaders in the study. For practical reasons, test order was individually adapted and breaks were taken when needed; however, hearing participants did the imitation task as the second task and DHH participants did it as one of the last four tasks on both occasions. This study is part of a larger project, and data relating to predictor variables in the present study were collected at T1 and reported in Holmer et al. (2016). Test leaders made sure that the participant understood each task before testing took place, and participants practiced all tasks except the imitation task before administration. The present study was approved by the regional ethical review board and all participants and their parents gave informed consent which was attested in writing by the parents.

### Imitation of Manual Gestures

Stimuli were selected from an available set of videorecorded manual gestures including signs lexicalized in SSL but not BSL (chosen to be familiar to the DHH participants but not the hearing participants), signs lexicalized in BSL but not SSL (chosen to be unfamiliar to the DHH participants but phonologically plausible) and non-signs, that is manual gestures that violate the phonological rules of both sign languages or contain combinations of phonological parameters that do not occur in either language (c.f., Orfanidou et al., 2009; Cardin et al., 2016; Rudner et al., 2016). A total of nine videos of bimanual gestures were selected, three of each type (see **Figure 1**). To keep facial expressions neutral across all types of manual gestures, non-manual features of the SSL and BSL signs were not performed. Videos were of high definition quality (1080 × 720 pixels) and were presented at the center of the screen of an laptop (15.4 inches) with presentation software DMDX (version 4.1.2.0; Forster and Forster, 2003).

The order of presentation was randomized seperately on the two occasions. As an introduction to the task, the participant was given the following instruction: "Now, you are going to see some videos on the computer. In each video, there is a man who will do something. I want you to watch carefully what he does." This instruction was given to make sure that the participant was focused on the screen before starting the test. Making sure that the participant is attentive to the target is an important part of imitation paradigms (Dickerson et al., 2013; Wang et al., 2015). When the first video had been played, the screen went blank and the child was told: "Now, it is your turn". This comment is commonly used as a neutral prompt to elicit a response in imitation paradigms (Dickerson et al., 2013; Wang et al., 2015). If the child did not initiate an imitative act (i.e., move their hands and arms in an attempt to imitate the target) within 30 s from the point at which the video ended, the instruction (i.e., "Now, it is your turn") was repeated once. When the child had responded the test leader clicked a button to move on to the next video. If the child did not respond within 30 s of the second instruction the test leader moved on to the next video. The same procedure was repeated for each of the remaining eight videos. Across all participants, the test leader moved on to the next video without a response being given by the child six times at T1 and two times at T2. All non-responses occurred in the DHH group.

### Scoring

Test sessions were videorecorded and individual responses to target videos were coded at a later time. The order in which videos were coded was randomized for each rater. The coding procedure in the present study was inspired by earlier imitation paradigms (Meltzoff and Moore, 1977; Nordqvist et al., 2015), in which reliable coding typically can be achieved after a restricted amount of training. A visual analog scale (VAS, Rudner et al., 2012) was used instead of a categorical coding system (e.g., correct/incorrect) to maximize variance. The VAS was a horizontal line on a sheet of paper with fixed end points, "No correspondence" and "Perfect correspondence" but no intermediate grading. The precision of each individual response was rated by putting a corresponding cross on the VAS. The score was the proportion of correspondence, i.e. if the cross was half way along the VAS, the score was 50%, and all non-responses were scored as 0. All responses were coded independently by two trained individuals and intraclass correlation coefficients were >0.70. The dependent measure was the average between-rater score across type of gesture.


FIGURE 1 | Stimulus material for the imitation task showing still of start, middle, and end position. SSL, Swedish Sign Language; BSL, British Sign Language; NS, Non-signs.

### Predictor Variables

### Language Skills

All participants performed a phonological decision task (Cross-Modal Phonological Awareness Test) in their first language, SSL for DHH participants and Swedish for hearing participants, two Swedish word reading tasks (lexical decision and Wordchains, Jacobson, 2001), and one Swedish reading comprehension task (Woodcock Passage Reading Comprehension Test, WPRC). In addition, DHH participants performed a SSL comprehension test.

### Swedish Sign Language Comprehension

The SSL Receptive Skills Test (see Holmer et al., 2016), an adaptation of a BSL original (Herman et al., 1999), was administered to the DHH participants as a measure of SSL comprehension. Forty videos of SSL sentences were presented one at a time to the participant who had to judge which picture out of three or four alternatives best represented the meaning of each sentence. The test was administered by trained native SSL users. One point was awarded for each correct response and the dependent measure was the number of correct responses. For two of the participants, scores pertained to testing less than 12 months before T1.

### Cross-modal Phonological Awareness Test

The Cross-modal Phonological Awareness Test (C-PhAT; Holmer et al., 2016) was used to assess phonological awareness. The C-PhAT can be used to assess phonological awareness of both SSL and Swedish using the same materials (c.f., Andin et al., 2014). In the present study, DHH participants performed the SSL version (C-PhAT-SSL) and hearing participants performed the Swedish version (C-PhAT-Swed). In both versions, pairs of printed characters (i.e., digits and letters) were presented on a laptop (15.4 inches screen) in presentation software DMDX (version 4.1.2.0; Forster and Forster, 2003). The participant determined if the phonological labels of the printed characters were phonologically similar or not. In the SSL version this required determining whether or not they shared a handshape in the Swedish Manual Alphabet or Manual Numeral System (C-PhAT-SSL) and in the Swedish version this involved determining whether or not they rhymed in Swedish (C-PhAT-Swed), see **Table 1**. Button-press responses were given. The number of hits was adjusted for false alarms in accordance with signal detection theory (Swets et al., 1961); thus, *d'* was the dependent measure on both versions of the task.

### Word Reading

Two measures of word reading were administered to both groups. The first task was a lexical decision task, in which participants were presented with three-letter items in lowercase on a laptop (15.4 inches screen) with presentation software DMDX (version 4.1.2.0; Forster and Forster, 2003). Items were real words, pseudowords (i.e., items that are pronouncable and look like real Swedish words but lack meaning) and non-words (i.e., items that cannot be real words in Swedish) presented one at a time on the screen in a set order and the participant decided, for each item, if it was a real word in Swedish or not. There were 20 real words, 10 pseudo-words, and 10 non-words. Responses were made by

The second task that was used to assess word reading was Wordchains (Jacobson, 2001), an established test in the Nordic countries (e.g., Asbjørnsen et al., 2010). In this task, the participant was presented with uninterrupted strings of characters that could be separated by pen strokes into three different Swedish words, e.g., hej|mat|snö (in English, hi| food|snow). In total, there was 60 different wordchains evenly distributed on 20 rows on a sheet, and the participant had 2 min to solve as many chains as possible. The participant practiced the task with three separate chains and was instructed how to correct an erronous response before testing commenced. The dependent measure was the number of chains correctly completed within the two minute time limit. The two tests of word reading were combined into a word reading score, by converting raw data to normal scores and then averaging the normal scores into one single variable.

### Woodcock Passage Reading Comprehension

The Swedish version of the WPRC test (Furnes and Samuelsson, 2009) was used as a measure of Swedish language comprehension. In this test, passages of text of different length in which one word is omitted were presented to the participant. Hearing participants had to say or write a word that completed the passage; DHH participants could answer by providing an appropriate sign or, saying or writing a word. At the beginning of the test, passages consist of single three-word sentences and at the end of the test, passages include several sentences with both main and subordinate clauses. Testing was stopped after a sequence of six consecutive errors. In total there were 68 passages, and the dependent measure was the number of correct answers.

### Motor Skill

To assess motor control, a bead threading task was used (White et al., 2006). Participants threaded nine colored wooden beads of different shapes onto an 8 mm thick string with a knot in the end. The task was administered twice and the participants were asked to thread the beads onto the string as fast as possible. The fastest completion time in s across the two trials was the dependent measure.

TABLE 1 | Examples of pairs in the Cross-Modal Phonological Awareness Test that have similar phonological labels in Swedish (Category 1); in the Swedish manual alphabet or manual numeral systems (Category 2); and in neither (Category 3).


*SMS, Swedish Manual Alphabet and Manual Numeral System.*

### Working Memory

The Clown test (Sundqvist and Rönnberg, 2010; Birberg Thornberg, 2011), based on the Mr. Peanut task (Kemps et al., 2000), was used as a measure of visual working memory. A clown figure on a magnetic board with varying numbers of magnets placed at different locations was shown to the participant. The figure was then turned away from the participant, the magnets were removed, and the participant had to say the color of the magnets. After that, the figure was once again turned towards the participant who was given the magnets and instructed to reproduce the pattern presented earlier. The number of magnets increased from one at the first level, up to a maximum of ten. There were three trials on each level and on each trial the magnets were all of the same color (red, blue, or yellow) and placed in a pre-defined order. Two incorrect answers on one level led to discontinuation of the task. One point was awarded for each correct trial, and the dependent measure was total score.

# Data Analysis

First, descriptive statistics were calculated and between group differences were investigated. In a second step, a repeated measures split-plot ANOVA was conducted with two within group factors: occasion with two levels (T1, T2), and type of manual gesture with three levels (SSL, BSL, non-signs), and one between group factor with two levels (DHH, Hearing). *Post hoc* analyses and exploration of simple main effects were then performed. In the final step, correlational analysis of relations between predictor variables (SSL comprehension, NVIQ, Working memory, Bead threading, C-PhAT, Word reading, and WPRC) at T1 and imitative ability (average score across all responses) at both occasions was conducted.

Some violations of normality were detected on the predictor variables in the hearing group. Thus, parametric and nonparametric methods for between group comparisons and correlations were compared in analyses involving these measures. No differences were detected between approaches and therefore we only report results from parametric methods (i.e., *t*-tests and Pearson *r*). A two-tailed significance level of 0.05 was applied, and to obtain maximum power, despite low *n*, no correction was made for multiple comparisons. Descriptive statistics, correlations and the split-plot ANOVA, with *post hoc* tests, were conducted using IBM SPSS Statistics (Version 22.0), and simple main effects were calculated manually in Microsoft Excel (2013) following the recommendations of Kirk (1994).

### Missing Data

For one DHH participant all responses on the imitation task were missing at both occasions. In addition, a full set of responses on the same task was missing from another DHH participant at T1 and one further DHH participant at T2. One full set of imitation responses was also missing from one hearing participant at T2. All these responses were missing due to technical errors. In addition, one further hearing participant failed to perform the imitation task at T2. A number of responses were coded as missing because they were performed out of picture. This applied to three responses from one DHH participant at T1, and one response each from another DHH participant and two hearing participants at T2. Finally, one DHH participant did not do the test of SSL comprehension.

When calculating average imitation scores on the three types of manual gestures (SSL, BSL, and non-signs) and the average imitation score across all items in the task, all available data for each individual was used. In statistical analyses, the missing completely at random (MCAR) mechanism was assumed, i.e., absence of data was assumed to be entirely haphazard (Enders, 2010). Listwise (in ANOVA) or pairwise (in correlations and regression) deletion were used to handle missing data, since these procedures provide unbiased estimates under the MCAR mechanism (Enders, 2010).

# RESULTS

# Descriptive Statistics

There were no differences between groups on gender distribution, <sup>χ</sup>2(1) <sup>=</sup> 0.01, *<sup>p</sup>* <sup>=</sup> 0.92, NVIQ, Working memory, Bead threading, or Word reading (see **Table 2**). DHH participants were older than hearing participants, *t*(12.2) = 4.0, *p* = 0.002, but performed worse than them on WPRC (see **Table 2**). Girls outperformed boys on Bead threading at both occasions in both groups (*p*s < 0.05). No other gender differences were revealed (*p*s > 0.05). Age and NVIQ were unrelated to performance on the imitation task in both groups (*p*s > 0.05).

# Imitation Task

Performance on the imitation task is presented in **Table 3**. In the split-plot ANOVA, the assumption of sphericitiy was satisfied and error variances were homogeneous on imitation of all types of gestures across groups. The main effects were statistically significant: occasion, *F*(1,42) = 45.5, η<sup>p</sup> <sup>2</sup> <sup>=</sup> 0.52, *<sup>p</sup>* <sup>&</sup>lt; 0.001; type of manual gesture, *F*(2,84) = 4.74, η<sup>p</sup> <sup>2</sup> = 0.10, *p* = 0.011; and group, *F*(1,42) = 8.27, η<sup>p</sup> <sup>2</sup> <sup>=</sup> 0.16, *<sup>p</sup>* <sup>=</sup> 0.006; as well as the group by occasion interaction, *F*(1,42) = 10.7, η<sup>p</sup> <sup>2</sup> = 0.20, *p* = 0.002 (see **Figure 2**). The group by type of manual gesture interaction was not significant, *F*(2,84) = 0.96, η<sup>p</sup> <sup>2</sup> = 0.02, *p* = 0.39, disfavouring our initial prediction that DHH signing would perform better on the SSL signs than both on the BSL and non-signs. All other interactions were also non-significant (*p*s > 0.05). Removing the non-responses of DHH participants from the imitation scores did not change the results.

*Post hoc* analyses of the main effects revealed that performance was better at the second occasion (T2) than at the first occasion (T1), mean difference = 10.0, and that DHH participants outperformed hearing participants, mean difference = 9.50. The mean differences across groups between imitation of SSL and of non-signs (4.24), as well as between imitation of BSL and of non-signs (5.55) were statistically significant, showing that imitation of non-signs was poorer than imitation of both SSL and BSL signs. However, there was no difference in performance

between SSL and BSL (see **Figure 3**). Simple main effects of the group by occasion interaction revealed that the performance of both DHH, *F*(1,9) = 10.9, *r* = 0.72, *p* = 0.009, and hearing participants, *F*(1,33) = 4.46, *r* = 0.34, *p* = 0.042, improved over time. Further, the DHH group outperformed the hearing group at T2, *F*(1,45) = 19.0,*r* = 0.55, *p* < 0.001, but not T1, *F*(1,45) = 1.96, *r* = 0.20, *p* = 0.17. Thus, in contrast to what was predicted, the DHH group did not have an initial advantage on the task, nor did hearing participants have a steeper development between the two occasions than did DHH children. Rather, DHH children showed a stronger development than hearing children, as evident from the significant group by occasion interaction.

# Predicting Performance on the Imitation Task

The correlations between predictor variables (NVIQ, SSL comprehension, Working memory, Bead threading, Cross-modal Phonological Awareness Test, Word reading, and WPRC) at T1 and performance on the imitation task at both occasions were explored to investigate our predictions (see **Table 4**). For DHH participants, imitative precision at T1 predicted imitative precision at T2, *r*(10) = 0.65, *p* = 0.040. Partial support for our initial prediction that sign language skills should be related to imitative ability was found in the pattern of correlations. Word reading at T1 was significantly associated with imitative ability at both T1, *r*(11) = 0.70, *p* = 0.016, and T2, *r*(11) = 0.80, *p* = 0.003. Further, performance on the imitation task at T2 was predicted by SSL comprehension, *r*(11) = 0.70, *p* = 0.017, and phonological awareness, *r*(11) = 0.64, *p* = 0.035, at T1. Excluding non-responses from imitation scores did not affect the correlational pattern.

As for DHH participants, imitiative precision at T1 was related to imitative precision at T2 for hearing participants, *r*(34) = 0.66, *p* < 0.001, indicating stability in performance on the imitation task over time for both samples. Further, for the hearing participants, scores on WPRC at T1 predicted performance on the imitation task at T2, *r*(34) = 0.43, *p* = 0.012. Thus, the overall pattern indicates a connection between language and imitation of manual gestures. However, connections are more broadly distributed for DHH signing than for hearing nonsigning children.

To test the predictive power of language comprehension on imitative ability at T2 for hearing participants, a hierarchical regression model was conducted. In the first step, imitative ability was regressed on itself. In a second step, Bead threading and Working memory was included, to control for variance accounted for by motor skills and working memory. In the third and final step, WPRC was added as a predictor (see **Table 5**). The addition of WPRC led to a -*R*<sup>2</sup> of 0.09 which was significant, *F*(1,31) = 5.83, *p* = 0.022, and the final model explained 49.9% of the variance in imitative ability at T2, *F*(4,31) = 7.72, *p* <0.001. Errors were normally distributed and inspection of the scatterplot between residuals and predicted values indicated homoscedasticity.

# DISCUSSION

In the present study, we elicited imitation of manual gestures from Swedish DHH signing children and hearing non-signing children at similar levels of cognitive and language development, with the aim of studying how pre-existing linguistic knowledge influences precision of imitation. We predicted that the DHH signing children would be better at imitating manual gestures lexicalized in their own sign language (SSL) than unfamiliar BSL signs, and that both groups would be better at imitating lexical signs (SSL and BSL) than non-signs. We also predicted that the hearing non-signing children would perform worse than DHH signing children with all types of gestures the first time we elicited imitation, but that the performance gap between groups would be reduced when imitation was elicited a second time. Finally, we predicted that imitation performance on both occasions would be associated with linguistic skills, especially in the manual modality.

TABLE 2 | Descriptive statistics and between group *t*-tests for predictor variables.


*DHH, Deaf and hard-of-hearing; SSLC, Swedish Sign Language comprehension (raw score); NVIQ, Non-verbal intelligence (raw score); WM, Working memory (raw score); BT, Bead threading; C-PhAT, Cross-Modal Phonological Awareness Test, SSL version (C-PhAT-SSL) for DHH participants and Swedish version (C-PhAT-Swed) for hearing participants (d' scores); WC, Wordchains (raw score); LD, Lexical decision (d' scores); WPRC, Woodcock Passage Reading Comprehension (raw scores).*

<sup>a</sup>*<sup>n</sup>* <sup>=</sup> *12.* <sup>b</sup>*Data also reported in Holmer et al. (2016).*

<sup>c</sup>*d', a value* > *0 indicates better than chance performance.*


# No Effect of Familiarity

Contrary to our prediction, we did not find any evidence that pre-existing knowledge of SSL improved precision of imitation of signs lexicalized in SSL compared to signs lexicalized in another sign language (BSL) for the DHH signing participants. We derived our prediction from the ELU model, which states that language processing is rapid and automatic if input matches preexisting phonological and semantic representations (Rönnberg et al., 2013). We reasoned that because, for the DHH signing participants, the repertoire of phonological components is similar for SSL and BSL (Rudner et al., 2016), the unfamiliar BSL signs would match existing phonological representations. However, because the cohort of lexical candidates activated by BSL signs would not be constrained by meaning (Marslen-Wilson, 1987), our assumption was that a better match would be obtained with SSL signs than with BSL signs, leading to better imitation for the DHH signing participants.

It is possible that the three specific SSL items chosen in the present study from SSL did not match the existing representations of the DHH signing participants because they had not yet been acquired. However, we deem this unlikely as the items were commonly occurring. Another possibility is that the number of participants and the number of trials were too small to detect this effect. However, this is also unlikely because the experiment was repeated on a second occasion. Thus, the present results strongly suggest that in DHH signing children who are at an early stage of their reading development, pre-existing semantic representation does not enhance imitation more than pre-existing phonological representation. There are examples relating to deaf signing adults of pre-existing semantic representation not influencing either behavior (Rudner et al., 2016) or neural processing (Petitto et al., 2000; Cardin et al., 2016), and it has been argued that this may by due to the fact the phonology of sign language often carries semantic information (Thompson et al., 2012). One interpretation of the absence of an effect of sign familiarity in the present study is that for sign language users, semantic representation does not constrain the cohort of lexical candidates activated by phonologically plausible exemplars. Thus, signrelated semantic representation may not play the same role as speech-related semantic representation in the mechanism described by the ELU model (Rönnberg et al., 2013).

It is important to note that the target items used in the present study did not include non-manual gestures. Non-manual aspects of lexical signs may be important for achieving a match between an incoming signal of degraded quality and existing representations in the mature mental lexicon (Quer and Steinbach, 2015) and thus contribute to ease of language understanding. Such an effect is likely to be even more important in the developing language system. Thus, future work should investigate the role of non-manual components in the ability of DHH signing children to imitate signs in their own and unfamiliar sign languages.

### Effect of Lexicality

a*Two missing cases.*

Recent studies indicate that even for non-signers, non-signs are more difficult to process than lexical signs (Cardin et al., 2016;


FIGURE 2 | Overall performance on the imitation task (average score across all available items; 100 on the Y-axis represents ratings of perfect correspondence between target and response) for deaf and hard-of-hearing (DHH) and hearing participants at T1 and T2. Error bars represents ±1 SE. ∗∗∗*p* < 0.001.

Rudner et al., 2016). This suggests that it is more demanding to process manual gestures that break the phonological rules of signed languages, even for individuals with no previous knowledge of sign language. The implication of this is that the phonological characteristics of a language may arise as a consequence of more efficient neural processing for its perception and production. Thus, we predicted that in the present study, both groups would be better at imitating lexicalized signs (both SSL and BSL) than non-signs. This was exactly what we found.

Other work indicates that it is easier to imitate meaningful acts (e.g., pantomimes of object use) than novel, meaningless acts (Tessari and Rumiati, 2004), and it has also been suggested that imitation builds on understanding intent and goal-directedness of an action (Bekkering et al., 2000; Want and Gattis, 2005). Thus, more precise imitation of lexical signs than non-signs in the present study may be driven by differences in the perceived meaningfulness, intent and goal-directedness of the items as well as in inherent motor patterns. Future work should use sign-based



*DHH, Deaf and hard-of-hearing; WR, Word reading; WPRC, Woodcock Passage Reading Comprehension; SSLC, Swedish Sign Language comprehension; C-PhAT, Cross-Modal Phonological Awareness Test, SSL version (C-PhAT-SSL) for DHH participants and Swedish version (C-PhAT-Swed) for hearing participants; WM, Working memory; BT, Bead threading; NVIQ, Non-verbal intelligence.* <sup>a</sup> *Two missing cases.*

<sup>∗</sup>*p* < *0.05, two-tailed.* ∗∗*p* < *0.01, two-tailed.* †*p* < *0.05, one-tailed.*

stimuli generated by computerized avatars to separate the effects of phonologically legal motor patterning on the one hand and meaningfulness, intent and goal-directedness on the other.

Surprisingly, the DHH signing children were no more precise in their imitation of lexical signs than the hearing non-signing children. The inability to find any difference between groups, might in part be due to statistical issues relating to diverging variances across groups or the form of distributions on variables. However, statistical tests indicated equal variances across groups as well as normally distributed imitation scores, indicating that these factors did not influence results, although it should be noted that the power to detect such violations was restricted. Thus, we found no evidence to support the notion that pre-existing phonological representation facilitates imitation of unfamiliar but phonologically acceptable manual gestures, but we cannot rule out that this may have been due in part to methodological issues.

### Effect of Prior Imitation

Both groups were more precise in their imitation of manual gestures second time round. We had predicted that the increment would be greater for hearing children than for the DHH signing children. This prediction was based on the notion that preexisting representation would facilitate language processing, in line with the ELU model (Rönnberg et al., 2013). Specifically, we predicted that the DHH group would have an advantage over the hearing group at the first occasion (T1). However, we predicted that this advantage would diminish at the second occasion (T2) because the hearing children would be able to make use of the representations they had encoded into episodic long-term memory at T1. However, the opposite was true. While there was no difference between groups in precision of imitation at T1, the DHH group produced more precise imitations at T2 than the hearing children. This fits in with the lack of evidence that pre-existing linguistic representation facilitated imitation.

### TABLE 5 | Hierarchical regression model for predicting performance of hearing participants on the imitation task at T2.


The pattern of results suggests that T1 provided an opportunity for both groups to establish representations that they could exploit at T2. The fact that the improvement in imitation over time did not interaction with stimulus type strengthens the notion that pre-existing linguistic representation does not support imitation and suggests that the improvement in imitation performance at T2 was driven by the ability to form item-specific representations, irrespective of lexiciality. However, the fact that the DHH group showed a greater improvement in imitation ability over time suggests that they were more successful than the hearing group in exploiting those item-specific representations.

### Correlations with Language Skills

Language skills assessed at T1 predicted precision of imitation at T2 for both groups. In particular, for the DHH group, SSL phonological awareness measured using the C-PhAT (Holmer et al., 2016), SSL proficiency, measured using a SSL comprehension test, and Swedish word reading all strongly predicted precision of imitation at T2. Imitation at T1, however, was only significantly correlated with word reading, although the correlation with SSL phonological awareness was also marginally significant. This pattern of correlations, suggests that SSL skills, including phonological awareness and comprehension, are mobilized during imitation, but only when adequate representations have already been established. Further, the correlation with word reading may also suggest mobilization of sign language skills, as written words seem to be recoded to their corresponding signs in DHH signers (Leinenger, 2014). The relation between sign language skills and imitation of manual gestures, should be investigated in larger samples in future studies.

For the hearing group, reading comprehension at T1, a proxy for speech-based representation, correlated significantly with precision of imitation at T2, whereas none of the language variables correlated with precision of imitation at T1. Indeed, regression analysis showed that reading comprehension at T1 explained unique variance in imitation precision at T2, above and beyond variance explained by imitation precision at T1, motor skill and working memory. This suggests hearing nonsigning children mobilize language comprehensions skills during imitation of manual gestures, rather than motor skills or working memory, but only when adequate representations have already been established. Taken together, the pattern of correlations across groups provides support for our prediction of a positive relationship between imitation and linguistic skills, especially in the manual modality.

### Overall Interpretation

The specific predictions relating to the influence of preexisting semantic and phonological representation on precision of imitation were based on the limited number of studies performed to date. In any small field, the results of any new study may be at least partly unexpected and that was the case here. The pattern of results revealed by the present study suggests that for children whose language skills are still developing, the establishment of itemspecific representations of manual gestures is supported by both domain general and modality specific skills. Specifically, DHH signing children seem to be able to make use of modality specific language skills, although not pre-existing linguistic representations, to establish new representations of manual gestures, while establishment of manual representations in hearing non-signing children seems to be supported by the domain general aspect of language processing.

These modality-specific findings suggest that the ELU model (Rönnberg et al., 2013) cannot be applied directly to sign language, at least with reference to the developing language system. Hence, we suggest a modified version of the ELU model, i.e., a D-ELU model (see **Figure 4**). Like ELU, D-ELU emphasizes the importance of a good match between language input and pre-existing representations for language formation. However, whereas ELU predicts domain specific explicit processing when there is a mismatch between input and existing representations, D-ELU predicts that when there is a mismatch between input signal and stored linguistic representations in the developing language system, the explicit processing loop engages both domain general representations (e.g., semantic long-term memory) and domain specific representations (e.g., sign-specific phonology) in the analysis of the incoming language signal. This process leads to establishment of new representations or a redefinition of stored representations, a notion in line with perceptual magnet theory (Kuhl, 1991) which predicts a warping of the perceptual space around phonological representations as learning progresses. In comparison to the mature language system which is more tolerant of phonological diversity, this process is qualitatively different. Thus, an adaptation of the ELU model for the developing language system is warranted. Interestingly, changes in phonological representation are also characteristic of individuals with post-lingual hearing loss (Classon et al., 2013). Thus, One possibility is that D-ELU could also help us understand ELU towards the end of the lifespan. In order to account for the lack of interaction between phonology and semantics in sign language processing, reported both here and in earlier studies (Cardin et al., 2016; Rudner et al., 2016), a sign specific component should be reintroduced into the model (c.f., Rönnberg et al., 2008). Future work should test the generalizability of the proposed D-ELU model by investigating the role of language skills across modalities in establishment of linguistic representations.

# CONCLUSION

The act of imitation allows both DHH signing and hearing non-signing children to establish specific representations which together with language skills facilitate future imitation. This set of findings prompts an adaptation of the ELU model, D-ELU.

# AUTHOR CONTRIBUTIONS

EH, MH, and MR designed the study. EH co-ordinated data collection and coding, and performed the statistical analyses. EH prepared the first draft of the article and all authors contributed to the final version.

# FUNDING

This work was supported by grant number 2008-0846 to MR from the Swedish Research Council for Health, Working Life and Welfare.

# ACKNOWLEDGMENTS

The authors would like to thank the children who participated in this study and their parents as well as the schools for giving us access to their facilities. Thanks to Jenny Carlsson, Gunilla Turesson-Morais, Hanna Åkerblom, Elisabeth Thilén, Lisbeth Wikström, Sara Moritz, Malin Eriksson, Lina Larsson and Moa Claar for help with data collection. Thanks also to Mia-Mari Stråle, Sofia Szadlo, and Malin Jönsson for help with coding of the data; Magnus Ryttervik for translating administration instructions into SSL; Annette Sundqvist and and Katarina Forssén for technical assistance.

# REFERENCES


White, S., Milne, E., Rosen, S., Hansen, P., Swettenham, J., Frith, U., et al. (2006). The role of sensorimotor impairments in dyslexia: a multiple case study of dyslexic children. *Dev. Sci.* 9, 237–269. doi: 10.1111/j.1467-7687.2006.00483.x

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2016 Holmer, Heimann and Rudner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# **Load and distinctness interact in working memory for lexical manual gestures**

### *Mary Rudner\*, Elena Toscano and Emil Holmer*

*Linnaeus Centre HEAD, Swedish Institute for Disability Research, Department of Behavioural Sciences and Learning, Linköping University, Sweden*

The Ease of Language Understanding model (Rönnberg et al., 2013) predicts that decreasing the distinctness of language stimuli increases working memory load; in the speech domain this notion is supported by empirical evidence. Our aim was to determine whether such an over-additive interaction can be generalized to sign processing in signnaïve individuals and whether it is modulated by experience of computer gaming. Twenty young adults with no knowledge of sign language performed an *n*-back working memory task based on manual gestures lexicalized in sign language; the visual resolution of the signs and working memory load were manipulated. Performance was poorer when load was high and resolution was low. These two effects interacted over-additively, demonstrating that reducing the resolution of signed stimuli increases working memory load when there is no pre-existing semantic representation. This suggests that load and distinctness are handled by a shared amodal mechanism which can be revealed empirically when stimuli are degraded and load is high, even without pre-existing semantic representation. There was some evidence that the mechanism is influenced by computer gaming experience. Future work should explore how the shared mechanism is influenced by pre-existing semantic representation and sensory factors together with computer gaming experience.

# *Edited by:*

*Patrik Sörqvist, University of Gävle, Sweden*

### *Reviewed by:*

*Malte Wöstmann, Max Planck Institute, Germany Robert W. Hughes, Royal Holloway, University of London, UK*

### *\*Correspondence:*

*Mary Rudner, Department of Behavioural Sciences and Learning, Linköping University, 581 83 Linköping, Sweden mary.rudner@liu.se*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

> *Received: 19 May 2015 Accepted: 23 July 2015 Published: 13 August 2015*

### *Citation:*

*Rudner M, Toscano E and Holmer E (2015) Load and distinctness interact in working memory for lexical manual gestures. Front. Psychol. 6:1147. doi: 10.3389/fpsyg.2015.01147* **Keywords: working memory, manual gestures, load, distinctness, resolution, computer games**

# **Introduction**

Working memory is the ability to keep information in mind for a limited period of time while processing it (Baddeley, 2012). There is a close connection between working memory and communication which builds on the need to maintain and process information during receptive and productive language processing (Majerus, 2013) and in many ways, the functionality of working memory seems to be adapted to communication needs (Baddeley et al., 1998). Working memory for speech based language has been studied extensively and it is known that capacity is greater for words than non-words (Hulme et al., 1991) and influenced by the phonological structure of to-beremembered items (Baddeley, 2012). There is some evidence that these effects generalize to sign language but the overall picture is not clear cut (Rudner et al., under review). Beyond linguistic aspects, working memory is influenced by memory load, operationalized either as how many, or how long, items need to be maintained, as well as the distinctness of the presented items, or how difficult it is to perceive them (Barch et al., 1997). Further, computerized training can modulate the effect of increased working memory load (Dahlin et al., 2008) and videogaming can improve cognition (Bavelier and Davidson, 2013). The purpose of the present study is to further our understanding of the language modality specificity of working memory by investigating the interplay of load and distinctness in working memory for manual gestures and its association with experience of computer games. Here, load is operationalized as the number of items maintained and distinctness as visual resolution on presentation.

Everyday listening situations are often noisy which means that the quality or distinctness of the language signal may be reduced (Mattys et al., 2012). Listening to speech in noise is more cognitively demanding, than listening to speech in quiet, especially for individuals with hearing loss (Rudner and Lunner, 2014), and generates greater neural activation throughout language processing regions (Scott and McGettigan, 2013). Individuals with greater working memory capacity are better at understanding speech in noise (Rudner et al., 2011; Zekveld et al., 2011) and show less activation in language processing regions, suggesting that the neural mechanisms supporting speech understanding in noise are more efficient in this group (Zekveld et al., 2012). This set of findings supports the notion expressed in the Ease of Language Understanding model (ELU; Rönnberg et al., 2013) that mismatch arises when the incoming language signal cannot be rapidly and automatically associated with the contents of long-term memory, and that limited working memory resources are engaged in deciphering the message. According to the ELU model (Rönnberg et al., 2013), mismatch increases working memory load because degraded, and thus indistinct, elements of the speech signal need to be held longer in working memory before they can be disambiguated by accessing the corresponding item in the mental lexicon. In other words, an indistinct speech signal actually causes greater working memory load by increasing the length of time individual items need to be maintained in working memory before speech understanding is achieved, reducing the portion of total resources available for processing new items entering the system and making them more vulnerable to mismatch. Thus, the ELU model specifically predicts an over-additive (and not under-additive) interaction between load and distinctness. The intertwining of distinctness and load during speech understanding in noise paradigms makes it hard to distinguish the underlying mechanisms.

In a set of studies from our lab (Mishra et al., 2013a,b, 2014), participants were presented with auditory 13-item lists of twodigit numbers and required to strategically select and report back two of those numbers when the list ended. Load was manipulated by requiring the participants to additionally report in half of the trials the dummy number which was always the first item in the list. Distinctness was manipulated by presenting the items with and without background noise. Because we were interested in the effects of low-level noise on cognition, the signal to noise ratio (SNR) was adapted so that items were audible. In other words, distinctness was still relatively high. Both manipulations reduced performance, but there was no interaction between load and distinctness and thus no evidence of a shared mechanism.

In a functional magnetic resonance imaging (fMRI) study, Barch et al. (1997) also investigated the interaction of distinctness and load using a task in which a sequence of letters is presented and the participant is instructed to respond to a target letter but only when it is preceded by a particular cue letter. This task loads on working memory by requiring the participant to keep the cue letter in mind until the target letter has been presented. Load was manipulated by adapting the retention interval between cue and target, and distinctness was manipulated by removing pixels from the target. Results showed that increasing load while keeping distinctness constant was associated with greater activation of the dorsolateral prefrontal cortex. On the other hand, decreasing distinctness while keeping load constant was associated with greater activation of the anterior cingulate. Thus, the results of the study by Barch et al. (1997) suggested that the neural mechanisms underpinning load and distinctness in working memory are separate. However, it is possible that the load and distinctness manipulations in the study by Barch et al. (1997) were not strong enough to trigger a shared mechanism.

In a more recent magnetoencephalography (MEG) study, Obleser et al. (2012) investigated the combined effects of distinctness and load on the neural mechanisms underpinning working memory for digits by studying neural oscillations. In particular, changes in power in low frequency oscillations in the alpha band were used as an index of working memory load. Distinctness was manipulated using noise-vocoding at 4, 8, and 16 bands. At four bands, speech is hard to understand but digits can still be identified because they belong to a small closed set. Load was manipulated by requiring the participants to retain two, four, or six items in working memory. A significant interaction was found between load and distinctness, revealing that when load was high and distinctness low there was an increase in alpha power in temporo-parietal regions. This interaction provides evidence of a shared mechanism. Taken together, evidence suggests that although load and distinctness appear to be supported by separate mechanisms there is a threshold at which a joint mechanism may be revealed empirically.

Working memory processing is supported by a load-sensitive neural network including the dorsolateral prefrontal regions identified as load-sensitive by Barch et al. (1997) and parietal regions (Ma et al., 2014) adjacent to that supporting the interaction of load and distinctness (Obleser et al., 2012). This applies across the language modalities of sign and speech (Rudner et al., 2009). Signed languages are natural languages in the visuospatial domain with vocabulary and grammatical structure that differ from those of the surrounding spoken languages (Emmorey, 2002). Working memory for sign language additionally elicits modality-specific neural activation in the parietal lobes bilaterally (Rönnberg et al., 2004; Buchsbaum et al., 2005; Rudner et al., 2007; Bavelier et al., 2008; Pa et al., 2008), possibly reflecting activation of a capacity-limited store for representation of the visual scene (Todd and Marois, 2004; Rudner, 2015). Lexicality influences the neurocognitive processing of manual gestures, even in individuals with no knowledge of sign language (Cardin et al., 2015). Further, knowledge of a signed language enhances working memory for the signs of that language, demonstrating that pre-existing lexical representation influences working memory processing of lexical signs (Rudner et al., under review). Moreover, although increasing load reduces the capacity of working memory for signs, this effect is smaller for deaf signers than for hearing signers or non-signers (Rudner et al., under review). This means that sign language allows us not only to study whether the shared mechanism supporting load and distinctness during speech processing generalizes across language modalities, but also whether it is dependent on pre-existing lexical representation. According to flexible resource models (Ma et al., 2014), the quality of input, e.g., distinctness, influences working memory processing, even when semantic representations are absent. This suggests that the over-additive interaction between load and distinctness predicted by the ELU model (Rönnberg et al., 2013) should not only generalize to sign language but may also be observable even when pre-existing representation is lacking. Thus, in order to isolate the interaction of load and distinctness in working memory for manual gestures in the present study, we presented to non-signers to-be-remembered items that were lexical signs. This allowed us to control for any effects of lexicality and pre-existing semantic representation and their potential interactions.

It is established practice that sign language interpreters choose dark clothes to contrast with their signing hands, and ensure good lighting and an unobstructed line of sight to those requiring signed translation. This suggests that poor contrast and masking are sources of visual noise that impact on visual communication. Although neither deafness nor sign language use seem to be associated with changes in contrast sensitivity (Finney and Dobkins, 2001), the data compression applied during digital video communication, used frequently by signers, may influence the quality of communication (Agrafiotis et al., 2003). However, little empirical work has addressed these issues. An early study (Pavel et al., 1987) found that adding digitally generated Gaussian noise to videos of individual lexical signs reduced the ability of deaf sign language users to identify them. In particular, a critical point was observed at root mean square SNR of 0.5.

Working memory for sign language has been shown to display some of the characteristics of working memory for speech based language (Rudner et al., 2009, 2010; Andin et al., 2013), including an effect of load (Rudner et al., under review). However, it has not hitherto been investigated whether the effect of load interacts with the distinctness of manual gestures. The ELU model is a multimodal model of working memory; in other words, it predicts similar phenomena across the language modalities of sign and speech (Rönnberg, 2003; Rönnberg et al., 2013). Based on empirical findings relating to the role of working memory during speech understanding under adverse conditions, this model predicts an over-additive interaction between working memory load and reduced distinctness. As we have argued, in the speech domain this is because an indistinct input signal causes greater working memory load by increasing the length of time individual items need to be maintained in working memory before speech understanding is achieved, reducing the portion of total resources available for processing new items entering the system and making them more vulnerable to mismatch. In the case of manual gestures, with no pre-existing representation in semantic long-term memory, mismatch will prevail. According to flexible resource models (Ma et al., 2014), the quality of input, e.g., distinctness, is important not only for achieving match but also for working memory processing as such. Thus, we predict that decreasing the distinctness of manual gestures will increase working memory load, resulting in an over-additive interaction between these two factors, even for non-signers with no corresponding representations. The main aim of the present study is to test this prediction.

Dahlin et al. (2008) showed that computerized training of updating skills led to better working memory performance when load was high. However, cognitive training programs are timeconsuming and complicated to administer, and improvements in trained skills seldom transfer to untrained skills (Owen et al., 2010). Meanwhile, videogaming has become a major pastime and there is increasing evidence that playing videogames is associated with robustly enhanced visuospatial (Bavelier and Davidson, 2013) and executive (Anguera et al., 2013) skills. Thus, in the present study we asked the participants to report their experience of playing computer games and investigated whether this was associated with performance on the working memory task based on manual gestures.

As in a number of recent studies from our lab (Rudner, 2015; Rudner et al., under review), we opted to use an *n*-back working memory paradigm (Cohen et al., 1994) to investigate effects of load and distinctness on working memory performance. In the *n*-back task, series of items are presented and the task of the participant is to determine whether the current item matches the items presented n steps back in the series and make a "yes" or "no" button-press response. For example, if *n* = 1, the current item is compared to the immediately preceding item, if *n* = 2, the current item is compared to the last item but one. This task makes the temporary maintenance and processing demands that characterize working memory (Ma et al., 2014) and working memory load is determined by n; the greater the magnitude of n, the greater the working memory load. The stimulus items were videorecordings of lexical signs that were presented either at full or with reduced resolution to manipulate distinctness.

We predicted that reducing the resolution of the sign stimuli would reduce performance on the *n*-back task. Further, we predicted that increasing memory load by increasing n would reduce performance. Moreover, we predicted an over-additive interaction such that the effect of reduced resolution would be greater when working memory load was high, empirically revealing the shared mechanism, proposed by the ELU model (Rönnberg et al., 2013) and showing that it is not dependent on pre-existing semantic representation. Finally, we predicted that experience of playing computer games would be associated with working memory for manual gestures, especially when load was high and resolution low.

### **Materials and Methods**

### **Participants**

Twenty hearing participants (10 females) between 19 and 25 years (*M* = 22, SD = 1.8) took part in the study. They had no knowledge of any sign language and reported no hearing impairment. They had normal or corrected to normal vision and performed within the normal range on the Block Design subtest of WAIS-IV (Wechsler, 2008). They were all international students from Europe (18) and Asia (two), fluent in English, at Linköping

University, Sweden. The study was conducted in accordance with the provisions of the Swedish Act (2003:460) concerning the Ethical Review of Research Involving Humans. Informed consent was given by all participants.

### **Materials**

The stimulus material consisted of 90 video-recorded manual gestures, each with a duration of 2–3 s. Forty-five of the gestures constituted signs lexicalized in British Sign Language and the other 45 were signs lexicalized in Swedish Sign Language. They were all generated by a male, deaf native signer of German Sign Language who was unfamiliar with both languages. Thus, the stimuli were all natural signs and did not differ in the degree to which they were produced with a foreign accent. The materials were developed in connection with a larger project (see Cardin et al., 2013). The distinction between languages is unimportant for the purposes of the present study and the British Sign Language and Swedish Sign Language materials are balanced across stimulus lists. Each of the stimuli was processed to adapt the resolution. There were five different levels of resolution: R1 (720 *×* 480 pixels); R2 (180 *×* 120 pixels); R3 (90 *×* 60 pixels); R4 (24 *×* 16 pixels); R5 (12 *×* 8 pixels), see **Figure 1**.

Ten lists of 45 stimuli each were assembled, five for each of the two load levels of the *n*-back working memory task. Each list was available with each of the five different levels of resolution. Levels of resolution were held constant within lists. All stimuli were presented at the center of a computer screen with a constant video resolution of 1280 *×* 800 pixels, irrespective of the resolution of the individual stimuli.

### **Experimental Task and Design**

An *n*-back task was used in the present study (Cohen et al., 1994). N was either one (low load) or two (high load). During the *n*-back task, lists of videos were presented with a time between stimulus onsets of 4 s and the participant was instructed to determine for each video whether it was identical to the previous video (1-back) or the previous video but one (2-back). They pressed one key for a positive response and another key for a negative response. The dependent measure was *d ′* (Stanislaw and Todorov, 1999). No feedback was given.

The within subjects experimental design was 2 *n*-back (1 back, 2-back) *×* 5 resolution (R1–R5). Each participant performed each of the two *n*-back tasks five times, once with each level of resolution, and each time with a different list. *N*-back was blocked so that 10 of the participants performed the 1-back task followed by the 2-back task while order was reversed for the other 10. The assignment of lists to resolutions was balanced and the order of resolutions within blocks was pseudorandomized.

### **Procedure**

When the participants arrived at the laboratory, they were informed about the study and gave their written consent to participation. After providing demographic information, including how many hours a day they spent playing computer games, they performed a set of tests reported elsewhere. The test of working memory for manual gestures (*n*-back experiment) was performed at a second test session 1 month later. The *n*back experiment was run using DMDX software (version 4.3.0.1, Forster and Forster, 2003) and took approximately 15 min to complete. The participants performed one training list for the relevant task before each block.

# **Results**

### *N***-Back Experiment**

Inspection of the *d ′* scores revealed that the scores of one of the participants in the low load condition (1-back) were more than two standard deviations below the mean across all five conditions. The participant performed the 2-back task first without any difficulty (all scores were within the same range as those of the other participants) but confirmed that she was tired and did not pay attention to the subsequent 1-back task. It was therefore decided to replace the 1-back scores of that participant with group mean for the analyses. The adjusted *d ′* scores are shown in **Figure 2**.

A repeated measures analysis of variance (ANOVA) was computed on *d ′* scores with two within subject factors: working memory load at two levels (low, high) and resolution at five levels (R1–R5). The ANOVA revealed a significant main effect of working memory load, *F*(1,19) = 33.63, MSE = 0.67, *p <* 0.001, η 2 *<sup>p</sup>* = 0.64, and a significant main effect of resolution, *F*(4,76) = 36.79, MSE = 0.32, *p <* 0.001, η 2 *<sup>p</sup>* = 0.66. There was also a significant interaction between these two factors, *F*(4,76) = 3.05, MSE = 0.26, *p* = 0.02, η 2 *<sup>p</sup>* = 0.14. Investigation of this interaction, using separate ANOVAs for each of the two memory load levels, revealed that the mean difference (MD) between R1 and R5 was statistically significant at both load levels, high: MD = 1.49, *p <* 0.001; low: MD = 0.92, *p <* 0.001. However, between R1 and R4, there was a statistically significant difference at high load: MD = 0.63, *p* = 0.007 but not at low load: MD = 0.15, *p* = 0.28, see **Figure 2**. It is also interesting to note that performance at R5 differed significantly from performance at all other levels of resolution at both memory loads, all *p*s *<* 0.002. Further, performance at R4 differed significantly from performance at

**experiment.** R1 (720 *×* 480 pixels); R2 (180 *×* 120 pixels); R3 (90 *×* 60 pixels); R4 (24 *×* 16 pixels); R5 (12 *×* 8 pixels). Error bars show standard error. Brackets show significant differences, \**p <* 0.05.

R3 for both memory loads, high: MD = 0.43, *p* = 0.03; low: MD = 0.29, *p* = 0.04. However, although there was a tendency for performance at R3 to be lower than at R2 when working memory load was high, MD = 0.34, *p* = 0.07; there was no difference when load was low, MD = 0.06, *p* = 0.52. This pattern demonstrates that working memory for manual gestures is more sensitive to resolution when working memory load is high than when it is low.

Response bias was analyzed by calculating *c* (Stanislaw and Todorov, 1999). The grand mean *c*-value was 0.14 (SD = 0.07) which was significantly different from the neutral point (0), *t*(19) = 9.45, *p <* 0.001. Repeated measures ANOVA showed no main effect of *n*, *F*(1,19) = 1.69, MSE = 0.05, *p* = 0.21, a significant but small main effect of resolution, *F*(4,76) = 3.55, MSE = 0.04, *p* = 0.01, η 2 *<sup>p</sup>* = 0.16, and, importantly, no significant interaction, *F*(4,76) = 0.88, MSE = 0.04, *p* = 0.48. Pairwise comparisons showed that response bias at R1,*c* = 0.20, was significantly greater than at R5 (*c* = 0.07), *p* = 0.01, demonstrating an increasing bias toward a positive response as resolution decreased.

### **Computer Games**

Playing action video games improves performance in a range of attentional, perceptual and cognitive tasks (Bejjanki et al., 2014). Therefore, we investigated whether experience playing computer games improved performance on the *n*-back task. Only six out of the 20 participants reported that they played computer games. Among those six, two reported playing 2 h daily, one reported playing 1 h and the other three played half an hour each. To determine whether playing computer games was associated with *n*-back performance, a between group variable was entered into the ANOVA based on whether the participant reported playing computer games or not. There was no main effect of playing computer games, *F*(1,18) = 0.11 MSE = 1.27, *p* = 0.75. However, there was a tendency toward a three-way interaction with working memory load and resolution, *F*(4,72) = 2.14, MSE = 0.25, *p* = 0.09, η 2 *<sup>p</sup>* = 0.11, see **Figure 3**. Visual inspection of the interaction suggests that playing computer games may improve performance when memory load is high. To investigate this, we

**FIGURE 3 | Interaction between experience of playing computer games and performance on the** *n***-back working memory task.** R1 (720 *×* 480 pixels); R2 (180 *×* 120 pixels); R3 (90 *×* 60 pixels); R4 (24 *×* 16 pixels); R5 (12 *×* 8 pixels). Dark bars show mean performance for participants who did not play computer games (*n* = 14) and light bars for those who did (*n* = 6). Error bars show standard error.

tested MD in performance between R5 and each of the other resolution levels for each group at the high load level. This revealed that although performance at R5 differed significantly from performance at all other levels of resolution at high memory load for non-players, all *p*s *<* 0.02, there was no significant difference in performance between R5 and R4 for computer gamers, *p* = 0.31. This pattern suggests that computer gamers may be less sensitive to resolution when working memory load is high than non-players.

To ensure that experience of playing computer games was not confounded by other variables we performed two-tailed independent samples *t*-tests to test for differences between the sub group who played computer games and the sub-group who did not. We found no statistically significant differences in age, *t*(18) = 0.98, *p* = 0.34 or Block Design, *t*(18) = 0.17, *p* = 0.86. The two Asian students stated that they did not play computer games. Two of the computer gamers were women.

### **Discussion**

In the present study, sign-naïve participants performed an *n*back working memory task based on videos of lexical signs. The distinctness of the stimuli and working memory load were manipulated orthogonally by varying the resolution of the videos and presenting 1-back and 2-back versions of the task in a balanced within-subjects design.

In line with our prediction, poor visual resolution and high load resulted in poorer *n*-back performance. Moreover, and also in line with our prediction, we demonstrated an overadditive interaction such that the effect of reducing visual resolution was greater when load was high. This indicates that the shared mechanism supporting processing of distinctness and load previously identified for speech processing (Obleser et al., 2012; Petersen et al., 2015) can be generalized across language modality to sign processing. What is more, it indicates that this mechanism is not dependent on pre-existing semantic representation.

Due to the scarcity of previous work on the effect of reducing visual resolution on working memory for manual gestures, our choice of resolution levels was arbitrary: R1 (720 *×* 480 = 34560 pixels); R2 (180 *×* 120 = 21600 pixels); R3 (90 *×* 60 = 5400 pixels); R4 (24 *×* 16 = 384 pixels); R5 (12 *×* 8 = 96 pixels). There was no difference in performance between R1, R2, and R3 at either load level. However, there was a statistically significant difference in performance between R1 and R5 when resolution (number of pixels) was reduced by more than 99% at both load levels and a statistically significant difference in performance between R1 and R4 when resolution (number of pixels) was reduced by just under 99% at high load but not at low load. Thus, a considerable reduction in resolution was required before working memory performance was affected at either memory load. This suggests that representations adequate to solve the task could be generated even at very low resolution. In the present study, we used stimuli that are lexicalized signs and the participants were non-signers. We used this approach because it has been shown that neurocognitive representation of lexical signs is different from that of non-signs, even in non-signers (Cardin et al., 2015), and that pre-existing semantic representation enhances

working memory for manual gestures (Rudner et al., under review). It is likely that sign language users who have pre-existing representations of lexical signs will have more robust performance at lower resolutions than non-signers. Future work should use sign language to investigate the interaction between load, distinctness and pre-existing representation.

In a recent study (Rudner et al., under review), we showed that deafness mitigates the effect of increasing working memory load manipulated using an *n*-back task based on manual gestures. In that study, n was manipulated at three levels (*n* = 1, 2, 3). We found that although signers were able to perform above chance when load was high at *n* = 3, the performance of non-signers was significantly lower. Thus, in the present study, we decided to use only two load levels, *n* = 1 and *n* = 2, omitting *n* = 3, because we considered that the performance of the non-signers in the present study would be too poor to reveal any further effects of stimulus degradation. However, Obleser et al. (2012) showed a potentiation of alpha power when working memory load was high and distinctness was low. Using a similar paradigm, Petersen et al. (2015) showed that hearing loss also increased alpha power, but that when load was high and distinctness was low, alpha power actually dropped for the individuals with the most severe degree of hearing loss, despite amplification. This was interpreted as indicating a breakdown in the mechanism supporting working memory at high load when stimulus distinctness is poor. Further, language modalityspecific differences in working memory processing have been shown to emerge when cognitive demands are high (Rudner and Rönnberg, 2008). Thus, future studies should investigate how differing degrees of sensory acuity and long-term sensory deprivation with and without technical intervention interact with load, distinctness and pre-existing representation.

There has been considerable interest in cognitive training and its potential for increasing the performance in various domains of groups of individuals with functional impairments. Some studies have shown significant effects of cognitive training (e.g., Dahlin et al., 2008) but generally, transfer to other cognitive functions has been lacking (Owen et al., 2010). However, a body of work is now emerging that shows effects on cognition on videogaming (for an overview, see Bavelier and Davidson, 2013). In the present study, we asked the participants to report how many hours a day they spent playing computer games. We were surprised to find that only six out of the 20 participants played computer games at all. Notwithstanding, we found evidence to suggest that the individuals who stated that they played computer games were less affected by increasing levels of stimulus degradation when working memory load was high. Comparison of the two subgroups gives no grounds to suppose that these results are biased by age, gender, non-verbal intelligence or cultural background. This finding is in line with recent work showing superior attentional and oculomotor control generalizing to biologically relevant stimuli in students reporting playing action video games a minimum of three hours per week during the previous six months compared to matched non-players (Chisholm and Kingstone, 2015).

Bejjanki et al. (2014) suggested that videogaming may drive a general learning mechanism based on enhancement of perceptual templates. Such a mechanism might allow videogamers to establish better representations of degraded stimuli during a cognitive task. Further work should establish whether this mechanism does indeed allow non-signers to resist the negative effects of increasing load during working memory for manual gestures and whether such a mechanism is distinct from the mechanism that allows signers, with pre-existing lexical representations, to outperform non-signers on working memory for manual gestures (Rudner et al., under review). Future studies should investigate how the effect of videogaming on working memory for degraded manual gestures interacts with the effects of sign language experience.

# **Conclusion**

The results of the present study demonstrate that the overadditive interaction of load and distinctness predicted by the ELU model (Rönnberg et al., 2013) and empirically demonstrated for

# **References**


speech processing (Obleser et al., 2012; Petersen et al., 2015) can be generalized to sign processing. Moreover, we have shown that this interaction is not dependent on pre-existing semantic representation. Further, there was some evidence that the overadditive interaction was modulated by experience of playing computer games. This set of findings supports the notion of a shared working memory mechanism supporting load and distinctness and indicates that the mechanism is amodal. Future work using sign language should to investigate how the shared mechanism is modulated by pre-existing semantic representation as well as sensory factors and computer gaming experience.

# **Acknowledgments**

The authors would like to thank Harald Nautsch, Department of Electrical Engineering, Linköping University, for help with stimulus preparation. This work was supported by the Linnaeus Centre HEAD grant from the Swedish Research Council.

working memory task with functional MRI. *Hum. Brain Mapp.* 1, 293–304. doi: 10.1002/hbm.460010407


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Rudner, Toscano and Holmer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# On The (Un)importance of Working Memory in Speech-in-Noise Processing for Listeners with Normal Hearing Thresholds

Christian Füllgrabe<sup>1</sup> \* and Stuart Rosen<sup>2</sup>

<sup>1</sup> Medical Research Council Institute of Hearing Research, The University of Nottingham, Nottingham, UK, <sup>2</sup> Speech,Hearing and Phonetic Sciences, University College London, London, UK

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Mary Rudner, Linköping University, Sweden Thomas Lunner, The Swedish Institute for Disability Research, Sweden Kathryn Arehart, University of Colorado Boulder, USA

\*Correspondence: Christian Füllgrabe christian.fullgrabe@nottingham.ac.uk

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 29 March 2016 Accepted: 09 August 2016 Published: 30 August 2016

### Citation:

Füllgrabe C and Rosen S (2016) On The (Un)importance of Working Memory in Speech-in-Noise Processing for Listeners with Normal Hearing Thresholds. Front. Psychol. 7:1268. doi: 10.3389/fpsyg.2016.01268 With the advent of cognitive hearing science, increased attention has been given to individual differences in cognitive functioning and their explanatory power in accounting for inter-listener variability in the processing of speech in noise (SiN). The psychological construct that has received much interest in recent years is working memory. Empirical evidence indeed confirms the association between WM capacity (WMC) and SiN identification in older hearing-impaired listeners. However, some theoretical models propose that variations in WMC are an important predictor for variations in speech processing abilities in adverse perceptual conditions for all listeners, and this notion has become widely accepted within the field. To assess whether WMC also plays a role when listeners without hearing loss process speech in adverse listening conditions, we surveyed published and unpublished studies in which the Reading-Span test (a widely used measure of WMC) was administered in conjunction with a measure of SiN identification, using sentence material routinely used in audiological and hearing research. A meta-analysis revealed that, for young listeners with audiometrically normal hearing, individual variations in WMC are estimated to account for, on average, less than 2% of the variance in SiN identification scores. This result cautions against the (intuitively appealing) assumption that individual variations in WMC are predictive of SiN identification independently of the age and hearing status of the listener.

Keywords: working memory, speech perception in noise, aging, normal hearing, hearing loss, supra-threshold auditory processing, sentence identification, reading-span test

# INTRODUCTION

Over the past decades, there has been growing interest in the role of individual differences in cognitive functioning in speech processing, reflected by a noticeable increase in the number of scientific publications on this topic (see **Figure 1**). Such work reflects the emergence of the new interdisciplinary research field of Cognitive Hearing Science (e.g., Arlinger et al., 2009), focussing on understanding the interplay of auditory and cognitive processes in speech perception, primarily in adverse circumstances. Not only are key scientific issues at stake, there are also important clinical implications in trying to provide effective rehabilitation to people suffering from problems with spoken communication.

cognitive, cognition, memory, attention, inhibition or speed of processing, published between 1986 and 2015 in the following journals: Ear and Hearing, International Journal of Audiology (or before 2002: Audiology, British Journal of Audiology and Scandinavian Audiology), Journal of the Acoustical Society of America, Journal of the American Academy of Audiology, Journal of Speech, Language and Hearing Research (or before 1997: Journal of Speech and Hearing Research), Hearing Research and Journal of the Association for Research in Otolaryngology. The filled symbols denote publications featuring working memory as the second research term.

# Working Memory and Its Role in Complex Cognition

Amongst the different cognitive abilities investigated, working memory (WM) has received considerable attention in recent years (see filled symbols in **Figure 1**). WM is considered by many psychologists as a "cognitive primitive," due to its moderateto-very-strong associations with different aspects of hot (i.e., emotion-laden; Klein and Boals, 2001) and cold cognition, such as reasoning (Barrouillet, 1996), attentional control (Das-Smaal et al., 1993), comprehension (Daneman and Merikle, 1996), and fact recall and pronoun referencing (Daneman and Carpenter, 1980). Over the years, different definitions have been given for this theoretical construct but it is generally agreed that the capacity of the WM system (WMC) can be reliably assessed by so-called complex span tasks. These require participants to perform a complex activity while concurrently trying to retain new information. For example, in one of the most widely used WM tasks, the Reading-Span (RSpan) test (Baddeley et al., 1985), visually presented sentences have to be read and their plausibility judged, while trying to remember parts of their content for recall after a variable number of sentences.

# The Role of Working Memory in Speech Perception

Given the strong and systematic link between WM and higherorder complex behavior, it is hardly surprising that performance on complex span tasks has also been used to explain individual variability in understanding speech in noise (SiN).

For example, a series of audiological research studies investigated whether individual differences in WMC, measured by the version of the RSpan test developed by Rönnberg et al. (1989), can help predict unaided (Lunner, 2003; Rudner et al., 2011) and aided (Lunner, 2003; Foo et al., 2007; Rudner et al., 2008, 2009, 2011) speech perception in hearing-impaired (HI) listeners, and explain the user-dependent success of different types of signal-processing performed by the hearing aid (e.g., dynamic range or frequency compression; Souza et al., 2015). Mainly moderate, sometimes even strong correlations between SiN identification and RSpan scores were consistently reported. Surprisingly, when referring to these findings to corroborate the role of WM in SiN perception, it is generally not mentioned that the cited studies were conducted with HI listeners who, on average, were aged over 65 years.

Furthermore, on the basis of an extensive review of behavioral studies concerned with the effects of cognitive factors on SiN perception in HI and normal-hearing (NH) listeners, Akeroyd (2008) concluded, too, that cognitive functioning is associated with SiN identification, and that WMC, especially when measured by the RSpan test, is the best cognitive predictor. However, these conclusions were based solely on the results from HI listeners (namely the relevant citations in the paragraph above), a fact generally not acknowledged when citing this reference.

A similar assumption that the same crucial cognitive processes are at work in all listeners, independently of their age and hearing status, is made in recent models of speech/language processing (e.g., Rönnberg, 2003; Heald and Nusbaum, 2014). For example, according to the latest instantiation of the Ease of Language Understanding (ELU) model (Rönnberg et al., 2013), any mismatch between the perceptual speech input and the phonological representations stored in long-term memory disrupts automatic lexical retrieval, resulting in the use of explicit, effortful processing mechanisms based on WM. The greater the mismatch, the more effortful listening becomes. Both internal distortions (i.e., related to the integrity of the auditory, linguistic and cognitive systems) and external distortions (e.g., background noise) are supposed to contribute to the mismatch. Consequently, it is assumed within this framework that WMC also plays a role when NH listeners have to process spoken language in acoustically adverse conditions. While no experimental evidence supporting this claim has actually been provided, this notion has become widely accepted within the field.

# STUDY SURVEY

To assess the claim that individual variability in WMC accounts for differences in SiN identification even in the absence of hearing

loss, we surveyed studies administering the RSpan test<sup>1</sup> and a measure of SiN identification to participants with audiometrically normal hearing sensitivity.

To ensure consistency with experimental conditions in investigations of HI listeners, only studies presenting sentence material routinely used in audiological and hearing research against spatially co-located background maskers were considered. In addition, we only examined studies in which the effect of age was controlled for, in order to avoid inflated estimates of the correlation between WMC and SiN tasks caused by the tendency for performance in both kinds of tasks to worsen with age. The effect of age was controlled for either by restricting the analysis to a narrow age range, or by statistically partialling out the effect of age when using data from participants across a wider age range. Based on a request posted on the Auditory List<sup>2</sup> and a general literature search, we were able to compile data from 19 published and unpublished studies that complied with our inclusion criteria<sup>3</sup> . Since several studies measured SiN identification against different types of background maskers or for different performance levels, a total of 41 data sets was entered into the meta-analysis (see **Figure 2**). For each data set, the Pearson correlation coefficient (r; diamonds) and associated 95 and 99% confidence intervals (CIs; black and red horizontal lines, respectively) are indicated, as well as the performance level at which the participants were tested, the type of masker, the sentence material<sup>4</sup> , the age range of the sample and the sample size. Within each of the three sections of **Figure 2**, data sets are organized by decreasing performance level (i.e., increasing difficulty). For identical performance levels, data sets are ordered by masker type, representing presumed increasing masker complexity, from "simple" notionally steady noise<sup>5</sup> through sinusoidally or speech-envelope-modulated noise to speech babble. Interestingly, some of the studies for which the data were reanalyzed on our request (indicated by an asterisk against them) did not even report the correlation between WMC and SiN identification in NH listeners in their original publication.

Across all data sets, the observed r values varied widely from −0.29 to 0.64, with almost a quarter of the values being negative, indicating that sometimes low-WMC individuals showed better SiN identification than individuals with high WMC. CIs were rather large, suggesting that studies were underpowered (albeit not necessarily designed to assess this specific relationship), and, in most cases, the intervals included the value zero.

Seemingly in contradiction with the ELU-model prediction of higher WM involvement for speech identification in increasingly adverse listening conditions, there was no obvious trend for more consistent or stronger correlations in more difficult listening conditions (i.e., at lower performance levels). In fact, there is some (descriptive) evidence of stronger associations between WMC and SiN identification in easier listening conditions [see results in section I for the same listeners in high- and lowperformance-level conditions in Koelewijn et al. (2012) and Carroll et al. (2016)]. However, this trend was based on results for two performance levels only, and it was not observed consistently across studies (Zekveld et al., 2011; Stenbäck et al., 2016) or even within the same study (Koelewijn et al., 2012).

Moreover, comparisons across different data sets obtained for similar performance levels did not show that inter-individual variability in WMC were more consistently or strongly associated with SiN identification for more complex maskers or target speech, as has previously been speculated (e.g., Rönnberg et al., 2010; Smith and Pichora-Fuller, 2015). For example, for young NH listeners, operating at a performance level of 50%-correct, the correlation for simple relatively predictable HINT sentences presented in a steady noise was 0.58 (Moradi et al., 2014) but only: (i) 0.14 in spectro-temporally and linguistically more complex babble noise (Ellis and Rönnberg, personal communication), and (ii) −0.01 for the linguistically more complex and unpredictable IEEE sentences also presented in steady noise (Banks et al., 2015).

At the same time, the strength of the correlation varied even for studies using very similar test conditions and participant groups. For example, at a performance level of 50%-correct for IEEE sentences presented in a steady noise masker, the correlation for young NH listeners was either −0.29 (Schoof and Rosen, 2014) or −0.01 (Banks et al., 2015). This illustrates the dependence of the results on the particular sample used (and its size) and cautions against basing conclusions as to the role of individual differences in WMC in SiN identification on observations from single small-scale studies.

As there was a sufficiently large number of data sets from studies restricting their sample to young listeners (aged 18– 40 years), a random-effects meta-analysis model was used to estimate the average correlation among these studies. This kind

<sup>1</sup>Most studies used the RSpan test originally developed by Rönnberg et al. (1989) but some administered a shorter version of the test. However, there seems to be no differences in mean performance between the two test versions (Classon, 2013).

<sup>2</sup>http://www.auditory.org/

<sup>3</sup>Data from a further two studies were not included in the meta-analysis due to the failure to obtain re-analysed data and the authors' explicit wish for us not to use their data.

<sup>4</sup>Description of the different sentence lists used in the studies entered into the meta-analysis:

ASL – Adaptive Sentence List (MacLeod and Summerfield, 1990): Predictable simple four- to six-word sentences (e.g., "The boiled egg was soft.").

HINT – Hearing In Noise Test (Nilsson et al., 1994; Hällgren et al., 2006): Predictable simple three- to seven-word everyday sentences (e.g., "Strawberry jam is sweet.").

GÖSA – Göttinger sentence test (Kollmeier and Wesselkamp, 1997): Highpredictability three- to seven-word (mean = 5) everyday sentences (e.g., "The dispute has ended.").

VU98 – (Versfeld et al., 2000): Eight- or nine-syllable everyday sentences (e.g., "The shop is within walking distance.").

IEEE – Institute of Electrical and Electronics Engineers Harvard sentences (Rothauser et al., 1969; Killion et al., 2004): Low-predictability five-keyword sentences (e.g., "A white silk jacket goes with any shoes.").

OLACS – Oldenburg Linguistically and Audiologically Controlled Sentences (Uslar et al., 2013): Seven-word sentences of varying linguistic complexity (e.g., "The little boy greets the nice father." "The farmer, whom the teachers catch, smiles.").

Matrix – Matrix sentences (Hagerman, 1982; Vlaming et al., 2011): Lowredundancy five-word sentences with the same syntactic structure (name-verbnumber-adjective-object; e.g., "Nina wants some big beds.").

In comparison, the cited investigations involving older HI listeners used HINT and Matrix sentences.

<sup>5</sup>Background noise on which no amplitude modulation is impressed is often referred to as a "steady" or "stationary" masker. However, even such notionally steady maskers contain intrinsic random amplitude fluctuations that impede speech perception (Stone et al., 2011, 2012).

listeners after controlling for the effect of age by (I) computing partial correlations or (II) using a limited age range [younger listeners aged ≤40 years (A) vs. older listeners aged ≥60 years (B)]. Shown in the plot are Pearson correlation coefficients (diamonds with their relative sizes indicating the study's sample size) and associated 95% (black) and 99% (red) confidence intervals. Several studies contributed more than one correlation due to multiple listening conditions, varying in masker type or performance level, also indicated in the Figure (with the exception of the 2014 study by Zekveld et al. (2011) in which the target speech and masker babble were produced by speakers either of the same gender or of different genders). When necessary, the sign of the correlation was changed so that a positive correlation represents better performance on the two tasks. An average for correlations based only on young NH listeners is provided (circle). Also given in the figure are source references (<sup>∗</sup> indicates re-analyzed published data; + indicates unpublished data, personal communication), experimental conditions (performance level, PL; type of masker, Mask; type of sentence material, Mat) and participant details (age range, Age; number of participants, N). Masker: S – notionally steady noise, M<sup>x</sup> or Msp – noise modulated by an X-Hz sinusoidal amplitude modulation or a speech envelope, B<sup>x</sup> – X-talker babble. PL: X%(A) – adaptive procedure tracking the speech reception threshold corresponding to X%-correct identification, X%(FZ−Y) – constant stimuli procedure using several fixed SNRs yielding an overall average performance level of X% with average performance for each of the different SNRs ranging from Z to Y%-correct identification, X%(F) – constant stimuli procedure using a single fixed SNR, yielding an average performance level of X%. In some cases, the modulation depth of the amplitude-modulated noises was only 10%, which is hardly above detection threshold (e.g., Füllgrabe et al., 2005). Therefore, those maskers are labeled as steady rather than modulated.

**221**

of analysis has the advantage not only of assuming that the true treatment effect differs from study to study, but also accounts for the fact that multiple measures can arise from the same study (e.g., where different maskers have been used in the same listeners). The analysis was performed using the R package metafor (Viechtbauer, 2010) and a transformation of the r values to Fisher's z scale. Across all 24 data sets, the average r value was 0.12. In other words, individual variations in WMC in young people with audiometrically normal hearing are estimated to account for, on average, less than 2% of the variance in SiN identification scores.

Given the considerably smaller number of data sets in each of the two other categories, involving older listeners, we did not compute a summary statistic. However, it is noteworthy that in the largest study included in the survey, using listeners from a wide age range, significant correlations between WMC and SiN identification were found for unmodulated and modulated background noises (see section I of **Figure 2**), and when averaged across maskers, even after partialling out the effects of age and hearing sensitivity (r = 0.39; p ≤ 0.001; as reported in Füllgrabe and Rosen, 2016). However, separate correlational analyses for each age group in this study revealed that the strength of the association differed across age groups, with the youngest listeners (18–39 years) showing the weakest and a non-significant correlation (r = 0.18; p = 0.162) while stronger and significant correlations were observed for the middleaged (40–59 years) to old–old (70–91 years) age groups (all r ≥ 0.44; all p ≤ 0.011). A linear regression of SiN identification scores against age, RSpan scores and their interaction showed that the slope of the linear dependence of SiN identification performance on RSpan scores indeed increased significantly with age (p ≤ 0.001). This illustrates the moderating effect of age on the relationship between WMC and SiN identification, cautioning that the statistical control of the effect of age by computing partial correlations is not necessarily appropriate.

# DISCUSSION AND CONCLUSION

Contrary to common lore and model predictions, this metaanalysis failed to find consistent evidence that, in adverse listening conditions, WMC (as measured by the RSpan test) is a reliable and strong predictor of SiN identification in young listeners with normal hearing thresholds. Recent experimental work on the perception of interrupted speech, another form of signal degradation, is consistent with this finding (Benard et al., 2014; Nagaraj and Knapp, 2015).

It could be argued that the cognitive and speech tests used in the studies surveyed here are suboptimal or inappropriate measures of WMC and SiN processing, respectively (e.g., Besser et al., 2012; Sörqvist and Rönnberg, 2012; Keidser et al., 2015). However, both the conclusions of many empirical studies, showing a link between WMC and SiN processing, and the predictions of the ELU model are based on performance obtained on these very tests.

Another criticism could be made regarding the fact that SiN identification was predominantly assessed for performance levels close to 50% correct, obscuring the possibility that WMC and SiN identification are linked to a greater extent than reported here at other performance levels. Indeed, according to the ELU model, a greater mismatch between sensory and mental representations, and hence a higher involvement of WM-based identification processes, is predicted as speech-to-noise ratios become less favorable. However, this does not seem to be borne out by the collected results. Alternatively, it has also been argued that WMbased restorative processes in older HI (Lunner and Sundewall-Thorén, 2007; Larsby et al., 2008, 2012) and young NH listeners (Stenbäck et al., 2015) might only be effective in conditions where the acoustic signal is not "too" degraded, suggesting a nonmonotonic relationship between WMC and SiN identification. While this seems an interesting proposition, the collected results do not indicate the existence of such "sweet spots" for cognitive involvement.

Hence, all things considered, the results of this meta-analysis caution against the (intuitively appealing) assumption that individual variations in WM determine SiN processing in all its forms and independently of the age and hearing status of the listener.

Despite the inconsequential degree to which WMC can predict SiN identification performance in young NH listeners, the reported results should not to be interpreted as evidence against the involvement of cognition in speech and language processing in those listeners per se. First, individual differences in WMC have sometimes been shown to explain some of the variability in performance in more linguistically complex tasks, such as the comprehension of conversations (Keidser et al., 2015; but see Smith and Pichora-Fuller, 2015, for contrary results for the comprehension of narratives). Second, different cognitive measures, probing individually the hypothesized sub-processes of WM (e.g., inhibition, shifting, updating; Miyake et al., 2000) or other domain-general cognitive primitives (e.g., processing speed) might prove to be better predictors of SiN processing abilities than the RSpan test (e.g., Sörqvist et al., 2010; Rudner et al., 2011).

It is also important to emphasize that the here reported findings for young NH listeners are not incompatible with the body of evidence showing significant correlations between WMC and SiN identification in primarily older HI listeners. Our own data for NH listeners sampled from across the entire adult lifespan (Füllgrabe and Rosen, 2016) revealed that WMC becomes important for SiN identification from middle age onward, with the oldest listeners (≥70 years) showing the strongest correlation and differing significantly from the youngest age group. One possible explanation for an increasing cognitive involvement in terms of WMC with age, in addition to the loss of audibility, is the accumulation of age-related changes in supra-threshold auditory processing (e.g., sensitivity to temporal-fine-structure and temporal-envelope cues; Schneider and Pichora-Fuller, 2001; Füllgrabe et al., 2003, 2015), sometimes from as early as mid-life (Füllgrabe, 2013). Changes in the coding fidelity of single neurons or across a neural population (Henry and Heinz, 2013; Sergeyenko et al., 2013; Bharadwaj et al., 2014; Lopez-Poveda, 2014), which are not detected by a conventional audiometric assessment, have

indeed been associated with degraded sensory representations of the acoustic speech signal. These internal distortions could then call for more WM-based compensatory mechanisms to enable activation of the appropriate representations in longterm memory. Why, however, such age-related internal changes in coding fidelity would result in a greater reliance on WMC for SiN identification than an increase in the amount of energetic and/or informational masking is unclear. Possibly, this discrepancy could be due to secondary changes in the precision of the phonological representations stored in long-term memory, following long-standing auditory processing deficits (e.g., Andersson, 2002; Classon et al., 2013), thus providing a top-down contribution to the mismatch between sensory and mental representations. Clearly, further reflections on the nature and source of listening adversity (see Mattys et al., 2012) are needed to generate oriented hypotheses that can be tested experimentally.

From a clinical perspective, a cognitive assessment (e.g., of WMC) may still prove helpful in improving the prediction of aided SiN identification performance for older audiological patients. Future evidence based on new large samples, independent of those repeatedly investigated in previous studies (Foo et al., 2007; Rudner et al., 2008, 2009, 2011), could further specify the role and importance of cognition in audiological practice.

In conclusion, even though the question of a general vs. specialized WM system in language comprehension is not new (Caplan and Waters, 1999) and it has been speculated that differences in tasks and their processing demands activate different sub-components of the WM system, the less-discerning general opinion is that variation in WMC (often assessed by a single measure) can explain differences in performance on a variety of speech tasks. Currently available data from independent research groups do not confirm this assumption for the frequently used task of sentence identification. However, this is not to say that the processing of SiN does not involve a range of cognitive abilities, including WM. For example, it is possible that, even when individual differences exist, the WMC of most individuals is sufficient for the

### REFERENCES


purpose of SiN identification. Systematic efforts are therefore required to establish under which acoustic and linguistic conditions the different cognitive abilities come into play (e.g., Fedorenko, 2014; Smith and Pichora-Fuller, 2015; Heinrich and Knight, 2016). Finally, the results of this meta-analysis clearly highlight the need for a consistent and explicit labeling of the participant characteristics (such as age and hearing status) when reporting results and caution against the untested generalization of research findings from one participant group to another.

# AUTHOR CONTRIBUTIONS

CF collated, analyzed and plotted the data, and wrote the paper. SR analyzed the data, and revised and commented on the paper.

# FUNDING

The Medical Research Council Institute of Hearing Research is supported by the Medical Research Council (grant number U135097130). This work was also supported by the Oticon Foundation (Denmark).

# ACKNOWLEDGMENTS

Portions of this paper were presented at the 2015 International Symposium on Hearing in Groningen, NL. We are indebted to our colleagues Kathy Arehart, Briony Banks, Jana Besser, Rebecca Carroll, Rachel Ellis, Erin Ingvalson, Inga Holube, Lisa Kilman, Thomas Koelewijn, Theresa Nüsse, Tim Schoof, Pamela Souza, Victoria Stenbäck, Verena Uslar, Anna Warzybok, and Adriana Zekveld for sharing and reanalyzing their data. We also thank Tom Campbell, Alexander Francis, Gitte Keidser, Rebecca Millman, Daniel Oberfeld-Twistel and Valeriy Shafiro for stimulating discussions, and Oliver Zobay for statistical advice.




speech-recognition-in-noise. Scand. J. Psychol. 56, 264–272. doi: 10.1111/sjop. 12206


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer MR and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Füllgrabe and Rosen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Working Memory Load Affects Processing Time in Spoken Word Recognition: Evidence from Eye-Movements

### Britt Hadar 1, 2, Joshua E. Skrzypek <sup>1</sup> , Arthur Wingfield<sup>3</sup> and Boaz M. Ben-David1, 4, 5, 6 \*

<sup>1</sup> Baruch Ivcher School of Psychology, Interdisciplinary Center Herzliya, Herzliya, Israel, <sup>2</sup> School of Psychological Sciences, Tel-Aviv University, Tel Aviv, Israel, <sup>3</sup> Volen National Center for Complex Systems, Brandeis University, Waltham, MA, USA, <sup>4</sup> Rehabilitation Sciences Institute, University of Toronto, Toronto, ON, Canada, <sup>5</sup> Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada, <sup>6</sup> Toronto Rehabilitation Institute, University Health Networks, Toronto, ON, Canada

### Edited by:

Jerker Rönnberg, Swedish Institute for Disability Research and Linköping University, Sweden

### Reviewed by:

Stefanie E. Kuchinsky, University of Maryland, USA Thomas Koelewijn, Vrije Universiteit (VU), University Medical Center, Netherlands

> \*Correspondence: Boaz M. Ben-David boaz.ben.david@idc.ac.il

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

Received: 13 December 2015 Accepted: 04 May 2016 Published: 19 May 2016

### Citation:

Hadar B, Skrzypek JE, Wingfield A and Ben-David BM (2016) Working Memory Load Affects Processing Time in Spoken Word Recognition: Evidence from Eye-Movements. Front. Neurosci. 10:221. doi: 10.3389/fnins.2016.00221 In daily life, speech perception is usually accompanied by other tasks that tap into working memory capacity. However, the role of working memory on speech processing is not clear. The goal of this study was to examine how working memory load affects the timeline for spoken word recognition in ideal listening conditions. We used the "visual world" eye-tracking paradigm. The task consisted of spoken instructions referring to one of four objects depicted on a computer monitor (e.g., "point at the candle"). Half of the trials presented a phonological competitor to the target word that either overlapped in the initial syllable (onset) or at the last syllable (offset). Eye movements captured listeners' ability to differentiate the target noun from its depicted phonological competitor (e.g., candy or sandal). We manipulated working memory load by using a digit pre-load task, where participants had to retain either one (low-load) or four (high-load) spoken digits for the duration of a spoken word recognition trial. The data show that the high-load condition delayed real-time target discrimination. Specifically, a four-digit load was sufficient to delay the point of discrimination between the spoken target word and its phonological competitor. Our results emphasize the important role working memory plays in speech perception, even when performed by young adults in ideal listening conditions.

### Keywords: working memory, speech perception, word recognition, eye-tracking, visual world paradigm

# INTRODUCTION

Although, seemingly performed without effort, understanding speech is a complex task (Pollack and Pickett, 1963; Lindblom et al., 1992; Wingfield et al., 1994; Murphy et al., 2000). During the process of spoken-word recognition, listeners must simultaneously retain and process the context of the sentence, keep the previous spoken words activated, segregate the speech signal from noise, and inhibit the potential activation of alternatives for the spoken word (e.g., phonetic or semantic). All of these operations might draw on the same resources necessary for speech processing and, as a result, may compromise recognition. The current study presents, to the best of our knowledge, the first examination of the impact of working memory load on the online processing of a single spoken word in ideal listening conditions. For this purpose, we examined eye-movements using the visual

world paradigm (Tanenhaus et al., 1995) to reveal listeners' timeline for recognition of target words.

### SPOKEN-WORD RECOGNITION

Most current models of speech perception are activationcompetition models, in which auditory input activates a set of lexical candidates, which then compete for the highest level of activation. Lexical access is a product of the integration of bottom-up and top-down processes (e.g., see the Cohort model, Marslen-Wilson, 1987, 1990; TRACE model McClelland and Elman, 1986). Bottom-up information is supplied by the acoustic-phonetic features of the sound wave, while the top-down information consists of the semantic and syntactic information related to the input (Rönnberg et al., 2013). As the acoustic signal unfolds in time, an analysis of the signal features allows the system to match phonetic cues to word forms in the mental lexicon. For example, hearing the initial phoneme /kæ/ will activate the words candy, candle, cannon, camel, etc. As the utterance of the word progresses to include more phonemes, irrelevant alternatives are inhibited until the listener reaches the isolation point—the point in time at which the target word is distinguished from its alternatives. The continuous uptake of speech sounds from the unfolding spoken word also activates offset-sound sharing alternatives that act as phonological competitors e.g., candle—sandal (Wayland et al., 1989; Wingfield et al., 1997; Luce and Pisoni, 1998; Sommers and Amano, 1998). These alternatives activated at the end of the word, were also found to delay the isolation point (Allopenna et al., 1998), as lexical access takes place continuously. This offset-overlap effect was noted more strongly in populations with reduced working memory capacity (e.g., older adults, Ben-David et al., 2011). For example, if the onset of the word was not enough to lead to an isolation point, the additional information at the end of the word can add alternatives and thus further delay this point.

Studies of speech perception have primarily focused on accuracy-based assessments to provide information about the overall integrity of speech perception. Such off-line measures, however, make it difficult to determine the specific processes underlying this accuracy. To overcome this limitation, we investigated linguistic processing using the "visual world" eyetracking paradigm (Tanenhaus et al., 1995). In this paradigm, listeners are asked to follow spoken instructions referring to objects depicted on a computer monitor (the "visual world"). For example, participants might hear the phrase, "point at the candle," and simultaneously see a display containing four pictures, each representing a word: candle (target), sandal (offset-competitor), finger, and zebra (unrelated nouns). As the listeners hear the instructions and the unfolding sound of the object's name, their eye-gaze data are time-locked with what is being heard on a moment-to-moment basis. With this, we were able to record where a person is looking on a visual display, how long their eye dwelled on a location, and the rate and order in which their gaze moved to other locations. To illustrate the method, consider our example of a listener listening to the phrase, "point at the candle," where both a candle and a sandal are depicted on the display. We track, in real-time, as the listener shifts his or her focus between candle and sandal, which share the terminal phoneme /d@1/. One can record, with millisecond accuracy, whether focus on the target, candle, is delayed due to competing activation of the offset competitor, sandal, as reflected by the listener's gaze pattern. In this way, eye movements can reveal the point at which listeners are able to isolate a target word from its competitor.

The visual world paradigm can also gauge what factors might either impede or facilitate spoken word processing, and to what extent. For example, the paradigm has been used successfully to test the impact of stream segregation of a spoken word from a noisy background (Ben-David et al., 2011) as well as from competing speech (Helfer and Staub, 2014). In the current study, we used this paradigm to investigate the role of working memory load. Listeners were asked to recognize the spoken word and touch the relevant pictogram, while retaining in memory spoken digit(s) presented at the beginning of a trial.

# SPEECH PROCESSING AND WORKING MEMORY

Working memory is a fundamental cognitive mechanism that allows active maintenance and manipulation of a limited amount of information (Luck and Vogel, 1997; Awh et al., 2007). Many complex cognitive tasks, including understanding speech, rely on working memory support (Baddeley, 1992; Luck and Vogel, 2013). Because working memory capacity is limited, any increase in demands on working memory should decrease the capacity available to actively maintain and process additional information.

In experimental settings, a dual task paradigm can reveal the toll individuals pay when resources are occupied by a concurrent task (Pashler, 1994). Participants in the dual task paradigm are asked to perform two simultaneous tasks. As the demands of the primary task increase, the available resources for the secondary task decrease (Sarampalis et al., 2009; Tun et al., 2009; Campana et al., 2011). Thus, the extent of the decrease in performance in the secondary task can point to the degree of resources demanded by the primary task (Kerr, 1973).

It has been argued that differences in working memory capacity may stem from differences in the efficiency of inhibiting irrelevant information. Vogel et al. (2005) found that individuals with low working memory capacity find it harder to inhibit irrelevant information than do high-capacity individuals (see also Lash et al., 2013). Similarly, working memory capacity predicts participants' ability to inhibit irrelevant distractors in a Flanker task (Heitz and Engle, 2007). Awh and Vogel (2008) view working memory as responsible for inhibiting irrelevant sensory information, naming it as the "bouncer in the brain." Lavie et al. (2004) found that an increase in working memory load increases distractor interference in the visual domain. They suggested a working memory based cognitive control mechanism that decreases interference from distractions. Once this control mechanism is occupied by a task that demands working memory resources, inhibition efficiency is decreased in any other task. In speech recognition, an increase in working memory demands might be reflected by a decrease in the ability to inhibit the activation of word alternatives.

Another approach that considers working memory as an important player in the speech perception process is the Ease of Language Understanding model (ELU; Rönnberg, 2003; Rönnberg et al., 2008, 2013). According to the ELU model, the language input receives implicit processing at the episodic buffer, and is then compared to phonological information stored in long-term memory. This model suggests that this "implicit" process is completed rapidly, with little or no draw on resources. However, if a mismatch occurs between the signal and its corresponding representation in long-term memory, slower, resource-demanding "explicit" processing is required. Thus, when the competition increases between the bottom-up sound information and possible word alternatives, resources are recruited for "explicit" speech processing and successful word identification will take longer to complete.

Although, in the discussions that follow we contrast implicit vs. explicit processing following Rönnberg et al. (2013), we recognize that these terms may be more accurately seen as denoting two ends of a continuum, reflecting degrees of resource demands for success (see the discussion in Wingfield et al., 2015).

# CURRENT STUDY

The goal of the current study was to examine the extent to which working memory load affects the timeline for the processing of a single spoken word. As a first step, we adapted the visual world paradigm (Tanenhaus et al., 1995) to Hebrew and validated it. Two types of sound-sharing competitors were presented on different trials, onset- and offset-overlap. The target words and their phonological competitors were matched on linguistic characteristics, such as frequency, familiarity, and number of syllables. The corresponding pictograms were matched for recognizability and visual saliency. Next, we tested online recognition of a spoken word using the visual world paradigm, with two levels of working memory pre-load: high vs. low load. In the beginning of each trial, either one spoken digit (low load) or four spoken digits (high load) were presented. Participants were asked to retain the digit(s) while performing the spoken word recognition task. Once they had indicated their recognition of the spoken word (by touching the correct pictogram), they were asked to verbally recall the digit(s). By using eye-tracking with high-resolution data in the millisecond level, our goal was to reveal the exact timeline of word processing and the factors that may facilitate or impede each stage of the process until recognition occurs.

Applying the ELU model to the visual world paradigm described above yields several predictions. Mainly, as the competition increases between top-down and bottom-up information, there will be a shift from an implicit to an explicit process. This shift will be evident in a delay in eye fixations on the target word. Recall, in the visual world paradigm the listener is given time to review the four alternatives before the word is presented, and then asked to focus at the center of the monitor (where no picture is presented). Thus, these alternatives (top-down) can now compete for activation as the bottom-up auditory signal unfolds in time. When onset phonological competitors are presented, one can hypothesize that explicit processing will be activated. In these trials, two pictograms depicting words that share initial sounds (e.g., candle and candy) are presented on the monitor. As the spoken word unfolds in time, at least two alternatives are activated in response to input matching the pictograms. With more of the word heard, more information is accumulated and a mismatch can ensue between the bottom-up input and potential phonological alternatives, leading to explicit processing. Conversely, offset overlap competitors present less competition to the processing of the target word than onset overlap competitors (Allopenna et al., 1998). Thus, these trials should mostly lead to some degree of implicit processing. Increasing the working memory load from one to four digits might increase the competition generated by the shared final phonemes. We suggest that this increase in competition might shift speech processing from implicit to more explicit, delaying the onset of fixations on the target word. Note, explicit processing represents a slower processing of the spoken word, whereas implicit processing represents a faster one. When working memory load is low, the fast (implicit) processing of the initial sounds will minimize the impact of the shared offset sounds, as recognition might be reached earlier. However, when working memory load is high, the slower explicit processing will increase the competition presented by the offset sound sharing alternatives, as recognition is delayed. That is, one could hypothesize that increasing the load will have a larger impact on trials presenting offset overlap competition than on trials presenting onset overlap competition.

# THE MAIN EXPERIMENT

We tested the role of working memory in the process of single spoken word recognition in ideal listening conditions. Young adults were tested both in high- and low-load conditions. We hypothesized that manipulating the load will have an impact on eye-fixations, especially in offset trials, that generally show only a small target-competitor competition for young good hearing individuals, when no load is utilized.

# METHODS

# Participants

Twenty-four undergraduate students recruited from the Interdisciplinary Center (IDC) Herzliya, participated in the study in return for course credits. Their hearing thresholds were tested via a MAICO MA-51 audiometer. Four participants were excluded from analysis due to hearing impairments (PTA > 20 dB HL). Thus, 20 participants (M age = 24.2, SD = 2.0) were included in the analyses. All participants had pure-tone air conduction thresholds within clinically normal limits to their age range from 0.25 to 6 kHz in both ears (≤20 dB HL). Participants completed the Wechsler digit recall sub-task (WAIS IV, Wechsler, 2008), and their auditory working memory capacity was within expected values for their age range (M = 6.26, SD = 0.93). All participants were native Hebrew speakers, based on a self-report, and they achieved an average score on Wechsler subtest for vocabulary (M = 39.7, SD = 8.3) corresponding to above-average vocabulary levels for native Hebrew speakers (WAIS IV, Wechsler, 2008). All participants reported normal or corrected to normal vision, and when necessary, wore their own corrective eyewear.

# Paradigm Construction

The current study adapted the "standard" visual world paradigm to Hebrew. Therefore, several preliminary steps were carried out to ensure that the basic paradigm yields comparable results in Hebrew.

# Visual Stimuli

The experiment consisted of 32 critical trials (that include phonological competitors), 32 filler trials (that did not include phonological competitors), and eight practice trials. On all displays, four pictograms corresponding to object names in Hebrew were presented in the four corners of a 3 × 3 grid on a computer monitor (9 × 9 cm, subtending ∼8.5◦ visual angle at a distance of 60 cm). We used a touch screen panel (T 23" ATCO infrared 4096 × 4096), to allow more natural response. We included only disyllabic words since in past research (Ben-David et al., 2011) disyllabic words yielded more accurate responses in a visual world paradigm. Images were not recycled in the critical nor in the filler displays, therefore 288 different images were used. The majority of images were drawn from the normed color image set of Rossion and Pourtois (2004). The remaining images were taken from commercial clip art databases and were selected to match the Rossion and Pourtois images in terms of visual style. In each critical trial, one pair of the depicted words either overlapped in the initial syllable (onset overlap) or in the final syllable (offset overlap). The critical trials summed to a total of 16 onset trials (e.g., /a я .gaz/ and /a я .nav/, box and bunny, respectively) and 16 offset trials (e.g., /xa.lon/ and /ba.lon/, window and balloon, respectively). In each critical trial, the target and its phonological competitor were presented alongside two unrelated stimuli that did not share onset- or offset-sounds with any of the words depicted in that trial. The relative position of pictograms within the grid (target, competitor, and two unrelated) was counterbalanced across the set of displays. An example of a critical trial is presented in **Figure 1**. Filler trials consisted of four pictograms that did not share onset- or offset-sound relations. The filler trials were included in order to diminish participants' expectations about the task and the phonetic semblance between the target and the competitor.

# Lexical Items Selection

In order to control for word frequency effects (Magnuson et al., 2007), we counterbalanced the target words in several ways. First, frequency of appearance in the language was measured by the Hebrew blog corpus (Linzen, 2009), based on a large corpus of blogs written in colloquial Hebrew. These frequencies were compared with the word frequency database for printed Hebrew in national newspapers (Frost and Plaut, 2005). Both databases used the orthographic form of the letter clusters, and were measured as the mean occurrence per million words. According to their frequencies, target words were equally distributed across the two experimental blocks, so that each block contained an equal number of the more frequent target words (which were counterbalanced across participants). Moreover, target– competitor allocation was counterbalanced as well, such that each word served for half of the participants as a target and for the other half as a competitor and vice versa.

# Image Selection

To control for potential recognizability of display objects, 18 university students, native Hebrew speakers from the same population as our main experiment, were asked to name the critical images on an online questionnaire. Each image was presented for unlimited time. Fifty-nine out of the 64 experimental pictograms were highly recognizable (at least 75% name agreement). For the remaining five images, a different procedure was used, where participants were asked to rate: "to what extent (1–10) does the pictogram represent the word \_\_\_\_\_

[the object it is depicting]." This procedure was repeated with different images, until we found five pictograms that received scores higher than eight and these were included in the final set.

# Auditory Stimuli

The stimuli consisted of the Hebrew equivalent of the sentence "point at the \_\_\_\_ [target word]" using the plural non-gender specific form (i.e., "/hats.bee.uh/ /al/ /ha/ [target word]"). These were prerecorded by a female native Hebrew speaking radioactress in a professional radio studio (IDC radio), using a sampling rate of 48 kHz. The root-mean-square intensity was equated across all digitally recorded sentences, and the signal was played at 79 dB SPL. The average time interval between the onset of the recorded sentence and the onset of the target word was 1114 ms (SD = 97 ms), and the average noun duration was 1078 ms (SD = 91 ms) as measured from the recordings by three native Hebrew speakers using Praat software for analysis of speech (Version 5.4, Boersma and Weenink, 2004).

# Pre-test

The paradigm was validated in a pre-test with a group of participants taken from the same population as our main experiment. In the pre-test, we wished to validate the translation and other variations in the paradigm. For example, in the original paradigm, participants were instructed to move pictured objects (e.g., "put the apple that is on the towel in the box;" Tanenhaus et al., 1995). However, more recent research has used the instructions of looking at the target (e.g., "look at the candle," Ben-David et al., 2011) or clicking on it with a computer mouse (e.g., Allopenna et al., 1998) for selection of the objects. As the former instructions might provoke more conservative eye movements and the latter might be less direct, we used a setup that allowed us to collect responses by a touch screen. Thus, participants were simply asked to point with their finger at one of the objects on the monitor (e.g., "point at the candle"). The results of the pre-test confirmed that the baseline paradigm in Hebrew generates similar eye fixations patterns as previous findings (e.g., Ben-David et al., 2011).

# Procedure

Participants were tested individually in a single walled sound attenuated booth (IAC). They were seated at a distance of 60 cm from the computer monitor, resting their chin on a chin rest. Eye movements were recorded via a table-mounted eye tracking system (SR Research Eyelink 1000, using the "tower mount" configuration), which sampled eye gaze position every 2 ms. Each block of trials began with a calibration procedure followed by four practice trials. Within each block, 16 critical trials (eight with onset overlap and eight with offset overlap) were pseudo randomly interleaved with 16 filler trials, with the exception that the first four trials were always fillers. Participants completed two blocks, high- and low-load (counterbalanced). In the highload block, four random digits were played prior to the speech perception task, at a pace of one digit per second. The digits were prerecorded by the same female actress that read the instructions. Participants were asked to retain these digits for later recall. In the low-load block, participants were presented with only one random digit for later recall. Each trial began with a visual cue (black triangle on a white background) immediately followed by the auditory presentation of the digits. Then, a 3 × 3 grid appeared on the monitor, containing the four pictograms at each corner of the grid. After 2000 ms, a short 1-kHz tone was played, directing participants to focus on the fixation cross which simultaneously appeared in the center of the grid.

After the system registered cumulative fixations on the central square for at least 200 ms, the fixation cross disappeared, and the recorded instruction sentence was played. Participants were instructed to point at one of the four objects on the monitor. A choice was indicated by touching the pictogram on the monitor. A feedback signal followed the participant's choice; either a green square (denoting "correct") or red ("incorrect") masked the cell. The feedback was administered in order to attain the highest degree of accuracy and attention for the whole duration of the task.

The objects then disappeared from the grid to signal the end of the trial, and a visual cue (black circle on a white background) was presented, indicating recall. Participants were instructed to report the digits verbally in the order in which they had been presented. Instructions emphasized that performance on both tasks were equally important. At the end of the procedure, participants were probed for whether they suspected a connection between the pictograms and were debriefed.

# Interest Areas

Interest areas were defined in rectangular regions around each image, following the grid. Interest areas were also defined for each of the remaining five regions of the grid as well as offscreen, but these were not included in the subsequent analysis. The samples were then grouped and binned into 20 ms time-bins, with 10 samples summed per bin. Data retained for each time-bin included the target fixation count (i.e., the number of samples per bin that contained a fixation on the target).

# STATISTICAL ANALYSIS

# Eye-Movements Analysis: Fixations on the Phonological Competitor

We tested whether aggregated fixations on the phonological competitor (total time fixating on the competitor, see Helfer and Staub, 2014) were significantly higher than average aggregated fixations on the unrelated nouns (from 200 to 1500 ms after the onset of the word). We used a repeated-measures ANOVA, with the type of noun (phonological competitor vs. unrelated noun), type of overlap (onset vs. offset), and load (high vs. low) as within participant factors. We found a main effect for the type of noun [F(1, 19) = 9.89, p = 0.005, η <sup>2</sup> = 0.34], indicating that, overall, phonological competitors generated more fixations than the unrelated nouns (averages of 3.5 and 2.5% of possible fixations, respectively)—showcasing the competition on processing. No significant main effects were found for the type of competitor, for load, and none of the two or three-way interactions were statistically significant (p ≥ 0.09, for all). This indicated that neither of these factors nor the interactions between them had an impact on fixations on items other than the target. As a consequence, fixations on the phonological competitors will not be further discussed.

# Modeling Eye-Movements Analysis: Fixations on the Target Word

Analyses were made on trials in which the digits were correctly retained. Once a selection was made (by pressing on the correct pictogram), we considered all the following time bins as fixations on the target (applying the same procedure as in Ben-David et al., 2011). This facilitated the comparison of different trials, independent of the amount of time taken by the participant to select the target. Note, at the time they make a selection, participants have already reached a decision about the spoken word. Thus, we opted to use a cumulative approach—where we report, at each time bin, the percent of trials where the participant had reached recognition of the target word.

We used Mirman's Growth Curve Analysis (Mirman, 2014), which is a multilevel regression technique designed for time course analysis, and specifically to the visual world paradigm. This method was chosen as it utilizes the fine-grained data eye-tracking provides, while avoiding the power-time-resolution tradeoff<sup>1</sup> . Three orthogonal time-vectors were computed from the time data. These vectors corresponded to first, second, and third-degree time terms, to help isolate the different polynomial time effects of the model parameters. We applied a mixedeffects model containing fixed effects of the competitor overlap (onset vs. offset), the working memory load (low vs. high), and the combined effect of the two on the intercept and all three time-terms. Random effects of the participants on the intercept and each time-term were also included. The mean of the model's predicted response was then plotted for each combined level of the factors. The overall time course of target fixations (from word onset to 2980 ms after word onset) was captured with a third-order (cubic) orthogonal polynomial with fixed effects of condition (low vs. high load) on all time terms, and participant and participant-by-condition random effects on all time terms. The low-load onset competition model was treated as the reference (baseline) and relative parameters estimated for the remaining three models (onset-high load, offset-low load, and offset-high load). For the models, time bins of 20 ms were used (10 samples per time bin, and 50 time bins per second), providing 125 measurements per trial in the period of interest, (for details, see Mirman, 2014). Statistical significance for individual parameter estimates was assessed using the normal approximation. Specifically, because the high-resolution timecourse data provided us with relatively many measurements, we assumed the t-scores calculated for the coefficient estimators were normally distributed and approximate z-values. All analyses were carried out in R statistical software (version 3.1.3). The lme4 package (version 1.1–10) was used to fit the linear mixedeffects models. All R packages were downloaded from the CRAN package repository (R Core Team, 2016).

### RESULTS AND DISCUSSION

### Accuracy Analysis

(a) Target selection. The target spoken word was correctly selected (100% accuracy) in all trials in both high and low load conditions. (b) Digit recall task. The mean accuracy across all conditions for the digit recall task was very high (M = 98.3%, SD = 4.1). However, it was slightly better for the low-load (one digit) relative to the high-load (four digits) condition (99.7 vs. 96.9%). Yet this difference was not found to be significant in a repeated measures ANOVA of digit-span accuracy with type of competitor (onset or offset) and working memory load (high or low) as within-participant factors.

# Eye-Movements Analysis—Fixation on the Target Word

**Figure 2** presents the data and the model for the offset (orange line) and onset (purple line) competitor trials, in the low load (continuous line) and the high load (dashed line) conditions.

First, all coefficients of the base model, onset low load, were found to be significant (see Appendix 1 in Supplementary Material), indicating that the model presents a good fit to the data. Second, this base model was compared to the other three models (onset-high, offset-low, offset-high). All parameters of the other three models were significant (linear, quadratic, and cubic)<sup>2</sup> .

To estimate the main effects of load (high vs. low), type of competition (onset vs. offset) and the interaction of the two on the model, Chi-square tests were conducted (see, Appendix 2 in Supplementary Material). There was a significant effect of working memory load, indicating that the models for low load were different from the models representing high load conditions across onset and offset phonological competition. Specifically, as indicated by observing **Figure 2**, the models for the high load conditions show slower accumulation of information. To exhibit this effect of load, **Table 1** presents the thresholds for 25, 50, and 75% recognition in ms (the points in time after which the chance of fixating on the target was above 25, 50, or 75%) based on the model estimation. Note that across the three thresholds, the recognition in the high load conditions occurs later than in the low load conditions.

The data also show an effect of phonological overlap, where the models for onset were different from the models for offset competition (see Appendix 1 in Supplementary Material). Finally, a significant interaction of the two main effects was found. Examining **Figure 2**, it appears that the interaction reflects a larger effect of load on offset compared to onset competition. This differential effect of load is also evident by examining the model based thresholds in **Table 1**. Consider the 50% threshold for target recognition. Load delayed the threshold by 44 ms in onset competition trials and by 106 ms in offset competition trials. In sum, a four-digit preload delayed fixations on the target, but to a larger degree when the display presented offset-overlap competition.

# GENERAL DISCUSSION

The goal of the current study was to examine the influence of working memory load on spoken word recognition. Load was manipulated by retaining either four spoken digits (high load) or one digit (low load). By monitoring eye-movements, we were able to reveal a delay of more than a 100 ms in the activation of the spoken target word (50% threshold in offset competition). Notably, listening conditions were ideal, and accuracy rates were

<sup>1</sup>Mirman (2014; Mirman et al., 2008; Britt et al., 2014) discussed three main challenges of analyzing visual world time course data using t-tests or ANOVA. (1) Trade-offs between power and resolution. Namely, as each time-bin has limited data, we need larger time-bins to increase statistical power. Yet, this will reduce temporal resolution and thus valuable information on the gradual change over time might be lost. (2) Possibility of experimenter bias. In the traditional analyses, the experimenter must choose the time bin size, and the time boundaries for the ANOVA. These choices might introduce a bias. (3) Statistical thresholding. The time-bin by time-bin tests, must treat p values that are <0.05 as fundamentally different as those above 0.05. Thus, small noisy changes in the data may lead to over- or under-estimation of discreet differences.

<sup>2</sup>Except of the intercept of onset high load, which was not significantly different than the intercept of the base model (see Appendix 1 in Supplementary Material).

TABLE 1 | Thresholds in ms for 25, 50, and 75% recognition, based on the model, as a function of the type of phonological (onset vs. offset) overlap and load (high vs. low).


at ceiling, indicating participants' adherence to the instructions and the ease of the task. Not only was the speech recognition task easy, but also the digit recall task, as participants' average working memory capacity (as tested before the study) was substantially larger than four digits. Nevertheless, even though no extreme boundaries were reached, and the additional load had no effect on accuracy, we were still able to observe a slowdown in the recognition process due to working memory load.

# Offset vs. Onset Competitor

Examining fixations on competitors, our data are consistent with evidence from continuous mapping models (e.g., TRACE; McClelland and Elman, 1986), where both onset and offset competition play a role in spoken word recognition. Across conditions, we found that the time spent fixating on the phonological competitors was, on average, higher than the time spent fixating on the unrelated items.

Turning to target fixations, we note a main effect for load, with delayed fixations in the high load condition, and a main effect for the type of phonological overlap with delayed fixation for onset competition. The latter result supports previous works demonstrating weaker activation for offset relative to onset competition in young good hearing adults (Allopenna et al., 1998; Tanenhaus et al., 2000; for supportive data from gating studies on onset vs. offset competition see Wingfield et al., 1997). Moreover, the size of the digit pre-load was found to have a larger impact on target recognition with offset competition compared to onset competition. In other words, increasing the pre-load from one to four spoken digits was sufficient to produce a prominent competition from offset-sound sharing alternatives, as reflected by a slowdown in target fixations function. This can relate to a reduced ability in the high-load condition to efficiently inhibit the processing of offset alternatives, which might be easily discarded in the low load condition (Lavie et al., 2004).

Our results may also suggest that in the high-load condition, listeners were slow on the uptake of the spoken word (sluggish onset). For example, when the offset sharing pair /xa.lon/— /ba.lon/ (window-balloon) is presented, slower processing of the initial sounds (that distinguish between the two alternatives) would increase the competition generated as the shared /lon/ sound unfolds. However, theoretically, this slowed processing of initial sounds should not increase competition at the onset of the word (e.g., /ar.nav/—/ar.gaz/; for a discussion on applying

information theory to the analysis of signal processing, see Ben-David and Algom, 2009).

This slowdown in the processing of the initial sounds of the word is in line with the hypothesis that when working memory demands are higher (fewer resources are available), it takes longer for the speech sound stream to form into an auditory object (Kubovy and Van Valkenburg, 2001; for a review see, Griffiths and Warren, 2004). In such a case, integrating the phonemes into a coherent object (word) might have been delayed due to the working memory load. As a result, listeners were slower to process the initial sounds of the word. Moreover, Sörqvist et al. (2012) noted that an increase in working memory load was related to a decrease in a very early auditory sensory processing stage (measured by auditory evoked brain stem responses, ABR). However, the auditory stimuli were not at the center of listeners' attention nor were they speech-like. Clearly, more research is needed to examine whether the formation of auditory objects is impacted by load when speech is presented in quiet and there is no need to segregate streams.

The sluggish onset of word processing may also relate to the working memory load task itself. The phonological loop in the Baddeley model (Baddeley and Hitch, 1974; Baddeley, 1986) is the mechanism for temporary storage for phonemic information, and when it is occupied, the processing of other auditory information is impaired (e.g., Burgess and Hitch, 1999). This might suggest that the phonological loop, being preoccupied with rehearsing the preloaded digits, is responsible for the delay in word processing. It is possible that processing of the initial sounds of the word was hampered until the digits were encoded into long-term memory (LTM). Transferring the digits to LTM "freed" the phonological loop, enabling it to process effectively what is retained of the word (for a similar notion, see Rönnberg et al., 2013).

### Relating Our Data to Aging Research

It is possible to consider the links between our results in the high working memory load condition and the data obtained in similar studies with older adults. As older adults have reduced working memory capacity (Zacks, 1989; Salthouse et al., 2003; Gazzaley et al., 2005; Small et al., 2011), one may claim that performance in the high-load task can somewhat simulate the reduced working memory capacity indicated in older adults. Comparing our data to Ben-David et al. (2011) data shows interesting similarities between the processing of older adults, and the processing of younger adults in the high load conditions. The authors found substantially larger age-related effects on processing in the offset overlap condition than the onset (see Figures 6A,B, p. 253, Ben-David et al., 2011). The authors explained this difficulty in offset as the consequence of older adults' less synchronized matching of auditory input to the mental lexicon, potentially due to reduced working memory capacity. It is possible that the working memory load manipulation might have a similar impact on our participants' speech processing, by decreasing available resources for recognition. Further research can use the same working memory manipulation on older adults and observe whether offset competitor processing deteriorates more than onset.

### Relating Our Data to the ELU Model

The ELU model (Rönnberg et al., 2013) posits that when there is a good match between the bottom-up acoustic input and the corresponding phonological representation in LTM, speech is processed implicitly with little or no demands on working memory resources. Further, task difficulty determines the allocation of resources to explicit speech processing that may include cognitive functions such as inhibition, executive functions, and working memory (McCabe et al., 2010). When the competition between bottom-up and top-down information increases, a shift is expected from implicit to more explicit processing. In our data, this shift might be reflected by a delay in gaze fixations on the target. We suggest that explicit processing for onset overlap (where competition is greater) was already employed in the low load condition. Thus, the increase in working memory load affected to a lesser degree the processing or gaze fixations for onset overlap in high load. Offset overlap trials, on the other hand, generated relatively little competition in the low load condition, and thus mostly relied on implicit processing. In the high-load condition, the additional demands on working memory amplified the competition, triggering the engagement of explicit processing. This was reflected in a delayed 50% threshold for gaze fixations on the target word.

# FUTURE STUDIES

Future studies should further investigate how aging and background noise can impact the role of working memory in speech processing. One of the biggest difficulties older adults have to cope with is deteriorated speech comprehension, especially in noisy environments (Schneider et al., 2010) and with increased demands (see Wingfield et al., 2015). This difficulty can interfere with maintenance of health and quality of life (Ishine et al., 2007; Gopinath et al., 2012) and can potentially affect the rate of cognitive decline (Lin, 2011). A central research question in speech recognition in older adults is the extent to which difficulties stem from bottom-up, sensory declines that degrade the speech input (Schneider and Pichora-Fuller, 2000), and to what extent they stem from an age-related decline in working memory (e.g., Bopp and Verhaeghen, 2007) and related cognitive abilities (e.g., inhibition of irrelevant distractors, see Ben-David et al., 2014; Lash and Wingfield, 2014). Specifically, a recent study may suggest that an increase in task demands (shifting from noise to babble background) hampered the ability of older adults to quickly generate independent target-word and background auditory streams (Ben-David et al., 2012). We hope that by adapting the paradigm used in the current study to test an older adult population, one can learn more about the role of working memory in speech processing in older age. Finally, more work is called for in Hebrew to see whether the language and the associated culture may contribute to the discussed effects. One such factor may be changes in the rate of speech across cultures and languages (see Ben-David and Icht, 2015; Icht and Ben-David, 2015), or unique attributes of Hebrew itself (e.g., the role of consonantal roots, see Frost et al., 1997).

# AUTHOR CONTRIBUTIONS

BH is responsible of the design of the paradigm, had prominent intellectual contribution, approval of the draft, and accountable for the data. JS is responsible for the analysis and interpretation of the data, revising it, had intellectual contribution, approval of the draft, and accountable for the data. AW had intellectual contribution of interpreting the results, approval of the draft, and accountable for the data. BB is responsible of the design of the paradigm, the analysis and the interpretation of the results. Had prominent intellectual contribution, approval of the draft, and accountable for the data.

### ACKNOWLEDGMENTS

The authors thank Julia G. Elmalem, Juliet Gavison and Ronen Eldan for their work on this project. This research was conducted, in part, with the help of the Bronfman Philanthropies for collaborative research initiative, awarded to AW and BB. BB research was partially supported by a Marie Curie Career Integration Grant (FP7-PEOPLE-2012-CIG) from the European Commissions. AW research is supported by the U.S. National Institutes of Health under award number R01 AG019714. This paper was published with the support of the Marie Curie Alumni Association.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00221

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Hadar, Skrzypek, Wingfield and Ben-David. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing

John E. Marsh1, 2 \*, Robert Ljung<sup>1</sup> , Anatole Nöstl <sup>1</sup> , Emma Threadgold<sup>3</sup> and Tom A. Campbell <sup>4</sup>

<sup>1</sup> Department of Building, Energy, and Environmental Engineering, Faculty of Engineering and Sustainable Development, University of Gävle, Gävle, Sweden, <sup>2</sup> School of Psychology, University of Central Lancashire, Preston, Lancashire, UK, <sup>3</sup> Psychology, City University, London, UK, <sup>4</sup> Neuroscience Center, University of Helsinki, Helsinki, Finland

A dynamic interplay is known to exist between auditory processing and human cognition. For example, prior investigations of speech-in-noise have revealed there is more to learning than just listening: Even if all words within a spoken list are correctly heard in noise, later memory for those words is typically impoverished. These investigations supported a view that there is a "gap" between the intelligibility of speech and memory for that speech. Here, the notion was that this gap between speech intelligibility and memorability is a function of the extent to which the spoken message seizes limited immediate memory resources (e.g., Kjellberg et al., 2008). Accordingly, the more difficult the processing of the spoken message, the less resources are available for elaboration, storage, and recall of that spoken material. However, it was not previously known how increasing that difficulty affected the memory processing of semantically rich spoken material. This investigation showed that noise impairs higher levels of cognitive analysis. A variant of the Deese-Roediger-McDermott procedure that encourages semantic elaborative processes was deployed. On each trial, participants listened to a 36-item list comprising 12 words blocked by each of 3 different themes. Each of those 12 words (e.g., bed, tired, snore…) was associated with a "critical" lure theme word that was not presented (e.g., sleep). Word lists were either presented without noise or at a signal-to-noise ratio of 5 decibels upon an A-weighting. Noise reduced false recall of the critical words, and decreased the semantic clustering of recall. Theoretical and practical implications are discussed.

Keywords: noise, elaborative processing, false recall, semantic clustering, speech intelligibility

# Introduction

In everyday life, listeners have to recognize speech under conditions in which the speech signal is degraded, masked or even replaced by the presence of background sound. From traffic in the street to cross-talk in a restaurant, that unwanted background sound is termed "noise". The impact of such noise on hearing aid users is socially profound. The overwhelming majority of patients visiting a hearing healthcare professional have reported difficulty understanding conversation in noise (Kochkin, 2000). Indeed, a large-scale survey revealed that one quarter of consumers did not use their hearing aid because of noise (Taylor, 2003). Adaptive procedures

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Nina Kraus, Northwestern University, USA Patti Adank, University College London, UK

### \*Correspondence:

John E. Marsh, School of Psychology, University of Central Lancashire, Darwin Building, Marsh Lane, Preston, PR1 2HE Lancashire, UK jemarsh@uclan.ac.uk

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 05 December 2014 Accepted: 16 April 2015 Published: 21 May 2015

### Citation:

Marsh JE, Ljung R, Nöstl A, Threadgold E and Campbell TA (2015) Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing. Front. Psychol. 6:548. doi: 10.3389/fpsyg.2015.00548 measuring performance in response to noise-embedded sentences are progressively becoming understood as essential to the routine diagnostic battery throughout the hearing aid fitting and selection process (Nilsson et al., 1994; Taylor, 2003; Killion et al., 2004).

For individuals with sensorineural hearing impairment, in turn with degraded input to their auditory systems, speech intelligibility may be affected by peripheral energetic masking (Kidd et al., 1998), further degrading the information in that signal even before neural transduction of that information at the cochlea. That neuronally transduced input may be regarded as even more partial because of the peripheral presence of noise. To a certain extent, the brain has been shown to use context to repair degraded sensory information and thereby improve speech perception (Shahin and Miller, 2009; Shahin et al., 2012). Further, speech-in-noise training has been shown to improve the performance on speech-in-noise tasks containing sentences (Song et al., 2012). Cognitive factors enhancing the cues that are important to listening to speech-in-noise caused this improvement, as was measured at the level of responses of the auditory brainstem (Skoe and Kraus, 2010; Campbell et al., 2012). Such performance improvements occurred with speechin-noise training of cochlea implant users, who receive the very partial input of vocoded speech to a few electrodes at the cochlea (Ingvalson et al., 2013). That is, this speech-in-noise training caused improvements of central origins. The current research is thus centrally focused. A form of central masking of speech intelligibility (e.g., Kidd et al., 2010) affects the perception of individual words of sentences. At this level, central masking does not degrade the signal but rather a central interference occurred between the noise and the to-be-attended signal. Crucial to the brain repairing a degraded speech signal is top-down prediction. That prediction could be influenced by phonemic, syntactic, and semantic information.

Germane to this concept of top-down prediction has been evidence that syntactic complexity raises speech reception thresholds in fluctuating noise in a manner less apparent with stationary noise (Uslar et al., 2013). A kindred cognitive phenomenon has been observed through examining the impact of noise effects on working memory performance (Schlittmeier et al., 2012) whereby noise that fluctuates more, proved more disruptive. Accordingly, fluctuating noise disrupted the cognitive mechanisms involved in retaining the memory for words, in turn, disrupting performance on Uslar et al.'s (2013) speechin-noise task. Such noise disrupts visually based tasks even when semantically unrelated to the task being performed and even when heard at a low-to-moderate intensity (Marsh and Jones, 2010; for reviews, see Hughes and Jones, 2001, 2003; Beaman, 2005; Campbell, 2005; Beaman et al., 2007; Szalma and Hancock, 2011). With the advent of overly populated schools and open-plan offices, concern is rising that increases in noise pollution are adversely affecting scholastic attainment (Klatte et al., 2013) and productivity at work (Mak and Lui, 2012). What is really needed to understand distraction is an account of the effects of noise on perceptual processing (e.g., hearing), cognitive (mnemonic) functioning and the interplay between the two. With the emergence of the field of cognitive hearing science, research has identified that the capacity for language understanding is affected both by, on the one hand, processes that are perceptual and bottom-up, and, on the other hand, processes that are cognitive and top-down (e.g., Rönnberg et al., 2008). Pivotal to this cognitive hearing science approach is how changes in speech understanding (e.g., intelligibility) are underpinned by perceptual and cognitive functions.

A recent investigation has brought auditory noise to the fore in cognitive hearing science. This investigation concerned the effects of noise on the perception of words and the subsequent memory for those words (Kjellberg et al., 2008). Participants, who heard words correctly within noise, recalled those words worse than when those words had been presented without noise. That is, noise, which did not impair identification of speech, impaired cognition. The current study investigated how listening to spoken words in noise takes working memory resources away from the encoding, storage, and further processing of those words (Kjellberg et al., 2008). A variant of the Deese-Roediger-McDermott paradigm (DRM; Deese, 1959; Roediger and McDermott, 1995) was employed. Within the memory and language literature DRM procedures have been used to measure semantic processing (Stadler et al., 1999; Johansson and Stenberg, 2002). The present experiment gauged whether listening in noise reduced such semantic processing despite accurate identification of each spoken word during study.

Working memory deficits amongst the elderly have been attributed to degraded linguistic input due to age-related hearing loss (Rabbitt, 1968, 1991; Cervera et al., 2009). Speech understanding in effortful listening conditions, either due to background noise, or age-related hearing loss, is considered to require the direction of processing resources toward perceptual processing. That processing is required for recognizing the speech material. As a consequence even if the recognition of speech is successful, fewer resources are left to accomplish other tasks such as storage, manipulation (e.g., elaboration), and comprehension of the materials. This "effortful listening" hypothesis is supported by the fact that adding broadband background sound (e.g., white noise) to a list of to-beremembered spoken words—thereby reducing the signal to noise ratio or SNR—can impair free recall. That is, even when those words have been correctly heard previously, such noise still produced an impairment of memory performance (Kjellberg et al., 2008; Ljung and Kjellberg, 2009; Ljung et al., 2013; see also Rabbitt, 1968; Pichora-Fuller et al., 1995; Murphy et al., 2000; Schneider et al., 2002; McCoy et al., 2005; Wingfield et al., 2005). Since these noise effects cannot be attributed to impaired identification of material at study, it has been proposed that noise makes word identification more difficult, thus leaving fewer working memory resources available for the encoding, storage, and further processing of the words (McCoy et al., 2005; Kjellberg et al., 2008). Very few studies have investigated whether listening in noise impairs semantic processing of spoken words despite the correct identification of those words in noise. The paucity of research in this area impelled us to address the claim that listening to spoken words in noise reduces the higher-order cognitive processing (e.g., semantic processing) of those words. Here, we used a free recall task in which we presented lists of thematically-related words (e.g., Stadler et al., 1999; Johansson and Stenberg, 2002). This variant of the DRM procedure is known to elicit higher-order (e.g., gist-based, or relational) semantic processing. If noise were to reduce such processing, noise would affect aspects of free recall performance as captured by this DRM procedure.

In the DRM procedure, participants are presented with a list of items (e.g., bed, tired, snore. . . ) that are all associated with a critical, non-presented, word, or theme (e.g., sleep). Many studies show that participants tend to falsely recall this critical lure despite explicit instruction not to guess (Deese, 1959; Roediger and McDermott, 1995). According to one approach (e.g., Brainerd et al., 2008), these "associative illusions" constitute a reflection of semantic gist processing: the semantic gist of the list is used as a retrieval cue (e.g., that all the words in the list were examples of "fruit") and the critical word is "recalled" because it matches that cue. Gist-based processes are distinguished from verbatim processes, the latter of which are responsible for encoding surface details (e.g., that the word was "banana", that the word was presented in black, printed in lowercase, etc.). Similarly, relational processing—thinking about the commonalities of list words—also increased the false recall of the critical word (e.g., McCabe et al., 2004). According to Kjellberg et al. (2008) and McCoy et al. (2005), noise could have stinted such gist-based processing that involves deep-encoding processes. If this view is correct, then noise is predicted to reduce the frequency of false memories.

Higher cognitive processes are not only marked by the occurrence of false recall, but also by the organization of responses on that free recall task. In the typical DRM procedure approximately 15 words are associated with a critical word item (or "theme"; e.g., sleep). These 15 words are presented for free recall, but the critical item is not presented. Semantic processing is therefore indexed by this DRM procedure and this processing is revealed by the apparition of the lure critical item as a response. To further reveal the semantic organization of the list words during output, we modified the DRM procedure such that groups of words were associated to one of three different themes within the same list (e.g., consider a number of words: dream, bed, wake, top, peak, summit, **hate**, **fear**, **cross**, presented from three themes: sleep, mountain, **angry**). For free recall, participants are expected to spontaneously cluster list words by theme at a greater-than-chance level even in the absence of explicit instruction to cluster. Further, participants typically cluster their responses by theme or by category even if, during study, the words are randomized with respect to their theme or categories (Bousfield, 1953; Smith et al., 1981; Marsh et al., 2009, 2014). The advantage of using this procedure, over free recall of lists comprising associates to a critical lure, is that this modified DRM yields a measure of semantic processing at test. That is, the degree of organization of responses by semantic category serves as an index of semantic processing (e.g., Marsh et al., 2014). This semantic-clustering provides an opportunity to assess the degree to which extant semantic associations guide the encoding and retrieval of episodic information (Bousfield, 1953; Tulving, 1968). Semantic-clustering is typically enhanced when processing is directed toward the organizational relations among list items as a whole. That is, if participants attempt to concentrate on what the words within the study list have in common with one another—relational processing (arguably the default strategy when to-be-remembered words from the same theme are grouped together [blocked by theme] during study) semantic clustering is enhanced (Hunt and Einstein, 1981; Hunt and McDaniel, 1993). It is also known that such blocking of words by thematic category gives rise to a greater number of false memories (e.g., Goodwin et al., 2001; McCabe et al., 2004).

Investigations have previously revealed side effects of poor listening conditions on mnemonic retention (e.g., Kjellberg et al., 2008; Ljung et al., 2009). The novelty of the present investigation resided in testing whether poor listening impairs higher-order semantic processing (gist-based processing or relational processing) of to-be-remembered spoken material. Prior studies have manipulated the SNR of spoken material. For example, in the speech reception threshold test (Plomp and Mimpen, 1979) participants attempt to recognize and repeat familiar words (e.g., baseball, playground) or sentences presented at decreasing intensity typically against a background of noise; low SNRs occur when the difference between signal and noise decreases. At such low SNRs the listener has to rely more on informational redundancy and contextual cues to understand the word or sentence. A typical approach to determining the speech reception threshold has been to ascertain the level at which the participant can accurately repeat 50% of the presented words. Kjellberg et al. (2008), however, used a variant of this procedure to investigate the listening-memory function for words heard in noise. Kjellberg et al. drew two phonetically balanced word lists of 50 monosyllabic words from standardized audiometric materials (Hagerman, 1982). Words within each list were semantically unrelated to one another, so as to minimize the contribution of semantic and contextual top-down cues to listening. In the noise condition, aperiodic noise accompanied the list of 50 spoken words (SNR = 5). The rate of presentation was one spoken word every 4 s. Participants attempted to repeat back each word after that word was presented to ensure correct identification. The participants were aware of the requirement to memorize the words for a later memory test. Recall was immediate, following directly after the presentation of the list. Kjellberg et al. showed that adding noise to the spoken words impaired their free recall despite the fact that each word had been repeated earlier to ensure correct identification. This empirical acoustic setting, according to the ease of language understanding model (ELU; Rönnberg et al., 2008) rendered insufficient the implicit and unconscious cognitive-linguistic processing of the spoken words. The processing did not suffice to support the identification and understanding of those words. As a consequence, explicit processes requiring working memory are required to match degraded input to long-term memory representations by inferring missing information or repairing misunderstandings. Accordingly, that repair occurred either in a prospective manner, predicting what is upcoming within spoken language, or retrospective manner identifying what has already been said. These explicit working memory processes were accordingly cognitively demanding and necessary for the correct identification and comprehension of the speech signal under such adverse conditions (Rönnberg et al., 2009). Associated with these cognitive demands is a reduced availability of episodic memory resources. Such functions would have otherwise supported concurrent or subsequent storage, alongside the elaboration (e.g., relational or gist-based processing) of the speech input. Hence, later memory for the words suffered. In this conceptualization, therefore, speech perception consumed mnemonic functions particularly if that speech is degraded or one has hearing difficulty. Hence individual differences in memory functions have a pronounced effect on perceptual processing and reception of speech. Evidence supporting such an approach has stemmed from noise disrupting memory in the context of recognition, paired associate learning, sequential recall of nonsense syllables, sentence recall, discourse comprehension, and comprehension of oral instructions (Rabbitt, 1968; Pichora-Fuller et al., 1995; Surprenant and Neath, 1996; Murphy et al., 2000; Surprenant, 2007; Klatte et al., 2010; Valente et al., 2012). In all such investigations, a memory disruption occurred even though the SNR allowed perfect, or near perfect, identification of the speech. The individual susceptibility of memory functions to interference by noise thus determined how prior context can be used to predict and, in turn, repair and retroactively interpret speech in noise.

The purpose of this investigation was thus to test the hypothesis that listening to spoken words in noise reduces the semantic processing of those words. This we term the effortful listening hypothesis. More specifically, it is postulated that noise disrupts working memory, in turn, affecting semantic processing.

# Experiment

The current investigation thus developed the paradigm of Kjellberg et al. Rather than using short (e.g., 10-item) lists of unrelated words that would not permit analyses of semantic processing, a relatively long list of words was employed. Each list of words was blocked according to three themes. Prior studies demonstrate that blocking lists by semantic associates promotes spontaneous processing of the semantic relations between items within a list (e.g., D'Agostino, 1969). The aims of using blocking, here, was to increase the likelihood that participants would organize responses by category and to precipitate false memories (e.g., McCabe et al., 2004). With such lists containing several themes, participants, at test, are known to cluster the associatively-related items together at a greaterthan-chance level even in the absence of any instruction to do so (e.g., Bousfield, 1953). Both false recall and semantic clustering are thus expected to yield evidence that participants bring to bear pre-existing conceptual relationships or semantic associations to guide encoding and retrieval of episodic information. If, according to the effortful listening hypothesis, noise renders word identification difficult—thereby leaving fewer working memory resources for the further processing and semantic elaboration of the words (e.g., McCoy et al., 2005; Kjellberg et al., 2008) then it would be predicted that presenting lists of semantic associates in noise as compared to quiet (no noise) will reduce false memories. This effortful listening hypothesis also predicted that noise will reduce the degree to which the associates are thematically organized (clustered) at test.

### Method

The study was approved by the Regional Ethical Review Board at the University of Uppsala (Dnr 2011/108). As the data would be treated anonymously, and no apparent ethical research complication with participation could be identified, oral consent was deemed sufficient by the Ethical Review Board. The data collector took note of the oral consent of participants.

### Participants

Thirty-one participants (18 male, 13 female) with a mean age of 26.8 years (range = 20–42 years) from the University of Gävle took part in return for a cinema ticket. Each participant selfreported normal hearing and Swedish as their first language. Data from five participants were excluded due to equipment failure (3 participants) or the occasional non-compliance with experimental instructions (2 participants).

### Apparatus and Materials **To-be-remembered material**

Twelve associates were chosen from each of 30 themes in the Johansson and Stenberg (2002) norms in order to construct 10 lists of 36 words, each having 3 themes (see Appendix A in Supplementary Material). Themes chosen had minimal word overlap such as to diminish the possibility of proactive interference (Shuell, 1968). The words chosen were the 12 most frequently produced instances to the non-presented, critical item.

Themes were randomly assigned to each list. However, this assignment was with the constraint that associated themes were not presented together. Words within each list were arranged in a blocked format such that all associates from a given theme were presented together within the list. The words were digitally recorded without intonation in a female voice at an approximately even-pitch and sampled with a 16-bit resolution at a sampling rate of 44.1 kHz using Audacity software. The first and second author listened to the recordings to ensure that they achieved this criterion and for the few occasions in which the recordings of the spoken words failed to meet these criteria, the words were recorded again. Spoken items were presented at an equivalent sound level of 64 decibels as measured with a digital sound level meter (Mastech MS6700) on an A-weighting.

### **Noise**

Broadband noise was synthesized from the speech material thereby producing noise with the same long-term-averagespectrum characteristics (LTAS) as the speech stimuli. For the noise condition, the noise at 59 decibels was mixed with the spoken items, thereby giving an SNR of 5 decibels. This SNR made listening demanding, but not impossible. The lists were presented via stereo headphones that the participants wore throughout the experiment. Participants wore headphones throughout the control condition and the background noise within the room yielded an SNR of 28 decibels. Measures in decibels were determined using an A-weighting.

### **Design**

A repeated-measures design was used with one withinparticipant factor Noise, of which there were two levels: No noise, Noise. The two conditions were randomized as follows. The 10 lists were randomly split into two sets of five and interleaved with one another during presentation. For half of the participants, the first list was presented in noise and the second list presented in quiet (no noise) with trials alternating thereafter between noise and no noise, whereas this order was reversed for the other half of the participants.

### **Procedure**

Participants were seated in a cubicle. Lists of theme items were presented over stereo headphones (Sennheiser HD-200) one word at a time with an inter-stimulus interval of 4 s of quiet between each word. Retrieval was immediate at the end of the list and participants typed their answers into an E-Prime computer program that also controlled stimulus presentation.

Participants were tested individually in a small room comprising a HP Compaq 6720s laptop PC. They were informed that they would be presented with ten 36-word lists and that each list would be presented one-word-at-a-time over headphones, from which they were asked to memorize as many words as possible and write the words they remembered down in the order that they came to mind. Participants were not explicitly told that the lists could be categorized by theme. Participants were informed that they had unlimited time for recall, and that when they could not remember any additional items, they should click on a "continue" button to initiate the onset of the next list. Participants were instructed that the to-be-remembered items would sometimes be accompanied by noise. They were also were instructed to ignore the noise and to concentrate on identifying each word. The experimental session lasted for approximately 50 min<sup>1</sup> .

### Recall Measures

Recall measures came in four forms: the total number of items correctly recalled, the mean number of items recalled by theme, the number of themes recalled (scored by recalling one word from a theme), and the number of critical items falsely recalled.

### Results

**Table 1** shows the results of the various recall measures in the two conditions. The effects of noise on each dependent variable displayed were those predicted by the effortful listening hypothesis that effortful listening in noise detracts from the elaborate, semantic processing of spoken words. Accordingly, in noise, fewer items were recalled correctly, with fewer correct items per semantic theme and fewer such themes. Fewer critical lure words, not present in the list, were recalled. This further evidence for a noise-induced decrement in semantic processing was bolstered by the clustering measure detailed in the ensuing sub-section.

Inferential statistical analyses corroborated these overall tendencies: the mean scores for the total number of items correctly recalled per list was significantly lower in the noise condition than in the control condition, t(25) = 4.24, p < 0.001, CI0.<sup>95</sup> = [0.858, 2.48]. Further, the mean number of items per theme recalled was also significantly lower in the noise condition than in the control condition, t(25) = 3.52, p = 0.002, CI0.<sup>95</sup> = [0.227, 0.868]. In addition, the mean number of themes recalled was also significantly lower in the noise condition than in the control condition, t(25) = 2.47, p = 0.021, CI0.<sup>95</sup> = [0.024, 0.268]. Finally, the mean number of critical lures recalled was also smaller in the noise condition as compared to the control condition, t(25) = 2.83, p = 0.009, CI0.<sup>95</sup> = [0.262, 1.66]. There were too few intrusions to be subject to inferential statistical analysis. This paucity of such intrusions is likely due to the theme of the list acting as a top-down guide such that words phonemically similar to targets were not produced because those words did not "fit" with the semantic theme being recalled.

TABLE 1 | Mean recall performance for the four recall measures as a function of two background conditions (no noise vs. noise) used in the study.


<sup>1</sup> An initial pilot study was undertaken using the same methodology as outlined in the foregoing, with the exception that participants had to merely shadow (repeat back) spoken words during presentation. Eight participants (6 male; mean age 31.4 years [SD = 2.7]) from the University of Gävle took part in return for a cinema ticket. Each participant self-reported normal hearing and reported Swedish as a first language. Participants were randomly assigned to one of the four orders as matching that used in the study proper. These volunteers correctly repeated back 34.35 (SE = 0.11) words per list in the no noise condition and 32.78 (SE = 0.09) words per list in the noise condition. This difference was statistically significant, t(7) = 16.42, p < 0.001, CI0.<sup>95</sup> = [1.35, 1.80]. Further analyses of the data revealed that the participants experienced difficulty only in repeating back infrequent words that were typically weak in terms of those words' backward-associative strength to the critical item. Backward associative strength is the strength of normative association from a list word (e.g., thread) to the critical word (e.g., needle) as indexed by the probability of a list word eliciting the critical word in a word association task. Backward-associative strength is the most important factor in determining false recall of the critical item (Roediger et al., 2001). Consequently, the identification of words high in backward associative strength to the critical item was unaffected by the noise. In turn, the SNR in the noise condition—which impaired identification only of weak associates of the critical lure—was not materially affected by the gist-based, or relational, processing responsible for eliciting the critical lure. Identification of the strong associates, the processing of which prime production of the critical lure went unaffected by the presence of noise. The SNR of 5 decibels was therefore deemed appropriate for use in the experiment proper.

### Clustering Measure

There are a number of ways of measuring semantic clustering (Murphy, 1979). Here, we use the Z score (for the mathematical assumptions and algorithms used to compute Z scores, see Frankel and Cole, 1971). Briefly, this measure of clustering is based on the number of runs of exemplars of the same category at test. Run length is defined as the number of same-category items recalled in succession. Items recalled in isolation are scored as runs of one. Therefore, the number of runs is one more than the number of times the category changes during recall. Suppose a, b, and c, represent different themes and items from the themes: The recalled list aaabbccbcbb has six runs commencing with a run of three and terminating with a run of two. On the Z score measure, clustering occurs when the number of runs that are observed on the output list is significantly fewer than expected by chance. Perfect clustering, e.g., aaaabbbbcccc, results in a higher Z score than imperfect clustering, e.g., abaabccbbcca. Positive scores indicate tendencies toward clustering. Negative Z scores are possible when less categorization occurs than by chance. The Z score, therefore, has an advantage over several other methods of assessing clustering because the Z score enables one to tell if clustering is at an above-chance level (Frankel and Cole, 1971). Z scores here, as is typical (Murphy, 1979), were computed with all repeat and intrusion errors removed. The mean Z score was lower in the noise condition than in the no noise control condition, t(25) = 3.19, p = 0.004, CI0.<sup>95</sup> = [0.169, 0.788].

### Discussion

The results show that the effortful listening hypothesis is supported. That is, listening in noise is effortful and requires working memory resources that are necessary for elaborate, semantic processing of spoken words (McCoy et al., 2005; Kjellberg et al., 2008). Noise disrupts that elaborate semantic processing. With speech-in-noise, participants not only remember less of the word lists, but also falsely recall fewer critical items, fewer themes, and semantically cluster less of the associates by theme at output when the words were presented in noise. The recall of critical items, themes, and semantic clustering is traditionally accepted as reflecting higher-order semantic processing (Hunt and McDaniel, 1993; Burns and Brown, 2000). The reduction in recall of themes, for example, is thought to reflect the failure to adequately establish higher-order semantic encodings during study that can be used during retrieval as a plan to enable the transition between themes during recall (Bower et al., 1969). Listening difficulty thus requires working memory resources thereby leaving fewer of these limited resources available for encoding, storage, and further conceptual processing of the words using pre-existing semantic associations (McCoy et al., 2005; Kjellberg et al., 2008).

Consistent with the ELU model (Rönnberg et al., 2008), the interpretation offered is that listening in noise renders the implicit lexical access processes that ordinarily underpin language processing insufficient. Therefore explicit processes requiring working memory resources are required to match, via reconstruction, the degraded incoming stimuli against representations in long-term episodic memory. These processes are guided by top-down knowledge that the list words belong to semantic themes, therefore avoiding the incorrect production of a candidate item that is phonologically similar to the target. However, this resource-demanding process adversely affects other resource-requiring processes involving episodic longterm memory. There is thus a compromise in the operation of that resource-requiring relational processing or gist-based processing (Serra and Nairne, 1993; DeLosh and McDaniel, 1996). That compromise has knock-on effects impairing storage and elaboration of the spoken input. Consistent with this view, one possibility is that listening difficulty during study increases the use of verbatim processing at the expense of gistbased processing. Noise, for example, may require participants to process the verbatim, perceptual characteristics or surface forms of the spoken words to identify those words. Such characteristics include information about the phonetic constituents of those words or the linguistic style of the speaker. This increase in verbatim processing thus leads to an impoverished gist-based processing of how the words belong to a common semantic theme. In turn, there is a reduction in the elicitation of false recall (cf., Brainerd and Reyna, 2002).

Another possible explanation for the impairment to recall that listening in noise produces may be found in relation to the distinctiveness processing framework (Hunt et al., 2011). This framework incorporates two episodic processes: Relational processing that encodes similarity among a set of items, and itemspecific processing that encodes information that is specific to individual items. In the current study, the listening difficulty that noise causes could produce a focus toward the item-specific properties of the to-be-remembered items at the expense of relational processing. A reduction in relational processing would have the consequence of reducing semantic organization. That reduction would, in turn, cause less clustering by theme (Hunt and McDaniel, 1993) and a reduction in false recall (Hunt et al., 2011). Moreover, another possibility is that listening effort merely tilts the balance between relational and item-specific processing toward item-specific processing. A similar view has been espoused by Arndt and Reder (2003) to explain their finding that presenting each list item in a perceptually distinct font, as compared to the same font, reduced false memory for the critical lures. Arndt and Reder (2003) suggested that the unique fonts caused processing of item-specific features of the visual items. Processing of those item-specific features therefore directed processing away from the relational information. As a consequence, Arndt and Reder argue, the probability of activating the critical items is therefore reduced. With regard to the current study, our findings are consistent with the view that noise directs processing to the item thereby reducing relational processing. Indeed, this is consistent with the notion that noise induces a process that explicitly matches the incoming stimuli with representations in episodic long-term memory (Rönnberg et al., 2008). However, as Arndt and Reder (2003) do, we suggest that the processing balance between item-specific and relational processing is merely tilted in favor of item-specific processing, rather than increasing it. This conclusion is shaped by the finding that listening in noise reduced correct recall in the current investigation. By contrast, for manipulations and orienting tasks

that emphasize item-specific processing, correct recall is typically facilitated (Mulligan, 1999) or unchanged (Hunt, 2003).

The results do not support the view that noise degrades the sensory traces of stimuli making them more difficult to discriminate from one another (Surprenant and Neath, 1996; Surprenant, 2007). According to this view the occurrence of false recall should be greater in the noise condition: Participants no longer have access to the acoustic codes that could distinguish the study items from the non-presented critical items that lack such a code (Rummer et al., 2009).

The current investigation offers further evidence for the gap between listening and mnemonic performance: Our pilot study showed a 4.5% reduction of correct identification by noise. However, there was almost a threefold drop in mnemonic performance (16%), which is substantially greater than the 7.7% reported by Kjellberg et al. (2008). We attribute this drop in mnemonic performance to the semantic nature of the task used in the current study. In the investigation by Kjellberg et al. (2008) lists comprised words that were unrelated to one another. This lack of a meaningful association presumably required a greater reliance on perceptual as compared with semantic coding of the speech signal. In the listening-in-noise condition, explicit matching processes would be required, whereby perceptually similar alternative interpretations of the speech stimuli are considered as using stored information (Rönnberg et al., 2008). Within the current experiment, lists of thematic words were used and these items were blocked by theme. The semantic priming occurring between consecutive items constrains the search set within long-term memory and diminishes any gain that may arise from generating phonetically similar, candidate items within long-term memory. Moreover, there is a possibility that our method of blocking list items by theme during study could have underestimated the disruptive effect of noise on some measures of higher-order processing: Blocked presentation methods are known to give rise to greater false recall levels than when themes are randomly interspersed throughout a list, or are presented along with unrelated filler items (Goodwin et al., 2001). However, blocking items by theme (or category) compared to random presentation increases semantic clustering (D'Agostino, 1969). Therefore, it is possible that much more pronounced effects of noise during study arise for semantic clustering if the associates to each theme are randomly presented throughout the list whereby the semantic connections to the themes are more difficult to process.

Working memory processes play a role when individuals with hearing loss listen in noise (Akeroyd, 2008; Rönnberg et al., 2008; for a review, see Mattys et al., 2012). More work is needed that investigates the memory functions for the semantically rich materials used in the current study for individuals that differ in relation to working memory capacity and hearing. Semantic effects are predicted to be more pronounced for individuals with poorer speech perception capabilities in noise. Such individuals include those with hearing impairment (Rabbitt, 1991; Pichora-Fuller et al., 1995) or young children (Wightman and Kistler, 2005, whom typically require an SNR 5-7 decibels higher to achieve similar levels of identification of speech and nonspeech signals, see Werner, 2007). Similarly individuals with low working memory capacity should also experience a much greater disruptive effect due to listening in noise. Previous work has also shown that advancing age can be offset by cognitive capacity, indicating that listening per se is maintained among elderly individuals with high working memory capacity (Rönnberg, 2003). However, as we have described earlier, listening success and later mnemonic success are different functions. Therefore, there is a requirement to understand whether mnemonic performance is impaired disproportionately among younger and older adults with comparable listening ability. Further, while younger individuals, do benefit from semantic encoding instructions as compared with shallow encoding instructions, with particular relevance to our current study, elderly, elderly individuals do not. For example, older adults show less activity in regions of the brain that are associated with semantic processing than younger adults (Daselaar et al., 2003). Elderly individuals as compared to younger adults are accordingly not only disproportionately impaired in listening to semantically rich material (Pichora-Fuller, 2008), but also in their memory for such material. The effects demonstrated here should also be more pronounced when the masking sound is fluctuating noise rather than steady noise (Leibold and Neff, 2007; Uslar et al., 2013), which is arguably more ecologically valid particularly within the built environment setting.

By contrast to the present investigation, the recent findings of Uslar et al. (2013) investigate speech reception thresholds of sentences rather than single words in noise. These thresholds in fluctuating noise are strongly correlated with cognitive abilities. That is, an individual's attention or "conflict monitoring" (Stroop) and working memory (digit span, word span) ability correlated with speech perception in that noise in a manner not shown for speech without noise. Uslar et al.'s (2013) findings thus support the view that, in noise, cognition "kicks-in" during speech understanding (Rönnberg et al., 2010). However, Uslar et al.'s (2013) data show these individual differences in cognitive factors neither influence how deviations from a canonical word order, nor how increases in syntactic complexity, affect speech recognition thresholds in fluctuating noise.

A question outstanding for cognitive hearing science is what role working memory plays in speech perception in noise if working memory does not assist the syntactic processing of sequences of words in sentences? The data of the present investigation address this question. Working memory for the semantics of prior material affect the lexical access and the elaborative processing of speech-in-noise. Accordingly, semantic processing operates predictively to determine what is heard in a top-down manner, permitting the brain to repair semantically predictable utterances obscured by noise. It is posited in the theory offered that semantic repair during lexical access in noise requires working memory resources. Not only does this requirement affect the perception of speech-in-noise but also the understanding of that speech—the primary objective of listening to speech.

In the longest-term, a test sensitive to the identified influences of semantic processing of speech-in-noise might join the audiologist's diagnostic battery including established approaches using sentences in noise. Such an approach could, at least, give the patient a realistic assessment of the structure of their communication problems, and how well particular hearingassistive devices and cognitive training programmes such as working memory training (Henshaw and Ferguson, 2013) might, or might not, help. Further development of valid diagnostic measures related to semantic processing of speech-in-noise is required. If those measures offer a specificity that is predictive of treatment outcome remains a further open question for cognitive hearing science to address.

### Conclusions

This investigation shows that listening difficulty has a pronounced effect on later mnemonic retention of thematically organized lists of words. This result is consistent with the view that identification of speech-in-noise adversely affects the encoding, storage, and processing of the spoken information (McCoy et al., 2005; Kjellberg et al., 2008). Further this view is consistent with noise adversely affecting semantic processes (gist-processing or relational-processing), semantic clustering, theme recall, and false recall of the critical words. All these indices of semantic processing are diminished following degraded speech presented during study. Further, the memory "gap" between intelligibility and memory is of greater magnitude than previously observed, possibly owing to the rich semantic nature of the to-be-remembered material. That gap is also greater than in previous investigations

### References


because the materials were semantically richer that the words used in those previous experiments (Kjellberg et al., 2008). Cognition, particularly in relation to semantic processes, therefore is particularly vulnerable to listening conditions during study. The results illustrate the importance of the dynamic interplay between human cognition and auditory processing. These findings are generally consistent with the assertion that adverse listening conditions recruit explicit working memory processes that, as a consequence, reduce the capacity for the efficient operation of episodic memory processes (Rönnberg et al., 2008). Finally, the results reported here are important to bear in mind when discussing acoustical norms for classrooms and other premises within which the understanding and memory for spoken information is vital: Guidelines should neither relate simply to the signal being heard, nor to the memory function for lists of unrelated words, but should also concern memory for material from which meaning is to be extracted and elaborated. The development of a hearing-cognition instrument to take into account listening and memory functions is therefore a priority area.

### Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2015.00548/abstract


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Marsh, Ljung, Nöstl, Threadgold and Campbell. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing

John E. Marsh1, 2 \*, Robert Ljung<sup>1</sup> , Anatole Nöstl <sup>1</sup> , Emma Threadgold<sup>3</sup> and Tom A. Campbell <sup>4</sup> \*

### Edited and reviewed by:

Jerker Rönnberg, Linköping University, Sweden

### \*Correspondence:

John E. Marsh JEMarsh@uclan.ac.uk Tom A. Campbell tom.campbell@helsinki.fi

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 31 January 2017 Accepted: 01 March 2017 Published: 28 March 2017

### Citation:

Marsh JE, Ljung R, Nöstl A, Threadgold E and Campbell TA (2017) Corrigendum: Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing. Front. Psychol. 8:390. doi: 10.3389/fpsyg.2017.00390 <sup>1</sup> Department of Building, Energy, and Environmental Engineering, Faculty of Engineering and Sustainable Development, University of Gävle, Gävle, Sweden, <sup>2</sup> School of Psychology, University of Central Lancashire, Preston, Lancashire, UK, <sup>3</sup> Psychology, City University, London, UK, <sup>4</sup> Neuroscience Center, University of Helsinki, Helsinki, Finland

Keywords: noise, elaborative processing, false recall, semantic clustering, speech intelligibility

### **A corrigendum on**

### **Failing to get the gist of what's being said: background noise impairs higher-order cognitive processing**

by Marsh, J. E., Ljung, R., Nöstl, A., Threadgold, E., and Campbell, T. A. (2015). Front. Psychol. 6:548. doi: 10.3389/fpsyg.2015.00548

# ERROR IN TABLE

In the original article, there was a mistake in **Table 1** as published. Due to a tabulation error, the total number of critical lures recalled was reported incorrectly. The corrected **Table 1** appears below. The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Marsh, Ljung, Nöstl, Threadgold and Campbell. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

### TABLE 1 | Mean recall performance for the four recall measures as a function of two background conditions (no noise vs. noise) used in the study.


# The Item versus the Object in Memory: On the Implausibility of Overwriting As a Mechanism for Forgetting in Short-Term Memory

C. Philip Beaman<sup>1</sup> \* and Dylan M. Jones<sup>2</sup>

<sup>1</sup> Centre for Cognition Research, School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK, <sup>2</sup> School of Psychology, Cardiff University, Cardiff, UK

The nature of forgetting in short-term memory remains a disputed topic, with much debate focussed upon whether decay plays a fundamental role (Berman et al., 2009; Altmann and Schunn, 2012; Barrouillet et al., 2012; Neath and Brown, 2012; Oberauer and Lewandowsky, 2013; Ricker et al., 2014) but much less focus on other plausible mechanisms. One such mechanism of long-standing in auditory memory is overwriting (e.g., Crowder and Morton, 1969) in which some aspects of a representation are "overwritten" and rendered inaccessible by the subsequent presentation of a further item. Here, we review the evidence for different forms of overwriting (at the feature and item levels) and examine the plausibility of this mechanism both as a form of auditory memory and when viewed in the context of a larger hearing, speech and language understanding system.

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Emily M. Elliott, Louisiana State University, USA Lars Nyberg, Umeå University, Sweden

> \*Correspondence: C. Philip Beaman c.p.beaman@reading.ac.uk

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 11 December 2015 Accepted: 23 February 2016 Published: 10 March 2016

### Citation:

Beaman CP and Jones DM (2016) The Item versus the Object in Memory: On the Implausibility of Overwriting As a Mechanism for Forgetting in Short-Term Memory. Front. Psychol. 7:341. doi: 10.3389/fpsyg.2016.00341 Keywords: auditory cognition, short-term memory, memory, forgetting, auditory scene analysis

Like many cognitive capabilities, language is grounded in memory. A failure to appreciate what has just gone drastically limits the ability to comprehend the present and any capacity to anticipate the future. Both long-term memory (for semantics and other lexical and world knowledge) and short-term memory (a record of the immediate past) are implicated in this process. In the current paper, a particular focus is placed upon the relationship between memory and language reception (e.g., hearing) rather than production (e.g., speaking). Although the latter is clearly of importance – both as an aspect of language in which memory must play its part and as a means by which (via overt or sub-vocal rehearsal) information is maintained in short- or longer-term memory (Craik and Watkins, 1973; Ward et al., 2003; Ward and Tan, 2004; Taylor et al., 2015) – a focus on the nature of the auditory-perceptual input suggests constraints on how any system accepting such input must be configured.

A key feature of the classical short-term memory (STM) research program is the importance of serial order (Lashley, 1951; Conrad, 1960; Murdock, 1968, 1983; Lewandowsky and Murdock, 1989; Henson, 1998; Brown et al., 2000; Botvinick and Plaut, 2006; Burgess and Hitch, 2006). Words, letters, or digits are presented sequentially and participants required to recall the items in the order in which they were presented. Short-term memory tasks are usually deliberately designed so that the associations to be held across multiple items (words, digits, letters) are arbitrary. Performance in such tasks is framed in terms of its proximity to verbatim recall of all the items, namely the correct item in its position at presentation. Implicitly, short-term memory theorists make the assumption that the "item" (the word, letter, or digit of interest)—rather than the relationship

between items—is the most meaningful unit of analysis. Moreover, identification of the item at recall is the basis for correct scoring in the memory test. Despite known problems with identifying the "item" in anything other than a logically circular way (Miller, 1956) such an approach is defensible in cases where the items are well-known and taken from a small, circumscribed set (e.g., digits) and where recollection of an item at a time collapses into a requirement to select the most likely candidate given the degraded or incomplete information available (Nairne, 1990). This contrasts starkly with the situation in most everyday language, in which structured, non-arbitrary relationships are available between individual elements represented at multiple levels (phonotactic, syntactic, semantic, pragmatic) and the identification of a single "item" is neither necessary nor sufficient to comprehend the meaning of the sequence.

By considering veridical recall of arbitrary items rather than the relationships between them, much of interest is lost with regards to later analyses. A key component of perception is in organizing as well as registering information and of interest is whether, in registering and organizing the stimuli prior to retrieval, the perceptual system represents them in a way that harmonizes with the retrieval requirements in standard short-term memory tasks. Given the emphasis on the iterative retrieval of items across the to-be-remembered sequence, does the perceptual system, for instance, cluster items at a supra-item level in such a way as to aid or to hamper efficient retrieval? In other words, does perception result in 'items' corresponding exactly to items specified in terms of the linguistic taxonomy (such as single syllables, words, or digits) on which the sequence is nominally based? In the event of supra-item organization, how are items grouped or transformed? Is there grouping of adjacent elements (as with chunking, classically) or are non-adjacent items organized into a greater whole? Within item-focussed approaches to short-term memory—ones that assume recall is a product of an aggregation of elemental actions—forgetting may be explained by the 'overwriting' of items by subsequent events. If suprasegmental organization occurs, is overwriting still a plausible mechanism?

Here, we explore how the registration of events in memory reflects auditory input and, in particular, the organizational processes that are at play. On the basis of key phenomena in auditory perception we consider potential implications for the structure of short-term memory and, in particular, the nature of forgetting.

### THE "STANDARD" MODEL OF MEMORY

The modal model of memory, informed by neuropsychological case data, has always assumed a functional and structural distinction between short-term and long-term memory, with the former fed by largely unspecified perceptual input processes, frequently depicted as a buffer storage system (Shallice and Cooper, 2010). In long-term memory, where the notion of memory as a reliable, veridical system has long since been dismissed and a reconstructive account of recall is generally accepted (Bartlett, 1932), suppression, inhibition and blocking of the memory trace have all been discussed as possible explanations of forgetting (for example in the context of the misinformation effect in eyewitness memory). In contrast, discussions of shortterm and sensory memory have been less open to the idea of memory distortion as normal and recall as a reconstructive activity. In consequence, processes that highlight deterioration of the representation such as decay and overwriting (respectively) have predominated as mechanisms for forgetting and active supra-segmental organizational processes (such as grouping into objects), which may equally hamper recall when they are inconsistent with retrieval requirements, have been largely ignored.

Much has already been written both critiquing the evidence for decay (e.g., Neath and Nairne, 1995; Nairne, 2002; Lewandowsky and Oberauer, 2008, 2009, 2015; Lewandowsky et al., 2008; Oberauer and Lewandowsky, 2008, 2013; Brown and Lewandowsky, 2010; Neath and Brown, 2012) and defending the concept (Altmann and Gray, 2002; Portrat et al., 2008; Altmann, 2009; Barrouillet et al., 2011; Altmann and Schunn, 2012) so, rather than repeating now-familiar arguments about decay versus some other (often unspecified)<sup>1</sup> form of interference as the source of forgetting (see Ricker et al., 2014, for a review), here we will specifically consider interference by overwriting as it appears from the perspective of auditory perception and the organization of the auditory environment.

The introduction of overwriting or displacement as a key determinant of forgetting over the short-term can be traced back to early studies of auditory sensory memory. Classically, a restricted-capacity acoustic sensory memory trace, overwritten by subsequent auditory events (Crowder and Morton, 1969), is available to supplement end-of-sequence recall otherwise only supported by "post-categorical" short-term memory systems dedicated to verbal memory but otherwise blind to sensory modality or the perceptual origins of the memoranda. This venerable account is nonetheless still extant and has been incorporated into more recent formulations of the contribution of sensory memory to immediate recall of auditory-verbal material (e.g., Page and Norris, 1998, and Burgess and Hitch, 1999, both make reference to an auditory input buffer overwritten by subsequent data). Latterly, other formulations of shortterm memory have also utilized overwriting as a means of implementing interference and hence forgetting. For example, in providing a framework for short-term memory that eschews decay as a concept, Nairne (1990, 2002), Neath and Nairne (1995), Neath (2000), Oberauer and Kliegl (2006), Oberauer and Lange (2008), Lewandowsky et al. (2009), and Oberauer (2009) explicitly replace decay with overwriting as an explanatory concept. For memory of specifically auditory origin, therefore, three claims have been made:

(1) An auditory sensory store is overwritten, an item at a time, during encoding (e.g., Crowder and Morton, 1969; Page

<sup>1</sup> In fact, Lewandowsky et al. (2009, box 4) postulate at least four possible alternatives to trace decay and Ricker et al. (2014, Table 1) suggest five possibilities, all of which – as with decay itself – may be implemented in different ways (e.g., Lewandowsky and Farrell, 2011).

and Norris, 1998; Burgess and Hitch, 1999; Mercer and McKeown, 2010a).


It is interesting that the "precategorical" acoustic nature of the auditory sensory store (Crowder and Morton, 1969) arose because of prior theoretical commitments to a model of word recognition—the logogen model—which assumed a single system for recognizing both written and spoken words (Morton, 1964, 1969). Subsequent to this, changes to the logogen model (Morton, 1979) removed this theoretical constraint and introduced separate auditory and visual input logogens so that the idea that overwriting occurred at an early processing stage was retained even though the original a priori reasons for assuming that overwriting occurred prior to word-identification had vanished. The Crowder and Morton (1969) view is, despite its commitment to a pre-categorical (presumably continuous) representational format a classically item-based data buffer system of the first-in, first-out variety. Their approach can be contrasted with the forms of overwriting implemented in models by Nairne (1990) and Lange and Oberauer (2005).

In Nairne's (1990) feature model of immediate memory, individual items are represented as vectors of features, which may represent modality-specific or modality-independent information. The eponymous features were speculatively identified with patterns of neural firing by Beaman et al. (2008) and, although their exact nature and status has never been formally defined, it is at this level that overwriting operates within the model. Feature overwriting works by an incoming item deleting identical features already held as part of the representation of immediately preceding item. For example, if the third feature of item n+1 of a sequence takes the same value as item n of the same sequence then the item-level representation of n is denuded of this feature, the representation becomes degraded as a consequence and n is henceforth less likely to be correctly recalled when cued to do so at some point in the future.

In contrast, the version of overwriting put forward by Lange and Oberauer (2005) and Oberauer and Kliegl (2006) interference is not limited to the preceding item. Like the approach of Nairne (1990), the model is once again feature-based; in this instance, however, different items are represented as patterns of activation across a subset of the features ("feature units") available system-wide and representations compete for access to their constituent feature units. Where a given representation loses this competition, the feature unit is captured by that competitor and is not available as part of the item representation of the "losing" representation. In this way a particular representation is degraded, thus impeding recall. The neural competition for features is framed in terms of synchronized firing of neurons as a mechanism of binding together the features that belong to the representation of an item (Raffone and Wolters, 2001). Feature units possessing features belonging to the same representation fire synchronously, whereas units belonging to different representations fire out of synchrony.

The principal difficulty with overwriting as the sole, or key, determinant of failure to recall in these or any other accounts is that while many studies have reported greater interference when irrelevant information (e.g., from a secondary task; Lange and Oberauer, 2005) is related to the memoranda, or when the list items are themselves similar along a specific dimension (e.g., the phonological similarity effect; Conrad, 1964; Conrad and Hull, 1964; Baddeley, 1966) other studies have shown the opposite. Overwriting in the three accounts given above assumes that interference occurs between similar items or items with similar features – acoustic items displace earlier acoustic items in a precategorical store (Crowder and Morton, 1969) or features are overwritten if they are shared between successive items (Nairne, 1990) or if they are supported by common feature units (Lange and Oberauer, 2005). These assumptions readily account for data in which interference is observed at recall between items that are similar along one or more crucial dimensions. However, Mercer and McKeown (2010a,b) found that complex tones were more accurately identified in a same-different task when followed by distractors containing novel frequencies – those frequencies not present in the target - when compared to a condition in which the distractors shared frequencies with the target. This pattern of results is directly contrary to that which would naturally occur if similarity-based overwriting was in operation.

Interestingly, Mercer and McKeown (2010b) also concluded in favor of an overwriting account – but in their model, directly contrary to assumptions made by other theorists about overwriting, "interference is principally caused by tones that include novel features since these will be most potent in "overwriting" the contents of the auditory spectral shortterm memory buffer" (Mercer and McKeown, 2010b, p. 1258, emphasis added). In other words, this model assumes overwriting by items which are representationally distinct from the preceding input, rather than by items which share features with earlier items. Whether overwriting is assumed to occur amongst similar or dissimilar items/features is, of course, an a priori decision for any theorist attempting to construct a model (Lewandowsky and Farrell, 2011) but it is unlikely that similar items would be overwritten in some cognitive systems and dissimilar items overwritten elsewhere. To allow that closely related cognitive and perceptual subsystems work on diametrically opposed principles is, at best, un-parsimonious and contrary to Occam's Razor. If overwriting is to be accepted then a consistent set of rules should apply (Surprenant and Neath, 2009). Nor is the study by Mercer and McKeown (2010b) (which involved fairly "lowlevel" and non-verbal acoustic stimuli) unique in its findings. An earlier study by Nairne and Kelley (1999) showed that the phonological similarity effect observed with verbal stimuli is reversed after relatively brief periods of distraction, resulting in better performance in a serial order reconstruction test for phonologically similar lists than for phonologically dissimilar lists. If overwriting is seen as necessary to account for forgetting

caused by interference effects between similar items, then reversing these similarity effects casts doubt upon the need for overwriting.

Finally, task requirements—which are unlikely to directly influence low-level processes such as overwriting/displacement of patterns of neural firing or competition for neural feature units—also play a substantial role in similarity effects for which overwriting is offered as an explanatory mechanism. Despite numerous documented similarities between immediate free and serial recall (Beaman and Jones, 1998; Bhatarah et al., 2006, 2008; Ward et al., 2010; Grenfell-Essam and Ward, 2012; Grenfell-Essam et al., 2013; Spurgeon et al., 2014) similarity effects within the to-be-recalled list —supposedly reflecting the impact of item-representations degraded by direct over-writing (Nairne, 1990) or competition for specific feature units (Oberauer and Kliegl, 2006)—depress performance on immediate serial recall tasks but enhance performance in free recall (Fournet et al., 2003). Once again it is difficult to reconcile such findings with a low-level, item-based and automatic overwriting interference process without appealing to a higher-level activity that negates, and more than negates, the negative effect of similarity-based overwriting. To account for the reversal of the phonological similarity effect, Nairne and Kelley (1999) proposed that a period of distraction allows phonological similarity to be used as a cue to select candidate items for serial reconstruction of order (e.g., any correct item must share a rime with all other items) and similar suggestions are equally applicable to free, or item, recall situations (e.g., Watkins et al., 1974; Saint-Aubin and Poirier, 1999). However, such accounts are necessarily post hoc and—if overwriting occurs—strategies such as these must be sufficiently ubiquitous and powerful enough to not only negate but reverse the similarity effects otherwise observed. Exigencies of space mean that the interesting issue of retrieval mode and streaming cannot here be addressed fully but we note that free recall is in part controlled by strategic retrieval factors so that we may expect effects such as those of similarity to be different dependent upon the mode of recall and, critically, scoring technique employed. A stream of similar-sounding items will necessarily lose order cues relative to a dissimilar stream up until the point that items become so dissimilar that stream coherence is lost (Jones et al., 1999). There are no such necessary consequences for retrieval of individual items so scoring criteria at test are crucial in the appearance and form of similarity effects.

# AUDITORY SCENE ANALYSIS: SOME PRELIMINARIES

If a structural account of forgetting is set aside, what remains? Perceptual organization has profound consequences not just for the coherence of our experience of the world but also for the accessibility of information contained within it. Perception itself is directly linked to memory, as, for example, the perception of loudness is determined by a temporal integration of acoustic power; the perceived loudness of a burst of white noise depends upon its duration (Scharf, 1978) demonstrating that perception is reliant upon memory in a manner which renders the simple idea that incoming stimuli "automatically" overwrite pre-existing representations problematic. There is a mass of evidence showing powerful effects of perceptual organization and, as with vision, it is useful to think of auditory perception in terms of objects. So, despite being intrinsically evanescent in a way that the visual world is often not, successive events are assembled into temporally extended objects in a way that allows several "streams" of information to co-exist. Note that this is immediately different from the situation assumed within most models of verbal STM, which concentrate upon memory for a single list, and require further work to allow simultaneous representation of multiple lists or streams within the same representational space. Generally, the rules of organization follow Gestalt principles that are based on the physical attributes of the stimuli: proximity, similarity, closure, symmetry, common fate, continuity, among others. So, auditory perception is an active process that partitions the auditory world into auditory objects or streams, a process known generically as auditory scene analysis (Bregman, 1990). It is difficult to overstate the importance that these forces of organization have on what may be retrieved from an auditory scene, even when the scene comprises a few simple stimuli.

Necessarily, stream formation involves memory. A succession of individual stimuli achieves stream quality by a process that depend on not just a single but many preceding stimuli, a process that requires storage. Streams take time to form and less compelling streams can vacillate and break down. In everyday environments, scene analysis typically results in several simultaneous streams, such as the instruments of a rock bank or orchestra, or indeed a domestic scene of refrigerator noise, radio and conversation. Also, the principles by which this is achieved are embodied in musical polyphony: the rules of composition—though in a non-acoustic language—allow a composer to generate an intelligible and coherent rendition of harmonic and melodic intent.

So, the logic adopted here is that auditory memory is intimately connected to auditory perception and that in turn the study of auditory perception suggests ways in which auditoryverbal memory is organized. Furthermore, we know that this organization is not veridical, in as much as it does not faithfully represent an item-by-item sequence, free of item clusters. As we will see later, the item-clusters produced by auditory perception are very much richer and more diverse than those considered by current models of verbal STM.

It is useful to consider specific instances, using some very simple non-verbal stimuli, of how perceptual organization of sound brings about changes to perception before returning to the case of verbal memory. The first example shows how the context in which stimuli appear works to shape what we may know of them. Take the very simple case of two short tones, A and B, the same in every respect except that they are a semi-tone apart, presented in quick succession (see **Figure 1**). When faced with the task of reporting the order as being high-low or low-high, most listeners find they can make the discrimination quite easily. However, if flankers (F<sup>1</sup> and F2)—sharing almost the same pitch and tempo as A and B (see again **Figure 1**)— are inserted either side of them then we observe a dramatic reduction in the capacity to report the order of A and B. How might this come about?

One way to think in terms of overwriting and to suppose that the second flanker (or indeed both flankers) somehow interfere with the representation of A and B, making their comparison less easy. Another way is frame the change in context in terms of object formation. Whilst presented as a pair, A and B formed a single object and at the same time constituted its boundaries. Adding the flankers created a new object and new boundaries, with A and B now constituting its innards, so that now the order information contained in A and B becomes more difficult to address. This is a familiar situation in STM where current recall of the items and order of the first and last few items gives rise to primacy and recency effects, with recall of items in the correct order very much worse toward the center of the list.

A simple further addition to this auditory scene shows how implausible the overwriting explanation turns out to be in the case of simple tones. If we add a further two stimuli (C<sup>1</sup> and C<sup>2</sup> in **Figure 1**) either side and sharing both pitch and tempo with the flankers then we witness a remarkable transformation: if we now ask a listener to judge the pitch order of A and B, close to full efficiency (that is, the level of performance when A and B are presented in isolation) is restored. Clearly, according to the overwriting view (and indeed, most interference theories of forgetting) adding more stimuli should – if anything – produce more overwriting, not less. However, the outcome of adding the C stimuli is readily understood in terms of auditory scene analysis. The C stimuli act as 'captors,' that is, by virtue of their greater similarity to the F stimuli than to the AB pair, two objects are formed; the one: CCCF1F2CCC, the other: AB. The flankers are captured, releasing AB to become a separate, and therefore an independently addressable, entity thereby restoring memory for the order of A and B.

This setting shows several remarkable qualities of auditory scene analysis with a number of important implications for the way we understand memory. The first and most profound is that the future shapes the past: perception is retroactive. What follows from this has great relevance to our current discussion about the plausibility of overwriting as an explanatory construct. Critically, the perceptibility of AB is only decided when both F<sup>2</sup> and CCC are presented, but even then both F2CCC and AB are distinguishable only with reference to F<sup>1</sup> and CCC. The first point that follows from this is that it is important therefore to think in terms of the emergent properties of the stimulus ensemble (the object), not merely as an aggregation of the properties of individual stimuli. The second point is that items need not be temporally adjacent in order to form into objects.

A second illustration lends weight to the first while at the same time addressing the natural skepticism that such a simple setting involving the mere 'perception' of tones A and B could have more general repercussions for more complex settings that we think as being characteristic of the study of 'memory.' Here again, the listener is asked to compare two tones but this time asked to make the judgment about whether they are the same or different in pitch (Jones et al., 1997).

**Figure 2** shows the arrangement of stimuli used by Jones, Macken, and Harries (following, for example, Deutsch, 1972, 1978a,b; Semal and Demany, 1991, 1993; Starr and Pitt, 1997; Mathias and von Kriegstein, 2014). First, a standard stimulus a tone—is followed either by a blank interval or a filled interval and then, some seconds later, by a comparison stimulus: another tone. The listener is asked to ignore stimuli that come between the standard and comparison tones in making their judgment.

The key variable of interest is the content of the interval and its effects on the accuracy of the comparison judgment. Having a sequence of tones in the interval similar in pitch and timbre to the standard and comparison (see **Figure 2**) has a dramatic effect of reducing the accuracy of the same-different judgment. If, instead of having tones, we have speech stimuli (say a sequence of words), comparison judgment improves considerably, to a level that is close to when there are no interpolated stimuli. This result is conventionally interpreted in an overwriting framework: memory for the standard is compromised by similar stimuli interpolated between it and the comparison (e.g., Semal and Demany, 1991, 1993; Mathias and von Kriegstein, 2014). However, another of the conditions in the study of Jones et al. (1997) makes this interpretation implausible. If the number of interpolated tones is doubled then any reasonable interpretation the overwriting account suggests that performance cannot improve and should, in fact, deteriorate. If overwriting interferes only with the immediately preceding item (as with Nairne, 1990) then the level of interference remains the same, although the increase in the number of sources competing for consideration at recall could still negatively affect overall performance. If overwriting is not restricted to immediately preceding items (as with Lange and Oberauer, 2005) then performance should deteriorate, and appreciably so given the rise in number of interfering sources. In the event, the opposite turns out to be true; performance improves significantly.

If we construe the setting in terms of auditory scene analysis, this last result is entirely intelligible. In object terms, the proximity of the standard to the interpolated tones and the similarity of their physical character (sharing tone-like qualities), along with its shared timing, increases the likelihood that it will be incorporated with them into an object, thereby reducing its identity as a separate entity. When the interpolated material is

speech, of course this tendency will be much less likely. Doubling the number of interpolated tone stimuli is likely to produce an outcome similar to that seen with interpolated speech stimuli: by virtue of shared timing (in addition to shared pitch and timbre) the interpolated stimuli will in this case form an object separate from the standard. The judgment of similarity is once again based on two stimuli distinct from the interpolated stimuli: the scene comprises three objects, a standard, a distinct interpolated stream, and the comparison.

Streaming thus produces important consequences for our judgment of the plausibility of overwriting as an explanatory mechanism and for hearing and memory. The context in which stimuli appear has powerful repercussions for what we can retrieve of stimuli. As we shall go on to consider, the fact that auditory stimuli appear in chronological order does not mean that that access to temporally adjacent stimuli is guaranteed. So, for instance, if we present a sequence in alternating male-female voices (M1F2M3F4M5F6), two streams are formed (M1M3M<sup>5</sup> and F2F4F6) a situation that contrasts with a single (e.g., male) stream: M1M2M3M4M5M6. By forming two distinct streams it will become more difficult to retrieve chronologically adjacent stimuli (e.g., M1F2M3F<sup>4</sup> will be harder to retrieve than M1M2M3M4), but easier to retrieve stream-adjacent (and chronologically nonadjacent) stimuli (e.g., M1M3M<sup>5</sup> will be easier to retrieve in the alternating voices case) if cued to retrieve the utterances in strict temporal order of their occurrence. Notice that—as suggested earlier—the stream can contain non-adjacent elements. This contrasts with the typical interpretations of 'chunking' (and also "grouping") that invariably refer to an aggregation of temporally adjacent elements. Auditory scene analysis shows that even quite remote elements may be assembled into an organized whole. This is why scene analysis and chunking are slightly different mechanisms and why it is important to consider remote elements in any scene analysis (see Jones, 1993, 1999; Jones et al., 1996; for extended discussions). This relates to the question of overwriting because temporally remote and non-adjacent items can have a greater effect upon memory for any given target item than does the immediately subsequent item, a result which is inconsistent with at least two forms of overwriting (Crowder and Morton, 1969; Nairne, 1990)

Perhaps the simplest and most telling prediction from the overwriting hypothesis is that sequences with fewer shared features should be easier to retrieve than those with many shared features. This follows from such ideas as the relative distinctiveness principle, the suggestion that an item (or series of items) perceived to be discriminable on some dimension(s) from its fellows is easier to recall by virtue of psychological distinctiveness (a principle which is consistent with overwriting as an underlying mechanism, although other mechanisms may produce such an outcome; Surprenant and Neath, 2009; Neath, 2010). Evidence already reviewed indicates that this is not always the case, and further data indicate that streaming may be a useful concept in explaining outcomes that run contrary to this principle.

Very distinct non-speech sounds, when presented quickly in a sequence are easy to recognize, so that listeners can judge they are present but are less able to indicate the order in which they appeared. So, if a sequence of very different sounds—a highpitched tone, a hiss, a low-pitched tone and a buzz—are heard in a repeating cycle, listeners are able to name each of the sounds. However, they cannot report their order correctly, even if the period of listening is extended indefinitely (Warren et al., 1969; see also Jones et al., 1999). However, if a sequence of four spoken digits—spoken in the same voice—is presented under the same conditions, the order can be readily reported. The key difference between these two settings is in the level of commonality in

acoustic content: acoustically the digits form a variation on a common ground and so quickly form a stream, but for the non-voice sounds each element constitutes a separate entity and streaming is less easy to achieve.<sup>2</sup> Consider how such a situation would be addressed by Nairne's (1990) feature model, in which automatic overwriting forms a large part. The identity of the stimuli themselves would be represented in secondary memory, so the task would simply be to match the correct primary memory representation to the correct secondary memory identity in the correct order. The task would be made difficult by the fact that overwriting would degrade the primary memory representations such that the primary-secondary memory match would become more problematic and, potentially, confused. This confusion would clearly be more prevalent in the situation under which the most overwriting occurred – when the stimuli come from a common source (spoken digits) and share common acoustic and lexical features. These results pose grave difficulties for an overwriting account; distinct sequences should be subject to less overwriting, but the results are diametrically opposite. The explanation comes from stream formation: when the stimuli are perceived as originating from a common source they form a single stream within which order is preserved.

# AUDITORY SCENE ANALYSIS: THE 'SUPRASEGMENTAL' APPROACH APPLIED TO VERBAL MEMORY

In view of the problems outlined earlier, we wish to outline an alternative framework in which retrieval (in both sensory and short-term memory) is constrained by perceptual principles. The primary line of argument we wish to pursue is that the need to maintain a coherent stream of information over time constrains the processes operating within memory and hence automatic and immediate overwriting of an item representation or the features representing an item is not a tenable explanation for forgetting. The auditory scene analysis principles outlined above, however, were introduced with reference to simple auditory stimuli (e.g., tones) and in what follows these are expanded to encompass more traditional verbal memory phenomena.

Within auditory memory, overwriting was originally proposed as an explanation for the interference associated with a poststimulus suffix (Crowder and Morton, 1969) so we turn first to this phenomenon and possible alternative accounts.

# The Failure of Overwriting: Capturing the Suffix

Classically, the existence of acoustic storage termed the "precategorical acoustic store" (PAS; Crowder and Morton, 1969) was assumed to precede "post-categorical" verbal storage (where modality of origin – spoken or written – is irrelevant, a common abstract representation is shared by all stimuli, regardless of input modality). Its existence was inferred from the auditory recency effect, in which the final item of an auditorily presented list for serial recall is recalled at near-ceiling levels compared to the much smaller recency effect obtained with visually presented lists of the same verbal items. The reason that this has been attributed to a restricted capacity acoustic store is that elsewhere along the list performance on visually and auditorily presented lists is broadly equivalent (but see Beaman, 2002; Macken et al., 2015) and is affected in a similar manner by standard verbal manipulations such as phonological similarity, word-length, and concurrent articulation. The final piece of evidence provided in support of PAS was that the presence of a post-stimulus suffix effectively eliminates this final-item advantage, leading Crowder and Morton (1969) to conclude that the stimulus suffix effect "depends upon selective displacement of information from PAS" (Crowder and Morton, 1969, p. 369).

Crowder and Morton (1969) assumed an item-by-item displacement system rather than feature-based overwriting and one reason for disputing the feature-based interference account of the stimulus suffix effect comes from data showing that stimulus suffixes which are phonemically similar to the memoranda do not necessarily show larger suffix effects (Crowder and Cheng, 1973) and—like the other similarity-based interference effects already reviewed—may also show smaller effects (Carr and Miles, 1997). Another reason for questioning feature-based overwriting comes from studies of streaming the suffix. It is well established that the stimulus suffix effect depends at least in part upon the suffix being perceived as originating from the same sources, or stream, as the to-be-recalled list. So, for example, variations in the spatial location, timbre and pitch of the suffix relative to the list reduces the size of the suffix effect whereas similar manipulations varying suffix frequency, emotionality and meaning have no such effect (Morton et al., 1971). Other manipulations varying the "speech-like" qualities of the suffix similarly moderate the size of the suffix effect (Morton et al., 1981). Manipulations of the top-down interpretation of the suffix likewise show that forcing the suffix to be grouped with, or apart from, the list affects the auditory memory interference effect (Crowder, 1971; Frankish and Turner, 1984; Neath et al., 1993). So, for example, ambiguous stimuli, which can be perceived as either speech or non-speech, can be treated as a speech suffix on the basis of labeling them as such (Ayres et al., 1979; Neath et al., 1993). However, other non-speech stimuli do not show a suffix effect unless contextual effects also force them to be perceived as speech (Morton and Chambers, 1976; Ottley et al., 1982). These results show that physically identical stimuli, which bear physically identical relationships to the memoranda, can produce different memory effects depending upon context and expectation. At best, therefore, any interference effect obtained under such circumstances can only be ascribed only in part to the physical overwriting of the memory trace.

Perhaps most intriguingly, the effects of a repeated suffix have also been shown to reduce the disruption observed (Crowder, 1971, 1978; Morton, 1976). With a repeated suffix, the same suffix item is presented multiple times in quick succession and in tempo with the list sequence (as usually also happens with a single suffix). The reduced effect of the suffix when repeated in this way, even though the first presentation of the repeated

<sup>2</sup>The rate of presentation in these studies is fast and this prevents verbal labeling; when the speed of presentation is slowed performance improves but only to a relatively small degree.

suffix is physically identical to the presentation of a single suffix, is difficult to reconcile with an overwriting account based upon physical or feature similarity between successive items since the relationship between the suffix and the list items is equivalent in the two conditions. Critically, a repeated suffix only becomes a repeated suffix at the point of its re-presentation; logically, therefore, automatic overwriting occasioned by the first presentation of the suffix must already have occurred at this point. Data such as these have led to suggestions that the suffix effect might reflect the simultaneous action of overwriting, accounting for the reduced suffix effect still observed, and perceptual grouping, accounting for the difference between single and repeated suffix effects (e.g., Morton, 1976). According to these accounts, the repeated suffix forms a perceptual group apart from the to-be-remembered list whereas the single suffix is perceived as part of this list. It follows from this that the sole cause of the disruption observed in the repeated suffix condition is from overwriting. A single item suffix likewise overwrites the final item but also further depresses memory performance by increasing the functional size of the memory set (the list length) by an extra item (e.g., Nairne, 1990).

These data undermine the importance of overwriting as the source of the suffix memory disruption effect but do not rule out the possibility that overwriting occurs; perhaps it is merely contributing only part of the observed disruption. Later data reported by Nicholls and Jones (2002) are, however, less equivocal. In their experiments, Nicholls and Jones (2002) interleaved a sequence of irrelevant items between the to-beremembered list items such that the suffix, when presented, was perceptually grouped with, or "captured" by, these irrelevant items. The sequence comprised the item 'ah,' which was also used as the suffix in a traditional suffix effect condition (see **Figure 3**). When no-suffix, suffix and captured suffix conditions are compared, it is clear that in the captured suffix condition performance approximates that to the no-suffix condition<sup>3</sup> . In the captured suffix condition the recency effect was fully restored and there was no suffix effect on the final list-item when the suffix was grouped, or streamed, with the sequence of irrelevant items. In contrast, the suffix presented alone continued to produce a suffix effect. Unlike the repeated suffix manipulation which reduced but did not eliminate the suffix effect, these data cannot easily be explained by the joint operation of overwriting and grouping since—in this case—the grouping (or streaming) manipulation removed the suffix effect entirely and hence the need to assume overwriting as the basis of the suffix effect.

Thus, the proposition that auditory-sensory memory is necessarily automatically overwritten is untenable. However, the suffix effect is only a single line of evidence. Recently, doubts about overwriting have been reinforced by findings from a paradigm using alternating voices for each list item and observing the consequences for memory of streams created in this way (Hughes et al., 2009). A suffix presented in a different voice reduces the suffix effect (Morton et al., 1971), consistent with the idea that overwriting depends on similarity between the suffix and the final list item but also consistent with the idea that a different voice suffix is grouped apart from the list items. If overwriting is automatic and based solely upon such physical properties and relationships between successive items, then presenting the to-be-recalled list in alternating voices (e.g., male-female-male and so on), should limit the overwriting observed between successive items compared to the same items presented in a single voice because the feature similarity is reduced by the voice change. Hence, overall recall should be enhanced relative to single-voice presentation. Alternatively, if perceptual organization is important so that items presented in different voices are streamed as coming from distinct sources, then recalling the items in the correct serial order should be harder. As noted in early research on auditory attention, items are preferentially recalled according to the stream or channel from which they are perceived to originate (e.g., Broadbent, 1958; for an extensive discussion see Hughes et al., in press) such that if the two voices are perceived as two separate streams then to recall the items in correct serial order requires participants to shift alternately between streams in order to reconstruct the serial order of the list. This extra cognitive requirement imposes a behavioral cost such that a list of alternating voices is not recalled as well as the same items presented in a single voice (Hughes et al., 2009; see **Figure 4**). Again this talker-variability effect calls into question the predominance of overwriting, which would predict the opposite pattern of results.

# Time, Space and Voice-Based Grouping Effects

The talker-variability effect, together with the different-voice suffix effect, supports the assumption that lists presented in different voices are perceptually grouped apart and that this influences the appearance of memory phenomena. Such assumptions find further support from early work on auditory attention (Broadbent, 1958) together with current theories of low-level auditory perception, within which auditory stream segregation (Bregman, 1990) plays a central role. One further line of evidence, however, serves to emphasize the relationship between perceptual organization and what seem superficially to be wholly mnemonic processes (suffix and talker-variability effects).

Work on grouping within auditory memory by Frankish and Turner (1984), Frankish (1985, 1989, 1995) directly examines the effect of perceptual grouping on subsequent recall. In a series of experiments, Frankish (1985, 1989, 1995) demonstrated that coherent groups can be formed within lists presented for immediate serial recall. These groups are defined by boundaries that exhibit the same, or similar, primacy and recency effects at recall as the longer lists of which they form a part. For example in a control (ungrouped) list, recency occurs only at the end of the list. However, in a 9-item list which is organized into three groups of three items each—for example by a delay in presentation between items 3 and 4 and between items 6 and 7—recency is seen for the final item of group 1 (at serial position 3, which must therefore be relatively immune to the suffix effects of

<sup>3</sup>Notably, the mere presence of an irrelevant sequence of repeated items has no appreciable effect on serial recall.

item 4). Grouping is effective when it employs exactly those principles of perceptual organization important for reducing the suffix effect. These principles include change of voice, delay in presentation, and change of spatial location, all of which have been confirmed as producing within-list recency effects associated with groups (Frankish, 1989). The principles of grouping in auditory-verbal memory, it appears, are readily inferred from the data showing a reduction of the suffix effect. Additionally, Frankish (1985) showed that, with visual presentation, there is little extra grouping advantage by inserting extra pauses after the third and sixth items in the nine-item list. Frankish (1985) found no obvious difference between the serial position curves produced when participants are asked to subjectively group visual lists and those produced when the presentation of the lists was grouped by half second pauses (Experiment 1).

FIGURE 4 | (A) Shows a sub-set of stimuli used by Hughes et al. (2009). To-be-remembered stimuli are first shown in isolation with lists either all from the same voice (Single) and then shown with alternating male and female voices (Alternating). Participants are required to report all the list in the sequence in which it was presented. Then lists with lead-in are shown. In the first case both the lead-in and the to-be-remembered list are in the same voice (Single–Single) and then in alternating voices (Alternating-Alternating). (B) Shows the performance associated with each of those conditions as a function of the presentation position of the stimuli within the to-be-remembered sequence.

In a further study, Frankish (1989) showed that an extra pause of only 80 ms following the third and sixth items had as much effect as an extra half-second pause. Likewise, when the middle three digits were differentiated from the others by either voice (male vs. female) or spatial channel (left vs. right ear), the effects of these manipulations were equivalent to those of the temporal change. In addition, the study demonstrated that the voice distinction alone is as effective as voice plus pause. That is, if the middle three digits are in a different voice from the first and last three, then inserting a pause of half a second after the third and sixth digits, thereby, in addition, temporarily isolating the middle three digits, has no further effect.

These effects appear to reflect the automatic segmentation of auditory lists in a manner that is more powerful than the strategic grouping that operates on visually presented lists which produces less of an effect and is more readily disrupted (Hitch et al., 1996). Although a number of researchers (e.g., Hitch et al., 1996; Farrell, 2012) have concentrated on the role of timing—and of extended pauses—in creating groups, Frankish's results clearly show that perceptual groups can be created using cues other than elongated pauses between list items. This observation is important because it shows that factors other than consolidation and rehearsal of a recently completed group (in the pause before the next group arrives) are responsible

for creating these group boundaries. It also shows that the group boundaries can be established very quickly – parsing the list into subgroups almost instantaneously as the stimuli are encountered. Thus, although providing temporal cues to grouping and allowing (or encouraging; Taylor et al., 2015) prosodic, group-based rehearsal to emerge is one means of parsing the input, it is not the only way in which withinlist organization can emerge. Crucially for current purposes, the perceptual segmentation of auditory lists is one that requires the constant comparison of the current and preceding auditory input. Automatic overwriting of previous stimuli by incoming information would interfere with the allocation of the current (incoming) stimulus to the appropriate perceptual stream, which may have been established over several preceding items.

# Principles of Organization: Similarity-Based Streaming

Generally, theories of short-term memory memory fail to acknowledge (or at most, pay lip-service to) the idea that events might be organized—and re-organized—according to perceptual streams. Rather, current theories view short-term memory as post-categorical, item-based encoding within a single, to-be-recalled list. The item here is defined by the experimenter a priori rather than inferred from the behavior of the participant. Those characteristics of the stimuli that denote common origin, that connote streams—among them similarity of pitch, timbre, location, and proximity in time are ignored by such accounts, which also overlook the fluidity and flexibility of systems within which items are organized and re-organized—according to their perceived belonging to one or more sources of origin. We argue that this is a profound mistake.

In the first instance, it is logical to assume that whatever form representations take in memory is constrained by the way in which information is available perceptually. The existence of natural organizational principles, known since the advent of Gestalt psychology, implies that multiple streams of information co-exist within memory in a way that is inconsistent with strict overwriting as the mechanism for forgetting. In the second instance, treating memoranda as discrete and independent items within the experimental participants' cognitive systems because they were conceived and presented as such by the experimenter is an unwarranted assumption. The assumption arises directly from the idea that representations are, almost by definition, abstract and "post-categorical," whereas in fact very few studies have examined the extent to which memory results can be accounted for by categorical vs. continuous storage systems (Frankish, 2008; Joseph et al., 2015). Taken to the extreme, it is clear that the recall of individual items is not independent, and whilst few models make this mistake, the amount and type of information relating the experimenter-defined items to one another and to a perceived locus of origin is impoverished in current theories. The relationship between items is formally often one merely of time or position (e.g., Page and Norris, 1998; Brown et al., 2007). Commonality of perceptual characteristics rarely plays a role because all of the elements within the memoranda are automatically assigned to a single list-structure, something that presumably occurs at a pre-mnemonic processing stage. Where between-item similarity is considered (as for example, to model the phonological similarity effect) this may often be at a distinct stage from positional similarity. For example, the primacy model of Page and Norris (1998) in which positional errors between localist representations occur naturally along the "primacy gradient" then forward items onto an explicitly phonological distributed representation stage prior to output in order to implement item confusion errors (Beaman, 2000) 4 .

Missing from all of these accounts is any measure of streambased similarity such that elements within the memoranda are allocated to one stream or another based upon a common theme or thread running through the sequence and which serves to distinguish this stream from another. Stream-based similarity, according to this analysis, is necessary to account for the effects reviewed above – the reduction or elimination of suffix effects, the talker variability effect, the perceptual grouping effect and so on. The thread of similarity that acts to hold elements together is, however, precisely the source of interference that would consistently and continually degrade individual item representations under an overwriting account.

The availability of information about the stream to which the stimuli belong is precisely what is needed to account for moderation and abolition of suffix effects, between-talker variability effects, and within-list grouping effects as reviewed here. Discontinuities in time (i.e., elongated breaks between groups) have been used to account for within-list temporal grouping effects (Nairne, 1990; Hitch et al., 1996; Farrell, 2012). This mechanism follows naturally from the idea of overwriting, since a break is naturally interpreted as a pause in which information can be consolidated and/or within which retroactive interference (such as overwriting) will not occur. Such accounts do not properly address the effects of very short pauses between groups which are more parsimoniously conceived of as groupings caused by discontinuities in rhythm rather than time per se, nor are they able to account for grouping effects caused by intonation, timbre or spatial location. For the same reasons, speaker-variability effects and reduced suffix effects are not predicted by such accounts because the models do not maintain the correct types of information to give rise to such effects. To do so, not only must information about physical characteristics be maintained in addition to whatever post-categorical or more abstract labels that may be assumed, but also information must be held about the stream as a whole rather than individual items in isolation, and incoming information (e.g., a postlist suffix) interpreted in terms of the information held and

<sup>4</sup>This is a mirror image of how the feature model addresses the same situation: in the feature model, item-based confusions arise naturally from the distributed representation of items as vectors of feature values but positional errors only occur when an item independently "drifts" along the position dimension (Neath, 1999, 2000).

prior expectations it elicits, as shown both by contextual suffix effects and by experiments repeating and streaming the suffix. The main conclusions point to the intimacy of perception and memory, or perhaps even to their wholesale integration. Certainly, no attribution to the action of auditory memory should be entertained until a thoroughgoin analysis of how auditory streaming could explain the same phenomena has been dismissed. Only after streaming processes have yielded the superordinate structure of the material being remembered can other approaches – such as overwriting – be entertained as explanatory constructs.

### REFERENCES


### AUTHOR CONTRIBUTIONS

This manuscript was co-written by CPB and DJ. Figures were provided by DJ.

# FUNDING

The research reported in this article was supported by Economic and Social Research Council (UK) grant ES/L00710X/1 awarded to Philip Beaman and Dylan M. Jones.




Loop. J. Exp. Psychol. Learn. Mem. Cogn. Learn. Mem. Cogn. 40, 1110–1141. doi: 10.1037/a0035784


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Beaman and Jones. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Children with speech sound disorder: comparing a non-linguistic auditory approach with a phonological intervention approach to improve phonological skills

### *Cristina F. B. Murphy\*, Luciana O. Pagan-Neves, Haydée F. Wertzner and Eliane Schochat*

Department of Physical Therapy, Speech-Language Pathology and Occupational Therapy, Center for Teaching and Research, School of Medicine, University of São Paulo, São Paulo, Brazil

### *Edited by:*

Adriana A. Zekveld, Linköping University, Sweden

### *Reviewed by:*

Heikki Lyytinen, University of Jyväskylä, Finland Cecilia Nakeva Von Mentzer, Linköping University, Sweden

### *\*Correspondence:*

Cristina F. B. Murphy, Department of Physical Therapy, Speech-Language Pathology and Occupational Therapy, Center for Teaching and Research, School of Medicine, University of São Paulo, 51 Cipotânea, Cidade Universitária, 05360-160 São Paulo, Brazil e-mail: crist78@yahoo.com

This study aimed to compare the effects of a non-linguistic auditory intervention approach with a phonological intervention approach on the phonological skills of children with speech sound disorder (SSD). A total of 17 children, aged 7–12 years, with SSD were randomly allocated to either the non-linguistic auditory temporal intervention group (n = 10, average age 7.7 ± 1.2) or phonological intervention group (n = 7, average age 8.6 ± 1.2).The intervention outcomes included auditory-sensory measures (auditory temporal processing skills) and cognitive measures (attention, short-term memory, speech production, and phonological awareness skills). The auditory approach focused on non-linguistic auditory training (e.g., backward masking and frequency discrimination), whereas the phonological approach focused on speech sound training (e.g., phonological organization and awareness). Both interventions consisted of 12 45-min sessions delivered twice per week, for a total of 9 h. Intra-group analysis demonstrated that the auditory intervention group showed significant gains in both auditory and cognitive measures, whereas no significant gain was observed in the phonological intervention group. No significant improvement on phonological skills was observed in any of the groups. Inter-group analysis demonstrated significant differences between the improvement following training for both groups, with a more pronounced gain for the non-linguistic auditory temporal intervention in one of the visual attention measures and both auditory measures.Therefore, both analyses suggest that although the non-linguistic auditory intervention approach appeared to be the most effective intervention approach, it was not sufficient to promote the enhancement of phonological skills.

**Keywords: speech sound disorder, phonology impairment, language therapy, auditory stimulation, children**

### **INTRODUCTION**

Speech sound disorder (SSD) is defined as a developmental disorder characterized by articulatory and/or phonological difficulties that affect a child's ability to be understood by others, leading to reduced speech intelligibility, in the absence of other cognitive, sensory, motor, structural, or affective issues (Shriberg, 2003; Raitano et al., 2004; McGrath et al., 2007). It is currently wellestablished that, in most cases, the primary characteristics of SSD are difficulties in acquiring the phonological representations of speech sound systems in addition to deficits in speech perception and phonological tasks (Bird and Bishop, 1992; Leitao and Fletcher, 2004; Kenney et al., 2006; Fey, 2008). Despite the overlap of symptoms between SSD and language impairments, such as specific language impairment (SLI), SSD have their own characteristics (primarily increased substitution or omission of sounds from words compared to same-aged peers and speech production errors) and constitute the largest group of speech and language impairments observed in children (Shriberg and Kwiatkowski, 1982; Shriberg et al., 1994; Broomfield and Dodd, 2004; Tkach et al., 2011). According to Shriberg et al. (1999), the prevalence of SSD is ∼2–13%, and the rate of comorbidity between SSD and SLI in 6-years-old children, for instance, is 0.51%.

Several studies have investigated the effects of different intervention approaches on phonological impairments in children with SSD. For many years, the most common treatment approach in speech language pathology was the traditional articulation approach (Van Riper, 1939), which focuses on how to articulate individual phonemes to improve speech intelligibility. Over time, several phonological intervention approaches were incorporated in speech therapy by focusing on the phonological representations of speech sound systems, including phonemic awareness, vocabulary, and/or phonological memory tasks. Williams et al. (2010) documented 23 different intervention approaches for children with SSD, with the cycles approach (Hodson and Paden, 1983, 1991) and the core vocabulary approach (Holm et al., 2005) as examples of recognized phonological therapies. The Cycles Phonological Remediation Approach (Hodson and Paden, 1983, 1991) aims to increase a child's intelligibility by facilitating the emergence of the following primary target patterns for beginning cycles such as final consonants, clusters, velars, and liquids. The Core Vocabulary approach establishes consistency of production and enhances consonant and vowel accuracy. According to Crosbie et al. (2006), this approach is effective for children with an inconsistent phonological disorder.

As previously mentioned, numerous studies have demonstrated that one symptom of SSD is speech perception deficits. However, the role of this deficit in developmental phonological disorders remains unclear. Since the 1980s, research has supported the hypothesis, initially proposed by Tallal and Piercy (1973), that an auditory-sensory deficit, more specifically, an auditory temporal processing deficit, may be the underlying cause of speech perception deficits (Tallal and Piercy, 1973; Tallal, 1980; Tallal et al., 1996; Fitch et al., 1997; Habib, 2000; Ingelghem et al., 2001; Share et al., 2002; Murphy and Schochat, 2009a,b). This auditory temporal processing difficulty can be described as a limited ability to process "acoustic elements of short duration" such as consonants with rapid formant transitions. Thus, children with language impairments, including SSD, would have difficulties perceiving and distinguishing these sounds properly within the speech spectrum and subsequently developing the phonological representation of each one to produce them properly. Based on this hypothesis, a large number of studies have investigated the effects of auditory temporal training on language and phonological skills (Merzenich et al., 1996; Tallal et al., 1996; Kujala et al., 2001; Hayes et al., 2003; Cohen et al., 2005; Russo et al., 2005; Strehlow et al., 2006; Gaab et al.,2007; Lakshminarayanan and Tallal,2007; Gillam et al., 2008; Given et al., 2008; Murphy and Schochat, 2011; Heim et al., 2013). Despite this body of research, the extent to which auditory perceptual learning is generalized to higher phonological skills remains controversial and this controversy is often discussed in terms of methodology issues.

In the research conducted by Tallal et al. (1996), for instance, the trained group was composed of children with both speech and language impairments (described by the authors as languagelearning impairments). Therefore, combining children with SSD and SLI together might confound the observation of a relationship between pure speech perception deficits and auditory temporal processing skills. In addition, there is no consensus as to whether the changes in language skills that follow auditory training are due to specific auditory-sensory learning or to a general enhancement in cognitive skills. Numerous studies have demonstrated that auditory training can also promote improvement in cognitive skills (especially with regard to working memory and attention) in addition to the enhancement of auditory-sensory skills (Mahncke et al., 2006; Adcock et al., 2009; Murphy et al., 2011).

Although a great number of studies have addressed the effectiveness of auditory and phonological intervention approaches on the language skills of children with either SLI or dyslexia, only a few studies have investigated the effect of these intervention approaches in the speech production and phonological awareness skills of children with SSD. Lousada et al. (2012) described the presence of learning generalization in a study evaluating the effectiveness of a phonological intervention approach and an articulation intervention approach in children with SSDs. Either a generalization probe of the trained sound or phonological process to five non-intervention words was used. The authors demonstrated that the children in the phonological group showed greater generalization to untreated words than those who received articulation therapy. No study has investigated the efficacy of the auditory training or even attempted a direct comparison of the effectiveness of auditory and phonological

intervention approaches on speech production and phonological awareness skills. Baker and McLeod (2011) for example, mentioned that few studies have demonstrated that one intervention approach is more efficient to another with a specific disorder group. Besides, most of the studies reporting efficacy studies were quasi-experimental designs or no experimental, indicating the need of more controlled studies including groups of children and randomized controlled interventions (Brumbaugha and Smita, 2014).

Therefore, the aim of the present study is to compare the effect of an auditory and phonological intervention approach on speech production and phonological awareness skills in children with SSD. Taking into account previous studies demonstrating a strong link between impaired phonological processing and SSD as well as the hypothesis associating speech perception deficits to an auditory-sensory impairment, we will be able to explore the real contribution of phonological skills as well as the auditorysensory aspects in language skills by comparing both intervention approaches. We also aim to investigate the extent to which both interventions may improve other deficits present in children with SSD, including sustained attention (Murphy et al., 2014) and phonological working memory deficits (Adams and Gathercole, 1995). We hypothesized that each of the interventions will improve the performance in the trained tasks (auditory and phonological skills) and result in learning transfer to associated tasks in the same or different domains (language, auditory, memory, and attention skills).

### **MATERIALS AND METHODS**

This study was conducted at the Department of Physical Therapy, Speech-Language Pathology and Occupational Therapy in the School of Medicine (FMUSP/HC) at the University of São Paulo and was approved by the Research Ethics Committee in the Analysis of Research Projects at the Hospital das Clínicas, School of Medicine, University of São Paulo, under Protocol Number 575/09. A written consent form with detailed information on the aim and protocols of the study was also approved by the same ethics committee. All parents provided written informed consent on behalf of the children involved in the study.

### **MATERIALS**

### *Apparatus*

The experiment took place in an isolated room in the Speech-Language Pathology Clinic. Auditory tests were administered binaurally in a sound-treated booth at a level of 40 dB NS using an audiometer, headphones, and compact disks. Attention and short-term memory tests were administered using the E-Prime Professional Software to display the stimuli and collect the data. The language tasks were recorded using a JVC® Everio video camera and a Zoom H2 digital recorder for audio. Auditory intervention was delivered individually using a laptop, headphones, and specific software. The stimuli were presented binaurally at a comfortable listening level, which corresponded to a sound level of 70 dB (A). In the phonological intervention approach, children were positioned face-to-face with the speech and language pathologist to provide visual support of the therapist's mouth. Target

sounds were presented at approximately 50–60 dB HL at a distance of 1 m.

### *Outcome measures*

The intervention outcomes were categorized as "auditory-sensory measures" (i.e., auditory temporal processing skills) and"cognitive measures" (i.e., attention, short-term memory, speech production, and phonological awareness skills).

### *Auditory-sensory measures.*

*Frequency Pattern Test (FPT; Musiek, 1994).* The FPT consists of 20 trials with ∼6-s intervals between each trial pair. Each trial consisted of three stimuli for 150 ms with an inter-stimulus interval of 200 ms. The low stimulus (L) was 880 Hz, and the high stimulus (H) was 1122 Hz. There were six possible stimulus combinations: HHL, HLL, HLH, LHL, LLH, and LHH. The children were instructed to carefully listen to all three stimuli and respond by naming them in the order in which they were presented (e.g., "low, low, high"; "high, low, low"; etc.). After the study, we calculated the percentage of correct answers. This test was administered binaurally in a sound-treated booth at a level of 40 dB NS. In nonimpaired Brazilian children (ages 7–11 years-old), the expected result varies between 47.5 and 69.4% (Schochat et al., 2000).

*Gap in Noise Test (GIN – Musiek et al., 2005).* The GIN Test consists of stimuli with ten different gap lengths of 2–30 ms. In this test, the participants listened to segments of broadband noise that contained 0, 1, 2, or 3 silent intervals (i.e., gaps). As Musiek et al. (2005) described, the broadband noise was turned off and on instantaneously to produce gaps. Listeners were instructed to raise their hands each time they heard a gap. Gaps were separated by at least 500 ms for each trial. The test was performed in a sound-treated booth at a level of 40 dB NS. The task consisted of 35 trials presented binaurally. In non-impaired Brazilian children (ages 8–10 years-old), the expected result is ∼6.1 ms (Amaral and Colella-Santos, 2010).

### *Cognitive measures.*

*Auditory and Visual Attention Tests (Murphy et al., 2014).* In both tests, performance is assessed using tasks that require participants to remain prepared to respond to infrequent targets (e.g., digits, letters, or symbols) over an extended period of time. In the present research, both tests were developed using E-Prime Professional software. In the visual test, digits between 1 and 7 were presented on a screen and participants pressed a button as quickly as possible each time a 1 or 5 appeared. The auditory task was identical to the visual task except that the participants heard the digit spoken over a set of calibrated headphones. The stimuli were presented binaurally at a comfortable listening level corresponding to a sound pressure level of 70 dB (A). The duration of each test was ∼6 min and consisted of 210 trials. Three performance measures were compared across blocks: correct detection (HIT), false alarms (FAs: errors of omission and commission), and reaction time (RT). Participants were tested individually in a quiet, well-lit laboratory on campus. The testing session was composed of two parts: evaluation of auditory sustained attention and evaluation of visual sustained attention. The order was counterbalanced among participants. Before each section, the participants were given

appropriate instructions and asked to perform approximately 15 practice trials.

*Visual digit span (forward recall; Murphy et al., 2014).* This task was developed using E-Prime Professional software. The digit span task begins with a series of three digits, with 12 attempts for each series. Children verbally repeat each numerical sequence after viewing the numbers on a computer screen. If the children are correct more than 50% of the time, longer series are gradually presented. The span result is the last series for which the subject's responses were more than 50% correct.

*Speech production.* Assessed by the picture-naming and the word imitation tasks (Wertzner, 2004), derived from the Infantile Language Test-ABFW (Andrade et al., 2004). The picture-naming task was composed of 34 pictures of objects (24 dissyllable and 10 trisyllable words) with 90 consonants and the word imitation task was composed of 39 words (25 dissyllable and 14 trisyllable words) with 107 consonants. Two researchers transcribed each trial to ensure the accuracy of the data. There was ≥90% inter-reliability. The percentage of consonants correct – revised (PCC-R; Shriberg et al., 1997) was calculated separately for both speech production tasks by dividing the number of correct productions by the total number of consonants in the sample and multiplying by 100 to determine the production acuity of each subject.

*Phonological awareness.* Assessed by the *Lindamood Auditory Conceptualization Test* (LAC; Lindamood and Lindamood, 1979), adapted to the Brazilian Portuguese language (Rosal, 2002; Wertzner et al., 2014). The LAC test assesses phonological awareness skills without requiring verbal responses (children use colored blocks to represent their responses). This method provides superior information on phonological representations, as they prevent speech production errors from affecting the respondent's performance. The test comprised two categories: phonological awareness 1 (PA1) and phonological awareness 2 (PA2). PA1 assesses perception skills through the auditory selection of speech sounds. It comprises six complex sameness/difference sequences covering three possible variations in sequence of three gross and three fine contrasts. The subject must discriminate how many sounds he or she heard in a pattern, and in what sequential order their sameness or difference occurs. Examples of this category are the sound patterns (/b/ /b/ /z/) and (/k/ /t/ /k/). PA2 assesses comprehension skills associated with the child's ability to perceive and compare the number and order of sounds in a spoken pattern (including 12 stimuli that assess the manipulation of one phonemic change such as addition, substitution, omission, transfer, and repetition).

### *Intervention program*

Because the impact of both approaches will be investigated for the group as a whole (not individually), we chose to adopt, for both interventions, more general training tasks instead of specialized training focused on specific speech difficulties or impaired auditory skills.

### **AUDITORY INTERVENTION**

The training focused on different auditory-sensory aspects, such as frequency discrimination, ordering, and backward masking. Each of the three tasks took ∼15 min to complete, resulting in 45 min of total training per session. The following software was used for the training tasks:


### **PHONOLOGICAL INTERVENTION**

As mentioned previously, because the impact of this approach was investigated for the group as a whole (not individually), for the present study, we designed a phonological stimulation program (PSP) for the stimulation of different sounds of the phonetic inventory. The PSP was formulated to expose the participants to all sounds from the Brazilian Portuguese system independent of the phonological processes observed during evaluations such that phonological acquisition could occur gradually over a short period of time (12 sessions of stimulation). Compared to more traditional phonological intervention approaches, the current approach is more closely linked to the Cycles Phonological Remediation Approach (Hodson and Paden, 1983, 1991), which also predicts that phonological acquisition in children with phonological disorders is gradual, as in typically developing children, and should be associated with kinesthetic and auditory sensations in order to acquire new patterns. Therefore, this approach intends to increase the child's intelligibility by facilitating the emergence of primary target patterns for beginning cycles such as final consonants, clusters, velars, and liquids.

During the 12-weeks period of the intervention, all 21 consonantal sounds (CVs) and 13 clusters (CVC) of Brazilian Portuguese were stimulated through activities involving the auditory perception of the target sound, articulatory production, phonological organization, and metalinguistic abilities. Every 2 weeks, each child was exposed to a new specific sound pattern within CV syllables, such as stops, fricatives, liquids and nasals, as well as more complex syllables such as CVC and CCV, regardless of the child's performance and the phonological processes observed in evaluations.

The sound patterns stimulated were as follows: sessions 1 and 2 – fricatives (/f/, /v/, /s/, /z/, /- /, /Z/); sessions 3 and 4 – stops (/p/, /b/, /t/, /d/, / k/, /g/); sessions 5 and 6 – liquids (/l/, /R/, /λ/) and the velar fricative (/x/); sessions 7 and 8 – (/m/, /n/, /ñ/) and (/s/, /R/) in CVC syllables; sessions 9 and 10 – /l/ in CCV syllables and sessions 11 and 12 – /R/ in CCV syllables. We based the target sequence of stimuli on different studies with Brazilian Portuguese-speaking children (Wertzner, 2004;Wertzner et al., 2006, 2007), which indicate that difficulties with the liquids production followed by devoicing of fricatives and stops are the most common speech deficits in children with SSD. As the liquid sounds are complex sounds due to both its production and its occurrence in Brazilian Portuguese distribution, we chose to begin the PSP with the presentation of the fricatives followed by the stops so we could also be able to present the differentiation of the contrast between voiced and voiceless sounds. After these sounds, we presented the liquids and the velar fricative followed by the most complex syllables (CVC and CCV) to finish the program.

A variety of tasks were used during the PSP, some of which will be highlighted here. One of the auditory perception tasks was to read three words beginning with each target sound to the child and then perform auditory recognition training for the sounds. In the articulatory tasks, the child had to pay attention to the sound and how the sound was produced by the researcher. Explanations regarding the sound's production were also given. Then, the child had to name specific objects beginning with the target sounds. In the tasks concerning phonological organization, the researcher asked the child to create a sentence including the name of a picture. Metaphonological tasks including syllable, rhyme, and alliteration activities were also performed in addition to phonological memory tasks with words beginning with the target sounds.

### **METHODS**

### *Participants*

A total of 19 children diagnosed with SSD were invited to participate in this study. The children were recruited through the Laboratory of Investigation in Phonology within the Department of Physical Therapy, Speech-Language Pathology, Audiology and Occupational Therapy at the School of Medicine at the University of São Paulo. The children were diagnosed using the phonology test (Wertzner, 2004) derived from the Infantile Language Test-ABFW (Andrade et al., 2004). Diagnosis of a SSD was made by the by the presence of phonological impairments, which were determined by the presence of phonological processes that were not age expected and the absence of impairment in the other language areas (vocabulary, pragmatics, and fluency), which are also measured using the Infantile Language Test-ABFW (Andrade et al., 2004). After diagnosis, the PCC-R (Shriberg et al., 1997) was determined based on the speech samples obtained by picturenaming and an imitation of word tasks from the phonology test (Wertzner, 2004). This quantitative measure was chosen because it is highly sensitive to differences in phonological deficits and provides information pertaining to the two primary error types: omissions and substitutions (Shriberg et al., 1997). The children were monolingual Brazilian-Portuguese speakers and were not undergoing rehabilitation.

The inclusion criteria were as follows: age between 7 and 12 years, diagnoses of a SSD using the phonological output/speech production test described above; no deficits in other language areas (vocabulary, pragmatics, and fluency), IQ > 80 (based on the WISC-IV); and no familial or personal history of diagnosed or suspected auditory, otological or neurological disorders or injuries. This specific age range was chosen because the complexity of the some auditory tasks included in the auditory intervention, which would not necessarily be easily comprehended by younger children. In addition, participants were required to demonstrate normal tympanometry and acoustic reflexes. Auditory sensitivity was required to be within normal limits (≤15 dB HL for octave frequencies from 250 to 8000 Hz) and symmetrical (interaural differences ≤5 dB HL at each frequency). In order to investigate these inclusion criteria, they were required to pass a series of inclusion tests consisting of a parent questionnaire, an audiological evaluation, language tests and a non-verbal IQ test (the Raven test of Colored Progressive Matrices with Brazilian norms (Angelini et al., 1999) and a conversion table of IQ values (Strauss et al., 2006).

The results of these tests (i.e., the IQ test and audiological evaluation) led to the exclusion of two children. Then, the selected children were randomly assigned into either the auditory intervention group (AG, *n* = 10) or the phonological intervention group (PG, *n* = 7). **Table 1** displays the characteristics of these two groups (gender, age, IQ, and language skills).

There were no significant inter-group differences with regard to age (*p* = 0.053), IQ (*p* = 0.35), short-term memory (*p* = 0.17), auditory processing (Frequency Pattern Test: *p* = 0.21, Gaps in Noise test: *p* = 0.80), and one of the language skills (picturenaming: *p* = 0.06). Differences were found only for imitation of words (*p* = 0.013). The significance threshold was set at *p* < 0.05 (**Table 1**).

### *Procedures*

After the groups were established, a series of tests concerning attention, short-term memory, language, and auditory processing were applied before and after the interventions (outcome measures). The characteristics regarding each of these tests are described in the Materials section. Each participant was allocated to one of the two intervention groups. Both of these approaches consisted of 12 45-min sessions twice per week, for a total of 9 h of training. The details regarding each program are also described in the Materials section. Both groups received approximately the same number of training sessions (AG: mean = 11 sessions, PG: **Table 1 | Performance characteristics of the AG and PG on the screening battery.**


Speech production tasks: percent consonants correct for both picture- naming and imitation of words. AG, auditory group; PG, phonological group; M, mean; SD, standard deviation; IQ, intellectual quotient; \*significant.

mean = 11.4 sessions; *p* = 0.62). **Figure 1** demonstrates the sequence of procedures adopted from the initial invitation to participants until the number of completed training sessions for each group.

### **STATISTICAL ANALYSIS**

The data were analyzed using Minitab Statistical Software version 16.1. Non-parametric statistics were used because both groups violated the assumption of normal distribution necessary for parametric analysis. Intra- and inter-group analyses were used not only to investigate the effect of each intervention approach separately (intra-group analysis) but also to compare the level of improvements following interventions in both groups (inter-group analysis).

For the first analysis, the pre- and post-intervention performances were compared separately for each group in each of the tests (intra-group analysis using the Wilcoxon test). In the second analysis, the differences between the pre- and postintervention performances ("improvement-following training") were compared between both groups in each of the tests (intergroup analysis using the Mann–Whitney test). The significance threshold was set at *p* < 0.05.

### **RESULTS**

### **INTRA-GROUP ANALYSIS**

**Table 2** displays the performances in auditory-sensory and cognitive measures for both groups (pre- and post-training).

### *Auditory group*

The Wilcoxon test demonstrated significant differences between the pre- and post-intervention performances for both auditory measures (FPT: *p* = 0.01 and GIN: *p* = 0.05), one of the visual attention measures (RT: *p* = 0.03), one of the auditory attention



FPT, Frequency Pattern Test; GIN, Gap in Noise; FA, false alarm; RT, reaction time, AG, auditory group; PA, phonological awareness; PCC, percentage of consonants correct; PG, phonological group; M, mean; SD, standard deviation; \*significant.

measures (FA: *p* = 0.03) and digit span (*p* = 0.05). No significant differences were observedfor the other outcomes (picture-naming: *p* = 0.72; imitation of words: *p* = 0.10; Visual HIT: *p* = 0.31; Visual FA: *p* = 0.28; Auditory HIT: *p* = 0.13; Auditory RT: *p* = 0.18; IB: *p* = 0.20; II: *p* = 0.27).

### *Phonological group*

The Wilcoxon test demonstrated no significant differences between the pre- and post-intervention performances in any of the measures [auditory (FPT: *p* = 0.95; GIN: *p* = 0.85), shortterm memory (*p* = 0.78), visual attention (HIT: *p* = 0.78; FA: *p* = 0.78; RT: *p* = 0.27), auditory attention (HIT: *p* = 0.55; FA: *p* = 0.07; RT: *p* = 0.35) and language (picture-naming: *p* = 0.13; imitation of words: *p* = 0.83; IB: *p* = 0.46; II: *p* = 0.68)].

### **INTER-GROUP ANALYSIS**

With regard to the auditory-sensory measures, the Mann–Whitney test showed a significant difference between the gains in both groups for both auditory measures (PF: *p* = 0.01; GIN: *p* = 0.02).

With regard to the cognitive measures, the Mann–Whitney test demonstrated significant differences between the gains in both groups for visual RT (*p* = 0.02) and no significant differences between gains in both groups for language tasks (IB: *p* = 0.58; II: *p* = 0.52; picture-naming task: *p* = 0.69; imitation of words task: *p* = 0.32), the short-term memory task (*p* = 0.45) and the other auditory and visual attention measures (visual HIT: *p* =0.72; visual FA: *p* = 0.41; auditory HIT: *p* = 0.35; auditory FA: *p* = 0.88; auditory RT: *p* = 1.0).

To summarize, intra-group analysis demonstrated that the auditory intervention group showed significant gains in both auditory and cognitive measures, whereas no significant gain was observed in the phonological intervention group. Intergroup analysis demonstrated significant differences between the improvement following training for both groups, with a more pronounced gain for the non-linguistic auditory temporal intervention in one of the visual attention measures and both auditory measures. No significant improvement on phonological skills was observed in both analysis in any of the groups (**Table 3** and **Figure 2**).

### **DISCUSSION**

The purpose of this study was to compare the impact of a nonlinguistic auditory and a phonological intervention approach on the phonological skills of children with SSD. Before discussing the present results, it is important to discuss the characteristics of the groups, specifically the age and the pre-training performance in phonological tasks. Although no significant differences were observed with regard to age, there was a difference of ∼1 year between the groups (children in the phonological intervention group having the highest mean age). Although several studies have corroborated the hypothesis regarding the existence of a critical period for learning (Knudsen, 2004), a difference of 1 year is insufficient to influence significant differences in the way that the learning process occurs, especially comparing 7- and 8-years-old. Murphy and Schochat (2011), for instance, observed a significant difference between the gains following auditory training only between a younger group (ages 7–10) and an older group (ages 11–14). However, the age difference in our study possibly influenced the performance on the phonological and short-term memory tasks pre intervention. This result is expected given that, even in children with SSD, these two skills improve with development (to some extent). Therefore, specifically for the imitation of words task, the phonological group had a significantly better performance than the auditory group; however, the difference between the groups in the shortterm memory task was not significant. The implications of the performance of the phonological group on the phonological

**Table 3 | Comparison between gains in both groups (Inter-group analysis).**


FPT, Frequency Pattern Test; GIN, Gap in Noise; FA, false alarm; RT, reaction time; PA, phonological awareness; PCC, percentage of consonants correct; AG, auditory group; PG, phonological group; M, mean; SD, standard deviation; \*significant.

tests will be discussed further, with the comments concerning the improvement following training on the same tests. Regarding gender, both groups contained a higher number of boys, which corroborates previous research on the higher prevalence of SSD in boys (Shriberg et al., 1986, 1994; Wertzner and Oliveira, 2002).

The Intra-group analysis demonstrated that although no significant improvement following training was observed for the phonological group, the auditory group showed significant gains in both auditory, one of the visual and one of the auditory attention measures as well as in the digit span measures.

Regarding the auditory group, the improvements for both the FPT and GIN test were expected because the trained task in the auditory intervention approach is similar to both of these outcome measures. Thus, this improvement is likely to represent mid-transfer, that is, the learning generalization from the trained task to a different task in the same domain. Other studies, like the present research, have also demonstrated improvements following a non-linguistic auditory intervention approach in a similar trained task (Kujala et al., 2001; Murphy and Schochat, 2011). Kujala et al. (2001), for instance, used non-linguistic audiovisual computer training, with sound elements varying in pitch, duration, and intensity, in reading-impaired children. After training, improvements in a behavioral auditory frequency discrimination task were demonstrated, corroborating the results of the present

research. Murphy and Schochat (2011) applied frequency discrimination training in children with dyslexia. After training, there was a significant improvement in the trained group on a similar trained task.

Despite the improvement of the auditory group on both auditory-sensory measures, no significant improvement was observed for language tasks, suggesting no generalization from non-linguistic auditory tasks to higher phonological skills. Previous research has demonstrated that this is a controversial topic. Some studies have observed improvements in verbal skills after auditory training (Kujala et al., 2001; Lakshminarayanan and Tallal, 2007; Murphy and Schochat, 2011), whereas others failed to show the same results (Halliday et al., 2012). Kujala et al. (2001), for instance, implemented an audiovisual training program including only non-linguistic stimuli for a group of 7-years-old dyslexic children (*n* = 24). The results showed that whereas before training, there were no differences in performances on reading tests between the "trained" and "untrained" groups (both composed of dyslexic children), after training, the "trained" group had better results than the "untrained" group. Electrophysiological auditory tests also showed similar results – larger amplitudes of the mismatch negativity wave were seen after training. The researchers suggested that non-linguistic auditory training, such as in the current research, can improve reading skills. In contrast, in a study conducted by Halliday et al. (2012), no learning generalization across different tasks or stimuli was found when different types of sensory training were given (auditory frequency discrimination, auditory phonetic discrimination, and visual frequency discrimination tasks). The authors concluded that learning following auditory training was specific to the task or stimulus. Most likely, these controversial results are due to the methodological differences among studies, such as the training delivered (amount of training, type of task, and stimulus), the outcome measures (how far from the trained task the effect extends) and the population (typically developing children or those with language disorders). Regarding the length and intensity of the training, for instance, we administered both training approaches over 12 sessions of 45 min each (one per week, totaling 12 weeks), whereas Kujala et al. (2001) administered 14 sessions of 10 min (twice per week, totaling 7 weeks) and Halliday et al. (2012) administered 12 sessions of 30 min (three times per week, in total 4 weeks). Although Halliday et al. (2012) provided the most intensive training, no generalization was observed from the auditory stimulus or task to higher level measure of language ability. One possible explanation was demonstrated by Molloy et al. (2012), who claimed that optimal training regimens should have short sessions spaced by several days in early learning, as done by Kujala et al. (2001), which was the only study that demonstrated learning transfer from the non-linguistic stimuli to language skills.

Despite the lack of generalization from the trained tasks to language skills, intra-group analysis demonstrated improvements in short-term memory and attention outcome measures. This result suggests a positive benefit of training on the attention and memory skills of children with SSDs; moreover, it demonstrates the influence of an auditory-sensory intervention on top–down skills. As

in the present research, previous studies also reported enhanced attention skills following auditory-sensory training in different populations (Stevens et al., 2008; Soveri et al., 2013). Stevens et al. (2008) demonstrated better selective auditory attention performance following *Fast ForWord* (FFW) training in children with SLI, suggesting that the neural mechanisms of selective attention are remediated through training. Soveri et al. (2013) also demonstrated improved auditory attention in healthy adults, suggesting that auditory training can modulate the allocation of auditory attention in the adult population. It is also important to note that in the current research, the improvement in shortterm memory seemed to be insufficient for the enhancement of phonological skills. This transfer may occur given that poor phonological representations of speech sound systems are often attributed to deficits that involve memory skills (Bird and Bishop, 1992; Raitano et al., 2004; Kenney et al., 2006). Because short-term memory improvements were observed only for the intra-group analysis, additional studies are necessary to better investigate this result.

Contrary to the auditory group, the phonological group exhibited no improvement, after training, in auditory-sensory measures. This result was expected given that the tasks included in the phonological intervention approach did not have a close or even underlying relationship with these auditory-sensory measures. However, the lack of improvement in phonological tasks was not expected because the phonological training tasks were similar to the phonological outcome measures; therefore, it would be reasonable to expect a more pronounced gain for the phonological group. It is possible that this result is associated with the type of phonological intervention approach adopted in this study. As noted above, the phonological intervention approach consisted of more general tasks, with no focus on the individual's performance before the intervention (deviant or missing phonemes). Therefore, the improvement in phonological outcome measures had to be linked to learning transfer from this general stimulation to some specific deviant or missing phonological process. Previous studies have demonstrated this generalization when the phonological intervention approach was based on the child's target speech production goals. Lousada et al. (2012), for instance, described the presence of learning generalization in a study evaluating the effectiveness of a phonological intervention approach and an articulation intervention approach in children with SSDs. A generalization probe of the trained sound or phonological process to five nonintervention words was used. The authors demonstrated that the children in the phonological group showed greater generalization to untreated words than those who received articulation therapy.

The results of the inter-group analysis demonstrated no significant difference between both groups with regard to improvement on the phonological tests following intervention. One of the issues with this comparison is that the phonological group, compared to the auditory group, had a significant better performance on the phonological tests before training. Thus, the phonological group had less chance to develop, which could negatively impact the observation of increased improvement of the phonological group following intervention. Therefore, this might be a

reason for the lack of a more pronounced gain in the phonological group. However, in the intra-group analyses, in which both groups were analyzed separately, the phonological group had no significant improvement, even for phonological awareness task that included manipulation, in which the score obtained prior to intervention was only 67.5%. Thus, at least for this task, there was no ceiling effect, which means that it would be absolutely reasonable to observe a significant improvement following intervention.

The initial hypothesis of this study was that each one of the interventions would improve the performance in the trained tasks (auditory and phonological skills), leading to the learning transfer to associated tasks (language, memory, and attention skills). As previously mentioned, significant improvement in the trained tasks were observed only in the auditory group. We hypothesize that this improvement might be related to the increased similarity between the auditory training tasks and the auditory outcome measures compared to the phonological trained tasks and the phonological tests. Therefore, further studies should investigate the effect of a more specific intervention approach that focuses on specific speech difficulties/phonological processes. Despite that, previous studies has also demonstrated the positive effect of more general remediation. The auditory program FFW (Tallal et al., 1996), for instance, is one of the examples of a successful general approach given that the program comprises varied skills such as auditory temporal, phonological awareness and reading skills and it is not focused in a singular aspect. In this case, research has demonstrated generalization from more perceptual trained aspects to language skills of children with language disorder (Merzenich et al., 1996; Gaab et al., 2007). Lousada et al. (2012) also described the presence of generalization from a trained phonological process to non-trained words.

The observed transfer from the auditory training to the attention and memory skills might be related to the different characteristics of the two interventions. Whereas the auditory training was administered via a computer with fixed audiovisual tasks demanding attention and time to answer, the phonological training was administered by a speech therapist with more flexible tasks and more time to answer. With regard to the transfer to phonological skills, because no significant enhancement was observed (even with auditory-sensory improvement), the results do not corroborate the initial hypothesis, which associates auditory temporal processing and phonological skills. Therefore, although the nonlinguistic auditory intervention approach appears to be the most effective intervention approach, this was insufficient to promote the enhancement of speech production and phonological awareness skills. Further studies are necessary to ascertain the extent to which auditory-sensory is involved with the etiology of SSD and the process of learning generalization across bottom–up and top–down skills.

These results are based on preliminary data from 10 participants who received auditory training and seven who received phonological training. It is clear that additional data are needed to confirm and extend these findings. Further research is also required to investigate the presence of a test-retest effect through the inclusion of a control group (non-trained group).

### **ACKNOWLEDGMENTS**

We thank David R. Moore for providing the STAR program. We also thank Danira Tavares Francisco, Marina Jorge Pulga, Tatiane Faria Barrozo, and Thaís Zemlickas Silva for their evaluation and assessment of the children. This work was supported by São Paulo Research Foundation (2009/50453-6).

### **REFERENCES**


problems. *Clin. Neurophysiol.* 114, 673–684. doi: 10.1016/S1388-2457(02) 00414-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 13 August 2014; accepted: 13 January 2015; published online: 04 February 2015.*

*Citation: Murphy CFB, Pagan-Neves LO,Wertzner HF and Schochat E (2015) Children with speech sound disorder: comparing a non-linguistic auditory approach with a phonological intervention approach to improve phonological skills. Front. Psychol. 6:64. doi: 10.3389/fpsyg.2015.00064*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology.*

*Copyright © 2015 Murphy, Pagan-Neves, Wertzner and Schochat. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear When Exposed to 65 dB of Auditory Noise

### *Göran B. W. Söderlund1\*† and Elisabeth Nilsson Jobs2†*

*<sup>1</sup> Department of Teacher Education and Sports, Sogn og Fjordane University College, Sogndal, Norway, <sup>2</sup> Department of Psychology, Karolinska University Hospital, Stockholm, Sweden*

### *Edited by:*

*Patrik Sörqvist, University of Gävle, Sweden*

# *Reviewed by:*

*Andre Brechmann, Leibniz Institute for Neurobiology, Germany Staffan Hygge, University of Gävle, Sweden*

*\*Correspondence:*

*Göran B. W. Söderlund goran.soderlund@hisf.no †These authors have shared first authorship.*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

> *Received: 01 October 2015 Accepted: 08 January 2016 Published: 29 January 2016*

### *Citation:*

*Söderlund GBW and Jobs EN (2016) Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear When Exposed to 65 dB of Auditory Noise. Front. Psychol. 7:34. doi: 10.3389/fpsyg.2016.00034*

The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD), affecting ∼6–9% of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman's speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB). Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure.

Keywords: speech recognition, ADHD, Hagerman test, speech in noise, white noise, stochastic resonance

# INTRODUCTION

Attention deficit hyperactivity disorder (ADHD) is the most common neuropsychiatric condition in children, affecting ∼6–9% of the youth population and 3–5% of adults (e.g., Froehlich et al., 2007; Dopheide and Pliszka, 2009). ADHD is more prevalent among boys with a ratio of 1:3 (Biederman and Faraone, 2004; Lindemann et al., 2012), although these differences have diminished over the years and more girls are now diagnosed (de la Barra et al., 2013). The inattentive deficit comprises difficulties in sustaining attention, following instructions and being seemingly inattentive when spoken to directly, while the hyperactivity is manifested by overactivity, restlessness, and impulsivity (APA, 2013). Children with attention deficits display deficits in working memory, in particular auditory working memory (Alderson et al., 2015), often seem to have a listening problem, need auditory information to be repeated, have difficulties in dichotic listening tasks (Cacace and McFarland, 2006) and often display a sluggish cognitive tempo (McBurnett et al., 2001). ADHD is commonly associated with school failures and academic underachievement (Faraone et al., 1993; Barkley et al., 2006; Serra-Pinheiro et al., 2008). A common explanation to symptoms of ADHD is low continuous levels of dopamine in the synaptic cleft (Volkow et al., 2009). In line with this, stimulant medication, e.g., methylphenidate, can be used to treat symptoms of ADHD, both behavioral and cognitive problems, to facilitate adaptation to school demands (Evans et al., 2001; Greenhill et al., 2002; Scheffler et al., 2009; Wigal et al., 2011). However, the best dose for optimal cognitive functioning has been found to be lower than the best dose for school behavior (Hale et al., 2011). Of greater concern, it is not evident that stimulant medication improves learning processes (Hellwig-Brida et al., 2011; Ginsberg and Lindefors, 2012), long term effects of medication are not well-known yet (Group, 2004) and neither are the effects on the developing brain (Anderson et al., 2002; Andersen, 2005). These uncertainties about medication make it urgent to look for alternative ways of improving attention and thus school performance for children with attention deficits.

The aim of the present study is to investigate if performance in speech recognition thresholds differs between children with ADHD symptoms and typically developed children (TDC) performing a speech recognition task in two different noise conditions, no noise and in 65 dB slightly modulated noise (that resembles white noise). The hypothesized difference between groups in speech recognition thresholds will here be further investigated. A reason for this is that prior research on ADHD has mainly focused on executive functioning where differences in performance are explained by deficits in pre-frontal cortex activation (e.g., Aaron et al., 2004; Brennan and Arnsten, 2008; Boonstra et al., 2010). Less notice has been given to sensory perception and subcortical functioning in ADHD even though there is a large overlap between central auditory processing disorder and ADHD (Riccio et al., 1994; Chermak et al., 2002).

There are somewhat contradictory findings regarding auditory perception in ADHD, indicating impairments and as well as no impairments. Some studies indicate differences between ADHD and TDC in speech processing, e.g., ADD children seems to prefer lower loudness levels when listening to speech, and display inferior speech discriminating ability when exposed to noise (Geffner et al., 1996; Lucker et al., 1996) and in hearing ability (Abdo et al., 2010). In binaural speech recognition tasks younger children with ADHD perform worse than TD children but at the same level in signal detection tasks (Pillsbury et al., 1995). In dichotic listening tasks TD children outperform children with ADHD in cognitive control of auditory input (Dramsdahl et al., 2011; Oie et al., 2014). From this we can conclude that ADHD children display a reduced signal recognition or perception efficiency but not for signal detection *per se*. Noise can be detrimental for attention but when investigating efferent auditory system the ability to suppress contralateral noise between an ADHD- and a control group was reported as equal (Pereira et al., 2012). Differences in auditory brainstem responses are found in ADHD and ASD patients that might indicate a fundamental difference in auditory processing compared to TDC (Källstrand et al., 2010; Claesdotter-Hybbinette et al., 2015; Jafari et al., 2015). To sum up, mixed results referred above provide good reasons to further investigate the topic of auditory perception and in particular speech recognition in ADHD in different noisy environments that are common during schoolwork.

The effects of acoustic noise on learning have often been investigated in relation to hearing in difficult conditions, where noise is usually an obstacle (Ljung et al., 2009; Song et al., 2012). Even low levels of continuous or intermittent noise are found to impair the learning and reproduction of texts in healthy control subjects (Trimmel et al., 2012). In contrast to the main body of evidence there have been an increasing number of studies reporting findings that loud acoustic random noise (80 dBA) under certain circumstances can be beneficial for performance on various cognitive tasks. This noise benefit is found in particular in individuals with an ADHD diagnosis (Söderlund et al., 2007) or with poor attention ability (Söderlund et al., 2010; Helps et al., 2014). Road traffic- and speech noise can also be beneficial for cognitive performance (Stansfeld et al., 2005; Söderlund and Sikström, 2012). This is a somewhat counter intuitive finding, while persons with attention problems are often shown as particularly vulnerable to distraction (e.g., Geffner et al., 1996; Rickman, 2001). A recent theory of noise benefit is the moderate brain arousal model (MBA) that relies on the phenomenon of stochastic resonance (SR; Sikström and Söderlund, 2007). SR is a ubiquitous phenomenon that exists in nature in any system with noise and a signal that requires passing a threshold as in the nervous system (McDonnell and Abbott, 2009). The simplest form of SR is threshold SR when a weak auditory signal is presented below the hearing threshold and becomes detectable when a random noise is added to the signal pushing it over the detection threshold (Stacey and Durand, 2001; Moss et al., 2004). In threshold SR the signal should be presented just below the hearing threshold and the noise in the same range (20–35 dB) for SR to occur. In supra threshold SR (SSR) this will occur when all noises added equals the mean of the signal amplitude (Stocks, 2000; McDonnell et al., 2007). This means that both noise and signal can be far above the hearing threshold; in the present study we focus on supra threshold SR setting the noise level at a constant level of 65 dB SPL and modulating the speech signal from 85 dB SPL and downward. The SR effect appears highly sensitive to both the intensity of the signal and the noise level; this relationship follows an inverted U-curve function, where performance peaks at moderate noise levels. This means that a moderate level of white noise is beneficial for performance whereas too little does not add the power required to bring the signal over the threshold and too much overpowers the signal, leading to a deterioration in attention and performance (McDonnell and Ward, 2011). The novel aspect of the MBA model is that it proposes individual differences in the SR effect and that these differences are linked to attention ability, while inattentive- or ADHD diagnosed individuals need higher input of noise compared to TDC to function at their full potential (Sikström and Söderlund, 2007).

In accordance with the MBA model this leads to the prediction that children with ADHD will benefit more from noise than children with normal attention, for whom noise will have a detrimental effect on performance. Accordingly, we will investigate if thresholds in speech recognition differ between children with ADHD symptoms and a typically developed control group and study how noise exposure affects the two groups. The hypothesis is that the noise during a speech recognition task will strengthen the signal and thus increase the signal-tonoise ratio in particular for the ADHD group; this improvement will be mediated through the supra threshold SR phenomenon. Our more specific predictions are as follows: (i) in the no noise condition the inattentive group demand a higher speech signal level as compared to controls in order to perceive the speech signal correctly due to a smaller signal-to-noise ratio; (ii) in the noisy condition (65 dB SPL modulated noise) these differences will disappear while noise strengthens the speech signal for the inattentive group and they will perform in parity with the TDC group.

# MATERIALS AND METHODS

### Participants and Recruitment

Forty-nine secondary school boys between 9 and 10 years of age (*M* = 10,2) participated in the study. Girls were not included in the study since a vast majority of the clinical group were boys and gender could therefore be a confounding variable. Initial testing and parent- and teacher ratings of ADHDsymptoms were performed before the speech recognition task. In the ADHD group 10 boys were recruited by the staff (nurses and psychologists) at neuro-psychiatric units within the pediatric healthcare in the Stockholm catchment area, all having a clinical diagnosis set by a pediatrician. One participant was excluded due to incomplete test data. The 39 participants recruited for the control group came from a school in a mixed demographic suburb of Stockholm. One participant was excluded due to incomplete test data and one due to low general ability (IQ *<* 80). According to the initial teacher and parent ratings of ADHD symptoms, six of the participants had significant ADHDsymptoms (a mean score below 2,5) and were moved from the control group to the ADHD symptom group. Thus in all 15 participants were included in the ADHD symptom group and the remaining 31 participants constituted the typically developed control group. Of note is that ADHD is a behavioral diagnosis, i.e., certain behaviors make up criteria for the diagnosis. To get a diagnosis the symptoms should not be explained by a general cognitive deficit and symptoms should be present in childhood (DSM-5, 2013). Diagnoses are mainly based on questionnaires where symptoms are rated (Martel et al., 2015). Symptoms of inattention and hyperactivity are often viewed as dimensional traits that exist to a greater or lesser degree in the population (Marcus and Barry, 2011). The ADHD symptom rating used in this study (see below) is based on the DSM-5 criteria and captures behaviors within the diagnostic realm. Note that the term *ADHD symptom group* is used when the extra six participants are included. Group assignments were made prior to the speech recognition test. All participation was followed after written permission from parents and oral consent from children. The regional ethic board in Stockholm approved the study.

The initial teacher- and parent ratings of participants covered items about school achievements (reading, arithmetic, oral presentation, general school performance), social skills, hearing and hearing sensitivity, language spoken at home, and medication. All participants had normal hearing according to self-report, parent and/or teacher reports. To rule out peripheral hearing loss, exclusion criterion was set to binaural hearing threshold of 37 dB SPL (equivalent to 15 dB HL) or below according to the result in the no-noise condition. No participants were excluded for this reason. ADHD symptom rating were based on the SWAN scales (Swanson et al., 2007), the TTI-IV interview manual (Tannock et al., 2002), and the Diagnostic and Statistical Manual of Mental Disorders criteria for ADHD (APA, 2000, 2013). The rating consisted of salutogenic items rated on a scale from 1 to 7 where 1 was much below average, 7 was much above average, 4 was average and 1 much below average. The rating covered nine questions about attention ability and nine about activity level. Two subtests, "Similarities" and "Picture concepts" from the Wechsler Intelligence scale for children, WISC-IV were used to measure cognitive ability. The "Similarities" subtest measures verbal fluid intelligence and the "Picture concept" subtest measures non-verbal fluid intelligence (Flanagan et al., 2006). Two subtests of auditory working memory were used, "Digit span" from Wechsler scale of Intelligence, WISC- IV and "Repetition of sentences" from the neuropsychological battery NEPSY (Korkman et al., 2001). The subtests "Score" and "Score- double task" (Score DT) from the TEA-Ch battery (Manly et al., 2001) were used to asses sustained auditory attention and auditory divided attention respectively. See participant characteristics in **Table 1**.

# Materials and Test Battery

The signal-to- noise ratio (i.e., the relation between the signal and the noise in dB), where it is comfortable to listen to speech is about 15 dB, i.e., when the signal is 15 dB louder than the noise. Noise levels can thus be as high as 40–50 dB SPL without affecting speech intelligibility if the signal is presented at about 65 dB SPL. A comfortable level for listening to speech in quiet or low levels of background noise is about 60–65 dB SPL, which corresponds to the level of normal conversational speech heard at 1 m (Scharine et al., 2009). The ability to detect speech in quiet improves from the age of 4–10 years with 9 dB, i.e., one can hear speech on average 9 dB softer at the age of 10 years (Neumann et al., 2012).

Speech-in-noise tests, where the speech signal is imbedded in background noise, are mainly used for evaluation of the benefit of hearing aids but also for assessing auditory functioning

### Söderlund and Jobs Speech Recognition in ADHD

### TABLE 1 | Participant characteristics: cognitive test scores and ratings.


*Group comparison Welch's F.* ∗*p < 0.05,* ∗∗*p < 0.01,* ∗∗∗*p < 0.001.*

*N* = *number of participants in each group and number of participants rated by parents and teachers.*

*ss* = *scaled scores 1–19. Rating scale: scores 1–7.*

in individuals that report difficulties in perceiving speech in noisy surroundings despite good peripheral hearing (i.e., tone thresholds). In the present study we used the Hagerman sentences test that is one of the Swedish speech recognition tests in noise used in clinical settings. The test has a fixed noise level where the speech signal is attenuated. The noise sounds like a continuous noise but is slightly modulated to resemble the temporal variations of speech (Hagerman, 1984, 2002). The slightly modulated noise resembles white noise but does not have a flat power spectrum, most of the energy is between 1 and 5 kHz, the frequencies of normal speech (Hagerman, 2002). Of note, slightly modulated noise possesses the same stochastic, random properties as white noise or pink noise. A children version of the Hagerman sentences test has been developed (Hagerman and Richardson, 2009) using three-word sentences, having the syntactic structure of numeral– adjective – object (e.g., "three beautiful gloves"). In this version, the slightly modulated noise is set to 65 dB SPL to be more comfortable for children and the threshold has been set to 68% correct words, i.e., two out of three words should be correctly repeated (Hagerman and Richardson, 2009). The ambition with present setting was to find out if there was a differential effect of noise on groups as such and not to specify effects at different levels. To use more than two noise levels would have given us more information but we choose to use the Hagerman version as close to the original test as possible. To develop and use a new non-validated test without norm data was not an alternative. Moreover, it would have prolonged the test considerably and put too much strain on the participants on cost of the reliability.

In the present study the test was presented binaurally with headphones. The equipment for the speech recognition test consisted of a lap-top, headphones Sennheiser HDA 280, and an external audiocard Behringer UCA 222 with the software calibrated at the department of Technical Audiology at Karolinska Institute in Stockholm. Thresholds in quiet (no noise) for the minimum audible level in dB SPL were tested in order to compare this condition with the noise condition at a well audible level. In addition, both speech and noise were presented simultaneously in both ears, in order to get a more natural hearing situation. The computer-based adaptive method adjusted the speech level after each sentence depending on how many words that was repeated correctly. In the first condition (noise), the sentences (i.e., signal) were presented at suprathreshold level at 85 dB SPL and then attenuated in slightly modulated noise at 65 dB SPL, in order to identify the threshold for the correct recognition, a criterion of 67% words, i.e., two out of three words correctly recalled. In the second condition (no- noise), the sentences were presented at 50 dB SPL and then attenuated until the minimum audible level using the same criterion as above. The test comprised in all 12 lists with 10 sentences in each list. From these 12 lists three randomized lists were chosen for each participant in each condition, one list for practicing and two for the actual test. Each participant was exposed to 30 sentences in each condition (i.e., three lists) in total 60 sentences. In some cases one, two and in the odd case three sentences occurred twice during a condition. The exposure to repeated sentences, were very similar to each group with slightly more repetitions in the ADHD-symptom group and in the no-noise condition. For the assessment of thresholds, the training list was used to make participants familiar with the task and the test situation and to set a suitable speech signal level to start the first test list from. The two test lists was built on a computer-based adaptive method that adjusted the speech level after each sentence depending on how many words the participant recognized correctly. If *no word* was recognized, the signal was increased 2 dB, if only *one word* was recognized the speech signal was increased by 1dB, if two words were recognized the signal remained at the same level and if all *three words* were recognized the signal was decreased with 1 dB. The speech level of the last sentence in the first test list decided the level at which to begin the second test list. The program calculated a mean of the 10 sentences for the last list.

### Design and Test Procedure

The design was a 2 × 2 mixed design. The within group manipulation was binaural speech recognition in two conditions no-noise vs. noise. Threshold in the no nose condition was set in dB SPL at minimum audible level (attenuated from 50 dB SPL). Speech recognition threshold in noise was set in dB SPL at well audible level (attenuated from 85 dB SPL) in noise at 65 dB SPL. The between group variable was children with ADHD symptoms vs. controls. Dependent variable was binaural speech recognition thresholds dB SPL in the no- noise condition vs. the noise condition.

### Test Procedure

The testing was conducted at the child' s school to minimize drop-out rate and participants were tested individually in a room during the school day by the same licensed psychologist and took part in the participants' schools for optimal participation rate. The test session began with the modified Hagerman test for children. The two speech recognition conditions (no noise vs. noise) were given in counterbalanced order and took ∼10 min to perform in all. The test session also included tests for participant characteristics (**Table 1**). They were carried out in the same succession and administrated according to the manuals and took ∼40 min to administer. After 15 min of testing a short break with juice and fruit followed. After taking part in the testing, the boys received a movie voucher.

### RESULTS

### Speech Recognition Thresholds

A 2 × 2 mixed analysis of variance (ANOVA) was conducted to asses speech recognition threshold with one between subjects factor, group (ADHD symptoms vs. TDC) and one within subjects factor, noise condition (no noise vs. 65 dB SPL modulated noise). A criterion of 67% words correct was used, two out of three words were correctly recalled for a correct response. We found a main effect of noise [*F*(1,44) = 6852.6, *p <* 0.0005] and an interaction between speech condition and group [*F*(1,44) = 6.52, *p* = 0.014, η<sup>2</sup> = 0.129]. This means that 65 dB noise affected groups differently; in the no noise condition TDC children displayed a lower speech recognition threshold as compared to the ADHD symptom group. The overall differences between groups was significant [*F*(1,44) = 7.70, *p* = 0.008, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.149]. In the noise condition both groups had similar recognition thresholds. *Post hoc* testing, an independent samples *t*-test showed that the difference between groups in the no noise condition was significant [*t*(44) = 2.36, *p* = 0.030] the TDC group perceived correctly at 27.6 dB while the ADHD symptom group needed 29.6 dB for correct performance. In the noisy condition this difference disappeared [57.7 vs. 58.0 dB; *t*(44) = 0.97, *p* = 0.336], see **Figure 1**.

We conducted an alternate mixed ANOVA that only included the originally clinically diagnosed ADHD group of nine children and the TD group of 31 control children. Data displayed that the interaction between groups increased further [*F*(1,38) = 11.79, *<sup>p</sup>* <sup>=</sup> 0.001, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.237] and a *<sup>t</sup>*-test showed that the mean difference still was significant [*t*(38) = 3.32, *p* = 0.002]. ADHD children now require 30.4 dB for a correct recall and there was still no difference between groups in the noisy condition still [57.7 vs. 57.9 dB; *t*(38) = 0.40, *p* = 0.691]. Only one participant had a threshold just below the level for inclusion (i.e., 37 dB SPL/15 dB HL). However, if excluding this participant from the ADHD group, the difference between groups still remained significant [*t*(37) = 2.58, *p* = 0.014].

The relationship between speech recognition threshold in silence and attention ability was further investigated in a Pearson product-moment correlation, see **Figure 2**. Data showed a significant correlation between attention ability as rated by teachers (*r*<sup>2</sup> <sup>=</sup> 0.385, *<sup>p</sup>* <sup>=</sup> 0.010) and parents (*r*<sup>2</sup> <sup>=</sup> 0.342, *p* = 0.047). Hyperactivity by parent's ratings and speech recognition was only significantly correlated (*r*<sup>2</sup> <sup>=</sup> 0.437, *p* = 0.006), see **Table 2** for all figures. However, there were no further correlations between cognitive ability and speech recognition thresholds as measured by similarities (*r*<sup>2</sup> <sup>=</sup> 0.007, *<sup>p</sup>* <sup>=</sup> 0.963) and picture completion (*r*<sup>2</sup> <sup>=</sup> 0.102, *<sup>p</sup>* <sup>=</sup> 0.502).

# DISCUSSION

The current study tested the hypothesis that children who differ in attention (ADHD symptoms vs. TDC children) will have different speech recognition thresholds, which could be diminished in noisy conditions following the MBA model (Sikström and Söderlund, 2007). Firstly, the results corroborated the prediction that there was a difference between speech recognition thresholds, although the difference was small (just over 2 dB) and could possibly be within the margin of error, due to natural variation in sensitivity to sensory stimuli (Scharine et al., 2009). The correlation between hearing thresholds and the ratings of ADHD symptoms offers further arguments for

TABLE 2 | Bivariate correlation between speech recognition threshold, attention, and hyperactivity.


*Significant values in bold.*

the significance of the present finding. The results indicate that the difference is due to a real neurocognitive dimension rather than just a perceptual peripheral deficit or sensory fluctuation. The most important finding in the present study is the proposed link between attention ability and speech recognition. This is the first study, to the best of our knowledge, which has shown a link between perceptual speech thresholds and behavioral assessments of ADHD symptoms after parent- and teacher ratings. This means that different groups of individuals perceive auditory information differently and are furthermore differently affected by external noise. Of course this will need to be replicated in future studies. Secondly, and even more interesting, was that binaural noise exposure made these differences disappear; in the noisy condition both groups displayed almost exactly the same signal-to-noise ratio of ≈7 dB in order to achieve correct speech recognition at an audible level (≈58 dB).

The existence of group differences in auditory perception between ADHD and TDC has been reported earlier in a small number of studies. Pillsbury et al. (1995) found no deficits in signal detection *per se* in the ADHD group, but found reduced processing efficiency for signal recognition, in particular in noisy environments. This is of great interest for the present study while it provides arguments for distinguishing between signal detection and signal recognition in ADHD when discussing results on auditory perception. Present results could also indicate deficits in the auditory pathways in ADHD; however, Central Auditory Processing Deficit (CAPD) is a complex and heterogeneous group of auditory-specific disorders, usually associated with a range of listening- and learning deficits, including auditory discrimination. There is a huge overlap between language processing disorders and ADHD, in particular in the inattentive subtype (ADD; Chermak et al., 2002). A clinical study estimated the overlap between CAPD and ADD as high as 50% (Riccio et al., 1994). This finding is verified by Oie et al. (2014) who found impairments only in the predominantly inattentive group and not in the ADHD combined group in an executive auditory control task (dichotic listening). This goes along with data from the present study where inattention was found to be a stronger predictor of higher speech recognition thresholds than hyperactivity, as shown by the correlations in **Table 2**.

Working memory capacity can be a confounding variable when conducting auditory perception studies in ADHD samples. As shown in **Table 1**, the ADHD symptom group had a poorer working memory performance than the TDC group and this is hallmark of ADHD in general (e.g., Alloway, 2011; Kasper et al., 2012). ADHD-related working memory deficits reflect a combination of impaired central executive and phonological storage/rehearsal processes and performance deterioration when stimuli set sizes are increased (Alderson et al., 2015). This means that a phonological processing deficit could just mirror a limited working memory capacity and not phonological storage as such. For this reason many auditory processing tests might be invalid because of difficulties in dissociating auditory processing disorder from language-, attention problems, and working memory capacity (Katz and Tillery, 2005). With this in mind, the current study indicates that the Hagerman speech-in noise test for children is a robust test not loading on working memory. The difference in the no noise condition is thus likely to capture auditory processing differences rather than differences in working memory. At supra threshold level the difference between groups disappeared, despite inferior working memory performance in the inattentive children. From this it is tempting to draw the same conclusions about the threshold condition. There might be other processes that could mediate the auditory processes at this level that are not taken into account in the present study. However, the NEPSY sentence repetition test showed that the ADHD symptom group could repeat at least seven words while the Hagerman test only require three words to be repeated. The complicated interrelation between working memory capacity, attention, and speech recognition thresholds therefore needs to be further investigated. Research on auditory brain stem responses supports though the view that ADHD patients are affected on the auditory processing level rather than the cognitive, Two recent studies found that dysfunctions of the auditory brainstem pathways cause deficits in temporal encoding of both speech and non-speech stimuli that could explain speechprocessing difficulties in ADHD (Claesdotter-Hybbinette et al., 2015; Jafari et al., 2015). Of note is that the study by Claesdotter-Hybbinette et al. (2015) is made on girls and the present study on boys, providing an argument for possible generalization of the present findings being valid for girls as well.

The effects of auditory noise can be both positive, e.g., lowering hearing thresholds (Zeng et al., 2000), and negative, but in fact mostly the latter, in particular in demanding cognitive tasks (Sörqvist, 2010). In the present study we focus on positive effects of noise referring to the effect of SR where noise under certain well-defined conditions can be beneficial for performance, in particular in nervous systems that are not working at their optimum (McDonnell and Ward, 2011). We found a noise benefit in ADHD and in line with this finding, Pereira et al. (2012) found that the ability to suppress contralateral noise in a ADHD- and a control group was equal. Further support was given by Behne et al. (2005), who showed that when exposing noise and signal into the same ear, in particular into the right auditory cortex, this would lead to greater brain activation, thus possibly making noise an advantage instead of an obstacle in processing complex auditory signals like speech. On the other hand, contradicting results was found by Abdo et al. (2010) were a group of normal hearing ADHD children performed worse than controls on both a digit dichotic listening task and on a speechin-noise task. However, the kind of noise that was used during the task performance in this study is not described and of note is that the tasks in Abdo et al.'s (2010) study did put high demands on both auditory processing and working memory, not just auditory perception as in the present study. The type of noise also plays a pivotal role, e.g., if the noise is meaningless as in the present study (white noise like), or if it is meaningful, such as speech noise. For example, Hawley et al. (2004) found that binaural meaningless noise did not interfere with performance whereas meaningful (speech babble) monaural noise did. In a study by (Söderlund and Sikström, 2012) cafeteria noise, i.e., speech noise, was used and results showed exactly the same effect of cafeteria noise as the one of white noise, that is, a noise benefit for the inattentive group. Thus, there is a problem when comparing results form different studies when the type of noise sometimes is not properly described while this seems to play a pivotal role for the results.

Additionally, to yield a noise benefit it seems that the noise should be exposed binaural. In auditory perception tasks, different kinds of task-irrelevant noises are frequently used in experiments that can be presented both monaurally and binaurally. For example, in dichotic listening (DL, binaural) tasks it shows that ADHD patients have a reduced left hemisphere specialization, i.e., larger right hemisphere contribution, which leads to impaired word processing among ADHD patients when word processing is normally dedicated to the left hemisphere (Hale et al., 2006). Further support is given by Dramsdahl et al. (2011), where the ADHD patients failed to perceive syllables in the forced left ear condition in dichotic listening tasks, as the forced left condition is depending on activity in the right hemisphere. Of note is that ADHD and TDC performed equally well in the non-forced and forced right ear conditions linked to the ability to just perceive the syllables and not on top-down directed cognitive control. This provides further evidence that there is a distinction between the detection of a target and the perception or recognition of targets like word stimuli. Age and developmental factors can also play a role in speech perception, with younger children displaying a larger susceptibility to noise than older children (Talarico et al., 2007). From this we conclude that if noise benefit should occur in speech recognition tasks the noise has to be: binaural, meaningless, random, and within a moderate loudness range (65–80 dB) to provide opportunity for supra threshold SR to occur (McDonnell et al., 2007). Evidence of the need to take contextual factors into account is provided from Michalek et al. (2014) posing that factors such as diagnosis, modality, and signal-to-noise ratio all have a main effect on a person's ability to process speech in noise. To sum up, auditory processing of speech is influenced by both internal (e.g., attention, age, working memory, brain stem response) and external factors (e.g., noise type, bi- or monaural, visual information).

Stimulant medication in ADHD seems to have a robust effect on ADHD behaviors (Antshel et al., 2014) and on cognitive task performance (Murray et al., 2011), but not as obvious when it comes to school performance (Hellwig-Brida et al., 2011; Wigal et al., 2012; Prasad et al., 2013).

Moreover, stimulant medication has been found effective to reduce susceptibility to auditory noise as well (Tillery et al., 2000; Freyaldenhoven et al., 2005). On the other hand, effects of medication on any of three central auditory processing measures are not found (Tillery et al., 2000). This may be regarded as good news for noise, as noise benefits can be seen in domains were medication has little or no effects, indicating that the working mechanisms of white noise and stimulant medication differ.

The working mechanisms of noise benefits are not yet known, but apart from SR, auditory masking is a good candidate in speech recognition, as a masker different from the signal, the noise can facilitate signal detection (Durlach et al., 2003). Furthermore, masking has been shown to have effect on impulsivity (Gray et al., 2002) but also in other modalities like in vision (Dawes et al., 2009), or the tactile sense (Tan et al., 2003). In both SR and masking tasks, irrelevant or meaningless stimulation in different modalities increases the signal-to-noise ratio and thus improves performance in various sensory or cognitive tasks. Yet another explanation to consider is that, instead of inducing SR, white noise increases arousal in participants. Such explanations are consistent with state regulation models of ADHD (Sonuga-Barke et al., 2010), derived from cognitive energetic theories (Sergeant, 2005). This theory posits that children with attention problems have difficulties in modulating their levels of arousal and activity in order to adjust to changing circumstances in the environment, particularly during boring tasks like speech recognition.

In future studies a sub-threshold noise condition should be added to determine the threshold speech signal. In this setting it is possible to investigate whether the relative benefit of noise for inattentive persons is apparent in threshold SR as well. Wong et al. (2008) found that in young adults, when listening to speech in low noise (20 dB below the speech signal), crucial networks in the auditory cortex and frontal areas were activated. One

### REFERENCES


hypothesis, if speech processing deficits in ADHD are evident, could be that individuals with ADHD have dysfunctional neural pathways before the superior temporal gyrus and thus display difficulties in detecting signals at minimum or low audible levels. If this holds, external noise might induce increased network activation, involving more neuronal structures, thus producing higher level of internal noise in the brain, in line with predictions from the MBA model (Sikström and Söderlund, 2007).

### Limitations

This study should be regarded as a pilot insofar that no firm conclusions could be drawn from it because of the small number of participants; there were only 15 in the clinical, ADHD symptom group. Our findings have to be corroborated in a follow-up study. We do not know if these findings are valid for girls either since only boys participated.

Importantly, further studies should include testing in a sound proof setting rather in a school setting which involves a lot of ambient noise. Tone audiometry thresholds as well as speech recognition thresholds should be measured in the lab monaurally to further evaluate binaural speech recognition thresholds. Although all participants had hearing within normal range, the difference between the groups could be due to subtle differences in the peripheral transmission, e.g., in the middle ear or in the cochlea. However the correlation between hearing thresholds and the rating of ADHD symptoms speaks against this, implicating that the difference is due to a neurocognitive dimension rather than a perceptual peripheral deficit.

# AUTHOR CONTRIBUTIONS

Shared first authorship, both authors have contributed to the outlining, the design, and planning of the study. Both have contributed significantly to the writing of the manuscript. ENJ had the main responsibility for data collection and test battery. GS has been responsible for the statistical assessment, figures, tables and the outlining of the discussion. Both authors have collaborated through the revision process and the final version of the discussion.

# ACKNOWLEDGMENTS

We would like to thank Björn Hagerman and Åke Olofsson, Department of Technical Audiology at Karolinska Institute, Sweden for developing the particular version of speech-in-noise test that was used in our study.

Alderson, R. M., Kasper, L. J., Patros, C. H. G., Hudec, K. L., Tarle, S. J., and Lea, S. E. (2015). Working memory deficits in boys with attention deficit/hyperactivity disorder (ADHD): an examination of orthographic coding and episodic buffer processes. *Child Neuropsychol.* 21, 509–530. doi: 10.1080/09297 049.2014.917618

Alloway, T. P. (2011). A comparison of working memory profiles in children with ADHD and DCD. *Child Neuropsychol.* 17, 483–494. doi: 10.1080/0929 7049.2011.553590


attention deficit hyperactivity disorder and in their siblings. *J. Abnorm. Psychol.* 102, 616–623. doi: 10.1037/0021-843X.102.4.616


stimuli. *J. Acoust. Soc. Am.* 114(6 Pt 1), 3295–3308. doi: 10.1121/1. 1623788


methylphenidate on older children with attention-deficit/hyperactivity disorder. *J. Child Adolesc. Psychopharmacol.* 21, 121–131. doi: 10.1089/cap.2010.0047


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer Staffan Hygge and the handling editor declare their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

*Copyright © 2016 Söderlund and Jobs. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Theory-of-mind in individuals with Alström syndrome is related to executive functions, and verbal ability

*Hans-Erik Frölander1,2,3,4,5\*, Claes Möller1,2,3,4,6 , Mary Rudner3,4,7, Sushmit Mishra8, Jan D. Marshall9,10, Heather Piacentini10 and Björn Lyxell3,4,7*

*<sup>1</sup> Health Academy, School of Health and Medical Sciences, Örebro University, Örebro, Sweden, <sup>2</sup> Audiological Research Centre, Örebro University Hospital, Örebro, Sweden, <sup>3</sup> Swedish Institute for Disability Research, Linköping, Sweden, <sup>4</sup> Linnaeus Centre HEAD, Linköping, Sweden, <sup>5</sup> Research on Hearing and Deafness (HEAD) graduate School, Linköping, Sweden, <sup>6</sup> Department of Audiology, Örebro University Hospital, Örebro, Sweden, <sup>7</sup> Department of Behavioral Sciences and Learning, Linköping University, Linköping, Sweden, <sup>8</sup> Institute of Health Sciences, Utkal University, Bhubaneswar, India, <sup>9</sup> Jackson Laboratory, Bar Harbor, ME, USA, <sup>10</sup> Alstrom Syndrome International, Mount Desert, ME, USA*

### *Edited by:*

*Patrik Sörqvist, University of Gävle, Sweden*

### *Reviewed by:*

*Martin Meyer, University of Zurich, Switzerland K. Jonas Brännström, Lund University, Sweden*

### *\*Correspondence:*

*Hans-Erik Frölander, Health Academy, School of Health and Medical Sciences, Örebro University, Fakultestgatan 1, 701 12 Örebro, Sweden hans-erik.frolander@spsm.se*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 05 May 2015 Accepted: 07 September 2015 Published: 23 September 2015*

### *Citation:*

*Frölander H-E, Möller C, Rudner M, Mishra S, Marshall JD, Piacentini H and Lyxell B (2015) Theory-of-mind in individuals with Alström syndrome is related to executive functions, and verbal ability. Front. Psychol. 6:1426. doi: 10.3389/fpsyg.2015.01426* Objective: This study focuses on cognitive prerequisites for the development of theoryof-mind (ToM), the ability to impute mental states to self and others in young adults with Alström syndrome (AS). AS is a rare and quite recently described recessively inherited ciliopathic disorder which causes progressive sensorineural hearing loss and juvenile blindness, as well as many other organ dysfunctions. Two cognitive abilities were considered; Phonological working memory (WM) and executive functions (EF), both of importance in speech development.

Methods: Ten individuals (18–37 years) diagnosed with AS, and 20 individuals with no known impairment matched for age, gender, and educational level participated. Sensory functions were measured. Information about motor functions and communicative skills was obtained from responses to a questionnaire. ToM was assessed using Happés strange stories, verbal ability by a vocabulary test, phonological WM by means of an auditory presented non-word serial recall task and EF by tests of updating and inhibition.

Results: The AS group performed at a significantly lower level than the control group in both the ToM task and the EF tasks. A significant correlation was observed between recall of non-words and EF in the AS group. Updating, but not inhibition, correlated significantly with verbal ability, whereas both updating and inhibition were significantly related to the ability to initiate and sustain communication. Poorer performance in the ToM and EF tasks were related to language perseverance and motor mannerisms.

Conclusion: The AS group displayed a delayed ToM as well as reduced phonological WM, EF, and verbal ability. A significant association between ToM and EF, suggests a compensatory role of EF. This association may reflect the importance of EF to perceive and process input from the social environment when the social interaction is challenged by dual sensory loss. We argue that limitations in EF capacity in individuals with AS, to some extent, may be related to early blindness and progressive hearing loss, but maybe also to gene specific abnormalities.

Keywords: Alström syndrome (AS), ciliopathy, deafblindness, theory-of-mind, verbal ability, executive functions

# Introduction

The present study focuses on cognitive prerequisites for the development of theory-of-mind (ToM) in adolescents and young adults with Alström syndrome (AS). ToM refers to the ability to impute mental states to self and to others (Premack and Woodruff, 1978), of importance to establish social relations (Hughes and Leekam, 2004). A significant step in the development of this ability occurs during the preschool years around the age of four when children normally understand that another person may hold a belief different from themselves (Wellman et al., 2001). Cognitive skills such as the ability to process information and the ability to control one's own thoughts and actions are important for the development of ToM (Wellman and Woolley, 1990; Pernier and Lang, 1999; Sabbagh et al., 2010). Deficiency in ToM is one of the core traits of Autism spectrum disorders (ASD; Baron-Cohen, 1989), but has also been observed in populations with other syndromes, including: Down syndrome (Zelazo et al., 1996); Fragile X syndrome (Belmonte and Bourgeron, 2006); Williams syndrome (Cornish et al., 2005) and CHARGE syndrome (Hartshorne et al., 2005). Clinical observations suggest that individuals with AS have a varying degree of ToM ranging from normal to levels typical of individuals with high functioning autism (Frölander et al., 2014).

Alström syndrome is an autosomal recessive syndrome within the Ciliopathy Spectrum. AS is rare but individuals identified with this syndrome are rapidly increasing (900, 2015 ISA). It is multi-systemic with high prevalence of additional diseases (Marshall et al., 2007a, 2011). AS causes progressive dual sensory loss, i.e., deafblindness (Möller, 2007). Sensorineural hearing loss progresses slowly in the first decade, usually reaching moderate or severe loss in the following decades. Age of onset, however, varies from infancy to adulthood. The high prevalence of otitis media in this group causes additional hearing loss. Cone rod retinal dystrophy leads to Retinitis Pigmentosa (RP) and juvenile blindness. Age of onset differs, but the onset of visual dysfunction is earlier than that of hearing loss, and typically established within weeks after birth (Marshall et al., 2007b). The visual loss deteriorates to blindness usually in late adolescence. (Marshall et al., 2011), and is, in contrast to hearing loss, already significant during the important stage of ToM development around the age of four. Motor milestones are often delayed in AS. Deficits in coordination, balance and fine motor skills have been observed, as well as early language delays and atypical behavior (Marshall et al., 2007a). The first and the second author have vast clinical experience of people with different deafblind syndromes, and the clinical findings in persons with AS includes lack of inhibition and as children extreme stubbornness and excessive eating (Frölander and Möller, 2015, Personal communication). Early hearing loss is associated with delayed development of ToM for children in hearing families who use speech communication, irrespective of the use of technical aids including hearing aids and cochlear implants (CI; Peterson, 2004). The importance of access to sound for the development of ToM in children who rely on speech communication has recently been shown in a study in which better ToM was demonstrated in congenitally deaf children who received CIs at an average of 18 months compared to those who received their implants at an average of 41 months (Sundqvist et al., 2014). It has thus been proposed that for children in hearing families, early hearing loss leads to impoverished social interaction, delaying the development of ToM. However, neither degree of hearing loss nor age at onset was found to be associated with ToM development in a population of individuals with AS. Access to sound would be expected to promote social interaction and thus development of ToM also in individuals with AS, but the generally slow progress of hearing loss might explain why no relationship between ToM and onset of hearing loss was observed in our previous study (Frölander et al., 2014). Studies of children with congenital blindness have demonstrated that a significant visual loss may also cause a delayed development of ToM (Minter et al., 1998; Roch-Levecq, 2006). In a previous study, age at onset of visual loss in AS was correlated with ToM, probably reflecting a loss of vision that is demonstrated within a few weeks from birth (Frölander et al., 2014). Such rapid vision loss is already evident by the sensitive age of four causing a lack of social and communicative stimuli with a negative impact on ToM development.

Performance on ToM tasks in typical individuals is related to working memory (WM). During ToM tasks information has to be kept in mind while determining states of mind. This loads on WM (Davis and Pratt, 1995; Hughes, 1998a,b; Keenan, 1998; Keenan et al., 1998). WM is an essential component in more complex cognitive activities such as communication (Baddeley, 2012). Previous results show that the AS group performs at a lower level in WM tasks compared to non-disabled controls. However, performance on WM tasks and ToM tasks was not significantly correlated in either group (Frölander et al., 2014). One reason for this might be that WM capacity beyond a critical level does not contribute to an enhanced performance (Slade and Ruffman, 2005).

Executive functions (EF) control and regulate thought and action (Espy et al., 2004; Burgess and Simons, 2005). EF include updating of new information, inhibition of irrelevant information and shifting of focus between different sources of information (Miyake et al., 2000; Letho et al., 2003). EF is closely associated with ToM in non-disabled populations (Hughes, 1998a; Perner and Lang, 1999; Mitchell and Riggs, 2000; Sabbagh et al., 2010; Zelazo and Carlson, 2012) as well as in disabled populations, including: ASD (Ozonoff and Jensen, 1999; Joseph and Tager-Flusberg, 2004) cerebral palsy (Li et al., 2007); frontal lobe damages (Rowe et al., 2001) and amygdala damage (Fine et al., 2001). Specifically, the role of inhibitory control has been stressed in the emergence and expression of ToM (Carlson and Moses, 2001; Carlson et al., 2002; Leslie et al., 2004). No previous study has examined the role of EF in ToM development in a population of individuals with AS.

Theory-of-mind is closely related to verbal ability irrespective of level of functioning (Slade and Ruffman, 2005), and this applies to individuals with AS (Frölander et al., 2014). Receptive language development is in general delayed in individuals with AS (Marshall et al., 2007b). EF is related to verbal ability in non-disabled populations (Carlson et al., 1998) as well as in individuals with ASD (Landa and Goldberg, 2005). EF is also important for communicative skills, such as ability to respond to conversational changes, in disabled as well as in non-disabled populations (Bishop and Adams, 1989; Hughes, 1998a; Ylvisaker and DeBonis, 2000).

In the present study we focus on how phonological WM capacity, executive functioning, verbal ability and communicative skills relate to development of ToM in a population of adolescents and young adults with AS, compared to a group of individuals with typical development matched for age and educational level. In addition, we examine how characteristics of sensory loss and motor deficits are related to development of ToM in AS.

We predict that verbal ability will be of particular importance for ToM performance in AS, as this relation has been displayed in a previous study, but also that EFs such as the ability to initiate and sustain communication will relate to ToM performance. We further assume that cognitive skills predict ToM performance in individuals with AS, in contrast to individuals with normal hearing and vision, by underpinning the ability to engage in communication.

### Materials and Methods

### Participants

Ten young adults (six females) with AS, and a mean age of 28.30 (6.08), participated in the study. It should be noted that seven out of ten individuals were the same individuals as reported in Frölander et al. (2014). Background information was obtained from medical records and from responses to the Alström Syndrome International (ASI) questionnaire by the participants and their families (Alström Syndrome International, 2010).

### Hearing

All subjects showed a bilateral symmetrical moderate to profound sensorineural hearing loss. The onset of hearing loss was either congenital or early childhood. The hearing loss was in all subjects progressive but with variable rate. All subjects were at the time of the testing fitted with bilateral hearing aids. Hearing impairment (HI) was assessed by pure tone audiometry with calculation of the average pure tone threshold for the best ear at frequencies 0.5, 1, 2, and 4 KHz (PTA4). The audiograms were performed with standard equipment. The hearing tests used were either audiometry performed in standardized settings within 6 months and/or by audiometry performed at the time of testing by the use of audiometry with earphones (Kuduvawe Ltd).

### Vision

Visual acuity was measured using Snellen chart-based standard tests and visual field was measured using Goldmann perimetry, resulting in categorization into five phenotypes; 1 = normal, 2 = presence of a partial or complete ring scotoma, the latter either extending or not extending into periphery, 3 = concentric central field loss with a remaining peripheral island less than one half of the field circumference, 5 = no visual field, blindness (Grover et al., 1997). All subjects were legally blind since childhood, and 7/10 had no residual vision at all (**Table 1**).

Data about vestibular functioning was retrieved from medical records. Information about occurrence of repetitive mannerisms (i.e., hand or finger flapping or twisting or complex wholebody movements) and difficulties balancing were obtained from responses to a questionnaire (Alström Syndrome International, 2010). The information was categorized in five levels, due to reported frequency of occurrence, from "never" to "almost always."

A control group of 20 non-disabled individuals (eight males) with normal hearing and vision were chosen to match the AS group with no differences regarding gender, age, and educational level (defined by years of schooling).

As a previous study revealed substantial variance in ToM performance in the AS group it was planned to divide the group into two sub groups enabling more specific comparisons with the control group regarding number of correct mental inferences produced (Frölander et al., 2014). It was thus decided that individuals with AS producing correct mental inferences within 1 SD of the mean of the control group (*n* = 3), would be compared separately with the control group, to determine if their performance in other domains resembled that of the control group. These individuals are referred to as the group of AS individuals displaying a better mental state understanding, equal to the control group. The remainder of the group of individuals with AS (*n* = 7) was also compared with the control group, to determine how their performance differed in other domains. This

TABLE 1 | Participant characteristics (mean **<sup>+</sup>** SD) for non-disabled individuals (*<sup>n</sup>* **<sup>=</sup>** 20), individuals with Alström syndrome (AS) (*<sup>n</sup>* **<sup>=</sup>** 10), individuals with AS and better Theory-of-mind (ToM) (*<sup>n</sup>* **<sup>=</sup>** 3), and individuals with AS and poorer ToM (*<sup>n</sup>* **<sup>=</sup>** 7).


∗ ∗*Difference is significant at p < 0.01.*

group is referred to as the group of AS individuals displaying a poorer mental state understanding than the control group. There was no difference between the two subgroups of AS individuals in gender frequency, age or educational level. In addition, no difference whether in hearing, visual acuity or visual field was found between the better and the poorer ToM – performing group.

The study was approved by the regional ethics review board of Uppsala – Örebro, Sweden and the institutional Review Board of The Jackson Laboratory, USA. Informed consent was obtained from all participants.

### Measures and Scoring Advanced ToM

Theory-of-mind was measured by a collection of stories from Happe's (1994) advanced test of ToM – the strange stories (Fletcher et al., 1995; Happe et al., 1996; Jolliffe and Baron-Cohen, 1999; White et al., 2009). The stimuli presented involved eight stories including two examples of each of double bluff, persuasion, white lies, and misunderstanding.

The stories were read aloud to the participants. The participants were offered the opportunity to read the stories in Braille or have one more verbal presentation.

A "why question," concerning the ability to understand mental states was asked in connection with each of the stories (for example: Peter thinks his aunt looks silly in her new hat, but when she asks if he likes it, he answers that it is very nice, why?) The answer could be scored as correct or incorrect. The answer to the question could furthermore either be physical or mental. Physical state answers were associated with consequences for example "to get something." Mental state answers referred to thoughts, feelings, desires, traits and dispositions including terms as "think," "know" "like" and "lie." A score of 2 was given if participants referred to exact mental states. 1 was given for correct mental state answers in more general terms, or for a correct physical state answer. 0 was given if the answer was unrelated. The score for the ToM stories ranged from 0 to 16 (Happe, 1994). Inter-rater reliability between one of the authors and an independent rater was 92%.

### Phonological Working Memory

Phonological WM was assessed using Serial Recall of Non-words (Gathercole and Pickering, 2000), taken from a computer based test battery, the Sound Information Procesing System (i.e., the SIPS, Lyxell et al., 2009). Non-word series (for example: "med, tas, rog") with increasing lengths, 3 – 5 words, were used as stimuli, in total 6 series. The prerecorded auditory signals were presented on a laptop, from the computer platform, in order to keep the time of the presentation of stimuli constant. The built in loudspeakers on the laptop computer were used for presentation. The participants were asked to repeat each of the non-words after each serie of non-words. The answers were recorded as a basis for later transcription and scoring of accuracy according to three different criteria: (1) The percentage of correctly reproduced non-words; (2) The percentage of correctly reproduced consonants; (3) The percentage of non-words pronounced with the correct vowel and total number of syllables.

### Executive Functions

The EFs of updating and inhibition were measured using a shortened and adapted version of the Cognitive Spare Capacity Test (CSCT; Mishra et al., 2013b). The prerecorded auditory signals were presented from a DMDX platform (Forster and Forster, 2003), designed to precisely time the presentation of stimuli, sampled at a rate of 44100 Hz with 16–bit resolution. The ability to perceive stimuli were controlled for by a task where participants were asked to repeat 13 two-digit numbers directly after the presentation of each number. Prerecorded lists of 13 spoken two-digit numbers were then presented. After each list, the participants were asked to report particular list items, depending on the condition, updating or inhibition. Audio responses were recorded on an external tape recorder, as a basis for later scoring.

In the inhibition condition, participants were presented a list of value items spoken by a male or a female voice and asked to report at the end of the list the two odd value items spoken by the male voice. In the updating condition, participants were asked to report, the first item in the list as well as the two highest numbers in the list. There were three lists per condition and these were blocked. Before each block, the participants were provided with an oral instruction and whether to remember two numbers (inhibition) or three numbers (updating). In the updating condition the first item was, however, not counted as it is mainly an established way to increase memory load (Mishra et al., 2013a).

The EF score ranged from 0 to 2, six lists with one point for each correct recalled number of two possible within a list.

### Verbal Ability and Communicative Skill

Verbal ability was evaluated by the vocabulary subtest in the Wechsler Adult Intelligence Scale (Wechsler, 1997). Participants were asked to define the meaning of words. Responses to items 1– 33 were scored on the basis of the standardized scoring principles. A score of 2 was given for responses that demonstrated a good understanding of the word; a score of 1 was given for less elaborate correct responses and a score of 0 was given if the response was obviously wrong. The total score was computed by combining the scores for all items. The possible score was within the range 0–66.

Information about individual ability to use language and communicate was obtained from responses to the Alström Syndrome International questionnaire (Alström Syndrome International, 2010). The information was categorized in five levels, due to reported frequency of occurrence, from "never" to "almost always." Information from the responses about ability to initiate and sustain communication, ability to pay attention to the topic, occurrence of repetitive use of language or words and odd rhythm of speech referring to a disorganization of temporal structures of the verbal stream (Zellner Keller and Keller, 2001), was specifically addressed.

### Procedure

Testing took place in a quiet room with normal acoustics. Hearing aids were adjusted and all participants used their fitted hearing aids. Audibility was ensured by adjusting hearing aid amplification and/or in some cases by the use of FM systems, where the participants chose the mode of amplification. Audibility was controlled for by a task where participants were to repeat numbers directly after the presentation of each number (Mishra et al., 2013b). The presentation level was adjusted to be comfortable for each individual participant and was held constant throughout testing, but the level was not verified. The tests of ToM, Phonological WM, EF, and Verbal ability were presented in the order mentioned. Assessments of hearing loss, measurements of visual acuity, visual field, and collection of questionnaire data were conducted separately. All testing in the study conform to regulatory standards.

### Data Analyses

Statistical Package for Social Sciences (SPSS) was used for statistical analyses. Independent *t*-tests were performed to examine group differences. Non-parametric, Mann–Whitney tests were used to examine differences between the both AS sub groups and between sub groups and the non-disabled control group. Spearmans's rho non-parametric correlations were computed. A significance level of 0.05 was employed.

### Results

The results will be presented in three parts. First performance in ToM, phonological WM and EF are presented. Second, verbal ability and communicative skills are presented and related to EF and ToM. This is followed by a presentation of motor dysfunctions and sensory deficits in relation to EF and ToM.

### Theory-of-Mind Performance

A significant difference emerged between the AS group and the non-disabled control group with respect to the ability to produce correct mental inferences, where the AS group made significantly fewer correct mental state inferences than the control group (*t* = 5.062, df = 28, *p <* 0.0005).

### Phonological Working Memory Performance

A significant difference emerged between the AS group and the control group in phonological WM performance, where the AS group was outperformed by the non-disabled control group in proportion correct serially recalled non-words (*t* = 9.180, df = 28, *p <* 0.0005), in proportion correct recalled consonants (*t* = 11.323, df = 28, *p <* 0.0005) and in proportion correct recalled vowels and supra segmental accuracy (*t* = 14.869, df = 28, *p <* 0.0005).

A correlation was found in the AS group between audibility (ability to reproduce a two digit number directly after the presentation) and both reproduction of non-words (ρ = 718, *n* = 10, *p* = 0.019) and reproduction of consonants (ρ = 0.634, *n* = 10, *p* = 0.037), but no correlation was found between audibility and reproduction of vowels with total number of syllables.

### Executive Functioning Performance

A significant difference emerged between the AS group and the control group in EF, where the AS group performed at significantly lower levels than the control group in both inhibition (*t* = 4.105, df = 28, *p <* 0.0005) and updating (*t* = 4.603, df = 28, *p <* 0.0005). Individuals with AS that displayed an equal mental state understanding as the control group (*n* = 3) also showed equal performance with the control group in inhibition and updating, where two out three performed within I SD of the control group and the third marginally below. In contrast, individuals with AS that displayed significantly poorer mental state understanding than the control group (*n* = 7), also showed a significantly poorer performance in inhibition (*U* = 132.00, *N*<sup>2</sup> = 27, *p <* 0.0005), and in updating (*U* = 5.559, *n* = 27, *p* = 0.002; **Table 2**).

The degree of HI was not related to performance on inhibition or updating tasks in the AS group and did not correlate with inhibition nor with updating, Furthermore, no significant difference in audibility was displayed between the AS group and the control group. Correlations were, however, found in the AS group between production of correct mental inferences and both inhibition (ρ = 0.778, *n* = 10, *p* = 0.008) and updating (ρ = 0.740, *n* = 10, *p* = 0.014). In contrast, no correlations were obtained in the control group between production of correct mental inferences and inhibition or updating. Reported difficulties to pay attention to the topic in a dialog was negatively related to the production of correct mental inferences in AS (ρ = 0.740, *n* = 10, *p* = 0.014).

### Verbal Ability and Communicative Skills

A difference in verbal ability emerged between the AS group and the control group where the AS group performed at a significantly lower level (*t* = 3.552, df = 28, *p* = 0.001). A comparison between the group of AS individuals with a poorer mental state understanding and the control group displayed a significant difference in verbal ability, where the former group was outperformed (*U* = 123.00, *n* = 27, *p* = 0.003). Two out of three performed within I SD of the control group mean, No difference in verbal ability was in contrast displayed between AS individuals with a mental state understanding equal to the control group and the control group.

In the AS group verbal ability correlated with correct mental inferences (ρ = 0.737, *n* = 10, *p* = 0.015) and updating (ρ = 850, *n* = 10, *p* = 0.002), but not with inhibition, whereas in the control group verbal ability correlated neither with correct mental inferences, nor with updating or inhibition. Further, verbal ability was not related to reproduction of non-words, either in the AS group or in the control group. Verbal ability was related negatively to literal interpretations (ρ = −0.840, *n* = 9, *p* = 0.005) and repetitive use of language (ρ = −0.797, *n* = 9, *p* = 0.010) in the AS group.

The ability to initiate and sustain a conversation also correlated with correct mental inferences in the AS group (ρ = 0.753, *n* = 9, *p* = 0.019), with updating (ρ = 0.804, *n* = 804, *p* = 0.009) and with inhibition (ρ = 0.743, *n* = 9, *p* = 0.022). Ability to initiate and sustain communication correlated negatively with an odd rhythm of speech (ρ = −0.990, *n* = 8, *p* = 0.0005) in the AS group.


TABLE 2 | Percentage of correct mental inferences, serial recall of non-words correct consonants and correct vowels respectively, inhibition and updating (mean **<sup>+</sup>** SD) for non-disabled individuals (*<sup>n</sup>* **<sup>=</sup>** 20), individuals with AS (*<sup>n</sup>* **<sup>=</sup>** 10), individuals with AS and better ToM (*<sup>n</sup>* **<sup>=</sup>** 3), and individuals with AS and poorer ToM (*<sup>n</sup>* **<sup>=</sup>** 7).

∗ ∗*Difference is significant at p < 0.01.*

### Motor Dysfunctions and Sensory Deficits

In the AS group repetitive motor mannerisms as an infant or toddler, in contrast to any other motor dysfunction correlated negatively with correct mental inferences (ρ = −0.784, *n* = 8, *p* = 0.021), and inhibition (ρ = −784, *n* = 8, *p* = 0.021). Repetitive motor mannerisms in turn related to odd rhythm of speech (ρ = 986, *n* = 7, *p <* 0.0005).

There was no difference in occurrence of motor mannerisms between the two groups of AS individuals with, or without vestibular deficiencies, and no relation within the AS group between degree of visual impairment and motor mannerisms.

### Discussion

The purpose of this study was to explore how cognitive skills that are important for communication can account for differences in ToM performance among individuals with AS.

The AS group was outperformed by the non-disabled control group in ToM, phonological WM, EF, and verbal ability. Within the AS group, ToM was significantly predicted by EF and verbal ability but not by phonological WM. PTA4 significantly predicted phonological WM but not EF, verbal ability or ToM. The ability to infer meaning from incompletely received signals due to sensory loss, was related to updating capacity. The ability to initiate and sustain communication was related to both updating and inhibition of irrelevant information. A poor ability to initiate and sustain communication was related to a repetitive use of language and the occurrence of motor mannerisms. Individuals with AS that exhibited repetitive manners were accordingly characterized by a poorer inhibitory capacity.

The relation between ToM and verbal ability in AS is in agreement with earlier results for individuals with AS (Frölander et al., 2014) and also observed in other disabled populations (Hughes and Leekam, 2004; Fisher et al., 2005), as well as in populations of non-disabled individuals (Carlson et al., 1998; Ruffman et al., 2002). Advanced ToM required to capture the socially loaded ToM stories (Happe, 1994), is dependent on verbal understanding (Onishi et al., 2007). In addition, focused attention and cognitive efforts are required in advanced ToM processing (Fonagy and Luyten, 2009). These requirements are further increased by the challenge following a dual sensory loss (Dammeyer, 2010).

The progressive loss of hearing in AS primarily affects the high frequency range during the first decades (Marshall et al., 2007a) and is likely to hinder perception of consonants (Dubno et al., 2002; Pichora-Fuller and Souza, 2003). This may explain why the reproduction of consonants, but not of vowels that are of low frequency, correlated with audibility in the young adults with AS. Speech perception in ordinary communication improves with visual cues (Erber, 1969; Lidestam and Beskow, 2006; Zekveld et al., 2011), especially in challenging situations (Bernstein and Grant, 2009; Mishra et al., 2014). A progressive and early loss of vision in individuals with AS presumably negatively affects the ToM development as the visual loss in most cases are significant in a sensitive period for development, at the age of four. Visual loss is assumed to further limit the possibilities to perceive speech (Marshall et al., 2007a). The ability to understand speech under adverse conditions is, however, also dependent on cognitive skills such as WM capacity and executive functioning (Rönnberg et al., 2010).

In the present study, inhibition in the AS group was related to the ability to reproduce consonants in non-words. The reason for this outcome might be that the ability to inhibit irrelevant associations and thoughts affected the ability to correctly reproduce non-words (Miyake et al., 2000; Letho et al., 2003), The displayed differences in AS in inhibitory capacity is, however, of significant importance, as inhibition in addition was related to the ability to initiate and sustain communication in the AS group. This relation is reflecting that a prevention of irrelevant information from gaining attentional access is required in ongoing processes as communication (Lustig et al., 2007). Relations between inhibitory capacity and communication have been displayed in previous studies in other populations with disability as individuals with ASD (Ylvisaker and DeBonis, 2000; Noterdaeme et al., 2001; Mc Evoy et al., 2006), and Attention-Deficit/Hyperactivity Disorder (Ylvisaker and DeBonis, 2000). Mental references are developed through social interaction (Carpendale and Lewis, 2004), and the cognitive load of WM is, during normal conditions, relatively extensive (Slade and Ruffman, 2005). Challenges due to sensory loss would be expected to increase this load (Bernstein and Grant, 2009; Mishra et al., 2014). Thus, inhibitory capacity is important for individuals with AS in the development of ToM, as it may underpin the ability to initiate and sustain communication.

The ability to update WM with incoming information, and to compare this information with stored knowledge and infer meaning, is a process necessary, for example, to sentence comprehension (Rudner et al., 2011; Zekveld et al., 2011). In the present study, updating correlated with vocabulary in the AS group, which confirms a general relationship between updating and verbal ability (St Clair-Thompson and Gathercole, 2006). Poorer updating ability in the AS group was related to a preponderance of literal interpretations and a repetitive use of words. This suggests that vocabulary development may be dependent on updating in the AS group. In the present study, updating ability was further related to the production of correct mental inferences in the AS group. This may imply a mediating role for updating ability in vocabulary development, in turn promoting ToM.

Alström syndrome individuals with equal mental state understanding as the control group displayed a similar performance level as the control group in updating and in inhibition tasks. This is in line with previous results in nondisabled populations (Pernier and Lang, 1999; Carlson et al., 2004; Muller et al., 2012) as well as in other populations of individuals with disabilities, i.e., high functioning autism (Ozonoff et al., 1991) and cerebral palsy (Li et al., 2014). In the present study, among individuals with AS, poor ToM was related to reported problems in paying attention to the topic of a conversation. This pattern of results suggests that ToM in individuals with AS may be dependent on EF in general and inhibitory capacity in particular. Instead of the direct relation between ToM and EF observed in individuals with ASD (Zelazo and Carlson, 2012), we suggest a mediated relation (cf., Hughes and Leekam, 2004), where limitations in EF capacity impoverish communication and speech, already challenged in AS by hearing loss and additive severe visual loss, with negative consequences for ToM formation.

Odd rhythm of speech is part of a cluster of self-reported speech deficits in the AS group, related to poorer inhibitory capacity, communicative difficulties and to a lower frequency of produced correct mental inferences. Speech deficits were also associated with motor mannerisms in early childhood among individuals with AS in the present study. One possible explanation is that EF deficits in AS may cause speech mannerisms as well as early motor mannerisms, as both dysfunctions are related to motor control difficulties. These results are in line with similar findings for individuals with ASD (Livesey et al., 2006). In the present study, the ability to reproduce vowels in the AS group was also poor, despite the fact that perception of vowels did not seem to be affected by hearing loss. This indicates an inability to process information, which may also be related to EF. The findings in previous studies that inhibition of attention in general is strongly correlated with inhibition of action (Friedman and Miyake, 2004), further supports the notion that motor control problems in individuals with AS may be specifically related to poor inhibitory capacity.

Motor milestones are typically delayed in AS (Marshall et al., 2007a) as well as in other deafblind related syndromes such as Usher syndrome (Kimberling et al., 1989) and CHARGE syndrome (Dammeyer, 2012). Good balance is maintained by proprioception, vision and vestibular input. When vision and vestibular function are diminished, balance has to rely on proprioception which in many daily situations will give imbalance, unsteadiness and insecurity. Vestibular dysfunction is noted in many deafblind syndromes (Kimberling et al., 1989; Möller, 2003; Dammeyer, 2012), and frequent in AS (Marshall et al., 2007a). It causes motor delay (Marshall et al., 2007a), and might be one basis of the reported repetitive motor manners in the present study. However, AS individuals with vestibular dysfunction did not exhibit more motor mannerisms than AS individuals without vestibular dysfunction, suggesting dysfunctional top down processes instead (Diamond, 2013), defined as lack of capacity to perform deliberate kinds of processing (Rudner et al., 2008). Motor repetitive mannerisms are frequent in ASDs and related to lack of inhibitory control (Mahone et al., 2004; Mosconi et al., 2009), but also occur in non-spectrum disorders such as mental retardation (Bodfish et al., 2000). It is reasonable to also attribute such difficulties in AS to an inhibition deficit. Furthermore, repetitive mannerisms in AS were in contrast to other motor deviations, related to ToM. Similar findings have been made in ASD (Stoelb et al., 2004; Chamberlain et al., 2006), and this is probably reflecting the commonly displayed relation between ToM capacity and inhibition (Carlson and Moses, 2001).

Inhibitory control involves ability to control behavior (Diamond, 2013), and the prevalence of behavioral issues (i.e., obsessive compulsive disorder traits) in AS (Marshall et al., 2007a) probably reflects the poorer inhibitory capacity in many individuals. Lack of inhibitory control is associated with difficulty in regulating emotions (Carlson and Wang, 2007), exercising behavioral control, resisting temptation, and preventing impulses (Diamond, 2013) as well as the occurrence of repetitive behavior (Mosconi et al., 2009). Restrictions in inhibitory capacity that has been documented in a proportion of individuals with AS in this study could be the cause of observed behavioral issues that have been reported in this syndrome (Marshall et al., 2007a). Atypical behaviors are in general cross situational (Funder and Colvin, 1991), and biological influence specifically on behavioral inhibition has been confirmed (Matheny, 1989).

Neurocognitive impairments are frequently demonstrated in individuals with ciliary dysfunction (Lee and Gleeson, 2010). Brain abnormalities have also been established in individuals with AS, mainly in posterior regions (Citton et al., 2013; Manara et al., 2015). AS has a significant phenotypic overlap with Bardet Biedl syndrome (BBS). Motor mannerisms are displayed in both syndromes (Dyer et al., 1994; Marshall et al., 2007b). Such mannerisms have, however, also been observed in nonciliopathy disorders, i.e., in individuals with visual impairment of varying etiology; (Fazzi et al., 1999; Molloy and Rowe, 2011). However, no relationship was seen between motor mannerisms of individuals with AS in the present study and degree of visual impairment, suggesting that the main cause of mannerisms may not be sensory loss. In Joubert syndrome, another ciliopathy where sensory dysfunction is not as frequently displayed as in AS and BBS, repetitive and stereotyped motor mannerisms have, however, been reported, and related to EF deficits caused by abnormalities in cerebellum (Steinlin, 2007). Deficits in cerebellar areas are known to relate to deficits in EF (Cardoso et al., 2014) and individuals with higher EF have been shown to have greater recruitment of cerebellar regions within a frontoparietal resting state network (Reineberg et al., 2015). Cerebellar anomalies have in fact been observed in individuals with AS (Yilmaz et al., 2006), but no consistent findings in these areas have been discovered in the AS population (Citton et al., 2013). Thus, individual cerebellar anomalies may account for some of the EF deficit found in individuals with AS in the present study. In contrast, in previous studies of, i.e., deaf and hard of hearing individuals, poorer EF performance has been recognized as a secondary consequence of sensory loss (Rhine-Kahlback, 2004). No significant relation between degree of sensory loss and EF performance was, however, displayed in the present study. Aside of brain anomalies, this suggests potential external factors of importance for development of EF capacity in AS, illustrated by training effects in both non-disabled children (Thorell et al., 2009) and the elderly (Nagamatsu et al., 2012).

Autism spectrum traits have been observed in some individuals with AS (Marshall et al., 2007a), and in the present study the AS group displayed a significantly poorer ToM than the group of non-disabled controls. A general decrease in ToM capacity is typical for individuals with ASD (Baron-Cohen, 1989). The results from the present study, however, indicates a varied performance level in ToM in the AS group, that is highly dependent on verbal ability, whereas the threshold for language impact in ASD is markedly high (Hughes and Leekam, 2004).

Theory-of-mind capacity is related to EFs in disabled as well as non-disabled populations, where the direct role of inhibition in numerous studies have been stressed in development of an understanding that another person may hold a belief different from one self (Carlson and Moses, 2001). The within group variance in ToM in the present study, rather indicates an indirect role of EF in ToM development in AS, mediated by communicative skills and verbal ability. A better ability to inhibit noise and to update information might compensate for the consequences of sensory loss in communication. An additive factor of importance to understand difficulties in individuals with AS to develop advanced ToM is the significant loss of vision already at the age of four, which is a sensitive age in development of ToM. In a previous study early onset of visual loss in AS was accordingly found to relate to poorer ToM capacity (Frölander et al., 2014). Early visual loss in addition to progressing hearing loss over time constitute a challenge to communication. Variance in executive functioning could, however, explain a significant part of the variance in ToM in AS, pointing at an indirect role of EF in ToM development, in contrast to the direct relation between EF and ToM displayed in ASD.

This study of cognitive prerequisites for ToM development is complicated due to the small population, reflecting the low prevalence of AS. Moreover the present population displayed a high degree of heterogeneity in performance and ceiling effects were recognized in the non-disabled control group. A division of the population into subgroups made between and within group comparisons possible, and also uncovered patterns of interaction. The influence of cognitive skills in ToM performance is, however, only for certain when audibility is fully restored (Akeroyd, 2008). This was accomplished in the EF test but not in the test of phonological WM. It is possible that the differences in presentation level between participants influenced the pattern of correlation. However, there is no reason to believe that presentation level systematically covaried with either of the variables of interest. Thus that any incidental differences in presentation level are likely to have weakened rather than strengthened the association. By relating performance on the phonological WM test, to individual performance on the EF test an investigation of the compensatory role of EF, when the perception of speech signals is challenged, was, however, accomplished.

## Conclusion

In the present study the complication of hearing and visual loss in individuals with AS is addressed and related to development of ToM. The results confirm previous results showing that ToM is related to verbal ability in individuals with AS. A novel finding is that the results imply a compensatory role of EF in ToM development in AS. We suggest that this relation reflects the challenge of dual sensory loss and the importance of EF in developing the ability to perceive and process input from the social environment. It is likely that good updating capacity in individuals with AS enables inferences of meaning from incompletely received signals. Updating capacity contributes to verbal ability that is of importance for ToM development. Further, good inhibitory capacity enables sustained social interaction, in spite of the challenging conditions during communication. Poorer inhibitory capacity could be one cause of the observed mannerisms in individuals with AS, which may further decrease their opportunities to participate in communication.

Recommendations; The conclusions of the present study imply that rehabilitation of individuals with AS should focus on development of verbal ability and EF. Cognitive intervention that focuses on EF may be effective in order to support development of ToM as a basis for enhanced social participation and reduction of abnormal behavior. The apparent association between ToM and EFs in AS, however, needs to be further elaborated and replicated. As the mutations causing AS is actively expressed in most organs brain anomalies have been reported in some individuals with AS (Marshall et al., 2007a; Citton et al., 2013; Manara et al., 2015), further neurological studies need to be conducted. In addition, implementation of cognitive methods to improve verbal ability and EFs in individuals with AS would be of interest in future studies.

## References


# Acknowledgments

The authors would like to thank Alström Syndrome International for enabling and supporting data-retrieval. Thanks to the participants in the study. Thanks to MalinWass, Niklas Rönnberg and Elisabet Clason at the Linnaeus Centre HEAD, at Linköping University for the development of test – material and support in analyses of results. In addition thanks to the staff at the Audiological Research Centre at the University hospital in Örebro. Specific thanks to Hanna Hagsten and Jenny Hjaldal for help to retrieve and register data, and to Margareta Landin for help to find relevant literature and to organize references. Support was provided by the Linnaeus Centre HEAD and for JDM by the NIH HDO36878.


of mind" in story comprehension. *Cognition* 57, 109–128. doi: 10.1016/0010- 0277(95)00692-R


Molloy, A., and Rowe, F. J. (2011). Manneristic behaviors of visually impaired children. *Strabismus* 19, 77–84. doi: 10.3109/09273972.2011.600417


load-force adaptation. *Neural Netw.* 23, 1043–1050. doi: 10.1016/j.neunet.2010. 08.007


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Frölander, Möller, Rudner, Mishra, Marshall, Piacentini and Lyxell. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Cognitive skills and reading in adults with Usher syndrome type 2

*Cecilia Henricson1,2,3\*, Björn Lidestam2,3, Björn Lyxell1,2,3 and Claes Möller1,4,5*

*<sup>1</sup> Swedish Institute for Disability Research (SIDR), Linköping, Sweden, <sup>2</sup> Linnaeus Centre for Research on Hearing and Deafness (HEAD), Linköping, Sweden, <sup>3</sup> Department of Behavioral Sciences and Learning, Linköping University, Linköping, Sweden, <sup>4</sup> Audiological Research Centre, Örebro University Hospital, Örebro, Sweden, <sup>5</sup> School of Medicine and Health, Örebro University, Örebro, Sweden*

Objective: To investigate working memory (WM), phonological skills, lexical skills, and reading comprehension in adults with Usher syndrome type 2 (USH2).

Design: The participants performed tests of phonological processing, lexical access, WM, and reading comprehension. The design of the test situation and tests was specifically considered for use with persons with low vision in combination with hearing impairment. The performance of the group with USH2 on the different cognitive measures was compared to that of a matched control group with normal hearing and vision (NVH).

### *Edited by:*

*Patrik Sörqvist, University of Gävle, Sweden*

### *Reviewed by:*

*Max Christoph Liebau, University Hospital of Cologne, Germany Veronica Marie Whitford, McGill University, Canada*

### *\*Correspondence:*

*Cecilia Henricson, Department of Behavioral Sciences and Learning, Linköping University, S-582 35 Linköping, Sweden cecilia.henricson@liu.se*

### *Specialty section:*

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

> *Received: 16 December 2014 Accepted: 06 March 2015 Published: 25 March 2015*

### *Citation:*

*Henricson C, Lidestam B, Lyxell B and Möller C (2015) Cognitive skills and reading in adults with Usher syndrome type 2. Front. Psychol. 6:326. doi: 10.3389/fpsyg.2015.00326* Study Sample: Thirteen participants with USH2 aged 21–60 years and a control group of 10 individuals with NVH, matched on age and level of education.

Results: The group with USH2 displayed significantly lower performance on tests of phonological processing, and on measures requiring both fast visual judgment and phonological processing. There was a larger variation in performance among the individuals with USH2 than in the matched control group.

Conclusion: The performance of the group with USH2 indicated similar problems with phonological processing skills and phonological WM as in individuals with long-term hearing loss. The group with USH2 also had significantly longer reaction times, indicating that processing of visual stimuli is difficult due to the visual impairment. These findings point toward the difficulties in accessing information that persons with USH2 experience, and could be part of the explanation of why individuals with USH2 report high levels of fatigue and feelings of stress (Wahlqvist et al., 2013).

Keywords: deafblindness, Usher syndrome, phonological skill, lexical skill, working memory, reading

# Introduction

Impairment in both hearing and vision, deafblindness, causes major reduction in intake of sensory information from the environment. There can be several etiologies behind deafblindness, but Usher syndrome is one of the most common causes (Pennings, 2004; Sadeghi, 2005). The clinically estimated prevalence of Usher syndrome is reported to be 2.4–6.2 individuals out of 100 000 people

**Abbreviations:** NHV, normal hearing and vision; P.corr.c, percent correct consonants; USH2, Usher syndrome type 2; WM, working memory.

(Sadeghi et al., 2004b). The prevalence is similar worldwide, though which type of the syndrome that is most common differs locally (Sadeghi et al., 2004b). In Sweden the clinically defined types 1 and 2 of the syndrome are most common, and type 3 is unusual (Sadeghi et al., 2004b). In the present study, focus is on adults with USH2. Individuals with USH2 have a congenital moderate-to-severe hearing loss, and the hearing loss remains relatively stable over the lifespan (Sadeghi et al., 2004a). Individuals with USH2 have a severely limited, central visual field and suffer visual deficts such as poor photo- and contrast sensitivity due to the retinal disorder retinitis pigmentosa (RP; Sadeghi et al., 2004b; van Wijk et al., 2004). The first symptoms (poor contrast sensitivity and night blindness) of RP, are typically evident during ages 5–10 years but is commonly diagnosed in the late teens in individuals with USH2 (Sadeghi et al., 2004b). The degenerative process in the retina typically stabilizes at about 40–50 years of age (Sadeghi et al., 2004b).

Several of the genes causing Usher syndrome have been mapped and described, and the medical aspects of Usher syndrome have also received much attention in research, however, the cognitive functioning of individuals with the syndrome has not been in focus. In the present study three basic cognitive skills were examined in adults with USH2; WM, phonological and lexical skills. Several studies have demonstrated that the capacity of, and efficient interplay between, these cognitive skills are highly important to the amount of understanding achieved when decoding language, whether in speech (Rönnberg et al., 2010), or written form (Engel de Abreu et al., 2011). The decoding of information in speech relies partly on phonological skills (Rönnberg et al., 2010; Classon et al., 2013). The separate words have to be identified in the continuous speech signal, and identification is mediated by matching the phonological sequences to phonological representations stored in long term memory. In this way, the burden on the storage component of phonological WM is reduced, and more resources can be directed to processing the semantic content (Rönnberg et al., 2010). A number of studies have shown that hearing impairment (HI), both congenital and non-congenital in nature, is associated with reduced efficiency of phonological processing, especially less stable phonological representations and reduced phonological WM (Lyxell et al., 1996, 1998, 2009; Andersson, 2001; Spencer and Tomblin, 2009; Henricson et al., 2012; Lazard et al., 2012; Classon et al., 2013). Several studies using the Reading Span test have found that individuals with long term hearing loss display lower results on the test (Lyxell et al., 1996; Rönnberg et al., 2010), which also suggests a decrease in WM for verbal materials. These findings should be highly relevant also in the case of the group with USH2, but whether they apply to the same extent has not been investigated.

A better understanding of the cognitive functioning of the group with USH2 is at the foundation of developing better assistance and rehabilitation, which could increase the wellbeing for individuals. Information on the group's performance on cognitive measures could also offer insights on the influence of perceptual information from the auditory and visual senses on cognitive performance. As mentioned previously, the cognitive skills that are examined in the present study are at the basis of other complex abilities, and could be of specific importance to reading, for example. In normal hearing (NH) individuals the correlation between phonological skills and reading skills is typically most pronounced in the initial stages of learning to read (Lundberg, 2009; Schaffner and Schiefele, 2013). Children with cochlear implants (CIs) constitute a group who often display low phonological skills (Lyxell et al., 2009; Geers and Sedey, 2011; Dillon et al., 2012), and for example Geers et al. (2008) have shown that despite reading at level with children with NH at the ages 10–12 years, many children with CI display low performance on tests of reading comprehension in late adolescence. A possible explanation could be that since relying on phonological skills is effortful for many children with CI, the alternative is to use the salient visual cues, such as shapes of words and letters when decoding text. It seems probable that children with CI apply an orthographic reading strategy (Lyxell et al., 2009; Geers and Sedey, 2011), which might not be sufficient in order to reach full comprehension of complex texts. Because of the congenital hearing loss, the development of phonological and lexical skills is at risk in individuals with USH2. Also, the progressive visual loss could be interfering with the retrieval of information such as the mentioned salient visual cues of text, but also with the individual's ability to learn and use lip-reading (a skill which relies on an understanding of spoken language phonology, and hence could maintain and refine the phonological skills), further complicating the development of language skills. Since the development of reading skills (e.g., decoding and comprehension) depends on phonological and lexical abilities in individuals with NH, the present study aimed to examine this relationship in individuals with USH2. More specifically, the present study investigated phonological and lexical skills, WM and reading comprehension in individuals with USH2 compared to matched controls with NHV.

### Materials and Methods

### Participants

The group of participants consisted of 13 individuals (3 women, 10 men) with USH2 in the ages 21–60 years (*M* = 38.8, SD = 12.7 years, see **Table 1**). The participants' ages were distributed such that four individuals were between 20 and 30 years, two between 31 and 40 years, five between 41 and 50 years, and two individuals between 51 and 60 years of age. All were recruited through the Örebro Audiological Research Centre's national database on Usher syndrome, in which they had been entered after receiving their diagnosis of USH2a, as results of clinical and genetic investigation. All participants with USH2 had a symmetric, sensorineural, sloping hearing loss which was moderate to severe (Pure Tone Average over four frequencies (PTA4) left ear, *M* = 66.2 dB, SD = 11.6 dB; PTA4 right ear, *M* = 67.5 dB, SD = 13.3 dB, see **Table 1**). Speech discrimination in noise (signal/noise + 4 dB) was in all participants with USH2 between 50% and 60% correctly identified words, which due to the hearing loss was an expected level of performance. Information on participants' visual field was retrieved from the Örebro Audiological Research Centre's database on Usher syndrome and is reported

### TABLE 1 | Data on age, hearing thresholds (PTA4) and vision for the participants.


*Visual field reported according to the Goldman standard (Sadeghi, 2005), where a classification of 1* = *normal visual field and 5* = *no visual field. Visual acuity reported in the decimal scale, where 1* = *normal acuity and 0.05* = *functional blindness. M, mean value of the group, SD, standard deviation.*

as the calibrated Goldmann hemispheres. The Goldmann hemispheres categorizes loss of visual field into five phenotypes where 1 = normal visual field, 2 = presence of a partial or complete ring scotoma, the latter either extending or not extending into periphery, 3 = concentric central field loss with a remaining peripheral island, less than one-half of the field circumference, 4 = marked concentric loss (visual field of less than, or equal to, 10◦), and 5 = no visual field (blindness; Sadeghi et al., 2004b). The classification of participants' visual fields is displayed in **Table 1**, as is data on participants' visual acuity. Visual acuity is reported in the decimal scale, where a value of 1–0.6 is considered normal vision, and 0.05 indicates functional blindness. All participants had Swedish as their primary language. All of the participants with USH2 had completed the Swedish comprehensive school of 9 years, and the Swedish upper secondary school of 3 years. Seven of the participants with USH2 had studied for one up to 5 years of university education, and six had vocational educations.

A control group of 10 persons (four women, six men) in the ages 23–60 years (*M* = 38.4, SD = 11.0), with NH and normal or corrected-to-normal vision was selected to match the group with USH2 with respect to age and educational level. Audiograms were measured on all participants (PTA4 left ear, *M* = 3.7 dB, SD = 5.1 dB; PTA4 right ear, *M* = 3.3 dB, SD = 3.6 dB, see **Table 1**), and vision was reported by each participant to be normal when using corrections such as glasses or lenses. None reported using any other visual facilitation in their every-day life. All of the participants in the control group with NHV had completed the Swedish comprehensive school of 9 years, and the Swedish upper secondary school of 3 years. Six of the participants with NHV had studied for one up to 5 years of university education, and four had vocational educations.

Prior to their participation, all participants received letters of information describing the study aims, methods, and on how data would be reported. All participants provided written consent.

### Cognitive Tests

The test session lasted for 2–2.5 h and included tests of WM, phonological skill, lexical access, phonological WM, and reading

comprehension. The tests were given in a set order, but half of all participants were given the tests in reversed order to balance potential order effects. Six of the tests were presented visually (text), and one test was presented auditorily. The six tests which contained visual stimuli (text) were displayed on a computer screen (Dell, LCD, 22). Color settings for contrast and font sizes 16, 24, 26, 32, 36, 42, 50, 70, and 90 points and could be specified by each participant to enhance visibility and accommodate for the varying degree of visual problems in each participant with USH2. None of the participants with USH2 chose a font size smaller than 24 or greater than 42 points. All participants with USH2 preferred the setting with yellow text on black background, which is the option with highest degree of visual contrast. All participants in the control group also took the tests in this high contrast setting. The stimuli in Serial Recall of Non-words was presented auditorily. Before the test session all participants in the group with USH2 had their hearing aids checked, to ensure that the devices were functioning correctly. At the session all participants had access to further technology, such as tele-coil, loudspeakers, and FM-systems; radio communication units specifically designed for hearing aid reinforcement, were also available. Ten of the participants chose to use the FM-system at some point during, or through the whole of the test session. A sample sentence, not included in the actual test, was used to set the sound level to a comfortable loudness for the participant, before starting the test with the recorded voice. Each experimenter ensured that their voice could be perceived clearly, so that instructions could be heard without problem, before starting the session.

Regarding the control group participants with NHV, the recorded voice was presented through loudspeakers (Logitech S-100), which the participant set to a comfortable level of loudness while listening to the sample sentence. The loudspeakers were positioned on either side of the computer screen, directly in front of the participants. Each experimenter made sure that listening conditions were as good as possible for the control group participants during the test session.

### Verbal Ability: Antonyms

This test has previously been used in Lyxell et al. (1996). It was presented in text on screen. The task was to identify the pair of words which were each other's antonyms in a set of five words. The participant had 5 min to complete as many items as possible. Performance was scored as number of correctly identified pairs of antonyms, of a maximum of 29 items.

### Speed of Visual Judgment: Physical Matching

This test has previously been used in Lyxell et al. (1996). It was presented in text on screen. The task was to identify whether a displayed pair of letters were identical or different. For the identical condition to be valid, both letters have to be the same, and they have to be either in upper or lower case (e.g., "e – e"). Each item was presented for 2 s with 1 s between tasks, and total number of items was 16. Performance was scored as percentage correct judgments, and mean reaction time (RT) for correct answers was recorded.

### Lexical Access: Lexical Matching

This test has previously been used in Lyxell et al. (1996). Single syllable words or non-words were presented, one at a time, on screen. The task was to judge whether the displayed word was a real word or not and push a button accordingly. There were 40 items, each displayed for a maximum of 5 s with 1 s intervals between items. Performance was scored as percentage correct judgments, and mean RT for correct answers was recorded.

### Phonological Processing: Rhyme Judgment

This test has previously been used in Lyxell et al. (1996), and Classon et al. (2013). The test items were presented in text on screen, and the task was to judge whether pairs of words rhyme or not and push a button accordingly. The participant was instructed to disregard spelling and lettering of the words and focus on their sound (e.g., "MUSTASCH – pistage" makes a rhyme in Swedish). The total number of items was 32. Each item was presented for a maximum of 5 s, with 1 s interval between items. Performance was scored as percentage correct judgments, and mean RT for correct answers was recorded.

### Complex Working Memory: Reading Span

This test has previously been used in Lyxell et al. (1996), Classon et al. (2013), and Ng et al. (2013). This test was presented in text on screen. The participant was presented with sequences of sentences consisting of three words. The first sequence consisted of three sentences, with a maximum of five sentences in one sequence. There were two trials at each level. The sentences were presented word by word, and after each sentence the participant had to judge whether the content was semantically anomalous or not (e.g., "Pots jump high" or "Bikes have wheels"). After a sequence was complete, the task was to repeat either the first, or last, word of each sentence. The participant did not know in advance whether the task would be to repeat the first or last words. The total number of sentences was 24. Each word in each sentence was displayed for 0.8 s with an interval of 0.75 s between them. The interval between sentences was 2 s, during which the participant replied to whether the sentence was absurd or not. Performance was scored as percentage of correct words recalled in a free-recall criterion.

### Phonological Working Memory: Serial Recall of Non-Words

Before starting this test, all participants listened to a sample of the recorded voice in order to set sound to a comfortable and audible level. The task was to repeat sequences of one syllable non-words, all with consonant-vowel-consonant structure. The sequence length started at two words, increasing with one word after three trials at each level, up to a maximum of seven words in sequence. The test was terminated if the participant failed to repeat the correct number of items in a sequence on two attempts. The total number of words was 81, with a total of 162 consonants. Performance was scored in two ways: (1) p.corr.c of recalled words, and (2) Longest recalled sequence.

### Reading Comprehension: Gates MacGinitie

This test was presented in text on screen. Short passages of text on different subjects are presented. The task was to read through each passage and answer multiple choice questions about the contents, or implications, of the text. Performance was scored as number of correct answers of maximum 42 answers.

### Statistics

The data were analyzed for group differences using the Mann– Whitney *U*-tests, with a significance level set to *p <* 0.05. In cases where participants were unable to perform a test, the person was excluded from analyses (i.e., for Reading Span there were two missing values, and analysis was run on eleven subjects). Effect sizes are presented as Pearson *r* values. Since there is a wide age range among participants, Spearman's correlations were also performed to examine the impact of age on performance. Spearman's correlations are also used to examine the impact of visual status and degree of hearing loss on performance.

## Results

### Verbal Ability: Antonyms

There was no significant difference between the groups on this test, *U* = 94.50, *z* = 1.84, *p* = 0.07, *r* = 0.38 (USH2: *M* = 14.9 and SD = 4.4; NHV: *M* = 18.3 and SD = 3.6; See **Table 2** for details on performance in the groups). However, the variation in performance was higher in the group with USH2, with three individuals performing above the mean rank value of the control group (15), and six below.

### Speed of Visual Judgment: Physical Matching

The control group had significantly higher scores on this test, *U* = 109.00, *z* = 2.88, *p* = 0.04, *r* = 0.60 (USH2: *M* = 87.6 and SD = 12.6; NHV: *M* = 98.2 and SD = 4.0), and also had significantly shorter RTs, *U* = 18.00, *z* = 2.92, *p* = 0.04, *r* = 0.61 (USH2: *M* = 1.1 and SD = 0.4; NHV: *M* = 0.7 and SD = 0.1; See **Table 2**, and **Figure 1**, for details on performance in the groups). There were seven participants in the group with USH2 who performed between 94 and 100%, and six with performance below 94%, whereas in the control group only one participant performed below this score. Regarding RTs, 12 participants with USH2 had RTs longer than 0.7 s, compared to only three participants in the control group.

### Lexical Access: Lexical Matching (see Figure 2)

There was no significant difference in performance between groups regarding score, *U* = 82,50, *z* = 1.11, *p* = 0.27 (USH2: *M* = 92.5 and SD = 10.1; NHV: *M* = 96.4 and SD = 4.2), but there was a significant difference in RT, *U* = 32.00, *z* = 2.05, *p* = 0.04, *r* = 0.43 (USH2: *M* = 1.4 and SD = 0.7; NHV: *M* = 0.9 and SD = 0.3) on this test (See **Table 2**, and **Figure 2**, for details on performance in the groups). Two participants with USH2 and one participant in the control group performed below 90% correct. All except one participant with USH2 had an RT longer than



0.8 s, in comparison to the control group where only three had RTs longer than 0.8 s.

### Phonological Processing: Rhyme Judgment

The control group had significantly higher scores, *U* = 114.00, *z* = 3.08, *p* = 0.02, *r* = 0.64 (USH2: *M* = 74.6 and SD = 18.2; NHV: *M* = 95.7 and SD = 7.8), but the difference in RT was not significant, *U* = 34.00, *z* = 1.93, *p* = 0.05 (USH2: *M* = 2.1 and SD = 1.0; NHV: *M* = 1.3 and SD = 0.5) on this test (see **Table 2** and **Figure 3**, for details on performance in the groups). While all participants in the control group had performance above 90%, only four participants with USH2 had performance at or above this score. Regarding the RTs, ten of the participants with USH2 had a RT above 1.5 s, compared to two in the control group.

### Complex Working Memory: Reading Span

The group with NHV had significantly higher performance on this test, *U* = 87.50, *z* = 2.30, *p* = 0.02, *r* = 0.27 (USH2: *M* = 54.6 and SD = 12.8; NHV: *M* = 69.6 and SD = 14.7; see **Table 2** for details on performance in the groups). Eight of the participants with USH2 had scores below 60%, compared to two participants in the control group. Two participants with USH2 were unable to perform this test due to their visual impairment, and were excluded from the analysis of this measure.

### Phonological Working Memory: Serial Recall of Non-Words

The control group displayed both higher percentage of correct consonants in the recalled non-words *U* = 101.50, *z* = 2.42, *p* = 0.02, *r* = 0.50 (USH2: *M* = 42.1 and SD = 11.9; NHV: *M* = 56.7 and SD = 12.5) and longer span length, *U* = 103.50, *z* = 2.39, *p* = 0.02, *r* = 0.50 (USH2: *M* = 3.9 and SD = 0.9; NHV: *M* = 4.9 and SD = 1.0; see **Table 2**, and **Figure 4**, for details on performance in the groups). Ten of the participants with USH2 had performance at or below 50% consonants correct, compared

FIGURE 3 | Displaying RT in seconds and score (% correct answers) for each individual on the test Rhyme Judgment. Individuals with USH2 are displayed as filled circles and individuals in the control group as triangles. The difference in performance among individuals with USH2 is greater than in the control group with NVH. Performance on this test was affected by degree of visual impairment, but could also be an indication of less stable phonological representations in the group with USH2.

to two in the control group. Four of the participants with USH2 had a recalled longest sequence at or below four words, while none of the participants in the control group were below this span length.

### Reading Comprehension

There was no significant difference between groups, *U* = 60.00, *z* = 1.24, *p* = 0.22 (USH2: *M* = 39.0 and SD = 8.5; NHV: *M* = 44.4 and SD = 2.2) on this test, but the group with USH2 display higher degree of variability in performance ranging from full score on the test to less than half score (see **Table 2**, and **Figure 4**, for details on performance in the groups). All participants in the control group had a score at or above 40 points (of maximum score 48), while three participants with USH2 had results below this score. Four of the participants with USH2 were unable to perform the test, in two cases because of the visual impairment. In two cases the participants grew too tired during the testing and hence declined participation in the test of reading comprehension.

### Spearmans' Correlations

There were no significant correlations between age and performance, in terms of score, on the cognitive tests in the group with USH2 (see **Table 3**). There was a significant, moderate correlation between age and performance on complex WM, as well as between age and score on Lexical Matching, in the group with NHV (see **Table 3**). The correlation was negative, indicating that the younger individuals with NHV had higher score on Lexical Matching. In the group with USH2 there were significant, moderate correlations between visual status and RTs on Lexical Matching, Rhyme Judgment, and Physical Matching (see

**Table 3**). The correlation between visual status and performance (score) on Physical Matching was significant (see **Table 3**). The correlations between visual status and performance (in terms of proportion correct answers), and visual status and RT on the tests, were not significant (see **Table 3**).

### Summary of Results

There were significant between-group differences in performance (score) on speeded visual judgment (Physical Matching), phonological processing (Rhyme Judgment), phonological WM (Serial Recall of Non-words), and complex WM (Reading Span). The group with USH2 displayed poorer performance on these measures. There were also significant between-group differences regarding RT on Physical Matching and Lexical access, where the group with USH2 had longer RT. There was no significant difference between groups on reading comprehension. Age and visual decline were moderately correlated in the group with USH2, where increased age was associated with poorer visual performance. Furthermore visual decline and RT on the tests were moderately correlated, such that poorer visual performance was associated with longer RT.

### Discussion

The aim of this study was to examine WM, phonological and lexical skills, and reading comprehension in adults with USH2 in relation to a matched control group with NVH. The general findings were that the group with USH2 had lower performance on complex verbal WM, reduced phonological WM, as well as less accurate phonological processing. Reduced WM and phonological processing was indexed by significantly lower performance and longer RTs on the Reading Span, Rhyme judgment, and Serial Recall of Non-words tests. The effect sizes were moderate to large when the groups differed significantly. However, it is important to note that lower performance was not a general finding in the group with USH2. Several of the

### TABLE 3 | Spearman correlations between age, visual status, and cognitive variables.


<sup>∗</sup>*p* ≤ *0.05,* ∗∗*p* ≤ *0.01.*

participants with USH2 performed comparably to or slightly lower than the control group on the experimental measures. Only a few performed markedly below the control group. An interesting aspect is that performance was varied in participants with USH2 across the different tests such that individual strengths, weaknesses and degree of alertness may have had a stronger influence on performance than their degree of visual impairment, for example. Correlational analysis also indicated that generally low performance was not specifically associated with either higher age or poor vision in the group with USH2. However, two individuals with USH2 displayed generally low performance on all tests, and these cases will be discussed further below.

The variation in performance in the group with USH2 is displayed in **Figures 1–4**, and from this information we can conclude that most participants with USH2 indeed had difficulties on measures of phonological processing and phonological WM; however, some did not.

A slightly unusual finding was the difference in performance on physical matching, a test which is generally used as control measure for general RT. For individuals with NHV the proportion of correct responses is expected to be high. The control group performed at ceiling on this test and had short RTs. Regarding the group with USH2, the majority achieved high scores and had RTs only slightly longer than those of the control group, but four individuals with USH2 displayed low scores and long RTs. Two of these individuals declined participation in the test of complex verbal WM, as well on the test of reading comprehension, because of their low vision. The data on their visual status confirmed both visual field and acuity to be severely limited. Hence, the low performance on physical matching of these two participants was likely an effect of not being able to perceive and/or evaluate the visual stimuli properly. As a group, the participants with USH2 display significantly longer RTs on Lexical Matching and Rhyme Judgment. On both Lexical Matching and Rhyme Judgment the majority of participants with USH2 displayed relatively long RTs, though in the latter case the difference in RT was not significant in the two-sided test of significant difference. A possible explanation is that the participants with USH2 experience visual input to be uncertain due to their visual impairment, and hence have adapted by allowing more time when inspecting visual elements.

Though the finding of significant difference between groups on Physical Matching was unexpected, the differences in performance on tests relying on phonological skills and phonological WM were less so. Even when analyses were run with the two participants with poorest vision excluded from all measures, the pattern of results remained, indicating that phonological processing difficulties are likely to be an issue for persons with USH2. Previous research (i.e., Lyxell et al., 1998; Classon et al., 2013) has investigated the impact of long term hearing loss on phonological skills and found that phonological processing skills decline over time (Rönnberg et al., 2010; Classon et al., 2013). The primary effect of reduced ability to process phonological information, according to Rönnberg et al. (2010) is difficulties when processing speech, and hence speech comprehension can be compromised. However, whether the reduction in phonological skills in adults with long term hearing loss also affects reading comprehension has not been investigated. Most likely this is due to the fact that even though phonological skills are correlated with reading skill in individuals with NH at the beginning stages of reading (Lundberg, 2009; Schaffner and Schiefele, 2013), as the reader becomes more skilled, this correlation becomes less prominent. In USH2, the HI is congenital, and hence could give rise to delayed or divergent development of phonological skills (Wass, 2009; Lederberg et al., 2013; Lyxell et al., 2013; Nakeva von Mentzer et al., 2013) which could have an impact on the development of their reading skill. While there was no significant difference between groups in performance on the test of reading comprehension in this study, three individuals with USH2 performed at or below more than 1 SD of both groups' means. These three participants also displayed low results on tests of phonological skill, phonological WM, and complex verbal WM. While one of these participants was in the higher end of the age span, the other two were in the middle, and neither of them were among those with poorest vision. Possibly, these participants have not been able to acquire nuanced and stable phonological skills at an early stage due to their HI, and as an effect reading skills later in life are compromised. One of these participants also reported reading to be a very tiring activity, and terminated the test of reading comprehension before the time allotted had expired.

The difficulties with phonological processing experienced by individuals with USH2 in this study could be disruptive for speech comprehension, especially when conversation takes place in noisy environments (e.g., Rudner et al., 2011; Ng et al., 2013). Studies investigating health aspects in persons with hearing loss often find higher levels of fatigue in individuals with hearing loss (Hua et al., 2013).The effort exerted by applying conscious strategies in order to retrieve the information necessary to follow conversations could be one explanation, as suggested by for example Rönnberg et al. (2010). In individuals with deafblindness the access to visual information is also severely limited, hence further increasing the strain on the individual to acquire the information necessary in the conversation. Possibly, the difficulties experienced in extracting information in social situations by persons with USH2 could be part of the explanation behind the findings of Wahlqvist et al. (2013), who found psycho-social health to be significantly lower in the population with USH2, with higher prevalence of headache, fatigue, and depression in comparison to a reference population. Therefore, one of the key goals of rehabilitation should be to help individuals compensate for the loss of information from vision and hearing, and the knowledge gained from studies such as the present could be important in the design of interventions on audiological clinics.

It should be noted that there are inherent challenges in conducting research with populations with deafblindness. Due to the dual sensory loss, and individual variation in degree of loss, it is hard to design a test situation in which all participants with deafblindness would have opportunity to display peak performance. However, none of the participants in the present study reported difficulties with hearing the instructions or test items during the test sessions. All participants were experienced hearing aid users, had their hearing aids checked before the test session, and the FM-systems used during sessions gave further benefit. Compensating for low vision in cognitive testing turned out, not surprisingly, to be a greater challenge. Even though the tests had been adapted for participants with low vision, problems with visibility remained. In particular, the two participants with most advanced RP experienced the tests where stimuli were displayed for only a short time as tiring and had difficulty finding and getting the item in focus before display time for the item expired. As stated, these two participants declined participation in some tests, since they were not able to see the material properly. The impact of the visual impairment on the tests used could be further investigated by including a group with matched visual status, but without HI. Possibly, a group with matching visual impairment would display similar difficulties with fast visual judgment, though performing higher results on the tests of phonological processing skills.

### Conclusion

The performance of the group with USH2 indicated similar problems with phonological processing skills and phonological WM as experienced by other individuals with long-term hearing loss. On tests of phonological processing and phonological WM performance level was significantly lower in the group with USH2 than in the control group with NHV. On the visually displayed tests of phonological processing performance was likely also affected by the problems with visibility, even though with the exception of two participants the individuals in the group with USH2 did not report specific difficulties with visibility. The majority of participants with USH2 had particular difficulties when fast visual judgment was required in

### References


combination with phonological processing, such as in the Rhyme Judgment task. However, for several of the measures of phonological processing some individuals performed similar to the control group, whereas a few performed markedly low, despite same level of visual impairment. Information on the level of phonological processing skills could be important in the design of intervention for individuals. Individuals could benefit from extra support and specific training of phonological skills in order to ease communication, thus possibly reducing feelings of stress and/or loneliness. A recommendation for future research would be to further investigate phonological skills in the population with USH2, preferably with separate control groups matched on degree and duration of HI respectively visual impairment. It would also be relevant to study communicative strategies, and to connect these aspects to health and well-being in the group.

# Funding

This project was approved by the Swedish regional ethical vetting board in Uppsala and was financed by a grant from the Swedish Research Council Forte and by the Audiological Research Centre in Örebro.

# Acknowledgments

We would like to thank all participants of the study for taking the time to participate in this research. We would also like to thank all of you who worked at and helped us create the event for collection of data, and especially Moa Wahlqvist and Jennie Hjaldahl for substantial contributions to the project.

*Int. J. Pediatr. Otorhinolaryngol.* 76, 1449–1457. doi: 10.1016/j.ijporl.2012. 06.020


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Henricson, Lidestam, Lyxell and Möller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Timing and Effort of Lexical Access in Natural and Degraded Speech

### Anita E. Wagner1,2 \*, Paolo Toffanin<sup>1</sup> and Deniz Bas, kent1,2

<sup>1</sup> Department of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, Netherlands, <sup>2</sup> Graduate School of Medical Sciences, School of Behavioral and Cognitive Neuroscience, University of Groningen, Groningen, Netherlands

Understanding speech is effortless in ideal situations, and although adverse conditions, such as caused by hearing impairment, often render it an effortful task, they do not necessarily suspend speech comprehension. A prime example of this is speech perception by cochlear implant users, whose hearing prostheses transmit speech as a significantly degraded signal. It is yet unknown how mechanisms of speech processing deal with such degraded signals, and whether they are affected by effortful processing of speech. This paper compares the automatic process of lexical competition between natural and degraded speech, and combines gaze fixations, which capture the course of lexical disambiguation, with pupillometry, which quantifies the mental effort involved in processing speech. Listeners' ocular responses were recorded during disambiguation of lexical embeddings with matching and mismatching durational cues. Durational cues were selected due to their substantial role in listeners' quick limitation of the number of lexical candidates for lexical access in natural speech. Results showed that lexical competition increased mental effort in processing natural stimuli in particular in presence of mismatching cues. Signal degradation reduced listeners' ability to quickly integrate durational cues in lexical selection, and delayed and prolonged lexical competition. The effort of processing degraded speech was increased overall, and because it had its sources at the pre-lexical level this effect can be attributed to listening to degraded speech rather than to lexical disambiguation. In sum, the course of lexical competition was largely comparable for natural and degraded speech, but showed crucial shifts in timing, and different sources of increased mental effort. We argue that well-timed progress of information from sensory to pre-lexical and lexical stages of processing, which is the result of perceptual adaptation during speech development, is the reason why in ideal situations speech is perceived as an undemanding task. Degradation of the signal or the receiver channel can quickly bring this well-adjusted timing out of balance and lead to increase in mental effort. Incomplete and effortful processing at the early pre-lexical stages has its consequences on lexical processing as it adds uncertainty to the forming and revising of lexical hypotheses.

Keywords: time-course of speech perception, speech perception in adverse communicative situations, cochlear implants, pupillometry, lexical processing

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Kimmo Alho, University of Helsinki, Finland Dorothea Wendt, Technical University of Denmark, Denmark

> \*Correspondence: Anita E. Wagner a.wagner@umcg.nl

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 07 October 2015 Accepted: 04 March 2016 Published: 30 March 2016

### Citation:

Wagner AE, Toffanin P and Bas, kent D (2016) The Timing and Effort of Lexical Access in Natural and Degraded Speech. Front. Psychol. 7:398. doi: 10.3389/fpsyg.2016.00398

# INTRODUCTION

fpsyg-07-00398 March 26, 2016 Time: 13:44 # 2

Understanding speech involves the rapid translation of acoustic information into meaning. The time course in which listeners extract phonetic information and map it onto their mental representations has been extensively studied in ideal listening conditions (e.g., Allopenna et al., 1998; Dahan and Tanenhaus, 2004). It is, however, less well understood how degraded signals, such as speech transmitted via cochlear implants (CIs) – hearing prostheses that allow profoundly deaf listeners to regain access to speech perception – find their way into the mental lexicon. When noisy surroundings or hearing impairment complicate speech comprehension, effortful processing is the first noticeable consequence. This paper investigates how signal degradation affects the time course of lexical access, as reflected in listeners' gaze fixations, and how lexical processing employs mental resources, as reflected in pupil dilation.

In ideal conditions, understanding speech is a prime example of an automatic perceptual process that takes its course without our attention. We can understand speech and at the same time engage in parallel activities. What enables this efficient processing is the seamless transfer of information within a hierarchy of pre-lexical and lexical decoding stages. Models of speech perception (e.g., TRACE: McClelland and Elman, 1986; Shortlist: Norris, 1994; Shortlist B: Norris and McQueen, 2008) describe pre-lexical and lexical processing as automatic. Evidence that listeners process speech even in absence of conscious awareness (Davis et al., 2007) further supports this notion. Unlike speech perception in ideal situations, the processing of degraded speech draws more strongly on attentional resources (Wild et al., 2012), and can lead to mental fatigue (Hornsby, 2013).

Increased effort during speech perception, sometimes also referred to as mental fatigue (for a distinction of these terms see McGarrigle et al., 2014), is often reported by users of CIs (Noble et al., 2008). Compared to natural speech the signal transmitted via CIs is strongly degraded in its spectrotemporal form. Following implantation listeners need to adapt their processing of speech to this specific transmission, and despite reaching relatively successful speech understanding on average, many listeners describe speech perception to be more tiresome. The symptom of greater effort during speech perception has also been reported for hearing impaired listeners (Kramer et al., 1997), and hearing-aid users (Hornsby, 2013).

Audiological assessment methods are traditionally based on measures of intelligibility and no standard tests exist for quantifying effort. Mental effort is first and foremost listeners' impression, but it may affect automatic mechanisms underlying speech perception, and bottlenecks within these mechanisms can increase effort even further. Recently, there has been an increase in interest in pupillometry as an objective measure of mental effort in speech perception (Kramer et al., 1997; Zekveld et al., 2014; Koelewijn et al., 2015). Pupillometry has confirmed itself as a method to study the subconscious use of attentional resources in cognition since Hess and Polt (1960), and Kahneman and Beatty (1966). These classical studies established that the dilation of the human pupil does not only reflect adaptation to changes in luminance, in the timescale of 200–500 ms (Ellis, 1981), but also a slower evolving response to mental effort, in the timescale of above 900 ms (Hoeks and Levelt, 1993). Since then, pupillometry has been applied to study cognitive processes, such as those related to memory load (e.g., Hess and Polt, 1960) or attention (Hoeks and Levelt, 1993). Whereas it is accepted that increased pupil dilation reflects increased processing, the sources of pupil dilation can be attributed to mental effort (Hess and Polt, 1964), controlled attention (Hoeks and Levelt, 1993; Koelewijn et al., 2015), automatic attention (Libby et al., 1973), or engagement in a task (Kahneman and Beatty, 1966; Kang et al., 2014). There is no clear-cut distinction between effort and attention, and some models of cognitive resources see a close correlation between effortful processing and increased demands on attention (Kahneman, 1973). For speech perception, pupillometry has been applied to study listening effort under divided attention (Koelewijn et al., 2015), listening effort (Zekveld et al., 2014; Winn et al., 2015), and speech perception training (Kuchinsky et al., 2014). Greater pupil dilation has been found to reflect both auditory and cognitive aspects of processing speech in challenging conditions (Zekveld et al., 2014).

Effortless processing of speech in optimal conditions is based on experience with the signal, and on the consequential fine attunement of the perceptual system to the regular and common patterns in the listener's native language (Cutler, 2012). Language-specific processing of speech starts with early and subconscious perceptual organization of acoustic cues (e.g., Kuhl, 1991; Iverson et al., 2003), and continues with semantic and pragmatic integration of meaning into the context of a conversation (Kamide et al., 2003). This fine adjustment takes place during speech development (Kuhl, 1991; Best, 1995), and ensures the almost instant processing of the speech signal, as speech needs to be processed in real time. A delay in the processing of speech on pre-lexical and lexical stages will decrease the automaticity of speech perception and may increase mental effort. Gaskell and colleagues (Cleland et al., 2006, 2012) show the importance of well-timed lexical processing in a series of experiments deploying the Psychological Refractory Period. Early stages of speech processing, such as integration of cues to phonemic identification appear to take place without drawing upon central resources (Gaskell et al., 2008). Accessing the meaning of words, however, has been found to create a bottleneck, which sets a limit on the processing of subsequently presented tasks (Cleland et al., 2012). Degradation of the signal may affect the fine timing of processing even further.

The aim of the present study is to track the timing of lexical access in natural and degraded speech, and to study whether and how this processing interacts with mental effort. We hypothesize that degradation will affect the automaticity of processing speech and delay the timing of processing information at pre-lexical and lexical levels. The time course of lexical access has been studied by means of eye-tracking (e.g., Allopenna et al., 1998; Dahan et al., 2002). This paradigm is based on the over the decades replicated finding that listeners' gaze fixations to pictures displayed on a screen are driven by auditory speech stimuli: listeners spontaneously fixate the object that is being referred to in the signal they hear (Cooper, 1974). This paradigm thus captures the time course of lexical decision-making. Previous

eye-tracking studies have documented listener's fast integration of detailed phonetic and semantic information and how this information modulates their lexical decisions (Dahan et al., 2002; Salverda et al., 2003; Dahan and Gaskell, 2007).

The process of interest in this paper is lexical competition, which is the short-lived interval during which the heard signal matches multiple lexical entries, and the perceptual system allows multiple lexical candidates to compete for the best match to the signal. Listeners, not knowing the intended word beforehand, subconsciously and for splits of milliseconds consider multiple words that have overlapping phonological forms. This includes homonyms (e.g., pair and pear), lexical embeddings (e.g., paint in painting), and words that can occur across word boundaries (e.g., can in black and blue). Models of speech perception see lexical competition as integral part of lexical access (for a recent discussion on this debate see McMurray et al., 2009).

The present experiment adapts the design by Salverda et al. (2003), who studied the time course of disambiguation of words embedded in other words (e.g., pan in panda) in Dutch. These authors found that listeners' gaze fixations during the processing of lexical embeddings are guided by the durational differences between syllables in monosyllabic versus polysyllabic words. The lengthening of syllables in boundary position makes the monosyllabic word, e.g., pan, longer than the phonologically overlapping syllable in the polysyllabic word panda. To study the effect of the durational cues on lexical decision, Salverda et al. (2003) manipulated the duration of the first syllable by cross-splicing monosyllabic words into polysyllabic targets. This manipulation will be part of our experiment, as well as the second manipulation of signal degradation that simulates the signals transmitted via CIs. We will record the time-course of lexical disambiguation in natural and degraded stimuli. The durational manipulation is crucial in combination with the specific degradation applied because while CIs strongly degrade the signal in its spectrotemporal details they reliably transmit the durational relations in speech (Vavatzanidis et al., 2015). This means that listeners can pick up on the durational cues for both degraded and natural speech stimuli. In order to also get insight into the mental effort involved in lexical access we will record listeners' pupil dilation alongside the fixations.

Pupil dilation will give us insight into the mental effort involved in the processing of degraded versus natural speech. The measure of mental effort captured in pupil dilation combined with gaze fixations can reflect processing bottlenecks, or the accumulated effort resulting from ill-adjusted timing between processing stages. However, pupil dilation may also indicate the engagement in a task, or the recruitment of attentional resources. The manifold sources of pupil dilation have led to some ambiguity in the use of terms. In this paper we will use the term 'mental effort' to describe our results. However, we are aware that automatic attentional allocation can play a role in the regulation of cognitive processes (Posner, 1992), as is the perception of speech. Furthermore, the capacity model by Kahneman (1973) sees a close connection between attention and mental effort. In this model, tasks compete for processing resources with automatic tasks requiring no attention and little effort. Although speech perception is often described as automatic, this automaticity is granted mainly when listening to native speech. Listening to a foreign but familiar language already demands more attention and effort in processing. There is growing evidence for the involvement of attentional resources, in particular when processing speech in adverse conditions (Mattys et al., 2009; Wild et al., 2012). Even for the perception of natural signals attention has been found to not only facilitate segregation of speakers (Kerlin et al., 2010), but also to share resources with parallel tasks, such as performing memory-related tests while suppressing irrelevant non-speech sounds (Sörqvist et al., 2012).

Three questions stand in focus of the present study. (1) Does the time course of lexical disambiguation, as captured by gaze fixations, differ between the processing of natural versus degraded speech? (2) Does lexical competition involve an increase in mental effort, as captured in listeners' pupil dilation? (3) Does processing of degraded speech show a comparable course of changes in mental effort to natural speech? Based on our working hypothesis that timing between the processing stages is crucial for automatic and effortless perception we assume that there will be differences in the time course of processing natural versus degraded speech. A hint into a similar direction has been reported by Farris-Trimble et al. (2014). Regarding question two, we expect to find a difference in pupil dilation for the processing of degraded versus natural speech. We do not expect lexical competition in natural speech to employ mental resources, since speech perception is an automatic process. Our experimental stimuli, however, contain misleading cues that will force listeners to revise their lexical hypotheses, and we expect that mental resources may then be recruited. Coming to our third question, we expect to observe more effort in processing degraded speech, as it has previously been reported (Zekveld et al., 2014; Winn et al., 2015). However, it is still an open question whether processing degraded speech per se already depletes mental resources allocated to speech perception or whether an additive effect of lexical competition can also be observed. Should the course of fixation between degraded and natural speech indeed differ, as we hypothesize, then the recruitment of mental resources or the course of effort visible in pupil dilation might differ as well.

# EXPERIMENT

# Method

### Participants

Seventy-three normal hearing volunteers, aged between 20 and 31 years (mean age 24), participated in this study. None of them reported any known hearing or learning difficulties, and they all had normal or corrected-to-normal vision. Their hearing thresholds were normal, i.e., below 20 dB HL at the audiometric frequencies between 500 and 8000 kHz. Half of the volunteers were randomly assigned to participate in the task with natural speech (NS), and the other half with degraded speech (DS). Before the experiment started, the participants signed a written consent form for the study as approved by the Medical Ethical Committee of the University Medical Centre Groningen. The volunteers

received either course credits or a small honorarium for their participation.

### Stimuli

The materials consisted of 26 critical items, which were borrowed from Salverda et al. (2003). These were polysyllabic Dutch words, which were paired with initially embedded, thus phonologically overlapping, monosyllabic words as competitors. The stimuli set contained next to the critical items also 40 filler items, partly again borrowed from Salverda et al. (2003) and partly constructed for this study. The fillers were selected based on two criteria: their syllabic structure, and presence of embedded words. Seven of the fillers were polysyllabic and 33 were monosyllabic words, thus allowing us to balance the distribution of short and longer words as targets throughout the experiment. Twenty of the fillers did not contain a competitor, ten of the fillers were monosyllabic words that were paired with polysyllabic words in which they were embedded in initial position. The remaining ten filler targets were monosyllabic words paired with polysyllabic words that embedded them in final position.

For all the materials, the sentence context was neutral and revealed no semantic information about the target. A female native speaker of Dutch with no prominent regional accent recorded the sentences in blocks of paired sentences. The speaker was instructed to pronounce the sentences clearly but in a natural manner. For each pair of target- and competitor items three sentences were recorded. The sentence containing the polysyllabic, thus embedding, target (e.g., bokser [boxer]) was recorded twice. Only one instance of the sentence with the monosyllabic, hence embedded, competitor (e.g., bok [goat] is embedded in bokser) was necessary to construct the materials. The initial part of both sentences was identical, and the monosyllabic (competitor) word was always followed by words that matched the phonological, prosodical, and stress pattern of the target sentence as closely as possible. For instance, for the target word 'bokser' the sentence Wij wisten wel dat de oude bokser gestopt was [We all knew that the old boxer retired] was paired with the sentence Wij wisten wel dat de oude bok suffig was [We all knew that the old goat was drowsy]. In order to accentuate the durational differences that were driving listeners' gaze fixations in the study by Salverda et al. (2003), words following the monosyllabic words were stressed on their first syllable. Due to final lengthening words preceding a stressed position are produced as longer. This allows us to ascertain that the durational cues were audible in the degraded signals. The differences in length between the embedded syllables and the syllables in the polysyllabic words ranged between 20 and 120 ms, with a mean of 65 ms.

All materials were subjected to a splicing procedure, in analogy to Salverda et al. (2003). An example of the procedure is shown in **Table 1**. The acoustic manipulation was implemented in PRAAT (Boersma and Weenink, 2013), and consisted of combining the three sentences recorded per critical pair in order to create two experimental conditions. The sentences were divided into two parts: the initial part contained the sentence up to either the first syllable of the polysyllabic word, or the end of the monosyllabic word; the second part contained the second syllable of the polysyllabic word until the end of the sentence. In Condition 1 (target-matching cues) the first part of the polysyllabic sentence was combined with the second part of the second recording of the same polysyllabic sentence. In Condition 2 (target-mismatching cues) the first part of the monosyllabic sentence was combined with the same second part of the polysyllabic sentence as in Condition 1. This resulted in Condition 1 having the durational pattern typical for the polysyllabic word, and Condition 2 having the durational pattern typical for the monosyllabic word, where this pattern, however, was then violated when the second syllable extracted from the target word was presented.

The degradation in the form of acoustic CI simulation was performed by sinusoid vocoding the speech signal with eight channels, and implemented in MATLAB. The decision to create vocoded stimuli with eight channels is based on the finding that increasing the number of channels improves speech perception of CI users up to seven channels and then plateaus (Friesen et al., 2001). The stimulus signal within the frequency range of 100 Hz – 10 kHz was bandpass filtered into eight frequency bands. The intervals between these eight channels were chosen to be equally spaced based with regards to the basilar membrane using Greenwood's mapping function (Greenwood, 1990). The amplitude envelopes of these channels were extracted in each frequency band, by first half-wave rectification, then low-pass filtering (4th order Butterworth) the band-limited signal at 300 Hz. The simulated speech was obtained by summing up sinusoids at a frequency matching the center frequency of each band modulated with the extracted envelopes. **Figure 1** displays the spectrograms of an experimental sentence in its natural (NS) and vocoded form (DS), for the stimuli with target-matching (left panel) and competitor-matching durational cues (right panel).

### Apparatus and Presentation

The eye-tracker SIM Eyelink 500, with a sampling rate of 250 Hz was used. This head mounted eye-tracker contains two small cameras, which can be aligned with the participants' pupil to track the pupil's movements as well as its size continuously during the experiment. The listeners were seated in front of a 19-inch monitor, within a distance of about 50–60 cm from

TABLE 1 | An example of the recorded sentences, and the splicing manipulation applied to create the target-matching and target-mismatching condition.


the screen. The stimuli were presented via a speaker in sound attenuated room at a comfortable level of about 65 dB SPL. The lighting in this room was kept constant throughout the experiment.

For the display, black and white line drawings were made for the purpose of this study, and validated through consistent naming by Dutch native speakers. For the presentation of the pictures a virtual grid was created to divide the screen into three horizontal and three vertical bars. A red cross appeared centered in the middle quadrant resulting from the 3<sup>∗</sup> 3 partition of the screen, and the four pictures were centered in the four external quadrants on the grid. An example of a display with bokser (boxer) as target and bok (goat) as competitor are shown in **Figure 2**. The pictures of the 26 critical items were always presented with the respective monosyllabic competitor and two phonologically and semantically unrelated distractors (see Supplementary Material). Twenty filler items were presented with 3 unrelated distractors, 10 monosyllabic fillers were presented with their word-final embedding (target: bel; competitor: libelle), and 10 with the monosyllabic embedding competitor (e.g., target: mand; competitor: mandarijn).

### Procedure

Before the experiment all participants were familiarized with all the pictures to ensure that they identified them as intended. The pictures were presented to the participants who named them, and were then told the intended name in case of a mismatch between the word used in the experiment and their identification

(for instance to clarify synonyms, such as couch and sofa). Participants assigned to the DS condition were familiarized with the sort of degradation used in the experiment. They were presented with at least 30 degraded sentences and were asked to click on the correct sentence that was written amongst 10 sentences on the screen. During this phase participants were allowed to listen to these sentences as often as they wanted. After that the eye-tracker was mounted and calibrated.

Before the data collection started, participants performed four practice trials during which the participant could always refer to the experimenter to ask for instructions. Each trial consisted of a red cross appearing on the screen for 500 ms, followed by the visual display of the four pictures, and simultaneous auditory presentation of the sentence. Participants were instructed to listen to the stimuli and to click on the object mentioned in the sentence. They were also instructed to blink only between the trials, while the word "Blink" appeared on the screen. After each of the blinking pauses participants could progress on a self– paced basis. After every five trials a recalibration screen appeared, to make sure the eye-tracker did not lose track of the pupil. The experiment lasted on average 15–20 min, and consisted of 62 trials, 26 of which were critical trials. The session needed to realize the experimental protocol, including initial information of the participant, the hearing screening, familiarization with the pictures and the degradation, and debriefing lasted about 1 h.

### Data Analysis

Listeners correctly clicked on the target in 95% of the trials. Trials in which participants failed to identify the intended target word or with blinks longer than 300 ms were excluded from the analysis (on average two trials per participant). The SR Eyelink 500 records blinks as data points with x–y coordinates and pupil size information. Blinks shorter than 300 ms were linearly interpolated based on the median of 25 samples recorded before and after the blink.

The data of two participants were excluded from the analysis because their number of misidentification of the target together with trials containing blinks longer than 300 ms summed up to 50% of the trials. In addition, the data of four other participants were discarded due to computer or calibration failures. Following this, the data set contained the recordings of 67 participants, 35 of which took part in DS and 32 in NS.

The statistical analysis of the data is based on the interval between 200 and 2000 ms after word onset. The first 200 ms after the onset of the target are needed to plan and perform the eye movement triggered by an auditory stimulus for a display with multiple pictures (Hallett, 1986), and participants always clicked on the target within the interval of 2 s. The statistical analysis of the gaze fixation will focus on the fixations toward the competitor, since these time curves give insight into how quickly listeners use duration as cue, and how it modulates their lexical decision.

Pupil size data were recorded as pupil area alongside fixations at each sample point. However, eye movements may affect the measurement of pupil size. To ensure that such measurement artifacts do not introduce differences between the experimental conditions, we counted the number of fixations per trial. Within our analysis window of 200–2000 ms we counted on average three fixations. We found no differences between the experimental conditions, neither between filler items nor critical items. Thus if eye movements affected the measurements of pupil size, they did so equally for all conditions. Our approach of combining gaze fixation data with pupillary responses is similar to Klingner (2010), and following his report we also visually inspected the course of pupil dilation across movements for drastic changes in pupil size that would signal measurement errors due to movements. Within our analysis window we did not see such jumps in pupil size. To calculate pupil size changes related to the presentation of stimuli – Event Related Pupil Dilation (ERPD) – we time-locked the pupil size data to the presentation of the target word, corrected it to a baseline immediately preceding the target word, and then normalized the values to correct for individual differences in pupil size, according to the following Equation.

$$\% \, ERPD = \frac{observation - baseline}{baseline} \ast 100$$

To address the questions of whether lexical competition leads to increased pupil dilation and whether the course of pupil dilation is comparable for degraded and NS, we used two different baselines to compute two percentage changes in ERPD. Baseline 1 will enable us to study the pupil size within the time window of lexical competition. To specifically observe the effect of our experimental manipulation, and to limit other sources that can lead to changes in pupil dilation, baseline 1 is the interval that immediately precedes the manipulation. Baseline 2 will examine whether potential effects of lexical competition on pupil dilation are comparable across groups. More effortful processing of DS (Zekveld et al., 2014; Winn et al., 2015) implies an increase in pupil size due to the higher demands when processing DS. Baseline 2 must thus be free of any differences in pupil size between groups of participants assigned either to DS or NS. Therefore, baseline 2 is the average pupil size in the interval preceding the very first sentence in the experiment, where the average pupil size was not significantly different between the groups. Specifically, baseline 1 is the average pupil size within the interval of 200-ms preceding the target word within the sentence, and is computed separately per participant and trial. This value was then inserted into the above equation. Whereas baseline 1 focuses on the processing of lexical competition, the individual normalization per trial may potentially conceal group difference (DS versus NS) in the baseline itself. Baseline 2 is the average pupil size in the interval of 200 ms at the very beginning of the experiment. This value was then inserted into the above equation. Percentage change in ERDP computed from baseline 2 encompasses all the cognitive processes that take place while solving the experimental task – processing speech–, and provides a reliable baseline for the effort induced by the experiment.

### Statistical Analyses Fixations

The probability of listeners fixating the competitor was analyzed by means of logistic growth curves analysis models (Mirman,

2014). R (R Core Team, 2013) with lme4 package (Bates et al., 2014) was used to model the time curves of fixations as fourth order polynomials within the time window of 200–1800 ms after target word onset. The time course curves were described in four terms: intercept, the overall slope of the curve, the width of the rise and fall around the inflection, and the steepness of the curvature at the tails. The probability of fixations along the time course was modeled as a function of Presentation (NS versus DS), Condition (target-matching duration versus target-mismatching duration) and the possible interactions between these two factors and all four terms describing the curves. As random effect we included individual variation among participants on all four terms describing the time curve. Model comparison was used to estimate the contribution of individual predictors to the fit of the model. For this, individual fixed effects were sequentially added, and the change in the model fit was evaluated by means of likelihood ratio test.

### Pupil dilation

The pupil size data, as captured by the ERPD, was also analyzed by means of Growth Curve Analysis, as time curves of pupil dilation. The courses of dilation were analyzed as polynomial curves of third order, since the fourth order turned out to be redundant to the description of the curve functions. The terms describing the curves are: intercept, the slope of the function, and a coefficient for the curvature around the inflection point. The statistical models included the terms describing the curves, an interaction of these three terms with the experimental conditions (target-matching versus target-mismatching cues) and presentation condition (NS versus DS). To account for individual variation also random effects of the terms describing the curve were included per participant.

# Results

### Fixations

**Figure 3** displays the time-curves of fixations to all four pictures for both target-matching (top panels) and target-mismatching (bottom panels) conditions and split by groups presented with NS (left panels) and DS (right panels). This figure shows proportions of fixations averaged across participants, and the 95% confidence intervals for the fixations to the target and competitor. A comparison between the top and the bottom panels gives insight into how the mismatching duration led listeners' gaze fixations to the competitor. The point at which the fixations curves to competitor cross with the curve for the target signals the point at which on average the target won the process of lexical competition. In the presentation with NS the difference in time between the two conditions is about 120 ms. This is the maximum duration needed for the disambiguating acoustic information (i.e., the second syllable of the target word) to come

FIGURE 3 | Curves of proportions of gaze fixation over time for the target-matching and target-mismatching conditions, when presented with natural (NS) and degraded speech (DS). The green lines show the proportion of fixations averaged across participants and items and the 95% confidence intervals for target fixations, red lines show the same for competitor fixations, and the dashed black lines show fixations to the distractors.



Model:

Log-odds of fixations to competitor ∼ (Curve: intercept) <sup>∗</sup> (curve: slope) <sup>∗</sup> (curve: fall and rise around inflection peak) <sup>∗</sup> (curve: decline in tails) <sup>∗</sup> exp <sup>∗</sup> condition + random effects (all curve parameters per participant).

DS and target-matching condition are at the intercept.

in. A comparison between NS and DS shows that this point of disambiguation was delayed for DS on average for some 40 ms for the target-matching stimuli, and of 140 ms in the target-mismatching condition. For the presentation with DS, we find a greater difference in timing of lexical disambiguation due to acoustic information: here, the difference between the target matching (right upper panel) and mismatching cues (right bottom panel) is above 200 ms.

Of particular interest for this study is the question of statistical significance of the interactions between condition and experiment and the terms describing the course of the curves. These interactions were significant (see **Table 2** for a summary of the model estimates). **Figure 4** displays the effects of condition on the time curves of fixations toward the competitor for NS and DS, respectively. These figures display the probability of fixations toward the competitor on the averaged data (solid lines), and on the data as fitted by the statistical model (dashed lines). The interaction between the intercept of the curve and Condition and Presentation [χ 2 (3) = 185.28, p < 0.001] reflects that the difference between the areas underneath the curves for condition with target-matching versus target-mismatching cues was greater in the experiment with NS than with DS. This indicates that DS modified listeners' ability to quickly integrate durational differences while forming their lexical hypotheses, and listeners' gazes were slower directed toward the picture that best matched the acoustic information in the signal. This is also indicated by the three-way interaction with the slope of the curve [χ 2 (3) = 318.99, p < 0.001]: the time curves of fixations showed a steeper increase in the target-mismatching condition in NS than in DS, showing a faster reaction of listeners' gazes to the durational cues. The three way interaction between the term describing the rise and fall of the curve around the central inflection [χ 2 (3) = 676.85, p < 0.001] describes the fact that the curve of fixations in the target-mismatching condition rose and fell significantly faster in NS than in DS. The three way interaction with the cubic term [χ 2 (3) = 471.77, p < 0.001] reflect the difference in the decline of fixations to competitor between the matching and mismatching condition in NS versus DS. This decline was slower for mismatching cues in DS.

In sum, for the presentation with NS listeners' gazes are quickly governed by the acoustic information in the signal: they fixate the competitor picture more often for stimuli that contain cues appropriate for the competitor. **Figure 4** (left panel) also shows a delay in the peak location of the fixation curves for competitor between the two conditions of about 120 ms in NS. Part of this delay is explained by the fact that the stimuli in the target-mismatching condition were longer by about 65 ms on average. This figure also shows that fixations to the competitor drop rapidly after the acoustic information that clearly disambiguates the target from competitor comes in. Hence, listeners very rapidly revise their initial lexical hypothesis that was based on the cues in the speech signal.

FIGURE 5 | Pupil dilation data time curves shown for NS (left) and DS (right) for target matching (green), target-mismatching (red) and filler stimuli (black).



Model: ERPD ∼ (Curve: intercept) <sup>∗</sup> (curve: slope) <sup>∗</sup> (curve: fall and rise around inflection peak) <sup>∗</sup> presentation (natural versus degraded) <sup>∗</sup> condition (target-matching versus target-mismatching versus filler items) + random effects (all curve parameters per participant).

The right panel in **Figure 4** displays the differences in DS between the two conditions. In comparison with NS, the peak of fixations to competitor in the target-mismatching condition is not higher than the peak of fixations for the target matching items. Furthermore, the peak location for the target-mismatching condition is delayed even further, to more than 200 ms. This implies that listeners presented with DS did not show such a high sensitivity to durational cues as listeners presented with NS. Also, the integration of durational cues for lexical decision took longer, since the difference of 200 ms cannot be explained by the durational differences in the stimuli alone. The figure also visualizes the significant interactions with the third and fourth term of the time curve: the rise and fall of the competitor fixations curve is slower for DS than for NS, making for a shallower curve, and indicating that listeners decision on the lexical target was not as quick and not as certain in DS as in NS. The effect of uncertainty is further captured by the slower decline of competitor fixations at the tail of the curve: Even following the presentation of the clearly disambiguating second syllable of the word listeners still fixated the competitor to some degree.

### Pupil Dilation

The time-curves of the ERPD for the target-matching, targetmismatching and filler items are displayed in **Figure 5**. Note that the filler items did not elicit lexical competition. A visual comparison between the left panel (NS) and the right panel (DS) shows at first glance that the difference in pupil dilation between filler items and items inducing lexical competition was greater for NS than for DS. In analogy to the gaze fixations analysis, model comparison was used to estimate the significant contribution of the factors and interactions. The final model compared the dilation time curves across participant groups and conditions. The estimates of the final model are listed in **Table 3**. The threeway interaction between all the terms describing the curve, the presentation modes (NS versus DS), and the condition (fillers versus target-matching cues versus target-mismatching cues) was significant. The interaction between the first term – the intercept of the curve – [χ 2 (5) = 2051.6, p < 0.001] captures the differences in the areas underneath the curves across the conditions between NS and DS. The interaction with the second term – the slope – [χ 2 (5) = 86.37, p < 0.001] reflects the difference in the course of increase of pupil size between the conditions across NS and DS. The three-way interaction between the third term – the curvature around the peak – [χ 2 (5) = 173.48, p < 0.001] captures the release from increase in pupil dilation.

For NS (**Figure 5**, left panel), the ERPD curves show an increase over time as a function of lexical competition. The statistical analysis revealed that the target-matching curves differed from the target-mismatching curves on all terms describing the curves [χ 2 (1) = 35.89, p < 0.001]. The curves for

experimental items and fillers. The functions displayed are smoothed to better represent the trend that is visible in the raw data displayed in Figure 5.

both conditions also differed significantly from the filler items in terms of slope [χ 2 (1) = 5.99, p < 0.001], curvature [χ 2 (1) = 8.65, p < 0.001], and area under the curve [χ 2 (1) = 16.53, p < 0.001]. This implies that pupil dilation captured the effect of lexical competition, and that dilation was significantly larger when the cues were mismatching the target. The right panel of **Figure 5** displays the pupil dilation time course for the stimuli with DS. These curves differed from each other only in terms of their intercept [χ 2 (1) = 18.3, p < 0.001], and both differed from the filler items only in the curvature of the function [χ 2 (1) = 5.84, p < 0.001]. This suggests that pupil dilation here did not capture effects of cue manipulation, and that the effect of lexical competition was only marginal.

The three way interactions are visualized in **Figure 6**. For display purposes only, in this figure the time curves of dilation for the filler items were subtracted from the curves of the two conditions (target-matching or target-mismatching cues). This was done to accentuate the effect of lexical competition, which was minimized in the fillers. Also for display purposes only, the curves are smoothed by means of a locally weighted regression function with a span of 0.5. Increased pupil dilation as a function of mismatching cues in NS is illustrated in the steeper curve for target-mismatching items (upper panel). The far smaller effect of mismatching cues on pupil dilation in DS is visible in the smaller difference between the two curves displayed (bottom panel). The increase in pupil dilation as a function of lexical competition, as well as a function of mismatching cues is smaller than in NS.

### **Baseline 2**

**Figure 7** displays ERPD curves for DS and NS for baseline 2. For display purposes only, the curves are smoothed by means of a locally weighted regression function with a span of 0.5. The curves for both NS and DS differed from each other in all three terms. The first term of the curve function, describing the intercept [χ 2 (2) = 48.99, p < 0.001], the second term, describing the slope [χ 2 (2) = 85.24, p < 0.001], and the third term, describing the curvature [χ 2 (2) = 28.9, p < 0.001]. Especially the intercept term for the two curves is important. The intercept captures the pupil size change due to participating in the experiment, regardless of whether the source of changes in pupil size was mental effort or engagement required by the task. The negative intercept value for NS represents the decrease in pupil dilation in subsequent trials due to participation in the experiment. The positive intercept for DS instead represents an increase in pupil dilation in subsequent trials due to participation in the experiment.

The ERPD curves with baseline 2 captured the fact that in NS listeners' pupil dilation increased gradually, after the presentation of the target, reaching a peak only after 900 ms after the onset of the word. In the DS condition, however, pupil dilation was already increased at the onset of the target word. While the overall dilation was greater in DS, this pupil dilation curve shows a very even course over the entire analysis window. This suggests that contrary to NS, where lexical disambiguation is at the source of increased pupil dilation, in DS participation in the experiment itself causes pupil dilation. Baseline 2 does not allow singling out individual processes at the source of pupil dilation, but we attribute the difference in ERPD calculated with baseline 2 to the demands that performing the experiment with DS posed on the participants. For NS **Figure 7** shows that the task was not increasing mental effort throughout the experiment, and lexical competition appears to be the main source of increased pupil dilation for NS. For DS **Figure 7** shows an increase due to performing the experiment, and explains why the analysis based on ERPD calculated with baseline 1 suggested a reduced mental effort for DS. The task itself, processing speech, has led to increased mental effort.

We investigated how signal degradation that simulates speech transmitted via CIs alters the time-course of speech perception and the mental effort drawn upon during this course. To sum up, we find a similar course of lexical disambiguation between degraded and natural signals, with a main difference in the timing of integration of durational cues, and the timing of resolution of lexical competition. Furthermore we find an increase in pupil dilation for listeners presented with NS, which is time-locked to lexical competition, and perception of target-mismatching cues.

A different pattern of mental effort was found for DS, with pupil dilation not increasing as a function of lexical processing but due to the presentation with DS throughout the experiment. Increased effort in processing DS appears to have its sources at the pre-lexical level, while increased pupil dilation in NS has its source in lexical processing.

# DISCUSSION

Our results from the conjunct analysis of gaze fixations with pupil dilation show different timing in the processing of DS at pre-lexical and lexical levels. At the pre-lexical level these timing differences seem to be the result of automatic versus more effortful processing of the signal. At the lexical level these timing differences appear to be the consequence of processing at the pre-lexical level with the corollary of different constrains on the selection of lexical candidates. For DS, increased mental effort has its source at the stages of pre-lexical processing, which further complicates the lexical processing. The finding of increased pupil dilation due to mismatching acoustic cues in NS, however, points to a possibly different recruitment of mental resources for natural versus degraded speech.

# Lexical Competition in Natural Speech

For natural stimuli, the gaze fixation results replicate the study by Salverda et al. (2003): durational differences between phonologically overlapping syllables in longer versus shorter words immediately modulate listeners' lexical interpretations. Our recordings of pupil dilation show that mental resources are engaged during lexical access, and in particular when mismatching acoustic cues make listeners revise their initial lexical hypothesis. The small albeit significant increase in pupil size due to lexical competition in the cue-matching condition, together with the stronger increase in pupil size as a response to mismatching cues, shows how quickly speech perception can engage additional mental resources. With no effort perceived by the listener, this effect remains unnoticed in optimal listening conditions. Optimal conditions, i.e., conversation among nativespeaker and normal-hearing (NH) listener in acoustically favorable surroundings, are rarely warranted in day-to-day communications. Although speech perception is commonly described as an automatic process, it likely draws on additional mental resources more often than not.

Increased processing or elevated activation of brain regions (Zekveld et al., 2014) correlates with pupil dilation, but it is not possible to make a clear-cut distinction between the processes that contribute to pupil dilation. We cannot strictly tell apart the sources of the observed pupil dilation for NS. Increased pupil dilation with mismatching cues could reflect mental effort. Increased dilation due to lexical competition in the cue matching condition can, however, also reflect the involvement of attentional processes rather than effort. There is evidence for attentional resources taking part in the automatic processing of speech (Wild et al., 2012; Wöstmann et al., 2015). In Posner's (1992) framework, there is also the notion of the perceptual and cognitive system to be supported by autonomous attentional shifts that are automatically triggered by the stimulus as, in our case, the speech signal. Our results for NS support the interpretation that the incoming signal initiates speech perception automatically, but will draw on additional mental resources for the processing at pre-lexical or lexical levels, depending on the sort of degradations in the signal. According to the model of attention by Kahneman (1973), distribution of attentional resources is closely related to mental effort, with automatic tasks requiring less attentional resources. Processing of native speech becomes automatic through exposure, through the fine attunement of the perceptual system during speech development, and through extensive experience with the signal. The stimuli in our experiment misled our participants to a spurious lexical hypothesis by providing them with misleading cues. At first glance, this seems a rather artificial situation, but it is not completely unfamiliar, as it may occur for instance while communicating with foreign accented speech. Processing of foreign accented speech might affect the timing of pre-lexical and lexical processing in a similar way as our experimental condition. The less well-timed processing of speech may then recruit more mental resources.

The timing of speech processing appears to be crucial for the seamless automatic transfer of information from pre-lexical to lexical levels of analysis. The processing of early postsensory but pre-lexical levels of speech perception is likely to be constrained by the capacity of the auditory sensory memory (Crowder, 1993) that limits the retention of acoustic details. Successful processing at this level facilitates quick lexical access, and swift resolution of lexical competition is necessary for the mapping of signal to meaning (Cleland et al., 2012). Welltimed transition of information between pre-lexical and lexical levels appears to conceal the engagement of attentional resources in speech perception. Our study thus stresses the need for a better understanding of the role of the timing window for the interaction between sensory, pre-lexical and lexical processing of speech.

In line with this interpretation are more recent findings on speech perception and attention. Winkler et al. (2005) report that pre-attentive processes of feature binding in auditory perception may require attentional processes for acoustically rich and complex stimuli when processed under time pressure. Similarly, Sörqvist et al. (2012) have shown that a non-auditory memory task diverts listeners' attention to task-irrelevant speech sounds, which then also modulates their sensory processing, as captured in the Auditory Brainstem Response. Regarding the lexical processing of speech, Gaskell et al. (2008) showed in a series of experiments that early stages of speech processing can take place without drawing upon central resources, but accessing the meaning of words creates a bottleneck, which sets limit on the processing of subsequently presented words (Cleland et al., 2012). The magnitude of this limitation was modulated by the demands that a specific word poses on lexical competition, i.e., the similarity of the word to other words. As argued by Cleland et al. (2012) the access to the meaning of words occurs during a limited time window. In line with this, our results with NS did not show an increase in mental effort due to the task in the experiment (as captured in ERPD measured with baseline 2)

since the speech materials were processed within the necessary time limitations. Still we observed increased pupil dilation due to mismatching cues in the signal, and we interpret this as a targeted engagement of automatic attentional resources rather than mental effort. A separation of sources that contribute to pupil dilation is however, still object to research.

# Lexical Competition in Degraded Signals

For DS, our results show that lexical competition was slower, prolonged and led to a less certain lexical decision. We also observed a reduced or delayed sensitivity to the durational cue, no increase in pupil dilation due to lexical competition or mismatching cues, and increased pupil dilation due to the demands of the experiment, which main task consisted of listening to speech. This last finding is in line with previous results (Zekveld et al., 2014; Winn et al., 2015) showing increased pupil dilation when listening to DS. We interpret our results as showing a bottom–up cascading effect of degradation. The small delay in uptake of the durational cues, which is visible also in the comparison between NS and DS in target-matching condition, carried over to high-level lexical stages of processing, accumulating additional delay during lexical competition. This accumulated uncertainty resulting from ill-timed processing at pre-lexical and lexical levels will have to be compensated for with increased demands on listeners' working memory. Speech needs to be processed in real time, and delayed mapping of signal to meaning will not only make listeners entertain multiple interpretations of the spoken message, but will also limit their predictive processing of speech (Wagner et al., 2016).

The lack of sensitivity to durational cues can partly be explained by the nature of the degradation. The reduction of the spectrotemporal details from NS likely disrupts the binding of acoustic features into categories and reduces neuronal synchronization (Anderson et al., 2010) on the physiological level. Pre-attentive processes involved in the binding of acoustic features into auditory objects, such as phonemes or syllables, are also subjects to practice and experience, as is suggested by superior pre-attentive processing of musicians (Koelsch et al., 1999). Our participants were not experienced with the degradation prior to the experiment. Furthermore, attention to acoustic events appears to be also guided by spectrotemporal details that occur in NS (Ding et al., 2014; Wöstmann et al., 2015). Finally, the lack of naturally occurring consequences of coarticulation may smear out the acoustic features, which in natural signal are the binding elements of phonetic categories within words. All these factors contribute to the slower integration of acoustic features in the formation of perceptual objects such as syllables or words.

The slower progress of information between pre-lexical and lexical stages is also fortified by the fact that the signal does not resemble listeners' mental representations. Lyxell et al. (1998) argue in a study with users of CIs that long-term deprivation of auditory sensory information before implantation may deteriorate the long-term representation of speech. In the present study it was the speech signal that was degraded, while the mental representation of our NH listeners was intact. A mismatch between the mental representations and the signal was present in our experiment nevertheless. Our results show that even short-term exposure to degraded signals affects its mapping to mental representations, by slowing it down. In addition, the less constrained mapping of signal to mental representations on the pre-lexical level has consequences for the processing on the lexical level.

Our results show that it is more difficult to revise built-up lexical expectations upon hearing DS signals. The delay on prelexical levels might have opened up the opportunity to build up stronger, and in this case, misleading lexical hypotheses about the word that was being processed. This explanation is supported firstly by the observed prolonged lexical competition, and secondly by the uncertainty about the lexical decision after disambiguating acoustic information was presented in DS. In line with this, Lash and Wingfield (2014) present evidence for an auditory analog to the Bruner–Potter effect. Bruner and Potter (1964) showed that recognition of an image presented in a progressive way from blurred to clear is slowed down relative to a singular presentation of a clear image. An unclear object leads participants to build up multiple hypotheses about the identity of an image, and rejecting several hypotheses requires longer than it takes for a single better-cued hypothesis to develop. In analogy to this, auditory presentation with degraded signal compromise its immediate processing and passing on to higher evaluation levels, causing listeners to hang on to spurious lexical hypotheses.

While we argue that the source of effort is the pre-lexical processing, there are also alternative explanations for the lack of an additive effect of lexical competition on pupil dilation for degraded signals. Firstly, it is likely that pupil dilation was not able to capture or differentiate additive effects of lexical competition and listening to DS. Secondly, the attentional resources that a listener can draw upon may be depleted by the attention directed toward the processing of degraded signals. A third explanation is that delayed reception of acoustic cues in degraded signals obscures lexical competition and alters the more targeted engagement of attentional resources found in NS. The processing effort found in natural signals would then not be comparable to the effort evoked by lexical competition for degraded signals. Though the three explanations are not mutually exclusive, we believe that the fixation data combined with the pupil dilation data provide some support for the last explanation. The gaze fixations show that lexical competition is delayed and prolonged for degraded signals, and we see increased pupil dilation due to listening to DS. Listeners' engagement in lexical competition may be gated by attentional resources, and a constant effortful processing may disengage the automatic attentional processes that are supposed to be driven by the signal, making lexical competition a less automatic process.

To our knowledge this is the first study that combined measures of time–course of speech perception, in gaze fixations, with mental effort, in pupil dilation. Even though the sources underlying pupil dilation are manifold and difficult to strictly separate, and more research is on the way to investigate these sources, we believe that our study offers a contribution to this search. Speech perception can be an effortful task, in particular for CI users, but also in every-day non-optimal interactions. Our study shows involvement of mental resources for processes that

are fundamental to speech perception, and how well-adjusted timing of information processing can conceal this involvement. We attribute experience with the task, i.e., speech perception, to be at the source of well-timed flow of information between stages of speech perception. An intriguing research question for the future is whether early exposure to degraded signals will lead to similar fine adjustment of speech processing, for instance in CI users who were implanted within the first year of their life. Related to this is also the fundamental question of the role that spectrotemporal details play in the process of well-timed speech processing, and regulation of attentional resources.

# AUTHOR CONTRIBUTIONS

The author AW developed the concept of this study, acquired the data, analyzed and interpreted the results, and wrote the paper. AW gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author PT contributed to the data acquisition, and data analysis, and revised critically the final version of this paper for important intellectual content. PT gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author DB enabled the data acquisition, contributed to the interpretation of the results, and critically revised previous and the final version of this paper for important intellectual content. DB gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work.

## REFERENCES


## FUNDING

This work was supported by a Marie Curie Intra-European Fellowship (FP7-PEOPLE-2012-IEF 332402). Support for the third author came from a VIDI Grant from the Netherlands Organization for Scientific Research (NWO), the Netherlands Organization for Health Research and Development (ZonMw) Grant No. 016.093.397, and by the Heinsius Houbolt Foundation.

### ACKNOWLEDGMENTS

We would like to thank Prof. Frans Cornelissen (University Medical Centre Groningen) for providing the eye-tracker for this study, and Prof. Stuart Rosen (University College London) for lending us his scripts with the vocoding functions. We are also grateful to Jop Luberti for creating the experimental pictures. The study is part of the research program of our department: Healthy Aging and Communication.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.00398



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Wagner, Toffanin and Bas,kent. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Monitoring Alpha Oscillations and Pupil Dilation across a Performance-Intensity Function

Catherine M. McMahon1,2 \*, Isabelle Boisvert1,2, Peter de Lissa2,3, Louise Granger1,2 , Ronny Ibrahim1,2, Chi Yhun Lo1,2, Kelly Miles1,2 and Petra L. Graham<sup>4</sup>

<sup>1</sup> Department of Linguistics, Macquarie University, Sydney, NSW, Australia, <sup>2</sup> The HEARing CRC, Melbourne, VIC, Australia, <sup>3</sup> Department of Psychology, Macquarie University, Sydney, NSW, Australia, <sup>4</sup> Department of Statistics, Macquarie University, Sydney, NSW, Australia

Listening to degraded speech can be challenging and requires a continuous investment of cognitive resources, which is more challenging for those with hearing loss. However, while alpha power (8–12 Hz) and pupil dilation have been suggested as objective correlates of listening effort, it is not clear whether they assess the same cognitive processes involved, or other sensory and/or neurophysiological mechanisms that are associated with the task. Therefore, the aim of this study is to compare alpha power and pupil dilation during a sentence recognition task in 15 randomized levels of noise (−7 to +7 dB SNR) using highly intelligible (16 channel vocoded) and moderately intelligible (6 channel vocoded) speech. Twenty young normal-hearing adults participated in the study, however, due to extraneous noise, data from only 16 (10 females, 6 males; aged 19–28 years) was used in the Electroencephalography (EEG) analysis and 10 in the pupil analysis. Behavioral testing of perceived effort and speech performance was assessed at 3 fixed SNRs per participant and was comparable to sentence recognition performance assessed in the physiological test session for both 16- and 6-channel vocoded sentences. Results showed a significant interaction between channel vocoding for both the alpha power and the pupil size changes. While both measures significantly decreased with more positive SNRs for the 16-channel vocoding, this was not observed with the 6-channel vocoding. The results of this study suggest that these measures may encode different processes involved in speech perception, which show similar trends for highly intelligible speech, but diverge for more spectrally degraded speech. The results to date suggest that these objective correlates of listening effort, and the cognitive processes involved in listening effort, are not yet sufficiently well understood to be used within a clinical setting.

Keywords: alpha power, pupil dilation, listening effort, listening in noise, speech perception, perceived effort, mental exertion

# INTRODUCTION

Listening to degraded speech, either in adverse acoustic environments or with hearing loss, is challenging (McCoy et al., 2005; Stenfelt and Rönnberg, 2009), and it is assumed that the increased cognitive load required to understand a conversation is associated with self-reported effort (Lunner et al., 2009; Rudner et al., 2012). Adults with hearing loss report listening to be greatly taxing

### Edited by:

Adriana A. Zekveld, VU University Medical Center, Netherlands

### Reviewed by:

Stefanie E. Kuchinsky, University of Maryland, USA Antje Strauß, Centre National de la Recherche Scientifique, France

> \*Correspondence: Catherine M. McMahon cath.mcmahon@mq.edu.au

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 11 December 2015 Accepted: 05 May 2016 Published: 24 May 2016

### Citation:

McMahon CM, Boisvert I, de Lissa P, Granger L, Ibrahim R, Lo CY, Miles K and Graham PL (2016) Monitoring Alpha Oscillations and Pupil Dilation across a Performance-Intensity Function. Front. Psychol. 7:745. doi: 10.3389/fpsyg.2016.00745

(Kramer et al., 2006), which may cause increased stress and fatigue (Hétu et al., 1988), contribute to early retirement (Danermark and Gellerstedt, 2004), social withdrawal (Weinstein and Ventry, 1982), and negatively affect relationships (Hétu et al., 1993). Current speech perception tests, which measure performance on a word or sentence recognition task, provide only a gross indication of the activity limitations caused by hearing loss, and do not consider the top–down effects related to increased concentration and attention, as well as effort (Wingfield et al., 2005; Pichora-Fuller and Singh, 2006; Schneider et al., 2010). Therefore, concurrently measuring the cognitive load or listening effort needed to undertake a speech perception task could increase its sensitivity, enabling a more holistic understanding of the challenges faced by adults with hearing loss in communicative settings.

Listening effort, defined as "the mental exertion required to attend to, and understand, an auditory message" (McGarrigle et al., 2014), is influenced by both the clarity of the auditory signal and the cognitive resources available. As hearing loss and cognitive decline are highly associated with age (Salthouse, 2004; Lin et al., 2013), there is a recognized need to understand the contribution of cognition and effort to listening to everyday speech within a clinical environment to better direct rehabilitation strategies towards and/or improve device fitting, particularly for older adults. Certainly it has been shown that greater cognitive resources are required to perceive a speech signal that becomes more degraded and this is more challenging for older adults (Rabbitt, 1991; Rönnberg et al., 2010, 2013). However, importantly, several studies have also highlighted the advantages that individuals with greater cognitive resources have to understand speech in noise (Lunner, 2003), utilize fast signal processing strategies in hearing aids (Lunner and Sundewall-Thorén, 2007), and compensate when mismatches occur between what is heard and the brain's phonological representations of speech (Avivi-Reich et al., 2014).

Recently, there has been an increased interest in understanding and measuring listening effort, so that future clinical measures may ensue. Many studies have attempted to estimate listening effort, using behavioral, subjective or objective approaches (see McGarrigle et al., 2014 for a review). While subjective measures have high face-validity, they have several inherent limitations; including whether participants are indeed rating perceived effort, or rating their ability to discriminate between different signal-to-noise ratios (SNRs; Rudner et al., 2012). Additionally, subjective measures poorly correlate with other behavioral and objective measures of listening effort (Zekveld et al., 2010; Gosselin and Gagné, 2011; Hornsby, 2013), possibly because these measures relate to specific components of the goal-directed cognitive processes underpinning mental effort (Sarter et al., 2006), therefore each should be investigated. An effective and consistent objective correlate of listening effort has not yet been found (Bernarding et al., 2013), although pupil dilation and oscillations in the alpha frequency band (8–12 Hz) have independently been shown to be associated with changes in speech intelligibility (Obleser et al., 2012; Becker et al., 2013; Zekveld and Kramer, 2014; Petersen et al., 2015) and seem to be sensitive to hearing loss during a speech recognition or digit recall task in noise (Kramer et al., 1997; Zekveld et al., 2011; Petersen et al., 2015). It is, however, not yet known whether these two objective measures assess the same processes, whether sensory (e.g., phonological mapping of degraded speech), cognitive (e.g., cognitive load, inhibition of task irrelevant activity, or working memory), or neurophysiological (e.g., acute stress associated with the investment of attentional resources). These physiological responses may also reflect the extent of brain regions that are recruited to achieve a specific performance (e.g., to increase cognitive processing or provide inhibitory control; see Radulescu et al., 2014). Further, while there is an extensive literature on the neurophysiological mechanisms governing pupil dilation (Laeng et al., 2012), less is understood about those which underpin oscillatory cortical activity or the neuromodulators which influence it (Klimesch et al., 2007).

There appear to be general trends observed between task difficulty and changes in pupil dilation or in alpha power, however, these are not consistent across all studies (see Zekveld and Kramer, 2014; Wöstmann et al., 2015). This may in part depend on the type of task (i.e., listening to randomized or fixed speech tokens), the period when the physiological response is measured (during listening to degraded speech or during the retention period of a memory recall task), or the population characteristics (younger versus older adults, or normal hearing versus those with hearing loss). Alternatively, cognitive load/listening effort may be inherently non-linear and a function of the availability of processing resources coupled with the intentional motivation to allocate such resources to the task (Sarter et al., 2006). That is, when the task is too difficult and the processing demands exceed the available cognitive resources, or when the task is too easy and requires minimal cognitive resources (i.e., is automatic or passive), then effort may not be required or allocated to the task (Granholm et al., 1996; Zekveld and Kramer, 2014). As such, the greatest change in objective measures related to effort may be observable at medium levels of performance, rather than at the extreme ends of performance. Similar non-linear associations between performance and stress (Anderson, 1976) and performance and mental effort have been previously reported (Radulescu et al., 2014).

The current study aims to compare both alpha activity and pupil dilation measured simultaneously over a complete performance-intensity function while listening to sentences with high intelligibility (16-channel vocoded) or moderate intelligibility (6-channel vocoded). Specifically, it aims to identify whether these measures show similar patterns of behavior across the 15 SNRs and with the two levels of vocoding, suggesting that they may encode similar sensory, cognitive or neurophysiological processes involved in listening effort (that currently remain unclear; McGarrigle et al., 2014). A further reason to manipulate both the SNRs and the channel vocoding to degrade speech was to investigate the behavior of these measures on what could be approximated to a simulation of listening with a cochlear implant (Friesen et al., 2001). If these measures are to be applicable in clinical settings, their pattern of behavior should be predictable in a clinical population.

# MATERIALS AND METHODS

fpsyg-07-00745 May 20, 2016 Time: 12:44 # 3

## Participants

Twenty young adults were recruited to participate in this study. Amongst this group, two did not attend all testing sessions. Invalid recordings led to the exclusion of two more participants for the Electroencephalography (EEG) measures and an additional six for the pupil measures. The main reason for excluding the data related to participants looking away from the visual target or closing their eyes when listening became difficult. Participants (10 females, 6 males) were aged from 19 to 28 years (mean = 23 years, SD = 2.6). All participants were native Australian English speakers and were right-handed. Participants' hearing was screened using distortion product otoacoustic emissions. All participants had present emissions bilaterally between 1–4 kHz, which ruled out a moderate or greater hearing loss. All participants reported normal or corrected-to-normal vision. Informed consent was obtained from all participants.

# Speech Perception Material

Recorded Bamford-Kowal-Bench/Australia (BKB/A) sentences spoken by a native Australian-English female were presented as targets in the presence of four-talker babble noise. The sentences and background noise were vocoded by dividing the frequency range from 50 to 6000 Hz into 6 or 16 logarithmically spaced channels. The amplitude envelope was then extracted from each channel and used to modulate the noise with the same frequency band. Each band of noise was then recombined to produce the noise vocoded sentences and background noise. See Shannon et al. (1995) for more information about speech recognition with vocoded material.

## Physiological Measures

Electroencephalography activity and pupil dilation were measured simultaneously during the speech recognition task conducted in a sound-treated and magnetically shielded room. With their forehead resting on an eye-tracker support, participants were asked to maintain their gaze on a small cross presented in the middle of the computer screen. The following presentation protocol was used: 1 s of quiet, variable length of noise (>1 s), sentence in noise, 1 s of noise. Physiological testing was conducted across two sessions: session one used the 16-channel vocoded material and session two used the 6-channel vocoded material. Each session presented 240 target sentences at 65 dB with the noise randomized between 58 and 72 dB (−7 to +7 dB SNR, a total of 15 levels). Pilot data indicated these SNRs provided the full range (0–100%) of speech recognition scores (SRS). The randomization was programmed for sentences of the same BKB/A list to be presented at the same SNR to allow off-line scoring of performance as per the original lists.

After each presentation, a response period of 4 s was given, and indicated by a starting and a finishing tone. Participants were asked to repeat the sentences they heard between the two tone signals, and to guess when unsure. Oral responses were recorded using a voice recorder and video-camera setup directly in front of them, to allow more accurate marking of their responses at a later time. The sentence recognition in noise task was scored at a word level (using the standard BKB/A scoring criteria) and performance was scored for each SNR condition.

### EEG

A soft-cap was used to facilitate the spatial separation of the electrodes. EEG data were recorded from 32 Ag-AgCl sintered electrodes using the 10–20 montage with a Synamps II amplifier. The ground electrode was located between the Fz and FPz electrodes. Electrode impedances were kept below 5 k. Ocular movement was recorded with bipolar electrodes placed at the outer canthi, and above and below the left eye. Data was recorded at a sampling rate of 1000 Hz, an online band-pass filter of 0.01 to 100 Hz, and a notch filter at 50 Hz.

Post-acquisition, all cortical recordings were analyzed using Fieldtrip, an analysis toolbox in MATLAB developed by Oostenveld et al. (2011). The raw EEG data were first epoched between −2 and 6 s relative to the stimulus onset at 0 s which were then re-referenced to the combined mastoids. The re-referenced epochs were then bandpass filtered with the cut-off frequencies of 0.5 to 45 Hz. Eyeblink artifacts were rejected by transforming the sensor space data into independent components space data using independent component analysis ('runica'). The eyeblink artifacts were visually inspected and rejected by transforming the components data back into sensor space by excluding the identified eyeblink component(s). Movement related artifacts and noisy trials were rejected by visual inspections. The accepted trials were bandpass filtered again with cut-off frequencies between 8 and 12 Hz to extract alpha oscillations. Alpha band activity was extracted from the parietal electrodes (P3, P4, and Pz) during the encoding period (1 s duration finishing 200 ms before the end of the sentence) and was subtracted from the baseline in noise (300–800 ms after the noise onset) on a trial by trial basis, then averaged to obtain mean alpha power for each SNR. As no significant time-frequency electrode clusters were identified across the scalp during the sentence processing time period, alpha power in the parietal region was used in the current study. A time-frequency representation of the average EEG data collapsed across all of the signal-to-noise levels (**Figure 1**) illustrated the increased activity occurring in the alpha frequency-band averaged during the sentence presentations for both 16-channel and 6-channel noise vocoded sentences.

### Pupillometry

Pupil size was measured with a monocular (right eye) Eyelink 1000 eye-tracker sampling at 1000 Hz. Single-trial pupil data was processed through Dataviewer software (version 1.11.1), and compiled into single-trial pupil-diameter waveforms (0 s baseline to 6 s) for further offline processing and analyses performed using MATLAB. Data were smoothed using a 5-points moving average.

Blinks were identified in each trial as pupil sample sizes that were smaller than three standard deviations below the mean pupil diameter. Trials where more than 15% of the trial samples were detected as in a blink (which also occurred when the participants were looking away from target) were rejected. In accepted trials, samples within blinks were interpolated from between 66 ms

preceding the onset of a blink to 132 ms following the end of a blink. Accepted trials were averaged to form condition-specific pupil size waveforms to represent change of pupil dilation across the trial. For each participant a threshold of 135 or more accepted trials in both the 6- and the 16- channel blocks had to be met to not be excluded, so that a meaningful condition average may be formed. The average of accepted trials for each participant was 193, or 13 trials per SNR.

For each trial, the mean pupil size measured between 0 and 2 s was subtracted from the peak pupil size identified between 2 and 6 s (see **Figure 2** for an example of the pupil response during the experiment).

### Behavioral Measures

A behavioral test session was conducted with each participant to obtain a self-reported measure of effort during the sentence recognition task, which could be later compared to the physiological measures. This measure could not readily be obtained during the physiological test session because of the randomization of SNRs at each trial. The behavioral testing was performed in an acoustically treated room, with the equipment calibrated prior to each participant's session. The speaker was positioned one meter from the participant at 0 ◦ azimuth. An adaptive procedure was chosen to obtain effort ratings at three SNRs around the mid-range of each participant's performance-intensity function. The speech-innoise algorithm and software used were developed by the National Acoustic Laboratories to obtain speech reception thresholds (SRT, the signal to noise ratio at which 50% of words were correctly perceived; see Keidser et al., 2013 for a comprehensive review). Target sentences were presented at 65 dB and the background noise was modulated using an adaptive procedure. The participant's SRT was calculated when the standard error was less than 0.8 dB. The noise was then presented at a fixed level based on the participant's SRT with 1 list (16 sentences), to validate the accuracy of the initial SRT calculation. Finally, the noise was fixed at −3 and +3 dB relative to their SRT and two lists per condition were presented, so that performance could be measured in easier and more difficult conditions. Thus, the conditions presented were: 50%SRT, 50%SRT(−3 dB), and 50%SRT(+3 dB) in the 16- and 6-channel vocoded conditions. All presentations were counterbalanced across participants for level and vocoding. After each presentation, participants were asked to rate the perceived effort invested in each SRT condition on a Borg CR10 scale (Borg, 1998).

### Statistical Methods

Linear mixed-effects models with a random intercept for individual were used for all analyses to control for repeatedmeasures over different levels of SNR on individuals. While

random slopes were also of interest, these models failed to converge and were therefore not utilized.

Models for SRS were built by comparing a model with SNR, presentation mode and channel vocoding to a model containing SNR, presentation mode, channel vocoding and the interaction between SNR and presentation mode. The terms were fitted in the order described although no result difference was found if they were added to the model in a different order. Likelihood ratio tests were used to compare fixed effects of the simpler and more complex models after fitting the model using maximum likelihood. Where an interaction was not significant, the main effects model results were reported. All categorical variables used treatment contrasts (whereby all levels were compared with a reference level). P-values less than 0.05 were considered significant for all analyses.

Models for perceived effort, pupil size and alpha power were built by comparing a model with SNR and channel vocoding as main effects to a model with an interaction between SNR and channel vocoding. Because visual inspection of the change in pupil size and alpha power over SNRs suggested non-linear changes for one or both channels, models sequentially including a quadratic term for SNR (i.e., SNR<sup>2</sup> ) and then a cubic term for SNR (i.e., SNR<sup>3</sup> ) with an interaction between each term and vocoding channel were used to determine if the effects were similar for both channels. Again, likelihood ratio tests were used to compare models. These models are reported separately by channel vocoding (6 and 16) to aid interpretation. Models with a quadratic term are used to describe a simple curvilinear change while cubic terms are used to explain more complicated curvature with more than one change in the direction of the curve.

To account for the use of repeated measures on individuals, correlations presented in the results section are the average of the correlations calculated for each individual. Analyses were performed in R version using the nlme Package. This study was conducted under the ethical oversight of the Human Research Ethics Committee at Macquarie University (Ref: 5201100426).

# RESULTS

# Performance-Intensity Functions and Effort Ratings

Performance-intensity functions were measured during the behavioral test session (using 3 fixed SNRs per participant) and the physiological test session (using randomized SNRs across the 15 levels of noise). As seen in **Figure 3A**, SRSs measured during the physiological test session increased with SNR (p < 0.001) for both vocoding levels [16 ch: r = 0.93 (95% CI: 0.92 to 0.94); 6 ch: r = 0.92 (95% CI: 0.91 to 0.94)]. As expected, SRSs were significantly greater with the 16-channel material compared to the 6-channel (mean difference 26.72%, 95% CI: 22.12 to 31.31%, p < 0.001, **Table 1**). **Figure 3B** displays the performanceintensity functions where the three SNR levels presented in the behavioral session (fixed presentation) were matched to the same three SNRs measured during the objective session (randomized presentation). There was no evidence for a difference in the pattern of change in SRS between the fixed and random modes of presentation across the SNR levels, after adjusting for channel vocoding (p = 0.50, **Table 1**). For the 16-channel vocoding, for every unit increase in SNR, SRS increased by 6.44% (95% CI: 5.07 to 7.82%) for the fixed versus 6.47% (95% CI: 5.12 to 7.82%) for the randomized presentation, showing that the slopes by mode of presentation overlap considerably. Similarly, for the 6-channel vocoding, for every unit increase in SNR, SRS increased by 5.47 (95% CI: 4.29 to 6.64%) for the fixed versus 6.65% (95% CI: 5.13 to 8.18%) for the randomized presentation.

**Figure 3C** shows the mean effort ratings measured after each of the fixed SNR sentence blocks. There was no interaction between SNR and channel vocoding (p = 0.26, **Table 1**) indicating no evidence of a different pattern of effort over SNR between the two channels. Excluding the interaction term, LME regression confirmed that perceived effort averaged over channels significantly decreased (p < 0.001) with increasing SNR (−0.55, 95% CI: −0.65 to −0.45). SRS with 6-channel vocoding required on average 2.10 units more effort than 16-channel vocoding (95% CI: 1.47 to 2.74; p < 0.001).

### EEG Analyses

fpsyg-07-00745 May 20, 2016 Time: 12:44 # 6

### Effect of Vocoding on Baseline Alpha

A LME regression was used to examine the effect of vocoding (conducted during different test sessions) on alpha power during baseline. No significant difference was found between16- and 6 channel vocoding (mean difference = 0.69 mcV<sup>2</sup> , 95% CI: −1.47 to 2.85, p = 0.53). This suggests that overall, participants had similar alpha power baselines on both test sessions.

### Alpha Power Change and SNR

Alpha power was processed as a relative change from baseline in noise, for each trial. A LME regression model suggested a significant interaction effect between SNR and channel vocoding on alpha power change (p = 0.01, **Table 1**). Specifically, for the 6-channel vocoding, there was no evidence of a change in alpha power over the different SNRs (0.01%, 95% CI: −2.38 to 2.41%); p = 0.99) while for the 16-channel vocoding, for every unit increase in SNR, alpha power decreased by 4.34% (95% CI: 1.94 to 6.73% decrease; p < 0.001). Non-linear models using a quadratic or cubic term for both channel vocoding did not improve model fit compared to a linear model (log likelihood −2632.18 vs. −2632.57, p = 0.68 and −2631.87 vs. −2632.57, p = 0.84, respectively). As seen in **Figure 4**, the largest separation between 16- and 6-channel vocoding was in the most challenging (lower) SNRs.

### Pupil Analyses

### Pupil Size Change from Baseline

For the pupil size, a LME model was conducted to verify the effect of vocoding (conducted during different test sessions)


Channel 6 was the reference level for channel; Random was the reference level for presentation mode. SRS, Speech Recognition Score; SNR, Signal-to-Noise-Ratio; SNR<sup>2</sup> , quadratic model; SNR<sup>3</sup> , cubic model.

on baseline, while controlling for repeated measures. The pupil size during baseline was found to be significantly larger during the second session [6-channel (harder condition); mean difference = 0.56 mm, 95% CI: 0.47 to 0.64 mm, p < 0.001].

### Pupil Size Change and SNR

fpsyg-07-00745 May 20, 2016 Time: 12:44 # 7

Looking at the pupil size change relative to baseline (**Figure 5**), A LME regression model with only a linear term in SNR indicated a significant interaction effect between vocoding and SNR (p < 0.001, **Table 1**). For every unit increase in SNR, pupil size significantly increased by 0.007 mm (95% CI: 0.001 to 0.014 mm; p = 0.02) for the 6-channel vocoding while it significantly decreased for the 16-channel (mean change −0.008 mm, 95% CI: −0.015 to −0.002 mm; p = 0.01). Visual inspection of the relationship between pupil size and SNR indicated a potential non-linear relationship. As such a mixed effects model for pupil diameter containing a cubic term for SNR (**Table 1**) had significantly better fit compared to a linear model (log likelihood 97.6 versus 92.4, p = 0.04) or quadratic model (log likelihood 97.6 versus 94.4, p = 0.04). An interaction between the cubic term and channel was significant (p = 0.03). Examination of the relationship between pupil size and SNR within each channel indicated that with 16-channel vocoding, there was no significant effect of a quadratic term (p = 0.34) or cubic term in SNR (p = 0.46), while there was strong evidence of a cubic relationship (p = 0.01) for the 6-channel vocoding.

### Individual Alpha Power versus Pupil Size Change Comparisons

At the individual level, alpha power change was not found to be significantly correlated (p > 0.05) with pupil size change for either the 16-channel (mean r = 0.05, 95% CI: −0.16 to 0.26) or the 6-channel vocoding (mean r = −0.10, 95% CI: −0.35 to 0.16).

# DISCUSSION

The results of this study suggest that, while there was a significant and expected difference in speech recognition performance and effort rating between the 6- and 16-channel vocoded material across the 15 SNRs, the mean changes observed in the physiological measures (alpha power and pupil size) were less predictable. Significant relationships were found between mean pupil dilation and SNR, and mean alpha power and SNR for 16-channel vocoded sentences, showing a similar trajectory of change; i.e., larger pupil responses and larger alpha power change were measured for less intelligible speech. For the pupil response only, there was also a significant non-linear relationship with SNR with the 6-channel vocoded sentences, whereby pupil dilation was larger in the hardest and easier conditions. This is perhaps consistent with the non-linear change in pupil dilation with changes in task difficulty that have been shown previously (Granholm et al., 1996; Zekveld and Kramer, 2014). Further, significant interactions between SNR and vocoding were seen in both physiological measures, although the largest difference between alpha power change was observed in the least intelligible conditions (more negative SNRs) whereas the largest difference in the pupil dilation was observed in the most intelligible conditions (more positive SNRs).

The linear association between SNR and pupil dilation for the 16-channel vocoded sentences, and the comparatively larger pupil dilation for the 6-channel compared with the 16-channel vocoded sentences at more positive SNRs (≥+2 dB), is similar to that observed in previous studies, i.e., larger pupil size is observed with greater cognitive load (Kahneman and Beatty, 1966; Granholm et al., 1996; Winn et al., 2015). Larger pupil dilation relative to baseline is typically measured during more cognitively demanding speech processing tasks. For example, poorer SNRs (Zekveld et al., 2010), greater spectral degradation with channel vocoding (Winn et al., 2015), single-talker compared with noise maskers (Koelewijn et al., 2012), randomized SNRs compared

with fixed SNRs (Zekveld and Kramer, 2014), grammatical complexity (Schluroff, 1982) or perceptual effort with hearing loss (Kramer et al., 1997). Certainly the results of the current study support an increase in pupil dilation for the most challenging SNRs with the 16-channel vocoded sentences. However, the relationship between pupil dilation and SNR for the 6-channel vocoded sentences in the current study was not simple, where the mean pupil dilation across subjects plateaued for moderately negative SNRs and showed an increase with increasing speech intelligibility. It is possible that the changes in the pupil size across the 15 SNRs for the 6-channel vocoded sentences could reflect the non-linear behavior of the pupil size that has been observed when task difficulty exceeds capacity (Peavler, 1974; Granholm et al., 1996; Zekveld and Kramer, 2014). For example, it has been demonstrated that pupil dilation systematically increases with task difficulty (such as with a digit recall task), until it reaches or exceeds the limits of available cognitive resources, whereby it either asymptotes (Peavler, 1974), declines (Granholm et al., 1996), or shows both a decline followed by an asymptote for the most challenging intelligibility conditions (Zekveld and Kramer, 2014). An alternative explanation is that the noise levels per se could have influenced pupil dilation at the more negative SNRs (noise levels reached a maximum of 72 dB), where mean pupil dilation for both 16- and 6-channel vocoded sentences was similar. While Zekveld and Kramer (2014) attempted to reduce the likelihood of noise affecting pupil dilation by controlling the overall signal level while changing the SNR, in the current study, a fixed signal level was used with modulated levels of noise. Pupil dilation has been shown to be modulated by acute stress (Valentino and Van Bockstaele, 2008; Laeng et al., 2012) and animal studies have demonstrated that long-term effects of non-traumatic noise is associated with increased cortisol levels, hypertension and reduced cardiovascular function (see Gourévitch et al., 2014 for a review). A recent study looking at physiological measures of stress during listening in noise found that adults with hearing loss, who are constantly exposed to degraded speech, had higher autonomic system reactivity compared to adults with normal hearing, at similar performance levels (Mackersie et al., 2015). Therefore, while the noise levels in the current study were short-term, this may have caused a phasic stress reaction which could have influenced pupil dilation. This hypothesis, however, is not supported by studies suggesting that the pupil dilates with negative affect (Partala and Surakka, 2003).

The change in mean alpha power, relative to baseline, showed an enhancement of alpha activity in both 16-channel and 6 channel vocoding conditions, consistent with the inhibition hypothesis, where activity that is not related to the goal-directed task is actively inhibited (Klimesch et al., 2007). Therefore, it has been suggested that alpha enhancement which occurs during a speech-in-noise task results from the enhancement of auditory attention through the active suppression of noise (Strauß et al., 2014). However, most studies assessing alpha power change with vocoded speech material (Obleser and Weisz, 2012; Becker et al., 2013; Strauß et al., 2014) or during the processing of semantic information (Klimesch et al., 1997) have shown a reduction of alpha power, which is consistent with active cognitive processing of speech information. Specifically, the results of the current study appear contradictory to those reported by Obleser and Weisz (2012) using noise vocoded (2-, 4-, 8-, and 16-channels) mono- bi- and tri-syllabic words. They showed less alpha power suppression, of posterior-central alpha power with decreasing intelligibility measured between 800 and 900 ms post word onset. However, the task across the two studies was not the same. In the current study, participants were required to repeat the vocoded sentences, whereas in the Obleser and Weisz (2012) study, participants were asked to rank the comprehension of vocoded words without attending to the linguistic or acoustic aspects of the speech materials. While previous studies have shown a very high correlation between SRSs and rating scores, it is unclear whether the pattern of event-related oscillatory cortical activity measured during these different tasks is the same. Further, the types of analyses conducted across studies are not the same. For example, while Becker et al. (2013) demonstrated that mean alpha power during the region of interest (ROI) between 480 and 620 ms is reduced as speech intelligibility is increased (using monosyllabic French words), this was an absolute measure of alpha power rather than a change relative to the baseline. Variability of whether alpha power was increased or decreased was observed within studies. For example, Becker et al. (2013) showed the mean trajectory of change in alpha power during noise-vocoded monosyllabic words and demonstrated that alpha power is enhanced in the less intelligible conditions (similar to our results) but is suppressed in the most intelligible conditions (similar to the results shown by Obleser and Weisz, 2012). Further, using an auditory lexical decision task, Strauß et al. (2014) demonstrated mean increases of alpha power occurred for clear pseudo-words but a reduction was observed for ambiguous and real-words, which parametrically changed as the clarity of the words increased. Finally, using 18 younger and 20 older healthy adults, Wöstmann et al. (2015) demonstrated that decreases in mean alpha power which occurred as speech intelligibility increased (using four syllable digits masked by a single speaker) appeared to be driven by the older adults rather than an effect across the entire population. Given the differences in the types of speech stimuli used across the different studies, the task required, as well as the ROI used to assess alpha power changes (i.e., during or after the speech tokens), and the different populations assessed (older versus younger adults), further investigation of alpha power is needed to better understand the changes observed and how this might be used as an objective measure of attentional effort and/or cognitive load for the individual.

Within the current study, while a significant interaction was found between 6- and 16-channel vocoding for both alpha power and pupil size change, the trend patterns differed. The magnitude of the difference between both vocoding levels was greater in the most challenging SNRs for alpha power, but in the least challenging SNRs for the pupil size. This could suggest that these physiological responses are driven by different neurophysiological or attentional networks (Corbetta and Shulman, 2002; Corbetta et al., 2008; Petersen and Posner, 2012). There is a vast literature on attentional effort which suggests that discrete neuroanatomical areas encode specific cognitive operations ("processors") that are involved in attention, which are modified by "controllers" depending on the type of

attentional tasks required (see Power and Petersen, 2013). While the majority of the literature in this field focuses on the visual modality, there is evidence to suggest that similar processes should be evident when listening to degraded speech, such as listening in noise (Spagna et al., 2015). The main determinants of attentional allocation would then be; the identification of the appropriate processing strategy needed to undertake the speech perception task, the maintenance of attention during the task, and the processing of errors to increase (or, at least, reduce declines in) performance. Further, these processes may work synergistically under less cognitively demanding conditions but diverge under more challenging conditions, or conditions which have different types of attentional requirements (Vossel et al., 2014). It is also possible that different processors and controllers are used by different individuals to undertake these cognitively demanding task, which may have led to a lack of correlation between alpha power change and pupil dilation change within individuals. Corbetta and Shulman (2002) proposed the existence of two anatomically distinct attention networks; the dorsal fronto-parietal network, which is involved in the top–down voluntary or goal-directed allocation of attention (which includes preparatory attention and orienting within memory), and the ventral fronto-parietal network, which is involved in the involuntary shifts in attention. It is proposed that under normal circumstances, the ventral network is suppressed but is activated by unexpected, novel, salient, or behaviorally relevant events. Where this occurs, it is assumed that a "circuit-breaking" signal is sent to the dorsal attention network, resulting in reorienting, or shifting in attention toward this new event (Corbetta et al., 2008). It has been proposed that the locus coeruleus-norepinephrine (LC-NE) system modulates the functional integration of the entire cortical attentional system (Corbetta et al., 2008; Sara, 2009), whereby NE released by the LC triggers the ventral network to interrupt the dorsal attention network (Bouret and Sara, 2005) and reset attention. This ensures a coordinated rapid and adaptive neurophysiological response to spontaneous or conditioned behavioral imperatives (Sara and Bouret, 2012).

Pupil dilation is under the control of the LC-NE system, therefore it may be reasonable to assume that indirect attention tasks may be associated with the changes in pupil dilation observed in the current study. It has been proposed that pupil dilation is modulated by both staying on task and choosing between alternatives (exploration; Aston-Jones and Cohen, 2005). Therefore, a complex task, such as the perception and comprehension of a moderately intelligible (vocoded) speech signal, may result in changes in pupil dilation that reflect the interaction between different processing strategies. Alpha power changes have been associated with top–down inhibition of task irrelevant brain regions, and it has been suggested that alpha power is under the control of the dorsal attention network (Zumer et al., 2014). Further, increases in alpha power may inhibit the ventral attention network, preventing reorienting to irrelevant stimuli during goal-directed cognitive behavior (Benedek et al., 2014). While other models of attention exist (Seeley et al., 2007; Petersen and Posner, 2012), it is clear that a simple association between a physiological measure of attentional effort and task difficulty (e.g., changes in speech intelligibility) fails to consider the multiple autonomic cognitive operations as well as the voluntary control of attention that reflects effortful cognitive control (see Sarter et al., 2006). It is recognized that there is a dynamic interplay between the bottom–up sensory information and the top–down cognitively controlled factors (which may be either under automatic or voluntary control), such as knowledge, expectations and goals, that can be modulated by motivational factors, such as payment for participations (Tomporowski and Tinsley, 1996) and genetic influencers (Fan et al., 2003). Therefore, it is reasonable to assume that considerable variability in attentional allocation could exist between individuals undertaking a highly complex task.

An alternative explanation is that the within-subject variability of sustaining on-task attention toward sentences with unpredictable levels of intelligibility, was greater under the more challenging noise vocoding conditions (6-channel) where the effort-reward balance was not as high compared with the 16-channel vocoded materials. Sustaining attention on a complex task is challenging (Warm et al., 2008) and requires suppression of internal tendencies of mind-wandering, a default network activation that typically occurs during low task demands (Christoff et al., 2009; Gruberger et al., 2011), with concomitant activation of the goal-directed dorsal fronto-parietal attentional network (Corbetta and Shulman, 2002). Fluctuations in sustained attention can occur with stress, distraction with competing stimuli, fatigue, or lack of motivation toward the task, and are commonly associated with a decline in performance (Hancock, 1989; Esterman et al., 2012). As stated by Esterman et al. (2012) "as the neural systems supporting task performance appear to shift with one's attentional state, failure to account for attentional fluctuations may obscure meaningful information about underlying mechanisms". Certainly, some people have a preponderance to mind-wandering (Mason et al., 2007). This may be a confounder to the results of the current study comparing physiological responses to a range of SNRs, despite the ecological validity that this may have to their ability to follow conversations within multi-talker environments. That is, the variability in the physiological measures may, in fact, provide important information about the individual's processing of degraded speech that is not captured within more common behavioral measures of speech perception. For example, a recent study by Kuchinsky et al. (2016), suggests that individual differences in the pupillary response of older adults with hearing loss during a monosyllabic word recognition task was related to task vigilance (less variability in response time) and to the extent of primary auditory cortical activity. Therefore, pupil dilation may index the magnitude of the engagement between bottom–up sensory and top–down cortical processing which is increased with greater degradation of the speech signal (influenced by poorer SNR, reduced spectral information, or hearing loss).

Significant differences in the baseline data were also observed between the 6- and 16-channel vocoding for pupil size, but not for alpha power. These two levels of vocoding were assessed during different sessions for all subjects, therefore this could be due either to a session effect, or to a difference in the level of cognitive effort that was maintained throughout the session. Given that the results are consistent with an increase in cognitive load during

the 6-channel vocoded session, it is likely that the difference in the tonic pupillary response across the two physiological measures sessions (16- versus 6-channel vocoded-sentence tasks) resulted from differences in vigilance or the awareness of errors in performance during the more cognitively challenging task (Critchley, 2005; Ullsperger et al., 2010).

Limitations of the study include the relatively small number of participants included in the final data analysis (particularly for pupillary measures), and that only 16 sentences were presented for each SNR level (scored as 50 words across the set of 16 sentences) in each condition, reducing statistical power. Further, the test set-up restricted people from responding normally to an effortful task (i.e., a number of participants tended to close their eyes during the stimuli presentation but were instructed to keep their eyes opened). Explicitly investing effort in trying to keep their eyes opened despite the natural tendency to want to close them may have in itself created changes in pupil size and alpha oscillations. This may also have added an additional stressful component to the task.

### CONCLUSION

The results of this study suggest that the relationship between task difficulty and both pupil dilation and alpha power change was similar for the 16-channel vocoded sentences (high intelligibility), which might suggest that the attentional networks are operating with high concordance, or in a consistent and predictable manner across the SNRs. However, further degradations in the speech intelligibility, using the 6 channel vocoded materials, could have produced a discordant relationship between the attention networks, or different processors (such as linguistic strategies) may have been used to comprehend the speech signal. Importantly, however, given the considerable interest in assessing listening effort within clinical settings (see McGarrigle et al., 2014), it is important to ensure that we have a solid understanding of what these physiological measures are assessing, and how to interpret the

### REFERENCES


responses for the individual. Certainly, the results of this study do not currently support the clinical use of these physiological techniques as sufficiently sensitive to provide complementary information about listening effort to existing measures of speech perception performance. To be clinically viable in a hearing rehabilitation setting, such objective indices of effort should be more sensitive to changes in auditory input than existing measures of speech perception performance or subjective ratings of effort. The behavior of these indices should also be predictable across a range of performances and speech degradation to be applicable to the range of hearing loss and devices available, including hearing aids, and cochlear implants.

### AUTHOR CONTRIBUTIONS

Original idea: PL, CM, IB. Protocol development: PL, IB, CM, RI. Data collection: CYL, LG, PL, IB. Writing of manuscript: CM, IB, LG, KM, CYL. Data processing and analyses: RI, PL, PG, IB, CM, KM.

### FUNDING

This study was supported by the HEARing CRC, established and supported by the Cooperative Research Centres Programme – Business Australia.

### ACKNOWLEDGMENTS

The authors thank Jörg Buchholz for support with protocol development, Mike Jones for support with statistical analyses, and Gabrielle Martinez for help with data collection. Contributions- Original idea: PL, CM, IB. Protocol development: PL, IB, CM, RI. Data collection: CYL, LG, PL, IB. Writing of manuscript: CM, IB, LG, KM, CYL. Data processing and analyses: RI, PL, PG, IB, CM, KM.

influence of age and hearing impairment. Brain Res. Bull. 91, 21–30. doi: 10.1016/j.brainresbull.2012.11.005




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 McMahon, Boisvert, de Lissa, Granger, Ibrahim, Lo, Miles and Graham. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Impact of Background Noise and Sentence Complexity on Processing Demands during Sentence Comprehension

Dorothea Wendt1,2 \*, Torsten Dau<sup>1</sup> and Jens Hjortkjær1,3

<sup>1</sup> Hearing Systems, Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Kongens Lyngby, Denmark, <sup>2</sup> Eriksholm Research Centre, Snekkersten, Denmark, <sup>3</sup> Danish Research Centre for Magnetic Resonance, Centre for Functional and Diagnostic Imaging and Research, Copenhagen University Hospital Hvidovre, Hvidovre, Denmark

Speech comprehension in adverse listening conditions can be effortful even when speech is fully intelligible. Acoustical distortions typically make speech comprehension more effortful, but effort also depends on linguistic aspects of the speech signal, such as its syntactic complexity. In the present study, pupil dilations, and subjective effort ratings were recorded in 20 normal-hearing participants while performing a sentence comprehension task. The sentences were either syntactically simple (subjectfirst sentence structure) or complex (object-first sentence structure) and were presented in two levels of background noise both corresponding to high intelligibility. A digit span and a reading span test were used to assess individual differences in the participants' working memory capacity (WMC). The results showed that the subjectively rated effort was mostly affected by the noise level and less by syntactic complexity. Conversely, pupil dilations increased with syntactic complexity but only showed a small effect of the noise level. Participants with higher WMC showed increased pupil responses in the higher-level noise condition but rated sentence comprehension as being less effortful compared to participants with lower WMC. Overall, the results demonstrate that pupil dilations and subjectively rated effort represent different aspects of effort. Furthermore, the results indicate that effort can vary in situations with high speech intelligibility.

Keywords: effort, processing demands, pupillometry, syntactic complexity, background noise, working memory capacity, reading span, digit span

# INTRODUCTION

Speech communication provides a major basis for human interaction. Speech intelligibility has traditionally been measured in terms of the speech reception threshold (SRT) which reflects the signal-to-noise ratio (SNR) at which 50% of the words or sentences have been correctly recognized. However, these measures are typically obtained at low SNRs which do not correspond to everydaylistening situations that typically take place at SNRs of +5 to +15 dB (Smeds et al., 2015). In such more realistic communication situations, despite the fact that speech intelligibility is high, people may experience considerable difficulties when listening to speech. There has recently been growing interest in identifying the factors that cause these difficulties and attempts have been made

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Jonathan E. Peelle, Washington University in St. Louis, USA Frederick Jerome Gallun, National Center for Rehabilitative Auditory Research, USA

> \*Correspondence: Dorothea Wendt wendt@elektro.dtu.dk

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 26 November 2015 Accepted: 24 February 2016 Published: 10 March 2016

### Citation:

Wendt D, Dau T and Hjortkjær J (2016) Impact of Background Noise and Sentence Complexity on Processing Demands during Sentence Comprehension. Front. Psychol. 7:345. doi: 10.3389/fpsyg.2016.00345

to characterize the processing demand or processing load (Johnsrude and Rodd, 2015) involved in speech comprehension (Gosselin and Gagné, 2010, 2011; McGarrigle et al., 2014).

Processing demands can be imposed by two factors: stimulusrelated factors that are associated with properties of the stimulus (e.g., noise degradation or linguistic complexity), and listener-related factors that reflect the perceptual and cognitive abilities of the listener [e.g., hearing impairment or working memory capacity (WMC)]. Regarding stimulus-related factors, the degradation of the speech signal due to the presence of background noise has been demonstrated to have an impact on the processing demand (e.g., Rabbitt, 1968; Pichora-Fuller and Singh, 2006). Varying the SNR can thus be used to induce higher or lower processing demand during speech comprehension, such that a higher amount of noise imposes a processing demand. Linguistic aspects, such as syntactically complex sentence structures, have been shown to decrease speech comprehension (Just and Carpenter, 1992), decrease sentence intelligibility (Uslar et al., 2013) and increase the sentence processing duration (Wendt et al., 2014, 2015). Hearing impairment, as a listenerrelated factor, typically degrades the representation of the speech signal in the auditory system which, in turn, can affect speech recognition (e.g., Plomp and Mimpen, 1979; Wingfield et al., 2006) and the sentence processing duration (Wendt et al., 2015). Moreover, cognitive abilities, such as a person's WMC, have been related to speech recognition performance (e.g., Lunner, 2003; Akeroyd, 2008). It has been suggested that individual cognitive recourses can be utilized to partly compensate for changes in the processing demand imposed by stimulus-related factors, even though the relationship between cognitive abilities and processing demand remains controversial (Ahern and Beatty, 1979; Verney et al., 2004; van der Meer et al., 2010).

The amount of cognitive resources utilized by a listener in a speech comprehension task can be defined in terms of effort (see also Johnsrude and Rodd, 2015). In other words, effort is a measure indicating the amount of resources deployed when processing speech, which depends on the interplay of the processing demand imposed by the stimulus-related factors (e.g., background noise, sentence complexity) and the listener-related cognitive abilities (such as WMC). A person's effort involved in speech comprehension has been measured using various methods and techniques (see McGarrigle et al., 2014 for a review). Subjective measures, such as perceived effort experienced during speech comprehension, have been tested using rating scales or questionnaires. Rudner et al. (2012) tested the effect of both noise level (in terms of SNR) and noise type (stationary vs. fluctuating) on subjective ratings of the perceived effort experienced in a sentence recognition task. It was found that the subjectively rated effort was affected by both the type of the background noise and the SNR. Although a fluctuating noise masker typically provides a release from masking (e.g., Festen and Plomp, 1990; Wagener et al., 2006), implying increased recognition rates compared to the condition with a stationary noise, listeners rated speech recognition in this noise condition to be more effortful. Rudner et al. (2012) also reported that rated effort increased with decreasing SNR consistent with other studies (Humes et al., 1999; Hällgren et al., 2005; Zekveld et al., 2010). Physiological correlates of processing effort include pupillary responses measured during speech tasks (see Kahneman and Beatty, 1966; Kahneman, 1973; Poock, 1973; Beatty, 1982; Granholm et al., 1996). More recently, there has been an increasing interest in measuring pupil dilations during speech perception in acoustically challenging situations (Kramer et al., 1997; Zekveld et al., 2010, 2011; Koelewijn et al., 2012; Kuchinsky et al., 2013). Zekveld et al. (2010, 2011) reported increased pupil dilations as an index of effort depending on speech intelligibility and type of background noise. Some studies have recorded subjective ratings of effort and pupil dilations in the same listeners (Zekveld et al., 2010, 2011; Koelewijn et al., 2012), but the relationship between the two measures has not yet been clarified. While Koelewijn et al. (2012) showed that the subjective ratings were positively correlated with pupil dilations during a speech recognition task, Zekveld et al. (2010) reported significant correlations between the rated effort and intelligibility but did not find any correlation between the subjectively rated effort and pupil dilations.

Working memory capacity has also been related to both subjective ratings and pupil dilations. Zekveld et al. (2011) reported a positive correlation between digit span test scores (as an index of WMC) and pupil dilations. Moreover, van der Meer et al. (2010) showed that listeners with higher fluid intelligence scores showed larger pupil dilations while performing a difficult task compared to individuals with lower scores. This led to the "resource hypothesis" (van der Meer et al., 2010) stating that individuals with better cognitive abilities, including higher WMC, allocate more resources, leading to a higher processing effort as reflected by larger pupil dilations. However, individuals with greater WMC have also been shown to rate listening as being less effortful (e.g., Rudner et al., 2012). This led to the "efficiency hypothesis" stating that individuals with higher cognitive resources report lower perceived effort due to more efficient processing (Ahern and Beatty, 1979; van der Meer et al., 2010). In line with this, the ease of language understanding (ELU) model suggests that it is less effortful for individuals with a high WMC to process a distorted speech signal (Rönnberg, 2003; Rönnberg et al., 2013). This seems to be in conflict with the resource hypothesis arguing that individuals with higher WMC engage more cognitive resources leading to higher effort. However, whereas the resource hypothesis is based on studies employing pupil response as a physiological correlate of effort, the ELU model refers to studies using subjective ratings as the indicator of effort. Thus, it may be that the two metrics represent different components of processing demand.

The present study attempted to distinguish between the outcomes obtained with rated effort vs. pupil dilation. Here, subjective ratings of effort, termed "perceived effort" (McGarrigle et al., 2014), were considered as an indicator of how effortful the process of speech comprehension is experienced by the participants. In contrast, pupil responses were considered as an indicator of "processing effort". Perceived effort and processing effort were measured in an audio-visual picturematching paradigm. In this paradigm, the participant's task was to match a spoken sentence with a picture presented before the sentence. This paradigm was designed to capture several levels of speech processing involved in the comprehension of speech

in background noise. This includes both lower-level perceptual processing, such as the separation of the speech signal from the background noise (Johnsrude and Rodd, 2015) and higher-level cognitive processes including linguistic and syntactic operations, such as a thematic assignment of the characters' role in the spoken sentence (see e.g., Wingfield et al., 2005). In the applied picturematching paradigm, a mental assignment of the characters' roles (i.e., who is doing something to whom) is required to accomplish the comprehension task. By employing the paradigm, it was investigated how different levels of the SNR (at high speech intelligibility levels) and the variation of the syntactic complexity of the sentence structure affect perceived effort and processing effort. Furthermore, it was examined how individual participants' cognitive test scores were related with perceived effort and processing effort.

# MATERIALS AND METHODS

# Participants

Eleven female and nine male participants with normal hearing participated in the experiment, with an average age of 23 years (ranging from 19 to 36 years). The participants had pure tone hearing thresholds of 15 dB hearing level (HL) or better at the standard audiometric frequencies in the range from 125 to 8000 Hz. All participants performed better than 20/50 on the Snellen chart indicating normal or corrected to normal vision (according to Hetherington, 1954). All experiments were approved by the Science Ethics Committee for the Capital Region of Denmark.

# Stimuli

### Speech Material

Thirty-nine items from the German Oldenburg Linguistically and Audiologically Controlled Sentence corpus (OLACS, see Uslar et al., 2013) were translated into Danish language and recorded. Each sentence describes two characters and an action being performed by one of the characters. All sentences contained a transitive full verb such as filme ("film" in **Table 1**), an auxiliary verb vil ("will"), a subject noun phrase den sure pingvin ("The angry penguin") and an object noun phrase den søde koala ("the sweet koala"). Each speech item was recorded with two different sentence structures in order to vary the complexity of sentences without changing word elements. Each sentence was either realized with a subject-verb-object structure (SVO I and II in **Table 1**) as well as with a syntactically complex objectverb-subject structure (OVS I and II in **Table 1**). While the SVO structure is canonical in Danish syntax and considered to be easy to process, written and spoken OVS sentences in Danish are more difficult to process (see Boeg Thomsen and Kristensen, 2014; Kristensen et al., 2014).

In both (SVO and OVS) sentence structures, the participants need to identify the semantic roles of the involved characters. The role assignment of the character that carries out the action (the agent) and the character that is affected by the action (the patient) is possible only after the auxiliary verb vil. Until the auxiliary verb, both sentence structures are ambiguous with respect to the grammatical roles of the involved characters and, thus, no thematic role assignment can be made. The auxiliary verb vil is either followed by the transitive verb filme ("film" see word 5 in **Table 1**), indicating a subject noun phrase at the beginning of the sentence, or by the article den ("the" see word 5 for the OVS I and II), informing the listener about the object role of the first noun. Since word 5 within each sentence provided the information required performing the comprehension task, the onset of word 5 is defined as the point of target disambiguation (PTD) for all sentence structures (see **Table 1**). Care was taken in selecting actions, agents, and objects that were non-stereotypical for any of the characters (for example, baking is a typical action of a baker). This constraint was employed to make sure that the participants did not make premature role assignments based on any anticipation of an agent's characteristic action.

### Visual Material

Pictures from the OLACS picture set were used, which were created for eye-tracking purposes (see Wendt et al., 2014, 2015). Each sentence was presented with either a target or a competitor picture. The picture illustrating the situation as described in the spoken sentence was defined as target picture (left panel of **Figure 1**). The competitor picture showed the same characters

TABLE 1 | Examples of the two sentence structures that were presented in the audio-visual picture-matching task.


All sentences contained eight words and have either a subject-verb-object (SVO) or an object-verb-subject sentence (OVS) structure. The onset of word five is defined as point of target disambiguation (PTD).

and action but interchanged roles of the agent and patient (right panel of **Figure 1**). Both the competitor and the target picture were of the same size, and within each picture, the agent was always shown on the left side in order to facilitate fast comprehension of the depicted scene. There were always two sentences that potentially matched a given sentence (i.e., a SVO and an OSV sentence for each picture). For instance, the left picture shown in **Figure 1** was used as target picture for sentence SVO I and OVS II in **Table 1**. All pictures were presented to the participants before they performed the audio-visual picture matching paradigm to familiarize them with the visual stimuli. All pictures are publicly available<sup>1</sup> .

# Audio-Visual Picture-Matching Paradigm

The trial procedure for the audio-visual picture matching paradigm is shown in **Figure 2**. After an initial silent baseline showing a fixation cross (for 1 s), the participants were shown a picture (either target or competitor) for a period of 2 s. This was followed by a 3-s long background noise baseline after which a sentence was presented in the same background noise. After the sentence offset, the background noise continued for additional 3 s. A fixation cross was presented during the sound stimulus presentation. After the final noise offset, the participants were prompted to decide whether the sentence matched the picture or not via a button press (left or right mouse button). After the comprehension task, the participants were instructed to rate how

<sup>1</sup>http://www.aulin.uni-oldenburg.de/49349.html

FIGURE 2 | Trial structure of the audio-visual picture-matching paradigm. Participants saw a picture on screen for 2000 ms, followed by a visual fixation cross and a simultaneous acoustical presentation of a sentence in background noise. Background noise was presented 3000 ms before and ended 3000 ms after sentence offset. After the acoustic presentation, participants' task was to decide whether the picture matched with the sentence or not. Pupil dilations were measured from the picture onset until the participants' response in the comprehension task. The comprehension task was followed by a subjective rating of the experienced difficulty.

difficult it was to understand the sentence using a continuous visual analog scale (McAuliffe et al., 2012). They were asked to indicate their rating by positioning a mouse on a continuous slider marked "easy" and "difficult" at the extremes.

First, the participants performed one training block, which contained 10 trials. After training, each participant listened to 159 sentences, divided into two blocks. Both SVO and OVS sentences were presented in a lower-level noise condition (+12 dB SNR) or in a higher-level noise condition (−6 dB SNR). The noise masker was a stationary speech-shaped noise with the long-term frequency spectrum of the speech. Filler trials were included were the picture either did not match the character or the action of the spoken sentences.

### Cognitive Tests

At the end of the test session, the participants performed two cognitive tests: a digit-span test and a reading span task. The digit span test was conducted in a forward and a backward version. The forward version is thought to primarily asses working memory size (i.e., number of items that can be stored) whereas the backward version reflects the capacity for online manipulation of the content of working memory (e.g., Kemper et al., 1989; Cheung and Kemper, 1992). In the forward version, a chain of digits was presented aurally and the participants were then asked to repeat back the sequence. In the backward version, the participants were asked to repeat back the sequence in reversed order. To calculate the scores for the digit span test, one point was awarded for each correctly repeated sequence (according to the traditional scoring; see Tewes, 1991). The scores were presented in percentages correct, i.e., how many sequences out of the 14 sequences were repeated correctly. In addition, while the participants performed the digital span tests, pupil dilations were recorded to obtain a physiological correlate of effort.

In the reading span task, the participants were presented with sequences of sentences on the screen and instructed to determine, after each sentence, whether the sentence made sense or not (Daneman and Carpenter, 1980). After each sentence, a letter was presented on the screen and the participant was asked to remember the letter. After a set of sentences (length of the set varied between 3 and 11 sentences), the participant was prompted to recall the letters presented between sentences. The number of letters that were correctly recalled were scored regardless of the order in which they were reported. The reading span score was defined as the aggregated number of letters correctly recalled across all sentences in the test. Letters were used as targets rather than sentence words in order to make the task less reliant on reading abilities.

### Apparatus

The experiment was performed in a sound-proof booth. Participants were seated 60 cm from the computer screen and a chin rest was used to stabilize their head. Visual stimulus was presented on a 22<sup>00</sup> computer screen with a resolution of 1680 × 1050 pixels. The stimuli were delivered through two loudspeakers (ADAM, A5X), located next to the screen. An eyetracker system (EyeLink 1000 desktop system, SR Research Ltd.) was used to record participants' pupil dilation with a sampling rate of 1000 Hz throughout the experiment. The eye-tracker was calibrated at the beginning of the experiment using a nine-point fixation stimulus. During each trial, pupil size and pupil x- and y-traces were recorded for detecting horizontal and vertical eyemovements, respectively. The eye tracker sampled only from the left eye.

# Pupil Data Analysis

The recorded data were analyzed for 20 participants in a similar way as reported in previous studies (Piquado et al., 2010; Zekveld et al., 2010, 2011) 2 . First, eye-blinks were removed from the recorded data by classifying samples for which the pupil value was below 3 standard deviations of the mean pupil dilation. After removing the eye-blinks, linear interpolation was applied starting 350 ms before and ending 700 ms after a detected eyeblink. Trials for which more than 20% of the data required interpolation were removed from the further data analysis. For one participant more than 50% of the trials required interpolation and, therefore, this participant was excluded from the further data analysis (Siegle et al., 2003). The data of the de-blinked trails were smoothed by a four-point moving average filter. In order to control for individual differences in pupil range, each trial data point was subtracted by the minimum pupil value of the entire trial time series (from trial onset of the picture presentation until the comprehension task) for each individual participant. Afterward, the pupil data were divided by the range of the pupil size within the entire trial. Finally, the pupil data were normalized by subtracting a baseline value which was defined as the averaged pupil value across 1 s before sentence presentation (when listening to noise alone, see **Figure 3**). The pupil responses were averaged across all participants for each condition. Averaged pupil data were analyzed within three different time epochs (see **Figure 3**). Epoch 1 describes the time from the start of the sentence until the point of disambiguation. Epoch 2 is defined as the time after the point of disambiguation until the sentence offset. Epoch 3 defines the 3 seconds following the sentence offset when the participants are asked to retain sentences in memory until the comprehension question.

# RESULTS

## Speech Comprehension in the Audio-Visual Picture-Matching Task Comprehension Accuracy

**Figure 4** shows the mean response accuracy across participants in the audio-visual picture-matching paradigm. The highest accuracy was found for the SVO sentences (93.1% in the lower-level noise condition and 87.8% in the higher-level noise condition). For the OVS structure, the response accuracy was between 57.2% (in the higher-level noise condition) and 58.1% (for the lower-level noise condition). The comprehension accuracy was analyzed using two separate repeated-measures analyses of variance (ANOVA) with complexity (simple,

<sup>2</sup>All data exclusions, all manipulations, and all measures were reported in this study.

complex) and noise level (high, low) as within-subject factors. The ANOVA revealed a main effect of complexity [F(1,18) = 15.8, p = 0.001, ω = 0.53] showing that the processing of OVS sentences resulted in more comprehension errors compared to the processing of SVO sentences. No effect of the noise level on the accuracy scores was found [F(1,18) = 1.8, p = 0.2] indicating that speech intelligibility was high in both noise conditions.

### Subjective Ratings

Averaged subjective ratings across all participants were calculated for each condition. The subjective ratings were analyzed using two separate repeated-measures ANOVA with complexity and noise level as within-subject factors. The ANOVA revealed a main effect of noise level [F(1,18) = 56.3, p < 0.001, ω = 0.779] indicating that the higher-level noise condition was rated as being more difficult compared to the lower-level noise condition. In addition, a small but significant effect of complexity on rating was also found [F(1,18) = 4.6, p = 0.048, ω = 0.223].

### Time-Averaged Pupil Dilation

fpsyg-07-00345 March 8, 2016 Time: 17:51 # 7

Averaged pupil dilations across all participants were calculated for each epoch (see **Figure 5**). The dilations were analyzed using separate repeated-measures ANOVA treating complexity and noise level as within-subjects factors. Separate ANOVAs were performed for each epoch. In epoch 1, there was a significant effect of noise level on the time-averaged pupil dilation [F(1,18) = 12.1, p = 0.03, ω = 0.41], but no effect of complexity was found [F(1,18) = 0.93, p = 0.35]. In epochs 2 and 3, significant effects of complexity [F(1,18) = 10.8, p = 0.004, ω = 0.39; epoch 3: F(1,18) = 12.8, p < 0.001, ω = 0.52] were revealed. Furthermore, an interaction of complexity and noise level was found in epoch 3 [F(1,18) = 9.0, p = 0.008, ω = 0.35].

# Cognitive Data

Pearson correlation coefficients between the subjective ratings and the performance in the cognitive tests [digit span forward (DF) score, digit span backward (DB) score, and reading span (RS) score in **Table 2**] were computed. A statistically significant correlation between the subjectively rated effort and the DB score was found for the SVO sentences presented at the higher noise level (p < 0.05, see **Table 2** and **Figure 6**).

In addition, Pearson correlation coefficients between the mean pupil response in epoch 3 and the performance in the cognitive tests were computed. Significant correlations were only found between the DB score and the pupil dilations in epoch 3 (p < 0.05, see **Table 2**), indicating that participants with higher DB scores had larger pupil dilations in the speech task.

Finally, correlations between the pupil dilations in the digit span test and the pupil dilations in the speech task (during epoch 3) were calculated. Pearson correlation coefficients revealed statistical significance (see **Figure 6**; p < 0.05), i.e., participants with enlarged pupil dilations in the speech task also showed higher pupil dilations in the span test.

# DISCUSSION

# Effects of Stimulus-Related Factors on Effort

A small but significant increase in pupil dilation due to the increased noise level was found in epoch 1, i.e., while the participants were listening to the first part of the sentence. The changes in pupil dilation due to the noise level were similar for both sentences structures, i.e., independent of the syntactic complexity. Moreover, a clear effect of the noise level on the perceived effort was found, i.e., the listeners reported speech processing as being more effortful when the sentences

FIGURE 5 | Mean pupil dilation observed for all four conditions. Time-averaged pupil dilation was calculated for three different epochs. Epoch 1 is the time when the first part of the sentence was presented. Epoch 2 includes the time after the sentence was disambiguated until the comprehension question. The third epoch is defined as the time from sentence onset until participants' response. The error bars show the standard deviations.


TABLE 2 | Correlation coefficient between the span tests and both the mean pupil dilation in the audio-visual picture matching paradigm and the subjective ratings for all four condition.

DB, digit span backward; DF, digit span forward; RS, reading span; Scores, scores in the span tests; Pupil, averaged pupil response in the digit span backward tests. Bolded values indicate significant correlation coefficients.

were presented at lower SNRs. These results are in line with studies that reported changes in pupil dilation and subjective ratings to be dependent on the SNR (Rudner et al., 2012). Zekveld et al. (2010, 2014) observed that pupil dilations and subjective ratings of effort increased with decreasing SNR. However, the current findings also clearly indicate an effect of noise level on effort in listening situations even when speech intelligibility is high. As reflected by the performance in the comprehension task, the participants were able to perform the task equally well at low and high noise levels. To the authors' knowledge, this is the first study demonstrating that effort increases with decreasing SNR even when speech intelligibility is still high.

Higher processing effort due to the increased syntactic complexity was expected for the OVS sentences compared to the syntactically less complex SVO sentences. Syntactic complexity came into play in epoch 2 when the participants listened to the second part of the sentence. In epochs 2 and 3, the participants were required to process and interpret the sentence by mentally assigning the grammatical roles of agent and patient and matching the spoken sentence content with the scene depicted in the picture. A pupil enlargement was measured for

the OVS sentences during epoch 2 and during the retention interval in epoch 3. These findings are consistent with other studies that showed increased effort while processing syntactically complex sentences (Piquado et al., 2010; Wendt et al., 2014). For example, Piquado et al. (2010) reported significantly larger pupil dilations during the retention of complex sentences. In the current study, the sentences were presented in noise in order to test the combined effects of sentence complexity and background noise level. The pupil data demonstrated distinct effects of noise and sentence complexity during epochs 1 and 2. Whereas a main effect of noise was observed in epoch 1, increased pupil dilations induced by the complexity of the sentence were measured in epoch 2. These results suggest that an increased processing effort due to an increased noise level occurs only if the sentence complexity is irrelevant for the task. As soon as the listeners start to process and retain syntactically complex information, the effect of the noise becomes negligible. Interestingly, an interactive effect of noise and complexity on the pupil dilation was found in epoch 3. This interaction was characterized by a steep decrease of the pupil response in the acoustically challenging listening situation (see epoch 3 in **Figure 3**). In other words, although a high pupil size induced by the noise level was detected in epoch 1, the pupil size decreased faster back to the baseline value in the retention period in epoch 3. This observation may suggest that listeners were able to recover faster from the high processing demand in the acoustic more challenging listening situation. However, this fast recovery occurred only for the simple sentence structures. When processing more linguistically complex sentences, this interactive effect was not found.

Whereas the pupil response indicated a clear impact of syntactic complexity, the effect of complexity on the subjective ratings was rather small. This suggests that subjective ratings and pupil dilations reflect different aspects of effort involved in speech comprehension. The pupil responses, interpreted as a physiological correlate of processing effort, were mainly sensitive to the syntactic complexity during sentence comprehension but were not indicative of the subjectively perceived effort. The perceived effort, in contrast, was more influenced by the degradation of the speech signal resulting from the increased background noise level. This is consistent with previous studies also suggesting that potentially different aspects of the effort may be measured when testing different methods and measures of effort (e.g., McGarrigle et al., 2014).

# The Influence of Listener-Related Factors on Effort

In the present study, different span tests were used to measure cognitive abilities of the participants. A moderate correlation between the digit span scores and the pupil dilations was found (**Figure 6**). Higher scores in the backward digit span test were found to correlate with higher pupil dilations in the speech comprehension task in the higher-level noise condition.

This could indicate that individuals with higher WMC allocate and engage more cognitive resources compared to individuals with smaller WMC. Previous studies have also reported higher pupil enlargement during speech processing for individuals with higher scores in cognitive tests (e.g., Zekveld et al., 2011). The results thus are consistent with the notion that individuals with higher cognitive capacities mobilize more working memory resources in acoustical challenging conditions as stated by the resource hypothesis (van der Meer et al., 2010). It is noticeable that significant correlations between WMC and pupil dilations were found specifically in epoch 3 comprising the retention period. This could suggest that pupil dilations specifically indicate the mobilization of working memory resources while storing speech information (and preparing for the upcoming comprehension task). Interestingly, significant correlations appeared only when processing sentences in the acoustically more challenging condition, suggesting that pupil response may further relate to the ability of listeners to rely on some form of working memory processing for compensating increased demands due to challenging acoustics. However, further research is needed to specifically explore these mechanisms.

Interestingly, the subjective ratings were found to be negatively correlated with WMC such that participants with a higher WMC tended to report lower perceived effort when processing SVO sentences in the higher-level noise condition. This suggests that individuals with greater WMC are able to use their resources to cope with the acoustically degraded speech signals and therefore report less effort, as argued by the efficiency hypothesis and the ELU model (Rönnberg et al., 2013). The presented data indicate that the relationship between individual WMC and effort depends on the employed measure. While listeners with a larger memory capacity may engage more resources, as indicated by increased pupil responses, this is not perceived as being effortful. Predictions made by the resource hypothesis with regard to processing effort (and its pupil response correlate) may be interpreted in terms of engagement of enhanced WMC, but not in terms of perceived effort. Predictions about effort made by the ELU model may be interpreted in terms of the subjective experience of effort.

Significant correlations between effort (both perceived effort and processing effort) and the digit span scores were only measured for the SVO sentences in the higher-level noise condition. This indicates that the WMC was only relevant when the induced demands increased due to the acoustic degradation of the speech signal. For the OVS sentences, no correlations between the digit span scores and the rated effort were found both in the higher-level noise and the lower-level noise condition. This suggests that the effort reached a plateau in situations when the cognitive resources could not compensate for the increased processing demands any longer (Johnsrude and Rodd, 2015). Thus, it may be that the available cognitive resources are exhausted when processing OVS sentences, which would further explain why no correlation between the digit span scores and effort were found in neither the higher-level noise nor the lowerlevel noise condition. No correlations were found between the reading span and the pupil response in the speech task. Note, however, that the procedure for the reading span test differed to the procedure applied in more recent studies (e.g., Lunner, 2003; Rönnberg et al., 2014; Petersen et al., 2016). A revised procedure

of the reading span test was developed to include having to remember either the first or the last word of each sentence in the list (see e.g., Lyxell et al., 1996). Since the participants do not know beforehand whether it will be the first or the last word, this revised procedure is suggested to increase the task difficulty and, therefore, the reading span score is supposed to reflect a more sensitive measure of the WMC. Thus, the missing correlation between the reading span score and the speech task might be explained by the procedure applied in the current study.

In this study, WMC was considered as a listener-related factor that potentially influences effort. However, there may be other listener-related factors that have not been considered in the current study. Interestingly, positive correlations between the pupil dilations in the digit span test and the pupil dilations in the speech task were found (**Figure 6**). Listeners that allocated more resources in the speech task also tended to mobilize more resources in the digit span test. This may indicate that some listeners generally engaged more resources than others when performing a task. Other potential listenerspecific factors affecting effort have been discussed in the literature. For instance, the level of motivation of individual participants could further influence the intensity of effort mobilization (Brehm and Self, 1989; Gendolla and Richter, 2010). With increasing success importance or with increasing motivation intensity, the amount of effort involved in a task can increase. It is possible that those participants who showed increased pupil dilations in both tasks were more motivated than those who exhibited smaller pupil dilation in both tasks. However, since motivation and success importance were not tested in the present study, the potential contribution of motivation to the results from this study remains to be clarified.

### Implications for Future Research

The audio-visual picture-matching paradigm presented in this study is well suited for studying speech processing in realistic communication situations. Monitoring increased effort during speech processing when intelligibility is high is crucial since it indicates challenges that constantly appear in everyday life. Moreover, in order to perform the task, listeners need to conduct a syntactic analysis of the sentence. This is in contrast to many speech intelligibility studies where the participants are typically asked to repeat back the recognized words of a sentence (Hagerman, 1984; Plomp, 1986). However, repetition does not necessarily involve any processing of the sentence structure or meaning that may constitute an important component of the challenges experienced in every day speech comprehension.

Extensive engagement of cognitive resources in everyday speech processing may eventually lead to fatigue or tiredness. Previous research suggested that hearing-impaired listeners are particular challenged in adverse conditions both with regard to speech perception performance and in terms of their effort required to achieve successful speech perception (Plomp, 1986; Rönnberg et al., 2013; Wendt et al., 2015). Consequences of increased effort can be, for example, a higher level of mental distress and fatigue leading to stress (Gatehouse and Gordon, 1990; Kramer et al., 2006; Edwards, 2007; Hornsby, 2013). Since traditional speech recognition tests are not sensitive to detect changes in effort in more realistic communication situations, there seems to be a need for new methods and measures to examine effort for hearing-impaired people. The findings of the present study suggest that pupil responses and subjective ratings are independent measures addressing different aspects of effort. Thus, when testing one measure of effort, the other measure is not necessarily reflected. This should be taken into account by researchers and clinicians when applying either one or the other method in their studies (McGarrigle et al., 2014).

# SUMMARY AND CONCLUSION

Three main observations were made in the present study. First, effects of increased demands due to background noise level and syntactic complexity were reflected in both the subjective ratings and pupil dilations. Second, the interaction between background noise level and syntactic complexity was rather small. Instead, separable effects of noise level and complex syntax on the subjective ratings and the pupil dilations were found: Increased syntactical complexity resulted in enlarged and prolonged pupil dilations, whereas a higher background noise level resulted in the task being rated as more effortful. Third, individual differences in cognitive abilities of the participants correlated differently with perceived effort and processing effort. Participants with higher scores in the backward span test (indicating higher WMC) showed increased pupil dilations but also reported the speech task to be less effortful than participants with lower scores. Overall, these findings demonstrate that pupil dilations and subjectively rated effort can vary in situations when intelligibility is at a high level and represent different aspects of effort. The methods and measures employed to investigate effort therefore need to be chosen carefully depending on the specific research question and hypothesis.

# AUTHOR CONTRIBUTIONS

The author DW developed the conception and design of the study, supervised the data acquisition, analyzed, and interpreted the data, and wrote the paper. DW gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author TD was involved in developing the design of the study, enabled the data acquisition, and substantial contributed to the interpretation of the data. TD was involved in writing and critically revising this version of this manuscript for important intellectual content. TD gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author JH developed the conception and design of the study, supervised the data acquisition, analyzed, and interpreted the data. JH was involved in writing and critically revising

this version of this manuscript for important intellectual content. JH gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work.

# FUNDING

This research was supported by the Oticon Centre of Excellence for Hearing and Speech Sciences (CHeSS).

# REFERENCES


# ACKNOWLEDGMENTS

We thank Line Burholt Kristensen for her help with developing and recording the sentence material, the AULIN project for providing the picture material, and Wiebke Lamping for performing the measurements. In addition, thanks to Hartwig Siebner for his valuable input to the study design. JH was supported by the Oticon Foundation.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Wendt, Dau and Hjortkjær. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Autonomic Nervous System Responses During Perception of Masked Speech may Reflect Constructs other than Subjective Listening Effort

Alexander L. Francis<sup>1</sup> \*, Megan K. MacPherson<sup>2</sup> , Bharath Chandrasekaran<sup>3</sup> and Ann M. Alvar<sup>1</sup>

<sup>1</sup> Department of Speech, Language and Hearing Sciences, Purdue University, West Lafayette, IN, USA, <sup>2</sup> School of Communication Science and Disorders, Florida State University, Tallahassee, FL, USA, <sup>3</sup> Department of Communication Sciences and Disorders, University of Texas at Austin, Austin, TX, USA

### Edited by:

Rachel Jane Ellis, Linköping University, Sweden

### Reviewed by:

Nai Ding, Zhejiang University, China Carol Mackersie, San Diego State University, USA

> \*Correspondence: Alexander L. Francis francisa@purdue.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 25 September 2015 Accepted: 10 February 2016 Published: 01 March 2016

### Citation:

Francis AL, MacPherson MK, Chandrasekaran B and Alvar AM (2016) Autonomic Nervous System Responses During Perception of Masked Speech may Reflect Constructs other than Subjective Listening Effort. Front. Psychol. 7:263. doi: 10.3389/fpsyg.2016.00263 Typically, understanding speech seems effortless and automatic. However, a variety of factors may, independently or interactively, make listening more effortful. Physiological measures may help to distinguish between the application of different cognitive mechanisms whose operation is perceived as effortful. In the present study, physiological and behavioral measures associated with task demand were collected along with behavioral measures of performance while participants listened to and repeated sentences. The goal was to measure psychophysiological reactivity associated with three degraded listening conditions, each of which differed in terms of the source of the difficulty (distortion, energetic masking, and informational masking), and therefore were expected to engage different cognitive mechanisms. These conditions were chosen to be matched for overall performance (keywords correct), and were compared to listening to unmasked speech produced by a natural voice. The three degraded conditions were: (1) Unmasked speech produced by a computer speech synthesizer, (2) Speech produced by a natural voice and masked byspeech-shaped noise and (3) Speech produced by a natural voice and masked by two-talker babble. Masked conditions were both presented at a −8 dB signal to noise ratio (SNR), a level shown in previous research to result in comparable levels of performance for these stimuli and maskers. Performance was measured in terms of proportion of key words identified correctly, and task demand or effort was quantified subjectively by self-report. Measures of psychophysiological reactivity included electrodermal (skin conductance) response frequency and amplitude, blood pulse amplitude and pulse rate. Results suggest that the two masked conditions evoked stronger psychophysiological reactivity than did the two unmasked conditions even when behavioral measures of listening performance and listeners' subjective perception of task demand were comparable across the three degraded conditions.

Keywords: listening effort, psychophysiology, informational masking

# INTRODUCTION

fpsyg-07-00263 March 1, 2016 Time: 17:6 # 2

In the normal case, understanding speech may seem to be effortless and automatic. However, even small changes in hearing acuity, signal quality or listening context can substantially reduce recognition performance and subsequent understanding or recall of the message (Van Engen et al., 2012) and presumably therefore increase perceived listening effort. Chronic effortful listening may, in turn, lead to long-term stress and fatigue as well as potentially serious health issues including hypertension and increased risk of stroke (Hogan et al., 2009). In the audiology clinic, listening effort is increasingly being seen as a significant factor for hearing aid users, both as it relates to intelligibility and as a potentially independent quality associated with willingness to adopt and continue using hearing aids (Picou, 2013). Listening effort is often associated with the allocation of limited supplies of cognitive "resources" such as working memory capacity or selective attention (Hicks and Tharpe, 2002), such that increased listening effort is associated with poorer performance on simultaneous or immediately subsequent cognitively demanding tasks (McCoy et al., 2005; Sarampalis et al., 2009). However, there is still a great deal of disagreement regarding the source of listening effort, or how to best characterize and quantify it (McGarrigle et al., 2014). The present article addresses these questions by quantifying psychophysiological responses to stimulus manipulations that are associated with different possible sources of increased listening effort.

According to one prominent proposal, the effortfulness hypothesis, the increase in perceived effort (and the decrease in downstream task performance) that is associated with listening in adverse conditions is linked to the acoustic phonetic degradation of the signal. Listeners confronted with a phonetically ambiguous or misleading acoustic signal must engage cognitively demanding mechanisms of repair or compensation in order to successfully decipher the intended message. Operating these mechanisms is assumed to require the commitment of cognitive resources that are in limited supply, and the consumption of these resources is typically associated with the concept of "effort." Thus, for present purposes, effortful processes may be thought of as those cognitive processes that involve the active commitment of cognitive resources such as working memory. Because such resources are in limited supply, listeners will have fewer resources remaining for subsequent processing of the linguistic information encoded in that signal (Rabbitt, 1968, 1991; Pichora-Fuller et al., 1995; McCoy et al., 2005; Wingfield et al., 2005; Pichora-Fuller and Singh, 2006; Surprenant, 2007; Lunner et al., 2009; Tun et al., 2009). However, not all sources of signal degradation have the same effect on the signal, and it is possible that different repair or compensation mechanisms may be engaged (or the same mechanisms may be engaged to differing degrees) to achieve the same level of performance under different circumstances. That is, different types of signal degradation may incur different demands on cognitive resources, or demands on different resources, and thus may differentially affect perceived effort even when performance is comparable. The goal of the present study was to investigate this possibility by quantifying physiological responses associated with task demand while listening to three similarly intelligible but differently degraded speech signals. If different types of degradation that result in the same performance are nevertheless associated with different patterns of psychophysiological reactivity, this would suggest that listeners are engaging different compensatory cognitive mechanisms to cope with the different sources of degradation.

Three types of degradation were chosen to represent three different ways in which a signal might be degraded. The first two involve masking, and represent examples of energetic and informational masking, respectively, while the third, computer speech synthesis, represents a complex form of signal degradation accomplished without masking.

Energetic masking is the simplest type of masking, in which one signal (the masker) physically obscures some part of the meaningful (target) signal. The source of difficulty in this case is simply the physical interaction between the two competing signals in the auditory periphery (Brungart et al., 2006). Adding speech-shaped noise to the target signal is a prototypical example of energetic masking, as the decrease in performance with respect to unmasked speech is arguably due entirely to the overlap of the excitation patterns of the target and masker signals on the basilar membrane. From a listener's perspective, the difficulty in understanding speech in noise arises mainly from the loss of information contained within those parts of the target signal that are obscured by the noise. Although listeners are likely to recognize that there are two separate sound sources in the combined signal, namely the target speech and the masking noise, they generally have little difficulty distinguishing between the two, meaning that demands on selective attention should play a minimal role in this condition (Shinn-Cunningham and Best, 2008). Similarly, the noise signal has no informational content, and therefore, in itself, is assumed to add no appreciable load to listeners' working memory (though cf. Sörqvist and Rönnberg, 2014, who suggest that attention, and hence working memory, is still involved even in simple noise-masking conditions). In principle, the effortfulness hypothesis would thus account for any increase in listening effort related to added noise as primarily due to the need to cope with the less informative (degraded) target signal itself.

Informational masking, in contrast, is often used as a catchall term covering all cases of interference that cannot be explained purely in terms of energetic masking (Cooke et al., 2008). In the present case we will consider a specific type of informational masking, namely the use of one or more tobe-ignored speech signals (speech maskers) to interfere with listeners' understanding of a target speech signal, a condition under which performance has been shown to dissociate from performance under energetic masking (Brungart, 2001; Brungart et al., 2006; Van Engen et al., 2012). In this case, in addition to the energetic masking that occurs when the masking signal(s) interfere acoustically with the target signal, there is also some interference occurring at a more linguistic or cognitive level of processing (Mattys et al., 2009). For example, speech masked by two-talker babble not only presents listeners with the challenge of dealing with a partially obscured target speech signal, it also imposes greater demands on selective attention as listeners must choose to which of the three voices to attend (Freyman et al.,

2004; Brungart et al., 2006; Ihlefeld and Shinn-Cunningham, 2008; Shinn-Cunningham, 2008). In addition, demands on working memory likely increase, as listeners probably retain some of the content of the masking signal in working memory and this must subsequently be selectively inhibited at the lexical level (Tun et al., 2002; Van Engen and Bradlow, 2007; Cooke et al., 2008; Mattys et al., 2009; Dekerle et al., 2014). Neuropsychological and genetic studies further suggest that populations that are predisposed to show poorer selective attention, as indexed either by increased degree of depressive symptoms (Chandrasekaran et al., 2015) or genetic markers associated with poorer executive function (Xie et al., 2015) experience greater interference in conditions that emphasize informational masking as compared to those involving primarily energetic masking.

Finally, synthetic speech represents a different sort of signal degradation, one that has been less well-studied in the effort literature but that has been shown to introduce cognitive demands on speech perception (Pisoni et al., 1985; Francis and Nusbaum, 2009). Unmasked synthetic speech, like foreign accented, dysarthric, and noise-vocoded speech consists of a single signal, thus eliminating issues of selective attention at the signal level. However, synthetic speech is distorted in ways that not only represent a lack of information, but potentially introduce misleading information (Francis et al., 2007), a property shared with accented and dysarthric speech, but not necessarily vocoded speech. Thus, listening to synthetic speech, like listening in competing speech, may require the application of additional cognitive resources for the retention and eventual inhibition of a larger number of competing lexical items in working memory (Francis and Nusbaum, 2009); but, unlike competing speech conditions, in the case of synthetic speech there is no benefit to applying selective attentional processes to filter out competing signals before their content interferes.

Thus, these three types of degradation allow for the possibility of distinguishing between listening effort due to the increased cognitive demands associated with informational masking (noise- vs. speech-masked), and of listening to a single challenging signal as compared to selectively attending to multiple signals (speech-masked vs. synthetic speech).

In order to quantify listening effort, three general methods of assessment have been identified in the literature: subjective (self-report) measures of task demand using instruments such as the NASA Task Load Index (TLX, Hart and Staveland, 1988) and the Speech, Spatial and Qualities of Hearing Scale (SSQ, Gatehouse and Noble, 2004); measures of behavioral interference between dual tasks (Sarampalis et al., 2009; Fraser et al., 2010); and physiological assessments of central nervous system function using fMRI (Wild et al., 2012) and EEG/ERP methods (Bernarding et al., 2012) and of autonomic nervous system arousal based on measurements of a variety of systems, such as those that reflect pupillary (Zekveld et al., 2011), electrodermal, and cardiovascular function (Mackersie and Cones, 2011; Mackersie et al., 2015; Seeman and Sims, 2015).

The autonomic nervous system is a division of the nervous system controlling functions vital to survival including respiration, digestion, body temperature, blood pressure, vasoconstriction, heart rate and sweating (Hamill et al., 2012). It is divided into three major branches: the sympathetic, parasympathetic, and enteric nervous systems. The enteric nervous system primarily governs digestion and will not be further discussed here. The sympathetic nervous system (SNS) is typically associated with fight-or-flight responses such as the cool, damp palms associated with confronting a physical or emotional threat, while the parasympathetic nervous system (PNS) is typically associated with rest, relaxation, and recovery from stressors. The sympathetic and parasympathetic branches interact to preserve a homeodynamic balance within the body, maintaining a stable internal state and adjusting bodily functions to respond to internal and external stimuli (Kim and Kim, 2012).

Thus, physiological measures of autonomic nervous system reactivity were selected for the present study because such measures, especially those reflecting SNS arousal, are associated both with increased cognitive demand and with emotional stress, and may therefore constitute an important link between the momentary demands of listening to speech under adverse conditions and long-term health issues associated with hearing impairment. For example, chronic stress associated with living in a noisy environment has been linked to both higher levels of SNS arousal and increased risk of adverse health outcomes (Babisch, 2011). Similarly, measures of peripheral vasoconstriction due to SNS arousal are associated with subjective measures of annoyance by noise (Conrad, 1973) which, in turn, may be among the better predictors of compliance in hearing aid users (Nabelek et al., 2006; though cf. Olsen and Brännström, 2014). Moreover, anxiety also affects speech perception, potentially increasing demand on cognitive processing (Mattys et al., 2013). Thus, developing a better understanding of autonomic nervous system responses to different sources of listening effort will also provide insight into the possibility that chronically heightened listening effort may contribute to broader issues of health and wellbeing. In this study, four measures of autonomic nervous system reactivity were assessed: skin conductance response (SCR) rate and amplitude, fingertip pulse amplitude (PA), and pulse rate (PR).

### Skin Conductance Response

The SCR refers to a phasic increase in the conductivity of the surface of the skin, especially on the palms of the hands or the feet, reflecting increased eccrine sweat gland activity. The eccrine sweat glands are innervated solely by the SNS. Skin conductance is collected by running a slight (0.5 V) current between two electrodes across the surface of the skin. As eccrine sweat gland activity increases, the concentration of negative ions on the skin surface increases, increasing conductivity between the two electrodes (Boucsein, 2012). Although SCRs are not elicited in all trials (Andreassi, 2007), their frequency and amplitude have long been associated with a wide range of psychological responses. The simplest of these is the orienting response (OR), an involuntary response to any sufficiently large change in the sensory environment, reflecting stimulus novelty and degree of surprise, but also affected by stimulus significance. In this context, the SCR is also potentiated by the arousing quality of the stimulus content (irrespective of positive or negative affective valance), such that more significant or more emotionally arousing stimuli induce a stronger SCR (Bradley, 2009). For example, Mackersie and Cones (2011) showed that increasing task demands on selective attention by increasing the complexity of a dichotic digits repetition task increased the amplitude of the SCR, suggesting that as the listening task became more attentionally demanding, listeners' SNS arousal increased.

## Pulse Amplitude

fpsyg-07-00263 March 1, 2016 Time: 17:6 # 4

Fingertip pulse amplitude (PA) is a measure of the volume of blood in the capillary bed of the fingertip at the peak of the heartbeat. Like the SCR, it is governed purely by the sympathetic branch of the autonomic nervous system, with increasing arousal leading to peripheral vasoconstriction and therefore decreased amplitude of the blood pulse volume signal (Iani et al., 2004; Andreassi, 2007) (henceforth PA). Phasic PA has been shown to decrease in response to increasing demands of cognitive tasks such as the Stroop task (Tulen et al., 1989) and mental arithmetic (Goldstein and Edelberg, 1997), and such decrease has been linked specifically to the increased investment of mental effort in a task, such that PA decreases parametrically with increase in working memory load (Iani et al., 2004).

## Pulse Rate

Changes in heart rate have been used extensively to study arousal related to sensory and cognitive processing. The period (and thus frequency or rate) of the heart beat is governed by both sympathetic and PNSs, with acceleration primarily under the influence of the sympathetic branch (Andreassi, 2007). Phasic cardiac acceleration and deceleration (momentary increase and decrease of heart rate) are each associated with different aspects of mental demand. Deceleration within the first few heart beats following presentation of a stimulus is typically characterized as part of an automatic OR, and is often interpreted as reflecting the holding of resources in reserve to prepare for stimulus encoding and processing (Lacey and Lacey, 1980; Lang, 1994) or even as an indication of a defensive response to threatening or unpleasant information in the stimulus (Bradley, 2009). Thus, listeners anticipating the need to process more complex or perceptually demanding stimuli, or who are experiencing the stimulus as threatening or aversive, might be expected to show a greater degree of cardiac deceleration during the initial OR. That is, to the extent that cardiac deceleration constitutes a component of an automatic OR, it is not, in itself, a reflection of the operation of an effortful (i.e., controlled, resource-demanding) process but it may nevertheless be expected to occur more strongly in conditions in which the stimulus is perceived to be aversive and/or is expected to be demanding to process further. On the other hand, heart rate has also been observed to increase as a mental task becomes more difficult, for example when doing increasingly complex mental arithmetic (Jennings, 1975), and this acceleration generally persists throughout the duration of the task. Thus, different aspects of cardiac response may reflect different ways in which a given task may be perceived as effortful: deceleration may be associated with tasks that are perceived as effortful because they involves processing stimuli that are

# Summary

The purpose of the present study was to quantify psychophysiological responses that might reflect differences in the degree or type of effortful cognitive mechanisms listeners employ to perceive speech under three conditions of increased difficulty compared to listening to unmasked, undistorted speech. The conditions differed in terms of the source of the difficulty (energetic masking, informational masking, and distortion) but were chosen to be matched for overall performance (keywords correct). We hypothesized that, although speech recognition performance should not differ significantly across conditions, listeners would exhibit greater psychophysiological reactivity in conditions involving informational masking and distortion, because these conditions, more so than simple energetic masking, increase demands on cognitive mechanisms of working memory and attention. In fact, results suggested instead that greater psychophysiological reactivity across degradation conditions was mainly associated with conditions involving masking (whether informational or energetic) as compared to either unmasked condition.

### MATERIALS AND METHODS

# Subjects

Fourteen native speakers of American English gave informed consent and participated in this study under a protocol approved by the Purdue University Human Research Protection Program. They ranged in age from 20 to 32 years (mean = 26.0). There were 11 women and 3 men and all were right-handed. All were recruited from the Purdue University community and either had at least a Bachelor's degree level of education (13) or were currently in college. No participant reported fluency in any language other than English. All were nonsmokers in good health by self-report, and none were currently taking any medications known to influence cardiovascular or electrodermal responses. All reported having minimal or no caffeine consumption on the day of testing (though cf. Barry et al., 2008). Participants were screened for anxiety and depression which may affect or be associated with autonomic nervous system function (Dieleman et al., 2010), using scales that would be suitable for both younger and elderly individuals because this study was intended as part of a larger project including geriatric participants. All participants scored within normal limits on the Geriatric Depression Scale (GDS, Yesavage et al., 1983) and the Geriatric Anxiety Inventory (GAI, Pachana et al., 2007). All exhibited auditory thresholds within age-normal limits, passing a pure tone screening test of 20 dB SPL at 250 and 500 Hz, and 25 dB SPL at 1000, 2000, 4000, and 8000 Hz. All reported normal or corrected-to-normal vision. All participants scored


TABLE 1 | Scores range from 1 to 20 where 1 = "very low" and 20 = "very high" for ratings of mental demand, effort, and frustration, and 1 = "perfect" and 20 = "failure" for performance.

GDS, Geriatric Depression Scale (Yesavage et al., 1983); GAI, Geriatric Anxiety Index (Pachana et al., 2007); PTA, Pure Tone Average [average of pure tones at 0.25, 0.50, 1, and 2 kHz, in the left (L) and right (R) ear]; CLQT Attention normal limits (NL) between 180-215; CLQT Memory NL between 155 and 185; CLQT Executive Function NL between 24 and 40; CLQT Language NL between 29 and 37; CLQT Visuospatial Skills NL between 82 and 105. Values are presented in the form of Mean (SD).

within normal limits on all subscales of the Cognitive Linguistic Quick Test (CLQT, Helm-Estabrooks, 2001). Basic demographic information and test results are shown in **Table 1**.

# Apparatus and Materials

### Testing Environment

During the speech perception task, participants were tested in a quiet room, seated comfortably approximately 1.5 m directly in front of a speaker (Hafler M5 Reference). All stimuli were played via speaker at a comfortable listening level (approximately 76 dBA measured at the location of the seated participant's head, averaged over four test sentences). Although this overall level is higher than is typical in speech audiometry, it corresponds to a signal (speech) level of 67 dBA combined with a masking noise level 8 dB louder (necessary to achieve comparable performance across the two masked conditions). Stimulus presentation was controlled by a program written in E-Prime 2.0 (Psychology Software Tools, Inc. [E-Prime 2.0], 2012). Responses were made verbally, and were scored on-line by the experimenter.

### Stimuli

Stimuli were selected from a database of sentences originally developed by Van Engen et al. (2014). The subset used here consisted of 80 semantically meaningful sentences based on the Basic English Lexicon sentences (Calandruccio and Smiljanic, 2012) spoken in a conversational style by a young, female native speaker of American English. Sentences always contained four key words. For example (keywords underlined) The hungry girl ate a sandwich. Masking stimuli were derived from a set of 30 different sentences (not in the target set) produced by eight different female native speakers of American English (not the target talker). Two talker-babble was created by concatenating sentences from two of these talkers, removing silences, and adding them together using the mix paste function in Audacity 1.2.5<sup>1</sup> . The speech shaped noise was generated by filtering white noise to match the long-term average spectrum of all of the masking sentences. Thus, the two-talker babble masker clearly sounded like the speech of two talkers, while the speech shaped noise masker sounded like filtered white noise. Stimuli were mixed to present the target at a challenging SNR of −8 dB, and then all stimuli were normalized to the same RMS intensity level using Praat 5.3. At this SNR, prior studies with these stimuli in our labs have shown that comparable performance is typically elicited across the two listening conditions tested here.

Synthetic speech was generated using ESpeak 1.46<sup>2</sup> . Espeak is a publicly available formant-style text-to-speech synthesizer that runs under Windows and Linux. Stimuli were generated by presenting a text file with one sentence per line to the synthesizer, producing a single sound file containing all sentences spaced at regular intervals. The default voice (male) and speaking rate were used because preliminary, informal testing suggested that these were sufficiently difficult to be comparable to the masked speech stimuli in terms of overall intelligibility, even though the synthetic sentences were noticeably shorter than the natural ones. The resulting wave file was segmented into separate files using Praat 5.3, and these were subsequently RMS amplitude normalized to the same level as the masked and unmasked stimuli generated with natural speech. Thus, there were two masked conditions: speech-shaped noise and two-talker babble, and two unmasked conditions: synthetic speech and natural speech.

### Design

Participants completed two sessions, with inter-session intervals averaging 6.4 days (SD = 5.5; ranging from later in the same day for one participant, to 16 days later for another). In the first session, participants completed the process of informed consent, provided background demographic information, and completed screening tests for hearing thresholds, anxiety, depression, and cognitive function. In the second session participants were played two sentences (not otherwise used in the experiment) in each of the four conditions, and then completed the speech perception task, which consisted of four conditions. Each condition presented one type of stimulus: unmasked natural speech, unmasked synthetic speech, natural speech masked by speech-shaped noise, or natural speech masked by two-talker babble. Conditions were presented in random order across participants.

**Figure 1** shows a schematic diagram of the experiment design. In each condition, there were three sequences of stimuli, which we will refer to here as "runs." The first run in each condition was originally intended to permit the collection of a variety of preliminary physiological data as well as to familiarize the participant with the experimental paradigm. It consisted of 2 min of silence followed by a 0.25 s tone (400 Hz), 0.75 s of silence, 6 s of presentation of the masker (in the two masked trials) or silence in the unmasked trials. The idea was to enable the collection of data

<sup>1</sup>www.audacity.sourceforge.net

<sup>2</sup>http://espeak.sourceforge.net/index.html

under true baseline conditions (in silence) as well as in a noiseonly condition (see Parsons, 2007). This was followed by 60 s of silence and then two trials using sentences not otherwise used in the rest of the experiment. Preliminary analyses conducted after the first three participants had completed the study suggested that there was little benefit to analyzing physiological responses during the various portions of this run because some participants did not remain sufficiently still during the silent periods, so although it was included for all subsequent participants in order to maintain a consistent experimental protocol across participants, it was not further analyzed.

The second and third runs presented the experimental stimuli, and had identical formats. Each experimental run began with 30 s of silence, followed by eight experimental trials (sentences). Each trial began with a 0.25 s beep (400 Hz), followed by 0.5 s of silence, and then the start of the masking sound which began 0.75 s before the speech stimulus, resulting in a total duration of 1.5 s between the onset of the warning beep and the onset of the speech signal to be repeated. In the two unmasked conditions the period between the warning beep and the start of the speech stimulus was also 1.5 s, but the period following the beep was silent up to the beginning of the target sentence. The target sentence ended 0.25 s before the end of the noise, between 2.768 and 3.503 s after the sentence began (or 1.208–1.904 s for the synthetic speech). Twelve seconds after the initial warning beep, a second, identical beep was played to indicate to the listener that they should repeat the sentence they heard, or as much of it as they could remember. Eight seconds later the next trial began. Thus, each trial, from initial warning beep to beginning of the next trial, lasted 20.5 s while each run consisted of the presentation of eight sentences and lasted 3 min, 14 s. In total, each condition (three runs, including two containing eight sentences each) lasted 10 min, 16 s, and the entire session required a minimum of 41 min, 4 s (although exact durations varied somewhat because of different times spent between runs and between conditions). All participants finished the second session in under an hour.

# Behavioral Measures

During the speech perception task, the experimenter scored the number of key words repeated correctly on each sentence. Each sentence contained four keywords, so there were a total of 64 possible correct responses in each condition (four words per sentence, eight sentences per run, two runs per condition). The experimenter also administered an abbreviated version of the NASA Task Load Index (TLX; Hart and Staveland, 1988) after each block. Following Mackersie et al. (2015), the present study included only four of the six subscales from the orginal TLX (Mental Demand, Performance, Effort, and Frustration) and slightly revised the questions to make them more appropriae for the current listening task context. The other two dimensions, Physical Demand and Temporal Demand, were excluded in the interest of time, and because this listening task did not impose any physical or response time demand on participants. This task was administered orally, asking participants to rate each measure on a scale of 0–20, in order to permit the participant to remain still during performance of the task.

# Physiological Recordings

Immediately prior to the speech perception task period, participants washed their hands carefully with soap and water, and let them dry thoroughly. During the task, autonomic nervous system responses were collected using a Biopac MP150 Data

Acquisition System, including a Biopac GSR100C amplifier (electrodermal response) and PPG100C (pulse plethysmograph) amplifiers. Acquisition and analysis was conducted using AcqKnowledge 4.3 software (Biopac Systems, Inc.) running on a Dell Latitude E6430 running Windows 7.

### Electrodermal Response Measures

Self-adhesive Ag/AgCl electrodes for measuring skin conductance were affixed to the palmar surface of the medial phalanges of the first (index) and second (middle) finger on the participant's right hand. Following recommended procedures, the electrodes were left in place for at least 5 min before data collection began (Potter and Bolls, 2012). The tonic conductance, in microSiemens (µS), between the two electrodes was recorded with an initial gain of 5 µ/V and at a sampling rate of 2.5 kHz. The signal was subsequently resampled to 19.5 Hz to facilitate digital processing (see Arnold et al., 2014 for comparable methods). The resulting tonic skin conductance level (SCL) curve was then smoothed using the built-in AcqKnowledge algorithm with baseline removal (baseline estimation window width of 1 s), and phasic SCRs were automatically identified from this signal as peaks greater than 0.01 µS occurring within a window beginning 1 s after the warning tone (to avoid including responses to the tone itself) and ending 10 s later (about 1 s before the tone indicating that participants should begin speaking). Two SCR-related measures were examined:


### Blood Pulse Measures

A pulse plethysmograph transducer (TSD200) was affixed securely but comfortably using a Velcro band to the palmar surface of the distal phalange of the participant's right ring (third) finger. This transducer emits an infrared signal and calculates the amount that has been reflected by the blood volume in the capillary bed it faces (Berntson et al., 2007). Reflectance, and thus signal level, increases with increased capillary blood volume. This signal, in volts, was initially digitized at a sampling rate of 2.5 kHz, was subsequently down-sampled to 312.5 Hz to facilitate digital analysis, and was then digitally band pass filtered (Hanning) between 0.5 and 3 Hz to remove potential artifacts. The resulting signal is periodic, with a frequency (PR) corresponding to heart rate. However, because this is a measure derived from capillary volume rather than directly from the heart signal, we will refer to it as PR rather than heart rate.

Following a combination of methods used by Potter et al. (2008) and Wise et al. (2009), PR and volume were calculated in 1 s increments over the 10 s beginning at the first warning beep for each trial, and referenced to the baseline (pre-stimulus) respective PR or volume calculated over the 2 s immediately preceding the beep for each trial.<sup>3</sup> This resulted in scores centered around 1, with values greater than 1.0 indicating a heart rate acceleration or increase in PA and scores less than 1.0 indicating deceleration or decrease in PA. Two blood pulse measures were examined:


# RESULTS

### Keyword Recognition (Intelligibility)

In order to meet the criteria for application of analysis of variance, proportion correct responses were transformed into rationalized arcsine units (RAU; Studebaker, 1985), shown in **Table 2**. This is simply a linear transformation of the results of a traditional arcsine transformation, with the goal of putting the transformed values into a range that is comparable to that of the original percentages over most of the range of values (i.e., between the "stretched" tails of the distribution). Results of a generalized linear model analysis of variance (ANOVA) with condition treated as a repeated measure showed a significant effect of condition, F(3,39) = 44.47, p < 0.001, η 2 <sup>p</sup> = 0.77, with the unmasked, natural condition (115.5 RAU) being significantly better understood (p < 0.001 in all cases by Tukey HSD post hoc analysis) than the other three (speech-shaped noise = 93.3, twotalker babble = 91.9, and synthetic speech = 98.2, all values in RAU). There was also a significant difference between two-talker babble and synthetic speech (p = 0.04). However, there was no significant difference between speech-shaped noise or two-talker babble conditions, suggesting that, as intended, the two masked speech conditions were comparable in terms of intelligibility and both were significantly less intelligible than the unmasked speech.

### Subjective Task Demand (Self-Report)

Scores on the four Task Load Index questions (Mackersie et al., 2015) were relatively similar across three difficult conditions, as shown in **Table 3**.

Because the different sub-scales of the NASA TLX address distinct theoretical constructs related to task load, separate analyses of variance were conducted to determine whether listeners' subjective ratings of mental demand, performance, effort or frustration differed across the four conditions. Results showed significant main effects of condition for all four scales: Mental Demand, F(3,39) = 28.13, p < 0.001, η 2 <sup>p</sup> = 0.68; Performance, F(3,39) = 10.13, p < 0.001, η 2 <sup>p</sup> = 0.44;

<sup>3</sup>A 10 s window was used because preliminary observations suggested that some listeners were (physiologically) anticipating the signal to begin speaking that occurred 12 s after the beginning of the trial. By ending the analysis window 2 s before the signal to begin speaking, it was possible to avoid including response properties that might pertain mainly to such anticipation.

### TABLE 2 | Behavioral and physiological measures obtained for each condition.


Standard deviations shown in parentheses for all measures.

### TABLE 3 | Mean scores on the NASA TLX subscales in each condition.


Effort, F(3,39) = 22.38, p < 0.001, η 2 <sup>p</sup> = 0.63; Frustration, F(3,39) = 5.42, p = 0.003, η 2 <sup>p</sup> = 0.29. The only posthoc (Tukey HSD) pairwise comparisons between conditions that were statistically significant were those that included the unmasked, natural speech condition (p < 0.001) for all scales except Frustration, for which the comparison between unmasked natural speech and synthetic speech was significant only at the p = 0.047 level, while the comparisons of unmasked natural speech with the speech-shaped noise masking and two-talker babble masking conditions were both significant (p = 0.03 and p = 0.003, respectively). Although extremely tentative at this point, these results suggest that future research exploring task load for listening to speech masked by other speech might benefit from focusing specifically on listeners' sense of frustration in addition to broader subjective measures of overall task load. Overall, these results suggest that, at least as far as can be determined by self-report, listeners found the degraded speech conditions to be comparatively more demanding than the unmasked natural speech, but not differently demanding compared to one another. However, it must be noted that all scores were relatively low (below 10 on a 20-point scale) suggesting that the overall task was not perceived as particularly demanding.

### Physiological Measures

Results from the four physiological measures, SCR frequency, SCR amplitude, PR, and PA, calculated for all four conditions are shown in **Table 2**. There were no significant (p < 0.05) (uncorrected) Pearson product-moment correlations between any of the measures within each of the four conditions, nor were there any significant correlations across conditions within any of the four measures. These scores were submitted to linear mixed model (SAS 9.3 PROC MIXED, SAS Institute Inc, 2011) ANOVA with repeated measures.<sup>4</sup>

### Skin Conductance Response

A comparison of SCR frequency across the four conditions showed no significant effect of condition, F(3,39) = 2.03, p = 0.13, η 2 <sup>p</sup> = 0.14. However, the ANOVA of SCR amplitude showed a significant effect of condition, F(3,36.2) = 3.02, p = 0.04, η 2 <sup>p</sup> = 0.21. Note that three cells were omitted from this design because there were no SCR peaks for those subjects in those conditions. Post hoc (Tukey HSD) analyses showed that this effect was carried entirely by a significant difference between the two-talker babble and the speech-shaped noise conditions (padj = 0.031). This suggests that listeners showed a stronger electrodermal response when presented with speech in a background of two-talker babble as compared to a background of speech-shaped noise.

### Pulse Rate

A graph of mean PR (calculated over 10 consecutive 1 s windows beginning at the warning beep prior to the stimulus and referenced as a proportion of the average PR calculated over the 2 s immediately preceding the beep) is shown in **Figure 1**.

Results of a linear mixed model analysis of variance (ANOVA) with two within-subjects measures (condition and time period) showed no significant effect of condition, F(3,39) = 1.31,

<sup>4</sup>Effect sizes were calculated independently from F and p statistics by first estimating type III sums of squares within the PROC MIXED procedure, and then applying the methods described by Bakeman (2005).

p = 0.28, η 2 <sup>p</sup> = 0.09, and no significant interaction, F(30,520) = 0.79, p = 0.79, η 2 <sup>p</sup> = 0.04. However, the effect of Time Period was significant, F(10,520) = 4.65, p < 0.001, η 2 <sup>p</sup> = 0.08. Post hoc (Tukey HSD) analyses show that point T + 7 (7 s after the start of the trial, roughly 6 s after the start of the sentence, and between 2 and 3 s after the end of the sentence/stimulus) exhibited a significantly lower PR when compared to all other time points except T + 6. T + 3 and T + 6 were also significantly different, but no other pairwise comparisons were significant at the p < 0.05 level. Thus, there appears to be a slight (but non-significant) increase in PR about 2–3 s after the beginning of the trial, approximately when we might expect the beginning of a response to the onset of the stimulus, followed by a significant decline in relative heart rate approximately when we might expect to see a response subsequent to the end of the stimulus. Note that a change of about 4%, as seen here, reflects a change of approximately 3 beats (or cycles) per minute, given an observed grand average PR of 74.7 beats per minute across all participants and conditions. Although this amount of change may seem small, it is relatively large compared to changes in PR seen in response to auditory stimuli in previous studies, e.g., Potter et al. (2008) (mean change < 1 BPM).

Even though the lack of a significant interaction effect does not strictly license examination of post hoc test results involving pairwise differences within the interaction (i.e., time point × condition), such planned comparisons may be informative in guiding the design of future research. Indeed, comparison of the lowest PRs for the speech-shaped noise, synthetic speech, and two-talker babble conditions vs. the Unmasked natural speech PR at the same time point (i.e., T + 6 for speech-shaped noise vs. T + 6 for Natural Speech, and T + 7 for synthetic speech and two-talker babble vs. T + 7 for Natural Speech) show large differences. Testing these differences using uncorrected post hoc comparisons<sup>5</sup> and comparing the resulting p value to a threshold corrected for sequential multiple comparisons (Holm, 1979) shows that the difference for speech-shaped noise, synthetic speech and twotalker babble are all significant (puncorrected = 0.011, 0.013, and 0.024, respectively), suggesting that degradation of speech induces significantly greater decrease in heart rate than does unmasked speech, and evidence of this increased reactivity is found approximately 6–7 s following the beginning of the stimulus.

### Pulse Amplitude

A graph of mean PA over 10 consecutive 1 s windows and referenced as a proportion of the average PA over the 2 s immediately preceding the beep in a manner comparable to that of PR in **Figure 1**, is shown in **Figure 2**.

Results of a linear mixed models ANOVA with two repeated measures (condition and time period) showed a significant main effect of condition, F(3,39) = 3.52, p = 0.02, η 2 <sup>p</sup> = 0.21, and of time period, F(10,520) = 59.06, p < 0.001, η 2 <sup>p</sup> = 0.53, but no interaction, F(30,520) = 0.95, p = 0.54, η 2 <sup>p</sup> = 0.05. Post hoc (Tukey HSD) analyses show no significant pairwise differences between conditions (p > 0.08 in all cases). However, post hoc (Tukey HSD) analyses of pairwise differences in time point were found to be significant (p < 0.05 in all cases reported here) as follows: T vs. T + 5 and beyond; T + 1 vs. T + 4 and beyond; T + 2 vs. T + 4 and beyond; T + 3 vs. T + 5 and beyond; T + 4 vs. T + 5 and beyond; T + 5 vs. T + 6 and beyond; T + 6 and T + 7 and beyond.

Although the interaction between time-point and condition was not significant, and none of the pairwise comparisons between conditions overall or at the same time point were significant in a corrected (Tukey HSD) analysis, as with the PR date discussed above, unlicensed examination of subsidiary effects may provide guidance for subsequent research. In this spirit, examination of the graph combined with pairwise comparisons between conditions suggest that the significant effect of condition is possibly being carried by a difference between masked and unmasked conditions. According to these analyses, there does not appear to be any meaningful difference between the two masked conditions: speech-shaped noise vs. two-talker babble, puncorrected = 0.748; Unmasked natural speech vs. synthetic speech, puncorrected = 0.968, but there are visible differences between the two unmasked conditions that are significant by uncorrected post hoc analyses (although these are not significant when compared to a Bonferroni–Holm-corrected threshold): speech-shaped noise vs. unmasked natural Speech, puncorrected = 0.042; speech-shaped noise vs. synthetic speech, puncorrected = 0.038; two-talker babble vs. unmasked natural speech, puncorrected = 0.020; two-talker babble vs. synthetic speech, puncorrected = 0.018).<sup>6</sup> Further, it appears that the preponderance of any such effects occurs in the last 5 or 6 time periods, a time at which the masked stimuli (speech-shaped noise and two-talker babble) exhibit considerably lower PA values than do the unmasked stimuli (Natural Speech and synthetic speech). Specifically, the greatest difference appears to be occurring around time point T + 8 or T + 9, with the divergence beginning around time T + 5 or T + 6. It may be noted that the peak PA response (at T + 9) is occurring about 2 s later than the peak PR deceleration (T + 7), though they begin at about the same time. This may be due to differences in the speed of response of the two measures or to the cognitive phenomena to which they are related, or both (see Discussion). Although these results must be considered preliminary due to the increased probability of Type 1 error through the reliance on uncorrected post hoc statistical analyses, overall it can be said that there appears to be a difference in the magnitude of the PA response to masked as compared to unmasked speech, and this difference begins to become apparent approximately 5–6 s after stimulus onset, and peaks 2–3 s after that.

<sup>5</sup>Uncorrected comparisons were used because standard post hoc corrections take all pairwise comparisons into account, drastically increasing corrected p-values to compensate for comparisons that are irrelevant to the present analysis. Instead, we have chosen to report raw p-values along with the critical p-value as determined by Holm–Bonferroni sequential correction as implemented for Excel by Justin Gaetano (Gaetano, 2013) for the number of pairwise comparisons that are actually relevant to the present analyses.

<sup>6</sup> Similar results are obtained when examining pairwise comparisons at specific time points, e.g., T + 8 and T + 9.

### Correlations

In order to explore possible relationships between subjective measures of task demand and individual physiological responses, Pearson product-moment correlations were carried out for each of the four conditions between all four subscales of the TLX collected here (Mental Demand, Performance, Effort and Frustration) and six physiological measures: SCR Frequency, SCR Amplitude, Mean Pulse Rate, Mean Pulse Amplitude, and Pulse Rate and Amplitude at the respective minima shown in **Figures 2** and **3** (for Pulse Amplitude this was time T + 9 for all four conditions, while for Pulse Rate this was time T + 7 for all conditions except speech-shaped noise masking, for which it was T + 6). Due to the large number of comparisons, none of these tests were significant at a level corrected for multiple comparisons (p < 0.002). However, a general trend was observed suggesting that the measure of Mean Pulse Volume might be more likely to correlate with TLX subscales, in that it correlated with ratings of Performance (unmasked natural speech, r = 0.66, puncorrected = 0.01; synthetic speech, r = 0.70, p = 0.005), Effort (two-talker babble Masker, r = 0.53, puncorrected = 0.05; synthetic speech, r = 0.56, puncorrected = 0.04), and Frustration (unmasked natural speech, r = 0.69, puncorrected = 0.007). The only other physiological measures correlating with a TLX subscale measure with a significance at or below p = 0.05 were Mean Heart Rate (with Performance in the unmasked natural speech condition, r = 0.71, p = 0.005) and Pulse Amplitude at time T + 9 (with Performance in the Speech-shaped noise masking condition).

### DISCUSSION

Behavioral measures of performance (proportion of key words repeated correctly) and subjective task demand showed that all degraded conditions were significantly less intelligible and imposed greater task demands than the unmasked natural speech condition. Additional findings also suggest that the synthetic speech condition may have been marginally less difficult, as reflected in performance, than the two-talker babble condition, and it may also have been somewhat less frustrating in comparison to unmasked natural speech than were the two masked conditions. These findings suggest that finer-grained assessments of subjective task load and behavioral performance might be informative in future research with stimuli like those used here.

In the present study, participants showed a significant increase in SCR to sentences presented in two-talker babble as compared to those presented in speech-shaped noise. Mackersie and Cones (2011) interpreted their finding that SCRs were elevated in more difficult dichotic digit task conditions as confirming that the SCR may index task demand, but Mackersie et al. (2015), who found no effect of changing SNR (and therefore presumably task demand), moderated these findings by suggesting that SCR may only be sensitive to task demand when performance is very good and/or effort is low. The present results, however, suggest a slightly different interpretation, namely that the SCR may be most indicative of the operation of selective attention. In the present experiment, performance and ratings of task

demand were comparable across the two masked conditions, yet the SCR response was significantly stronger when the masker contained intelligible speech. This also highlights a significant difference between the conditions used by Mackersie and Cones (2011) and Mackersie et al. (2015): In the former, the task involved listening to streams of spoken digits presented simultaneously to each ear (i.e., speech in the presence of intelligible masking speech). In the latter, the masker consisted of a mixture of speech signals from 5 different talkers, two of which were time-reversed, making the mixture potentially much less intelligible, and perhaps closer in intelligibility to the current speech-shaped noise condition. Further research is necessary to investigate the possibility that an increase in skin conductance may correspond to the engagement of attentional mechanisms involved in separating acoustically similar streams of speech.

In contrast, physiological measures of blood PR and PA suggested the possibility that there might be some differences between one or more of the degraded conditions and the umasked natural condition. With respect to PR, the appearance of a significant deceleration approximately 5–6 s after the start of the stimulus is consistent with the expectation that the stimuli in question require some degree of mental processing. Such deceleration is consistent with the appearance of an OR indicating the holding in reserve of cognitive resources in anticipation of having to encode a perceptually demanding stimulus. The lack of any apparent increase in PR during the span of the analysis window suggests that processing these stimuli, once they are encoded, does not require significant additional mental elaboration. Notably, in-depth (but speculative) examination of the main effect of condition suggested a difference between the synthetic and unmasked natural conditions, suggesting that the OR to the synthetic stimuli was stronger (perhaps indicating an anticipation that the stimuli would be perceptually more complex) than for the unmasked natural speech. Further inspection of the data suggested that the same might be true for the other two degraded conditions as well. Even more speculatively, it is possible that there is a slight deceleration within the first 1–2 heart beats after trial onset (time T + 1) followed by a small acceleration (T + 2, T + 3) prior to the large deceleration discussed here. Such a triphasic response (deceleration, acceleration, deceleration) would be consistent with results observed from studies with shorter and less meaningful auditory stimuli (Keefe and Johnson, 1970; Graham and Slaby, 1973; cited in Andreassi, 2007 p. 354). In short, it seems likely that all three types of degraded speech required greater commitment of cognitive resources in the service of initial encoding of the signal (as indicated by a stronger OR for these stimuli), but that synthetic speech may have incurred the greatest demand. Further research is

necessary to better specify the structure of the heart rate response associated with auditory stimuli of sentence length, and to better quantify factors that affect different components of this response.

Finally, there is a clear decrease in PA peaking approximately 4–5 s following the end of the stimulus. Decreased PA has long been associated with an increased demand on working memory capacity (Iani et al., 2004), so this pattern is consistent with the hypothesis that listeners were engaging working memory systems in processing the speech stimuli presented here. Other studies, however, have shown that decreased PA is a physiological response associated with the presence of noise even when task performance is unaffected (Kryter and Poza, 1980; Millar and Steels, 1990). This is then interpreted in terms of the "adaptive costs" model of physiological response to performance under stress, such that decreased PA (and other SNS responses) are considered to reflect "active coping," that is, the application of increased effort to maintain performance in the presence of an environmental stressor (see discussion by Parsons, 2007). Indeed, research by Mattys et al. (2013) suggests that exogenously induced anxiety or stress can influence the application of capacity-demanding processes to speech perception. This interpretation would be consistent with the tentative determination that there may be a difference in the decrease in PA associated with conditions containing added noise (two-talker babble and speech-shaped noise) as compared to that associated with conditions without noise (unmasked natural speech and synthetic speech). If the reliability of this distinction is borne out by future research, its appearance here may be interpreted as reflecting either a greater commitment of working memory resources to the listening task in the two masked speech conditions as compared to the unmasked conditions, or (also) a more complex response incorporating both an autonomic stress response associated with performing a task in noise as well as the greater cognitive effort required to maintain performance when listening to degraded speech. Further research is necessary to determine whether there is in fact a reliable distinction between the PA response to speech in noise as compared to similarly difficult unmasked speech, and, if so, to further untangle direct and indirect effects of noise on the application of working memory to speech perception in both masked and unmasked conditions.

While the determination that there is an overall increased commitment of working memory capacity to speech perception in degraded conditions would be completely consistent with the predictions of the effortfulness hypothesis, the apparent discrepancy between the conclusions drawn from the different pulse measures (rate vs. amplitude) must still be considered. That is, why does the synthetic speech condition, which was significantly more difficult to understand than the unmasked natural speech condition according to both self-reported effort ratings and performance measures, seem to incur greater demand on mental processing as indexed by PA, but not according to the measure of PR? One clue to an answer to this question lies in the observation that the peak of the PA marker seems to be occurring somewhat earlier during the window of analysis than did the PR response. This temporal difference likely reflects some combination of: (1) a relative delay in the responsivity of the two systems (cardiac deceleration vs. peripheral vasoconstriction), (2) differential contribution of sympathetic arousal affecting both end organs as compared with the combined effects of parasympathetic and sympathetic systems on PR, and (3) each measure reflecting a response to different stimulus processing demands.

While it is entirely likely that the two systems respond on different timescales, the fact that they show discrepant patterns of reactivity for different sorts of stimuli is also quite consistent with the idea that the two measures reflect responses to different aspects of speech processing. In this regard, it is important to note first that previous research comparing physiological responses associated with the perception of degraded (but unmasked) speech to those associated with masked speech has already suggested that these tasks may differ in terms of the degree to which cognitive processes are applied. In particular, Zekveld et al. (2014) found that noise-vocoded speech (degradation without masking) evoked a smaller pupillary response (a measure of ANS reactivity reflecting both sympathetic and parasympathetic contributions) than did noise- and speechmasked natural speech, even when performance was matched. Moreover, regional brain activity, as measured with the BOLD response, in regions associated with speech perception and selective attention (bilateral superior- and medial-temporal gyri, and dorsal anterior cingulate cortex) changed parametrically with pupil dilation, suggesting that different types of degradation result in different degrees of demand on attentional and speech processing systems specifically related to segregating target speech from competing signals. Thus, the differences between responses to masked vs. unmasked stimuli observed here in the PA measures may reflect differences in the engagement of selective attentional mechanisms associated with segregating target from masking signals. On the other hand, the response pattern observed in the PR measures may reflect overall differences in the difficulty of encoding degraded signals as such, or perhaps even differences in the degree to which masked signals are perceived as stressful, arousing or emotionally evocative (Bradley and Lang, 2000). The fact that the one pattern (PA, related to segregation) appears later in the pulse record than the other (PR, related to orienting and preparation for stimulus encoding) even though one might arguably expect segregation to incur demand earlier in processing than encoding, may be a result of differences in the speed of response of the two systems. Further research is necessary to determine whether stimulus differences that lead to differences in PR vs. PA measures are in fact associated with differential demands on segregation vs. encoding, and, if so, whether they have similar or differing effects on downstream performance (i.e., recall or understanding of the target speech, or processing of subsequent speech), as might be predicted by the effortfulness hypothesis.

In summary, the present results suggest that listening to speech in the presence of a masking sound or sounds introduces additional, or different, processing demands beyond those associated with the simple difficulty of understanding degraded speech. From the present results it cannot be determined

whether these additional demands derive from the application of additional, or different, cognitive mechanisms such as those involved in selective attention (as suggested by Zekveld et al., 2014 in explaining related findings) or whether they instead reflect aspects of an affective or emotional stress-like response to the presence of a noxious stimulus (the masker). Given that anxiety may also introduce changes in the cognitive processes applied to speech perception (Mattys et al., 2013), further research is necessary to distinguish between psychophysiological and behavioral consequences of both stress and cognitive demand on speech processing in adverse conditions.

### AUTHOR CONTRIBUTIONS

AF and MM designed the study with input from BC and AA. BC provided the stimuli. AF and AA collected the data and conducted all analyses with input from MM and BC. AF wrote the paper, and MM, BC, and AA provided comments on drafts.

### REFERENCES


# FUNDING

This research was partially supported by funding from the Department of Speech, Language and Hearing Sciences and the Office of the Provost, Purdue University, under the auspices of a Provost's Fellowship for Study in a Second Discipline granted to AF.

### ACKNOWLEDGMENTS

We are grateful to our participants for their time and interest, to Anne Smith, Bridget Walsh and Janna Berlin for assistance with equipment and study design, to Rob Potter and Annie Lang for advice on data analysis, to Bruce Craig and Rongrong Zhang for assistance with statistical analyses, and to Audrey Bengert, Allison Gearhart, Jessica Lorenz, Alyssa Nymeyer, Nikeytha Ramsey, Jim Rodeheffer, Ashlee Witty and McKenzie Wyss for assistance with running the experiment and conducting data analysis.


(Cambridge, MA: Academic Press), 17–26. doi: 10.1016/B978-0-12-386525- 0.00004-4


style, and masker. J. Speech Lang. Hear Res. 57, 1908–1918. doi: 10.1044/JSLHR-H-13-0076


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Francis, MacPherson, Chandrasekaran and Alvar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Effects of Aging and Adult-Onset Hearing Loss on Cortical Auditory Regions

Velia Cardin1, 2 \*

<sup>1</sup> Department of Experimental Psychology, Deafness, Cognition and Language Research Centre, University College London, London, UK, <sup>2</sup> Department of Behavioural Sciences and Learning, Linnaeus Centre HEAD, Swedish Institute for Disability Research, Linköping University, Linköping, Sweden

Hearing loss is a common feature in human aging. It has been argued that dysfunctions in central processing are important contributing factors to hearing loss during older age. Aging also has well documented consequences for neural structure and function, but it is not clear how these effects interact with those that arise as a consequence of hearing loss. This paper reviews the effects of aging and adult-onset hearing loss in the structure and function of cortical auditory regions. The evidence reviewed suggests that aging and hearing loss result in atrophy of cortical auditory regions and stronger engagement of networks involved in the detection of salient events, adaptive control and re-allocation of attention. These cortical mechanisms are engaged during listening in effortful conditions in normal hearing individuals. Therefore, as a consequence of aging and hearing loss, all listening becomes effortful and cognitive load is constantly high, reducing the amount of available cognitive resources. This constant effortful listening and reduced cognitive spare capacity could be what accelerates cognitive decline in older adults with hearing loss.

Keywords: hearing loss, aging (aging), aging and cognitive function, cognitive decline, Auditory cortex, humans

# INTRODUCTION

Normal aging in humans is often accompanied by hearing loss (Lin et al., 2011b; Humes et al., 2012). This age-related hearing loss in known as presbycusis. In the UK alone, 6.4 million (60%) of those over 65 years of age have some hearing loss (Action-on-Hearing-Loss, 2011)—when compared to 25 year-olds, 70 year-olds have average hearing thresholds that are raised 10 dB at lower frequencies (250–1000 Hz), and 20–60 dB at higher frequencies (2–8 kHz). Even though pure-tone detection is strongly associated with auditory processing deficits (Humes et al., 1994; Humes, 1996), the relationship between real-life auditory processing and pure-tone sensitivity is weak, and elderly adults with similar thresholds vary in their ability to understand speech in noise (Schneider et al., 2002; Humes, 2007; Wilson and McArdle, 2008). Other perceptual variables, such as temporal and intensity discrimination, frequency resolution, audibility and binaural processing, account for some of the differences, but they cannot explain all the observed variance (Glasberg et al., 1984; Moore and Peters, 1992; Moore et al., 1992; Sommers and Humes, 1993; Pichora-Fuller and Schneider, 1998; Schneider and Pichora-Fuller, 2000; Gordon-Salant, 2005; Grose et al., 2006; Souza and Boike, 2006; Humes et al., 2010; Gordon-Salant et al., 2011; Grose and Mamo, 2012; Tun et al., 2012). Evidence also suggests that older adults with hearing loss have poorer speech

### Edited by:

Patrik Sörqvist, University of Gävle, Sweden

### Reviewed by:

Monita Chatterjee, Boys Town National Research Hospital, USA Julia Jones Huyck, Kent State University, USA

> \*Correspondence: Velia Cardin velia.cardin@gmail.com

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 01 December 2015 Accepted: 22 April 2016 Published: 11 May 2016

### Citation:

Cardin V (2016) Effects of Aging and Adult-Onset Hearing Loss on Cortical Auditory Regions. Front. Neurosci. 10:199. doi: 10.3389/fnins.2016.00199 comprehension than older adults with better hearing (Stewart and Wingfield, 2009; Adank and Janse, 2010; Tun et al., 2010), and than young adults with equivalently poor hearing (Dubno et al., 1984; Fitzgibbons and Gordon-Salant, 1995; Wingfield et al., 2006b). Older adults are also more likely to have "hidden hearing loss," where there is damage in high-threshold auditory nerve fibers (usually as a consequence of noise exposure, Kujawa and Liberman, 2009; Schaette and McAlpine, 2011; Plack et al., 2014; Viana et al., 2015). This cochlear neuropathy is not reflected in conventional audiograms, but affects auditory processing at all subsequent levels, including cortical responses (Bharadwaj et al., 2015). Together, these pieces of evidence suggest a functional interaction between the effects of hearing loss and aging, exacerbating the effects that each has in isolation.

It is also known that as a consequence of aging there are changes in brain structure, including reductions in white matter integrity, gray matter volume and thinning of the cortex (Sullivan and Pfefferbaum, 2006; Grady, 2012; Onoda et al., 2012; Betzel et al., 2014). These are accompanied by changes in connectivity between functional networks, and recruitment of additional brain regions for performance of several tasks (see for a review Grady, 2012; Park and McDonough, 2013; Bennett and Madden, 2014; Lockhart and DeCarli, 2014). It is still debatable whether this reflects compensation, dedifferentiation (i.e., loss of functional specialization), or less efficient use of neural resources (see Grady, 2012, for a review). These neural changes have an impact on behavior, with decline in several cognitive domains including attention, working memory and processing speed (Deary et al., 2009; Tun et al., 2012). Cognitive abilities, in particular working memory, are strong predictors of successful understanding of speech in noise, and a decline in these is likely to impact negatively on auditory and speech processing (Pichora-Fuller, 2003; Akeroyd, 2008; Arlinger et al., 2009; Anderson et al., 2013; Arehart et al., 2013; Zekveld et al., 2013). Importantly, not all brain regions are equally affected by older age, and some functions are preserved through the lifespan. For example, structural changes are more pronounced in the prefrontal cortex (Raz et al., 1991), connectivity in the defaultmode network seems to be particularly affected in older age (Tomasi and Volkow, 2012), and even though many cognitive functions decline, language comprehension skills are preserved in older adults (Shafto and Tyler, 2014).

Evidence suggests that hearing loss in older adults also contributes independently to cognitive decline, exacerbating the effects of physiological aging (Lin et al., 2011a, 2014; Pichora-Fuller and Levitt, 2012; Wayne and Johnsrude, 2015). Several theories have been put forward to explain the relationship between hearing loss and cognitive decline in the elderly (Baltes and Lindenberger, 1997; Pichora-Fuller, 2003; Lindenberger and Ghisletta, 2009; Sarampalis et al., 2009; Tun et al., 2009; Heinrich and Schneider, 2010; Ronnberg et al., 2011). However, it is not yet clear what the relationship between them is, and which neural mechanisms are affected.

This review provides a summary of the effects of adult-onset hearing loss and aging on the function and structure of the central auditory system in humans. The exclusion of literature investigating early onset hearing loss is because the effects of hearing loss on brain structure and function will vary with the developmental stage and biological age at which the sensory deprivation occurs. This is mainly due to the following reasons:


In short, due to the interplay that hearing loss will have with language acquisition and with sensitive periods of neural plasticity in early life, the effects of adult-onset hearing loss cannot be equated to those of onset during adolescence or infancy (Lyness et al., 2013). Therefore, the studies reviewed here will look exclusively at adult-onset hearing loss, with the aim of understanding what the effect of hearing loss are in a brain that has established sensory and cognitive systems.

The review is divided into two main sections:


This paper concentrates on evidence obtained from the study of humans, given the relevance that language function has for human communication, but also for auditory processing. However, animal studies have been extremely informative for our understanding of hearing loss and aging, and several excellent reviews discuss these topics in detail (Frisina, 2009; Fetoni et al., 2015; Ouda et al., 2015).

# BRAIN STRUCTURAL CHANGES AS A CONSEQUENCE OF HEARING LOSS AND AGING

### Auditory Cortex in Humans

In non-human primates, primary auditory areas are grouped in a "core" region, and secondary areas are grouped in "belt" and "parabelt" regions, located concentrically around the core (see Hackett, 2011, for a review). The core regions represent the first level of cortical auditory processing, and the surrounding belt and parabelt regions support higher levels of processing. In humans, the auditory cortex is located in the superior temporal gyrus (STG), but its precise extent and borders are not clear (Hackett, 2015). The auditory core is located in Heschl's gyrus (HG), but its functional subdivision is still a matter of debate (e.g., Formisano et al., 2003; Da Costa et al., 2011; Dick et al., 2012). An alternative anatomical approach is to characterize primary auditory areas based on their microstructural properties. Postmortem cytoarchitectonic analysis has revealed three distinct areas in Heschl's gyrus (from postero-medial to antero-lateral): Te1.1, Te1.0, and Te1.2. Based on its granularity, Te1.0 is the most likely human homolog of the primary auditory cortex (Morosan et al., 2001; Tahmasebi et al., 2009; Hackett, 2011). The fact that these cytoarchitectonic definitions bypass auditory stimulation for the definition of functional areas, which can be complicated or not possible in participants with auditory deficits, have made these cytoarchitectonic maps popular in the study of hearing loss in humans.

# Morphometry

Techniques measuring gray matter volume, cortical thickness and surface area have been the most commonly used to assess structural brain changes as a consequence of hearing loss (**Table 1**). However, outcomes from these techniques have been mixed. Of those studies measuring morphometric changes in auditory cortices, two have showed a positive correlation between hearing loss and reductions in gray matter volume (Peelle et al., 2011; Eckert et al., 2012), whereas other two did not find an effect (Boyen et al., 2013; Profant et al., 2014). These conflicting findings could be explained by the specificity in the definition of the regions of interest. Eckert et al. (2012) and Peelle et al. (2011), who found that hearing loss is associated with gray matter loss, used probabilistic cytoarchitectonic maps which are more likely to contain exclusively primary auditory regions. Instead, Profant et al. (2014) and Boyen et al. (2013) measured structural changes in the whole of Heschl's gyrus, which contains not only primary auditory regions, but also other functionally and anatomically distinct areas, and for this reason results are deemed to be more heterogeneous. These differences suggest an association between hearing loss and gray matter that may be constrained to primary auditory cortices. They also call for specificity when defining auditory regions of interest in future studies, as averaging neuroimaging measurements across a whole gyrus may hinder true effects in discrete regions.

The effects of hearing loss on the morphometry of other structures of the temporal lobe and the rest of the brain are also mixed (see **Table 1**). Boyen et al. (2013) found an increase in gray matter volume in STG and middle temporal gyrus (MTG) in hearing impaired individuals. Instead, Husain et al. (2011) found reductions in gray matter in STG in those with hearing loss, but not in those with hearing loss and tinnitus, and Yang et al. (2014) reported reduced gray matter in STG, MTG and inferior temporal gyrus in patients with unilateral hearing loss. Similar discrepancies are found when looking at results from the whole brain (**Table 1**).

Regarding the effect of aging on the morphometry of auditory regions, whole brain analyses not always show differences between young and older adults in temporal lobes, but those that define discrete regions of interest do. Reductions in gray matter volume and cortical thickness as a consequence of aging have been found in Heschl's Gyrus, planum temporale, and STG (Harris et al., 2009; Tremblay et al., 2013; Meunier et al., 2014; Profant et al., 2014). These effects are not uniformly present across the brain (Profant et al., 2014), suggesting that they are indeed specific to auditory areas, and not general senescent effects.

In trying to understand discrepancies in the observed effects of hearing loss and aging on brain morphometry, there are three salient factors: (1) Lack of specificity when defining regions of interest, as explained above (e.g., the occipital lobe has many functionally specialized areas); (2) Measuring the effects of high frequency hearing thresholds vs. average pure tone thresholds. Hearing loss tends to be greater for higher-frequency sounds, and as discussed by Eckert et al. (2012), using the high-frequency component of hearing thresholds may provide more accurate estimations of the effects of hearing loss on neural structure; and (3) All the studies mentioned above are cross-sectional (typically N ≈ 20). Cross-sectional and longitudinal studies do not always provide concordant evidence. For example, in the neurobiology of human aging, longitudinal changes are not always reflected in cross-sectional analyses, and estimated rates do not match longitudinal measurements (Raz et al., 2005; Raz and Lindenberger, 2011). In the field of hearing loss, this discrepancy between cross-sectional and longitudinal measurements becomes apparent in the study by Lin et al. (2014), in which the authors compared brain volume in hearing impaired (N = 51) and normal hearing (N = 75) older adults in a baseline scan and a follow-up scan (on average) 6.4 years later. They found no significant differences between the groups at the baseline measure, but after the follow-up scan those with hearing impairment had an accelerated volume decline in the whole brain, and particularly in the right temporal lobe. In short, the use of more specific, hypothesis-driven definitions of regions of interest, in combination with longitudinal approaches, could aid some light on the mixed effects found when measuring gray matter changes as a consequence of hearing loss.

# Diffusion MRI Studies

Another non-invasive brain imaging technique to study brain structure is diffusion MRI (dMRI). This technique measures microstructural parameters, including fractional anisotropy (FA) and mean diffusivity (MD), which reflect properties such as the degree of density and orientation dispersion of neuronal fiber bundles (Jones, 2008; Johansen-Berg and Rushworth, 2009). It also allows tracing of anatomical connections in the living brain.

Microstructural changes associated with hearing loss have been found in subcortical components of the auditory pathway [reduction in fractional anisotropy (FA) and increase in radial diffusivity], such as the lateral lemniscus and the inferior colliculus (Lin et al., 2008). White matter tracts underneath Heschl's gyrus also show a tendency for an effect in microstructure (increase in AvgL2L3, which the authors tentatively suggest could reflect demyelination), but differences do not achieve statistical significance (N = 12–15 per experimental group; Profant et al., 2014). In a whole-brain


TABLE 1 | Studies evaluating the effect of hearing loss on the structure of the human brain.

dMRI, diffusion magnetic resonance imaging; HI, hearing impaired; HI + T, hearing impaired with tinnitus; NH, normal hearing; GM, gray matter; STG, superior temporal gyrus; MTG, middle temporal gyrus; SFG, superior frontal gyrus; FA, fractional anisotropy; ACC, Anterior cingulate cortex; LL, lateral lemniscus; IC, inferior colliculus; HG, Heschl's gyrus; PCG, posterior cingulate gyrus; ITG, inferior temporal gyrus; (–), Information not provided. \*Age range is not provided in this study. It is possible that adolescents have been included in the sample, as authors only specify recruiting participants older than 8 years of age.

analysis, a reduction in FA was also found in a large cluster of the right hemisphere which comprised the corticospinal tract, inferior and superior longitudinal fasciculi, inferior frontooccipital fasciculus, superior occipital fasciculus and anterior thalamic radiations (Husain et al., 2011).

Compared to younger adults, older adults show reductions in FA in the acoustic radiation, Heschl's gyrus and STG (Lutz et al., 2007). (A non-significant trend in this direction is also observed in the white matter under Heschl's gyrus in the study of Profant et al., 2014).

The evidence so far is scarce, but it suggests that hearing loss and aging result in microstructural changes in white matter tracts of the auditory pathway, potentially compromising cortical auditory function.

# EFFECTS OF AGING AND HEARING LOSS ON BRAIN FUNCTION

### fMRI Studies

One of the most common tools for the study the human brain is functional Magnetic Resonance Imaging (fMRI). By detecting changes in blood flow that occur as a consequence of neural activity, this technique allows indirect measurement of brain function in a non-invasive manner and with a spatial resolution of millimeters (Ogawa et al., 1990). fMRI has been widely used in the study of aging and hearing loss, but not without challenges. Specifically, there are high levels of acoustic noise during MRI scanning, and special efforts have to be put into selecting acquisition sequences and their interaction with auditory stimuli (Peelle, 2014). Age also produces changes in the vasculature, affecting the blood-oxygenation level-dependent (BOLD) signal, and it is important to separate these vascular effects from those that arise as a consequence of differences in neural function between young and older adults (Tsvetanov et al., 2015). However, some of the most relevant issues affect all studies of auditory processing in aging and hearing loss. Stimuli selection is one of these—the use of simple acoustic stimulation (e.g., detection of tones) may hinder effects that are only evident when listening in challenging conditions, whereas complex tasks may reflect problems in cognition, and not auditory processing per se. Furthermore, as explained in more detail in the following section, differences in cortical effects between groups and conditions may be due to compromised processing in peripheral and subcortical regions, which will affect the quality of the signal that arrives to the cortex. In addition, hearing loss is common in older adults, and auditory thresholds are not always measured in studies of aging. Consequently, effects that are assigned to aging may be due to concomitant hearing loss. This confound is sometimes avoided by using stimuli with the same audibility for all participants, but this can compromise frequency encoding. When reviewing the evidence below, these issues will be highlighted when it is likely that they can affect the interpretation of results.

fMRI studies of auditory processing show less activity in cortical auditory regions in older than younger adults. This has been demonstrated using a variety of paradigms, from those in which participants passively listened to words (Cliff et al., 2013), to speech in noise tasks (Hwang et al., 2007; Wong et al., 2009; Bilodeau-Mercure et al., 2015; Manan et al., 2015). This is contrary to findings using simple acoustic stimuli, where there is an increase in the level of activity observed in auditory cortex as a function of age (Profant et al., 2015).

The effects of hearing loss on the function of cortical and subcortical areas also vary with the complexity of the stimuli. Boyen et al. (2014) showed a negative correlation between hearing thresholds (mean PTA = 43 dB HL) and activation elicited by acoustic stimulation in subcortical structures of the auditory pathway (medial geniculate body, inferior colliculus and cochlear nucleus). Such a relationship was not found in the STG. Differences in cortical activation between older adults (mean 69 years) with mild (8 kHz PTA ∼30 dB) and expressed (8 kHz PTA ∼70 dB) presbycusis were also absent in the study of Profant et al. (2015). However, both these studies used basic auditory stimulation. In a study of sentence comprehension in older adults (mean = 64.9 years), Peelle et al. (2011) showed that hearing ability correlated with activity not only in subcortical regions, but also in the auditory cortex (STG encompassing also primary auditory regions, but without defining them specifically). As mentioned above, they also showed that hearing thresholds were positively correlated with gray matter loss in cytoarchitectonic regions Te1.0 and Te1.1, providing a structural and functional link, and suggesting that cortical differences may only be apparent when using fMRI with complex auditory stimulation and challenging tasks. This link between structure and function is important, because cortical functional effects can always reflect processing deficiencies or compensatory mechanisms from subcortical or peripheral stages. However, when functional effects are linked to structural damage or atrophy, it suggests cortical mechanisms are indeed compromised.

Reductions in evoked fMRI activity and in gray matter volume in auditory areas, in particular the STG, are often accompanied by differential recruitment of other cortical regions. For example, in a study of speech in noise, Wong et al. (2009) found that the reduction in activity in auditory areas of older individuals (mean age = 68 years; range = 63–75) was accompanied by stronger recruitment of parietofrontal regions, and that this additional recruitment correlated with performance. In a further study (Wong et al., 2010), they showed that the volume of the left pars triangularis and the cortical thickness of the left superior frontal gyrus were positively correlated with performance in a speechin-noise test (mean age = 67 years; range = 62–75 years). Gray matter volume in left auditory cortices has also been found to be positively associated with word recognition skills, and negatively associated with activation in anterior cingulate cortex (ACC; age range = 19–39 and 61–79 years; Harris et al., 2009) and middle frontal gyrus (mean age= 42.1 years; range = 21–79 years; Eckert et al., 2008). Furthermore, Tyler et al. (2010) demonstrated that, during a word monitoring task, older adults (mean age = 67.4 years; range = 49–86 years) show additional recruitment of frontal right hemisphere regions. This additional recruitment was positively associated with the level of gray matter atrophy in left frontotemporal regions, including STG, and aided older adults in performing at the same level than the younger group. Summarizing, as a consequence of aging and hearing loss, there are morphological changes in auditory areas, which are consistent with structural damage. It is not known whether this damage is the cause of compromised auditory processing, and the reliance on more cognitive resources to aid perception. What it is clear from the evidence reviewed above, is that additional recruitment of frontal regions is observed when there is damage in auditory areas, and that the amount of damage in temporal cortices and the recruitment of frontal regions predict behavioral performance.

The additional activation and recruitment of frontal regions during auditory tasks, observed both as a consequence of aging and hearing loss, is likely to reflect more widespread changes in network dynamics. The connectivity of cortical functional networks changes in older age. In younger adults, functional activity in the left STG is positively correlated with activity in the right STG; in older adults there is no significant correlation between activity in the left STG and the right STG, but activity in the left STG is significantly correlated with activity in a more spread set of areas, including frontal regions (mean age = 67 years; range = 63–75) (Wong et al., 2009). Studies of aging have also shown reduced connectivities within a sentencecomprehension network (Peelle et al., 2011; see **Table 1**), and between the salience network and the auditory network (Onoda et al., 2012). This latter effect comes from a resting state study in which connectivity was correlated with age (n= 73; mean age = 60; range = 36–86), and it is not clear whether it is mediated by age itself or by age-related hearing loss (Onoda et al., 2012), as hearing thresholds were not measured. In a study evaluating changes in functional connectivity as a consequence of hearing loss (mean 36 dB HL at 4 kHz), Husain et al. (2014) did not find evidence of hearing loss affecting the pattern of functional connectivity between auditory regions and other cortical areas. However, hearing loss affects the pattern of connectivity in the attention and default mode networks (Husain et al., 2014), suggesting that the effects of aging could be at least partly due to concomitant hearing loss. Importantly, the level of network reorganization observed in older adults is associated with the level of gray matter loss in temporal regions rather than age itself (Meunier et al., 2014). This provides more indirect evidence to support the idea that the effects of aging on the reorganization of cortical functional networks are at least partially due to age-related hearing loss, given that hearing loss is associated with gray matter reductions in temporal areas (see above).

Dynamics of the salience network seem to be particularly influenced by aging and hearing loss. This network includes the ACC, the pre-supplementary motor area and the insula, and it is generally thought to be involved in the detection of salient events, and in deploying the appropriate behavioral responses to these events (Menon and Uddin, 2010). Functional recruitment of components of the salience network during auditory tasks changes in older age. During speech perception tasks, younger adults show stronger activations of ACC in incorrect trials compared to correct ones (Sharp et al., 2005; Harris et al., 2009), and while listening to degraded speech more than when listening to clear speech (Erb et al., 2013). Instead, in older adults (mean age = 71; range = 61–79 years), there is higher overall activity in ACC during speech perception (Harris et al., 2009), and similar levels of activations with degraded and clear speech (Erb and Obleser, 2013). Furthermore, ACC recruitment is negatively associated with word recognition and speech comprehension in older adults (Sharp et al., 2005; Harris et al., 2009; Erb and Obleser, 2013). Erb and Obleser (2013) argue that it is the degree to which the ACC is engaged and disengaged in degraded and clear speech, respectively, that is associated with better speech comprehension, and that this dynamic range of ACC activity decreases with age, with detrimental consequences for comprehension. Importantly, the level of recruitment of ACC is correlated with gray matter volume loss in HG/STG (Eckert et al., 2008; Harris et al., 2009). Thus, additional cognitive resources are used to achieve successful auditory perception in challenging conditions. In turn, aging and hearing loss affect the successful deployment of such strategies. Future studies should aim to disentangle if this is a direct effect on the mechanisms of cognitive control, or if it is mediated by gray matter volume loss in auditory cortices and compromised auditory processing, as discussed above.

Additional recruitment of frontal regions, in particular in the cingulo-opercular cortex (medial frontal cortex, anterior insula and frontal operculum), aids in adaptive control during word recognition (Vaden et al., 2013, 2015), in which by learning from difficult or error trials, performance improves in following trials. Vaden et al. (2013) have shown that cingulo-opercular activity is increased in trials with low intelligibility or errors, and that the magnitude of the cingulo-opercular response in these situations predicts performance in subsequent trials, with increases in activity associated with better performance. This mechanism is still engaged in older adults (mean age = 60; range = 50–81 years), but to a lesser extent (Vaden et al., 2015); therefore, their ability to adapt in subsequent presentations may be somehow compromised.

Hearing loss further affects cortical mechanisms for cognitive control in older adults. Hearing loss does not seem to compromise cingulo-opercular activity and adaptive control (Vaden et al., 2016). However, experiments by Erb and collaborators show that hearing loss has effects on the activation of the insula while listening both in quiet and in noise (Erb and Obleser, 2013; Erb et al., 2013). In these studies, younger (mean age = 26; range = 22–31 years) and older adults (mean age = 67; range = 56–77 years) activate the anterior insula in adverse listening conditions. However, with higher degrees of hearing loss (tested range = 5–43 dB HL), higher insula activations were observed during clear than degraded speech, demonstrating that hearing loss alters the amount of cognitive resources deployed for speech understanding (Rudner et al., 2009; Stenfelt and Ronnberg, 2009). It should be highlighted that in the experiment of Erb and Obleser (2013), speech was presented at an audible level for each participant. Hearing loss in older also adults modulates the amount of neural activity in STG as a function of the grammatical complexity of the stimuli (Peelle et al., 2011). Together, these pieces of evidence suggest that sensory loss has an impact on the neural resources used for cognitive control, and not only affects the ability to process the perceptual aspects of the speech signal. These central processing effects are unlikely to be reverted with amplification, calling for more rounded cognitive and audiological interventions in those with hearing loss.

Older adults also seem to struggle in suppressing irrelevant information not only from the auditory signal, but also from other sensory systems (Kuchinsky et al., 2012; Vaden et al., 2015, 2016), which is in turn associated with less supression of activity in sensory cortices, but also more extensive activations in prefrontal and parietal regions (Nielson et al., 2002; Gazzaley and D'Esposito, 2007; Turner and Spreng, 2012). Whereas younger adults (<40 year old) suppressed visual cortex activity when performing an auditory task, older adults (>61 years old) synchronously activated both visual and auditory cortices, failing to suppress irrelevant visual activity (Kuchinsky et al., 2012; Vaden et al., 2015, 2016). Importantly, reducing stimulus integrity had an independent but spatially similar effect to that of aging (Kuchinsky et al., 2012). This similarity in the effects of perceptual degradation and aging suggests that neural mechanisms used for challenging listening are always deployed in older adults, exacerbating the effects of noise and making all listening effortful. In addition, hearing loss further contributes to the detrimental effect of aging. Adults with hearing loss (mean age = 66; range = 45–78 years; PTA 38.4 dB HL) show less suppression of activity in occipital regions during listening than participants with less hearing loss or normal hearing (mean age = 62; range = 53–71 years; PTA 19.2 dB HL; Vaden et al., 2016). This difference in suppression of activity in the visual cortex was observed even when there were no significant differences in performance (participants were actively chosen to control for this).

This failure to suppress irrelevant sensory activity observed in older adults and in those with hearing loss, could be the result of having to allocate more cognitive resources for listening. This in turn reduces cognitive spare capacity (i.e., the amount of available cognitive resources) (Mishra et al., 2013, 2014; Rudner and Lunner, 2014). This reduced cognitive spare capacity will not only affect visual suppression, but also higher order language processing and any other function that relies on cognitive resources (Mishra et al., 2013, 2014; Rudner and Lunner, 2014). In support of this, aging and hearing loss result in more interference in dual-task paradigms (Tun et al., 2009), and worse comprehension of syntactically complex sentences (Wingfield et al., 2006a; Stewart and Wingfield, 2009; DeCaro et al., 2016). Pupilometry studies also show that aging and hearing loss are associated with less availability of cognitive resources. In normal hearing individuals, the pupil dilates as cognitive load increases, for example by decreasing the intelligibility of the speech signal. Older age (evaluated using a range of 45–73 years of age) and hearing loss (>25 dB HL) result in maintained pupil dilation across noise levels, indicating less release from listening effort as speech is more intelligible (Zekveld et al., 2011). Increased pupil dilation and cognitive load are in turn associated with increased activation of cortical auditory regions, but also frontal ones (Zekveld et al., 2014), supporting the idea that in older listeners with hearing loss more resources are allocated for listening in clear conditions. It is perhaps this unavailability of resources what compromises general cognitive function, accelerating decline.

### EEG Studies

Another common method for the study of cortical function is EEG. This technique has poorer spatial resolution than fMRI, but excellent time resolution. Auditory stimulation results in an alteration of the encephalogram known as cortical auditory evoked potential (CAEP). In adults, the most prominent peaks are N1 (∼100 ms post-stimulus onset) and P2 (∼175–200 ms). Smaller peaks P1 and N2, preceding N1 and succeeding P2 respectively, are also often described (see for a review Wunderlich and Cone-Wesson, 2006). Changes in the amplitude and latency of these components reflect perceptual discrimination and processing (see Hyde, 1997; Kraus and Cheour, 2000; Wunderlich and Cone-Wesson, 2006; Sharma and Glick, 2016), making EEG an invaluable tool for the study of the effects of hearing loss and aging on cortical auditory function.

EEG results from studies of hearing loss have showed increased amplitude in the N1 and P2 components of the CAEP in individuals with hearing loss (compared to control groups with normal hearing; Tremblay et al., 2003; Harkrider et al., 2006; Bertoli et al., 2011; Campbell and Sharma, 2013; but see Wunderlich and Cone-Wesson, 2006). The amplitude and latency of the P2 component is positively correlated with speech-in-noise thresholds, and P2 amplitude is also positively correlated with hearing thresholds at high frequencies (Campbell and Sharma, 2013). These results suggest that the greater the degree of hearing loss and the difficulty to understand speech, the larger and more sluggish the cortical response.

Aging also has effects on the CAEP, increasing N1 and P2 amplitude, and P2 latency (Pfefferbaum et al., 1980; Anderer et al., 1996; Tremblay et al., 2002, 2003; Harkrider et al., 2005, 2006; Ceponiene et al., 2008; McCullagh and Shinn, 2013). It is interesting to note that differences between younger and older adults disappear as noise increases. McCullagh and Shinn (2013) showed that older adults (mean age = 66.4; range 62–77) have higher N1 and P2 amplitudes than younger adults (mean age = 21.4; range = 19–29) in response to an auditory oddball paradigm in quiet conditions (stimuli presented at equal sensation level for all participants). As noise was introduced in the stimuli, amplitude of the N1 and P2 was maintained in the younger group, but decreased in the group of older adults. These results can be interpreted as older adults having to deploy compensatory mechanisms while listening in quiet, as seeing in the fMRI studies described above, but not being able to deploy these mechanisms while listening in challenging conditions.

The EEG evidence reviewed above shows aging and hearing loss affecting the CAEP in the same direction. It has been suggested that these effects on the CAEP reflect inefficient cortical processing in response to a degraded signal (Harkrider et al., 2005; Ross et al., 2007). To address this issue, Harkrider et al. (2006) tested whether the effects of aging and hearing loss on the CAEP disappeared by increasing the audibility of the stimuli. Behavioral differences driven by age and hearing loss disappeared, as well as the effect of aging on the CAEP, but there was no change on the hearing loss effect on the cortical response. This highlights that the effects of age and hearing loss, despite modifying the CAEP in the same direction, could be of different nature, and thus may need different treatment strategies. These results also support the evidence obtained with MRI, which suggests that hearing loss results in cortical reorganization, demonstrating that the effects of hearing loss on cortical responses are not only a consequence of degradation of the signal or increased effort. In support of this reorganization, source localization reveals a reduction in activation in temporal cortical regions, and recruitment of frontal areas in hearing impaired individuals (∼40 dB HL at 4 KHz; Campbell and Sharma, 2013). This cortical reorganization hypothesis is also in agreement with results of a magnetoencephalography (MEG) study by Dietrich et al. (2001), who showed that the group of neurons that is usually responsive to the lost frequencies starts responding to adjacent tone frequencies when there is hearing loss.

An interesting issue to consider is how much the effects that we observe in cortical responses are due to differences or compensatory mechanisms that arise in subcortical processing stages. From studies in humans and animals, it is known that both aging and hearing loss affect the auditory brainstem response (ABR), resulting in elevated thresholds and reduced amplitudes (Boettcher, 2002 for a review). We are just beginning to understand how these subcortical effects modulate cortical processing and how they are also regulated by cortical topdown signals. An excellent example of this interaction is the recent efforts to characterize the effects of hidden hearing loss in humans, which can also contribute to explaining why older adults with normal audiograms have trouble with speech perception in noise. Animal studies have revealed that noise exposure and aging can produce cochlear neuropathy without causing hair cells loss, and without affecting an individual's ability to detect sounds, resulting in "hidden hearing loss" (Kujawa and Liberman, 2009; Schaette and McAlpine, 2011; Furman et al., 2013; Plack et al., 2014; Viana et al., 2015). This is due to damaged high-threshold, medium- and lowspontaneous rate auditory nerve fibers, which are thought to encode acoustic information at medium to high levels, and when signal to noise ratio is poor (Kujawa and Liberman, 2009; Furman et al., 2013). Post-mortem histopathological analysis has shown that this type of damage exists in human adults with no history of hearing problems and with no apparent cochlear damage, likely contributing to difficulties while listening in challenging conditions (Viana et al., 2015), but without consequences on conventional audiograms. In an elegant combination of behavioral and electrophysiological techniques, Bharadwaj et al. (2015) investigated whether potential behavioral and physiological effects of hidden hearing loss could affect cortical processing. Using experimental conditions that were more likely to evoke recruitment of fibers that are more vulnerable to neuropathy, including high sound levels, offfrequency maskers, and shallow modulation depths, they found a correlation between behavioral and electrophysiological measurements of temporal coding fidelity. They further showed that poor subcortical encoding was associated with poor cortical sensitivity in interaural time differences. However, none of these measurements were related to hearing thresholds. In short, they demonstrated that effects that arise at subcortical processing stages are reflected in cortical responses, even in the absence of peripheral damage. Furthermore, this effect was found in younger adults (21–39 years of age) who reported no hearing problems and had normal audiograms (<15 dB HL), so it is likely to be worse in older adults, and a contributor to the effects of aging on listening difficulties despite normal audiograms.

Not only compromised subcortical processing is reflected on cortical responses, but top-down effects from cortical areas can also modulate subcortical stages. Sorqvist et al. (2012) observed that ABRs were modulated by working memory load, suggesting dynamic interactions between top-down and bottomup mechanisms, e.g., cortical regions that control attention allocation regulating subcortical gating. These results show that we need to consider the nervous system as a whole, and not just investigate processing in isolated areas. More studies integrating comprehensive behavioral measurements of auditory processing and cognition, combined with human neuroscience techniques recording activity at all processing stages, will give us a better picture of how aging and hearing loss affect auditory function.

# REMAINING QUESTIONS AND FUTURE DIRECTIONS


these two factors on the function and structure of specific brain regions. Yet, cortical regions work as part of functional and structural networks, and effects on one node could affect the dynamics of the whole network. In the coming years, the field needs to investigate how these factors, in isolation and combined, influence network dynamics and what treatments are available to avoid the behavioral consequences of altered functions.


# CONCLUSIONS

The evidence discussed here suggests that atrophy of cortical auditory regions is present in hearing loss and older age, potentially compromising auditory processing. In addition, due to peripheral damage, the auditory signal will be poor and degraded. In these situations, a stronger reliance on cognitive resources is necessary in order to achieve successful auditory perception, even in quiet conditions. This is supported by studies of speech perception, where there is additional recruitment of frontal cortical regions when there is damage in auditory areas, which in turn predicts behavioral performance. As a consequence, cognitive load is constantly high, deeming all listening effortful, and reducing the amount of spare cognitive capacity for other tasks (resulting in poor performance in diagnostic tests of cognitive function; Lin and Albert, 2014; Rudner and Lunner, 2014). The cortical mechanisms deployed to aid normal listening are similar to those usually engaged for listening in effortful conditions, including engagement of the saliency network, adaptive control and re-allocation of attention. The problems arise when listening conditions are challenging, and cognitive resources are no longer enough. This constant effortful listening and reduced cognitive spare capacity could be what accelerates cognitive decline in older adults with hearing loss.

In several of the studies reviewed above, hearing loss and aging have similar detrimental effects on cortical processing. However, in some situations hearing loss and aging had effects that could be dissociated (e.g., Harkrider et al., 2006; Vaden et al., 2016), suggesting more than one mechanism for impairment of cortical processing, but also more avenues for treatment. Improved understanding of the independent effects of aging and hearing loss will help us in designing successful interventions. Above all, it is important that future research evaluates whether early audiological interventions, combined with cognitive assessments, can prevent the consequences of hearing loss in brain function and structure, and reduce cognitive decline.

### REFERENCES


### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

## FUNDING

Funding from the Linnaeus Centre HEAD, The Swedish Research Council (grant number: 2007-8654), Action on Hearing Loss (Project 598), and the Economic and Social Research Council of Great Britain (Grant RES-620-28-0002).

with mild to moderate sensorineural hearing loss. Hear. Res. 312, 48–59. doi: 10.1016/j.heares.2014.03.001


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Cardin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Processing Complex Sounds Passing through the Rostral Brainstem: The New Early Filter Model

John E. Marsh1, 2 \* and Tom A. Campbell <sup>3</sup>

<sup>1</sup> School of Psychology, University of Central Lancashire, Preston, UK, <sup>2</sup> Department of Building, Energy and Environmental Engineering, University of Gävle, Gävle, Sweden, <sup>3</sup> Neuroscience Center, University of Helsinki, Helsinki, Finland

The rostral brainstem receives both "bottom-up" input from the ascending auditory system and "top-down" descending corticofugal connections. Speech information passing through the inferior colliculus of elderly listeners reflects the periodicity envelope of a speech syllable. This information arguably also reflects a composite of temporal-fine-structure (TFS) information from the higher frequency vowel harmonics of that repeated syllable. The amplitude of those higher frequency harmonics, bearing even higher frequency TFS information, correlates positively with the word recognition ability of elderly listeners under reverberatory conditions. Also relevant is that working memory capacity (WMC), which is subject to age-related decline, constrains the processing of sounds at the level of the brainstem. Turning to the effects of a visually presented sensory or memory load on auditory processes, there is a load-dependent reduction of that processing, as manifest in the auditory brainstem responses (ABR) evoked by to-be-ignored clicks. Wave V decreases in amplitude with increases in the visually presented memory load. A visually presented sensory load also produces a load-dependent reduction of a slightly different sort: The sensory load of visually presented information limits the disruptive effects of background sound upon working memory performance. A new early filter model is thus advanced whereby systems within the frontal lobe (affected by sensory or memory load) cholinergically influence top-down corticofugal connections. Those corticofugal connections constrain the processing of complex sounds such as speech at the level of the brainstem. Selective attention thereby limits the distracting effects of background sound entering the higher auditory system via the inferior colliculus. Processing TFS in the brainstem relates to perception of speech under adverse conditions. Attentional selectivity is crucial when the signal heard is degraded or masked: e.g., speech in noise, speech in reverberatory environments. The assumptions of a new early filter model are consistent with these findings: A subcortical early filter, with a predictive selectivity based on acoustical (linguistic) context and foreknowledge, is under cholinergic top-down control. A prefrontal capacity limitation constrains this top-down control as is guided by the cholinergic processing of contextual information in working memory.

Keywords: auditory brainstem response (ABR), complex auditory brainstem response (cABR), electroencephalography, magnetoencephalography, temporal fine structure (TFS), selective attention, new early filter model, cognitive hearing science

Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Samira Anderson, University of Maryland, USA Christian Füllgrabe, MRC Institute of Hearing Research, UK

> \*Correspondence: John E. Marsh jemarsh@uclan.ac.uk

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 27 October 2015 Accepted: 17 March 2016 Published: 10 May 2016

### Citation:

Marsh JE and Campbell TA (2016) Processing Complex Sounds Passing through the Rostral Brainstem: The New Early Filter Model. Front. Neurosci. 10:136. doi: 10.3389/fnins.2016.00136

# INTRODUCTION

One of the most challenging tasks that most people perform upon a daily basis is perceiving and understanding speech in background sound such as noise. Be that noise interfering voices in a restaurant, music, or traffic in the street, the sociopsychological impact is profound for many elderly listeners, whether or not they suffer from peripheral hearing loss. The majority of audiological patients have difficulty understanding conversation in noise (Kochkin, 2000). Noise may obscure or degrade speech information, such that only a fraction of the speech signal is available to the listener's brain. Listening and communicating under adverse conditions (Mattys et al., 2012) is known to engage compensatory brain mechanisms, particularly in elderly listeners (Wong et al., 2009).

The purpose of this article is to provide a theoretical model explaining phenomena related to the cognitive hearing science of the perception and comprehension of speech in noise. This model is intended to focus new enquiry. Having highlighted the scale of the problem motivating this objective, we first offer two necessary definitions: (i) Elevated audiometric thresholds define hearing impairment; (ii) Sensory processing is the way that the nervous system receives information from the auditory periphery and turns that information into perceptual representations. Deficits of sensory processing thus not only include losses that cause elevated audiometric thresholds and/or supra-threshold auditory processing deficits, but also include what has been termed "hidden loss" (Schaette and McAlpine, 2011; Plack et al., 2014). Considering such hidden loss, Kujawa and Liberman (2015) have revealed cochlear synaptopathy in an animal model, characterized by changes either at the level of the synapse from hair cells to auditory nerve fibers or at the level of the nerve fibers themselves. Kujawa and Liberman showed that in agerelated hearing loss, synaptopathy precedes hair cell loss. This synaptopathy likely causes problems hearing in noise even before the loss of those hair cells. Accordingly, such synaptopathy is one origin of a hidden loss, which affects hearing (in noise) without elevating audiometric thresholds. Further, when the person's brain adapts to peripheral loss such as damage to hair cells, this loss can become hidden. The nervous mechanisms of sensory processing between primary auditory nerve fibers and the rostral brainstem of the central auditory system thus undergo adaptive neuroplastic changes, such that the individual is audiometrically normal (Schaette and McAlpine, 2011). The evidence for hidden loss thus challenges a watertight definition of hearing impairment based on audiometric thresholds alone. To further specify the definition of sensory processing, deficits in sensory processing may thus reside in the auditory periphery or in the central auditory system. However, the long-term neuroplastic changes in sensory processing, which accommodate sensorineural loss, involve adaptive changes in the auditory nerve and/or the central auditory system.

Turning from defining sensory processing to applying this notion to aging, the aging of individuals with bilateral sloping hearing loss causes a decline in sensory processing. Specifically, the weaker activation of superior temporal regions reflects that decline (Wong et al., 2010). This is accompanied by an increase in the recruitment of more general cognitive brain areas of the frontal lobe (Wong et al., 2009). The development of a larger and more active left pars triangularis of the inferior frontal gyrus and the left superior frontal gyrus compensate when listening under adverse conditions including speech in noise (Wong et al., 2010). Also, prefrontal activation correlated positively with improved speech-in-noise performance in older adults. These data thus support the decline-compensation hypothesis (Wong et al., 2009). This hypothesis postulates that the neurophysiological characteristics of an aging brain with respect to sensorily and cognitively demanding tasks include a reduced activation in (auditory) sensory areas, which otherwise support sensory processing, alongside an increase in general cognitive (association) areas, respectively. Longterm neuroanatomical changes, which permit compensatory prefrontal cortical activation to sensory decline, may be a doubleedged sword. Such changes may cause maladaptive changes in cognitive abilities not related to speech-in-noise perception. In that sense, these changes would reflect a cognitive decline. Having introduced the decline-compensation hypothesis, we now turn to other extant hypotheses.

A seminal review (Schneider and Pichora-Fuller, 2000) contrasts four further hypotheses of associated declines in sensory and cognitive processing. The "sensory deprivation hypothesis" and the "information degradation hypothesis" both assume that sensory decline occurs before cognitive decline. The "sensory deprivation hypothesis" assumes that prolonged sensory decline drives a chronic cognitive change. By contrast, the "information degradation hypothesis" assumes that sensory decline immediately drives an acute cognitive decline. The "cognitive load on perception hypothesis" assumes that agerelated cognitive decline occurs before sensory decline. Cognitive decline thus drives changes in perception: what we term sensory processing. The "common-cause hypothesis" assumes a common age-related factor causes a deterioration of both sensory processing and cognition. Wong et al.'s (2009, 2010) data supporting the decline-compensation hypothesis are also compatible with long-term chronic changes assumed by the sensory deprivation hypothesis. These data are not compatible with the acute changes assumed by the information degradation hypothesis and are agnostic as to whether sensory decline drives cognitive decline, or vice-versa as the cognitive load on perception hypothesis assumes. However, these data out-rule the common-cause hypothesis: There was not an age-related decline in the activation during speech-in-noise perception across sensory and cognitive areas (Wong et al., 2009).

Pertinent to these findings, Lin et al. (2011) postulated that the compensatory dedication of general cognitive resources to difficult auditory perception could also cause an accelerated decline in cognitive faculties. With peripheral age-related hearing loss leading to deafferentation of the auditory nerves and, in turn, a loss of afferents within the central auditory system, what happens is that the perception and understanding of speech becomes more difficult. Other cases where auditory perception is difficult are under environmentally adverse conditions such as noise or reverberation. A competing theory that Lin et al. evaluated is that social isolation and loneliness, caused by communication impairments (Strawbridge et al., 2000), could relate to cognitive decline and neuroanatomical indicators of Alzheimer's disease pathology (Bennett et al., 2006). The decline-compensation hypothesis (Wong et al., 2009, 2010) rather assumes that the compensatory dedication of general cognitive resources to difficult auditory perception accelerates neurocognitive decline. Of particular interest are complex span tests that assess working memory capacity (WMC); (e.g., Daneman and Carpenter, 1980; Turner and Engle, 1989; for an introduction to different working memory (WM) processes, see Baddeley, 1986). These complex span tasks involve retaining a memory load during some form of concurrent mental processing—tasks that are more strongly affected by cognitive aging than simple verbal short-term memory span (Bopp and Verhaeghen, 2005). Forward digit span requires the mental operations of retaining digit items in their original order, a measure of simple verbal short-term memory span. Backward digit span also requires the concurrent reordering of those items for backward report. Backward digit span and complex span tasks thus share the common requirement for concurrent mental processing during retention. Backward recall, sharing commonalities with both forward recall and complex span, is thus only intermediately susceptible to cognitive aging (Bopp and Verhaeghen, 2005).

Having introduced aging and working memory, it is worth considering the role of working memory in the perception of speech under acoustically adverse conditions. Perceiving and understanding speech in noise involves retaining a memory load. Such context proactively predicts, and retroactively repairs, utterances containing degraded sensory information (Marslen-Wilson, 1975; Samuel, 1981; Shahin and Miller, 2009; Shahin et al., 2012). The retention of information occurs while the listener concurrently performs linguistic processing. This lingustic processing affects the perceptual and semantic processing of that degraded sensory information in a top-down manner. Indeed, Uslar et al. (2013) revealed that the more complex the linguistic processing required, when perceiving speech in noise, the higher the signal-to-noise ratio required to identify 80% of the presented stimuli. Uslar et al.'s findings thus cohere well with the notion that speech-in-noise perception relies on a WM function: managing the trade-off between the (more complex linguistic) processing and the retention of (semantosyntactic contextual) information. Further, corroboration of this notion stems from training on a backward span task (in noise). Such training improves complex span performance—WM improvements generalizing from the backward span task—and also enhances speech-in-noise performance (Ingvalson et al., 2015).

Turning to a different form of adverse conditions, background noise from to-be-ignored sources is not the only form of noise affecting the processing of to-be-attended speech. Reverberation pervades the built-environment and is particularly challenging for hearing-impaired listeners: The speech signal produced by the talker reverberates-off of hard surfaces, such as walls, reaching the listener in the form of an echo at a delay from the speech signal. Reverberation thus obscures speech perception cues of the direct signal (Nábeˇlek, 1988). However, it has been shown that humans have the ability of perceptual compensation (Watkins and Raimond, 2013): They use tacit knowledge of the room acoustics from immediate prior speech sound context to reduce the adverse effects of reverberation on speech perception. Accordingly, the listener's brain forms, and retains in memory, a mental model of the room's acoustics when listening. This model is used in a top-down manner to select and predict the perceptual representation of the current utterance to support speech perception under reverberatory adverse conditions.

A goal of the present article is thus to refocus new enquiry into the perception and comprehension of speech under adverse conditions by offering a new theoretical cognitive model of subcortical speech processing. The necessary evidence integrated thus centers on the relation of WM to the brainstem's processing of speech under adverse conditions. These conditions include noise and reverberation. A further goal is to communicate, beyond the consequences of such peripheral masking effects, how cognitive aging and plasticity of the auditory nerves and central auditory system driven by hearing loss can affect the brain's processing of speech in noise.

In the following, we will introduce the pivotal role of the rostral auditory brainstem as an anatomical and informational hub of the "bottom-up" ascending and "top-down" descending auditory systems. In turn, we will review the current state-ofthe-art on the complex Auditory Brainstem Response (cABR) to speech sounds. What then ensues is a discussion of findings concerning the relation of effects of reverberation on the speech intelligibility to the speech ABR representation of speech TFS. These findings concern elderly listeners. This discussion will flow then into how memory load and WMC can influence the generation of wave V of the auditory brainstem response (ABR) to clicks. In turn, the influence of memory load and sensory load on auditory distraction will be considered. The discussion will ultimately converge on a new early filter model, reviving Broadbent's (1958) influential assumption: There is a capacity limitation on how the human mind processes information. That bottleneck in processing selects information early on for further processing. The rostral brainstem is arguably crucial in the operation of that early filter, to which we now turn.

# THE ROSTRAL BRAINSTEM AS A COMPUTATIONAL HUB IN THE ASCENDING AND DESCENDING AUDITORY SYSTEMS SERVING AS AN EARLY FILTER

# Generators of the Auditory Brainstem Response

A rapid volley of deflections of the click-elicited ABR, deflections of scalp-measured electrical potentials, occur mostly within the first 10 ms after the onset of a sound (**Figure 1A**). Tone-pip-elicited ABR deflections occur slightly later (Ikeda, 2015). Assessments of the deflections of ABRs are already in routine clinical use. The audiology lecturer's E-COLI

localized to cortical regions. Credits: (A) is adapted with permission from Campbell et al. (2012). Promotional and commercial use of the material in print, digital or mobile device format is prohibited without the permission from the publisher Wolters Kluwer Health. Please contact healthpermissions@wolterskluwer.com for further

information. (B–D) are adapted with permission of John Wiley and Sons from Parkkonen et al. (2009). Copyright © 2009 Wiley-Liss, Inc.

mnemonic (Hall, 2007) detailing a one-to-one peak-to-structure mapping, misidentifies the nature of ABR source generation. The mnemonic specifies E: eighth nerve action potential (wave I); C: cochlear nucleus (wave II); O: olivary complex (superior) (wave III); L: lateral lemniscus (wave IV); I: inferior colliculus (wave V). This bottom-up route does reflect some of the detail of the ascension of information through the subcortical auditory system upwards toward the medial geniculate body of the thalamus and then the auditory cortex. Yet, sophistication is warranted: Manyto-one mappings of anatomical source generator structures to each deflection are apparent (Hall, 2007). Further, vertexnegative troughs as well as vertex-positive peaks can also have source generators. Multiple sources can be concurrently active and a subset of those generators reflected in the timing and amplitude of the ABR peak (**Figures 2A,B**).

Further vindicating a sophistication concerning the mapping of source generators to deflections, a far-field magnetoencephalographic investigation (Parkkonen et al., 2009) localized wave V to regions posterior and lateral to both the lateral lemniscus and inferior colliculus (IC) of the hemisphere contralateral to the stimulation. These Equivalent Current Dipole source models of magnetic Auditory Brainstem Responses (mABR) represented the net effect of simultaneously active sources. It cannot be out-ruled that concurrent activation of both lateral lemniscus and IC contributed to this Wave V. However, as measured directly during surgery, fibers of the lateral lemniscus have been shown to generate the Wave V peak (Møller and Jannetta, 1982; Møller et al., 1994). Those fibers enter the IC, though there may be further consequences for the activation of the IC indicated by the later longer-lasting

high-amplitude SN10 negativity (Davis and Hirsh, 1979; Møller and Jannetta, 1983) 1 . This IC is the largest structure of the brainstem and wave V the largest wave of the ABR with commonly used filtering parameters. However, wave V is not affected by deafferentation of the IC (Møller and Burgess, 1986).

As depicted in **Figure 2A**, ABR source generators are subcortical processing stations. These stations are on the pathway of the ascending auditory system, mediated by neuronal elements originating from sensory receptors. In psychological terms that pathway may be described as bottom-up. This pathway begins with the auditory nerve fibers that input the cochlear nuclei and bifurcate from where information is then transmitted upward to other brainstem, midbrain, and thalamic stations up to the auditory cortex. These ascending connections running from cochlear to cortex are termed corticopetal connections.

### Interim Summary

There are a series of subcortical generators of the ABR within the ascending auditory system. There are many-to-one mappings from the activation of generators to the sequence of scalpmeasured deflections in the ABR.

### Corticopetal-Corticofugal Loops

Not only is there an ascending auditory system, as we have already introduced, but there is also a descending auditory system. There are extensive efferent top-down projections of this descending auditory system. These systems of ascending and descending connections are not independent (Bajo and King, 2013). Rather, Bajo and King theorize that the auditory system is a series of dynamic loops in which changes in activity at higher levels in the brain affect neural coding in the IC. These loops also affect other subcortical nuclei as much as signals received from lower structures of the brainstem (**Figure 2**). In control theory, such loops could permit a corrective positive feedback. Accordingly, a loop receives a top-down expectancy of neural output descending from the requirements of higher structures of the auditory system. To specify these terms, an "expectancy" is a prediction signal from higher structures to lower structures in the context of previous ascending input from a lower structure. This prediction signal is also based on what information the higher structures "require" lower structures to select. For instance, consider selective attention to behaviorally relevant targets of a certain fundamental frequency: The prediction signal coding the expectancy from higher structures may require lower structures to provide information about the behaviorally relevant fundamental frequency. The deviation of the actual neural output of an ascending connection from that expectancy then leads to an alteration in the descending connections of that loop. Those altered descending connections, in turn, affect how the ascending connections code future neural input. As **Figure 2** depicts, the auditory system is thus theoretically a collection of dynamic control loops. As each of these loops contain corticopetal and corticofugal connections, such a loop is termed a corticopetalcorticofugal loop. Each loop is influenced by changes in higher levels and input from lower loops. Suga et al. (2000) postulate that such corticopetal-corticofugal loops perform cortically "egocentric selection." Noise information ascends affecting descending corticofugal connections. This effect on corticofugal connections leads to a transient shift, thus sharpening the lateral inhibition of ascending connections. Accordingly, subsequent noise leads to a small suppressed ascending output to noise information: a small short-lived cortical change thus occurs in response to noise stimulation. When the ascending information is a fear-conditioned signal rather than noise, that information ascends to the auditory cortex and auditory association cortex. In turn, these cortices activate the cholinergic basal forebrain via the amygdala—a cortical influence on the basal forebrain that can also be affected by an unconditioned somatosensory shock stimulus, possibly by ascending thalamic routes (Weinberger, 1998).

### Interim Summary

The auditory system is a hierarchy of corticopetal-corticofugal loops. These loops can dynamically adapt. By virtue of being hierarchically organized, such a loop can selectively filter incoming information on the basis of top-down control from higher structures.

### Cortical Cholinergic Attention System

Having introduced the notion of hierachical control of corticopetal-corticofugal loops of the central auditory system, we turn now to how the highest of these loops could be controlled. Sarter et al. (2005) reviewed evidence for a reciprocal feedback loop between the basal forebrain and the prefrontal cortex. This feedback loop controls the cholinergic projections to the prefrontal cortex within an anterior attentional system (**Figure 2A**). This positive feedback loop also controls the cholinergic output to other brain areas including sensory areas, yet without reciprocal feedback. Such a system of cholinergic feedback has the basis for top-down control of sensory processing. This control occurs through the basal forebrain through the release of acetylcholine by efferent topdown projections to sensory areas including the auditory cortex (Kilgard and Merzenich, 1998; **Figure 3**). Acetylcholine thus affects the auditory cortex; top-down projections influencing sensory cortical processing. Kilgard and Merzenich revealed that such top-down reorganization occurred without either a fearful or an aversive stimulus. It is thus viable that prefrontally controlled attention to stimuli, for instance during the longterm experience of listening to a specific language, rather than fear conditioning, can cholinergically permit attention to those auditory experiences to cause long-term changes in the operation

<sup>1</sup>To obtain this subsequent SN10 (Davis and Hirsh, 1979), a wide input bandpass, such as 40–3000 Hz, is required in EEG measurements. An SN10 in the click ABRs was apparent (**Figure 1A**; Campbell et al., 2012) with a low cutoff within the recommended range for cABRs of 30–100 Hz (Skoe and Kraus, 2010). While the origins of the SN10 include the inferior colliculus in the brainstem, the data of Parkkonen et al. (2009) suggested that there is a contribution of contralateral cortical sources of the SN10 and thus warrant some words of caution. The recommended approach (Skoe and Kraus, 2010) is to record with a low cutoff of 1 Hz and then to filter digitally offline. Recording click ABRs on the same day as the cABRs is also conventional. To prevent the strong, possibly cortical, contributions of SN10 to cABRs, a recommendation for cABRs is thus a digital filter that substantially removes the SN10 to click ABRs from the same session. That filter should not remove wave V of the ABR.

of egocentric selection by corticopetal-corticofugal loops. Also viable is that the prefrontally controlled cholinergic modulation of corticofugal connections from the auditory cortex is an attentional modulation of auditory subcortical processing.

Visual attentional demands can also influence such subcortical auditory processing. When a cat visually attends a mouse, subcortical auditory responses of the dorsal cochlear nucleus are reduced (Hernández-Peón et al., 1956). Further, attention to a visual discrimination task reduces responses of the auditory nerve to clicks (Oatman, 1971; Oatman and Anderson, 1977). In humans, Lukas (1980) revealed that attention to the visual modality also reduces auditory nerve responses, while Puel et al. (1988) showed that such attention reduced the otoacoustic emissions evoked by a click. Prefrontal influences of visual attention on such subcortical auditory filtering by corticofugal influences on corticopetal-corticofugal loops could also, in turn, permit visual attention to influence the cortically generated auditory supratemporal mismatch negativity (Erlbeck et al., 2014; Campbell, 2015). This convergent evidence thus points toward a very early stage of attention that influences subcortical auditory mechanisms.

### Interim Summary

We introduced the cholinergic top-down control assumption that the cholinergic cortical attentional system controls an early filter. Corticofugal modulation of corticopetal-corticofugal loops leads to an attentional selection crucially affecting the level of the rostral brainstem. The rostral brainstem is the locus of action of that filter, being integral to the confluence of ascending, descending, ipsilateral, and contralateral effective connectivity of the subcortical central auditory system.

# Attention and Auditory Brainstem Responses

In contrast to this evidence for top-down control, ABRs proved, in several early studies, to be unaffected by attention (Woldorff et al., 1987, 1993; Woldorff and Hillyard, 1991). Compelling was that, juxtaposed with Woldorff et al.'s findings indicating there are no attentional effects on ABRs, in the same studies, there were attentional augments of auditory middle latency response (AMLR) deflections (20–50 ms.), alongside attentional augments of auditory long latency responses (ALLRs). These ALLRs include N1 and P2. In Woldorff et al.'s "dichotic" listening tasks, participants attempted to attend to target deviants (D) in an oddball sequence of standards (S), SSSSDSSSSSSSSD... Attending those deviants, while ignoring unattended deviants in an oddball sequence, presented in the other ear, affected the P20–P50 of the AMLR and the "Nd" of ALLRs. Contrastingly, ABRs were unaffected by such attention in these dichotic listening tasks.

Inconsistent with the findings of Woldorff et al., Ikeda et al. (2008) showed that selective attention affected tone-pip ABRs (**Figure 4**). A task requirement of perceptual discrimination between pips of a target frequency and a non-target frequency, alongside rather loud (100 dB SPL) contralateral masking noise, sufficed to cause attentional augments of ABRs. Those attentional augments occurred in the range of waves II–VI in response to attended target sounds relative to sounds that participants just ignored (while reading a book). Conversely, Ikeda et al. (2008) also revealed attentional decrements of all ABRs to attended frequent non-targets relative to acoustically identical sounds that participants just ignored. The augments and decrements of ABRs by selective

attention were particularly apparent with a contralateral Cz-A2 bipolar channel than with the Cz-A1 channel ipsilateral to stimulation. These Cz-A2 ABRs arguably more strongly reflected right hemisphere generators that were contralateral to the left ear that received the tone pips. The extent of these selective attention effects on ABRs were also stronger with louder (100 dB SPL) than with quieter (80 dB SPL) masking noise. The implication is that the mechanisms of selective attention affecting ABR generation are promoted by the binaural interaction of information from to-be-ignored masking noise; masking noise that would make the task more effortful. These mechanisms affect generators ipsilateral and contralateral to the attended ear. An assumption is that these mechanisms involve the descending corticofugal routes between subcortical processing stations.

The earliest signs of binaural interaction of the ascending auditory system in the ABR, at least in some individuals, occur during Wave III (e.g., Wong, 2002; Hu et al., 2014). This Wave III generation could implicate the superior olivary complexes (SOC) after the first bifurcation from the cochlear nucleus within the subcortical ascending auditory system. Such binaural interactions can be attentionally modulated at least for tone-pip stimuli (Ikeda, 2015). These interactions involve cells exhibiting ipsilateral excitation alongside contralateral inhibition (Ikeda, 2015). Conceivable is that binaural interactions with tone pips also engage cells exhibiting ipsilateral excitation as well as contralateral excitation (Ikeda, 2015). The findings of Ikeda et al. (2008) revealed that selective attentional effects on Wave II can be affected by contralateral noise. The descending olivocochlear projection could mediate an improved selection of the attended target at the level of the cochlear nucleus. This selection would occur prior to the first bifurcation of the ascending auditory system including an ascending projection to the SOC of the contralateral hemisphere. The top-down influence of that descending olivocochlear projection could exclusively involve covert attentional mechanisms. Such mechanisms could operate at the level of the cochlear nucleus or also involve the outer hair cells (Maison et al., 2001). Another hypothesis is that these covert attentional mechanisms even modulate the muscles affected during the middle ear acoustical reflex (Ikeda et al., 2013). There is thus evidence for a corticofugally operated topdown early selective filtering mechanism affecting processing during the first few milliseconds. This mechanism comes particularly into play under adverse conditions including noise (Maison et al., 2001). This mechanism is arguably less necessary and apparent under the experimental conditions that Woldorff et al. employed. The early processing of that sound, affected by top-down attentional effects, thus becomes sensitive to the demands of the task and what the sound is.

### Interim Summary

The ABR is attentionally modulated in loud noise.

### Refractoriness of ABRs and ALLRs

We turn now from attentional modulations of ABRs and ALLRs to their relative susceptibility to attenuation on repeated presentation of a sound: refractoriness. In this subsection, we intend to tackle why the subcortical processing indexed by ABRs more closely reflects temporal information within the acoustical waveform than thalmocortically generated responses. The answer to this question hinges on this notion of refractoriness. The timecourse of auditory evoked responses (EPs), otherwise known as "auditory event-related potentials" (ERPs), are time-locked to the onset of a sound. Deflections of the ALLRs of auditory ERPs, such as the supratemporally generated auditory N1, attenuate on repeated presentation of a sound. This attenuation recovers after a period of silence (e.g., Butler, 1973; Campbell and Neuvonen, 2007), as is termed the refractory period. When stimulusspecific neuronal elements are unstimulated, those neurons are released from refractoriness (e.g., Campbell et al., 2003, 2005, 2007; see **Figure 5A**). By contrast to ALLRs, such as the auditory N1, ABRs are relatively unaffected by refractoriness: For instance, even with multiple reductions in interstimulus interval from 53 to 3 ms, all ABR deflections were unaffected except for wave V (Picton et al., 1992). Wave V showed a prolongation of peak latency at interstimulus intervals of 3 ms only. However, Valderrama et al. (2014) compared ABRs elicited with interstimulus intervals of 21–25 ms to those elicited with interstimulus intervals of 2–5 ms. Valderrama et al. thus found shorter interstimulus intevals reduced ABR amplitudes and affected ABR morphology. On balance, ABRs are less subject to refractoriness than the auditory N1; this refractoriness occuring at briefer interstimulus intervals, with which stimuli evoke ABRs with a clear morphology. Indeed, Valderrama et al. (2014) deconvolved overlapping ABR signals with interstimulus intervals as short as 2–5 ms. Thus when a complex sound such as a speech stimulus /dA/ is presented, the consequence, after the ABR to the onset, is that the rostral brainstem generates an ongoing response to aspects of the ongoing /dA/ sound.

### Interim Summary

The ABR is relatively unaffected by refractoriness. Thus when complex sounds are presented, a flowing river-of-information passes through the rostral brainstem that abstracts envelope and periodicity information generating a cABR response.

FIGURE 5 | Longer refractory periods of auditory N1 than for ABRs. The grand-averaged auditory N1 to a tone in a pitch-varying sequence of tones, 9 different pitch tokens, presented at an interstimulus interval (ISI) of 328 ms., is less refracted than when presented in a 1-token repeated tone sequence (A). Stimulus-specific cortical neuronal elements sensitive to pitch become less responsive upon repeated stimulation, as recovers after a period of quiescence. Such quiescence is more common with multiple different pitch tokens. The inter-token repetition interval between stimulation of stimulus-specific elements is longer with a higher token set size. Such elements contribute to N1 generation and thus N1 is refracted in the 1-token relative to 9-token sequences; Campbell et al. (2007); n = 12. ABRs are also subject to refractoriness (B), though new deconvolution techniques show sounds still elicit ABRs with ISIs of 2–5 ms. The ABRs are from a representative participant with intact hearing, Valderrama et al. (2014); n = 1. Credits: (A) is adapted with permission of John Wiley and Sons from Campbell et al. (2007). Copyright © 2007 Society for Psychophysiological Research. (B) is reprinted from Valderrama et al. (2014). Copyright © 2014, with permission from Elsevier.

### Marsh and Campbell The New Early Filter Model

# Attention, Expectancy, and Prediction Affect Both the cABR and Speech-in-Noise Perception

Being relatively unaffected by refractoriness, the cABR is thus responsive to landmarks in the acoustical waveform (Skoe and Kraus, 2010; Campbell et al., 2012). The representation of lower frequencies of that acoustical waveform predominates the cABR waveform. The cABR generator process thus seems to abstract the envelope and the fundamental frequency of the stimulus away from the acoustical waveform. The cABR does so at a time-lag of 8 to 10 ms. After the ABR response to the consonantal onset of the /dA/ stimulus, the cABR reflects that informational flow through rostral brainstem generators of the ABRs, with the contribution of a distinct Frequency Following Response or "FFR" (Chandrasekaran and Kraus, 2010; Xu and Gong, 2014; Bidelman, 2015; Xu and Ye, 2015). This FFR locks primarily to the fundamental frequency of the vowel portion that the rostral brainstem also generates, albeit in the IC. The form of FFR typically recorded when analyzing cABRs is an "envelope FFR" (Aiken and Picton, 2008) or "envelope following response" (Easwar et al., 2015; Varghese et al., 2015). This EFR follows the periodicity envelope. The envelope differs from the spectral FFR (Aiken and Picton, 2008; Easwar et al., 2015) that follows the spectral frequency of the stimulus. Though there are cochlear nucleus (CN), trapezoid body, and superior olivary complex (SOC) contributions to the FFR (Marsh et al., 1974) as well as a cortical contribution (Coffey et al., 2016), there is a dramatic reduction in a form of FFR accomplished by a subcortical cooling of the IC (Smith et al., 1975). On balance, generators in the vicinity of the rostral brainstem, encompassing the lateral lemniscus and IC, predominate both the cABR to consonantal and vowel portions of a speech sound. The flow of information through the rostral brainstem indexed by the cABR is time-lagged. This time-lag concerns how long the landmark information takes to reach the rostral brainstem. A series of investigations revealed that attention augments the FFR: Galbraith and colleagues (Galbraith and Arroyo, 1993; Galbraith et al., 1995, 1998, 2003) showed that whether comparing attending sounds to not attending sounds, or whether attending to a selected auditory stream of sound while ignoring another, an attentional augment of the FFR is shown and that FFR is higher in amplitude with speech sounds (for an alternative perspective, see Varghese et al., 2015). A separate series of experiments also corroborated that the familiarity of speech or music affected the time-course and dynamics of FFR via experience-dependent plasticity (Musacchia et al., 2007; Wong et al., 2007; Song et al., 2008; Chandrasekaran et al., 2012).

Turning from these initial studies revealing influences of experience and attention on FFRs, a recent investigation of auditory attention and FFRs (Lehmann and Schönwiesner, 2014) showed that attentional selection in background speech noise can rely on both frequency and spatial cues. This selection can also rely on frequency cues alone. In Lehmann and Schönwiesner's procedure, participants attended to vowels uttered by the designated speaker while ignoring another speaker (attend the male and ignore the female, or attend the female and ignore the male). These participants were required to detect occasional attended pitch-deviant target vowels by pressing a button. In a diotic condition, audio-recordings of a male repeating /a/ and a female speaker repeating /i/ were intermixed such that the same sound mixture was presented to both ears. In a dichotic condition, the male speaker's repeated /a/ was presented to the left ear and the female speaker's repeated /i/ was presented to the contralateral ear. In both the diotic and dichotic conditions, the FFR followed the distinct fundamentals of both vowels. In the dichotic condition only, attending the male (on the left) relative to attending the female (on the right) increased the amplitude of the FFR at the fundamental frequency of male's /a/. The direction of attention thus arguably affects the FFR. Lehmann and Schönwiesner computed a neural spectral modulation index of how much attention affects the FFR. This index was higher in the dichotic than the diotic conditions. Spatial cues were thus important to attentional selection, which conceivably occurs at the level the rostral brainstem. Further, frequency cues were also sufficient for attentional selection in that the modulation index was above zero in the diotic condition. Accordingly, attentional selection does not require the segregation of attended and ignored information to different sides of the brain. Further, individual variability in the amplitude of these attentional FFR augments, whilst selecting one voice and ignoring another, was related to the detection of pitch-deviant targets in the attended stream (Lehmann and Schönwiesner, 2014): the stronger the attentional modulation of FFR, the lower the discriminability of the attended pitch-deviant target. Relative to individuals performing at ceiling, participants who struggled more with the task thus applied more attention to the task's stimuli affecting the brainstem representation of those stimuli. The IC, at least in part, generated this attentionally augmented FFR (Bidelman, 2015). In addition, an extensive corticofugal efferent system arguably influenced the generation of this attentional augment in a manner that is both goal-directed and behaviorally relevant. For evidence of a cortical contribution to FFR, see Coffey et al. (2016).

Having established the FFRs of cABRs are influenced by auditory attention and long-term auditory experience, it is worth emphasizing that the cABRs generated in the rostral brainstem are not the automated readout of stimulus attributes in an informational vacuum. Rather, cABR generation is affected by expectancies derived from the immediate preceding context. An investigation of neural entrainment in children revealed such effects of acoustical context on cABR (Chandrasekaran et al., 2009). The notion was that a variable sequence of acoustically distinct monosyllables containing a /dA/ syllable prevents the preceding context from predictively enhancing the neural representation of the current stimulus /dA/. "Neural entrainment" using the context of a repeated /dA/ (**Figures 6A,B**) reflected such an enhancement. This neural entrainment enhanced the cABR second harmonic amplitude during the formant transition between consonantal and steady-state vowel portions of /dA/ (**Figures 6C–E**). The cortex could process a memory of the preceding context, leading to a top-down expectancy. Subcortical corticopetal-corticofugal loops attempt to meet that expectancy when encoding the current stimulation. The stronger such neural entrainment for

the second harmonic in the formant transition, the better the speech-in-noise performance. Such neural entrainment of the cABR is thus functionally relevant for speech-in-noise performance.

This neural entrainment, enhancing the second harmonic during the formant transition, predicted speech-in-noise performance (Chandrasekaran et al., 2009) as assessed by the Hearing In Noise Test or HINT (Nilsson et al., 1994). The more faithful the cABR was to the auditory signal during the transition from the consonantal to the vowel portion, the better the speech-in-noise performance. This evidence concerning speech-in-noise performance of children coheres well with that from older adults. Anderson and Kraus (2010) compared two such adults, with near-identical audiograms (≤25 dB HL for audiometric frequencies from 125 to 8 kHz) to one another. The individual with poorer speech-in-noise peformance exhibited a weaker representation of the fundamental frequency and second harmonic in the FFR of the cABR. Comparing two groups of older adults who showed good and poor speech-in-noise performance, respectively, Anderson et al. (2011) found no significant audiometric difference (≤25 dB HL from 125 to 4 kHz), yet the difference in the FFR of the cABR was replicated. In those older adults, the presence of meaningless syntactic speech adversely affected the faithfulness of the cABR to a repeated /dA/. This influence of background speech noise was particularly strong in those showing poor speech-in-noise performance on the HINT: The higher the overall root mean square (RMS) of the cABR in quiet or noise, or the stronger the correlation of the cABR waveform to /dA/ in quiet and noise, the better the speech-in-noise performance on the HINT (Anderson et al., 2011).

Kindred to the neural entrainment of the cABR that predicted HINT performance (Chandrasekaran et al., 2009), Chandrasekaran et al. (2012) revealed that using repeated rather than changing stimuli augmented the FFR and reduced the cerebral blood flow in the IC: a repetition suppression effect. The processing of sound in the IC becomes more efficient when predictable. The fidelity of the FFR and the associated repetition suppression is particularly pronounced in those who have learned to process the sound well: e.g., English-speakers who rapidly learn new vocabulary based on the recognition of lexically meaningful tones, having acquired the mapping of distinct pitch patterns of one English pseudoword onto pictures of different objects. These findings could thus relate to the second language acquisition of tonal languages, such as Mandarin Chinese.

When sequences of natural stimuli such as speech exhibit an inherent acoustical variability with time, thus not promoting neural entrainment and repetition suppression, the nature of the filtering of the auditory information at the level of the rostral brainstem is thus arguably non-absolute. The IC shows increased bloodflow reflecting a less efficient processing of the stimulus and generates waveforms less faithful to the stimulus suggesting the filter is wide open to unpredicted stimuli. Accordingly, the experience-dependent corticofugal efferent influence on the rostral brainstem typically permits a selectivity for information promoted by top-down expectancies. Not only acoustical but also semantic and linguistic factors may influence expectancies. Those factors affect the ascendency of information in the auditory system from the IC upward. The influence of these top-down expectancies on corticopetal-corticofugal loops effectively operate as an early filter (Broadbent, 1958). The neural entrainment of facets of the cABR, FFR, or repetition suppression at the IC reflect the selectivity of that early filter, for instance, by affecting the perception of speech in noise. Yet the selectivity of that filtering is only near-absolute under conditions that promote neural entrainment or repetition suppression within the IC. These conditions are atypical in natural acoustically varying tobe-attended stimulation that is often in the presence of noise. The new early filter model offered here thus proposes that the early filter is not only affected by top-down experience-dependent selective attentional factors but also by neural entrainment. This assumption that neural entrainment affects the early filter is thus not as discrepant as Broadbent's (1958) early selection model was with the evidence supporting attenuation (Treisman, 1960, 1964a,b, 1969; Treisman and Riley, 1969) and late selection models (Gray and Wedderburn, 1960; Deutsch and Deutsch, 1963).

### Interim Summary

Top-down attentional as well as experience-dependent plasticity factors influence cABR generation. In support of an assumption of predictive selection by the early filter, this generation is also affected by the neural entrainment determined by the speech context. This neural entrainment affects the attention selectivity for speech in noise.

# TFS and Age-Dependent Decline of Temporal Resolution

Having discussed how attention, expectancy, and prediction affect the sub-cortical representation of speech in the central auditory system, as well as speech-in-noise performance, we turn now to the representation of TFS. TFS is best understood by first considering how the auditory periphery analyses sound. The structure of the basilar membrane within the cochlea performs a Fourier-analysis-like function (von Békésy, 1960): The basilar membrane resolves a complex sound into component narrowband signals. In response to a sinusoidal stimulation, the basilar-membrane response takes the form of a traveling wave that shows a peak amplitude at a specific place on the basilar membrane, depending on the frequency of the stimulation. Due to the mechanical properties of the basilar membrane, the basal end responds most vigorously to high-frequency sounds and the apical end to low-frequency sounds. This tonotopically organized pattern of vibration is transduced by the inner hair cells. In the auditory nerve, each transduced component narrowband signal thus has a temporal envelope, an informational trace of the slow amplitude dynamics of the upper extremes of basilar membrane deflections of that narrowband waveform. This temporal envelope varies at lower frequency, slower than the higher frequency TFS information bounded within that envelope. This amplitude modulation envelope supplies cues to speech perception that are not only necessary but also sometimes alone sufficient for speech perception (Drullman et al., 1994a,b; Shannon et al., 1995). In quiet, slow-rate temporal-envelope

the acoustical spectrogram (B) illustrates. Chandrasekaran et al. (2009) derived cABRs to /dA/ in a variable speech or in a repeated /dA/ context (C). The cABRs revealed no significant effect of speech context during the steady-state vowel portion (D), but during the formant transition (boxed) context influenced cABRs. The amplitude spectra of the cABR (E) during the formant transition revealed a repetitive context augmented the second and fourth harmonics, as was significant (F). Correlations revealed that the higher this presumably top-down speech-context modulation of the representation of the second harmonic during the formant transition of the cABR, the better the speech-in-noise performance on the Hearing in Noise Test (not shown). Credit: Reprinted from Chandrasekaran et al. (2009). Copyright © 2009, with permission from Elsevier in respect to Chandrasekaran et al. (2009: Exp.1); n = 21.

cues (4–16 Hz) are especially important for speech identification (Drullman et al., 1994a) when higher frequency amplitude modulation envelope is present. Also in quiet, medium rate amplitude modulation envelope (2–128 Hz) is also important when lower frequency amplitude modulation envelope is absent (Drullman et al., 1994b). In the presence of interfering sounds, slow temporal-envelope cues (0.4–2 Hz) become important conveying prosody (Füllgrabe et al., 2009), as do high rate temporal-envelope cues (50–200 Hz) conveying fundamental frequency (Stone et al., 2009, 2010).

The temporal information bounded within this temporalenvelope is TFS, i.e., the fluctuations in amplitude close to the center frequency of a narrowband signal, which are higher in frequency than the amplitude modulation envelope. In tone-vocoded sound, narrow frequency bandwidths of sound and in turn the resolved narrowband signal at the basilar membrane—"channels" have envelope information preserved yet the TFS replaced with a tonal sound amplitude-modulated by that envelope. Hopkins and Moore (2010) explored how incrementally replacing the content of tone-vocoded channels with the original speech channels improved speech recognition in noise. Listeners were between 19 and 24 years of age and audiometrically normal in the test ear. These listeners were sensitive to TFS as can be used in speech perception in noise: The speech reception thresholds of target signals containing partial TFS information improved when adding speech TFS information to the tone-vocoded sound (**Figure 7**; Hopkins and Moore, 2010). The TFS information improved thresholds in a procedure that incrementally replaced higherand-higher frequency tone-vocoded channels with speech TFS (**Figure 7**, red line). TFS information also improved thresholds in a procedure replacing lower-and-lower frequency tonevocoded channels of noise with speech TFS in the same bandwidth (**Figure 7**, blue line). Noteworthy is that TFS in higher frequency ranges aided speech recognition when no TFS was available in lower frequency ranges. In an analogous experiment, Hopkins and Moore (2010) also showed that speech TFS information is less useful to those with hearing impairment, albeit potentially confounded by the hearing-impaired participants being older (Moore et al., 2012; Füllgrabe, 2013; Füllgrabe et al., 2015).

Another study complements Hopkins and Moore's (2010) demonstration that TFS is important for speech identification in the presence of speech background sounds. Stone et al.'s (2011) experiments investigated the dynamic range of usable TFS information by comparing the addition of TFS information to the amplitude peaks of a vocoded speech signal by adding that TFS information to the valleys and troughs of this vocoded signal. Whether added to amplitude peaks or to troughs, TFS information improved identification of the target speech over a background talker: Adding target and background noise TFS information to a channel containing the corresponding temporal envelope information proved useful. This TFS information was useful for channel levels—relative to the RMS sound level of that channel—from about 10 dB below to 7 dB above that RMS sound level. However, the range of channel sound levels where TFS was useful depended on the relative levels of the target sound to the background masking talker: For an experimental condition in which background noise dominated the target more, adding TFS to peaks was more useful at channel sound levels further below the RMS sound level of the channel than in an experimental condition in which the background noise did not dominate the target as much. Further, adding TFS information to peaks when the background dominated more was more useful than adding TFS information to dips. Stone et al.'s (2011) results thus show

that TFS information is not exclusively useful for listening in dips, but rather TFS also contributes to the segregation of target tobe-attended speech from the to-be-ignored background speech sound.

Having shown how processing TFS information is important to the recognition of speech in speech noise, we turn to how the subcortical processing of TFS is relevant to one of the outstanding unresolved conundrums of cognitive hearing science. This conundrum is that of isolating the age-related decline in temporal processing that is caused by effects of peripheral hearing loss on the auditory nerve and central auditory system from age-related declines that are unconnected to audiometric loss. Presbycusis, age-related sloping loss, may drive a progressive deafferentation of unstimulated neurons spreading upward in the ascending auditory system, which ultimately results in chronic cognitive change according to the sensory deprivation hypothesis. Hearing-impaired listeners can experience supra-threshold auditory processing deficits, characterized by distorted processing of audible speech cues. Peripheral damage to outer hair cells and reductions in peripheral compression and frequency selectivity contribute to these deficits, as does a reduced access to TFS information in the speech waveform, leading to this distortion (Summers et al., 2013). However, this impairment of TFS processing, which affects distortion, is not necessarily always a direct or indirect consequence of peripheral damage.

There is evidence for an independent age-related decline in temporal resolution as reflected by the action of the rostral Marsh and Campbell The New Early Filter Model

brainstem of the central auditory system (Marmel et al., 2013) 2 . Marmel et al. investigated an audiometrically heterogeneous population of adults with a wide age range (Supplementary Figure 1A). Participants with thresholds greater than 20 dB HL had a sensorineural loss. To investigate inter-individual variability in temporal resolution at the level of the rostral brainstem of the central auditory system, Marmel et al. used an FFR synchronization index. This index comprised of the cross-correlation of FFR to the stimulus and also comprised of the signal-to-noise ratio of the FFR. Such an index thus tracked how faithful the FFR was to the acoustical stimulus. This FFR synchronization index decreased with age in a manner reflecting a poorer temporal resolution at the level of the rostral brainstem, which is associated with higher frequency difference limens. These higher limens reflected poorer pitch discrimination abilities. A tendency for sloping loss to be more severe in elder participants was confirmed (Supplementary Figure 1), yet at 500 Hz, hearing thresholds did not correlate significantly with age (Supplementary Figure 1C). Marmel et al. presented stimuli in this frequency range when measuring absolute auditory thresholds, frequency difference limens, and FFR synchronization. The influence of age on FFR synchronization in this frequency range, without a significant influence of age on hearing level, thus strains any assumption that a peripheral presbycusis could be the sole cause of this effect of age on the processing of sound by the central auditory system (though see Footnote 2). Further, this FFR synchronization was not associated with absolute auditory thresholds. The point is that there was an age-related decline in temporal resolution arguably at the level of the rostral brainstem that was associated with impairments in pitch discrimination abilities. Pitch discrimination abilities appeared to hinge both on absolute auditory thresholds and on the FFR synchronization index. However, FFR synchronization yet not absolute auditory threshold was affected by age. It is tenable that auditory absolute threshold could affect the place-coding of auditory information, in turn affecting pitch discrimination. Equally, absolute auditory threshold could affect the coding of auditory information that is not phase-locked. However, the firing of neurons conveying that auditory information would have to be asynchronous. The upshot of Marmel et al.'s (2013) findings is that there is an age-related functionally relevant decline in auditory temporal resolution at the level of the rostral brainstem. This decline is arguably independent of audiometric hearing loss, which though affecting frequency discrimination, was not affected by age within the frequency ranges investigated.

The question that still remains is whether aging of the auditory nerves and central auditory system alone drives this functionally relevant decline of temporal resolution arguably at the level of the rostral brainstem, as indexed by FFR synchronization. Such aging could relate to a decline of inhibitory GABAergic (Caspary et al., 2008; Anderson et al., 2011) or cholinergic (Zubieta et al., 2001) systems of neurotransmission. Such systems involve respectively, γ-aminobutyric acid or acetylcholine. A decline in temporal processing may limit the speed of acoustical fluctuations that the (auditory nerve and, in turn the) central auditory system can follow. Such a decline thus renders it impossible for the central auditory system to represent high frequencies using the rate facet of a place-rate code, which affect the IC's generation of the FFR.

### Interim Summary

There is an age-related decline in supra-threshold auditory processing, which Marmel et al. (2013) revealed as independent of audiometric hearing loss (Marmel et al., 2013). This age-related decline occurs alongside a decline in the temporal resolution of the FFR, which arguably the rostral brainstem generates. This age-related decline in temporal resolution could also impair speech recognition in noise (Hopkins and Moore, 2010). There is a comparable age-related decline in TFS sensitivity, which even occurs in audiometrically normal adults (Füllgrabe, 2013; Füllgrabe et al., 2015). However, peripheral hearing loss could also drive a decline in the processing of sounds in the auditory nerves and central auditory system. This loss is either measurable in the audiogram, or is "hidden" (Schaette and McAlpine, 2011; Plack et al., 2014).

### Neuroplastic Changes to Accommodate High-Frequency Audiometric Loss

A hypothesis is that the decline of temporal resolution of the central auditory system, indexed by the FFR, comes from long-term neuroplastic change to accommodate the loss of audiometric sensitivity, especially in the high-frequency range. Older adults with mild-to-moderate hearing impairment show FFRs of the cABR with an, at first counterintuitive, higher amplitude fundamental and lower harmonics than normal-hearing controls (**Figure 8**; Anderson et al., 2013). One explanation is the higher amplitude of the FFR in hearingimpaired listeners might be due to a larger effective modulation depth in those listeners caused by the reduction or abolition of cochlear compression (Füllgrabe et al., 2003; Oxenham and Bacon, 2003).

However, such results are also germane to another theory (Woods and Yund, 2007) that sensorineural impairment leads to a remapping from the auditory cortex—with impoverished output to high frequency cues—to the auditory association cortex. Accordingly, that remapping, to compensate, takes the low frequency cues still available for phoneme recognition and amplifies those cues within the central auditory system (Woods and Yund, 2007). Whether occurring between the auditory cortex and auditory association cortex, or between other structures of the auditory system, this remapping has

<sup>2</sup>Declines in the processing of TFS are not necessarily in the brain or auditory nerves, but rather may be consequences of decline in the auditory periphery. TFS sensitivity does indeed decline with age even in the absence of peripheral hearing loss as measured by the audiogram (Füllgrabe et al., 2015). The loss of auditory nerve fibers or their synapses with hair cells, cochlea synaptopathy, can be age-related or noise-induced (Kujawa and Liberman, 2015). Plack et al. (2014) offer an account of how such noise-induced hearing loss can be audiometrically "hidden" (Schaette and McAlpine, 2011) yet might have consequences for temporal processing in the central auditory system, as indexed by FFR. Plack et al. present supportive preliminary data of how audiometrically normal young adults, who had a history of loud noise exposure, showed FFRs less faithful to a 3.1 kHz tone. It is as viable that age-related cochlear synaptopathy might drive a comparable agerelated decline in temporal processing by the central auditory system, as reflected by Marmel et al.'s (2013) FFRs.

Anderson et al. (2013). Copyright © 2013, Acoustic Society of America.

consequences. Anderson et al.'s (2013) data relate to such a remapping. Those consequences alter the generation of the FFR in the rostral brainstem. Anderson et al.'s (2013) analyses of higher harmonics Aiken and Picton (2008) revealed no corresponding upregulation of high frequency cues.

Further, Anderson et al.'s analyses revealed that whether the stimuli were unamplified or amplified, using the NAL-R fitting formula (Byrne and Dillon, 1986), there was a bias in persons with mild-to-moderate hearing loss toward a stronger upregulation of lower rather than high frequency components in noise. Indeed, this bias for upregulating high frequency components was even stronger when amplified. If this bias were due to peripheral factors alone, such as a reduction in cochlear compression, then, if there were no long-term consequent neuroplastic changes, we would predict amplification would attenuate that bias. Anderson et al.'s analyses revealed the reverse of that prediction: amplification enhanced this bias. Accordingly, while peripheral factors such as declining cochlear compression would affect the FFR, long-term neuroplastic changes also take place that affect the FFR. Amplification with the NAL-R formula used did not remediate these changes.

Anderson et al.'s (2013) FFR findings from older adults with mild-to-moderate hearing impairment cohere well with evidence of a slightly different sort. Upon receiving an aid that amplifies high frequency cues, hearing aid users who have had unaided high frequency hearing loss for many years, can hear the amplified sound as distorted (Woods and Yund, 2007; Galster et al., 2011). Reasons, which are not necessarily mutually exclusive, could include regions of dead cochlea (Vickers et al., 2001; Mackersie et al., 2004; Moore, 2004; Preminger et al., 2005; Aazh and Moore, 2007; Vinay and Moore, 2007; Zhang et al., 2014). However, other reasons could include the long-term plasticity of the auditory nerves or central auditory system attempting to make the best use of lower frequency information from a damaged periphery. The low frequency sound can also seem too loud: "hypercusis." Here the notion is that the encoded low frequency cues swamp high frequency perceptual cues. This problem is even more apparent in the FFR under conditions of background noise (Anderson et al., 2013). As Galster et al. (2011) note, "the inability to restore audibility of high-frequency speech and the possible contraindication for the restoration of high-frequency speech are established conundrums of hearing care." This neuroplastic change is, at least in part, reversible. Training programs improve the aided perception of word-initial phonemes for those who have become accustomed to high frequency loss (Woods and Yund, 2007); people who presumably have partially functional basal cochlear regions. The neuroplastic changes, which adapt to hearing loss and seem to implicate the rostral brainstem, thus seem, on the whole, to be reversible. These neuroplastic changes are reversible even in later life and even after extensive hearing aid use. At first glance, such a finding would cohere well with the notion of neuroplastic recovery from neuroplastic long-term changes that accommodate peripheral hearing loss. However, it is worth considering that GABA units can increase in the auditory cortex due to training (Guo et al., 2012). Accordingly, systems of neurotransmission could have aged affecting the temporal resolution of the central auditory system to a point that is not normal. Those systems of neurotransmission could be subject to recovery due to training. While training was effective for nearly all individuals, there were factors affecting the inter-individual differences in the efficacy of training (Stecker et al., 2006). The distortion and annoyance issues associated with receiving an aid after becoming accustomed to sensorineural hearing impairment also concern signal processing techniques. These techniques map information in the high frequency components in the to-be-amplified sound onto lower frequency regions of cochlea (Galster et al., 2011). Approaches include frequency compression (Glista et al., 2009) and frequency transposition (Füllgrabe et al., 2010).

Such a signal processing approach might be more advisable than training when the majority of high frequency (basal) regions of cochlea are dead—the relevant afferents of the eighth cranial nerve have atrophied. At first, it is hard to imagine how such individuals could benefit from a training in listening to high frequency information: If a region of cochlea is dead, there is no sound transduction at the characteristic frequencies of the inner hair cells of that region. However, if sounds are loud, a frequency component produces a broader excitation pattern across auditory nerve fibers. With loud enough frequency components, regions of live inner hair cells neighboring dead cochlear regions, would thus be able to transduce some high-frequency information: "off-frequency listening" (Westergaard, 2004). Foreseeable is that training these persons to use information from off-frequency listening might have some benefit with very high levels of amplification. For persons with extensive dead basal cochlear regions, a prediction is that such training is not as effective as the suggested signal processing approaches.

Anderson et al.'s (2013) analyses offer intriguing biomarkers to evaluate for specifity in predicting such treatment's outcomes. These analyses were geared to investigating both lower frequency components and higher frequency components of the FFR of the cABR. As such these analyses revealed low frequency cues swamp higher frequency cues following neuroplastic changes that accommodate sensorineural loss. By contrast to these analyses, the representation of the stimulus classically apparent in the cABR, is relatively abstracted from the TFS at the level of the rostral brainstem. Whether the FFR of the cABR was responsive to the steady-state segment of a vowel or the steady-state sound of a cello, that FFR represented the fundamental and lower harmonics more strongly than the higher harmonics: As depicted in **Figure 9**, such lower frequency components were more strongly represented even when higher harmonics are of a higher intensity, as attributable to the low-pass characteristics of brainstem phase-locking (Musacchia et al., 2007; Skoe and Kraus, 2010).

### Interim Summary

Audiometric hearing loss could drive a decline of temporal resolution in the central auditory system. This age-related decline could be a long-term adaptation to higher frequency loss at the periphery. However, Füllgrabe et al. (2015) have shown an age-related decline of temporal resolution in audiometrically normal individuals, who are audiometrically matched across age groups. This decline thus arguably occurs in the central auditory system. This finding would thus indicate that audiometric hearing loss does not drive all such decline. However, this assertion comes with a caveat that there may be hidden loss (Schaette and McAlpine, 2011; Plack et al., 2014; Kujawa and Liberman, 2015) that is age-related. Accordingly, that hidden loss does not affect the audiogram but still drives this decline thus affecting the central auditory system. The cABR can reflect neuroplastic changes in response to peripheral sensorineural loss upregulating the relative representation of

lower rather than high frequency components arguably at the rostral brainstem. This upregulation occurs in a manner exacerbated by noise and by amplification, as could relate to distortion, hypercusis, and annoyance issues. The cABR is thus an intriguing biomarker that could have specificity informing the approach to treatment. Stimulus transduction artifact-free cABRs can now be recorded through hearing aids (Bellier et al., 2015). It remains to be determined how well such cABR attributes including noise sensitivity and the extent of adaptation to lower frequency components—predict the outcomes of fitting. This fitting concerns signal processing, directional microphones, binaural care, and choice of noise reduction schemes. Also tobe-determined is how well such cABR attributes predict the benefit from behavioral interventions such as perceptual training (Woods and Yund, 2007).

# From The Limits on Phase-Locking in the Inferior Colliculus to Top-Down Neural Entrainment During Speech Perception

We have seen that prolonged hearing impairment has consequences for the generation of FFR of the cABR, which typically reflects low frequency sound components. The temporal envelope information in a narrowband signal is definitively lower in frequency than the TFS information bound within that envelope. Narrowband signals with a lower center frequency, are, however more dominated by temporal envelope information than narrowband signals with a higher center frequency. Much speech TFS information is transmitted through those higher center frequency narrowband signals. The question remains for neuroscience as to how TFS is re-coded prior to the rostral brainstem. Spectral FFRs are known to represent harmonics of acoustical information as high as 1500 Hz (Aiken and Picton, 2008). Yet, a processing of TFS above 1500 Hz contributed to speech target recognition in speech background noise (Hopkins and Moore, 2010). The frequency components of that TFS over 1500 Hz are thus somehow processed by the brain. Such temporal information is available at the level of the cochlear nucleus (Palmer and Russell, 1986; Winter and Palmer, 1990). Skoe and Kraus (2010) postulate that a place code facet of a rate-place code (Rhode and Greenberg, 1994) recodes information about higher frequencies. Such a place code could be supported by a form of tonotopy within the IC (e.g., Malmierca et al., 2008). Indeed, Harris et al. (1997) support this notion of a rate-place code with multi-unit recordings from gerbil IC. Frequencies below 1000 Hz activated a broad phase-locked population. Higher frequencies induced activation of a more focal population without phase-locked firing of the constituent neuronal elements. The spectral FFR is thus more strongly affected by lower frequencies.

Turning from these evoked responses to neuronal oscillations, a recent model of how cortical theta (1–8 Hz) and gamma (25– 35 Hz) oscillations process speech assumes a high-resolution spectrotemporal representation of speech in the primary auditory cortex (**Figure 10**; Giraud and Poeppel, 2012). This representation enters input layer IV upon which operations are performed to code speech into the theta- and gamma-band, albeit a representation encoded in a neuronal spike train. At first blush, the assumption of such a representation contrasts with the upper limit of phase-locking in the FFR. This limit is known to drop from 3.5 kHz in the guinea pig auditory nerve to 2–3 kHz in the cochlear nucleus (Palmer and Russell, 1986; Winter and Palmer, 1990) down to 1000 Hz in the central nucleus of the guinea pig's IC, right down to 250 Hz in auditory cortex (Wallace et al.,

2002, 2005). This cortical limit is likely an over-estimate in nonhuman primates (Steinschneider et al., 1980, 2008), perhaps even humans.

We postulate an asynchronous recoding at the input to the auditory cortex serves as a high-resolution spectrotemporal representation of speech in a spike train within the primary auditory cortex. This cortical representation is strongly reliant on the place facet of a rate-place code, particularly for higher frequencies. Such representation is a spike-coded highresolution representation that interacts with a gamma-band representation of spectrotemporal information in the speech bandwidth compressed into a lower frequency range (25–35 Hz). This gamma-band representation interacts with a slow stimuluslocked theta-band representation (1–8 Hz), further refining the spike coding of speech information for cortical processing of meaningful utterances (Giraud and Poeppel, 2012). Regions of auditory cortex show high measurements of GABA+; GABA+ levels correlating positively with language skill on the CELF4 (Gaetz et al., 2014), such that autistic children exhibit decreased GABA+ and auditory gamma-band (30–50 Hz) responses to auditory pure tones (Gandal et al., 2010; see also Port et al., 2015). A possibility is that gamma power may thus modulate the auditory cortical excitability to refine the spike coding of speech information for cortical processing of meaningful utterances. The corticofugal influence of such cortical modulations by neuronal oscillations are postulated here to serve as a possible basis of subcortical neural entrainment: Rather than the repeated speech context entraining the processing of the speech by the rostral brainstem, the parsing of the utterance entrains that processing.

### Interim Summary

The cABR reflects, at least in part, the phase-locked responding of a wide neuronal population within the IC tuned to the fundamental and lower harmonics of the acoustical stimulus. This phase-locking of a wide population breaks down in exchange for more focal populations that are tonotopic to higher frequencies and do not contribute strongly to the cABR. Higher frequency components of greater than 250 Hz cannot be cortically represented in a phase-locked manner, though such components are perceptually relevant to TFS perception that can improve speech-in-noise perception. Rather, a high fidelity spatiotemporal representation arguably enters the primary auditory cortex as a neuronal spike train. That spike train representation interacts with the lower frequency range compression of speech sound of the low gamma-band alongside a stimulus-locked representation of the sound in the lower theta band. These interactions of the spike trains with these neuronal oscillations in auditory cortex, we posit, not only refine the syntactic and semantic processing of the speech, but also have top-down corticofugal influences. Germane is that a memory for a repeated rather than a variable context enhances the subcortical representation of incoming stimulation in a manner that indexes speech-in-noise perception (Chandrasekaran et al., 2009). Just as that enhancement could rely on top-down prediction, these interactions of the spike train with auditory cortical neuronal oscillations could control corticopetal-corticofugal loops to promote the subcortical processing of semantically and syntactically predictable utterances. Those neuronal oscillations in the cortex could corticofugally modulate subcortical neural entrainment, such that top-down (semanto-syntactic) speech context can affect early filtering. Just as it is assumed that the new early filter operates by predictive selection on the basis of acoustical context, there is also scope for semanto-syntactic context to influence that predictive selectivity.

## Section Summary

The ascending auditory system contains a series of relay stations that generate the ABR to sounds. This ascending auditory system is part of multiple corticopetal-corticofugal loops that can dynamically adapt to filter information selectively on the basis of top-down control by higher structures. The manipulation and temporary storage of contextual information in the prefrontal cortices affects how the cortical cholinergic system controls those loops in a top-down manner. The connectivity of the IC serves as a hub of this early filter at the confluence of the bottom-up processing of the ascending auditory system, binaural interactions, and the top-down controlled predictions from the descending auditory system. There is thus the connectivity to support attentional modulations of ABRs, as occurs under conditions including loud noise. By contrast to cortically generated ALLRs such as the N1 (e.g., Butler, 1973; Campbell et al., 2003, 2005, 2007; Campbell and Neuvonen, 2007), such ABRs are relatively unaffected by refractoriness (Picton et al., 1992; Valderrama et al., 2014). Accordingly, populations of ABR-generating neurons are not particularly susceptible to refractoriness. Thus, an ongoing sound leads to an ongoing cABR to acoustical landmarks within that sound. Top-down contextual factors can influence the generation of that cABR. Indeed, we would argue that stimulus and linguistic context affect the subcortical representation of speech in a topdown manner.

The ability to process TFS information is not apparent in the cABR that typically has low-pass characteristics. There is an age-related decline in temporal resolution, even when there is no audiometric evidence for sensory decline. Processing TFS information is important for speech perception in noise. The sensory deprivation and information degradation hypotheses (Schneider and Pichora-Fuller, 2000) for this decline in temporal resolution still cannot be out-ruled: Hidden loss, which is immeasurable with an audiogram, might still drive that decline. Similarly, there could be a sensory processing of that TFS, which is intrinsically intertwined with a cognitive processing of TFS. A decline in this cognitive processing could cause a decline in the supra-threshold sensory processing of TFS, as postulated by Schneider and Pichora-Fuller's cognitive load on perception hypothesis. With a /dA/ stimulus, the cABR typically neglects the higher harmonics dominated by TFS rather reflecting the brain's ongoing response to the fundamental and lower harmonics. Nevertheless, the cABR offers a promising biomarker with respect to speech-in-noise perception. Whether the addition of cABR to an audiologist's diagnostic battery would improve the specificity of treatment outcome remains undetermined. A recent investigation, to which we now turn, used an approach to cABR to glean higher harmonics that bear considerable TFS information—harmonics that could offer insights into the nature of an individual's speech perception under adverse conditions such as noise or reverberation.

# REVERBERATION AND PROCESSING OF TFS BY IC

Fujihira and Shiraishi (2015) investigated the FFR of the cABR of elderly individuals (61–73 years) with age-normal hearing. Puretone audiograms revealed listener's overall mild hearing loss to be age-normal: That loss was in no case strongly asymmetric and latencies of a discernable click-evoked wave V were normal for each participant. Pure tone averages (500–4000 Hz) revealed bilateral losses less than or equal to 30 dB HL, while thresholds at 8000 Hz were less than 50 dB HL. To obtain cABRs, each participant heard a series of /dA/ speech sounds in rapid succession—instances of the original acoustical waveform were interspersed with an inverted version that was 180◦ out-of-phase with the original.

EEG epochs were time-locked to the onset of each acoustical waveform. There were an equal number of epochs free of bioelectric artifacts selected containing responses to the original and the inverted waveform. Two separate sets of epochs were binned according to stimulus type. From these sets of epochs, Fujihira and Shiraishi derived two different kinds of FFRs of the cABR to /dA/. Each such response followed either the spectral frequency of the stimulus or the frequency of that stimulus's envelope. Fujihira and Shiraishi then used these sets of epochs in two different forms of analysis (Aiken and Picton, 2008) with the purpose of isolating: (i) the envelope FFR that phase-locks to the periodicity envelope using what is termed the ADD method; (ii) the spectral FFR that phase-locks to resolved harmonic components of the acoustical signal thus containing some of TFS resolved by the auditory periphery using what is termed the SUB method.

On the one hand, the individual ADD cABR came from EEG epoch waveforms collapsing across original and inverted /dA/ epochs, in the classical manner (Skoe and Kraus, 2010). This approach reduced the contamination in the recordings from the cochlear microphonic and from any stimulus transduction artifact (Aiken and Picton, 2008; Campbell et al., 2012). This ADD cABR (**Figure 11A**) reflected the time-lagged course of the stimulus envelope abstracted away from the TFS inclusive of higher harmonics of the acoustical waveform. This ADD cABR also represented well the fundamental frequency and lower harmonics (**Figure 11C**).

On the other hand, for the individual SUB cABR, EEG epoch waveforms in response to inverted /dA/ epochs were subtracted from original /dA/ epochs and divided by the total number of responses (for a similar approach, see also Anderson et al., 2013). In comparison to the ADD cABR phaselocked to the periodicity envelope, this low amplitude polaritysensitive SUB cABR (**Figure 11B**) neither represented well the fundamental frequency, lower harmonics, nor the stimulus envelope. Instead, the SUB cABR reflected the TFS and higher harmonics of the acoustical waveform (**Figure 11D**). This SUB cABR mostly reflected the spectral FFR. While the cochlear microphonic could have contributed to this SUB cABR, the precedent is that the cochlear microphonic is not influential

FIGURE 11 | A higher frequency TFS is represented in cABRs than previously thought. Grand-averaged "envelope FFR" and "spectral FFR" of elderly listeners with intact hearing derived respectively with ADD (A) and SUB (B). The SUB waveform (B) arguably exhibits intact TFS. The dotted line denotes the onset of /dA/, boxing the time-lagged period when the brain is responsive to the sound, time-lagged from the onset by the time stimulus information takes to reach the rostral brainstem. Fujihira and Shiraishi (2015) separately derived the corresponding amplitude spectra using ADD (C) and SUB (D) methods of Aiken and Picton (2008). Credit: Reprinted from Fujihira and Shiraishi (2015). Copyright © 2015, with permission from Elsevier; n = 30.

(Aiken and Picton, 2008): Only one harmonic was significant with masking that rendered a speech stimulus inaudible (Aiken and Picton, 2008: Exp.2). The resulting response arguably reflected the cochlear microphonic in the absence of either a brainstem-generated polarity-sensitive neuronal frequency following response (Chimento and Schreiner, 1990) or an auditory nerve-generated neurophonic. Aiken and Picton revealed multiple significant other harmonics not attributable to the cochlear microphonic. There were no such frequencies higher than 1500 Hz in the spectral FFR, despite there being harmonics higher than 1500 Hz in the acoustical stimulus. Further, Fujihira and Shiraishi's acoustical stimulation via tubes (Killion, 1984), with a transducer distant from the participant and EEG recordings, precluded substantial contamination of the SUB cABR stimulation transduction artifact.

Rather, the polarity sensitivity of the SUB cABR arguably relates to the asymmetry of the speech signal's envelope alongside the brainstem's reflection of slight polarity differences in phaselocked activity to periodicity envelope encoded from several regions of cochlea. The SUB cABR thus conveys a complex sum of TFS information from multiple narrow frequency bands resolved at the cochlea. With a /dA/ stimulus, envelope dominates the resolved low frequency bands apparent in the ADD cABR. The influence of high frequency TFS content on that periodicity envelope arguably dominates the resolved high frequency bands apparent in the SUB cABR.

Fujihira and Shiraishi tested the association between aspects of this SUB and the ADD cABR with word recognition of isolated familiar words under anechoic conditions and under reverberatory conditions at multiple reverberation times (0.5, 1.0, and 1.5 s): Overall, the longer that reverberation time, the less intelligible the speech. Neither SUB nor ADD cABR responses predict word recognition under anechoic conditions performance being at ceiling. However, aspects of the SUB yet not the ADD cABR responses predicted word recognition of single words in isolation under reverberatory conditions.

That is, correlation matrices of amplitudes of components of the discrete Fourier Transform of SUB and ADD cABRs with word recognition performance showed that the amplitude of the ADD cABR did not predict this performance under reverberation. Contrastingly, the amplitude of the high harmonics in the SUB cABR did: The amplitude of SUB cABR harmonics at around 400, 500, 800, 900, and 1000 Hz correlated positively with word recognition performance for at least one reverberation time; the amplitude of the harmonic around 500 Hz correlating positively with word recognition under all reverberatory conditions. In other experiments, EFRs, phase-locked to the fundamental, have shown a reliable polarity-sensitivity in a subset of individuals (Aiken and Purcell, 2013; Easwar et al., 2015) albeit a polarity-sensivity that was not significant on a group level. The contribution of such individual differences in the generation of the polarity-sensitivity of the EFR could not account for such correlations of amplitudes of higher harmonics in the SUB cABR with word recognition. Rather, Fujihira and Shiraishi thus arguably showed that the phase-locked brainstem coding of TFS is critical to word recognition under adverse conditions of reverberation. Fujihira and Shiraishi conjecture that TFS is present in the higher harmonics of the spectral FFR of the cABR. Appealing as this explanation is, however, it remains to be discerned what TFS narrowband signal components bear individually within the frequency range of 400-1000 Hz at the level of the IC. It is further worth considering that perceptual compensation (Watkins and Raimond, 2013) for effects of reverberation were possible in the word recognition task, given reverberation time was consistent within blocks. Accordingly, the TFS arguably manifest in the spectral FFR of the SUB cABR to /dA/ sounds in quiet could index the influence of top-down expectancy of a repeated /dA/ on the rostral brainstem representation of that /dA/. This capacity for top-down expectancies to influence subcortical processing could have also affected perceptual compensation. Accordingly, this perceptual compensation operates by selecting the crucial TFS information required for speech-in-noise performance.

The contralateral presence of speech-shaped noise in Fujihira and Shiraishi's tests of word recognition could promote the binaural interaction of information from to-be-ignored masking noise. This interaction arguably invoked similar descending corticofugal, effortful, mechanisms of top-down selective attention affecting processing at the level of the rostral brainstem (Maison et al., 2001; Ikeda et al., 2008, 2013). Further, preceding context could affect the spectral FFR of cABRs (Chandrasekaran et al., 2009, 2012). Use of that context in spectral FFRs could relate to an ability for perceptual compensation (Watkins and Raimond, 2013), much as the greater influence of a repeated relative to a variable syllabic context on envelope FFRs (**Figure 6**) is associated with speech-in-noise performance (Chandrasekaran et al., 2009). Thus, the sound of prior heard speech context gave rise to a mental model of the tacit knowledge of the room acoustics. Accordingly, participants' brains used that model to influence the effect of reverberation on speech perception. Using such context required the formation of a model of the stimulus or the room acoustics, a model retained in memory to influence perceptual performance.

### Section Summary

One thousand five hundred Hertz represents the limits of phaselocking at the auditory brainstem (Aiken and Picton, 2008). The attenuation of functionally relevant information in higher frequency components (>400 Hz) in contralateral noise is not as a strong an attenuation as with the lower frequency components revealed by the traditional ADD technique. Fujihira and Shiraishi employed the SUB technique to derive spectral FFRs.

In assessment, those with intact hearing, and to a lesser extent those with mild-to-moderate hearing loss, make use of TFS in frequency ranges greater than 1500 Hz for speech perception: Some form of re-coding of TFS seems inevitable to give rise to modulations of gamma- and theta-bands oscillations at the level of the cortex. Just as the representation of (semantosyntactic) speech context parsed at the level of the cortex could affect the subcortical representation of speech at the level of the rostral brainstem, so could the representation of the room acoustics also parsed at the level of the cortex. These findings cohere with the predictive selection assumption: It is postulated that such top-down expectancies from context corticofugally modulate the rostral brainstem processing of phase-locked speech information in corticopetal-corticofugal loops. Accordingly, those expectancies critically influence the speech word perception and in turn word recognition in a context of adverse reverberatory conditions. For such a context to influence the rostral brainstem processing requires a WM function for complex span tasks. This function is the retention of that context in memory storage whilst processing the dual task of listening-to or recognizing the speech. It is to the influence of memory abilities on ABR generation to which we now turn, in which the findings of a recent investigation are germane.

# AUDITORY BRAINSTEM RESPONSES, WORKING MEMORY, AND SPEECH IN NOISE

Humans have the ability of perceptual compensation, i.e., using prior context to help perceive speech correctly, an ability that relies on memory. For instance, knowledge of the room acoustics from immediate prior speech sound context reduces the adverse effects of reverberation on speech perception (Watkins and Raimond, 2013). When listening, the brain thus holds a mental model of the room's acoustics in (working) memory. The brain uses that model in a top-down prediction to select the perceptual representation of the current utterance, so as to support speech perception. A hypothesis is that some (working) memory function of the brain interacts with an early stage of processing in the brainstem to support that predictive selection. This hypothesis is of interest because the extent to which the brain can use contextual information held in (working) memory for top-down predictive selection at the subcortical level of the brainstem, in turn, would influence speech-in-noise perception. The result of a study of ABRs to ignored sounds under different conditions of load, for people with different working memory capacities evaluates this hypothesis.

To investigate the effect of concurrent visual-verbal memory load and WMC on the subcortical processing of sound, Sörqvist et al. (2012) employed an n-back task with young adults who had normal hearing: An n-back memory task was accompanied by large numbers of task-irrelevant clicks. Those clicks elicited ABRs. Visual letters appeared one-at-a-time and participants attempted to press a button when a letter was the same as that n letters ago. n-back tasks with higher ns thus meant higher concurrent memory loads. Performance was indeed poorer with higher memory load (1-back = 2-back < 3-back), while the Wave V of the ABR decreased (1-back < 2-back = 3-back). Also identifiable in the ABR data of **Figure 12**, was a SN10 negativity that also decreased (1-back = 2-back > 3-back).

This influence of simultaneous visual-verbal memory load shows that systems of WM affected those systems that influence the generators of wave V. Those generators are within or near the rostral brainstem. WM systems also affected the generators of SN10, which have cortical contributions (Parkkonen et al., 2009). Though Sörqvist et al. (2012)revealed that this SN10 more closely mirrored behavior, there was a functionally relevant influence of memory load on wave V (4–8 ms). When that influence

exceeded a certain threshold, higher memory loads accordingly affected the subsequent SN10 (10.5–12 ms), possibly via the ascending auditory system. Either common processes affect the generation of SN10 and the brain mechanisms supporting nback performance or SN10 generation and n-back share common processes. As the effect on wave V was prolonged, overlapping Waves III (4 ms) and IV (5 ms), the effect of load is not focal to the lateral lemniscus terminating in the IC that determines the peak of wave V (Møller et al., 1994). Rather, the effect could be mediated by the IC itself generating a slower longer-lasting waveform overlapping wave III and IV, possibly extending to influence the SN10. There was also further compelling evidence for the functional relation of memory load on this longer lasting aspect of wave V generation. This evidence stemmed from data on complex span tasks including the OSPAN (Turner and Engle, 1989; Beaman, 2004). All participants in the n-back also separately completed these complex span tasks to determine WMC. WMC is the maximum number of items that can be

tasks, particularly the 3-back is consistent with the notion that load suppresses attention during the SN10 time range. Credit: Adapted from Sörqvist et al. (2012); n = 35. Reprinted by permission of MIT Press Journals. stored during the processing of the task and recalled correctly after near flawless performance of that task. A task accuracy criterion ensures that there is no trade-off between the task and memory storage. While performance on the task can be subject to momentary performance aberrations, WMC is a cognitive trait of a person: a long-term measure of that person's cognitive competence.

Sörqvist et al. (2012) correlated WMC from complex span tasks with deflections of the ABR to clicks that were measured during the separate n-back task. The concept was to determine if WMC predicted ABR generation under the different memory demands of different n-back tasks. Only with the higher memory demands of the 3-back, did individuals' WMC predict Wave V amplitude. On this 3-back, yet not the 1-back and 2-back, the higher the individual's WMC, the lower the amplitude of Wave V.

Sörqvist et al. (2012) postulate that the prefrontal lobe is at the apex of an attentional network supporting WM and the topdown suppression of the processing of incoming sound stimuli stimuli that receive preliminary processing by the brainstem. In accordance with an assumption of top-down cholinergic control, Sörqvist et al. (2012) hypothesize that prefrontal projections to the cortical cholinergic system, reliant on the neurotransmitter acetylcholine, can suppress to-be-ignored sound (Sarter et al., 2005). Corticofugal connections of the descending auditory system could mediate that suppression. Accordingly, Wave V and SN10 is so-affected by WM load. This load-dependent reduction supports a limited prefrontal capacity assumption: Engaging a capacity-limited prefrontal control with a (visually presented memory) load diverts predictive selectivity away from processing the to-be-ignored clicks. The processing of clicks is thus suppressed at the level of the rostral brainstem. In accordance with an assumption of predictive selection by an early filter, this suppression may be particularly effective when the to-be-ignored sounds are highly predictable. Sörqvist et al.'s (2012) oddball sequences contained largely repetitions of the same click sound. As we already postulated, WM for such recent acoustical context may be crucial to perceptual compensation (Watkins and Raimond, 2013) that attenuates the effects of reverberation on speech perception. Sörqvist et al.'s (2012) evidence shows that there is an interplay between an individual's WMC and the corticofugal suppression of the generation of Wave V. As this interplay between an individual's WMC and this corticofugal suppression is arguably cholinergic, we term that interplay the cholinergic working memory assumption. This interplay is particularly apparent when all sound is tobe-ignored and the primary task requires a higher memory load.

Germane to the mechanisms of this effect of WM on brainstem processing is a recent investigation that demonstrated an age-related decline in WMC in audiometrically normal adults. This investigation measured WMC with a complex span task (Füllgrabe et al., 2015). There were two audiometrically matched groups of such individuals, an elder group, aged 60–79 years, and a younger group, aged 18–27 years. The decline in WMC was associated with a decline in speech-in-noise performance. The association of speech-in-noise performance with WMC at first appeared to be entirely mediated by aging: Statistically controlling for age eliminated this correlation (Füllgrabe et al., 2015). However, a larger-scale investigation of audiometrically normal individuals (Füllgrabe and Rosen, 2016) revealed that this association withstood statistical control for age. On balance, reconciling the data of Füllgrabe et al. (2015) with those of Füllgrabe and Rosen (2016), there is an influence of an agerelated decline of WMC, which is associated with a decline in speech-in-noise performance. This influence is stronger in more elderly individuals. Age-related declines in WMC were not the only factor, as individual differences in WMC within a limited age range also predicted speech-in-noise performance. The association was moderate and significant separately in the elder groups aged 40–59 years, 60–69 years, and 70–91 years, yet weak and non-significant in individuals aged 18–39 years (Füllgrabe and Rosen, 2016). Füllgrabe et al. (2015) further revealed that better sensitivity to TFS information was also associated with improved speech-in-noise performance. This TFS processing was also subject to age-related decline. However, other individual differences between participants, which varied within but not between age groups, also affected that TFS processing: Performance on some cognitive tests exhibited moderate-to-strong positive correlations of better performance with improved TFS sensitivity. These cognitive tests were forward digit span, backward digit span, as well as sub-tests of the Test of Everyday Attention (TEA): trail making, block design, map search, and elevator counting with reversal. Overall, better scores on TEA also correlated positively with improved TFS sensitivity. These correlations were moderate, but remained significant after partialling out the effect of age. By contrast, Füllgrabe et al. (2015) revealed no significant association between measures of TFS perception and WMC, cohering with the notion that cortical cholinergic mechanisms are not the only mechanisms modulating the subcortical processing of TFS. Performance on some cognitive tests correlated positively with TFS sensitivity after partialling out the effect of age. Some mechanisms for processing TFS are thus resistant to the effects of aging. We postulate that, even in those with normal hearing, there is a distinct age-related decline in the cortical cholinergic system impacting the influence of the prefrontal lobe on brainstem processing. In turn, that decline affects the representation of TFS by the rostral brainstem, thus determining speech-in-noise performance. Potential cholinergic mechanisms for such an agerelated decline could include the age-related damage of postsynaptic muscarinic acetylcholine receptors. Positron Emission Tomography has revealed an age-related reduction in the binding of such receptors in the neocortex and thalamus (Zubieta et al., 2001).

In accordance with a cholinergic top-down control assumption, the extent of age-related decline of such cholinergic mechanisms, we postulate, determines how topdown expectancies can corticofugally modulate the subcortical representation of speech TFS information. This TFS information is crucial to the processing of speech-in-noise retained by elderly individuals with age-normal hearing (mild-to-moderate loss) and arguably represented at the level of the rostral brainstem (Fujihira and Shiraishi, 2015). Indeed, populations of neuronal elements simultaneously firing in the rostral brainstem represent in phase-locked manner such information up to 1500 Hz (Aiken and Picton, 2008); higher frequencies arguably relying on a rate-place code (Rhode and Greenberg, 1994; Skoe and Kraus, 2010) via tonotopy without phase-locking in the IC (Harris et al., 1997). The data of Füllgrabe et al. (2015) also point toward additional mechanisms for processing TFS, which are associated with speech-in-noise performance yet are relatively resistant to the influence of age-related decline. These mechanisms are neurocognitive processes that are unaffected by aging yet contribute to performance on several cognitive tasks. By contrast, Füllgrabe et al. (2015) revealed no significant association between measures of TFS perception and WMC, cohering with the notion that cortical cholinergic mechanisms are not the only mechanisms modulating the subcortical processing of TFS. Such mechanisms for representing TFS could more critically implicate inhibitory GABA in the inferior colliculus as could also be affected by aging (Caspary et al., 2008; Anderson et al., 2011) though without directly affecting WM. An alternative tenable hypothesis could also concern agerelated declines in excitatory serotonergic neurotransmission in the ascending central auditory system (Tadros et al., 2007).

## Section Summary

Increases in concurrent cognitive memory load affect the brainstem processing of sound in a manner that attempts to shutdown that processing by the auditory brainstem. That shut-down conforms with the notion that corticopetal-corticofugal loops of the early filter operate according to the assumptions of predictive selection and a prefrontal capacity-limitation. This attempt to shut-down brainstem processing is thus top-down and constrained by WMC. Acccordingly, under conditions of high concurrent memory load, those with higher WMC show a reduced wave V. The facet of WMC that declines with age, alongside the age-related decline in TFS processing, affects speech-in-noise performance. These influences of WMC support a cholinergic working memory hypothesis: People with better WMC for the storage and processing of acoustical context, as measured by complex span tasks, possess better prefrontal control of corticopetal-corticofugal loops via the cortical cholinergic system.

We postulate a cholinergic stance of the cognitive load on perception hypothesis, that, even in those with normal hearing, there is an age-related decline in the cortical cholinergic system. This cortical cholinergic decline impacts the influence of the prefrontal cortex on brainstem processing and, in turn, the sensory-cognitive representation of TFS determining speechin-noise performance. There are likely other age-influenced mechanisms affecting TFS processing that are not influenced by age-related declines in WM, such as those implicating collicular GABA.

Potential cholinergic mechanisms for an age-related decline affecting WM and speech-in-noise performance include the age-related damage of post-synaptic muscarinic acetylcholine receptors. Here we have seen WMC constrains the influence of cognitive load on the subcortical processing of sound, as could be related to the processing of speech in noise. We now turn to intriguing parallels concerning the influence of sensory load on the behavioral effects of processing to-be-ignored sound in auditory distraction paradigms.

# SENSORY LOAD AND AUDITORY DISTRACTION

The disruptive effects of auditory distraction upon WM have been extensively investigated in a serial recall paradigm (Jones, 1993). To-be-remembered items are presented one-at-a-time and a to-be-ignored sequence of sounds is presented alongside those items and/or during a distraction-filled retention interval. Hughes et al. (2013) investigated two auditory distraction effects in this paradigm. The first disruptive effect, produced by an occasional change-of-voice in the to-be-ignored sound, is termed a deviant effect. Hughes et al. termed the second disruptive effect the changing-state effect, variously known as the token set size effect (Campbell et al., 2002a). That is, a repeated sound AAAAAAAAA... is less disruptive that a changing sequence of multiple sound tokens ABCDEABCDE....

Hughes et al. found that increasing the sensory load, by taking visual to-remembered digits (**Figure 13A**) and degrading with Gaussian visual noise (**Figure 13B**), decreased the deviant effect (**Figures 13A,C**). Such an attenuation of the deviant effect with increases in sensory load thus resembled the attenuation of Wave V of the ABR by increases in n-back load (Sörqvist et al., 2012). Further, Hughes et al. also revealed that forewarning the participant of the presence of a deviant attenuated the deviant effect (**Figure 13D**). A viable interpretation is thus that a top-down expectancy led to a prefrontally coordinated and cholinergically mediated corticofugal influence on subcortical filtering. This interpretation thus assumes the early filter can operate according to a principle of foreknowlege: predictive selectivity that affects cholinergic top-down control of that filter. This filtering thus attenuated the disruptive influence of an expected rather than an unexpected change-in-voice. A further parallel with the ABR findings of Sörqvist et al. (2012) was compelling: The higher the OSPAN measure of WMC, the smaller the deviant effect (**Figure 13G**). This correlation has been replicated (Sörqvist, 2010) as further corroborated by metaanalysis (Sörqvist et al., 2013). Such a finding would be expected whereby a prefrontal cortex-coordinated WM system modulates the subcortical filtering of deviant to-be-ignored sound. The subcortical filtering would in turn attenuate that deviant's cortical processing and disruptive propensity.

By contrast to this deviant effect, the token set size effect went unmodulated by either sensory load (**Figure 13E**) or forewarning (**Figure 13F**) of the presence of a multi-token sequence. Indeed, that token set size effect went uncorrelated with WMC (**Figure 13H**). A top-down expectancy generated by a repeated token AAAA... may suffice for corticofugal influences on subcortical filtering to attenuate the cortical processing of that sound. A hypothetical top-down expectancy that attenuates the disruptive effects of a changing-state multi-token sequence varying in many attributes (rather than just the voice of speaker) seems to defy formulation. If such a top-down expectancy is

= 24. A comparable influence of sensory load on the token set size effect or "changing-state effect" was not apparent (E); n = 45, nor was there any modulation by foreknowledge of changing-state multi-token stimulation that was consistently more disruptive than a repeated speech token (F); n = 31. Indeed, WMC as indexed by OSPAN correlated negatively with the deviant effect (G); n = 24, yet not the token set size effect (H); n = 31. Credit: Copyright © 2013 by the American Psychological Association. Adapted with permission from Hughes et al. (2013). The use of APA information does not imply endorsement by APA.

formulated, any influence on distraction is swamped by other distraction-invoking mechanisms strongly influenced by token set size. For instance, the effects of increases in token set size on the supratemporal auditory cortex as indexed by releases in refractoriness of the supratemporal N1—that could be related to the form of auditory distraction termed the token set size effect (Campbell et al., 2003, 2005, 2007)—would accordingly go relatively unaffected by such corticofugal influences.

Similar to the deviant effect, sensory load attenuates a semantically mediated phenomenon known as the betweensequence semantic similarity effect (Marsh et al., 2015b; **Figure 14**). Marsh et al. (2015b) presented to-be-remembered words visually and concurrently with to-be-ignored heard words. To-be-ignored words drawn from the same semantic category as the to-be-recalled words disrupt recall of those

FIGURE 14 | Sensory load attenuates the between-sequence semantic similarity effect. Increasing the sensory load (A) of the to-be-remembered target items (e.g., "chair, desk, wardrobe...") reduced the influence of the meaning of to-be-ignored speech on WM performance (B): Under such conditions of low sensory load, the semantically related to-be-ignored speech sound (e.g., "table, sofa, bookshelf...") disrupted recall performance more than semantically unrelated speech (nurse, secretary, carpenter). This increase in load also reduced the influence of semantic relatedness on the number semantic intrusions into recall from the to-be-ignored speech (C). Sensory load thus affects the influence of auditory meaning on cognitive processes, Marsh et al. (2015b: Exp.1); n = 32. Credit: Copyright © 2015 by the American Psychological Association. Adapted with permission from Marsh et al. (2015b). The use of APA information does not imply endorsement by APA.

to-be-remembered words: Fewer to-be-remembered words were recalled with increases in the semantic relatedness of the to-beremembered and to-be-ignored material (**Figure 14B**). Further, more of the to-be-ignored words erroneously intrude into recall (**Figure 14C**). Marsh et al. (2015b) revealed that degrading the visual word stimuli with Gaussian noise (**Figure 14A**), thereby increasing sensory load, modulated these effects of semantic relatedness (**Figures 14B,C**). We offer an interpretation of how sensory load reduces this between-sequence semantic similarity effect, consistent with how sensory load also reduces the deviant effect: A prefrontal cortex-coordinated WM system ultimately modulates the subcortical filtering of to-be-ignored sound promoting the cortical processing of semantically relevant auditory material. When a sensory load engages that prefrontal control, accordingly the top-down semantic expectancies (expectancies of a cognitive-linguistic nature) controlling the corticopetal-corticofugal loops no longer support the processing of semantically relevant auditory material. In turn, with a sensory load, semantically relevant material is less intrusive and less disruptive of the recall of the to-beremembered material.

There is a sensory load of a slightly different sort, the presence of to-be-ignored background noise: speech-shaped noise accompanying to-be-recalled words (Marsh et al., 2015a). While not affecting the perception of to-be-attended items, such background noise rather can impair the semantic processing of the to-be-attended items. Theoretically, when the sensory load of this speech-shaped noise engages prefrontal control, top-down semantic expectancies, corticofugally controlling the corticopetal-corticofugal loops, no longer support the processing of the semantically relevant auditory material. Listening in noise thus recruits WM resources that would otherwise be used for elaborate semantic processing of spoken words (McCoy et al., 2005; Kjellberg et al., 2008). Noise disrupts that elaborate semantic processing. Therefore, the sensory load of background noise impairs the understanding of heard speech. Here we thus postulate that sensory load by to-be-ignored background speech sound engages prefrontal control of the cortical cholinergic system. Accordingly, semantic expectancies cannot corticofugally control the corticopetal-corticofugal loops that tune the subcortical representation of attended speech to semantically likely candidate utterances.

Marsh et al. (2015a) demonstrated that semantic processing is disrupted by noise, in carefully calibrated circumstances in which the perception of speech in noise is relatively unimpaired. However, top-down semantic expectancies have been shown to support the contextual repair of degraded sensory information thereby improving speech perception (Shahin and Miller, 2009; Shahin et al., 2012). Accordingly, the engagement of prefrontal control by to-be-ignored speech could adversely affect the influence of semantic expectancies on the perception of speech in noise.

### Section Summary

Here we draw parallels between the WMC constrained influence of cognitive memory load on the subcortical processing of sound and the influence of sensory load on forms of auditory distraction. In both cases, increases in load had effects that were constrained by WMC. Though some forms of auditory distraction go unaffected, we postulate that (semanto-syntactic) foreknowledge affects the top-down corticofugal influences of the cortical cholinergic system that is influenced by WM. Those influences affect subcortical processing alongside the processing of auditory deviance and auditory meaning, thus influencing the perception and understanding of speech in noise. These findings from distraction and speech-in-noise findings have motivated the assumptions of the new early filter model to which the discussion now turns.

# THE NEW EARLY FILTER MODEL

The new early filter model is depicted in **Figure 15**. Corticofugally controlled corticopetal-corticofugal loops serve as an early filter increasing the signal-to-noise ratio at the cortex, operating early by egocentric selection (Suga et al., 2000). This selection serves to enhance the predicted signals and suppress unattended predicted noise. For instance, as **Figure 2** illustrates, one corticopetal-corticofugal loop includes corticopetal connections ascending from the right IC up to the right auditory cortex via the right medial geniculate body. This loop also includes corticofugal connections descending from the right auditory cortex to the right IC via the right medial geniculate body. Such a loop receives not only information from loops lower in the central auditory system, but also controls those lower loops. This loop also sends information upward and is under the control of higher loops. The representation of the auditory speech signal at the level of the rostral brainstem is wellspecified as phase-locked synchronous activity up to 1500 Hz. The fidelity of that representation of TFS information of the tobe-attended auditory signal, supported by a phase-locking over a broad region of inferior colliculus (Harris et al., 1997), arguably limits the processing of speech-in-noise, affecting in turn word recognition by the cortex (Fujihira and Shiraishi, 2015).

The new early filter model revives Broadbent's (1958) influential early filter assumption: There is a capacity limitation on how the human mind processes information that selects information early on for further processing. A psychological theoretical difference is that the new early filter model assumes that prior contextual information, which a working memory network stores and processes, can determine an attentional expectancy.

The prefrontal cortex is not only an aspect of that working memory network (Gisselgård et al., 2004; Campbell, 2005) but also an aspect of the anterior attentional system (Sarter et al., 2005). Attentional requirements and an attentional expectancy derived from prior context affect the prefrontal control of the cholinergic basal forebrain that in turn can cholinergically topdown control the organization of the primary auditory cortex (Kilgard and Merzenich, 1998). This we term the cholinergic top-down control assumption.

The key departure from Broadbent (1958) is that the early filter of corticofugal-corticopetal loops is by default wide open, such that, when stimulation in unpredictable, late selection may be more influential than early selection on cognitive

performance. However, when (linguistic) expectancy predicts the to-be-attended stimulation, then that early filter becomes more selective. This we term the predictive selection assumption.

This predictive selectivity can improve TFS sensitivity and speech-in-noise perception. Also this predictive selectivity renders the cABR to selected information more faithful to the (linguistically) predictable stimulus: neural entrainment. We postulate predictive selectivity via corticofugal-corticopetal loops not only affects the perception of speech in noise, but also affects the comprehension of speech in noise. Prefrontal control is assumed to be capacity-limited. This we term the prefrontal capacity limitation assumption.

Accordingly, a sensory or a cognitive load on that prefrontal control, diverts predictive selectivity away from other stimuli. There is thus a cognitive load-dependent reduction of the wave V evoked by to-be-ignored clicks (Sörqvist et al., 2012) due to a diversion of prefrontal control toward the control of information processing within visual and association cortices.

Combining the predictive selectivity assumption and the prefrontal capacity limitation assumption also accounts for several semantic phenomena. The sensory load of meaningless speech-shaped noise disrupts the elaborative semantic processing of the to-be-attended speech in that acoustical noise (Marsh et al., 2015a). This noise diverts prefrontal control toward processing the sensory load of acoustical noise in visual and association cortices: There is a diversion of prefrontal control away from the storage and processing required for using preceding sound to predict what the semantically likely candidate utterances are. Similarly, the sensory load of visual noise diverts prefrontal control away from the cognitive processes required for the encoding of visual items into memory (Marsh et al., 2015b). That prefrontal control is diverted toward processing the sensory load of the visual noise. This visual sensory load also diverts prefrontal control away from the semantic processing of to-be-ignored sound and to-beremembered visual items, thus abolishing the between-sequence semantic similarity effect (Marsh et al., 2015b). In turn, that visual sensory load diverts prefrontal control away from the involuntary attentional processing of a to-be-ignored change of voice, thus decreasing the deviant effect (Hughes et al., 2013).

People with better WMC for the storage and processing of acoustical context posess better prefrontal control of corticopetal-corticofugal loops via the cortical cholinergic system. This we term the cholinergic working memory assumption. These individuals thus have enhanced loaddependent reductions of the wave V elicited by to-be-ignored clicks (Sörqvist et al., 2012). Further, combining the cholinergic working memory and prefrontal limited capacity assumptions with the predictive selectivity assumption offers explanatory value. This combination of assumptions accounts for how higher WMC-individuals show better resistance to the deviant effect (Hughes et al., 2013) and, in a different manner, better speech-in-noise performance (Füllgrabe and Rosen, 2016). We turn first to the deviant effect.

A person's WMC affects the prefrontal control of the early filter's predictive selectivity via the influence of cholinergic projections of basal forebrain on corticopetal-corticofugal loops. With higher-WMC participants, who have better prefrontal control of corticopetal-corticofugal loops, there is top-down cholinergic control that tunes predictive selectivity well. That better tuning prevents extensive processing of the deviant change of voice in the to-be-ignored sound, thus reducing the deviant effect (Hughes et al., 2013). That deviance would otherwise capture prefrontal control away from the visual and association cortices, which support the encoding of the tobe-remembered items into working memory. The notion of corticofugal influences of visual attention on auditory deviance processing agrees with data concerning the auditory mismatch negativity (Campbell, 2015). Prefrontal influences of visual attention on subcortical auditory filtering by corticopetalcorticofugal loops could also, in turn, permit visual attention to influence the cortically generated auditory supratemporal mismatch negativity (Erlbeck et al., 2014; Campbell, 2015). The deviant effect, could be related, at least in part, to the auditory deviance processing that this mismatch negativity indexes. Indeed, there are stronger cholinergic influences on the auditory mismatch negativity in young individuals (Pekkonen et al., 2001) than in elder adults (Pekkonen et al., 2005). Pekkonen et al.'s findings thus arguably cohere well with the cholinergic working memory assumption: Elder participants also have reduced complex span performance (Bopp and Verhaeghen, 2005) such that the cortical cholinergic system no longer strongly influences deviance processing in those older adults (Zubieta et al., 2001; Pekkonen et al., 2005). Low-WMC participants, who arguably have less effective cortical cholinergic systems, show stronger deviant effects (Hughes et al., 2013). Foreknowledge of an imminent deviant similarly attenuates the deviant effect. This foreknowledge provides WM with a top-down context that the prefrontal anterior attentional system uses to cholinergically improve that predictive selectivity (Hughes et al., 2013). In turn, this effect of foreknowledge on predictive selectivity excludes the processing of deviance via an early filter through the control of corticofugal-corticopetal loops. Such contextual influences of foreknowlege is assumed to play a role in how topdown (semanto-syntactic) expectancies can improve speech-innoise performance. This we term the foreknowledge predictive selectivity assumption.

Having discussed the implications for understanding the deviant effect of combining the cholinergic working memory and prefrontal limited capacity assumptions with the predictive selection assumption, we turn now to speech-in-noise perception itself. The processing and storage of acoustical context to promote predictive selectivity is better in higher-WMC participants. These higher-WMC participants thus have better speech-in-noise perception. While this correlation was significant for participants aged 18–91 years, listeners aged 40–91 years caused this association between WMC and speech-in-noise perception to be significant (see Füllgrabe and Rosen, 2016). What the cholinergic facet of the cholinergic working memory assumption contributes to this explanation is a biological mechanism. This mechanism is assumed to be that by which the age-related decline in WMC predicts declines in speech-in-noise performance. Cholinergic decline (Zubieta et al., 2001) thus led to a decline in the influence of the prefrontally controlled cholinergic basal forebrain. This decline would not only affect the anterior attentional system, including the prefrontal cortices that are part of a working memory network (Campbell, 2005) thus affecting WMC for visually presented material. That decline would also affect the cholinergic basal forebrain's control of the auditory cortex (Kilgard and Merzenich, 1998) in turn adversely affecting speech-in-noise perception. A cholinergic stance of Schneider and Pichora-Fuller's (2000) cognitive load on perception hypothesis would thus predict that a cognitive aging of the cortical cholinergic system drives a decline in sensory processing.

### Section Summary

The new early filter model assumes prefrontal cortex controls top-down expectancy via the cortical cholinergic system thus influencing sensory and association cortices. In turn, the cholinergic basal forebrain indirectly influences corticopetalcorticofugal loops by corticofugal descending connections, as is termed cholinergic top-down control. Those corticopetalcorticofugal loops serve as an early filter, acting upon the level of the rostral brainstem. This filter operates according to the assumption of predictive selection such that expectancies on the basis of preceding stimulus context, linguistic context, or foreknowledge affects TFS perception and speech perception in noise/reverberation. Combining the predictive selection assumption with that of a prefrontal capacity limitation has explanatory advantages. This combination explains how diversions of prefrontal control lead to load-dependent reductions of wave V, alongside several semantic phenomena. One such phenomenon is how meaningless noise disrupts the semantic elaborative processing of speech in that noise. The cholinergic working memory assumption that complex WMC affects the early filter via the cholinergic basal forebrain's control of corticopetal-corticofugal loops has further explanatory value. The addition of this assumption explains how WMC influences both load-dependent reductions in wave V and speech-in-noise performance.

# EXPLANATORY LIMITS OF THE EARLY FILTER

Having discussed the explanatory value, we turn to the explanatory limits of the new early filter model with respect to auditory distraction and speech in noise. The form of auditory distraction known as the changing-state or token set size effect that, in theory, relates to the refractoriness of the generation of the supratemporal N1 (Campbell et al., 2003, 2005, 2007) is arguably unrelated to the cortical cholinergic system. Expectancy or sensory load thus does not affect that form of distraction (Hughes et al., 2013). Though there may be cholinergic influences on the latency of auditory N1 generation, the cholinergic antagonist scopolamine does not affect the refractoriness of the generation of the M100 magnetic counterpart of the supratemporal N1 (Pekkonen et al., 2005). Further, there is support for an influence of a separate, at least partially GABAergic influence on the latency of supratemporal M100 generation (Gandal et al., 2010). The MEGAPRESS technique—which is insensitive to acetylcholine—revealed high GABA+ macromolecule measurements in an auditory region of interest extending from middle temporal regions to superior temporal gyrus (Gaetz et al., 2014). This finding arguably indicates that Gandal et al.'s modulation of M100 generation is in part GABAergic. The token set size effect that could be related to refractoriness of N1 generation (Campbell et al., 2003, 2005, 2007) and M100 generation is, however, not necessarily completely unrelated to speech-in-noise performance: Noise can produce an auditory distraction effect influencing the cortical retention of linguistic information in turn limiting the perception and understanding of speech in noise. This noise produces a stronger auditory distraction effect with fluctuating changingstate or multi-token noise than steady-state noise.

The reverberatory adverse conditions of interfering high intensity speech in a restaurant, with a vaulted non-absorptive ceiling, present a high sensory load under which to attempt to listen to the attended speech. In such circumstances, an early filter arguably attempts to top-down attenuate the ABR (Sörqvist et al., 2012). This filter attempts to close-down the processing of auditory noise at the cost of closing-down the processing of the auditory signal. However, those conditions also present a token set size effect that is resistant to such topdown effects. Alternatively, those conditions could even produce a stronger form of auditory distraction under conditions of high cognitive load (Gisselgård et al., 2003, 2004; Valtonen et al., 2003; Campbell, 2005; Petersson et al., 2006). This token set size effect can affect the perception of and memory for lipread material, when that perception benefits from the retention of contextual information (Campbell, 2000; Campbell et al., 2002b): For some individuals directional microphone(s) might sufficiently reduce sensory load for the perception and understanding of speech. Others might attain more effective communication in such adverse conditions by lip-reading, closing-down their hearing by switching-off hearing assistive devices, or even by covering one's ears.

# OPEN QUESTIONS AND CAVEATS FOR FUTURE RESEARCH

The new early filter model assumes that WMC constrains processing of sound at the rostral brainstem according to top-down expectancies. Convergent evidence supports this assumption from effects of load and WMC on ABRs, alongside different forms of auditory distraction.

The proposed mechanism for controlling this filter is a prefrontally coordinated network that supports WMC and controls the cholinergic basal forebrain. This cholinergic basal forebrain, in turn, can modulate corticopetal-corticofugal loops controlling the subcortical early filtering of auditory information. We postulate a representation of the preceding context, which a WM network—including the prefrontal cortices—maintains and manipulates. The processing of that representation permits topdown prediction that selects the perceptual representation of the current utterance supporting the auditory perception of speech. Accordingly, that WM interacts (cholingerically) with an early stage of processing in the brainstem to support that predictive selection by the early filter.

This filter is wide open when top-down expectancies defy formulation, such as during highly variable meaningless sequences of speech noise information, (e.g., Campbell et al., 2003, 2005) jus, käs, tam, nev, poi, tam, jus, käs... This notion is thus reconcilable with evidence previously martialed in favor of attenuation or late selection models of auditory attention. Yet it is viable that top-down expectancies and cortical modulations responsive to the dynamics of meaningful speech control corticofugal connections mediate subcortical neural entrainment. This conjecture leads to an open empirical question for cABR investigations: Is there a syntactically or semantically mediated form of subcortical neural entrainment? A caveat for cABR investigations to reveal a compelling semanto-syntactic influence on such neural entrainment is that the signal-to-noise ratio of the cABR needs to be high. To do so is a methodological challenge with ordinary EEG equipment, as requires epoching EEG to the onsets of thousands of sounds (e.g., Campbell et al., 2012). Comparing neural entrainment using sequences of semantically or grammatically related word sounds rather than unrelated pairs of word sounds could thus be more practical than using large numbers of sentences. A further caveat is, for that entrainment to be established as subcortical, the cABRs measured need to be unconfounded by cortical contributions of the SN10 (Parkkonen et al., 2009). It is thus necessary to digitally filter cABR recordings in a way that substantially removes the SN10 to click ABRs from the same session. This filtering should not remove Wave V of the ABR.

Open empirical questions of practical and theoretical importance arise for which the new early filter model offers a framework for making predictions. The model predicts, as already established, that (younger) high-WMC participants would be better at hearing words within noise. Yet those high-WMC participants should also show a decreased betweensequence semantic similarity effect when those words serve as the to-be-ignored speech. Open research questions also relate to treatments for hearing loss such as neuropharmacological approaches and WM training. The cholinergic stance of the cognitive load on perception hypothesis concerns an age-related decline in the cortical cholinergic system. This hypothesis would predict that, for aging individuals exhibiting postsynaptic muscarinic acetylcholine receptors damage, use of acetlycholinesterase inhibitors could improve WM function for complex span tasks. In turn, this pharmacological treatment would also improve TFS perception alongside the perception and comprehension of speech in noise. WM training may have similar effects. Schneider and Pichora-Fuller's (2000)sensory deprivation hypothesis, assuming sensory decline drives chronic cognitive decline, should be borne in mind. Even in audiometrically normal individuals, those persons could have a hidden peripheral loss. Accordingly, that loss would result in a sensory decline that may drive damage to post-synaptic muscarinic acetylcholine receptors thus producing cognitive decline. As such, in experiments testing this cholinergic stance of the cognitive load on perception hypothesis, in selecting participants of all ages, screening should not only use audiograms but also use ABR measures of hidden

### REFERENCES


loss, such as the ratio of wave I to wave V (Schaette and McAlpine, 2011). We offer a caveat for the interpretation of evidence from pharmacological treatments seeming to support a cholinergic stance of the cognitive load on perception hypothesis. Those treatments may effect variables such as attributes of cABRs, TFS sensitivity, WMC, or the perception and comprehension of speech in noise. The caveat is that the individuals undergoing the intervention should neither exhibit audiometric nor hidden loss.

Open questions also concern the relation of peripheral sensorineural hearing loss to a compensatory dedication of cognitive resources to the perception and understanding of speech under adverse conditions. Further open questions concern how such a compensation relates to the age-related decline of these systems of neurotransmission alongside an accelerated decline in cognitive faculties including WM.

# AUTHOR CONTRIBUTIONS

Both JM and TC made substantial contributions to the concept and interpretation in drafting the manuscript, approved the submitted materials, and have agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

# FUNDING

The writing of this article was supported by a grant from the Swedish Research Council (2015-01116) awarded to Patrik Sörqvist and to JM.

# ACKNOWLEDGMENTS

Thanks are due to Steve J. Aiken, Steven Bell, Jessica de Boer, Kazunari Ikeda, Erin M. Ingvalson, Kristiina Kompus, Alexandre Lehmann, Brian Moore, Alan Palmer, Lauri Parkkonen, and Mark Wallace for productive discussions, comments, and suggestions.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00136


model. J. Neurosci. 31, 13452–13457. doi: 10.1523/JNEUROSCI.2156- 11.2011


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Marsh and Campbell. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Three Factors Are Critical in Order to Synthesize Intelligible Noise-Vocoded Japanese Speech

Takuya Kishida<sup>1</sup> \*, Yoshitaka Nakajima<sup>2</sup> , Kazuo Ueda<sup>2</sup> and Gerard B. Remijn<sup>2</sup>

<sup>1</sup> Human Science, Graduate School of Design, Kyushu University, Fukuoka, Japan, <sup>2</sup> Department of Human Science/Research Center for Applied Perceptual Science, Kyushu University, Fukuoka, Japan

Factor analysis (principal component analysis followed by varimax rotation) had shown that 3 common factors appear across 20 critical-band power fluctuations derived from spoken sentences of eight different languages [Ueda et al. (2010). Fechner Day 2010, Padua]. The present study investigated the contributions of such power-fluctuation factors to speech intelligibility. The method of factor analysis was modified to obtain factors suitable for resynthesizing speech sounds as 20-critical-band noise-vocoded speech. The resynthesized speech sounds were used for an intelligibility test. The modification of factor analysis ensured that the resynthesized speech sounds were not accompanied by a steady background noise caused by the data reduction procedure. Spoken sentences of British English, Japanese, and Mandarin Chinese were subjected to this modified analysis. Confirming the earlier analysis, indeed 3–4 factors were common to these languages. The number of power-fluctuation factors needed to make noise-vocoded speech intelligible was then examined. Critical-band power fluctuations of the Japanese spoken sentences were resynthesized from the obtained factors, resulting in noise-vocoded-speech stimuli, and the intelligibility of these speech stimuli was tested by 12 native Japanese speakers. Japanese mora (syllable-like phonological unit) identification performances were measured when the number of factors was 1–9. Statistically significant improvement in intelligibility was observed when the number of factors was increased stepwise up to 6. The 12 listeners identified 92.1% of the morae correctly on average in the 6-factor condition. The intelligibility improved sharply when the number of factors changed from 2 to 3. In this step, the cumulative contribution ratio of factors improved only by 10.6%, from 37.3 to 47.9%, but the average mora identification leaped from 6.9 to 69.2%. The results indicated that, if the number of factors is 3 or more, elementary linguistic information is preserved in such noise-vocoded speech.

Keywords: speech perception, noise-vocoded speech, factor analysis, principal component analysis, critical band

# INTRODUCTION

It is important to understand what acoustic characteristics of speech sounds are essential for speech intelligibility in order to elucidate the cognitive mechanisms of speech communication. The acoustic characteristics of speech that contribute to speech perception have been investigated with many different approaches. One of the most fruitful methods is to control acoustic characteristics of

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Amy Poremba, University of Iowa, USA Klaus Mathiak, RWTH Aachen University, Germany

> \*Correspondence: Takuya Kishida kishida.takuya0119@gmail.com

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 15 December 2015 Accepted: 29 March 2016 Published: 26 April 2016

### Citation:

Kishida T, Nakajima Y, Ueda K and Remijn GB (2016) Three Factors Are Critical in Order to Synthesize Intelligible Noise-Vocoded Japanese Speech. Front. Psychol. 7:517. doi: 10.3389/fpsyg.2016.00517 speech by signal processing and then to test the intelligibility of the synthesized signals (for reviews see Diehl et al., 2004; Samuel, 2011). The temporal change of spectra is the representative acoustic characteristic in this context, and is processed by a frequency analyzer of the auditory system (Plomp, 1964; Plomp and Mimpen, 1968; Plack, 2013).

Perceptual experiments in which spectral information was systematically degraded revealed that perceptual cues embedded in speech spectra are highly redundant (Remez et al., 1981; Baer and Moore, 1993; Shannon et al., 1995; Warren et al., 1995). These studies often proceeded from the concept of auditory filters (Patterson, 1974; Moore, 2012) or critical bands (Fletcher, 1940), indicating parallel channels to process frequency components. Although the widths of the critical bands were determined from behavioral data, each of them corresponds to a distance of about 1.3 mm along the basilar membrane (Fastl and Zwicker, 2006). There are about 20 critical bands in the commonly used frequency range of speech sounds, which means that we can use the power fluctuations in these frequency bands to perceive speech. In most situations, however, we can perceive speech sounds represented by a relatively small number of power fluctuations because of the redundancy of perceptual cues in speech sounds. Shannon et al. (1995) found that four bands of amplitude-modulated noise were sufficient for nearly perfect scores (>95%) of word intelligibility. Many studies (e.g., Dorman et al., 1997; Loizou et al., 1999; Souza and Rosen, 2009; Ellermeier et al., 2015) have measured the intelligibility of noise-vocoded speech, and indicated results consistent with Shannon et al. (1995). These studies suggest that the 20 outputs of critical-band filters, for example, can be reduced to a smaller number of channels without sacrificing the speech intelligibility too much.

In the present study, the power fluctuations of speech signals in 20 critical-band filters were analyzed and resynthesized with a new method of factor analysis. This analysis method is a modification of principal component analysis followed by varimax rotation, and was developed to reduce the number of dimensions of observed variables while retaining the information conveyed by these variables as far as possible (Jolliffe, 2002).

One of the earliest studies that applied principal component analysis to speech sounds was conducted by Plomp et al. (1967). They found that 14 Dutch steady vowels were distinguishable on the first and second principal component plane; these first two principal components had a close relation with the first and the second formant of the vowels (Pols et al., 1973). Zahorian and Rothenberg (1981) performed principal component analyses of speech, and they suggested that 3–5 principal components might convey enough perceptual cues to make speech signals intelligible. In a more systematic study of Ueda et al. (2010), principal component analysis was followed by varimax rotation. They discovered that 3 common factors appeared in 20 critical-band power fluctuations derived from spoken sentences of eight different languages (American English, British English, Cantonese Chinese, French, German, Japanese, Mandarin Chinese, and Spanish). The same analysis was performed over speech samples from 15-, 20-, and 24 month-old infants, and the 3 common factors observed in adult voices were gradually formed along with language acquisition (Yamashita et al., 2013).

Thus, 3 factors seem to reflect an acoustic language universal, and these factors may play important roles in speech perception. This speculation, however, was brought about only from observations of acoustic characteristics of speech sounds, and it was not yet clear whether the extracted factors convey any perceptual cues. In the present study, we therefore examined how many factors were needed to make speech signals sufficiently intelligible. If the first 3 factors would indeed make up a basic framework of speech perception, then speech sounds resynthesized from these factors should be intelligible enough. We thus performed a perceptual experiment employing resynthesized speech stimuli.

# SPEECH ANALYSIS

The purpose of this analysis was to obtain power-fluctuation factors suitable for resynthesizing speech sounds.

## Materials

Two-hundred speech sentences each spoken by five male native speakers of British English, 200 sentences each spoken by five male native speakers of Japanese, and 78 sentences<sup>1</sup> each spoken by five male native speakers of Mandarin Chinese were used in the present analysis. These materials were selected from a commercial speech database (NTT-AT., 2002), recorded digitally (16-bit linear quantization and sampling frequency of 16000 Hz). The mean fundamental frequencies of the spoken sentences were 126 Hz (SD = 30 Hz) in British English, 136 Hz (SD = 31 Hz) in Japanese, and 164 Hz (SD = 38 Hz) in Mandarin Chinese. The three languages were chosen from the languages analyzed in the previous study of Ueda et al. (2010) as representatives of different families of languages. These three languages have different linguistic rhythms; English is a stress-timed language, Japanese is a mora-timed language, and Mandarin Chinese is a syllable-timed language (Ramus et al., 1999).

### Procedure

Speech sentences were resampled every 1 ms with a 30 ms-long Hamming window. From the extracted short time segments, power spectra were obtained through a Fast Fourier Transformation (FFT). Following this, these power spectra were smoothed with a 5-ms shortpass lifter by cepstral analysis (for a review on cepstral analysis see Rabiner and Schafer, 1978) to remove unnecessary details of the spectra. A 5-ms shortpass lifter removed fine structure of the power spectra narrower than 200 Hz that reflected vocal folds vibrations. Smoothed power spectra were then divided into 20 critical bands, and averaged power was calculated for each band. Thus, 20 temporal power fluctuations were obtained. The 20 critical bandwidths were taken from Zwicker and Terhardt (1980). The bandwidths originally ranged from 0 to 6400 Hz, but since the range below

<sup>1</sup>The number of Mandarin Chinese sentences was smaller than that of British English sentences or Japanese sentences because speech files with technical problems in the recordings were excluded from the analysis.

50 Hz is unrelated to speech, the 1st bandwidth was narrowed from [0–100 Hz] to [50–100 Hz] (**Table 1**).

The 20 power fluctuations were subjected to a new type of principal component analysis followed by varimax rotation. Origin-shifted principal component analysis as used in this study proceeds from the idea that calculated eigenvectors should originate not from the gravity center of the data but from the zero point<sup>2</sup> , i.e., acoustically silent point. If the silent point is not contained in the subspace of the principal components, resynthesized sounds should generate noise even at the point corresponding to the silent point. In other words, the silent point is mapped onto a point indicating a certain acoustic power. As a result, the listener perceives a steady background noise in the resynthesized speech sounds (probably, this kind of steady background noise should have appeared in Zahorian and Rothenberg's (1981) resynthesized speech). An example of a steady background noise in a resynthesized speech sound is shown in **Figure 1**.

The eigenvectors derived with origin-shifted principal component analysis were rotated by varimax rotation (Kaiser, 1958), resulting in power-fluctuation factors. The purpose of varimax rotation was to make the relation between the factors and the critical bands easier to interpret because the orthogonality of the factors was maintained. The total number of factors produced in the above procedure was varied from 1 to 9 (for example, when 3 power-fluctuation factors were obtained, the eigenvectors of the first 3 principal components were rotated).

### Results and Discussion

**Figure 2** shows the cumulative contributions of the first 1– 9 principal components. Over 70% variance of the power fluctuations was accounted for by the first 9 principal components in all three languages (75, 76, and 71% for British English, Japanese, and Mandarin Chinese, respectively). A plausible explanation for the lower cumulative contributions for Mandarin Chinese is that the mean fundamental frequency of Mandarin Chinese speech was higher than that of the other languages, and that the cepstral liftering could not smooth the power spectra sufficiently.

**Figure 3** shows factor loadings obtained with the three languages. The patterns of the power-fluctuation factors were similar among the three languages when the number of extracted factors was up to 4 (**Figures 3A–C,E–G,I–K**). The cumulative contributions of the 4 power-fluctuation factors were 53, 55, and 48% for British English, Japanese, and Mandarin Chinese, respectively. When the number of factors was 5 or larger, it was difficult to find similar patterns of factors among these languages (**Figures 3D,H,L**). This means that about 50% of variance in the 20 power fluctuations could be mapped onto a common subspace of 4 fluctuation factors for the three languages.

In the 3- and 4-factor analyses, the factors seemed to divide speech sounds into four frequency bands (about 50–550 Hz,


The center frequency (not necessarily the exact mathematical center) and cutoff frequencies were adopted from Zwicker and Terhardt (1980), except for the lowest band.

550–1700 Hz, 1700–3500 Hz, and over 3500 Hz). One of the factors obtained in the 3-factor analyses had high loadings at two frequency bands, the 1st and the 3rd band. These bimodal factors were first reported by Ueda and Nakajima (2008) and Ueda et al. (2010), in which they predicted that the bimodal factor would be separated into 2 factors if they could elaborate the analysis method. The predicted factors indeed appeared in the 4-factor analysis.

# SPEECH INTELLIGIBILITY EXPERIMENT

The purpose of this experiment was to determine the number of power-fluctuation factors needed to make speech sufficiently intelligible. We chose Japanese speech sentences as the basis for sound stimuli. Japanese is convenient for scoring answers reported by participants because Japanese words can be broken up into morae, which are syllable-like phonological units. Each lexical mora is uniquely represented by a single Japanese "hiragana" letter used in writing.

### Participants

Six men and six women, ranging in age from 19 to 24 years old (mean age = 21.5 years, SD = 1.6 years), participated as volunteers. They were all native speakers of Japanese with puretone thresholds lower than 25 dB HL at audiometric frequencies of 125–8000 Hz for both ears. They were naive as to the purpose of the experiment. The procedure of the experiment was approved by the Ethics Committee of the Faculty of Design, Kyushu University. All participants provided written informed consent as to their participation.

<sup>2</sup>Only when the gravity center of the data is identical with the zero point, the eigenvectors in the usual sense originate from the zero point. To realize this special case, principal component analysis was performed on the power fluctuations connected with their sign-reversed counterparts.

# Equipment

The experiment was conducted in a sound-proof room, where the background noise level was below 25 dBA. The sound stimuli were generated digitally (16-bit linear quantization and sampling frequency of 16000 Hz), with a computer (Frontier KZFM71/N) equipped with an audio card (E-MU 0404). The sounds were presented binaurally (diotically) to the participant via a digitalto-analog converter (ONKYO, SE-U55GX), an active low-pass filter (NF DV-04 DV8FL, cutoff at 7000 Hz), a digital graphic equalizer (Roland, RDQ-2031), an amplifier (STAX, SRM-323S), and headphones (STAX, SR-307). The active low-pass filter was for avoiding aliasing, and the digital graphic equalizer was to equalize frequency responses of the headphones.

# Stimuli

Original speech signals for sound stimuli were digitally recorded Japanese sentences (16-bit linear quantization and sampling frequency of 16000 Hz), selected from a commercial speech database (NTT-AT., 2002). Fifty-seven sentences, each containing 17 to 19 morae (mean = 18 morae), spoken by a male speaker were used; nine sentences were used for training trials, three sentences for warm-up trials, and the remaining 45 sentences for measurement trials. These sentences were part of the 200 sentences used to determine the power-fluctuation factors in the analysis.

The original speech signals were resynthesized from factors as 20-band noise-vocoded speech. The number of factors was 1–9 resulting in nine conditions. The 45 sentences used for measurement trials were divided into nine lists, each containing five sentences of 17 to 19 morae (mean = 18 morae). Each list was assigned to a different factor-number condition, and the assignment of the sentence lists to the factor-number conditions was different among participants (**Table 2**). The nine sentences for training trials were also assigned to different



The number from 1 to 9 indicates the factor number condition. Each sentence list consisted of five sentences containing 17–19 morae each.

factor-number conditions, but the assignment of sentences to the conditions was the same among participants. The three sentences used for warm-up trials were of 7- to 9-factor conditions.

In order to synthesize noise-vocoded speech, the reproduced 20 power fluctuations of the original speech signal were used. With the same procedure as described in the Speech Analysis section, 20 power fluctuations were extracted from the original speech signal. To obtain time series of factor scores, the score of the tth time frame of the kth factor, Xk,<sup>t</sup> was calculated by the following equation:

$$X\_{k,t} = \sum\_{n=1}^{20} A\_{k,n} Y\_{n,t},\tag{1}$$

where Ak,<sup>n</sup> is the nth component of the normalized vector indicating the kth factor of the K (= 1... 9) power-fluctuation factor(s) determined in the analysis of Japanese speech in the Speech Analysis section, and Yn,<sup>t</sup> is the tth time frame of the power fluctuation in the nth critical band. Ak,<sup>n</sup> was different between the factor-number conditions as is plotted in **Figure 4**. Next, 20 power fluctuations were reproduced by

$$\hat{Y}\_{n,t} = \sum\_{k=1}^{K} A\_{k,n} \, X\_{k,t},\tag{2}$$

where, <sup>Y</sup><sup>ˆ</sup> n,t is the tth time frame of the reproduced power fluctuation in the nth critical band. Geometrically, the transformations can be regarded as the projections of 20 power fluctuations in a 20-dimensional Euclidean space onto the K-dimensional subspace formed by the normalized vectors indicating the obtained factors.

White noise was generated, and was passed through banks of digital filters with the same cutoff frequencies as specified in **Table 1**. Twenty power fluctuations were then computed by squaring and smoothing each bandpass-filter output. The ratio

between the reproduced power of the original speech signal as in equation (2) and the power of the generated noise was calculated in each critical band at each sample point<sup>3</sup> . The 20 bandpassfiltered noises were thus modulated with that ratio to realize the 20 reproduced power fluctuations of speech sound. Finally, the modulated bandpass-filtered noises were added up to yield noise-vocoded speech.

### Procedure

The intelligibility experiment started with one training block of nine trials, followed by three main blocks which each consisted of one warm-up trial and 15 measurement trials. The participant, who sat on a chair in front of the computer screen wearing headphones, was asked to click a "play" button on the screen for each trial. A sound stimulus was presented 0.5 s after the button was clicked. The presentation was repeated three times with 1.5 s intervals. After listening to the sound stimulus, the participant typed the morae (syllable-like phonological units) which he/she heard using hiraganas (Japanese moraic phonograms). The participant was instructed to avoid guessing parts of sentences which were not heard clearly. All stimuli in main blocks were presented in random order.

### Results and Discussion

**Figure 5** shows the percentage of mora identification as a function of the number of factors used to reconstruct the 20 power fluctuations of the Japanese speech stimuli. Mora

<sup>3</sup>The sample size of the 20 reproduced power fluctuations was increased by repeating values to equate with the sample size of the power fluctuations of the bandpass-filtered noises.

identification increased with the number of factors, and approached a plateau around the 4-factor condition, where the participants' performance was 83.7% (SD = 5.8%). Mora identification was subjected to arcsine transformation, and a oneway Analysis of Variance (ANOVA) with repeated measures was performed. The results showed that the main effect of number of factors was significant [F(8, 88) = 315.44, p < 0.0001]. Posthoc tests according to Scheffe showed no statistically significant differences in mora identification when the number of factors was increased beyond 6 [F(8, 99) = 3.83, p = 0.869, n.s.]. There was a significant difference [F(8, 99) = 317.36, p < 0.001] in the mora identification between the 2- and the 3- [or more]

factor conditions, and between the 3- and the 4- [or more] factor conditions [F(8, 99) = 16.68, p < 0.05]. There was no significant difference between the mora identifications obtained with the 4 and the 5- factor condition [F(8, 99) = 1.37, p = 0.994, n.s.], between the 4- and the 6-factor condition [F(8, 99) = 11.09, p = 0.212, n.s.], or between the 5- and the 8- factor condition [F(8, 99) = 10.15, p = 0.269, n.s.]. A remarkable improvement of mora identification appeared when the number of factors was changed from 2 to 3; the average mora identification leaped from 7% to about 70% [exactly from 6.9% (SD = 6.7%) to 69.2% (SD = 11.7%)]. The first 3 factors turned out to be critical for speech intelligibility.

# GENERAL DISCUSSION

We applied factor analysis to spoken sentences in three different languages, and performed an intelligibility test of Japanese sentences to investigate how power-fluctuation factors contributed to speech perception. The method of the factor analysis used in a previous study (Ueda et al., 2010) was modified in order to make it possible to resynthesize power fluctuations of speech across 20 critical bands from the obtained factors. The power-fluctuation factors extracted with this modified analysis method had very similar profiles to the ones in the previous studies (Ueda et al., 2010; Ellermeier et al., 2015). Twenty critical bands were divided into four frequency regions by the factors when the number of extracted factors was 3 or 4. These factors appeared commonly across three languages, i.e., British English, Japanese, and Mandarin Chinese. This was consistent with the results of Ueda et al. (2010). The drastic modification of the analysis method did not distort the essential features of the factors. The advantage of the present modification was that the vectors indicating the factors originated from the acoustically silent point. The silent point mapped onto the

extracted subspace generated no noise: Silent parts remained silent when resynthesized.

The set of the 3 power-fluctuation factors proved to play a vital role in making speech intelligible. Although the 3 factors explained only 47.9% of the power fluctuations in the original speech sentences, 69.2% (SD = 11.7%) of the morae in the Japanese sentences were conveyed perceptually through these factors. Less than a half of the physical variance thus is very likely to be more informative than the rest. The finding that the 6-factor condition finally led to an asymptotic performance suggests that the information in the 3–4 factors forms the basis of perceptual cues, but that it is not yet sufficient to carry phonological details.

Let us compare the participants' performance in this study with that in four previous studies (Shannon et al., 1995; Dorman et al., 1997; Souza and Rosen, 2009; Ellermeier et al., 2015), which investigated the relationship between the number of vocoding channels and sentence recognition. Four-channel noise-vocoded

### REFERENCES


speech induced high intelligibility in these previous studies (95% correct score in Shannon et al., 1995; Dorman et al., 1997; Ellermeier et al., 2015, and 70% correct score in Souza and Rosen, 2009). These correct scores are not too far from those obtained in the 3- and the 4-factor condition in our experiment. Very probably, four-channel noise-vocoded speech of which bandwidths are determined by the 3 or 4 factors in the present paradigm will be intelligible as well (see Ellermeier et al., 2015). The reason why high recognition performances were obtained with four-channel noise-vocoded speech in the previous studies can be explained if we assume that the fundamental nature of speech sounds consists of 3 or 4 bands of power fluctuations, and originates from constraints of the size and structure of the human articulatory organs (Yamashita et al., 2013, Figure A3). This conjecture comes from the fact that 3–4 factors commonly appeared in three different languages, and speech perception seems to be based on the perceptual cues carried by the 3–4 power-fluctuation factors.

If the power-fluctuation factors obtained in this study play an essential role in human speech communication, some correspondence will be found between the factors and articulatory movements as well as brain activities related to speech communication. The present findings might contribute to the development of technology supporting speech communication on various occasions.

### AUTHOR CONTRIBUTIONS

TK and YN designed the study. TK collected and analyzed the data, and the other authors supported him occasionally. All the authors interpreted the results together. TK wrote the first draft, and all the authors improved the paper together. YN gave the final approval of the version to be published.

### ACKNOWLEDGMENTS

This study was supported by a JSPS KAKENHI Grant-in-Aid for Scientific Research (A) (25242002), and a JSPS KAKENHI Grantin-Aid for Exploratory Research (26540145). Asuka Ono gave us technical assistance.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Kishida, Nakajima, Ueda and Remijn. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Deficit in Movement-Derived Sentences in German-Speaking Hearing-Impaired Children

### Esther Ruigendijk<sup>1</sup> \* and Naama Friedmann<sup>2</sup>

<sup>1</sup> Department of Dutch and Cluster of Excellence "Hearing for All", University of Oldenburg, Oldenburg, Germany, <sup>2</sup> Language and Brain Lab, Tel Aviv University, Tel Aviv, Israel

Children with hearing impairment (HI) show disorders in syntax and morphology. The question is whether and how these disorders are connected to problems in the auditory domain. The aim of this paper is to examine whether moderate to severe hearing loss at a young age affects the ability of German-speaking orally trained children to understand and produce sentences. We focused on sentence structures that are derived by syntactic movement, which have been identified as a sensitive marker for syntactic impairment in other languages and in other populations with syntactic impairment. Therefore, our study tested subject and object relatives, subject and object Wh-questions, passive sentences, and topicalized sentences, as well as sentences with verb movement to second sentential position. We tested 19 HI children aged 9;5–13;6 and compared their performance with hearing children using comprehension tasks of sentence-picture matching and sentence repetition tasks. For the comprehension tasks, we included HI children who passed an auditory discrimination task; for the sentence repetition tasks, we selected children who passed a screening task of simple sentence repetition without lip-reading; this made sure that they could perceive the words in the tests, so that we could test their grammatical abilities. The results clearly showed that most of the participants with HI had considerable difficulties in the comprehension and repetition of sentences with syntactic movement: they had significant difficulties understanding object relatives, Wh-questions, and topicalized sentences, and in the repetition of object who and which questions and subject relatives, as well as in sentences with verb movement to second sentential position. Repetition of passives was only problematic for some children. Object relatives were still difficult at this age for both HI and hearing children. An additional important outcome of the study is that not all sentence structures are impaired—passive structures were not problematic for most of the HI children

Keywords: syntax, hearing impaired children, German, relative clauses, Wh-questions

# INTRODUCTION

Children with hearing impairment (HI) very often show language problems. Many studies of the language of HI children examine their vocabulary and phonology, and demonstrate difficulties in these language domains (e.g., Davis et al., 1986; Briscoe et al., 2001). In the current study, we focus on a different language domain of great difficulty in HI children: syntax. The ability to understand

### Edited by:

Jerker Rönnberg, Linköping University, Sweden

### Reviewed by:

Arnaud Rey, Centre National de la Recherche Scientifique, France Francesco Vespignani, University of Trento, Italy

\*Correspondence: Esther Ruigendijk esther.ruigendijk@uni-oldenburg.de

### Specialty section:

This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology

Received: 07 January 2016 Accepted: 20 April 2017 Published: 13 June 2017

### Citation:

Ruigendijk E and Friedmann N (2017) A Deficit in Movement-Derived Sentences in German-Speaking Hearing-Impaired Children. Front. Psychol. 8:689. doi: 10.3389/fpsyg.2017.00689

and produce sentences is a core language ability, but studies have shown that children with HI show great difficulty in syntax, in both the comprehension and production of syntactically complex sentences (Pressnell, 1973; Sarachan-Deily and Love, 1974; Geers and Moog, 1978; Berent, 1996; Brannon, 1966, 1968; Quigley and King, 1980; Friedmann and Szterman, 2006, 2011; Delage and Tuller, 2007; Geers et al., 2009; Friedmann and Costa, 2011; Friedmann and Haddad-Hanna, 2014; Szterman and Friedmann, 2014b).

Studies that examined which sentence structures pose difficulties to HI children, done mainly in English, Hebrew, and Arabic, pointed to several structures that are especially difficult for these children. These were mainly Wh-questions, object relatives, object topicalization sentences, and passive sentences.

**Wh-questions**, like "which girl did grandma draw?" were found to be impaired in HI children's comprehension and production (Quigley et al., 1974b; Geers and Moog, 1978; de Villiers, 1988; de Villiers et al., 1994; Berent, 1996; Friedmann et al., 2010b; Friedmann and Szterman, 2011; Szterman and Friedmann, 2015). **Relative clauses**, such as "this is the girl who grandma kissed" were also found to cause special difficulty for HI children in both comprehension and production (Quigley et al., 1974a; Berent, 1988; de Villiers, 1988; Friedmann and Szterman, 2006; Friedmann et al., 2008, 2010b; Friedmann and Haddad-Hanna, 2014; Szterman and Friedmann, 2014a, 2015; Volpato and Vernice, 2014). Similar difficulties have also been reported for **topicalization** structures, such as "this girl, the grandma loved" (Friedmann and Szterman, 2006; Szterman and Friedmann, 2014a, 2015). A further type of sentences that was reported to be difficult for HI children is the passive construction, such as "the girl was tickled by the grandma" (Power and Quigley, 1973).

Syntactically, these structures share a common property they are all derived by syntactic movement. Syntactic movement is the operation that creates a structure by movement of an element from a basic word order (also termed the base-generated order). For instance, it is assumed that in English (and other languages) the basic word order is subject-verb-object. To derive the topicalized structure "this girl, the grandma loved" from the base-generated order "the grandma loved this girl", this girl is moved from a position after the verb loved to the first position of the sentence. It has therefore been argued that HI children may have a specific problem with structures that are derived by syntactic movement (see e.g., Friedmann and Szterman, 2006, 2011).

Within the movement-derived sentence structures, the structures in which HI children show most difficulties are the ones where the order of the participants in the sentence is not the usual one. In English, Hebrew, and Arabic, where syntax of HI children has been tested, the basic word order (see the simple sentence in 1) is subject-verb-object, or to use the thematic structure: agent-before-theme (note that this is not the same thing, see the discussion on example 10 below). Namely, the agent of the verb (and of the action described in the sentence) precedes the theme of the verb. The movement-derived sentences that are most difficult for HI children to understand, exemplified in 2–5, are the ones where the theme precedes the agent (in 2–5, the grandfather, who is the theme, precedes the boy, who is the agent).


Sentences (2)–(5) differ in structure, but in all of them the boy is the agent of the action (i.e., the tickler), and the grandfather is the theme of the action (i.e., the one being tickled). The verb tickle assigns two thematic roles: the role of agent to the noun phrase (NP) that performs the action and theme to the NP that receives the action or is affected by it. The assignment of these thematic roles is done according to the base-generated order: the verb assigns the agent role to the NP that precedes it and the theme role to the verb that follows it. Since in sentences (2–5) the object is moved to the position before the verb, the question is how this NP receives its thematic role. Within Government and Binding theory (Chomsky, 1981) it is assumed that NPs that move, leave behind a trace in their original position (marked by an underlined gap in examples 2–5). The verb assigns the thematic role to the trace of the moved NP and the role is then transferred from the trace position to the moved constituent through a chain consisting of the trace and the moved NP. For (2–5) this means that the verb assigns a thematic role of theme to the trace of the NP the grandfather, which has moved. This role is then transferred to the grandfather, through a chain of movement, and hence this NP can be interpreted as the theme of the sentence. In processing terms, one may think of movement as re-activation of the NP that moved in its base-generated position: upon hearing the sentence in (2), for example, the hearer keeps the NP "which grandfather" in a syntactic memory component until she hears the verb, and then she can re-access this NP after the verb, and interpret it as the theme, in order to understand 'who did what to whom' in the sentence.

Sentences in which the theme (the object of the sentence here) moves across the agent (the subject) to a position in the beginning of the sentence are especially difficult for various populations: young children who have not yet completed the acquisition of syntax in their language (Friedmann et al., 2009, 2010a; Belletti et al., 2012; Biran and Ruigendijk, 2015), children with developmental syntactic impairment, SySLI (Friedmann and Novogrodsky, 2004, 2011; Friedmann et al., 2015), and individuals with agrammatism (Grodzinsky et al., 1999). In studies of English, Hebrew, and Palestinian Arabic, the difficulty in these structures is cast in terms of word order: the theme moves to a position before the agent, and the word order is not the canonical one; to distinguish between an object and a subject question in English, for example (Which grandfather did the boy tickle vs. Which grandfather tickles the boy), one needs to rely on word order.

The situation is different in German. German marks subjects and objects through morphology, using case-marking. Subject

and object-first sentences have the same order of NPs and verbs and only differ in the case-marking of these NPs. German NPs are marked for case, as can be seen in sentence (6), where der Junge 'the boy' has nominative case and den Opa 'the grandfather' accusative case.<sup>1</sup> Sentence (7–9) show German examples of three of the structures with Wh-movement, which have been found to be impaired in children with HI: object Wh-questions, object relatives and topicalized sentences (parallel to the English examples in 2–4).<sup>2</sup>

(6) Simple active:

Der Junge kitzelt den Opa. theNOM <sup>3</sup> boy tickles theACC grandfather. 'The boy is tickling the grandfather.'

(7) Object Wh-question:

Welchen Opa<sup>1</sup> kitzelt der Junge t1? whichACC grandfather tickles theNOM boy? 'Which grandfather does the boy tickle?'

(8) Object relative clause:

Das ist der Opa1, den der Junge t<sup>1</sup> kitzelt. this is theNOM grandfather, thatACC theNOM boy tickles. 'This is the grandfather that the boy tickles'

(9) Topicalization:

Den Opa<sup>1</sup> kitzelt der Junge t1. theACC grandfather tickles theNOM boy. 'It is the grandfather that the boy tickles.' (The German sentence does not include embedding, but this translation keeps the gist of the use of such sentences.)

In German, case morphology gives important information as to 'who did what to whom'. In our sentences (6–9), the subject of the sentence always has nominative case der Junge 'theNOM boy', marked here on the article of the NP. For masculine NPs, the article always unambiguously distinguishes nominative (der) and accusative (den) case. This marks the subject and object, and hence provides clear information on who does what to whom. Studies on language acquisition in young German-speaking children (up until the age of 7 at least) show that, although object-first sentences are still not comprehended adult-like, such unambiguous case-marking does indeed improve comprehension (Arosio et al., 2012; Biran and Ruigendijk, 2015; Roesch and Chondrogianni, 2015) as well as sentence repetition (Biran and Ruigendijk, 2015).

Thus, in German, correct interpretation and use of these specific structures depends on morphosyntactic information<sup>4</sup> that is perceptually not very salient: determiners and verbal inflection. However, it does not seem to be the perceptual salience of the case-bearing words that is the source of the difficulty with these syntactic structures in HI. We can see that difficulty in movement-derived sentences is apparent also in languages such as English, Italian, Hebrew and Arabic, where these syntactic structures are not marked by similarly-sounding case markers but rather by (perceptually salient) word order. In addition to morphosyntactic information, the different structures are realized with different prosody. However, difficulties in perceiving prosody cannot be the source of the difficulty either. First of all, people understand sentences with movement even when they are written, when no prosody is provided. Additionally, HI children show similar deficits in written movement-derived sentences (Quigley et al., 1974a; Szterman and Friedmann, 2014a,b), where no prosodic information is provided. This confirms the idea that prosody is not the only aspect that can distinguish these types of sentences, and that there is a special role for morphosyntactic information worth studying in HI children.

It has been shown by Hennies et al. (2012), that Germanspeaking HI children perform worse than normal-hearing children on the perception of consonants that are relevant for subject-verb agreement on syllable offset. Furthermore, Szagun (2004) showed that the article system of German-speaking children with a CI (cochlear implant) is less well-developed than that in normal hearing children, which she argues is the result of persisting perceptual problems. Steinbrink (2004), however, found for –n and vowels (which are important for case morphology – n for the distinction between the casemarked determiners den and der and dem; vowels for the distinction between the determiners die, das, and der vs. den/dem) no clear correlation between phonological problems and the production of correct inflectional morphology (as examined through spontaneous speech analysis). Similarly, in one of our own recent (eyetracking) studies, we found that CI children are aware of both case and subject–verb agreement morphology, but show a considerable delay in the effect of this morphosyntactic information on sentence interpretation (Schouwenaars et al., 2015). It is thus especially interesting to examine how Germanspeaking children with HI understand and produce structures

<sup>1</sup>Note that only for masculine NPs nominative and accusative are unambiguous (der vs den); for neuter (das), feminine (die), and plural (die) definite articles, nominative, and accusative are the same. Unless stated otherwise, we use and discuss the unambiguous case marker in this paper.

<sup>2</sup>We abstract away here from movement of the verb from its VP-internal (verb phrase-internal) position. It is argued that in German, the verb is base generated at the final position, hence canonical order would be SOV. In these examples the original position of the verb does not affect the assignment of the thematic roles.

<sup>3</sup>NOM refers to nominative case, usually used for the subject of the sentence, ACC refers to accusative case, used for –among other categories- the object of a sentence, and DAT refers to dative case, which is used –among other things- after some prepositions, like the P von 'by' in passive sentences (like 10).

<sup>4</sup> In addition to morphosyntactic information, the different structures may be argued to be realized with different prosody. Weber et al. (2006) for instance conclude that prosody can influence the interpretation of constituent order ambiguities (possible in German, see footnote 2) in that a prosodic manipulation (i.e., marked prosody with narrow focus on the first NP) eliminated the normally existing subject-first preference. Importantly though, the prosodic information did not make an object-first interpretation more preferable, showing that prosody alone is not enough for disambiguation of these structures, whereas morphosyntactic information can be. Similarly, Braun (2006) provided experimental evidence for a different prosody for topicalized sentences in a production task in German, but again this prosodic information could not be reliably used for comprehension (see also Pappert and Pechmann, 2012, or Carroll, 2013 for a discussion).

with Wh-movement in which the theme precedes the agent, and which require the processing of case markers, and this is one of the aims of our study.

We have so far discussed **Wh-movement**, a movement of a phrase to the beginning of the sentence (NP or PP to spec-CP, i.e., the specifier of the complementizer phrase, in syntactic terms), which derives Wh-questions, relative clauses, and topicalization sentences (sentences 7–9). However, Wh-movement is only one type of movement that results in a non-canonical structure. Types of syntactic movement differ by the type of element that moves, and the position to which it moves. Assessing comprehension and repetition of HI in German allows us to explore another question: are all types of movement impaired in HI? We therefore assessed two additional types of movement-derived sentences, in addition to Wh-movement: one is a type of movement that involves the movement of an NP, but to a different sentential position – a movement from object to subject position (which is called A-movement), which occurs, for example, in passive sentences such as (10); the other is the movement of the verb to the second sentential position (verb movement, or, in more syntactic terms, V-to-C movement), illustrated in (11).

(10) Passive:

Der Opa<sup>1</sup> wurde von dem Jungen t<sup>1</sup> gekitzelt. theNOM grandfather was by theDAT boy tickled. 'The grandfather was tickled by the boy.'

(11) Verb movement:

Jetzt kitzelt<sup>1</sup> der Junge den Opa t1. now tickles theNOM boy theACC grandfather. 'Now the boy is tickling the grandfather'

In (10), similar to (7–9), the theme, der Opa, comes before the agent der Junge, that is, the theme has been moved from its original position to the first position of the sentence. Unlike in (7–9), however, it is now the syntactic subject of the sentence, as indicated by subject-verb agreement and as can be seen in its case-marking: nominative. The agent of the sentence is now realized in a 'by phrase': von dem Jungen 'by theDAT boyDAT', with unambiguous dative case. So, here we have a subject–object word order, but it is still non-canonical in the sense that the first NP is not the agent of the sentence. In this type of movement, the thematic role is assigned to the original position of the object, whereas nominative is assigned to the moved element.

One final type of movement-derived sentences to be tested here is shown in (11). In German, the finite verb of a sentence moves to the second position of the sentence in main clauses, as can be seen in all examples (6–10) already (see footnote 3). Importantly, when a child repeats a simple active sentence in German, with the order subject-verb, one cannot be completely sure what the underlying structure is that results in this output.<sup>5</sup> When a German sentence starts with an adverb (A), the verb moves to the second position of the sentence, to a position before the subject, creating an AVSO word order (i.e., Adverb – Verb – Subject – Object). With this sentence type, we can be sure about the underlying structure that is realized: the adverb is moved to Spec-CP, whereas the verb is moved to C. A further difference between this structure and the active sentence (in 1) is that both NPs now come after the verb. The order of the NPs is still canonical agent-theme. This type of movement is called V-to-C movement.

The ability to understand and produce sentences with syntactic movement is a crucial language ability. Our aim was to assess whether the lack of sufficient exposure to natural language from birth affects the ability of German-speaking children with HI to understand and produce (non-canonical) sentences that are derived by syntactic movement. We further asked which types of movement are impaired. Unlike other languages in which syntax of HI children was examined so far, like English, Hebrew, or Arabic, German enables us to study the interaction of word order phenomena with morphosyntactic case-marking. Furthermore, German allows testing of sentences that include object movement without other changes in the sentence (topicalization), and allows us to compare various types of syntactic movement: Wh-movement, A-movement as seen in passives, and verb movement (to C). So, for example, English allows examining passives, Wh-questions and relative clauses, but not V-to-C movement of main verbs, or topicalization without other interfering factors, which can be tested in German. Hebrew and Arabic allow the study of V-to-C movement of main verbs and topicalization, as well as relative clauses and Whquestions, but passives in these languages are rarely used. Thus, examining these structures in German HI may help us better understand the effects of HI on the acquisition of sentences with syntactic movement, by examining another type of movement, beyond phrasal movement, and by examining the effect of case marking on the processing of sentences derived by Whmovement. Furthermore, our data may help to better understand the possible psycholinguistic bases of the syntactic impairment in different populations by systematically studying the effects of HI on language acquisition using similar structures that are studied in these other populations with different etiologies such as syntactic SLI or agrammatism.

### GENERAL METHOD

We used two types of tasks to examine the HI children's syntactic abilities. In the first part of this article we describe two picture selection tasks (Experiments 1 and 2) which we used to test the participants' comprehension of subject- and object relative clauses and of passive sentences, as well as who and which subjectand object questions and topicalized sentences. In the second part, we report on two sentence repetition tasks (Experiments 3 and 4) with which we examined subject relative clauses, passive sentences, and subject and object who and which questions, in comparison with simple SVO sentences (subject–verb–object), and sentences with an adverb (AVSO vs. SVOA). We chose two different types of tasks, comprehension and repetition, to

<sup>5</sup> In cases of difficulties in the CP-layer of the syntactic tree, one may leave the subject in spec-IP (inflectional phrase) and the verb in I, or even leave the subject and the verb VP-internally. An SV sentence may still sound exactly the same as an SV sentence in which the S moved to spec-CP and the verb to C and is hence not a very good way to test verb-movement.

offer converging evidence of a syntactic impairment and to allow for task independent assessment of the difficulty. The picture selection tasks allow for a controlled way of assessing participants' ability to use syntax for comprehension. Performance on this task is informative in two ways. First, we can test whether the HI children perform similar to or less well than the hearing children. Second, the task allows us to distinguish between above chance, chance and below chance performance, where above-chance performance indicates knowledge of the structure, and chance level or below chance performance suggests that the syntactic information is not acquired yet. Chance performance in the picture selection task would be manifested by random pointing to one of the two pictures, pointing to each picture around half of the time. In our sentence types chance performance suggests that the child is aware of the morphosyntactic information, but cannot yet use it for correct sentence interpretation. Below chance performance means a systematic error pattern, i.e., systematically choosing the distractor picture, which would indicate that the child is not yet aware of the morpho-syntactic information given in the sentence (such as case marking).

Repetition tasks allow full control of the target sentence and the construction of minimal pairs of sentences – one including the tested structure and one completely parallel but without the tested structure. It is hence a relatively simple way to examine the syntactic abilities of children in various structures such as relative clauses, Wh-questions, and passives using the same task. Repeating a sentence in one's native language involves comprehension and production, and does not merely consist of a passive, phonological copy of the input sentence. Therefore, difficulties in the comprehension and production of a syntactic structure may be manifested in difficulties to repeat this structure (Friedmann and Grodzinsky, 1997; Lust et al., 1998; Potter and Lombardi, 1998; Friedmann, 2006, 2013; Friedmann and Lavi, 2006; Szterman and Friedmann, 2015). When participants repeat sentences that are similar in length and words, which differ only in the relevant syntactic feature tested, and succeed on one structure but fail in the other structure, this might indicate a specific difficulty with the tested structure. Thus, if a child consistently makes structural errors when she repeats a certain structure, but consistently repeats correctly the control sentence, this would indicate that she has not yet mastered this specific structure, or that she has a deficit in this structure. Also, the types of errors that the participant makes when repeating a structure are informative: repetition errors that affect the structure of the sentence indicate a difficulty that is structural in nature. Conversely, lexical errors, i.e., substituting or omitting of the lexical items in a way that does not affect the syntactic structure or the thematic roles in the sentence, may reflect either a lexical difficulty, or the increased difficulty of the target sentence, which might result from its structure. Each task was described in detail below.

Each child was tested individually by a native speaker of German, in 2 to 5 meetings. The children participated at will and they were told that they could stop whenever they wanted. We received informed consent from all parents. No time limit was set in any of the tasks, and the experimenter repeated every item as many times as the participant requested. We varied the type of tasks (i.e., repetition, comprehension<sup>6</sup> ) in each session, so that there was enough variation for the child. In between tasks we had short breaks. Apart from that, the child could take as many breaks as s/he wanted. This study was approved by and carried out in accordance with recommendations from the local ethics committee at the University of Oldenburg.

Prior to the experiments, two screening tests were used to assess for each participant (for the HI children: with hearing aid device) whether s/he could perceive language as presented/used in our tests. One screening test was an auditory same-different task, which was designed to make sure that the participants perceived the phonological differences between case inflections, which are crucial for sentence comprehension (and hence also for repetition) in German, and that their performance was not influenced by problems in hearing these morphemes. The participant heard 22 pairs of NPs (each NP including one or two words); The test included pairs of determiners, determiners + nouns, Wh-elements, Wh-element + nouns, and possessive pronoun + N. There were 11 identical pairs and 11 pairs that differed in their case inflection (for instance, identical: den Jungen – den Jungen 'theMASC,ACC, boyACC - theMASC,ACC, boyACC'; different: der Esel – den Esel 'theMASC,NOM donkey – theMASC,ACC donkey' MASC = masculine). The participants were asked to judge whether the NPs in each pair were the same or different. Individuals who made errors on more than three items in this screening task did not participate in the study.

The other screening test was a simple sentence repetition task, which was used to make sure that the sentence stimuli in the experiments were perceived correctly, and that the children did not have relevant production difficulties. The experimenter said 10 simple canonical SVO sentences (e.g., Das junge Mädchen zeichnet den frechen Frosch. 'theNEUT,NOM girl draws theMASC,ACC naughtyACC frog' NEUT = neutral) with her lips concealed, and the participants were asked to repeat each sentence aloud. In this test, omissions and substitutions of the determiners, the nouns, or the verbs were counted as incorrect. We did not count as incorrect errors that resulted from pronunciation difficulties. Individuals who made errors on more than one sentence in this task did not participate in the repetition experiments. Children who did not pass the screening repetition task, but who did well on the same-different task (i.e., less than three errors), did participate in the sentence comprehension tasks, but not in the repetition tasks.

### Participants

In total 24 German-speaking children with HI were examined. Five of them did not pass one or both of the screening tests, and hence their data were not analyzed any further. Four children did pass the same-different task, but not the repetition screening, so they only participated in the sentence comprehension tasks. The children whose data did enter the analysis were nineteen children, 9;5 to 13 years old (M = 10;7, SD = 0;11), nine girls,

<sup>6</sup>One additional task, elicited production, was tested in these sessions as well, contributing to the overall variation in tasks. Data from this task will not be reported in this paper, because even the hearing participants in the ages we tested still do not master the production of object relatives, so we could not use this task to compare the abilities of the HI to the hearing children.

ten boys. This age range was chosen (a) since it is important to understand the effects of HI on language performance of school age children, and (b) according to previous studies TD hearing children in this age range acquire most of the syntactic structures by that age.

They had moderate to profound hearing loss, which has been diagnosed at a very young age or relatively late (age range of diagnosis: 0;4–9;0). Fourteen of the children used binaural hearing aids, two used two cochlear implants and three children used one cochlear implant and a hearing aid. Since we were interested in the effect of HI on language impairment in general, we did not distinguish between types of HI. Fourteen of the children went to a special school for children with HI, and the rest attended regular schools. Most of the participants performed all tasks, some of them performed only part of the tasks (see below for details), for organizational reasons. Subject files included no other disabilities, and all children came from a family that spoke only German and that used no sign language. All children were trained orally. All participants constantly wore their hearing aids or their CI(s). The details of each of the participants are presented in Appendix A.

The children in the control groups for these experiments were 96-monolingual typically developing children without language impairment or hearing disorder. They were 7;0–12;5 years old (M = 9;9). For organizational reasons, not all hearing children could perform all tasks, see for more details the description of the results below.

# Statistical Analysis

For each task, we ran two types of analyses: group-level and individual-level. The group analysis was done to establish whether HI children in general performed differently from hearing children, that is, whether in general HI causes syntactic difficulties. We were specifically interested in whether in the group some sentence types were more often affected than others. Since it is well-known that there is quite some variation in the performance of the HI children and our group of HI participants was varied in several aspects as well (hearing aids vs CI, age of diagnosis, severity of hearing loss), an individual level analysis was done to further examine the range of abilities and problems in HI children. We were interested in how many and which children performed worse than the hearing group, and whether we could distinguish characteristics in for instance background, or exposure that may explain the difference between good performers and not-so-good performers. We were also interested in whether a scale of difficulty can be detected between the various structures.

We first ran a repeated measures ANOVA with the relevant sentence factors as within subject variable (either: sentence type, or word order and question type), group as between subjectvariable and single subject accuracy as dependent variable. For this we used percentage correct so that we could use data from participants for whom we did not have complete data sets<sup>7</sup> . When this resulted in significant effects of group or interactions with group, we ran pairwise comparisons per sentence type to see which sentence type resulted in lower performance in the HI group. This was followed by post hoc paired t-tests within groups to compare performance on the different sentence types whenever a main effect of sentence type or an interaction of group with sentence type was found. Also for the comprehension tasks, we established whether performance differed from chance or not using the binomial test. Finally, the performance of each individual participant with HI was compared to the control group in each sentence type using the Crawford and Howell's t-test for the comparison of a single participant to a group (Crawford and Howell, 1998; Crawford and Garthwaite, 2002).

# EXPERIMENTS 1 AND 2: COMPREHENSION OF RELATIVE CLAUSES, PASSIVES, TOPICALIZATION AND WH-QUESTIONS

Sentence comprehension was assessed using two picture selection tasks, one assessed passive sentences and relative clauses compared to simple sentences (Experiment 1); the other (Experiment 2) assessed Wh-questions and topicalization structures in comparison to simple sentences. We used two different tasks to create more variation (and less boredom) for the participants, both regarding method and the pictures we used.

# Material Experiment 1: Comprehension of Relative Clauses and Passives

In the first comprehension task, the participant heard a sentence read by a native speaker of German, and saw two pictures on the same page, one above the other. In one picture the roles matched the sentence; in the other picture the roles were reversed (**Figure 1**). The participant was requested to point to the picture that correctly described the sentence.

The task included a total of 80 sentences for each participant, namely 20 simple SVO sentences, 20 subject relatives, 20 object relatives, and 20 passive sentences (see examples in **Table 1**). All verbs were agentive transitive. All the sentences were semantically reversible so that comprehension of the meaning of the words alone cannot determine the meaning of the sentence (namely, we did not use irreversible sentences like 'The girl is eating a pear', only reversible ones like 'The girl is kissing the grandmother').

Sentences were randomly ordered, and presented in 2 sessions of 40 sentences each (10 sentences of each type per session). The participants saw the 40 picture pairs twice, once in each session (20 picture pairs were presented with the subject relatives and object relatives, and 20 picture pairs with the SVO and passive sentences, four pictures were used in all four conditions and hence presented twice in each session). The correct picture in each pair was randomized both within a session (in each session half of the sentences matched the upper picture, and half matched the bottom picture), and between sessions (the matching picture in each pair was sometimes the top picture, and sometimes the bottom picture).

<sup>7</sup>For some children, not all items could be tested because they were either too tired, or for organizational reasons.

For relative clauses, both NPs were masculine, in order to make them unambiguously case-marked (see above). For simple SVO and the passive sentences we used NPs of all three grammatical genders: masculine, feminine, and neutral; 13 of the comprehension of questions (see Hamburger and Crain, 1982 for the importance of felicity in assessing Wh-movement; see Friedmann and Novogrodsky, 2011 for a discussion of the felicity of this specific type of task for assessing comprehension of Whquestions). For example, in **Figure 2**, a boy in a green shirt is pushing a man who is pushing a boy in an orange shirt. Here too, the experimenter –a native speaker of German- read out a sentence, while the participant saw the picture. The participant then had to point to the correct figure, or alternatively reply


### TABLE 1 | Types of sentences in Experiment 1.

### TABLE 2 | Types of sentences in Experiment 2.


orally, by naming the color (e.g., in **Figure 2**: "the green one", "the boy with the green shirt").

The test consisted of 108 sentences in 6 conditions, with 18 items in each condition. The sentence types included subject and object who and which questions and topicalized OVS (object– verb–subject) sentences, as well as simple SVO sentences for comparison (See **Table 2** for examples). Again, sentences were randomly ordered, and presented in 2 sessions of 54 sentences each (9 sentences of each type per session). The participant saw 18 pictures 6 times; three times in each session. The position of the correct actor in each sentence, left or right from the middle figure, was randomized within a session and between sessions.

For all sentences, both NPs were masculine. Using feminine or neuter NPs would make the who questions structurally ambiguous between subject- and object-question interpretation.

# RESULTS: COMPREHENSION OF RELATIVE CLAUSES, PASSIVES, TOPICALIZATION AND WH-QUESTIONS

# Experiment 1: Comprehension of Relative Clauses and Passives

The results of Experiment 1 are summarized in **Figure 3**. This task was performed by 19 HI children (age 9;3–13;0, mean 10;7), and by 53 hearing children (age 9;3–12;6, mean 10;8). We analyzed the data with a repeated measure with variables group and sentence type. This revealed a main effect for sentence type [F(3,210) = 100.21, p < 0.001], caused by overall lower performance on object relatives. We also found a main effect of group [F(1,70) = 7.13, p = 0.009], and an interaction of group and sentence type [F(3,210) = 3.55, p = 0.02], caused by lower performance of the hearing impaired group, who performed even worse on the object relatives. Post hoc pairwise comparisons (Bonferroni corrected) revealed that SVO sentences overall were comprehended better than each of the three other conditions (p < 0.01), and passives and subject relatives were comprehended better than object relatives (p < 0.01). A comparison of the performance of the two groups per sentence type (independent t-tests) showed that the HI children performed significantly poorer than the hearing control group on subject- and object relatives (p = 0.036 and p = 0.025, respectively). The hearing children, as a group, performed above chance level on all four conditions (one sample t-test p < 0.05), whereas the HI children, as a group, performed not differently from chance level on the object relatives (one sample t-test, p = 0.56), and above chance on the three other conditions.

The hearing group was divided into two age groups: 34 nine and ten year olds (aged 9;3–10;11, including 11 nine year olds and 23 ten year olds), and 19 eleven and twelve year olds (aged 11–12;6).

As shown in **Figure 3**, the comprehension of object relatives in German still develops within the ages we tested: the average performance of hearing children below age 11 was 52% correct, and it improved to 83% in the hearing children who were 11 years old and older. We therefore compared the individual HI participants to the hearing participants by age, comparing the 14 HI participants under the age of 11 to the 9–10 year old hearing children, and the 5 HI children who were older than 11 to the 11–12 year old hearing participants. Comparisons of each of the HI children to her/his age-matched hearing group and to chance level in each sentence type are summarized in **Table 3**. As summarized in **Table 3**, object relative was the structure that showed the most impaired performance in the HI group in this task, with 7 HI participants performing significantly below the matched hearing group, and almost all HI performing not above chance level.

These results suggest that some of the participants with HI have a considerable difficulty in the comprehension of object relatives, beyond the difficulty their hearing age-peers show. However, the results bear an additional type of important information: that not all types of movement are equally difficult

TABLE 3 | Number of HI participants performing significantly below the hearing group, and number of HI participants performing not above chance (at/below chance) in the two comprehension experiments.


Comparison to the control group using Crawford and Howell's t-test, all p < 0.05. Comparison to chance level using binom, all p < 0.05.

subject and object relative. A star indicates a significant difference between the (age matched) hearing and HI groups.

for HI children. Firstly, the passive construction, which involves movement other than Wh-movement, seems to be normally comprehended for most HI children. Secondly, subject relatives, which involve Wh-movement but in which the theme does not cross the agent in its movement, is also comprehended relatively well. These findings thus suggest a selective deficit affecting object Wh-dependencies in children with HI.

# Experiment 2: Comprehension of Topicalization and Wh-Questions

The results of Experiment 2 are summarized in **Figure 4**. This task was performed by 16 HI children (aged 9;3–13;0, mean 10;6), and 18 hearing children whose ages were similar to the youngest children in the HI group (aged 9;3–10;8, mean 9;10).

We ran two separate repeated measure analyses, one to compare performance on the topicalized (OVS) sentences to the simple SVO sentences, and one to compare the four different question types to each other. The analysis of SVO and topicalized sentences revealed a main effect of sentence type [F(1,32) = 70.98, p < 0.001], as well as a main effect of group, showing that the hearing group outperformed the HI group [F(1,32) = 8.67, p = 0.006]. This was especially caused by lower performance on topicalized sentences as indicated by interactions of sentence type and group [F(1,32) = 8.39, p = 0.007]. One sample t-tests showed that the hearing group performed above chance for both conditions (p < 0.01), whereas the HI group performed at chance for the topicalized sentences (p = 0.45).

The analysis of the four question types revealed a main effect of question type (subject vs. object), object questions being overall more difficult than subject questions [F(1,32) = 20.84, p < 0.001], a main effect of Wh-phrase (who vs. which), caused by lower performance on which than on who questions [F(1,32) = 10.89, p = 0.002], and a marginally significant effect of group, caused by the HI children performing below the hearing children [F(1,32) = 3.79, p = 0.06]. There was a significant interaction of group and Wh-phrase [F(1,32) = 8.29, p = 0.007], caused by the relatively lower performance on which questions in the HI group. Finally, the interaction of question type (subject/object question) and Wh-phrase (which/who question) was marginally significant [F(1,32) = 4.03, p = 0.05], caused by a lower performance on which, compared to who

object questions. A comparison between the two groups of the performance in each sentence type (independent t-tests) showed that the HI children performed significantly poorer than the hearing control group on subject which questions (p = 0.004). Their lower performance on object which questions differed only marginally when Bonferroni correction is applied (p = 0.029) from the performance of the hearing group, since some of the hearing children also still had problems with this condition.

We followed-up on the group effect and question type effect, by comparing subject who with object who and subject which with object which questions per group with paired t-tests. This confirmed the first impression that for each group indeed object questions were significantly more problematic than subject questions (hearing children: subject vs. object who questions p = 0.006, subject vs object which questions: p = 0.049; HI children: subject vs. object who questions p = 0.01, subject vs object which questions: p = 0.001). Furthermore, for the HI group, performance on object which questions was lower than that on object who questions (p < 0.006). The hearing group performed above chance on all four questions (p < 0.05), as indicated by one sample t-tests. The HI group, however, performed above chance on subject (who and which) questions and on object who questions (p < 0.05), but, importantly, they performed at chance level on object which questions (p = 0.40).

Finally, we compared the performance of each individual HI child to the hearing group (using Crawford and Howell's t-test), and found that 6 of the 16 HI children performed lower than the hearing controls on the topicalized sentences and 10 of the 16 HI children performed significantly below the hearing control group on at least one question type, as shown in **Table 3**.

Interestingly, each of the six participants who performed below the hearing group on the topicalized structures was also below the hearing group on at least one type of which questions. Only seven HI children performed above chance on the which object questions, and 10 performed above chance on the who object questions.

Similarly to Experiment 1, these results show that some of the HI participants have problems in the comprehension of sentences that are derived by Wh-movement. Again, sentences in which the theme precedes the agent, as in topicalized sentences and in object questions seem to be especially problematic, supporting the suggestion that children with HI have a selective deficit affecting object Wh-dependencies. Object which questions were the most impaired type of question in the HI group.

# Overall Analysis of Difficult Structures in Comprehension in the Two Tasks According to the Individual Performance

An analysis of the two comprehension tasks that looks at the individual performance of each HI participant in each condition is also very telling with respect to the structures that are most difficult for children with HI. First, when we look at the structures in which the HI participants performed not better than chance level (at chance or below chance level, according to the binom test, p < 0.05), we see 4 structure in which more than 2 HI participants were no better than chance: object relatives, topicalized OVS sentences, and the two types of object questions. All these structures include Whmovement of the theme across the agent. A second analysis, which takes into account the number of HI participants who performed below the hearing group in each condition indicates that these were also the most difficult structures according to this measure: more than 2 participants performed below the hearing group on object relatives, topicalized OVS sentences, and the two types of object questions. In this analysis, also the subject Wh-movement sentences – subject relatives and the two types of subject question – were found difficult. The two analyses are summarized in **Table 3** (shaded cells indicate the structures for which more than 2 HI children

performed below the hearing group and/or at or below chance).

## EXPERIMENTS 3 AND 4. REPETITION OF RELATIVE CLAUSES, PASSIVES, TOPICALIZATION, WH-QUESTIONS, AND V-TO-C MOVEMENT

After we established that some of the participants with HI had considerable difficulties in understanding sentences with Wh-movement, but not passive sentences, which are derived by A-movement, we continued to examine the various types of movement using two sentence repetition tasks.

We were mainly interested in the following comparisons: to test whether Wh-movement is impaired, we tested several types of structures that are derived by Wh-movement: relative clauses and subject- and object- who and which questions. We first tested whether these were problematic by comparing each condition to the performance on the simple SVO condition and to the performance of the hearing age-matched control group. We then compared Wh-questions that are derived by Wh-movement but keep the canonical word order of the arguments (agent before theme) and do not involve a movement of a NP across a similar NP to their non-canonical counterparts (i.e., theme before agent), that is, subject questions were compared with object questions. We further compared repetition of sentences with Wh-movement (relative clauses, Wh-questions) with sentences with A-movement (passives), with sentences in which the verb moved to second position (V-to-C movement, AVSO), and with sentences without movement (simple SVO). To test whether the existence of embedding was the source of the difficulty rather than Wh-movement, we compared sentences with Whmovement without embedding (Wh-questions) and sentences with Wh-movement and embedding (subject relative clauses). We also compared the effect of the position of the embedded relative clause within the sentence (de Villiers et al., 1979; Correa, 1995), by comparing right-branching subject relative clauses with center-embedded subject relative clauses. Finally, we also compared long vs. short which questions (i.e., which questions with or without an extra prepositional phrase). The sentences were divided over two tasks. This way we could vary the repetition task with the other tasks and divide it over more sessions. Furthermore, the two repetition tasks differed with respect to the sentence types that were included (more details can be found in the next sessions). The two tasks will be reported separately, since the control groups that participated on the tasks are not completely the same.

### Material

### Experiment 3: Repetition of Wh-Questions, Subject Relatives, and Passives

The sentences of the first repetition task included 10 subject questions and 10 object questions (half of each were who questions and half which questions). The who questions were created with an extra PP to match their length with the which questions; 10 passive sentences with a by phrase; and 16 subject relatives (half right-branching and half centerembedded). We also included 20 simple SVO sentences ending with a prepositional phrase as control sentences, which were included to provide a baseline as to the participants' ability to repeat sentences without syntactic complexity, and to include some easier and less frustrating sentences for the participants. (see **Table 4** for examples).


### Experiment 4: Repetition of Wh-Questions, and V-to-C Movement Derived Sentences

The second repetition task consisted of long subject and object who and which questions (5 each, with an extra PP for all four questions types), and simple canonical sentences that started with an adverbial phrase, and hence included the verb in second sentential position, before the subject (AVSO, 10 items), or ended with adverbial phrase (SVOA, also 10 items).

The sentences of the various types, 132 in total for the two tasks<sup>8</sup> , were presented in random order, in smaller blocks of 20–40 sentences, sometimes with several blocks per session over at least two sessions (for some children more sessions were needed, with a maximum of five sessions in total). All sentences were semantically reversible and included a transitive verb. In the center-embedding relative clauses, the matrix verbs were intransitive and the embedded verbs were transitive. Apart from the SVOA and AVSO sentences, the two NPs were of masculine gender in all sentences, to preclude (temporary) structural ambiguity (as in German only masculine determiners distinguish between nominative and accusative case). Since structural ambiguity was less of a problem in AVSO sentences and in order to create more variation in the material, in 5 of the SVOA and 5 of the AVSO sentences one NP was feminine or neuter.

All sentences consisted of 5 to 8 words, a perfect matching with respect to number of words was not possible. However, whenever there was an unavoidable difference, we made sure that sentences we expected to be relatively easier were longer than sentences that were expected to be relatively more complex instead of vice versa. So, e.g., the supposedly easier right-branching subject-relatives consisted of eight words (the only 8-word condition), whereas the syntactically more complex center-embedded subject-relatives consisted of six words.

# Procedure Experiments 3 and 4. Sentence Repetition

The experimenter read a sentence in a relaxed pace and in a normal (neutral) intonation meaning that she did not use a specific focus intonation for object-first sentences, for instance, but questions were consistently produced with a question intonation. The participant was requested to count to 3 out loud and then to repeat the sentence as accurately as possible.

The counting was used to prevent rehearsal in the phonological loop (Baddeley, 1997; Friedmann and Grodzinsky, 1997), and hence to preclude phonological echoing. The whole session was audio-recorded and afterward transcribed for further analysis.

# Error Analysis Experiments 3 and 4. Sentence Repetition

In the analysis of errors in repetition, structural errors were scored separately from lexical and morphological errors that did not affect the structure and the thematic roles in the sentence. Phonological errors and other errors resulting from articulatory problems in which the target words and structure were still recognizable were ignored.

An error was classified as a **structural error** (see examples in 12), when the child changed the structure of the sentence, changed the thematic roles in it, or produced an ungrammatical sentence, for instance by using the same case twice (resulting in a sentence with two nominatives or two accusatives). **Lexical errors** were errors that included substitutions of a NP with another NP that did not appear in the target sentence (a singer → a dancer), a substitution of the verb with another verb with the same argument structure (like → love), and a few omissions or additions of the definite article (the elephant → elephant), or a, substitution, or addition of the adverbial or prepositional phrase (yesterday → today).

(12) Examples of **structural errors** for target sentence: Welchen Puma beisst der Leopard?' WhichACC Puma bites theNOM leopard?

> Role reversal with a structure change (object questions > subject question): Welch**er** Puma beisst d**en** Leopard

WhichNOM Puma bites theACC leopard?

Role reversal without structure change (Noun reversal) Welchen **Leopard** beisst der **Puma**? WhichACC Puma bites theNOM leopard?

Noun doubling (one of the arguments receives both roles):

Welchen Puma beisst der **Puma**? WhichACC Puma bites theNOM Puma?

Case error (two nominatives): Welch**er** Puma beisst der Leopard? WhichNOM Puma bites theNOM leopard?

As can be seen in (12), some lexical substitutions were indicative of a problem with the thematic roles of the sentence, and were hence counted as structural errors. These included substitution of one of the NPs in the sentence with the other NP, i.e., in noun doubling, yielding a sentence in which one of the NPs appears on both roles (which puma does the leopard bite → which puma does the puma bite), and reversals (→ which leopard does the puma bite?).

Finally, **morphological errors** that did not affect the thematic grid of the sentence and did not pertain to the syntactic structure, were counted separately from the structural errors and grouped together with lexical errors. These were mainly gender errors or number errors (changing a singular NP into a plural), and some instances of an accusative that was changed into a dative case (Wem beisst der Leopard → whoDAT does the leopard bite). This latter error type was the only case error that did not count as structural error, since a confusion of accusative and dative in our task did not affect the overall structure, and crucially did not affect the assignment of either syntactic or semantic roles, since both are clearly objective case.

<sup>8</sup>Three additional conditions (topicalized sentences and two types of object relatives) with a total of 26 items were initially included in the task. These will not be reported here, because even some of the 11-year old hearing children still made errors in repeating them.

# RESULTS: REPETITION OF SUBJECT RELATIVES, WH-QUESTIONS, PASSIVES TOPICALIZATION, AND V-TO-C MOVEMENT DERIVED SENTENCES

# Experiment 3: Repetition of Passives, Topicalization, and Wh-Questions

The results presented in **Figures 5** and **6** and in the analysis below include only structural errors, whereas sentences that were repeated only with lexical and/ or morphological errors were scored as correct repetitions for this analysis.

**Figure 5** shows the results of the first sentence repetition task. This task included simple SVO sentences (with an extra



<sup>∗</sup>The HI group performed significantly below the hearing group, in an independent t-test using FDR correction for multiple comparisons (Benjamini and Hochberg, 1995).

PP to match for number of words), passives (with a by phrase), right-branching and center embedded subject relatives as well as short subject and object who and which questions. This task was performed by 15 HI children (age 9;7–13;0, mean 10;8), and 47 age-matched hearing children (age 9;7–12;6; mean 10;10).

To analyze these data, we first ran a repeated measures test with group (hearing vs. HI) and sentence type as variables. This revealed a main effect of sentence type [F(7,420) = 14.42, p < 0.001], and a main effect of group [F(1,60) = 12.59, p = 0.001]. Also an interaction of sentence type and group was found [F(7,420) = 6.09, p < 0.001]. To follow this up, we compared the performance of the two groups on each sentence type. This revealed that the HI group performed significantly worse on SVO sentences, who and which object questions and center-embedded subject relatives (t-tests, p < 0.05, see **Table 5**). No difference between groups was found for the who and which subject questions and the right-branching subject relatives.

Finally, we ran repeated measures per group with post hoc pairwise comparisons (Bonferroni corrected) to see which conditions were most problematic in each group. This revealed a significant main effect of sentence type for the hearing children [F(7,322) = 3.79, p = 0.001]. Pairwise comparisons showed that passive sentences were significantly easier than object who and which questions (p < 0.05). For the HI group, we also found a main effect of sentence type [F(7,98) = 10.41, p < 0.001]. Pairwise comparisons revealed that SVO, passive sentences, and subject who questions, as well as right-branching subject relatives were repeated better than object who and which questions (all comparisons p < 0.05, Bonferroni corrected, see Appendix B).

An analysis of the performance of each individual HI participant compared with the hearing group revealed that object questions were difficult also at the individual level, and were more difficult than the parallel subject questions. As summarized in **Table 6**, the structures on which the performance of the HI children was most deviant from that of the control group, namely, on which there were more HI children who performed below the aged-matched hearing children, were object who questions, where 7 HI children had a lower performance than the hearing children and object which questions, where 5 HI children were below the controls (there were fewer HI children below the controls on the parallel subject who and which questions). The below-control performance of some HI children on SVO and passive sentences, probably resulted from the ceiling performance of the hearing children, which made a single error already significantly below the hearing group. There was considerable overlap between the HI children who performed significantly below the controls in the various constructions: seven HI children performed below the hearing controls on at least two conditions, (4 of them on 4 and more conditions), and only three showed impaired performance on only one condition (one of them was very close to the cut-off point in three additional conditions, so he was probably impaired, and one only made a single error in the SVO condition, which qualified as significantly below the control, but he was probably unimpaired).

### Error Analysis Experiment 3

As can be seen in **Table 7**, most of the structural errors that the children made relate to syntactic/semantic role assignment. The HI children made many case errors when they tried to repeat Wh-questions. These errors resulted in an ungrammatical sentence with either two nominatives or two accusatives. Importantly, such errors occurred almost exclusively when the HI children tried to repeat an object question, and not when they tried to repeat a subject question. Other errors relating to the syntactic/semantic roles are head doublings or reversals, as well as canonization, which means that in repeating an object-first sentence, a child produces a grammatical (but non-target) subject-first sentence. Interestingly, some of the errors on the center-embedded subject relatives are changes into right-branching subject relatives (these are the word order errors in the center-embedded subject relatives in **Table 5**). The few word order errors that occurred in the canonical-order sentences (SVO, subject questions, and right-branching subject relatives), 5 errors in total in these structures, were object-first sentences. Errors in the Wh-word consisted of omission of the Wh-word and use of full NP instead, or use of who instead of which or vice versa. Other errors consisted of omissions of one of the DPs, fragments, or in subject relatives: omission of one of the verbs.

To summarize, HI children performed worse on the repetition task than the hearing children. Interestingly, as we saw also in the comprehension studies, not all movement-derived sentences were equally problematic. Passive sentences caused relatively little problems, and the performance of the HI group in repeating them was very similar to their repetition of simple SVO sentences, although it has to be acknowledged that there was a group difference for the SVO sentences, which can be explained by the ceiling performance of the hearing children (as we argued above). In contrast, object questions, which are derived by Wh-movement, were especially difficult. These problems seem to be caused by the fact that in object questions the theme is moved over the agent of the sentence. Subject who and which questions, which involve Wh-movement but in which the theme follows the agent, did not cause repetition problems for the HI group. Furthermore, most errors on the object questions were related to syntactic/semantic role assignment. Finally, center embedded subject relatives, but not right-branching subject relatives, are problematic for the HI children.

TABLE 6 | Repetition Experiments 3 and 4: number of HI participants performing significantly below the hearing group.


Comparison to the control group using Crawford and Howell's t-test, all p < 0.05.

TABLE 7 | Experiment 3- structural errors in repetition: number of errors per sentence type.


# Experiment 4: Results Repetition of Wh-Questions and V-to-V Movement Derived Sentences

The results of the second sentence repetition task, which compared Wh-questions (long subject and object who and which questions) and AVSO sentences to simple sentences, are presented in **Figure 6**. This task was performed by 11 HI children (age 9;11–13;0, mean 11;0), and 9 hearing children, in the age of the youngest HI participants (age 9;11–10;8, mean 10;3).

We ran two separate repeated measures, one to examine verb movement, by comparing the performance on SVOA and AVSO sentences, and one to examine Wh-questions, by comparing the four different question types to each other (with two variables: question type- subject or object question, and Wh-phrase – who or which). The verb-movement analysis revealed a significant effect of sentence type [F(1,18) = 13.14, p = 0.002], an interaction of group and sentence type [F(1,18) = 8.73, p = 0.008], and a marginally significant difference between the two groups, caused by overall lower performance of the HI children [F(1,18) = 4.27, p = 0.054]. This was caused by a lower performance on the AVSO sentences, but not on the SVOA sentence in the HI group (as indicated by post hoc independent t-tests, SVOA: t(18) = 1.05, p = 0.31, and AVSO: t(18) = 2.54, p = 0.02, respectively). The analysis of the Whquestions resulted in a main effect of question type, with subject questions repeated correctly significantly more often than object questions [F(1,18) = 18.77, p < 0.001]. A main effect of Whphrase was also found, caused by significantly more correct repetitions for who than for which questions [F(1,18) = 15.89, p < 0.001], as well as an interaction of question type and Wh-phrase, caused by relatively fewer correct repetitions for object which questions [F(1,18) = 11.62, p = 0.003]. No main effect of group and no interactions with group were found.

The comparison of the performance of each HI individual with the hearing group is summarized in **Table 6**. It indicates that 2 of the 11 HI children performed below the hearing control group on the SVOA sentences, and 5 were below the hearing group on AVSO sentences; 3 children performed below the hearing children on the which object questions, 3 on the which subject questions; 2 on the who object questions and one of the who subject questions (all p < 0.05, Crawford and Howell's t-test). The children who performed significantly lower on many of the conditions (6 or 7) of the first repetition task, performed poorly also in this task.

### Error Analysis Experiment 4

The error analysis on the second repetition task (see **Table 8**) revealed that most errors on object questions can again be connected to problems with syntactic/semantic role assignment: canonization errors (changing an object question to a subject question), case errors, and noun doublings or reversals. The canonization error of the AVSO sentence involved a change into an SVO sentence. Errors with the Wh-word consist of omission of the Wh-word and use of full NP instead, or use of who instead of which or vice versa. Other errors consisted of omission of one of the arguments, the verb, or an otherwise fragmentary response.



This sentence repetition task, like the previous repetition task, and similarly to the results of the comprehension task indicated that the children with HI had difficulties in structures that are derived by Wh-movement, especially when the theme precedes the agent, i.e., in object questions. Which questions were more problematic than who questions. Whereas the HI children repeated sentences that involved a movement of the verb to second position (AVSO sentences) less well than the hearing children, their performance on the AVSO sentences was still better than their performance on the object questions.

Experiment 4 showed partially different results than Experiment 3, in that the hearing children did not perform very well on the object which questions yet. This may have been caused by the fact that the hearing children in this task were overall younger (up to age 10;8) than the children who participated in Experiment 3 (up to age 12;6). This, combined with the fact that the which object questions in Experiment 4 were slightly longer (since we had added a prepositional phrase), may have caused their lower performance, which then resulted in the absence of the interaction.

Nevertheless, the findings of the repetition tasks join those of the comprehension tasks in indicating that the HI children show a selective deficit affecting object Wh-dependencies.

### INDIVIDUAL PERFORMANCE PATTERNS IN ALL FOUR TASKS

The comparison of each individual HI participant to the hearing group (as tested with the Crawford t-tests) for the two comprehension and repetition tasks revealed that almost all HI children had problems in at least some comprehension or repetition of movement derived sentences. We classified the children according to the comparison of each of them to the control group, in children with good performance, almost good performance, mild impairment, or severe impairment.

Children with normal performance performed below the hearing group on one condition at most, and with a maximum of 2 errors on this condition, indicating that performance was still close to the normal hearing performance. Children with almostnormal performance performed below the hearing group on 2 of the movement conditions, and performed well (above chance performance) on all the other conditions. Children with a mild impairment performed clearly below the hearing group on one or two conditions, and performed at (or below) chance on this and/or at two other conditions. The severely impaired group performed significantly below the hearing group for at least six conditions (for those children who performed all tasks; or three conditions for the children who performed only two of the four tasks).

This way, our group of HI children consisted of only three HI children whose syntactic performance was within the normal range (participants 1, 8, and 9), and one HI child with nearnormal performance (23). The rest 14 HI children had a syntactic impairment: Six HI children (2, 3, 4, 6, 7, and 18) who were tested on both repetition and comprehension were severely impaired in several conditions and the four children that were only tested on comprehension (11, 12, 16, and 17) were all impaired on at least three conditions. Five additional participants had a mild syntactic impairment (5, 21, 22, 24, and 26).

Furthermore, a Guttman Scale (Guttman, 1944, 1950) <sup>9</sup> was found in the comprehension of Wh-movement and passives, suggesting ranking of impairment of the two structures: the two children who failed to understand passives also had considerable problems in understanding object Wh-movement sentences (i.e., object relatives, topicalized sentences, and object questions), and there were children who failed only on Wh-movement derived sentences, but not on passives. That is: there were no children who failed on passives but had no problems with object Wh-movement. These results show that not all types of movement result in the same difficulty in HI children, and that a deficit in passives is more severe than a deficit in Whmovement alone, and involves a deficit in Wh-movement as well.

Finally, an analysis of the background of the subgroup of the 4 children who showed normal or near-normal syntax either received hearing- aids during the first year of life (participant 9) or received hearing aids after age 5 years (participants 1, 8, 23), and there is no information that their hearing was impaired earlier, and therefore they may have been hearing normally during the first year of life and lost their hearing only at a later age. This suggests the pivotal role of early language exposure in later development of syntactic abilities. It would be very interesting to see, in future research with a larger HI group with more detailed background data, if input during the first year of life correlates with later syntactic performance.

# DISCUSSION

The aim of this study was to examine whether lack of sufficient exposure to language from birth affects Germanspeaking children with HI in their comprehension and repetition of sentences that are derived by syntactic movement. Our second aim was to compare three types of syntactic movement: Wh-movement, passives (A-movement), and verb movement (as in V-to-C movement). One of the reasons that make this study in German especially interesting is that it allowed us to examine whether German-speaking HI children can use case morphology for the correct interpretation and repetition of these movement-derived non-canonical sentences. German furthermore allowed us the direct comparison of different types of syntactic movement.

Our results indicated that most of the children with HI showed considerable difficulties in both sentence repetition and

<sup>9</sup>Louis Guttman initially suggested this approach for establishing a scale on dichotomous assertions. The idea was to examine whether items where a person either endorses or does not endorse a statement form a scale. If these statements form a scale, then a ranking of statements is possible, in which a person who endorses a certain statement, would also endorse all statements that are ranked lower in the scale. We use this approach to examine whether the impairment on the various syntactic structures forms a scale for the population of hearing impaired children, using whether or not a person succeeded in a certain structure as the dichotomous measure.

comprehension, and performed significantly poorer than hearing children. Importantly, their difficulty was selective and did not span over all sentence types. The comprehension of the HI children was significantly lower than that of the hearing group in subject and object relatives, topicalized sentences (OVS) and object who and which questions. In contrast, they performed similarly to the hearing children on simple SVO sentences, passives, and on subject who and subject which questions. These structures were also the problematic ones according to the individual-level analysis of the number of HI children who performed worse than the controls and the number of HI children who performed not better than chance level. This indicates that it is not any type of syntactic movement that results in comprehension deficits, but specifically Wh-movement that is the problem.

The sentence repetition of the HI group showed a similar selective impairment. Their repetition of object who and object which questions, center-embedded subject relatives and AVSO sentences showed considerable impairment and resulted in performance that was significantly poorer than that of the hearing children, at the group level, and for most of the HI participants also at the individual level. In contrast, performance on subject who and subject which questions, as well as on simple SVOA sentences and right-branching subject relatives did not differ from their age-matched hearing group (object relatives and topicalized sentences were not reported for the repetition tasks, due to the low performance even in the hearing group).

It has to be noted though, that overall there was much variation in the HI group, as became clear by the analyses in which we compared the performance of each HI participant to the hearing group. Only three of the 19 HI children performed just like the hearing children. Most HI children quite clearly performed (much) poorer than their age-matched peers on more than one condition that involved syntactic movement. Some HI children even performed poorer than hearing children on the syntactically less complex conditions (e.g., passives, subject questions, or even on SVO, see **Table 3**). We will discuss these results in detail below, where we start with a discussion of the comprehension and repetition of the three different types of syntactic movement: Wh-movement, A-movement, and verb movement. Then we will compare the results on these structures, and finally we will discuss possible explanations for the variation in performance.

The poor performance that our German-speaking HI participants demonstrated in the comprehension and repetition of structures that are derived by Wh-movement of the object is in line with previous studies on HI syntax in English, Hebrew, Arabic, and Italian. Object relatives, object topicalized sentences, and object questions are all sentences that are derived by Wh-movement, in which the theme moves to a position before the agent, as explained in the introduction, and various studies demonstrated that children with HI are impaired in such structures (Quigley et al., 1974a,b; Geers and Moog, 1978; Berent, 1988, 1996; de Villiers, 1988; de Villiers et al., 1994; Friedmann and Szterman, 2006, 2011; Friedmann et al., 2010b; Friedmann and Haddad-Hanna, 2014; Szterman and Friedmann, 2014b, 2015; Volpato and Vernice, 2014). The poorer performance in center-embedded relative clauses compared with right-branching<sup>10</sup> ones is also in line with previous literature (see e.g., Quigley et al., 1974a; de Villiers et al., 1979, where the performance on right-branching object relatives was better than on center-embedded object relatives for hearing children).

Thus, sentences derived by Wh-movement seem to be especially impaired, in both comprehension and repetition, especially those sentences in which the theme moved to precede the agent of the sentence. German also allowed us to examine two further types of movement: A-movement, which occurs in passive sentences, and verb movement to second position. The results indicated that the deficit of the HI children did not extend to all types of syntactic movement. Starting with passive sentences, which are derived by A-movement, a type of movement that is shorter than Wh-movement (from object position to subject position, roughly), our findings indicate that they did not show the same impairment as did the sentences derived by Wh-movement: only 2 children failed to understand the passive sentences, and in fact, the performance on the comprehension of the passive sentences was just like the performance on the simple SVO sentences. These findings indicate that different types of movement are impaired differently in HI, and that not all types of movement are impaired in HI. It moreover shows that the impairment is not merely a problem in non-canonical sentences. In passives too the theme comes before the agent, yet, most HI children do not have problems comprehending and repeating those structures. Note that the few HI children that do have problems in passives, always also have problems with Wh-movement, which does not hold the other way around. It seems to be the case that Wh-movement is impaired, especially so when the theme has moved over the agent, whereas A-movement (as seen in passives) seems to be relatively well-comprehended by most HI children.

The relatively good comprehension of passives in German is in contrast to findings in earlier studies on English (Power and Quigley, 1973; Nolen and Wilbur, 1985; Schmitt, 1968), where the comprehension of passives was reported to be impaired. One explanation for this difference could be that the children in these studies had a more severe HI than our children. At least for the Power and Quigley and the Nolen and Wilbur data this seems to be the relevant difference between their and our participants.

Can this difference between English and German be ascribed to the fact that in German case-marking can indicate the agent and the theme in the sentence? Definitely not: in passive sentences the theme is actually marked as the subject of the sentence, and hence, if anything, the case-marking is liable to confuse the children, unlike in the Wh-movement structures.

<sup>10</sup>Both right-branching and center-embedded subject relatives show a form of Wh-movement and are hence in principle equally difficult, but some factor made the participants repeat the right-branching relatives better. One possibility is that whereas the comprehension of right branching subject relatives can benefit from a strategy according to which the first DP is the agent (Grodzinsky, 1995), for a center-embedded relative this would not be enough. If the trace of movement is not identified correctly, the main verb might not be identified as such, and the identification of its argument might become difficult, leading to impaired repetition.

But in effect, the picture that emerged from the HI performance was the opposite: they succeeded in passives and failed in Wh-movement, so it cannot be case-marking that saved their interpretation and identification of agent and theme.

Interestingly, other populations have been found to show similar difficulties in the comprehension and/or production of complex syntactic structures. People with agrammatic aphasia for instance show a severe deficit in the production of relative clauses, Wh-questions, and embedded structures (Friedmann, 2001, 2006; Ruigendijk et al., 2004), as well as in the comprehension of object Wh-questions, object relative clauses, topicalization structures, and (for some patients) passive sentences (see, among many others, Zurif and Caramazza, 1976; Grodzinsky, 2000; Friedmann and Shapiro, 2003). Children with a Specific Language Impairment (SLI), specifically children with syntactic SLI, show a significant specific deficit in the comprehension and production of sentences with movement dependencies such Wh-questions and relative clauses (e.g., van der Lely, 1998; Bishop et al., 2000; Friedmann and Novogrodsky, 2004, 2011; Novogrodsky and Friedmann, 2006). Whereas the deficits in the three populations seem similar, it might still be that the underlying psycholinguistic and neural bases of the syntactic impairment is different in each of these populations. Szterman and Friedmann (2014b) in fact suggested that the HI population includes (at least) two patterns of impairment, one characterized by impairment in the CP layer of the syntactic tree, similar to theories regarding agrammatic aphasia (Friedmann and Grodzinsky, 2000; Friedmann, 2006), whereas other HI children show a deficit in movement that is more similar to the one evinced in syntactic SLI. A further question is whether the syntactic problems in HI children should be characterized as a deficit, or as a delay in development. This question cannot easily be answered on the basis of our data. However, the fact that some of our HI children who performed well below the hearing group were 11 years and older, this may be an indication for a more persistent impairment. In other studies (e.g., Friedmann and Haddad-Hanna, 2014), even 21 year old HI participants demonstrated the same types of syntactic deficits, suggesting that at least in some cases the syntactic deficit is a deficit rather than a delay. An interesting approach to the term "delay" may be the following: we may thing of HI individuals as having a syntax that has been "stuck" at some stage of normal development.

Importantly, case did not seem to assist the participants in their interpretation of the object relatives, object questions, or topicalized OVS sentences: in all these sentence types, the agent is marked with a nominative case and the theme with an accusative case. Nevertheless, these were the structures that the participants found most difficult. Therefore, we can conclude that they could not utilize the case markers to assign the thematic roles in the sentence (see Friedmann et al., 2017, for a related discussion). In fact, 10 of the HI children performed below chance level, consistently reversing the roles of the agent and the theme in at least one of the Wh-movement sentence conditions. This indicates that not only do these children not use case for interpretation; they even do not take it into the computation of thematic roles at all, and ascribe roles as if case did not exist in the sentence, on the basis of the linear order of the two NPs. Importantly, their inability to use case for interpretation is not a result of them not being able to hear the case markers: we were very careful to only include in the study children who performed well in the auditory discrimination task that included phrases with determiners and Wh-elements marked for nominative and accusative case (see General Method).

It is possible that what makes passive sentences easier for the HI children than sentences with object Wh-movement is the passive auxiliary (wird or wurde) and the by phrase (vom), which provides a signal beyond case that the sentence is not a simple SVO sentence.

The error types in the repetition task provide further support for the specific problems in sentences that are derived by Wh-movement of the theme across the agent, most errors somehow relate to the semantic/syntactic role assignment. Either the sentence is canonized, that is, an object first structure is changed into a subject first structure, or, the NPs are reversed. Also frequently occurring for the object structures, is a case error, for instance repeating the sentence with two nominatives or accusatives, suggesting that the child starts out with a nominative NP, then seems to realize the final NP was a nominative, or vice versa: s/he starts repeating the first NP correctly as an accusative and then in between 'canonizes' and ends up with a second accusative NP. The use of case markers, even when they map the thematic roles incorrectly, adds support for our conclusion that even though HI children cannot use case markers for comprehension, they do hear them, store them, and know their morphological distribution.

Interestingly, object which questions were the most impaired type of question in the HI group, both in sentence comprehension, and in the second repetition task<sup>11</sup>. These object which questions seemed to cause similar comprehension problems as object relatives and topicalized sentences. The difference between object which questions and object who questions has been reported before, for HI children (Friedmann and Szterman, 2011), and also for other populations such as young hearing children (De Vincenzi et al., 1999; Avrutin, 2000; Friedmann et al., 2009; Biran and Ruigendijk, 2015; and children with S-SLI, Friedmann and Novogrodsky, 2011). It has been explained by the fact that in which, but not in who questions there are two lexical NPs (which NP and the subject NP), whereas in object who questions, there is only one lexical NP (the object NP), and a who phrase. The argumentation is, that when the moved element (here: the object) is similar in structure to the element it moves over (here: the subject), then the structure is more problematic in child language than if the moved element is less similar. In a which question, a full NP (welchen NP, 'whichACC NP') moves over the subject to the first position of the sentence, whereas in an object who question, only a Wh-phrase moves (wen, 'whoACC'). Similarly to which

<sup>11</sup>Note that the hearing children had a relatively low performance on object which questions in this task as well and therefore no interaction with group was found here. Most likely this was caused by the fact that this task was slightly more difficult, since the sentences were longer (caused by an additional PP, see General Method section). Importantly, still 5 out of 15 HI children performed significantly below the hearing control group.

questions, also in topicalized sentences and in object relatives a full lexical NPs moves over the subject NP, which may explain the similar performance on these three conditions (see Friedmann et al., 2009; Belletti et al., 2012; Biran and Ruigendijk, 2015 for a more detailed account)<sup>12</sup>. Apparently, what is difficult in normal language acquisition, is even more difficult in acquisition for HI children.

Furthermore, the comprehension of subject relatives, as well as subject who and which questions overall was less problematic. That is when comparing the two groups, sentences with Wh-movement in which the theme did not cross the agent caused less comprehension problems than sentences with crossing of the theme over the agent. Nevertheless, the individual results show that these structures too caused some difficulties for some HI children. The problems of these four participants with subject Wh-movement sentences were most pronounced in the repetition task.

At least for some of the children, the reason for this pattern, which shows good performance on sentences with Wh-movement in which the agent remains before the theme, and poorer performance on the repetition of these structures may be related to an impairment in the syntactic tree. Szterman and Friedmann (2014b) found that whereas the syntactic deficit of many of their HI participants was a deficit in Wh-movement, there were some children whose syntactic deficit was of a different sort: they had a deficit in the highest node of the syntactic tree, CP (similar to the impairment in agrammatism, see Friedmann, 2001, 2006; and to de Villiers et al., 1994 suggestion for all HI children). In German, every sentence that involves Wh-movement requires lexical items to reach the CP layer<sup>13</sup> . As a result, all Wh-movement sentences, should be difficult to produce for individuals with CP impairment, both those in which the agent has moved (subject relatives, subject questions), and those in which the theme moved. In a sentence-picture matching comprehension task, a strategy that ascribed the first NP the agent role can still guide the participant to point to the correct picture in Wh-movement sentences in which the agent moved and remained before the theme. However, object Wh-movement sentences would show impairment in such tasks. The story is different in repetition: here, an agent-first strategy cannot salvage sentence repetition, so the difficulty would manifest itself also in the repetition of subject relatives and subject questions. Supporting this view is the fact that all the children who failed to repeat subject Wh-movement (and not just object Wh-movement) also failed to repeat AVSO sentences, in which the verb moves to the CP layer (one of this children was 0.01 points above the threshold for the verb-movement structures). Those five children that performed worse on the AVSO than the hearing group were even more impaired on the object which questions, which would be in line with the idea that these children not only have a problem in Wh-movement, but also in using the CP layer.

Finally, whereas some HI children had severe syntactic difficulties, others performed much better in both comprehension and repetition. A possible explanation may be found in the age of implantation and/or the age of hearing loss of the children who showed better syntactic abilities. Of the 4 children who showed age-appropriate or near-normal syntax in our tasks, one received hearing devices at a very young age (at or before age 1;0), pointing, very carefully to the importance of exposure to language during the first year of life. The three other HI participants were diagnosed with a HI quite late (5;0, 6;0, and 8;0) which may indicate that the hearing loss was not present from birth, so that they were actually exposed to language normally during the first year of life.

One interesting question is what exactly it is in the early exposure to language that is needed to acquire syntactic structures derived by Wh-movement. One possibility is that the phonological properties of the structures we tested are especially difficult for a hearing-impaired child to perceive during the early critical period: e.g., the German case markers or the prosody of topicalized sentences. However, this does not seem to be the case: the specific difficulty in Wh-movement structures is typical also to young hearing TD children and to hearing children with syntactic SLI (e.g., Friedmann and Novogrodsky, 2004, 2011; Friedmann et al., 2009; Biran and Ruigendijk, 2015). Additionally, whereas the perception of the case markings on the German determiners may be difficult, difficulties in parallel sentence structures are also apparent in languages in which topicalization and relative clauses are marked by word order and not by phonologically similar case marking on determiners, such as Italian, English, and Hebrew. Therefore, it does not seem to be difficulty in hearing specific parts of the sentence in early childhood that hampers the acquisition of Wh-movement structures, but rather something more general about exposure to language in the first year. It is currently an open and especially intriguing question of what exactly is the type of language input that is required during the critical period for Wh-movement.

Another possible account would ascribe the syntactic difficulty of HI children in specific structures to their difficulty with respect to, for example, perceiving the different case morphemes, despite normal syntactic abilities. The results, however, are not consistent with this approach either: good syntactic abilities with poor perception would end up in repeating sentences possibly with incorrect case relative to the target sentence, but the repeated sentences are then expected to be grammatical. Such an approach cannot account for the error pattern that our participants exhibited in sentence repetition, where, for example, they produced sentences with the same case twice.

We have admittedly a very small sample of participants and hence our data can only be taken as a possible indication for future research. Nevertheless, these results are consistent with similar reports from larger groups of HI in Hebrew (Friedmann and Szterman, 2006; Szterman and Friedmann, 2014b, 2015), and Arabic (Friedmann and Haddad-Hanna, 2014), where the HI children who succeeded in syntactic tests were the ones who received hearing aids before the age of one year. Therefore, we may suggest that although early implantation or aiding does not guarantee good syntactic

<sup>12</sup>There seem to be some differences between German child language and for instance Hebrew child language in this respect, as discussed by Biran and Ruigendijk (2015).

<sup>13</sup>Although simple SVO can be produced with raising only up to IP, and we would not be able to tell the difference by the phonological string.

performance later on, as some HI children who were aided at a very young age still had considerable syntactic problems, early exposure to language input emerges as a necessary condition for the normal development of syntactic abilities.

### ETHICS STATEMENT

This study was approved by Kommission für Forschungsfolgenabschätzung und Ethik der Universität Oldenburg. We informed all parents of our study, the methodology, and procedure. Only children whose parents gave informed consent (written, with signature) took part in this study. Furthermore, only children who wanted to participate, participated, and they could stop their participation at any moment when they wanted, without further consequences. Both parents and children were explicitly informed about this.

# AUTHOR CONTRIBUTIONS

ER and NF created together the research question and design. They adapted the Hebrew tests together to German. ER ran the

### REFERENCES


tests and coded the results. ER and NF analyzed and interpreted the results together, wrote the paper together, and each did part of the statistical analysis.

### ACKNOWLEDGMENTS

The research was supported by the Joint German-Israeli Research Program grant GR01791 (Friedmann) and by the Israel Science Foundation (grant no. 1066/14, Friedmann), the German Israeli Foundation (Friedmann and Ruigendijk), grant 1113/2010, and by the Cluster of Excellence EXC 1077/1 "Hearing4all" funded by the German Research Council (DFG). We wholeheartedly thank Ronit Szterman and Rama Novogrodsky with whom the original tests in Hebrew were developed.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.00689/full#supplementary-material

sensorineural hearing loss and those with specific language impairment. J. Child Psychol. Psychiatry 42, 329–340. doi: 10.1111/1469-7610.00726



and S. Matteini (Amsterdam: John Benjamins), 303–320. doi: 10.1075/la.223. 14fri



2013, eds C. Hamann and E. Ruigendijk (Newcastle: Cambridge Scholars Publishing).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Ruigendijk and Friedmann. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Perceptual Plasticity for Auditory Object Recognition

Shannon L. M. Heald\* † , Stephen C. Van Hedger\* † and Howard C. Nusbaum

Department of Psychology, The University of Chicago, Chicago, IL, United States

### Edited by:

Rachel Jane Ellis, Linköping University, Sweden

### Reviewed by:

Cyrille Magne, Middle Tennessee State University, United States Jonathan B. Fritz, University of Maryland, College Park, United States

### \*Correspondence:

Shannon L. M. Heald sheald@uchicago.edu Stephen C. Van Hedger svanhedger@uchicago.edu

†These authors are co-first authors.

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 03 March 2016 Accepted: 26 April 2017 Published: 23 May 2017

### Citation:

Heald SLM, Van Hedger SC and Nusbaum HC (2017) Perceptual Plasticity for Auditory Object Recognition. Front. Psychol. 8:781. doi: 10.3389/fpsyg.2017.00781 In our auditory environment, we rarely experience the exact acoustic waveform twice. This is especially true for communicative signals that have meaning for listeners. In speech and music, the acoustic signal changes as a function of the talker (or instrument), speaking (or playing) rate, and room acoustics, to name a few factors. Yet, despite this acoustic variability, we are able to recognize a sentence or melody as the same across various kinds of acoustic inputs and determine meaning based on listening goals, expectations, context, and experience. The recognition process relates acoustic signals to prior experience despite variability in signal-relevant and signal-irrelevant acoustic properties, some of which could be considered as "noise" in service of a recognition goal. However, some acoustic variability, if systematic, is lawful and can be exploited by listeners to aid in recognition. Perceivable changes in systematic variability can herald a need for listeners to reorganize perception and reorient their attention to more immediately signal-relevant cues. This view is not incorporated currently in many extant theories of auditory perception, which traditionally reduce psychological or neural representations of perceptual objects and the processes that act on them to static entities. While this reduction is likely done for the sake of empirical tractability, such a reduction may seriously distort the perceptual process to be modeled. We argue that perceptual representations, as well as the processes underlying perception, are dynamically determined by an interaction between the uncertainty of the auditory signal and constraints of context. This suggests that the process of auditory recognition is highly context-dependent in that the identity of a given auditory object may be intrinsically tied to its preceding context. To argue for the flexible neural and psychological updating of sound-to-meaning mappings across speech and music, we draw upon examples of perceptual categories that are thought to be highly stable. This framework suggests that the process of auditory recognition cannot be divorced from the short-term context in which an auditory object is presented. Implications for auditory category acquisition and extant models of auditory perception, both cognitive and neural, are discussed.

Keywords: auditory perception, speech perception, music perception, short-term plasticity, categorization, perceptual constancy, lack of invariance, dynamical systems

# INTRODUCTION

fpsyg-08-00781 May 20, 2017 Time: 15:47 # 2

Perceptual understanding of the auditory world is not a trivial task. We generally perceive discrete auditory objects, despite highly convolved auditory scenes that occur in the real world. For example, we can effortlessly perceive a siren in the distance and the hum of a washing machine while following a dialog in a movie that is underscored by background music. In part, recognizing these sound objects is aided by the spatial separation of the waveforms (see Cherry, 1953) as well as perceptual organization (see Bregman, 1990). However, each of our two basilar membranes is vibrated by the aggregation of the separate source waveforms striking our eardrums. Moreover, each of the sound objects, beyond being mixed in with an uncertain sound stage of other sound objects, may be distorted by the room, by motion, and further may be physically different from the generator of similar objects (washing machine, siren, or talker) we have encountered in the past. Simply stated, there is an incredible amount of variability in our auditory environments.

In speech, the lack of invariance between acoustic waveforms and their intended linguistic meaning became clear when the spectrograph was used to visually represent acoustic patterns in the spectro-temporal domain. Between talkers, there is variation in vocal tract size and shape that translates into differences in the acoustic realization of phonemes (Fant, 1960; Stevens, 1998). However, even local changes over time in linguistic experience (Cooper, 1974; Iverson and Evans, 2007), affective state (Barrett and Paus, 2002), speaking rate (Gay, 1978; Miller and Baer, 1983), and fatigue (Lindblom, 1963; Moon and Lindblom, 1994) can alter the acoustic realization of a given phoneme. Understanding the various sources of variability and their consequences on speech signals is important as different sources of variability may evoke different adaptive mechanisms for their resolution (see, Nygaard et al., 1995).

Beyond sources of variability that seemingly obstruct identification, there is clear evidence that idiosyncratic articulatory differences in how individuals produce phonemes result in acoustic differences (Liberman et al., 1967). Similar sources of variability hold for higher levels of linguistic representation, such as syllabic, lexical, prosodic, and sentential levels of analysis (cf. Heald and Nusbaum, 2014). Moreover, a highly variable acoustic signal is by no means unique to speech. In music, individuals have a perception of melodic stability or preservation of a melodic "Gestalt" despite changes in tempo (Handel, 1993; Monahan, 1993), pitch height or chroma (Handel, 1989), and instrumental timbre (Zhu et al., 2011). In fact, perhaps with a few contrived exceptions (such as listening to the same audio recording with the same speakers in the same room with the same background noise from the same physical location), we are not exposed to the same acoustic pattern of a particular auditory object twice. The question then becomes – how do we perceptually process acoustic variability in order to achieve a sense of experiential stability and recognizability across variable acoustic signals?

# REGULARITIES IN OUR ENVIRONMENT SHAPE OUR PERCEPTUAL EXPERIENCE

One possibility is that perceptual stability arises from the ability to form and use categories or classes of functional equivalence. It is a longstanding assertion in cognitive psychology that categorization serves to reduce psychologically irrelevant variability, carving the world up into meaningful parts (Bruner et al., 1956). In audition, some have argued that the categorical nature of speech perception originates in the architecture of the perceptual system (Elman and McClelland, 1986; Holt and Lotto, 2010). Other theories have suggested that speech categories arise out of sensitivity to the statistical distribution of occurrences of speech tokens (for a review, see Feldman et al., 2013).

Indeed, it has been proposed that the ability to extract statistical regularities in one's environment, which could occur by an unsupervised or implicit process, shapes our perceptual categories in both speech (cf. Strange and Jenkins, 1978; Werker and Tees, 1984; Kuhl et al., 1992; Werker and Polka, 1993; Saffran et al., 1996; Kluender et al., 1998; Maye and Gerken, 2000; Maye et al., 2002) and music (cf. Lynch et al., 1990; Lynch and Eilers, 1991, 1992; Soley and Hannon, 2010; Van Hedger et al., 2016). An often-cited example in speech research is that an infant's ability to discriminate sounds in their native language increases with linguistic exposure, while the ability to discriminate sounds that are not linguistically functional in their native language decreases (Werker and Tees, 1983). Further, work in speech development by Nittrouer and Miller (1997), Nittrouer and Lowenstein (2007) has shown that the shaping of perceptual sensitivities and acoustic to phonetic mappings by one's native language experience occurs throughout adolescence, indicating that individuals remain sensitive to the statistical regularities of acoustic cues and how they covary with sound meaning distinctions throughout their development. Therefore, it seems that given enough listening experience, individuals are able to learn how multiple acoustic cues work in concert to denote a particular meaning, even when no single cue is necessary or sufficient.

# SOUNDS IN A SYSTEM OF CATEGORIES

Individuals are not only sensitive to the statistical regularities of items that give rise to functional classes or categories, but to the systematic regularities among the resulting categories themselves. This hierarchical source of information, which goes beyond any specific individual category, could aid in disambiguating a physical signal that has multiple meanings. For both speech and music this allows the categories within each system to be defined internally, through the relationships held among categories of each system. This suggests that individuals possess categories that work collectively with one another as a long-term, experientially defined context to orchestrate a cohesive perceptual world (see Bruner, 1973; Billman and Knutson, 1996; Goldstone et al., 2012). In music, the implied key of a musical piece organizes the interrelations among pitch classes in a hierarchical structure (Krumhansl and Shepard, 1979; Krumhansl and Kessler, 1982).

Importantly, these hierarchical relations become strengthened as a function of listening experience, suggesting that experience with tonal areas or keys shapes how individuals organize pitch classes (cf. Krumhansl and Keil, 1982). These hierarchical relationships are also seen in speech among various phonemic classes, initially described as a featural system (e.g., Chomsky and Halle, 1968) and the distributional constraints on phonemes and phonotactics. For a given talker, vowel categories are often discussed as occupying a vowel space that roughly corresponds to the speaker's articulatory space (Ladefoged and Broadbent, 1957). Some authors have posited that point vowels, which represent the extremes of the acoustic and articulatory space, may be used to calibrate changes in the space across individuals, as they systematically bound the rest of the vowel inventory (Joos, 1948; Gerstman, 1968; Lieberman et al., 1972). Due to the concomitant experience of visual information and acoustic information (rooted in the physical process of speech sound production), there are also systematic relations that extend between modalities. For example, an auditory /ba/ paired with a visual /ga/ often yields the perceptual experience of /da/ due to the systematic relationship of place of articulation among those functional classes (McGurk and MacDonald, 1976). Given these examples, it is clear that within both speech and music, perceptual categories are not isolated entities. Rather, listening experience over time confers systematicity that can be meaningful. Such relationships may be additionally important to ensure stability in a system that is heavily influenced by recent perceptual experience, as stability may exist through interconnections within the category system. Long-term learning mechanisms may remove short-term changes that are inconsistent with the system, while in other cases, allow for such changes to generalize to the rest of the system in order to achieve consistency.

## STABILITY OF PERCEPTUAL SYSTEMS?

Despite clear evidence that listeners are able to rapidly learn from the statistical distributions of their acoustic environments, both for the formation of perceptual categories and the relationships that exist among them, few auditory recognition models include such learning<sup>1</sup> . Indeed, speech perception models such as feature-detector theories (e.g., Stevens and Blumstein, 1981), ecological theories (Fowler and Galantucci, 2005), motor theories (e.g., Liberman and Mattingly, 1985), and interactive theories (TRACE: e.g., McClelland and Elman, 1986; C-CuRe: McMurray and Jongman, 2011) provide no mechanism to update perceptual representations, and as such, implicitly assume that the representations that guide the perceptual process are more stable than plastic. While C-CuRE (McMurray and Jongman, 2011) might be thought of as highly adaptive by allowing different levels of abstraction to interact during perception, this model does not make claims about how the representations that guide perception are established either in terms of the formation of auditory objects or the features that comprise them. For example, the identification of a given vowel depends on the first (F1) and second (F2) formant values, but some of these values will be ambiguous depending on the linguistic context and talker. According to C-CuRE, once the talker's vocal characteristics are known, a listener can make use of these formant values. The listener can compare the formant values of the given signal against the talker's average F1 and F2, helping to select the likely identification of the vowel. Importantly, for the C-CuRE model, feature meanings are already available to the listener. While there is some suggestion that this knowledge could be derived from linguistic input and may be amended, the model itself has remained agnostic as to how and when this information is obtained and updated by the listener. A similar issue arises in other interactive models of speech perception (e.g., TRACE: McClelland and Elman, 1986; Hebb-Trace: Mirman et al., 2006) and models of pitch perception (e.g., Anantharaman et al., 1993; Gockel et al., 2001).

While some auditory neurobiological models demonstrate clear awareness that mechanisms for learning and adaptation be included in models of perception and recognition (Weinberger, 2004, 2015; McLachlan and Wilson, 2010; Shamma and Fritz, 2014), this is less true for neurobiological models of speech perception, which traditionally limit their modeling to perisylvian language areas (Fitch et al., 1997; Hickok and Poeppel, 2007; Rauschecker and Scott, 2009; Friederici, 2012), ignoring brain regions that have been implicated in category learning, such as the striatum, the thalamus, and the frontoparietal attentionworking memory network (McClelland et al., 1995; Ashby and Maddox, 2005). Further, the restriction of speech models to perisylvian language areas marks an extreme cortical myopia of the auditory system, as it ignores the corticofugal pathways that exist between cortical and subcortical regions such as the medial geniculate nucleus in the thalamus, the inferior colliculus in the midbrain, the superior olive and cochlear nucleus in the pons, all the way down to the cochlea in the inner ear (cf. Parvizi, 2009). Previous work has shown that higher-level cognitive functions can reorganize subcortical structures as low as the cochlea. For example, selective attention or discrimination training has been demonstrated to enhance the spectral peaks of evoked otoacoustic emissions produced in the inner ear (Giard et al., 1994; Maison et al., 2001; de Boer and Thornton, 2008). Inclusion of the corticofugal system in neurobiological models of speech would allow the system, through feedback and top-down control, to adapt to ambiguity or change in the speech signal by selectively enhancing the most diagnostic spectral cues for a given talker or expected circumstance, even before it reaches perisylvian language areas. Including the corticofugal system can thus drastically change how extant models, which are entirely cortical, explain top-down, attention modulated effects in speech and music. While the omission of corticofugal pathways and brain regions associated with category learning is likely not an intentional omission but a simplification for the sake of experimental tractability, it is clear that such an omission has large scale consequences for modeling auditory perception, speech or otherwise. Indeed, the inclusion of learning areas and adaptive corticofugal connections on auditory processing requires a vastly different view of perception, in that even the

<sup>1</sup>Although for exceptions, see Tuller et al. (1994), Case et al. (1995), Mirman et al. (2006), Lancia and Winter (2013), and Kleinschmidt and Jaeger (2015).

earliest moments of auditory processing are guided by higher cognitive processing via expectations and listening goals. In this sense, it is unlikely that learning and adaptability can be simply grafted on top of current cortical models of perception. The very notion that learning and adaptive connections could be omitted, however, (even for the sake of simplicity) is in essence, a tacit statement that the representations that guide recognition are more stable than plastic.

The notion that our representations are more stable than plastic may also be rooted in our experience of the world as perceptually stable. In music, relative perceptual constancy can be found for a given melody despite changes in key, tempo, or instrument. Similarly, in speech, a given phoneme can be recognized despite changes in phonetic environment and talker. This is not to say that listeners are "deaf " to acoustic differences between different examples of a given melody or phoneme, but that different goals in listening can arguably shape the way we direct attention (consciously or unconsciously) to variability among auditory objects. In this sense, listening goals organize attention, such that individuals orient toward cues that reflect a given parsing, and away from cues that do not (cf. Goldstone and Hendrickson, 2010). More recent work on change deafness demonstrates that changes in listening goals alter a participant's ability to notice a change in talker over a phone conversation (Fenn et al., 2011). More specifically, the authors demonstrated that participants did not detect a surreptitious change in talker during a phone conversation, but could detect the change if told to explicitly monitor for it. This suggests that listening goals modulate how we parse or categorize signals, in that these listening determine how attention is directed toward the acoustic variance of a given signal.

Perceptual classification or categorization here should not be confused with categorical perception (cf. Holt and Lotto, 2010). Categorical perception, classically defined in audition, refers to the notion that a continuum of sounds that differ along a particular acoustic dimension are not heard to change continuously, but rather as an abrupt shift from one category to another (e.g., Liberman et al., 1957). As such, categorical perception suggests that despite changes in listening goals, individuals' perceptual discrimination of any two stimuli is inextricably linked to the probability of classifying these stimuli as belonging to different categories (e.g., Studdert-Kennedy et al., 1970). Categorization, conversely, refers to a particular organization of attention, wherein cues that are indicative of between-category variability are emphasized while cues that reflect within-category variability are deemphasized (Goldstone, 1994). Indeed, even within the earliest examples of categorical perception (a phenomenon that, in theory, completely attenuates within-category variability), there appears to be some retention of within-category discriminability (e.g., Liberman et al., 1957). English listeners can reliably rate some acoustic realizations of phonetic categories (e.g., "ba") as better versions than others (e.g., Pisoni and Lazarus, 1974; Pisoni and Tash, 1974; Carney et al., 1977; Iverson and Kuhl, 1995). Additionally, a number of studies have shown that not only are individuals sensitive to withincategory variability, but also this variability affects subsequent lexical processing (Dahan et al., 2001; McMurray et al., 2002; Gow et al., 2003). In music, the perception of pitch chroma categories among absolute pitch (AP) possessors is categorical in the sense that AP possessors show sharp identification boundaries between note categories (e.g., Ward and Burns, 1999). However, AP possessors also show reliable within-category differentiation when providing goodness judgments within a note category (e.g., Levitin and Rogers, 2005). Graded evaluations within a category are further seen in musical intervals, where sharp category boundaries indicative of categorical perception are also generally observed at least for musicians (Siegel and Siegel, 1977). There is also evidence that within-category discrimination can exceed what would be predicted from category identification responses (Zatorre and Halpern, 1979). Indeed, Holt et al. (2000) have suggested that the task structure typically employed in categorical perception tasks may be what is driving the manifestation of within category homogeneity that is characteristic of categorical perception. Another way of stating this is that listening goals defined by the task structure modulate the way attention is directed toward acoustic variance.

While there is clear evidence that individuals possess the ability to attend to acoustic variability, even within perceptual categories, it is still unclear from the demonstrations reported thus far whether listeners are influenced by acoustic variability that is attenuated by disattention due to their listening goals. More specifically, it is unclear whether the representations that guide perception are influenced by subtle, within-category acoustic variability, even if it appears to be functionally irrelevant for current listening goals. Even though there is ample evidence that perceptual sensitivity to acoustic variability is attenuated through categorization, this variability may nevertheless be preserved and further, may be incorporated into the representations that guide perception. In this sense, putatively irrelevant acoustic variability, even if not consciously experienced, may still affect subsequent perception. For example, Gureckis and Goldstone (2008) have argued that the preservation of variability (in our case, the acoustic trace independent of the way in which the acoustics relate to an established category structure due to a current listening goal) allows for perceptual plasticity within a system, as adaptability can only be achieved if individuals are sensitive (consciously or unconsciously) to potentially behavioral relevant changes in within-category structure. In this sense, without the preservation of variability listeners would fail to adapt to situations where the identity of perceptual objects rapidly change. Indeed, there is a growing body of evidence supporting the view that the preservation of acoustic variability can be used in service of instantiating a novel category. In speech, adult listeners are able to amend perceptual categories as well as learn novel perceptual categories not present in their native language, even when the acoustic cues needed to learn the novel category structure are in direct conflict with a preexisting category structure. Adult native Japanese listeners, who presumably become insensitive to the acoustic differences between /r/ and /l/ categories through accrued experience listening to Japanese, are nevertheless able to learn this non-native discrimination through explicit perceptual training (Lively et al., 1994; Bradlow et al., 1997; Ingvalson et al., 2012), rapid incidental perceptual learning (Lim and Holt, 2011),

as well as through the accrual of time residing in English-speaking countries (Ingvalson et al., 2011). Further, adult English speakers are able to learn the non-native Thai pre-voicing contrast, which functionally splits their native /b/ category (Pisoni et al., 1982) and to distinguish between different Zulu clicks, which make use of completely novel acoustic cues (Best et al., 1988).

Beyond retaining an ability to form non-native perceptual categories in adulthood, there is also clear evidence that individuals are able to update and amend the representations that guide their processing of native speech. Clarke and Luce (2005) showed that within moments of listening to a new speaker, listeners modify their classification of stop consonants to reflect the new speaker's productions, suggesting that linguistic representations are plastic in that they can be adjusted online to optimize perception. This finding has been replicated in a study that further showed that participants' lexical decisions reflect recently heard acoustic probability distributions (Clayards et al., 2008).

Perceptual flexibility also can be demonstrated at a higher level, presumably due to discernible higher-order structure. Work in our lab has demonstrated that individuals are able to rapidly learn synthetic speech produced by rule that is defined by poor and often misleading acoustic cues. In this research, no words ever repeat during testing or training, so that the learning of a particular synthesizer is thought to entail the redirection of attention to the most diagnostic and behaviorally relevant acoustic cues across multiple phonemic categories in concert (see Nusbaum and Schwab, 1986; Fenn et al., 2003; Francis et al., 2007; Francis and Nusbaum, 2009) in much the same way as learning new phonetic categories (Francis and Nusbaum, 2002). Given these studies, it appears that the process of categorization in pursuit of current listening goals does not completely attenuate acoustic variability.

Beyond speech, the representations that guide music perception also appear to be remarkably flexible. Wong et al. (2009) have demonstrated that individuals are able to learn multiple musical systems through passive listening exposure. This "bimusicality" is not merely the storage of two, modular systems of music (Wong et al., 2011); though it is unclear whether early exposure (i.e., within a putative critical period) is necessary to develop this knowledge. In support of the notion that even adult listeners can come to understand a novel musical system that may parse pitch space in a conflicting way compared to Western music, Loui and Wessel (2008) have demonstrated that adult listeners of Western music are able to learn a novel artificial musical grammar. In their paradigm, individuals heard melodies composed using the Bohlen–Pierce scale – a musical system that is strikingly different from Western music, as it consists of 13 equally spaced notes within a three-octave range as opposed to 12 equally spaced notes within a two-octave range. Nevertheless, after mere minutes of listening to 15 Bohlen–Pierce melodies that conformed to a finite-state grammar, listeners were able to recognize these previously heard melodies as well as generalize the rules of the finite-state grammar to novel melodies.

Even within the Western musical system, adults display plasticity for learning categories thought to be unlearnable in adulthood. A particularly salient example of adult plasticity within Western music learning comes from the phenomenon of AP – the ability to name or produce any musical note without the aid of a reference note (see Deutsch, 2013 for a review). AP has been conceptualized as a rare ability, manifesting in as few as one in every 10,000 individuals in Western cultures (Bachem, 1955), though the mechanisms of AP acquisition are still debated. While there is some research arguing for a genetic predisposition underlying AP (e.g., Baharloo et al., 1998; Theusch et al., 2009), with even some accounts claiming that AP requires little or no environmental shaping (Ross et al., 2003), most theories of AP acquisition adhere to an early-learning framework (e.g., Crozier, 1997). This framework predicts that only individuals with early note naming experience would be candidates for developing AP categories. As such, previously naive adults should not be able to learn AP. This early-learning argument of AP has been further explained as a "loss" of AP processing without early interventions, either from music or language (i.e., tonal languages), in which AP is emphasized (cf. Sergeant and Roche, 1973; Deutsch et al., 2004). In support of this explanation, infants appear to process pitch both absolutely and relatively, though they switch to relative pitch cues when AP cues become unreliable (Saffran et al., 2005).

Yet, similar to how even "irrelevant" acoustic variability within speech is not completely attenuated, there is mounting evidence that most individuals (regardless of possessing AP) retain the ability to perceive and remember AP, presumably through implicit statistical learning mechanisms. For example, non-AP possessors are able to tell when familiar music recordings have been subtly shifted in pitch (e.g., Terhardt and Seewan, 1983; Schellenberg and Trehub, 2003), even if they are not able to explicitly name the musical notes they are hearing. These results suggest that the perception of AP is not an ability that is completely lost without the knowledge of explicit musical note category labels or with more advanced development of relative pitch abilities. As such, it is possible that adult listeners might be able to learn how musical note categories map onto particular absolute pitches. In support of this idea, most studies examining the degree to which AP can be trained in an adult population find some improvement after training, even after a single training session (Van Hedger et al., 2015). A few studies have even found improvements in absolute note identification such that post-training performance rivals that of that an AP population who learned note categories early in life (Brady, 1970; Rush, 1989). These findings not only support the notion that most adults retain an ability to perceive and remember AP to some degree, but also that AP categories are, to an extent, trainable into adulthood.

Despite these accounts of AP plasticity within an adult population, one might still argue that the adult learning of AP categories represents a fundamentally different phenomenon than that of early-acquired AP, even if the behavioral note classifications from trained adults are, in some extreme cases, indistinguishable from that of an AP population who acquired note categories early in life. One reason to support this kind of dissociation between adult-acquired and early-acquired AP relates to the putative lack of plasticity that exists within an AP possessor who acquired note categories early in life. Specifically, note categories within an early-acquired AP

population are thought to be highly stable once established (Ward and Burns, 1999), only being alterable in very limited circumstances, such as through physiological changes to the auditory system as a result of aging (cf. Athos et al., 2007) or pharmaceutical interventions (e.g., Kobayashi et al., 2001). However, recent empirical evidence has demonstrated that even within this early-acquired AP population, there exists a great deal of plasticity in note category representations that is tied to particular environmental experiences. Wilson et al. (2012) reported reductions in AP ability as a function of whether an individual plays a "movable do" instrument (i.e., an instrument in which a notated "C" actually belongs to a different pitch chroma category, such as "F"), suggesting that nascent AP abilities might be undone through inconsistent sound-tocategory mappings. Dohn et al. (2014) reported differences in note identification accuracy among AP possessors that could be explained by whether one was actively playing a musical instrument, suggesting that AP ability might be "tuned up" by recent musical experience.

Both of these studies speak to how particular regularities in the environment may affect overall note category accuracy within an AP population, though they do not speak to whether the structure of the note categories can be altered through experience once they are acquired. Indeed, one of the hallmarks of AP is not only being able to accurately label a given pitch with its note category (e.g., C#), but also provide a goodness rating of how well that pitch conforms to the category (e.g., flat, in-tune, or sharp). Presumably, this ability to label some category members as better than others stems from either a fixed note-frequency association established early in life, or through the consistent environmental exposure of listening to music that is tuned to a very specific standard (e.g., in which the "A" above middle C is tuned to 440 Hz). Adopting the first explanation, plasticity of AP category structure should not be possible. Adopting the second explanation, AP category structure should be modifiable and tied to the statistical regularities of hearing particular tunings in the environment. Our previous work has clearly demonstrated evidence in support of this second explanation – that is, the structure of note categories for AP possessors is plastic and dependent on how music is tuned in the current listening environment (Hedger et al., 2013). In our paradigm, AP possessors assigned goodness ratings to isolated musical notes. Not surprisingly, in-tune notes (according to an A = 440 Hz standard) were rated as more "in-tune" than notes that deviated from this standard by one-third of a note category. However, after listening to a symphony that was slowly flattened by one-third of a note category, the same participants began rating similarly flattened versions of isolated notes as more "in-tune" than the notes that were in-tune based off of the A = 440 Hz standard. These findings suggest that AP note categories are held in place by the recent listening environment, not by a fixed and immutable note-frequency association that is established early in life. Overall, then, the past decade or so of research on AP has highlighted how this ability can be modified by behaviorally relevant environmental input that extends well into adulthood.

# CROSS-DOMAIN TRANSFER BETWEEN MUSIC AND SPEECH

These accounts of plasticity in auditory perception for both speech and music suggest that both systems may be subserved by common perceptual and learning mechanisms. Recent work exploring the relationship between speech and music processing has found mounting evidence that musical training improves several aspects of speech processing, though it is debated whether these transfer effects are due to general enhancements in auditory processing (e.g., pitch perception) vs. an enhanced representation of phonological categories. Hypotheses like OPERA (Patel, 2011) posit that musical training may enhance aspects of speech processing when there is anatomical overlap between networks that process the acoustic features shared between music and speech, when the perceptual precision required of musical training exceed that of general speech processing, when the training of music elicits positive emotions, when musical training is repetitive, and when the musical training engages attention. Indeed, the OPERA hypothesis provides a framework for understanding many of the empirical findings within the music-to-speech transfer literature. Musical training helps individuals to detect speech in noise (Parbery-Clark et al., 2009), presumably through strengthened auditory working memory, which requires directed attention. Musicians are also better able to use non-native tonal contrasts to distinguish word meanings (Wong and Perrachione, 2007), presumably because musical training has made pitch processing more precise. This explanation can further be applied to the empirical findings that musicians are better able to subcortically track the pitch of emotional speech (Strait et al., 2009).

Recent work has further demonstrated that musical training can also influence the categorical perception of speech. Bidelman et al. (2014) found that musicians showed steeper identification functions of vowels that varied along a categorical speech continuum, and moreover these results could be modeled by changes at multiple levels of the auditory pathway (both subcortical and cortical). In a similar study, Wu et al. (2015) found that Chinese musicians were better able to discriminate within-category lexical tone exemplars in a categorical perception task compared to non-musicians, though, unlike Bidelman et al. (2014), the between-category differentiation between musicians and non-musicians was comparable. Wu et al. (2015) interpret the within-category improvement among musicians in an OPERA framework, arguing that musicians have more precise representations of pitch that allow for fine-grained distinctions within a linguistic category.

Finally, there is emerging evidence that certain kinds of speech expertise may enhance musical processing, demonstrating a proof-of-concept of the bidirectionality of music-speech transfer effects. Specifically, non-musician speakers of a tonal language (Cantonese) showed auditory processing advantages in pitch acuity and music perception that non-musician speakers of English did not show (Bidelman et al., 2013). While there is less evidence supporting this direction of transfer, this is perhaps not surprising as speech expertise is ubiquitous in a way music

expertise is not. Thus, transfer effects from speech to music processing are more constrained, as one has to design a study in which there (1) exists substantial differences in speech expertise, and (2) this difference in expertise must theoretically relate to some aspect of music processing (e.g., pitch perception).

How can these transfer effects between speech and music be interpreted in the larger context of auditory object plasticity? Given the evidence across speech and music that recent auditory events profoundly influence the perception of auditory objects within each system, it stands to reason that recent auditory experience from one system of knowledge (e.g., music) may influence subsequent auditory perception in the other system (e.g., speech), assuming there is overlap among particular acoustic features of both systems. Indeed, there is some empirical evidence to at least conceptually support this idea. An accumulating body of work has demonstrated that the perception of speech sounds is influenced by the long-term average spectrum (LTAS) of a preceding sound, even if that preceding sound is non-linguistic in nature (e.g., Holt et al., 2000; Holt and Lotto, 2002). This influence of non-linguistic sounds on speech perception appears to reflect a general sensitivity to spectro-temporal distributional information, as the nonlinguistic preceding context can influence speech categorization even when it is not immediately preceding the to-be-categorized speech sound (Holt, 2005). While these results do not directly demonstrate that recent experience in music can influence the way in which a speech sound is categorized, it is reasonable to predict that certain kinds of experiences in music or speech (e.g., a melody played in a particular frequency range) may alter the way in which subsequent speech sounds are perceived. As such, future work within this realm will help us understand the extent to which auditory object plasticity can be understood using a general auditory framework.

# NEURAL MARKERS FOR RAPID AUDITORY PLASTICITY

What is most remarkable about the previously discussed examples of perceptual plasticity in both speech and music is that significant reorganization of perception can been achieved within a single experimental session. Indeed, there is clear neural evidence from animal models that the ability to rapidly reorganize maps in auditory cortex is maintained into adulthood (see Feldman and Brecht, 2005 for a review; Ohl and Scheich, 2005). While these maps are thought to represent long-term experience with one's auditory environment (Schreiner and Polley, 2014), they demonstrate high mutability in adults, in that cortical reorganizations may be triggered by task demands as well as the attentional state of the animal (Ahissar et al., 1992, 1998; Fritz et al., 2003, 2010; Fritz J.B. et al., 2005; Polley et al., 2006; for a review see Jääskeläinen and Ahveninen, 2014). In fact, plasticity is not observed when the stimuli are not behaviorally relevant for the organism (Ahissar et al., 1992; Polley et al., 2006; Fritz et al., 2010). Behaviorally relevant experience with a set of tones is known to lead to rapid tonotopic map expansion (Recanzone et al., 1993; Polley et al., 2006; Bieszczad and Weinberger, 2010), sharper receptive field tunings (Recanzone et al., 1993), and greater neuronal synchrony (Kilgard et al., 2007). Notably, these changes appear to have a direct effect on subsequent performance wherein larger cortical map expansion and sharper receptive field tunings are associated with greater improvements in performance following training (Recanzone, 2003). Further, the changes in spectro-temporal receptive field selectivity and inhibition persist for hours after learning, even during subsequent passive listening (Fritz et al., 2003). More recent work by Reed et al. (2011) suggests that while cortical map expansion may be triggered by perceptual learning, these states do not need to be maintained in order to preserve perceptual performance gains. They argue that the function of cortical map expansions is to identify the most efficient circuitry to support a behaviorally relevant, perceptual improvement. Once efficient circuitry is established, the system is able to preserve enhancement in performance via the discovered circuitry despite any subsequent retraction in cortical map representation.

Beyond tonotopic changes, other modes of plasticity in auditory cortex have been found as a consequence of auditory training. For example, experience discriminating spectrally structured auditory gratings (often referred to as auditory spectral ripples) leads to significant changes in the spectral and spectro-temporal receptive field bandwidth of neurons in auditory cortex (Keeling et al., 2008; Yin et al., 2014). These changes, if present in humans, would provide a mechanism that supports the perceptual adaptation to complex sounds, such as phonemes or chord classification (e.g., Schreiner and Calhoun, 1994; Kowalski et al., 1995; Keeling et al., 2008). Besides changes in spectral bandwidth receptivity, auditory training in adult animals can fully correct atypical temporal processing found in auditory cortex due to long-term auditory deprivation, such that normal following capacity and spiketiming precision are found after training (Beitel et al., 2003; Zhou et al., 2012). Crucially, training also appears to induce objectbased or category-level processing, in that behaviorally relevant experience engenders complex, categorical representations that go beyond acoustic feature processing (King and Nelken, 2009; Bathellier et al., 2012; Bao et al., 2013; Lu et al., 2017). More specifically, recent work by Bao et al. (2013) has shown that early training leads to neural selectivity for complex spectral features in that trained sounds show greater population level activation relative to untrained sound. Further, while experienced sounds post-training show a reduction in the number of responding neurons, these elicited responses are greater in magnitude. Importantly, the mechanisms guiding plasticity appear to maintain homeostasis within individual receptive fields, in that inhibitory and excitatory synaptic modifications are coordinated such that they collectively sum to zero across a single neuron's receptive field (Froemke et al., 2013). Coordination between inhibitory and excitatory modifications within a receptive field are necessary, as changes in long-term potentiation or longterm depression alone would create destabilized network activity that is either hyper or hypo-receptive (Abbott and Nelson, 2000). Importantly, the balancing of synaptic modification within individual receptive fields is predicted by cognitive theories of selective attention, which suggest that while directed attention

perceptually boosts salient or behaviorally relevant stimuli, it does so at the expense of other stimuli (for a review see, Treisman, 1969).

Neural evidence for rapid perceptual learning in adults has also been found in humans (for reviews, see Jääskeläinen and Ahveninen, 2014; Lee et al., 2014). Specifically, perceptual training of novel phonetic categories appears to lead to changes in early sensory components of scalp recorded auditory evoked potentials (AEPs), which are thought to arise from auditory cortex (Hari et al., 1980; Wood and Wolpaw, 1982; Näätänen and Picton, 1987), suggesting that experience-contingent, perceptual reorganization similarly occurs in humans (e.g., Tremblay et al., 2001; Reinke et al., 2003; Alain et al., 2007, 2015; Ben-David et al., 2011). A recent fMRI and AEP study by de Souza et al. (2013) has shown that rapid perceptual learning is marked not only by a reorganization in sensory cortex but in higher level areas such as left and right superior temporal gyrus and left inferior frontal gyrus. Importantly, their findings suggest that perceptual reorganization due to training is gated by the allocation of attention, implicating behavioral relevance via listening goals as the gating agent in perceptual plasticity. Evidence for this can also be found in the work of Mesgarani and Chang (2012). Using Electrocorticography (ECoG), where electrodes are placed directly on the surface of the brain to record changes in electrical activity from cortex, Mesgarani and Chang (2012) demonstrated that the cortical representations evoked to understand a signal are determined largely by listening goals, such that rapid changes in which talker participants were attending to in multi-talker speech led to immediate changes in population responses in non-primary auditory cortex known to encode critical spectral and temporal features of speech. Specifically, they showed that cortical responses in non-primary auditory cortex are attentionmodulated, such that the representations evoked were specific to the talker to which the listener was attending, rather than the external acoustic environment (Mesgarani and Chang, 2012; see also Zion-Golumbic et al., 2013; for review see, Zion-Golumbic and Schroeder, 2012).

As previously mentioned, rapid neural changes in sensory and higher level areas are thought to be the product of the corticofugal system (which includes cortex and subcortical structures such as the inferior colliculus, thalamus, amygdala, hippocampus, and cerebellum), in that bottom-up processes may operate contemporaneously and interactively with topdown driven processes to actively shape signal processing (Suga and Ma, 2003; Slee and David, 2015). Rapid strengthening or diminishing of synapse efficacy can occur within minutes through mechanisms such as long-term potentiation and longterm depression (Cruikshank and Weinberger, 1996; Finnerty et al., 1999; Dinse et al., 2003). As previously mentioned, these alterations appear to be contingent on whether input is behaviorally relevant, especially in the adult animal, suggesting that neural plasticity is gated by top-down or descending systems (Crow, 1968; Kety, 1970; Ahissar et al., 1992; Ahissar et al., 1998; for similar work in adult rats, see Polley et al., 2006) such as the cholinergic and noradrenergic systems that originate from the basal forebrain whose effects are mediated through the regulation of GABA circuits (Ahissar et al., 1996). While there appears to be receptivity in the speech and music community to modeling putatively top-down interactions operating entirely in cortex (George and Hawkins, 2009; Kiebel et al., 2009; Friston, 2010; Moran et al., 2013; Yildiz et al., 2013), very little work has been done to model corticofugal interactions in achieving behaviorally relevant signal processing, as extant neurobiological models of speech and music traditionally limit modeling solely to cortex. As such, the process of perception that extant models puts forth reflects a myopic view of the neural architecture that supports auditory understanding in a world where behavioral relevance is ever-changing (cf. Parvizi, 2009).

Beyond the notion that rapid cortical changes appear to persist for hours, even after the conclusion of a given task (Fritz et al., 2003; Fritz J. et al., 2005; Fritz J.B. et al., 2005), more recent work has started to examine how such rapid changes may be made more robust through other concurrent but more long-term neurobiological mechanisms that may require off-line processing during an inactive period such as sleep (Louie and Wilson, 2001; Brawn et al., 2010). These long-term mechanisms include dendritic remodeling, changes in receptor and transmitter base levels or axonal sprouting or pruning (Sun et al., 2005). Indeed, it is unlikely that immediate changes in cortex are a product of rapid remodeling of synaptic connections, or dendritic expansion or formation, which are likely components of more long-term mechanisms that support learning. Fritz et al. (2013) have suggested that rapid changes in behavior may be driven by changes in the gain of synaptic input onto individual dendritic spines, which may have the necessary architecture to achieve rapid changes. Recent work by Chen et al. (2011) supports this suggestion, as individual synaptic spines on dendrites of layers II to III of A1 neurons in mice are remarkably variable in their tuning frequencies, in that individual neurons possess dendritic spines that are tuned to widely different frequencies, with tunings that are both broad and narrow. As such, the arrangement and pattern of synaptic spines of A1 neurons appears to provides an ideal substrate for rapid cortical receptive field plasticity.

The notion that there are multiple learning mechanisms operating at different time scales concurrently is present in some cognitive learning models (e.g., complementary learning systems, McClelland et al., 1995; Ashby and Maddox, 2005; Ashby et al., 2007). While these models have been important in accounts of learning and memory, they have not been widely incorporated in models of speech and music perception. This omission along with the extreme cortical myopia found within models of speech and music perception reflect an overly simplified, perhaps misguided understanding of the neural mechanisms that underlie perception, as the addition of such mechanisms may drastically alter the processes to be modeled. More explicitly, an important consequence of viewing the perceptual process as highly adaptive is that putatively uninformative variability is no longer something for the system to overcome, but part of the information the system uses to grants perceptual constancy. In this way, it may be our ability to adapt to variable experiences that allows one to assign behaviorally relevant meaning and achieve perceptual stability.

A somewhat different approach to understanding perceptual representations and learning, however, can be found in neural dynamical system models (Laurent et al., 2001; Rabinovich et al., 2001). These models treat a given interpretation for an object as one of many paths through a multidimensional feature space in service of a given listening goal. In essence, the patterns of neural activity in these kinds of systems can form stable trajectories (reflecting different classifications) that are distinct but mutable with experience. These models do not have "stored memories" separate from the processing activity itself within neural populations, so that auditory objects would be represented by the pattern of neural activity over time within the processing network, with different spectro-temporal patterns having different stabilities. This is entirely consistent with Walter Freeman's work on brain oscillations showing that after rabbits learn a set of odor objects, learning a new odor subsequently alters oscillatory patterns associated with all previously learned odors (Freeman, 1978). These types of models do not require a separate stable "representation" for a given object such that different neurons or different network subparts are disjunctively representative of different objects, but instead dynamically create a percept from stable patterns of neural activity arising from the interaction with neural populations. Given that this marks a theoretical shift in ideas about perceptual representation from a traditional neuron doctrine (Barlow, 1972) or cell assembly idea (e.g., Hebb, 1949) in which specific neurons are identified with psychologically distinct objects to the idea that these representations emerge in the patterns of neural activity within a network (see Yuste, 2015), it is unclear how such a framework may be applied to the neural receptive field tuning data just reviewed. One possibility is that changes in behaviorally relevance or training via exposure may shift the activity pattern in a population of neurons from one stable trajectory to another and that mechanisms such as cortical magnification may allow for the most efficient pattern to be found (see, Reed et al., 2011). Models of this sort may provide a different way of conceptualizing short-term and long-term changes in tunings by unifying the impact of experience, not on the formation of representations in memory, but through the dynamic interaction of neural population responses that are sensitive to changes in attention and context.

# RELIANCE ON RECENT EXPERIENCE AND EXPECTATIONS

The evidence cited earlier that receptive fields change as a result of behaviorally relevant experience and that such changes persist after learning, highlights that perceptual constancy may indeed arise through a categorization process that results in attenuation of goal-irrelevant acoustic variability in service of current listening goals. However, such variability may be preserved outside of the veil of perceptual constancy and be incorporated, if lawful, into the representations that guide perception (Elman and McClelland, 1986). Indeed, individuals are faced with continual changes in how phonetic categories are acoustically realized over time at both a community level (Watson et al., 2000; Labov, 2001) and at an idiosyncratic level (Bauer, 1985; Evans and Iverson, 2007). As such, neural representations must preserve aspects of variability outside of processes that produce forms of perceptual constancy.

Work by Tuller et al. (1994), Case et al. (1995) have put forth a non-linear dynamic model of speech perception. In their model, perception is viewed as a dynamical process that is highly context-dependent, such that perceptual constancy is achieved via attraction to "perceptual magnets" that are modified nonlinearly through experience. Crucial to their model, listeners remain sensitive to the fine-grain acoustic properties of auditory input as recent experience can induce a shift in perception. Similar to Tuller et al. (1994), Kleinschmidt and Jaeger (2015) have proposed a highly context-dependent model of speech perception. In their model, perceptual stability in speech is achieved through recognition "strategies" that vary depending on the degree to which a signal is familiar based on past experience. This flexible strategic approach based on prior familiarity is critical for successful perception, as a system that is rigidly fixed in acoustic-to-meaning mappings would fail to recognize (perhaps by misclassification) perceptual information that was distinct from past experience, whereas a system that is too flexible might require a listener to continually start from scratch. However, from this view, perceptual constancy is not achieved through the activation of a fixed set of features, but through listening expectations based on the statistics of prior experience. In this way, perceptual constancy arising from such a system could be thought of as an emergent property that results from the comparison of prior experience to bottom-up information from (i) the signal and (ii) recent listening experience (i.e., context).

Within a window of recent experience, what kinds of cues convey to a listener that a deviation from expectations has occurred? Listeners must flexibly shift between different situations that may have different underlying statistical distributions (Qian et al., 2012; Zinszer and Weiss, 2013), using contextual cues that signal a change in an underlying statistical structure (Gebhart et al., 2009). One particularly clear and ecologically relevant contextual cue comes from a change in source information – that is, a change in talker for speech, or instrument for music. For example, when participants learn novel words from distributional probabilities of items across two unrelated artificial languages (i.e., that mark words using different distributional probabilities), they only show reliable transfer of learning across both languages when the differences between languages are contextually cued through different talkers (Weiss et al., 2009). This is presumably because without a contextual cue to index the specific language, listeners must rely on the overall accrued statistics of their past experience in relation to the sample of language drawn from the current experience, which may be too noisy to be adequately learned or deployed. More recent work has demonstrated that the kind of cueing necessary to parse incoming distributional information into multiple representations can come from temporal cues as well. Gonzales et al. (2015) found that infants could reliably differentiate statistical input from two accents if temporally separated. This suggests that even in the absence of a salient perceptual distinction between two sources of information (e.g.,

speaker), listeners can nevertheless use other kinds of cues to meaningfully use variable input to form expectations that can constrain recognition. Indeed, work by Pisoni (1993) has demonstrated that listeners track attributes of speech signals that have been traditionally thought to be unimportant to the recognition process (e.g., a speaker's speaking rate, emotional state, dialect, and gender) but may be useful in forming expectations that guide and constrain the recognition process. To be clear, these results suggest that experience with the different statistics of pattern sets, given a context cue that appropriately identifies the different sets, may subsequently shape the way listeners direct attention to stimulus properties highlighting a possible way in which top down interactions (via cortical or corticofugal means) may reorganize perception.

Work by Magnuson and Nusbaum (2007) has shown that attention and expectations alone may influence the way listeners tune their perception to context. Specifically, they demonstrated that the performance costs typically associated with adjusting to talker variability, were modulated solely by altering the expectations of hearing one or two talkers. In their study, listeners expecting to hear a single talker did not show performance costs in word recognition when listeners were expecting to hear two talkers, even though the acoustic tokens were identical. Related work by Magnuson et al. (1995) showed that this performance cost is still observed when shifting between two familiar talkers. This example of contextual tuning illustrates that top-down expectations, which occur outside of statistical learning, can fundamentally change how talker variability is accommodated in word recognition. This finding is conceptually similar to research by Niedzielski (1999), who demonstrated that vowel classification differed depending on whether listeners thought the vowels were produced by a speaker from Windsor, Ontario or Detroit, Michigan – cities that have different speech patterns but are close in distance. Similarly Johnson et al. (1999) showed that the perception of "androgynous" speech was altered when presented with a male vs. female face. Linking the domains of speech and music, recent work has demonstrated that the pitch of an identical acoustic signal is processed differently depending on whether the signal is interpreted as spoken or sung (Vanden Bosch der Nederlanden et al., 2015).

Kleinschmidt and Jaeger (2015) has offered a computational approach on how such expectations may influence the perception of a signal. Specifically, they posit that until a listener has enough direct experience with a talker, a listener must supplement their observed input with their prior beliefs, which are brought online via expectations. However, this suggests that prior expectations are only necessary until enough direct experience has accrued. Another possibility, supported by Magnuson and Nusbaum (2007), is that prior expectations are able to shape the interpretation of an acoustic pattern, regardless of accrued experience, as most acoustic patterns are non-deterministic (ambiguous). More specifically, Magnuson and Nusbaum (2007) show that when a many-to-many mapping between acoustic cues and their meanings occurs that this requires more cognitive, active processes, such as a change in expectation that may then direct attention to resolve the recognition uncertainty (cf. Heald and Nusbaum, 2014). Taken together, this suggests that auditory perception cannot be a purely passive, bottom-up process, as expectations about the interpretation of a signal clearly alter the nature of how that signal is processed.

If top-down, attention driven effects are vital in auditory processing, then deficits in such processing should be associated with failures in detecting signal embedded in noise (Atiani et al., 2009; Parbery-Clark et al., 2011), poorer discrimination among stimuli with subtle differences (Edeline et al., 1993), and failure in learning new perceptual categories (Garrido et al., 2009). Indeed, recent work by Perrachione et al. (2016) has argued that the neurophysiological dysfunctions found in dyslexic individuals, which include deficits in these behaviors, arises due to a diminished ability to generate robust, top-down perceptual expectations (for a similar argument see also, Ahissar et al., 2006; Jaffe-Dax et al., 2015).

If recent experience and expectations shape perception, it also follows that the ability to learn signal and pattern statistics is not solely sufficient to explain the empirical accounts of rapid perceptual plasticity within auditory object recognition. Changes in expectations appear to alter the priors the observer uses and may do so by violating the local statistics (prior context), such as when a talker changes. Further, there must be some processing by which one may resolve the inherent ambiguity or uncertainty that arises from the fact that the environment can be represented by multiple associations among cues. Listeners must determine the relevant associations weighing the given context under a given listening goal in order to direct attention appropriately (cf. Heald and Nusbaum, 2014). We argue that the uncertainty in weighing potential interpretations puts a particular emphasis on recent experience, as temporally local changes in contextual cues or changes in the variance of the input can signal to a listener that the underlying statistics have changed, altering how attention is distributed among the available cues in order to appropriately interpret a given signal. Importantly, this window of recent experience may also help solidify or alter listener expectations. In this way, recent experience may act as a buffer or an anchor against which the current signal and current representations are compared to previous experience. This would allow for rapid adaptability across a wide range of putatively stable representations, such as note category representations for AP possessors (Hedger et al., 2013), linguistic representations of pitch (Dolscheid et al., 2013), and phonetic category representations (Liberman et al., 1956; Ladefoged and Broadbent, 1957; Mann, 1986; Evans and Iverson, 2004; Huang and Holt, 2012).

It is important to consider exactly how plasticity engendered by a short-term window relates to a putatively stable, longterm representation of an auditory object. Given the behavioral and neural evidence previously discussed, it does not appear to be the case that auditory representations are static entities once established. Instead, auditory representations appear to be heavily influenced by recent perceptual context. Further, these changes persist in time after learning has concluded. However, this does not imply that there is no inherent stability built into the perceptual system. As previously discussed, perceptual categories in speech and music are not freestanding entities, but rather are a part of a constellation of categories that possess meaningful

relationships with one another. Stability may exist through interconnections that exist in the category systems. Long-term neural mechanisms may work to remove rapid cortical changes that are inconsistent with the system, while in other cases, allow such changes to generalize to the rest of the system in order to achieve consistency.

### CONCLUSION

The present paper has addressed the apparent paradox between experiencing perceptual constancy and dynamic perceptual flexibility in auditory object recognition. Two critical factors in this issue are the problem of acoustic variability and the reliance of listeners on recent experience. Specifically, we have argued that the process of achieving plasticity in audition necessarily entails that one must retain the ability to perceive acoustic variance independent of current listening goals. This is because a system that completely attenuates putatively "irrelevant" variance, by definition, has a single representational structure and assesses incoming perceptual information through a fixed lens. This would necessarily prevent individuals from flexibly adapting to behaviorally relevant changes in their environment. This view also suggests that learning is an important part of the recognition process, as listeners must be able to rapidly learn from and adapt to changes in the statistical distributions of their acoustic environments. A goal for future research should be to examine the degree to which perceptual learning is influenced by listening goals and expectations. More specifically, while perceptual

### REFERENCES


constancy may be goal driven, we have argued that perceptual learning may occur to some extent outside of perceptual constancy. In addition to maintaining sensitivity to acoustic variance, we have argued that a reliance on recent experience is necessary for individuals to flexibility adapt to changes in their environment. Recent experience provides a window through which the given signal and current representations are compared to previous knowledge, in that it contains meaningful cues as to when one should switch to an alternate sound-to-meaning mapping. Future work should examine the neural and cognitive mechanisms that underlie this process. Further, extant models of speech and music perception should be updated to reflect the importance of variability and short-term experience in the instantiation of both perceptual flexibility and constancy.

### AUTHOR CONTRIBUTIONS

SH and SVH wrote the first draft of the manuscript. HN provided comments on the draft, and all authors revised the manuscript to its final form.

### ACKNOWLEDGMENTS

This work was supported by the Multidisciplinary University Research Initiatives (MURI) Program of the Office of Naval Research through grant, DOD/ONR N00014-13-1-0205.


without learning. Psychophysiology 48, 797–807. doi: 10.1111/j.1469-8986.2010. 01139.x


a speech-in-noise discrimination task. J. Neurosci. 28, 4929–4937. doi: 10.1523/ JNEUROSCI.0902-08.2008






**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Heald, Van Hedger and Nusbaum. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age

*Sara Skoog Waller1\*, Mårten Eriksson1 and Patrik Sörqvist2*

*<sup>1</sup> Department of Social Work and Psychology, University of Gävle, Gävle, Sweden, <sup>2</sup> Department of Building, Energy and Environmental Engineering, University of Gävle, Gävle, Sweden*

Cognitive hearing science is mainly about the study of how cognitive factors contribute to speech comprehension, but cognitive factors also partake in speech processing to infer non-linguistic information from speech signals, such as the intentions of the talker and the speaker's age. Here, we report two experiments on age estimation by "naïve" listeners. The aim was to study how speech rate influences estimation of speaker age by comparing the speakers' natural speech rate with increased or decreased speech rate. In Experiment 1, listeners were presented with audio samples of read speech from three different speaker age groups (young, middle aged, and old adults). They estimated the speakers as younger when speech rate was faster than normal and as older when speech rate was slower than normal. This speech rate effect was slightly greater in magnitude for older (60–65 years) speakers in comparison with younger (20–25 years) speakers, suggesting that speech rate may gain greater importance as a perceptual age cue with increased speaker age. This pattern was more pronounced in Experiment 2, in which listeners estimated age from spontaneous speech. Faster speech rate was associated with lower age estimates, but only for older and middle aged (40–45 years) speakers. Taken together, speakers of all age groups were estimated as older when speech rate decreased, except for the youngest speakers in Experiment 2. The absence of a linear speech rate effect in estimates of younger speakers, for spontaneous speech, implies that listeners use different age estimation strategies or cues (possibly vocabulary) depending on the age of the speaker and the spontaneity of the speech. Potential implications for forensic investigations and other applied domains are discussed.

Keywords: age estimation, speech perception, speech rate, cognitive speech processing, speech spontaneity

### Introduction

Cognitive hearing science is mainly about how cognitive factors contribute to speech comprehension (Arlinger et al., 2009), such as how working memory (Rönnberg et al., 2013) and long-term memory (Sörqvist et al., 2014) supports speech comprehension in adverse listening conditions, and how the mind tries to predict upcoming information in the unfolding speech stream (Bendixen et al., 2009). However, cognitive factors can also partake to extract non-linguistic information from speech signals. Indexical information of a person (see Harnsberger et al., 2008) such as gender, age, height, and weight can be extracted with some certainty from voice alone (Krauss et al., 2002; Hughes and Gallup, 2008). This paper investigates this relatively understudied

### *Edited by:*

*Judit Gervain, Centre National de la Recherche Scientifique – Universite Paris Descartes, France*

### *Reviewed by:*

*Laurianne Cabrera, Université Paris Descartes, France Jean-Remy Hochmann, Harvard University, USA*

### *\*Correspondence:*

*Sara Skoog Waller, Department of Social Work and Psychology, University of Gävle, Kungsbäcksvägen 47, SE-801 76 Gävle, Sweden sara.wallerskoog@hig.se*

### *Specialty section:*

*This article was submitted to Language Sciences, a section of the journal Frontiers in Psychology*

*Received: 17 March 2015 Accepted: 29 June 2015 Published: 17 July 2015*

### *Citation:*

*Skoog Waller S, Eriksson M and Sörqvist P (2015) Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age. Front. Psychol. 6:978. doi: 10.3389/fpsyg.2015.00978* form of cognitive speech processing. Specifically, it explores in two experiments how variations in one aspect of the speech signal—speech rate—influence age estimation. The first experiment is based on read speech whereas the second is based on spontaneous speech. Most previous research on age estimates from voice has been done on read speech (Ptacek and Sander, 1966; Ramig and Ringel, 1983; Huntley et al., 1987; Shipp et al., 1992; Braun, 1996; Braun and Cerrato, 1999; Cerrato et al., 2000; Harnsberger et al., 2008; Torre and Barlow, 2009). However, most communication come about spontaneously why age estimates from spontaneous communication is of obvious interest. The results may have implications for various applied areas such as acting (e.g., Werner, 1996), speech synthesis (e.g., Schötz, 2006), speech and hearing disorders (e.g., Harnsberger et al., 2008) and forensic investigations (e.g., Yarmey et al., 1996).

When inferring the age of the speaker from voice, a listener may rely on various cues to infer the age of the speaker from the physical attributes of the voice as well as the contents (linguistic attributes) of what is being said (Moyse, 2014). For example, older adults produce less fluent and less complex speech in comparison with younger adults (Kemper et al., 2003). Examples of physical speech attributes that change with age is fundamental frequency, amount of shimmer and speech rate. The fundamental frequency of the voice changes at puberty and during the transition into adulthood (Hughes and Rhodes, 2010) and correlates with other physiological changes as people gets older and the amount of shimmer is found to increase (Ramig and Ringel, 1983; Xue and Hao, 2003). Whilst most age-related changes in the fundamental frequency take place prior to adulthood (Huber et al., 1999; Lee et al., 1999; Amir and Biron-Schental, 2004), speech rate continues to change considerably after adulthood. As people get older, speech rate decreases (Linville, 2001; Brückl and Sendlmeier, 2003; Schötz, 2006). All age related changes of speech may not be used in an age estimation task, but speech rate seems of greatest relevance (Harnsberger et al., 2008). People may hence incidentally learn the association between speech rate and age of speakers in their everyday interactions with others. If these associations have been learned and if speech rate is used as a cue to age estimates, manipulations of speech rate should influence age estimates of adult speakers.

The accuracy of age estimates based on voice is poor when compared to age estimates from faces (Rhodes, 2009; Moyse, 2014). Although the magnitude of correlations between age estimates and the chronological age of the speaker is typically high (Shipp and Hollien, 1969; Huntley et al., 1987; Neiman and Applegate, 1990; Braun, 1996; Cerrato et al., 2000; Brückl and Sendlmeier, 2003), the age of young speakers is systematically overestimated and the age of older speakers is systematically underestimated (Shipp and Hollien, 1969; Hollien and Tolhurst, 1978; Huntley et al., 1987; Braun, 1996; Braun and Cerrato, 1999; Cerrato et al., 2000; Brückl and Sendlmeier, 2003). The cause of this effect may simply be that, when cues to the accurate estimate are scarce, the best strategy would be to guess on an age estimate close to the middle of the possible age range to minimize error (Fahsing et al., 2004). The resulting biases are typical of research on estimation of person characteristics. In the present study, the accuracy of the age estimates is also used as a control of task difficulty. Extant research shows that age estimation of younger individuals is easier (i.e., has greater accuracy) than age estimation of older individuals (Rhodes, 2009; Vestlund et al., 2009; Moyse, 2014). We explored task difficulty in the context of accuracy estimates, because difference in task difficulty may be informative when the effects of speech rate on over- and underestimates are interpreted. Here, accuracy is defined as the absolute difference between the age estimate and the chronological age of the speaker, whereas over- and underestimates are calculated by taking the signed difference between the age estimate and the chronological age of the speaker (Vestlund et al., 2009). When averaged across estimates, these two dependent measures (accuracy versus over/underestimates) can yield quite different outcomes, and signed differences cannot alone be used as an estimate of task difficulty.

Speech rate changes with chronological age and, therefore, one way to study the effects of speech rate on age estimation is to ask participants to make age estimates of voices from speakers who differ in chronological age. However, experimental research, in which the parameter of interest, in this case speech rate, is manipulated, constitutes much harder causal evidence for the effects of speech rate on age estimation. Only a few studies hitherto (Schötz, 2004; Winkler, 2007; Harnsberger et al., 2008) have studied the effect of speech rate on perceived age by actually manipulating speech rate and the study of Harnsberger et al. (2008) is most relevant as they are the only ones that study speech material longer than a few words. They reported that increased speech rate (by 20%) lowered perceived age of older speakers (74–88 years) and that decreased speech rate (by 20%) resulted in higher age estimates of middle-aged speakers although decreased speech rate did not change the perceived age of younger (21–29 years) speakers. However, Harnsberger et al. (2008) did not study the effects of increased speech rate on perception of younger speakers, nor did they study the effects of decreased speech rate on perception of older speakers. The present study will close that gap. Moreover, a change of speech rate by 20% is quite substantial and a preliminary study indicated that a manipulation of this magnitude made some voices sound "strange" according to naive listeners. No strangeness was noted when we manipulated speech rate plus minus 10% and it was therefore decided to use this smaller manipulation to see if it also had an effect on perceived age.

In sum, this study explores how subtle manipulations of the speech signal in form of a change in speech rate affect listeners' judgment of speaker age. The effect of increased and decreased speech rate on young, middle-aged, and old voices will be analyzed. The first experiment concerns read speech while the second concerns spontaneous speech.

### Experiment 1

In Experiment 1, we investigated how a change in speech rate influenced age estimations of voices from younger, middle-aged, and older speakers. We hypothesized, extending the results from Harnsberger et al. (2008) that decreased speech rate would make

all speakers sound older and increased speech rate would make all speakers sound younger, regardless of the chronological age of the speaker. Moreover, we explored whether the magnitude of this speech rate effect depends on the chronological age of the speakers.

### Method

### Participants/Listeners

Eighty-one students (67% female) at the University of Gävle participated in the listening tests in exchange for a ticket to the movie (value of US \$12). The mean age of the participants was 24 years (SD = 6.01, range 18–49 years). The studies reported in this paper were conducted in accordance with the declaration of Helsinki and the ethical guidelines given by the American Psychological Association. All participants (listeners and speakers) were adults and participated on informed consent. The listeners and the speakers signed an information agreement form. The experiment caused no harm to any part, the identity of the participants has been kept confidential, and no conflict of interest can be identified.

### Speech Material

Voices from 36 non-smoking native speakers of Swedish were used in the study. Twelve were 20–30 years, 12 were 40–50 years, and 12 were 60–70 years. Six speakers from each age group were female and six were male. The speakers were recorded while reading a 35 word text containing written walking directions.

The recordings were made in a silent room on a computer connected to a dynamic microphone placed 15 cm from the speaker's mouth. The recordings were edited in Audacity 1.2.6 (http://audacity*.*sourceforge*.*net). A standard feature in the program was used to compress the dynamic range of the recordings, making the loudest parts softer while keeping the volume of the soft parts the same. The threshold value was set to −12 dB and the ratio was set to 2:1. The speech samples were then normalized for intensity by setting the maximum intensity of all samples to the same value.

The manipulations of speech rate were also made in Audacity by creating two new versions of each original speech sample and decreasing the speech rate for one of them by 10% while increasing the speech rate for the other version by 10%. The pitch was kept constant for each voice across the three speech rate conditions by a standard feature in Audacity. The speech samples varied between 10 and 19 s in length after manipulation.

Average fundamental frequency for each speech sample was analyzed in Praat. As expected (e.g., Titze, 1994), men's voices had a lower F0 than women's voices as confirmed by a 2 (Gender: women, men) × 3 (Age group: young, middle aged, old) analysis of variance with F0 as dependent variable, *F*(1,30) = 100.16, MSE <sup>=</sup> 518.26, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.77. However, there was no direct effect of age group or an interaction between the factors. See **Table 1** for means and variation in F0 over age groups and gender. Thus, F0 was not included as a factor in subsequent analyses.

### Procedure

The listening tests were conducted in a laboratory where speech samples were presented to the participants through headphones.


The participants adjusted the volume to a comfortable level at the start of the experiment. They were instructed to estimate the age (in years) of each speaker they were going to hear and write their estimate in a form. Three test trials were used for familiarization with the task. A 10-s pause was set in between every speech sample. Backtracking was not allowed. In all, the experiment lasted 15–20 min.

Each participant estimated each speaker only at one speech rate. The participants were randomized into three listener groups that were balanced with regard to gender and age. Each listener group was presented to 36 speech samples (12 samples with increased speech rate, 12 with natural speech rate and 12 with decreased speech rate) in randomized order. Each set contained speech samples produced by all 36 speakers but at different speech rates. A randomized order was generated for each of the three sets of speech samples. This order was also reversed, resulting in two orders of presentation for each of the three listening groups.

### Statistics and Design

A 3 (speaker age group: young vs. middle-aged vs. old) × 3 (speech rate: increased vs. natural vs. decreased) withinparticipants factorial design was used to measure differences in age estimates depending on speaker age group and speech rate. In cases of absent estimations or if listeners were acquainted with a speaker, missing values were substituted by the mean value for the particular speech sample for speaker age group, speaker gender and listener gender. This procedure was applied to 13 missing values. Two dependent measures were calculated, signed differences between age estimates and the chronological age of the target person (to investigate over- and underestimations) and the absolute/unsigned differences (to investigate accuracy) following previous studies (e.g., Vestlund et al., 2009; Voelke et al., 2012).

### Results and Discussion

As can be seen in **Figure 1**, the age of younger speakers was overestimated (a deviation from the accurate age of the speaker above 0) and the age of older speakers was underestimated (a deviation below 0). Moreover, increased speech rate made the speaker sound younger, and decreased speech rate made the speaker sound older. This speech rate effect was most pronounced in age estimates of voices from old speakers. These conclusions were supported by a 3 (speaker age group: young vs. middle-aged vs. older) × 3 (speech rate: increased vs. natural vs. decreased) repeated measures analysis of variance. The analysis revealed a main effect of speaker age group, *<sup>F</sup>*(2,160) <sup>=</sup> 691.72, MSE <sup>=</sup> 24.26, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.90, a main effect of speech rate, *F*(2,160) = 70.69, MSE = 17.89,

*p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.47, and a significant interaction between the two factors, *F*(4,320) = 2.48, MSE = 16.68, *p* = 0.044, η2 <sup>p</sup> = 0.03. Follow-up *t*-tests were conducted to tease apart the interaction. Fast speech rate was different from slow speech rate in age estimates of young, *t*(80) = 4.26, *p <* 0.001, middleaged, *t*(80) = 6.83, *p <* 0.001, and old speakers, *t*(80) = 7.68, *p <* 0.001. The difference in age estimates of voices with slow and fast speech rate was larger for estimates of old speakers in comparison with estimates of young speakers, *t*(80) = 2.23, *p* = 0.029. A 2 (speaker gender) × 2 (participant gender) analysis of variance with age estimates collapsed across age groups and speech rates was computed to explore general effects of gender. It revealed that female voices are perceived as younger (*M* = −26.29, SD = 27.38) than male voices (*M* = −12.42, SD = −32.56), *F*(1,158) = 7.64, MSE = 896.08, *<sup>p</sup>* <sup>=</sup> 0.006, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.05, but yielded no effect of participant gender nor an interaction between speaker gender and participant gender.

As a control of task difficulty, the accuracy of the estimates was also analyzed. Accuracy was highest in estimations of the youngest age group (*M* = 8.10, SD = 4.29), intermediate in the middle-aged group (*M* = 9.22, SD = 3.52) and lowest in estimations of the oldest age group (*M* = 14.53, SD = 5.50). This was confirmed by a repeated measures analysis of variance with age group of target persons as independent variable (young vs. middle-aged vs. older) and accuracy as dependent variable, *<sup>F</sup>*(2,160) <sup>=</sup> 66.99, MSE <sup>=</sup> 14.23, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.46. Estimates of young were different from middle-aged, *t*(80) = 2.07, *p* = 0.041, estimates of young were different from old, *t*(80) = 9.42, *p <* 0.001, and estimates of middle-age were different from old, *t*(80) = 9.66, *p <* 0.001.

A further control analysis was conducted in view of a "scale" problem in age estimates: For example, an estimation error of 2 years is not much (in percent) when the speaker is 65 years old, whilst an estimation error of 2 years is quite substantial when the speaker is only 4 years old. For each age estimate, respectively, the signed difference between the age estimate and speaker's chronological age was divided with speaker's age. Following this procedure, error estimates, expressed as percent of speaker's chronological age, were obtained (**Figure 2**). As can be seen in **Figure 2**, which depicts percent error estimates, a speech rate effect was clearly pronounced in estimates of young speakers and old speakers, but not in middle aged speakers, and faster speech rate was overall associated with lower age estimates. A 3 (speaker age group: young vs. middle-aged vs. older) × 3 (speech rate: increased vs. natural vs. decreased) repeated measures analysis of variance with percent error estimates as dependent variable revealed a main effect of speaker age group, *F*(2,160) = 537.83, MSE <sup>=</sup> 0.02, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.87, a main effect of speech rate, *F*(2,160) = 54.64, MSE = 0.02, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.41, and a significant interaction between the two factors, *F*(4,320) = 8.27, MSE = 0.02, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.09. In young speakers, faster speech rate made the speaker sound younger in comparison with neutral speech rate, *t*(80) = 3.50, *p <* 0.001, whilst the

difference between slow speech rate and neutral speech rate did not reach significance, *t*(80) = 1.80, *p* = 0.075. In older speakers, there were clear cut differences between all three speech rates. Slower speech rate made them sound older in comparison with neutral speech rate, *t*(80) = 7.13, *p <* 0.001, and faster speech rate made them sound younger compared to neutral speech rate, *t*(80) = 2.80, *p* = 0.006. Taken together, the key finding from these analyses is that the speech rate effect is strongest in estimates of older speakers, but also quite strong in estimates of younger speakers, and faster speech rate makes the speaker sound younger.

The findings confirm the general assumption that speech rate is a cue to speakers' age that listeners use as a basis for making age estimates. The effect was found for all three age groups and was not limited to middle aged and old voices as in Harnsberger et al. (2008). The interaction between speech rate and the chronological age of the speaker suggests, however, that speech rate may gain greater importance as an age cue with increased speaker age. This is shown in the analysis with regular age estimates and received some further support in the analysis of percent error estimates. The assumption that cues to speaker age are more prominent or easy to perceive in voices of younger speakers accords well with the accuracy analyses, as accuracy was higher in age estimates based on voices from younger speakers in comparison with estimates of older speakers. Thus, the listener may have to rely more on different and less informative cues when making estimates of the older and more difficult age groups.

# Experiment 2

The impact on age estimates of paralinguistic speech attributes such as speech rate is likely to depend on access to other cues such as linguistic variation, and consequently on the type of speech material to be assessed. Spontaneous speech which in contrast to read speech allows for variation in wording, should presumably yield more accurate age estimates, and age estimates of spontaneous speech should be less influenced by speech rate, compared to age estimates of read speech. Studies investigating listener's estimation of speaker age have almost exclusively been based on speech that is produced when reading out loud (i.e., read speech) in the form of sentences, words, or vowels. From a methodological viewpoint, read speech has the advantage of control over linguistic variation and duration. Conversely, spontaneous speech should entail more variability between speech samples. However, listeners' age estimation strategies are more likely to be based on what they have learned from their everyday interactions with others—such as the association between speech rate and the chronological age of the speaker wherein they listen almost exclusively on spontaneous speech, not to read speech. Some evidence for this assumption has been reported in a study by Schötz (2005) who found that age estimates were more accurate when based on spontaneous speech in comparison with estimates based on read isolated words. Experiment 2 was designed to test whether speech rate is an important age cue in the context of spontaneous speech and whether it would interact with the chronological age of the speaker just as in Experiment 1. One possibility is that speech rate plays a more subordinate role as a cue to speaker age in the context of spontaneous speech, as spontaneous speech is richer in other age cues (complexity, fluency, and word selection, etc.). As in Experiment 1, accuracy served as a device to infer task difficulty.

### Method

### Participants/Listeners

Eighty-six students (68% female) from the University of Gävle participated in the experiment in exchange for a ticket to the movie (about US \$12). The mean age of the participants was 24 years (SD = 5.14, range 18–51 years).

### Speech Material

A total of 36 original samples of spontaneous speech were used produced by the same group of speakers as in Experiment 1. The speech samples were generated by asking each speaker to provide directions on how to navigate from an origin to a destination on a map. The map represented a route taking a number of turns through an area with simple landmarks for buildings, vegetation, and water. Some speakers primarily used right–left descriptors, whereas others gave more detailed descriptions of the environment. Segments from the recordings were edited and manipulated in the same manner as in Experiment 1 using Audacity. Three versions for each speech sample were used (natural speech rate, 10% decreased speech rate and 10% increased speech rate). The duration of the speech samples before manipulation was 9–18 s.

Average fundamental frequency for each speech sample was analyzed in Praat. See **Table 2** for means and variation in F0 over age groups and gender. Like in Experiment 1, men's voices had a lower F0 than women's voices. This was confirmed by a 2 (Gender: women, men) × 3 (Age group: young, middle age, old) analysis of variance with F0 as dependent variable, *<sup>F</sup>*(1,30) <sup>=</sup> 218.02, MSE <sup>=</sup> 258.36, *<sup>p</sup> <sup>&</sup>lt;* 0.001 <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.88. There was no direct effect of age group and no interaction between gender and age group. F0 was therefore not analyzed further.

### Design and Procedure

The design and procedure was the same as in Experiment 1. The only difference was that spontaneous speech was presented instead of read speech.

### Results and Discussion

As can be seen in **Figure 3**, the result pattern was quite similar to that found in Experiment 1. Again, the speaker sounded younger when speech rate was increased, and older when the speech rate



was decreased. However, it was only in age estimates of the oldest age group that there was a clear-cut negative relationship between speech rate and age estimates. A 3 (speaker age group: young vs. middle-aged vs. old) × 3 (speech rate: increased vs. neutral vs. decreased) repeated measures analysis of variance revealed a main effect of speaker age group, *F*(2,170) = 475.64, MSE <sup>=</sup> 28.49, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.85, a main effect of speech rate, *<sup>F</sup>*(2,170) <sup>=</sup> 22.65, MSE <sup>=</sup> 20.65, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.21, and a significant interaction between the two factors, *F*(4,340) = 3.94, MSE <sup>=</sup> 26.53, *<sup>p</sup>* <sup>=</sup> 0.004, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.04. This interaction reveals that the effect of speech rate is linearly related to age estimates of older speakers—faster speech rate is associated with lower age estimates (i.e., faster speech rate make the speaker sound younger)—but this is not the case in estimates of young speakers—wherein highest age estimates were found for the natural speech rate. Follow-up *t*-tests showed, in estimates of young speakers, that there was no significant difference between fast and slow speech rate, *t*(85) = 1.68, *p* = 0.097, and no difference between slow and natural, *t*(85) = 1.27, *p* = 0.209, but there was a difference between fast and natural speech rate in estimates of young speakers, *t*(85) = 3.18, *p* = 0.002. However, for both middleaged, *t*(85) = 3.31, *p* = 0.001, and older speakers, *t*(85) = 5.05, *p <* 0.001, there was a difference between fast and slow speech rate. Taken together, the speech rate effect behaves differently for the three speaker age groups. A 2 (speaker gender) × 2 (participant gender) analysis of variance with age estimates collapsed across age groups and speech rates was computed to

explore general effects of gender. It revealed that men made larger underestimation errors (*M* = −29.51, SD = −31.68) compared to women (*M* = −13.18, SD = −34.05), *F*(1,168) = 9.10, MSE <sup>=</sup> 1085.49, *<sup>p</sup>* <sup>=</sup> 0.003, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.05, but yielded no effect of speaker gender nor an interaction between speaker gender and participant gender.

As in Experiment 1, the analysis of differences in accuracy between speaker age groups gave a significant main effect of speaker age, *<sup>F</sup>*(2,170) <sup>=</sup> 19.76, MSE <sup>=</sup> 20.53, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.40, and again, accuracy was highest in estimations of the youngest age group (*M* = 6.56, SD = 3.51), lowest in estimations of the oldest age group (*M* = 11.09, SD = 5.65) and intermediate in the middle-aged group (*M* = 8.14, SD = 3.49). Estimates of young were different from middle-aged, *t*(80) = 2.07, *p* = 0.041, estimates of young were different from old, *t*(80) = 2.32, *p* = 0.023, and estimates of middle-age were different from old, *t*(80) = 4.12, *p <* 0.001.

Also, as in Experiment 1, an analysis with estimation error in percent of speaker's chronological age was conducted. These results (**Figure 4**) were very similar to those found with regular age estimates (**Figure 3**). A 3 (speaker age group: young vs. middle-aged vs. old) × 3 (speech rate: increased vs. neutral vs. decreased) repeated measures analysis of variance revealed a main effect of speaker age group, *F*(2,170) = 464.62, MSE <sup>=</sup> 0.02, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.85, a main effect of speech rate, *<sup>F</sup>*(2,170) <sup>=</sup> 15.56, MSE <sup>=</sup> 0.02, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.16, and a significant interaction between the two factors, *F*(4,340) = 2.86,

between the age estimations and chronological age of the speakers, divided with speakers age). The estimates are made of the recording), a faster rate (10% faster) or a slower rate (10% slower). Error bars represent SEMs.

MSE = 0.02, *p* = 0.023, η<sup>2</sup> <sup>p</sup> = 0.03. In estimates of young speakers, the difference between slow speech rate and fast speech rate did not reach significance, *t*(85) = 1.83, *p* = 0.071, and there was no difference between slow speech rate and neutral speech rate, *t*(85) = 1.12, *p* = 0.265, but fast speech rate made them sound younger in comparison with neutral speech rate, *t*(85) = 3.19, *p* = 0.002. In estimates of old speakers, faster speech rate made them sound younger in comparison with neutral speech rate, *t*(85) = 2.02, *p* = 0.046, and slower speech rate made them sound older, *t*(85) = 4.30, *p <* 0.001, and a substantial difference was found between slow and fast speech rate, *t*(85) = 5.69, *p <* 0.001.

Experiment 2 replicates the key findings from Experiment 1: listeners use speech rate as a cue to infer the age of speakers from their voices, but this cue is assigned greater weight in estimates of older speakers. When the speech is spontaneous, and hence relatively rich in age cues, the listeners seem to rely on other cues than speech rate when estimating the age of younger speakers, whilst speech rate is still an important cue in the more difficult situation of age estimates of older speakers.

### Cross-Experiment Analyses

Experiment 2 expands previous findings by showing that estimators rely less on speech rate when making age estimates of young speakers in the context of spontaneous speech compared with read speech. A cross-experiment analysis was conducted to test, within a coherent analysis, whether speech rate (slow vs. natural vs. fast) and speech material (read vs. spontaneous) interact in their effects on age estimation of younger speakers. Specifically, a visual inspection of **Figures 1** and **3** suggests that the difference between the speech rate conditions are greater for read speech than for spontaneous speech. A mixed analysis of variance with speech material as between-subject factor, speech rate as within-subject factor and over/underestimates as dependent variable was calculated to test this hypothesis. A main effect of speech rate, *F*(2,330) = 12.39, MSE = 16.64, *p <* 0.001, η2 <sup>p</sup> = 0.07, a main effect of speech material, *F*(1,165) = 4.17, MSE <sup>=</sup> 26.92, *<sup>p</sup>* <sup>=</sup> 0.043, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.03, and a significant interaction between the two factors, *F*(2,330) = 7.49, MSE = 16.64, *p <* 0.001, η2 <sup>p</sup> = 0.04, were found.

A cross-experiment analysis on accuracy estimates were also conducted, to test the hypothesis (of applied importance) that age estimation accuracy is higher for spontaneous speech than for read speech (**Figure 5**). A 3 (speaker age group: young vs. middleaged vs. old) × 2 (material: read vs. spontaneous speech) repeated measures analysis of variance was performed for estimates of voices at natural speech rate from both experiments. The results supported the assumption that spontaneous speech contains more age information compared to read speech, as a main effect of speech material revealed higher accuracy in estimates based on spontaneous speech, *F*(1,165) = 19.53, MSE = 23.68, *p <* 0.001, η2 <sup>p</sup> = 0.11. Moreover, a significant interaction between speaker age group and material, *F*(2,340) = 4.11, MSE = 26.53, *p* = 0.004, η2 <sup>p</sup> = 0.04, indicated that the difference in accuracy between read and spontaneous speech was greater for the oldest age group compared to the accuracy difference due to material amongst the two younger age groups. Again, to make accurate estimates of older speakers seems to require more complex age information and may rely on different cues than what is needed to make accurate estimates of younger speakers.

and Experiment 2 (spontaneous speech). Note that lower values represent higher accuracy, as accuracy is calculated as the average of the absolute values of the difference between the age estimations and

chronological age of the speakers. The estimates are made of voices from young, middle aged, and old speakers based on read speech and spontaneous speech played back at a neutral rate (same as the recording). Error bars represent SEMs.

# General Discussion

The experiments reported here show that speech rate is an age cue that listeners rely on when inferring the age of speakers from their voices. The current study is consistent with previous studies on speech rate (Shipp et al., 1992; Brückl and Sendlmeier, 2003; Stölten and Engstrand, 2003; Winkler, 2007; Harnsberger et al., 2008), whilst expanding those findings in several directions. Specifically, speakers are estimated as younger when they talk faster and as older when they talk slower, especially older speakers. It appears as if age estimates of younger speakers, however, are not influenced by speech rate, at least in the context of spontaneous speech wherein the speakers are free to select words as they like.

### Speech Rate as a Cue to Speaker's Age

Harnsberger et al. (2008) found the typical speech rate effect higher age estimates of slower speech rate and lower age estimates of faster speech rate—when speech rate was manipulated by 20%. Here, we found that a more modest speech rate manipulation of 10% produces a speech rate effect with a similar pattern. Hence, even subtle changes of speech rate can influence listeners' perception of speaker age.

Listeners are able to distinguish between spontaneous speech and read speech (Blaauw, 1994) as they differ on several acoustic cues such as prosodic cues and spectral cues (Howell and Kadi-Hanifi, 1991; Nakamura et al., 2008). In particular, the boundaries between tone units differ between spontaneous speech and read speech (Blaauw, 1994), the position of the stresses differs and there are fewer pauses in read speech (Howell and Kadi-Hanifi, 1991) and spontaneous speech has a more constrained spectral space (Nakamura et al., 2008). Moreover, the semantic content (word choice) should be more variable between speech samples for spontaneous speech. These factors may explain why the interaction between speech rate and chronological age, in the present study, was slightly different in the context of spontaneous and read speech. Whilst the speech rate effect was quite different for spontaneous and read speech in age estimates of younger speakers, it was very similar in age estimates of older speakers. Under the assumption that acoustic factors (prosodic and spectral cues) vary in a roughly similar way between younger and older adult speakers, the reason why the speech rate effect is less pronounced in estimates of young adults is that the age of young speakers can more easily be identified from word choice. In other words, listeners may rely more on speech rate as a cue to age when making age estimates of older speakers, whereas word choice or other semantic aspects of the speech signal is used to identify the speaker as a young adult.

An additional reason for why speech rate was less influential on age estimates of young speakers is, potentially, that the listeners—who were mostly young adults—are more familiar with the way other young adults talk. This familiarity could perhaps lead to better discriminatory abilities making them able to identify a speaker as young, even when the speech signal is distorted by manipulations of speech rate. This suggestion is consistent with studies demonstrating an own-age bias in age estimates (i.e., people tend to estimate the age of others with greater accuracy when the target person is about the same age as the one making the estimate; Rhodes, 2009). Whether there is a similar own-age bias in age estimates from voices is unclear and the present study cannot provide evidence in support of this assumption, as no older listeners were included. Moreover, there was no support for an own-gender bias.

### Potential Applied Implications

Research on earwitness testimony is sparse but of applied importance as there are many situations in which voice is the most distinct and reliable cue to personal characteristics and identity, such as when the visual conditions are poor or when the face of a target is covered—conditions that are frequently found in criminal situations (Yarmey et al., 1996; Yarmey, 2001, 2004). In particular, when the crime is committed over a phone call or otherwise when a culprit's identity can only be revealed from speech recordings, knowledge on the reliability of earwitness testimonies is quite important. One implication from the pair of experiments reported here is that speech rate should be recognized as a factor influencing the accuracy of the age estimate of the perpetrator, but only when the speaker's age is relatively high. When the age of the speaker is relatively high, a slow speech rate would indicate that age estimates from earwitnesses are likely closer to the actual age of the culprit than when speech rate is fast. Conversely, when speech rate is fast—which arguably is the usual case in sharp earwitness situations—the age of older culprits is likely to be substantially underestimated. From an applied point of view, the higher estimation accuracy when age estimation is made on voices from spontaneous speech is also noteworthy. Estimation accuracy is underestimated when investigated in the context of read speech, a methodological aspect to consider in future studies and when drawing conclusions from extant research.

Another applied implication relates to acting. Many actors receive voice training (Werner, 1996) and may learn to use their voice to sound more male or female, for example. One implication from the present experiments is that actors may use speech rate to their advantage when attempting the sound as of a different age than they really are. A faster speech rate could make them sound younger, at least if the actor is above "young adult."

A third potential (yet at present highly speculative) applied implication is that hearing impairments—and a corresponding hearing aid apparatus—that distort the temporal resolution of the speech signal may distort not only the reception of the speech signal and its comprehension but also other top–down cognitive speech processes such as inference of speaker age. As the effects of hearing impairments and of hearing aids co-vary with cognitive/top–down components of speech processing (Lunner et al., 2009), it is not far-fetched to assume that distortions to time resolution in speech reception can also influence a listener's age estimation of speakers, as even slight changes in speech rate (10%) produce quite drastic changes in the listeners' perception of the speaker's age. A target for future research is to look into the effects of hearing aids on age estimation by voice. One possibility is that hearing aids distort F0 information, which could influence age estimates, just as it influences gender perception (Massida et al., 2013).

### Conclusion

Cognitive operations partake in speech processing to extract nonlinguistic information from speech signals such as the age of the speaker who generates the voice. The purpose of the present paper has been to explore some of the characteristics of this rather special form of cognitive speech processing. We can conclude

### References


that speech rate is one source of information that listeners use to extract age information, especially when listening to older speakers. Speech rate is clearly not the only age cue, however, and when the speaker is relatively young and in a spontaneous speech context, the listener primarily relies on other sources of information (e.g., acoustic and linguistic).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Skoog Waller, Eriksson and Sörqvist. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*