Edited by:
Reviewed by:
*Correspondence:
This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Understanding speech is effortless in ideal situations, and although adverse conditions, such as caused by hearing impairment, often render it an effortful task, they do not necessarily suspend speech comprehension. A prime example of this is speech perception by cochlear implant users, whose hearing prostheses transmit speech as a significantly degraded signal. It is yet unknown how mechanisms of speech processing deal with such degraded signals, and whether they are affected by effortful processing of speech. This paper compares the automatic process of lexical competition between natural and degraded speech, and combines gaze fixations, which capture the course of lexical disambiguation, with pupillometry, which quantifies the mental effort involved in processing speech. Listeners’ ocular responses were recorded during disambiguation of lexical embeddings with matching and mismatching durational cues. Durational cues were selected due to their substantial role in listeners’ quick limitation of the number of lexical candidates for lexical access in natural speech. Results showed that lexical competition increased mental effort in processing natural stimuli in particular in presence of mismatching cues. Signal degradation reduced listeners’ ability to quickly integrate durational cues in lexical selection, and delayed and prolonged lexical competition. The effort of processing degraded speech was increased overall, and because it had its sources at the pre-lexical level this effect can be attributed to listening to degraded speech rather than to lexical disambiguation. In sum, the course of lexical competition was largely comparable for natural and degraded speech, but showed crucial shifts in timing, and different sources of increased mental effort. We argue that well-timed progress of information from sensory to pre-lexical and lexical stages of processing, which is the result of perceptual adaptation during speech development, is the reason why in ideal situations speech is perceived as an undemanding task. Degradation of the signal or the receiver channel can quickly bring this well-adjusted timing out of balance and lead to increase in mental effort. Incomplete and effortful processing at the early pre-lexical stages has its consequences on lexical processing as it adds uncertainty to the forming and revising of lexical hypotheses.
Understanding speech involves the rapid translation of acoustic information into meaning. The time course in which listeners extract phonetic information and map it onto their mental representations has been extensively studied in ideal listening conditions (e.g.,
In ideal conditions, understanding speech is a prime example of an automatic perceptual process that takes its course without our attention. We can understand speech and at the same time engage in parallel activities. What enables this efficient processing is the seamless transfer of information within a hierarchy of pre-lexical and lexical decoding stages. Models of speech perception (e.g., TRACE:
Increased effort during speech perception, sometimes also referred to as mental fatigue (for a distinction of these terms see
Audiological assessment methods are traditionally based on measures of intelligibility and no standard tests exist for quantifying effort. Mental effort is first and foremost listeners’ impression, but it may affect automatic mechanisms underlying speech perception, and bottlenecks within these mechanisms can increase effort even further. Recently, there has been an increase in interest in pupillometry as an objective measure of mental effort in speech perception (
Effortless processing of speech in optimal conditions is based on experience with the signal, and on the consequential fine attunement of the perceptual system to the regular and common patterns in the listener’s native language (
The aim of the present study is to track the timing of lexical access in natural and degraded speech, and to study whether and how this processing interacts with mental effort. We hypothesize that degradation will affect the automaticity of processing speech and delay the timing of processing information at pre-lexical and lexical levels. The time course of lexical access has been studied by means of eye-tracking (e.g.,
The process of interest in this paper is lexical competition, which is the short-lived interval during which the heard signal matches multiple lexical entries, and the perceptual system allows multiple lexical candidates to compete for the best match to the signal. Listeners, not knowing the intended word beforehand, subconsciously and for splits of milliseconds consider multiple words that have overlapping phonological forms. This includes homonyms (e.g., pair and pear), lexical embeddings (e.g., paint in
The present experiment adapts the design by
Pupil dilation will give us insight into the mental effort involved in the processing of degraded versus natural speech. The measure of mental effort captured in pupil dilation combined with gaze fixations can reflect processing bottlenecks, or the accumulated effort resulting from ill-adjusted timing between processing stages. However, pupil dilation may also indicate the engagement in a task, or the recruitment of attentional resources. The manifold sources of pupil dilation have led to some ambiguity in the use of terms. In this paper we will use the term ‘mental effort’ to describe our results. However, we are aware that automatic attentional allocation can play a role in the regulation of cognitive processes (
Three questions stand in focus of the present study. (1) Does the time course of lexical disambiguation, as captured by gaze fixations, differ between the processing of natural versus degraded speech? (2) Does lexical competition involve an increase in mental effort, as captured in listeners’ pupil dilation? (3) Does processing of degraded speech show a comparable course of changes in mental effort to natural speech? Based on our working hypothesis that timing between the processing stages is crucial for automatic and effortless perception we assume that there will be differences in the time course of processing natural versus degraded speech. A hint into a similar direction has been reported by
Seventy-three normal hearing volunteers, aged between 20 and 31 years (mean age 24), participated in this study. None of them reported any known hearing or learning difficulties, and they all had normal or corrected-to-normal vision. Their hearing thresholds were normal, i.e., below 20 dB HL at the audiometric frequencies between 500 and 8000 kHz. Half of the volunteers were randomly assigned to participate in the task with natural speech (NS), and the other half with degraded speech (DS). Before the experiment started, the participants signed a written consent form for the study as approved by the Medical Ethical Committee of the University Medical Centre Groningen. The volunteers received either course credits or a small honorarium for their participation.
The materials consisted of 26 critical items, which were borrowed from
For all the materials, the sentence context was neutral and revealed no semantic information about the target. A female native speaker of Dutch with no prominent regional accent recorded the sentences in blocks of paired sentences. The speaker was instructed to pronounce the sentences clearly but in a natural manner. For each pair of target- and competitor items three sentences were recorded. The sentence containing the polysyllabic, thus embedding, target (e.g., bokser [boxer]) was recorded twice. Only one instance of the sentence with the monosyllabic, hence embedded, competitor (e.g., bok [goat] is embedded in bokser) was necessary to construct the materials. The initial part of both sentences was identical, and the monosyllabic (competitor) word was always followed by words that matched the phonological, prosodical, and stress pattern of the target sentence as closely as possible. For instance, for the target word ‘bokser’ the sentence
All materials were subjected to a splicing procedure, in analogy to
An example of the recorded sentences, and the splicing manipulation applied to create the target-matching and target-mismatching condition.
Sentence 1 | We wisten wel dat die oude BOKSER gestopt was | |
Sentence 2 | We wisten wel dat die oude BOKSER gestopt was | |
Sentence 3 | We wisten wel dat die oude BOK suffig was | |
Target-matching duration | We wisten wel dat die oude BOK⋅SER gestopt was | |
Target-matching duration | We wisten wel dat die oude BOK⋅SER gestopt was | |
The degradation in the form of acoustic CI simulation was performed by sinusoid vocoding the speech signal with eight channels, and implemented in MATLAB. The decision to create vocoded stimuli with eight channels is based on the finding that increasing the number of channels improves speech perception of CI users up to seven channels and then plateaus (
The eye-tracker SIM Eyelink 500, with a sampling rate of 250 Hz was used. This head mounted eye-tracker contains two small cameras, which can be aligned with the participants’ pupil to track the pupil’s movements as well as its size continuously during the experiment. The listeners were seated in front of a 19-inch monitor, within a distance of about 50–60 cm from the screen. The stimuli were presented via a speaker in sound attenuated room at a comfortable level of about 65 dB SPL. The lighting in this room was kept constant throughout the experiment.
For the display, black and white line drawings were made for the purpose of this study, and validated through consistent naming by Dutch native speakers. For the presentation of the pictures a virtual grid was created to divide the screen into three horizontal and three vertical bars. A red cross appeared centered in the middle quadrant resulting from the 3∗3 partition of the screen, and the four pictures were centered in the four external quadrants on the grid. An example of a display with
Before the experiment all participants were familiarized with all the pictures to ensure that they identified them as intended. The pictures were presented to the participants who named them, and were then told the intended name in case of a mismatch between the word used in the experiment and their identification (for instance to clarify synonyms, such as couch and sofa). Participants assigned to the DS condition were familiarized with the sort of degradation used in the experiment. They were presented with at least 30 degraded sentences and were asked to click on the correct sentence that was written amongst 10 sentences on the screen. During this phase participants were allowed to listen to these sentences as often as they wanted. After that the eye-tracker was mounted and calibrated.
Before the data collection started, participants performed four practice trials during which the participant could always refer to the experimenter to ask for instructions. Each trial consisted of a red cross appearing on the screen for 500 ms, followed by the visual display of the four pictures, and simultaneous auditory presentation of the sentence. Participants were instructed to listen to the stimuli and to click on the object mentioned in the sentence. They were also instructed to blink only between the trials, while the word “Blink” appeared on the screen. After each of the blinking pauses participants could progress on a self–paced basis. After every five trials a recalibration screen appeared, to make sure the eye-tracker did not lose track of the pupil. The experiment lasted on average 15–20 min, and consisted of 62 trials, 26 of which were critical trials. The session needed to realize the experimental protocol, including initial information of the participant, the hearing screening, familiarization with the pictures and the degradation, and debriefing lasted about 1 h.
Listeners correctly clicked on the target in 95% of the trials. Trials in which participants failed to identify the intended target word or with blinks longer than 300 ms were excluded from the analysis (on average two trials per participant). The SR Eyelink 500 records blinks as data points with x–y coordinates and pupil size information. Blinks shorter than 300 ms were linearly interpolated based on the median of 25 samples recorded before and after the blink.
The data of two participants were excluded from the analysis because their number of misidentification of the target together with trials containing blinks longer than 300 ms summed up to 50% of the trials. In addition, the data of four other participants were discarded due to computer or calibration failures. Following this, the data set contained the recordings of 67 participants, 35 of which took part in DS and 32 in NS.
The statistical analysis of the data is based on the interval between 200 and 2000 ms after word onset. The first 200 ms after the onset of the target are needed to plan and perform the eye movement triggered by an auditory stimulus for a display with multiple pictures (
Pupil size data were recorded as pupil area alongside fixations at each sample point. However, eye movements may affect the measurement of pupil size. To ensure that such measurement artifacts do not introduce differences between the experimental conditions, we counted the number of fixations per trial. Within our analysis window of 200–2000 ms we counted on average three fixations. We found no differences between the experimental conditions, neither between filler items nor critical items. Thus if eye movements affected the measurements of pupil size, they did so equally for all conditions. Our approach of combining gaze fixation data with pupillary responses is similar to
To address the questions of whether lexical competition leads to increased pupil dilation and whether the course of pupil dilation is comparable for degraded and NS, we used two different baselines to compute two percentage changes in ERPD. Baseline 1 will enable us to study the pupil size within the time window of lexical competition. To specifically observe the effect of our experimental manipulation, and to limit other sources that can lead to changes in pupil dilation, baseline 1 is the interval that immediately precedes the manipulation. Baseline 2 will examine whether potential effects of lexical competition on pupil dilation are comparable across groups. More effortful processing of DS (
The probability of listeners fixating the competitor was analyzed by means of logistic growth curves analysis models (
The pupil size data, as captured by the ERPD, was also analyzed by means of Growth Curve Analysis, as time curves of pupil dilation. The courses of dilation were analyzed as polynomial curves of third order, since the fourth order turned out to be redundant to the description of the curve functions. The terms describing the curves are: intercept, the slope of the function, and a coefficient for the curvature around the inflection point. The statistical models included the terms describing the curves, an interaction of these three terms with the experimental conditions (target-matching versus target-mismatching cues) and presentation condition (NS versus DS). To account for individual variation also random effects of the terms describing the curve were included per participant.
Of particular interest for this study is the question of statistical significance of the interactions between condition and experiment and the terms describing the course of the curves. These interactions were significant (see
Summary of the estimates of the statistical model used for the analysis of gaze fixations to the competitor.
Factor | Estimate | Standard error | Significance |
---|---|---|---|
Curve Intercept ∗ condition ∗ experiment | 11.76 | 1.13 | <0.001 |
Curve slope ∗ condition ∗ experiment | 20.63 | 1.18 | <0.001 |
Curve rise and fall ∗ condition ∗ experiment | 22.81 | 0.96 | <0.001 |
Curve decline in tails ∗ condition ∗ experiment | 9.26 | 0.60 | <0.001 |
In sum, for the presentation with NS listeners’ gazes are quickly governed by the acoustic information in the signal: they fixate the competitor picture more often for stimuli that contain cues appropriate for the competitor.
The right panel in
The time-curves of the ERPD for the target-matching, target-mismatching and filler items are displayed in
Summary of the estimates of the statistical model used for the analysis of ERPD.
Factor | Estimate | Standard error | Significance |
---|---|---|---|
Curve Intercept ∗ condition(fillers versus matching cues) ∗ presentation (reference: NS) | -4.41 | 0.82 | <0.001 |
Curve slope ∗ condition(fillers versus matching cues) ∗ presentation (reference: NS) | -4.90 | 0.82 | <0.001 |
Curve rise and fall ∗ condition (fillers versus matching cues) ∗ presentation (reference: NS) | -1.87 | 0.82 | <0.03 |
Curve Intercept ∗ condition(fillers versus mismatching cues)∗presentation (reference: NS) | 14.41 | 0.82 | <0.002 |
Curve slope ∗ condition(fillers versus mismatching cues) ∗ presentation (reference: NS) | -2.56 | 0.82 | < 0.001 |
Curve rise and fall ∗condition(fillers versus mismatching cues)∗ presentation (reference: NS) | -3.11 | 0.82 | <0.001 |
Curve Intercept∗condition (mismatching versus matching cues)∗presentation (reference: NS) | 14.11 | 0.86 | <0.001 |
Curve slope ∗ condition(mismatching versus matching cues) ∗ presentation (reference: NS) | -2.81 | 0.86 | <0.002 |
Curve rise and fall ∗ condition(mismatching versus matching cues) ∗ presentation (reference: NS) | -3.22 | 0.86 | <0.001 |
For NS (
The three way interactions are visualized in
The ERPD curves with baseline 2 captured the fact that in NS listeners’ pupil dilation increased gradually, after the presentation of the target, reaching a peak only after 900 ms after the onset of the word. In the DS condition, however, pupil dilation was already increased at the onset of the target word. While the overall dilation was greater in DS, this pupil dilation curve shows a very even course over the entire analysis window. This suggests that contrary to NS, where lexical disambiguation is at the source of increased pupil dilation, in DS participation in the experiment itself causes pupil dilation. Baseline 2 does not allow singling out individual processes at the source of pupil dilation, but we attribute the difference in ERPD calculated with baseline 2 to the demands that performing the experiment with DS posed on the participants. For NS
We investigated how signal degradation that simulates speech transmitted via CIs alters the time-course of speech perception and the mental effort drawn upon during this course. To sum up, we find a similar course of lexical disambiguation between degraded and natural signals, with a main difference in the timing of integration of durational cues, and the timing of resolution of lexical competition. Furthermore we find an increase in pupil dilation for listeners presented with NS, which is time-locked to lexical competition, and perception of target-mismatching cues. A different pattern of mental effort was found for DS, with pupil dilation not increasing as a function of lexical processing but due to the presentation with DS throughout the experiment. Increased effort in processing DS appears to have its sources at the pre-lexical level, while increased pupil dilation in NS has its source in lexical processing.
Our results from the conjunct analysis of gaze fixations with pupil dilation show different timing in the processing of DS at pre-lexical and lexical levels. At the pre-lexical level these timing differences seem to be the result of automatic versus more effortful processing of the signal. At the lexical level these timing differences appear to be the consequence of processing at the pre-lexical level with the corollary of different constrains on the selection of lexical candidates. For DS, increased mental effort has its source at the stages of pre-lexical processing, which further complicates the lexical processing. The finding of increased pupil dilation due to mismatching acoustic cues in NS, however, points to a possibly different recruitment of mental resources for natural versus degraded speech.
For natural stimuli, the gaze fixation results replicate the study by
Increased processing or elevated activation of brain regions (
The timing of speech processing appears to be crucial for the seamless automatic transfer of information from pre-lexical to lexical levels of analysis. The processing of early post-sensory but pre-lexical levels of speech perception is likely to be constrained by the capacity of the auditory sensory memory (
In line with this interpretation are more recent findings on speech perception and attention.
For DS, our results show that lexical competition was slower, prolonged and led to a less certain lexical decision. We also observed a reduced or delayed sensitivity to the durational cue, no increase in pupil dilation due to lexical competition or mismatching cues, and increased pupil dilation due to the demands of the experiment, which main task consisted of listening to speech. This last finding is in line with previous results (
The lack of sensitivity to durational cues can partly be explained by the nature of the degradation. The reduction of the spectrotemporal details from NS likely disrupts the binding of acoustic features into categories and reduces neuronal synchronization (
The slower progress of information between pre-lexical and lexical stages is also fortified by the fact that the signal does not resemble listeners’ mental representations.
Our results show that it is more difficult to revise built-up lexical expectations upon hearing DS signals. The delay on pre-lexical levels might have opened up the opportunity to build up stronger, and in this case, misleading lexical hypotheses about the word that was being processed. This explanation is supported firstly by the observed prolonged lexical competition, and secondly by the uncertainty about the lexical decision after disambiguating acoustic information was presented in DS. In line with this,
While we argue that the source of effort is the pre-lexical processing, there are also alternative explanations for the lack of an additive effect of lexical competition on pupil dilation for degraded signals. Firstly, it is likely that pupil dilation was not able to capture or differentiate additive effects of lexical competition and listening to DS. Secondly, the attentional resources that a listener can draw upon may be depleted by the attention directed toward the processing of degraded signals. A third explanation is that delayed reception of acoustic cues in degraded signals obscures lexical competition and alters the more targeted engagement of attentional resources found in NS. The processing effort found in natural signals would then not be comparable to the effort evoked by lexical competition for degraded signals. Though the three explanations are not mutually exclusive, we believe that the fixation data combined with the pupil dilation data provide some support for the last explanation. The gaze fixations show that lexical competition is delayed and prolonged for degraded signals, and we see increased pupil dilation due to listening to DS. Listeners’ engagement in lexical competition may be gated by attentional resources, and a constant effortful processing may disengage the automatic attentional processes that are supposed to be driven by the signal, making lexical competition a less automatic process.
To our knowledge this is the first study that combined measures of time–course of speech perception, in gaze fixations, with mental effort, in pupil dilation. Even though the sources underlying pupil dilation are manifold and difficult to strictly separate, and more research is on the way to investigate these sources, we believe that our study offers a contribution to this search. Speech perception can be an effortful task, in particular for CI users, but also in every-day non-optimal interactions. Our study shows involvement of mental resources for processes that are fundamental to speech perception, and how well-adjusted timing of information processing can conceal this involvement. We attribute experience with the task, i.e., speech perception, to be at the source of well-timed flow of information between stages of speech perception. An intriguing research question for the future is whether early exposure to degraded signals will lead to similar fine adjustment of speech processing, for instance in CI users who were implanted within the first year of their life. Related to this is also the fundamental question of the role that spectrotemporal details play in the process of well-timed speech processing, and regulation of attentional resources.
The author AW developed the concept of this study, acquired the data, analyzed and interpreted the results, and wrote the paper. AW gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author PT contributed to the data acquisition, and data analysis, and revised critically the final version of this paper for important intellectual content. PT gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work. The author DB enabled the data acquisition, contributed to the interpretation of the results, and critically revised previous and the final version of this paper for important intellectual content. DB gives the final approval of the version to be published, and agrees to be accountable for all aspects of the work.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Prof. Frans Cornelissen (University Medical Centre Groningen) for providing the eye-tracker for this study, and Prof. Stuart Rosen (University College London) for lending us his scripts with the vocoding functions. We are also grateful to Jop Luberti for creating the experimental pictures. The study is part of the research program of our department: Healthy Aging and Communication.
The Supplementary Material for this article can be found online at: