Different Measures of Auditory and Visual Stroop Interference and Their Relationship to Speech Intelligibility in Noise

Knight, Sarah; Heinrich, Antje

doi:10.3389/fpsyg.2017.00230

ORIGINAL RESEARCH article

Front. Psychol., 17 March 2017

Sec. Auditory Cognitive Neuroscience

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.00230

Different Measures of Auditory and Visual Stroop Interference and Their Relationship to Speech Intelligibility in Noise

Sarah Knight^*

Antje Heinrich

Medical Research Council Institute of Hearing Research, University of Nottingham, Nottingham, UK

Inhibition—the ability to suppress goal-irrelevant information—is thought to be an important cognitive skill in many situations, including speech-in-noise (SiN) perception. One way to measure inhibition is by means of Stroop tasks, in which one stimulus dimension must be named while a second, more prepotent dimension is ignored. The to-be-ignored dimension may be relevant or irrelevant to the target dimension, and the inhibition measure—Stroop interference (SI)—is calculated as the reaction time difference between the relevant and irrelevant conditions. Both SiN perception and inhibition are suggested to worsen with age, yet attempts to connect age-related declines in these two abilities have produced mixed results. We suggest that the inconsistencies between studies may be due to methodological issues surrounding the use of Stroop tasks. First, the relationship between SI and SiN perception may differ depending on the modality of the Stroop task; second, the traditional SI measure may not account for generalized slowing or sensory declines, and thus may not provide a pure interference measure. We investigated both claims in a group of 50 older adults, who performed two Stroop tasks (visual and auditory) and two SiN perception tasks. For each Stroop task, we calculated interference scores using both the traditional difference measure and methods designed to address its various problems, and compared the ability of these different scoring methods to predict SiN performance, alone and in combination with hearing sensitivity. Results from the two Stroop tasks were uncorrelated and had different relationships to SiN perception. Changing the scoring method altered the nature of the predictive relationship between Stroop scores and SiN perception, which was additionally influenced by hearing sensitivity. These findings raise questions about the extent to which different Stroop tasks and/or scoring methods measure the same aspect of cognition. They also highlight the importance of considering additional variables such as hearing ability when analyzing cognitive variables.

Introduction

Inhibition—the ability to suppress goal-irrelevant information (MacLeod, 1991)—is thought to be important in many situations. One of these situations is speech-in-noise (SiN) perception, in which listeners aim to focus on the foreground (target speech) and ignore the background (distractor) sound. The ability to inhibit irrelevant information has been suggested to worsen with age (Hasher and Zacks, 1988), with implications across a variety of cognitive domains including language, memory, and attention (Stoltzfus et al., 1996; Burke, 1997). This cognitive decline has potential consequences for everyday activities such as reading and text comprehension (Dywan and Murphy, 1996) and even engaging in appropriate social behavior (von Hippel, 2007). The ability to understand speech-in-noise is also observed to worsen with age, affecting the ability to hold conversations and engage in social activities (CHABA, 1988). Given the suggested importance of inhibition for SiN perception, researchers have begun to ask whether or not age-related declines in inhibition could account, at least in part, for the observed difficulties older adults have when listening in noisy environments. However, answering this question has been made difficult by the fact that it is not clear what role modality plays in the measurement of inhibition (whether or not inhibition tasks in different modalities measure the same underlying ability) and whether the standard scoring method adequately accounts for other, unconnected, age-related changes.

In the following section we introduce two types of Stroop task, a paradigm commonly used to assess inhibitory abilities and the focus of this study. We first explain the nature of Stroop tasks, and discuss the effect perceptual modality has on task outcomes. Next, we explore the effect of age-related changes on Stroop interference and consider potential underlying mechanisms. Finally, we discuss how the most common outcome measure of Stroop interference, reaction times (RTs), may relate to strength of inhibition, and propose that trials which are responded to more slowly may not only represent inhibition more accurately than trials responded to more quickly but may also better reveal differential levels of inhibition between participants. We then turn to speech-in-noise perception, and discuss the possible role of inhibition in SiN perception. In particular, we focus on the role inhibition plays during lexical access, a key element of speech perception, and consider how changes across the lifespan in lexical access might indicate age-related changes in inhibition. Finally, we discuss the results obtained from existing studies designed to test the relationship between inhibition and SiN perception, and suggest some reasons why these discrepancies might arise.

Stroop Tasks

One common means of assessing inhibition is by using variants of the Stroop task (Stroop, 1935). In the traditional visual color-word Stroop task (ibid.), participants are required to name the ink color of a string of letters, irrespective of the letters themselves. The string of letters can be either meaningless (e.g., XXXX)—the neutral condition—or can form a conflicting color word (e.g., BLUE printed in red)—the incongruent condition. Since word reading is a more prepotent response than color naming in this situation (Melara and Algom, 2003), word naming has the potential to interfere with color naming. In order to prevent this interference, participants must attempt to inhibit, or suppress, the incongruent word. The difference in reaction time (RT) between color naming in the neutral condition and color naming in the incongruent condition is taken as a measure of inhibitory ability, and termed Stroop interference (SI). Besides the traditional visual paradigm, auditory versions of the Stroop task have also been successfully used (e.g., Green and Barber, 1981; Morgan and Brandt, 1989). In auditory Stroop tasks, participants are required to respond as quickly as possible to some perceptual feature of a word (e.g., speaker gender, voice pitch, stimulus location) while ignoring the semantic information, which can be either irrelevant (e.g., “cat”) or conflicting (e.g., “man” spoken by a woman, “low” in a high-pitched voice, “right” heard in the left ear). Again, SI is typically obtained by calculating the difference in reaction time between feature naming with irrelevant semantic content and feature naming with an incongruent semantic distractor.

Stroop Tasks across Modalities

The visual and auditory versions of the Stroop task are generally assumed to tap the same underlying domain-general inhibitory ability; however, the relationship between the two measures and the extent to which this assumption is true remains unclear. On the one hand, there is evidence to suggest that carefully-matched Stroop tasks presented across different modalities do probe shared inhibitory processes, producing similar patterns of neural activation and correlated behavioral responses (Roberts and Hall, 2008). On the other hand, it has been shown that, even within the same modality, measures of inhibition that are not so closely matched do not correlate within individuals, suggesting either that there is no single inhibitory function supporting performance across different tasks and/or that task-specific demands determine individual differences more strongly than general inhibitory abilities (Shilling et al., 2002). This suggests that any two inhibition tasks, either within or across modalities, are unlikely to be comparable unless they have been deliberately matched, and in particular that an auditory Stroop task cannot automatically be assumed to be an alternative way of measuring the same ability tapped by a given visual Stroop task. In the current study we will address the question of the relationship between visual and auditory versions of the Stroop task by comparing scores from the same participants on an auditory and a visual Stroop task, both deliberately chosen to meet certain criteria.

Age-Related Declines in Stroop Performance

When calculated in the traditional way, SI (Stroop interference) on both visual and auditory tasks is generally observed to increase with age, implying a worse performance on the incongruent Stroop task compared to the neutral condition and—hence—poorer inhibition. However, it has long been recognized that no task is ever a “pure” measure of a given cognitive function, but instead includes other, additional processes—something referred to as the “impurity principle” (Surprenant and Neath, 2009). In the case of the Stroop task, it has been suggested that these age-related increases in SI could be due, at least in part, to just such additional processes; that is, that there are potential confounds with non-inhibitory factors created by the methods typically used to calculate SI (Ben-David and Schneider, 2009)—and that methods should be used which account for these factors.

One of these confounds is generalized age-related slowing. In the traditional SI measure, inhibition is represented by the absolute difference in time taken to name the background color between conditions with and without a distracting color word. A change in the speed of processing would slow performance on all tasks by the same factor (Cerella and Hale, 1994; Verhaeghen and Cerella, 2002), leading to a proportional increase of RTs in the incongruent and neutral conditions; this would result in a larger absolute difference between RTs in the two conditions, and thus a larger SI (Shilling et al., 2002; Ben-David and Schneider, 2009). Crucially, in such a case the increased SI does not necessarily represent any decline in inhibitory ability, but a change in processing speed. One way to address this issue is to use a method for calculating Stroop scores which accounts for, or factors out, changes in overall processing speed. For example, it is possible to use normalized scores, in which the RT in the incongruent condition is divided by the RT in the neutral condition, thus removing any changes in SI caused by proportional RT increases in both conditions. This is further discussed in “Calculating Visual Stroop scores” in the Materials and Methods section below.

While a generalized slowing of processing speed is expected to affect Stroop tasks across different modalities in similar ways, the confounding effects of sensory change will be specific to the perceptual domain of any given Stroop task. For visually presented Stroop tasks, such confounding effects may be particularly critical when they adversely affect the RT of the incongruent condition. If we accept the proposal of Melara and Algom (2003) that the Stroop interference effect arises due to a failure to inhibit the more rapidly accessed printed word until access to the incongruent color name is achieved, then changes in color vision may make access to the color word slower and/or more difficult, thereby increasing reaction times during color naming (Ben-David and Schneider, 2010). Such changes could be brought about by age-related yellowing of the lens and a loss of photo receptors (Werner and Steele, 1988; Anstey et al., 2002). These age-related changes in color vision do not affect word reading (Salthouse and Meinz, 1995), the speed of which remains largely unchanged with age provided the words are sufficiently legible (Akutsu et al., 1991). As a result, the difference between the time taken to read incongruent words and to name ink colors will be much greater for individuals with an age-related decline in color vision than for those with better color vision (i.e., younger adults). Melara and Algom (2003) characterized this discrepancy between color naming speed and reading speed as the “Dimensional Imbalance,” or DI. Having a larger DI—that is, a greater discrepancy in processing time between reading and color naming—puts individuals at an increased risk of a failure of inhibition (as expressed in larger SIs), since participants have to suppress the irrelevant word for longer. In this case, then, increased SI scores may reflect a combination of reduced inhibitory control and an increased likelihood of inhibitory failure caused by differences in processing speed for words as opposed to colors (i.e., a large DI). One way to address this issue is to use a method for calculating Stroop scores which accounts for, or factors out, differences in DI. For example, it is possible to regress RTs in the incongruent condition on DI scores, and then use the residuals as a measure of Stroop interference. This is further discussed in “Calculating Visual Stroop scores” in the Materials and Methods section below.

In the current study we will examine the effect of general age-related slowing and age-related sensory changes by comparing alternative scoring methods that capture age-related changes in inhibitory ability to different extents.

RT Distributions in Stroop Tasks

In addition to questions of how to appropriately capture the differential age trajectories of the processes contributing to the overall effect, there is a further issue with the way in which Stroop scores are traditionally calculated, namely that they usually use an average score over all trials. If it is true (e.g., Ridderinkhof et al., 2004) that the strength of inhibition depends on the overall processing time, with the slowest responses allowing more time for inhibition to build up, then differences in inhibitory ability are likely to be most evident during those trials with the longest reaction times. That is, trials with longer reaction times will be more informative when assessing inhibitory differences than trials with shorter reaction times, since the gap between those with good inhibition and those with poor inhibition will be at its most pronounced. In averaging over all trials, the traditional SI measure may blur crucial information by mixing outcomes from some informative (slow) trials with outcomes from many uninformative (fast) trials. In the second part of the paper we will examine this hypothesis by investigating the differing extent of Stroop interference for slow and fast trials.

Speech-in-Noise Perception and Inhibition

Research into SiN perception difficulties in older adults has revealed that only some of these difficulties can be accounted for by hearing loss, and that other abilities must play a role (Schneider and Pichora-Fuller, 2000; Wingfield and Tun, 2007). One of those abilities is cognition, which must be examined alongside hearing loss in order to better explain age-related difficulties (Akeroyd, 2008). Cognition is not a unitary construct, and has many different components. The exact number and nature of the cognitive components varies across different cognitive models; however, inhibition is generally identified as a core ability (e.g., Conway and Engle, 1994; Friedman and Miyake, 2004; Baddeley, 2012; Diamond, 2013). Two potential ways in which inhibition may affect SiN perception have been suggested. First, poor inhibition may increase susceptibility to background noise during SiN listening (Janse, 2012). This implies not only that those with poor inhibition will perform worse on SiN tasks than those with good inhibition, but also that their difficulties may increase disproportionately as the signal-to-noise ratio (SNR) becomes more adverse. Second, it is suggested that poor inhibition may make it harder for listeners to successfully select the target during lexical access (Sommers and Danielson, 1999).

Lexical Access and Inhibition

One way to conceptualize lexical access is in terms of the Neighborhood Activation Model (NAM) (Luce and Pisoni, 1998). The NAM proposes that items in the mental lexicon are organized into similarity neighborhoods, defined as all words that can be created from a target item by adding, deleting or substituting a single phoneme. Any given target word will activate both the target and, to varying degrees, its surrounding neighborhood, which may be large (dense) or small (sparse); furthermore, words which are more commonly encountered (have a high frequency of occurrence) will be activated more strongly than those less commonly encountered. Words are therefore classified as “lexically easy” if they have a high word frequency and relatively sparse neighborhoods, and as “lexically hard” if they have a low word frequency and relatively dense neighborhoods (e.g., Taler et al., 2010). It is assumed that inhibition plays a larger role in the perception of lexically hard words than easy words (Sommers and Danielson, 1999). It is therefore expected not only that listeners will be less likely to correctly identify lexically hard words than lexically easy words, but also that individual differences in inhibition will relate more closely to the perception of lexically hard words than lexically easy words. The first prediction has been borne out experimentally in studies with normal-hearing adults (Sommers and Danielson, 1999; Taler et al., 2010; Helfer and Jesse, 2015), children (Eisenberg et al., 2002), cochlear implant users (Kaiser et al., 2003; Bierer et al., 2016) and native and non-native speakers (Bradlow and Pisoni, 1999); the second prediction has also received some experimental support (Sommers and Danielson, 1999; Taler et al., 2010) and will be further tested in the current study.

Lexical access can also be affected by the semantic context provided by the words preceding the target: a certain semantic context can markedly increase the likelihood that a given word will occur. It is commonly found that recognition is better for words in semantically meaningful sentences than words in isolation (Miller et al., 1951; Nittrouer and Boothroyd, 1990), and for items in sentences with higher as opposed to lower semantic predictability (Bilger et al., 1984). These findings can also be explained in terms of the NAM: as semantic information builds over the course of a sentence, it increases activation levels for contextually consistent words (Sommers and Danielson, 1999).

The phenomenon of retrieval-induced forgetting has also been suggested by some researchers (e.g., Anderson et al., 1994; Aslan and Bäuml, 2011) as evidence for the role of active inhibition in lexical access [however, see e.g., MacLeod et al. (2003) and Williams and Zacks (2001) for alternative interpretations]. Retrieval-induced forgetting refers to a situation in which recall for verbal material suffers when related material (e.g., a member of the same category) has earlier been cued and correctly recalled. This suggests that inhibitory processes suppress relevant but uncued material during the initial recall phase, leading to poorer recall for that same material later.

Age-Related Changes in Inhibition and Lexical Access

The fact that effects of lexical difficulty and semantic context on word recognition vary through the lifespan has been taken as indicating age-related changes in inhibition. For example, the finding that identification of isolated lexically hard words declined with age, while performance for isolated lexically easy words was comparable for younger and older listeners, was interpreted by Sommers (1996) as reflecting an age-related decline in inhibitory control: since competing words from the target's neighborhood have to be suppressed or inhibited for successful word identification, poorer inhibition would reduce the ability to perform the required suppression of competing words and hence result in lower performance for lexically hard words. Results from the audiovisual (AV) domain have been interpreted in a similar vein: the finding that older adults were disproportionately poorer at identifying words with dense audiovisual neighborhoods was taken as indicating an age-related decline in inhibition (Dey and Sommers, 2015); this hypothesis was supported by the fact that Stroop scores predicted AV word recognition in older, but not younger, adults. Finally, Sommers and Danielson (1999) attribute Pichora-Fuller et al.'s (1995) finding that older listeners benefitted more from the addition of semantic context than younger listeners to higher activation of contextually consistent words amongst older listeners due to increased linguistic experience.

However, it is important to note that several studies have failed to show a relationship between inhibitory abilities and SiN perception (Gilbert et al., 2013; Helfer and Freyman, 2014). It is unclear why these discrepancies arose, but one possibility is that the differences were due, at least in part, to the methodological issues described above. Although all of these studies used Stroop tasks to assess inhibition, they differed in the modality of the task used (auditory vs. visual), and in the way in which Stroop interference was calculated. In particular, some used traditional SI scores, which as discussed above may be subject to confounds with generalized slowing and/or sensory decline, while others used adjusted scoring systems that may have accounted for slowing, poor color vision or both. In order to better understand the relationship between inhibition, SI scores and SiN perception, and to investigate how the predictive relationship between SI scores and SiN perception changes depending on whether or not possible confounds in the SI measures have been taken into account, we assessed the predictive value for SiN perception of SI measures derived from an auditory and a visual Stroop task using scoring methods that did or did not account for possible age-related confounds. If the power of Stroop scores to predict SiN perception is based on their ability to measure inhibition, then a purer inhibitory measure free from age-related confounds should improve prediction. However, Stroop scores may primarily measure more general age-related changes, such as generalized slowing and sensory declines. Since generalized slowing will affect performance across a range of tasks, and sensory declines are likely to be shared across the visual and auditory domains (Lindenberger and Baltes, 1994), the predictive relationship between Stroop scores and SiN perception may be based more strongly on these age-related changes than on inhibition. If this is the case, then the traditional, unadjusted SI measures should prove more useful in predicting SiN performance.

Hypotheses

Different Scoring Systems

H1: Scoring methods can be devised that do or do not take age-related changes in processing speed and sensory decline (i.e., poorer color vision) into account. If non-inhibitory age-related changes are independent contributors to Stroop scores alongside inhibitory ability (Melara and Algom, 2003), we would expect a low correlation between traditional scores, which do not account for these age-related changes, and the new scores, which do.

H2: Stroop scores can be calculated across all trials, or only across trials which are responded to particularly slowly or quickly. We expect the size of the Stroop effect to be larger on average for the slower trials than the faster trials, since a proportional slowing of both longer (incongruent trial) and shorter (neutral trial) RTs leads to a larger differences between the two overall when using the traditional calculation method. If it is true that differences in inhibitory ability are more in evidence when participants take longer to respond (Ridderinkhof et al., 2004), then we also expect to see greater variation in individual Stroop effects when examining slower trials as opposed to faster trials.

Visual vs. Auditory Tasks

H3: The results from the visual and auditory Stroop tasks will be broadly comparable, assuming that (a) inhibition is a modality-independent general cognitive ability, (b) inhibition influences individual performance to a greater extent than do task-specific demands, and (c) the two types of task are tapping into the same ability. If this is not the case, this raises questions about the extent to which the two tasks measure the same aspect of cognition.

Relationship to SiN Tasks

H4: Based on previous studies (Sommers and Danielson, 1999; Janse, 2012) we predict larger Stroop interference (SI) scores to be predictive of worse performance on SiN tasks—that is, a negative relationship between SI scores and SiN scores. If SI scores provide a genuine measure of inhibitory ability, then this relationship should be particularly strong when the SiN stimuli demand high levels of inhibition: at lower (less favorable) SNRs, when sentential context is lacking (i.e., when targets are isolated words), when target words have a low word frequency and/or high neighborhood density, or when semantic context does not aid inference (i.e., when targets appear in low-predictability sentences). It is possible that these effects may be particularly pronounced for those with poorer hearing sensitivity (Helfer and Jesse, 2015).

H5: If the relationship between SI scores and SiN perception is partially driven by shared sensory decline, we might expect the predictive power of Stroop interference for speech perception to decrease once sensory decline is taken into account. If, on the other hand, it is the inhibition component of the Stroop task that drives the relationship with speech perception, then a purer measure less affected by sensory change might improve the association between the two measures.

H6: Based on previous studies suggesting that differences in inhibitory ability are more in evidence when participants take longer to respond (Ridderinkhof et al., 2004), we expect Stroop scores derived from slower trials to be better predictors of SiN perception than scores derived from faster trials or averages across all trials.

Materials and Methods

Participants

Participants were 50 adults aged over 60 (mean: 69.5 years, SD: 6.4, range = 61–86) with mild hearing loss. A sample size of N = 50 allowed for the detection of a medium-sized effect (r = 0.35) at alpha (two-tailed) = 0.05 with a probability of 80%. This was deemed sufficient given that the most closely related previous studies (Sommers and Danielson, 1999; Janse, 2012) typically show medium-to-large effect size correlations. Exclusion criteria were hearing aid use and non-native English language status. This study was carried out in accordance with the recommendations of the University of Nottingham's Code of Research Conduct and Research Ethics, with written informed consent from all participants. All participants gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the University of Nottingham's School of Psychology Ethics Committee (ref. 464).

Visual accuracy was assessed using a Landolt C Chart, and color vision was tested using the card version of the City University color Vision Test. All participants were able to successfully read a full line of optotypes on the Landolt C Chart at a logMAR value of at least 0.3, with the majority (34) able to read a full line at between −0.1 and 0.1 logMAR. Four participants failed the color Vision Test, and the same group also verbally reported color blindness; these participants were excluded from the visual Stroop task. No other participant reported any difficulty in reading the test materials for the visual Stroop task. Two participants were excluded from the auditory Stroop task due to technical failure. Additionally, all participants were screened for mild cognitive impairment (MCI) using the Montreal Cognitive Assessment (MoCA) (mean: 27.86; SD: 1.95).

The reported results are part of a larger study into cognitive contributions to speech perception in older adults. Unreported results do not relate to the topics discussed in this paper.

Auditory Measures

Pure-tone air-conduction thresholds (PTA) were collected for nine frequencies between 0.25 and 8 kHz for each ear, following the procedure recommended by the British Society of Audiology (British Society of Audiology, 2011) using an Interacoustics Audiometer AT235 (Interacoustics, Middelfart, Denmark) and TDH39P headphones (Telephonics, Farmingdale, NY, USA). Mean thresholds as a function of frequency are presented in Figure 1. As this figure shows, there was considerable variability between participants in terms of hearing sensitivity, particularly at the higher frequencies.

FIGURE 1

Figure 1. Mean PTA thresholds as a function of frequency. Bars indicate +/−1 standard deviation.

Speech reception thresholds (SRT) were obtained using 30 sentences from the Adaptive Sentence List (Macleod and Summerfield, 1990). Sentences were initially presented at 60 dB SPL, with a one-down-one-up procedure and step sizes of 10 dB down, then 5 dB up for the first reversal; the remainder of the trials used a three-down-one-up procedure with a step size of 2 dB. The last two reversals were averaged to determine the 79% accuracy point (Levitt, 1971). Based on this, all auditory stimuli used throughout the study, including the auditory Stroop stimuli, were presented at 30 dB SL—that is, 30 dB above each participant's individual threshold. This procedure was used to partially control for differences in intelligibility in quiet due to the considerable range in participants' hearing sensitivity.

Stroop Tasks

In the visual Stroop task, modeled after Janse (2012), participants were presented with grids formed of 48 boxes in an 8 × 6 arrangement. There were three types of grid: (i) a reading grid, consisting of white boxes containing black color words; (ii) a control grid, consisting of colored boxes containing the string “XXXX” in black; (iii) an interference grid, consisting of colored boxes containing mismatched color words in black. The colors used were red, blue, green and brown. Using relatively large boxes of color instead of font color maximized the opportunity for older participants to clearly see the colors. The distractor words were printed in black and displayed in each box using 20 pt Calibri font. In order to ensure best possible visibility the light in the test room was always at least 880 lux and was set in such a way that each participant could optimally see colors and text without experiencing glare. For (i), the task was to read the words aloud as quickly and accurately as possible. For (ii) and (iii), the task was to name the background color of the boxes as quickly and accurately as possible. There was a short practice session for each of the 3 tasks. Participants saw two versions of each grid. The total time taken to complete each grid was timed by the experimenter using a stopwatch, and overall scores for each grid type were calculated by averaging the two times obtained. Some participants made errors on the interference grid. In these cases, no penalty was applied if they corrected their mistake. Uncorrected mistakes were penalized by calculating the participant's average time per item on the interference grid in question, then adding this duration to their total grid time once for each mistake. The times for the reading and control grids represented error-free performance for all participants.

In the auditory Stroop task, modeled after Sommers and Danielson (1999), participants heard two male and two female speakers, and were required to respond as quickly and accurately as possible to the gender of the speaker. Any given trial consisted of one of three words: “mother,” “father” or “person.” These words could therefore be congruent with gender (e.g., female + “mother”), incongruent with gender (e.g., male + “mother”) or neutral (“person”). RTs for gender decisions were obtained via button presses. Participants always used their self-reported dominant hand to respond, and returned their hand to the rest position in front of the button box after the end of each trial. For each trial, the RT was measured from the onset of the sound file; however, the recordings had been trimmed so that, for the words “father” and “person,” voicing started at a similar point in all files (around 13 ms after onset for “father,” and around 7 ms after onset for “person”). For “mother,” voicing was considered to start early enough that the point of vowel onset was not meaningfully different between any of the four recordings. The location (left/right) of the buttons corresponding to “female” and “male” were swapped for half of the participants. Participants received a short practice session containing all three conditions before the start of the task.

Calculating Visual Stroop Scores

The Stroop interference measure (SI) traditionally used in the literature (MacLeod, 1991) is calculated as follows:

\begin{array}{l} vS I_{raw} = C i - C n & (1) \end{array}

One problem with using the traditional SI measure as an estimate of inhibition in older adults is that there can be age-related changes in general processing speed (Ben-David and Schneider, 2009). This would be expected to slow performance on incongruent (Ci) and neutral (Cn) trials by the same factor, leading to different absolute increases—which in turn lead to larger SI values when the difference between the two conditions is calculated. A possible way to account for this age-related change and minimize its effect on interference estimates is to use a normalized measure of Stroop interference. This can be calculated as follows:

\begin{array}{l} vS I_{norm} = C i / C n & (2) \end{array}

Another problem with the visual SI measure is that the different age-related trajectories for color vision (declining) and reading speed (stable) mean that color naming RTs in the neutral condition (Cn) may slow with age relative to reading speed (Rn) (Salthouse and Meinz, 1995). The Stroop effect originates from the difference in time course between color naming in the presence vs. absence of a readable distracting color word. If color naming slows while word reading remains unchanged with age, then there will be a greater difference in processing speed between the color naming and reading dimensions, and this puts participants at greater risk of inhibition failure in the incongruent (distractor) condition: that is, if a participant's color naming speed is relatively slow compared to their reading speed, they have to suppress the irrelevant word for longer, and this increases their chances of experiencing an inhibition failure.

Melara and Algom (2003) refer to the discrepancy between access to words and color names as the Dimensional Imbalance (DI) i.e.,

\begin{array}{l} DI = Cn - Rn & (3) \end{array}

Thus, a large DI score indicates a slow color naming speed relative to reading speed. Melara & Algom found DI to be strongly positively correlated with Stroop interference (SI) as measured by (1): larger DI scores (relatively slow color naming speeds) were associated with larger Stroop effects.

If an increased dimensional imbalance indeed contributes to larger SI (inhibitory failure) in older adults, then it needs to be taken into account when calculating inhibition ability. There are two possible ways to do this. The first is to calculate a standardized Ci using the DI score, as follows:

\begin{array}{l} vS I_{standard} = C i / D I & (4) \end{array}

This factors out the part of Ci which is determined by DI. As a result, differences in color naming speed relative to reading speed are controlled for, leaving only the portion which represents “true” inhibitory ability.

An alternative approach is to use residuals. For a linear regression modeled as Ci_i = α + βDI_i + ε_i, the residuals can be calculated as:

\begin{array}{l} {vSI}_{res} = y_{Ci} - ŷ_{Ci} & (5) \end{array}

This method regresses Ci on DI, and then takes the unstandardized residual [i.e., the difference between the observed Ci value (y_Ci) and the predicted Ci value (ŷ_Ci)] for each participant. These residuals represent the difference between a participant's observed Ci score relative to what their DI score would predict: a residual near to 0 indicates that the observed Ci score is very similar to what the DI score would predict, suggesting that DI explains almost all of the increase in Ci relative to Cn. A positive residual suggests that the observed Ci score is higher than what could be predicted by DI, indicating “true” inhibitory failure; while a negative residual suggests that the observed Ci is lower than what would be predicted based on DI, and represents “true” inhibitory success. This method thus provides a measure of inhibitory control free from the effects of visual sensory decline. It also accounts for general cognitive slowing since, like (2), it is a relational measure. One issue with this method is that the residual scores depend on the performance of the sample—that is, the predictive relationship between DI and Ci is derived only from the study participants, who may not be representative of the wider population. It would be preferable to independently derive a “gold-standard” relationship between DI and Ci; however, this has not yet been done, and so for the current study we must rely on the data from our sample alone.

Calculating Auditory Stroop Scores

The traditional Stroop interference measure (SI) for the auditory Stroop is calculated analogously to the visual Stroop:

\begin{array}{l} aS I_{raw} = a R T i - a R T n & (6) \end{array}

As explained above, the issue of generalized slowing makes the traditional Stroop (SI) measure problematic: if aRTi and aRTn increase by the same factor, SI will also increase; this means that a larger SI may reflect slowing rather than paucity of inhibition. Normalized SI was proposed as one means of addressing the issue of generalized slowing, and can be calculated for the auditory Stroop as follows:

\begin{array}{l} aS I_{norm} = a R T i / a R T n & (7) \end{array}

As discussed in the Introduction, using average measures across all trials of a Stroop task may not be the most efficient way of quantifying inhibition and its failure. We know that inhibition takes time to build up, and that its effects may therefore be strongest for each participant's slowest RTs for incongruent trials (Ridderinkhof, 2002; Ridderinkhof et al., 2004; Roelofs et al., 2011). During these trials the distractor has the greatest chance to interfere, but inhibition also has the greatest potential to be deployed by those who can successfully do so; thus individual differences in inhibitory abilities will be most in evidence, since the disparity between those able to successfully deploy inhibition and those less able to do so will be largest during these trials (Roelofs et al., 2011). To assess this, slow and fast trials must be analyzed separately. This type of differential analysis of single trials is usually done using delta plots and delta scores.

Delta scores are calculated using neutral (aRTn) and incongruent (aRTi) conditions. For each participant and each condition, the trials are sorted by RT, and then split into equally-sized quintiles. The average RT is calculated for each quintile in each condition. Mean RT per quintile is the averaged RT across aRTn and aRTi for a given quintile. Delta RT per quintile is calculated as mean aRTi minus mean aRTn for a given quintile. When averaged over all participants the grand mean RT and grand delta RT can be obtained for each quintile. It is worth noting that, since delta RT per quintile is obtained by calculating aRTi—aRTn for that quintile, it is conceptually no different to using the traditional (aSI_raw) measure (see equation (6) above). It is the same calculation, but performed using only a subset of trials.

Delta plots show grand mean RTs plotted against grand delta RTs for the five RT quintiles (Q1-Q5). Since the delta RT measure compares conditions with and without distractors, and interference from distractors increases over time, the plots typically show an overall increase in delta RTs as mean RTs increase. Individual differences in the build-up of inhibition are expressed in a delta plot by differences in this relationship between mean and delta RTs (Ridderinkhof et al., 2004). Those who are not successfully inhibiting show a monotonic increase in delta RT as mean RT increases. In contrast, those who are successfully engaging inhibition initially show a monotonic increase in delta RT, but for the slowest trials the relationship between delta RT and mean RT will become less steep, flatten out or even become negative. Delta plots allow us to focus on those trials that both allow and require the most inhibition for successful performance, thereby maximizing the chance of seeing individual differences in inhibitory ability.

Speech-in-Noise Tasks

The SiN tasks varied in both semantic context and lexical difficulty. Semantic context was varied as part of the sentence task, where target words were the final words of low- (LP) and high-predictability (HP) sentences. Stimuli were 112 sentence pairs from a recently developed sentence pairs test (Heinrich et al., 2014). This test, based on the SPIN-R test (Bilger et al., 1984), comprises sentence pairs with identical sentence-final monosyllabic words, which are more or less predictable from the preceding context (e.g., “We'll never get there at this rate” vs. “He's always had it at this rate”). High and low predictability (HP/LP) sentence pairs were matched for duration, stress pattern, and semantic complexity. Sentences were recorded using a male Standard British English speaker. Only the HP or LP version of a sentence was heard by a single participant.

Lexical difficulty was assessed in the word task, where target stimuli were 200 isolated words whose lexical difficulty was varied in terms of word frequency (WF) and neighborhood density (ND). The set of words comprised the 112 final words from the sentence task and an additional 88 monosyllables. WF was measured using the BNC corpus (http://www.natcorp.ox.ac.uk/), filtered for nouns (exact form). This corpus was chosen because it both uses British English and also allows particular parts of speech to be isolated: in this case, the measure of interest was the frequency of the target words as nouns, since the sentence contexts led listeners to anticipate a noun target, and as the exact form heard in the sentence, not with potential pluralizations or any other alterations. This limitation was mirrored in the scoring of the SiN task, where only the exact form of a target was scored as correct. ND was determined using N-Watch (Davis, 2005). This tool uses the Celex database to create neighborhood measures using a letter-substitution algorithm, but cross-checks the measures with word frequency to ensure that extremely rare words are not included. This stops over-estimation of ND with respect to most people's vocabulary. It also uses British English. Based on these measures, the 200 words were divided into 4 groups, with WF and ND ranges as shown in Table 1.

TABLE 1

Table 1. Lexical information for word stimuli.

All 200 words were re-recorded using a different male Standard British English speaker.

All SiN stimuli were presented in speech-modulated noise (SMN). The SMN was created by using an inverse FFT to generate a noise signal with the same long-term average spectrum as the target speech. This noise signal was then modulated in level by dot multiplying it with the absolute value of the smoothed Hilbert transform of the target speech (smoothing was accomplished by convolving the speech envelope with a 46 ms vector of ones). Finally the SMN was scaled to match the RMS level of the target speech. This made the speech signal unintelligible while keeping the long-term average spectrum, level, and temporal envelope of the original signal intact. SiN stimuli were presented in two SNRs to create a more or less adverse listening condition (words at +1 and −2 dB; sentences at −4 and −7 dB). SNR levels were chosen to vary the overall difficulty of the task between 20 and 80% accuracy. Each of the 112 sentence-final words was only heard once by each participant, either in the context of an HP or an LP sentence, and half the sentences of each type were heard with high or low SNR. Each of the 200 words was heard only once, with either high or low SNR, and there were equal numbers of words in each combination of word frequency and neighborhood density categories. After hearing each sentence or word participants repeated as much as they could. Testing was self-paced, and responses were recorded for offline scoring.

Procedure

Testing was carried out in a double-wall sound-attenuating booth (Industrial Acoustics Company (IAC), Winchester, UK) using Sennheiser HD280 headphones. All testing was in the left ear only. The SiN and Stroop tasks formed part of a larger battery of tests, which were administered over the course of two sessions around a week apart. The two SiN tasks (words and sentences) were always tested in different sessions; the two Stroop tasks (auditory and visual) were tested in different sessions wherever possible, which was the majority of cases. The order of SiN tasks was counterbalanced across participants. There was no systematic pairing of SiN and Stroop tasks within sessions.

Modeling

In all cases, the outcome measure was speech intelligibility as measured in RAUs (Studebaker, 1985). A number of stimulus-based variables were coded as categorical predictors: semantic predictability (LP/HP) of sentence-final words; word frequency (high/low) and neighborhood density (high/low) of isolated words; speech type (sentences/words) of words and sentences; SNR (high/low). In addition, the following listener variables were coded as continuous predictors: Stroop score (on either the auditory or visual Stroop tasks, using a specified scoring system), and PTA. The PTA variable was calculated by averaging the obtained thresholds at all tested frequencies for each participant, and then centering these values.

The relationship between predictor and outcome variables was assessed in a series of linear mixed models (LMMs) using ML estimation, with predictor variables as fixed effects and Type 3 SS. All models included participants as random effects.

A backwards stepwise procedure was used to determine the final set of predictors for each model.¹ This procedure was implemented through manual checking and effect removal. All analyses were performed in IBM SPSS Statistics 21.

Results

Mean Results for Speech-in-Noise (SiN) Perception

Mean intelligibility values for all SiN conditions are given in Table 2.

TABLE 2

Table 2. Mean scores and standard errors in the 6 different SiN conditions.

Repeated-measures ANOVAs were conducted to investigate group differences in word and sentence intelligibility due to stimulus-based predictor variables. For intelligibility of sentence-final words, a semantic predictability (LP/HP) x SNR (low/high) within-subjects ANOVA showed significant main effects of both predictability [F_{(1, 49)} = 571.72; MSE = 91.67, p < 0.001, η² = 0.921; HP > LP] and SNR [F_{(1, 49)} = 168.54; MSE = 76.81, p < 0.001, η² = 0.775; easy > hard], but no predictability × SNR interaction. For intelligibility of isolated words, a word frequency (low/high) × neighborhood density (low/high) × SNR (low/high) within-subject ANOVA showed significant main effects of word frequency (WF) [F_{(1, 49)} = 111.67; MSE = 37.37, p < 0.001, η² = 0.695; high > low], neighborhood density (ND) [F_{(1, 49)} = 33.89; MSE = 70.11, p < 0.001, η² = 0.409; low > high] and SNR [F_{(1, 49)} = 120.69; MSE = 66.54, p < 0.001, η² = 0.711; easy > hard]; additionally, a significant WF × ND interaction [F_{(1, 49)} = 180.40; MSE = 54.53, p < 0.001, η² = 0.786] indicated that words with both a high word frequency and a low neighborhood density were more intelligible than words in the other three conditions (Bonferroni-corrected at p = 0.05).

Visual Stroop

The mean for Cn was 31.66 s (SD = 5.41 s); the mean for Ci was 47.13 s (SD = 8.14 s); and in all cases the difference between them was positive (i.e., Ci > Cn). The mean difference between RTs in the two conditions for the current dataset was 15.5 s (SD = 4.49 s) overall, which represents a mean of 0.32 s (SD = 0.09 s) per item (word). The vSI_norm measure (Equation 2 above) gives a mean score of 1.49 (SD = 0.14).

The Relationship between Visual Stroop Scores and Speech-in-Noise (SiN) Perception

This section examines the predictive value of visual Stroop interference for SiN perception in high and low predictability sentences and for single words varying in word frequency and neighborhood density. Predictive power for SiN perception was investigated for two measures of visual Stroop interference: vSI_raw, the traditional measure of Stroop interference unadjusted for sensory decline, and vSI_res, the new measure of Stroop interference that takes general age-related slowing as well as sensory decline into account. The predictive relationship between each of the visual Stroop scores and performance on the sentence task, the word task and the sentence and word tasks combined are presented in Tables 3–5 respectively. The analyses combining the scores from the sentence and word tasks (Table 5) were included in order to directly compare the predictive effect of Stroop scores across target stimuli of different linguistic complexity. In a second step, PTA was added to each set of analyses in order to examine how it modified the predictive effect of the Stroop scores.

TABLE 3

Table 3. Summary of LMMs assessing relationship of visual Stroop scores to sentence perception.

TABLE 4

Table 4. Summary of LMMs assessing relationship of visual Stroop scores to word perception.

TABLE 5

Table 5. Summary of LMMs assessing relationship of visual Stroop scores to all SiN perception (combined dataset).

Tables 3–5 indicate, for each combination of model type and dataset, (a) whether a predictive effect of the Stroop measure on SiN performance was present, and what the nature of the effect was; and (b) what, if any, significant interactions between the Stroop measure and stimulus-based variables or PTA were present. The effects are described as rate of change where a positive slope indicates an average increase in SiN performance with every additional increase in Stroop interference, while a negative slope indicates an average decrease in SiN performance with every additional increase in Stroop interference. Based on our hypotheses, we expect negative slopes. While PTA was always entered as a continuous predictor, we use a categorical median split when reporting and discussing its effects, because it allows for clearer descriptions, particularly of complex interactions. The tables do not list significant interactions if they do not involve the Stroop measure. The AIC value is included for each model as an indication of goodness-of-fit, with lower AIC values corresponding to a better fit.

The models reveal a complex pattern of results with the direction of the relationship between the vSI measures and SiN performance, as well as the strength of the relationship, depending on the scoring method and characteristics of the stimulus and the listener. However, in all cases, the inclusion of PTA in the model enhanced model fit (i.e., produced a lower AIC value).

We will now examine, for each dataset in turn, how the nature of the relationship between Stroop scores and SiN performance was modulated by stimulus-based variables and PTA for each Stroop scoring method.

Sentence perception

Traditional (vSI_raw) measure. There was no predictive effect of the Stroop measure overall, and stimulus-based predictors did not modulate the predictive effect of Stroop interference. There was also no modulating effect of PTA.

Adjusted (vSI_res) measure. While there was no predictive main effect of Stroop interference, an interaction of vSI_res∗Pred∗SNR indicates that the predicted negative relationship between Stroop scores and sentence perception was seen for the high predictability (HP) sentences in the harder SNR, and for the low predictability (LP) sentences in the easier SNR, but not for the HP sentences in the easier SNR or the LP sentences in the harder SNR. There was no modulating effect of PTA.

Word perception

Traditional (vSI_raw) measure. While there was no predictive main effect of Stroop interference, an interaction with neighborhood density (ND) indicates that the observed relationship between vSI_raw and word perception was more negative for words with less dense neighborhoods. This interaction was modulated by SNR and PTA in an interaction of vSI_raw∗SNR∗ND∗PTA, indicating that the relationship between Stroop scores and SiN perception changed in different ways across ND and SNR conditions for listeners with better and worse hearing. Specifically, the relationship was negative for those with poor PTA, but was more mixed for those with good PTA, being positive for high ND words in the easier SNR and approaching zero for both ND conditions in the harder SNR.

Adjusted (vSI_res) measure. There was no main effect of Stroop interference and no modulating effects of stimulus-based variables on their own. Once PTA was added to the model, an interaction of vSI_res∗ND emerged, indicating that the predictive effect of Stroop scores was strongest for low ND words. This interaction was further modulated by PTA, indicating that the relationship between Stroop scores and SiN perception changed in different ways for the two ND conditions when examining listeners with better and worse hearing. Specifically, for those with worse hearing the Stroop/SiN relationship was more negative for low ND words but less negative for high ND words when compared to those with better hearing.

Speech (combined dataset)

Traditional (vSI_raw) measure. There was no predictive main effect of Stroop interference. An interaction with Type indicates that the predictive effect of Stroop scores for SiN perception differed in direction between sentences and words, being negative for the word task and positive for the sentence task. PTA did not modulate the found relationships.

Adjusted (vSI_res) measure. There was no main effect of Stroop interference and no modulating effects of stimulus-based variables or PTA.

In summary, the predictive effect of visual Stroop scores for SiN perception is similar in some respects across all three analyses and regardless of the scoring method. Both scoring systems reveal some specific influences of lexical factors [sentence predictability (Table 3) and word neighborhood density (Table 4)], and neither system shows a large effect of PTA.

Auditory Stroop (All Trials)

The auditory Stroop task resulted in three measures for each participant: average RT for neutral trials (aRTn), congruent trials (aRTc) and incongruent trials (aRTi). Initial inspection of the data revealed that not all four speakers produced Stroop interference effects for every participant. We therefore analyzed for each participant the responses to the female and male speaker who produced, for that participant, the largest overall traditional Stroop interference (aRTi—aRTn). Speakers M1 and M2 were chosen 13 and 35 times respectively, speakers F1 and F2 25 and 23 times respectively. Following Green and Barber (1981), only correct trials from the aRTi and aRTn conditions were included in any analysis.

Congruent trials are usually included in Auditory Stroop tasks, and previous studies (Green and Barber, 1981; Jerger et al., 1988) have found a facilitation effect (i.e., faster responses to congruent than neutral trials), although this is not always the case (Sommers and Danielson, 1999). Using a 1-way repeated-measures ANOVA (Greenhouse-Geisser corrected for violations of sphericity) with aRTn, aRTc and aRTi as within-subject levels of condition, we found a main effect of condition [F_{(2, 79)} = 53.40; MSE = 0.005, p < 0.001, η² = 0.532]. Post-hoc testing showed an interference effect but no facilitation effect [aRTi > aRTc, aRTi > aRTn, aRTc = aRTn (Bonferroni-corrected at p = 0.05)].

The mean aRTi (per item) was 1.33 s (SD = 0.23 s), the mean aRTn was 1.20 s (SD = 0.21 s), and aRTi was higher than aRTnfor all but 3 listeners. The mean difference between RTs in the two conditions for the current dataset was 0.13 s (SD = 0.09 s) per item (word). This difference is smaller than for the visual Stroop. The aSI_norm measure (Equation 7 above) gives a mean score of 1.11 (SD = 0.08).

The Relationship between Auditory Stroop Scores and Speech-in-Noise (SiN) Perception

This section examines the predictive value of auditory Stroop interference for SiN perception in high and low predictability sentences, and for single words varying in word frequency and neighborhood density. As before, performance in these conditions was predicted by one of two auditory Stroop interference measures: aSI_raw, the traditional measure for Stroop interference, or aSI_norm, a measure of Stroop interference that takes generalized slowing into account. The relationship between each Stroop measure and SiN perception, as characterized by a series of LMMs, is summarized in Tables 6–8. In all cases, the first part of the table presents the results when Stroop interference and stimulus-based variables are the only predictors of SiN performance. The second part of each table presents the results when PTA is considered in addition to Stroop interference and stimulus-based variables.

TABLE 6

Table 6. Summary of LMMs assessing relationship of auditory Stroop scores to sentence perception.

TABLE 7

Table 7. Summary of LMMs assessing relationship of auditory Stroop scores to word perception.

TABLE 8

Table 8. Summary of LMMs assessing relationship of auditory Stroop scores to all SiN perception (combined dataset).

For both auditory Stroop scoring systems, the overall relationship between Stroop scores and SiN perception is mostly positive. This is truer for the normalized (aSI_norm) scores than the traditional (aSI_raw) scores, since Stroop scores never reach significance as a main effect when using the aSI_raw scoring method, but are significant across all datasets when using the aSI_norm measure without PTA. As before, including PTA improved the fit of the model in all cases.