Auditory Stream Segregation Can Be Modeled by Neural Competition in Cochlear Implant Listeners

Auditory stream segregation is a perceptual process by which the human auditory system groups sounds from different sources into perceptually meaningful elements (e.g., a voice or a melody). The perceptual segregation of sounds is important, for example, for the understanding of speech in noisy scenarios, a particularly challenging task for listeners with a cochlear implant (CI). It has been suggested that some aspects of stream segregation may be explained by relatively basic neural mechanisms at a cortical level. During the past decades, a variety of models have been proposed to account for the data from stream segregation experiments in normal-hearing (NH) listeners. However, little attention has been given to corresponding findings in CI listeners. The present study investigated whether a neural model of sequential stream segregation, proposed to describe the behavioral effects observed in NH listeners, can account for behavioral data from CI listeners. The model operates on the stimulus features at the cortical level and includes a competition stage between the neuronal units encoding the different percepts. The competition arises from a combination of mutual inhibition, adaptation, and additive noise. The model was found to capture the main trends in the behavioral data from CI listeners, such as the larger probability of a segregated percept with increasing the feature difference between the sounds as well as the build-up effect. Importantly, this was achieved without any modification to the model's competition stage, suggesting that stream segregation could be mediated by a similar mechanism in both groups of listeners.


INTRODUCTION
The cochlear implant (CI) is a neural prosthesis that allows many CI listeners to achieve high levels of speech understanding in quiet. Nevertheless, CI listeners typically experience difficulties to understand a single person's voice among many, or to recognize a familiar melody in a complex musical arrangement (e.g., Nelson et al., 2003). In such scenarios, the listener needs to parse the sounds in the complex auditory scene and group them into meaningful auditory objects or streams, a process known as auditory scene analysis (Bregman, 1990). However, the mechanisms that may allow CI listeners to perceptually group multiple sound events into streams remain unclear. The present study evaluated whether a computational model, proposed to account for the main aspects of auditory scene analysis observed in normal-hearing (NH) listeners, can also account for the behavioral data from CI listeners.
A common paradigm to investigate auditory scene analysis employs sequences of repeating, sequentially-presented sounds which may differ in various acoustic properties, typically the frequency content (for a review, see Carlyon, 2004;Micheyl and Oxenham, 2010;Gutschalk and Dykstra, 2014). Small differences and/or slow presentation rates promote the perceptual grouping of the sounds into a single stream (i.e., integration). Conversely, large differences and/or fast presentation rates promote the perceptual grouping of the sounds into several streams (i.e., segregation). The perception of the sequence has been described as bistable (e.g., Pressnitzer and Hupé, 2006) and it is characterized by ongoing spontaneous switches between an integrated and a segregated percept for long stimulus presentations. Nevertheless, the overall probability of experiencing a segregated percept has been reported to increase over time, typically reaching a plateau after the first couple of seconds. This phenomenon has often been referred to as the build-up of stream segregation (e.g., Bregman, 1978).
During the past decades, a variety of models have been proposed to account for the phenomenon reported in the experimental studies (see recent reviews by Szabó et al., 2016;Snyder and Elhilali, 2017). Based on a conceptual model described by Fishman et al. (2001), Rankin et al. (2015Rankin et al. ( , 2017 proposed a neuromechanistic model to account for a variety of behavioral effects in NH listeners, including the effects of frequency differences and presentation rate, the dynamics of the bistable perception and the build-up effect. The model operates on the stimulus features at the cortical level and includes a competition stage between the neuronal units encoding the different percepts. The competition between the units results from a combination of mutual inhibition, adaptation and additive noise mechanisms, suggested to contribute to perceptual bistability at cortical stages (e.g., Moreno-Bote et al., 2007;Shpiro et al., 2009;Kondo et al., 2018).
Studies investigating the perceptual organization of sounds in CI listeners suggest that the listeners may be able to segregate sequential sounds on the basis of perceptual differences elicited by manipulations of the place, the rate or the intensity of the electrical stimulation (e.g., Cooper and Roberts, 2009;Marozeau et al., 2013;Paredes-Gallardo et al., 2018a,b,c). Furthermore, recent studies with CI listeners observed similar trends to those reported for NH listeners (albeit with a larger inter-subject variability), suggesting a common underlying mechanism in both groups of listeners (Paredes-Gallardo et al., 2018a,b,c).
The present study investigated whether neural competition at a cortical level, proposed to account for the behavioral effects of sequential stream segregation in NH listeners, can also account for the data from CI listeners. Specifically, the neuromechanistic model proposed by Rankin et al. (2015Rankin et al. ( , 2017 was here used to account for the behavioral data from Paredes-Gallardo et al. (2018b,c). If the model would be able to capture the main trends in the behavioral data (i.e., the larger probability of a segregated percept with increasing the perceptual difference between the sounds and the build-up effect) without modifications to the competition stage, this would indicate that stream segregation can be described by a similar mechanism both in NH and CI listeners.

Behavioral Stream Segregation Data
Paredes- Gallardo et al. (2018b,c) investigated stream segregation in 7 and 9 CI listeners (respectively) making use of sequences of alternating A and B sounds. The sounds were encoded either via different electrodes at a constant pulse rate, or with different pulse rates from the same electrode, inducing in both cases a difference in perceived pitch ( pitch). The listeners were asked to perform a temporal delay detection task that was easiest when the A and B sounds were perceptually segregated. Therefore, larger d' reflected a higher likelihood of a segregated percept. Overall, the d' scores increased with increasing the pitch between the A and B sounds, as well as with increasing the sequence duration. Thus, consistent with previous studies with NH listeners, the authors suggested that larger pitch might facilitate the perceptual segregation of the A and B sounds and that a segregated percept builds up over time.

Stream Segregation Modeling Framework
The modeling framework used in the present study is based in the neuromechanistic model proposed by Rankin et al. (2015Rankin et al. ( , 2017. A schematic representation of the framework is shown in Figure 1. The framework is divided into five parts: [1] The input to the model represents the dynamics of the stimulus (i.e., the onset times of the A and B sounds).
[2] With this information, the model mimics the pulsatile responses and the feature dependence observed at the primary auditory cortex (Micheyl et al., 2005). A weighting function, ω( feature,t), is used in this stage to control the spread of the responses to three units in the competition network (represented by I A , I B and I AB ).
[3] The competition between the units is modeled through a combination of mutual inhibition, recurrent excitation, slow adaptation and noise, which is added to I A , I B , and I AB . The inhibition processes, proportional to the inhibition strength parameter β i , are indicated with round-ended connectors in Figure 1. The recurrent excitation, slow adaptation, and additive noise are not shown in Figure 1. The model encodes integration when the activity of the AB unit is larger than the activity from the A and the B units and a segregated percept otherwise. Thus, the output from the competition network is a binary representation of the percept over time.
[4] The build-up function is then computed by timebinned averaging across N simulations, which represents the time course of the proportion of segregation over N trials.
[5] To link the proportion of segregation with the d' scores, an ideal observer (IO) was used in the back-end of the model. The IO assumed a 100% hit rate for the segregated trials and chance level performance for the integrated trials, estimating a d' score for a given sequence duration, pitch, and t (for more details on the IO model, see Paredes-Gallardo et al., 2018b,c). Rankin et al. (2017) proposed the neuromechanistic model to account for behavioral data from NH listeners. The acoustic stimuli consisted of repeating triplets of ABA sounds separated by a short pause (ABA_). The model parameters were defined to minimize the deviations between the model predictions and the behavioral data. In the present study, unless otherwise specified, the model equations and parameters were kept as described in Rankin et al. (2017). However, the stimulus structure was slightly modified to resemble the stimuli used by Paredes-Gallardo et al. (2018b,c). In addition, some parameters in the second part of the model, related to the input signals to the competition stage, were adjusted to account for the differences between the input signals in NH and CI listeners.

Model Parameters and Fitting Procedure
In the original model, the A and B sounds were defined as pure tones with different frequencies. Thus, the weighting function ω( feature, t) was dependent on the frequency difference between the sounds [i.e., ω( f,t)]. Conversely, in the studies from Paredes-Gallardo et al. (2018b,c), the A and the B sounds differed either in the place or the rate of the electrical stimulation, eliciting a difference in the perceived pitch. Thus, in the present study, the dependency of the weighting function on the frequency difference was replaced by a dependency on pitch, as indicated by Equation (1). The variable t represents the time vector, L the amplitude factor and σ the lateral decay constant. Q(t) and R(t) are exponential decay functions and represent the amplitude and the pitch adaptation of the input, respectively, with a time constant of 500 ms (for more details, see Rankin et al., 2017).
In the study from Rankin et al. (2017), the lateral decay constant σ was defined in semitones. As a result of the change in the dependency of ω from frequency separation to pitch, σ had to be redefined in the present study. Two different model fits were considered here. For the first one, a genetic algorithm was used to find the value of σ leading to the minimum averaged mean error (AME) between the mean d' scores achieved by the listeners and the model predictions. For the second one, the genetic algorithm was allowed to adjust the value of the amplitude factor L in addition to σ in order to minimize the AME between the predictions and the data. In both cases, the fitting of the model was performed by manipulating the parameters of the weighting function, and no changes were made to the parameters or equations from the competition stage. The values of σ and the AME resulting from the fitting procedure are presented rounded to the first significant figure. The value of L is presented rounded to the second significant figure.
The simulations were performed on a sequence of alternating A and B sounds with a presentation rate of 5.89 Hz. Each d' estimate was computed from 1,000 simulated trials (N) at 1.24 and 3.96 s (i.e., equivalent to the long and the short sequence durations from Paredes-Gallardo et al.

RESULTS
Two different model fits were considered in the present study: one where only σ was adjusted and L was fixed to 0.6, as in Rankin et al. (2017), and another one where both σ and L were adjusted. When only σ was adjusted, the minimum AME between the predictions and the data was achieved for σ = 40 (AME = 0.5). Conversely, when both σ and L were adjusted, the minimum AME was achieved for σ = 30 and L = 0.35 (AME = 0.3). Figure 2 shows the predicted proportion of segregation as a function of time (i.e., the build-up functions) for each of the model fits (Figures 2A,B) as well as a comparison between the predicted d' values and the behavioral data for the long and the short sequence durations (Figures 2C,D). The results from the simulations where only the value of σ was adjusted are shown in blue whereas the results from the simulations where both σ and L were adjusted are shown in green.
In Figures 2A,B, lighter colors indicate smaller pitch and darker colors indicate larger pitch. As an arbitrary reference, pitch of 100% represents the perceptual pitch difference between a 900 pps pulse train stimulating electrode 11 vs. electrode 22 of the array. Similar trends are observed for both model fits. The proportion of segregation increases over time and reaches a plateau after ∼2-∼4 s for most pitch conditions. The plateau of the build-up functions happens at values below 1, suggesting the presence of perceptual switches between an integrated and a segregated percept throughout the trial. Larger pitch values lead to steeper slopes, reaching the plateau in a shorter time (i.e., faster build-up). Nevertheless, the effect of both pitch and time is more pronounced in Figure 2A than in Figure 2B, where the amplitude factor L was set to 0.35, a smaller value than the original value of 0.6 in the model for NH listeners. The comparison between the predicted d' scores from the model and the d' scores from the listeners are shown in Figure 2C (short sequence) and Figure 2D (long sequence). The solid lines represent the model predictions, and the open markers represent the mean d' scores from the listeners. The error bars indicate ±1 standard deviation. The d' scores achieved by the listeners generally increase with increasing pitch between the sounds and are, overall, higher for the long than for the short sequences, reflecting the build-up effect. In addition, the effect of pitch was larger for the long than for the short sequence. These trends are well-captured by the model, both for the fit where only σ was adjusted and for the fit where σ and L were manipulated. Nevertheless, a better agreement between the data and the predictions is observed when adjusting the value of the amplitude factor L in addition to σ (solid green line). However, whereas the d' scores achieved by the listeners saturate for large pitch values both for the long and the short sequences, the predictions from the model continue to increase at large pitch values for the short sequence. This can also be seen in the build-up functions from Figures 2A,B: the proportion of segregation saturates with increasing pitch in the plateau region of the build-up functions (i.e., for the long sequence) and continues to increase with increasing pitch in the steep region of the build-up functions (i.e., for the short sequence).

DISCUSSION
The present study evaluated whether the neuromechanistic model proposed by Rankin et al. (2015Rankin et al. ( , 2017 would be able to account for the effects observed in the behavioral data from CI listeners. The model parameters and equations from the competition stage were kept as defined in Rankin et al. (2017), and only the function defining the amount of input spread to the different units from the competition network, ω( feature,t), was adjusted to account for the data from the CI listeners. Specifically, the model parameters σ and L were adjusted. The adjustment of σ was a necessary step in order to change the dependency of ω from frequency separation to pitch. When σ was adjusted, and the remaining model parameters were kept as defined by Rankin et al. (2017) for NH listeners, the model was able to capture the main trends of the behavioral data from the CI listeners (i.e., larger d' scores with increasing pitch and the sequence duration). Nevertheless, a better fit between the data and the model predictions was achieved when modifying the amplitude factor L in addition to σ . The optimal value for L was found to be 0.35, a lower value than the value of 0.6 used by Rankin et al. (2017). The lower input amplitude L reduces the effect of pitch on the proportion of segregation and increases the relative weight of the additive noise in the model, resulting in a more ambiguous percept (i.e., higher minima and lower maxima of the build-up functions in Figure 2B with respect to those from Figure 2A). Such ambiguity could arise from the generally weak pitch percept experienced by CI listeners (e.g., Oxenham, 2008). Thus, the findings from the present study suggest that a more ambiguous percept may better characterize the behavioral data from CI listeners, which may indicate a weaker role of obligatory processes on stream segregation in CI listeners than in NH listeners. Even though the model successfully captures the main trends of the data, there are some discrepancies between the data and the model predictions. Specifically, whereas the d' scores achieved by the listeners saturate for large pitch values both for the long and the short sequences, the model only predicts a saturation effect for the long sequence (i.e., in the plateau region of the build-up function). The agreement between the model predictions and the behavioral data could be further improved by manipulating model parameters affecting the buildup process [e.g., by adjusting or redefining the exponential decays R(t) and Q(t) from Equation (1)]. However, a better behavioral characterization of the build-up functions for CI listeners would be required to support such modifications in the model. The aim of the present study was not to achieve the best fit between the model predictions and the data but to evaluate whether a model that was proposed to account for data from NH listeners can capture the main trends in the data from CI listeners. Overall, the model was able to account for the effect of perceptual pitch differences ( pitch) elicited by changes in the electrode or the pulse rate of the electrical stimulation on stream segregation, as well as the build-up effect. Importantly, this was achieved without any changes to the model parameters and without modifying the characteristics of the competition stage. These findings indicate that a competition network featuring mutual inhibition, adaptation and additive noise can account for the behavioral effects of stream segregation, also in CI listeners, suggesting that stream segregation may be mediated by a similar mechanism in NH and CI listeners.
Finally, the results presented in this study are consistent with findings from invasive physiological studies in animals and modeling work suggesting that many important aspects of stream segregation, such as the effect of perceptual differences between the sounds or the build-up effect may be explained by relatively basic neural mechanisms at a cortical level (e.g., Fishman et al., 2001Fishman et al., , 2017Micheyl et al., 2005Micheyl et al., , 2007. Nevertheless, more experimental data from CI listeners are needed to evaluate whether the neuromechanistic model can account for a wider range of behavioral effects of stream segregation in CI listeners, such as the effects of variations in the stimulus presentation rate or the dynamics of bistable perception.

AUTHOR CONTRIBUTIONS
AP-G designed and performed the research. AP-G, TD, and JM interpreted the results and wrote the manuscript.

FUNDING
This work was supported by the Oticon Center of Excellence for Hearing and Speech Sciences (CHESS) and the Center for Applied Hearing Research (CAHR).