Stimulus Pauses and Perturbations Differentially Delay or Promote the Segregation of Auditory Objects: Psychoacoustics and Modeling

Segregating distinct sound sources is fundamental for auditory perception, as in the cocktail party problem. In a process called the build-up of stream segregation, distinct sound sources that are perceptually integrated initially can be segregated into separate streams after several seconds. Previous research concluded that abrupt changes in the incoming sounds during build-up—for example, a step change in location, loudness or timing—reset the percept to integrated. Following this reset, the multisecond build-up process begins again. Neurophysiological recordings in auditory cortex (A1) show fast (subsecond) adaptation, but unified mechanistic explanations for the bias toward integration, multisecond build-up and resets remain elusive. Combining psychoacoustics and modeling, we show that initial unadapted A1 responses bias integration, that the slowness of build-up arises naturally from competition downstream, and that recovery of adaptation can explain resets. An early bias toward integrated perceptual interpretations arising from primary cortical stages that encode low-level features and feed into competition downstream could also explain similar phenomena in vision. Further, we report a previously overlooked class of perturbations that promote segregation rather than integration. Our results challenge current understanding for perturbation effects on the emergence of sound source segregation, leading to a new hypothesis for differential processing downstream of A1. Transient perturbations can momentarily redirect A1 responses as input to downstream competition units that favor segregation.

B Outline of the model The network structure and neural mechanisms forming the basis of our model were originally motivated in Rankin et al (2015). In this section we give a complete description of the model, specifying the exact formulation used in the present study. The firing rate variables r k are indexed by k = {AB, A, B} for each population shown in Fig 1A with the associated adaptation a k and recurrent excitation e k variables (note that the symbol "e" is used exclusively for excitation variables and associated constants whilst the symbol "exp ()" is used for the exponential function). The system of first order differential equations is as follows: with time constants τr = 10 ms (cortical), τa = 1.4 s (spike frequency adaptation), τe = 70 ms (NMDA-excitation). The strength of recurrent excitation is given by βe = 0.65, lateral inhibition β i = 0.3 and adaptation g = 0.045. Note that the profile of inhibition used here, with non-uniform synaptic weights and independent of DF, was determined after fitting the model to behavioural data (Rankin et al 2015). Note that although within-unit inhibition is included, βe > β i , so there is always net within-unit excitation. The firing rate function F is given by where θ F = 0.2 is a threshold parameter and k F = 12 is a slope parameter. Additive noise is introduced with independent stochastic processes χ A , χ B and χ AB added to the inputs of each population. Input noise is modeled as an Ornstien-Uhlenbeck process: where τχ = 100 ms (a standard choice (Shpiro et al 2009;Seely and Chow 2011)) is the timescale, the strength γ equals 0.0875 and ξ(t) is a white noise process with zero mean. Note these terms appear inside the firing rate function F such that firing rates r k remain positive and do not exceed 1.

B.1 Model inputs and early adaptation
The particular form of the periodic inputs are based on recorded responses from A1 with ABA triplets (Micheyl et al 2005). We capture the basic form of these responses to tones (TR) with a pair of onset response functions, one with larger amplitude and early rise that captures the initial onset and a second with smaller amplitude and late rise that captures the plateau: with plateau amplitude fraction Λ 2 < 1 and rise times α 1 < α 2 . The constant terms terms normalise the amplitude at t = α {1,2} of the individual onset functions to 1. A standard Heaviside function H (step function where H(t) = 0 zero for t < 0 and H(t) = 1 for t > 0) ensures no response before an input tone at t = 0. Rise times of α 1 = 15 ms and α 2 = 82.5 ms and an amplitude Λ 2 = 1/6 were chosen to approximately match the rise time and relative onset-to-plateau ratio observed in Micheyl et al (2005). The spread of input is defined via the weighting function where the tonotopic decay constant is σp = 9.7 st, the input amplitude is Ip = 0.6, R(t) represents effective DF adaptation (increasing with time) and Q(t) represents amplitude adaptation (decreasing with time). These are the two components of the early fast-adaptation in A1 sharing a common timescale τ A1 = 500 ms. The tonotopic spread of inputs in A1 evolves with time according to where the initial DF fraction is p = 0.1 (R(t) rises from 0.1 to 1; effective DF rises from 0.1DF to DF). The input amplitude evolves according to where the 1 + m (m = 2.5) is the initial input amplitude factor (Q(t) decays exponentially from 3.5 to 1; input amplitude decays from 3.5Ip to Ip).
In order to specify the amount of input received by each unit, I AB , I A and I B , in (1), we first construct sequences of tone responses TR A (t) (A A . . . ) and TR B (t) ( B . . . ) where the tones and silences (" ") each have a duration of 100 ms. Inputs for a repeating ABA . . . sequence are given by and plotted in Fig 1B. Respectively, equations (6) and (7)  We now specify how the formulation of the model in the present study relates to the one in Rankin et al (2015). In our previous study a slow synaptic depression on the recurrent excitation was introduced, but here we assume this does not play a role in the build up phase, i.e. we use static excitation (denoted e fix in Rankin et al (2015)). To maintain a match to our experimental data under this assumption g, βe, γ and Ip were adjusted relative to the values used in Rankin et al (2015). In the present study we use global, rather than DF-dependent, inhibition (denoted i gbl in Rankin et al (2015)), see our previous paper for further discussion on this point. The input terms in (1) given by I AB , I A , I B refer to the input to the competition stage, which may be different to the A1 responses, e.g. particular when inputs from distractor tones are gated out; see Fig 4A. B.2 Inputs from distractor and deviant tones, simple implementation of SSA For a distractor tone at tonotopic location d, or a deviant tone at tonotopic location D, the amplitude response in A1 can be computed in terms of the frequency difference DF d (or DF D ) between d (or D) and the tonotopic locations A, (A+B)/2 and B. The weighting function for a distractor (similarly for a deviant) is given by where, the amplitude adapts through Q(t) and the tonotopic spread is assumed broad σ d = 2.7σp (for example when above or below the A and B tones). In the presence of the ABA triplets, the A location is hit by more tones and, if a distractor immediately follows at A, it will be significantly adapted due to stimulus specific adaptation (SSA) Ulanovsky et al (2004); Taaseh et al (2011) in A1. As such, a relatively smaller response is assumed at the A location (factor 0.5 in (10)). This ad hoc, straightforward implementation of SSA is illustrated in SIFig. 2B. We provide a more general implementation of SSA below. We now let TR d (t) (. . . d . . . ) represent an impulse (4) at the specific time of the additional distractor tone. A distractor tone, as a salient new event, is assumed boosted (I d = 2.8Ip) when it is it integrated as input to the Apart than the SSA model, the same assumptions are used (boost of inputs to A-unit and B-unit, no input to AB-unit (Fig 4A).
competition stage (see Fig 3D, where the distractor tone d gives larger amplitude input to the competition stage than preceding tones). For, say, a distractor tone at tonotopic location B+2 the modified inputs would be I AB (t) = I AB (t) + w d (DF/2 + 2)TR d (t),

B.3 General model for stimulus specific adaptation in A1
Here we provide a more general description of how neuronal responses in A1 depend on the tonotopic location of a new tone subject due to SSA from preceding tones. Our implementation of SSA is based primarily on feedforward effects. In SSA a location that has received a sustained input will be adapted in response to further input at the same tonotopic location, with a bandwidth of around 3-4 st in A1 (Ulanovsky et al 2004;Taaseh et al 2011). We provide a plausible, general implementation of SSA in our model, that could describe A1 responses and be used to determine the inputs from distractor tones to the model's competition stage. Then general schema described below for computing the relative amplitude of responses to new tones, additional to the ABA triplets yields very similar results to the ad hoc description above, compare Fig 4E with SIFig. 2D.
The general principal is to determine how the tuning curve for a new tone might be modified, based on previous inputs from the regular triplet tones. Example tuning curves for new tones (shown unadapted in SIFig. 2A), are modified by the adaptation profiles (dashed curves in SIFig. 2B), dependent on the relative location of the new tone to preceding inputs. The adaptation profiles show the most adaptation close to the A tones (fast repetition rate), and less adaptation close to the B tones (slow repetition rate). For a new tone below A, the tuning curves (blue solid curve in SIFig. 2B) is carved out on the right hand side. For a new tone above B, the tuning curves (green solid curve in SIFig. 2B) is carved out on the left hand side. For a tone in between the tuning curve is carved out on either side (yellow solid curve in SIFig. 2B). Below we give a more complete, mathematical description of how the modified tuning curves are calculated.
In this more general formulation, functions will be defined in terms of a tonotopic coordinate y, rather than in terms of a frequency difference DF d , as used above in (9). In the absence of any prior input, an isolated tone will elicit a response in A1, largest at the tonotopic location of the tone, and decaying on either side (SIFig. 2A). In Rankin et al (2015), the tuning of these responses was assumed to have a symmetric exponential decay and, for a tone at a location N, this can be described by where σtc = 4σp is broad relative to the post-adaptation tuning width for the A and B tones in (5). In the presence of repeating ABA triplets that precede a new tone, the tuned responses will depend on the location of the new tone relative to the As or the Bs. In general, if a series of tones has been arriving at a specific tonotopic location L (either A or B) then the tuning curve of any subsequent tones will be altered. For a new tone N + above L the left side of its tuning curve will be reduced. For a new tone N − below L the right side of its tuning curve will be reduced. The following equation describes the Gaussian adaptation profile AP around the L location where BW = 4 is the bandwidth of adaptation and c L is the amplitude of adaptation, which will be larger when, for example, the preceding sequence of L tones has a higher repetition rate. Equation 12 is 1 for y L, decreases with Gaussian decay to 1 − c L as y approaches L from below and is 1 − c L for y ≥ L. We similarly define AP for a tone below L In this way the modified tuning curve T C for a tone N + above L is given by multiplying the tuning curve with the appropriate adaptation profile T C(y, N + , L) = T C(y, N + )AP + (y, L), and for a tone N − below L is similarly given by T C(y, N − , L) = T C(y, N − )AP − (y, L).
If a tuning curve will be modulated by two sequences of tones L 1 and L 2 , an additional argument in (14) or (15) can signify further modulation of the tuning curve by a second adaptation profile, e.g. T C(y, N − , L 1 , L 2 ) = T C(y, N − )AP − (y, L 1 )AP − (y, L 2 ). These functions can now be used to work out the tuning curves for responses to deviant tones d, relative to the locations of the A and B tones featured in the ABA triplet sequence. Assuming significantly more adaptation at the A location due to the higher repetition rate, we set the adaptation strengths associated respectively with the A and B locations to be c A = 0.5 and c B = 0.125. The adaptation profile for a tone below A (which is also below B) will be and is plotted dashed blue in SIFig. 2B. For a tone between A and B (above A and below B), we have plotted dashed yellow in SIFig. 2B. For a tone above B (also above A), we have plotted dashed green in SIFig. 2B. For example, the tuning curve for a new tone (e.g. distractor tone) arriving at a location A-2 (SIFig. 2B solid blue) is given by at a location (A+B)/2 (SIFig. 2B solid yellow) is given by and at a location B+2 (SIFig. 2B solid green) is given by T C B+ (y, B + 2, A, B) = T C(y, B + 2)/2)AP B+ (y, A, B).
To summarise, for T C, the first argument is tonotopic location, the second argument the location of a new tone. The subscript A−, AB or B+ indicates whether the new tone is below, between, or above the A and B tones. The third and fourth arguments are the adapted locations for preceding tones (here A and B from the ABA triplets). Having defined the relative amplitude across tonotopy in A1, we now describe the final steps to determine the inputs to the model's competition stage. Similar to (10), the inputs for, say, a distractor tone d above B I AB (t) = I AB (t) + IssaQ ( where Q(t) describes early onset adaptation and Issa = 3Ip is the boosted amplitude for a salient new tone. Again, if we were to incorporate the assumption illustrated in Fig 4A, that no input from a distractor tone reaches in AB-unit, we set I AB (t) = I AB (t). SIFig. 2D shows the effect on proportion segregated of distractor tones at different tonotopic locations with the general model for SSA presented here. The general model for SSA captures the same features as show in Fig 4E, also based on the same assumptions illustrated in Fig 4A, but with a different implementation of SSA.