Nicotinic and Cholinergic Modulation of Reward Prediction Error Computations in the Ventral Tegmental Area: a Minimal Circuit Model

Dopamine (DA) neurons in the ventral tegmental area (VTA) are thought to encode reward prediction errors (RPE) by comparing actual and expected rewards. In recent years, much work has been done to identify how the brain uses and computes this signal.While several lines of evidence suggest the the interplay of he DA and the inhibitory interneurons in the VTA implements the RPE computaiton, it still remains unclear how the DA neurons learn key quantities, for example the amplitude and the timing of primary rewards during conditioning tasks. Furthermore, exogenous nicotine and endogenous acetylcholine, acting on both VTA DA and GABA (γ - aminobutyric acid) neurons via nicotinic-acetylcholine receptors (nAChRs), also likely affect these computations. To explore the potential circuit-level mechanisms for RPE computations during classical-conditioning tasks, we developed a minimal computational model of the VTA circuitry. The model was designed to account for several reward-related properties of VTA afferents and recent findings on VTA GABA neuron dynamics during conditioning. With our minimal model, we showed that the RPE can be learned by a two-speed process computing reward timing and magnitude. Including a model of nAChR-mediated currents in the VTA DA-GABA circuit, we also showed that nicotine should reduce the acetylcholine action on the VTA GABA neurons by receptor desensitization and therefore potentially boost the DA responses to reward information. Together, our results delineate the mechanisms by which RPE are computed in the brain, and suggest a hypothesis on nicotine-mediated effects on reward-related perception and decision-making.


INTRODUCTION
To adapt to their environment, animals constantly compare their predictions with new environmental 23 outcomes (rewards, punishments, etc). The difference between prediction and outcome is the prediction 24 The temporal dynamics of the average activities of DA and GABA neurons in the VTA taken from 110 Graupner et al. (2013) are described by the following equations: where ν D and ν G are the mean firing rates of the DA and GABAergic neuron populations, respectively.
where ω = 30 represent the maximum firing rate, γ = 8 is the inflexion point and β = 0.3 is the slope.  The input currents in Eq. 1 are given by: where w x 's (with x = G, PFC, PPT-D, PPT-G, α4β2) specify the total strength of the respective input ( Fig.   132 1, Table 1). The weight of α4β2-nAChRs, w α4β2 = 15 was chosen in order to account for the increase of  2.2.1 Classical-conditioning task and the associated signals 150 We modeled a VTA neural circuit ( Fig. 1)  behavioral trial begins with a conditioned stimulus (CS; a tone, 0.5 s), followed by an unconditioned 153 stimulus (US; the outcome, 0.5 s) separated by an interval of 1.5 s. (Fig. 2A). This type of task, implying a 154 delay between the CS offset and the US onset (here, 1 s), is then a trace-conditioning task, that differs from 155 a delay-conditioning task where the CS and US overlap (Connor and Gould, 2016).

156
As the animal learns that a fixed reward constantly follows a predictive tone at a specific timing, our  To integrate the US input into a short-term phasic component we use the function G τ (x(t)) (Vitay and

171
Hamker, 2014) defined as follows: Here when x(t) switches from 0 to 1 at time t = 0, G τ (x(t)) will display a localized bump of activation with a maximum at t = τ . This function is thus convenient to integrate the square signal ν US (t) ( Fig. 2A where ν PPTg is the mean activity of the PPTg neurons population, τ PPTg = 100 ms (the short-latency 180 response), and f (x) is a Hill function with two parameters: f max , the saturating firing rate; and h, the   195 We thus assume that the PFC integrates the CS signal and learns to maintain its activity until the reward 196 delivery. Consistently with previous neural-circuit working-memory models (Durstewitz et al., 2000), 197 we minimally described this mechanism by a neural population with recurrent excitation and a slower where τ PFC = 100 ms (short-latency response), a(t) describes the amount of adaptation that neurons have

2.2.4
Learning of the US timing in the PFC 207 The dynamic system described above typically switches between two stables states: quasi absence of 208 activity or maximal activity in the PFC. The latter stable state particularly appears as J PFC (n) increases 209 with learning: where α T = 0.2 is the timing learning rate, ∆t DA = t 2 − t 1 measures the difference between the time at 211 which PFC activity declines (t 1 such as ν PFC (t 1 ) γ after CS onset) and the time of DA maximal activity where α V is the cortico-striatal plasticity learning rate related to reward magnitude, δ(n) is a deviation from PPTg was found to be the main source of cholinergic input to the VTA, we assume that ACh concentration 239 directly depends on PPTg activity, as modeled by the following equation: 240 where w ACh = 1 μM is the amplitude of the cholinergic connection that tunes concentration of where τ y (N ic, ACh) refers to the Nic/ACh concentration-dependent time constant at which the steady-254 state y ∞ (N ic, ACh) is achieved. The maximal achievable activation or sensitization, for a given Nic/ACh 255 concentration, a ∞ (N ic, ACh) and s ∞ (N ic, ACh) are given by Hill equations of the form: where EC 50 and IC 50 are the half-maximal concentrations of nAChR activation and sensitization, 257 respectively. The factor α > 1 accounts for the higher potency of Nic to evoke a response as compared where τ max refers to the recovery time constant from desensitization in the absence of ligands, τ 0 is the to the laser intensity I = 4 for 1.5 < t < 2.5 and zero otherwise. Then, we subtracted this signal to VTA 274 GABA neuron activity as follows: where s is the subtracted signal that integrates the light signal ν light with a time constant τ s = 300 276 ms, ν G-opto is the photo-inhibited GABA neurons activity, and ν G-control is the normal GABA neurons could not return to the same rewarding location, they had to choose between the two remaining locations. 303 We thus modeled decisions between two alternatives. The probability P i was computed according to the softmax choice rule: where V i and V j are the values of the states i and j (the other option) respectively, b is an inverse 306 temperature parameter reflecting the sensitivity of choice to the difference between both values. We chose 307 b = 0.4 which corresponds to a reasonable exploration-exploitation ratio. 308 We simulated the task over 10,000 simulations and computed the number of times the mouse chose each 309 location. We thus obtained the average repartition of the mouse over the three locations. A similar task was 310 simulated for mice after 5 min Nic ingestion (see below).

RESULTS
We used the model developed above to understand the learning dynamics within the PFC-VTA circuitry nicotine exposure on DA responses to rewarding events. 322 We should note that most experiments we simulated herein concern the learning task of a CS-US 323 association (Fig. 2). The learning procedure consists of a conditioning phase where a tone (CS) and a 324 constant water-reward (US) are presented together for 50 trials. Within each 3 s-trial, the CS is presented at 325 t = 0.5 s (Fig. 3, 5, 6, dashed grey line) followed by the US at t = 2 s (Fig. 3, 5, 6, dashed cyan line). DA neurons at the US when the reward is unexpected (Fig. 3D, n = 1), and a small excitation in GABA 338 neurons (Fig. 3B, n = 1). PPTg fibers also stimulate VTA neurons through ACh-mediated α4β2 nAChRs 339 activation, with a larger influence on GABA neurons (r = 0.2 in Fig. 1).
Early in the conditioning task, simulated PFC neurons respond to the tone (Fig. 3A, n = 1), and this 341 activity builds up until being maintained during the whole CS-US interval (Fig. 3A, n = 6, n = 50).

342
Thus, PFC neurons show a working-memory like activity now tuned to decay at the reward delivery time.

343
Concurrently, the phasic activity of DA neurons at the US acts as prediction-error signal on corticostriatal 344 synapses, increasing the glutamatergic input from the NAc onto VTA DA and GABA neurons (Fig. 3B,   345 3D, 4B). Note that the NAc was not modeled explicitly, but we modeled the net effect of the PFC-NAc 346 plasticity with the variable w PFC (see next section).

355
Together, these results propose a simple mechanism for RPE computation the VTA and its afferents. by Eq. 8. This two-speed learning process enables to qualitatively reproduce the DA dynamics found 363 experimentally, with almost no effect outside the CS and US time-windows (Fig. 3D).

364
Particularly, the graphical analysis of the PFC system enables us to understand the timing learning 3A. After learning (Fig. 4D), the system initially shows the same dynamics but when the CS is removed,

372
the system is maintained at the second fixed point (30 Hz) until reward delivery (Fig. 3A, n = 50) due to 373 its bistability after CS presentation (cyan curve). Finally, with the adaptation dynamics, the PFC activity 374 decays right after reward delivery (Fig. 4D, dark blue). Indeed, through this timing learning mechanism, 375 the strength of the recurrent connections maintains the Up state activity of the PFC exactly until the US 376 timing (Eq. 7). Together, these simulations show a two-speed learning process that enables VTA dopamine 377 neurons to predict the value and the timing of the water reward from PFC plasticity mechanisms.  379 We next focus specifically on the local VTA neurons interactions during the conditioning task. Particularly,   394 We next asked whether we can identify the effects of nicotine action in the VTA during the classical-

398
For our qualitative investigations, we assume that α4β2-nAChRs are mainly expressed on VTA GABA 399 neurons (r = 0.2) and we study the effects of nicotine-induced desensitization on these receptors.

400
Nic-induced desensitization may potentially lead to several effects. First, under nicotine (Fig. 6B), DA 401 baseline activity slightly increases. Second, simulated exposure also raises DA responses to reward-delivery 402 when the animal is naive (Fig. 6A, 6B, n = 1), and therefore to reward-predictive cues when the animal 403 has learnt the task (Fig. 6A, 6B, n = 50). As expected, these effects derive from the reduction of the 404 ACh-induced GABA activation provided by the PPTg nucleus (Fig. 3C). Thus, our simulations predict that 405 nicotine would up-regulate DA bursting activity at rewarding events.

406
What would happen if the animal, after having learned in the presence of nicotine, is not exposed to it 407 anymore (nicotine withdrawal)? To answer this question, we investigate the effects of nicotine withdrawal 408 on DA activity after the animal has learnt the CS-US association under nicotine (Fig. 6C), with the same 409 amount of reward (4 μL). In addition to a slight decrease in DA baseline activity, the DA response to the 410 simulated water reward is reduced even below baseline (Fig. 6C, dark red). DA neurons would then signal  423 In order to evaluate the effects of nicotine on the choice preferences among reward sizes, we simulated a Following reinforcement-learning theory (Rescorla and Wagner, 1972;Sutton and Barto, 1998), CS 427 response to each reward size (computed from Fig. 6D) was attributed to the expected value of each location. 428 We then computed the repartition of the mouse between the three locations over 10,000 simulations in 429 control conditions or after 5 min nicotine ingestion.

430
In control conditions, the simulated mice chose according to the location's estimated value (Fig. 7B); the

DISCUSSION
The overarching aim of this study was to determine how dopamine neurons compute key quantities such as     Finally, in our behavioral simulations of a decision-making task (Fig. 7), we report that nicotine exposure probability. In this line, future studies could investigate the effects of chronic nicotine on VTA activity 542 during a classical conditioning task as presented here (Fig. 6) but also on behavioral choices according to 543 reward size (Fig. 7). The parameters in the model were chosen qualitatively in order to account for most of experimental data from 552 different studies (references) with relative accuracy. The α4β2-containing nAChR parameters were directly taken 553 from (Graupner et al., 2013), whereas the network parameters were qualitatively adapted from different studies.

554
When no data could be related, some parameters were arbitrarily fixed (here).

CONFLICT OF INTEREST STATEMENT
The authors declare that the research was conducted in the absence of any commercial or financial 556 relationships that could be construed as a potential conflict of interest. Afferents inputs and circuitry of the ventral tegmental area (VTA). The GABA neuron population (red) inhibits locally the DA neuron population (green). This local circuit receives excitatory glutamatergic input (blue axons) from the corticostriatal pathway and the pedunculopontine tegmental nucleus (PPTg). The PPTg furthermore furnishes cholinergic projections (purple axon) to the VTA neurons (α4β2 nAChRs). r is the parameter to change continuously the dominant site of α4β2 nAChR action. Dopaminergic efferents (green axon) project, amongst others, to the nucleus accumbens (NAc) and the prefrontal cortex (PFC) and modulates cortico-striatal projections w PFC and PFC recurrent excitation J PFC weights. The PFC integrates CS (tone) information, while the PPTg respond phasically to the water reward itself (US). Dopamine and acetylcholine outflows are represented by green and purple shaded areas, respectively. All parameters and description are summarized in Table 1.   Phase analysis of PFC neuron activity from Eq. 6 before learning (C) and after learning (D). Different times of the task are represented: t < 0.5 s (before CS onset, light blue) and 1 s < t < 2 s (between CS offset and US onset, light blue), 0.5 s < t < 1 s (during CS presentation, medium blue) and t > 2 s (after US onset, dark blue). Fixed points are represented by green (stable) or red (unstable) dots. Dashed arrows: trajectories of the system from t = 0 to t = 3 s. TD error model as implemented in (Schultz, 1998). The TD error in DA neurons is computed from 3 inputs: two reward expectation signals and one reward signal. Traces show how these terms change with time at the last trial of a conditioning task. DA response to a reward omission can be approximated by V (t + 1) − V (t) (gray), the derivative of the value function, V (t). Adapted from (Watabe-Uchida et al., 2017).