Bi-Directional Effect of Increasing Doses of Baclofen on Reinforcement Learning

In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA) released from ventral tegmental area (VTA) neurons. It has been shown that in brain slices of mice, GABAB-receptor agonists at low concentrations increase the firing frequency of VTA–DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning in humans. Here, in a double-blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen, a high affinity GABAB-receptor agonist, in a gambling task associated with monetary reward. A low (20 mg) dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg) dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002). Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning.

by haloperidol, participants learned slower and earned less money compared to the control group. Interestingly, no shift of the learning curves was observed when participants were in the loss condition, which suggests that other processes are involved in aversive learning. In a separate study using the Iowa gambling task, an activation of the ventral striatum has also been shown by fMRI (Li et al., 2010).
The effect of DA on learning can be explained by a modulation of the mesocorticolimbic system of circuits involved in action planning and decision-making. In many mammals, at least two systems exist to predict the value of an action: the planning or explicit system, which takes a given situation, predicts an outcome and evaluates that outcome; and the habit or implicit system, which takes a given situation and identifies the best remembered action to take (Redish et al., 2008). The flexible planning system involves the ventral and dorsomedial striatum, the prelimbic medial prefrontal cortex and the orbitofrontal cortex, as well as the entorhinal cortex and hippocampus, with an involvement of DA inputs from the VTA. The habit system involves the dorsolateral striatum, the infralimbic medial prefrontal cortex as well as the parietal cortex, with an involvement of DA inputs from the pars compacta of the substantia nigra (SNc; Redish et al., 2008). The mesocorticolimbic system thus has a central role in evaluating the value of predicted outcomes during decision-making and planning. An over-evaluation of a predicted value by the DA system might alter the decision system leading to addictive behaviors (Redish et al., 2008). Another mechanism leading to automatic decision-making and even addiction could be the recruitment of the habit system by the NAc via feedback loops to the dorsal striatum (Koob and Volkow, 2010). Understanding how modulation of DA can alter valuation and decision-processing therefore has profound implication for understanding motivated

IntroductIon
In his paper on "The Law of Effect," Thorndike stipulated that: "of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur" (Thorndike, 1898). Since then, it has been suggested that the mesolimbic dopamine (DA) system is involved in this learning by coding for a "reward-prediction error" (Schultz et al., 1997). The mesocorticolimbic DA system originates in the ventral tegmental area (VTA), which projects to the nucleus accumbens (NAc) and the prefrontal cortex. Under physiologic conditions, mesocorticolimbic projections release DA in response to natural rewards such as food and sex, which are critical for the survival of the species. This process reflects the fact that it is important for an organism to learn the circumstances under which rewards are obtained (Balland and Lüscher, 2008). When an external reward is delivered, DA neurons elicit a strong learning signal indicating whether the value of the current state is better or worse than predicted (Schultz et al., 1997), rather than euphoria or pleasure (Balland and Lüscher, 2008). This signal therefore allows rapid acquisition of predictive cues and efficient behaviors that are successful in obtaining rewards (Bechara et al., 1998). Evidence that this system can be pharmacologically modulated by changes in DA function has been provided by Pessiglione et al. (2006). In their study, human volunteers carried out a learning task that involved money gains and losses while functional magnetic resonance images (fMRI) were collected. When mesocorticolimbic DA was boosted by l-DOPA, the participants learned faster and earned more money. Conversely, when DA signaling was inhibited behaviors and addiction. Here we propose to pharmacologically modulate DA release with the GABA B -receptor agonist baclofen and observe the effect of this modulation on an instrumental learning task.
Baclofen (p-chlorophenyl-GABA) acts as a high affinity g-aminobutyric acid type B (GABA B ) receptor agonist. Its primary action as spasmolytic agent is mediated by increasing K + conductance that results in postsynaptic inhibition (Cruz et al., 2004;Katzung, 2009). In addition, baclofen causes presynaptic inhibition by reducing Ca 2+ influx and the release probability of excitatory transmitters in the brain and spinal cord (Katzung, 2009). Interestingly, baclofen may also modulate DA release in the mesocorticolimbic system by targeting VTA neurons (Lomazzi et al., 2008). A recent model proposed by Cruz et al. (2004) shows a bi-directional control of DA activity by increasing doses of baclofen. In this model, low-dose baclofen preferentially inhibits g-aminobutyric acid (GABA) neurons, which control in part DA neuron activity, causing a disinhibition of DA neurons. Conversely, high dose baclofen inhibits the firing of DA neurons, resulting in a decrease of transmitter release to the NAc in the ventral striatum. A possible explanation for this phenomenon is based on a different coupling efficiency between GABA B -receptors, G-proteins, regulator of G-proteins signaling (RGS) proteins, and G-protein-gated inwardly rectifying potassium channels (GIRK/Kir3), forming a macromolecular signaling complex (Lomazzi et al., 2008). Indeed, it has been shown that the concentration that produces 50% of the maximal effect (EC 50 ) of baclofen is one order of magnitude lower in GABA neurons than in DA neurons. Therefore, low doses of baclofen preferentially inhibit GABA neurons activity (Cruz et al., 2004;Labouèbe et al., 2007).
In this study, the focus was on the question whether the prediction error of reward signals (i.e., DA neurons firing) in healthy human subjects can be modulated by increasing doses of baclofen. We predicted that low-dose baclofen would disinhibit DA neurons, eventually increase DA release, and thus make the behavioral instrumental learning process more efficient. Conversely, high dose would inhibit DA neurons and therefore reduce the learning rate.

MaterIals and Methods experIMental procedure
The local ethics committee and Swissmedic approved the study (CER 07-074, NAC 07-029, Swissmed: 2008DR2044). The present experiments constituted a randomized, double-blind, placebocontrolled study using low-and high-dose of baclofen. Informed consent was obtained from all subjects.
A total of 36 healthy male subjects were recruited at the University of Geneva. Exclusion criteria were an age below 18 or above 35 years, weight below 60 or above 90 kg, regular consumption of drugs or medications, regular gambling (≥1 time/week, i.e., Casino, lottery, poker), and history of psychiatric or neurological disease. These 36 subjects were randomly split into three groups: 12 subjects received 20 mg of baclofen, 12 subjects received 50 mg of baclofen, and 12 subjects received a placebo. The 50-mg group took 10 mg on the first day and progressively increased the dosage by 10 mg every day to reach 50 mg at day 5. The 20-mg group took the placebo during the first 3 days, took the first 10 mg of baclofen on day 4 and 20 mg on day 5. All groups had to take the last dose 1 h prior to the instrumental learning task. All subjects had to take the same number of compounds over 5 days. Over the 5-days, subjects were asked to report their degree of alertness using the Stanford Sleepiness Scale (SSS) in order to assess for adverse effects at different time points each day (10.00 am-1.00 pm-4.00 pm-7.00 pm) and 30 min before the instrumental test (Hoddes et al., 1973). The SSS is a rating scale measuring current level of subjective sleepiness. It consists of seven statements describing different levels of current sleepiness ranging from "feeling active and vital" (1 point) to "almost in reverie" (7 points). In addition, subjects underwent an auditory digit span test (in order and inversed order), to assess for attention and vigilance just prior to the test.
Baclofen is usually used for its spasmolytic effects at a dosage between 30 and 80 mg/day. Twenty milligram of baclofen therefore represent a low-dose (0.3 mg/kg p.o. respectively for a 70-kg weighted person). At this dose, the predicted plasmatic concentration is about 360 ng/ml after 0.5-1.5 h (=1.60 μM, baclofen MW = 213 g/mol; Compendium Suisse des Medicaments, 2011). In the CSF, the expected value is about 8.5 times lower than in the plasma, which corresponds to 42 ng/ml for the 20-mg dose (nearly 0.20 μM; Compendium Suisse des Medicaments, 2011). These doses theoretically correspond to the concentration activating VTA DA neurons in vitro (Cruz et al., 2004). For 50 mg baclofen, the predicted plasmatic concentration is about 900 ng/ml after 0.5-1.5 h (=4.2 μM, baclofen MW = 213 g/mol; Compendium Suisse des Medicaments, 2011). In the CSF, the expected value corresponds to 106 ng/ml for the 50-mg dose (nearly 0.50 μM; Compendium Suisse des Medicaments, 2011). These doses theoretically correspond to the concentration starting to inhibit the VTA DA neurons in vitro (Cruz et al., 2004). Plasma elimination half-life of baclofen is situated between 3 and 4 h (Compendium Suisse des Medicaments, 2011).
The subjects had to perform one first practice session in order to become acquainted with the instrumental learning task and three subsequent experimental sessions of the same task adapted from Pessiglione et al. (2006). Each session proposed three new pairs of visual stimuli. Each of the pairs of stimuli (three conditions: "win" to assess the effects of baclofen on the ability of reward learning, "loose" to learn from punishment, and "neutral" as a control) was associated with three pairs of outcomes ("win" +1 CHF/nil, "loose" −1 CHF/nil, "neutral" nil/nil), the two stimuli corresponding to reciprocal probabilities (0.8/0.2 and 0.2/0.8). The neutral pair was nil whatever the stimulus chosen.
The three conditions were randomly presented during each run and the relative position of the visual stimuli was counterbalanced across trials. Each item in the pairs belonged to the same semantic field (e.g., animals, current life objects, transport vehicles etc.), in order not to influence choices by stimulus meaning. In each of the four test sessions, subjects first viewed the cues above and under a central fixation cross (4 s), then indicated their choice by pressing a button on a separate keyboard that led to the appearance of a red frame around the chosen item (1 s), and finally viewed the outcome (1.5 s). A total of 60 trials were administered per session (20 trials per condition, trial 0 was calculated as 0.495 ± 0.036 as the mean and SEM across all starting points of the subjects). One session lasted about 8 min each (Figure 1).

results
Data from the behavioral learning task was obtained from 34 individuals (11 each in the control and 20 mg baclofen group, 12 in the 50-mg baclofen group). Out of the initial 36 participants, one was excluded in the control group because of discontinuation of substance taking for unknown reasons; and another was excluded in the 20-mg group because of a history of psychiatric disease and addictive behavior not previously detected during the initial examination. All participants had reached a university level and had a mean age between 24 and 27 years. We specifically monitored for the presence of tiredness as a potential side effect. No significant differences were observed on the SSS between the three groups during the whole week and just prior to the task (30 min before the test: placebo group 1.7 ± 0.0.65 SD, 20 mg group 1.8 ± 0.61 SD, 50 mg 2.1 ± 0.67 SD points on the SSS). All subjects were able to repeat a five numbers digit span in order and four numbers in reverse order. In the experimental gambling task, we observed a significant difference in choosing the correct stimulus between the groups, specifically for the gain condition (Kruskal-Wallis test, χ 2 = 14.56, df = 2, p = 0.001) but not for the loss (Kruskal-Wallis test, χ 2 = 5.38, df = 2, p = 0.68) nor the neutral condition (Kruskal-Wallis test, χ 2 = 1.57, df = 2, p = 0.45). Comparisons with placebo showed a significant difference for the 20-mg baclofen group for the gain (89.55 ± 1.39 vs. 81.07 ± 1.55%, p = 0.002 with Mann-Whitney test) but not for the 50-mg group (79.59 ± 1.63 vs. 81.07 ± 1.55%, p = 0.734 with Mann-Whitney test, Figure 2). These results indicate that subjects in the 20-mg baclofen group more often chose the correct symbol associated with the highest probability of earning money (gain pair).
Accordingly, the 20-mg baclofen group earned the highest amount of money overall (20.82 ± 2.67 CHF), whereas the 50-mg baclofen group received 18.08 ± 2.39 CHF and the placebo group 17.73 ± 2.08 CHF (Figure 2). However, the difference for the monetary gain did not receive significance, in contrast to the proportion of correct choice [ANOVA, F(2,31) = 0.488, p = 0.618].
Learning curves for the gain, loss, and neutral pairs were obtained and plotted for each group (Figure 1). The 20-mg baclofen group showed a faster learning rate over the first 10 trials for the gain pair, with a significant difference in trial per trial comparisons for trials 5 (p = 0.038), 7 (p = 0.046), 8 (p = 0.034), 9 (p = 0.025), 10 (p = 0.032), and 11 (p = 0.032; Mann-Whitney test for each trial after Kruskal-Wallis test showing a significant difference between the three groups). For the loss pair, in contrast, no significant difference was observed between the three groups (Kruskal-Wallis test for each trial, data not shown). This was also the case for the neutral pair (Kruskal-Wallis test for each trial, data not shown).
To earn money, the subjects had to learn, by trial and error, the stimulus-outcome associations. Subjects were instructed to maximize their earnings. Each subject received the total amount earned during the three experimental sessions. The task was coded using the software e-PRIME.

data and statIstIcal analysIs
All statistical data were obtained from the three experimental sessions (3 × 60 trials) and calculated for each condition (gain pair, loss pair, and neutral pair, each 3 × 20 trials) and group (control, 20 mg baclofen, 50 mg baclofen). The overall mean proportion of correct choices and money gain were calculated for each participant. Learning curves (proportion of correct choices across trials)

Figure 1 | A low dose of baclofen accelerates instrumental learning. (A)
Schematics of the gambling task. Subjects selected either the upper or lower of two visual symbol stimuli presented on a computer screen and subsequently observed the outcome. The correct symbol of the gain pair was associated with 80% probability of winning 1 CHF, the correct symbol of the loss pair was associated with 20% of loosing 1 CHF, the neutral pair served as control. ms, Milliseconds. (B) Observed mean of behavioral choices over three concomitant sessions of 20 trials per condition (gain and loss pair) for the 20-mg baclofen, 50 mg baclofen and control group. The learning curves show the proportion of correct choices for each trial (1 means correct symbol choosing for the gain pair, 0 means correct symbol choosing for the loss pair). Trial per trial comparison between the 20-mg baclofen and the control group showed statistical significance for trial 5, 7, 8, 9, 10, and 11 (*p < 0.05). Trial points were smoothed starting from trial 1 using Gaussian algorithm. Neutral condition data not shown. ±SEM means standard error of mean.

Frontiers in Behavioral Neuroscience
www.frontiersin.org a reward predicting stimulus and the reward. This activation varies monotonically with risk and could code for the discrepancy between predicted and actual reward (Fiorillo et al., 2003). Such data suggest that DA signals could have an important role in the gain condition of our learning task for evaluating, confirming, and finally learning the risk uncertainties associated with the different reward cues. The effect of 20 mg of baclofen could be explained by an enhancement of this process due to larger release of DA at striatal synapses, acting as potent learning signal and by the involvement of glutamate-dependent forms of plasticity in VTA neurons (Ungless et al., 2001;Saal et al., 2003;Borgland et al., 2004), in the NAc (Kourrich et al., 2007), and prefrontal cortex neurons (Sun et al., 2005). Moreover, we have to remember that there are other brain structures implicated in reward-coding than the mesocorticolimbic DA system. Additional discriminatory information could be provided by the orbitofrontal cortex, striatum, and amygdala (Schultz, 2010). Besides enhanced learning by DA agonist, Pessiglione et al. (2006) also reported a significant decrease in the learning curve with the DA receptor antagonist haloperidol, which is known for its strong depressant action on the VTA system (Pessiglione et al., 2006). In a same manner, we expected a decrease in the learning rate with a dosage of 50 mg of baclofen compared to 20 mg and placebo. However, we did not observe such effect. This negative result is most probably due to the fact that the concentration of baclofen in the CSF may be too low (0.5 μM) to sufficiently inhibit DA neurons. For complete abolition of firing, a concentration of 100 μM must be reached in vitro (Cruz et al., 2004). However, this concentration virtually corresponds to nearly 10 g of baclofen p.o., a dose that is two orders of magnitude higher than the usual maximum dose (80 mg/day). Furthermore, these dosage-effect relationships may also be strongly influenced by each individual's pharmacokinetics. To inhibit efficiently the VTA system, concentrations close to the maximum dose or even higher concentrations may be necessary, which however can be confounded by the occurrence of adverse effects such as tiredness, muscle weakness, and headache.

dIscussIon
In this study, inspired by a rodent model of the effects of baclofen on the VTA (Compendium Suisse des Medicaments, 2011), we could demonstrate that the GABA B -receptor agonist baclofen causes a significant modulation of reward-driven learning in young, healthy male humans.

reward learnIng
Out of the two dosages used here, enhanced instrumental learning was only observed in the low-dose baclofen group. Participants who received 20 mg of baclofen chose the stimuli linked to the highest probability of earning money significantly more often than the other two groups. This effect is reflected by the greater steepness of the learning curve for this group relative to the placebo group at the first six trials, and a higher plateau thereafter. From this point onwards, all groups reached a relatively stable performance but with generally higher accuracy for the 20-mg group. In addition, participants in the 20-mg baclofen group tended to earn more money after the task completion, as compared to the other groups (although this failed to reach significance). However, the overall amount of money is not a reliable indicator of learning because subjects can also earn money by choosing the wrong symbol in the gain condition (0.2 probability of earning money).
These results are in agreement with those obtained in a recent study using the nearly same learning task showing improvement of learning with l-DOPA in the gain condition but not in the loss condition (Pessiglione et al., 2006). This improvement was correlated with an increase in striatal activity as measured with fMRI. A similar effect was also described in a population with Parkinson's disease with problem gambling and shopping (Voon et al., 2010). The implication of DA for reward processing is now well established. According to the "prediction error hypothesis," most DA neurons encode a "reward-prediction error" (Schultz et al., 1997) indistinctly responding to reward probability, magnitude, and the time when the predicted reward is expected (Schultz, 2007). Moreover, one third of DA neurons show a relatively slow, moderate, but significant activation that increases gradually during the interval between As mentioned above, low-dose baclofen may have addictive properties since it preferentially disinhibits DA neurons, which increases the learning of reward signals. However, in contrast to GHB, addictive behaviors are not widely observed for baclofen, which is also a GABA B -receptor agonist (European Monitoring Centre for Drugs and Drug Addiction, 2010). This apparent contradiction can be explained by their difference in affinity for the GABA B -receptor (high-affinity for baclofen, low-affinity for GHB; Cruz et al., 2004). Thus, typical therapeutic doses of baclofen, particularly when given repetitively are most likely sufficient to suppress physiological DA firing and explain why baclofen is normally not abused (Labouèbe et al., 2007), while concentrations obtained with typical recreational use of GHB will preferentially affect VTA GABA neurons.
In line with this interpretation, rodent studies show that baclofen reduces self-administration of a number of drugs (Brebner et al., 2002) and is considered a putative anti-craving compound in humans (Cousins et al., 2002). A double-blind controlled study with a relatively low dosage (30 mg/day) of baclofen has shown its efficacy vs. placebo on sobriety and dropouts in alcohol-dependant patients , while in most case-reports up to 120 mg/day was used in order to obtain the same effects (Ameisen, 2005;Agabio et al., 2007;Bucknam, 2007). Also well documented is the reduction in cigaret consumption with a relatively high dosage of 80 mg/day (Franklin et al., 2009). However, the efficacy of these regimens remains controversial, as other studies reported only modest relief of symptoms (Garbutt et al., 2010). Patient adherence (low half-life of baclofen) and disease heterogeneity (for example anxious vs. non-anxious populations) may limit those studies. The potential of baclofen as a putative anti-craving compound in aiding the initiation, alleviation, and maintenance of drug abstinence is therefore still a highly discussed topic and certainly needs further clinical research.

conclusIon
Our randomized, double-blind, and placebo-controlled study revealed a positive reinforcement in healthy subjects taking a single dose of 20 mg of baclofen during an instrumental learning task involving monetary reward. At this dosage, subjects were more efficient in choosing the stimulus linked to the highest probability of earning money, as compared with the placebo group. These results suggest a reinforcement of prediction error learning signals by baclofen for reward stimuli, and thus corroborate with in vitro studies showing an enhanced activation of DA neurons with lowdose baclofen. However, these mechanisms must be confirmed by using fMRI or labeled baclofen with carbon-11, which will eventually correlate our findings with increased activity in the mesocorticolimbic DA system and associated areas. In contrast, learning was not affected by a higher dosage of 50 mg of baclofen. Such a finding suggests that even higher dosages are needed to efficiently inhibit the VTA reward system in vivo and to eventually serve as an anti-craving treatment.

acknowledgMents
We thank the Lüscher and Vuilleumier labs for support and discussion.

aversIon learnIng
We did not observe any differences between the three groups in the loss condition, which is consistent with previous data (Pessiglione et al., 2006). Dopaminergic neurons respond mostly with depressed firing rates to aversive stimuli (Ungless et al., 2004;Schultz, 2007). Recent studies however identified different subpopulations of DA neurons that respond to aversive stimuli in being either excited or inhibited (Brischoux et al., 2009;Matsumoto and Hikosaka, 2009). Thus, the inhibited responding subpopulations might encode a prediction error for aversive stimuli (Matsumoto and Hikosaka, 2009). Such neurons are situated in the ventromedial SNc and VTA, projecting mainly to the ventral striatum, which is thought to process reward values as classically assumed (Matsumoto and Hikosaka, 2009). In addition, however, other structures like the lateral habenula (Matsumoto and Hikosaka, 2008) and amygdala (Paton et al., 2006) have neurons responding both to reward and aversive stimuli. These other structures might subserve learning of the loss condition without any impact of DA manipulation used in the study of Pessiglione et al. (2006) and ours.
The importance of DA neurons in the aversive condition needs to be considered and clarified in future studies. In humans, data from fMRI in healthy subjects and patients with Parkinson's disease during a similar instrumental learning task, point to the implication of a distinct brain network including the anterior insula, dorsal striatum, and orbitofrontal cortex, that influences the learning from negative outcomes (Pessiglione et al., 2006;Voon et al., 2010). Although some DA neurons in VTA are activated by aversive events, the largest DA activation is related to reward (Ungless et al., 2004). Alternatively and more specifically, the addictive behaviors in Parkinson's disease may be associated with a shift of the response to reinforcing cues from ventral (impaired) to dorsal striatum, so that the response itself becomes dominated from stimulus-response rather than action-outcome representations (Everitt and Robbins, 2005).

IMplIcatIon for addIctIon
The key role of the mesocorticolimbic system in the neurocircuitry of addiction is generally accepted. (Koob and Volkow, 2010), and these pathways could be implicated in addictive behaviors even long after drug exposure (Lüscher and Bellone, 2008). Although addictive drugs have very distinct molecular targets, they all cause an increase in DA concentration in the mesocorticolimbic projection target structures (Lüscher and Ungless, 2006). Moreover, there is strong evidence that drugs of abuse cause a potentiation of excitatory synapses on the VTA DA neurons (Kauer and Malenka, 2007). Synaptic plasticity might therefore represent the cellular mechanism underlying instrumental learning, which is pathologic in addicts (Balland and Lüscher, 2008). Drugs that bind to the G-protein coupled receptors (GPCR) belong to a first group of addictive drugs, which includes morphine, delta-9-tetrahydrocannabinol (THC), and the GABA B -receptor agonist g-hydroxy-butyric acid (GHB; Lüscher and Ungless, 2006). The action of these drugs is preferentially on GABA interneurons, which normally inhibit DA neurons. Thus, inhibition of GABA neurons leads to a net activation of DA neurons and an increase of DA release, a mechanism referred to as disinhibition.