An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

doi:10.3389/fncom.2014.00047

ORIGINAL RESEARCH article

Front. Comput. Neurosci., 16 April 2014

Volume 8 - 2014 | https://doi.org/10.3389/fncom.2014.00047

This article is part of the Research TopicBasal Ganglia XI - Proceedings of the 11th Triennial Meeting of the International Basal Ganglia SocietyView all 14 articles

An extended reinforcement learning model of basal ganglia to understand the contributions of serotonin and dopamine in risk-based decision making, reward prediction, and punishment learning

Pragathi P. Balasubramani¹

V. Srinivasa Chakravarthy¹^*

Balaraman Ravindran²

Ahmed A. Moustafa³

¹Department of Biotechnology, Indian Institute of Technology - Madras, Chennai, India
²Department of Computer Science and Engineering, Indian Institute of Technology - Madras, Chennai, India
³Foundational Processes of Behaviour Research Concentration, Marcs Institute for Brain and Behaviour & School of Social Sciences and Psychology, University of Western Sydney, Sydney, NSW, Australia

Although empirical and neural studies show that serotonin (5HT) plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL)-framework. The model depicts the roles of dopamine (DA) and serotonin (5HT) in Basal Ganglia (BG). In this model, the DA signal is represented by the temporal difference error (δ), while the 5HT signal is represented by a parameter (α) that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: (1) Risk-sensitive decision making, where 5HT controls risk assessment, (2) Temporal reward prediction, where 5HT controls time-scale of reward prediction, and (3) Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

Introduction

Monoamine neuromodulators such as dopamine, serotonin, norepinephrine and acetylcholine are hailed to be the most promising neural messengers to ensure healthy adaptation to our uncertain environments. Specifically, serotonin (5HT) and dopamine (DA) play important roles in various cognitive processes, including reward and punishment learning (Cools et al., 2011; Rogers, 2011). DA signaling has been linked to reward processing in the brain for a long time (Bertler and Rosengren, 1966). Furthermore the activity of mesencephalic DA neurons are found to closely resemble temporal difference error (TD) in Reinforcement Learning (RL) (Schultz, 1998). This TD error represents the difference in the total reward (outcome) that the agent or subject receives at a given state and time, and the total predicted reward. The semblance between the TD error signal and DA signal served as a starting point of an extensive theoretical and experimental effort to apply concepts of RL to understand the functions of the Basal Ganglia (BG) (Schultz et al., 1997; Sutton and Barto, 1998; Joel et al., 2002; Chakravarthy et al., 2010). This led to the emergence of a framework for understanding the BG functions in which the DA signal played a crucial role. Deficiency of such a neuromodulator (DA) leads to symptoms observed in neurodegenerative disorders like Parkinson's Disease (Bertler and Rosengren, 1966; Goetz et al., 2001).

The Multiple Functions of Serotonin

It is well-known that dopamine is not the only neuromodulator that is associated with the BG function. Serotonin (5HT) projections to the BG are also known to have an important role in decision making (Rogers, 2011). 5HT is an ancient molecule that existed even in plants (Angiolillo and Vanderkooi, 1996). Through its precursor tryptophan, 5HT is linked to some of the fundamental processes of life itself. Tryptophan-based molecules in plants are crucial for capturing the light energy necessary for glucose metabolism and oxygen production (Angiolillo and Vanderkooi, 1996). Thus, by virtue of its fundamental role in energy conversion, 5HT is integral to mitosis, maturation, and apoptosis. In lower organisms, it modulates the feeding behavior and other social behaviors such as dominance posture, and escape responses (Kravitz, 2000; Azmitia, 2001; Chao et al., 2004). Due to its extended role as a homeostatic regulator in higher animals and in mammals, 5HT is also associated with appetite suppression (Azmitia, 1999; Halford et al., 2005; Gillette, 2006). Furthermore, 5HT plays important roles in anxiety, depression, inhibition, hallucination, attention, fatigue, and mood (Tops et al., 2009; Cools et al., 2011). Increasing 5HT level leads to decreasing punishment prediction, though recent evidence pointing to the role of DA in processing aversive stimuli makes the picture more complicated (So et al., 2009; Boureau and Dayan, 2011). The tendency to pay more attention to negative than positive experiences or other kinds of information (negative cognitive biases) are found to occur at lower levels of 5HT (Cools et al., 2008; Robinson et al., 2012). 5HT is also known to control the time scale of reward prediction (Tanaka et al., 2007) and to play a role in risk sensitive behavior (Long et al., 2009; Murphy et al., 2009; Rogers, 2011). Studies found that under conditions of tryptophan depletion, which is known to reduce the brain 5HT level, risky choices are preferred to safer ones in decision making tasks (Long et al., 2009; Murphy et al., 2009; Rogers, 2011). Reports about 5HT transporter gene influencing risk based decision making also exist (He et al., 2010; Kuhnen et al., 2013). 5HT is known to influence non-linearity in risk-based decision making (Kahneman and Tversky, 1979)—risk-aversivity in the case of gains and risk-seeking during losses, while presented with choices of equal means (Murphy et al., 2009; Zhong et al., 2009a,b). In summary, 5HT is not only important for behavioral inhibition, but is also related to time scales of reward prediction, risk, anxiety, attention etc., and to non-cognitive functions like energy conversion, apoptosis, feeding, and fatigue.

Prior Theoretical and Computational Abstract Models of Serotonin

It would be interesting to understand and reconcile the roles of DA and 5HT in the BG. Prior abstract models addressing the same quest such as that by Daw et al. (2002) argue that DA signaling plays a role that is complementary to 5HT. It has been suggested that whereas the DA signal responds to appetitive stimuli, 5HT responds to aversive or punitive stimuli (Daw et al., 2002). Unlike computational models that argue for complementary roles of DA and 5HT, empirical studies show that both neuromodulators play cardinal roles in coding the signals associated with the reward (Tops et al., 2009; Cools et al., 2011; Rogers, 2011). Genes that control neurotransmission of both molecules are known to affect processing of both rewarding and aversive stimuli (Cools et al., 2011). Complex interactions between DA and 5HT make it difficult to tease apart precisely the relative roles of the two molecules in reward evaluation. Some subtypes of 5HT receptors facilitate DA release from the midbrain DA releasing sites, while others inhibit them (Alex and Pehek, 2007). In summary, it is clear that the relationship between DA and 5HT is not one of simple complementarity. Both synergistic and opposing interactions exist between these two molecules in the brain (Boureau and Dayan, 2011).

Efforts have been made to elucidate the function of 5HT through abstract modeling. Daw et al. (2002) developed a line of modeling that explores an opponent relationship (Daw et al., 2002; Dayan and Huys, 2008) between DA and 5HT. In an attempt to embed all the four key neuromodulators—DA, 5HT, norepinephrine and acetylcholine—within the framework of RL, Doya (2002) associated 5HT with discount factor, γ, which is a measure of time-scale of reward integration (Doya, 2002; Tanaka et al., 2007). There is no single computational theory that integrates and reconciles the existing computational perspectives of 5HT function in a single framework.

Our Model in Brief

In this modeling study, we present a model of both 5HT and DA in BG simulated using a modified RL framework. Here, DA represents TD error as in most extant literature of DA signaling and RL (Schultz et al., 1997; Sutton and Barto, 1998), and 5HT controls risk prediction error. Action selection is controlled by the utility function that is a weighted combination of both the value and risk function (Bell, 1995; Preuschoff et al., 2006; D'acremont et al., 2009). In the proposed modified formulation of utility function, the weight of the risk function depends on the sign of the value function and a tradeoff parameter α, which we describe in detail below. Just as value function was thought to be computed in the striatum, we now propose that the utility function is computed in the striatum.

The outline of the paper is as follows: Section Methods describes the model equations. In Section Results, we show that a combination of both value and the risk function for decision making explains the following experiments. The first of these pertains to risk sensitivity in bee foraging (Real, 1981). Here we demonstrate that the proposed 5HT and DA model can simulate this simple neurobiological instance of risk-based decision making. We then show the capability of the model to explain the roles of 5HT in the representative experimental conditions: risk sensitivity in Tryptophan depleted conditions (Long et al., 2009); time-scale of reward prediction (Tanaka et al., 2007); and reward and punishment sensitivity (Cools et al., 2008). We present the discussion on the model and results in Section Discussion. Furthermore in the discussion, we hypothesize that the plausible neural correlates for the risk component are the D1R and the D2R co-expressing medium spiny neurons of the striatum, with serotonin selectively modulating this population of neurons.

Methods

On the lines of the utility models described by Bell (1995) and D'acremont et al. (2009), we present here the utility function, U_t as a tradeoff between the expected payoff and the variance of the payoff (the subscript “t” refers to time). The original Utility formulation used in Bell (1995; D'acremont et al. (2009) is (Equation 2.1).

\begin{matrix} U_{t} (s, a) = Q_{t} (s, a) - κ \sqrt{h_{t} (s, a)} & (2.1) \end{matrix}

where Q_t is the expected cumulative reward and h_t is the risk function or reward variance, for state, s, action, a; κ is the risk preference. Note that in equation. 2.1, we represent the state and action explicitly as opposed to (Bell, 1995; D'acremont et al., 2009).

In classical RL (Sutton and Barto, 1998) terms, following policy, π, the action value function, Q, at time t of a state, “s,” and action, “a” may be expressed as (Equation 2.2).

\begin{matrix} Q^{π} (s, a) = E_{π} (r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots | s_{t} = s, a_{t} = a) & (2.2) \end{matrix}

where r_t is the reward obtained at time, t, and γ is the discount factor (0 < γ < 1). E_π denotes the expectation when action selection is done with policy π. The incremental update for the action value function, Q is defined as in Equation 2.3.

\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + η_{Q} δ_{t} & (2.3) \end{matrix}

where s_t is the state at time, t; a_t is the action performed at time, t, and η_Q is the learning rate of the action value function (0 < η_Q < 1). δ_t is the TD error defined by Equation 2.4,

\begin{matrix} δ_{t} = r_{t + 1} + γ Q_{t} (s_{t + 1}, a_{t + 1}) - Q_{t} (s_{t}, a_{t}) & (2.4) \end{matrix}

In the case of immediate reward problems, δ_t is defined by Equation 2.5.

\begin{matrix} δ_{t} = r_{t} - Q_{t} (s_{t}, a_{t}) & (2.5) \end{matrix}

Similar to the value function, the risk function “h_t” has an incremental update as defined by Equation 2.6.

\begin{matrix} h_{t + 1} (s_{t}, a_{t}) = h_{t} (s_{t}, a_{t}) + η_{h} ξ_{t} & (2.6) \end{matrix}

where η_h is the learning rate of the risk function (0 < η_h < 1), and ξ_t is the risk prediction error expressed by Equation 2.7,

\begin{matrix} ξ_{t} = δ_{t}^{2} - h_{t} (s_{t}, a_{t}) & (2.7) \end{matrix}

η_h and η_Q are set to 0.1, and Q_t and h_t are set to zero at t = 0 for simulations of (sections Risk Sensitivity and Rapid Tryptophan Depletion, Time Scale of Reward Prediction and Serotonin, Reward/Punishment Prediction Learning and Serotonin) described below.

We now present a modified form of the utility function by substituting κ = α.sign[Q_t(s_t, a_t)] in (Equation 2.1).

\begin{matrix} U_{t} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) - α s i g n (Q_{t} (s_{t}, a_{t})) \sqrt{h_{t} (s_{t}, a_{t})} & (2.8) \end{matrix}

In (Equation 2.8), the risk preference includes three components—the “α” term, the “sign(Q_t)” term, and the risk term $\sqrt{h_{t}}$ . The sign(Q_t) term achieves a familiar feature of human decision making viz., risk-aversion for gains and risk-seeking for losses (Kahneman and Tversky, 1979). In other words, when sign(Q_t) is positive (negative), U_t is maximized (minimized) by minimizing (maximizing) risk. Note that the expected action value Q_t would be positive for gains that earn rewards greater than a reward base (= 0), and would be negative otherwise during losses. We associate 5HT level with α, a constant that controls the relative weightage between action value and risk (Equation 2.8).

In this study, action selection is performed using softmax distribution (Sutton and Barto, 1998) generated from the utility. Note that traditionally the distribution generated from the action value is used. The probability, P_t(a|s) of selecting an action, a, for a state, s, at time, t, is given by the softmax policy (Equation 2.9).

\begin{matrix} P_{t} (a | s) = e x p (β U_{t} (s, a)) / \sum_{i = 1}^{n} \exp (β U_{t} (s, i)) & (2.9) \end{matrix}

n is the total number of actions available at state, s, and β is the inverse temperature parameter. Values of β tending to 0 make the actions almost equiprobable and the β tending to ∞ make the softmax action selection identical to greedy action selection.

Results

In this section, we apply the model of 5HT and DA in BG (Section Methods) to explain several risk-based decision making phenomena pertaining to BG function.

1) Measurement of risk sensitivity: Two experiments are simulated in this category:

- Risk sensitivity in Bee foraging (Real, 1981)

- Risk sensitivity and Tryptophan depletion (Long et al., 2009)

2) Representation of time scale of reward prediction (Tanaka et al., 2007) and

3) Measurement of punishment sensitivity (Cools et al., 2008).

The parameters for each experiment are optimized using genetic algorithm (GA) (Goldberg, 1989) (Details of the GA option set are given in Supplementary material).

Risk Sensitivity in Bee Foraging

Experiment summary

In the bee foraging experiment by Real (1981), bees were allowed to choose between flowers of two colors—blue and yellow. Both types of flowers deliver the same amounts of mean reward (nectar) but differ in the reward variance. The experiment showed that bees prefer the less risky flowers i.e., the one with lesser variance in nectar (Real, 1981).

Biogenic amines such as 5HT are found to influence foraging behavior in bees (Schulz and Robinson, 1999; Wagener-Hulme et al., 1999). In particular, the brain levels of dopamine, serotonin, and octopamine are found to be high in foraging bees (Wagener-Hulme et al., 1999). Montague et al. (1995) showed risk aversion in bee foraging using a general predictive learning framework without mentioning DA. They assume a special “subjective utility” which is a non-linear reward function (Montague et al., 1995) to account for the risk sensitivity of the subject. In the foraging problem of (Real, 1981) bees choose between two flowers that have the same mean reward but differ in risk or reward variance. Therefore, the problem is ideally suited for risk-based decision making approach. We show that the task can be modeled, without any assumptions about “subjective utility,” by using the proposed 5HT-DA model which has an explicit representation for risk.

Simulation

We model the above phenomenon of bee foraging using the modified utility function of Section Methods. This foraging problem of (Real, 1981) is treated as a variation of the stochastic “two-armed bandit” problem (Sutton and Barto, 1998), possessing no state (s) and 2 actions (a). We represent the colors of the flower (“yellow” and “blue”) that happens to be the only predictor of nectar delivery as two arms (viz. the two actions, a). Initial series of experimental trials is modeled to have all the blue flowers (“no-risk” choice) delivering 1 μl (reward value, r = 1) of nectar; 1/3 of the yellow flowers delivering 3 μl (r = 3), and the remaining 2/3 of the yellow flowers contain no nectar at all (r = 0) (yellow flowers = “risky” choice). These contingencies are reversed at trial 15 and stay that way till trial 40. Since the task here requires only a single decision per trial, we model it as an immediate reward problem (Equation 2.5). Hence the δ for any trial t is calculated as in Equation 3.1.2.1 for updating the respective action value by Equation 3.1.2.2.

\begin{matrix} δ_{t} = r_{t} - Q_{t} (a_{t} \in {b l u e f l o w e r, y e l l o w f l o w e r}) & (3.1.2.1) \end{matrix}

\begin{matrix} Q_{t + 1} (a_{t}) = Q_{t} (a_{t}) + η_{Q} δ_{t} & (3.1.2.2) \end{matrix}

\begin{matrix} h_{t + 1} (a_{t}) = h_{t} (a_{t}) + η_{h} ξ_{t} & (3.1.2.3) \end{matrix}

\begin{matrix} ξ_{t} = δ_{t}^{2} - h_{t} (a_{t}) & (3.1.2.4) \end{matrix}

\begin{matrix} U_{t} (a_{t}) = Q_{t} (a_{t}) - α s i g n (Q_{t} (a_{t})) \sqrt{h_{t} (a_{t})} & (3.1.2.5) \end{matrix}

In our simulation, the expected action value (given by Q) for both the flowers converges to be the same value (=1). Our model accounts for the risk through the variance (represented by “h” of each flower: Equations 3.1.2.3, 3.1.2.4) component in the utility function (Equation 3.1.2.5) that plays a key role in the action selection.

Results

In the experiment (Real, 1981), most of the bees visited the constant nectar yielding blue flowers initially i.e. they chose a risk-free strategy, but later the choice switched to the yellow flowers, once the yellow became the less risky choice. We observe the same in our simulations too. Risk-aversive behavior being an optimal approach during the positive rewarding scenario, the blue flowers that deliver a steady reward of 1 have higher utility and are preferred over the more variable yellow flowers initially. The situation is reversed after trial 15 when the blue flowers suddenly become risky and the yellow ones become risk-free. Here, the utility of the yellow flowers starts increasing, as expected. Note that the expected action value for both flowers still remains the same, though the utility has changed.

With η_h = 0.051, η_Q= 0.001, α = 1.5 in Equation 3.1.2.5, and β = 10 in Equation 2.9 for the simulation, the proposed model captures the shift in selection in less than 5 trials from the indication of the contingency reversal (red line in the Figure 1). Since the value is always non-negative, and α > 0, our model exhibits risk-averse behavior, similar to the bees in the study.

FIGURE 1

Figure 1. Selection of the blue flowers obtained from our simulation (Sims) as an average of 1000 instances, that adapted from Real (1981) experiment (Expt), and red line indicating contingency reversal.