Synaptic Theory of Replicator-Like Melioration

According to the theory of Melioration, organisms in repeated choice settings shift their choice preference in favor of the alternative that provides the highest return. The goal of this paper is to explain how this learning behavior can emerge from microscopic changes in the efficacies of synapses, in the context of a two-alternative repeated-choice experiment. I consider a large family of synaptic plasticity rules in which changes in synaptic efficacies are driven by the covariance between reward and neural activity. I construct a general framework that predicts the learning dynamics of any decision-making neural network that implements this synaptic plasticity rule and show that melioration naturally emerges in such networks. Moreover, the resultant learning dynamics follows the Replicator equation which is commonly used to phenomenologically describe changes in behavior in operant conditioning experiments. Several examples demonstrate how the learning rate of the network is affected by its properties and by the specifics of the plasticity rule. These results help bridge the gap between cellular physiology and learning behavior.

convergence of a covariance plasticity rule to a fixed point results in matching behavior (Loewenstein and Seung, 2006;Loewenstein, 2008a). This result is independent of the architecture of the decision making network, the properties of the constituting neurons or the specifics of the covariance plasticity rule.
The universality of the relation between the fixed-point solution of the covariance synaptic plasticity rule and the matching law of behavior raises the question of whether there are aspects of the dynamics of convergence to the matching law that are also universal. In this paper I study the transient learning dynamics of a general decision making network in which changes in synaptic efficacies are driven by the covariance between reward and neural activity. I examine the two-alternative repeated-choice schedule which is typically used in human and animal experiments. I show that the macroscopic behavioral learning dynamics that result from the microscopic synaptic covariance plasticity rule are also general and follow the well known Replicator equation. This result is independent of the decision-making network architecture, the properties of the neurons and the specifics of the plasticity rule. These only determine the learning rate in the behavioral learning equation. By analyzing several examples, I show that in these examples, the learning rate depends on the probabilities of choice: it is approximately proportional to the product of the probabilities of choice raised to a power, where the power depends on the specifics of the model. Some of the findings presented here have appeared previously in abstract form (Loewenstein, 2008b).

MelioRation and the ReplicatoR equation
One way of formalizing the theory of melioration mathematically is by assuming that subjects make choices stochastically as if tossing a biased coin. This assumption is supported by the weak temporal correlations between choices in repeated choice experiments intRoduction According to the "law of effect" formulated by Edward Thorndike a century ago, the outcome of a behavior affects the likelihood of occurrence of this behavior in the future: a positive outcome increases the likelihood whereas a negative outcome decreases it (Thorndike, 1911). One quantitative formulation of this qualitative law of behavior was proposed half a century later by Richard Herrnstein, and is known as the "matching law" (Herrnstein, 1961). The matching law states that over a long series of repeated trials, the number of times an action is chosen is proportional to the reward accumulated from choosing that action (Davison and McCarthy, 1988;Herrnstein, 1997;Gallistel et al., 2001;Sugrue et al., 2004). In other words, the average reward per choice is equal for all chosen alternatives. To explain how matching behavior actually takes place, the "theory of Melioration" argues that organisms are sensitive to rates of reinforcement and shift their choice preference in the direction of the alternative that provides the highest return (Herrnstein and Prelec, 1991, however, see also Gallistel et al., 2001). If the returns from all chosen alternatives are equal, as postulated by the matching law, then choice preference will remain unchanged. Thus, matching is a fixed point of the dynamics of melioration.
The neural basis of the law of effect has been extensively explored. It is generally believed that learning is due, at least in part, to changes in the efficacies of synapses in the brain. In particular, activity-dependent synaptic plasticity, modulated by a reward signal, is thought to underpin this form of operant conditioning (Mazzoni et al., 1991;Williams, 1992;Xie and Seung, 2004;Fiete and Seung, 2006;Baras and Meir, 2007;Farries and Fairhall, 2007;Florian, 2007;Izhikevich, 2007;Legenstein et al., 2008Legenstein et al., , 2009Law and Gold, 2009). In a previous study we considered the large family of reward-modulated synaptic plasticity rules in which changes in synaptic efficacies are driven by the covariance between reward and neural activity. We showed that under very general conditions, the (Barraclough et al., 2004;Sugrue et al., 2004;Glimcher, 2005). The bias of the coin corresponds to choice preference, and the learning process manifests itself as a change in this bias with experience toward the more rewarding alternative. Denoting the probability of choosing alternative i at time t by p i (t), the theory of Melioration posits that a change in p i (t) with time is proportional to the difference between the return from alternative i, i.e., the average reward obtained in trials in which alternative i was chosen, and the overall return. Formally,  (Fudenberg and Levine, 1998;Hofbauer and Sigmund, 1998) and is widely used in learning models and in evolutionary game theory. Note that the theory of Melioration does not require η to be constant in time. Melioration will be achieved as long as η > 0.

synaptic plasticity and leaRning
It is generally believed that choice preference is determined by the efficacies of the synapses of the decision-making neural network. Theoretically, if we were able to determine the architecture of this decision-making network and the properties of all the constituent neurons, we could determine the probability of choosing alternative i in a trial from the efficacies of all the synapses at the time of that trial. Formally, where W = (W 1 ,W 2 ,…,) is the vector of the efficacies of all the synapses that are involved in the decision-making process, as schematically illustrated in Figure 1A, and t is an index of the trial. Because choice probabilities are a function of the synaptic weights, changes in these weights due to synaptic plasticity ( Figure 1B, left) will change the choice probabilities ( Figure 1B, right), yielding the learning rule In the next section it is shown that in the context of twoalternative repeated-choice experiment, if changes in synaptic efficacies are driven by the covariance between reward and neural activity, the average velocity approximation (Heskes and Kappen, 1993;Kempter et al., 1999;Dayan and Abbott, 2001) of the learning rule, Eq. 3, reproduces the Replicator equation, Eq. 1.

covaRiance-based synaptic plasticity
In statistics, the covariance between two random variables is the mean value of the product of their fluctuations. Accordingly, covariance-based synaptic plasticity arises when changes in synaptic efficacy in a trial are driven by the product of reward and neural activity, provided that at least one of these signals is measured relative to its mean value. For example, the change in the synaptic strength W in a trial, ∆W, could be expressed by where ϕ is the plasticity rate, R is the magnitude of reward delivered to the subject, N is any measure of neural activity and E[N] is the average of N. For example, N can correspond to the presynaptic activity, the postsynaptic activity or the product of presynaptic and postsynaptic activities. In the latter case, Eq. 4a can be considered Hebbian. Another example of a biologically plausible implementation of reward-modulated covariance plasticity is where E[R] is the average of the previously harvested rewards. For both of these plasticity rules, the expectation value of the right hand side of the equation is proportional to the covariance between R and N (Loewenstein and Seung, 2006), and for this reason it can be said these plasticity rules are driven by the covariance of reward and neural activity. The biological implementation of Eqs 4a,b requires information, at the level of the synapse, about the average neural activity (in Eq. 4a) or the average reward (in Eq. 4b) (Loewenstein, 2008a). However, covariance-based synaptic plasticity can also arise without explicit information about the averages: the average terms in Eqs 4a,b can be replaced with any unbiased estimator of the average that is not correlated with the reward. This is because such a change will not affect the average velocity approximation, Eq. 4c. For example, consider a variation of Eq. 4a, in which the average neural activity, E[N], is replaced by the neural activity τ trials ago: If the reward delivered in trial t, R(t) is independent of the neural activity τ trials ago, N(t − τ), then the average velocity approximation of Eq. 4d yields Eq. 4c. The reward R(t) and the neural activity N(t − τ) are approximately independent if the neural activities in consecutive trials are approximately independent and if the dependence of the reward on the choice τ trials ago is weak.

covaRiance plasticity and ReplicatoR dynaMics
In order to relate the covariance-based plasticity rules to behavior, I use the average velocity approximation in which I replace the stochastic difference equations, Eqs 4a,b,d with a differential equation in which the right hand side of the equation is replaced by its expectation value, Eq. 4c According to the average velocity approximation, if the plasticity rate is sufficiently small, under certain stability conditions, the deviation of the stochastic realization of W from its average Separating the covariance term into trials in which alternative 1 was chosen (A = 1) and trials in which alternative 2 was chosen (A = 2) yields where E[δN k |A = i] is the average of δN k in trials in which alternative i was chosen (i∈{1,2}). The reward R is a function of the actions A and the actions are a function of the neural activities. Therefore, given an action, the reward and the neural activities are statistically independent and hence: Thus, Eq. 9 becomes: velocity approximation value is O( ) ϕ (Heskes and Kappen, 1993). Therefore, the smaller the plasticity rate ϕ the better the average velocity approximation. Differentiating Eq. 2 with respect to time yields dp dt where the index k sums over all synapses that participate in the decision-making. Substituting Eq. 5 in Eq. 6 yields dp dt where N k and ϕ k are the neural activity and the plasticity rate in the neuronal plasticity rule (Eq. 4) that correspond to synapse k. By definition, time, is larger; the third example implements a dynamic competition. After the competition, the firing rate of the premotor population that corresponds to the chosen alternative is high whereas the firing rate of the other premotor population is low (Wang, 2002). More formally, denoting by M a the firing rate of population a, I assume that M 1 = M win , M 2 = M los in trials in which alternative 1 is chosen and M 1 = M los , M 2 = M win in trials in which alternative 2 is chosen, where M win > M los .

exaMple 1: the teMpoRal WinneR-take-all Readout
A recent study has shown that the central nervous system can make accurate decisions about external stimuli in brief time frames by considering the identity of the neuron that fired the first spike (Shamir, 2009), a readout scheme known as temporal Winner-Take-All (tWTA). In the framework of the decision making network shown in Figure 2, alternative 1 is chosen in trials in which the first neuron to fire a spike belongs to premotor population 1. By contrast, if the first neuron to spike belongs to population 2, alternative 2 is chosen. This readout process, which implements the Race Model for decision making in the limit of small threshold (Bogacz et al., 2006), can occur if the competition between the two populations of premotor neurons is mediated by strong and fast lateral inhibition. While it could be argued that it is unlikely for a single spike in a single neuron to determine choice (however see Herfst and Brecht, 2008), the analytical tractability of this model provides insights into how the learning rate is affected by the properties of the network. Moreover, it can be considered as the limit of a fast decision process. Finally, it can be generalized to an arbitrary threshold (n-tWTA model, Shamir, 2009).
Next I separate E [δN k ] into trials in which alternative 1 was chosen and trials in which alternative 2 was chosen and use the fact that by definition, E[δN k ] = 0 Substituting Eq. 12 in Eq. 11 yields Substituting Eq. 13 in Eq. 7 results in Eq. 1 with a learning rate η that is given by Thus, if synaptic changes are driven by the covariance of reward and neural activity, then according to the average velocity approximation, learning behavior follows the Replicator dynamics. This result is very general. The Replicator learning dynamics turns out to be a generic outcome of covariance-based synaptic plasticity implemented in any decision-making network, independently of the properties of the constituent neurons or the specifics of the covariance-based synaptic plasticity.

the leaRning Rate η
The learning rate η in the Replicator equation is determined by the sum over all synapses of the product of three terms (Eq. 14): The first term, ϕ k , is the plasticity rate. The second term, ∂p i /∂W k , signifies the dependence of the probability of choice on the synaptic efficacies. In other words, it is a measure of the susceptibility of choice behavior to the synaptic efficacies. The third term, E[δN k |A = i], is the average of the fluctuations in neural activity in trials in which alternative i was chosen. This term is determined both by the plasticity rule, which determines N, and by the network properties that determine the conditional average of N. In the next sections I analyze several examples to show how the properties of the decision making network and the synaptic plasticity rule impact the effective learning rate.

the netWoRk aRchitectuRe
An overt response in a decision making task is believed to result from competition between populations of neurons, each population representing an alternative. In this paper I implement this competition in a general decision-making network which is commonly used to study decision-making in the cortex (Wang, 2002). The network model consists of two populations of "sensory" neurons, each containing a large number of neurons, n, representing the two alternatives, and two populations of "premotor" neurons, which signal the chosen alternative and therefore are referred to as "premotor" (Figure 2). I assume that the activity of neurons in the sensory population is independent of past actions and rewards (which is why I refer to these neurons as "sensory"). Choice is determined by competition between the premotor populations. I use specific examples to analyze three general types of competition. In the first example, the decision is determined by the first population whose activity reaches a threshold; in the second example, it is the population whose activity, averaged over a particular window of Note that the denominator in Eq. 21 is constant because: Thus, if ϕ > 0 then according to Eq. 21 the network model is expected to meliorate: with experience, the model will bias its choice preference in favor of the alternative that provides, on average, more reward. The rate at which this learning takes place is proportional to the product of (1) the difference between the neural activity of the premotor population in "winning" trials and "losing" trials, (2) the plasticity rate, and (3) the dependence of the firing rate on the synaptic efficacy α. It is inversely proportional to the population average firing rates of the premotor populations.
The tWTA model described above is sufficiently simple to derive the actual trial-to-trial stochastic dynamics, allowing us to better understand the resultant behavior as well as to study the quality of the average velocity approximation. Using Eqs 15 and 17, the change in probability of choice in a trial is where a 1 is an index variable that is equal to 1 in trials in which alternative 1 is chosen and to 0 otherwise. The resultant Eq. 22 is the linear reward-inaction algorithm proposed by economists as a phenomenological description of human learning behavior (Cross, 1973) and is commonly used in machine learning (Narendra and Thathachar, 1989). Note that the dynamics of the linear reward-inaction algorithm, Eq. 22, is stochastic for two reasons. First, choice is stochastic and second, the reward schedule may be stochastic and in that case, the reward variable R is also a stochastic variable. A detailed analysis of the relation between the linear rewardinaction algorithm, Eq. 22 and its average velocity approximation, Eq. 1, appears elsewhere (Borgers and Sarin, 1997;Hofbauer and Sigmund, 1998). Here I demonstrate the relation between the stochastic dynamics and its deterministic approximation using a specific example. I simulated the stochastic dynamics, Eq. 22 in a "two-armed bandit" reward schedule in which alternatives 1 and 2 provide a binary reward with probabilities 0.75 and 0.25, respectively, and recorded the choice behavior of the model. The probability of choosing alternative 1, p 1 , as a function of trial number was estimated by repeating the simulation 1,000 times and counting the fraction of trials in which alternative 1 was chosen (Figure 3A, circles). Initially, the two alternatives were chosen with equal probability. With experience, the model biased its choice preference in favor of alternative 1 that provided the reward with a higher probability, as expected from the average velocity approximation (black solid line), Eq. 1.
In this section I study the effect of covariance-based plasticity in a decision making network characterized by a tWTA readout. I assume that during the competition, the timing of spikes of each premotor neuron in each population is determined by a Poisson process whose rate is a linear function of the input synaptic efficacy to that neuron. Formally, λ a,i = C a,i + α·W a,i , where λ a,i is the firing rate of neuron i of population a; W a,i is the synaptic input to the neuron (a ∈ {1,2}, k∈[1,n a ]) 1 ;C a,i and α > 0 are constants.
Susceptibility Because the firing of the neurons is a Poisson process and choice is determined by the identity of the first neuron to fire, it is easy to show that the probability that the first spike to fire belongs to population 1 and thus that alternative 1 is chosen in a trial, p 1 is: Differentiating Eq. 15 with respect to the synaptic efficacies yields Plasticity rule Here I consider a synaptic plasticity rule in which the synaptic efficacies W a,l change according to product of reward with the activity of the corresponding premotor population (after the competition), assuming that this activity is measured relative to its average value and assuming that all plasticity rates are equal ϕ a,i = ϕ, The plasticity rule of Eq. 17 is an expression of covariance because it is a product of reward and neural activity (postsynaptic activity), measured relative to its average value: In order to compute the learning rate, I consider the term, E[δN k |A = i] in Eq. 14. The neural activity here corresponds to the activity of the premotor population following the competition. Therefore, to premotor population a, alternative 1 is chosen in trials in which I 1 >I 2 . Otherwise alternative 2 is chosen 2 . The mechanism underlying this competition is not explicitly modeled here. The synaptic input to the premotor populations, I a , is the sum of the activities of the corresponding sensory neurons, weighted by the corresponding synaptic efficacies: denoting by S a,k the spike count of sensory neuron k in population a in a particular temporal window, I W S a k n ak a k a = ∑ =1 , , . Here I assume that the spike count of the different neurons is independently drawn, and is independent of past actions and rewards.
Using the central limit theorem, it can be shown that the susceptibility of the probability of choice in this model is approximately (see Materials and Methods), The learning behavior of the neural model analyzed in the previous section follows the linear reward-inaction algorithm, a stochastic implementation of the Replicator equation with a constant learning rate. However, this result does not necessarily generalize to other neural models. In this section I present several examples in which the covariance synaptic plasticity results in a learning rate which is a function of the probabilities of choice. In the previous section I computed the learning rate in a model in which decisions were determined by the identity of the neuron that fired the first spike. However, if the inhibition that mediates the competition between the premotor populations is weaker and slower, the decision is likely to be determined by the joint activity of many neurons, similar to the well-studied population code scheme. In this section I consider such a population readout model. I assume that the total input to each premotor population is the sum of activities of all neurons of the corresponding sensory population, each weighted by its synaptic efficacy. The chosen alternative is the one that corresponds to the larger input. Formally, denoting by I a the synaptic input  In fact the example I study in Section "Materials and Methods" is slightly more general: alternative 1 is chosen in trials in which I 1 − I 2 > z e , where z e is a zero-mean Gaussian noise. Otherwise alternative 2 is chosen.
To study the consequences of a presynaptic activity covariance rule, I simulated the network dynamics with the presynaptic-activity dependent covariance plasticity rule ∆W a,k (t) = ϕR(t)·(S a,k (t) − S a,k (t−1)). The results of these numerical simulations (Figure 3C, circles) were similar to the expected from the expected average velocity approximation η η π = ⋅ − 0 1 2 1 2 ( ) p p (solid line) 3 , but not exact: the learning rate of the stochastic dynamics was slightly lower than that of the deterministic dynamics. This small deviation of the stochastic dynamics from its average velocity approximation disappears when a smaller plasticity rate is used (not shown).

exaMple 3: dynaMic coMpetition Model
The framework used here to derive the behavioral consequences of covariance-based synaptic plasticity can also be used in more complex models, as long as the susceptibility and the conditional average of the neural fluctuations can be computed. Therefore, even if the model is too complex to solve analytically, it is possible to use a phenomenological approximation to study the effect of covariance-based synaptic plasticity on learning behavior. This is demonstrated in this section using the Soltani and Wang (2006) dynamic model for decision making. Soltani and Wang analyzed a biophysical spiking neurons model that is based on the architecture of Figure 2. The result of their extensive numerical simulations was that the probability of choosing an alternative is, approximately, a logistic function of the difference in the overall synaptic efficacies onto the two premotor populations, where T is a parameter that determines the sensitivity of the probability of choice to the difference in the synaptic efficacies. Equation 26 can be used to compute the susceptibility of choice behavior to the synaptic efficacies, yielding Assuming that synaptic plasticity is postsynaptic-activity dependent, Eq. 17 4 , and substituting Eqs 27 and 20 in Eq. 14 yields where η ϕ As in the previous examples, the learning rate is proportional to the product of the probabilities of choice to a power, η = η 0 ·(p 1 p 2 ) α , and in this example α = 1.
The effective learning rate depends on the plasticity rule used. Here I discuss three covariance plasticity rules that differ by the neural activity term in Eq. 4c: N is (1) the postsynaptic-activity, (2) the presynaptic-activity, and (3) Hebbian (the product of presynaptic and postsynaptic activities). In Section "Materials and Methods" I show that both postsynaptic activity and Hebbian covariance rules result in a learning rate that is approximately given by In contrast, if the neural activity in the covariance plasticity rule is presynaptic, and if this activity is drawn from a Gaussian distribution, the learning rate is approximately given by Common to these examples and similar to the tWTA example, the population readout model is expected to meliorate. However, in contrast to the tWTA example, the rate at which this learning takes place is not constant and is proportional to (p 1 p 2 ) α , where α = π/4 for postsynaptic or Hebbian covariance plasticity and α = π/2 − 1 for the presynaptic covariance plasticity. The fact that the effective learning rate is not constant and decreases as one of the probabilities of choice approaches zero has important implications for exploratory behavior: Consider a reward schedule in which the return from one of the alternatives surpasses that of the other alternative. According to Eq. 1, the probability of choosing the more profitable alternative will always increase. However, the fact that the learning rate decreased allows for continued exploration of the second alternative, albeit with an ever decreasing probability. This result is consistent with empirically observed human as well as animal behavior (Vulkan, 2000;Shanks et al., 2002;Neiman and Loewenstein, 2008).
In order to compare the stochastic dynamics to its average velocity approximation, I simulated the learning behavior of the decision-making model of Figure 2, in which each sensory population in the simulations consisted of 1,000 Poisson neurons. I used the same reward schedule as in Example 1, namely, a "two-armed bandit" reward schedule in which alternatives 1 and 2 provide a binary reward with probabilities of 0.75 and 0.25. The probability of choice was estimated by repeating the simulation 1,000 times and counting the fraction of trials in which alternative 1 was chosen.
To study the consequences of a post-synaptic activity covariance rule, I simulated the network when synaptic changes are given by 1 (see Eq. 4d). The simulated probability of choice is denoted by black circles in Figure 3B. Despite the increased complexity of the network model, as well as the synaptic plasticity rule, the stochastic dynamics is remarkably similar to its average velocity approximation, η η π = ⋅ Similarly, I simulated the network using a Hebbian covariance plasticity rule, ∆W a,k (t) = ϕR(t)·(S a,k (t)·M a (t) − S a,k (t − 1)·M a (t − 1)), where S a,k is the number of spikes fired by the presynaptic neuron at a given window of time. The results of these simulations (Figure 3B, blue circles) are similar to those of the postsynaptic-activity dependent plasticity and are consistent with the expected average velocity approximation (solid line).
A learning rate that decreases as one of the probabilities of choice approaches 1 (α > 0) has important behavioral consequences. It enables a large learning rate and thus, fast learning when the probabilities of the two alternatives are approximately equal. In contrast, as one of the probabilities of choice approaches 1, learning becomes slow, allowing for continuous exploration, i.e., the choosing of both alternatives, even after a large number of trials.
Whether the theory of Melioration is a good description of the process of adaptation of choice preference is subject to debate among scholars in the field. While Replicator-like dynamics provides a good phenomenological description of choice behavior in many repeated-choice experiments, it has been argued that it is inconsistent with the rapid changes in behavior following changes in reward schedule (Gallistel et al., 2001, however, see Neiman andLoewenstein, 2007). Another criticism of this theory is that it does not address the temporal credit assignment problem in more complicated behavioral experiments, generally formulated as a fully observable Markov decision process (MDP, Sutton and Barto, 1998). Importantly, it can be shown that other popular phenomenological behavioral models can be formulated in the Replicator framework. For example, consider an income-based model in which the income I of the two alternatives is estimated using an exponential filter and ratio of the probabilities of choosing the two alternatives is equal to the ratio of incomes: This model has been used to describe human learning behavior in games (Erev and Roth, 1998) and monkeys' learning behavior in a concurrent variable interval (VI) schedule (Sugrue et al., 2004). In Section "Materials and Methods" I show that Eq. 30 can be rewritten as a linear reward-inaction algorithm in which the learning rate depends on the exponentially weighted average reward. Reinforcement learning in the brain is likely to be mediated by many different algorithms, implemented in different brain modules. These algorithms probably range from high level deliberation through temporal-difference (TD) learning and Monte Carlo methods (Sutton and Barto, 1998) to simple "stateless" (Loewenstein et al., 2009) methods such as the Replicator dynamics. Compared to these methods, the computational capabilities of covariance-based synaptic plasticity are limited. However, the implementation of the covariance rule in the neural hardware is much simpler and much more robust: network architecture and the properties of neurons can change, but as long as the synaptic rule is covariance-based the organism will meliorate.

MateRials and Methods
This section provides the technical derivations supporting the text. The effective learning rates are computed for various decision-making models and the details of the numerical simulations are provided. Topics are presented in the order in which they appear in the text and equations are numbered to coincide with the equations in the text.
As in Example 1, this model is sufficiently simple to derive the actual trial-to-trial stochastic dynamics. Using Eqs. 17 and 26, it is easy to show that the change in probability of choice in a trial is To study the quality of the average velocity approximation, I numerically simulated the decision making model, Eq. 29, in the same "two-armed bandit" reward schedule described in Examples 1,2 and estimated the dynamics of probability of choice by averaging over 1,000 repetitions (Figure 3D, circles). The stochastic dynamics, Eq. 29, was remarkably similar to its average velocity approximation.

discussion
In this paper I constructed a framework that relates the microscopic properties of neural dynamics to the macroscopic dynamics of learning behavior in the framework of a two-alternative repeatedchoice experiment, assuming that synaptic changes follow a covariance rule. I showed that while the decision making network may be complex, if synaptic plasticity in the brain is driven by the covariance between reward and neural activity, the emergent learning behavior dynamics meliorates and follows the Replicator equation. The specifics of the network architecture, e.g., the properties of the neurons and the characteristics of the synaptic plasticity rule, only determine the learning rate. Thus, Replicator-like meliorating learning behavior dynamics is consistent with covariance-based synaptic plasticity.
The generality of this result raise the question of whether it is possible to infer the underlying neural dynamics from the observed learning behavior in the framework of covariance-based synaptic plasticity. The examples analyzed in this paper suggest that careful measurement of the learning rate may provide such information. In these examples, the effective learning rate is approximately η = η 0 ·(p 1 p 2 ) α , where the value of α depends on the network and the plasticity rule. For example, in the tWTA model with the postsynaptic activity-dependent covariance rule, α = 0. At the other extreme, the dynamic competition model of Soltani and Wang (2006), with the same plasticity rule resulted in α = 1. The value of α in all the other models lies between these two values. Therefore, the value of α is a window, albeit limited, to the underlying neural dynamics. However, estimating the value of α from behavioral data is not straightforward. The main reason is that it requires the accurate estimation of the non-stationary probability of choice from the binary string of choices. Therefore, an accurate estimation of α may require a very large number of trials. Yet, despite this limitation, it is clear from previously published data on human and animal learning behavior that the learning rate decreases as the probability of choice approaches unity (Vulkan, 2000;Shanks et al., 2002;Neiman and Loewenstein, 2008). This result, which indicates that α > 0, refutes the naïve formulation of the Replicator equation (or its stochastic implementation, the linear reward-inaction algorithm) in which the learning rate was assumed constant, α = 0 (Cross, 1973;Fudenberg and Levine, 1998;Hofbauer and Sigmund, 1998). Therefore, I suggest a refinement of these models in which η = η 0 ·(p 1 p 2 ) α . However, the question of whether even June 2010 | Volume 4 | Article 17 | 9

Loewenstein
Synaptic theory of Replicator melioration To compare the differential contribution of the two terms in Eq. 36, consider.
To find the dependence of susceptibility on the probability of choice, I expand Eq. 35 around μ = 0, yielding Expanding the exponent term in Eq. 40 around μ = 0 and substituting Eq. 41 yields (42) Note that the approximation of Eq. 42 is valid not only around p 1 = 0.5 but also for p 1 = 0 and p 2 = 0 (μ → ±∞). To study the quality of this approximation for all values of p i , I numerically computed the dependence of e −µ σ 2 2 2 on the probability of choice and compared it to its approximation, Eq. 42. A quantitative analysis reveals that for 0.05 < p 1 < 0.95, the deviations of e −µ 2 2 2σ from (4p 1 p 2 ) π/4 do not exceed 5%. Substituting Eq. 42 in Eq. 40 results in and Z is a zero-mean stochastic variable with variance With Assumption 1-3 the central limit theorem can be applied to Eq. 32, yielding To compute the effective learning rate, we need to compute the effect of change in the synaptic efficacies on the probability of choice, ∂p 1 /∂W a,i . Using the chain rule, (38) 5 Note that according to Eq. 40, a cumulative normal distribution is expected to fit the numerical simulations in Soltani and Wang (2006) discussed in Example 2 better than a logistic function. In fact a careful examination of Figure 3 in that paper reveals a deviation from the fitted logistic function that is consistent with a cumulative normal distribution function.
Scaling arguments show that under very general conditions, k post hardly changes in the time relevant for the learning of p 1 : using Eq. 34, Substituting Eqs 4c and 13 in Eq. 46 yields

Learning rate when synaptic pLasticity is hebbian
In this section I compute the dependence of the effective learning rate on the probability of choice assuming the synaptic plasticity in Eq. 4c where N is the product of presynaptic and postsynaptic neural activities and ϕ a,i = ϕ . I show that the dependence of the learning rate on the probability of choice is the same as computed in the section "Learning rate when synaptic plasticity is post-synaptic activity-dependent." As before, to compute the learning rate we need to compute the value of E[δN a,j | A = 1] where N a,j = S a,j ·M a . The reward schedule: Two-armed bandit in which alternatives 1 and 2 yielded a binary reward with a probability of 0.75 and 0.25, respectively. In Examples 1,2, the number of sensory neurons in each population was 1,000. The activity of each of these sensory neurons S a,j in a trial was drawn from a Poisson distribution with parameter λ a,i which was constant throughout all simulations. λ a,i was independently drawn from a Gaussian distribution with a mean of 10 and a standard distribution of 5 truncated at λ a,i = 1 (λ a,i < 1 were replaced by λ a,i = 1). M win = 12, M los = 2. Initial conditions in the simulations were W a,i = λ a,i /10. The synaptic plasticity rate was equal for all synapses, ϕ ϕ j a = . The values of the plasticity rate in all simulations were chosen such that in the average velocity approximation, the probability of choosing alternative 1 after 200 trials would be equal to 0.75. In Figure3A, η = 0.0110; in Figure 3B, black circles, φ =2.62·10 −5 , resulting in η 0 = 0.0355; In Figure 3B, blue circles, φ = 2.18·10 −6 , resulting in η 0 = 0.0355; In Figure 3C, φ = 2.90·10 −3 , resulting in η 0 = 0.0258; In Figure 3D, η 0 = 0.0488.
and M(t) is a an exponentially weighted average of past rewards: If η << 1 then  η η ( ) ( ) t M t ≈ .