^{1}

^{2}

^{1}

^{1}

^{*}

^{1}

^{2}

Edited by: Paul E. M. Phillips, University of Washington, USA

Reviewed by: Geoffrey Schoenbaum, University of Maryland School of Medicine, USA; Matthew S. Matell, Villanova University, USA; Fuat Balci, Koç University, Turkey

*Correspondence: Marshall G. Hussain Shuler, Department of Neuroscience, Johns Hopkins University, 725 N Wolfe Street, 914 WBSB, Baltimore, MD 21205, USA e-mail:

This article was submitted to the journal Frontiers in Behavioral Neuroscience.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Animals and humans make decisions based on their expected outcomes. Since relevant outcomes are often delayed, perceiving delays and choosing between earlier vs. later rewards (intertemporal decision-making) is an essential component of animal behavior. The myriad observations made in experiments studying intertemporal decision-making and time perception have not yet been rationalized within a single theory. Here we present a theory—Training-Integrated Maximized Estimation of Reinforcement Rate (TIMERR)—that explains a wide variety of behavioral observations made in intertemporal decision-making and the perception of time. Our theory postulates that animals make intertemporal choices to optimize expected reward rates over a limited temporal window which includes a past integration interval—over which experienced reward rate is estimated—as well as the expected delay to future reward. Using this theory, we derive mathematical expressions for both the subjective value of a delayed reward and the subjective representation of the delay. A unique contribution of our work is in finding that the past integration interval directly determines the steepness of temporal discounting and the non-linearity of time perception. In so doing, our theory provides a single framework to understand both intertemporal decision-making and time perception.

Survival and reproductive success depends on beneficial decision-making. Such decisions are guided by judgments regarding outcomes, which are represented as expected reinforcement amounts. As actual reinforcements are often available only after a delay, measuring delays and attributing values to reinforcements that incorporate the cost of time is an essential component of animal behavior (Stephens and Krebs,

In the past, many theories including Optimal Foraging Theory (Stephens and Krebs,

None of these theories and models can systematically explain the breadth of data on intertemporal decision-making; we argue that the inability of prior theories to rationalize behavior stems from the lack of biologically-realistic constraints on general optimization criteria (see next section). Further, while intertemporal decision-making necessarily requires perception of time, theories of intertemporal decision-making and time perception (Gibbon et al.,

Intertemporal choice behavior has been modeled using two dissimilar approaches. The first approach is to develop theories that explore ultimate (Alcock and Sherman,

The second approach, mainly undertaken by psychologists and behavioral analysts, is to understand the proximate (Alcock and Sherman,

In order to explain behavior, an ultimate theory must consider appropriate proximate constraints. The lack of appropriate constraints might explain the inability of the above theories to rationalize experimental data. By merely stating that animals maximize indefinitely-long-term ^{100} combinations); (3) animals cannot persist for indefinitely long intervals without food in the hope of obtaining an unusually large reward in the distant future, even if the reward may provide the highest long-term reward rate (e.g., option between 11,000 units of reward in 100 days vs. 10 units of reward in 0.1 day). On the other hand, ERT, although computationally-simple, expects an animal to ignore its past reward experience while making the current choice.

To contend with uncertainties regarding the future, an animal could estimate reward rates based on an expectation of the environment derived from its past experience. In a world that presents large fluctuations in reinforcement statistics over time, estimating reinforcement rate using the immediate past has an advantage over using longer-term estimations because the correlation between the immediate past and the immediate future is likely high. Hence, our TIMERR theory proposes an algorithm for intertemporal choice that aims to maximize expected reward rate based on, and constrained by, memory of past reinforcement experience. As a consequence, it postulates that time is subjectively represented such that subjective representation of reward rate accurately reflects objective changes in reward rate (see section TIMERR Theory: Time Perception). In doing so, we are capable of explaining a wide variety of fundamental observations made in intertemporal decision-making and time perception. These include hyperbolic discounting (Stephens and Krebs,

To illustrate the motivation and reasoning behind our theory, we consider a simple behavioral task. In this task, an animal must make decisions on every trial between two randomly chosen (among a finite number of possible alternatives) known reinforcement-options. Having chosen an option on one trial, the animal is required to wait the corresponding delay to obtain the reward amount chosen. An example environment with three possible reinforcement-options is shown in Figure

Assuming a stationary reinforcement-environment in which it is not possible to directly know the pattern of future reinforcements, an animal may yet use its past reinforcement experience to instruct its current choice. Provisionally, suppose also that an animal can store its entire reinforcement-history in the task in its memory. So rather than maximizing reward rates into the future as envisioned by OFT, the animal can then maximize the total reward rate that would be achieved so far (at the end of the current trial). In other words, the animal could pick the option that when chosen, would lead to the highest global reward rate over all trials until, and including, the current trial, i.e.,
_{i}, t_{i}

Under the above conditions, this algorithm yields the highest possible reward rate achievable at the end of any given number of trials. In contrast, previous algorithms for intertemporal decision-making (hyperbolic discounting, exponential discounting, two-parameter discounting), while being successful at fitting behavioral data, fail to maximize global reward rates. For the example reinforcement-environment shown in Figure

The reason why extant alternatives fare poorly is that they do not account for opportunity cost, i.e., the cost incurred in the lost opportunity to obtain better rewards than currently available. In the example considered, two of the reinforcement-options are significantly worse than the third (Figure

The behavioral task shown in Figure

It is important to note that while the extent to which Equation (1) outperforms other models depends on the reinforcement-environment under consideration, its performance in a stationary environment will be greater than or equal to previous decision models. However, biological systems face at least three major constraints that limit the appropriateness of Equation (1): (1) their reinforcement-environments are non-stationary; (2) integrating reinforcement-history over arbitrarily long intervals is computationally implausible, and, (3) indefinitely long intervals without reward cannot be sustained by an animal (while maintaining fitness) even if they were to return the highest long-term reward rate (e.g., choice between 100,000 units of food in 100 days vs. 10 units of food in 0.1 day). Hence, in order to be biologically-realistic, TIMERR theory states that the interval over which reinforcement-history is evaluated, the past-integration-interval (_{ime}; _{ime} and the learned expected delay to reward (

_{est}) by the animal over a time-scale of _{ime} [Calculation of the Estimate of Past Reward Rate (_{est}) in Appendix]. This estimate is used to evaluate whether the expected reward rates upon picking either current option is worth the opportunity cost of waiting. _{est}). Such an algorithm automatically includes the opportunity cost of waiting in the decision. _{est}.

If the estimated average reward rate over the past integration window of _{ime} is denoted by _{est}, the TIMERR algorithm can be written as:

From the TIMERR algorithm, it is possible to derive the subjective value of a delayed reward (Figure

This is calculated by asserting that reward rate for (_{est} is an estimate of the average reward rate in the past over the integration window _{ime} with the reward option specified by a magnitude

Equation (3) presents an alternative interpretation of the algorithm: the animal is estimating the net worth of pursuing each delayed reward by subtracting the opportunity cost incurred by forfeiting potential alternative reward options during the delay to a given reward and normalizing by the explicit temporal cost of waiting. This is because the numerator in Equation (3) represents the expected reward gain but subtracts this opportunity cost, _{est}

The temporal discounting function—the ratio of subjective value to the subjective value of the reward when presented immediately—is given by [based on Equation (3)]
_{est}) subtractive term. The effects of varying the parameters, viz. the past integration interval (_{ime}), estimated average reward rate (_{est}) and reward magnitude (_{ime}, the past integration interval (Figure _{ime}. As opportunity costs (_{est}) increase, delayed rewards are discounted more steeply (Figure _{est} × _{ime}) losses will be preferred immediately while higher-magnitude losses will be preferred when delayed (Figure

_{ime}) increases, the discounting function becomes less steep, i.e., the subjective value for a given delayed reward becomes higher (_{est} = 0 and _{est} increases, the opportunity cost of pursuing a delayed reward increases and hence, the discounting function becomes steeper. The dotted line indicates a subjective value of zero, below which rewards are not pursued, as is the case when the delay is too high. (_{ime} = 100). _{ime} = 100 and _{est} = 0.05). _{ime} = 100 and _{est} = 0.05). Note that as the magnitude of loss decreases, so does the steepness of discounting (Figure _{est}

_{est} = 0.05 and _{ime}= 100). As the magnitude of a loss increases, the discounting function becomes steeper. However, the slope of the discounting steepness with respect to the magnitude is minimal for large magnitudes (100 and 1000; see Consequences of the Discounting Function in Appendix). At magnitudes below _{est} _{ime}, the discounting function becomes an increasing function of delay. _{est} _{ime}, a loss becomes even more of a loss when delayed. Hence, at low magnitudes (< _{est} _{ime}), losses are preferred immediately. No curve crosses the dotted line at zero, showing that at all delays, losses remain punishing.

Attributing values to rewards delayed in time necessitates representations of those temporal delays. These representations of time are subjective, as it is known that time perception varies within and across individuals (Gibbon et al.,

Since TIMERR theory states that animals seek to maximize expected reward rates, we posit that time is represented subjectively (Figure _{est}). Hence, if the subjective representation of time associated with a delay _{ime}], thereby making it possible to represent very long durations within the finite dynamic ranges of neuronal firing rates. Plots of the subjective time representation of delays between 1 and 90 s are shown in Figure _{ime}. As mentioned previously (Figure _{ime} corresponds to steeper discounting, characteristic of more impulsive decision-making. It can be seen that the difference in subjective time representations between 40 and 50 s is smaller for a lower _{ime} (high impulsivity). Hence, higher impulsivity corresponds to a reduction in the ability to discriminate between long intervals (a decrease in the precision of time representation) (Figures

_{ime}. Lower values of _{ime} generate steeper discounting (higher impulsivity), and hence, smaller subjective values. _{ime} for longer intervals. This saturation effect is more pronounced in the case of higher impulsivity, thereby leading to a reduced ability to discriminate between intervals (here, 40 and 50 s). _{ime}, demonstrating that as impulsivity reduces, so does underproduction.

Internal time representation has been previously modeled using accumulator models (Buhusi and Meck, _{ime} correspond to an underproduction of time intervals (i.e., decreased accuracy of reproduction), with the magnitude of underproduction increasing with increasing durations of the sample interval (Figure _{ime}, or equivalently, with decreasing impulsivity (Figure

_{v}_{ime} = 300 s. An analytical approximation is expressed in Equation (8). Each data point is the result of averaging over 2000 trials.

Prior studies have observed that the error in representation of intervals increases with their durations (Gibbon et al., _{ime} is smaller (higher impulsivity) (Figures

Calculating the error in reproduced intervals by the accumulator model mentioned above cannot be done analytically. However, we present an approximate analytical solution below. Assuming that the representation of subjective time, _{v}

The above equation results in a U-shaped _{v}_{v}_{v}_{ime}. Hence, though the accumulator model considered here predicts an increase in _{v}_{ime}. For larger values of _{ime}, _{v}_{ime} of 300 s, _{v}_{v}_{ime}). It must also be emphasized that the above equations only apply within an individual subject when _{ime} can be assumed to be a constant, independent of the durations being tested. Pooling data across different subjects, as is common, would lead to averaging across different values of _{ime}, and hence a flattening of the _{v}

Time perception is also studied using temporal bisection experiments (Allan and Gibbon, _{s}_{l}

The bisection point as calculated by TIMERR theory is derived below. The calculation involves transforming both the short and long intervals into subjective time representations and expressing the bisection point in subjective time (subjective bisection point) as the mean of these two subjective representations. The bisection point expressed in real time is then calculated as the inverse of the subjective bisection point.

_{ime} varies between zero and infinity, respectively.

Hence, TIMERR theory predicts that when comparing bisection points across individuals, individuals with larger values of _{ime} will show bisection points closer to the arithmetic mean whereas individuals with smaller values of _{ime} will show lower bisection points, closer to the geometric mean. If _{ime} was smaller still, the bisection point would be lower than the geometric mean, approaching the harmonic mean. This is in accordance with the experimental evidence mentioned above showing bisection points between the harmonic and arithmetic means (Allan and Gibbon, _{v}

All the predictions mentioned below result from Equations (3) and (6).

The discounting function will be hyperbolic in form (Frederick et al.,

The discounting steepness could be labile within and across individuals (Loewenstein and Prelec,

Temporal discounting could be steeper when average delays to expected rewards are lower (Frederick et al., _{ime})].

“Magnitude Effect”: as reward magnitudes increase in a net positive environment, the discounting function becomes less steep (Frederick et al.,

“Sign Effect”: rewards are discounted steeper than punishments of equal magnitudes in net positive environments (Frederick et al.,

The “Sign Effect” will be larger for smaller magnitudes (Loewenstein and Prelec,

“Magnitude Effect” for losses: as the magnitudes of losses increase, the discounting becomes steeper. This is in the reverse direction as the effect for gains (Hardisty et al.,

Punishments are treated differently depending upon their magnitudes. Higher magnitude punishments are preferred at a delay, while lower magnitude punishments are preferred immediately (Loewenstein and Prelec,

“Delay-Speedup” asymmetry: Delaying a reward that you have already obtained is more punishing than speeding up the delivery of the same reward from that delay is rewarding. This is because a received reward will be included in the current estimate of past reward rate (_{est}) and hence, will be included in the opportunity cost (Frederick et al.,

Time perception and temporal discounting are correlated (Wittmann and Paulus,

Timing errors increase with the duration of intervals (Gibbon et al.,

Timing errors increase in such a way that the coefficient of variation follows a U-shaped curve (Gibbon et al.,

Impulsivity (as characterized by abnormally steep temporal discounting) leads to abnormally large timing errors (Wittmann et al.,

Impulsivity leads to underproduction of time intervals, with the magnitude of underproduction increasing with the duration of the interval (Wittmann and Paulus,

The bisection point in temporal bisection experiments will be between the harmonic and arithmetic means of the reference durations (Allan and Gibbon,

The bisection point need not be constant within and across individuals (Baumann and Odum,

The bisection point will be lower for individuals with steeper discounting (Baumann and Odum,

The choice behavior for impulsive individuals will be more inconsistent than for normal individuals (Evenden,

Post-reward delays will not be directly included in the intertemporal decisions of animals during typical laboratory tasks (Stephens and Anderson,

Our theory provides a simple algorithm for decision-making in time. The algorithm of TIMERR theory, in its computational simplicity, could explain results on intertemporal choice observed across the animal kingdom (Stephens and Krebs,

In environments with time-dependent changes of reinforcement statistics, animals should have an appropriately sized past integration interval depending on the environment so as to appropriately estimate opportunity costs [e.g., integrating reward-history from the onset of winter would be highly maladaptive in order to evaluate the opportunity cost associated with a delay of an hour in the summer; also see Effects of Plasticity in the Past Integration Interval (_{ime}) in Appendix]. In keeping with the expectation that animals can adapt past integration intervals to their environment, it has been shown that humans can adaptively assign different weights to previous decision outcomes based on the environment (Behrens et al., _{ime} would correspondingly affect the steepness of discounting. This novel prediction has two major implications for behavior: (1) the discounting steepness of an individual need not be a constant, as has sometimes been implied in prior literature (Frederick et al., _{ime} would lead to corresponding changes in subjective representations of time. Hence, we predict that perceived durations may be linked to experienced reward environments, i.e., “time flies when you're having fun.”

It is important to point out that the TIMERR algorithm for decision-making only depends on the calculation of the expected reward rate, as shown in Figure _{ime}.

Reward magnitudes and delays have been shown to be represented by neuromodulatory and cortical systems (Platt and Glimcher, _{est}) has been proposed to be embodied by dopamine levels over long time-scales (Niv et al., _{ime}) over which average reward rates are calculated directly determines the steepness of temporal discounting.

While there have been previous models that connect time perception to temporal decision making (Staddon and Cerutti,

All simulations were run using MATLAB R2010a.

Figure

Figure

An accumulator model described by the following equation was used for simulations of a time reproduction task.

where _{t}

The above equation was integrated using the Euler-Maruyama method. In this method,

Every trial in the time reproduction task consisted of two phases: a time measurement phase and a time production phase. During the time measurement phase, the accumulator integrates subjective time until the expiration of the sample duration (Figure _{ime} to calculate the median production interval as shown in Figures

Vijay M. K. Namboodiri, Stefan Mihalas, and Marshall G. Hussain Shuler conceived of the study. Vijay M. K. Namboodiri and Stefan Mihalas developed TIMERR theory and its extensions. Vijay M. K. Namboodiri ran the simulations comparing the performance of Equation (1) with other models shown in Figures

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank Dr. Peter Holland, Dr. Veit Stuphorn, Dr. James Knierim, Dr. Emma Roach, Dr. Camila Zold, Josh Levy, Gerald Sun, Grant Gillary, Jeffrey Mayse, Naween Anand, Arnab Kundu, and Kyle Severson for discussions and comments on the manuscript. This work was funded by NIMH (R01 MH084911 and R01 MH093665) to Marshall G. Hussain Shuler.

_{ime} +

While the advantage of this model is that it is computationally less expensive, the disadvantages for the model are that (1) subjective values in memory are not generalizable, i.e., the subjective value in memory for an option will fundamentally depend on the reward environment in which it was presented; and (2) representations of the reward delays could be useful for anticipatory behaviors.

It is important to note that this equation can still be expressed in terms of subjective time as defined in the Main Text, viz.

Generally speaking, building such risk models is difficult, especially since they are environment-specific. However, there could be statistical patterns in environments for which animals have acquired corresponding representations over evolution. Specifically, decay of rewards arising from factors like natural decay (rotting, for instance) or due to competition from other foragers could have statistical patterns. During the course of travel to a food source, competition poses the strongest cause for decay since natural decay typically happens over a longer time-scale, viz. days to months. In such an environment with competition from other foragers, a forager could estimate how much a reward will decay in the time it takes it to travel to the food source.

Suppose the forager sees a reward of magnitude

We assume that the rate of decay of a reward in competition is proportional to a power of its magnitude, implying that larger rewards are more sought-after in competition and hence, would decay at a faster rate. We denote the survival time of a typical reward by _{sur} and consider that after time _{sur}, the reward is entirely consumed. If, as stated above, one assumes that _{sur} is inversely related to a power α of the magnitude of a reward at any time [

Hence, the rate of change of a value with initial magnitude _{(0)} =

A forager could estimate the parameters

If the non-linearities and state-dependence of magnitude perception can be expressed by a function

_{est} during the wait, there could be a different reward rate that they might, nevertheless, expect to gain. If we denote that this additional expected reward rate is a fraction _{est}, then we can state that the net expected loss of reward rate during the wait is (_{est}. This factor can also be added to expressions of subjective value calculated above in Equations (3), (A2), and (A4). Specifically, Equation (3) becomes

_{ime}) plus the time to a delayed reward. However, non-linearities in the relationship between reward rates and fitness levels [as discussed in Effects of Plasticity in the Past Integration Interval (_{ime}) in Appendix] could lead to state-dependent consumption requirements. For example, in a state of extreme hunger, it might be appropriate for the decision rule to apply a very short time scale of discounting so as to avoid dangerously long delays to food. However, integrating past reward rates over such extremely short timescales could compromise the reliability of the estimated reward rate. Hence, as a more general version of TIMERR theory, the window over which reward rate is maximized could incorporate a scaled down value of the interval over which past reward rate is estimated, with the scaling factor governed by consumption requirements. If such a scaling factor is represented by

In an environment with positive _{est}, the following predictions can be made

“Magnitude Effect” for gains: as noted in the Main Text, as _{est} and _{est} and

“Magnitude Effect” for losses/punishments: if _{est} > 0), the “magnitude” effect for punishments is in the opposite direction as the “magnitude” effect for gains.

“Sign Effect”: gains are discounted more steeply than punishments of equal magnitudes. A further prediction is that this effect will be larger for smaller reward magnitudes. This prediction has been proven experimentally (Loewenstein and Prelec,

Differential treatment of losses/punishments: As the “magnitude” of the punishment decreases below _{est}_{ime}(_{est}_{ime}), the discounting function becomes a monotonically increasing function of delay. This means that the punishment would be preferred immediately when the magnitude of punishment is below this value. Above this value, a delayed punishment would be preferred to an immediate punishment. This prediction has experimental support (Frederick et al.,

A reward of _{est} will lead to a negative subjective value. Hence, given an option between pursuing or forgoing this reward, the animal would only pursue (forgo) the reward at shorter (longer) delays.

When understanding the reversal of the “Magnitude Effect” for losses, it is important to keep in mind that as |

Hence, as the magnitude of a loss increases, the size of the “Magnitude Effect” becomes lower and harder to detect (Figure

In an environment with negative _{est} (i.e., net punishing environment), all the predictions listed above would reverse trends. Specifically,

“Magnitude Effect” for gains: as

“Magnitude Effect” for losses: as the magnitude of a punishment increases, the discounting function becomes less steep.

“Sign Effect”: Punishments are discounted more steeply than gains of equal magnitudes.

Differential treatment of gains: as the magnitude of the gain decreases below _{est}_{ime} (_{est}_{ime}), it would be preferred at a delay. Beyond this magnitude, the gain would be preferred immediately.

A punishment of magnitude _{est}.

TIMERR theory, however, allows for the possibility that in a variant of standard laboratory tasks that makes a post-reward delay immediately precede another reward included in the choice behavior would result in animals not ignoring post-reward delays. Prior experiments evince this possibility (Stephens and Anderson,

The most important implication of the TIMERR theory is that the steepness of discounting of future rewards will depend directly on the past integration interval, i.e., the longer you integrate over the past, the more tolerant you will be to delays, and vice-versa. In the above sections, the past integration interval (_{ime}) was treated as a constant. However, the purpose of the past integration interval is to reliably estimate the baseline reward rate expected through the delay until a future reward. Further, since _{ime} determines the temporal discounting steepness, it will also affect the rate at which animals obtain rewards in a given environment. Hence, depending on the reinforcement statistics of the environment, it would be appropriate for animals to adaptively integrate reward history over different temporal windows so as to maximize rates of reward.

In this section, we qualitatively address the problem of optimizing _{ime}. We consider that an optimal _{ime} would satisfy four criteria: (1) obtain rewards at magnitudes and intervals that maximize the fitness of an animal, which is accomplished partially through (2) reliable estimation of past reward rates leading to (3) appropriate estimations of opportunity cost for typical delays faced by the animal with (4) minimal computational/memory costs.

Before considering the general optimization problem for _{ime}, it is useful to consider an illustrative example. This example ignores the last three criteria listed above and only considers the impact of _{ime} on the fitness of an animal. Consider a hypothetical animal that typically obtains rewards at a rate of 1 unit per hour. Suppose such an animal is presented with a choice between (a) 2 units of reward available after an hour, and (b) 20 units of reward available after 15 h. The subjective values of options “a” and “b” are calculated below for four different values of _{ime}, as per Equation (3).

_{ime} = ∞ h |
1 | 5 | |

_{ime} = 10 h |
0.91 | 2 | |

_{ime} = 2.5 h |
0.71 | 0.71 | Both equal |

_{ime} = 1 h |
0.50 | 0.31 |

As is apparent, larger _{ime} biases the choice toward option “

Reward rate having chosen option “

Reward rate having chosen option “

However, if we presume that this animal evolved so as to require a minimum reward of 2 units within every 10 h in order to function in good health, choosing option “_{ime} should be much lower than 10 h. In summary, so as to meet consumption requirements, it is inappropriate to integrate past reward rate history over very long times even if the animal has infinite computational/memory resources. Keeping in mind the above example and the four criteria listed for an optimal _{ime}, we enumerate the following disadvantages for setting inappropriately large or inappropriately small _{ime}.

Integrating over inappropriately large _{ime} has at least four disadvantages to the animal: (1) a very long _{ime} is inappropriate given consumption requirements of an animal, as illustrated above; (2) the computational/memory costs involved in this integration are high; (3) integrating over large time scales in a dynamically changing environment could make the estimate of past reward rate inappropriate for the delay to reward (e.g., integrating over the winter and spring seasons as an estimate of baseline reward rate expected over a delay of an hour in the summer might prove very costly for foragers); (4) the longer the _{ime}, the harder it is to update _{est} in a dynamic environment.

Integrating over inappropriately small _{ime}, on the other hand, presents the following disadvantages: (1) estimate of baseline reward rate would be unreliable since integration must be carried out over a long enough time-scale so as to appreciate the stationary variability in an environment; (2) estimate of baseline reward rate might be highly inappropriate for the future delay (e.g., integrating over the past 1 min might be very inappropriate when the delay to a future reward is a day); (3) the animal would more greatly deviate from global optimality [as is clear from Equation (3)].

In light of the above discussion, we argue that the following relationships should hold for _{ime}. In each of these relationships, all factors other than the one considered are assumed constant.

R1. Time-dependent changes in environmental reinforcement statistics: if an environment is unstable, i.e., the reinforcement statistics of the environment are time- dependent, we predict that _{ime} would be lower than the timescale of the dynamics of changes in environmental statistics.

R2. Variability of estimated reward rate: if an environment is stable and has very low variability in the estimated reward rate it provides to an animal, integrating over a long _{ime} would not provide a more accurate estimate of past reward rate than integrating over a short _{ime}. Hence, in order to be better at adapting to potential changes in the environment and minimize computational/ memory costs, we predict that in a stable environment, _{ime} will reduce (increase) as the variability in the estimated reward rate reduces (increases).

R3. Mean of estimated reward rate: in a stable environment with higher average reward rates, the benefit of integrating over a long _{ime} will be smaller when weighed against the computational/memory cost involved. As an extreme example, when the reward rate is infinity, the benefit of integrating over long windows is infinitesimal. This is because the benefit of integrating over a longer _{ime} can be thought of as the net gain in average reward rate over that achieved when decisions are made with the lowest possible _{ime}. If the increase in average reward rate is solely due to an increase in the mean (constant standard deviation) of reward magnitudes, the proportional benefit of integrating over a large _{ime} reduces. If the increase in average reward rate is solely due to an increase in frequency of rewards, the integration can be carried out over a lower time to maintain the estimation accuracy. Hence, we predict that, in general, as average reward rates increase (decrease), _{ime} will decrease (increase).

R4. Average delays to rewards: as the average delay between the moment of decision and receipt of rewards increases (decreases), _{ime} should increase (decrease) correspondingly. This is because reward history calculated over a low _{ime} might be inappropriate as an estimate of baseline reward rate for the delays until future reward.

In human experiments, it is common to give abstract questionnaires to study preference (e.g., “which do you prefer: $100 now or $150 a month from now?”). In such tasks, setting _{ime} to be of the order of seconds or minutes might be very inappropriate to calculate a baseline expected reward rate over the month to a reward (_{ime} might increase so as to match the abstract delays to allow humans to discount less steeply as these delays increase. Similarly, when the choice involves delays of the order of seconds, integrating over hours might not be appropriate and therefore, the discounting steepness would be predicted to be higher in such experiments. Thus, in prior experimental results (Loewenstein and Prelec, _{ime}

It must be noted that even though the calculation of _{est} is performed over a time-scale of _{ime}, yet unspecified is the particular form of memory for past reward events. The simplest form of a memory function is one in which rewards that were received within a past duration of _{ime} are recollected perfectly while any reward that was received beyond this duration is completely forgotten. A more realistic memory function will be such that a reward that was received will be remembered accurately with a probability depending on the time in the past at which it was received, with the dependence being a continuous and monotonically decreasing function. For such a function, _{ime} will be defined as twice the average recollected duration over the probability distribution of recollection. The factor of two is to ensure that in the simplest memory model presented above, the longest duration at which rewards are recollected (twice the average duration) is _{ime}.

If we define local updating as updating _{est} based solely on the memory of the last reward (both magnitude and time elapsed since its receipt), the constraint of local updating when placed on such a general memory function necessitates it to be exponential in time. In this case, _{est} is updated as:
_{lastreward} is the time elapsed since the receipt of the last reward.