The Cerebellum: A Neural System for the Study of Reinforcement Learning

In its strictest application, the term “reinforcement learning” refers to a computational approach to learning in which an agent (often a machine) interacts with a mutable environment to maximize reward through trial and error. The approach borrows essentials from several fields, most notably Computer Science, Behavioral Neuroscience, and Psychology. At the most basic level, a neural system capable of mediating reinforcement learning must be able to acquire sensory information about the external environment and internal milieu (either directly or through connectivities with other brain regions), must be able to select a behavior to be executed, and must be capable of providing evaluative feedback about the success of that behavior. Given that Psychology informs us that reinforcers, both positive and negative, are stimuli or consequences that increase the probability that the immediately antecedent behavior will be repeated and that reinforcer strength or viability is modulated by the organism's past experience with the reinforcer, its affect, and even the state of its muscles (e.g., eyes open or closed); it is the case that any neural system that supports reinforcement learning must also be sensitive to these same considerations. Once learning is established, such a neural system must finally be able to maintain continued response expression and prevent response drift. In this report, we examine both historical and recent evidence that the cerebellum satisfies all of these requirements. While we report evidence from a variety of learning paradigms, the majority of our discussion will focus on classical conditioning of the rabbit eye blink response as an ideal model system for the study of reinforcement and reinforcement learning.

by receiving sensory input directly from the spinal cord as well as from frontal, parietal, temporal, and occipital cortices and in turn relays sensorimotor information to the red nucleus and spinal cord (Jiang et al., 2002) as well as reciprocal projections back to motor cortex via the ventrolateral thalamus (Allen and Tsukahara, 1974;Percheron et al., 1996;Kelly and Strick, 2003;Strick et al., 2009). More recently, it has been noted that other cerebrocerebellar loops exist with largely non-motor regions of the cerebral cortex. For example, using herpes simplex virus type 1 to retrogradely label prefrontal cortex of monkey, Middleton and Strick (1997) found that injections in areas 46 and 9 retrogradely labeled mediodorsal thalamic nuclei and subsequently the cerebellar dentate nuclei. Additional mapping studies have indicated that the cerebellar dentate nuclei possess distinct output channels wherein dorsal portions project to premotor and motor cortices while ventral portions target non-motor regions of prefrontal, frontal, and posterior parietal cortices in both primate and rat (Leiner et al., 1986(Leiner et al., , 1987Strick, 2000, 2001;Kelly and Strick, 2003). An additional loop also exists with striatum (Ichinohe et al., 2000), a region known to participate in reinforcement learning (Kelley et al., 1997;Dayan and Balleine, 2002;O'Doherty et al., 2004). Together, these anatomical studies indicate that the cerebellum not only receives sensory information about stimuli impinging on the organism but

IntroductIon
The cerebellum has often been likened to a neuronal machine or computer with its precise geometrical array of intrinsic cell types that allow for the integration and organization of movementrelated information through both of its afferent systems. Eccles et al. (1967) posited that this system's organization makes it uniquely appropriate for the storage of information. In this paper, we will discuss the evidence that the cerebellum is integral to reinforcement learning. We will rely primarily on evidence from classical conditioning experiments, but we will also introduce an emerging clinical literature that begins to link the concordance of cerebellar abnormalities with cognitive deficits including changes in reactivity to reinforcement.

cerebellar anatomy and connectIvIty
One prerequisite for a neural system to be considered a mediator of reinforcement is that it must be able to access information about the sensory environment as well as the internal state of the organism. This information must be directly acquired or available through connectivity with other brain regions. Traditionally, the anatomy and connectivity of the cerebellum has been discussed largely in terms of motor control and coordination. As such, it has been noted that the cerebellum coordinates motor behavior also information about the organism's affect. It should be noted however that subjective feeling need play no role in reinforcement as classical conditioning is possible in decerebrate animals (Mauk and Thompson, 1987).
Within the cerebellum proper, the principal feature and cell type of the cerebellum is the Purkinje cell whose nearly two dimensional dendritic tree is arrayed perpendicular to the long axis of the lobule. The Purkinje cell receives excitatory input from two sources: the parallel fibers and the climbing fibers. The first of these, the parallel fibers, are part of a bisynaptic pathway originating in the pons. Pontine axons are known as mossy fibers and they ascend to the middle layers of the cerebellum where they synapse with granule cells (axon collaterals also go to the deep nuclei). Granule cell axons then ascend to the molecular layer of the cerebellar cortex where they bifurcate, forming parallel fibers. The parallel fibers then course perpendicular to the plane of the Purkinje cell dendritic tree. A given Purkinje cell receives input from numerous parallel fibers, but a given parallel fiber makes at most only a couple of synapses with a given Purkinje cell dendrite and is thus considered a weak input to the cell. The other source of excitatory input, and the one that we will concern ourselves with most in this discussion, is from the climbing fibers, which originate in the inferior olive of the brainstem. These axons receive their name because they literally wrap themselves around the proximal dendrites and soma of the Purkinje cell (axon collaterals also go to the deep nuclei) making hundreds of synaptic connections. While a given Purkinje cell receives input from only one climbing fiber, because of this extensive connectivity, it is considered a very potent exciter of the cell. It is through this strong excitation of Purkinje cells in specific lobules that particular skeletal motor movements are selected. All other connections with the Purkinje cell are inhibitory, including those of stellate, basket, and Golgi cells. The output of the Purkinje cell is also inhibitory. The targets of this inhibitory output are the deep nuclei: dentate, interpositus, and fastigial nuclei. The most lateral regions of the cerebellar cortex project to the dentate, the intermediate regions project to the interpositus and the most midline regions project to the fastigial nucleus.

contrIbutIon of cerebellum to reInforcement learnIng
One of the first studies to explore the contribution of the cerebellum to learning was published in 1942 by Brogden and Gantt (1942). This classical conditioning experiment examined which parts of the reflex arc could be eliminated and still permit conditioning to occur. In this particular experiment, dogs were implanted with electrodes in the cerebellar cortex that were connected to a coil buried beneath the scalp. Current flow was induced by passing an external field coil energized by a thyratron generator over the buried coil. This electrical stimulation produced a number of different movements depending on its exact location. The movements included eye blinks, limb flexions, and head/neck movements. They then paired a bell or light conditioned stimulus (CS; 2 s duration) with faradic stimulation (1-10th second duration) of the cerebellum as the unconditioned stimulus (UCS) immediately upon termination of the CS. This procedure was continued until the dogs met a criterion of 100% anticipatory responses during a training session. Briefly, their results showed that electrical stimulation of the cerebellum can produce a variety of behavioral responses and that responses generated in this manner are adequate when paired with conditioned stimuli for classical conditioning to occur. Furthermore, the data suggested that some motor responses are more readily conditioned than others; an idea that has also been supported by a variety of experiments across species (Gormezano et al., 1983). Importantly, Loucks (1935), also working in Gantt's laboratory, showed that behavioral responses elicited by electrical stimulation of the motor area of the cerebral cortex in dogs could not be conditioned to a CS, a result later confirmed by Wagner et al. (1967).
The Brogden and Gantt (1942) experiments were largely forgotten until the late 1970s and early 1980s when Thompson et al. (2000) began a systematic reexamination of the cerebellar role in classical conditioning. Their results (plus those of many others) led Thompson to propose a model for classical conditioning in which the cerebellum was both necessary and sufficient for the establishment of classical conditioning of discrete responses with an aversive UCS (Thompson, 1986). Briefly, the model indicates that information about the CS, including both its physical dimensions and the context in which it is generated (e.g., affective state), is relayed from the forebrain to the pontine nuclei and that pontine mossy fibers in turn carry this information to the granule cell layer of the cerebellar cortex. Granule cell axons (parallel fibers) then converge on both neurons of the deep cerebellar nuclei and cerebellar cortex (Purkinje cells). Somatosensory information about the UCS is transmitted from the forebrain to the inferior olive. Olivary climbing fiber axons then ascend to the cerebellum where they also synapse with deep nuclei and cortical Purkinje cells. The selection of a Purkinje cell (or cells) by the inferior olive reflects the response to be executed. The convergence of CS and UCS inputs within the cerebellum denotes not only the motor response to be executed but also marks the place where the memory trace is most likely to be created.

role of the InferIor olIve
A model of classical conditioning that has been of noted heuristic value was put forward in 1972 by Rescorla and Wagner (1972). In their model, they proposed that the greatest amount of information to be learned about the relationship between CS and UCS occurs in the first paired training trials. As training progresses, and presumably learning occurs, the amount of information to be learned about the CS-UCS relationship declines. This model has fostered a number of investigations that are pertinent to reinforcement learning. For example, Foy and Thompson (1986) recorded single-unit activity from Purkinje cells during Pavlovian conditioning of the rabbit eye blink response. They found that complex spike responses to the onset of the UCS were evident early in training in 61% of the 118 Purkinje cells studied but that only 27% of these neurons evidenced UCS-evoked complex spikes at the end of training. Given that complex spikes are most likely triggered by olivary climbing fiber input, such a decrease in complex spike activity would be consistent with a decline in olivary activity across training trials. A more explicit test of the hypothesis that associative strength declines across training was conducted by Sears and Steinmetz (1991). In their experiment, they recorded unit activity of the dorsal accessory olivary nucleus during classical conditioning of the rabbit eye blink response. They found that the olivary activity time-locked to the UCS was initially quite robust Substitution of the peripheral UCS with electrical stimulation of the inferior olive or its climbing fibers provides, perhaps, the strongest assessment of whether the neurons of the olive constitute the substrate for reinforcement during Pavlovian classical conditioning. Electrical brain stimulation must be able to evoke all of the behavioral phenomena normally linked to exteroceptive UCS presentations during training. Inferior olive stimulation produces a variety of movements, depending on the location of the stimulating electrodes. These movements have been reported to include eye blinks, head nods, neck turns, facial muscle twitches, or extension/retraction of the limbs. Mauk et al. (1986) report that tone CS pairings with inferior olive electrical stimulation as the UCS produces normal rates of learning. Additionally, it was found that the range of effective interstimulus intervals (ISIs) with tone CS and electrical stimulation UCS was identical to that observed in conditioning paradigms with exteroceptive UCSs. Conditioning was optimal when the CS-UCS interval was 150-250 ms. Brief ISIs of 50 ms did not support learning.
A potential criticism that could be leveled against any experiments that employ electrical stimulation of the inferior olive as the UCS is that eye blinks or any other movements triggered in such a fashion could reflect antidromic stimulation of spinal trigeminal neurons and the spread of activation to mossy fiber collaterals that target rabbit facial maps in cerebellar lobule HVI (Moore and Blazis, 1989). To explore this question, Swain et al. (1992) conducted a parametric study in which they stimulated the white matter immediately underlying rabbit lobule HVI. They chose this site because it was distant from brainstem and spinal reflex pathways. They believed that electrical stimulation at this location would drive olivary climbing fibers and other cerebellar afferents. They observed that stimulation of lobule HVI white matter elicited a variety of movements, the most common of which were eye blinks, lip movements, or head nods and turns. When paired with a tone CS, such stimulation produced rates of learning commonly observed with conditioning using exteroceptive UCSs (Gormezano et al., 1983) or olivary stimulation (Mauk et al., 1986). When presented with CS-alone trials, the rabbits extinguished quite quickly and exhibited substantial savings upon reinstatement of paired training trials. The degree of similarity between learning in the white matter stimulation study and studies employing exteroceptive UCSs was remarkable even down to the smallest details. For example, at the end of training, one rabbit, which had displayed ipsilateral lip movement to cerebellar white matter stimulation during conditioning, was presented in its home cage with a whistle at about the same frequency as the tone CS used in the original experiment. The rabbit displayed a conditioned lip movement. When a whistle at a different frequency was presented, the rabbit did not respond. Together, these findings suggest that conditioning was specific to the tone CS rather than the context and that the CR displayed a stimulus generalization gradient to the pitch of the whistle. Additionally, among all animals in the study, there were fewer CRs on the final day of reacquisition training than there were on the final day of initial acquisition training. Even though this difference was not statistically significant, it is consonant with reports that the CS may acquire inhibitory properties during extinction training that become evident upon resumption of training.
The ability of the CS to acquire inhibitory properties in the white matter stimulation studies was also explored in control groups that were presented with either randomly or explicitly unpaired trials but declined as CR expression commenced later in training. Evoked activity was reinstated in the olive during UCS-only trials or on paired trials in which the rabbit made an incorrect or maladaptive response. The authors noted that the partial reinforcement schedule associated with olivary responses to missing or maladaptive responses may maintain continued CR expression. These data also led Sears and Steinmetz (1991) to propose that olivary activity registers the sensory aspects of the UCS and unconditioned response (UCR)/CR and through this registration also provides an error detection signal to the cerebellum that allows the organism to adjust maladaptive responses. Further studies have verified the role of climbing fiber synapses onto Purkinje cells in error detection in non-human animals and humans alike (Thompson and Gluck, 1991;Swain and Thompson, 1993;Shinkman et al., 1996;Kettner et al., 1997;Kitazawa et al., 1998;Thompson et al., 2000).
The preceding data suggest that the necessity of maintaining stimulus evoked responses in the UCS pathway declines as the association between CS and UCS develops. The exception to this pattern is the case of a performance error for which olivary activity is reinstated. A significant body of behavioral learning data supports this view. An early report by Kamin (1968), for example, found that no new learning occurs when a second CS is presented immediately following a CS that has already been associatively linked to a UCS. Kamin refers to this process as "blocking" and asserts that the animal ignores the novel CS because it fails to provide any new information. If neuronal activity was maintained in the UCS pathway, it might be hypothesized that the animal would attach the same predictive relationship to the second CS as it did to the first. This proposition was explicitly tested by Kim et al., 1998. In this study, they injected picrotoxin, a GABA antagonist, into the inferior olive of rabbits that had recently learned the tone CS-air puff UCS association. They found that picrotoxin prevented the diminution of olivary activity normally observed with continued training. Additional conditioning with the insertion of a second CS (light) yielded UCS-evoked complex spike activity in the cerebellar cortex to both CSs. Examination of CS-only probe trials during training indicated that the animals displayed similar conditioned responses to both CSs suggesting that the animals had formed similar associations between both CSs and the UCS. In a different study, Swain et al. (1992) employed electrical stimulation of cerebellar white matter (and presumably climbing fibers) as the UCS. Early in training, this stimulation elicited only single responses. With continued paired training, however, multiple responses were often elicited and showed evidence of conditioning. CS-alone training produced extinction and retraining produced extremely rapid reacquisition. These results indicate that continued activity in the UCS pathway promotes response drift.
If the source of reinforcing input to the cerebellum is from the inferior olive, then damage to the olive should yield behavioral effects that are indistinguishable from those seen when the UCS is absent during conditioning trials. Such appears to be the case. Rostromedial lesions of the inferior olive prevent acquisition of the learned response in rabbits trained with conjoint tone CS and air puff UCS presentations in a standard delay classical conditioning procedure (McCormick et al., 1985). Additionally, in rabbits that have learned the task, destruction of the inferior olive gradually abolishes CRs in a manner that is reminiscent of that seen in intact animals undergoing extinction training (i.e., CS-only trials).

Frontiers in Behavioral Neuroscience
www.frontiersin.org to reinforcement. In its simplest conceptualization, motivation can be defined as the amount of effort an animal will extend to receive reinforcement (Hodos, 1961;Skjoldager et al., 1993). One way in which behavior analysts measure reinforcement strength and motivation is through the observation of breaking points. The breaking point is the maximum amount of effort an animal will engage in for reinforcement and is a function of both the strength of the reinforcer as well as the animal's state of deprivation (Hodos and Kalman, 1963). A decrease in the breaking point equates to decreased motivation.
There is a small body of literature that implicates the cerebellum in motivation in both hedonic and anhedonic paradigms. Bauer et al. (2011) report that lesions of the dentate nucleus alter reinforcement strength and decrease breaking points in rats. Animals were trained on an operant conditioning task with a progressive ratio schedule of reinforcement. A pre-surgery breaking point was established with a criterion performance of three consecutive days with consistent responding (within one ratio step difference). Upon reaching criterion performance, animals were subjected to either a sham surgery or bilateral electrolytic lesions of the dentate nuclei (output to non-motor regions of forebrain). After a week of recovery, animals were returned to Skinner boxes and a second breaking point was established. Bauer et al. (2011) found that cerebellar lesions significantly reduced breaking points compared to sham controls, indicating a decreased willingness to perform a physical task for reinforcement. It was concluded that cerebellar damage reduces motivation, resulting in depressed responding for appetitive reward. Similar effects of cerebellar damage have been reported in humans. Thoma et al. (2008) found that focal cerebellar lesions selectively impaired reward-based reversal learning in an associative learning paradigm.
Studies conducted in rodents indicate that decreased motivation following cerebellar disruption extends beyond hedonic paradigms. Reduced behavioral responses following cerebellar lesions have been reported in studies of exploratory motivation, active avoidance, and passive avoidance, in the absence of gross motor deficits (Lalonde et al., 1988a,b;D'Agata et al., 1993;Caston et al., 1998;Lalonde and Strazielle, 2003;Bauer et al., 2011).
Changes in motivation that affect operant learning and exploration may be the result of disruption of the connectivity between the cerebellum and frontal cortex. As described previously, the cerebellum and prefrontal cortex are associated through a network that includes reciprocal loops with motor and non-motor regions of the forebrain. It has been demonstrated in classical conditioning tasks that the convergence of climbing fibers and parallel fibers on the Purkinje cell is responsible for selecting behavior. In this paradigm, the inferior olive is engaged early in the learning process and habituates as task performance improves, increasing activity only in the instance of performance errors. In this way, the inferior olive serves as an evaluative feedback mechanism to promote optimal performance for maximum reinforcement. In the case of motivation, damage to the cerebellum may impact behavior by disrupting the connectivity between the cerebellum and forebrain, preventing information regarding the predictive value of the reinforcer from reaching the inferior olive and resulting in decreased behavioral output. of tone CS and white matter stimulation as the UCS. The learning rates of rabbits that received the explicitly unpaired, but not randomly unpaired, presentations of the CS and UCS were profoundly retarded when they were later trained using paired CS-UCS trials. Rescorla (1969) has previously demonstrated that the explicitly unpaired control procedure may in fact not be learning-neutral but instead imposes inhibitory properties on the CS such that future learning using the same CS is slower.
A subsequent study by Swain et al. (1999) observed that exposure to as few as 108 UCS-alone trials was capable of producing a robust UCS pre-exposure effect. Rabbits presented with the UCS before paired CS-UCS training typically required in excess of 600 trials to master the task. Rabbits that were given no UCS pre-exposure trials acquired the task at a normal rate of 100-200 trials. When the rabbits were presented with UCSs of fixed duration, the amplitude of the UCR increased across training while the latency decreased. Mis and Moore (1973) report similar results for UCS pre-exposure and reflex augmentation with peripheral UCSs.
In a further attempt at localization, Shinkman et al. (1996) stimulated parallel fibers at the cerebellar cortical surface as a CS with underlying white matter stimulation as a UCS. The CS intensity was set well below threshold to elicit any behavioral response. Conditioning occurred normally, the CS now eliciting the response previously elicited by the UCS. Interestingly, the response elicited by strong CS stimulation before training was sometimes quite different than the UCS-evoked response that developed to the CS as a CR after paired CS-UCS training, an extraordinary example of plasticity in the cerebellum.

clInIcal studIes
In recent years, a number of case histories and studies have reported concurrent aberrations in cerebellum and other parts of the brain (e.g., prefrontal cortex) in an array of clinical disorders including schizophrenia, autism, and attention deficit hyperactive disorder, in addition to syndromes generally induced by damage or infarct (Martin and Albers, 1995;Rapaport et al., 2000;Rapoport, 2001). The aberrations include alterations in volume, cell number, and dendritic morphology (Bauman and Kemper, 1985;Courchesne et al., 1994;Schmahmann and Sherman, 1997;Fatemi et al., 2002;Andreasen and Pierson, 2008). In addition to the organic similarities between these disorders, there are also some commonalities in gross behavioral symptoms including changes in affect, attention, impulsivity, motivation, motor activity, social interactivity, and specific forms of learning. Schmahmann (2001) purports that damage or malformation to the cerebellum in fact rises to the status of a syndrome that he calls "Cerebellar Cognitive Affective Syndrome" in which affected individuals experience "1) disturbances of executive function, which includes deficient planning, set-shifting, abstract reasoning, working memory, and decreased verbal fluency; 2) impaired spatial cognition, including visual-spatial disorganization and impaired visual-spatial memory; 3) personality change characterized by flattening or blunting of affect and disinhibited or inappropriate behavior; and 4) linguistic difficulties, including dysprosodia, agrammatism, and mild anomia. (p. 371)." This clinical literature is too large and extended to discuss in its entirety here but we will cite a few studies that describe a cerebellar role in motivation because of the relation of this construct