Associative Learning of Stimuli Paired and Unpaired With Reinforcement: Evaluating Evidence From Maggots, Flies, Bees, and Rats

Finding rewards and avoiding punishments are powerful goals of behavior. To maximize reward and minimize punishment, it is beneficial to learn about the stimuli that predict their occurrence, and decades of research have provided insight into the brain processes underlying such associative reinforcement learning. In addition, it is well known in experimental psychology, yet often unacknowledged in neighboring scientific disciplines, that subjects also learn about the stimuli that predict the absence of reinforcement. Here we evaluate evidence for both these learning processes. We focus on two study cases that both provide a baseline level of behavior against which the effects of associative learning can be assessed. Firstly, we report pertinent evidence from Drosophila larvae. A re-analysis of the literature reveals that through paired presentations of an odor A and a sugar reward (A+) the animals learn that the reward can be found where the odor is, and therefore show an above-baseline preference for the odor. In contrast, through unpaired training (A/+) the animals learn that the reward can be found precisely where the odor is not, and accordingly these larvae show a below-baseline preference for it (the same is the case, with inverted signs, for learning through taste punishment). In addition, we present previously unpublished data demonstrating that also during a two-odor, differential conditioning protocol (A+/B) both these learning processes take place in larvae, i.e., learning about both the rewarded stimulus A and the non-rewarded stimulus B (again, this is likewise the case for differential conditioning with taste punishment). Secondly, after briefly discussing published evidence from adult Drosophila, honeybees, and rats, we report an unpublished data set showing that relative to baseline behavior after truly random presentations of a visual stimulus A and punishment, rats exhibit memories of opposite valence upon paired and unpaired training. Collectively, the evidence conforms to classical findings in experimental psychology and suggests that across species animals associatively learn both through paired and through unpaired presentations of stimuli with reinforcement – with opposite valence. While the brain mechanisms of unpaired learning for the most part still need to be uncovered, the immediate implication is that using unpaired procedures as a mnemonically neutral control for associative reinforcement learning may be leading analyses astray.

This paper discusses both previously published and unpublished experiments, all of which follow established methods (Andreatta et al., 2012;Gerber et al., 2013;Michels et al., 2017). hitherto unpublished experiments. In this section, we first report the statistical approaches applied to all data, and then briefly summarize how the previously unpublished experiments were performed. The raw data underlying all figures can be found in the Supplementary Datasheet 1. Key parameters varying across the Drosophila learning experiments are summarized in Table 1 and Table 2.

Statistical Analyses
Two-tailed non-parametric tests were used (statistical assumptions for these tests were met throughout). Values were compared across multiple groups with Kruskal-Wallis tests (KW test). For subsequent pair-wise comparisons, Mann-Whitney U-tests (MW test) were used. To test whether values of a given group differ from chance, i.e. from zero, we used onesample sign tests (OSS test). When multiple comparisons were performed within one analysis, a Bonferroni-Holm correction was applied to keep the experiment-wide error rate below 5 % (Holm, 1979). For KW and MW tests, we used Statistica 11.0 (Statsoft), for OSS tests R 3.4.0 (R Core Team, 2017). In all learning experiments on Drosophila ( Fig. 1-3), the preference values were statistically indistinguishable between the training procedures when animals were tested under baseline conditions (MW tests, P > 0.05 corrected according to Bonferroni-Holm within each experiment), indicated by a common letter and a vertical bar above the box plots. Therefore, we pooled these data and compared the pooled data with the preferences after paired and after unpaired training when tested in non-baseline conditions (MW tests). The median of the pooled baseline data is displayed as a stippled line. Experimenters were blind to the testing conditions. We present our data as box plots which represent the median as the middle line and 25%/75% and 10%/90% as box boundaries and whiskers, respectively. Outliers are not displayed. Sample sizes are displayed below each box-plot. In order to keep the main text and the figure legends concise, we report the results of statistical tests in Tables S1-S4.

Experiments with Drosophila Larvae
The methods employed for the previously unpublished experiments (Fig. 4) largely followed established procedures (Gerber et al., 2013;Michels et al., 2017), for the most part also matching those employed in the previous publications from which the data in Figure 1 were taken; for a synopsis of key parameters varying across learning experiments, see Table 1. In brief, we used third-instar feeding-stage larvae from the Canton-Special wild-type strain, aged 5 days after egg laying. Flies were maintained on standard medium, in mass culture at 25 °C, 60-70 % relative humidity and a 12/12 hour light/dark cycle. Before each experiment, we removed a spoonful of food medium from a food vial, collected about 20 larvae, briefly rinsed them in distilled water and started the experiment.
Prior to experiments, odor containers were prepared: 10 µl of odor substance was filled into custom-made Teflon containers (5 mm inner diameter with a lid perforated with seven 0.5-mm diameter holes). Before the experiment started, Petri dishes were covered with modified lids perforated in the center by 15 holes of 1 mm diameter to improve aeration.
For training, approximately 20 larvae were placed in the middle of a FRU-containing dish with two odor containers on opposite sides, both filled with AM. After 2.5 min, the larvae were transferred onto an agarose-only dish with two containers filled with OCT, where they also spent 2.5 min. Three such AM+/OCT training cycles were performed, in each case using fresh dishes. In repetitions of the experiment, in half of the cases training started with a FRUcontaining dish (AM+/OCT) and in the other half with an agarose-only dish (OCT/AM+). For each group of larvae trained AM+/OCT (or OCT/AM+, respectively), a second group was trained reciprocally, i.e.: AM/OCT+ (or OCT+/AM, respectively).
Following training, the larvae were transferred to a test Petri dish that did or did not contain FRU and were given the choice between AM on one side, and an empty odor container (EM) on the other side of the dish. After three minutes the larvae were counted and a preference score was calculated as: (1) AM Pref = (# AM -# EM ) / # Total In this equation, # indicates the number of larvae on the respective half of the dish. Thus, AM Pref values are constrained between 1 and -1, with positive values indicating approach towards, and negative values indicating aversion of AM.
With quinine as the reinforcer, the experiments were performed and the Preferences were calculated in an analogous way.

Experiments with Rats
Adult Sprague-Dawley rats were reared under standard conditions (i.e. group housing with ad libitum food and water in a temperature-and humidity-controlled vivarium) and, when 8-10 weeks of age, were admitted to experiments that used an 8-box startle system (SR-LAB, San Diego Instruments, San Diego, USA) equipped with floor grids (for applying the foot shock US), light bulbs (for applying the light CS) and loudspeakers (for presenting the startle probe). Rats were submitted to three different conditioning protocols. In these protocols, a 5-s light stimulus and a 0.5-mA foot shock were presented 15 times each with variable intertrial-intervals (90-150 s from shock to shock). In the 'Paired' training condition, the light stimulus preceded and co-terminated with the foot shock. In the 'Unpaired' group, the light stimulus was presented randomly, but was never less than 12 s before or after the foot shock. In the 'Random' group, the light stimulus was presented randomly, i.e. also very shortly before or after the foot shock. One day later, a startle test was performed in which the startle probe (a noise of 96 dB SPL and 40 ms duration) was presented every 30 s, either in the presence or the absence of the light stimulus. The startle amplitude (SA) was measured by a motion sensor underneath the device. The percent difference between these two testing conditions was calculated as: (2) Startle Difference Score = (SA in presence -SA in absence) * 100 / SA in absence Thus, positive scores reflect an increased startle in the presence of the light stimulus, which is indicative of negative valence, whereas negative scores reflect a decreased startle in the presence of the light stimulus, indicative of positive valence. (B) At the beginning of an experiment of paired odor-reward training, no reward is predicted as the whole experimental context as well as the odor is novel to the larvae. The presence of a reward (green fill) is thus unpredicted and results in a positive prediction error. This prediction error is associated with all stimuli present, i.e. the odor (red cloud) and the contextual cues of the Petri dish (e.g. light, physical properties of agarose surface). If the larvae are then transferred to a second Petri dish, the context will predict a reward -but no reward is actually present. The resulting negative prediction error is associated with the context but not with the odor (because the odor is not present). Across repetitions of this cycle, the context is alternately associated with a positive and a negative prediction error. The odor, however, is only associated with positive prediction errors. Thus, at the moment of the test, the odor (and not the context) reliably predicts the reward and is therefore approached by the larvae. (C) At the beginning of unpaired training too, reward is present but not predicted. Importantly, the resulting positive prediction error is associated with the context but not the odor (because the odor is not present). After transferring the larvae to the second, odor-containing Petri dish, the context predicts a reward, but no reward is present. The resulting negative prediction error is associated with the context, as well as the odor. Across repetitions of this cycle, the context is alternately associated with a positive and a negative prediction error. The odor, however, is only associated with negative prediction errors. Thus, at the moment of the test, the odor reliably predicts no reward and is therefore avoided by the larvae. Figure S2: Schematic of how innate and learned valence jointly determine behavioral output Rewards (green) can be trained paired (A, C) or unpaired (B, D) with the odor (red cloud) to establish an associative memory (indicated by *). Reward-paired training leads to a positive learned-valence bias, whereas reward-unpaired training leads to a negative learned-valence bias (indicated by + and -, respectively). Likewise, punishments (yellow) can be trained paired or unpaired with the odor, causing a negative learned-valence bias after punishment-paired training and a positive learned-valence bias after punishment-unpaired training. Learned-valence biases are summed with the innate valence of the odor to a common valence signal that determines olfactory behavior. Significantly, reward-related memories enter the summation with the innate valence only in the absence of a reward (A, B), but are prevented from affecting behavior in the presence of the reward (C, D) (indicated by ~, and reduced line strength). This suppression is independent of the valence of the memory (i.e. takes place for both paired-and unpaired reward-memory), and specifically affects rewardrelated and not punishment-related memories. Punishment-related memories, in turn, are prevented from affecting behavior in the absence of punishment (A, B), and only enter the summation in the presence of it (C, D). The proposed gating processes (~) are suggested to occur upstream of the summation point indicated by ∑, and do not affect innate valence signaling. Please note that physiologically, paired-memories likely correspond to a depression of synaptic strength and that learned valence is likely computed by combining pathways with approach-and avoidance-promoting effects.  All results from the statistical tests performed on the data from the previously published learning experiments on Drosophila larvae that are presented in this study (Fig. 1)     All results from the statistical tests performed on the data from the previously unpublished learning experiments on Drosophila larvae that are presented in this study (Fig. 4)    All results from the statistical tests performed on the data from the previously unpublished learning experiments on rats that are presented in this study (Fig. 5) are documented. Significant results (corrected according to Bonferroni-Holm) are in bold.