Information bottleneck-based Hebbian learning rule naturally ties working memory and synaptic updates

Deep neural feedforward networks are effective models for a wide array of problems, but training and deploying such networks presents a significant energy cost. Spiking neural networks (SNNs), which are modeled after biologically realistic neurons, offer a potential solution when deployed correctly on neuromorphic computing hardware. Still, many applications train SNNs offline, and running network training directly on neuromorphic hardware is an ongoing research problem. The primary hurdle is that back-propagation, which makes training such artificial deep networks possible, is biologically implausible. Neuroscientists are uncertain about how the brain would propagate a precise error signal backward through a network of neurons. Recent progress addresses part of this question, e.g., the weight transport problem, but a complete solution remains intangible. In contrast, novel learning rules based on the information bottleneck (IB) train each layer of a network independently, circumventing the need to propagate errors across layers. Instead, propagation is implicit due the layers' feedforward connectivity. These rules take the form of a three-factor Hebbian update a global error signal modulates local synaptic updates within each layer. Unfortunately, the global signal for a given layer requires processing multiple samples concurrently, and the brain only sees a single sample at a time. We propose a new three-factor update rule where the global signal correctly captures information across samples via an auxiliary memory network. The auxiliary network can be trained a priori independently of the dataset being used with the primary network. We demonstrate comparable performance to baselines on image classification tasks. Interestingly, unlike back-propagation-like schemes where there is no link between learning and memory, our rule presents a direct connection between working memory and synaptic updates. To the best of our knowledge, this is the first rule to make this link explicit. We explore these implications in initial experiments examining the effect of memory capacity on learning performance. Moving forward, this work suggests an alternate view of learning where each layer balances memory-informed compression against task performance. This view naturally encompasses several key aspects of neural computation, including memory, efficiency, and locality.


Introduction
The success of deep learning demonstrates the usefulness of large feedforward networks for solving a variety of tasks.Bringing the same results to spiking neural networks is challenging, since the driving factor behind deep learning's success -back-propagation -is not considered to be biologically plausible (Lillicrap et al., 2020).Specifically, it is unclear how neurons might propagate a precise error signal within a forward/backward pass framework like backpropagation.A large body of work has been devoted to establishing plausible alternatives or approximations for this error propagation scheme (Akrout et al., 2019;Balduzzi, Vanchinathan, and Buhmann, 2015;Scellier and Bengio, 2017;Lillicrap et al., 2020).While these approaches do address some of the issues with back-propagation, implausible elements, like separate inference and learning phases, still persist.(Frémaux and Gerstner, 2016).
In this work, we rely on recent advances in deep learning that train feedforward networks by balancing the information bottleneck (Ma, Lewis, and Kleijn, 2019).Unlike back-propagation, where an error signal computed at the end of the network is propagated to the front (see Fig. 1A), this method, called the Hilbert-Schmidt Independence Criterion (HSIC) bottleneck, applies the information bottleneck to each layer in the network independently.Layer-wise optimization is biologically plausible as shown in Fig. 1B.Our contributions include: 1. We show that optimizing the HSIC bottleneck via gradient descent emits a three-factor learning rule (Frémaux and Gerstner, 2016) composed of a local Hebbian component and a global layer-wise modulating signal.2. The HSIC bottleneck depends on a batch of samples, and this is reflected in our update rule.Unfortunately, the brain only sees a single sample at a time.We show that the local component only requires the current sample, and that the global component can be accurately computed by an auxiliary reservoir network.The reservoir acts as a working memory, and the effective "batch size" corresponds to its capacity.3. We demonstrate the empirical performance of our update rule by comparing it against baselines on synthetic datasets as well as MNIST (LeCun, Cortes, and Burges, 1998).4. To the best of our knowledge, our rule is the first to make a direct connection between working memory and synaptic updates.We explore this connection in some initial experiments on memory size and learning performance.
Layer-wise objectives (Belilovsky, Eickenberg, and Oyallon, 2019;Nøkland and Eidnes, 2019), like the one used in this work, offer an alternative that avoids the weight transport problem entirely.Moreover, our objective emits a biologically plausible three-factor learning rule which can be applied concurrently with inference.Pogodin and Latham (2020) draw similar intuition in their work on the plausible HSIC (pHSIC) learning rule.But in order to make experiments with the pHSIC computationally feasible, the authors used an approximation where the network "sees" 256 samples at once.In contrast, their proposed rule only receives information from two samples -the current one and the previous one.As shown in Fig. 2, we find this limited memory capacity reduces the effectiveness of weight updates.This motivates our work, in which we derive an alternate rule where only the global component depends on past samples (the local component only requires the current pre-and post-synaptic activity).Furthermore, we show that this global component can be computed using a reservoir network.This allows us to achieve performance much closer to back-propagation without compromising the biological plausibility of the rule.Pogodin and Latham (2020)."pHSIC (bs = 256)" is the accuracy reported in that work, and "pHSIC (bs = 2)" is the accuracy we obtain when applying the same rule with a reduced batch size.The degradation in performance is our motivation for deriving an alternate rule that effectively captures the batch size crucial to the HSIC bottleneck.

Notation
First, we will introduce the notation used in the paper.
• Vectors are indicated in bold and lower-case (e.g.x).
• Matrices are indicated in bold and upper-case (e.g.W).
• Superscripts refer to different layers of a feedforward network (e.g.z is the -th layer).• Subscripts refer to individual samples (e.g.x i is the i-th sample).
• Brackets refer to elements within a matrix or vector (e.g.[x] i is the i-th element of x).

The information bottleneck
Given an input random variable, X, an output label random variable, Y , and hidden representation, T , the information bottleneck (IB) principle is described by min where I(X; Y ) is the mutual information between two random variables.Intuitively, this expression adjusts T to achieve a balance between information compression and output preservation.
Though computing the mutual information of two random variables requires knowledge of their distributions, Ma, Lewis, and Kleijn (2019) propose using the Hilbert-Schmidt Independence Criterion (HSIC) as a proxy for mutual information.Given a finite number of samples, N , a statistical estimator for the HSIC (Gretton et al., 2005) is where Eq. 3 and 4 define the centered and uncentered kernel matrices, respectively.Using these definitions Ma, Lewis, and Kleijn (2019) define the HSIC objective -a loss function for training feedforward neural networks by balancing the IB at each layer.Consider a feedforward neural network with L layers where the output of layer is We train the network to minimize where Z = {Z } L =1 are the output distributions at each hidden layer.Note that there is a separate objective for each layer.As a result, there is no explicit error propagation across layers, and the error propagation is implicit due to forward connectivity as shown in Fig. 1.

Deriving a biologically-plausible rule for the HSIC bottleneck
In this work, we seek to optimize Eq. 5 for a feedforward network of leaky integrate-and-fire (LIF) neurons.Given the membrane potentials, u , for layer , the dynamics are governed by where z is the output activity (firing rate) and ζ ∼ Unif(−0.05,0.05) emulates the firing rate noise.We optimize L HSIC by taking the gradient with respect to W and applying gradient descent.Doing this, we obtain the following update rule: where α ij is a Hebbian-like term between pre-synaptic neuron j and post-synaptic neuron i.Note that the indices, p and q, are from zero to −(N − 1) to indicate samples at previous time steps (i.e.z 0 corresponds to the layer output now, z −1 corresponds to the layer output from the previously seen sample, etc.).We call N , the batch size in the deep learning, the effective batch size in our work.A full derivation of Eq. 7 can be found in the appendix.This rule is similar to the basic rule in Pogodin and Latham (2020), except that they replace k(x p , x q ) with k(z p , z q ) and do not use a centered kernel matrix.
Without modifications, Eq. 7 is not biologically plausible.α ij (z p , z q ) cannot be called Hebbian when p and q are not equal to zero, since it depends on non-local information from the past.We solve this by making a simplifying approximation.We assume that ∂z p /∂[W ] ij = 0 when p = 0.In other words, the weights at the current time step do not affect past outputs.With this assumption, we find that α ij (z p , z q ) = 0 if and only if (p = 0, q = 0) or (p = 0, q = 0).This gives us our final update rule: Details for deriving Eq. 8 from Eq. 7 are found in the appendix.Note that β ij is now a Hebbian term that only depends on the current pre-and postsynaptic activity.ξ i is a modulating term that adjusts the synaptic update layer-wise.This establishes a three-factor learning rule for Eq. 5. Yet, it is still not biologically plausible, since the global error, ξ i , requires buffering layer activity over many samples.We discuss how to overcome this problem next.

Computing the modulating signal with a reservoir network
In order to compute ξ i in Eq. 8 biologically, we require a neural circuit capable of storing information for future use.Recurrent networks can provide such functionality, and Hoerzer, Legenstein, and Maass (2014) demonstrate how a reservoir network can be trained to compute complex signals using a binary teaching signal.
For each layer, we construct an auxiliary network of LIF neurons whose dynamics are governed by where u r are the recurrent neuron membrane potentials, r i is the input signal activity, and r o is the readout activity.ζ r is the recurrent firing rate noise and ζ o is the exploratory noise for the readout activity.λ is a hyper-parameter that controls the chaos level of the recurrent population.
As stated in Hoerzer, Legenstein, and Maass (2014), we train the auxiliary network using a three-factor Hebbian learning rule shown below, where ξ is the true global error signal in Eq. 8, P is the negative meansquared error of the current network output vs. the true signal, M is a binary teaching signal derived from P , and P , ro are low-pass filtered versions of P , ro, respectively.The low-pass filter averages the signals over a moving window (based on the same function as Hoerzer, Legenstein, and Maass ( 2014)).Intuitively, this rule uses Hebbian updates to train the readout weights, but it gates the strength of the updates based on whether the network performance is improving or not.
< l a t e x i t s h a 1 _ b a s e 6 4 = " H w   Fig. 3 illustrates the full design of the proposed learning scheme.The reservoir serves as a working memory where the capacity of the memory determines the effective batch size.To the best of our knowledge, our rule is the first to modulate the Hebbian updates of a synapse based on past information stored in a working memory.Furthermore, having a controllable effective batch size means we can study the effect of memory capacity on the learning convergence.In particular, we can compare performance against the N = 2 case which matches the biologically plausible variant of the prior work (Pogodin and Latham, 2020).

Experiments
We tested our design through a series of experiments on the reservoir individually and the full network on various synthetic and standard datasets.The code to reproduce each experiment is available at https://github.com/darsnack/biological-hsic along with instructions.Experiments were performed on a single node machine with a AMD Ryzen Threadripper 1950X 16-Core Processor and Nvidia Titan Xp GPU.
All data is normalized to [0, 1] to represent rate-encoded signals.Simulations use a time step of 1 ms.
The learned output of the final layer via Eq. 5 is not necessarily one-hot like the true labels.This is because the HSIC objective only attempts to match the predicted output distribution and the true output distribution based on similarity between representations (this behavior is explained later in Fig. 8).Like Ma, Lewis, and Kleijn (2019), we use a linear readout layer trained for 1000 epochs with gradient descent to map between the HSIC-learned output and the label encoding.This is only required for our experiments.Biological circuitry would not require a specific label encoding.

Reservoir experiments
First, we verify the ability for the reservoir to reproduce the true signal ξ.We use a recurrent population of 2000 neurons with τ r = 50 ms and λ = 1.2 (based on recommendations in Hoerzer, Legenstein, and Maass (2014)).A hundred random input signals, X ∈ R 100×100 , Y ∈ R 1×100 , and Z ∈ R 10×100 , are drawn from Unif(0, 1).We train the reservoir for 500 s to learn ξ based on X, Y, Z, then stop the weight updates and test it for 100 s.The following learning rate decay schedule is used A complete list of parameters is in Table 1.The results can be seen in Fig. 4 which shows the reservoir output for the first element of the output signal.Prior to training, the output is pure noise, but it begins to match the true signal quickly after training begins.Even when the training is stopped, the reservoir continues to produce the correct output for the full testing period.

Small dataset experiments
In order to demonstrate that a reservoir readout can effectively modulate learning for a complete network, we perform a series of small-scale experiments.These  experiments use simple synthetic datasets through which we can gain an intuitive understanding of the learning behavior.In addition to the parameters listed in First, we test a single layer perceptron on a linear binary classification task.We separated 100 uniformly sampled points from [−1, 1] × [−1, 1] using a line as shown in Fig. 5A.We pre-train the reservoir network for 550 s with γ = 5, then we perform 50 epochs of training on the dataset.The network reached a final classification accuracy of 94%.Furthermore, the ratio of the learned weights converges towards the ratio of the true coefficients of the linear boundary.
Next, we test a multilayer perceptron (MLP) network with an reservoir in place for each layer.A synthetic 2D binary classification dataset is generated by sampling 100 random points in [−1, 1] × [−1, 1] uniformly, then separating the points by the decision boundary x 2 = tanh(3x 1 ).An example generated dataset is shown in Fig. 6A.The MLP architecture is two layers of size two and one, respectively.We warm-up and pre-train the reservoir network for 550 s with γ = 20, then we run 50 epochs of training with our rule.On testing, the network achieves 98% accuracy.
In both small experiments, there is a noticeable oscillation in the objective value during training.We attribute this instability to the output of the reservoir falling out of phase with the true signal before returning in-phase.This behavior was noted in Hoerzer, Legenstein, and Maass (2014), and it is periodic.In our experiments, we disable weight updates to the reservoir to demonstrate the ability for the full framework to learn even with an imperfect modulating signal.In practice, the reservoir can be continuously updated, which will eliminate the phase shift (Hoerzer, Legenstein, and Maass, 2014).

Large dataset experiments
Finally, we test our rule on MNIST with an MLP against a back-propagation baseline.The architecture used for all learning frameworks is a 784 ⇒ 64 ⇒ 32 ⇒ 10 ⇒ 10 network.The back-propagation baseline uses ReLU activation and artificial neurons like those typically used in deep learning.Our method uses LIF neurons as described in Sec. 5.
Since our learning rule is designed to process single samples at a time, we must iterate and train on MNIST a single sample at a time.This becomes computationally infeasible due to the large dataset size.In order to make the experiments tractable, we do not simulate the reservoir neurons.Having already established their ability to capture the global signal, ξ, correctly, we compute the signal analytically.Furthermore, we obtain a subset of MNIST by sampling 50% of the points in the training data via stratified sampling.We retain the complete test data for reporting final accuracy.The back-propagation network is trained with a batch size of 32, and our rule uses an effective batch size of 32.The networks are all trained for 75 epochs, and each sample is presented for 20 ms.
The HSIC balance parameter, γ, is set to 50, 100, 200, and 500, for each layer, respectively.We use a momentum optimizer (with momentum, ρ = 0.9) purely to speed up convergence and make the experiment tractable.The learning rate starts at 5 × 10 −2 for 20 epochs, drops to 5 × 10 −3 for 30 epochs, and finally 1 × 10 −3 for the remaining training time.The remaining parameters match the previous experiments.We performed four trials to account for random seeds and show the average test accuracy across trials in Fig. B. The HSIC objective (Eq.5) for each layer.The plain objective value is very noisy due to the noise in the neuron firing rate.Despite this noise, the network converges and reaches 98% accuracy.
as back-propagation, our method is able to improve and learn the dataset.After 75 epochs, our objective value continued to decrease but at a diminished rate.We expect that the gap between our method and back-propagation would be smaller with more epochs of training; however, our goal is not to establish a state-of-theart method for training.Instead, we are demonstrating a biologically-plausible rule that can scale up to larger datasets.
In addition to examining the test accuracy, we show the average predicted output per class from one the trials in Fig. 8.We see that the predictions do not match a one-hot vector, but each label is assigned a unique binary "code word." Note that we repeat these experiments with pHSIC (batch size = 2) and LIF neurons to produce the result in Fig. 2.
In order to demonstrate that our rule can scale up to larger datasets with sufficient iterations, we create a subset of MNIST by only selecting samples for the digits 0, 1, 2, and 4. We repeat our experiment on this subset of data.The results are shown in Fig. 9.With a slightly smaller, but still complex dataset, our method achieves comparable performance to back-propagation.Our method improves in accuracy as training proceeds, but convergence quickly slows down.We expect the gap in performance to close with more epochs but running for many epochs was computationally intractable for this work.

Effects of memory capacity
One of the novel features of our rule is the ability to control the memory capacity of the update.To explore this parameter, we repeat the same MNIST experiments as before for various effective batch sizes and number of epochs of training.The results are shown in Fig. 10.The normalized final test accuracy is much lower for small batch sizes independent of the number of training epochs.This is because for a given random variable, X ∈ R n , the kernel matrix, K X , is the basis for an estimate of how samples of X are distributed in R n .When the effective batch size is small, this estimate is poor and provides an erroneous signal for modulating the weight updates.In particular, note that α i (z p ) in Eq. 8 has an anti-Hebbian behavior -it tends to drive z 0 and z p (the latent representation for layer ) apart.This signal is flipped whenever k(y 0 , y p ) is large.In other words, if y 0 is more similar to y p than it is to other samples, the latent representations are driven towards each other.When the batch size is small, K Y rarely contains samples of the same label, so the rule tends to over-drive the latent representations apart.Each subplot corresponds to a single true label, and the solid line is the mean output for the final HSIC-trained layer.As expected, the output is not one-hot, but the network does tend to learn a unique binary "code word" for each class.

Discussion
In this work, we proposed a three-factor learning rule for training feedforward networks based on the information bottleneck principle.The rule is biologically plausible, and we are able to scale up to reasonable performance on MNIST without compromise.We do this by factoring our weight update into a local component and global component.The local component depends only on the current synaptic activity, so it can be implemented via Hebbian learning.In contrast to prior work, our global component uses information across many samples seen over time.We show that this content can be stored in an auxiliary reservoir network, and the readout of the reservoir can be used to modulate the local weight updates.To the best of our knowledge, this is the first biological learning rule to tightly couple the synaptic updates with a working memory capacity.We verified the efficacy of our rule on synthetic datasets and MNIST, and we explored the effect of the size of the working memory capacity on the learning performance.We see that the accuracy is degraded for small batch sizes independent of the number of epochs.
Even though our rule does perform reasonably well, there is room for improvement.The rule performs best when it is able to distinguish between different high dimensional inputs.The resolution at which it separates inputs is controlled by the parameter, σ, in the kernel function (Eq.4).The use of a fixed σ is partly responsible for the slow down in convergence in Fig. 7.In Ma, Lewis, and Kleijn (2019), the authors propose using multiple networks trained with the different values of σ and averaging the output across networks.This allows the overall network to separate the data at different resolutions.Future work can consider a population of networks with varying σ to achieve the same effect.Alternatively, the lower hierarchies of the visual cortex have built-in pre-processing at varying spatial resolutions (e.g.Gabor filters).We could consider adding a series of filters before the network and pass the concatenated output to the network.Addressing the resolution issue will be important for improving the speed and scalability of the learning method.
Additionally, our rule is strongly supervised.While the mechanism for synaptic updates is biologically plausible, the overall learning paradigm is not.Note that the purpose of the label information in the global signal is to indicate whether the output for the current sample should be the same or different from previous samples.In other words, it might be possible to replace the term k(y 0 , y p ) in Eq. 8 with a binary teaching signal.This would allow the rule to operate under weak supervision.
Most importantly, while our rule is certainly biologically plausible, it remains to be seen if it is an accurate model for circuitry in the brain.Since rules based on the information bottleneck are relatively new, the corresponding experimental evidence must still be obtained.Yet, we note that our auxiliary reservoir serves a similar role to the "blackboard" circuit proposed in Mumford (1991).This circuit, present in the thalamus, receives projected connections from the visual cortex, similar to how each layer projects its output onto the reservoir.Furthermore, Mumford suggests that this circuit acts as a temporal buffer and sends signals that capture information over longer timescales back to the cortex like our reservoir.
So, while it is uncertain whether our exact rule and memory circuit are present in biology, we suggest that an in-depth exploration of memory-modulated learning rules is necessary.We hope this work will be an important step in that direction.

A Derivation of learning rule
Below is a complete derivation of our update rule.First, we find the derivative of L HSIC (Eq.5) with respect to the layer weight, W .Then, we show the assumptions necessary to make this update biologically plausible.
HSIC(X, Y ) = 1 (N − 1) 2 −(N −1) p,q=0 k(x p , x q ) k(y q , y p ) (11) In other words, the first sample in the batch is the current sample, and the last sample in the batch is the one presented N − 1 samples ago.Now, taking the derivative of Eq. 5 with respect to [W] ij : k(z p , z q ) k(x q , x p ) − γ k(z p , z q ) k(y q , y p ) = 1 (N − 1) 2 −(N −1) p,q=0 k(x q , x p ) − γ k(y q , y p ) Focusing on the derivative of k(z p , z q ), And finally, This gives us the update rule form in Eq. 7. The term is the local component, while the rest can be globally computer per layer.But this rule is not biologically plausible, since the local indices p and q sample neuron firing rates from past time steps.
We make the assumption that ∂z p ∂[W]ij = 0 when p = 0 (similarly when q = 0).In other words, the past output does not depend on the current weights.With this assumption, we note that Additionally, when both (p = 0, q = 0), Thus, we find that each term in the summation in Eq. 12 is non-zero only when (p = 0, q = 0) and (p = 0, q = 0).Furthermore, due to the symmetry of the terms, we can reduce the double-summation into twice times a single-summation.This gives us our final form where is factored out of the summation since it no longer depends on p.This results in a local component, β ij , that can be computed using only the current pre-and post-synaptic activity.The global component, ξ, requires memory, so we use an auxiliary network to compute it.

B Reservoir network details
Here we explain the low-pass filtering (LPF) details not covered by the behavior described in Sec. 5, Eq. 9.As part of the learning rule in Eq. 10, the readout and error signals are low-pass filtered according to where ∆t is the simulation time step.This is the same filtering technique presented in Hoerzer, Legenstein, and Maass (2014).
t e x i t s h a 1 _ b a s e 6 4 = " A H u 0 O B z b U R U j l i n wj Q G U X n R h M o U = " > A A A C Q X i c b V D L S h x B F K 3 W x M e Y + F y 6 a T I I J o u h W 0 T j T h T B h Y s R H B W 6 G 6 m u v j 1 T T D 2 a q t t J h q a / w K 1 + k V / h J 7 g T t 2 6 s G R s T k x w o O J x z L v f W S Q v B L Q b B v T c 1 / e H j z O z c f G v h 0 + f F p e W V 1 X O r S 8 O g x 7 T Q 5 j K l F g R X 0 E O O A i 4 L A 1 S m A i 7 S 4 e H Y v / g B x n K t z n B U Q C J p X / G c M 4 p O O j 2 5 W m 4 H n W A C / 1 8 S N q R N G n S v V r x v c a Z Z K U E h E 9 T a K A w K T C p q k D M B d S s u L R S U D W k f I k c V l W C T a n J p 7 W 8 4 J f N z b d x T 6 E / U P y c q K q 0 d y d Q l J c W B / d s b i / / z o h L z 7 0 n F V V E i K P a 6 K C + F j 9 o f f 9 v P u A G G Y u Q I Z Y a 7 W 3 0 2 o I Y y d O W 0 W r G C n 0 x L S V V W x f C r c F n I 6 i h I q n i 8 K 0 2 r o 7 p + n y q M T u s o b B J G V l 1 T x w J y 3 G y H s e H 9 A X 6 t X b v h 7 1 L 3 H H a 3 G 7 I X v r V 7 v t U Jd z p b p 9 v t / Y O m 5 z m y T r 6 Q T R K S X b J P j k m X 9 A g j Q K 7 J D b n 1 7 r w H 7 9 F 7 e o 1 O e c 3 M G n k H 7 / k F l y C w o Q = = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " A H u 0 O B z b

Figure 1 :
Figure 1: A. Sequential (explicit) error propagation requires precise information transfer backwards between layers.B. Parallel (implicit) error propagation uses only local information in combination with a global modulating signal.Biological rules of this form are known as three-factor learning rules(Frémaux and Gerstner, 2016).
a 0 I X a 5 2 n p N r w q d + r 9 z V Z r d O X z r G 8 Y 6 S b G F w I a 3 B 9 l h e H z B T 7 3 g W 7 2 G + p x 0 N F h b 4 6 z X 3 T P D 8 b Z y / H B 2 8 P R y e u e 8 x Z 5Q p 6 S f Z K R I 3 J C z s i E T A k j H f l C v p J v 0 f f o O i Z x v K n G U b + z R 2 4 o 3 v 4 J E d i 0 T A = = < / l a t e x i t > z <l a t e x i t s h a 1 _ b a s e 6 4 = " C z T + 0 c l i B a e C o 1 f I f 9 d h U X P c K g c = " > A A A C Q 3 i c b V D L S u R A F K 3 o z K j t O D 5 m 6 S Z M I z i z a B I R H z t R h F m 2 O K 1 C E q R Su e k u r E e o u l G b k E 9 w q 1 / k R 8 w 3 z E 7 c C l a 3 Y X w e K D i c c y 7 3 1 k k L w S 0 G w V 9 v Y v L T 5 y 9 T 0 z O t 2 a 9 z 3 + Y X F p e O r C 4 N g x 7 T r T a v 9 u n 6 d K o x O 6 y h s E k Z W X V P H A n J c b Y e x 4 f 0 B / q x d u + F z q d s O m + s N 2 Q 7 / t 3 u 0 1 g k 3 O m s H 6 + 2 d 3 a b n a b J M f p B V E p J N s k N + k y 7 p E U b 6 5 I p c k x v v 1 v v n 3 X n 3 T 9 E J r 5 n 5 T l 7 B e 3 g E o / e x p g = = < / l a t e x i t > ⇠ Reservoir < l a t e x i t s h a 1 _ b a s e 6 4 = " J b p h 4

Figure 3 :
Figure3: The overall network architecture.Each layer has a corresponding auxiliary reservoir network.The synaptic update, β ij in Eq. 8, is modulated by a layer-wise error signal, ξ, that is the readout from the reservoir.

Figure 4 :
Figure 4: The reservoir output when learning ξ in Eq. 8. A. The output at the right before and after the start of training.When the training begins, the quickly begins to match the true signal.B. The output at the start of testing.Even without the feedback from the learning rule, the output matches the true signal.C. The output at the end of testing (100 s later) continues to maintain the correct behavior.

Figure 5 :
Figure 5: The results from training a single layer perceptron on a linear binary classification task. A. An example sample set from the synthetic distribution.B. The ratio of the two weights in the model converges the true ratio.C. The HSIC objective converges due to the learning rule.The final test accuracy is 94%.

Figure 6 :
Figure 6: The results from training a two-layer MLP on a synthetic binary classification dataset.A. An example sample set from the synthetic distribution.B.The HSIC objective (Eq.5) for each layer.The plain objective value is very noisy due to the noise in the neuron firing rate.Despite this noise, the network converges and reaches 98% accuracy.

Figure 7 :
Figure7: Average test accuracy over four independent trials on MNIST for back-propagation and our method.The red error bars indicate the standard deviation across trials.Our method improves in accuracy as training proceeds, but convergence quickly slows down.We expect the gap in performance to close with more epochs but running for many epochs was computationally intractable for this work.

Figure 8 :
Figure8: The average predicted output across all test samples on MNIST separated by true label (the shaded region indicates the standard deviation across test samples).Each subplot corresponds to a single true label, and the solid line is the mean output for the final HSIC-trained layer.As expected, the output is not one-hot, but the network does tend to learn a unique binary "code word" for each class.

Table 1 ,
Table 2 contains shared parameters for all small-scale experiments.