Bayesian continual learning via spiking neural networks

Among the main features of biological intelligence are energy efficiency, capacity for continual adaptation, and risk management via uncertainty quantification. Neuromorphic engineering has been thus far mostly driven by the goal of implementing energy-efficient machines that take inspiration from the time-based computing paradigm of biological brains. In this paper, we take steps toward the design of neuromorphic systems that are capable of adaptation to changing learning tasks, while producing well-calibrated uncertainty quantification estimates. To this end, we derive online learning rules for spiking neural networks (SNNs) within a Bayesian continual learning framework. In it, each synaptic weight is represented by parameters that quantify the current epistemic uncertainty resulting from prior knowledge and observed data. The proposed online rules update the distribution parameters in a streaming fashion as data are observed. We instantiate the proposed approach for both real-valued and binary synaptic weights. Experimental results using Intel's Lava platform show the merits of Bayesian over frequentist learning in terms of capacity for adaptation and uncertainty quantification.


I. INTRODUCTION
Recent advances in machine learning and artificial intelligence systems have been largely driven by a pursuit of accuracy via resource-intensive pattern recognition algorithms run in a train-and-then-deploy fashion. In stark contrast, neuroscience paints a picture of intelligence that revolves around continual adaptation, uncertainty quantification, and resource budgeting (allostasis) for the parsimonious processing of event-driven information [1]- [4]. Taking inspiration from neuroscience, over the last decade, neuromorphic engineering has pursued the goal of implementing energy-efficient machines that process information with time via sparse inter-neuron binary signals -or spikes [5]. The main aim of this paper is to introduce algorithmic solutions to endow neuromorphic models, namely spiking neural networks (SNNs), with the capacity for adaptation to changing learning tasks, while ensuring the reliable quantification of uncertainty of the model's decisions.

A. Managing Uncertainty via Bayesian Learning
Training algorithms for SNNs have been overwhelmingly derived by following the frequentist approach which consists in minimizing the training loss with respect to the model parameter vector [6]- [9]. This is partly motivated by the dominance of frequentist learning, and associated software tools, in the literature on deep learning for conventional artificial neural networks (ANNs). Frequentist learning is well justified when enough data are available to make the training loss a good empirical approximation of the underlying population loss [10]. When this condition is not satisfied, while the model's average accuracy may be satisfactory on test data, the decisions made by the trained model can be badly calibrated, often resulting in overconfident predictions [11], [12]. The problem is particularly significant for decisions made on test data that differ significantly from the data observed during training -a common occurrence for applications such as self-driving vehicles. Furthermore, the inability of frequentist learning to account for uncertainty limits its capacity to adapt to new tasks while retaining the capacity to operate on previous tasks [13].
The main cause of the poor calibration of frequentist learning is the selection of a single parameter vector, which disregards any uncertainty on the best model to use for a certain task due to the availability of limited data. A more principled approach that has the potential to properly account for such epistemic uncertainty, i.e., for uncertainty related to the availability of limited data, is given by Bayesian learning [14] and by its generalized form known as information risk minimization (see, e.g., [15]- [19]). Bayesian learning maintains a distribution over the model parameter vector that represents the partial information available to the learner. This way, Bayesian models can provide well-calibrated decisions, which quantify accurately the associated degree of uncertainty and can be used to detect out-of-distribution inputs [20]. In the self-driving example provided earlier, the vehicle may hand back control to the driver when the certainty of its decision is below a certain threshold.
Bayesian reasoning is at the core of the Bayesian brain hypothesis in neuroscience, according to which biological brains constantly update an internal model of the world in an attempt to minimize their information-theoretic surprise. This hypothesis is formalized by the free energy principle, which measures surprise in terms of a variational free energy [21]. In this context, synaptic plasticity has been hypothesized to be well-modelled as Bayesian learning, which keeps track of the distributions of synaptic weights over time [22].
In the present paper, we propose (generalized) Bayesian learning rules for SNNs with binary and real-valued synaptic weights that can adapt over time to changing learning tasks. arXiv:2208.13723v2 [cs.NE] 1 Nov 2022

B. Related Work
Bayesian learning, and its application to deep ANNs, typically labelled as Bayesian deep learning, is receiving increasing attention in the literature. We refer to the following work for a recent overview [23]. natural gradient descent rule known as the Bayesian learning rule was introduced in [24], then applied in [25] to train binary ANNs, and to a variety of other scenarios in [26]. Reference [26] demonstrates that the Bayesian learning rule recovers many state-of-the-art machine learning algorithms in a principled fashion. We also point to the reference [27] that explores the use of natural gradient descent for frequentist learning in spiking neurons.
As mentioned, the choice of a Bayesian learning framework is in line with the importance of the Bayesian brain hypothesis in computational neurosciences [21]. The recent reference [22] explores a Bayesian paradigm to model biological synapses as an explanation of the capacity of the brain to perform learning in the presence of noisy observations. A Bayesian approach to neural plasticity was previously proposed for synaptic sampling, by modeling synaptic plasticity as sampling from a posterior distribution [28]. Apart from the conference version [29] of the present work, this paper is the first to explore the definition of Bayesian learning and Bayesian continual learning rules for general SNNs adopting the standard spike response model (SRM, see, e.g., [30]).
Continual learning is a key area of machine learning research, which is partly motivated by the goal of understanding how biological brains maintain previously acquired skills while adding new capabilities. Unlike traditional machine learning, whereby one performs training based on a single data source, in continual learning, several datasets, corresponding to different tasks, are sequentially presented to the learner. A challenge in continual learning is the ability of the learning algorithm to perform competitively on previous tasks after training on the subsequently observed datasets. In this context, catastrophic forgetting indicates the situation in which performance drops sharply on previously encountered tasks after learning new ones. Many continual learning techniques follow the principle of preserving synaptic connections that are deemed important to perform well on previously learned tasks via a regularization of the learning objective [31], [32]. Bayesian approaches have also been proposed for this purpose, whereby priors are selected as the posterior evaluated on the previous task to prevent the new posterior distribution from deviating too much from learned states. Biological mechanisms are explicitly leveraged in works such as [33] and [34], which combine a variety of neural mechanisms to obtain state-of-the-art performance for SNNs on standard continual learning benchmarks. Reference [35] also proposes a continual learning algorithm for SNNs in an unsupervised scenario by assuming limited precision for the weights. In the present paper, we demonstrate how Bayesian learning allows obtaining similar biologically inspired features by following a principled objective grounded in information risk minimization.
Traditionally, training of SNNs has relied on biologically realistic Hebbian rules, among which spike-timing dependent plasticity (STDP) is the most popular. STDP modulates the synaptic weight between two neurons based on the firing times of both neurons. A long-term potentiation (i.e., an increase in the weight) of the synapse occurs when the pre-synaptic neuron spikes right before the post-synaptic neuron, while long-term depression (i.e., a decrease in the weight) of a synapse happens when the pre-synaptic neuron spikes after the post-synaptic neuron. STDP implements a form of unsupervised learning, and can be leveraged to perform tasks such as clustering, while also supporting continual learning [36].
Supervised learning based on the minimization of the training loss is challenging in SNNs due to the activation function of spiking neurons, the derivative of which is always zero, except at the spike time, where it is not differentiable. Modern training algorithms [6]- [8] overcome this difficulty through the use of surrogate gradients, i.e., by replacing the true derivative with that of a well-defined differentiable function [37]. An alternative approach, reviewed in [38], is to view the SNN as a probabilistic model whose likelihood can be directly differentiated. Further extensions of the probabilistic modelling approach and associated training rules are presented in [39], [40].
An application of Bayesian principles to SNNs has first been proposed in the conference version of this paper [29]. Reference [29] focuses on SNNs with binary synaptic weights and offline learning, presenting limited experimental results. In contrast, the current paper provides all the necessary background, including frequentist learning; it covers frequentist and Bayesian continual learning; and it provides extensive experimental results on a variety of tasks.

C. Main Contributions
In this work, we derive online learning rules for SNNs within a Bayesian continual learning framework. In it, each synaptic weight is represented by parameters that quantify the current epistemic uncertainty associated with prior knowledge and data observed thus far. Bayesian methods are key to handling uncertainty over time, providing the model knowledge of what is to be retained, and what can be forgotten [13]. The main contributions are as follows. i) We introduce general frameworks for the definition of single-task and continual Bayesian learning problems for SNNs that are based on information risk minimization and variational inference. Following the desiderata formulated in [41], we focus on the standard formulation of continual learning in which there exist clear demarcations between subsequent tasks, but the learner is unaware of the identity of the current task. For example, in the typical example of an autonomous vehicle navigating in several environments, the vehicle may be aware that it is encountering a new terrain, while being a priori unaware of the type of new terrain. Furthermore, the model is not modified between tasks, and tasks may be encountered more than once; ii) We instantiate the general Bayesian learning frameworks for SNNs with real-valued synapses. To this end, we adopt a Gaussian variational distribution for the synaptic weights, and demonstrate learning rules that can adapt the parameters of the weight distributions online. This choice of variational posterior has been previously explored for ANNs, and can yield state-of-the-art performance on real-life datasets [42]; iii) We then introduce Bayesian single-task and continual learning rules for SNNs with binary weights, with the main goal of supporting more efficient hardware implementations [43], [44], including platforms based on beyond-CMOS memristors [45]; iv) Through experiments on both synthetic and real neuromorphic datasets, we demonstrate the advantage of the Bayesian learning paradigm in terms of accuracy and calibration for both single-task and continual learning. As neuromorphic algorithms are designed to be run on dedicated hardware, we run the experiments using Intel's Lava software emulator platform [46], accounting for the limited precision of synaptic weights in hardware.

II. METHODS
We first introduce the adopted SNN model, namely the standard spike response model (SRM), before giving a short overview of frequentist, Bayesian, continual, and biologically inspired learning. We then detail learning rules for offline and continual frequentist learning, and derive associated online Bayesian learning rules.
A. SNN Model 1) Spike Response Model: The architecture of an SNN is defined by a network of spiking neurons connected over an arbitrary graph, which possibly includes (directed) cycles. As illustrated in Fig. 1, the directed graph G = (N , E) is described by a set N of nodes, representing the neurons, and by a set E of directed edges i → j with i = j ∈ N , representing synaptic connections.
Focusing on a discrete-time implementation, each spiking neuron i ∈ N produces a binary value s i,t ∈ {0, 1} at discrete time t = 1, 2, . . ., with "1" denoting the firing of a spike. We collect in an |N | × 1 vector s t = (s i,t : i ∈ N ) the spikes emitted by all neurons N at time t, and denote by s t = (s 1 , . . . , s t ) the spike sequences of all neurons up to time t. Without loss of generality, we consider time-sequences of length T , and write s := s T . Each neuron i receives input spike signals {s j,t } j∈Pi = s Pi,t at time t from the set P i = {j ∈ N : (j → i) ∈ E} of parent, or pre-synaptic, neurons, which are connected to neuron i via directed links in the graph G. With some abuse of notations, this set is taken to include also exogeneous input signals.
Each neuron i maintains a scalar analog state variable u i,t , known as the membrane potential. Mathematically, neuron i outputs a binary signal s i,t , or spike, at time t when the membrane potential u i,t is above a threshold ϑ, i.e., with Θ(·) being the Heaviside step function and ϑ being the fixed firing threshold. Following the standard discrete-time SRM [30], the membrane potential u i,t is obtained by summing filtered contributions from pre-synaptic neurons in set P i and from the neuron's own output. In particular, the membrane potential evolves as where w ij is a learnable synaptic weight from pre-synaptic neuron j ∈ P i to post-synaptic neuron i; and we collect in vector w = {w i } i∈N the model parameters, with w i := {w ij } j∈Pi being the synaptic weights for each neuron i. We have denoted as α t and β t the spike responses of synapses and somas, respectively; while * denotes the convolution operator f t * g t = δ>0 f δ g t−δ . When implemented with autoregressive filters, the SRM is equivalent to leaky integrate-and-fire (LIF) neuron model [7], [30]. The techniques developed in this work can be directly generalized to other, more complex, neuron models, such as resonate-and-fire [47], but we leave an investigation of this point to future work.
2) Real-Valued and Binary-Valued Synapses: In this paper, we will consider two implementations of the SRM introduced in the previous subsection. In the first, the synaptic weights in vector w are real-valued, i.e., w ij ∈ R, with possibly limited resolution, as dictated by deployment on neuromorphic hardware (see Sec. III). In contrast, in the second implementation, the weights are binary, i.e., w ij ∈ {+1, −1}. The advantages of models with binary-valued synapses, which we call binary SNNs, include a reduced complexity for the computation of the membrane potential u i,t in (2). Furthermore, binary SNNs are particularly well suited for implementations on chips with nanoscale components that provide discrete conductance levels for the synapses [45]. In this regard, we note that the methods described in this paper can be generalized to models with weights having any discrete number of values.

B. Frequentist versus Bayesian Learning
With traditional frequentist learning, the vector of synaptic weights w is optimized by minimizing a training loss. The training loss is adopted as a proxy for the population loss, i.e., for the loss averaged over the true, unknown, distribution of the data. Therefore, frequentist learning disregards the inherent uncertainty caused by the availability of limited training data, which causes the training loss to be a potentially inaccurate estimate of the population loss. As a result, frequentist learning is known to potentially yield poorly calibrated, and overconfident decisions for ANNs [11].
In contrast, as seen in Fig. 2a, Bayesian learning optimizes over a distribution q(w) in the space of the synaptic weight vector w. The distribution q(w) captures the epistemic uncertainty induced by the lack of knowledge of the true distribution of the data. This is done by assigning similar values of q(w) to model parameters that fit equally well the data, while also being consistent with prior knowledge. As a consequence, Bayesian learning is known to produce better calibrated decisions, i.e., decisions whose associated confidence better reflects the actual accuracy of the decision [12]. Furthermore, models trained via Bayesian learning can better detect out-of-distribution data, i.e., data that is not covered by the distribution of the training set [20], [48]. Once distribution q(w) is optimized via Bayesian learning, at inference time a decision on any new test input is made by averaging the decisions of multiple models, with each being drawn from the distribution q(w). The average over multiple models can be realized in one of two ways. i) Ensemble predictor: Given a test input, as seen in Fig. 2b, one draws a new synaptic weight vector several times from the distribution q(w), and an ensemble decision is obtained by averaging the decisions produced by running the SNN with each sampled weight vector; ii) Committee machine: Alternatively, one can sample a number of realizations from the distribution q(w) that are kept fixed and reused for all test inputs. This solution foregoes the sampling step at inference time as illustrated in Fig. 2c. However, the approach generally requires a larger memory to store all samples w to be used for inference, while the ensemble predictor can make decisions using different weight vectors w ∼ q(w) sequentially over time.

C. Offline versus Continual Learning
Offline learning denotes the typical situation where the system is presented with a single training dataset D, which is used to measure a training loss. In offline learning, optimization of the training loss is carried out once and for all, resulting in a synaptic weight vector w or in a distribution q(w) for frequentist or Bayesian learning, respectively. Offline learning is hence, by construction, unable to adapt to changing conditions, and it is deemed to be a poor representation of how intelligence works in biological organisms [49].
In continual learning, the system is sequentially presented datasets D (1) , D (2) , . . . corresponding to distinct, but related, learning tasks, where each task is selected, possibly with replacement, from a pool of tasks, and its identity is unknown to the system. For each task k, the system is given a training set D (k) , and its goal is to learn to make predictions that generalize well on the new task, while causing minimal loss of accuracy on previous tasks 1, . . . , k − 1. In frequentist continual learning, the model parameter vector w is updated as data from successive tasks is collected. Conversely, in Bayesian continual learning, the distribution q(w) is updated over time as illustrated in Fig. 3. The updates should be sufficient to address the needs of the new task, while not disrupting performance on previous tasks, operating on a stability-plasticity trade-off.

D. Biological Principles of Learning
Many existing works on continual learning draw their inspiration from the mechanisms underlying the capability of biological brains to carry out life-long learning [34], [49]. Learning is believed to be achieved in biological systems by modulating the strength of synaptic links. In this process, a variety of mechanisms are at work to establish short-to intermediate-term and long-term memory for the acquisition of new information over time [50]. These mechanisms operate at different time and spatial scales.
One of the best understood mechanisms, long-term potentiation, contributes to the management of long-term memory through the consolidation of synaptic connections [51], [52]. Once established, these are rendered resistant to disruption by changing their capacity to change via metaplasticity [53], [54]. As a related mechanism, return to a base state is ensured after exposition to small, noisy changes by heterosynaptic plasticity, which plays a key role in ensuring the stability of neural systems [55]. Neuromodulation operates at the scale of neural populations to respond to particular events registered by the brain [56]. Finally, episodic replay plays a key role in the maintenance of long-term memory, by allowing biological brains to re-activate signals seen during previous active periods when inactive (i.e., sleeping) [49].

E. Frequentist Offline Learning
We now review frequentist offline training algorithms for SNNs, under the SRM model described in Sec. II-A1. This will provide the necessary background for Bayesian learning and its continual version, described in Sec. II-F and Sec. II-H, respectively.
1) Empirical Risk Minimization: To start, as illustrated in Fig. 1, we divide the set N of neurons of the SNN into two subsets Y and H with N = Y ∪ H: a set of read-out, or output, neurons Y and a set of hidden neurons H. The set of exogeneous inputs is defined as X . We focus on supervised learning, in which a dataset D is given by |D| pairs (x, y) of signals generated from an unknown distribution p(x, y), with x being exogeneous input signals, one for each element of the set X , and y the corresponding desired output signals. Both x and y are vector sequences of length T , with x comprising |X | signals, and y including |Y| signals. Each output samples y m,t in y dictates the desired behavior of the mth neuron in the read-out set Y. The sequences in x and y can generally take arbitrary real values (see Sec. III for specific examples).
In frequentist learning, the goal is to minimize the training loss over the parameter vector w using the training dataset D = {(x, y)}. To elaborate, we define the loss L x,y (w) measured with respect to a data (x, y) ∈ D as the error between the reference signals y and the output spiking signals produced by the SNN with parameters w, given the input x. Accordingly, the loss is written as a sum over time instants t = 1, . . . , T and over the |Y| read-out neurons as where function L y m,t , f m (w, x t ) is a local loss measure comparing the target output y m,t of neuron m at time t and the actual output f m (w, x t ) of the same neuron. The notations f m (w, x t ) and L x t ,yt (w) are used as a reminder that the output of the SNN and the corresponding loss at time t generally depend on the input x t up to time t, and on the target output y t at time t. Specifically, the notation f m (w, x t ) makes it clear that the output of neuron m ∈ Y is produced with the model parameters w from exogeneous input x t , consisting of all input samples up to time t, using the SRM (1)- (2). The training loss L D (w) is an empirical estimate of the population loss based on the data samples in the training dataset D, and is given as Frequentist learning addresses the empirical risk minimization (ERM) problem Problem (5) cannot be directly solved using standard gradient-based methods since: (i) the spiking mechanism (1) is not differentiable in w due to the presence of the threshold function Θ(·); and (ii) in the case of binary SNNs, the domain of the weight vector w is the discrete set of binary values.
To tackle the former problem, as detailed in Sec. II-E2, surrogate gradients (SG) methods replace the derivative of the threshold function Θ(·) in (1) with a suitable differentiable approximation [37]. In a similar manner, for the latter issue, optimization over binary weights is conventionally done via the straight-through estimator (STE) [29], [57], which is covered in Sec. II-E3.
2) Surrogate Gradient: As discussed in the previous subsections, the gradient ∇ w L x,y (w) is typically evaluated via SG methods. SG techniques approximate the Heaviside function Θ(·) in (1) when computing the gradient ∇ w L x,y (w). Specifically, the derivative Θ (·) is replaced with the derivative of a differentiable surrogate function, such as rectifier or sigmoid. For example, with a sigmoid surrogate, given by function ). Using the loss decomposition in (3), the partial derivative of the training loss L x t ,yt (w) at each time instant t with respect to a synaptic weight w ij can be accordingly approximated as where the first term e i,t is the derivative of the loss at time t with respect to the output s i,t of post-synaptic neuron i at time t; and the third term can be directly computed from (2) as the filtered pre-synaptic trace of neuron j. For simplicity of notation, we have defined f m,t := f m (w, x t ) and omitted the explicit dependence of s i,t and u i,t on exogeneous inputs x t and synaptic weights w. The second term is the source of the approximation, as the derivative of the threshold function Θ (·) from (1), which is zero almost everywhere, is replaced using the derivative of the sigmoid function. At every time instant t = 1, . . . , T , using (6), the online update is obtained via stochastic gradient descent (SGD) as where η > 0 is a learning rate, and B ⊆ D is a mini-batch of examples (x, y) from the training dataset. Note that the sequential implementation of the update (7) over time t requires running a number of copies of the SNN model equal to the size of the mini-batch B. In fact, each input x, with (x, y) ∈ B, generally causes the spiking neurons to follow distinct trajectories in the space of the membrane potentials. Henceforth, when referring to online learning rules, we will implicitly assume that parallel executions of the SNN are possible when the mini-batch size is larger than 1.
The weight update in the direction of the negative gradients in (7) implements a standard three-factor rule. Three-factor rules generalize two-factor Hebbian updates such as STDP [58], and can be implemented on hardware with similar complexity [7], [8], [59]. In fact, the partial derivative (6) can be written as where we distinguish three terms. The first is the per-neuron error signal e i,t , which can be in principle computed via backpropagation through time [60]. In practice, this term is approximated, e.g. via local signals [6], or via random projections [7]. The latter technique has previously been likened to the biological mechanisms behind short-term memory [61]. We will discuss a specific implementation in Sec. III-B. The second contribution is given by the local post-synaptic term σ (u i,t − ϑ), which measures the current sensitivity to changes in the membrane potential of the neuron i. Finally, the last term is the local pre-synaptic trace α t * s j,t that depends on the activity of the neuron j.
3) Straight-Through Estimator: As mentioned in Sec. II-E1, optimization over binary weights can be carried out using STE [29], [57], which maintains latent, real-valued weights to compute gradients during training. Binary weights, obtained via quantization of the real-valued latent weights, are used as the next iterate. To elaborate, in addition to the binary weight vector w ∈ {+1, −1} |w| , we define the real-valued weight vector w r ∈ R |w|×1 . We use |w| to denote the size of vector w. With STE, gradients are estimated by differentiating over the real-valued latent weights w r , instead of discrete binary weights w, to compute the gradient ∇ w r L x t ,yt (w r )| w r =w . The technique can be naturally combined with the SG method, detailed in Sec. II-E2, to obtain the gradients with respect to the real-valued latent weights.
The training algorithm proceeds iteratively by selecting a mini-batch B of examples (x, y) from the training dataset D at each iteration as in (7). Accordingly, the real-valued latent weight vector w r is updated via online SGD as and the next iterate for the binary weights w is obtained by quantization as where the sign function is defined as sign(x) = +1 for x ≥ 0 and sign(x) = −1 for x < 0.

F. Bayesian Offline Learning
In this section, we describe the formulation of Bayesian offline learning, and then develop two Bayesian training algorithms for SNNs with real-valued and binary synaptic weights.
1) Information Risk Minimization: Bayesian learning formulates the training problem as the optimization of a probability distribution q(w) in the space of synaptic weights, which is referred to as the variational posterior. To this end, the ERM problem (5) is replaced by the information risk minimization (IRM) problem where ρ > 0 is a "temperature" constant, p(w) is an arbitrary prior distribution over synaptic weights, and KL(·||·) is the Kullback-Leibler divergence Algorithm 1: Bayesian offline learning with real-valued synapses 1: Input: dataset D, learning rate η, temperature parameter ρ, prior (m 0 , p 0 ) 2: Output: learned parameters pair (m, p) 3: initialize parameters (m 1 , p 1 ) 4: repeat 5: select mini-batch B ⊆ D 6: for each time-step t = 1, . . . , T do 7: sample weights w as w ∼ N (w|m t , P −1 t ).

8:
for each (x, y) ∈ B do 9: compute the gradient ∇ w L x t ,yt (w) locally at each synapse using SG (see Sec. II-E2). 10: end for 11: update the mean and precision parameters (m ij,t , p ij,t ) for all synapses (i, j) ∈ E as The objective function in IRM problem (11) is known as (variational) free energy [18]. The problem of minimizing the free energy in (11) must strike a balance between fitting the data -i.e., minimizing the first term -and not deviating too much from the reference behavior defined by prior p(w) -i.e., keeping the second term small. Note that with ρ = 0, the IRM problem (11) reduces to the ERM problem (5) in the sense that the optimal solution of the IRM problem with ρ = 0 is a distribution concentrated at the solution of the ERM problem (assuming that the latter is unique). The KL divergence term in (11) is hence essential to Bayesian learning, and it is formally justified as a regularizing penalty that accounts for epistemic uncertainty due to the presence of limited data in the context of PAC Bayes theory [15]. It can also be used as a model of bounded rationality accounting for the complexity of information processing [18].
If no constraints are imposed on the variational posterior q(w), the optimal solution of (11) is given by the Gibbs posterior Due to the intractability of the normalizing constant in (13), we adopt a mean-field variational inference (VI) approximation that limits the optimization domain for problem (11) to a class of factorized distributions (see, e.g., [19], [62]). More specifically, we focus on Gaussian and Bernoulli variational approximations, targeting SNN models with real-valued and binary synaptic weights, respectively, which are detailed in the rest of this section.
2) Gaussian Mean-Field Variational Inference: In this subsection, we derive a Gaussian mean-field VI algorithm that approximately solves the IRM problem (11) by assuming variational posteriors of the form q(w) = N (w|m, P −1 ), where m is a mean vector and P is a precision diagonal matrix with positive vector p on the main diagonal. For the |w| × 1 weight vector w, the distribution of the parameters w is defined by the |w| × 1 mean vector m and |w| × 1 precision vector p = {p ij } (i,j)∈E with p ij > 0 for all (i, j) ∈ E. This variational model is well suited for real-valued synapses, which can be practically realized to the fixed precision allowed by the hardware implementation [63]. We choose the prior p(w) as the Gaussian distribution p(w) = N (w|m 0 , P −1 0 ) with mean m 0 and precision matrix P 0 with positive diagonal vector p 0 . We tackle the IRM problem (11) with respect to the so-called variational parameters (m, p) of the Gaussian variational posterior q(w) via the Bayesian learning rule [26]. The Bayesian learning rule is derived by applying natural gradient descent to the variational free energy F(q(w)) in (11). The derivation leverages the fact that the distribution q(w) is an exponentialfamily distribution with natural parameters λ = (P m, −1/2P ), sufficient statistics T = (w, ww T ) and mean parameters µ = (m, P −1 + mm T ). Updates to the mean m t and precision p t parameters at iteration t can be obtained as [26], [42] where where η > 0 is a learning rate; B ⊆ D is a mini-batch of examples (x, y) from the training dataset; and q t (w) = N (w|m t , P −1 t ) is the variational posterior at iteration t with m t and p t . In practice, the updates (14)- (15) are estimated by evaluating the expectation over distribution q t (w) via one or more randomly drawn samples w ∼ q t (w). Furthermore, the gradients ∇ w L x t ,yt (w) can be approximated using the online SG method described in Sec. II-E2. The overall training algorithm proceeds iteratively by selecting a mini-batch B ⊆ D of examples (x, y) from the training dataset at each iteration, and is summarized in Algorithm 1. Note that, as mentioned in Sec. II-E2, the implementation of a rule operating with mini-batches requires running |B| SNN models in parallel, where |B| is the cardinality of the mini-batch. When this is not possible, the rule can be applied with mini-batches of size |B| = 1.
3) Bernoulli Mean-Field Variational Inference: In this subsection, we turn to the case of binary synaptic weights w ij ∈ {+1, −1}. For this setting, we adopt the variational posterior q(w) = Bern w|p , with where the |w| × 1 vector p = {{p ij } j∈Pi } i∈N defines the variational posterior, with p ij being the probability that synaptic weights w ij equals +1.
The variational posterior (16) can be reparameterized in terms of the mean parameters µ = {{µ ij } j∈Pi } i∈N as by setting p ij = (µ ij + 1)/2, where 1 is the all-ones vector. It can also be expressed in terms of the logits, or natural parameters, w r = {{w r ij } j∈Pi } i∈N as q(w) = Bern w|σ(2w r ) by setting for all (i, j) ∈ E. The notation w r has been introduced to suggest a relationship with the STE method described in Sec. II-E3, as defined below. We assume that the prior distribution p(w) also follows a mean-field Bernoulli distribution of the form p(w) = Bern(w|σ(2w r 0 )), for some vector of w r 0 logits. For example, setting w r 0 = 0 indicates that the binary weights are equally likely to be either +1 or −1 a priori.
In a manner similar to the case of Gaussian VI developed in the previous subsection, we apply natural gradient descent to minimize the variational free energy in (11) with respect to the variational parameters w r defining the variational posterior q(w). Following [25], and applying the online SGD rule detailed in Sec. II-E2, this yields the update where η > 0 is a learning rate and q t (w) the variational posterior with w r t and µ t related through (18). Note that the gradient in (19) is with respect to the mean parameters µ t .
In order to estimate the gradient in (19), we leverage the reparameterization trick via the Gumbel-Softmax (GS) distribution [25], [64]. Accordingly, we first obtain a sample w that is approximately distributed according to q t (w) = Bern w|σ(2w r t ) . This is done by drawing a vector δ = {{δ ij } j∈Pi } i∈N of i.i.d. Gumbel variables, and computing where τ > 0 is a parameter. When τ in (20) tends to zero, the tanh(·) function tends to the sign(·) function, and the vector w follows distribution q t (w) [25]. To generate δ, one can set δ ij = 1 2 log ij 1− ij , with ij ∼ U(0, 1) being i.i.d. samples.

8:
for each (x, y) ∈ B do 9: compute the gradient ∇ w L x t ,yt (w) locally at each synapse using SG (see Sec. II-E2). 10: end for 11: update the real-valued weights w r ij,t for all synapses (i, j) ∈ E as With this sample, for each example (x, y), we then obtain an approximately unbiased estimate of the gradient in (19) by using the following approximation where the approximate equality (a) is exact when τ → 0 and the equality (b) follows the chain rule. We note that the gradient ∇ w L x t ,yt (w) can be computed as detailed in Sec. II-E2.
As summarized in Algorithm 2, the resulting rule proceeds iteratively by selecting a mini-batch B of examples (x, y) from the training dataset D at each iteration. Using the samples w ij from (20), we obtain at every time-step t the estimate of the gradient (19) as This is unbiased when the limit τ → 0 holds.

G. Frequentist Continual Learning
We now consider a continual learning setting, in which the system is sequentially presented datasets D (1) , D (2) , . . . corresponding to distinct, but related, learning tasks. Applying a frequentist approach, at every subsequent task k, the system minimizes a new objective based on dataset D (k) in order to update the model parameter vector w, where we have used superscript (k) to denote the quantities corresponding to the kth task. We first describe an algorithm based on coresets and regularization [65]. Then, we briefly review a recently proposed biologically inspired rule.
1) Regularization-based Continual Learning: In a similar manner to (4), let us first define as the training loss evaluated on dataset D (k) for the kth task. A general formulation of the continual learning problem in a frequentist framework is then obtained as the minimum of the objective where L C (k ) (w) is the training loss evaluated on a coreset, that is, a subset C (k ) ⊂ D (k ) of examples randomly selected from a previous task k < k and maintained for use when future tasks are encountered; α ≥ 0 determines the strength of the regularization; and R(w, {w (k ) } k−1 k =1 ) is a regularization function aimed at preventing the current weights from differing too much from previously learned weights {w (k ) } k−1 k =1 , hence mitigating the problem of catastrophic forgetting [66]. A popular choice for the regularization function, yielding the Elastic Weight Consolidation (EWC) method, proposes to estimate the relative importance of synapses for previous tasks via the Fisher information matrices (FIM) computed on datasets k < k [31]. This corresponds to the choice of the regularizer where F (k) (w) = diag (x,y)∈D (k) (∇ w L x,y (w)) 2 is an approximation of the FIM estimated on dataset D (k) . The square operation in vector (∇ w L x,y (w)) 2 is evaluated element-wise. Intuitively, a larger value of an entry in the diagonal of the matrix F (k) (w) indicates that the corresponding entry of the vector w plays a significant role for the kth task.
2) Biologically Inspired Continual Learning: The authors of [34] introduce a biologically inspired, frequentist, continual learning rule for SNNs, which we briefly review here. The approach operates online in discrete time t, and implements the mechanisms described in Sec. II-D. It considers a leaky integrate-and-fire (LIF) neuron model. The LIF is a special case of the SRM (1)- (2) in which the synaptic response α implemented as the alpha-function spike response α t = exp(−t/τ mem ) − exp(−t/τ syn ) and the exponentially decaying feedback filter β t = − exp(−t/τ ref ) for t ≥ 1 with some positive constants τ mem , τ syn , and τ ref . This choice enables an autoregressive update of the membrane potential [7], [67].
A metaplasticity parameter ν ij is introduced for each synapse (i, j) ∈ E that determines the degree to which the synapse is prone to change. This quantity is increased by a fixed step ∆ν as ν ij,t+1 ← ν ij,t + ∆ν (26) when the pre-and post-synaptic neurons spiking rates, i.e., the spiking rate of neuron i and j, respectively, pass a predetermined threshold. Furthermore, each synapse (i, j) ∈ E maintains a reference weight w ref ij to mimic heterosynaptic plasticity by adjusting the weight updates to drive synaptic weights towards this resting state. It is updated over time as where κ > 0, and serves as a reference to implement heterosynaptic plasticity. With these definitions, the update of each synaptic weight w is computed according to the online learning rule where η and γ are respectively learning and decay rates, and e i,t is an error signal from neuron i (see [34] for details). The rule (28) takes a form similar to that of three-factor rules (8), with the term e i,t · s j,t · σ (u i,t − ϑ) evaluating the product of error, post-synaptic, and pre-synaptic signals. The update (28) implements metaplasticity via the term exp − |ν ij · w ij,t | that decreases the magnitude of the updates during the training procedure for active synapses. It also accounts for heterosynaptic plasticity thanks to the term (w ij,t − w ref ij,t ), which drives the updates towards learned "resting" weight w ref ij,t when the presynaptic neuron is active.

H. Bayesian Continual Learning
In this section, we generalize the Bayesian formulation seen in Section II-F from the offline setting to continual learning.

1) Bayesian Continual Learning:
To allow the adaptation to task k without catastrophic forgetting, we consider the problem [41] min q (k) (w) F (k) q (k) (w) (29) of minimizing the free energy metric which combines the IRM formulation (11) with the use of coresets. Minimizing the free energy objective (30) must strike a balance between fitting the new training data D (k) , as well as the coresets {C (k ) } k−1 k =1 from the previous tasks, while not deviating too much from previously learned distribution q (k−1) (w). Comparing (30) with the free energy (11), we observe that the distribution q (k−1) (w) plays the role of prior for the current task k.
2) Continual Gaussian Mean-Field Variational Inference: Similarly to the approach for offline learning described in Sec. II-F, we first assume a Gaussian variational posterior q(w), and address the problem (30) via natural gradient descent. To this end, we adopt the variational posterior q (k) (w) = N (w|m (k) , (P (k) ) −1 ), with mean vector m (k) and diagonal precision matrix P (k) with positive diagonal vector p (k) of size |w| × 1 for every task k. We choose the prior p(w) for dataset D 1 as the Gaussian distribution p(w) = N (w|m 0 , P −1 0 ) with positive diagonal vector p 0 of size |w|×1. Applying the Bayesian learning rule [26] as in Sec. II-F2, updates to the mean and precision parameters can be obtained via online SGD as where mini-batch B is now selected at random from both dataset D (k) and coresets from previous tasks, i.e., B ⊆ D (k) ∪ k k =1 C (k ) . The rule can be directly derived by following the steps detailed in Sec. II-F2, and using for prior at every task k the mean m (k−1) and precision P (k−1) obtained at the end of training on the previous task.
3) On the Biological Plausibility of the Bayesian Learning Rule: The continual learning rule (31)-(32) exhibits some of the mechanisms thought to enable memory retention in biological brains as described in Sec. II-D. In particular, synaptic consolidation and metaplasticity for each synapse (i, j) ∈ E are modelled by the precision p ij . In fact, a larger precision p ij,t+1 effectively reduces the step size 1/p ij,t+1 of the synaptic weight update (32). This is a similar mechanism to the metaplasticity parameter ν ij,t introduced in the rule (28). Furthermore, by (31), the precision p ij is increased to a degree that depends on the relevance of the synapse (i, j) ∈ E as measured by the estimated FIM (∂L x t ,yt (w)/∂w ij ) 2 for the current mini-batch B of examples.
Heterosynaptic plasticity, which drives the updates towards previously learned and resting states to prevent catastrophic forgetting, is obtained from first principles via the IRM formulation with a KL regularization term, rather than from the addition of the reference weight w ref in the previous work [34]. This mechanism drives the updates of the precision p Finally, the use of coresets implements a form of replay, or reactivation, in biological brains [68]. 4) Continual Bernoulli Mean-Field Variational Inference: We now consider continual learning with a Bernoulli mean-field variational posterior, and force the synaptic weight w ij to be binary, i.e., w ij ∈ {+1, −1}. Following (16), the posterior is of the form q (k) (w) = Bern w|p (k) .
We leverage the Gumbel-softmax trick, and use the reparametrization in terms of the natural parameters at task k w r,(k) ij We then apply the Bayesian learning rule, and, following the results obtained in the offline learning case of Sec. II-F3, we obtain the learning rule at task k as where we denote as w r,(k−1) the logits obtained at the end of the previous task k − 1, and mini-batch B is selected at random as B ⊆ D (k) ∪ k k =1 C (k ) .

III. EXPERIMENTS
In this section, we compare the performance of frequentist and Bayesian learning schemes in a variety of experiments, using both synthetic and real neuromorphic datasets. All experiments consist of classification tasks with C classes. In each such task, we are given a dataset D consisting of spiking inputs x and label c x ∈ {0, 1, . . . , C − 1}. Each pair (x, c x ) is converted into a pair of spiking signals (x, y) to obtain the dataset D. To do this, the target signals y are such that each sample y t is the C × 1 one-hot encoding vector of label c x for all time-steps t = 1, . . . , T .
A. Datasets 1) Two-moons dataset: We first consider an offline 2D binary classification task on the two-moons dataset [69]. Training is done on 200 examples per class with added noise with standard deviation σ = 0.1 as proposed in reference [25] for 100 epochs. The inputs x are obtained via population encoding following [67] over T = 100 time-steps and via 10 neurons.
2) DVS-Gestures: Next, we consider a real-world neuromorphic dataset for offline classification, namely the DVS-Gestures dataset [70]. The dataset comprises 11 classes of hand movements, captured with a DVS camera. Movements are recorded from 30 different persons under 5 lighting conditions. To evaluate the calibration of Bayesian learning algorithms, we obtain in-and out-of-distribution dataset D id and D ood by partitioning the dataset by users and lighting conditions. We selected the first 15 users for the training set, while the remaining 15 users are used for testing. The first 4 lighting conditions are used for in-distribution testing; and the one left out from the training set is used for out-of-distribution testing. Images are of size 128 × 128 × 2, and preprocessed following [70] to yield inputs of size 32 × 32 × 2, with sequences of length 500 ms for training and 1500 ms for testing, with a sampling rate of 10 ms.
3) Split-MNIST and MNIST-DVS: For continual learning, we first conduct experiments on the 5-ways split-MNIST dataset [34], [41]. Examples from the MNIST dataset, of size 28 × 28 pixels, are hence rate-encoded over T = 50 time-steps [67], and training examples drawn from subsets of two classes are successively presented to the system for training. The order of the pairs is selected as {0, 1}, then {2, 3}, and so on. We restrict here our study to rate encoding, although the proposed methods are applicable to any spike encoding scheme. In a similar way, we also consider a neuromorphic continual learning setting based on the neuromorphic counterpart to the MNIST dataset, namely, the MNIST-DVS dataset [71]. Following the preprocessing adopted in references [72]- [74], we cropped images spatially to 26 × 26 pixels, capturing the active part of the image, and temporally to a duration of 2 seconds. For each pixel, positive and negative events are encoded as (unsigned) spikes over two different input channels, and the input x is of size 1352 spiking signals. Uniform downsampling over time is then carried out to restrict the length to T = 80 time-samples. The training dataset is composed of 900 examples per class, and the test dataset contains 100 examples per class. For continual learning, classes are presented to the network in pairs by following the lexicographical order, i.e., the classes {0, 1} are presented first, then {2, 3}, and so on.

B. Implementation
All schemes are implemented using the SG technique DECOLLE [7] to compute the gradients. In DECOLLE, the SNN is organized into L layers, with the first L − 1 layers encompassing the hidden neurons in set H, and the Lth layer containing the read-out neurons in set Y. To evaluate the partial derivative (8), we need to specify how to compute error signals e i,t for each neuron i ∈ N . To this end, at each time t, the spiking outputs s where B (l) ∈ R C×|l| are random, fixed weights, |l| is the cardinality of layer l, and Softmax m (a) = exp(a m )/ 1≤n≤C exp(a n ) is the ith element of the softmax of vector a with length C. The local losses (35) at every time-step t are then used to compute the error signals e i,t in (8) for every neuron i ∈ l as While the algorithms introduced in this work are valid for any SNN architecture as highlighted in Fig. 1, DECOLLE is limited to feedforward layered architectures, which we hence adopt for our experiments [7]. Furthermore, we consider autoregressive filters for the spike responses of synapses α t and somas β t in the membrane potential (2), as discussed in Sec. II-A1.
Results have been obtained by using Intel's Lava software framework [46], under Loihi-compatible fixed-point precision [63] 1 . We use as benchmark the frequentist algorithms detailed in Sec. II-E and II-G, for which gradients are as described in the previous paragraph. For Bayesian learning with real-valued (fixed-precision) synapses, we set the threshold of each neuron as ϑ = 64; while for binary synapses the threshold ϑ is selected as the square-root of the fan-in of the corresponding layer.
Implementation of the proposed methods on hardware is left for future work. While Loihi supports the injection of Gaussian noise to the membrane potential of the neurons [63], it does not provide mechanisms for the sampling of the model parameters. In contrast, recent work [75] has proposed leveraging the inherent noise of nanoscale devices in order to implement Bayesian inference.

C. Performance Measures
Apart from the test accuracy, performance metrics include calibration measures, namely reliability diagrams and expected calibration error (ECE), which are described next. We note that, as the hardware implementation of Bayesian SNNs is currently an open problem (see Sec. III-B), we are unable to provide measurements in terms of energy expenditure and computation time. As a general remark, as discussed in Sec. II-B, Bayesian learning requires a larger memory to store all samples for the weights distribution to be used for inference using a committee machine implementation, while an ensemble predictor implementation increases inference latency.
1) Confidence levels: For frequentist learning, predictive probabilities are obtained from a single pass through the network with parameter vector w as where f (w, x t ) is the output of read-out layer L for weights w, as detailed in the previous subsection. In contrast, for Bayesian learning, decisions and confidence levels are obtained by drawing N S samples {w s } N S s=1 from the distribution q(w), and by averaging the read-out outputs of the model to obtain the probability assigned to each class as Unless mentioned otherwise, the predictions (38) are obtained by using the committee machine approach, and hence the weights {w s } N S s=1 are kept fixed for all test inputs x (see Sec. II-B). All results presented are averaged over three repetitions of the experiments and 10 draws from the posterior distribution q(w), i.e., we set N S = 10 in all experiments.
For Bayesian learning, the hard prediction of the model is hence obtained as with 1(·) being the indicator function; while the average empirical confidence of the predictor for bin m is defined as Reliability diagrams plot the per-bin accuracy acc(B m ) versus confidence level conf(B m ) across all bins m. A model is said to be perfectly calibrated when, for all bins m, the equality acc(B m ) = conf(B m ) holds. If in the mth bin, the empirical accuracy and empirical confidence are different, the predictor is considered to be over-confident when the inequality acc(B m ) < conf(B m ) holds, and under-confident when the reverse inequality acc(B m ) > conf(B m ) holds.
3) Expected calibration error (ECE): While reliability diagrams offer a fine-grained description of calibration, the ECE provides a scalar measure of the global miscalibration of the model. This is done by computing the average difference between per-bin confidence and accuracy as [12] Models with a lower ECE are considered to be better calibrated. 4) Out-of-distribution empirical confidence: Reliability diagrams and ECE assume that the test data follows the same distribution as the training data. A well-calibrated model is also expected to assign lower probabilities to out-of-distribution data, i.e., data that does not follow the training distribution [76]. To gauge the capacity of a model to recognize out-of-distribution data, a common approach is to plot the histogram of the predictive probabilities p c * x |x, {w s } N S s=1 x∈Dood evaluated on a dataset D ood of out-of-distribution examples [20], [76]. Such examples may correspond, as discussed, to examples recorded in different lighting conditions with a neuromorphic camera.
D. Offline Learning 1) Two-moons dataset: We start by considering the two-moons dataset. For this experiment, the SNN comprises two fully connected layers with 256 neurons each. Bayesian learning is implemented with different values of the temperature parameter ρ in the free energy (11). In Fig. 4, triangles indicate training points for a class "0", while circles indicate training points for a class "1". The color intensity represents the predictive probabilities (37) for frequentist learning and (38) for Bayesian learning:  the more intense the color, the higher the prediction confidence determined by the model. Bayesian learning is observed to provide better calibrated predictions, that are more uncertain outside the input regions covered by training data points. For both real-valued and binary synapses, the temperature parameter ρ has an important role to play in preventing overfitting and underfitting of the training data, while also enabling uncertainty quantification. When the parameter ρ is too large, the model cannot fit the data correctly, resulting in inaccurate predictions; while when ρ is too small, the training data is fit more tightly, leading to a poor representation of the prediction uncertainty outside the training set. A well-chosen value of ρ strikes the best trade-off between faithfully fitting the training data and allowing for uncertainty quantification. Frequentist algorithms, obtained in the limit when ρ → 0, yield the most over-confident estimates.
2) DVS-Gestures: We now turn to the DVS-Gestures dataset, for which we plot the performance for real-valued and binaryvalued SNNs, in terms of accuracy, reliability diagrams [76], and ECE [12] in Fig. 5 and Fig. 6. In all cases, the SNNs have two fully connected layers comprising 4096 neurons each, and they are trained for 200 epochs. The architecture was chosen to highlight the benefits of Bayesian learning over frequentist learning in regimes characterized by epistemic uncertainty, and it was not optimized for maximal accuracy. The figures confirm that Bayesian SNNs generally produce better calibrated outcomes. In fact, reliability diagrams (top rows) demonstrate that frequentist learning algorithms produce overconfident decisions, while Bayesian learning outputs decisions whose confidence levels match well the test accuracies. This improvement is reflected, for models with real-valued synapses (with fixed precision), in a lower ECE of 0.064, as compared to 0.088 for frequentist SNNs; while, for binary SNNs, the reduction in ECE is from 0.085 for frequentist learning, to 0.069 for Bayesian learning. This benefit may come at the cost of a slight decrease in terms of accuracy, which is only observed here for binary synapses. The bottom parts of Fig. 5 and Fig. 6 also show that frequentist learning tends to output high-confidence decisions with a larger probability. We now turn to evaluate the performance in terms of robustness to out-of-distribution data. As explained in Sec. III-C, to this end, we evaluate the histogram of the confidence levels produced by frequentist and Bayesian learning, as shown in Fig. 7. From the figure, it is remarked that Bayesian learning correctly provides low confidence levels on out-of-distribution data, while frequentist learning outputs decisions with confidence levels similar to the case of in-distribution data, which are shown in Fig. 5 and Fig. 6.
This point is further illustrated in Fig. 8 by showing the three largest probabilities assigned by the different models on selected examples, considering real-valued synapses in the top row and binary synapses in the bottom row. In the left column, we observe that, when both models predict the wrong class, Bayesian SNNs tend to do so with a lower level of certainty, and typically rank the correct class higher than their frequentist counterparts. Specifically  models with both real-valued and binary synapses rank the correct class second, while the frequentist models rank it third. Furthermore, as seen in the middle column, in a number of cases, the Bayesian models manage to predict the correct class, while the frequentist models predict a wrong class with high certainty. Finally, in the right column, we show that even when frequentist models predict the correct class and Bayesian models fail to do so, they still assign lower probabilities (i.e., < 50%) to the predicted class.
A key advantage of SNNs is the possibility to obtain intermediate decisions during the observation of the T samples of a test example. To elaborate on this aspect, Fig. 9 reports the evolution of the mean test accuracy, ECE, and predictive probabilities (38)- (37) for all examples in the out-of-distribution dataset as a function of the discrete time-steps t = 1, 2, . . . , T . Although both Bayesian and frequentist methods show similar improvements in accuracy over time, frequentist algorithms remain poorly calibrated, even after the observation of many time samples. The bottom plots show that frequentist learning tends to be more confident in its decisions, especially when a few samples t have been observed. On the contrary, Bayesian algorithms offer better calibration and confidence estimates, even when only part of the input signal x has been observed.

E. Continual Learning
We now turn to continual learning benchmarks. Starting with the rate encoded MNIST dataset, we use coresets representing 7.5% of randomly selected training examples for each class. To establish a fair comparison with the protocol adopted in reference [34], we train SNNs comprising a single layer with 400 neurons for one epoch on each subtask. This choice was found to be advantageous for Bayesian techniques -a result that may be related to the known asymptotic behavior of Bayesian neural networks as non-parametric models [77]. In Table I  on the last task, as well as the average ECE at that point for real-valued synapses, enabling a comparison with [34]. Bayesian continual learning is seen to achieve the best accuracy and calibration across all the methods studied here, including the solution introduced in [34]. The latter incurs a 2.5× memory overhead as compared to standard frequentist methods. Considering that we performed training using the 8-bit precision imposed by the neuromorphic chip Loihi, our solution outperforms the stateof-the-art with a 5× memory consumption improvement. This saving can be leveraged, e.g., to store several samples of the weights for a committee machine implementation.
Next, for the MNIST-DVS dataset [71], we use coresets representing 10% of randomly selected training examples for each class, and implement multilayer SNNs with 2048 − 4096 − 4096 − 2048 − 1024 neurons per layer, that we train on each subtask for 100 epochs. This task requires a larger architecture and longer training time to allow for the processing of the richer spatio-temporal information recorded by neuromorphic cameras, as compared to the spatial information from static image datasets, such as MNIST, encoded into spikes via rate encoding [67].
We highlight the requirement for a larger architecture on the MNIST-DVS dataset in Fig. 11 Fig. 10). The horizontal bar represents the median value across the tasks, while the box extends from the first to the third quartile. The whiskers extend from the box by 1.5 times the inter-quartile range. Circles represent outliers.
of the mean parameter m at the end of training on the MNIST and MNIST-DVS datasets. For the larger network trained on the MNIST-DVS dataset, 83.5% of the mean parameters are non-zero, a larger proportion than that of the network trained on the MNIST dataset, for which only 80.1% of the mean weight parameters are non-zero. This demonstrates that the larger number of weights used for this task is important for the network to perform well. In Fig. 10, we show the evolution of the test accuracy and ECE on all tasks, represented with lines of different colors, during training. The performance on the current task is shown as a thicker line. We consider frequentist and Bayesian learning, with both real-valued and binary synapses. With Bayesian learning, the test accuracy on previous tasks does not decrease excessively when learning a new task, which shows the capacity of the technique to tackle catastrophic forgetting. Also, the ECE across all tasks is seen to remain more stable for Bayesian learning as compared to the frequentist benchmarks. For both real-valued and binary synapses, the final average accuracy and ECE across all tasks show the superiority of Bayesian over frequentist learning.
This point is further elaborated in Fig. 12, which shows test accuracy and ECE on all tasks at the final epoch -the 500thin Fig. 12. Bayesian learning can be seen to offer a better test accuracy and ECE on average across tasks, as well as a lower dispersion among tasks.

IV. CONCLUSION
In this work, we have introduced a Bayesian learning framework for SNNs with both real-valued and binary-valued synapses. Bayesian learning is particularly well suited for applications characterized by limited data -a situation that is likely to be encountered in use cases of neuromorphic computing such as edge intelligence. We have demonstrated the benefits of Bayesian learning in terms of calibration metrics that gauge the effectiveness of uncertainty quantification over a variety of offline and continual learning. We have also argued that the proposed rules exhibit mechanisms resembling those that enable lifelong learning in biological brains from a theoretically motivated information risk minimization framework. While this work focused on variational inference Bayesian learning methods, future research may explore Monte-Carlo based solutions. Finally, we recall the importance of investigating solutions for hardware design, adopting either ensemble predictors or committees of machines. As an example, consider ensemble predictions based on binary synapses. An implementation based on digital hardware would need to store the real-valued parameters of the parameter vector distribution, and to sample from the distribution using auxiliary circuitry, which incurs energy and memory overheads. Alternatively, one could leverage the inherent stochasticity of analog hardware for sampling [75], a line of research that we reserve for future work.

AUTHOR CONTRIBUTIONS
OS first proposed to train SNNs via Bayesian learning, HJ derived the rule for binary synapses, and NS extended it to real-valued synapses. NS designed and implemented the experiments. NS, HJ, and OS wrote the text.

FUNDING
This study received funding from Intel Labs through the Intel Neuromorphic Research Community (INRC). The work of O. Simeone has also been supported by the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (Grant Agreement No. 725731). The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication. All authors declare no other competing interests.

DATA AVAILABILITY STATEMENT
The datasets analyzed for this study are freely available and can be found at references [69], [71], [78].