Noise Helps Optimization Escape From Saddle Points in the Synaptic Plasticity

Numerous experimental studies suggest that noise is inherent in the human brain. However, the functional importance of noise remains unknown. n particular, from a computational perspective, such stochasticity is potentially harmful to brain function. In machine learning, a large number of saddle points are surrounded by high error plateaus and give the illusion of the existence of local minimum. As a result, being trapped in the saddle points can dramatically impair learning and adding noise will attack such saddle point problems in high-dimensional optimization, especially under the strict saddle condition. Motivated by these arguments, we propose one biologically plausible noise structure and demonstrate that noise can efficiently improve the optimization performance of spiking neural networks based on stochastic gradient descent. The strict saddle condition for synaptic plasticity is deduced, and under such conditions, noise can help optimization escape from saddle points on high dimensional domains. The theoretical results explain the stochasticity of synapses and guide us on how to make use of noise. In addition, we provide biological interpretations of proposed noise structures from two points: one based on the free energy principle in neuroscience and another based on observations of in vivo experiments. Our simulation results manifest that in the learning and test phase, the accuracy of synaptic sampling with noise is almost 20% higher than that without noise for synthesis dataset, and the gain in accuracy with/without noise is at least 10% for the MNIST and CIFAR-10 dataset. Our study provides a new learning framework for the brain and sheds new light on deep noisy spiking neural networks.

Θ{h n k }logp (z n = k|x n , w) Assuming that A ki = P OISSON (x n i |αe w ki ) = (αe w ki ) x n i e −αe w ki x n i ! , we get p N (x n |z n = k, w) = j POISSON x n j αe w kj = j A kj . Therefore, ∂(log p (x|k, w)) ∂w ki = 1 We further get Θ{h n k }log p(x n |z n =k,w) In the spike-based Winner-Take-All networks, give the input x n and the firing rate of neuron z k is proportional to the posterior distribution p (h n = k|x n , w). In simulation, the network generates spike trains S k (t) with such rate ρ k (t) of neuron z k . According to ∂ 2 log p N Due to the Gauss property p {|x − µ| < σ} = 0.6826 and w ki e w ki √ e w ki | w ki →0 = 0, it is plausible to refer to ( n αw ki e w ki S k (t) − Θ{h n k } )dt as the general characteristic of the noise distribution dW ki . We further get the Hessian information with the noise, According to the zero gradient, we can further get, It is obvious that the trace of Hessian matrix consists of three terms: Whether the trace is positive will be analyzed through two steps.
First, we will illustrate that the second term is much smaller than the third. When the network is stable, i.e., the gradient is zero, by considering a large number of samples, we will get: Under the theoretically optimal STDP learning rule, the following equation will be derived (Habenschuss et al., 2013;Nessler et al., 2009): Through the above two points, we can derive that That is to say, the second term is much smaller than the third: 1 Second, we will discuss whether the third term is positive. For each sample in the Winner-Take-All networks, the third term can be split into two parts: one part represents neuron z k which actually release spikes, one part represents expected neuron z lable : According to the theoretically optimal STDP learning rule Eq.S11, the expression will become the approximate difference of actual and expected neurons: When the network is trapped in the saddle points, the neuron which releases spikes is not the expected. We can get that the potential of the actual neuron is higher than the expected: As a result, the third term is always positive when the network is trapped in the saddle points.
In summary, as long as the third term is large enough, the first and second term can be ignored, and the trace of Hessian matrix will be positive so that strict saddle property will be satisfied.

PROOF OF THEOREM 2
Under the condition that x|θ∼ N µ, σ 2 and noise ε∼ N (0, 1), the aim is to get p (ε|θ). By inducing hidden variable x, we get Given the input x, the distribution of noise ε can be regarded as that of a mixture of noise and input. We get The exponential term can be arranged as The second term is binomial of x and in fact is some constant with the integration of x. The numerator and denominator of the last term is the same order of ε and can also regard as some constant. We also verify the approximation between e − ε 2 2σ 2 ande − ε 2 2σ 2 + ε 2 2(σ 2 −ε 2 ) in Matlab simulations. Results also show that these two functions are nearly the same. We finally improve that 2σ 2 (S19) The previous work (Habenschuss et al., 2013;Kappel et al., 2015) shows that in the spiked-based WTA networks, one prominent motif of cortical microcircuits, p (x|θ) is the integration of N Poisson distribution with the mean αe w ki , which can approximate Normal distribution N (N αe w ki , N αe w ki ). Therefore, we get,

BACK-PROPAGATION FOR SYNAPTIC SAMPLING ON THE THREE-LAYER NETWORK.
In this section we derive learning rules based on back-propagation for synaptic sampling. As shown in Fig.S1, we add one hidden layer z n to the two-layer network. In the two-layer model, given y n = k, input x n i follows a Poisson distribution with a mean that is affected by synaptic weight w ki . The firing rate is proportional to the posterior probability p (z n = k|x n , w). In the threelayer network, such a relationship becomes the product of Poisson distribution. Assuming that A kj = P OISSON (y n j |αe w kj ) B ji = P OISSON (x n i |αe w ji ), one obtains the posterior probability of corresponding spiking neuron k as The likelihood function becomes Figure S1. A three-layer neural network diagram.
Θ{h n k } logp (z n = k|x n , w) The derivation of synaptic weight w kj of the second layer is In Eq.S21, the deviation of numerator and denominator is When M is large enough, derive the log of numerator and denominator and we obtain Learning rule for the second layer is derived as The derivation of synaptic weight w ji of the first layer is First, we derive the second part ∂z n j ∂w ji . In the spike-based neural networks, the firing rate of stochastic spiking neurons depends exponentially on the membrane voltage. It has been proposed that the exponential relationship between the membrane potential and the firing rate is a good approximation to the firing characteristics of cortical pyramidal neurons (Jolivet et al., 2006).The membrane voltage u j (t) of neuron j in the two-layer network is given by The corresponding instantaneous firing rate ρ j (t) of neuron j is given by where I lat (t) is divisive lateral inhibition, i.e. I lat (t) = l e u l (t) . Then we get, In the WTA neural circuit, when neuron j spikes, y n j (t) = 1. Thus we get the approximation of last term. Then we derive the first part ∂ log p N (J|θ) ∂y n j . Under the theoretically optimal STDP learning rule, the following equation will be derived (Habenschuss et al., 2013;Nessler et al., 2009): We can get Note that the synaptic weights of two layers is about the same order of magnitude and hence cancel each other. We get the compact approximation for learning rule in the first layer, In fact, the experiment results also show the performance of compact learning rules in Eq.S36 is as good as the exact one in Eq.S35.
Similarly, the learning rule for L-layer spiking neural networks can be concluded as follow. Given the L-layer noisy spiking neural networks, each layer computes a function X l = g l (X l−1 , W l ), where X l is the output of the l th layer, X l−1 is the input of the l th layer and W l is the vector of adjustable parameters between the (l − 1) th and the l th layer. Note the vector X 1 in the first layer is the input sample. The learning rule for L-layer spiking neural networks can be concluded as follow.