Probabilistic Circuits for Autonomous Learning: A Simulation Study

Modern machine learning is based on powerful algorithms running on digital computing platforms and there is great interest in accelerating the learning process and making it more energy efficient. In this paper we present a fully autonomous probabilistic circuit for fast and efficient learning that makes no use of digital computing. Specifically we use SPICE simulations to demonstrate a clockless autonomous circuit where the required synaptic weights are read out in the form of analog voltages. This allows us to demonstrate a circuit that can be built with existing technology to emulate the Boltzmann machine learning algorithm based on gradient optimization of the maximum likelihood function. Such autonomous circuits could be particularly of interest as standalone learning devices in the context of mobile and edge computing.


INTRODUCTION
Machine learning, inference, and many other emerging applications (Schuman et al., 2017) make use of stochastic neural networks comprising (1) a binary stochastic neuron (BSN) (Ackley et al., 1985;Neal, 1992) and (2) a synapse that constructs the inputs I i to the ith BSN from the outputs m j of all other BSNs.
The output m i of the ith BSN fluctuates between +1 and −1 with a probability controlled by its input where r represents a random number in the range [−1, +1], and τ N is the time it takes for a neuron to provide a stochastic output m i in accordance with a new input I i 1 . Usually the synaptic function, I i ({m}) is linear and is defined by a set of weights W ij such that where τ S is the time it takes to recompute the inputs {I} everytime the outputs {m} change. Typically Equations (1), (2) are implemented in software, often with special accelerators for the synaptic function using GPU/TPUs (Schmidhuber, 2015;Jouppi, 2016). The time constants τ N and τ S are not important when Equations (1) and (2) are implemented on a digital computer using a clock to ensure that neurons are updated sequentially and the synapse is updated between any two updates. But they play an important role in clockless operation of autonomous hardware that makes use of the natural physics of specific systems to implement Equations (1) and (2) approximately. A key advantage of using BSNs is that Equation (1) can be implemented compactly using stochastic magnetic tunnel junctions (MTJs) as shown in Camsari et al. (2017a,b), while resistive or capacitive crossbars can implement Equation (2) (Hassan et al., 2019a). It has been shown that such hardware implementations can operate autonomously without clocks, if the BSN operates slower than the synapse, that is, if τ N >> τ S shown by Sutton et al. (2019).
Stochastic neural networks defined by Equations (1) and (2) can be used for inference whereby the weights W ij are designed such that the system has a very high probability of visiting configurations defined by {m} = {v} n , where {v} n represents a specified set of patterns. However, the most challenging and time-consuming part of implementing a neural network is not the inference function, but the learning required to determine the correct weights W ij for a given application. This is commonly done using powerful cloud-based processors and there is great interest in accelerating the learning process and making it more energy efficient so that it can become a routine part of mobile and edge computing.
In this paper we present a new approach to the problem of fast and efficient learning that makes no use of digital computing at all. Instead it makes use of the natural physics of a fully autonomous probabilistic circuit composed of standard electronic components like resistors, capacitors, and transistors along with stochastic MTJs.
We focus on a fully visible Boltzmann machine (FVBM), a form of stochastic recurrent neural network, for which the most common learning algorithm is based on the gradient ascent approach to optimize the maximum likelihood function (Carreira-Perpinan and Hinton, 2005;Koller and Friedman, 2009). We use a slightly simplified version of this approach, whereby the weights are changed incrementally according to where ǫ is the learning parameter and λ is the regularization parameter (Ng, 2004). The term v i v j is the correlation between the ith and the jth entry of the training vector {v} n . The term m i m j corresponds to the sampled correlation taken from the model's distribution. The advantage of this network topology is that the learning rule is local since it only requires information of the two neurons i and j connected by weight W ij . In addition, the learning rule can tolerate stochasticity for example in the form of sampling noise which makes it an attractive algorithm to use for hardware machine learning (Carreira-Perpinan and Hinton, 2005;Fischer and Igel, 2014;Ernoult et al., 2019).
For our autonomous operation we replace the equation above with its continuous time version (τ L : learning time constant) which we translate into an RC circuit by associating W ij with the voltage on a capacitor C driven by a voltage source (V v,ij − V m,ij ) with a series resistance R (Figure 1): with v i v j = V v,ij /(V DD /2) and m i m j = V m,ij /(V DD /2). From Figure 1 and comparing Equations (3), (4) it is easy to see how the weights and the learning and regularization parameters are mapped into circuit elements: , and τ L = λRC where A v is the voltage gain of OP3 in Figure 1 and V 0 is the reference voltage of the BSN. For proper operation the learning time scale τ L has to be much larger than the neuron time τ N to be able to collect enough statistics throughout the learning process.
A key element of this approach is the representation of the weights W with voltages rather than with programmable resistances for which memristors and other technologies are still in development (Li et al., 2018b). By contrast the charging of capacitors is a textbook phenomenon, allowing us to design a learning circuit that can be built today with established technology. The idea of using capacitor voltages to represent weights in neural networks has been presented by several authors for different network topologies in analog learning circuits (Schneider and Card, 1993;Card et al., 1994;Kim et al., 2017;Sung et al., 2018). The use of capacitors has the advantage of having a high level of linearity and symmetry for the weight updates during the training process (Li et al., 2018a).
In section 2, we will describe such a learning circuit that emulates Equations (1)-(3). The training images or patterns {v} n are fed in as electrical signals into the input terminals, and the synaptic weights W ij can then be read out in the form of voltages from the output terminals. Alternatively the values can be stored in a non-volatile memory from which they can subsequently be read and used for inference. In section 3, we will present SPICE simulations demonstrating the operation of this autonomous learning circuit.

METHODS
The autonomous learning circuit has three parts where each part represents one of the three Equations (1)-(3). On the left hand side of Figure 1, the training data is fed into the circuit by supplying a voltage V v,ij which is given by the ith entry of the bipolar training vector v i multiplied by the jth entry of the training vector v j and scaled by the supply voltage V DD /2. The training vectors can be fed in sequentially or as an average of all training vectors. The weight voltage V ij across capacitor C follows Equation (4) where V v,ij is compared to voltage V m,ij which represents correlation of the outputs of BSNs m i and m j . Voltage V m,ij is computed in the circuit by using an XNOR gate that is connected to the output of BSN i and BSN j. The synapse in the center of the circuit connects weight voltages to neurons according to Equation (2). Voltage V ij has to be multiplied by 1 or −1 depending on the current value of m j . This is accomplished by using a switch which connects either the positive or the negative node of V ij to the operational amplifiers OP1 and OP2. Here, OP1 accumulates all negative contributions and OP2 accumulates all positive contributions of the synaptic function. The differential amplifier OP3 takes the difference between the output voltages of OP2 and OP1 and amplifies the voltage by amplification factor A v . This voltage conversion is used to control the voltage level of V ij in relation to the input voltage of each BSN. The voltage level at the input of the BSN is fixed by the reference voltage of the BSN which is V 0 . However, the voltage level of V ij can be adjusted and utilized to adjust the regularization parameter λ in the learning rule (Equation 3). The functionality of the BSN is described by Equation (1) where the dimensionless input is given by I i (t) = V i,in (t)/V 0 . This relates the voltage V ij to the dimensionless weight by W ij = A v V ij /V 0 . The hardware implementation of the BSN uses a stochastic MTJ in series with a transistor as presented by Camsari et al. (2017b). Due to thermal fluctuations of the low-barrier magnet (LBM) of the MTJ the output voltage of the MTJ fluctuates randomly but with the right statistics given by Equation (1). The time dynamics of the LBM can be obtained by solving the stochastic Landau-Lifshitz-Gilbert (LLG) equation. Due to the fast thermal fluctuations of the LBM in the MTJ, Equation (1) can be evaluated on a subnanosecond timescale leading to fast generation of samples (Hassan et al., 2019b;Kaiser et al., 2019b). Figure 1 just shows the hardware implementation of one weight and one BSN. The size of the whole circuit depends on the size of the training vector N. For every entry of the training vector one BSN is needed. The number of weights which is the number of RC-circuits is given by N(N − 1)/2 where every connection between BSNs is assumed to be reciprocal. To learn biases another N RC-circuits are needed.
The learning process is captured by Equations (3) and (4). The whole learning process has similarity with the software implementation of persistent contrastive divergence (PCD) (Tieleman, 2008) since the circuit takes samples from the model's distribution (V m,ij ) and compares it to the target distribution (V v,ij ) without reinitializing the Markov Chain after a weight update. During the learning process voltage V ij reaches a constant average value where dV ij dt ≈ 0. This voltage V ij = V ij,learned corresponds to the learned weight.
For inference the capacitor C is replaced by a voltage source of voltage V ij,learned . Consequently, the autonomous circuit will compute the desired functionality given by the training vectors. In general, training and inference have to be performed on identical hardware in order to learn around variations (see Supplementary Material for more details). It is important to note that in inference mode this circuit can be used for optimization by performing electrical annealing. This is done by increasing all weight voltages V ij by the same factor over time. In this way the ground state of a Hamiltonian like the Ising Hamiltionian can be found (Sutton et al., 2017;Camsari et al., 2019).

RESULTS
In this section the autonomous learning circuit in Figure 1 is simulated in SPICE. We show how the proposed circuit can be used for both inference and learning. As examples, we demonstrate the learning on a full adder (FA) and on 5 × 3 digit images. The BSN models are simulated in the framework developed by Camsari et al. (2015). For all SPICE simulations the following parameters are used for the stochastic MTJ in the BSN implementation: Saturation magnetization M S = 1,100 emu/cc, LBM diameter D = 22 nm, LBM thickness l = 2 nm, TMR = 110%, damping coefficient α = 0.01, temperature T = 300 K and demagnetization field H D = 4πM S with V = (D/2) 2 πl. For the transistors, 14 nm HP-FinFET Predictive Technology Models (PTM) 2 are used with fin number fin = 1 for the inverters and fin = 2 for XNOR-gates. Ideal operational amplifiers and switches are used in the synapse. The characteristic time of the BSNs τ N is in the order of 100 ps (Hassan et al., 2019b) and much larger than the time it takes for the synaptic connections, namely the resistors and operational amplifiers, to propagate BSN outputs to neighboring inputs. It has to be noted that in principle other hardware implementations of the synapse for computing Equation (2) could be utilized as long as the condition τ N ≫ τ S is satisfied.

Learning Addition
As first training example, we use the probability distribution of a full adder. The FA has 5 nodes and 10 weights that have to be learned. In the case of the FA training, no biases are needed. The probability distribution of a full adder with bipolar variables is shown in Table 1. To learn this distribution the correlation terms v i v j in the learning rule have to be fed into the voltage node V v,ij . The correlation is dependent on what training vector/truth table line is fed in. For the second line of the truth table for example v 1 v 2 = −1 · −1 = 1 and v 1 v 3 = −1 · 1 = −1 with A being the first node, B the second node and so on. In Figure 2B the correlation v 1 v 5 is shown. For the sequential case the value of v 1 v 5 is obtained by circling through all lines of the truth table where each training vector is shown for 1 ns. A and C out in Table 1 only differ in the fourth and fifth line for which v 1 v 5 = −1. For all other cases v 1 v 5 = 1. The average of all lines is shown as red solid line. Figure 2A shows the weight voltage V ij with i = 1 and j = 5 for FA learning and the first 1,000 ns of training. The following learning parameters have been used for the FA: τ L = 62.5 ns where C = 1 nF and R = 5 k , A v = 10, and R f = 1 M . This choice of learning parameters 2 http://ptm.asu.edu/ ensures that τ L ≫ τ N . Due to the averaging effect of the RCcircuit both sequential and average feeding of the training vector result in similar learning behavior as long as the RC-constant is much larger than the timescale of sequential feeding. Figure 2C shows the enlarged version of Figure 2A. For the sequential feeding, voltage V 1,5 changes substantially every time v 1 v 5 switches to −1.
At the start of training all weight voltages are initialized to 0 V and the probability distribution is uniform. The training is performed for 5,500 ns. In Figure 3A the ideal probability distribution of the FA P Ideal is shown together with the normalized histogram P SPICE of the sampled BSN configurations taken from the last 500 ns of learning and compared to the ideal distribution P Ideal . The training vector is fed in as an average. For P SPICE the eight trained configurations of Table 1 are the dominant peaks. To monitor the training process, the Kullback-Leibner divergence between the trained and the ideal probability distribution KL(P Ideal ||P SPICE (t)) is plotted as a function of training time t in Figure 3B where P SPICE (t) is the normalized histogram taken over 500 ns. P SPICE at t = 0 corresponds to the histogram taken from t = 0 to t = 500 ns. During training the KL divergence decreases over time until it reaches a constant value at about 0.1. It has to be noted that after the weight matrix is learned correctly for a fully visible Boltzmann machine, the KL divergence can be reduced further by increasing all weights uniformly by a factor I 0 which corresponds to inverse temperature of the Boltzmann machine (Aarts and Korst, 1989). Figure 3 shows that the probability distribution of a FA can be learned very fast with the proposed autonomous learning circuit. In addition, the learning performance is robust when components of the circuit are subject to variation. In the Supplementary Material, additional figures of the learning performance are shown when the diameter of the magnet and the resistances of the RC-circuits are subject to variation. The robustness against variations can be explained by the fact that the circuit can learn around variations. BSNs using LBMs under variations have also been analyzed by Abeed and Bandyopadhyay (2019) and Drobitch and Bandyopadhyay (2019).

Learning Image Completion
As second example, the circuit is utilized to train 10 5 × 3 pixel digit images shown in Figure 4A. Here, 105 reciprocal weights and 15 biases have to be learned. The network is trained for 3,000 ns and the bipolar training data is fed in as average of the 10 v i v j terms for every digit. The same learning parameters as in the previous section are used here. In Figure 4B, the KL divergence is shown as a function of time between the SPICE histogram and the ideal probability distribution where the ideal distribution has 10 peaks with each peak being 10% for each digit. Most of the learning happens in the first 1,500 ns of training, however, the KL divergence still reduces slightly during the later parts of learning. After 3,000 ns the KL divergence reaches a value of around 0.5.
For inference we replace the capacitor by a voltage source where every voltage is given by the previously learned voltage V ij . The circuit is run for 10 instances where every instance has a unique clamping pattern of 6 pixels representing one of the 10 digits. The clamped inputs are shown in Figure 4C. The input of a clamped BSN is set to ±V DD /2. Each instance is run for 100 ns and the outputs of the BSNs are monitored. The BSNs fluctuate between the configurations given by the learned probability distribution. In Figure 4D, the heat map of the output of the BSNs is shown. For every digit the most likely configuration is given by the trained digit image. To illustrate this point, the amount of BSN fluctuations is reduced by increasing the learned weight voltages by a factor of I 0 = 2. The circuit is again run in inference mode for 100 ns with the same clamping patterns. In Figure 4E the heatmap is shown. The circuit locks into the learned digit configuration. This shows that in inference mode the circuit can be utilized for image completion.

DISCUSSION
In this paper we have presented a framework for mapping a continuous version of Boltzmann machine learning rule (Equation 3) to a clockless autonomous circuit. We have shown full SPICE simulations to demonstrate the feasibility of this circuit running without any digital component with the learning parameters set by circuit parameters. Due to the fast BSN operation, samples are drawn at subnanosecond speeds leading to fast learning, as such the learning speed should be at least multiple orders of magnitudes faster compared to other computing platforms (Adachi and Henderson, 2015;Korenkevych et al., 2016;Terenin et al., 2019). The advantage of this autonomous architecture is that it produces random numbers naturally and does not rely on pseudo random number generators like linear-feedback shift register (LFSRs) (which are for example used in Bojnordi and Ipek, 2016). These LFSRs have overhead and are not as compact and efficient as the hardware BSN used in this paper. As shown by Borders et al. (2019), typical LFSRs need about 10x more energy per flip and more than 100x more area than an MTJ-based BSN. Another advantage of this approach is that the interfacing with digital hardware only needs to be performed after the learning has been completed. Hence, no expensive analog-to-digital conversion has to be performed during learning. We believe this approach could be extended to other energy based machine learning algorithms like equilibrium propagation introduced by Scellier and Bengio (2017) to design autonomous circuits. Such standalone learning devices could be particularly of interest in the context of mobile and edge computing.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

AUTHOR CONTRIBUTIONS
JK and SD wrote the paper. JK performed the simulations. RF helped setting up the simulations. KC developed the simulation modules for the BSN. All authors discussed the results and helped refine the manuscript.

FUNDING
This work was supported in part by ASCENT, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. KC gratefully acknowledges support from Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant CCF-0939370.

ACKNOWLEDGMENTS
This manuscript has been released as a prepint at arXiv (Kaiser et al., 2019a).