Hardware Design for Autonomous Bayesian Networks

Directed acyclic graphs or Bayesian networks that are popular in many AI-related sectors for probabilistic inference and causal reasoning can be mapped to probabilistic circuits built out of probabilistic bits (p-bits), analogous to binary stochastic neurons of stochastic artificial neural networks. In order to satisfy standard statistical results, individual p-bits not only need to be updated sequentially but also in order from the parent to the child nodes, necessitating the use of sequencers in software implementations. In this article, we first use SPICE simulations to show that an autonomous hardware Bayesian network can operate correctly without any clocks or sequencers, but only if the individual p-bits are appropriately designed. We then present a simple behavioral model of the autonomous hardware illustrating the essential characteristics needed for correct sequencer-free operation. This model is also benchmarked against SPICE simulations and can be used to simulate large-scale networks. Our results could be useful in the design of hardware accelerators that use energy-efficient building blocks suited for low-level implementations of Bayesian networks. The autonomous massively parallel operation of our proposed stochastic hardware has biological relevance since neural dynamics in brain is also stochastic and autonomous by nature.


I. INTRODUCTION
Bayesian networks (BN) or belief nets are probabilistic directed acyclic graphs (DAG) popular for reasoning under uncertainty and probabilistic inference in real world applications such as medical diagnosis 1 , genomic data analysis [2][3][4] , forecasting 5,6 , robotics 7 , image classification 8,9 , neuroscience 10 and so on.BNs are composed of probabilistic nodes and edges from parent to child nodes and are defined in terms of conditional probability tables (CPT) that describe how each child node is influenced by its parent nodes [11][12][13][14] .The CPTs can be obtained from expert knowledge and/or machine learned from data 15 .Computation of different probabilities from a BN becomes intractable when the network gets deeper and more complicated with child nodes having many parent nodes.This has inspired various hardware implemenations of BNs for efficient inference [16][17][18][19][20][21][22][23][24][25][26] .In this article we have elucidated the design criteria for an autonomous (clockless) hardware for BN unlike other implementations that typically use clocks.
Recently a new type of hardware computing framework called Probabilistic Spin Logic (PSL) is proposed 27 based on a building block called probabilistic bits (p-bits), that are analogous to Binary Stochastic Neurons (BSN) 28,29 of the artificial neural network (ANN) literature.p-bits can be interconnected to solve a wide variety of problems such as optimization 30,31 , inference 32 , an enhanced type of Boolean logic that is invertible 27,[33][34][35] , quantum emulation 36 and in-situ learning from probability distributions 37 .
Unlike conventional deterministic networks built out of deterministic, stable bits, stochastic or probabilistic networks composed of p-bits (Fig. 1a), can be correlated by interconnecting them to construct p-circuits defined a) Electronic mail: rfaria@purdue.edub) Electronic mail: datta@purdue.eduby two equations [27][28][29] : (1) a p-bit/BSN equation and (2) a weight logic/synapse equation.The output of a p-bit, m i is related to its dimensionless input I i by the equation: where rand(−1, +1) is a random number uniformly distributed between −1 and +1, and τ N is the neuron evaluation time.
The synapse generates the input I i from a weighted sum of the states of other p-bits.In general the synapse can be a linear or non-linear function although a common form is the linear synapse described according to the equation: where, h i is the on-site bias and J ij is the weight of the coupling from j th p-bit to i th p-bit and τ S is the synpase evaluation time.FIG. 1. Clocked versus Autonomous p-circuit: (a) a probabilistic (p-)circuit is composed of p-bits interconnected by a weight logic (synapse) that computes the input Ii to the i th p-bit as a function of the outputs from other p-bits.Two p-bit designs (design 1 and 2) based on sMTJ using LBMs have been used to build a p-circuit.(b) Two types of p-circuits are built: a directed or Bayesian network and a symmetrically connected Boltzmann network.The p-circuits are sequential (labeled as SeqPSL) that means p-bits are updated sequentially, one at a time, using a clock circuitry with a sequencer.It is shown that for Boltzmann networks update order does not matter and any random update order would produce the correct probability distribution.But for Bayesian networks, a specific, parent-to-child update order is necessary to converge to the correct probability distribution dictated by the Bayes rule.(c) The same Bayesian and Boltzmann p-circuits are implemented on an autonomous hardware built with p-bit design 1 and 2 without any clocks or sequencers.It is interesting to note that for Bayesian networks, design 2 fails to match the probabilities from applying Bayes rule, whereas design 1 works quite well as an autonomous Bayesian network.each τ S +τ N time interval, only one p-bit is updated.This naturally implies the use of sequencers to ensure the sequential update of p-bits.For symmetrically connected networks (J ij = J ji ) such as Boltzmann machines, the update order of p-bits does not matter and any random update order produces the standard probability distribution described by equilibrium Boltzmann law as long as p-bits are updated sequentially.But for directed acyclic networks (J ij = 0, J ji = 0) or Bayesian networks to be consistent with the expected conditional probability distribution, p-bits need to be updated not only sequentially, but also in a specific update order which is from the parent to child nodes 29 similar to the concept of forward sampling in belief networks 13,41,42 .As long as this parent to child update order is maintained, the network converges to the correct probability distribution described by probability chain rule or Bayes rule.This effect of update order in a sequential p-circuit is shown on a three p-bit network in fig.1b.
Unlike sequential p-circuits in ANN literature, the distinguishing feature of our probabilistic hardware is that it is autonomous where each p-bit runs in parallel without any clocks or sequencers.This autonomous p-circuit (ApC) allows massive parallelism potentially providing peta flips per second sampling speed 43 .The complete sequencer-free operation of our "autonomous" p-circuit is very different from the "asynchronous" operation of spiking neural networks 44,45 .Although p-bits are fluctuating in parallel in an ApC, it is very unlikely that two p-bits will update at the exact same time since random noise control their dynamics.Therefore persistent parallel updates are extremely unlikely and are not a concern.Note that even if p-bits update sequentially, each update has to be informed such that when one p-bit updates it has received the up-to-date input I i based on the latest states of other p-bits m j that it is connected to.This informed update can be ensured as long as the synapse response time is much faster than the neuron time (τ S τ N ) and this is a key design rule for an ApC.An ApC works properly for a Boltzmann network without any clock since no specific update order is required in this case.But it is not intuitive at all if an ApC would work for a Bayesian network since a particular parent to child informed update order is required in this case as shown in fig.1b.As such, it is not straightforward that a clockless autonomous circuit can naturally ensure this specific informed update order.In fig.1c, we have shown that it is possible to design hardware p-circuit that can naturally ensure a parent to child informed update order in a Bayesian network without any clocks.In fig.1c, two p-bit designs are evaluated for implementing both Boltzmann and Bayesian networks.We have shown that design 1 is suitable for both Boltzmann and Bayesian networks.But design 2 is suitable for Boltzmann networks only and does not work for Bayesian networks in general.The synapse in both types of p-circuits is implemented using a resistive crossbar architecture 38,46 .In all the simulations τ S is assumed to be negligible compared to other time scales in the circuit dynamics.
Further we have provided a behavioral model in section II for both design 1 and 2 illustrating the essential characteristics needed for correct sequencer-free operation of BNs.Both models are benchmarked against state-of-the-art device/circuit models (SPICE) of the actual devices and can be used for the efficient simulation of large scale autonomous networks.

II. BEHAVIORAL MODEL FOR AUTONOMOUS HARDWARE
A. Autonomous behavioral model: Design 1 The autonomous circuit behaviour of design 1 can be explained by slightly modifying the two equations (eqns.1 a and b) stated in section I.The fluctuating resistance of the low barrier nanomagnet based MTJ is represented by a correlated random number r M T J with values between -1 and +1 and an average dwell time of the fluctuation denoted by τ N .The NMOS transistor tunable resistance is denoted by r T and the inverter is represented by a sgn function.Thus the normalized output m i = V OU T,i /(V DD /2) of the i th p-bit can be expressed as: where, ∆t is the simulation time step, r T,i is the NMOS transistor resistance tunable by the normalized input where V 0 is a fitting parameter which is ≈ 50mV for the chosen parameters and transistor technology and r M T J,i is a correlated random number generator with an average retention time of τ N .r T,i as a function of input I i is approximated by a tanh function with a response time denoted by τ T modelled by the following equations: The synapse delay τ S in computing the input I i can be modelled by: For calculating r M T J,i , at time t + ∆t a new random number will be picked according to the following equations: where, rand [0,1] is a uniformly distributed random number between 0 and 1 and τ N represents the average retention time of the fluctuating MTJ resistance.If r f lip is -1, a new random r M T J will be chosen between −1 and +1.Otherwise the previous r M T J (t) will be kept in the next time step (t + ∆t), which can be expressed as: The charge current flowing throught the MTJ branch of p-bit design 1 can get polarized by the fixed layer of the MTJ and generate a spin current I M T J that can tune/pin r M T J by modifying τ N according to: where, τ 0 N is the retention time of r M T J when I M T J = 0.This pinning effect by I M T J is much smaller in in-plane magnets (IMA) than perpendicular magnets (PMA) 47 .
Figure .2a shows the comparison of this behavioral model for p-bit design 1 with SPICE simulation of the actual hardware in terms of fluctuation dynamics, sigmoidal charateristic response, autocorrelation time (τ corr ) and step response time (τ step ) and in all cases the behavioral model closely matches SPICE simulationsl.SPICE simulation involves experimentally benchmarked modules for different parts of the device, for example solving stochastic Landau-Lifshitz-Gilbert equation (sLLG) for LBM physics and the 14 nm Predictive Technology Model (PTM) for transistors.The autonomous behavioral model for design 1 is labeled as "PPSL: design 1".The benchmarking is done for two different LBMs: (1) Faster fluctuating magnet 1 with saturation magnetization M s = 1100 emu/cc, diameter

p-bit
Step response time,   Steady state response In the steady state sigmoidal response, V 0 is a tanh fitting parameter that defines the width of the sigmoid and lies within the range of 40 mV to 60 mV reasonably well depending on which part of the sigmoid needs to be better matched.In fig.2, V 0 value of 50 mV is used to fit the sigmoid from SPICE simulation.
There are two types of time responses: (1) Autocorrelation time under zero input condition labeled as τ corr and (2) step response time τ step .The full width half maximum (FWHM) of the autocorrelation function of the fluctuating output under zero input is defined by τ corr which is proportional to the retention time τ N of the LBM.The step response time τ step is obtained by taking an average of the p-bit output over many ensembles when the input I i is stepped from a large negative value to zero at time t = 0. τ step defines how fast the first statistically correct sample can be obtained after the input is changed.For p-bit design 1, τ step is independent of LBM retention time τ N and is defined by the NMOS transistor response time τ T which is much faster (few picoseconds) than LBM fluctuation time τ N .The effect of this two very different time scales in design 1 (τ step τ corr ) on an autonomous Bayesian network is described in section III.

B. Autonomous behavioral model: Design 2
The autonomous behavioral model for design 2 is proposed in 43 .In this article, we have benchmarked this model with the SPICE simulation of the single p-bit steady state and time responses shown in fig.2b.According to this model, the normalized output m i = V OU T,i /(V DD /2) can be expressed as: where, p N OT f lip,i )(t + ∆t) is the probability of retention of the i th p-bit (or "not flipping") in the next time step that is a function of average neuron flip time τ N , input I i and the current p-bit output m i (t).Different time scales in p-bit design 1 and 2 are also reported in 47 in an energy-delay analysis context.In this article, we explain the effect of these time scales in designing an autonomous Bayesian network (section III).

III. DIFFERENCE BETWEEN DESIGN 1 AND DESIGN 2 IN IMPLEMENTING BN
The behavioral models introduced in section II are applied to implement a multi layer belief/Bayesian network with 19 p-bits and random interconnection strengths between +1 and −1 (fig.3a).For illustrative purposes, the interconnections are designed in such a way that although there are no meaningful correlations between the blue and red colored nodes with random couplings, pairs of intermediate nodes (A, M 1 ) and (M 1 , B) get negatively correlated because of a net −r 2 type coupling through each branch connecting the pairs.So it is expected that the start and end nodes (A, B) get positively correlated.Fig. 3b shows histograms of four configurations (00, 01, 10, 11) of the pair of nodes A and B obtained from different approaches: Bayes rule (labeled as Analytic), SPICE simulation of design 1 (SPICE: Design 1) and design 2 (SPICE: Design 2), autonomous behavioral model for design 1 (PPSL: Design 1) and design 2 (PPSL: design 2).It is shown that results from SPICE simulation and behavioral model for design 1 matches reasonably well with the standard analytical values showing 00 and 11 states with highest probability whereas design 2 autonomous hardware does not work well in terms of matching with the analytical results and shows approximately all equal peaks.The analytical values are obtained from applying the standard joint probability rule for BNs 11,14 which is: Joint probability between two specific nodes x i and x j can be calculated from the above equation by summing over all configurations of the others nodes in the network which becomes computationally expensive for larger networks.But one major advantage of our probabilistic hardware is that probabilities of specific nodes can be obtained just by looking at the nodes of interest ignoring all other nodes in the system similar to what Feynman stated about a probabilistic computer imitating the probabilistic laws of nature 48 .Indeed, in the Bayesian network example in fig.3, the probabilities of different configurations of nodes A and B were obtained just by looking at the fluctuating outputs of the two nodes ignoring all other nodes.For the SPICE simulation of design 1 hardware, tanh fitting parameter V 0 = 57 mV is used and the mapping principle from dimensionless coupling terms J ij to the coupling resistances in the hardware is described in 32 .The reason why design 1 works for a BN and design 2 does not, is because of the two very different time responses of the two designs shown in fig. 2. It is this two different time scales in design 1 (τ step τ corr ) that naturally ensures a parent to child informed update order in a Bayesian network.The reason is that when τ step is small, each child node can immediately respond to any change of its parent nodes that have a much larger time scale ∝ τ corr , and thus can be conditionally satisfied with the parent nodes very fast.Otherwise, if τ corr gets comparable to τ step , the child node will not be able to keep up with the fast changing parent nodes and will produce substantial number of statistically incorrect samples over the entire time range thus deviating from the correct probability distribution.
The effect of τ step /τ corr ratio is shown in fig. 4 for the same BN presented in fig. 3 by plotting the histogram of AB configurations for different τ T /τ N ratios.It is shown that when τ T /τ N ratio is small, the histogram converges to the correct distribution.As τ T gets comparable to τ N , the histogram begins to diverge from the correct distribution.Thus the very fast NMOS transistor response in design 1 makes it suitable for an autonomous Bayesian network hardware.One thing to note that under certain conditions, results from design 2 can also match the analytical results if the input I i to each p-bit in the network always fluctuates between large values that ensures a fast step response time.
So apart from ensuring a fast synapse compared to neuron fluctuation time (τ S τ N ) which is the design rule for an autonomous probabilistic hardware, the autonomous Bayesian network demands an additional p-bit design rule which is a much faster step response time of the p-bit compared to its fluctuation time (τ step τ N ) as ensured in design 1.In all the simulations the LBM was a circular in-plane magnet whose magnetization spans all values between +1 and -1 and negligible pinning effect.If the LBM is a PMA magnet with bipolar fluctutations having just two values +1 and -1, design 1 will not provide any sigmoidal response except with substantial pinning effect 31 .Under this condition, τ step of design 1 will be comparable to τ N again and the system will not work as an autonomous Bayesian network in general.Therefore LBM with continuous range fluctuation is expected for design 1 p-bit to work properly as a Bayesian network.

IV. DISCUSSION
In this article we have elucidated the design criteria for an autonomous clockless hardware for Bayesian networks that requires a specific parent to child update order when implemented on a probabilistic circuit.By performing SPICE simulations of two autonomous probabilistic hardwares built out of p-bits (design 1 and design 2 in fig.1), we have shown that the autonomous hardware will naturally ensure a parent to child informed update order without any sequencers if the step response time (τ step ) of the p-bit is much smaller than its autocorrelation time (τ corr ).This criteria of having two different time scales is met in design 1 as τ step comes from the NMOS transistor response time τ T in this design which is few picoseconds.We have also proposed an autonomous behavioral model for design 1 and benchmarked it against SPICE simulation of the actual hardware.All the simulations using behavioral model for design 1 are performed ignoring some non-ideal effects listed below: • Pinning of the sMTJ fluctuation due to spin transfer torque (STT) effect is ignored by assuming I M T J = 0 in eqn.6.This is a reasonable assumption considering circular in-plane magnets that are very difficult to pin due to the large demagnetization field that is always present, irrespective of the energy barrier 47 .This effect is more prominent in perpendicular anisotropy magnets (PMA) magnets.It is important to include the pinning effect in p-bits with bipolar LBM fluctuations since in this case the p-bit does not provide a sigmoidal response without the pinning current.This effect is also experimentally observed in 31 for PMA magnets.Such a p-bit design with bipolar PMA and STT pinning might not work for Bayesian networks in general, since in this case τ step will be dependent on magnet fluctuation time τ N .
• In the proposed behavioral model, the step response time of the NMOS transistor τ T in design 1 is assumed to be independent of the input I.But there is a functional dependence of τ T on I in real hardware.
• The NMOS transistor resistance r T is approximated as a tanh function for simplicity.In order to capture the hardware behavior in a better way, the tanh can be replaced by a more complicated function and the weight matrix [J] will have to be learnt around that function.
All the non-ideal effects listed above are supposed to have minimal effects on different probability distributions shown in this article.Real LBMs may suffer from common fabrication defects, resulting in variations in average magnet fluctuation time τ N 49 .The autonomous BN is also quite tolerant to such variations in τ N as long as τ T min(τ N ).

FIG. 2 .
FIG. 2. Autonomous behavioral model for p-bit: Design 1 and 2: (a) Behavioral model for the autonomous hardware with design 1 is benchmarked with SPICE simulations of the actual device involving experimentally benchmarked modules.The behavioral model (labeled as 'PPSL') shows good agreement with SPICE in terms of capturing fluctuation dynamics, steady state sigmoidal response, and two different time responses: autocorrelation time of the fluctuating output under zero input condition labeled as τcorr which is proportional to the LBM retention time τN in the nanosecond range and the step response time τstep defined by the transistor response time τT which is few picoseconds and much smaller than τN .The magnet parameters used in the simulations are mentioned in section II (b) Similar benchmarking for p-bit design 2. In this case τstep is proportional to τN .
Figure.2b  shows how this simple autonomous behavioral model for design 2 matches reasonably well with SPICE simulation of the device in terms of fluctuation dynamics, sigmoidal charateristic response, autocorrelation time (τ corr ) and step response time (τ step ).In design 2, τ step and τ corr are both proportional to LBM fluctuation time τ N unlike design 1.

FIG. 3 .
FIG. 3. Difference between design 1 and design 2: (a) The behavioral models described in fig. 2 are applied to simulate a 19 p-bit BN with random Jij between +1 and -1.The interconnections are designed in such a way so that pairs of intermediate nodes (A, M1) and (M1, B) get anti-correlated and (A, B) gets positively correlated.(b) The probability distribution of four configurations of AB are shown in a histogram from different approaches (SPICE, behavioral model and analytic).The behavioral models for two designs (labeled as PPSL) match reasonably well with the corresponding results from SPICE simulation of the actual hardware.Note that While design 1 matches with the standard analytical values quite well, design 2 does not works as an autonomous Bayesian network in general.

FIG. 4 .
FIG. 4. Effect of step response time in design 1:The reason for design 1 to work accurately as an anutonomous Bayesian network as shown in fig. 3 is the two different time scales (τT and τN ) in this design with the condition that τT τN .The same histogram shown in fig. 3 is plotted using the proposed behavioral model for different τT /τN ratios and compared with the analytical values.It can be seen that as τT gets comparable to τN , the probility distribution diverges from the standard statistical values.