Hardware for Artificial Intelligence

Editorial

09 August 2022

Editorial: Hardware for artificial intelligence

Irem Boybat

Melika Payvand

Oliver Rhodes

and

Alexander Serb

3,778 views

0 citations

Editors

Alexantrou Serb

University of Southampton

Melika Payvand

Institute of Neuroinformatics, Faculty of Mathematics and Natural Sciences, University of Zurich

Irem Boybat

IBM Research - Zurich

Oliver Rhodes

The University of Manchester

Impact

Effect of asymmetric conductance modulation for SGD-based training. (A) Schematic and pseudocode of processes for conventional SGD algorithm (Cauchy, 1847). Vectors x, y, represent the input and output vectors in the forward pass whereas δ, z contain the backpropagated error information. The analog architecture schematic is only shown for a single layer, where all vectors are propagated between upper and lower network layers in general. The pseudocode only describes operations computed in the analog domain, whereas digital computations such as activation functions are not shown for simplicity. (B) Sketch of conductance modulation behavior of a symmetric crosspoint device. (C) Simulated single-parameter optimization result for the symmetric device shown in (B). conductance successfully converges to the optimal value for the problem at hand, G0. (D) Simulated residual distance between the final converged value, Gfinal, and G0 for training the device with characteristics shown in (B) for datasets with different optimal values. (E) Sketch of conductance modulation behavior of an asymmetric crosspoint device. The point at which ΔG+ = ΔG− is defined as the symmetry point of the device (Gsymmetry) (F) Simulated training result for the same single-parameter optimization with the asymmetric device shown in (E). Device conductance fails to converge to G0, but instead settles at a level between G0 and Gsymmetry. (G) Simulated residual distance (in semilog scale) between the final value, Gfinal, and G0 for training the device with characteristics shown in (E) for datasets with different optimal values.

Original Research

09 May 2022

Neural Network Training With Asymmetric Crosspoint Elements

Murat Onen

, 6 more and

Seyoung Kim

3,746 views

19 citations

Original Research

11 April 2022

MONETA: A Processing-In-Memory-Based Hardware Platform for the Hybrid Convolutional Spiking Neural Network With Online Learning

Daehyun Kim

, 4 more and

Saibal Mukhopadhyay

2,985 views

3 citations

Original Research

09 December 2021

Mapping the BCPNN Learning Rule to a Memristor Model

Deyu Wang

, 8 more and

Zhuo Zou

3,557 views

12 citations

Circuit implementation of a p-bit block. (A) PSL-based representation of two-node Bayesian network. (B) The p-bit design based on MTJ p-circuit with connection weight and bias to be connected to another node (Faria et al., 2018). (C) The p-bit design, based on nanomagnet p-circuit with connection weight and bias to be connected to another node (Debashis et al., 2020). (D) The required auxiliary node, X, for representing the four-node Bayesian network.

Review

02 December 2021

Brain-Inspired Hardware Solutions for Inference in Bayesian Networks

Leila Bagheriye

and

Johan Kwisthout

5,211 views

3 citations

The code structure and device models (A) expanded NeuroSim v3.0 with added modules and functions highlighted with a double border. (B) Ideal memristor with linear symmetric and reproducible weight update and a large ON / OFF ratio vs. (C) Non-ideal memristor model with smaller ON / OFF ratio, weight update nonlinearity, and variability (5 cycles shown).

Original Research

22 November 2021

Gradient Decomposition Methods for Training Neural Networks With Non-ideal Synaptic Devices

Junyun Zhao

, 4 more and

Gina C. Adam

5,511 views

8 citations

(A, B, C) Blue curves: The evolution of three different weights (corresponding to three different devices with non-symmetric behavior, σcycle=1 and about 15 states) during TTv2 training. Red curves show the sign of the updates and the expected average saturation value for the corresponding device. (D) The evaluation of a linear and symmetric device with σcycle=0.3 and more than 1,000 states. (E) LSTM training simulations for SGD, Tiki-Taka, and TTv2 algorithms. Different color curves use an extremely noisy array model with non-symmetric devices, μΔw=0.08 (corresponding to 15 states), and the additive cycle-to-cycle update noise with σcycle=1. The square symbols show the SGD training baseline from Figure 3 with symmetric device arrays with 1,200 states and cycle-to-cycle noise at σcycle=0.3. The open circles are the floating-point baseline.

Original Research

09 September 2021

Enabling Training of Neural Networks on Noisy Hardware

Tayfun Gokmen

13,379 views

26 citations

Original Research

03 August 2021

Accelerating Inference of Convolutional Neural Networks Using In-memory Computing

Martino Dazzi

, 2 more and

Evangelos Eleftheriou

In-memory computing (IMC) is a non-von Neumann paradigm that has recently established itself as a promising approach for energy-efficient, high throughput hardware for deep learning applications. One prominent application of IMC is that of performing matrix-vector multiplication in $O (1)$ time complexity by mapping the synaptic weights of a neural-network layer to the devices of an IMC core. However, because of the significantly different pattern of execution compared to previous computational paradigms, IMC requires a rethinking of the architectural design choices made when designing deep-learning hardware. In this work, we focus on application-specific, IMC hardware for inference of Convolution Neural Networks (CNNs), and provide methodologies for implementing the various architectural components of the IMC core. Specifically, we present methods for mapping synaptic weights and activations on the memory structures and give evidence of the various trade-offs therein, such as the one between on-chip memory requirements and execution latency. Lastly, we show how to employ these methods to implement a pipelined dataflow that offers throughput and latency beyond state-of-the-art for image classification tasks.

11,560 views

21 citations

An abstract schematic of the class of optoelectronic neurons meeting our three criteria. Each synapse (Se and Si for expiatory and inhibitory synapses, respectively) is implemented with a physical circuit block containing a detector and a temporal filter. The detector produces an all-or-nothing electrical pulse upon receipt of an optical spike which is then processed by the filter. The parameters of the filter (time constant, weight, etc.) can be set individually for each synapse. A local weight update circuit (W) implements plasticity mechanisms at each synapse. Synaptic outputs are integrated in the soma (N) which drives an optical transmitter to downstream connections upon reaching threshold.

Hypothesis and Theory

06 September 2021

Considerations for Neuromorphic Supercomputing in Semiconducting and Superconducting Optoelectronic Hardware

Bryce A. Primavera

and

Jeffrey M. Shainline

4,365 views

10 citations

Original Research

20 July 2021

Always-On Sub-Microwatt Spiking Neural Network Based on Spike-Driven Clock- and Power-Gating for an Ultra-Low-Power Intelligent Device

Pavan Kumar Chundi

, 7 more and

Mingoo Seok

4,568 views

10 citations

(A) Bidirectional Encoder Representations from Transformers (BERT) with 12 encoder layers. The input to BERT is a sequence of tokens, where each token is either a word or a word-piece. This sequence is processed through each layer, followed by a pooler to reduce output size and a fully-connected classifier layer. For example, to classify “I want a cat <eos>" (where <eos> is the end-of-sentence token) as either grammatical (0) or not (1), the classifier needs only two outputs. Each encoder layer (B) is comprised of two main building blocks: (1) the self-attention block, where the model computes an attention matrix between the input and itself, and (2) a feed-forward network with two large fully-connected layers. Dark grey represents trained weight layers in analog, while (C) shows the attention processing in digital. The input sequence to the self-attention block passes through a trained weight layer split into three parts to compute Q (query), K (key), and V (value) matrices. To compute attention (C), Q, K, and V are each split into multiple attention heads (for BERT, 12), both to reduce matrix sizes and to allow each to learn slightly different representations of the sequence. [c(i)] A similarity matrix is computed between Q and K, followed by a softmax operation along rows to produce values between 0 and 1. [c(ii)] These probabilities are then multiplied by V and move to the next analog tile followed by the feed-forward network. [c(iii)] A higher probability (darker shade) in one of the 12 probability (P) matrices might indicate, for example, that the word “cat” is important for prediction of the word “want”.

Original Research

05 July 2021

Toward Software-Equivalent Accuracy on Transformer-Based Deep Neural Networks With Analog Memory Devices

Katie Spoon

, 9 more and

Geoffrey W. Burr

10,140 views

19 citations

Overview of the hardware and software applied for this benchmark. On the top level the Wavefront algorithm representing the test scenario is featured. In the model layer below it is visualized how the algorithm is implemented via PyNN running in python. On the interface level the model is translated to the respective back ends and subsequently transferred onto the final hardware layer for execution.

Original Research

29 June 2021

Benchmarking Highly Parallel Hardware for Spiking Neural Networks in Robotics

Lea Steffen

, 4 more and

Rüdiger Dillmann

4,114 views

7 citations

Original Research

09 June 2021

NeuroSim Simulator for Compute-in-Memory Hardware Accelerator: Validation and Benchmark

Anni Lu

, 3 more and

Shimeng Yu

Compute-in-memory (CIM) is an attractive solution to process the extensive workloads of multiply-and-accumulate (MAC) operations in deep neural network (DNN) hardware accelerators. A simulator with options of various mainstream and emerging memory technologies, architectures, and networks can be a great convenience for fast early-stage design space exploration of CIM hardware accelerators. DNN+NeuroSim is an integrated benchmark framework supporting flexible and hierarchical CIM array design options from a device level, to a circuit level and up to an algorithm level. In this study, we validate and calibrate the prediction of NeuroSim against a 40-nm RRAM-based CIM macro post-layout simulations. First, the parameters of a memory device and CMOS transistor are extracted from the foundry’s process design kit (PDK) and employed in the NeuroSim settings; the peripheral modules and operating dataflow are also configured to be the same as the actual chip implementation. Next, the area, critical path, and energy consumption values from the SPICE simulations at the module level are compared with those from NeuroSim. Some adjustment factors are introduced to account for transistor sizing and wiring area in the layout, gate switching activity, post-layout performance drop, etc. We show that the prediction from NeuroSim is precise with chip-level error under 1% after the calibration. Finally, the system-level performance benchmark is conducted with various device technologies and compared with the results before the validation. The general conclusions stay the same after the validation, but the performance degrades slightly due to the post-layout calibration.

17,849 views

42 citations

μBrain event-driven architecture. (A) The digital architecture is organized in layers. Each layer consists of an arbiter, a weight memory matrix for forward and recurrent connections, and a set of IF neurons. The architecture can be synthesized for an arbitrary number of neurons, weight bit width resolution, and synaptic memory size M, Nx – where M, is the number of inputs and Nx is the number of neurons in layer indexed by x. (B) Input/Output address event representation signals and timing. (C) Simplified schematic of a digital spiking neuron. Input spikes arriving at random times select corresponding weights, which in turn are added (or subtracted) by an accumulator. Each time the accumulator overflows, the neuron's circuit emits an output spike on the axon output. The graph below shows the time progress of the accumulator value representing the neuron's membrane potential. Output spikes are shown below the neuron's membrane potential.

Original Research

19 May 2021

μBrain: An Event-Driven and Fully Synthesizable Architecture for Spiking Neural Networks

Jan Stuijt

, 2 more and

Federico Corradi

The development of brain-inspired neuromorphic computing architectures as a paradigm for Artificial Intelligence (AI) at the edge is a candidate solution that can meet strict energy and cost reduction constraints in the Internet of Things (IoT) application areas. Toward this goal, we present μBrain: the first digital yet fully event-driven without clock architecture, with co-located memory and processing capability that exploits event-based processing to reduce an always-on system's overall energy consumption (μW dynamic operation). The chip area in a 40 nm Complementary Metal Oxide Semiconductor (CMOS) digital technology is 2.82 mm² including pads (without pads 1.42 mm²). This small area footprint enables μBrain integration in re-trainable sensor ICs to perform various signal processing tasks, such as data preprocessing, dimensionality reduction, feature selection, and application-specific inference. We present an instantiation of the μBrain architecture in a 40 nm CMOS digital chip and demonstrate its efficiency in a radar-based gesture classification with a power consumption of 70 μW and energy consumption of 340 nJ per classification. As a digital architecture, μBrain is fully synthesizable and lends to a fast development-to-deployment cycle in Application-Specific Integrated Circuits (ASIC). To the best of our knowledge, μBrain is the first tiny-scale digital, spike-based, fully parallel, non-Von-Neumann architecture (without schedules, clocks, nor state machines). For these reasons, μBrain is ultra-low-power and offers software-to-hardware fidelity. μBrain enables always-on neuromorphic computing in IoT sensor nodes that require running on battery power for years.

16,445 views

84 citations

Original Research

18 February 2021

Scaling Equilibrium Propagation to Deep ConvNets by Drastically Reducing Its Gradient Estimator Bias

Axel Laborieux

, 4 more and

Damien Querlioz

Equilibrium Propagation is a biologically-inspired algorithm that trains convergent recurrent neural networks with a local learning rule. This approach constitutes a major lead to allow learning-capable neuromophic systems and comes with strong theoretical guarantees. Equilibrium propagation operates in two phases, during which the network is let to evolve freely and then “nudged” toward a target; the weights of the network are then updated based solely on the states of the neurons that they connect. The weight updates of Equilibrium Propagation have been shown mathematically to approach those provided by Backpropagation Through Time (BPTT), the mainstream approach to train recurrent neural networks, when nudging is performed with infinitely small strength. In practice, however, the standard implementation of Equilibrium Propagation does not scale to visual tasks harder than MNIST. In this work, we show that a bias in the gradient estimate of equilibrium propagation, inherent in the use of finite nudging, is responsible for this phenomenon and that canceling it allows training deep convolutional neural networks. We show that this bias can be greatly reduced by using symmetric nudging (a positive nudging and a negative one). We also generalize Equilibrium Propagation to the case of cross-entropy loss (by opposition to squared error). As a result of these advances, we are able to achieve a test error of 11.7% on CIFAR-10, which approaches the one achieved by BPTT and provides a major improvement with respect to the standard Equilibrium Propagation that gives 86% test error. We also apply these techniques to train an architecture with unidirectional forward and backward connections, yielding a 13.2% test error. These results highlight equilibrium propagation as a compelling biologically-plausible approach to compute error gradients in deep neuromorphic systems.

9,221 views

44 citations