Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations

In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. Training of large DNNs, however, is universally considered as time consuming and computationally intensive task that demands datacenter-scale computational resources recruited for many days. Here we propose a concept of resistive processing unit (RPU) devices that can potentially accelerate DNN training by orders of magnitude while using much less power. The proposed RPU device can store and update the weight values locally thus minimizing data movement during training and allowing to fully exploit the locality and the parallelism of the training algorithm. We evaluate the effect of various RPU device features/non-idealities and system parameters on performance in order to derive the device and system level specifications for implementation of an accelerator chip for DNN training in a realistic CMOS-compatible technology. For large DNNs with about 1 billion weights this massively parallel RPU architecture can achieve acceleration factors of 30, 000 × compared to state-of-the-art microprocessors while providing power efficiency of 84, 000 GigaOps∕s∕W. Problems that currently require days of training on a datacenter-size cluster with thousands of machines can be addressed within hours on a single RPU accelerator. A system consisting of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large streams of business and scientific data, integration, and analysis of multimodal sensory data flows from a massive number of IoT (Internet of Things) sensors.


Introduction
Deep Neural Networks (DNNs) [ 1 ] demonstrated significant commercial success in the last years with performance exceeding sophisticated prior methods in speech [ 2 ] and object recognition [ 3-5 ].However, training the DNNs is an extremely computationally intensive task that requires massive computational resources and enormous training time that hinders their further application.For example, a 70% relative improvement has been demonstrated for a DNN with 1 billion connections that was trained on a cluster with 1000 machines for three days [ 6 ].Training the DNNs relies in general on the backpropagation algorithm that is intrinsically local and parallel [ 7 ].Various hardware approaches to accelerate DNN training that are exploiting this locality and parallelism have been explored with a different level of success starting from the early 90s [ 8,9 ] to current developments with GPU [ 10,11 ], FPGA [ 12 ] or specially designed ASIC [ 13 ].Further acceleration is possible by fully utilizing the locality and parallelism of the algorithm.For a fully connected DNN layer that maps neurons to neurons significant acceleration can be achieved by minimizing data movement using local storage and processing of the weight values on the same node and connecting nodes together into a massive × systolic array [ 8 ] where the whole DNN can fit in.Instead of a usual time complexity of ( ) the problem can be reduced therefore to a constant time (1) independent of the array size.However, the addressable problem size is limited to the number of nodes in the array that is challenging to scale up to billions even with the most advanced CMOS technologies.
Novel nano-electronic device concepts based on non-volatile memory (NVM) technologies, such as phase change memory (PCM) [ 14,15 ] and resistive random access memory (RRAM) [ 15-19 ], have been explored recently for implementing neural networks with a learning rule inspired by spike-timing-dependent plasticity (STDP) observed in biological systems [ 20 ].Only recently, their implementation for acceleration of DNN training using backpropagation algorithm have been considered [ 21-25 ] with reported acceleration factors ranging from 27X [ 26 ] to 900X [ 21 ], and even 2140X [ 27 ] and significant reduction in power and area.All of these bottom-up approach of using previously developed memory technologies looks very promising, however the estimated acceleration factors are limited by device specifications intrinsic to their application as NVM cells.Device characteristics usually considered beneficial or irrelevant for memory applications such as high on/off ratio, digital bit-wise storage, and asymmetrical set and reset operations, are becoming limitations for acceleration of DNN training [ 26,28 ].These non-ideal device characteristics can potentially be compensated with a proper design of peripheral circuits and a whole system, but only partially and with a cost of significantly increased operational time [ 26 ].In contrast, here we propose an up-down approach where ultimate acceleration of DNN training is achieved by design of a system and CMOS circuitry that imposes specific requirements for resistive devices.We propose and analyze a concept of Resistive Processing Unit (RPU) devices that can simultaneously store and process weights and are potentially scalable to billions of nodes with foundry CMOS technologies.Our estimates indicate that acceleration factors close to 30,000 are achievable on a single chip with realistic power and area constraints.

Definition of the RPU device concept
The backpropagation algorithm is composed of three cycles, forward, backward and weight update that are repeated many times until a convergence criterion is met.The forward and backward cycles mainly involve computing vector-matrix multiplication in forward and backward directions.This operation can be performed on a 2D crossbar array of two-terminal resistive devices as it was proposed more than 50 years ago [ 29 ].In forward cycle, stored conductance values in the crossbar array form a matrix, whereas the input vector is transmitted as voltage pulses through each of the input rows.In a backward cycle, when voltage pulses are supplied from columns as an input, then the vector-matrix product is computed on the transpose of a matrix.These operations achieve the required (1) time complexity, but only for two out of three cycles of the training algorithm.
In contrast to forward and backward cycles, implementing the weight update on a 2D crossbar array of resistive devices locally and all in parallel, independent of the array size, is challenging.It requires calculating a vector-vector outer product which consist of a multiplication operation and an incremental weight update to be performed locally at each cross-point as illustrated in Fig. 1A.The corresponding update rule is usually expressed as where represents the weight value for the row and the !column (for simplicity layer index is omitted) and is the activity at the input neuron, is the error computed by the output neuron and is the global learning rate.
In order to implement a local and parallel update on an array of two-terminal devices that can perform both weight storage and processing (Resistive Processing Unit or RPU) we first propose to significantly simplify the multiplication operation itself by using stochastic computing techniques [ 30-32 ].It has been shown that by using two stochastic streams the multiplication operation can be reduced to a simple AND operation [ 30-32 ].Fig. 1B illustrates the stochastic update rule where numbers that are encoded from neurons ( and ) are translated to stochastic bit streams using stochastic translators (STR).Then they are sent to the crossbar array where each RPU device changes its conductance ( ) slightly when bits from and coincide.In this scheme we can write the update rule as follows where ). is length of the stochastic bit stream at the output of STRs that is used during the update cycle, ∆ $ % is the change in the weight value due to a single coincidence event, ' % and ) % are random variables that are characterized by Bernoulli process, and a superscript n represents bit position in the trial sequence.The probabilities that ' % and ) % are equal to unity are controlled by / and / , respectively, where / is a gain factor in the STR.
One possible pulsing scheme that enables the stochastic update rule of Eq.2 is presented in Fig. 1C.The voltage pulses with positive and negative amplitudes are sent from corresponding STRs on rows (' ) and columns () ), respectively.As opposed to a floating point number encoded into a binary stream, the corresponding number translated into a stochastic stream is represented by a whole population of such pulses.In order for a two-terminal RPU device to distinguish coincidence events at a cross-point, its conductance value should not change significantly when a single pulse 0 1 /2 is applied to a device from a row or a column.However, when two pulses coincide and the RPU device sees the full voltage (0 1 ) the conductance should change by nonzero amount ∆ $ % .The parameter ∆ $ % is proportional to ∆ $ % through the amplification factor defined by peripheral circuitry.To enable both up and down changes in conductance the polarity of the pulses can be switched during the update cycle as shown in Fig. 1D.The proposed pulsing scheme allows all the RPU devices in an array to work in parallel and perform the multiplication operation locally by simply relying on the statistics of the coincidence events, thus achieving the (1) time complexity for the weight update cycle of the training algorithm.

Network training with RPU array using stochastic update rule
To test the validity of this approach, we compare classification accuracies achieved with a deep neural network composed of fully connected layers with 784, 256, 128 and 10 neurons, respectively.This network is trained with a standard MNIST training dataset of 60,000 examples of images of handwritten digits [ 33 ] using cross-entropy objective function and backpropagation algorithm [ 7 ].Raw pixel values of each 28x28 pixel image are given as inputs, while sigmoid and softmax activation functions are used in hidden and output layers, respectively.The temperature parameter for both activation functions is assumed to be unity.Fig. 2 shows a set of classification error curves for the MNIST test dataset of 10,000 images.The curve marked with open circles in Fig. 2A corresponds to a baseline model where the network is trained using the conventional update rule as defined by Eq.1 with a floating point multiplication operation.In order to make a fair comparison between the baseline model and the stochastic model in which the training uses the stochastic update rule of Eq.2, the learning rates need to match.In the most general form the average change in the weight value for the stochastic model can be written as Therefore the learning rate for the stochastic model is controlled by three parameters )., ∆ $ % , and / that should be adjusted to match the learning rates that are used in the baseline model.
Although the stochastic update rule allows to substitute multiplication operation with a simple AND operation, the result of the operation, however, is no longer exact, but probabilistic with a standard deviation to mean ratio that scales with 1/√).. Increasing the stochastic bit stream length ).would decrease the error, but in turn would increase the update time.In order to find an acceptable range of ).values that allow to reach classification errors similar to the baseline model, we performed training using different ).values while setting ∆ $ % = /).and / = 1 in order to match the learning rates used for the baseline model as discussed above.As it is shown in Fig. 2A, ). as small as 10 is sufficient for the stochastic model to become indistinguishable from the baseline model.
To determine how strong non-linearity in the device switching characteristics is required for the algorithm to converge to classification errors comparable to the baseline model, a non-linearity factor is varied as shown Fig. 2B.The non-linearity factor is defined as the ratio of two conductance changes at half and full voltages as . As shown in Fig. 2C, the values of : ≈ 1 correspond to a saturating type nonlinear response, when : = 0.5 the response is linear as typically considered for a memristor [ 34 ], and values of : ≈ 0 corresponds to a rectifying type non-linear response.As it is shown in Fig. 2B the algorithm fails to converge for the linear response, however, a non-linearity factor : below 0.1 is enough to achieve classification errors comparable to the baseline model.
These results validate that although the updates in the stochastic model are probabilistic, classification errors can become indistinguishable from those achieved with the baseline model.The implementation of the stochastic update rule on an array of analog RPU devices with non-linear switching characteristics effectively utilizes the locality and the parallelism of the algorithm.As a result the update time is becoming independent of the array size, and is a constant value proportional to )., thus achieving the required (1) time complexity.

Derivation of RPU device specifications
Various materials, physical mechanisms, and device concepts have been analyzed in view of their potential implementation as cross-bar arrays for neural network training [ 21-26 ].These technologies have been initially developed for storage class memory applications.It is not clear beforehand, however, whether intrinsic limitations of these technologies, when applied to realization of the proposed RPU concept, would result in a significant acceleration, or, in contrast, might limit the performance.For example, PCM devices can only increase the conductance during training, thus resulting in network saturation after a number of updates.This problem can be mitigated by a periodic serial reset of weights, however with a price of lengthening the training time [ 22,26 ] as it violates the (1) time complexity.In order to determine the device specifications required to achieve the ultimate acceleration when (1) time complexity is reached, we performed a series of trainings summarized in Fig. 3.Each figure corresponds to a specific "stress test" where a single parameter is scanned while all the others are fixed allowing to explore the acceptable RPU device parameters that the algorithm can tolerate without significant error penalty.This includes variations in RPU device switching characteristics, such as, incremental conductance change due to a single coincidence event, asymmetry in up and down conductance changes, tunable range of the conductance values, and various types of noise in the system.For all of the stochastic models illustrated in Fig. 3, : = 0 and ).= 10 is used.In order to match the learning rates used for the baseline model the and are translated to stochastic streams with / defined as / = @ /(). ∆ $ % ).This allows the average learning rate to be the same as in the baseline model.
Ideally, the RPU device should be analog i.e. the conductance change due to a single coincidence event ∆ $ % should be arbitrarily small, thus continuously covering all the allowed conductance values.To determine the largest acceptable ∆ $ % due to a single coincidence event that does not produce significant error penalty, the parameter ∆ $ % is scanned between 0.32 and 0.00032, while other parameters are fixed as shown in Fig. 3A.While for large ∆ $ % the convergence is poor since it controls the standard deviation of the stochastic update rule, for smaller ∆ $ % the results are approaching the baseline model.The ∆ $ % smaller than 0.01 produces an error penalty at the end of 30 th epoch as small as just 0.3% above the 2.0% classification error of the baseline model.
To determine minimum and maximum conductance values that RPU devices should support for the algorithm to converge, a set of training curves is calculated as shown in Fig. 3B.Each curve is defined by the weight range where the absolute value of weights A A is kept below a certain bound that is varied between 0.1 and 3.The other parameters are identical to Fig. 3A, while ∆ $ % is taken as 0.001 to assure that the results are mostly defined by the choice of the weight range.The model with weights A A bounded to values larger than 0.3 results in an acceptable error penalty criteria of 0.3% as defined above.Since, the parameter ∆ $ % (and ) is proportional to ∆ $ % (and ) through the amplification factor defined by peripheral circuitry, the number of coincidence events required to move the RPU device from its minimum to its maximum conductance value can be derived as . This gives a lower estimate for the number of states that are required to be stored on an RPU device as 600.
In order to determine the tolerance of the algorithm to the variation in the incremental conductance change due to a single coincidence event ∆ $ % , the ∆ $ % value used for each coincidence event is assumed to be a random variable with a Gaussian distribution.Corresponding results are shown in Fig. 3C, where the standard deviation is varied while the average ∆ $ % value is set to 0.001.As it is seen, the algorithm is robust against the randomness on the weight change for each coincidence event and models with a standard deviation below 150% of the mean value reach acceptable 0.3% error penalty.
For stochastic models illustrated in Fig. 3D, yet another randomness, a device-to-device variation in the incremental conductance change due to a single coincidence event ∆ $ % , is introduced.In this case the ∆ $ % used for each RPU device is sampled from a Gaussian distribution at the beginning of the training and then this fixed value is used throughout the training for each coincidence event.For all stochastic models in Fig. 3D, the average ∆ $ % value of 0.001 is used while the standard deviation is varied for each model.Results show that the algorithm is also robust against the device-to-device variation and an acceptable error penalty can be achieved for models with a standard deviation up to 110% of the mean value.
To determine tolerance of the algorithm to the device-to-device variation in the upper and lower bounds of the conductance value, we assumed upper and lower bounds that are different for each RPU device for the models in Fig. 3E.The bounds used for each RPU device are sampled from a Gaussian distribution at the beginning of the training and are used throughout the training.For all of the stochastic models in Fig. 3E, mean value of 1.0 for upper bound (and −1.0 for lower bound) is used to assure that the results are mostly defined by the device-to-device variation in the upper and lower bounds.Fig. 3E shows that the algorithm is robust against the variation in the bounds and models with a standard deviation up to 80% of the mean can achieve acceptable 0.3% error penalty.
Fabricated RPU devices may also show different amounts of change in the conductance value due to positive (∆ $ % J ) and negative (∆ $ % K ) pulses as illustrated in Figs.1C and 1D.To determine how much asymmetry between up and down changes the algorithm can tolerate, the up (∆ $ % J ) and down (∆ $ % K ) changes in the weight value are varied as shown in Figs.3F and 3G.In both cases this global asymmetry is considered to be uniform throughout the whole RPU device array.For each model in Fig. 3F ∆ $ % J is fixed to 0.001 while ∆ $ % K is varied from 0.95 to 0.25 weaker than the up value.Similarly, Fig. 3G shows an analogous results for ∆ $ % K fixed to 0.001 while ∆ $ % J is varied.Results show that up and down changes need to be significantly balanced (within 5% of the mean) in order for the stochastic model to achieve an acceptable 0.3% error penalty.
In order to determine tolerance of the algorithm to the device-to-device variation in asymmetry, as opposed to a global asymmetry considered in Figs.3F and 3G, the curves in Fig. 3H are calculated for various values of the standard deviation of ∆ $ % J /∆ $ % K .The parameters ∆ $ % J and ∆ $ % K for each RPU device are sampled from a Gaussian distribution at the beginning of the training and then used throughout the training for each coincidence event.All the models assume that the average value of ∆ $ % J and ∆ $ % K is 0.001.The standard deviation of ∆ $ % J /∆ $ % K needs to be less than 6% of the mean value to achieve an acceptable 0.3% error penalty.
Analog computation is sensitive to various noise sources such as thermal noise, shot noise, etc that are all additive and can be modelled as a single unbiased Gaussian noise.Influence of noise penalty during the weight update cycle is already considered in Figs.3C-3H.In order to estimate tolerance of the algorithm to noise during forward and backward cycles, we modelled analog noise as a random error imposed on the results of vector-matrix multiplication.As it is shown in Fig. 3I, an acceptable 0.3% error penalty is reached for a noise level of 10% normalized on activation function temperature.
Radar diagram in Fig. 4A summarizes specifications of RPU devices that are derived from the "stress tests" performed in Fig. 3. Axes C-I correspond to experiments in Figs.3C-3I, respectively.Solid line 1 connects threshold values determined for these parameters for an acceptable 0.3% error penalty.Note that these specifications differ significantly from parameters typical for NVM technologies.The storage in NVM devices is digital and typically does not exceed a few bits.This constraint is imposed by system requirement to achieve high signal-to-noise ratio for read and write operations.In addition, the write operation does not depend on history as it overwrites all previously stored values.In contrast, weight values in the neural network operation are not needed to be written and resolved with very high signal-tonoise ratio.In fact, the algorithm can withstand up to 150% of noise in the weights updates (parameter C) and can tolerate up to 10% reading noise on columns or rows (parameter I).However, as opposed to a few bit storage capacity on NVM devices, a large number of coincidence events (over 600 from Fig. 3B) is required for the RPU device to keep track of the history of weight updates.In addition, in contrast to high endurance of full swing writing between bit levels required for NVM devices, RPU devices need to have high endurance only to small incremental changes, ∆ $ % .
Combined contribution of all parameters considered in Fig. 4A can be additive and therefore exceed the acceptable 0.3% error penalty.Fig. 4B shows training results when effects of more than one parameter are combined.When all parameters (C, D, E, F, G, H, and I) are combined at the threshold the test error reaches 5.0% that is 3.0% above the baseline model.Although this penalty can be acceptable for some applications, it is significantly higher than the 0.3% error penalty considered above.This 3.0% penalty is higher than a simple additive impact of uncorrelated contributions indicating that at least some of these parameters are interacting.It opens the possibility of optimizing the error penalty by trading off tolerances between various parameters.For example, the model that combines only parameters C, D, and E at the threshold, as shown by curve 2 in Fig. 4B, gives 0.9% error penalty that is about the expected sum of individual contributions.Note that these parameters are defined by imperfections in device operation and by device-to-device mismatch that are all controlled by fabrication tolerances in a given technology.Even for deeply scaled CMOS technologies the fabrication tolerances do not exceed 30% that is much smaller than 150%, 110%, and 80% used for calculation of curve 2. The contributions of C, D and E to the error penalty can be eliminated by setting the corresponding tolerances to 30% (data not shown).
Among the parameters of Fig. 4A, the asymmetry between up and down changes in the conductance value of RPU devices (parameter F, G and H) is the most restrictive.Parameter F (or G) is the global asymmetry that can be compensated by controlling pulse voltages and/or number of pulses in the positive and negative update cycles, and hence even asymmetries higher than the threshold value of 5% can be eliminated with proper design of peripheral circuits.In contrast, the parameter H that is defined by device-to-device variation in the asymmetry, can be compensated by peripheral circuits only if each RPU device is addressed serially.To maintain the (1) time complexity, the device mismatch parameter H and the noise parameter I can be co-optimized to reduce the error penalty.The resulting model illustrated by the gray shaded area bounded with curve 3 in Fig. 4B achieves at most 0.3% error penalty.For this model parameters C, D, and E are set to 30% while F (or G) is set to zero, H is set to 2%, and I is set to 5%.Alternatively, the same result (data not shown) can be obtained by restricting the noise parameter I to 2.5% and increasing the device mismatch tolerance H to 4% that can simplify the array fabrication in expense of designing less noisy circuits.

Circuit and system level design considerations
The ultimate acceleration of DNN training with backpropagation algorithm on a RPU array of size × can be approached when (1) time complexity operation is enforced.In this case overall acceleration is proportional to that favors very large arrays.In general the design of the array, peripheral circuits, and the whole system should be based on optimization of the network parameters for a specific workload and classification task.In order to develop a general methodology for such a design, we will use the results of the analysis presented above as an example with understanding, however, that the developed approach is valid for larger class of more complicated cases than a relatively simple 3 layer network used to classify the MNIST dataset in Figs.2-4.

RPU array design
For realistic technological implementations of the crossbar array, the array size will ultimately be limited by resistance and parasitic capacitance of the transmission lines resulting in significant L/ delay and voltage drop.For further analysis we assume that RPU devices are integrated at the back-end-of-line (BEOL) stack in-between intermediate metal levels.This allows the top thick metal levels to be used for power distribution, and the lower metal levels and the area under the RPU array for peripheral CMOS circuitry.Typical intermediate metal levels in a scaled CMOS technology have a thickness of 360 NO, and a width of 200 NO.Corresponding typical line resistance is about P Q %R = 0.36 S/TO with parasitic capacitance of U Q %R = 0.2 VW/TO.Assuming a reasonable 1 XY clock frequency for the pulses used during the update cycle, and allowing L/ delay to be at most 10% of the pulse width (0.1 N ), the longest line length should be Z Q %R = 1.64 OO.Assuming a reasonable line spacing of 200 NO this results in an array with 4096 × 4096 RPU devices.Since the conductance values of RPU devices can only be positive, we assume that a pair of identical RPU device arrays is used to encode positive ( J ) and negative ( K ) weight values.The weight value ( ) is proportional to a difference of two conductance values stored in two corresponding devices ( J − K ) located in identical positions of a pair of RPU arrays.To minimize the area, these two arrays can be stacked on top of each other occupying 4 consecutive metal levels resulting in a total area of ' \]]\^= 2.68 OO .For this array size a full update cycle (both positive and negative) performed using 1 N pulses can be completed in 20 N for ).= 10.
In order to estimate an average RPU device resistance, L _R` aR , we assume at most 10% voltage drop on the transmission line that is defined by × L Q %R /L _R` aR , where L Q %R is the total line resistance equal to P Q %R Z Q %R .The contribution of the output resistance of the line drivers to the total line resistance can be minimized by proper circuit design.For an array size of = 4096 the average RPU device resistance is therefore L _R` aR = 24 bΩ.Using this resistance value, and assuming an operating voltage of 1 0 for all 3 training cycles and on-average about 20% activity for each device that is typical for the models of Figs.2-4, the power dissipation on a pair of RPU arrays can be estimated as d \]]\^= 0.28 .

Design of peripheral circuits
Operation of a single column (or row) during forward (or backward) cycle is illustrated in Fig. 5A.In contrast to the update cycle, stochastic translators are not needed.Here we assume that time-encoding scheme is used when input vectors are represented by fixed amplitude 0 % = 1 0 pulses with a tunable duration.Pulse duration is a multiple of 1 N and is proportional to the value of the input vector.Currents generated at each RPU device are summed on the columns (or rows) and this total current is integrated over the measurement time, e $R\f by current readout circuits as illustrated in Fig. 5A.Positive and negative voltage pulses are supplied separately to each of the identical RPU arrays that are used to encode positive and negative weights.Currents from both arrays are fed into peripheral circuitry that consists of an op-amp that integrates differential current on the capacitor / % , and an analog-to-digital converter ADC.Note, that for time-encoded pulses, the time-quantization error at the input to the RPU array scales inversely with the total number of pulses.For the models in Fig. 4B number of pulses larger than 20 (~5 bit resolution) is enough to eliminate corresponding error penalty.
We define a single RPU tile as a pair of arrays with 4096 × 4096 devices with peripheral circuits that support the parallel operation of the array in all 3 cycles.Peripheral circuitry includes ADCs, op-amps, STRs consisting of random number generators, and line drivers used to direct signals along the columns and rows.As shown in Fig. 5C the signals from a tile are directed towards a non-linear function (NLF) circuit that calculates either activation functions (i.e.sigmoid, softmax) and their derivates as well as arithmetical operations (i.e.multiplication) depending on cycle type and on position of corresponding layer.At the tile boundary input signals to the NLF are bounded to a certain threshold value to avoid signal saturation.Fig. 5B shows test error for the network of the model 3 in Fig. 4B, but with bounds |h| imposed on results of vector-matrix multiplication that is equivalent to restricting the NLF input.For neurons in hidden layers the NLF circuit should compute a sigmoid activation function.When the input to this sigmoid NLF is restricted to |h| = 3, the resulting error penalty does not exceed an additional 0.4% as shown by curve 1 in Fig. 5B.
Neurons at the output layer perform a softmax NLF operation, that, when corresponding input is also restricted to |h| = 3, results in exceedingly large error as shown by curve 2 in Fig. 5B.To make design more flexible and programmable it is desired for the NLF in both hidden and output layers to have the same bounds.When bounds on both softmax and sigmoid NLF are restricted to |h| = 12, the total penalty is within acceptable range as shown by curve 3. Assuming 5% acceptable noise level taken from the results of Fig. 4B and an operation voltage range between −1 0 and 1 0 at the input to the ADC, the corresponding bit resolution and voltage step required are 9 i e and 4 O0, respectively.These numbers imply that the acceptable total integrated RMS voltage noise at the input to the ADC (or at the output of the op-amp) should not exceed 4 O0.

Noise analysis
In order to estimate the acceptable level of the input referred noise the integration function of the op-amp should be defined.Voltage at the output of the op-amp can be derived as where m is the conductance on/off ratio for an RPU device.This equation assumes all devices are contributing simultaneously that makes it hard to design a circuit that would require either a very large capacitor or large voltage swing.However, for a given bounds |h| imposed on the NLF transformation, the output voltage should not necessarily exceed the level corresponding to simultaneous contribution of |h| devices.Since, as shown above, an acceptable bound |h| = 12 is enough, the number in Eq.4 can be replaced with 12. Assuming that 0 jk signal feeding into the ADC should not exceed 1 0, and the L _R` aR is 24 bΩ, the choice of integrating capacitor / % is dictated by the integration time e $R\f and on/off ratio m.Fig. 5D presents estimates of acceptable noise levels for various on/off ratios on the devices m and integration times e $R\f .This noise level corresponds to the input referred noise of the op-amp calculated using standard noise analysis in integrator-based circuits [ 35 ].If e $R\f is taken as 20 N following the quantization error consideration discussed above, the acceptable noise levels are relatively low of the order of just 10 N0/√XY as seen in Fig. 5D curve 1.Even an increase of the on/off ratio m to several orders of magnitude does not help to accommodate higher noise.In order to accommodate higher noise e $R\f needs to be increased with a penalty, however, of increased overall calculation time.As seen from curves in Fig. 5D, for a given noise level the on/off ratios as small as 2 to 10 can be acceptable that is, in fact, quite modest in comparison to several orders of magnitude typical for NVM applications.When e $R\f and m are chosen as 80 N and 6, respectively, corresponding level of acceptable input referred noise shown by curve 2 in Fig. 5D can be derived as 15.1 N0/√XY.Corresponding capacitance / % can also be calculated as 57 VW using Eq. 4.
Various noise sources can contribute to total acceptable input referred noise level of an op-amp including thermal noise, shot noise, and supply voltage noise, etc. Thermal noise due to a pair of arrays with 4096 × 4096 RPU devices can be estimated as 7.0 N0/√XY, which leaves about 13.4 N0/√XY for other noise sources.Depending on exact physical implementation of a RPU device and type of non-linear p − 0 response, shot noise levels produced by the RPU array can vary.Assuming a diode-like model, total noise from a whole array scales as a square root of a number of active RPU devices in a column (or a row), and hence depends on an overall instantaneous activity of a network.For a pair of arrays with 4096 × 4096 RPU devices and assuming a moderate 20% activity of the network that is typical for the models of Figs.
2-4, the shot noise contribution is about 13.4 N0/√XY.Longer integration time is needed for higher workloads or additional noise contributions including the noise on the voltage, amplifier noise, etc.

System level design considerations
The tile area occupied by peripheral circuitry and corresponding dissipated power are dominated by the contribution from 4096 ADC.Assuming e $R\f of 80 N for forward and backward cycles, ADCs operating with 9 i e resolution at 12.5 bq O Zr / rU are required.The state-of-the-art SAR-ADC [ 36,37 ] that can provide this performance, occupy an area of 0.0256 OO and consume 0.24 O power that results in a total area of 104 OO and a total power of 1 for an array of 4096 ADCs.This area is much larger than the RPU array itself, therefore it is reasonable to time-multiplex the ADCs between different columns/rows by increasing the sampling rate while keeping total power unchanged.Assuming each ADC is shared by 64 columns (or rows), the total ADC area can be reduced to 1.64 OO with each ADC running at about 800 bq O Zr / rU .Since we assume that RPU device arrays are built on the intermediate metal levels on top of peripheral CMOS circuitry, the total tile area is defined by the RPU array area of 2.68 OO that leaves about 1.0 OO for other circuitry that also can be area optimized.For example, the number of random number generators used to translate binary data to stochastic bit stream can be significantly reduced to just 2 as no operations are performed on streams generated within columns (or rows) and evidenced by no additional error penalty for corresponding classification test (data not shown).Total area of a single tile therefore is 2.68 OO , while the total power dissipated by both RPU arrays and all peripheral circuitry (ADCs, opamps, STR) can be estimated as 2.0 , assuming 0.7 reserved for op-amps and STRs.
The number of weight updates per second on a single tile can be estimated as 839 srP t u er / given the 20 N duration of the update cycle and 4096 × 4096 array size.This translates into power efficiency of 419 srP t u er / / and area efficiency of 319 srP t u er / /OO .The tile throughput during the forward and backward cycles can be estimated as 419 srP / given 80 N for forward (or backward) cycle with power and area efficiencies of 210 srP / / and 156 srP / /OO , respectively.These efficiency numbers are about 5 orders of magnitude better than state-of-the-art CPU and GPU performance metrics [ 38 ].
The power and area efficiencies achieved for a single tile will inevitably degrade as multiple tiles are integrated together as a system-on-chip.As illustrated in Fig. 5C, additional power and area should be reserved for programmable NLF circuits, on-chip communication via coherent bus or network-on-chip (NoC), off-chip I/O circuitry, etc. Increasing the number of tiles on a chip will first result in an acceleration of a total chip throughput, but eventually would saturate as it will be limited either by power, area, communication bandwidth or compute resources.State-of-the-art high-performance CPU (IBM Power8 12-core CPU [ 39 ]) or GPU (NVidia Tesla K40 GPU [ 40 ]) can be taken as a reference for estimation of the maximum area of 600 OO and power of 250 on a single chip.While power and area per tile are not prohibitive to scale the number of tiles up to 50 to 100, the communication bandwidth and compute resources needed for a system to be efficient might be challenging.
Communication bandwidth for a single tile can be estimated assuming 5 bit input and 9 bit output per column (or row) for forward (or backward) cycles that give in total about 90 GB/s unidirectional bandwidths that will also satisfy the update cycle communication requirements.This number is about 3 times less than the communication bandwidth in IBM Power8 CPU between a single core and a nearby L2 cache [ 39   1) can be based on just a few compute cores that can process the tile data sequentially in order to fit larger numbers of tiles to deal with larger network sizes.For example, a chip with 100 tiles and a single 50 / compute core will be capable of dealing with networks with as many as 1.6 billion weights and dissipate only about 22 W assuming 20W from compute core and communication bus and just 2 W for RPU tiles since only one is active at any given time.This gives a power efficiency of 20 srP / / that is 4 orders of magnitude better than state-of-the-art CPU and GPU.

Discussion
We proposed a concept of resistive processing unit (RPU) devices that can simultaneously store and process data locally and in parallel, thus potentially providing significant acceleration for DNN training.The tolerance of the training algorithm to various RPU device and system parameters as well as to technological imperfections and different sources of noise has been explored.This analysis allows to define a list of specifications for RPU devices summarized in Table 2 that can be used as a guide for a systematic search for new physical mechanisms, materials and device designs to realize the RPU device concept with realistic CMOS-compatible technology.
We also presented an analysis of various system designs based on the RPU array concept that can potentially provide many orders of magnitude acceleration of deep neural network training while significantly decreasing required power and compute hardware resources.The results are summarized in Table 1.This analysis shows that, depending on the network size, different design choices for the RPU accelerator chip can be made that trade power and acceleration factor.
The proposed accelerator chip design of Fig. 5C is flexible and can accommodate different types of DNN architectures beyond fully connected layers.For example, convolutional layers can be also mapped to an RPU array in an analogous way.In this case, instead of performing a vector-matrix multiplication for forward and backward cycles, an array needs to perform a matrix-matrix multiplication that can be achieved by feeding the columns of the input matrix serially into the columns of the RPU array.In addition, peripheral NLF circuits need to be reprogrammed to perform not only calculation of activation functions, but also max-pooling and sub-sampling.

Figures and TablesFig. 1 .
Figures and Tables

Fig. 2 .
Fig. 2. Test error of DNN with the MNIST dataset.Open white circles correspond to the baseline model with the training performed using the conventional update rule of Eq.1.(A) Lines marked as 1, 2, and 3 correspond to the stochastic model of Eq.2 with stochastic bit lengths ).= 1,2 and 10, respectively.(B) Lines marked as 1, 2, and 3 correspond to the stochastic model with ).= 10 and the non-linearity ratio : = 0.5, 0.4 and 0.1, respectively.(C) Illustration of various non-linear responses of RPU device with : = 0, 0.5 and 1.

Fig. 3 .
Fig. 3. Test error of DNN with the MNIST dataset.Open white circles correspond to a baseline model with the training is performed using the conventional update rule of Eq.1.All solid lines assume a stochastic model with ).= 10 and : = 0. (A) Lines 1, 2, and 3 correspond to a stochastic model with ∆ $ % = 0.1, 0.032 and 0.01, respectively.All curves in B-I use ∆ $ % = 0.001.(B) Lines 1, 2, and 3correspond to a stochastic model with weights bounded to 0.1, 0.2, and 0.3, respectively.(C) 1, 2, and 3 correspond to a stochastic model with a coincidence-to-coincidence variation in ∆ $ % of 1000%, 320%, and 100%, respectively.(D) Lines 1, 2, and 3 correspond to a stochastic model with device-todevice variation in ∆ $ % of 1000%, 320%, and 100%, respectively.(E) Lines 1, 2, and 3 correspond to a stochastic model with device-to-device variation in the upper and lower bounds of 1000%, 320%, and 100%, respectively.All solid lines in E have a mean value of 1.0 for upper bound (and −1.0 for lower bound).(F) Lines 1, 2, and 3 correspond to a stochastic model, where down changes are weaker by 0.5, 0.75, and 0.9, respectively.(G) Lines 1, 2, and 3 correspond to a stochastic model, where up changes are weaker by 0.5, 0.75, and 0.9, respectively.(H) Lines 1, 2, and 3 correspond to a stochastic model with device-to-device variation in the up and down changes by 40%, 20%, and 6%, respectively.(I) 1, 2, and 3 correspond to a stochastic model with a noise in vector-matrix multiplication of 100%, 60%, and 10%, respectively, normalized on activation function temperature.

Fig. 4 .
Fig. 4. (A) Line 1 shows threshold values for parameters from Fig. 3 assuming a 0.3% error penalty.Parameters C-I correspond to experiments in Figs.3C-3I, respectively.The gray shaded area bounded by line 3 results in at most 0.3% error penalty when all parameters are combined.(B) Curve 1 corresponds to a model with all parameters combined at the threshold value as shown in the radar diagram by line 1.Curve 2 corresponds to a model with only C, D and E combined at the threshold.Curve 3 corresponds to a model with C, D, E at 30%, F/G at 0%, H at 2% and I at 5%, all combined as shown in the radar diagram by line 3.

Fig. 5 .
Fig. 5. (A) Operation of a single column (or row) during forward (or backward) cycle showing an op-amp that integrates the differential current on the capacitor / % , and an analog-to-digital converter ADC.(B) Test error for the network of model 3 in Fig. 4B with bounds |h| imposed on results of vector-matrix multiplication.Curve 1 corresponds to a model with |h| = 3 imposed only on sigmoid activation function in hidden layers.Curves 2 and 3 corresponds to a model with |h| = 3 and 12, respectively, imposed on both sigmoid and softmax activation functions.(C) Schematics of the architecture for accelerator RPU chip.RPU tiles are located on the bottom, NLF digital compute circuits are on the top, on-chip communication is provided by a bus or NoC, and off-chip communication relies on I/O circuits.(D) Acceptable input referred noise levels for various on/off ratio on the RPU devices m and integration times e $R\f .Curves 1,2 and 3 corresponds to e $R\f of 20, 80 and 160 ns, respectively.
]. State-of-the-art on-chip coherent bus (over 3 TB/s in IBM Power8 CPU[ 39 ]) or NoC (2.5 TB/s in Ref[ 41 ]) can provide sufficient communication bandwidth between distant tiles.Compute resources needed to sustain (1) time complexity for a single tile can be estimated as 51 / assuming 80 N cycle time and 4096 numbers generated at columns or rows.To support parallel operation of N tiles, compute resources need to be scaled by (N) thus limiting the number of tiles that can be active at a given time to keep the total power envelop on a chip below 250 .For example, a single core of IBM Power8 CPU [ 39 ] can achieve about 50 W. d/ that might be sufficient to support one tile, however the maximum power is reached just for 12 tiles assuming 20 W per core.Corresponding power efficiency for this design point (Design 1 in Table1) would be 20 srP / / .Same compute resources can be provided by 32 cores of state-of-the-art GPU [ 40 ], but with better power efficiency thus allowing up to 50 tiles to work in parallel.Corresponding power efficiency for this design (Design 2 in Table1) would be 84 srP / / .Further increase in the number of tiles that can operate concurrently can be envisioned by designing specialized power and area efficient digital circuits that operate fixed point numbers with limited bit resolution.An alternative design (Design 3 in Table The update cycle operations are identical for both convolutional and fully connected layers hence do not require reprograming.The required connectivity between layers can be achieved by reprogramming tile addresses in a network.Most of the recent DNN architectures are based on combination of many convolutional and fully connected layers [ref ref] with a number of parameters of the order of a billion.Our analysis demonstrates that a single RPU accelerator chip can be used to train such a large deep neural networks.Problems of the size of ImageNet classification that currently require days of training on a datacenter-size cluster with thousands of machines [ 6 ] can take just a few hours on a single RPU accelerator chip.