RescueSNN: enabling reliable executions on spiking neural network accelerators under permanent faults

To maximize the performance and energy efficiency of Spiking Neural Network (SNN) processing on resource-constrained embedded systems, specialized hardware accelerators/chips are employed. However, these SNN chips may suffer from permanent faults which can affect the functionality of weight memory and neuron behavior, thereby causing potentially significant accuracy degradation and system malfunctioning. Such permanent faults may come from manufacturing defects during the fabrication process, and/or from device/transistor damages (e.g., due to wear out) during the run-time operation. However, the impact of permanent faults in SNN chips and the respective mitigation techniques have not been thoroughly investigated yet. Toward this, we propose RescueSNN, a novel methodology to mitigate permanent faults in the compute engine of SNN chips without requiring additional retraining, thereby significantly cutting down the design time and retraining costs, while maintaining the throughput and quality. The key ideas of our RescueSNN methodology are (1) analyzing the characteristics of SNN under permanent faults; (2) leveraging this analysis to improve the SNN fault-tolerance through effective fault-aware mapping (FAM); and (3) devising lightweight hardware enhancements to support FAM. Our FAM technique leverages the fault map of SNN compute engine for (i) minimizing weight corruption when mapping weight bits on the faulty memory cells, and (ii) selectively employing faulty neurons that do not cause significant accuracy degradation to maintain accuracy and throughput, while considering the SNN operations and processing dataflow. The experimental results show that our RescueSNN improves accuracy by up to 80% while maintaining the throughput reduction below 25% in high fault rate (e.g., 0.5 of the potential fault locations), as compared to running SNNs on the faulty chip without mitigation. In this manner, the embedded systems that employ RescueSNN-enhanced chips can efficiently ensure reliable executions against permanent faults during their operational lifetime.


INTRODUCTION
In recent years, SNNs have shown a potential for achieving high accuracy with ultra-low power/energy consumption due to their sparse spike-based operations (Putra and Shafique, 2020).Moreover, SNNs can perform unsupervised learning with unlabeled data using spike-timing-dependent plasticity (STDP), which is highly desired for real-world applications (e.g., autonomous agents like UAVs and robotics), especially due to two reasons: these systems are typically subjected to unforeseen scenarios (Putra andShafique, 2021b, 2022a); and gathering unlabeled data is cheaper than labeled ones (Rathi et al., 2019).An SNN architecture supporting unsupervised learning is shown in Fig. 1(a).To maximize the performance and energy efficiency of SNN processing, specialized SNN accelerators/chips are employed (Painkras et al., 2013;Akopyan et al., 2015;Davies et al., 2018;Frenkel et al., 2019).However, these SNN chips may suffer from permanent faults, which can occur during: (1) chip fabrication process due to manufacturing defects, as fabricating an SNN chip with millions-to-billions of nano-scale transistors with 100% correct functionality is difficult, and even worsen due to the aggressive technology scaling (Zhang et al., 2018;Hanif et al., 2018;Hanif et al., 2021); and (2) run time operation due to device/transistor wear out and damages, that are caused by Hot Carrier Injection (HCI), Bias Temperature Instability (BTI), electromigration, or Time Dependent Dielectric Breakdown (TDDB) (Radetzki et al., 2013;Werner et al., 2016;Hanif et al., 2018;Baloch et al., 2019;Mercier et al., 2020;Hanif et al., 2021).
Permanent faults can affect the functionality of the compute engine of SNN accelerators/chips, including the local weight memory/registers and neurons, by corrupting the weight values and neuron behavior (i.e., membrane potential dynamics and spike generation), thereby degrading the accuracy, as shown in Fig. 2. For instance, permanent faults can change the weight values through stuck-at 0 and 1, as shown in Fig. 1(b).Simply stopping the executions on faulty chips at run time will lead to a short operational lifetime, while discarding the faulty chips at design time will lead to low yield and increase the per-unit cost of the non-faulty chip.Therefore, alternate low-cost solutions for mitigating permanent faults in the SNN compute engine1 are required.These solutions will prolong the operational lifetime of SNN chips.Moreover, such solutions also increase the applicability of wafer-scale chips for SNNs where embracing permanent faults is important to maintain the yield.
Targeted Problem: How can we efficiently mitigate permanent faults in the SNN compute engine (i.e., the local weight registers and neurons) on the accuracy, thereby improving the SNN fault tolerance and maintaining the throughput.The efficient solution to this problem will enable reliable SNN executions on faulty chips without the need for retraining for energy-constrained embedded systems, such as IoT-Edge devices and autonomous agents.

State-of-the-Art and Their Limitations
Besides discarding the faulty chips, the standard VLSI fault tolerance techniques like Dual Modular Redundancy (DMR) (Vadlamani et al., 2010), Triple Modular Redundancy (TMR) (Lyons and Vanderkulk, 1962), and Error Correction Code (ECC) (Sze, 2000), may be used for mitigating permanent faults.However, they require extra (redundant) hardware and/or executions which incur huge area and energy overheads.State-of-the-art works have studied different design aspects for understanding faults in SNNs and devising mitigation techniques, as follows.  1 0 1 1  0 1 0 1  0 1 0 0  0 0 1 0   1 0 1 1  0 1 0 0  0 1 0 0  0 0 1  • SNN fault tolerance: Previous works studied the impact of faults on SNN weights without considering the underlying hardware architectures and processing dataflows (Schuman et al., 2020;Rastogi et al., 2021) and each discussing a specific fault, like bit-flip or synapse removal.Recent works devised mitigation techniques for faults in the weight memories of an SNN hardware (Putra et al., 2021a(Putra et al., ,b, 2022a)), while work of Putra et al. (2022b) aimed at addressing transient faults.
In summary, the state-of-the-art works still focus on permanent fault modeling and injection only on the weight memories of SNN hardware.Hence, the impact of permanent faults in the SNN compute engine (i.e., synapses and neurons) and the respective fault mitigation techniques, especially with a focus on avoiding retraining costs, are still unexplored.To study the challenges of mitigating permanent faults, we perform an experimental case study in Section 1.2.

Case Study and Research Challenges
In this case study, we consider an SNN accelerator architecture in Fig. 2(a).We assume all neurons are not faulty, and inject permanent faults (i.e., stuck-at 0 or 1) on the weight registers with random distribution and different rates of faulty memory cells to see the significance of faulty registers on accuracy.Details on the experimental setup are discussed in Section 4. From the experimental results in Fig. 2(b), we make the following key observations.
• Classification accuracy decreases as the rate of faulty memory cells increases for both stuck-at 0 and stuck-at 1 scenarios, thereby showing the negative impact of permanent faults in the synapses.
• In the stuck-at 0 case, the stored weight value will either stay the same or decrease from the original value.In the case of decreased weight value, the corresponding neuron will require more stimulus (input spikes) to increase its membrane potential and reach the threshold potential for generating a spike, which represents recognition of a specific class.However, in an SNN model, multiple neurons may be responsible to recognize the same class.Therefore, if the neuron with faulty weight bits cannot recognize the input class, then other neurons may recognize it.As consequence, the accuracy degradation caused by stuck-at 0 in memory cells is relatively small and negligible in some cases.

Memory (DRAM)
Learning Unit

Weight Buffer
Neuron Buffer   (Basu et al., 2022).Synapses and neurons can be affected by permanent faults, whose detailed discussion is provided in Section 2.2.(b) The stuck-at faults in the local weight memory (synaptic weight registers) can decrease accuracy.

SNN Accelerator
• In the stuck-at 1 case, the stored weight value will either stay the same or increase from the original value.In the case of increased weight value, the corresponding neuron will require less stimulus (input spikes) to increase its membrane potential and reach the threshold potential for generating a spike, which represents recognition of a specific class.Therefore, this neuron may become more active to generate more spikes for any input classes, which leads to more misclassification.As consequence, the accuracy degradation caused by stuck-at 1 in memory cells is more significant/noticeable than the stuck-at 0 case.
• Combinations of fault types and fault rates lead to different accuracy, which represents different fault patterns in real-world chips, rendering it unpredictable at design time.
Based on these observations, we outline the following research challenges to devise an efficient solution for the targeted problem.
• The mitigation technique should not employ retraining, as retraining needs huge compute/memory costs, processing time, and a training dataset that may not be available in certain cases due to the restriction policies.Moreover, retraining is not a scalable approach considering the large number of fabricated chips, as it needs to consider a unique fault map from each chip thereby retraining per chip.
Note, the fault map information can be obtained through the standard wafer/chip test procedure after fabrication, hence this test does not introduce new cost and only incurs a typical cost for chip test (Xu et al., 2020;Fan et al., 2022).• The mitigation should have minimal performance/energy overhead at run time as compared to that of the baseline design without fault mitigation technique, thereby making it applicable for energyconstrained embedded systems.
• The technique should not avoid the use of faulty SNN components (i.e., synapses and neurons), as it means omitting the entire computations in the respective columns of the SNN compute engine, which leads to throughput reduction.
• SNNs require a specialized permanent fault mitigation technique as compared to other neural network computation models (e.g., deep neural networks), since SNNs have different operations and dataflows.

Our Novel Contributions
To address the above challenges, we propose RescueSNN, a novel methodology that enables reliable executions on SNN accelerators under permanent faults.To the best of our knowledge, this work is the first effort that mitigates permanent faults in the SNN accelerators/chips.Following are the key steps of the RescueSNN methodology (the overview is shown in Fig. 3).
• Analyzing the SNN fault tolerance to understand the impact of faulty components (i.e., synapses and neurons) on accuracy considering the given fault rates.
• Employing the fault-aware mapping (FAM) techniques to safely map SNN weights and neuron operations on faulty compute engine, thereby maintaining accuracy and throughput.Our FAM techniques leverage the fault map of the compute engine to perform the following key mechanisms.1. Mapping the significant weight bits on the non-faulty memory cells of the synapses (weight registers) to minimally pollute/change the weight values.
2. Selectively employing faulty neurons that do not cause significant accuracy degradation at inference, based on the behavior of their membrane potential dynamics and spike generation.
• Devising simple hardware enhancements to enable efficient FAM techniques.Our enhancements shuffle the weight bits from the synapses by employing simple combinational logic units, such as multiplexers, so that these weight bits can be used for SNN computations.
Key Results: We evaluate our RescueSNN methodology using Python-based simulations (Hazan et al., 2018) on a multi-GPU machine.The experimental results show that our RescueSNN improves the SNN accuracy without retraining by up to 80% and 47% on the MNIST and the Fashion MNIST respectively, from the SNN processing without fault mitigation.

Spiking Neural Networks (SNNs)
Overview: SNNs are the brain-inspired computational models that employ action potentials (spikes) to encode the information (Maass, 1997;Putra and Shafique, 2022b).In an SNN model, neurons and synapses are connected in a specific architecture (Tavanaei et al., 2019;Mozafari et al., 2019).In this work, we consider a fully-connected (FC)-based network architecture as shown in Fig. 1(a), since it has demonstrated a high accuracy when employing bio-plausible learning rules, e.g., the spike-timing-dependent plasticity (STDP).This network architecture connects each data of the input (e.g., a pixel of an image) to all excitatory neurons for generating spikes.Each spike is used for inhibiting the excitatory neurons except the one from which the spike is generated.Such bio-plausible learning rules offer efficient learning mechanisms as they perform local learning in each synapse by leveraging spike events without any global loss function, thereby enabling unsupervised and online learning capabilities, which are especially beneficial for autonomous agents (Putra and Shafique, 2022a;Pfeiffer and Pfeil, 2018).In this work, we employ the pair-based weight-dependent STDP and bound each weight value (wgh) within the defined range, i.e., wgh = [0, 1], because this approach has been widely used by previous works (Diehl and Cook, 2015;Srinivasan et al., 2017;Hazan et al., 2019).Here, the pair-wise weight-dependent STDP learning rule is used, as it defines the maximum allowed weights, which is suitable for fixed-point format (see Eq. 1).
∆wgh denotes the weight update, η pre and η post denote the learning rate for pre-and post-synaptic spike, while x pre and x post denote the pre-and post-synaptic traces, respectively.wgh m denotes the maximum allowed weight, wgh denotes the current weight, and µ denotes the weight dependence factor.We consider the Leaky Integrate-and-Fire (LIF) neuron model, as it has low computational complexity with high bio-plausible behavior (Izhikevich, 2004).Meanwhile, the synapse is modeled by weight value (wgh) which represents the strength of the synaptic connection between the corresponding neurons.Furthermore, an SNN model typically employs a specific spike coding scheme to encode/decode data information into/from spikes.In this work, we consider the rate coding scheme which employs the frequency of spikes to proportionally represent the data, i.e., a higher data value is represented by a higher number of spikes.The reason is that, the rate coding scheme achieves high accuracy using bio-plausible STDP-based learning rules, which perform efficient learning mechanisms under unsupervised settings, thereby enabling energyefficient and smart computing systems (Diehl and Cook, 2015;Rathi et al., 2019;Putra and Shafique, 2020).
SNN Accelerator Architecture: We consider the typical SNN accelerator and compute engine architectures shown in Fig. 2(a) and Fig. 4 respectively, which are adapted from the design in Frenkel et al. (2019).We focus on the compute engine, as it is responsible for generating spikes which determine the accuracy.The compute engine has a synapse crossbar, and each synapse has a register that stores a weight value.We use 8-bit weight precision as it has a good trade-off for accuracy-memory (Putra and Shafique, 2020;Putra and Shafique, 2021a).Each bit is stored in a weight memory cell.To optimize the chip area, each synapse adds its weight with an accumulated value from the previous synapse in the same column, and stores the cumulative value in a 32-bit register.In this manner, only a single connection is required between a neuron and the connected synapses.Furthermore, each column of the compute engine has a single LIF neuron.If a spike comes to a neuron, the membrane potential (V mem ) increases by the respective weight (wgh), otherwise, the V mem decreases/leaks.If the V mem reaches the threshold potential (V th ), a spike is generated, and then V mem goes back to the reset potential (V reset ).Hence, a LIF neuron has four main operations: (1) V mem increase, (2) V mem leak, (3) V mem reset, and (4) spike generation, as shown in Fig. 2(a).Neuron operations ( 1)-( 3) define the membrane potential dynamics.

Overview
An SNN compute engine consists of two main components, i.e., synapses and neurons, which have different hardware circuitry.Therefore, we need to define a fault model for each component to achieve fast design space exploration.
1. Synapses: Each synapse hardware uses a register to store a weight value.Therefore, each permanent fault in a synapse can affect a single weight bit in the form of either a stuck-at 0 or a stuck-at 1 fault.2. Neurons: Each neuron hardware depends on the neuron model to facilitate its operations.Therefore, permanent faults can manifest in different forms depending upon the type of operation being executed on the (digital) neuron hardware, as discussed in the following (see an overview in Fig. 5).
• Faults in the 'V mem increase' operation make the neuron unable to increase its membrane potential.As a result, this neuron cannot generate any spikes (i.e., a dormant neuron).
• Faults in the 'V mem leak' operation make the neuron unable to decrease its membrane potential.Hence, this neuron acts like the Integrate-and-Fire (IF) neuron model.
• Faults in the 'V mem reset' operation make the neuron unable to reset its membrane potential.As a result, this neuron will continuously generate spikes.
• Faults in the 'spike generation' make the neuron unable to generate spikes (i.e., dormant neuron).

Fault Generation and Distribution
Previous studies have shown that permanent faults occur in random locations of a chip, hence leading to a certain fault map (Zhang et al., 2018;Radetzki et al., 2013;Werner et al., 2016;Mercier et al., 2020;Stanisavljević et al., 2011).Following are the key steps to generate and distribute permanent faults on the SNN compute engine, as shown in Fig. 6.
1. We consider a weight memory cell and a neuron operation as the potential fault locations.
2. We generate permanent faults based on the given fault rate and distribute them randomly across the potential fault locations.The fault rate represents the ratio between the total number of faulty weight memory cells and neuron operations to the total number of potential fault locations (i.e., the total number of weight memory cells and neuron operations).3.If a fault occurs in a local weight memory cell, then we randomly select the type of stuck-at fault (i.e., either stuck-at 0 or stuck-at 1).Meanwhile, if a fault occurs in a neuron operation, then we randomly select the type of permanent faulty operation.

RESCUESNN METHODOLOGY
The overview of our RescueSNN methodology is shown in Fig. 7, and its key steps are discussed in the following subsections.

SNN Fault Tolerance Analysis under Permanent Faults
SNN fault tolerance analysis is important to understand how a given SNN model will behave considering a specific operating condition (e.g., a combination of certain fault rate, type of stuck-at fault, architecture in [1] in [2] in [3] in [4] in [5] in [6] in [7] sel out [7] out [6] out [5] out [4] out [3] out [2] out [1] out of the compute engine, etc.).This analysis provides information which can be leveraged for devising an efficient fault mitigation technique.Therefore, our RescueSNN methodology investigates the interaction between the faulty components (i.e., synapses and neurons) and the obtained accuracy.To do this, we perform the following experimental case studies, while considering a 400-neuron network with fullyconnected architecture like in Fig. 1(a).
• We study the accuracy under faulty weight registers, by injecting a specific stuck-at fault (i.e., either stuck-at 0 or stuck-at 1) in the weight registers, while considering fault-free neurons.Experimental results are shown in Fig. 2(b).We observe that both stuck-at 0 and stuck-at 1 faults can degrade accuracy.Therefore, the mitigation technique should address both stuck-at faults.
• We study the accuracy under faulty neuron operations, by injecting faults on the neuron operations, while considering fault-free weight registers.Experimental results are shown in Fig. 8, from which we make the following observations.1. Faulty 'spike generation', 'V mem increase', and 'V mem leak' operations have tolerable accuracy, since their faulty behavior does not dominate the spiking activity, and/or the function of the corresponding faulty neurons for classification may be substituted by other neurons that recognize the same class.Therefore, these neurons can still be used for SNN processing.2. Faulty 'V mem reset' operations cause significant accuracy degradation, as these operations make the corresponding neurons dominate classification.Therefore, these neurons should not be used for SNN processing.
Note, complex SNN models with multiple layers and different computational architectures (e.g., convolutional and fully-connected) may have different observation results as compared to results in Fig. 8.However, previous work has observed similar trends to our study, i.e., neurons with faulty 'V mem reset' operations continuously generate spikes (so-called saturated neurons) and cause the most significant accuracy degradation than other types of faulty neuron operations (Spyrou et al., 2021).It also identified that, saturated neurons affect classification accuracy at any layer of SNN models, as these faulty neurons always dominate the classification activity which results in a significant accuracy degradation.This finding is consistent with the insights provided by our study.However, it is still challenging to achieve high accuracy when employing STDP-based learning on complex SNN models (Rathi et al., 2023), thereby hindering their applicability for diverse applications, such as systems with online training and unsupervised learning requirements (e.g., autonomous mobile agents).Therefore, in this work, we consider the FC-based SNNs shown in Fig. 1(a) to enable multiple advantages, such as high accuracy, unsupervised learning capabilities, and efficient online training.

The Proposed Fault-Aware Mapping (FAM)
Permanent faults in SNN chips can be identified at the design time and at the run time.The post-fabrication test can be employed to find a set of locations of faults (due to manufacturing defects) in the SNN compute engine (i.e., fault map) at the design time (Zhang et al., 2018;Putra et al., 2021a).Meanwhile, the online test strategy like the Built-In Self-Test (BIST) techniques can be employed to obtain the fault map (due to device wear out or physical damages) at the run time (Baloch et al., 2019;Wang et al., 2019;Mercier et al., 2022).Our RescueSNN methodology leverages this fault map to safely map the SNN weights and operations on the compute engine, thereby minimizing the negative impact of permanent faults.To do this, the RescueSNN employs Fault-Aware Mapping (FAM) techniques that mitigate the faults in synapses and neurons through the following key mechanisms.
1. FAM for Synapses: The significant weight bits are placed on the non-faulty weight memory cells, while the insignificant bits are placed on the faulty ones, by performing a simple bit-shuffling technique.
The significance of weight bits can be identified by experiments that observe the accuracy after modifying a specific bit (Putra et al., 2021a).In general, previous studies have observed that the significance of a weight bit is proportional to its bit location.For instance, in 8-bit fixed-point precision, bit-7 has the highest significance than other bits.Furthermore, a synapse may have a single faulty bit or multiple faulty bits.Therefore, we propose a mapping strategy that can address both cases using the following steps (see an overview in Fig. 9).
• We identify the faulty weight bits (e.g., through the post-fabrication testing) to obtain information of the fault map and fault rate in each synapse hardware.
• We identify the maximum fault rate in each synapse hardware for safely storing a weight.In this work, we consider a maximum of 2 faulty bits from an 8-bit weight, based on the fault rates that offer tolerable accuracy from analysis in Section 3.1.
• We identify the segment in each synapse with the highest number of subsequent non-faulty memory cells.This information is leveraged for maximizing the possibility of storing the significant bits in the non-faulty cells.Hence, we also examine the corner case (i.e., the right-most and left-most cells) as possible subsequent non-faulty memory cells; see the third row of Fig. 9(b) with data-3.
• We perform a circular-shift technique for each data word to efficiently implement bit-shuffling.2. FAM for Neurons: The use of neurons should be avoided if they have faulty 'V mem reset' operations, as these faulty operations cause significant accuracy degradation.Meanwhile, neurons with other types of faults can still be used for SNN processing, as their faulty behavior does not dominate the spiking activity.Different SNN operations that aim at recognizing the same input class are mapped on both the faulty and fault-free neurons for maintaining throughput, while compensating the loss from the faulty 'V mem increase', 'V mem leak', and 'spike generation' operations.
Afterward, we leverage these mechanisms for devising three mapping strategies, as the variants of our FAM technique (i.e., FAM1, FAM2, and FAM3), which provide trade-offs between accuracy and mapping complexity, as discussed in the following.
• FAM1: It avoids mapping the SNN weights and operations on the columns of compute engine that have faulty neurons, as shown in Fig. 10(a), as faulty neurons can reduce the accuracy more than faulty registers, especially in the case of faulty 'V mem reset'.However, FAM1 does not mitigate the negative impact of faults in the registers, hence the accuracy improvement is sub-optimal.The benefit of FAM1 is due to its simple mechanism which enables a low-complexity control mechanism.
• FAM2: It maps the SNN weights and operations on the columns of compute engine that have fault-free neurons (just like FAM1) and employs a bit-shuffling technique to map the significant weight bits on the non-faulty memory cells, as shown in Fig. 10(b).Therefore, FAM2 can improve the SNN fault tolerance at the cost of a more complex control mechanism than FAM1.
• FAM3: It selectively maps the SNN weights and operations on the columns of compute engine that do not have faulty 'V mem reset' operations, as well as maps the significant weight bits on the non-faulty memory cells using a bit-shuffling technique, as shown in Fig. 10(c).Therefore, FAM3 can enhance the SNN fault tolerance as compared to FAM1 at the cost of a more complex control mechanism, and can improve the throughput as compared to FAM1 and FAM2.
Information regarding how to map the SNN weights and operations on the compute engine is provided through software program (e.g., firmware), thereby enabling the applicability and flexibility of the proposed FAM technique (e.g., FAM1, FAM2, or FAM3) for different possible fault maps on the compute engine.The meta data of this information is stored in the on-chip buffer, which can be accessed for operations in the compute engine.

EVALUATION METHODOLOGY
For evaluating the RescueSNN methodology, we employ the experimental setup shown in Fig. 12.We use the fully-connected network shown in Fig. 1(a) with a different number of neurons, to evaluate the generality of our RescueSNN methodology.For conciseness, we represent a network with i-number of neurons as Ni.We use the MNIST and Fashion MNIST as the workloads, and adopt the same test conditions as used widely by the SNN community (Diehl and Cook, 2015).For comparison, we consider the SNN without fault mitigation as the baseline.
Fault Generation and Injection: Permanent faults are generated based on the fault modeling in Section 2.2 of the revised manuscript.To do this, we first generate binary values (i.e., 0 and 1) based on the given fault rate while considering the potential fault locations (shown in Fig. 6).Here, '0' represents a non-faulty memory cell in synapses or a non-faulty operation in neurons; while '1' represents a faulty memory cell in synapses or a faulty operation in neurons.These binary values are then randomly distributed into an array that represents the potential fault locations, so that each value corresponds to a specific weight memory cell or a specific neuron operation.Fig. 13 shows the potential locations/components that can be affected by permanent faults to cause faulty memory cells as well as faulty 'V mem increase', 'V mem leak', 'V mem reset', and 'spike generation' operations.For each fault in the weight memory cells (synapses), we randomly determine the type of fault, i.e., either stuck-at 0 or stuck-at 1.In stuck-at 0 case, value 0 is injected to the corresponding memory cell; while in stuck-at 1 case, value 1 is injected.Meanwhile, each fault in neurons corresponds to either faulty 'V mem increase', 'V mem leak', 'V mem reset', or 'spike generation' operation (as shown in Fig. 5).Each faulty behavior in the corresponding neuron is realized through different approaches as described in the following.• Faulty 'V mem leak' operation: It is mainly caused by faulty subtraction in 'V mem leak' part, hence V mem is not decreased despite there are no incoming spikes.
• Faulty 'V mem reset' operation: It is mainly caused by faulty comparison in 'V mem reset' part, hence the spike generator is activated to continuously generate spikes.
• Faulty 'spike generation': It is mainly caused by faulty multiplexing in 'spike generation' part, hence the spike generator is always deactivated and no output spikes are produced.
Accuracy Evaluation: We use the Python-based simulations (Hazan et al., 2018), which run on Nvidia RTX 2080 Ti GPUs, while considering SNN accelerator architecture shown in Fig. 2(a) and 4.
Hardware Evaluation: We evaluate the area, energy consumption, and throughput of both the original compute engine (without enhancements) and the enhanced compute engine using our RescueSNN methodology.To do this, we design RTL codes for both (original and enhanced) compute engines, then synthesize them using the Cadence Genus tool considering a 65nm CMOS technology to obtain their area, power consumption, and timing (i.e., a clock cycle latency for SNN processing on the compute engine).Afterward, we calculate the required number of cycles and computation latency for processing an input sample (i.e., latency-per-sample), considering the timing from synthesis and the mapping strategy on active synapses and neurons in the compute engine.Then, we estimate the throughput by computing the number of samples that can be processed within one second of SNN inference based on the information of latencyper-sample.Furthermore, we also estimate the energy consumption by leveraging the information of power consumption from synthesis and latency-per-sample for SNN inference.The estimation of throughput and energy consumption is also performed using the Python-based simulation framework (Hazan et al., 2018).

RESULTS AND DISCUSSION
We evaluate different design aspects including accuracy, throughput, energy consumption, and area as discussed in the following.

Maintaining Accuracy
Fig. 14 presents the experimental results for the accuracy of different fault mitigation techniques, i.e., the baseline and our FAM-based strategies including FAM1, FAM2, and FAM3.We observe that the baseline suffers from a significant accuracy degradation as shown by 1 , because it does not mitigate Figure 13.The potential locations/components that can be affected by permanent faults to cause faulty memory cells as well as faulty 'V mem increase', 'V mem leak', 'V mem reset', and 'spike generation' operations.
faults in synapses and neurons, thereby leading to unreliable SNN executions.The significant accuracy degradation is mainly due to the fault model for faulty 'V mem reset' operation that makes the corresponding neuron generate spikes continuously once its membrane potential V mem reaches the threshold potential V th , thereby dominating the classification activity and leading to high misclassification.We also observe that FAM1 significantly improves the SNN fault tolerance as compared to the baseline, because FAM1 avoids the use of faulty neurons, especially for faulty 'V mem reset' operations, as shown by 2 .Our FAM2 improves the SNN fault tolerance even more as compared to FAM1, since FAM2 also mitigates faults in the weight registers in addition to avoiding the use of faulty neurons, as shown by 3 .Meanwhile, our FAM3 also significantly improves the SNN fault tolerance from baseline and FAM1, and obtains comparable accuracy to FAM2, since FAM3 mitigates faults in weight registers and selectively uses faulty neurons.It achieves up to 80% accuracy improvement compared to the baseline on the MNIST dataset, as shown by 4 .We also observe that the same reasons are also applicable to different workloads, thereby leading the accuracy profiles for Fashion MNIST to have similar trends to the accuracy profiles for MNIST.These results show that our FAM strategies (FAM1, FAM2, and FAM3) are effective for mitigating permanent faults in the compute engine without retraining, across different model sizes, fault rates, and workloads.

Maintaining Throughput
Fig. 15(a) presents the experimental results on the throughput of different mitigation techniques, i.e., the baseline and our FAM strategies (FAM1, FAM2, and FAM3).We observe that the baseline has the highest throughput across different model sizes and fault rates, as it uses all synapses and neurons for performing SNN executions, as shown by 1 .Meanwhile, FAM1 and FAM2 may suffer from throughput reduction because they avoid the use of faulty neurons, thereby omitting the corresponding columns of the SNN compute engine.For instance, FAM1 and FAM2 may suffer from 30% throughput reduction for N1600 with 0.1 fault rate, as shown by 2 .Meanwhile, our FAM3 can maintain the throughput close to the baseline (e.g., keeping the throughput reduction below 25% in a 0.5 fault rate), thereby improving the throughput significantly as compared to FAM1 and FAM2.The reason is that, FAM3 omits the columns of compute engine only if the corresponding neurons have faulty 'V mem reset' operations.For instance, FAM3 has less than 15% throughput reduction for N1600, as indicated by 3 .These results show that our FAM3 is effective for maintaining the throughput across different model sizes, fault rates, and workloads.

Energy Consumption and Area Overheads
Fig . 15(b) shows the experimental results on the energy consumption of different mitigation techniques, i.e., the baseline and our FAM strategies (FAM1, FAM2, and FAM3).We observe that different techniques have comparable energy for small fault rates, as shown by label-4 .The reason is that small fault rates have a low probability of faulty neurons, hence the resource utilization for different techniques is similar.For large fault rates, FAM1 and FAM2 have higher energy consumption than the baseline and FAM3, as shown by label-5 .The reason is that large fault rates have a high probability of faulty neurons, hence the resource utilization for different techniques is different, i.e., FAM1 and FAM2 avoid the use of faulty neurons, thereby incurring higher compute latency and energy consumption.The baseline and FAM3 have comparable energy since FAM3 employs simple hardware enhancements: (1) multiplexing operations in each HEB which are shared for all synapses in the same column of compute engine, and (2) registers accesses in ECU, thereby minimizing the energy consumption overhead for FAM3 (i.e., within 30%).For area footprint, the original compute engine consumes around 6.27 mm 2 of area, while the one with proposed enhancements consumes around 8.56 mm 2 of area.Therefore, the proposed enhancements incur about 36.5% of area overhead, which encompasses about 36.2% of ECU and about 0.3% of HEBs.The area of ECU dominates the total area of enhancements since it mainly employs a set of 3-bit registers (i.e., 256x256 registers), which incurs a larger area as compared to HEBs (i.e., 256x25 multiplexers).These results show that our FAM3 achieves minimum overheads in terms of energy consumption and area across different model sizes, fault rates, and workloads.
In summary, the above discussions show that our RescueSNN methodology can effectively mitigate permanent faults in the SNN chips without retraining.Since our RescueSNN addresses permanent faults during both the design time and the run time, it increases the yield of SNN chips, as well as enables efficient and reliable SNN executions during their operational lifetime.Furthermore, our RescueSNN also avoids carbon emission as it does not need any retraining, thereby offering an environment-friendly solution (Strubell et al., 2019(Strubell et al., , 2020)).

Further Discussion
In general, we observe that a faulty 'V mem reset' operation can cause significant accuracy degradation as it deteriorates the neuron from the expected behavior.The reason is that, the generated (faulty) spikes will affect how the SNN model understands the information, since an SNN model typically employs a certain spike coding scheme, i.e., rate coding in this work.Therefore, a neuron with faulty 'V mem reset' operation will generate a high number of spikes and dominate the classification activity, thereby leading to high misclassification and significant accuracy degradation.We also observe that, a higher number of spikes generated by faulty 'V mem reset' operation also indicates that the SNN model performs more frequent neuron operations that correspond to spike generation.This condition leads to higher power/energy consumption for SNN processing, which has been observed and studied in previous works (Krithivasan et al., 2019;Park et al., 2020;Putra and Shafique, 2023).
Comparisons with Retraining Technique: In a standard chip fabrication process, manufactured chips are evaluated in a wafer/chip test procedure (i.e., wafer acceptance test and chip probing test).This test procedure aims at evaluating the quality of each chip, including any faults in the chip (Xu et al., 2020;Fan et al., 2022).In this step, the permanent faults and the corresponding fault map information from manufacturing defects are identified.Therefore, this step does not introduce new cost, and only requires typical cost for a standard wafer/chip test procedure (Xu et 2020;Fan et al., 2022).In the retraining technique, the fault map information is then incorporated in the retraining process considering how the weights and neuron operations are mapped on the SNN compute engine, i.e., so-called fault-aware training (FAT).In this manner, the SNN model is expected to adapt to the presence of faults, hence maintaining high accuracy.This indicates that, the retraining technique requires (1) fault map information from the chip test procedure, and (2) a full training dataset, which may be unavailable due to restriction policy.Furthermore, each chip has a unique fault map which requires its own retraining process, thereby incurring huge time and energy costs.Otherwise, the retraining technique will not be effective.Meanwhile, our proposed FAM technique in RescueSNN methodology leverages the fault map information to safely map the weights and neuron operations on the SNN compute engine.It ensures that the SNN processing is not negatively affected by permanent faults, thereby maintaining high accuracy.Although each chip has a unique fault

Figure 1 .
Figure 1.(a) An SNN architecture that achieves high accuracy in unsupervised learning scenarios, i.e., a single-layer fully-connected (FC) network (Putra and Shafique, 2020).(b) Permanent faults in the weight memory of the SNN compute engine may exist in form of stuck-at 0 and stuck-at 1 faults.

Figure 2 .
Figure 2. (a) The typical SNN accelerator architecture employs crossbar-based synaptic connections(Basu et al., 2022).Synapses and neurons can be affected by permanent faults, whose detailed discussion is provided in Section 2.2.(b) The stuck-at faults in the local weight memory (synaptic weight registers) can decrease accuracy.

Figure 3 .
Figure 3. Overview of our novel contributions.

Figure 4 .
Figure 4.The architecture of the compute engine of SNN accelerators.

▪▪
It avoids faulty neurons ▪ It safely maps the weight bits on faulty registers FAM3: It avoids neurons with faulty 'V mem reset' ▪ It safely maps the weight bits on faulty registers

Figure 7 .
Figure 7. Overview of our RescueSNN methodology and its key steps for mitigating permanent faults in the SNN compute engine.

Figure 8 .
Figure 8. Impact of faulty neuron operations on accuracy.Different faulty neuron operations have a different impact on accuracy.Notable accuracy degradation happens when faulty 'V mem reset' operations are employed.

Figure 9 .
Figure 9. (a) Illustration of possible fault locations (fault map) in faulty weight registers.(b) The proposed circular-shift bit-shuffling technique for the corresponding fault map.

Figure 11 .
Figure11.The architecture of the proposed enhancements, including the hardware enhancement blocks (HEBs) and the enhancement control unit (ECU), for accommodating FAM strategies.

Figure 12 .
Figure 12.Overview of the experimental setup and tools flow.

Figure 15 .
Figure 15.(a) Throughput across different mitigation techniques, different model sizes, and different fault rates for both MNIST and Fashion MNIST, as they have a similar number of SNN weights and operations.(b) Energy consumption across different mitigation techniques, different model sizes, and different fault rates for both MNIST and Fashion MNIST, as they have a similar number of SNN weights and operations.