Breaking the memory wall: next-generation artificial intelligence hardware

Roy, Kaushik; Kosta, Adarsh; Sharma, Tanvi; Negi, Shubham; Sharma, Deepika; Saxena, Utkarsh; Roy, Sourjya; Raghunathan, Anand; Wan, Zishen; Spetalnick, Samuel; Liu, Che-Kai; Raychowdhury, Arijit

doi:10.3389/fsci.2025.1611658

LEAD article

Front Sci, 16 December 2025

Volume 3 - 2025 | https://doi.org/10.3389/fsci.2025.1611658

This is part of an article hub

Explore article hub

Read article explainer

Breaking the memory wall: next-generation artificial intelligence hardware

Kaushik Roy^1*

¹Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, United States
²School of Electrical and Computer Engineering, College of Engineering, Georgia Institute of Technology, Atlanta, GA, United States

Abstract

The relentless advancement of artificial intelligence (AI) across sectors such as healthcare, the automotive industry, and social media necessitates the development of more efficient hardware solutions that can implement diverse learning algorithms. This lead article explores the evolution of AI learning algorithms and their computational demands, using autonomous drone navigation as a case study to highlight the limitations of traditional hardware. Traditional hardware, based on the von Neumann architecture, suffers from limited computational efficiency due to the separation of compute units and memory, also known as the “memory wall” problem. To overcome this barrier, this article discusses novel approaches to AI hardware design, focusing on compute-in-memory (CIM) techniques and stochastic hardware. CIM offers a promising solution to the memory wall problem by integrating computing capabilities directly into the memory system. This article details state-of-the-art developments in CIM for different memory types and at various levels of the memory hierarchy to support essential AI compute functions. We also discuss the use of CIM in developing neuromorphic hardware capable of accelerating biologically inspired algorithms, such as spiking neural networks. Furthermore, we highlight how stochastic hardware can exploit the error resilience of AI algorithms to enhance energy efficiency. Encompassing the full stack of AI systems, from learning algorithms to circuit and device-level techniques and architectures, this article provides a comprehensive roadmap for future research and development in AI hardware.

Key points

Efficient artificial intelligence (AI) hardware is crucial for resource-constrained applications such as healthcare and transportation, where it enhances performance, reduces costs, and supports real-time decision-making.
Overcoming the memory wall in traditional hardware is critical for enhancing AI computational efficiency, reducing latency, and enabling faster, more effective processing of complex algorithms.
Compute-in-memory paradigms using different memory technologies, such as embedded non-volatile memory, static random-access memory, dynamic random-access memory, and flash memory, help develop energy-efficient AI hardware by tackling the memory wall problem.
Stochasticity in AI algorithms (e.g., via spike timing-dependent plasticity or STDP) and hardware (e.g., via spin–orbit transfer magnetic tunnel junctions or SOT-MTJs) can be leveraged to improve energy efficiency for diverse workloads and could unlock novel capabilities.
Co-designing hardware and algorithms to optimize energy, latency, and accuracy will lead to the development of a “converged platform” for artificial neural networks and spiking neural networks, suitable for diverse AI applications.

Introduction

Artificial intelligence (AI) has emerged as one of the most transformative technologies of the 21st century, reshaping numerous aspects of everyday life. Driven by the overarching goal of replicating human intelligence, developers have created multiple generations of AI algorithms. Machine learning (ML) algorithms are particularly notable, as they draw inspiration from human brain functions, enabling computers to learn and generalize from input data. Recent advances in ML aspire to equip computers with cognition, perception, and reasoning abilities that potentially match or exceed those of humans (1–9).

In 1989, LeCun and colleagues at Bell Labs made a significant breakthrough by training a neural network to classify handwritten digits (10). Since then, the development of neural network training algorithms has continued to evolve, resulting in architectures such as multi-layer perceptrons (MLPs), convolutional neural networks (CNNs) (1), long short-term memory models (LSTMs) (2), and transformers (3). This evolution has led to an explosion in both the parameters and computational demands of these models, culminating in bottlenecks during training on traditional central processing units (CPUs). In a pivotal move, in 2012, graphics processing units (GPUs) began to be utilized for their superior parallelism capabilities, specifically for efficient matrix-vector multiplication (MVM), the primary computation in neural networks (1). This, along with advances in transistor technology, has facilitated the training of multi-billion parameter networks using extensive server farms within data centers.

With the development of large language and vision models (LLMs and LVMs) based on the transformer architecture (3), AI has excelled in applications such as language translation (4), text prediction/generation (5), text-to-image synthesis (6), and image generation (7). However, these multi-billion parameter models require significant computational and energy resources, raising concerns about their sustainability in real-world scenarios. Moreover, they are less efficient than the human brain owing to their dense, synchronous, and high-precision computations and the need for extensive data movement between compute and memory units. To address these challenges, efforts have been made to achieve brain-like computation, such as spiking neural networks (SNNs), which can perform sparse and event-driven computations similar to the brain (11). Nevertheless, efficient implementation of fundamental operations required by ML workloads cannot be realized without considering optimization at the underlying hardware level.

Today’s AI hardware solutions are rooted in the von Neumann architecture, which distinctly separates computational and memory units. It has become evident that shuttling data between memory and compute units, known as the “memory wall” problem, is responsible for the majority of the energy consumption and latency in computing systems. Rethinking devices, circuits, and computing architectures to incorporate a “compute-in-memory” (CIM) paradigm, by performing compute operations within the memory array itself, holds significant promise for alleviating this issue (12). Concurrently, the error-resiliency of AI algorithms can be utilized to develop hardware that is approximate but enables faster operations with reduced energy consumption while maintaining system-level accuracy (13). As AI applications become more integrated into our daily lives, designing brain-inspired AI algorithms and hardware necessitates a mutual consideration of constraints and requirements, promoting an algorithm-hardware co-design approach. Figure 1 illustrates the evolution of AI models and hardware over time, highlighting the functional gap relative to the brain.

Figure 1

Diagram depicting the evolution of AI models and hardware. The top shows models from Multi-layer Perceptron (MLP) to Neurosymbolic, with advancements like Convolutional Neural Networks (CNN) and Transformers. The bottom illustrates hardware progression from CPU to Compute-in-memory (CIM) processors. An arrow indicates the functional gap in capabilities, emphasizing the need for heterogeneous compute kernels and high memory intensity to support future workloads.

Figure 1. Evolution of artificial intelligence (AI) models and hardware over time. With the increasing complexity and demands of AI workloads, AI models also expand in size and complexity to maintain performance, resulting in high energy and latency implementations. To address this, advancements in AI hardware aim to offer novel solutions that are fast and efficient. However, this evolution in hardware also leads to a widening functional gap between the AI models and the hardware. The gap can be characterized by factors such as a variety of computational kernels, unpredictable and sparse access patterns, and high memory density requirements. This motivates the need to co-design AI algorithms and AI hardware in a converged platform across the entire stack, from devices and circuits to architecture and algorithms.

Existing surveys on autonomous navigation have provided comprehensive overviews of AI methods in the field. A notable survey by Nahavandi et al. (14) offers an in-depth examination of autonomous navigation for mobile robots, covering standard sensing technologies, a variety of robotic platforms, simulation environments, and navigation fundamentals and presenting a thorough treatment of Simultaneous Localization and Mapping (SLAM) algorithms. Separately, another work by Rezwan et al. (15) focuses specifically on unmanned aerial vehicle (UAV) autonomous navigation, comparing traditional mathematical optimization approaches with recent learning-based methods, and delves into UAV-specific considerations such as navigation models and input modalities. In contrast, this article emphasizes the evolution of AI applications and advances in AI algorithms to emphasize the requirements for specialized AI hardware. Using autonomous drone navigation as an illustrative application, it contextualizes the discussion on efficient AI hardware design, directly linking algorithmic demands to hardware innovation.

On the hardware side, previous reviews (16) have primarily explored the material properties of devices employed in CIM crossbars without addressing the broader accelerator architecture or issues of application-level integration. Furthermore, in contrast to other works (17, 18), this lead article additionally surveys emerging low-power neural network architectures, such as spiking neural networks (SNNs), which are highly relevant for energy-constrained edge platforms like drones. Ultimately, this article explores new computing paradigms—namely, in-memory and approximate computing—that offer the potential for substantial performance enhancements over existing AI hardware. It also explores neuromorphic hardware for SNNs, highlighting the advancements necessary in hardware technologies such as CIM and stochastic devices. The discussion concludes with an examination of brain-inspired solutions as promising avenues for achieving efficient AI implementations.

Evolution of AI applications and algorithms

AI systems of today are incredibly versatile, capable of both simple tasks, such as voice-activated lighting, and complex ones, such as generating realistic videos of imagined scenes. This section explores the historical as well as ongoing development of AI applications and the algorithms that power them. We focus on autonomous drone navigation as an exemplary application to demonstrate AI’s potential and examine the learning algorithms best suited for fast and efficient deployment on edge devices.

AI applications

The cognitive capabilities of AI have recently experienced a significant leap, leading to unprecedented performance across various domains. Notably, these include computer vision (encompassing image and video analysis, captioning, denoising, and inpainting) and natural language processing (including language translation, text summarization, and chatbots). As a result, AI has been applied in a variety of industries, such as healthcare, finance, transportation, retail, and manufacturing. The dynamic landscape of AI applications is continually evolving as more tasks valuable to human society are consistently enhanced by AI.

In the current era, dominated by data, acquiring relevant information is essential for the effective operation of AI systems. This information usually comes from a vast corpus of language, image, and speech data on the Internet. However, it can also be derived in real-time, for example, from the activity history of end-users on various ML-backed social media platforms or directly from the physical world through an array of sensors, including cameras, global positioning system (GPS), radar, lidar, sonar, and inertial measurement units (IMUs). Among these, vision sensors are the most prominent across various real-world applications. In recent years, a new category of vision sensors known as event-based cameras has gained significant attention (19, 20). Unlike traditional frame-based cameras that synchronously sample dense frames, event-based cameras provide an asynchronous and sparse stream of binary events based on the pixel-wise change in log-scale intensity. This leads to better temporal resolution, dynamic range, and power consumption (21). These characteristics position event cameras as promising candidates for low-power and low-latency sensing elements in resource-constrained systems.

Consider an exemplary application of a UAV deployed in a hazardous environment for autonomous search and rescue (Figure 2) (22). The UAV is required to navigate seamlessly at high speed, plan, reason, and make decisions without any human supervision. To accomplish this, it needs to have a detailed understanding of its environment by carrying out several underlying perception tasks—such as optical flow, depth estimation, semantic segmentation, and object detection—to construct a perception base. This perception base is further used by high-level perception modules, planning subsystems, and neurosymbolic models to appropriately control the behavior of the drone (23, 24). This entire pipeline needs to be executed in real-time to enable high-speed navigation under all environmental conditions, which is quite challenging. From the algorithm standpoint, first, the architectures must be small enough to satisfy the resource constraints at the edge yet powerful enough to achieve satisfactory performance (25). Second, when considering real-world deployment for tasks such as optical flow and object tracking, it is crucial that the underlying algorithms and architectures are capable of effectively capturing the temporal dependencies in inputs over time. Third, since many of the subtasks are interdependent, where the output of one task serves as the input for another, it is important to carefully consider the design of the overall architecture and optimize it for efficient implementation. On the hardware front, optimized implementations capable of accelerating algorithms with these characteristics are vital, necessitating a shift from traditional CPU/GPU-based approaches to application-specific hardware.

Figure 2

Diagram illustrating a drone navigation system in hazardous environments. It includes sensors (GPS, compass, etc.), algorithms (ANN, RNN, Transformer), and hardware (CPU, GPU). These components feed into the drone. The flowchart shows how the drone processes data for high-level perception, planning, and control to navigate hazards and locate a victim. It incorporates perception bases like optical flow, depth, and segmentation, leading to motion commands.

Figure 2. An exemplary application involving a search-and-rescue mission carried out by an unmanned aerial vehicle (UAV). The UAV is equipped with a variety of sensors for accurate and rapid sensing, and it also employs efficient AI models on suitable brain-inspired hardware for perception, planning, and symbolic tasks. Given these capabilities, the UAV can navigate seamlessly in a previously unseen environment by determining an optimal motion path, while avoiding obstacles and hazardous areas.

AI deployment

AI models are trained using extensive real-world data in centralized data centers, often employing clusters of GPUs before they are deployed for repeated inference in the real world. AI models can be deployed either at the edge or in the cloud. Deploying AI models in the cloud has become a common strategy, primarily due to the unparalleled application accuracy achievable with the use of large, complex models. While practical for creating AI applications, this approach faces several challenges, such as the requirement for an active Internet connection, privacy concerns when transmitting sensitive data, and high energy and latency costs for extensive back-and-forth communications. Conversely, edge AI involves deploying AI algorithms entirely on edge systems, eliminating the need for communication with the cloud. However, due to their physical constraints, limited power budget, and computing capabilities, edge AI models must be scaled down, inevitably resulting in performance loss (26). As edge computing becomes more widespread, it is crucial to develop both algorithms and hardware together, with the goal of creating AI platforms that are both fast and efficient.

To facilitate the efficient deployment of AI models, quantization and sparsity have emerged as highly effective techniques, leading to models with substantially reduced size and computational demands. Quantization reduces the numerical precision of model parameters and activations from high-precision floating-point formats (e.g., FP32) to compact representations such as 8-bit integer (INT8), 4-bit integer, or even binary. Historically, early neural networks relied on handcrafted fixed-point arithmetic for embedded inference; however, modern quantization extends this principle with sophisticated calibration, error correction, and mixed-precision strategies (27, 28) that preserve accuracy at extreme compression ratios. Techniques like dynamic per-channel scaling, learned quantization parameters, and low-rank residual correction (29) demonstrate that quantization is no longer a heuristic post-processing step but an integral part of model design and training. By aligning model precision with hardware arithmetic capabilities, quantization enables dense compute units, such as tensor cores and systolic arrays, to operate near their theoretical throughput limits while dramatically lowering memory bandwidth and energy costs. Sparsity, on the other hand, exploits the empirical observation that many weights and activations in neural networks are redundant or contribute minimally to output quality (30). Structured sparsity (e.g., block or channel pruning) allows direct hardware acceleration, while unstructured sparsity achieves finer-grained compression through pruning and re-parameterization (31). Lately, hardware systems have added support for fine-grained unstructured sparsity (32). The introduction of sparse attention mechanisms and mixture-of-experts models reflects a paradigm shift and computation is no longer uniformly distributed but dynamically allocated to the most informative aspects of inputs. Sparsity thus transforms inference from a static to an adaptive process, allowing AI systems to scale capacity without linearly scaling cost.

When viewed through the lens of AI evolution, quantization and sparsity represent complementary pathways toward efficiency, minimizing both the representation and computation costs. As AI progresses toward neuromorphic and analog paradigms, these principles extend beyond digital optimization to form the foundation of energy-aware intelligence, where precision, density, and selectivity co-evolve as intrinsic properties of learning systems.

Learning algorithms: UAV navigation

Given the challenges in autonomous UAV navigation (Figure 2), and drawing parallels with efficient biological systems, such as the fruit fly (33), it is logical to seek inspiration from the brain. Tasks such as optical flow estimation require that we determine the movement of pixel intensities over time, whole motion segmentation necessitates classifying pixels into object categories in an image sequence, and object tracking involves identifying and tracking a moving object. These tasks are inherently sequential. Thus, the underlying AI algorithm must possess the ability to learn temporal dependencies between successive inputs.

Taking optical flow as an example, traditional AI models such as CNNs (1, 34) prove inherently unsuitable owing to their inability to capture temporal dependencies. Architectures integrating memory, such as recurrent neural networks (RNNs) (35) and LSTM networks (2), are more fitting but suffer from heightened network complexity and an intricate training process. These shortcomings arise because these systems do not encompass the fundamental working principles of the brain, which operates in a sparse and event-driven manner, seamlessly integrating compute and memory within the same physical substrate.

In contrast, advancements in neuroscience have led to the development of bio-plausible algorithms such as SNNs, which can efficiently process sequential data through sparse and event-driven computations, similar to the brain. SNNs represent a computationally simpler alternative to RNNs or LSTMs, utilizing a unique mechanism—membrane potential—which serves as a lightweight form of internal memory (11). In SNNs, inputs to each neural network layer are spikes (0 or 1) over time, necessitating only an accumulation operation, unlike the multiply-and-accumulate operation in artificial neural networks (ANNs) (36). However, deep SNNs also face challenges such as vanishing spikes and non-differentiable activations (37, 38), making training difficult. Fortunately, there have been several successful efforts toward developing techniques such as ANN-to-SNN conversion (37, 39), learnable neuronal dynamics (36, 40), and surrogate gradient learning (38, 41), which have simplified SNN training and improved their application performance. Furthermore, SNNs excel at processing data from previously discussed asynchronous sensors (event-based cameras).

Event-based optical flow estimation is typically carried out using an encoder-decoder multi-scale architecture based on U-Net, as introduced by Ronneberger et al. (42). Fully ANN models, inspired by Zhu et al. (43, 44), serve as the baselines and require representation of the event bins in the channel dimension (Figure 3A). This approach discards any temporal dependence between input events, resulting in suboptimal performance. In comparison, fully SNN models, such as Adaptive-SpikeNet (40) (Figure 3B), can capture temporal information effectively while utilizing layer-wise learnable neuronal dynamics to mitigate the challenges associated with SNN training. This results in improved application performance (~20% lower error) with the same model size or extremely lightweight (0.27-M parameter compared with 13 M) and efficient (~10× lower energy) models with similar performance, underscoring the efficacy of SNNs over ANNs in capturing temporal dynamics.

Figure 3

Diagram illustrating four models for optical flow. A: Fully-ANN model using event bins as channels with 13 million parameters. B: Fully-SNN model with encoder and decoder in spiking neural networks, using 0.27 million parameters. C: SNN-ANN hybrid with a spiking encoder and artificial decoder with 13 million parameters. D: Sensor fused SNN-ANN hybrid integrating event bins and greyscale frames with 7.55 million parameters. Each model employs distinct configurations of encoders, decoders, residuals, and accumulators, depicted through flow diagrams and visual representations of events.

Figure 3. Architectures for optical flow estimation based on U-Net from Ronneberger et al. (42). (A) Fully artificial neural network (ANN) architecture operating on event bins over channels. (B) Fully spiking neural network (SNN) architecture—Adaptive-SpikeNet—operating on events over time with output spikes accumulated at the last decoder layer. Re-used from (40), with permission from IEEE. (C) Hybrid SNN–ANN architecture—Spike-FlowNet—operating on events over time. The output spikes are accumulated at the SNN-encoder. Re-used from (45), with permission from Springer Nature, (D) Sensor-fusion architecture—Fusion-FlowNet— utilizing data from events over time and grayscale frames over channels. Re-used from (46), with permission from IEEE.

In parallel, there have been efforts to explore hybrid SNN–ANN models to simplify training. Works such as Spike-FlowNet (45) and SSLN (47) fall into this category. Spike-FlowNet (Figure 3C) consists of just an SNN-encoder and outperforms the fully ANN approach (43), offering 1.21× lower energy consumption. There have also been efforts to combine information from frame- and event-based cameras into a sensor-fusion model. Works such as Fusion-FlowNet (46) (Figure 3D) use events over time as inputs to an SNN encoder and grayscale frames over channels as inputs to an ANN-encoder. This enables superior feature extraction, leading to significantly improved performance and smaller model sizes. In fact, Fusion-FlowNet, with 7.55 million parameters, attains 40% lower error with a 1.87× energy reduction compared to fully ANN methods (43). Along similar lines, DOTIE (48) offers a lightweight object detection pipeline, and HALSIE (49) uses sensor fusion for semantic segmentation. Although explained for specific tasks, the discussed principles can be applied to a broad spectrum of perception tasks. In the context of UAV navigation applications, all these approaches constitute notable advancements toward achieving brain-like edge intelligence.

While the above works demonstrate the unique potential of SNNs and hybrid architectures, it is important to note that these advantages are domain specific and should not be generalized. SNNs work well with tasks involving temporal information, and event-based sensors work well when sparse binary information is sufficient for carrying out the task at hand. In contrast, applications such as face recognition, image classification, image analysis, and 3D reconstruction, which are data intensive, are still dominated by ANNs.

In line with recent advancements in AI, including LLMs/LVMs (3), diffusion models (7), and neural architecture search (NAS) (50), which are much more powerful and power hungry than previously discussed traditional approaches, there have been few initiatives toward low-cost implementations such as SNN transformers (51, 52). However, the energy efficiency of these initiatives remains suboptimal. This underscores the need for innovative approaches in developing brain-inspired architectures from the ground up, effectively utilizing the unique temporal dimension of SNNs rather than simply replicating ANN-based models.

Even with meticulously designed and highly optimized algorithms, achieving efficiency at the hardware level remains challenging. Although traditional hardware is advancing, a significant gap persists between AI hardware capabilities and the rigorous demands of AI applications across various domains. Beyond standard operations such as matrix-vector multiplications (MVMs) and transcendental functions, brain-inspired workloads require operations like frequent fetching and updating of membrane potential and modeling neuronal dynamics. These demands present significant challenges for von Neumann architectures and underscore the need to explore CIM approaches to provide an energy-efficient, converged platform for both SNN and ANN-based AI algorithms. The remainder of this article explores potential hardware solutions at various levels, from specialized CIM architectures and digital/analog memory technologies to stochastic hardware and emerging devices.

Compute in/near memory for efficient AI hardware design

The concept of CIM dates back to the 1990s (53, 54). However, with the advancements in AI applications, the limitations of traditional computing architectures have become increasingly apparent. Traditional systems, based on the von Neumann architecture, face the memory wall challenge (55), where the separation between processing and memory units leads to significant data movement overheads. As an alternative, CIM paradigms have been proposed to reduce the higher cost of memory accesses (56–58).

CIM, also known as processing-in-memory (PIM), fundamentally alters the computing paradigm by bringing computations closer to, or inside, synaptic memory where data reside, effectively addressing the memory wall challenge. CIM architectures can perform massively parallel multiply-and-accumulate (MAC) operations on inputs and synaptic weights, the predominant operation in AI models. Additionally, CIM technology shows promise for efficiently executing transcendental functions such as exponential, logarithmic, or trigonometric calculations. It is also particularly advantageous for SNN algorithms by significantly reducing the costs associated with storing and fetching neuronal membrane potentials.

Figure 4A illustrates an example of a CIM accelerator with its spatial architecture (59). This architecture consists of multiple tiles connected via a network on chip (NoC), each containing N CIM cores. A CIM core (Figure 4B) comprises CIM arrays, referred to as matrix-vector multiplication units (MVMUs), along with other functional units and an instruction execution pipeline (Fetch, Decode, Execute, and Memory). Following the fetching of an instruction from the instruction memory, it is decoded and executed in the appropriate functional unit, which could be the scalar functional unit (SFU), vector functional unit (VFU), or the pipelined MVMU. The output is stored in the memory unit (MU), which facilitates communication with other cores. Hence, CIM arrays provide a framework for designing energy-efficient hardware tailored for AI applications.

Figure 4

Diagram showcasing various components and architectures in a Compute-in-Memory (CIM) accelerator. Panel A illustrates a spatial architecture with global control and memory units, and network configurations. Panel B details the CIM core with MVMU structures, including control units, operand steer unit, and memory elements. Panels C and D provide examples of digital and analog CIM-based MVMUs, highlighting components like register files and crossbars. Panel E features pie charts displaying the distribution of energy and area in an analog CIM macro, focusing on ADC, crossbar, digital, and DAC contributions.

Figure 4. Design of compute-in-memory (CIM)-based artificial intelligence (AI) accelerators using analog and digital computing schemes. (A) An example of a CIM-based AI accelerator with spatially distributed cores connected through a network on chip (NOC) is depicted. (B) Each CIM core consists of CIM-based matrix vector multiplication units (MVMUs). CIMs can be either digital or analog. (C) An example of a digital CIM MVMU consisting of 6T SRAM cells modified to perform bit-wise multiplications near each memory cell. (D) An example of an analog CIM MVMU, with different choices for memory cells and details of peripherals. (E) Analog-to-digital converters (ADCs) present in the peripherals dominate the computation costs in analog CIM, consuming the highest area and power.

CIM can be integrated into memory through different methods, also known as computing schemes. These schemes are broadly categorized into analog and digital, depending on whether the values computed in the memory array are continuous or discrete, respectively. Additionally, the benefits of CIM vary with the memory technology based on their read–write energy, area, latency, and other specifications. The following subsection describes various types of CIM, particularly focusing on analog and digital CIM computing schemes, and reviews the differences in various memory types. The subsequent subsection details the advancements in CIM designs for different memory technologies across a memory hierarchy.

Types of CIM

Computing schemes

CIM can be broadly categorized into analog (or mixed-signal) and digital (binary) types. Figure 4D depicts an example of analog CIM (60), for performing the matrix-vector multiplication (MVM) operation.

Several methods are employed to perform such in situ MAC operations in an analog manner. The technique shown in Figure 4D involves current-mode computing, which takes advantage of the property of linear addition of currents in circuits, namely, Kirchhoff’s current law. Mathematically, this can be represented as Equation 1 follows:

\begin{array}{l} I_{B L, i} = \sum_{j = 0}^{N} V_{j} \times G_{i, j} & (1) \end{array}

The encoded inputs are supplied to each memory row through the source of the access device, contributing to $V_{j}$ . The resistance of each cell (or its inverse, the conductance $G_{i j}$ ) corresponds to the weights of the neural network and can be single or multi-bit, depending on the bit capacity of the memory cell. The choice of the memory cell in a CIM array could be embedded non-volatile memory (eNVM) devices or any type of static random-access memory (SRAM) cell (Figure 4D). By activating multiple wordlines (WLs) simultaneously in an array, the current from each row accumulates at the bitline (BL) ( $I_{B L, i}$ ) to produce the MAC output between inputs and weights. The analog current is then converted to the voltage domain via a transimpedance amplifier (TIA) and finally to the digital domain. Another method is charge-based computing (61), which exploits the capacitive properties of memory cells to accumulate charges, thereby obtaining the MAC output.

The analog nature of communication is prone to errors; therefore, the MAC output is converted to the digital domain via an analog-to-digital converter (ADC) to facilitate noise-resilient and robust communication between different CIM cores. Similarly, inputs arrive in digital form and are encoded using a digital-to-analog converter (DAC) to convert the bit-streamed inputs into the voltage domain. However, ADCs often introduce significant area, power, and performance overheads in mixed-signal CIM systems, as their operation requires high precision and fast conversion rates to maintain overall system efficiency. The distribution of energy and area for a typical spatial architecture (Figure 4E), such as PUMA (59), reveals that the full precision ADC occupies more than 80% of the area and consumes more than 50% of the energy in the CIM accelerator. While digital components rank second highest in terms of area and energy, the contribution from the crossbar is minimal. Here, the crossbar refers to the memory array modified for analog CIM.

In contrast, digital CIM (Figure 4C) provides a fully digital alternative to analog and mixed-signal methods. In this architecture, digital CIM macros perform MAC operations using digital arithmetic and logic operations either within or in close proximity to the memory array. In one variant of digital CIM, two wordlines (WLs) are activated, and their corresponding bitwise operations, such as NAND, XOR, and NOR, can be accessed at the periphery of the memory array through modified sense amplifier circuits. Such multiple bitwise operations can be combined to execute arithmetic operations such as multiply and accumulate. This approach, as noted by Wang et al. (62), necessitates storing inputs and weights in a bit-aligned manner within the memory array and incurs a small area overhead due to minor changes in the peripheral circuitry. Figure 4C illustrates another method for a digital CIM core, which involves extensively modifying the memory array to incorporate multipliers and adders adjacent to each memory cell (18). The partial sum outputs are subsequently combined using adder trees and accumulators. The details of this digital CIM macro are further elucidated in the SRAM-based CIM subsection.

When comparing digital and analog CIM approaches, numerous trade-offs become apparent. Beyond the high area, energy, and latency overheads introduced by ADCs, analog CIM is susceptible to functional errors due to various non-idealities, including IR drop within the memory array, device non-ideal characteristics, leakage currents, and others (57, 58, 63, 64). Conversely, a digital computation environment is more predictable and easier to control, leading to more accurate and reliable system behavior. However, digital CIM may not achieve the high energy efficiency of analog CIM. Moreover, the repetitive bitwise additions required to generate the final MAC output in digital CIM can increase the overall compute latency.

Memory technologies

CIM technology can be applied in various ways depending on the type of memory, each offering unique advantages and specific implementations. The different types of memories include SRAM, dynamic random-access memory (DRAM), eNVMs, and flash memory. SRAM is commonly used as cache memory in processors due to its high speed and stability. An SRAM cell typically consists of 6 to 10 transistors and stores data in a binary format. In contrast, a DRAM cell comprises a single transistor and a capacitor to store data, making it cost-effective for high-density and large-capacity main memory in systems. However, it requires periodic refreshes to store data correctly, reducing its speed.

While both SRAM and DRAM are volatile memories, non-volatile memories such as hard disk drives (HDDs) or flash memory, found in solid-state drives (SSDs), can retain data when power is lost. Flash memory, devoid of moving parts, is faster and more durable compared to mechanical hard disks. Other alternatives, such as eNVMs—including resistive random-access memory (RRAM), phase-change memories (PCMs), spin–transfer torque magnetic RAM (STT-MRAM), and ferroelectric field-effect transistors (FeFETs)—are often noted for their low power consumption and high-density storage. These eNVMs, such as RRAM and PCM, can store more than 1 bit of data per cell and differentiate the state stored in the cell by switching between a high-resistance state (HRS) and a low-resistance state (LRS), triggered by an electrical stimulus such as a current or voltage pulse. The underlying physics of each technology is unique: STT-MRAM depends on the parallel and anti-parallel alignment of two magnetic layers separated by a thin tunneling insulator layer, PCM utilizes chalcogenide materials to switch between crystalline and amorphous phases, and RRAM relies on the formation and rupture of conductive filaments in the insulator between two electrodes. FeFET (65) relies on the polarization of the ferroelectric layer (e.g., Hafnium materials), thus storing different threshold voltages ( $V_{T H}$ ).

Each of these memory technologies presents a unique set of trade-offs with respect to power efficiency, performance, cost, and ease of integration, which should be considered when integrating computing into memory for specific AI applications (Supplementary Table 1).

CIM across memory hierarchy: opportunities and challenges

Every computing system includes a memory subsystem with different memory levels to hide the memory latency during compute. For instance, the memory closest to compute is small and fast (e.g., SRAM) while the farthest memory is large and slow (e.g., DRAM). The last memory level is generally non-volatile, such as flash or using emerging devices. The aforementioned differences in memory characteristics necessitate examination of the CIM technologies for each memory technology separately. This subsection details the developments in CIM implementations with different memory technologies for performing matrix-vector multiplications.

eNVM-based CIM

Recent advancements in eNVM technologies (66–70) have garnered considerable interest for their application in power-sensitive and data-heavy environments (71). These technologies are characterized by low energy consumption, high storage density, non-volatility, and the capacity for in-situ parallel operations. eNVMs are employed in hardware specialized for various AI applications, including convolutional neural networks (59), graph neural networks (GNN) (72), SNNs (8, 73), and transformers (74, 75). These studies reveal key insights regarding CIM-based application-specific fabrics: (a) in some cases, the intrinsic stochasticity of eNVM can be advantageous, as demonstrated in specific applications (76, 77); (b) although the analog crossbar architecture offers attractive benefits, notably its fast MVM capability, the associated challenges with full precision analog CIM might impede its suitability for supporting large-scale applications in the future; and (c) carefully deciding the circuit and device parameters for the CIM array may help in alleviating some of these obstacles while designing energy-efficient edge computing hardware. The details of stochastic hardware are discussed in a later section, whereas the implications of (b) and (c) are discussed further here.

The strongest form of CIM with eNVM is the full-precision analog variant that performs an N-wide vector–vector dot product in each of the M active columns of cells during each read cycle. It offers the greatest potential advantages for compute efficiency and bandwidth but suffers from practical constraints on density and accuracy, as shown in the column view in Figure 5A. Practical considerations limit the feasibility of this full-precision analog approach in nano-scale nonvolatile array macros. For example, implementing precise analog inputs requires the cells and array metallization to support voltage biasing of individual storage elements. Typical 1T1R arrays natively support binary inputs to individual cells, using the WLs to toggle high-gain selector devices (78). In addition, the full-precision variant demands precise peripheral circuits, including TIA, ADC, and DAC, to maintain the accuracy of data bits.

Figure 5

A diagram titled “Full-precision analog CIM” (A) shows input voltages through DACs connected to ADCs. “Quantized analog CIM” (B) presents binary input vectors with 0 and 1 levels through a similar setup. A graph (C) titled “1b/1b compute efficiency and storage density” plots storage density against compute efficiency, indicating data points from different years, a Pareto frontier, and an improved trade-off direction.

Figure 5. Design trade-offs in variants of embedded non-volatile memory (eNVM)-based analog compute-in-memory (CIM). (A) Full-precision analog CIM paradigm with eNVM, where full-precision inputs are reproduced using wordline (WL) digital-to-analog converters (DACs), while full-precision weights are stored in memory cells in the conductance domain. Each column represents a full 4-wide vector/vector dot product, resulting in a 1 × 4 by 4 × 2 matrix product with a 1 × 2 vector result. (B) Quantized analog CIM paradigm with eNVM: single-bit inputs are driven by buffers in place of DACs, and single-bit cell states are programmed into the memory columns. In the 2b case, each column now stores a single bit of data per cell. After two cycles, the result is a 1 × 4 by 4 × 1 product at INT2 input/weight precision. In both scenarios, (A, B), a mapping between numerical values and voltage/conductance domain values must be chosen by designers. (C) Trade-off between compute efficiency [throughput per watt (TOPS/W, scaled to 1b/1b)] vs. estimated storage density (Mb/mm²) of reported CIM macros using resistive random-access memory (RRAM), showing how the analog CIM variants differ from each other.

To overcome the challenges of implementing full-precision analog CIM, three knobs can be tuned: reduce the input precision of $P_{x}$ bits, reduce the weight precision of $P_{w}$ bits, and reduce kernel width $P_{W L}$ . This “quantized” variant is illustrated in Figure 5B. Note that completing a full vector–vector dot product with arbitrary input and weight precisions $B_{x}$ and $B_{w}$ and width N thus requires external shift-adding circuitry (Figures 4D, C) read operations Equation 2:

\begin{array}{l} C = [\frac{N}{P_{W L}}] \times [\frac{B_{w}}{P_{w}}] \times [\frac{B_{x}}{P_{x}}] & (2) \end{array}

In the most aggressively quantized mode, WLs and cell values are binary $(P_{w} = 1$ and $P_{x} = 1)$ . The read-out bitline current thus becomes the following Equation 3:

\begin{array}{l} I_{B L . i} \propto \sum_{j = 1}^{N} 1 | X_{i} = W_{i j} = 1 & (3) \end{array}

Several CIM macros using this family of partially “quantized” approaches with RRAM have been demonstrated in scaled nodes (79–89). Designers typically implement current-based readout schemes, as in traditional NVM designs, which operate by clamping the cell bias voltage, typically just the BL voltage, to an approximate target value and measuring the resulting current. Some outlier works avoid current-sensing and measure the BL voltage itself. Yoon et al. (83) used a voltage-based scheme with feedback, while Hung et al. (87) allowed the BL to slew downward due to selected memory cells without a pull-up transistor to reduce power while necessitating a carefully designed readout system.

Another approach to tackle the associated challenges with analog CIM includes co-designing crossbars to achieve the highest energy efficiency with minimal loss in accuracy. Chakraborty et al. (57, 58) and Kim et al. (90) showed how the analog nature of computing results in several non-idealities, such as non-linearity from access transistors, eNVM device non-linear characteristics, and parasitic resistance. They proposed a data-dependent approach and an analytical modeling approach, respectively, to capture such non-idealities, which lead to functional errors in the MVM computation output.

Sharma et al. (91) highlighted the impact of co-designing device-circuit parameters for reducing functional and quantization errors for crossbar arrays consisting of spin–orbit transfer magnetic tunnel junctions (SOT-MTJs). SOT-MTJs have high endurance, similar to their variant STT-MRAM, but offer a higher ratio of $R_{o f f}$ by $R_{o n}$ due to their decoupled read and write paths. Their quantitative analysis shows that smaller crossbar sizes and the optimum size of $R_{o n}$ could lead to the least accuracy degradation in crossbars, depending on the input activation sparsity. In addition, higher sparsity can reduce energy consumption by leveraging a lower precision ADC without affecting accuracy. More approaches to reducing the ADC bottleneck are discussed in the next section.

Beyond macro-level works, several system-level CIM with RRAM designs have been demonstrated. Chang et al. (92) combined 288-unit CIM macros, each with eight channels and activating up to nine binary WLs in parallel, for a total of 2.25 MB of NVM storage on-die along with 768 KB of traditional SRAM. Power switching at each macro allows power to scale an order of magnitude as macros are enabled and disabled during intermittent computing. The power gating idea is extended by Chang et al. (93) and Lele et al. (94), who described two-tier power gating using 32 clusters of five macros each for a total of 1.25 MB of RRAM along with 1.25 MB of traditional SRAM. This second work also includes a lighter-weight SNN-based component utilized to trigger gated RRAM macros to wake up for heavier-weight CNN computing. Huang et al. (89) reported a relatively large die area (30.6 mm²) and a large 4 MB of on-die nonvolatile memory to support slightly larger networks. They focused on a configurable macro with the ability to implement both in-memory and near-memory computing and achieved a very high 76.5 throughput per watt (TOPS/W) efficiency at INT8 precision at the macro level.

However, much of the above efforts have focused on specialized CIM designs tailored to specific workloads, which may diverge from the broader trajectory of hybrid SNN and ANN hardware development. To address the variety of operations used in such workloads, (i) Singh et al. (95) demonstrated architecture support for hybrid ANN and SNN workloads, (ii) Chen et al. (96) proposed a multifunctional CIM macro that uses the same memory array with combined peripherals to support logic-in-memory, content addressable memory (97), and dot-product computations, (iii) Wan et al. (98) presented a taped-out RRAM chip for multiple neural network architectures, and (iv) Shin et al. (99) demonstrated a taped-out RRAM chip for reconfigurable RNN-CNN workloads.

However, CIM with NVM is not a fully mature design concept, and the possibility of MAC-level accuracy degradation poses a significant risk to the future of the paradigm. Beyond accuracy issues, a major risk of CIM with NVM is a reduction in data density. Compared with traditional memory macros with binary read-out channels, CIM designs require more complex peripheral designs, such as medium-precision ADCs (100), complex sensing circuits (101), and isolation circuitry for high-voltage domains (88), which diminish array area efficiency. Furthermore, the top-level NVM data densities of these CIM systems, ranging from 0.49 to 1.31 Mbit/mm², compare poorly with those of digital systems using NVM, reported at 1.92 and 2.80 Mbit/mm² (102, 103). Figure 5C depicts an efficiency versus data density frontier observed empirically in published works, mostly macros.

SRAM-based CIM

Initial CIM efforts focused on modifying SRAM 6-transistor (6T) arrays, a primary type of memory storage, to act as classifiers or MLPs for tasks such as digit recognition (104, 105). While such efforts focused on approximate analog CIM inference accelerator designs, ADCs were incorporated later to maintain result accuracy, leading to more complex designs with higher data precision (106). Notably, SRAM 6T cells have coupled read–write paths, which can cause bit flips through a short circuit when all WLs are enabled in an array. To address these read-disturb issues, Kang et al. (104) employed staggered activation and asymmetric sense amplifiers to avoid short circuits and detect bitwise NAND, NOR, and so on, respectively. Recently, Chih et al. (18) proposed a full-precision digital CIM macro using SRAM 6T, where the multipliers are connected directly to the BLs of each cell (Figure 6A). This approach overcomes the analog CIM overheads without any loss in computation accuracy. The entire array is divided into multiple sub-CIM units via the adder tree (Figure 6B). Each sub-CIM unit activates all input activations simultaneously (represented by a blue line), and the partial sum of 4-bit MAC from each row is sent to the adder tree. For the subsequent stream of inputs, the partial sums are combined using an accumulator (Figure 6C), which bit-shifts the last computed output and adds the current partial sum from the adder tree. Follow-up work (107) further increased the TOPS/W by including pipelining, bit-width flexibility, and simultaneous weight updating in the digital CIM macro.

Figure 6

Schematic of a sub-CIM unit in a digital CIM core based on SRAM 6T. Panel A: circuit diagram showing wordlines (WL) and input bits (IN_B) connected to logic gates and an adder tree. Panel B: adder tree diagram showing various bit levels. Panel C: accumulator structure showing operations including bit shift, inverse, and addition using a signed MSB.

Figure 6. Circuit details of a digital compute-in-memory (CIM) macro example with static random-access memory (SRAM)-6T cells. (A) Sub-unit in an SRAM-6T array where each cell, storing weights, has a multiplier (NOR gate) associated with it. (B) The partial sum of 4-bit inputs and 4-bit weights goes to the adder tree. (C) Eventually, multiple partial sums are accumulated in the accumulator. Thus, each sub-unit in the array performs a 4-bit – 4-bit multiply and accumulate (MAC) computation. Re-used from (18) with permission from IEEE.

As SRAM 6 T-based CIM designs have evolved, the SRAM-based CIM classifiers have expanded to include other types of SRAM cells, thereby eliminating the read-disturb issues associated with SRAM 6T. This expansion is evident in numerous simulation works and silicon prototypes consisting of 8T (60, 62, 108), 10T (109), and 12T (110) cells. Although larger SRAM cells compromise on storage density, they provide greater read-disturb stability. For instance, SRAM 8T has decoupled paths for reading and writing to the cell. In another effort, a charge-domain computing approach was shown to be more efficient for such analog CIM macro designs compared to the earlier used current-based computing, which necessitates bulky transimpedance amplifiers (TIAs) for current-to-voltage domain conversion (61, 111). Additionally, Wang et al. (62) employed a digital CIM approach by activating two rows at a time in an SRAM-8 T-based memory array to calculate logic operations such as NOR and AND at the end of the BL.

Subsequently, SRAM CIM inference accelerators or macros have expanded to processor-level design (18), programmable chip designs (112), and floating-point accelerators (113). To balance the accuracy of fully digital CIM and the energy efficiency of analog designs, researchers have proposed analog prototypes with reconfigurable ADC precision (114). These reconfigurable ADC designs employ high ADC precision for scenarios with lower data sparsity and lower ADC precision for high data sparsity. Thus, the reconfigurable ADC design can achieve optimal accuracy and energy efficiency, varying according to the signal-to-noise ratio and data sparsity.

Nevertheless, ADCs alone consume as much as 60% of the energy and occupy nearly 80% of the area in analog CIM accelerators (115). Due to the significant area requirements of ADCs, they are shared across multiple columns in the crossbar, which limits the throughput of the crossbar arrays. Researchers have explored various strategies to mitigate the limitations imposed by ADCs in CIM accelerators (28, 116–119). These studies have primarily focused on reducing the precision of ADCs through partial sum quantization (PSQ) to lower bit levels, thereby saving power and area. Despite these adjustments to lower-precision ADCs, the challenge of inefficiency persists. Building on the PSQ technique, recent research has demonstrated that partial sums can be quantized to just 1 (binary) or 1.5 bits (ternary) (28, 120). This method eliminates the need for ADCs by adopting learned-step quantization (121), utilizing trainable floating-point scale factors. The binary or ternary value (p) at the columns of crossbars is multiplied by a scale factor (s) to bring the quantized values to a similar range as the actual floating-point data. The binary and ternary quantization values for p are delineated in Equation 4, where α represents a trainable threshold.

\begin{array}{l} p_{b} = {\begin{matrix} 1 i f p s \geq 0 \\ - 1 i f p s < 0 \end{matrix} | p_{t} = {\begin{cases} 1 & if p s \geq α \\ 0 & - α < p s < α \\ - 1 & if p s \leq - α \end{cases} & (4) \end{array}

The partial sum is shifted and accumulated across all the input bit streams to obtain the final partial sum value (PS) at a crossbar column. Experiments underline the importance of using scaling factors in PSQ and reveal that binary and ternary PSQ with scaling factors achieve higher accuracy than 2-bit quantized partial sums. This partial sum quantization approach has also been applied to SNNs when deployed on analog CIM accelerators (122). In this context, it has demonstrated up to 72× lower inference energy and 10× lower latency compared to deployments on the NVIDIA Jetson TX2 board.

DRAM-based CIM

In most ML applications, the DNN models and data processed are too large to fit on-chip, which implies that the majority of the model and data reside in DRAM. The movement of data to and from DRAM consumes significant time and energy in current AI hardware, including GPUs and specialized ML accelerators such as tensor processing units (TPUs) or neural processing units (NPUs). This makes processing inside or near the DRAM a promising approach to improving the efficiency of future AI hardware. There are different flavors of DRAM-based CIM; they differ based on where and how processing logic is integrated into the DRAM subsystem (Figure 7A).

Figure 7

Diagram illustrating a high-bandwidth DRAM computing architecture. Panel A shows three configurations: near subarray computing, in subarray computing, and 3D DRAM computing. Panel B details the process in subarray computing with three states: initial, enabled wordlines (WLs), and enabled sense amplifier, showing voltage levels for lines A, B, C, D, and E. Panel C presents five steps for data row and compute row operations: initial state, copy A, copy B, calculate C_out and calculate S.

Figure 7. Example demonstrating in/near memory computing for dynamic random-access memory (DRAM). (A) A variety of in/near memory computing is available for DRAM based on where or how processing logic is integrated. (i) One method involves processing near the DRAM subarray in the DRAM banks by feeding data from the subarrays to the logic units. (ii) An alternative approach does the computation in the subarray by activating two or multiple wordlines. (iii) In 3D architectures, processing is done with the help of computation units in the logic die. (B) The majority of DRAM cell operations involve the simultaneous activation of multiple worldlines and the subsequent activation of the sense amplifiers after charge sharing. (C) This figure shows the steps of a single bit in an n-bit bit serial addition operation (128). The first two steps are to copy the operands to the compute rows, followed by a triple row and a quintuple row activation to obtain the carry out and sum bit.

i. Computation is performed near DRAM subarrays [Figure 7A(i)]. Data are read from the subarrays and fed to logic (computation units) near the subarrays for processing. Despite the logic being slower, as it is being realized in a DRAM process, overall processing efficiency is improved by reducing the distances data travel and by utilizing the higher internal bandwidth available within the DRAM.

ii. Computation is performed in DRAM subarrays by activating two or more wordlines, as shown in Figure 7A(ii). The results are obtained at the columns of the subarray with the help of the local sense amplifiers. Not all functions may be suitable for realization within subarrays, in which case a combination of in-subarray and near-subarray computing may be used.

iii. Compute units are placed on logic dies that are integrated with 3D-stacked DRAM dies as shown in the high bandwidth memory architecture of Figure 7A(iii). The through-silicon via (TSV) connections provide high-bandwidth data transfer between the compute units and DRAM.

Each approach has its benefits, and recent efforts have explored all three approaches for accelerating AI workloads.

In-subarray computing utilizes the maximum internal DRAM bandwidth, as computation occurs at local sense amplifiers; ideally, this should lead to a more enhanced performance compared with near-subarray and 3D DRAM variants. Initial efforts (123–127) focused on evaluating Boolean computations at local sense amplifiers by activating multiple rows of the DRAM subarray. Ambit (123) specifically accelerates workloads through bulk bitwise Boolean operations (AND, OR, and NOT). The DRAM-based Reconfigurable In-Situ Accelerator (DRISA) (124) features a reconfigurable in-situ accelerator designed for performing Boolean operations. Conversely, DRAM based Accelerator for Accurate CNN Inference (DrAcc) (125) implements ternary neural networks through in-DRAM bit operations. Efficient and Low Power Processing-in-Memory (ELP2IM) (126) presents techniques for the low-power realization of bitwise operations in DRAM. Single Instruction Multiple Data DRAM (SIMDRAM) (127) introduces an end-to-end, flexible, general-purpose framework for bit-serial (Single Instruction Multiple Data) (SIMD) computing in DRAM.

The novel approach to bit-serial addition operations within DRAM subarrays introduced by Ali et al. (128) serves as an illustrative example of in-subarray computing. This approach requires data within DRAM to be organized in a transposed format, aligning all bits of the two multi-bit operands in the same column. This arrangement facilitates parallel n-bit addition operations across all columns within the subarray, as well as across subarrays. In each column, an addition operation is realized bit-serially, with a full adder function decomposed into majority operations, realized by activating multiple wordlines (Figure 7B). Five DRAM cells store data (A, B, C, D, and E) in their initial state, and the corresponding five wordlines are simultaneously activated to allow charge sharing. Following this, sense amplifiers activate to evaluate the majority. Both sum and carry outputs for bit-serial addition are evaluated through a majority operation. Due to the destructive nature of these operations, data are first copied into dedicated compute rows for processing (Figure 7C). An arbitrary bit-width addition requires nine additional compute rows alongside the data rows, incurring less than 1% area overhead. Figure 7C outlines all steps involved in single-bit addition, starting with copying operand bits to compute rows A and B to preserve the original data. The carry-out bit (Cout) is evaluated through a triple-row activation of A, B, and the carry-in bit (Cin). The sum bit results from a quintuple-row activation involving A, B, Cin, Cout, and Cout complement. Each of the four steps in a single-bit addition operation requires one AAP (activate-activate-precharge) operation. For an n-bit bit-serial addition operation, 4n+1 AAP operations are necessary. A system-level evaluation of this approach demonstrated up to an 11.5-fold performance improvement compared with conventional von Neumann machines when running the k-Nearest Neighbor (kNN) algorithm on the MNIST handwritten digit classification dataset.

Building upon this approach, Roy et al. (129) proposed an accelerator for ML inference workloads that combines in-subarray computing with near-subarray computing. A novel multiplication primitive was introduced without significant changes to the DRAM subarray. Additionally, an architecture comprising reconfigurable adder trees and special function units was proposed to fully realize DNNs within DRAM itself. A mapping scheme was also introduced alongside the architecture to maximize utilization and, consequently, performance.

Near-subarray computing has been explored in several efforts (127, 130, 131). Newton’s proposal by He et al. (130) involves an in-DRAM accelerator architecture incorporating multiply and accumulate units along with buffers within the DRAM. RecNMP, by Ke et al. (131), accelerates personalized recommendation systems through near-memory processing.

3D DRAM computing was explored in various efforts (132–134). Notably, Neurocube, by Kim et al. (133), introduced a programmable and digitally scalable accelerator built upon 3D high-density memory for efficient neural network acceleration. Gao et al. (134) proposed a scalable NN accelerator using 3D memory.

It is expected that some of the above computing in memory techniques will make their way into commercial DRAM products. For example, Samsung has recently incorporated computing in memory into their high-bandwidth memory (HBM) products (135), enabling parts of AI workloads to be accelerated in DRAM.

Flash-based CIM

Similar to other memory technologies, flash memory, a type of non-volatile storage, can be utilized to perform various computational operations, such as bitwise operations (AND, OR, and XOR), basic arithmetic operations, and more complex tasks, including searching and pattern matching. These operations are typically executed using the charge storage properties of flash cells. Predominantly available in two types, NAND and NOR, flash memory is structured around floating-gate transistors. NAND flash, characterized by its high density, is commonly used for data storage, while NOR flash, known for its faster read capabilities, is often employed in code storage and execution.

Xiang et al. (136) proposed an efficient, mixed-signal, NOR flash memory-based CIM design for SNNs. Since SNNs compute only spikes, which can be represented in binary format, such a design can eliminate the use of ADC/DAC, thus conserving energy and area. Similarly, Choi et al. (137) utilized two single-level NAND floating gate cells to store complementary data for a binary neural network algorithm. In S-FLASH, Kang et al. (138) employed a full mixed-signal CIM design and optimized energy by modifying the bit width of partial multiplication and exploiting many zero partial multiplication results. Additionally, works such as ParaBit by Gao et al. (139) and 3D-FPIM by Lee et al. (140) have performed end-to-end evaluations of NAND flash-based CIM architecture, demonstrating their potential to achieve higher performance and energy efficiency compared to RRAM-based accelerators.

Accelerating functions beyond matrix multiplication

In addition to matrix multiplication operations, memory can be repurposed to accelerate other AI application operations. The various neural network models discussed above (e.g., RNNs, SNNs, and transformers) often require repetitive evaluation of transcendental functions, such as tanh, exponential, and softmax. Evaluating these functions through the Maclaurin series expansion is deemed impractical. Consequently, previous studies have employed range reduction techniques and mathematical tables (141). For enhanced energy efficiency, it is preferable to store these tables on-chip rather than in off-chip DRAMs. However, the integration of these tables as on-chip read-only memory (ROM) necessitates additional silicon area, thus increasing costs. Lee et al. (142) proposed a ROM-embedded STT-MRAM bit-cell that includes an additional bitline for ROM operation: R-MRAM. This design effectively combines ROM and RAM functionalities within the same layout, allowing both types of memory to be utilized simultaneously without increasing the array size. The memory configuration might contain, for example, X MB of RAM and an equal amount of ROM. The circuit features an extra bitline compared to standard STT-MRAM, as shown in Figure 8. ROM data are stored by selectively connecting the magnetic tunnel junction (MTJ) to one of the two Vread bitlines. Note that a typical layout of STT-MRAM bit-cell has enough routing space for this extra bitline since it requires a large access transistor to satisfy the bidirectional switching current requirements of the MTJ. This additional bitline enables standard RAM read/write operations as well as ROM read operations in R-MRAM. In the RAM mode, both bitlines are activated simultaneously with the proper wordline, while for ROM retrieval, only one of the bitlines is activated. Hence, during ROM retrieval, the data read from a particular cell is either the resistance of the MTJ (either in the on or off state) or infinite if the cell is disconnected from the bitline. The requirements for peripheral sensing circuitry differ slightly from those in standard RAM. Notably, such a ROM configuration could be applied to other non-volatile memory technologies, such as PCM and RRAM, and volatile memories, such as complementary metal-oxide semiconductor (CMOS)-based SRAM.

Figure 8

Diagram comparing RAM and ROM functionality in a ROM embedded RAM with SOT-MTJs. Panel A shows RAM with enabled wordlines and bitlines. Arrows indicate RAM data flow. Panel B shows ROM with similar structure; arrows indicate ROM data flow. Key distinctions are in data flow paths and connections.

Figure 8. Depiction of read-only memory (ROM) embedded random-access memory (RAM) with spin–orbit transfer magnetic tunnel junction (SOT-MTJ) cells. The connection of the MTJ cell with the Vread bitline decides the value of the data stored in ROM. (A) RAM functionality is achieved by enabling both bitlines, and (B) ROM functionality by enabling only one bitline, respectively, with the proper wordline (shown horizontally).

For implementing ROM in SRAM technology, Lee and Roy (143) proposed using two wordlines per 6-T cell, where the ROM data are stored based on whether the left access or the right access transistor is connected to wordline1 or wordline2. Note, during RAM mode, both wordlines are activated for read operations; however, for writing, only one of the wordlines is activated, effectively making the cell a 5-T cell during writing. Since the cells are volatile, during ROM data retrieval, the RAM data are first stored in a buffer and written back to the cell after the ROM read.

Dutta suggested a novel modification to the standard 8T SRAM cell that enables it to store ROM data without affecting its normal functionality or requiring extra silicon area (144). Building upon this, they demonstrate the use of this modification in an 8T-SRAM-based MVM unit. Storing “1” and “0” in the MVM unit requires an additional terminal, RCON. For cells storing a value of “1”, the source line (SL) is connected to RCON, and for cells storing “0”, SL is connected to the ground (GND). When RCON is connected to GND, the SL of all cells connects to GND, making the array function like a conventional SRAM array. However, if RCON is connected to the power supply/drain voltage (VDD) and all cells in that row store ‘“1”, specific read bitlines (RBLs) are prevented from discharging and activating the ROM mode. For ROM operation, the rows of the 8T SRAM array must store “1” as their value, which erases any pre-existing data. Thus, the data from the SRAM array should first be transferred to a temporary buffer and then restored post-ROM data read. Therefore, using an 8T-SRAM for dual functionality (normal SRAM and ROM-embedded RAM) necessitates four cycles for a ROM data read: (i) copying SRAM row data to a buffer, (ii) writing “1” to all cells in the row, (iii) reading ROM data, and (iv) writing the data back from the buffer to the SRAM. This novel 8T SRAM-based design has also been evaluated at the system level to accelerate Transformer models by Kim et al. (145), demonstrating up to a 10× improvement in throughput compared to the NVIDIA A40 GPU (145).

Neuromorphic computing: designing brain-inspired AI hardware

As observed in the exemplary application of autonomous drone navigation, the network architecture plays a critical role in determining the parameters, accuracy, energy consumption, and latency of the system. In fact, the optimal architecture may be a hybrid of SNNs, ANNs, and LSTMs. Although some of the previously discussed ANN hardware can support SNNs, achieving optimal performance requires the development of specialized hardware for neuromorphic computing (or SNNs).

Neuromorphic computing, which draws inspiration from the brain’s ability to unify computation and memory storage within the same physical substrate, its highly dense and recurrent connections, and its sparse, spike-based computation and communication, results in a remarkably efficient system. In recent years, SNNs, inspired by biological neurons, have emerged as promising candidates for processing sequential tasks with an asynchronous and sparse event stream (36). Central to SNNs is a neuron model characterized by an internal state known as membrane potential (8). Leaky integrate-and-fire (LIF) is the most commonly used neuron model. In a LIF neuron, the membrane potential accumulates input and decays at a certain rate governed by the leak at each timestep, emitting a spike when it exceeds a threshold. This process enables SNNs to learn the input timing information without any explicit temporal encoding, making them a special, less complex case of RNNs.

Earlier, SNNs were used for static tasks, such as image classification, where they lagged behind ANNs in accuracy. However, recent research has shown that SNNs excel in dynamic tasks, such as gesture recognition and optical flow estimation, surpassing ANNs and RNNs in accuracy (47). Their spike-based computation, focusing on accumulation operations, consumes less energy than the multiplications required by ANNs. However, the additional membrane potential parameter and temporal dimension introduce unique requirements and challenges that differentiate SNNs from standard ANNs, as summarized below:

i. Temporal processing requirements: SNNs operate in discrete timesteps, encoding information in the timing of spikes and processing weights and membrane potential data structures across several timesteps. Consequently, they require precise hardware to manage complex temporal processes, such as accurate spike timing and event-driven communication, while minimizing memory access.

ii. Neuron activation functions: Unlike ANNs, which employ continuous activation functions, such as sigmoid or rectified linear unit (ReLU), SNNs utilize complex activation functions that account for the temporal dynamics of spikes. This demands specialized hardware tailored to their unique time and frequency dependencies.

iii. High temporal and spatial sparsity: SNNs demonstrate inherent sparsity in time, with spikes occurring intermittently, and in space, with only a few neurons active at any given time. Accelerating SNNs necessitates optimizing computations for this dual sparsity, which can pose challenges for traditional hardware platforms.

Efficient SNN acceleration requires a comprehensive understanding of these challenges and characteristics. To that end, several hardware solutions have been proposed, tailored to the distinct requirements of these biologically plausible neural networks. Asynchronous event-driven architectures have emerged as a transformative approach in this domain. IBM’s TrueNorth (146), with a million neurons and 256 million synapses, exemplifies low-power, real-time processing. Its asynchronous communication with synchronous neurosynaptic cores, connected through a distributed 2D mesh architecture, emphasizes parallelism and efficiency in handling spiking events. Similarly, Intel’s Loihi (147) features a network of asynchronous neuromorphic cores interconnected through an asynchronous mesh network. Its second iteration enhances scalability and energy efficiency, further supporting advanced SNN research and applications (148). Several other designs have also explored asynchronous event-driven approaches for large-scale brain simulations (149–151), and ultra low-power, resource-constrained edge applications (152–155).

As previously discussed in the CIM section, in-memory computing reduces weight and activation data movement in ANNs. In the SNNs case, there is additional data movement resulting from the temporal dimension and the membrane potential data structure. In light of this, Agrawal et al. (156) proposed an SRAM-based digital in-memory computing macro designed specifically for SNNs. This macro integrates essential SNN inference operations, such as accumulation, thresholding, spike-check, and reset, within a fused weight and membrane potential memory. The proposed macro employs a staggered layout to support 6-bit weight to 11-bit Vmem additions using reconfigurable column peripherals. It also supports multiple neuron models, including IF, LIF, and residual membrane potential (RMP) neurons, using in-memory operations. Such a macro can serve as a computational unit for a large-scale hierarchical mesh architecture (59). While early SNN research primarily emphasized biologically inspired neuron models and circuit primitives, recent advances underscore the importance of a system-level design to fully realize the benefits of temporal sparsity and event-driven computing. This includes dataflow-aware accelerators that align compute scheduling with sparse, asynchronous spike events (157, 158), algorithm-hardware co-design strategies that leverage CIM for efficient spike-based inference (90, 155), and hybrid architectures that integrate near- and in-memory computing with support for both frame- and event-based modalities (93). Recent efforts such as NeuroBench (159) aim to standardize evaluation and benchmarking of neuromorphic systems across datasets, tasks, and hardware backends, further enabling fair comparison. These holistic approaches enable scalable and energy-efficient deployment of SNNs beyond small-scale benchmarks and represent a critical bridge between neuroscience-inspired models and practical machine intelligence.

Stochastic hardware

While traditional neuromorphic computing models have predominantly relied on deterministic neural and synaptic primitives, recent efforts have adapted these models to incorporate stochastic elements, leading to models that are compact while maintaining accuracy for a class of applications (160–163). The integration of stochasticity into algorithms has been shown to improve the energy efficiency of diverse AI workloads. One notable advancement involves leveraging stochasticity for information encoding over time, achieved through probabilistic synaptic or neural updates. This facilitates the state compression of neural and synaptic units, enabling their implementation through single-bit technologies. For example, Roy et al. (163) developed fully binary neural networks using stochastic activation functions. Srinivasan et al. focused on training SNNs using a local learning rule (spike timing-dependent plasticity or STDP) (164) based on stochasticity (160). STDP updates synapses proportionately based on the timing difference between their input and the output spiking of the neuron. If the time difference is positive, the synapses are proportionately potentiated, while if it is negative, the synapses are proportionately depressed. Stochastic STDP has been used for synaptic updates where the time difference between input switching and output switching time is used as a measure of the probability of switching. Stochastic binary multi-layer networks exhibit heightened representation capacity and excel in classification tasks compared with their deterministic counterparts (165). The regularization effect induced by stochasticity plays a pivotal role in achieving such results. As researchers delve deeper into the interplay between stochasticity and algorithmic design, the prospect of unlocking novel capabilities in AI systems becomes increasingly promising.

At the hardware level, the core of stochastic computing circuits comprises controllable true random number generators (TRNGs), known as stochastic bits. The physics of various non-volatile memories demonstrates inherent stochasticity, which can be leveraged to implement stochastic algorithms with high efficiency. This stochasticity is intimately tied to the unique switching mechanisms of these devices. For instance, spintronic devices utilize thermal noise for magnetic orientation switching (166). Similarly, ReRAM devices depend on conductive filament formation or the movement of oxygen ions/vacancies (167), and PCM devices switch through a process that involves heating and abrupt quenching (168). All these mechanisms inherently exhibit stochastic characteristics. Furthermore, the non-volatility of such devices eliminates the need for separate storage of stochastic bits and enables their use in training neural networks. One particularly promising approach is the utilization of SOT-MTJs, which are three-terminal, read–write separable devices distinguished by their stochastic sigmoid-like switching characteristics (162). Figure 9A illustrates an SOT-MTJ device, where a magnetic tunnel junction comprising two nano-magnets (the bottom being the free layer and the top being the fixed layer, separated by a very thin oxide layer) is situated atop a heavy metal, depicted in gray. The current flowing through the heavy metal potentially switches the bottom nano-magnet. Depending on the magnitude and width of the current pulse, there is a probability that the SOT-device (the free layer magnet) switches, as shown in Figure 9B. Such a device can serve as a neuronal activation function, as previously mentioned. Figure 9C presents a circuit designed by Srinivasan et al. that implements stochastic STDP updates for synapses in a crossbar array (Srinivasan et al., 2020). The gate of the transistor $M_{S T D P}$ receives a voltage ramp input with an input spike. The two bottom pass transistors activate only when the POST signal turns high with an output spike. Consequently, the current flowing through the heavy metal (through transistors $M_{S T D P}$ and the two pass transistors) is proportional to the time difference between the input and output spike times, leading to the required stochastic switching of the device and, hence, stochastic STDP. Figure 9D displays a crossbar-type circuit array incorporating the stochastic learning circuit (Figure 9C). MTJs with high barrier magnets—a barrier height greater than 40 $K_{B} T$ (where $K_{B}$ is the Boltzmann constant and T is the operating temperature)—have a high retention time of approximately 7 years. However, they require an external input (current) to assist in switching. Conversely, low barrier magnets—a barrier height less than 15 $K_{B} T$ —demonstrate random switching solely due to thermal noise. Such low barrier MTJs, used for random number generation (169, 170), exhibit high switching times on the order of milliseconds, leading to very slow operations and susceptibility to process variations (170). Additionally, owing to the low retention time of low barrier magnets, they are unsuitable as synapses but can be used as neurons (170).

Figure 9

Panel A shows a schematic of a stochastic SOT-MTJ with synaptic current flow. Panel B presents a graph depicting switching probability against current pulse amplitude for different pulse widths. Panel C illustrates a neuron model with pre and post-neuron connections. Panel D displays a crossbar with connections for pre and post-neurons.

Figure 9. Elements of stochastic hardware using spin–orbit transfer magnetic tunnel junctions (SOT-MTJs). (A) The structure of an SOT-MTJ with read and write circuitry to utilize it as a stochastic synapse. (B) The switching probability of a stochastic synapse between anti-parallel and parallel states with different current pulses. (C) A circuit consisting of SOT-MTJ for implementing spike timing-dependent plasticity (STDP) updates for a synapse in a crossbar-type array shown in (D).

Conclusion

The availability of large volumes of data from sensors all around us, coupled with an insatiable demand for handling new workloads for emerging applications, has transformed the landscape of AI. To meet this demand, AI models are growing at an alarming rate—as illustrated by the 5,000-fold increase in the size of natural language processing models over the last 4 years (Figure 10) (171).

Figure 10

Graph showing the growth of language model sizes from 2018 to 2022. Model parameters increase from ELMo at 94 million in 2018 to Megatron-Turing NLG at 530 billion in 2022.

Figure 10. Trend of language model sizes over the past four years, showing an increase by a factor of 5,000. Re-used from (171), with permission from NVIDIA.

The energy consumption of such implementations in today’s hardware far exceeds the limits that edge devices, constrained by computing and energy, can handle. Hence, there is an urgent need to rethink the hardware, which can lead to quantum improvements in energy and latency while still maintaining accuracy. While today’s CPUs, GPUs, TPUs, and field programmable gate arrays (FPGAs) have undergone updates to better handle matrix–matrix multiplication, sparsity in activation, pruning, and quantization of weights, they are still bottlenecked by the memory wall problem. Moreover, models such as transformers also require better and more effective handling of the attention mechanism in hardware, necessitating extensive calculations for softmax operations. Such operations require several trips to the main memory to fetch weights (parameters) and mathematical tables. These trips for fetching data are costly in terms of both energy and latency. To this end, it is believed that in-memory computing at various hierarchies of memory needs to be explored to determine where, how, and when such technology can be utilized to alleviate the memory wall and achieve quantum improvements in energy and latency (172).

On the other hand, for the exemplary application of autonomous drone navigation, an integrated approach to sensing and computation is necessary to perform all computations onboard rather than in the cloud, as cloud computing incurs expensive communication costs and leads to increased latency. To achieve the required SWAP (size, weight, and power) and accuracy, there is a need for “tiny” networks. To handle different types of inputs from various sensors and to efficiently compute both sequential and static tasks required for drone navigation, tiny hybrid networks utilizing both standard ANNs and more biologically plausible SNNs can be crucial. As discussed, SNNs, with their asynchronous event-driven computing, show great potential for extracting spatiotemporal features from event streams, especially for sequential applications, while ANNs are very effective for static tasks. Hence, from a hardware perspective, there is a need for a “converged platform” that can efficiently implement both ANNs and SNNs. While SNNs can use the in-memory computing primitives described earlier, there is an additional data structure—“membrane potential,” responsible for keeping track of temporal information that must be fetched and updated at every time-step in SNNs. Computing in SNNs is asynchronous in nature, and, hence, hardware suitable for asynchronous or event-driven computation can reap the benefits of the sparse nature of spikes to achieve energy efficiency. While experimental hardware solutions such as Spinnaker, TrueNorth, and Loihi have been developed, there is still a path ahead to incorporate the need for both ANNs and SNNs in a converged hardware platform. We believe that significant improvements in SWAP are only possible by co-designing the algorithms and hardware.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fsci.2025.1611658/full#supplementary-material

Statements

Author contributions

KR: Conceptualization, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing.

AK: Conceptualization, Formal Analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

TS: Conceptualization, Formal Analysis, Investigation, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

SN: Conceptualization, Formal Analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

DS: Conceptualization, Formal Analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

US: Conceptualization, Formal Analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

SR: Conceptualization, Formal Analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

AnR: Conceptualization, Formal Analysis, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing.

ZW: Conceptualization, Formal Analysis, Methodology, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing.

SS: Software, Validation, Visualization, Writing – original draft, Writing – review & editing, Conceptualization, Formal Analysis, Methodology.

C-KL: Conceptualization, Formal Analysis, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing.

ArR: Conceptualization, Formal Analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author. The data that support the findings of this work are publicly available and free to download from their respective sources. Code written to produce the results of this work is available upon reasonable request to the corresponding author.

Funding

The authors declared that financial support was received for this work and/or its publication. All authors received funding from the Center for the Co-Design of Cognitive Systems (COCOSYS), a JUMP 2.0 center (AWD-004311-S4). KR’s work is also in part funded by the Semiconductor Research Corporation (2023-AI-3152) and National Science Foundation (2402983-CCF & 2023-AI-3152). The funders were not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declared that generative AI was used in the creation of this manuscript. The AI tools ChatGPT-4o and 4o-mini from OpenAI were used to improve the vocabulary and formatting of text in certain sections of the manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Krizhevsky A, Sutskever I, and Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, and Weinberger KQ, editors. Advances in Neural Information Processing Systems 25 (NIPS 2012). New York, NY: Curran Associates, Inc. (2012). 1097–105. Available at: https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

Google Scholar

2. Hochreiter S and Schmidhuber J. Long short-term memory. Neural Comput (1997) 9(8):1735–80. doi: 10.1162/neco.1997.9.8.1735

PubMed Abstract | Crossref Full Text | Google Scholar

3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, editors. Advances in Neural Information Processing Systems: Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, NY: Curran Associates, Inc. (2007). 6000–10. Available at: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

Google Scholar

4. Devlin J, Chang MW, Lee K, and Toutanova K. BERT. Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, and Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: human language technologies, volume 1 (long and short papers). Minneapolis, MN: Association for Computational Linguistics (2019). 4171–86. doi: 10.18653/v1/N19-1423