- Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA, United States
Introduction: Single Flux Quantum (SFQ) superconducting technology offers major performance and energy advantages over CMOS as Dennard scaling wanes. Yet, SFQ CPUs face key challenges: the Josephson Junction (JJ) budget limits manufacturability, and the lack of dense on-chip memory restricts scalability. Control and memory structures dominate JJ usage, motivating new architectures that exploit SFQ strengths while mitigating its limitations.
Methods: We introduce Icy-Hot, a hybrid CPU architecture that splits computation across two cryogenic zones. The 4 K Icy Zone, implemented with SFQ logic, performs high-speed execution, while the 77 K Hot Zone, built with CMOS, handles fetch, decode, and control. Compiler-inserted metadata and compact SFQ memory structures—a Hot-Driven Register File and a shift-register-based Dependency Buffer—enable decoupled pipeline execution and reduced cross-zone communication.
Results: Cycle-accurate simulations and SFQ circuit synthesis demonstrate that Icy-Hot achieves a 38% total power improvement over a 77 K CMOS baseline while using only ≈220,000 JJs, an 8× reduction from naïve SFQ cores. Loop-intensive workloads with high operand reuse achieve up to 4.5× performance speedup due to efficient local execution in the Icy Zone.
Discussion: By co-optimizing cryogenic CMOS and SFQ circuits, Icy-Hot demonstrates a practical path toward scalable, energy-efficient superconducting processors. The design captures SFQ’s intrinsic efficiency while addressing memory and control bottlenecks, marking a foundational step toward general-purpose SFQ CPU architectures.
1 Introduction
The decline of Dennard scaling (Mudge, 2015) has led to increasingly unsustainable power consumption, motivating the exploration of post-CMOS technologies. Among emerging alternatives, Single Flux Quantum (SFQ) technology—based on Josephson Junctions (JJs)—has garnered significant interest as a next-generation solution. SFQ circuits, operating at cryogenic temperatures, exhibit superconductivity and zero resistance. This results in ultra-fast computation speeds (tens of GHz) (Likharev and Semenov, 1991; Nagaoka et al., 2019) and extremely low switching energy (on the order of
While JJ-level efficiency is promising, realizing these benefits at the architectural level requires careful CPU design exploration. Prior work has established several foundational building blocks for SFQ CPUs (Zha et al., 2022; Katam et al., 2017; Dorojevets and Chen, 2015; Fujiwara et al., 2004; Tang et al., 2015; Ando et al., 2016; Kundu et al., 2019; Kiriche et al., 2019; Qu et al., 2020), including SFQ-specific ALUs (Doro et al., 2012; Filippov et al., 2011; Kundu et al., 2019; Ando et al., 2016; Tang et al., 2015), register files, and branch predictors (Fujiwara et al., 2003; Zha et al., 2022; Zha et al., 2023). A bit-serial SFQ CPU prototype has also been demonstrated in (Ando et al., 2016).
One of the greatest challenges in scaling SFQ CPUs is the limited number of JJs available. Currently, MIT Lincoln Lab can fabricate SFQ chips with around 20,000 JJs, with projections of up to 100,000 by 2026 on a
To assess the architectural impact of JJ constraints, we synthesized an in-order CPU (with no on-chip cache) using open-source SFQ cell libraries (Fourie et al., 2019; Schindler et al., 2021). As described in Section 4, our synthesis showed that the CPU consumed nearly two million JJs, with 34% dedicated to decoding and dependency logic and about 50% spent on path balancing. Control structures, such as scoreboards for dependency tracking, are particularly expensive, due to SFQ’s lack of dense memory structures. All storage must be constructed with flip-flop-like cells, incurring additional path balancing and consuming excessive JJs.
Consequently, designs that require large on-chip memory—such as caches, large register files, or complex branch predictors—are difficult to implement. While ongoing research aims to build high-capacity SFQ memory (Ryazanov et al., 2012; Vernik et al., 2013), practical solutions remain out of reach. This motivates a new design approach that minimizes reliance on SFQ-based memory and control.
We propose Icy-Hot, a hybrid CPU architecture that splits computation across two temperature zones. The 77K “Hot Zone” implements memory and control using CMOS, while the 4K “Icy Zone” houses JJ-efficient SFQ ALUs. By isolating only the execution stage in the Icy Zone, the design maximizes SFQ efficiency while reducing JJ use for memory and control.
To support this thermal split, we introduce lightweight SFQ memory structures in the Icy Zone: a small Hot-Driven Register File cache (HDRFC) for frequently accessed data, and a larger shift-register-based Dependency Buffer for sparsely used operands (Ando et al., 2016; Kundu et al., 2019; Kawaguchi and Takagi, 2022). Instructions are decoded in the Hot Zone and transmitted as coarse-grained blocks (e.g., loops) to the Icy Zone for high-speed execution.
Since the Icy Zone can run significantly faster than the Hot Zone, decoupling minimizes communication overhead. Decoded instructions and operands are stored in compact shift registers. Random-access memory is avoided due to its high JJ cost.
1.1 The context and limitation for SFQ CPUs
Skeptics may question the feasibility of CPUs without complex predictors and large memory—features common in modern out-of-order designs. Yet, CMOS CPUs also began with simpler in-order architectures. Likewise, SFQ designs must evolve through incremental progress. Ongoing work in SFQ library design (Fourie et al., 2019), place-and-route (Katam et al., 2017), and major government investments (NSF, 2019) suggest that SFQ CPUs are a practical research direction, particularly for scaling quantum systems.
Our proposed Icy-Hot CPU is built using these tools, allowing exploration of a JJ-limited architecture with modest speculation support. Though not yet competitive with out-of-order CMOS CPUs in raw performance, our design demonstrates the viability of a hybrid CMOS-SFQ approach. Prior work on SFQ accelerators (Ishida et al., 2020; Zokaee and Jiang, 2021) has not addressed JJ limitations. In contrast, our work explicitly targets a JJ-constrained, general-purpose CPU architecture.
This paper makes the following contributions:
• We present the Icy-Hot architecture, which decouples execution from control and memory logic to optimize JJ usage and overall energy efficiency.
• We address challenges in control overhead, memory density, and thermal zone communication by leveraging compiler hints and shift-register-based memory structures in the Icy Zone.
• Our evaluation shows that the Icy-Hot design achieves a 36.9% power improvement over conventional CPU designs.
2 Background
2.1 Hardware components
2.1.1 SFQ logic technology
Single-flux-quantum (SFQ) technology utilizes a magnetic pulse called single flux quantum or fluxon to transfer signals between logic gates. The fluxon is stored in a superconducting loop, which consists of Josephson junctions (JJs); the existence of a fluxon is interpreted to mean “1″, while the absence represents “0″ within the circuit. Most SFQ logic gates perform the logic function only when an external clock is applied and transfers the output pulse, if any, to the next gate.
Figure 1b demonstrates an exemplary SFQ logic, which is enabled by gate-level clocking compared to stage-level clocking in a traditional CMOS pipeline shown on the left. Each gate in the circuit shown in Figure 1c receives an explicit clock, which then activates the gate to perform its function. However, notice the presence of path-balancing DFFs in the circuit. Without DFFs, each of the inputs to the last XNOR gate may traverse a different number of gates and, hence, may arrive at different times. Hence, without path balancing, SFQ gates generate wrong outputs. The added delay-flip-flops (DFFs) within the SFQ circuit guarantees that all the input pulses to a gate arrive at the same clock period.
Figure 1. Comparison between (a) CMOS logic design and (b) SFQ logic design. SFQ logic requires path-balancing DFFs to utilize gate-level clocking.
Figure 2 demonstrates a cryogenic cooler used in SFQ technology (Razmkhah and Bozbey, 2019). The cooler consists of a superconducting chip carrier operating at 4.2K. Data is transferred to the 4.2K chip carrier through the superconducting cables; the physical properties of the cables, such as their length, determine the transmission latency. The transmission wires are often limited as there are a limited number of cables that can traverse across the thermal zones. In our design, we assume 64 input and 32 output wires to enter and exit the Icy zone.
2.1.2 Shift register
Random-access SFQ memories have low memory density since memory cells are primarily implemented as flip-flops. Random access requires peripheral logic such as a decoder (to select the memory cell of the given address) and merger (to output data) to be implemented. Splitter and merger cells consume JJs and also increase access latency as well. To reach a 32-entry register file, a 5-level splitter cell is needed, which takes five cycles of access latency. Prior work (Fujiwara et al., 2003) demonstrates the decoder implementation, using 39 JJs and 620 JJs for one-to-two and one-to-four decoder, respectively. Furthermore, prior work (Tanaka et al., 2016) achieves 25 JJs/bit for NDRO-based RAM while with more recent work (Zha et al., 2022) that utilize higher density memory cell, HC-DRO, reaches 15 JJs/bit. The JJ count per bit drops significantly when using shift registers as a standalone memory structure, where (Yuh and Mukhanov, 1992) proposes a design of five JJs/bit. The most efficient shift register memory design is proposed by (Mukhanov, 1993), realizing a one-bit memory cell with only four JJs.
To exploit the JJ efficiency of shift registers over RAM, Icy-Hot relies on shift registers as the primary operand storage structure in the Icy zone. Our proposed design uses a parallel shift register design that allows for higher accessing efficiency and bandwidth, as has been demonstrated before (Xu et al., 2021).
A shift register is built using a column of D flip-flops that holds individual bits of data. When given a clock pulse, the entire column transfers all the data to the next logic gate, eliminating the shift latency that exists in the conventional serial design. The internal feedback loop in a shift register is controlled by non-destructing readout cells (NDRO) available in the open-source SFQ cell library. The maximum operating frequency of this parallel register is demonstrated to be 34 GHz (Xu et al., 2021).
A 32x32 Parallel Shift Register has an estimated access time of 910 ps (including decode time, clock delay, and reading time).
2.1.3 Register file design
SFQ memory design conventionally uses a destructive readout (DRO) cell or a non-destructive readout (NDRO) cell to store a single pulse. Recently, a high density High Capacity Destructive Readout (HC-DRO) cell (Katam et al., 2020) has been proposed, and a corresponding register file design, HiPerRF (Zha et al., 2022), has also been presented. Compared to storing a single bit in the conventional cells, HC-DRO can store at most three pulses, representing 2 bits in each cell. HiPerRF implements NDRO capabilities, in which the cell retains its bits after readout with a custom-designed output port. Even with such capabilities, the overall register file size can be limiting when building a CPU with a large physical register file. Hence, our design may use the HiPerRF-like design to build a tiny register file cache within the ICY zone.
2.1.4 -cell multiplexer
Since multiplexers are extensively used in a CPU, such as data forwarding logic in execution units, Icy-Hot design uses a novel
3 Related works
3.1 RSFQ microprocessors
Several previous works have aimed to design rapid single-flux-quantum (RSFQ) microprocessor (Dorojevets and Chen, 2015; Fujiwara et al., 2004; Zha et al., 2022). aims to build register files within a microprocessor design (Kiriche et al., 2019; Qu et al., 2020). implements 8-bit bit-parallel arithmetic logic unit (ALU) and (Doro et al., 2012; Filippov et al., 2011) proposes 8-bit asynchronous ALU design (Tang et al., 2015). proposes 4-bit bit-slice ALU targeting 32-bit microprocessor operating 50 GHz. However, these works focus on the execution units of a microprocessor with limited ISA and data length (Ando et al., 2016). has demonstrated eight-bit serial microprocessor with custom-designed register file and ALU. The authors have fabricated and tested the microprocessor system with a 256-bit memory (
(Dorojevets et al., 2010) explored microarchitecture for 32-bit RSFQ CPUs. They showed a cell-level design and simulation results for a 32-bit datapath by using asynchronous hybrid wave-pipelining techniques. However, this work only focused on the datapath design. Their goal is to design a datapath that can catch up with the high frequency of the SFQ technology. They did not take the CPU control path into consideration. As for the memory, they only considered the interface so that their memory structure could keep up with the datapath. The size limit of the SFQ memory is not in their exploration scope.
Byun et al. (2020), CryoCore demonstrates 77K-optimal CMOS CPU design by reducing both the size and number of cooling-unfriendly microarchitecture units. Cryocore utilizes voltage scaling to achieve both power efficiency and high clock frequency. However, the work is fundamentally a CMOS-based design exploration that differs from SFQ technology. While the paper demonstrates power benefit compared to an optimized high-performance core (i7-6,700) running on 77K, achieving 5.5 W per core, SFQ-based microarchitecture design achieves orders of magnitude higher power efficiency over the CMOS designs (Mukhanov, 2011; Scott Holmes, 2020). In addition, different physical properties of SFQ and CMOS enable different microarchitecture design aspects, such as gate-level pipelining, which make it difficult to compare the CryoCore and our proposed design quantitatively.
3.2 Decoupled access/execute computer architectures
In Decoupled Access/Execute Computer Architectures, the authors introduce the idea of separating memory access and execution into two independent instruction streams managed by distinct processors. This decoupling enables asynchronous communication through queues, allowing the architecture to reduce memory latency effects and improve performance without relying on complex out-of-order issue logic (Smith, 1982).
This foundational idea has since been extended to various CMOS-based architectures, including CGRAs and GPUs, and has also inspired compiler frameworks to target such decoupled execution architectures. For CGRAs, Mage demonstrates a tailored DAE architecture for static control programs by offloading address generation to a dedicated AGU, improving area efficiency and reducing PE complexity (Naclerio et al., 2025). Similarly, the Stream-Dataflow architecture extends DAE to a CGRA substrate using explicit stream commands and a dataflow execution model to enable high parallelism with efficient memory coordination (Nowatzki et al., 2017). In the GPU domain, decoupled access/execute has been applied by restructuring fragment processors to separate memory and compute paths, allowing memory access to proceed independently and reducing stalls from latency-sensitive operation. This approach helps increase throughput in mobile GPUs, where energy and bandwidth constraints are significant (Arnau et al., 2012).
On the compiler side, frameworks such as those discussed in the DAE compilers work develop static and profiling-based transformations to extract and schedule decoupled memory and compute phases, targeting energy and performance improvements even on conventional CMOS platforms. These efforts demonstrate how compiler analysis can automate DAE transformations across different hardware targets, making the abstraction more practical and portable (Szafarczyk et al., 2025).
Our design also builds upon the core principle of decoupling access and execution found in DAE architectures, where separate instruction streams improve performance by overlapping memory and computation. However, unlike the CMOS-based DAE model, our Icy-Hot architecture must operate across a cryogenic and room-temperature boundary, fundamentally altering how data is accessed, stored, and transmitted. These constraints necessitate a different implementation strategy, relying on compact shift-register-based operand storage and minimizing communication across thermal zones to manage latency and bandwidth limitations.
4 Motivation
SFQ technology is considered a viable post-Dennard/post-Moore solution. However, the JJ count constraints due to fabrication technology make realizing an SFQ CPU require significant architectural innovation. To quantify these JJ limitations, we synthesized a partial 32-bit in-order RISC-V core (Ultraembedded, 2013) using the open-source SFQ cell libraries and the qPalace Synthesis tool (Fourie et al., 2019). In order to synthesize the design, the instruction and data caches, as well as the multiplier and divider units, were removed, which all demonstrated to be JJ intensive. Table 1 quantifies the total JJ count of each unit, showing that even a simple in-order CPU exceeds the fabrication budget. The results also show that around 40%–50% of the JJs are used for path balancing. We surmise that synthesizing the full 32-bit core with caches would use around 10x more JJs than the 1.8M used in our synthesis.
This data motivates us to pursue the Icy-Hot design, where the compute engine is decoupled from the memory and control-intensive structures, using a few 100K JJs. We measured the contribution of execution units to the overall power consumption using the default Xeon Out-of-Order configuration template in Gem5 (Binkert et al., 2011). We ran several Parsec benchmarks (Bienia et al., 2008) (Swaptions, Canneal, Fluidanimate, and Freqmine) and fed the resulting execution statistics to McPAT (Li et al., 2009). Our results showed on average, 48.5% of the total power is consumed by the execution units while running these workloads. Note that execution unit power consumption would be a larger fraction in in-order processors. Thus, an ideal Icy-Hot system can eliminate much of the power consumption of the execution unit.
5 Icy hot CPU design
Figure 4 presents an overview of the Icy-Hot design. The left part of the figure, labeled Hot zone (at the top left corner), is where the instruction fetch, decode, dispatch, and commit operations are performed. External memory hierarchy, including a large cache structure, DRAM, etc., is also placed in the Hot zone. The Hot zone may be built using cryogenic CMOS, as has been demonstrated by recent works (Byun et al., 2020). The right half of the figure, labeled Icy zone (at the top right corner), is where the SFQ execution engines and the supporting operand storage structures are placed. The Icy zone executes all instructions and stores any state within these storage structures.
While the design principles for Icy-Hot apply to any general-purpose applications, our current design is targeted toward supporting loop-intensive applications. Such applications allow the execution of a loop body in the Icy zone without needing to interface with the Hot zone frequently. Loop codes are also amenable to efficient data preloading and static data dependence analysis, which we exploit in our design, as detailed later.
The microarchitecture blocks that are newly designed for the Icy-Hot CPU are highlighted in the figure using numbered boxes. We will describe the functions of these new blocks as we go through the design details in this section. We first describe the general operation of the two zones and then describe how memory and branch instructions are handled in more depth at the end of this section.
5.1 Hot zone
The fetch and decode stage of the core operates in the Hot zone. Every instruction, except for memory read instructions (such as load and mov), are fetched and decoded in the Hot zone and executed in the Icy zone. To efficiently transfer the decoded instruction, control bits, and instruction operands, the Hot zone uses the Transmission Buffer (labeled as (1) in the figure).
5.1.1 Transmission Buffer
The Transmission Buffer is a map-like data structure used to ensure that dependencies between instructions are identified and the correct source value is obtained. The source operands of an instruction are available in one of the three operand storage structures in the Icy zone: (1) Shift register operand storage (labeled as (2) in the figure), (2) Hot-Driven Register File (labeled (3)), or (3) Dependency Buffer (labeled (4)). The actual location of the source values is encoded into the instruction’s opcode as metadata by the compiler. In our current implementation, we emulate this compiler functionality by marking the source operand location information using a post-binary analysis with the PIN tool (Luk et al., 2005). If the operand value is already available in the large register file in the Hot zone, that value is read and attached to the decoded instruction bits and placed in the Transmission Buffer. If a prior parent instruction currently produces the operand value, then that operand will be available in the Hot-Driven Register File or the Dependency Buffer.
All decoded instructions enter the Transmission Buffer. The source operands are marked with 2-bit metadata indicating where the operands are going to be present in the Icy zone. If the source operand is already available in the Hot zone, the value is read and stored alongside the instruction opcode. The source operand metadata is set to “00″ to indicate the presence of the data values within the instruction packet. If the source operand is produced and available in the Hot-Driven Register File, the operand’s metadata is set to “01”. If the source operand is produced and available in the Dependency Buffer, the operand’s metadata is set to “10”. For each instruction, the destination register is also tagged with the metadata depending on where the data will eventually be placed within the Icy zone. This determination of where the instruction is going to deposit the register value to be used by a dependent instruction later is done by a post-binary analysis process in our implementation, which can also be easily implemented in a compiler during the binary generation process. The policies that govern the operand placement are described later in this section.
To reduce the movement of data between the Icy and Hot zone, the Transmission Buffer stores a block of instructions, such as a loop body up to the Transmission Buffer capacity, and sends the entire block of decoded instructions to the Icy zone. The Icy zone executes the loop code without needing to communicate with the Hot zone until the end of the loop condition. This decoupling enables the Hot zone to keep pace with the Icy zone. In a non-loop code, this decoupling may not be as effective, and hence, as we explained earlier, the Icy-Hot design benefits loop-oriented applications.
5.2 Icy zone
The execution units operate in the Icy zone. They read the decoded instructions and their operands from the shift register buffers. Recall that these values are moved from the Hot zone to these buffers from the Transmission Buffer.
5.2.1 Operand and dependency buffers
Shift register operand storage (labeled (2) in the figure) is used to store the opcode, source, and destination values. The execution unit reads the instruction opcode and operand meta bits from the front of the opcode buffer. If the operand metadata indicates that the value is already available in the operand buffer, then the value is fed directly for execution. If the operand metadata indicates that the value is available in the Hot-Driven Register File, then the value is read from that register file.
If the operand data is available in the Dependency Buffer, then the process of reaching the data needed is more complex. The Dependency Buffer is a circular shift register design. Each shift register entry has a 5-bit register number and the corresponding register value that was produced by a prior parent instruction. Since the shift register can only be accessed from the front, the source operand is located by circulating through the entries until the source register number is matched. Hence, the reading of a source operand from the Dependency Buffer can take a variable number of cycles depending on where the source is present within the shift register queue.
While accessing data from a variable delay shift register seems inefficient, it is important to note that random access register files are typically accessed using a tree of NDROC cells (Non-Destructive ReadOut with Complementary Output). NDROC cells were proposed in prior work (Tanaka et al., 2016; Ando et al., 2016) and were later used to build an SFQ-specific register file (Zha et al., 2022). Each level in the NDROC tree requires a cycle to traverse the tree. As a result, each access to the Hot-Driven Register File also incurs multiple cycles. In fact, accessing a single register from the random access capable register file may take longer than accessing the front few elements of a shift register. In fact, we have done a detailed simulation analysis to identify how far a shift register has to be rotated to reach the desired data. Across all our benchmarks on average the data was available within the top-3 entries of the shift register. Note that a 32-entry register file would have consumed five cycles to access the register through the NDROC tree. These are unique constraints imposed by gate-level SFQ designs. Hence, even though the shift register-based Dependency Buffer may appear to take multiple cycles to rotate through the values, its average access latency is similar to the multi-cycle register file access.
5.2.2 Handling suzuki stack delay with Output Buffer
The Output Buffer is another parallel shift register that allows the storing of values that need to be sent to the Hot zone. The Output Buffer is necessary as there is a frequency limitation on the Suzuki stack (Mustafa and Köse, 2022), which is used to transmit data out of the 4K zone. In particular, transmitting data from the Icy zone to the Hot zone requires a Suzuki stack, which amplifies the pulse voltages to be compatible with CMOS operations. This process is slow, and hence, the Output Buffer will supply values to the Suzuki stack so that the computation that is occurring in the Icy zone does not need to stall. Only values to be transmitted to the Hot zone enter the Output Buffer. If a value is not going to be transmitted to the Hot zone, then it either is sent for immediate reuse by the ALU, the Dependency Buffer, or the Hot Driven Register File. The destination of the output value is determined by the metadata generated by the compiler based on the distance of data dependency.
5.3 Handling memory operations
In this design, we assume that the necessary data (that fits within our SFQ memory structures) has already been transferred onto the superconducting fabric prior to execution in the Icy zone. This assumption is grounded in the use of established data movement strategies, such as prefetching and preloading, which proactively bring data closer to the point of computation. Prefetching techniques—especially stride-based schemes—anticipate upcoming memory accesses by analyzing access patterns in loop-heavy workloads, and load the corresponding data into the Hot zone memory hierarchy. Subsequently, the preloaded data is explicitly pushed into the Icy zone ensuring its availability before execution begins. By relying on this upfront data movement, the Icy zone can operate with minimal stalls and reduced dependency on real-time memory access, which is critical given the latency and bandwidth limitations of the cryogenic interconnect between thermal zones. However, due to the limited storage capacity within the Icy Zone, some data will need to be transmitted close to it is execution, resulting in processor stalls, accounted for in our simulation.
6 Methodology
6.1 Icy-hot simulator
We implemented all the Icy-Hot functionality and created a cycle-accurate simulator based on PIN Binary Instrumentation Tool (Luk et al., 2005). The simulator is interfaced with the instruction traces received from the instrumented binary. The front end of our simulator attaches the compiler-generated metadata to the source and destination registers. For instance, the simulator considers a loop block of instructions to determine which destination registers to be mapped to the Hot-Driven Register File and which destination registers to be mapped to the Dependency buffer. In summary, our simulator’s front end emulates the compiled code behavior with the metadata information that is required for Icy zone execution.
The code is then executed in the Icy-Hot design. The execution stage of the Icy zone ALU has multiple gate-level pipeline stages, and the depth of the ALU execution is used as the simulation parameter. Since SFQ processors are able to run at a much higher frequency than CMOS (Scott Holmes, 2020), we assume the Icy zone to be run at ten times the frequency of the Hot zone. Therefore, we only execute the Fetch, Decode, Memory, and Writeback stages every ten cycles when simulating our Icy-Hot design while the Execute stage runs every cycle.
The simulation parameters (summarized in Table 2), such as the ALU pipeline depth, are based on the physics and device-level characterization data for SFQ ALU designs from the GitHub repository (Schindler, 2013), which is validated as part of an IARPA superconducting design tools initiative. The library provides the usual CMOS-like characterization data, such as the power consumption, number of JJs used and path balancing costs in terms of JJ counts. This tool was also utilized to determine the frequency of the Icy Zone that was simulated. Our 32-bit Integer adder achieved a post-route timing closure at 8.5 GHz. Given the tool’s assumpitions are intentionally conservative, we view this 8.5 GHz to be a lower-bound estimate, and as such, we extrapolate the SFQ to be 10x faster than Hot Zone (assuming the Hot Zone is 1 GHz). Similarly, the number of bits that can be transmitted between the two thermal zones is based on the Suzuki stack capabilities (Mustafa and Köse, 2022) that amplify and transmit the SFQ pulses to CMOS voltages for use in the Hot zone. Note that the clocks listed are the clock cycles as measured in the Icy zone. Thus, our simulator accounts for the communication overhead between the two zones, the impact of the multi-stage gate level pipeline used in the ALUs, and the cost of accessing the data in the Icy zone memory structures.
The simulator marks dependent instructions within an instruction window and keeps track of which data structure the values are stored in, as described in our microarchitecture. The access latencies of various structures, such as the Hot-Driven Register File access time in cycles, are provided as input parameters. Access latency for each Dependency Buffer entry and the cycle penalty of shifting through Dependency Buffer values to obtain the correct value is also computed appropriately in the simulator.
All the latencies of various microarchitecture blocks are derived from the synthesized blocks available in the qPalace tool (Fourie et al., 2019; Schindler et al., 2021). Thus, all the design blocks, including the ALUs, Suzuki stacks, and wiring, are accurately modeled and provided as simulation parameters to the simulator. The power consumption is also modeled based on the JJ counts, wiring costs, and other circuit-level hardware implementation details.
7 Evaluation
7.1 Power benefit
Table 3 demonstrates that the total static power used by the Icy-Hot design is 27,083
To put this power consumption into perspective, we have modeled the power consumption of an out-of-order Xeon processor. We used a combination of McPat (Li et al., 2009) and Gem5 (Binkert et al., 2011). McPat is an integrated power, area, and timing modeling framework with pre-existing microarchitectural configurations. It can be paired up with performance simulators, such as Gem5, to obtain an accurate power model for a given application on a specified core. We used Gem5 to obtain the performance statistics for X86 in-order core similar to Tulsa with in-order capability (Gilbert et al., 2006). These statistics were then used to generate the power consumption. Icy-Hot Design has an overall power improvement of about 36.9% compared to the in-order core. Since the Icy-Hot design still has its non-execution pipeline stages in 77K the power improvement is bounded by what can be achieved by moving the execution units into SFQ while staying within the JJ limit. But as more JJ counts become available, the fraction of the CPU functionality that can be brought into the Icy zone will increase, and this will lead to increased power efficiency over time.
7.2 Hardware evaluation on JJ usage
The hardware evaluation JJ count is obtained using the RSFQ cell library (Schindler, 2013), and synthesis results using qPalace tool (Schindler et al., 2021; Fourie et al., 2019). Chip area evaluation is not an appropriate measurement as the JJ count is the limiting factor in SFQ designs. Also, since our paper aims to achieve architectural innovation in future SFQ technology (Holmes et al., 2018; Scott Holmes, 2020)), we prioritize JJ count over chip area in our hardware evaluation. Prior works have also presented JJ counts for the hardware estimates (Zha et al., 2022; Zha et al., 2023).
Table 4 shows the JJ count of individual components of the proposed Icy-Hot design. Buffers include bit-parallel shift registers used for the Input Buffer (opcode, source register value, and destination register ID), the Output Buffer, and the Dependency buffer.
The total number of JJs used for the design is 221,790, which is well within the JJ fabrication limitation of the future (Scott Holmes, 2020). Compared to the naive in-order core synthesis shown in Table 1, the total number of JJs used has decreased by
7.3 Overall speedups
We compare the performance of Icy-Hot against a baseline where all stages exist in the Hot zone. Figure 5 depicts that our average speedup is when compared to baseline. Blackscholes and Libquantum show limited CPI improvement. Blackscholes uses the Black-Scholes Partial Differential Equation to calculate the prices of a portfolio of European options. The program iterates through all the derivatives for each option and computes the price. The runtime scales linearly with the input set size, as the number of options is stored in an initialization array, which needs to be loaded before the computation occurs. Therefore, this program’s performance is limited by the amount of memory available in the Icy Zone. Though this program is indeed an ideal loop-oriented target application, the lack of speedup in this application is due to limited SFQ memory.
Canneal, on the other hand, is a cache-aware simulated annealing algorithm that pseudorandomly picks pairs of elements and tries to swap them. The algorithm is optimized for data reuse and only discards one element during each iteration, which reduces cache misses. As such, since the need for new data is minimal and the new data point can be preloaded, Icy-Hot can do several swaps at the time compared to the baseline. Similarly, Streamcluster finds a predetermined number of medians for a stream of input points such that each point is assigned to its nearest center. This algorithm utilizes static partitioning of data such that data accesses are predictable - enabling our design to preload and minimize loads.
7.4 Design space exploration: Memory structures
The Hot-Driven Register File and the Dependency buffer are critical components in our design as they continually allow the computation in the Icy zone to execute with minimal delays. We evaluate the performance benefit obtained by changing the sizing of these data structures. Our goal in this evaluation is to determine which data structure has the most positive impact, the ideal sizing of each data structure, and which is critical during different applications.
7.4.1 Hot-Driven Register File vs. dependency buffer
Table 5 shows the speedup over baseline on varying sizes of the Icy zone memory structures. We use a design with zero entries in the Dependency Buffer and zero entries in the Hot-Driven Register File. This is the worst-case scenario where instruction sources would need to be fetched from the Hot Zone. This incurs a large transmission delay - motivating our need for some of the memory choices in the Icy Zone, even if CMOS-sized caches cannot be replicated. We varied both the size of the Hot-Driven Register File and the Dependency Buffer to evaluate which of these structures led to better performance. The Dependency buffer provides the most benefit, even accounting for shifting delays.
When the Dependency Buffers and Register Files are of equal size (16 or 32 entries each), the performance improves significantly compared to the design without any Icy zone memory structures. However, when using 64 entries in each of the Dependency Buffers and Register File, the performance improvement is only marginally better than 32 entry design. There was an increase of around 15,000 JJs compared to a performance speedup of 0.06.
The lack of performance speedup was attributed to the fact that as more registers are used, the access latency to each register increases. Due to the splitter trees used to create a register file, the access latency to a register file is proportional to the
7.5 Benchmark characteristics
The Icy-Hot design is beneficial for most programs and shows a significantly higher benefit when a program operates on available data within the Icy zone. Based on our results, we note that in such programs, data is repeatedly manipulated as part of the computation kernels. Such applications show a greater benefit as data can be preloaded into the Icy zone once, and the data gets reused–decreasing the communication overhead between the two zones.
7.5.1 Parsec benchmarks
For the smaller, loop-based Parsec Benchmarks, the performance benefits received did not significantly vary with the changes to the sizes of the Hot-Driven register file or dependency buffer. These applications do not spill and fill registers too often, reducing unneeded memory accesses. However, for benchmarks that required larger input data sets, such as Blackscholes, increasing the size of the dependency buffer and register file has a positive impact up to a threshold. As we increased the sizes from (16,16) to (32,32), Blackscholes showed a speedup of 2.34x. Though this speedup is significant, we also noted that a Dependency Buffer of size 128 and Register File size of 32 (128,32) showed a speedup of only 0.99x (over 32,32), showing the impact of longer access latency to deep shift registers and register files.
7.5.2 Other benchmarks
We have chosen Atax from Polybench (Pouchet and Grauer-Gray, 2012), and Libquantum from the SPEC (SPEC, 2000) benchmarks. Atax is focused on matrix calculation, representing a simple linear algebra application, while Libquantum covers Shor’s factorization algorithm from quantum computers. These benchmarks show slight performance improvements of 1.17X and 1.3X for Atax and Libquantum, respectively, as depicted in Figure 5. The performance improvement is minimal compared to Parsec benchmarks as data reuse is minimal. Instead of being able to utilize the Icy zone memory structures, the applications suffer from transmission delay, resulting in minimal performance improvement even with 10X the frequency in the Icy-Hot execution stage.
7.6 Future work
This work exposes two critical components that, if the community collectively improved, would significantly increase the performance of our system.
7.6.1 Co-packaged CryoCMOS to Nb integration
Our simulations assume a 10-cycle transmission delay between the Hot and Icy zones. This delay dominates runtime for reuse-poor workloads. Recent demonstrations of superconducting–CMOS co-integration (Numata et al., 2024) suggest that the physical separation and interface penalty could be reduced substantially. A conservative reduction of the hop from 10 to 3 Icy cycles yields an additional 1.5 to 2x spedup in our models for communication-bound kernels.
7.6.2 Hybrid Icy-zone memory to eliminate most cross-domain traffic
An equally promising path involves expanding “Icy-resident” storage so that active data rarely traverses the Hot/Icy boundary. Our current register-file and shift-register hierarchy already shows that benchmarks with high operand reuse benefit markedly when data remains in the Icy zone, whereas low-reuse codes are limited. Introducing a hybrid memory such multi-banked SFQ shift-registers combined with compact NDRO or CryoCMOS SRAM banks (Damsteegt et al., 2024; Parihar et al., 2023) would allow a much larger working set to reside locally. It is important to note that because access latency in the existing data buffer scales only modestly with size, these benefits come at moderate JJ cost.
7.6.3 Broader architectural scaling
Both of these improvements also expand the fraction of the CPU that can operate efficiently at 4 K. As the hot control and fetch logic progressively migrates into the superconducting region, the aggregate power advantage will grow beyond the current 37%–39% once cooling overhead is normalized. With these advancements, more than just the Execution stage can operate within the Icy region, allowing for some of the cost of control transmission to be mitigated as well.
These directions correspond closely to ongoing experimental progress in CryoCMOS and hybrid superconducting memory technology, and thus represent realistic near-term paths to move the IcyHot architecture from a proof-of-concept toward a high-impact computing platform.
8 Conclusion
With the slowdown of Moore’s Law and Dennard Scaling and the evident increase in the research activity to develop single flux quantum (SFQ) devices and logic design tools, now is the time to explore SFQ-based CPU designs. However, building a full-fledged CPU is currently limited by large SFQ memory availability and path-balancing flip-flops. While one may argue that without large memory and complex predictors why build SFQ CPUs? The SFQ device level power efficiency is two to three orders of magnitude over CMOS. Hence, it is imperative to try to capture even a fraction of that power efficiency while building new SFQ-centric CPUs. Problems faced by CMOS CPUs in their early incarnation resulted in the development of in-order CPUs with limited memory at first. In the same vein, we believe that the Icy-Hot design is a stepping stone on the path to eventually building complex designs.
In this work, we propose Icy-Hot, an SFQ CPU design paradigm that decouples the compute engine from the rest of the pipeline, a stepping stone in overcoming the memory capacity limitation that challenges the current SFQ CPU design. By decoupling the Execution stage and running it in the Icy zone, we are able to capitalize on the processing speed and power advantages of superconducting logic while alleviating the burden of providing control logic and memory from the superconducting design. To evaluate the Icy-Hot design, a cycle-accurate simulator that implements the Icy zone and Hot zone functionality and transmission line functionality has been developed. We test the various components of our design and evaluate the overall benefits and individual benefits provided by each data structure.
We execute a range of benchmarks using this simulator to show that our design, even with the cost of data transmission, achieves about a 36.9% improvement in power efficiency, even when accounting for cooling costs, and provides nearly 30% performance improvements over baseline designs. This design demonstrates that SFQ CPUs can be built with limited JJs by relying on a hybrid design and achieves a power improvement while not sacrificing performance.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author contributions
TR: Writing – review and editing, Writing – original draft. JL: Writing – review and editing, Writing – original draft. HZ: Writing – review and editing. MQ: Data curation, Writing – original draft. MA: Writing – review and editing.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. This work has been funded by the National Science Foundation (NSF) under the Expedition: DISCoVER (Design and Integration of Superconducting Computation for Ventures beyond Exascale Realization) project grant number 2124453.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Ando, Y., Sato, R., Tanaka, M., Takagi, K., Takagi, N., and Fujimaki, A. (2016). Design and demonstration of an 8-bit bit-serial rsfq microprocessor: core e4. IEEE Trans. Appl. Supercond. 26 (5), 1–5. doi:10.1109/tasc.2016.2565609
Arnau, J.-M., Parcerisa, J.-M., and Xekalakis, P. (2012). Boosting Mobile gpu performance with a decoupled access/execute fragment processor. ACM SIGARCH Comput. Archit. News 40 (3), 84–93. doi:10.1109/isca.2012.6237008
Bienia, C., Kumar, S., Singh, J. P., and Li, K. (2008). “The parsec benchmark suite: characterization and architectural implications,” in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, 72–81.
Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., et al. (2011). The gem5 simulator. ACM SIGARCH Comput. Archit. news 39 (2), 1–7. doi:10.1145/2024716.2024718
Byun, I., Min, D., Lee, G.-h., Na, S., and Kim, J. (2020). “Cryocore: a fast and dense processor architecture for cryogenic computing,” in 2020 ACM/IEEE 47th annual international symposium on computer Architecture (ISCA) (IEEE), 335–348.
Damsteegt, R. A., Overwater, R. W., Babaie, M., and Sebastiano, F. (2024). A benchmark of cryo-cmos embedded sram/drams in 40-nm cmos. IEEE J. Solid-State Circuits 59 (7), 2042–2054. doi:10.1109/jssc.2024.3385696
Dorojevets, M., Ayala, C. L., Yoshikawa, N., and Fujimaki, A. (2012). 8-bit asynchronous sparse-tree superconductor rsfq arithmetic-logic unit with a rich set of operations. IEEE Trans. Appl. Supercond. 23 (1), 1700104–1 700 104. doi:10.1109/tasc.2012.2229334
Dorojevets, M., and Chen, Z. (2015). “Fast pipelined storage for high-performance energy-efficient computing with superconductor technology,” in 2015 12th international conference and expo on emerging technologies for a smarter world (CEWIT) (IEEE), 1–6.
Dorojevets, M., Ayala, C. L., and Kasperek, A. K. (2010). Data-flow microarchitecture for wide datapath rsfq processors: design study. IEEE Trans. Appl. Supercond. 21 (3), 787–791. doi:10.1109/tasc.2010.2087410
Filippov, T., Dorojevets, M., Sahu, A., Kirichenko, A., Ayala, C., and Mukhanov, O. (2011). 8-bit asynchronous wave-pipelined rsfq arithmetic-logic unit. IEEE Trans. Appl. Supercond. 21 (3), 847–851. doi:10.1109/tasc.2010.2103918
Fourie, C. J., Jackman, K., Botha, M. M., Razmkhah, S., Febvre, P., Ayala, C. L., et al. (2019). Coldflux superconducting eda and tcad tools project: overview and progress. IEEE Trans. Appl. Supercond. 29 (5), 1–7. doi:10.1109/tasc.2019.2892115
Fujiwara, K., Hoshina, H., Yamashiro, Y., and Yoshikawa, N. (2003). Design and component test of sfq shift register memories. IEEE Trans. Appl. Supercond. 13 (2), 555–558. doi:10.1109/tasc.2003.813945
Fujiwara, K., Yamashiro, Y., Yoshikawa, N., Hashimoto, Y., Yorozu, S., Terai, H., et al. (2004). High-speed test of sfq-shift register files using ptl wiring. Phys. C. Supercond. 412, 1586–1590. doi:10.1016/j.physc.2004.01.173
Gilbert, J. D., Hunt, S. H., Gunadi, D., and Srinivas, G. (2006). The tulsa processor: a dual core large shared-cache.
Holmes, D. S., Debenedictis, E., Fagaly, R., Febvre, P., Gupta, D., Herr, A., et al. (2018). “Superconductor electronics technology roadmap for irds 2018,” in Applied Superconductivity Conference (ASC 2018).
Ishida, K., Byun, I., Nagaoka, I., Fukumitsu, K., Tanaka, M., Kawakami, S., et al. (2020). “Supernpu: an extremely fast neural processing unit using superconducting logic devices,” in 2020 53rd annual IEEE/ACM international symposium on microarchitecture (MICRO) IEEE, 58–72.
Karamuftuoglu, M. A., and Pedram, M. (2023). α-soma: single flux quantum threshold cell for spiking neural network implementations. IEEE Trans. Appl. Supercond. doi:10.1109/TASC.2023.3264703
Katam, N., Shahsavani, S. N., Lin, T. R., Pasandi, G., Shafaei, A., and Pedram, M. (2017). Sport lab sfq logic circuit benchmark suite. Los Angeles, CA: University Southern California, Technical Report.
Katam, N. K., Zha, H., Pedram, M., and Annavaram, M. (2020). Multi fluxon storage and its implications for microprocessor design. J. Phys. Conf. Ser. 1559 (1), 012004. doi:10.1088/1742-6596/1559/1/012004
Kawaguchi, T., and Takagi, N. (2022). 32-bit alu with clockless gates for rsfq bit-parallel processor. IEICE Trans. Electron. 105 (6), 245–250. doi:10.1587/transele.2021sep0005
Kirichenko, A. F., Vernik, I. V., Kamkar, M. Y., Walter, J., Miller, M., Albu, L. R., et al. (2019). Ersfq 8-bit parallel arithmetic logic unit. IEEE Trans. Appl. Supercond. 29 (5), 1–7. doi:10.1109/tasc.2019.2904484
Kundu, S., Datta, G., Beerel, P. A., and Pedram, M. (2019). “Qbsa: logic design of a 32-bit block-skewed rsfq arithmetic logic unit,” in 2019 IEEE International Superconductive Electronics Conference (ISEC) (IEEE), 1–3.
Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., and Jouppi, N. P. (2009). “Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proceedings of the 42nd annual ieee/acm international symposium on microarchitecture, 469–480.
Likharev, K. K., and Semenov, V. K. (1991). Rsfq logic/memory family: a new josephson-junction technology for sub-terahertz-clock-frequency digital systems. IEEE Trans. Appl. Supercond. 1 (1), 3–28. doi:10.1109/77.80745
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., et al. (2005). Pin: building customized program analysis tools with dynamic instrumentation. Acm sigplan Not. 40 (6), 190–200. doi:10.1145/1064978.1065034
Mudge, T. (2015). The specialization trend in computer hardware: techincal perspective. Commun. ACM 58 (4), 84. doi:10.1145/2735839
Mukhanov, O. A. (1993). Rsfq 1024-bit shift register for acquisition memory. IEEE Trans. Appl. Supercond. 3 (4), 3102–3113. doi:10.1109/77.251810
Mukhanov, O. A. (2011). Energy-efficient single flux quantum technology. IEEE Trans. Appl. Supercond. 21 (3), 760–769. doi:10.1109/tasc.2010.2096792
Mustafa, Y., and Köse, S. (2022). Optimization of suzuki stack circuit to reduce power dissipation. IEEE Trans. Appl. Supercond. 32 (8), 1–7. doi:10.1109/tasc.2022.3192202
Naclerio, A., Riente, F., Turvani, G., Vacca, M., Zamboni, M., and Graziano, M. (2025). “Mage: a decoupled access-execute cgra tailored for static control applications,” in 2025 IEEE international symposium on circuits and systems (ISCAS) (IEEE), 1–5.
Nagaoka, I., Tanaka, M., Inoue, K., and Fujimaki, A. (2019). “29.3 a 48ghz 5.6 mw gate-level-pipelined multiplier using single-flux quantum logic,” in 2019 IEEE international solid-state circuits Conference-(ISSCC) (IEEE), 460–462.
Nowatzki, T., Gangadhar, V., Ardalani, N., and Sankaralingam, K. (2017). “Stream-dataflow acceleration,” in 2017 ACM/IEEE 44th annual international symposium on computer Architecture (SCA), 416–429.
NSF (2019). Discover expedition. Available online at: https://www.nsf.gov/news/special_reports/announcements/042222.jsp.
Numata, H., Iguchi, N., Tanaka, M., Okamoto, K., Miura, S., Uchida, K., et al. (2024). Superconducting nb interconnects for cryo-cmos and superconducting digital logic applications. Jpn. J. Appl. Phys. 63, 04SP73. doi:10.35848/1347-4065/ad37c1
Parihar, S. S., Thomann, S., Pahwa, G., Chauhan, Y. S., and Amrouch, H. (2023). Cryogenic in-memory computing for quantum processors using commercial 5-nm finfets. IEEE Open J. Circuits Syst. 4, 258–270. doi:10.1109/ojcas.2023.3309478
Qu, P.-Y., Tang, G.-M., Yang, J.-H., Ye, X.-C., Fan, D.-R., Zhang, Z.-M., et al. (2020). Design of an 8-bit bit-parallel rsfq microprocessor. IEEE Trans. Appl. Supercond. 30 (7), 1–6. doi:10.1109/tasc.2020.3017527
Razmkhah, S., and Bozbey, A. (2019). Heat flux capacity measurement and improvement for the test of superconducting logic circuits in closed-cycle cryostats. Turkish J. Electr. Eng. Comput. Sci. 27 (5), 3912–3922. doi:10.3906/elk-1903-164
Ryazanov, V. V., Bol’ginov, V. V., Sobanin, D. S., Vernik, I. V., Tolpygo, S. K., Kadin, A. M., et al. (2012). Magnetic josephson junction technology for digital and memory applications. Phys. Procedia 36, 35–41. doi:10.1016/j.phpro.2012.06.126
Schindler, L. (2013). Rsfq cell library. Available online at: https://github.com/sunmagnetics/RSFQlib.
Schindler, L., Delport, J. A., and Fourie, C. J. (2021). The coldflux rsfq cell library for mit-ll sfq5ee fabrication process. IEEE Trans. Appl. Supercond. 32 (2), 1–7. doi:10.1109/tasc.2021.3135905
Scott Holmes, I. C. o. S. (2020). “Superconductor electronics technology roadmap for irds,” in Applied superconductivity conference.
Smith, J. E. (1982). Decoupled access/execute computer architectures. ACM SIGARCH Comput. Archit. News 10 (3), 112–119. doi:10.1145/1067649.801719
SPEC (2000). Spec cpu 2006. Available online at: https://www.spec.org/cpu2006/.
Szafarczyk, R., Nabi, S. W., and Vanderbauwhede, W. (2025). “Compiler support for speculation in decoupled access/execute architectures,” in Proceedings of the 34th ACM SIGPLAN international conference on compiler construction, 192–204.
Tanaka, M., Sato, R., Hatanaka, Y., and Fujimaki, A. (2016). High-density shift-register-based rapid single-flux-quantum memory system for bit-serial microprocessors. IEEE Trans. Appl. Supercond. 26 (5), 1–5. doi:10.1109/tasc.2016.2555905
Tang, G.-M., Takata, K., Tanaka, M., Fujimaki, A., Takagi, K., and Takagi, N. (2015). 4-bit bit-slice arithmetic logic unit for 32-bit rsfq microprocessors. IEEE Trans. Appl. Supercond. 26 (1), 1–6.
Ultraembedded (2013). Risc-v core. Available online at: https://github.com/ultraembedded/riscvreadme.
Vernik, I. V., Bol’ginov, V. V., Bakurskiy, S. V., Golubov, A. A., Kupriyanov, M. Y., Ryazanov, V. V., et al. (2013). Magnetic josephson junctions with superconducting interlayer for cryogenic memory. IEEE Trans. Appl. Supercond. 23 (3), 1701208. doi:10.1109/tasc.2012.2233270
Xu, W., Ying, L., Lin, Q., Ren, J., and Wang, Z. (2021). Design and implementation of bit-parallel rsfq shift register memories. Supercond. Sci. Technol. 34 (8), 085002. doi:10.1088/1361-6668/ac086e
Yuh, P., and Mukhanov, O. A. (1992). Design and testing of rapid single flux quantum shift registers with magnetically coupled readout gates. IEEE Trans. Appl. Supercond. 2 (4), 214–221. doi:10.1109/77.182733
Zha, H., Katam, N. K., Pedram, M., and Annavaram, M. (2022). “Hiperrf: a dual-bit dense storage sfq register file,” in 2022 IEEE international symposium on high-performance computer architecture (HPCA) (IEEE), 415–428.
Zha, H., Tannu, S., and Annavaram, M. (2023). “Superbp: design space exploration of perceptron-based branch predictors for superconducting cpus,” in 2023 IEEE international symposium on microarchitecture (MICRO) (IEEE).
Keywords: superconducting, architecture, CPU (central processing unit), thermal zone model, design, optimization
Citation: Renduchintala T, Lee J, Zha H, Qi M and Annavaram M (2025) Icy-hot: decoupled compute paradigm towards a general-purpose superconducting CPU design. Front. Mater. 12:1618454. doi: 10.3389/fmats.2025.1618454
Received: 26 April 2025; Accepted: 20 October 2025;
Published: 20 November 2025.
Edited by:
Mark Law, University of Florida, United StatesReviewed by:
Timothy Sherwood, University of California, Santa Barbara, United StatesJohn Shalf, E. O. Lawrence Berkeley National Laboratory Computing Sciences Research, United States
Copyright © 2025 Renduchintala, Lee, Zha, Qi and Annavaram. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Tara Renduchintala, dHJlbmR1Y2hAdXNjLmVkdQ==
Michael Qi