In-Memory Computation Based Mapping of Keccak-f Hash Function

Cryptographic hash functions play a central role in data security for applications such as message authentication, data verification, and detecting malicious or illegal modification of data. However, such functions typically require intensive computations with high volume of memory accesses. Novel computing architectures such as logic-in-memory (LIM)/in-memory computing (IMC) have been investigated in the literature to address the limitations of intense compute and memory bottleneck. In this work, we present an implementation of Keccak-f (a state-of-the-art secure hash algorithm) using a variant of simultaneous logic-in-memory (SLIM) that utilizes emerging non-volatile memory (NVM) devices. Detailed operation and instruction mapping on SLIM-based digital gates is presented. Through simulations, we benchmark the proposed approach using LIM cells based on four different emerging NVM devices (OxRAM, CBRAM, PCM, and FeRAM). The proposed mapping strategy when used with state-of-the-art emerging NVM devices offers EDP savings of up to 300× compared to conventional methods.


INTRODUCTION
Hashing algorithms are primarily used for creating a compressed and unique representation for data, which facilitates verification of large amount of information stored in the cloud or over network-connected devices. However, this kind of data is prone to attacks and interventions by third-parties leading to corruption or illegal modification. In aforementioned situations, secure hashing algorithms (SHAs) come to rescue as they enable detection of such attacks and also allow authentication of data origins for downloads from sources over the internet (Debnath et al., 2017). SHA-3 has been used as the current standard by NIST for cryptographically secure hashing (CSH) applications (Dang, 2015). It is implemented using the Keccak-f which is based on sponge construction (Bertoni et al., 2013). It allows generation of a fixed-size output from an arbitrary number of input bits. However, a primary limitation of using Keccak-f is the additional computation cost which has been mitigated by using parallel computing algorithms (Kishore and Raina, 2019) and custom hardware accelerator solutions (Michail et al., 2012;Khalil-Hani et al., 2010) in the literature. While compute latency can be addressed by parallel or custom architectures, overall latency remains constrained while accessing data from storage due to a limited memory bandwidth. Furthermore, since explicit data transactions between the processor and memory are performed extensively (see Figure 1A), there is a possibility to compromise security due to data leakage (Skorobogatov, 2017) or side-channel attacks (SCAs) (Zohner et al., 2012). An intuitive way to address the issues of memory bottleneck, intensive compute, and data security for this application would be to make use of "in-memory computing" (IMC)/logic-in-memory (LIM) approaches (Linn et al., 2012;You et al., 2014;Gao et al., 2015;Wang et al., 2017;Zhou et al., 2017;Sun et al., 2018;Kim et al., 2019). These approaches offer a clear advantage through in situ computations, i.e., computing exactly where the data is located. Furthermore, as minimal or negligible data transfer occurs between CPU and storage in case of LIM (see Figure 1B), it enhances the security. An advanced form of LIM referred to as 'simultaneous LIM' (SLIM) was recently proposed (Kingra et al., 2020) in which the same memory cell is used for both logic and storage functions simultaneously in space (silicon area) and time (clock cycles) (refer Figure 1B).
In this work, we present an efficient hardware mapping of the Keccak hash function (described later) using SLIM-based methodology. The basic operations of Keccak-f algorithm are XOR/AND/NOT, which can be efficiently implemented using SLIM. As shown in Figure 1B, in the proposed SLIM-based architecture, data transfer to CPU is no longer a part of the pipeline, and only instructions are transferred between CPU and storage to complete the hashing operation. Hence, SLIM would offer enhanced security and resistance to SCA (Skorobogatov, 2017) due to negligible data movement. In the literature, LIM methodologies have been proposed for different cryptographic applications such as SHA-3 (Bhattacharjee et al., 2017;Nagarajan et al., 2019;Yang and Chen, 2019) and Advanced Encryption Standard (AES) (Angizi et al., 2018;Xie et al., 2018). Compared to the literature, following are the novel contributions of this work: • A modified SLIM bank design (by incorporating SHIFT registers) was adapted specifically to execute CSH applications. • New mapping methodology for realizing Keccak-f (using XOR, AND, and NOT gates) on a SLIM MAT (matrix). • Performance analysis (energy and latency) w.r.t.
Section 2 summarizes basics of the Keccak-f algorithm, presents the SLIM methodology and its hardware realization, and describes the proposed hardware mapping methodology with operation and instruction level details. Section 3 summarizes the key analysis results, and Section 4 provides the conclusion.

Basics of Keccak-f
CSH refers to security-oriented usage of hashing functions that ensure very high difficulty for inverse transformations, making them strongly unidirectional (Chi and Zhu, 2017). The primary properties associated with CSH are the following: 1. Pre-image resistance: Input data (message) is difficult to find if only output data (message digest) is known. 2. Second pre-image resistance: Given a message m i and its hash output hash(k,m i ), where k is the hash key, it is difficult to find another message m j satisfying hash(k,m i ) = hash(k,m j ). 3. Collision resistance: Two messages m i and m j should have different hash results in order to avoid birthday attacks (Bellare and Kohno, 2003).
While the length of input data, i.e., message for CSH function is arbitrary, the output, i.e., message digest has a fixed length. Fixed-sized hashes are used to represent the original input for validation. Due to the security aspect, any small change in input requires a significant change in the output. CSH can be further divided into two categories: 1) keyed cryptographic hashing and 2) unkeyed cryptographic hashing. The current standard for unkeyed cryptographic hashing is SHA-3 also known as Keccak (Bertoni et al., 2013). A permutation block (Keccak-f) is used as the core operation. The Keccak-f is realized by multiple rounds (here 24) of 5-steps (θ, ρ, π, χ, and ι) with logic operations and bit-wise permutations. It operates on a fixed number of bits "b", i.e., width of the permutation or bit state. In this study, we consider Keccak-f implementation with block-size "b" = 1,600. Input and output entries of Keccak-f round function are 5 × 5 matrices of 64-bit words. The complete process for hashing with steps is shown in Algorithm 1. Message block "A" and round constant "RC" are provided as inputs. Variables "x" and "y" represent the matrix index, and operations on "x" and "y" are in modulo 5 (i.e., maximum value is 4, and any further increase starts the cycle back from 0). Variables "B", "C" and "D" are used to hold intermediate values. In ρ step, r(x, y) represents the rotation matrix, and "A" is rotated according to the "r" matrix values. Round and rotation constants are given in Dworkin, (2015). Final computed hash is stored in "A".

SLIM: Concept and Methodology
SLIM methodology (Kingra et al., 2020) relies on exploiting multilevel capability (MLC) of emerging NVM devices to use the same memory cell for both "storage" and "logic" functions simultaneously. In SLIM, rather than storing input variables in the bitcell, the aim is to preserve the initial memory state. For that, at least four distinct resistance states are essential (states S1, S2, S3, and S4 in Figure 2B). Hence, MLC NVM devices that exhibit repeatable analog conductance tuning behavior can be utilized to realize the SLIM approach. Each of the four selected states are assigned to both logic ("1"/"0") and memory (LRS/HRS) definitions. The sense amplifier (SA) threshold/memory statesensing window is defined such that two states lie in the memory LRS sense region while other two in the memory HRS sense region. The two states in the memory LRS region (here S1 and S2) are assigned Logic "1" and Logic "0" values, respectively.
Similarly, the memory HRS region states (here S3 and S4) are assigned Logic "1" and Logic "0" values. There are two representations for each resistance state: consider state S1, it has Logic "1" and Memory "LRS" label. While executing any logic operation, the SLIM programming scheme permits state transitions only within the Logic ("1"/"0") levels of a particular memory sense region (i.e., Logic "1" ↔ "0" within HRS or within LRS sense regions are permitted, but Logic "1" ↔ "0" through HRS ↔ LRS is not permissible). Thus, any initially stored memory state on the bitcell can be preserved even after executing a logic operation.

Device Fabrication for Experimental Validation
Analog resistive switching OxRAM stacks of the Ni/3-nm HfO 2 / 7-nm Al-doped-TiO 2 (ATO)/TiN (top to bottom) structure were fabricated by following a CMOS compatible process. The transmission electron microscopy (TEM) cross-section image of the device stack is shown in Figure 3A, where amorphous dielectric bilayer is seen deposited on the TiN bottom electrode (BE) film. The device fabrication flow is as follows: First, the 100nm thick TiN BE film was deposited on the thermal-SiO 2 (500 nm)/Si wafer by physical vapor deposition (PVD), RF magnetron sputtering. The BEs were then patterned by optical photolithography (first mask) and dry etching using inductively coupled plasma (ICP). At the bottom, 7-nm thick ATO dielectric was then deposited by interchanging varying amounts of TiO 2 and Al 2 O 3 PE-ALD (plasma-enhanced atomic layer deposition) cycles. At the top, the 3-nm thick dielectric HfO 2 film was deposited using TDMAHf (tetrakis(dimethylamido)hafnium) and O 2 plasma. Next, the 100-nm thick Ni TE (top electrode) film was deposited by DC sputtering and patterned using the liftoff technique. Final photolithography (third mask) and ICP dry etching step was performed to open the contact windows (etch the dielectrics) to the BE contact pads.

Experimental Validation of SLIM
OxRAM-device ( Figure 3A) based 1T-1R SLIM bitcell configuration ( Figure 3B) is adopted for the proposed method (Kingra et al., 2020). The DC-IV curve of the characterized OxRAM device is shown in Figure 3C. The experimental setup (including integrated 1T-1R SLIM bitcell, CMOS chip, OxRAM chip, and parameter analyzer) used in this study is shown in Figure 3D. From the continuum of attainable OxRAM resistance levels ( Figure 3E), four distinct resistance states are selected. These states are labeled as "11", "10", "01" and "00". The resistance distribution for the selected states is presented in Figure 3F. As shown in Figure 3F, each of these four selected states are allotted both logic ("1"/"0") and memory ("LRS"/ "HRS") definitions. Input variables are mapped onto the gate terminal (V G ) and source terminal (V 2 ) of the selector (NMOS transistor in this case). Before executing any logic operation, stored data on the bitcell can be either in Memory State "1" (i.e., absolute Memory state "11," "LRS") or Memory State "0" (i.e., absolute Memory state "01," "HRS"). While performing a logic operation, there are chances the bitcell can make a transition from "11" → "10" or "01" → "00". To perform consecutive logic operation on the same bitcell, the bitcell is assumed to be in absolute memory state ("11" for Memory "LRS" and "01" for Memory "HRS"); hence, the bitcell needs to be reinitialized to an absolute memory state in case state transition happens while executing logic operation. The SLIM programming signals and memory-logic state transitions for 1T-1R bitcell are shown in Figure 3G. As observed, the initially stored memory state of the bitcell is preserved even after executing a logic operation. Computation mapping is performed for realizing Keccak-f using XOR/AND/ NOT gates and SHIFT registers as building blocks. SLIM bitcells are used for realizing XOR/AND/NOT gate functionalities. A single 1T-1R bitcell can realize NOT/ NAND gates. XOR/AND logic is realized using 4/2 SLIM NAND logic gates. Programming signal mapping for possible 1-bit, 2-input (a, b) operand combinations are shown in Figure 4A. As discussed earlier, before executing SLIM logic, the bitcell should be in absolute Memory State "11" or "01". Operands a/b are mapped to V G /V 2 terminal of bitcell, respectively, keeping V 1 grounded (V 1 , V 2 , and V G shown in Figure 3B). Voltage conditions for operand mapping are: a = "0"/"1" indicating V G = 0/V DD , respectively. b = "0"/"1" indicating V 2 = 0 V/5.5 V, respectively. When both a = b = "1" (Logic high), the applied programming voltage drops across the OxRAM device, and it undergoes RESET switching, i.e., its device resistance increases and the current flowing through the device decreases. Figure 4B shows experimental results for NAND logic operation on SLIM bitcell for the initial device state: "11" (i.e., stored Memory State = LRS/"1"). Truth table for the 2-input NAND gate using a single 1T-1R SLIM bitcell is shown in Figure 4C. An example of mapping basic logic gates used in the study (NOT, AND, XOR) using SLIM is shown in Figure 5.

SLIM Operation Mapping
To perform computation mapping on SLIM, an estimation model is developed with the following inputs: 1) operation breakdown in terms of Intel-based instructions, 2) memory interface parameters (bandwidth and data width for cache and DRAM, etc.), 3) SLIM MAT size, and 4) MLC device parameters (energy and latency for read/write operations). Using this, a dataflow graph is realized in order to estimate possible optimization strategies for pipelining as well as performing parallel computation. Once dataflow is selected by the user, an energy and latency estimation is generated with the selected MLC device utilized in the SLIM implementation (details provided in Eqs 1, 2 ). The complete process is summarized in Figure 6.

Proposed Keccak-f Mapping
Using the customized SLIM operation mapping method, CSH function SHA-3 is mapped on the SLIM MAT. First, the base Keccak-f in SHA-3 is broken down into a series of steps (refer Algorithm 1, i.e., θ, ρ, π, χ, and ι). The message is read from the main memory with block-size = 1,600 bits. Each message block ("A") is mapped onto SLIM MATs in the form of a 5 × 5 matrix with cell size of 64-bits. Since variables/matrices "A" and "B" both FIGURE 5 | Illustration of XOR/NOT operation mapping on SLIM MAT with each SLIM bitcell performing NAND logic depending on the inputs. XOR operation takes three cycles of SLIM operations. Intermediate logic outputs are read and stored in buffer memory (see Figure 1A) for further signal application. AND operation can be realized by using NAND gate followed by a NOT gate and thus require 2 SLIM bitcells.
Frontiers in Nanotechnology | www.frontiersin.org March 2022 | Volume 4 | Article 841756 need to be presented simultaneously starting from step ρ, the maximum memory required would be double the message block size (i.e., 64 × 5 × 5 × 2). Therefore, a MAT size of 64 × 64 is selected. Table 1 lists the number of logic operations (NOT, AND, SHIFT, and XOR) required at each step. Figure 5 illustrates operand mapping and XOR/NOT logic realization using SLIM methodology highlighting cycle count and SLIM bitcell requirement. Table 1 lists operation count and required SLIM MATs for each computation step in Keccak-f. Complete flow for operation mapping and energy estimation is described in Figure 6. Figure 7A illustrates data-flow mapping on the 64 × 64 SLIM MAT. We map the original data matrix "A" (5 × 5 × 64 bit words) in first 25 rows of the MAT, where entries from the same row of the matrix are arranged in neighboring rows. The row below in which matrix "A" is allocated holds the round constant RC (64-bits). Consecutive steps show data allocation in the MAT evolving during execution of the 5steps for a single round of Keccak-f (see Table 1). An important point to note here is the ability to reuse the MAT storage for variables "B", "C", and "D" since intermediate data are stored as the logic state of the bitcell. For calculating matrix "C", XOR operations are mapped on bitcells already storing matrix "A". This minimizes additional bitcell requirement for logic computation, while preserving original data matrix "A" for further use. To reuse the selected MAT for multiple compute runs, intermediate refresh operations are carried out by programming the device to absolute memory state. In refresh scheme, first, the initial state of the bitcell is read. If the bitcell is in absolute memory state (i.e., "11"or "01"), logic operation can be performed directly. However, in case the state of bitcell is in non-absolute memory state (i.e., "10" or "00"), refresh signal is applied, and the bitcell is restored to its corresponding absolute memory state ("11" or "01") before executing logic operation on that bitcell. To estimate energy dissipation for combinational logic operations realized using SLIM bitcells, an empirical model is built based on Eq.  Total Energy SLIM energy + N shift × SHIFT energy .
(2) Here, N denotes total possible combinations of input x i . Switch_events is an empirical model to determine worst-case state transition events at each node of the NAND logic-based computation graph considering all input combinations for specified logic operation. For example, when performing an AND operation, two SLIM bitcells are used (first SLIM bitcell acting as NAND gate and the other acting as NOT gate). Depending on the input combinations/operands, one of the two SLIM bitcells will undergo switching (Logic "1" → Logic "0"). For instance, if the logic output of NAND is "1" (for input combinations: "00", "01", "10"), NOT gate will undergo Logic "1" → Logic "0" switching; whereas for input combinations "'11," NAND gate undergoes Logic "1" → Logic "0" transition, and NOT gate will exhibit Logic "1" state. Refresh energy accounts for number of refresh operations carried out over the SLIM MAT during execution of the CSH workload (Kingra et al., 2020). Read energy accounts for reading of the output state in order to chain the operation. However, in comparison to Switch energy and Refresh energy , the Read energy component of device dissipation is insignificant. Decoder energy is estimated based on the literature (Viveka and Amrutur, 2014). Since SHIFT/rotate operation is not possible with the proposed SLIM bitcell, additional CMOS SHIFT registers are included in the SLIM bank design. The reference energy value for SHIFT registers in periphery is estimated from Woo et al., (2019) (denoted by SHIFT energy ). N shift represents number of SHIFT operations performed for the CSH workload. To estimate latency values, the worst path from the input to output node of the NAND logic-based computation graph is used.
Energy dissipation/op and latency estimates for desired logic gates using the SLIM mapping strategy are shown in Figures 7B,C, respectively. A cycle-wise execution timechart for the CSH workload is shown in Figure 7D. In step θ, 5-input XOR operation is split into 3 steps as SLIM bitcell realizes 2-input XOR logic gate. All bitcells in a row of the SLIM MAT can be operated in parallel. For the current study, a SLIM MAT of size 64 × 64 is selected that can perform 4096 NAND operations. Exploiting maximum possible parallelism, two streams of 2-input-independent XOR operations can be carried out in Step1a resulting in 640 XOR operations. Hence, out of 1280 XOR operations, 640 are completed in Step1a followed by 320 each in Step1b and Step1c, respectively. Similarly, the cycle count has been reported for the remaining Keccak-f steps. Each XOR operation requires three compute/write cycles and four read cycles. The overall execution of a single round of Keccak-f requires ≈24k NAND logic operation that exceeds the overall capacity of the two SLIM MATs, i.e., ≈ 8k. Hence, refresh operation is carried out three times as depicted in Figure 7D.

Performance Benchmarking for Proposed SLIM-Based Keccak-f
Comparison of conventional architecture with proposed SLIM-based architecture for CSH workload is shown in Table 2. Here, it is assumed that data/message are fetched from the main memory (RAM) to CPU in case of conventional architecture (Intel i5 CPU with DRAM). This results in increased data transfer that costs both in terms of energy and latency as compared to proposed SLIM implementation. Logic operations are carried out inside the CPU, and message digest is stored back in the main memory. For data transfer, 128-bit-wide bus is used for all memory interfaces. In case of SLIM-based architecture, all combinational logic operations are performed using SLIM bitcells, and the corresponding energy is estimated using an approximate model described in Eq. 1. To maximize the possible parallel SLIM operations in Step3 and Step6 (from Table 1), two SLIM MATs are used, each of size 64 × 64 (i.e., 512 B). This results in maximum number of parallel 8,192 (2 × 64 × 64) SLIM operations. SLIM-based architecture is realized using four different emerging NVM device technologies (Choi et al., 2020), i.e., CBRAM, OxRAM, PCM, and FeRAM. By performing all computations inside the SLIM MAT, the total EDP (energy delay product) for the CSH workload reduces by 300× in case of FeRAM-based SLIM. Total EDP for SLIM-based implementations can be broken down into four operation types (i.e., data transfer, SLIM-logic, Shift, and SLIM-refresh) as illustrated in Figure 8. Shift operation contribution remains constant across technologies. Data transfer relies on read operation performance of each device technology. Device programming energy/switching energy determines the cost of SLIM operations (Logic, Refresh). In case of FeRAM-based SLIM, it can be observed that dissipation due to data transfer and shift operations is higher than the SLIM operations leading to better EDP savings. However, for PCM, SLIM operations show a significant energy contribution due to higher switching energy of the PCM device from the literature (see Table 2), thus leading to higher EDP. A significant advantage of the proposed hardware mapping scheme is the possibility to exploit parallelism based on high density of NVM devices. Considering a 1-Gb SLIM memory bank with MAT size 64 × 64, it would be possible to perform 128k Keccak-f round functions in parallel over different blocks of a large message, i.e., 200-Mb data can be processed in parallel. In case of CPU-based approaches, the maximum parallelism is limited to total number of cores available, i.e., 32 for typical modern-day computers.

SLIM Implementation and NVM Device Reliability
The reliability of SLIM bitcell is primarily dependent on device endurance, programming variability, and the conductance values for selected SLIM states. SLIM bitcell operation relies on device state switching while performing a logic operation and hence faces a fundamental limitation due to limited endurance of NVM devices. As shown in Table 2, NVM device endurance is typically observed in the range of 10 4 cycles. One way to address this would be to map the application over a larger SLIM bitcell array, thus reducing requirement of switching-events required per device at the cost of additional latency due to data movement. However, through design optimizations, improved device endurance ( ≥ 10 10 ) has been reported for each of the aforementioned device technologies (CBRAM, OxRAM, PCM, and FeRAM) as shown in Table 3. Aspects of programming variability, conductance state distribution, and on-off ratio are closely linked to the choice of NVM device technology and its material stack. For instance, in case of OxRAM devices, optimization can be performed through choice of switching oxide, layer stacking, electrode material, and doping. Devices based on metal oxides such as TaO x (Lee et al., 2011;Goux et al., 2014) and HfO 2 Hudec et al., 2016) have been shown to demonstrate excellent endurance and switching characteristics as compared to oxides such as NiO 2 , TiO 2 , or Al 2 O 3 (Yang et al., 2008;Kwon et al., 2010). However, Al 2 O 3 has been shown to improve analog tuning capabilities of other HfO x -based devices when used for material stacking (Yu et al., 2011;Chen et al., 2016). Using a combination of a less-reactive electrode and inert electrode has been shown to improve both device endurance and switching speed (Goux et al., , 2013. Doping of the switching oxide layer can be used to further enhance device properties of OxRAM. Ti doping in HfO x was found to exhibit forming-free devices with improved endurance, while Al and Si as dopants resulted in improved retention time (Chakrabarti et al., 2013;Chen et al., 2014). Hence, by material optimization, OxRAM characteristics can be tuned. Distribution of conductance states, programming variability, and on-off ratio are also key parameters to determine circuit complexity of periphery, since sensing higher resistance accurately would require larger area dedicated to SA. Separability of conductance states even after including programming variability is essential to ensure reliable operation for SLIM functionality. Comparison of the various device technologies in terms of MLC parameters (MLC states, on-off ratio) are shown in Table 3. The on-off ratio of two orders-of-magnitude (100) with at least eight states exhibited by FeRAM would be ideal to ensure reliable operation, since additional states would act as separation buffers between actual four states used for mapping improving reliability.

Comparison With Other LIM Methods
SHA-3 computation using NVM-based LIM has been proposed in recent works (Bhattacharjee et al., 2017;Nagarajan et al., 2019;Yang and Chen, 2019). Table 4 compares the current implementation with respect to recent studies in terms of the LIM-operating principle. As shown in Table 4, the SLIM methodology helps in realizing the universal NAND logic gate using any two terminal NVM devices exhibiting MLC. This enables realization of complex logic operations by breaking it down into series of NAND gates. In this study, architecture exploration has been confined to 2-input XOR logic gates for CSH workload; however, SLIM offers flexibility to map n-input XOR logic by chaining universal 1T-1R SLIM NAND logic gates. Another advantage of SLIM over other NVM-based LIM realizations is the reduction in area overhead, since SLIM bitcells can be used simultaneously for storage as well as to map logic gates (such as XOR/AND/NOT). Performance comparison (in terms of energy dissipation/operation cycles) with prior works is not shown, since they rely on using theoretically estimated NVM device parameters in  comparison to the current work that uses experimentally measured device parameters.

CONCLUSION
In this work, we present step-by-step implementation of Keccak-f function using NVM-based SLIM methodology. Experimental validation of SLIM methodology using bilayer OxRAM devices is performed. A new mapping strategy was proposed, and simulation-based estimation of EDP was performed using state-of-the-art emerging NVM devices. A detailed discussion of trade-offs between device technologies in terms of operation reliability is also presented. Based on the analysis, FeRAM demonstrates the best performance (EDP savings of~300×) and reliability emerging as a promising candidate. Next, we also presented a comparison of the proposed method with other LIM techniques for the same application. The proposed method demonstrates improved generalization capability to optimize logic mapping (universal gate) and better area savings (colocation of logic and memory) while minimizing write operations required per 1-bit XOR computation. Beyond EDP and endurance improvement, the proposed method also enhances security for hash computations due to reduced data movement by taking advantage of co-location of logic and memory. This further results in better immunity to security exploits such as SCA.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

AUTHOR CONTRIBUTIONS
SK and VP performed experimental characterization and simulations. MS planned and supervised the project. All authors participated in data analysis and writing of the manuscript.