AUTHOR=Laguna Ann Franchesca , Sharifi Mohammed Mehdi , Kazemi Arman , Yin Xunzhao , Niemier Michael , Hu X. Sharon 

TITLE=Hardware-Software Co-Design of an In-Memory Transformer Network Accelerator

JOURNAL=Frontiers in Electronics

VOLUME=Volume 3 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/electronics/articles/10.3389/felec.2022.847069

DOI=10.3389/felec.2022.847069

ISSN=2673-5857

ABSTRACT=Transformer networks have outperformed recurrent and convolutional neural networks in terms of accuracy in various sequential tasks. However, memory and compute bottlenecks prevent transformer networks from scaling to long sequences due to their high execution time and energy consumption. Different neural attention mechanisms have been proposed to lower computational load, but still suffer from the memory bandwidth bottleneck. In-memory processing can help alleviate memory bottlenecks by reducing the transfer overhead between the memory and compute units and allow transformer networks to scale to longer sequences. We propose an in-memory transformer network accelerator that uses a combination of crossbars and content-addressable memories to accelerate transformer networks. We accelerate transformer networks by (1) computing in-memory, thus minimizing the memory transfer overhead, (2) caching reusable parameters to reduce the number of operations, and (3) exploiting the available parallelism in the attention mechanism. To reduce energy consumption the following techniques are introduced: (1) a configurable attention selector is used to choose different sparse attention patterns, (2) a content-addressable memory aided locality sensitive hashing helps filter the number of sequence elements by their importance, and (3) FeFET-based crossbars are used to store projection weights while CMOS-based crossbars are used as an attentional cache to store attention scores for later reuse. We achieve an 8x latency and 84x energy improvement when implementing the Vanilla transformer using CMOS-based crossbars and CAMs at a sequence length of 512 compared to the GPU approach. Additional improvement can be achieved when using FeFETs to store learned weights while CMOS-based devices are used only for attentional caches and peripherals. The CMOS-FeFET hybrid configuration achieves 7.89x latency and 1264x energy improvement when implementing the Vanilla transformer at a sequence length of 512 compared to the GPU approach. Our in-memory transformer network inference accelerator with CMOS-FeFET hybrid configuration achieves a 173x speedup and 494x energy improvement for implementing bidirectional transformers with a sequence length of 4098 compared to the GPU approach.