AUTHOR=Gan Yao , Fu Yanyun , Wang Deyong , Li Yongming 

TITLE=A novel approach to attention mechanism using kernel functions: Kerformer

JOURNAL=Frontiers in Neurorobotics

VOLUME=Volume 17 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2023.1214203

DOI=10.3389/fnbot.2023.1214203

ISSN=1662-5218

ABSTRACT=Artificial Intelligence (AI) is a new technological science that researches and develops theories, methods, technologies and application systems used to simulate, extend and expand human intelligence. With the rise of AI, transformer has been highly successful in various natural language processing (NLP) tasks. However, its attention mechanism requires a quadratic calculation cost with respect to the input sequence length, which limits its efficiency and scalability for longorder tasks. To address this challenge, we propose a linear transformer based on the kernel approach, named Kerformer. Our proposed method simplifies the attention operation by leveraging a nonlinear reweighting mechanism that transforms the attention mechanism from traditional maximum attention to dot product attention based on feature mapping. The Kerformer algorithm focuses on two key properties of softmax computation: non-negativity and non-linear weighting.To satisfy these properties, we separately perform a non-negativity operation on Query(Q) and Key(K) and make their computations separable. In addition, we incorporate the SE Block to re-weight the non-negativity processed K matrices and improve the performance of the model.Our approach reduces the time complexity of the attention matrix from O(N 2 ) to O(N ), where N is the sequence length, resulting in significantly improved efficiency and scalability for long-order tasks. In our simulation experiments, Kerformer outperformed other methods with lower time and memory consumption. On NLP and vision tasks, Kerformer achieved higher average accuracy (83.39%) and performed better in long sequence tasks (average accuracy of 58.94%). It also demonstrated superior efficiency and convergence speed in visual tasks compared to other models.