AUTHOR=Gao Xiangrui , Cao Changling , He Chenfeng , Lai Lipeng 

TITLE=Pre-training with a rational approach for antibody sequence representation

JOURNAL=Frontiers in Immunology

VOLUME=Volume 15 - 2024

YEAR=2024

URL=https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1468599

DOI=10.3389/fimmu.2024.1468599

ISSN=1664-3224

ABSTRACT=Antibodies represent a specific class of proteins produced by the adaptive immune system in response to pathogens. Mining the information embedded in antibody amino acid sequences can benefit both antibody property prediction and novel therapeutic development. Protein-specific pre-training models have been used to extract latent representations from protein sequences, structural, functional, and homologous information. However, compared to other proteins, antibodies possess unique features that should be incorporated using specifically designed training methods, so there is still room for improvement in pre-training models for antibody sequences. On the one hand, existing protein pre-training models primarily utilize language models without fully considering the differences between protein sequences and language sequences. In this study, we present a Pre-trained model of Antibody sequences trained with a Rational Approach for antibodies (PARA), which employs a strategy conforming to antibody sequence patterns and an advanced natural language processing self-encoding model structure. We demonstrate PARA's performance on several tasks by comparing it to various published pre-training models of antibodies. The results show that PARA significantly outperforms existing models on these tasks, suggesting that PARA has an advantage in capturing antibody sequence information. We believe that the antibody latent representation provided by PARA can substantially facilitate studies in relevant areas. PARA is available at https://github.com/xtalpi-xic.