AUTHOR=Gao Xiangrui , Cao Changling , He Chenfeng , Lai Lipeng TITLE=Pre-training with a rational approach for antibody sequence representation JOURNAL=Frontiers in Immunology VOLUME=Volume 15 - 2024 YEAR=2024 URL=https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1468599 DOI=10.3389/fimmu.2024.1468599 ISSN=1664-3224 ABSTRACT=Antibodies represent a specific class of proteins produced by the adaptive immune system in response to pathogens. Mining the information embedded in antibody amino acid sequences can benefit both antibody property prediction and novel therapeutic development. Protein-specific pre-training models have been used to extract latent representations from protein sequences, structural, functional, and homologous information. However, compared to other proteins, antibodies possess unique features that should be incorporated using specifically designed training methods, so there is still room for improvement in pre-training models for antibody sequences. On the one hand, existing protein pre-training models primarily utilize language models without fully considering the differences between protein sequences and language sequences. In this study, we present a Pre-trained model of Antibody sequences trained with a Rational Approach for antibodies (PARA), which employs a strategy conforming to antibody sequence patterns and an advanced natural language processing self-encoding model structure. We demonstrate PARA's performance on several tasks by comparing it to various published pre-training models of antibodies. The results show that PARA significantly outperforms existing models on these tasks, suggesting that PARA has an advantage in capturing antibody sequence information. We believe that the antibody latent representation provided by PARA can substantially facilitate studies in relevant areas. PARA is available at https://github.com/xtalpi-xic.