AUTHOR=Li Kuan , Zhong Yue , Lin Xuan , Quan Zhe TITLE=Predicting the Disease Risk of Protein Mutation Sequences With Pre-training Model JOURNAL=Frontiers in Genetics VOLUME=Volume 11 - 2020 YEAR=2020 URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2020.605620 DOI=10.3389/fgene.2020.605620 ISSN=1664-8021 ABSTRACT=Accurately identifying the missense mutations is of great help to alleviate the loss of protein function and structural changes, which might greatly reduce the risk of disease for tumor suppressor genes (e.g., BRCA1 and PTEN). In this paper, we propose a hybrid framework, called BertVS, that predicts the disease risk for missense mutation of proteins. Our framework is able to learn sequence representations from protein domain through pre-training BERT models, and it also integrates with the hydrophilic properties of amino acids to obtain the sequence representations of biochemical characteristics. And the concatenation of two learned representations are then sent to the classifier to predict the missense mutations of protein sequences. Specifically, we use the protein family database (Pfam) as a corpus to train BERT model to learn the contextual information of protein sequences, and our pre-training BERT model achieves a value of 0.984 on accuracy in the masked language model prediction task. With comparison to the baselines, results show that BertVS achieves higher performance of 0.920 on AUROC and 0.915 on AUPR in the functionally critical domain of BRCA1 gene. Additionally, the extended experiment on ClinVar dataset can illustrate that gene variants with known clinical significance can also be efficiently classified by our method. Therefore, BertVS can learn the functional information of the protein sequences and effectively predict the disease risk of variants with uncertain clinical significance.