COVIEdb: A Database for Potential Immune Epitopes of Coronaviruses

2019 novel coronavirus (2019-nCoV) has caused large-scale pandemic COVID-19 all over the world. It’s essential to find out which parts of the 2019-nCoV sequence are recognized by human immune system for vaccine development. And for the prevention of the potential outbreak of similar coronaviruses in the future, vaccines against immunogenic epitopes shared by different human coronaviruses are essential. Here we predict all the potential B/T-cell epitopes for SARS-CoV, MERS-CoV, 2019-nCoV and RaTG13-CoV based on the protein sequences. We found YFKYWDQTY in ORF1ab protein, VYDPLQPEL and TVYDPLQPEL in spike (S) protein might be pan-coronavirus targets for vaccine development. All the predicted results are stored in a database COVIEdb (http://biopharm.zju.edu.cn/coviedb/).


INTRODUCTION
Two coronaviruses-severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV)-have caused two large-scale pandemics in the past two decades (Drosten et al., 2003;Zaki et al., 2012). Now, the third coronavirus caused pandemic  is ongoing (Liu and Saif, 2020;Zhang N. et al., 2020). The 2019 novel coronavirus (2019-nCoV) which was first identified in Wuhan, China in December 2019, from patients with pneumonia is the very coronavirus (Zhu et al., 2020). Analysis of the viral genome has revealed that 2019-nCoV is phylogenetically close to SARS-CoV (Lu et al., 2020), as was named SARS-CoV-2. As of June 5, 2020, 6,640,960 people have been confirmed COVID-19, including 391,285 deaths (∼5.89% fatality rate) all over the world.
Because of the less cost-effective than treatment and reduce morbidity and mortality without long-lasting effects, vaccines are the most effective strategy for preventing infectious diseases . However, there is still no approved vaccines for human coronaviruses (hCoV). There are several types of vaccines are under pre-clinical testing or clinical trials including inactivated vaccine, recombinant subunit vaccine, recombinant vector vaccine, and nucleic acid vaccine. In general, modern vaccines, such as recombinant subunit, peptide, and nucleic acid vaccines, are advantageous over other types of vaccines because of higher safety and less side effect, by inducing the immune system without introducing whole infectious viruses (Graham et al., 2013). Nucleic acid vaccines such as DNA vaccines and mRNA vaccines represent an innovative approach by direct injection of plasmids or mRNAs encoding the antigens, accompanied with a wide range of immune responses (Yang et al., 2004;Pardi et al., 2018). These advantages are applied with prophylactic vaccines and therapeutic vaccines to treat infectious diseases and cancers. For the development of modern vaccines, it is of critical importance to identify potential immune epitopes of 2019-nCoV, as well as other infectious pathogens.
Considering the seriousness of the recent outbreaks of zoonotic coronaviruses, therapeutic agents and vaccines for pan-coronaviruses should be developed to cope with the hCoV outbreaks in the present and in the future. Here, we predict all the potential B/T cell epitopes for SARS-CoV, 2019-nCoV, and MERS-CoV to provide potential targets for pan-coronaviruses vaccine development. The prediction are based on their proteins sequences. RaTG13-CoV is included because of its high homology with 2019-nCoV (96% whole genome identity ). All the predicted results are stored in a database named COVIEdb (http://biopharm.zju. edu.cn/coviedb/).
The human leukocyte antigen (HLA) alleles used for T-cell epitopes prediction are derived from Zhou et al. which analyzed the HLA distribution of 20,635 individuals of Han Chinese ancestry (Zhou et al., 2016). The top 20 HLA I alleles of A, B, and C subtypes and HLA II alleles with frequency more than 5% are the final HLA datasets (Supplementary Table 1).

B-Cell Epitope Prediction
The B-cell epitopes were predicted by the seven tools embedded in the Immune Epitope Database and Analysis Resource (IEDB) (Vita et al., 2015). More specifically, BepiPred-1.0 (Larsen et al., 2006), BepiPred-2.0 (Jespersen et al., 2017), Chou and Fasman beta turn prediction (Chou and Fasman, 1978), Emini surface accessibility scale (Emini et al., 1985), Karplus and Schulz flexibility scale (Karplus and Schulz, 1985), Kolaskar and Tongaonkar antigenicity scale (Kolaskar and Tongaonkar, 1990), and Parker hydrophilicity prediction (Parker et al., 1986) are used for predicting amino acid sites belonging to Bcell epitopes. The parameters are all set as default. The thresholds of each tool are listed in Supplementary Table 2A. In this study, only amino acids that be confirmed by at least five tools are considered as part of B-cell epitopes.
All tools give the score to define whether an amino acid is part of B-cell epitopes but not to define whether a peptide is B-cell epitopes. Here, we set B_score to quantify the possibility of a peptide to be B-cell epitopes, which is calculated as follows: where L is the length of the peptide, a is the amino acid that belongs to the peptide, and n a is the number of tools convinced that amino acid is part of B-cell epitopes.

T-Cell Epitope Prediction
The T-cell epitopes prediction were divided into two parts. One of them are presented by HLA I allele and would induce the activation of CD8+ T cells. This type of T-cell epitopes were predicted by NetMHCpan 4.0 (Jurtz et al., 2017), MHCflurry (Donnell et al., 2018), and DeepHLApan (Wu et al., 2019). Another type of T-cell epitopes presented by HLA II alleles were predicted by MixMHC2pred (Racle et al., 2019) and NetMHCIIpan (Karosiene et al., 2013), which would induce the activation of CD4+ T cells. The thresholds to define potential T-cell epitopes of each tool are listed in Supplementary  Table 2B.
In the prediction of T-cell epitopes presented by HLA I alleles, all peptides with length range from 8 to 11 were selected and combined with previous HLA I alleles to create HLApeptide pairs. It's similar to predict that presented by HLA II alleles, with the difference that peptide length ranges from 15 to 28. Only HLA-peptide pairs satisfied with all thresholds of used tools would be convinced as potential T-cell epitopes in this study.

DATA DESCRIPTION Genome Organization of Four Coronaviruses
All selected coronaviruses have similar genome organization with coding genes of spike (S protein), envelope (E protein), membrane (M protein), nucleoprotein (N protein), and several open reading frames. SARS-CoV, 2019-nCoV, MERS-CoV, and RaTG13-CoV express 9, 8, 10, and 9 non-redundant protein coding genes, respectively ( Figure 1A). In SARS-CoV, orf3b is overlapped with orf3a and E gene, orf7b is overlapped with orf7a, orf8b is overlapped with orf8a, and orf9b is part of orf9a (N gene). In 2019-nCoV, only orf7b is overlapped with orf7a and other genes are separated. In MERS-CoV, the orf4b is overlapped with orf4a and orf8b is part of N gene. In RaTG13-CoV, ns7b and ns7a are overlapped.

Characterization of Predicted B/T-Cell Epitopes
Though some genes are overlapped, we predicted the potential B/ T-cell epitopes of all genes because overlapped genes encode different proteins. Results show that the number of the predicted epitopes is different but similar among the homologous proteins of four coronaviruses ( Figure 1B and Supplementary Table 3). Take the S protein as example, average 444 peptides are predicted as epitopes presented by HLA I alleles among four coronaviruses. The most is the S protein in MERS-CoV which occupies 482, the least is that in RaTG13 which occupies 423. Average 1,615 peptides are predicted as epitopes presented by HLA II alleles. The most is the S in MERS-CoV which occupies 1,804, the least is that in 2019-nCoV which occupies 1471. Average 323 amino acids are predicted as part of B-cell epitopes. The most is the S protein in 2019-nCoV which occupies 359, the least is that in SARS-CoV which occupies 279. The difference of predicted B/T-cell epitopes is minor in S. In other homologous genes, similar phenomenon occurs.
Normally, the number of predicted B/T-cell epitopes is positive correlated with the length of the proteins which genes translate ( Figure 1C). However, there are also some exceptions that longer gene with less number of predicted B/T-cell epitopes, such as the M protein compared with the N protein in 2019-nCoV ( Figure 1D). With nearly half length of encoded protein, M protein possesses more T-cell epitopes presented by both HLA I alleles and HLA II alleles than N protein, which indicates that M protein is preferred to be recognized by T cells than N protein.
Besides, all proteins have predicted epitopes presented by HLA II alleles except ORF8a in SARS-CoV, which might be ascribed to its short length and less immunogenicity.
For better visualization of the predicted B/T cell epitopes, we create a database named COVIEdb (http://biopharm.zju.edu.cn/ coviedb/). With four main pages "B-epitope", "T-epitope", "Peptide", and "Validated", researchers could find useful information easily and quickly. The predicted results of B-cell epitopes could be searched in "B-epitope" page. With the virus and gene selected, the corresponding predicted B-cell epitopes would appear. The predicted results of T-cell epitopes could be searched in "T-epitope" page. Similar with that in "B-epitope" page, coronavirus and protein are necessary. Besides, the type of T-cell epitopes should also be selected. Only the peptide-HLA pairs which satisfied thresholds of all tools would be showed in this page. The searchable data in the "Peptide" page is the combined result of previous predicted B-cell epitopes and T-cell epitopes. In this page, the only selectable parameter is the protein. The "Validated" page containing the predicted B/T epitopes that have been validated by recently literatures (Le Bert et al., 2020;Zhang B. Z. et al., 2020). To date, there are only 116 validated epitopes in the "Validated" page. However, with the growing research on coronaviruses, more validated data would be added to the "Validated" page.

Shared B/T-Cell Epitopes
Though the evolution rate of human coronavirus is fast, we try to find out B/T-cell epitopes conserved and shared in different coronaviruses for the pan-coronavirus vaccine development. Based on the predicted B-cell epitopes and T-cell epitopes, we found 77 peptides that exist in all coronaviruses have the potential to induce T-cell activation and 10 of them with B_score larger than 4 ( Table 1 and Supplementary Table 4). In particular, the peptide YFKYWDQTY from ORF1ab could be presented by 7.33% people, which might be a good candidate for vaccine design.
All the T-cell epitopes shared in four coronaviruses are located in ORF1ab. However, the S protein of the coronavirus is the most important protein where the receptor binding domain (RBD) located. So, we further investigated the shared epitopes that located in S protein. There are 265 potential epitopes in S protein shared by three coronaviruses and 35 of them with B_score larger than 5 (Supplementary Table 5). The peptides VYDPLQPEL and TVYDPLQPEL even have B_score larger than 6. To note, though these two peptides differs only one amino acid, the HLA alleles that can bind with them are different. VYDPLQPEL can be presented by HLA-C07:02, HLA-C04:01, and HLA-C14:02, with overall 8.26% frequency in Chinese Han population, while TVYDPLQPEL can be presented by HLA-A02:06 and HLA-C12:03, with 2.44% frequency. The two peptides are different in the aspect of epitopes, but we could take them as one when choosing the vaccine target, which indicates the feasibility of the peptides to be potential pancoronavirus vaccine target. We believe that these results and the developed database could benefit not only the vaccine (especially the multipleepitope vaccine which could protect from various coronavirus) development but also provide the targets for drug design such as neutralizing antibody on 2019-nCoV and the possible coronavirus outbreak in the future.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: COVIEdb (http:// biopharm.zju.edu.cn/coviedb/).

AUTHOR CONTRIBUTIONS
ZZ and JJ conceived of the idea and supervised the study. JW, WC, and JZ performed the epitope prediction. JW constructed and maintained the database and web interface. WZ, YS, HZ, PY, and SC participated in the data analysis. JW and ZZ wrote the manuscript. All authors contributed to the article and approved the submitted version.

ACKNOWLEDGMENTS
This manuscript has been released as a pre-print at bioRxiv .