BBPpredict: A Web Service for Identifying Blood-Brain Barrier Penetrating Peptides

Blood-brain barrier (BBB) is a major barrier to drug delivery into the brain in the treatment of central nervous system (CNS) diseases. Blood-brain barrier penetrating peptides (BBPs), a class of peptides that can cross BBB through various mechanisms without damaging BBB, are effective drug candidates for CNS diseases. However, identification of BBPs by experimental methods is time-consuming and laborious. To discover more BBPs as drugs for CNS disease, it is urgent to develop computational methods that can quickly and accurately identify BBPs and non-BBPs. In the present study, we created a training dataset that consists of 326 BBPs derived from previous databases and published manuscripts and 326 non-BBPs collected from UniProt, to construct a BBP predictor based on sequence information. We also constructed an independent testing dataset with 99 BBPs and 99 non-BBPs. Multiple machine learning methods were compared based on the training dataset via a nested cross-validation. The final BBP predictor was constructed based on the training dataset and the results showed that random forest (RF) method outperformed other classification algorithms on the training and independent testing dataset. Compared with previous BBP prediction tools, the RF-based predictor, named BBPpredict, performs considerably better than state-of-the-art BBP predictors. BBPpredict is expected to contribute to the discovery of novel BBPs, or at least can be a useful complement to the existing methods in this area. BBPpredict is freely available at http://i.uestc.edu.cn/BBPpredict/cgi-bin/BBPpredict.pl.


CTDC
CTDC describes the composition of each amino acid, which consists of three values: the percentage of hydrophobic, polar and neutral residues of the protein and can be defined as follows: where ( ) N r describes the number of amino acid type r in the sequence.

CTDT
CTDT describes the frequency of amino acid combined with another amino acids residues, which also consists of three values. It is given as ( , )

CTDD
CTDD consists of five values for each of the three groups (polar, neutral and hydrophobic). The details of CTDD features can be available in (Dubchak et al., 1995;Dubchak et al., 1999;Chen et al., 2018).

Dipeptide Deviation from Expected Mean (DDE)
DDE feature vector is constructed by the following three parameters (Saravanan and Gautham, 2015).
Dc(r,s), the frequency of dipeptide 'rs' in sequence, is given as ( , ) , , { , , ,... } 1 where rs N is the number of the dipeptide consisting of amino acids r and s in the peptide sequence. Tm(r,s), the theoretical mean, is given by: where r C represents the number of codons that code for amino acid r in dipeptide 'rs' and S C represents the number of codons which code for amino acid s in dipeptide 'rs'. CN is the number of all possible codons excluding the three stop codons. Tv (r,s), the theoretical variance of the dipeptide 'rs', is defined as: Finally, DDE(r,s) is given by:

Grouped Di-Peptide Composition (GDPC)
The GDPC encoding is similar to DPC descriptor. It is composed of a total of 25 descriptors, which can be calculated as: ( , ) , , { 1, 2, 3, 4, 5} 1 rs N f r s r s g g g g g N    where Nrs is the number of amino acid type groups r accompanied by and type groups s. g1, g2, g3, g4 and g5 represent amino acid groups (GAVLMI), (FYW), (KRH), (DE) and (STCPNQ), respectively.

Moran correlation (Moran)
The Moran feature is described according to the distribution of amino acid properties in peptides or protein sequence (Horne, 1988;Feng and Zhang, 2000;Sokal and Thomson, 2006;Xiao et al., 2015). The amino acid properties are descripted based on different types of amino acids index that can be accessed at http://www.genome.jp/dbget/aaindex.html/.The computation of Moran is available in (Chen et al., 2018).

Geary correlation (Geary)
Geary is also a features descriptor that describes the properties of amino acids for a protein or peptide sequence (Sokal and Thomson, 2006;Chen et al., 2018). It can be calculated as: Where d represents the lag of the autocorrelation, nlag is the maximum value of the lag (default value:30), i P is the properties of the amino acids at positions i, i d P  is the properties of the amino acids at positions i+d. ' P is average of the considered property P over the entire sequence, it can be calculated as:

Normalized Moreau-Broto Autocorrelation (NMBroto)
The MBroto descriptors (Horne, 1988) are defined as follows: The normalized descriptors are thus calculated as: where definitions of d, i P and i d P  are consistent with the description above.

SOCNumber (Sequence-Order-Coupling Number)
The d-th rank sequence-order-coupling number is calculated as: where di,i+d is the entry in a given distance matrix describing a distance between the amino acids at position i and the amino acids at position i + d, nlag has the same definitions with the description above.

QSOrder (Quasi-sequence-order)
A quasi-sequence-order descriptor can calculate for each amino acid type, it defined as: 20 where w is a weighting factor (w = 0.1).

APAAC (Amphiphilic Pseudo-Amino Acid Composition)
APAAC was proposed in (Chou, 2005;Jiao and Du, 2016), which is like the PAAC descriptors. The details of APAAC features can be found in (Chou, 2001;Chen et al., 2018). In this study, 1428 features can be obtained from the BBP/non-BBP sequence finally.

Nested cross validation
A nested five-fold cross-validation was applied on the training dataset (326 BBPs and 326 non-BBPs) to evaluate the prediction performance. Nested cross-validation has an inner and outer loop. The inner loop serves for model/parameter selection, while the outer loop is responsible for estimating the quality of the models trained in the inner layer. In this work, the training dataset (326 BBPs and 326 non-BBPs) was equally divided into five subsets in the outer layer. Among these five subsets, a subset was used as the testing-set and the other four subsets as the training-set. In the inner loop, the data of the training-set constructed in the outer layer were regrouped into five subsets of the same size, where four subsets were employed for tuning parameters (feature number and classifier parameters, details could be found in Tables S1 and S2), and one for evaluating models. It should be noted that the F-scores were calculated based on the training-set of the inner loop.

Result of the reproducibility analysis
The results of the reproducible analysis are listed in Table S9. In Table S9, the accuracy, MCC, AUC, sensitivity and specificity of 100 data-sets based on RF algorithm are 76.25%±3.56%, 0.5264±0.0710, 0.8563±0.0309, 75.36%±5.54% and 77.14% ± 4.62%, respectively. These results are highly consistent with the results in Table 3.