ACP-DRL: an anticancer peptides recognition method based on deep representation learning

Cancer, a significant global public health issue, resulted in about 10 million deaths in 2022. Anticancer peptides (ACPs), as a category of bioactive peptides, have emerged as a focal point in clinical cancer research due to their potential to inhibit tumor cell proliferation with minimal side effects. However, the recognition of ACPs through wet-lab experiments still faces challenges of low efficiency and high cost. Our work proposes a recognition method for ACPs named ACP-DRL based on deep representation learning, to address the challenges associated with the recognition of ACPs in wet-lab experiments. ACP-DRL marks initial exploration of integrating protein language models into ACPs recognition, employing in-domain further pre-training to enhance the development of deep representation learning. Simultaneously, it employs bidirectional long short-term memory networks to extract amino acid features from sequences. Consequently, ACP-DRL eliminates constraints on sequence length and the dependence on manual features, showcasing remarkable competitiveness in comparison with existing methods.


Introduction
Cancer is a major public health problem worldwide and one of the leading cause of death (Siegel et al., 2023).Typical treatment options to reduce the burden of cancer on human health involve surgery, radiotherapy, and/or systemic therapy.However, the toxicities associated with traditional treatment methods, present considerable challenges for tolerability and adherence, making it difficult for patients to complete their prescribed treatment regimens (Mun et al., 2018).Therefore, the development of new anticancer drugs with higher efficacy, low resistance, and fewer adverse effects are necessary.Anticancer peptides (ACPs) potentially offer new perspectives for achieving this goal (Gabernet et al., 2016).Considering their intrinsic nature as cationic amphiphiles, ACPs exhibit unique, receptor-independent mechanisms.These peptides display an exceptional capacity to selectively target and eliminate cancer cells via folding-dependent membrane disruption (Aronson et al., 2018).On the one hand, ACPs therapy has been extensively researched and applied in preclinical and various stages of clinical trials against tumors (Pelliccia et al., 2019;Liu et al., 2024).On the other hand, the time-consuming and costly process of identifying ACPs through biological experiments, as well as the limited number of available ACPs, have hindered its development.
Fortunately, with the tremendous progress made in the field of machine learning over the past decades, the feasibility of employing computational methods to predict typical peptides has become a reality.As a result, various recognition methods for ACPs based on amino acid sequences have emerged, such as iACP (Chen et al., 2016), PEPred-Suite (Wei et al., 2019), ACPred-Fuse (Rao et al., 2020), iACP-DRLF (Lv et al., 2021), AntiCP 2.0 (Agrawal et al., 2021), ACP-check (Zhu et al., 2022) and ACP-BC (Sun et al., 2023).These recognition methods adopt diverse approaches to convert amino acid sequences into numerical representations and use machine learning algorithms to uncover patterns within these features.Among these methodologies, AntiCP 2.0 relies on common feature extraction techniques such as dipeptide composition and an ETree classifier model.In contrast, iACP-DRLF leverages two deep representation learning techniques alongside LGBM for refined feature selection.And ACP-check integrates a bidirectional long short-term memory (Bi-LSTM) network with a fully connected network, facilitating predictions based on both raw amino acid sequences and handcrafted features.ACP-BC is a three-channel end-to-end model, which employs data augmentation techniques, integrated in various combinations.
Despite the numerous informatics approaches proposed for ACPs recognition, there is still room for improvement.For instance, AntiCP 2.0 imposes a requirement on the target peptide sequence length to be between 4 and 50, while iACP-DRLF introduces a complex feature extraction strategy.More importantly, the scarcity of experimentally annotated datasets of ACPs significantly constrains the utilization and performance of machine learning.In light of these considerations, this study proposes ACP-DRL.ACP-DRL incorporates advanced language models that can efficiently utilize vast unlabelled datasets and extend sequence length through positional encoding, while Bi-LSTM operates without imposing restrictions on sequence length.In ACP-DRL, we have shifted our focus to deep representation learning, alleviating the scarcity of ACP datasets through the application of extensive unlabeled data.This allows predictions on longer sequences and reduces dependence on feature engineering based on expert knowledge.Simultaneously, in comparison with existing methods, ACP-DRL demonstrates exceptional performance.

Datasets
To ensure a fair comparison, the main and alternate datasets supplied by AntiCP 2.0 (Agrawal et al., 2021) were employed in this research.These consolidated datasets incorporate data harvested from numerous databases including DADP (Novković et al., 2012), CAMP (Waghu et al., 2014), APD (Wang and Wang, 2004), APD2 (Wang et al., 2009), CancerPPD (Tyagi et al., 2015), Uniprot (Consortium, 2015), and SwissProt (Gasteiger et al., 2001) databases.The positive dataset, enriched with experimentally validated ACPs, was derived from a conjoined compilation of the antimicrobial peptide (AMP) database and the CancerPPD database.In contrast, the main negative dataset consisted of AMPs lacking anticancer activity, sourced solely from the AMP database, while the alternative negative dataset encompassed random peptides extracted from protein within the SwissProt database (Gasteiger et al., 2001).The main dataset includes 861 ACPs and equal number of non-ACPs while the alternate dataset holds a count of 970 for both ACPs and non-ACPs.
We additionally created an imbalanced dataset (comprising 845 ACPs and 3,800 non-ACPs) for five-fold cross-validation, which encompasses all data from both the main and alternate datasets.Additional sequence data was obtained from Rao et al. (2020), and we used the CD-HIT algorithm to construct nonredundant sequences.
Furthermore, we collected approximately 1.5 million peptide sequences from PeptideAtlas (Omenn et al., 2022) as an unlabeled dataset for in-domain further pre-training of protein language model.
We assessed the amino acid composition (AAC) of peptides and generated six sample sequence logos (Supplementary Figure S1) in the in-domain further pre-training daset (IFPT), main, and alternate datasets.This was done to gain insights into the residue preferences at the N-terminus and C-terminus in these three datasets.
The result indicates that both the main and alternate datasets showed a high predominance of 'K', 'L', and 'A' residues at the N-terminus, and 'K' and 'L' at the C-terminus (Supplementary Figure S1A), consistent with previous studies (Agrawal et al., 2021).However, no particular amino acid type dominated at the N-terminus (Supplementary Figure S1A) in the IFPT dataset, suggesting little to no conservation.As for the C-terminus (Supplementary Figure S1B), it often concluded with either 'K' or 'R', most likely influenced by specific enzyme cleavage sites, as the C-terminus is the end part to form during protein synthesis.The presence of amino acids such as lysine or arginine could have a significant impact on this cleavage process, with enzymes like trypsin specifically cleaving these, thereby affecting their prevalence at the C-termini.It can be discerned that the dataset used for in-domain further pre-training does not exhibit substantial similarity with the dataset utilized for anticancer peptide recognition.

Framework of ACP-DRL
As depicted in Figure 1, the framework of ACP-DRL consists of three main modules.Firstly, the initial section delineates the representation of peptide sequences.Secondly, the following section elucidates the further pre-training of the protein language model.Thirdly, the section explains the process of extracting peptide sequence features using a Bi-LSTM, and subsequently classifying these peptides based on the extracted features.

Tokenized peptides representation
The initial section of Figure 1 illustrates a process in which peptide sequences are tokenized, which means that each amino acid is converted into its corresponding numerical IDs.These IDs are subsequently used as inputs for our peptide language model.Within the vocabulary of our language model, a total of 26 tokens have been utilized.This includes five special tokens ( ), 20 tokens representing the standard amino acids in their abbreviated forms.Additionally, the token "X" has been specifically designated to denote non-generic or unresolved amino acids.This allocation of "X" facilitates the accommodation of non-standard amino acids, ultimately enhancing the model's adaptability and flexibility.

Language model with in-domain further pre-training
There is a perspective within the community that proteins can be represented by amino acids, and thus, they can be approximated as a unique form of natural language (Ofer et al., 2021).Recent academic research has further emphasized this perspective, with the release of numerous protein language models.Elnaggar et al. ( 2021) put forward the BERT-BFD model which was trained on the BFD (Steinegger et al., 2019) dataset composed of an impressive count of 2,122 million protein sequences.Concurrently, OntoProtein was put forth by Zhang et al. (2022), employing the robust techniques of knowledge graphs and gene ontology.The aforementioned efforts have yielded excellent protein language models.Upon consideration, we selected OntoProtein as the foundational model and further conducted training based on our work.
Pre-training broadly involves the initial training of a model on a large dataset which enables it to acquire universal features.In-domain further pre-training signifies an added layer of refinement to the pretrained model using task-relevant data within a specific field or operation.This additional step aims to bolster model performance within its designated tasks (Grambow et al., 2022).
In the context of our research, we collected and employed the IFPT dataset (about 1.5 million peptide sequences) to incrementally enhance OntoProtein to approximate the peptide level feature space more closely.Through this strategy, we proposed the OntoProtein within Peptides (OPP) model and could continuously obtain and train learnable deep representations during the training of downstream tasks.The imperative behind this step is to facilitate OntoProtein's adaptability to the transition happening from protein sequences to peptide sequences.It is worth noting that, although OntoProtein jointly trains knowledge embedding (KE) and masked language modeling (MLM) tasks, during the In-domain further pre-training stage, we only trained the MLM task.
As shown in Figure 1, during the in-domain further pre-training stage, a subset of amino acids is masked, and the language model needs to predict the masked amino acids based on contextual information.This prompts the language model to learn the underlying information of peptide sequences.In our approach, each token (amino acid) has a 15% probability of being masked, and we use cross-entropy loss to estimate the predictions for these masked tokens.This process invokes the preparation of masked token inputs, conforming to the principles of masked language modeling.The distribution of these masked tokens adheres to a specific ratio: 80% are masked, 10% are replaced with random tokens, and the remaining 10% maintain their original identity.

Fine-tuning layer and classifier
BERT has demonstrated significant potential in the field of text classification, with researchers commonly acknowledging that the "[CLS]" token is expected to capture information from the entire sequence (Sun et al., 2019;Wang and Kuo, 2020).Consequently, in early classification tasks, researchers often relied solely on the information from the "[CLS]" token; however, this practice is not considered optimal (Jiang et al., 2021;Kim et al., 2021).In order to further extract sequence features from the peptides, we added an extra fine-tuning layer rather than connecting the "[CLS]" output directly to a fully connected layer.Bi-LSTM is particularly suitable for handling sequence data and can simultaneously capture both preceding and following contextual information.
The LSTM comprises four components: the forgetting gate f t , the input gate i t , the cell state C t , and the output gate o t .The forgetting gate f t takes a value between 0 and 1.When an element of f t is 0, it prevents the passage of the value from the previous cell state C t−1 , achieving selective forgetfulness.Meanwhile, the input gate i t contributes information to the cell state C t , thereby updating the information.This selective interplay of remembering and forgetting effectively addresses challenges such as gradient explosion, gradient disappearance, and distance-dependent issues commonly encountered in traditional RNNs.The whole process is as follows: In this study, we employed Bi-LSTM to extract contextual information.As illustrated in the third section of Figure 1, our OPP model furnishes a high-dimensional encoding for each amino acid in peptide sequences.We sequentially input this into two LSTMs (forward and backward) and combined their state vectors to provide a feature vector for each peptide.After that, we utilize a fully connected layer and Softmax function for classification, with a default threshold of 0.5.If the probability of belonging to the positive class is greater than 0.5, the target peptide sequence is categorized as an ACP; otherwise, it is designated as a non-ACP.

Performance evaluation
The evaluation in this study is conducted using four metrics, namely, accuracy (Acc), sensitivity (Sen), specificity (SP) and Mathew's correlation coefficient (MCC), which is in line with previous studies.The specific evaluation metrics are as follows: Sen TP TP + FN (8)

Results and discussion
We ran ACP-DRL on a single node of the GPU cluster in the National Center for Protein Sciences (Beijing).During the training, we trained for 20 epochs with a learning rate of 2e-5 on 8 T V100 GPUs, using adafactor as the optimizer, and adopted the cosine with restarts learning rate schedule.The batch size could be set to 32 for each training iteration.In this section, we commence with the evaluation of the language model, followed by an assessment of various fine-tuning layers.Finally, we compare results of ACP-DRL with existing methods.

Evaluation of different language models
The development of Artificial Intelligence for Science has provided scholars with available protein language models.To assess the feasibility of current typical protein language models in ACPs recognition, we gathered randomly initialized BERT, BERT-BFD trained on 2,122 million protein sequences and OntoProtein which incorporates joint training with KE and GO, for training and evaluation.We designed a common vocabulary for these three models, encoding peptide sequences from the main dataset into each model, and subsequently employing a fully connected layer for classification on the encoded results.The evaluation results (Figure 2) suggest that the three language models have similar performances in terms of sensitivity.Still, regarding specificity, the initialized BERT performs worse, which might contribute to its lower accuracy.BERT-BFD and OntoProtein, both of which have been pre-trained employing a substantial volume of protein sequences, demonstrate performances that are relatively equivalent.Overall, OntoProtein is slightly inferior to BERT-BFD in sensitivity but achieves advantages in accuracy, specificity and MCC, with the benefit in specificity being more pronounced.Furthermore, considering its slightly higher MCC than BERT-BFD, we propose that OntoProtein has more potential for the task at hand.

Performance of different finetuning layers
We can obtain the encoding of each amino acid in a peptide sequence through language models.To further extract sequence features, we utilized OntoProtein as the base model and experimented with various fine-tuning layers on the main dataset.Specifically, the fully connected layer only utilized the encoding of the "[CLS]" token for classification, while Text-CNN, forward LSTM, and Bi-LSTM utilized the encoding information of the entire sequence.Figure 3 illustrates the experimental results under different fine-tuning layers.It can be observed that the effectiveness of using only the encoding of the "[CLS]" token for classification is not satisfactory, corroborating the findings of Kim Evaluation of language models on main dataset.2021).The performance is somewhat improved with a simple forward LSTM, but a significant leap is observed when incorporating a backward LSTM.Text-CNN demonstrates a certain level of competitiveness in this task but falls short of Bi-LSTM, reaffirming our confidence in choosing Bi-LSTM as the fine-tuning layer.

Evaluation of in-domain further pre-training
In-domain further pre-training is a primary approach for enhancing language models using in-domain additional datasets.We gathered a dataset comprising approximately 1.5 million peptide sequences to assess performance of OntoProtein in the peptide domain, ultimately obtaining the OPP model used for actual training.To better understand the distribution changes of feature information in language models, we employed t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of model features.We discussed three stages of the language model: a) the unpretrained BERT model, b) the OntoProtein model released by Zhang et al. (2022), and c) the OPP model obtained through our additional pre-training.Figures 4A, B present the t-SNE visualization results of the test sets from the main and alternate datasets at three stages.It can be observed that the initialized BERT exhibits a considerable overlap of points on both datasets, confirming the subpar testing results shown in Figure 2.This phenomenon may be attributed to the model's excessive parameter count compared to the small training dataset.OntoProtein and our OPP model demonstrate excellent performance in the t-SNE visualizations, displaying distinct sample clusters.On the main dataset, the ACPs of our OPP model are more clustered than those of OntoProtein, while on the alternate dataset, the OPP model has fewer mixed-in non-ACPs among its ACPs.Therefore, it is reasonable to conclude that the indomain further pre-training strategy-utilizing the IFPT dataset implemented in this study-augments the model's performance on both the main and alternate datasets.This enhancement is observable notwithstanding the significant differences in amino acid composition and positional preference between the unlabeled IFPT dataset and the datasets used for downstream tasks, as depicted in Supplementary Figure S1.Visualization results of t-SNE for language models at different stages on (A) main dataset and (B) alternate dataset.

Evaluation on the main and alternate dataset
After confirming the superior performance of our OPP model, we adopted it to construct the ACP-DRL model.To evaluate the performance of ACP-DRL model, we conduct a comparison with other machine learning or deep learning models, which include iACP, PEPred-Suite, ACPred-Fuse, AntiCP 2.0, iACP-DRLF, ACP-check, and ACP-BC.In this evaluation, we use the benchmark datasets (main and alternate datasets) as proposed by AntiCP 2.0.
The original ACP-BC paper does not furnish performance metrics for our benchmark datasets.Hence, using their GitHub repository (https://github.com/shunmengfan/ACP-BC)where their source code is available, we conducted experiments on our benchmark dataset using the best parameters stated in their paper.The optimal parameters deployed were: data augmentation factor R) set to 1.0, LSTM hidden layer C) with 256 nodes, number of neurons in the embedding layer D) as 512, and a learning rate of 1e-3.Both ACP-check and iACP-DRLF adopted the metric values reported in their respective papers, and hence there are slight differences in the degree of precision.The precision of ACP-check is maintained at 1%, while iACP-DRLF maintains its precision to 0.1%.The performance of the remaining methods came from the metric values reported by AntiCP 2.0 after executing evaluations on the main and alternate datasets.
Table 1 and Table 2 respectively display the performance on the main and alternate datasets, with the best performance for each metric highlighted in bold.As shown in Table 1, our model achieved the highest accuracy, specificity, and MCC on the main dataset, with a sensitivity close to that of ACP-check.The advantage is even more pronounced on the alternate dataset (Table 2), where our model reached an accuracy of 94.43%.Although our sensitivity was slightly lower than ACP-check, our specificity exceeded ACP-check by 3.64%.Overall, compared to existing advanced methods, the ACP-DRL model proposed in this study is highly competitive.

Five-fold cross-validation on imbalanced dataset
To further illustrate the effectiveness of our model, we constructed an imbalanced dataset (comprising 845 ACPs and 3,800 non-ACPs) for a five-fold cross-validation.For this validation, we chose ACP-BC, the most recent model, and ACPcheck, which offers competitive performance on main and alternate datasets, for comparisons.
We downloaded the source code for ACP-check from an opensource project (https://github.com/ystillen/ACP-check)and tailored a version for our five-fold cross-validation.For ACP-check, we chose the parameters best suited to the main dataset (lr = 1e-3, batch size = 50, epoch = 30).For ACP-BC, we still referred to the previously mentioned code and optimal parameters.
In this cross-validation, we included two additional evaluation metrics-Area Under the ROC Curve (AUC) and Area Under the Precision-Recall Curve (AUPR)-to further assess the model.While AUC serves as a common indicator for classifying performance across different thresholds (with a score close to 1.0 indicating strong performance), AUPR focuses more on the performance of classifiers in circumstances with imbalanced positive and negative samples.
Supplementary Tables S1-S3 demonstrate the performance of the three models on the imbalanced dataset, while Table 3 presents the average performance based on five-fold cross-validation, with the best results highlighted in bold.The results suggest that ACP-DRL has achieved top-tier performance across five evaluation metrics-Acc, Spc, MCC, AUC, and AUPR.
Although ACP-BC attained the highest score for sensitivity, ACP-DRL achieved similar results whilst surpassing ACP-BC in specificity.This may suggest that ACP-DRL adopts a more conservative approach when classifying positive instances, hence avoiding potential misidentifications of true positives.Perhaps due to the sensitivity of ACP-check to data distribution, it did not demonstrate competitive performance in this test.
Paired T-tests were conducted on the results of five-fold crossvalidation (as shown in Supplementary Table S4).The results indicated statistically significant differences in Spc, AUC, and AUPR metrics (p < 0.05) and a marginal difference in Acc (p = 0.05) when comparing our ACP-DRL model with ACP-BC.Meanwhile, when comparing with ACP-check, Acc, Sen, MCC, AUC, and AUPR all manifested significant differences (p < 0.05).The best performance for each metric highlighted in bold.
In this work, we have proposed a novel ACPs recognition method called ACP-DRL.ACP-DRL enhances the existing protein language model using in-domain further pre-training technology to approximate the peptide level feature space more closely, continuously obtains and trains learnable deep representation during training of downstream tasks, and learns the features at the amino acid level through Bi-LSTM, which combined with a fully connected layer to complete the recognition of ACPs.This design introduces the BERT-based protein large language model and further pre-training techniques into the ACPs recognition for the first time, eliminates constraints on sequence length and the dependence on manual features, showcasing remarkable competitiveness in comparison with existing methods.In recent years, recognizing various functional peptides like MFTP (Fan et al., 2023), MLBP(Tang et al., 2022), and PrMFTP (Yan et al., 2022) has seen significant advancements.These methods universally use encoders to transition peptide sequences into vectors.Believing that our OPP model is notably adept at this encoding task, we plan to apply it to the research in recognizing multifunctional peptides next.Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Yu, W., Jones, L., et al. (2021).Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing.IEEE Trans.Pattern Analysis Mach.Intell. 43, 1-16. doi:10.1109/TPAMI.2019.2929146Fan, H., Yan, W., Wang, L., Liu, J., Bin, Y., and Xia, J. (2023).Deep learning-based multi-functional therapeutic peptides prediction with a multi-label focal dice loss function.Bioinformatics 39, btad334.doi:10.1093/bioinformatics/btad334The best performance for each metric highlighted in bold.

FIGURE 1
FIGURE 1 Framework of ACP-DRL.(A) Tokenized peptides representation.(B) Language model with in-domain further pre-training.(C) Fine-tuning layer and classifier.

FIGURE 3
FIGURE 3Performance of different fine-tuning layers on main dataset.

TABLE 1
Comparison with existing methods on main dataset.

TABLE 2
Comparison with existing methods on alternate dataset.

TABLE 3
Comparison of ACP-BC, ACP-check, and ACP-DRL in five-fold cross validation on an imbalanced dataset.