Multi-feature fusion for gene prediction and functional peptide identification

Ma, Chenjing; Wei, Qianran; Wang, Guohua; Miao, Yan; Yuan, Lei

doi:10.3389/fmicb.2026.1736391

METHODS article

Front. Microbiol., 06 February 2026

Sec. Systems Microbiology

Volume 17 - 2026 | https://doi.org/10.3389/fmicb.2026.1736391

Multi-feature fusion for gene prediction and functional peptide identification

Chenjing Ma^1,2

Qianran Wei¹

Guohua Wang^2,3

Yan Miao²^*

Lei Yuan¹^*

¹Department of Hepatobiliary Surgery, The Quzhou Affiliated Hospital of Wenzhou Medical University, Quzhou People's Hospital, Quzhou, Zhejiang, China
²College of Computer and Control Engineering, Northeast Forestry University, Harbin, Heilongjiang, China
³Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China

Anticancer peptides (ACPs) have demonstrated potent antitumor activity and low toxicity, offering considerable potential in cancer therapeutics. Meanwhile, antimicrobial peptides (AMPs)serve as key components of the innate immune defense system. Owing to their broad-spectrum antimicrobial activity and low propensity for inducing resistance, AMPs have attracted considerable attention in the fields of infection control and immunotherapy. Accurate identification of ACPs and AMPs is critical for the discovery of novel therapeutic agents. However, wet-lab identification is often time-consuming, costly, and inefficient, falling short of the demands for highthroughput drug screening. Furthermore, existing computational methods exhibit limitations in feature representation and cross-task prediction capability. To address these challenges, a tool for functional peptide prediction is proposed, namely GP2FI, which consists of two sequential stages: a gene prediction model (MHA-preconv) and a functional peptide identification model (FuncPred-CB). MHA-preconv integrates CNNs with Transformer encoder layers to form a two-stage deep architecture, effectively capturing both local sequence patterns and long-range dependencies. Based on the coding regions identified by MHA-preconv, FuncPred-CB incorporates a pre-trained BERT language model to automatically extract contextual semantic features from amino acid sequences. Experimental results on multiple benchmark datasets demonstrate that MHA-preconv and GP2FI consistently outperforms the state-of-the-art methods in terms of accuracy and other performance metrics.The code for the GP2FI can be found at https://github.com/ma999-mxl/maLBX.git.

1 Introduction

Functional peptides, particularly anticancer peptides (ACPs) and antimicrobial peptides (AMPs), have emerged as prominent research topics in recent years due to their critical roles in cancer therapy and immune defense. ACPs exhibit remarkable anti-tumor potential by selectively targeting cancer cells through unique membrane-disruptive mechanisms. AMPs, on the other hand, are widely distributed in living organisms and possess broad-spectrum antimicrobial activity with low risk of resistance development. They have been extensively applied in medicine, food safety, and agriculture. However, the experimental identification of such functional peptides is time-consuming and costly, which significantly hinders their large-scale development and practical application.

With the rapid development of artificial intelligence technologies, sequence-based functional peptide prediction has emerged as a feasible and efficient alternative approach. This predictive process generally involves two key stages: first, deep learning methods are employed to accurately identify open reading frames (ORFs) from raw genomic sequences; second, the predicted gene sequences are translated into protein sequences, which are subsequently analyzed by machine learning or deep learning models to identify potential ACPs or AMPs. Therefore, constructing an end-to-end framework that integrates efficient gene prediction and functional peptide identification is of great importance for the discovery of novel functional peptides and the advancement of precision medicine and anti-infective therapeutics.

For gene prediction, a variety of algorithms have been proposed to identify ORFs with protein-coding potential in genomic sequences. These methods can generally be categorized into three main types: statistical learning-based, traditional machine learning-based, and deep learning-based approaches. Early tools such as Prodigal (Hyatt et al., 2010) employed Hidden Markov Models (HMMs) combined with statistical scoring schemes to rank and evaluate ORFs. While these approaches are computationally efficient, they often struggle to capture complex sequence patterns and long-range dependencies (Larsen and Krogh, 2003; Delcher et al., 2007; Kelley et al., 2011). Traditional machine learning-based methods, such as Orphelia (Hoff et al., 2009) and MGC (El Allali and Rose, 2013), integrated neural networks with discriminative classifiers. MetaGUN (Liu et al., 2013), MetaGeneAnnotator (Noguchi et al., 2008), and mRMR-SVM (Al-Ajlan and El Allali, 2018a) utilized Support Vector Machines (SVMs) for gene classification. FragGeneScan (Rho et al., 2010) combined sequencing error models with codon usage preferences, enhancing its robustness on low-quality data. Additionally, an m5C-Seq (Abbas et al., 2024) model—an ensemble-learning approach for predicting RNA 5-methylcytosine modification sites—and an ML (Abbas et al., 2025a) model designed for rare genetic diseases, which leverages machine learning to handle high-dimensional genomic data in a manner that deviates from the traditional single-gene prediction paradigm, have been introduced. Nevertheless, these methods still exhibit limitations in deep feature extraction and modeling of sequence-level dependencies. In response, a growing number of deep learning-based models have been established in recent years. Meta-MFDL (Zhang et al., 2017) employs a multi-layer stacked architecture for feature extraction and classification. CNN-MGP (Al-Ajlan and El Allali, 2018b) constructs a multi-branch CNN ensemble for gene prediction. CNN-RAI (Karagöz and Nalbantoglu, 2021) leverages k-mer features in a CNN framework. Although these models improve prediction accuracy, gene prediction commonly faces challenges such as short read lengths, incomplete sequences, and fragmentation, leading to loss of sequence information. Additionally, limitations in capturing global sequence dependencies further increase the difficulty of accurate gene identification.

For functional peptide identification, research has primarily focused on the independent prediction of two major categories: ACPs and AMPs. For ACP prediction, various models have been developed by transforming peptide sequences into numerical representations and applying classification algorithms. Representative methods include ACP-DRL (Xu et al., 2024), PEPred-Suite (Wei et al., 2019), ACPred-Fuse (Rao et al., 2019), iACP-DRLF (Lv et al., 2021), AntiCP2.0 (Agrawal et al., 2020), ACP-check (Zhu et al., 2022), and ACP-BC (Sun et al., 2023), which utilize dipeptide composition, deep representation learning, Bi-LSTM architectures, or multi-channel data augmentation strategies for modeling. In the field of AMP prediction, the construction of large-scale AMP databases such as CAMP (Waghu et al., 2015), APD3 (Wang, 2004), dbAMP (Jhong et al., 2018), and DRAMP 2.0 (Kang et al., 2019) has provided essential data resources for computational model development. Existing tools include CS-AMPPred (Porto et al., 2012) (SVM-based classification), PEP-FOLD (Bhadra et al., 2018) (random forest models), Ensemble-AMPPred (Lertampaiporn et al., 2021), and AMPpred-EL (Lv et al., 2022) (ensemble learning approaches). Recently, deep learning-based models such as AMPScanner (Veltri et al., 2018), BERT-AMP (Ma et al., 2022), sAMPpred-GAT (Yan et al., 2022), and AMPpred-MFA (Li et al., 2023) have demonstrated excellent performance in AMP identification tasks. However, most of these approaches suffer from several limitations, including data scarcity, sequence length constraints, limited semantic representation capability, and task specificity. Notably, they generally support only single-task classification and lack a unified framework for predicting multiple functional peptide types such as ACPs and AMPs. Recently, AI (Abbas et al., 2025b) applicable to intelligent healthcare have also begun to emerge. These models are task-specific, designed to fulfill well-defined functions, yet they can only handle the designated task and are unable to adapt to different types of problems.

To overcome these limitations and further enhance the accuracy of ORF prediction and functional peptide identification, a novel deep learning-based gene prediction and functional peptide identification method, namely GP2FI, is proposed. It consists of two components: a gene prediction model, MHA-preconv, and a peptide prediction model, FuncPred-CB. MHA-preconv first extracts candidate ORFs from raw genomic sequences and encodes them with a set of features. It then adopts a two-stage deep learning architecture to integrate both local and global sequence features. FuncPred-CB is a peptide prediction model, which integrates a pre-trained BERT language model with a dual-channel CNN–BiLSTM architecture to effectively handle longer sequences and reduce reliance on manual feature engineering. It is capable of simultaneously predicting ACPs and AMPs within a single unified framework, significantly enhancing the adaptability and accuracy of functional peptide recognition across diverse tasks.

The performance of the MHA-preconv model was evaluated against three widely used tools across multiple genomic datasets. MHA-preconv achieved a gene prediction accuracy of 0.98, outperforming Prodigal (Hyatt et al., 2010) by 0.02, Orphelia (Hoff et al., 2009) by 0.13, and FragGeneScan (Rho et al., 2010) by 0.15, Tiberius (Gabriel et al., 2024) by 0.07, and Helixer (Holst et al., 2025) by 0.08. The FuncPred-CB model was applied to multiple ACP and AMP datasets and conducted systematic comparisons with state-of-the-art prediction methods. FuncPred-CB achieved a maximum accuracy of 0.93 in ACP prediction, exceeding that of ACP-DRL (Xu et al., 2024) by 0.02, ACP-check (Zhu et al., 2022) by 0.15, iACP-DRLF (Lv et al., 2021) by 0.12, and AntiCP2.0 (Agrawal et al., 2020) by 0.22. FuncPred-CB also achieved a competitive accuracy of 0.96 and the highest AUC of 0.99 in AMP prediction, with its accuracy surpassing the deep stacked model AMPpred-MFA (Li et al., 2023) by 0.0113. These experimental results demonstrate the superior performance of GP2FI. MHA-preconv achieves higher accuracy and enhanced sequence modeling capability in gene prediction, while FuncPred-CB balances precision and task generalization in unified ACP and AMP identification, providing a powerful computational foundation for functional peptide discovery and downstream bio-activity research.

2 Methods

GP2FI is a two-stage deep learning framework for gene prediction and functional peptide identification. It consists of two stages: a gene prediction model (MHA-preconv) and a functional peptide identification model (FuncPred-CB). MHA-preconv integrates CNNs with Transformer encoder layers to form a two-stage deep architecture, effectively capturing both local sequence patterns and long-range dependencies within ORF sequences. Based on the coding regions identified by MHA-preconv, FuncPred-CB incorporates a pre-trained BERT language model to automatically extract contextual semantic features from amino acid sequences. It also adopts a dual-channel feature extraction mechanism combining CNN and Bi-LSTM, enabling it to simultaneously capture local structural features and global dependencies. Final classification is performed using a multi-layer perceptron.The overall workflow of GP2FI is illustrated in Figure 1.

Figure 1

Diagram illustrating two workflows for DNA and protein sequence analysis. The left side shows an DNA sequence analysis pipeline, including feature extraction with open reading frame identification, fusion of features, CNN processing, and a transformer encoder. The right side depicts a protein sequence analysis process, including peptide tokenization using a domain-specific pre-trained model, CNN and BI-LSTM processing, and MLP for final feature fusion. Both workflows involve layers of neural networks and specific sequence handling techniques.

Figure 1. Architecture of the GP2FI framework. (1) MHA-preconv model: (a) Feature extraction, (b) Feature fusion, (c) CNN model, and (d) Transformer encoder model. (2) FuncPred-CB model: (a) Peptide sequence tokenization and domain-specific pretrained language model, (b) CNN model, (c) Bi-LSTM model, and (d) MLP model.

MHA-preconv comprises four main stages: feature extraction, multi-feature fusion, CNN model, and Transformer encoder model. First, all candidate ORFs are extracted from the genomic sequence and encoded using one-hot encoding, with each sequence standardized to a fixed length of 700 base pairs. In parallel, six handcrafted features are computed. Then the one-hot encoded sequences and handcrafted features are jointly input into the CNN module to extract local feature patterns. The resulting feature maps are flattened and passed into a Transformer encoder to capture long-range dependencies and global contextual relationships. Finally, a fully connected layer followed by a softmax layer outputs the probability of each ORF being a protein-coding region.

FuncPred-CB also consists of four main stages: Peptide sequence tokenization and domain-specific pretrained language model, CNN model, Bi-LSTM model, and MLP model. First, the protein sequences translated from MHA-preconv are tokenized, where each amino acid is mapped to a corresponding token ID. The tokenized sequence is then fed into a pretrained BERT language model to obtain contextual semantic embeddings for each residue. The output of BERT is subsequently passed through two parallel channels: a CNN channel for capturing local structural patterns, and a Bi-LSTM channel for modeling long-range dependencies within the sequence.The outputs from both channels are concatenated along the feature dimension and fed into a multi-layer perceptron (MLP) for classification, enabling unified prediction of both ACPs and AMPs.

2.1 Feature extraction

Each ORF is characterized by six types of effective features, as detailed in Supplementary File S1.1, including monocodon usage, dicodon usage, translation initiation site (TIS), ORF length, GC content, and basic nucleotide composition. These features are designed to enhance the discriminative power of the model in identifying protein-coding regions. In general, ORFs can be categorized as either complete or incomplete. A complete ORF is defined as one that contains both a start codon (ATG, CTG, GTG, or TTG) and a stop codon (TAG, TGA, or TAA). In contrast, incomplete ORFs lack upstream or downstream regions, or both. In cases where both ends are truncated, the ORF spans the entire sequence fragment without any identifiable start or stop codon. A complete prokaryotic gene, as illustrated in Figure 2, typically begins at the 5′ promoter region and ends at the 3′ terminator region. Transcription occurs between the transcription start site and the transcription termination site, encompassing the 5′ untranslated region (5′ UTR), the ORF, and the 3′ untranslated region (3′ UTR), with only the ORF being translated into protein. Given that the translation initiation site can be located up to 30 base pairs upstream of the canonical start codon, the ORF start offset was set to 30 bp in the search procedure (Hoff et al., 2008). During both training and testing phases, only ORFs with a minimum length of 60 base pairs were considered to ensure reliability.

Figure 2

Diagram of a gene structure showing regions from left to right: Promoter with TATAAT boxes, 5' UTR, initiation codon, CDS (coding sequence), termination codon, 3' UTR, and Terminator. Conserved sequence: TTGACA.

Figure 2. Structure of a prokaryotic open reading frame (ORF).

2.2 Multi-feature fusion

The fixed-length ORFs are subjected to one-hot encoding, where each nucleotide is represented by a one-hot vector, resulting each ORF with a length of L is represented as an L×4 matrix. The encoded ORF and the manually extracted six features are then fused as input for further processing by the subsequent CNN and Transformer encoder layers. The entire feature set, encompassing the encoded ORF, and the six features, is concatenated into a one-dimensional feature vector to represent the input sequence fragment, expressed as:

X = [X_{O R F}, X_{M C}, X_{D C}, X_{T I S}, X_{O R F_{L}}, X_{G C}, X_{b a s e C}]

where X_ORF, X_MC, X_DC, X_TIS, X_{OR_F_L}, X_GC, and X_baseC represent the feature extraction vectors mentioned above.

2.3 Feature extraction for gene detection

2.3.1 CNN model

A CNN model pre-trained on 10 mutually exclusive datasets, each constructed based on predefined GC content ranges, was employed. The concatenated one-dimensional array obtained from multi-feature fusion is input into the appropriate pre-trained CNN model, which is then fine-tuned using our target dataset. The final CNN architecture consists of six layers. The first layer is a convolutional layer with 64 filters and a filter window size of 3. The second layer is a max-pooling layer with a pool size of 2. The third layer is another convolutional layer with 200 filters and a filter window size of 3. The fourth layer is a second max-pooling layer with a pool size of 2. This is followed by a dropout layer to mitigate overfitting. The output from the convolutional layers is then flattened into a one-dimensional vector and is fed into the first fully connected layer, where the dimensionality is reduced from 35,000 to 4,096.

2.3.2 Transformer encoder model

To incorporate global contextual information, a Transformer encoder modeL was incorporated after the CNN model, allowing long-distance relationships within the sequence be considered while preserving sequential information. The output from the CNN layers is fed into the Transformer encoder with an 8-head attention mechanism to extract global contextual dependencies across the entire ORF sequence. The output is then passed through a flattening layer to convert the multi-dimensional attention output into a one-dimensional vector. This vector is subsequently fed into a fully connected layer with an input dimension of 4,096 and an output dimension of 128, followed by a dropout layer with a dropout rate of 0.2. The resulting vector is then passed into another fully connected layer with 128 neurons, producing a single scalar output. A Sigmoid activation function is applied to obtain the probability that an ORF encodes a protein-coding gene. As a post-processing step, a greedy algorithm (Zhou and Troyanskaya, 2015) is applied to ensure that only one gene is retained among overlapping predictions. The candidate ORF with the highest probability score is selected, and any other ORF overlapping more than 60 base pairs with it is discarded. The final set of predicted genes is then produced.

During model training, the binary cross-entropy loss (Hoff et al., 2008) was used to compute the error between predicted probabilities and ground-truth labels. The model was trained with a batch size of 32 using the Adam optimizer, with a learning rate of 0.001. Multiple hyperparameter configurations were explored to optimize performance.

2.4 Peptide sequence tokenization and domain-specific pre-trained language model

The FuncPred-CB model for functional peptide identification begins by tokenizing peptide sequences, converting each amino acid into its corresponding numerical ID as input to the language model. The vocabulary comprises 26 tokens, including the single-letter codes of the 20 standard amino acids, an unknown residue represented by X, and special tokens such as [CLS] and [SEP]. Subsequently, a BERT-based protein language model pre-trained in the ACP-DRL framework is employed to map each amino acid to a vector representation enriched with contextual semantics. Trained on large-scale protein sequence databases using a masked language modeling (MLM) strategy, this model demonstrates strong capabilities in biological sequence modeling and semantic representation.By incorporating BERT-derived contextual embeddings, the model is better equipped to capture critical patterns and long-range dependencies within functional peptides, thereby enhancing the performance of downstream classification tasks.

2.5 Feature extraction for functional peptide identification

2.5.1 CNN model

A dual-channel feature extraction architecture based on the pre-trained language model BERT is employed. The sequence representations output by BERT are embedded as high-dimensional, context-sensitive vectors and fed into one of the CNN channels to extract both local structural features and global dependencies. In the CNN channel, the BERT output is transposed to match the input format required by one-dimensional convolution, and then passed through three consecutive convolutional layers. Each layer uses a kernel size of 3 and contains 256, 128, and 64 filters, respectively. After ReLU activation, average pooling is applied to extract stable local pattern features.

2.5.2 Bi-LSTM model

The other feature extraction channel, Bi-LSTM, processes the original sequence outputs from BERT to capture long-range dependencies within the sequence through a Bi-LSTM network. The final hidden state at the last time step is taken as the global representation of the peptide sequence.The features obtained from both CNN and Bi-LSTM channels are concatenated and passed through a batch normalization layer before being fed into a three-layer MLP for classification. This architecture effectively integrates the strengths of CNNs in capturing local structural features with the contextual modeling capabilities of Bi-LSTM, thereby enhancing the overall performance of functional peptide identification. Besides, we further conducted statistical and visual analyses of amino acid composition to explore underlying data characteristics and enhance model interpretability. Specifically, we analyzed the frequency distribution of the 20 standard amino acids in the ACP and AMP datasets. Detailed analysis is provided in Supplementary File S3.2.

During the training phase, binary cross-entropy loss was used as the objective function. The model was optimized using the Adafactor optimizer with an initial learning rate of 2 × 10⁻⁵. Training was conducted for 20 epochs with a batch size of 4. Early stopping was applied to prevent overfitting, and evaluation metrics including accuracy, F1 score, and Matthews correlation coefficient (MCC) were recorded after each epoch. Comprehensive evaluation metrics are provided in Supplementary File S2.

3 Results

3.1 Datasets

MHA-preconv was trained and evaluated using four datasets. Dataset_1 contains 164 complete genomes (including bacteria and archaea) and is used for training and validation, with the data split into training and testing sets at a 7:3 ratio. Dataset_2 consists of 10 complete genomes for model tuning. Dataset_3 includes complete genomes from 9 independent species and is used for independent testing. Dataset_4 encompasses 100 newly collected genomes covering broad taxonomic diversity (including Gram-negative bacteria, Staphylococcus spp., etc.) and was divided into five equal subsets for incremental testing. Stratified sampling was applied to both training and test splits to ensure equal expected numbers of positive and negative instances in every mini-batch, preventing the majority class (NCS) from overwhelming the minority class (CDS). In addition, a weighted random-sampling strategy was employed during training to oversample the minority class and further alleviate class bias. All genomic sequences and annotations were downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and NCBI RefSeq (https://www.ncbi.nlm.nih.gov/refseq/). The databases used and detailed dataset-construction procedures are described in Supplementary File S3.1.

FuncPred-CB was trained and evaluated by two task-specific datasets For the ACP task, Dataset_1 contains 970 ACPs and 970 non-ACPs, while Dataset_2 includes 861 ACPs and an equal number of non-ACPs.Positive sequences were collected from the AMP and CancerPPD databases and experimentally verified to possess anticancer activity; negatives consist of AMPs without anticancer activity and random peptides extracted from Swiss-Prot. For the AMP task, the dataset contains 10,322 non-redundant AMP sequences whose antimicrobial activity has been experimentally validated, together with 3,029,894 non-AMP sequences. To avoid potential bias, any peptides known to exhibit anticancer activity were excluded from the AMP-positive class, ensuring that positives possess only antimicrobial activity. To mitigate distribution shifts that could arise from random splitting, stratified sampling was applied and a fixed random seed (seed = 702) was set to guarantee reproducibility. Training and test sets are stored in separate physical files to prevent data leakage at the source; the test set is used exclusively for final evaluation and is never involved in model development or hyper-parameter tuning. Class imbalance is addressed by weighted random oversampling of the minority class during training. All datasets were randomly split into training and test sets at an 8:2 ratio. The databases used and detailed dataset-construction procedures are described in Supplementary File S3.1.

3.2 Performance of MHA-preconv on the gene dataset Dataset_1 and of FuncPred-CB on the ACP dataset Dataset_1 and the AMP dataset Dataset_3.

Comprehensive training and testing were performed for gene finding, ACP and AMP tasks with three dedicated models: MHA-preconv, FuncPred-CB-ACP and FuncPred-CB-AMP. MHA-preconv was evaluated on the gene-prediction portion of Dataset_1, whereas FuncPred-CB was assessed on the functional-peptide portions—ACP Dataset_1 and AMP Dataset_3. After 48 epochs, MHA-preconv achieved 98 % test accuracy; FuncPred-CB reached 92 % and 96 % test accuracy on the ACP and AMP datasets, respectively, within 10 epochs. The results are displayed in Figure 3.

Figure 3

Line chart comparing the performance of three models: MHA-preconv, FuncPred-CB-ACP, and FuncPred-CB-AMP across four metrics. MHA-preconv (blue) shows consistently high scores above 0.95. FuncPred-CB-ACP (orange) scores range from 0.840 to 0.920. FuncPred-CB-AMP (green) performs similarly to MHA-preconv. Metrics on the x-axis include test accuracy, sensitivity, specificity, f1 score, and MCC; y-axis shows score levels.

Figure 3. Comparative performance metrics of MHA-preconv and FuncPred-CB on test datasets.

3.3 Comparison of MHA-preconv with four benchmark methods on Dataset_3 and Dataset_4

MHA-preconv model was compared with five well-established gene prediction tools: Prodigal, Orphelia, FragGeneScan, Tiberius, and Helixer, using Dataset_3 and Dataset_4. The results are shown in Tables 1, 2. The highest accuracy was achieved by MHA-preconv on 8 out of 9 species in Dataset_3. The only exception was N. pharaonis (Species No. 7), where Prodigal slightly outperformed our method. For the remaining species, Acc values of 97.56%, 96.18%, 93.96%, 97.49%, 76.47%, 96.52%, 96.12%, and 96.67% were achieved by MHA-preconv, respectively. On Dataset4, MHA-preconv outperformed all benchmark methods, achieving the highest Acc values across all five subsets: 97.33%, 96.40%, 97.78%, 95.88%, and 95.74%. These demonstrates the strong generalization ability and classification performance of our method across a wide range of microbial genomes.

Table 1

Table 1. Results obtained by four methods on 9 different strains.

Table 2

Table 2. Results obtained by the four methods on the five divided subsets.

3.4 Comparison with state-of-the-art ACP methods on Dataset_1, and Dataset_2

To comprehensively evaluate the proposed FuncPred-CB model for functional-peptide recognition, we conducted comparative experiments on two benchmark datasets against six state-of-the-art ACP predictors: ACP-DRL, ACP-check, iACP-DRLF, AntiCP 2.0, ACP-CLB (Geng et al., 2025) and ACP-GCN (Rao et al., 2020). As illustrated in Figure 4 and Table 3, FuncPred-CB achieves 92.49% accuracy, 91.19% sensitivity, 93.78% specificity, 86.78% MCC and 94.58% AUC on Dataset_1, and 73.19% accuracy, 71.01% sensitivity, 77.20% specificity, 46.42% MCC and 82.60% AUC on Dataset_2. Except for a marginally lower accuracy than ACP-CLB, FuncPred-CB surpasses all baseline methods on the remaining metrics, demonstrating robust adaptability to complex and heterogeneous data. Collectively, these results validate the superior overall performance, generalizability and competitiveness of FuncPred-CB as a functional-peptide predictor for anticancer-peptide identification.

Figure 4

Three ROC curve graphs comparing model performance on different datasets. The first graph shows various models with AUC scores, highlighting ACP-DRL (0.9225) and Our method (0.9858) among others. The second graph includes models like ACP-DRL (0.7625) and ACF-CNN (0.8492). The third graph compares AMPred-HLF (0.9833), Our method (0.9853), and DeepAMPpred (0.9635). Each graph plots True Positive Rate against False Positive Rate.

Figure 4. Comparison of ROC curves for different models across three functional peptide datasets.

Table 3

Table 3. Performance comparison of various models on the Datasets_1 and Datasets_2.

3.5 Comparison with the latest AMP method on Dataset_3

To assess the performance of FuncPred-CB on AMP recognition, we benchmarked it against the recent models AMPpred-MFA and deep-AMPpred (Zhao et al., 2025) on Dataset_3. The results, provided in Supplementary File S3.3 and Table 4, show that FuncPred-CB attained 95.9 % accuracy, 97.3 % sensitivity, 96.3 % specificity, 91.7 % MCC, and 98.7 % AUC. Although its accuracy is marginally lower than that of deep-AMPpred, FuncPred-CB delivers superior overall performance relative to both AMPpred-MFA and deep-AMPpred. These findings underscore that FuncPred-CB not only excels in anticancer-peptide prediction but also exhibits strong discriminative power and generalizability in AMP classification tasks.

Table 4

Table 4. Comparison of performance metrics between FuncPred-CB, AMPpred-MFA and Deep-AMPpred on the Dataset_3.

3.6 The impact of basic nucleotides on MHA-preconv

Since the differences in base frequencies between coding and non-coding regions can reflect structural and functional characteristics of the genome, the data distribution of coding and non-coding regions was plotted (detailed in Supplementary File S3.4), and the impact of including nucleotide composition as a feature on model performance was compared. As shown in Table 5, the incorporation of nucleotide composition features improved the model's prediction accuracy, sensitivity, specificity, and other metrics across different genomes.

Table 5

Table 5. Comparison of the impact of nucleotide composition features on the performance of MHA-preconv.

Accordingly, the use of standalone CNN and Transformer modules was also compared, different configurations of CNN and encoder layers were tested, and the effect of the Bi-LSTM model on model performance was evaluated. All these modifications significantly enhanced the model's performance. Detailed analyses and results are provided in Supplementary Files S3.5–S3.8.

3.7 Physicochemical property analysis of functional peptides

To evaluate the potential of the predicted functional peptides in the fields of ACPs and AMPs, we employed the unified prediction framework GP2FI. Coding genes predicted from Dataset1 were translated into protein sequences, which were then subjected to functional peptide identification using FuncPred-CB and comprehensive multiparametric physiochemical characterization. As shown in Figures 5–7, the large number of sequences limits the resolution of the heat-maps, which are therefore intended only as an overview of the global physiochemical landscape of ACPs and AMPs. Detailed sequence information and peptide classifications, together with in-silico mapping against the CancerPPD database, are provided in Appendix S3.10; wet-lab functional assays will be required for definitive validation.

Figure 5

Heatmap showing protein sequences analyzed by various properties: GRAVY, Molecular Weight, Aromaticity, Instability Index, Isoelectric Point, Charge at pH 7, Boman Index, and W Content. Color intensity ranges from blue (negative values) to red (positive values).

Figure 5. Heatmap of physicochemical property profiles for functional peptides predicted by the GP2FI model.

Figure 5 illustrates the normalized distribution of eight physicochemical properties—GRAVY (hydrophobicity), molecular weight, aromaticity, instability index, isoelectric point, net charge at pH 7, Boman index, and tryptophan content (W_Content)—across multiple sequences predicted as functional peptides. The heatmap reveals substantial variation in these properties among different sequences, indicating their potential functional divergence. For example, the sequence in row 9 shows high values for isoelectric point, net charge, and Boman index, suggesting strong cationic affinity and protein interaction capability, making it a promising ACP candidate. Similarly, the sequence in row 19 exhibits elevated isoelectric point and positive charge, along with moderate aromaticity and tryptophan content, indicating considerable anticancer potential. Sequences in rows 13 and 41 also display favorable profiles across several ACP-associated features, highlighting their structural suitability as ACPs. In contrast, the sequence in row 35 exhibits high GRAVY and aromaticity values, reflecting strong hydrophobicity and structural stability—characteristics well-suited for AMP candidates. The sequence in row 16 also scores high in GRAVY and instability index, indicating favorable membrane affinity and antimicrobial stability. Row 43 shows marked enrichment in aromaticity and tryptophan content, aligning with classic physicochemical traits of AMPs.

To more intuitively reveal the numerical differences in key properties and their association with sequence characteristics, Figure 6 presents bar charts of GRAVY, Boman index, and tryptophan content across the predicted functional peptide sequences. Compared to the heatmap, the bar plots offer a clearer depiction of the magnitude of each attribute, facilitating the identification of representative sequences. For instance, these visualizations further confirm that sequences such as that in row 35 exhibit notably high scores in GRAVY and aromaticity, indicating strong hydrophobicity and high structural stability—traits that suggest strong potential as AMP candidates.

Figure 6

Bar chart displaying vertical bars of varying heights and colors, including GRAVY, Boman index, and tryptophan content, mostly clustered around the zero point on the y-axis labeled “Value,” with fluctuations ranging approximately from -3 to +2. The x-axis represents sequences with densely packed labels that are not clearly discernible.

Figure 6. Bar chart of key physicochemical properties for functional peptide sequences predicted by the GP2FI model.

Furthermore, Figure 7 presents a scatter plot illustrating the relationship between GRAVY and net charge, with the Boman index encoded as the color gradient. This visualization reveals the multidimensional interaction among these physicochemical properties. Analysis shows that most AMP-like sequences cluster in the region characterized by high GRAVY and low charge, whereas sequences exhibiting moderate hydrophobicity, medium to high charge, and elevated Boman index tend to group in the ACP-favored region. Brighter colors (yellow) represent higher Boman index values, indicating stronger potential for protein–protein interactions and possibly higher biological activity. For instance, sequences located in the upper-right region of the plot exhibit both high charge and high Boman index, suggesting strong structural characteristics associated with ACPs, while those in the lower-left region with low charge and high hydrophobicity are more typical of AMP-like features.

Figure 7

Scatter plot showing the relationship between GRAVY (hydrophobicity) on the x-axis and charge at pH 7 on the y-axis. Data points are colored based on the Boman Index, with a color gradient from blue to yellow indicating increasing values. Most points are clustered around lower charges and hydrophobicity values, with a few outliers at higher charges.

Figure 7. Scatter plot of GRAVY and charge properties for functional peptide sequences predicted by the GP2FI model.

4 Discussion

The main innovations of GP2FI are reflected in the following aspects: First, to address the challenges of gene prediction, the MHA-preconv model integrated CNN and Transformer architectures, enabling the effective extraction of local patterns in ORFs while simultaneously capturing long-range dependencies in nucleotide sequences. Compared with previous models that heavily rely on handcrafted features, a more streamlined architecture, higher prediction efficiency, and reduced dependency on manual feature inputs are achieved by MHA-preconv. Second, the FuncPred-CB model was proposed for dual-task functional peptide prediction. It leverages a pretrained BERT language model to automatically extract contextual semantic representations of amino acid sequences and employs a dual-channel architecture combining CNN and Bi-LSTM to deeply fuse local and global features. Experimental results demonstrate that superior performance across multiple metrics is achieved by FuncPred-CB. Finally, a physicochemical property analysis of the predicted functional peptide sequences was conducted. The results further validated the predictive effectiveness of the model and revealed functional tendencies of different sequences in anticancer and antimicrobial directions, offering valuable insights for subsequent experimental validation and drug development.

Despite its strong predictive performance, GP2FI has several limitations. First, the framework only provides “precursor-level” activity scores for small ribosomally encoded peptides (sREPs) translated via the canonical ribosome; it ingests complete metagenomic ORF sequences and currently does not model signal-peptide removal, proteolytic cleavage, post-translational modifications, or non-ribosomal peptide synthetase (NRPS) pathways. Dedicated maturation modules and wet-lab validation will be incorporated in future work to bridge this gap. In addition, the pipeline remains a two-stage system that requires manual hand-off between gene prediction and peptide function prediction. We plan to introduce joint training to establish a truly end-to-end workflow. The current model also exhibits suboptimal efficiency when handling very long sequences and in deployment scenarios, while physiochemical property analysis is not yet embedded within the predictive loop. Upcoming efforts will focus on integrating structural information to build more biologically interpretable models with enhanced explainability.We acknowledge that, although our study has significantly reduced reliance on manual features compared with traditional approaches, a subset of handcrafted prior features is still employed. In future work, we are committed to achieving high-efficiency prediction without any handcrafted features whatsoever.

Certainly, as Mc Neil and Lee (2025) emphasized, expanding and prospecting future functional-peptide research requires “novel immunomodulatory molecules to overcome resistance”—a niche for which ACPs are ideal candidates. Rapid and accurate genome-wide identification of ACPs equips clinicians with a readily deployable “peptide arsenal” that can be immediately combined with ICIs, perfectly aligning with the review's vision of “personalized ICI-combination therapy.” In our future work, we plan to adopt the multi-model voting ensemble framework proposed by Abbas et al. (2025c), originally designed for multi-class peptide tasks, to further boost the robustness and generalizability of our prediction system.

5 Conclusion

In this study, a unified prediction framework, GP2FI, was proposed, which integrates two deep learning models: MHA-preconv for metagenomic gene prediction and FuncPred-CB for the identification of ACPs and AMPs. As a multitask integrated deep learning framework, GP2FI exhibits excellent performance and practical application potential in both gene prediction and functional peptide recognition. Compared to traditional methods, experimental results across multiple datasets demonstrate that strong performance advantages and broad adaptability in both coding gene detection and functional peptide screening are offered by GP2FI. Ongoing improvements in data integration, model efficiency, and biological interpretability will further strengthen its utility, providing comprehensive computational support for the efficient discovery of functional peptide-based therapeutics.

Data availability statement

Gene prediction-related datasets can be obtained from NCBI RefSeq (https://www.ncbi.nlm.nih.gov/refseq/) and GenBank (https://www.ncbi.nlm.nih.gov/genbank/). The CAMI dataset is available for download at https://data.cami-challenge.org/. The Sharon real metagenomic dataset can be accessed from the NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra). Cancer peptide datasets can be obtained from the following databases: DADP (http://webs.iiitd.edu.in/raghava/dadp/), CAMP (http://www.camp.bicnirrh.res.in/), APD/APD2 (https://aps.unmc.edu/), CancerPPD (http://crdd.osdd.net/raghava/cancerppd/), UniProt (https://www.uniprot.org/), and SwissProt (https://www.uniprot.org/help/swiss-prot). Antimicrobial peptide datasets can be obtained from ADAM (http://bioinform.info/adam/), APD (https://aps.unmc.edu/), CAMP (http://www.camp.bicnirrh.res.in/), and LAMP (http://biotechlab.fudan.edu.cn/database/lamp/).

Author contributions

CM: Writing – original draft, Software. QW: Data curation, Formal analysis, Writing – review & editing. GW: Data curation, Supervision, Formal analysis, Writing – review & editing. LY: Data curation, Writing – review & editing, Funding acquisition. YM: Project administration, Methodology, Writing – review & editing, Formal analysis, Funding acquisition.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the National Natural Science Foundation of China (NSFC) [62573111 and 62301139], the Heilongjiang Provincial Natural Science Foundation of China [JJ2025QC0185], the Quzhou municipal Science and Technology Project Foundation [2022K55], and the Zhejiang Provincial Natural Science Foundation of China [LTGY23H070004].

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2026.1736391/full#supplementary-material

References

Abbas, S. R., Abbas, Z., Zahir, A., and Lee, S. W. (2025a). Advancing genome-based precision medicine: a review on machine learning applications for rare genetic disorders. Brief. Bioinform. 26:bbaf329. doi: 10.1093/bib/bbaf329

PubMed Abstract | Crossref Full Text | Google Scholar

Abbas, S. R., Seol, H., Abbas, Z., and Lee, S. W. (2025b). Exploring the role of artificial intelligence in smart healthcare: a capability and function-oriented review. Healthcare 13:1642. doi: 10.3390/healthcare13141642

PubMed Abstract | Crossref Full Text | Google Scholar

Abbas, Z., Kim, S., Lee, N., Kazmi, S. A. W., and Lee, S. W. (2025c). A robust ensemble framework for anticancer peptide classification using multi-model voting approach. Comput. Biol. Med. 188:109750. doi: 10.1016/j.compbiomed.2025.109750

PubMed Abstract | Crossref Full Text | Google Scholar

Abbas, Z., Rehman, M. U., Tayara, H., Lee, S. W., and Chong, K. T. (2024). m5C-Seq: machine learning-enhanced profiling of RNA 5-methylcytosine modifications. Comput. Biol. Med. 182:109087. doi: 10.1016/j.compbiomed.2024.109087

PubMed Abstract | Crossref Full Text | Google Scholar

Agrawal, P., Bhagat, D., Mahalwal, M., Sharma, N., and Raghava, G. P. S. (2020). AntiCP 2.0: an updated model for predicting anticancer peptides. Brief. Bioinform. 22:bbaa153. doi: 10.1093/bib/bbaa153

PubMed Abstract | Crossref Full Text | Google Scholar

Al-Ajlan, A., and El Allali, A. (2018a). “The effect of machine learning algorithms on metagenomics gene prediction,” in Proceedings of the 2018 5th International Conference on Bioinformatics Research and Applications, 16–21. doi: 10.1145/3309129.3309136

Crossref Full Text | Google Scholar

Al-Ajlan, A., and El Allali, A. (2018b). CNN-MGP: convolutional neural networks for metagenomics gene prediction. Interdisc. Sci. 11, 628–635. doi: 10.1007/s12539-018-0313-4

PubMed Abstract | Crossref Full Text | Google Scholar

Bhadra, P., Yan, J., Li, J., Fong, S., and Siu, S. W. I. (2018). AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci. Rep. 8:1697. doi: 10.1038/s41598-018-19752-w

PubMed Abstract | Crossref Full Text | Google Scholar

Delcher, A. L., Bratke, K. A., Powers, E. C., and Salzberg, S. L. (2007). Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679. doi: 10.1093/bioinformatics/btm009

PubMed Abstract | Crossref Full Text | Google Scholar

El Allali, A., and Rose, J. R. (2013). MGC: a metagenomic gene caller. BMC Bioinform. 14:S6. doi: 10.1186/1471-2105-14-S9-S6

PubMed Abstract | Crossref Full Text | Google Scholar

Gabriel, L., Becker, F., Hoff, K. J., and Stanke, M. (2024). Tiberius: end-to-end deep learning with an HMM for gene prediction. Bioinformatics 40:btae685. doi: 10.1093/bioinformatics/btae685

PubMed Abstract | Crossref Full Text | Google Scholar

Geng, A., Luo, Z., Li, A., Zhang, Z., Zou, Q., Wei, L., et al. (2025). ACP-CLB: an anticancer peptide prediction model based on multichannel discriminative processing and integration of large pretrained protein language models. J. Chem. Inf. Model. 65, 2336–2349. doi: 10.1021/acs.jcim.4c02072

PubMed Abstract | Crossref Full Text | Google Scholar

Hoff, K. J., Lingner, T., Meinicke, P., and Tech, M. (2009). Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 37, W101–W105. doi: 10.1093/nar/gkp327

PubMed Abstract | Crossref Full Text | Google Scholar

Hoff, K. J., Tech, M., Lingner, T., Daniel, R., Morgenstern, B., and Meinicke, P. (2008). Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinform. 9:217. doi: 10.1186/1471-2105-9-217

PubMed Abstract | Crossref Full Text | Google Scholar

Holst, F., Bolger, A. M., Kindel, F., Günther, C., Maß, J., Triesch, S., et al. (2025). Helixer: ab initio prediction of primary eukaryotic gene models combining deep learning and a hidden Markov model. Nat. Methods 2025, 1–8. doi: 10.1038/s41592-025-02939-1

PubMed Abstract | Crossref Full Text | Google Scholar

Hyatt, D., Chen, G.-L., LoCascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 1–11. doi: 10.1186/1471-2105-11-119

PubMed Abstract | Crossref Full Text | Google Scholar

Jhong, J.-H., Chi, Y.-H., Li, W.-C., Lin, T.-H., Huang, K.-Y., and Lee, T.-Y. (2018). dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data. Nucleic Acids Res. 47, D285–D297. doi: 10.1093/nar/gky1030

PubMed Abstract | Crossref Full Text | Google Scholar

Kang, X., Dong, F., Shi, C., Liu, S., Sun, J., Chen, J., et al. (2019). DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci. Data 6:148. doi: 10.1038/s41597-019-0154-y

PubMed Abstract | Crossref Full Text | Google Scholar

Karagöz, M. A., and Nalbantoglu, O. U. (2021). Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning. Biomed. Signal Process. Control 67:102539. doi: 10.1016/j.bspc.2021.102539

Crossref Full Text | Google Scholar

Kelley, D. R., Liu, B., Delcher, A. L., Pop, M., and Salzberg, S. L. (2011). Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 40, e9–e9. doi: 10.1093/nar/gkr1067

PubMed Abstract | Crossref Full Text | Google Scholar

Larsen, T. S., and Krogh, A. (2003). EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinform. 4:21. doi: 10.1186/1471-2105-4-21

Crossref Full Text | Google Scholar

Lertampaiporn, S., Vorapreeda, T., Hongsthong, A., and Thammarongtham, C. (2021). Ensemble-AMPPred: robust AMP Prediction and recognition using the ensemble learning method with a new hybrid feature for differentiating AMPs. Genes 12:137. doi: 10.3390/genes12020137

PubMed Abstract | Crossref Full Text | Google Scholar

Li, C., Zou, Q., Jia, C., and Zheng, J. (2023). AMPpred-MFA: an interpretable antimicrobial peptide predictor with a stacking architecture, multiple features, and multihead attention. J. Chem. Inf. Model. 64, 2393–2404. doi: 10.1021/acs.jcim.3c01017

PubMed Abstract | Crossref Full Text | Google Scholar

Liu, Y., Guo, J., Hu, G., and Zhu, H. (2013). Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinform. 14:S5. doi: 10.1186/1471-2105-14-S5-S12

PubMed Abstract | Crossref Full Text | Google Scholar

Lv, H., Yan, K., Guo, Y., Zou, Q., Hesham, A. E.-L., and Liu, B. (2022). AMPpred-EL: an effective antimicrobial peptide prediction model based on ensemble learning. Comput. Biol. Med. 146:105577. doi: 10.1016/j.compbiomed.2022.105577

PubMed Abstract | Crossref Full Text | Google Scholar

Lv, Z., Cui, F., Zou, Q., Zhang, L., and Xu, L. (2021). Anticancer peptides prediction with deep representation learning features. Brief. Bioinform. 22:bbab008. doi: 10.1093/bib/bbab008

PubMed Abstract | Crossref Full Text | Google Scholar

Ma, Y., Guo, Z., Xia, B., Zhang, Y., Liu, X., Yu, Y., et al. (2022). Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat. Biotechnol. 40, 921–931. doi: 10.1038/s41587-022-01226-0

PubMed Abstract | Crossref Full Text | Google Scholar

Mc Neil, V., and Lee, S. W. (2025). Advancing cancer treatment: a review of immune checkpoint inhibitors and combination strategies. Cancers 17:1408. doi: 10.3390/cancers17091408

PubMed Abstract | Crossref Full Text | Google Scholar

Noguchi, H., Taniguchi, T., and Itoh, T. (2008). MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 15, 387–396. doi: 10.1093/dnares/dsn027

PubMed Abstract | Crossref Full Text | Google Scholar

Porto, W. F., Pires, Á. S., and Franco, O. L. (2012). CS-AMPPred: an updated SVM model for antimicrobial activity prediction in cysteine-stabilized peptides. PLoS ONE 7:e51444. doi: 10.1371/journal.pone.0051444

PubMed Abstract | Crossref Full Text | Google Scholar

Rao, B., Zhang, L., and Zhang, G. (2020). ACP-GCN: the identification of anticancer peptides based on graph convolution networks. IEEE Access 8, 176005–176011. doi: 10.1109/ACCESS.2020.3023800

Crossref Full Text | Google Scholar

Rao, B., Zhou, C., Zhang, G., Su, R., and Wei, L. (2019). ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief. Bioinform. 21, 1846–1855. doi: 10.1093/bib/bbz088

PubMed Abstract | Crossref Full Text | Google Scholar

Rho, M., Tang, H., and Ye, Y. (2010). FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 38, e191–e191. doi: 10.1093/nar/gkq747

PubMed Abstract | Crossref Full Text | Google Scholar

Sun, M., Hu, H., Pang, W., and Zhou, Y. (2023). ACP-BC: a model for accurate identification of anticancer peptides based on fusion features of bidirectional long short-term memory and chemically derived information. Int. J. Mol. Sci. 24:15447. doi: 10.3390/ijms242015447

PubMed Abstract | Crossref Full Text | Google Scholar

Veltri, D., Kamath, U., and Shehu, A. (2018). Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740–2747. doi: 10.1093/bioinformatics/bty179

PubMed Abstract | Crossref Full Text | Google Scholar

Waghu, F. H., Barai, R. S., Gurung, P., and Idicula-Thomas, S. (2015). CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides: Table 1. Nucleic Acids Res. 44, D1094–D1097. doi: 10.1093/nar/gkv1051

Crossref Full Text | Google Scholar

Wang, Z. (2004). APD: the antimicrobial peptide database. Nucleic Acids Res. 32, 590D–592. doi: 10.1093/nar/gkh025

PubMed Abstract | Crossref Full Text | Google Scholar

Wei, L., Zhou, C., Su, R., and Zou, Q. (2019). PEPred-Suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics 35, 4272–4280. doi: 10.1093/bioinformatics/btz246

PubMed Abstract | Crossref Full Text | Google Scholar

Xu, X., Li, C., Yuan, X., Zhang, Q., Liu, Y., Zhu, Y., et al. (2024). ACP-DRL: an anticancer peptides recognition method based on deep representation learning. Front. Genet. 15:1376486. doi: 10.3389/fgene.2024.1376486

PubMed Abstract | Crossref Full Text | Google Scholar

Yan, K., Lv, H., Guo, Y., Peng, W., and Liu, B. (2022). sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics 39:btac715. doi: 10.1093/bioinformatics/btac715

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, S.-W., Jin, X.-Y., and Zhang, T. (2017). Gene prediction in metagenomic fragments with deep learning. Biomed Res. Int. 2017, 1–9. doi: 10.1155/2017/4740354

PubMed Abstract | Crossref Full Text | Google Scholar

Zhao, J., Liu, H., Kang, L., Gao, W., Lu, Q., Rao, Y., et al. (2025). deep-AMPpred: a deep learning method for identifying antimicrobial peptides and their functional activities. J. Chem. Inf. Model. 65, 997–1008. doi: 10.1021/acs.jcim.4c01913

PubMed Abstract | Crossref Full Text | Google Scholar

Zhou, J., and Troyanskaya, O. G. (2015). Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934. doi: 10.1038/nmeth.3547

PubMed Abstract | Crossref Full Text | Google Scholar

Zhu, L., Ye, C., Hu, X., Yang, S., and Zhu, C. (2022). ACP-check: An anticancer peptide prediction model based on bidirectional long short-term memory and multi-features fusion strategy. Comput. Biol. Med. 148:105868. doi: 10.1016/j.compbiomed.2022.105868

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: ACP, AMPs, deep learning, functional peptide prediction, gene prediction

Citation: Ma C, Wei Q, Wang G, Miao Y and Yuan L (2026) Multi-feature fusion for gene prediction and functional peptide identification. Front. Microbiol. 17:1736391. doi: 10.3389/fmicb.2026.1736391

Received: 31 October 2025; Revised: 14 January 2026;
Accepted: 15 January 2026; Published: 06 February 2026.

Edited by:

Bin Wei, Zhejiang University of Technology, China

Reviewed by:

Seung Won Lee, Sungkyunkwan University, Republic of Korea
Wenxuan Xing, Beijing Institute of Technology, China

Copyright © 2026 Ma, Wei, Wang, Miao and Yuan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Lei Yuan, c2VueGl1OTlAMTYzLmNvbQ==; Yan Miao, bWlhb3lhbkBuZWZ1LmVkdS5jbg==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.