CNNeoPP: a large language model-enhanced deep learning pipeline for personalized neoantigen prediction and liquid biopsy applications

Cai, Yu; Chen, Rui; Song, Mingming; Wang, Lei; Huo, Zirong; Yang, Dongyan; Zhang, Sitong; Gao, Shenghan; Hwang, Seungyong; Bai, Ling; Lv, Yonggang; Cui, Yali; Zhang, Xi

doi:10.3389/fimmu.2026.1722117

ORIGINAL RESEARCH article

Front. Immunol., 04 February 2026

Sec. Cancer Immunity and Immunotherapy

Volume 17 - 2026 | https://doi.org/10.3389/fimmu.2026.1722117

CNNeoPP: a large language model-enhanced deep learning pipeline for personalized neoantigen prediction and liquid biopsy applications

Yu Cai^1†

Rui Chen^1†

Mingming Song¹

Lei Wang²

Zirong Huo¹

Dongyan Yang¹

Sitong Zhang¹

Shenghan Gao³

Seungyong Hwang⁴

Ling Bai⁵

Yonggang Lv^3*

Yali Cui^1*

Xi Zhang^5*

¹College of Life Sciences, Northwest University, Xi’an, Shaanxi, China
²Institute of Nuclear and New Energy Technology, Tsinghua University, Beijing, China
³Department of Thyroid Breast Surgery, Xi’an NO.3 Hospital, the Affiliated Hospital of Northwest University, Xi’an, Shaanxi, China
⁴Department of Statistics and Research Institute of Applied Statistics, Chonbuk National University, Jeonju, Jeonbuk-do, Republic of Korea
⁵Department of Ophthalmology, The Second Affiliated Hospital of Xi’an Jiaotong University, Xi’an, Shaanxi, China

Neoantigens have emerged as promising targets for personalized cancer immunotherapy. However, accurate identification of immunogenic neoantigens remains a challenge due to limitations in existing predictive models. Here, we present CNNeo, a novel deep learning-based neoantigen prediction model, and CNNeoPP, an integrated computational pipeline for neoantigen discovery. CNNeo employs large language model-derived sequence representations and multi-modal feature integration, demonstrating superior predictive performance compared to existing tools. CNNeoPP was rigorously validated using independent datasets, including the TESLA dataset, and experimental validation via ELISpot T-cell assays. Additionally, we conducted a proof-of-concept study utilizing plasma cell-free DNA to explore the feasibility of non-invasive neoantigen prediction. We found that increased sequencing depth enhances neoantigen detectability, further amplified by the prioritization strategy of CNNeoPP. CNNeoDB, a publicly accessible database was developed compiling neoantigen data from multiple sources. This study establishes robust tools for neoantigen prediction, with implications for optimizing cancer immunotherapy and liquid biopsy-based tumor monitoring. CNNeoPP is available at https://github.com/AaronChen007/neoantigen.

1 Introduction

Neoantigens are tumor-specific antigens that arise due to somatic mutations in cancer cells or from viral proteins. These novel peptide sequences are presented by major histocompatibility complex (MHC) molecules and can be recognized by T lymphocytes (1). Due to their high immunogenicity, exclusive expression in tumor cells, and poor tolerance by the immune system, neoantigens can elicit patient-specific immune responses, making them an attractive target for cancer immunotherapy (2, 3). Neoantigen-based therapies, including personalized vaccines and adoptive T cell therapies, have gained momentum in recent years, with rapid advancements accelerating their clinical applications (4, 5). Moreover, neoantigens are increasingly being utilized to assess responses to immunotherapy (6), and their detection through liquid biopsy approaches, such as tumor cell-free DNA (cfDNA) analysis, offers a promising non-invasive strategy for monitoring tumor evolution and guiding treatment decisions (7, 8).

Somatic single nucleotide variants (SNVs), which account for approximately 80% of neoantigens (9), can be efficiently detected using next-generation sequencing (NGS) technologies like whole-exome sequencing (WES) and RNA sequencing (RNA-seq). Despite reductions in NGS costs and advancements in sequencing technologies, neoantigen identification through NGS still faces challenges (1, 10). Many computational models predominantly rely on peptide-MHC binding affinity predictions (11), often neglecting immunogenicity factors such as antigen processing, peptide stability, T cell receptor recognition, and immune regulation (12). Traditional machine learning models and simplistic scoring methods fail to capture the complex sequence patterns underpinning immune recognition, limiting their predictive accuracy. To address these shortcomings, we have proposed deep learning-based approaches incorporating advanced sequence encoding techniques to enhance neoantigen prediction accuracy (10). Additionally, validation of neoantigen prediction models has largely relied on in silico analyses or limited experimental data, lacking rigorous independent validation, highlighting the critical need for robust experimental validation. Furthermore, existing prediction models are predominantly optimized for bulk tumor tissue data, leaving their applicability to liquid biopsy approaches uncertain, despite the potential of cfDNA-based neoantigen identification to improve treatment strategies (13).

Convolutional Neural Networks (CNNs) are a class of deep learning models designed to capture hierarchical features and spatial patterns within sequential or image-like data. Natural Language Processing (NLP), a subfield of artificial intelligence, processes textual or sequence-based data using tokenization and embedding techniques. To overcome limitations described above, we developed a deep learning-based neoantigen prediction model, CNNeo (“CNN and NLP-based Neoantigen Prediction Model”), optimized for performance by integrating neoantigen peptide sequences and human leukocyte antigen (HLA) allele information using NLP-based encoders. Additionally, we assembled and validated CNNeoPP (“CNNeo Pipeline”), a comprehensive pipeline integrating NGS data processing with neoantigen classification. These advancements facilitate neoantigen discovery from plasma cfDNA, supporting the development of personalized cancer immunotherapies.

2 Methods

2.1 Study participants and sample preparation

The study was approved by the Medical Ethics Review Committee of Northwest University with the approval number 24093085 and conducted per the Declaration of Helsinki. Breast cancer samples, including tumor tissue and PBMCs, were collected from four patients at Xi’an NO.3 Hospital, the Affiliated Hospital of Northwest University, while lung cancer samples from two patients were obtained from our biobank, as described in our previous study (14). HLA-matched PBMCs were sourced from Milestone Biological Science & Technology Co., Ltd (Shanghai, China). Plasma samples were collected from healthy donors recruited at Northwest University. Genomic DNA (gDNA) and RNA from tumor tissues (either frozen or paraffin-embedded) and PBMCs were extracted using commercially available isolation kits (Tiangen Biotech Co., Ltd., Beijing, China). Noncancer cell-free DNA (cfDNA) was extracted from healthy plasma using the Magnetic Serum/Plasma DNA Kit (Tiangen Biotech Co., Ltd.) and quantified using a Qubit 4.0 Fluorometer (Thermo Fisher, Waltham, MA, USA). The fragment size of cfDNA was assessed using the Qsep 100 system (Bioptic Inc., Taiwan, China).

2.2 Construction of training and validation datasets

Paired HLA-peptide data used for training were curated and consolidated from published literature and existing neoantigen databases to develop predictive models (Supplementary Figure S1). A comprehensive search through Web of Science and PubMed identified studies reporting immunogenic and non-immunogenic tumor-specific peptides. Data extraction included mutated peptide sequences, wild-type sequences, HLA types, SNV mutation details, gene names, peptide lengths, and immunogenicity validation methods. Only peptides of 8–11 amino acids were included, while incomplete or duplicate entries were excluded. Additionally, neoantigen data were retrieved from Neodb and NEPdb databases (http://nep.whu.edu.cn). While Neodb contains only immunogenic peptides, NEPdb includes both immunogenic and non-immunogenic peptides. Each entry was cross-verified with the original publications to ensure accuracy.

An independent dataset from a published study (15), containing experimentally validated neoantigens and HLA pairs from 12 advanced lung cancer patients, was used for CNNeo model validation. Additionally, the TESLA dataset (1), which includes raw DNA-seq and RNA-seq data from melanoma and NSCLC patients, along with experimentally validated neoantigens, was utilized for CNNeoPP pipeline validation. All four breast cancer and two lung cancer patients collected in-house were processed using CNNeoPP, with one breast cancer and two lung cancer samples selected for experimental validation. All training datasets, validation datasets, and in-house collected samples and data were consolidated into the CNNeoDB database.

2.3 Cancer cell line and contrived empirical cfDNA sample

The A549 lung cancer cell line was obtained from Procell Biotechnology Co., Ltd. (Wuhan, China) and cultured in RPMI-1640 medium supplemented with 10% FBS at 37°C in a 5% CO₂ incubator. gDNA was extracted from A549 cells using the TIANamp Genomic DNA Kit (Tiangen Biotech Co., Ltd.). DNA fragmentation was performed using NEBNext^® dsDNA Fragmentase^® (New England Biolabs, Ipswich, MA, USA) to generate fragments of approximately 170 bp. The sheared gDNA (sgDNA) was then purified (size selection) using AMPure XP beads (Beckman Coulter Inc., Brea, CA, USA). To optimize yield, the incubation time was optimized to 40 minutes, and the AMPure XP bead-to-sample ratio was adjusted to 0.8× to selectively remove larger DNA fragments (Supplementary Figure S6A). To create an empirical 15% cancer cfDNA sample, the purified 100% cancer sgDNA was mixed with non-cancer cfDNA, resulting in a sample with approximately 15% tumor-derived DNA. The final DNA concentration was measured using the Qubit 4.0 Fluorometer (Thermo Fisher), and fragment size was assessed using the Qsep 100 system (Bioptic Inc.).

2.4 Next generation sequencing

Whole-exome DNA sequencing (DNA-seq) and transcriptomic RNA sequencing (RNA-seq) of tumor tissue DNA, PBMC DNA, and tumor RNA were performed using the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA) at a sequencing depth of 100× coverage, with an input of 200/300 ng DNA/RNA per sample. DNA-seq of cell line-derived sgDNA and cfDNA samples, including the empirical 15% cancer cfDNA, was conducted on the MGI DNB-seq T7 platform (MGI Tech Co., Ltd., Shenzhen, China) at a sequencing depth of 200× or 1000× coverage, with an input of 50 ng DNA per sample.

2.5 Feature engineering and evaluation

To identify key features for model training, a comprehensive literature review was performed. Candidate features were first collected from published neoantigen prediction studies that explicitly evaluated experimentally validated immunogenic and non-immunogenic peptides. From these candidates, features were retained only if they satisfied all of the following criteria: 1) Biological relevance: the feature reflects known mechanistic determinants of neoantigen immunogenicity, including antigen processing and presentation and T-cell recognition. 2) Experimental support: the feature has been evaluated in at least one study using experimentally validated neoantigen datasets. 3) Non-redundancy: the feature provides complementary biological information rather than a trivial transformation of another retained feature. 4) General applicability: the feature can be computed consistently across datasets and HLA alleles without requiring tumor- or cohort-specific assumptions. Based on these criteria, 11 structured immunogenicity-related features (F1–F11) were selected and grouped into three mechanistic categories (Supplementary Table S2). The calculations of these immunogenicity features are detailed in the Supplementary File.

In addition, peptide sequence (F12) and HLA allele (F13) (16) were incorporated as unstructured inputs to enrich data-driven learning of sequence-level patterns. Specifically, HLA alleles were represented using both the HLA pseudo-sequence and the full HLA allele name (e.g., HLA-A*01:01). While the HLA pseudo-sequence encodes amino acid residues forming the peptide-binding groove, it does not fully capture allele-level immunological heterogeneity across HLA supertypes and subtypes. Previous studies have shown that HLA alleles sharing highly similar or even identical binding-pocket residues may exhibit distinct immunogenic outcomes, potentially arising from factors beyond the binding groove, such as allele lineage, evolutionary divergence, differential expression levels, and allele-specific biases in T-cell receptor (TCR) recognition. The full HLA allele name implicitly encodes supertype membership and subtype-level structure defined by HLA nomenclature, thereby providing orthogonal allelic information that is not recoverable from the pseudo-sequence alone.

To evaluate the independence of immunogenicity features, Spearman correlation analysis was conducted between each feature pair within the consolidated training dataset (n=1498). Relative feature importance was determined using a random forest classifier, where importance scores were computed to rank the feature accordingly.

2.6 Structured normalization and sequence embedding

To ensure uniform data distribution, Z-score normalization was applied to structured features using scikit-learn’s StandardScaler. This transformation centers the data at a mean of 0 and scales it to a standard deviation of 1. The scaler was fitted on the training dataset and applied consistently across both training and testing datasets to maintain data integrity throughout the analysis. To convert HLA names (or pseudosequences) and peptide sequences into numerical representations, NLP techniques were employed including one-hot encoding for categorical binary representation, TF-IDF (Term Frequency-Inverse Document Frequency) for quantifying sequence importance, and BioBERT, a pre-trained biomedical large language model (LLM) that tokenizes sequences and maps them to embeddings for downstream learning tasks. To mitigate class imbalance in the training dataset, SMOTE (Synthetic Minority Over-sampling Technique) (17) was applied to generate synthetic samples for the minority class by interpolating between existing instances and their k-nearest neighbors. SMOTE-generated synthetic samples were used exclusively within the training folds and were not carried forward to model testing or performance evaluation, ensuring that all reported results are based entirely on the original, unchanged dataset.

2.7 Development and integration of submodels into CNNeo and CNNeoPP

A total of 24 model combinations (Supplementary Table S4) were trained using three encoders and four deep learning algorithms, optimized based on the characteristics of the input data and feature sets. Model training was conducted using Scikit-learn for traditional machine learning models and PyTorch for deep learning architectures. The dataset was split into training (60%), validation (20%), and test (20%) sets for deep learning models, while 5-fold cross-validation was applied for traditional machine learning algorithms. Typically, Random Forest (RF) and Fully Connected Neural Networks (FCNN) were utilized to process structured features (F1-F11), while Convolutional Neural Networks (CNN) were applied to encode unstructured features (F12, F13), which were transformed using NLP techniques, as described previously. The FCNN architecture consists of an initial fully connected layer, followed by a ReLU activation function, a dropout layer to prevent overfitting, and an additional fully connected layer, with classification performed using nn.CrossEntropyLoss, which applies softmax internally. The CNN architecture employs three parallel convolutional layers with kernel sizes of 3×3, 4×4, and 5×5, each followed by a ReLU activation function and max pooling to retain the most salient features. The outputs are concatenated, passed through a dropout layer to further reduce overfitting, and processed by a fully connected layer for classification. This parallel convolutional design enables the model to effectively capture multi-scale patterns within the data.

Three top-performing submodels were integrated into CNNeo model: 1) FCNN_TF: Peptide sequences and HLA alleles were merged and tokenized using 6-mer slicing. The resulting k-mer tokens were encoded with TF-IDF and input into a FCNN consisting of hidden layers (hidden size = 64) with ReLU activation and dropout (0.2). Training settings included batch size = 32, epochs = 45, and the Adam optimizer (learning rate = 1×10⁻⁴). 2) CNN_BioBERT: Peptide-HLA pairs were first processed into 4-mer subsequences and then tokenized using the BioBERT tokenizer. BioBERT was used as a fixed Transformer-based embedding extractor, producing contextual embeddings that were subsequently passed into a CNN consisting of kernel sizes [3, 4, 5] with 120 filters, followed by max-pooling, dropout (0.1), and a fully connected classifier. Training used batch size = 32, epochs = 19, and Adam optimizer (learning rate = 1×10⁻⁴). 3) FCNN_BioBERT: Peptide-HLA strings were processed into 2-mer subsequences and encoded using the BioBERT tokenizer. The resulting embeddings were concatenated with 11 structured immunogenicity-related features, creating a multimodal representation. This fused representation was fed into a FCNN (hidden size = 53), with ReLU activation and dropout (0.5). Training used batch size = 32, epochs = 31, and Adam optimizer (learning rate = 1×10⁻⁴). The three top-performing submodels (FCNN_TF, CNN_BioBERT, and FCNN_BioBERT) were integrated into CNNeo using a rank-level ensemble (late-fusion) consensus approach. Specifically, each submodel (e.g., FCNN_TF) is trained independently and generates a ranked list of peptide–HLA pairs based on predicted immunogenicity scores. The top 50 candidates from each submodel are first selected. Peptides that appear in at least two submodels are identified as consensus peptides and are prioritized at the top of the final CNNeo ranking. The remaining peptides are then appended in descending order according to their individual model-specific scores.

Subsequently, CNNeo was integrated into CNNeoPP, a comprehensive neoantigen prediction pipeline, detailed in the Supplementary File. All CNNeo- and CNNeoPP-based analyses reported in this study were performed using the tagged GitHub release v1.0.0. The performance of CNNeo and CNNeoPP was evaluated using independent datasets, as described above. The assessment was benchmarked against four existing neoantigen prediction tools: Seq2Neo-CNN (16), Immuno-GNN (18), DeepImmuno-CNN (19), and INeo-Epp (20). Each tool was tested using default parameters, and a unified evaluation metric (Top N) was applied to ensure fair and consistent comparison.

2.8 Experimental validation by IFN-γ T-cell assays

The clinical utility of CNNeoPP was evaluated using breast and lung cancer samples for neoantigen prediction. The IFN-γ ELISpot assays were performed by Baizhen Biotechnologies (Wuhan, China), an external research service provider specializing in immunological assays. The GenScript Biotech Corporation (Nanjing, China) was responsible for peptide synthesis and Baizhen Biotechnologies was responsible for assay setup, execution, and data acquisition following standardized protocols. The experimental workflow included PBMC QC and preparation, peptide synthesis and peptide stimulation, ELISpot plate processing, and spot count analysis, as outlined below. The top-ranked HLA-neoantigen pairs were compared with existing models, selecting 27 CNNeo-exclusive candidates and 23 overlapping candidates, totaling 50 neoantigens for validation. Peptides (95% purity) were synthesized by Milestone Biological Science & Technology Co., Ltd. and tested using HLA matched PBMCs in an IFN-γ ELISpot (Enzyme-linked immunospot) assay. Frozen PBMCs were rapidly thawed, washed to remove DMSO, and resuspended in complete medium. After cell counting, PBMCs were rested for 1 hour in a 37 °C, 5% CO₂ incubator, then 4×10⁴ cells per well were seeded pre-blocked ELISpot plates. Peptides were added at a final concentration of 20 µg/mL, alongside blank (RPMI-1640), positive (CEF peptide pool), and negative (RPMI-1640 with 0.5% DMSO and 10% FBS) controls. CEF is a control peptide pool derived from Cytomegalovirus (CMV), Epstein-Barr virus (EBV), and Influenza virus. Plates were incubated for 16–20 hours, washed, and incubated with biotin-labeled detection IFN-γ antibodies for 2 hours at room temperature, followed by streptavidin-horseradish peroxidase (HRP) incubation for 1 hour. 3,3’,5,5’-tetramethylbenzidine (TMB) substrate (Sigma-Aldrich, PA, USA) was applied for color development, and the reaction was stopped using deionized water. Plates were dried overnight in the dark before spot counts were recorded using an ELISpot reader. A weak positive response was defined as 8–81 spots after subtracting the background mean spot count, while a strong positive response was classified as ≥ 81 spots (21). This threshold was derived from the 75th percentile of the spot count distribution, ensuring that strong responses were distinguished from weaker, non-significant responses in the dataset.

2.9 cfDNA sample preparation and sequencing strategy

To evaluate the feasibility of CNNeoPP for cfDNA-based neoantigen prediction, four different sample types were prepared. A tumor-only sample (100% cancer sgDNA) was generated from cancer cell-derived sgDNA and sequenced at 200× depth. Two empirical 15% cancer cfDNA samples, containing 15% tumor content, were prepared by spiking 100% cancer sgDNA with noncancer cfDNA at a 15:85 ratio, followed by sequencing at 200× and 1000× depth, respectively. Lastly, an in silico simulated 15% cancer cfDNA sample was created by computationally mixing 100% cancer sgDNA reads (200× coverage) with noncancer cfDNA reads at the same 15:85 ratio, generating a simulated tumor cfDNA dataset. CNNeoPP was then applied to all samples to predict candidate neoantigens.

2.10 Construction of CNNeoDB

To establish CNNeoDB, neoantigen data were compiled from multiple sources and categorized into two main datasets. Dataset 1 integrates (1) a publicly available dataset including training data for model development (Supplementary Figure S1A) (2), two validation datasets including the lung cancer dataset for CNNeo validation (Supplementary Figure S1B) and the TESLA dataset for CNNeoPP validation (Supplementary Figure S1C), and (3) one new experimentally validated neoantigens dataset developed from three cancer patients in this study. Dataset 2 consists of candidate neoantigens identified by CNNeoPP in additional cancer patients. These datasets provide a comprehensive resource for neoantigen prediction and validation.

2.11 Statistical analysis

All statistical analyses and data visualization were performed using Python-based libraries. Spearman correlation analysis was applied to assess relationships between immunogenicity features using correlation coefficients, while chi-square tests evaluated differences in peptide length distributions between immunogenic and non-immunogenic peptides. Differences in amino-acid frequencies between immunogenic and non-immunogenic peptides at each sequence position were assessed using chi-square tests (positions with sparse counts were filtered before testing). Shapiro-Wilk tests were conducted to assess whether immunogenic features followed a normal distribution, revealing that most features did not meet normality assumptions (Supplementary Table S3). Consequently, Mann-Whitney U tests were applied to compare the distribution of immunogenicity features across peptide groups. P-values were reported where applicable to indicate statistical significance.

3 Results

3.1 Study design and overall strategy

This study (Figure 1) details the development and validation of CNNeo, a novel deep learning-based neoantigen prediction model, and its integration into CNNeoPP, a comprehensive computational pipeline for neoantigen discovery. CNNeoPP incorporates sequencing data processing, mutation calling, feature extraction, and immunogenicity prediction using NGS data from tumor tissue and PBMCs. Independent validation was performed using the TESLA dataset, T-cell assays, and a cfDNA-based proof-of-concept study. Additionally, we developed CNNeoDB, a curated neoantigen database to facilitate data sharing and support future research.

Figure 1

Flowchart illustrating six steps in a neoantigen prediction pipeline: 1) Model Development involving dataset consolidation and validation. 2) Pipeline Construction, including DNA/RNA input, mutation calling, and neoantigen prediction. 3) Independent Validation using the TESLA dataset. 4) Experimental Validation with cancer specimens and T cell assays. 5) Proof-of-Concept Study comparing gDNA and cfDNA. 6) Building a new database with public data and novel candidates.

Figure 1. Overall study design. This study follows six key steps: (1) development of CNNeo, a neoantigen prediction model trained on immunogenic and non-immunogenic peptide datasets; (2) implementation of CNNeoPP, a computational pipeline integrating CNNeo for neoantigen prediction from sequencing data; (3) independent validation using the TESLA dataset; (4) experimental validation through ELISpot assays; (5) a proof-of-concept study comparing neoantigen identification from gDNA and cfDNA using CNNeoPP; and (6) development and release of CNNeoDB, a comprehensive neoantigen database integrating both public and newly identified candidates from this study.

3.2 Data collection and feature evaluation

The training dataset was curated from literature and public databases, including Neodb and NEPdb, yielding a dataset of 1,498 validated peptides (Supplementary Figure S1A; Supplementary Table S1). An independent validation dataset consisting of 153 neoantigens was compiled (Supplementary Figure S1B). The TESLA dataset, which included paired tumor-normal samples and experimentally validated peptides, was used for further validation (Supplementary Figure S1C).

In the training dataset, immunogenic peptides exhibited distinct biological characteristics compared to non-immunogenic peptides, with a notable enrichment of shorter peptide lengths (9-mers being the most common) (Figure 2A) and differences in amino acid composition (Figures 2B, C). Statistical comparison of residue frequencies confirmed significant differences at multiple positions between immunogenic and non-immunogenic peptides (p < 0.05 for 10 out of 11 positions). Key residues such as glutamic acid (E) and Aspartic acid (D) are more enriched in immunogenic peptides at positions P2, P3 and P4, and lysine (K), arginine (R), are predominantly seen at position P1. Analysis of HLA-I allele distribution revealed that immunogenic peptides were associated with a more diverse range of HLA alleles (Supplementary Figure S2). A statistical comparison of 11 immunogenicity-related features (Supplementary Tables S2, S3) showed significant differences in 7 features (Supplementary Figure S3), and most features exhibited low statistically correlation with each other (Supplementary Figure S4), although biological dependencies among certain features are naturally expected. Random forest analysis identified TAP transport efficiency, NetCTLpan score, and peptide-HLA binding affinity as the top 3 predictive features of immunogenicity (Figure 2D).

Figure 2

Four-panel image showing peptide data visualizations: A: Bar chart comparing peptide counts by length, distinguishing immunogenic and non-immunogenic peptides. The longest peptides have higher counts for immunogenic types. B: Frequency plots for immunogenic and non-immunogenic peptides, showing amino acid occurrences at different positions. C: Word clouds for immunogenic and non-immunogenic peptides, highlighting prominent amino acids. D: Bar chart displaying relative feature importance, with highlights on TAP transport efficiency and peptide-HLA binding affinity as significant factors.

Figure 2. Characterization of peptides in the training dataset and feature importance. (A) Peptide length distribution comparing immunogenic and non-immunogenic peptides. The statistical significance of differences in peptide length distributions was evaluated using a chi-square test, with the p-value displayed in the figure title. (B) Sequence logos representing amino acid preferences in immunogenic (left) and non-immunogenic (right) peptides. Amino acids are color-coded based on their properties: polar (red), nonpolar/hydrophobic (blue), positively charged (yellow), and negatively charged (green). The y-axis represents the normalized relative frequency of each amino acid, ensuring the sum equals 1. (C) Word clouds visualizing amino acid composition of immunogenic (left) and non-immunogenic (right) peptides. The relative size of each amino acid reflects its frequency within each group, illustrating compositional differences between immunogenic and non-immunogenic peptides. (D) Feature importance analysis using a random forest model. The relative contribution of 11 immunogenicity features is ranked to assess their impact on immunogenicity classification.

3.3 Model training, testing and independent validation

A total of 24 individual models were trained utilizing a variety of supervised learning algorithms incorporating peptide sequences, HLA allele information, and immunogenicity features (Supplementary Table S4; Figure 3A), with three models outperforming the others (Figure 3B). The FCNN_TF and CNN_BioBERT models demonstrated the highest performance, achieving an AUC of 0.81 on the independent test set by leveraging TF-IDF or BioBERT encoding to extract semantic relationships in peptide sequences. The FCNN_BioBERT model exhibited robustness by integrating deep learning with combined immunogenicity features, peptide sequences, and HLA features, improving predictive performance compared to models using only immunogenicity features. The resulting CNNeo model was then tested using an independent dataset of 69 immunogenic and 84 non-immunogenic mutant peptides (Figure 3C). The CNNeo model, along with its individual components (FCNN_TF, CNN_BioBERT, and FCNN_BioBERT), was benchmarked against four existing tools (DeepImmuno-CNN, seq2neo-CNN, Immuno-GNN, and INeo-Epp). Despite prediction variability among models, performance evaluation at different ranking thresholds (top 10, top 20, and top 50) revealed that the CNNeo framework outperformed all existing tools (Figure 3C; Supplementary Table S5), further demonstrating the benefits of integrating combined features and algorithms.

Figure 3

Panel A shows a table comparing different models based on features such as peptide information, HLA allele names, and immunogenicity, with measurements of Precision, Recall, F1-score, Accuracy, and AUC presented in a heatmap. Panel B illustrates the framework for CNN models, including FCNN_TF, CNN_BioBERT, and FCNN_BioBERT, structured with layers like Linear, Conv1D, and ReLu. Panel C depicts a bar chart comparing the count of accurately predicted neoantigens among various models, highlighting differences in top 10, 20, and 50 results.

Figure 3. Architecture and performance evaluation of CNNeo. (A) Heatmap summarizing key model components (e.g., feature type, algorithm, encoding method) alongside performance metrics of representative models, evaluated on the independent test set. Color transitions from teal to pink indicate increasing performance values across evaluation metrics. (B) Schematic representation of three top-performing neural network architectures integrated into the CNNeo framework, illustrating the model workflow from encoding method (top) to final predictions (bottom). (C) Comparative analysis of predicted immunogenic neoantigen counts across different models in TopN rankings using an independent dataset. CNNeo (integrating FCNN_TF, CNN_BioBERT, and FCNN_BioBERT) is compared against its individual components and four existing tools (DeepImmuno-CNN, seq2neo-CNN, Immuno-GNN, INeo-Epp), evaluating their relative predictive performance and the advantages of ensemble integration.

3.4 Full pipeline-level validation using TESLA dataset

CNNeo was then integrated into CNNeoPP, a comprehensive pipeline designed to streamline neoantigen discovery by performing DNA-seq on tumor tissue and PBMC samples, as well as RNA-seq on tumor tissue, followed by analyzing sequencing data, somatic SNVs, RNA expression levels, and HLA typing (Figure 4A). The performance of CNNeoPP was further evaluated using the TESLA consortium dataset (Supplementary Table S6) and compared with pipelines that integrate existing tools, including seq2neo-CNN, Immuno-GNN and INeo-Epp. To ensure a fair comparison, all tools were provided with and evaluated on the same set of peptide-HLA pairs generated by CNNeoPP’s upstream modules. Of the six TESLA patient samples originally analyzed by the consortium, sequencing data were publicly released for only five samples (Supplementary Table S6). Therefore, our analysis was restricted to these five publicly available samples, which contain 532 non-redundant SNV-derived peptides. Among these 532 peptides, 34 (6.39%) were experimentally confirmed as immunogenic, indicating that only a small fraction of somatic mutations generates neoantigens capable of eliciting a T-cell response (Figure 4B). CNNeoPP successfully captured 3, 5, and 8 unique immunogenic peptides within the top 10, 20, and 50 candidates, respectively, demonstrating a clear and reproducible advantage over existing approaches (Figure 4C; Supplementary Table S7). This consistent performance underscores the robustness of the multi-model ensemble pipeline and its superior accuracy in identifying neoantigens in real-world and independent applications.

Figure 4

Diagram illustrating a neoantigen prediction pipeline labeled CNNNeoPP. A detailed flowchart (A) shows input from tumor tissue and PBMCs through DNA and RNA sequencing, leading to mutation calling, peptide sequencing, and immunogenicity assessment, filtered by gene expression, before entering the CNNNeo model to output neoantigen candidates. Table (B) displays patient data, cancer types, and counts related to S-N-V and peptide-HLA testing results. Bar chart (C) compares counts of accurately predicted neoantigens across different models and ranks.

Figure 4. Independent validation and performance assessment of CNNeoPP. (A) Schematic workflow of CNNeoPP. DNA-seq and RNA-seq data are processed to identify somatic SNVs, HLA types, and gene expression levels. Mutations are translated into amino acid changes, and 11 immunogenicity features are computed. These inputs are used by CNNeo to generate neoantigen predictions. (B) Summary of five TESLA patients’data used for independent validation of CNNeoPP, including cancer type, total identified neoantigen candidates, number of experimentally tested peptides, number of confirmed immunogenic neoantigens, and the percentage of confirmed immunogenic neoantigens. (C) Comparative performance evaluation of CNNeoPP against pipelines incorporating existing tools (seq2neo-CNN, Immuno-GNN, INeo-Epp) using TopN ranking analysis in TESLA cohort.

3.5 Experimental validation by T-cell assay

This study enrolled one breast cancer (BCa) and two lung cancer (LC1 and LC2) patients, from whom tumor tissue and PBMC samples were collected (Figure 5A). CNNeoPP was applied to process sequencing data and predict candidate neoantigens. To optimize the selection of peptide candidates for experimental validation, 50 top-ranked neoantigens were chosen from CNNeoPP predictions, with reference to existing model predictions. Of these, approximately half (n=27) were exclusively identified by CNNeoPP, while the remaining half overlapped with predictions from existing models and were prioritized in the final selection process. ELISpot assays with HLA-matched healthy PBMCs (Supplementary Figure S5A) were conducted to evaluate whether these neoantigens could elicit a T-cell response (Figure 5B; Supplementary Figure S5B). After removing outliers, the average spot count for each peptide was calculated by subtracting the negative control background (Figure 5C). Among the 50 peptides selected for experimental validation, 29 (58%) demonstrated immunogenicity, including 24 (48%) with a weak T-cell response and 5 (10%) with a strong response (Figure 5D; Supplementary Table S8). Notably, CNNeoPP-only predicted peptides exhibited a higher positive rate compared to peptides co-predicted by CNNeoPP and other models, with 59.3% vs. 34.8% showing a weak response and 11.1% vs. 8.7% showing a strong response. These findings further highlight the enhanced performance of CNNeoPP, as demonstrated by the higher proportion of positive responses in ELISpot assays relative to other prediction approaches.

Figure 5

Flowchart illustrating a study on neoantigen prediction using tumor and PBMC samples from BCa and LC patients. Part A shows the workflow; tumor tissue and PBMC are processed using CNNNeoPP for neoantigen synthesis. Part B displays ELISpot assay results with numbered spots for different peptides. Part C presents bar graphs comparing background-subtracted spot counts for BCa and LC patients. Part D is a bar chart showing the count and proportion of accurately predicted neoantigens with weak and strong responses.

Figure 5. Experimental validation of CNNeoPP via T-cell assays. (A) Schematic workflow of experimental validation of CNNeoPP. Tumor tissue and PBMCs from breast and lung cancer patients were processed for DNA-seq and RNA-seq, followed by neoantigen prediction using CNNeoPP and other existing tools. A total of 50 selective neoantigen candidates were synthesized, and peptide-HLA complexes were tested for response in HLA-matched T cell assays. (B) Representative ELISpot results illustrate immune responses for blank control (BC), positive control (PC), negative control (NC), and tested neoantigen peptides. Peptides confirmed as immunogenic are highlighted in bold. (C) Background-subtracted IFN-γ ELISpot responses for predicted neoantigens. Each bar represents a tested neoantigen, with PC exhibiting the highest response. The pink dashed line indicates the minimum response threshold (spot count ≥ 8), while the teal dashed line represents the strong positive response threshold (spot count ≥ 81). Neoantigens exceeding these thresholds were classified as immunogenic. (D) Performance evaluation of CNNeoPP, assessed by weakly positive (pink) and strongly positive (teal) responses across three groups: Total (50 peptides), CNNeoPP-only predicted peptides (27 peptides), and peptides predicted by combining CNNeoPP with other existing tools (23 peptides).

3.6 Proof-of-concept study in cfDNA

To evaluate the feasibility of applying CNNeoPP to cfDNA-based neoantigen prediction, contrived empirical cfDNA was generated by mixing cancer cell line-derived sheared gDNA (sgDNA) with healthy plasma to achieve a controlled tumor content of 15% (Figure 6A; Supplementary Figure S6A). Fragmentation analysis confirmed that the contrived cfDNA exhibited size distribution and fragmentation patterns consistent with natural cfDNA, ensuring its suitability as a representative cfDNA sample for feasibility validation of cfDNA-based prediction (Figure 6B; Supplementary Figure S6B). After sequencing, CNNeoPP analyzed four different sample conditions: (I) a cancer cell line-derived sheared gDNA (sgDNA) sample with 100% tumor content at 200× coverage, (II & III) empirical cfDNA samples with 15% tumor content sequenced at 200× and 1000× coverage, and (IV) an in silico simulated cfDNA sample with 15% tumor content at 200× coverage (Figure 6A).

Figure 6

Diagram illustrating a cancer research study. Panel A displays an experimental flowchart involving cancer cell line sgDNA, normal plasma, and cancer/noncancer cfDNA, processed through the CNNeoPP model. Panel B shows fluorescence graphs comparing noncancer cfDNA to empirical cancer cfDNA. Panel C presents scatter plots with UMAP dimensions, highlighting dataset groupings. Panel D includes a Venn diagram and table detailing predicted neoantigens, SNV counts, and concordance percentages for four sample groups.

Figure 6. Proof-of-concept study for neoantigen prediction from cfDNA. (A) Experimental design for evaluating CNNeoPP performance from cfDNA. Cancer cell line-derived sheared gDNA (sgDNA) and non-cancer cfDNA were collected to generate an empirical contrived 15% cancer cfDNA sample, and all three were then sequenced at 200× coverage or additionally at 1000× coverage. Four conditions (I-IV) of samples were processed using CNNeoPP as annotated, including an in silico simulated 15% cancer cfDNA condition. (B) Fragmentation profiles of non-cancer cfDNA and empirical 15% cancer cfDNA samples. (C) UMAP visualization of 100% cancer sgDNA at 200× coverage and empirical 15% cancer cfDNA at 1000× and 200×, illustrating overlay patterns based on sequence features. Points are color-coded according to experimental conditions to highlight data distribution. (D) Summary of SNV and neoantigen predictions across different sample conditions. A Venn diagram (left) depicts the concordant prediction count across the other three conditions compared to the top 50 predicted neoantigens from condition I (100% cancer). The accompanying table (right) details total SNV counts, total predicted neoantigen counts, and the number of concordant neoantigens among the top 50 predicted from condition I, with the concordance percentage also reported.

UMAP visualization revealed similar clustering patterns between Sample I (tumor-only ground-truth scenario, red) and Sample II (contrived cfDNA, blue) (Figure 6C, left). As the sequencing depth increased in Sample III, a higher concentration of blue dots appeared in the center of the distribution, leading to greater overlap (purple dots) and a reduction in non-overlapping red dots (Figure 6C, right). Comparative analysis showed the in silico simulated 15% cancer sample (Sample IV) yielded a 16% (8 out of 50) overlap with “ground-truth” neoantigens as expected, whereas the empirical 15% cancer cfDNA sample (Sample II) demonstrated a higher 38% (19 out of 50) overlap and better clustering consistency with “ground-truth” (Figure 6D; Supplementary Figure S7). Notably, increasing sequencing depth from 200× to 1000× (Sample II vs. III) led to a 14% increase in SNV calls and predicted neoantigens, while the “ground-truth” neoantigen prediction increased by 21% (Figure 6D). These findings suggest that while neoantigen signals are weaker in cfDNA, this limitation may be partially compensated by increased sequencing depth, with CNNeoPP further amplifying this effect by identifying a higher proportion of “ground-truth” neoantigens beyond the expected increase in SNV detection.

3.7 Establishment of CNNeoDB: a neoantigen database

To facilitate neoantigen research and predictive model development, a publicly accessible tumor neoantigen database CNNeoDB (http://www.cnneodb.cn/) was released. CNNeoDB integrates multi-source datasets that are used or reported in the present study, which allows users to search and download peptide entries based on criteria such as HLA allele, tumor type, and immunogenicity classification results. Researchers can contribute to CNNeoDB by submitting novel neoantigen candidates, including peptide sequences, genomic coordinates, and validation evidence.

4 Discussion

A major challenge in advancing personalized cancer immunotherapy is the identification of individualized immunogenic neoepitopes, which remains a critical barrier to translating clinical studies into effective treatments. In this study, we developed and validated a comprehensive neoantigen prediction pipeline, CNNeoPP, that outperforms existing tools by accurately identifying immunogenic neoantigens capable of eliciting T cell responses from non-immunogenic peptides in individual patients. Furthermore, our proof-of-concept study revealed that while higher sequencing depth can partially recover suppressed neoantigen signals in cfDNA, additional strategies may be required for optimal detection.

Various physicochemical properties were found to differ significantly between immunogenic and non-immunogenic peptide sequences. Firstly, we demonstrated that 9-mers were the most abundant length in immunogenic peptides (59%), followed by 10-mers (28%), whereas the distribution was more comparable (49% vs. 46%) in non-immunogenic peptides. This result aligns with previous findings that 9-mer peptide binders are the most common binding peptides compared to 10-, 11-, and 8-mers across various cancer types and HLA alleles (22, 23). Additionally, our analysis highlighted unique residue preferences in immunogenic peptides, where negatively charged, hydrophilic residues (E, D) were enriched at positions P2, P3, P4, and positively charged residues (K, R) at position P1. In contrast, non-immunogenic peptides exhibited more generalized hydrophobic residues (e.g., L, V) distribution across all positions. While some aspects of our observations are novel, others align with previous studies indicating that neoantigen positions P2 and P3 tend to contain fewer hydrophobic residues (24), and that negatively charged residues at P4 enhance binding affinity with TCR (25, 26). Furthermore, our data indicate that explicitly computed features are key contributors to immunogenicity classification, in agreement with previous findings (27, 28). Given the demonstrated importance of both the raw peptide sequence and explicitly computed features, both were incorporated as input data during CNNeo training and are expected to enhance its performance. It should be noted that the non-immunogenic peptide dataset used in this study is primarily derived from a limited number of source studies, which may introduce biases in peptide length selection and HLA allele representation. In particular, single-epitope validation studies tend to preferentially report minimal 9-mer peptides, whereas more systematic screening approaches often include peptides spanning broader length ranges and HLA backgrounds. To mitigate the potential impact of these biases, we performed explicit analyses of peptide length and HLA allele distributions and observed stable model performance across peptide length strata and held-out HLA alleles, indicating that CNNeo’s predictive accuracy is not dominated by length or HLA frequency artifacts. Nevertheless, expanding the diversity and scale of experimentally validated non-immunogenic neoepitopes will be critical for further improving model generalizability and reducing residual dataset-specific biases. Future studies incorporating larger and more systematically generated negative datasets will help refine and further validate the robustness of CNNeo.

While various computational pipelines for neoantigen prediction have been developed, most focus predominantly on peptide-MHC binding affinity predictions. Recent approaches have started to incorporate additional features such as raw peptide sequences and HLA alleles to improve assessment of neoantigen immunogenicity (29, 30). Many existing neoantigen prediction tools and peptide analysis models still employ pre-deep-learning encoding methods, such as One-hot encoding (orthogonal encoding) (31). The key innovation of CNNeo lies in its specialized architecture, which uniquely integrates peptide sequence data using advanced LLM-derived embedding techniques with structured immunogenicity features, enabling a more comprehensive and accurate neoantigen prediction framework. BioBERT, a pre-trained biomedical LLM, was used in this study as a fixed Transformer-based embedding extractor for peptide sequences (32), while TF-IDF, a traditional NLP embedding method, quantifies the importance of sequence elements (33). Though rarely applied to peptide analysis for neoantigen prediction, their integration with modern machine learning models offers a novel approach (34). Convolutional layers in modern deep learning models have been extensively used to predict protein function from amino acid sequences (35). FCNNs utilize dense layers to learn high-dimensional representations and improve classification tasks by consolidating both sequence-derived and additional structured data (36). CNNeo leverages NLP-driven sequence processing, CNN-based feature extraction, and FCNN-based classification to synthesize both unstructured sequence data and structured features, maximizing neoantigen prediction accuracy. In addition, the size of the training dataset used in this study is relatively limited compared to typical deep learning applications. To address potential concerns regarding data efficiency, we compared our neural-network framework with classical machine-learning approach, including random forest. This classical method consistently showed inferior predictive performance compared with CNNeo, as reflected in the benchmarking results (Figure 3A). This observation suggests that, even under limited sample size conditions, the neural-network architecture employed in CNNeo provides superior modeling capacity for neoantigen immunogenicity prediction.

Our study demonstrates the robustness of CNNeo and CNNeoPP through development with a diverse training set and comprehensive validations across independent datasets and experimental assays. CNNeoPP was developed using multiple curated datasets, ensuring broad applicability across different cancer types and patient HLA backgrounds (18, 37). CNNeoPP consistently outperforms existing tools (16, 18–20), maintaining high predictive accuracy across a series of ranking thresholds (top 10, 20, and 50), highlighting its reliability in prioritizing immunogenic candidates. Both model- and pipeline-level validation confirm its superiority, with CNNeo’s deep learning framework outperforming state-of-the-art prediction tools. In the TESLA validation dataset (1), CNNeoPP underscored its performance by identifying 23.53% (8 out of 34) of the immunogenic peptides, while only requiring a total candidate selection rate of 9.40% (50 out of 532) within the dataset. We note that this level of performance falls within the range reported by other TESLA-based predictors, reflecting the stringent nature of the TESLA benchmark based on experimentally validated T cell responses.

Beyond computational validation, IFN-γ ELISpot assays confirmed that neoantigens uniquely predicted by CNNeoPP exhibited a higher proportion of strong T cell responses compared to those identified by other tools. In the TESLA dataset, the true positive rate was 16% (8/50) across 5 patients. In contrast, in the experimental validation (3 patients), the positive rate was 58% (29/50) using predictions from CNNeoPP and other tools, based on a standard ELISpot threshold of 8 spots per well (21). This rate further increased to 70% (19/27) for peptides exclusively predicted by CNNeoPP. When applying a more stringent threshold of 81 spot-forming cells per 4 × 10⁴ PBMCs (38), the positive rate remained at 11.1% for CNNeoPP predictions, exceeding the 8.7% observed for peptides predicted by CNNeoPP and other tools combined. Notably, this 11.1% positive rate closely aligns with findings from a recent clinical trial on individualized mRNA vaccines encoding neoantigens that 11% of tested neoantigens induced a high-magnitude T cell response (39). However, this comparison is intended only as contextual reference, as clinical vaccine response percentages are influenced by multiple biological and therapeutic factors beyond peptide-level immunogenicity prediction. During the design and development of CNNeo, immunogenicity features relevant to TCR recognition, such as “Hydrophobicity of TCR contact residues”, were systematically integrated to enhance prediction accuracy, contributing to the high experimental validation positive rate observed here (1, 40–42). These findings underscore CNNeoPP’s ability to integrate features reflective of T cell recognition, establishing it as a powerful framework for guiding clinically relevant neoantigen selection.

Identifying neoantigens from plasma cfDNA offers a non-invasive, real-time approach to track tumor evolution and guide personalized immunotherapy and immune monitoring. High-depth sequencing is essential to improve sensitivity due to the significantly lower tumor content in cfDNA. Typically, only a small subset of somatic mutations (around 1-2%) give rise to neoantigens capable of eliciting T-cell responses (43–46). Therefore, our proof-of-concept study evaluated the feasibility of CNNeoPP in detecting highly-ranked SNVs (top 0.64%, 50/7792) from a cell line-contrived sgDNA sample. Our results demonstrate that higher sequencing depth can partially recover suppressed neoantigen signals in cfDNA as identified by CNNeoPP. Moreover, our findings indicate that when analyzing high-depth samples, CNNeoPP further enhances this effect by prioritizing top-ranked neoantigens beyond the expected increase in SNV detection. Together, these results provide proof-of-concept evidence supporting the use of CNNeoPP for cfDNA-based neoantigen prediction. However, further optimization of library preparation is needed to maximize sensitivity and ensure robust detection of circulating neoantigens in liquid biopsy applications.

5 Conclusion

This study presents CNNeo and CNNeoPP, a deep learning-based model and computational pipeline that significantly enhance neoantigen prediction accuracy. Through robust validation in independent source of patients, CNNeoPP outperformed existing tools in identifying immunogenic neoantigens, with experimental results confirming higher T-cell response rates. Furthermore, our proof-of-concept study established the feasibility of cfDNA-based neoantigen prediction, supporting its potential for non-invasive tumor monitoring and personalized immunotherapy applications.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Ethics statement

The studies involving humans were approved by Medical Ethics Review Committee of Northwest University. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

YCa: Conceptualization, Software, Writing – original draft, Investigation, Visualization, Validation, Methodology, Formal analysis, Writing – review & editing, Data curation, Supervision. RC: Formal analysis, Supervision, Data curation, Methodology, Validation, Visualization, Software, Writing – original draft, Writing – review & editing, Investigation. MS: Investigation, Data curation, Writing – review & editing. LW: Validation, Methodology, Visualization, Writing – review & editing. ZH: Investigation, Writing – review & editing. DY: Investigation, Writing – review & editing. SZ: Investigation, Writing – review & editing. SG: Resources, Writing – review & editing. SH: Visualization, Writing – review & editing. LB: Supervision, Writing – review & editing. YLv: Supervision, Writing – review & editing. YCu: Supervision, Writing – review & editing. XZ: Funding acquisition, Conceptualization, Project administration, Resources, Visualization, Formal analysis, Validation, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. The primary research activities in this study were supported by the National Natural Science Foundation of China (Grant Nos. 81672627 and 82071863). Additional funding sources provided support for contributing authors and related academic training. These include the Youth Training Program grant funded by the Xi’an Municipal Health Commission (Grant No. 2024qn02) and the Global Learning & Academic Research Institution for Master’s/PhD Students and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF), funded by the Ministry of Education (Grant No. RS-2024-00443714).

Acknowledgments

We sincerely acknowledge and express our gratitude to the patients and donors for their invaluable contributions of samples and data. Their generosity has been instrumental in advancing research on neoantigen prediction and personalized cancer immunotherapy.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fimmu.2026.1722117/full#supplementary-material

References

1. Wells DK, van Buuren MM, Dang KK, Hubbard-Lucey VM, Sheehan KCF, Campbell KM, et al. Key parameters of tumor epitope immunogenicity revealed through a consortium approach improve neoantigen prediction. Cell. (2020) 183:818–34.e13. doi: 10.1016/j.cell.2020.09.015

PubMed Abstract | Crossref Full Text | Google Scholar

2. Schumacher TN and Schreiber RD. Neoantigens in cancer immunotherapy. Sci (New York NY). (2015) 348:69–74. doi: 10.1126/science.aaa4971

PubMed Abstract | Crossref Full Text | Google Scholar

3. Xie N, Shen G, Gao W, Huang Z, Huang C, and Fu L. Neoantigens: promising targets for cancer therapy. Signal Transduct Target Ther. (2023) 8:9. doi: 10.1038/s41392-022-01270-x

PubMed Abstract | Crossref Full Text | Google Scholar

4. Blass E and Ott PA. Advances in the development of personalized neoantigen-based therapeutic cancer vaccines. Nat Rev Clin Oncol. (2021) 18:215–29. doi: 10.1038/s41571-020-00460-2

PubMed Abstract | Crossref Full Text | Google Scholar

5. Peng M, Mo Y, Wang Y, Wu P, Zhang Y, Xiong F, et al. Neoantigen vaccine: an emerging tumor immunotherapy. Mol Cancer. (2019) 18:128. doi: 10.1186/s12943-019-1055-6

PubMed Abstract | Crossref Full Text | Google Scholar

6. Zhou JG, Ding Z, Shi H, and Cheng M. Editorial: the clinical application of neoantigens. Front Immunol. (2022) 13:842633. doi: 10.3389/fimmu.2022.842633

PubMed Abstract | Crossref Full Text | Google Scholar

7. Chen F, Zou Z, Du J, Su S, Shao J, Meng F, et al. Neoantigen identification strategies enable personalized immunotherapy in refractory solid tumors. J Clin Invest. (2019) 129:2056–70. doi: 10.1172/jci99538

PubMed Abstract | Crossref Full Text | Google Scholar

8. Jia Q, Chiu L, Wu S, Bai J, Peng L, Zheng L, et al. Tracking neoantigens by personalized circulating tumor DNA sequencing during checkpoint blockade immunotherapy in non-small cell lung cancer. Adv Sci. (2020) 7:1903410. doi: 10.1002/advs.201903410

PubMed Abstract | Crossref Full Text | Google Scholar

9. Shi Y, Jing B, and Xi R. Comprehensive analysis of neoantigens derived from structural variation across whole genomes from 2528 tumors. Genome Biol. (2023) 24:169. doi: 10.1186/s13059-023-03005-9

PubMed Abstract | Crossref Full Text | Google Scholar

10. Cai Y, Chen R, Gao S, Li W, Liu Y, Su G, et al. Artificial intelligence applied in neoantigen identification facilitates personalized cancer immunotherapy. Front Oncol. (2022) 12:1054231. doi: 10.3389/fonc.2022.1054231

PubMed Abstract | Crossref Full Text | Google Scholar

11. Jurtz V, Paul S, Andreatta M, Marcatili P, Peters B, and Nielsen M. Netmhcpan-4.0: improved peptide-mhc class I interaction predictions integrating eluted ligand and peptide binding affinity data. J Immunol. (2017) 199:3360–8. doi: 10.4049/jimmunol.1700893

PubMed Abstract | Crossref Full Text | Google Scholar

12. Kalaora S, Wolf Y, Feferman T, Barnea E, Greenstein E, Reshef D, et al. Combined analysis of antigen presentation and T-cell recognition reveals restricted immune responses in melanoma. Cancer Discov. (2018) 8:1366–75. doi: 10.1158/2159-8290.cd-17-1418

PubMed Abstract | Crossref Full Text | Google Scholar

13. Pereira B, Chen CT, Goyal L, Walmsley C, Pinto CJ, Baiev I, et al. Cell-free DNA captures tumor heterogeneity and driver alterations in rapid autopsies with pre-treated metastatic cancer. Nat Commun. (2021) 12:3199. doi: 10.1038/s41467-021-23394-4

PubMed Abstract | Crossref Full Text | Google Scholar

14. Zhang X, Chen R, Huo Z, Li W, Jiang M, Su G, et al. Blood-based molecular and cellular biomarkers of early response to neoadjuvant pd-1 blockade in patients with non-small cell lung cancer. Cancer Cell Int. (2024) 24:225. doi: 10.1186/s12935-024-03412-3

PubMed Abstract | Crossref Full Text | Google Scholar

15. Ding Z, Li Q, Zhang R, Xie L, Shu Y, Gao S, et al. Personalized neoantigen pulsed dendritic cell vaccine for advanced lung cancer. Signal Transduct Target Ther. (2021) 6:26. doi: 10.1038/s41392-020-00448-5

PubMed Abstract | Crossref Full Text | Google Scholar

16. Diao K, Chen J, Wu T, Wang X, Wang G, Sun X, et al. Seq2neo: A comprehensive pipeline for cancer neoantigen immunogenicity prediction. Int J Mol Sci. (2022) 23:11624. doi: 10.3390/ijms231911624

PubMed Abstract | Crossref Full Text | Google Scholar

17. Blagus R and Lusa L. Smote for high-dimensional class-imbalanced data. BMC Bioinf. (2013) 14:106. doi: 10.1186/1471-2105-14-106

PubMed Abstract | Crossref Full Text | Google Scholar

18. Wu T, Chen J, Diao K, Wang G, Wang J, Yao H, et al. Neodb: A Comprehensive Neoantigen Database And discovery Platform For cancer Immunotherapy. Database (Oxford). (2023) 2023:baad041. doi: 10.1093/database/baad041

PubMed Abstract | Crossref Full Text | Google Scholar

19. Li G, Iyer B, Prasath VBS, Ni Y, and Salomonis N. Deepimmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity. Brief Bioinform. (2021) 22:bbab160. doi: 10.1093/bib/bbab160

PubMed Abstract | Crossref Full Text | Google Scholar

20. Wang G, Wan H, Jian X, Li Y, Ouyang J, Tan X, et al. Ineo-epp: A novel T-cell hla class-I immunogenicity or neoantigenic epitope prediction method based on sequence-related amino acid features. BioMed Res Int. (2020) 2020:5798356. doi: 10.1155/2020/5798356

PubMed Abstract | Crossref Full Text | Google Scholar

21. Shou J, Mo F, Zhang S, Lu L, Han N, Liu L, et al. Combination treatment of radiofrequency ablation and peptide neoantigen vaccination: promising modality for future cancer immunotherapy. Front Immunol. (2022) 13:1000681. doi: 10.3389/fimmu.2022.1000681

PubMed Abstract | Crossref Full Text | Google Scholar

22. Müller M, Huber F, Arnaud M, Kraemer AI, Altimiras ER, Michaux J, et al. Machine learning methods and harmonized datasets improve immunogenic neoantigen prediction. Immunity. (2023) 56:2650–63.e6. doi: 10.1016/j.immuni.2023.09.002

PubMed Abstract | Crossref Full Text | Google Scholar

23. Teku GN and Vihinen M. Pan-cancer analysis of neoepitopes. Sci Rep. (2018) 8:12735. doi: 10.1038/s41598-018-30724-y

PubMed Abstract | Crossref Full Text | Google Scholar

24. Riley TP, Keller GLJ, Smith AR, Davancaze LM, Arbuiso AG, Devlin JR, et al. Structure based prediction of neoantigen immunogenicity. Front Immunol. (2019) 10:2047. doi: 10.3389/fimmu.2019.02047

PubMed Abstract | Crossref Full Text | Google Scholar

25. Jackson KR, Antunes DA, Talukder AH, Maleki AR, Amagai K, Salmon A, et al. Charge-based interactions through peptide position 4 drive diversity of antigen presentation by human leukocyte antigen class I molecules. PNAS nexus. (2022) 1:pgac124. doi: 10.1093/pnasnexus/pgac124

PubMed Abstract | Crossref Full Text | Google Scholar

26. Ma J, Ayres CM, Brambley CA, Chandran SS, Rosales TJ, Corcelli SA, et al. Dynamic allostery in the peptide/mhc complex enables tcr neoantigen selectivity. Res Square. (2024) 16:849. doi: 10.21203/rs.3.rs-4457195/v1

PubMed Abstract | Crossref Full Text | Google Scholar

27. Pyke RM, Mellacheruvu D, Dea S, Abbott C, Zhang SV, Phillips NA, et al. Precision neoantigen discovery using large-scale immunopeptidomes and composite modeling of mhc peptide presentation. Mol Cell Proteomics: MCP. (2023) 22:100506. doi: 10.1016/j.mcpro.2023.100506

PubMed Abstract | Crossref Full Text | Google Scholar

28. Stranzl T, Larsen MV, Lundegaard C, and Nielsen M. Netctlpan: pan-specific mhc class I pathway epitope predictions. Immunogenetics. (2010) 62:357–68. doi: 10.1007/s00251-010-0441-4

PubMed Abstract | Crossref Full Text | Google Scholar

29. De Mattos-Arruda L, Vazquez M, Finotello F, Lepore R, Porta E, Hundal J, et al. Neoantigen prediction and computational perspectives towards clinical benefit: recommendations from the esmo precision medicine working group. Ann Oncol. (2020) 31:978–90. doi: 10.1016/j.annonc.2020.05.008

PubMed Abstract | Crossref Full Text | Google Scholar

30. Nibeyro G, Baronetto V, Folco JI, Pastore P, Girotti MR, Prato L, et al. Unraveling tumor specific neoantigen immunogenicity prediction: A comprehensive analysis. Front Immunol. (2023) 14:1094236. doi: 10.3389/fimmu.2023.1094236

PubMed Abstract | Crossref Full Text | Google Scholar

31. ElAbd H, Bromberg Y, Hoarfrost A, Lenz T, Franke A, and Wendorff M. Amino acid encoding for deep learning applications. BMC Bioinf. (2020) 21:235. doi: 10.1186/s12859-020-03546-x

PubMed Abstract | Crossref Full Text | Google Scholar

32. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. (2020) 36:1234–40. doi: 10.1093/bioinformatics/btz682

PubMed Abstract | Crossref Full Text | Google Scholar

33. Richard E and Reddy B. Text classification for clinical trial operations: evaluation and comparison of natural language processing techniques. Ther Innov Regul Sci. (2021) 55:447–53. doi: 10.1007/s43441-020-00236-x

PubMed Abstract | Crossref Full Text | Google Scholar

34. Gelman S, Fahlberg SA, Heinzelman P, Romero PA, and Gitter A. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc Natl Acad Sci U.S.A. (2021) 118:e2104878118. doi: 10.1073/pnas.2104878118

PubMed Abstract | Crossref Full Text | Google Scholar

35. Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun. (2021) 12:3168. doi: 10.1038/s41467-021-23303-9

PubMed Abstract | Crossref Full Text | Google Scholar

36. Hu W, Guan L, and Li M. Prediction of DNA methylation based on multi-dimensional feature encoding and double convolutional fully connected convolutional neural network. PloS Comput Biol. (2023) 19:e1011370. doi: 10.1371/journal.pcbi.1011370

PubMed Abstract | Crossref Full Text | Google Scholar

37. Xia J, Bai P, Fan W, Li Q, Li Y, Wang D, et al. Nepdb: A database of T-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy. Front Immunol. (2021) 12:644637. doi: 10.3389/fimmu.2021.644637

PubMed Abstract | Crossref Full Text | Google Scholar

38. Keskin DB, Anandappa AJ, Sun J, Tirosh I, Mathewson ND, Li S, et al. Neoantigen vaccine generates intratumoral T cell responses in phase ib glioblastoma trial. Nature. (2019) 565:234–9. doi: 10.1038/s41586-018-0792-9

PubMed Abstract | Crossref Full Text | Google Scholar

39. Rojas LA, Sethna Z, Soares KC, Olcese C, Pang N, Patterson E, et al. Personalized rna neoantigen vaccines stimulate T cells in pancreatic cancer. Nature. (2023) 618:144–50. doi: 10.1038/s41586-023-06063-y

PubMed Abstract | Crossref Full Text | Google Scholar

40. Kim S, Kim HS, Kim E, Lee MG, Shin EC, Paik S, et al. Neopepsee: accurate genome-level prediction of neoantigens by harnessing sequence and amino acid immunogenicity information. Ann Oncol. (2018) 29:1030–6. doi: 10.1093/annonc/mdy022

PubMed Abstract | Crossref Full Text | Google Scholar

41. Zhou C, Wei Z, Zhang Z, Zhang B, Zhu C, Chen K, et al. Ptuneos: prioritizing tumor neoantigens from next-generation sequencing data. Genome Med. (2019) 11:67. doi: 10.1186/s13073-019-0679-x

PubMed Abstract | Crossref Full Text | Google Scholar

42. Gartner JJ, Parkhurst MR, Gros A, Tran E, Jafferji MS, Copeland A, et al. A machine learning model for ranking candidate hla class I neoantigens based on known neoepitopes from multiple human tumor types. Nat Cancer. (2021) 2:563–74. doi: 10.1038/s43018-021-00197-6

PubMed Abstract | Crossref Full Text | Google Scholar

43. Yadav M, Jhunjhunwala S, Phung QT, Lupardus P, Tanguay J, Bumbaca S, et al. Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing. Nature. (2014) 515:572–6. doi: 10.1038/nature14001

PubMed Abstract | Crossref Full Text | Google Scholar

44. McGranahan N, Furness AJ, Rosenthal R, Ramskov S, Lyngaa R, Saini SK, et al. Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade. Sci (New York NY). (2016) 351:1463–9. doi: 10.1126/science.aaf1490

PubMed Abstract | Crossref Full Text | Google Scholar

45. Tran E, Robbins PF, Lu YC, Prickett TD, Gartner JJ, Jia L, et al. T-cell transfer therapy targeting mutant kras in cancer. N Engl J Med. (2016) 375:2255–62. doi: 10.1056/NEJMoa1609279

PubMed Abstract | Crossref Full Text | Google Scholar

46. Parkhurst MR, Robbins PF, Tran E, Prickett TD, Gartner JJ, Jia L, et al. Unique neoantigens arise from somatic mutations in patients with gastrointestinal cancers. Cancer Discov. (2019) 9:1022–35. doi: 10.1158/2159-8290.cd-18-1494

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: cancer immunotherapy, computational pipeline, deep learning, liquid biopsy, neoantigen prediction

Citation: Cai Y, Chen R, Song M, Wang L, Huo Z, Yang D, Zhang S, Gao S, Hwang S, Bai L, Lv Y, Cui Y and Zhang X (2026) CNNeoPP: a large language model-enhanced deep learning pipeline for personalized neoantigen prediction and liquid biopsy applications. Front. Immunol. 17:1722117. doi: 10.3389/fimmu.2026.1722117

Received: 10 October 2025; Accepted: 15 January 2026; Revised: 05 January 2026;
Published: 04 February 2026.

Edited by:

Don J Diamond, City of Hope National Medical Center, United States

Reviewed by:

Richard A. Schäfer, Northwestern University, United States
Bruno Fant, myNEO Therapeutics, Belgium

Copyright © 2026 Cai, Chen, Song, Wang, Huo, Yang, Zhang, Gao, Hwang, Bai, Lv, Cui and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yonggang Lv, bHZ5Z3d5dHRAMTYzLmNvbQ==; Yali Cui, eWFsaWN1aUBud3UuZWR1LmNu; Xi Zhang, eHpoYW5ndHhAZ21haWwuY29t

^†These authors have contributed equally to this work and share first authorship

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.