Enhanced deep Convolutional Neural Network for SARS-CoV-2 variants classification

Awe, Olaitan I.; Obura, Hesborn; Ssemuyiga, Charles; Mudibo, Evans; Mwanga, Mike J.

doi:10.3389/frai.2025.1512003

ORIGINAL RESEARCH article

Front. Artif. Intell., 08 September 2025

Sec. Machine Learning and Artificial Intelligence

Volume 8 - 2025 | https://doi.org/10.3389/frai.2025.1512003

Enhanced deep Convolutional Neural Network for SARS-CoV-2 variants classification

Olaitan I. Awe^1,2^†

Hesborn Obura^3,4^†

Charles Ssemuyiga^5,6^*^†

Evans Mudibo^3,4,7^†

Mike J. Mwanga^3,4^†

¹African Society for Bioinformatics and Computational Biology, Cape Town, South Africa
²Department of Computer Science, Faculty of Science, University of Ibadan, Ibadan, Nigeria
³Department of Biochemistry and Biotechnology, School of Pure and Applied Science, Pwani University, Kilifi, Kenya
⁴Pwani University Biosciences Research Centre, Pwani University, Kilifi, Kenya
⁵PharmaQsar Bioinformatics Firm, Kampala, Uganda
⁶Department of Biological and Environmental Sciences, School of Natural Sciences, Kampala International University, Kampala, Uganda
⁷Centre for Geographic Medicine Research, Kenya Medical Research Institute Wellcome Trust Research Program, Kilifi, Kenya

Introduction: Rapid and scalable classification of SARS-CoV-2 genomes from spike-gene sequences can support real-time genomic surveillance in contexts where whole-genome data or high-end computing resources are limited.

Methods: We curated approximately 35,800 quality-filtered spike sequences spanning multiple clades and lineages and trained a hybrid CNN–BiLSTM model with standard regularization and class-imbalance handling. Model performance was benchmarked against Nextclade assignments and compared with classical machine-learning baselines.

Results: Across 10 experimental runs, the model achieved a mean training accuracy of 99.74% ± 0.11, a validation accuracy of 99.00% ± 0.00, and a test accuracy of 99.91% ± 0.03. In benchmarking against the molecular epidemiology tool Nextclade, our model demonstrated superior performance, correctly identifying 100% of Omicron sequences, compared to 34.95% achieved by Nextclade. Saliency and feature attribution analyses highlighted recurrent spike substitutions consistent with known variant-defining mutations, as well as additional uncharacterized motifs with potential biological relevance.

Discussion: These findings demonstrate that spike-only deep models can provide rapid and accurate clade or variant classification, while also yielding interpretable feature importance. Such models complement phylogenetic approaches in settings with constrained resources and enable efficient triage of samples for confirmatory whole-genome analysis, supporting more timely genomic surveillance.

1 Introduction

Taxonomic classification of pathogens is central to clinical diagnosis and outbreak surveillance. Traditional approaches such as PCR and metagenomic sequencing enable pathogen detection without prior knowledge (Li et al., 2021; Omar et al., 2024), and have become increasingly accessible due to reduced costs and improved tools. However, these methods remain computationally demanding and time-consuming, particularly for large-scale datasets, owing to incomplete reference databases and the growing microbial diversity (Menzel et al., 2016). This can delay timely identification during outbreaks and requires substantial bioinformatics expertise. Machine learning-based tools can identify and extract important sequence features for sequence classification in a computationally efficient manner. In a DNA sequence, these features will be the pattern of arrangement of nucleotides in a sequence that is/are unique for each pathogen. For this reason, machine learning methods have been applied in the identification and classification of pathogens in clinical samples (Abd-Alhalem et al., 2021; Liang et al., 2020; Hanson et al., 2024) as well as in complex diseases (Enejoh et al., 2025). For instance, VirFinder (Ren et al., 2017) models used k-mer frequency to classify viruses while VirSorter (Roux et al., 2015) uses a probabilistic model tool to predict viral sequences. Randhawa et al. (2020) proposed a supervised machine learning with a digital signal process model for genomic identification of COVID-19 virus signatures, important in the classification of SARS-CoV-2 (Table 1).

Table 1

Table 1. Shared spike protein mutations among major SARS-CoV-2 VOCs, including genomic coordinates (based on NC_045512.2), spike gene nucleotide positions, and their known phenotypic effects.

Deep learning (DL), a subfield of machine learning, has emerged as a powerful framework for automated feature extraction from biological sequences, particularly through architectures originally developed for natural language processing. By leveraging multiple non-linear transformations, DL models can learn hierarchical representations of nucleotide or amino acid sequences without the need for handcrafted features (Abd-Alhalem et al., 2021; Gunasekaran et al., 2021). These capabilities have been widely applied in genomics for tasks such as DNA classification, gene annotation, protein structure prediction, and viral detection. Notably, several DL-based tools such as VirHunter (Sukhorukov et al., 2022), DeepVirFinder (Ren et al., 2017), and ViralMiner (Tampuu et al., 2019) have demonstrated high accuracy in identifying viral sequences directly from raw metagenomic data. RNA-seq datasets provide an avenue for the detection of the expression of genes in a sample (Alaya et al., 2024; Ben Aribi et al., 2024, 2025; El Abed et al., 2023; Ather et al., 2018; Die et al., 2019). PACIFIC, another DL framework, was developed to detect co-infections of respiratory viruses including SARS-CoV-2 from RNA-seq datasets (Charles et al., 2023; Elbasir et al., 2023). These models offer substantial advantages over alignment-dependent methods, particularly in their scalability and ability to detect novel or divergent viral genomes (Table 2).

Table 2

Table 2. A summary of the studies compared to the present study.

SARS-CoV-2 VOCs are primarily distinguished by recurrent mutations in the spike (S) gene, which encodes the viral protein responsible for mediating host cell entry via the Angiotensin-Converting Enzyme 2 (ACE2) receptor (Nyamari et al., 2023; Nzungize et al., 2022). These mutations not only influence viral transmissibility and immune escape but also serve as robust genomic signatures for lineage classification (Awe et al., 2023; Harvey et al., 2021; Korber et al., 2020; Obura et al., 2022; Oluwagbemi and Awe, 2018). For instance, key substitutions in the receptor-binding domain (RBD) such as N501Y, E484K, and K417N/T alter ACE2 binding affinity and antibody recognition, thereby shaping the phenotypic profiles of Alpha, Beta, Gamma, and Omicron lineages (Cao et al., 2021; Harvey et al., 2021). Meanwhile, mutations such as L452R and P681R in Delta contribute to enhanced infectivity and proteolytic activation (Meng et al., 2022). Although some of these markers are unique to specific variants, several have emerged independently across lineages through convergent evolution, complicating classification tasks (Table 1) (Edara et al., 2021). Notably, D614G became globally dominant early in the pandemic and is now present across all major VOCs, reducing its discriminatory power (Korber et al., 2020). Similarly, the N501Y substitution, once a hallmark of Alpha, also appears in Beta, Gamma, and Omicron (Harvey et al., 2021). These shared features can blur the mutational boundaries between variants, particularly in models relying solely on sequence-based inputs. Deep learning models, however, can implicitly learn to differentiate subtle mutation contexts and adjacent sequence motifs. Several studies have demonstrated that the spike gene alone is sufficient for robust lineage classification due to its high mutational density and selective pressure (Harvey et al., 2021). Furthermore, recombination signals that manifest in spike often reflect broader genome-wide signatures, making spike-based models a reliable first-tier diagnostic filter.

1.1 Existing characterization methods for SARS-CoV-2 spike sequence classification

Traditional SARS-CoV-2 variant classification approaches typically rely on structured feature encodings such as k-mer frequencies or one-hot encoding (OHE), followed by classical machine learning algorithms which typically require explicit feature engineering and perform well with structured inputs such as k-mer or one-hot encoded sequences, they include; Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), Logistic Regression (LR), k-Nearest Neighbors (KNN), and Naive Bayes (NB) (Kuzmin et al., 2020; Nicora et al., 2022). The k-mer encoding technique transforms biological sequences into fixed-length numerical vectors by segmenting nucleotide or amino acid strings into overlapping sub-sequences of length k. This alignment-free approach captures local motif frequency information and has been widely applied in genomic classification tasks, particularly when sequence alignment is computationally infeasible. For example, Kuzmin et al. (2020) applied Principal Component Analysis (PCA) and OHE to spike sequences, achieving >98% accuracy using SVM, LR, and DT models. Nicora et al. (2022) employed amino acid k-mers in a one-class classification system that detected emerging variants with >98% precision. Ali et al. (2021) introduced a kernel approximation technique for k-mer-based feature vectors, showing improved accuracy even on limited training data. Other notable methods include Spike2Vec, which computes amino acid embeddings using information gain to highlight biologically informative residues (Ali and Patterson, 2021), and PWM2Vec, a position-weight matrix-based embedding for capturing positional nucleotide importance (Ali et al., 2022). These approaches support classical classifiers but may suffer performance loss on complex, full-length genomic inputs. Further innovations have combined sequence encoding with neural network architectures. Ali et al. (2023) presented a simple neural network using k-mers as input, demonstrating strong classification by geographic lineage and identifying informative residues. Nan et al. (2022) proposed a fine-grained adaptation of a sequence-based DL model that showed clear inter-lineage and intra-lineage clustering in Omicron sequences. Similarly, Li et al. (2022) developed a 3D Convolutional Neural Network (CNN) that used dinucleotide composition to differentiate Beta, Delta, and Omicron lineages with high predictive power (Table 2).

OHE leads to high-dimensional sparse vectors, which strain computational resources and are prone to overfitting, while k-mer-based encodings neglect the positional order and long-distance dependencies between biologically significant motifs. These limitations are further amplified in the context of recombinant or drifted variants, where subtle shifts in mutation patterns require models capable of deeper contextual understanding. To overcome these challenges, we designed a hybrid deep learning architecture that combines CNNs with BiLSTMs, trained end-to-end on aligned, full-length spike DNA sequences. CNNs are deep learning architectures capable of automatically extracting hierarchical spatial features from structured data, they have proven effective for detecting spatially localized sequence features such as conserved motifs or mutation clusters through trainable filters that enable hierarchical pattern learning without manual feature engineering (Gunasekaran et al., 2021; Zeng et al., 2016). This makes them well-suited for extracting discriminative subsequence patterns embedded within high-throughput genomic data. Following the convolutional layers, we integrated BiLSTMs to capture long-range dependencies by processing the input sequence in both forward and backward directions. Unlike conventional Recurrent neural networks (RNNs), BiLSTMs, a class of RNNs utilize gated units that control the flow of information across time steps, preserving critical contextual relationships among nucleotide positions (Graves, 2012; Hochreiter and Schmidhuber, 1997). This is especially relevant in genomics, where the impact of mutations often depends on their broader sequence environment. By aligning and encoding complete spike sequences, our model can learn variant-specific signatures in both conserved and variable regions. In doing so, it mitigates common pitfalls of traditional ML models, offering a scalable, accurate, and biologically informed solution for genomic variant classification.

2 Materials and methods

2.1 Code availability

All code, model training scripts, and preprocessing pipelines used in this study are publicly available in the GitHub repository (https://github.com/omicscodeathon/ml_sarscov2) to ensure full transparency and reproducibility of the results.

2.2 Python libraries and computational tools

All analyses and model development were implemented in Python (version: 3.12.7), leveraging a range of established open-source libraries optimized for deep learning, scientific computing, and bioinformatics workflows. Model training and evaluation were conducted using TensorFlow (version: 2.19.0) and Keras (version: 3.10.0), which provide a high-level API for defining and training deep neural networks. The CNN–BiLSTM architecture was constructed using the Keras Sequential API, with additional functionality for dropout, batch normalization, and L2 regularization to enhance generalization performance (Gulli and Pal, 2017; Ramasubramanian and Singh, 2019; Sergeev and Del Balso, 2018). For numerical operations and data manipulation, NumPy (version: 1.26.4) and Pandas (version: 2.2.2) were employed. NumPy was used for efficient matrix operations on encoded sequence arrays, while Pandas facilitated structured handling of metadata such as variant labels and FASTA sequence records (Gupta and Bagchi, 2024). Matplotlib (version: 3.9.2) and Seaborn (version: 0.13.2) were used extensively for data visualization, including model training curves, confusion matrices, and distribution plots (Lemenkova, 2020). Evaluation metrics including precision, recall, F1-score, accuracy, and log loss were computed using functions from Scikit-learn (version: 1.6.1), generation of classification reports, and confusion matrices with multi-class support (Kramer, 2016). To streamline file handling, OS was used for automated directory creation and result storage. This modular and transparent Python-based implementation ensures reproducibility, facilitates extension to other genomic datasets, and supports integration into scalable bioinformatics pipelines (Pawar et al., 2024; Coetzer et al., 2025). Model training was conducted on a high-performance Intel Xeon Linux system equipped with 18 cores and 512 GB RAM. TensorFlow Lite was utilized to convert the trained model into a lightweight, quantized format suitable for deployment on resource-constrained devices (Jacob et al., 2018) (Figure 1).

Figure 1

Flowchart showing the process of analyzing sequences. Starts with

Figure 1. Data preprocessing workflow: feature extraction, feature encoding, and classifications. The first step is to input raw sequences for feature extraction and feature encoding using the one-hot encoding technique. Then, the one-hot encoded sequences are fed to the CNN model for classification analysis.

2.3 Dataset collection and spike gene extraction

To construct a variant classification model, we downloaded >26,000 representative, high-quality SARS-CoV-2 whole-genome sequences from the GISAID database (https://www.gisaid.org/) using the following inclusion criteria: “complete genome,” “high coverage,” “collection date complete,” and sampling dates between 01/01/2021 and 08/04/2022. Sequences were selected with the representation of pre-defined geographical regions (Africa, Asia, Oceania, S. America, N. America) to ensure global representation and reduce regional sampling bias, thereby enhancing the model’s generalizability across diverse SARS-CoV-2 variant distributions. More diverse validation and omicron training Data of 10,000 sequences was downloaded from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/). Complete SARS-CoV-2 genome sequences were downloaded and compiled into a local nucleotide BLAST database using the makeblastdb utility from the BLAST+ suite (Camacho et al., 2009). To extract the spike gene from each genome, we used the spike gene sequence from the Wuhan reference genome (NC_045512.2 / Wuhan-Hu-1) as a query in a local BLASTn search. To ensure high specificity, each BLASTn search was restricted to a single high-scoring segment pair (HSP) per subject sequence, and the number of target sequences was set to match the size of the database. This alignment-based extraction strategy enabled high-confidence localization of spike regions across diverse genomic backgrounds. Results were output in tabular format, and genomic coordinates triplets (sequence ID, start, stop) were programmatically extracted using a custom Python script. Only alignments with lengths >3,000 bp were retained to ensure near-complete spike gene coverage, consistent with the expected gene length (~3,800 bp) (Zhou et al., 2020). Spike gene segments were then extracted from the alignment coordinate sets and extracted spike gene sequences were then aligned using MAFFT v7.475 (Katoh et al., 2019). This alignment step standardized sequence length and preserved homologous nucleotide positions across samples (Training and validation), ensuring compatibility with downstream convolutional models. The distribution of training and validation data is shown in Figure 2.

Figure 2

Bar charts comparing training (blue) and validation (green) sequence distribution by variant. Training data: Omicron 6362, Gamma 5597, Alpha 5511, Delta 5495, Beta 4271. Validation data: Alpha 2317, Gamma 2016, Delta 1981, Beta 1688, Omicron 583. Totals are 27,236 for training and 8,585 for validation sequences.

Figure 2. Bar plots showing the distribution of SARS-CoV-2 variant sequences in the modeling (a) and validation (b) datasets. Each bar represents the number of sequences associated with a specific variant. This distribution highlights the class balance across datasets used for model training and evaluation.

2.4 Feature construction and dataset preparation

We performed systematic preprocessing to convert aligned spike gene sequences and labels into a supervised learning format compatible with deep learning. To ensure full reproducibility, the aligned FASTA files were parsed into structured Pandas DataFrames. Headers were preformatted to embed SARS-CoV-2 variant labels (Alpha, Beta, Gamma, Delta, Omicron), which were extracted and mapped to integer class labels from 0 to 4. All nucleotide sequences were capitalized to maintain consistency in encoding.

2.4.1 One-hot encoding of nucleotide sequences

We implemented a bespoke one-hot encoding routine to convert each nucleotide in the spike gene into a binary matrix suitable for input into a Convolutional Neural Network, a widely adopted approach in genomic deep learning (Zeng et al., 2016). The encoding scheme employed was A = [1, 0, 0, 0], C = [0, 1, 0, 0], G = [0, 0, 1, 0], T = [0, 0, 0, 1], and Ambiguous bases (e.g., N, R, Y) and gaps were considered non-informative and assigned [0, 0, 0, 0]. This masking strategy reflects the uncertainty inherent in these positions and avoids assigning fractional values that could introduce noise and reduce model interpretability (Juan et al., 2017). Insertions were treated as valid bases, whereas deletions were treated similarly to ambiguous characters and ignored during encoding (Zeng et al., 2016). The resulting sequences were transformed into a 3D NumPy array with dimensions corresponding to the number of samples, aligned nucleotide positions, and four binary channels representing the DNA bases (A, C, G, T) (Juan et al., 2017). This array served as the input feature matrix for model training. The corresponding variant labels were extracted and mapped to integer values/labels ranging from 0 to 4 (Gamma = 0, Delta = 1, Beta = 2, Alpha = 3, Omicron = 4) which were used as the ground-truth target classes in the supervised classification framework (Figure 1).

2.5 Model architecture, training, and evaluation

2.5.1 CNN-based feature extraction

We used four sequential Conv1D layers with 54, 27, 14, and 7 filters, respectively, and kernel sizes progressively reduced from 6 to 2. The input layer receives a 3D tensor corresponding to one-hot encoded spike sequences of shape (samples, sequence length, 4), where the four binary channels represent A, C, G, and T nucleotides. ReLU activations (Fred Agarap, 2018) were applied after each convolution to introduce non-linearity and mitigate vanishing gradients, allowing the model to learn complex feature hierarchies. MaxPooling1D layers were interleaved after each convolutional block to downsample the feature maps, reduce spatial dimensions, and control overfitting. Dropout layers with rates of 0.5, 0.2, and 0.1 were also inserted between layers for regularization (Srivastava et al., 2014). Padding was set to “same” to preserve spatial dimensions and prevent boundary effects during convolution operations (Gholamalinezhad and Khosravi, 2020) (Table 3).

Table 3

Table 3. CNN–BiLSTM architecture for SARS-CoV-2 variant classification.

2.5.2 BiLSTM layer and dense network

The feature map output from the last convolutional block was passed to a BiLSTM layer (512 units) to learn contextual nucleotide dependencies in both forward and reverse directions, critical for variant-discriminative patterns that may span across the sequence (Graves, 2012; Hochreiter and Schmidhuber, 1997). A dropout layer (0.01) followed the BiLSTM to prevent the co-adaptation of neurons. The high-level features were then flattened and passed through six fully connected Dense layers (256, 128, 64, 32, 16, and 5 units, respectively). All intermediate layers used ReLU activation, while the output layer employed a softmax function to predict the probability distribution across the five SARS-CoV-2 variant classes. The full architecture consisted of 2.4 million trainable parameters. The CNN–BiLSTM framework had a softmax output for multiclass classification, enabling fine-grained discrimination among SARS-CoV-2 variants. The softmax layer transforms the final outputs into probability distributions over variant classes (Wang et al., 2018), enabling clear and interpretable multiclass predictions. Crucially, this architecture not only improved performance on closely related lineages, particularly those with shared or overlapping mutation profiles but also facilitated interpretability via saliency mapping.

2.5.3 Training configuration

The model was trained using the Adam optimizer (Kingma and Ba, 2014) with categorical cross-entropy as the loss function. One-hot encoded class labels were used to supervise training. The training dataset was split using stratified sampling (75% training, 25% test), and the validation set was held out to validate generalization. To enhance statistical reproducibility and consistency, we fixed random seeds across NumPy, TensorFlow, and Python environments (Bouthillier et al., 2021). Early stopping with patience of 8 epochs and a minimum delta of 0.0005 was applied to prevent overfitting and restore the best weights.

The model was trained for a maximum of 30 epochs using a batch size of 1,000. Epoch-wise accuracy and loss metrics were plotted for both training and validation datasets. We performed hyperparemeter tuning as the number of epochs increased, by adding more layers or removing some layers and while observing the model accuracy on model training. The model was trained and evaluated across 10 independent runs with different random initializations. All reported performance metrics represent the average values computed over these 10 runs to ensure robustness and mitigate variance due to stochastic processes during training.

2.5.4 Model evaluation metrics and interpretation

Model performance was comprehensively evaluated using a multi-faceted approach designed to assess both predictive accuracy and interpretability. First, accuracy and loss metrics were recorded across training epochs to monitor convergence and generalization behavior. To provide a more granular understanding of classification performance, a detailed classification report was generated, including per-class precision, recall, and F1-scores, alongside macro-averaged metrics. Class-wise discrimination ability was further visualized through confusion matrices to identify misclassification patterns among SARS-CoV-2 variants. In addition, receiver operating characteristic (ROC) curves were constructed for each class, and the macro-average area under the curve (AUC) was computed to quantify the model’s ability to distinguish between variant classes (Fawcett, 2006). Beyond accuracy, the reliability of the model’s predicted probabilities was assessed through confidence calibration analysis. We calculated the Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) using 10 equal-width confidence bins. A reliability diagram was plotted to illustrate the correspondence between predicted confidence and observed accuracy, and class-wise ECE values were reported to identify calibration discrepancies across variant types (Guo et al., 2017; Kull et al., 2019; Vaicenavicius et al., 2019). To enhance model interpretability, we applied gradient-based saliency map techniques which highlighted sequence positions with the strongest influence on classification decisions, revealing biologically meaningful nucleotide patterns and mutation hotspots. Reported saliency results were averaged over multiple runs to ensure robustness and reduce stochastic artifacts (Samek et al., 2021).

2.5.5 Quantization for model optimization

While the CNN–BiLSTM model exhibited strong classification performance, its original architecture comprised approximately 2.45 million parameters, resulting in a memory footprint of ~9.3 MB. Such resource demands may hinder deployment in memory-constrained environments such as portable diagnostic devices, edge AI systems, or embedded viral surveillance platforms. To mitigate this limitation and enhance model portability, we applied post-training quantization using TensorFlow Lite (TFLite), a widely adopted framework for neural network compression and cross-platform deployment (David et al., 2021). Quantization was performed by converting the trained model from full precision (32-bit floating-point) to reduced precision using 8-bit integer weights while preserving float32 outputs. We employed dynamic range quantization via the TFLite converter, which statically quantizes only the model weights, thereby minimizing the need for representative datasets and preserving inference throughput (Jacob et al., 2018). To ensure compatibility with the BiLSTM architecture, we enabled SELECT_TF_OPS alongside TFLITE_BUILTINS, allowing the converter to fallback to TensorFlow operations when needed. Additionally, the experimental tensor list-lowering option was disabled to avoid incompatibility during the conversion of recurrent components.

2.6 Model validation

An external validation dataset, composed of SARS-CoV-2 spike sequences distinct from those used during training, was preprocessed using the same alignment, one-hot encoding, and dimensionality normalization procedures. This validation dataset contained sequences from all five target variants and was used to assess the model’s generalization to unseen data. Predictions were made using the trained CNN-BiLSTM quantized model, and performance was quantified using overall accuracy, per-class precision, recall, F1-score, and confusion matrices. The model maintained consistent classification accuracy, demonstrating its ability to generalize to new variant profiles beyond the training distribution.

2.7 Leave-One-Variant-Out cross-validation

To evaluate the robustness and generalization capacity of our CNN–BiLSTM classifier, we implemented a Leave-One-Variant-Out (LOVO) cross-validation scheme. This strategy simulates real-world scenarios where a previously unobserved variant emerges in the population. In each of the five LOVO experiments, one VOC was entirely excluded from the training dataset and reserved for validation. The model was trained on the remaining four variants, tested, and then validated. In addition to standard evaluation metrics (accuracy, precision, recall, F1-score, and confusion matrix), we employed confidence thresholding to flag predictions with low model certainty as “Unknown.” This mechanism enhances open-set recognition, allowing the model to abstain from forced classification when variant signatures deviate substantially from those learned during training (Fort et al., 2021; Qu et al., 2024).

3 Results and discussion

3.1 Dataset composition and model training overview

The curated Training and validation dataset comprised a total of 27,236 and 8,585 aligned spike protein DNA sequences, respectively, spanning five major SARS-CoV-2 variants (Figure 2). The relatively low number of Omicron sequences from GSAID database was due to their limited availability in the selected timeframe and stringent filtering criteria, hence enriched with NCBI Virus sequences. For supervised training, the dataset was partitioned into 20,427sequences (75%) for training and 6,809 sequences (25%) for testing, following a stratified sampling strategy to preserve variant class proportions. The final CNN–BiLSTM architecture contained approximately 2.45 million trainable parameters, reflecting the model’s depth and capacity for hierarchical feature learning. Each training session using a batch size of 1,000 and 30 epochs required approximately 65 min to converge using the Adam optimizer and early stopping based on validation accuracy.

3.2 Post-training quantization results

The quantized model demonstrated a substantial reduction in size and parameter count, from 2.45 million parameters (9.33 MB) to 465,997 parameters (3.95 MB), corresponding to an ~81% compression. Despite this reduction, the quantized model retained virtually identical classification performance on the test set and external validation dataset. Furthermore, inference time per sequence was reduced by 32% on CPU when evaluated in batch mode. This efficiency gain positions the model for practical deployment in real-time genomic surveillance systems or field diagnostics. These results further validate the suitability of post-training quantization as an effective model compression strategy for SARS-CoV-2 variant classification.

3.3 Model testing

To train the model, the one-hot encoded spike protein sequences were transformed into a tensor matrix and passed through a deep CNN–BiLSTM architecture. The training was guided by the Adam optimizer, with categorical cross-entropy as the loss function. Model weights were updated using backpropagation based on prediction error. The training process showed consistent convergence across multiple runs. On average, the model completed training in 25.9 ± 1.6 epochs. Initial training and testing accuracies were 0.2257 ± 0.0000 and 0.2335 ± 0.0000, respectively, with corresponding losses of 1.6311 ± 0.0000 and 1.6060 ± 0.0000, indicating randomized predictions at initialization. By the end of training, the model achieved a final training accuracy of 0.9974 ± 0.0011 and a test accuracy of 0.9991 ± 0.0003, with losses reduced to 0.0099 ± 0.0015 and 0.0037 ± 0.0006, respectively. Figure 3 shows the variation in accuracy and loss during training and testing.

Figure 3

Line graph showing training and test accuracy and loss over 25 epochs. Train loss (blue) and test loss (red) decrease significantly, while train accuracy (green) and test accuracy (orange) increase, both stabilizing around 1.0.

Figure 3. Training and test performance of the CNN-BiLSTM model across epochs.

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (specificity) at various classification thresholds, offering an interpretable measure of class separability. Across 10 independent training runs, the model consistently achieved near-perfect ROC-AUC values, indicating excellent generalization and class discrimination capabilities. The macro-average ROC-AUC was 0.9999 ± 0.00003, reflecting consistently high performance across all five variant classes. Individual variant-level AUC scores were equally strong: Alpha, Gamma, and Beta variants achieved perfect separability with AUC = 1.0000 ± 0.0000, while Delta and Omicron also demonstrated near-perfect scores of 0.9998 ± 0.00005, and 0.9998 ± 0.0002, respectively. These results confirm the model’s ability to robustly distinguish between SARS-CoV-2 variants using spike gene sequences. The small standard deviations across runs further demonstrate the stability and reproducibility of the architecture under different random initializations and training conditions. Visualizations of per-class ROC curves and macro-averaged curves are shown in Figure 4, which further illustrate the high separability achieved between variant classes.

Figure 4

ROC curves for SARS-CoV-2 variants: Gamma, Delta, Beta, Alpha, and Omicron. Each plot shows true positive rate versus false positive rate. Gamma, Beta, and Alpha have AUC values of 1.0000, indicating perfect classification. Delta and Omicron have AUC values of 0.9998. The ROC curve illustrates the trade-off between sensitivity and specificity, with AUC summarizing classification performance.

Figure 4. Receiver operating characteristic (ROC) curves for the CNN-BLSTM model across the five SARS-CoV-2 variant classes. (a) Gamma, (b) Delta, (c) Beta, (d) Alpha, and (e) Omicron. Each curve illustrates the trade-off between true positive rate (TPR) and false positive rate (FPR) for one variant, with all classes achieving near-perfect discrimination.

An AUC of 1.00 represents an ideal classifier with 100% sensitivity and 100% specificity, which suggests that the model was able to flawlessly distinguish between all variant classes in the validation dataset. This exceptional discriminative power affirms that the model successfully learned robust and lineage-specific genomic features, making it highly effective in variant identification. Such performance is rarely achieved in genomic sequence classification, especially across closely related variants like Beta and Gamma, which share critical spike protein mutations such as E484K and N501Y (Harvey et al., 2021). Moreover, the distinctiveness of the Omicron variant, characterized by numerous spike protein mutations including P681H, K417N, and T478K, was captured by the model. The near-vertical ROC curve paths toward the upper-left corner reflect that the false positive rate was essentially zero across all classes. These results further corroborate the outcomes from the confusion matrix, where misclassifications were minimal and predominantly confined to variants with overlapping mutational landscapes. Such high AUC scores may also suggest that the model avoided overfitting due to its integrated architecture (CNN-BiLSTM), which is well-suited for capturing both local sequence motifs and long-range dependencies.

The calibration analysis revealed that the model provides highly reliable confidence scores (Figure 5). This was visualized through a reliability diagram, which shows close alignment with the ideal calibration line. The expected calibration error (ECE) for the overall model was extremely low at 0.0009 ± 0.0003, indicating that the predicted probabilities closely match actual accuracies. Additionally, the maximum calibration error (MCE) was 0.1659 ± 0.0776, a tolerable deviation in low-confidence bins. Calibration was also evaluated per variant: Gamma (ECE = 0.0001 ± 0.0000), Delta (0.0006 ± 0.0001), Beta (0.0002 ± 0.0001), Alpha (0.0005 ± 0.0001), and Omicron (0.0005 ± 0.0002). These values show consistent calibration performance across different SARS-CoV-2 lineages. The model demonstrates low expected calibration error, indicating trustworthy confidence estimates.

Figure 5

Reliability diagram showing true accuracy against predicted confidence for a model. The blue line with markers represents model performance, while the dashed line represents perfect calibration. The model closely follows the perfect line from 0.0 to 1.0 confidence but with deviations near 0.4 and 0.7 confidence levels.

Figure 5. Reliability diagram showing confidence calibration of the CNN-BiLSTM model in classifying SARS-CoV-2 variants using spike gene sequences. The blue curve represents model confidence vs. actual accuracy, and the dashed diagonal line indicates perfect calibration. A model above the perfect calibration shows a good calibration.

To evaluate the interpretability of the CNN-BiLSTM model, we generated pre-quantization saliency maps for each SARS-CoV-2 variant by averaging gradient-based importance scores across nucleotide positions in the spike gene. These visualizations (Figure 6) reveal how the model identifies critical regions responsible for classification. The saliency maps for each variant uncover biologically relevant patterns that correspond to known mutational hotspots and sequence patterns in the spike gene, which support variant discrimination. To zoom through these maps, the precise nucleotide positions whose saliency values are above 0.2 were extracted for each variant and converted into saliency plots (Supplementary Figures S8–S12).

Figure 6

Average saliency maps for SARS-CoV-2 variants Gamma, Delta, Beta, Alpha, and Omicron. Each plot represents nucleotide positions against saliency values, marked from zero to three thousand five hundred fifty. A color bar indicates saliency with a gradient from red (low) to yellow (high).

Figure 6. Average saliency maps across nucleotide positions for five SARS-CoV-2 variants based on spike gene input. (a) Gamma, (b) Delta, (c) Beta, (d) Alpha, and (e) Omicron. Bright regions indicate higher model sensitivity and importance. The intensity of each signal correlates with the feature’s importance to classification.

The saliency maps (Figure 6; Supplementary Figures S8–S12) reveal distinct nucleotide patterns prioritized by the CNN-BiLSTM classification model across SARS-CoV-2 variants. For Gamma (Figure 6a; Supplementary Figure S1), the most prominent signal arises from nucleotides 55–58 (CAAA), indicating a concentrated contribution to model prediction in the 5′ region of the spike gene. Aside from a notable position at nucleotide 1,386, the majority of salient features lie below position 100, suggesting the presence of a contiguous informative region between positions 46–82 rather than isolated single-nucleotide contributions. In the Delta variant (Figure 6b; Supplementary Figure S9), the saliency signal is broader and denser, with recurrent patterns rather than singular positions dominating the model’s focus. Salient regions include 197–213, 421–449, and 735–766. Additionally, discrete peaks at positions 1,413, 1,428, 1,454, and 1,461 further underscore the variant-specific features influence distributed across the spike sequence.

For Beta (Figure 6c; Supplementary Figure S10), salient features are highly localized, with a concentrated peak spanning positions 1,379–1,394. Secondary signals are observed at positions 431 and 202–207, indicating a narrower mutational signature influencing the model’s classification. The Alpha variant (Figure 6d; Supplementary Figure S11) exhibits distinct saliency around positions 194–217, a region potentially corresponding to the Δ69–70 deletion, a known defining mutation of Alpha. Beyond this, the model appears to prioritize diffuse patterns over discrete mutation sites, suggesting its reliance on broader sequence motifs. Omicron (Figure 6e; Supplementary Figure S12) displays an expansive and intensified saliency profile, with multiple peaks across the spike gene, particularly at 201–208, 429–437, 753–762, 1,125–1,159, 1,379–1,392, and several positions beyond nucleotide 2000. This widespread attention aligns with Omicron’s extensive mutational load, which includes over 30 spike mutations (Karim and Karim, 2021) and reflects the model’s capacity to capture complex, lineage-specific patterns. Notably, a conserved saliency signal is consistently observed across all variants in the 0–500 nucleotide range. Although these regions may not correspond to well-characterized mutations, they likely reflect conserved sequence motifs or alignment-related artifacts that the model has learned to associate with class-discriminative features. This warrants further interpretability-focused investigations to elucidate the biological or technical underpinnings of these recurrent signals. These results confirm that the one-hot encoding approach does not preclude the model from capturing complex contextual dependencies. Despite lacking explicit k-mer context, the CNN–BiLSTM architecture effectively identified multi-base sequence motifs relevant to variant classification. Notably, high saliency regions often spanned multiple positions and exhibited clear lineage-specific patterns, reinforcing the biological plausibility of our learned representations.

3.4 Model validation

Following successful training and testing of the model, further validation of the model was performed on a new dataset containing 8,585 SARS-CoV-2 variant sequences downloaded from the GSAID (3,062) and NCBI virus (5,523) databases. The distribution of variants in the validation dataset is shown in Figure 2b.

Additionally, confusion matrices (Figure 7) were used to assess per-class performance and misclassification patterns in both test and external validation datasets. Confusion matrices show strong agreement between true and predicted labels, with nearly all variants correctly classified in both test and validation sets. In the test matrix (Figure 7a), classification accuracy exceeds 99% across all variants, with negligible confusion.

Figure 7

Side-by-side confusion matrices show test results on the left (a) and validation results on the right (b). Both matrices evaluate classifications among Gamma, Delta, Beta, Alpha, and Omicron variants. The diagonal entries represent correctly predicted instances, with the highest counts in the Gamma and Alpha categories. A color scale on the right indicates values from 0 to over 2000.

Figure 7. Confusion matrix showing the performance of the model on the test set (a) and validation set (b).

In the external validation matrix, most variants demonstrate high classification accuracy, reflecting the model’s robustness. Nonetheless, several misclassifications are observed. Delta is occasionally misclassified as Beta (n = 13), suggesting overlapping feature representations, which is further supported by similarities observed in their saliency maps. Alpha misclassified as Beta (n = 7) may result from shared sequence features that contribute to model decisions, particularly within regions around position 1,500 and below 500. Although limited in number, Gamma misclassified as Beta (n = 2) likely stems from both variants sharing critical receptor-binding domain (RBD) mutations such as E484K and N501Y, highlighting the difficulty in distinguishing Gamma from Beta based solely on RBD signatures. Additional misclassifications include Gamma to Alpha (n = 22) and Delta to Alpha (n = 10), Delta to Omicron (n = 11), Alpha to Delta (n = 6), Alpha to Gamma (n = 1), Beta to Alpha (n = 1), and Beta to Omicron (n = 4). Importantly, all 583 Omicron sequences were correctly classified, demonstrating the model’s strong generalization capability for highly mutated variants when included in training. This likely arises from Omicron’s distinct profile, which includes shared mutations with multiple variants (e.g., N501Y, Δ69–70), but also unique substitutions such as N679K and multiple insertions that provide discriminative power (Cao et al., 2021; Karim and Karim, 2021).

This study included a confusion report for the validation dataset that has the precision, recall, accuracy, and F1 score for the 5 classes, shown in Table 4. Precision is the ability of the model to predict the true positive sequence as positive, and recall shows the number of positive sequences that are classified correctly as positive. The F1 scoresums up the predictive performance of the model by combining precision and recall metrics. It gives the overall accuracy of the model. This is calculated from the precision and recall test as shown in the following formulas.

Precision = \frac{True positive}{True positive + False Positive}

Recall = \frac{True positive}{True Positive + Negative}

F 1 - Score = \frac{2^{*} {Precision}^{*} Recall}{Precision + Recall}

Table 4

Table 4. Classification performance of the CNN-BiLSTM model on the external validation dataset across five SARS-CoV-2 variants.

The model demonstrated exceptional classification performance on the external validation dataset, achieving an overall accuracy of 99.00% ± 0.00 across five SARS-CoV-2 variants. Precision, recall, and F1scores were consistently high across all classes, with minimal variance across runs, indicating strong robustness and generalizability of the model. Per-class analysis reveals the model’s ability to accurately distinguish between closely related viral lineages. Gamma variant was predicted with perfect precision (100.00 ± 0.00%), while beta, delta, and alpha with a precision of 99.00 ± 00% while omicron had the smallest of 97.00 ± 00% with all their recall ≥98.00 ± 00%, underscoring the model’s sensitivity to discriminative mutation patterns within these classes. The Omicron variant, which is of particular concern due to its rapid global spread and mutational complexity, was identified with 99.00 ± 0.13% F1-score, reflecting high sensitivity (recall: 99.00 ± 0.10%) and specificity (precision: 97.00 ± 0.10%), even with limited support (n = 583). This is noteworthy given the evolutionary divergence of Omicron from other lineages and the model’s training being predominantly on earlier variants. The Beta variant, known for its immune escape features, showed a higher precision (99.00 ± 0.10%) and maintained 100 ± 0.00% recall, suggesting that the model prioritizes avoiding false negatives, a desirable trait in surveillance applications. Delta, with a recall of 98.10% ± 0.30, showed the lowest among the five, though still indicative of high performance. Macro-averaged and weighted precision, recall, and F1-scores remained consistently 99.00%, highlighting the model’s balanced performance regardless of minimal class imbalance. This confirms the model’s strong generalization ability and aligns with findings in genomic deep learning, where CNN-RNN hybrids outperform traditional ML classifiers in sequence-based virus classification tasks (Randhawa et al., 2020; Zeng et al., 2016).

These results suggest that the CNN–BiLSTM architecture can reliably classify SARS-CoV-2 variants from raw nucleotide sequences, offering a rapid and alignment-free computational tool suitable for genomic surveillance, especially in settings where fast mutation tracking is critical. Furthermore, the consistency of performance across 10 independent runs (reported as mean ± standard deviation) provides statistical robustness in line with best practices in modern deep learning evaluation (Charles et al., 2023; Charles and Mahapatra, 2023; Charles and Edgar, 2024; Reimers and Gurevych, 2017). The model demonstrates consistently high performance across all variants, with minimal variability, indicating robustness and strong generalization capacity.

We compared the validation results from our model with the classification results from Nextclade, a genomic analysis tool used for SARS-CoV-2 variant assignment (Aksamentov et al., 2021; Hadfield et al., 2018). Table 5 presents the performance metrics for both tools. The CNN + BiLSTM model demonstrated strong performance, achieving over 95% accuracy for all VOCs. Notably, the model excelled in classifying Omicron, with a perfect success rate of 100%, outperforming Nextclade’s 34.95% for this variant. Nextclade showed high accuracy (98–99%) for most variants, including Alpha, Beta, Gamma, and Delta, but struggled significantly with Omicron, correctly classifying only 34.95% of sequences. Additionally, Nextclade identified the presence of recombinant strains in Omicron (n = 67) and Delta (n = 11), a capability that is currently outside the scope of our CNN + BiLSTM model. While both tools perform well for most variants, Nextclade’s recombinant strain identification adds value to its analysis, but its limitations in Omicron classification underscore the strength of our CNN + BiLSTM model in this area.

Table 5

Table 5. Comparison between the proposed model and the Nextclade in variant classification.

One-hot encoding and a k-mer feature classification models (k = 3) were also constructed to benchmark our model, and their performance was compared to our CNN-BiLSTM model. The results show that our model is superior to these models. Table 6 summarizes the comparison.

Table 6

Table 6. Comparison of our model to baseline models.

3.5 LOVO

A confidence threshold of 0.603 was empirically selected based on calibration curves (see Supplementary Figure S2). When Alpha was left out during training, the model failed to identify this class entirely, yielding an F1-score of 0.00 (Supplementary Table S2). The majority of Alpha sequences with over 96% were flagged as “Unknown,” (Supplementary Figure S3). This reflects the limitations of closed-set classifiers under open-set conditions, particularly when overlapping RBD mutations such as N501Y (shared with Beta, Gamma, and Omicron) mislead the network. In the Beta-excluded setup, the classifier showed complete failure in Beta recognition (F1 = 0.00), with misclassifications skewed toward Delta and Alpha (Supplementary Table S3), This is not a surprise as Beta and Delta share the main important 3 saliency features, even though they are more significate for Delta than Beta (Figures 6b,c). Notably, a few numbers of sequences were also labeled as “Unknown,” reflecting the network’s uncertainty in the absence of Beta’s hallmark features (Figure 6c). The overall accuracy remained high (83%), but macro F1 was reduced to 0.65 (Supplementary Table S3).

Exclusion of Delta during training led to severe misclassification, with most Delta sequences being labeled as Beta/Omicron or flagged as “Unknown” (Supplementary Table S4; Supplementary Figure S5). While Gamma, Alpha, and Omicron maintained high classification fidelity (F1 > 0.97), the omission of Delta resulted in a macro F1 of 0.65, underlining the challenge of identifying this variant without exposure. Gamma sequences were also poorly classified when excluded from training, achieving an F1-score of 0.00 (Supplementary Table S5), and 99% were flagged as unknown (Supplementary Figure S6). Despite the lack of Gamma-specific representation during training, the classifier maintained high weighted accuracy (84%) due to strong performance on known variants. The omission of Omicron caused misclassification primarily into mainly Alpha, with a subset flagged as “Unknown” (Supplementary Figure S7; Supplementary Table S6). The divergence of Omicron’s mutational profile including more than 30 substitutions resulted in a complete failure to identify it correctly (F1 = 0.00), though the model maintained an overall accuracy of 95% on known classes.

The LOVO experiments reveal critical insights into the model’s generalization capacity and its behavior under open-set conditions. While the CNN-BiLSTM architecture demonstrated near-perfect classification on known classes, it struggled to generalize to previously unseen variants. Most failures were biologically grounded misclassification patterns aligned with known shared mutations partly supporting the interpretability of deep learning outputs.

Importantly, the use of confidence thresholding enabled a safety mechanism to flag low-confidence predictions. In the Alpha-left-out setup, for example, a majority of sequences were flagged as “Unknown” rather than being misclassified, reducing the risk of erroneous inference. This behavior reflects real-world utility, where genomic surveillance systems must handle emerging lineages conservatively before integration into training sets. The inclusion of the “Unknown” class also highlights the importance of uncertainty-aware classification in bioinformatics. Classical classifiers tend to force all predictions into known classes, which can mislead downstream interpretation. Our thresholding strategy enhances reliability by enabling abstention from classification when evidence is insufficient. Overall, LOVO testing provides a rigorous stress test for variant classifiers. Our findings suggest that while CNN–BiLSTM models offer superior performance when trained comprehensively, their application in real-time pathogen surveillance requires dynamic retraining and incorporation of model calibration mechanisms to ensure accurate performance in evolving genomic landscapes (Fort et al., 2021; Qu et al., 2024).

In this study, we trained and evaluated a CNN–LSTM deep learning model for classifying SARS-CoV-2 variants using spike gene nucleotide sequences. Traditional approaches, particularly k-mer-based and kernel approximation methods, have been widely employed for viral sequence analysis due to their computational efficiency and interpretability. These models generally perform well when sequence patterns are localized and relatively simple to capture through fixed-length representations (Ali et al., 2021). However, their reliance on predefined features restricts adaptability in dynamic genomic contexts. As SARS-CoV-2 continues to evolve, these models require frequent recalibration to accommodate novel mutations, limiting their utility for prospective surveillance applications. The choice of k-mer size further complicates their effectiveness: shorter k-mers may overlook long-range biological dependencies, while longer k-mers increase dimensionality and risk overfitting. Moreover, such methods cannot automatically discover meaningful hierarchical features or integrate distant relationships between sequence elements, and these limitations become increasingly problematic as genomic datasets grow in size and complexity (Ali et al., 2021; Ali and Patterson, 2021).

DL architectures, particularly hybrid models like CNN–LSTM, overcome these constraints by learning discriminative features directly from raw nucleotide sequences. CNN layers are well-suited for capturing spatial patterns such as conserved motifs or mutation clusters, while LSTM layers excel at modeling temporal dependencies, including long-range interactions between dispersed sequence elements (Graves, 2012; Hochreiter and Schmidhuber, 1997; Zhang et al., 2015). This layered architecture enables our model to learn both local and global sequence relationships without manual feature engineering, providing superior flexibility and generalization performance (Lecun et al., 2015; Ren et al., 2017). Our CNN–LSTM classifier demonstrated excellent accuracy (99%) on the independent validation dataset, highlighting its robustness in generalizing to unseen sequences. Importantly, the model was explicitly designed to tolerate real-world sequencing errors, including ambiguous nucleotides (“N”), which were ignored during training and prediction. By assigning zero weight to these bases, the model focused exclusively on informative regions of the sequence, enhancing reliability in noisy, high-throughput sequencing contexts (Juan et al., 2017).

The model’s performance also benefited from leveraging the spike gene, a functionally critical and mutation-dense region of the SARS-CoV-2 genome as the input feature space. Although this region represents only ~13% of the complete genome, it harbors key variant-defining mutations, making it a highly informative target for classification tasks (Harvey et al., 2021; Karim and Karim, 2021). Compared to rule-based molecular epidemiology tools like Nextclade, our approach showed superior accuracy, especially for divergent lineages such as Omicron, which are frequently misclassified due to their extensive mutational drift (Aksamentov et al., 2021).

To prevent overfitting and promote generalization, we implemented several regularization techniques. Dropout (Srivastava et al., 2014) was used to randomly deactivate neurons during training, reducing co-adaptation and improving robustness. L2 weight regularization (Goodfellow et al., 2016) penalized model complexity, and batch normalization accelerated convergence and stabilized training. These methods were complemented by 5-fold cross-validation to ensure performance consistency across diverse data splits, in line with best practices for evaluating deep learning models (Neyshabur et al., 2017). The use of the Adam optimizer further facilitated efficient convergence by dynamically adapting learning rates based on gradient estimates (Kingma and Ba, 2014). Compared to earlier works, our model offers a balanced solution to both interpretability and performance. For example, Whata and Chimedza (2021) proposed a CNN–BiLSTM model for broad coronavirus classification with excellent results, but limited computational depth constrained its scalability. Similarly, KEVOLVE (Lebatteux et al., 2024), which relies on genetic algorithms to extract discriminative motifs, demonstrated strong performance but may face interpretability and generalization challenges as new variants emerge. In our implementation, the CNN–LSTM model combines high accuracy, resilience to sequence ambiguity, and scalability, making it well-suited for real-time genomic surveillance.

4 Limitations and future work

Although the CNN–BiLSTM model developed in this study demonstrated high accuracy in classifying SARS-CoV-2 variants based on spike gene sequences, several limitations warrant careful consideration. First, the use of spike gene sequences as the sole input feature. Although the spike protein is biologically informative and central to variant-specific phenotypes such as transmissibility and immune escape, relying exclusively on this region may restrict the model’s ability to detect lineages that harbor distinguishing mutations elsewhere in the genome. The emergence of recombinant lineages or divergent subvariants with mutations outside the spike region could reduce classification performance. Expanding the input space to include full-genome sequences or other conserved regions could enhance the model’s generalizability to a broader array of SARS-CoV-2 lineages.

A further consideration is ambiguous nucleotides are treated as uninformative zeros, potentially excluding useful information in noisy real-world sequences. Alternative sequence encoding strategies such as transformer-based representations, or positional attention mechanisms could enable the model to learn context-dependent relationships and dynamically assign importance to specific regions of the spike gene.

The LOVO validation strategy exposed additional limitations in model generalization. Although the model maintained high accuracy when all classes were seen during training, performance deteriorated substantially when a variant was excluded, reflecting the difficulty in generalizing to previously unseen classes despite shared mutational features, revealing that the model’s decision boundaries were highly dependent on variant-specific training data. These findings underscore the challenge of open-set classification in genomics, where the ability to flag novel or recombinant lineages without prior exposure is crucial. Incorporating confidence calibration techniques and open-set recognition mechanisms may improve the model’s capacity to handle such cases.

In future iterations, the model will be extended to address these limitations, with a focus on improving efficiency, scalability, and adaptability. This will include the integration of unsupervised anomaly detection for early identification of novel variants and the benchmarking of CNN–BiLSTM against emerging architectures such as transformers. These enhancements aim to position the model as a robust and interpretable framework for genomic surveillance, capable of supporting variant classification in fast-evolving epidemiological landscapes.

5 Conclusion

This study shows that deep learning can be applied as an alternative method to the classification of viruses in addition to conventional sequence classification methods. CNN-LSTM model achieved high accuracy in classifying the five most dominant SARS-CoV-2 variants. Clinicians and public health organizations could use this tool to track new SARS-CoV-2 lineages in real-time surveillance settings. Frontline diagnostic workflows might benefit from its low latency and excellent precision.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/omicscodeathon/ml_sarscov2.

Author contributions

OA: Investigation, Project administration, Resources, Supervision, Writing – review & editing. HO: Conceptualization, Investigation, Methodology, Writing – original draft. CS: Validation, Writing – original draft, Writing – review & editing. EM: Formal analysis, Methodology, Writing – original draft. MM: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Acknowledgments

The authors thank the National Institutes of Health (NIH) Office of Data Science Strategy (ODSS) and the National Center for Biotechnology Information (NCBI) for their immense support before and during the April 2022 Omics codeathon organized in collaboration with the African Society for Bioinformatics and Computational Biology (ASBCB). The authors acknowledge Daniel Lumian for helping in editing the first draft of the manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2025.1512003/full#supplementary-material

References

Abd-Alhalem, S. M., El-Rabaie, E.-S. M., Soliman, N. F., Abdulrahman, S. E. S. E., Ismail, N. A., and Abd El-samie, F. E. (2021). DNA sequences classification with deep learning: a survey. Menoufia J. Electron. Eng. Res. 30, 41–51. doi: 10.21608/MJEER.2021.146090

Crossref Full Text | Google Scholar

Aksamentov, I., Roemer, C., Hodcroft, E., and Neher, R. (2021). Nextclade: clade assignment, mutation calling and quality control for viral genomes. J. Open Source Softw. 6:3773. doi: 10.21105/JOSS.03773

Crossref Full Text | Google Scholar

Alaya, F., Baraket, G., Adediran, D. A., Cuttler, K., Ajiboye, I., Kivumbi, M. T., et al. (2024). Multiple sclerosis stages and their differentially expressed genes: a bioinformatics analysis. bioRxiv. doi: 10.1101/2024.01.20.576448

Crossref Full Text | Google Scholar

Ali, S., Bello, B., Chourasia, P., Punathil, R. T., Zhou, Y., and Patterson, M. (2022) PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Available online at: http://arxiv.org/abs/2201.02273.