Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Oncol., 04 December 2025

Sec. Cancer Imaging and Image-directed Interventions

Volume 15 - 2025 | https://doi.org/10.3389/fonc.2025.1703772

Cervical cancer classification using a novel hybrid approach

  • 1School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT) Deemed to be University, Bhubaneswar, Odisha, India
  • 2Department of Mathematics, Pandit Deendayal Energy University, Gandhinagar, Gujarat, India
  • 3Faculty of Economics and Administrative Sciences, Universidad Católica de la Santísima Concepción, Concepción, Chile

Objective: Cervical cancer is among the most frequently diagnosed malignancies in women. It is the fourth most prevalent malignancy in women worldwide. Pap smear tests, a popular and effective medical procedure, enable the early detection and screening of cervical cancer. Expert physicians perform the smear analysis, which is a laborious, time consuming and prone to mistakes. The main objective of our work is to distinguish or classify the healthy and malignant cervical cells using our proposed CASPNet model.

Methods: This study proposes a novel technique by combining the concept of feature extraction by multi-head self-attention blocks, cross-stage partial network and feature fusion integration by spatial pyramid pooling fast layer components to identify healthy and cancerous cervical cells. Based on the comprehensive ablation study results, our proposed CASPNet architecture shows optimal performance having superior test accuracy with comparable computational efficiency.

Results: The experimental study of our proposed model CASPNet (Contextual Attention and Spatial Pooling Network) has achieved an accuracy of 97.07% in the widely used benchmark SIPAKMED dataset.

Conclusion: When compared to CNN models, self-attention blocks of vision transformer models are more accurate in classification tests and are generally better at capturing global contextual information inside an input image. The architecture’s CSP blocks are ideal for classification tasks with constrained resources where efficiency and speed are balanced; as a result, they are suitable for local feature extraction. Again, in cervical cells, objects are of varying sizes. Therefore, SPPF records contextual information at different receptive fields and performs multi-scale feature extraction. Hence, we can understand the images more precisely and reliably by incorporating all these benefits in our suggested CASPNet model.

1 Introduction

The fourth most common cancer among women is cervical cancer. Worldwide, cervical cancer has a serious impact on women’s lives and health. The World Health Organization reports that in 2022, almost 350 hundreds of women globally lost their lives to cervical cancer, while about 660 hundreds of women received new cervical cancer diagnoses. Cervical cancer continues to rank among the top causes of cancer-related mortality for women, especially in low- and middle-income nations (1). The Pap smear test is a commonly used and effective medical procedure that enables early identification and screening of cervical cancer. Accurate classification of cervical cancer cells is crucial for determining the stage and severity of the disease, which directly impacts therapeutic decisions and patient outcomes. However, traditional diagnostic methods can be time-consuming and prone to human error. To get past these challenges, current studies have concentrated more on combining machine learning and deep learning methods to improve classification’s precision, effectiveness and consistency. By automating the identification and categorization of abnormal cervical cells, these technologies offer promising solutions to improve screening programs and reduce mortality rates. As a result, various deep-learning-based methods have been created. Convolutional Neural Networks (CNNs), when thoroughly trained on extensive, well-annotated natural image datasets, have proved to be beneficial for disease diagnostic tasks, in spite of the differences between natural and medical images. Reusing the knowledge gained from one work for another is known as transfer learning, and it has been implemented by various researchers in classifying cervical cancer (2,3). Recently, vision transformer models (ViTs) have shown more remarkable performance than CNNs. Consequently, transformer-based models and their variants are effectively employed in numerous applications, including image classification (4). By switching from local convolutions to a global, data-driven attention mechanism, ViTs are excellent at feature extraction. Another essential and crucial component of You Look Only Once (YOLO) architecture is multi-scale feature learning. It enables models to recognize objects of various sizes with accuracy within an image. Hence a fundamental and distinguishing idea in YOLO models, is feature fusion integration. In our approach, we have combined the above concepts and created a new hybrid model by combining the ViT blocks of vision transformer model and CSP/SPPF blocks of YOLO to achieve the classification task. Then it is evaluated on the publicly available cervical cancer benchmark dataset, named SIPAKMED (5).

Further sections of the paper include Section 2 that discusses about the literature survey of cervical cancer cell classification, Section 3 comprises about the materials and methods used in the experimental study, Section 4 contains the discussion on results and analysis and also on the aspects of the proposed work to other related works and Section 5 covers the conclusion and future work.

2 Literature review

To improve the accuracy of a CNN model called AlexNet architecture, a padding strategy is implemented (1). The model’s accuracy is determined to have improved from 84.88% to 87.32%. The classification model performs well in identifying diseased cells with the classes of koilocytotic and dyskeratotic cells and normal cells with the classes superficial-intermediate and parabasal. Nevertheless, the model’s accuracy in identifying benign cell images with the class metaplastic is low. Again, using the ResNet-152 architecture, a deep learning classification method is developed that achieved a 94.89% classification accuracy on the SIPAKMED dataset. Prior studies have employed both machine learning and deep learning methodologies. Since machine learning approach captures the features manually, it takes up valuable time. Hence this study focuses on deep learning classification techniques only. In future, multiple datasets with comparable pap smear modality can be incorporated to make a more robust model. Avoiding the limiting environmental features is one advantage of doing this. Since time is crucial for diagnosing and determining the best course of treatment for cervical cancer, so training time can be also more optimized (2). (3) have used three pretrained models, InceptionV3, ResNet-50 and VGG19, to form the classification network. Out of these models, InceptionV3 has achieved highest accuracy value of 96.1%. In this paper, authors have also introduced visualization techniques for classification, which highlights the areas of an image that discriminate for a given class by computing an image-specific class saliency map. In future work, systematic integration of the image specific saliency maps into learning formulations will be done. (4) have applied an alternate approach of feature selection. Principal Component Analysis (PCA) and Grey Wolf Optimization (GWO) are used in this work. This work presents a two-level feature reduction technique that optimally selects feature sets by utilizing the benefits of both approaches. There is room for improvement through the use of hybrid metaheuristic feature selection algorithms and various classification models. This study paved a way for multi-domain adaptation and additional research in this area. Several classification problems, such as those in computer vision and biomedical applications, can be tested using the suggested process (10). An incremental deep tree (IDT) architecture is used by researchers to overcome CNN’s catastrophic forgetting in biological image classification. This framework enables CNNs to learn new classes while preserving accuracy on previously learned ones. Three well-known incremental methods are compared to the IDT framework to evaluate the efficacy of this strategy. The accuracy values attained by this method and using the SIPAKMED dataset is 93.00%. Since the suggested approach only affects the classifier’s output layer and ignores its internal architecture, experimenting with new deep models that optimize hyperparameters will enhance the outcomes (5).

A method is proposed that uses five polynomial (SVM) classifier models in an integrated cascade approach. The overall accuracy for all seven classes of the Herlev dataset is 97.3% and the test accuracy for all the classes is near to 92%. In this study, a method for automatically extracting features without the need for image processing techniques is proposed. In addition to reducing computing time, skipping this step can result in some image information loss, which would lower the accuracy of the proposed system. Additionally, it offers a high degree of confidentiality in order to quickly differentiate between various pap smear images (6). (7) has used a vision transformer module for global feature extraction and a convolutional neural network module for local feature extraction. The classification task is subsequently completed by fusing the local and global features using a multilayer perceptron module. Using the SIPAKMED dataset, this method reached a maximum accuracy of 91.72%. To increase the model’s effectiveness, further model combinations can be investigated in the future. Also, by altering the module architecture, feature extraction capability can be improved.

(8) have conducted a thorough investigation of two categories of the most sophisticated and promising deep learning techniques using over 20 vision transformer (ViT) models and 40 convolutional neural network (CNN) models. Additionally, the study uses data augmentation methods to change the data and ensemble learning methods to improve the model’s output correctness. The SIPAKMED dataset is used to test the methodology. For the classification task, EfficientNet-B6 achieves the best accuracy value of 89.95%, while ViT-B16 achieves the highest accuracy value of 91.93% among the ViT models. The ensemble approach that uses the EfficientNet-B6 and ViT-B16 models in conjunction with the max-voting process achieves the highest accuracy of 92.95%. (9) have used the Vision Transformer as the basis model for cervical cell classification and then incorporated an optimized pretrained MobileNet model to improve the output class prediction accuracy. The SIPAKMED dataset is used for this work and the accuracy value obtained is 97.65%. (10) have created the Cerviformer model by automatically classifying cervical cells using a cross-attention mechanism and a latent transformer model. This model can handle very large-scale inputs because it continuously combines the input data into a tiny latent transformer module using a cross-attention technique. With the SIPAKMED dataset, the maximum accuracy achieved with this method is 96.67%. (11) have conducted comparative analysis between vision transformer model variants and CNN model variants to classify the cervical cell images. It is observed that the vision transformer models outperform CNN models in terms of test accuracy. Using SIPAKMED dataset, highest accuracy of 93% is achieved with regularized ViT model variant named LeViT. To improve accuracy in the future, the ensemble technique that blends ViT models and CNN models can be investigated with reduced resource usage. Other cervical cancer datasets may also be used to investigate this model.

A more detailed analysis of the literature survey is again depicted in the following Table 1.

Table 1
www.frontiersin.org

Table 1. Analysis of some research articles.

3 Materials and methods

These resources and techniques used to categorize cervical cancer images are covered in this section. The experimental setup consists of the following steps: (i) dataset used for this study, (ii) dataset visualization (sample images of the dataset belonging to various classes), (iii) dataset augmentation and preprocessing, (iv) splitting of dataset (preparing the dataset in order to feed to the model), (v) proposed hybrid model (the model first processes the image patches through transformer blocks providing global feature learning through self-attention. Then reshapes the output to the CSP and SPPF components of YOLO for multi-scale processing and feature extraction and finally uses a classification head to predict the image class) and (vi) various software and hardware used for experimentation and (vii) different parameter values used in the hybrid model.

3.1 Dataset description

As per recent studies and literature survey, SIPAKMED dataset is a widely used standardized, publicly accessible with balanced classes and rich in morphology dataset used in the community. Since this dataset contains high intra-class variability and background artifacts, this dataset is comparatively challenging than others. The experimental study is implemented using the publicly available SIPAKMED dataset (6). 4049 isolated cell images in the.bmp format from 966 cluster cell images of pap smear slides make up this largest benchmark dataset. Expert cytopathologists have separated the cells into five different types based on their morphology and appearance, namely Parabasal, Superficial-Intermediate, Dyskeratotic, Koilocytotic and Metaplastic. Classes Parabasal and Superficial-Intermediate are considered as normal categories; Dyskeratotic, Koilocytotic and Metaplastic as abnormal categories. The number of pictures available in each class of the SIPAKMED dataset is shown in below Table 2.

Table 2
www.frontiersin.org

Table 2. Class name distribution from the original SIPAKMED dataset.

3.2 Dataset visualization

The following Figure 1 depicts the samples from each class in the widely used benchmark SIPAKMED dataset (12).

Figure 1
Images (a) through (e) show various magnified microscopic views of cells with different levels of staining. Each image displays a distinct shape and coloration pattern, indicating potential variations in cell types or states.

Figure 1. Sample images from each class in SIPAKMED dataset. (a) Dyskeratotic (b) Parabasal (c) Dyskeratotic (d) Superficial-Intermediate (e) Metaplastic.

3.3 Data augmentation and image processing

The imbalanced class issue is resolved using the data augmentation technique. It aids in distributing data in a balanced manner. The number of images that fall into each class varies, as Table 2 above illustrates. Data augmentation creates new images from old ones using a range of properties, such as cropping, flipping, rotating, and other methods. In this manner, the training dataset is made larger and of higher quality with the use of data augmentation. The various parameters considered for data augmentation are given in Table 3.

Table 3
www.frontiersin.org

Table 3. Applied parameters for data augmentation.

3.4 Splitting data

To make sure that each set is representative of the entire dataset, it is essential to randomize the data before splitting it. This reduces the possibility of bias in the model’s evaluation and training. Stratification guarantees that each set retains the same percentage of classes as the original dataset in classification task. This is especially crucial when working with imbalanced datasets. Therefore, for training, validation, and testing purpose, the dataset is split into 80:10:10 ratio using the holdout method as part of a uniform data partitioning strategy to guarantee experimental consistency. Additionally, in order to preserve the reproducibility of the outcomes in a single run, a fixed random seed value of 1337 is considered.

3.5 Proposed hybrid classification model

The following Figure 2 shows the proposed overall workflow of our experimental study for classifying cervical cancer. The proposed classification model is a hybrid model architecture that combines Vision Transformer (ViT) with YOLO components. In Algorithm 1, the key components of this hybrid classifier include the following:

Figure 2
Flowchart depicting a cervical cancer dataset analysis process. The SIPAKMED dataset undergoes image resizing and splitting into unbalanced and balanced datasets through image augmentation. These datasets form a testing set for a hybrid classifier model consisting of patch embedding, vision transformer processing, reshape transformer output, CSP block processing, and spatial pyramid pooling. The classification head outputs class probabilities. Results and analysis include evaluation parameters like accuracy, F1 score, precision, and recall.

Figure 2. Proposed methodology workflow for classifying cervical cells.

1) Patch Embedding Layer: The input image is divided into patches and projected into an embedding space in this layer.

2) Vision Transformer blocks: It provides global feature learning through self-attention mechanism.

3) Cross Stage Partial (CSP) blocks: This is a YOLO component that is used for efficient feature extraction. It is achieved by splitting the feature map of a base layer into two parts, which are then combined using a cross-stage hierarchy.

4) Spatial Pyramid Pooling Fast (SPPF): is used from YOLOv5 for multi-scale feature maps representation. To enhance the model’s capacity to capture features at various levels of abstraction, this component is used.

5) Classification head: After feature extraction layers of the model, normalization of the layer is done to stabilize the training, and then a linear transformation is applied to the input to map the high-dimensional space to the number of classes to predict each class logits.

Hence, this new hybrid model first processes the image patches through transformer blocks. It then reshapes the output to operate with the CNN-based YOLO components. Lastly, a classification head is used to predict the class probabilities.

Algorithm 1. Hybrid CASPNet model for cervical cancer image classification.

www.frontiersin.org

The specification tables for our hybrid classifier are given below in Tables 47. Table 4 contains Structure of Input Processing, Table 5 describes structure of transformer block. Table 6 and Table details about structure of CSP Block and SPPF Block.

Table 4
www.frontiersin.org

Table 4. Structure of input processing.

Table 5
www.frontiersin.org

Table 5. Structure of transformer blocks.

Table 6
www.frontiersin.org

Table 6. Structure of CSP block.

Table 7
www.frontiersin.org

Table 7. Structure of SPPF block.

3.6 Detailed architecture of the hybrid classification model

Following, Figure 3 depicts the detailed architectural diagram of the proposed CASPNet hybrid model for cervical cancer image classification. Our CASPNet model is a hybrid model where vision transformer blocks are integrated with CNN components namely Cross-Stage Partial (CSP) and Spatial Pyramid Pooling Fast (SPPF) blocks. Bridging the dimensional gap between the spatial feature maps required by convolutional layers and the sequence-based representations of transformers is a crucial problem in such hybrid system. The model applies to a patch size of 16 pixels. The patch embedding layer transforms the input image of 224x224 with 3 RGB channels into a flattened output as (B,196,768). Following path embedding, inside vision transformer processing, to encode spatial information, learnable positional embeddings are inserted, and a learnable class token is appended to the sequence. This sequence passes through 12 transformer blocks to keep the output as (B,197,768), where B is the batch size, N is the number of tokens (197) and D is the embedding dimension (768). After this, the model does not introduce any learned projection layers. Instead, it uses inverse patch embedding (geometric reshape) operation to convert the transformer’s sequence representation to the spatial feature maps needed by CSP and SPPF blocks. Since no trainable parameters are added in this approach, it also reduces model complexity and avoids overfitting. The reshaped tensor (B,768,14,14) is an appropriate 4D tensor format for convolutional operations inside CSP and SPPF blocks. Local-global feature fusion is enabled in CSP block. The SPPF block uses cascaded max pooling to aggregate multi-scale information while preserving spatial dimensions. As a spatial aggregation technique, the global average pooling operation preserves the channel-wise feature representations. The output now becomes (B,768) by calculating the arithmetic mean for each channel separately across the two spatial dimensions, width and height. Finally, the classification head uses a hierarchical, fully-connected network with non-linear activations and regularization to change the 768-dimensional feature vector into class probabilities.

Figure 3
Diagram illustrating a machine learning model for image classification. It begins with an input image, moving through stages like patch embedding, transformer blocks, CSP and SPPF blocks, classification head, and output image. The classification head has seven layers, including layer normalization and linear transformations. The process applies softmax for inference, leading to a predicted label, “Koilocytic Class.” The flow includes various operations such as convolutional layers, attention mechanisms, and pooling layers. Each step is associated with specific parameters and descriptions, detailing the data transformations at each stage.

Figure 3. Detailed diagram of the proposed CASPNet model for cervical cancer diagnosis comprising basic three components: patch embedding and transformer blocks, CSP block, and SPPF block.

The below Table 8 depicts the hyperparameter table containing the parameters used by ViT block. Whereas Table 9 below shows the dimensionality transformation strategy taken by CSP block from the ViT output. As our model does not introduce any learnt projection layers since convolutional projection layers such as nn.Conv2d, nn.ConvTranspose2d or nn.Linear are not included between the ViT and CSP blocks. Rather, it employs geometric reshaping, or inverse patch embedding. This transformation is parameter-free and relies on the spatial structure of the patch tokens being preserved.

Table 8
www.frontiersin.org

Table 8. ViT block.

Table 9
www.frontiersin.org

Table 9. ViT output to CSP input (dimensionality transformation strategy).

3.7 Software and hardware used

The experiments have been implemented with PyTorch (2.5.1) and TorchVision (0.20.1) compiled with CUDA version 12.4 on a Google Colab Notebook. The hardware included an i5 11th generation processor and 16 GB RAM. We have used NVIDIA T4 GPU which is a powerful and energy-efficient GPU that speeds up processes related to computer vision.

3.8 Model parameters used

The proposed model is run from scratch for 450 epochs using a learning rate value of 0.001 and a weight decay value of 0.05. CrossEntropyLoss is used as the loss function. OneCycleLR scheduler is used in the experimental setup to increase the neural network training speed and efficiency. For better performance, the optimal batch size considered is 128. All the above parameters used along with other parameters considered in our proposed CASPNet model are listed in Table 10 and are discussed below.

Table 10
www.frontiersin.org

Table 10. Hyperparameters used for experiments.

Learning Rate is a scalar variable and is the step taken in the direction of the negative gradient during backpropagation. Backpropagation is the technique of updating the weights of a neural network by propagating the error between the expected and actual outputs backward through the network. A one cycle learning rate policy is used by the scheduler in the experimental study. Using a cosine annealing technique, it begins with a low learning rate (0.0001), raises it to a maximum of 0.001 during the first 10% of training and then progressively lowers it a very low value (0.00001) at the end of training.

CrossEntropyLoss function is common in neural networks for multi-class classification. It is crucial in evaluating the probability of the predicted class labels matching the actual class labels. The multi-class classification statement is shown in the following Equation 1.

Loss=c=1Cyi,c*log(pi,c)(1)

Here, C represents the total number of classes, yi,c denotes the true label for class c and pi,c denotes the predicted probability of class c.

Neural networks employ weight decay as a regularization strategy to prevent overfitting. To keep the model’s weight as low as possible, a term is added to the loss function. Usually, the L2 norm of the weights serves as this penalty term. The weight decay equation is expressed in the following Equation 2.

Losstotal=Lossoriginal+λ*||W||2(2)

Here, Losstotal is the sum of the original loss and the weight decay penalty, Lossoriginal denotes the original loss function, λ represents the weight decay coefficient, W represents the vector of model weights and ||W||2 is the L2 norm of the weights which gives the total of the weights squared.

Because it directly affects the accuracy and computing efficiency of the training process, batch size is one of the most crucial hyperparameters in deep learning training. It indicates the quantity of data used in a single forward and backward trip through the network. In our study, a batch size of 128 is considered to train the neural model.

The image size refers to the dimensions of the input images that are given into the deep learning model during training. The image’s height and width are represented in pixels by 224. The image’s color channel count is represented by 3. Typically, this represents the RGB color images.

Another important aspect is the epoch, where an epoch corresponds to a whole run through the entire dataset. The model views each training example once throughout each epoch and modifies its parameters based on the loss function. In our experimental study, we have observed that the test accuracy has increased after every iteration, resulting in a graph that showcases that the scratch model is able to learn the dataset gradually after each epoch. A total of 450 epochs have been considered for training the model with the dataset and reported single run results with seed value as 1337. To check for robustness of the model, we have done strict cross validation.

Lastly, the optimizers play an important role while training the model. In deep learning, optimizers are algorithms that modify the weights and biases of the model in order to minimize the loss function. They control the network’s data-driven learning process. The AdamW optimizer is a modification of Adam, where it implements the weight decay correctly and has better generalization in many cases. It is particularly effective for transformer models. Given parameters are θ, Loss function L(θ), Learning rate η, Weight decay coefficient λ, Exponential decay rates β1, β2 ∈ (0, 1), small constant ϵ to prevent division by zero. And the AdamW optimizer’s general parameter update given below in mathematical Equation 3.

θ(t+1)=θ(t)η.m^(t)/v^(t)+ϵ  η.λ.θ(t)(3)

Here, model parameters updated at a given time step t is denoted by θ(t). Rate of learning is denoted by η. The bias-corrected first-moment estimate is m^(t). Bias-corrected second-moment estimate is v^(t). ϵ is a small constant added to denominator to keep numerical stability and the weight decay coefficient is denoted by λ. We have done rigorous experiments by using different optimizers and have found that AdamW optimizer performance is much better for SIPAKMED dataset. The reason is that the SIPAKMED dataset is a complex dataset with high inter-class similarity and AdamW optimizer excels with such complex, similar classes.

4 Results and analysis

This section discusses the ablation analysis and conclusions pertaining to the experimental findings. The most widely used and recognized evaluation metrics in classification task studies are accuracy, recall, precision and F1-score (18, 19).

4.1 Ablation study

The suggested CASPNet (ViT+CSP+SPPF) architecture exhibits optimal performance across several assessment parameters based on the thorough ablation study findings, confirming the integration of its component parts. Following Table 11 displays the ablation study experimental results. With a test accuracy of 97.07%, the complete model outperforms all ablated variants by 30.62% over the standalone ViT architecture (66.45%), 75.85%% over CSP+SPPF alone (21.22%) and 14.84% over ViT+CSP without SPPF (82.23%). These findings show that each architectural element provides crucial complementary capabilities; SPPF module aggregates multi-scale spatial features through pyramidal pooling, the CSP block improves gradient flow and feature reuse through cross-stage partial connections and the vision transformer offers global context modeling through self-attention mechanisms. It is found that, the ViT-only configuration has a poor test accuracy and severe underfitting despite having the fewest parameters (46,277) and computational cost (23.081M FLOPs). This suggests that pure transformer architectures do not have enough inductive biases for successful visual feature extraction in this domain. The CSP+SPPF module on the other hand, shows catastrophic overfitting despite moderate computational requirements (4.765G FLOPs), achieving only 21% test accuracy while requiring total training time of 2273.911 secs. This suggests that long-range dependencies required for robust classification cannot be captured by convolutional components alone. Although ViT+CSP module requires the longest training time (9831.581 secs) and largest computational cost (17.442G FLOPs) among partial architectures, it achieves fair performance (82.23% test accuracy) but still falls short of the CASPNet model.

Table 11
www.frontiersin.org

Table 11. Ablation study results.

The complete ViT+CSP+SPPF module achieves the best training time (9972.122 secs) and superior test accuracy with comparable computational efficiency (17.731G FLOPs) and having 90.76M total trainable parameters, proving that the SPPF module improves convergence properties in addition to representational capacity. To ensure fair comparison and attribute performance differences only to architectural modules, all setups has used the same hyperparameters (learning rate value as 0.001, AdamW optimizer and 450 epochs). Hence the complete ViT+CSP+SPPF architecture is the best configuration for this cervical cell classification task because it strikes the ideal balance between global context modeling, hierarchical feature extraction and multi-scale spatial aggregation, as demonstrated by the ablation study results.

4.2 Our hybrid model performance on SIPAKMED dataset

In our experimental study, we have considered the performance metrics as mentioned below in Table 12. Table 12 contains the equations corresponding to the evaluation parameters considered. The first important aspect of measuring the performance of a classification model is its accuracy. It is a simple method to evaluate the model’s overall performance. The formula depicts the proportion of the number of correct predictions to the total number of predictions. Hence, a greater number of values of true positives (TP) and true negatives (TN) will lead to a greater accuracy of the model.

Table 12
www.frontiersin.org

Table 12. Performance metrics used for experiments.

The next crucial performance metric of a classification model is precision. It focuses on the accuracy of positive predictions. Therefore, it is represented as the percentage of accurately predicted positives out of all predicted positives. It is especially crucial when the expense of false positives is significant. In medical diagnosis, since a false positive case might lead to improper treatment decisions, hence in such scenario precision plays an important role in the model’s performance. Also, in imbalanced dataset cases where one class greatly outnumbers the other, precision is extremely important. Alone, accuracy cannot be treated as a performance metric for the overall performance of the model.

Recall, or sensitivity, is also a crucial performance parameter, especially in classification tasks. The ability of the model to locate all real positive cases is the main focus of recall. In medical diagnosis problems, missing a positive instance (false negative) has a significant cost, hence recall is essential. Meaning, if a person who actually has cervical cancer is not detected as a cancer patient, then it will have severe consequences. Because positive cases are uncommon in imbalanced datasets, recall is very crucial in judging the models’ performance. In these situations, accuracy alone can be misleading; hence the precision parameter is taken into account.

Recall and precision are often traded off in real-world situations. Low recall can result from high precision and vice versa. Finding a balance between these two is aided by the F1-score. In the case of imbalanced datasets, even though the model is highly accurate, it would be pointless if it consistently predicted the majority class. In these situations, the F1-score metric provides a more precise representation of the model’s performance. It is denoted as the harmonic mean of precision and recall.

The confusion matrix table present above provides an overview of a classification algorithm’s performance. The number of true positives, true negatives, false positives, and false negatives are shown in this table and it shows how well a classification model performs by comparing its predicted labels to the actual labels. The below Figure 4 displays the confusion matrix generated using our proposed model on the SIPAKMED dataset.

Figure 4
Confusion matrix for classification showing five classes: abnormal dyskeratotic, abnormal koilocytic, benign metaplastic, normal parabasal, and normal superficial intermediate. The diagonal values, indicating correct predictions, are 82, 80, 75, 78, and 82 respectively. Off-diagonal values represent misclassifications. A color gradient from light blue to dark blue indicates increasing values.

Figure 4. Confusion Matrix (CM) on SIPAKMED dataset.

It shows that there are 397 true cases out of 409 test cases which use the SIPAKMED dataset. Our proposed CASPNet model demonstrates its robustness by overcoming challenges such as inadequate data, image fluctuations, and image quality problems. This very feature of our model demonstrates its usefulness in real-time cervical cancer diagnosis. As observed in Figure 4, our scratch CASPNet classifier model achieves 3 FN values for the Koilocytotic class, 5 FN values for the Metaplastic class, 2 FN values for the Parabasal class, and 2 FN values for the Superficial Intermediate class while correctly diagnosing all the images of the Dyskeratotic class. The following Table 8 shows the performance metrics of the proposed model across the SIPAKMED dataset.

Again, Figure 5 displays the ROC-AUC graph that contrasts the True Positive Rate (TPR) or sensitivity and False Positive Rate (FPR) or (1-Specificity) at various categorization criteria. It is a technique to evaluate a model’s capacity for discriminating input based on class.

Figure 5
ROC curve graph for multi-class classification showing True Positive Rate versus False Positive Rate. It includes curves for classes 0 to 4. The overall AUC is 0.983, with individual class AUCs ranging from 0.95 to 0.99. A diagonal line represents random chance.

Figure 5. ROC-AUC curve on proposed CASPNet model using SIPAKMED dataset.

The following Equations 4, 5 denote the formulas corresponding to TPR and FPR respectively.

TPR=TP/(TP+FN)(4)

Here, true positives are denoted by TP and false negatives by FN.

FPR=FP/(FP+TN)(5)

Here, FP denotes false positives and TN denotes true negatives.

The following Table 13 displays the accuracy per class of the proposed scratch model achieved when applied on the SIPAKMED dataset. It is observed that, Metaplastic category has the lowest accuracy rate due to its morphological ambiguity and inter-class similarity between Parabasal and Koilocytotic classes. So, even when using deep learning architectures, metaplastic cells are intrinsically difficult for feature discrimination due to the fluctuating nuclear-to-cytoplasmic ratio and irregular chromatin patterns.

Table 13
www.frontiersin.org

Table 13. Per-class accuracy using the proposed scratch model on SIPAKMED dataset.

4.3 Explainability and interpretability

In computer vision, Grad-CAM (Gradient Weighted Class Activation Mapping) is a potent method for comprehending and visualizing the reasons behind a convolutional neural network (CNN) prediction. Grad-CAM creates a heatmap, which is a data visualization in which values are represented by colors. Red colors show the areas of high interest or high activation, whereas blue colors represent areas of low importance or low activation. This is accomplished by calculating the gradients of the target class’s score with respect to the feature mappings of the final convolutional layer. These gradients show how significant each feature map is to the prediction. It creates a heatmap, which is a data visualization in which values are represented by colors. Hence, Grad-CAM assists in determining whether the model is concentrating on background noise or pertinent features. In our approach, the SPPF layer satisfies the fundamental requirements for Grad-CAM application by maintaining spatial feature representations and differentiable routes during backpropagation, even though it uses max-pooling procedures. Below, figures indicate questionable spots or abnormalities on pap smear slide images, allowing medical professionals to identify potential cervical malignancies. In this sense, explainability is essential so that medical professionals may review the logic of the model and ensure that it aligns with medical knowledge.

Above, Figure 6 shows the Grad-CAM results of sample input images belonging to a different class image of SIPAKMED dataset. Figure 6A shows a single dyskeratotic type of squamous cell from the SIPAKMED dataset, with a high nucleus-cytoplasm (N:C) ratio, irregular nuclear membrane, and hyperchromatic nucleus. This shows that the heatmaps give special attention to the aberrant features of the cell’s nucleus and the surrounding dense, aberrantly keratinized cytoplasm. Hence it indicates that the CASPNet model focuses on abnormal nuclear features. Figures 6B–E presents the other class images present of the dataset and their corresponding Grad-CAM results with high confidence scores. It is obvious from the outputs that our model is successfully applying the fundamental, valuable diagnostic cues that human pathologists utilize to detect cancerous cells. This robust spatial alignment demonstrates that the CASPNet model does not rely on artifacts or erroneous background characteristics.

Figure 6
(A) Two images labeled Dyskeratotic Image and Grad-CAM, showing a cellular structure and a heat map overlay with high confidence of 0.9982. (B) Metaplastic Image and Grad-CAM, depicting a cellular structure and heat map with a confidence of 0.9805. (C) Koilocytotic Image and Grad-CAM, showcasing a cellular image and corresponding heat map, confidence score 0.9007. (D) Parabasal Image and Grad-CAM, illustrating a cellular structure and heat map, with 0.9815 confidence. (E) Superficial Intermediate Image and Grad-CAM, displaying a cell image and heat map, confidence score 0.9998.

Figure 6. Grad-CAM results on various class images of SIPAKMED dataset.

4.3 State-of-the-art study

On the SIPAKMED dataset, the suggested CASPNet model has an accuracy of 97.07% and a weighted average F1 score of 97% at epoch number 402.

Furthermore, several of the previously proposed methods for categorizing pap smear cervical cell images have been compared with the approaches detailed in this paper. The experimental results of the current study are contrasted with those of previous approaches in Table 14. The effectiveness of the suggested approach is therefore demonstrated by its strong performance on a variety of performance metrics. The suggested approach performs better and is, in some cases, competent with current state-of-the-art techniques, as indicated in Table 14.

Table 14
www.frontiersin.org

Table 14. Comparison with related work using cutting edge methods for SIPAKMED dataset.

4.4 Discussion

Our suggested architecture, which combines ViT, CSP and SPPF components, shows exceptional efficacy in capturing morphological characteristics of cervical cells for classification task, achieving 97.07% accuracy. Given that the model is trained from scratch, this performance is quite noteworthy and shows how well our architectural design decisions work. Our model has attained a commendable accuracy of 97.07%, representing only a marginal 0.58% difference from the pretrained baseline of Maurya et al. (15) work and a marginal difference of 0.80% from pretrained baseline of Basak et al. (10) work. Maurya et al. has used pretrained models ViT/L32 and MobileNetV1 whereas Basak et al. work has applied pretrained models VGG16, ResNet50, Inceptionv3, DenseNet121 ensemble model. Such pretrained models are already trained on huge ImageNet database. Our well-constructed scratch model, which includes elements like ViT, CSP and SPPF, shows the model’s strong capacity to recognize and categorize cervical cell morphological characteristics. This itself is a novel algorithm and it includes a lot of human effort and is also a time-consuming process in terms of building the model architecture meticulously. The suggested CASPNet model having 17.731 GFLOPs performs similarly to the approach described by Maurya et al. (15), which attained 15.581 GFLOPs, in terms of computational complexity as measured by GFLOPs. However, compared to the ensemble-based method suggested by Basak et al. (10), which shows a noticeably larger result of 28.261 GFLOPs, it necessitates significantly fewer calculations. Table 15 summarizes the comparative experimental results across several important factors.

Table 15
www.frontiersin.org

Table 15. Comparison with best SOTA works.

5 Conclusion and future work

Our experimental study represents a significant turning point in the field of cervical cancer image categorization using state-of-the-art deep learning techniques and vision transformer models. Combining self-attention, cross-stage partial network (CSP) blocks and spatial pyramid pooling fast (SPPF) layer components, our model architecture is thereby adapted for image classification. This specifically shows the innovative changes made in this study, spurring an advancement in this area. Self-attention blocks of vision transformer models excel in capturing global contextual information in general within an input image, and they have improved accuracy in classification tasks compared to CNN models. The CSP blocks in the architecture are perfect for classification task with limited resources where speed and effectiveness are balanced; hence, it is appropriate for local feature extraction, resulting in real-time applications. Again, in cervical cells, objects are of varying sizes. So, multi-scale feature extraction is done by SPPF and it records contextual information at various receptive fields. Hence, combining all these advantages in our proposed CASPNet model, we are able to comprehend the images more accurately and robustly. According to the results achieved by our proposed approach, it is quite obvious that the suggested approach shows great promise for utilizing Pap smear images to determine the extent of dysplasia present in cervical lesions as well as reduce the time needed for manual observations.

In the future, we will focus on expanding the data set further in order to maintain the constant overall accuracy while keeping images that contain varying degrees of dysplasia. Since, the metaplastic category cells have the lowest accuracy rate, we will concentrate on improving its accuracy in the future. In future, we will also consider combining our work on regularized ViT model variant and feature fusion integration components to build a robust yet efficient model. Additionally, investigating methods for utilizing multimodal data, such as clinical data and patient history may enhance the accuracy and efficacy of cervical cancer diagnosis.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.kaggle.com/datasets/akshaykrishnan/sipakmed5.

Author contributions

JM: Conceptualization, Data curation, Formal Analysis, Writing – original draft, Writing – review & editing. RC: Conceptualization, Data curation, Formal Analysis, Writing – original draft, Writing – review & editing. MK: Formal Analysis, Methodology, Project administration, Writing – original draft, Writing – review & editing. EL-C: Formal Analysis, Methodology, Validation, Writing – original draft, Writing – review & editing. MS: Formal Analysis, Methodology, Validation, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This work is supported by the Universidad Católica de la Santísima Concepción (UCSC) 2025 through APC.

Acknowledgments

Research supported by Red Sistemas Inteligentes y Expertos Modelos Computacionales Iberoamericanos (SIEMCI), project number 522RT0130 in Programa Iberoamericano de Ciencia y Tecnologia para el Desarrollo (CYTED).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Haryanto T, Sitanggang IS, Agmalaro MA, and Rulaningtyas R. 2020 International conference on computer engineering, network, and intelligent multimedia (CENIM). IEEE (2020). p. 34–8.

Google Scholar

2. Tripathi A, Arora A, and Bhan A. Classification of cervical cancer using Deep Learning Algorithm. In: 2021 5th international conference on intelligent computing and control systems (ICICCS). IEEE (2021). p. 1210–8.

Google Scholar

3. Dhawan S, Singh K, and Arora M. Cervix image classification for prognosis of cervical cancer using deep neural network with transfer learning. EAI Endorsed Trans Pervasive Health Technol. (2021) 7. doi: 10.4108/eai.12-4-2021.169183

Crossref Full Text | Google Scholar

4. Basak H, Kundu R, Chakraborty S, and Das N. Cervical cytology classification using PCA and GWO enhanced deep features selection. SN Comput Sci. (2021) 2:369. doi: 10.1007/s42979-021-00741-2

Crossref Full Text | Google Scholar

5. Mousser W, Ouadfel S, Taleb-Ahmed A, and Kitouni I. IDT: an incremental deep tree framework for biological image classification. Artif Intell Med. (2022) 134:102392. doi: 10.1016/j.artmed.2022.102392

PubMed Abstract | Crossref Full Text | Google Scholar

6. Alquran H, Mustafa WA, Qasmieh IA, Yacob YM, Alsalatie M, Al-Issa Y, et al. Cervical cancer classification using combined machine learning and deep learning approach. Comput Mater Contin. (2022) 72:5117–34. doi: 10.32604/cmc.2022.025692

Crossref Full Text | Google Scholar

7. Liu W, Li C, Xu N, Jiang T, Rahaman MM, Sun H, et al. CVM-Cervix: A hybrid cervical Pap-smear image classification framework using CNN, visual transformer and multilayer perceptron. Pattern Recognition. (2022) 130:108829. doi: 10.1016/j.patcog.2022.108829

Crossref Full Text | Google Scholar

8. Pacal I and Kılıcarslan S. Deep learning-based approaches for robust classification of cervical cancer. Neural Computing Appl. (2023) 35:18813–28. doi: 10.1007/s00521-023-08757-w

Crossref Full Text | Google Scholar

9. Maurya R, Pandey NN, and Dutta MK. VisionCervix: Papanicolaou cervical smears classification using novel CNN-Vision ensemble approach. Biomed Signal Process Control. (2023) 79:104156. doi: 10.1016/j.bspc.2022.104156

Crossref Full Text | Google Scholar

10. Deo BS, Pal M, Panigrahi PK, and Pradhan A. CerviFormer: A pap smear-based cervical cancer classification method using cross-attention and latent transformer. Int J Imaging Syst Technol. (2024) 34:e23043. doi: 10.1002/ima.23043

Crossref Full Text | Google Scholar

11. Mondal J, Chatterjee R, and Gourisaria MK. Vision Transformer based approach for accurately detecting cervical cancer. In: 2025 5th international conference on expert clouds and applications (ICOECA). IEEE (2025). p. 500–6.

Google Scholar

12. Sipakmed dataset link . Available online at: https://www.kaggle.com/datasets/akshaykrishnan/sipakmed5 (Accessed April 11, 2025).

Google Scholar

14. Win KY, Choomchuay S, Hamamoto K, Raveesunthornkiat M, Rangsirattanakul L, and Pongsawat S. Computer aided diagnosis system for detection of cancer cells on cytological pleural effusion images. BioMed Res Int. (2018) 1:6456724. doi: 10.1155/2018/6456724

PubMed Abstract | Crossref Full Text | Google Scholar

15. Hussain E, Mahanta LB, Das CR, and Talukdar RK. A comprehensive study on the multi-class cervical cancer diagnostic prediction on pap smear images using a fusion-based decision from ensemble deep convolutional neural network. Tissue Cell. (2020) 65:101347. doi: 10.1016/j.tice.2020.101347

PubMed Abstract | Crossref Full Text | Google Scholar

16. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv. (2020) 7:5. doi: 10.48550/arXiv.2010.11929

Crossref Full Text | Google Scholar

17. Plissiti ME, Dimitrakopoulos P, Sfikas G, Nikou C, Krikoni O, and Charchanti A. Sipakmed: A new dataset for feature and image based classification of normal and pathological cervical cells in pap smear images. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE (2018). p. 3144–8.

Google Scholar

18. Behera B, Kumaravelan G, and Kumar P. Performance evaluation of deep learning algorithms in biomedical document classification. In: 2019 11th international conference on advanced computing (ICoAC). IEEE (2019). p. 220–4.

Google Scholar

19. Vujović Ž. Classification model evaluation metrics. Int J Advanced Comput Sci Appl. (2021) 12:599–606. doi: 10.14569/IJACSA.2021.0120670

Crossref Full Text | Google Scholar

Keywords: medical image analysis, cervical cancer, vision transformer, hybrid model, classification of images

Citation: Mondal J, Chatterjee R, Kumar Gourisaria M, Sahni M and León-Castro E (2025) Cervical cancer classification using a novel hybrid approach. Front. Oncol. 15:1703772. doi: 10.3389/fonc.2025.1703772

Received: 11 September 2025; Accepted: 19 November 2025; Revised: 17 November 2025;
Published: 04 December 2025.

Edited by:

Haoyuan Chen, ShanghaiTech University, China

Reviewed by:

Xuanshuo Fu, Autonomous University of Barcelona, Spain
Wenhe Bai, Macao Polytechnic University, Macao SAR, China

Copyright © 2025 Mondal, Chatterjee, Kumar Gourisaria, Sahni and León-Castro. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Mahendra Kumar Gourisaria, bWtnb3VyaXNhcmlhMjAxMEBnbWFpbC5jb20=; Ernesto León-Castro, ZWxlb25AdWNzYy5jbA==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.