Machine learning-based prediction of nasopharyngeal carcinoma risk: a clinical approach

Yang, Wenhui; Zhou, Chengyan; Tang, Minzhong; Huang, Zhiqiang; Zhu, Haiqing; Li, Shangyang; Huang, Huipin; Liang, Yujuan; Pan, Wenting; Yuan, Yulin

doi:10.3389/fimmu.2025.1648648

ORIGINAL RESEARCH article

Front. Immunol., 27 November 2025

Sec. Cancer Immunity and Immunotherapy

Volume 16 - 2025 | https://doi.org/10.3389/fimmu.2025.1648648

This article is part of the Research TopicAdvances in the Treatment of Nasopharyngeal CancerView all 10 articles

Machine learning-based prediction of nasopharyngeal carcinoma risk: a clinical approach

Wenhui Yang¹

Chengyan Zhou²

Minzhong Tang³

Zhiqiang Huang⁴

Haiqing Zhu⁵

Shangyang Li¹

Huipin Huang¹

Yujuan Liang¹

Wenting Pan¹

Yulin Yuan^1*

¹Department of Laboratory Medicine, The People’s Hospital of Guangxi Zhuang Autonomous Region, Nanning, Guangxi Zhuang Autonomous Region, China
²Department of Dermatology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Nanning, Guangxi Zhuang Autonomous Region, China
³Key Laboratory of Nasopharyngeal Carcinoma Molecular Epidemiology, Wuzhou Red Cross Hospital, Wuzhou, Guangxi Zhuang Autonomous Region, China
⁴Department of Laboratory Medicine, The First People’s Hospital of Fangchenggang City, Fangchenggang, Guangxi Zhuang Autonomous Region, China
⁵Department of Laboratory Medicine, The People’s Hospital of Yongning District, Nanning, Guangxi Zhuang Autonomous Region, China

Background: Early screening and risk assessment of nasopharyngeal carcinoma (NPC) are essential for timely diagnosis and improved treatment outcomes. This study aimed to develop and evaluate predictive models using logistic regression and machine learning (ML) techniques to identify significant risk factors for NPC across various healthcare settings.

Methods: A total of 569 participants were enrolled in the internal training and validation cohorts, and 160 were enrolled in the independent external validation cohort. Several Epstein-Barr virus (EBV)-related antibodies and serological and hematological markers were assessed to identify discriminatory features between NPC and non-NPC individuals. Feature selection was performed using least absolute shrinkage and selection operator (LASSO) regression, recursive feature elimination cross-validation (REFCV), and support vector machine recursive feature elimination cross-validation (SVMREFCV). The performance of nine machine learning (ML) models (logistic regression (LR), eXtreme Gradient Boosting (XGBoost), light gradient boosting machine (LightGBM), random forest (RF), AdaBoost, multilayer perceptron (MLP), decision tree (DT), gradient boosting decision tree (GBDT), and Gaussian Naïve Bayes (GNB)) was evaluated using the area under the curve (AUC), accuracy (ACC), sensitivity (SE), and specificity (SP) in both the training and validation cohorts. Model calibration was assessed using calibration plots and clinical utility was evaluated through decision curve analysis (DCA).

Results: Five key predictors (nuclear antigen 1 immunoglobulin A (NTA1-IgA), viral capsid antigen immunoglobulin A (VCA-IgA), Rta protein immunoglobulin A (Rta-IgA), platelet (PLT) count, and lymphocyte (LM) count) were consistently identified across the three feature selection algorithms. The XGBoost model achieved the highest performance in the internal training (AUC = 0.999) and validation cohorts (AUC = 0.995); it also outperformed in the independent external validation cohort with an AUC of 0.956. Calibration and DCA for both internal and intendent external cohorts were then confirmed the strong clinical utility for the XGBoost model. An outline tool also enabled real-time NPC risk prediction based on the five selected biomarkers.

Conclusion: This study presents a robust and interpretable ML-based approach for NPC risk prediction, integrating EBV serology and hematological markers. The model demonstrated high predictive accuracy and potential for population-based screening, providing an efficient tool for early NPC detection and intervention planning.

1 Introduction

Nasopharyngeal carcinoma (NPC) is a malignant tumor originating in the nasopharynx, an anatomical region located posterior to the nasal cavity and superior to the oropharynx (1). Although NPC is relatively rare in Western countries, it exhibits a markedly elevated incidence in certain parts of Asia, particularly Southeast Asia and Southern China (2), where it constitutes a significant public health concern (1). Specific geographic regions, such as Guangdong Province in southern China, demonstrate some of the highest incidence rates globally (3). Considerable variation in NPC risk has been observed among different ethnic subpopulations within Asia (4), including genetic susceptibility (1, 5–9), Epstein-Barr virus (EBV) infection (1, 5, 7, 10–12), environmental and lifestyle factors (1, 3, 7, 13), and socioeconomic and demographic factors (1, 13, 14), which appear to be the principal drivers of NPC’s high incidence of NPC in certain Asian populations. Therefore, understanding how genetic variability modulate EBV susceptibility and NPC development could pave the way for personalized risk assessment and preventive measures.

Many cases are diagnosed at advanced stages due to the asymptomatic nature of the condition in its early stages, which limits treatment options and leads to poor prognoses. The deep anatomical location of the nasopharynx, coupled with the nonspecific nature of early symptoms such as mild nasal congestion, headaches, or neck swelling, often delays diagnosis of NPC. Given these challenges, routine screening is crucial for improving early detection, facilitating timely intervention, and enhancing clinical outcomes (15).

Current screening approaches for NPC include serological biomarker detection, imaging techniques, and nasal endoscopy. Epstein-Barr virus (EBV) deoxyribonucleic acid (DNA) levels (8), along with virus capsid antigen-immunoglobulin A (VCA-IgA) and early antigen-immunoglobulin A (EA-IgA) antibodies, are widely recognized as key biomarkers for NPC risk assessment, offering a noninvasive approach to early detection (10, 16). However, alternative screening technologies, such as polarographic analysis, have been explored for their potential in detecting biochemical changes associated with NPC (17). Polarography, an electrochemical technique, has been employed to assess oxidative stress markers (18, 19), metabolic alterations (20, 21), and tumor-related enzymatic changes in biological fluids (22, 23), and may serve as an additional biomarker for NPC screening.

Studies have shown that the polarographic reduction of metal ions and nitro compounds can help identify redox activity changes in cancerous cells, potentially aiding in NPC detection. Ghorbian and Ghorbian (2023) reported that electrochemical signals in blood samples from patients with NPC exhibited distinct redox behaviors compared to those of healthy individuals, suggesting that polarographic methods could complement existing serological screening techniques (24). Furthermore, polarographic detection of EBV-associated metabolic changes may enhance the sensitivity and specificity of NPC risk assessment models, providing a novel avenue for noninvasive diagnostics (25).

Given the growing role of machine learning (ML) in NPC prediction, integrating polarographic biomarker analysis with traditional serological and epidemiological data could further refine risk stratification models (26). This study aimed to develop a comprehensive predictive model that incorporates EBV-DNA, IgA antibodies, smoking history, family history of NPC, and emerging electrochemical screening techniques to enhance early NPC detection and support clinical decision making (27).

A major challenge in NPC diagnosis is the reliance on nasal endoscopy and biopsy, which are invasive procedures typically performed only in symptomatic individuals. However, serological markers such as EBV-DNA and IgA antibodies (VCA-IgA and EA-IgA) have emerged as promising noninvasive biomarkers for early risk assessment (11, 28, 29). Studies have demonstrated that elevated EBV-DNA levels strongly correlate with NPC development; however, their clinical utility remains underutilized, particularly in primary healthcare settings. Additionally, lifestyle factors such as smoking, genetic predisposition, and environmental exposure influence NPC risk. However, their precise contributions require further quantification through predictive modeling (30).

Despite advancements in NPC screening, no widely adopted risk stratification model currently integrates serological, clinical, and epidemiological variables to guide early interventions. Existing screening strategies often lack sensitivity and specificity, resulting in missed cases or unnecessary invasive procedures (31). Developing an accurate and clinically interpretable NPC risk prediction model could significantly enhance screening protocols by identifying high-risk individuals who would benefit the most from further diagnostic testing, including endoscopy and imaging (30).

Accurate risk prediction is essential for the early detection of NPC, timely intervention, and improved patient outcomes. Prognosis for early stage NPC is generally favorable, with significantly higher survival rates compared to advanced-stage disease. Risk prediction models are deigned to identify individuals at elevated risk, thereby facilitating targeted screening and preventive counseling efforts. To address this gap, our study leveraged machine learning (ML) to develop a predictive model for NPC risk assessment. By utilizing least absolute shrinkage and selection operator (LASSO) regression for feature selection and logistic regression for modeling, we analyzed a comprehensive dataset of patients from diverse healthcare settings (32). This approach aims to identify key risk predictors and provide clinicians with a practical, data-driven screening tool to facilitate early NPC detection. Additionally, the integration of decision curve analysis (DCA) enhances the clinical relevance of the model by assessing its net benefit in guiding clinical decision-making. Through this research, we aimed to bridge the gap in NPC early screening, offering a robust framework for risk stratification, improved diagnostic precision, and timely intervention.

2 Materials and methods

2.1 Study design and eligibility criteria

The present retrospective study was conducted to develop and validate a ML-based predictive model for the early screening and risk assessment of NPC. A total of 1,000 individuals were initially enrolled for internal training and validation cohorts from The People’s Hospital of Guangxi Zhuang Autonomous Region, between January 2018 and June 2022. The study population included individuals undergoing routine health checkups, those presenting with NPC-related symptoms, and high-risk individuals, such as those with a family history of NPC or heavy smoking habits.

Participants were included if they met the following criteria: (1) aged ≥ 18 years, (2) provided informed consent for data collection and analysis, and (3) had complete clinical and serological records. The exclusion criteria were as follows: (1) a prior diagnosis of NPC, (2) the presence of severe comorbid conditions (e.g., advanced cancer, chronic liver disease, or autoimmune disorders), and (3) incomplete or missing key clinical data. After applying the inclusion and exclusion criteria, 431 participants were excluded, resulting in a final study population of 569 participants in the internal training and validation cohorts. The external validation cohort included 160 participants. Wuzhou Red Cross Hospital, the First People’s Hospital of Fangchenggang city, and the People’s Hospital of Yongning District, Nanning, between October 2022 to December 2024.

2.2 Data collection

Participant were categorized into two groups based on diagnostic findings: NPC (n = 234, 41.12%) and non-NPC (n = 335, 58.88%). Demographic data, including age and gender were recorded for all participants. Clinical assessments included EBV-related antibodies (nuclear antigen 1 immunoglobulin A (NTA1), Rta protein immunoglobulin A (Rta-IgA), viral capsid antigen immunoglobulin A (VCA-IgA), early antigen immunoglobulin A (EA-IgA), Zta protein immunoglobulin A (Zta-IgA)), liver function tests (alanine aminotransferase (ALT), aspartate aminotransferase (AST), total protein (TP), albumin (ALB), total bilirubin (TB), direct bilirubin (DB), cholinesterase (CHE), adenosine deaminase (ADA), alkaline phosphatase (ALP), total bile acid (TBA), gamma-glutamyl transferase (GGT), prealbumin (PA)), renal function markers (urea, creatinine (CREA), uric acid (UA)), glucose and lipid metabolism (glucose (GLU), triglycerides (TG), high density lipoprotein (HDL), low density lipoprotein (LDL), apolipoprotein A-I (APOPA1), apolipoprotein B (APOB), lipoprotein A (LPA)), and hematology (white blood cell (WBC), red blood cell (RBC), hemoglobin (HB), hematocrit (HCT), mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), platelet count (PLT), red cell distribution width (RDW), platelet distribution width (PDW), and lymphocyte (LM)). Feature selection was performed using least absolute shrinkage and selection operator (LASSO) regression, recursive feature elimination cross-validation (REFCV), and support vector machine recursive feature elimination cross-validation (SVMREFCV).

2.3 Machine learning

To identify significant predictors, least absolute shrinkage and selection operator (LASSO) regression was applied using the ‘glmnet’ package in R. This method filtered out irrelevant or redundant features, retaining only the most informative variables for NPC risk prediction. Following feature selection, predictive models were developed using both logistic regression and multiple machine learning (ML) algorithms. The dataset was randomly split into a training set (70%) and a validation set (30%) using RStudio (version 2025.05.0 + 496) with the ‘ggplot2’ package. Although the overall class distribution between the NPC and non-NPC participants was moderately imbalanced (41.12% vs 58.88%), model robustness was maintained by employing class-weight adjustments during training to ensure balanced learning across classes. Sensitivity and specificity were jointly evaluated to assess the balanced performance. In future studies, advanced resampling methods, such as the Synthetic Minority Oversampling Technique (SMOTE) and stratified cross-validation, will be explored to further mitigate potential class imbalance effects. The following ML models were evaluated: logistic regression (for interpretability and clinical application), eXtreme Gradient Boosting (XGBoost), light gradient boosting machine (LightGBM), random forest (RF), decision tree (DT), AdaBoost, multilayer perceptron (MLP), Gaussian Naïve Bayes (GNB), and gradient boosting decision tree (GBDT). Model calibration was conducted, and predictive performance was evaluated using area under the curve (AUC), accuracy (ACC), sensitivity (SE), and specificity (SP) in both training and validation cohorts. All Statistical analyses were performed using R version 4.2.3 and python version 3.11.4.

2.4 Sample size and power analysis

Our primary performance metric was the area under the ROC curve (AUC). A post hoc power analysis (two-sided α = 0.05) using R indicated that the internal cohort (n = 569; NPC cases = 234, non-NPC = 335) provides >99% power to detect an AUC ≥ 0.80 versus the null hypothesis AUC = 0.50. The independent external cohort (n = 160; NPC = 67, non-NPC = 93) provides ~95% power to detect an AUC ≥ 0.80 versus AUC = 0.50. In addition, model complexity was constrained relative to events: with five final predictors, the events-per-variable (EPV) was ~47 (234/5), exceeding conventional EPV recommendations (≥10–20) for logistic/ML models and reducing the risk of overfitting. For reproducibility, we fixed the random seed (set.seed(1234)) and reported the package versions and key hyperparameters in Section 2.3. In future prospective studies, we plan to determine the sample size a priori based on the anticipated AUC and desired confidence interval width to pre-specify precision.

2.5 Statistical analysis

Statistical analyses and ML model training were performed using RStudio (version 2025.05.0 + 496) with the ggplot2 package. Descriptive statistics were reported as means ± standard deviations (SD) for continuous variables and proportions for categorical variables. Between-group comparisons were conducted using intendent t-tests for continuous variables and chi-square tests for categorical variables. Statistical significance was set at p < 0.05.

3 Results

3.1 The baseline characteristics of enrolled participants

A total of 569 participants were included in the internal training and validation cohort, comprising 234 (41.12%) individuals in the NPC-positive group and 335 (58.88%) in the non-NPC group. All five EBV-related antibodies (NTA1, Rta-IgA, VCA-IgA, EA-IgA, and Zta-IgA) showed statistically significant differences between the two groups (p < 0.001). Similarly, the levels of total protein (TP), albumin (ALB), prealbumin (PA), and uric acid (UA) differed significantly (p < 0.001). Additional biomarkers, including HDL, APOA1, APOB, RBC, HB, HCT, PLT, RDW, and LM, also demonstrated significant differences between the two groups (p < 0.001) (Table 1).

Table 1

Table 1. The baseline clinical characteristics of included patients.

3.2 Feature selection correlated with NPC

Key predictor associated with NPC risk were identified using LASSO regression combined with 10-fold cross-validation (Figure 1). The most relevant predictors included NTA1, VCA, Rta, PLT, and LM. ROC analysis demonstrated that VCA had the highest AUC value (0.977), followed by NTA1 (AUC = 0.972). PLT had the lowest AUC value (0.633). While Rta (AUC = 0.83) and LM (AUC = 0.84) had moderate AUC values (Figure 2).

Figure 1

Panel A shows a coefficient path plot for multiple predictors, with each line representing a predictor's coefficient as lambda changes. The plot indicates varying impacts of predictors on the response variable. Panel B displays a binomial deviance versus log(lambda) plot, with red dots marking deviance values and vertical dashed lines suggesting optimal lambda choices. The graph demonstrates the trade-off between complexity and model fit.

Figure 1. Least absolute shrinkage and selection operator (LASSO) regression analysis and 10-fold cross-validation for selecting factors associated with NPC. (A) LASSO coefficient path; (B) LASSO cross-validation curve.

Figure 2

Receiver Operating Characteristic (ROC) curve illustrating the performance of five models: LM (AUC=0.84), NTA1 (AUC=0.972), PLT (AUC=0.633), Rta (AUC=0.83), and VCA (AUC=0.977). Sensitivity is plotted against 1-specificity. VCA exhibits the highest AUC, indicating superior performance.

Figure 2. ROC curves for the 5 selected predictors: NTA1, Rta, VCA, PLT, and LM.

3.3 Features identified using ML algorithms

Three ML algorithms - LASSOCV, REFCV, and SVMREFCV - were used to identify biomarkers associated with NPC (Figures 3A–C, respectively). A Veen diagram was generated using RStudio to illustrate the overlapping features (Figure 3D). The five features (NTA1, VCA, Rta, PLT, and LM) were consistently selected across methods, and the obtained results showed that they were the most associated factors with NPC risk.

Figure 3

Panel A: Bar chart showing coefficients in a Lasso model for six features, with VCA having the highest positive value. Panel B and C: Line graphs depicting cross-validation scores against the number of selected features, both improving with more features. Panel D: Venn diagram comparing feature selection across LASSO, REFCV, and SVMRFE, showing shared and unique features among the methods.

Figure 3. Identification of characteristic markers. (A) Six markers were identified using the LASSOCV algorithm; (B) five markers were identified using the repeated elastic net feature Cross-validation (REFCV) algorithm; (C) five markers were identified using the support vector machine recursive feature elimination with cross-validation (SVMREFCV) algorithm; (D) Venn plot of markers for three machine-learning algorithms.

3.4 Model evaluations

The predictive performance of nine ML models was assessed in both training and validation cohorts (Table 2; Figure 4). The XGBoost model demonstrated superior performance, achieving an AUC of 0.999, sensitivity of 0.985, and specificity of 0.999 in the training cohorts. In the validation cohort, the XGBoost model maintained the highest AUC (0.995), ACC (0.959), SE (0.94), and SP (0.973). Calibration plots and decision curve analysis further supported the outstanding performance and clinical utility of the XGBoost model. Further analysis demonstrated the robust generalization of the XGBoost model. As shown in Figures 5A–D, the AUC values for the test cohort (AUC = 0.993) and validation cohort (AUC = 0.994) were slightly lower than the training cohort (AUC = 1.000), suggesting high model fidelity. Although these near-perfect AUC values indicate excellent discriminative ability, they may also reflect potential overfitting in the model. To minimize this risk, we applied rigorous feature selection (LASSO, REFCV, and SVMREFCV), used an independent external validation cohort, and conducted calibration and decision curve analyses to evaluate model reliability. Calibration plot and decision curve analysis of the XGBoost model are shown in Figures 5E, F, further confirmed the strong fitting ability and high clinical utility of the XGBoost model.

Table 2

Table 2. Diagnostic efficacy of nine classifiers in the internal training and validation cohorts.

Figure 4

Five graphs present model performance metrics. Graph A (ROC curve for training) and Graph B (ROC curve for validation) show model sensitivity vs. one-specificity. Graph C illustrates decision curve analysis by threshold probability. Graph D shows a calibration curve with fraction of positives vs. mean predicted value. Graph E is a forest plot of AUC scores for different models. Each graph includes distinct colored lines corresponding to different learning models, like XGBoost, Logistic, and others, displaying statistical performance.

Figure 4. Performance comparison between multiple models. (A) Receiver operating characteristic (ROC) curve of the training cohort; (B) ROC curve of the validation cohort; (C) decision curve of multiple machine-learning (ML) models; (D) calibration curve of different ML models; (E) forest plot of each area under the curve (AUC) score.

Figure 5

Panel A shows the ROC curve for training with an AUC of 1.0. Panel B displays the ROC curve for validation with an AUC of 0.994. Panel C presents the ROC curve for testing with an AUC of 0.993. Panel D illustrates the GridSearchCV learning curve, with consistent performance for training and validation sets. Panel E shows calibration plots with the XGBoost curve and perfect calibration line. Panel F displays a test decision curve comparing XGBoost, Treat None, and Treat All strategies, focusing on mean net benefit and threshold probability.

Figure 5. Performance of the XGBoost prediction model across cohorts. (A–C) Receiver operating characteristic (ROC) curves for the training, validation, and testing cohorts, respectively, showing sensitivity vs. 1 specificity. (D) GridSearchCV learning curve illustrating AUC convergence for training and validation sets across sample sizes. (E) Calibration (reliability) curve comparing predicted probabilities with observed outcomes; dashed line indicates perfect calibration. (F) Decision curve analysis (DCA) demonstrating the net clinical benefit across threshold probabilities. All axes are consistently labeled; units represent proportions or probabilities (0–1).

3.5 External validation of the XGBoost model

A total of 160 participants were included in the independent external validation cohort from three other centers. It compromised of 93 (58.13%) individuals in non-NPC group and 67 (41.88%) in NPC group (Table 3). The XGBoost model maintained the highest predictive performance with an AUC of 0.956 (Table 4; Figure 6), and the decision curve indicated a strong clinical benefit (Figure 6).

Table 3

Table 3. Characteristics of external validation cohort.

Table 4

Table 4. Diagnostic efficacy of nine classifiers in the external validation cohorts.

Figure 6

Panel A shows a decision curve with mean net benefit versus threshold probability. It compares XGBoost, Treat None, and Treat All strategies. Panel B displays an ROC curve plotting sensitivity versus 1-specificity, with a test set ROC curve achieving an AUC of 0.956 and confidence interval of 0.922 to 0.991.

Figure 6. Performance of the XGBoost model in the external validation cohort. (A) Decision curve analysis (DCA) illustrating the net clinical benefit of the XGBoost model compared with “Treat None” and “Treat All” strategies across threshold probabilities (%). (B) Receiver operating characteristic (ROC) curve showing model discrimination with an AUC of 0.956 (95% CI 0.922–0.991); axes display sensitivity vs. 1 – specificity.

3.6 Model interpretation with SHAP

To quantify the contribution of each biomarker to the XGBoost model, SHAP (SHapley Additive exPlanations) values were computed for all participants in the validation cohort. The global mean SHAP plot (Figures 7A, B) ranked VCA-IgA and NTA1-IgA as the two most influential features, followed by lymphocyte counts, platelet and Rta-IgA. Dependence plots revealed clear dose–response relationships: higher VCA-IgA titers systematically increased the predicted probability, whereas elevated lymphocyte counts and Rta-IgA exerted a protective effect. A dependence plot for an exemplar high-risk participant (Figure 7C) showed that the VCA-IgA antibodies alone contributed +4.74 log-odds, respectively, accounting for ~75% of the final risk score. These analyses confirm that the model relies on biologically plausible drivers and provide clinician-readable explanations for each prediction.

Figure 7

Panel A displays a scatter plot of SHAP values indicating the impact of features (NTA1, VCA, LM, PLT, Rta) on model output, with colors representing high to low feature values. Panel B is a horizontal bar chart showing average impact magnitude of the same features, with NTA1 and VCA having the highest impact. Panel C is a SHAP summary plot showing how individual features (VCA, LM, Rta) contribute to a prediction output of 0.33, visualized in red (higher) to blue (lower).

Figure 7. SHAP-based interpretability of the XGBoost model. (A) Global Mean Absolute SHAP Values Plot. (B) Feature Importance Bar Chart. (C) Dependence Plot for an Exemplary High-Risk Participant.

3.7 Online prediction tool

An online prediction tool was also developed to facilitate clinician-friendly interpretation of NPC risk (https://www.xsmartanalysis.com/model/list/predict/model/html?mid=25653&symbol=51PY7477364ykEV6Jk11). The five selected predictors (NTA1, VCA, Rta, PLT, and LM) were used as input variables. Model explainability was enhanced using SHAP values to quantify the individual contribution of each predictor to the overall model performance. VCA (SHAP = 5.06) and NTA1 (SHAP = 4.68) exerted the strongest positive influence on NPC risk, reflecting their critical diagnostic relevance, whereas LM and PLT contributed modestly, suggesting an auxiliary role in the systemic immune response. The integration of SHAP visualization within the online interface allows clinicians to intuitively interpret how each biomarker drives individual risk predictions (Figures 8A, B).

Figure 8

Panel A shows an interface from the People's Hospital of Guangxi Zhuang Autonomous Region for predicting nasopharyngeal carcinoma risk, featuring input fields for the parameters VCA, Rta, NA1, L/M, and PLT. Panel B displays the same interface with filled input fields, calculates a disease occurrence probability of 90.0%, and includes guidance for cases where the probability exceeds the threshold. A graphical bar indicates parameter values, with details on model prediction and treatment recommendations.

Figure 8. An online prediction tool to predict the risk of mortality. (A) An online page based on the XGBoost algorithm. (B) An online page to predict the risk of NPC based on five predictors.

4 Discussion

The present study leveraged a large-scale cohort of 569 participants and integrated serological and hematological markers with advanced ML techniques to develop a high-performing model for NPC risk prediction. The model development process involved multiple stages, including LASSO analysis, feature selection, ROC analysis, model evaluation, calibration plots and decision curve analysis, to demonstrate the efficacy of the nine ML models (logistic regression, XGBoost, LightGBM, RF, DT, AdaBoost, MLP, GNB, and GBDT) in identifying key risk factors for NPC and in providing a reliable, clinically applicable tool for early detection and decision-making across varying healthcare settings.

Through LASSO, REFCV, and SVMREFCV analysis, we selected five predictors: three EBV-related antibodies (NTA1, VCA, and Rta), which reinforce the central role of EBV serology in NPC pathogenesis, and two hematological indicators (PLT and LM), reflecting host immune and inflammatory dynamics (Figures 1, 3). To enhance interpretability and clinical usability, we further employed SHAP (SHapley Additive exPlanations) analysis to quantify each feature’s contribution to model output. The results (Figure 7C) indicated that VCA and NTA1 exhibited the strongest positive influence on NPC risk prediction, whereas LM,PLT and Rta contributed modestly but consistently to the overall model discrimination. These insights help bridge the gap between algorithmic prediction and clinical reasoning, enabling end-users to understand how each variable drives individual risk estimates. More importantly, we found that VCA and NTA1 had the highest AUC values in the ROC analysis, indicating that VCA and NTA1 are the most relevant predictors of NPC risk. These results are consistent with previous research highlighting the role of EBV-related antibodies, especially VCA-IgA and NTA1, in the pathogenesis and early detection of NPC (25, 33, 34). Rta-IgA also further corroborates the role of EBV reactivation in NPC development, as Rta is a lytic protein expressed during EBV replication (10, 35). In addition to serological markers, although PLT had the lowest AUC values, we observed that hematological parameters such as PLT and LM were still significantly associated with NPC risk. These findings are consistent with Wu et al. (36), who first constructed a predictive model using baseline LM subpopulations to estimate immunotherapy responses in NPC patients (36), our findings further support the critical role of systemic inflammation and immune dynamics in NPC progression and prognosis. Moreover, these findings are in line with prior work by Long et al. (37), who incorporated blood-based markers into ML development and demonstrated that including the platelet-lymphocyte ratio (PLR), particularly within a logistic regression framework, significantly enhanced predictive performance and clinical utility (37). These insights are crucial as they guide clinicians in identifying high-risk individuals, particularly those who may benefit from more intensive screening, and emphasized the importance of ML models in clinical practice (38).Thus, SHAP transforms the ensemble model into a transparent, instance-level decision aid that can be displayed alongside the risk score, increasing clinician trust and facilitating patient counseling.

Moreover, among the nine ML models evaluated, the XGBoost model consistently outperformed the others across all metrics, achieving an AUC of 1.000 in the training cohort and 0.995 in the internal validation cohort (Table 2), as well as 0.956 in the independent external validation cohort (Table 4). While these near-perfect AUC values highlight the strong discriminative ability of the model, they may also raise concerns regarding potential overfitting. To mitigate this risk and ensure generalizability, we adopted several strategies, including independent external validation, careful feature selection using three algorithms (LASSO, REFCV, and SVMREFCV) and model calibration assessment. In future studies, nested cross-validation, enhanced regularization tuning, and complexity reduction will be considered to further confirm the robustness of the XGBoost model. Decision curve analysis for both internal and external validation cohort also confirmed its clinical applicability, suggesting that the XGBoost-based model could be a valuable tool for early NPC screening and risk stratification in large populations. Our study highlights the utility of incorporating EBV-related serological markers, such as NTA1, Rta, and VCA, alongside hematological parameters (e.g., PLT and LM) for risk prediction in NPC. In contrast, Chen et al. (2024) developed a novel XGBoost model based on hospital electronic medical records (EMR) and the patient graph connection delta ratio (CDR), deliberately excluding EBV-related antibodies from their model construction process (39). Despite this exclusion, their model achieved strong predictive performance (AUC = 0.87) (39), suggesting they non-serological features derived from EMR and patient network structures can also effectively capture NPC risk. While another study conducted by Chen et al. (2024). However, our findings suggest that integrating virological biomarkers may provide added biological relevance and potentially improve early detection, especially in high-risk populations. The complementary nature of these approaches underscores the need for multi-modal data integration in NPC risk modeling. Future studies should explore hybrid models that combine clinical, serological, and network-based features to enhance predictive accuracy and clinical applicability across diverse healthcare settings. Furthermore, the implementation of an online prediction tool based on the five selected features (NTA1, Rta, VCA, PLT, and LM) offers a practical approach for clinicians to assess NPC risk in real time. To enhance interpretability and facilitate clinical adoption, SHAP-based model explainability was applied, enabling the visualization of each biomarker’s individual and combined influence on NPC risk. This approach bridges the gap between model accuracy and clinical transparency, allowing practitioners to better understand the biological rationale behind the predictions and make informed decisions. This tool could facilitate personalized screening strategies, especially in high-incidence regions, thereby improving early diagnosis and outcomes.

The model developed in this study holds significant promise for improving NPC risk prediction across a range of healthcare settings. By integrating clinical data and laboratory biomarkers, the model is adaptable to varying resource settings, from township hospitals to tertiary medical centers. This flexibility is essential, as it allows the model to be implemented in regions with differing levels of diagnostic infrastructure. Clinicians can use this tool to identify individuals at high risk of NPC, potentially reducing the need for invasive procedures such as Naso endoscopy in low-risk individuals. While the results are promising, there are several areas for future work. Further validation of the model in independent and multi-center cohorts is necessary to assess its generalizability across diverse populations. Although the present study focused primarily on serological and hematological predictors due to their accessibility and non-invasive nature, the absence of radiological data (e.g., CT, MRI, or PET imaging) represents a limitation. Integrating radiological or radiomic features in future hybrid models could significantly enhance diagnostic comprehensiveness by capturing both anatomical and molecular information. Such multimodal frameworks combining imaging, serological, and genomic data may improve predictive accuracy, interpretability, and clinical applicability. In addition, the inclusion of other potentially relevant biomarkers or clinical variables, especially in high-burden settings should be explored. The AdaBoost model was excluded from calibration analysis due to computational instability during curve generation; future studies should revisit its integration. Finally, embedding the XGBoost model within clinical decision support systems, along with real-time model updates, could further strengthen its role in NPC early detection and management. However, class imbalance in the dataset was moderate, class-weight adjustments were applied to counter the potential bias toward the majority group. Future work will systematically evaluate the impact of data balancing strategies such as SMOTE and stratified sampling to ensure equitable performance across classes.

5 Conclusion

In this large-scale study, we identified a sensitive panel of serological and hematological biomarkers – NTA1, VCA, Rta, PLT, and LM – associated with NPC risk. By integrating these predicators into the ML framework, we developed and validated an XGBoost-based predicted model that achieved near-perfect performance across internal training and validation cohorts, and external validation cohort. The XGBoost model’s accuracy, coupled with its clinical interpretability and integration into an online tool, offers a scalable solution for early NPC risk stratification. These findings underscore the potential of combining population-based serological profiling with ML to enhance NPC screening strategies. Prospective, multi-center validation is warranted to support clinical implementation.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.

Ethics statement

The studies involving humans were approved by Ethics Committee of The People’s Hospital of Guangxi Zhuang Autonomous Region. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

WY: Methodology, Data curation, Writing – review & editing, Writing – original draft, Software, Visualization, Resources. CZ: Writing – original draft, Writing – review & editing. MT: Data curation, Writing – review & editing, Conceptualization. ZH: Formal analysis, Data curation, Writing – review & editing. HZ: Writing – review & editing, Visualization, Supervision. SL: Writing – review & editing, Data curation, Methodology, Visualization. HH: Data curation, Conceptualization, Writing – review & editing. YL: Formal analysis, Writing – review & editing, Supervision. WP: Writing – review & editing, Supervision, Investigation. YY: Conceptualization, Funding acquisition, Writing – review & editing, Project administration, Data curation, Writing – original draft, Methodology.

Funding

The author(s) declare financial support was received for the research and/or publication of this article. This work was supported by the Natural Science Foundation of Guangxi (2023GXNSFAA026065), the Project for the Development and Promotion of Appropriate Medical and Health Technologies in Guangxi (S2023008) and the Central Government Guidance Fund for Local Science and Technology Development of Guangxi Zhuang Autonomous Region (Guike ZY24212050).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. MAYO CLINIC, M. Nasopharyngeal carcinoma (2024). Available online at: https://www.mayoclinic.org/diseases-conditions/nasopharyngeal-carcinoma/symptoms-causes/syc-20375529 (Accessed May 6, 2025).

Google Scholar

2. Ao X, Luo C, Zhang M, Liu L, and Peng S. The efficacy of natural products for the treatment of nasopharyngeal carcinoma. Chem Biol Drug Des. (2024) 103:e14411. doi: 10.1111/cbdd.14411

PubMed Abstract | Crossref Full Text | Google Scholar

3. HER C. Nasopharyngeal cancer and the Southeast Asian patient. Am Family Physician. (2001) 63:1776–83.

PubMed Abstract | Google Scholar

4. American Association for Cancer Research, A. Nasopharyngeal Cancer Incidence Varies Widely Among Different Ethnic Subgroups of Asian Americans. Philadelphia, PA, USA: American Association for Cancer Research (AACR) (2021).

Google Scholar

5. Hsu W-L, Yu KJ, Chien YC, Chiang CJ, Cheng YJ, Chen JY, et al. Familial tendency and risk of nasopharyngeal carcinoma in Taiwan: effects of covariates on risk. Am J Epidemiol. (2010) 173:292–9. doi: 10.1093/aje/kwq358

PubMed Abstract | Crossref Full Text | Google Scholar

6. American Cancer Society, A. Risk Factors for Nasopharyngeal Cancer. Atlanta, GA, USA: American Cancer Society (2022).

Google Scholar

7. Chang ET, Ye W, Zeng YX, and Adami HO. The evolving epidemiology of nasopharyngeal carcinoma. Cancer Epidemiol Biomarkers Prev. (2021) 30:1035–47. doi: 10.1158/1055-9965.EPI-20-1702

PubMed Abstract | Crossref Full Text | Google Scholar

8. Ruan HL, Qin HD, Shugart YY, Bei JX, Luo FT, Zeng YX, et al. Developing genetic epidemiological models to predict risk for nasopharyngeal carcinoma in high-risk population of China. PloS One. (2013) 8:e56128. doi: 10.1371/journal.pone.0056128

PubMed Abstract | Crossref Full Text | Google Scholar

9. World Cancer Research Fund International, W. Dietary and genetic factors and risk of nasopharyngeal cancer in south-east Asia. London, United Kingdom: World Cancer Research Fund International (2025).

Google Scholar

10. Shao H, Chen M, Xiao Y, Xu L, Cao H, Hong B, et al. Establishing a risk prediction model for nasopharyngeal carcinoma based on anti-BNLF2b serological biomarkers: A retrospective study. Int J Med Sci. (2025) 22:2165–73. doi: 10.7150/ijms.110758

PubMed Abstract | Crossref Full Text | Google Scholar

11. Yu X, Chen H, and Ji M. Epstein-Barr virus-based nasopharyngeal carcinoma population screening. Ann Nasopharynx Cancer. (2022) 6:3. doi: 10.21037/anpc-21-6

Crossref Full Text | Google Scholar

12. Cui Q, Feng FT, Xu M, Liu WS, Yao YY, Xie SH, et al. Nasopharyngeal carcinoma risk prediction via salivary detection of host and Epstein-Barr virus genetic variants. Oncotarget. (2017) 8:95066–74. doi: 10.18632/oncotarget.11144

PubMed Abstract | Crossref Full Text | Google Scholar

13. Okekpa SI, Mydin SMN RB, Mangantig E, Azmi NSA, Zahari SNS, Kaur G, et al. Nasopharyngeal carcinoma (NPC) risk factors: A systematic review and meta-analysis of the association with lifestyle, diets, socioeconomic and sociodemographic in Asian Region. Asian Pac J Cancer Prev. (2019) 20:3505–14. doi: 10.31557/APJCP.2019.20.11.3505

PubMed Abstract | Crossref Full Text | Google Scholar

14. Hutajulu SH, Howdon D, Taroeno-Hariadi KW, Hardianti MS, Purwanto I, Indrasari SR, et al. Survival outcome and prognostic factors of patients with nasopharyngeal cancer in Yogyakarta, Indonesia: A hospital-based retrospective study. PLoS One. (2021) 16:e0246638. doi: 10.1371/journal.pone.0246638

PubMed Abstract | Crossref Full Text | Google Scholar

15. Wang R and Kang M. Guidelines for radiotherapy of nasopharyngeal carcinoma. Precis Radiat Oncol. (2021) 5:122–59. doi: 10.1002/pro6.1123

Crossref Full Text | Google Scholar

16. Yoshizaki T, Kondo S, Dochi H, Kobayashi E, Mizokami H, Komura S, et al. Recent advances in assessing the clinical implications of epstein-barr virus infection and their application to the diagnosis and treatment of nasopharyngeal carcinoma. Microorganisms. (2024) 12:14. doi: 10.3390/microorganisms12010014

PubMed Abstract | Crossref Full Text | Google Scholar

17. Jiang W, Zheng B, and Wei H. Recent advances in early detection of nasopharyngeal carcinoma. Discov Oncol. (2024) 15:365. doi: 10.1007/s12672-024-01242-3

PubMed Abstract | Crossref Full Text | Google Scholar

18. Musatova I, Dzyuba B, Boryshpolets S, Iqbal A, Sotnikov A, Kholodnyy V, et al. Evaluation of carp sperm respiration: fluorometry with optochemical oxygen sensor versus polarography. Fish Physiol Biochem. (2025) 51:1–14. doi: 10.1007/s10695-024-01418-2

PubMed Abstract | Crossref Full Text | Google Scholar

19. Shakya R. Markers of oxidative stress in plants. In: Ecophysiology of tropical plants. Boca Raton, FL, USA: CRC Press (2024). p. 298–310.

Google Scholar

20. Lavorato M, Iadarola D, Remes C, Kaur P, Broxton C, Mathew ND, et al. dldhcri3 zebrafish exhibit altered mitochondrial ultrastructure, morphology, and dysfunction partially rescued by probucol or thiamine. JCI Insight. (2024) 9:e178973. doi: 10.1172/jci.insight.178973

PubMed Abstract | Crossref Full Text | Google Scholar

21. Rajkumar R, Vedhi C, and Sreeharsha N. Future of voltammetry for biosensing applications. In: Advancements in Voltammetry for Biosensing Applications. Singapore: Springer (2025). p. 217–25.

Google Scholar

22. Zhang J, Tang K, Fang R, Liu J, Liu M, Ma J, et al. Nanotechnological strategies to increase the oxygen content of the tumor. Front Pharmacol. (2023) 14:1140362. doi: 10.3389/fphar.2023.1140362

PubMed Abstract | Crossref Full Text | Google Scholar

23. Jiang H. Tumor Hypoxic Microenvironment and Infiltration and Metastasis. Amsterdam, the Netherlands: Elsevier (2024), SSRN 4959920.

Google Scholar

24. Ghorbian M and Ghorbian S. Usefulness of machine learning and deep learning approaches in screening and early detection of breast cancer. Heliyon. (2023) 9:e22427. doi: 10.1016/j.heliyon.2023.e22427

PubMed Abstract | Crossref Full Text | Google Scholar

25. Liu W, Chen G, Gong X, Wang Y, Zheng Y, Liao X, et al. The diagnostic value of EBV-DNA and EBV-related antibodies detection for nasopharyngeal carcinoma: a meta-analysis. Cancer Cell Int. (2021) 21:164. doi: 10.1186/s12935-021-01862-7

PubMed Abstract | Crossref Full Text | Google Scholar

26. Chen JW, Lin ST, Lin YC, Wang BS, Chien YN, and Chiou HY. Early detection of nasopharyngeal carcinoma through machine-learning-driven prediction model in a population-based healthcare record database. Cancer Med. (2024) 13:e7144. doi: 10.1002/cam4.7144

PubMed Abstract | Crossref Full Text | Google Scholar

27. Hu D, Wang Y, Ji G, and Liu Y. Using machine learning algorithms to predict the prognosis of advanced nasopharyngeal carcinoma after intensity-modulated radiotherapy. Curr Probl Cancer. (2024) 48:101040. doi: 10.1016/j.currproblcancer.2023.101040

PubMed Abstract | Crossref Full Text | Google Scholar

28. Hsieh H-T, Zhang XY, Wang Y, and Cheng XQ. Biomarkers for nasopharyngeal carcinoma. Clinica Chimica Acta. (2025) 572:120257. doi: 10.1016/j.cca.2025.120257

PubMed Abstract | Crossref Full Text | Google Scholar

29. Krishnan M and Babu S. Biomarkers in Nasopharyngeal Carcinoma (NPC): Clinical relevance and prognostic potential. Oral Oncol Rep. (2024) 11:100640. doi: 10.1016/j.oor.2024.100640

Crossref Full Text | Google Scholar

30. Su ZY, Siak PY, Leong CO, and Cheah SC. The role of Epstein-Barr virus in nasopharyngeal carcinoma. Front Microbiol. (2023) 14:1116143. doi: 10.3389/fmicb.2023.1116143

PubMed Abstract | Crossref Full Text | Google Scholar

31. Siak PY, Heng WS, Teoh SSH, Lwin YY, and Cheah SC. Precision medicine in nasopharyngeal carcinoma: comprehensive review of past, present, and future prospect. J Transl Med. (2023) 21:786. doi: 10.1186/s12967-023-04673-8

PubMed Abstract | Crossref Full Text | Google Scholar

32. Berloco F, Marvulli PM, Suglia V, Colucci S, Pagano G, Palazzo L, et al. Enhancing survival analysis model selection through XAI(t) in healthcare. Appl Sci. (2024) 14:6084. doi: 10.3390/app14146084

Crossref Full Text | Google Scholar

33. Liu H, Lei L, Song S, Geng X, Lin K, Li N, et al. The serological diagnostic value of EBV-related IgA antibody panels for nasopharyngeal carcinoma: a diagnostic test accuracy meta-analysis. BMC Cancer. (2024) 24:1115. doi: 10.1186/s12885-024-12878-3

PubMed Abstract | Crossref Full Text | Google Scholar

34. Lian M. Combining Epstein–Barr virus antibodies for early detection of nasopharyngeal carcinoma: A meta-analysis. Auris Nasus Larynx. (2023) 50:430–9. doi: 10.1016/j.anl.2022.09.010

PubMed Abstract | Crossref Full Text | Google Scholar

35. Yuan Y, Ye F, Wu JH, Fu XY, Huang ZX, and Zhang T. Early screening of nasopharyngeal carcinoma. Head Neck. (2023) 45:2700–9. doi: 10.1002/hed.27466

PubMed Abstract | Crossref Full Text | Google Scholar

36. Wu Y-X, Tian BY, Ou XY, Wu M, Huang Q, Han RK, et al. A novel model for predicting prognosis and response to immunotherapy in nasopharyngeal carcinoma patients. Cancer Immunol Immunother. (2024) 73:14. doi: 10.1007/s00262-023-03626-w

PubMed Abstract | Crossref Full Text | Google Scholar

37. Long L, Tao Y, Yu W, Hou Q, Liang Y, Huang K, et al. Multiparameter diagnostic model using S100A9, CCL5 and blood biomarkers for nasopharyngeal carcinoma. Sci Rep. (2025) 15:7502. doi: 10.1038/s41598-025-92518-3

PubMed Abstract | Crossref Full Text | Google Scholar

38. Shick AA, Webber CM, Kiarashi N, Weinberg JP, Deoras A, Petrick N, et al. Transparency of artificial intelligence/machine learning-enabled medical devices. NPJ Digital Med. (2024) 7:21. doi: 10.1038/s41746-023-00992-8

PubMed Abstract | Crossref Full Text | Google Scholar

39. Chen A, Lu R, Han R, Huang R, Qin G, Wen J, et al. Building practical risk prediction models for nasopharyngeal carcinoma screening with patient graph analysis and machine learning. Cancer Epidemiol Biomarkers Prev. (2023) 32:274–80. doi: 10.1158/1055-9965.EPI-22-0792

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: nasopharyngeal carcinoma (NPC), NPC screening, predictive modeling, machine learning (ML), Epstein-Barr virus (EBV), logistic regression

Citation: Yang W, Zhou C, Tang M, Huang Z, Zhu H, Li S, Huang H, Liang Y, Pan W and Yuan Y (2025) Machine learning-based prediction of nasopharyngeal carcinoma risk: a clinical approach. Front. Immunol. 16:1648648. doi: 10.3389/fimmu.2025.1648648

Received: 17 June 2025; Accepted: 05 November 2025; Revised: 30 September 2025;
Published: 27 November 2025.

Edited by:

Claudine Kieda, Military Institute of Medicine, Poland

Reviewed by:

Daniela Messineo, Sapienza University of Rome, Italy
Vasileios Papanikos, University of Patras, Greece

Copyright © 2025 Yang, Zhou, Tang, Huang, Zhu, Li, Huang, Liang, Pan and Yuan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yulin Yuan, eXVhbnl1bGluQDEyNi5jb20=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.