Development and validation of an explainable machine learning and nomogram model for early detection and risk stratification of polycystic ovary syndrome: a multicenter study

Yao, Bihua; Yu, Xingyu; Zhang, Yunyan; Chen, Jiayan; Zhu, Xiaotong; Zhang, Cheng; Jijun, Tong

doi:10.3389/fendo.2025.1719631

ORIGINAL RESEARCH article

Front. Endocrinol., 17 December 2025

Sec. Reproduction

Volume 16 - 2025 | https://doi.org/10.3389/fendo.2025.1719631

Development and validation of an explainable machine learning and nomogram model for early detection and risk stratification of polycystic ovary syndrome: a multicenter study

Bihua Yao^1,2†

Xingyu Yu^3†

Yunyan Zhang⁴

Jiayan Chen²

Xiaotong Zhu²

Cheng Zhang^5*

Tong Jijun^3*

¹Laboratory Medicine Center, Department of Clinical Laboratory, The First People’s Hospital of Jiashan, Jiashan Hospital Affiliated to Jiaxing University, Jiaxing, Zhejiang, China
²School of Laboratory Medicine and Life Sciences, Key Laboratory of Laboratory Medicine, Ministry of Education, Wenzhou Medical University, Wenzhou, Zhejiang, China
³The Key Laboratory of Intelligent Unmanned Systems Software Technology and Applications, Zhejiang Sci-Tech University, Hangzhou, Zhejiang, China
⁴Department of Gynecology, The First People’s Hospital of Jiashan, Jiashan Hospital Affiliated to Jiaxing University, Jiaxing, Zhejiang, China
⁵Laboratory Medicine Center, Department of Clinical Laboratory, Jiaxing Hospital of Traditional Chinese Medicine Affiliated to Zhejiang Chinese Medical University, Jiaxing, Zhejiang, China

Background: Polycystic ovary syndrome (PCOS) is a common endocrine–metabolic condition in reproductive-aged women, linked to infertility and long-term cardiometabolic risk. Early identification remains challenging because current diagnosis relies on hormone testing and imaging. This research sought to develop and evaluate an interpretable machine learning (ML) model and a simplified nomogram for the early detection of PCOS.

Methods: Data from 1,600 women at the First People’s Hospital of Jiashan were used for model training, with 283 external cases from Jiaxing Hospital of Traditional Chinese Medicine for validation. Twenty-three routine laboratory indicators were analyzed. After LASSO feature selection, seven ML algorithms were compared. The best-performing XGBoost model was interpreted using Shapley Additive exPlanations (SHAP). A logistic regression–based nomogram was developed from the key predictors.

Results: The XGBoost model showed excellent discrimination (AUC = 0.919 internal; 0.923 external). SHAP identified DHEAS, AMH, TG, and age as key contributors. The nomogram also performed well (AUC = 0.901 train; 0.887 test).

Conclusions: This interpretable “XGBoost + SHAP” and nomogram framework provides an accurate, transparent, and practical tool for early PCOS screening and individualized management.

Introduction

Polycystic ovarian syndrome (PCOS), one of the most common endocrine and metabolic disorders in reproductive-aged women (1), is estimated to affect 8%–13% of the global population (2–4). Its clinical and metabolic manifestations are highly heterogeneous, typically defined by hyperandrogenism, ovulatory dysfunction, and polycystic ovarian morphology, often associated with insulin resistance (5) and metabolic syndrome (6, 7), posing long-term reproductive and metabolic health risks for women (8). The Rotterdam 2003 criteria remain the most widely accepted diagnostic standard, requiring at least two of the three features for diagnosis (9, 10). The 2023 international evidence-based guideline further emphasized the integration of clinical, biochemical, and imaging evidence, recognizing anti-Müllerian hormone (AMH) as an alternative marker for defining polycystic ovarian morphology (PCOM), but not as a standalone diagnostic test (11). Owing to the heterogeneity of PCOS, current diagnosis relies heavily on hormonal assays, menstrual history, and imaging evaluations, which are influenced by timing, instrument variability, and operator experience, resulting in missed or delayed diagnoses, particularly in early or atypical cases (12, 13). Therefore, there is an urgent need for objective, stable, and easily accessible serum-based indicators. Recent systematic reviews have identified AMH, androgens, insulin-resistance-related indices, and lipid-metabolism markers (14–16) as potential key biomarkers, providing new directions for the early identification and risk stratification of PCOS.

The rapid advancement of artificial intelligence (AI) and machine learning (ML) has shifted clinical research paradigms from experience-driven to data-driven (17–23), offering innovative strategies for disease prediction and early diagnosis (23). However, most existing ML studies on PCOS have been limited by small sample sizes or single-center data and have rarely achieved a balance between predictive performance and clinical applicability, hindering clinical translation. Meanwhile, although diagnostic and therapeutic approaches continue to evolve, the etiology and management of PCOS remain complex, and current interventions mainly focus on symptomatic control (24). Hence, novel intelligent diagnostic and decision-support tools are needed to facilitate early risk detection and personalized management.

In this study, we integrated multicenter clinical data encompassing routine hormonal and metabolic indicators to propose a dual-model strategy. First, we developed a high-performance machine learning (ML)- based screening model, PCOS-XGBoost. We applied Shapley Additive exPlanations (SHAP) (25) to elucidate key predictors and their threshold effects, thereby enhancing model interpretability and transparency. Second, we constructed a simplified logistic-regression nomogram to translate complex algorithms into a clinically intuitive tool. Through this complementary framework, the study aimed to strike a balance between predictive accuracy and clinical usability, providing a feasible approach for early PCOS screening and individualized management.

Materials and methods

Data source

This retrospective study enrolled 1,600 women who first visited the Department of Gynecology at the First People’s Hospital of Jiashan between January 2021 and January 2025 as the training cohort, and 283 women from the Jiaxing Hospital of Traditional Chinese Medicine between January 2024 and January 2025 as the external validation cohort.

All data were de-identified before analysis. The study followed the Declaration of Helsinki and was approved by both institutional ethics committees (JZYLUN2025–034 and JSYIRB2024-103). As this was a retrospective study, the need for informed consent was exempted.

Participants

Eligible participants were women aged 18–45 years who had not used medications affecting hormone levels (e.g., oral contraceptives) within the past three months, met at least two Rotterdam 2003 criteria—oligo/anovulation, clinical or biochemical hyperandrogenism, and polycystic ovarian morphology—and had complete clinical and laboratory data.

Patients were excluded if they had missing key data, duplicate records, other endocrine disorders (e.g., congenital adrenal hyperplasia, thyroid dysfunction, or Cushing’s syndrome), severe organic diseases (e.g., endometriosis), were pregnant or lactating, or had recently received hormonal therapy.

Data extraction and processing

To ensure data quality and model reliability, the raw dataset was thoroughly cleaned by removing records with excessive missing values, outliers, or measurements below the detection limit. In total, 23 clinical parameters were retained for model development, including age, AD, DHT, 17α-OHP, E1, LH, P, T, FSH, E2, PRL, DHEAS, AMH, INS-0h, INS-0.5h, INS-1h, INS-2h, INS-3h, TCH, TG, ApoE, Lp(a), and Glu.

To enhance the consistency and reliability of hormone level measurements, all participants had their hormone levels measured during the follicular phase of the menstrual cycle. This ensured standardized conditions for hormone testing. Testosterone and other hormones were measured using mass spectrometry (MS), providing high sensitivity and accuracy. All hormone measurements were performed by accredited laboratories following strict standard operating procedures.

The Python multiple imputation by chained equations (MICE) approach was used to impute missing data. Highly correlated features (Spearman r > 0.6) were excluded, and all variables were standardized using Z-scores. To address class imbalance, random undersampling was applied, which helped reduce bias and prevent overfitting.

Feature selection and data partitioning

LASSO regression was used for feature selection, with 10-fold cross-validation and the λ1se criterion to determine the final variables included in the model. A dataset was randomly split into a 7:3 ratio, creating a training set (n = 1,120) and an internal validation set (n = 480). The model’s generalizability was evaluated using an independent external validation set (n = 283).

Statistical analysis and model development

All analyses were performed in R 4.3.3. Normality was assessed with the Kolmogorov–Smirnov test. Non-normal data are presented as median (IQR) using the Mann–Whitney U test; categorical data as n (%) using the Pearson χ² test. Class imbalance was addressed by random undersampling; data were split 7:3 into training and internal validation sets. The λ1se criterion and LASSO with 10-fold cross-validation were used for feature selection. The objective function was:

\hat{β_{0}}, \hat{β} = \arg \min {\sum_{i = 1}^{n} {(y_{i} - β_{0} - \sum_{j = 1}^{p} β_{j} x_{i j})}^{2}}

S u b j e c t t o \sum_{j = 1}^{p} | β_{j} | \leq λ

Where λ controls regularization and drives some coefficients to zero

Using the variables selected by LASSO, we trained Light Gradient Boosting Machine (LGBM) (26), Random Forest (RF) (27), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Decision Tree (DT) (28), K-Nearest Neighbors (KNN), and Logistic Regression (LR) models with tenfold cross-validation. The area under the ROC curve (AUC) was used as the primary performance metric. Among these models, XGBoost achieved the best performance and was further evaluated in the internal and external validation cohorts.

Model evaluation

The model’s performance was comprehensively evaluated using various metrics, including accuracy, precision, recall, F1 score, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). To assess clinical applicability, Calibration curves, clinical impact curves (CIC), and decision curve analysis (DCA) were constructed to identify the optimal decision threshold. The generalizability of each model was further tested using both internal and external validation cohorts.

Model interpretability

XGBoost model predictions were explained using the Shapley Additive exPlanations (SHAP) approach for clarity. Feature importance plots illustrated the overall contribution of each variable, while individual case analyses visualized how key features influenced specific predictions, providing insight into the model’s decision-making process.

Nomogram construction and validation

Among all models, XGBoost showed the best predictive performance and was selected as the primary model. However, due to its “black-box” nature, direct clinical application may be limited. To improve interpretability, a simplified nomogram was developed using logistic regression based on the key features identified by XGBoost, translating complex model outputs into an intuitive clinical tool for individualized risk assessment.

The nomogram’s performance was evaluated using ROC curves, compared with single predictors, and further verified with DCA and calibration curves. Results demonstrated that the nomogram retained the predictive value of core variables while showing strong robustness and clinical utility.

Results

Baseline characteristics

A total of 1,600 participants were enrolled, comprising 800 cases of PCOS and 800 non-PCOS cases. The cohort was randomly split into a training set (n = 1,120) and a testing set (n = 480) at a 7:3 ratio. Table 1 presents the baseline characteristics, while Figure 1 illustrates the study method.

Table 1

Table 1. Statistics of characteristics of the PCOS and non-PCOS individuals.

Figure 1

Flowchart detailing a dual-model strategy. It begins with data processing and preparation, including population collection, missing value imputation, collinearity removal, and normalization. LASSO regression is used for feature selection, identifying nine key variables. Model development compares seven machine learning algorithms, with XGBoost selected as best. Evaluation includes ROC, calibration, and more. Research utility involves SHAP analysis for model explainability. Clinical utility is addressed through a nomogram evaluated by ROC, DCA, and calibration. Conclusion highlights a dual-model strategy: research with XGBoost and SHAP, and clinical with a nomogram.

Figure 1. Study the workflow of model development and validation.

Feature selection

LASSO regression was used for feature selection after minimizing absolute shrinkage and selection. As shown in Figure 2A, the coefficient profiles indicate that most variable coefficients progressively shrank toward zero as the penalty parameter (λ) increased. Figure 2B displays the tenfold cross-validation results, where the two dashed lines represent the minimum error criterion (λ.min) and the one–standard error criterion (λ.1se). Under the λ.min criterion, nine variables with nonzero coefficients—AD, T, AMH, TG, DHEAS, ApoE, Lp(a), FSH, and age—were finally selected as key features strongly associated with the diagnosis of polycystic ovary syndrome (PCOS) (Figure 2D).

Figure 2

Panel A shows a line graph with various colored lines representing coefficients against log(lambda). Panel B is a plot of binomial deviance versus log(lambda), highlighting variables with vertical dashed lines. Panel C features a bar chart comparing coefficients of different variables, with AD having the highest and age the lowest. Panel D, another bar chart, shows slightly different coefficient values for the same variables, with AD still the highest and age negative.

Figure 2. LASSO regression feature selection. (A) Coefficient profile plot; (B) Ten-fold cross-validation curve with designated λ.min and λ.1se; (C, D) Variables having nonzero coefficients. For model development, the following variables were chosen: AD, T, AMH, TG, DHEAS, ApoE, Lp(a), FSH, and age.

Rad-score comparison

The Rad-score was calculated for each participant and compared between groups. As shown in Figure 3, patients with PCOS (label = 1) had significantly higher Rad-scores than those without PCOS (label = 0) (P< 0.0001), indicating good discriminative ability. The Rad-score may serve as a reliable marker for distinguishing PCOS from non-PCOS and holds potential value for early diagnosis and risk stratification.

Figure 3

Two scatter plots labeled A and B compare training and testing data, respectively. Both plots show values on the y-axis and groups “0” and “1” on the x-axis. Red and teal points represent groups, with teal points generally having larger spread and higher values. Box plots indicate the range and distribution, with a significant difference marked by asterisks between the groups.

Figure 3. Comparison of Rad-scores between PCOS and non-PCOS groups. Rad-scores were significantly higher in the PCOS group (P< 0.0001). (A) Training Data; (B) Testing Data.

Waterfall plot

To further evaluate the model’s discrimination at the individual level, waterfall plots of Rad-scores were generated for the training and validation cohorts (Figure 4). Distinct distributions of Rad-scores were observed between patients with PCOS and those without PCOS in both cohorts, indicating that the model can consistently distinguish between the two groups.

Figure 4

Two waterfall plots showing Rad_Score distribution for training and testing data. Both charts display participants on the x-axis and Rad_Score on the y-axis, with scores divided into categories zero (red) and one (blue). The training data graph has more participants, extending beyond 1500, while the testing data has fewer, around 900. Both plots show a similar pattern of distribution with scores ranging from negative to positive values.

Figure 4. Waterfall plots of Rad-scores in the training and validation cohorts. Each bar represents an individual participant; blue indicates those with PCOS, and red indicates those without PCOS. Overall, PCOS patients showed higher Rad-scores, demonstrating the model’s strong discriminative ability. (A) Training Data; (B) Testing Data.

Model performance comparison

The predictive performance of seven machine learning models was compared (Figure 5, Table 2). Among them, the XGBoost model achieved the best overall performance (AUC = 0.919, 95% CI: 0.896–0.942), demonstrating a better balance across metrics than the other algorithms. The Random Forest model ranked second (AUC = 0.918), while the K-Nearest Neighbors (KNN) model performed the poorest. Detailed performance metrics for all models are summarized in Table 2.

Figure 5

ROC curve comparing the performance of seven algorithms: Decision Tree (DT), K-Nearest Neighbors (KNN), LightGBM (LGBM), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and XGBoost (XGB). The XGB has the highest AUC (0.919), while KNN has the lowest (0.834). The x-axis represents 1-Specificity and the y-axis represents Sensitivity.

Figure 5. ROC curves for the machine learning models. LGBM, Light Gradient Boosting Machine; XGB (XGBoost), extreme gradient boosting; RF, random forest; DT, decision tree; SVM, support vector machine; KNN, k-nearest neighbors; LR, logistic regression; ROC, receiver operating characteristic; AUC, area under the curve.

Table 2

Table 2. PCOS prediction machine learning model performance.

Calibration and clinical utility analysis

Following the comparison of overall predictive performance, we further evaluated the calibration and clinical utility of the seven models (Figure 6).

Figure 6

Four-panel chart with various visualizations. Top left: Calibration plot comparing predicted and observed probabilities for several models, labeled with different colors. Top right: Decision curve analysis showing net benefit across thresholds for the same models. Bottom left: Risk plot showing high-risk numbers against a cost-benefit ratio. Bottom right: Confusion matrix with counts of 197 and 47 for true negatives, and 37 and 199 for true positives, with a color gradient bar indicating value intensity.

Figure 6. Calibration performance and clinical utility of the models. (A) calibration curves; (B) decision curve analysis (DCA); (C) clinical impact curve (CIC); (D) confusion matrix of the best-performing XGB model. DT, Decision Tree; XGB, Extreme Gradient Boosting; LGBM, Light Gradient Boosting Machine; RF, Random Forest; SVM, Support Vector Machine; KNN, K-Nearest Neighbors; LR, Logistic Regression.

(A) The calibration curves show how well the predicted probabilities align with the observed outcomes. Overall, the XGB model (black line) showed the closest fit to the ideal diagonal, indicating superior calibration. The RF, SVM, LGBM, DT, and LR models also demonstrated acceptable consistency, whereas the KNN model exhibited larger deviations across risk intervals, suggesting less reliable predictions. These findings confirm that XGB achieved the best balance between accuracy and consistency.

(B) The decision curve analysis (DCA) compared the net clinical benefit of the models across various threshold probabilities. All models outperformed the “Treat None” strategy (purple line), and most performed better than the “Treat All” strategy (blue curve) in the low-to-intermediate risk range, suggesting positive clinical value. Among them, the XGB model consistently maintained the highest net benefit across almost all thresholds, indicating superior capability in balancing false positives and false negatives. In contrast, KNN and SVM showed greater variability, reflecting lower stability. Collectively, these results support the XGB model as the most clinically applicable algorithm.

(C) The clinical impact curve (CIC) further assessed the XGB model’s ability to identify high-risk individuals at different thresholds. The red solid line represents the number of individuals predicted as high-risk, and the blue dashed line represents the actual number of true-positive cases, with 95% confidence intervals shown as shaded boundaries. As the threshold increased, the number of predicted high-risk cases declined, reducing false positives but potentially missing actual cases. Notably, at a threshold of 0.5, the two curves nearly overlapped, and the confidence interval narrowed, indicating optimal agreement between predicted and observed outcomes. This threshold provided the optimal trade-off between sensitivity and specificity, thereby maximizing the benefits of clinical intervention. The CIC thus quantitatively demonstrates the clinical applicability of the XGB model for individualized risk assessment of PCOS.

The confusion matrix of the XGB model is presented in Figure 6D, offering a visual evaluation of its classification performance.

External validation

The proposed XGBoost model demonstrated strong generalizability. In Figure 7, the ROC curve for the external validation cohort showed an AUC of 0.923 (95% CI: 0.893–0.953), indicating strong predictive ability.

Figure 7

Receiver Operating Characteristic (ROC) curve comparing test and train performances. The turquoise line represents the training data with an Area Under the Curve (AUC) of 0.952, while the red line depicts the testing data with an AUC of 0.923. Both lines show high sensitivity and specificity.

Figure 7. External validation in an independent cohort.

Model interpretability analysis

With the growing application of artificial intelligence in medicine, model interpretability has become an important indicator for evaluating the usability and safety of predictive models (29). Doshi-Velez and Kim et al. emphasized that interpretability should be regarded as a core component of machine learning science and systematically assessed in high-risk domains (30).To measure the contribution of each feature to model predictions, we employed the Shapley Additive exPlanations (SHAP) approach (Figure 8). SHAP, a game theory–based feature attribution method, is one of the most widely used interpretability algorithms (31).

Figure 8

Four-panel visualization depicting SHAP values. Panel A shows a beeswarm plot with multiple features affecting SHAP values. Panel B presents a bar chart ranking features by mean SHAP values, where DHEAS has the highest impact. Panel C illustrates a decision plot with stepwise contributions of features to the final prediction, showing positive and negative impacts. Panel D contains scatter plots for various features like AD, T, AMH, TG, and others against SHAP values, indicating relationships and trends.

Figure 8. Global and local model interpretation using SHAP analysis. (A) SHAP beeswarm summary plot showing the distribution and direction of feature contributions across all samples. Each dot represents one patient, colored by the feature value (yellow = high, purple = low). Features at the top have a greater overall impact on the model output. (B) Bar plot of mean absolute SHAP values indicating the overall importance of each variable. Together, (A, B) show that DHEAS and AMH were the most influential positive predictors, followed by TG and age, whereas Lp(a) had a substantial negative contribution. The effects of T, ApoE, AD, and FSH were relatively minor. (C) SHAP waterfall plot for an individual case illustrating that AMH, TG, and age were major positive contributors, while Lp(a) was the primary negative contributor. (D) SHAP main effect dependence plots of dominant features predicting PCOS. DHEAS and AMH exerted the most potent positive effects on PCOS prediction, while metabolic markers (ApoE and Lp(a)) showed a moderate influence. FSHand age were negatively associated with PCOS, consistent with its hormonal and epidemiological characteristics. Each point represents a single patient, where the x-axis indicates the actual feature value and the y-axis represents its SHAP value. Positive SHAP values push the prediction toward PCOS.

At the global level, the beeswarm and bar plots (Figures 8A, B) revealed the relative importance of nine key predictors. DHEAS and AMH were the most influential positive contributors, with higher values significantly increasing SHAP scores and the predicted probability of PCOS. TG and age also showed positive associations, but with lower and more heterogeneous contributions across individuals. In contrast, elevated Lp(a) values were predominantly distributed in the negative SHAP region, indicating an inhibitory effect on prediction. FSH showed an overall negative association, consistent with the typical hormonal profile of PCOS, characterized by an elevated LH/FSH ratio. These findings suggest that PCOS development involves not only reproductive hormonal dysregulation but also metabolic disturbances, reflecting its potential pathophysiological heterogeneity.

At the individual level, the waterfall plot (Figure 8C) illustrated the cumulative contribution of features to a single patient’s prediction. The baseline prediction value of 0.00657 increased to 0.962 after integrating all feature effects, indicating that the case was classified as high risk. AMH (+0.593), TG (+0.197), age (+0.15), and DHEAS (+0.102) were the main positive drivers, while Lp(a) (–0.279) acted as the primary negative factor. This suggests that multiple hormonal and metabolic factors jointly contributed to the patient’s high-risk prediction.

The dependence plots (Figure 8D) revealed several threshold effects. Specifically, AMH values above five ng/mL shifted from neutral to strongly positive contributions, and levels exceeding 10 ng/mL almost deterministically indicated PCOS. DHEAS exhibited a substantial positive contribution around 200 μg/dL, which further increased with concentration, suggesting a dose–response relationship. FSH showed positive SHAP values at lower levels but negative contributions at higher levels, aligning with the characteristic lower FSH levels in patients with PCOS. Moreover, a younger age (<30 years) was positively associated with PCOS prediction, consistent with its known epidemiological pattern.

Nomogram construction and validation

SHAP analysis not only enhanced model transparency but also helped identify the most clinically relevant predictors. However, SHAP plots are primarily suited for research interpretation, whereas clinicians require intuitive and straightforward tools for individualized risk assessment. Therefore, based on the key variables identified by the XGBoost model, we developed a simplified logistic regression–based nomogram (Figure 9A). Using a few routine indicators, clinicians can easily calculate the total score and estimate an individual’s likelihood of having PCOS.

Figure 9

Panel A shows a nomogram predicting diagnostic possibilities using variables like age, DHEAS, FSH, and others. Panel B presents ROC curves with sensitivity vs. 1-specificity for various markers. Panels C and D display net benefit curves against threshold probability, evaluating different models using markers like AD and ApoE.

Figure 9. Nomogram and evaluation of its diagnostic performance and clinical utility. (A) Nomogram for predicting PCOS diagnosis. (B) Comparison of ROC curves between individual predictors and the nomogram model. (C) Decision curve analysis (DCA) for the train cohort. (D) Decision curve analysis (DCA) for the test cohort.

In clinical practice, the nomogram provides physicians with an intuitive and easy-to-use tool to quickly assess a patient’s risk of having PCOS based on routine clinical data. For example, suppose a patient’s clinical data is as follows: age 30 years, AMH 10 ng/mL, FSH 60, triglycerides (TG) 5, ApoE 12, and DHEAS 150. By inputting these data into the nomogram, we calculate the corresponding points for each variable: AD 2, points = 3; T 6, points = 20; AMH 10, points = 12; TG 5, points = 10.5; ApoE 12, points = 6.5; FSH 60, points = 25; Age 30 years, points = 16.75; DHEAS 150, points = 15; Lp(a) 1000, points = 7.5. By summing these points, we obtain a total points of 116.25. Based on this total score of 116.25, the corresponding PCOS risk for this patient is 82.5%. This result indicates that the patient has a high risk of PCOS, and the clinician can use this information to decide whether to perform further diagnostic tests, such as ultrasound or hormonal assessments, or adjust the treatment plan. With the nomogram, physicians can quickly and intuitively assess the patient’s risk, assist in developing personalized treatment plans, and enhance clinical decision-making efficiency, especially in resource-limited settings.

The ROC curves showed that the nomogram achieved better discrimination than individual predictors (Figure 9B), demonstrating strong predictive performance (AUC = 0.901 for the train set and 0.887 for the test set; Table 3). The decision curve analysis (DCA) further indicated higher net clinical benefit across most threshold probabilities in both cohorts (Figures 9C, D). Other performance metrics, including sensitivity, specificity, PLR, NLR, and Kappa values, remained stable between datasets (Table 3), supporting the model’s robustness and clinical utility.

Table 3

Table 3. Performance metrics of the nomogram in the train and test sets.

To further assess the model’s goodness of fit, calibration curves were plotted (Figure 10). The predicted probabilities showed good agreement with the observed outcomes in both the train and test cohorts, with the curves closely following the ideal diagonal line. The Brier score, Emax, and Eavg were 0.129, 0.038, and 0.016 in the train cohort, and 0.140, 0.068, and 0.026 in the test cohort, respectively, demonstrating excellent calibration performance.

Figure 10

Calibration plots A and B compare actual versus predicted probabilities. Both plots include ideal, logistic calibration, and nonparametric lines. Plot A shows better model performance with higher C (ROC) and Dxy values. Plot B has a lower slope and higher S:z, indicating potential model miscalibration. Visualized data includes prediction errors and metrics like Brier score and Emax.

Figure 10. Calibration curves of the nomogram: (A) train cohort; (B) test cohort. The diagonal line represents perfect prediction, the solid line represents logistic calibration, and the dotted line shows the nonparametric calibration.

Together, the XGBoost-based framework and its simplified nomogram demonstrated consistent and reliable performance across datasets, laying the groundwork for further clinical validation and broader application.

Conclusion

An interpretable XGBoost model and a simplified nomogram were developed using routine hormonal and metabolic indicators. This dual-model approach shows strong potential for early detection of PCOS, pending further validation in larger prospective studies.

Discussion

In this study, we developed an interpretable XGBoost model based on routine hormonal and metabolic indicators to predict PCOS. The model demonstrated excellent and stable performance across datasets, achieving an AUC of 0.919in the training cohort and 0.923 in the external validation cohort, outperforming all other machine learning algorithms. The Rad-score effectively discriminated between PCOS and non-PCOS cases and maintained stability across cohorts, while waterfall plots confirmed a strong individual-level discriminative ability. Furthermore, the decision curve analysis (DCA) and clinical impact curve (CIC) consistently indicated a higher net clinical benefit over a wide range of thresholds, and calibration curves demonstrated strong agreement between predicted and observed probabilities. Collectively, these findings confirm that the proposed model possesses robust predictive accuracy, good calibration, and high clinical reliability.

Using SHAP analysis, we quantified feature contributions and identified biologically meaningful threshold effects. AMH, DHEAS, and testosterone emerged as the most influential positive predictors, consistent with the hormonal profile characteristic of PCOS. Triglycerides (TG) and ApoE also contributed positively, suggesting the presence of concurrent metabolic abnormalities. Conversely, FSH, Lp(a), and age exerted adverse effects, aligning with the clinical pattern that PCOS predominantly affects younger women with lower FSH levels. The inverse association between age and PCOS risk is consistent with previous epidemiological findings showing a decline in androgen levels and an increase in metabolic disturbances with advancing age (32). These results emphasize that PCOS development involves a complex interplay between reproductive hormones and metabolic dysregulation.

Model interpretability is fundamental for fostering clinicians’ trust and ensuring the safe integration of AI-based tools in medical practice. Although SHAP provides valuable insight into feature-level contributions, its visualization and analytical depth are better suited to research contexts than to routine clinical workflows. Prior studies have highlighted that clinicians prioritize clarity, usability, and clinical relevance when interpreting AI outputs; however, current explainable AI (XAI) frameworks remain limited in bridging this gap (33). Future work should focus on refining interpretable visualization approaches and developing user-oriented explanations that better integrate with clinical reasoning, thereby facilitating the broader adoption of AI-assisted decision support systems (34).

In comparison to existing literature, our study offers several notable strengths. Previous work has predominantly focused on the diagnostic value of AMH alone. Van der Ham et al. in Fertility and Sterility and Gomes et al. in AJOG both confirmed AMH as a reliable biomarker for PCOS diagnosis (35, 36). The latest international guideline (Teede et al., 2023, JCEM) further incorporated AMH as an alternative to PCOM in adult diagnostic criteria, while cautioning against its use as a standalone test, especially in adolescents (10). However, most of these studies relied on single-variable or traditional regression models. Attempts to integrate multiple biomarkers—such as the model proposed by Tong et al. (37)—were limited by small sample sizes and a lack of external validation. In imaging studies, Moral et al. achieved high diagnostic accuracy using deep learning on ultrasound images; however, their model’s reliance on imaging quality limits its clinical generalizability (38). By contrast, our multicenter, large-sample study employed multi-model comparison and independent validation, addressing these shortcomings. Moreover, SHAP analysis identified clinically interpretable thresholds, such as an AMH level greater than five ng/mL, which is consistent with current international guidelines and evidence.

The proposed dual-model strategy, combining an interpretable XGBoost model with a simplified logistic regression–based nomogram, provides a practical balance between predictive accuracy and clinical usability. This approach is particularly advantageous for identifying early PCOS risk in resource-limited settings, enabling frontline clinicians to conduct personalized risk assessments and interventions. However, it is important to acknowledge that anthropometric parameters, such as body mass index (BMI) and waist circumference, which are key determinants of PCOS, were not included in this study. These variables, commonly used in clinical practice, could further enhance the model’s predictive accuracy, particularly in assessing metabolic risks and obesity-related complications. Future research should consider incorporating these anthropometric measurements to improve the comprehensiveness and precision of PCOS screening tools.

The nomogram, as an intuitive risk assessment tool, can quickly calculate disease risk based on routine clinical data from patients. However, to effectively integrate the nomogram into clinical practice, it must first be connected to the electronic health record (EHR) system, which will automatically retrieve patient data and calculate risk scores. This tool can serve as a decision support system, helping doctors quickly assess patient risk and develop personalized treatment plans. To ensure effective application, it is necessary to train doctors and regularly evaluate and optimize the model to improve its clinical adaptability and accuracy. In this study, we demonstrated how to use the nomogram to assess the risk of PCOS by inputting clinical data such as age, AMH, FSH, and calculating the score for each variable to determine the individual PCOS risk. This method provides clinicians with a simple and effective tool that helps them make faster and more accurate decisions.

Nevertheless, several limitations should be acknowledged. First, as a retrospective study, potential selection bias cannot be excluded. Although validated externally, as Van Calster et al. (39)noted, validation is not the endpoint, and temporal or population-based generalizability warrants further investigation. Second, despite the inclusion of multicenter data, the overall sample size remains moderate; future large-scale, prospective studies are needed to confirm model stability. Third, this study did not incorporate imaging or multi-omics data (e.g., metabolomics, microbiomics). Recent studies have reported significant alterations in the gut microbiota and its metabolites among women with PCOS, suggesting that the gut–metabolic axis may play a key role in disease pathogenesis (40). Future research integrating multi-dimensional data in prospective designs may further enhance the robustness, interpretability, and translational potential of this dual-model framework.

Since all participants in this study were from China, the generalizability of the findings to other ethnic or geographical populations may be limited. Ethnic and geographical factors, such as genetic variations, environmental influences, and healthcare access, may contribute to differences in disease presentation and risk factors. Future studies should consider incorporating multi-ethnic and multi-regional cohorts to validate the applicability and generalizability of the model across diverse population.

Additionally, it is important to consider the impact of menstrual cycle phase and assay timing on the accuracy of hormonal measurements. Hormones such as FSH, AMH, and LH fluctuate at different stages of the menstrual cycle, especially during the follicular and luteal phases, where FSH and LH levels can vary significantly. To minimize these fluctuations, we recommend that hormonal measurements be performed during standardized phases of the menstrual cycle, such as days 2 to 5 of menstruation. Future studies should further standardize assay timing to ensure that data are collected during consistent menstrual cycle phases, thus improving the stability and accuracy of predictive models.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding authors.

Ethics statement

The studies involving humans were approved by institutional ethics committees (JZYLUN2025-034 and JSYIRB2024-103). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

BY: Conceptualization, Writing – review & editing, Resources, Supervision, Formal Analysis, Writing – original draft, Software, Data curation, Methodology. XY: Writing – original draft, Visualization, Methodology, Validation, Project administration. YZ: Writing – original draft, Investigation, Methodology, Data curation. JC: Writing – original draft, Methodology, Investigation. XZ: Methodology, Investigation, Writing – original draft, Project administration, Resources. CZ: Supervision, Writing – review & editing. TJ: Writing – review & editing, Supervision.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The authors declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Abbreviation

ML, machine learning; SHAP, SHapley Additive explanation; XGBoost, extreme gradient boosting; SVM, support vector machine; DT, decision tree; RF, random forest; LR, Logistic regression; LGBM, a light gradient boosting machine; KNN, K-Nearest Neighbors algorithm; AUC, area under the curve; F1, F1score; NPV, Negative Predictive Value; PPV, Positive Predictive Value; ACC, Accuracy; PLR, Positive Likelihood Ratio; SEN, Sensitivity; NLR, Negative Likelihood Ratio; SPE, Specificity; PPA, Positive Predictive Agreement; NPA, Negative Predictive Agreement; TPA, Total Predictive Agreement; KAPPA, Cohen’s Kappa coefficient; AI, Artificial Intelligence; PCOS, Polycystic Ovary Syndrome; AD, Androstenedione; DHT, Dihydrotestosterone; 17α-OHP, 17α-Hydroxyprogesterone; E1, Estrone; LH, Luteotropic Hormone; P, Progesterone; T, Testosterone; FSH, Follicle-Stimulating Hormone; E2, Estradiol; PRL, Prolactin; DHEAS, Dehydroepiandrosterone Sulfate; AMH, Anti-Müllerian Hormone; INS-0h, Insulin-0h; INS-0.5h, Insulin-0.5h; INS-1h, Insulin-1h; INS-2h, Insulin-2h; INS-3h, Insulin-3h; TCH, Total Cholesterol; TG, Triglyceride; Glu, Glucose; Apo E, Apolipoprotein E; Lp (a), Lipoprotein(a).

References

1. Sadeghi HM, Adeli I, Calina D, Docea AO, Mousavi T, Daniali M, et al. Polycystic ovary syndrome: A comprehensive review of pathogenesis, management, and drug repurposing. Int J Mol Sci. (2022) 23:2. doi: 10.3390/ijms23020583

PubMed Abstract | Crossref Full Text | Google Scholar

2. Yasmin A, Roychoudhury S, Paul Choudhury A, Ahmed ABF, Dutta S, Mottola F, et al. Polycystic ovary syndrome: an updated overview foregrounding impacts of ethnicities and geographic variations. Life (Basel). (2022) 12. doi: 10.3390/life12121974

PubMed Abstract | Crossref Full Text | Google Scholar

3. Safiri S, Noori M, Nejadghaderi SA, Karamzad N, Carson-Chahhoud K, Sullman MJM, et al. Prevalence, incidence and years lived with disability due to polycystic ovary syndrome in 204 countries and territories, 1990-2019. Hum Reprod. (2022) 37:1919–31. doi: 10.1093/humrep/deac091

PubMed Abstract | Crossref Full Text | Google Scholar

4. Wolf WM, Wattick RA, Kinkade ON, and Olfert MD. Geographical prevalence of polycystic ovary syndrome as determined by region and race/ethnicity. Int J Environ Res Public Health. (2018) 15. doi: 10.3390/ijerph15112589

PubMed Abstract | Crossref Full Text | Google Scholar

5. Zehravi M, Maqbool M, and Ara I. Polycystic ovary syndrome and infertility: an update. Int J Adolesc Med Health. (2021) 34:1–9. doi: 10.1515/ijamh-2021-0073

PubMed Abstract | Crossref Full Text | Google Scholar

6. Cassar S, Misso ML, Hopkins WG, Shaw CS, Teede HJ, and Stepto NK. Insulin resistance in polycystic ovary syndrome: a systematic review and meta-analysis of euglycaemic–hyperinsulinaemic clamp studies. Hum Reproduction. (2016) 31:2619–31. doi: 10.1093/humrep/dew243

PubMed Abstract | Crossref Full Text | Google Scholar

7. Joham AE, Norman RJ, Stener-Victorin E, Legro RS, Franks S, Moran LJ, et al. Polycystic ovary syndrome. Lancet Diabetes Endocrinol. (2022) 10 9:668–80. doi: 10.1016/s2213-8587(22)00163-2

PubMed Abstract | Crossref Full Text | Google Scholar

8. Helvaci N and Yildiz BO. Polycystic ovary syndrome as a metabolic disease. Nat Rev Endocrinol. (2025) 21:230–44. doi: 10.1038/s41574-024-01057-w

PubMed Abstract | Crossref Full Text | Google Scholar

9. Rotterdam ESHRE/ASRM-Sponsored PCOS Consensus Workshop Group. Revised 2003 consensus on diagnostic criteria and long-term health risks related to polycystic ovary syndrome. Fertil Steril. (2004) 81:19–25. doi: 10.1016/j.fertnstert.2003.10.004

PubMed Abstract | Crossref Full Text | Google Scholar

10. Teede HJ, Tay CT, Laven JJE, Dokras A, Moran LJ, Piltonen TT, et al. Recommendations from the 2023 international evidence-based guideline for the assessment and management of polycystic ovary syndrome. J Clin Endocrinol Metab. (2023) 108:2447–69. doi: 10.1210/clinem/dgad463

PubMed Abstract | Crossref Full Text | Google Scholar

11. Teede HJ, Tay CT, Laven J, Dokras A, Moran LJ, Piltonen TT, et al. Recommendations from the 2023 international evidence-based guideline for the assessment and management of polycystic ovary syndrome†. Hum Reprod. (2023) 38:1655–79. doi: 10.1093/humrep/dead156

PubMed Abstract | Crossref Full Text | Google Scholar

12. Copp T, Muscat DM, Hersch J, McCaffery KJ, Doust J, Mol BW, et al. Clinicians' perspectives on diagnosing polycystic ovary syndrome in Australia: a qualitative study. Hum Reprod. (2020) 35:660–8. doi: 10.1093/humrep/deaa005

PubMed Abstract | Crossref Full Text | Google Scholar

13. Joham AE, Piltonen T, Lujan ME, Kiconco S, and Tay CT. Challenges in diagnosis and understanding of natural history of polycystic ovary syndrome. Clin Endocrinol (Oxf). (2022) 97:165–73. doi: 10.1111/cen.14757

PubMed Abstract | Crossref Full Text | Google Scholar

14. Witchel SF, Oberfield SE, and Peña AS. Polycystic ovary syndrome: pathophysiology, presentation, and treatment with emphasis on adolescent girls. J Endocr Soc. (2019) 3:1545–73. doi: 10.1210/js.2019-00078

PubMed Abstract | Crossref Full Text | Google Scholar

15. Greenwood EA and Huddleston HG. Insulin resistance in polycystic ovary syndrome: concept versus cutoff. Fertil Steril. (2019) 112:827–8. doi: 10.1016/j.fertnstert.2019.08.100

PubMed Abstract | Crossref Full Text | Google Scholar

16. Walford H, Tyler B, Abbara A, Clarke S, Talaulikar V, and Wattar BA. Biomarkers to inform the management of polycystic ovary syndrome: A review of systematic reviews. Clin Endocrinol (Oxf). (2024) 101:535–48. doi: 10.1111/cen.15101

PubMed Abstract | Crossref Full Text | Google Scholar

17. Rajkomar A, Dean J, and Kohane I. Machine learning in medicine. N Engl J Med. (2019) 380:1347–58. doi: 10.1056/NEJMra1814259

PubMed Abstract | Crossref Full Text | Google Scholar

18. Holzinger A, Biemann C, Pattichis CS, and Kell DB. What do we need to build explainable AI systems for the medical domain? arXiv preprint arXiv:171209923. (2017).

Google Scholar

19. Wang J, Zeng Z, Li Z, Liu G, Zhang S, Luo C, et al. The clinical application of artificial intelligence in cancer precision treatment. J Transl Med. (2025) 23:120. doi: 10.1186/s12967-025-06139-5

PubMed Abstract | Crossref Full Text | Google Scholar

20. Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z, et al. Scientific discovery in the age of artificial intelligence. Nature. (2023) 620:47–60. doi: 10.1038/s41586-023-06221-2

PubMed Abstract | Crossref Full Text | Google Scholar

21. Bhinder B, Gilvary C, Madhukar NS, and Elemento O. Artificial intelligence in cancer research and precision medicine. Cancer Discov. (2021) 11:900–15. doi: 10.1158/2159-8290.Cd-21-0090

PubMed Abstract | Crossref Full Text | Google Scholar

22. Zhao T, Wang S, Ouyang C, Chen M, Liu C, Zhang J, et al. Artificial intelligence for geoscience: Progress, challenges, and perspectives. Innovation (Camb). (2024) 5:100691. doi: 10.1016/j.xinn.2024.100691

PubMed Abstract | Crossref Full Text | Google Scholar

23. Beam AL and Kohane IS. Big data and machine learning in health care. JAMA. (2018) 319:1317–8. doi: 10.1001/jama.2017.18391

PubMed Abstract | Crossref Full Text | Google Scholar

24. Kulkarni S, Gupta K, Ratre P, Mishra PK, Singh Y, Biharee A, et al. Polycystic ovary syndrome: Current scenario and future insights. Drug Discov Today. (2023) 28:103821. doi: 10.1016/j.drudis.2023.103821

PubMed Abstract | Crossref Full Text | Google Scholar

25. Wang H, Liang Q, Hancock JT, and Khoshgoftaar TM. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J Big Data. (2024) 11:44. doi: 10.1186/s40537-024-00905-w

Crossref Full Text | Google Scholar

26. Yang H, Qin G, Liu Z, Hu Y, and Dai Q. (2024). LightGBM robust optimization algorithm based on topological data analysis, in: Proceedings of the 2024 International Conference on Computer and Multimedia Technology,. pp. 574–82.

Google Scholar

27. Breiman L. Random forests. Mach Learning. (2001) 45:5–32. doi: 10.1023/A:1010933404324

Crossref Full Text | Google Scholar

28. Karalis G. Decision trees and applications. Adv Exp Med Biol. (2020) 1194:239–42. doi: 10.1007/978-3-030-32622-7_21

PubMed Abstract | Crossref Full Text | Google Scholar

29. Carvalho DV, Pereira EM, and Cardoso JS. Machine learning interpretability: A survey on methods and metrics. Electronics. (2019) 8:832. doi: 10.3390/electronics8080832

Crossref Full Text | Google Scholar

30. Doshi-Velez F and Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:170208608. (2017).

Google Scholar

31. Guidotti R, Monreale A, Ruggieri S, Turini F, Giannotti F, and Pedreschi D. A survey of methods for explaining black box models. ACM computing surveys (CSUR). (2018) 51:1–42. doi: 10.1145/3236009

Crossref Full Text | Google Scholar

32. de Medeiros SF, Yamamoto MMW, Souto de Medeiros MA, Barbosa BB, Soares JM, and Baracat EC. Changes in clinical and biochemical characteristics of polycystic ovary syndrome with advancing age. Endocr Connec. (2020) 9:74–89. doi: 10.1530/ec-19-0496

PubMed Abstract | Crossref Full Text | Google Scholar

33. Lesley U and Kuratomi Hernández A. (2024). Improving XAI explanations for clinical decision-making–Physicians’ perspective on local explanations in healthcare, in: International Conference on Artificial Intelligence in Medicine, . pp. 296–312.

Google Scholar

34. Shortliffe EH and Sepúlveda MJ. Clinical decision support in the era of artificial intelligence. Jama. (2018) 320:2199–200. doi: 10.1001/jama.2018.17163

PubMed Abstract | Crossref Full Text | Google Scholar

35. van der Ham K, Laven JSE, Tay CT, Mousa A, Teede H, and Louwers YV. Anti-müllerian hormone as a diagnostic biomarker for polycystic ovary syndrome and polycystic ovarian morphology: a systematic review and meta-analysis. Fertil Steril. (2024) 122:727–39. doi: 10.1016/j.fertnstert.2024.05.163

PubMed Abstract | Crossref Full Text | Google Scholar

36. Gomes MO, Gomes JO, Ananias LF, Lombardi LA, da Silva FS, and Espindula AP. Anti-Müllerian hormone as a diagnostic marker of polycystic ovary syndrome: a systematic review with meta-analysis. Am J Obstet Gynecol. (2025) 232:506–23.e7. doi: 10.1016/j.ajog.2025.01.044

PubMed Abstract | Crossref Full Text | Google Scholar

37. Tong C, Wu Y, Zhuang Z, and Yu Y. A diagnostic model for polycystic ovary syndrome based on machine learning. Sci Rep. (2025) 15:9821. doi: 10.1038/s41598-025-92630-4

PubMed Abstract | Crossref Full Text | Google Scholar

38. Moral P, Mustafi D, Mustafi A, and Sahana SK. CystNet: An AI driven model for PCOS detection using multilevel thresholding of ultrasound images. Sci Rep. (2024) 14:25012. doi: 10.1038/s41598-024-75964-3

PubMed Abstract | Crossref Full Text | Google Scholar

39. Van Calster B, Steyerberg EW, Wynants L, and van Smeden M. There is no such thing as a validated prediction model. BMC Med. (2023) 21:70. doi: 10.1186/s12916-023-02779-w

PubMed Abstract | Crossref Full Text | Google Scholar

40. da Silva TR, Marchesan LB, Rampelotto PH, Longo L, de Oliveira TF, Landberg R, et al. Gut microbiota and gut-derived metabolites are altered and associated with dietary intake in women with polycystic ovary syndrome. J Ovarian Res. (2024) 17:232. doi: 10.1186/s13048-024-01550-w

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: early screening, machine learning, nomogram, polycystic ovary syndrome, Shap, XGBoost

Citation: Yao B, Yu X, Zhang Y, Chen J, Zhu X, Zhang C and Jijun T (2025) Development and validation of an explainable machine learning and nomogram model for early detection and risk stratification of polycystic ovary syndrome: a multicenter study. Front. Endocrinol. 16:1719631. doi: 10.3389/fendo.2025.1719631

Received: 06 October 2025; Accepted: 01 December 2025; Revised: 27 November 2025;
Published: 17 December 2025.

Edited by:

Marco Bonomi, University of Milan, Italy

Reviewed by:

Elisa Maseroli, University of Florence, Italy
Valeria Lanzi, University of Milan, Italy

Copyright © 2025 Yao, Yu, Zhang, Chen, Zhu, Zhang and Jijun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Cheng Zhang, emhhbmdjaGVuZzAyNEAxMjYuY29t; Tong Jijun, amlqdW50b25nQHpzdHUuZWR1LmNu

^†These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.