Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Endocrinol., 10 February 2026

Sec. Obesity

Volume 17 - 2026 | https://doi.org/10.3389/fendo.2026.1772106

Machine learning–based prediction of IVF/ICSI outcomes in male factor infertility highlighting couple-level BMI

Hu Li&#x;Hu Li1†Jie Gao&#x;Jie Gao1†Yiran Li*Yiran Li2*
  • 1Shanghai Key Laboratory of Maternal Fetal Medicine, Shanghai Institute of Maternal-Fetal Medicine and Gynecologic Oncology, Shanghai First Maternity and Infant Hospital, School of Medicine, Tongji University, Shanghai, China
  • 2Centre for Assisted Reproduction, Shanghai Key Laboratory of Maternal Fetal Medicine, Shanghai Institute of Maternal-Fetal Medicine and Gynecologic Oncology, Shanghai First Maternity and Infant Hospital, School of Medicine, Tongji University, Shanghai, China

Background: Most clinical prediction models for assisted reproductive technology focus primarily on female ovarian reserve markers and often under-represent male factors and the metabolic status of both partners. Additionally, traditional parametric models may have limited ability to capture nonlinear patterns within reproductive data. This study aimed to develop and validate a machine learning (ML)–based model to predict clinical pregnancy outcomes in couples with male factor infertility undergoing IVF/ICSI, and to explore model interpretability using Shapley Additive exPlanations (SHAP).

Methods: This retrospective study analyzed 2,565 couples undergoing their first IVF/ICSI cycle for male factor infertility at Shanghai First Maternity and Infant Hospital between 2019 and 2025. The cohort was partitioned according to embryo transfer date, with the first 70% of cases assigned to the training set and the remaining 30% reserved as an temporal internal validation set. Feature selection was conducted using LASSO regression within the training set. Seven ML models, including LightGBM and Logistic Regression, were developed and optimized through 5-fold cross-validation. Model performance was evaluated using the area under the curve (AUC), accuracy, Brier score, and decision curve analysis. SHAP was employed to provide a visual interpretation of the optimal model.

Results: Five predictors were selected in the training set: female BMI, male BMI, basal FSH, AMH, and female age. In the temporal validation set, all models demonstrated comparable discriminative performance (AUC range: 0.840–0.857). LightGBM achieved an AUC of 0.857 (95% CI: 0.830–0.882), with an accuracy of 0.775 and specificity of 0.909. DeLong tests indicated no statistically significant differences in AUC between LightGBM and Random Forest (P = 0.918), XGBoost (P = 0.985), or logistic regression (P = 0.067). Based on its overall stability across discrimination, calibration (Brier score = 0.145), and clinical utility, LightGBM was selected for interpretability analysis.

Conclusions: A LightGBM-based prediction model demonstrated reasonable performance for predicting IVF/ICSI outcomes in couples with male factor infertility. Within this dataset, couple-level metabolic features were strongly associated with model predictions alongside traditional ovarian reserve markers. These findings reflect predictive associations rather than causal effects and suggest that metabolic characteristics may warrant consideration in risk stratification and counseling. Prospective studies are needed to determine whether targeted interventions can improve clinical outcomes.

1 Introduction

Infertility has become a global public health concern and affects approximately 15% of couples of reproductive age (1, 2). Male factors account for 40%–50% of these cases (3). For patients with severe oligozoospermia, asthenozoospermia, or teratozoospermia, in vitro fertilization (IVF) and intracytoplasmic sperm injection (ICSI) remain the most effective treatment options (4, 5). Nevertheless, despite advances in assisted reproductive technology, the clinical pregnancy rate per IVF/ICSI cycle is still only 40%–60% (68). Failed cycles place a substantial financial burden on patients and are frequently associated with considerable psychological distress (9, 10). Accordingly, accurate pre-treatment prediction of pregnancy success is important for individualized treatment planning and for setting realistic expectations.

Traditional prediction models, including the Templeton model and the Nelson model, are largely based on logistic regression (1113). Although these models have broad applicability in general populations, they have several limitations. First, these models largely center on female age and ovarian reserve markers such as AMH and FSH, while giving limited consideration to partner-related characteristics and their potential interactions, including male BMI and age (14). Second, although logistic regression is a type of generalized linear model, prespecified regression models may still be limited in their ability to flexibly capture complex nonlinear relationships and high-dimensional structures that are commonly observed in reproductive datasets, unless nonlinear terms or interactions are explicitly modeled (15). Third, generic models are often not tailored to the male factor infertility subgroup (16), which can compromise predictive accuracy in this population.

In recent years, rapid progress in artificial intelligence and machine learning (ML) has introduced new approaches to clinical prediction. Compared with traditional statistical methods, ML algorithms—including random forest and gradient boosting trees—offer advantages in modeling complex, nonlinear, and high-dimensional data (17, 18). Prior studies have reported promising performance of ML approaches in polycystic ovary syndrome (PCOS) (19). However, high-precision ML models based on large samples and incorporating couple-level characteristics remain limited in the male factor infertility population. In addition, many ML models are considered “black boxes” with limited clinical interpretability, which hinders their broader implementation in practice (20, 21).

This study aims to develop an IVF/ICSI pregnancy outcome prediction model for couples with male factor infertility using a single-center, large-sample retrospective dataset and multiple ML algorithms. Particular attention is given to quantifying the contribution of spousal BMI within the prediction framework using SHAP analysis. The findings may offer additional insight to support clinical decision-making in this setting.

2 Materials and methods

2.1 Subjects and design

This retrospective cohort study used data from couples who underwent IVF or ICSI at the Reproductive Medical Center of Shanghai First Maternity and Infant Hospital between January 2019 and January 2025. This study was approved by the Research Ethics Committee of Shanghai First Maternity and Infant Hospital (KS25468). The inclusion criteria were: (1) a primary diagnosis of male factor infertility, including oligozoospermia, asthenozoospermia, or teratozoospermia, defined according to the WHO 5th edition criteria (22); (2) treatment with conventional IVF or ICSI; (3) complete follow-up records for pregnancy outcomes; (4) only the first IVF/ICSI treatment cycle was included for each couple. The exclusion criteria were: (1) severe uterine malformations or intrauterine adhesions in the female partner; (2) chromosomal karyotype abnormalities in either partner; (3) cycles involving donor sperm or oocytes; and (4) missing values in non-imputable administrative or eligibility variables (none in the final analytic cohort). Ultimately, 2,565 couples were included. To enhance the methodological rigor and better reflect real-world clinical application, the cohort was partitioned strictly according to the date of embryo transfer (23). The earliest 70% of cases (n = 1,797) were assigned to the training set, and the most recent 30% (n = 768) were reserved as an internal validation set. The training set was used exclusively for feature selection, model development, and hyperparameter tuning, whereas the validation set was used only for final model evaluation.

2.2 Data collection

Variables were extracted from the electronic medical record (EMR) system. Demographic characteristics included female age, male age, female body mass index (BMI), male BMI, infertility duration, infertility type, female education, and male education. Clinical characteristics included menstrual regularity; basal follicle-stimulating hormone (FSH), luteinizing hormone (LH), estradiol (E2), progesterone (P), testosterone (T), prolactin (PRL), and anti-Müllerian hormone (AMH). The outcome was clinical pregnancy, defined as the presence of a gestational sac with fetal cardiac activity in the uterine cavity on transvaginal ultrasound 28–35 days after embryo transfer (24). Absence of a gestational sac or biochemical pregnancy was classified as non-clinical pregnancy.

2.3 Data preprocessing and feature selection

For missing data, multiple imputation was performed using the mice package in R (25). In this dataset, all candidate predictors exhibited low levels of missingness (<5% in both the training and validation sets; Supplementary Table S1 and Supplementary Figure S1) (25). No participants were excluded due to missing non-imputable administrative or eligibility-defining variables (e.g., outcome follow-up), and therefore the final analytic cohort contained complete information on all eligibility-defining variables. Given the low proportion of missingness across predictors, multiple imputation was applied to all predictors to avoid unnecessary case deletion while minimizing potential instability associated with highly incomplete variables. Five imputed datasets were generated (m = 5, seed = 123). To avoid information leakage, imputation was conducted after the temporal split and performed separately within the training set and the temporal validation set. In the training set, the imputation model included all candidate predictors and the outcome variable in order to preserve predictor–outcome associations for model development. In the validation set, the imputation model included only predictors and explicitly excluded the outcome variable, thereby preventing outcome-informed imputation during model evaluation. Predictive mean matching was used for continuous variables and logistic regression for binary variables, with 20 iterations per imputation. The predictor matrix followed the default mice setting in which predictors were allowed to inform each other, except where structurally inappropriate; specifically, the outcome variable was excluded from all validation-set imputation models, and administrative or eligibility-defining variables were not imputed. The full imputation methods and predictor matrix for the training set are reported in Supplementary Table S2 and Supplementary Figure S2. Feature selection was performed using the least absolute shrinkage and selection operator (LASSO) within the training set only (26). LASSO was implemented in R using the glmnet package with internal standardization (standardize = TRUE). LASSO was fitted separately within each of the five imputed training datasets, and predictors were retained in the final feature set if they showed stable selection at λ1SE across imputations, operationalized as non-zero coefficients in at least four of the five imputed datasets (27). The penalty parameter λ was selected using 10-fold cross-validation, and the minimum deviance occurred at λ_min = 0.00406, while the 1-standard-error criterion selected λ_1SE = 0.01801. For downstream machine learning analyses, continuous predictors were standardized to Z-scores using the StandardScaler function in the Python scikit-learn library (28), with scaling parameters learned from the training set and then applied to the validation set. Model development proceeded separately within each imputed training dataset. Each fitted model was then applied to each imputed validation dataset to generate predicted probabilities; for each individual, predicted probabilities were averaged across imputations to obtain pooled predictions. All performance metrics were calculated using these pooled predicted probabilities.

2.4 Model construction and hyperparameter tuning

Seven commonly used machine learning algorithms were developed to predict pregnancy outcomes: logistic regression (LR), decision tree (DT), random forest (RF), support vector machine (SVM), artificial neural network (ANN), XGBoost, and LightGBM. Hyperparameters for each model were optimized using 5-fold cross-validation with grid search within the training set. To ensure comparability and avoid optimistic bias under multiple imputation, hyperparameter tuning was performed only once using the first imputed training dataset, with a fixed random seed (seed = 123) (29). The resulting optimal hyperparameters were then held constant and applied to all five imputed training datasets for model fitting. This strategy ensured that model complexity and tuning degrees of freedom were consistent across imputations while allowing uncertainty due to imputation to be reflected in model estimation.

2.5 Model evaluation and interpretation

Model performance was evaluated in the validation set using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and F1 score. For all classification metrics, predicted probabilities were dichotomized using a fixed operating threshold of 0.5, which was applied consistently across all models and datasets to ensure comparability.

Calibration was evaluated using the Brier score (30) and calibration curves constructed by grouping predicted probabilities into deciles and plotting observed versus predicted outcome probabilities. Clinical utility was examined using decision curve analysis (DCA) over a clinically plausible threshold probability range of 20–80%, reflecting the range in which clinicians may reasonably consider counseling or intervention in the context of IVF/ICSI outcome prediction.

Models were fitted separately within each of the five imputed datasets. For each individual and each model, predicted probabilities were averaged across imputations to obtain a single pooled predicted probability. All discrimination metrics (including AUC), classification metrics, calibration analyses, and decision curve analyses were computed based on these pooled predicted probabilities.

The 95% confidence intervals for AUC were estimated using nonparametric bootstrap resampling (1,000 replications) applied to the pooled predicted probabilities within each dataset. Pairwise comparisons of AUCs between models were conducted using DeLong’s test for correlated receiver operating characteristic curves based on ROC curves constructed from the pooled predicted probabilities. Specifically, the AUCs of each model were compared against those of LightGBM and logistic regression, respectively, and the corresponding P values were reported.

The representative model was then interpreted using SHAP to quantify the marginal contribution of each feature (31). SHAP summary plots, beeswarm plots, and dependence plots were generated to illustrate the model’s decision patterns. SHAP values were computed using the standard interventional SHAP implementation provided by the SHAP Python package. We acknowledge that when predictors are correlated, feature attributions may be influenced by feature dependence, and therefore SHAP results should be interpreted as model-based associations rather than causal effects.

2.6 Statistical analysis

All analyses were performed using R (version 4.2.0) and Python (version 3.9.0). In R, data preprocessing, multiple imputation, descriptive analyses, dataset partitioning, and LASSO feature selection were conducted using standard statistical packages, including mice, caret, tableone, and glmnet. All machine learning model development, hyperparameter tuning, and performance evaluation were conducted in Python. Graphical analyses, including ROC curves, calibration curves, and decision curve analysis, were generated using matplotlib and ggplot2.

3 Results

3.1 Baseline characteristics

Baseline characteristics of the total cohort, training set, and validation set are presented in Table 1. The distributions of outcomes, demographic variables, and clinical characteristics were highly comparable between the training and validation sets. All standardized mean differences (SMDs) were below 0.25, with most variables showing SMDs < 0.10, indicating good balance between the two datasets. Slight imbalances were observed for female age (SMD = 0.218) and AMH (SMD = 0.113), but the overall clinical characteristics remained broadly similar across the two subsets, supporting the appropriateness of the temporal split for model development and validation.

Table 1
www.frontiersin.org

Table 1. Baseline characteristics of the overall cohort and comparison between the training and validation sets.

3.2 LASSO regression feature selection

LASSO regression was applied to reduce multicollinearity and identify key predictors. As shown in Figures 1A, B, the penalty parameter was selected using 10-fold cross-validation. The minimum cross-validated deviance occurred at λ_min = 0.00406, while the 1-standard-error criterion selected λ_1SE = 0.01801, yielding a parsimonious set of five predictors with non-zero coefficients. Across the five imputed training datasets, five predictors were consistently selected by LASSO at λ1SE in at least four imputations: female BMI, male BMI, basal FSH, AMH, and female age. Variables such as menstrual regularity, infertility type, and education level were not stably selected and were therefore excluded. This stability-based selection suggests that spousal BMI and ovarian reserve markers constituted the most robust predictors within the available feature set.

Figure 1
Panel A shows a graph of binomial deviance against log lambda, with red points along a curve that increases sharply. Panel B illustrates coefficients plotted against log lambda, with multiple colored lines diverging as lambda increases.

Figure 1. Feature selection using the LASSO regression model. (A) LASSO Regression Model Factor Selection: Left dashed line represents the optimal lambda value (lambda_min), while the right dashed line marks the lambda value within one standard error of the optimal (lambda.1se). (B) LASSO regression model screening variable trajectories.

3.3 Machine learning model performance evaluation

The predictive performance of the seven models in the training and validation sets is summarized in Tables 2 and Supplementary Table S3, with ROC curves, calibration plots, and decision curve analyses shown in Figure 2, and confusion matrices presented in Figure 3. In the training set, ensemble models (Random Forest, XGBoost, and LightGBM) achieved higher AUCs than other algorithms (AUCs 0.903–0.923), whereas model performance became more comparable in the temporal validation set, with AUCs ranging narrowly from 0.840 to 0.857 across all models (Table 2, Figure 2B). LightGBM, XGBoost, and Random Forest demonstrated almost identical discriminative ability (all AUC = 0.857). DeLong tests confirmed that there were no statistically significant differences in AUC between LightGBM and Random Forest (P = 0.918), XGBoost (P = 0.985), or Logistic Regression (P = 0.067).

Table 2
www.frontiersin.org

Table 2. Predictive performance of seven models in the validation set.

Figure 2
Panel A shows ROC curves for training set model comparison with multiple models plotted, where logistic regression achieves the highest AUC of 0.836. Panel B displays ROC curves for validation set model comparison with logistic regression also having the highest AUC of 0.841. Panel C presents decision curve analysis for training set models, while Panel D shows it for validation set models. Panels E and F illustrate calibration curves for training and validation sets, respectively, highlighting the logistic model with a Brier score of 0.044 for training and 0.048 for validation. Each panel compares several machine learning models.

Figure 2. Performance evaluation of seven machine learning models in the training and validation sets. Receiver operating characteristic (ROC) curves for the training set (A) and validation set (B). Decision curve analysis (DCA) for the training set (C) and validation set (D). Calibration curves for the training set (E) and validation set (F).

Figure 3
Seven confusion matrices comparing different machine learning models for clinical pregnancy prediction. A: LightGBM, B: Logistic, C: DecisionTree, D: ANN, E: SVM, F: XGBoost, G: RandomForest. Each matrix displays true positives, true negatives, false positives, and false negatives with percentage values, visualized in a blue color gradient.

Figure 3. Confusion matrix heatmaps of machine learning models in the validation set. (A) LightGBM; (B) Logistic Regression; (C) XGBoost; (D) Random Forest; (E) Decision Tree; (F) Support Vector Machine (SVM); (G) Artificial Neural Network (ANN).

Beyond discrimination, LightGBM showed a balanced performance profile in the validation set, with an accuracy of 0.775, high specificity of 0.909 and moderate sensitivity (0.596). Calibration analysis suggested reasonable agreement between predicted and observed risks (Brier score = 0.145; Figure 2F), and decision curve analysis indicated net clinical benefit across a range of clinically plausible threshold probabilities (Figure 2D). Taken together, LightGBM was selected as the representative model for subsequent interpretability analyses due to its overall stability across discrimination, calibration, and clinical utility, rather than on statistically superior AUC alone.

Analysis of confusion matrices (Figure 3) further illustrated model behavior. LightGBM achieved a high true-negative rate (92.20%), reflecting strong specificity, while maintaining a sensitivity comparable to other models. From a clinical perspective, this tendency to limit false-positive predictions may be advantageous for avoiding overly optimistic prognostic assessments in couples with low likelihood of pregnancy.

3.4 Model interpretability analysis

To enhance interpretability of the selected model, SHAP was applied to visualize the contribution of individual predictors to model output (Figure 4). The SHAP bar plot (Figure 4A), based on mean absolute SHAP values, indicated that male BMI and female BMI showed the highest average contributions within the fitted model and the available feature set, followed by basal FSH, AMH, and female age. The SHAP beeswarm plot (Figure 4B) further illustrated the direction of these associations: higher BMI values in either partner (red points) were predominantly located on the negative side of the x-axis, suggesting that higher BMI was associated with lower predicted probability of clinical pregnancy in the model output.

Figure 4
Charts showing SHAP analysis results. A: Bar chart of mean SHAP values highlighting feature importance, with female and male BMI as major factors. B: Bee swarm plot displaying impact on the model output by BMI, FSH, AMH, and female age. C: Waterfall chart illustrating cumulative contribution of features like AMH and BMI to the model output. D: Scatter plots depicting SHAP values against individual features for female age, male and female BMI, AMH, and FSH, showing non-linear relationships and feature value colors.

Figure 4. LightGBM model explanation by the SHAP method. (A) Bar chart of the all features. (B) Beeswarm plot. (C) Force plot for one non-pregnant patient. (D) SHAP dependency plot of features in the LightGBM model.

The SHAP force plot (Figure 4C) presents an illustrative individual case, showing how each feature contributed to shifting the prediction from the baseline toward a lower probability of pregnancy. In this example, elevated male BMI and older female age exerted negative contributions that outweighed the positive contribution of AMH. This visualization demonstrates how the model integrates multiple features to generate a personalized prediction, while reflecting model behavior rather than biological causation.

Finally, the SHAP dependence plots (Figure 4D) suggested nonlinear relationships between predictors and model output. Both female BMI and male BMI showed a threshold-like pattern: SHAP values remained relatively neutral within the lower range but declined sharply once BMI exceeded approximately the upper-normal range. AMH demonstrated a modest positive association at low-to-moderate levels, while higher FSH and increasing female age were associated with progressively negative SHAP values. These patterns reflect how the fitted model utilizes these predictors and should be interpreted as model-based associations rather than evidence of specific biological thresholds.

4 Discussion

This study developed and validated a prediction model for IVF/ICSI pregnancy outcomes in couples with male factor infertility using a single-center, large-sample retrospective cohort and the LightGBM algorithm. Compared with conventional logistic regression (Supplementary Table S4), ensemble tree–based methods may offer theoretical flexibility, although discrimination was comparable across models in our study. In the validation set, LightGBM showed an AUC of 0.857 and a specificity of 90.9%. SHAP-based interpretation suggested that, within the fitted model and conditional on the available predictors, spousal body mass index (BMI) exhibited relatively larger contributions to model predictions than traditional ovarian reserve indicators such as basal FSH, AMH, and female age. Importantly, these findings reflect model-based associations rather than causal effects. Nevertheless, the results suggest that, in the clinical context of impaired sperm quality, couple-level metabolic characteristics may represent an underappreciated dimension in prognostic assessment. These findings complement existing frameworks that traditionally emphasize ovarian reserve (32, 33).

This observation is biologically plausible within the pathophysiological context of male factor infertility and may generate hypotheses for future research. Based on existing literature, we propose a conceptual “two-hit” hypothesis as a possible interpretive framework rather than a conclusion supported directly by our data. First, patients with oligozoospermia, asthenozoospermia, or teratozoospermia frequently show increased sperm DNA fragmentation and aberrant epigenetic alterations, and obesity-associated systemic oxidative stress in males may further exacerbate these abnormalities (34, 35). Although ICSI can bypass physical barriers to fertilization, it does not rectify molecular defects carried by sperm, which may lead to embryos with reduced developmental competence, constituting the first hit (36). Second, when female BMI exceeds a threshold, obesity-related chronic low-grade inflammation may alter endometrial gene expression and compromise receptivity and decidualization, representing the second hit (3739). Importantly, the present study did not directly measure sperm DNA fragmentation, epigenetic alterations, or endometrial receptivity. Therefore, this conceptual framework should be regarded as hypothesis-generating and requires validation in future mechanistic and experimental studies.

Although multiple imputation was used to address missing data, performance metrics and statistical tests were primarily derived from pooled predicted probabilities rather than from fully Rubin-combined estimates across imputations. This approach may underestimate uncertainty because between-imputation variability is not fully propagated. However, given the very low proportion of missingness (<5% for all predictors), the impact of this limitation is likely modest. Future studies with higher levels of missingness should consider fully nested bootstrap–imputation procedures to provide more rigorous uncertainty quantification.

In the SHAP dependence plots, a gradual decline in SHAP values was observed as BMI increased, with a more apparent decrease beyond approximately 24–25 kg/m². This apparent threshold should be interpreted with caution for several reasons and does not imply a clinically actionable cutoff or an intervention threshold. First, the value was derived from visual inspection of SHAP-based plots and reflects model behavior under correlated predictors rather than a clinically or statistically validated boundary. We did not apply formal methods for threshold identification, such as spline-based regression, uncertainty-aware partial dependence analysis, or analyses based on prespecified BMI categories. Second, the observed value closely corresponds to the Chinese definition of overweight (BMI ≥24 kg/m²), indicating that this pattern may partly reflect population-specific characteristics. Therefore, the generalizability of this threshold beyond the present cohort remains uncertain and warrants validation in external populations using alternative BMI classification standards.

These findings may have potential clinical implications, but they should be interpreted with appropriate caution. Rather than advocating a change in clinical practice, our results highlight the possible value of considering couple-level metabolic health alongside traditional ovarian-centered assessments (40). In current practice, clinical efforts often focus on optimizing ovarian stimulation to increase oocyte yield. Our model suggests that metabolic factors may contribute to prognostic stratification and may be useful during patient counseling. However, whether targeted preconception interventions, such as weight reduction—particularly in the male partner—lead to improved ART outcomes remains uncertain and requires confirmation in prospective interventional studies. Therefore, BMI should be regarded as a potentially informative predictive marker in this dataset rather than a basis for mandatory treatment delay or universal prioritization of weight intervention (41).

5 Conclusion

A LightGBM-based model demonstrated reasonable predictive performance for IVF/ICSI pregnancy outcomes in couples with male factor infertility, with relatively high specificity in the validation set. Model interpretation suggested that, within the fitted model and available feature set, couple-level metabolic characteristics were associated with predicted outcomes alongside traditional ovarian reserve markers. These findings represent predictive associations rather than causal effects. BMI may serve as a potentially informative prognostic feature for counseling and risk stratification in this population, while the clinical benefit of targeted metabolic interventions requires confirmation in prospective and interventional studies.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by institutional ethics committee of Shanghai First Maternity and Infant Hospital. The studies were conducted in accordance with the local legislation and institutional requirements. The ethics committee/institutional review board waived the requirement of written informed consent for participation from the participants or the participants’ legal guardians/next of kin. The requirement for informed consent was waived due to the retrospective nature of the study design and the use of de-identified patient information.

Author contributions

HL: Investigation, Software, Writing – original draft. JG: Data curation, Validation, Visualization, Writing – original draft. YL: Conceptualization, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the Shanghai Health System Outstanding Talents Program (Grant No. 20234Z0019).

Acknowledgments

The authors would like to acknowledge the helpful suggestions concerning this study received from their colleagues.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fendo.2026.1772106/full#supplementary-material

References

1. Serafini S and O’Flaherty C. Dysregulation of sphingolipid and cholesterol homeostasis imposes oxidative stress in human spermatozoa. Redox Biol. (2025) 84:103669. doi: 10.1016/j.redox.2025.103669

PubMed Abstract | Crossref Full Text | Google Scholar

2. Jin Z-R, Fang D, Liu B-H, Cai J, Tang W-H, Jiang H, et al. Roles of CatSper channels in the pathogenesis of asthenozoospermia and the therapeutic effects of acupuncture-like treatment on asthenozoospermia. Theranostics. (2021) 11:2822–44. doi: 10.7150/thno.51869

PubMed Abstract | Crossref Full Text | Google Scholar

3. Minhas S, Bettocchi C, Boeri L, Capogrosso P, Carvalho J, Cilesiz NC, et al. European association of urology guidelines on male sexual and reproductive health: 2021 update on male infertility. Eur Urol. (2021) 80:603–20. doi: 10.1016/j.eururo.2021.08.014

PubMed Abstract | Crossref Full Text | Google Scholar

4. Kobayashi N, Miyauchi N, Tatsuta N, Kitamura A, Okae H, Hiura H, et al. Factors associated with aberrant imprint methylation and oligozoospermia. Sci Rep. (2017) 7:42336. doi: 10.1038/srep42336

PubMed Abstract | Crossref Full Text | Google Scholar

5. Kirkegaard K, Sundvall L, Erlandsen M, Hindkjær JJ, Knudsen UB, and Ingerslev HJ. Timing of human preimplantation embryonic development is confounded by embryo origin. Hum Reprod (Oxford England). (2015) 31:324–31. doi: 10.1093/humrep/dev296

PubMed Abstract | Crossref Full Text | Google Scholar

6. Adeniyi T, Horne G, Ruane PT, Brison DR, and Roberts SA. Clinical efficacy of hyaluronate-containing embryo transfer medium in IVF/ICSI treatment cycles: a cohort study. Hum Reprod Open. (2021) 2021:hoab004. doi: 10.1093/hropen/hoab004

PubMed Abstract | Crossref Full Text | Google Scholar

7. Mantikou E, Youssef MAFM, van Wely M, van der Veen F, Al-Inany HG, Repping S, et al. Embryo culture media and IVF/ICSI success rates: a systematic review. Hum Reprod Update. (2013) 19:210–20. doi: 10.1093/humupd/dms061

PubMed Abstract | Crossref Full Text | Google Scholar

8. Foong SC, Fleetham JA, O’Keane JA, Scott SG, Tough SC, and Greene CA. A prospective randomized trial of conventional in vitro fertilization versus intracytoplasmic sperm injection in unexplained infertility. J Assisted Reprod Genet. (2006) 23:137–40. doi: 10.1007/s10815-005-9008-y

PubMed Abstract | Crossref Full Text | Google Scholar

9. Connolly MP, Hoorens S, and Chambers GM. The costs and consequences of assisted reproductive technology: an economic perspective. Hum Reprod Update. (2010) 16:603–13. doi: 10.1093/humupd/dmq013

PubMed Abstract | Crossref Full Text | Google Scholar

10. Zou K, Wang J, Bi H, Zhang Y, Tian X, Tian N, et al. Comparison of different in vitro differentiation conditions for murine female germline stem cells. Cell Prolif. (2018) 52:e12530. doi: 10.1111/cpr.12530

PubMed Abstract | Crossref Full Text | Google Scholar

11. Leushuis E, van der Steeg JW, Steures P, Bossuyt PMM, Eijkemans MJC, van der Veen F, et al. Prediction models in reproductive medicine: a critical appraisal. Hum Reprod Update. (2009) 15:537–52. doi: 10.1093/humupd/dmp013

PubMed Abstract | Crossref Full Text | Google Scholar

12. Templeton A, Morris JK, and Parslow W. Factors that affect outcome of in-vitro fertilisation treatment. Lancet. (1996) 348:1402–6. doi: 10.1016/S0140-6736(96)05291-9

PubMed Abstract | Crossref Full Text | Google Scholar

13. Nelson SM and Lawlor DA. Predicting live birth, preterm delivery, and low birth weight in infants born from in vitro fertilisation: a prospective study of 144,018 treatment cycles. PloS Med. (2011) 8:e1000386. doi: 10.1371/journal.pmed.1000386

PubMed Abstract | Crossref Full Text | Google Scholar

14. Campbell JM, Lane M, Owens JA, and Bakos HW. Paternal obesity negatively affects male fertility and assisted reproduction outcomes: a systematic review and meta-analysis. Reprod Biomedicine Online. (2015) 31:593–604. doi: 10.1016/j.rbmo.2015.07.012

PubMed Abstract | Crossref Full Text | Google Scholar

15. Huang S, Tuerganbayi K, Wang J, Saad SH, Zhang J, Zou J, et al. Machine learning-based preliminary screening tool for clinical pregnancy prediction: towards management of IVF/ICSI stages. Ann Med. (2025) 57:2582245. doi: 10.1080/07853890.2025.2582245

PubMed Abstract | Crossref Full Text | Google Scholar

16. Leijdekkers JA, Eijkemans MJC, van Tilborg TC, Oudshoorn SC, McLernon DJ, Bhattacharya S, et al. Predicting the cumulative chance of live birth over multiple complete cycles of in vitro fertilization: an external validation study. Hum Reprod (Oxford England). (2018) 33:1684–95. doi: 10.1093/humrep/dey263

PubMed Abstract | Crossref Full Text | Google Scholar

17. Bzdok D, Altman N, and Krzywinski M. Statistics versus machine learning. Nat Methods. (2018) 15:233–4. doi: 10.1038/nmeth.4642

PubMed Abstract | Crossref Full Text | Google Scholar

18. Deo RC. Machine learning in medicine. Circulation. (2015) 132:1920–30. doi: 10.1161/CIRCULATIONAHA.115.001593

PubMed Abstract | Crossref Full Text | Google Scholar

19. Wang J, Chen R, Long H, He J, Tang M, Su M, et al. Artificial intelligence in polycystic ovarian syndrome management: past, present, and future. Radiol Med. (2025) 130:1409–41. doi: 10.1007/s11547-025-02032-9

PubMed Abstract | Crossref Full Text | Google Scholar

20. London AJ. Artificial intelligence and black-box medical decisions: accuracy versus explainability. Hastings Cent Rep. (2019) 49:15–21. doi: 10.1002/hast.973

PubMed Abstract | Crossref Full Text | Google Scholar

21. Amann J, Blasimme A, Vayena E, Frey D, and Madai VI. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Med Inf Decision Making. (2020) 20:310. doi: 10.1186/s12911-020-01332-6

PubMed Abstract | Crossref Full Text | Google Scholar

22. Cooper TG, Noonan E, von Eckardstein S, Auger J, Baker HWG, Behre HM, et al. World Health Organization reference values for human semen characteristics. Hum Reprod Update. (2009) 16:231–45. doi: 10.1093/humupd/dmp048

PubMed Abstract | Crossref Full Text | Google Scholar

23. Collins GS, Reitsma JB, Altman DG, and Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Internal Med. (2015) 162:55–63. doi: 10.7326/M14-0697

PubMed Abstract | Crossref Full Text | Google Scholar

24. Zhai J, Li S, Zhu Y, Sun Y, Chen Z-J, and Du Y. Serum sex hormone binding globulin concentration as a predictor of ovarian response during controlled ovarian hyperstimulation. Front In Med. (2021) 8:719818. doi: 10.3389/fmed.2021.719818

PubMed Abstract | Crossref Full Text | Google Scholar

25. van Buuren S and Groothuis-Oudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Software. (2011) 45:1–67. doi: 10.18637/jss.v045.i03

Crossref Full Text | Google Scholar

26. Wang K, Xiong W, Duan X, Li Q, Ren P, Ye H, et al. A nomogram based on autoantibodies for noninvasive detection of AFP-negative hepatocellular carcinoma: a multicenter study. Br J Cancer. (2025) 133:1896–906. doi: 10.1038/s41416-025-03215-x

PubMed Abstract | Crossref Full Text | Google Scholar

27. Rhodes CJ, Otero-Núñez P, Wharton J, Swietlik EM, Kariotis S, Harbaum L, et al. Whole-blood RNA profiles associated with pulmonary arterial hypertension and clinical outcome. Am J Respir Crit Care Med. (2020) 202:586–94. doi: 10.1164/rccm.202003-0510OC

PubMed Abstract | Crossref Full Text | Google Scholar

28. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. (2011) 12:2825–30. doi: 10.5555/1953048.2078195

PubMed Abstract | Crossref Full Text | Google Scholar

29. Hu D, Li Y, Zhang D, Ding J, Song Z, Min J, et al. Genetic trade-offs between complex diseases and longevity. Aging Cell. (2022) 21:e13654. doi: 10.1111/acel.13654

PubMed Abstract | Crossref Full Text | Google Scholar

30. Bai Y, Lei C, Zhang N, Liu Y, Hu Z, Li Y, et al. Peri-ulcerative mucosal inflammation appearance is an independent risk factor for 30-day rebleeding in patients with gastric ulcer bleeding: A multicenter retrospective study. J Inflammation Res. (2022) 15:4951–61. doi: 10.2147/JIR.S378263

PubMed Abstract | Crossref Full Text | Google Scholar

31. Giordano G, Mastrantoni L, and Landi F. Development and validation of quantile regression forests for prediction of reference quantiles in handgrip and chair-stand test. J Cachexia Sarcopenia Muscle. (2025) 16:e13868. doi: 10.1002/jcsm.13868

PubMed Abstract | Crossref Full Text | Google Scholar

32. Provost MP, Acharya KS, Acharya CR, Yeh JS, Steward RG, Eaton JL, et al. Pregnancy outcomes decline with increasing body mass index: analysis of 239,127 fresh autologous in&xa0;vitro fertilization cycles from the 2008&x2013;2010 Society for Assisted Reproductive Technology registry. Fertil Steril. (2016) 105:663–9. doi: 10.1016/j.fertnstert.2015.11.008

PubMed Abstract | Crossref Full Text | Google Scholar

33. Campbell JM, Lane M, Owens JA, and Bakos HW. Paternal obesity negatively affects male fertility and assisted reproduction outcomes: a systematic review and meta-analysis. Reprod Biomedicine Online. (2015) 31:593–604. doi: 10.1016/j.rbmo.2015.07.012

PubMed Abstract | Crossref Full Text | Google Scholar

34. Leisegang K, Sengupta P, Agarwal A, and Henkel R. Obesity and male infertility: Mechanisms and management. Andrologia. (2020) 53:e13617. doi: 10.1111/and.13617

PubMed Abstract | Crossref Full Text | Google Scholar

35. Donkin I and Barrès R. Sperm epigenetics and influence of environmental factors. Mol Metab. (2018) 14:1–11. doi: 10.1016/j.molmet.2018.02.006

PubMed Abstract | Crossref Full Text | Google Scholar

36. Simon L, Zini A, Dyachenko A, Ciampi A, and Carrell DT. A systematic review and meta-analysis to determine the effect of sperm DNA damage on in vitro fertilization and intracytoplasmic sperm injection outcome. Asian J Andrology. (2017) 19:80–90. doi: 10.4103/1008-682X.182822

PubMed Abstract | Crossref Full Text | Google Scholar

37. Rhee JS, Saben JL, Mayer AL, Schulte MB, Asghar Z, Stephens C, et al. Diet-induced obesity impairs endometrial stromal cell decidualization: a potential role for impaired autophagy. Hum Reprod (Oxford England). (2016) 31:1315–26. doi: 10.1093/humrep/dew048

PubMed Abstract | Crossref Full Text | Google Scholar

38. Broughton DE and Moley KH. Obesity and female infertility: potential mediators of obesity’s impact. Fertil Steril. (2017) 107:840–7. doi: 10.1016/j.fertnstert.2017.01.017

PubMed Abstract | Crossref Full Text | Google Scholar

39. Bellver J, Martínez-Conejero JA, Labarta E, Alamá P, Melo MAB, Remohí J, et al. Endometrial gene expression in the window of implantation is altered in obese women especially in association with polycystic ovary syndrome. Fertil Steril. (2011) 95:2335-41,2391.e1-8. doi: 10.1016/j.fertnstert.2011.03.021

PubMed Abstract | Crossref Full Text | Google Scholar

40. Best D, Avenell A, and Bhattacharya S. How effective are weight-loss interventions for improving fertility in women and men who are overweight or obese? A systematic review and meta-analysis of the evidence. Hum Reprod Update. (2017) 23:681–705. doi: 10.1093/humupd/dmx027

PubMed Abstract | Crossref Full Text | Google Scholar

41. Obesity and reproduction: a committee opinion. Fertil Steril. (2021) 116:1266–85. doi: 10.1016/j.fertnstert.2021.08.018

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: body mass index, clinical pregnancy, lightGBM, machine learning, male infertility

Citation: Li H, Gao J and Li Y (2026) Machine learning–based prediction of IVF/ICSI outcomes in male factor infertility highlighting couple-level BMI. Front. Endocrinol. 17:1772106. doi: 10.3389/fendo.2026.1772106

Received: 20 December 2025; Accepted: 26 January 2026; Revised: 23 January 2026;
Published: 10 February 2026.

Edited by:

Luca Busetto, University of Padua, Italy

Reviewed by:

Keyan Wang, Zhengzhou University, China
Arash Ziaee, Mashhad University of Medical Sciences, Iran

Copyright © 2026 Li, Gao and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Yiran Li, bGl5aXJhbjIwMDdAZ21haWwuY29t

These authors have contributed equally to this work

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.