Predicting the unexpected total fertilization failure in conventional in vitro fertilization cycles: What is the role of semen quality?

Background: Male and female gametes factors might contribute to the total fertilization failure (TFF). In first in vitro fertilization (IVF) cycles, decision-making of insemination protocol was mainly based on semen quality for the contribution of female clinical characteristics to TFF remained obscure. The purpose of the study was to evaluate the role of semen quality in predicting unexpected TFF. Methods: A single-center retrospective cohort analysis was performed on 19539 cycles between 2013 and 2021. Two algorithms, a Least Absolute Shrinkage and Selection Operator (LASSO) and an Extreme Gradient Boosting (Xgboost) were used to create models with cycle characteristics parameters. By including semen parameters or not, the contribution of semen parameters to the performance of the models was evaluated. The area under the curve (AUC), the calibration, and the net reclassification index (NRI) were used to evaluate the performance of the models. Results: The prevalence of TFF were .07 (95%CI:0.07-0.08), and .08 (95%CI:0.07-0.09) respectively in the development and validation group. Including all characteristics, with the models of LASSO and Xgboost, TFF was predicted with the AUCs of .74 (95%CI:0.72-0.77) and .75 (95%CI:0.72-0.77) in the validation group. The AUCs with models of LASSO and Xgboost without semen parameters were .72 (95%CI:0.69-0.74) and .73 (95%CI:0.7-0.75). The models of LASSO and Xgboost with semen parameters only gave the AUCs of .58 (95%CI:0.55-0.61) and .57 (95%CI:0.55-0.6). For the overall validation cohort, the event NRI values were −5.20 for the LASSO model and −.71 for the Xgboost while the non-event NRI values were 10.40 for LASSO model and 0.64 for Xgboost. In the subgroup of poor responders, the prevalence was .21 (95%CI:0.18-0.24). With refitted models of LASSO and Xgboost, the AUCs were .72 (95%CI:0.67-0.77) and .69 (95%CI:0.65-0.74) respectively. Conclusion: In unselected patients, semen parameters contribute to limited value in predicting TFF. However, oocyte yield is an important predictor for TFF and the prevalence of TFF in poor responders was high. Because reasonable predicting power for TFF could be achieved in poor responders, it may warrant further study to prevent TFF in these patients.


Introduction
While there is a trend of increasing use of intracytoplasmic sperm injection (ICSI), the conventional in vitro fertilization (IVF) practice remained an important part of contemporary assisted reproductive technology (ART) practices (Adamson et al., 2018), resulting in an average fertilization rate of 76% (ESHRE Special Interest Group of Embryology and Alpha Scientists in Reproductive Medicine, 2017). However, unexpected total fertilization failure (TFF) still occurred in 5%-20% of IVF treatment cycles (Combelles et al., 2010;Huang et al., 2015). The TFF refers to the failure of all retrieved mature oocytes to form two pronuclear zygotes after insemination in 15-18 h (Shinar et al., 2014). With the TFF, no embryos were available for transfer and the treatments would be canceled. That would incur distressful emotions and excess financial burden for the repeated medical treatment cycles for infertile couples.
ICSI is proven to increase fertilization rates when TFF has previously occurred with conventional insemination (Practice, 2020). Therefore, identifying the potential clinical risk factors and establishing predicting models might help choose the appropriate insemination protocol for patients with high risks of TFF. A few studies have attempted to predict the risks of fertilization failure among patients receiving conventional IVF (Repping et al., 2002;Krog et al., 2015;Li J. et al., 2021;Tian et al., 2022), demonstrating discriminatory powers as the form of AUCs ranging from .72 to .80. However, they still lack a consensus on the key predictors for TFF. For instance, Repping et al. (2002) achieved a notable AUC of .72 with total motile sperm count (TMC) alone while Tian et al. (2022) included a comprehensive set of predictors to report a similar AUC of .74 and the sperm quality diagnoses rather than TMC was used in their model. In addition, the lack of external validation and heterogeneity of the outcomes measured may limit the applicability of these models in clinical practice.
In the first cycle of ART, the decision for selecting ICSI over conventional IVF in patients without absolute indications (Babayev et al., 2014) is generally based on the semen parameters measured during the infertility workup or on the day of ovum pickup (OPU) (Li J. et al., 2021;Bjorndahl et al., 2022;Dcunha et al., 2022). However, the criteria are often arbitrary. While patients with less than 2 million motile spermatozoa might be recommended for ICSI (Dcunha et al., 2022), some authors propose more rigorous criteria (Repping et al., 2002). The work of Repping et al. (2002) might suggest that increasing the minimal TMC requirement for conventional IVF could effectively prevent TFF. Paradoxically, however, TFF may also occur in the conventional IVF treatment cycles with normal semen parameters (Wang and Swerdloff, 2014;Esteves et al., 2017). It is known that female gamete quality might also contribute to TFF (ESHRE Special Interest Group of Embryology and Alpha Scientists in Reproductive Medicine, 2017). However, unlike the male factor, the contribution of female clinical characteristics to TFF remains obscure. As current evidence does not support the use of ICSI solely according to the female characteristics, such as female age and poor response (Practice, 2020), evaluating the relative importance of male and female characteristics in the prediction of TFF may facilitate the decision making on the selection of insemination protocol in the first ART cycle.
In the present study, we retrospectively analyzed 19539 cycles that received their first IVF treatments in our center from January 2013 to December 2021. The aim was to evaluate the value of basic semen parameters in predicting the occurrence of TFF and try to create models to predict unexpected TFF for the patients receiving their first IVF treatment cycles.

Study subjects
A retrospective cohort analysis was performed on the patients who underwent their first IVF treatment cycles in the Center for Reproduction Medicine of the affiliated Chenggong Hospital of Xiamen University, China, between January 2013 to December 2021. The data from cycles in the period between January 2013 to December 2018 were obtained to create models to predict total fertilization failure (development group). The data from cycles in the period between January 2019 to December 2021 were obtained to validate the models (validation group). The inclusion criteria were the patients receiving their first IVF treatments. The exclusion criteria were patients canceling their ovum pickup (OPU), patients with no oocytes, and patients with no mature oocyte at the fertilization check.
This retrospective study was approved by Institutional Review Board from the Ethical Committee of the Medical College Xiamen University. Informed consent was not necessary, because the research was based on non-identifiable records as approved by the ethics committee.

Treatment protocol and fertilization check
Conventional agonist or antagonist stimulation protocols were used for ovarian stimulation as previously described (Cai et al., 2017). The initial and ongoing dosage was determined according to patients' age, antral follicle count (AFC), BMI, and ovarian response. An intramuscular injection of human chorionic gonadotropin (4000-6000 IU, hCG; Livzen, China) or a subcutaneous injection of recombinant human chorionic gonadotropin (250 μg, Ovidrel, Merck-Serono, Switzerland) was administrated for final triggering when at least one follicle reached a mean diameter of 18 mm. Ovum puncture under transvaginal ultrasound guidance for oocyte retrieval was performed 34-36 h after hCG injection.
Routine IVF protocol in our center was carried out (Jiang et al., 2022). Cumulus-oocyte complexes were co-cultured with approximate 1.5-3 X 10 5 progressively motile spermatozoa in pre-equilibrated fertilization culture medium (K-SIFM, Cook) under mineral oil in traditional incubators (C200, Labotect) at Frontiers in Cell and Developmental Biology frontiersin.org 37°C, 6% CO2 and 5% O2 in a humidified atmosphere. After 4 h coculture, oocytes were denuded and cultured individually in preequilibrated Cleavage Medium (K-SICM, Cook). The culture system and the procedure of semen preparation were kept unchanged in the period of study. Fertilization was determined according to the presence of two pronuclei (2 PN) about 17 h post insemination. It should be confirmed 2 h later if no obvious pronuclei could be observed.

Statistical analyses
The endpoint was total fertilization failure which was defined as the failure of all available oocytes to be fertilized in one IVF cycle. Considering the continuous variables were not normally distributed, they were presented as medians (first quartile, third quartile), while absolute frequencies and percentages (n, %) were used to present the categorical variables.
Two algorithms, a Least Absolute Shrinkage and Selection Operator (LASSO) and an Extreme Gradient Boosting (Xgboost) were used to create models. By including semen parameters or not and with only semen parameters, six different models were established in the end. Variables of female characters contained female age, duration of infertility, female primary infertility, previous IUI failure, female height, female weight, female BMI, PCOS, endometriosis, female basal FSH, female basal LH, female basal PRL, female basal E2, female basal P, female basal T, antral follicle count (AFC). Variables of male characters included male age, male primary infertility, male height, male weight, male BMI, semen volume, sperm concentration, normal morphology, sperm motility, sperm progressive motility, sperm non-progressive motility, total motile sperm count, and normozoospermia. The rest variables were the couple's secondary infertility and ovarian stimulation characteristics, which included total gonadotropin, Gn duration, total HMG dose, HMG duration, starting dose, FSH/LH/E2/levels on the day of stimulation, total hCG dose, E2/LH/P levels on the day of triggering, follicle count over 14 mm on the day of triggering, follicle count less than 14 mm on the day of triggering, oocyte yield, count of punctured follicles, normal responder, ovarian stimulation protocol. The variables were selected according to the associations between each factor and total fertilization failure (Supplementary Figure S1 and Supplementary Figure S2). In the models of LASSO and Xgboost, three sets of models were established: 1) the fmodels with all the variables included (Lasso and Xgboost with all features), 2) the models with variables except semen parameters (Lasso and Xgboost without semen parameters), 3) the models with only semen parameters included (Lasso and Xgboost with semen parameters only).
The coefficients and intercepts of the models created with the algorithm of LASSO were shown in (Supplementary Table S1). The features showing a non-linear association with total TFF underwent a restricted cubic spline (RCS) transformation with five knots using the rms package for R software to give a better fit. Each transformed feature generates three independent spline variables following the formulas below. Each spline variable was labeled with the original feature name in combination with the numbers 1, 2, or 3. The spline variables in addition to the original feature were used to construct the models.
The receiver operating characteristic (ROC) curve with area under the curve (AUC) was calculated to qualify the predicting power of the models. A 95% confidence interval (95% CI) was calculated for the AUC. The optimal cut-off points of the ROC curves were determined according to Youden`s index. The sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated accordingly. The agreements between predictions and clinic observations were compared with calibration curves.
For the optimal cut-off points, the Net Reclassification index (NRI) was calculated to evaluate the contribution of semen parameters to the prediction models (Ref.). The net reclassification index (z) is defined as the difference between the correct reclassification and incorrect reclassification cases in either event (z+) or non-event (z−) patients when models were compared.
Additive NRI was defined as follows.
Additive NRI z+ ( ) Total number of patients with event *100 + z− ( ) Total number of patients without event *100 Absolute NRI was defined as follows.
Absolute NRI 100* z+ Because the consequences of the prediction of the event (TFF) and non-event (fertilization) may be different. We also reported the Event-NRI and Non-Event-NRI, respectively. Event − NRI z+ ( ) Total number of patients with an event *100 The performance of the models was also evaluated in the subgroups which represent different clinical scenarios. The secondary infertility subgroup referred to the cycles of both males and females who were secondary infertile. The normozoospermia subgroup included the cycles of all parameters that met the baseline of WHO. And the normal responder subgroup included the cycles with the count of retrieved oocytes surpassing 4.

Result
A total of 20401 initiated cycles were identified during the study period. With the exclusion of 341 cycles cancelled for OPU, 172 cycles with no oocyte, 163 cycles resulting in no mature oocyte, and 186 cycles with miss values, 19353 cycles were included.
The importance of each variable in different models was computed with the Xgboost algorithm. The oocyte yield was the most important variable in predicting the occurrence of TFF no matter in the model with or without semen parameters. In the model of semen parameters only, TMC, motility, and sperm concentration showed the best importance (Supplementary Figure S4).
The prediction of TFF was stratified according to the patient subgroup. The TFF occurrence rates ranged from 6% to 21% in different subgroups. A relatively lower prevalence of 6% was observed in the group of secondary infertility, normozoospermia, and normal responder. The prevalence increased in the rest groups, and the highest was achieved in the poor responder group at 21%. The models with or without semen parameters had comparable predictive values and were better than the models with semen parameters only according to the AUCs (Table 3, Supplementary Figure S5).
To improve the performance of the models in the poor responder group, we update the models, which were calibrated according to the slope and intercept of the calibration curve. At the same time, we created refitted models, which were constructed from the development data with poor responders only. The coefficients of LASSO models refitted for poor responders were presented in (Supplementary Table S2). The models provided moderate performance after being updated and refitted according to AUCs and calibration curves (Supplementary Table S3 and Supplementary Figure S6).
The NRIs for the overall validation cohort and subgroups suggested that including semen parameters may have a positive effect on the classification of the patients as it provided positive NIR values (Supplementary Table S4). However, the event NRI values suggested that adding semen parameters to the models harms the correct classification of patients with TFF while the positive nonevent NRI indicated that it contributed to the classification of patients without TFF. Nevertheless, in the models refitted for poor responders, semen parameters added little or none to absolute NRI and non-event NRI.

Main finding
In the present study, we used two algorithms, a Least Absolute Shrinkage and Selection Operator (LASSO) and an Extreme Gradient Boosting (Xgboost), to develop predictive models for TFF in a retrospective cohort including 19539 cycles receiving their first IVF treatments in our center from January 2013 to December 2021. It was found that including semen parameters would limitedly improve the prediction power of the models for TFF in the cycles of IVF. The models without semen parameters could achieve comparable predictive power with the models with semen parameters according to AUCs. On the other hand, the models with only semen parameters had nearly no meaningful predicting value. The count of retrieved oocytes was the most important factor associated with the occurrence of TFF. Furthermore, in the cycles of poor responders, we observed a higher prevalence of TFF and our models could give reasonable predictions.

Interpretation
With or without semen parameters, our models yield comparable discriminatory powers with previous studies (Repping et al., 2002;Li J. et al., 2021;Tian et al., 2022). However, the heterogeneity in the inclusion criteria, outcomes measured, and study design might hamper further comparison between our models and previous ones. For instance, the models of Tian et al. (2022) reported an AUC for TFF in both IVF and ICSI cycles, and the work of Li J. et al. (2021) aimed to predict the combined incidence of TFF and low fertilization rate. Importantly, several studies were not limited to the first cycle (Repping et al., 2002;Li J. et al., 2021;Tian et al., 2022). As ICSI could be used to overcome the failure of fertilization in previous cycles, including multiple cycles for model development risks a potential selection bias. Also, our study was the only study including a temporary validation following model development among the studies All features, models with all the variables were included; without semen parameters, models with variables except semen parameters were included; models including only semen parameters.
Frontiers in Cell and Developmental Biology frontiersin.org mentioned (Repping et al., 2002;Li J. et al., 2021;Tian et al., 2022). The validation warned of a risk of overfitting in certain algorithms.
With the same predictors, the Xgboost models underwent a dramatic decrease in discriminatory power while the discriminatory power of the LASSO models remained stable. Because a minimal standard deviation of the residuals has been secured in the internal validation during the development of Xgboost, our data also suggested a risk of over-fit in models that lack external validation. Our findings appear to conflict with the work of Repping et al. (2002) with semen parameters alone, we failed to demonstrate a meaningful discriminatory power of the models. It could be explained by the difference in the criteria of ICSI. The study of Repping et al. (2002) used a criterion of .2 million post-wash progressive sperm for ICSI, which was much lower than ours. Their cohort may include more patients with poor semen parameters and thus with more TFF cases due to insufficient sperm input. Supporting the hypothesis, their data also demonstrated a TFF rate (110/892) higher than both our study and the Vienna criteria (ESHRE Special Interest Group of Embryology and Alpha Scientists in Reproductive Medicine, 2017).
A further question is whether a prediction model is suitable for different clinical scenarios. While most of the previous models were based on unselected IVF patients (Repping et al., 2002;Li J. et al., 2021;Tian et al., 2022), the work of Li J. et al. (2021) is based on patients with boarder line semen parameters. The fact that the selection of insemination protocol depends on the semen quality (Dcunha et al., 2022) might lead to bias with respect to the difference between the population in which the predictive models were developed and the patients who are supposed to have high risks. A one-shoe-for-all model developed in an unselected population might not meet all potential clinical scenarios. In our subgroup analyses, TFF occurred with a prevalence of 6% and 9% in the normozoospermia patients and the patients with sub-optimal semen parameters. Nevertheless, the discriminatory power and the calibration in the large remained similar between the two subgroups, suggesting a similar overall performance of the model.
Besides the clinical characteristics of the patients, TFF may also result from genetic deficiencies (Koler et al., 2009;Litzky and Marsit, 2019;Jin et al., 2021;Li M. et al., 2021;Liu et al., 2021;Wang et al., 2021), which are not necessarily related to the clinical predictors such as semen parameters. It could be argued that a clinical prediction is useless if a significant part of the TFF patients suffered from undetected genetic deficiencies. However, the extent of TFF patients who are affected by genetic deficiencies is largely unknown. The diagnosis of secondary fertility suggests at least one previous successful fertilization that occurred in vivo in those patients. Therefore, patients with secondary infertility are supposed to be unaffected by the genetic deficiencies which impair the fertilization process. With similar AUCs demonstrated in both primary and secondary infertility patients, it might suggest absolute deficiencies that affect fertilization are rare events even for TFF patients. It also echoes the earlier studies which suggested that patients suffering from TFF might have an increased fertilization rate in the second cycle when conventional IVF continues. Previous studies had demonstrated a significant association between oocyte yield and TFF (Sarikaya et al., 2011;Xia et al., 2020;Tian et al., 2022). By adding the oocyte yield to the model, Repping et al. (2002) increased the discriminatory power of their model to .8. In the present study, the Xgboost model suggested that the oocyte yield was the predictor of the highest importance. The prevalence of TFF in poor responders was as high as .21(95%CI:0.18-0.24). These data highlighted the role of the female factor in the occurrence of TFF. Although one might argue that fertilization still manifested a probability event for an individual oocyte and TFF may occur in poor responders just by chance, the models refitted for poor responders showed a moderate discriminatory power. It may encourage future studies focusing on the prediction of TFF for poor responders.

Clinical significance
In practice, TFF threatened a small fraction of infertility patients. The reported prevalence was 5%-20% in previous studies (Combelles et al., 2010;Huang et al., 2015). In the present study, we observed a less than 8% prevalence in the whole IVF treatment cycles of our center. By naïve guessing that all cycles were fertilized, an accuracy higher than 90% will be achieved. In these terms of view, it appeared that predicting TFF has limited benefits. However, the high prevalence of TFF in poor responders is unneglectable. As the evidence does not support the use of ICSI for all poor responders (6), it is worth the efforts to predict the poor responders with higher risks for TFF. While a predictive model provides a positive or negative prediction for individual patients, the consequences of having a positive prediction may not always be the same as those of having a negative prediction. In the case of TFF, if the clinical decision were made according to the prediction, the patients misclassified as TFF may be incorrectly assigned with ICSI, increasing the potential treatment burden, but the treatment continues. On the other hand, for the patients misclassified as "healthy", the cycle is canceled. If the goal of a prediction model is to prevent the latter situation, discriminability or accuracy might not be the only target to purchase. According to our data, a hypothetical poor responder

Strengths and limitations
Only cycles of the first IVF treatment were included in our study, the results were more practicable for insemination protocol decision-making, and the prediction power was ensured by a large sample size. We also calibrate the models in a different data set and different clinic scenarios, such as normozoospermia or male subfertility, normal or poor responder, and primary or secondary infertility. Some drawbacks must be concerned, such as retrospective analysis study design, and data set from a single IVF center. So, it is necessary to test the performance of the models in more IVF centers with different treatment systems. The study also failed to include certain clinical parameters, such as DNA fragmentation and male endocrine parameters, which may further improve the discriminatory power of the models.

Conclusion
Our study showed that including basic semen parameters would limitedly improve the prediction value for the occurrence of total fertilization failure in the first IVF cycles. On the other hand, the oocyte yields were significantly associated with the TFF and the prevalence of TFF in poor responders was high. With our models, a reasonable predicting value could be achieved, especially practicable in the poor responder treatment cycles.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Ethics statement
This study was reviewed and approved by Ethical Committee of the Xiamen University Affifiliated Chenggong Hospital.