Assessment and quantification of ovarian reserve on the basis of machine learning models

Background Early detection of ovarian aging is of huge importance, although no ideal marker or acknowledged evaluation system exists. The purpose of this study was to develop a better prediction model to assess and quantify ovarian reserve using machine learning methods. Methods This is a multicenter, nationwide population-based study including a total of 1,020 healthy women. For these healthy women, their ovarian reserve was quantified in the form of ovarian age, which was assumed equal to their chronological age, and least absolute shrinkage and selection operator (LASSO) regression was used to select features to construct models. Seven machine learning methods, namely artificial neural network (ANN), support vector machine (SVM), generalized linear model (GLM), K-nearest neighbors regression (KNN), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) were applied to construct prediction models separately. Pearson’s correlation coefficient (PCC), mean absolute error (MAE), and mean squared error (MSE) were used to compare the efficiency and stability of these models. Results Anti-Müllerian hormone (AMH) and antral follicle count (AFC) were detected to have the highest absolute PCC values of 0.45 and 0.43 with age and held similar age distribution curves. The LightGBM model was thought to be the most suitable model for ovarian age after ranking analysis, combining PCC, MAE, and MSE values. The LightGBM model obtained PCC values of 0.82, 0.56, and 0.70 for the training set, the test set, and the entire dataset, respectively. The LightGBM method still held the lowest MAE and cross-validated MSE values. Further, in two different age groups (20–35 and >35 years), the LightGBM model also obtained the lowest MAE value of 2.88 for women between the ages of 20 and 35 years and the second lowest MAE value of 5.12 for women over the age of 35 years. Conclusion Machine learning methods combining multi-features were reliable in assessing and quantifying ovarian reserve, and the LightGBM method turned out to be the approach with the best result, especially in the child-bearing age group of 20 to 35 years.


Introduction
Ovarian reserve represents the number of oocytes remaining in the ovary; both the number and quality of oocytes impact reproductive potential and aging (1,2). Ovarian aging is due to a variety of causative factors, such as chromosomal, genetic, mitochondrial, and cytoplasmic changes in oocyte quantity and quality (3)(4)(5)(6)(7)(8). Evaluation of present ovarian reserve and ovarian aging degree could offer helpful advice for women regarding evaluating their reproductive potential and preventing early menopause or related disorders because few treatments are effective in preventing ovarian aging.
So far, the most classical and commonly used evaluation system for ovarian aging is the Stages of Reproductive Aging Workshop criteria (STRAW+10), which is widely considered the gold standard for characterizing reproductive aging through menopause. STRAW classified the stages of a woman's adult life into three general categories: reproductive, menopausal transition, and postmenopause. However, the STRAW staging approach lacks specific diagnostic criteria for evaluating ovarian reserve, and the assessment system is too generalized to reliably assess each individual's ovarian aging degree. In addition, the current evaluation of ovarian reserve can draw on clinical indicators, such as biochemical tests and ultrasound imaging of the ovaries (2). Biochemical tests include follicle-stimulating hormone (FSH), estradiol (E2), or inhibin B in early-follicular-phase, cycle-dayindependent anti-Müllerian hormone (AMH), and provocative tests, while ultrasonographic measures include antral follicle count (AFC) and ovarian volume. Among these indicators, AMH is regarded as the most sensitive and reliable marker of ovarian reserve because it is independent of the menstrual cycle and tends to decline before FSH rises (9). However, several studies have reported the limited use of these markers. In reproductive-aged women without a history of infertility, markers of lower ovarian reserve were found to be unrelated to reduced fertility, and in women with a history of one to two previous miscarriages, AMH levels were found to be unrelated to clinical pregnancy loss (10,11). These findings highlight the limitations of these single markers.
Machine learning holds considerable advantages for analyzing and integrating large amounts of medical data (12,13). Machine learning can fully account for the interactions between characteristics and incorporate new data to update models, in contrast to traditional statistical analysis approaches, which rely on a preset equation (14). In the realm of assisted reproduction, machine learning methods have previously been applied to evaluate and predict pregnancy rates (15)(16)(17). Researchers also have attempted to construct regression models to assess ovarian reserve by integrating single biochemical and ultrasound markers (18)(19)(20)(21). However, more machine learning methods should be utilized to determine a suitable evaluation model. The main aim of this study is to develop a more accurate machine learning model to estimate and quantify ovarian reserve in terms of predicting reproductive possibility and time to menopause.

Method Population selection
This is a multicenter, nationwide population-based study. The participants were recruited from seven centers in six different cities of China, including the city of Shenyang (northern China), Foshan (southern China), Chengdu (western China), Zhengzhou, Yichang, and Wuhan (central China). From October 2011 to December 2014, a total of 2,055 women, aged 20 to 55, were recruited through advertisements. Of the initial recruits, 1,020 women met the following strict inclusion criteria for the healthy population used for modeling: 1) regular menstrual cycles between 21 and 35 days for women <40 years old having regular menstrual cycles and for women >40 not required to have regular menstrual cycles, considering that they may be in normal perimenopause or menopause; 2) no hormone use in the past 6 months; 3) no history of radiotherapy or chemotherapy; 4) no history of hysterectomy, oophorectomy, or any other type of ovarian surgery; 5) no ovarian cysts or ovarian tumors, as confirmed by ultrasound; and 6) no known chronic, systemic, metabolic, or endocrine diseases such as hyperandrogenism or hyperprolactinemia.
All volunteers were interviewed one-on-one using prepared questionnaires that included questions about their demographic, geographic, and reproductive characteristics. The participants were physically examined and received free hormone and ultrasound testing. The study was approved by the Tongji Ethics Committee, and written informed consent was obtained from each woman for the anonymous use of clinical data for statistical evaluation and research purposes.

Blood sample collection
All blood samples were taken from the participants' antecubital vein between 7:00 AM and 11:00 AM, after a 12 h overnight fast, on days 2 to 5 of a spontaneous menstrual cycle or any day if amenorrhea had lasted more than 3 months in those aged over 40 years. The samples were then centrifuged using standard conditions within 2 h of venipuncture. After centrifugation, serums were obtained, aliquoted, transported to the central laboratory, and stored at −80°C for no more than 2 weeks until the assays were performed. To avoid the potential bias produced by differences between laboratory test results, we chose the gynecologic endocrine laboratory of Tongji Hospital as the central laboratory; all serums were transported to the central laboratory using dry ice within 48 h of collection, and all serum hormones were tested in the central laboratory.

Hormone detection
Serum concentrations of AMH at the time of recruitment were measured using the AMH Gen II ELISA kit (Beckman Coulter, Inc., Brea, CA, USA) and Ultra-Sensitive AMH ELISA assays (AL-105, Ansh Labs, Webster, TX, USA). Two commercial assays and the mean value were decided as the final AMH level, and all serum AMH measurements were performed in the same laboratory using the above kits. The controls were used at two concentrations to monitor the accuracy of the assay. The intra-and interassay coefficients of variation (CVs) were 3.6% and 4.5%, respectively. The lowest amount of AMH that could be detected with a 95% probability in a sample was 0.08 ng/ml for Gen II ELISA and 0.04 ng/ml for Ansh Labs; therefore, we replaced all values recorded as <min (undetectable) with a value of 0.08 or 0.04 ng/ml for the purpose of this analysis. Serum FSH, luteinizing hormone (LH), E2, testosterone (T), prolactin (PRL), and progesterone (PRG) levels were measured using a chemiluminescence-based immunometric assay on an ADVIA Centaur immunoassay system (Siemens Healthcare Diagnostics Inc., Tarrytown, NY, USA). All the serum hormone levels were measured in the same laboratory using the same kit. The intra-and interassay coefficients of variation were all <15%. Due to missing values, the three hormones-T, PRL, and PRG-were not included in the analysis.

Ultrasound examination
A transvaginal ultrasound scan of the ovaries was performed to determine the AFC. This ultrasound examination was performed at the seven centers. All participating research institutes were modernized large comprehensive hospitals and received our regular supervision and verification. The unified standard for this examination was formulated in the beginning, and all ultrasound doctors were strictly trained to test AFCs according to the same standard. In this study, the AFC was defined as the total number of visible round or oval structures with diameters of 2 to 10 mm in both ovaries. All ultrasound examinations were performed on days 2 to 5 of a spontaneous menstrual cycle or in the follicular phase for non-menstruating women. None of the eligible participants had follicles larger than 10 mm. No significant differences were found between each center. The intra-analysis coefficient of variation for the follicle diameter measurements was <5%, and the lower limit of detection was 0.1 mm.

Establishment and assessment of models
In this study, ovarian reserve was quantified in the form of ovarian age for healthy women, and ovarian age was regarded as equal to their chronological age. The least absolute shrinkage and selection operator (LASSO) regression was used for data regularization and feature selection (22). With the use of seven features (AMH, body mass index (BMI), Inhibin B, FSH, E2, LH, and AFC), quantifying work was performed. As for the construction of prediction models, seven different machine learning algorithms were used, namely artificial neural network (ANN), support vector machine (SVM), generalized linear model (GLM), K-nearest neighbors regression (KNN), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM), in which their chronological age was regarded as ovarian age in healthy women who were supposed to have a normal ovarian function (23-29). All the above-mentioned models were trained and tested on a partitioned 50/ 50 percentage split of the dataset by stratified random sampling. Parameter tuning was based on the grid search method and 10-fold cross-validation in training the dataset (30). The parameters of the machine learning models are listed in Table S1. For model assessment, Pearson's correlation coefficient (PCC) and mean absolute error (MAE) values were applied to indicate how well a model explains the variation in the dependent variables. The mean squared error (MSE) value was calculated to measure the stability of the model. All machine learning techniques were programmed in R language (version 3.6.3) using packages including neuralnet, e1071, kknn, gbm, xgboost, and lightgbm.

Results
From October 2011 to December 2014, a total of 2,055 women, aged from 18 to 55, were recruited through advertisements. According to exclusion and inclusion criteria, a total of 1,020 women from seven centers were enrolled and analyzed ( Figure 1). Table 1 summarizes the statistics of included women for age, AMH value, Inhibin B, BMI value, FSH, LH, E2, and AFC value. Figure  Holding the assumption that ovarian age was equal to chronological age in healthy women, we performed LASSO regression analysis on the total data to select those features suitable for constructing the models ( Figure S1). Finally, these seven features were all left with the lowest CP value for the follow-up study (Table S2). We randomly chose half of the datasets as training data to make the prediction and half as test data, and we then checked the results on different datasets.
The results of the prediction analyses were compared in terms of PCC and MAE values for the seven machine learning models (ANN, SVM, GLM, KNN, GBDT, XGBoost, and LightGBM) ( Table 2). The table shows the PCC values for the training dataset, the test dataset, and the entire dataset, as well as the MAE values for the entire dataset. Focusing on the PCC values, it can be observed that the XGBoost, LightGBM, and ANN method had better performance. While PCC just describes the correlation trend, the MAE value represents the detailed difference, which reflects the prediction bias. The LightGBM model had the lowest MAE value for all the data of 3.41 years. As there were five datasets with more than 90 women, the seven models were also tested in the datasets of Chengdu, Foshan, Tongji, Shenyang, and Zhengzhou. The XGBoost and LightGBM models also obtained the highest PCC value in all center-based datasets. While the XGBoost model had the highest PCC value on the Shenyang dataset, at 0.90, the GLM model  The flowchart of the study design.
Ding et al. 10.3389/fendo.2023.1087429 Frontiers in Endocrinology frontiersin.org had the lowest value on the Foshan and Tongji datasets, at just 0.43 ( Figure 3A). As for the MAE value, the GBDT model had the highest MAE value on the Shenyang dataset, at 5.52 years, and the LightGBM model had the lowest value on the Foshan dataset, at just 3.05 years ( Figure 3B). Further, we cross-validated the models using the 10-fold method in which we randomly chose 90% of the entire dataset as the training dataset and 10% of the data as the test dataset. We iterated the method 100 times and obtained a mean MSE value. Figure 4 shows the MSE value broken down for the seven different methods. The lowest mean MSE was gained for the LightGBM technique, which showed the stability of this method.
In order to evaluate the performance of the models and select the most suitable one, we combined the three indexes of PCC value, MAE value, and MSE value. The models that ranked in the top three under each index were left. As shown in Table 3, the LightGBM model was the only one that ranked in the top three in all the lists. Though the PCC value of XGBoost was a little higher than that in the LightGBM model, the MSE and MAE values were much better in the LightGBM model.
As 35 years is the boundary age of childbearing, here, we divided the datasets into two different age groups (20-35 and >35 years) and analyzed the mean prediction errors by age groups.

Discussion
In this study, we collected data on clinical, biochemical, and basic ultrasonographic features in a population of healthy women with the aim of constructing a quantitative system for ovarian reserve. We compared different machine learning models with respect to their prediction accuracy and stability in order to find a better one to reflect the ovarian reserve status.
In recent years, mathematical methods have been used by researchers to evaluate ovarian reserves. Younis et al. developed a multivariable scoring system, combining biochemical tests, imaging measures, and BMI to assess ovarian reserve and pregnancy rate (21). Xu and colleagues developed two models to evaluate ovarian reserve, clinical pregnancy rate, and live-birth rate (18,19). Although these models are simple and easy to use, they are only used for infertile patients who require fertility treatments and in vitro fertilization (IVF) cycles, which means that they do not adequately reflect the majority of women of childbearing age. Additionally, the output result from these models is categorized, which makes it impossible to quantify ovarian function. Even though they could evaluate the reproductive prognosis, it would be challenging for these models to forecast the timing of menopause. As a result, Roberta's study attempted to measure and describe ovarian function using the quantitative variable  OvAge, a numeric variable that accurately reflects ovarian reserve in terms of both reproductive potential and time to menopause (20). They employed a single generalized linear model method since their study was the first to utilize a multi-factor model to assess and quantify ovarian age, and other characteristics like BMI that affect ovarian reserve should also be included in the model (31). Meanwhile, there were many ultrasonic measurement indicators in the model, for which special hardware was needed, and the subjective judgment of different ultrasound staff might result in an artificial mistake. In contrast to their study, we developed assessment models using a variety of machine learning methods and straightforward, objective indicators. Furthermore, seven machine learning models were constructed and analyzed to choose the most effective model for ovarian reserve quantification. In our study, we first calculated the PCC value between different indicators and age. AFC value and AMH obtained the highest absolute PCC value, which is in accordance with the studies that said that AFC and AMH were the two most important single tests in evaluating ovarian reserve. The PCC value between AMH and AFC was as high as 0.67, indicating the effect of AMH on the stage of preantral and small antral follicles (32). We also revealed the AFC and AMH distributions, referring to age, and obtained fitting curves. With the prevailing age-specific reference values obtained for AMH levels based on samples from an American population in 2011, agespecific AMH reference values for Chinese women are needed (33). Our age-specific AMH distribution curve here is also similar to that of a Japanese study revealing an age-specific AMH reference for Japanese women to evaluate reproductive potential (34).
We used the assumption that ovarian age corresponds to chronological age in healthy women to investigate this novel variable of ovarian reserve. The key findings of this research are that clinical variables, blood biomarkers, and ultrasonographic characteristics may all be used to estimate ovarian reserve. After ranking analysis, including PCC, MAE, and MSE values, we determined the LightGBM model to be the best appropriate model of the seven prediction models we constructed. The LightGBM approach, which was developed to be dispersed and effective with the benefits of faster training speeds, more efficiency, and better accuracy, utilized histogram-based algorithms. In our study, the performance of the LightGBM model, which had the second-best PCC value of all the models, obtained PCC values of 0.82, 0.56, and 0.70 for the training set, the test set, and the entire dataset, respectively. MAE measures the exact differences between ovarian age and predicted ovarian age, and the LightGBM model obtained the lowest MAE value, indicating better accuracy. Moreover, the MSE value of the LightGBM model was the lowest, which showed better stability in this method. Other models, such as XGBoost and ANN, also exhibited good performance on prediction accuracy but did not perform as well in terms of stability. As a previous study used the GLM method to construct a predictive system for ovarian reserve evaluation, the results here showed that FIGURE 4 Mean squared error (MSE) after 10-fold cross-validation for the seven methods.

FIGURE 3
Radar plot showing correlation values (A) and mean absolute prediction errors (MAE) values (B) for the five datasets using the seven different prediction models.
the predictive power of this method was lower than that of other methods (20). Considering that a model combining markers would not be superior to a model with a single marker, we found that the PCC values of the seven models were all higher than those of single markers, such as AMH (−0.45) and AFC (−0.43), which indicated that machine learning methods may lessen the influence of correlated markers in combined ovarian reserve marker models (35, 36). Further, we performed an age-stratified analysis, and we found that the LightGBM model was the most suitable model for women under the age of 35, with the lowest MAE value of 2.88 years. This model could distinguish the ovarian age for women under 35 years old with an accuracy of 99.49%. As we know, 35 years is the boundary for childbearing age, and the model has the potential to be used for evaluating reproductive function and guiding childbearing (37).
Our study has several strengths. First, we assessed and quantified ovarian reserve in terms of ovarian age in a way that could be easily implemented in the clinic. For example, as the recognized natural menopause age is around 51, it is easy to evaluate the distance to an individual's menopause (38). Further, as 35 years is the boundary for childbearing age, it is easy to predict ovarian age and compare it to this boundary, then design individual reproductive plans (37). Second, the data included in this study came from multi-centers, which covered several geographical regions of China; this made the study population more representative and improved the credibility of the results. Third, we compared the performances of seven models and selected the most effective one. Our result may be more reasonable than the former study, which used only one method.
This study has several limitations. First, our model regards ovarian age as chronological age in healthy women, which would need more strict inclusion criteria for the population. Second, due to incomplete information, limited features were used in this study. As ovarian aging is associated with additional features including lifestyle and genetic factors, these features should also be incorporated into future studies (3,4,39). Third, this is a cross-sectional study involving healthy population data; an external validation test should be conducted in polycystic ovarian syndrome and diminished ovarian reserve patients. A longitudinal follow-up study should be performed to assess the predicting ability. Additionally, though this is a nationwide study, the sample size from some centers was too small, which could potentially cause bias. More samples are needed to further test the model and explore more clinical applications.

Conclusion
Taken together, machine learning methods combining multifeatures, including simple and easily obtained clinical, biochemical, and ultrasonographic parameters were reliable in quantifying ovarian reserve and were better than a single indicator, providing another possible measurement to reflect ovarian reserve accurately and predict the aging degree of female ovaries individually. After comparison, the LightGBM method revealed itself to be the approach with the best quantitative effect and stability, especially in the specific age group of 20 to 35 years. In the future, this model should be tested and improved on a larger cohort.

FIGURE 5
Mean absolute prediction errors (MAE) broken down for the seven different prediction models and different age groups: (A) 20-35 (B) >35 years.

Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement
The studies involving human participants were reviewed and approved by Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology. The patients/participants provided their written informed consent to participate in this study.

Author contributions
SW and YL conceived and designed the study. TD and TW collected the data and developed the analytic pipeline. WR, YH, and WM led the analysis, generated the tables and figures, and wrote the manuscript. MW and FF verified and processed the underlying data. All authors contributed to the article and approved the submitted version.

Funding
This work was supported by the National Natural Science Foundation of China (grant numbers 81873824 and 81902669).