- 1Nuffield Department of Population Health, University of Oxford, Oxford, United Kingdom
- 2School of Preventive Medicine, Wenzhou Medical University, Wenzhou, Zhejiang, China
- 3Wenzhou Women and Children Health Guidance Center, Wenzhou, Zhejiang, China
Background: Accurate antenatal prediction of SGA at birth is essential to improve development and delivery of preventative and therapeutic interventions. This study aimed to assess the performance of machine learning (ML) models to predict SGA at birth among Chinese pregnancies classified according to the Chinese birthweight standard and three international birthweight standards.
Methods: We collected multimodal, longitudinal, antenatal surveillance data on 350,135 singleton pregnancies in Wenzhou City, China, between Jan 1, 2014 and Dec 31, 2016. For three pregnancy intervals we developed ML prediction models for newborns classified as SGA using the China, Intergrowth 21st, Fetal Medicine Foundation (FMF), and Gestation-related Optimal Weight (GROW) standards. We applied lasso regression to conduct feature selection, and CatBoost, XGBoost, LightBoost, Artificial Neural Networks, Random Forest, Stacked ensemble model, and logistic regression for predictive modeling in training data sets, with validation in testing data sets.
Results: Among 22,603 singleton pregnancies with complete data, the rate of SGA using the China standard was 6.1%, compared to 4.3, 6.0, and 9.7% for the Intergrowth 21st, GROW, and FMF standards, respectively. This pattern was maintained in the imputed data set (n = 225,523), with corresponding SGA rates of 6.8, 4.8, 7.4, and 10.7%. Late pregnancy models (<37 weeks) had the best power to predict SGA, compared to middle (<26 weeks) and early pregnancy (<18 weeks) models. With the China standard, the logistic regression model in late pregnancy performed best with an area under the receiver operating characteristic curve (ROC-AUC) of 0.74. Logistic regression also performed better than ML algorithms with the Intergrowth-21st and GROW standards at each pregnancy interval, although differences were small. The Random Forest model with the FMF standard achieved superior performance at each pregnancy interval, reaching a ROC-AUC of 0.79 in late pregnancy. Notably, the middle pregnancy Random Forest model with the FMF standard already attained a ROC-AUC of 0.72 at 26 weeks’ gestation. Symphysis-fundal height, maternal abdominal circumference, maternal age, maternal height and weight, and parity were consistently identified as key predictors of SGA across the different standards.
Conclusion: There are important differences in the classification of SGA at birth between national and international birthweight standards. Both machine learning models and traditional logistic regression demonstrated comparable predictive performance for SGA identification. These findings hold promise for guiding risk-stratified prenatal care and optimizing resource allocation in clinical settings.
Introduction
Small-for-gestational-age (SGA) is defined as birthweight for gestational age below the 10th centile according to a birthweight chart (American College of Obstetricians and Gynecologists' Committee on Practice Bulletins—Obstetrics, 2021). SGA newborns are a major cause of global neonatal and child mortality and morbidity, especially in low- and middle-income countries (LMICs) (Lee et al., 2013). An estimated 23.3 million infants (19.3% of live births) per year are born SGA in LMICs, which contribute to 21.9% of neonatal deaths (Lee et al., 2017). The highest rates and numbers of SGA infants are born in Asia, and China has the fifth highest number of SGA newborns annually (Lee et al., 2017). Sustainable Development Goal 3 (SDG3) target 3.2 aims to reduce neonatal and child mortality to 12 and 25 per 1,000 live births, respectively, in all countries by 2030 (Liu et al., 2016). However, many LMICs are not on track to meet these targets, highlighting an urgent need to address the adverse perinatal outcomes that contribute to neonatal and child mortality (Sharrow et al., 2022; GBD 2019 Under-5 Mortality Collaborators, 2021).
Crucially, SGA classification depends on the birthweight charts used, which include reference charts, prescriptive standards, and customized growth charts (Capital Institute of Pediatrics and Coordinating Study Group of Nine Cities on the Physical Growth and Development of Children, 2020; Gardosi et al., 2018; Nicolaides et al., 2018; Villar et al., 2014). Many countries use charts derived from their own population. For example, the Chinese newborn chart is a population-based chart based on healthy pregnant women from nine cities across China (Capital Institute of Pediatrics and Coordinating Study Group of Nine Cities on the Physical Growth and Development of Children, 2020). The Intergrowth 21st birthweight standard is a prescriptive international population-based standard derived from multi-ethnic urban populations in eight countries and selected healthy, well-nourished women receiving adequate antenatal care and at low risk of fetal growth impairment (Villar et al., 2014). The Fetal Medicine Foundation (FMF) chart is based on fetal estimated weight and birthweight data from unselected singleton pregnancies at two UK hospitals, including pregnancies at risk of complications and preterm babies in utero (Nicolaides et al., 2018). Unlike these universal charts, the customized Gestation-related Optimal Weight (GROW) chart adjusts for maternal weight, height, parity, ethnicity or country of origin, and fetal sex (Gardosi et al., 2018). Each birthweight chart classifies different populations of newborn babies as SGA. To our knowledge, few studies have compared SGA classification among Chinese pregnancies according to different birthweight standards.
It is crucial to improve antenatal prediction of SGA to enable development and implementation of preventative and therapeutic interventions. The traditional approach to risk prediction has been logistic regression based on known risk factors. However, this approach has proven to have poor predictive power for SGA (Bai et al., 2022; Bai et al., 2022). Given this limitation, there is a pressing need for more sophisticated analytical approaches. The field of perinatal epidemiology is now leveraging artificial intelligence (AI) to harness complex datasets for public health impact. AI promises a paradigm shift by uncovering subtle, non-linear interactions within routine clinical data that elude conventional methods (Mennickent et al., 2023). Large-scale, multimodal, longitudinal electronic health records facilitate the use of AI for predicting the risk of clinical outcomes (Hunter and Holmes, 2023). To date, studies to predict SGA at birth using Machine Learning (ML) have had important limitations, including small sample sizes, highly selected patient groups, and design or analysis biases (Bai et al., 2022; Bai et al., 2022; Vicoveanu et al., 2022). Some popular ML methods, such as a Stacked ensemble model that combines predictions from multiple base models using a meta-model to achieve superior performance, have not been applied to SGA prediction (Naimi and Balzer, 2018), and the predictive performance of these methods compared to other ML methods, such as Random Forests and Catboost, is unknown (Cho et al., 2022; Choi et al., 2021). In addition, a review of perinatal outcome prediction found that many ML models failed to explain their decision-making process to enable clinicians to understand the importance of input features (Ramakrishnan et al., 2021).
The development of accurate antenatal models for predicting SGA at birth requires high-performing ML algorithms. However, the accuracy of any such model is fundamentally dependent on the birthweight standards used to define SGA. Each standard identifies a different neonatal subpopulation, leading to substantial variation in clinical management. For example, infants classified as SGA by a customized standard (e.g., GROW) but not by a population standard (e.g., Intergrowth-21st) may miss essential hypoglycemia or hypothermia monitoring, whereas misclassifying a constitutionally small infant as SGA may prompt unnecessary investigations and parental anxiety. Thus, the choice of standard directly shapes risk stratification, resource use, and quality of care.
Therefore, this study aims to compare six machine learning (ML) models and logistic regression in predicting SGA based on four birthweight standards—the Chinese national standard, Intergrowth-21st, FMF, and GROW—and to evaluate how standard selection influences prediction accuracy.
Methods
Study design
The Wenzhou maternal and child health information management platform covers 51 midwifery clinics and hospitals in Wenzhou City in Zhejiang Province, China, and was used to collect maternal and perinatal health records. We included all 350,135 singleton pregnancies registered from 1 January 2014 to 31 December 2016. Of these, 225,523 pregnancies were registered, had antenatal follow-up, and had delivery records (Supplementary Figure S1). The data analysis workflow, encompassing data engineering, feature selection, prediction modeling, and model performance and interpretation, is illustrated in Figure 1.
Figure 1. Study methodology. Steps to develop the machine learning models to predict small for gestational age are shown. Each step consists of several processes, as indicated. ANN, Artificial Neural Networks; FMF, Fetal Medicine Foundation; GROW, Gestation-Related Optimal Weight; SGA, small for gestational age (birthweight <10th centile for gestational age), SHAP, Shapley Additive Explanations; SMOTE, Synthetic Minority Over-sampling Technique; PR-AUC, Area Under the Precision-Recall Curve; ROC-AUC, Area Under the Receiver Operating Characteristic curve.
Participant features
A prospective pregnancy health survey was conducted at registration at around 12 weeks’ gestation, collecting information regarding demographics, social, medical, obstetric and gynecological history, anthropometric measurements, and laboratory analyses. Gestational age at birth was determined at first-trimester ultrasound (standard practice). Birthweight was measured within 1 h of birth. Symphysis fundal height (SFH), maternal abdominal circumference (MAC), systolic blood pressure (SBP), diastolic blood pressure (DBP) and weight were measured at each antenatal care visit. Pregnancy was divided into three intervals which were determined based on a combination of clinical practice and the distribution of our dataset: early pregnancy (< 18 weeks’ gestation), middle of pregnancy (18 to 25+6 weeks), and late pregnancy (26 to 36+6 weeks) (Supplementary Table S1). Fifteen variables were created by dividing follow-up measurements into separate variables according to the three pregnancy intervals. If there were multiple visits during a given pregnancy interval, the average value of measurements was used for analysis. 43 features from registration and follow-up visits as well as four features from delivery data are shown in Supplementary Tables S2, S3. Additional variables were derived from the differences between pregnancy intervals (e.g., diffSBP12).
Birthweight for gestational age standards
Singleton newborns with birthweight less than the 10th centile were classified as SGA based on four birthweight standards. SGA classification was according to newborn sex, except for the FMF standard (Nicolaides et al., 2018). Birthweight centiles for the China standard were based on the national reference, which was used for the primary endpoint of SGA classification in this study (Capital Institute of Pediatrics and Coordinating Study Group of Nine Cities on the Physical Growth and Development of Children, 2020). The GROW standard applied maternal height and weight at registration, parity, country of origin (China), fetal sex, and gestational age to calculate the birthweight centiles (www.gestation.net). Birthweight centiles for the Intergrowth 21st standard were calculated through its dedicated software (Villar et al., 2014). Therefore, four separate data sets were generated with SGA at birth classified according to each of the four birthweight standards, with the China Standard serving as the primary classification method for defining SGA and the other three standards as secondary classification methods.
Data preprocessing
Data preprocessing for the cohort of 225,523 singleton pregnancies with registration, follow-up, and delivery records involved a staged process. Prior to imputation, variables with over 30% missing data were removed, reducing the feature set from 53 to 25. The MICE algorithm was then applied to these 25 variables to generate an imputed data set (n = 225,523), with the fifth iteration retained. In parallel, a complete data set (n = 22,603) was formed by excluding all pregnancy records with missing values from the 53 variables. The datasets were subsequently processed as follows: a 70%/30% stratified split was performed, using individual pregnancy records as the sampling unit. This approach was necessitated by the anonymized nature of the data, which precluded the identification of women with multiple pregnancies and ensured complete separation between training and testing sets. Following the split, all numeric features underwent normalization via the Yeo-Johnson method, followed by standardization (centering and scaling to achieve zero mean and unit variance). To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was subsequently applied exclusively to the training sets.
Feature selection
For each birthweight standard, Lasso regression was used to select important features for SGA prediction at three different time points: early (<18 weeks), middle (<26 weeks), and late pregnancy (<37 weeks). This analysis was performed on the training data sets of the imputed data and complete data using 10-fold cross-validation. Lasso regression was chosen for its advantage in handling multicollinearity among predictors. By applying an L1 penalty to the coefficients, lasso regression automatically identifies relevant predictors—shrinking the coefficients of less informative variables to zero—to yield a sparse subset of features.
Design and development of prediction models
For each birthweight standard, we developed distinct prediction models for the early, middle, and late pregnancy intervals. The primary analysis was based on the complete data, while the imputed data were used in sensitivity analyses to evaluate the robustness of the models to missing data. In both analyses, the features were selected from variables available at each gestational interval using Lasso regression. The selected features were used to train the following algorithms: CatBoost, XGBoost, LightGBM, Random Forest, Artificial Neural Networks (ANN), a Stacked Ensemble model, and logistic regression (for baseline comparison). Hyperparameters for all individual models except the Stacked ensemble model were optimized via a random search (Supplementary Table S4). The Stacked Ensemble model was then constructed using these individually tuned models (CatBoost, XGBoost, LightGBM, Random Forest, ANN, and logistic regression) as base learners. Their predictions were combined using a logistic regression meta-learner with a regularization strength (C) of 0.1, a fixed random state for reproducibility, and a maximum iteration limit of 500. The tuning was guided by the area under the receiver operating characteristic curve (ROC-AUC) value, which was evaluated using 5-fold cross-validation on the training sets.
Model performance and interpretation
For each prediction model developed based on the training data sets, performance metrics, including the ROC-AUC, accuracy, sensitivity, specificity, balanced accuracy (the average of sensitivity and specificity), positive predictive values (PPV), negative predictive values (NPV), and F1 scores (harmonic mean of PPV and sensitivity), were evaluated on the testing data sets using optimal threshold values. These metrics and their corresponding 95% confidence intervals were estimated using bootstrap resampling with 1,000 replicates. The optimal probability threshold for classifying a case as SGA was determined as the point on the ROC curve closest to the top-left corner (0,1). All metrics (e.g., sensitivity, specificity) are reported using this single, consistent threshold to facilitate model comparison. The best-fitting model was selected based on the following criteria: if all models had a precision-recall AUC (PR-AUC) below 0.2, the model with the highest ROC-AUC was chosen; otherwise, the model with the highest PR-AUC was selected. Calibration curves with a brier score were plotted to compare predicted and observed outcomes for the final optimal predictive model based on each birthweight standard. Model interpretation was performed by calculating Shapley Additive Explanation (SHAP) values on the testing datasets, employing a global approach to assess population-level feature importance. The mean absolute SHAP value was used to rank features by importance by their overall impact on the model output, while the distribution and central tendency of individual SHAP values (positive or negative) for each feature revealed its directional association with SGA risk. This analysis validated clinical relevance by confirming the alignment of top features with medical knowledge and used mean absolute SHAP values to rank features, identifying key determinants of SGA risk. To further evaluate model generalizability, an additional analysis was conducted using the complete data with a more stringent SGA definition. In this analysis, SGA status was defined by the overlap of all four birthweight standards, where a newborn was considered SGA only if classified as such by every standard, and all other births were defined as non-SGA. The optimal models identified under each individual standard were then evaluated when applied to identify SGA under this stringent, overlapping criterion. DeLong test was used to test the ROC-AUC difference between the best-fitting models, with P value < 0.001 considered statistically significant.
Software and implementation
The analytical workflow was conducted using a dual-software approach. Data preprocessing and engineering were performed in R (version 3.6.1), which included multiple imputation via the MICE package to handle missing data, normalization and standardization using the recipes package with Yeo-Johnson transformation, and addressing class imbalance through the SMOTE algorithm implemented in the DMwR package. Subsequent predictive modeling and evaluation were implemented in Python 3.6 within the Spyder 6 environment, utilizing pandas and numpy for data manipulation, scikit-learn for machine learning algorithms and performance assessment, matplotlib and seaborn for visualization, SHAP for model interpretability, and scipy for statistical computations.
Results
SGA classification
Among 22,603 singleton pregnancies with complete data, the rate of SGA with the China standard was 6.1%, which was similar to the GROW standard (6.0%), higher than the Intergrowth 21st standard (4.3%) and lower than the FMF standard (9.7%) (Table 1). Multiple imputation was performed for the cohort of 225,523 singleton pregnancies, with the distribution of variables before and after imputation compared in Supplementary Table S5. A similar trend in SGA rates across standards was observed in the larger imputed data (n = 225,523), with the China, GROW, Intergrowth 21st, and FMF standards yielding SGA rates of 6.8%, 7.4%, 4.8%, and 10.7%, respectively. SGA rates according to gestational age for each birthweight standard in 22,603 singleton pregnancies with complete data are shown in Figure 2A. There were similar proportions of SGA with the China and Intergrowth 21st standards at 28 to 37 weeks’ gestation, but a higher proportion of SGA with the China standard after 37 weeks. The GROW standard had intermediate rates of SGA before 37 weeks, but similar rates of SGA as the China standard after 37 weeks. The FMF standard classified the highest proportion of infants as SGA at all gestations (Figure 2A). 2,345 newborns were classified as SGA by at least one of the four standards, of which 845 (36.0%) were classified as SGA by all four standards (Figure 2B). 37 (1.6%) of infants were only classified as SGA by the China standard and not by any other standard (Figure 2B). The overlap of SGA cases classified by pairs of standards ranged from 44.0 to 100% (Figure 2C). SGA cases classified by the China standard were frequently also classified as SGA by the other three standards (67.6–95.7%) (Figure 2C). The overlap of non-SGA at birth classified by four birthweight standards is shown in Supplementary Figure S2.
Figure 2. SGA classification according to different birthweight standards. (A) The proportion of SGA at birth at different gestational ages (in weeks) according to different birthweight for gestational age standards: China standard, Fetal Medicine Foundation (FMF) standard, Gestation-related Optimal Weight (GROW) standard, Intergrowth-21st standard. (B) Overlap of SGA cases classified according to the four birthweight standards. (C) Overlap of SGA cases classified according to pairs of birthweight standards. SGA, Small for gestational age (birthweight <10th centile for gestational age).
Significant differences in maternal age, weight, age at menarche, education and albumin at registration were observed between pregnancies with SGA and non-SGA infants for the China standard and the three other birthweight standards (Table 1). Blood pressure values (SBP3, DBP1, DBP2, DBP3), blood pressure change values (diffDBP23), and all maternal anthropometric measurements (maternal weight, MAC, and SFH) and their change values between each two pregnancy intervals differed significantly between pregnancies with SGA infants compared to non-SGA infants for all birthweight standards (Table 1).
SGA prediction modeling
For each birthweight standard, feature selection was conducted using lasso regression, separately across three pregnancy intervals: early (<18 weeks), middle (<26 weeks), and late pregnancy (<37 weeks). The analysis was performed on both imputed and complete datasets, with the optimal λ value selected using the one-standard-error criterion. The number of predictors retained for each standard and pregnancy interval, along with the corresponding λ values, are summarized in Supplementary Table S6. The variable selection paths and importance rankings across all four birthweight standards are illustrated in Supplementary Figures S3–S6, which present coefficient shrinkage plots and variable importance bar charts for each pregnancy interval.
ROC curves and PR-ROC curves for different pregnancy intervals in the testing sets are illustrated in Figure 3 for the complete data and Supplementary Figure S7 for the imputed data, respectively. Late pregnancy prediction models performed better at ROC-AUCs than early and middle pregnancy models for all birthweight standards (Figure 3; Table 2). The China standard had intermediate predictive ROC-AUCs for SGA across the three pregnancy intervals and ML models, with ROC-AUCs similar to the Intergrowth 21st standard, better than the GROW standard, but not as good as the FMF standard (Figure 3; Table 2). The highest ROC-AUC values observed for the late pregnancy models were 0.74 for logistic regression with the China standard, and ROC-AUCs ranging from 0.64 to 0.79 for the other standards (Table 2). For the China standard, the late pregnancy model developed by the logistic regression had the highest F1 score, with a value of 0.30. Based on predefined criteria, the best performing model was the late pregnancy model based on logistic regression for the China standard (ROC-AUC 0.74, PR-AUC 0.16), and the late pregnancy model based on Random Forest showed superior performance for the FMF standard (ROC-AUC 0.79, PR-AUC 0.28), with sensitivity of 0.78, PPV of 0.20, and F1 score of 0.45. Their calibration curves and hyper-parameter settings are shown in Supplementary Figure S8 and Supplementary Table S7, respectively. The calibration curves of the top-performing models (Supplementary Figure S8) demonstrated systematic overestimation, deviating above the line of perfect calibration. This is evidenced by Brier scores of 0.2281 (China standard, logistic regression), 0.2359 (INTERGROWTH-21st standard, logistic regression), 0.2325 (GROW standard, logistic regression), and 0.1949 (FMF standard, Random Forest). The model for the FMF standard exhibited the best calibration. The ROC curves of the training set and testing set of the complete data indicated consistent predictive performance for the China, INTERGROWTH-21st, and GROW standards, with minimal AUC differences. However, a more notable performance gap was observed for the FMF standard (training AUC: 0.982, testing AUC: 0.789), suggesting a degree of overfitting for this specific model (Supplementary Figure S9). The predictive performance of models developed using the imputed dataset was largely consistent with that observed in the complete dataset, showing similar trends across pregnancy intervals and birthweight standards. The optimal models for the China, Intergrowth 21st, and FMF standards remained logistic regression, while XGBoost performed best for the GROW standard, with the performance, hyper-parameter settings, and ROC curves of the training set and testing set provided in Supplementary Tables S8, S9 and Supplementary Figure S10, respectively.
Figure 3. Receiver operating characteristic and precision-recall curves for prediction of small for gestational age at three pregnancy intervals in the testing set of complete data according to four birthweight standards using six machine learning algorithms and logistic regression prediction models. PR-AUC, Area under the precision-recall curve; ROC, Receiver Operating Characteristic.
Table 2. Bootstrap Validation of prediction model performance using testing data set from the complete data.
To further assess model generalizability, we evaluated the optimal models using a more stringent SGA definition in which a newborn was classified as SGA only when identified as such by all four birthweight standards (n = 846; Figure 2B). Under this overlapping criterion, performance was comparable across standards, with mean AUCs of 0.741, 0.741, 0.729, and 0.721 for models based on the China, Intergrowth-21st, GROW, and FMF standards, respectively (Supplementary Figures S11A–D). However, bootstrap analysis revealed important differences in model performance: the FMF standard model demonstrated markedly superior discriminative ability, achieving a mean AUC of 0.981 (95% CI: 0.978–0.985). This substantially exceeded the performance of models based on the China (mean AUC: 0.749), Intergrowth-21st (0.753), and GROW (0.739) standards under their respective original definitions. The performance advantage of the FMF-based model was consistent across multiple metrics, including sensitivity (0.923 vs. 0.715–0.764), specificity (0.928 vs. 0.612–0.679), and accuracy (0.927 vs. 0.617–0.680) (Supplementary Figure S11E). The statistical superiority of the FMF-based model was further confirmed by DeLong tests, which revealed significant differences in ROC-AUC between the FMF standard model and all other models (all p < 0.001), while no significant difference was observed between the China and INTERGROWTH-21st standards (p = 0.739) (Supplementary Table S10).
Model interpretation
Variable importance, ranked by the mean absolute SHAP value for the best-performing model under each birthweight standard, is presented in Figure 4. The analysis identified consistent key predictors across the standards. Late-pregnancy symphysis-fundal height (SFH3) was the most important predictor for the China and FMF standards, and ranked fourth and fifth for the Intergrowth-21st and GROW standards, respectively. Similarly, late-pregnancy maternal abdominal circumference (MAC3) was the second-ranked predictor for the China and FMF standards and ranked within the top five for the other two standards. Maternal age was also identified as a highly influential variable, ranking within the top eight predictors for all standards. Furthermore, maternal height and weight, and parity were among the most important predictors for the China, Intergrowth-21st, and FMF standards. The connected lines in the figure visually demonstrate the variation in the relative ranking of these key predictors across the different standards. Based on the mean SHAP values from the best-fitting models for each birthweight standard, older maternal age was consistently associated with an increased risk of SGA, as indicated by positive mean SHAP values across all four standards. In contrast, features including SFH3, MAC3, maternal height and weight, and parity showed inconsistent directional associations with SGA risk, with positive influences under some standards and negative under others (Supplementary Table S11; Supplementary Figure S12). Based on the analysis of the imputed dataset, the ranking of predictor importance was largely consistent with that observed in the complete dataset, with late-pregnancy SFH, MAC, maternal age, height and weight, and parity remaining among the most influential features across the four standards (Supplementary Figure S13). The direction of association for key predictors, as indicated by the mean SHAP values, also showed patterns similar to those in the complete data (Supplementary Table S12; Supplementary Figure S14).
Figure 4. Predictor importance ranking for the optimal models across the four birthweight standards using the testing set of the complete data. Feature importance is ranked vertically by the mean absolute SHAP value, with the specific value labeled to the right of each bar. The China, Intergrowth-21st, and GROW standards used logistic regression as the optimal model, while the FMF standard used random forest. Common predictors across standards are connected by solid lines of the same color to facilitate comparison of rankings. AlB, albumin; ALT, alanine aminotransferase; AST, aspartate aminotransferase; BUN, serum urea nitrogen; DBP, diastolic blood pressure; FBG, fasting blood glucose; MAC, maternal abdominal circumference; Scr, serum creatinine; SBP, systolic blood pressure; SFH, symphysis fundal height; SGA, small for gestational age (birthweight <10th centile for gestational age); SHAP, Shapley Additive Explanations; TBil, total bilirubin. 1st pregnancy interval is the period before 18 gestational weeks. 2th pregnancy interval is the period between 18 and 25+6 gestational weeks. 3rd pregnancy interval is the period between 26 and 36+6 gestational weeks.
Discussion
The SGA rate among Chinese newborns based on the China standard was 6.1%, which was similar to the GROW standard (6.0%), higher than the Intergrowth 21st standard (4.3%) and lower than the FMF standard (9.7%). Late pregnancy models had the best power to predict SGA, compared to middle and early pregnancy models, which is likely due to the additional relevant features/predictors that become available during the course of pregnancy, such as additional MAC and SFH. Our analysis revealed that optimal model performance was standard-dependent: logistic regression achieved the best performance for the China standard (ROC-AUC 0.74, PR-AUC 0.16), while Random Forest demonstrated superior performance for the FMF standard (ROC-AUC 0.79, PR-AUC 0.29, with sensitivity of 0.78, PPV of 0.20 and F1 score of 0.45). Symphysis fundal height, maternal abdominal circumference, maternal age, maternal height and weight, and parity were identified as key predictors of SGA.
In our study, the FMF standard classified the highest proportion of newborns as SGA, in line with two previous publications (Kabiri et al., 2020; Savirón-Cornudella et al., 2021). This elevated SGA rate can be attributed to the standard’s methodology, which integrates term birth data with estimated preterm birthweights based on the assumption that estimated fetal weight and birthweight share the same median across gestational ages (Nicolaides et al., 2018). Since preterm births are often associated with pathological conditions and fetal growth restriction, the FMF standard tends to classify more preterm infants as SGA (Nicolaides et al., 2018). The customized GROW standard classified an intermediate proportion of infants as SGA in our Chinese data set compared to population-based standards. This contrasts with reports from high-income countries, where the GROW standard typically identifies the highest proportion of SGA at birth (Odibo et al., 2018; Fernández-Alba et al., 2022; Zhang et al., 2007). The GROW standard used was constructed with China selected as the country of origin, resulting in SGA proportions similar to those identified by the China standard. Moreover, we observed substantial overlap in SGA classification between the Intergrowth 21st and China standards. This may partly be explained by their similar study designs as both were developed using data from low-risk, well-nourished women. Additionally, the Intergrowth 21st project included participants from Beijing, China, whose socioeconomic context aligns closely with that of population used to construct the China standard (Villar et al., 2014).
Our ML models have greatly improved predictive power for SGA, especially the Random forest-based model based on the FMF standard, compared to previous studies (Bai et al., 2022; Cho et al., 2022; Kuhle et al., 2018). A big data study comparing ML methods showed that a model using logistic regression with predictors available at 26 weeks appeared the best-fitting tool to predict SGA birth, with a ROC-AUC value of 0.66 for primiparous women (Kuhle et al., 2018). Our models developed at 26 weeks achieved better prediction, with an ROC-AUC of 0.70 based on the China standard and a ROC-AUC of 0.72 based on the FMF standard. Compared to the China standard, the superior performance of the Random Forest model with the FMF standard stems from its algorithmic advantage in handling complex data patterns, as it excels at capturing non-linear relationships and complex interactions among predictive features. This capability is critical for leveraging the nuanced information within the input variables, leading to more powerful discrimination for the specific task of SGA identification under any given standard (Couronné et al., 2018). In contrast, logistic regression remained the optimal or non-inferior model for the China, Intergrowth 21st, and GROW standards, suggesting predominantly linear predictor-outcome relationships, as supported by SHAP analysis. Furthermore, its structural simplicity mitigates overfitting and enhances generalizability (Christodoulou et al., 2019; Deo, 2015), while its inherent interpretability—providing transparent, quantifiable risk associations—offers a distinct advantage for potential clinical implementation (Rudin, 2019). The collective findings indicate that model superiority is context-dependent, hinging on a specific alignment between the model’s form and the prediction task’s requirements.
Our study demonstrates that the best-fitting models show potential for predicting SGA at birth, with performance varying by gestational age and birthweight standard. Specifically, the Random Forest model under the FMF standard achieved a ROC-AUC of 0.72 at 26 weeks of gestation, which represents a clinically promising performance for early risk stratification, and further improved to 0.79 in late pregnancy, which falls within the ‘acceptable’ to ‘excellent’ range according to common diagnostic benchmarks. Beyond prediction, our analysis also identified several key clinical predictors, including symphysis-fundal height, maternal abdominal circumference, maternal age, maternal height and weight, and parity, which may inform opportunities for individualized antenatal management. Within the framework of the Chinese tiered prenatal care system, this level of predictive capability at middle pregnancy could enable practical clinical triage by identifying high-risk pregnancies for intensified monitoring, such as through frequent serial symphysis-fundal height measurements and third-trimester ultrasound biometry, while maintaining standard care for lower-risk women, thereby optimizing resource allocation. These models thus provide a quantitative tool for improving SGA detection and management, while the identified key predictors further inform individualized antenatal management strategies.
An important next step will be the validation of our ML prediction models in independent data sets. Further improvements to our prediction models may be achieved by incorporating additional variables, such as previous pregnancy outcome details, glucose monitoring data, and ultrasound measurements. When validated and refined, ML prediction models need to be assessed prospectively, ideally in the context of an RCT, to ascertain improved prediction of SGA at birth. Ultimately, improved SGA prediction combined with interventions needs to be demonstrated to improve perinatal morbidity and mortality.
This study has several strengths. To our knowledge this is the first study to compare four population-based and customized birthweight standards using a large population-based data set, and develop ML models to predict SGA at birth at different stages of pregnancy. This is also the first study to compare the CatBoost method with widely used Random Forest, Stacked ensemble model, and ANN methods to determine the best-fitting model for each birthweight standard. Finally, our study is transparent in the methodology used for data processing, feature selection, prediction model development, assessment and interpretation, thereby reducing the potential for analytical bias.
This study has some limitations. While routine pregnancy surveillance data has the advantages of scale and inclusivity, it has the disadvantage of a relatively limited number of variables, which may have limited model performances. Second, the pregnancies in our study were from a single city in China, which may not be representative of all Chinese singleton pregnancies, which may limit the generalisability of the proposed models. Third, we assessed four representative birthweight standards. However, other standards, such as the World Health Organization Fetal Growth Charts and the NICHD fetal growth standard, assess antenatal fetal weight, relying on ultrasound scan rather than the newborn size, which may lead to more accurate estimation of SGA risk (Grantz et al., 2018; Kiserud et al., 2017). However, comparison of different fetal and birthweight standards showed that all standards assessed had poor performance for predicting adverse perinatal outcomes among an Australian population (Choi et al., 2021). Fourth, we did not compare the performance of the ML models to existing prediction methods that may currently be in use in Wenzhou and therefore cannot comment on ML performance relative to existing prediction methods. Finally, we did not have access to antenatal fetal ultrasound measurements, which might have improved our prediction models. However, routine access to antenatal ultrasound is not available in many LMICs, which have the highest burden of SGA and may benefit most from improved antenatal SGA prediction models.
In conclusion, this study reveals substantial variation in SGA classification across birthweight standards. Both sophisticated machine learning algorithms and conventional logistic regression demonstrated comparable predictive performance for SGA identification. These findings highlight the potential to enhance prenatal care through computational approaches that enable risk-stratified management.
Data availability statement
The data analyzed in this study is subject to the following licenses/restrictions: privacy and ethical restrictions. Requests to access these datasets should be directed to cWl1eWFuX3l1QHdtdS5lZHUuY24=.
Ethics statement
The studies involving humans were approved by Second Hospital Affiliated to Wenzhou Medical University. The studies were conducted in accordance with the local legislation and institutional requirements. The ethics committee/institutional review board waived the requirement of written informed consent for participation from the participants or the participants’ legal guardians/next of kin because anonymized data was derived from the regional pregnancy surveillance data system.
Author contributions
Q-YY: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing. YL: Data curation, Writing – review & editing. Y-RZ: Data curation, Writing – review & editing. X-JY: Conceptualization, Investigation, Supervision, Writing – review & editing. JH: Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Supervision, Validation, Writing – original draft.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This study was funded by the Natural Science Foundation of Zhejiang Province (LQ22H260001) for data collection and the China Scholarship Council (no. 202108330205) for overseas research.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that Generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2025.1679979/full#supplementary-material
References
American College of Obstetricians and Gynecologists' Committee on Practice Bulletins—Obstetrics (2021). Prediction and prevention of spontaneous preterm birth: ACOG practice bulletin, number 234. Obstet. Gynecol. 138:e65-e90. doi: 10.1097/AOG.0000000000004479,
Bai, X., Zhou, Z., Luo, Y., Yang, H., Zhu, H., Chen, S., et al. (2022). Development and evaluation of a machine learning prediction model for small-for-gestational-age births in women exposed to radiation before pregnancy. J Pers Med 12:550. doi: 10.3390/jpm12040550,
Bai, X., Zhou, Z., Su, M., Li, Y., Yang, L., Liu, K., et al. (2022). Predictive models for small-for-gestational-age births in women exposed to pesticides before pregnancy based on multiple machine learning algorithms. Front. Public Health 10:940182. doi: 10.3389/fpubh.2022.940182,
Capital Institute of Pediatrics and Coordinating Study Group of Nine Cities on the Physical Growth and Development of Children (2020). Growth standard curves of birth weight, length and head circumference of Chinese newborns of different gestation. Zhonghua Er Ke Za Zhi 58, 738–746. doi: 10.3760/cma.j.cn112140-20200316-00242,
Cho, H., Lee, E. H., Lee, K. S., and Heo, J. S. (2022). Machine learning-based risk factor analysis of adverse birth outcomes in very low birth weight infants. Sci. Rep. 12:12119. doi: 10.1038/s41598-022-16234-y,
Choi, S. K. Y., Gordon, A., Hilder, L., Henry, A., Hyett, J. A., Brew, B. K., et al. (2021). Performance of six birth-weight and estimated-fetal-weight standards for predicting adverse perinatal outcome: a 10-year nationwide population-based study. Ultrasound Obstet. Gynecol. 58, 264–277. doi: 10.1002/uog.22151,
Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., and Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22. doi: 10.1016/j.jclinepi.2019.02.004,
Couronné, R., Probst, P., and Boulesteix, A. L. (2018). Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 19:270. doi: 10.1186/s12859-018-2264-5,
Deo, R. C. (2015). Machine learning in medicine. Circulation 132, 1920–1930. doi: 10.1161/CIRCULATIONAHA.115.001593,
Fernández-Alba, J. J., Castillo Lara, M., Sánchez Mera, R., Aragón Baizán, S., González Macías, C., Quintero Prado, R., et al. (2022). INTERGROWTH-21st versus a customized method for the prediction of neonatal nutritional status in hypertensive disorders of pregnancy. BMC Pregnancy Childbirth 22:136. doi: 10.1186/s12884-022-04450-3,
Gardosi, J., Francis, A., Turner, S., and Williams, M. (2018). Customized growth charts: rationale, validation and clinical benefits. Am. J. Obstet. Gynecol. 218, S609–s618. doi: 10.1016/j.ajog.2017.12.011,
GBD 2019 Under-5 Mortality Collaborators (2021). Global, regional, and national progress towards sustainable development goal 3.2 for neonatal and child health: all-cause and cause-specific mortality findings from the global burden of disease study 2019. Lancet 398, 870–905. doi: 10.1016/S0140-6736(21)01207-1
Grantz, K. L., Hediger, M. L., Liu, D., and Buck Louis, G. M. (2018). Fetal growth standards: the NICHD fetal growth study approach in context with INTERGROWTH-21st and the World Health Organization multicentre growth reference study. Am. J. Obstet. Gynecol. 218:S641-S655.e28. doi: 10.1016/j.ajog.2017.11.593,
Hunter, D. J., and Holmes, C. (2023). Where medical statistics meets artificial intelligence. N. Engl. J. Med. 389, 1211–1219. doi: 10.1056/NEJMra2212850,
Kabiri, D., Romero, R., Gudicha, D. W., Hernandez-Andrade, E., Pacora, P., Benshalom-Tirosh, N., et al. (2020). Prediction of adverse perinatal outcome by fetal biometry: comparison of customized and population-based standards. Ultrasound Obstet. Gynecol. 55, 177–188. doi: 10.1002/uog.20299,
Kiserud, T., Piaggio, G., Carroli, G., Widmer, M., Carvalho, J., Neerup Jensen, L., et al. (2017). The World Health Organization fetal growth charts: a multinational longitudinal study of ultrasound biometric measurements and estimated fetal weight. PLoS Med. 14:e1002220. doi: 10.1371/journal.pmed.1002220,
Kuhle, S., Maguire, B., Zhang, H., Hamilton, D., Allen, A. C., Joseph, K. S., et al. (2018). Comparison of logistic regression with machine learning methods for the prediction of fetal growth abnormalities: a retrospective cohort study. BMC Pregnancy Childbirth 18:333. doi: 10.1186/s12884-018-1971-2,
Lee, A. C., Katz, J., Blencowe, H., Cousens, S., Kozuki, N., Vogel, J. P., et al. (2013). National and regional estimates of term and preterm babies born small for gestational age in 138 low-income and middle-income countries in 2010. Lancet Glob. Health 1, e26–e36. doi: 10.1016/S2214-109X(13)70006-8,
Lee, A. C., Kozuki, N., Cousens, S., Stevens, G. A., Blencowe, H., Silveira, M. F., et al. (2017). Estimates of burden and consequences of infants born small for gestational age in low and middle income countries with INTERGROWTH-21(st) standard: analysis of CHERG datasets. BMJ 358:j3677. doi: 10.1136/bmj.j3677,
Liu, L., Oza, S., Hogan, D., Chu, Y., Perin, J., Zhu, J., et al. (2016). Global, regional, and national causes of under-5 mortality in 2000-15: an updated systematic analysis with implications for the sustainable development goals. Lancet 388, 3027–3035. doi: 10.1016/S0140-6736(16)31593-8,
Mennickent, D., Rodríguez, A., Opazo, M. C., Riedel, C. A., Castro, E., Eriz-Salinas, A., et al. (2023). Machine learning applied in maternal and fetal health: a narrative review focused on pregnancy diseases and complications. Front. Endocrinol. (Lausanne) 14:1130139. doi: 10.3389/fendo.2023.1130139,
Naimi, A. I., and Balzer, L. B. (2018). Stacked generalization: an introduction to super learning. Eur. J. Epidemiol. 33, 459–464. doi: 10.1007/s10654-018-0390-z,
Nicolaides, K. H., Wright, D., Syngelaki, A., Wright, A., and Akolekar, R. (2018). Fetal medicine foundation fetal and neonatal population weight charts. Ultrasound Obstet. Gynecol. 52, 44–51. doi: 10.1002/uog.19073,
Odibo, A. O., Nwabuobi, C., Odibo, L., Leavitt, K., Obican, S., and Tuuli, M. G. (2018). Customized fetal growth standard compared with the INTERGROWTH-21st century standard at predicting small-for-gestational-age neonates. Acta Obstet. Gynecol. Scand. 97, 1381–1387. doi: 10.1111/aogs.13394,
Ramakrishnan, R., Rao, S., and He, J. R. (2021). Perinatal health predictors using artificial intelligence: a review. Womens Health (Lond) 17:17455065211046132. doi: 10.1177/17455065211046132,
Rudin, C. (2019). Stop explaining Black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1, 206–215. doi: 10.1038/s42256-019-0048-x,
Savirón-Cornudella, R., Esteban, L. M., Aznar-Gimeno, R., Dieste-Pérez, P., Pérez-López, F. R., Campillos, J. M., et al. (2021). Prediction of late-onset small for gestational age and fetal growth restriction by fetal biometry at 35 weeks and impact of ultrasound-delivery interval: comparison of six fetal growth standards. J. Clin. Med. 10:984. doi: 10.3390/jcm10132984,
Sharrow, D., Hug, L., You, D., Alkema, L., Black, R., Cousens, S., et al. (2022). Global, regional, and national trends in under-5 mortality between 1990 and 2019 with scenario-based projections until 2030: a systematic analysis by the UN inter-agency Group for Child Mortality Estimation. Lancet Glob. Health 10, e195–e206. doi: 10.1016/S2214-109X(21)00515-5,
Vicoveanu, P., Vasilache, I. A., Scripcariu, I. S., Nemescu, D., Carauleanu, A., Vicoveanu, D., et al. (2022). Use of a feed-forward Back propagation network for the prediction of small for gestational age newborns in a cohort of pregnant patients with thrombophilia. Diagnostics (Basel) 12:1009. doi: 10.3390/diagnostics12041009,
Villar, J., Cheikh Ismail, L., Victora, C. G., Ohuma, E. O., Bertino, E., Altman, D. G., et al. (2014). International standards for newborn weight, length, and head circumference by gestational age and sex: the newborn cross-sectional study of the INTERGROWTH-21st project. Lancet 384, 857–868. doi: 10.1016/S0140-6736(14)60932-6,
Keywords: artificial intelligence, birthweight standards, feature selection, machine learning, prediction models, small-for-gestational-age
Citation: Yu Q-Y, Lin Y, Zhou Y-R, Yang X-J and Hemelaar J (2026) Antenatal prediction of small for gestational age at birth based on four birthweight standards using machine learning algorithms. Front. Artif. Intell. 8:1679979. doi: 10.3389/frai.2025.1679979
Edited by:
Deepanjali Vishwakarma, University of Limerick, IrelandReviewed by:
Umida Ganieva, Rosalind Franklin University of Medicine and Science, United StatesKingsley Wong, University of Western Australia, Australia
Copyright © 2026 Yu, Lin, Zhou, Yang and Hemelaar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Xin-Jun Yang, eGp5YW5nQHdtdS5lZHUuY24=; Joris Hemelaar, am9yaXMuaGVtZWxhYXJAbmRwaC5veC5hYy51aw==
†These authors have contributed equally to this work
Ying Lin3