Predicting risk of preterm birth in singleton pregnancies using machine learning algorithms

We aimed to develop, train, and validate machine learning models for predicting preterm birth (<37 weeks' gestation) in singleton pregnancies at different gestational intervals. Models were developed based on complete data from 22,603 singleton pregnancies from a prospective population-based cohort study that was conducted in 51 midwifery clinics and hospitals in Wenzhou City of China between 2014 and 2016. We applied Catboost, Random Forest, Stacked Model, Deep Neural Networks (DNN), and Support Vector Machine (SVM) algorithms, as well as logistic regression, to conduct feature selection and predictive modeling. Feature selection was implemented based on permutation-based feature importance lists derived from the machine learning models including all features, using a balanced training data set. To develop prediction models, the top 10%, 25%, and 50% most important predictive features were selected. Prediction models were developed with the training data set with 5-fold cross-validation for internal validation. Model performance was assessed using area under the receiver operating curve (AUC) values. The CatBoost-based prediction model after 26 weeks' gestation performed best with an AUC value of 0.70 (0.67, 0.73), accuracy of 0.81, sensitivity of 0.47, and specificity of 0.83. Number of antenatal care visits before 24 weeks' gestation, aspartate aminotransferase level at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure emerged as strong predictors after 26 completed weeks. The application of machine learning on pregnancy surveillance data is a promising approach to predict preterm birth and we identified several modifiable antenatal predictors.


Introduction
Preterm birth (PTB) is the leading cause of neonatal and child mortality globally (Liu et al., 2016).United Nations Sustainable Development Goal 3 target 3.2 aims to reduce neonatal and child mortality to 12 per 1,000 live births and 25 per 1,000 live births, respectively (United Nations, 2016).A recent study estimated that 10.6% of all babies worldwide are born prematurely, with Asia accounting for 7.84 million (52.9%) PTBs.In particular, China accounts for an estimated 1.17 million PTBs annually, highlighting an urgent public health issue (Chawanpaiboon et al., 2019).
Early detection of pregnant women at risk of preterm birth helps high-risk pregnant women to receive timely preventative interventions to reduce the risk of PTB (ACOG, 2021).Imaging tests or invasive screening have potential as effective screening methods, but remain experimental because of high cost, possible harm, and low accessibility (Bahado-Singh et al., 2019;Considine et al., 2019;Wang et al., 2019).Non-invasive screening measures using machine learning (ML) algorithms based on large-scale pregnancy surveillance data with multilevel information linkage to delivery records promises to be beneficial to support clinical decision making to predict adverse pregnancy outcomes and guide pregnancy management without any extra physiological or imaging tests (Gao et al., 2019;Sharifi-Heris et al., 2022).
Prediction models using ML algorithms to quantify the risk of PTB have been proposed in recent years, with predictive powers ranging from 0.6 to 0.9 (Weber et al., 2018;Koivu and Sairanen, 2020;Arabi Belaghi et al., 2021;Raja et al., 2021;Shields et al., 2021;Lee et al., 2022;Nieto-Del-Amor et al., 2022;Sun et al., 2022).Some ML prediction models using uterine electrohysterographic (EHG) signals and multi-omics in the middle trimester reported a good ability to differentiate between preterm and term birth (Tarca et al., 2021;Mohammadi Far et al., 2022;Nieto-Del-Amor et al., 2022;Romero-Morales et al., 2022;Espinosa et al., 2023).However, these predictors are time consuming and costly, and are impossible to measure in routine antenatal care in low-resource settings.Instead, prediction models using maternal features available from routine pregnancy care are more likely to be widely applicable and improve pregnancy outcomes.To improve predictive power of PTB, many popular ML algorithms have been employed and compared with traditional regression methods and achieved high areas under the receiver operating characteristic curve (AUC) (Fazzari et al., 2022;Park et al., 2022;Nsugbe et al., 2023).A number of studies found that logistic regression provided quicker and better classification performance, and easier interpretability than ML models in other disease settings (Kuhle et al., 2018;Song et al., 2023).However, a study comparing deep learning with logistic regression found that neural networks showed slightly better predictive power of PTB than logistic regression (Goldsztejn and Nehorai, 2023).To achieve an efficient prediction model, feature selection is an important process to reduce dimensionality and computing complexity, and facilitate clinical practice.There are two conventional ways to conduct feature selection: one is applying univariate analysis to select features which are highly associated with the outcome (Park et al., 2022;Nsugbe et al., 2023), another is relying on feature importance derived from ML algorithms (Sharifi-Heris et al., 2022;Espinosa et al., 2023).However, some known important features might be ignored when only relying on ML-based feature importance lists (Bose et al., 2019;Liverani et al., 2023).Moreover, predictive models are at risk of overestimation or underestimation bias, due to inappropriate data sources (Sun et al., 2022), confounding factors as predictors (Raja et al., 2021), poor definition of predictors (Lee et al., 2022), incomplete reporting of modeling processes (Shields et al., 2021;Lee et al., 2022;Sun et al., 2022), inappropriate statistical approaches to perform feature selection (Weber et al., 2018;Koivu and Sairanen, 2020;Arabi Belaghi et al., 2021), and absence of handling of imbalanced data (AlSaad et al., 2022;Fazzari et al., 2022).
To overcome the limitations of previous studies we designed and utilized detailed methodology to perform data pre-processing and select predictors using feature importance derived from the ML algorithms, combined with clinical knowledge.We aimed to develop and validate PTB prediction models at different gestational intervals to support application in clinical practice.

Study design and population
A prospective population-based cohort study was conducted in 51 midwifery clinics and hospitals in Wenzhou City located in Zhejiang Province of China, recruiting 355,062 pregnant women at around 12-week gestation.We included all singleton pregnancies who delivered at <42 weeks' gestation from 1 January 2014 to 31 December 2016.Exclusion criteria were absence of follow-up antenatal records or birth records, multiple pregnancy (e.g, twins), missing values of any features listed in Supplementary Table S1, and deliveries at < 24 weeks with birthweight over 1000 g, deliveries at > 24 weeks but with weight Z scores beyond the range of −3 and 3 according to Intergrowth 21th standard for newborn weight (Supplementary Figure S1) (Villar et al., 2014).Supplementary Figure S2 shows the selection process of participants, and a total of 22,603 singleton pregnancies with complete data were included in the analysis.The data sets were de-identified and we were authorized to access these datasets.The study was approved by the ethics committee of the Second Hospital Affiliated to Wenzhou Medical University.

Outcomes
PTB was defined as birth occurring between 24 and 36 + 6 weeks' gestation, regardless of whether the PTB was spontaneous or medically indicated.The gestational age at birth was determined by ultrasound estimation at the first antenatal care visit.

Data collection
The Wenzhou maternal and child health information management platform was used to collect health records of pregnancy health care before, during, and after delivery.At registration, each pregnant woman was recruited and interviewed using a standardized questionnaire to gather demographic and lifestyle information, pregnancy history, and medical history by a trained obstetric doctor, and laboratory tests were taken after fasting overnight.We collected information, including maternal age, height, weight, education, occupation, and ethnicity.Further, we included parity, maternal heart rate, gynecological history, and clinically confirmed disease history, behaviors (smoking, medicine use, alcohol use and contraception) in the last 3 months, menstruation (length of menstrual cycle, length of a menstrual period, age at menarche).During antenatal care visits, vital signs, including blood pressure, Maternal Abdominal Circumference (MAC), Symphysis Fundal Height (SFH), and Pregnancy-associated laboratory tests at registration, tested after an overnight fast of more than 8 hours, were extracted as potential markers of PTB.These tests comprised hemoglobin, leukocyte count, platelet count, Fasting Blood Glucose (FBG), Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), Albumin (AIB), Total Bilirubin (TBil), Serum creatinine (Scr), serum Urea Nitrogen (BUN), urine acetone bodies, Urine Occult Blood (ERY), Urine White Blood Cells (LEU), Urine Glucose (UGLU), and blood type.The features ERY, LEU, UGLU, and blood type were removed as they had >50% missing values.
Given the percentiles of gestational weeks at the first antenatal visit (Supplementary Table S2), the pregnancy period before 37 gestational weeks was divided into early pregnancy (< 18 weeks), middle of pregnancy (18 to 25+ 6 weeks), and late pregnancy (26 to 36 + 6 weeks).Antenatal measurements were also encoded in line with the gestational intervals.For example, if there were two antenatal visits before 18 weeks, the feature of SBP1 was created and assigned by averaging the two SBP values.The difference in SBP, DBP, SFH, MAC, and maternal weight between pregnancy periods was calculated to represent the absolute change between pregnancy periods.Supplementary Table S1 lists all 49 maternal features included as candidate predictors of PTB.

Statistical methodology
As illustrated in Figure 1, the process to construct models for predicting PTB was performed in a number of steps:

Data cleaning, splitting, and resampling
Firstly, data cleaning involved data merging, removing cases with missing values of features of interests, and deleting cases with abnormal birthweights.Secondly, we divided our data of 22,603 pregnancies with complete data into training and testing data sets according to a 70%/30% split, using stratified sampling.The training data set was then further divided into training (80%) and validation data sets (20%), using stratified sampling.The training data set was used to implement feature selection and develop the final prediction models with 5-fold cross validation and hyperparameter tuning.The validation data set was used to assess the performance of full-feature prediction models that were used to select a subset of optimal features.The testing data set was created to assess the performance of prediction models developed using the training data set with selected optimal features.Thirdly, to our knowledge, data involving singleton pregnancies always contain <10% preterm births, which are imbalanced data that lead to decreased prediction performance using ML approaches.With imbalanced data, prediction models tend to favor the majority class as outcome to achieve high accuracy.Thus, we applied the over/undersampling hybrid method and the K-nearest neighbors for the undersampling method to balance the training data set, while keeping the testing and validation data sets imbalanced (Zhang et al., 2010;Nieto-Del-Amor et al., 2022).In our study, with the resampling methods, the training data set was resamplied into balanced data, resulting in preterm births and term births each occupying 50%, respectively.

Data engineering
Normalization (minimizing the skewedness of numeric variables by the Yeo-Johnson method) and standardization (centering and scaling numeric variables with zero mean and unit variance) of all continuous features was performed to improve model performance for the training, testing, and validation data sets (Boehmke and Greenwell, 2019;Raju et al., 2020).

Algorithms
Random Forest (Biau, 2012), CatBoost (Prokhorenkova et al., 2018;Wang et al., 2018), Support Vector Machine (SVM) (Noble, 2006;Gao et al., 2019), andStacked Models (Van der Laan et al., 2007) were used to construct prediction models with all features to obtain permutation-based feature importance lists for feature selection.These four algorithms plus the Deep Neural Network (DNN) and logistic regression were used to develop PTB prediction models.The prediction models based on the balanced training data set were developed with 5-fold cross validation (CV).The fullfeature prediction models for feature selection were validated in the validation data set, and the final prediction models with optimal subsets of features were validated in the testing data set.
Random Forest, an improvement over bagging decision trees, is an ensemble learning algorithm that produces successive independent trees fitted on bootstrapped random subsets of data (Biau, 2012).It creates additional randomness to select predictors among a random subset of features by splitting nodes.The bagging method combines multiple decision trees to achieve a more accurate and stable result (Biau, 2012).
CatBoost comprises one of the most efficient gradient boosting of decision trees algorithms, taking advantage of automatical handling of categorical features and missing values in the dataset to decrease overfitting.Compared to other gradient boosting algorithms such as XGBoost, it structures symmetric decision trees to enable efficient CPU (Central Processing Unit) implementation, reducing time-consumption, and acting as a regulator to improve overfitting (Prokhorenkova et al., 2018).
SVM is a very effective supervised machine learning algorithm which finds an optimal hyperplane based on multidimensional data to act as a class boundary to separate cases into different classes (Noble, 2006).The hyperplane strives to achieve a maximum margin between the closest points of different classes.The SVM with radial basis function kernel is used to implement predictive modeling.
Stacked Model, also called super learner, is an ensemble algorithm that stacks multiple traditional ML base learners such as the random forests and gradient boosting to find the optimal combination of diverse learning algorithms that make a prediction as-good-as or better than any single ML model (Polley and Van Der Laan, 2010).We applied generalized linear model, Deep Learning, Random Forests, XGBoost, Gradient Boosting Machine as base learners, and used the generalized linear model with non-negative weights to implement the ensemble process of base learners with "h2o" package in R software.
DNN, a type of deep learning, provide a multi-layer neural network to learn data as data sets have numeric dimensions of the features (Chollet and Allaire, 2018).DNN have widespread applications in image classification and voice recognition (Moreira et al., 2018;Chen and Xu, 2020).The feed-forward DNN we used here are densely connected layers where inputs impact on each successive layer which then affect the final output layer.To build a feed-forward DNN, we defined a network architecture with 4 hidden layers with the nodes ranging from 16 to 128, followed by an output layer with 2 nodes.Each hidden layer is activated by a Rectified linear unit (ReLU) function that is taking the summed weighted inputs in a previous layer and transforming them to a 0 (not fire) or > 0 (fire) if there is enough signal, and we used the sigmoid activation function for the output layer.A binary cross-entropy loss function and an optimizer of keras were established to assess the DNN accuracy and automatically adjust the weights across all the node connections to improve the overall predictive accuracy.The specific hyper-parameter set is shown in Supplementary Table S3.
Logistic regression (LR) is one of the most common statistical analysis models for predicting the probabilities of binary responses.Using the model equations, maximum likelihood estimation estimates the parameters of a probability distribution.

Feature selection
Four ML algorithms were applied to construct 49-feature prediction models which were employed to calculate permutationbased feature importance that were used to generate feature ranking lists (Altmann et al., 2010).The average importance score of each feature was calculated by the sum of the rankings across the four models divided by four, which was used to get the final ranking for each feature.The discrimination performance of 49-feature prediction models was validated in the validation data set.To achieve a model with the fewest predictors and best predictive power, the number of candidate predictors were selected according to the top 10%, top 25%, and top 50% rules based on the final rank list.49-feature prediction models for feature selection, as well as the final prediction models according to the different rules of selecting the most important features, using the training data set, were all tuned by the random grid search strategy for hyper-parameters (Supplementary Table S3).

Predictive modeling
Under each rule of selecting the number of candidate predictors, five additional predictors of maternal weight and height at registration, parity, maternal age, and neonatal sex were added into the prediction models if they were not in the predictor list, considering their crucial contributions to perinatal health (Gardosi et al., 2018).Three sequential prediction models were developed to discriminate preterm birth from term birth, according to stage of pregnancy: early pregnancy models aimed to predict risk of PTB with data available before 18 weeks; Middle pregnancy models were constructed to evaluate the risk of preterm birth with data available before 26 weeks; Late pregnancy models are built to assess

Performance assessment and interpretation
For each ML predictive model developed with the balanced training set, we assessed the AUC value, accuracy (Acc) (Equation 1), sensitivity (Sen) (Equation 2), specificity (Spec) (Equation 3) using optimal threshold values in the validation data set.The optimal threshold value of receiver operating characteristic (ROC) curve was assigned as the point closest to the true positive rate of 1 and false positive rate of 0. The Acc, Sen, and Spec are computed as follows: TP and FN refer to the numbers of true positives (PTB classified as PTB) and false negatives (PTB classified as term birth), respectively.TN and FP refer to the numbers of true negatives (term births classified as term births) and false positives (term births classified as PTB), respectively.
We selected the best-fitting model with the highest AUC values, and the highest accuracy in cases of similar AUC values (± 0.02) in the testing data set.Calibration curves were plotted for the final optimal predictive model, developed with the balanced training set, to show predicted vs. observed outcomes with the testing set.
To provide interpretation for the best-fitting model, we applied Shapley Additive Explanations (SHAP) values to evaluate each predictive feature using the testing data set (Williamson and Feng, 2020).As a tool for visualizing the effect of individual features on the model results, SHAP values enable clinical practitioners to distinguish the key factors contributing to the risk of disease.The odds ratio (OR) and 95% confidential interval (CI) were calculated using univariate logistic regression in the cohort of 22,603 pregnancies to indicate the association of the predictors derived from the best-fitting model with PTB.

Software used
Data pre-processing was conducted in R (version 4.3.1).DNN models were run in Python (version 3.1, using "tensorflow" and "keras" packages), while the other ML models were run in R (version 3.6.1)("h2o" package for random forests and stacked model; "catboost" package for CatBoost; "e1071" package for SVM).Cross validation was performed with the "caret" package, and SHAP values were calculated using the "fastshap" package.The ROC-AUC curves were plotted by the "pROC" and "ROCR" packages.

Maternal characteristics at registration
After matching registration data, antenatal visit data, and birth data, 225,523 singleton pregnancies had health records in each of the three data resources (Supplementary Figure S2).After removing pregnancies with missing values of features of interests (n = 202,873) and births with abnormal birthweight (n = 47), 22,603 singleton pregnancies with 946 (4.2%) PTBs were retained.Table 1 shows the maternal features, including demographics, laboratory tests, and clinical history, collected at registration.

Maternal measurements during antenatal care visits
The level of systolic blood pressure (SBP) in middle and late pregnancy was higher in the PTB group than in the full-term group (all P < 0.001), with higher increases in SBP levels during pregnancy (diffSBP13, 4.0 vs. 3.0, P = 0.005) (Table 1).The diastolic blood pressure (DBP) levels at all three gestational intervals were higher in the PTB group than in the full-term group (all P < 0.01), with higher increases in DBP levels during pregnancy (diffDBP13, 1.45 vs. 1.00,P = 0.007).The increases of symphysis fundal height (SFH) and maternal abdominal circumference (MAC) in late pregnancy were more in pregnant women who delivered full-term birth than those who delivered preterm (all P < 0.001) (Table 1).The maternal characteristics of the PTB group and the full-term group in the training set balanced by the hybrid resampling or under-sampling are shown in Supplementary Table S5.

Feature selection
Supplementary Table S6 presents the model performances of four ML algorithms with all 49 features using the balanced training data set.The highest AUC value was achieved in the CatBoost model developed using the training data set balanced with hybrid resampling (the AUC value: 0.679), and in the Stacked model developed using the training data set balanced with the under-sampling technique (the AUC value: −0.692).Supplementary Tables S7, S8 list the permutation-based feature importance with hybrid resampling or under-sampling.The top five features derived from models using the training data set balanced by hybrid resampling were: SFH3, diffMAC23, diffSFH23, DBP3, and FBG.The top five features derived from models using the training data set balanced by under-sampling were: SFH3, diffSFH23, diffMAC23, neonatal sex, and hemoglobin.The predictors of PTB selected according to the different inclusion rules were combined with five additional features of maternal weight and height at registration, maternal age, neonatal sex, and parity to develop prediction models (Supplementary Table S9).

Model performance
Under different rules (top 10%, top 25%, top 50%) to select the number of included predictors, early, middle, and late pregnancy models were constructed considering the predictors available at different gestational intervals (Table 2, Supplementary Table S10).The ROC curves for all models in the training and validation data sets are illustrated in Figure 2 and Supplementary Figures S3-S5.Among all predictive models, the late pregnancy models performed best, with the highest AUC value achieved by the CatBoost model (the AUC value: 0.703, 95%CI: 0.672, 0.733; Accuracy: 0.811) with predictors selected by the top 50% rule (Figure 3, Table 2).The hyper-parameter settings for the best-fitting CatBoost model are shown in Supplementary Table S11.All models based on early and middle pregnancy predictors performed less well, irrespective of the top rule according to which the predictors were selected.Among the prediction models with AUC values over 0.680, the highest sensitivity was achieved by the LR-based late pregnancy model (Sen: 0.617) with predictors selected by the top 10% rule.The calibration curves for the best-fitting late pregnancy Catboost models using either resampling method are shown in Supplementary Figure S6.SHAP values represent the contribution of each included predictor on an individual's prediction of PTB.SHAP values based on the CatBoost-based late pregnancy model in which predictors were selected by the top 50% rule, with the median absolute SHAP values used to rank feature importance, are shown in Figure 4.The ten highest impact features were DBP3 (median absolute SHAP value: 0.075), diffSFH23 (0.066), diffMAC23 (0.065), diffWEIGHT23 (0.050), AST (0.032), SFH3 (0.030), length days of a menstrual cycle (0.026),diffWEIGHT13 (0.021), number of antenatal visits before 24 weeks (0.019), and diffSBP13 (0.016) (Supplementary Table S12).Among the ten highest impact predictors, the level of DBP in the late pregnancy period (DBP3), AST level at registration, more than 4 antenatal visit before 24 weeks, and increase in SBP during the whole pregnancy (diffSBP13) were associated with increased risk of PTB, whereas higher level in SFH3, diffSFH23, diffMAC23, diffWEIGHT23, and diffWEIGHT13 were associated with decreased risk of PTB (Supplementary Table S13).

Discussion
This study used ML algorithms to establish prediction models for PTB across three gestational intervals, applying feature selection that synthesized multiple rankings of feature importance derived from the ML models.Our study developed a Catboost-based model with predictors from the routine antenatal care available after 26 completed weeks, achieving an accuracy of over 0.8.Addition of features that are strongly associated with preterm birth, such as previous PTB history, pregnancy hypertension syndromes, gestational diabetes, and ultrasound measurements, including cervical length, into the current best-fitting model, is likely to further improve the model using the CatBoost algorithm and could lead to improved evaluation of the risk of PTB   during pregnancy.The high-impact predictors found by our study could feed future, more efficient ML algorithms to achieve better predictive power.In addition, our findings indicate that obstetric doctors should particularly monitor the frequency of antenatal visits before 24 weeks, change in maternal blood pressure, weight, symphysis fundal height and abdominal circumference in late pregnancy.
We used electronic health records to feed the ML models, adding new models for predicting PTB.Compared with previous similar studies, some other prediction models with AUC values > 0.80 outperformed the model we developed (AUC: 0.703), but the study design, the predictors, and the analysis processes used by other studies are more likely to lead to overestimation of model performance and clinical implications (Arabi Belaghi et al., 2021;  2021) proposed a prediction model during the second trimester which achieved an AUC of 0.80 with artificial neural networks, but their predictors were selected by univariate logistic regression instead of the stateof-the-art ML algorithms, and the predictor maternal complication they used was not defined.Sun et al. developed a prediction model using Random Forest and obtained an AUC value of approximate 0.90, but the non-nested 1:1 case-control study design used could produce great overestimation of the model performance (Sun et al., 2022).Another study added preconception thyroid-stimulating hormone (TSH) levels into prediction models, obtaining the AUC value of 0.812, however, measurement of preconception TSH levels is not included in routine items of preconception examination and therefore prevented its inclusion in our models (Sun et al., 2021).
This study found similar model performance to predict PTB at different pregnancy periods across differing ML algorithms.Stacked Model did not perform better than other individual algorithms.Some important predictors such as gestational diabetes and previous PTB were not included, which possibly resulted in loss of model performance.The AUC value of the CatBoost model was the highest among all ML algorithms.The CatBoost model outperformed the current state-of-the-art implementations of Gradient Boosting Decision Trees to address categorical features without converting them into number, and modify classical gradient boosting algorithms to achieve an unbiased gradient to relieve the overfitting problem (Dorogush et al., 2018).Some researchers reported that CatBoost models achieved outstanding predictive power in gestational diabetes mellitus, suggesting that this algorithm has advantages in the field of neonatal and pregnancy science (Kumar et al., 2022;Zhang and Wang, 2022).Moreover, we found that the predictive power of CatBoost was best after 26 weeks' gestation with the AUC of 0.70, suggesting that there may be room for providing preventative and therapeutic interventions to reduce the risk of PTB after 26 gestational weeks.
We found that the risk of PTB was associated with more than four antenatal care visits before 24 weeks.Given that the frequency of antenatal care visits before delivery may be a confounding influence, we analyzed the frequency of antenatal visits before 24 weeks' gestation -the minimum value of gestational age at birth in our study data set -to assess its relationship with PTB.Compared with 2 to 4 antenatal care visits, singleton pregnancies with > 4 visits before 24 weeks were associated with increased risk of PTB.Notably, singleton pregnancies with > 4 visits of antenatal care before 24 weeks' gestation might be attributable to maternal complications or mental health problems (Nath et al., 2017;Kumar and Dhillon, 2021).
We found that the level of AST at registration was associated with PTB.It has been previously reported that serum AST, a hematological measurement to evaluate liver function, is associated with PTB (Zhuang et al., 2017).The cause of high level of AST during pregnancy may be specific or non-specific liver diseases, indicating a potential risk of abnormal liver function that is highly related to adverse perinatal outcomes (Liu et al., 2022).A retrospective study that investigated the trajectories of AST levels during normal pregnancy found the AST level mostly remained unchanged during pregnancy, and indicated that monitoring of AST levels during pregnancy could help early recognition and diagnosis of impaired liver function (Ushida et al., 2022).
Our study has a number of strengths.We analyzed over 20,000 singleton pregnancies with complete data to develop ML models to predict preterm birth.The large sample from one site allowed us to develop robust predictive models, with less bias due to a consistent procedure of data collection.We report the whole process of data pre-processing (data cleaning, splitting, and resampling) and data engineering to reduce statistical bias and improve predictive power.The use of multiple ML algorithms for feature importance ranking and feature selection is the unique highlight of this study, compared with most studies using a single ML algorithm or univariate correlation to perform feature selection (Koivu and Sairanen, 2020;Arabi Belaghi et al., 2021;Speiser, 2021;Zhao et al., 2021).We are the first to use multiple ML algorithms to conduct feature selection in this field, referring to a principle from ensemble models that an ensemble approach outperforms any individual model (Dietterich, 2000).The feature importance list is not identical across different ML algorithms, so considering multiple ML algorithms to produce the final ranking by averaging multiple feature importance scores reduces potential ranking variance derived from different algorithms.Last but not least, to our knowledge, this is the first study to use the CatBoost algorithm to predict PTB, and the CatBoost algorithm performed overall better than other ML algorithms and logistic regression.
Our study has some limitations.Firstly, the predictive ability of the optimal CatBoost model was modest, with an AUC value of slightly over 0.70, mainly due to the lack of some important predictors such as previous PTB history and gestational diabetes mellitus.Second, the indices collected from maternal blood and urine tests at registration had limited power to predict PTB, apart from AST.Third, we divided the pregnancy period before 37 weeks into three time intervals that limited the deep learning ability of DNN to achieve their maximum predictive performance (Zhang et al., 2022).Fourth, we did not have information on the length of the uterine cervix, which is a known predictor of PTB.Fifth, although we used hybrid and under-sampling methods in the training data set to improve model performance, we did not balance the validation and testing sets to assess model performance, as some previous studies did (Nieto-Del-Amor et al., 2022;Kyparissidis Kokkinidis et al., 2023).Finally, there may have been misclassification and selection bias in our electronic health record-based study.However, preterm birth was defined according to the gestational age determined by the ultrasound scan at the first antenatal visits, thereby limiting the potential for outcome misclassification.

Conclusion
The CatBoost-based PTB prediction model is a promising predictive tool to help decision making for physicians in clinical practice, including decisions regarding referral to a preterm birth clinic, ultrasound assessment of the cervical length, and administration of preventative interventions, such as progesterone.The number of antenatal care visits before 24 weeks' gestation, AST at registration, symphysis fundal height, maternal weight, abdominal circumference, and blood pressure were identified as strong predictors after 26 completed weeks.The model may be improved and developed further with additional strong predictors.

FIGURE
FIGUREDiagram of the statistical methodology.ML, machine learning; CV, cross validation; gw, gestational weeks.

FIGURE
FIGUREReceiver operating characteristic curves of six prediction models, developed using the training data set balanced by hybrid resampling, validated in the testing data set.DNN, deep neural networks; SVM, support vector machine.

FIGURE
FIGUREBar plots comparing area under the receiver operating characteristic curve (AUC) values across all prediction models stratified by di erent rules for selecting the number of predictors and pregnancy intervals.(A) Early pregnancy, (B) Middle pregnancy, (C) Late pregnancy.The horizontal red dotted line indicates an AUC value of . .

FIGURE
FIGURE Catboost-based SHAP values of the late pregnancy model using the top % predictors in the testing data set.(A) SHAP values for individuals in the testing data set were predicted and plotted for each predictor.(B) Bar plot of the median absolute SHAP values for each predictor.SHAP, Shapley additive explanations.
risk of PTB after 26 weeks.Five-fold CV accuracy was used in the balanced training data set to assess model performance for internal validation. the