Predictors of micronutrient deficiency among children aged 6–23 months in Ethiopia: a machine learning approach

Introduction Micronutrient (MN) deficiencies are a major public health problem in developing countries including Ethiopia, leading to childhood morbidity and mortality. Effective implementation of programs aimed at reducing MN deficiencies requires an understanding of the important drivers of suboptimal MN intake. Therefore, this study aimed to identify important predictors of MN deficiency among children aged 6–23 months in Ethiopia using machine learning algorithms. Methods This study employed data from the 2019 Ethiopia Mini Demographic and Health Survey (2019 EMDHS) and included a sample of 1,455 children aged 6–23 months for analysis. Machine Learning (ML) methods including, Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Neural Network (NN), and Naïve Bayes (NB) were used to prioritize risk factors for MN deficiency prediction. Performance metrics including accuracy, sensitivity, specificity, and Area Under the Receiver Operating Characteristic (AUROC) curves were used to evaluate model prediction performance. Results The prediction performance of the RF model was the best performing ML model in predicting child MN deficiency, with an AUROC of 80.01% and accuracy of 72.41% in the test data. The RF algorithm identified the eastern region of Ethiopia, poorest wealth index, no maternal education, lack of media exposure, home delivery, and younger child age as the top prioritized risk factors in their order of importance for MN deficiency prediction. Conclusion The RF algorithm outperformed other ML algorithms in predicting child MN deficiency in Ethiopia. Based on the findings of this study, improving women’s education, increasing exposure to mass media, introducing MN-rich foods in early childhood, enhancing access to health services, and targeted intervention in the eastern region are strongly recommended to significantly reduce child MN deficiency.


Introduction
Micronutrient (MN) deficiencies are a major public health problem around the world, contributing to childhood morbidity and mortality.The burden of this problem is disproportionately high in low-and middle-income countries, particularly in Sub-Saharan Africa, including Ethiopia (1,2).MN deficiencies mainly occur when people lack access to MN-rich foods like fruits, vegetables, animal products, and fortified foods.MN deficiencies lower immune capabilities and increase the overall risk of infection-related mortality, particularly diarrhea, measles, malaria, and pneumonia, which are among the world's top ten leading causes of death (1,3).MNs are only minimally required; however, their lack in the diet has a severe impact on the survival and development of children.Furthermore, MN deficiency contributes to stunting, wasting, weakened immunity, and delays in cognitive development (1,3,4).
Vitamin A (VA) and Iron are essential micronutrients that are crucial for the growth and development of children and their deficiency causes significant public health problem in children (5).Iron deficiency is a primary cause of anemia and has serious health consequences for both women and children.VA plays an important role in maintaining the epithelial tissue in the body.Its severe deficiency causes eye damage and is the leading cause of preventable childhood blindness.Moreover, VA deficiency increases the severity of infections such as measles and diarrheal disease in children and slows recovery from illness.It is common in dry environments where fresh fruits and vegetables are not readily available (3).
According to the 2019 United Nations Children's Fund report, 340 million children globally suffered from hidden hunger as a result of MN deficiency (6).In Africa, less than one-third and one-half of children aged between 6 and 23 months met the minimum criteria for dietary diversity and meal frequency, respectively.According to the 2019 Ethiopian Mini Demographic Health Survey (EMDHS) report, the consumption of foods rich in VA and iron, which are the major MN deficiency indicators, remains low among young children in Ethiopia.Thirty-nine percent of children aged 6-23 months consumed foods rich in VA during the 24 h before the interview, whereas 24% consumed iron-rich foods (3).
Empirical studies have identified several factors associated with insufficient minimum dietary diversity, including limited access to media such as newspapers, magazines, and radio; lower education level of fathers; fewer antenatal care visits; younger child age; working in agriculture, and poorest household wealth index (1,4,(6)(7)(8).However, the typical logistic and multilevel models employed in these studies were unable to identify the most important predictors.Identifying predictors of MN deficiency and taking corrective action are critical in reducing MN deficiency.Prioritizing predictors based on their contribution in predicting MN deficiency will be cost effective and simple to implement but has not yet been considered.Machine learning (ML) algorithms, which intersects statistical learning and artificial intelligence research, are used to explore large amounts of data to discover unknown patterns or relationships and show the share of predictors for a particular problem (9,10).In addition, ML helps to develop predictive models and the selection of the most important predictors.
Hence, the ML algorithm is the ideal candidate statistical model for addressing these statistical modeling issues.These models have demonstrated high performance in solving classification problems compared to the conventional statistical models applied to select the most important predictors.The availability of diverse alternative models to be selected as the best fit for a predictive model is one of the most important features behind the use of ML algorithms.Among others, the five widely used ML models considered in this study are Support Vector Machine (SVM), Logistic Regression (LR), Neural Network (NN), Random Forest (RF), and Naïve Bayes (NB) (9)(10)(11)(12)(13)(14).
The most significant predictors of MN deficiency were determined after evaluating these multiple models and choosing the model that best fit the data under consideration in this study.This enables health professionals, policy designers and implementers, and interventions geared towards addressing challenges posed by MN deficiency to concentrate their efforts on the most reliable predictors and take corrective actions.To the best of our knowledge, no previous study has used ML modeling to determine the factors that predict MN deficiency in Ethiopia and other East African nations.The main objective of this study was to identify the most important predictors of childhood MN deficiency in Ethiopia by evaluating various ML algorithms that most accurately and efficiently predict micronutrient deficiency.

Data source and sampling procedure
This analysis involved the Ethiopia Mini Demographic and Health Survey (EMDHS), which was collected through a nationally representative, cross-sectional, and household-based survey conducted in Ethiopia in 2019.The data collection used a two-stage cluster sampling design with stratification into urban and rural regions.Twenty-one sampling strata were obtained after stratifying each region into urban and rural areas.In the first stage, 305 Enumeration Areas (EAs) (93 urban EAs and 212 rural EAs) were chosen with a probability proportional to the EA size in each stratum.In the second stage, 30 households were randomly selected from each EA using an equal probability method from the fresh list of households, resulting in a total of 8,663 households with 1,463 children aged 6-23 months (3).

Outcome variable
The outcome variable in this study was the MN deficiency status of children aged 6-23 months, which was derived based on the MN intake status from respondents' report.It was mainly computed from the VA and Iron rich foods consumed in the last 24 h prior to the data collection among children aged 6-23 months.We classified children's MN deficiency status into two groups: "Yes" outcomes if the respondent reported that the child did not consume any of the minimum recommended MNs, and "No" outcomes if the child had consumed at least one of the minimum recommended MNs (1).
A child was grouped in the MN deficient category in VA if he or she had not consumed any of the seven VA-rich foods in the 24 h prior to the data collection.The seven VA rich foods include: i. eggs; ii.meat (beef, hog, lamb, or chicken); iii.Pumpkin, carrots, and squash; iv.any dark green leafy vegetables; v. mangoes, papayas, and other fruits

Predictors in the model
The MN deficiency predictor variables or features included in the models were child age in months, age of mothers, number of children under five, mother's education, antenatal care (ANC) visit, postnatal care (PNC) visit, health check after delivery, place of delivery, current pregnancy status, currently breastfeeding, wealth index, region, place of residence, and media exposure (See details in Table 1).Moreover, the administrative region shapefiles were used to investigate the spatial variation in the prevalence of child MN deficiency.

Feature selection
Feature selection is a critical step in predicting and interpreting high-dimensional datasets.We employed the Recursive Feature Elimination (RFE) method as a feature selection technique that uses a wrapper approach to select the most relevant features for a given ML model by recursively removing features from the dataset and training the model on the remaining features until the desired number of features is obtained (15).RFE is a valuable tool for identifying the most important features of MN deficiency in children and improving the predictive power of our ML models.Therefore, ML algorithms were applied to determine their predictive power and identify the most important determinants of child MN deficiency.

Machine learning methods
Machine Learning (ML) methods that were used in this study include SVM, LR, NN, RF, and NB.ML models have been used to rank relevant predictors of MN deficiency and to identify important predictors of health outcomes and other variables of interest.
We used the R programming language (version 4.2.2) and R packages sf (16), caret (17), and pROC (18) for data preprocessing and analysis.The performance of the ML algorithms was evaluated using metrics such as accuracy and the Area Under the Receiver Operating Characteristic curve (AUROC).
In this study, we employed ML approaches by randomly dividing the dataset into two sets: 80% of it for the training set and 20% for the test set.The training set was used to train the model and the test set was used to evaluate the performance of the model.Standard ML accuracy measures were used to evaluate the prediction power of popular supervised ML algorithms, including SVM (13), LR (11,14), 19), RF (10-14, 20, 21), and NB (19).The ML algorithms were trained based on 10-fold cross-validation to optimize models.The overall pipeline of this study is shown in Figure 1. Figure 1 depicts the ML approach for predicting MN deficiency using EMDHS data.The approach involves several steps, including data collection, preprocessing, data cleaning and encoding, feature selection, building and evaluating ML algorithms, and comparing the performance of different models.The best-performing model was then used to predict MN deficiency.Following this approach, this study aimed to develop accurate and reliable predictive models that can inform public health policies and promote child development in Ethiopia.
Support Vector Machine (SVM) is a supervised ML model used for regression and classification that creates a hyperplane or set of hyperplanes in a high-or infinite-dimensional space.The objective is to maximize the margin between the nearest training points or support vectors of each class and the separating hyperplane.The best separation border is represented by the hyperplane with the largest available margin.To conduct linear separation, data must be transformed into higher dimensions using kernel functions.Non-linear classification tasks can be successfully completed using SVM, which is successful on complicated issues with little training data because of its generalization capabilities (22).
Logistic Regression (LR) is a statistical machine learning algorithm for binary classification problems that models the probability of an input data point belonging to a particular class.LR applies a logistic sigmoid function to the weighted sum of input predictors to estimate the probabilities, then thresholds the output to make a binary prediction.Moreover, it assumes a linear relationship between the log-odds of the outcome and the input predictors and can handle numerous predictor variables.It does not require linear The Random Forest (RF) is a popular algorithm for supervised ML that is used to solve classification and regression issues.It generates decision trees from randomly chosen data samples, gets predictions from each tree, and uses a majority vote to determine the optimal solution.RF also ranks the significance of each predictor using the mean decrease in accuracy (24)(25)(26).
The Neural Network (NN), also known as an Artificial Neural Network (ANN), is an ML model that uses a network of functions to recognize and translate a data input of one form into a desired output.The notion of NN was based on the biology of humans and how neurons work together in the human brain to understand information from the senses.NNs learn from labeled training data by adjusting the connection weights between layers of simple processing units, which enables them to model complex nonlinear relationships for applications in prediction, classification, and clustering (24,27).
Naive Bayes (NB) is a supervised machine learning algorithm classifier based on Bayes' theorem with independence assumptions between the features that simplifies the computation needed to estimate likelihood and posterior probability, making Naive Bayes a fast, scalable classifier that tends to perform very well on a variety of data despite its simplicity and restrictive assumptions (28).

Model performance evaluation
Different model performance metrics, including precision, recall or sensitivity, specificity, accuracy, F1 score, Receiver Operating Characteristics (ROC) curves, and ROC Area Under the Curve (ROC AUC) scores, were used to compare the performance of ML models or classifiers (24,29).
A confusion matrix for binary classification is a two-by-two matrix that displays the values of True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN) resulting from the predicted classes of data.By analyzing the confusion matrix, we can calculate various performance metrics such as recall (or Specificity is another performance metric used in binary classification that measures the proportion of negative samples that are correctly identified by the model.Specifically, it measures the ability of the model to correctly predict negative samples as negative. Accuracy is a commonly used performance metric in binary classification that measures the proportion of samples that are correctly classified by the model out of all the samples it has predicted.It is calculated as: ( ) The F 1 score is the harmonic mean of precision and recall The Receiver Operating Characteristic (ROC) curve is another standard tool used with binary classifiers, which plots sensitivity versus (1 − specificity).Measuring the Area Under the Curve (AUC) is one method of comparing classifiers.AUC provides an aggregated value that illustrates the likelihood that each ML algorithm will accurately classify a random sample.The better the classifier, the more closely the ROC curve will hug the top left corner (24,30).

Descriptive results
Data from 1,455 children aged 6 to 23 months were included in the analysis to assess the MN deficiency status in Ethiopia.Overall, 62.1% of them had not received any of the minimum recommended micronutrients and were therefore MN deficient.According to Table 2, the prevalence of MN deficiency was significantly higher among children whose mothers had no education (70.53%) compared to those with higher education (36.53%).
The prevalence of MN deficiency decreases as the child's age increases, with the lowest percentage of deficiency found in the 18-23 month age group (47.97%).MN deficiency is also significantly prevalent among children whose mothers have no media exposure (67.95%) compared to those with media exposure (47.43%).The results also suggest that as the wealth quintile increases, the prevalence of MN deficiency decreases, with the lowest percentage of deficiency found in the richest wealth quintile (47.12%) and the highest in the poorest (80.3%).The prevalence of MN deficiency also varies widely across regions, with the highest percentage of deficiency found in the Somali region (98.20%) and the lowest percentage of deficiency found in the Gambela region (42.94%)(Table 2).
According to Table 2, children whose mothers did not attend any ANC visits were more likely to have a MN deficiency (73.49%), compared to mothers who attended 1-3 ANC visits (60.70%) and those attended 4 or more visits (53.85%).Additionally, households with three or more children are more likely to experience a MN deficit (78.84%) than households with one or two children (59.81 and 57.3%, respectively).

Spatial distribution of childhood MN deficiency
As per the findings presented in Figure 2, the spatial variation of childhood MN deficiency was most prevalent in Somali, Afar, and Amhara regions, while Gambela, Addis Ababa, and Southern Nations, Nationalities, and Peoples (SNNP) were the least affected regions.The findings suggest that the eastern part of Ethiopia, which includes the Somali and Afar regions, and the Amhara region were severely affected by MN deficiency.

Predictive algorithms for child micronutrient deficiency
The Recursive Feature Elimination (RFE) method was used to identify the features required to develop the ML algorithms on the training dataset.The results showed that RF had a relatively higher accuracy of 72.41% (95% CI: 66.89, 77.48), indicating its ability to correctly classify positive and negative cases.RF also achieved an AUROC of 80.01, suggesting good discriminative ability in distinguishing between positive and negative cases.The NPV of RF was found 69.23%, indicating its effectiveness in correctly identifying children without micronutrient deficiency.Additionally, the F1 score of RF was 79.59, indicating a balanced performance in terms of precision and recall, while NN had a slightly lower AUROC (79.84%) and accuracy (71.03%) compared to RF.Moreover, RF has the highest sensitivity (86.67%), meaning 86.67% of the children who are actually MN deficient are correctly identified by the model.In comparison to the other classifiers, Generalized Linear Model (GLM) had a slightly lower accuracy (70.69%) compared to RF, NN, and SVM and a relatively high AUROC score of 79.53% next to RF and NN.However, its sensitivity score of 80% was lower than those of RF and SVM.Finally, RF had the highest AUROC score (80.01%), whereas NB had the lowest (78.18%) (Figure 3).Based solely on the results presented in Table 3, RF, NN, and SVM were the top-performing algorithms, respectively, in terms of accuracy (Table 3).Thus, among all the algorithms utilized in our investigation, the RF algorithm performed the best in predicting the MN-deficient status of the cases, as evidenced by performance measures.

The important predictors of micronutrient deficiency
The model evaluation findings, as discussed above, demonstrated that the random forest classifier was the best classifier in terms of accuracy and area under the receiver operating characteristics (AUROC) curve.Based on the most accurate classifier (RF), the top important predictors are presented according to their mean decrease accuracy (MDA) (Figure 4).Among the proposed predictors, the Somali region, the poorest wealth index, no maternal education, no media exposure, home delivery, the Afar region, and children aged 6-8 months were the top important predictors in their order of importance for MN deficiency among children aged 6-23 months in Ethiopia.

Spatial mapping of actual vs. predicted childhood MN deficiency prevalence
The spatial variation in Figures 5A,B depicts the actual and predicted prevalence of childhood MN deficiency for each region in the test data, respectively.To predict the regional prevalence of MN deficiency, our best predictive model (RF) was employed.Upon visual inspection of the map, we observed that while some discrepancies existed between a few regions, the overall patterns of the observed prevalence were consistent with the predicted prevalence of child MN deficiency.This suggests that our predictive model (RF) was reliable and can be used to predict the childhood MN deficiency prevalence in areas where data are lacking.

Classical logistic regression analysis
In contrast to the machine learning models, the traditional logistic regression model provides interpretable odds ratios for each predictor.Based on the results presented in Table 4, the region where the child lives, wealth index, maternal education level, and child age in months were found to be significant predictors of micronutrient deficiency among children aged 6-23 months in Ethiopia.Specifically, children living in the Somali and Afar region had 31.20 and 4.75 times higher odds of MN deficiency, respectively, compared to children in the SNNP region.Children in the poorest wealth index category had 4.75 times higher odds of micronutrient deficiency compared to children in the richest wealth index category.Moreover, the study found that a lower maternal education level and a younger child's age were significantly associated with higher odds of micronutrient deficiency in children.Specifically, no education, primary, and secondary education in mothers were associated with 2.50, 1.96, and 1.91 times higher odds, respectively, compared to higher education.Children aged 6-11 months had 1.78 times higher odds of MN deficiency compared to those aged 18-23 months (Table 4).

Discussion
In this study, we found that children aged 6-23 months had a significant prevalence of MN deficiency, which accounted for 62.1% of children in Ethiopia.This finding highlights the highest MN deficiency compared with other studies conducted in East Africa (31), including Ethiopia (1).The difference in results can be explained by the influence of sample size because the current survey was a minidemographic survey.Moreover, we found strong associations between certain demographic and socio-economic factors and the FIGURE 3 ROC curve for machine learning models in predicting childhood micronutrient deficiency.ROC, receiver operating characteristic; AUROC: area under the receiver operating characteristics.prevalence of micronutrient deficiency, such as poverty, lack of media exposure, young age, low maternal education, and larger household size.This finding is consistent with other studies in this area (1,32,33).
The findings of this study also showed considerable variations in MN deficiency among children across Ethiopian regions, as illustrated in the spatial map.MN deficiency is most prevalent in the eastern regions, such as Somalia and Afar, and in Amhara region, but least prevalent in the south-west, southern, and central regions in Gambella, SNNP, and Addis Ababa, respectively.Evidence of similar geographical variabilities in MN deficiency has been shown (1,31,34).These findings highlight the need for targeted interventions that address the specific needs of different population groups in the eastern regions of Ethiopia.Variable importance from random forest.The spatial distribution of the actual (A), and predicted (B) of MN deficiency prevalence on the test data.MND, micronutrient deficiency.In terms of predictive ML algorithms, the random forest algorithm was found to have the highest accuracy and AUROC score for predicting micronutrient deficiency.However, it is worth noting that while the logistic regression algorithm (GLM) had slightly lower accuracy compared to other algorithms such as NN, RF, and SVM, its advantage lies in producing more interpretable results in terms of the predictors estimated in the algorithm.Numerous machine learning (ML) approaches have been applied to health issues, including nutritional status (11,14,21,35), asthma risk prediction (20), and childhood anemia (9).These studies have demonstrated high-quality and valid predictions, highlighting the potential of the ML approach in predicting health outcomes.Findings from the RF classifier reveal that the Somali region, the poorest wealth index, children of mothers who have no education, children whose mothers have no media exposure, home delivery, the Afar region, and children aged 6-8 months were the top important variables in their order of importance for predicting MN deficiency among children aged 6-23 months in Ethiopia (1,31,32).
The findings of this study indicated that the poorest household wealth index was an important predictor of child MN deficiency.This aligns with evidence that poverty and the poorest wealth index status contribute to childhood MN deficiency (31,33).Children from low-income households often have limited access to nutritious food, which can lead to deficiencies in essential micronutrients.The implications of these findings highlight the need for targeted interventions aimed at addressing MN deficiency in low-income households.Besides, this study finds that home delivery was a significant risk factor for micronutrient deficiency.This suggests that women who give birth at home may not receive the same level of support and education on proper nutrition and infant care that they would receive in a healthcare facility (36).
Likewise, the significance of a child's age in predicting micronutrient deficiency has been well documented in the literature (1,31,33), which supports the results of this study.Additionally, it seems that children aged 6 to 11 months are more vulnerable to micronutrient deficiencies.These findings suggest that there is a strong association between child age and micronutrient deficiency, with younger children being at a higher risk of deficiency.This highlights the importance of early interventions to promote optimal nutrition and prevent micronutrient deficiency in infants and young children in Ethiopia.
Furthermore, the results indicate that a lack of maternal education increases the risk of childhood micronutrient deficiency.Conversely, children of educated women have significantly lower rates of micronutrient deficiency (31,33).These findings have important implications for addressing child micronutrient deficiency and further emphasize the need to improve women's education in developing countries to promote better outcomes for children's micronutrient status.Moreover, the findings indicate that parents who lack media exposure are also important predictors of childhood micronutrient deficiency, which is consistent with previous research conducted in India (35).This indicates that parental access to media can play a significant role in promoting good nutritional outcomes for children.
Additionally, this study investigated the spatial variation of the actual and predicted prevalence of MN deficiency using RF model, which highlighted the overall patterns of the observed prevalence that were consistent with the predicted prevalence of MN deficiency in children.This suggests that our predictive model (RF) was reliable and can be used to predict the prevalence of childhood MN deficiency in areas where data is lacking.
Moreover, the findings from the best-performing ML model (RF) are largely consistent with the traditional logistic regression analysis.Both the eastern region where the child lives, the wealth index, maternal education level, and child age in months were found to be significant predictors of micronutrient deficiency among children aged 6-23 months in Ethiopia.However, home delivery and media exposure emerged as important predictors in the ML models but not in conventional logistic regression.This suggests that the ML models may reveal previously unknown insights beyond traditional logistic regression approaches.Specifically, ML models could identify new influential variables for policy decision making that are missed by standard statistical methods (37).While the core findings aligned, ML provided the additional benefit of highlighting novel and potentially crucial MN deficiency factors not captured by traditional logistic regression.

Conclusion
The aim of this study was to evaluate the effectiveness of various ML algorithms and identify the most accurate and efficient algorithm for predicting micronutrient deficiencies.Accuracy and AUROC were used to evaluate the predictive power of the ML algorithms.The random forest algorithm was identified as the best model, achieving an accuracy of 72.41% and an AUROC of 80.01% on the test data.Thus, the Somali region, the poorest wealth index, children of uneducated moms, children whose parents have no media exposure, home delivery, the Afar region, and children aged 6-8 months were found to be the most important predictors of child MN deficits in their order of importance.Furthermore, the findings demonstrated considerable regional variations in the frequency of child MN deficit, particularly in Ethiopia's eastern region.Although the RF model and traditional logistic regression model displayed more similar important predictors, the RF model was able to discover some crucial predictors that the conventional logistic regression model had missed.As a result, our model may provide better policy suggestions for children with MN deficiency.These findings underscore the importance of socioeconomic and spatial factors in the incidence of micronutrient deficiencies among Ethiopian children.Addressing these issues may result in better health outcomes for children within an age category of 6-23 months.The regional variation in the prevalence of MN deficiency emphasizes the need for targeted interventions that account for differences in the prevalence and risk factors of micronutrient deficiencies across different regions in Ethiopia.

FIGURE 1 Flow
FIGURE 1Flow chart of Machine learning approach.
Precision also called positive predictive value (PPV) measures how many of the samples predicted as positive are actually positive.

FIGURE 2
FIGURE 2Spatial variations in MN deficiency by administrative regions in Ethiopia, EMDHS, 2019.

TABLE 1
The description of the predictor variables considered in the analysis.
relationships between dependent and independent variables, and penalization can control overfitting.The interpretability of model coefficients and probabilities makes logistic regression a popular starting classifier for machine learning applications involving binary prediction (23, 24).

TABLE 2
Weighted prevalence and chi-square statistics of MN deficiency by demographic and other characteristics among children aged 6-23 months in Ethiopia (n = 1,455).

TABLE 3
Model evaluation metrics for all ML models as evaluated on the test data.

TABLE 4
Logistic regression model results for factors associated with child MN deficiency (based on training data).