Feature selection and association rule learning identify risk factors of malnutrition among Ethiopian schoolchildren

Introduction Previous studies have sought to identify risk factors for malnutrition in populations of schoolchildren, depending on traditional logistic regression methods. However, holistic machine learning (ML) approaches are emerging that may provide a more comprehensive analysis of risk factors. Methods This study employed feature selection and association rule learning ML methods in conjunction with logistic regression on epidemiological survey data from 1,036 Ethiopian school children. Our first analysis used the entire dataset and then we reran this analysis on age, residence, and sex population subsets. Results Both logistic regression and ML methods identified older childhood age as a significant risk factor, while females and vaccinated individuals showed reduced odds of stunting. Our machine learning analyses provided additional insights into the data, as feature selection identified that age, school latrine cleanliness, large family size, and nail trimming habits were significant risk factors for stunting, underweight, and thinness. Association rule learning revealed an association between co-occurring hygiene and socio-economical variables with malnutrition that was otherwise missed using traditional statistical methods. Discussion Our analysis supports the benefit of integrating feature selection methods, association rules learning techniques, and logistic regression to identify comprehensive risk factors associated with malnutrition in young children.


Introduction
Current epidemiological efforts targeted toward low-and middle-income countries focus on the double burden of malnutrition, which is defined as the simultaneous occurrence of underweight and overweight outcomes in a population (1).While undernutrition and overnutrition have historically occupied separate populations, their co-occurrence has become more frequent within lower-income populations as economic development and urbanization occur (2)(3)(4)(5).Despite increases in overweight and obesity (5), undernutrition remains a significant concern in developing countries, affecting more than 150 million children (6).This emphasizes the need to better understand factors shaping undernutrition in these developing contexts so that more effective interventions can be established.The WHO uses several metrics to assess malnutrition, describing low height-for-age as stunting, low body-mass index (BMI)-for-age as thinness (7), and low weight-for-age as underweight.Stunting reveals chronic nutritional deficiencies (8), while thinness is an acute form of malnutrition, and underweight is a composite metric to assess both acute and chronic malnutrition.Nutritional deficiencies in early and intermediate childhood have developmental consequences on physical development (9), cognitive performance (10,11), and lifespan (12), resulting in a greater risk of morbidity, mortality, and reproductive hindrance (12)(13)(14).
Africa accounts for over one-third of global stunting cases and is one of the few regions where stunting prevalence is not significantly decreasing (6,15).In Africa, despite a reduction in malnutrition-related disability-adjusted life years (DALYs), this continent still comprises the largest share of malnutrition DALYs and has the highest age-standardized death rate from malnutrition (16).Child undernutrition is most highly concentrated in the eastern and southern regions of Africa (17).Ethiopia, located in the eastern part of Africa, ranks among other countries as having one of the highest prevalences of undernutrition, as an estimated 40% of children are stunted and 25% of children are underweight (17).Ethiopia is still struggling to meet the United Nations (UN) Sustainable Development Goal 2 ("Zero Hunger") by 2030 (18), highlighting the need to implement more effective interventions.Therefore, it is crucial to analyze risk factors for undernutrition in this region.
Several previous studies have demonstrated how complex the etiology of malnutrition is in Ethiopia (19)(20)(21)(22)(23). Two systematic reviews and metanalyses of epidemiological studies showed a combination of environmental, demographic, and socioeconomic statuses as significant determinants of malnutrition (24,25).For example, access to nutritious food and sanitation and hygiene factors have been identified as important environmental variables, residential area, child age and sex as important demographic factors, and maternal education status and socioeconomic status as important economic factors (24,25).Despite these insights, there is still a great need for research to further analyze these trends and clarify the mechanisms by which they may be associated with undernutrition (24,25).
Additionally, these studies have all relied solely on logistic regression methods to examine the determinants of undernutrition (24,25).However, traditional logistic regression models alone may struggle to produce accurate findings since large datasets with many variables are prone to overfitting (26).This could be improved by integrating machine learning techniques that utilize advanced mathematical and statistical methods to discern patterns within a dataset (27).These approaches appear to be a promising alternative to logistic regression since they avoid overfitting (28,29).ML techniques have previously been used to identify risk factors for parasitic infections, congestive heart failure, diabetes, overweight/obesity, and dementia (30-34).Moreover, several recent studies have used ML approaches to effectively predict undernutrition outcomes in Bangladesh, India, and Ethiopia (35-39).These studies indicate that ML methods such as feature selection may be able to identify risk factors effectively (36).Despite these advances, most studies failed to implement association rule learning, another promising ML method that may facilitate the identification of risk factors.Association rule learning has previously shown promise in predicting disease co-occurrences and risk factors for parasitic infection (34,40).Altogether, these ML methods could identify important risk factors for undernutrition, and potentially provide crucial insights into how the co-occurrence of variables may lead to undernutrition.
To the best of our knowledge, no study has used multiple ML techniques in conjunction with logistic regression to investigate risk factors for undernutrition outcomes.Such analysis could improve targeted public health interventions by uncovering novel variables associated with stunting, underweight, or thinness in Ethiopian schoolchildren.In this study, we used ML feature selection and association rule learning with logistic regression to identify risk factors for stunting, underweight, and thinness in Ethiopian school-aged children.

Data collection
This project was carried out in Jimma Town, Ethiopia.Jimma is located 352 km southwest of Addis Ababa with an altitude of 2,450 m above sea level and average temperatures ranging from 15°C to 18°C (Figure 1).In this area, there is a low level of sanitation practiced and many households lack access to clean drinking water.The study catchment area included 14 public elementary schools, all of which were surveyed, resulting in a total of 1,036 participants (498 males, 538 females) (Figure 1).Students were randomly selected and had parental consent to participate after a complete explanation of the experiment and objective.Each parent was given adequate time to think about and discuss their decision as to whether they wanted their child to participate.Data was collected in a 2021 survey titled "Understanding gut-microbiome interactions following mass deworming against soil-transmitted helminths (STHs) among young Ethiopian schoolchildren".The survey included a series of yes/no questions relating to sociodemographic and behavioral factors in addition to pertinent medical history including but not limited to sickness, receiving a deworming drug, and BCG vaccination status.The survey also included anthropometric measures used to calculate malnutrition outcomes.

The list of risk factors
From the administered survey, we obtained data on the study populations' demographic, socioeconomic, biological, and behavioral characteristics.The complete list of risk factors used in this study is provided in Supplementary Table S2.

Outcome definition
Stunting, underweight, and thinness were determined using the World Health Organization's Growth Reference Study Group guidelines by AnthroPlus package in R (41).This package uses an individual's sex, age, height, and weight and computes a height-for-age (41), weight-for-age (42), and BMI-for-age metric (BAZ).It then compares this metric to the WHO Reference 2007 for 5-19 years and returns a z-score indicating the deviation of this individual from the mean of the reference population.These z-scores are continuous variables and are used in the multivariate linear regression described below.To determine the stunting, underweight, and thinness classes for feature selection, the zscores were converted into a binary measure, where "0" indicates a z-score greater than −2 and "1" indicates a z-score less than −2 (an individual with stunting, underweight, or thinness).

Data imputation
Survey data often have incomplete responses, dependent on a participant's willingness or ability to answer certain questions.Out of 41,280 total values, 674 values were missing for stunting and underweight, and 427 values were missing for thinness.Even though there were relatively few missing values, minimizing missing values as much as possible is necessary to produce the most effective ML model, as missing values can introduce selection bias (43).To ensure that the model is unbiased, the data were initially preprocessed using K-nearest neighbors (K-NN) imputation with five nearest neighbors, which is a standard number of neighbors used for data imputation (44).K-NN imputation was chosen instead of other single imputation techniques, like mean or median substitution, because of its ability to determine missing data points, based on similar samples in the dataset (44).Since the dataset contains multiple types of data (categorical and numerical), the K-NN imputation was conducted once to impute numerical variables and once to impute categorical variables.Imputation has been only applied to samples with less than 5% missing values.Four samples were either too young or too old to calculate undernutrition metrics, thus 1,032 out of 1,036 entries could be used in this study.
To check the performance of K-NN imputation, we also performed random forest imputation and compared the overlap in values between the two methods.There was a 90.4% overlap in values for stunting, a 90.1% overlap for underweight, and an 88.4% overlap for imputed thinness data (Supplementary Table S1), showing that K-NN imputation was consistent with other imputation methods.

One hot encoding
The data consists of a mix of categorical, numerical, and binary data.The categorical variables were encoded using one-hot encoding, which takes the potential risk factors with more than two groups and creates multiple factors from their class distribution.One-hot encoding is necessary before performing logistic regression since regression is a distance-based method and categorical variables must be encoded to be analyzed by distance-based methods (45).After one-hot encoding, the reference category was removed to account for multicollinearity and redundancy in the dataset.

Logistic regression
Univariate and multivariate logistic regression was performed for stunting, underweight, and thinness outcomes.The p-values for risk factors were adjusted for multiple testing using the Benjamini-Hochberg procedure (46).Risk factors identified were compared using this statistical method to significant variables identified by three feature selection methods.Multivariate linear regression was also performed in this comparative analysis to identify other potentially important variables.

Feature selection
Feature selection algorithms were used to identify key determinants of different malnutrition outcomes.Chi-square and Monte-Carlo were used as ranking-based feature selection techniques alongside minimum redundancy maximum relevance (mRMR) and joint mutual information (JMI) subset-based feature selection methods.The chi-square method calculates the association between risk factors and undernutrition outcomes using the chi-squared score (47).Monte-Carlo is a commonly used feature selection method (48)(49)(50), which ranks features based on their contribution to the undernutrition outcome based on a relative importance metric (51).Subset-based methods choose a subset of a dataset's original features that collectively possess a good predictive ability (52).MRMR and JMI both select features that have the strongest relationships with an outcome and the weakest relationship with other risk factors by evaluating and comparing these two different interactions.The top 10 features with the highest importance level from the ranking-and subset-based feature selection methods and multivariate linear regression were used in further analysis.
Based on a literature search, feature selection was also performed for subgroups of the dataset that were suspected to have a large influence on the lifestyle of individuals.This subset analysis was performed across residence (urban vs. suburban/ rural), age (up to 10 vs. older than 10), and sex (male vs. female) subsets to explore how population subsets are exposed to different undernutrition risk factors.

Association rule learning
Association rule learning was used to discern risk factor combinations strongly associated with stunting, underweight, or thinness (53).Association rule learning gives support, confidence, and lift values for variable combinations.Support assigns a value based on the frequency that a rule occurs in the dataset and confidence indicates the amount of time that the rule is true.Lift approximates the association rule's strength, which is defined as the ratio of observed support to expected support in the instance that the risk factor and outcome are unrelated.P-values were also calculated for each rule to gauge the effect and significance of a rule's association with stunting, underweight, or thinness outcomes using Fisher's Exact test (54).We selected the rules that had the greatest significance values.This analysis was performed on the entire dataset, as well as age, residence, and sex population subsets.We also used the CART method to supplement rules provided by association rule learning (55).The CART method was used to obtain a decision tree for undernutrition outcomes, and this decision tree was transcribed into the format of our association rules for interpretability.

Code and data availability
R programming language was used to write the study's code since it is user-friendly and contains advanced statistical learning libraries.Our source code is available on GitHub at https://github.com/CJPIV/SR2022Malnutrition.The principal investigator can provide access to the survey data upon reasonable request.

Machine learning risk factor analysis
ML feature selection complemented logistic regression, while also providing a novel approach to risk factor identification.For each feature selection method, variables were considered important if they appeared in the top 10 features for 90% of performed feature selection runs.Features that appeared in three or more feature selection methods across stunting, underweight, and/or thinness outcomes were considered important (Table 2).In line with regression findings, age was selected for stunting and underweight by mRMR, JMI, Monte-Carlo, univariate logistic regression, multivariate logistic regression, and multivariate linear regression (Table 2).Sex and vaccination status were selected for stunting by mRMR, JMI, multivariate logistic regression, and multivariate linear regression (Table 2).School latrine cleanliness, family size, and nail trimming habits were novel variables identified solely by feature selection chi-square or JMI methods (Table 2).However, the directionality of these associations is unclear, as these variables were not significant in the multivariate logistic regression.

Association rule learning
We used association rule learning to examine how variable co-occurrences may be associated with increased odds of stunting, underweight, and thinness.We observed unique trends in rules for stunting, underweight, and thinness (Table 3).Owning an animal frequently appeared alongside hygiene-related variables for stunting rules.For underweight, antibiotic use and open defecation were frequently found in rules that also contained related hygiene variables (i.e., walking barefoot, washing raw vegetables infrequently, cleaning oneself or clothes in a river).Similarly, open defecation was found with hygiene-related variables including handwashing with water only and nail trimming habits in multiple rules for thinness.Association rule learning using the CART method

Subset-adjusted findings
To analyze how risk factors may be constrained by larger determinants of health, we performed logistic regression and feature selection analyses for sex, residence, and age subsets.Using multivariate logistic regression, we observed the uneven distribution of risk factors in the population (Table 4).Older childhood age (10-18 years old) was again a risk factor, while having received vaccinations was a protective factor for the urban population subset.Interestingly, owning a household pet was a risk factor for females, urban, and children under 10 years old subsets; although findings remained significant only for the female subset following p-value correction.For underweight, older age was a significant risk factor exclusive to urban and male populations.Owning a household pet, having a house floor made of dust, open defecation, and older age also displayed subset-specific significance for underweight, although this significance was lost to p-value correction.Similarly, subset-specific findings for thinness which identified older age and owning a sheep (or goat) as risk factors specific to males did not remain significant after adjustment.
Feature selection was also performed for sex, residence, and age subsets to examine how variables predictive of stunting, underweight, or thinness vary by these subsets.Again, variables were considered important for a feature selector if they appeared in the top 10 features for 95% of performed feature selection runs.In addition to identifying the same subset-specific variables observed in logistic regression, feature selection displayed interesting variable trends for different population subsets (Figure 2).Again, household pet ownership and sex were variables important to the up to 10 years old subset (Figure 2A).In addition, family size, toilet paper usage, school latrine cleanliness, and sex were important factors exclusive to children up to 10. Vaccination status, kitchen placement, and a maternal   2A).For the urban subset, age, household pet ownership, and vaccination status were again present, while having a school latrine cleanliness was also identified (Figure 2B).Farm animals, a house floor made of dust, and deworming status were factors specific to suburban/rural residence (Figure 2B).Household access to potable water, large family size, open defecation, owning a farm animal, and school latrine cleanliness were important factors for males, while vaccination status and having a house floor made of dust were important factors for females (Figure 2C).Age, owning a household pet, and maternal education status were important for both male and female subsets (Figure 2C).Subset association rule learning was also performed for age, residence, and sex subsets, and variables identified by these previous methods also tended to appear in association rules (Supplementary Table S6).
Like the association rule learning analysis that was applied on the entire dataset, hygiene variables and behaviors such as spending time in rivers, antibiotic use and owning animals appeared in several rules (Supplementary Table S6).

Discussion
In our study, we performed ML approaches in conjunction with logistic regression to explore potential risk factors associated with malnutrition.Both multivariate logistic regression and feature selection identified older age, sex, and vaccination status as important risk factors for stunting, underweight, and/or thinness.In addition, feature selection identified novel factors for these detrimental outcomes, such as a larger family, nail trimming habits, and school latrine cleanliness.Association rule learning displayed interesting hygiene and animal variable co-occurrences.Our subset analysis revealed noticeable differences in significant factors for different population subsets.Collectively, these findings offer new insights into the complexity of stunting, underweight, and thinness in Ethiopian schoolchildren, suggesting that future studies should seek to use these complementary ML methods to provide a more comprehensive analysis.
Our finding that older age is associated with greater odds of stunting and underweight is consistent with previous literature (56-59), and is likely the result of the increased nutritional demands of children as they transition into adolescence (60).Males tend to require greater nutritional intake to maintain muscle mass, increasing their sensitivity to environmental conditions (61).Biological differences may also be exacerbated by cultural norms in which males often engage in strenuous physical labor compared to females who perform householdrelated tasks and are often in closer proximity to food (56).Additionally, vaccination appeared to have a significant protective role against stunting, which has also appeared in previous studies (62,63) and suggests that children face longterm health consequences based on their susceptibility to infection if unvaccinated.Nandi et al. (2019), found that Ethiopian, Indian, and Vietnamese school-aged children who received early-life measles vaccination (6-18 mo.) had greater anthropomorphic measurements than matched unvaccinated children, suggesting the long-term benefits of vaccination (64).
Feature selection complemented logistic regression while identifying novel factors predictive for stunting, underweight, and thinness, revealing its value when used alongside traditional statistical methods.Large family size, school latrine cleanliness, and nail trimming habits were identified as novel predictors of stunting, underweight, and thinness.Bazie et al. (2021) suggested that increases in family size contribute to spreading resources more thinly among a greater number of children and predisposing them to smaller and less diverse diets (65).Reinforcing this conclusion, high physiological density, or number of persons per unit of agricultural land, has previously been associated with a greater likelihood of undernutrition (66).
School latrine cleanliness and nail trimming habits are likely representative of sanitation and hygiene access.Previously, having access to a clean latrine has been identified as an important determinant of child malnutrition (67,68).Repeated exposure to Feature selection for age and residence subsets identifies noticeable differences in important determinants of undernutrition.Feature selection was performed for age, residence, and sex subsets.Feature selection was performed for (A) ≤10 vs. >10, (B) urban vs. suburban/ rural, and (C) male vs. female population subsets and summary tables were created to display variables important to each population subset.
fecal matter in an unclean school latrine may result in undernutrition through enteric dysfunction, as pathogenic bacteria present in the feces, such as E. Coli, may damage the intestinal mucosa and prevent nutrient absorption (69, 70).Additionally, parasites such as Ascaris lumbricoides shed eggs in feces.Exposure to these eggs can lead to subsequent parasitic infection, which is strongly associated with malnutrition (71,72).Long nails can collect soil or fecal material, inadvertently leading to the collection of parasites.In fact, previous research in Jimma, Ethiopia has found that dirt trapped in nails contributed to helminth infection (73,74), and that odds of parasitic infection increased as nail trimming decreased (75).Once on the nails, a parasitic infection may occur through the fecal-oral route as children bring their hands close to their mouths.As parasitic infection occurs, children may have a decreased appetite (76), greater nutrient loss due to vomiting (77), or exhibit malabsorption of nutrients (78).
Association rule learning was performed to identify variable cooccurrences associated with a greater likelihood of stunting, underweight, and thinness.For stunting, owning a chicken or house pet was frequently found with hygienic variables including toilet paper usage status and handwashing habits before eating.Chickens and house pets may serve as reservoirs for enteric bacteria such as E. coli, Campylobacter, and Salmonella (79).Infection with enteric bacteria has been associated with diarrheal disease, iron-deficiency anaemia, and growth impairment (80,81).Poor hygienic behaviors disrupt the fecal-oral route by which many parasitic infections occur (71,82).A similar pattern of hygiene indicator behavioral variables co-occurrences appeared using association rule learning for underweight and thinness.Furthermore, association rule learning was able to identify sets of co-occurring variables related malnutrition that was otherwise missed using traditional statistical methods.In this study, association rule learning highlighted the importance of different sets of hygiene variables as risk factor of malnutrition by grouping related variables together.
In subgroup analysis by residence, we observed older age, owning a household pet, and vaccination status as important factors for stunting, underweight, and thinness in the urban population, while a house floor made of dust, farm animals, and deworming were important in the suburban/rural population.These differences are corroborated by earlier reports (83)(84)(85), that documented rural-urban differentials in factors associated with nutritional outcomes, and linked this divide to socioeconomic inequality and differences in the standards of living of the residents in the two different settings.Previously in Ethiopia, higher odds of undernutrition have been associated with older children in urban locations (58,86,87).Older children in urban areas may spend more time outside of their household, increasing their exposure to potential infectious diseases.Owning a housepet may be associated with increased odds of undernutrition through increased exposure to helminth infection.Indeed, Misikir et al., (2020) found a positive association between living with domesticated animals and hookworm infection in a nearby region of Ethiopia (88).Having received vaccinations may be especially important in the urban context where individuals are frequently coming into contact with each other (89).This may be exacerbated by studies that have shown a high degree of vaccine hesitancy among Ethiopians (90, 91), since low levels of vaccination allow for a greater prevalence of infectious diseases.
In the suburban/rural context, house floor material is an indicator of socioeconomic status.Previous studies in Ethiopia have found that socioeconomic status indicators have been associated with increased odds of undernutrition in rural settings (92,93).In rural Ethiopia, lower socioeconomic status has been associated with a lower dietary diversity score (94), suggesting a potential mechanism by which Ethiopian children of lower socioeconomic status are more likely to be malnourished.Owning farm animals may result in increased odds of undernutrition among children since these animals serve as a reservoir for many infectious agents such as echinococcus (95), cryptosporidium, and giardia (96).The importance of deworming in suburban/rural areas may be a result of increased opportunities for helminth infection.For example, individuals living in these areas are more likely to engage in agriculture, which increases ones susceptibility to hookworm infection (97).Additionally, evidence suggests that Ethiopians living in more rural regions may have poor latrine quality (98), which has been associated with helminth infection since increased exposure to helminths in feces is more likely to result in infection (74).Therefore, long durations without deworming are more likely to have negative consequences in rural contexts where helminth infection and re-infection rates may be higher.The use of subset analysis demonstrates how there is a need to investigate how these larger determinants of health may result in specific predictors of undernutrition.Identifying a unique set of factors contributing to malnutrition across subgroup of population either by residences, age, or sex categories provide insights into planning targeted public health interventions tailored to specific subgroups.

Limitations
Our findings must be understood considering the following limitations.First, given the cross-sectional design of this study, we are unable to attribute causality between associated exposure and outcome variables.Since we cannot attribute causality, there may be confounding factors for variables that were significantly associated with undernutrition outcomes.Also, the relatively low prevalence of stunting, underweight, and thinness in our study population may have interfered with our ability to detect associations between features and stunting, underweight, and thinness outcomes.Moreover, computation times for subset-based feature selection and association rule learning increase exponentially with the number of variables in a dataset, which limits the applicability of our code to datasets with a large number of risk factors.
Feature selection and association rule learning are two commonly used techniques in machine learning and data analysis; however, both have limitations.Since feature selection reduces the size of the data to enable a more efficient analysis, this method can suffer from overfitting and may not generalize well to new data.Association rule learning can generate a large number of rules, which makes it difficult to extract useful information.Since we focused on association rules of three variables in the interest of computation time and interpretability, it is possible that we overlooked important rules of other sizes.These rules also lacked statistical power since a small proportion of the overall population exhibited undernutrition.This resulted in a relatively small number of individuals from which rules could be generated.However, this lack of power is unlikely to affect our main findings since these association rules supported our feature selection analysis which had high statistical power.Also, association rule learning assumes independence between features, which may not be true in certain instances.Overall, both feature selection and association rule learning are useful tools, but they must be used with caution and in conjunction with other techniques to ensure accurate and robust results.

Conclusion
In this study, we found that feature selection, association rule learning, and subset analysis substantiated traditional logistic regression findings.As logistic regression identified older age children (10-18 years old) as a risk factor for stunting and underweight, and being female and having received vaccinations as protective factors for stunting, feature selection also found age to be important for stunting and underweight while sex was important for stunting.Feature selection also identified school latrine cleanliness, large family size, and nail trimming habits as novel variables important to stunting, underweight, and thinness outcomes.Association rule learning showed co-occurring hygiene and socioeconomic variables were related to malnutrition, which was otherwise missed using traditional statistical methods.We also demonstrate the need to analyze different population subsections, showing the promise that feature selection may have to uncover a unique set of malnutrition risk factors which could be used to plan targeted public health interventions.

TABLE 1
Multivariate logistic regression identifies significant risk and protective factors for undernutrition in Ethiopian school-aged children.Multivariate logistic regression was performed for each value for stunting, underweight, and thinness outcomes and then the p-value was corrected using the Benjamini-Hochberg correction.Thinness was defined as BMIZ <−2.
a Stunting was defined as HAZ <−2.b Underweight was defined as WAZ <−2.c

TABLE 2
Feature selection identifies having a school latrine cleanliness, a larger family size, and nail trimming habits as potential factors associated with undernutrition.

TABLE 3
Association rule learning identifies co-occurring variables associated with stunting, underweight, and thinness.Association rules were computed using three or fewer variables on the left-hand side and rules with the highest lift values were selected for.Odds ratios and p-values were also calculated for association rules.

TABLE 4
Multivariate logistic regression for residence and age population subsets reveals significant risk and protective factors.Thinness was defined as BMIZ <−2.