Factors Associated With Lower Respiratory Tract Infection Among Chinese Students Aged 6–14 Years

Aims We employed machine-learning methods to explore data from a large survey on students, with the goal of identifying and validating a thrifty panel of important factors associated with lower respiratory tract infection (LRTI). Methods Cross-sectional cluster sampling was performed for a survey of students aged 6–14 years who attended primary or junior high school in Beijing within January, 2022. Data were collected via electronic questionnaires. Statistical analyses were completed using the PyCharm (Edition 2018.1 x64) and Python (Version 3.7.6). Results Data from 11,308 students (5,527 girls and 5,781 boys) were analyzed, and 909 of them had LRTI with the prevalence of 8.01%. After a comprehensive evaluation, the Gaussian naive Bayes (gNB) algorithm outperformed the other machine-learning algorithms. The gNB algorithm had accuracy of 0.856, precision of 0.140, recall of 0.165, F1 score of 0.151, and area under the receiver operating characteristic curve (AUROC) of 0.652. Using the optimal gNB algorithm, top five important factors, including age, rhinitis, sitting time, dental caries, and food or drug allergy, had decent prediction performance. In addition, the top five factors had prediction performance comparable to all factors modeled. For example, under the sequential deep-learning model, the accuracy and loss were separately gauged at 92.26 and 25.62% when incorporating the top five factors, and 92.22 and 25.52% when incorporating all factors. Conclusions Our findings showed the top five important factors modeled by gNB algorithm can sufficiently represent all involved factors in predicting LRTI risk among Chinese students aged 6–14 years.


INTRODUCTION
Lower respiratory tract infection (LRTI) is a common infectious disease in pediatric clinics, and it ranks as a leading cause of pediatric deaths (1)(2)(3). LRTI places a heavy burden on individuals and public health systems. Pneumonia is the most common form of LRTI, and it, on average around the world, takes the lives of three children every 2 min (4). Global statistics have shown that hospitalization due to pneumonia increased by 2.9 times from 2000 to 2015 (5), and each year 0.65 million children die of LRTI (6,7). Given the complex pathogenesis of LRTI, there is increasing interest in understanding the causes of LRTI, proposing effective prediction algorithms, and risk-stratifying children who might benefit from close monitoring and timely interventions.
To date, numerous studies have been conducted to identify factors that can better predict LRTI occurrence. Notably, Shi et al. have written an excellent systematic review and metaanalysis focusing on factors susceptible to respiratory syncytial virus associated acute LRTI among young children, and they found that comorbidity, congenital heart disease, prematurity, and younger age were associated with poor outcomes (8). More recently, we, among 7,222 preschool-aged children, have identified five factors of significance that were associated in a synergistic manner with recurrent respiratory tract infection (9). Thus far, no consensus exists on how many LRTI-susceptibility factors are actually involved and how they act. The reasons for this challenge are partly due to the difficulty in delineating more complicated and nuanced relationship among factors to predict LRTI when adopting traditional statistical methods (such as Logistic regression analysis), which involve only one inputoutput layer and accommodate relatively small amounts of variation. To overcome this challenge, more advanced machinelearning methods have been developed and successfully applied in a variety of clinical settings (10)(11)(12). To our knowledge, there is to date no application of machine-learning methods in the field of LRTI.
To fill this gap in knowledge and generate more information for future studies, we attempted to employ machine-learning methods to manage data from a large survey on students 6-14 years of age, with the goal of identifying and validating a thrifty panel of important factors associated with LRTI, and meanwhile selecting the optimal algorithm for possible clinical application.

Study Design
This survey was performed following a cross-sectional cluster sampling design within January, 2022 in Pinggu district, Beijing. The Ethics Committee of Beijing University of Chinese Medicine reviewed and approved the protocols of this survey, which was implemented according to the Declaration of Helsinki.

Study Participants
Study participants are consisted of students aged 6-14 years and attending primary school or junior high school at the time of survey. The parents or guardians of participating students provided electronic signature consenting to the participation of this survey, and importantly there are opt-out clauses in our consent form.
A total of 26 schools in Pinggu district were randomly selected, including eight primary schools and 18 junior high schools. The total number of registered students in these 26 schools was 11,633. The self-designed questionnaire was generated by the "Wenjuanxing" website (https://www.wenjuan.com/), an online platform in mainland China in the form of QR code. The QRcoded questionnaire can be easily recognized by common smart phones on the market, and it was sent to the parents or guardians of 11,633 students by their teachers-in-charge via the "WeChat" social media APP.

Data Collection
The questionnaire used in this survey was circulated to the parents or guardians of a small number of students (N = 120), and the reliability coefficient (alpha) was over 0.85. Specifically, items in the questionnaire were related to both students themselves and their parents from multiple aspects, and were downloaded into a Microsoft Office Excel TM spreadsheet.
From students, information was collected on age, sex, nationality, waist-hip rate, body mass index (BMI), pregnancy order, delivery order, twin birth, delivery mode, gestational age, birth weight, birth body length, infancy feeding, breastfeeding duration, pure breastfeeding, pure breastfeeding time, time of adding solid-food, stool frequency, and stool consistency, as well as lifestyle-related factors including eating speed, fall asleep time, sleep duration, sitting time, screen time (time of watching TV or playing video games), daily time of outdoor activities, sleeping with the light on, using plastic tableware, using makeup, as well as the weekly intake frequencies of dietary fiber, out-of-season fruit, animal protein, soy protein, milk, dietary supplement, food containing preservative, fast food, snacks, sweet food, night meals, and picky eating frequency per week. In addition, the episodes of LRTI over the past year, chronic diseases, dental caries, and rhinitis allergy (including foods or drugs) were also recorded.
From parents, information on BMI, bearing age and education of both parents, family income (RMB per year), number of relatives with hypertension and diabetes was collected.

Quality Control
Quality of survey data was strictly controlled. Specifically, schoolhealthcare physicians and teachers in charge of class were trained to understand the detailed procedure of this survey and each item in the questionnaire. They were responsible for assisting the parents or guardians of participating students to fill out this questionnaire. As the survey ended, data were downloaded from the "Wenjuanxing" platform, and each item was rigorously checked. In the case of missing values and obvious outliers, school-healthcare physicians and teachers in charge of class were requested to contact the parents or guardians of participant students to provide or confirm relevant information.

LRTI Definition
Clinically, LRTI refers to the infection of the lung tissue or tracheobronchitis below the throat, and it is usually caused by viruses or bacterial microorganisms from the mouth and upper respiratory tract spreading down the respiratory tract (13). In this survey, LRTI refers to the occurrence of LRTI diagnosed by doctors in the past year, whose hospital or outpatient clinic diagnosis cases and information were confirmed by teachers in charge of class. If there was any disagreement, our team would further verify carefully regarding the content of the inquiry included symptoms and related diagnosis and treatment.

Definitions of Other Items
Allergic rhinitis was diagnosed based on previous medical records, and food/drug allergy was identified by questions related to physician diagnosis in accordance with the International Study of Asthma and Allergies in Childhood (ISAAC) questionnaire (14). Dental caries was recorded, and medical history of children referred to chronic kidney diseases, congenital heart disease, hypothyroidism, and other chronic diseases.
BMI was calculated as body weight divided by height squared (kg/m 2 ). Body weight and height were measured by school-healthcare physicians. Infancy feeding included pure breastfeeding, partial breastfeeding, and non-breastfeeding. Gestational age, breastfeeding duration and solid food consumption age were recorded in months. Delivery mode included vaginal delivery and cesarean section. Stool frequency was classified into 1-2 times per day, 3-4 times per day, more than 4 times per day, 2-3 times per week and 0 or once per week. Stool consistency was classified into four categories according to the Bristol Stool Form Scale (BSFS). Lifestyle-related factors included sleep habits, daily activity habits, sitting habits, and eating habits. Specifically, sleep duration, sitting time, screen time, daily duration of outdoor activities, and recorded in hours were, respectively, calculated as the sum of both on workdays ×5 and weekends ×2 divided by 7. The weekly intake frequency of eating the following foods (dietary fiber, out-of-season fruit, animal protein, soy protein, milk, dietary supplement, food containing preservative, fast food, snacks, and sweet food) was classified as every day, three or more times per week, once or twice per week and hardly. The frequency of the following behaviors (sleeping with the light on, using plastic tableware, using make-up night meals, and picky for foods) was categorized into four groups, that is, every day, three or more times per week, once or twice per week, and hardly.
For parents or guardians, maternal and paternal BMI was calculated from self-reported body weight and height. Education was categorized as middle school degree or below, high school degree, and college degree or above. Family income (RMB per year) was categorized as <100,000, 100,000-300,000, and ≥300,000. The relative diseases referred to as diabetes mellitus or hypertension diagnosed by doctors from tertiary hospitals.

Statistical Analyses
If the missing percent of each item in the questionnaire exceeds 30%, this item was removed from the final analysis. The expression of continuous factors is mean (standard deviation) if no deviation from normal distribution is observed, and median (interquartile range) otherwise. The expression of categorical factors is count (percent). Two-group (students with and without LRTI within the last year) comparison was done using t-test for normally distributed factors, rank-sum test for skewed factors, and χ 2 -test for categorical factors.
To ensure the reproducibility of machine-learning models, data from 11,308 students were randomly divided into the training set (60%, N = 6,785 students) and the testing set (40%, N = 4,523 students). The training group is used to construct the machine-learning algorithms, and the testing group is used to test the reproducibility of these algorithms. In this study, 11 machine-learning algorithms were trained, including Logistic regression, random forest, support vector machine (SVM), decision tree, K-nearest neighbors (KNN), gradient boosting machine (GBM), light gradient boosting machine (LGBM), extreme gradient boosting machine (XGBoost), Gaussian naive Bayes (gNB), multinomial naive Bayes (mNB), and Bernoulli naive Bayes (bNB). Meanwhile, both hard and soft voting classifications were calculated based on the 11 machine-learning algorithms. The performance of each algorithm was evaluated from five aspects, that is, accuracy, precision, recall, F1 score and AUROC. By definition, accuracy refers to the rate of correct prediction, and precision measures the ability to target actual positive observations. Recall reflects the capability to predict actual positivity correctly. F1 score, calculated as the harmonic mean between precision and recall, takes both false positives and false negatives into account. AUROC is proposed as a summarized accuracy index, with a higher value indicating a higher probability of having the characteristic under study. The optimal algorithm was selected after comprehensive evaluation of above five aspects.
To narrow the range of contributing factors, the importance of each factor was calculated using the SHAP (SHapley Additive exPlanation) tool. After ordering the importance of all variables from the highest to the lowest, the prediction performance of an increasing number of top factors was appraised by accuracy, precision, and AUROC, upon which the minimal number of important variables was determined. Further, the contribution of these variables was compared with that of all variables in terms of model accuracy and model loss under study by using the deep-learning sequential model with three types of optimizers (adaptive moment estimation, root mean square prop, and stochastic gradient descent).
The statistical handling was done by using the community PyCharm (Edition 2018.1 x64) on the Windows 10 system with the Python (Python Software Foundation) software (Version 3.7.6). Missing data were supplemented according to the multiple imputation procedure, which was implemented by the MICE package in the R programming environment (Version 4.1.1).

Baseline Characteristics
After excluding invalid questionnaires, data from 11,308 students (5,527 girls and 5,781 boys) were analyzed finally, with response rate of being 98%. There were 909 students who had experienced LRTI during the last year, and so the prevalence of LRTI in this student population was 8.01%.
The baseline characteristics of all participating students are presented in Table 1 according to the presence and absence of LRTI.  Table 2, including precision, recall, F1 score, and

Importance Ranking and Appraisal
To evaluate the contribution of all factors to LRTI prediction, the importance of each factor was gauged and ranked. The importance of top 20 factors is illustrated in Figure 2.
By using the optimal gNB algorithm, the cumulative performance of top 10 factors according to the descending importance was calculated ( Table 3). By comparison, the top five important variables, including age, rhinitis, sitting time, dental caries, and allergy, had decent prediction performance.

Confirmation of Top Important Factors
To further ascertain the contribution of these top five factors, the deep-learning sequential model was employed by comparing that of all factors under study (

DISCUSSION
In this cross-sectional analysis on 11,308 Chinese students aged 6-14 years, we attempted to identify and validate a thrifty panel of important variables after comparing the performance of multiple machine-learning algorithms. Importantly, we have teased out the optimal machine-learning model, gNB algorithm, and identified five top important variables that can predict the occurrence of LRTI with performance parallel to that of all variables under study. Moreover, the contribution of the five top important variables to model prediction was further validated by deep-learning model, indicating the robustness and reliability of our findings. To the best of our knowledge, this is thus far the first report that has explored the risk profiles of LRTI in Chinese students over 5 years of age in the medical literature.
More recently, artificial intelligence techniques represented by machine/deep learning have been extensively applied to a growing number of studies to assist or partly replace clinicians in decision making (10)(11)(12). As an extension of our previous work adopting traditional statistical methods (line regression and Logistic regression) when modeling, we in this study employed the more advanced machine-learning methods to tease out the optimal algorithm and deep-learning models to validate the contribution of the thrifty panel of important LRTI-susceptibility factors selected by the machine-learning methods. Notably, we narrowed down the list of potential candidate factors, and found that five of these factors, including age, rhinitis, sitting time, dental caries, and allergy, were sufficient to predict the likelihood of LRTI, with decent performance. The modeling of the five factors using the gNB algorithm can be applied in the practical settings to help parents and school-healthcare physicians to monitor the likelihood of having LRTI for early prevention and timely intervention. Our findings are clinical and biologically plausible. It is reasonable to expect that young age is often linked to less mature function, which makes younger students more susceptible to the development of LRTI and associated symptoms. Moreover, allergy to foods and drugs was also identified as a risk-conferring factor for LRTI, and this issue deserves special attention, as the prevalence of allergy in children is steadily increasing around the global (15). Currently, there is no direct evidence for the association of food allergy with LRTI; however, some studies have shown that a variety of respiratory symptoms triggered by foods occurred in up to half of patients (16,17). Respiratory manifestations of food allergy, an immunoglobulin E-mediated immune responses, arise from damage to the epithelial surfaces of the lungs on account of the epithelium of the lungs being a sensor of environmental stimuli (18)(19)(20)(21)(22). Given the important contribution of food allergy to LRTI prediction in the present study, it is reasonable to speculate that susceptibility to respiratory infection might be due to damage of respiratory epithelium caused by food allergy. As demonstrated by James et al. (23) and Larsen et al. (24), increased adherence of pathogens to inflamed respiratory epithelium, increased mucosal permeability, or altered immune response to certain viral and bacterial pathogens can increase the vulnerability to respiratory infection. In a separate study, Vermeulen and Kuehn found that by contrast with non-allergic peers, one of the allergens in allergic rhinitis was food allergens and young children who were sensitized to foods were more likely to induce allergic rhinitis afterwards (25,26). As such, it is highly recommended for parents to take their children who are allergic to foods to see a pediatrician or allergy specialist for regular intervention with aging. Deeper insights into the independent or combined pathogenicity between rhinitis and food allergy for LRTI were unclear. Nevertheless, more investigations to fully understand the mechanisms of LRTI pertaining to food allergy with or without rhinitis are challenging.
Further, our study indicated that dental caries was a significant contributor to LRTI, and it is notable that more than half of studies (57.9%) who were once diagnosed as LRTI by clinicians had one or more dental caries. This finding was in agreement with that of Mehtonen et al. (27), who found that dental caries was associated with an increased occurrence of LRTI based on a 20-year follow-up of a prospective cohort including children born in Espoo. It is well known that dental caries appears at the beginning of the respiratory system at the mouth and lower respiratory infections deeper in the respiratory tract, and oral cavity harbors one of the most complex microbiomes in the body. Possible mechanisms behind the association between oral health and pneumonia were described by many researchers (28,29). For example, Thoden van Velzen et al. (30) defined dental plaque as one of the important causes of dental caries and it served as a persistent reservoir for potential pathogens, both oral and respiratory bacteria. Another two studies also reported that oral bacteria in the dental plaque would shed into the saliva and were aspirated into the lower respiratory tract to influence the initiation or progression of LRTI conditions such as pneumonia (31,32). Hence, for practical reasons, there is necessity to highlight the importance of keeping dental health and reducing LRTI risk.
It is also worth noting that sitting time was found to be associated with the occurrence of LRTI in this study. Sedentary behaviors are predominate in modern life, but adverse effects of these behaviors haven't been completely understood in students. Prolonged sitting time could cause reduced physical activities, which can affect multiple aspects  of immune response (33). Evidence from a prospective US cohort indicated that prolonged sitting time increased the chance of pneumonitis due to solids and liquids (34).
Other studies showed that regular physical activity was conducive to decreasing mortality and morbidity for influenza and pneumonia (35-37), strengthening the findings of this study. To this point, it is encouraging to elongate physical exercise and outdoor activities of students by reducing sitting time, which can, at least in part, prevent the development of LRTI.

Strengths and Limitations
Strengths of this study include a large-scale student population from 26 schools in Beijing, a high  questionnaire response rate, a wide coverage of potentially candidate factors associated with LRTI, and a comprehensive analysis of contributing predictors for LRTI in students aged 6-14 using advanced artificial intelligence techniques. Some limitations should be acknowledged when interpreting our findings. Firstly, due to the cross-sectional design of this survey, causality cannot be established. Secondly, our study was based on data from students 6-14 years of age living in a district of Beijing, and extrapolation of our findings to other regions or races should be made with caution. Thirdly, in this survey, data were collected via parents-reported electronic questionnaires, which might yield risk for recall or reporting bias, although strict quality control was implemented. Additionally, items analyzed are more general, and some transient factors such as quick weather change from warm to cold and severe air pollution that were found to be susceptible to respiratory infection (38,39) are not collected in this survey. We agree that further incorporation of more factors is necessary to improve model precision and recall, which are relatively low, even under the optimal gNB algorithm. Our findings presented here are preliminary, and future work will entail refining our model by incorporating more data in other independent groups.

CONCLUSIONS
Our findings showed that gNB algorithm outperformed other machine-learning algorithms, and the top five important factors can sufficiently represent all involved factors in predicting the risk of LRTI in Chinese students aged 6-14 years. We agree that collective action is required to ensure students have access to immediate and effective treatment, with routine prevention and intervention as joint strategies. Last but not least, we must value, foster, and commit to shed light on the interaction of food allergy and rhinitis, explore more carefully differences in prediction models of risk factors for LRTI, and validate and improve the model in larger sample sizes and more populations.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Ethics Committee of Beijing University of Chinese Medicine. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
ZZ planned and designed the study and directed its implementation. ZZ and WN drafted the protocol. MX, QW, YZ, BP, and MY obtained statutory and ethics approvals. MX and QW contributed to data acquisition. MX and WN conducted statistical analyses and wrote the manuscript. MX, QW, YZ, BP, MY, and XD did the data preparation and quality control. All authors read and approved the final manuscript prior to submission.