“GENYAL” Study to Childhood Obesity Prevention: Methodology and Preliminary Results

Objective This article describes the methodology and summarizes some preliminary results of the GENYAL study aiming to design and validate a predictive model, considering both environmental and genetic factors, that identifies children who would benefit most from actions aimed at reducing the risk of obesity and its complications. Design The study is a cluster randomized clinical trial with 5-year follow-up. The initial evaluation was carried out in 2017. The schools were randomly split into intervention (nutritional education) and control schools. Anthropometric measurements, social and health as well as dietary and physical activity data of schoolchildren and their families are annually collected. A total of 26 single nucleotide polymorphisms (SNPs) were assessed. Machine Learning models are being designed to predict obesity phenotypes after the 5-year follow-up. Settings Six schools in Madrid. Participants A total of 221 schoolchildren (6–8 years old). Results Collected results show that the prevalence of excess weight was 19.0, 25.4, and 32.2% (according to World Health Organization, International Obesity Task Force and Orbegozo Foundation criteria, respectively). Associations between the nutritional state of children with mother BMI [β = 0.21 (0.13–0.3), p (adjusted) <0.001], geographical location of the school [OR = 2.74 (1.24–6.22), p (adjusted) = 0.06], dairy servings per day [OR = 0.48 (0.29–0.75), p (adjusted) = 0.05] and 8 SNPs [rs1260326, rs780094, rs10913469, rs328, rs7647305, rs3101336, rs2568958, rs925946; p (not adjusted) <0.05] were found. Conclusions These baseline data support the evidence that environmental and genetic factors play a role in the development of childhood obesity. After 5-year follow-up, the GENYAL study pretends to validate the predictive model as a new strategy to fight against obesity. Clinical Trial Registration This study has been registered in ClinicalTrials.gov with the identifier NCT03419520, https://clinicaltrials.gov/ct2/show/NCT03419520.


INTRODUCTION
Obesity is a complex, chronic and multifactorial disease, originated as an interaction between genetic and environmental factors (1). The prevalence of overweight and obese children is rising every year. Specifically, in compliance with the WHO, the number of overweight and obese children aged 0-5 years increased from 32 million globally in 1990 to 41 million in 2016. And it is expected to increase to 70 million by 2025 if these trends continue (2). The situation in Spain is also alarming, with a prevalence of 23.2% of overweight (22.4% boys and 23.9% girls), and 18.1% of obesity (20.4% boys and 15.8% girls) according to data from the ALADINO study carried out by the Spanish Agency for Consumer Affairs, Food Safety and Nutrition (3).
Childhood obesity usually leads to adulthood obesity, which increases the risk of developing certain diseases, such as hypertension, type 2 diabetes and cardiovascular diseases, prematurely (4)(5)(6). This early age has been identified as a key point for the implementation of healthy dietary and lifestyle patterns. Thus, the home and schools provide a useful environment to develop educational and lifestyle interventions for school-age children (7).
There is no doubt about the multifactorial etiology of obesity in which socio-cultural, dietetic, environmental and genetic factors are involved (8)(9)(10)(11). However, current knowledge is still insufficient to determine the relative importance of these different factors, having a complex network of associations between them (12). In this regard, machine learning techniques represent a powerful prediction tool through their great ability to big data analysis. Thus, Machine learning represents a tool based on a set of algorithms that can characterize, adapt, learn, predict, and analyze data, increasing the knowledge of obesity and offering possibilities of predicting the disease with unprecedented precision. These techniques have been proposed as a potential tool to predict a future excess of body weight and its comorbidities. There are several predictive machine learning algorithms such as neural networks, decision tree analysis or random forest. Each of them should be used according to the purpose and nature of the study variables (13).
Considering all the above-mentioned aspects, the main objective of the GENYAL study is to design and validate a machine learning-based predictive model that identifies children who would benefit most from actions aimed at reducing the risk of obesity and its complications, considering both environmental and genetic factors, and applicable at the beginning of the school stage. The nutritional education developed in the intervention's schools will be also evaluated as part of the predictive model. This article describes the methods and analyses that will be applied. In addition, it summarizes some preliminary results obtained after the first year of the data collection.

Type of Study and Duration
The present study is a cluster randomized clinical trial with 5-year follow-up intervention based on nutritional education, annual anthropometric measurement evaluations and data collection from questionnaires. Saliva samples were collected for all the schoolchildren in the initial evaluation (2017) in order to obtain genetic information. The final evaluation will be carried out 4 years after the initial intervention, which corresponds to the end of the primary school (Figure 1). The study is therefore expected to last 5 years, from 2016 to 2017 academic year to 2021-2022. Table 1 provides a schedule of activities and interventions throughout the study.

Recruitment, Sample Size, and Sample Characteristics
Due to the nature of the study as a clinical trial, the large number of variables necessary for the design of the preventive model with machine learning (each of them of a very different nature), and the duration of 5 years, a statistically robust sample size could not be implemented. Furthermore, the Consejería de Educación e Investigación de la Comunidad de Madrid was responsible for the selection of six representative schools of the Autonomous Community of Madrid (ACM) (Spain) (two in the north, two in the center and two in the city's south zone), considering the number of students per center and the average socioeconomic level of the districts and neighborhoods. Therefore, the selection was representative of the average income of the ACM households (14). All the School Boards approved the participation in the study and included a total of 569 potential children participants from different districts of Madrid: Chamberí, Hortaleza, Carabanchel, Puente de Vallecas, and Moncloa-Aravaca.

Inclusion and Exclusion Criteria
The inclusion criteria to participate in the study were: being in 1st or 2nd grade of primary school and having an informed consent signed by at least one of the parents. Exclusion criteria were not attending school during the evaluation days or having planned not to stay at the school the following years.

Randomization
In order to avoid cross-contamination between intervention and control subjects, randomization was carried out by school center instead of individually. Thus, participating schools were randomly and proportionally stratified into two groups: intervention schools and control schools, considering the number of participants per center, their geographic area and their socioeconomic status.
The randomization procedure was carried out with the statistical software R version 3.4 (www.r-project.org).

Ethical Aspects and Data Processing
Protocols and methodology used in the present study comply with the ethical principles for research involving human subjects laid down in the Declaration of Helsinki (1964) and its modifications. The study was approved by the Research Ethics Committee of the IMDEA Food Foundation (PI:IM024; Approval date: March 29th, 2016) and it has been registered in ClinicalTrials.gov with the identifier NCT03419520. School centers and families were informed in detail about the different stages of the project both, orally and in writing. Signed informed consent from at least one of the parents were collected by the researchers prior to the first evaluation. This document included a specific consent to DNA extraction and the evaluation of polymorphisms from the saliva samples. In addition, it included a section on the storage of the remaining samples as a collection registered, according to Spanish legislation (Royal Decree 1716/2011, of November 18th).
Data compiled along the study are going to be processed using a web application that applies dissociation criteria making the volunteers' data anonymous, in compliance with the current Spanish legislation (Organic Law 15/1999 of December 13th, on the protection of Personal Data) and may be used for scientific purpose as publications and conferences. Only the researchers directly related to the study will be allowed to access data.

Selection of Single Nucleotide Polymorphisms
A total of 26 single nucleotide polymorphisms (SNPs) associated with a higher risk of early-age onset of obesity and its comorbidities were selected. The selection was made considering the biological activity of each SNP, Caucasian allele frequencies and the scientific evidence that supports the association between the presence of the polymorphism and the risk of developing overweight, obesity or its complications. The sum of the risk alleles will further be used to design a genetic risk score.
Different databases such as 1,000 Genomes, HapMap, Pubmed, GWAS Central, GWAS Catalog or Ensembl were used. Table 2 shows the selected 26 SNPs, which will be included in the predictive model.

Questionnaires
Different questionnaires were designed based on other surveys used in similar studies to facilitate the comparison of the results. All of them are annually sent to families by email or in the paper format according to the parents' preference and are filled by at least one of the parents. The information collected is summarized in Table 3.
Regarding social, health and demographic data, parents annually complete a self-reported questionnaire that includes different personal questions based on the surveys used in the ALADINO and ELOIN studies (3,112).
Dietary information is gathered using a 48-h food record of 2 non-consecutive days, a weekday and a weekend day, as recommended by the European Food Safety Authority guidelines (113). Afterwards, the data are tabulated and analyzed using the DIAL software (Alce Ingeniería, Madrid, Spain) (114) in order to obtain information about macro and micronutrients.
Moreover, the adherence to the Mediterranean diet pattern is assessed using the "KIDMED Mediterranean Diet Quality Index" in addition to general questions about the dietary habits of the children and their parents. The KIDMED questionnaire consists of a total of 16 dichotomous questions that must be answered affirmatively or negatively to obtain a score (115).
Physical activity and free time data about the children and their parents are gathered using a questionnaire with different sections adapted and modified from the ALADINO (3) and the ELOIN (112) studies. In addition, a 48-h physical activity record is collected, corresponding to 24 h of a weekday and a complete weekend day (116). In the physical activity record, parents had to specify the time that their children spent during 24 h of a week day and 24 h of a weekend day doing different activities, including resting hours and activities with a variable level of intensity (very light, light, moderate and intense). The time spent doing each activity is multiplied by the corresponding activity coefficient defined by the WHO (117), added and divided by 24, obtaining the Individual Physical Activity Coefficient (IPAC). Then, the IPAC corresponding to a weekday is multiplied by 5 and the weekend IPAC by 2, both results are added and divided by 7, thus, obtaining the median physical activity per individual. Afterwards, it is necessary to convert the IPAC into a Physical Activity Coefficient (PAC) according to sex, therefore an equivalence is made between the IPAC and the PAC proposed by the Institute of Medicine (118). Finally, participants are classified into sedentary, lightly active, active and very active in line with their PAC.
All these data and information are collected every year on equal terms.

Anthropometric and Blood Pressure Measurements
These data are collected in the school centers, early in the morning, by previously trained nutritionists, following standardized protocols and WHO international instructions for this age group (117). For the anthropometric measurements, children had to wear a T-shirt and gym shorts. All measures are taken twice, and the average is used for the analyses.
Height is determined using a Leicester height rod with an accuracy of 1 mm (Biological Medical Technology SL, Barcelona, Spain). Body weight and fat mass percentage are assessed using a BF511 Body Composition Monitor (BF511-OMRON HEALTHCARE UK, LT, Kyoto, Japan). Furthermore, fat mass percentage is classified according to the tables offered by OMROM Healthcare (119). Waist and brachial circumferences measurements are taken using a non-elastic tape (KaWe Kirchner & Wilhelm GmbH, Asperg, Germany; range 0-150 cm, 1 mm of precision). The waist circumference measurements obtained are classified by percentiles in compliance with Fernández et al. (120). Triceps skinfolds are taken following the International Society for the Advancement of Kinanthropometry guidelines (121) using a mechanic caliper (HOLTAIN LTD. CRYMYCH UK 10 g/mm 2 constant pressure; range 0-39 mm and 0.1 mm of precision) and the results obtained are ranked according to percentiles proposed by Frisancho AR (122).
Using these data, other variables of interest are calculated. In particular, BMI is calculated as the body weight divided by the squared height (kg/m 2 ). There is not a universal technique to classify the BMI values in the pediatric collective, therefore, the results are ranked according to the percentiles of Faustino Orbegozo Eizaguirre Foundation, reviewed in 2011 (120), International Obesity Task Force reviewed in 2000 (123), and WHO reviewed in 2007 (124). The results of overweight and obesity rates are unified as a single category called excess weight (EW). The arm muscular and fat areas are obtained using the equations proposed by Mataix Verdú and López Jurado (125) and López-Sobaler and Quintas Herrero (126), respectively. The protein and caloric reserves are calculated by Frisancho AR equations (122). Waist/height ratio is calculated as waist circumference (cm)/height (cm) and it was classified according to Panjikkaran et al. and Ashwell investigations (127,128). Height/age index is rated in percentiles according to Fernández et al. (120).
For blood pressure monitoring, an automatic digital monitor is used (OMRON M3-Intellisense) using a cuff suitable for

Compiling Saliva Samples, DNA Extraction and Genotyping
Buccal smears were collected for DNA extraction following standardized protocols. For this purpose, a sterile swab free of human RNAse, DNAse and DNA (300263DNA-Hisopos Deltalab polystyrene and polyester) was used. Children had to have their mouth clean and avoid eating or drinking 30 min prior to collection. Three samples were taken per children, each one identified with the number corresponding to the order of extraction, to ensure traceability. As the samples were collected, they were directly stored in refrigeration until all the children were evaluated. Immediately after, they were frozen at −80 • C until their processing.
Genomic DNA was extracted from the buccal swabs using the INVISORB R SPIN TISSUE MINI KIT (Stratec), according to the manufacturer's instructions. Samples were lysed in the presence of proteinase K and a specific lysis buffer. The lysate was then purified and finally, it was eluted in a free EDTA solution. For genotyping, the DNA samples were loaded in TaqMan R OpenArray R Real-Time PCR plates (Life Technologies Inc., Carlsbad, CA) already configured with the specific selected SNPs with specific waves for each allele, marked with a different fluorophore to determine the genotype. This process was made using the OpenArray R AccuFill TM System (Life Technologies Inc., Carlsbad, CA). Once it was charged, a PCR performed and the chips were read in the QuantStudio R 12K Flex Real-Time PCR Instrument (Life Technologies Inc., Carlsbad, CA). Results were analyzed using the TaqMan R Genotyper software (Life Technologies Inc., Carlsbad, CA), which automatically assigns the genotype to each sample according to the amount of detected signal for each fluorophore.
The duplicate analysis was used to validate the genotyping result.

Design of Educational Tools and Implementation of the Nutritional Education Programme
For the implementation of the nutritional education programme in the "intervention schools", three different kinds of guides were designed aimed at parents, children and teachers. All this information was developed and adapted to the participants' age by the nutritionists from the IMDEA Food Foundation. This material is sent to parents and educational centers in different modules adapted to parents, students, and teachers. The same modules include different activities and topics each year according to the children's growth. The sending strategy follows a protocol, and it will be maintained until the end of the study, through email or regular delivery, as the receiver may prefer.
Moreover, some workshops are being carried out and are summarized in Table 4.
The validation of this tool is expected to be carried out through the impact generated over the years of the study, measured as the evolution of anthropometric variables and the dietary habits, between control and intervention schools. Moreover, parents and teachers along the study will evaluate all the material.

Statistical Analysis
Descriptive analyses of the baseline data were performed by computing for the categorical variables the class's absolute and relative frequencies, and for the quantitative variables the mean, median, standard deviation, interquartile range, maximum and minimum. To check the homogeneity of the two groups in the case of quantitative variables, t-tests were used for normally-distributed variables, or Mann-Whitney U-test as nonparametric alternative. In the case of categorical variables, Chi-Square or Fischer exact tests were used. The association between anthropometric and dietary, social, health and SNP variables were performed by linear or logistic regressions. The Bonferroni correction was applied for multiple tests. In addition, for the SNPs variables, the Hardy-Weinberg equilibrium condition was tested by means of Chi-Square tests. All analyses were conducted with R Statistical Software version 3.41. Statistical tests used a 0.05 significance level, in two-tailed tests.
Regarding Machine Learning models, they will be derived to predict the BMI from all the analyzed variables after the 5-year follow-up. Both classification (after dichotomization of the BMI) and regression models will be considered, and Random Forest will be applied. It has been observed that the use of Random Forest improves the predictive model's performance, creating a more effective predictive model than the one that could be obtained using decision tree or logistic regression techniques (13). The predictive power of the models will be evaluated and internal cross-validation and external validations with external datasets will be implemented. Variable importance analyses will be performed in order to quantify the relative weights of the different variables in the prediction of BMI. The model will be iteratively improved by refitting with new data along with the successive yearly evaluations during the study.

RESULTS
Parents of 224 children (116 girls and 105 boys) accepted to participate in the study and signed the informed consent. It shows a collaboration rate of 39%. Finally, 221 children were evaluated, since three did not attend the initial evaluation (Figure 2). Among the total number of students enrolled in the study, 115 belonged to intervention schools while 106 were in control schools. Tables 5, 6 show the basal main characteristics of the study sample according to the corresponding control or intervention group.
According to the preliminary results obtained after the first year of the evaluation, 32.2% of the students presented EW, taking into account the WHO criteria. These figures were higher when IOFT standard (25.4%) or the national criteria of the Orbegozo Foundation (19.0%) were applied.

DISCUSSION
The GENYAL study to prevent child obesity is, to our knowledge, the first interventional trial in Spanish schoolchildren aiming to provide preventive and therapeutic approach based on a high degree of evidence for early obesity through machine learning.
The baseline data and the associations observed after the first analysis support the evidence that environmental and genetic factors play a role in the development of childhood obesity.
According with the results, for each point in the BMI of the mother, the BMI of the child was increased 0.21 kg/m 2 . It shows that among the multiple risk factors for the development of obesity in children, parental obesity is one of the most impactful as a result of both genetic and environmental interactions. Children imitate their parents, therefore, the parents' dietary habits and PA are more likely to be reproduced by their descendants (130).
With reference to the location of the school, that represent the socioeconomic level, presented a close relationship with the presence of EW, with a decreasing distribution of risk from south to north area. These results are consistent with other studies that have shown how the socioeconomic status of the school correlates with the prevalence of overweight and obesity as it increases the likelihood that schoolchildren will follow a diet rich in energydense, low-cost foods, as well as fewer opportunities to practice sport (131). Regarding dietary aspects, dairy servings per day showed a protective effect against EW. It could be related to several factors such as if this food is a source of calcium, peptides, bioactive compounds, etc. They have been studied due to their relationship in the appetite control and other mechanisms involved in controlling weight (132,133). These results highlight the important role that this group of foods could have for the prevention of weight overload.
The study of these factors in the child population and the social context of Madrid is minimal, having found a single study with similar sociodemographic characteristics (134). Nevertheless, no intervention was performed in this study, nor was genetic data collection, thus the GENYAL project is shown as a novel study in this regard.
In this study, a significant association (data adjusted for sex and age) between the nutritional status of schoolchildren and 8 SNPs was found (rs1260326, rs780094, rs10913469, rs328, rs7647305, rs3101336, rs2568958, rs925946). In previous studies, these polymorphisms have been associated with adiposity traits and their related comorbidities ( Table 2). Genetic factors play an essential role in the development of obesity (135). Thus, knowing which ones are associated with excess weight early in life could contribute to obesity early detection.
Conversely, it is important to note that current knowledge is insufficient to determine the relative importance of these different factors. Therefore, new techniques are needed to be used as predictive instruments (12). Currently, machine learning is considered an extremely valuable tool in the medical field, since it is capable of providing diagnostic and early detection strategies for diseases through the analysis of large datasets (136). Prevention plays a crucial role in controlling the high obesity prevalence, so machine learning techniques have already been used for the prediction of the BMI in children (137). However, the current predictive model would be the first to include obesityrelated SNPs as genetic information, as well as anthropometric, social and lifestyle variables.
Regarding the last results from the Commission on Ending Childhood Obesity, the implementation of integral programs that promote healthy environment in schools is recommended with the objective of ensuring that children grow well and develop healthy habits (138). Nevertheless, although Spain is one of the countries where more intervention studies to prevent obesity have been developed (139), the politic strategies to prevent chronic illnesses such as overweight and obesity are not defined, and even show very low evidence of efficiency, according to the last data revised by the Cochrane Database (140).
According to the latest scientific research, the intervention studies in schools which include family and community spheres, implementing actions to promote healthy food and physical activity, are the most effective (5,46). This study has been designed to elaborate strategies and to work as a multidisciplinary team reinforcing the educational sphere of the participating children and their environment, school, and family.
One of the strong points of our study consists in the implementation, and subsequent validation, of educational tools for students, their parents and teachers, applying a nutritional education method that promotes healthy dietetic habits and physical exercise, both in schools and outside. The importance of the validation of these educational strategies lay on a large number of studies with contradictory results, which might be partly explained by the fact that many researches may have lacked statistical power to detect changes in the results of interest related to adiposity or children's nutritional status (5). In the present study, the educational approaches to the intervention schools will be held for 5 years, in line with the annual anthropometric assessments. Their utility will be evaluated taking into account the evolution of the anthropometric and dietetic results annually collected. Moreover, the presence or absence of the educational program (control and intervention schools) will be included as an input dichotomous variable in the predictive model, evaluating its influence on the predicted BMI. This will allow us to detect differences in the body composition between intervention and control schools, enabling us to assess the impact of educational support.
Another strength of the present study is the selection of 26 SNPs for early prevention of obesity, by making an extensive bibliographic research, as shown in Table 2. The nutritional genomic tools would be very useful in the research and prevention of obesity, and they would be an important support in public health applications. Obesity is a multifactorial illness, where the genetic variants involved are dispersed along the whole genome. Although SNPs have been cataloged as the best indicators to predict obesity risk (141), several studies suggest that much remains to be discovered. There is a lot of interest in predicting the appearance of chronic diseases at an early age (142). According to the last review about precision nutrition (143), the creation of a genetic risk score may let us determine the risk of developing obesity or other chronic pathologies related to the individual genetic component, and even be able to predict the expected weight gain as a consequence of exposure to different variables, such as specific diets.
On the other hand, we consider that the sample size used is one of the weak aspects in our study, pointing to the necessity to include new schools to increase the number of children. Nevertheless, from this first phase of the study, we expect to calculate the sample size needed to increase the statistical power and, consequently, to find solid associations between the studied variables. Furthermore, the classification of schools according to the area and the socioeconomic level widens the scope of the research, making it more representative of the city of Madrid. Similarly, the use of dietary and physical activity questionnaires may lead to reporting bias, but in the absence of better tools with low cost and high throughput, these records can offer valuable information, although it should be interpreted with caution.
After 5 years of follow-up, the GENYAL study aims to validate the machine learning predictive model that considers environmental and genetic factors in the obesity development, as well as the educational tools, to obtain new and potentially valuable data to increase our knowledge of the precipitants of childhood obesity and their relative importance to design preventive protocols at an early age based on machine learning models. With a view to the future perspective of continuity of this study, in addition to increasing the sample size to validate the results obtained, the possibility of implementing personalized nutritional education interventions is proposed to improve adherence and efficacy by applying the novel concepts provided by studies on precision nutrition.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethics Committee of Fundación IMDEA-Food. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
VL-K was the principal investigator and responsible for the study and protocol design. HM-P helped designing the protocol and drafting the manuscript. HM-P, EA-A, RI, and IE-S were responsible for data collection. GC conducted statistical analysis of the data. SM contributed to genetic samples management. JM, GR, and AR supervised the final compilation of the manuscript and provided scientific advice and consultation. All authors read and approved the final manuscript.

FUNDING
This study was supported by Conserjería de Educación, Universidades y Ciencia de la Comunidad de Madrid, Dirección General de Educación Infantil, Primaria y Secundaria.