Validation of a Tool to Evaluate Drug Prevention Programs Among Students

Background: School-based prevention programs have been implemented worldwide with the intention of reducing or delaying the onset of alcohol and drug use among adolescents. However, their effects need to be evaluated, being essential to use validated and reliable questionnaires for this purpose. This study aimed to verify the semantic validity and reliability of an instrument developed to evaluate the results of a government drug prevention program for schoolchildren called #Tamojunto2.0. Methods: This is a mixed methods study with quantitative (test-retest, confirmatory factor analysis and non-response evaluation) and qualitative analyses (focus group and field cards). The self-administered questionnaires were used for a sample of 262 eighth-grade students (elementary school II) in 11 classes of four public schools in the city of São Paulo. Results: The level of agreement was substantial (Kappa 0.60–0.79) or almost perfect (Kappa > 0.8) for almost all questions about the use of marijuana, alcohol, cigarettes, cocaine, crack, and binge drinking. The model fit indices, for almost all secondary outcomes, indicated that the modls underlying each scale, constituted by observed and latent variables, had a good fit adjustument. The focus groups and field cards provided high-quality information that helped the researchers identify the main difficulties in applying and understanding the questions. Conclusion: The questionnaire showed high factorial validity, reliability and understanding by adolescents. After the necessary changes, identified in this study, the questionnaire will be suitable to evaluate the results of the #Tamojunto2.0 program in a randomized controlled trial.

School-based prevention programs have been implemented worldwide with the intention of reducing or delaying the onset of drug use among adolescents, with the most successful models being the ones working on the development of life skills and normative belief changes in relation to drug use (Strøm et al., 2014). To follow international prevention guidelines, the Brazilian Ministry of Health, in partnership with the United Nations Office on Drugs and Crime, decided to invest in the cross-cultural adaptation of a school program called Unplugged (Faggiano et al., 2010), which was renamed to #Tamojunto. Unplugged is a drug use prevention program for adolescents between 12 and 14 years of age, consisting of 12 classes that use interactive methods to work with social and personal skills, knowledge about drugs, and normative beliefs (Van Der Kreeft et al., 2009).
The randomized controlled trial (RCT), conducted in 2014 and 2015 to evaluate the effectiveness of #Tamojunto in the Brazilian context, showed an iatrogenic result for the onset of alcohol consumption, suggesting inadequate cultural adaptations in the components about alcohol (Sanchez et al., 2018). Considering these negative results, the material was readapted so that it would once again reflect the central elements of the European Unplugged (Faggiano et al., 2010). This new version, called #Tamojunto2.0, will be evaluated in the Brazilian population in 2019 and 2020 to substantiate the government's decision to expand it to all federation units.
Although the RCT that evaluated the effect of the first version of the program used data collection tools adapted and validated for the Brazilian population (Prado et al., 2016), after content adaptations and new international publications showing that drug programs should also be tested for mental health outcomes (Newton et al., 2018), it was decided to readapt the instrument. Considering the scarcity of validated instruments to evaluate prevention programs in schools (Strøm et al., 2014) and guarantee the validity of the study, it is necessary to have a consistent data collection methodology, which depends on adequate and precise measurement instruments (Neusa Maria Costa Alexandre Marina Zambon Orpinelli Coluci, 2011).
This study aimed to check the semantic validity and reliability of an instrument developed to evaluate the results of the #Tamojunto2.0 program to prevent drug use in students.

METHOD Study Design
This is a mixed method study (Creswell, 2009) with quantitative and qualitative analyses to evaluate the semantic validity, construct validity and reliability of a self-administered questionnaire applied in the classroom for eighth-grade adolescents (Prado et al., 2016). The data were obtained through three procedures: (1) administering the questionnaire (74 questions) to students in the classroom, (2) focus groups with students after they filled the questionnaire, and (3) field notes regarding questionnaire administration in the classroom. The qualitative analyses emphasized the content of the questions presented by students and classroom procedures, while the quantitative analysis used (a) method of measurement stability (b) confirmatory factor analysis and (c) identification of the most difficult questions to answer through "non-response." The stability of repeated responses was analyzed using two data collections with the same population and instrument, in a range of 15-20 days, through a "test-retest" evaluation. The collections occurred between September to November 2018. This study was approved by the Ethics Committee of the Federal University of São Paulo, number 2.806.301, and registered in the REBEC (Clinical Trials Registry of the Ministry of Health)--RBR-8cnkwq. The Consent to participate in the study was written and obtained from the schools' directors before randomization and from students, after randomization. All participants took part voluntarily after having given their free and informed consent based on the autonomy of adolescents guaranteed by the Brazilian Statute of the Child and Adolescent (Law No. 8069/1990). Moreover, parents were informed of the study by the directors and could recommend non-participation in data collection if they preferred. However, participation in the intervention was part of the school curriculum and was mandatory for all the students in the participating schools.

Sample
The questionnaires were administered to a sample of 262 eighthgrade students distributed in 11 classes of four public schools in the city of São Paulo. The public schools were selected from different regions of the city to intentionally represent different socioeconomic, school performance, and activity length realities (part-time and full-time schools).
Sample size calculation considered the reproducibility analysis, taking as reference a 5% error and considering a type I error of 5%, a type II error of 10%, and a possibility of detectable disagreement of up to 20%. According to Temel and Erdogan (2017), the required sample would be 246 subjects for these defined characteristics.

Procedures
The paper and self-administered questionnaire were distributed to students from all classes by trained researchers. On the first page of the questionnaire, the students created a secret code involving the generation of letters and numbers from personal information, which may only be decoded by the students themselves. These codes allow researchers to pair individual questionnaires at different study times and provide the participants anonymity and confidentiality, essential in a study on illicit behavior.
The administrator filled out a field form for each class, totaling 11 completed forms, which consisted of a semi-structured instrument on the school's general information, group behavior, conflict between students, class size, number of students present and absent in the classroom during administration, and refusals to participate. They also included an open field to describe any difficulties faced during the administration process and the students' doubts and questions.
Five focus groups were conducted after data collection using questionnaires in the classroom. Each focus group composed of 8-12 students, totaling 50 students in five groups, and two mediators (Stewart et al., 2007). The focus groups were audiorecorded and lasted a mean of 50 min. The classes were randomly chosen to participate in these focus groups with the objective of identifying difficulties in understanding the questions and the use of misinterpreted words or terms. The group mediator read the questions, and everyone discussed the understood meaning, unknown words, difficulties to answer, the layout, and the group's suggestions to improve the questionnaire.
The questionnaire data was entered into a custom typing system. The platform created could be accessed by several typists simultaneously and allowed real-time control of each typist's work and quality of typing. The internal coherence of the responses was assessed through consistency analyses to identify incomplete or incorrectly completed questionnaire on purpose.

Instrument and Measurements
One of the basic instruments to formulate this questionnaire was tested by the EU-DAP (European Drug Addiction Prevention Trial) and used in previous studies on the effectiveness of Unplugged (Faggiano et al., 2010). The EU-Dap collects information, knowledge, and opinions on substance use, emphasizing alcohol, tobacco, and other drug use. In Brazil, this questionnaire was translated, adapted (Prado et al., 2016) and used to assess the first version of the #Tamojunto program (Sanchez et al., 2018) with some questions replaced by questions elaborated from two questionnaires widely used in several studies in Brazil: the questionnaire of the World Health Organization used by CEBRID (Brazilian Information Center on Psychotropic Drugs) in the VI Survey of drug use among students (Carlini et al., 2010), and the PENSE questionnaire (National School Health Survey) used by the Ministry of Health (IBGE, 2016).
Our instrument has 74 questions and, initially, there is a question about gender (male and female) and three questions about age, weight, and height data. The socioeconomic status was assessed by the scale of the Brazilian Association of Research Companies (Associação Brasileira de Empresas de Pesquisa-ABEP), a questionnaire with 15 dichotomous items (no/yes) (ABEP A. B. de empresas de pesquisa, 2018), which considers the head-of-household's education and the goods and services used, with scores ranging from 1 to 100 or in categories from A to E. Higher scores indicate a better economic status, and socioeconomic classes are ranked from A (highest) to E (lowest).
To the primary outcomes, there are 30 questions about the substance use prevalence, as detailed below: 1) Four items for alcohol use: lifetime use (no/yes), age of first drink (never drank or space to enter age), use in the past 12 months (no/yes), and use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 2) Three items for binge drinking: lifetime episode (no/yes), episode in the past 12 months (no/yes) and episode in the past 30 days with 4 response options (no, 1 time, 2 times, and 3-5 times). 3) Four items for tabacco use: lifetime use (no/yes), use in the past 12 months (no/yes), use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month) and how many cigarretes are smoked by day with 4 response options (I don't smoke, 1-10 cigarettes a day, 11-20 cigarettes a day or more than 20 cigarettes a day). 4) Three items for marijuana use: lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 5) Three items for cocaine use: lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 6) Three items for crack use: lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 7) Three items for weight loss remedies use, without a medical prescription: lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with 4 response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 8) Three items for tranquilizer remedies use, without a medical prescription: lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with 4 response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 9) Three items about inhalants (inha some other product like glue, ether, paint remover, gasoline, benzine, enamel, acetone, thinner, turpentine, paint, aerosol deodorant for have a good sensation): lifetime use (no/yes), use in the past 12 months (no/yes) and use in the past 30 days with four response options (no, 1-5 days in the month, 6-19 days in the month and 20 days or more in the month). 10) One item about the use of other drugs not mentioned in the questionnaire (never used or space to enter the name of the drug).
In addition, the adaptation of this instrument to evaluate #Tamojunto 2.0 included other scales to measure seven secondary outcomes: 1) The Olweus Bully/Victim Questionnaire composed of two numerical scales, bullying suffered with eight dichotomous items (no/yes) and bullying practiced with nine dichotomous items (no/yes), to investigate school violence episodes (Solberg and Olweus, 2003). Bullying victimization ranges from 0 to 8 (the higher the score, the more intense the bullying victimization) and bullying perpetration ranges from 0 to 9 (the higher the score, the more intense the bullying perpetration). Each "yes" response is given 1 point, and a total score computed by summing. Scores above 1 are considered a positive screen for bullying. 2) The SCOFF (Sick/Control/One stone/Fat/Food) questionnaire is comprised of five dichotomous items (no/yes) to measure eating disorder symptoms (Morgan et al., 1999). The SCOFF assesses five core aspects of anorexia (AN) and bulimia (BN). Each "yes" response is given 1 point, and a total score computed by summing. Scores above 2 are considered a positive screen for AN or BN.
3) The Stunkard Silhouette Scale consists of nine female figures and nine male figures, numbered 1-9, ranging from very thin to very obese (Scagliusi et al., 2006) to collect data from Body Satisfaction. Participants were asked to choose one figure that they thought represented their body currently and one that they thought to be the ideal body. An ideal discrepancy score was calculated by diferença between answers. 4) The Strengths and Difficulties Questionnaire (SDQ) is comprised of 25 items with three answer options (false, more or less true and true) to collect data on psychiatric symptoms (Goodman, 1997). The SDQ is a behavioral screening questionnaire and the 25 self-report SDQ items are divided between five scales of five items each: emotional symptoms, conduct problems, hyperactivity-inattention, peer relationship problems, and prosocil behaviors. 5) Excessive use of electronic games was measured through nine dichotomous questions (no/yes), adapted from the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) (American Psychiatric Association, 2013). 6) Psychosis screening was assessed using four questions with three answers options (not true, a little bit true and very true), adapted from the DSM-5 (American Psychiatric Association, 2013). 7) The data relating to Damage due to alcohol use was adapted from EU-DAP questionnaire with 11 dichotomous items (no/yes) about problems attributed to the use of alcohol, such as fights, accidents, theft, among other problems.
The questionnaire also contains questions that evaluate the effect of program mediators, such as attitude, a scale with 10 dichotomous items (no/yes), decision making, a scale with nine dichotomous items (disagree/agree), beliefs about drugs, a scale with 11 items and four answer options (very likely, likely, unlikely and very unlikely), knowledge about drugs, a scale with six items and three answer options (no, yes and I do not know) and refusal skills, a scale with three items and four answer options (very likely, likely, unlikely, and very unlikely), all adapted from Eu-DAP questionnaire (Faggiano et al., 2010). However, it should be highlighted that the mediators were not analyzed regarding the reliability of the measurements in this study, but only regarding semantic understanding. More information on the structure and source of the instrument is presented in Table 1.

Quantitative Analysis
The reliability of the instrument regarding primary outcomes (frequency of drug use) was analyzed using Kappa's statistics, which measures the consistency level of categorical responses at two moments of questionnaire administration, in addition to the agreement that would be randomly expected. Therefore, the questionnaire was applied to the same students, with a 2-3week interval, with answers being matched using the Levenshtein algorithm, which identifies similarities between a character set for comparative purposes. The Kappa test was analyzed for consistency in both administration times, with almost perfect (> 0.79), substantial (0.60-0.79), moderate (0.40-0.59), regular (0.20-0.39), and low (0.0-0.19) (Landis and Koch, 2012).
Using the Mplus 7.4 statistical modeling program, we conducted confirmatory factor analysis (CFA), to provide evidence for the construct validity of secondary outcomes scales (bullying, eating disorder symptoms, psychiatric symptoms, psychosis screening, excessive use of electronic games and damage due to alcohol use). To assess the quality of fit indices, the comparative fit index (CFI), the Tucker-Lewis index (TLI), the weighted root mean square residual (WRMR), and the root mean square error (RMSEA) approximation were used. The cutoff criteria used to determine the quality of fit were CFI and TLI near or above 0.90, WRMR near to 1.0 and RMSEA near or below 0.08 (Little et al., 2013). The reliability of secondary outcomes (bullying, eating disorder symptoms, psychological symptoms, and excessive use of electronic games) was assessed through the Cronbach's Alpha test at both the data collection moments, since these measurements used continuous outcomes collected through scales. This method is considered conservative, especially in cases where the items on the scale are heterogeneous, dichotomous, or define multifactorial structures. This test estimates the internal consistency of item variances and test totals of subject that are classified as excellent (> 0.9), good (> 0.8), acceptable (> 0.7), questionable (> 0.6), low (> 0.5), and unacceptable (<0.5) (Maroco and Garcia-Marques, 2006).
The proportion of answers left blank was evaluated in all questionnaires as a suggestive measure of the difficulty of the questions, that is, each question was subjected to a descriptive analysis to assess the proportion of adolescents not answering it.

Qualitative Analysis
Qualitative data analysis aimed at checking the semantic and operational validity of #Tamojunto2.0 questionnaire through the analysis of field cards and focus groups. The analysis strategy chosen was the data based theory, due to its richness to construct categories of analysis based on the studied context (Strauss and Corbin, 2008), which is appropriate to clarify difficulties reported by students in answering the questionnaires. After one researcher transcribed the groups' audios, another researcher read this transcript in depth and summarized each focus group, and another member of the research group analyzed these summaries highlighting common situations between groups. Thus, it was possible to determine three main categories of analysis: (1) semantic difficulties, (2) difficulties related to the structure and layout of the questions, and (3) environment and logistics characteristics that facilitated or hindered questionnaire answering.

Study Population Profile
During the first collection, 262 students answered the questionnaire, with 257 students answering it during the second collection. A total of 207 questionnaires were paired on both occasions using the Levenshtein algorithm. Considering as base only the students present on the 1st day of data collection, the proportion of paired questionnaires was 79%. The participants had a mean age of 13.74 years (SE 0.05), with a predominance of females (53.3%) and socioeconomic class B and C (37.02 and 46.56%, respectively) ( Table 2).

Non-response Data
The proportion of non-response had a systematic increase from question 60. Simple, clear questions with only two answer options (yes and no) had a non-response ratio between 1 and 6%. However, when the interviewee was confronted with a long list of options, such as very unlikely, unlikely, likely, or very likely, this ratio varied between 2 and 15%. The questions about height and weight had the highest percentage of non-responders in the entire questionnaire, 18 and 19%, respectively. Table 3 shows the internal agreement level by the Kappa index for primary outcomes. The agreement level was substantial   Table 4. The model fit indices for the bullying, electronic games, psychosis screening, and damage due to alcohol use indicated that our model, constituted by observed and latent variables, has a good fit adjustment. However, in the initial model for the eating disorder symptoms, some of the fit indices didn't show adequade values. By inspecting the modification indices (MI), we found that the models fit would be improved with the association of the following observed measures: S5 with S1 (S5: Would you say that Food dominates your life?/S1: Do you make yourself Sick because your feel uncomfortably full?). These modifications improved the unidimensional model. The same occurred with the psychiatric symptoms' scale, but, in this case, even with the modification indices, the final models' fit was not good.

Agreement Data
Questions related to secondary outcomes had good reliability (Cronbach's alpha higher than 0.7) for all outcomes in both segments, except for the SCOFF, which had a weak consistency (Cronbach's alpha lower than 0.600) in both segments ( Table 5).

Qualitative Analysis Data
Direct Participatory Observation in the Classroom (Recorded on Field Sheets) As for environment and logistics characteristics that facilitated or hindered questionnaire answering, potential barriers to questionnaire administration were identified in some classes. The students' misconduct and the use of cell phones in the classroom were notable factors related to the non-fulfillment of the task within the stipulated time (50 min). As for difficulties related to the structure and layout of the questions, the creation of secret codes in some classes lasted longer than expected (more than 5 min) due to understanding difficulties reported by some students during the task. In addition, semantic difficulties were also reported by students and registered in the field card by the administrators, such as ambiguous language, lack of concept or word understanding, and complex argumentation in some questions. For example, some questions about bullying raised questions because the students did not understand if they referred to the perpetrator or the victim. In addition, some  students delivered the questionnaire with unanswered questions reporting "fatigue." There was a mean absence among enrolled students of 30%. According to the report of one of the teachers "a large number of absences is common, " especially on Fridays, the day where it can reach more than 50% among the students enrolled in the 8th year.

Focus Groups
Focus group data confirmed most of the problems recorded on field cards. As for the semantic barriers, it was observed that some words generated interpretation difficulties, such as the term "silhouette, " which was interpreted by the students as both "the whole body" and "just the part of the waist." The name SABESP (Basic Sanitation Company of the State of São Paulo), which appeared as an example of water and sewage network in the ABEP socioeconomic module, also raised questions, and many students did not understand that it was related to "piped water." In the question related to drug beliefs, a negative statement ("Smoking marijuana is not addictive"), also raised questions among many students since they had to choose "yes" or "no" if they agreed or disagreed with the statement, that is, a question with a negative response seemed to confuse the students. As for difficulties related to the layout and structure of the questions, the four-to-five points Likert-scale used in many issues was an important source of questions because the students could hardly differentiate between "unlikely" and "very unlikely, " suggesting a reduction to make the options more objective; "I thought it could have used another word, like right or wrong, " reported one of the students.
The time used to complete the questionnaire was 70 min, 20 min more than the time recommended by the directors, who authorized questionnaire application for one class period (50 min).
Finally, the more inconsistent questions were the ones in the psychosis module, since the students wrongly understood the statements about seeing and hearing what the others do not as being able to see and hear.

DISCUSSION
This study evaluated the quality of a questionnaire built from other instruments already in use nationally and internationally, which will be applied in the RCT to evaluate the effectiveness of the # Tamojunto2.0 program and could be applied in the evaluation of other school-based drug use prevention programs. The semantic validity, construct valididy and reliability of this questionnaire were investigated using qualitative and quantitative research methods with three data sources (structured questionnaire, focus groups, and field cards). It evidenced that, like in previous studies (Beaton et al., 2000), this data source triangulation had high consistency levels between results and greater information complementarity, allowing a more meaningful understanding and clarification of difficulties related to the development process of the instrument and facilitating the adequacy of the final version of the questionnaire to be applied.
The findings related to response reliability, assessed by the respondents' ability to give the same answers in different instrument administrations, proved to be positive in most questions. Questions related to the main outcome, drug use by adolescents, presented a Kappa index that showed a substantial or almost perfect agreement level, corroborating Komro et al. (2004). In addition, the questionnaire provides a better understanding of drugrelated issues than the one analyzed by Prado et al. (2016), which aimed to evaluate the Unplugged Program and presented a moderate agreement level. Only the question that referred to drug use in the previous month had moderate reliability. However, this finding may have occurred due to the period described in the question (30 days) and the interval between one collection and another (between 15 and 20 days). Therefore, part of the retest period does not overlap with the test and may have generated different responses, although correct.
The Construct Validity and Fit of the dimensional models of secondary outcomes had a good fit adjustment. Only for psychiatric symptoms scale that the final models' fit was not good, but as it is a widely used scale with many validations, both in Brazil and internationally (Goodman, 1997;Goodman et al., 2003;Goodman and Goodman, 2009), we decided to keep it and analyze it better when we apply in a larger sample. Moreover, the results related to the reliability of the scales used to evaluate the secondary outcomes analyzed by the Cronbach's alpha showed good consistency, corroborating other studies assessing similar scales (Lemmens et al., 2015). This shows that the questionnaires included to evaluate these outcomes were well-understood by the students, thus, being reliable for this evaluation. Only the SCOFF questionnaire showed weak consistency in both segments. However, some authors argue that a Cronbach's alpha between 0.5 and 0.6 can be acceptable, provided that the results obtained with the instrument are cautiously treated (Maroco and Garcia-Marques, 2006). Therefore, it was decided to keep this scale in the final instrument because the model fit indices had a good fit and this questionnaire has already been validated in several countries (Kutz et al., 2020), in addition, it is a practical instrument and reduced size.
The analysis of the focus group audios confirmed the difficulties observed in the field cards and showed the need to replace some words. The word "silhouette" in the question about body image will need to be replaced by "figure" to make it clear what it refers to. Likert-scale questions, such as "very unlikely, " "unlikely, " "probable, " and "very likely, " need to be replaced by "no" and "yes" since high-quality questionnaires must prioritize simple sentences and avoid using difficult words or asking more than one question per sentence (Terwee et al., 2007). The low level of literacy in Brazil can, to some degree, be related to poor understanding of the questions, since the data from the International Student Assessment Program (PISA) showed that the performance of Brazilian students was below average in the reading competence test compared to students from other countries (INEP, 2015).
A high rate of non-responses, generally above 20%, generates results that are not representative of the population (Stavseth et al., 2019). Therefore, the final version of the questionnaire excluded questions about height and weight, as they had the highest percentage of non-responses of the entire questionnaire due to students' lack of information regarding these data. It is important to note that even with this exclusion, body dissatisfaction can still be assessed using the Stunkard Silhouette scale (Scagliusi et al., 2006). In addition, there was a systematic increase in non-responses from question 60, and due to criticisms about the length of the questionnaire and students' tiredness, who answered it without attention and care, the new questionnaire should be reduced by 20% of the questions.
The combination of field card and focus group evaluations with Kappa analyses, CFA and Cronbach's alpha test/retest was especially important to identify the necessary changes to this questionnaire. After these changes are made, the questionnaire will be suitable to be used in the #Tamojunto2.0 program effectiveness study.