Assessing Executive Function in Adolescence: A Scoping Review of Existing Measures and Their Psychometric Robustness

Background: There is much research examining adolescents' executive function (EF) but there is little information about tools that measure EF, in particular preference of use, their reliability and validity. This information is important as to help both researchers and practitioners select the most relevant and reliable measure of EF to use with adolescents in their context. Aims: We conducted a scoping review to: (a) identify the measures of EF that have been used in studies conducted among adolescents in the past 15 years; (b) identify the most frequently used measures of EF; and (c) establish the psychometric robustness of existing EF measures used with adolescents. Methods: We searched three bibliographic databases (PsycINFO, Ovid Medline, and Web of Science) using key terms “Adolescents,” “Executive Functions,” and “measures”. The search covered research articles published between 1st January 2002 and 31st July 2017. Results: We identified a total of 338 individual measures of EF from 705 eligible studies. The vast majority of these studies (95%) were conducted in high income countries. Of the identified measures, 10 were the most used frequently, with a cumulative percent frequency accounting for nearly half (44%) the frequency of usage of all reported measures of EF. These are: Digit Span (count = 160), Trail Making Test (count = 158), Behavior Rating Inventory of Executive Function (count = 148), Wisconsin Card Sorting Test (count = 140), Verbal Fluency Tasks (count = 88), Stroop Color-Word Test (count = 78), Classical Stroop Task (count = 63), Color-Word Interference Test from Delis-Kaplan battery (count = 62), Rey-Osterrieth Complex Figure Test (count = 62), and Original Continuous Performance Test (count = 58). In terms of paradigms, tasks from Span (count = 235), Stroop (count = 216), Trails (count = 171), Card sorting (count = 166), Continuous performance (count = 99), and Tower (count = 94) paradigms were frequently used. Only 48 studies out of the included 705 reported the reliability and/or validity of measures of EF used with adolescents, but limited to studies in high income countries. Conclusion: We conclude that there is a wide array of measures for assessing EF among adolescents. Ten of these measures are frequently used. However, the evidence of psychometric robustness of measures of EF used with adolescents remains limited to support the validity of their usage across different contexts.

Background: There is much research examining adolescents' executive function (EF) but there is little information about tools that measure EF, in particular preference of use, their reliability and validity. This information is important as to help both researchers and practitioners select the most relevant and reliable measure of EF to use with adolescents in their context.

Aims:
We conducted a scoping review to: (a) identify the measures of EF that have been used in studies conducted among adolescents in the past 15 years; (b) identify the most frequently used measures of EF; and (c) establish the psychometric robustness of existing EF measures used with adolescents.
Methods: We searched three bibliographic databases (PsycINFO, Ovid Medline, and Web of Science) using key terms "Adolescents," "Executive Functions," and "measures". The search covered research articles published between 1st January 2002 and 31st July 2017.
Results: We identified a total of 338 individual measures of EF from 705 eligible studies. The vast majority of these studies (95%) were conducted in high income countries. Of the identified measures, 10 were the most used frequently, with a cumulative percent frequency accounting for nearly half (44%) the frequency of usage of all reported measures of EF. These are: Digit Span (count = 160), Trail Making Test (count = 158), Behavior Rating Inventory of Executive Function (count = 148), Wisconsin Card Sorting Test (count = 140), Verbal Fluency Tasks (count = 88), Stroop Color-Word Test (count = 78), Classical Stroop Task (count = 63), Color-Word Interference Test from Delis-Kaplan battery (count = 62), , and Original Continuous Performance Test (count = 58). In terms of paradigms, tasks from Span (count = 235), Stroop (count = 216), Trails (count = 171), Card sorting (count = 166), Continuous performance (count = 99), and Tower (count = 94) paradigms were frequently used. Only 48 studies out of the included 705 reported the reliability and/or validity of measures of EF used with adolescents, but limited to studies in high income countries.

INTRODUCTION General Background
Executive function (EF), also known as executive control or cognitive control, is an umbrella term that describes a set of interrelated but distinct cognitive abilities. These cognitive abilities, mediated by the prefrontal cortex (Siddiqui et al., 2008) include, but are not limited to: planning, shifting (i.e., flexibility of thought and action), fluency (i.e., generation of new responses), problem solving, decision making, self-regulation, attentional control, working memory (i.e., concurrent remembering and processing), inhibitory control, and cognitive flexibility (Miyake and Friedman, 2012;Burnett et al., 2013;Costanzo et al., 2013).
Currently, consensus is lacking as to the precise components of EF since it is a multi-faceted construct. Converging research (e.g., Collins and Koechlin, 2012;Lunt et al., 2012;Miyake and Friedman, 2012;Hall and Marteau, 2014;Karbach and Unger, 2014) suggests that EF may be conceptualized best as comprising of three distinct yet related "core" dimensions: working memory, inhibitory control, and cognitive flexibility. Other authors (e.g., Tsermentseli and Poland, 2016;Zimmerman et al., 2016;Poon, 2017) have described EF in terms of "cool" and "hot" components. Cool cognitive skills are elicited under relatively abstract, decontextualized, and non-emotional conditions (Peterson and Welsh, 2014) and require logic and critical analysis (Rubia, 2011). Examples include planning, verbal reasoning, problem-solving, sequencing, cognitive flexibility, working memory, the ability to sustain attention, behavioral monitoring, and inhibition. Hot cognitive skills, in contrast, are elicited in contexts that require personal interpretation where emotions, motivation, or a tension between instant gratification and long-term rewards are generated (Zelazo and Carlson, 2012). Affective cognitive abilities such as social cognition, emotional regulation, affective decision making, and the ability to delay gratification, are posited as aspects of hot EF.
Instead of focusing on individual elements of EF, other investigators in the field have chosen to use theoretical underpinnings on EF for their research purposes. For instance, Burnett et al. (2013), in reviewing the literature on EF and everyday behavior, adopted the Executive Control System conceptual framework (Anderson, 2002;Anderson and Reidy, 2012). This conceptual framework categorizes EF into four broad domains: (i) information processing; (ii) attentional control; (iii) cognitive flexibility; and (iv) goal setting, each consisting of various components tapping into EF. Such a broad categorization, on one hand, overcomes the need of having to focus on components when studying EF. On the other hand, the broadness may lose the precision on the exact construct of EF being studied.
Despite the current lack of consensus about the precise components of EF, it is generally agreed that EF is important for enabling an individual not only to control their emotions and socially interact (Anderson, 2002;Xanthopoulos et al., 2015) but also engage in independent, purposeful, and goal-oriented behavior (Lezak et al., 2004).

Executive Function in Adolescence
EF skills play an important role in shaping an adolescent's behavior and promoting her/his socio-emotional and educational competencies (Riggs et al., 2006;Bierman et al., 2008). An important aspect of EF is the ability to adaptively respond in circumstances that prime inappropriate and/or prepotent responses, which can lead to impetuous acts or errors in judgment (Prencipe et al., 2011). Adolescence, a period of increasing autonomy, may be of particular vulnerability to such errors partly because EF continues to develop throughout this period (Best and Miller, 2010;Taylor et al., 2015). Moreover, transitioning to adolescence is often followed by a new set of challenging responsibilities and self-regulatory demands for example in educational and social spheres (Burnett et al., 2013) requiring greater reliance on the emerging cognitive control.
It is noteworthy that EF in childhood and adolescence is a predictor of adult productivity and future life outcomes (Diamond, 2013). Therefore, the need to monitor, screen and intervene for EF problems early in life cannot be overemphasized.

Tools Used to Assess Executive Function at Adolescence
Studying EF in youth populations has received special attention in the recent years, given that it influences their behavioral, social, emotional, and academic outcomes (Arán Filippetti and Richaud, 2015). A wide range of performance-based measures of EF exists and have been used with adolescent subpopulation for years. These include tasks such as the Wisconsin Card Sorting Test (WCST; Heaton, 1981); the Trail Making Test (TMT; Reitan, 1992); and the Stroop Color-Word Test (SCWT; Golden and Freshwater, 1978). To assess aspects of EF comprehensively, some performance-based EF tests are administered as a set, in neuropsychological batteries such as the Delis-Kaplan Executive Function System (D-KEFS;Delis et al., 2001), Cambridge Neuropsychological Test Automated Battery (CANTAB;Cambridge Cognition Ltd, 2006), the Behavioral Assessment of the Dysexecutive Syndrome for Children (BADS-C; Emslie et al., 2003), and a developmental Neuropsychological assessment battery (NEPSY; Korkman et al., 1998).
In the new millennium, researchers have also begun to broaden ways of assessing EF among children and adolescents by including self or informant reported questionnaires designed to index children's everyday EF skills. Examples of such EF rating scales include measures such as the Behavior Rating Inventory of Executive Function (BRIEF; Gioia et al., 2000;Guy et al., 2004), the Dysexecutive Questionnaire for Children (DEX-C; Emslie et al., 2003), Amsterdam Executive Function Inventory (AEFI; Van der Elst et al., 2012) and most recently the Dynamic Occupation Assessment of Executive Function (DOAEF; Chubarov et al., 2015).

Rationale for the Present Study
Despite the existence of a wide array of neuropsychological measures of EF for use with adolescents, little is known about the most preferred (frequently used) measures for this subpopulation. Where literature review has been reported, this has been limited to a specific adolescent sub-population such as those living with cerebral palsy (see Pereira et al., 2018). Furthermore, synthesized and summarized information about the psychometric robustness of existing EF measures for use with adolescents i.e., their reliability and ecological validity remains unknown yet this is known among the child (Henry and Bettenay, 2010) and adult population (Pickens et al., 2010). For researchers, neuropsychologists and related practitioners selecting a measure of EF for use in research or clinical purposes, current information on preference, reliability, and validity of EF measures for use with adolescents is essential in helping them make an informed decision.
The aim of this study is therefore to address the abovementioned knowledge gaps. Specifically, the study examines the following research questions: 1. Which measures are used to assess EF among adolescents within the past 15 years (between January 2002 and July 2017)? 2. Among the identified measures of EF, which are the most preferred or frequently utilized? 3. What is the psychometric robustness of the existing EF measures for use with adolescents?

METHODS
A scoping review was undertaken. Scoping reviews are useful for mapping out and summarizing existing literature on a specific topic in order to assist researchers in identifying the extent, range and nature of the current research evidence (Levac et al., 2010). Our focus was on measures of EF used in studies conducted among adolescents and reported within the last one and a half decade, the aim being to capture the most recent evidence in the field. A methodologically rigorous scoping review framework proposed by Arksey and O'Malley (Arksey and O'Malley, 2005) was applied in the current study. This framework involves five key phases: (i) identifying research question(s); (ii) identifying relevant studies; (iii) study selection; (iv) Extracting and charting the data; (v) collating, summarizing, and reporting the results. Our research questions are listed in the introduction section.

Identifying Relevant Studies
A search was conducted in three electronic bibliographic databases, that is, PsycINFO, Ovid MEDLINE, and Web of Science. The search terms comprised of the key words "Adolescents, " "Executive Functions, " and "measures" which were combined using the AND Boolean operator. Synonyms for each of the key terms were combined using the OR Boolean operator (see Appendix 1 in Supplementary material). The search was limited to only peer reviewed articles, articles in English language, published between 1st January 2002 and 31st July 2017. Where a database could allow, we restricted the search to adolescence age group of interest (13-17 years). Dissertations and book chapters were filtered out. Table 1 summarizes the criteria used in the selection of eligible studies for our scoping review. Four authors (MKN, DS, AM, and EC) screened the titles, abstracts, and full-text articles for study eligibility. Where disagreement or doubt arose concerning inclusion or exclusion of an article, the four re-evaluated the article and reached a consensus. Consensus to either include or exclude an article occurred on 25 occasions. Figure 1 illustrates the selection process of the articles.

Data Extraction
From each study fulfilling the inclusion criteria, information on: author, year, and journal of publication, country in which study was conducted, the measure of EF used in the study and psychometric properties of the measure (if documented) were extracted and charted into Microsoft office  Excel (version 2016). For psychometric data, we extracted Cronbach's alpha, intra-class coefficient of correlation (ICC) or any other correlation coefficient, if reported, when documenting the reliability of a given measure. Where a study explored validity of a given measure of EF, we documented the type of validity examined such as construct, content, criterion, concurrent, divergent, or convergent validity, alongside supporting statistics.

Analysis
We first counted the number of individual measures of EF identified from the review. Then, we computed the frequency of use of each individual measure and summed the frequencies.
To describe the most preferred or frequently used measure of EF, we developed a priori working definition, that is, a measure of EF should have accounted for ≥2.5% of the summed frequency of usage of all individual measures of EF (equivalent to a frequency count ≥58 from different included publications). For this analysis, percentage frequencies and cumulative percentage frequencies of measures of EF were computed in Microsoft office Excel (version 2016). We did computations using two approaches. First, we computed frequencies of usage of individual measures of EF (i.e., how frequent an individual measure was used in all the included studies). Second, we grouped measures of EF which we deemed as having a similar underlying principle of measurement (similar underlying latent factor) and computed the frequencies based on this grouping. Data on psychometric properties from the eligible studies were abstracted into a table on Microsoft Office Word (2016). We also coded the countries (and their respective continents) in which the included studies were conducted. These countries were categorized according to World Bank's income ranking (World Bank, 2017). Data were then imported to STATA (version 14.0) statistical software package (StataCorp, 2015) for univariate analyses (frequency and percentage distribution).

Measures of EF Used With Adolescents
This scoping review identified 705 eligible studies as shown in the study selection flowchart (Figure 1). From these studies, we identified a total of 338 individual measures that have been used to assess an aspect of EF among adolescents aged 13-17 years (see Appendix 2 in Supplementary material). Most studies used multiple measures and the total frequency count of all the identified individual measures of EF was 2328. Majority of these measures were performance based, with only 13 out of the 338 identified being self-or informant-reported rating scales of EF.
Of the 338 measures of EF identified, 10 were most frequently used ( Table 2). The cumulative percent frequency of these 10 frequently used individual measures of EF was 43.7%, nearly half of the total frequency count of usage of all identified measures (see Table 2). Appendix 3 in Supplementary material presents a summary of the administration procedures of these 10 most frequent measures of EF.
Of the 338 individual measures identified, we grouped 72 tasks into 12 paradigms of EF namely Cancellation, Card sorting, Continuous Performance Tests (CPT), Go/No-go, Flanker, Hayling & Brixton, Maze, N-back, Span, Stroop, Tower, and Trails tasks ( Table 3). These paradigms consisted of individual  measures of EF that assess a common underlying latent factor. The frequency count for these paradigms was 1,116 (47.9% of the total frequency count of all identified measures). Tasks from the Span (count = 235), Stroop (count = 216), Trails (count = 171), Card sorting (count = 166), CPT (count = 99), and Tower (count = 94) paradigms were frequently used. Figure 2 shows how the identified measures of EF were categorized. A breakdown of the results from the analysis of the regional and income ranking of countries where the EF measures were utilized are presented in Appendix 4 in Supplementary material. In summary, majority of the EF assessments among adolescents aged 13-17 years were conducted either in North America (n = 325, 46.1%) or Europe (n = 277, 39.3%). Consequently, included studies were mainly conducted in high income countries (n = 667, 94.6%) compared to a meager 0.3% representation of low income countries.

Psychometric Robustness of Identified Measures of EF
Of the 705 included studies in the scoping review, only 48 reported an aspect of reliability and/or validity of a measure of EF used with the adolescent sub-population in a given study setting. These study settings were all high-income countries except for two studies (Wong et al., 2012;Malek et al., 2013) that were conducted in an upper middle-income economic setting of Cuba and Iran. More than half of these 48 studies (n = 28) reported the psychometric characteristics of the selfor informant-reported rating scales, with n = 22 specifically reporting the psychometrics of the BRIEF (Gioia et al., 2000;Guy et al., 2004; Table 4). Reported internal consistency of the BRIEF ranged from 0.65 to 0.98 in the context of high income countries (both informant-and self-reports; see Table 4 for details). Only one study from United States (Rose and Holmbeck, 2007) reported the inter-rater reliability for the BRIEF (informant report version) which was excellent (0.96 to 0.98). The test-retest reliability for the BRIEF ranged between 0.81 and 0.86. Validity aspects of the BRIEF that were examined across studies included construct, concurrent, and discriminant validity. Presented results, though in the context of high income countries, indicate that BRIEF is a valid measure of EF (see Table 4). It is only in one study from Netherlands (Huizinga and Smidts, 2011), where the Root Mean Square of Error of Approximation (RMSEA) value for the parent-report version of the BRIEF (an estimate of construct validity) was 0.11 which is greater than the recommended value of <0.06 (Thompson, 2004). However, an alternative estimate, Non-Normed Fit Index (NNFI) was excellent at 0.92.
The six additional rating scales with reported psychometrics were: the Dynamic Occupation Assessment of Executive Function (DOAEF; Chubarov et al., 2015); Diabetes Related Executive Functioning Scale (DREFS; Duke et al., 2014); Behavioral Inhibition System and Behavioral Activation System (BIS/BAS; Carver and White, 1994); EpiTrack Junior R (Kadish et al., 2013); Amsterdam Executive Function Inventory (AEFI; Van der Elst et al., 2012); and Ballet Executive Scale (BES; Wong et al., 2012). The reported reliability and validity of each of these six measures of EF is presented in Table 4. In summary, their internal consistency ranged between 0.60 and 0.97 (acceptable to excellent); only one study (Chubarov et al., 2015) reported inter-rater reliability and test-retest reliability (for DOAEF) as 0.97 and 0.91, respectively; only one study  reported interinformant reliability (for DREFS) as 0.73. Validity aspects that were examined for some of these six measures included construct, convergent, discriminant, and concurrent validities. Presented results indicated that these measures were valid (see Table 4).
Psychometric characteristics of performance based measures, reported from the remaining 20 studies, are shown in Table 4. Unlike for some EF rating scales e.g., the BRIEF, DOAEF, DREFS, and AEFI, where both reliability and validity aspects were reported (Table 4), studies reporting on psychometric characteristics of performance based measures of EF did report either reliability or validity aspect, not both. The exception was in three studies (Chevignard et al., 2010;Malek et al., 2013;Pesce et al., 2016) where both reliability and validity aspects of the children's cooking task (CCT), SCWT, and the random number generation (RNG) task were reported (Table 4). Also, it was only these 3 studies out of the 20 that reported on validity. The discriminant and concurrent validity of the CCT were established in the study by Chevignard et al. (2010); discriminant validity of the SCWT was established in the study by Malek et al. (2013); whereas construct validity of the RNG task was established in the study by Pesce et al. (2016). The performance based measures that were among the 10 frequently used measures of EF (Table 2) had some psychometric characteristics presented (though not extensive and mostly reliability than validity) except for CPT, and D-KEFS Color Word Interference Test (see Table 4 for details). Poor reliabilities were also reported for some complex executive tasks like SCWT (test-retest as low as 0.37) in the study by Malek et al. (2013) and Tower test (internal consistency of 0.48) in the study by McAuliffe et al. (2008).

DISCUSSION
We carried out a scoping review of measures of EF covering the period between 1st January 2002 and 31st July 2017. We wanted to know three things. First, which measures have been used to assess executive function of adolescents aged 13-17 years. Second, of the identified measures of EF, we were interested in knowing which measures stood out or dominated the field in terms of preference. Lastly, we wanted to establish evidence on the psychometric robustness of measures of EF currently used with the adolescent sub-population.

Preferred Measures of Adolescent EF
We observed that there is a range of individual EF measures (largely performance based) currently in use with young people aged 13-17 years, although 10 measures of these seem to dominate. Besides, there are a range of individual tasks of EF with a similar underlying principle of measurement (similar latent factor) that have been used to assess EF among adolescents. We grouped these into paradigms to get a better understanding of which group of tasks are frequently selected. We found out that tasks from 12 paradigms were often used and that tasks from the card sorting, CPT, Span, Stroop, Tower and Trails paradigms met our criteria of being frequently used, although in different variations, to assess adolescent EF.
From this review, the most measures of EF currently in use with the adolescent sub-population were performance based. Only 13 out of the 338 identified individual measures of EF were rating scales. This observation was also the same when it came to the 10 dominant measures of EF, where the Behavior Rating Inventory of Executive Function (BRIEF; Gioia et al., 2000;Guy et al., 2004) was the only rating scale. The preference for using performance based measures may reflect either the perceived higher reliability or validity of these measures, or the absence of a wide range of informant/self-report based measures. We prefer the latter explanation for various reasons explained from our findings or elsewhere.
Firstly, from our findings, existing evidence on psychometric robustness of such performance based measures remains scanty (see Table 4 and subheading on psychometric robustness of measures below). Relatedly, the extensive use of experimental paradigms for EF assessment has been criticized because they capture mainly performance at either the pathological or impairment level (Whyte et al., 1996). As a result, most end up having limited functional and ecological validity (Chan et al., 2008). Previous research has also found rating scales such as the BRIEF to be sensitive to changes in executive function even in the absence of changes in performance based measures (Cummings et al., 2002).
Secondly, our findings show that the range of EF ratings scales currently available for use with adolescents is limited. Out of 338 identified measures of EF, only 13 were rating scales. These findings suggest a need for further development or adaptation and validation of measures of EF that are informant/self-report based. Availability of many validated EF rating scales will provide researchers or clinicians a range of options to choose from, but most importantly, enhance the functional and ecological validity of their measurement (Chan et al., 2008) Another reason why researchers and practitioners prefer performance based measures of EF is because some assess multiple components of EF. Focusing on the identified dominant measures, most of the performance based measures tap into more than one aspect of EF. As examples, the Trail Making Test (TMT) assesses domains such as psychomotor speed, cognitive flexibility and working memory (Reitan, 1992) while Wisconsin Card Sorting Test (WCST) is believed to examine aspects such as perseveration, abstract reasoning, working memory and cognitive flexibility (Heaton, 1981). The brief administration time of some performance based measures of EF may also attract some researchers and clinicians in the field. For instance, both the digit span (Blackburn and Benton, 1957) and Stroop Color-Word Test (Golden and Freshwater, 1978) take 5 min or less to administer, whereas the TMT takes 5-12 min (Reitan, 1992).

Psychometric Robustness of Measures Currently Being Used to Assess Adolescent EF
Only 48 out of 705 studies included in this scoping review reported the reliability and/or validity of a measure of EF used with adolescents. Almost all of the 48 studies were conducted in high income countries, except for two (Wong et al., 2012;Malek et al., 2013) conducted in upper middle-income countries. This is not surprising because in this scoping review, we observed that most of the studies assessing adolescent EF have been conducted in North America and Europe (see Appendix 4 in Supplementary material). These are the same settings in which the majority, if not all, of the existing measures of EF   have been developed and validated. Therefore, the reported psychometrics largely reflects performance of EF measures only within a restricted context. For adolescent EF to be accurately assessed across contexts, validation work should be extended to cover low-to-middle income countries where there is hardly any contextually appropriate measure of EF, yet it is in such settings where a great majority of the world's adolescents live in WHO (2014). The reported psychometrics characteristics were mainly for EF rating scales (28 out of the 48 studies). These rating scales include the BRIEF (Gioia et al., 2000;Guy et al., 2004), DOAEF (Chubarov et al., 2015), DREFS , AEFI (Van der Elst et al., 2012), EpiTrack Junior R (Kadish et al., 2013), the Ballet Executive Scale (BES; Wong et al., 2012), and Behavioral Inhibition System and Behavioral Activation System (BIS/BAS; Carver and White, 1994). All, except BIS/BAS have been developed and validated within the last decade. The psychometric characteristics of these rating scales, even though from confined context, are good. A step forward in research will be to see more reports of how these tests perform when adapted and used in low-to-middle income countries.
Being one of the dominant measures used to assess adolescent EF, BRIEF predominated in terms of reported psychometrics. Twenty-two out of the 48 studies in this scoping review reported the psychometric characteristics of the BRIEF. Apart from two studies (Burton et al., 2016;Owens et al., 2016) where the reported sub-scale internal consistency of the BRIEF was below the recommended cut-off standard of 0.70 (Cicchetti, 1994), all the other reported reliabilities (Cronbach alphas, inter-rater and test-retest) ranged from good to excellent. BRIEF appears to be a valid measure of EF in terms of its construct, concurrent, and discriminant validity. Generally, these patterns of results re-affirm the good psychometric properties of the BRIEF as originally reported (Gioia et al., 2000;Guy et al., 2004). More work needs to be done from settings other than high income countries to confidently conclude that BRIEF can be validly used across contexts.
Evidence of psychometric robustness of performance based measures, even in the restricted context of upper middle income and high-income countries, remains scanty. Majority of the studies reporting on psychometrics of performance based measures focused on reliability aspect, though not extensively. Limited psychometrics are also presented for the dominant measures of EF we identified, mostly reliability. For some complex executive tasks like the Stroop Color-Word Test in the study by Malek et al. (2013) and Tower test in the study by McAuliffe et al. (2008), poor reliabilities are presented (see Table 4). Such findings mirror the observation that "complex executive tasks tend to suffer from relatively low internal and/or test-retest reliability" (Miyake et al., 2000). A major weakness is noticeable in terms of validity where only 3 studies (Chevignard et al., 2010;Malek et al., 2013;Pesce et al., 2016) reported on an aspect of validity of performance based measures. Among the dominant performance based measures of EF, only the Stroop Color-Word Test (Golden and Freshwater, 1978) has some evidence of validity for use with adolescents. Despite the lack of concrete evidence about psychometric robustness of performance based measures of EF other than backup by original test developers, these measures continue to be a preferred option overtime because of two potential reasons. First, we think that the relatively long history of development and use of performance based measures of EF, such as the Stroop task (Stroop, 1935), provides some degree of confidence in using them. Second, in reference to findings from this review, it could be that because some performance based measures of EF are used in almost similar contexts as to where they were originally developed and tested, researchers rarely focus on exploring their psychometric stability.
In summary, a notable conclusion from this review is the fact that there is just not enough validity and reliability data to support the use of measures of EF among adolescents across different national and cultural contexts. Similar observations are also noted from a recent review (Pereira et al., 2018). Of concern here, therefore, is the transference and use of these measures of EF in different cultural contexts without adequate adaptation and standardization. This can lead to significant limitations of interpreting findings (Greenfield, 1997), constrict the withingroup variance or mask true differences between study groups (Grantham-McGregor, 1993). Assessment bias can also arise because of a lack of familiarity with test demands or content (Baddeley et al., 1995;Vijver, 1997). Findings from assessment of adolescent EF especially in low-to-middle income countries using unstandardized measures may not be a true reflection of their EF performance and can misguide policy or intervention efforts. Given the current scenario, it is essential that researchers adapt and/or develop context-sensitive measures of EF that possess adequate psychometric characteristics for use with adolescents.

STRENGTHS AND LIMITATIONS OF THE REVIEW
We chose to conduct a scoping review as it is the most recommended where the nature of research (questions) is anticipated to be large/broader, complex, or highly heterogeneous hence not amenable to a more precise systematic review (Peters et al., 2015). We conducted this scoping review in a systematic and rigorous manner following the recommended scoping review framework (Arksey and O'Malley, 2005). However, the review has some limitations that need to be highlighted. Despite the search strategy being a thorough one, we only searched 3 major electronic databases of MEDLINE, PsycINFO, and Web of Science. Therefore, we cannot be certain that all important data were extracted and consequently reported. We also included and reviewed published articles entirely from journals. We did not search the gray literature. We included articles published in English only because this was the main language that authors are familiar with. We therefore acknowledge that the published literature may not be representative of entirety of work examining EF among adolescents. Because of inconsistency across included studies in reporting the EF domain assessed by tasks/measures of EF we identified, it was difficult to collate information on tasks/measures by EF domain assessed.

CONCLUSIONS
There is a range of measures currently in use to evaluate different aspects of executive function among the adolescent sub-population, although 10 measures appear to dominate the field. Unfortunately, there is very limited evidence generated to support the validity and usage of these measures of EF among adolescents across different national and cultural contexts.