Portuguese Physical Literacy Assessment - Observation (PPLA-O) for adolescents (15–18 years) from grades 10–12: Development and initial validation through item response theory

Introduction Aims of these studies were to develop the Portuguese Physical Literacy Assessment Observation instrument (PPLA-O) to assess the physical and part of the cognitive domain of Physical Literacy (PL) through data collected routinely by Physical Education (PE) teachers; and to assess the construct validity (dimensionality, measurement invariance, and convergent and discriminant validity) and score reliability of one of its modules [Movement Competence, Rules, and Tactics (MCRT)]. Methods Content analysis of the Portuguese PE syllabus and literature review were used for PPLA-O domain identification. Multidimensional Item Response Theory (MIRT) models were used to assess construct validity and reliability, along with bivariate correlations in a sample of 515 Portuguese grade 10–12 students (Mage = 16, SD = 1). Results PPLA-O development resulted in an instrument with two modules: MCRT (22 physical activities) and Health-Related Fitness (HRF; 5 protocols); both assessed with teacher-reported data entered in a spreadsheet. A two correlated dimensions Graded Response Model (Manipulative-based Activities [MA], and Stability-based Activities [SA]) showed best fit to the MCRT data, suggesting measurement invariance across sexes, and adequate to good score reliabilities (MA = .89, and SA = .73). There was a moderate to high correlation (r = .68) between dimensions, and boys had higher scores in both dimensions. Correlations among MCRT scores and HRF variables were similar in magnitude to previous reports in meta-analysis and systematic reviews. Conclusions PPLA-O is composed of two modules that integrate observational data collected by PE teachers into a common frame of criterion-referenced PL assessment. The HRF module uses data collected through widely validated FITescola® assessment protocols. The MCRT makes use of teacher-reported data collected in a wide range of activities and movement pursuits to measure movement competence and inherent cognitive skills (Tactics and Rules). We also gathered initial evidence supporting construct validity and score reliability of the MCRT module. This highly feasible instrument can provide Portuguese grade 10–12 (15–18 years) PE students with feedback on their PL journey, along with the other instrument of PPLA (PPLA-Questionnaire). Further studies should assess inter and intra-rater reliability and criterion-related validity of its two modules.

Introduction: Aims of these studies were to develop the Portuguese Physical Literacy Assessment Observation instrument (PPLA-O) to assess the physical and part of the cognitive domain of Physical Literacy (PL) through data collected routinely by Physical Education (PE) teachers; and to assess the construct validity (dimensionality, measurement invariance, and convergent and discriminant validity) and score reliability of one of its modules [Movement Competence, Rules, and Tactics (MCRT)]. Methods: Content analysis of the Portuguese PE syllabus and literature review were used for PPLA-O domain identification. Multidimensional Item Response Theory (MIRT) models were used to assess construct validity and reliability, along with bivariate correlations in a sample of 515 Portuguese grade 10-12 students (M age = 16, SD = 1). Results: PPLA-O development resulted in an instrument with two modules: MCRT (22 physical activities) and Health-Related Fitness (HRF; 5 protocols); both assessed with teacher-reported data entered in a spreadsheet. A two correlated dimensions Graded Response Model (Manipulative-based Activities [MA], and Stability-based Activities [SA]) showed best fit to the MCRT data, suggesting measurement invariance across sexes, and adequate to good score reliabilities (MA = .89, and SA = .73). There was a moderate to high correlation (r = .68) between dimensions, and boys had higher scores in both dimensions. Correlations among MCRT scores and HRF variables were similar in magnitude to previous reports in meta-analysis and systematic reviews.

Introduction
Physical literacy (PL) is a holistic concept composed of four interrelated domains: physical, emotional/psychological, cognitive, and social. It comprises skills and attributes that individuals show through physical activity (PA) and movement throughout their lives (1,2). This concept is also at the heart of quality Physical Education (PE) for schoolaged children and adolescents (3,4).
Two crucial elements within the physical domain of PL are movement competence (MC) and health-related fitness (HRF), as they are conceptualized as part of a spiral of engagement that leads to increased PA participation in children, which might strengthen into adolescence (5, 6)-a stage in life in which we will focus, given their concerning low levels of PA (7). However, if the goal is meaningful and involved PA participation, its decision-making and tactical aspects (elements of the cognitive domain of PL) need to be also considered (2,(8)(9)(10).
Development of MC, HRF, and decision-making is an explicit or implicit part of some PE syllabi (11), as is the case of Portugal (12)(13)(14)(15), where data on MC-through an authentic assessment lens, that integrates movement and decision-making skills (16)-and HRF of students is routinely collected by PE teachers. These teachers are qualified movement professionals that observe students in various settings (17,18), and may be in a privileged position to assess multiple aspects of student development (19,20). While HRF assessment makes use of standardized protocols (FITescola®; 21) that produce generalizable and interpretable data for educational and research stakeholders, within and outside of schools, this has not been the case for the assessment of MC.
One option to solve this issue would be the use of MC assessment batteries; however, these suffer from multiple drawbacks: (1) they require additional training and/or lesson time for correct application (22), and so lower their feasibility in PE settings; (2) they focus mostly on children (23); (3) those available for adolescents are generally product-oriented (24), providing assessment only in discrete, low-generalization tasks (25) that lack the needed ecological validity (6) to understand engagement in advanced physical experiences in a variety of domains and environmental constraints (25,26)-a characteristic that defines motor development in adolescence (27,28); and, (4) they neglect the decision-making aspects previously mentioned, requiring separate use of other instruments, that are however, limited to formalized games (29,30).
This issue motivated the development of a criterionreferenced instrument that could frame observational data collected by teachers in the physical and cognitive domains into the Portuguese Physical Literacy Assessment (PPLA) tool, which already counts with measures to assess all other domains of PL in adolescents (aged 15-18) (31)(32)(33).
Our aims for the following studies were to (a) develop the PPLA-Observation (PPLA-O) based on the review of relevant conceptual frameworks and the Portuguese PE syllabus-resulting in two modules, the Movement Competence, Rules, and Tactics (MCRT) module and the Health-Related Fitness (HRF) module; (b) investigate the dimensionality structure of MCRT module through Item Response Theory (IRT) methods; (c) test this structure for differential item functioning (DIF) according to sex, as comparisons between sexes are likely in the future, due to suggested differences in object-controlling/manipulative skills (34); (d) establish support for convergent and discriminant validity, and score reliability for this module. A secondary aim was to draw inferences for scoring and criterion-referenced cut-scores mechanisms. We did not focus on validation of the HRF module as it comprises measures (i.e., FITescola® protocols) that have already published evidence to support validity and reliabilityfurther details in the Results section.

Overview
The development and testing of the PPLA-O followed a common philosophy-centered in providing a criterionreferenced and feasible tool for PE use-and multiple-phase methodology to that of the other part of PPLA: PPLA-Questionnaire (PPLA-Q; 31). It was inspired by the physical and cognitive domains of the PL model proposed in the APLF (2,35), and by the Portuguese PE syllabus (12)(13)(14)(15).
These studies entailed domain identification and measure selection, resulting in an instrument with two modules: HRF and MCRT; followed by content analysis of the PPES according to chosen taxonomies to ensure content validity. A pilot test evaluated feasibility of data entry for PE teachers. Finally, we assessed the dimensionality and reliability of the Movement Competency, Rules, and Tactics module. Since the HRF module is grounded in widely used and reported protocols (i.e., FITescola©; 21), no validation was done. In all phases, adherence to standards for instrument development and validation was sought (36, 37).

Domain identification and measure selection
Similar to the procedures conducted for the development of the PPLA-Q (31), a theoretical framework was established for each of the nine selected elements in the physical and cognitive domains based on a literature review of relevant theories in the fields of motor development, physical fitness, and PE; supported by previous review efforts by the APLF team (35), and analysis of the Portuguese PE syllabus (PPES; [12][13][14]. Afterward, each selected element was mapped into the two-level PPLA framework (31). This framework establishes a Foundation (initial development that enables participation in movement and PA) and Mastery level (relational understanding and application of skills) of development for each element, based on the original APLF work, and the structure of observed learning outcomes taxonomy (SOLO; 38). Operational definitions per element and level were based on the APLF (2). Then, based on the PPES and its assessment norms, measures, or instruments for each element were selected to maximize feasibility and ecological validity.
Since, as we will detail in the Results section, the PPES uses an integrated criterion-referenced assessment of movement competencies, along with rules' knowledge and tactical development, a summative content analysis of the syllabus was conducted (39) to study possible factorial structures that would allow disentangling these various elements from each other. Coding was made by the lead investigator, using a deductive categorization (40) with categories extracted from the respective theories or models; as no specific taxonomy existed for the Rules element, an inductive approach was taken. For the Movement Competence skills, sport/specialized skills in each chosen activity were assessed for the diversity of movement skills required in its execution, based on Gallahue's (27) taxonomy of Locomotion, Manipulative, and Stability movement skills, along with Dudley's (9) taxonomy for Moving with equipment (or Object Locomotion). For the Tactics element, the diversity of tactical actions was counted according to the Game Performance Assessment System (30).

Pilot testing
Concurrent with the pilot test of the PPLA-Q (31) in November 2020, two PE teachers from the involved classes were asked to complete the resulting PPLA-O from the previous phase. PPLA-O took the form of a spreadsheet file (Supplementary Material S1) where teachers could enter all results from the selected (1) proficiency levels for MCRTordinal code, and (2) HRF protocols-continuously coded, except for Shoulder Stretch, which was coded as a binary variable; along with demographic information for each student. Feasibility was assessed through qualitative comments on the clarity of the provided instructions for data insertion, and identification of bugs in the automated spreadsheet files used to generate unique codes for each student (to assure anonymity) and insert data.
IRT analysis of the movement competence, rules, and tactics module Participants This study used the same sample as previous PPLA-Q validation studies. Sampling procedures are fully described in previous work (32). Briefly, a convenience sample of 521 grade 10-12 students from 25 classes in 6 public schools in Lisbon metropolitan area was used. Recruitment was stratified by grade and course major according to population percentage quotas. Schools from diverse socioeconomic backgrounds were chosen to increase sample representativeness. Student sample characteristics are summed up in Table 1. Data about students was reported by 22 PE teachers. The sample size conformed to recommendations for multidimensional graded response models (GRM) (41).

Measures and procedures
PPLA-O was completed by the PE teachers (N = 22) of each class from January to March 2021. Data collection for this tool was concurrent with the one for PPLA-Q validation studies (32,33). Upon acceptance to participate, teachers were sent the PPLA-O matrix and were asked to return the latter upon data collection of the PPLA-Q. Since a lockdown was in effect due Mota et al. 10.3389/fspor.2022.1033648 Frontiers in Sports and Active living to the COVID-19 pandemic for most of the data collection, teachers were asked to provide the most recent data before lockdown, according to the levels provided in the PPES and protocols of the FITescola®. Despite not being part of the PPLA-O, height and weight information were collected to calculate body mass index (BMI) for each student. This measure would be used for testing relevant correlations with measures in the MCRT module.

Analysis
All analyses were performed in RStudio (42) with R 4.1.0 (43). Partial PE proficiency levels (e.g., partial Elementary level) were collapsed into the adjacent lower category to equalize assessment across schools-since it is common for each school to define their criteria for these partial levels to motivate students.
Descriptive statistics were generated using the psych (44), naniar (45), and summarytools (46) packages. Students with no collected data (n = 6; non-participation in PE because of injury) were then removed from the dataset. Little's test was used to assess tenability of data missing completely at random (MCAR; 47). Results of χ 2 (766) = 1,681, p < .001 (with missing patterns = 91) provided evidence against MCAR. The assumption of missing at random (MAR) was plausible based on the results of a sensitivity analysis of missing data grouped by class. Two items (Rhythmic Gymnastics, and Modern Dance) were eliminated prior to further analysis due to low observed frequency (n = 1, and 0, respectively).

Dimensionality
All IRT models were estimated using Marginal Maximum Likelihood with the expected-maximization algorithm in mirt (version 1.34.11; 48), robust to high degrees of missing data (49). A two-stage analysis was performed. First, sequentially more complex models were estimated until there was no improvement in model-data fit, or convergence issues occurred due to over factoring. We fitted a (1) unidimensional partial credit model (1d-PCM), (i) unidimensional graded response model (1d-GRM), and (ii) exploratory multidimensional correlated GRM (2d-GRM and 3d-GRM). Comparison between models used the likelihoodratio test (LRT; 50) based on the −2LL statistic for each model (significance level of .05) to assess whether adding parameters (i.e., discrimination) and extra dimensions improved the fit of the model. The Akaike Information Criterion (AIC; 51) and sample-adjusted Bayesian information criterion (SABIC; 52) provided additional insights, with lower values indicating better model fit.
Then, after an optimal exploratory solution was attained, its standardized loadings (oblimin rotated) were assessed to identify non-salient items with a threshold of λ < .30 (53) or communality <.40. Cross-loadings were assessed using a variance explained ratio (λ 1 2 /λ 2 2 ), with values lower than 1.5 (54) considered for elimination depending on factor interpretability. These items were then removed one by one (with model re-estimation) until simple structure was achieved. For the second stage, all previous models were rerun to detect whether the sequential improvement in fit held after removal of items. Finally, item loadings were constrained to load on its salient factor, and a confirmatory GRM model was fit.

Differential item functioning (DIF)
Before DIF analysis, five cases had to be removed to equalize categories in the Throws and Jumps (both from Athletics) activities. DIF analysis was performed between sexes using a two-stage approach. First, a multiple-group IRT version of the final model was fit with no equality constraints across-groups and used as a reference to run the DIF function in mirtwhich adds, and tests via LRT, equality constraints for one item at a time, returning multiplicity-controlled (57) p-values. Three items with the highest p-values were selected as    Frontiers in Sports and Active living anchors (i.e., assumed invariant) and a final addictive sequential analysis was run in the anchored model (i.e., three invariant items constrained to equality), with freely estimated means and variances. Adjusted p-values <.05 were used as the threshold for existence of DIF.

Discriminant and convergent validity
Bivariate Pearson and polyserial correlations (and 95% CI) were calculated using the polycor (58) and piercer (59) packages using all pairwise complete observations. These were used to evaluate discriminant validity (threshold of r = .85 to discern whether resulting variables were statistically different) and convergent validity based on magnitude reported in similar studies. Magnitudes were interpreted as: very high, high, moderate, and low correlations, when r > .90, >.70, >.50, >.30, respectively (60). Inter-factor discriminant validity was assessed via correlation in the final MCRT model, using the same .85 threshold.

Reliability and scoring
Marginal reliability (61), using Expected a-posterior (EAP) (62) scores, was calculated to quantify average reliability across the θ continuum. These were evaluated as acceptable (ρ xx > .70; 63), and as good (ρ xx > .80; 64). Thresholds for each item (d k , or intercept parameter) were transformed into difficulty parameters (b k ) using b k = −(d k /a k ) (65) for easier interpretation.

Results
Given the initial focus on the development of the PPLA-O, this section will first describe the results of domain identification and measure selection-including relevant definitions, and a summary literature review of its theoretical framework and relationships with PA participation or other relevant outcomes. It will then present the results of the remaining studies: content analysis, pilot testing, and IRT analysis of the MCRT module.

Domain identification and measure selection
Health-related fitness (HRF) module Physical fitness can be interpreted as the capacity to perform PA and/or physical exercise that integrates most bodily functions involved in movement (66,67). Some authors suggest it as a predictor of PA in youth (6,68), with active youth presenting healthier physical fitness profiles (69). However, this is disputed by other authors (66,70).
More robust evidence, however, correlates fitness with various health outcomes throughout the life span (71).
Among these, cardiovascular endurance is linked with diverse metabolic markers (72), mental health (73,74), and cognitive benefits including academic performance (75, 76). Musculoskeletal fitness is liked with increased bone density (72) and positive self-perceptions (77). And, despite there being no compelling link between flexibility and health, the former is suggested to be central to correct posture and increased functional capacity (78).
Given its prominent role in a healthy and active life, HRF is an integral part of the PPES, as one of its three major areas, along with physical activities and knowledge. Its assessment is operationalized through the FITescola© test battery (21). This battery, analogous to FitnessGram© (78), offers a set of protocols to assess whether children and adolescents meet evidence-based criteria for health-related benefits. From these, we selected the most disseminated ones in PE teacher's practice, that simultaneously adhere to international recommendations (72, 79) ( Table 2, column 5), and have extensive validity and reliability evidence (80)(81)(82)(83)(84)(85). The obtention of the Healthy Fitness Zone was mapped as the transition point between Foundation and Mastery level for elements in this module, with the Athletic Profile values used as a reference for maximum points. The latter is a zone designed to assess athletic potential in youth (86).

Movement competence, rules, and tactics (MCRT) module Movement competence
Movement competence (MC) can be defined as the development of sufficient movement skills to assure successful performance in a variety of physical activities, be that work or play (26,87). This concept is employed by Whitehead (88) in allusion to a "bank" that enables individuals to respond automatically and meaningfully to movement situations. Most commonly, these skills are divided into (1) fundamental movement skills, and (2) specialized movement skills (27). Fundamental movement skills are organized series of basic movements that involve combinations of two or more body segments (27), and form the building block for specialized movement skills (89), which represent application of these fundamental movement skills to specific physical activity or sports contexts with increased refinement (e.g., fielding a ground ball; 27, 28). Different, yet analogous taxonomies include the subdivision into general, refined, and specific movement patterns (90). All these movement skills can be categorized into different movement skill sets according to their function (26) as locomotor, stability, or manipulative movement skills (27), and present multiple phases and stages of development throughout the lifespan. Other sources add a fourth category that includes movement skills with equipment (e.g., bike, surfboard, skate rollers; 2,9).
MC has a suspected cause-effect relationship with PA (91), with multiple reviews identifying a positive association between Mota et al. 10.3389/fspor.2022.1033648 Frontiers in Sports and Active living  Frontiers in Sports and Active living the two across childhood (92). This association also seems to be higher with object control/manipulative movement skills (93,94). However, few studies have examined this correlation among adolescents (92). Similarly, positive correlations have been identified with perceived competence (95) and health-related fitness (5, 96).
In the PPES, MC is developed within the physical activities area, which includes subareas for diverse physical activities (i.e., Team sports, Gymnastics, Athletics, Racquets, Combat, Rollerskating, Swimming, Rhythmic-Expressive, Traditional Games, and Nature exploration). In each of these subareas, multiple physical activities (to which we will refer simply as activities, from now on) are used as a means of development and assessment of each student through three levels: Introductory, Elementary, and Advanced. The Introductory level frames multiple foundational skills and knowledge needed for participation in each activity-in reduced or constrained gameplay, or pedagogical progressions leading to the formal setting of the activity. The Elementary level refers to the mastery of the main elements of each activity-in the full formal setting of the activity. The Advanced level establishes skills and knowledge needed for higher-degree participation in the activities (e.g., performance-settings). This assessment uses a set of rubrics that establish (1) the skill, knowledge, or attitude to be observed, (2) the context (e.g., 2 × 2 reduced gameplay of volleyball, or a gymnastics sequence composed of predetermined movements, and c) multiple qualitative criteria that describe the action. Given the above frame, we corresponded to the Introductory and Elementary levels in these activities with the Foundation and Mastery levels of the PPLA in all elements of movement competence (i.e., locomotion, manipulative, stability, moving with equipment).

Rules
Although framed within the realm of team sports and games, most literature on rules readily generalizes to other movement contexts. Rules provide a structure that manages and guides practitioners' actions (97). These can be considered primary, or fundamental, when they act as constraints that regulate and apply restrictions on the mode of action available to the individual (e.g., scoring rules); or as secondary when they represent written or unwritten rules that facilitate participation [e.g., safety and ethical rules of organized PA; (9)]. Both contribute to the form of the activity as we know it (16). Understanding rules and their application is therefore an essential part of every activity-something that Bunker and Thorpe frame as "Game Appreciation" (8).
Within the PPES, rules' knowledge and understanding are integrated holistically within each activity proficiency level previously mentioned. Thus, all activities promote the learning of safety codes and equipment management, while activities like Team Sports and Athletics allow learning of more closed scoring and playing rules. These outcomes are framed into the Foundation level of this element. At higher levels (mostly Advanced), students are asked to be officials and referees, which works as a powerful learning tool to reinforce rule knowledge and conditional application of all aspects of the activity (16). This skill is proposed as part of the Mastery level.

Tactics
Tactics can be framed as time-sensitive responses to problems posed in movement and PA contexts, be that inherent to game participation (i.e., acquiring advantage), or informal PA (i.e., maximizing quality and efficiency) (9,98). These contexts act as eventful dynamic systems (99) that require participants to develop and apply higher-level cognitive skills (e.g., comparing, contrasting, analyzing, evaluating) required for thoughtful decision-making (100), in interaction with others and the environment (9). Despite being separated here into two different elements, tactical knowledge and application are mostly conceived as the next (higher-order) level of rules' knowledge, in a learning continuum that frames decisionmaking within PA (8,9,97): Only after participants can identify the constraints imposed by rules, can they acknowledge degrees of freedom available to act.
Game sense approaches, which propose teaching of PA through reduced or adapted forms of the formal activity [e.g., Teaching Games for Understanding (TGfU); 8], recognize that the learning of specific skills and tactics constrains each other (101); while traditional, skill-centered approaches (i.e., analytical) focus on the former as the main constrainer of the capacity to participate in PA. The TGfU approach recognizes the similarity between tactical actions among the various games by categorizing them into (1) target games, (2) net/wall games, (3) striking/fielding, and (4) invasion games (8). Based on this taxonomy, the Game Performance Assessment Instrument typifies tactical action these into six transversal categories: (1) decision-making, (2) adjust, (3) cover, (4) support, (5) guard/mark, (6) base (30, 102)-skill execution excluded.
Benefits of using these approaches might include increased engagement, enjoyment, and motivation in PE classes (103). Also, some authors argue that awareness and decision-making skills might transfer to contexts outside of movement (2,9), being central to critical thinking as a general education outcome (100).
As aforementioned, the PPES frames tactical skills within the learning of activities and into the diverse levels of learning. Assessment is made in-context, through a combination of skills and decision-making, coherent with principles of authentic assessment (16,104). We framed a more constrained application of tactics (i.e., reproduction of descriptive tactics) to the Foundation level, while a more Mota et al. 10.3389/fspor.2022.1033648 Frontiers in Sports and Active living critical, relational stance on decision-making was framed at the Mastery level. Given the integrated nature of the Movement Competence, Rules, and Tactics elements, the specification levels for each activity were selected as holistic, process-oriented measures of these elements. A set of 22 physical activities that represent the full breadth of subareas within the syllabus were chosen, with the possibility for teachers to include any other activity assessed. Chosen activities spanned all movement forms (90, 105) and two of the four game types according to TGfU ( Table 3). Target and striking games are not commonly developed in Portuguese PE and were not included. Table 3 presents the summary of the content analysis of the PPES. Higher levels of proficiency in each activity entailed a higher diversity of movement skills in all typologies; however, this tendency only emerged between the Introductory and Elementary levels, with almost no new movement skills required when transitioning to the Advanced level. Locomotor skills were required with similar diversity across all types of activities, with two clusters emerging according to manipulative skills (mostly Team Sports) and stability (Gymnastics and Rollerskating) movement skills: while Team Sports required mostly dynamic balancing, twisting, turning, landing, and dodging movement skills, Gymnastics uniquely required skills combining inverted support, rolling, and diverse bending and stretching movement skills. Tactics-wise, a similar pattern was noted with increasing levels requiring a higher diversity of tactical action-without the plateau observed for movement skills. As expected, tactical actions were mostly requested by Team Sports and Racquets activities.

Content analysis
Finally, regarding rules, four general categories emerged from the analysis. Knowledge and application of safety rules and specific activity rules were mostly observed at the Introductory levels; while identification of referee signals, and officiating were mostly skills required for Elementary and Advanced levels, respectively.

Pilot testing
Teachers had no difficulties with data insertion and regarded the instructions as clear. As expected, data collection implied no further efforts, as activities and HRF protocols were already part of their lessons. They highlighted errors in the code generator spreadsheet and PPLA-O spreadsheet, which were corrected for the next phase.

Preliminary analysis
Seven activities had lower than 90% assessment rate (Modern Dance, Rhythmic Gymnastics, Rugby, Wrestling, Judo, Acrobatic Gymnastics, and Tennis; Table 4). The most prevalent level of proficiency was Introductory, with the Advanced level attaining only residual prevalence (0 to 5.1% of assessed students). Flexibility protocols had lower percentages of assessed students compared to other protocols ( Table 5).

IRT analysis of the movement competence, rules, and tactics module Dimensionality
In the first stage of analysis, the 2d-GRM presented the best fit according to information criteria (AIC, SABIC, and −2LL; Table 6). According to the likelihood-ratio test (LRT), freely estimating discrimination (slope) parameters improved the fit from the 1d-PCM to the 1d-GRM; and estimating an additional dimension also improved fit from the 1d-GRM to Frontiers in Sports and Active living the 2d-GRM. A 3d-GRM was estimated, however, its information matrix could not be inverted, signaling an empirically unidentified model (estimates are not presented). Item standardized loadings and parameters were analyzed based on the 2d-GRM exploratory solution. Reasons for item removal are presented in Table 6. As a note, Wrestling item had a borderline variance ratio (1.66), and we opted initially for non-removal based on its added value as a unique item concerning Combat activities. However, estimation of the following second stage confirmatory 2d-GRM (with items constrained to load on its salient factor) did not converge. Removal of this item allowed the solution to converge.
The second stage comprised sequential re-estimation of all models, without removed items, to assess whether results obtained in the first stage were robust. Improvement in fit between models was equivalent to those observed during the first stage. Finally, a confirmatory 2d-GRM was fit, resulting in decreased fit (according to all indices) vs. its exploratory counterpart, which was expected since the former imposes more constraints on item loadings (cross-loadings constrained to 0).
Loadings in the final confirmatory solution ranged from very good to excellent (.75 to .92, and .64 to .91), for dimensions 1 and 2, respectively (Table 7, Figure 1). An equivalent pattern of moderate (a > .65) to very good (a > 1.70) discrimination parameters (56) indicates that items are performing correctly in their respective dimension (i.e., providing information to separate students with different levels of θ). Interpretation of these two moderately (r = .68) correlated dimensions is coherent with items (i.e., PA) being better measures of either Manipulative skills, or Stability skills, as such we named these dimensions as Manipulative-based Activities (MA), and Stability-based Activities (SA), respectively ( Table 7). Usage of Locomotion skills is likely prevalent across all activities, and thus no third factor emerged based on it. Surprisingly, all Athletics disciplines had higher loadings on the Manipulative factor than on the Stability factor; also, loadings patterns do not suggest that tactical skills might be a source of covariation among tacticalalike activities (e.g., Handball and Basketball). Interpretations for these occurrences are provided in the Discussion.

Differential item functioning (DIF)
In the first stage of the analysis, the Throws (Athletics), Climbing, and Rollerskating indicators were selected as anchors (adjusted p-values = 1.00). Subsequent sequential analysis with these indicators constrained to equality acrossgroups revealed no DIF according to sex.

Discriminant and convergent validity
Inter-factor correlation between MA and SA was moderate to high (r = .68; Table 7). Table 9 displays the bivariate correlations between all variables in both PPLA-O modules, along with an additional BMI variable. These results will be discussed and compared further in the Discussion.

Reliability and scoring
Both dimensions of the MCRT attained acceptable marginal reliability in the final solution (ρ xx = .89 and.73, respectively; Table 7). Table 8 presents transformed intercept parameters (category threshold) which can be interpreted as transition points between levels of proficiency for each activity (i.e., θ point at which there is a 50% probability to be scored in that category or higher; 109). Median values represent a heuristic cut-score between general proficiency levels (θ) in each dimension. I.e., a student with θ = −1.68 is likely transitioning from Non-Introductory to Introductory level in most Manipulative activities.

Discussion
Our aims for the following studies were to (a) develop the PPLA-Observation based on the review of relevant conceptual frameworks and Portuguese PE syllabus practices; (b) investigate the dimensionality structure of one of its modules -Movement Competence, Rules, and Tactics modulethrough Item Response Theory (IRT) methods; (c) test this structure for differential item functioning according to sex; (d) establish support for convergent and discriminant validity, and score reliability for this module. A secondary aim was to draw inferences for scoring and criterion-referenced cut-scores mechanisms.
IRT analysis of the movement competence, rules, and tactics module Dimensionality Our results, based on exploratory and confirmatory IRT analysis, provide evidence in favor of a two correlated factor solution for assessing Movement Competence, Rules, and Tactics, with evidence of measurement invariance (no-DIF) across sexes. This is contrary to our initial conceptualization that proposed that seven latent variables could be responsible for the variance in observed proficiency levels of activities: Locomotion, Manipulative, Stability, and Movement skills using Objects, Rules, and Tactics. Items (activities) did not cluster according to different tactical typologies, movement forms, or subareas. Instead, our results suggest that their variance is driven according to competence in two types of movement skills: Manipulative movement skills, and Stability movement skills. Competence in Locomotor movement skills did not emerge as a latent factor explaining variance. This might be due to locomotor skills being transversally required in specialized skills in all evaluated activities (e.g., sliding to Mota et al. 10.3389/fspor.2022.1033648 Frontiers in Sports and Active living Frontiers in Sports and Active living hit a falling shuttlecock, or running and then jumping onto a trampoline)-as can also be seen in our content analysis of movement skills ( Table 3).
Another unexpected finding was that two Athletics disciplines that were expected to load on the SA dimension (i.e., Running, and Jumps)-as specific skills for these activities are mostly locomotor and stability-based-presented higher loadings on MA. This might originate from a disconnect on how this group of activities (Athletics) is conceived and assessed within the PPES: rubrics for all disciplines are grouped and assessed as a single activity, however, throughout the syllabi (12), the three disciplines appear mentioned as different activities. It is possible that this led to teachers reporting according to different standards. This requires scrutiny and caution in further developments of this tool.
Regarding Tactics, content analysis of the PPES revealed that until the Elementary proficiency level, both movement skills, and tactical requisites increase simultaneously. It is during the transition to the Advanced level that tactical indicators take precedence ( Table 3). It is plausible that skill and tactical factors co-vary closely until the Elementary level, and only when students transition into Advanced levels is the tactical factor singularly driving variance in items-since movement skills factors cease or lower their effect at this level. However, in our sample, almost all students were at, or below, the Elementary level in all activities ( Table 4), which could preclude disentanglement of variance between these factors. Also, since most tactical-heavy activities are those requiring manipulative skills, the MA factor might likely be accounting for variance of tactical knowledge and application. Further studies with large-scale samples, with a higher proportion of students in Advanced stages, could test these hypotheses and offer insights into this factorial structure.
Regarding Rules, variance caused by differing degrees of rule knowledge and application might be similarly overshadowed by movement skills and tactics: A student might know and apply all rules from an activity, but absence of required skill and tactical factors might prevent him from advancing in proficiency level. Albeit aligned with an authentic assessment perspective, this invalidates measurement of this element using only observed activity levels, and will likely require an external instrument (e.g., scale) to isolate.

Differential item functioning (DIF)
Items seem to function similarly for both sexes (i.e., no DIF). Results can be meaningfully compared; despite suggestions in the literature pointing to bias when teachers observe MC (18, 107)-considering girl's competence in PA to be below average compared to boys of the same age.

Discriminant and convergent validity
The moderate to high correlation between MA and SA (r = .68; Table 7) is similar to results of another movement skill battery, using the same conceptualization, in older children and adolescents in a Portuguese sample (r = .64 108);. Due to the strength of this correlation, a general motor ability underlying results in both factors is tenable (26), and could be further investigated through second-order or bifactorial Frontiers in Sports and Active living modeling (109,110). Despite this, discriminant validity is still ensured, with inter-factor correlations below .85 (109). Correlations observed in our study among MA and SA, and correlates like sex, age, BMI, and fitness ( Table 9) were coherent with those found in the literature regarding movement skills in adolescents, strengthening the evidence for construct validity of the MCRT. Boys had higher scores than girls in both dimensions (Table 10), with the difference being smaller in stability skills (111,112). Values for the correlation of age and scores on both dimensions (r = .23 [.15, .31], and r = . 18 [.09, 26], MA and SA, respectively) were like those reported in a meta-analysis by Barnett and colleagues (93)-including an  . Cardiovascular and muscular endurance were also correlated with both scores, in similar magnitude as in previous studies (92,111). Finally, despite inconclusive results in reviews (92, 96), we observed a negative correlation between all flexibility indicators and scores in both dimensions; this correlation was lower regarding SA, which is plausible with the idea that stability-based activities require higher ranges of motions. The role of flexibility warrants further scrutiny, since our results pointed to a mostly negative correlation with other fitness indicators; especially the sit-andreach indicators might be collapsed since their correlation suggested they are statistically equivalent (r > .85).

Reliability and scoring
Use of a sub-score for each of the identified dimensions of the MCRT seems plausible given the evidence of sub-score reliability. We suggest a transformation so that these scores provide an intuitive 0 to 100 interpretation-like other scores in PPLA. For this transformation, the median θ score estimated for the transition from Elementary to Advanced level (θ = 1.95, and 2.96, respectively; Table 8) can be used as the upper bound, and the estimated θ score for a student with the lowest possible levels in all activities as a lower bound (θ MA = −2.38, and θ SA = −2.27, not shown). As an example, with × being the new 0-100 score, and θ the estimated θ MA score.
Since these scores require complex computations, the effectiveness, and precision of simpler options (e.g., sumscores) should be investigated in the future, given our concern for feasibility.
Reliability has been widely established for the HRF module protocols. We suggest that results from each protocol should be similarly transformed using the values reported by FITescola® Athletic Profile, based on sex and age, as the upper bound. In this manner, a 0 to 100 criterion-referenced score can be obtained.

Strengths and limitations
One of the major strengths of the PPLA-O is its feasibility: it uses data routinely collected by PE teachers to frame the evaluated elements into a common reference frame of Physical Literacy. Its content validity is also maximized by making use of (1) HRF protocols that have been chosen and adapted with the PE context in mind (FITescola®), and (2) data referent to proficiency levels in diverse physical activities that were chosen to figure in the Portuguese syllabus by curriculum design experts. It also evaluates movement skillsand inherent tactical actions-within tasks and environmental constraints that will be common to activities practiced outside of PE, providing a chance for an authentic, ecologically valid, and highly feasible assessment. Further efforts could study content and face validity with students and other educational stakeholders, as well as with motor development specialists to provide another layer of validity evidence.
Another strength rests in using IRT methodologies to analyze construct validity and reliability. Due to the intended ecological approach, missing data will always assume large proportions, since different students' needs will dictate that each class will work on and assess different activities. IRT algorithms were specifically designed to work with categorical data and are robust to missing data, using all information available to estimate parameters that also have higher degrees of invariance from sample to sample (53,113). As such, students with just a few assessed activities will still be able to be scored. However, large amounts of missing data still posed a limitation regarding assessment of absolute fit of the models-through statistical tests equivalent to chi-square (i.e., C2; 113) and derived relative fit indexes (root mean square error of approximation).
One limitation of this study lies in the unknown inter and intra-observer reliability of PE teachers while assessing both the fitness protocols and activity levels. We would argue that numerous factors could contribute to higher reliability, including (1) extensive training during initial teacher's education, (2) clear and task-specific rubrics for each activity and level available in the syllabus (115), (3) specific fitness protocols with detailed instructions and resource for application, (4) collaborative training and observation opportunities within schools, and (5) assessment based on multiple in-context observations. Despite this, these inferences require further scrutiny and empirical validation, since process-oriented assessments are more susceptible to bias caused by different levels of observer's expertise (e.g., 115,116). As part of this effort, demographic data on PE teachers, along with teaching experience and other relevant variables should also be collected to better understand assessment patterns, which we did not do during these studies.
A final, more general limitation is concerned with the timeframe of this study. All data collection was done amongst lockdowns imposed by the COVID-19 pandemic. This limited the number and quality of activities assessed by PE teachers (especially those involving physical contact like wrestling or acrobatic gymnastics) and might have imposed additional unforeseen limitations on these results. As such, these results should be replicated in a larger, more representative sample of students in regular PE circumstances, which will likely enable a deeper insight into the Tactics element.

Conclusion
Throughout this article, we detailed the development of the PPLA-O, an instrument that assesses the physical and part of the cognitive domains of PL in grade 10 to 12 adolescents (15-18 years). It is composed of two modules, (1) Health-Related Fitness (HRF), and (2) Movement Competence, Rules, and Tactics (MCRT), that integrate observational data from PE teachers into a common frame of criterion-referenced PL ( Figure 1). The former makes use of data collected through widely validated FITescola® assessment protocols, while the latter makes use of teacher-reported data collected in a wide range of activities and movement pursuits to measure movement competence and inherent cognitive skills (Tactics and Rules). We also gathered initial evidence supporting construct validity and score reliability of the MCRT module through IRT multidimensional models. A final twodimensional graded response model solution (Manipulativebased Activities, and Stability-based Activities) showed best fit to the data. The absence of Differential Item Functioning allows meaningful comparison of scores between sexes. Further studies should assess inter and intra-rater reliability and criterion-related validity. This highly feasible instrument can be used routinely-alongside the other instrument of PPLA (PPLA-Q)-to provide students with feedback on their PL journey and support pedagogical decisions at multiple levels (e.g., class, school, municipality, country).

Data availability statement
The datasets presented in this article are not readily available because participants of this study did not explicitly agree for their data to be shared publicly. Requests to access the datasets should be directed to João Mota, joao.mota@ucc.ie.

Ethics statement
The studies involving human participants were reviewed and approved by Ethics Council of Faculty of Human Kinetics. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.