- 1Faculty of Education, Shaanxi Normal University, Xi’an, China
- 2College of Teacher Education, Hebei Normal University, Shijiazhuang, China
- 3School of Teacher Development, Shaanxi Normal University, Xi’an, China
- 4College of Life Sciences, Shaanxi Normal University, Xi’an, China
Introduction: Comprehensive assessments of scientific practices (SPs) require measuring both their epistemic and non-epistemic dimensions. However, existing classroom observation protocols systematically overlook non-epistemic practices; lack specificity, especially for secondary biology; have psychometric limitations, hindering holistic assessments of the proficiency of scientific practices; and less consideration given to cultural adaptability. This study aimed to develop and validate the Scientific Practices Observation Protocol (SPOP), which encompasses epistemic practices aligned with Next Generation Science Standards and non-epistemic practices, using the family resemblance approach.
Methods: The SPOP was validated using 883 students across 127 videotaped Grades 10–11 biology lessons in classrooms from diverse regions in China. Multi-dimensional Rasch measurement was employed to analyze dimensionality, validity, reliability, and measurement invariance.
Results: The results confirmed a robust five-dimensional structure: investigating, sensemaking, critiquing, constraints, and interaction. The SPOP demonstrated strong psychometric properties, including excellent model fit, high reliability, and measurement invariance across gender, effectively capturing non-epistemic practices often overlooked by prior tools.
Discussion: As a culture-adapted psychometrically robust protocol that integrates both dimensions, the SPOP could equip educators with the ability to determine proficiency of scientific practices, support researchers in tracking the development of these practices, and provide a foundation for practice-oriented teaching reform.
1 Introduction
To help students understand the epistemic basis of science, it is imperative to study and engage in foundational practices of science (Duschl and Grandy, 2013). The Framework for K–12 Science Education and the Next Generation Science Standards (NGSS) jointly identify eight core scientific practices (SPs) as cornerstones of science literacy. These practices include asking questions (AQ), developing and using models (DUM), planning and carrying out investigations (PCOI), analyzing and interpreting data (AID), using mathematics and computational thinking (UMCT), constructing explanations (CE), engaging argument from evidence (EAFE), and obtaining, evaluating, and communicating information (OECI) (National Research Council [NRC], 2012; NGSS Lead States, 2013). Originating in the United States, these practices have gained global relevance by reframing scientific inquiry as an integrated process encompassing epistemic, social, and physical dimensions, moving beyond the simplistic view of science as isolated experimentation (Ford, 2015; Osborne, 2014).
This holistic perspective of inquiry inherently requires that assessments of SPs encompass two interconnected dimensions: epistemic practices (EPs), which focus on knowledge construction, and non-epistemic practices (N-EPs), which cover sociological and contextual aspects (García-Carmona, 2022, 2024).
Classroom observation protocols are indispensable for evaluating SPs. They uniquely capture dynamic interactions, such as peer discussions on data interpretation or teacher-student dialogue on experimental ethics. These dynamic interactions cannot be fully captured by standardized tests or self-report surveys (Xu and Clarke, 2018). Therefore, valid and reliable observation tools are crucial for educators to diagnose proficiency gaps and for researchers to track the impact of instructional reforms. However, existing SPs’ observation protocols have four limitations.
First, they systematically overlook N-EPs. Second, methodological flaws undermine their accuracy. Most tools rely excessively on holistic Likert scales that fail to distinguish fine-grained differences in performance in SPs, and many lack rigorous validation, offering insufficient proof concerning whether they accurately measure the intended construct or yield consistent scores (Capps et al., 2016; Chen and Terada, 2021). Furthermore, gender-based differential item functioning (DIF) analysis has been overlooked, affecting the instrument’s validity and group fairness (Berrío et al., 2020). Third, disciplinary specificity is lacking, especially for biology. This leaves high school biology SPs, such as those involving animal experiment ethics, ecological data collaboration, under assessed. Fourth, existing observational protocols designed to assess SPs often overlook cultural adaptability during their development. Specifically, when assessment frameworks rooted in Western science education systems are applied within non-Western cultural settings—such as classrooms in China—they frequently require modification and contextual refinement to ensure their validity and appropriateness (Shao et al., 2025).
To address these limitations, a theoretical framework that comprehensively encompasses epistemic and non-epistemic dimensions is required. The family resemblance approach (FRA) to science, which conceptualizes science as an integrated system of cognitive-epistemic and social-institutional components, is well-suited (Erduran and Dagher, 2014; Irzik and Nola, 2023). Critically, the dual-system structure of the FRA aligns with secondary biology’s unique demands, where cognitive practices intertwine with non-epistemic considerations. Guided by the FRA, this study develops the Scientific Practices Observation Protocol (SPOP), an instrument culturally contextualized for Chinese high school biology classrooms. The SPOP integrates NGSS-derived EPs and N-EPs identified by García-Carmona (2020), offering a comprehensive and refined measure of students’ proficiency levels in SPs.
2 Literature review
2.1 Biological SPs and the FRA
While the NGSS and Framework for K–12 Science Education have established eight core practices, they have systematically neglected non-epistemic dimensions (García-Carmona, 2020). A definition of scientific inquiry that excludes non-epistemic factors cannot fully capture the complexity of real-world science (Erduran and Dagher, 2014). Specifically, biology exhibits distinct social embeddedness and value-ladenness, as knowledge production is not objective and neutral but socially constructed, and shaped by social biases (Potochnik, 2017). In particular teaching contexts, this trait presents itself as the deep interweaving of cognitive inquiry and social/ethical considerations.
Compared to traditional frameworks that focus narrowly on cognitive or epistemological aspects, the FRA incorporates two interrelated systems: the cognitive-epistemic system (encompassing goals, values, methods, practices, and knowledge) and the social-institutional system (including social certification, communication, professional norms, funding, etc.) (Cheung and Erduran, 2023; Irzik and Nola, 2023). These systems are “distinct yet inseparable.” For instance, a scientist’s experimental design is inevitably shaped by constraints like funding. Thus, the FRA is advantageous in defining the nature of science and guiding science education.
2.2 Observation protocols for existing SPs
We systematically evaluated observation protocols for existing SPs by screening against two criteria: application in K-12 science classrooms and empirical evidence of reliability/validity. This screening revealed 14 protocols—the Science Teacher Inquiry Rubric (STIR) (Bodzin and Beerer, 2003), Electronic Quality of Inquiry Protocol (EQUIP) (Marshall et al., 2010), Assessment of Scientific Argumentation in the Classroom Observation Protocol (ASAC) (Sampson et al., 2012), Practices of Science Observation Protocol (P-SOP) (Forbes et al., 2013), Systematic Characterization of Inquiry Instruction in Early Learning Classroom Environments (SCIIENCE) (Kaderavek et al., 2015), Science Discourse Instrument (SDI) (Fishman et al., 2017), Modeling Observation Protocol (MOP) (Baumfalk et al., 2019), Scholastic Inquiry Observation Instrument (SIO) (Turner et al., 2018), the Interactive-Constructive-Active-Passive (ICAP) to Measure NGSS Science Practice Implementation (IONIC) (Chen and Terada, 2021), Modeling-Based Teaching Observation Protocol (MBTOP) (Shi et al., 2021), STEM Observation Protocol (STEM-OP) (Dare et al., 2021), Scientific Inquiry-Supported Classroom Observation Protocol (SICOP) (Unver et al., 2024), Integrated STEM Classroom Observation Protocol (iSTEM) (Ong et al., 2024), and Quantitative Modeling Observation Protocol (QMOP) (Lucas et al., 2025). A comparison and a summary of key features are summarized in Supplementary Material 1.
2.2.1 Content and framework
The majority focus on specific dimensions such as inquiry, argumentation, or modeling. Only the SCIIENCE and IONIC address all eight EPs. However, the IONIC infers proficiency indirectly through student engagement levels, while the SCIIENCE focuses on teacher behaviors. Moreover, existing protocols systematically neglect the N-EPs. This omission is significant given that non-epistemic factors frequently determine scientific assessment outcomes (Elliott and McKaughan, 2014). Without assessing N-EPs, tools cannot capture the full complexity of authentic SPs. Moreover, few instruments consider the cultural adaptability of assessment frameworks during development, which can undermine the measurement validity of core constructs (Shao et al., 2025).
2.2.2 School stage and subject
Current observation protocols reveal an uneven distribution across educational stages, with high school being underrepresented. Among the 14 protocols, seven target elementary and middle school levels, one targets the university level, four cover K-12, one includes both middle and high school levels, and one spans high school and university. Few address discipline-specific SPs. Specifically, 12 are domain-general: ten for general science and two for STEM fields. Only two target individual disciplines: MOP for geography and QMOP for biology. The influential framework for K-12 Science Education strongly advocates “science and engineering practices” as cross-cutting, domain-general elements. However, SPs are inherently domain-specific, shaped by disciplinary norms and goals (Ford, 2008). This view is supported by evidence that the progression of SPs emerges through context-dependent interactions between knowledge, epistemology, and task constraints (Neumann et al., 2013), underscoring domain-general tools’ inability to capture discipline-specific practices.
2.2.3 Measurement methodology
The 14 protocols employ various methodologies, including Likert scales, behavioral code frequency, and descriptive protocols. These tools largely lack a theoretical framework for guiding raters in distinguishing learning environments’ characteristics, resulting in subjective impressions rather than educational theory. For instance, the ASAC uses a 0–3 Likert scale, but inconsistencies persist in result interpretation across applications despite reported interrater reliability (Capps et al., 2016). Furthermore, for scoring and leveling, employing the protocols for peer and self-assessment purposes become challenging, as they fail to address the breadth of student-teacher behavior in providing continuous feedback (Miller et al., 2014). A Likert scale-based design fails to capture fine-grained differences in SPs quality.
2.2.4 Reliability and validity
Validity exhibits widespread insufficiencies across the 14 protocols. Only eight protocols verified content validity through expert review. Three reported face validity, two reported translation validity, and one protocol provided no validity evidence beyond basic coding scheme descriptions. Construct validity was assessed by merely five protocols, with inconsistent methodological rigor: the P-SOP used principal component analysis (PCA)—an unsuitable method for construct validation due to its assumption of no measurement error (Schmitt, 2011)—and three protocols employed confirmatory factor analysis (CFA), but with small sample sizes, limiting generalizability. Only the MBTOP adopted item response theory, a relatively rigorous approach for construct validation. For other validity types, few protocols reported supplementary evidence. Moreover, none of the 14 protocols performed DIF analysis.
While all 14 protocols reported interrater reliability, their methodological approaches and rigor varied considerably, raising concerns about comparability: eight studies used the standard Cohen’s kappa for categorical data; the STIR used a less robust teacher-observer correlation; the MBTOP relied on Spearman’s rank correlation; the STEM-OP applied Krippendorff’s alpha for small samples; the QMOP used the gold standard intraclass correlation for continuous data; and the iSTEM was inconsistent, reporting weak percent agreement alongside the Miles-Huberman equation. Internal reliability was documented for only six protocols: five used Cronbach’s α, and the MBTOP used expected a posteriori/post-verification (EAP/PV) measures for reliability.
A final critical issue was insufficient sample sizes for validation. Boomsma (1985) recommends a minimum of 100 lessons for robust psychometric testing, but only three protocols met this standard. Small samples reduce statistical power, preventing confirmation of whether tools work consistently across contexts.
2.3 Conceptualization and framework
Despite the Framework for K-12 Science Standards and NGSS outlining eight EPs, their interpretation in research and practice remains highly flexible. To ensure conceptual clarity, we precisely defined each EP dimension and clarified the original meanings by reviewing the NGSS Framework and its source references (Table 1). For instance, AQ refers to both the generation of new questions and the reformulation of given questions. Examples are as follows: What exists and what happens? Why does it happen? How does one know?
N-EP constructs’ operationalization was primarily based on García-Carmona (2020, 2021)) theoretical framework (2020, 2021) (Table 2). Based on the FRA’s EPs and N-EPs systems, an initial dimension system for a SPs framework was integrated from 18 existing research.
Following McNeill et al. (2018) and Ko and Krist (2019), the eight EPs from the NGSS were grouped into three core dimensions corresponding to the FRA cognitive-epistemic system: investigating (IV) encompasses AQ, PCOI, and UMCT, focusing on the process of knowledge acquisition; sensemaking (SM) encompasses DUM, AID, and CE, focusing on the process of knowledge comprehension and transformation; and critiquing (CQ) encompasses EAFE and OECI, focusing on the process of knowledge verification and application. Based on the eight N-EPs proposed by García-Carmona (2020) and combined with the FRA social-institutional system, the eight N-EPs were merged into three core dimensions as follows: utilize resources (UR), includes professional and personal relationships among scientists (PPR) and fundraising for scientific investigation (FD); ethics (E), includes gender in scientific investigation (G) and moral and ethical issues in scientific investigation (ME); and interactions (IA), includes the role of the scientific community in the acceptance of scientific theories (RSC), rhetorical skills and semantic strategies to persuade through one’s ideas (RS), scientific collaboration and cooperation (SCC), and social perspective of scientific communication (SPSC) (Figure 1).
Figure 1. Preliminarily proposed conceptual framework for SPs proficiency. IV, investigating; AQ, asking questions; DUM, developing and using models; PCOI, planning and carrying out investigations; SM, sensemaking; AID, analyzing and interpreting data; UMCT, using mathematics and computational thinking; CE, constructing explanations; CQ, critiquing; EAFE, engaging argument from evidence;; OECI, obtaining, evaluating, and communicating information; UR, utilize resources; PPR, professional and personal relationships among scientists; FD, fundraising for scientific investigation; E, ethics; G, gender in scientific investigation; ME, moral and ethical issues in scientific investigation, IA, 16 interactions; RSC, the role of the scientific community in the acceptance of scientific theories; RS, rhetorical skills and semantic strategies to persuade through one’s ideas; SCC, scientific collaboration and cooperation; SPSC, social perspective of scientific communication.
However, the direct application of Western cultural adaptation theories often leads observational scales to overlook regional cultural diversity (Aronson et al., 2020; Franco et al., 2023). For instance, observation indicators developed based on individualistic Western cultures may not adequately capture classroom interactions within collectivist cultural contexts like China. Therefore, in developing the SPOP, it is essential to consider the cultural adaptability of the SPs assessment framework and make necessary adjustments.
3 Research objectives
Based on the literature review and theoretical framework presented above, this study aimed to address three research objectives:
(1) Develop a culture-adapted observation protocol encompassing EPs and N-EPs for secondary biology classrooms in China.
(2) Establish the reliability and validity of the SPOP.
(3) Examine the DIF of the SPOP across genders to ensure measurement 33 invariance and equity.
4 Methods
4.1 Data sources
We used a stratified purposive sampling strategy: first, in terms of China’s seven major geographical regions to ensure socio-economic and educational diversity, then by course type (compulsory, elective) within each region to guarantee course content variation (Table 3). The final sample consisted of complete video recordings of 127 high school biology lessons, each lasting 40 min. For each video-recorded class, 6–8 students were selected through simple random sampling (covering different seating areas) to ensure representativeness, involving a total of 883 students in Grades 10–11. Although the sampling object was the course, the core analytical unit was the students’ performance in SPs observed in the classroom. That is, the students’ practical behaviors during specific teaching tasks were independently coded and used for calibrating the measurement model.
Ethical compliance was strictly maintained throughout the data collection process. All teachers and students provided voluntary, informed consent, with the data used solely for SPOP validation while ensuring three guarantees: participant anonymity in SP analysis; access to de-identified data upon request; and video usage only for research purposes.
4.2 Development of the SPOP
4.2.1 Performance levels and characteristics
To translate abstract constructs of EPs and N-EPs into measurable indicators, we developed hierarchical performance levels through a synthesis of literature and evidence-based adaptation. For the practice of AQ, a four-tier complexity framework (Huang et al., 2017) was extended to incorporate Level 0 (students failed to raise questions or pose scientific inquiries). Similarly, the PCOI employs a five-level evidence-grounded taxonomy: Level 0 denotes the inability to propose plans; Level 1 reflects inadequate task-solving designs; Level 2 requires teacher scaffolding for adequate plans; Level 3 involves partially adequate independent plans permitting investigation but not full conclusions; and Level 4 represents fully adequate autonomous solutions (Crujeiras-Pérez and Jiménez-Aleixandre, 2017). Accordingly, we established hierarchical performance levels for each of the EPs, with all levels grounded in empirical or theoretical literature (Table 4).
For the N-EPs, we developed hierarchical framework levels based on García-Carmona (2020, 2024) N-EPs learning standards, tailored to secondary biology contexts. For example, the categorization of PPR into five levels is based on two criteria: autonomy in organizing inquiry tasks and the rationality of role assignments. The levels progress from full teacher control (Level 0) to fully autonomous and rational student collaboration (Level 4). The ME uses a five-tier 14 scale focused on recognition of stakeholder interests and adherence to ethical 15 guidelines.
4.2.2 Development of scoring examples
To help raters quickly understand how to assess students’ performance in relevant SPs according to the scoring criteria presented in Supplementary Material 2, a scoring case was developed as an example, which comes from high school biology lessons on diabetes dietary management, where students were divided into six subgroups to design experiments investigating sugar, protein, and fat reduction via colorimetric reactions. Each subgroup formulated research questions and experimental protocols with subsequent teacher-facilitated discussions to refine designs (Supplementary Material 3).
4.2.3 Testing by the research team
This phase established translation validity by examining the extent to which the theoretical concepts were translated into observable classroom behaviors. The first two authors independently applied the SPOP to 25 grade 10–11 biology lesson videos to test translation validity. The results indicated that four dimensions exhibited excessively low observation rates across the 25 trial lessons: PPR (4.3%), RSC (2.1%), SPSC (3.8%), and G (1.9%). This suggests that the behaviors corresponding to these dimensions occur infrequently in Chinese classrooms, resulting in insufficient measurement variance. Consequently, these dimensions required integration or deletion.
Given that PPR focuses on “task organization and role assignment” and, together with SCC, falls within the core category of “scientific ethos” in the FRA’s “social-institutional system,” PPR was integrated into SCC. This merger continues to cover the core construct of that system. This revision addressed the practical need to simplify the assessment framework for classroom use while preserving the essential elements of scientific teamwork (García-Carmona, 2024); Guided by García-Carmona’s (2020) principle that N-EPs can be integrated into higher-order dimensions according to cultural contexts, and noting that gender equality in Chinese classrooms manifests chiefly as “equal participation opportunities” rather than a standalone dimension, we integrated the G dimension into the “balanced participation” indicator of IA. This integration also serves to prevent the reinforcement of gender stereotypes (García-Carmona, 2020).
In Chinese classrooms, scientific theories are often treated as “absolute truths” rather than as outcomes of negotiation within the scientific community. Teachers tend to emphasize “the authority of scientists” (Zhu and Li, 2020), thereby overlooking the need to guide students in the RSC dimension. Furthermore, scientific communication in Chinese classrooms is often teacher- centered, with little attention paid to science-society interaction, which makes the SPSC dimension virtually unobservable in assessment. These cultural factors result in the absence of behavioral manifestations related to these dimensions, constituting a “culture-specific measurement blind spot.” Retaining them would diminish the instrument’s practicality while their removal does not compromise the core constructs of the FRA’ s dual-system framework. Accordingly, the RSC and SPSC dimensions were omitted. This decision reflects the core philosophy of the FRA—that “science education must 23 balance theoretical comprehensiveness with practical utility” (Irzik and Nola, 2023).
Finally, issues were resolved through team discussions, leading to two key revisions. First, four dimensions (PPR, G, RSC, and SPSC) of N-EPs were integrated or removed due to classroom cultural differences or the need to simplify the instrument. Second, FD (the original UR dimension) and ME (the original E dimension) were merged to form a new “constraints (CT)” dimension. After the aforementioned revisions, the number of sub-practices in the SPOP was streamlined to twelve, and the number of N-EP dimensions was reduced from three (UR, E, IA) to two (Constraints, CT; Interaction, IA) (Table 5). In sum, the revised framework strikes a balance between theoretical integrity and practical utility by preserving the complete cognitive-epistemic system while capturing the essential aspects of the social-institutional system.
4.2.4 Advisory board feedback
This phase involved an advisory board comprising five educational researchers, two senior teachers, one master teacher, and two biology teaching specialists were tasked with enhancing content, face, and translation validity via expert protocol evaluation. Each independently assessed the students in three standardized video lessons using the instrument, enabling external validation beyond the core research team. Collected feedback primarily addressed specific sub-practice descriptors, which the team systematically synthesized to implement targeted revisions and refinements to the protocol. To address the lack of biological characteristics in describing performance levels under the DUM, we revised the sub-dimensions’ descriptions and merged levels 3 and 4, which were difficult to distinguish due to vague descriptions, into a single level 3, “Students construct, use, evaluate, and modify biological models to improve alignment with observed data, experimental results, or predictions, while incorporating additional evidence or phenomena.” The revised SPOP was used for the pilot test.
4.2.5 Pilot test and revision
We evaluated the SPOP’s preliminary psychometric properties and refined problematic items through a pilot test. Applying Wright and Tennant’s (1996) Rasch sampling guideline, we randomly selected 50 secondary biology lessons from 127 lessons, ensuring stable parameter estimation with 50 respondents each encountering ≥ 10 items. This guideline was applied by treating the students as respondents and the observed practices as items. Two raters (a senior high school teacher with over 20 years of classroom experience and a doctoral-level biology education expert) underwent eight hours of training regarding the SPOP’s theoretical framework, coding cases, and individual case analysis. After multiple rounds of collaborative discussions, the raters achieved excellent inter-rater reliability (Cohen’s κ = 0.747>0.61), indicating a high level of agreement between the two raters (Cohen, 1968).
We implemented Rasch measurement to transform ordinal observations into interval-level measures (Liu, 2012) and analyzed the latent SP proficiency construct through dimensionality and reliability diagnostics using Winsteps 3.74.0. Resultant fit statistics guided iterative scale refinements—a validation methodology empirically supported in science education research (He et al., 2016; Zhang and Browne, 2023). The revised SPOP was used for formal testing.
4.2.6 Formal testing
We targeted a sample of 100 instructional units, with the exact number being 127 lessons, to ensure statistical robustness (Boomsma, 1985). Furthermore, we implemented systematic rater training protocols to establish instrument feasibility and ensure measurement consistency. Two qualified raters independently evaluated a 20% subsample of the videos, with inter-rater reliability Cohen’s κ = 0.833, confirming robust interrater consistency in assessing secondary students’ proficiency in SPs.
4.3 Data analysis
The Partial Credit Rasch Model (PCM) (Wright and Masters, 1982) was employed to validate the psychometric properties of the SPOP. As a core member of the Rasch model family (Masters, 1982), the PCM features a “step difficulty” parameter design that accurately aligns with the construct essence of “stepwise performance in SPs” in the SPOP. By assuming a fixed item discrimination parameter of 1, it strictly upholds the “measurement invariance” of the Rasch model (Bond and Fox, 2015), ensuring the fairness of ability comparisons among students across genders, grades, and regions—a core advantage unattainable by the Generalized Partial Credit Model (GPCM) and Graded Response Model (GRM). Regarding the GPCM, its proposer Muraki (1992) explicitly noted that its advantage of variable item discrimination only manifests when item discrimination is significantly heterogeneous. However, pilot testing revealed high homogeneity (αi ranged from 0.93 to 1.04, near the PCM’s fixed value of 1.00) in item discrimination, rendering the GPCM’s additional discrimination parameter redundant and unnecessarily increasing model complexity and interpretation barriers. For the GRM, the “cumulative probability logic” established by Samejima (1969) requires strictly progressive scoring levels. In contrast, the SPOP’s performance levels exhibit “non-monotonic difficulty” (e.g., significant variations in difficulty may exist across different dimensions within the same scoring level). The GRM’s fixed increasing thresholds distort the logic of real performance, and its parameter interpretation is far less concise and intuitive than that of the PCM, which conflicts with the SPOP’s need for promotion as a classroom observation tool for frontline educators. In summary, the PCM meets three key objectives with fewer parameters: it captures step difficulty accurately, ensures measurement fairness, and supports easy use and dissemination. This makes it well-suited to the core research needs.
Given the inherently multifaceted nature of proficiency in SPs in secondary biology classrooms, we conducted multi-dimensional Rasch analyses using ConQuest 5.0 (Adams et al., 2020) to identify the optimal measurement model. To balance fit and complexity, we used three fit indices for our model selection: deviance, Akaike information criterion (AIC) (Akaike, 1974), and Bayesian information criterion (BIC) (Schwarz, 1978). We further tested the correlation among the five dimensions of SPOP using ConQuest 5.0.
The rating scale’s validation, conducted according to established methodologies, confirmed monotonic step calibrations (Linacre, 2002), who also specifies a minimum of 1.4 logit intervals for three-category scales while allowing lower thresholds for four- to five-category instruments. Individual item fit was assessed via Infit and Outfit MNSQ, evaluated against a 0.7–1.3 range (Liu, 2012). Item-person maps assessed alignment between 883 students’ proficiency in SPs and the difficulty thresholds of 12 sub-dimensions. To enable criterion-referenced interpretation of the SPOP results, we established proficiency levels following the Programme for International Student Assessment (PISA) methodological framework (OECD, 2017). The continuum of SPs was partitioned into 1.0 logit intervals, deliberately exceeding PISA’s 0.8 logit standard, to accommodate the manifestation patterns of epistemic operations. Item and person reliability were evaluated via Rasch-specific metrics. We also use various methods to verify the internal reliability and separation reliability of the item and person. To establish criterion-related validity, a Pearson correlation analysis was conducted between a sample of 102 students’ Rasch ability estimates on the SPOP and their scores on the rencent Gaokao biology test items designed to assess core competencies. Correlation analyses were performed with SPSS 27.0. To ensure fairness across genders, we conducted gender-based DIF analysis using the Mantel-Haenszel method (Adams et al., 2018).
5 Results
5.1 Reliability of the SPOP indicators
The instrument demonstrated strong psychometric adequacy across multiple reliability metrics. Rasch model indices showed high item (0.959) and person (0.813) reliability. The overall item separation reliability was 0.841>0.8, indicating good discriminant validity of the tool; the person separation reliability was 0.762>0.7, demonstrating that the SPOP can stably and effectively distinguish between students with different levels of SPs. Dimension-specific EAP/PV reliability estimates further demonstrated internal consistency, with IV (0.735), SM (0.851), CQ (0.778), CT (0.801), and IA (0.830) all exceeding the 0.70 threshold (Bond and Fox, 2015).
5.2 Validity of the SPOP indicators
5.2.1 Model-data fit
The standardized residual variance table of the formal test showed that the observed original total variance was 56.4%, of which the measurement explained 44.4%, within the acceptable range but lower than the previous round of testing (71.5%) and near the 50% lower limit, suggesting scale multi-dimensionality. We therefore conducted concurrent multi-dimensional Rasch analyses using ConQuest 5.0 to identify the optimal measurement model (Figure 2).
The five-dimensional model exhibited the least deviance among the three models. The change in deviance from the unidimensional to the two-dimensional model was statistically significant [χ2(2) = 1097.77, p < 0.05]. The change in deviance from the two-dimensional to the five-dimensional model was also statistically significant [χ2(5) = 862.49, p < 0.001]. Further confirmation was derived from the information criteria: the five-dimensional model achieved the minimal values, with metrics that balanced model fit against complexity. Collectively, these results established the multi-dimensional Rasch framework as the optimal representation of our data structure, confirming robust psychometric properties for the instrument (Table 6).
5.2.2 Correlations between the five dimensions
Correlation analysis showed strong interconnections among the dimensions of EPs (IV, SM, CQ; r > 0.75). Beyond these intra-epistemic connections, moderate correlations (all coefficients > 0.5) emerged between IA, a dimension of N-EPs, and the three dimensions of EPs (IV, SM, CQ). This finding mirrors existing research, which has highlighted interactions between EPs and N-EPs. However, CT showed negligible 17 associations with IA (r = −0.060) and all EPs dimensions (r = −0.054 to −0.033) (Table 7).
These findings show meaningful interconnections within the EPs and between the IA dimension of N-EPs and EPs. However, the distinct, non-overlapping measurement dimensions and content confirmed that these constituted distinct yet related competencies.
5.2.3 Category probability curves
After empirical refinement across 25 classroom trials, we found that dimensions like AQ and FD showed unobservable levels and were temporarily retained, while DUM and CE had ambiguous category probability peaks. Following expert consultation addressing insufficient biological representation and ambiguous level descriptors, 50 classroom tests confirmed unobservable hypothesis levels for specific dimensions of SPs, prompting the consolidation of items. The finalized scale demonstrated optimal psychometric properties, characterized by distinct probability curves with adjacent peak separations greater than 1.4 logits (e.g., PCOI items) (Figures 3, 4). For the ME dimension, only levels 0 and 1 were observed in the data, contrary to the initial hypothesis of levels 0–2. Supplementary Material 4 presents the response thresholds corresponding to different categories of each dimension in the five-dimensional Rasch model output by ConQuest 5.0.
5.2.4 Item fit
When MNSQ values fall outside the confidence interval, the absolute value of their corresponding T-statistics will exceed ± 2.0, indicating significant item misfit (Adams et al., 2020). The parameter estimation table for 12 dimensions of SPs shows that the 14 unweighted MNSQ values of CE(0.84) and OECI(1.36) are outside the confidence 15 interval (CI), and their corresponding absolute values of T are both greater than 2. 16 While the weighted MNSQ values are 0.88 and 1.00, respectively, both within the corresponding CI range, and their corresponding absolute values of T are less than 2 (Table 8). Additionally, both the Unweighted Fit and Weighted Fit MNSQ for the AQ dimension slightly exceeded the upper limit, yet their confidence intervals encompassed the acceptable range, indicating that the fit was close to meeting the criteria. Due to its foundational role in EPs, AQ was retained for further investigation in subsequent research, despite exhibiting low frequency and a comparatively high error estimate (0.420) in the current sample.
Table 8. Fit statistics for 12 items in the revised Scientific Practices Observation Protocol (SPOP).
5.2.5 Wright map
The Wright map confirmed construct validity by visualizing student ability-item difficulty alignment across the dimensions (Figure 5). Students clustered near 0 logit (−2 to +2), and all 12 sub-dimensions maintained robust discrimination despite varying difficulty levels. Furthermore, item difficulties skewed right as foundational elements that represent simpler tasks progress to more challenging constructs. With a 8 mean item difficulty calibrated to 0 logits, each graphical symbol representing roughly eight students exhibited discriminative ability: clustered distributions signify constrained differentiation, whereas dispersed configurations indicate robust discrimination. The empirical data demonstrated that the students typically clustered near 0 logit (a range of −2 to +2), while all 12 sub-dimensions maintained robust discrimination despite varying difficulty levels. Crucially, the item thresholds comprehensively covered observed ability ranges, with dimensionally distinct difficulty spans, confirming the psychometric adequacy of the SPOP through appropriate difficulty targeting, discriminative power, and multi-dimensional sensitivity.
The resultant five-tier classification system organizes student proficiency into 19 progressive mastery levels (Figure 5). Among them, Items 1-12 correspond to AQ, 20 PCOI, UMCT, DUM, AID, CE, EAFE, OECI, FD, ME, SCC, and RS, respectively. 21 For example, “1.1” represents Level 1 of AQ. The five-tier proficiency framework progresses from Level 1 mastery of foundational behaviors, including fundamental questioning (AQ-1) and funding awareness (FD-1), to Level 2 execution of targeted practices, exemplified by biological model development post-data collection (DUM-2) and budget-constrained investigation planning without sustainability considerations (FD-2). Level 3 achievement encompasses items of moderate difficulty, featuring teacher-assisted investigation design (PCOI-2) and high-interaction biological collaboration (SCC-2). Contrastingly, Level 4 mastery of advanced items demonstrates synthesis-driven biological questioning (AQ-2) and scientifically persuasive reporting (RS-3). Culminating at Level 5 expertise across complex items, the students construct principle-based biological explanations (CE-3) and generate coherent rebuttals in socio-scientific contexts (EAFE-4), reflecting progressive sophistication from fundamental to expert performance tiers.
5.2.6 Criterion-related validity
The results revealed a significant positive correlation (r = 0.61, p < 0.05) (Cohen, 1988) between the SPs measured by the SPOP and scores on recent Gaokao test items, which supports the criterion-related validity of the instrument.
5.3 DIF analysis
DIF analysis examines both effect size and statistical significance (p < 0.05). Effect 12 size categories (Zwick et al., 1999) are: Category C (moderate to large, |DIF| ≥ 0.64 logits); Category B (slight to moderate, |DIF| ≥ 0.43 logits); and Category A (negligible, |DIF| < 0.43 logits). DIF analysis by gender revealed that OECI was the 15 most challenging item for both boys and girls. Contrastingly, CE was the easiest item 16 for boys, whereas AID was the easiest for girls (Table 9). Although the absolute 17 values of DIF for some items (AQ, UMCT, AID, CE, EAFE, SF, ME) exceeded 0.43 logits, all Mantel-Haenszel probabilities surpassed the 0.05 significance threshold. Subsequently, we selected eight items with relatively small absolute DIF Contrast values and relatively large p-values from the initial test results as a “composite anchor,” fixing their difficulty parameters. Following this, the four non-anchored items (AQ, DUM, OECI, ME) were re-examined for DIF using the same method over two rounds of iteration. Analyses from both rounds showed uniformly non-significant results (p > 0.05). Post-hoc power analysis further confirmed these results. With a 8 sample size of 883, the statistical power exceeded 0.99 for detecting a small effect 9 size, well above the 0.8 threshold. These finding provides effective evidence 10 supporting the conclusion of measurement invariance.
6 Discussion
6.1 Developing a valid and reliable protocol to assess SPs covering N-EPs
Existing SP observation protocols systematically overlook N-EPs’ assessment, often focusing on specific aspects of EPs and prioritizing teachers’ instructional performance in classrooms. Furthermore, most observation tools use Likert scales or behavioral code frequency to measure students’ levels of SPs, using classical test theory to conduct validation. Therefore, a valid and reliable instrument for comprehensively measuring students’ performance in SPs is needed. Based on FRA theory, this study integrated EPs and N-EPs into an observation scale, using the partial credit Rasch model for instrument validation, making item difficulty and measures of students’ SPs directly comparable (Bond and Fox, 2015), while addressing existing limitations. Compared to interdisciplinary observational protocols such as the STEM-OP and iSTEM, the SPOP comprises refined indicators that reflect biological characteristics.
Correlational analysis uncovered strong interlinkages among the dimensions of EPs (IV, SM, and CQ), consistent with existing research indicating that EPs operate as interconnected processes (Chen and Terada, 2021; McNeill et al., 2018). Such interrelationships highlight the synergistic cognitive processes at the core of SPs. IA and the three EPs dimensions showed moderate correlations, mirroring existing research, which has highlighted interactions between EPs and N-EPs. EPs provide the methodological foundation for knowledge production, while N-EPs shape the social conditions for such production (Wu and Erduran, 2024). This can also be supported by epistemic network analysis: while connections within the cognitive system are strong, cross-dimensional links between the cognitive and social systems are also significant (Cheung and Erduran, 2023), confirming their inseparability in real-world SPs. The negligible associations between CT (a N-EP) and IA or all EPs dimensions suggest that EPs are linked to only some dimensions of N-EPs, confirming SPs’ multidimensional nature. Combined with the results of the final deviance, AIC, and BIC tests, this study verified that the five-dimensional structural model of SPs achieved the best fit.
The instrument showed robust psychometric adequacy across various reliability indicators. An initial level of criterion-related validity was established by correlating instrument scores with student performance on a set of recent Gaokao items. Moreover, the DIF test further clarified the validity and fairness of SPOP, addressing the methodological flaws in the development process of existing observation protocols.
6.2 Establishing performance indicators and levels of SPs
Although the eight dimensions of EPs have been proposed in existing observation protocol studies (e.g., McNeill et al., 2018) or partially validated for their sub-dimension performance (e.g., Mooney, 2002; Shi et al., 2021), they lack systematicity. Additionally, the practice and evaluation of the eight dimensions of N-EPs in science classrooms remain in the initial stage (García-Carmona, 2022, 2024). This study empirically validated the hierarchical relationships among the indicators within each dimension of the SPs. Based on Rasch modeling results, we further established students’ performance levels in SPs—these levels can effectively characterize students’ SPs in science classrooms and are accessible to both teachers and researchers. Compared with previews protocols, this observation protocol enables more accurate measurement of students’ performance in SPs.
6.3 Guiding students’ advanced development in SPs
The integration of N-EPs into the SPOP is likely to help teachers better understand and implement N-EPs’ instruction. The downplaying of N-EPs in the NGSS hampers teachers’ understanding of complete SPs (Dagher and Erduran, 2016; García-Carmona, 2020). The present findings provide empirical evidence for applying SPs-related assessments in high school biology classrooms, which is conducive to facilitating SPs’ incorporation, especially N-EPs, into school curricula.
The SPOP directly provides feedback for teaching optimization. Accordingly, researchers can assess students’ levels in SPs, further identify their specific performance levels, and conduct comparative studies to determine differences in SP levels across genders, grade levels, and districts. The five-level competence framework of the SPOP provides standards for differentiated instruction. For instance, if most students demonstrate Level 2 performance in EPs—such as only describing data without logical reasoning during data interpretation—targeted instruction should be strengthened to enhance the connection between “evidence and conclusion.”
Validated via the Rasch model, the SPOP was shown to exhibit good reliability and validity, supporting long-term tracking of students’ competencies. This instrument can serve as a promising observation tool in student scientific literacy improvement programs to determine whether interventions enhance high school 8 students’ SPs. Teachers can conduct observations at the semester beginning, middle, and end of a semester to map the development of competence curves.
6.4 Cultural adaptation and applicability boundaries of the SPOP
In developing the SPOP, the decision to reduce certain N-EP dimensions from the FRA does not negate their theoretical value but represents a necessary filtering based on pedagogical feasibility and cultural adaptation. Consequently, the SPOP emerges as an organic integration of the FRA theory and the culture of Chinese high-school biology classrooms, which emphasizes collective collaboration, teacher authority, and examination-oriented instruction. This filtering is theoretically supported by García-Carmona (2024), who argues that effective teaching of N-EP dimensions requires teachers to possess “explicit intention” and that educators should select dimensions from a broad NOS framework that align with instructional goals, content, and student characteristics.
In current Chinese high school biology classrooms, dimensions such as RSC and SPSC lack corresponding instructional guidance and an observational basis, making them difficult to assess effectively. This reality provides the practical rationale for their removal. Furthermore, consensus in cross-cultural educational research underscores the importance of cultural sensitivity in coding and analytical frameworks. Clarke et al. (2012) caution against imposing teaching ideals from a single cultural perspective, while Xu and Clarke (2018) argue that observational tools should balance cross-cultural comparability with cultural specificity—that is, they should be grounded in local educational goals, teacher-student interaction patterns, and the broader cultural context. In line with García-Carmona (2024) observation that full NOS integration is a long-term endeavor, the SPOP, designed as an introductory tool for teaching non-epistemic practices, intentionally incorporates only those dimensions that are observable, actionable, and culturally aligned in local classrooms, thereby enabling a practical transition from theory to practice. Notably, the exclusion of RSC and SPSC may itself indicate a systemic underemphasis on the “social-institutional” layer of science in Chinese science education. Thus, the SPOP documents the existing classroom “reality,” rather than prescribing an idealized model of practice.
7 Limitations and future direction
Some issues must be addressed in future research. First, sampling was restricted to Grade 10–11 high school biology classrooms; future studies could extend the SPOP to Grade 12 or junior school biology by adapting level descriptors and items. According to the junior school biology curriculum, the difficulty of the corresponding dimension statements for SPs should be appropriately reduced. Second, the SPOP’s design is culturally specific to Chinese classrooms, necessitating the removal or integration of four N-EP dimensions. Future studies should therefore enhance cross-cultural validity through two means: validating the instrument in diverse settings (e.g., Western or East Asian classrooms) to locally adapt dimensions, and conducting prior pilot testing to recover or adjust relevant dimensions, such as reinstating RSC and SPSC in Western contexts. Third, while this study provides initial evidence for the concurrent validity of the SPOP, it does not address its predictive validity. Future research should therefore extend the range of criterion measures and establish the instrument’s predictive validity. Fourth, traditional manual classroom observation and scoring are relatively inefficient; future studies could introduce artificial intelligence scoring tools.
Data availability statement
The data analyzed in this study is subject to the following licenses/restrictions: To protect the privacy of participants (students). Requests to access these datasets should be directed to emhhb2h1aUBzbm51LmVkdS5jbg==.
Ethics statement
The studies involving humans were approved by Shaanxi Normal University Ethics Committee. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’ legal guardians/next of kin.
Author contributions
GL: Methodology, Supervision, Writing – review & editing. HZ: Conceptualization, Investigation, Visualization, Writing – original draft, Writing – review & editing. JY: Methodology, Supervision, Writing – review & editing. WL: Writing – review & editing. HM: Writing – review & editing.
Funding
The author(s) declared that financial support was received for this work and/or its publication. This work was supported by the Ministry of Education (grant number DHA220396).
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2026.1749066/full#supplementary-material
References
Adams, D., Sumintono, B., Mohamed, A., and Nur, S. (2018). E-learning readiness among students of diverse backgrounds in a leading Malaysian higher education institution.”. Malaysian J. Learn. Instruct. 15, 227–256. doi: 10.32890/mjli2018.15.2.9
Adams, R. J., Wu, M. L., Cloney, D., and Wilson, M. R. (2020). ACER ConQuest: Generalised item response modelling software (version 5). Camberwell, Vic: Australian Council for Educational Research.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automatic Control 19, 716–723. doi: 10.1109/TAC.1974.1100705
Aronson, B., Meyers, L., and Winn, V. (2020). “Lies my teacher [educator] still tells”: Using critical race counternarratives to disrupt whiteness in teacher education. Teach. Educ. 55, 300–322. doi: 10.1080/08878730.2020.1759743
Baumfalk, B., Bhattacharya, D., Vo, T., Forbes, C., Zangori, L., and Schwarz, S. (2019). Impact of model-based science curriculum and instruction on elementary students’ explanations for the hydrosphere. J. Res. Sci. Teach. 56, 570–597. doi: 10.1002/tea.21514
Berrío, ÁI., Gómez-Benito, J., and Arias-Patiño, E. M. (2020). Developments and trends in research on methods of detecting differential item functioning. Educ. Res. Rev. 31:100340. doi: 10.1016/j.edurev.2020.100340
Bodzin, A. M., and Beerer, K. M. (2003). Promoting inquiry-based science instruction: The validation of the Science Teacher Inquiry Rubric (STIR). J. Elementary Sci. Education 15, 39–49. doi: 10.1007/BF03173842
Bond, T. G., and Fox, C. M. (2015). Applying the Rasch Model: Fundamental measurement in the human sciences, 3rd Edn. Milton Park: Routledge.
Boomsma, A. (1985). Nonconvergence, improper solutions, and starting values in LISREL maximum likelihood estimation. Psychometrika 50, 229–242. doi: 10.1007/BF02294248
Braaten, M., and Windschitl, M. (2011). Working toward a stronger conceptualization of scientific explanation for science education. Sci. Educ. 95, 639–669. doi: 10.1002/sce.20449
Brigandt, I. (2016). Why the difference between explanation and argument matters to science education. Sci. Educ. 25, 251–275. doi: 10.1007/s11191-016-9826-6
Cai, J., and Hwang, S. (2002). Generalized and generative thinking in US and Chinese students’ mathematical problem solving and problem posing. J. Mathemat. Behav. 21, 401–421. doi: 10.1016/S0732-3123(02)00142-6
Capps, D. K., Shemwell, J. T., and Young, A. M. (2016). Over reported and misunderstood? A study of teachers’ reported enactment and knowledge of inquiry-based science teaching. Intern. J. Sci. Educ. 38, 934–959. doi: 10.1080/09500693.2016.1173261
Chen, Y.-C., and Terada, T. (2021). Development and validation of an observation-based protocol to measure the eight scientific practices of the next generation science standards in K-12 science classrooms. J. Res. Sci. Teach. 58, 1489–1526. doi: 10.1002/tea.21716
Cheung, K. K. C., and Erduran, S. (2023). A systematic review of research on family resemblance approach to nature of science in science education. Sci. Educ. 32, 1637–1673. doi: 10.1007/s11191-022-00379-3
Clarke, D. J., Wang, L., Xu, L., Aizikovitsh-Udi, E., and Cao, Y. (2012). “International comparisons of mathematics classrooms and curricula: The validity-comparability compromise,” in Proceedings of the 36th Conference of the International Group for the Psychology of Mathematics Education (PME36), ed. T. Y. Tso (Taiwan).
Cohen, J. (1968). Weighted Kappa: Nominal scale agreement provision for scaled 28 disagreement or partial credit. Psychol. Bull. 70, 213–220. doi: 10.1037/H0026256
Cohen, J. (1988). Statistical power analysis for the behavioral sciences, 2nd Edn. Mahwah, NJ: Lawrence Erlbaum Associates.
Crujeiras-Pérez, B., and Jiménez-Aleixandre, M. P. (2017). High school students’ engagement in planning investigations: Findings from a longitudinal study in Spain. Chem. Educ. Res. Pract. 17, 659–669. doi: 10.1039/C6RP00185H
Curzon, P., Dorling, M., Ng, T., Selby, C., and Woollard, J. (2014). Developing computational thinking in the classroom: A framework. United Kingdom: Southampton Education School.
Dagher, Z. R., and Erduran, S. (2016). Reconceptualizing the nature of science for science education. Sci. Edu. 25, 147–164. doi: 10.1007/S11191-015-9800-8
Dare, E. A., Hiwatig, B. M. R., Keratithamkul, K., Ellis, J. A., Roehrig, G., Ring-Whalen, E. A., et al. (2021). “Improving integrated STEM education: The design and development of a K-12 STEM observation protocol (STEM-OP) (RTP),” in Proceedings of the ASEE virtual annual conference content access, virtual conference. doi: 10.18260/1-2-37307
Duschl, R. A., and Grandy, R. (2013). Two views about explicitly teaching nature of 46 science. Sci. Educ. 22, 2109–2139. doi: 10.1007/s11191-012-9539-4
Elliott, K. C., and McKaughan, D. (2014). Nonepistemic values and the multiple goals of science. Philos. Sci. 81, 1–21. doi: 10.1086/674345
Erduran, S., and Dagher, Z. R. (2014). Reconceptualizing the nature of science for science education: Scientific knowledge, practices and other family categories. Berlin: Springer Academic Publishers.
Fishman, E. J., Borko, H., Osborne, J., Gomez, F., Rafanelli, S., Reigh, E., et al. (2017). A Practice-based professional development program to support scientific argumentation from evidence in the elementary classroom. J. Sci. Teach. Educ. 8, 222–249. doi: 10.1080/1046560X.2017.1302727
Forbes, C. T., Biggers, M., and Zangori, L. (2013). Investigating essential characteristics of scientific practices in elementary science learning environments: The practices of science observation protocol (P-SOP). School Sci. Mathemat. 113, 180–190. doi: 10.1111/ssm.12014
Ford, M. (2008). Disciplinary authority and accountability in scientific practice and learning. Sci. Educ. 92, 404–423. doi: 10.1002/SCE.20263
Ford, M. J. (2015). Educational implications of choosing ‘practice’ to describe science in the next generation science standards. Sci. Educ. 99, 1041–1048. doi: 10.1002/sce.21188
Franco, M. P., Bottiani, J. H., and Bradshaw, C. P. (2023). Assessing teachers’culturally responsive classroom practice in PK–12 schools: A systematic review of teacher-, student-, and observer-report measures. Rev. Educ. Res. 93, 745–779. doi: 10.3102/00346543231208720
García-Carmona, A. (2020). From inquiry-based science education to the approach based on scientific practices. Sci. Educ. 29, 443–463. doi: 10.1007/S11191-020-00108-8
García-Carmona, A. (2021). Prácticas no-epistémicas: Ampliando la mirada en el enfoque didáctico basado en prácticas científicas. [Non-epistemic practices: Broadening the perspective in the didactic approach based on scientific practices]. Rev. Eureka Sobre Enseñanza Divulgación Ciencias 18, 1–18. doi: 10.25267/Rev_Eureka_ensen_divulg_cienc.2021.v18.i1.1108 Spanish
García-Carmona, A. (2022). La comprensión de aspectos epistémicos de la naturaleza de la ciencia en el nuevo currículo de educación secundaria obligatoria, Tras La LOMLOE. [Understanding epistemic aspects of the nature of science in the new compulsory secondary education curriculum, following the LOMLOE]. Rev. Española Pedagogía 80, 433–450. doi: 10.22550/REP80-3-2022-01 Spanish
García-Carmona, A. (2024). The non-epistemic dimension, at last a key component in mainstream theoretical approaches to teaching the nature of science. Sci. Educ. 34, 1149–1165. doi: 10.1007/s11191-024-00495-2
He, P., Liu, X., Zheng, C., and Jia, M. (2016). Using Rasch measurement to validate an instrument for measuring the quality of classroom teaching in secondary chemistry lessons. Chem. Educ. Res. Pract. 17, 381–393. doi: 10.1039/C6RP00004E
Huang, X., Lederman, N. G., and Cai, C. (2017). Improving Chinese junior high school students’ ability to ask critical questions. J. Res. Sci. Teach. 54, 963–987. doi: 10.1002/TEA.21390
Irzik, G., and Nola, R. (2023). Revisiting the foundations of the family resemblance approach to nature of science: Some new ideas. Sci. Educ. 32, 1227–1245. doi: 10.1007/s11191-022-00375-7
Kaderavek, J. N., North, T., Rotshtein, R., Dao, H., Liber, N., Milewski, G., et al. (2015). SCIIENCE: The creation and pilot implementation of an NGSS-based instrument to evaluate early childhood science teaching. Stud. Educ. Eval. 45, 27–36. doi: 10.1016/j.stueduc.2015.03.003
Ko, M. L. M., and Krist, C. (2019). Opening up curricula to redistribute epistemic agency: A framework for supporting science teaching. Sci. Educ. 103, 979–1010. doi: 10.1002/sce.21511
Linacre, J. M. (2002). Optimizing rating scale category effectiveness. J. Appl. Measurement 3, 85–106.
Liu, X. (2012). “Developing measurement instruments for science education research,” in Second international handbook of science education, eds B. J. Fraser, K. Tobin, and C. J. McRobbie (Berlin: Springer), 651–665. doi: 10.1007/978-1-4020-9041-7_43
Lucas, L., Khushal, Mayes, Couch, B., and Dauer, J. (2025). Development of the quantitative modelling observation protocol (QMOP) for undergraduate biology courses: Validity evidence for score interpretation and uses. Intern. J. Sci. Educ. 47, 282–306. doi: 10.1080/09500693.2024.2320060
Marshall, J. C., Smart, J., and Horton, R. M. (2010). The design and validation of EQUIP: An instrument to assess inquiry-based instruction. Intern. J. Sci. Mathemat. Educ. 8, 299–321. doi: 10.1007/S10763-009-9174-Y
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149–174. doi: 10.1007/BF02296272
McNeill, K. L., Katsh-Singer, R., and Pelletier, P. (2015). Assessing science practices: Moving your class along a continuum. Sci. Scope 39, 21–28. doi: 10.2505/4/ss15_039_04_21
McNeill, K. L., Lowenhaupt, R. J., and Katsh-Singer, R. (2018). Instructional leadership in the era of the NGSS: Principals’ understandings of science practices. Sci. Educ. 102, 452–473. doi: 10.1002/sce.21336
Miller, K., Brickman, P., and Oliver, J. S. (2014). Enhancing teaching assistants’ (TAs’) inquiry teaching by means of teaching observations and reflective discourse. School Sci. Mathemat. 114, 178–190. doi: 10.1111/ssm.12065
Mooney, E. S. (2002). A framework for characterizing middle school students’ statistical thinking. Mathemat. Think. Learn. 4, 23–63. doi: 10.1207/S15327833MTL0401_2
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Appl. Psychol. Measurement 16, 159–176. doi: 10.1002/j.2333-8504.1992.tb01436.x
National Research Council [NRC] (2012). A framework for K-12 science education: Practices, cross-cutting concepts, and core ideas. Washington, DC: The National Academies Press.
Neumann, K., Viering, T., Boone, W. J., and Fischer, H. E. (2013). Towards a learning progression of energy. J. Res. Sci. Teach. 50, 162–188. doi: 10.1002/TEA.21061
NGSS Lead States (2013). Next generation science standards: For states, by states. Washington, DC: The National Academies Press.
Ong, Y. S., Koh, J., Tan, G. A. L., and Ng, Y. S. (2024). Developing an integrated STEM classroom observation protocol using the productive disciplinary engagement framework. Res. Sci. Educ. 54, 101–118. doi: 10.1007/s11165-023-10110-z
Osborne, J. (2014). Teaching scientific practices: Meeting the challenge of change. J. Sci. Teach. Educ. 25, 177–196. doi: 10.1007/S10972-014-9384-1
Potochnik, A. (2017). Idealization and the aims of science. Chicago, IL: University of Chicago Press, doi: 10.7208/chicago/9780226507194.001.0001
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika 34, 1–97. doi: 10.1007/BF03372160
Sampson, V., Enderle, P., and Walker, J. P. (2012). “The development and validation of the assessment of scientific argumentation in the classroom (ASAC) observation protocol: A Tool for evaluating how students participate in scientific argumentation,” in Perspectives on Scientific argumentation: Theory, practice and research, eds Myint Swe Khine (Berlin: Springer), 235–264. doi: 10.1007/978-94-007-2470-9_12
Schmitt, T. A. (2011). Current methodological considerations in exploratory and confirmatory factor analysis. J. Psychoeduc. Assess. 29, 304–321. doi: 10.1177/0734282911406653
Schwarz, C. V., Reiser, B. J., Davis, E. A., Kenyon, L., Achér, A., Fortus, D., et al. (2009). Developing a learning progression for scientific modeling: Making scientific modeling accessible and meaningful for learners. J. Res. Sci. Teach. 46, 632–654. doi: 10.1002/TEA.20311
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464. doi: 10.1214/aos/1176344136
Selby, C. C. (2015). “Relationships: Computational thinking, pedagogy of programming, and Bloom’s taxonomy,” in WiPSCE ’15: Proceedings of the Workshop in primary and secondary computing education, (New York, NY: ACM), 80–87.
Shao, M., Muhamad, M. M., Razali, F., Nasiruddin, N. J. M., Sha, X., and Yin, G. (2025). The Chinese adaptation of the teachers’ sense of efficacy scale in early childhood pre-service teachers: Validity, measurement invariance, and reliability. Behav. Sci. 15, 329–329. doi: 10.3390/bs15030329
Shi, F., Wang, L., Liu, X., and Chiu, H. (2021). Development and validation of an observation protocol for measuring science teachers’ modeling-based teaching performance. J. Res. Sci. Teach. 58, 1359–1388. doi: 10.1002/tea.21712
Turner, R. C., Keiffer, E. A., and Salamo, G. J. (2018). Observing inquiry−based 10 learning environments using the Scholastic Inquiry Observation instrument. Intern. J. Sci. Mathemat. Educ. 16, 1455–1478. doi: 10.1007/s10763-017-9843-1
Unver, A. O., Okulu, H. Z., Bektas, O., Yilmaz, Y. O., Muslu, N., Senler, B., et al. (2024). Designing an observation protocol for professional development providers and mentors working with scientific inquiry-supported classroom settings. School Sci. Mathemat. 124, 217–231. doi: 10.1111/ssm.12657
Wu, J. Y., and Erduran, S. (2024). Investigating scientists’ views about the utility of the family resemblance approach to nature of science in science education. Sci. Educ. 33, 73–102. doi: 10.1007/s11191-021-00313-z
Xu, L., and Clarke, D. (2018). “Validity and comparability in cross-cultural video studies of classrooms,” in Video-based research in education, eds L. Xu, G. Aranda, W. Widjaja, and D. Clarke (Milton Park: Routledge), 19–33. doi: 10.4324/9781315109213-3
Zhang, J., and Browne, W. (2023). Exploring Chinese High school students’performance and perceptions of scientific argumentation by understanding it as a three-component progression of competencies. J. Res. Sci. Teach. 60, 847–884. doi: 10.1002/tea.21819
Zhu, X., and Li, J. (2020). Classroom culture in China: Collective individualism learning model. Berlin: Springer, doi: 10.1007/978-981-15-1827-0
Keywords: classroom observation, non-epistemic dimension, Rasch measurement, scientific practices, secondary biology classroom
Citation: Zhao H, Yu J, Liu W, Ma H and Li G (2026) Development and validation of a culture-adapted observation protocol to measure students’ scientific practices in secondary biology classrooms: encompassing epistemic and non-epistemic dimensions. Front. Educ. 11:1749066. doi: 10.3389/feduc.2026.1749066
Received: 18 November 2025; Revised: 02 January 2026; Accepted: 02 January 2026;
Published: 28 January 2026.
Edited by:
Kuan-Yu Jin, Hong Kong Examinations and Assessment Authority, Hong Kong SAR, ChinaReviewed by:
Jose Antonio López-Pina, University of Murcia, SpainRivo Panji Yudha, Surabaya State University, Indonesia
Copyright © 2026 Zhao, Yu, Liu, Ma and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Gaofeng Li, bGlnYW9mZW5nQHNubnUuZWR1LmNu
Huangdong Ma3