Analyzing Large-Scale Studies: Benefits and Challenges

Ertl, Bernhard; Hartmann, Florian G.; Heine, Jörg-Henrik

doi:10.3389/fpsyg.2020.577410

OPINION article

Front. Psychol., 09 December 2020

Sec. Educational Psychology

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.577410

This article is part of the Research TopicWhat Big Data Can Tell Us About the Psychology of Learning and TeachingView all 12 articles

Analyzing Large-Scale Studies: Benefits and Challenges

Bernhard Ertl¹^*

Florian G. Hartmann²

Jörg-Henrik Heine³

¹Department of Human Sciences, Learning and Teaching With Media, Institute for Education, Universität der Bundeswehr München, Neubiberg, Germany
²Department of Human Sciences, Methodology in the Social Sciences, Institute for Education, Universität der Bundeswehr München, Neubiberg, Germany
³Center for International Student Assessment, TUM School of Education, Technical University of Munich, Munich, Germany

Introduction

The analysis of (inter)national large-scale assessments (LSAs) promises representativity of their results and statistical power and has the ability to reveal even minor effects. LSAs' international grounding verifies previous findings that might previously have been biased by their focus on Western and industrialized countries. This contribution will discuss these promises, contextualizing them via methodical challenges and interpretation caveats that are able to tap the potential of LSAs for educational psychology. Evidence of this contribution is grounded in previous analyses of Program for International Student Assessment (PISA; Schleicher, 2019) and Program for the International Assessment of Adult Competencies (PIAAC; OECD, 2013), two internationally repeated cross-sectional studies. Many aspects we bring up can also apply to several other international large-scale studies, such as TIMSS, PIRLS, and ICILS.¹ We also refer to the national longitudinal study German National Educational Panel Study (NEPS; Blossfeld et al., 2011) to include a perspective on longitudinal studies in this paper. Implications for large-scale studies within the context of learning and teaching round off our paper in its closing section.

Promises

Representativity and Impact

LSAs aim to survey representative (sub)samples of defined populations (e.g., OECD, 2013, section Caveats). This representativity can help them be more informative and provide stronger evidence for policymaking than traditional educational or psychological studies that often rely on convenience samples. Wagemaker (2014) discusses changes in educational policies as one of LSAs' impacts. Fischman et al. (2019) looked deeper inside the issue of LSAs' direct impact on educational policy, finding that several countries worldwide have established PISA-based educational goals (p. 12). They further report that LSA results are often used as triggers or levers for educational reforms, while also showing that several stakeholders mentioned that these kinds of studies actually hinder reforms when their focus is too much on simply reaching the stated indicators (see Rutkowski and Rutkowski, 2018).

Longitudinal Perspective

A second LSA benefit is their long-time perspective. They either have been repeated cross-sectionally in several cycles (e.g., the PISA study takes place every 3 years; Schleicher, 2019) or show a longitudinal panel design, such as with NEPS that recently surveyed six starting cohorts in Germany over the past 10 years (Blossfeld and Roßbach, 2019). While the trend-study approach of PISA allows a measurement of how changes in educational policy or society may impact a defined sample (e.g., 15-year-old students in PISA; Schleicher, 2019), the longitudinal approach of NEPS enables background variables to be revealed, shedding light on how an individual's characteristics affect educational trajectories (Blossfeld and Roßbach, 2019). These procedures can be especially informative if a study like NEPS follows several cohorts that overlap at a certain point in time.

Standardization

Besides representativity and the longitudinal perspective, LSAs provide standardized procedures, instruments, item pools, and test booklets (e.g., OECD, 2013). These standardizations ensure a survey setting and data that allow international comparisons (PIAAC and PISA) as well as comparisons between survey cycles (PIAAC and PISA) or waves (NEPS). An essential prerequisite for supporting these comparisons is the international cooperation for developing competency and performance measures as well as questionnaires (see, e.g., OECD, 2013). Furthermore, the standardized coding of survey data allows a certain level of matching to contextual and/or official data, e.g., labor market data, national examination statistics, or even geodata from microcom in NEPS (Schönberger and Koberg, 2018).²

Statistical Power

Finally, the large sample sizes with LSAs provide a statistical power for analyses that allows detection on the individual level of even small effects, even if subsamples of the original population are analyzed. This helps to reveal effects that would have been overlooked in traditional educational or psychological studies. However, statistical power here decreases when analyses go beyond the individual level and focus on class, school, or national realms.

Challenges

Complexity of Analysis

These promises go along with analysis and interpretation challenges. The advantage of representativity in the context of economic sample sizes requires a complex weighting of each case. Consequently, all further analyses must include weights to be able to maintain representativity during analyses. Using stratification variables for sampling that differ across the participating countries to reflect different (educational) structures in their population requires complex variance estimation procedures. This is typically based on replicated estimation or bootstrap procedures (Rust, 1985; Lin et al., 2013) to prove significance statements. In addition, the principle of item sampling (e.g., Lord, 1965) typically used in competence assessment (see Rutkowski et al., 2013) results in design-related missing data points (see below), which are compensated by the plausible value (PV) techniques (e.g., von Davier et al., 2009; von Davier, 2013, and Marsman et al., 2016). Here, analysis procedures have to take not only one but also multiple (e.g., five, ten, or even more) variables (PV) as competence measures into account. However, these kinds of procedures are rare with traditional statistics programs,³ meaning representative analyses need either add-ons such as the IDB Analyzer⁴ or specifically developed packages for R (e.g., survey; BIFIEsurvey, or intsvy; see Heine and Reiss, 2019).

Test Time

Another aspect relates to the extent of the questionnaires. People being surveyed can offer only a limited amount of time. This is typically compensated for in LSAs via two alternative approaches. A pragmatic and easily implemented approach is to apply very short scales for measuring traits and competencies. The NEPS panel, for example, measures the Big Five⁵ personality domains with only two items per dimension and vocational interests (the Big Six) with three items per dimension (see Wohlkinger et al., 2011). The issue of expectably low reliabilities and the respective validity is increasingly being discussed in psychological research (Rammstedt and Beierlein, 2014). A more demanding approach in terms of both implementation and later analysis is to use rotated booklet designs (e.g., Frey et al., 2009 and Heine et al., 2016). For computer-based assessments, adaptive test scenarios can usually further reduce the number of items (e.g., Kubinger, 2017). In both test designs, the items are appropriately distributed across different test booklets or even test scenarios. Test takers here often do not answer every item, which inevitably results in missing data points. With a suitable test design, this loss of data is typically completely random, although it still might require the use of data imputation methods which can be complicated to apply.⁶

Missing Data and Imputation

Correspondingly, for the construction of short scales or within-scale⁷ booklet designs, LSAs often require general design decisions for the assessment of competencies. The NEPS data set for instance surveyed competencies for only about a third of the student cohort (FDZ-LIfBi, 2018), while PIAAC assessed the competency of problem solving in technology-rich environments just for parts of the sample (OECD, 2013) with the booklet designs described above. This means that there is no discrete competency value for an individual; the estimate for competency is based on PVs (e.g., von Davier et al., 2009), which are based on the theory of data imputation (see Rubin, 1987). Modeling longitudinal effects, e.g., by structural equation modeling, furthermore requires the availability of the target variables at specific waves in order to construct valid models.

Invariance of Measurement

A recent OECD conference related to cross-country comparability of questionnaire scales (see Avvisati et al., 2019) identified measurement invariance as a core challenge for LSAs in general and for PISA studies as well (Van de Vijver et al., 2019). Among other methodological topics, participants from different countries discussed typical forms of analysis for verification of measurement invariance. A classical approach for the verification of the measurement invariance uses multigroup confirmatory factor analysis (MGCFA). Based on this, a widely accepted taxonomy includes configurational, metric, scalar, and residual measurement invariance (e.g., Putnick and Bornstein, 2016). The MGCFA approach however also has critical aspects ranging from insufficient subgroup sizes (even for LSA data), reduced test strength, and unknown distribution properties of the test statistics—especially when global model validation tests are used to assess the relative model fit of varyingly nested MGCFA models for levels of measurement invariance. Moreover, MGCFA rests on the assumption of a continuous scale for both the latent variable of interest and the response scales of the manifest indicators. When these strong assumptions of interval scales can be seriously questioned, different models from the IRT domain can be used for ordinal scales or methodology for classification like (multigroup) latent class analysis (MG-LCA—Eid et al., 2003 and Eid, 2019) for nominal scales. Some recent approaches in the LSA framework are founded upon Bayesian IRT models (e.g., Fox, 2010) or IRT residual fit statistics (see, e.g., Buchholz and Hartig, 2017). To establish an invariant scale on the item level, there are in fact some promising approaches to automated item selection to determine a scale, which fulfill predefined target criteria such as invariance across subsamples and cultures (e.g., Schultze and Eid, 2018).

Item Formats and Response Sets

Extreme and middle response endorsement, cheating, socially desirable responding, and flat-lined response behavior are phenomena closely related to the issue of invariant measurement (see Heine, 2020). A critical discussion is currently taking place regarding whether innovative item formats (Kyllonen, 2013) such as forced choice measures (e.g., Bürkner et al., 2019) or anchoring vignettes to adjust distorted responses (e.g., Stankov et al., 2018) might lead to improved measurement when compared to classical rating scales.

Classification Issues and Different Standards

Standardization and international comparability require the classification of responses, e.g., of vocational aspirations, by standardized classification schemes such as the ISCO-08. However, standardization is always subject to national practice and legislation, and although these schemes are in fact well-defined, they usually do not unambiguously map in alignment with national peculiarities; i.e., they often are only able to partially map national differences. Nursing is widely discussed as a prototypical challenge when it comes to international classification issues (see, e.g., Baumann, 2013 and Palmer and Miles, 2019) because it is distinguished with respect to the educational path (vocational vs. university background) as well as in terms of the scope of medical treatment a nurse is allowed to perform (see, e.g., Currie and Carr-Hill, 2013 and Gunn et al., 2019).

Caveats

Significance Does Not Mean Big Effects

Along with these challenges, LSAs also provide some interpretation caveats. The high sample sizes of large-scale studies support big statistical power (on the level of the individual) as a result frequent significance levels of p < 0.001 (or lower). Although this is strong when it comes to detecting even marginal differences, it also allows marginal effect sizes (zero effects) to become significant. So merely showing the significance of differences is not sufficient (e.g., Cohen, 1994 and Hunter, 1997) when analyzing large-scale studies; it is necessary to additionally discuss effect sizes (e.g., Snyder and Lawson, 1993).

Horse Race Communication

Countries and states participating in international large-scale studies differ in both their schooling systems and general societal aspects. Just one example of this involves socioeconomic background variables and basic political and social convictions. Different immigration policies in different countries (see, e.g., Entorf and Minoiu, 2005 and Hunger and Krannich, 2015) can lead to a different population composition in so-called “non-native speaker groups,” or groups of people with low socioeconomic status might in turn influence (bias) the outcomes of these studies in cross-country comparisons much more than the factor of different school systems. Many international large-scale studies have very complex designs and analyses, and as a result, local or national aspects might be the most illustrative ones to communicate, even if they are not the most relevant ones when considering other educational factors. This often leads to a horse race discussion focusing on the position rather than on the peculiarities of the respective systems. While Rutkowski and Rutkowski (2018) describe how to deal with these peculiarities, the NEPS data use agreement prohibits comparisons between the German federal states⁸ to avoid precisely these issues.

Implications for Learning and Teaching

We have discussed the promises, challenges, and caveats of LSAs. Benefits such as representativity and the long-time perspective go along with challenges such as the complexity of analysis and limited information (e.g., information loss due to classification issues, missing values, constructs not covered, and panel loss) as well as with further caveats for interpretation. This reflects a general issue of these studies, i.e., that their result might have the power to influence educational policies (see Fischman et al., 2019) while at the same time displaying difficulties in being appropriately communicated to teachers, principals, and policymakers due to their complexity. This makes it essential to communicate and transfer LSA evidence into practice in a manner that this is appropriate and understandable for a non-scientific audience, without trivializing its results.

The international perspective of many large-scale studies allows the stereotypes and preconditions that national studies cannot overcome to be reflected upon (see also Else-Quest et al., 2010). These include for example stereotyped gender differences in mathematics and science that in the Western world often favor boys—while PISA results on the other hand have disclosed that several countries show scores favoring girls in mathematics and an almost even distribution in science scores (OECD, 2015, p. 28f.). The study design thereby allows an analysis of the extent to which phenomena develop over time and between different countries, which is an essential aspect for evaluating changes in really any educational system. Incidentally, education always targets the development of individuals. So longitudinal follow-up surveys and analyses of cohorts may increase the benefits of these studies as they relate to learning and teaching.

To sum up, (inter)national large-scale studies can provide several benefits for research on learning and teaching in how they achieve a solid data set for investigating relevant effects. However, the formal comparability of study scores does not exactly reflect actual differences between states or educational systems without considering background variables and national social and educational specifics. Although these studies may mitigate the methodical shortcomings of traditional studies, especially the focus on Western white populations, they at the same time may reveal methodical challenges.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Funding

Conceptual analyses resulting in this article were partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - project ER470/2-1. The publication of this article was funded by the Open Access Fund of the Bundeswehr Universität München.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

1. ^See, e.g., Lenkeit and Schwippert (2018), Gustafsson (2018), von Maurice et al. (2017), and Rutkowski et al. (2010) for an overview of international large-scale studies.

2. ^Matching to contextual data is typically required to preserve the anonymity of individuals and schools. Here, different levels of anonymization, starting from a segment of households up to the municipality level, may be observable (see Schönberger and Koberg, 2018). This kind of matching is usually implemented by the provider of the data set and may require further data access restrictions, e.g., that access is granted only in rooms with specific security precautions. Microcom enrichment may be restricted in some countries and for some studies.

3. ^Analyses would be supported by multilevel structural equation modeling, e.g., in MPLUS, if the correct weights are appropriately used and the plausible values are correctly applied. However, the usability of this modeling is dependent on the complexity of the data set and decreases dramatically when nested plausible values are used, for example.

4. ^https://www.iea.nl/data-tools/tools

5. ^The Big Five is a set of personality variables including the dimensions of openness, conscientiousness, extraversion, agreeableness, and neuroticism (see Goldberg, 1990 and McCrae and John, 1992).

6. ^The use of rotated booklet designs and/or adaptive testing usually leads to the imputation of data by the provision of plausible values for estimating test results (see next section). This increases the complexity of analyses (as mentioned in the previous section).

7. ^The within-scale booklet design is used to describe the phenomenon that all constructs or scales are represented in all booklets, albeit with different and a reduced number of items.

8. ^https://www.neps-data.de/Portals/0/NEPS/Datenzentrum/Datenzugangswege/Vertraege/NEPS_DataUseAgreement_en.pdf

References

Avvisati, F., Le Donné, N., and Paccagnella, M. (2019). A meeting report: cross-cultural comparability of questionnaire measures in large-scale international surveys. Meas. Instrum. Soc. Sci. 1:8. doi: 10.1186/s42409-019-0010-z