In the last decade, a growing number of researchers have become interested in applying new tools to verify the equivalence of measurements in comparative political science using mass surveys (Davidov, 2009; Ariely and Davidov, 2012; Coromina and Davidov, 2013; Alemán and Woods, 2016; Welzel and Inglehart, 2016; Sokolov, 2018). The increasingly available cross-national datasets offer tremendous possibilities for comparative survey analysis, including cross-sectional comparative analyses, analysis of cross-national repeated cross-sections, and analysis of cross-national panels. Many of these datasets can be linked to information about contextual attributes of the different countries and important economic, social, or political information (such as GNP, social spending, migration flow data, or religious composition) that facilitates multi-level analyses. A similar comparative logic can be applied to a lower level of aggregation as well, for example when regions or even smaller units within countries are compared.
In all these types of comparative analysis using different kinds of data, comparability of the measurements is a necessary condition to obtain valid results. There is a steadily growing literature on measurement equivalence, specifying the statistical prerequisites for comparing unbiased covariances, regression coefficients, and latent means in regression analysis, structural equation models, Item Response Theory (IRT) approaches, multi-level models, and latent class and mixture models (Jöreskog, 1971; Meredith, 1993; Steenkamp and Baumgartner, 1998; Vandenberg and Lance, 2000; Davidov et al., 2014, 2018; Kim et al., 2016; Verhagen et al., 2016; van de Vijver et al., 2019; Roover, 2021; Pokropek and Pokropek, 2022). This rather technical literature—that often focuses on statistical details and pays less attention to theoretical validity—has more recently been complemented by new approaches investigating how respondents interpret particular items, for example by probing questions concerning the content of the items. This framework has been expanded in recent years by implementing the probing technique in web surveys (web probing), which results in much larger sample sizes compared to traditional face-to-face cognitive interviewing (Behr et al., 2017, 2020; Meitinger, 2017).
Since establishing the necessity of testing for measurement invariance, confirmatory factor analysis with multiple groups (MGCFA) has arguably been the most widely used technical tool to evaluate various levels—configural, metric and scalar—of measurement invariance for continuous variables (Brown, 2015). In the case of ordered-categorical items with few categories and a high degree of skewness, the ordinal approach to MGCFA is more appropriate (Brown, 2015; Liu et al., 2017). However, more recently, these tests of exact equivalence have been criticized for being too restrictive, often leading to the conclusion that comparisons should not be made even when cross-cultural differences are negligible (Zercher et al., 2015). To tackle this criticism, more liberal approaches have been developed for continuous variables—called approximate invariance—that allow comparisons of many groups and countries which would not be possible with the traditional approaches. A notable step in the direction of approximate rather than exact invariance is the application of Bayesian estimation in measurement models (Muthén and Asparouhov, 2012; van de Schoot et al., 2013; Davidov et al., 2015). In the case of dichotomous items, Item Response Models (IRT) have predominantly been used for this purpose. Additionally, Exploratory and Confirmatory Latent Class Analyses for multiple groups have been applied for the purpose of testing measurement equivalence. Another promising development is the use of multilevel regression models and structural equation multilevel models by combining individual-level data and higher-order level data, to explain why there is no metric or scalar invariance (Davidov et al., 2012; Jak et al., 2013; Jak and Jorgensen, 2017). All these procedures are grounded in the latent variable approach and make specific assumptions concerning the direction of relationships between latent variables and items. The models just mentioned assume reflective indicators, that is, indicators conceptualized as consequences (reflections) of the underlying latent variable. While this is an appropriate assumption in many cases, the literature also discusses examples of formative constructs (i.e., assuming that items determine the latent variable) (Sokolov, 2018; Stadler et al., 2021). This issue of reflective vs. formative indicators and the necessity of testing measurement invariance has been a point of critical discussion among political scientists (Alemán and Woods, 2016; Welzel and Inglehart, 2016; Sokolov, 2018; Welzel et al., 2021; Meuleman et al., 2022).
This Research Topic wants to inform political scientists about the state-of-the-art in this very fast-developing branch of survey methodology and statistics, where a lot of basic research has been done outside political science (e.g., in the fields of psychometrics and statistics). To this end, it presents a collection of studies that apply the different techniques to central political science concepts (see Table 1 for a summary).
In the article “The Relationship Between (sub)national Identity, Citizenship Conceptions, and Perceived Ethnic Threat in Flanders and Wallonia for the Period 1995–2020: A Measurement Invariance Testing Strategy”, Billiet et al. study the relationship between (sub)national identity and perceived ethnic threat in Belgium and relate it to ethnic and civic citizenship conceptions in Flanders and Wallonia. They assess measurement invariance over time (1995–2020) using data from the Belgian National Election Studies. They find that conceptualization and measurement of (sub)national identity had to be adjusted in Wallonia, and illustrate how deviations from measurement invariance can be useful sources of information on social reality.
Maskileyson et al. assess in their article the measurement equivalence of self-reported depressive symptoms among the elderly in 17 European countries and Israel (“The EURO-D Measure of Depressive Symptoms in the Aging Population: Comparability Across European Countries and Israel”). They test measurement invariance of the EURO-D scale in the sixth wave (2015) of the Survey on Health, Aging and Retirement in Europe (SHARE) using multigroup confirmatory factor analysis (MGCFA) as well as alignment, and conclude that partial equivalence is present.
Scotto et al. discuss the issue of measurement invariance testing of ordinal scales with the example of political efficacy (“Alternative Measures of Political Efficacy: The Quest for Cross-Cultural Invariance With Ordinally Scaled Survey Items”). They propose to distinguish between internal and external efficacy. In representative samples of respondents in the United States and Great Britain, they find equivalence of loadings and thresholds for their measurement model and thus conclude that differences in latent variable means can be interpreted meaningfully. Concretely, British respondents are found to have lower levels of internal and external efficacy than American respondents.
Sokolov tests for measurement invariance of two recently introduced measures of attitudes toward democracy in the World Values Survey's sixth round, the liberal and authoritarian notions of democracy. His analyses show that both measures can be considered reliable comparative measures of democratic attitudes, although for different reasons. Sokolov points out that some survey-based constructs, e.g., authoritarian notions of democracy, do not follow the reflective logic of construct development. Instead, Sokolov claims that these notions should be regarded as formative measures.
Heyder et al. propose a revised version of the Group Focused Enmity (GFE) syndrome as a two-dimensional concept: an ideology of inequality (generalized attitudes) and social prejudice (specific attitudes). The measurement models are empirically tested using data from the GFE panel (waves 2006, 2008) as well as the representative GFE surveys (cross-sections 2003, 2011) conducted in Germany. To test for external validity, they have included a social dominance orientation (SDO). Additionally, the methodological focus of the study is to test for several forms of measurement invariance in the context of higher-order factor models considering the issue of multidimensionality of latent variables. The empirical results support the idea that GFE is a bi-dimensional concept consisting of an ideology of inequality and social prejudice. Moreover, SDO is demonstrated to be empirically distinct from both dimensions and correlates more strongly with the ideology of inequality in comparison to social prejudice. The bi-dimensional GFE conceptualization proves to be at least metric invariant both between and within individuals. Finally, the impact of the proposed conceptualization and empirical findings are discussed in the context of international research on ideologies, attitudes, and prejudices.
Finally, Lomazzi's study offers a theoretical overview of the key issues concerning the measurement and comparison of socio-political values and aims to answer questions of what, why, and when they must be evaluated, and how measurement equivalence can be assessed in practice. Furthermore, she discusses the implications of formative and reflective approaches to the measurement of socio-political values. Exact and approximate approaches to equivalence are described as well as their empirical translation into multigroup confirmatory factor analysis (MGCFA) and the frequentist alignment method. Her study investigates the construct of solidarity as measured by the European Values Study (EVS) and uses data collected in 34 countries in the last wave of the EVS (2017–2020). The concept is captured through a battery of nine items reflecting three dimensions of solidarity: social, local, and global. Two measurement models are hypothesized: a first-order factor model, in which the three independent dimensions of solidarity are correlated, and a second-order factor model, in which solidarity is conceived according to a hierarchical principle and the construct of solidarity is reflected in the three sub-factors. Employing MGCFA the results indicated that metric invariance was achieved. The alignment method supported approximate equivalence only when the model was reduced to two factors excluding global solidarity. The second-order factor model fits the data for only seven of the 34 countries.
In a nutshell, our conclusions from these studies and previous research are as follows:
1. Contrary to the position of Welzel and Inglehart (2016) and Welzel et al. (2021), these studies argue that one needs to reach at least partial metric invariance to get unbiased regression coefficients in comparative research involving two or more groups of countries (Meuleman et al., 2022; Pokropek and Pokropek, 2022).
2. If one wants to compare means one must employ latent means to correct for measurement error and must reach partial scalar invariance with at least two equal loadings and two intercepts of the same items (Meuleman et al., 2022; Pokropek and Pokropek, 2022).
3. If partial invariance fails one can use more liberal approximate techniques like Bayesian CFA (Seddig and Leitgöb, 2018) in the case of few groups (< 10) and alignment in the case of many groups (>10) (Muthén and Asparouhov, 2014; Cieciuch et al., 2018).
4. The choice of model specification in the case of measurement models as reflective or formative must be founded on theoretical arguments as one cannot test them against each other in cross-sectional models. The reason is that the two model specifications are neither nested nor equivalent (Asparouhov and Muthén, 2019).
5. As measurement invariance is only a necessary but not sufficient condition it is advisable to employ additional cognitive interviews or web probing before the main study is executed (Meitinger, 2017; Behr et al., 2020; Meitinger et al., 2020).
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.