The EURO-D Measure of Depressive Symptoms in the Aging Population: Comparability Across European Countries and Israel

Most of the countries in Europe are experiencing a rapid aging of their populations and with this an increase in mental health challenges due to aging. Comparative research may help countries to assess the promotion of healthy aging in general, and mentally healthy aging in particular, and explore ways for adapting mental health policy measures. However, the comparative study of mental health indicators requires that the groups understand the survey questions inquiring about their mental health in the same way and display similar response patterns. Otherwise, observed differences in perceived mental health may not reflect true differences but rather cultural bias in the health measures. To date, research on cross-country equivalence of depression measures among older populations has received very limited attention. Thus, there is a growing need for the cross-country validation of existing depression measures using samples of the older population and establishing measurement equivalence of the assessment tools. Indeed, insights on mental health outcomes and how they compare across societies is paramount to inform policy makers seeking to improve mental health conditions of the populations. This study, therefore, aims to examine measurement equivalence of self-reported depressive symptoms among older populations in 17 European countries and Israel. The data for the current analysis are from the sixth wave (2015) of the Survey on Health, Ageing and Retirement in Europe (SHARE) and consist of the population of respondents 50 years of age and older. The measurement of depression is based on the EURO-D scale, which was developed by a European consortium. It identifies existing depressive symptoms and consists of the 12 items: depression, pessimism, suicidality, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment, and tearfulness. We examine the cross-country comparability of these data by testing for measurement equivalence using multigroup confirmatory factor analysis (MGCFA) and alignment. Our findings reveal partial equivalence thus allowing us to draw meaningful conclusions on similarities and differences among the older population across 18 countries on the EURO-D measure of depression. Findings are discussed in light of policy implications for universal access to mental health care across countries.


INTRODUCTION
Most of the countries in Europe are experiencing the rapid aging of their populations that is accompanied by an increase in mental illness challenges due to aging (Chiu et al., 2017). Indeed, depression is one of the predominant mental disorders in old age (Blazer et al., 1987;Blazer, 2003). Therefore, a coherent and focused public health response is required to promote healthy aging across nations (Beard et al., 2016). Since the examination of mental health requires the exposure of personal feelings and emotions, the concept of depression for older adults may vary greatly across cultures. Thus, one of the biggest methodological challenges encountered in cross-national studies is to ensure the equivalence of mental health measurements across different national or cultural samples. Namely, the comparative study of depressive symptoms requires that the various groups understand the survey questions inquiring about their mental health in the same way and respond to them in a similar manner. Otherwise, observed differences in depression may not reflect true differences but rather cultural bias in the underlying measures. To date, only few researchers have considered this issue (e.g., Castro-Costa et al., 2008).
While several studies tried to assess and compare mental health across older populations in Europe (e.g., Castro-Costa et al., 2008;Fried 2015), findings on the incidence of mental health disorders among elderly are inconsistent (Alonso et al., 2004;Copeland et al., 2004;Andreas et al., 2017). Such inconsistency can be attributed not only to the cultural differences but also to a lack of measurement equivalence across different groups of older adults (e.g., Castro-Costa et al., 2008). That is to say, essential questions on precisely how to assess mental health and depression of the elderly populations are still unresolved, making policy evaluation and implementation difficult (Graeff-Buhl-Nielsen et al., 2020). Indeed, mental health indicators must provide policy makers seeking to improve mental health of the populations with meaningful and relevant information. Therefore, there is a growing need for a cross-country validation of existing mental health measures using samples of the older population and establishing measurement equivalence of the assessment tools. Moreover, the number of general cross-national health surveys that include mental health measures is constantly growing (e.g., Harpham et al., 2003). There is a need for valid mental health measures that provide policy makers and health care providers the information they require to address the potential gaps among population groups at the local and national levels.
This study aims to bridge this gap by examining the comparability of self-reported depressive symptoms among the older population in 17 European countries and Israel. The data for the analysis derive from the sixth wave (2015) of the Survey on Health, Ageing and Retirement in Europe (SHARE) and consist of the population of respondents aged 50 years and older. For our analysis, we selected EURO-D scale developed by a European consortium (Prince et al., 1999b), because it is one of the commonly used measures of depression among older adults (Copeland et al., 2004;Castro-Costa et al., 2008). The EURO-D scale identifies existing depressive symptoms and consists of 12 items assessing depression, pessimism, suicidality, guilt, sleep, interest, irritability, appetite, fatigue, concentration, enjoyment, and tearfulness.
The current study contributes to the literature on mental health in older age by providing an examination of whether one of the most the widely accepted measurement tools to assess depression displays equivalent measurement characteristics across 18 countries, thus enabling researchers to draw valid comparisons of mental health among various cultural members of the older population. Specifically, the results of this study present policy makers and health care providers with valid information on the comparability of EURO-D scale. This can help them to apply effective strategies to improve health care provision and reduce mental health disparities among groups.
In the following, we first review previous research on the comparability of measures of mental health. Second, we discuss the data sources, measurements of depression, and methods used in this article to assess measurement invariance. Third, we provide results from exploratory factor analysis (EFA), multigroup confirmatory factor analysis (MGCFA), and alignment to test the comparability of the mental health measures. Finally, we discuss the findings in light of national policy implications.

THEORETICAL BACKGROUND
Mental health experts have long been addressing late-life depression and its consequences for the quality of life (Blazer, 2003). They emphasize that depression includes a large and heterogeneous number of symptoms that have direct causal effect on each other (Fried, 2015). For example, sleep disturbances may cause tiredness, which may then lead to a condition of poor psychomotor fitness, rendering the patients susceptible to a low level of concentration elicited by their sleep disturbances (Fried, 2015). In depression research, depressive symptoms are usually estimated using rating scales and added together to create sum-score indices. The EURO-D is an example of a frequently used and validated scale to measure depressive symptoms in adults (Marques et al., 2020;Santini et al., 2020). For example,  used the EURO-D scale to compare the presence of depressive symptoms across populations in 15 European countries. The authors found that having a poorer self-perception of health, being female, experiencing economic difficulties and widowhood, maintaining low levels of activity and exercise, and having a lower educational level were associated with higher depressive symptomatology. Similarly, Belvederi Murri et al. (2020) used the EURO-D scale to examine depressive symptoms in later life in 19 European countries. Richardson et al. (2020) explored cross-national variations in sociodemographic inequalities in depression among older populations in 18 countries using the EURO-D scale.
To allow a meaningful interpretation of similarities and differences in the scores of the scale in cross-country comparative studies, it must measure a single construct and be equivalent across different country samples (Castro-Costa et al., 2008;Fried, 2015). Therefore, it is necessary to establish that it measures the same concept in different cultural contexts (Castro-Costa et al., 2008). Indeed, various authors emphasized that culturally determined differences in norms or expressions of depression may have a large influence on self-reported symptoms (Jürges, 2007;Castro-Costa et al., 2008). Even though measurement invariance is a prerequisite for crosscountry comparative studies, only few researchers have actually taken this issue into consideration (Janget al., 2001;Castro-Costa et al., 2008;Fried et al., 2016;Graeff-Buhl-Nielsen et al., 2020).
For example, Fried et al. (2016) analyzed whether unidimensionality and temporal invariance are tenable assumptions in typical studies of depression. They tested these two conditions in two large datasets with a total sample of 3,509 participants, in four widely used depression rating scales (one selfreport and three clinician-reports), with varying intervals between measurement points (ranging from 6 weeks to 2 years). These researchers found neither unidimensionality nor temporal invariance. Specifically, they found that the analyzed instruments do not assess a single underlying construct, and they do not measure the same set of constructs in the same way across time (Fried et al., 2016). In another study by Jang and colleagues (2001), the structure and validity of the Geriatric Depression Scale-Short Form (GDS-SF) were examined in South Korean and American samples of older adults. The participants included 153 and 459 older adults living in South Korea and the United States, respectively. All participants completed the original English and the translated into the Korean language version of the GDS-SF, as well as additional demographic and health-related measures. The results revealed that the GDS-SF exhibited good reliability in both samples. However, the results of a principal components analysis indicated that the structure was not well replicated across the two countries. The authors concluded that despite the efforts to produce equivalent questionnaires, the concept of depression for older adults might vary greatly in South Korea and the United States (Jang et al., 2001). Graeff-Buhl-Nielsen et al. (2020) expanded on Huppert and So (2013) multidimensional subjective well-being framework by testing the replicability of the model in Brazil, Colombia, Uganda, and the United Kingdom. The authors applied Bayesian approximate measurement invariance on a sample of 381 young adult participants to test for measurement consistency across countries. The results showed that the Huppert and So (2013) model was comparable across non-European regions, where meaningful differences in well-being patterns across regions were observed. Graeff-Buhl-Nielsen et al. (2020) suggested that the 10-item measure proposed by Huppert and So (2013) is useful for assessing mental health outside of Europe (Graeff-Buhl-Nielsen et al., 2020).
Another example that is particularly relevant for the present study is provided by Castro-Costa et al. (2008) who investigated the psychometric properties of the EURO-D-scale across 10 European countries in the first wave of the SHARE data (2004). The results revealed a two-factor solution, with affective suffering and motivation as two subdimensions (similar to the findings of Prince et al., 1999a) in nine of the 10 countries after employing a principal component analysis (PCA) and in all countries after employing a Confirmatory Factor Analysis (CFA). However, only the affective suffering subscale was equivalent across countries, while the motivation subscale was not. In conclusion, there is evidence to suggest that the EURO-D reflects two dimensions of depressive symptoms in late-life across European countries, with the affective suffering subdimension showing more robust cross-cultural validity than the motivational subdimension (Castro-Costa et al., 2008).
Notably, in the current study we examine whether findings are similar for the same scale but across a larger set of countries and at a later time point (2015). Moreover, we employ various robustness tests that take not only the categorical character of the data into account but also allow for a stricter or more liberal examination of measurement invariance.

Data
The data for the analysis derive from the sixth wave of SHARE (2015) (Börsch-Supan et al., 2013;Malter and Börsch-Supan, 2017;Börsch-Supan, 2019) The SHARE project is the largest pan-European panel data infrastructure that collects information on the health and well-being of the aging population in Europe and Israel. It collects comparable and longitudinal information at the individual level on diverse topics such as income, work, assets, pension plans, health insurance, disability, mental health, and physical health. In addition, SHARE's focus on older populations (50 + ) offers a unique opportunity to compare health in general and depression symptoms in particular among these populations. The data were gathered by means of face-toface interviews conducted in respondents' homes using a computer-based questionnaire. In addition to face-to-face interviews, respondents provided additional detailed information about their assets by filling out a short questionnaire. For more information on the data collection documentation, see http://www.share-project.org/specialdata-sets.html. Our data consisted of samples of the population aged 50 years and older from 18 countries: Austria (

Variables
The dependent variable in the current study is depression. We view depression as a mental disorder that cannot be observed or measured directly but can be assessed by measuring its symptoms (Fried, 2015). Thus, our conceptualization of depression resembles a reflective latent variable model (Bollen and Lennox, 1991) where different observed indicators (i.e., depressive symptoms) are reflective of an unobserved underlying and subjective latent construct (i.e., depression). Following this notion, the latent construct is assumed to determine any correlations between the observed indicators.
The measurement of self-reported depressive symptoms (i.e., the observed indicators) in this study is based on the EURO-D scale that was developed by a European consortium (Prince et al., 1999b). This scale contains 12 items tapping into Frontiers in Political Science | www.frontiersin.org August 2021 | Volume 3 | Article 665004 depression, pessimism, suicidality, guilt, sleep quality, interest, irritability, appetite, fatigue, concentration (on reading or entertainment), enjoyment, and tearfulness (for the question formulations and response categories, see Table 1). The scale yields a potential range from 0 to 12, with the number of depressive symptoms denoting the score. Thus, a higher score implies a higher degree of depression. Each single item measures the self-reported presence of a particular symptom. The EURO-D scale was shown to correlate well with other well-known health measures (Prince et al., 1999b), and its validity has been examined and confirmed by several studies (Larraga et al., 2006).

Method
Our analytical strategy consists of three steps. First, we use EFA (Barendse et al., 2015) to investigate the dimensionality of the 12 depressive symptom items across countries. Following Worthington and Whittaker (2006), we retain factors if the eigenvalue is larger than 1.00 and items if the factor loading is larger than 0.30 (Brown, 2015). Moreover, items are deleted if they load on two or more factors with a loading larger than 0.30.
We also deleted an item if, in addition to its main loading, it had a cross-loading whose difference to the main loading was smaller than 0.15. We considered both cases as an indication of the lack of discriminant validity. Second, we used MGCFA (Reise et al., 1993) to assess whether our a priori formulated common measurement model of depression assessed in the previous step exists in all countries and whether the measurement characteristics of this model are invariant across countries. Measurement invariance refers to "whether or not, under different conditions of observing and studying phenomena, measurement operations yield measures of the same attribute" (Horn and McArdle, 1992, p. 117), which is essential to ensure that a latent variable (in this case "depression") measures the same construct in different groups (Davidov et al., 2014;Vandenberg and Lance, 2000). When measurement invariance is absent, comparisons of relationships among variables (e.g., correlations, regression coefficients) and comparisons of scores (e.g., means) may be biased (Chen, 2008). The MGCFA approach for binary indicators allows testing measurement invariance by successively constraining the measurement parameters (i.e., factor loadings, thresholds, and residual variances) in the measurement model across countries. 1 The hierarchy of constraints reflects that group differences are increasingly attributed to differences in the latent factor and not to differences in the measurement characteristics. We test three levels of invariance (Millsap and Yun-Tein, 2004). Configural invariance refers to a model where only the number of factors, indicators, and the pattern of nonzero and zero factor loadings is invariant across countries. Strong invariance requires that the factor loadings and thresholds are held equal across countries. 2 Strict invariance additionally requires that the residual variances are held equal across countries. When strict invariance holds, researchers may even compare variances, covariances, regression coefficients, and means of the observed indicators or using composite scores. If only strong invariance holds, then only the means of the factors (i.e., latent means) may be used for a meaningful comparison (Liu et al., 2017). For estimating the model we used the varianceadjusted weighted least squares (WLSMV) estimator (Muthén et al., 1997) and the software program Mplus Muthén, 1998-2017). The WLSMV estimator works reasonably well even with small sample sizes, and its estimates are considered unbiased and efficient (Li et al., 2017). Missing data were treated pairwise (Asparouhov and Muthén, 2010a). Moreover, we use the theta parameterization approach that allows specifying the residual covariances of the latent response variables as parameter in the model (Asparouhov and Muthén, 2010a). 3 Models can be assessed and compared using the model chisquare test statistic (Asparouhov and Muthén, 2010b) and alternative indices such as the comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean residual (SRMR). We follow common practice to recognize acceptable model fit when CFI ≥ 0.90 and RMSEA/SRMR < 0.08 (West et al., 2012). Moreover, for testing measurement invariance constraints, we use the following guidelines in this study: Differences in model fit were considered irrelevant if the deterioration in CFI was smaller than 0.004 and the deterioration in RMSEA was smaller than 0.01 when moving from less to more constrained models (Svetina et al., 2020).
Third, we tested for approximate measurement invariance using the alignment optimization procedure . Compared to the classical MGCFA method for testing measurement invariance across groups, alignment is a less strict approach with regard to the requirement of equality constraints of measurement parameters across groups. Whereas MGCFA assumes that measurement parameters are equal across groups, the alignment procedure does not rely on such a strict assumption but rather allows for many small and a few large differences in measurement parameters across groups while still guaranteeing that factor means may be compared without bias. Alignment uses an unconstrained (configural) model in which all parameters are estimated without equality constraints, for example, with maximum likelihood. In the next step, the parameter estimation follows a procedure that minimizes a component 1 We used a threshold model for the binary measures in the SHARE study assuming that the dichotomy in an observed response y (i.e., 0 symptom not reported, 1 symptom reported) is determined by an underlying latent response y p that follows a normal distribution, so that y 0 if y p ≤ τ and y 1 if y p > τ, where τ is a threshold (Forero et al., 2009;Wu and Estabrook, 2016). That is, a respondent will report a symptom if the latent response is above the threshold and not report a symptom if the latent response is equal to or below the threshold, where the relation between the latent factor and the latent response y p follows a regular factor model for continuous normal variables. 2 The step of testing factor loading invariance separately is omitted and conducted in tandem with testing threshold invariance to ensure model identification Muthén, 1998-2017). 3 The residual variances were fixed at one for all variables in a reference group and freely estimated in all other groups. Only when testing for strict invariance were all residual variances fixed at one in all groups. loss function that finds the most optimal arrangement of measurement parameters, in which parameter differences across groups are usually very small and larger differences are restricted to a minimum. 4 Thus, the amount of measurement noninvariance is minimized without having to constrain any parameters to be exactly equal across groups (for technical details, see Asparouhov and Muthén, 2014). The final aligned factor means can be used for comparison if the degree of noninvariance in the alignment model is still tolerable. The degree of noninvariance is assessed with regard to the amount of noninvariant parameters in the model. When the amount of noninvariant parameters is smaller than 25%  or 29% (Flake and McCoach, 2018), the aligned factor means and the measurement parameters are considered trustworthy. 5 In sum, alignment identifies the most comparable means even in the absence of full measurement invariance.

Descriptive Statistics
Table 1 reveals considerable variations across countries in the reported levels of depressive symptoms as indicated by the single indicators (euro1-euro12) and the composite score (EURO-D). Older adults in Scandinavian countries (Denmark and Sweden), Southern Europe (Italy, Spain, and Greece), and Israel reported fewer depressive symptoms than individuals living in other European countries in the sample. While this pattern was similar for several depressive symptoms, mean differences for the EURO-D score were somewhat less consistent, although Sweden and Denmark are still representative of countries with low depression scores. Our examination of measurement invariance in the following sections will determine whether and to what extent we may rely on these reported crosscountry score differences.

Exploratory Factor Analysis
We performed EFA for the 12 depressive symptom items across 18 countries and within each country separately. The rotated solution from the EFA across all countries is shown in Table 2.
According to the eigenvalue criterion, two factors (eigenvalues 4.849 and 1.360) emerged that represented affective suffering (depression, sleep, guilt, irritability, tearfulness) and motivation (pessimism, interest, concentration, enjoyment). These two factors were measured by the same items as in previous research (Castro-Costa et al., 2008), and they corroborate the findings of earlier analyses of the EURO-D scale (Castro-Costa et al., 2007, 2008Guerra et al., 2015;. However, the items reflecting suicidality, appetite, and fatigue show considerable cross-loadings that do not allow allocating them to either of the two factors. This analysis on the full sample is used as a benchmark for screening the data and its factorial structure. When the EFA is performed for each country separately, the eigenvalue criterion again suggested a two-dimensional structure for most countries, which was in line with previous findings (Prince et al., 1999a;Castro-Costa et al., 2008). 6 In three countries, a third factor was suggested, which was, however, substantially meaningless and therefore ignored. According to the selection criteria described above, items measuring suicidality, sleep, appetite, and fatigue were dropped from further analysis because they either failed to load substantially on any factor or loaded on both factors in more than 25% of the countries (see Supplementary  Appendix). We chose this cutoff value for the share of countries because we considered 25% to be indicative of a substantial number of countries in which the items did not operate well. Obviously, other researchers may choose a higher or a lower cutoff for item selection. However, we would like to note that keeping these items in our case would likely result in misspecifications of the factor structure.

Multigroup Confirmatory Factor Analysis
Next, we retained the two-dimensional structure obtained with EFA and tested whether it can be supported in each of the countries and whether it displays measurement invariance across countries using MGCFA. The general model structure is depicted in Figure 1.
-First, we tested the model separately in each country. Results indicated good model fit, satisfactory factor loadings (higher than 0.3 in standardized terms; see Brown, 2015), and correlations between the two factors below 0.80 (indicating discriminant validity; Brown, 2015) in all countries with the exception of Denmark. In Denmark, the standardized factor loading of item euro2 (pessimism) on the motivation factor was low (0.18). Since we aimed at finding a model that applies to all countries, we omitted Denmark from further analysis.
Second, we examined the measurement invariance properties of the two subdimensions for the remaining 17 countries. These results are shown in Table 3. The fit indices indicated that the configural model fit the data well, suggesting that the same twodimensional structure existed in all countries. The strong invariance model with cross-country equality constraints on the factor loadings and thresholds also fit the data well. However, the deterioration in model fit was outside the range of the recommended cutoff criteria. The modification indices suggested that the thresholds for items euro1 (depression), euro2 (pessimism), euro4 (guilt), euro7 (irritability), euro10 (concentration), euro11 (enjoyment), and euro12 (tearfulness) were not equal across countries. Austria, Czech Republic, Estonia, Germany, Greece, Italy, Sweden, and Switzerland contributed most to the noninvariance indicating that people used the item categories differently in these countries. 7 Finally, also the strict invariance model showed a considerable deterioration in model fit compared to the strong invariance model with the CFI fit index value falling to below 0.90.

Alignment
We tested whether comparisons of factor means are nevertheless trustworthy using the alignment procedure. The alignment procedure is more lenient, and it could suggest that means may be compared after all, even when exact strong or strict measurement invariance is not supported by the data . We ran the procedure separately for each latent dimension 8 . The number of noninvariant parameter estimates is presented in Table 4. The table demonstrates that percentages of invariant parameters are far below the recommended cutoff criteria, and therefore, we conclude that the factor means may be trustworthy after all. However, items euro2 (pessimism), euro7 (irritability), and euro12 (tearfulness) were still significantly noninvariant in Austria, Germany, Sweden, Italy, France, Greece, Belgium, Israel, Poland, Luxemburg, Portugal, and Estonia (Table 5). Figure 2 shows the estimated factors means and the commonly used composite scores. The country rankings are quite different when using the more trustworthy aligned factor means compared to the composite score means. Moreover, the correlations of the latent and composite score means were only as high as 0.93 for the motivation subdimension and 0.80 for the affective suffering subdimension, suggesting that comparisons based on composite scores may be misleading. For example, based on the composite scores, older populations are the least depressed in Sweden, Austria, and Spain for both dimensions. However, based on the aligned means, it is Israel, Spain, Switzerland (affective suffering) as well as the Czech Republic, Austria, and Germany (motivation) where older populations display the lowest depression scores. The picture becomes more troubling when one relies on the general EURO-D composite score that includes all items in one dimension. In this case, the correlation of this composite score is only 0.64 with the affective suffering dimension and 0.90 with the motivation dimension. In other words, the bias in mean rankings is even larger when the general depression score is used as a single measure rather than considering the aligned means and the twodimensionality of the construct. 9

SUMMARY AND CONCLUSION
The principal objective of the current study was to examine measurement equivalence of self-reported depressive symptoms among older populations in 17 European countries and Israel as measured by the EURO-D scale in the SHARE data. Indeed, existing literature on cross-country validation of depression measures using representative samples of the older population is lacking. This lacuna is unfortunate, since comparative research of depressive symptoms requires that the groups under study understand the survey questions in the same way and display similar measurement characteristics. Otherwise, observed differences in perceived depressive symptoms may not reflect true differences but rather methodological artefacts or other similar types of bias (e.g., cultural bias in response behavior). Accurate and reliable information on depression scores across populations is crucial to the evidence-based formulation of effective mental health policies and their successful implementation. Therefore, we aimed to fill this gap by examining the measurement equivalence of the cross-cultural assessment of depressive symptoms by the EURO-D scale in older European and Israeli adults aged 50 and over. We used different approaches to examine measurement invariance (stricter and more liberal, i.e., alignment) complemented by a series of robustness tests for the findings. By doing this, we attempted to provide researchers with reliable scores for conducting meaningful comparative analyses of depression across populations.
First, our results from the EFA indicated a two-dimensional structure of the depression scale across countries. The items measuring depression, guilt, irritability, and tearfulness represented the factor affective suffering, and the items measuring pessimism, interest, concentration, and enjoyment represented the factor motivation. However, the remaining items measuring suicidality, sleep, appetite, and fatigue were not clearly related to one of the factors and where therefore omitted from further analysis. Denmark had to be dropped from  (Sass & Schmitt, 2010). 7 The modification indices are presented in the Supplementary Appendix. 8 Alignment is estimated using maximum likelihood and requires numerical integration when categorical indicators are analyzed. In the current analysis we experienced negative values in the absolute change of the loglikelihood across iterations even when we increased the number of integration points to improve numerical precision. Therefore, we decided to run the analysis for each latent variable separately to allow the models to converge more easily. 9 We would like to note that a different choice of countries and items may result in different findings. Thus, researchers interested in comparing different sets of countries and/or different measures are encouraged to perform analyses similar to those presented here. We include, in an online appendix, the syntax for the models we examined. Furthermore, we conducted robustness tests without the pessimism item but including Denmark in the analysis. The findings were quite similar, and strict invariance was still not supported by the data, whereas approximate invariance was (see Online Supplementary Table S1, S2).
further analysis, because one of the items measuring pessimism did not load on its corresponding factor motivation in a satisfactory way. The two-dimensional structure we identified was concordant with previous findings (Prince et al., 1999a;Castro-Costa et al., 2007, Castro-Costa et al., 2008Guerra et al., 2015;. For example, the pioneer study by Prince et al. (1999a) that tested the EURO-D scale in 14 European centers has reported that it can be reduced into two factors: affective suffering and motivation. Castro-Costa et al. (2008) results supported the EURO-D as either a unidimensional or bidimensional scale measure of depressive symptoms in late-life across European countries. Guerra et al. (2015) also found a two-factor structure (affective and motivation) of the EURO-D scale using large populationbased survey samples of older people living in Latin America, India, China, and Nigeria. Finally, a more recent study by  analyzed the factor structure of the EURO-D depression scale in 15 European countries in an older wave (5: 2013) of SHARE. These authors also identified two factors. Second, we tested the measurement invariance across the remaining 17 countries. Results showed that strong invariance may be given if one is willing to accept a certain drop in the MGCFA fit statistics. However, strict invariance was clearly rejected by the data. Thus, we then tested for the more liberal approximate invariance using the alignment procedure. This procedure revealed that the aligned latent factor means are comparable after all. These are encouraging results, as they imply that researchers may confidently draw meaningful and valid conclusions in crossnational comparative research on depression using the SHARE data and the EURO-D scale. However, the findings also imply that the current practice to perform comparative analysis based on sum scores of the scale should be viewed with skepticism. The strict requirements for sum scores (unidimensionality, strict measurement invariance) are not met with the current data (see also Fried et al., 2016;McNeish and Wolf, 2020). Contrary to the findings of Castro-Costa et al. (2008), for example, our results illustrate that even the sum scores of the single subdimensions are biased when compared to the aligned means. Thus, aligned factor means should be used instead. However, when researchers use other data or different sets of countries, measurement equivalence properties of the scores should be reevaluated, and the findings we reached are specific for the data at hand. Indeed, findings suggest that measurement invariance may  still be given after all when using more liberal approaches like alignment to examine it, even when MGCFA fails to demonstrate measurement invariance. The knowledge obtained from this study may help policy makers to base their decisions on true evidence of mental health prevalence across different European countries and Israel rather than on methodological artefacts. Despite its contribution, the present study is not without limitations. First, while our study examined one particular depression scale, the literature discusses several other selfreported measures of depression that may be subject to noninvariance across different countries. Second, our study was limited to the European context and Israel and to the older population of these countries. Whereas measurement equivalence was established for the data at hand, it does not necessarily suggest that it would be given also in other countries and across other age groups. Consequently, one needs to keep in mind that this may affect the results and potential comparability with other datasets. Future studies could address these important issues by further analyzing data in other countries, covering a diverse range of age groups, and using additional scales measuring respondents' mental health.
Notwithstanding these limitations, the current study, to the best of our knowledge, offers the most comprehensive examination of the measurement invariance properties of the depression scale in the SHARE data across participating European countries and Israel. It suggests that the aligned country means of depression may be used in comparative studies with confidence. When the reduction of mental health disorders is of utmost importance in many countries with a growing older population, unbiased country scores of depression are important for the development of informed European health policy and interventions to reduce the prevalence of depression and increase quality of life for older members of the population. Based on our empirical findings, the SHARE cross-cultural depression data is a reliable information source to include in efforts to achieve this goal. Note: The FIXED alignment procedure was used with Portugal (PT) as a reference group, where the factor means, and variances were fixed to 0 and 1, respectively.

AUTHOR CONTRIBUTIONS
First and second author share equal contribution to the work and are presented in alphabetical order. DM, DS, and ED contributed to conception and design of the study. DM organized the database. DS performed the statistical analysis. DM and DS wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.