A Wilcoxon–Mann–Whitney Test for Latent Variables

We propose an extension of the Wilcoxon–Mann–Whitney test to compare two groups when the outcome variable is latent. We empirically demonstrate that the test can have superior power properties relative to tests based on Structural Equation Modeling for a variety of settings. In addition, several other advantages of the Wilcoxon–Mann–Whitney test are retained such as robustness to outliers and good small sample performance. We demonstrate the proposed methodology on a case study.


INTRODUCTION
Consider a study where the interest is in the association between employment (yes/no) and the construct depression. The latter is quantified by the score on three questionnaires: the Patient Health Questionnaire-9 (PHQ-9; Spitzer et al., 1999;Kroenke et al., 2001;Kroenke and Spitzer, 2002), the Center for Epidemiological Studies Depression Scale-10 (CESD-10; Andresen et al., 1994), and the eight-item PROMIS Depression Short Form (PROMIS D-8; 8b short form; Pilkonis et al., 2011). The data originate from Amtmann et al. (2014), where the psychometric properties of these questionnaires were examined given that they all aim to screen for high depressive symptoms or major depressive disorder (MDD). Figure 1 displays these scores for 455 employed and unemployed individuals living with multiple sclerosis (MS). Scores for the PHQ-9 can range from 0 to 27, for the CESD-10 from 0 to 30 and the scores for PROMIS D-8 are reported on a standardized scale with mean 50 and standard deviation 10.
At first sight, the Wilcoxon-Mann-Whitney (WMW) test seems an appropriate choice to compare these two groups given the skewness and presence of outliers depicted in Figure 1. For each questionnaire, a WMW test can be carried out and for the sake of illustration, the WMW test will be introduced below for the PHQ-9 scores. The test considers the null hypothesis claiming that the distribution of the PHQ-9 scores is the same for both groups against the alternative stating that the probability that a subject of the unemployed group has a higher PHQ-9 score as compared to a subject of the employed group is different from 50%. The test statistic reflects this probability, also denoted as the probabilistic index, and the sole use of ranks hereby results in robustness against outliers. Under the null hypothesis and in the absence of ties, the sampling distribution of the test statistic only depends on the sample sizes of both groups, hence resulting in a distribution-free test (Thas, 2010). The WMW test often has superior power compared with the t-test for heavy tailed distributions (Blair and Higgins, 1980). For an exponential distribution, for example, the t-test needs approximately three times more observations than the WMW test to attain the same power (Van der Vaart, 2000;Hollander et al., 2013). Albeit both the WMW and the t-test aim to compare two samples, they apply different hypotheses and according effect sizes (i.e. probabilistic index vs. comparison of means) hence leading to different test properties. Although the WMW test seems to be an appropriate choice in the current example, there is a complication as the outcome of interest, depression, is a latent variable. Stated otherwise, it is a variable that is not directly observed, but rather theoretically postulated or empirically inferred from observed variables (i.e. the indicators or proxies). Switching to latent variables, one cannot directly apply the methods that were designed for observed variables, because a measurement model that connects the latent variable to observed variables has to be postulated. Typically, continuous data with latent variables are analyzed via Factor Analysis (FA) or Structural Equation Modeling (SEM). The classical SEM has an optimal performance with respect to hypothesis testing when data are multivariate normally distributed. Given the skewness of the data, it is our interest to study whether the WMW test can be used in the context of latent variables while maintaining the attractive properties mentioned before. Applying the WMW test naively on each questionnaire results in a significant difference between the two groups on the first and second questionnaire (p = 0.003 and p = 0.04 respectively) but not on the third questionnaire (p = 0.12), making it difficult to obtain a global conclusion whether there is a significant association between depression and employment or not. In addition, these p-values do not take the measurement error into account. To the best of our knowledge, an extension of the WMW that takes into account those two essential aspects (i.e. combining the information of multiple indicators where measurement error is inherent) is not available yet and therefore the objective of this paper is to extend the WMW test in the context of latent variables with the main focus on hypothesis testing. The paper is organized as follows. In Section 2 we propose a WMW test for latent variables. Section 3 empirically contrasts the new methodology with existing methods in different settings by the use of a simulation study. Section 4 presents a case study and Section 5 provides a conclusion and discussion.

METHOD
We first introduce a measurement model that relates the latent variable to multiple indicators (or proxies). We then formulate the hypotheses of interest and demonstrate how they can be tested.

Measurement Model
The measurement model that relates the latent variable η to a set of indicators Y p (p = 1, . . . , P), is given by where h p (·) denotes a strictly monotone function and ε p denotes the measurement error. To distinguish between the two groups, we use the notation (Y p , η, ε p ) for the first group and (Y * p , η * , ε * p ) for the second. We define the reliability of an indicator Y p as the proportion of variance in the indicator Y p that is not explained by the measurement error ε p and thus reflecting the quality of that indicator (Nunnally and Bernstein, 1994;Kline, 2015). Specifically, The variance of the measurement error ε p can be replaced by its empirical counterpart in order to obtain an estimate of the reliability of an indicator. We further elaborate on this estimation in the subsequent subsection. Similar to classic factor analysis and SEM, comparing the latent means between two groups by the use of indicators is only feasible when assuming measurement invariance. Hence, the function h p (·) is assumed to be equal for the two groups and this needs to be confirmed by using e.g. SEM. In contrast with FA and SEM, where h p (·) is assumed to be linear, we impose a less stringent specification, i.e. monotone without imposing linearity. Garcia-Marques et al. (2014) pointed out that a curvelinear trend between a latent variable and its indicator may exist in psychological research due to e.g. ceiling or floor effects. An example of a floor effect is observed by Amtmann et al. (2014) for the PROMIS D-8 questionnaire and hence portrayed in the right panel of Figure 1. This demonstrates that assuming linearity can sometimes be an incorrect representation of reality.
The focus in this paper does not lie on estimating the measurement model and will thus be treated as a nuisance, in contrast with classic FA and SEM.

Hypotheses and Testing Procedure
We are interested in testing the following hypotheses which are expressed in terms of the distribution of the latent variable η (i.e. F η ): If η i (i = 1, . . . , m) and η * j (j = 1, . . . , n) would be observable, the WMW test statistic (in the absence of ties) is given by (π −0.5)/σ 0 where σ 0 = (m + n + 1)/(mn12) and with I(·) the indicator function (Wilcoxon, 1945;Mann and Whitney, 1947). Here,π is an unbiased estimator for P(η < η * ), i.e. the probability that a randomly selected subject from group 1 has a lower outcome than a randomly selected subject from group 2. In practice, η i and η * j are unobservable and therefore Equation (4) can not be computed. Instead, we only observe Y pi and Y * pj which are related to η i and η * j via measurement model (1). Let WMW(Y p , Y * p ; α) denote the WMW test applied to the indicators Y p and Y * p and where α denotes the level of significance. Under location-shift, i.e. Y p meaning that the rejection level does not exceed α when F Y p = F Y * p and that it is at least α when F Y p = F Y * p (Lehmann, 1951). Under the model is also an unbiased test for H 0 : F η = F η * vs. H A : F η = F η * . Indeed, when assuming equal distributions for the measurement error over the two groups per indicator, it follows that when F η = F η * then F Y p = F Y * p so that the rejection level of WMW(Y p , Y * p ; α) does not exceed α when F η = F η * . Secondly, when F η = F η * then F Y p = F Y * p so that the rejection level is at least α when F η = F η * and under the assumption of location-shift (i.e. Y p d = Y * p + ). Consequently, the test statistic results in an unbiased test for H 0 : F η = F η * vs. H A : F η = F η * . In summary, we can apply the WMW test on the observed data to test hypotheses concerning the latent variable. This testing procedure does not require that h p (·), var(ε p ), var(ε * p ) have to be estimated nor that h p (·) needs to be linear. The strong assumption of equal distributions for the measurement error is required for the theoretical validation of this method. However, violations of this assumption will appear to have no detrimental consequences with respect to hypothesis testing in our simulation study (see Section 3).
When multiple indicators are available, we propose to aggregate them to obtain a new indicator where the abovementioned test rationale still holds. This aggregated indicator can be superior in terms of its reliability in comparison with the original indicators, but it however requires estimates of the measurement error variance. The aggregated indicator is obtained by making a linear combination of the original indicators while preserving measurement invariance as mentioned in Section 2.1. Therefore, the construction of this linear combination is based on all data of an indicator p, i.e. both Y p and Y * p . Let Z pk denote an observation from the pooled sample Y p and Y * p where k = 1, . . . , (m + n). We define the aggregated indicator as where the weights a p in Equation (7) can be chosen according to two strategies. A first strategy is to simplify the aggregation to an unweighted mean of the standardized indicators. However, treating all indicators as equally important is a reduction of the complexity in reality where some indicators are more reliable than others. Therefore, we propose a second strategy where the weights a p in Equation (7) are chosen so that the estimated reliability of Y AGG k is maximized (Bentler, 1968;Li, 1997;Penev and Raykov, 2006). In other words, a maximally reliable composite is constructed and the estimated reliability of Y AGG k is by construction at least equal to the highest estimated reliability of the separate indicators used in the aggregation.
In order to obtain the weights a p , one needs to estimate the variance of the measurement error of the accompanying indicators, comprising the data of both groups. Given the data structure of our simulation study (Section 3) and case study (Section 4), we briefly elaborate on how this variance can be estimated when at least three indicators are available. For a more exhaustive explanation on how to estimate the variance of measurement error under different settings, we refer to De Neve and Dehaene (2021). Imposing a linear relationship among the Although the assumption of a linear relationship among the indicators is required for these formulas, the results of our simulation study show no detrimental consequences of violations.
The test rationale that has been put forward earlier on, i.e. employ the WMW test on observed data to make inference with respect to latent variables, is still valid when using this aggregated indicator Y AGG k . Moreover, this aggregation does not complicate the justification due to the flexibility of the measurement model, because the aggregated indicator can be rewritten as a p ε pk , and hence the same structure as described in measurement model (1) still holds. Therefore, we obtain the following test statistic: where the data of the two groups for the aggregated indicator are distinguished by using the notation Y AGG i and Y AGG j for respectively the first and second group. Similar with the standard WMW test, a p-value can be obtained by either using a permutation null distribution or a standard normal approximation (Wilcoxon, 1945;Mann and Whitney, 1947;Thas, 2010).

Sample Size and Power Calculation
An approximate total sample size N (N = m + n) for a onesided test with significance level α can be determined by using the formula from Hollander et al. (2013): where c reflects the ratio of the sample sizes of the two groups, i.e. c = m m+n , and δ denotes the effect size under the alternative hypothesis, i.e. P(Y AGG 1 < Y AGG 2 ). Subsequently, the expected power can be deduced via

SIMULATION STUDY
In order to assess the finite sample performance of the extended Wilcoxon-Mann-Whitney test, a simulation study is performed. All simulations and analyses are performed with R version 3.5.1 (R Core Team, 2020). Different scenarios are explored, all based on the following data generating process: Latent variables: Observed variables/indicators: To explore the performance of the suggested methodology under different scenarios, the simulation study covers both normal, heavily tailed and skewed distributions for ζ and ζ * (and consequently η and η * ): N (0, 1), t 5 , Laplace(0, 1.25) and the standard exponential centered around zero. In the wide range of possible non-normal distributions, these distributions correspond to kurtosis and skewness values that can be encountered in practice and that are leptokurtic (Chou et al., 1991). This positive kurtosis enables the examination of whether the superior power of the WMW test in heavier tailed distributions is carried over to the context of latent variables (Van der Vaart, 2000;Hollander et al., 2013). The superiority in heavier tailed distributions is also observed in small samples and therefore sample sizes in this simulation study are varied between 15, 50, and 100 observations in each group. As a result, exploration of the properties of different methods is possible without running into a ceiling effect with respect to the empirical power. By varying the variances of the measurement error, the influence of the reliability of the indicators on the different methods can be studied. We consider reliabilities of 60% and 80% which corresponds to indicators that can be considered as having a relatively weak and an adequate reliability respectively (Nunnally and Bernstein, 1994;Jackson, 2001). The function h p (·) is either linear or non-linear. In the non-linear case, the transformation h p (·) equals −1 [F(·)] where F equals the tdistribution with 1 or 3 degrees of freedom. This relationship can be seen as a modified inverse logit function and was chosen since it recreates a curvilinear trend between a latent variable and its indicator, in accordance with the trend mentioned in Garcia-Marques et al. (2014). Figure 2 shows an example of such a curvilinear trend between a latent variable η and h p (η). Taking into account all parameters discussed up till now, the simulation study involves linear or non-linear functions h p (·), four error distributions, three sample sizes and two reliabilities, resulting in 48 simulation scenarios. For all these 48 combinations, four settings are considered to gain insight with respect to the empirical consequences of violating the assumption FIGURE 2 | Illustration of the curvilinear trend between a latent variable η and the function h p (η) as used in the simulation study.
of equality in distribution of the measurement error as postulated under model (5) in Section 2.2.
• Setting 2: cε p d = ε * p with c = 1, i.e. the variance of the measurement error differs across groups, where c is chosen in such a way that the reliability of all indicators in the second group is consistently about 5% lower than in group 1.
• Setting 3: ε p d = ε * p but Var(ε p ) = Var(ε * p ), i.e. the distribution of the measurement error differs across groups (a normal distribution and Laplace distribution respectively), but the variance is equal.
In the last setting, we consider a correctly specified model but with an indicator with very low reliability: have a reliability of only 20%. In order to assess the impact of all these parameters on both the empirical Type I error rate and power, the parameter β * as defined in Equation (11) is varied. The exact parameter value depends on the error distribution, but it is chosen so that the probabilistic index P(η < η * ) equals 50% and 65% under the null and alternative hypotheses respectively. The rationale for this probabilistic index can be traced back to the simple relationship between a probabilistic index and the standardized difference as described in Cohen (1988) and De Schryver and De Neve (2019). By exploiting this relationship, a probabilistic index of 65% coincides with a standardized effect of 0.55 and is hence defined as a medium effect according to Cohen (1988).
For each setting, 1,000 Monte Carlo simulation runs are used to evaluate and contrast the performance of six methods. In the first method, further referred to as WMW-max rel, a maximally reliable composite is used as input for the WMW test. In this simulation study, the weights a p are obtained via an optimisation function, but an analytic solution based on Bentler (1968) is also possible. The variance of the measurement error of each indicator needed for this optimization process was estimated by using the formulas presented in Section 2.2. The second method, further referred to as WMW-mean, also applies the WMW test on an aggregated indicator, but here the aggregation is simplified to the mean of the standardized indicators. This comparison enables us to study the influence of estimating weights according to the quality of the individual indicators. The third method uses the maximally reliable composite as input for a Welch t-test while the fourth method has the unweighted mean as input variable for a Welch t-test. These two methods are accordingly further referred to as t-test-max rel and t-testmean and form a parametric alternative for the first and second method. The fifth and sixth method are based on SEM with equal group loadings and intercepts and are currently the most common methods to compare a latent variable between two groups. We include both SEM with and without correction for non-normal data, further referred as SEM and SEM-correction.
The correction for non-normal data refers to the use of robust Satorra-Bentler standard errors (Satorra and Bentler, 1994). SEM and SEM-correction can also be seen as a parametric counterpart of the proposed WMW extension, but unlike the Welch t-test, SEM does take into account the measurement error of the indicators. The SEMs are implemented by using lavaan (Rosseel, 2012).
For sample size m = n = 15, p-values for the methods based on SEM and the Welch t-test were obtained by using a permutation null model, to ensure a fair comparison between the six different methods. For larger samples, inference was conducted by relying on the asymptotic distribution of the respective test statistics.
The R code to recreate the simulation study is available in the Supplementary Material.

Results
Because the Type I error rate is correctly controlled for almost all methods in almost all settings, we do not display these tables in the main text but provide them in the Supplementary Material (i.e., Tables 1-4 in Supplementary Data Sheet 2). For 3 cases, it can be noted that the methods relying on SEM are too liberal, i.e. the empirical Type I error rate is 6.9%. On the other hand, one can observe that the methods relying on the WMW test are too conservative in 4 cases for setting 2 with an empirical Type I error rate of 3.0 or 3.2%. These 7 deviations are indicated in bold in the corresponding tables in the Supplementary Material. It is noteworthy that even under model misspecification (i.e. setting 2 and 3) the Type I error rate is controlled for by all six methods. Hence, the assumption of distributionally homogeneous errors as postulated in Equation (5) is in practice less stringent. The results with respect to the empirical power are summarized in Tables 1-4. Regardless of the setting, sample size, reliability or type of function h p (·), the following trend can be observed with respect to the empirical power. When the latent variable is normally distributed, the six methods have similar power although the methods based on SEM and the t-test are slightly more powerful. As the distribution becomes more heavily tailed, the more superior the WMW methods become. This pattern is in accordance with the observations in the context of observed outcome variables, as mentioned by Van der Vaart (2000) and Hollander et al. (2013).
When one of the three indicators has an extremely low reliability, i.e. setting 4, a slightly different pattern can be seen. SEM has now a more pronounced superior power for the normal distribution. For the more heavily tailed distributions, the superiority of the WMW methods overall re-emerges. The methods WMW-max rel and WMW-mean differ in terms of the aggregated indicator that is used as input for the WMW method. The added value of the optimization process is demonstrated when looking at setting 4. The ability to fine-tune the weights in line with the reliability of each indicator separately and hence giving less influence to an indicator that has a bad quality, results in a higher empirical power for WMW-max rel in The highest power per setting is indicated in bold.
Frontiers in Psychology | www.frontiersin.org  The highest power per setting is indicated in bold.
Frontiers in Psychology | www.frontiersin.org The highest power per setting is indicated in bold.
Frontiers in Psychology | www.frontiersin.org comparison with WMW-mean. However, it should be noted that when the reliability of indicators is equal to one another and/or the sample size is small (i.e. m = n = 15), a small reduction in empirical power for WMW-max rel in comparison with WMW-mean is observed. Hence, estimating the weights in order to optimize the estimated reliability of the aggregated indicator only has an added value when the data requires such adaptation. Comparing the methods SEM and SEM-correction, the results with respect to the empirical power show no remarkable differences. Hence, the results suggest that the added value of robust Satorra-Bentler standard errors is limited in our simulation setup.
The lower the reliability of all three indicators, the lower the power, and this is true for all settings and all methods. To guard the readability of the tables, the point estimates and standard errors are not listed, but based on these results, the reason of the decrease in power is perhaps different between the WMW methods and SEM methods. For SEM and SEMcorrection, it seems that there is barely an effect on the precision of the estimation but an increased empirical standard error is observed, hence influencing the power. Contrary, for WMW-max rel and WMW-mean, the empirical standard error remains relatively stable over the different levels of reliability, but the estimation is less accurate and hence influencing the power.
Overall, the simulation results suggest that the attractive properties of the original WMW method as discussed in the introduction are preserved in the context of latent variables. The extended WMW method is thus robust against outliers, relevant for skewed data and has a superior power in skewed distributions.

ILLUSTRATION
We now reconsider the example of the introduction, where we want to examine the association between employment and depression in 455 patients with multiple sclerosis (MS). Figure 3 shows the pairwise scatterplots of the three depression questionnaires. The relation between PROMIS-D-8 and the other two indicators seems non-linear. This is in accordance with Amtmann et al. (2014), who argue that this questionnaire is subject to floor effects. Taking into account the characteristics of the data, i.e. the skewness depicted in Figure 1 and the observed non-linearity in Figure 3, the proposed extension of the WMW method is an appropriate analysis.
Visual inspection via Q-Q plots (see Figure 1 in Supplementary Data Sheet 2) shows that the assumption of location shift for the indicators is reasonably met. For the sake of completeness, the data are analyzed with the six estimation methods mentioned in the simulation study (Section 3): WMWmax rel, WMW-mean, t-test-max rel, t-testmean, SEM and SEM-correction. In order to compare two groups by using our proposed WMW extension or via SEM, model comparison tests indicate that measurement invariance across the two groups is established. After performing a data-driven model evaluation, a residual covariance between CESD-10 and PROMIS-D-8 is added for both groups in order to further enhance the fit of the model when using SEM. Table 5 lists the p-values of all six methods and almost all methods lead to the conclusion that there is a significant difference in levels of depression between employed and unemployed people living with MS. Inference based on t-test-max rel is only marginally significant at the 5% significance level. The interpretation of this difference does vary according to the used method. All parametric methods, i.e. the methods based on SEM and the t-test, give an estimation of the mean difference between the two groups where the employed patients are the reference group. Hence, a positive difference indicates that the unemployed patients have on average a higher score. Both SEM and SEM-correction give an estimated difference in latent means for depression of 1.444. Analyses based on the t-test with a weighted sum or mean of the standardized indicators result in a difference of 1.14 and 0.19, respectively. Given that these modified t-tests do not take a measurement model into account, in contrast with SEM, its effect size merely reflects a difference in an overall outcome that aims to measure depression. Nevertheless, both analyses indicate that the group of unemployed patients score on average higher for depression than the group of employed patients.
For the extended WMW methods, the interpretation of the effect sizes is slightly different. Here, the effect size represents the probability that an unemployed patient has a higher score for depression than an employed patient. These probabilities are 56.48 and 56.92% for respectively WMW-max rel and WMW-mean. Consequently, patients who are unemployed have a significantly higher probability (around 56%) to have higher depression scores than patients that do have a job. The results thus show that the conclusion in this specific context for all methods points in the same direction. The R code for all analysis and figures is made available in the Supplementary Material.

DISCUSSION
In this paper, we proposed an extension of the Wilcoxon-Mann-Whitney test in the context of latent variables with a main focus on hypothesis testing. We introduced two strategies: one where the mean of the standardized indicators is used and one where a maximally reliable composite is created to form the input of the WMW test. The statistical properties of these proposed testing procedures were examined and compared with the performance of SEMs and two adapted t-tests in a simulation study. By comparing the two proposed strategies to each other in this simulation study, the costs and benefits of this maximization procedure were explored.
Even though SEM is omnipresent in the context of analyzing latent variables, we believe that a valuable alternative is provided in this paper. Concerning measurement model (1), the difference with the classical theory in SEM is the amount of knowledge that is required concerning the function h p (·). Typically, h p (·) is assumed to be linear and is estimated during the analysis. In our proposed methodology where the mean of the standardized indicators is used, we relax this requirement since we only assume monotonicity and we do not need to estimate h p (·) as this is not our main interest. Some flexibility has thus been introduced in comparison with the traditional FA and SEM. Additionally, the variances of the measurement error do not need to be estimated in this strategy. For the theoretical validation of our methodology, we do need to impose the assumption of equal distributions for the measurement error per indicator across two groups. However, the simulation results suggest that violations have no detrimental consequences with respect to both the empirical Type I error rate and the power.
When a maximally reliable composite is used, i.e. the second strategy of the extended WMW test, the variance of the measurement error needs to be estimated in order to obtain the weights for this aggregated indicator. In the proposed formulas for the estimation of the variance, linearity among the indicators was imposed. Also here, deviations from this linearity assumption did not lead to problems with respect to hypothesis testing in our simulation study as long as the function h p (·) is monotone. It may be clear that the use of a maximally reliable composite requires additional assumptions in comparison with the use of a simple mean, but based on our simulation study, these assumptions seem to be flexible in practice.
Using an aggregated indicator based on the data-driven maximization procedure entails an improvement compared with the use of an unweighted mean when the reliability of the indicators differs. These results confirm the theoretical expectation that fine-tuning the weights in line with the reliability of each indicator separately can result in an increase in empirical power. However, one should not needlessly use this maximization procedure as it can result in a small loss of power when e.g. the reliability of indicators is equal to one another and/or the sample size is small (i.e. m = n = 15).
Most interestingly for this paper is that the results show that the attractive properties of the original WMW method are transferred to the context of latent variables. The results confirm that the procedures based on the WMW test have superior power when the distribution is heavily tailed. This is a pattern that is also observed when simulating data in the context of observed outcome variables, as mentioned by Van der Vaart (2000) and Hollander et al. (2013).
A possible limitation for the user might be the effect size of the extended WMW test. Where SEM provides an estimate of the difference between two groups on the scale of the latent variable which can be standardized, the effect size of our proposed method is a probabilistic index. On the other hand, from a theoretical point of view, this latter effect size has attractive properties. A probabilistic index is scale invariant and robust to outliers.
A second limitation of this study is the sole focus on hypothesis testing. It is known that using the standard Wilcoxon-Mann-Whitney test in the context of measurement error leads to an underestimation of the true effect size Sukhatme, 1996, 1997;Faraggi, 2000;Schisterman et al., 2001;Tosteson et al., 2005;Fuller, 2009). In future research, adaptations to the proposed methodology can be studied to enhance the point estimation. Other possible directions for research are extensions for paired groups, multiple group comparison or comparing groups over time.
A third limitation of the suggested methodology is that it is inherently impossible to model the monotonic relation between the indicator and the latent variable, in contrast to SEM. The methodology presented in this paper extends the standard Wilcoxon-Mann-Whitney method and hence only uses the rank of the data. Closely related is the remark that the extended WMW test can only be applied after measurement invariance is determined by using SEM. The additional use of the extended WMW test is especially justified when the distribution is heavily tailed in order to profit from the attractive properties of the extended WMW test as discussed earlier.
To conclude, this paper validated the use of the WMW method in the context of latent variables by implementing some small adaptations, i.e. the creation of an aggregated indicator. The use of existing concepts not only facilitates the practical implementation for researchers and practitioners, the advantages of the original WMW method are also carried over into the new context. We believe that the combination of flexibility in the measurement model, the ability to allocate weights reflecting the reliability of indicators and the superiority in heavily tailed distributions results in a valuable methodology.

DATA AVAILABILITY STATEMENT
Requests to access data with respect to the simulation study should be directed to Heidelinde Dehaene, heidelinde.dehaene@ugent.be. Some previously collected existing data was used in this article and is not publicly available. Therefore, requests to access these data should be directed to the corresponding authors [i.e. Amtmann et al. (2014)].

AUTHOR CONTRIBUTIONS
HD, JDN, and YR: conceptualization and methodology of the presented idea. HD: implementation of the simulation studies, data analysis and writing of the manuscript. JDN and YR: review and editing of the manuscript and supervision. All authors approved the submitted version.

FUNDING
This work was financially supported by a Special Research Fund (BOF) Starting Grant 01N00717 from Ghent University. The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by Ghent University, FWO and the Flemish Government-department EWI.