Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Educ., 12 January 2026

Sec. Assessment, Testing and Applied Measurement

Volume 10 - 2025 | https://doi.org/10.3389/feduc.2025.1645911

A comparison of three approaches for clustering polytomous data in the presence of masking variables


Sijia Huang
Sijia Huang1*Preston BotterPreston Botter1Alexandra SturmAlexandra Sturm2
  • 1School of Education, Indiana University Bloomington, Bloomington, IN, United States
  • 2Psychological Science, Loyola Marymount University, Los Angeles, CA, United States

To uncover the heterogeneity in a population, it is common yet important to partition individuals into distinct subgroups based on their responses to items in measurement tools. Various approaches have been introduced to tackle this clustering problem in psychology and education. To provide more guidance to practitioners, in this study, we compareD the performance of three widely-applied approaches, including the latent class analysis (LCA), k-means and k-medians, in clustering polytomous items in the presence of masking variables. In the simulation conditions considered, we found that LCA coupled with Bayesian Information Criterion (BIC) outperformed other approaches and methods for determining the optimal number of subgroups. We also applied the three approaches to an empirical data set and obtained different conclusions regarding the number of subgroups. Additionally, we discussed the limitations of this study and future research directions.

1 Introduction

To uncover and describe the heterogeneity in a population, it is a common yet important task to partition many individuals into a handful of distinct subgroups based on their responses to items in measurement tools. For instance, many studies in developmental psychology (e.g., Althoff et al., 2006; Basten et al., 2013; Nozadi et al., 2016; Wadsworth et al., 2001) have investigated the phenotype profiles of anxiety, depression, and attention-deficit/hyperactivity disorder (ADHD) using responses to items in Child Behavior Checklist (CBCL; Achenbach, 1999), a popular caregiver-report measure for assessing children's behavior problems. Additionally, in education, recent studies have identified subgroups of students with distinct online learning patterns (e.g., Araka et al., 2022; Barnard-Brak et al., 2010; Broadbent and Fuller-Tyszkiewicz, 2018) and school principles with different leadership types (e.g., Agasisti et al., 2019; Angela and Alex, 2014) using students' and principles' responses to survey questionnaires.

Various approaches have been introduced to tackle this problem of clustering a large number of individuals into non-overlapping subgroups. Based on if an explicit statistical model is invoked, these clustering approaches can be roughly divided into two categories: model-based and non-model-based approaches (Brusco et al., 2017). One of the most extensively applied model-based clustering approaches in psychology and education is the latent class analysis (LCA; e.g., Lazarsfeld, 1950a,b; Goodman, 1974; McCutcheon, 1987; Vermunt and Magidson, 2004), which can be viewed as a special case of finite mixture models (McLachlan and Peel, 2000) when data are categorical. LCA, along with other model-based clustering approaches, relies on statistical models and assumes that the observed data (i.e., item responses) are a mixture of several probability distributions. In contrast, non-model-based approaches, such as the k-means clustering (Bishop and Nasrabadi, 2006; MacQueen, 1967), k-medians clustering (Kaufman and Rousseeuw, 1990), and hierarchical cluster analysis (HCA; Johnson, 1967), determine how individuals are clustered based on measures of distance/dissimilarity between them.

Studies have been conducted to compare various clustering approaches to offer insight into their applications to different research context. For instance, (Brusco et al. 2017) evaluated the performance of LCA, k-means, and k-medians algorithms in clustering dichotomous data. They found that all the three approaches could well recover the cluster structures in the simulations but yielded different results when applied to a real data set. (Magidson and Vermunt 2002), also through a simulation study, compared LCA with k-means in clustering two continuous variables. (Schreiber and Pekarik 2014) applied LCA, k-means, and HCA to empirical data sets and concluded that LCA was statistically more rigorous than the other two approaches. (Papachristou et al. 2016) compared LCA with k-means and found that the two approaches identified similar clinical profiles based on a cancer symptoms data set.

To further improve the understanding of clustering approaches and provide guidance to practitioners, two issues need to be more closely investigated. First, measures in psychology and education typically consist of a large number of items; however, not all of them are essential for identifying subgroups in the population. Following (Brusco 2004) and (Fowlkes and Mallows 1983), we use the term true variables for items in a measure that define the true subgroup structures, and masking variables to refer to those that are not relevant. While several methods for selecting true variables for the subsequent analysis (e.g., Brusco and Cradit, 2001; Carmone et al., 1999) have been developed, it is yet unclear if and how masking variables would impact the performance of both model-based and non-model-based approaches for clustering individuals.

Secondly, most of the existing studies on clustering approaches have been focused on dichotomous or continuous variables. However, items in psychological measures and educational survey questionnaires generally have more than two response categories (i.e., polytomous variables). For instance, items in CBCL are statements of children and youth's behaviors (e.g., [if a child] acts too young for his/her age) and are rated on a 3-point Likert scale (0 = Not true, 1 = Somewhat or sometimes true, and 2 = Very true or often true). To be able to choose among various clustering approaches, researchers and practitioners in psychology and education must know how these approaches perform when data are polytomous.

To address these two issues, the present study looked into the performance of clustering approaches for assigning individuals into subgroups based on polytomous data in the presence of masking variables. Specifically, through a comprehensive simulation study and an empirical data analysis, we compared three popular approaches, including LCA, k-means, and k-medians, and methods for selecting the optimal number of subgroups. The remainder of this paper is organized as follows: in Section 2, we reviewed the three clustering approaches to be discussed in this study and the associated cluster number selecting methods. In Section 3, we conducted a simulation study to compare the performance of the three approaches in recovering subgroup structures under various conditions. In Section 4, we applied the three clustering approaches to an empirical data set and compared the results. Lastly, in Section 5, we discussed limitations of the present study and pointed to future research directions.

2 Clustering approaches

In this section, we briefly reviewed the three popular clustering approaches to be compared in the present study, including LCA, k-means, and k-medians, along with methods for selecting the optimal number of subgroups for each of the approaches.

2.1 Latent class analysis

Since first introduced by (Lazarsfeld 1950a,b), LCA has been undergoing significant development (e.g., Goodman, 1974; Vermunt and Magidson, 2004). Typically, LCA assumes that the population that is being studied is a mixture of two or more mutually exclusive subsgroups (i.e., classes). The subgroup memberships of individuals are unobserved and lead to different distributions of the observed variables. Let xi = (xi1, …, xiv, …, xiV), where 1 ≤ vV, denote the observed responses of individual i (1 ≤ iN) to V polytomous items/variables, where N indicates the number of individuals in the sample. Invoking the conditional independence assumption, we have that for an individual i that belongs to subgroup k (1 ≤ kK), the probability that he/she/they have the response pattern xi is

P(xi|ci=k)=j=1VP(xij|ci=k),    (1)

where P(xij|ci = k) is the probability that a subgroup k individual score xij for item j. Let λ = (λ1, …, λK) be a vector that consists of the probabilities that an individual belongs to each of the K subgroups. The multivariate density of the individual i can then be specified as a mixture distribution,

P(xi)=c=1KλcP(xi|ci=k)=c=1Kλcj=1VP(xij|ci=k).    (2)

Then the overall likelihood L is the product of the contribution of each individual,

L=i=1NP(xi)=i=1N[c=1Kλcj=1VP(xij|ci=k)].    (3)

Estimates of model parameters are generally obtained through maximum likelihood (ML)-based estimation procedures, such as the expectation–maximization (EM) algorithm.

2.1.1 Determining K

To determine the number of subgroups K in the population, a series of LCA models with different numbers of sub-groups are fitted to observed data and compared. Due to the lack of absolute fit indices, in practice, two information criteria, namely the Akaike information criterion (AIC; Akaike, 1973) and Bayesian information criterion (BIC; Schwarz, 1978) are often adopted to aid selecting the number of subgroups. Both AIC and BIC add penalty terms to the maximized likelihood of the model, denoted by L^. Specifically, AIC factors in the number of parameters to reduce overfitting:

AIC=-2ln(L^)+2f,    (4)

where f refers to the number of free parameters in the model. BIC, on the other hand, penalizes for both the number of model parameters and the number of individuals in the sample N:

BIC=-2ln(L^)+fln(N).    (5)

2.2 k-means and k-medians

The k-means clustering is probably the most widely-applied unsupervised machine learning technique, and k-medians clustering can be viewed as one of its variants. Unlike LCA, k-means and k-medians do not assume any statistical models but assign individuals to the pre-specified number of subgroups through minimizing the sum of within-cluster variation. Specifically, with k-means, the variation within each cluster is computed as the sum of the squared Euclidean distance between each individual and its cluster center,

Wk=xiCk(xix¯i)2    (6)

where x̄i represents the means of all individuals within subgroup k. On the other hand, the k-medians clustering uses the medians as cluster centers and is therefore more robust to outliers. A smaller Wk indicates that the individuals in the subgroup k are more homogeneous. Note that while the Euclidean distance is often considered the default of k-means and k-medians, other distance measures, such as the Manhattan distance, can also be used.

This minimization problem of k-means and k-medians is solved through applying the iterative algorithm shown in Algorithm 1. The algorithm starts with randomly assigning individuals to subgroups and updates the subgroup memberships based on individuals' distance to the cluster centers. The algorithm stops when the membership assignments stop changing. This optimization algorithm may stop at the local optimum rather than the global optimum. Therefore, in practice, one needs to run the algorithm multiple times with different initial configurations and then select the clustering solution with the smallest sum of within-cluster variation.

Algorithm 1
www.frontiersin.org

Algorithm 1. An iterative algorithm for k-means and k-medians.

2.2.1 Determining K

Various methods have been proposed to determine the optimal number of subgroups for both k-means and k-medians clustering. In this study, we focused on the widely-applied Maximum Ratio of Percentage Changes (MRPC) for k-means, and the Calinski-Harabasz (CH) pseudo-F statistic and the Silhouette method for both k-means and k-medians.

Maximum ratio of percentage changes (MRPC). The MRPC determines the optimal number of subgroups by calculating the ratio of the percentage change in the sum of within-cluster variation as the pre-specified subgroup number increases. Let W(K) denote the sum of within-cluster variation when the number of subgroups for k-means is specified as K, i.e., W(K)=k=1KWk. The MRPC associated with K subgroups is then computed as:

MRPC(K)=[W(K-1)-W(K)]/W(K-1)[W(K)-W(K+1)]/W(K).    (7)

A larger MRPC means that the ratio of change from K−1 to K subgroups is larger than the ratio of change from K to K+1 subgroups. Therefore, the optimal number of subgroups is typically selected as the K that is associated the largest MRPC value.

Calinski-Harabasz (CH) pseudo-F statistic. For k-means and k-medians, the CH pseudo-F statistic (Caliski and Harabasz, 1974) assesses the quality of clustering by comparing the between-cluster variance to the within-cluster variance. Specifically, for K subgroups, it is computed as:

CH(K)=B(K)/(K-1)W(K)/(N-K),    (8)

where B(K) represents the between-cluster variance. B(K) is computed as the weighted sum of the squared Euclidean distance between cluster centers and the center of all individuals, with a larger B(K) value suggests a more distinct separation of the K subgroups. Therefore, the number of subgroups K that has the largest CH pseudo-F statistic, which indicates that the K homogeneous subgroups are well separated, is selected as the optimal number of subgroups.

Silhouette method. The Silhouette method (Rousseeuw, 1987) is another widely-applied way to evaluate the clustering quality for k-means and k-medians. Specifically, for each of the N individuals, the Silhouette width, denoted by si, which measures how similar an individual is to its assigned subgroup compared to other subgroups, is calculated as:

si=bi-aimax(ai,bi).    (9)

In Equation 9, ai represents the average dissimilarity of individual i to all other individuals in its assigned subgroup,

ai=1|Ci|-1jCi,jid(i,j),    (10)

where Ci denotes a collection of individuals that are assigned to the same subgroup as individual i, |Ci| is the number of individuals in the subgroup, and d(i, j) represents the distance between two individuals, i and j. On the other hand, bi is the minimum average dissimilarity of individual i to all individuals in any other subgroup, and it calculated as:

bi=minki{1|Ck|jCkd(i,j)}    (11)

where Ck represents the set of individuals assigned to subgroup k, and |Ck| is the number of individuals in it.

A Silhouette score si close to 1 (Kaufman and Rousseeuw, 1990; Rousseeuw, 1987) indicates that the individual i is well clustered in the sense that it is closer to other individuals in the same subgroup than to those in other subgroups. In the contrast, a si value that is close to 0 suggests that the individual is on or near the decision boundary between two neighboring subgroups. Additionally, a si near -1 indicates that the individual may have been assigned to the wrong subgroup, since it is closer to individuals in other subgroups than to its assigned subgroup.

The average Silhouette score s̄ reflects the overall quality of the clustering solution and is often use to determine the number of optimal subgroups. It is computed as the average of individual si:

s̄=1Ni=1Nsi.    (12)

Common rules of thumb from empirical studies that examine the absolute fit of clustering models suggest that a s̄ greater than 0.7 indicates a good clustering quality (Kaufman and Rousseeuw, 1990). In this study, our primary concern is the relative fit; therefore, the optimal number of subgroups is selected as the number of subgroups that maximizes the s̄.

2.3 Other methods for determining K

In addition to AIC and BIC for LCA, and MRPC, CH pseudo-F statistic and Silhouette method for k-means and k-medians, there are other methods for determining the optimal number of subgroups while tackling the clustering tasks.

Sample-size adjusted BIC. The sample-size adjusted BIC (SABIC; Sclove, 1987) is another popular information criterion and is used for choosing the number of subgroups when LCA is applied. SABIC is similar to BIC; however, the penalty on N is reduced,

SABIC=-2ln(L^)+log(N+224).    (13)

Gap statistic. For k-means and k-medians, the Gap statistic (Tibshirani et al., 2002) is another option for determining the optimal number of subgroups. The Gap statistic compares the curve of log(W[K]) to the reference curve obtained from data uniformly distributed over a rectangle containing the data. Then the optimal number of subgroups is chosen as the K associated with the largest difference between the two curves.

3 Simulation study

To compare the performance of LCA, k-means and k-median approaches, along with their associated methods for determining the optimal number of subgroups, we conducted a comprehensive simulation. Specifically, we considered items with three response categories, which represent a wide range of items that are often seen in psychological assessments and educational survey questionnaires. Examples of three-category items include items that measure the frequencies of certain behaviors (0 = Never, 1 = Sometimes, and 2 = Usually) and items that inquire about the attitude toward certain statements (0 = Disagree, 1 = Neutral, and 2 = Agree).

3.1 Data generation

To simulate data, we followed the process used in previous simulation studies (e.g., Brusco, 2004; Brusco et al., 2017; Dimitriadou et al., 2002), and manipulated six factors, including (1) the number of individuals N, (2) the number of subgroups K, (3) the number of true variables V, (4) the number of masking variables P, (5) the error level in the subgroup structure ϵ, and (6) the densities of subgroups.

Motivated by previous studies, we generated item responses in the following three steps:

• First, we obtained values of V true variables based on the perfect subgroup structure (shown in Tables 13), along with the number of individuals N and cluster probabilities. For instance, in conditions with V = 10, K = 2, N = 250 and equal subgroup probabilities, 125 (250/2) individuals would be assigned with responses presented in the first row of Table 1, while the other 125 individuals had responses in the second row.

• In Step 2, we added errors to the item responses from Step 1 by randomly selecting ϵ of the data points and changing their values. For instance, in conditions with ϵ = 5%, we would randomly select 5% × N×V item responses and change their values from the perfect subgroup structure. For selected responses with values of 1, we would change their values to either 0 or 2 with the same probability of 0.5.

• In the last step, we generated values of P masking variables, when applicable. In other words, additional N×P item responses were generated by randomly sampling from the three possible values, 0, 1, and 2. Then values of masking variables were appended to the noise-corrupted data from Step 2.

Table 1
www.frontiersin.org

Table 1. Cluster structures for V = 10.

Table 2
www.frontiersin.org

Table 2. Cluster structures for V = 20.

Table 3
www.frontiersin.org

Table 3. Cluster structures for V = 30.

Sample R code for data generation were included in online Supplementary material.

Number of individuals N. We considered three levels of the number of individuals, N = 250, 500, and 1,000. These numbers are generally viewed as moderately large/large sample sizes for LCA, k-means and k-medians clustering.

Number of subgroups K. While generating values of true variables, we considered three numbers of subgroups, K = 2, 3, and 4. The cluster numbers were chosen based on previous simulation studies (e.g., Brusco, 2004; Brusco et al., 2017; Dimitriadou et al., 2002).

Number of true variables V. In this simulation study, we considered three numbers of true variables (i.e., variables that define the subgroup structure), V = 10, 20, and 30. These numbers covered a wide range the lengths of psychological assessments and educational survey questionnaires.

Number of masking variables P. In practice, it is possible that not all variables in a measure define the subgroup structure. Therefore, in addition to conditions where there exist no masking variable (i.e., P = 0), we considered two levels of the number of masking variables, P = 5 and 10.

Level of error ϵ. Following previous simulation studies, we considered three levels of errors in Step 2 of the data generation process, ϵ = 5%, 10%, and 20%.

Subgroup probabilities. We considered three types of cluster probabilities, one of which represents cases with equal probability and two for unequal probability conditions. Specifically, in the equal probability conditions, all subgroups are of the same size so that the probability that an individual belong to one of the K subgroups in 1/K. In the first type of unequal probability conditions, the first subgroup had a probability of 0.6, and the other K−1 subgroups have the same probability of (1 − 0.4)/(K−1). In the second type of unequal probability conditions, the first subgroup had a probability of 0.1, and other clusters have the probability of (1 − 0.1)/(K−1).

In sum, these six manipulated factors led to a total of 3 × 3 × 3 × 3 × 3 × 3 = 729 simulation conditions. For each of the conditions, we simulated and analyzed 50 data sets in R (R Core Team, 2023).

3.2 Data analysis

We analyzed each simulated data set using the three approaches of LCA, k-means and k-medians clustering, extracting 1 to 6 subgroups, and chose the optimal number of subgroups based on different methods. Specifically, we performed LCA using the PoLCA package (Drew and Jeffrey, 2011) and extracted the AIC and BIC indices. For k-means, we used the R package cluster (Maechler et al., 2013) for clustering and computed the MRPC, CH pseudo-F statistic, and Silhouette statistic based on the outputs. For k-medians clustering, we used the Kmedians package (Godichon-Baggioni, 2022) for subgroup extraction and obtained the CH pseudo-F statistic and Silhouette statistic using the cluster package. For all three approaches and numbers of subgroups, the numbers of random starting values/configurations were fixed at 100, the max numbers of iterations were set to 1,000, unless otherwise specified defaults were used.

3.3 Evaluation criteria

In this study, we focused on the key indicator of subgroup partition recovery to evaluate the three clustering approaches and their associated methods for choosing the number of subgroups. Following previous simulation studies, for each simulation condition, we computed the Adjusted Rand Index (ARI; Hubert, 1974; Steinley, 2004) between the true subgroup memberships and predicted partitions for each combination of clustering approaches and methods for choosing the subgroup number using the Kmedians package (Azzalini and Menardi, 2014). The ARI index quantifies the agreement between two partitions and ranges from 0 to 1, with a larger value indicating a higher level of agreement. We then performed the analysis of variance (ANOVA) to investigate how the ARI was impacted by the clustering approaches and the manipulated factors. We considered main effects of the six manipulated factors and all two-way interactions. To better inform about the performances of these approaches, we also plotted the determined number of subgroups for each combination of clustering approaches and methods of choosing subgroup number.

3.4 Results

As shown in Table 4, the ARIs of the seven combinations of clustering approaches and methods for selecting the optimal subgroup number (i.e., Methods) were significantly different (F = 375.6, df = 6, p < 0.01, partial η2 = 0.31). The post hoc Tukey's HSD test indicated that LCA outperformed the k-means and k-medians clustering, with BIC led to a greater ARI than AIC. Additionally, k-means coupled with the MRPC performed significantly better than other four methods, including k-means and k-medians coupled with the CH Pseudo-F statistic and the Silhouette method.

Table 4
www.frontiersin.org

Table 4. Analysis of variance (ANOVA) for ARI.

Besides, five out of the six manipulated simulation factors showed significant main effects on the ARI, including the number of clusters K (F = 1352.7, df = 2, p < 0.01, partial η2 = 0.35), the number of true variables V (F = 299.2, df = 2, p < 0.01, partial η2 = 0.11), the number of masking variables P (F = 99.8, df = 2, p < 0.01, partial η2 = 0.04), the error level E (F = 568.2, df = 2, p < 0.01, partial η2 = 0.18), and the cluster density D (F = 65.9, df = 2, p < 0.01, partial η2 = 0.03). Lastly, ten two-way interactions were significant; however, the effect sizes were all smaller than 0.04.

Figure 1 presents the aggregated ARI values by three out of the six manipulated factors with effect sizes that were greater than 0.1, including the number of clusters K, the number of true variables V, and the error level E. A general trend was that the ARI value decreases as K increases, V decreases, and E decreases. For instance, as shown in Figure 1b, when V increases, the ARI value of k-means coupled with MRPC improved from 0.7 to 0.93. Another example is that, as shown in Figure 1c, when E increases, the ARI of k-medians coupled with CH Pseudo-F decreased from 0.81 to 0.66. It is also worth noting that the LCA approach had strong subgroup recovery performance in V = 10 and E = 0.2 conditions, among which the lowest ARI value was 0.81.

Figure 1
Bar charts comparing the Adjusted Rand Index (ARI) of six clustering methods under different conditions. (a) Varying the number of clusters k shows different ARI scores for K = 2, 3, and 4, with values ranging from 0.55 to 1.00. (b) Varying true variables V from 10 to 30 results in ARI ranging from 0.68 to 1.00. (c) Varying error levels E from 0.05 to 0.2, showing ARI from 0.66 to 1.00. Each chart includes methods like k-means, k-medians, and LCA evaluated by CH, MRPC, AIC, BIC, and Pseudo-F criteria.

Figure 1. Aggregated ARI values by different factors. (a) By number of clusters K. (b) By number of true variables V. (c) By number of error level E.

Figure 2 shows boxplots of the determined numbers of subgroups by combinations of clustering approaches and associated methods, aggregated over all simulation factors other than K. As shown in Figure 2a, when K = 2, LCA coupled with BIC and k-means and k-medians tend to choose 2 as the optimal number of sub-groups; while LCA coupled with AIC overestimated K. When the true number of subgroups were 3 or 4, LCA coupled with BIC was still able to select the correct subgroup number; however, k-means and k-medians coupled with MRPC and CH Pseudo-F tend to underestimated K and selected 2 as the number of subgroups, while LCA coupled with AIC favored more subgroups.

Figure 2
Box plots displaying estimated K values for clustering methods including k-means and k-medians with CH Pseudo-F, MRPC, Silhouette, and LCA with AIC and BIC. Panels represent true K values of two, three, and four.

Figure 2. Boxplot of numbers of subgroups selected based on different methods. (a) True K = 2. (b) True K = 3. (c) True K = 4.

4 Empirical illustration

To further compare the three approaches, LCA, k-means and k-medians, along with the methods for selecting the number of subgroups, we analyzed an empirical data set from the Simon Foundation Powering Autism Research for Knowledge (SPARK; SPARK Consortium, 2018), a national research initiative funded by the Simons Foundation Autism Research Initiative (SFARI).

4.1 Data

We examined a parent-rated measurement instrument named the Vineland Adaptive Behavior Scales Third Edition (Vineland-3; Sparrow et al., 2016) in the SPARK data base. The Vineland-3 is used as a measure of adaptive skills and is often a part of diagnostic evaluations for intellectual and developmental disabilities. Participant data were drawn from the SPARK data repository that includes phenotype and genotype data for individuals with an autism spectrum diagnosis. Thus, participants included in the empirical illustration all had a diagnosis of an autism spectrum disorder (ASD) and were all under the age of 18.

In this analysis, we focused on one of the 13 subdomains in Vineland-3, the Domestic subdomain, which includes 30 items. Items in this sub-domain measure a child's ability to perform household tasks (e.g., [The individual] is careful when using sharp objects, for example, scissor, knives). Parents rated the children's behaviors on a 3-point Likert-type scale (0 = Never, 1 = Sometimes, and 2 = Usually or often). A sample of N = 500 individuals were randomly selected from complete observations for the analysis.

4.2 Analysis

We performed LCA, k-means and k-medians clustering on the sample data in R (R Core Team, 2023). Specifically, for each of the three approaches, we ran a series of analyses with different numbers of subgroups, ranging from 1 to 10. Based on the outputs, we chose the optimal number of subgroups using various methods, including AIC and BIC for LCA, and MPRC, CH pseudo-F and Silhouette statistics for k-means and k-medians. Then we computed the agreements between partitions with the subgroups numbers determined by different methods.

4.3 Results

Table 5 summarizes various indices for the three approaches with different numbers of subgroups. The first observation is that the number of subgroups selected based on different approaches varied. Based on LCA coupled with AIC, the number of subgroups was 9, and the selected number of subgroups was 3 when BIC was applied. On the other hand, the numbers of subgroups determined by k-means and k-medians coupled with MRPC, CH pseudo-F and Silhouette statistics were all 2.

Table 5
www.frontiersin.org

Table 5. Summary of empirical data analysis results.

Table 6 summarizes the ARIs between subgroup partitions determined based on different methods. The agreements between LCA coupled with AIC and other methods were low, with the ARI value ranging from 0.15 to 0.24. The agreements of LCA coupled with BIC and k-means and k-medians were moderate. The ARI between k-means and k-medians was 0.98, suggesting a high level of agreement between the two partitions.

Table 6
www.frontiersin.org

Table 6. ARI between partitions based on different methods.

5 Discussion

In psychology and education, assigning a large number of individuals to a handful of subgroups based on their responses to items in measurement scales has long been an important task to understand the population heterogeneity. To better choose between various model-based and non-model-based clustering approaches, it is vital for researchers to investigate the performance of these approaches, along with the methods for determining the optimal number of subgroups, under different scenarios.

In this study, we compared the performance of three widely-applied approaches, LCA, k-means, and k-medians, in clustering individuals based on polytomous item responses in the presence of masking variables. Previous studies that compared different clustering approaches have been primarily focused on either binary (e.g., Brusco et al., 2017) or continuous variables (e.g., Magidson and Vermunt, 2002). Our study extended this line of work by focusing on polytomous variables, specifically items with three categories, which are common in psychological and educational measurement but less frequently researched. Recent work by (Haslbeck et al. 2023) showed that when ordinal responses are treated as continuous, Gaussian mixture models (GMM) can still recover the correct number of components if there are enough categories and variables; however, parameter estimates (particularly means and covariances) remain biased regardless of the sample size. This finding highlighted that results obtained with continuous or binary indicators may not directly generalize to polytomous data. Moreover, while latent profile analysis (LPA; a GMM with a constrained covariance matrix) is typically applied to continuous indicators and LCA is applied to categorical indicators, more research is needed to determine how consistently these approaches perform across items with different number of categories and study conditions.

We found that, via the simulation, LCA coupled with BIC generally provided the most accurate subgroup recovery. In contrast, k-means, k-medians can be more sensitive to simulation design factors, such as the true number of subgroups, the presence of masking variables and noise in the true variables. For instance, in K = 2 conditions, the average ARI of with k-means and k-medians and their associated methods for choosing subgroup number were all above 0.9. When the number of true variables was relatively large (e.g., V = 30), k-medians coupled with the CH Pseduo-F could still achieve an average ARI of 0.78. Our findings were not consistent with studies that examined the performance of clustering approaches for dichotomous data. For instance, (Brusco et al. 2017) showed that LCA and k-medians were competitive for clustering dichotomous data, while our study suggests LCA outperforms the other two approaches when items have three categories.

Additionally, we analyzed empirical data from a data base on autistic children using the three approaches and chose the optimal number of subgroups based on different methods. Similar to the simulation results, we found that the numbers of subgroups determined by k-means and k-medians were the same and the ARI was high. On the other hand, the number of subgroups and partitions determined by LCA coupled with BIC were different from those of LCA with AIC and the two non-model-based clustering approaches. Such inconsistency between clustering solutions of different approaches was also observed in (Brusco et al. 2017).

Based on the results of this study, we recommend psychological and educational practitioners consider LCA for clustering tasks when the items in the measurement scales are polytomous. Additionally, in practice, while applying k-means and k-medians, researchers may experience the issue of local optima, which refers to the tendency that the algorithm converges at a local rather than the global optimum. To address this issue, we suggest adopting the widely known multiple restart strategy (e.g., Shireman et al., 2016), where the algorithm is fit multiple times with different starting values/configurations. Further, to better understand the heterogeneity in the population and interpret the subgroup structure, we suggest determining the optimal number of subgroups based on both the statistical evidence and substantive theories. This is because, the global optimum may be achieved at a large number of subgroups, which can be less useful with regard to interpretation.

5.1 Limitations

As with other simulations, one of the limitations of the present study concerns the manipulated factors and the chosen levels of these manipulated factors. For instance, the sample sizes N we considered in this simulation (250, 500, and 1,000) are generally viewed as moderately large and large sample sizes in psychological research. Therefore, our results may not be generalizable to cases where the sample sizes are relatively small. We also designed only one subgroup structure for each combination of the number of true variables V and the number of subgroups K. In reality, the distances between subgroups tend to be different. In addition, we simulated responses to items with three categories, while many psychological assessments and education surveys consists of items with four or more response categories. Therefore, for future research, we suggest more investigations on the impacts of true subgroup structures and measurement scale properties on the performance of clustering approaches.

Second, while maintaining consistency with prior studies, the process we followed to generate data is prone to several issues. For instance, we introduced randomness in Steps 2 and 3 of the process, but duplicates may exist and result in numerical instability of the optimization algorithms. We advocate for more discussion in the psychological and educational measurement community on how to generate data that can better reflect the reality. In addition, when different approaches are to be compared via a simulation, the data generation process should be carefully designed so that the simulated data do not favor any of the approaches.

Last but not least, some other popular clustering approaches were not considered due to the scope of this study. These approaches include but are not limited to HCA, spectral clustering (e.g., Ng et al., 2002; Rohe et al., 2011), and autoencoders for clustering (e.g., Klingler et al., 2017; Zhang et al., 2022). In addition, to better align with existing studies (e.g., Brusco et al., 2017), we did not include all possible methods for determining the optimal number of subgroups for the three clustering approaches in our simulation and empirical data analysis, such as SABIC and the Gap statistic.

5.2 Future directions

To enhance the understanding of LCA, k-means and k-medians when applied to psychological and educational research, we suggest expanding the scope of the simulation to account for various item types, true subgroup structures, and sample sizes. Future studies can also include factors related to psychometric properties of measurement scales, such as their reliabilities and item properties, in the simulation.

Secondly, to provide practitioners with a more comprehensive picture, we recommend comparing the three approaches considered in this study with other approaches for clustering tasks and other methods for choosing the subgroup numbers. Moreover, characteristics of the approaches, such as available software programs and the computation time, should be discussed in addition to their performance in recovering the subgroup partition. We also suggest that practitioners compare the solutions of different approaches (e.g., numbers of subgroups and assignment of subgroup memberships) when completing clustering tasks and choose the more interpretable results.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: Simon Foundation Powering Autism Research for Knowledge.

Author contributions

SH: Conceptualization, Formal analysis, Methodology, Visualization, Writing – original draft. PB: Formal analysis, Methodology, Visualization, Writing – original draft. AS: Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Acknowledgments

The authors acknowledge the individuals and their families who participated in the SPARK (Simons Foundation Powering Autism Research for Knowledge) study for their contribution to research. Data analysis was performed using resources from the SPARK consortium.

Conflict of interest

The author(s) declared that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2025.1645911/full#supplementary-material

References

Achenbach, T. M. (1999). “The Child Behavior Checklist and related instruments,” in The use of psychological testing for treatment planning and outcomes assessment, ed. M. E. Maurish (Lawrence Erlbaum Associates Publishers, 2nd edition), 429–466.

Google Scholar

Agasisti, T., Bowers, A. J., and Soncin, M. (2019). School principals leadership types and student achievement in the Italian context: empirical results from a three-step latent class analysis. Educ. Manag. Admin. Leader. 47, 860–886. doi: 10.1177/1741143218768577

Crossref Full Text | Google Scholar

Akaike, H. (1973). “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the Second International Symposium on Information Theory, eds. B. N. Petrov, and F. Csaki (Budapest: Akademiai Kiado), 267–281.

Google Scholar

Althoff, R. R., Copeland, W. E., Stanger, C., Derks, E. M., Todd, R. D., Neuman, R. J., et al. (2006). The latent class structure of ADHD is stable across informants. Twin Res. Hum. Genet. 9, 507–522. doi: 10.1375/twin.9.4.507

PubMed Abstract | Crossref Full Text | Google Scholar

Angela, U., and Alex, J. B. (2014). What are the different types of principals across the United States? A latent class analysis of principal perception of leadership. Educ. Admin. Quart. 50, 96–134. doi: 10.1177/0013161X13489019

Crossref Full Text | Google Scholar

Araka, E., Oboko, R., Maina, E., and Gitonga, R. (2022). Using educational data mining techniques to identify profiles in self-regulated learning: an empirical evaluation. Int. Rev. Res. Open Distr. Learn. 23, 131–162. doi: 10.19173/irrodl.v22i4.5401

Crossref Full Text | Google Scholar

Azzalini, A., and Menardi, G. (2014). Clustering via nonparametric density estimation: The r package pdfcluster. J. Stat. Softw. 57, 1–26. doi: 10.18637/jss.v057.i11

Crossref Full Text | Google Scholar

Barnard-Brak, L., Paton, V. O., and Lan, W. Y. (2010). Profiles in self-regulated learning in the online learning environment. Int. Rev. Res. Open Distr. Learn. 11, 61–80. doi: 10.19173/irrodl.v11i1.769

Crossref Full Text | Google Scholar

Basten, M. M., Althoff, R. R., Tiemeier, H., Jaddoe, V. W., Hofman, A., Hudziak, J. J., et al. (2013). The dysregulation profile in young children: empirically defined classes in the Generation R study. J. Am. Acad. Child Adoles. Psychiat. 52, 841–850.e2. doi: 10.1016/j.jaac.2013.05.007

PubMed Abstract | Crossref Full Text | Google Scholar

Bishop, C. M., and Nasrabadi, N. M. (2006). Pattern Recognition and Machine Learning. Cham: Springer.

Google Scholar

Broadbent, J., and Fuller-Tyszkiewicz, M. (2018). Profiles in self-regulated learning and their correlates for online and blended learning students. Educ. Technol. Res. Dev. 66, 1435–1455. doi: 10.1007/s11423-018-9595-9

Crossref Full Text | Google Scholar

Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychol. Methods 9, 510–523. doi: 10.1037/1082-989X.9.4.510

PubMed Abstract | Crossref Full Text | Google Scholar

Brusco, M. J., and Cradit, J. D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270. doi: 10.1007/BF02294838

Crossref Full Text | Google Scholar

Brusco, M. J., Shireman, E., and Steinley, D. (2017). A comparison of latent class, k-means, and k-median methods for clustering dichotomous data. Psychol. Methods 22, 563–580. doi: 10.1037/met0000095

PubMed Abstract | Crossref Full Text | Google Scholar

Caliński, T., and Harabasz, J. (1974). A dendrite method for cluster analysis. Commun. Statist. 3, 1–27. doi: 10.1080/03610917408548446

Crossref Full Text | Google Scholar

Carmone, F. J. J., Kara, A., and Maxwell, S. (1999). Hinov: a new model to improve market segment definition by identifying noisy variables. J. Market. Res. 36, 501–509. doi: 10.1177/002224379903600408

Crossref Full Text | Google Scholar

Dimitriadou, E., Dolničar, S., and Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67, 137–159. doi: 10.1007/BF02294713

Crossref Full Text | Google Scholar

Drew, A. L., and Jeffrey, B. L. (2011). poLCA: an R package for polytomous variable latent class analysis. J. Stat. Softw. 42, 1–29. doi: 10.18637/jss.v042.i10

Crossref Full Text | Google Scholar

Fowlkes, E. B., and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78, 553–569. doi: 10.1080/01621459.1983.10478008

Crossref Full Text | Google Scholar

Godichon-Baggioni, A. (2022). Kmedians: K-Medians. R package version 2.2.0. doi: 10.32614/CRAN.package.Kmedians

Crossref Full Text | Google Scholar

Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, 215–231. doi: 10.1093/biomet/61.2.215

Crossref Full Text | Google Scholar

Haslbeck, J. M., Vermunt, J. K., and Waldorp, L. J. (2023). The impact of ordinal scales on Gaussian mixture recovery. Behav. Res. Methods. 55, 2143–2156. doi: 10.3758/s13428-022-01883-8

PubMed Abstract | Crossref Full Text | Google Scholar

Hubert, L. (1974). Problems of seriation using a subject by item response matrix. Psychol. Bull. 81, 976–983. doi: 10.1037/h0037348

Crossref Full Text | Google Scholar

Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika 32, 241–254. doi: 10.1007/BF02289588

PubMed Abstract | Crossref Full Text | Google Scholar

Kaufman, L., and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley &Sons. doi: 10.1002/9780470316801

Crossref Full Text | Google Scholar

Klingler, S., Wampfler, R., Käser, T., Solenthaler, B., and Gross, M. (2017). “Efficient feature embeddings for student classification with variational auto-cncoders,” in International Educational Data Mining Society, Paper presented at the International Conference on Educational Data Mining (EDM).

Google Scholar

Lazarsfeld, P. F. (1950a). “The interpretation and mathematical foundation of latent structure analysis,” in Measurement and Prediction (Princeton University Press), 413–472.

Google Scholar

Lazarsfeld, P. F. (1950b). “The logical and mathematical foundation of latent structure analysis,” in Measurement and Prediction (Princeton University Press), 362–412.

Google Scholar

MacQueen, J. (1967). “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability (University of California Press), 281–297.

Google Scholar

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., and Hornik, K. (2013). cluster: Cluster Analysis Basics and Extensions. Available online at: https://svn.r-project.org/R-packages/trunk/cluster/

Google Scholar

Magidson, J., and Vermunt, J. (2002). Latent class models for clustering: a comparison with k-means. Canad. J. Market. Res. 20, 36–43.

Google Scholar

McCutcheon, A. L. (1987). Latent Class Analysis. London: Sage. doi: 10.4135/9781412984713

Crossref Full Text | Google Scholar

McLachlan, G., and Peel, D. (2000). Finite Mixture Models. London: Wiley-Interscience Publication. doi: 10.1002/0471721182

Crossref Full Text | Google Scholar

Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). “On spectral clustering: analysis and an algorithm,” in Advances in Neural Information Processing Systems, 849–856.

Google Scholar

Nozadi, S. S., Troller-Renfree, S., White, L. K., Frenkel, T., Degnan, K. A., Bar-Haim, Y., et al. (2016). The moderating role of attention biases in understanding the link between behavioral inhibition and anxiety. J. Exp. Psychopathol. 7, 451–465. doi: 10.5127/jep.052515

PubMed Abstract | Crossref Full Text | Google Scholar

Papachristou, N., Miaskowski, C., Barnaghi, P., Maguire, R., Farajidavar, N., Cooper, B., et al. (2016). “Comparing machine learning clustering with latent class analysis on cancer symptoms' data,” in 2016 IEEE Healthcare Innovation Point-Of-Care Technologies Conference (HI-POCT), 162–166. doi: 10.1109/HIC.2016.7797722

Crossref Full Text | Google Scholar

R Core Team (2023). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Google Scholar

Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist. 39, 1878–1915. doi: 10.1214/11-AOS887

Crossref Full Text | Google Scholar

Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65. doi: 10.1016/0377-0427(87)90125-7

Crossref Full Text | Google Scholar

Schreiber, J. B., and Pekarik, A. J. (2014). Using latent class analysis versus k-means or hierarchical clustering to understand museum visitors. Curator: The Museum Journal 57, 45–59. doi: 10.1111/cura.12050

Crossref Full Text | Google Scholar

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461–464. doi: 10.1214/aos/1176344136

Crossref Full Text | Google Scholar

Sclove, S. L. (1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika 52, 333–343. doi: 10.1007/BF02294360

Crossref Full Text | Google Scholar

Shireman, E. M., Steinley, D., and Brusco, M. J. (2016). Local optima in mixture modeling. Multivariate Behav. Res. 51, 466–481. doi: 10.1080/00273171.2016.1160359

PubMed Abstract | Crossref Full Text | Google Scholar

SPARK Consortium (2018). Spark: a us cohort of 50,000 families to accelerate autism research. Neuron 97, 488–493.

Google Scholar

Sparrow, S. S., Cicchetti, D. V., and Saulnier, C. A. (2016). Vineland Adaptive Behavior Scales, Third Edition. Pearson.

Google Scholar

Steinley, D. (2004). Properties of the hubert-arable adjusted rand index. Psychol. Methods 9, 386–396. doi: 10.1037/1082-989X.9.3.386

Crossref Full Text | Google Scholar

Tibshirani, R., Walther, G., and Hastie, T. (2002). Estimating the number of clusters in a data set via the Gap statistic. J. R. Stat. Soc. Series B 63, 411–423. doi: 10.1111/1467-9868.00293

Crossref Full Text | Google Scholar

Vermunt, J. K., and Magidson, J. (2004). “Latent class analysis,” in The Sage Encyclopedia of Social Sciences Research Methods, eds. M. S. Lewis-Beck, A. Bryman, and T. F. Liao (Thousand Oakes, CA: Sage Publications), 549–553.

Google Scholar

Wadsworth, M. E., Hudziak, J. J., Heath, A. C., and Achenbach, T. M. (2001). Latent class analysis of child behavior checklist anxiety/depression in children and adolescents. J. Am. Acad. Child Adoles. Psychiat. 40, 106–114. doi: 10.1097/00004583-200101000-00023

PubMed Abstract | Crossref Full Text | Google Scholar

Zhang, M., Du, X., Rice, K., Hung, J.-L., and Li, H. (2022). Revealing at-risk learning patterns and corresponding self-regulated strategies via LSTM encoder and time-series clustering. Inf. Disc. Deliv. 50, 206–216. doi: 10.1108/IDD-12-2020-0160

Crossref Full Text | Google Scholar

Keywords: latent class analysis, k-means, k-medians, clustering, polytomous data

Citation: Huang S, Botter P and Sturm A (2026) A comparison of three approaches for clustering polytomous data in the presence of masking variables. Front. Educ. 10:1645911. doi: 10.3389/feduc.2025.1645911

Received: 12 June 2025; Revised: 22 October 2025;
Accepted: 02 December 2025; Published: 12 January 2026.

Edited by:

Kuan-Yu Jin, Hong Kong Examinations and Assessment Authority, Hong Kong SAR, China

Reviewed by:

Max Hahn-Klimroth, Goethe University Frankfurt, Germany
Rajesh Kumar, Kansas State University Olathe, United States
Yi-Jhen Wu, TU Dortmund University, Germany

Copyright © 2026 Huang, Botter and Sturm. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Sijia Huang, c2lqaHVhbmdAaXUuZWR1

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.