Semi-supervised spectral clustering with application to detect population stratification.

In genetic association studies, unaccounted population stratification can cause spurious associations in a discovery process of identifying disease-associated genetic markers. In such a situation, prior information is often available for some subjects' population identities. To leverage the additional information, we propose a semi-supervised clustering approach for detecting population stratification. This approach maintains the advantages of spectral clustering, while is integrated with the additional identity information, leading to sharper clustering performance. To demonstrate utility of our approach, we analyze a whole-genome sequencing dataset from the 1000 Genomes Project, consisting of the genotypes of 607 individuals sampled from three continental groups involving 10 subpopulations. This is compared against a semi-supervised spectral clustering method, in addition to a spectral clustering method, with the known subpopulation information by the Rand index and an adjusted Rand (ARand) index. The numerical results suggest that the proposed method outperforms its competitors in detecting population stratification.


INTRODUCTION
With the rapid advance of high-throughput technologies, genome-wide association studies (GWAS) and whole-exome or whole-genome sequencing studies have become popular (International HapMap Consortium, 2003). However, in a population-based association study, presence of undetected population stratification, also referred to as the population structure, becomes a potential issue leading to false discovery (Marchini et al., 2004). Population stratification occurs in presence of a systematic difference in allele frequencies between cases and controls due to different ancestries. One direct consequence of ignoring population stratification is inflated false positives and false negatives (Lander and Schork, 1994;Hirschhorn and Daly, 2005;Thomas et al., 2005).
Clustering has been an effective means to detect and describe known or cryptic population stratification (Paschou et al., 2010). For detecting or adjusting for population stratification, three major methods have been proposed, including genomic control (Devlin and Roeder, 1999;Devlin et al., 2004), structured association mapping and other clustering methods (Pritchard et al., 2000;Satten et al., 2001), and principal component analysis [PCA, (Patterson et al., 2006;Zhang et al., 2012)] and spectral methods (Lee et al., 2009;Zhang et al., 2009). As argued in Lee et al. (2009), different methods may be applicable in different situations, for instance, a combination of PCA and a clustering method may be preferable when the method is applied to preprocess in association studies. Despite progress, issues remain. One important issue is how to utilize additional prior information to enhance clustering performance to adjust for population stratification. In a situation where some subjects' population identities are known priori, a semi-supervised approach is more suitable. Towards this end, we propose two methods to detect population stratification, that is, semi-supervised clustering methods that are integrated with PCA and another clustering method, respectively. These methods are developed to (1) integrate the prior information for clustering, (2) to avoid that dense clusters are collapsed into a single group, whereas their sparser counterparts are divided into more multiple clusters, and (3) utilize the prior information to separate highly overlapped subpopulations.
For (1), we incorporate prior information through constraints, as in Grira et al. (2004). The constraints are expressed in terms of pairwise must-links and cannot-links imposed over a subset of the subjects with known population identities, where a must-link connects two subjects from the same subpopulation, whereas a cannot-link deals with different subpoluations.
For (2), we develop our semi-supervised clustering method based on a local scale spectral clustering method (Zelnik-manor and Perona, 2004). In some situations, subpopulations may not be in a same scale, then we consider a spectral clustering method involving a local scaling parameter to guard against potential disruptive influence caused by the different densities of the different subpoluations.
For (3), we introduce a continuous parameter to adjust similarities between subjects with cannot-links and those without any cannot-link, in addition to adjusting similarities between mustlink pairs and cannot-link pairs. As indicated in the numerical results in Section 3.1, many pairs of subjects with cannot-links in two different subpoluations were assigned into one cluster by an existing semi-supervised method, which is in contrast to the proposed semi-supervised spectral clustering method.
The paper is organized as follows. Section 2 gives a motivating data example, and introduces the proposed methods. Section 3 presents our analysis of a low-coverage whole-genome sequencing data, published on the 1000 Genomes Project website. This is followed by a discussion in Section 4.

DATA
In this study, we used a low-coverage whole-genome sequencing dataset to evaluate the performance of our semi-supervised spectral clustering algorithm. The processed data were downloaded from the 1000 Genomes Project (1000 Genomes Project Consortium 2010) web site http://www.sph.umich.edu /csg/abecasis/MACH/download/1000G-2010-08.html. The phased data contain the DNA sequences of n = 607 individuals of three continental groups: Africans (AFR), Europeans (EUR) and Asians (ASN); there are 3, 4, and 3 subgroups in the three continental groups respectively (Table 1) after we removed three subgroups (2 PUR and 1 MXL) from the downloaded data due to their small sample sizes.
We used all the p = 7, 459, 664 SNVs appearing in all the three continental groups on chromosomes 1 to 22. In the 7,456,664 SNVs, there are 343,782 rare variants (RVs, with minor allele frequencies, MAFs < 1%), 1,189,061 low frequency variants (LFVs, 1%≤ MAFs < 5%) and 5,926,821 common variants (CVs, MAFs ≥ 5%). There are 132,742, 525,440 and 1,107,080 monomorphic variants in each of the three continental groups: AFR, EUR and ASN, respectively, and there are 18,559 variants that are monomorphic in all the three continental groups. Furthermore, there are 101,279 variants that are monomorphic in AFR but polymorphic in EUR, and 67,661 variants that are monomorphic in AFR but polymorphic in ASN; there are 493,977 variants that are monomorphic in EUR but polymorphic in AFR, and 133,388 variants that are monomorphic in EUR but polymorphic in ASN; there are 1,041,999 variants that are monomorphic in ASN but polymorphic in AFR, and 715,028 variants that are monomorphic in ASN but polymorphic in EUR.
Denote the data by an n × p matrix Z, with rows indexing n individuals, and columns indexing p SNVs. For each SNV, we chose the minor allele as the reference allele. Let Z ij ∈ {0, 1, 2} be the number of minor alleles for SNV j of individual i. We centered each column (SNV) to have mean 0; denote the centered data matrix Z c = AZ, where A = I − 1 n 11 t is an n × n centering matrix, I denotes the n × n identity matrix and 1 denotes the length-n vector with each entry equal to 1. Then, we used PCA for dimension reduction (Menozzi et al., 1978;Cavalli-Sforza et al., 1994): we computed the n × n sample covariance matrix H = Z c Z t c , and then used the re-scaled eigenvectors of H as coordinates for subject i, x i = ( √ λ 1 u 1 (i), . . . , λ J u J (i)), where λ 1 ≥ λ 2 ≥ · · · ≥ λ J ≥ 0 are the M largest eigenvalues of H and u j = (u j (1), . . . , u j (n)) t , j ∈ {1, . . . , J}, are the corresponding eigenvectors. Typically, eigenvectors that correspond to large eigenvalues reveal important ancestry axes.

SEMI-SUPERVISED SPECTRAL CLUSTERING
Existing semi-supervised clustering methods can be categorized into two: search-based and similarity-based. The former is a modified clustering method in that the prior constraints are used to yield appropriate partitions (Demiriz et al., 1999;Wagstaff et al., 2001;Basu et al., 2002). The latter is a clustering method based on a modified similarity metric (Bilenko and Mooney, 2003;Xing et al., 2003;Yang et al., 2008). We think that the latter may be more efficient, since it embeds prior constraints only by simply modifying the similarity metric, while the former may use prior constraints to yield appropriate partitions in each iteration.
With this in mind, in this paper we developed a semisupervised spectral clustering method to infer population structure. Before proposing this method in detail, we first review some spectral clustering algorithms, which were developed from the studies of weighted graph partitioning problems (Shi and Malik, 2000;Meila and Shi, 2001;Ng et al., 2001;Kannan et al., 2004). The spectral clustering algorithms are similarity-based. A popular choice for defining the similarity between a pair of subjects , where the scale parameter σ ij controls the size of local neighborhoods in the weighted graph. Although a global scale σ ij = σ is often used, as mentioned in Zelnik-manor and Perona (2004), using a local scale parameter, σ ij = (σ i σ j ) 1/2 with σ i , σ j > 0, for each pair (i, j) may obtain better performance, especially when the clusters of the data have different volumes. Below we review the local scale spectral clustering algorithm proposed by Zelnik-manor and Perona (2004).
Given a set of n points {x 1 , . . . , x n } in the J-dimensional Euclidean space R J and the neighborhood parameter T, cluster them into K clusters as follows: 3. Define D to be a diagonal matrix with D ii = n j = 1 W ij and construct the normalized Laplacian matrix L = I − D −1/2 WD −1/2 . 4. Find u 1 , . . . , u K , the smallest K eigenvectors of L, and let U be the matrix containing the vectors u 1 , . . . , u K as columns.

Frontiers in Genetics | Statistical Genetics and Methodology
October 2013 | Volume 4 | Article 215 | 2 5. For i = 1, . . . , n and j = 1, . . . , K, let i ) t ∈ R K be the vector corresponding to the ith row of U , and then cluster the points {y 1 , . . . , y n } with the K-means algorithm into K clusters {S 1 , . . . , S K }. 7. Assign the original point x i to cluster S j if and only if y i is assigned to cluster S j .
Let M denote the must-link matrix and C denote the cannot-link matrix for clustering n points {x 1 , . . . , are already known to be in the same (or different) cluster(s); M ij = 0 (C ij = 0) means that we do not know whether x i , x j are in the same cluster. Given the must-link matrix M and the cannot-link matrix C, Yang et al. (2008) proposed a semi-supervised algorithm by modifying the second step of the local scale spectral clustering algorithm above as follows: for , the algorithm forces the pair (i, j) to be clustered into the same cluster. However, in general, letting W ij ≡ 0 for a cannot-link pair (i, j) may not force the pair (i, j) to be clustered into two different clusters. W ij ≡ 0 only means that observations i and j are far away; if there exists an observation k such that W ik and W kj are large enough, i and j may still be clustered into one cluster. Thus, for embedding cannot-link information into a spectral clustering algorithm, only letting the weights of all cannot-link pairs to be zero is not enough. To avoid clustering a cannot-link pair (i, j) into one cluster, we adjust W ik and W kj for each k without any cannot-link information, based on which we propose a new semi-supervised spectral clustering algorithm. Before introducing the algorithm, to make best use of the semi-supervised information, we may first adjust the mustlink matrix and cannot-link matrix as follows: (1) Adjust the must-link matrix M such that: for each pair (i, j) (i = j), M ij = 1 whenever there exists a k = i, j such that M ik = 1 and M kj = 1; (2) Adjust the cannot-link matrix C such that: for each pair (i, j) (i = j), C ij = 1 whenever there exists a k = i, j such that M ik = 1 and C kj = 1. After the adjustment, if there exists any contradictory pair (i, j) (i = j) with C ij = 1 and M ik = 1, to avoid being misled we will let C ij = 0 and M ik = 0.
In fact, though there have been much reported success with using pairwise constraints for clustering, there are two limitations (Davidson and Ravi, 2005;Davidson et al., 2006). First, if the constraints are poorly specified and then using cannotlink constraints may make the feasibility problem intractable (Davidson and Ravi, 2005); second, some constraints may have adverse effects to semi-supervised clustering (Davidson et al., 2006). There were some discussions about how to deal with the limitations, and accordingly some methods were specifically designed to overcome such limitations. Because the concern of the limitations is not the focus of this paper, we will not introduce these methods in detail.
Let V M = M ij = 1 {i, j} and V C = C ij = 1 {i, j}. Now we are ready to show our semi-supervised spectral clustering (SSSC) algorithm.
Algorithm SSSC Given a set of n points D = {x 1 , . . . , x n } in R J , a must-link matrix M, a cannot-link matrix C, and the parameters α ∈ {0, 1}, β ≥ 1 and the neighborhood parameter T, cluster the points into K clusters as follows: 3. Define D to be a diagonal matrix with D ii = j W ij and construct the normalized Laplacian matrix L = I − D −1/2 WD −1/2 . 4. Find u 1 , . . . , u K , the first K eigenvectors of L, and let U be the matrix containing the vectors u 1 , . . . , u K as columns.
i ) t ∈ R K be the vector corresponding to the ith row of U , and then cluster the points {y 1 , . . . , y n } with the K-means algorithm into K clusters {S 1 , . . . , S K }. 7. Assign the original point x i to cluster S j if and only if y i was assigned to cluster S j .
Note that in the Step 2.c of our new algorithm above, we believe that for each k ∈ V\V C and each cannot-link pair (i, j), if x k is nearer to x i , then it should be much farther away from x j , because the distance between x i and x j has already been set to the maximum. Thus, we penalize the similarity between x k and x j by letting W jk = W kj = W jk /β (β > 1). On the other hand, we set a parameter α to determine whether we force the similarities between a sample k ∈ V\V M and a must-link pair (i, j) to be the same. In fact, if α = 0 and β = 1, then our algorithm reduces to that of Yang et al. (2008).

CHOOSING THE PARAMETERS
We develop a cross-validation procedure to choose the parameters for the Algorithm SSSC, modified from a criterion used in Tibshirani and Walther (2005) for the K-mean clustering. In addition, we borrow the idea of cluster reproducibility index (RI) (Shen et al., 2009) to define a new prediction strength. We summarize the procedure as follows. Given a data set D and a candidate set of parameters = K × A × B × T , where K and T are sets of positive integers, A = {0, 1}, and B is a set of real numbers equal to www.frontiersin.org October 2013 | Volume 4 | Article 215 | 3 or larger than 1. Randomly permute the sample index set V = [N] of D, and then partition the permuted sample index set into two roughly equal parts. Select one part as the test index subset V te for the test data D te = {X n : n ∈ V te } and take the remaining part as the training index subset V tr for the training data D tr = {X n : n ∈ V tr }.
apply Algorithm SSSC to divide D tr into K clusters with parameters α, β, T and the must-link matrix M tr , the cannot-link matrix C tr ; apply Algorithm SSSC to divide D te into K clusters with parameters α, β, T and the must-link matrix M te , the cannotlink matrix C te . Let l tr and l te denote the corresponding clustering assignments. Divide the test data D te into K clusters under the guidance of l tr , that is, assign each sample in D te into the closest cluster of D tr characterized by l tr in the sense of the Euclidean distance, and then let l te|tr denote the corresponding clustering assignment. Note that here the distance between a sample and a cluster is defined as the minimum distance between this sample and each sample in the cluster. Next, compute the adjusted Rand index (Hubert and Arabie, 1985) between l te|tr and l te as the prediction strength. Repeat the above steps for a number of times with different randomly selected permuted samples, and finally chooseθ = (K,α,β,T) ∈ with the highest average prediction strength.
Note that while using PCA for dimension reduction in Section 2.1, we did not mention how to choose an appropriate number of PCs. There are many studies about this problem for traditional PCA, such as Jackson (1991), Jolliffe (2002) and Pedro et al. (2005). Because in this paper we only focus on the performance of a clustering algorithm, we propose using a special procedure that is related to the clustering performance. In fact, we view the number of PCs as a parameter and then decide it in the above cross-validation procedure. Especially, we first choosê θ J ∈ using the above cross-validation procedure for each J in a set of candidate numbers of PCs J , and then chooseĴ ∈ J with the highest average prediction strength among J as the best fitted number of PCs.

MAIN RESULTS
We used all the SNVs appeared in all the three continental groups in chromosomes 1-22 to extract the top t principle components (PCs). As shown in the left panel of Figure 1, the top 2 PCs could completely separate the three continental groups. However, some subgroups could not be completely separated. We used the local scale spectral clustering algorithm introduced in Section Methods to cluster the 607 t-dimension vectors into 10 clusters. As shown in Table 2 and Figure 1, subgroup GBR ('4') cannot be completely distinguished from CEU ('6'); CHS ('8') cannot be completely distinguished from CHB ('9').
The spectral clustering algorithm used above is an unsupervised clustering algorithm without using any additional clustering information. However, in many cases, partial knowledge is available concerning pairwise (must-link or cannot-link) constraints among a subset of subjects. Thus, we propose a semi-supervised local scale spectral clustering algorithm to make use of the preknown constraints. We show the performance of our algorithm by varying the number of available must-link or cannot-link constraints. We let SSR denote the semi-supervised ratio, and randomly selected a fraction SSR of individuals from each subgroup. Then we obtained a must-link matrix and a cannot-link matrix according to the selected individuals and their subgroup identities, which were input to our semi-supervised algorithm. We used the algorithm in Yang et al. (2008) and our new proposed algorithm to cluster the 607 individuals into 10 clusters with the top 10, 20, and 30 PCs respectively, and then compared the Rand index (Rand, 1971) and an adjusted Rand (ARand) index (Hubert and Arabie, 1985) between the true subgroups and the clustering results. We repeated this process for 100 times and at each time we randomly selected some individuals for getting the pre-known must-link matrix and cannot-link matrix by setting a different seed in R software. Then we indicated the average results of these 100 simulations in Figure 2. From Figure 2, we can see that when using the top 10, 20 and 30 PCs for clustering, our algorithm performed much better than the existing one with (α = 0, β = 1) (Yang et al., 2008) in terms of the Rand index and adjusted Rand index for almost all the values of SSR. It is clear that the blue vertical lines for our new algorithm appeared with smaller SSR values, indicating that our algorithm made use of the given semisupervised information more efficiently. In fact, while using other numbers of the top PCs, we also obtained similar results (not shown). Additionally, here in our new algorithm we used α =α and β =β.
Tables 3, 4 present the numbers of subjects assigned to each of the 10 clusters based on the top 10 PCs using the existing SSSC algorithm (α = 0, β = 1) (Yang et al., 2008) and our Algorithm

Frontiers in Genetics | Statistical Genetics and Methodology
October 2013 | Volume 4 | Article 215 | 4 SSSC with SSR = 0.5. We can see that our new algorithm performed much better than the existing one.
To further illustrate the difference among the unsupervised local scale spectral clustering algorithm, the existing SSSC algorithm (α = 0, β = 1) (Yang et al., 2008) and our new SSSC algorithm (α =α, β =β), we plotted the first two co-ordinates (of y i 's in Step 6) for each of the three algorithms (see Figure 3). To better observe the separation between the two subgroups CHS ('8') and CHB ('9'), we particularly plotted for the two subgroups, where the colors of the subjects in CHB were still kept red, however, those in CHS were changed to black (see the right three  sub-figures of Figure 3). The top two sub-figures are for the unsupervised local scale spectral clustering algorithm, the middle two are for the existing SSSC algorithm (α = 0, β = 1) and the bottom two are for our SSSC algorithm (α =α, β =β). From the first two sub-figures of Figure 3 and Table 2, we see that the two pairs of subgroups, GBR-CEU and CHS-CHB were inseparable, respectively. Then for the middle two sub-figures of Figure 3 and Table 3, by adjusting the similarities between must-link pairs to be 1 and those between cannot-link pairs to be 0, the subjects in GBR and CEU were a little more separable, however the subjects in CHS and CHB were still inseparable. Finally for the last two  FIGURE 3 | The top two sub-figures are for the unsupervised local scale spectral clustering algorithm, the middle two are for SSSC algorithm (α = 0, β = 1) and the bottom two are for our SSSC algorithm (α =α, β =β). Note that in the right three sub-figures, the colors of the subjects in CHS ('8') were changed from red to black. Note that the y (1) and y (2) axes are just the first two co-ordinates of the data points {y 1 , . . . , y n } in Step 6 of Algorithm SSSC.
sub-figures of Figure 3 and Table 4, we can see that the subjects in all the subgroups were more separable, and in particular, in the bottom right sub-figure the subjects in CHS and CHB were more separable.

SEMI-SUPERVISED CLUSTERING VERSUS CLASSIFICATION
We also did some numerical experiments to compare classification (supervised learning) with our semi-supervised clustering. For illustration and to have a easier problem for classification, we only took the individuals in the EUR continental group and used common variants (CVs) with minor allele frequencies (MAFs) greater than 5% on chromosome 1. Furthermore, we used PLINK (Purcell et al., 2007) to prune out correlated SNVs with a sliding window of size 50 (shifted by 5) and a threshold of r 2 ≤ 0.05, after which we had 11,840 CVs. First, we randomly chose a fraction SSR of individuals from each of the above four subgroups as semi-supervised information for our clustering algorithm and as the training data for a classification algorithm. We used penalized multinomial logistic regression with the Lasso or the Ridge penalty for classification; the penalization parameter was chosen by 5-fold cross-validation. We used the trained classifier to predict the subgroup labels for the remaining data, and combined the known labels in the training data and the predicted labels in the test data together to compare with the true labels in terms of the Rand indices and adjusted Rand indices. The top two sub-figures in Figure 4 summarize the corresponding results based on 100 simulations; it is demonstrated that in our experiments, our semi-supervised clustering algorithm performed much better than both Lasso-and Ridge-penalized regression, especially for cases with low SSRs.
On the other hand, in some cases the given semi-supervised information may not involve all the four subgroups. For example, we only had information about a subset of subjects from the CEU and GBR subgroups, but not any from the other subgroups. As before, we randomly selected a fraction (SSR) of individuals from the CEU and GBR subgroups respectively; they were used as semi-supervised information for our algorithm and as training data for a classification algorithm. The bottom three sub-figures in Figure 4 show the corresponding comparisons, indicating that our semi-supervised clustering algorithm performed overwhelmingly better than Lasso-and Ridge-penalized regression, because the classification algorithm predicted all the individuals of the unknown TSI or FIN subgroup as of either CEU or GBR subgroup. This illustrates an obvious advantage of a semi-supervised clustering approach for discovery of novel classes.

DIMENSION REDUCTION
In Section 3, we have demonstrated good performance of our algorithm with a few top PCs. In addition, we also obtained similar results (not shown) with other methods for dimension reduction, such as the spectral graph approach used in SpectralR and SpectralGEM (Lee et al., 2010). We used the spectral method in (Lee et al., 2010) for dimension reduction, then clustered the data into 10 clusters; we took the same procedure to compare the Rand index and adjusted Rand index values. Figure 2.

ALL OR A SUBSET OF SNVs?
In the previous section, we used all 7,459,664 SNVs appearing in all the three continental groups on chromosomes 1

FIGURE 4 | Rand indices (black) and adjusted Rand indices (red) between the true subgroups and the SSSC or classification results.
A vertical blue line indicates the point of SSR, at which the Rand and adjusted Rand are both bigger than 0.90 for the first time. The top three sub-figures demonstrate comparison of Lasso, Ridge and SSSC for the cases that semi-supervised information involves all the four subgroups, while the bottom three for the cases that semi-supervised information involves two subgroups.  to 22 without pruning out SVNs in linkage disequilibrium. We used common variants (CVs) with minor allele frequencies (MAFs) greater than 5% on chromosomes 1 to 22, and used PLINK (Purcell et al., 2007) to prune out correlated SNVs with a sliding window of size 50 (shifted by 5) and a threshold of r 2 ≤ 0.5, after which we had 1,022,090 CVs. Next, we used PCA for dimension reduction and then use the three algorithms to analyze the resulting data after dimension reduction.  Tables 5-7 present the numbers of subjects assigned to each of the 10 clusters based on the top 10 PCs using the unsupervised spectral clustering algorithm, the existing semi-supervised spectral clustering algorithm and our new algorithm with SSR = 0.5. From these results and those indicated by Tables 2-4, we see that using all the SNVs was better than using the pruned data in terms of the performance of the unsupervised spectral clustering. For the two semi-supervised spectral clustering algorithms, we find that while using the pruned data, the new www.frontiersin.org October 2013 | Volume 4 | Article 215 | 7 Table 7 | The numbers of subjects assigned to each of the 10 clusters based on the top 10 PCs using our new SSSC algorithm (α =α, β =β) with SSR = 0.5.
Groups Subgroups S1 S2 S3 S4 S5 S6 S7 S8 S9 Sa All semi-supervised spectral clustering algorithm still performed better than the existing one (Yang et al., 2008) as in Section 3.1 with all the SNVs.

LOCAL SCALE SPECTRAL CLUSTERING
Our semi-supervised spectral clustering algorithm is based on the local scale spectral clustering (Zelnik-manor and Perona, 2004), because we believe that local scales work better than choosing a single global scale for all pairs of subjects. In some situations the subgroups might not have the same scale; from our experience, given a fixed number of clusters, the subjects in a sparser group are more likely to be divided into more clusters, and the individuals in a denser group are more likely to be merged together. In these cases, it will be difficult to choose a suitable single global scale. In contrast, using local scales automatically adjusts for the heterogeneous scales in the subgroups. We did some experiments to compare the spectral clustering algorithms with a global scale and with local scales. We used several candidate values for a global scale, and found that even the best clustering result (in terms of the Rand indices and adjusted Rand indices) was almost the same as that obtained by using local scales. Because it is not the main point of this study, we do not show the detailed comparisons here.

CONCLUSIONS
We have proposed a new semi-supervised spectral clustering algorithm based on a more efficient use of the cannot link constraints in prior data. A whole-genome sequencing dataset from the 1000 Genomes Project was analyzed to compare the performance of our and other algorithms. In our experiments, unsupervised clustering algorithms could not completely separate some subgroups, such as the CEU-GBR and CHB-CHS subgroups; our semisupervised spectral clustering algorithm, along with a subset of individuals with known subgroup identities, distinguished these subgroups much better. Our proposed method may be potentially useful in genetic association studies. Its extensions to other clustering (Thalamuthu et al., 2006) and dimension reduction approaches are to be studied.