Adaptive Initialization Method for K-Means Algorithm

The K-means algorithm is a widely used clustering algorithm that offers simplicity and efficiency. However, the traditional K-means algorithm uses a random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance. In this research, we propose an adaptive initialization method for the K-means algorithm (AIMK) which can adapt to the various characteristics in different datasets and obtain better clustering performance with stable results. For larger or higher-dimensional datasets, we even leverage random sampling in AIMK (name as AIMK-RS) to reduce the time complexity. 22 real-world datasets were applied for performance comparisons. The experimental results show AIMK and AIMK-RS outperform the current initialization methods and several well-known clustering algorithms. Specifically, AIMK-RS can significantly reduce the time complexity to O (n). Moreover, we exploit AIMK to initialize K-medoids and spectral clustering, and better performance is also explored. The above results demonstrate superior performance and good scalability by AIMK or AIMK-RS. In the future, we would like to apply AIMK to more partition-based clustering algorithms to solve real-life practical problems.


Introduction
The clustering algorithm is a classic algorithm in the field of data mining.It is used in virtually all natural and social sciences and has played a central role in various fields such as biology, astronomy, psychology, medicine, and chemistry (Shah and Koltun, 2017).For example, in the commercial field, Horng-Jinh Chang et al. proposed an anticipation model of potential customers' purchasing behavior based on clustering analysis (Chang et al., 2007).In the biology field, clustering is of central importance for the analysis of genetic data, as it is used to identify putative cell types (Kiselev et al., 2019).In addition, the applications of the clustering algorithm also include image segmentation, object or character recognition (Dorai and Jain, 1995), (Connell and Jain, 1998) and data reduction (Huang, 1997;Jiang et al., 2014).The clustering algorithm mainly includes hierarchy-based algorithms, partition-based algorithms, density-based algorithms, model-based algorithms and grid-based algorithms (Saxena et al., 2017).
The K-means algorithm is widely used because of its simplicity and efficiency (MacQueen, 1967).It is a classic partition-based clustering algorithm.However, the traditional K-means algorithm uses the random method to determine the initial cluster centers, which make clustering results prone to local optima and then result in worse clustering performance.To overcome this disadvantage, many improved methods have been proposed.However, providing an optimal partition is an N-P hard problem under a specific metric (Redmond and Heneghan, 2007).
Forgy randomly selected K points from the data as the initial cluster centers without a theoretical basis, and the final clustering results more easily fell into a local optimum (FORGY, 1965).Jancey's method assigned a randomly generated synthetic point from the data space to each initial clustering center (Jancey, 1966).However, some of these centers may be quite distant from any of the points, which might lead to the formation of empty clusters.Mac Queen proposed using the first K points in the dataset as the initial centers.The disadvantage of this approach is that the method is extremely sensitive to data order (MacQueen, 1967).In addition, the above methods do not take into account the characteristics of data distribution, using randomly generated points or synthetic points as the initial cluster centers, resulting in poor and unstable clustering results (Yang et al., 2017).Selecting clustering centers is actually selecting the representative points for specific classes.The density of data points can be used to measure the representativeness of points.Redmond et al. estimated the density distribution of the dataset by constructing a Kd-tree (Redmond and Heneghan, 2007), but its density calculation method was unreasonable (Wang et al., 2009).Geng Zhang et al. proposed an initialization method based on density Canopy with complexity O(n^2) (Zhang et al., 2018).In addition, Cao et al. used the neighborhood-based rough set mode to measure the representativeness of the points to generate the initial cluster centers, but the method was sensitive to parameters (Cao et al., 2009).Khan et al. calculated the representative points from the dimensions of the data points based on the principle of data compression (Khan and Ahmad, 2004).The overall effect of this method is good, but its complexity is positively related to the dimensionality of the data and is not applicable to high-dimensional data.Based on the minimum spanning tree (MST), Jie et al.
selected representative points, which are also called skeleton points, from the datasets and then regarded some skeleton points that are far away from each other as the final initial cluster centers (Yang et al., 2016).
However, the complexity of this method is quadratic.In addition to the density of the data points, the distance between the data points is also regarded as one of the criteria for selecting the initial cluster centers.Gonzalez proposed a maximin method; the idea is to select the data points, which are as far as possible from each other, as the initial cluster centers, to make the cluster centers more evenly dispersed in each class (Gonzalez, 1985).However, this method has strong randomness, resulting in unstable clustering results.David Arthur et al. proposed K-means++ (Arthur and Vassilvitskii, 2007), which has disadvantages similar to the maximin method.For example, K-means++ will result in unstable clustering results because of the randomly selected first cluster center, or it may generate no representative initial cluster centers.To obtain better clustering results, some methods consider both the representativeness of data points and the distance between data points.A. Rodriguez et al. proposed a new clustering algorithm based on density peaks and proposed a method to generate cluster centers based on both density and distance (Rodriguez and Laio, 2014).Although this method is not leveraged to initialize the K-means algorithm, it has an extremely important enlightening significance.Numerous improved initialization methods have not yet been widely applied.Moreover, all the above mentioned methods cannot dynamically adapt to datasets with various characteristics (Yang et al., 2017).
Jie et al. proposed a K-means initialization method based on hybrid distance, which has been proven to dynamically adapt to datasets with various characteristics (Yang et al., 2017).The method considers both the density and the distance and uses a parameter to adjust the proportion of the two factors.They also proposed an internal clustering validation index, named the clustering validation index based on the neighbors (CVN), to select the optimal clustering results.However, this method also has shortcomings, such as (a) when calculating density, the threshold cannot be uniquely determined, resulting in unstable results.(b) Heavily depending on adjusting the parameter, the parameter must be adjusted five times to obtain better clustering results.(c) In some cases, the CVN index values calculated using different parameter settings are equal.At this time, CVN cannot be used to select better clustering results.(d) The time complexity of the algorithm is O(n^2), which is difficult to apply to large datasets.
In this paper, we proposed an adaptive initialization method for the K-means algorithm (AIMK), which not only adapts to datasets with various characteristics but also requires only two runs to obtain better clustering results.In addition, we proposed the AIMK-RS based on random sampling to reduce the time complexity of the AIMK to O(n).AIMK-RS is easily applied to large and high-dimensional datasets.
First, we proposed a new threshold to calculate the density of the data points based on MST.In addition, after using the new threshold, we found that we only need to adjust the parameter twice, and we can obtain better clustering results.Finally, we applied random sampling to AIMK to obtain the AIMK-RS, whose time complexity is only O(n).This paper is organized as follows.In the Proposed Algorithm section, the new initialization method for K-means based on skeleton points is presented.In the Experiments and Results section, the simulation and the results are presented and discussed.Finally, in the Conclusion section, the relevant conclusions are drawn.

Adaptive initialization method
In this section, we described the algorithm for selecting the initial cluster centers in detail.First, several concepts involving this algorithm are presented.

Skeleton point
In a previous study, Jie et al. proposed a new compressed representation, named skeleton points, from the original datasets based on an MST and regarded them as candidates for cluster centers (Yang et al., 2016).In contrast, we leveraged the skeleton points to determine the threshold for calculating the density of data points because they can reflect the characteristics of the datasets to some extent.In the beginning, we introduce how to construct an MST using the original dataset.

2
. Each data point   in dataset X corresponds to a vertex   ∈  in graph G, and there is a one-to-one correspondence between data point   (i = 1, 2..., n) and vertex   (i = 1, 2..., n).The number of vertices in graph G is equal to the number of data points   in dataset X. Edge weights between any two vertices are distance between the corresponding two data points.
The MST of G can be generated by using the Prim algorithm (Prim, 1957), which can be described as performing the following steps: Step 1: Pick any vertex   from graph G to be the root of the tree.
Step 2: Grow the tree by one edge: of the edges that connect the tree to vertices not yet in the tree, find the minimum-weight edge from G and transfer it to the tree.
Step 3: Repeat Step 2 (until the tree contains all vertices in graph G).
Through the above steps, we obtained an MST from the original datasets, and then we introduced how to obtain skeleton points from the MST.First, we introduce a concept, named the number of adjacent data points, and then we use this concept to obtain skeleton points.
Definition 2.1 (Number of adjacent data points) Let   be the set of vertices of T with degree i and   be the complementary set of   , that is,   = \  .For   , the number of adjacent data points, denoted as   , is the number of vertex in   being adjacent to vertex in   .
Note that only add 1 to   under the circumstance of one vertex in   being adjacent to more than one vertex in   .
Lemma 2.1 If anyone vertex in   is adjacent to one and only one vertex in  1 , then The proof of this lemma is obvious.Now, we introduce how to leverage the number of adjacent data points   to obtain the skeleton points.
Definition 2.2 (Skeleton Point) Suppose the maximum degree of T be m; then, V =  1 ∪  2 ∪ … ∪   .Let F = arg max    .The skeleton points, denoted as S, are the vertices of T with the degree being greater than or equal to F. Therefore, S =   ∪  +1 ∪ … ∪   .
We generated a synthetic dataset, then constructed the MST and calculated the skeleton points according to the Definition 2.1-2.3, which are enclosed by the triangles, as shown in Fig. 1.Next, we introduced the threshold for calculating the density of data points.

Threshold
Definition 2.3 (Threshold) In  = (,   ), suppose the number of skeleton points S is s; if the maximum weights of adjacent edges of each skeleton point can be denoted as { 1 ,  2 , … ,   }, then we defined a threshold as In an MST, the adjacent edge weights of vertices can reflect the distribution characteristics of the area where the vertices are located.While vertices contain a large number of unimportant points or outliers, we only focus on the skeleton points.In summary, when calculating the threshold, we only consider the adjacent edge weights of the skeleton points, and the mean value of the maximum weights of adjacent edges of each skeleton point is taken as the threshold.

Density of vertices
In the following section, we introduce how to calculate the density of data points using the threshold Thr.We first constructed a Thr-Connected Graph.
Definition 2.4 (Thr-Connected Graph) In dataset X, if d(  ,   ) ≤ ℎ, then add an edge between data points   and   ; this is called a Thr-Connected Graph (TCG), where d(  ,   ) is the distance between data point   and   .Each data point   in dataset X corresponds to a vertex   ∈  in graph TCG.
Definition 2.5 (density of   ) In TCG, the mean distance between the vertex   and the vertices adjacent to vertex   , denoted as   , is where k is the number of vertices   . Suppose To make <1, we added infinite decimal  to its denominator, where

Hybrid distance
If the distance among the initial cluster centers is small, it is easy to make the K-means algorithm fall into a local optimum.However, if only the distance factor is considered, it is possible to use the outlier as the initial cluster center.Jie.et al. proposed a new distance, a hybrid distance, to solve this problem (Yang et al., 2017).Hybrid distance considers the distance and density of the cluster centers at the same time so that the selected cluster centers are far away from each other and have a higher density.
where λ is a hyperparameter, normally set by 0 or 1; this is explained in detail in the following section.

Algorithm for determining the initial cluster centers
Now we present the algorithm for determining the initial cluster centers based on the above-defined concepts.The details are as follows: Step 1: Input: the dataset X and the number of clusters NC.
Step 2: Calculate distances d(  ,   ) between all pairs of data points.Represent dataset X as the undirected complete weighted graph  = (, ) and construct the MST using the Prim algorithm, then calculate the threshold Thr by Definition 2.1-2.3.
Step 3: Construct the TCG by Definition 2.4, calculate the density of every vertex   by Definition 2.5 and the sum of densities   +   between all pairs of vertices.
Step 4: Calculate the hybrid distance H(  ,   ) between all pairs of vertices according to Definition 2.6.

Datasets
We evaluated AIMK on 16 normal and six large and high-dimensional real-world datasets from the

Validation Indices
To evaluate the performance of clustering algorithms, we exploit three widely used external clustering where n denotes the number of objects.NC is the number of clusters.Pi is the number of objects that are correctly assigned.TP means true positive, FP means false positive, FN means false negative, and TN means true negative (Powers, 2011).

Sensitivity of λ
To analyze the sensitivity of the parameter λ, we ran AIMK on 16 datasets: Breast-cancer, Shuttle, Pendigits, Colon-cancer, Zoo, Haberman, Svmguide2, Wine, Ionosphere, Leukemia, Gisette, Splice, Svmguide4, Liver-disorders, Soybean-small, and Balance-scale while λ is set as 0, 0.25, 0.5, 0.75, 1.Then, we used ACC, RI, and F-measure to evaluate the performance of AIMK on each dataset.The results are listed in Tables 2-4.The optimal results for the corresponding index are denoted in bold.As the results show, when λ is set as 0 or 1, the optimal results in each validation index can be always obtained in each dataset.The HD algorithm is required to run five times to obtain a better clustering result (Yang et al., 2017), but AIMK can obtain a better result with only two runs.Therefore, in subsequent experiments, we only consider the results of AIMK when λ equals 0 or 1.

Impact of threshold Thr
To explain more clearly how to use skeleton points to determine the threshold Thr, we perform experiments on four representative datasets: Pendigits, Shuttle, Wine, and Gisette, whose data size and dimensions are from small to large and low to high, respectively.We run AIMK when Thr is set as the mean value of the maximum weights, mean weights, and minimum weights of adjacent edges of each skeleton point.Meanwhile, λ is set as 0 or 1, and the final results are shown as both sides of the slash "/'', respectively.We used ACC to evaluate the results of each run.The results are shown in Table 5.For each dataset, the optimal results can be obtained only when Thr is set as the maximum weights of adjacent edges of each skeleton point.Therefore, it is more reasonable to set Thr as the maximum weights of adjacent edges of each skeleton point.

Comparison with baselines
We compared AIMK (λ is set as 0 or 1) with 10 baselines on 16 normal real-world datasets.ACC, RI, and F-measure are exploited to evaluate the performance of each baseline on each dataset.The results are listed in Tables 6-8.The optimal results for the corresponding dataset are denoted in bold.We use the average rank to measure the final performance of each baseline across datasets.The rank means the rank number of each row sorted in descending order.If there are the same results from two different algorithms, their ranks are equal.
According to Tables 6-8, AIMK (set λ as 0 or 1) achieves the best performance on 13, 11, and 8 of the 14 datasets when measured by ACC, RI, and F-measure, respectively.Moreover, it can be seen from the ranks that AIMK is obviously superior to the other 10 baselines, no matter which validation index we use.
Furthermore, according to

Reducing complexity by sampling
Due to the time complexity ( 2 ), it is difficult to apply AIMK to large and high-dimensional datasets.To solve this problem, we consider random sampling to extract √ samples from the original datasets, where  means the number of samples of the dataset, and then use these samples as the input for AIMK.It is worth mentioning that to make the √ samples fully express the characteristics of the original datasets, we recommend using random sampling to reduce complexity only when the number of clusters K ≪ .In this way, the time complexity of AIMK will be reduced to ().AIMK after random sampling, denoted as AIMK-RS, is compared with two widely used initialization methods, K-means and K-means++, whose time complexity is also (), on six large and high-dimensional datasets.ACC, RI, and F-measure are also exploited to evaluate the results.In addition, we take the average performance of 100 runs as the real performance of the AIMK-RS because it provides for more even sampling and can fully express the characteristics of the original datasets.The optimal results for the corresponding datasets are denoted in bold.The results are listed in Tables 9-11, and we can conclude that after random sampling, compared with two baselines, AIMK still achieves better performance.

Choice of λ
After the above experiments, we can see that the parameter λ is crucial for the final clustering results.
To further illustrate the impact of parameter λ, we generated two types of datasets with different distributions from a mixture of three bivariate Gaussian densities.where Gaussian(, ) is a Gaussian normal distribution with the mean X and the covariance matrix Y.
We stimulate three clusters, namely, class 1, class 2 and class 3, which are represented by different shapes: circle (20 points), cross (20 points), and triangle (20 points), respectively.As shown in Fig. 2(a)-Fig.2(d), we used AIMK to determine the initial cluster centers marked with star when λ is set as 0 and 1.In Fig. 2(a)-Fig.2(b), when λ is equal to 0, the three cluster centers happen to be the centroid of three classes.
When λ is equal to 1, only one cluster center is the centroid of class 1, and the other two cluster centers are just outliers in class 2 and class 3, respectively.In Fig. 2(c)-Fig.2(d), when λ is equal to 0, two cluster centers are dropped in class 1, one cluster center is dropped in class 2, and no cluster center is dropped in class 3.However, when λ is equal to 1, three cluster centers happen to be dropped in three classes, and two of the three are outliers.
According to formula (4), when λ is equal to 0, only the top K points with higher density are selected as initial cluster centers.At this time, if all or most of these K initial cluster centers fall in K different classes, as shown in Fig. 2(a), then the initialization effect is better.However, for some datasets, such as overlapping datasets, shown as Fig. 2(c), the top K points with higher density cannot be distributed relatively evenly among K classes.Therefore, at this time, we need to consider the distance factor.
According to formula (4), when λ is equal to 1, we only select the K points that are far apart from each other as initial cluster centers.At this time, all or most of these K initial cluster centers are more likely to be relatively evenly distributed among the K classes, as shown in Fig. 2(d).

Algorithm analysis
According to the 2.8 Algorithm for determining the initial cluster centers, the time complexity of AIMK was analyzed as follows.In Step 2, the time complexity of computation of the distance between all pairs of vertices and the Prim algorithm is 2 () On , and the time complexity of calculation of the threshold Thr is ().Construction of the TCG and calculation of the density of every vertex   requires () in Step 3, and the computation of the sum of densities   +   between all pairs of vertices requires ().In Step 4, because the distance and the sum of densities between all pairs of vertices have been calculated in Step 2 and Step 3, the time complexity of the calculation of the hybrid distance H(  ,   ) between all pairs of vertices is ().Determination of the first and second initial cluster centers requires () in Step 5 and Step 6.In Steps 7 and 8, the remaining initial cluster centers are selected, in which the time complexity is less than  *  and approximately equal to ( * ).Because normally the number of clusters NC ≪ , the entire time complexity of AIMK is O( 2 ).However, according to the 3.7 Reducing complexity by sampling part, the time complexity of AIMK can be reduced to () after random sampling, denoted as AIMK-RS.The time complexities of all baselines, AIMK and AIMK-RS are listed in Table 12.

Conclusion
In this paper, we proposed an adaptive initialization method for the K-means algorithm, which not only adapts to datasets with various characteristics but also requires only two runs to obtain better clustering results.In addition, we proposed the AIMK-RS based on random sampling to reduce the time complexity of the AIMK to O(n).AIMK-RS is easily applied to large and high-dimensional datasets.First, we proposed a new threshold to calculate the density of the data points based on the skeleton points of the MST.In addition, after using the new threshold, we found that we only need to adjust the parameter twice, and we can obtain better clustering results.Finally, we applied random sampling to AIMK to obtain the AIMK-RS, whose time complexity is only O(n).
In the future, we will use AIMK or AIMK-RS to initialize other variants of the K-means algorithm, such as the K-medoids algorithm, K-modes algorithm, fuzzy C-means algorithm, etc.In addition, the threshold Thr proposed in this paper can be used to help density-based clustering algorithms, such as DBSCAN (Ester et al., 1996), OPTICS (Ankerst et al., 1999), and SFDP, calculate the density of points without any extra adjusting parameters.Furthermore, we will continue to explore the new method to estimate the characteristics of datasets to further determine the parameter λ specifically.

Fig. 1 .
Fig. 1.We generated a synthetic dataset, then constructed the MST and calculated the skeleton points according to Definitions 2.1-2.3;they are enclosed by the triangles.As shown, the skeleton points are a type of compressed representation based on the characteristic of the dataset.
validation indices: Accuracy (ACC), Rand Index (RI), and F-measure.These indices are defined as follows:

Table 5 .
The impact of threshold Thr on clustering performance.

Table 6 ,
AIMK achieves the highest ACC rank compared with the other 10 baselines.The rank of AIMK 1.125 is much higher than the rank of HD 4.188, which achieves the second-highest ACC rank.FCM achieves the lowest ACC rank, at just 7.438.HD is the best-performing initialization method for K-means in addition to AIMK in Table7, whose rank is 4.188.According to

Table 7 ,
AIMK achieves the highest RI rank compared with the other 10 baselines.The rank of AIMK 1.312 is much higher than the rank of HD 4.312, which achieves the second-highest RI rank.SH achieves the lowest RI, at just 7.562.HD is still the best-performing initialization method for K-means in addition to AIMK in Table8, whose rank is 4.312.According to Table8, AIMK still achieves the highest F-measure rank compared with the other 10 baselines.The rank of AIMK 1.500 is higher than the rank of SH 2.875, which achieves the second-highest F-measure rank.FCM achieves the lowest F-measure, which is just 9.062.MSTI is the best-performing initialization method for K-means in addition to AIMK in Table9, whose rank is 5.125.

Table 6 .
Results of all algorithms on 16 real-world datasets, measured by ACC.

Table 7 .
Results of all algorithms on 16 real-world datasets, measured by RI.

Table 8 .
Results of all algorithms on 16 real-world datasets, measured by F-measure.

Table 9 .
Large and high-dimensional datasets, measured by ACC.

Table 10 .
Large and high-dimensional datasets, measured by RI.

Table 11 .
Large and high-dimensional datasets, measured by F-measure.

Table 12 .
Comparison of time complexity of different clustering algorithms.