]> On Lloyd’s k-means Method

On Lloyd’s k-means Method

Sariel Har-Peled
Bardia Sadri

June 30, 2004

The most updated version of this paper is available from the author’s web page: http://www.uiuc.edu/~ sariel/papers/03/lloyd_kmeans
Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA; sariel@cs.uiuc.edu; http://www.uiuc.edu/~ sariel/. Work on this paper was partially supported by a NSF CAREER award CCR-0132901.
Department of Computer Science;University of Illinois;201 N. Goodwin Avenue; Urbana, IL 61801; USA; http://www.uiuc.edu/~ sadri/; sadri@cs.uiuc.edu.


Abstract

We present polynomial upper and lower bounds on the number of iterations performed by Lloyd’s method for k-means clustering. Our upper bounds are polynomial in the number of points, number of clusters, and the spread of the point set. We also present a lower bound, showing that in the worst case the k-means heuristic needs to perform Ω(n) iterations, for n points on the real line and two centers. Surprisingly, our construction spread is polynomial. This is the first construction showing that the k-means heuristic requires more than a polylogarithmic number of iterations. Furthermore, we present two alternative algorithms, with guaranteed performances, which are simple variants of Lloyd’s method. Results of our experimental studies on these algorithms are also presented.

1 Introduction

In a (geometric) clustering problem, we are given a finite set X IRd of n points and an integer k 2, and we seek a partition (clustering) S = (S1,,Sk) of X into k disjoint nonempty subsets along with a set C = c1,,ck of k corresponding centers, that minimizes a suitable cost function among all such k-clusterings of X. The cost function typically represents how tightly each cluster is packed and how separated different clusters are. A center ci serves the points in its cluster Si.

We consider the k-means clustering cost function φ(S,C) = i=1kψ(S i,ci), where ψ(S,c) = xS x c2, in which denotes the Euclidean norm. It can be easily observed that for any cluster Si, the point c that minimizes the sum xSi x c2, is the centroid of Si, denoted by c(Si), and therefore in an optimal clustering, ci = c(Si). Thus the above cost function can be written as φ(S) = i=1k xSi x c(Si) 2.

It can also be observed that in an optimal k-clustering, each point of Si is closer to ci, the center corresponding to Si, than to any other center. Thus, an optimal k-clustering is imposed by a Voronoi diagram whose sites are the centroids of the clusters. Such partitions are related to centroidal Voronoi tessellations (see [DFG99]).

A k-means clustering algorithm that is used widely because of its simplicity is the k-means heuristic, also called Lloyd’s method. This algorithm starts with an arbitrary k-clustering S0 of X with the initial k centers chosen to be the centroids of the clusters of S0. Then it repeatedly performs local improvements by applying the following “hill-climbing” step.

Definition 1.1 Given a clustering S = (S1,,Sk) of X, a k-MEANS step returns a clustering S = (S 1,,S k) by letting Si equal to the intersection of X with the cell of c(Si) in the Voronoi partitioning imposed by centers c(S1),,c(Sk). The (new) center of Si will be c(Si).

In a clustering S = (S1,,Sk) of X, a point x X is misclassified if there exists 1 ij k, such that x Si but x c(Sj) < x c(Si). Thus a k-MEANS step can be broken into two stages: (i) every misclassified point is assigned to its closest center, and (ii) Centers are moved to the centroids of their newly formed clusters. Lloyd’s algorithm, to which we shall refer as “k-MEANSMTD” throughout this paper, performs the k-MEANS step repeatedly and stops when the assignment of the points to the centers does not change from that of the previous step. This happens when there remains no misclassified points and consequently in the last k-MEANS step S = S. Clearly the clustering cost is reduced when each point is mapped to the closest center and also when each center moves to the centroid of the points it serves. Thus, the clustering cost is strictly reduced in each of the two stages of a k-MEANS step. This in particular implies that no clustering can be seen twice during the course of execution of k-MEANSMTD. Since there are only finitely many k-clusterings, the algorithm terminates in finite time.

The algorithm k-MEANSMTD and its variants are widely used in practice [DHS01]. It is known that the output of k-MEANSMTD is not necessarily a global minimum, and it can be arbitrarily bad compared to the optimal clustering. Furthermore, the answer returned by the algorithm and the number of steps depend on the initial choice of the centers, i.e. the initial clustering [KMN+02]. These shortcomings of k-MEANSMTD has lead to development of efficient polynomial approximation schemes for the k-means clustering problem both in low [Mat00ES03HM03] and high dimensions [dlVKKR03]. Unfortunately, those algorithms have had little impact in practice, as they are complicated and probably impractical because of large constants. A more practical local search algorithm, which guarantees a constant factor approximation, is described by Kanungo et al. [KMN+02].

Up to this point, no meaningful theoretical bound was known for the number of steps k-MEANSMTD can take to terminate in the worst case. Inaba et al. [IKI94] observe that the number of distinct Voronoi partitions of a given n-point set X IRd induced by k sites is at most O(nkd) which gives a trivial similar upper bound on the number of steps of k-MEANSMTD (by observing that the clustering cost monotonically decreases an thus no k-clustering can be seen twice). However, the fact that k in typical application can be in the hundreds together with the relatively fast convergence of k-MEANSMTD observed in practice, make this bound somewhat meaningless. The difficulty of proving any super-linear lower bound further suggests the looseness of this bound.

Our contribution. It thus appears that the combinatorial behavior of k-MEANSMTD is far from being well understood. Motivated by this, in this paper we provide a lower bound and upper bounds on the number of iterations performed by k-MEANSMTD. To our knowledge, our lower bound is the first that is super-polylogarithmic. Our upper bounds are polynomial in the spread Δ of the input point set, k, and n (the spread of a point set is the ratio between its diameter and the distance between its closest pair). The bounds are meaningful for most inputs. In Section  2, we present an Ω(n) lower bound on the number of iterations performed by k-MEANSMTD. More precisely, we show that for an adversarially chosen initial two centers and a set of n points on the line, k-MEANSMTD takes Ω(n) steps. Note, that this matches the straightforward upper bound on the number of Voronoi partitions in one dimension with two centers, which is O(n).

In Section  3, we provide a polynomial upper bound for the one-dimensional case. In Section  4, we provide an upper bound for the case where the points lie on a grid. In Section  5, we investigate two alternative algorithms, and provide polynomial upper bounds on the number of iterations they perform. Those algorithms are minor modifications of k-MEANSMTD algorithm, and we believe that their analysis provide an insight about the behavior of k-MEANSMTD. Some experimental results are presented in Section  6. In Section  7, we conclude by mentioning a few open problems and discussion of our results.

2 Lower Bound Construction for Two Clusters in One Dimension

In this section, we describe a set of 2n points, along with an initial pair of centers, on which k-MEANSMTD takes Ω(n) steps to terminate for n 2.

Fix n 2. Our set X will consist of 2n numbers y1 < < yn < xn < < x1 with yi = xi, for i = 1,,n.

At the ith iteration, we denote by li and ri the current left and right centers, respectively, and by Li and Ri the new sets of points assigned to li and ri, respectively. Furthermore, for each i 0, we denote by αi the Voronoi boundary 1 2(li + ri) between the centers li and ri. Thus

Li = x Xx < αi  and Ri = x Xx αi .

Let x1 be an arbitrary positive real number and let x2 < x1 be a positive real number to be specified shortly. Initially, we let l1 = x2 and r1 = x1 and consequently α1 = 1 2(x1 + x2). Thus in the first iteration, L1 = y1,,yn,xn,,x2 and R1 = x1. We will choose x2,,xn such that at the end of the ith step we have Li = y1,,yn,xn,,xi+1 and Ri = xi,,x1. Suppose for the inductive hypothesis that at the (i 1)th step we have

Li1 = y1,,yn,xn,,xi+1,xi  and Ri1 = xi1,,x1 .

Thus we can compute li and ri as follows

li = y1 + + yn + xn + + xi 2n i + 1  and ri = xi1 + + x1 i 1 .

Since y1 + + yn + xn + + xi = (xi1 + + x1), we get for αi:

αi = 1 2(li + ri) = 1 2 xi1 + + x1 i 1 xi1 + + x1 2n i + 1 = n i + 1 (i 1)(2n i + 1)(xi1 + + x1) = n i + 1 (i 1)(2n i + 1) si1,

where si1 = j=1i1x j.

To guarantee that only xi deserts from Li1 to Ri, in the ith iteration, we need that xi+1 < αi < xi. Thus, it is natural to set xi = τiαi, where τi > 1, for i = 1,,n. Picking the coefficients τ1,,τn is essentially the only part of this construction that is under our control. We set

τi = 1 + 1 n i + 1 = n i + 2 n i + 1,

for i = 1,,n. Since τi > 1, xi = τiαi > αi, for i = 1,,n. Next, we verify that xi+1 < αi. By definition,

xi+1 = τi+1αi+1 = τi+1 n i i(2n i) si = τi+1 n i i(2n i) (xi + si1) = τi+1 n i i(2n i) τi1 + (i 1)(2n i + 1) n i + 1 αi = n i + 1 i(2n i) n i + 1 n i + 2 + (i 1)(2n i + 1) n i + 1 αi.

It can be verified through elementary simplifications that the coefficient of αi above is always less than 1 implying that xi+1 < αi < xi, for i = 1,,n 1.

We can compute a recursive formula for xi+1 in terms of xi, as follows

xi+1 = τi+1αi+1 = n i + 1 n i n i i(2n i) si = n i + 1 i(2n i) (xi + si1) = n i + 1 i(2n i) xi + (i 1)(2n i + 1) n i + 1 αi = n i + 1 i(2n i) xi + (i 1)(2n i + 1) n i + 1 1 + 1 n i + 1 1x i = n i + 1 i(2n i) 1 + (i 1)(2n i + 1) n i + 1 n i + 1 n i + 2xi. = n i + 1 i(2n i) 1 + (i 1)(2n i + 1) n i + 2 xi,

for i = 1,,n 1. Thus letting ci = n i + 1 i(2n i) 1 + (i 1)(2n i + 1) n i + 2 we get that

xi+1 = cixi, (1)

for i = 1,,n 1.

Theorem 2.1 For each n 2, there exists a set of 2n points on a line with two initial center positions for which k-MEANSMTD takes exactly n steps to terminate.

2.1 The Spread of the Point Set

It is interesting to examine the spread of the above construction. In particular, somewhat surprisingly, the spread of this construction is polynomial, hinting (at least intuitively) that “bad” inputs for k-MEANSMTD are not that contrived.

By Eq. ( 1), we have xi+1 = cixi. Notice that by the given construction ci < 1 for all i = 1,,n 1 since xi+1 < xi. In the sequel we will show that xn is only polynomially smaller than x1, namely xn = Ω(x1n4). We then derive a bound on the distance between any consecutive pair xi and xi+1. These two assertions combined, imply that the point set has a spread bounded by O(n5). The following lemma follows from elementary algebraic simplifications.

Lemma 2.2 For each 1 i n2, ci 1 1i2, and for each n2 < i < n 1, ci 1 1(n i + 1) 2. Furthermore, for i 2, we have ci 1 12i.

Corollary 2.3 For any n > 0 we have xn = Ω(x1n4). Proof: xn = c1 i=2n1c i x1 c1x1 i=2n2 1 1 i2 i=n2+1n1 1 1 n i2

= c1x1 1 1 2 2 1 1 n2 2 1 1 n2 2 1 1 2 2 = c1x1 i=2n2(i 1)2 i2 2 = c 1x1 1 n2 4

The claim follows as c1 = n(2n 1) = Θ(1). _

Lemma 2.4 For each i = 1,,n 1, xi xi+1 xi3i. Proof: Since xi+1 = cixi, we have xi xi+1 = xi(1 ci). For i = 1, we have c1 = n(2n 1) 23, when n 2. Thus, we have x1 x2 x13. For i = 2,,n 1, using Lemma  2.2 we get 1 ci 12i. Thus, xi xi+1 = xi(1 ci) xi 1(2i) > xi3i, as claimed. _

Theorem 2.5 The spread of the point set constructed in Theorem  2.1 is O(n5).

Proof: By Lemma  2.4, for each i = 1,,n 1, xi xi+1 xi3i. Since xi > xn and by Corollary  2.3, xn = Ω(x1n4), it follows that xi xi+1 = Ω(x1n5). This lower bound for the distance between two consecutive points is also true for yi’s due to the symmetric construction of the point set around 0. On the other hand, since xn = Ω(x1n4), xn yn = 2xn = Ω(x1n4). Thus every pair of points are at distance at least Ω(x1n5). Since the diameter of the point set is 2x1, we get a bound of O(n5) for the spread of the point set. _

3 An Upper Bound for One Dimension

In this section, we prove an upper bound on the number of steps of k-MEANSMTD in one dimensional Euclidean space. As we shall see, the bound does not involve k but is instead related to the spread Δ of the point set X. Without loss of generality we can assume that the closest pair of points in X are at distance 1 and thus the diameter of the set X is Δ. Before proving the upper bound, we mention a technical lemma from [KMN+02].

Lemma 3.1 ([KMN+02]) Let S be a set of points in IRd with centroid c = c(S) and let z be an arbitrary point in IRd. Then ψ(S,z) ψ(S,c) = xS x z2 x c2 = S c z2. The above lemma quantifies the contribution of a center ci to the cost improvement in a k-MEANS step as a function of the distance it moves. More formally, if in a k-MEANS step a k-clustering S = (S1,,Sk) is changed to the other k-clustering S = (S 1,,S k), then

φ(S) φ(S) i=1kS ic(S i) c(S i) 2.

Note that in the above analysis we only consider the improvement resulting from the second stage of k-MEANS step in which the centers are moved to the centroids of their clusters. There is an additional gain from reassigning the points in the first stage of a k-MEANS step that we currently ignore.

In all our upper bound arguments we use the fact that if the initial set of centers is chosen from inside the convex hull of the input point set X (even if this is not the case, all centers move inside the convex hull of X after one step), the initial clustering cost is no more than nΔ2. This simply follows from the fact that each of the n points in X is at distance no more than Δ from its assigned center.

Theorem 3.2 The number of steps of k-MEANSMTD on a set X IR of n points with spread Δ is at most O(nΔ2).

Proof: Consider a k-MEANS step that changes a k-clustering S into another k-clustering S. The crucial observation is that in this step, there exists a cluster that is only extended or shrunk from its right end. To see this consider the leftmost cluster S1. Either S1 is modified in this step, in which case this modification can only happen in form of extension or shrinking at its right end, or it remains the same. In the latter case, the same argument can be made about S2, and so on.

Thus assume that S1 is extended on right by receiving a set T from the cluster directly to its right, namely S2 (S2 cannot lose all its points to S1 as it has at least one point to the right of c2 and this point is closer to c2 than to c1 and cannot go to S1). Notice that c(T) is to the right of the leftmost point in T and at distance at least (T 1)2 from this leftmost point (because every pair of points are at distance one or more in T and c(T) gets closest to its leftmost point when every pair of consecutive points in T are placed at the minimum distance of 1 from each other). Similarly, the centroid of S1 is to the left of the rightmost point of S1 and at distance at least (S1 1)2 from it. Thus, c(S1) c(T) (T 1)2 + (S1 1)2 + 1 = (T + S1)2, where the extra 1 is added because the distance between the leftmost point in T and the rightmost point in S1 is at least 1. The centroid of S1 will therefore be at distance

T S1 + Tc(S1) c(T) T S1 + TT + S1 2 = T 2 1 2

from c(S1) and to its right. Consequently, by Lemma  3.1, the improvement in clustering cost is at least 14.

Similar analysis implies a similar improvement in the clustering cost for the case where we remove points from S1. Since the initial clustering cost is at most nΔ2, the number of steps is no more than nΔ2(14) = 4nΔ2. _

Remark 3.3 The upper bound of Theorem  3.2 as well as all other upper bounds proved later in this paper can be slightly improved by observing that at the end of any k-MEANS step (or a substitute step used in the alternate algorithms considered later), we have a clustering S = (S1,,Sk) of the input point set X with centers c1,,ck, respectively, where for each i = 1,,k, ci = c(Si). Let ĉ = c(X). By Lemma  3.1, we can write

ψ(Si,ci) = ψ(Si,ĉ) Si ĉ ci 2,

for 1 i k. Summing this equation, for i = 1,,k, we have

φ(S) = xX x ĉ2 i=1k S i . ĉ ci 2 < xX ĉ x2 = 1 n x,yX x y2.

Thus, we get a better upper bound on 1n x,yX x y2 that can replace the trivial bound of nΔ2. Note that depending on the input, this improved upper bound can be smaller by a factor of O(n) than the bound we use (i.e., nΔ2). Nevertheless, in all our upper bound results we employ the weaker bound for the purpose of readability, while all those bounds can be made more precise by applying the above-mentioned improvement.

Remark 3.4 A slight technical detail in the implementation of k-MEANSMTD algorithm, involves the event of a center losing all the points it serves. The original k-means heuristic does not specify a particular solution to this problem. Candidate strategies used in practice include: placing the lonely center somewhere else arbitrarily or randomly, leaving it where it is to perhaps acquire some points in futures steps, or completely removing it. For the sake of convenience in our analysis, we adopt the last strategy, namely, whenever a center is left serving no points, we remove that center permanently and continue with the remaining centers.

4 Upper Bound for Points on a d-Dimensional Grid

In this section, we prove an upper bound on the number of steps of k-MEANSMTD when the input points belong to the integer grid 1,,Md. This is the case in many practical applications where every data point has a large number of fields with each field having values in a small discrete range. For example, this includes clustering of pictures, where every pixel forms a single coordinate (or three coordinates, corresponding to the RGB values) and the value of every coordinate is restricted to be an integer in the range 0–255.

The main observation is that the centroids of any two subsets of 1,,Md are either equal or are suitably far away. Since each step of k-MEANSMTD moves at least one center or else stops, this observation guarantees a certain amount of improvement to the clustering cost in each step.

Lemma 4.1 Let S1 and S2 be two nonempty subsets of 1,,Md with S1 + S2 n. Then, either c(S1) = c(S2) or c(S1) c(S2) 1n2.

Proof: If c(S1)c(S2) then they differ in at least one coordinate. Let u1 and u2 be the values of c(S1) and c(S2) in one such coordinate, respectively. By definition, u1 = s1S1 and u2 = s2S2 where s1 and s2 are integers in the range 1,,nM. In other words u1 u2 is the difference of two distinct fractions, both with denominators less than n. It follows that u1 u2 1n2 and consequently c(S1) c(S2) u1 u2 1n2. _

Theorem 4.2 The number of steps of k-MEANSMTD when executed on a point set X taken from the grid 1,,Md is at most dn5M2.

Proof: Note, that U = n (dM)2 = ndM2 is an upper bound of for the clustering cost of any k-clustering of a point set in 1,,Md and that at each step at least one center moves by at least 1n2. Therefore, by Lemma  3.1, at every step the cost function decreases by at least 1n4 and the overall number of steps can be no more than U(1n4) = dn5M2. _

5 Arbitrary Point Sets and Alternative Algorithms

Unfortunately proving any meaningful bounds for the general case of k-MEANSMTD, namely with points in IRd with d > 1 and no further restrictions, remains elusive. However, in this section, we present two close relatives of k-MEANSMTD for which we can prove polynomial bounds on the number of steps. The first algorithm differs from k-MEANSMTD in that it moves a misclassified point to its correct cluster, as soon as the misclassified point is discovered (rather than first finding all misclassified points and then reassigning them to their closest centers as is the case in k-MEANSMTD). The second algorithm is basically the same as k-MEANSMTD with a naturally generalized notion of misclassified points. Our experimental results (Section  6) further support the kinship of these two algorithms with k-MEANSMTD.

As was the case with our previous upper bounds, our main approach in bounding the number of steps in both these algorithms is through showing substantial improvements in the clustering cost at each step.

5.1 The SINGLEPNT Algorithm

We introduce an alternative to the k-MEANS step which we shall call a SINGLEPNT step.

Definition 5.1 In a SINGLEPNT step on a k-clustering S = (S1,,Sk), a misclassified point x is chosen, such that x Si and x c(Sj) < x c(Si), for some 1 ij k, and a new clustering S = (S 1,,S k) is formed by removing x from Si and adding it to Sj. Formally, for each 1 l k,

Sl = Sl li,j, Sl xl = i, Sl xl = j.

The centers are updated to the centroids of the clusters, and therefore only the centers of Si and Sj change. Note that updating the centers takes constant time.

In a SINGLEPNT step, if the misclassified point is far away from at least one of c(Si) and c(Sj), then the improvement in clustering cost made in the SINGLEPNT step cannot be too small.

Lemma 5.2 Let S and T be two point sets of sizes n and m, respectively, and let s = c(S) and t = c(T). Suppose that x is a point in T with distances dS and dT from s and t, respectively, and such that dS < dT . Let S = S x and T = T x and let s = c(S) and t = c(T). Then ψ(S,s) + ψ(T,t) ψ(S,s) ψ(T,t) (d S + dT )2(2(n + m)). Proof: Indeed, c(S) = n n+1c(S) + 1 n+1x. Thus

s s = c(S) c(S) = 1 n + 1c(S) 1 n + 1x = 1 n + 1 c(S) x = dS n + 1.

Similarly, t t = d T (m 1). Thus using Lemma  3.1 we get

ψ(S,s) ψ(S,s) = (n + 1) dS n + 1 2 = dS2 n + 1,

and similarly ψ(T,t) ψ(T,t) = d T 2(m 1).

Since dS < dT , we have that ψ(S,s) + ψ(T,t) ψ(S,s) + ψ(T,t), and

ψ(S, s) + ψ(T,t) ψ(S,s) ψ(T,t) ψ(S,s) + ψ(T,t) ψ(S,s) ψ(T,t) dS2 n + 1 + dT 2 m 1 dS2 n + m + dT 2 n + m = dS2 + d T 2 n + m (dS + dT )2 2(n + m) .

_

Our modified version of k-MEANSMTD, to which we shall refer as “SINGLEPNT”, replaces k-MEANS steps with SINGLEPNT steps. Starting from an arbitrary clustering of the input point set, SINGLEPNT repeatedly performs SINGLEPNT steps until no misclassified points remain. Notice that unlike the k-MEANS step, the SINGLEPNT step does not maintain the property that the clustering achieved at the end of the step is imposed by some Voronoi diagram. However, when the algorithm stops no misclassified points are left, and this property must hold since otherwise further steps would be possible.

Theorem 5.3 On any input X IRd, SINGLEPNT makes at most O(kn2Δ2) steps before termination.

Proof: Once again, we assume that no two points in X are less than unit distance apart. Call a SINGLEPNT step weak, if the misclassified point it considers is at distance less than 18 from both involved centers, i.e., its current center and the center closest to it. We call a SINGLEPNT step strong if it is not weak. Lemma  5.2 shows that in a strong SINGLEPNT step the clustering cost improves by at least 1(128n). In the sequel we shall show that the algorithm cannot take more than k consecutive weak steps, and thus at least one out of every k + 1 consecutive steps must be strong and thus result an improvement of 1(128n) to the clustering cost; hence the upper bound of O(kn2Δ2).

For a certain point in time, let c1,,ck denote the current centers, and let S1,,Sk denote the corresponding clusters; namely, Si is the set of points served by ci, for i = 1,,k. Consider the balls B1,,Bk of radius 18 centered at c1,,ck, respectively. Observe that since every pair of points in X are at distance at least 1 from each other, each ball Bi can contain at most one point of X. Moreover, the intersection of any subset of the balls B1,,Bk can contain at most one point of X. For a point x X, let B(x) denote the set of balls among B1,,Bk that contain the point x. We refer to B(x) as the batch of x.

By the above observation, the balls (and the corresponding centers) are classified according to the point of X they contain (if they contain such a point at all). Let BX be the set of batches of balls that are induced by X and contain more than one ball. Formally, BX = B(x) : x X, B(x) > 1. The set of balls BX is the set of active balls.

A misclassified point x can participate in a weak SINGLEPNT step only if it belongs to more than one ball; i.e., when B(x) > 1. Observe that, if we perform a weak step, and one of the centers move such that the corresponding ball Bi no longer contains any point of X in its interior, then for Bi to contain a point again, the algorithm must perform a strong step. To see this, observe that (weakly) losing a point x may cause a center move a distance of at most 18. Therefore, once a center ci loses a point x, and thus moves away from x, it does not move far enough for the ball Bi to contain a different point of X.

Hence, in every weak iteration a point x changes the cluster it belongs to in B(x). This might result in a shrinking of the active set of balls. On the other hand, while only weak SINGLEPNT steps are being taken, any cluster Sj can change only by winning or losing the point xi that stabs the corresponding ball Bj. It follows that once a set Sj loses the point x, then it can never get it back since that would correspond to an increase in the clustering cost. Therefore the total number of possible consecutive weak SINGLEPNT steps is bounded by xX,B(x)>1 B(x) k. _

5.2 The LAZY-k-MEANS algorithm

Our second variant to k-MEANSMTD, which we name “LAZY-k-MEANS”, results from a natural generalization misclassified points (Definition  1.1). Intuitively, the difference between the LAZY-k-MEANS and k-MEANSMTD is that LAZY-k-MEANS at each step only reassigns those misclassified points to their closest centers that are substantially misclassified, namely the points that benefit from reclassification by at least a constant factor.

Definition 5.4 Given a clustering S = (S1,,Sk) of a point set X, if for a point x Si there exists a ji, such that x c(Si) > (1 + ɛ) x c(Sj), then x is said to be (1 + ɛ)-misclassified for center pair (c(Si),c(Sj)). The centers c(Si) and c(Sj) are referred to as switch centers for x. We also say that c(Si) is the losing center and c(Sj) is the winning center for x.

Thus LAZY-k-MEANS with parameter ɛ starts with an arbitrary k-clustering. In each step, it (i) reassigns every (1 + ɛ)-misclassified point to its closest center and (ii) moves every center to the centroid of its new cluster. Indeed, k-MEANSMTD is simply LAZY-k-MEANS with parameter ɛ = 0. Naturally, the algorithm stops when no (1 + ɛ)-misclassified points are left.

In the sequel we bound the maximum number of steps taken by LAZY-k-MEANS. We shall use the following fact from elementary Euclidean geometry.

Fact 5.5 Given two points c and c with c c = , the locus of the points x with x c > (1 + ɛ) x c is an open ball of radius R = (1 + ɛ)(ɛ(2 + ɛ)) called the ɛ-Apollonius ball for c with respect to c. This ball is centered on the line containing the segment cc at distance R + ɛ(2(2 + ɛ)) from the bisector of cc, and on the same side of the bisector as c.

Lemma 5.6 For any three points x, c, and c in IRd with x c x c we have x c2 x c2 = 2h c c, where h is the distance from x to the bisector of c and c.

Proof: Let y be the intersection point of the segment cc with the (d 1)-dimensional hyperplane parallel to the bisector of c and c and containing x. By Pythagorean equality we have x c2 = x y2 + y c2 and x c2 = x y2 + y c2. Subtracting the first equality from the second, we obtain

x c2 x c2 = y c2 y c2 = (y c + y c)(y c y c) = 2h c c,

since y c y c = 2h. _

Theorem 5.7 The number of steps of LAZY-k-MEANS with parameter ɛ is O(nΔ2ɛ3).

Proof: We will show that every two consecutive steps of LAZY-k-MEANS with parameter ɛ make an improvement of at least

λ = ɛ3(2 + ɛ) 256(1 + ɛ)2 ɛ3 512 = Ω(ɛ3).

Let 0 = ɛ(2 + ɛ)(16(1 + ɛ)) . Notice that 0 < 18 for 0 < ɛ 1. We call a misclassified point x strongly misclassified, if its switch centers c and c are at distance at most 0 from each other, and weakly misclassified otherwise.

If at the beginning of a LAZY-k-MEANS step there exists a strongly misclassified point x for a center pair (c,c), then since every point in the ɛ-Apollonius ball for c with respect to c is at distance at least 0ɛ(2(2 + ɛ)) from the bisector of cc, by Lemma  5.6 the reclassification improvement in clustering cost resulting from assigning x to c is

x c2 x c2 = 02ɛ 2 + ɛ ɛ3(2 + ɛ) 256(1 + ɛ)2 = λ.

Thus we assume that all misclassified points are weakly misclassified. Let x be one such point for center pair (c,c). By our assumption c c < 0. Observe that in such a case, the radius of the ɛ-Apollonius ball for c with respect to c is (1 + ɛ)(ɛ(2 + ɛ)) < 116. In particular, since there exists a ball of radius 116 containing both x and c, the ball of radius 18 centered at c, which we denote by B(c, 18), includes x. Also since c c < 18 as verified above, we get c B(c, 18) as well. In other words, both switch centers c and c are at distance less than 14 from x. Now, since every pair of points in X are at distance 1 or more, any center can be a switch center for at most one weakly misclassified point. This in particular implies that in the considered LAZY-k-MEANS step, no cluster is modified by more than a single point.

When the misclassified points are assigned to their closest centers, the centers that do not lose or win any points stay at their previous locations. A center c that wins a point x moves closer to x since x is the only point it wins while losing no other points. Similarly, a center c that loses a point x moves away from x since x is the only point it loses without winning any other points. A losing center c moves away from its lost point x by a distance of at most c x < 14 since its previous number of served points was at least 2 (otherwise, we would have c = x and thus x could not be misclassified). Therefore, when c moves to the centroid of its cluster (now missing x), x c < 12 and consequently c y > 12 for any xy X. As a result, c can not be a switch center for any weakly misclassified point in the subsequent LAZY-k-MEANS step.

On the other hand, the winning center c to whose cluster x is added, moves closer to x and since no center other than c and c in B(x, 14) moves (as there is no point other than x they can win or lose), x will not be misclassified in the next LAZY-k-MEANS step.

It follows from the above discussion that the next LAZY-k-MEANS step cannot have any weakly misclassified points and thus either the algorithm stops or some strongly misclassified point will exist, resulting an improvement of at least λ. Thus the total number of steps taken by LAZY-k-MEANS with parameter ɛ is at most 2nΔ2λ = O(nΔ2ɛ3). _

6 Experimental Results

We introduced both SINGLEPNT and LAZY-k-MEANS alternatives to k-MEANSMTD as similar, equally easy to implement algorithms that are simpler to analyze than k-MEANSMTD itself. However, as mentioned in the introduction, k-MEANSMTD is mainly of interest only in practice because of its ease of implementation and its relatively fast termination (small number of steps). It thus raises the question of how our alternative algorithms perform in practice in comparison to k-MEANSMTD.

We performed a series of experiments analogous to those done in [KMN+02], as described below, to compare the number of rounds, number of reclassified points, and quality of final clustering produced by these two alternative algorithms with those of k-MEANSMTD. We use the same inputs used by Kanungo et al. for our experiments. See [KMN+02] for detailed description of those inputs. We have tried to implement each of the algorithms in the simplest possible way and avoided using any advanced point location or nearest neighbor search structure or algorithm. Due to the great similarity between the three algorithms considered here, it is expected that any technique used for improving the performance of any of these algorithms, to be suitable for improving the other two variants in a somewhat similar way.

The algorithms k-MEANSMTD and LAZY-k-MEANS iterate over points and assign each point to the closest center. While doing this the new set of centers are calculated and existence of a (1 + ɛ)-misclassified point is checked. SINGLEPNT examines the points one by one, moving to the first point when reaching the end of the list, checking if they are misclassified or not. When a misclassified point is discovered it is assigned to its closest center and the location of the two switching centers is updated. The algorithm stops when it cannot find a misclassified point for n consecutive steps.

The input used in these experiments together with the source-code of our implementation is available at [Sad04].

Our experimental results are summarized in Table  1 and Table  2. In conformance to [KMN+02] the costs referred to in these tables is the total final clustering cost, divided by the number of points. In that sense we report the “average” cost per point. Table  1 is produced by running, only once, each of the four algorithms with the same set of randomly chosen center for each combination of point set and number of centers considered. By studying several such tables it seems that the total number of reclassified points and the quality of clustering found by SINGLEPNT tends to be very close to those of k-MEANSMTD. Notice that in Table  1, the number of steps of SINGLEPNT are left blank as they are equal to the number of reclassified points and cannot be compared with the number of steps of k-MEANSMTD or LAZY-k-MEANS.

Table  2 summarizes the results of running 100 tests similar to the one reported in Table  1 each with different initial set of centers picked randomly from the bounding box of the given point set. The best, worst, and average final clustering costs are reported in each case.

We have not discussed the running times as we made no effort in optimizing our implementations. It is however interesting that both of the two alternative algorithms tend to be faster than k-MEANSMTD’s in a typical implementation such as ours. SINGLEPNT seems to be typically more than 20% faster than Lloyd. In particular, we emphasize, that our simple implementation is considerably slower than the implementation of Kanungo et al. [KMN+02] that uses data structure similar to kd-tree to speed up the computation of the Voronoi partitions. We believe that we would get similar performance gains by using their data structure.

7 Conclusions

We presented several results on the number of iterations performed by the k-MEANSMTD clustering algorithm. To our knowledge, our results are the first to provide combinatorial bounds on the performance of k-MEANSMTD. We also suggested related variants of k-MEANSMTD algorithm, and proved upper bound on their performance. We implemented those algorithms and compared their performance in practice [Sad04]. We conjecture that the upper bounds we proved for SINGLEPNT holds also for k-MEANSMTD. Maybe the most surprising part in those bounds is the luck of dependence on the dimension of the data on the bound of number of iterations performed.

We consider this paper to be a first step in understanding the Lloyd’s method. It is our belief that both our lower and upper bounds are loose, and one might need to use other techniques to improve them. In particular, we mention some open problems:

  1. There is still a large gap between our lower and upper bounds. In particular, a super-linear lower bound would be interesting even in high-dimensional space.
  2. Our current upper bounds include the spread as a parameter. It would be interesting to prove (or disprove) that this is indeed necessary.
  3. We have introduced alternative, easy to analyze algorithms, that are comparable to k-MEANSMTD both in their description and their behavior in practice. It would be interesting to show provable connections between these algorithms and compare the bounds on the number of steps they require to terminate.

7.1 Dependency on the spread

A shortcoming of our results, is the dependency on the spread of the point set in the bounds presented. However:

  1. This can be resolved by doing a preprocessing stage, snapping together points close to each other, and breaking the input into several parts to be further clustered separately. This is essentially what fast provable approximation algorithms for TSP, k-means, and k-median do [Aro98HM03]. This results in point sets with polynomial spread, which can be used instead of the original input to compute a good clustering. This is outside the scope of our analysis, but it can be used in practice to speedup k-MEANSMTD algorithm.
  2. In high dimensions, it seems that in many natural cases the spread tends to shrink and be quite small. As such, we expect our bounds to be meaningful in such cases.

    To see an indication of this shrinkage in the spread, imagine picking n points randomly from a unit hypercube in IRd with volume one. It is easy to see that the minimum distance between any pair of points is going to be at least L = 1n3d, with high probability, since if we center around each such point a hypercube of side length L, it would have volume 1n3 of the unit hypercube. As such, the probability of a second point falling inside this region is polynomially small.

    However, L tends to 1 as d increases. Thus, for d = Θ(log n) the spread of such random point set is Θ(d(L2)) = Θ(log n). (An alternative way to demonstrate this is by picking points randomly from the unit hypersphere. By using a concentration of mass argument [Mat02] on a hypersphere, we get a point-set with spread O(1) with high probability.)

7.2 Dependency on the initial solution

The initial starting solution fed into k-MEANSMTD is critical in the time it takes to converge, and in the quality of the final clustering generated. This is clearly suggested by Table  2, where trying many different initial solutions has yielded a considerable improvement in the best found solution. Of course, one can use a (rough) approximation algorithm [HM03] to come up with a better starting solution. While this approach might be useful in practice, it again falls outside the scope of our analysis.

7.3 Similar results

Recently, independently of our results, Sanjoy Dasgupta [Das03] announced results which are similar to a subset of our results. In particular, he mentions the one-dimensional lower bound, and a better upper bound for k < 5 but only in one dimension. This work of Sanjoy Dasgupta and Howard Karloff seems to be using similar arguments to ours (personal communication) although to our knowledge it has not been written or published yet.

Acknowledgments

The authors would like to thank Pankaj K. Agarwal, Boris Aronov and David Mount for useful discussions of problems studied in this paper and related problems. In particular, David Mount provided us with the test point sets used in [KMN+02].

References

[Aro98]    S. Arora. Polynomial time approximation schemes for euclidean tsp and other geometric problems. J. Assoc. Comput. Mach., 45(5):753–782, Sep 1998.

[Das03]    S. Dasgupta. How fast is k-means? In Proc. 16th Annu. Comp. Learn. Theo., number 2777 in Lect. Notes in Comp. Sci., page 735, 2003.

[DFG99]    Q. Du, V. Faber, and M. Gunzburger. Centroidal voronoi tessellations: Applications and algorithms. SIAM Review, 41(4):637–676, 1999.

[DHS01]    R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, New York, 2nd edition, 2001.

[dlVKKR03]   W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th Annu. ACM Sympos. Theory Comput., pages 50–58, 2003.

[ES03]    M. Effros and L. J. Schulman. Rapid clustering with a deterministic data net. Manuscript, 2003.

[HM03]    S. Har-Peled and S. Mazumdar. Coresets for k-means and k-median clustering and their applications. In Proc. 36th Annu. ACM Sympos. Theory Comput., 2003. To appear. http://www.uiuc.edu/~ sariel/papers/03/kcoreset/.

[IKI94]    M. Inaba, N. Katoh, and H. Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 332–339, 1994.

[KMN+02]    T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. In Proc. 18th Annu. ACM Sympos. Comput. Geom., pages 10–18, 2002.

[Mat00]    J. Matoušek. On approximate geometric k-clustering. Discrete Comput. Geom., 24:61–84, 2000.

[Mat02]    J. Matoušek. Lectures on Discrete Geometry. Springer, 2002.

[Sad04]    B. Sadri. Lloyd’s method and variants implementation together with inputs, 2004. http://www.uiuc.edu/~ sariel/papers/03/lloyd_kmeans.








Data Set kMethod StepsReclassified Final Cost












ClusGauss
n = 10, 000
d = 3
25k-MEANSMTD 24 4748 0.081615
SINGLEPNT - 4232 0.081622
LAZY-k-MEANS, ε = 0.05 17 2377 0.082702
LAZY-k-MEANS, ε = 0.20 18 1554 0.089905





50k-MEANSMTD 20 4672 0.031969
SINGLEPNT - 4391 0.031728
LAZY-k-MEANS, ε = 0.05 16 2244 0.032164
LAZY-k-MEANS, ε = 0.20 22 1974 0.034661





100k-MEANSMTD 22 5377 0.009639
SINGLEPNT - 4958 0.009706
LAZY-k-MEANS, ε = 0.05 15 2512 0.010925
LAZY-k-MEANS, ε = 0.20 19 1748 0.013092






MultiClus
n = 10, 000
d = 3
50k-MEANSMTD 21 2544 0.033870
SINGLEPNT - 2419 0.033941
LAZY-k-MEANS, ε = 0.05 16 1121 0.034622
LAZY-k-MEANS, ε = 0.20 25 722 0.038042





100k-MEANSMTD 18 1744 0.009248
SINGLEPNT - 1732 0.008854
LAZY-k-MEANS, ε = 0.05 11 740 0.009902
LAZY-k-MEANS, ε = 0.20 15 584 0.010811





500k-MEANSMTD 12 1768 0.002495
SINGLEPNT - 1694 0.002522
LAZY-k-MEANS, ε = 0.05 9 528 0.002757
LAZY-k-MEANS, ε = 0.20 11 444 0.002994






Lena22
n = 65, 536
d = 4
8k-MEANSMTD 36 62130 335.408625
SINGLEPNT - 57357 335.440866
LAZY-k-MEANS, ε = 0.05 27 50298 338.594668
LAZY-k-MEANS, ε = 0.20 21 44040 355.715258





64k-MEANSMTD 211 111844 94.098422
SINGLEPNT - 81505 94.390640
LAZY-k-MEANS, ε = 0.05 88 55541 97.608823
LAZY-k-MEANS, ε = 0.20 24 30201 120.274428





256k-MEANSMTD 167 111110 48.788216
SINGLEPNT - 101522 48.307815
LAZY-k-MEANS, ε = 0.05 92 57575 51.954810
LAZY-k-MEANS, ε = 0.20 79 32348 61.331614






Lena44
n = 16, 384
d = 16
8k-MEANSMTD 63 182112700.589245
SINGLEPNT - 164672700.587691
LAZY-k-MEANS, ε = 0.05 20 97152889.747540
LAZY-k-MEANS, ε = 0.20 27 92013008.783333





64k-MEANSMTD 61 212921525.846646
SINGLEPNT - 164221615.667299
LAZY-k-MEANS, ε = 0.05 45 130921555.520952
LAZY-k-MEANS, ε = 0.20 16 75271907.962692





256k-MEANSMTD 43 213941132.746162
SINGLEPNT - 280491122.407317
LAZY-k-MEANS, ε = 0.05 28 124051156.884049
LAZY-k-MEANS, ε = 0.20 27 79931320.303278






Kiss
n = 10, 000
d = 3
8k-MEANSMTD 18 5982 687.362264
SINGLEPNT - 7026 687.293930
LAZY-k-MEANS, ε = 0.05 18 3277 690.342895
LAZY-k-MEANS, ε = 0.20 23 2712 720.891998





64k-MEANSMTD 202 29288 202.044849
SINGLEPNT - 35228 185.519927
LAZY-k-MEANS, ε = 0.05 92 12471 221.936175
LAZY-k-MEANS, ε = 0.20 44 6080 263.497185





256k-MEANSMTD 144 17896 105.438490
SINGLEPNT - 16992 106.112133
LAZY-k-MEANS, ε = 0.05 61 7498 120.317362
LAZY-k-MEANS, ε = 0.20 27 3479 150.156231






Table 1: Number of steps, number of reclassified points, and final average clustering cost in a typical execution of each of the four algorithms on data sets mentioned in [KMN+02].








Data Set kMethod Minimum CostMaximum Cost Average Cost












ClusGauss
n = 10, 000
d = 3
25k-MEANSMTD 0.068462 0.087951 0.07501276
SINGLEPNT 0.067450 0.083194 0.07486010
LAZY-k-MEANS, ɛ = 0.20 0.074667 0.100035 0.08510598
LAZY-k-MEANS, ɛ = 0.05 0.070011 0.092658 0.07803375





50k-MEANSMTD 0.028841 0.040087 0.03335312
SINGLEPNT 0.028376 0.040623 0.03308624
LAZY-k-MEANS, ɛ = 0.20 0.031175 0.046528 0.03719264
LAZY-k-MEANS, ɛ = 0.05 0.029626 0.040811 0.03384180





100k-MEANSMTD 0.011425 0.016722 0.01401549
SINGLEPNT 0.010106 0.017986 0.01365492
LAZY-k-MEANS, ɛ = 0.20 0.011928 0.022015 0.01565268
LAZY-k-MEANS, ɛ = 0.05 0.011730 0.020600 0.01442575






MultiClus
n = 10, 000
d = 3
50k-MEANSMTD 0.027563 0.034995 0.03051698
SINGLEPNT 0.027412 0.034167 0.03083110
LAZY-k-MEANS, ɛ = 0.20 0.029507 0.055160 0.03620397
LAZY-k-MEANS, ɛ = 0.05 0.028457 0.046314 0.03260643





100k-MEANSMTD 0.002477 0.004324 0.00308144
SINGLEPNT 0.002390 0.004179 0.00303798
LAZY-k-MEANS, ɛ = 0.20 0.002758 0.005175 0.00356282
LAZY-k-MEANS, ɛ = 0.05 0.002331 0.004789 0.00322593





500k-MEANSMTD 0.002142 0.002731 0.00240768
SINGLEPNT 0.002136 0.002805 0.00244548
LAZY-k-MEANS, ɛ = 0.20 0.002539 0.003567 0.00292354
LAZY-k-MEANS, ɛ = 0.05 0.002206 0.002890 0.00254321






Lena22
n = 65, 536
d = 4
8k-MEANSMTD 263.644420 348.604787 299.78905632
SINGLEPNT 263.659829 348.527023 307.12394164
LAZY-k-MEANS, ɛ = 0.20 278.337133 414.679356 345.07986265
LAZY-k-MEANS, ɛ = 0.05 271.041374 409.802396 322.99259307





64k-MEANSMTD 82.074376 102.327255 88.53558757
SINGLEPNT 82.190945 104.574941 89.24323986
LAZY-k-MEANS, ɛ = 0.20 100.601485 147.170657 111.93562151
LAZY-k-MEANS, ɛ = 0.05 82.798308 106.231864 94.20319250





256k-MEANSMTD 44.637740 51.482531 47.66542537
SINGLEPNT 44.699224 51.685618 47.81799127
LAZY-k-MEANS, ɛ = 0.20 56.906620 71.491475 62.00216985
LAZY-k-MEANS, ɛ = 0.05 47.178425 54.946136 50.82872342






Lena44
n = 16, 384
d = 16
8k-MEANSMTD 2699.721266 3617.2820652903.30164756
SINGLEPNT 2699.663310 3216.8540242894.42713876
LAZY-k-MEANS, ɛ = 0.20 2834.438965 4452.8753833293.73084140
LAZY-k-MEANS, ɛ = 0.05 2725.907276 3649.5188292977.33094524





64k-MEANSMTD 1305.357406 1694.9658271503.17431782
SINGLEPNT 1345.821487 1811.6637691515.08195678
LAZY-k-MEANS, ɛ = 0.20 1564.252624 2385.7940131785.93841955
LAZY-k-MEANS, ɛ = 0.05 1410.883673 1793.7047551565.18092988





256k-MEANSMTD 1044.017122 1311.9424561151.64441691
SINGLEPNT 1055.788028 1308.4597541168.30843808
LAZY-k-MEANS, ɛ = 0.20 1262.487865 1653.8208401400.49905496
LAZY-k-MEANS, ɛ = 0.05 1094.884884 1385.3453141219.27000492






Kiss
n = 10, 000
d = 3
8k-MEANSMTD 687.278119 714.789442700.352315760
SINGLEPNT 687.279479 714.731416697.292832560
LAZY-k-MEANS, ɛ = 0.20 727.017538 947.779405802.256735040
LAZY-k-MEANS, ɛ = 0.05 689.779010 861.853344719.140385820





64k-MEANSMTD 158.607749 208.946701 178.21703676
SINGLEPNT 151.642447 203.102940 177.17793706
LAZY-k-MEANS, ɛ = 0.20 222.646398 324.435479 259.62118455
LAZY-k-MEANS, ɛ = 0.05 170.571861 248.648363 208.64482062





256k-MEANSMTD 96.272602 115.294309 105.30212380
SINGLEPNT 97.141907 125.009357 107.08187899
LAZY-k-MEANS, ɛ = 0.20 124.378185 158.922757 140.72908431
LAZY-k-MEANS, ɛ = 0.05 103.672482 129.685819 116.73971102






Table 2: Minimum, maximum, and average clustering cost on 100 executions of each of the algorithms on each of the data sets with initial centers picked randomly.