]> On Lloyd’s k-means Method

## On Lloyd’s $k$-means Method$\ast$

June 30, 2004

$\ast$The most updated version of this paper is available from the author’s web page: http://www.uiuc.edu/~ sariel/papers/03/lloyd_kmeans
$†$Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA; sariel@cs.uiuc.edu; http://www.uiuc.edu/~ sariel/. Work on this paper was partially supported by a NSF CAREER award CCR-0132901.
$‡$Department of Computer Science;University of Illinois;201 N. Goodwin Avenue; Urbana, IL 61801; USA; http://www.uiuc.edu/~ sadri/; sadri@cs.uiuc.edu.
 Abstract We present polynomial upper and lower bounds on the number of iterations performed by Lloyd’s method for $k$-means clustering. Our upper bounds are polynomial in the number of points, number of clusters, and the spread of the point set. We also present a lower bound, showing that in the worst case the $k$-means heuristic needs to perform $\Omega \left(n\right)$ iterations, for $n$ points on the real line and two centers. Surprisingly, our construction spread is polynomial. This is the first construction showing that the $k$-means heuristic requires more than a polylogarithmic number of iterations. Furthermore, we present two alternative algorithms, with guaranteed performances, which are simple variants of Lloyd’s method. Results of our experimental studies on these algorithms are also presented.

### 1 Introduction

In a (geometric) clustering problem, we are given a finite set $X\subset {I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29367pt}{0ex}}R}^{d}$ of $n$ points and an integer $k\ge 2$, and we seek a partition (clustering) $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ of $X$ into $k$ disjoint nonempty subsets along with a set $C=\left\{{c}_{1},\dots ,{c}_{k}\right\}$ of $k$ corresponding centers, that minimizes a suitable cost function among all such $k$-clusterings of $X$. The cost function typically represents how tightly each cluster is packed and how separated different clusters are. A center ${c}_{i}$ serves the points in its cluster ${S}_{i}$.

We consider the $k$-means clustering cost function $\phi \left(\mathcal{S},C\right)={\sum }_{i=1}^{k}\psi \left({S}_{i},{c}_{i}\right)$, where $\psi \left(S,c\right)={\sum }_{x\in S}{∥x-c∥}^{2}$, in which $∥\cdot ∥$ denotes the Euclidean norm. It can be easily observed that for any cluster ${S}_{i}$, the point $c$ that minimizes the sum ${\sum }_{x\in {S}_{i}}{∥x-c∥}^{2}$, is the centroid of ${S}_{i}$, denoted by $c\left({S}_{i}\right)$, and therefore in an optimal clustering, ${c}_{i}=c\left({S}_{i}\right)$. Thus the above cost function can be written as $\phi \left(\mathcal{S}\right)={\sum }_{i=1}^{k}{\sum }_{x\in {S}_{i}}{∥x-c\left({S}_{i}\right)∥}^{2}$.

It can also be observed that in an optimal $k$-clustering, each point of ${S}_{i}$ is closer to ${c}_{i}$, the center corresponding to ${S}_{i}$, than to any other center. Thus, an optimal $k$-clustering is imposed by a Voronoi diagram whose sites are the centroids of the clusters. Such partitions are related to centroidal Voronoi tessellations (see [DFG99]).

A $k$-means clustering algorithm that is used widely because of its simplicity is the $k$-means heuristic, also called Lloyd’s method. This algorithm starts with an arbitrary $k$-clustering ${\mathcal{S}}_{0}$ of $X$ with the initial $k$ centers chosen to be the centroids of the clusters of ${\mathcal{S}}_{0}$. Then it repeatedly performs local improvements by applying the following “hill-climbing” step.

Definition 1.1 Given a clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ of $X$, a $k$-MEANS step returns a clustering ${\mathcal{S}}^{\prime }=\left({S}_{1}^{\prime },\dots ,{S}_{k}^{\prime }\right)$ by letting ${S}_{i}^{\prime }$ equal to the intersection of $X$ with the cell of $c\left({S}_{i}\right)$ in the Voronoi partitioning imposed by centers $c\left({S}_{1}\right),\dots ,c\left({S}_{k}\right)$. The (new) center of ${S}_{i}^{\prime }$ will be $c\left({S}_{i}^{\prime }\right)$.

In a clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ of $X$, a point $x\in X$ is misclassified if there exists $1\le i\ne j\le k$, such that $x\in {S}_{i}$ but $∥x-c\left({S}_{j}\right)∥<∥x-c\left({S}_{i}\right)∥$. Thus a $k$-MEANS step can be broken into two stages: (i) every misclassified point is assigned to its closest center, and (ii) Centers are moved to the centroids of their newly formed clusters. Lloyd’s algorithm, to which we shall refer as “$k$-MEANSMTD” throughout this paper, performs the $k$-MEANS step repeatedly and stops when the assignment of the points to the centers does not change from that of the previous step. This happens when there remains no misclassified points and consequently in the last $k$-MEANS step ${\mathcal{S}}^{\prime }=\mathcal{S}$. Clearly the clustering cost is reduced when each point is mapped to the closest center and also when each center moves to the centroid of the points it serves. Thus, the clustering cost is strictly reduced in each of the two stages of a $k$-MEANS step. This in particular implies that no clustering can be seen twice during the course of execution of $k$-MEANSMTD. Since there are only finitely many $k$-clusterings, the algorithm terminates in finite time.

The algorithm $k$-MEANSMTD and its variants are widely used in practice [DHS01]. It is known that the output of $k$-MEANSMTD is not necessarily a global minimum, and it can be arbitrarily bad compared to the optimal clustering. Furthermore, the answer returned by the algorithm and the number of steps depend on the initial choice of the centers, i.e. the initial clustering . These shortcomings of $k$-MEANSMTD has lead to development of efficient polynomial approximation schemes for the $k$-means clustering problem both in low [Mat00ES03HM03] and high dimensions [dlVKKR03]. Unfortunately, those algorithms have had little impact in practice, as they are complicated and probably impractical because of large constants. A more practical local search algorithm, which guarantees a constant factor approximation, is described by Kanungo et al. .

Up to this point, no meaningful theoretical bound was known for the number of steps $k$-MEANSMTD can take to terminate in the worst case. Inaba et al. [IKI94] observe that the number of distinct Voronoi partitions of a given $n$-point set $X\subset {I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29367pt}{0ex}}R}^{d}$ induced by $k$ sites is at most $O\left({n}^{kd}\right)$ which gives a trivial similar upper bound on the number of steps of $k$-MEANSMTD (by observing that the clustering cost monotonically decreases an thus no $k$-clustering can be seen twice). However, the fact that $k$ in typical application can be in the hundreds together with the relatively fast convergence of $k$-MEANSMTD observed in practice, make this bound somewhat meaningless. The difficulty of proving any super-linear lower bound further suggests the looseness of this bound.

Our contribution. It thus appears that the combinatorial behavior of $k$-MEANSMTD is far from being well understood. Motivated by this, in this paper we provide a lower bound and upper bounds on the number of iterations performed by $k$-MEANSMTD. To our knowledge, our lower bound is the first that is super-polylogarithmic. Our upper bounds are polynomial in the spread $\Delta$ of the input point set, $k$, and $n$ (the spread of a point set is the ratio between its diameter and the distance between its closest pair). The bounds are meaningful for most inputs. In Section  2, we present an $\Omega \left(n\right)$ lower bound on the number of iterations performed by $k$-MEANSMTD. More precisely, we show that for an adversarially chosen initial two centers and a set of $n$ points on the line, $k$-MEANSMTD takes $\Omega \left(n\right)$ steps. Note, that this matches the straightforward upper bound on the number of Voronoi partitions in one dimension with two centers, which is $O\left(n\right)$.

In Section  3, we provide a polynomial upper bound for the one-dimensional case. In Section  4, we provide an upper bound for the case where the points lie on a grid. In Section  5, we investigate two alternative algorithms, and provide polynomial upper bounds on the number of iterations they perform. Those algorithms are minor modifications of $k$-MEANSMTD algorithm, and we believe that their analysis provide an insight about the behavior of $k$-MEANSMTD. Some experimental results are presented in Section  6. In Section  7, we conclude by mentioning a few open problems and discussion of our results.

### 2 Lower Bound Construction for Two Clusters in One Dimension

In this section, we describe a set of $2n$ points, along with an initial pair of centers, on which $k$-MEANSMTD takes $\Omega \left(n\right)$ steps to terminate for $n\ge 2$.

Fix $n\ge 2$. Our set $X$ will consist of $2n$ numbers ${y}_{1}<\cdots <{y}_{n}<{x}_{n}<\cdots <{x}_{1}$ with ${y}_{i}=-{x}_{i}$, for $i=1,\dots ,n$.

At the $i$th iteration, we denote by ${l}_{i}$ and ${r}_{i}$ the current left and right centers, respectively, and by ${L}_{i}$ and ${R}_{i}$ the new sets of points assigned to ${l}_{i}$ and ${r}_{i}$, respectively. Furthermore, for each $i\ge 0$, we denote by ${\alpha }_{i}$ the Voronoi boundary $\frac{1}{2}\left({l}_{i}+{r}_{i}\right)$ between the centers ${l}_{i}$ and ${r}_{i}$. Thus

Let ${x}_{1}$ be an arbitrary positive real number and let ${x}_{2}<{x}_{1}$ be a positive real number to be specified shortly. Initially, we let ${l}_{1}={x}_{2}$ and ${r}_{1}={x}_{1}$ and consequently ${\alpha }_{1}=\frac{1}{2}\left({x}_{1}+{x}_{2}\right)$. Thus in the first iteration, ${L}_{1}=\left\{{y}_{1},\dots ,{y}_{n},{x}_{n},\dots ,{x}_{2}\right\}$ and ${R}_{1}=\left\{{x}_{1}\right\}$. We will choose ${x}_{2},\dots ,{x}_{n}$ such that at the end of the $i$th step we have ${L}_{i}=\left\{{y}_{1},\dots ,{y}_{n},{x}_{n},\dots ,{x}_{i+1}\right\}$ and ${R}_{i}=\left\{{x}_{i},\dots ,{x}_{1}\right\}$. Suppose for the inductive hypothesis that at the $\left(i-1\right)$th step we have

Thus we can compute ${l}_{i}$ and ${r}_{i}$ as follows

Since ${y}_{1}+\cdots +{y}_{n}+{x}_{n}+\cdots +{x}_{i}=-\left({x}_{i-1}+\cdots +{x}_{1}\right)$, we get for ${\alpha }_{i}$:

$\begin{array}{rcll}{\alpha }_{i}=\frac{1}{2}\left({l}_{i}+{r}_{i}\right)& =& \frac{1}{2}\phantom{\rule{0em}{0ex}}\left(\frac{{x}_{i-1}+\cdots +{x}_{1}}{i-1}-\frac{{x}_{i-1}+\cdots +{x}_{1}}{2n-i+1}\right)& \text{}\\ & =& \frac{n-i+1}{\left(i-1\right)\left(2n-i+1\right)}\left({x}_{i-1}+\cdots +{x}_{1}\right)=\frac{n-i+1}{\left(i-1\right)\left(2n-i+1\right)}\cdot {s}_{i-1},& \text{}\end{array}$

where ${s}_{i-1}={\sum }_{j=1}^{i-1}{x}_{j}$.

To guarantee that only ${x}_{i}$ deserts from ${L}_{i-1}$ to ${R}_{i}$, in the $i$th iteration, we need that ${x}_{i+1}<{\alpha }_{i}<{x}_{i}$. Thus, it is natural to set ${x}_{i}={\tau }_{i}{\alpha }_{i}$, where ${\tau }_{i}>1$, for $i=1,\dots ,n$. Picking the coefficients ${\tau }_{1},\dots ,{\tau }_{n}$ is essentially the only part of this construction that is under our control. We set

${\tau }_{i}=1+\frac{1}{n-i+1}=\frac{n-i+2}{n-i+1},$

for $i=1,\dots ,n$. Since ${\tau }_{i}>1$, ${x}_{i}={\tau }_{i}{\alpha }_{i}>{\alpha }_{i}$, for $i=1,\dots ,n$. Next, we verify that ${x}_{i+1}<{\alpha }_{i}$. By definition,

$\begin{array}{rcll}{x}_{i+1}& =& {\tau }_{i+1}{\alpha }_{i+1}={\tau }_{i+1}\cdot \frac{n-i}{i\left(2n-i\right)}\cdot {s}_{i}& \text{}\\ & =& {\tau }_{i+1}\cdot \frac{n-i}{i\left(2n-i\right)}\cdot \left({x}_{i}+{s}_{i-1}\right)={\tau }_{i+1}\cdot \frac{n-i}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left({\tau }_{i}^{-1}+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+1}\right){\alpha }_{i}& \text{}\\ & =& \frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left(\frac{n-i+1}{n-i+2}+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+1}\right){\alpha }_{i}.& \text{}\end{array}$

It can be verified through elementary simplifications that the coefficient of ${\alpha }_{i}$ above is always less than $1$ implying that ${x}_{i+1}<{\alpha }_{i}<{x}_{i}$, for $i=1,\dots ,n-1$.

We can compute a recursive formula for ${x}_{i+1}$ in terms of ${x}_{i}$, as follows

$\begin{array}{rcll}{x}_{i+1}& =& {\tau }_{i+1}{\alpha }_{i+1}=\frac{n-i+1}{n-i}\cdot \frac{n-i}{i\left(2n-i\right)}\cdot {s}_{i}=\frac{n-i+1}{i\left(2n-i\right)}\cdot \left({x}_{i}+{s}_{i-1}\right)& \text{}\\ & =& \frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left({x}_{i}+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+1}\cdot {\alpha }_{i}\right)& \text{}\\ & =& \frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left({x}_{i}+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+1}\phantom{\rule{0em}{0ex}}{\left(1+\frac{1}{n-i+1}\right)}^{-1}{x}_{i}\right)& \text{}\\ & =& \frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left(1+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+1}\phantom{\rule{0em}{0ex}}\left(\frac{n-i+1}{n-i+2}\right)\right){x}_{i}.& \text{}\\ & =& \frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left(1+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+2}\right)\cdot {x}_{i},& \text{}\end{array}$

for $i=1,\dots ,n-1$. Thus letting ${c}_{i}=\frac{n-i+1}{i\left(2n-i\right)}\phantom{\rule{0em}{0ex}}\left(1+\frac{\left(i-1\right)\left(2n-i+1\right)}{n-i+2}\right)$ we get that

 ${x}_{i+1}={c}_{i}{x}_{i},$ (1)

for $i=1,\dots ,n-1$.

Theorem 2.1 For each $n\ge 2$, there exists a set of $2n$ points on a line with two initial center positions for which $k$-MEANSMTD takes exactly $n$ steps to terminate.

#### 2.1 The Spread of the Point Set

It is interesting to examine the spread of the above construction. In particular, somewhat surprisingly, the spread of this construction is polynomial, hinting (at least intuitively) that “bad” inputs for $k$-MEANSMTD are not that contrived.

By Eq. ( 1), we have ${x}_{i+1}={c}_{i}{x}_{i}$. Notice that by the given construction ${c}_{i}<1$ for all $i=1,\dots ,n-1$ since ${x}_{i+1}<{x}_{i}$. In the sequel we will show that ${x}_{n}$ is only polynomially smaller than ${x}_{1}$, namely ${x}_{n}=\Omega \left({x}_{1}∕{n}^{4}\right)$. We then derive a bound on the distance between any consecutive pair ${x}_{i}$ and ${x}_{i+1}$. These two assertions combined, imply that the point set has a spread bounded by $O\left({n}^{5}\right)$. The following lemma follows from elementary algebraic simplifications.

Lemma 2.2 For each $1\le i\le n∕2$, ${c}_{i}\ge \phantom{\rule{0em}{0ex}}{\left(1-1∕i\right)}^{2}$, and for each $n∕2, ${c}_{i}\ge \phantom{\rule{0em}{0ex}}{\left(1-1∕\left(n-i+1\right)\right)}^{2}$. Furthermore, for $i\ge 2$, we have ${c}_{i}\le 1-1∕2i$.

Corollary 2.3 For any $n>0$ we have ${x}_{n}=\Omega \left({x}_{1}∕{n}^{4}\right)$. Proof: ${x}_{n}={c}_{1}\cdot {\prod }_{i=2}^{n-1}{c}_{i}\cdot {x}_{1}\ge {c}_{1}{x}_{1}\cdot {\prod }_{i=2}^{⌊n∕2⌋}\phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{i}\right)}^{2}\cdot {\prod }_{i=⌊n∕2⌋+1}^{n-1}\phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{n-i}\right)}^{2}$

$\begin{array}{rcll}& =& {c}_{1}{x}_{1}\cdot \phantom{\rule{0ex}{0ex}}\phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{2}\right)}^{2}\dots \phantom{\rule{0ex}{0ex}}\phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{⌊n∕2⌋}\right)}^{2}\cdot \phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{⌊n∕2⌋}\right)}^{2}\dots \phantom{\rule{0ex}{0ex}}\phantom{\rule{0em}{0ex}}{\left(1-\frac{1}{2}\right)}^{2}& \text{}\\ & =& {c}_{1}{x}_{1}\cdot \phantom{\rule{0em}{0ex}}{\left({\prod }_{i=2}^{⌊n∕2⌋}\frac{{\left(i-1\right)}^{2}}{{i}^{2}}\right)}^{2}={c}_{1}{x}_{1}\cdot \phantom{\rule{0em}{0ex}}{\left(\frac{1}{⌊n∕2⌋}\right)}^{4}& \text{}\end{array}$

The claim follows as ${c}_{1}=n∕\left(2n-1\right)=\Theta \left(1\right)$. _

Lemma 2.4 For each $i=1,\dots ,n-1$, ${x}_{i}-{x}_{i+1}\ge {x}_{i}∕3i$. Proof: Since ${x}_{i+1}={c}_{i}{x}_{i}$, we have ${x}_{i}-{x}_{i+1}={x}_{i}\left(1-{c}_{i}\right)$. For $i=1$, we have ${c}_{1}=n∕\left(2n-1\right)\le 2∕3$, when $n\ge 2$. Thus, we have ${x}_{1}-{x}_{2}\ge {x}_{1}∕3$. For $i=2,\dots ,n-1$, using Lemma  2.2 we get $1-{c}_{i}\ge 1∕2i$. Thus, ${x}_{i}-{x}_{i+1}={x}_{i}\left(1-{c}_{i}\right)\ge {x}_{i}\cdot 1∕\left(2i\right)>{x}_{i}∕3i$, as claimed. _

Theorem 2.5 The spread of the point set constructed in Theorem  2.1 is $O\left({n}^{5}\right)$.

Proof: By Lemma  2.4, for each $i=1,\dots ,n-1$, ${x}_{i}-{x}_{i+1}\ge {x}_{i}∕3i$. Since ${x}_{i}>{x}_{n}$ and by Corollary  2.3, ${x}_{n}=\Omega \left({x}_{1}∕{n}^{4}\right)$, it follows that ${x}_{i}-{x}_{i+1}=\Omega \left({x}_{1}∕{n}^{5}\right)$. This lower bound for the distance between two consecutive points is also true for ${y}_{i}$’s due to the symmetric construction of the point set around $0$. On the other hand, since ${x}_{n}=\Omega \left({x}_{1}∕{n}^{4}\right)$, ${x}_{n}-{y}_{n}=2{x}_{n}=\Omega \left({x}_{1}∕{n}^{4}\right)$. Thus every pair of points are at distance at least $\Omega \left({x}_{1}∕{n}^{5}\right)$. Since the diameter of the point set is $2{x}_{1}$, we get a bound of $O\left({n}^{5}\right)$ for the spread of the point set. _

### 3 An Upper Bound for One Dimension

In this section, we prove an upper bound on the number of steps of $k$-MEANSMTD in one dimensional Euclidean space. As we shall see, the bound does not involve $k$ but is instead related to the spread $\Delta$ of the point set $X$. Without loss of generality we can assume that the closest pair of points in $X$ are at distance $1$ and thus the diameter of the set $X$ is $\Delta$. Before proving the upper bound, we mention a technical lemma from .

Lemma 3.1 () Let $S$ be a set of points in ${I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29993pt}{0ex}}R}^{d}$ with centroid $c=c\left(S\right)$ and let $z$ be an arbitrary point in ${I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29993pt}{0ex}}R}^{d}$. Then $\psi \left(S,z\right)-\psi \left(S,c\right)={\sum }_{x\in S}\phantom{\rule{0em}{0ex}}\left({∥x-z∥}^{2}-{∥x-c∥}^{2}\right)=\left|S\right|\cdot {∥c-z∥}^{2}$. The above lemma quantifies the contribution of a center ${c}_{i}$ to the cost improvement in a $k$-MEANS step as a function of the distance it moves. More formally, if in a $k$-MEANS step a $k$-clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ is changed to the other $k$-clustering ${\mathcal{S}}^{\prime }=\left({S}_{1}^{\prime },\dots ,{S}_{k}^{\prime }\right)$, then

$\phi \left({\mathcal{S}}^{\prime }\right)-\phi \left(\mathcal{S}\right)\ge {\sum }_{i=1}^{k}\mid {S}_{i}^{\prime }\mid \cdot {∥c\left({S}_{i}^{\prime }\right)-c\left({S}_{i}\right)∥}^{2}.$

Note that in the above analysis we only consider the improvement resulting from the second stage of $k$-MEANS step in which the centers are moved to the centroids of their clusters. There is an additional gain from reassigning the points in the first stage of a $k$-MEANS step that we currently ignore.

In all our upper bound arguments we use the fact that if the initial set of centers is chosen from inside the convex hull of the input point set $X$ (even if this is not the case, all centers move inside the convex hull of $X$ after one step), the initial clustering cost is no more than $n{\Delta }^{2}$. This simply follows from the fact that each of the $n$ points in $X$ is at distance no more than $\Delta$ from its assigned center.

Theorem 3.2 The number of steps of $k$-MEANSMTD on a set $X\subset I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29993pt}{0ex}}R$ of $n$ points with spread $\Delta$ is at most $O\left(n{\Delta }^{2}\right)$.

Proof: Consider a $k$-MEANS step that changes a $k$-clustering $\mathcal{S}$ into another $k$-clustering ${\mathcal{S}}^{\prime }$. The crucial observation is that in this step, there exists a cluster that is only extended or shrunk from its right end. To see this consider the leftmost cluster ${S}_{1}$. Either ${S}_{1}$ is modified in this step, in which case this modification can only happen in form of extension or shrinking at its right end, or it remains the same. In the latter case, the same argument can be made about ${S}_{2}$, and so on.

Thus assume that ${S}_{1}$ is extended on right by receiving a set $T$ from the cluster directly to its right, namely ${S}_{2}$ (${S}_{2}$ cannot lose all its points to ${S}_{1}$ as it has at least one point to the right of ${c}_{2}$ and this point is closer to ${c}_{2}$ than to ${c}_{1}$ and cannot go to ${S}_{1}$). Notice that $c\left(T\right)$ is to the right of the leftmost point in $T$ and at distance at least $\left(\mid T\mid -1\right)∕2$ from this leftmost point (because every pair of points are at distance one or more in $T$ and $c\left(T\right)$ gets closest to its leftmost point when every pair of consecutive points in $T$ are placed at the minimum distance of $1$ from each other). Similarly, the centroid of ${S}_{1}$ is to the left of the rightmost point of ${S}_{1}$ and at distance at least $\left(\mid {S}_{1}\mid -1\right)∕2$ from it. Thus, $∥c\left({S}_{1}\right)-c\left(T\right)∥\ge \left(\mid T\mid -1\right)∕2+\left(\mid {S}_{1}\mid -1\right)∕2+1=\left(\mid T\mid +\mid {S}_{1}\mid \right)∕2$, where the extra $1$ is added because the distance between the leftmost point in $T$ and the rightmost point in ${S}_{1}$ is at least $1$. The centroid of ${S}_{1}^{\prime }$ will therefore be at distance

$\frac{\mid T\mid }{\mid {S}_{1}\mid +\mid T\mid }∥c\left({S}_{1}\right)-c\left(T\right)∥\ge \frac{\mid T\mid }{\mid {S}_{1}\mid +\mid T\mid }\cdot \frac{\mid T\mid +\mid {S}_{1}\mid }{2}=\frac{\mid T\mid }{2}\ge \frac{1}{2}$

from $c\left({S}_{1}\right)$ and to its right. Consequently, by Lemma  3.1, the improvement in clustering cost is at least $1∕4$.

Similar analysis implies a similar improvement in the clustering cost for the case where we remove points from ${S}_{1}$. Since the initial clustering cost is at most $n{\Delta }^{2}$, the number of steps is no more than $n{\Delta }^{2}∕\left(1∕4\right)=4n{\Delta }^{2}$. _

Remark 3.3 The upper bound of Theorem  3.2 as well as all other upper bounds proved later in this paper can be slightly improved by observing that at the end of any $k$-MEANS step (or a substitute step used in the alternate algorithms considered later), we have a clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ of the input point set $X$ with centers ${c}_{1},\dots ,{c}_{k}$, respectively, where for each $i=1,\dots ,k$, ${c}_{i}=c\left({S}_{i}\right)$. Let $ĉ=c\left(X\right)$. By Lemma  3.1, we can write

$\psi \left({S}_{i},{c}_{i}\right)=\psi \left({S}_{i},ĉ\right)-\left|{S}_{i}\right|\cdot {∥ĉ-{c}_{i}∥}^{2},$

for $1\le i\le k$. Summing this equation, for $i=1,\dots ,k$, we have

$\phi \left({\mathcal{S}}^{\prime }\right)={\sum }_{x\in X}{∥x-ĉ∥}^{2}-{\sum }_{i=1}^{k}\left|{S}_{i}\right|.{∥ĉ-{c}_{i}∥}^{2}<{\sum }_{x\in X}{∥ĉ-x∥}^{2}=\frac{1}{n}{\sum }_{x,y\in X}{∥x-y∥}^{2}.$

Thus, we get a better upper bound on $1∕n{\sum }_{x,y\in X}{∥x-y∥}^{2}$ that can replace the trivial bound of $n{\Delta }^{2}$. Note that depending on the input, this improved upper bound can be smaller by a factor of $O\left(n\right)$ than the bound we use (i.e., $n{\Delta }^{2}$). Nevertheless, in all our upper bound results we employ the weaker bound for the purpose of readability, while all those bounds can be made more precise by applying the above-mentioned improvement.

Remark 3.4 A slight technical detail in the implementation of $k$-MEANSMTD algorithm, involves the event of a center losing all the points it serves. The original $k$-means heuristic does not specify a particular solution to this problem. Candidate strategies used in practice include: placing the lonely center somewhere else arbitrarily or randomly, leaving it where it is to perhaps acquire some points in futures steps, or completely removing it. For the sake of convenience in our analysis, we adopt the last strategy, namely, whenever a center is left serving no points, we remove that center permanently and continue with the remaining centers.

### 4 Upper Bound for Points on a $d$-Dimensional Grid

In this section, we prove an upper bound on the number of steps of $k$-MEANSMTD when the input points belong to the integer grid ${\left\{1,\dots ,M\right\}}^{d}$. This is the case in many practical applications where every data point has a large number of fields with each field having values in a small discrete range. For example, this includes clustering of pictures, where every pixel forms a single coordinate (or three coordinates, corresponding to the RGB values) and the value of every coordinate is restricted to be an integer in the range 0–255.

The main observation is that the centroids of any two subsets of ${\left\{1,\dots ,M\right\}}^{d}$ are either equal or are suitably far away. Since each step of $k$-MEANSMTD moves at least one center or else stops, this observation guarantees a certain amount of improvement to the clustering cost in each step.

Lemma 4.1 Let ${S}_{1}$ and ${S}_{2}$ be two nonempty subsets of ${\left\{1,\dots ,M\right\}}^{d}$ with $\mid {S}_{1}\mid +\mid {S}_{2}\mid \le n$. Then, either $c\left({S}_{1}\right)=c\left({S}_{2}\right)$ or $∥c\left({S}_{1}\right)-c\left({S}_{2}\right)∥\ge 1∕{n}^{2}$.

Proof: If $c\left({S}_{1}\right)\ne c\left({S}_{2}\right)$ then they differ in at least one coordinate. Let ${u}_{1}$ and ${u}_{2}$ be the values of $c\left({S}_{1}\right)$ and $c\left({S}_{2}\right)$ in one such coordinate, respectively. By definition, ${u}_{1}={s}_{1}∕\mid {S}_{1}\mid$ and ${u}_{2}={s}_{2}∕\mid {S}_{2}\mid$ where ${s}_{1}$ and ${s}_{2}$ are integers in the range $\left\{1,\dots ,nM\right\}$. In other words $\mid {u}_{1}-{u}_{2}\mid$ is the difference of two distinct fractions, both with denominators less than $n$. It follows that $\mid {u}_{1}-{u}_{2}\mid \ge 1∕{n}^{2}$ and consequently $∥c\left({S}_{1}\right)-c\left({S}_{2}\right)∥\ge \mid {u}_{1}-{u}_{2}\mid \ge 1∕{n}^{2}$. _

Theorem 4.2 The number of steps of $k$-MEANSMTD when executed on a point set $X$ taken from the grid ${\left\{1,\dots ,M\right\}}^{d}$ is at most $d{n}^{5}{M}^{2}$.

Proof: Note, that $U=n\cdot {\left(\sqrt{d}M\right)}^{2}=nd{M}^{2}$ is an upper bound of for the clustering cost of any $k$-clustering of a point set in ${\left\{1,\dots ,M\right\}}^{d}$ and that at each step at least one center moves by at least $1∕{n}^{2}$. Therefore, by Lemma  3.1, at every step the cost function decreases by at least $1∕{n}^{4}$ and the overall number of steps can be no more than $U∕\left(1∕{n}^{4}\right)=d{n}^{5}{M}^{2}$. _

### 5 Arbitrary Point Sets and Alternative Algorithms

Unfortunately proving any meaningful bounds for the general case of $k$-MEANSMTD, namely with points in ${I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29367pt}{0ex}}R}^{d}$ with $d>1$ and no further restrictions, remains elusive. However, in this section, we present two close relatives of $k$-MEANSMTD for which we can prove polynomial bounds on the number of steps. The first algorithm differs from $k$-MEANSMTD in that it moves a misclassified point to its correct cluster, as soon as the misclassified point is discovered (rather than first finding all misclassified points and then reassigning them to their closest centers as is the case in $k$-MEANSMTD). The second algorithm is basically the same as $k$-MEANSMTD with a naturally generalized notion of misclassified points. Our experimental results (Section  6) further support the kinship of these two algorithms with $k$-MEANSMTD.

As was the case with our previous upper bounds, our main approach in bounding the number of steps in both these algorithms is through showing substantial improvements in the clustering cost at each step.

#### 5.1 The SINGLEPNT Algorithm

We introduce an alternative to the $k$-MEANS step which we shall call a SINGLEPNT step.

Definition 5.1 In a SINGLEPNT step on a $k$-clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$, a misclassified point $x$ is chosen, such that $x\in {S}_{i}$ and $∥x-c\left({S}_{j}\right)∥<∥x-c\left({S}_{i}\right)∥$, for some $1\le i\ne j\le k$, and a new clustering ${\mathcal{S}}^{\prime }=\left({S}_{1}^{\prime },\dots ,{S}_{k}^{\prime }\right)$ is formed by removing $x$ from ${S}_{i}$ and adding it to ${S}_{j}$. Formally, for each $1\le l\le k$,

${S}_{l}^{\prime }=\left\{\begin{array}{ccc}{S}_{l}\hfill & \phantom{\rule{2em}{0ex}}\hfill & l\ne i,j,\hfill \\ {S}_{l}\setminus \left\{x\right\}\hfill & \hfill & l=i,\hfill \\ {S}_{l}\cup \left\{x\right\}\hfill & \hfill & l=j.\hfill \end{array}\right\$

The centers are updated to the centroids of the clusters, and therefore only the centers of ${S}_{i}$ and ${S}_{j}$ change. Note that updating the centers takes constant time.

In a SINGLEPNT step, if the misclassified point is far away from at least one of $c\left({S}_{i}\right)$ and $c\left({S}_{j}\right)$, then the improvement in clustering cost made in the SINGLEPNT step cannot be too small.

Lemma 5.2 Let $S$ and $T$ be two point sets of sizes $n$ and $m$, respectively, and let $s=c\left(S\right)$ and $t=c\left(T\right)$. Suppose that $x$ is a point in $T$ with distances ${d}_{S}$ and ${d}_{T}$ from $s$ and $t$, respectively, and such that ${d}_{S}<{d}_{T}$. Let ${S}^{\prime }=S\cup \left\{x\right\}$ and ${T}^{\prime }=T\setminus \left\{x\right\}$ and let ${s}^{\prime }=c\left({S}^{\prime }\right)$ and ${t}^{\prime }=c\left({T}^{\prime }\right)$. Then $\psi \left(S,s\right)+\psi \left(T,t\right)-\psi \left({S}^{\prime },{s}^{\prime }\right)-\psi \left({T}^{\prime },{t}^{\prime }\right)\ge {\left({d}_{S}+{d}_{T}\right)}^{2}∕\left(2\left(n+m\right)\right)$. Proof: Indeed, $c\left({S}^{\prime }\right)=\frac{n}{n+1}c\left(S\right)+\frac{1}{n+1}x$. Thus

$∥s-{s}^{\prime }∥=∥c\left(S\right)-c\left({S}^{\prime }\right)∥=∥\frac{1}{n+1}c\left(S\right)-\frac{1}{n+1}x∥=\frac{1}{n+1}∥c\left(S\right)-x∥=\frac{{d}_{S}}{n+1}.$

Similarly, $∥t-{t}^{\prime }∥={d}_{T}∕\left(m-1\right)$. Thus using Lemma  3.1 we get

$\psi \left({S}^{\prime },s\right)-\psi \left({S}^{\prime },{s}^{\prime }\right)=\left(n+1\right)\phantom{\rule{0em}{0ex}}{\left(\frac{{d}_{S}}{n+1}\right)}^{2}=\frac{{d}_{S}^{2}}{n+1},$

and similarly $\psi \left({T}^{\prime },t\right)-\psi \left({T}^{\prime },{t}^{\prime }\right)={d}_{T}^{2}∕\left(m-1\right)$.

Since ${d}_{S}<{d}_{T}$, we have that $\psi \left(S,s\right)+\psi \left(T,t\right)\ge \psi \left({S}^{\prime },s\right)+\psi \left({T}^{\prime },t\right)$, and

$\begin{array}{rcll}& & \phantom{\rule{-56.9055pt}{0ex}}\psi \left(S,s\right)+\psi \left(T,t\right)-\psi \left({S}^{\prime },{s}^{\prime }\right)-\psi \left({T}^{\prime },{t}^{\prime }\right)\ge \psi \left({S}^{\prime },s\right)+\psi \left({T}^{\prime },t\right)-\psi \left({S}^{\prime },{s}^{\prime }\right)-\psi \left({T}^{\prime },{t}^{\prime }\right)& \text{}\\ & \ge & \frac{{d}_{S}^{2}}{n+1}+\frac{{d}_{T}^{2}}{m-1}\ge \frac{{d}_{S}^{2}}{n+m}+\frac{{d}_{T}^{2}}{n+m}=\frac{{d}_{S}^{2}+{d}_{T}^{2}}{n+m}\ge \frac{{\left({d}_{S}+{d}_{T}\right)}^{2}}{2\left(n+m\right)}.& \text{}\end{array}$

_

Our modified version of $k$-MEANSMTD, to which we shall refer as “SINGLEPNT”, replaces $k$-MEANS steps with SINGLEPNT steps. Starting from an arbitrary clustering of the input point set, SINGLEPNT repeatedly performs SINGLEPNT steps until no misclassified points remain. Notice that unlike the $k$-MEANS step, the SINGLEPNT step does not maintain the property that the clustering achieved at the end of the step is imposed by some Voronoi diagram. However, when the algorithm stops no misclassified points are left, and this property must hold since otherwise further steps would be possible.

Theorem 5.3 On any input $X\subset {I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29993pt}{0ex}}R}^{d}$, SINGLEPNT makes at most $O\left(k{n}^{2}{\Delta }^{2}\right)$ steps before termination.

Proof: Once again, we assume that no two points in $X$ are less than unit distance apart. Call a SINGLEPNT step weak, if the misclassified point it considers is at distance less than $1∕8$ from both involved centers, i.e., its current center and the center closest to it. We call a SINGLEPNT step strong if it is not weak. Lemma  5.2 shows that in a strong SINGLEPNT step the clustering cost improves by at least $1∕\left(128n\right)$. In the sequel we shall show that the algorithm cannot take more than $k$ consecutive weak steps, and thus at least one out of every $k+1$ consecutive steps must be strong and thus result an improvement of $1∕\left(128n\right)$ to the clustering cost; hence the upper bound of $O\left(k{n}^{2}{\Delta }^{2}\right)$.

For a certain point in time, let ${c}_{1},\dots ,{c}_{k}$ denote the current centers, and let ${S}_{1},\dots ,{S}_{k}$ denote the corresponding clusters; namely, ${S}_{i}$ is the set of points served by ${c}_{i}$, for $i=1,\dots ,k$. Consider the balls ${B}_{1},\dots ,{B}_{k}$ of radius $1∕8$ centered at ${c}_{1},\dots ,{c}_{k}$, respectively. Observe that since every pair of points in $X$ are at distance at least $1$ from each other, each ball ${B}_{i}$ can contain at most one point of $X$. Moreover, the intersection of any subset of the balls ${B}_{1},\dots ,{B}_{k}$ can contain at most one point of $X$. For a point $x\in X$, let $\mathcal{B}\left(x\right)$ denote the set of balls among ${B}_{1},\dots ,{B}_{k}$ that contain the point $x$. We refer to $\mathcal{B}\left(x\right)$ as the batch of $x$.

By the above observation, the balls (and the corresponding centers) are classified according to the point of $X$ they contain (if they contain such a point at all). Let ${\mathcal{B}}_{X}$ be the set of batches of balls that are induced by $X$ and contain more than one ball. Formally, ${\mathcal{B}}_{X}=\left\{\mathcal{B}\left(x\right):x\in X,\left|\mathcal{B}\left(x\right)\right|>1\right\}$. The set of balls $\bigcup {\mathcal{B}}_{X}$ is the set of active balls.

A misclassified point $x$ can participate in a weak SINGLEPNT step only if it belongs to more than one ball; i.e., when $\left|\mathcal{B}\left(x\right)\right|>1$. Observe that, if we perform a weak step, and one of the centers move such that the corresponding ball ${B}_{i}$ no longer contains any point of $X$ in its interior, then for ${B}_{i}$ to contain a point again, the algorithm must perform a strong step. To see this, observe that (weakly) losing a point $x$ may cause a center move a distance of at most $1∕8$. Therefore, once a center ${c}_{i}$ loses a point $x$, and thus moves away from $x$, it does not move far enough for the ball ${B}_{i}$ to contain a different point of $X$.

Hence, in every weak iteration a point $x$ changes the cluster it belongs to in $\mathcal{B}\left(x\right)$. This might result in a shrinking of the active set of balls. On the other hand, while only weak SINGLEPNT steps are being taken, any cluster ${S}_{j}$ can change only by winning or losing the point ${x}_{i}$ that stabs the corresponding ball ${B}_{j}$. It follows that once a set ${S}_{j}$ loses the point $x$, then it can never get it back since that would correspond to an increase in the clustering cost. Therefore the total number of possible consecutive weak SINGLEPNT steps is bounded by ${\sum }_{x\in X,\left|\mathcal{B}\left(x\right)\right|>1}\left|\mathcal{B}\left(x\right)\right|\le k$. _

#### 5.2 The LAZY-$k$-MEANS algorithm

Our second variant to $k$-MEANSMTD, which we name “LAZY-$k$-MEANS”, results from a natural generalization misclassified points (Definition  1.1). Intuitively, the difference between the LAZY-$k$-MEANS and $k$-MEANSMTD is that LAZY-$k$-MEANS at each step only reassigns those misclassified points to their closest centers that are substantially misclassified, namely the points that benefit from reclassification by at least a constant factor.

Definition 5.4 Given a clustering $\mathcal{S}=\left({S}_{1},\dots ,{S}_{k}\right)$ of a point set $X$, if for a point $x\in {S}_{i}$ there exists a $j\ne i$, such that $∥x-c\left({S}_{i}\right)∥>\left(1+\varepsilon \right)∥x-c\left({S}_{j}\right)∥$, then $x$ is said to be $\left(1+\varepsilon \right)$-misclassified for center pair $\left(c\left({S}_{i}\right),c\left({S}_{j}\right)\right)$. The centers $c\left({S}_{i}\right)$ and $c\left({S}_{j}\right)$ are referred to as switch centers for $x$. We also say that $c\left({S}_{i}\right)$ is the losing center and $c\left({S}_{j}\right)$ is the winning center for $x$.

Thus LAZY-$k$-MEANS with parameter $\varepsilon$ starts with an arbitrary $k$-clustering. In each step, it (i) reassigns every $\left(1+\varepsilon \right)$-misclassified point to its closest center and (ii) moves every center to the centroid of its new cluster. Indeed, $k$-MEANSMTD is simply LAZY-$k$-MEANS with parameter $\varepsilon =0$. Naturally, the algorithm stops when no $\left(1+\varepsilon \right)$-misclassified points are left.

In the sequel we bound the maximum number of steps taken by LAZY-$k$-MEANS. We shall use the following fact from elementary Euclidean geometry.

Fact 5.5 Given two points $c$ and ${c}^{\prime }$ with $∥c-{c}^{\prime }∥=\ell$, the locus of the points $x$ with $∥x-{c}^{\prime }∥>\left(1+\varepsilon \right)∥x-c∥$ is an open ball of radius $R=\ell \left(1+\varepsilon \right)∕\left(\varepsilon \left(2+\varepsilon \right)\right)$ called the $\varepsilon$-Apollonius ball for $c$ with respect to ${c}^{\prime }$. This ball is centered on the line containing the segment $c{c}^{\prime }$ at distance $R+\ell \varepsilon ∕\left(2\left(2+\varepsilon \right)\right)$ from the bisector of $c{c}^{\prime }$, and on the same side of the bisector as $c$.

Lemma 5.6 For any three points $x$, $c$, and ${c}^{\prime }$ in ${I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29993pt}{0ex}}R}^{d}$ with $∥x-c∥\le ∥x-{c}^{\prime }∥$ we have ${∥x-{c}^{\prime }∥}^{2}-{∥x-c∥}^{2}=2h∥c-{c}^{\prime }∥$, where $h$ is the distance from $x$ to the bisector of $c$ and ${c}^{\prime }$.

Proof: Let $y$ be the intersection point of the segment $c{c}^{\prime }$ with the $\left(d-1\right)$-dimensional hyperplane parallel to the bisector of $c$ and ${c}^{\prime }$ and containing $x$. By Pythagorean equality we have ${∥x-c∥}^{2}={∥x-y∥}^{2}+{∥y-c∥}^{2}$ and ${∥x-{c}^{\prime }∥}^{2}={∥x-y∥}^{2}+{∥y-{c}^{\prime }∥}^{2}$. Subtracting the first equality from the second, we obtain

$\begin{array}{rcll}{∥x-{c}^{\prime }∥}^{2}-{∥x-c∥}^{2}& =& {∥y-{c}^{\prime }∥}^{2}-{∥y-c∥}^{2}& \text{}\\ & =& \left(∥y-c∥+∥y-{c}^{\prime }∥\right)\left(∥y-c∥-∥y-{c}^{\prime }∥\right)& \text{}\\ & =& 2h∥c-{c}^{\prime }∥,& \text{}\end{array}$

since $∥y-c∥-∥y-{c}^{\prime }∥=2h$. _

Theorem 5.7 The number of steps of LAZY-$k$-MEANS with parameter $\varepsilon$ is $O\left(n{\Delta }^{2}{\varepsilon }^{-3}\right)$.

Proof: We will show that every two consecutive steps of LAZY-$k$-MEANS with parameter $\varepsilon$ make an improvement of at least

${\lambda }^{\ast }=\frac{{\varepsilon }^{3}\left(2+\varepsilon \right)}{256{\left(1+\varepsilon \right)}^{2}}\ge \frac{{\varepsilon }^{3}}{512}=\Omega \left({\varepsilon }^{3}\right).$

Let ${\ell }_{0}=\varepsilon \left(2+\varepsilon \right)∕\left(16\left(1+\varepsilon \right)\right)$ . Notice that ${\ell }_{0}<1∕8$ for $0<\varepsilon \le 1$. We call a misclassified point $x$ strongly misclassified, if its switch centers $c$ and ${c}^{\prime }$ are at distance at most ${\ell }_{0}$ from each other, and weakly misclassified otherwise.

If at the beginning of a LAZY-$k$-MEANS step there exists a strongly misclassified point $x$ for a center pair $\left(c,{c}^{\prime }\right)$, then since every point in the $\varepsilon$-Apollonius ball for ${c}^{\prime }$ with respect to $c$ is at distance at least ${\ell }_{0}\varepsilon ∕\left(2\left(2+\varepsilon \right)\right)$ from the bisector of $c{c}^{\prime }$, by Lemma  5.6 the reclassification improvement in clustering cost resulting from assigning $x$ to ${c}^{\prime }$ is

${∥x-c∥}^{2}-{∥x-{c}^{\prime }∥}^{2}=\frac{{\ell }_{0}^{2}\varepsilon }{2+\varepsilon }\ge \frac{{\varepsilon }^{3}\left(2+\varepsilon \right)}{256{\left(1+\varepsilon \right)}^{2}}={\lambda }^{\ast }.$

Thus we assume that all misclassified points are weakly misclassified. Let $x$ be one such point for center pair $\left(c,{c}^{\prime }\right)$. By our assumption $∥c-{c}^{\prime }∥<{\ell }_{0}$. Observe that in such a case, the radius of the $\varepsilon$-Apollonius ball for ${c}^{\prime }$ with respect to $c$ is $\ell \left(1+\varepsilon \right)∕\left(\varepsilon \left(2+\varepsilon \right)\right)<1∕16$. In particular, since there exists a ball of radius $1∕16$ containing both $x$ and ${c}^{\prime }$, the ball of radius $1∕8$ centered at ${c}^{\prime }$, which we denote by $B\left({c}^{\prime },1∕8\right)$, includes $x$. Also since $∥c-{c}^{\prime }∥<1∕8$ as verified above, we get $c\in B\left({c}^{\prime },1∕8\right)$ as well. In other words, both switch centers $c$ and ${c}^{\prime }$ are at distance less than $1∕4$ from $x$. Now, since every pair of points in $X$ are at distance $1$ or more, any center can be a switch center for at most one weakly misclassified point. This in particular implies that in the considered LAZY-$k$-MEANS step, no cluster is modified by more than a single point.

When the misclassified points are assigned to their closest centers, the centers that do not lose or win any points stay at their previous locations. A center ${c}^{\prime }$ that wins a point $x$ moves closer to $x$ since $x$ is the only point it wins while losing no other points. Similarly, a center $c$ that loses a point $x$ moves away from $x$ since $x$ is the only point it loses without winning any other points. A losing center $c$ moves away from its lost point $x$ by a distance of at most $∥c-x∥<1∕4$ since its previous number of served points was at least $2$ (otherwise, we would have $c=x$ and thus $x$ could not be misclassified). Therefore, when $c$ moves to the centroid of its cluster (now missing $x$), $∥x-c∥<1∕2$ and consequently $∥c-y∥>1∕2$ for any $x\ne y\in X$. As a result, $c$ can not be a switch center for any weakly misclassified point in the subsequent LAZY-$k$-MEANS step.

On the other hand, the winning center ${c}^{\prime }$ to whose cluster $x$ is added, moves closer to $x$ and since no center other than $c$ and ${c}^{\prime }$ in $B\left(x,1∕4\right)$ moves (as there is no point other than $x$ they can win or lose), $x$ will not be misclassified in the next LAZY-$k$-MEANS step.

It follows from the above discussion that the next LAZY-$k$-MEANS step cannot have any weakly misclassified points and thus either the algorithm stops or some strongly misclassified point will exist, resulting an improvement of at least ${\lambda }^{\ast }$. Thus the total number of steps taken by LAZY-$k$-MEANS with parameter $\varepsilon$ is at most $2n{\Delta }^{2}∕{\lambda }^{\ast }=O\left(n{\Delta }^{2}{\varepsilon }^{-3}\right)$. _

### 6 Experimental Results

We introduced both SINGLEPNT and LAZY-$k$-MEANS alternatives to $k$-MEANSMTD as similar, equally easy to implement algorithms that are simpler to analyze than $k$-MEANSMTD itself. However, as mentioned in the introduction, $k$-MEANSMTD is mainly of interest only in practice because of its ease of implementation and its relatively fast termination (small number of steps). It thus raises the question of how our alternative algorithms perform in practice in comparison to $k$-MEANSMTD.

We performed a series of experiments analogous to those done in , as described below, to compare the number of rounds, number of reclassified points, and quality of final clustering produced by these two alternative algorithms with those of $k$-MEANSMTD. We use the same inputs used by Kanungo et al. for our experiments. See for detailed description of those inputs. We have tried to implement each of the algorithms in the simplest possible way and avoided using any advanced point location or nearest neighbor search structure or algorithm. Due to the great similarity between the three algorithms considered here, it is expected that any technique used for improving the performance of any of these algorithms, to be suitable for improving the other two variants in a somewhat similar way.

The algorithms $k$-MEANSMTD and LAZY-$k$-MEANS iterate over points and assign each point to the closest center. While doing this the new set of centers are calculated and existence of a $\left(1+\varepsilon \right)$-misclassified point is checked. SINGLEPNT examines the points one by one, moving to the first point when reaching the end of the list, checking if they are misclassified or not. When a misclassified point is discovered it is assigned to its closest center and the location of the two switching centers is updated. The algorithm stops when it cannot find a misclassified point for $n$ consecutive steps.

The input used in these experiments together with the source-code of our implementation is available at [Sad04].

Our experimental results are summarized in Table  1 and Table  2. In conformance to the costs referred to in these tables is the total final clustering cost, divided by the number of points. In that sense we report the “average” cost per point. Table  1 is produced by running, only once, each of the four algorithms with the same set of randomly chosen center for each combination of point set and number of centers considered. By studying several such tables it seems that the total number of reclassified points and the quality of clustering found by SINGLEPNT tends to be very close to those of $k$-MEANSMTD. Notice that in Table  1, the number of steps of SINGLEPNT are left blank as they are equal to the number of reclassified points and cannot be compared with the number of steps of $k$-MEANSMTD or LAZY-$k$-MEANS.

Table  2 summarizes the results of running 100 tests similar to the one reported in Table  1 each with different initial set of centers picked randomly from the bounding box of the given point set. The best, worst, and average final clustering costs are reported in each case.

We have not discussed the running times as we made no effort in optimizing our implementations. It is however interesting that both of the two alternative algorithms tend to be faster than $k$-MEANSMTD’s in a typical implementation such as ours. SINGLEPNT seems to be typically more than 20% faster than Lloyd. In particular, we emphasize, that our simple implementation is considerably slower than the implementation of Kanungo et al. that uses data structure similar to $kd$-tree to speed up the computation of the Voronoi partitions. We believe that we would get similar performance gains by using their data structure.

### 7 Conclusions

We presented several results on the number of iterations performed by the $k$-MEANSMTD clustering algorithm. To our knowledge, our results are the first to provide combinatorial bounds on the performance of $k$-MEANSMTD. We also suggested related variants of $k$-MEANSMTD algorithm, and proved upper bound on their performance. We implemented those algorithms and compared their performance in practice [Sad04]. We conjecture that the upper bounds we proved for SINGLEPNT holds also for $k$-MEANSMTD. Maybe the most surprising part in those bounds is the luck of dependence on the dimension of the data on the bound of number of iterations performed.

We consider this paper to be a first step in understanding the Lloyd’s method. It is our belief that both our lower and upper bounds are loose, and one might need to use other techniques to improve them. In particular, we mention some open problems:

1. There is still a large gap between our lower and upper bounds. In particular, a super-linear lower bound would be interesting even in high-dimensional space.
2. Our current upper bounds include the spread as a parameter. It would be interesting to prove (or disprove) that this is indeed necessary.
3. We have introduced alternative, easy to analyze algorithms, that are comparable to $k$-MEANSMTD both in their description and their behavior in practice. It would be interesting to show provable connections between these algorithms and compare the bounds on the number of steps they require to terminate.

#### 7.1 Dependency on the spread

A shortcoming of our results, is the dependency on the spread of the point set in the bounds presented. However:

1. This can be resolved by doing a preprocessing stage, snapping together points close to each other, and breaking the input into several parts to be further clustered separately. This is essentially what fast provable approximation algorithms for TSP, $k$-means, and $k$-median do [Aro98HM03]. This results in point sets with polynomial spread, which can be used instead of the original input to compute a good clustering. This is outside the scope of our analysis, but it can be used in practice to speedup $k$-MEANSMTD algorithm.
2. In high dimensions, it seems that in many natural cases the spread tends to shrink and be quite small. As such, we expect our bounds to be meaningful in such cases.

To see an indication of this shrinkage in the spread, imagine picking $n$ points randomly from a unit hypercube in ${I\phantom{\rule{0em}{0ex}}\phantom{\rule{-0.29367pt}{0ex}}R}^{d}$ with volume one. It is easy to see that the minimum distance between any pair of points is going to be at least $L=1∕{n}^{3∕d}$, with high probability, since if we center around each such point a hypercube of side length $L$, it would have volume $1∕{n}^{3}$ of the unit hypercube. As such, the probability of a second point falling inside this region is polynomially small.

However, $L$ tends to $1$ as $d$ increases. Thus, for $d=\Theta \left(logn\right)$ the spread of such random point set is $\Theta \left(\sqrt{d}∕\left(L∕2\right)\right)=\Theta \left(\sqrt{logn}\right)$. (An alternative way to demonstrate this is by picking points randomly from the unit hypersphere. By using a concentration of mass argument [Mat02] on a hypersphere, we get a point-set with spread $O\left(1\right)$ with high probability.)

#### 7.2 Dependency on the initial solution

The initial starting solution fed into $k$-MEANSMTD is critical in the time it takes to converge, and in the quality of the final clustering generated. This is clearly suggested by Table  2, where trying many different initial solutions has yielded a considerable improvement in the best found solution. Of course, one can use a (rough) approximation algorithm [HM03] to come up with a better starting solution. While this approach might be useful in practice, it again falls outside the scope of our analysis.

#### 7.3 Similar results

Recently, independently of our results, Sanjoy Dasgupta [Das03] announced results which are similar to a subset of our results. In particular, he mentions the one-dimensional lower bound, and a better upper bound for $k<5$ but only in one dimension. This work of Sanjoy Dasgupta and Howard Karloff seems to be using similar arguments to ours (personal communication) although to our knowledge it has not been written or published yet.

### Acknowledgments

The authors would like to thank Pankaj K. Agarwal, Boris Aronov and David Mount for useful discussions of problems studied in this paper and related problems. In particular, David Mount provided us with the test point sets used in .

### References

[Aro98]    S. Arora. Polynomial time approximation schemes for euclidean tsp and other geometric problems. J. Assoc. Comput. Mach., 45(5):753–782, Sep 1998.

[Das03]    S. Dasgupta. How fast is $k$-means? In Proc. 16th Annu. Comp. Learn. Theo., number 2777 in Lect. Notes in Comp. Sci., page 735, 2003.

[DFG99]    Q. Du, V. Faber, and M. Gunzburger. Centroidal voronoi tessellations: Applications and algorithms. SIAM Review, 41(4):637–676, 1999.

[DHS01]    R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, New York, 2nd edition, 2001.

[dlVKKR03]   W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th Annu. ACM Sympos. Theory Comput., pages 50–58, 2003.

[ES03]    M. Effros and L. J. Schulman. Rapid clustering with a deterministic data net. Manuscript, 2003.

[HM03]    S. Har-Peled and S. Mazumdar. Coresets for $k$-means and $k$-median clustering and their applications. In Proc. 36th Annu. ACM Sympos. Theory Comput., 2003. To appear. http://www.uiuc.edu/~ sariel/papers/03/kcoreset/.

[IKI94]    M. Inaba, N. Katoh, and H. Imai. Applications of weighted voronoi diagrams and randomization to variance-based $k$-clustering. In Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 332–339, 1994.

[KMN${}^{+}$02]    T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for $k$-means clustering. In Proc. 18th Annu. ACM Sympos. Comput. Geom., pages 10–18, 2002.

[Mat00]    J. Matoušek. On approximate geometric $k$-clustering. Discrete Comput. Geom., 24:61–84, 2000.

[Mat02]    J. Matoušek. Lectures on Discrete Geometry. Springer, 2002.

[Sad04]    B. Sadri. Lloyd’s method and variants implementation together with inputs, 2004. http://www.uiuc.edu/~ sariel/papers/03/lloyd_kmeans.

 Data Set $k$ Method Steps Reclassified Final Cost ClusGauss $n=10,000$ $d=3$ $25$ $k$-MEANSMTD 24 4748 0.081615 SINGLEPNT - 4232 0.081622 LAZY-$k$-MEANS, $\epsilon =0.05$ 17 2377 0.082702 LAZY-$k$-MEANS, $\epsilon =0.20$ 18 1554 0.089905 $50$ $k$-MEANSMTD 20 4672 0.031969 SINGLEPNT - 4391 0.031728 LAZY-$k$-MEANS, $\epsilon =0.05$ 16 2244 0.032164 LAZY-$k$-MEANS, $\epsilon =0.20$ 22 1974 0.034661 $100$ $k$-MEANSMTD 22 5377 0.009639 SINGLEPNT - 4958 0.009706 LAZY-$k$-MEANS, $\epsilon =0.05$ 15 2512 0.010925 LAZY-$k$-MEANS, $\epsilon =0.20$ 19 1748 0.013092 MultiClus $n=10,000$ $d=3$ $50$ $k$-MEANSMTD 21 2544 0.033870 SINGLEPNT - 2419 0.033941 LAZY-$k$-MEANS, $\epsilon =0.05$ 16 1121 0.034622 LAZY-$k$-MEANS, $\epsilon =0.20$ 25 722 0.038042 $100$ $k$-MEANSMTD 18 1744 0.009248 SINGLEPNT - 1732 0.008854 LAZY-$k$-MEANS, $\epsilon =0.05$ 11 740 0.009902 LAZY-$k$-MEANS, $\epsilon =0.20$ 15 584 0.010811 $500$ $k$-MEANSMTD 12 1768 0.002495 SINGLEPNT - 1694 0.002522 LAZY-$k$-MEANS, $\epsilon =0.05$ 9 528 0.002757 LAZY-$k$-MEANS, $\epsilon =0.20$ 11 444 0.002994 Lena22 $n=65,536$ $d=4$ $8$ $k$-MEANSMTD 36 62130 335.408625 SINGLEPNT - 57357 335.440866 LAZY-$k$-MEANS, $\epsilon =0.05$ 27 50298 338.594668 LAZY-$k$-MEANS, $\epsilon =0.20$ 21 44040 355.715258 $64$ $k$-MEANSMTD 211 111844 94.098422 SINGLEPNT - 81505 94.390640 LAZY-$k$-MEANS, $\epsilon =0.05$ 88 55541 97.608823 LAZY-$k$-MEANS, $\epsilon =0.20$ 24 30201 120.274428 $256$ $k$-MEANSMTD 167 111110 48.788216 SINGLEPNT - 101522 48.307815 LAZY-$k$-MEANS, $\epsilon =0.05$ 92 57575 51.954810 LAZY-$k$-MEANS, $\epsilon =0.20$ 79 32348 61.331614 Lena44 $n=16,384$ $d=16$ $8$ $k$-MEANSMTD 63 18211 2700.589245 SINGLEPNT - 16467 2700.587691 LAZY-$k$-MEANS, $\epsilon =0.05$ 20 9715 2889.747540 LAZY-$k$-MEANS, $\epsilon =0.20$ 27 9201 3008.783333 $64$ $k$-MEANSMTD 61 21292 1525.846646 SINGLEPNT - 16422 1615.667299 LAZY-$k$-MEANS, $\epsilon =0.05$ 45 13092 1555.520952 LAZY-$k$-MEANS, $\epsilon =0.20$ 16 7527 1907.962692 $256$ $k$-MEANSMTD 43 21394 1132.746162 SINGLEPNT - 28049 1122.407317 LAZY-$k$-MEANS, $\epsilon =0.05$ 28 12405 1156.884049 LAZY-$k$-MEANS, $\epsilon =0.20$ 27 7993 1320.303278 Kiss $n=10,000$ $d=3$ $8$ $k$-MEANSMTD 18 5982 687.362264 SINGLEPNT - 7026 687.293930 LAZY-$k$-MEANS, $\epsilon =0.05$ 18 3277 690.342895 LAZY-$k$-MEANS, $\epsilon =0.20$ 23 2712 720.891998 $64$ $k$-MEANSMTD 202 29288 202.044849 SINGLEPNT - 35228 185.519927 LAZY-$k$-MEANS, $\epsilon =0.05$ 92 12471 221.936175 LAZY-$k$-MEANS, $\epsilon =0.20$ 44 6080 263.497185 $256$ $k$-MEANSMTD 144 17896 105.438490 SINGLEPNT - 16992 106.112133 LAZY-$k$-MEANS, $\epsilon =0.05$ 61 7498 120.317362 LAZY-$k$-MEANS, $\epsilon =0.20$ 27 3479 150.156231

 Table 1: Number of steps, number of reclassified points, and final average clustering cost in a typical execution of each of the four algorithms on data sets mentioned in .

 Data Set $k$ Method Minimum Cost Maximum Cost Average Cost ClusGauss $n=10,000$ $d=3$ $25$ $k$-MEANSMTD 0.068462 0.087951 0.07501276 SINGLEPNT 0.067450 0.083194 0.07486010 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.074667 0.100035 0.08510598 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.070011 0.092658 0.07803375 $50$ $k$-MEANSMTD 0.028841 0.040087 0.03335312 SINGLEPNT 0.028376 0.040623 0.03308624 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.031175 0.046528 0.03719264 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.029626 0.040811 0.03384180 $100$ $k$-MEANSMTD 0.011425 0.016722 0.01401549 SINGLEPNT 0.010106 0.017986 0.01365492 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.011928 0.022015 0.01565268 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.011730 0.020600 0.01442575 MultiClus $n=10,000$ $d=3$ $50$ $k$-MEANSMTD 0.027563 0.034995 0.03051698 SINGLEPNT 0.027412 0.034167 0.03083110 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.029507 0.055160 0.03620397 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.028457 0.046314 0.03260643 $100$ $k$-MEANSMTD 0.002477 0.004324 0.00308144 SINGLEPNT 0.002390 0.004179 0.00303798 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.002758 0.005175 0.00356282 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.002331 0.004789 0.00322593 $500$ $k$-MEANSMTD 0.002142 0.002731 0.00240768 SINGLEPNT 0.002136 0.002805 0.00244548 LAZY-$k$-MEANS, $\varepsilon =0.20$ 0.002539 0.003567 0.00292354 LAZY-$k$-MEANS, $\varepsilon =0.05$ 0.002206 0.002890 0.00254321 Lena22 $n=65,536$ $d=4$ $8$ $k$-MEANSMTD 263.644420 348.604787 299.78905632 SINGLEPNT 263.659829 348.527023 307.12394164 LAZY-$k$-MEANS, $\varepsilon =0.20$ 278.337133 414.679356 345.07986265 LAZY-$k$-MEANS, $\varepsilon =0.05$ 271.041374 409.802396 322.99259307 $64$ $k$-MEANSMTD 82.074376 102.327255 88.53558757 SINGLEPNT 82.190945 104.574941 89.24323986 LAZY-$k$-MEANS, $\varepsilon =0.20$ 100.601485 147.170657 111.93562151 LAZY-$k$-MEANS, $\varepsilon =0.05$ 82.798308 106.231864 94.20319250 $256$ $k$-MEANSMTD 44.637740 51.482531 47.66542537 SINGLEPNT 44.699224 51.685618 47.81799127 LAZY-$k$-MEANS, $\varepsilon =0.20$ 56.906620 71.491475 62.00216985 LAZY-$k$-MEANS, $\varepsilon =0.05$ 47.178425 54.946136 50.82872342 Lena44 $n=16,384$ $d=16$ $8$ $k$-MEANSMTD 2699.721266 3617.282065 2903.30164756 SINGLEPNT 2699.663310 3216.854024 2894.42713876 LAZY-$k$-MEANS, $\varepsilon =0.20$ 2834.438965 4452.875383 3293.73084140 LAZY-$k$-MEANS, $\varepsilon =0.05$ 2725.907276 3649.518829 2977.33094524 $64$ $k$-MEANSMTD 1305.357406 1694.965827 1503.17431782 SINGLEPNT 1345.821487 1811.663769 1515.08195678 LAZY-$k$-MEANS, $\varepsilon =0.20$ 1564.252624 2385.794013 1785.93841955 LAZY-$k$-MEANS, $\varepsilon =0.05$ 1410.883673 1793.704755 1565.18092988 $256$ $k$-MEANSMTD 1044.017122 1311.942456 1151.64441691 SINGLEPNT 1055.788028 1308.459754 1168.30843808 LAZY-$k$-MEANS, $\varepsilon =0.20$ 1262.487865 1653.820840 1400.49905496 LAZY-$k$-MEANS, $\varepsilon =0.05$ 1094.884884 1385.345314 1219.27000492 Kiss $n=10,000$ $d=3$ $8$ $k$-MEANSMTD 687.278119 714.789442 700.352315760 SINGLEPNT 687.279479 714.731416 697.292832560 LAZY-$k$-MEANS, $\varepsilon =0.20$ 727.017538 947.779405 802.256735040 LAZY-$k$-MEANS, $\varepsilon =0.05$ 689.779010 861.853344 719.140385820 $64$ $k$-MEANSMTD 158.607749 208.946701 178.21703676 SINGLEPNT 151.642447 203.102940 177.17793706 LAZY-$k$-MEANS, $\varepsilon =0.20$ 222.646398 324.435479 259.62118455 LAZY-$k$-MEANS, $\varepsilon =0.05$ 170.571861 248.648363 208.64482062 $256$ $k$-MEANSMTD 96.272602 115.294309 105.30212380 SINGLEPNT 97.141907 125.009357 107.08187899 LAZY-$k$-MEANS, $\varepsilon =0.20$ 124.378185 158.922757 140.72908431 LAZY-$k$-MEANS, $\varepsilon =0.05$ 103.672482 129.685819 116.73971102

 Table 2: Minimum, maximum, and average clustering cost on 100 executions of each of the algorithms on each of the data sets with initial centers picked randomly.