Coresets for k-Means and k-Median
Clustering and their Applications
$\renewcommand{\Re}{{\rm I\!\hspace{-0.025em} R}}
\newcommand{\eps}{{\varepsilon}}%
\newcommand{\Coreset}{{\mathcal{S}}}
$
Sariel Har-Peled and
Soham Mazumdar
In this paper, we show the existence of small coresets for the
problems of computing $k$-median and $k$-means clustering for
points in low dimension. In other words, we show that given a
point set $P$ in $\Re^d$, one can compute a weighted set
$\Coreset \subseteq P$, of size $O(k \eps^{-d} \log{n})$, such
that one can compute the $k$-median/means clustering on $\Coreset$
instead of on $P$, and get an $(1+\eps)$-approximation.
As a result, we improve the fastest known algorithms for
$(1+\eps)$-approximate $k$-means and $k$-median. Our
algorithms have \emph{linear} running time for a fixed
$k$ and $\eps$. In addition, we can maintain the
$(1+\eps)$-approximate $k$-median or $k$-means
clustering of a stream when points are being \emph{only
inserted}, using polylogarithmic space and update
time.
@inproceedings{hm-ckmkm-04,
author = {S. {Har-Peled} and S.~{Mazumdar}},
booktitle = STOC_2004,
title = {Coresets for $k$-Means and $k$-Median Clustering
and their Applications},
year = 2004,
pages = {291--300}
}