The problem is is that kmedian (and more generally, kmedoid as well) lacks
what are called sufficient statistics. Informally, sufficient statistics
are a summary of the data that you have seen so far that allows you to
process new data and compute the value you like. For the mean and variance,
it is sufficient to keep the sum of all points, the sum of squared values
and the number of points seen (although this is not quite the best algorithm
see Welford's algorithm for
improvements<http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance>).
For medians, especially in high dimensional spaces, this is much harder to
do.
There are a variety of online approximation algorithms for medians which
provide what you need for the one dimensional case. For higher dimensions,
I don't know of a good answer offhand, but a variation on the one
dimensional algorithms might suffice.
The simplest algorithm for online computation of the median consists of
simply adding or subtracting a small constant amount to the current estimate
of the median according to whether the current sample is above or below your
current estimate. Making the amount that you add or subtract depend on how
many samples you have seen improves the convergence of this algorithm.
Estimates from separate subsamples can be combined in various ways. One
simple method is to simply take the median of the weighted subsamples (the
socalled remedian). A more refined approach is to take a mean of the
estimates weighted by a local estimate of the probability density in
neighborhood of median estimate. All of these techniques should be
extensible to the high dimensional case.
Good luck with this problem. We would like to hear of your progress.
On Wed, Mar 31, 2010 at 8:41 AM, Guohua Hao <haoguohua@gmail.com> wrote:
> Hello All,
>
> Sorry for the cross posting of this question.
>
> I would like to implement the kmedian clustering algorithm using map
> reduce. The kmedian clustering algorithm mentioned here is very close to
> the kmeans, except that the centroid of a cluster in the kmedian
> algorithm
> is the median of the points belonging to this cluster. Please refer to the
> following link for a formal definition of the median in high dimensional
> feature spaces.
>
> http://en.wikipedia.org/wiki/Geometric_median
>
> Unlike in the kmeans algorithm, where the centroids of clusters can be
> updated incrementally in combiners and reducers (since we are computing the
> average of those points belonging to the same cluster), in the kmedian
> algorithm, it seems to me that we have to collect all the points falling
> into the same cluster before we can compute its centroid. This process of
> computing centroids requires lots of memory space and is computationally
> expensive ( O(n^2) complexity if n points in this cluster ).
>
> Could you please give me some advice on how to implement this kmedian
> clustering algorithm more efficiently using map reduce?
>
> Thanks,
> Guohua
>
