Hello All,
Sorry for the cross posting of this question.
I would like to implement the kmedian clustering algorithm using map
reduce. The kmedian clustering algorithm mentioned here is very close to
the kmeans, except that the centroid of a cluster in the kmedian algorithm
is the median of the points belonging to this cluster. Please refer to the
following link for a formal definition of the median in high dimensional
feature spaces.
http://en.wikipedia.org/wiki/Geometric_median
Unlike in the kmeans algorithm, where the centroids of clusters can be
updated incrementally in combiners and reducers (since we are computing the
average of those points belonging to the same cluster), in the kmedian
algorithm, it seems to me that we have to collect all the points falling
into the same cluster before we can compute its centroid. This process of
computing centroids requires lots of memory space and is computationally
expensive ( O(n^2) complexity if n points in this cluster ).
Could you please give me some advice on how to implement this kmedian
clustering algorithm more efficiently using map reduce?
Thanks,
Guohua
