Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 98402 invoked from network); 31 Mar 2010 18:09:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Mar 2010 18:09:30 -0000 Received: (qmail 23747 invoked by uid 500); 31 Mar 2010 18:09:29 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 23664 invoked by uid 500); 31 Mar 2010 18:09:29 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 23628 invoked by uid 99); 31 Mar 2010 18:09:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 18:09:29 +0000 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.222.187 as permitted sender) Received: from [209.85.222.187] (HELO mail-pz0-f187.google.com) (209.85.222.187) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Mar 2010 18:09:23 +0000 Received: by pzk17 with SMTP id 17so423828pzk.5 for ; Wed, 31 Mar 2010 11:09:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:received:message-id:subject:to:cc:content-type; bh=nrxaHwLkTzPTn0EJiPJpFVkTnPEgsilZ8XWSyx0GNZY=; b=wjcIpxYQY+ynbzl+fWNwPMsIb6lxOZm79TZpjRfPCihEA9CPQ3AANOKM2aCr7N9DjH Bc3otDCRd6tI5O642Ubq4+GuI6MWXuO4J7JDod2mzJIO1TCOQ5A3UAthku2kuTt7d/RB X4eOaphJiF/jhEj3cmQc+dItdwwLgcpTo4vOg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; b=RHEvPtfRqF0YPSYayEamWYQcd5JuWEwdXx+OMSFOo5z5dT1Q7ESb9PDIGWk9ZrL3pm xW47resNjDDwN+w9J94uChn/O7U49+Bk513fnGKP2HjnGNrsvRINPwAxMclMLo+2X5F8 2m3YboIApMu6AQfWIO/FxAdmZnVcZ664I9XOI= MIME-Version: 1.0 Received: by 10.140.166.11 with HTTP; Wed, 31 Mar 2010 11:08:43 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Wed, 31 Mar 2010 11:08:43 -0700 Received: by 10.140.88.9 with SMTP id l9mr7719600rvb.286.1270058943183; Wed, 31 Mar 2010 11:09:03 -0700 (PDT) Message-ID: Subject: Re: question about implementing the k-median clustering algorithm based on Map Reduce To: mahout-user@lucene.apache.org Cc: mahout-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=000e0cd28aaa26fce404831ca467 --000e0cd28aaa26fce404831ca467 Content-Type: text/plain; charset=UTF-8 The problem is is that k-median (and more generally, k-medoid as well) lacks what are called sufficient statistics. Informally, sufficient statistics are a summary of the data that you have seen so far that allows you to process new data and compute the value you like. For the mean and variance, it is sufficient to keep the sum of all points, the sum of squared values and the number of points seen (although this is not quite the best algorithm see Welford's algorithm for improvements). For medians, especially in high dimensional spaces, this is much harder to do. There are a variety of on-line approximation algorithms for medians which provide what you need for the one dimensional case. For higher dimensions, I don't know of a good answer off-hand, but a variation on the one dimensional algorithms might suffice. The simplest algorithm for on-line computation of the median consists of simply adding or subtracting a small constant amount to the current estimate of the median according to whether the current sample is above or below your current estimate. Making the amount that you add or subtract depend on how many samples you have seen improves the convergence of this algorithm. Estimates from separate sub-samples can be combined in various ways. One simple method is to simply take the median of the weighted sub-samples (the so-called remedian). A more refined approach is to take a mean of the estimates weighted by a local estimate of the probability density in neighborhood of median estimate. All of these techniques should be extensible to the high dimensional case. Good luck with this problem. We would like to hear of your progress. On Wed, Mar 31, 2010 at 8:41 AM, Guohua Hao wrote: > Hello All, > > Sorry for the cross posting of this question. > > I would like to implement the k-median clustering algorithm using map > reduce. The k-median clustering algorithm mentioned here is very close to > the k-means, except that the centroid of a cluster in the k-median > algorithm > is the median of the points belonging to this cluster. Please refer to the > following link for a formal definition of the median in high dimensional > feature spaces. > > http://en.wikipedia.org/wiki/Geometric_median > > Unlike in the k-means algorithm, where the centroids of clusters can be > updated incrementally in combiners and reducers (since we are computing the > average of those points belonging to the same cluster), in the k-median > algorithm, it seems to me that we have to collect all the points falling > into the same cluster before we can compute its centroid. This process of > computing centroids requires lots of memory space and is computationally > expensive ( O(n^2) complexity if n points in this cluster ). > > Could you please give me some advice on how to implement this k-median > clustering algorithm more efficiently using map reduce? > > Thanks, > Guohua > --000e0cd28aaa26fce404831ca467--