Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1CDA7A17 for ; Tue, 20 Sep 2011 18:06:15 +0000 (UTC) Received: (qmail 70403 invoked by uid 500); 20 Sep 2011 18:06:14 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 70374 invoked by uid 500); 20 Sep 2011 18:06:14 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 70364 invoked by uid 99); 20 Sep 2011 18:06:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2011 18:06:14 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jeastman@narus.com designates 63.80.124.211 as permitted sender) Received: from [63.80.124.211] (HELO mxc.narus.com) (63.80.124.211) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Sep 2011 18:06:10 +0000 X-ASG-Debug-ID: 1316541949-03fcf60c673a6330001-aU6gRC Received: from rock.narus.com (rock.narus.com [192.168.7.163]) by mxc.narus.com with ESMTP id VhRBLHvqzH0isF5C (version=TLSv1 cipher=RC4-MD5 bits=128 verify=NO) for ; Tue, 20 Sep 2011 11:05:49 -0700 (PDT) X-Barracuda-Envelope-From: jeastman@Narus.com From: Jeff Eastman To: "user@mahout.apache.org" Date: Tue, 20 Sep 2011 11:05:48 -0700 Subject: RE: Clustering : Number of Reducers Thread-Topic: Clustering : Number of Reducers X-ASG-Orig-Subj: RE: Clustering : Number of Reducers Thread-Index: Acx3vJ47Yf5IAXlJTTG4zjkZse9Q7QAANj/Q Message-ID: <99CF5A2B2A1D9542A589C5F5EBD3DA03040D475159@rock.narus.com> References: <4E762F5C.5060305@xebia.com> <4E76CCFA.9070702@xebia.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D47501A@rock.narus.com> <4E784120.8010202@xebia.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D4750E7@rock.narus.com> <4E78C332.5050009@xebia.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03040D47512E@rock.narus.com> <4E78D045.30300@xebia.com> In-Reply-To: <4E78D045.30300@xebia.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US x-tm-as-product-ver: SMEX-10.0.0.1412-6.800.1017-18398.001 x-tm-as-result: No--56.702300-0.000000-31 x-tm-as-user-approved-sender: Yes x-tm-as-user-blocked-sender: No Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Barracuda-Connect: rock.narus.com[192.168.7.163] X-Barracuda-Start-Time: 1316541949 X-Barracuda-Encrypted: RC4-MD5 X-Barracuda-URL: http://mxc.narus.com:8000/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at narus.com As all the Mahout clustering implementations keep their clusters in memory,= I don't believe any of them will handle that many clusters. I'm a bit skep= tical; however, that 5 million clusters over a billion, 300-d vectors will = produce anything useful by way of analytics. You've got the curse of dimens= ionality working against you and your vectors will be nearly equidistant fr= om each other. This means that very small (=3Dnoise) differences in distanc= e will be driving the clustering. -----Original Message----- From: Paritosh Ranjan [mailto:pranjan@xebia.com]=20 Sent: Tuesday, September 20, 2011 10:41 AM To: user@mahout.apache.org Subject: Re: Clustering : Number of Reducers The max load I expect is 1 billion vectors. Around 300 dimensions per=20 vector. The number of clusters with more than one vector inside it can=20 be around 5 million, with an average of 10-20 vector per cluster. But, When most of the vectors are really far away in the worst case=20 (apart from the similar ones, which will be inside the canopy) , most of=20 the canopies might contain only one vector. So, the number of canopies=20 will be really high ( As lots of canopies will result into clusters=20 having single vector ). On 20-09-2011 22:56, Jeff Eastman wrote: > I guess it depends upon what you expect from your HUGE data set: How many= clusters do you believe it contains? A hundred? A thousand? A million? A b= illion? With the right T-values I believe Canopy can handle the first three= but not the last. It will also depend upon the size of your vectors. This = is because, as canopy centroids are calculated, the centroid vectors become= more dense and these take up more space in memory. So a million, really wi= de clusters might have trouble fitting into a 4GB reducer memory. But what = are you really going to do with a million clusters? This number seems vastl= y larger than one might find useful in summarizing a data set. I would thin= k a couple hundred clusters would be the limit of human-understandable clus= tering. Canopy can do that with no problem. > > MeanShiftCanopy, as its name implies, is really just an iterative canopy = implementation. It allows the specification of an arbitrary number of initi= al reducers, but it counts them down to 1 in each iteration in order to pro= perly process all the input. It is an agglomerative clustering algorithm, a= nd the clusters it builds contain the indices of each of the input points t= hat have been agglomerated. This makes the mean shift canopy larger in memo= ry than vanilla canopies since the list of points is maintained too. It is = possible to avoid the points accumulation and it won't happen unless the -c= l option is provided. In this case the memory consumption will be about the= same as vanilla canopy. > > Bottom line: How many clusters do you expect to find? > > > > > -----Original Message----- > From: Paritosh Ranjan [mailto:pranjan@xebia.com] > Sent: Tuesday, September 20, 2011 9:46 AM > To: user@mahout.apache.org > Subject: Re: Clustering : Number of Reducers > > "but all the canopies gotta fit in memory." > > If this is true, then CanopyDriver would not be able to cluster HUGE > data ( as the memory might blow up ). > > I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any > number of reducers. Will it also need all the canopies in memory? > > Or, which Clustering technique would you suggest to cluster really big > data ( considering performance and big size as parameters )? > > Thanks and Regards, > Paritosh Ranjan > > On 20-09-2011 21:35, Jeff Eastman wrote: >> Well, while it is true that the CanopyDriver writes all its canopies to = the file system, they are written at the end of the reduce method. The mapp= ers all output the same key, so the one reducer gets all the mapper pairs a= nd these must fit into memory before they can be output. With T1/T2 values = that are too small given the data, there will be a very large number of clu= sters output by each mapper and a corresponding deluge of clusters at the r= educer. T3/T4 may be used to supply different thresholds in the reduce step= , but all the canopies gotta fit in memory. >> >> -----Original Message----- >> From: Paritosh Ranjan [mailto:pranjan@xebia.com] >> Sent: Tuesday, September 20, 2011 12:31 AM >> To: user@mahout.apache.org >> Subject: Re: Clustering : Number of Reducers >> >> "The limit is that all the canopies need to fit into memory." >> I don't think so. I think you can use CanopyDriver to write canopies in >> a filesystem. This is done as a mapreduce job. Then the KMeansDriver >> needs these canopy points as input to run KMeans. >> >> On 20-09-2011 01:39, Jeff Eastman wrote: >>> Actually, most of the clustering jobs (including DirichletDriver) accep= t the -Dmapred.reduce.tasks=3Dn argument as noted below. Canopy is the only= job which forces n=3D1 and this is so the reducer will see all of the mapp= er outputs. Generally, by adjusting T2& T1 to suitably-large values you = can get canopy to handle pretty large datasets. The limit is that all the c= anopies need to fit into memory. >>> >>> -----Original Message----- >>> From: Paritosh Ranjan [mailto:pranjan@xebia.com] >>> Sent: Sunday, September 18, 2011 10:03 PM >>> To: user@mahout.apache.org >>> Subject: Re: Clustering : Number of Reducers >>> >>> So, does this mean that Mahout can not support clustering for large dat= a? >>> >>> Even in DirichletDriver the number of reducers is hardcoded to 1. And w= e >>> need canopies to run KMeansDriver. >>> >>> Paritosh >>> >>> On 19-09-2011 01:47, Konstantin Shmakov wrote: >>>> For most of the tasks one can force the number of reducers with >>>> mapred.reduce.tasks=3D >>>> where the desired number of reducers. >>>> >>>> It will not necessary increase the performance though - with kmeans an= d >>>> fuzzykmeans combiners do reducers job and increasing the number of red= ucers >>>> won't usually affect performance. >>>> >>>> With the canopy the distributed >>>> algorithmhas >>>> no combiners and has 1 reducer hardcoded >>>> - trying to increase #reducers won't have any effect as the algorithm >>>> doesn't work with>1 reducer. My experience that the canopy won't scale= to >>>> large data and need improvement. >>>> >>>> -- Konstantin >>>> >>>> >>>> >>>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan = wrote: >>>> >>>>> Hi, >>>>> >>>>> I have been trying to cluster some hundreds of millions of records us= ing >>>>> Mahout Clustering techniques. >>>>> >>>>> The number of reducers is always one which I am not able to change. T= his is >>>>> effecting the performance. I am using Mahout 0.5 >>>>> >>>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been change= d to >>>>> use any number of reducers. Will other ClusterDrivers also get change= d to >>>>> use any number of reducers in 0.6? >>>>> >>>>> Thanks and Regards, >>>>> Paritosh Ranjan >>>>> >>>>> >>>>> >>> ----- >>> No virus found in this message. >>> Checked by AVG - www.avg.com >>> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: 09/19/11 >> >> ----- >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11 >> > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11 >