Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: domain of jeastman@narus.com designates
 63.80.124.211 as permitted sender)
From: Jeff Eastman <jeastman@Narus.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>
Date: Tue, 20 Sep 2011 11:05:48 -0700
Subject: RE: Clustering : Number of Reducers
Thread-Topic: Clustering : Number of Reducers
Thread-Index: Acx3vJ47Yf5IAXlJTTG4zjkZse9Q7QAANj/Q
Message-ID: <99CF5A2B2A1D9542A589C5F5EBD3DA03040D475159@rock.narus.com>
References: <4E762F5C.5060305@xebia.com>
 <CAJnWZszjPVA4+X_qT_wN9rkt9Nxa1CgKzeJLdLorV3JnntmuGQ@mail.gmail.com>
 <4E76CCFA.9070702@xebia.com>
 <99CF5A2B2A1D9542A589C5F5EBD3DA03040D47501A@rock.narus.com>
 <4E784120.8010202@xebia.com>
 <99CF5A2B2A1D9542A589C5F5EBD3DA03040D4750E7@rock.narus.com>
 <4E78C332.5050009@xebia.com>
 <99CF5A2B2A1D9542A589C5F5EBD3DA03040D47512E@rock.narus.com>
 <4E78D045.30300@xebia.com>
In-Reply-To: <4E78D045.30300@xebia.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

As all the Mahout clustering implementations keep their clusters in memory,=
 I don't believe any of them will handle that many clusters. I'm a bit skep=
tical; however, that 5 million clusters over a billion, 300-d vectors will =
produce anything useful by way of analytics. You've got the curse of dimens=
ionality working against you and your vectors will be nearly equidistant fr=
om each other. This means that very small (=3Dnoise) differences in distanc=
e will be driving the clustering.


-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com]=20
Sent: Tuesday, September 20, 2011 10:41 AM
To: user@mahout.apache.org
Subject: Re: Clustering : Number of Reducers


The max load I expect is 1 billion vectors. Around 300 dimensions per=20
vector. The number of clusters with more than one vector inside it can=20
be around 5 million, with an average of 10-20 vector per cluster.

But, When most of the vectors are really far away in the worst case=20
(apart from the similar ones, which will be inside the canopy) , most of=20
the canopies might contain only one vector. So, the number of canopies=20
will be really high ( As lots of canopies will result into clusters=20
having single vector ).

On 20-09-2011 22:56, Jeff Eastman wrote:
> I guess it depends upon what you expect from your HUGE data set: How many=
 clusters do you believe it contains? A hundred? A thousand? A million? A b=
illion? With the right T-values I believe Canopy can handle the first three=
 but not the last. It will also depend upon the size of your vectors. This =
is because, as canopy centroids are calculated, the centroid vectors become=
 more dense and these take up more space in memory. So a million, really wi=
de clusters might have trouble fitting into a 4GB reducer memory. But what =
are you really going to do with a million clusters? This number seems vastl=
y larger than one might find useful in summarizing a data set. I would thin=
k a couple hundred clusters would be the limit of human-understandable clus=
tering. Canopy can do that with no problem.
>
> MeanShiftCanopy, as its name implies, is really just an iterative canopy =
implementation. It allows the specification of an arbitrary number of initi=
al reducers, but it counts them down to 1 in each iteration in order to pro=
perly process all the input. It is an agglomerative clustering algorithm, a=
nd the clusters it builds contain the indices of each of the input points t=
hat have been agglomerated. This makes the mean shift canopy larger in memo=
ry than vanilla canopies since the list of points is maintained too. It is =
possible to avoid the points accumulation and it won't happen unless the -c=
l option is provided. In this case the memory consumption will be about the=
 same as vanilla canopy.
>
> Bottom line: How many clusters do you expect to find?
>
>
>
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Tuesday, September 20, 2011 9:46 AM
> To: user@mahout.apache.org
> Subject: Re: Clustering : Number of Reducers
>
> "but all the canopies gotta fit in memory."
>
> If this is true, then CanopyDriver would not be able to cluster HUGE
> data ( as the memory might blow up ).
>
> I am using MeanShiftCanopyDriver of 0.6-snapshot which can use any
> number of reducers. Will it also need all the canopies in memory?
>
> Or, which Clustering technique would you suggest to cluster really big
> data ( considering performance and big size as parameters )?
>
> Thanks and Regards,
> Paritosh Ranjan
>
> On 20-09-2011 21:35, Jeff Eastman wrote:
>> Well, while it is true that the CanopyDriver writes all its canopies to =
the file system, they are written at the end of the reduce method. The mapp=
ers all output the same key, so the one reducer gets all the mapper pairs a=
nd these must fit into memory before they can be output. With T1/T2 values =
that are too small given the data, there will be a very large number of clu=
sters output by each mapper and a corresponding deluge of clusters at the r=
educer. T3/T4 may be used to supply different thresholds in the reduce step=
, but all the canopies gotta fit in memory.
>>
>> -----Original Message-----
>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>> Sent: Tuesday, September 20, 2011 12:31 AM
>> To: user@mahout.apache.org
>> Subject: Re: Clustering : Number of Reducers
>>
>> "The limit is that all the canopies need to fit into memory."
>> I don't think so. I think you can use CanopyDriver to write canopies in
>> a filesystem. This is done as a mapreduce job. Then the KMeansDriver
>> needs these canopy points as input to run KMeans.
>>
>> On 20-09-2011 01:39, Jeff Eastman wrote:
>>> Actually, most of the clustering jobs (including DirichletDriver) accep=
t the -Dmapred.reduce.tasks=3Dn argument as noted below. Canopy is the only=
 job which forces n=3D1 and this is so the reducer will see all of the mapp=
er outputs. Generally, by adjusting T2&    T1 to suitably-large values you =
can get canopy to handle pretty large datasets. The limit is that all the c=
anopies need to fit into memory.
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
>>> Sent: Sunday, September 18, 2011 10:03 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Clustering : Number of Reducers
>>>
>>> So, does this mean that Mahout can not support clustering for large dat=
a?
>>>
>>> Even in DirichletDriver the number of reducers is hardcoded to 1. And w=
e
>>> need canopies to run KMeansDriver.
>>>
>>> Paritosh
>>>
>>> On 19-09-2011 01:47, Konstantin Shmakov wrote:
>>>> For most of the tasks one can force the number of reducers with
>>>> mapred.reduce.tasks=3D<N>
>>>> where<N>     the desired number of reducers.
>>>>
>>>> It will not necessary increase the performance though - with kmeans an=
d
>>>> fuzzykmeans combiners do reducers job and increasing the number of red=
ucers
>>>> won't usually affect performance.
>>>>
>>>> With the canopy the distributed
>>>> algorithm<http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java=
/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=3D1134456&v=
iew=3Dmarkup>has
>>>> no combiners and has 1 reducer hardcoded
>>>> - trying to increase #reducers won't have any effect as the algorithm
>>>> doesn't work with>1 reducer. My experience that the canopy won't scale=
 to
>>>> large data and need improvement.
>>>>
>>>> -- Konstantin
>>>>
>>>>
>>>>
>>>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<pranjan@xebia.com>  =
   wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have been trying to cluster some hundreds of millions of records us=
ing
>>>>> Mahout Clustering techniques.
>>>>>
>>>>> The number of reducers is always one which I am not able to change. T=
his is
>>>>> effecting the performance. I am using Mahout 0.5
>>>>>
>>>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been change=
d to
>>>>> use any number of reducers. Will other ClusterDrivers also get change=
d to
>>>>> use any number of reducers in 0.6?
>>>>>
>>>>> Thanks and Regards,
>>>>> Paritosh Ranjan
>>>>>
>>>>>
>>>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: 09/19/11
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
>>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3908 - Release Date: 09/20/11
>