mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Clustering : Number of Reducers
Date Tue, 20 Sep 2011 16:05:14 GMT
Well, while it is true that the CanopyDriver writes all its canopies to the file system, they
are written at the end of the reduce method. The mappers all output the same key, so the one
reducer gets all the mapper pairs and these must fit into memory before they can be output.
With T1/T2 values that are too small given the data, there will be a very large number of
clusters output by each mapper and a corresponding deluge of clusters at the reducer. T3/T4
may be used to supply different thresholds in the reduce step, but all the canopies gotta
fit in memory.

-----Original Message-----
From: Paritosh Ranjan [mailto:pranjan@xebia.com] 
Sent: Tuesday, September 20, 2011 12:31 AM
To: user@mahout.apache.org
Subject: Re: Clustering : Number of Reducers

"The limit is that all the canopies need to fit into memory."
I don't think so. I think you can use CanopyDriver to write canopies in 
a filesystem. This is done as a mapreduce job. Then the KMeansDriver 
needs these canopy points as input to run KMeans.

On 20-09-2011 01:39, Jeff Eastman wrote:
> Actually, most of the clustering jobs (including DirichletDriver) accept the -Dmapred.reduce.tasks=n
argument as noted below. Canopy is the only job which forces n=1 and this is so the reducer
will see all of the mapper outputs. Generally, by adjusting T2&  T1 to suitably-large
values you can get canopy to handle pretty large datasets. The limit is that all the canopies
need to fit into memory.
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:pranjan@xebia.com]
> Sent: Sunday, September 18, 2011 10:03 PM
> To: user@mahout.apache.org
> Subject: Re: Clustering : Number of Reducers
>
> So, does this mean that Mahout can not support clustering for large data?
>
> Even in DirichletDriver the number of reducers is hardcoded to 1. And we
> need canopies to run KMeansDriver.
>
> Paritosh
>
> On 19-09-2011 01:47, Konstantin Shmakov wrote:
>> For most of the tasks one can force the number of reducers with
>> mapred.reduce.tasks=<N>
>> where<N>   the desired number of reducers.
>>
>> It will not necessary increase the performance though - with kmeans and
>> fuzzykmeans combiners do reducers job and increasing the number of reducers
>> won't usually affect performance.
>>
>> With the canopy the distributed
>> algorithm<http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java?revision=1134456&view=markup>has
>> no combiners and has 1 reducer hardcoded
>> - trying to increase #reducers won't have any effect as the algorithm
>> doesn't work with>1 reducer. My experience that the canopy won't scale to
>> large data and need improvement.
>>
>> -- Konstantin
>>
>>
>>
>> On Sun, Sep 18, 2011 at 10:50 AM, Paritosh Ranjan<pranjan@xebia.com>   wrote:
>>
>>> Hi,
>>>
>>> I have been trying to cluster some hundreds of millions of records using
>>> Mahout Clustering techniques.
>>>
>>> The number of reducers is always one which I am not able to change. This is
>>> effecting the performance. I am using Mahout 0.5
>>>
>>> In 0.6-SNAPSHOT, I see that the MeanShiftCanopyDriver has been changed to
>>> use any number of reducers. Will other ClusterDrivers also get changed to
>>> use any number of reducers in 0.6?
>>>
>>> Thanks and Regards,
>>> Paritosh Ranjan
>>>
>>>
>>>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 10.0.1410 / Virus Database: 1520/3906 - Release Date: 09/19/11
>


Mime
View raw message