mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shmakov <>
Subject Re: Fewer reducers leading to better distributed k-means job performance?
Date Tue, 06 Sep 2011 21:33:30 GMT
K-means have mappers, combiners and reducers and my experience with
k-means that mappers and combiners are responsible for most of the
performance.  In fact, most k-means jobs will use 1 reducer by

Did you verify that multiple reducers are actually doing something?

Mappers write each point with cluster assignment, combiners read these
points and produce intermediate centroids and reducer doing minimum
amount of work producing final centroids from intermediate one. One
possibility is that by specifying #reducers you partially disable
combiners optimization that run on mapper nodes - this could result in
more shuffling and date sent between nodes.

-- Konstantin

On Tue, Sep 6, 2011 at 9:40 AM, Ted Dunning <> wrote:
> It could also have to do with context switch rate, memory set size or memory
> bandwidth contention.  Having two many threads of certain kinds can cause
> contention on all kinds of resources.
> Without detailed internal diagnostics, these can be very hard to tease out.
>  Looking at load average is a good first step.  If you can get to some
> memory diagnostics about cache miss rates, you might be able to get further.
> On Tue, Sep 6, 2011 at 10:50 AM, Timothy Potter <>wrote:
>> Hi,
>> I'm running a distributed k-means clustering job in a 16 node EC2 cluster
>> (xlarge instances). I've experimented with 3 reducers per node
>> (mapred.reduce.tasks=48) and 2 reducers per node (mapred.reduce.tasks=32).
>> In one of my runs (k=120) on 6m vectors with roughly 20k dimensions, I've
>> seen a 25% improvement job performance using 2 reducers per node instead of
>> 3 (~45 mins to do 10 iterations with 32 reducers vs. ~1 hour with 48
>> reducers). The input data and initial clusters are the same in both cases.
>> My sense was that maybe I was over-utilizing resources with 3 reducers per
>> node, but in fact the load average remains healthy (< 4 on xlarge instances
>> with 4 virtual cores) and does not swap or anything obvious like that. So
>> the only other variable that I can think of here is the number and size of
>> output files from one iteration being sent as input to the next iteration
>> may have something to do with the performance difference? As further
>> evidence to this hunch, running the job with vectors with 11k dimensions,
>> the improvement was only about 10% -- so performance gains are better with
>> more data.
>> Does anyone else have any insights to what might be leading to this result?
>> Cheers,
>> Tim


View raw message