mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Potter <>
Subject Fewer reducers leading to better distributed k-means job performance?
Date Tue, 06 Sep 2011 15:50:30 GMT

I'm running a distributed k-means clustering job in a 16 node EC2 cluster
(xlarge instances). I've experimented with 3 reducers per node
(mapred.reduce.tasks=48) and 2 reducers per node (mapred.reduce.tasks=32).
In one of my runs (k=120) on 6m vectors with roughly 20k dimensions, I've
seen a 25% improvement job performance using 2 reducers per node instead of
3 (~45 mins to do 10 iterations with 32 reducers vs. ~1 hour with 48
reducers). The input data and initial clusters are the same in both cases.

My sense was that maybe I was over-utilizing resources with 3 reducers per
node, but in fact the load average remains healthy (< 4 on xlarge instances
with 4 virtual cores) and does not swap or anything obvious like that. So
the only other variable that I can think of here is the number and size of
output files from one iteration being sent as input to the next iteration
may have something to do with the performance difference? As further
evidence to this hunch, running the job with vectors with 11k dimensions,
the improvement was only about 10% -- so performance gains are better with
more data.

Does anyone else have any insights to what might be leading to this result?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message