hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: Distributing Keys across Reducers
Date Fri, 20 Jul 2012 14:45:57 GMT
On 07/20/2012 09:20 AM, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the reduce.  The job
is configured with 43 reduce tasks.  A perfectly even distribution would amount to about 70
million rows per reduce task.  However I actually got around 60 million for most of the tasks,
one task got over 100 million, and one task got almost 350 million.  This uneven distribution
caused the job to run exceedingly long.
> I believe this is referred to as a "key skew problem", which I know is heavily dependent
on the actual data being processed.  Can anyone point me to any blog posts, white papers,
etc. that might give me some options on how to deal with this issue?

Hadoop lets you override the default partitioner and replace it with 
your own.  This lets you write a custom partitioning scheme which 
distributes your data more evenly.



View raw message