hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Shine <Dave.Sh...@channelintelligence.com>
Subject Distributing Keys across Reducers
Date Fri, 20 Jul 2012 13:20:19 GMT
I have a job that is emitting over 3 billion rows from the map to the reduce.  The job is configured
with 43 reduce tasks.  A perfectly even distribution would amount to about 70 million rows
per reduce task.  However I actually got around 60 million for most of the tasks, one task
got over 100 million, and one task got almost 350 million.  This uneven distribution caused
the job to run exceedingly long.

I believe this is referred to as a "key skew problem", which I know is heavily dependent on
the actual data being processed.  Can anyone point me to any blog posts, white papers, etc.
that might give me some options on how to deal with this issue?

Dave Shine
Sr. Software Engineer
321.939.5093 direct |  407.314.0122 mobile

CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com<http://www.ciboost.com/>
facebook platform | where-to-buy | product search engines | shopping engines

The information contained in this email message is considered confidential and proprietary
to the sender and is intended solely for review and use by the named recipient. Any unauthorized
review, use or distribution is strictly prohibited. If you have received this message in error,
please advise the sender by reply email and delete the message.

View raw message