hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Broberg <Tim.Brob...@exar.com>
Subject Re: Distributing Keys across Reducers
Date Wed, 25 Jul 2012 16:10:23 GMT
Good to know. Thanks for the update.

    - Tim.

On Jul 25, 2012, at 5:21 AM, "Dave Shine" <Dave.Shine@channelintelligence.com> wrote:

> Just wanted to follow up on this issue.  It turned out that I was overlooking the obvious.
 Turns out that over 8% of the mapper output had exactly the same key, which was actually
an invalid value.  By changing the mapper to not emit records with an invalid key the problem
went away.
> 
> Moral of the story, verify the data before you blame the software.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile
> CI BoostT Clients  Outperform OnlineT  www.ciboost.com
> 
> 
> -----Original Message-----
> From: Dave Shine [mailto:Dave.Shine@channelintelligence.com] 
> Sent: Friday, July 20, 2012 1:13 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: RE: Distributing Keys across Reducers
> 
> Yes, that is a possibility, but it will take some significant rearchitecture.  I was
assuming that was what I was going to have to do until I saw the key distribution problem
and though I might be able to buy some relief by addressing that.
> 
> The job runs once per day, starting at 1:00AM EDT.  I have changed it to use a fewer
number of reducers just to see how that effects the distribution.
> 
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile CI Boost(tm) Clients  Outperform Online(tm)
 www.ciboost.com
> 
> 
> -----Original Message-----
> From: Tim Broberg [mailto:Tim.Broberg@exar.com]
> Sent: Friday, July 20, 2012 1:03 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: RE: Distributing Keys across Reducers
> 
> Just a thought, but can you deal with the problem with increased granularity by simply
making the jobs smaller?
> 
> If you have enough jobs, when one takes twice as long there will be plenty of other small
jobs to employ the other nodes, right?
> 
>    - Tim.
> 
> ________________________________________
> From: David Rosenstrauch [darose@darose.net]
> Sent: Friday, July 20, 2012 7:45 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
> 
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I have a job that is emitting over 3 billion rows from the map to the reduce.  The
job is configured with 43 reduce tasks.  A perfectly even distribution would amount to about
70 million rows per reduce task.  However I actually got around 60 million for most of the
tasks, one task got over 100 million, and one task got almost 350 million.  This uneven distribution
caused the job to run exceedingly long.
>> 
>> I believe this is referred to as a "key skew problem", which I know is heavily dependent
on the actual data being processed.  Can anyone point me to any blog posts, white papers,
etc. that might give me some options on how to deal with this issue?
> 
> Hadoop lets you override the default partitioner and replace it with your own.  This
lets you write a custom partitioning scheme which distributes your data more evenly.
> 
> HTH,
> 
> DR
> 
> The information contained in this email is intended only for the personal and confidential
use of the recipient(s) named above.  The information and any attached documents contained
in this message may be Exar confidential and/or legally privileged.  If you are not the intended
recipient, you are hereby notified that any review, use, dissemination or reproduction of
this message is strictly prohibited and may be unlawful.  If you have received this communication
in error, please notify us immediately by return email and delete the original message.
> 
> The information contained in this email message is considered confidential and proprietary
to the sender and is intended solely for review and use by the named recipient. Any unauthorized
review, use or distribution is strictly prohibited. If you have received this message in error,
please advise the sender by reply email and delete the message.

Mime
View raw message