hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Distributing Keys across Reducers
Date Fri, 20 Jul 2012 15:56:25 GMT
Does applying a combiner make any difference? Or are these numbers
with the combiner included?

On Fri, Jul 20, 2012 at 8:46 PM, Dave Shine
<Dave.Shine@channelintelligence.com> wrote:
> Thanks John.
>
> The key is my own WritableComparable object, and I have created custom Comparator, Partitioner,
and KeyValueGroupingComparator.  However, they are all pretty generic.  The Key class is has
two properties, a boolean and a string.  I'm grouping on just the string, but comparing on
both properties to ensure that the reducer receives the "true" values before the "false" values.
>
> My partitioner does the basic hash of just the string portion of the key class.  I'm
hoping to find some guidance on how to make that partitioner smarter and avoid this problem.
>
> Dave Shine
> Sr. Software Engineer
> 321.939.5093 direct |  407.314.0122 mobile
> CI Boost(tm) Clients  Outperform Online(tm)  www.ciboost.com
>
>
> -----Original Message-----
> From: John Armstrong [mailto:jrja@ccri.com]
> Sent: Friday, July 20, 2012 10:20 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Distributing Keys across Reducers
>
> On 07/20/2012 09:20 AM, Dave Shine wrote:
>> I believe this is referred to as a "key skew problem", which I know is
>> heavily dependent on the actual data being processed.  Can anyone
>> point me to any blog posts, white papers, etc. that might give me some
>> options on how to deal with this issue?
>
> I don't know about blog posts or white papers, but the canonical answer here is usually
using a different Partitioner.
>
> The default one takes the .hash() of each Mapper output key and reduces it modulo the
number of Reducers you've specified (43, here).  So the first place I'd look is to see if
there's some reason you're getting so many more outputs with one key-hash-mod-43 than the
others.
>
> A common answer here is that one key alone has a huge number of outputs, in which case
it's hard to do anything better with it.  Another case is that your key class' hash function
is bad at telling apart a certain class of keys that occur with some regularity.  Since 43
is an odd prime, I would not expect a moderately evenly distributed hash to suddenly get spikes
at certain values mod-43.
>
> So if you want to (and can) rejigger your hashes to spread things more evenly, great.
 If not, you're down to writing your own partitioner.
> It's slightly different depending on which API you're using, but either way you basically
have to write a function called getPartition that takes a mapper output record (key and value)
and the number of reducers and returns the index (from 0 to numReducers-1) of the reducer
that should handle that record.  And unless you REALLY know what you're doing, the function
should probably only depend on the key.
>
> Good luck.
>
> The information contained in this email message is considered confidential and proprietary
to the sender and is intended solely for review and use by the named recipient. Any unauthorized
review, use or distribution is strictly prohibited. If you have received this message in error,
please advise the sender by reply email and delete the message.



-- 
Harsh J

Mime
View raw message