hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: skew join in pig
Date Mon, 21 Jun 2010 13:47:15 GMT
OK. I konw how many reducers to allocate for one hot key. But how will the reducers allocated
for key1 overlap the reducers for key2? Say key1 needs 3 reducers and is allocated reducer
0, 1, 2. Key2 needs 4 reducers, can we allocate reducers 0, 1, 2, 3 for key2? Or 3, 4, 5,
6? What is the ideas behind the allocation for all of the hot keys? 


----- 原始邮件 ----
发件人: Alan Gates <gates@yahoo-inc.com>
收件人: pig-dev@hadoop.apache.org
发送日期: 2010/6/18 (周五) 2:46:09 下午
主   题: Re: skew join in pig

Are you asking how many reducers are used to split a hot key?  If so, the answer is as many
as we estimate it will take to make the the records for the key fit into memory.  For example,
if we have a key which we estimate has 10 million records, each record being about 100 bytes
and for each reduce task we have 400M available, then we will allocate 3 reducers for that
hot key.  We do not need to take into account any other keys sent to this reducer because
reducers process rows one key at a time.


On Jun 16, 2010, at 11:51 AM, Gang Luo wrote:

> Thanks for replying. It is much clear now. One more thing to ask about the third question
is, how to allocate reducers to several hot keys? Hashing? Further, Pig doesn't divide the
reducers into hot-key reducers and non-hot-key reducers, is it right?
> Thanks,
> -Gang
> ----- 原始邮件 ----
> 发件人: Alan Gates <gates@yahoo-inc.com>
> 收件人: pig-dev@hadoop.apache.org
> 发送日期: 2010/6/16 (周三) 12:16:13 下午
> 主   题: Re: skew join in pig
> On Jun 16, 2010, at 8:36 AM, Gang Luo wrote:
>> Hi,
>> there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec)
>> 1. does the sampling job sample and build histogram on both tables, or just one table
(in this case, which one) ?
> Just the left one.
>> 2. the join job still take the two table as inputs, and shuffle tuples from partitioned
table to particular reducer (one tuple to one reducer), and shuffle tuples from streamed table
to all reducers associative to one partition (one tuple to multiple reducers). Is that correct?
> Keys with small enough values to fit in memory are shuffled to reducers as normal.  Keys
that are too large are split between reducers on the left side, and replicated to all of those
reducers that have the splits (not all reducers) on the right side.  Does that answer your
>> 3. Hot keys need more than one reducers. Are these reducers dedicated to this key
only? Could they also take other keys at the same time?
> They take other keys at the same time.
>> 4. for non-hot keys, my understanding is that they are shuffled to reducers based
on default hash partitioner. However, it could happen all the keys shuffled to one reducers
incurs skew even none of them is skewed individually.
> This is always the case in map reduce, though a good hash function should minimize the
occurrences of this.
>> Can someone give me some ideas on these? Thanks.
>> -Gang
> Alan.


View raw message