hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject skew join in pig
Date Wed, 16 Jun 2010 15:36:48 GMT
there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec)
1. does the sampling job sample and build histogram on both tables, or just one table (in
this case, which one) ?
2. the join job still take the two table as inputs, and shuffle tuples from partitioned table
to particular reducer (one tuple to one reducer), and shuffle tuples from streamed table to
all reducers associative to one partition (one tuple to multiple reducers). Is that correct?
3. Hot keys need more than one reducers. Are these reducers dedicated to this key only? Could
they also take other keys at the same time?
4. for non-hot keys, my understanding is that they are shuffled to reducers based on default
hash partitioner. However, it could happen all the keys shuffled to one reducers incurs skew
even none of them is skewed individually.  

Can someone give me some ideas on these? Thanks.



View raw message