hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Popescu <>
Subject Re: handling joins in Hive 0.11.0
Date Fri, 13 Dec 2013 19:17:58 GMT


I found out that the dependency graph among task stages is incorrect for 
the skewed join optimized plan.

In particular, the conditional task in the optimized plan maintains no 
dependency with the child tasks
of the common join task in the original plan. The conditional task is 
composed of the map join task which
has all these dependencies, but for the case the map join task is 
filtered out, all these dependencies are removed.
Hence, all the other task stages of the query are skipped.

The bug resides in "ql/optimizer/physical/", 
processSkewJoin() function,
immediately after the ConditionalTask is created and its dependencies 
are set.

I currently fixed the issue by adding dependencies among the 
ConditonalTask and all the child tasks of the common
join task of the original plan.

 From the original design I see that only tasks included in the 
ConditionalTask are allowed to have dependencies,
so I am wondering what shall be the alternative correct implementation? 
Maybe adding an "nop" task inside the
ConditionalTask (in addition to the map join task), so that the 
dependencies are maintained for the case that the
map join task is filtered out?


On 11/15/2013 10:20 PM, Adrian Popescu wrote:
> 2. In my experiments I also evaluate skewed joins. I enable skew joins 
> through "hive.optimize.skewjoin" and I run the same
> tpch query 5. The skew join is not actually triggered as the number of 
> rows with the same key is less than "hive.skewjoin.key".
> Hence, the map join corresponding to the skewed join  is filtered out 
> at runtime, but unfortunately all the other stages
> are also filtered out. Thus, no result is actually generated. If I 
> disable the skew join optimization, the query running only with
> common joins returns the result correctly.
> I believe this is a bug when the skew join operator is enabled but not 
> triggered. Did anyone experienced the same problem with
> skew joins on queries of multiple map reduce joins? I attach the 
> explain plan. Essentially only stage 6 and 22 are executed.
> Everything else is skipped silently with no output result being 
> generated, nor error in "hive.log". Similar behaviour is observed
> for other TPCH queries.
> Many thanks,
> Adrian


View raw message