hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Popescu <adrian.pope...@epfl.ch>
Subject handling joins in Hive 0.11.0
Date Fri, 15 Nov 2013 21:20:43 GMT

Hello everyone,

I have two questions on join optimizations in Hive, one of which I 
believe is a bug in the 0.11.0 release.

1. From Hive online documentation I see that multiple map joins can be 
grouped together in one single MapReduce job if the
input tables are joined on the same joining key. From my experiments on 
TPCH I see that multiple map joins can
be grouped inside the same MapReduce job also for the case that all the 
input corresponding to the tables can be fit into
one single mapper. For instance, query 5 joins first nation with region, 
then the result is joined with supplier. From the explain
plan I see that both joins are run in a single MapReduce job as "nested 
map join operators". Explain plan is attached
(please see Stage 21 in "q5_explained_MJ.txt" for the nested MapJoin 
operators).

Is it possible to disable this feature (i.e., to have a MR job for each 
join)? The setting I use to trigger map joins is 
"hive.auto.convert.join=true"


2. In my experiments I also evaluate skewed joins. I enable skew joins 
through "hive.optimize.skewjoin" and I run the same
tpch query 5. The skew join is not actually triggered as the number of 
rows with the same key is less than "hive.skewjoin.key".
Hence, the map join corresponding to the skewed join  is filtered out at 
runtime, but unfortunately all the other stages
are also filtered out. Thus, no result is actually generated. If I 
disable the skew join optimization, the query running only with
common joins returns the result correctly.

I believe this is a bug when the skew join operator is enabled but not 
triggered. Did anyone experienced the same problem with
skew joins on queries of multiple map reduce joins? I attach the explain 
plan. Essentially only stage 6 and 22 are executed.
Everything else is skipped silently with no output result being 
generated, nor error in "hive.log". Similar behaviour is observed
for other TPCH queries.

Many thanks,
Adrian


Mime
View raw message