hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hadi Moshayedi <>
Subject map join, bucket map join, and sort merge bucket map join
Date Sat, 06 Oct 2012 10:24:10 GMT
 When I want to do bucket map join between tables A and B (A has small
buckets), which should load matching buckets of A into each mapper of B,
the things I do is (1) bucket A and B on the join keys with equal number of
buckets, (2) set hive.optimize.bucketmapjoin = true; (3) run the following

FROM a JOIN B ON  a.join_key = b.join_key;

 Question 1: Is my sequence of actions enough for triggering bucket map

After running the above query, what I observe is that a local task starts
hashing A, and creating multiple hashed buckets and then loads it into
mapper of B, and starts a job which doesn't have any reducers.

 Question 2: Why is it hashing it in the local task and not in the mapper
stage? This seems to take a lot of time, which can be avoided. I mean why
it doesn't upload the row bucket instead of hashed bucket, and do the
hashing in the mapper (and parallelize the hashing, instead of doing it

 If I have tables A and B sorted and bucketed by join keys and have the
same number of buckets and setting the 3 flags specified in the
and again running the above query, I see that this time it doesn't do the
hashing, but it does the join in just 1 mapper.

 Question 3: Why is it doing it in one mapper? Joins between different
buckets seem to be independent, so why it doesn't doing it in multiple


View raw message