hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pala M Muthaia <mchett...@rocketfuelinc.com>
Subject Why does SMB join generate hash table locally, even if input tables are large?
Date Tue, 29 Jul 2014 20:56:16 GMT
Hi,

I am testing SMB join for 2 large tables. The tables are bucketed and
sorted on the join column. I notice that even though the table is large,
Hive attempts to generate hash table for the 'small' table locally,
 similar to map join. Since the table is large in my case, the client runs
out of memory and the query fails.

I am using Hive 0.12 with the following settings:

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format =
org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

My test query does a simple join and a select, no subqueries/nested queries
etc.

I understand why a (bucket) map join requires hash table generation, but
why is that included for an SMB join? Shouldn't a SMB join just spin up one
mapper for each bucket and perform a sort merge join directly on the mapper?


Thanks,
pala

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message