hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pala M Muthaia <mchett...@rocketfuelinc.com>
Subject Re: Why does SMB join generate hash table locally, even if input tables are large?
Date Wed, 30 Jul 2014 19:08:12 GMT
+hive-users


On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia <mchettiar@rocketfuelinc.com
> wrote:

> Hi,
>
> I am testing SMB join for 2 large tables. The tables are bucketed and
> sorted on the join column. I notice that even though the table is large,
> Hive attempts to generate hash table for the 'small' table locally,
>  similar to map join. Since the table is large in my case, the client runs
> out of memory and the query fails.
>
> I am using Hive 0.12 with the following settings:
>
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.input.format =
> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>
> My test query does a simple join and a select, no subqueries/nested
> queries etc.
>
> I understand why a (bucket) map join requires hash table generation, but
> why is that included for an SMB join? Shouldn't a SMB join just spin up one
> mapper for each bucket and perform a sort merge join directly on the mapper?
>
>
> Thanks,
> pala
>
>
>
>

Mime
View raw message