hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From murali parimi <muralikrishna.par...@icloud.com>
Subject Re: Partitioned table and Bucket Map Join
Date Thu, 29 Jan 2015 17:46:24 GMT
Hello apologize for the confusion. Here I will iterate the problem again.

I have two tables A, B which are partitioned on column X and bucketed (Same number of buckets)
based on column Y. Table A is huge in terms of size (~135GB) and Table B is smaller table
in terms of size (33GB). Both the tables has around 3.1 billion records.Storage format is
ORC.

I intended to a sort merger bucket map join hoping there no reducers will be spawned and the
join will happen on map side. I have used the following settings.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;set hive.enforce.sorting=true;
 
Hive version 13.

Any thoughts! 

Thanks,
Murali


On Jan 29, 2015, at 07:44 PM, matshyeq <matshyeq@gmail.com> wrote:

My hunch is while partitioning is in fact very similar to bucketing (actually superior as
you have some control over what file data goes to) the hive optimizer only applies bucket
joins if your tables are bucketed so your join condition
   t1.bucketed_column = t2.bucketed_column
triggers the bucketed map join
but
   t1.partitioned_column = t2.partitioned_column
doesn't.
I'm hoping someone with deeper Hive knowledge would be able to confirm this.

Thank you,
Kind Regards 
~Maciek

On Thu, Jan 29, 2015 at 1:51 PM, murali parimi <muralikrishna.parimi@icloud.com> wrote:
I faced the same situation where two tables with 3 billion records on each side and partitioned,
sorted on same key. Set the following parameters in the hive query assuming the join will
happen in the map phase.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.enforce.sorting=true;

I am using hive version 13 and the storage format is Orc. One of the table is small in size
but I haven't checked whether irfan fit in the cache as we have huge memory. But the map sided
join didn't happen. What could be the reason?

Sent from my iPhone

> On Jan 29, 2015, at 7:38 AM, matshyeq <matshyeq@gmail.com> wrote:
>
> I do have two tables partitioned on the same criteria.
> Could I still take advantage of Bucket Map Join or better, Sort Merge Bucket Map Join?
> How?
>
> ~Maciek


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
    • Unnamed multipart/related (inline, None, 0 bytes)
View raw message