hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Dixon <matthew.di...@jagex.com>
Subject RE: COMMERCIAL:Re: Partitioned table and Bucket Map Join
Date Fri, 30 Jan 2015 09:17:58 GMT
Not sure if this is going to solve your problems and I agree with your point about partition
join optimisation but if your query is indeed an inner join (and not A LEFT OUTER JOIN B)
then you should arrange your table in order from smallest to biggest.  See this section on
the hive wiki:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

In every map/reduce stage of the join, the last table in the sequence is streamed through
the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed
in the reducer for buffering the rows for a particular value of the join key by organizing
the tables such that the largest tables appear last in the sequence. e.g. in

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)


I wonder also whether manually distributing and sorting on the partition column x for each
table in a subquery (before joining) might encourage hive to join them efficiently in the
subsequent stage...

From: murali parimi [mailto:muralikrishna.parimi@icloud.com]
Sent: 29 January 2015 18:56
To: user@hive.apache.org
Cc: user
Subject: COMMERCIAL:Re: Partitioned table and Bucket Map Join

agreed!

On Jan 29, 2015, at 11:42 PM, matshyeq <matshyeq@gmail.com<mailto:matshyeq@gmail.com>>
wrote:
no confusion here.
My use case is exactly the same.
1. What I was saying is my/your join condition looks like (or should look like, in your terms):

FROM A JOIN B
ON A.X = B.X
AND A.Y = B.Y

which should trigger merge bucket map join in my opinion:
Data locality information is full - you may look at the partitioning here as just another
bucketing level - data should be joined within the SAME partitions and the SAME buckets, 1:1!
Apparently Hive optimizer is not (yet?) considering partitioning for such optimization.
To me it should. Especially for cases where no bucketing is done on tables and partitioning
columns are used in join from both sides (FROM A JOIN B ON A.X= B.X).

2. If your query join is only based on a bucketing condition:
FROM A JOIN B
ON A.Y = B.Y

then the mappers wouldn't know which partition to join data from particular bucket. Could
still potentially only look for SAME bucket files in ALL available partitions but it's not
1:1 relation anymore so probably wouldn't gain that much by such optimization. Anyway that
optimization doesn't seem to be there either.

This thread is only to get a confirmation about the above (or an idea what I am/we are doing
wrong)

~Maciek

On Thu, Jan 29, 2015 at 5:46 PM, murali parimi <muralikrishna.parimi@icloud.com<mailto:muralikrishna.parimi@icloud.com>>
wrote:

Hello apologize for the confusion. Here I will iterate the problem again.

I have two tables A, B which are partitioned on column X and bucketed (Same number of buckets)
based on column Y. Table A is huge in terms of size (~135GB) and Table B is smaller table
in terms of size (33GB). Both the tables has around 3.1 billion records.Storage format is
ORC.

I intended to a sort merger bucket map join hoping there no reducers will be spawned and the
join will happen on map side. I have used the following settings.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;set hive.enforce.sorting=true;

Hive version 13.

Any thoughts!

Thanks,
Murali


On Jan 29, 2015, at 07:44 PM, matshyeq <matshyeq@gmail.com<mailto:matshyeq@gmail.com>>
wrote:
My hunch is while partitioning is in fact very similar to bucketing (actually superior as
you have some control over what file data goes to) the hive optimizer only applies bucket
joins if your tables are bucketed so your join condition
   t1.bucketed_column = t2.bucketed_column
triggers the bucketed map join
but
   t1.partitioned_column = t2.partitioned_column
doesn't.
I'm hoping someone with deeper Hive knowledge would be able to confirm this.

Thank you,
Kind Regards
~Maciek

On Thu, Jan 29, 2015 at 1:51 PM, murali parimi <muralikrishna.parimi@icloud.com<mailto:muralikrishna.parimi@icloud.com>>
wrote:

I faced the same situation where two tables with 3 billion records on each side and partitioned,
sorted on same key. Set the following parameters in the hive query assuming the join will
happen in the map phase.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.enforce.sorting=true;

I am using hive version 13 and the storage format is Orc. One of the table is small in size
but I haven't checked whether irfan fit in the cache as we have huge memory. But the map sided
join didn't happen. What could be the reason?

Sent from my iPhone

> On Jan 29, 2015, at 7:38 AM, matshyeq <matshyeq@gmail.com<mailto:matshyeq@gmail.com>>
wrote:
>
> I do have two tables partitioned on the same criteria.
> Could I still take advantage of Bucket Map Join or better, Sort Merge Bucket Map Join?
> How?
>
> ~Maciek


-- 

IMPORTANT NOTICE

The sender does not guarantee that this message, including any attachment, is secure or virus
free. Also, it is confidential and may be privileged or otherwise protected from disclosure.
If you are not the intended recipient, do not disclose or copy it or its contents. Please
telephone or email the sender and delete the message entirely from your system. No binding
obligations or payment commitments are to be derived from the contents of this email unless
and until a clear written agreement containing all the necessary terms and conditions is properly
executed.

Jagex Limited is a company registered in England & Wales with company number 03982706
and a registered office at St John's Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.

Mime
View raw message