hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From matshyeq <matsh...@gmail.com>
Subject Re: Partitioned table and Bucket Map Join
Date Fri, 30 Jan 2015 11:35:20 GMT
Raised enhancement request: HIVE-9523
<https://issues.apache.org/jira/browse/HIVE-9523>

As for table join order - wasn't aware of that behaviour (always read the
small print! ;). Not the same magnitude of possible improvement but still
something. Thanks.
~Macek

Subject:RE: COMMERCIAL:Re: Partitioned table and Bucket Map Join
<http://markmail.org/message/5vwyo5wfleozqzjh>[image: permalink]
<http://markmail.org/message/5vwyo5wfleozqzjh>From:Matthew Dixon (matt...@
jagex.com)Date:Jan 30, 2015 1:17:35 amList:*org.apache.hadoop.hive-user*

Not sure if this is going to solve your problems and I agree with your point
about partition join optimisation but if your query is indeed an inner join
(and
not A LEFT OUTER JOIN B) then you should arrange your table in order from
smallest to biggest. See this section on the *hive* wiki:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

In every map/reduce stage of the join, the last table in the sequence is
streamed through the reducers where as the others are buffered. Therefore,
it
helps to reduce the memory needed in the reducer for buffering the rows for
a
particular value of the join key by organizing the tables such that the
largest
tables appear last in the sequence. e.g. in

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key =
b.key1)

I wonder also whether manually distributing and sorting on the partition
column
x for each table in a subquery (before joining) might encourage *hive* to
join
them efficiently in the subsequent stage...

On Thu, Jan 29, 2015 at 6:55 PM, murali parimi <
muralikrishna.parimi@icloud.com> wrote:

> agreed!
>
> On Jan 29, 2015, at 11:42 PM, matshyeq <matshyeq@gmail.com> wrote:
>
> no confusion here.
> My use case is exactly the same.
> 1. What I was saying is my/your join condition looks like (or should look
> like, in your terms):
>
> FROM A JOIN B
> ON A.X = B.X
> AND A.Y = B.Y
>
> which should trigger merge bucket map join in my opinion:
> Data locality information is full - you may look at the partitioning here
> as just another bucketing level - data should be joined within the SAME
> partitions and the SAME buckets, 1:1!
> Apparently Hive optimizer is not (yet?) considering partitioning for such
> optimization.
> To me it should. Especially for cases where no bucketing is done on tables
> and partitioning columns are used in join from both sides (FROM A JOIN B ON
> A.X= B.X).
>
> 2. If your query join is only based on a bucketing condition:
> FROM A JOIN B
> ON A.Y = B.Y
>
> then the mappers wouldn't know which partition to join data from
> particular bucket. Could still potentially only look for SAME bucket files
> in ALL available partitions but it's not 1:1 relation anymore so probably
> wouldn't gain that much by such optimization. Anyway that optimization
> doesn't seem to be there either.
>
> This thread is only to get a confirmation about the above (or an idea what
> I am/we are doing wrong)
>
> ~Maciek
>
> On Thu, Jan 29, 2015 at 5:46 PM, murali parimi <
> muralikrishna.parimi@icloud.com> wrote:
>
>> Hello apologize for the confusion. Here I will iterate the problem again.
>>
>> I have two tables A, B which are partitioned on column X and bucketed
>> (Same number of buckets) based on column Y. Table A is huge in terms of
>> size (~135GB) and Table B is smaller table in terms of size (33GB). Both
>> the tables has around 3.1 billion records.Storage format is ORC.
>>
>> I intended to a sort merger bucket map join hoping there no reducers will
>> be spawned and the join will happen on map side. I have used the following
>> settings.
>>
>> set
>>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>> set hive.optimize.bucketmapjoin=true;
>>> set hive.optimize.bucketmapjoin.sortedmerge=true;set
>>> hive.enforce.sorting=true;
>>
>>
>> Hive version 13.
>>
>> Any thoughts!
>>
>> Thanks,
>> Murali
>>
>>
>> On Jan 29, 2015, at 07:44 PM, matshyeq <matshyeq@gmail.com> wrote:
>>
>> My hunch is while partitioning is in fact very similar to bucketing
>> (actually superior as you have some control over what file data goes to)
>> the hive optimizer only applies bucket joins if your tables are bucketed so
>> your join condition
>>    t1.bucketed_column = t2.bucketed_column
>> triggers the bucketed map join
>> but
>>    t1.partitioned_column = t2.partitioned_column
>> doesn't.
>> I'm hoping someone with deeper Hive knowledge would be able to confirm
>> this.
>>
>> Thank you,
>> Kind Regards
>> ~Maciek
>>
>> On Thu, Jan 29, 2015 at 1:51 PM, murali parimi <
>> muralikrishna.parimi@icloud.com> wrote:
>>
>>> I faced the same situation where two tables with 3 billion records on
>>> each side and partitioned, sorted on same key. Set the following parameters
>>> in the hive query assuming the join will happen in the map phase.
>>>
>>> set
>>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>> set hive.optimize.bucketmapjoin=true;
>>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>>> set hive.enforce.sorting=true;
>>>
>>> I am using hive version 13 and the storage format is Orc. One of the
>>> table is small in size but I haven't checked whether irfan fit in the cache
>>> as we have huge memory. But the map sided join didn't happen. What could be
>>> the reason?
>>>
>>> Sent from my iPhone
>>>
>>> > On Jan 29, 2015, at 7:38 AM, matshyeq <matshyeq@gmail.com> wrote:
>>> >
>>> > I do have two tables partitioned on the same criteria.
>>> > Could I still take advantage of Bucket Map Join or better, Sort Merge
>>> Bucket Map Join?
>>> > How?
>>> >
>>> > ~Maciek
>>>
>>
>>
>

Mime
View raw message