hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lefty Leverenz <leftylever...@gmail.com>
Subject Re: STREAMTABLE And MAPJOIN
Date Tue, 24 Dec 2013 09:14:26 GMT
This seems useful, so I added a sentence to the explanation of STREAMTABLE
in the JOINS wikidoc<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins#LanguageManualJoins-Examples>
:


>    -
>
>    In every map/reduce stage of the join, the table to be streamed can be
>    specified via a hint. e.g. in
>    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON
>    (a.key = b.key1) JOIN c ON (c.key = b.key1)
>
>    all the three tables are joined in a single map/reduce job and the
>    values for a particular value of the key for tables b and c are buffered in
>    the memory in the reducers. Then for each row retrieved from a, the join is
>    computed with the buffered rows. If the STREAMTABLE hint is omitted,
>    Hive streams the rightmost table in the join.
>
>
But I didn't specify inner joins.  Should that be made clear?

Thanks.  -- Lefty


On Tue, Dec 3, 2013 at 1:40 AM, Nitin Pawar <nitinpawar432@gmail.com> wrote:

> This is my understanding of both. Wait for the hive guru's to correct me
> if i made any mistake
>
>
> In Hive, when an inner join query happens the table at the last position
> on the right streams its records to the reducers. This is the default
> behavior.
>
> So say, you have a query select blah blah from t1 join t2 join t3 join t4
> on (blah blah)
> all the maps emitting key values on table t1, t2, t3 just send it to
> reducers and are bufferred in memory but for table t4 it streams the
> records to the reducer for better memory management and thats why its
> advised that you have largest table on the right
>
> This default behavior is changed by STREAMTABLE(t1) where you can tell
> which table data you want to be streamed.
>
> On the other hand, mapjoin is a concept where there are no reducers are
> involved. Its a join where the smaller table is buffered into memory of
> each map and then the joins are performed by the maps itself. As the
> smaller table data is available in memory, map jobs are very fast as the
> reduce step is completely removed.
>
>
> On Tue, Dec 3, 2013 at 2:47 PM, Baahu <bahubali@gmail.com> wrote:
>
>> Hi,
>> What is the difference between hints STREAMTABLE ,MAPJOIN .
>>
>> Thanks,
>> Baahu
>>
>>
>
>
> --
> Nitin Pawar
>

Mime
View raw message