hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HIVE-428) Implement Map-side Hash-Join in Hive
Date Sat, 18 Apr 2009 11:39:15 GMT

     [ https://issues.apache.org/jira/browse/HIVE-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

He Yongqiang resolved HIVE-428.
-------------------------------

      Resolution: Duplicate
    Release Note: Duplicate of hive-195

Close this issue, duplicate of hive-195.

> Implement Map-side Hash-Join in Hive
> ------------------------------------
>
>                 Key: HIVE-428
>                 URL: https://issues.apache.org/jira/browse/HIVE-428
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> There are many situations that join will perform much better if map side hash join is
used. We have a small test with a simple equal join of  two tables, plain MR join with no
map side hash join will execute about 50 seconds in a 6-node cluster (each node 8core, 4G
mem). With the mapside hash join is applied, it only needs about 15 seconds.
> The map side hash join can only be used when there is small files, which can be replicated
to each map. The map side hash join can be coexeuted together with the map-side filter.
> For example, 
> select A.a, A.c, B.b from A,B where A.a=B.d and A.a < 12 and B.b=10
> In our experiment, this statement can be translated into  three different plans if both
A and B are plain data file ( with no special compress).
> Plan 1
> Map-Reduce
> both A and B are input for the map. the shuffle data involved is very large.
> Plan 2
> 1) first filter B.b to a temp file B1 -- this is seperate Map only job
> 2) replicate B1 to each map when filter A and join them in the map
> no reduce is used
> Plan 3
> produce a job which's each mapper is filtering A (so the mapper is assigned with regard
to only A), and directly replicate B to each mapper
> Before each mapper is started filtering A, filter B and load passed B into memory. And
then start the mapper and join in the mem.
> Plan 3 performs better in our experiment because it saved a seperate map-only job. But
Plan2 is suitable for the situation when B's original file is very large, but its filtered
file is much small.
> This is the basic idea of Map side hash join.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message