hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-917) Bucketed Map Join
Date Tue, 16 Feb 2010 20:37:28 GMT

    [ https://issues.apache.org/jira/browse/HIVE-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834448#action_12834448
] 

Namit Jain commented on HIVE-917:
---------------------------------

Added some more tasks in the follow-up jira after talking to Yongqiang.
Will commit this if the tests pass

> Bucketed Map Join
> -----------------
>
>                 Key: HIVE-917
>                 URL: https://issues.apache.org/jira/browse/HIVE-917
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>         Attachments: hive-917-2010-2-15.patch, hive-917-2010-2-16.patch, hive-917-2010-2-3.patch,
hive-917-2010-2-8.patch
>
>
> Hive already have support for map-join. Map-join treats the big table as job input, and
in each mapper, it loads all data from a small table.
> In case the big table is already bucketed on the join key, we don't have to load the
whole small table in each of the mappers. This will greatly alleviate the memory pressure,
and make map-join work with medium-sized tables.
> There are 4 steps we can improve:
> S0. This is what the user can already do now: create a new bucketed table and insert
all data from the small table to it; Submit BUCKETNUM jobs, each doing a map-side join of
"bigtable TABLEPARTITION(BUCKET i OUT OF NBUCKETS)" with "smallbucketedtable TABLEPARTITION(BUCKET
i OUT OF NBUCKETS)".
> S1. Change the code so that when map-join is loading the small table, we automatically
drop the rows with the keys that are NOT in the same bucket as the big table. This should
alleviate the problem on memory, but we might still have thousands of mappers reading the
whole of the small table.
> S2. Let's say the user already bucketed the small table on the join key into exactly
the same number of buckets (or a factor of the buckets of the big table), then map-join can
choose to load only the buckets that are useful.
> S3. Add a new hint (e.g. /*+ MAPBUCKETJOIN(a) */), so that Hive automatically does S2,
without the need of asking the user to create temporary bucketed table for the small table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message