hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <>
Subject [jira] Created: (HIVE-917) Bucketed Map Join
Date Fri, 06 Nov 2009 22:26:32 GMT
Bucketed Map Join

                 Key: HIVE-917
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Zheng Shao

Hive already have support for map-join. Map-join treats the big table as job input, and in
each mapper, it loads all data from a small table.

In case the big table is already bucketed on the join key, we don't have to load the whole
small table in each of the mappers. This will greatly alleviate the memory pressure, and make
map-join work with medium-sized tables.

There are 4 steps we can improve:

S0. This is what the user can already do now: create a new bucketed table and insert all data
from the small table to it; Submit BUCKETNUM jobs, each doing a map-side join of "bigtable

S1. Change the code so that when map-join is loading the small table, we automatically drop
the rows with the keys that are NOT in the same bucket as the big table. This should alleviate
the problem on memory, but we might still have thousands of mappers reading the whole of the
small table.

S2. Let's say the user already bucketed the small table on the join key into exactly the same
number of buckets (or a factor of the buckets of the big table), then map-join can choose
to load only the buckets that are useful.

S3. Add a new hint (e.g. /*+ MAPBUCKETJOIN(a) */), so that Hive automatically does S2, without
the need of asking the user to create temporary bucketed table for the small table.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message