hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-964) handle skewed keys for a join in a separate job
Date Fri, 11 Dec 2009 22:44:18 GMT

    [ https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789586#action_12789586

He Yongqiang commented on HIVE-964:

Here is the idea, according to offline discussions with Namit and Ning.

1. Number of mr jobs to handle skew keys is the number of table minus 1 (we can stream the
last table, so big keys in the last table will not be a problem).
2. At runtime in Join, we output big keys in one table into one corresponding directories,
and all same keys in other tables into different dirs(one for each table).
The directories will look like:
dir-T1-bigkeys(containing big keys in T1), dir-T2-keys(containing keys which is big in T1),dir-T3-keys(containing
keys which is big in T1), ...
dir-T1-keys(containing keys which is big in T2), dir-T2-bigkeys(containing big keys in T2),dir-T3-keys(containing
keys which is big in T2), ...
dir-T1-keys(containing keys which is big in T3), dir-T2-keys(containing big keys in T3),dir-T3-bigkeys(containing
keys which is big in T3), ...
3. For each table, we launch one mapjoin job, taking the directory containing big keys in
this table and corresponding dirs in other tables as input. (Actally one job for one row in
the above.)

This strategy can help to make the plan fix at compile time. 

> handle skewed keys for a join in a separate job
> -----------------------------------------------
>                 Key: HIVE-964
>                 URL: https://issues.apache.org/jira/browse/HIVE-964
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
> The skewed keys can be written to a temporary table or file, and a followup conditional
task can be used to perform the join on those keys.
> As a first step, JDBM can be used for those keys

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message