hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-964) handle skewed keys for a join in a separate job
Date Wed, 30 Dec 2009 05:05:29 GMT

     [ https://issues.apache.org/jira/browse/HIVE-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

He Yongqiang updated HIVE-964:

    Attachment: hive-964-2009-12-29-4.patch

Attache a new patch. 
Changes include:
1) update patch against trunk code.
2) According to an offline discussion with Namit, Ning, and Ashish. This patch uses the original
groupKey in reducer as the dummy join key for follow-up map joins. The previous patch just
uses java's UUID and a random number generator to generate the dummy join keys.

HIVE-963 introduced a RowContainer to handle skew join keys, it will serialize the value parts
into a local file in case of OOM. Right now this patch just gets value object from RowContainer
and serialize it to HDFS. The bad thing is that, if a row is already serialized into local
file, it will need to deserialize to object and then reserialize to HDFS. It will need to
serialize this object twice and also a deserialize.  
It will be better if we can directly read RowContainer's local file and write to HDFS. This
will need  to let the RowContainer to serialize the key part together with the join values.
(Right now RowContainer will only serialize join values.) It will be better if we can enable
this only when skew join is enabled. I suggest to do this in a follow up jira because the
changes for handling skew join is already very complicated (Some comments to make the code
more easy to understand: 1) 'tag' is get differently in mapper side and reducer side, thus
they are different in MapJoinOp and JoinOp 2) join's tag order is reordered in JoinReorder,
but the operator tree is not changed. 3) map join actually does not use tag order array, but
the tag order array is used in MapJoinOp's parent CommonJoinOp.).

> handle skewed keys for a join in a separate job
> -----------------------------------------------
>                 Key: HIVE-964
>                 URL: https://issues.apache.org/jira/browse/HIVE-964
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>         Attachments: hive-964-2009-12-17.txt, hive-964-2009-12-28-2.patch, hive-964-2009-12-29-4.patch
> The skewed keys can be written to a temporary table or file, and a followup conditional
task can be used to perform the join on those keys.
> As a first step, JDBM can be used for those keys

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message