hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1093) Add a "skew join map join size" variable to control the input size of skew join's following map join job.
Date Mon, 25 Jan 2010 08:13:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804438#action_12804438
] 

He Yongqiang commented on HIVE-1093:
------------------------------------

>>Do you have performance numbers for the testcase ?
Yes. In my testcase, a split of 256M join with 100K is now taking more than 5 hours. (join
value can be ignored, so 256M and 100K are about pure key size).
And the 'map join size' should not be determined only by the big size ( eg. 256M). The small
size is more important in this case. 

The point is that  KEY1 ("256M join 100K") should use a much smaller split size than KEY2
("256M join 1K").  The problem here is that we are now doing KEY1 and KEY2 in a same job.
So if we choose a split size according to KEY1, it maybe a bit small for KEY2.

If we are going to choose to use bucket join for the followup mapjoin job. We will be able
to choose split size independently for different keys (because we are doing that in different
jobs).

> Add a "skew join map join size" variable to control the input size of skew join's following
map join job.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1093
>                 URL: https://issues.apache.org/jira/browse/HIVE-1093
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>
> In a test, many skew join key itself >250M size. And the following mapjoin will take
several hours to do a mapjoin for those big skew keys. 
> This can be better by using a small map input size for the following map join job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message