Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Wed, 19 Oct 2016 00:56:58 +0000 (UTC)
From: "Xuefu Zhang (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13006144.1474360584000.23746.1476838618411@Atlassian.JIRA>
In-Reply-To: <JIRA.13006144.1474360584000@Atlassian.JIRA>
References: <JIRA.13006144.1474360584000@Atlassian.JIRA> <JIRA.13006144.1474360584575@arcas>
Subject: [jira] [Commented] (HIVE-14797) reducer number estimating may lead
 to data skew
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 19 Oct 2016 00:57:00 -0000


    [ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587229#comment-15587229 ] 

Xuefu Zhang commented on HIVE-14797:
------------------------------------

[~lirui] Choosing a different seed for determining bucket number seems a little risky for FS if it's assumed that certain key always lands to a certain bucket such as in case of transaction. However, I'm not sure at all.

Looking again at the problem, I'm not sure if we need to deal with data skew problem in the way proposed by this patch. The original data is already skewed. I'm wondering if it's actually better to adjust the reducer number. I understand that this was the original approach. If user happens to specify 31 for reducers, then let it be. I'd think it's is a rare case, and I don't think solving this case justifies the need of a new seed, which seems a little more risky.

Thoughts?

> reducer number estimating may lead to data skew
> -----------------------------------------------
>
>                 Key: HIVE-14797
>                 URL: https://issues.apache.org/jira/browse/HIVE-14797
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: roncenzhao
>            Assignee: roncenzhao
>         Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>       int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], bucketFieldInspectors[i]);
>       hashCode = 31 * hashCode + fieldHash;
>     }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b string. The values of column 'a' in both two tables are not skew, but values of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the result, the job will be skew.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)