Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 55CBF200BB3 for ; Wed, 19 Oct 2016 04:41:01 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5292F160AF7; Wed, 19 Oct 2016 02:41:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 98FE6160AE5 for ; Wed, 19 Oct 2016 04:41:00 +0200 (CEST) Received: (qmail 46647 invoked by uid 500); 19 Oct 2016 02:40:59 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 46634 invoked by uid 99); 19 Oct 2016 02:40:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2016 02:40:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id ABBFD2C0073 for ; Wed, 19 Oct 2016 02:40:59 +0000 (UTC) Date: Wed, 19 Oct 2016 02:40:59 +0000 (UTC) From: "Rui Li (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 19 Oct 2016 02:41:01 -0000 [ https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587451#comment-15587451 ] Rui Li commented on HIVE-14797: ------------------------------- Hi [~xuefuz], for the example in the description, B is skewed but (A, B) shouldn't skew ideally. Other than shuffling, bucketed table should also suffer from this if the number of buckets happens to be 31, and we can't adjust the number of reducers in that case. I think the problem is valid, but more research is needed to find out how the hash code is used, and whether the solution here is correct. > reducer number estimating may lead to data skew > ----------------------------------------------- > > Key: HIVE-14797 > URL: https://issues.apache.org/jira/browse/HIVE-14797 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: roncenzhao > Assignee: roncenzhao > Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.4.patch, HIVE-14797.patch > > > HiveKey's hash code is generated by multipling by 31 key by key which is implemented in method `ObjectInspectorUtils.getBucketHashCode()`: > for (int i = 0; i < bucketFields.length; i++) { > int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], bucketFieldInspectors[i]); > hashCode = 31 * hashCode + fieldHash; > } > The follow example will lead to data skew: > I hava two table called tbl1 and tbl2 and they have the same column: a int, b string. The values of column 'a' in both two tables are not skew, but values of column 'b' in both two tables are skew. > When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data skew. > As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the result, the job will be skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332)