Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05FF418033 for ; Tue, 24 Nov 2015 16:40:12 +0000 (UTC) Received: (qmail 79200 invoked by uid 500); 24 Nov 2015 16:40:11 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 79075 invoked by uid 500); 24 Nov 2015 16:40:11 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 78862 invoked by uid 99); 24 Nov 2015 16:40:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 16:40:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 39B132C1F76 for ; Tue, 24 Nov 2015 16:40:11 +0000 (UTC) Date: Tue, 24 Nov 2015 16:40:11 +0000 (UTC) From: "Aman Sinha (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024802#comment-15024802 ] Aman Sinha commented on DRILL-4119: ----------------------------------- I did some more testing with the sample data. Here are 3 hash values: - hash64 is the native hash64 computed by XXHash.hash64() - hash64_downcast is the same value downcast to int - newhash is the the new 32 bit hash value computed by the proposed fix (combining the first and last 4 bytes of hash64). {noformat} 0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash from dfs.`/Users/asinha/data/sample.csv`; +-----------------------------------+----------------------+------------------+--------------+ | id | hash64 | hash64_downcast | newhash | +-----------------------------------+----------------------+------------------+--------------+ | 1a883d005e0ce003b918d737ac697e7c | 6695077304582944118 | 594687350 | 2140898336 | | e4b4388e8865819126cb0e4dcaa7261d | 2614721709087477964 | -2136387380 | -1528820922 | | 639a06fb09c70cc397666d38a8134af5 | 3943910117127083836 | 359520060 | 601244263 | | ae03f853f40c307aa24894e414a6dfdc | 4320987148691340574 | 214334750 | 925976565 | | 2dd3fdace36431e3810437bee1c7e3f1 | 5657579594883017754 | -1719653350 | -687608144 | | 00abdb137380e6ea8cb3e67df40c30dd | 5039129256017100358 | 573406790 | 1740892954 | | d65d4e30ec96a588e82847aca619e4a0 | 550451582126160076 | 716077260 | 755884032 | | 956f968866b3151ad472edfcafb579fa | 39366413145792912 | 1336074640 | 1328101915 | | 75577f830d12c86fd1de94d45cfa0715 | 6480730101791620276 | -226984780 | -1417129724 | | 298aa703dbee9e5f303372fe7a764975 | 7844015280248941602 | -2013696990 | -350034316 | +-----------------------------------+----------------------+------------------+--------------+ 10 rows selected (0.228 seconds) {noformat} A key observation is that all hash64 values are even numbers. This is not a good thing. I confirmed the behavior over a larger sample of 100 rows. However, this seems specific to strings that are 32 chars (or maybe longer, although a simple test for a 64 char string did not show the same pattern). I then modified the seed value to 1 (default is 0). This time I got better distribution for the hash64: {noformat} 0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 1) as newhash from dfs.`/Users/asinha/data/sample.csv`; +-----------------------------------+----------------------+------------------+-------------+ | id | hash64 | hash64_downcast | newhash | +-----------------------------------+----------------------+------------------+-------------+ | 1a883d005e0ce003b918d737ac697e7c | 3877569168361489241 | 1211204441 | 2113824708 | | e4b4388e8865819126cb0e4dcaa7261d | 5555510472474498931 | 567154547 | 1826042916 | | 639a06fb09c70cc397666d38a8134af5 | 6160367672898924663 | 1827713143 | 965653941 | | ae03f853f40c307aa24894e414a6dfdc | 5573714012720216212 | 533608596 | 1385691081 | | 2dd3fdace36431e3810437bee1c7e3f1 | 4742615352245986962 | 284141202 | 1363050779 | | 00abdb137380e6ea8cb3e67df40c30dd | 5870154798330275502 | 185067182 | 1517362206 | | d65d4e30ec96a588e82847aca619e4a0 | 5469776233948339425 | 828202209 | 2058735712 | | 956f968866b3151ad472edfcafb579fa | 8671446365158603789 | -1675527155 | -462006645 | | 75577f830d12c86fd1de94d45cfa0715 | 3369914886384026207 | 238584415 | 553440739 | | 298aa703dbee9e5f303372fe7a764975 | 3765901389360033496 | 1811181272 | 1605846404 | +-----------------------------------+----------------------+------------------+-------------+ 10 rows selected (0.263 seconds) {noformat} I am thinking we should put in the proposed fix I sent earlier since it improves things. Separately, I think we need to investigate the quality of the XXHash.hash64 implementation. BTW, I also downloaded the original XXHash's C implementation and based on an initial analysis that one produces different hash value than our implementation and does not seem to have the same 'even number' pattern. > Skew in hash distribution for varchar (and possibly other) types of data > ------------------------------------------------------------------------ > > Key: DRILL-4119 > URL: https://issues.apache.org/jira/browse/DRILL-4119 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.3.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > Fix For: 1.4.0 > > > We are seeing substantial skew for an Id column that contains varchar data of length 32. It is easily reproducible by a group-by query: > {noformat} > Explain plan for SELECT SomeId From table GROUP BY SomeId; > ... > 01-02 HashAgg(group=[{0}]) > 01-03 Project(SomeId=[$0]) > 01-04 HashToRandomExchange(dist0=[[$0]]) > 02-01 UnorderedMuxExchange > 03-01 Project(SomeId=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))]) > 03-02 HashAgg(group=[{0}]) > 03-03 Project(SomeId=[$0]) > {noformat} > The string id happens to be of the following type: > {noformat} > e4b4388e8865819126cb0e4dcaa7261d > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)