Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Tue, 24 Nov 2015 16:40:11 +0000 (UTC)
From: "Aman Sinha (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12915115.1448132758000.177992.1448383211233@Atlassian.JIRA>
In-Reply-To: <JIRA.12915115.1448132758000@Atlassian.JIRA>
References: <JIRA.12915115.1448132758000@Atlassian.JIRA>
 <JIRA.12915115.1448132758443@arcas>
Subject: [jira] [Commented] (DRILL-4119) Skew in hash distribution for
 varchar (and possibly other) types of data
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024802#comment-15024802 ] 

Aman Sinha commented on DRILL-4119:
-----------------------------------

I did some more testing with the sample data.  Here are 3 hash values: 
 - hash64 is the native hash64 computed by XXHash.hash64()
 - hash64_downcast is the same value downcast to int
 - newhash is the the new 32 bit hash value computed by the proposed fix (combining the first and last 4 bytes of hash64).  

{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0]) as hash64, castInt(hash64(columns[0])) as hash64_downcast, hash32(columns[0]) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+--------------+
|                id                 |        hash64        | hash64_downcast  |   newhash    |
+-----------------------------------+----------------------+------------------+--------------+
| 1a883d005e0ce003b918d737ac697e7c  | 6695077304582944118  | 594687350        | 2140898336   |
| e4b4388e8865819126cb0e4dcaa7261d  | 2614721709087477964  | -2136387380      | -1528820922  |
| 639a06fb09c70cc397666d38a8134af5  | 3943910117127083836  | 359520060        | 601244263    |
| ae03f853f40c307aa24894e414a6dfdc  | 4320987148691340574  | 214334750        | 925976565    |
| 2dd3fdace36431e3810437bee1c7e3f1  | 5657579594883017754  | -1719653350      | -687608144   |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5039129256017100358  | 573406790        | 1740892954   |
| d65d4e30ec96a588e82847aca619e4a0  | 550451582126160076   | 716077260        | 755884032    |
| 956f968866b3151ad472edfcafb579fa  | 39366413145792912    | 1336074640       | 1328101915   |
| 75577f830d12c86fd1de94d45cfa0715  | 6480730101791620276  | -226984780       | -1417129724  |
| 298aa703dbee9e5f303372fe7a764975  | 7844015280248941602  | -2013696990      | -350034316   |
+-----------------------------------+----------------------+------------------+--------------+
10 rows selected (0.228 seconds)
{noformat}

A key observation is that all hash64 values are even numbers.   This is not a good thing.  I confirmed the behavior over a larger sample of 100 rows.  However, this seems specific to strings that are 32 chars (or maybe longer, although a simple test for a 64 char string did not show the same pattern).  

I then modified the seed value to 1 (default is 0).  This time I got better distribution for the hash64:
{noformat}
0: jdbc:drill:zk=local> select columns[0] as id, hash64(columns[0], 1) as hash64, castInt(hash64(columns[0], 1)) as hash64_downcast, hash32(columns[0], 1) as newhash from dfs.`/Users/asinha/data/sample.csv`;
+-----------------------------------+----------------------+------------------+-------------+
|                id                 |        hash64        | hash64_downcast  |   newhash   |
+-----------------------------------+----------------------+------------------+-------------+
| 1a883d005e0ce003b918d737ac697e7c  | 3877569168361489241  | 1211204441       | 2113824708  |
| e4b4388e8865819126cb0e4dcaa7261d  | 5555510472474498931  | 567154547        | 1826042916  |
| 639a06fb09c70cc397666d38a8134af5  | 6160367672898924663  | 1827713143       | 965653941   |
| ae03f853f40c307aa24894e414a6dfdc  | 5573714012720216212  | 533608596        | 1385691081  |
| 2dd3fdace36431e3810437bee1c7e3f1  | 4742615352245986962  | 284141202        | 1363050779  |
| 00abdb137380e6ea8cb3e67df40c30dd  | 5870154798330275502  | 185067182        | 1517362206  |
| d65d4e30ec96a588e82847aca619e4a0  | 5469776233948339425  | 828202209        | 2058735712  |
| 956f968866b3151ad472edfcafb579fa  | 8671446365158603789  | -1675527155      | -462006645  |
| 75577f830d12c86fd1de94d45cfa0715  | 3369914886384026207  | 238584415        | 553440739   |
| 298aa703dbee9e5f303372fe7a764975  | 3765901389360033496  | 1811181272       | 1605846404  |
+-----------------------------------+----------------------+------------------+-------------+
10 rows selected (0.263 seconds)
{noformat}

I am thinking we should put in the proposed fix I sent earlier since it improves things.  Separately, I think we need to investigate the quality of the XXHash.hash64 implementation.   BTW, I also downloaded the original XXHash's C implementation and based on an initial analysis that one produces different hash value than our implementation and does not seem to have the same 'even number' pattern. 

> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of length 32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)