drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4119) Skew in hash distribution for varchar (and possibly other) types of data
Date Tue, 24 Nov 2015 17:33:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024913#comment-15024913
] 

Aman Sinha edited comment on DRILL-4119 at 11/24/15 5:32 PM:
-------------------------------------------------------------

Our hash64 implementation looks similar to the original one but I haven't done enough analysis
to say they are exactly the same.  The only way to check is through testing.  Here are 2 values
and their corresponding hash from the original (note, for some reason the command line utility
xxh64sum does not read multiple lines from a file, so I had to break up the values into separate
files): 
{noformat}
$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c

$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d

$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv

$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the conversion of
the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible
something may have gotten lost in translation. 


was (Author: amansinha100):
Our hash64 implementation looks similar to the original one but I haven't done enough analysis
to say they are exactly the same.  The only way to check is through testing.  Here are 2 values
and their corresponding hash from the original (note, for some reason the command line utility
xxh64sum does not read multiple lines from a file, so I had to break up the values into separate
files): 
{noformat}
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat > sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample1.csv
1a883d005e0ce003b918d737ac697e7c
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ cat sample2.csv
e4b4388e8865819126cb0e4dcaa7261d
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample1.csv
1213a50f060e0659  sample1.csv
Administrators-MacBook-Pro-144:xxHash-r42 asinha$ ./xxh64sum sample2.csv
e0658433041ce9aa  sample2.csv
{noformat}

These values don't match the value I am getting from Drill  after doing the conversion of
the long to hex (I used Long.toHexString() method in debugger to convert), so it is possible
something may have gotten lost in translation. 

> Skew in hash distribution for varchar (and possibly other) types of data
> ------------------------------------------------------------------------
>
>                 Key: DRILL-4119
>                 URL: https://issues.apache.org/jira/browse/DRILL-4119
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.4.0
>
>
> We are seeing substantial skew for an Id column that contains varchar data of length
32.   It is easily reproducible by a group-by query: 
> {noformat}
> Explain plan for SELECT SomeId From table GROUP BY SomeId;
> ...
> 01-02          HashAgg(group=[{0}])
> 01-03            Project(SomeId=[$0])
> 01-04              HashToRandomExchange(dist0=[[$0]])
> 02-01                UnorderedMuxExchange
> 03-01                  Project(SomeId=[$0], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($0))])
> 03-02                    HashAgg(group=[{0}])
> 03-03                      Project(SomeId=[$0])
> {noformat}
> The string id happens to be of the following type: 
> {noformat}
> e4b4388e8865819126cb0e4dcaa7261d
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message