drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5816) Hash function produces skewed results on String values with same leading prefix
Date Tue, 26 Sep 2017 22:16:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aman Sinha updated DRILL-5816:
------------------------------
    Labels: ready-to-commit  (was: )

> Hash function produces skewed results on String values with same leading prefix
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-5816
>                 URL: https://issues.apache.org/jira/browse/DRILL-5816
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Sorabh Hamirwasia
>            Assignee: Sorabh Hamirwasia
>              Labels: ready-to-commit
>             Fix For: 1.12.0
>
>
> Reported by [~amansinha100]
> Hashing of string values (for the hash exchange) could produce substantial skew for certain
types of strings that have the same leading prefix.
> Here's the sample data: (note all strings begin with 'mscId=' followed by numeric values)
> 0: jdbc:drill:drillbit=10.10.103.111> select a from dfs.tmp.vv3 limit 20;
> +---------------------+
> |          a          |
> +---------------------+
> | mscId=100139170495  |
> | mscId=100103806655  |
> | mscId=100229137840  |
> | mscId=100362859440  |
> | mscId=100032583600  |
> | mscId=100125021360  |
> | mscId=100243775920  |
> | mscId=100152820405  |
> | mscId=100084724405  |
> | mscId=100297398970  |
> | mscId=100059560890  |
> | mscId=100106108090  |
> | mscId=100032092090  |
> | mscId=100029460410  |
> | mscId=100110390995  |
> | mscId=100019105235  |
> | mscId=100354644435  |
> | mscId=100288523475  |
> | mscId=100214507475  |
> | mscId=100296418515  |
> +---------------------+
> 20 rows selected (0.33 seconds)
> Here's the hash values using the hash function that Drill uses for the HashToRandomExchange
(note that they are all even numbers):
> 0: jdbc:drill:drillbit=10.10.103.111> select hash32AsDouble(a, 1301011) from dfs.tmp.vv3
limit 20;
> +--------------+
> |    EXPR$0    |
> +--------------+
> | 1180062632   |
> | -1322734784  |
> | 2096701320   |
> | 2075007536   |
> | -1970336592  |
> | 1614574192   |
> | 1592743936   |
> | -1053691072  |
> | -689805200   |
> | 1893061072   |
> | 1660328376   |
> | 1852126136   |
> | 1927731344   |
> | 616840056    |
> | -1997249184  |
> | 1588717872   |
> | 193019624    |
> | 880839008    |
> | 1879415496   |
> | 1726850216   |
> +--------------+
> 20 rows selected (0.311 seconds)
> Doing a mod 56 only produces 1 distinct value, which indicates the skew:
> 0: jdbc:drill:drillbit=10.10.103.111> select distinct mod(hash32AsDouble(a, 1301011),
56) from dfs.tmp.vv3 limit 20;
> +---------+
> | EXPR$0  |
> +---------+
> | 0       |
> +---------+
> 1 row selected (1.041 seconds)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message