hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Scaffidi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers
Date Sat, 11 Jul 2015 17:53:04 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Scaffidi updated HADOOP-12217:
------------------------------------
    Attachment: HADOOP-12217.1.patch

This is the simplest fix that does not create a Double object to calculate a correct hashCode.
I have not yet tested this in a production-level environment, though. I can add some tests
to show the effectiveness of the hashCode distribution if desired.

> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 0.20.1,
0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1,
1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4,
2.0.3-alpha, 0.23.5, 0.23.6, 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha,
0.23.9, 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 2.5.1,
2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>         Attachments: HADOOP-12217.1.patch
>
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in
a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin queries
were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode
used to override hashCode() with a correct implementation, but for some reason they recently
removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's
DoubleWritable.
> It appears that this bug has been there since DoubleWritable was created(wow!) so I can
understand if fixing it is impractical due to the possibility of breaking things down-stream,
but I can't think of anything that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some historical
perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message