hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Scaffidi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers
Date Fri, 10 Jul 2015 17:48:04 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steve Scaffidi updated HADOOP-12217:
------------------------------------
    Description: 
Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in a HashMap
results in abysmal performance, due to hash code collisions.

I discovered this when testing the latest version of Hive and certain mapjoin queries were
exceedingly slow.

Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode
used to override hashCode() with a correct implementation, but for some reason they recently
removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's
DoubleWritable.

It appears that this bug has been there since DoubleWritable was created(wow!) so I can understand
if fixing it is impractical due to the possibility of breaking things down-stream, but I can't
think of anything that *should* break, off the top of my head.

Searching JIRA, I found several related tickets, which may be useful for some historical perspective:
HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041

  was:
Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in a HashMap
results in abysmal performance, due to hash code collisions.

I discovered this when testing the latest version of Hive and certain mapjoin queries were
exceedingly slow.

Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode
used to override hashCode() with a correct implementation, but for some reason they recently
removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's
DoubleWritable.

It appears that this bug has been there since DoubleWritable was created(!) so I can understand
if fixing it is impractical due to the possibility of breaking things down-stream, but I can't
think of anything that *should* break, off the top of my head.

Searching JIRA, I found several related tickets, which may be useful for some historical perspective:
HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041


> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 0.20.1,
0.20.2, 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1,
1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4,
2.0.3-alpha, 0.23.5, 0.23.6, 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha,
0.23.9, 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 2.5.1,
2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
>
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in
a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin queries
were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode
used to override hashCode() with a correct implementation, but for some reason they recently
removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's
DoubleWritable.
> It appears that this bug has been there since DoubleWritable was created(wow!) so I can
understand if fixing it is impractical due to the possibility of breaking things down-stream,
but I can't think of anything that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some historical
perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message