hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Scaffidi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-12217) hashCode in DoubleWritable returns same value for many numbers
Date Sat, 11 Jul 2015 16:51:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623484#comment-14623484

Steve Scaffidi commented on HADOOP-12217:

Looks like the MapJoinKeyBytes class was removed as part of HIVE-9331 (git commit c8ba0f96,
2015-01-15). In my testing, I found Cloudera's distro of Hive 1.1 was using MapJoinKeyObject,
which makes sense, but I've been looking through both their patched Hive code as well as master/trunk
from upstream and I don't see anything significant they changed from upstream that's related
to this.

I'm still trying to suss out another part of the issue that led me to finding the bug I reported
here: In my affected Hive queries, which join a STRING column (from the large table) with
an INT column (from the small table that is used for the mapjoin hashtable), Hive is converting
the STRING and the INT into DOUBLE for the purpose of the join, which, AFAICT, is a change
in behavior since Hive 0.13. Because the values I'm joining on are all fairly small integers
(about 160,000 values, ranging from 1 to 999,999), the bad hashCode implementation for DoubleWritable
causes the HashMap Hive builds in the local task to degenerate into a linked-list that is
both exceedingly slow to build as well as load in the subsequent map tasks. :(

On the other hand, the conversion to a DOUBLE to do the comparison makes sense given the table
of implicit conversions in the documentation - it seems to me that the old behavior must have
been incorrect and has since been "fixed" :) Unfortunately I have too many users with too
many queries that depend on the performance of the old behavior - it's easier for me to patch
Hadoop or Hive!

Once I figure out where/why Hive's behavior changed, I'll file a ticket there, too, if necessary,
hopefully with useful patches :)

> hashCode in DoubleWritable returns same value for many numbers
> --------------------------------------------------------------
>                 Key: HADOOP-12217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.19.0, 0.19.1, 0.20.0, 0.20.1,
0.20.2,,,, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1,
1.2.0, 0.21.0, 0.22.0, 0.23.0, 0.23.1, 0.23.3, 2.0.0-alpha, 2.0.1-alpha, 2.0.2-alpha, 0.23.4,
2.0.3-alpha, 0.23.5, 0.23.6, 1.1.2, 0.23.7, 2.1.0-beta, 2.0.4-alpha, 0.23.8, 1.2.1, 2.0.5-alpha,
0.23.9, 0.23.10, 0.23.11, 2.1.1-beta, 2.0.6-alpha, 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 2.5.1,
2.5.2, 2.6.0, 2.7.0, 2.7.1
>            Reporter: Steve Scaffidi
>              Labels: easyfix
> Because DoubleWritable.hashCode() is incorrect, using DoubleWritables as the keys in
a HashMap results in abysmal performance, due to hash code collisions.
> I discovered this when testing the latest version of Hive and certain mapjoin queries
were exceedingly slow.
> Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode
used to override hashCode() with a correct implementation, but for some reason they recently
removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's
> It appears that this bug has been there since DoubleWritable was created(wow!) so I can
understand if fixing it is impractical due to the possibility of breaking things down-stream,
but I can't think of anything that *should* break, off the top of my head.
> Searching JIRA, I found several related tickets, which may be useful for some historical
perspective: HADOOP-3061, HADOOP-3243, HIVE-511, HIVE-1629, HIVE-7041

This message was sent by Atlassian JIRA

View raw message