hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1385) MD5Hash has a bad hash function
Date Fri, 18 May 2007 16:03:16 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12496922

Owen O'Malley commented on HADOOP-1385:

> Shouldn't closeHash1 and closeHash2 share the same first four bytes but differ in (some
of) the others? Then > they would not be equal (according to the equals method) but the
hash codes would be equal. 

Maybe I should remove that test. I basically wanted to check the non-existence of the current
bug. In the new hash function the hash codes are different. In the old code the hash codes
are the same.

> Also, this looks like a 0.14.0 fix since the existing code isn't broken, just inefficient
in some cases.

It is really badly broken if you have more than 256 reduces. You'll basically have extremely
heavy loading on 256 of the reduces (depending on the precise number of reduces). If you run
with 2000 reduces, you'll have the majority of your workload done by a 1/8 of your cluster.
Certainly for my cluster, I'll fix this for 13. If the majority of people feel like this can
push to 14, that is fine.

> MD5Hash has a bad hash function
> -------------------------------
>                 Key: HADOOP-1385
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1385
>             Project: Hadoop
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.12.3
>            Reporter: Owen O'Malley
>         Assigned To: Owen O'Malley
>             Fix For: 0.13.0
>         Attachments: 1385.patch
> The MD5Hash class has a really bad hash function, that will cause most most md5s to hash
to 0xFFFFFFxx leaving only the low order byte as meaningful. The problem comes from the automatic
sign extension when promoting from byte to int.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message