hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ankur C. Goel" <gan...@yahoo-inc.com>
Subject Re: Hash Partitioner
Date Tue, 25 May 2010 18:21:24 GMT
Another thing I would look at critically is to see if there are any bugs in the write() and
readFields() method of my key class as there could hard to identify serialization / de-serialization
issues.

-@nkur


On 5/25/10 8:39 PM, "Eric Sammer" <esammer@cloudera.com> wrote:

On Mon, May 24, 2010 at 6:32 PM, Deepika Khera <Deepika.Khera@avg.com> wrote:
> Thanks for your response Eric.
>
> I am using hadoop 0.20.2.
>
> Here is what the hashCode() implementation looks like (I actually had the IDE generate
it for me)
>
> Main key (for mapper & reducer):
>
> public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (aKey != null ? aKey.hashCode() : 0);
>        result = 31 * result + (gKey != null ? gKey.hashCode() : 0);
>        result = 31 * result + (int) (date ^ (date >>> 32));
>        result = 31 * result + (ma != null ? ma.hashCode() : 0);
>        result = 31 * result + (cl != null ? cl.hashCode() : 0);
>        return result;
>    }
>
>
> aKey : AKey class
>
>
>    public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (v != null ? v.hashCode() : 0);
>        result = 31 * result + (s != null ? s.hashCode() : 0);
>        result = 31 * result + (o != null ? o.hashCode() : 0);
>        result = 31 * result + (l != null ? l.hashCode() : 0);
>        result = 31 * result + (e ? 1 : 0); //boolean
>        result = 31 * result + (li ? 1 : 0); //boolean
>        result = 31 * result + (aut ? 1 : 0); //boolean
>        return result;
>    }
>

Both of these look fine, assuming all the other hashCode()s return the
same value every time.

> When this happens, I do see the same values for the key. Also I am not using a grouping
comparator.

So you see two reduce methods getting the same key with the same
values? That's extremely odd. If this is the case, there's a bug in
Hadoop. Can you find the relevant logs from the reducers where Hadoop
fetches the map output? Does it look like its fetching the same output
twice? Do the two tasks where you see the duplicates have the same
task ID? Can you confirm the reduce tasks are from the same job ID for
us?

> I was wondering since the call to HashPartitioner.getPartition() is done from a map task,
several of which are running on different machines, is it possible that they get a different
hashcode and hence get different reducers assigned even when the key is the same.

The hashCode() result should *always* be the same given the same
internal state. In other words, it should be consistent and stable. If
I have a string new String("hello world") it will always have the
exact same hashCode(). If this isn't true, you will get wildly
unpredictable results not just with Hadoop but with Java's
comparators, collections, etc.

--
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message