hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepika Khera <Deepika.Kh...@avg.com>
Subject RE: RE: Re: Re: Hash Partitioner
Date Fri, 28 May 2010 00:57:26 GMT
This is just to close this one. I finally resolved my issue. Problem was that I had some enums
in my key, the hash code of which is not constant over JVMs. So instead of doing myEnum.hashCode(),
I should have first converted it to a string and then taken the hashcode (like myEnum.name().hashCode()).
I was relying on a correct hashCode() method, since the IDE had generated it for me. I just
had to be careful for my case.

Thanks for all the help!

Deepika



-----Original Message-----
From: Deepika Khera [mailto:Deepika.Khera@avg.com] 
Sent: Tuesday, May 25, 2010 2:03 PM
To: general@hadoop.apache.org
Subject: RE: Re: Re: Hash Partitioner

So I ran my process again with some more logging and here is what I see. 

I used my own HashPartitioner(basically copied hadoop's partitioner and added some logging
for analysis).I printed here the Key and the reducer that is assigned to the key (based on
the hash code). 

My process triggered off 2 mappers (running on 2 different hadoop machines), hence both of
these try to find reducers for the split file assigned to them. I see that for a same object
key assigned to both these mappers, I am getting 2 different reducers allocated by the Partitioner.

In the reducers I see -

1) 2 different reducers (the ones that the partitioner assigned the key to) printing out the
same Key (I did not print out the value as I thought that wouldn't matter)
2) Here are the logs from where reducers copies data from the mapper -
	
Reducer1:

2010-05-25 11:34:49,810 INFO org.apache.hadoop.mapred.ReduceTask: Read 1002612 bytes from
map-output for attempt_201005251129_0001_m_000001_0
2010-05-25 11:34:49,831 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000001_0
-> (127, 36) from hadoop-49.c.a.com
2010-05-25 11:34:50,797 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000006_0:
Got 1 new map-outputs
2010-05-25 11:34:54,835 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000006_0
Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201005251129_0001_m_000000_0,
compressed len: 1553902, decompressed len: 1553898
2010-05-25 11:34:54,841 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 1553898 bytes
(1553902 raw bytes) into RAM from attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,924 INFO org.apache.hadoop.mapred.ReduceTask: Read 1553898 bytes from
map-output for attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,944 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000000_0
-> (143, 36) from hadoop-25.c.a.com


Reducer2: 
 
2010-05-25 11:34:49,822 INFO org.apache.hadoop.mapred.ReduceTask: Read 637657 bytes from map-output
for attempt_201005251129_0001_m_000001_0
2010-05-25 11:34:49,911 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000001_0
-> (125, 36) from hadoop-49.c.a.com
2010-05-25 11:34:50,806 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000008_0:
Got 1 new map-outputs
2010-05-25 11:34:54,915 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201005251129_0001_r_000008_0
Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_201005251129_0001_m_000000_0,
compressed len: 1462335, decompressed len: 1462331
2010-05-25 11:34:54,920 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 1462331 bytes
(1462335 raw bytes) into RAM from attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Read 1462331 bytes from
map-output for attempt_201005251129_0001_m_000000_0
2010-05-25 11:34:54,937 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_201005251129_0001_m_000000_0
-> (147, 36) from hadoop-25.c.a.com


The 2 reduce tasks have different task ids and belong to the same job.

Thanks,
Deepika

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com] 
Sent: Tuesday, May 25, 2010 8:10 AM
To: general@hadoop.apache.org
Subject: Re: Re: Hash Partitioner

On Mon, May 24, 2010 at 6:32 PM, Deepika Khera <Deepika.Khera@avg.com> wrote:
> Thanks for your response Eric.
>
> I am using hadoop 0.20.2.
>
> Here is what the hashCode() implementation looks like (I actually had the IDE generate
it for me)
>
> Main key (for mapper & reducer):
>
> public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (aKey != null ? aKey.hashCode() : 0);
>        result = 31 * result + (gKey != null ? gKey.hashCode() : 0);
>        result = 31 * result + (int) (date ^ (date >>> 32));
>        result = 31 * result + (ma != null ? ma.hashCode() : 0);
>        result = 31 * result + (cl != null ? cl.hashCode() : 0);
>        return result;
>    }
>
>
> aKey : AKey class
>
>
>    public int hashCode() {
>        int result = kVersion;
>        result = 31 * result + (v != null ? v.hashCode() : 0);
>        result = 31 * result + (s != null ? s.hashCode() : 0);
>        result = 31 * result + (o != null ? o.hashCode() : 0);
>        result = 31 * result + (l != null ? l.hashCode() : 0);
>        result = 31 * result + (e ? 1 : 0); //boolean
>        result = 31 * result + (li ? 1 : 0); //boolean
>        result = 31 * result + (aut ? 1 : 0); //boolean
>        return result;
>    }
>

Both of these look fine, assuming all the other hashCode()s return the
same value every time.

> When this happens, I do see the same values for the key. Also I am not using a grouping
comparator.

So you see two reduce methods getting the same key with the same
values? That's extremely odd. If this is the case, there's a bug in
Hadoop. Can you find the relevant logs from the reducers where Hadoop
fetches the map output? Does it look like its fetching the same output
twice? Do the two tasks where you see the duplicates have the same
task ID? Can you confirm the reduce tasks are from the same job ID for
us?

> I was wondering since the call to HashPartitioner.getPartition() is done from a map task,
several of which are running on different machines, is it possible that they get a different
hashcode and hence get different reducers assigned even when the key is the same.

The hashCode() result should *always* be the same given the same
internal state. In other words, it should be consistent and stable. If
I have a string new String("hello world") it will always have the
exact same hashCode(). If this isn't true, you will get wildly
unpredictable results not just with Hadoop but with Java's
comparators, collections, etc.

-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Mime
View raw message