hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Shuffle/sort
Date Wed, 06 Jun 2012 16:38:48 GMT
No (sorry if I confused) the outputs are pulled from TaskTrackers'
HTTP server, which access the local (mapred.local.dir) file outputs
from maps, and serve it to the requester (reduce process). There is no
'push' in MR in this phase.

On Wed, Jun 6, 2012 at 10:06 PM, Barry, Sean F <sean.f.barry@intel.com> wrote:
> So I'm assuming that there is a push side also? Is it part of the map output?
> -sb
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Wednesday, June 06, 2012 9:33 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Shuffle/sort
> Sean,
> Yes thats the one for the shuffles that happen on reduce side (pull model), you can drill
down from that class onwards into seeing how fetchers operate, etc.
> On Wed, Jun 6, 2012 at 9:54 PM, Barry, Sean F <sean.f.barry@intel.com> wrote:
>> Thanks Harsh!
>> And is this the right source code for the shuffling that is done in the reduce task?
>> http://search-hadoop.com/c/Hadoop:/hadoop-mapreduce-project/hadoop-map
>> reduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/ha
>> doop/mapreduce/task/reduce/Shuffle.java%7C%7Cshuffle+sort
>> -sb
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Tuesday, June 05, 2012 7:43 PM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Shuffle/sort
>> Hey Sean,
>> Check out
>> http://www.slideshare.net/jhammerb/hadoop-map-reduce-arch-106883,
>> a slightly dated and MR1-oriented presentation from Owen O'Malley that goes a good
level in-depth to get an overview of how things work (including how reduces pull data).
>> After that, check out Chris Douglas'
>> http://www.slideshare.net/hadoopusergroup/ordered-record-collection
>> that goes in-depth into the evolution of the implementations of that layer. This
is pretty much the state of 0.20/1.0 today too, and in 2.0 we have had Netty replacing Jetty
among other improvements but I haven't a public document link to share on this yet. Others
may share the changes docs on 2.0 if they have a link to one (or I'll respond back as soon
as I have one).
>> I hope this helps!
>> On Wed, Jun 6, 2012 at 4:16 AM, Barry, Sean F <sean.f.barry@intel.com> wrote:
>>> "I was always wondering after mapping, how each reduce task get its
>>> input. It is said in google's paper and hadoop's documentation that a
>>> sort is done to aggregate the same key of the map output. But there
>>> is no detailed explanation of how it is implemented and my intuition
>>> is that perhaps a global hashing will work better than sorting. So I
>>> really want to know the details and see whether my intuition is right. If I can
find out that in the source code, where should I start with?"
>>> I saw this question online and no one replied to it. does anyone know where I
go to study the source code for the shuffle and sort.
>>> -sean
>> --
>> Harsh J
> --
> Harsh J

Harsh J

View raw message