hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <burtona...@gmail.com>
Subject Re: Performance of direct vs indirect shuffling
Date Wed, 21 Dec 2011 01:59:38 GMT
On Tue, Dec 20, 2011 at 4:53 PM, Todd Lipcon <todd@cloudera.com> wrote:

> The advantages of the "pull" based shuffle is fault tolerance - if you
> shuffle to the reducer and then the reducer dies, you have to rerun
> *all* of the earlier maps in the "push" model.

you would have the same situation if you aren't replicating the blocks in
the mapper.

in my situation I'm replicating the shuffle data so it should be a zero sum

The map jobs are just re-run where the last one failed since the shuffle
data has already been written.

(I should note that I'm working on another Map Reduce implementation that
I'm about to OSS)...

There are a LOT of problems in the map reduce space which are themselves
research papers and it would be nice to see more published in this area.

> The advantage of writing to disk is of course that you can have more
> intermediate output than fits in RAM.
well if you're shuffling across the network and you back up due to network
IO then your map jobs would just run slower.

> In practice, for short jobs, the output might stay entirely in buffer
> cache and never actually hit disk (RHEL by default configures the
> writeback period to 30 seconds when there isn't page cache pressure).
Or just start to block when memory is exhausted.

View raw message