hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: Performance of direct vs indirect shuffling
Date Wed, 21 Dec 2011 00:59:58 GMT

On Dec 20, 2011, at 3:55 PM, Kevin Burton wrote:

> The current hadoop implementation shuffles directly to disk and then those disk files
are eventually requested by the target nodes which are responsible for doing the reduce()
on the intermediate data.
> However, this requires more 2x IO than strictly necessary.
> If the data were instead shuffled DIRECTLY to the target host, this IO overhead would
be removed.

We've discussed 'push' v/s 'pull' shuffle multiple times and each time turned away due to
complexities in MR1. With MRv2 (YARN) this would be much more doable.


A single reducer, in typical (well-designed?) applications, process multiple gigabytes of
data across thousands of maps.

So, to really not do any disk i/o during the shuffle you'd need very large amounts of RAM...

Also, currently, the shuffle is effected by the reduce task. This has two major benefits :
# The 'pull' can only be initiated after the reduce is scheduled. The 'push' model would be
hampered if the reduce hasn't been started.
# The 'pull' is more resilient to failure of a single reduce. In the push model, it's harder
to deal with a reduce failing after a push from the map.

Again, with MR2 we could experiment with push v/s pull where it makes sense (small jobs etc.).
I'd love to mentor/help someone interested in putting cycles into it.


View raw message