hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject Shuffle design: optimization tradeoffs
Date Tue, 11 Jun 2013 16:00:45 GMT
I am curious about the tradeoffs that drove design of the partition/sort/shuffle (Elephant
book p 208).  Doubtless this has been tuned and measured and retuned, but I'd like to know
what observations came about during the iterative optimization process to drive the final
design.  For example:

*         Why does the mapper output create a single ordered file containing all partitions,
as opposed to a file per group of partitions (which would seem to lend itself better to multi-core
scaling), or even a file per partition?

*         Why does the max number of streams to merge at once (is.sort.factor) default to
10?  Is this obsolete?  In my experience, so long as you have memory to buffer each input
at 1MB or so, the merger is more efficient as a single phase.

*         Why does the mapper do a final merge of the spill files do disk, instead of having
the auxiliary process (in YARN) merge and stream data on the fly?

*         Why do mappers sort the tuples, as opposed to only partitioning them and letting
the reducers do the sorting?
Sorry if this is overly academic, but I'm sure a lot of people put a lot of time into the
tuning effort, and I hope they left a record of their efforts.

View raw message