hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Albert Chu <ch...@llnl.gov>
Subject Re: Shuffle design: optimization tradeoffs
Date Tue, 11 Jun 2013 21:31:45 GMT
On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been
> tuned and measured and retuned, but I’d like to know what observations
> came about during the iterative optimization process to drive the
> final design.  For example:
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of
> partitions (which would seem to lend itself better to multi-core
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this



> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience,
> so long as you have memory to buffer each input at 1MB or so, the
> merger is more efficient as a single phase.
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and
> stream data on the fly?
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> Sorry if this is overly academic, but I’m sure a lot of people put a
> lot of time into the tuning effort, and I hope they left a record of
> their efforts.
> Thanks
> John
Albert Chu
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

View raw message