hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ed Mazur <ma...@cs.umass.edu>
Subject Map output partitioning/sorting
Date Thu, 13 May 2010 20:54:18 GMT
The way partitioning and sorting currently happens on the map side is
that <key, value> output records are transformed into <partition, key,
value> and this is sorted on partition and key.

I assume it would be more efficient to just group the records into
partitions and then sort each of these individually, so why do we do a
global sort? My only guess is that this simplifies things when there
are buffer spills to disk and not all records are in memory.

That being said, I've not yet run into a case where buffer spills
couldn't be avoided by properly configuring buffer parameters. I've
not run any benchmarks on this, but I'm just curious what others
think, whether the difference would be negligible, etc.

Ed

Mime
View raw message