hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parimi, Nagender" <par...@amazon.com>
Subject How does the framework sort data?
Date Mon, 02 Aug 2010 18:22:26 GMT
Hi,

This is an admittedly naïve question, but I've been unable to find a comprehensive answer
online. I have gone through the tutorial a few times (http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html),
and my question is simple: who or what performs the sort in MapReduce? The tutorial above
states the following in a few places -

"The framework sorts the outputs of the maps, which are then input to the reduce tasks"

"The Mapper outputs are sorted and then partitioned per Reducer"

But this glosses over an important detail - who's sorting the mappers' outputs and how? Sorting
huge amounts of data isn't cheap, hence my interest.

The tutorial mentions that the Reducer performs 3-4 steps - Shuffle, Sort, a possible Secondary
Sort on values, and lastly Reduce. After which it states -

"The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they
are merged"

I've been told that mappers sort their outputs and then merge sort is used to combine them.
So is it some randomly chosen reducers that perform the merges? I would love to know more
about the details, anyone know? If you know of a doc that explains it, I'd appreciate if you
could pass it along!

thanks,
Nagender

Mime
View raw message