hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From psdc1978 <psdc1...@gmail.com>
Subject Where duplicated data is ignored?
Date Wed, 17 Feb 2010 10:34:44 GMT

In Hadoop MapRed, when I define the number of reduce tasks to run,


I've noticed that during the execution of an MapRed example, the Reduces
threads request 9 times the MapOutputServlet on the TaskTracker. The value 9
comes from the 3 reduces tasks times 3 splits that exist that have map
output. The purpose of MapOutputServlet is to give the map output data to a
reduce thread.

Since the merge result from my example - btw the example is the one that
counts words - doesn't contain duplicated data, where the duplicated data is

- Is it by the MapOutputServlet that detects that the split was already
- Is it by the Reduce task after retrieving data from the MapOutputServlet
and before the merging phase?
- Is it during the merging phase?

Thanks for the help,


View raw message