hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: Best approach for accessing secondary map task outputs from reduce tasks?
Date Mon, 14 Feb 2011 04:05:27 GMT
If these assumptions are correct:

0) Each map outputs one result, a few hundred bytes
1) The map output is deterministic, given an input split index
2) Every reducer must see the result from every map

Then just output the result N times, where N is the number of
reducers, using a custom Partitioner that assigns the result to
(records_seen++ % N), where records_seen is an int field on the

If (1) does not hold, then write the first stage as job with a single
(optional) reduce, and the second stage as a map-only job processing
the result. -C

On Sun, Feb 13, 2011 at 12:18 PM, Jacques <whshub@gmail.com> wrote:
> I'm outputting a small amount of secondary summary information from a map
> task that I want to use in the reduce phase of the job.  This information is
> keyed on a custom input split index.
> Each map task outputs this summary information (less than hundred bytes per
> input task).  Note that the summary information isn't ready until the
> completion of the map task.
> Each reduce task needs to read this information (for all input splits) to
> complete its task.
> What is the best way to pass this information to the Reduce stage?  I'm
> working on java using cdhb2.   Ideas I had include:
> 1. Output this data to MapContext.getWorkOutputPath().  However, that data
> is not available anywhere in the reduce stage.
> 2. Output this data to "mapred.output.dir".  The problem here is that the
> map task writes immediately to this so failed jobs and speculative execution
> could cause collision issues.
> 3. Output this data as in (1) and then use Mapper.cleanup() to copy these
> files to "mapred.output.dir".  Could work but I'm still a little concerned
> about collision/race issues as I'm not clear about when a Map task becomes
> "the" committed map task for that split.
> 4. Use an external system to hold this information and then just call that
> system from both phases.  This is basically an alternative of #3 and has the
> same issues.
> Are there suggested approaches of how to do this?
> It seems like (1) might make the most sense if there is a defined way to
> stream secondary outputs from all the mappers within the Reduce.setup()
> method.
> Thanks for any ideas.
> Jacques

View raw message