hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques <whs...@gmail.com>
Subject Re: Best approach for accessing secondary map task outputs from reduce tasks?
Date Mon, 14 Feb 2011 03:35:55 GMT
It was my understanding based on the FAQ and my personal experience, that
using the MutlipleOutputs class, or just relying on OutputComitter only
works for the final phase of the job.  (E.g. the reduce phase in a
map+reduce job and the map phase only in the case of reducer=NONE).  In the
case I'm talking about, I want the map output to be committed and available
to the reducers.  If I understand the intricacies of MapReduce, the map
output of a full map+reduce job is never put onto HDFS but is rather
streamed directly from the mapper to the requesting reducers.  To use (2)
effectively, I only want to commit the secondary output to HDFS if the map
task is completed successfully.

This seems to either require:
a) Assuming that the first time map.cleanup is called for a particular
split, that it is the definitive call for that split (and thus commit the
secondary information at that point)
b) Or, somehow always commit the map output to directories named for that
task attempt and then hook a delete of the map task output for those map
tasks which weren't committed.

Am I missing something and/or over-complicating things?

Thanks for your help

On Sun, Feb 13, 2011 at 6:54 PM, Harsh J <qwertymaniac@gmail.com> wrote:

> With just HDFS, IMO the good approach would be (2). See this FAQ on
> task-specific HDFS output directories you can use:
> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
> .
> It'd also be much easier to use the MultipleOutputs class (or other
> such utilities) for writing the extra data, as they also prefix -m- or
> -r- in the filenames, based on the task type.
> On Mon, Feb 14, 2011 at 1:48 AM, Jacques <whshub@gmail.com> wrote:
> > I'm outputting a small amount of secondary summary information from a map
> > task that I want to use in the reduce phase of the job.  This information
> is
> > keyed on a custom input split index.
> >
> > Each map task outputs this summary information (less than hundred bytes
> per
> > input task).  Note that the summary information isn't ready until the
> > completion of the map task.
> >
> > Each reduce task needs to read this information (for all input splits) to
> > complete its task.
> >
> > What is the best way to pass this information to the Reduce stage?  I'm
> > working on java using cdhb2.   Ideas I had include:
> >
> > 1. Output this data to MapContext.getWorkOutputPath().  However, that
> data
> > is not available anywhere in the reduce stage.
> > 2. Output this data to "mapred.output.dir".  The problem here is that the
> > map task writes immediately to this so failed jobs and speculative
> execution
> > could cause collision issues.
> > 3. Output this data as in (1) and then use Mapper.cleanup() to copy these
> > files to "mapred.output.dir".  Could work but I'm still a little
> concerned
> > about collision/race issues as I'm not clear about when a Map task
> becomes
> > "the" committed map task for that split.
> > 4. Use an external system to hold this information and then just call
> that
> > system from both phases.  This is basically an alternative of #3 and has
> the
> > same issues.
> >
> > Are there suggested approaches of how to do this?
> >
> > It seems like (1) might make the most sense if there is a defined way to
> > stream secondary outputs from all the mappers within the Reduce.setup()
> > method.
> >
> > Thanks for any ideas.
> >
> > Jacques
> >
> --
> Harsh J
> www.harshj.com

View raw message