hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Best approach for accessing secondary map task outputs from reduce tasks?
Date Mon, 14 Feb 2011 02:54:47 GMT
With just HDFS, IMO the good approach would be (2). See this FAQ on
task-specific HDFS output directories you can use:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F.
It'd also be much easier to use the MultipleOutputs class (or other
such utilities) for writing the extra data, as they also prefix -m- or
-r- in the filenames, based on the task type.

On Mon, Feb 14, 2011 at 1:48 AM, Jacques <whshub@gmail.com> wrote:
> I'm outputting a small amount of secondary summary information from a map
> task that I want to use in the reduce phase of the job.  This information is
> keyed on a custom input split index.
>
> Each map task outputs this summary information (less than hundred bytes per
> input task).  Note that the summary information isn't ready until the
> completion of the map task.
>
> Each reduce task needs to read this information (for all input splits) to
> complete its task.
>
> What is the best way to pass this information to the Reduce stage?  I'm
> working on java using cdhb2.   Ideas I had include:
>
> 1. Output this data to MapContext.getWorkOutputPath().  However, that data
> is not available anywhere in the reduce stage.
> 2. Output this data to "mapred.output.dir".  The problem here is that the
> map task writes immediately to this so failed jobs and speculative execution
> could cause collision issues.
> 3. Output this data as in (1) and then use Mapper.cleanup() to copy these
> files to "mapred.output.dir".  Could work but I'm still a little concerned
> about collision/race issues as I'm not clear about when a Map task becomes
> "the" committed map task for that split.
> 4. Use an external system to hold this information and then just call that
> system from both phases.  This is basically an alternative of #3 and has the
> same issues.
>
> Are there suggested approaches of how to do this?
>
> It seems like (1) might make the most sense if there is a defined way to
> stream secondary outputs from all the mappers within the Reduce.setup()
> method.
>
> Thanks for any ideas.
>
> Jacques
>



-- 
Harsh J
www.harshj.com

Mime
View raw message