hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Best approach for accessing secondary map task outputs from reduce tasks?
Date Mon, 14 Feb 2011 04:11:49 GMT
>From my experience, writing data is possible using MO in both Map and
Reduce sides of a single MR job. All data written to the MO name in
map-side is committed just like it would if the job were a map-only
job (there's no difference, since a map task does not wait for reduce
tasks to begin - it is very independent of what the job plan is). Know
that the MO uses direct record writers instead of the MapOutputBuffer
class that is used in the case of the default collector in a
Map+Reduce job (to write to local filesystem, for ReduceTask to pick
up and use) and thus your data should be available in Reduce side if
the framework guarantees that the Reduce operation never starts until
all Map tasks have finished (which is the case right now).

On Mon, Feb 14, 2011 at 9:05 AM, Jacques <whshub@gmail.com> wrote:
> It was my understanding based on the FAQ and my personal experience, that
> using the MutlipleOutputs class, or just relying on OutputComitter only
> works for the final phase of the job.  (E.g. the reduce phase in a
> map+reduce job and the map phase only in the case of reducer=NONE).  In the
> case I'm talking about, I want the map output to be committed and available
> to the reducers.  If I understand the intricacies of MapReduce, the map
> output of a full map+reduce job is never put onto HDFS but is rather
> streamed directly from the mapper to the requesting reducers.  To use (2)
> effectively, I only want to commit the secondary output to HDFS if the map
> task is completed successfully.
> This seems to either require:
> a) Assuming that the first time map.cleanup is called for a particular
> split, that it is the definitive call for that split (and thus commit the
> secondary information at that point)
> b) Or, somehow always commit the map output to directories named for that
> task attempt and then hook a delete of the map task output for those map
> tasks which weren't committed.
> Am I missing something and/or over-complicating things?
> Thanks for your help
> Jacques
> On Sun, Feb 13, 2011 at 6:54 PM, Harsh J <qwertymaniac@gmail.com> wrote:
>> With just HDFS, IMO the good approach would be (2). See this FAQ on
>> task-specific HDFS output directories you can use:
>> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F.
>> It'd also be much easier to use the MultipleOutputs class (or other
>> such utilities) for writing the extra data, as they also prefix -m- or
>> -r- in the filenames, based on the task type.
>> On Mon, Feb 14, 2011 at 1:48 AM, Jacques <whshub@gmail.com> wrote:
>> > I'm outputting a small amount of secondary summary information from a
>> > map
>> > task that I want to use in the reduce phase of the job.  This
>> > information is
>> > keyed on a custom input split index.
>> >
>> > Each map task outputs this summary information (less than hundred bytes
>> > per
>> > input task).  Note that the summary information isn't ready until the
>> > completion of the map task.
>> >
>> > Each reduce task needs to read this information (for all input splits)
>> > to
>> > complete its task.
>> >
>> > What is the best way to pass this information to the Reduce stage?  I'm
>> > working on java using cdhb2.   Ideas I had include:
>> >
>> > 1. Output this data to MapContext.getWorkOutputPath().  However, that
>> > data
>> > is not available anywhere in the reduce stage.
>> > 2. Output this data to "mapred.output.dir".  The problem here is that
>> > the
>> > map task writes immediately to this so failed jobs and speculative
>> > execution
>> > could cause collision issues.
>> > 3. Output this data as in (1) and then use Mapper.cleanup() to copy
>> > these
>> > files to "mapred.output.dir".  Could work but I'm still a little
>> > concerned
>> > about collision/race issues as I'm not clear about when a Map task
>> > becomes
>> > "the" committed map task for that split.
>> > 4. Use an external system to hold this information and then just call
>> > that
>> > system from both phases.  This is basically an alternative of #3 and has
>> > the
>> > same issues.
>> >
>> > Are there suggested approaches of how to do this?
>> >
>> > It seems like (1) might make the most sense if there is a defined way to
>> > stream secondary outputs from all the mappers within the Reduce.setup()
>> > method.
>> >
>> > Thanks for any ideas.
>> >
>> > Jacques
>> >
>> --
>> Harsh J
>> www.harshj.com

Harsh J

View raw message