hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amogh Vasekar <am...@yahoo-inc.com>
Subject Re: Saving Intermediate Results from the Mapper
Date Wed, 25 Nov 2009 06:50:01 GMT
I'm not sure if this will apply to your case since i'm not aware of the common part of job2:mapper
and job3:mapper but would like to give it a shot.
The whole process can be combined into a single mapred job. The mapper will read a record
and process till the "saved data part" , then for each record will output 2 records , one
each for the job2 and job3 mappers. The keys of records will be tagged ( <tag,key> )
depending on what reducer processing you want to do. In reduce() you can use this tag to make
processing decision. A custom partitioner might be needed depending on the key types to ensure
unique sets for reducer.
Ignore if this doesn't fit your bill :)


On 11/25/09 9:35 AM, "Gordon Linoff" <glinoff@gmail.com> wrote:

Does anyone have a pointer to code that allows the map to save data in
intermediate files, for use in a later map/reduce job?  I have been looking
for an example and cannot find one.

I have investigated MultipleOutputFormat and MultipleOutputs.  Because I am
using version 0.18.3, I don't have MultipleOutputs.  The problem with
MultipleOutputFormat is that the data I want to save is a different format
from the data I want to pass to the Reducer.  I have also tried opening a
sequence file directly from the mapper, but I am concerned that this is not
fault tolerant.

The process currently is:

Job1:  Mapper:  reads complicated data, saves out data structure.
Job2:  Mapper:  reads saved data, processes and sends data to Reducer 2.
Job3:  Mapper:  reads saved data, processes and sends data to Reducer 3.

I would like to combine the first two steps, so the process is:

Job1:  Mapper:  reads complicated data, saves out data structure, and passes
processed data to Reducer 2.
Job2:  Mapper:  reads saved data, processes and sends to Reducer 3.


On Sun, Nov 22, 2009 at 9:27 PM, Jason Venner <jason.hadoop@gmail.com>wrote:

> You can manually write the map output to a new file, there are a number of
> examples of opening a sequence file and writing to it on the web or in the
> example code for various hadoop books.
> You can also disable the removal of intermediate data, which will result in
> potentially large amounts of data being left in the mapred.local.dir.
> On Sun, Nov 22, 2009 at 3:56 PM, Gordon Linoff <glinoff@gmail.com> wrote:
>> I am starting to learn Hadoop, using the Yahoo virtual machine with
>> version
>> 0.18.
>> My question is rather simple.  I would like to execute a map/reduce job.
>>  In
>> addition to getting the results from the reduce, I would also like to save
>> the intermediate results from the map in another HDFS file.  Is this
>> possible?
>> --gordon
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message