hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <ni...@basj.es>
Subject Re: How to Create an effective chained MapReduce program.
Date Tue, 06 Sep 2011 05:57:12 GMT
Hi,

In the past i've had the same situation where I needed the data for
debugging. Back then I chose to create a second job with simply
SequenceFileInputFormat, IdentityMapper, IdentityReducer and finally
TextOutputFormat.

In my situation that worked great for my purpose.

-- 
Met vriendelijke groet,
Niels Basjes

Op 6 sep. 2011 01:54 schreef "ilyal levin" <nipponilyal@gmail.com> het
volgende:
>
> o.k , so now i'm using SequenceFileInputFormat
and SequenceFileOutputFormat and it works fine but the output of the reducer
is
> now a binary file (not txt) so i can't understand the data. how can i
solve this? i need the data (in txt form ) of the Intermediate stages in the
chain.
>
> Thanks
>
>
> On Tue, Sep 6, 2011 at 1:33 AM, ilyal levin <nipponilyal@gmail.com> wrote:
>>
>> Thanks for the help.
>>
>>
>> On Mon, Sep 5, 2011 at 10:50 PM, Roger Chen <rogchen@ucdavis.edu> wrote:
>>>
>>> The binary file will allow you to pass the output from the first reducer
to the second mapper. For example, if you outputed Text, IntWritable from
the first one in SequenceFileOutputFormat, then you are able to retrieve
Text, IntWritable input at the head of the second mapper. The idea of
chaining is that you know what kind of output the first reducer is going to
give already, and that you want to perform some secondary operation on it.
>>>
>>> One last thing on chaining jobs: it's often worth looking to see if you
can consolidate all of your separate map and reduce tasks into a single
map/reduce operation. There are many situations where it is more intuitive
to write a number of map/reduce operations and chain them together, but more
efficient to have just a single operation.
>>>
>>>
>>>
>>> On Mon, Sep 5, 2011 at 12:21 PM, ilyal levin <nipponilyal@gmail.com>
wrote:
>>>>
>>>> Thanks for the reply.
>>>> I tried it but it creates a binary file which i can not understand (i
need the result of the first job).
>>>> The other thing is how can i use this file in the next chained mapper?
i.e how can i retrieve the keys and the values in the map function?
>>>>
>>>>
>>>> Ilyal
>>>>
>>>>
>>>> On Mon, Sep 5, 2011 at 7:41 PM, Joey Echeverria <joey@cloudera.com>
wrote:
>>>>>
>>>>> Have you tried SequenceFileOutputFormat and SequenceFileInputFormat?
>>>>>
>>>>> -Joey
>>>>>
>>>>> On Mon, Sep 5, 2011 at 11:49 AM, ilyal levin <nipponilyal@gmail.com>
wrote:
>>>>> > Hi
>>>>> > I'm trying to write a chained mapreduce program. i'm doing so with
a
simple
>>>>> > loop where in each iteration i
>>>>> > create a job ,execute it and every time the current job's output
is
the next
>>>>> > job's input.
>>>>> > how can i configure the outputFormat of the current job and the
inputFormat
>>>>> > of the next job so that
>>>>> > i will not use the TextInputFormat (TextOutputFormat), because if
i
do use
>>>>> > it, i need to parse the input file in the Map function?
>>>>> > i.e if possible i want the next job to "consider" the input file
as
>>>>> > <key,value> and not plain Text.
>>>>> > Thanks a lot.
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Echeverria
>>>>> Cloudera, Inc.
>>>>> 443.305.9434
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Roger Chen
>>> UC Davis Genome Center
>>
>>
>

Mime
View raw message