hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: job taking input file, which "is being" written by its preceding job's map phase
Date Thu, 09 Feb 2012 07:15:55 GMT
Vamshi,

What problem are you exactly trying to solve by trying to attempt
this? If you are only interested in records being streamed from one
mapper into another, why can't it be chained together? Remember that
map-only jobs do not sort their data output -- so I still see no
benefit here in consuming record-by-record from a whole new task when
it could be done from the very same.

Btw, ChainMapper is an API abstraction to run several mapper
implementations in sequence (chain) for each record input and
transform them all along (helpful if you have several utility mappers
and want to build composites). It does not touch disk.

On Thu, Feb 9, 2012 at 12:15 PM, Vamshi Krishna <vamshi2105@gmail.com> wrote:
> thank you harsh for your reply. Here what chainMapper does is, once the
> first mapper finishes, then only second map starts using that file written
> by first mapper. Its just like chain. But what i want is like pipelining i.e
> after first map starts and before it finishes only second map has to start
> and kepp on reading from the same file that is being written by first map.
> It is almost like produce-consumer like scenario, where first map writes in
> to the file, and second map keeps on reading the same file. So that
> pipelining effect is seen between two maps.
> Hope you got what i am trying to tell..
>
> please help..
>
>
> On Wed, Feb 8, 2012 at 12:47 PM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Vamsi,
>>
>> Is it not possible to express your M-M-R phase chain as a simple, single
>> M-R?
>>
>> Perhaps look at the ChainMapper class @
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/ChainMapper.html
>>
>> On Wed, Feb 8, 2012 at 12:28 PM, Vamshi Krishna <vamshi2105@gmail.com>
>> wrote:
>> > Hi all
>> > i have an important question about mapreduce.
>> >  i have 2 hadoop mapreduce jobs. job1 has only mapper but no reducer.
>> > Job1
>> > started and in its map() it is writing to a "file1" using
>> > context(Arg1,Arg2). If i wanted to start job2 (immidietly after job1) ,
>> > which should take the "file1" (output still being written by above job's
>> > map
>> > phase) as input and do processing in its own map/reduce phases, and job2
>> > should keep on taking the newly written data to "file1" , untill job1
>> > finishes, what i should do?
>> >
>> > how can i do that, Please can anybody help?
>> >
>> > --
>> > Regards
>> >
>> > Vamshi Krishna
>> >
>>
>>
>>
>> --
>> Harsh J
>> Customer Ops. Engineer
>> Cloudera | http://tiny.cloudera.com/about
>
>
>
>
> --
> Regards
>
> Vamshi Krishna
>



-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Mime
View raw message