hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian CAPDEFIER <chivas314...@gmail.com>
Subject Re: chaining (the output of) jobs/ reducers
Date Tue, 17 Sep 2013 13:23:40 GMT
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.

On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
> It works on top of YARN, though the first release of Tez is yet to happen.
> You can learn about it more here: http://tez.incubator.apache.org/
> HTH,
> +Vinod
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
> Howdy,
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

View raw message