hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Curtin <curtin.ch...@gmail.com>
Subject Re: chaining (the output of) jobs/ reducers
Date Thu, 12 Sep 2013 13:39:26 GMT
If you want to stay in Java look at Cascading. Pig is also helpful. I think
there are other (Spring integration maybe?) but I'm not familiar with them
enough to make a recommendation.

Note that with Cascading and Pig you don't write 'map reduce' you write
logic and they map it to the various mapper/reducer steps automatically.

Hope this helps,


On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com>wrote:

> Howdy,
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?

View raw message