hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Beaudreault <bbeaudrea...@hubspot.com>
Subject Re: chaining (the output of) jobs/ reducers
Date Thu, 12 Sep 2013 17:38:08 GMT
It really comes down to the following:

In Job A set mapred.output.dir to some directory X.
In Job B set mapred.input.dir to the same directory X.

For Job A, do context.write() as normally, and each reducer will create an
output file in mapred.output.dir.  Then in Job B each of those will
correspond to a mapper.

Of course you need to make sure your input and output formats, as well as
input and output keys/values, match up between the two jobs as well.

If you are using HDFS, which it seems you are, the directories specified
can be HDFS directories.  In that case, with a replication factor of 3,
each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
the work to ensure that the mappers in the second job do as good a job as
possible to be data or rack-local.

On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER

> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <curtin.chris@gmail.com>wrote:
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>> Hope this helps,
>> Chris
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com
>> > wrote:
>>> Howdy,
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> jobs.
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) =>
>>> (key2, value 3)] => output.
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded
>>> somehow?

View raw message