hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkata K Pisupat <krishna.pisu...@gmail.com>
Subject Re: chaining (the output of) jobs/ reducers
Date Thu, 12 Sep 2013 20:07:32 GMT
Cascading would a good option in case you have a complex flow. However, in your case, you are
trying to chain two jobs only. I would suggest you to follow these steps. 

1. The output directory of Job1 would be set at the input directory for Job2. 
2. Launch Job1 using the new API. In launcher program, instead of using Jobconf and JobClient
for running job, use Job class. To run the job, invoke Job.waitForcompletion(true) on Job1.
This ensures to block the program until Job1 is run completely. 
3. Optionally, you can combine the individual output files generated by each reducer (if you
have more than 1 reducer task) into one or more files. 
4. Next step would be to launch Job2. 

The output of Job1 is written to HDFS and therefore, you will not have any issues while Job2
reads the input (Job1's output). 





On Sep 12, 2013, at 12:02 PM, Adrian CAPDEFIER <chivas314159@gmail.com> wrote:

> Thanks Bryan.
> 
> Yes, I am using hadoop + hdfs.
> 
> If I understand your point, hadoop tries to start the mapping processes on nodes where
the data is local and if that's not possible, then it is hdfs that replicates the data to
the mapper nodes? 
> 
> I expected to have to set up this in the code and I completely ignored HDFS; I guess
it's a case of not seeing the forest from all the trees!
> 
> 
> On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <bbeaudreault@hubspot.com> wrote:
> It really comes down to the following:
> 
> In Job A set mapred.output.dir to some directory X.
> In Job B set mapred.input.dir to the same directory X.
> 
> For Job A, do context.write() as normally, and each reducer will create an output file
in mapred.output.dir.  Then in Job B each of those will correspond to a mapper.
> 
> Of course you need to make sure your input and output formats, as well as input and output
keys/values, match up between the two jobs as well.
> 
> If you are using HDFS, which it seems you are, the directories specified can be HDFS
directories.  In that case, with a replication factor of 3, each of these output files will
exist on 3 nodes.  Hadoop and HDFS will do the work to ensure that the mappers in the second
job do as good a job as possible to be data or rack-local.
> 
> 
> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <chivas314159@gmail.com> wrote:
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd prefer to keep,
if possible, everything as close to the hadoop libraries.
> 
> I am sure I am overlooking something basic as repartitioning is a fairly common operation
in MPP environments.
> 
> 
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <curtin.chris@gmail.com> wrote:
> If you want to stay in Java look at Cascading. Pig is also helpful. I think there are
other (Spring integration maybe?) but I'm not familiar with them enough to make a recommendation.
> 
> Note that with Cascading and Pig you don't write 'map reduce' you write logic and they
map it to the various mapper/reducer steps automatically.
> 
> Hope this helps,
> 
> Chris
> 
> 
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <chivas314159@gmail.com> wrote:
> Howdy,
> 
> My application requires 2 distinct processing steps (reducers) to be performed on the
input data. The first operation generates changes the key values and, records that had different
keys in step 1 can end up having the same key in step 2.
> 
> The heavy lifting of the operation is in step1 and step2 only combines records where
keys were changed.
> 
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
> 
> 
> To implement this in hadoop, it seems that I need to create a separate job for each step.

> 
> Now I assumed, there would some sort of job management under hadoop to link Job 1 and
2, but the only thing I could find was related to job scheduling and nothing on how to synchronize
the input/output of the linked jobs.
> 
> 
> 
> The only crude solution that I can think of is to use a temporary file under HDFS, but
even so I'm not sure if this will work.
> 
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer (key2,
value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer (key2, value
3)] => output.
> 
> Is there a better way to pass the output from Job A as input to Job B (e.g. using network
streams or some built in java classes that don't do disk i/o)? 
> 
> 
> 
> The temporary file solution will work in a single node configuration, but I'm not sure
about an MPP config.
> 
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or both jobs run
on all 4 nodes - will HDFS be able to redistribute automagically the records between nodes
or does this need to be coded somehow? 
> 
> 
> 
> 


Mime
View raw message