hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Baff <Aaron.B...@telescope.tv>
Subject RE: Running multiple MR Job's in sequence
Date Thu, 29 Sep 2011 17:53:25 GMT
Yea, we don't want it to sit there waiting for the Job to complete, even if it's just a few

-----Original Message-----
From: turbocodr@gmail.com [mailto:turbocodr@gmail.com] On Behalf Of John Conwell
Sent: Thursday, September 29, 2011 10:50 AM
To: common-user@hadoop.apache.org
Subject: Re: Running multiple MR Job's in sequence

After you kick off a job, say JobA, your client doesn't need to sit and ping
Hadoop to see if it finished before it starts JobB.  You can have the client
block until the job is complete with "Job.waitForCompletion(boolean
verbose)".  Using this you can create a "job driver" that chains jobs
together easily.

Now, if your job takes 2 weeks to run, you cant kill your driver process.
 If you do, JobA will finish running, but JobB will never start


On Thu, Sep 29, 2011 at 9:51 AM, Aaron Baff <Aaron.Baff@telescope.tv> wrote:

> I saw this, but wasn't sure if it was something that ran on the client and
> just submitted the Job's in sequence, or if that gave it all to the
> JobTracker, and the JobTracker took care of submitting the Jobs in sequence
> appropriately.
> Basically, I'm looking for a completely stateless client, that doesn't need
> to ping the JobTracker every now and then to see if a Job has completed, and
> then submit the next one. The ideal flow would be the client gets in a
> request to run the series of Jobs, it preps them all, gets them all
> configured, and then passes them off to the JobTracker which runs them all
> in order without the client application needing to do anthing further.
> Sounds like that doesn't really exist as part of Hadoop framework, and
> needs something like Oozie (or a home-built system) to do this.
> --Aaron
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Wednesday, September 28, 2011 9:37 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Running multiple MR Job's in sequence
> Within the Hadoop core project, there is JobControl you can utilize
> for this. You can view its API at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/jobcontrol/package-summary.html
> and it is fairly simple to use (Create jobs in regular java API, build
> a dependency flow using JobControl atop these jobconf objects).
> Apache Oozie and other such tools offer higher abstractions on
> controlling a workflow, and can be considered when your needs can get
> a bit complex than just a series (easy to handle failure scenarios
> between dependent jobs, perform minor fs operations in pre/post
> processing, etc.).
> On Thu, Sep 29, 2011 at 5:26 AM, Aaron Baff <Aaron.Baff@telescope.tv>
> wrote:
> > Is it possible to submit a series of MR Jobs to the JobTracker to run in
> sequence (one finishes, take the output of that if successful and feed it
> into the next, etc), or does it need to run client side by using the
> JobControl or something like Oozie, or rolling our own? What I'm looking for
> is a fire & forget, and occasionally check back to see if it's done. So
> client-side doesn't need to really know anything or keep track of anything.
> Does something like that exist within the Hadoop framework?
> >
> > --Aaron
> >
> --
> Harsh J


John C

View raw message