hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris K Wensel <ch...@wensel.net>
Subject Re: how to run jobs every 30 minutes?
Date Wed, 15 Dec 2010 00:02:38 GMT

I see it this way.

You can glue a bunch of discrete command line apps together that may or may not have dependencies
between one another in a new syntax. which is darn nice if you already have a bunch of discrete
ready to run command line apps sitting around that need to be strung together, that can't
be used as libraries and instantiated through their APIs.

Or, you can string all your work together through the APIs with a turing complete language
and run them all from a single command line interface (and hand that to cron, or some other

In this case you can use Java, or easier languages like JRuby, Groovy, Jython, Clojure, etc
which were designed for this purpose. (They don't run on the cluster, they only run Hadoop
client side).

Think ant vs graddle (or any other build tool that uses a scripting language and not a configuration
file) if you want a concrete example.

Cascading itself is a query API (and query planner). But it also exposes to the user the ability
to run discrete 'processes' in dependency order for you. Either Cascading (Hadoop) Flows or
Riffle annotated process objects. They all can be intermingled and managed from the same dependency
scheduler. Cascading has one, and Riffle has one.

So you can run> Flow -> Mahout -> Pig -> Mahout -> Flow -> shell -> whattheheckever
from the same application.

Cascading also has the ability to only run 'stale' processes. Think 'make' file. When re-running
a job where only one file of many has changed, this is a big win.

I personally like parameterizing my applications via the command line and letting my cli options
drive the workflows. for example, my testing, integration, production environments are much
different, so its very easy to drive specific runs of the jobs by changing a cli arg. (args4j
makes this darn simple)

if I am chaining multiple CLI apps into a bigger production app, parameterizing that I suspect
will be error prone, esp if the input/output data points (jdbc vs file) are different in different

you can find Riffle here, https://github.com/cwensel/riffle  (its Apache Licensed, contributions


On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:

> Ed,
> Actually Oozie is quite different from Cascading.
> * Cascading allows you to write 'queries' using a Java API and they get
> translated into MR jobs.
> * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG
> (workflow jobs) and has timer+data dependency triggers (coordinator jobs).
> Regards.
> Alejandro
> On Tue, Dec 14, 2010 at 1:26 PM, edward choi <mp2893@gmail.com> wrote:
>> Thanks for the tip. I took a look at it.
>> Looks similar to Cascading I guess...?
>> Anyway thanks for the info!!
>> Ed
>> 2010/12/8 Alejandro Abdelnur <tucu@cloudera.com>
>>> Or, if you want to do it in a reliable way you could use an Oozie
>>> coordinator job.
>>> On Wed, Dec 8, 2010 at 1:53 PM, edward choi <mp2893@gmail.com> wrote:
>>>> My mistake. Come to think about it, you are right, I can just make an
>>>> infinite loop inside the Hadoop application.
>>>> Thanks for the reply.
>>>> 2010/12/7 Harsh J <qwertymaniac@gmail.com>
>>>>> Hi,
>>>>> On Tue, Dec 7, 2010 at 2:25 PM, edward choi <mp2893@gmail.com>
>>>>>> Hi,
>>>>>> I'm planning to crawl a certain web site every 30 minutes.
>>>>>> How would I get it done in Hadoop?
>>>>>> In pure Java, I used Thread.sleep() method, but I guess this won't
>>> work
>>>>> in
>>>>>> Hadoop.
>>>>> Why wouldn't it? You need to manage your post-job logic mostly, but
>>>>> sleep and resubmission should work just fine.
>>>>>> Or if it could work, could anyone show me an example?
>>>>>> Ed.
>>>>> --
>>>>> Harsh J
>>>>> www.harshj.com

Chris K Wensel

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading

View raw message