hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Kennedy <james.kenn...@troove.net>
Subject Re: Examples of chained MapReduce?
Date Mon, 25 Jun 2007 16:30:21 GMT
Ok guys, thank you for your responses.

Devaraj Das wrote:
> I haven't confirmed this but I vaguely remember that the resource schedulers
> (Torque, Condor) provide the feature using which one can submit a DAG of
> jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
> nodes pointing to it have successfully finished (or something like that) and
> the resource scheduler framework does the bookkeeping to take care of failed
> jobs, etc.
> In hadoop there is an effort "Integration of Hadoop with batch schedulers"
> https://issues.apache.org/jira/browse/HADOOP-719 
> I am not sure whether it handles the use case, where one could submit a
> chain of jobs, but think it potentially can handle that.
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com] 
> Sent: Sunday, June 24, 2007 6:10 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Examples of chained MapReduce?
> We are still evaluating Hadoop for use in our main-line analysis systems,
> but we already have the problem of workflow scheduling.
> Our solution for that was to implement a simpler version of Amazon's Simple
> Queue Service.  This allows us to have multiple redundant workers for some
> tasks or to choke a task down on other tasks.
> The basic idea is that queues contain XML tasks.  Tasks are read from the
> queue by workers, but are kept in a holding pen for a queue specific time
> period after they are read.  If the task completes normally, the worker will
> delete the task, but if the timeout expires before the worker completes the
> task, it is added back to the queue.
> Workers are structured as a triple of scripts that are executed of a manager
> process.  These are a pre-condition that can determine if any work should be
> done (usually this is a check for available local disk space or available
> CPU cycles), an item qualification (this is done with a particular item in
> case the work is subject to resource reservation) and a worker script.
> Even this tiny little framework suffices for quite complex workflows and
> work constraints.  It would be very easy to schedule map-reduce tasks via a
> similar mechanism.
> On 6/23/07 5:34 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:
>> James Kennedy wrote:
>>> But back to my original question... Doug suggests that dependence on 
>>> a driver process is acceptable.  But has anyone needed true MapReduce 
>>> chaining or tried it successfully?  Or is it generally accepted that 
>>> a multi-MapReduce algorithm should always be driven by a single process?
>> I would argue that this functionality is outside the scope of Hadoop. 
>> As far as I understand your question, you need orchestration, which 
>> involves the ability to record a state of previously executed 
>> map-reduce jobs, and starting next map-reduce jobs based on the 
>> existing state, possibly long time after the first job completes and 
>> from a different process.
>> I'm frequently facing this problem, and so far I've been using a 
>> poor-man's workflow system, consisting of a bunch of cron jobs, shell 
>> scripts, and simple marker files to record current state of data. In a 
>> similar way you can implement advisory application-level locking, 
>> using lock files.
>> Example: adding a new batch of pages to a Nutch index involves many 
>> steps, starting with fetchlist generation, fetching, parsing, updating 
>> the db, extraction of link information, and indexing. Each of these 
>> steps consists of one (or several) map-reduce jobs, and the input to 
>> the next jobs depends on the output of previous jobs. What you 
>> referred to in your previous email was a single-app driver for this 
>> workflow, called Crawl. But I'm using the slightly modified individual 
>> tools, which on successful completion create marker files (e.g. 
>> fetching.done). Other tools check for the existence of these files, 
>> and either perform their function or exit (if I want to run updatedb 
>> from a segment that is fetched but not parsed).
>> To summarize this long answer - I think that this functionality 
>> belongs in the application layer built on top of Hadoop, and IMHO we 
>> are better off not implementing it in the Hadoop proper.

View raw message