Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Message-ID: <467FED9D.2030908@troove.net>
Date: Mon, 25 Jun 2007 09:30:21 -0700
From: James Kennedy <james.kennedy@troove.net>
User-Agent: Thunderbird 1.5.0.2 (X11/20060420)
MIME-Version: 1.0
To: hadoop-user@lucene.apache.org
Subject: Re: Examples of chained MapReduce?
References: <467D1358.8070101@getopt.org> <C2A30B82.181AA%tdunning@veoh.com>
 <007201c7b691$c330d8e0$eb44420a@ds.corp.yahoo.com>
In-Reply-To: <007201c7b691$c330d8e0$eb44420a@ds.corp.yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Ok guys, thank you for your responses.

Devaraj Das wrote:
> I haven't confirmed this but I vaguely remember that the resource schedulers
> (Torque, Condor) provide the feature using which one can submit a DAG of
> jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
> nodes pointing to it have successfully finished (or something like that) and
> the resource scheduler framework does the bookkeeping to take care of failed
> jobs, etc.
> In hadoop there is an effort "Integration of Hadoop with batch schedulers"
> https://issues.apache.org/jira/browse/HADOOP-719 
> I am not sure whether it handles the use case, where one could submit a
> chain of jobs, but think it potentially can handle that.
>
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com] 
> Sent: Sunday, June 24, 2007 6:10 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Examples of chained MapReduce?
>
>
> We are still evaluating Hadoop for use in our main-line analysis systems,
> but we already have the problem of workflow scheduling.
>
> Our solution for that was to implement a simpler version of Amazon's Simple
> Queue Service.  This allows us to have multiple redundant workers for some
> tasks or to choke a task down on other tasks.
>
> The basic idea is that queues contain XML tasks.  Tasks are read from the
> queue by workers, but are kept in a holding pen for a queue specific time
> period after they are read.  If the task completes normally, the worker will
> delete the task, but if the timeout expires before the worker completes the
> task, it is added back to the queue.
>
> Workers are structured as a triple of scripts that are executed of a manager
> process.  These are a pre-condition that can determine if any work should be
> done (usually this is a check for available local disk space or available
> CPU cycles), an item qualification (this is done with a particular item in
> case the work is subject to resource reservation) and a worker script.
>
> Even this tiny little framework suffices for quite complex workflows and
> work constraints.  It would be very easy to schedule map-reduce tasks via a
> similar mechanism.
>
> On 6/23/07 5:34 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:
>
>   
>> James Kennedy wrote:
>>     
>>> But back to my original question... Doug suggests that dependence on 
>>> a driver process is acceptable.  But has anyone needed true MapReduce 
>>> chaining or tried it successfully?  Or is it generally accepted that 
>>> a multi-MapReduce algorithm should always be driven by a single process?
>>>       
>> I would argue that this functionality is outside the scope of Hadoop. 
>> As far as I understand your question, you need orchestration, which 
>> involves the ability to record a state of previously executed 
>> map-reduce jobs, and starting next map-reduce jobs based on the 
>> existing state, possibly long time after the first job completes and 
>> from a different process.
>>
>> I'm frequently facing this problem, and so far I've been using a 
>> poor-man's workflow system, consisting of a bunch of cron jobs, shell 
>> scripts, and simple marker files to record current state of data. In a 
>> similar way you can implement advisory application-level locking, 
>> using lock files.
>>
>> Example: adding a new batch of pages to a Nutch index involves many 
>> steps, starting with fetchlist generation, fetching, parsing, updating 
>> the db, extraction of link information, and indexing. Each of these 
>> steps consists of one (or several) map-reduce jobs, and the input to 
>> the next jobs depends on the output of previous jobs. What you 
>> referred to in your previous email was a single-app driver for this 
>> workflow, called Crawl. But I'm using the slightly modified individual 
>> tools, which on successful completion create marker files (e.g. 
>> fetching.done). Other tools check for the existence of these files, 
>> and either perform their function or exit (if I want to run updatedb 
>> from a segment that is fetched but not parsed).
>>
>> To summarize this long answer - I think that this functionality 
>> belongs in the application layer built on top of Hadoop, and IMHO we 
>> are better off not implementing it in the Hadoop proper.
>>
>>     
>
>
>