Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 59668 invoked from network); 25 Jun 2007 16:30:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Jun 2007 16:30:35 -0000 Received: (qmail 85035 invoked by uid 500); 25 Jun 2007 16:30:37 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 85011 invoked by uid 500); 25 Jun 2007 16:30:37 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 85002 invoked by uid 99); 25 Jun 2007 16:30:37 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2007 09:30:37 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [199.185.220.221] (HELO outbound02.telus.net) (199.185.220.221) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jun 2007 09:30:32 -0700 Received: from priv-edtnaa06.telusplanet.net ([207.81.126.109]) by priv-edtnes44.telusplanet.net (InterMail vM.7.08.02.00 201-2186-121-20061213) with ESMTP id <20070625163011.WASG24633.priv-edtnes44.telusplanet.net@priv-edtnaa06.telusplanet.net> for ; Mon, 25 Jun 2007 10:30:11 -0600 Received: from [192.168.1.51] (d207-81-126-109.bchsia.telus.net [207.81.126.109]) by priv-edtnaa06.telusplanet.net (BorderWare MXtreme Infinity Mail Firewall) with ESMTP id 64F6GXSPFX for ; Mon, 25 Jun 2007 10:30:10 -0600 (MDT) Message-ID: <467FED9D.2030908@troove.net> Date: Mon, 25 Jun 2007 09:30:21 -0700 From: James Kennedy User-Agent: Thunderbird 1.5.0.2 (X11/20060420) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: Examples of chained MapReduce? References: <467D1358.8070101@getopt.org> <007201c7b691$c330d8e0$eb44420a@ds.corp.yahoo.com> In-Reply-To: <007201c7b691$c330d8e0$eb44420a@ds.corp.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Ok guys, thank you for your responses. Devaraj Das wrote: > I haven't confirmed this but I vaguely remember that the resource schedulers > (Torque, Condor) provide the feature using which one can submit a DAG of > jobs, etc. The resource manager doesn't invoke a node in the DAG unless all > nodes pointing to it have successfully finished (or something like that) and > the resource scheduler framework does the bookkeeping to take care of failed > jobs, etc. > In hadoop there is an effort "Integration of Hadoop with batch schedulers" > https://issues.apache.org/jira/browse/HADOOP-719 > I am not sure whether it handles the use case, where one could submit a > chain of jobs, but think it potentially can handle that. > > -----Original Message----- > From: Ted Dunning [mailto:tdunning@veoh.com] > Sent: Sunday, June 24, 2007 6:10 AM > To: hadoop-user@lucene.apache.org > Subject: Re: Examples of chained MapReduce? > > > We are still evaluating Hadoop for use in our main-line analysis systems, > but we already have the problem of workflow scheduling. > > Our solution for that was to implement a simpler version of Amazon's Simple > Queue Service. This allows us to have multiple redundant workers for some > tasks or to choke a task down on other tasks. > > The basic idea is that queues contain XML tasks. Tasks are read from the > queue by workers, but are kept in a holding pen for a queue specific time > period after they are read. If the task completes normally, the worker will > delete the task, but if the timeout expires before the worker completes the > task, it is added back to the queue. > > Workers are structured as a triple of scripts that are executed of a manager > process. These are a pre-condition that can determine if any work should be > done (usually this is a check for available local disk space or available > CPU cycles), an item qualification (this is done with a particular item in > case the work is subject to resource reservation) and a worker script. > > Even this tiny little framework suffices for quite complex workflows and > work constraints. It would be very easy to schedule map-reduce tasks via a > similar mechanism. > > On 6/23/07 5:34 AM, "Andrzej Bialecki" wrote: > > >> James Kennedy wrote: >> >>> But back to my original question... Doug suggests that dependence on >>> a driver process is acceptable. But has anyone needed true MapReduce >>> chaining or tried it successfully? Or is it generally accepted that >>> a multi-MapReduce algorithm should always be driven by a single process? >>> >> I would argue that this functionality is outside the scope of Hadoop. >> As far as I understand your question, you need orchestration, which >> involves the ability to record a state of previously executed >> map-reduce jobs, and starting next map-reduce jobs based on the >> existing state, possibly long time after the first job completes and >> from a different process. >> >> I'm frequently facing this problem, and so far I've been using a >> poor-man's workflow system, consisting of a bunch of cron jobs, shell >> scripts, and simple marker files to record current state of data. In a >> similar way you can implement advisory application-level locking, >> using lock files. >> >> Example: adding a new batch of pages to a Nutch index involves many >> steps, starting with fetchlist generation, fetching, parsing, updating >> the db, extraction of link information, and indexing. Each of these >> steps consists of one (or several) map-reduce jobs, and the input to >> the next jobs depends on the output of previous jobs. What you >> referred to in your previous email was a single-app driver for this >> workflow, called Crawl. But I'm using the slightly modified individual >> tools, which on successful completion create marker files (e.g. >> fetching.done). Other tools check for the existence of these files, >> and either perform their function or exit (if I want to run updatedb >> from a segment that is fetched but not parsed). >> >> To summarize this long answer - I think that this functionality >> belongs in the application layer built on top of Hadoop, and IMHO we >> are better off not implementing it in the Hadoop proper. >> >> > > >