Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 36853 invoked from network); 22 May 2006 04:12:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 May 2006 04:12:14 -0000 Received: (qmail 91760 invoked by uid 500); 22 May 2006 04:12:13 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 91735 invoked by uid 500); 22 May 2006 04:12:13 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 91725 invoked by uid 99); 22 May 2006 04:12:13 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 May 2006 21:12:13 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 21 May 2006 21:12:12 -0700 Received: from SNV-XCHMAIL.xch.corp.yahoo.com (snv-xch2.xch.corp.yahoo.com [216.145.51.235]) by mrout1.yahoo.com (8.13.6/8.13.4/y.out) with ESMTP id k4M4BN3m086121 for ; Sun, 21 May 2006 21:11:23 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:mime-version:in-reply-to:references:content-type: message-id:content-transfer-encoding:from:subject:date:to:x-mailer: return-path:x-originalarrivaltime; b=SY1Z5ZKrKQSQ6FcIeXH5NubESQt2pl9r7vWPlUMGbjm0rc23IgT2VikJREwKS2x7 Received: from [10.0.1.2] ([172.21.179.131]) by SNV-XCHMAIL.xch.corp.yahoo.com with Microsoft SMTPSVC(5.0.2195.6713); Sun, 21 May 2006 21:11:22 -0700 Mime-Version: 1.0 (Apple Message framework v750) In-Reply-To: <1148235734.74169.31.camel@home> References: <20060519180104.D58BC10FB004@asf.osuosl.org> <446E158B.9060203@apache.org> <1148235734.74169.31.camel@home> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Eric Baldeschwieler Subject: Re: Job scheduling (Re: Unable to run more than one job concurrently) Date: Sun, 21 May 2006 21:11:23 -0700 To: hadoop-dev@lucene.apache.org X-Mailer: Apple Mail (2.750) X-OriginalArrivalTime: 22 May 2006 04:11:22.0900 (UTC) FILETIME=[C943E540:01C67D55] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ?? You have no guarantee that your time sensitive data is safe / committed until after your reduce has completed. If you care about reliability or data integrity, simply run a full map-reduce job in your collection window and store the result in the HDFS. Do expensive post processing you have a quarter to complete as another job. Being able to preempt a long job with a time sensitive short job seems to really be your requirement. On May 21, 2006, at 11:22 AM, Rod Taylor wrote: > >>> (2) Have a per-job total task count limit. Currently, we >>> establish the >>> number of tasks each node runs, and how many map or reduce tasks >>> we have >>> total in a given job. But it would be great if we could set a >>> ceiling on the >>> number of tasks that run concurrently for a given job. This may >>> help with >>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer >>> concurrent >>> jobs would be fine?). >> >> I like this idea. So if the highest-priority job is already >> running at >> its task limit, then tasks can be run from the next highest-priority >> job. Should there be separate limits for maps and reduces? > > Limits for map and reduce are useful for a job class. Not so much > for a > specific job instance. Data collection may be best achieved with 15 > parallel maps pulling data from remote data sources. But if the fact > there are 3 from one job and 12 from another isn't important. It's > important that 15 makes best use of resources. > > A different priority for map and reduce would also be useful. Many > times > data collection in a set timeframe is far more important than reducing > it for storage or post processing, particularly when data > collection is > retrieving it from a remote resource. > > > Data warehousing activities often require that data collection occur > once a night between set hours (very high priority) but processing of > the data collected can occur any time until the end of the quarter. > > > For Nutch, with both of the above you should be able to achieve N > number > of Fetch Map processes running at all times with everything else being > secondary within the remaining resources. This could make use of > 100% of > available remote bandwidth. > > -- > Rod Taylor >