Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 90754 invoked from network); 23 Oct 2007 03:34:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Oct 2007 03:34:18 -0000 Received: (qmail 91395 invoked by uid 500); 23 Oct 2007 03:34:04 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 91366 invoked by uid 500); 23 Oct 2007 03:34:04 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 91357 invoked by uid 99); 23 Oct 2007 03:34:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Oct 2007 20:34:04 -0700 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lca13@us.ibm.com designates 32.97.182.144 as permitted sender) Received: from [32.97.182.144] (HELO e4.ny.us.ibm.com) (32.97.182.144) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2007 03:34:06 +0000 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e4.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l9N3Uaw1017520 for ; Mon, 22 Oct 2007 23:30:36 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.5) with ESMTP id l9N3UaaF084722 for ; Mon, 22 Oct 2007 23:30:36 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l9N3UaUO032481 for ; Mon, 22 Oct 2007 23:30:36 -0400 Received: from d27mc602.rchland.ibm.com (d27mc602.rchland.ibm.com [9.10.229.36]) by d01av02.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l9N3UYC5032449 for ; Mon, 22 Oct 2007 23:30:34 -0400 In-Reply-To: Subject: Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base To: hadoop-user@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 Message-ID: From: Lance Amundsen Date: Mon, 22 Oct 2007 20:31:32 -0700 X-MIMETrack: Serialize by Router on d27mc602/27/M/IBM(Release 7.0.2FP2|May 14, 2007) at 10/22/2007 10:30:33 PM MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org What I am trying to do is see what it would take to modify the Hadoop framework for transaction based processing. So right now I am just trying to get to the point where I can start looking at the hard stuff. I am still blocked by simple control at this point. The workload I am trying to measure at this point is nothing more than a print statement... ie. nothing. Startup costs currently are linear WRT nodes, dependent on Input model. I first need to find or create a model that has a flat startup cost for n nodes, at which point I can start to tackle the actual pathlength and latency issues of the startup itself. This is all just investigative at this point, but I can already envision changes that would allow process startup and takedown in less than 1 sec and be nearly flatly growing as nodes increase. Ted Dunning To 10/22/2007 07:29 cc PM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base hadoop-user@lucen e.apache.org You probably have determined by now that there is a parameter that determines how many concurrent maps there are. mapred.tasktracker.tasks.maximum 3 The maximum number of tasks that will be run simultaneously by a task tracker. Btw... I am still curious about your approach. Isn't it normally better to measure marginal costs such as this startup cost by linear regression as you change parameters? It seems that otherwise, you will likely be mislead by what happens at the boundaries when what you really want it what happens in the normal operating region. On 10/22/07 5:53 PM, "Lance Amundsen" wrote: > ... > > Next I want to increase the concurrent # of tasks being executed for each > node... currently it seems like 2 or 3 is the upper limit (at least on the > earlier binaries I was running). >