Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 42228 invoked from network); 25 Oct 2007 17:19:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Oct 2007 17:19:39 -0000 Received: (qmail 37029 invoked by uid 500); 25 Oct 2007 17:19:24 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 36995 invoked by uid 500); 25 Oct 2007 17:19:24 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 36986 invoked by uid 99); 25 Oct 2007 17:19:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2007 10:19:24 -0700 X-ASF-Spam-Status: No, hits=-4.0 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lca13@us.ibm.com designates 32.97.182.145 as permitted sender) Received: from [32.97.182.145] (HELO e5.ny.us.ibm.com) (32.97.182.145) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Oct 2007 19:20:09 +0000 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e5.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id l9PHJ3iu008956 for ; Thu, 25 Oct 2007 13:19:03 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v8.5) with ESMTP id l9PHJ3cs112670 for ; Thu, 25 Oct 2007 13:19:03 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id l9PHJ23o012523 for ; Thu, 25 Oct 2007 13:19:02 -0400 Received: from d27mc602.rchland.ibm.com (d27mc602.rchland.ibm.com [9.10.229.36]) by d01av04.pok.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id l9PHJ2Ms012494 for ; Thu, 25 Oct 2007 13:19:02 -0400 In-Reply-To: <6683EEA3-F8A4-49EF-88F9-375BE4D5F6A1@yahoo-inc.com> Subject: Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base To: hadoop-user@lucene.apache.org X-Mailer: Lotus Notes Release 7.0 HF277 June 21, 2006 Message-ID: From: Lance Amundsen Date: Thu, 25 Oct 2007 10:19:59 -0700 X-MIMETrack: Serialize by Router on d27mc602/27/M/IBM(Release 7.0.2FP2|May 14, 2007) at 10/25/2007 12:19:02 PM MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org So I managed to get my fast InputFormat working.... it does still use the FS, but in such a way that it improves mapper startup by over 2X. And last night I got a prototype working that allows the map task to run under the JVM of the TaskTracker, rather than spawing a new JVM. The initial performance look really, really good. I just ran a 1000 map single input record job, (mappers doing no work however), in a one master, one slave setup... on my laptop.... It completed in a couple thousand seconds, or a couple seconds per map. Earlier I did a smaller 100 map job with a stable quieced system and it came in at about 130 seconds. So this prototype can start and end map jobs in 1-2 seconds, and should scale flatly with respect to nodes in the setup. "Owen O'Malley" To hadoop-user@lucene.apache.org 10/24/2007 01:05 cc PM Subject Re: InputFiles, Splits, Maps, Tasks Please respond to Questions 1.3 Base hadoop-user@lucen e.apache.org On Oct 24, 2007, at 12:42 PM, Doug Cutting wrote: > Lance Amundsen wrote: >> OK, that is encouraging. I'll take another pass at it. I succeeded >> yesterday with an in-memory only InputFormat, but only after I >> commented >> out some of the split referencing code, like the following in >> MapTask.java >> if (instantiatedSplit instanceof FileSplit) { >> FileSplit fileSplit = (FileSplit) instantiatedSplit; >> job.set("map.input.file", fileSplit.getPath().toString()); >> job.setLong("map.input.start", fileSplit.getStart()); >> job.setLong("map.input.length", fileSplit.getLength()); >> } > > Yes, that code should not exist, but it shouldn't affect you > either. You should be subclassing InputSplit, not FileSplit, so > this code shouldn't operate on your splits. That code doesn't do anything if they are non file-splits, so it absolutely shouldn't break anything. Applications depend on those attributes to know which split they are working on and there isn't a better fix until we move to context objects. I know that non- filesplits work because there are units tests to make sure they don't break anything. -- Owen