Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 52116 invoked from network); 25 May 2006 17:11:55 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 May 2006 17:11:55 -0000 Received: (qmail 30631 invoked by uid 500); 25 May 2006 17:11:55 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 30615 invoked by uid 500); 25 May 2006 17:11:54 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 30606 invoked by uid 99); 25 May 2006 17:11:54 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 10:11:54 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.115.57.43] (HELO ylpvm12.prodigy.net) (207.115.57.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 10:11:54 -0700 Received: from pimout5-ext.prodigy.net (pimout5-int.prodigy.net [207.115.4.21]) by ylpvm12.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id k4PHBUum015201 for ; Thu, 25 May 2006 13:11:30 -0400 X-ORBL: [69.228.218.244] Received: from [192.168.168.15] (adsl-69-228-218-244.dsl.pltn13.pacbell.net [69.228.218.244]) by pimout5-ext.prodigy.net (8.13.6 out.dk/8.13.6) with ESMTP id k4PHBS7I250712; Thu, 25 May 2006 13:11:29 -0400 Message-ID: <4475E53F.9010807@apache.org> Date: Thu, 25 May 2006 10:11:27 -0700 From: Doug Cutting User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: Help with MapReduce References: <4475CDC4.4070703@dragonflymc.com> <4475D201.1030006@apache.org> <4475D4BD.7010700@dragonflymc.com> <4475DA4D.9000707@apache.org> <4475E218.5000308@dragonflymc.com> In-Reply-To: <4475E218.5000308@dragonflymc.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Dennis Kubes wrote: > Ok. This is a little different in that I need to start thinking about > my algorithms in terms of sequential passes and multiple jobs instead of > direct access. That way I can use the input directories to get the data > that I need. Couldn't I also do it through the MapRunnable interface > that creates a reader shared by an inner mapper class or is that hacking > the interfaces when I should be thinking about this terms of sequential > processing? You can do it however you like! I don't know enough about your problem to say definitively which is the best approach. We're working hard on Hadoop so that we can scalably stream data through MapReduce at megabytes/second per node. So you might do some back-of-the envelope calculations. Figure at least 10ms per random access. So your maximum random access rate might be around 100/second per drive. Figure a 10MB/second transfer rate, so if randomly accessed data is 100kB each, then your maximum random access rate drops to 50 items/drive/second. Since these are over the network, real performance will probably be much worse. Also, MapFile requires a scan per entry, so you might really end up scanning 1MB per access, which would slow random accesses to 10 items/drive/second. You might benchmark your random accesss performance to get a better estimate, then compare that to processing the whole collection through MapReduce. Doug