Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 36043 invoked from network); 25 May 2006 16:25:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 25 May 2006 16:25:24 -0000 Received: (qmail 38542 invoked by uid 500); 25 May 2006 16:25:10 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 38459 invoked by uid 500); 25 May 2006 16:25:10 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 38430 invoked by uid 99); 25 May 2006 16:25:10 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 09:25:10 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.115.57.74] (HELO ylpvm43.prodigy.net) (207.115.57.74) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 May 2006 09:25:08 -0700 Received: from pimout6-ext.prodigy.net (pimout6-int.prodigy.net [207.115.4.22]) by ylpvm43.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id k4PGOo4Q005767 for ; Thu, 25 May 2006 12:24:50 -0400 X-ORBL: [69.228.218.244] Received: from [192.168.168.15] (adsl-69-228-218-244.dsl.pltn13.pacbell.net [69.228.218.244]) by pimout6-ext.prodigy.net (8.13.6 out.dk/8.13.6) with ESMTP id k4PGOjU9077256; Thu, 25 May 2006 12:24:46 -0400 Message-ID: <4475DA4D.9000707@apache.org> Date: Thu, 25 May 2006 09:24:45 -0700 From: Doug Cutting User-Agent: Mozilla Thunderbird 1.0.8 (X11/20060502) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: Help with MapReduce References: <4475CDC4.4070703@dragonflymc.com> <4475D201.1030006@apache.org> <4475D4BD.7010700@dragonflymc.com> In-Reply-To: <4475D4BD.7010700@dragonflymc.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Dennis Kubes wrote: > The problem is that I have a single url. I get the inlinks to that url > and then I need to go access content from all of its inlink urls that > have been fetched. > I was doing this through Random access. But then I went back and > re-read the google MapReduce paper and saw that it was designed for > Sequential access and saw that Hadoop implements the same way. But so > far I haven't found a way to efficiently solve this kind of problem in > sequential format. If your input urls are only a small fraction of the collection, then random access might be appropriate, or you might instead use two (or more) MapReduce passes, something like: 1. url -> inlink urls (using previously inverted link db) 2. inlink urls -> inlink content In each case the mapping might look like it's doing random access, but, if input keys are sorted, and the "table" you're "selecting" from (the link db in the first case and the content in the second) are sorted, then the accesses will actually be sequential, scanning each table only once. But these will generally be remote DFS accesses. MapReduce can usually arrange to place tasks on a node where the input data is local, but when the map task then accesses other files this optimization cannot be made. In Nutch, things are slightly more complicated, since the content is organized by segment, each sorted by URL. So you could either add another MapReduce pass so that the inlink urls are sorted by segment then url, or you could append all of your segments into a single segment. But if you're performing the calculation over the entire collection, or even a substantial fraction, then you might be able to use a single MapReduce pass, with the content and link db as inputs, performing your required computations in reduce. For anything larger than a small fraction of your collection this will likely be fastest. > If I were to do it in the configure and close wouldn't that still open a > single reader per map call? configure() and close() are only called once per map task. Doug