Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: local policy)
Message-ID: <4475D4BD.7010700@dragonflymc.com>
Date: Thu, 25 May 2006 11:01:01 -0500
From: Dennis Kubes <nutch-dev@dragonflymc.com>
User-Agent: Thunderbird 1.5.0.2 (Windows/20060308)
MIME-Version: 1.0
To: hadoop-user@lucene.apache.org
Subject: Re: Help with MapReduce
References: <4475CDC4.4070703@dragonflymc.com> <4475D201.1030006@apache.org>
In-Reply-To: <4475D201.1030006@apache.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

The problem is that I have a single url.  I get the inlinks to that url 
and then I need to go access content from all of its inlink urls that 
have been fetched. 

I was doing this through Random access.  But then I went back and 
re-read the google MapReduce paper and saw that it was designed for 
Sequential access and saw that Hadoop implements the same way.  But so 
far I haven't found a way to efficiently solve this kind of problem in 
sequential format.

If I were to do it in the configure and close wouldn't that still open a 
single reader per map call?

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> I am trying to read a MapFile inside mapper and reducer 
>> implementations.  So far the only way I have found to do it is by 
>> opening a new reader for each map and reduce call.  Is anybody doing 
>> something similar and if so is there a way to open a single reader 
>> and reuse it across multiple map or reduce calls?
>
> Can't you open it in the configure() implementation?  And close it in 
> the close() implementation?
>
> Are you randomly accessing a MapFile from a map() implementation? 
> That's not going to scale very well.  MapReduce is designed for 
> sequential access.
>
> Doug