hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Help with MapReduce
Date Thu, 25 May 2006 16:58:00 GMT
Ok.  This is a little different in that I need to start thinking about 
my algorithms in terms of sequential passes and multiple jobs instead of 
direct access.  That way I can use the input directories to get the data 
that I need.  Couldn't I also do it through the MapRunnable interface 
that creates a reader shared by an inner mapper class or is that hacking 
the interfaces when I should be thinking about this terms of sequential 
processing?

Dennis

Doug Cutting wrote:
> Dennis Kubes wrote:
>> The problem is that I have a single url.  I get the inlinks to that 
>> url and then I need to go access content from all of its inlink urls 
>> that have been fetched.
>> I was doing this through Random access.  But then I went back and 
>> re-read the google MapReduce paper and saw that it was designed for 
>> Sequential access and saw that Hadoop implements the same way.  But 
>> so far I haven't found a way to efficiently solve this kind of 
>> problem in sequential format.
>
> If your input urls are only a small fraction of the collection, then 
> random access might be appropriate, or you might instead use two (or 
> more) MapReduce passes, something like:
>
> 1. url -> inlink urls (using previously inverted link db)
> 2. inlink urls -> inlink content
>
> In each case the mapping might look like it's doing random access, 
> but, if input keys are sorted, and the "table" you're "selecting" from 
> (the link db in the first case and the content in the second) are 
> sorted, then the accesses will actually be sequential, scanning each 
> table only once.  But these will generally be remote DFS accesses.  
> MapReduce can usually arrange to place tasks on a node where the input 
> data is local, but when the map task then accesses other files this 
> optimization cannot be made.
>
> In Nutch, things are slightly more complicated, since the content is 
> organized by segment, each sorted by URL.  So you could either add 
> another MapReduce pass so that the inlink urls are sorted by segment 
> then url, or you could append all of your segments into a single segment.
>
> But if you're performing the calculation over the entire collection, 
> or even a substantial fraction, then you might be able to use a single 
> MapReduce pass, with the content and link db as inputs, performing 
> your required computations in reduce.  For anything larger than a 
> small fraction of your collection this will likely be fastest.
>
>> If I were to do it in the configure and close wouldn't that still 
>> open a single reader per map call?
>
> configure() and close() are only called once per map task.
>
> Doug

Mime
View raw message