hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Help with MapReduce
Date Thu, 25 May 2006 17:26:28 GMT
I wasn't really asking about better approach.  I was more interested in 
how you are thinking about Hadoop in terms of problem solving.  I see 
now that it is more about sustained throughput (I should have picked 
that one up from the GFS paper) and that algorithms need to be coded for 
sustained throughput.  This is a different type of thinking then coding 
an algorithm for a single machine so I am learning as I go.  Thanks for 
your help.


Doug Cutting wrote:
> Dennis Kubes wrote:
>> Ok.  This is a little different in that I need to start thinking 
>> about my algorithms in terms of sequential passes and multiple jobs 
>> instead of direct access.  That way I can use the input directories 
>> to get the data that I need.  Couldn't I also do it through the 
>> MapRunnable interface that creates a reader shared by an inner mapper 
>> class or is that hacking the interfaces when I should be thinking 
>> about this terms of sequential processing?
> You can do it however you like!  I don't know enough about your 
> problem to say definitively which is the best approach.  We're working 
> hard on Hadoop so that we can scalably stream data through MapReduce 
> at megabytes/second per node.  So you might do some back-of-the 
> envelope calculations.  Figure at least 10ms per random access.  So 
> your maximum random access rate might be around 100/second per drive.  
> Figure a 10MB/second transfer rate, so if randomly accessed data is 
> 100kB each, then your maximum random access rate drops to 50 
> items/drive/second. Since these are over the network, real performance 
> will probably be much worse.  Also, MapFile requires a scan per entry, 
> so you might really end up scanning 1MB per access, which would slow 
> random accesses to 10 items/drive/second.  You might benchmark your 
> random accesss performance  to get a better estimate, then compare 
> that to processing the whole collection through MapReduce.
> Doug

View raw message