hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Help with MapReduce
Date Thu, 25 May 2006 17:11:27 GMT
Dennis Kubes wrote:
> Ok.  This is a little different in that I need to start thinking about 
> my algorithms in terms of sequential passes and multiple jobs instead of 
> direct access.  That way I can use the input directories to get the data 
> that I need.  Couldn't I also do it through the MapRunnable interface 
> that creates a reader shared by an inner mapper class or is that hacking 
> the interfaces when I should be thinking about this terms of sequential 
> processing?

You can do it however you like!  I don't know enough about your problem 
to say definitively which is the best approach.  We're working hard on 
Hadoop so that we can scalably stream data through MapReduce at 
megabytes/second per node.  So you might do some back-of-the envelope 
calculations.  Figure at least 10ms per random access.  So your maximum 
random access rate might be around 100/second per drive.  Figure a 
10MB/second transfer rate, so if randomly accessed data is 100kB each, 
then your maximum random access rate drops to 50 items/drive/second. 
Since these are over the network, real performance will probably be much 
worse.  Also, MapFile requires a scan per entry, so you might really end 
up scanning 1MB per access, which would slow random accesses to 10 
items/drive/second.  You might benchmark your random accesss performance 
  to get a better estimate, then compare that to processing the whole 
collection through MapReduce.

Doug

Mime
View raw message