hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Tue, 06 Nov 2007 22:24:20 GMT
André Martin wrote:
> I was thinking of a similar solution/optimization but I have the 
> following problem:
> We have a large distributed system that consists of several 
> spider/crawler nodes - pretty much like a web crawler system - every 
> node writes its gathered data directly to the DFS. So there is no real 
> possibility of bundling the data while it is written to the DFS since 
> two spiders may write some data for the same logical unit concurrently - 
> if the DFS would support synchronized append writes, it would make our 
> lifes a little bit easier.
> However, our files are still organized in thousands of directories / a 
> pretty large directory tree since I need only certain branches for a 
> mapred operation in order to do some data mining...

Instead of organizing output into many directories you might consider 
using keys which encode that directory structure.  Then mapreduce can 
use these to partition output.  If you wish to mine only a subset of 
your data, you can process just those partitions which contain the 
portions of the keyspace you're interested in.

Doug

Mime
View raw message