hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Martin <m...@andremartin.de>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Tue, 06 Nov 2007 10:12:32 GMT
Hi Ted & Hadoopers,
I was thinking of a similar solution/optimization but I have the 
following problem:
We have a large distributed system that consists of several 
spider/crawler nodes - pretty much like a web crawler system - every 
node writes its gathered data directly to the DFS. So there is no real 
possibility of bundling the data while it is written to the DFS since 
two spiders may write some data for the same logical unit concurrently - 
if the DFS would support synchronized append writes, it would make our 
lifes a little bit easier.
However, our files are still organized in thousands of directories / a 
pretty large directory tree since I need only certain branches for a 
mapred operation in order to do some data mining...

Cu on the 'net,
                       Bye - bye,

                                  <<<<< André <<<< >>>>
èrbnA >>>>>

Ted Dunning wrote:
> If your larger run is typical of your smaller run then you have lots and
> lots of small files.  This is going to make things slow even without the
> overhead of a distributed computation.
>
> In the sequential case, enumerating the files an inefficient read patterns
> will be what slows you down.  The inefficient reads come about because the
> disk has to seek every 100KB of input.  That is bad.
>
> In the hadoop case, things are worse because opening a file takes much
> longer than with local files.
>
> The solution is for you to package your data more efficiently.  This fixes a
> multitude of ills.  If you don't mind limiting your available parallelism a
> little bit, you could even use tar files (tar isn't usually recommended
> because you can't split a tar file across maps).
>
> If you were to package 1000 files per bundle, you would get average file
> sizes of 100MB instead of 100KB and your file opening overhead in the
> parallel case would be decreased by 1000x.  Your disk read speed would be
> much higher as well because your disks would mostly be reading contiguous
> sectors.
>
> I have a system similar to yours with lots and lots of little files (littler
> than yours even).  With aggressive file bundling I can routinely process
> data at a sustained rate of 100MB/s on ten really crummy storage/compute
> nodes.  Moreover, that rate is probably not even bounded by I/O since my
> data takes a fair bit of CPU to decrypt and parse.
>   



Mime
View raw message