hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Sun, 04 Nov 2007 23:40:21 GMT

If your larger run is typical of your smaller run then you have lots and
lots of small files.  This is going to make things slow even without the
overhead of a distributed computation.

In the sequential case, enumerating the files an inefficient read patterns
will be what slows you down.  The inefficient reads come about because the
disk has to seek every 100KB of input.  That is bad.

In the hadoop case, things are worse because opening a file takes much
longer than with local files.

The solution is for you to package your data more efficiently.  This fixes a
multitude of ills.  If you don't mind limiting your available parallelism a
little bit, you could even use tar files (tar isn't usually recommended
because you can't split a tar file across maps).

If you were to package 1000 files per bundle, you would get average file
sizes of 100MB instead of 100KB and your file opening overhead in the
parallel case would be decreased by 1000x.  Your disk read speed would be
much higher as well because your disks would mostly be reading contiguous

I have a system similar to yours with lots and lots of little files (littler
than yours even).  With aggressive file bundling I can routinely process
data at a sustained rate of 100MB/s on ten really crummy storage/compute
nodes.  Moreover, that rate is probably not even bounded by I/O since my
data takes a fair bit of CPU to decrypt and parse.

On 11/4/07 4:02 PM, "André Martin" <mail@andremartin.de> wrote:

> Hi Enis & Hadoopers,
> thanks for the hint. I created/modified my RecordReader so that it uses
> MultiFileInputSplit and reads 30 files at once (by spawning several
> threads and using a bounded buffer àla producer/consumer). The
> accumulated throughput is now about 1MB/s on my 30 MB test data (spread
> over 300 files).
> However, I noticed some other bottlenecks during job submissions - a job
> submission of 53.000 files spread over 18,150 folders takes about 1hr
> and 45 mins..
> Since all the files are spread over severals thousand directories -
> listing/traversing of those directories using the listpath / globpaths
> method generates several thousands RPC calls. I think it would be more
> efficient to send the regex/path expression (the parameters) of the
> globpaths method to the server and traversing the directory tree on the
> server side instead of client side, or is there another way to retrieve
> all the file paths?
> Also, for each of my thousand files, a getBlockLocation RPC call is/was
> generated - I implemented/added a getBlockLocations[] method that
> accepts an array of paths etc. and returns a String[][][] matrix instead
> which is much more very efficient then generating thousands of RPC calls
> when calling getBlockLocation in the MultiFileSplit class...
> Any thoughts/comments are much appreciated!
> Thanks in advance!
>  Cu on the 'net,
>                        Bye - bye,
>                                   <<<<< André <<<< >>>>
èrbnA >>>>>
> Enis Soztutar wrote:
>> Hi,
>> I think you should try using MultiFileInputFormat/MultiFileInputSplit
>> rather than FileSplit, since the former is optimized for processing
>> large number of files. Could you report you numMaps and numReduces and
>> the avarage time the map() function is expected to take.

View raw message