hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Martin <m...@andremartin.de>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Sun, 04 Nov 2007 23:02:32 GMT
Hi Enis & Hadoopers,
thanks for the hint. I created/modified my RecordReader so that it uses 
MultiFileInputSplit and reads 30 files at once (by spawning several 
threads and using a bounded buffer àla producer/consumer). The 
accumulated throughput is now about 1MB/s on my 30 MB test data (spread 
over 300 files).
However, I noticed some other bottlenecks during job submissions - a job 
submission of 53.000 files spread over 18,150 folders takes about 1hr 
and 45 mins..
Since all the files are spread over severals thousand directories - 
listing/traversing of those directories using the listpath / globpaths 
method generates several thousands RPC calls. I think it would be more 
efficient to send the regex/path expression (the parameters) of the 
globpaths method to the server and traversing the directory tree on the 
server side instead of client side, or is there another way to retrieve 
all the file paths?
Also, for each of my thousand files, a getBlockLocation RPC call is/was 
generated - I implemented/added a getBlockLocations[] method that 
accepts an array of paths etc. and returns a String[][][] matrix instead 
which is much more very efficient then generating thousands of RPC calls 
when calling getBlockLocation in the MultiFileSplit class...
Any thoughts/comments are much appreciated!
Thanks in advance!

 Cu on the 'net,
                       Bye - bye,

                                  <<<<< André <<<< >>>>
èrbnA >>>>>

Enis Soztutar wrote:
> Hi,
>
> I think you should try using MultiFileInputFormat/MultiFileInputSplit 
> rather than FileSplit, since the former is optimized for processing 
> large number of files. Could you report you numMaps and numReduces and 
> the avarage time the map() function is expected to take.


Mime
View raw message