hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enis Soztutar <enis.soz.nu...@gmail.com>
Subject Re: Very weak mapred performance on small clusters with a massive amount of small files
Date Thu, 01 Nov 2007 10:02:49 GMT

I think you should try using MultiFileInputFormat/MultiFileInputSplit 
rather than FileSplit, since the former is optimized for processing 
large number of files. Could you report you numMaps and numReduces and 
the avarage time the map() function is expected to take.

André Martin wrote:
> Hi everyone,
> we are experiencing a very weak map-red performance on the following 
> mapred cluster setup:
> - Hadoop - nightly build from 2007-10-25_17-03-53
> - 5 tasktracker-/datanodes + 1 jobtracker-/namenode
> - 3.7GB (53,050 files in 18,150 folders - avg. file size: 74kB)
> A mapred job takes up to 24 hours before completion on our cluster. 
> We've measured and monitored network bandwidth, diskIO, paging, 
> swapping and CPU utilization in order to exclude those things as 
> bottlenecks on our host machines / network itself. However, a closer 
> look into the log files and the source codes revealed the following 
> things:
> During the mapping stage, we observed that each task tracker processes 
> only one or (a maximum of) two file-splits at a time which equals 
> pretty much to sequential reading/processing of 10.000+ files.
> (Reading files off the DFS takes only a second or less and we measured 
> a relatively high throughputs of ~1.5MB/s). Adjusting / increasing the 
> "mapred.tasktracker.tasks.maximum" parameter from the given default 
> value of 2 to 10 didn't work - each node still processes only two (at 
> max) map tasks at a time...
> Another major performance gap seems to lie in the reduce copy phase: 
> The load balancing / anti swamping policy on line 972 (in 
> ReduceTask.java) which guaranties that each tasktracker copies/fetches 
> only one map output at a time (from the "neighboring" tasktrackers) 
> causes very low average throughputs of  20kB/s and less. We disabled / 
> commented out the "duplicate hosts" check and reached then throughputs 
> up to ~1MB/s in average.
> It seems like Hadoop scales only well when processing large files on 
> large clusters whereas we would like to use it for a huge amount of 
> small files on a small cluster... Does anyone have similar 
> experiences/cluster setups?
> Any thoughts and ideas are much appreciated!
> Thanks in advance!
> Cu on the 'net,
>                        Bye - bye,
>                                   <<<<< André <<<< >>>>
èrbnA >>>>>

View raw message