hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: small files and number of mappers
Date Tue, 30 Nov 2010 18:27:05 GMT
On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <qwertymaniac@gmail.com> wrote:
> Hey,
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <marc.sturlese@gmail.com> wrote:
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
>> less).
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>> I see that if I have hundreds of small files in hdfs, hadoop automatically
>> will create A LOT of map tasks to consume them. Each map task will take 10
>> seconds or less... I don't know if it's possible to change the number of map
>> tasks from java code using the new API (I know it can be done with the old
>> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
>> This way, less maps tasks would be instanciated and each would be working
>> more time.
> Perhaps you need to use MultiFileInputFormat:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
> --
> Harsh J
> www.harshj.com

MultiFile and ConbinedInputFormats help.
JVM Re-use helps.

The larger problem is that an average NameNode with 4GB ram will start
JVM pausing with a relatively low number of files/blocks, say
10,000,000. 10mil is not a large number when generating thousands of
files a day.

We open sourced a tool to deal with this problem.

Essentially it takes a pass over a directory and combines multiple
files into one. On 'hourly' directories we run it after the hour is
closed out.

V2 (which we should throw over the fence in a week or so) uses the
same techniques but will be optimized for dealing with very large
directories and/or subdirectories of varying sizes by doing more
intelligent planning and grouping of which files an individual mapper
or reducer is going to combine.

View raw message