hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: small files and number of mappers
Date Tue, 30 Nov 2010 08:21:31 GMT

On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <marc.sturlese@gmail.com> wrote:
> Hey there,
> I am doing some tests and wandering which are the best practices to deal
> with very small files which are continuously being generated(1Mb or even
> less).

Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

> I see that if I have hundreds of small files in hdfs, hadoop automatically
> will create A LOT of map tasks to consume them. Each map task will take 10
> seconds or less... I don't know if it's possible to change the number of map
> tasks from java code using the new API (I know it can be done with the old
> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
> This way, less maps tasks would be instanciated and each would be working
> more time.

Perhaps you need to use MultiFileInputFormat:

Harsh J

View raw message