hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: How does FileInputFormat sub-classes handle many small files
Date Thu, 01 Sep 2011 16:58:56 GMT
Hello Per,

On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <steff@designware.dk> wrote:
> Hi
> FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat)
> are able to take all files in a folder and split the work of handling them
> into several sub-jobs (map-jobs). I know it can split a very big file into
> several sub-jobs, but how does it handle many small files in the folder. If
> there are 10000 small files each with 100 datarecords, I would not like my
> sub-jobs to become too small (due to the overhead of starting a JVM for each
> sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000
> datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords,
> but I would not like 10000 sub-jobs each about handling 100 datarecords. For
> this to be possible one split (the work to be done by one sub-job) will have
> to span more than one file. My question is, if FileInputFormat sub-classes
> are able to make such splits, or if they always create at least one
> split=sub-job=map-job per file?

You need CombineFileInputFormat for your case, not the vanilla
FileInputFormat which gives at least one split per file.

> Another thing is: I expect that FileInputFormat has to somehow list the
> files in the folder. Who does this listing handle many many files in the
> folder. Most OS's are bad at listing files in folders when there are a lot
> of files - at some point it become worse than O(n) where n is the number of
> files. Windows of course really suck, and even linux has problems with very
> high number of files. How does HDFS handle listing of files in a folder with
> many many files? Or maybe I should address this question to the hdfs mailing
> list?

Listing would be one RPC call in HDFS, so it might take some time for
millions of files under a single directory. Although the listing
operation, since done in-memory of the NameNode would be fast enough,
the transfer of the results to the client for large amount of items
may take up some time -- and there's no way to page either. I do not
think the complexity is worse than O(n) though, for files under the
same dir - only transfer costs should be your worry. But I've not
measured these things to give you concrete statements on this. Might
be a good exercise?

Also know that listing is only done on the front end (job client
submission) and not later on. So it is just a one time cost.

Harsh J

View raw message