hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <st...@designware.dk>
Subject Re: How does FileInputFormat sub-classes handle many small files
Date Thu, 01 Sep 2011 19:03:51 GMT
Harsh J skrev:
> Hello Per,
> On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <steff@designware.dk> wrote:
>> Hi
>> FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat)
>> are able to take all files in a folder and split the work of handling them
>> into several sub-jobs (map-jobs). I know it can split a very big file into
>> several sub-jobs, but how does it handle many small files in the folder. If
>> there are 10000 small files each with 100 datarecords, I would not like my
>> sub-jobs to become too small (due to the overhead of starting a JVM for each
>> sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000
>> datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords,
>> but I would not like 10000 sub-jobs each about handling 100 datarecords. For
>> this to be possible one split (the work to be done by one sub-job) will have
>> to span more than one file. My question is, if FileInputFormat sub-classes
>> are able to make such splits, or if they always create at least one
>> split=sub-job=map-job per file?
> You need CombineFileInputFormat for your case, not the vanilla
> FileInputFormat which gives at least one split per file.
Yes I found CombineFileInputFormat. It worries me a little though to see 
that it extends the deprecated FileInputFormat instead of the new 
FileInputFormat. It that a problem?
Also I notice that CombineFileInputFormat is abstract. Why is that? Is 
the extension shown on the following webpage a good way out of this:
>> Another thing is: I expect that FileInputFormat has to somehow list the
>> files in the folder. Who does this listing handle many many files in the
>> folder. Most OS's are bad at listing files in folders when there are a lot
>> of files - at some point it become worse than O(n) where n is the number of
>> files. Windows of course really suck, and even linux has problems with very
>> high number of files. How does HDFS handle listing of files in a folder with
>> many many files? Or maybe I should address this question to the hdfs mailing
>> list?
> Listing would be one RPC call in HDFS, so it might take some time for
> millions of files under a single directory. Although the listing
> operation, since done in-memory of the NameNode would be fast enough,
> the transfer of the results to the client for large amount of items
> may take up some time -- and there's no way to page either. I do not
> think the complexity is worse than O(n) though, for files under the
> same dir - only transfer costs should be your worry. But I've not
> measured these things to give you concrete statements on this. Might
> be a good exercise?
It would be nice with some number about this, yes. But we will make our 
own performance test anyway.
> Also know that listing is only done on the front end (job client
> submission) and not later on. So it is just a one time cost.

View raw message