hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: FileInputFormat directory traversal
Date Tue, 03 Feb 2009 21:43:04 GMT
Hmm.  Based on your reasons, an extension to FileInputFormat for the  
lib package seems more in order.

I'll try to hack something up and file a Jira issue.


On Feb 3, 2009, at 4:28 PM, Doug Cutting wrote:

> Hi, Ian.
> One reason is that a MapFile is represented by a directory  
> containing two files named "index" and "data".   
> SequenceFileInputFormat handles MapFiles too by, if an input file is  
> a directory containing a data file, using that file.
> Another reason is that's what reduces generate.
> Neither reason implies that this is the best or only way of doing  
> things.  It would probably be better if FileInputFormat optionally  
> supported recursive file enumeration.  (It would be incompatible and  
> thus cannot be the default mode.)
> Please file an issue in Jira for this and attach your patch.
> Thanks,
> Doug
> Ian Soboroff wrote:
>> Is there a reason FileInputFormat only traverses the first level of  
>> directories in its InputPaths?  (i.e., given an InputPath of 'foo',  
>> it will get foo/* but not foo/bar/*).
>> I wrote a full depth-first traversal in my custom InputFormat which  
>> I can offer as a patch.  But to do it I had to duplicate the  
>> PathFilter classes in FileInputFormat which are marked private, so  
>> a mainline patch would also touch FileInputFormat.
>> Ian

View raw message