hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: Input paths
Date Mon, 16 Oct 2006 16:43:30 GMT
InputFormatBase is used by some of the other input formats such as 
SequenceFileInputFormat so changing it there will affect those other 
classes as well.  I don't know if that is what you want or not.  I would 
probably extend TextInputFormat (assuming the files are in text logs 
such as apache logs and not xml files) and override the 
areValidInputDirectories to checks for files in the directories and the 
getSplits to return splits with only the files that you want to process.


Vetle Roeim wrote:
> On Mon, 16 Oct 2006 16:24:15 +0200, Dennis Kubes 
> <nutch-dev@dragonflymc.com> wrote:
>> You could write your own InputFormat implementation that would check 
>> files instead of directories (perhaps passing in the parent directory 
>> of the files).
> Oh, so this restriction is in the InputFormat? I see in 
> InputFormatBase.getSplits that the code just goes through input 
> directories and gets all the files there. Would it be ok to just 
> modify the code to handle files as well?
> The use case for this is if you have a directory containing multiple 
> files, but only want to operate on a few of those. In my case I have 
> log files from several servers, and while jobs are usually run on all 
> log files, this time I only want to run jobs on a subset.
>> We just did something similar to this for reading index files as an 
>> InputFormat.
>> Dennis
>> Vetle Roeim wrote:
>>> It seems that input to jobs is restricted to directories, and it is 
>>> impossible to add individual files -- JobConf calls 
>>> InputFormatBase.areValidInputDirectories, which checks that each 
>>> input path is a directory.
>>> Why is this required? Is it possible to change it or work around it 
>>> (without copying the files into a separate directory)?
>>> Thanks,
>>> --Vetle Roeim
>>> Opera Software ASA <URL: http://www.opera.com/ >
> --Vetle Roeim
> Team Manager, Information Systems
> Opera Software ASA <URL: http://www.opera.com/ >

View raw message