hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vetle Roeim" <vet...@opera.com>
Subject Re: Input paths
Date Mon, 16 Oct 2006 14:41:13 GMT
On Mon, 16 Oct 2006 16:24:15 +0200, Dennis Kubes  
<nutch-dev@dragonflymc.com> wrote:

> You could write your own InputFormat implementation that would check  
> files instead of directories (perhaps passing in the parent directory of  
> the files).

Oh, so this restriction is in the InputFormat? I see in  
InputFormatBase.getSplits that the code just goes through input  
directories and gets all the files there. Would it be ok to just modify  
the code to handle files as well?

The use case for this is if you have a directory containing multiple  
files, but only want to operate on a few of those. In my case I have log  
files from several servers, and while jobs are usually run on all log  
files, this time I only want to run jobs on a subset.

> We just did something similar to this for reading index files as an  
> InputFormat.
> Dennis
> Vetle Roeim wrote:
>> It seems that input to jobs is restricted to directories, and it is  
>> impossible to add individual files -- JobConf calls  
>> InputFormatBase.areValidInputDirectories, which checks that each input  
>> path is a directory.
>> Why is this required? Is it possible to change it or work around it  
>> (without copying the files into a separate directory)?
>> Thanks,
>> --Vetle Roeim
>> Opera Software ASA <URL: http://www.opera.com/ >

Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL: http://www.opera.com/ >

View raw message