hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vetle Roeim" <vet...@opera.com>
Subject Re: Input paths
Date Fri, 20 Oct 2006 08:58:59 GMT
On Mon, 16 Oct 2006 18:43:30 +0200, Dennis Kubes  
<nutch-dev@dragonflymc.com> wrote:

> InputFormatBase is used by some of the other input formats such as  
> SequenceFileInputFormat so changing it there will affect those other  
> classes as well.  I don't know if that is what you want or not.  I would  
> probably extend TextInputFormat (assuming the files are in text logs  
> such as apache logs and not xml files) and override the  
> areValidInputDirectories to checks for files in the directories and the  
> getSplits to return splits with only the files that you want to process.

Thanks for your suggestions ... Here's how I did it:

* When configuring the job, individual files are added with  
* The inputformat is set to be TextFileInputFormat, which is subclassed  
 from TextInputFormat
* TextFileInputFormat overloads the following methods:

   - areValidInputDirectories: This is set to return true, even if one of  
the input paths is a file
   - listPaths: in InputFormatBase, this method simply returns a list of  
all the files in the input directories. I overloaded this to test if the  
input path is a directory or a file, and simply add the input path  
directly if it's a file.

This enables input to both be files and directories, and it seems to work  
like a charm.

Vetle Roeim
Team Manager, Information Systems
Opera Software ASA <URL: http://www.opera.com/ >

View raw message