lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: IFilter
Date Tue, 29 Apr 2003 01:18:17 GMT
On Monday, April 28, 2003, at 05:03  PM, <> 
> We _could_ have one which returns a Document, but I'm thinking
> something even more specific to the nature of an IFilter, ie
> returning a Reader. Since the Field.Text(field, Reader) method really
> had the notion of adding the contents of a File in mind, I feel an
> IFilter should return a Reader too, so

The thing about this returning a Reader is that its specific to a 
particular field.  An actual Word, PDF, HTML, XML, or other document 
type is that it really needs to be broken into fields to be super 
useful, rather than just the text contents all jammed into a single 
field.  Pulling out the file system date, author, directory path 
perhaps, and other semantic data would be more robust.

How are you proposing that this concept be wrapped to include multiple 
fields that might exist within a single "document"?

> My code uses a ContentHandlerPicker to determine which ContentHandler
> to use. This is pluggable. It's a simple interface.
> public interface ContentHandlerPicker
> {
>     ContentHandler getContentHandler(File f);
> }

Sure, no problems here.  My Ant contribution to the sandbox allows the 
build file writer to pick a different document handler, and the default 
example one is a FileExtensionDocumentHandler that simply picks the 
HtmlDocument implementation for .htm/.html files, and the TextDocument 
implementation otherwise.  Its similar to what I implemented:

public interface DocumentHandler {
     public Document getDocument(File file)
                        throws DocumentHandlerException;

> So usage is something like
> document.add(Field.Text("fileContents",
> ContentHandlerFacade.getReader(file, aContentHandlerPicker)));

Again, this is so field specific though.  Perhaps your implementation 
is something that goes in a finer tuned space under my more generic 
idea for capturing an entire Document from a File (or VFS or URL, or 
wherever)  Also, I think the location of a "file" needs to be made 
generic, and Commons VFS seems to the likely candidate currently.

More thoughts?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message