lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <kelvin-li...@relevanz.com>
Subject Re: IFilter
Date Mon, 28 Apr 2003 21:03:15 GMT
On Mon, 28 Apr 2003 05:27:13 -0400, Erik Hatcher wrote:
>On Sunday, April 27, 2003, at 09:53  PM, Kelvin Tan wrote:
>>Anyone think there's potential in something like MS Index Server's
>>IFilter
>>concept for lucene?
>
>Absolutely.
>
>The indyo project in the sandbox as well as my ant code have the
>concept of a DocumentHandler that is pluggable.

Yeah, trouble with Indyo was it was trying to be much more than that, 
by actually including the indexing mechanism as well. And the code 
(in Sandbox at least) didn't have an elegant way of handling archives 
(zip, tar, gzip). But that's mostly coz I'm too lazy to update it, 
and there hasn't been a great deal of interest in Indyo.

That's changed now coz I added some supporting code to handle 
archives somewhat more gracefully by decompressing the archive into a 
temp directory and indexing that directory.

>
>I think this idea has been discussed on the list a long while back
>too.
>
>We really only need an interface that has a method which returns a
>Document, right?  In my ant project (in the sandbox also), it takes
>a
>java.io.File, but this should be made more generic (perhaps using
>Commons VFS API?).  Thoughts on what that interface should look 
like?

We _could_ have one which returns a Document, but I'm thinking 
something even more specific to the nature of an IFilter, ie 
returning a Reader. Since the Field.Text(field, Reader) method really 
had the notion of adding the contents of a File in mind, I feel an 
IFilter should return a Reader too, so

public interface ContentHandler
{
	boolean isContainer();
	Reader getReader();
}

My code uses a ContentHandlerPicker to determine which ContentHandler 
to use. This is pluggable. It's a simple interface.
public interface ContentHandlerPicker
{
    ContentHandler getContentHandler(File f);
}

So usage is something like 

document.add(Field.Text("fileContents", 
ContentHandlerFacade.getReader(file, aContentHandlerPicker)));

Right now, I wish I could accept an InputStream in addition to a 
File, but that invariably involves using some intelligent 
algorithm/3rd-party lib (like NGramJ? :-) to determine which IFilter 
to use based on the IS, like detecting magic numbers or something, 
and that's something over my head, I think. I'm assuming, of course, 
that clients are not explicitly specifying which IFilter to use, 
which doesn't necessarily have to be true. We could have 
ContentHandlerFacade.getReader(inputStream, aContentHandler) I guess.

Kelvin



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message