Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 32601 invoked from network); 28 Apr 2003 20:49:31 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 28 Apr 2003 20:49:31 -0000 Received: (qmail 25853 invoked by uid 97); 28 Apr 2003 20:51:35 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 25846 invoked from network); 28 Apr 2003 20:51:35 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 28 Apr 2003 20:51:35 -0000 Received: (qmail 30449 invoked by uid 500); 28 Apr 2003 20:49:04 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 30390 invoked from network); 28 Apr 2003 20:49:03 -0000 Received: from smtp12.singnet.com.sg (165.21.6.32) by daedalus.apache.org with SMTP; 28 Apr 2003 20:49:03 -0000 Received: from mycomputer (bb-203-125-40-205.singnet.com.sg [203.125.40.205]) by smtp12.singnet.com.sg (8.12.9/8.12.9) with SMTP id h3SKn7wv003907 for ; Tue, 29 Apr 2003 04:49:07 +0800 Message-Id: <200304282049.h3SKn7wv003907@smtp12.singnet.com.sg> From: To: Lucene Developers List X-Mailer: PocoMail 2.6 (1006) - Licensed Version Date: Tue, 29 Apr 2003 05:03:15 +0800 In-Reply-To: <9855C37A-795B-11D7-9DE9-000393A564E6@ehatchersolutions.com> Subject: Re: IFilter Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N On Mon, 28 Apr 2003 05:27:13 -0400, Erik Hatcher wrote: >On Sunday, April 27, 2003, at 09:53 PM, Kelvin Tan wrote: >>Anyone think there's potential in something like MS Index= Server's >>IFilter >>concept for lucene? > >Absolutely. > >The indyo project in the sandbox as well as my ant code have= the >concept of a DocumentHandler that is pluggable. Yeah, trouble with Indyo was it was trying to be much more than= that, by actually including the indexing mechanism as well. And the= code (in Sandbox at least) didn't have an elegant way of handling= archives (zip, tar, gzip). But that's mostly coz I'm too lazy to update= it, and there hasn't been a great deal of interest in Indyo. That's changed now coz I added some supporting code to handle archives somewhat more gracefully by decompressing the archive= into a temp directory and indexing that directory. > >I think this idea has been discussed on the list a long while= back >too. > >We really only need an interface that has a method which returns= a >Document, right? In my ant project (in the sandbox also), it= takes >a >java.io.File, but this should be made more generic (perhaps= using >Commons VFS API?). Thoughts on what that interface should look= like? We _could_ have one which returns a Document, but I'm thinking something even more specific to the nature of an IFilter, ie returning a Reader. Since the Field.Text(field, Reader) method= really had the notion of adding the contents of a File in mind, I feel= an IFilter should return a Reader too, so public interface ContentHandler { =09boolean isContainer(); =09Reader getReader(); } My code uses a ContentHandlerPicker to determine which= ContentHandler to use. This is pluggable. It's a simple interface. public interface ContentHandlerPicker { ContentHandler getContentHandler(File f); } So usage is something like document.add(Field.Text("fileContents", ContentHandlerFacade.getReader(file, aContentHandlerPicker))); Right now, I wish I could accept an InputStream in addition to a= File, but that invariably involves using some intelligent algorithm/3rd-party lib (like NGramJ? :-) to determine which= IFilter to use based on the IS, like detecting magic numbers or= something, and that's something over my head, I think. I'm assuming, of= course, that clients are not explicitly specifying which IFilter to use,= which doesn't necessarily have to be true. We could have ContentHandlerFacade.getReader(inputStream, aContentHandler) I= guess. Kelvin --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org