lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <kelvin-li...@relevanz.com>
Subject Re: IFilter
Date Tue, 29 Apr 2003 02:18:45 GMT


On Mon, 28 Apr 2003 21:18:17 -0400, Erik Hatcher said:
>On Monday, April 28, 2003, at 05:03  PM, <kelvin-lists@relevanz.com> 
>wrote:
>>
>>We _could_ have one which returns a Document, but I'm thinking
>>something even more specific to the nature of an IFilter, ie
>>returning a Reader. Since the Field.Text(field, Reader) method
>>really
>>had the notion of adding the contents of a File in mind, I feel an
>>IFilter should return a Reader too, so
>
>The thing about this returning a Reader is that its specific to a
>particular field.  An actual Word, PDF, HTML, XML, or other document
>type is that it really needs to be broken into fields to be super
>useful, rather than just the text contents all jammed into a single
>field.  Pulling out the file system date, author, directory path
>perhaps, and other semantic data would be more robust.
>
>How are you proposing that this concept be wrapped to include
>multiple
>fields that might exist within a single "document"?

Ah, you raise a very valid point. I had overlooked this because in my usage, I 
had no need for this. I feel there are 3 ways we can approach this:

1. Document getDocument() as you've implemented/suggested.
2. void addToDocument(Document doc) pass the created Document in, so the 
IFilter is at least focused on extracting the requisite fields and adding them 
to the Document and not concerned with Document creation (important in my 
usage)
3. Metadata getMetadata(), which returns this file's metadata. Most flexible in 
my opinion, because then clients can control what they want to name the fields 
without coding a separate IFilter just for that purpose, or even whatever else 
they want to do with the metadata. This is assuming that we have document 
metadata and document contents (which may or may not be valid for XML documents 
at least). The - side of this approach is that its less efficient (as a result 
of its flexibility).

What do you think?

<snip>
>is something that goes in a finer tuned space under my more generic
>idea for capturing an entire Document from a File (or VFS or URL, or
>wherever)  

+1

>Also, I think the location of a "file" needs to be made
>generic, and Commons VFS seems to the likely candidate currently.
>

I like the idea of using Commons VFS, although my feel is its not necessary at 
the moment? Ideally, I'd want to have minimal (or no!) dependencies on external 
libs since this is really quite a simple API. Do you feel really strongly about 
this?

Kelvin


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message