lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: IFilter
Date Tue, 29 Apr 2003 13:07:28 GMT
On Monday, April 28, 2003, at 10:18  PM, Kelvin Tan wrote:
>> How are you proposing that this concept be wrapped to include
>> multiple
>> fields that might exist within a single "document"?
> Ah, you raise a very valid point. I had overlooked this because in my 
> usage, I
> had no need for this. I feel there are 3 ways we can approach this:
> 1. Document getDocument() as you've implemented/suggested.
> 2. void addToDocument(Document doc) pass the created Document in, so 
> the
> IFilter is at least focused on extracting the requisite fields and 
> adding them
> to the Document and not concerned with Document creation (important in 
> my
> usage)
> 3. Metadata getMetadata(), which returns this file's metadata. Most 
> flexible in
> my opinion, because then clients can control what they want to name 
> the fields
> without coding a separate IFilter just for that purpose, or even 
> whatever else
> they want to do with the metadata. This is assuming that we have 
> document
> metadata and document contents (which may or may not be valid for XML 
> documents
> at least). The - side of this approach is that its less efficient (as 
> a result
> of its flexibility).

I like your third option here.  Rather than fabricate another class 
called Metadata, we could simply return a Map.

Are you processing large documents?  In a previous mail you were 
returning a Reader, which I'd assume means a large file could be read.  
I'm not sure what compromises there would be if we needed to have 
information streamed (like a SAX parser) and event driven as fields are 
picked off the incoming stream.

> What do you think?

Looks good.  I think we are making progress :)

>> Also, I think the location of a "file" needs to be made
>> generic, and Commons VFS seems to the likely candidate currently.
> I like the idea of using Commons VFS, although my feel is its not 
> necessary at
> the moment? Ideally, I'd want to have minimal (or no!) dependencies on 
> external
> libs since this is really quite a simple API. Do you feel really 
> strongly about
> this?

No strong feelings here, as I'm currently only dealing with local files.

I think the event-driven document reader use case should be considered. 
  POI does this, I believe.  How would that impact our design?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message