lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject Re: IFilter
Date Tue, 29 Apr 2003 16:13:14 GMT

On Tue, 29 Apr 2003 09:07:28 -0400, Erik Hatcher wrote:
>>3. Metadata getMetadata(), which returns this file's metadata. Most
>>flexible in
>>my opinion, because then clients can control what they want to name
>>the fields
>>without coding a separate IFilter just for that purpose, or even
>>whatever else
>>they want to do with the metadata. This is assuming that we have
>>metadata and document contents (which may or may not be valid for
>>at least). The - side of this approach is that its less efficient
>>a result
>>of its flexibility).
>I like your third option here.  Rather than fabricate another class
>called Metadata, we could simply return a Map.

We'll probably want an idiom where each IFilter declares its metadata 
keys/fields as constants, so there's no magic keys in the map.

>Are you processing large documents?  In a previous mail you were
>returning a Reader, which I'd assume means a large file could be

Well, I haven't tried anything larger than 15MB...
To pre-empt any problems of having too many open filehandles, I use a 
LazyFileReader which doesn't open the file until the first read is 
called. Don't know if this is actually necessary though.

>I'm not sure what compromises there would be if we needed to have
>information streamed (like a SAX parser) and event driven as fields
>picked off the incoming stream.
>I think the event-driven document reader use case should be
>POI does this, I believe.  How would that impact our design?

So, just to ensure I get you, if there's say a 
Reader getReader(InputStream is, String mimeType) 
method in the interface, you're wondering how to obtain the Reader 
without processing the entire inputstream first, or rather processing 
it from an event-driven mechanism?

No, that doesn't make sense. Wouldn't an event-driven document reader 
use case only apply to retrieval of metadata (Map), not file contents 
(Reader)? In which case you'd almost _have_ to finish processing the 
entire stream before returning the Map, no?

Maybe I'm missing your point.


>    Erik

>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message