lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <kelvin-li...@relevanz.com>
Subject Re: IFilter
Date Tue, 29 Apr 2003 16:13:14 GMT

On Tue, 29 Apr 2003 09:07:28 -0400, Erik Hatcher wrote:
>>3. Metadata getMetadata(), which returns this file's metadata. Most
>>
>>flexible in
>>my opinion, because then clients can control what they want to name
>>
>>the fields
>>without coding a separate IFilter just for that purpose, or even
>>whatever else
>>they want to do with the metadata. This is assuming that we have
>>document
>>metadata and document contents (which may or may not be valid for
>>XML
>>documents
>>at least). The - side of this approach is that its less efficient
>>(as
>>a result
>>of its flexibility).
>
>I like your third option here.  Rather than fabricate another class
>called Metadata, we could simply return a Map.

+1
We'll probably want an idiom where each IFilter declares its metadata 
keys/fields as constants, so there's no magic keys in the map.

>
>Are you processing large documents?  In a previous mail you were
>returning a Reader, which I'd assume means a large file could be
>read.  

Well, I haven't tried anything larger than 15MB...
To pre-empt any problems of having too many open filehandles, I use a 
LazyFileReader which doesn't open the file until the first read is 
called. Don't know if this is actually necessary though.

>I'm not sure what compromises there would be if we needed to have
>information streamed (like a SAX parser) and event driven as fields
>are
>picked off the incoming stream.
>
>
>I think the event-driven document reader use case should be
>considered.
>POI does this, I believe.  How would that impact our design?

So, just to ensure I get you, if there's say a 
Reader getReader(InputStream is, String mimeType) 
method in the interface, you're wondering how to obtain the Reader 
without processing the entire inputstream first, or rather processing 
it from an event-driven mechanism?

No, that doesn't make sense. Wouldn't an event-driven document reader 
use case only apply to retrieval of metadata (Map), not file contents 
(Reader)? In which case you'd almost _have_ to finish processing the 
entire stream before returning the Map, no?

Maybe I'm missing your point.

Kelvin

>
>    Erik
>
>
>---------------------------------------------------------------------

>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message