lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelvin Tan <>
Subject Re: IFilter
Date Wed, 30 Apr 2003 00:25:40 GMT

On Tue, 29 Apr 2003 14:14:57 -0400, Erik Hatcher said:
>On Tuesday, April 29, 2003, at 12:13  PM, <kelvin-
>>>I like your third option here.  Rather than fabricate another
>>>called Metadata, we could simply return a Map.
>>We'll probably want an idiom where each IFilter declares its
>>keys/fields as constants, so there's no magic keys in the map.
>I'm not following you here.  What do you mean by magic keys?

Sorry. Its a casual abuse of the phrase "magic numbers". In other words, I 

Map m = aHTMLFilter.getMetadata();
doc.add(Field.Text(HTMLFilter.TITLE, (String) m.get(HTMLFilter.TITLE)));


doc.add(Field.Text("title", (String) m.get("title)));

It may not seem significant if all you're going to do with the map is iterate 
over the keyset and add each entry as a field, but I don't think its good 
practice in any event.

>Who gets to name the fields that end up in the Document is where I'm
>not clear yet.  You would likely want some consistency in the field
>names among documents, or at least an overlap on "contents" or
>"keywords" or something like that.

That's exactly what I'm talking about. By providing a metadata map + fields as 
constants, clients can name the fields anyway they wish. 

String title = (String) aHTMLFilter.getMetadata().get(HTMLFilter.TITLE);
doc.add(Field.Text("myTitleField", title));

We can also provide a higher level API which does this for them by supplying a 
Document as arg, like so 

addToDoc(Document doc); 
where the implementation is simply 
doc.add(Field.Text(TITLE, (String) metadata.get(TITLE)));

(which is the second option I mentioned) if they're not too concerned about 

By the way, as you can probably see with the code snippets, we'll be casting 
quite alot when dealing with maps. That's one + for creating a small utility 
Metadata class which provides convenient typed accessor methods like 
getString() and getInt() instead of get().

>>>I think the event-driven document reader use case should be
>>>POI does this, I believe.  How would that impact our design?
>>So, just to ensure I get you, if there's say a
>>Reader getReader(InputStream is, String mimeType)
>>method in the interface, you're wondering how to obtain the Reader
>>without processing the entire inputstream first, or rather
>>it from an event-driven mechanism?
>I'm not thinking that detail oriented just yet.  If we are ok with
>Map being returned for all fields by some specific document handler
>implementation, then the details of the event-driven option would be
>hidden under there, except that the 15MB file would then live in
>field(s) of the Map in memory and then transferred to a Document in

Not exactly. It would referenced as a in the FileReader (or 
whatever other Reader you're using). Whether it will be in the metadata map, or 
obtained with a separate getReader() method, is something I'm wondering.

I guess for _some_ XML documents, the getReader() method may not be relevant. 
However, for most other documents, there's almost always document metadata 
(title, author, keywords, etc) and actual contents of that document. Right now, 
I'm tending towards both a getMetadata() and a getReader() methods. Thoughts?

>Maybe its a non-issue, just was curious if the different
>a document could be processed should factor into the equation or not.

>>From how I'm seeing it, it could be just an implementation detail on the part 
of the IFilter.

>Nope, that is my point.  Dunno why I even brought it up since its
>at all one of my use cases!  :)  Carry on.

I'm glad you brought it up, nonetheless! :-)


The book giving manifesto     -

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message