lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Fieldable, AbstractField, Field
Date Fri, 21 Mar 2008 10:45:37 GMT

Chris Hostetter wrote:
> : Wouldn't subclassing ReadOnlyDocument also work in this case, if  
> you override
> : the getField* to do your own new logic if it applies else  
> fallback to super?
> Sure, but how will IndexReader (or really FieldsReader) know which
> subclass to instantiate?

IndexReader would not instantiate a subclass in this example.  I was  
just saying your wrapping example doesn't require an interface: it  
works equally well if ReadOnlyDocument is a concrete class.  I think  
moving away from interfaces for 3.0 makes sense.

> I think in LUCENE-778 the notion of passing a
> DocumentFactory to IndexReader was brought up; in another reply to  
> this
> thread robert suggests having a setter that takes in a Class which  
> has a
> noarg constructor and using reflect -- both of these could work, but
> frankly having a really simple API that allows for object  
> decoration seems
> cleaner to me.

I think whether IndexReader should be able to instantiate an app- 
specific subclass for a returned document is a somewhat orthogonal  
question here.  I agree it would be nice to be able to do this,  
though I do wonder if we are "going too far in being flexible".   
Would we also offer app-specific per-field classes?  Lucene is a  
search engine and if you want to do neat custom application classes  
above it it seems like it should remain above the API with  
decoration.  Ie I do like the simplicity of Yonik's document()  
returning Map<String,String[]>.

> : approach.  People who are careful (store enough fields, don't use  
> boosting or
> : have separate store for their boosting) could pull Documents from  
> a reader,
> : tweak them, and build a new index.
> this to me is the strongest reason to have seperate classes, where  
> neither
> is the base class for the other -- we want to discourage the casual
> observer from assuming this is simple.  we want to make sure that  
> users
> who want to do this have to go out of their way to have to write the
> translation from SearchResultDocument to IndexableDocument so it's  
> clear
> to them what they might be losing in the translation.  (incidently:  
> this
> is another topic that came up in LUCENE-778...
> focusedCommentId=12478169#action_12478169
> )

Agreed, it's a dangerous trap to make this look simple.  EG, I just  
discovered that if you call "isIndexed()" on a Fieldable in the  
document returned by an IndexReader, it can lie, because it falls  
back to FieldInfos to check that field.  But this is not right since  
on adding the one doc you could have specified NOT to index that one  
field and only store it.  Also, getBoost() doesn't survive indexing.   
In short, it's buggy, today, that we try to act like a retrieved  
document has faithfully preserved these detailed index-time settings.

> fundementally, i believe
>   * the API for objects that are passed to "indexing" should only  
> require
> "getters" that are essential for understanding how to indexing
> that document. ie: if indexing happens by calling
> IndexWriter.addDocument(IndexableDoc) then the IndexableDoc API should
> only require that there is a way to "get" IndexableFields and other
> options, IndexableField should only describe how to "get" the field  
> name
> and field value and; the client providing the IndexableDoc should  
> be free
> to provide an object that implements those methods any way it sees  
> fit --
> we will most likely want to provde "simple" concrete subclasses of  
> these
> APIs that have simple setters to meet the common cases -- but the
> IndexWriter methods should know only about the "API" classes.
>   * the API for objects that are returned from "reading/searching"  
> should
> only specify "getters" that are essential for the client to understand
> the data the index knows about those documents.  "setters" used by
> IndexReader (or it's internals) shouldn't be exposed by this API  
> unless we
> *truely* want to let clients treat those objects as mutabl "Beans" for
> their own purposes ... personally, i think it's safer to make it easy
> to decorate those objects then to allow arbitrary modifications.

I like this approach.

So the IndexableDocument abstract base class would only define  
getters.  Document (in oal.index package) would subclass this, adding  
basic setters.  Likewise for Field, along with the getter/setters for  
indexing details like different term vector, stored, index settings.

StoredDocument abstract base class would only define getters.  There  
would be a (package-)private class in  that subclasses  
this and adds private setters.  Likewise for Field, or, maybe  
StoredDocument deals directly in String and doesn't make Field  
classes (Yonik's idea).

> : Actually I think they do share alot more than just name of the  
> field?
> : Accessing the "stored" value of a document is exactly what  
> indexing needs to
> : do when it indexes the document in the first place?  Ie, a  
> "stored" document
> : "looks alot like" the document at indexing time that had been  
> stored.  And
> i think the stored value looks alot like the value stored during index
> because of the perspective we look at them from ... but a Document
> returned from a search just has Fields with "Values" while a  
> document that
> needs to to be indexed will have Fields with "Values To Store" and
> Fields with "Values To Index" and Fields with both ... there's  
> really no
> reason for a single method in a common base class to describe those
> things.

OK I now agree with this.  I don't think Lucene should be trying to  
preserve all the index-time settings through to search-time, because  
it's buggy now and it's not really necessary.

> : things like isTokenized, isTermVectorStored,  
> isStoreOffsetWithTermVector,
> : isBinary are actually preserved in the index and known to the  
> reader, so it's
> : worth having these methods available at search time?
> I admit, some of those is*() methods might make sense in a common base
> class -- but not all of them are specific to each instance of  
> Field. Some
> are the least common denominator of all Fields indexed with the  
> same name
> added -- it's convinient to return that info as part of each Field
> instance so thatthe client doesn't have to go call some method in
> IndexReader (ie: reader.isTermVectorStored(fieldName)) but  
> providing it to
> the client using the same method (not just a method with a similar, or
> identical name, but the *exact* same method as defined by a common
> baseclass) is very confusing when that method returns something  
> different
> then what is was when the client orriginally indexed that document.

I've flipped here and would be happy not to expose any of this API in  
the returned document.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message