lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Fieldable, AbstractField, Field
Date Wed, 19 Mar 2008 19:50:10 GMT

Chris Hostetter wrote:
> : I do like moving towards a separation of Document for indexing vs
> : searching for 3.0.
> :
> : Disregarding for starters how we get there from here...
> :
> : Wouldn't we just want a base class (not an interface), say
> : ReadOnlyField, that is used in documents retrieved by a reader?   
> This
> : class would also have Index.*, Store.*, TermVector.*, and
> : isStored/Indexed/Tokenized/Compressed, etc, as these are recoverable
> : from an index.  Couldn't this be a concrete class, ie, the actual
> : class instantiated when a Document is loaded from a reader?
> Yes, but one of the peeves I've heard lots of people express over
> the years is that they want want to "decorate" the Documents  
> returned by a
> search, so that they can make those documents access alternate field
> stores and metadata not in the index.  (LUCENE-778 started out being a
> dicussion of wanting to pass custom subclasses of Document to
> writer.addDocument(), but it also mentions wanting to get custom  
> documents
> back from IndexReader.
> Imagine you're writing an app that does a search with Lucene, and then
> returns a List<Document> ...
>   public List<Document> myMethod(options) {
>     Document<List> docs = doSomeSearchStuff(indexreader, query,  
> options)
>     return docs;
>   }
> you've got alot of downstream code that calls myMethod and uses/ 
> propogates
> this List<Document> ... and then one day you decide that for each  
> document
> you want to also include some metadata that Lucene doesn't know  
> anything
> about, your downstream client code is happy to treat this new metadata
> just like any other field.  You could change the API of myMethod  
> and jump
> through a lot of hoops changing all of your other code; or if  
> "Document"
> is a simple interface, you could do something like...
>   public class MyDocumentWraper implements Document {
>     public MyDocumentWraper(Document, otherData) {...}
>     public static List<Document> wrappList(List<Document>,  
> otherData) {...}
>   }
>   public List<Document> myMethod(options) {
>     Document<List> docs = doSomeSearchStuff(indexreader, query,  
> options)
>     return MyDocumentWraper.wrapList(docs, getOtherData(options));
>   }
> (If i remember right, there are some comments to this effect in  
> LUCENE-778
> as well)

Wouldn't subclassing ReadOnlyDocument also work in this case, if you  
override the getField* to do your own new logic if it applies else  
fallback to super?

Alternatively .... we back away from distinguishing read only vs  
index time Document (and go back to a single concrete Field class).   
This way you can alter the fields of a Document returned from a  
reader.  I agree it's not clear that forcing "read only" on a  
Document returned by a reader is the right approach.  People who are  
careful (store enough fields, don't use boosting or have separate  
store for their boosting) could pull Documents from a reader, tweak  
them, and build a new index.

> : And then a subclass, IndexableField, that adds reader & tokenStream
> : values, get/set boost, setters to change a field's value, etc.
> IndexableField really shouldn't be a subclass of whatever class is
> returned after a sarch is done ... the methods used for accessing the
> "stored" value of a returned document make as little sense in the
> context of IndexableField as the setBoost/Reader/TokenStream  
> functions of
> Document currently make when a search is executed.
> when all is said and done: an IndexableField and a SearchResultField
> shouldn't have anything in common except *maybe* that they both have a
> fieldName.

Actually I think they do share alot more than just name of the  
field?  Accessing the "stored" value of a document is exactly what  
indexing needs to do when it indexes the document in the first  
place?  Ie, a "stored" document "looks alot like" the document at  
indexing time that had been stored.  And things like isTokenized,  
isTermVectorStored, isStoreOffsetWithTermVector, isBinary are  
actually preserved in the index and known to the reader, so it's  
worth having these methods available at search time?

> I think Yonik once argued that the ideal API for geting a Document  
> out of
> an IndexReader would be...
>    /** @return map of field name to field values */
>    public Map<String,String[]> getDocument(int id)

But that would lose the above is* methods and often would force  
applications to wrap that returned result in a new class anyway...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message