lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Per-document Payloads
Date Mon, 29 Oct 2007 08:27:18 GMT

> Michael Busch wrote:
> > Doug Cutting wrote:
> > 
> > If this is really required, perhaps it ought to appear as an
> > attribute for stored fields, indicating that the field should be
> > stored in a separate "column store".  This would permit efficient
> > enumeration of values of just that field.
> Yes I was thinking about this too. I'm just not sure if this is
> confusing for the users, because it will be conceptually different
> how to retrieve "normal" stored fields vs. "column-stored"
> fields. The former via getDocument() (multiple field values at a
> time), but the latter via an Iterator similar to TermDocs (one value
> at a time).  Do you think this would be confusing? Or do you have
> other ideas for the retrieval API?

Michael, are you thinking that the storage would/could be non-sparse
(like norms), and loaded/cached once in memory, especially for fixed
size fields?  EG a big array of ints of length maxDocID?  In John's
original case, every doc has this UID int field; I think this is
fairly common.

I think many apps have no trouble loading the array-of-ints entirely
into RAM, either because there are not that many docs or because
throwing RAM at the problem is fine (eg on a 64-bit JVM).

>From John's tests, the "load int[] directly from disk" took 186 msec
vs the payload approach (using today's payloads API) took 430 msec.

This is a sizable performance difference (2.3 X faster) and for
interactive indexing apps, where minimizing cost of re-opening readers
is critical, this is significant.  Especially combining this with the
ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
distributing the FieldCache down into sub-readers) should make
re-opening + re-warming much faster than today.

If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
would look at FieldInfo, determine that this field is stored
column-stride, and load it as one big int array?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message