lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Per-document Payloads
Date Mon, 29 Oct 2007 14:47:46 GMT

"Michael Busch" <> wrote:
> Michael McCandless wrote:
> > 
> > Michael, are you thinking that the storage would/could be non-sparse
> > (like norms), and loaded/cached once in memory, especially for fixed
> > size fields?  EG a big array of ints of length maxDocID?  In John's
> > original case, every doc has this UID int field; I think this is
> > fairly common.
> >
> Yes I agree, this is a common use case. In my first mail in this thread
> I suggested to have a flexible format. Non-sparse, like norms, in case
> every document has one value and all values have the same fixed size.
> Sparse and with a skip list if one or both conditions are false.
> The DocumentsWriter would have to check whether both conditions are
> true, in  which case it would store the values non-sparse. The
> SegmentMerger would only write the non-sparse format for the new segment
> if all of the source segments also had the non-sparse format with the
> same value size.
> This would provide the most flexibility for the users I think.

OK, got it.  So in the case where I always put a field "UID" on every
document, always a 4-byte binary field, then Lucene will "magically"
store this as non-sparse column-stride field for every segment.

But I still have to mark the Field as "column-stride storage" right?

Even if some docs do not have the field, it is still beneficial to
store it non-sparse up until a point.  EG the logic in
BitVector.isSparse() is doing a similar calculation.  This is only
possible when the field, when set on the document, is always the same
length in bytes.

Maybe we should also allow users to explicitly state that they wish
for this field to be stored in this way (sparse or non-sparse) rather
than having Lucene choose?

New question: how would we handle a "boolean" type column-stride
stored field?  It seems like we should always use BitVector since it
already handles the sparse/non-sparse storage decision "under the

> > I think many apps have no trouble loading the array-of-ints entirely
> > into RAM, either because there are not that many docs or because
> > throwing RAM at the problem is fine (eg on a 64-bit JVM).
> > 
> >>From John's tests, the "load int[] directly from disk" took 186 msec
> > vs the payload approach (using today's payloads API) took 430 msec.
> > 
> > This is a sizable performance difference (2.3 X faster) and for
> > interactive indexing apps, where minimizing cost of re-opening readers
> > is critical, this is significant.  Especially combining this with the
> > ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
> > distributing the FieldCache down into sub-readers) should make
> > re-opening + re-warming much faster than today.
> > 
> Yes definitely. I was planning to add a FieldCache implementation that
> uses these per-doc payloads - it's one of the most obvious use-cases.
> However, I think providing an iterator in addition, like TermDocs, makes
> sense too. People might have very big indexes, store longer values than
> 4 Bytes Ints, or use more than one per-doc payload. In some tests I
> found  out that the performance is still often acceptable, even if the
> values are not cached. (It's like having one AND-term more in the query,
> as one more "posting list" has to be processed).
> > If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
> > would look at FieldInfo, determine that this field is stored
> > column-stride, and load it as one big int array?
> > 
> So I think a TermDocs-like iterator plus a new FieldCache implementation
> would make sense?

OK, I agree, we should have an iterator API as well so that you can
process this posting list "document at a time" just like all other
terms in the query.

> We could further make these fields updateable, like norms?

Agreed, though how would the API work (if indeed we are just adding
"column-stride[-non]-sparse" options to Field)?  Because if the Field
is also indexed, we can't update that.  I think I can see why you
wanted to make a new API here :)


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message