lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <>
Subject Re: Per-document Payloads
Date Mon, 29 Oct 2007 08:57:04 GMT
Michael McCandless wrote:
> Michael, are you thinking that the storage would/could be non-sparse
> (like norms), and loaded/cached once in memory, especially for fixed
> size fields?  EG a big array of ints of length maxDocID?  In John's
> original case, every doc has this UID int field; I think this is
> fairly common.

Yes I agree, this is a common use case. In my first mail in this thread
I suggested to have a flexible format. Non-sparse, like norms, in case
every document has one value and all values have the same fixed size.
Sparse and with a skip list if one or both conditions are false.

The DocumentsWriter would have to check whether both conditions are
true, in  which case it would store the values non-sparse. The
SegmentMerger would only write the non-sparse format for the new segment
if all of the source segments also had the non-sparse format with the
same value size.

This would provide the most flexibility for the users I think.

> I think many apps have no trouble loading the array-of-ints entirely
> into RAM, either because there are not that many docs or because
> throwing RAM at the problem is fine (eg on a 64-bit JVM).
>>>From John's tests, the "load int[] directly from disk" took 186 msec
> vs the payload approach (using today's payloads API) took 430 msec.
> This is a sizable performance difference (2.3 X faster) and for
> interactive indexing apps, where minimizing cost of re-opening readers
> is critical, this is significant.  Especially combining this with the
> ideas from LUCENE-831 (incrementally updating the FieldCache; maybe
> distributing the FieldCache down into sub-readers) should make
> re-opening + re-warming much faster than today.

Yes definitely. I was planning to add a FieldCache implementation that
uses these per-doc payloads - it's one of the most obvious use-cases.
However, I think providing an iterator in addition, like TermDocs, makes
sense too. People might have very big indexes, store longer values than
4 Bytes Ints, or use more than one per-doc payload. In some tests I
found  out that the performance is still often acceptable, even if the
values are not cached. (It's like having one AND-term more in the query,
as one more "posting list" has to be processed).

> If so, wouldn't this API just fit under FieldCache?  Ie "getInts(...)"
> would look at FieldInfo, determine that this field is stored
> column-stride, and load it as one big int array?

So I think a TermDocs-like iterator plus a new FieldCache implementation
would make sense?

We could further make these fields updateable, like norms?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message