lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikumar Govindarajan <>
Subject Re: App supplied docID in lucene possible?
Date Tue, 06 Nov 2012 06:04:03 GMT
Looks far more complex than I had assumed!!!

An invariant of "non-decreasing docid per flush", if pushed to the app can
save lucene from handling the complex sparse data logic no?

Lucene can hold it's existing logic without major changes, detect any
out-of-order doc before every flush and emit an error.

I understand that multi-threaded indexing and such concerns also need to be
handled by the app, but thats what apps get when trying to control docIDs


On Mon, Nov 5, 2012 at 9:11 PM, Michael McCandless <> wrote:

> On Mon, Nov 5, 2012 at 4:37 AM, Ravikumar Govindarajan
> <> wrote:
> > Thanks Mike,
> >
> > Joins could be slower than docID based approach, no?
> Yes: slower at search time but faster at update time (generally not a
> good tradeoff... but it seems like in your case slow updates are the
> problem).
> > It would be great if lucene can incorporate an external docID after
> > weighing the pros & cons. Many like us will be willing to trade-off
> search
> > latency to some extent, in return for the low hanging fruits
> I think this would be very hard, for stored fields / term vectors /
> doc values / field cache / deleted docs, which cannot store documents
> "sparsely" today.
> Postings can store sparsely, but, when we write the postings in
> IndexWriter's RAM buffer, we rely on docIDs being assigned "in order".
>  So if the app specified the docID, we'd have to change how we buffer
> postings in RAM, and then fix flush to re-sort the docIDs before
> writing the segment.
> We have discussed such sort-docIDs-on-flush before, eg you can reduce
> postings size if you sort similar documents "together", but I don't
> know of anyone implementing that.
> Also lots of places at search time rely on a docID being the sum of a
> segment's docBase and the docID within the segment ... that would have
> to change to just use the decoded docID directly.
> Mike McCandless
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message