lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <>
Subject Re: IndexWriter.deleteDocuments(Query query)
Date Thu, 02 Apr 2009 18:26:24 GMT
Hi Michael:
    Thanks for looking into this.

    Approach 2 has a dependency on how fast the delete set performs a check
on a given id, approach one doesn't. After replacing my delete set with a
simple bitset, approach 2 gets a 25-30% improvement.

   I understand if the delete set is small, approach 1 would be faster,
while approach two has a more constant/deterministic performance. I would
also save from indexing the UID term into the index if going with approach

   I don't however see how column-stride fields would help here, isn't it a
generalization of what I am doing?

  BTW, can you shine some light on why would IndexWriter move docids around
when it is opened and no docs has been added to it?



On Thu, Apr 2, 2009 at 2:20 AM, Michael McCandless <> wrote:

> On Wed, Apr 1, 2009 at 6:37 PM, John Wang <> wrote:
> > a code snippet is worth 1000 words :)
> Here here!
> OK, now I understand the difference.
> With approach 1, for each of N UIDs you use a TermDocs to find the
> postings for that UID, and retrieve the one docID corresponding to
> that UID.  You retrieve UID -> docID.
> With approach 2, you iterate through all docs in the index, using a
> single full walk through the single TermPositions instance for your
> special UID_TERM, and retrieve the UID stored in the 4-byte payload.
> You retrieve docID -> UID.
> Approach 1 is expected to be more costly, per UID - Lucene must
> consult the terms dict (binary search on the terms index, followed by
> scan on disk within the 128 term block) to find the posting, then seek
> to the posting and read that.
> Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs
> into RAM (ie, you cannot be selective about which UIDs you load).
> So if the number of UIDs you need to process is small, approach 1
> should win; but after that number crosses X (apparently X < 10000 for
> you), approach 2's "bulk walk" will win.
> Approach 1 will get faster with the "pulsing" approach for inlining
> low-frequency postings directly into the terms dict (discussed on
> java-dev and implemented as a codec in the experimental flexible
> indexing patch on LUCENE-1458), because we save the second seek.
> Approach 2 will get much faster with column-stride fields
> (LUCENE-1231).
> Though we may want to take this even further and allow inversion for
> special fields ("primary key int" field, ie your UID) to be stored as
> a column-stride field.  Probably this could simply be another codec in
> LUCENE-1458.  Then, delete-by-Term would be exceptionally fast for
> such fields.
> Mike
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message