lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: IndexWriter.deleteDocuments(Query query)
Date Thu, 02 Apr 2009 09:20:37 GMT
On Wed, Apr 1, 2009 at 6:37 PM, John Wang <> wrote:
> a code snippet is worth 1000 words :)

Here here!

OK, now I understand the difference.

With approach 1, for each of N UIDs you use a TermDocs to find the
postings for that UID, and retrieve the one docID corresponding to
that UID.  You retrieve UID -> docID.

With approach 2, you iterate through all docs in the index, using a
single full walk through the single TermPositions instance for your
special UID_TERM, and retrieve the UID stored in the 4-byte payload.
You retrieve docID -> UID.

Approach 1 is expected to be more costly, per UID - Lucene must
consult the terms dict (binary search on the terms index, followed by
scan on disk within the 128 term block) to find the posting, then seek
to the posting and read that.

Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs
into RAM (ie, you cannot be selective about which UIDs you load).

So if the number of UIDs you need to process is small, approach 1
should win; but after that number crosses X (apparently X < 10000 for
you), approach 2's "bulk walk" will win.

Approach 1 will get faster with the "pulsing" approach for inlining
low-frequency postings directly into the terms dict (discussed on
java-dev and implemented as a codec in the experimental flexible
indexing patch on LUCENE-1458), because we save the second seek.

Approach 2 will get much faster with column-stride fields

Though we may want to take this even further and allow inversion for
special fields ("primary key int" field, ie your UID) to be stored as
a column-stride field.  Probably this could simply be another codec in
LUCENE-1458.  Then, delete-by-Term would be exceptionally fast for
such fields.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message