lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) Tombstone deletions in IndexReader
Date Sat, 07 Nov 2009 14:19:32 GMT


Michael McCandless commented on LUCENE-1526:

This looks like good progress!  So it's a copy-on-write, by fixed page
size (8 kbits, by default), BitVector.

One danger of MultiBitVector is if one does a deletion against an
already cloned instance, right?  Because each instance only tracks
refs back to the instance it was cloned from, not forward refs of
other instances that have cloned it?  So a clone of myself would
incorrectly see changes that I make.

That said, Lucene's internal use shouldn't ever do that -- when we
clone the deleted docs, the previous instance should never be touched
again (the "write lock" moves to the new clone, just like when
reopening a reader that's holding the write lock).  Can you add
assertions into MultiBitVector to verify this?  And explain in
javadocs that once you clone it, it's frozen.

Also, I don't think we should force SegmentReader to always use MBV,
unless we're sure the perf hit is negligible?  Can we somehow
conditionalize that?

What remains here?  Ie what tests fail & why?  (Or, why isn't it
committable?).  If you can get it to a committable state, I can run
some perf tests...

In LUCENE-1458, the new flex API uses a simple interface (called
"Bits") to represent docs that should be skipped, and when you ask for
the DocsEnum, you pass in your "Bits skipDocs".  This will be
important for LUCENE-1536, but also important for this issue because
it'll make swapping in different Bits impls easy.

> Tombstone deletions in IndexReader
> ----------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message