lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 13:34:39 GMT


Michael McCandless commented on LUCENE-1526:

bq. Lucene NRT makes a clone of the BitVector for every reader that has new deletions. Once
this is done, searching is "normal" - it's as if the reader were a disk reader. There's no
extra checking of deleted docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

Ok, so if this is copy-on-write, it's done every time there is a new delete for that segment?
If the disk index is optimized that means it would happen on every update, a clone of the
full numDocs sized BitVector? I'm still a little unsure of how this happens.

Right.  Actually is the index optimized in your tests?  My current
correctness testing (for the "lost deletes") isn't optimized... I'll
try optimizing it.

* somebody calls getReader() - they've got all the SegmentReaders for the disk segments, and
each of them have BitVectors for deletions.
* IW.update() gets called - the BitVector for the segment which now has a deletion is cloned,
and set on a new pooled SegmentReader as its deletedSet

Actually, the IW.updateDocument call merely buffers the Term to be
deleted.  It does not resolve that term to the corresponding docID
until the getReader (same as reopen) is called again.  But it would be
better if Lucene did the resolution in the FG (during the
updateDocument) call; this is what LUCENE-2047 will fix.  This
backgrounds the resolution, ie, reopen is no longer resolving all
deletes in the FG.

But, yes, the clone happens on the first delete to arrive against a
SegmentReader after it had been cloned in the NRT reader.

bq. * maybe IW.update() gets called a bunch more - do these modify the pooled but as-yet-unused
SegmentReader? New readers in the pool? What?

Just more buffering right now, but after LUCENE-2047, it will mark
further bits in the already cloned vector.  Ie, the clone happens only
after getReader has returned a reader using that SegmentReader.

bq. * another call to getReader() comes in, and gets an IndexReader wrapping the pooled SegmentReaders.

Each SegmentReader is cloned, and referenced by the reader returned by
getReader.  And then the next delete to arrive to thse segments will
force the bit vector to clone.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message