lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 13:42:39 GMT


Michael McCandless commented on LUCENE-1526:

So we re-ran some of our tests last night, commenting out our deleted check to measure it's
cost in the most extreme case possible: a dead easy query (in that it's only one term), but
one which yes, hits the entire index (doing a MatchAllDocs query is actually special-cased
in our code, and is perfectly fast, so not a good worst case to check), and as the index grows
up above a million documents, zoie could shave somewhere from 22-28% of its time off by not
doing the extra check.

OK, thanks for running that test...

So in the worst case (dead easy query, matching many many docs) Zoie's
search slowdown is 22-28%.  It's presumably quite a bit less
(approaching zero) for hard queries that match few docs.  So the
search slowdown is app dependent.

I think it'd be possible (though, complex!) to do a hybrid approach.
Meaning you use Zoie to get the insanely fast reopen, but, to avoid
the search slowdown, in the background you convert the buffered UIDs
to the docID bit vector, such that once all conversion is done, you
stop checking the int set.

I guess you'd have to throttle the conversion so that in the unnatural
(100s per sec) reopen test, with many queries in flight at once, you
don't exhaust the heap.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message