lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Wed, 11 Nov 2009 23:00:41 GMT


Michael McCandless commented on LUCENE-1526:

bq. These sound serious - if you can provide any details, that'd help. I'll do some stress
testing too. Thanks for testing and reporting

Out of these, the specific issue of incorrectness of applied deletes is easiest to see - we
saw it by indexing up to a million docs, then keep adding docs but only after doing a delete
on the UID where UID instead of increasing, is looped around mod 1million. Calling numDocs
(not maxDoc) on the reader with Zoie always returns 1M after looping around, but with NRT,
it starts slowly growing above 1M.

So far I've had no luck repro'ing this.  I have a 5M doc wikipedia
index.  Then I created an alg with 2 indexing threads (each replacing
docs at 100 docs/sec), and reopening ~ 60 times per second.  Another
thread then verifies that the docCount is always 5M.  It's run fine
for quite a while now...

Hmm maybe I need to try the balanced merge policy?  That would be
spooky if it caused the issue...

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message