lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Tue, 10 Nov 2009 20:51:27 GMT


Michael McCandless commented on LUCENE-1526:

bq. we need to see it in the real-world context of running actual worst case queries.

Isn't checking every document in the corpus for deletes the worse case? e.g. first test?

Zoie must do the IntSet check plus the BitVector check (done by
Lucene), right?

Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
you want to do.  You should compare the IntSet lookup (Zoie's added
cost) to 0.

So, for a query that hits 5M docs, Zoie will take 64 msec longer than
Lucene, due to the extra check.  What I'd like to know is what
pctg. slowdown that works out to be, eg for a simple TermQuery that
hits those 5M results -- that's Zoie's worst case search slowdown.

bq. at the expense of slower query time

According to the test, Zoie's query time is faster.

The tests so far are really testing Zoie's reopen time vs Lucene's.

To test the query time, you should set up Zoie w/ some pending
deletions, then turn off all indexing, then run the query test.

That would give us both extreme datapoints -- how much slower Lucene
is at reopening (which the current tests show), and how much slower
Zoie is during searching.

bq. it must double-check the deletions.

True, this double-check is only done for a candidate for a hit from the underlying query.

Right, Zoie's search slowdown is in proportion to the size of the
result set.  So eg for hard queries that produce very few results, the
impact will be negligible.  For simple queries that produce lots of
results, the relative cost is highest (but we don't yet know what it
actually is in practice).

It could still be neglible, eg since the "if" rarely triggers, the CPU
should be able to predict it just fine.

Net/net, Zoie has faster reopen time than Lucene, but then pays a
higher price (double check for deletion) for every result of every
search.  Users need to know what that price really is, in order to
make informed decision about which approach is best for their

bq. Normally result set is much smaller than the corpus, the overhead is not large.

Well, this is generally app dependent, and it's the net/net worst case
queries that apps need to worry about.  Lucene can't [yet] take
avantage of concurrency within a single query, so you're forced to
shard (= big step up in deployment complexity) once your worst case
queries get too slow.

bq. Can you describe the setup of the "indexing only "test?

starting off with an empty index and keep on adding documents, at the same time, for each
search request, return a reader for the current state of the indexing. Our test assumes 10
concurrent threads making search calls.

Oh I see: that test is just adding documents, vs the 2nd test which is
doing updateDocument.  Got it.

So, that's interesting... because, with no deletions, thus no
resolving of Term -> docID, and no cloning of the BitVector, Lucene's
reopen is still quite a bit slower.

What differences remain at this point?  Just the fact that the RAM dir
is being used to flush the new [tiny] segments?  Hmm what about the
merge policy?

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message