lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Wed, 11 Nov 2009 10:58:39 GMT


Michael McCandless commented on LUCENE-1526:

1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due
to file handle leaking.
2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved,
and I think that leads to performance differences more than the extra array look.
3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly.
I am not sure about the unit test coverage for NRT to comment specifically.

These sound serious -- if you can provide any details, that'd help.
I'll do some stress testing too.  Thanks for testing and reporting ;)

bq. Yes, so how does Lucene NRT deal with new deletes?

Lucene NRT makes a clone of the BitVector for every reader that has
new deletions.  Once this is done, searching is "normal" -- it's as if
the reader were a disk reader.  There's no extra checking of deleted
docs (unlike Zoie), no OR'ing of 2 BitVectors, etc.

Yes, this makes Lucene's reopen more costly.  But, then there's no
double checking for deletions.  That's the tradeoff, and this is why
the 64 msec is added to Zoie's search time.  Zoie's searches are

The fact that Zoie on the pure indexing case (ie no deletions) was 10X
faster than Lucene is very weird -- that means something else is up,
besides how deletions are carried in RAM.  It's entirely possible it's
the fact that Lucene doesn't flush the tiny segments to a RAMDir
(which LUCENE-1313 addresses).  Or, maybe there's another difference
in that test (eg, MergePolicy?).  Jake or John, if you could shed some
light on any other specific differences in that test, that would help.

bq. This is simply a question of trade-offs.

Precisely: Zoie has faster reopen time, but slower search time.  But
we haven't yet measured how much slower Zoie's searches are.

Actually I thought of a simple way to run the "search only" (not
reopen) test -- I'll just augment TopScoreDocCollector to optionally
check the IntSetAccelerator, and measure the cost in practice, for
different numbers of docs added to the IntSet.

bq. BTW, is there a performance benchmark/setup for lucene NRT?

In progress -- see LUCENE-2050.

bq. Aiming for maxing out indexing speed and query throughput at the same time is what we're
testing here, and this is a reasonable extreme limit to aim for when stress-testing real-time

But your test is missing a dimension: frequency of reopen.  If you
reopen once per second, how do Zoie/Lucene compare?  Twice per second?
Once every 5 seconds?  Etc.

It sounds like LinkedIn has a hard requirement that the reopen must
happen hundreds of times per second, which is perfectly fine.  That's
what LinkedIn needs.  But other apps have different requirements, and
so to make an informed decision they need to see the full picture.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message