lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Tue, 10 Nov 2009 21:29:27 GMT


John Wang commented on LUCENE-1526:

bq.  Zoie will take 64 msec longer than Lucene, due to the extra check.

That is not true. If you look at the report closely, it is 20ms difference, 64ms is the total
size. (after I turned on -server, the diff is about 10ms). This is running on my laptop, hardly
a production server.

This is also assuming the entire corpus is returned, where we should really take an average
of the result set from the query log.

However, to save this "overhead", using BitVector is wasting a lot of memory, which is expensive
to clone, new and gc. In a running system, much of that cost is hard to measure. This is simply
a question of trade-offs.

Again, I would suggest to run the tests yourself, afterall, it is open source :) and make
decisions for yourself, this way, we can get a better understanding from concrete numbers
and scenarios.

BTW, is there a performance benchmark/setup for lucene NRT?

bq. The tests so far are really testing Zoie's reopen time vs Lucene's

That is not true either. This test is simply testing searching with indexing turned on. Not
specific to re-open. I don't think the statement that the performance difference is solely
due to reopen is substantiated. I am seeing the following with NRT:

1) file handle leak - Our prod-quality machine fell over after 1 hr of running using NRT due
to file handle leaking.
2) cpu and memory starvation - monitoring cpu and memory usage, the machine seems very starved,
and I think that leads to performance differences more than the extra array look.
3) I am seeing also correctness issues as well, e.g. deletes don't get applied correctly.
I am not sure about the unit test coverage for NRT to comment specifically.

Again, this can all be specific to my usage of NRT or the test setup. That is why I urge you
guys to run our tests yourself and correct us if you see areas we are missing to make a fair

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message