lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 13:47:39 GMT


Michael McCandless commented on LUCENE-1526:

bq. The fact that Zoie on the pure indexing case (ie no deletions) was 10X faster than Lucene
is very weird - that means something else is up, besides how deletions are carried in RAM.
It's entirely possible it's the fact that Lucene doesn't flush the tiny segments to a RAMDir
(which LUCENE-1313 addresses).

Yeah, if you call getReader() a bunch of times per second, each one does a flush(true,true,true),
right? Without having LUCENE-1313, this kills the indexing performance if querying is going
on. If no getReader() is being called at all, Zoie is about 10% slower than pure Lucene IndexWriter.add()
(that's the cost of doing it in two steps - index into two RAMDirs [so they are hot-swappable]
and then writing segments to disk with addIndexesNoOptimize() periodically).

It'll be great if LUCENE-1313 nets us a 10X improvement in indexing
rate.  With the improvements to benchmark (LUCENE-2050), I'm hoping
this'll be easy to confirm...

Ahh I see, so with very rare reopens, Zoie's indexing rate is also
slower than Lucene's (because of the double buffering).

So the big picture tradeoff here is Zoie has wicked fast reopen times,
compared to Lucene, but has slightly slower (10%) indexing rate, and
slower searches (22-28% in the worst case), as the tradeoff.

It seems like we need to find the "break even" point.  Ie, if you
never reopen, then on fixed hardware, Lucene is faster at indexing and
searching than Zoie.  If you reopen at an insane rate (100s per sec),
Zoie is much faster than Lucene on both indexing and searching.  But
what if you reopen 2x, 1x per second?  Once every 2 seconds, etc.  At
some point the crossover will happen.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message