lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 03:46:48 GMT


Jason Rutherglen commented on LUCENE-1526:

As far as the implementation of this patch goes... I'm thinking
we can increase the page size, and do a simulated setNextPage
type of deal in SegmentTermDocs. We'll start at the first page,
if we hit an ArrayIndexOutOfBoundsException, we figure out what
page they're trying to access and return true|false for the bit.
We can continue this process of accessing iteratively on a page,
until we hit the AIOOBE, then figure out again. I think this is
a good approach because Java arrays already perform the access
bounds check, hitting an exception hopefully shouldn't be too
costly if it only happens a handful of times per posting
iteration, and then we're avoiding the small but extra array
access lookup for eac bit. We'll be executing the same access pattern as
today (i.e. random access on the byte array with about the same
overhead, with the AIOOBE occurring when a page is completed).
We can benchmark and see the performance difference.

I think in general, we'll simply benchmark the options we've
discussed 1) NRT 2) 1313 3) 1313 + pooling 4) 1313 + 1526 5)
1313 + 1526 w/iterative sequential page access.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message