lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Thu, 12 Nov 2009 13:24:39 GMT


Michael McCandless commented on LUCENE-1526:

bq. if you're indexing 300 documents a second (possibly all of which are delete+re-add), and
querying at a thousand queries a second, how many of these BitVectors are you going to end
up making?

Hopefully not much more than a few per second?

We should be careful what we measure to ensure that we're targeting the right use cases.

Requirements calling for zero latency updates (all index changes are always visible) are often
in error (i.e. it's usually not a true requirement).

Right, I think testing reopening 100s of times per second isn't all
that useful (most apps don't really need to do this).

I think seeing results broken out according to reopen frequency is
more helpful.

Seems like almost all apps should be well served by second reopen resolution on average (with
the ability to immediately reopen on demand). The only thing that would seem to need lower
latency is when an automated client does an add, and then immediately does a query and needs
to see it. In that case, that client could specify that they need an immediate reopen (either
during the add or the query).

To prevent against accidental or intentional denial-of-service for
clients that do the add + immediate query, one could also sync such
clients up to the reopen frequency.

This would also provide for the clean semantics (like GData) of "once
the 'update document' request returns, it's in the index", which I
agree is a very convenient API semantics.

Ie, if your reopen rate is 4x per second (once every 250 msec), then
you could hold all add requests coming in until the reopen completes,
then return those requests.

So the API can still build the well defined semantics on top of
Lucene, even if the reopen is rate limited under the hood.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>                 Key: LUCENE-1526
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>   Original Estimate: 168h
>  Remaining Estimate: 168h
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message