lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jake Mannix (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl
Date Wed, 11 Nov 2009 05:48:27 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776294#action_12776294
] 

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. Zoie must do the IntSet check plus the BitVector check (done by
Lucene), right?

Yes, so how does Lucene NRT deal with new deletes?  The disk-backed IndexReader still does
its internal check for deletions, right?  I haven't played with the latest patches on LUCENE-1313,
so I'm not sure what has changed, but if you're leaving the disk index alone (to preserve
point-in-time status of the index without writing to disk all the time), you've got your in-memory
BitVector of newly uncommitted deletes, and then the SegmentReaders from the disk have their
own internal deletedDocs BitVector.  Are these two OR'ed with each other somewhere?  What
is done in NRT to minimize the time of checking both of these without modifying the read-only
SegmentReader?  In the current 2.9.0 code, the segment is reloaded completely on getReader()
if there are new add/deletes, right?

bq. Ie comparing IntSet lookup vs BitVector lookup isn't the comparison
you want to do. You should compare the IntSet lookup (Zoie's added
cost) to 0.

If you've got that technique for resolving new deletes against the disk-based ones while maintaining
point-in-time nature and can completely amortize the reopen cost so that it doesn't affect
performance, then yeah, that would be the comparison.  I'm not sure I understand how the NRT
implementation is doing this currently - I tried to step through the debugger while running
the TestIndexWriterReader test, but I'm still not quite sure what is going on during the reopen.

bq.  So, for a query that hits 5M docs, Zoie will take 64 msec longer than
Lucene, due to the extra check. What I'd like to know is what
pctg. slowdown that works out to be, eg for a simple TermQuery that
hits those 5M results - that's Zoie's worst case search slowdown.

Yes, this is a good check to see, for while it is still a micro-benchmark, really, since it
would be done in isolation, while no other production tasks are going on, like rapid indexing
and the consequent flushes to disk and reader reopening is going on, but it would be useful
to see.

What would be even better, however, would be to have a running system whereby there is continual
updating of the index, and many concurrent requests are coming in which hit all 5M documents,
and measure the mean latency for zoie in this case, in both comparison to NRT, and in comparison
to lucene when you *don't* reopen the index (ie. you do things the pre-lucene2.9 way, where
the CPU is still being consumed by indexing, but the reader is out of date until the next
time it's scheduled by the application to reopen).  This would measure the effective latency
and throughtput costs of zoie and NRT vs non-NRT lucene.  I'm not really sure it's terribly
helpful to see "what is zoie's latency when you're not indexing at all" - why on earth would
you use either NRT or zoie if you're not doing lots of indexing? 

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message