Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <780186514.1257801632417.JavaMail.jira@brutus>
Date: Mon, 9 Nov 2009 21:20:32 +0000 (UTC)
From: "Jake Mannix (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1526) For near real-time search, use
 paged copy-on-write BitVector impl
In-Reply-To: <1260489736.1232557799649.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775151#action_12775151 ] 

Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. But how many msec does this clone add in practice?  Note that it's only done if there is a new deletion against that
segment.  I do agree it's silly wasteful, but searching should then be faster
than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the
turnaround time for search perf.

I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly.

bq. I'd love to see how the worst-case queries (matching millions of hits)
perform with each of these three options.

It's pretty easy to change the index and query files in our test to do that, that's a good idea.  You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT.   The instructions for how to set it up are on the zoie wiki.

bq. When a doc needs to be updated, you index it immediately into the
RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the
AcceleratedIntSet, and all searches "and NOT"'d against that set. You
don't tell Lucene to delete the old doc, yet.

Yep, basically.  The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted.

bq. These are great results! If I'm reading them right, it looks like
generally you get faster query throughput, and roughly equal indexing
throughput, on upgrading from 2.4 to 2.9?

That's about right.  Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT.

I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...).

bq. One optimization you could make with Zoie is, if a real-time deletion
(from the AcceleratedIntSet) is in fact hit, it could mark the
corresponding docID, to make subsequent searches a bit faster (and
save the bg CPU when flushing the deletes to Lucene).

That sound interesting - how would that work?  We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live?


> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org