Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 31501 invoked from network); 9 Nov 2009 21:20:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Nov 2009 21:20:57 -0000 Received: (qmail 43002 invoked by uid 500); 9 Nov 2009 21:20:56 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 42932 invoked by uid 500); 9 Nov 2009 21:20:55 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 42924 invoked by uid 99); 9 Nov 2009 21:20:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 21:20:55 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2009 21:20:52 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 672F5234C1EF for ; Mon, 9 Nov 2009 13:20:32 -0800 (PST) Message-ID: <780186514.1257801632417.JavaMail.jira@brutus> Date: Mon, 9 Nov 2009 21:20:32 +0000 (UTC) From: "Jake Mannix (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl In-Reply-To: <1260489736.1232557799649.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775151#action_12775151 ] Jake Mannix commented on LUCENE-1526: ------------------------------------- bq. But how many msec does this clone add in practice? Note that it's only done if there is a new deletion against that segment. I do agree it's silly wasteful, but searching should then be faster than using AccelerateIntSet or MultiBitSet. It's a tradeoff of the turnaround time for search perf. I actually don't know for sure if this is the majority of the time, as I haven't actually run both the AcceleratedIntSet or 2.9 NRT through a profiler, but if you're indexing at high speed (which is what is done in our load/perf tests), you're going to be cloning these things hundreds of times per second (look at the indexing throughput we're forcing the system to go through), and even if it's fast, that's costly. bq. I'd love to see how the worst-case queries (matching millions of hits) perform with each of these three options. It's pretty easy to change the index and query files in our test to do that, that's a good idea. You can feel free to check out our load testing framework too - it will let you monkey with various parameters, monitor the whole thing via JMX, and so forth, both for the full zoie-based stuff, and where the zoie api is wrapped purely around Lucene 2.9 NRT. The instructions for how to set it up are on the zoie wiki. bq. When a doc needs to be updated, you index it immediately into the RAMDir, and reopen the RAMDir's IndexReader. You add it's UID to the AcceleratedIntSet, and all searches "and NOT"'d against that set. You don't tell Lucene to delete the old doc, yet. Yep, basically. The IntSetAccellerator (of UIDs) is set on the (long lived) IndexReader for the disk index - this is why it's done as a ThreadLocal - everybody is sharing that IndexReader, but different threads have different point-in-time views of how much of it has been deleted. bq. These are great results! If I'm reading them right, it looks like generally you get faster query throughput, and roughly equal indexing throughput, on upgrading from 2.4 to 2.9? That's about right. Of course, the comparison between zoie with either 2.4 or 2.9 against lucene 2.9 NRT is an important one to look at: zoie is pushing about 7-9x better throughput for both queries and indexing than NRT. I'm sure the performance numbers would change if we allowed not realtimeness, yes, that's one of the many dimensions to consider in this (along with percentage of indexing events which are deletes, how many of those are from really old segments vs. newer ones, how big the queries are, etc...). bq. One optimization you could make with Zoie is, if a real-time deletion (from the AcceleratedIntSet) is in fact hit, it could mark the corresponding docID, to make subsequent searches a bit faster (and save the bg CPU when flushing the deletes to Lucene). That sound interesting - how would that work? We don't really touch the disk indexReader, other than to set this modSet on it in the ThreadLocal, where would this mark live? > For near real-time search, use paged copy-on-write BitVector impl > ----------------------------------------------------------------- > > Key: LUCENE-1526 > URL: https://issues.apache.org/jira/browse/LUCENE-1526 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Minor > Attachments: LUCENE-1526.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > SegmentReader currently uses a BitVector to represent deleted docs. > When performing rapid clone (see LUCENE-1314) and delete operations, > performing a copy on write of the BitVector can become costly because > the entire underlying byte array must be created and copied. A way to > make this clone delete process faster is to implement tombstones, a > term coined by Marvin Humphrey. Tombstones represent new deletions > plus the incremental deletions from previously reopened readers in > the current reader. > The proposed implementation of tombstones is to accumulate deletions > into an int array represented as a DocIdSet. With LUCENE-1476, > SegmentTermDocs iterates over deleted docs using a DocIdSet rather > than accessing the BitVector by calling get. This allows a BitVector > and a set of tombstones to by ANDed together as the current reader's > delete docs. > A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance. A > probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The > merge policy may be set in the clone/reopen methods or on the > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org