Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 71658 invoked from network); 13 Mar 2010 10:27:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Mar 2010 10:27:29 -0000 Received: (qmail 32409 invoked by uid 500); 13 Mar 2010 10:26:49 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 32356 invoked by uid 500); 13 Mar 2010 10:26:49 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 32343 invoked by uid 99); 13 Mar 2010 10:26:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Mar 2010 10:26:48 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Mar 2010 10:26:47 +0000 Received: from brutus.apache.org (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 238C2234C4BE for ; Sat, 13 Mar 2010 10:26:27 +0000 (UTC) Message-ID: <1608219619.243791268475987130.JavaMail.jira@brutus.apache.org> Date: Sat, 13 Mar 2010 10:26:27 +0000 (UTC) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency In-Reply-To: <396950764.46741267649907121.JavaMail.jira@brutus.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844848#action_12844848 ] Michael McCandless commented on LUCENE-2293: -------------------------------------------- I think this issue has these steps: * Allow the 5 to be changed (trivial first step) -- I'll do this after LUCENE-2294 is in * Change the approach for how we buffer in RAM to a more isolated approach, whereby IW has N fully independent RAM segments in-process and when a doc needs to be indexed it's added to one of them. Each segment would also write its own doc stores and "normal" segment merging (not the inefficient merge we now do on flush) would merge them. This should be a good simplification in the chain (eg maybe we can remove the *PerThread classes). The segments can flush independently, letting us make much better concurrent use of IO & CPU. * Enable NRT readers to directly search these RAM segments. This entails recording deletes on the RAM segments as an int[]. We need to solve the Term sorting issue... (b-tree, or, simply sort-on-demand the first time a query needs it, though that cost increases the larger your RAM segments get, ie, not incremental to the # docs you just added). Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segmeent have closed. We should do this part under a separate issue (LUCENE-2312). > IndexWriter has hard limit on max concurrency > --------------------------------------------- > > Key: LUCENE-2293 > URL: https://issues.apache.org/jira/browse/LUCENE-2293 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.1 > > > DocumentsWriter has this nasty hardwired constant: > {code} > private final static int MAX_THREAD_STATE = 5; > {code} > which probably I should have attached a //nocommit to the moment I > wrote it ;) > That constant sets the max number of thread states to 5. This means, > if more than 5 threads enter IndexWriter at once, they will "share" > only 5 thread states, meaning we gate CPU concurrency to 5 running > threads inside IW (each thread must first wait for the last thread to > finish using the thread state before grabbing it). > This is bad because modern hardware can make use of more than 5 > threads. So I think an immediate fix is to make this settable > (expert), and increase the default (8?). > It's tricky, though, because the more thread states, the less RAM > efficiency you have, meaning the worse indexing throughput. So you > shouldn't up and set this to 50: you'll be flushing too often. > But... I think a better fix is to re-think how threads write state > into DocumentsWriter. Today, a single docID stream is assigned across > threads (eg one thread gets docID=0, next one docID=1, etc.), and each > thread writes to a private RAM buffer (living in the thread state), > and then on flush we do a merge sort. The merge sort is inefficient > (does not currently use a PQ)... and, wasteful because we must > re-decode every posting byte. > I think we could change this, so that threads write to private RAM > buffers, with a private docID stream, but then instead of merging on > flush, we directly flush each thread as its own segment (and, allocate > private docIDs to each thread). We can then leave merging to CMS > which can already run merges in the BG without blocking ongoing > indexing (unlike the merge we do in flush, today). > This would also allow us to separately flush thread states. Ie, we > need not flush all thread states at once -- we can flush one when it > gets too big, and then let the others keep running. This should be a > good concurrency gain since is uses IO & CPU resources "throughout" > indexing instead of "big burst of CPU only" then "big burst of IO > only" that we have today (flush today "stops the world"). > One downside I can think of is... docIDs would now be "less > monotonic", meaning if N threads are indexing, you'll roughly get > in-time-order assignment of docIDs. But with this change, all of one > thread state would get 0..N docIDs, the next thread state'd get > N+1...M docIDs, etc. However, a single thread would still get > monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org