Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 84440 invoked from network); 12 Mar 2008 23:48:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Mar 2008 23:48:01 -0000 Received: (qmail 12531 invoked by uid 500); 12 Mar 2008 23:47:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 12500 invoked by uid 500); 12 Mar 2008 23:47:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 12488 invoked by uid 99); 12 Mar 2008 23:47:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2008 16:47:52 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [203.217.22.128] (HELO file1.syd.nuix.com.au) (203.217.22.128) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Mar 2008 23:47:05 +0000 Received: from host68.syd.nuix.com.au (host68.syd.nuix.com.au [192.168.222.68]) by file1.syd.nuix.com.au (Postfix) with ESMTP id AF7504A81BA for ; Thu, 13 Mar 2008 10:47:11 +1100 (EST) From: Daniel Noll Organization: Nuix Pty Ltd To: java-user@lucene.apache.org Subject: Re: Document ID shuffling under 2.3.x (on merge?) Date: Thu, 13 Mar 2008 10:42:50 +1100 User-Agent: KMail/1.9.6 (enterprise 0.20070907.709405) References: <200803111645.27956.daniel@nuix.com> <200803121229.38013.daniel@nuix.com> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200803131042.50640.daniel@nuix.com> X-Virus-Checked: Checked by ClamAV on apache.org On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: > OK, I think very likely this is the issue: when IndexWriter hits an > exception while processing a document, the portion of the document > already indexed is left in the index, and then its docID is marked > for deletion. You can see these deletions in your infoStream: > > flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments > > This means you have deletions in your index, by docID, and so when > you optimize the docIDs are then compacted. Aha. Under 2.2, a failure would result in nothing being added to the text index so this would explain the problem. It would also explain why smaller data sets are less likely to cause the problem (it's less likely for there to be an error in it.) Workarounds? - flush() after any IOException from addDocument() (overhead?) - use ++ to determine the next document ID instead of index.getWriter().docCount() (out of sync after an error but fixes itself on optimize(). - Use a field for a separate ID (slower later when reading the index) - ??? Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org