Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 3923 invoked from network); 25 Mar 2010 19:16:17 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Mar 2010 19:16:17 -0000 Received: (qmail 44167 invoked by uid 500); 25 Mar 2010 19:16:15 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 44118 invoked by uid 500); 25 Mar 2010 19:16:15 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44110 invoked by uid 99); 25 Mar 2010 19:16:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Mar 2010 19:16:15 +0000 X-ASF-Spam-Status: No, hits=-0.8 required=10.0 tests=AWL,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.223.182] (HELO mail-iw0-f182.google.com) (209.85.223.182) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Mar 2010 19:16:10 +0000 Received: by iwn12 with SMTP id 12so7694745iwn.21 for ; Thu, 25 Mar 2010 12:15:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.142.75.14 with SMTP id x14mr4388073wfa.187.1269544547211; Thu, 25 Mar 2010 12:15:47 -0700 (PDT) In-Reply-To: <4BABB3B0.3090303@deri.org> References: <4BAB956B.3050009@deri.org> <9ac0c6aa1003251145h530a1e55ma738ae9fad473bb0@mail.gmail.com> <4BABB3B0.3090303@deri.org> Date: Thu, 25 Mar 2010 15:15:47 -0400 Message-ID: <9ac0c6aa1003251215q1792b4c8jf0ef7e37bae5d9df@mail.gmail.com> Subject: Re: Flex API - Debugging Segment Merge From: Michael McCandless To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Mar 25, 2010 at 3:04 PM, Renaud Delbru wro= te: > Hi Michael, > > On 25/03/10 18:45, Michael McCandless wrote: >> >> Hi Renaud, >> >> It's great that you're pushing flex forward so much :) You're making >> some cool sounding codecs! =A0I'm really looking forward to seeing >> indexing/searching performance results on Wikipedia... >> > > I'll share them for sure whenever the results are ready ;o). I'll be waiting eagerly :) >> It sounds most likely there's a bug in the PFor impl? (Since you don't >> hit this exception with the others...). >> > > It seems so, but I found strange also that I cannot reproduce it with > synthetic data. Hmmm. >> During merge, each segment's docIDs are rebased according to how many >> non-deleted docs there are in all prior segments. =A0One possibility >> here is a given segment thought it had N deletions but in fact >> encountered fewer than N while iterating its docs. =A0This would cause >> the next segment to have too-low a base which can cause this exact >> exception on crossing from one segment to the next. =A0(Ie the very >> first doc of the next segment will suddenly be<=3D prior doc(s)). >> >> But... if that's happening (ie, bug is in Lucene not in PFor impl), >> you'd expect the other codecs to hit it too. >> >> Are you using multiple threads for indexing? =A0Are you also mixing in >> deletions (or updateDocument calls)? >> > > There is no deletion, I just create the index from scratch, and each > document I am adding as a unique identifier. Hmmm. > I am using one single thread for indexing: reading sequentially the list = of > wikipedia articles, putting the content into a single field, and add the > document to the index. Commit is done every 10K documents. Are you using contrib/benchmark for this? That makes it very easy to run tests like this... hmm though we need to extend it so you can specify which Codec to use... > I have tried with different mergeFactors (2, or 20), but whenever the fir= st > merge occurs, I got this CorruptIndexException. It's that consistent? Is it always that the docID is =3D=3D to one prior? Or is the next docID sometimes < the prior one? And, is it always on the 1st docID of a new segment? > I will try to continue to debug, but if I could have at least the faulty > segment, and the faulty term (or even better, the index of the faulty > block), I will be able to display the content of the blocks, and see if > there is some problems in the PFor encoding. You can instrument the code (or catch the exc in a debugger) to see all these details? Or... if you can post a patch of where you are, I can dig, if I can repro the issue... Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org