Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <4BABB3B0.3090303@deri.org>
References: <4BAB956B.3050009@deri.org>
	 <9ac0c6aa1003251145h530a1e55ma738ae9fad473bb0@mail.gmail.com>
	 <4BABB3B0.3090303@deri.org>
Date: Thu, 25 Mar 2010 15:15:47 -0400
Message-ID: <9ac0c6aa1003251215q1792b4c8jf0ef7e37bae5d9df@mail.gmail.com>
Subject: Re: Flex API - Debugging Segment Merge
From: Michael McCandless <lucene@mikemccandless.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 25, 2010 at 3:04 PM, Renaud Delbru <renaud.delbru@deri.org> wro=
te:
> Hi Michael,
>
> On 25/03/10 18:45, Michael McCandless wrote:
>>
>> Hi Renaud,
>>
>> It's great that you're pushing flex forward so much :) You're making
>> some cool sounding codecs! =A0I'm really looking forward to seeing
>> indexing/searching performance results on Wikipedia...
>>
>
> I'll share them for sure whenever the results are ready ;o).

I'll be waiting eagerly :)

>> It sounds most likely there's a bug in the PFor impl? (Since you don't
>> hit this exception with the others...).
>>
>
> It seems so, but I found strange also that I cannot reproduce it with
> synthetic data.

Hmmm.

>> During merge, each segment's docIDs are rebased according to how many
>> non-deleted docs there are in all prior segments. =A0One possibility
>> here is a given segment thought it had N deletions but in fact
>> encountered fewer than N while iterating its docs. =A0This would cause
>> the next segment to have too-low a base which can cause this exact
>> exception on crossing from one segment to the next. =A0(Ie the very
>> first doc of the next segment will suddenly be<=3D prior doc(s)).
>>
>> But... if that's happening (ie, bug is in Lucene not in PFor impl),
>> you'd expect the other codecs to hit it too.
>>
>> Are you using multiple threads for indexing? =A0Are you also mixing in
>> deletions (or updateDocument calls)?
>>
>
> There is no deletion, I just create the index from scratch, and each
> document I am adding as a unique identifier.

Hmmm.

> I am using one single thread for indexing: reading sequentially the list =
of
> wikipedia articles, putting the content into a single field, and add the
> document to the index. Commit is done every 10K documents.

Are you using contrib/benchmark for this?  That makes it very easy to
run tests like this... hmm though we need to extend it so you can
specify which Codec to use...

> I have tried with different mergeFactors (2, or 20), but whenever the fir=
st
> merge occurs, I got this CorruptIndexException.

It's that consistent?  Is it always that the docID is =3D=3D to one prior?
 Or is the next docID sometimes < the prior one?  And, is it always on
the 1st docID of a new segment?

> I will try to continue to debug, but if I could have at least the faulty
> segment, and the faulty term (or even better, the index of the faulty
> block), I will be able to display the content of the blocks, and see if
> there is some problems in the PFor encoding.

You can instrument the code (or catch the exc in a debugger) to see
all these details?

Or... if you can post a patch of where you are, I can dig, if I can
repro the issue...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org