lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Flex API - Debugging Segment Merge
Date Thu, 25 Mar 2010 19:15:47 GMT
On Thu, Mar 25, 2010 at 3:04 PM, Renaud Delbru <renaud.delbru@deri.org> wrote:
> Hi Michael,
>
> On 25/03/10 18:45, Michael McCandless wrote:
>>
>> Hi Renaud,
>>
>> It's great that you're pushing flex forward so much :) You're making
>> some cool sounding codecs!  I'm really looking forward to seeing
>> indexing/searching performance results on Wikipedia...
>>
>
> I'll share them for sure whenever the results are ready ;o).

I'll be waiting eagerly :)

>> It sounds most likely there's a bug in the PFor impl? (Since you don't
>> hit this exception with the others...).
>>
>
> It seems so, but I found strange also that I cannot reproduce it with
> synthetic data.

Hmmm.

>> During merge, each segment's docIDs are rebased according to how many
>> non-deleted docs there are in all prior segments.  One possibility
>> here is a given segment thought it had N deletions but in fact
>> encountered fewer than N while iterating its docs.  This would cause
>> the next segment to have too-low a base which can cause this exact
>> exception on crossing from one segment to the next.  (Ie the very
>> first doc of the next segment will suddenly be<= prior doc(s)).
>>
>> But... if that's happening (ie, bug is in Lucene not in PFor impl),
>> you'd expect the other codecs to hit it too.
>>
>> Are you using multiple threads for indexing?  Are you also mixing in
>> deletions (or updateDocument calls)?
>>
>
> There is no deletion, I just create the index from scratch, and each
> document I am adding as a unique identifier.

Hmmm.

> I am using one single thread for indexing: reading sequentially the list of
> wikipedia articles, putting the content into a single field, and add the
> document to the index. Commit is done every 10K documents.

Are you using contrib/benchmark for this?  That makes it very easy to
run tests like this... hmm though we need to extend it so you can
specify which Codec to use...

> I have tried with different mergeFactors (2, or 20), but whenever the first
> merge occurs, I got this CorruptIndexException.

It's that consistent?  Is it always that the docID is == to one prior?
 Or is the next docID sometimes < the prior one?  And, is it always on
the 1st docID of a new segment?

> I will try to continue to debug, but if I could have at least the faulty
> segment, and the faulty term (or even better, the index of the faulty
> block), I will be able to display the content of the blocks, and see if
> there is some problems in the PFor encoding.

You can instrument the code (or catch the exc in a debugger) to see
all these details?

Or... if you can post a patch of where you are, I can dig, if I can
repro the issue...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message