lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Flex API - Debugging Segment Merge
Date Thu, 25 Mar 2010 18:45:18 GMT
Hi Renaud,

It's great that you're pushing flex forward so much :) You're making
some cool sounding codecs!  I'm really looking forward to seeing
indexing/searching performance results on Wikipedia...

It sounds most likely there's a bug in the PFor impl? (Since you don't
hit this exception with the others...).

During merge, each segment's docIDs are rebased according to how many
non-deleted docs there are in all prior segments.  One possibility
here is a given segment thought it had N deletions but in fact
encountered fewer than N while iterating its docs.  This would cause
the next segment to have too-low a base which can cause this exact
exception on crossing from one segment to the next.  (Ie the very
first doc of the next segment will suddenly be <= prior doc(s)).

But... if that's happening (ie, bug is in Lucene not in PFor impl),
you'd expect the other codecs to hit it too.

Are you using multiple threads for indexing?  Are you also mixing in
deletions (or updateDocument calls)?

Mike

On Thu, Mar 25, 2010 at 12:55 PM, Renaud Delbru <renaud.delbru@deri.org> wrote:
> Hi,
>
> I am currently benchmarking various compression algorithms using the Sep
> Codec, but I got index corruption exception during the merge process, and I
> would need your help to debug it.
>
> I have reimplemented various algorithms like FOR, Simple9, VInt, PFor for
> the Sep IntBlock Codec. I am benchmarking them now on the wikipedia dataset.
> For some algorithms, FOR, Simple9, etc., I don't encounter problems. But
> using the PFor algorithms, I get a CorruptedIndex exception during the merge
> process (in SepPostingsWriterImpl#startDoc), because document are out of
> order:
>
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException:
> org.apache.lucene.index.CorruptIndexException: docs out of order (153 <= 153
> )
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:471)
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:435)
> Caused by: org.apache.lucene.index.CorruptIndexException: docs out of order
> (153 <= 153 )
>        at
> org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl.startDoc(SepPostingsWriterImpl.java:177)
>
> However, this is happening only when I tried to index the wikipedia dataset
> using the PFor algorithm. I have tried to recreate the error using a unit
> test, creating random document, and performing a merge, but in this case the
> error does not appear.
>
> After some debug, I have noticed that the document id at the end of a
> segment is the same than (or inferior to) the document id of the next
> segment to be merged. However, even by activating Codec.DEBUG=true, I am
> unable to know which are the faulty segments, and the faulty terms inside
> these segments. Could you indicate me a easy way to get this information, so
> I will be able to check these segments and their encoded blocks in order to
> find and understand the problem ?
>
> Thanks in advance,
> --
> Renaud Delbru
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message