lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Re: IO exception during merge/optimize
Date Wed, 28 Oct 2009 14:58:35 GMT
The only change I made to the source code was the patch for PayloadNearQuery
(LUCENE-1986).
It's possible that our content contains U+FFFF. I will run in debugger and
see.
The data is 'sensitive', so I may not be able to provide a bad segment,
unfortunately.

Peter

On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> OK... when you exported the sources & built yourself, you didn't make
> any changes, right?
>
> It's really odd how many of the errors are due to the term
> "literals:cfid196$", or some variation (one time with "on" appended,
> another time with "microsoft").  Do you know what documents typically
> contain that term, and what the context is around it?  Maybe try to
> index only those documents and see if this happens?  (It could
> conceivably be caused by bad data, if this is some weird bug).  One
> question: does your content ever use the [invalid] unicode character
> U+FFFF?  (Lucene uses this internally to mark the end of the term).
>
> Would it be possible to zip up all files starting with _1c (should be
> ~22 MB) and post somewhere that I could download?  That's the smallest
> of the broken segments I think.
>
> I don't need the full IW output just yet, thanks.
>
> Mike
>
> On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan <peterlkeegan@gmail.com>
> wrote:
> > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the same
> > problems when run multiple times.
> >
> >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52" mean?
> > This appears to be something added by the ant build, since I built Lucene
> > from the source code.
> >
> > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB,
> > maxBufferedDocs=1000000
> > This produced 49 segments, 9 of which are broken. The broken segments are
> in
> > the latter half, similar to my previous post with 3 segments. Do you
> think
> > this could be caused by 'bad' data, for example bad unicode characters?
> >
> > Here is the output from CheckIndex:
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message