lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: lucene-core-2.2.0.jar broken? CorruptIndexException?
Date Thu, 29 Nov 2007 10:10:25 GMT

"Bill Janssen" <> wrote:

> > Hmmm ... how many chunks of "about 50 pages" do you do before
> > hitting this?  Roughly how many docs are in the index when it
> > happens?
> Oh, gosh, not sure.  I'm guessing it's about half done.

Ugh, OK.  If we could boil this down to a smaller set that is easily
reproducible (and transferable to me) then I could try to track it

Do you have another PPC machine to reproduce this on?  (To rule out
bad RAM/hard-drive on the first one).

Can you try running with the trunk version of Lucene (2.3-dev) and see
if the error still occurs?  EG you can download this AM's build here:

Another thing to try is turning on the infoStream
(IndexWriter.setInfoStream(...)) and capture & post the resulting log.
It will be very large since it takes quite a while for the error to

> So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel
> Mac Pro, OS X 10.5.0, Java 1.5, and no exception is raised.
> Different corpus, about 50000 pages instead of 20000.  This is
> reinforcing my thinking that it's a big-endian issue.

That's a good question.

Lucene is endian independent: all writes to files boil eventually down
to a writeByte/writeBytes calls in such that
the ordering is controlled by Lucene, not the underlying CPU

That said, it is clearly a difference in your test so it seems like a
compelling lead... is it possible to run this different corpus back on
the PPC machine, to rule out a corpus difference leading to the

> I've got 1735 documents, 18969 pages -- average page size 10.9, max
> page size 1235 (a physics textbook), 578 one-page documents.  These
> are Web pages, PDFs, articles, photos, scanned stuff, technical
> papers, etc.  I index six documents at a time, so I guess I'm
> averaging about 65 pages per chunk.  For each document, I index the
> whole text of the document as a Lucene Document, and I index the
> text of each page separately as a Document.  I use the "contents"
> fields and "pagecontents" fields for those two uses.  I also add
> metadata information to each: "title", multiple "author" fields,
> "date", "abstract", etc.

OK, sounds like a nice rich corpus :) Are you using term vectors,
stored fields, payloads on any of these?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message