lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Janssen <>
Subject Re: lucene-core-2.2.0.jar broken? CorruptIndexException?
Date Wed, 28 Nov 2007 21:06:40 GMT
> Hmmm ... how many chunks of "about 50 pages" do you do before hitting this?
> Roughly how many docs are in the index when it happens?

Oh, gosh, not sure.  I'm guessing it's about half done.

> Can you describe the docs/fields you're adding?

I've got 1735 documents, 18969 pages -- average page size 10.9, max
page size 1235 (a physics textbook), 578 one-page documents.  These
are Web pages, PDFs, articles, photos, scanned stuff, technical
papers, etc.  I index six documents at a time, so I guess I'm
averaging about 65 pages per chunk.  For each document, I index the
whole text of the document as a Lucene Document, and I index the text
of each page separately as a Document.  I use the "contents" fields
and "pagecontents" fields for those two uses.  I also add metadata
information to each: "title", multiple "author" fields, "date",
"abstract", etc.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message