lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Noll <dan...@nuix.com>
Subject Re: Document ID shuffling under 2.3.x (on merge?)
Date Wed, 12 Mar 2008 23:42:50 GMT
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
> OK, I think very likely this is the issue: when IndexWriter hits an
> exception while processing a document, the portion of the document
> already indexed is left in the index, and then its docID is marked
> for deletion.  You can see these deletions in your infoStream:
>
>    flush 0 buffered deleted terms and 30 deleted docIDs on 20 segments
>
> This means you have deletions in your index, by docID, and so when
> you optimize the docIDs are then compacted.

Aha.  Under 2.2, a failure would result in nothing being added to the text 
index so this would explain the problem.  It would also explain why smaller 
data sets are less likely to cause the problem (it's less likely for there to 
be an error in it.)

Workarounds?
  - flush() after any IOException from addDocument()  (overhead?)
  - use ++ to determine the next document ID instead of
    index.getWriter().docCount()  (out of sync after an error but fixes itself
    on optimize().
  - Use a field for a separate ID (slower later when reading the index)
  - ???

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message