lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <>
Subject Re: How to not overwrite a Document if it 'already exists'?
Date Tue, 05 May 2009 23:24:52 GMT
Michael McCandless wrote:
> Lucene doesn't provide any way to do this, except opening a reader.
> Opening a reader is not "that" expensive if you use it for this
> purpose.  EG neither norms nor FieldCache will be loaded if you just
> enumerate the term docs.

Thanks for that info.  These indexes will be large, in the 10s of millions.  id 
field is unique and is 29 bytes.  I guess that's still a lot of data to trawl 
through to get to the term.

> But, you can let Lucene do the same thing for you by just always using
> updateDocument, which'll remove the old doc if it's present.

That's precisely what I don't want to occur.  I have two forms of a Document, 
which represent mail items.  One 'full' version containing all index and stored 
data, which represents a searchable mail item and one 'base', which is simply a 
marker Document which represents a mail in a forwarded mail chain, with just a 
couple of stored fields containing the mail meta data.

Under normal circumstances there are no problems as mails arrive in sequence and 
are never handled twice, but there is one case, during a reindex op, when the 
arrival of those mails can come out of sequence, i.e. a full mail is indexed 
first, but that mail is later processed as part of a forwarded mail chain of 
another mail.

It is the second time that mail is handled as a base mail that I do not want it 
to overwrite the full version.

Would it be technically difficult to support something like this in the 
IndexWriter API and if not, would it end up being more efficient that using a 
reader/terms to check this?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message