lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <>
Subject Re: Deleting a document with an IndexWriter open
Date Mon, 19 Jul 2004 09:46:56 GMT
Dmitry Serebrennikov wrote:
> Another solution that works well in some applications is to rely on 
> document number. This number will remain the same for the life of an 
> IndexReader. This number is also always larger for documents added 
> later. So given two documents with the same ID, the one with the highest 
> document number is the latest one. The rest can be deleted. One way to 
> store a list of documents easily is to use a filter (which could also be 
> serialized to disk if needed). This filter would only be valid for the 
> IndexReader used to create it.
> So here's a modified sequence of operations, perhaps a bit more 
> efficient than proposed by Christoph:
> 1) Open an IndexReader for searching - S. Keep it open until the 
> transaction is committed.
> 2) Open a second IndexReader for deletions - D.
> 3) Create a filter bitset F (or use any other mechanism for storing 
> document numbers to be deleted)
> 4) Open an IndexWriter for new documents - W.
> 5) As documents come in, add them using W. Find their old versions in D 
> and record their document numbers in F. D will not show any new 
> documents, only documents present at the time D was created.
> 6) Close W.
> 7) Use D to delete all documents marked in F.
> 8) Close D.
> Step 8 commits the transaction. At this point, another IndexReader S2 
> can be created and all new searches can go to that. Once all searches 
> using S are done, S can be closed.
> Would this work? I think it might. Anyone sees any holes in this? This 
> can even allow multiple Ws to be used concurrently, and perhaps even 
> multiple machines can be utilized that write to the same index, but I'm 
> not sure if this is desirable.

The propsed mechanism could indeed be made thread-safe and efficient
multithreaded update would be possible. Thats probably what you have in
mind. However, having more than one IndexWriter is not possible and not
required, since IndexWriter is already optimized for multithreading. Well,
I think you know this anyway, I add it just for other listeners.

> Yea, this would be a great thing to have available in Lucene...
> Dmitry.

One could add a class called IndexUpdate that could handle all that.
There should be a possibility to specify a field or set of fields for
identifying dublicate documents.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message