lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: Deleting a document with an IndexWriter open
Date Fri, 16 Jul 2004 18:41:05 GMT
Another solution that works well in some applications is to rely on 
document number. This number will remain the same for the life of an 
IndexReader. This number is also always larger for documents added 
later. So given two documents with the same ID, the one with the highest 
document number is the latest one. The rest can be deleted. One way to 
store a list of documents easily is to use a filter (which could also be 
serialized to disk if needed). This filter would only be valid for the 
IndexReader used to create it.

So here's a modified sequence of operations, perhaps a bit more 
efficient than proposed by Christoph:
1) Open an IndexReader for searching - S. Keep it open until the 
transaction is committed.
2) Open a second IndexReader for deletions - D.
3) Create a filter bitset F (or use any other mechanism for storing 
document numbers to be deleted)
4) Open an IndexWriter for new documents - W.
5) As documents come in, add them using W. Find their old versions in D 
and record their document numbers in F. D will not show any new 
documents, only documents present at the time D was created.
6) Close W.
7) Use D to delete all documents marked in F.
8) Close D.

Step 8 commits the transaction. At this point, another IndexReader S2 
can be created and all new searches can go to that. Once all searches 
using S are done, S can be closed.

Would this work? I think it might. Anyone sees any holes in this? This 
can even allow multiple Ws to be used concurrently, and perhaps even 
multiple machines can be utilized that write to the same index, but I'm 
not sure if this is desirable.

Yea, this would be a great thing to have available in Lucene...

Christoph Goller wrote:

> Giulio Cesare Solaroli wrote:
>> I have been thinking about this for a while, but could not find out a
>> reasonable solution.
>> The basic problems are:
>> - where do I (safely) store the index of the documents that needs to 
>> be deleted?
>> - how can I uniquely identify the Lucene documents that I have to
>> delete, given that there are different Lucene document matching a
>> single "real" document?
>> The second problem could be "easily" solved adding a kind of version
>> field (stored in the Lucene index) that is incremented every time a
>> new version of a document is inserted. In this way, when searching for
>> duplicated documents (using the "real" document ID) I will find a set
>> of Lucene documents and I could delete all but the one with the
>> highest version number.
> You need unique document ids. They may either be produced by the
> fulltext-Index (example 1) or they may come from outside (example 2):
> 1) You could use a unique id for every doucment added to the Lucene index
> (a kind of counter for the number of added documents). You have to 
> provide
> this number by yourself. It is not provided by Lucene! We are doing this
> in some applications. This unique id is stored in a dedicated field 
> and in
> your database you associate this unique id with your document. If you 
> change
> your document in the database, you find the unique id there and thus 
> you know
> which document to delete in the Lucene index. If the changed document 
> is added
> to the Lucene-Index, you get a new unique id and store this one with 
> the changed
> document in your database.
> 2) In another application we store a url of each document in the 
> Lucene index.
> If the document underlying the url has changed, we know which document 
> to delete
> in the Lucene index simply via the url and we store the new version of 
> the document again with a url-field.
> Christoph
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message