lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Delete is not multi-thread safe
Date Thu, 31 Jan 2002 18:43:36 GMT
Doug Cutting wrote:

>>From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
>>
>>>It seems that either a) deletes should be write-through, or 
>>>b) deletes should 
>>>be done by the writer, or c) writer should not optimize 
>>>non-RAM segments unless 
>>>asked to. As a client, I like option b) the best, though, 
>>>this is not the easiest option to implement. My $0.02
>>>
>>Or maybe
>>d) when merging, a writer should share an in-memory image of segment1 
>>and prohibit any deletes on segment one while merge is in progress?
>>
>
>Or maybe:
>e) Deleting from a reader while an IndexWriter is open on the same index
>should throw an exception.  This just requires the delete code to obtain the
>write.lock.
>
I don't think this would address the reported problem. I had not 
verified it, but per bug report it seems that the IndexReader caches the 
deletes and it is possible to have an IndexWriter perform optimization 
while an IndexReader is still holding delete information in memory 
(although from the application point of view the delete has been 
performed prior to addition).

If there is one "user" performing additions and deletions, then the two 
can be ordered. But if an application is such that it allows multiple 
people initiate index updates of various kinds, it may be much harder to 
order additions and deletions.

>
>Deletions and additions must happen serially.  In particular, the intended
>order of operations is:
>  reader.open();
>  reader.deleteDocument(...);
>  reader.close();
>  writer.open();
>  writer.addDocument(...);
>  writer.close();
>
>The bug is that this is not enforced, nor is it well documented.  Let's fix
>that first.  Another bug might be that IndexWriter is a misnomer: it should
>really be called something like DocumentAdder.
>
>>Personally, I would also like to see deletion moved into the writer. 
>>
>
>And I'd like to see cars outlawed.
>
Cars?!?! Ok, so we'll all ride the Ginger! :)

>
>Yes, this would be a cleaner API, but it would also encourage folks to write
>less efficient index updating code.  The most efficient approach is to batch
>deletions and additions separately.  Intermingling them will never be as
>fast.  The current API encourages one to do things this way.  Also,
>currently the deletion code is very simple and easy to maintain.  Optimizing
>intermingled additions and deletions would require adding a lot of new code,
>substantially complicating Lucene, and likely introducing bugs.
>
>Some background:  To delete a document we need an IndexReader to find its
>document number.  To add a document we just need to add a new segment,
>opening no readers.  Periodically a subset of the segments are opened by a
>reader to merge them.
>
Yes, this is one of the more ingenious design ideas in Lucene, I have to 
say! It makes a world of difference that segments are read-only and that 
document additions never have to update anything - only create new files.

>
>
>If deletion were added to an IndexWriter it would need to have an
>IndexReader opened on all segments, in order to find the document number and
>mark it as deleted.  Each time a document is added or segments are merged
>this reader must be invalidated.  It would be very inefficient to re-open
>this IndexReader each time a document is deleted, so code would need to be
>added to incrementally update a SegmentsReader in light of document
>additions and merges.  Such a reader could also be optimized to only open
>those files that are required for deletion.  Still, intermingling inserts
>and deletes would be less efficient, since it would require the dictionaries
>for each altered segment to be re-read in order to find the document number.
>
This are all excellent points and I had not realized most of them. Also, 
I agree that the DocumentAdder would be a clearer name for the 
IndexWriter. Also, +1 on documenting the preferred operation order and 
enforcing it if possible.

However, there are applications where this becomes very awkward. I think 
the main need for doing delete + add in one operation is when replacing 
documents with more up-to-date copies. Wouldn't it be great, if 
IndexWriter provided a way to not simply add a document, but add it as a 
replacement for another document (yes, I know that the document numbers 
are unsatble, but let's forget that for the moment). What would happen 
is that the existing IndexReaders would continue to use the document 
from the older segment where it still exists, but new IndexReaders would 
perform an in-memory merge where they would discover that a document is 
now deleted. Same thing would happen during optimization.

I think this would make replacing documents very easy, atomic, and 
probably thread-safe. But the problem is how do we identify a document 
from an older segment when replacing it.

Document ids are, of course, segment-specific and change during merge. 
This makes searches fast, but it makes it impossible to identify a 
document. But what if we add a "special" field, or add a unique document 
id in some other way? The searches will still use the segment-specific 
ids and remain fast, but there would be a unique id assigned to each 
document that applications could use if needed and also the replace 
operation could use in the IndexWriter. Obviously, we would have to make 
sure that these ids can be created quickly by multiple writers without a 
possibility of duplicate ids.

Would this work?




--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message