lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Delete is not multi-thread safe
Date Thu, 31 Jan 2002 19:02:23 GMT
> From: Dmitry Serebrennikov [mailto:dmitrys@earthlink.net]
> 
> If there is one "user" performing additions and deletions, 
> then the two 
> can be ordered. But if an application is such that it allows multiple 
> people initiate index updates of various kinds, it may be 
> much harder to 
> order additions and deletions.

Only one "user" is currently permitted to perform additions at once.  This
is enforced by the "write.lock" file.  It would be easy to extend this
restriction so that only a single user is permitted to perform additions or
deletions at once.  Lucene does not support simultaneous index modification
by multiple processes.  This restriction is just not yet properly enforced
by the deletion code.

> I agree that the DocumentAdder would be a clearer name for the 
> IndexWriter. Also, +1 on documenting the preferred operation 
> order and 
> enforcing it if possible.

Cool.  I think this is the approach that best keeps Lucene "lean and mean".

> However, there are applications where this becomes very 
> awkward. I think 
> the main need for doing delete + add in one operation is when 
> replacing 
> documents with more up-to-date copies.

How awkward is it to open a reader, delete a document, close it, open a
writer, add a document, and then close the writer?  If that's really too
much work, we could add a utility method to enacapsulate it.  However, if
you're updating more than a single document, its much more efficient to
first do all the deletions, then do all the additions.  So adding that
utility method might then encourage folks to write inefficient code.

Perhaps the utility method to add is something like:
  void updateDocs(Document[] docs, String idField);
This would delete any documents currently in an index that have the same
value for 'idField' as a document in 'docs', then add all the documents in
docs.  This API would encourage batching.  Its implementation would be to
open a reader, do the deletions, close the reader, open a writer, do the
additions, then close the writer.
 
> Document ids are, of course, segment-specific and change 
> during merge. 
> This makes searches fast, but it makes it impossible to identify a 
> document. But what if we add a "special" field, or add a 
> unique document 
> id in some other way? The searches will still use the 
> segment-specific 
> ids and remain fast, but there would be a unique id assigned to each 
> document that applications could use if needed and also the replace 
> operation could use in the IndexWriter. Obviously, we would 
> have to make 
> sure that these ids can be created quickly by multiple 
> writers without a 
> possibility of duplicate ids.
> 
> Would this work?

Sure, it *could* work.  But we'd need to add a new special dictionary for
document ids that is written to disk.  This would be smaller and hence
faster to access than the term dictionary that is now used for document ids.
All of the indexing code (creating, merging, reading) would have to be
modified to support this id dictionary.  And still, batched deletions would
be faster than intermingled insertion/deletion, just not as much.  Is it
worth it?  The current use of document fields for unique ids builds on
existing code, which is nice.

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message