lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Re: Delete is not multi-thread safe
Date Thu, 31 Jan 2002 20:17:31 GMT
Doug Cutting wrote:

>
>How awkward is it to open a reader, delete a document, close it, open a
>writer, add a document, and then close the writer?  If that's really too
>much work, we could add a utility method to enacapsulate it.  However, if
>you're updating more than a single document, its much more efficient to
>first do all the deletions, then do all the additions.  
>
That's just it - while you are busy re-crawling a web site (which can 
take some substantial time), there will exist a situation when the user 
will not find any documents from that web site - neither old, nor new. 
Maybe the answer is to re-crawl in a different index directory and then 
move the files...

>So adding that
>utility method might then encourage folks to write inefficient code.
>
>Perhaps the utility method to add is something like:
>  void updateDocs(Document[] docs, String idField);
>This would delete any documents currently in an index that have the same
>value for 'idField' as a document in 'docs', then add all the documents in
>docs.  This API would encourage batching.  Its implementation would be to
>open a reader, do the deletions, close the reader, open a writer, do the
>additions, then close the writer.
> 
>
>>Document ids are, of course, segment-specific and change 
>>during merge. 
>>This makes searches fast, but it makes it impossible to identify a 
>>document. But what if we add a "special" field, or add a 
>>unique document 
>>id in some other way? The searches will still use the 
>>segment-specific 
>>ids and remain fast, but there would be a unique id assigned to each 
>>document that applications could use if needed and also the replace 
>>operation could use in the IndexWriter. Obviously, we would 
>>have to make 
>>sure that these ids can be created quickly by multiple 
>>writers without a 
>>possibility of duplicate ids.
>>
>>Would this work?
>>
>
>Sure, it *could* work.  But we'd need to add a new special dictionary for
>document ids that is written to disk.  This would be smaller and hence
>faster to access than the term dictionary that is now used for document ids.
>All of the indexing code (creating, merging, reading) would have to be
>modified to support this id dictionary.  And still, batched deletions would
>be faster than intermingled insertion/deletion, just not as much.  Is it
>worth it?  The current use of document fields for unique ids builds on
>existing code, which is nice.
>
Hm. I'm afraid I don't follow. I'm not sure what the extra dictionary 
will be needed for (it think I know, but I'm not sure). Also, your 
proposal above seems almost the same as mine (with the 
updateDocs(Document[], String) method). What's missing is an ability to 
record deletions in the segment containing the replacing document rather 
then the segment containing the document being replaced. If we can do 
that, the deletions will be come atomic with additions of the replacing 
document, which I think would be great. Does this still require an extra 
dictionary and extra work? Sort of a way to record pending deletions, 
which become effective during a merge.

Dmitry




--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message