lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1705) Add deleteAllDocuments() method to IndexWriter
Date Fri, 19 Jun 2009 21:02:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722006#action_12722006
] 

Shai Erera commented on LUCENE-1705:
------------------------------------

My search app has such a scenario, and currently we just delete all the documents given a
certain criteria (something similar to the above MatchAllDocsQuery. But I actually think that's
the wrong approach. If you want to delete all the documents from the index, you'd better create
a new one. The main reason is that if your index has, say, 10M documents, a deleteAll() will
keep those 10M in the index, and when you'll re-index, the index size will be doubled. Worth
still, the deleted documents may belong to segments which will not be merged/optimized right
away (depends on your mergeFactor setting), and therefore will stick around for a long time
(until you call optimize() or expungeDeletes()).

But, creating a new IndexWriter right away, while overriding the current one is not so smart,
because your users will be left w/ no search results until the index has accumulated enough
documents. Therefore I think the solution for such an approach should be:
# Call writer.rollback() - abort all current operations, cancel everything until the last
commit.
# Create a new IndexWriter in a new directory and re-index everything.
# In the meantime, all your search operations go against the current index, which you know
is not going to change until the other one is re-built, and therefore you can also optimize
things, by opening an IndexReader and stop any accounting your code may do - just leave it
open.
# When re-indexing has complete, sync all your code and:
#* Define your workDir to be the new index dir. That way new searches can begin right away
on the index index)
#* Safely delete the old index dir (probably need to do something here to ensure no readers
are open against this dir etc.).

That's a high-level description and I realize it may have some holes here and there, but you
get the point.

If we were to create a deleteAll() method, I'd expect it to work that way. I.e., the solution
you proposed above (write a new segments file referencing no segments) would prevent all searches
until something new is actually re-indexed right?

I have to admit though, that I don't have an idea yet on how it can be done inside Lucene,
such that new readers will see the old segments, while when I finish re-indexing and call
commit, the previous segments will just be deleted.

A wild shot (and then I'll go to sleep on it) - how about if you re-index everything, not
committing during that time at all. Readers that are open against the current directory will
see all the documents, EXCEPT the new ones you're adding (same for new readers that you may
open). When you're done re-indexing, you'll call a commitNewOnly, which will create an empty
segments file and then call commit. That way, assuming you're using KeepOnlyLastCommitDeletionPolicy,
after the existing readers will close, any new reader that will be opened will see the new
segments only, and the next time you commit, the old segments will be deleted.

That will move the deleteAll() method to the application side, since it knows when it can
safely delete all the current segments. If you don't have such a requirement (keeping an index
for searches until re-indexing is complete), then I think you can safely close() the index
and re-create it?

> Add deleteAllDocuments() method to IndexWriter
> ----------------------------------------------
>
>                 Key: LUCENE-1705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1705
>             Project: Lucene - Java
>          Issue Type: Wish
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Tim Smith
>
> Ideally, there would be a deleteAllDocuments() or clear() method on the IndexWriter
> This method should have the same performance and characteristics as:
> * currentWriter.close()
> * currentWriter = new IndexWriter(..., create=true,...)
> This would greatly optimize a delete all documents case. Using deleteDocuments(new MatchAllDocsQuery())
could be expensive given a large existing index.
> IndexWriter.deleteAllDocuments() should have the same semantics as a commit(), as far
as index visibility goes (new IndexReader opening would get the empty index)
> I see this was previously asked for in LUCENE-932, however it would be nice to finally
see this added such that the IndexWriter would not need to be closed to perform the "clear"
as this seems to be the general recommendation for working with an IndexWriter now
> deleteAllDocuments() method should:
> * abort any background merges (they are pointless once a deleteAll has been received)
> * write new segments file referencing no segments
> This method would remove one of the final reasons i would ever need to close an IndexWriter
and reopen a new one 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message