lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: Deleting a document with an IndexWriter open
Date Fri, 16 Jul 2004 11:50:51 GMT
Giulio Cesare Solaroli wrote:
> Dear developers,
> 
> is there any architectural reason while an IndexWriter could not
> delete a document?

There are such reasons. Maybe Doug can give additional
insight. Here is what I think:

One reason I see is that there is no such thing as a unique
document id in Lucene. The IndexReader is the object through which an
index is accessed and search is also done through a reader. The document
ids used by one IndexReader/IndexSearcher instance are unique/valid only
with regard to this instance and the reader/search does not have a
possibility for changing document ids. However, by calling optimize
on an index with deletions, document ids will change. Some documents
will have other ids after calling the optimize than before. This has
no effect on an existing reader instance, only on IndexReader instances
generated after the optimize.

Of course an application can take care of unique document ids and store
them in a dedicated field. The ids could e.g. be urls If this
unique id is used for specifying a document for deletion or other terms
are used for specifying the document(s) for deletion, index access as
provided by a reader is needed to do the deletion. IndexWriter currently
does not have these capabilities.

So the only solution to the update problem is to build a wrapper around
Lucene that handles reading, writing, and updating. And this is what you
are actually doing :-)

> I understand that the IndexReader (besides its strange naming for this
> feature) is the right class to use to delete a document, but this
> raises a huge problem for me.
> 
> We add almost 50.000 documents a day, while deleting a similar amount
> of old documents over the same period.
> We index new documents in batch every 5 minutes while deleting the old
> ones and optimize the index twice a day, in order to keep good
> performance for the queries and the number of index files under
> control.
> 
> In this situation, I try to keep the same IndexWriter open as much as
> possible, in order to avoid any unnecessary fragmentation of the
> index.
> Before indexing any document, I can check to see if the document has
> already been inserted, but I am not able to delete it without closing
> the IndexWriter, opening an IndexReader, deleting the document,
> closing the IndexReader an opening again the IndexWritere.
> 
> This arrangement seems reasonable if updated documents are scarce, but
> doesn't seem feasible to work with a high rate of updated documents.
> 
> I would prefer to avoid deleting all updated documents from the index
> before opening the IndexWriter because the updating and indexing
> procedure would get much more complex, and because I will introduce a
> significant time gap where a previously available document is no more
> available on the index.

If you want to do several updates at the same time, the most efficient
way would be to:

1) Keep an IndexReader/Searcher open on your index in order to guarantee
reed access and a consistent index during the whole process.

2) Open a new IndexReader and delete all the documents that you want to
update.

3) Close the IndexReader (makes the deletions visible for any new
readers/writers but not for the still opened Searcher/Reader).

4) Open an IndexWriter and add all modified documents.

5) Close the IndexWriter (makes the insertions visible for any new
readers/writers but not for the still opened Searcher/Reader).

6) Substitute the IndexReader/Searcher with a new one to make
the changes visible.

> Do you confirm my idea that keeping and IndexWriter open as much as
> possible while indexing batch of documents is a "good thing"?

Yes. IndexWriter works with a RamDirectory as cache. If you close
it after each document and open a new one, you enforce unnecessary
write operations to your hard disk.

> Is there any option to ever see a deleteDocument method in the
> IndexWriter class

Probably not. I guess you either have to update every document separately
as described in your email (open and close a reader and writer for each
document), or do it in the way I describe above (more efficient).

Christoph








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message