lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: IndexWriter.deleteDocuments(Query query)
Date Wed, 01 Apr 2009 18:04:56 GMT
Thanks Michael for the info.
I do guarantee there are not modifications between when
"MySpecialIndexReader" is loaded and when I iterate and find the deleted
docids. I am, however, not aware that when IndexWriter is opened, docids
move. I thought only when docs are added and when it is committed.

With this information. I agree I cannot reuse the uid mapping array from
"MySpecialIndexReader". And I would have to load this mapping using the
IndexReader given to me by the IndexWriter.

My test essentially this. I took out the reader.deleteDocuments call from
both scenarios. I took a index of 5m docs. a batch of 10000 randomly
generated uids.

Compared the following scenarios:
1)
* open index reader
* for each uid in the batch, find the corresponding docid and add to an
IntList.
*close reader

2)
* open index reader
* load uid array from payload field
* iterate uid array, and check to see if uid is in deleted set, and add to
an IntList

The datastructure holding deleted set is IntOpenHashSet from fastutil.

1) took about 3500 - 4500 ms
2) took about 815 ms

-John

On Wed, Apr 1, 2009 at 10:17 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> > For me at lease, IndexWriter.deleteDocument(int) would be useful.
>
> I completely agree: delete-by-docID in IndexWriter would be a great
> feature.  Long ago I became convinced of that.
>
> Where this feature always gets stuck (search the lists -- it's gotten
> stuck alot) is how to implement it?  At any time, a merge can commit,
> which invalidates all docIDs stored anywhere.  We need to solve that
> before we can delete by docID.
>
> I don't see a clean solution.  Do you?
>
> > I have a customized index reader that stores a mapping of docid -> uid in
> > the payload (something Michael Bush and Ning Li suggested a while back)
> And
> > that mapping is loaded a IndexReader load time and is shared by
> searchers.
>
> OK
>
> > I do realtime update, so I get a batch of updates with a uid associated
> with
> > each batch. So I do deleted on the uid and add the document. And I
> > implemented using IndexWriter.deleteDocuments(Term[])
>
> OK
>
> > I realized I have an IndexReader around already with a docId->uid
> mapping, I
> > can just find out the docid from that list and simply call
> > IndexReader.deleteDocument(int). So out of curiosity, I compare the times
> > doing deletes with these two mechanisms with 1 batch of 10000 deletes.
> And
> > on my macbook pro, I see a difference/overhead of 3-4 seconds (with
> various
> > runs and how much term table is cached etc.) And that is something I
> would
> > expect because we essentially doing a "query" per element in the batch,
> > albeit posting list length is only 1, but still...
>
> But, your mapping is stale (you can't trust the docIDs) as soon as you
> open an IndexWriter on the same index, so this isn't really a valid
> test.
>
> 3-4 seconds out of how much total time?
>
> Can you give more details on this test?  Are you including time to
> open IndexReader, time to load your docID/uid mapping, and time to
> commit the changes (to be apples/apples)?
>
> > Now to me that is significant enough to move away from
> > IndexWriter.deleteDocuments().
> >
>
> > However, to actually implement the delete with IndexWriter on
> > docids, I have to create a customized Query object that iterates my
> > int[] of docids.
>
> That won't work (the docIDs might be invalid by the time your Query is
> visited).  The point of delete-by-Query is IW hands you a reader,
> which you must use right then use to find the docIDs; only the docIDs
> from that reader are valid.
>
> > Having IndexWriter.deleteDocuments take a Filter than DocIdSet makes
> > sense.
>
> Well... when IW deletes-by-Query, it's already using the Query as a
> Filter (ie, not doing any scoring).  Changing the API to
> delete-by-Filter won't change the performance.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message