lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Bridges <sean.brid...@gmail.com>
Subject Re: delete by docid in lucene 4
Date Thu, 12 Jul 2012 16:55:56 GMT
Thanks for the tip.

Does using updateDocument instead of addDocument affect
indexing/search performance?

Sean

On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> The trick is to index not with addDocument(Document) but instead with
> updateDocument(Term, Document). Lucene then adds the document atomically
> while deleting any previous documents with the given term (which is qour
> unique ID). If the key does not exist it simply indexes without deleting
> anything.
> By this you always have only one document with the same Term (==your unique
> ID).
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Sean Bridges [mailto:sean.bridges@gmail.com]
>> Sent: Thursday, July 12, 2012 5:42 PM
>> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
>> Subject: Re: delete by docid in lucene 4
>>
>> We have indexer machines which are fed documents by other machines.
>> If an error occurs (machine crashing etc) the same document may be sent to
> an
>> indexer multiple times.  Serial ids are assigned before documents reach
> the
>> indexer, so a document, may be in the index multiple times, each time with
> the
>> same serial id.
>>
>> When the index gets large enough, the indexer will stop writing to the
> index,
>> and upload it to another machine, which keeps the index forever.  Before
> we
>> upload the index, we forceMerge(1) on it, and gather some stats about the
>> index like max,min serial id, total documents.  While calculating max and
> min
>> serial id, if we see a duplicate serial id, we call
> IndexReader.deleteByDocId(...) .
>>
>> We could check for duplicate serial ids while indexing, but that is racy,
> and not
>> as efficient.
>>
>> Thanks,
>>
>> Sean
>>
>>
>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>> <simon.willnauer@gmail.com> wrote:
>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges@gmail.com>
>> wrote:
>> >> Is it possible to delete by docId in lucene 4?  I can delete by docid
>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> >> method is gone in lucene 4, and IndexWriter only allows deleting by
>> >> Term or Query.
>> >
>> > that is correct. In lucene 4 IndexReader is really just a reader!
>> >>
>> >> This is our use case -  In our system, each document is identified by
>> >> a unique serial id.  If an error occurs, we may index the same
>> >> message multiple times.  When an index grows large enough, we stop
>> >> adding to it, and optimize the index.  During optimization, if we see
>> >> multiple docs with the same serialid, we delete all but the first, as
>> >> all documents with the same serialid are the same.
>> >
>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>> > method? do you rely on multiple versions of the same doc? With Lucene
>> > 4 relying on the doc id can become very tricky. If you use multiple
>> > threads you create a lot of segments which can be merged in any order.
>> > You can't tell if a document ID maintains happened-before semantics at
>> > all.
>> >
>> > Can you tell us more about your usecase and why you are using
>> > deleteByDocID
>> >
>> > simon
>> >
>> >
>> >>
>> >> Thanks,
>> >>
>> >> Sean
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message