Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of simon.willnauer@gmail.com
 designates 209.85.214.176 as permitted sender)
MIME-Version: 1.0
Reply-To: simon.willnauer@gmail.com
In-Reply-To: 
 <CAOD+0F+vYUuvW5ChSM0vFP4jeFckoVFBNgW5_zgB7BaxgPxJMg@mail.gmail.com>
References: 
 <CAOD+0FJ0Zyy8ew674tDGXbaUsdoTw75bHOo4r5=ciU4FxPm=7A@mail.gmail.com>
	<CAAHmpkhTYxxJ-e_e+QAFWvHaKUwN8feCOPoBYB1rDb+qwQYhRQ@mail.gmail.com>
	<CAOD+0F+_gsZwcPM=eFjgQxP0bsCwpZ9SGroHZev07qYRs4+=dQ@mail.gmail.com>
	<02f201cd604b$386a32b0$a93e9810$@thetaphi.de>
	<CAOD+0F+vYUuvW5ChSM0vFP4jeFckoVFBNgW5_zgB7BaxgPxJMg@mail.gmail.com>
Date: Thu, 12 Jul 2012 20:53:30 +0200
Message-ID: 
 <CAAHmpkjvnLqq6uMSOD5ajqu0S2vaFuMDF2U1KaqD-NMPd3zx+Q@mail.gmail.com>
Subject: Re: delete by docid in lucene 4
From: Simon Willnauer <simon.willnauer@gmail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8

On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges@gmail.com> wrote:
> Thanks for the tip.
>
> Does using updateDocument instead of addDocument affect
> indexing/search performance?

it does affect index performance compared to add document but that
might be minor compared to your analysis chain. I wouldn't worry about
updateDocument its the only sensible way to use lucene really. Why
didn't you use this before, any reason? What is your ingest rate / doc
throughput and where would you get concerned?

simon
>
> Sean
>
> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> The trick is to index not with addDocument(Document) but instead with
>> updateDocument(Term, Document). Lucene then adds the document atomically
>> while deleting any previous documents with the given term (which is qour
>> unique ID). If the key does not exist it simply indexes without deleting
>> anything.
>> By this you always have only one document with the same Term (==your unique
>> ID).
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Sean Bridges [mailto:sean.bridges@gmail.com]
>>> Sent: Thursday, July 12, 2012 5:42 PM
>>> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
>>> Subject: Re: delete by docid in lucene 4
>>>
>>> We have indexer machines which are fed documents by other machines.
>>> If an error occurs (machine crashing etc) the same document may be sent to
>> an
>>> indexer multiple times.  Serial ids are assigned before documents reach
>> the
>>> indexer, so a document, may be in the index multiple times, each time with
>> the
>>> same serial id.
>>>
>>> When the index gets large enough, the indexer will stop writing to the
>> index,
>>> and upload it to another machine, which keeps the index forever.  Before
>> we
>>> upload the index, we forceMerge(1) on it, and gather some stats about the
>>> index like max,min serial id, total documents.  While calculating max and
>> min
>>> serial id, if we see a duplicate serial id, we call
>> IndexReader.deleteByDocId(...) .
>>>
>>> We could check for duplicate serial ids while indexing, but that is racy,
>> and not
>>> as efficient.
>>>
>>> Thanks,
>>>
>>> Sean
>>>
>>>
>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>>> <simon.willnauer@gmail.com> wrote:
>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges@gmail.com>
>>> wrote:
>>> >> Is it possible to delete by docId in lucene 4?  I can delete by docid
>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>>> >> method is gone in lucene 4, and IndexWriter only allows deleting by
>>> >> Term or Query.
>>> >
>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>>> >>
>>> >> This is our use case -  In our system, each document is identified by
>>> >> a unique serial id.  If an error occurs, we may index the same
>>> >> message multiple times.  When an index grows large enough, we stop
>>> >> adding to it, and optimize the index.  During optimization, if we see
>>> >> multiple docs with the same serialid, we delete all but the first, as
>>> >> all documents with the same serialid are the same.
>>> >
>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>>> > method? do you rely on multiple versions of the same doc? With Lucene
>>> > 4 relying on the doc id can become very tricky. If you use multiple
>>> > threads you create a lot of segments which can be merged in any order.
>>> > You can't tell if a document ID maintains happened-before semantics at
>>> > all.
>>> >
>>> > Can you tell us more about your usecase and why you are using
>>> > deleteByDocID
>>> >
>>> > simon
>>> >
>>> >
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Sean
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org