lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Bridges <sean.brid...@gmail.com>
Subject Re: delete by docid in lucene 4
Date Thu, 12 Jul 2012 19:50:33 GMT
I never used updateDocument() due to ignorance.

We are indexing several hundred documents per second, and most of the
analysis takes places on the non indexer machines to reduce load on
the indexers.  For our use case, deleteDocument(int docId) will be
faster as there are very few duplicates, but I don't know if the
difference is significant.

It would be nice to have a deleteDocument(int docId) in IndexWriter.
It seems like it would be easy to add as DocumentsWriter already has a
deletedDocID.  I can file a jira and submit a patch if this is
something that you guys would accept.

Sean

On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
<simon.willnauer@gmail.com> wrote:
> On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges@gmail.com> wrote:
>> Thanks for the tip.
>>
>> Does using updateDocument instead of addDocument affect
>> indexing/search performance?
>
> it does affect index performance compared to add document but that
> might be minor compared to your analysis chain. I wouldn't worry about
> updateDocument its the only sensible way to use lucene really. Why
> didn't you use this before, any reason? What is your ingest rate / doc
> throughput and where would you get concerned?
>
> simon
>>
>> Sean
>>
>> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>> The trick is to index not with addDocument(Document) but instead with
>>> updateDocument(Term, Document). Lucene then adds the document atomically
>>> while deleting any previous documents with the given term (which is qour
>>> unique ID). If the key does not exist it simply indexes without deleting
>>> anything.
>>> By this you always have only one document with the same Term (==your unique
>>> ID).
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>>
>>>
>>>> -----Original Message-----
>>>> From: Sean Bridges [mailto:sean.bridges@gmail.com]
>>>> Sent: Thursday, July 12, 2012 5:42 PM
>>>> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
>>>> Subject: Re: delete by docid in lucene 4
>>>>
>>>> We have indexer machines which are fed documents by other machines.
>>>> If an error occurs (machine crashing etc) the same document may be sent to
>>> an
>>>> indexer multiple times.  Serial ids are assigned before documents reach
>>> the
>>>> indexer, so a document, may be in the index multiple times, each time with
>>> the
>>>> same serial id.
>>>>
>>>> When the index gets large enough, the indexer will stop writing to the
>>> index,
>>>> and upload it to another machine, which keeps the index forever.  Before
>>> we
>>>> upload the index, we forceMerge(1) on it, and gather some stats about the
>>>> index like max,min serial id, total documents.  While calculating max and
>>> min
>>>> serial id, if we see a duplicate serial id, we call
>>> IndexReader.deleteByDocId(...) .
>>>>
>>>> We could check for duplicate serial ids while indexing, but that is racy,
>>> and not
>>>> as efficient.
>>>>
>>>> Thanks,
>>>>
>>>> Sean
>>>>
>>>>
>>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
>>>> <simon.willnauer@gmail.com> wrote:
>>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges@gmail.com>
>>>> wrote:
>>>> >> Is it possible to delete by docId in lucene 4?  I can delete by
docid
>>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>>>> >> method is gone in lucene 4, and IndexWriter only allows deleting
by
>>>> >> Term or Query.
>>>> >
>>>> > that is correct. In lucene 4 IndexReader is really just a reader!
>>>> >>
>>>> >> This is our use case -  In our system, each document is identified
by
>>>> >> a unique serial id.  If an error occurs, we may index the same
>>>> >> message multiple times.  When an index grows large enough, we stop
>>>> >> adding to it, and optimize the index.  During optimization, if we
see
>>>> >> multiple docs with the same serialid, we delete all but the first,
as
>>>> >> all documents with the same serialid are the same.
>>>> >
>>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
>>>> > method? do you rely on multiple versions of the same doc? With Lucene
>>>> > 4 relying on the doc id can become very tricky. If you use multiple
>>>> > threads you create a lot of segments which can be merged in any order.
>>>> > You can't tell if a document ID maintains happened-before semantics
at
>>>> > all.
>>>> >
>>>> > Can you tell us more about your usecase and why you are using
>>>> > deleteByDocID
>>>> >
>>>> > simon
>>>> >
>>>> >
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Sean
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> >>
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message