lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: delete by docid in lucene 4
Date Thu, 12 Jul 2012 20:08:29 GMT
Hi Sean,

Without checking the performance in your case, it makes no sense to discuss
about this. Lucene 4.0 changed a lot, there are several improvements. Please
read the following:

- Because of the new term dictionary, Term lookups on non-existing terms are
fail-fast, they don't do any disk IO in most cases. You can do ten thousands
of those per second on a simple laptop.
- DocumentsWriter uses internal Lucene DocIDs, but those are not global and
therefore not useful for you. They are only valid for one index segment and
only temporarily until IndexWriter merges segments again (possibly in
another thread)

So: Use updateDocument always when you put your new documents into the index
and give every document the unique ID from your pool. Document IDs of Lucene
are pure internal and especially in 4.0's IndexWriter no longer constant
(they can easily change after reopening an index depending on merge policy
or getting a new realtime reader). To uniquely identify documents later you
*have* to use a own key field.

Lucene 4.0 is different than previous versions, deleting by internal Lucene
docId will not come back.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges@gmail.com]
> Sent: Thursday, July 12, 2012 9:51 PM
> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
> Subject: Re: delete by docid in lucene 4
> 
> I never used updateDocument() due to ignorance.
> 
> We are indexing several hundred documents per second, and most of the
> analysis takes places on the non indexer machines to reduce load on the
> indexers.  For our use case, deleteDocument(int docId) will be faster as
there
> are very few duplicates, but I don't know if the difference is
significant.
> 
> It would be nice to have a deleteDocument(int docId) in IndexWriter.
> It seems like it would be easy to add as DocumentsWriter already has a
> deletedDocID.  I can file a jira and submit a patch if this is something
that you
> guys would accept.
> 
> Sean
> 
> On Thu, Jul 12, 2012 at 11:53 AM, Simon Willnauer
> <simon.willnauer@gmail.com> wrote:
> > On Thu, Jul 12, 2012 at 6:55 PM, Sean Bridges <sean.bridges@gmail.com>
> wrote:
> >> Thanks for the tip.
> >>
> >> Does using updateDocument instead of addDocument affect
> >> indexing/search performance?
> >
> > it does affect index performance compared to add document but that
> > might be minor compared to your analysis chain. I wouldn't worry about
> > updateDocument its the only sensible way to use lucene really. Why
> > didn't you use this before, any reason? What is your ingest rate / doc
> > throughput and where would you get concerned?
> >
> > simon
> >>
> >> Sean
> >>
> >> On Thu, Jul 12, 2012 at 9:27 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >>> The trick is to index not with addDocument(Document) but instead
> >>> with updateDocument(Term, Document). Lucene then adds the document
> >>> atomically while deleting any previous documents with the given term
> >>> (which is qour unique ID). If the key does not exist it simply
> >>> indexes without deleting anything.
> >>> By this you always have only one document with the same Term (==your
> >>> unique ID).
> >>>
> >>> Uwe
> >>>
> >>> -----
> >>> Uwe Schindler
> >>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> >>> eMail: uwe@thetaphi.de
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Sean Bridges [mailto:sean.bridges@gmail.com]
> >>>> Sent: Thursday, July 12, 2012 5:42 PM
> >>>> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
> >>>> Subject: Re: delete by docid in lucene 4
> >>>>
> >>>> We have indexer machines which are fed documents by other machines.
> >>>> If an error occurs (machine crashing etc) the same document may be
sent
> to
> >>> an
> >>>> indexer multiple times.  Serial ids are assigned before documents
reach
> >>> the
> >>>> indexer, so a document, may be in the index multiple times, each time
> with
> >>> the
> >>>> same serial id.
> >>>>
> >>>> When the index gets large enough, the indexer will stop writing to
the
> >>> index,
> >>>> and upload it to another machine, which keeps the index forever.
Before
> >>> we
> >>>> upload the index, we forceMerge(1) on it, and gather some stats about
> the
> >>>> index like max,min serial id, total documents.  While calculating max
and
> >>> min
> >>>> serial id, if we see a duplicate serial id, we call
> >>> IndexReader.deleteByDocId(...) .
> >>>>
> >>>> We could check for duplicate serial ids while indexing, but that is
racy,
> >>> and not
> >>>> as efficient.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Sean
> >>>>
> >>>>
> >>>> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
> >>>> <simon.willnauer@gmail.com> wrote:
> >>>> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges
> <sean.bridges@gmail.com>
> >>>> wrote:
> >>>> >> Is it possible to delete by docId in lucene 4?  I can delete
by
docid
> >>>> >> in lucene 3 using IndexReader.deleteDocument(int docId), but
that
> >>>> >> method is gone in lucene 4, and IndexWriter only allows deleting
by
> >>>> >> Term or Query.
> >>>> >
> >>>> > that is correct. In lucene 4 IndexReader is really just a reader!
> >>>> >>
> >>>> >> This is our use case -  In our system, each document is identified
by
> >>>> >> a unique serial id.  If an error occurs, we may index the same
> >>>> >> message multiple times.  When an index grows large enough,
we stop
> >>>> >> adding to it, and optimize the index.  During optimization,
if we
see
> >>>> >> multiple docs with the same serialid, we delete all but the
first,
as
> >>>> >> all documents with the same serialid are the same.
> >>>> >
> >>>> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
> >>>> > method? do you rely on multiple versions of the same doc? With
Lucene
> >>>> > 4 relying on the doc id can become very tricky. If you use multiple
> >>>> > threads you create a lot of segments which can be merged in any
order.
> >>>> > You can't tell if a document ID maintains happened-before semantics
at
> >>>> > all.
> >>>> >
> >>>> > Can you tell us more about your usecase and why you are using
> >>>> > deleteByDocID
> >>>> >
> >>>> > simon
> >>>> >
> >>>> >
> >>>> >>
> >>>> >> Thanks,
> >>>> >>
> >>>> >> Sean
> >>>> >>
> >>>> >>
---------------------------------------------------------------------
> >>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>> >>
> >>>> >
> >>>> >
---------------------------------------------------------------------
> >>>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>> >
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message