lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: delete by docid in lucene 4
Date Thu, 12 Jul 2012 16:27:24 GMT
The trick is to index not with addDocument(Document) but instead with
updateDocument(Term, Document). Lucene then adds the document atomically
while deleting any previous documents with the given term (which is qour
unique ID). If the key does not exist it simply indexes without deleting
anything.
By this you always have only one document with the same Term (==your unique
ID).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Sean Bridges [mailto:sean.bridges@gmail.com]
> Sent: Thursday, July 12, 2012 5:42 PM
> To: java-user@lucene.apache.org; simon.willnauer@gmail.com
> Subject: Re: delete by docid in lucene 4
> 
> We have indexer machines which are fed documents by other machines.
> If an error occurs (machine crashing etc) the same document may be sent to
an
> indexer multiple times.  Serial ids are assigned before documents reach
the
> indexer, so a document, may be in the index multiple times, each time with
the
> same serial id.
> 
> When the index gets large enough, the indexer will stop writing to the
index,
> and upload it to another machine, which keeps the index forever.  Before
we
> upload the index, we forceMerge(1) on it, and gather some stats about the
> index like max,min serial id, total documents.  While calculating max and
min
> serial id, if we see a duplicate serial id, we call
IndexReader.deleteByDocId(...) .
> 
> We could check for duplicate serial ids while indexing, but that is racy,
and not
> as efficient.
> 
> Thanks,
> 
> Sean
> 
> 
> On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
> <simon.willnauer@gmail.com> wrote:
> > On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <sean.bridges@gmail.com>
> wrote:
> >> Is it possible to delete by docId in lucene 4?  I can delete by docid
> >> in lucene 3 using IndexReader.deleteDocument(int docId), but that
> >> method is gone in lucene 4, and IndexWriter only allows deleting by
> >> Term or Query.
> >
> > that is correct. In lucene 4 IndexReader is really just a reader!
> >>
> >> This is our use case -  In our system, each document is identified by
> >> a unique serial id.  If an error occurs, we may index the same
> >> message multiple times.  When an index grows large enough, we stop
> >> adding to it, and optimize the index.  During optimization, if we see
> >> multiple docs with the same serialid, we delete all but the first, as
> >> all documents with the same serialid are the same.
> >
> > I am wondering why you don't use the IW#updateDocument(Term,Doc)
> > method? do you rely on multiple versions of the same doc? With Lucene
> > 4 relying on the doc id can become very tricky. If you use multiple
> > threads you create a lot of segments which can be merged in any order.
> > You can't tell if a document ID maintains happened-before semantics at
> > all.
> >
> > Can you tell us more about your usecase and why you are using
> > deleteByDocID
> >
> > simon
> >
> >
> >>
> >> Thanks,
> >>
> >> Sean
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message