lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Bridges <>
Subject Re: delete by docid in lucene 4
Date Thu, 12 Jul 2012 15:41:57 GMT
We have indexer machines which are fed documents by other machines.
If an error occurs (machine crashing etc) the same document may be
sent to an indexer multiple times.  Serial ids are assigned before
documents reach the indexer, so a document, may be in the index
multiple times, each time with the same serial id.

When the index gets large enough, the indexer will stop writing to the
index, and upload it to another machine, which keeps the index
forever.  Before we upload the index, we forceMerge(1) on it, and
gather some stats about the index like max,min serial id, total
documents.  While calculating max and min serial id, if we see a
duplicate serial id, we call IndexReader.deleteByDocId(...) .

We could check for duplicate serial ids while indexing, but that is
racy, and not as efficient.



On Thu, Jul 12, 2012 at 12:42 AM, Simon Willnauer
<> wrote:
> On Thu, Jul 12, 2012 at 3:09 AM, Sean Bridges <> wrote:
>> Is it possible to delete by docId in lucene 4?  I can delete by docid
>> in lucene 3 using IndexReader.deleteDocument(int docId), but that
>> method is gone in lucene 4, and IndexWriter only allows deleting by
>> Term or Query.
> that is correct. In lucene 4 IndexReader is really just a reader!
>> This is our use case -  In our system, each document is identified by
>> a unique serial id.  If an error occurs, we may index the same message
>> multiple times.  When an index grows large enough, we stop adding to
>> it, and optimize the index.  During optimization, if we see multiple
>> docs with the same serialid, we delete all but the first, as all
>> documents with the same serialid are the same.
> I am wondering why you don't use the IW#updateDocument(Term,Doc)
> method? do you rely on multiple versions of the same doc? With Lucene
> 4 relying on the doc id can become very tricky. If you use multiple
> threads you create a lot of segments which can be merged in any order.
> You can't tell if a document ID maintains happened-before semantics at
> all.
> Can you tell us more about your usecase and why you are using deleteByDocID
> simon
>> Thanks,
>> Sean
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message